Imports, Namespaces, and References

References I had a decent idea of how importing and namespaces worked, but it wasn’t until I started setting up unit tests that everything clicked. False assumptions I had about namespaces and imports caused test failures. It forced me to do things the right way, so that I was changing the right objects instead of creating objects in random places. I learned how most things are references, and when you think of variables that way, namespaces and imports make a lot of sense.

Some basics

When you write a module, you start defining things: functions, variables, classes, etc.. When you define something, Python creates an object in memory. The variable name you use is how you refer to that chunk of memory. Everything is references behind the scenes.

A namespace is a container for these references. It maps variable names to addresses in memory. Any module you write has its own namespace. When you define variable mylist in your module, you are defining mylist in your module’s namespace. Variables defined in other modules, or I should say in other modules’ namespaces, are not accessible in your module – unless you import them.

You will work with open source libraries, and to unlock their power you need to access their functions and classes which are in the library’s namespace. You gain access though importing. At a high level Python imports modules by doing the following:

  1. See if the module name is in sys.modules – this like a cache of imported modules.
  2. If not, look for the file that matches the module name
  3. Run the module being imported. This setups up all global variables, including variables for classes & functions
  4. Put the loaded module into sys.modules
  5. Setup references so you can access imports – where the fun happens.

I want to explain these in a little more detail as well as how they affect modifying variables inside of an import. I will go through the steps out of order because step 1, loading from the cache, doesn’t mean much except after step 4, putting a module into the cache.

Step 2: Finding what to import

Finding the modules is pretty useful and interesting, but does not have a lot to do with references so I am skipping the details here. Also, the Python docs outline this clearly and succulently:

https://docs.python.org/3.3/tutorial/modules.html#the-module-search-path

Step 3: Executing Code

After finding the file, Python loads and executes it. To give a concrete example, I have 2 simple modules:

cars.py

car_list = [1]

def myfun():
    return car_list

print "Importing..."

dealer.py

import cars

If you run cardealer.py, you see the following:

> python cardealer.py
Importing...

So by loading and executing I mean Python runs everything at the top level. That includes setting variables, defining functions, and running any other code, such as the print statement in cars.

It also does this in each module’s own namespace. Defining car_list in cars basically adds a pointer in cars to this new list. It does not affect dealer, which could have its own, completely independent variable named car_list. After the import is done, dealer will also not have direct access to car_list. cars has the reference/pointer to that object in memory, and the only way dealer can access the list is to go through cars: cars.car_list.

Step 4 & Step 1

Caching takes place after loading the module. Python stores the module in a module class object, which it puts into the sys.modules dictionary. The key is usually the name of the module, with 1 exception. If the module is the the one you are directly running, its key is ‘main‘. (Aside: this is why we use if name == ‘main‘).

What’s cool is that you can access modules through sys.modules. So back to the simple module. After importing it, you can access the list as cars.car_list. You can also access it through sys.modules:

>>> import sys
>>> sys.modules['cars'].car_list
[1]

The caching in sys.modules has many effects. Importing a cached module means it doesn’t get executed again. Reimporting something is cheap! It also means that all the classes, functions, and variables defined at the top level are only defined once, and statements only execute once. The print statement in cars won’t print if it is imported again, or imported anywhere else in the same process.

Step 5: Reference Setup

This is where the most interesting stuff happens. Depending on how you import something, references are setup in your module’s namespace to let you access what you imported. There are a few variations, and I have some code that shows what Python effectively does. Remember, the modules are already loaded into sys.modules at this point.

  1. import cars

    cars= sys.modules['cars']
  2. import cars as vehicles

    vehicles= sys.modules['cars']
  3. from cars import car_list

    car_list = sys.modules['cars'].car_list

No matter how you import something, code from a module always runs in its own namespace. So when cars looks up a variable, it looks it up in cars’s own namespace, not who imported.

Mutability

Imported variables behave is the same way as regular variables, which depends on mutability.

Remember, when Python creates an object, it puts it somewhere in memory. For mutable objects (dict, list, most other objects), when you make a change to that object, you change the value in memory. For immutable objects (int, float, str, tuple), you can never change the value in memory. For immutable objects you can reassign the variable to point to a new value, but you cannot change the original value in memory.

To illustrate this I can use the id function which prints an address in memory for a variable. I modify the value of an int and a list.

Immutable:

>>> a = 1
>>> id(a)
140619792058920
>>> a += 1
>>> id(a)
140619792058896

Mutable

>>> b = [1]
>>> id(b)
4444200832
>>> b += [2]
>>> id(b)
4444200832

When I change the value of a, the immutable integer, the id changes. But when I do the same for a list, the id remains the same.

There is a sneaky catch though! Many times when you assign a variable, even a mutable one, you change the reference.

>>> b = [1]
>>> id(b)
4320637944
>>> b = b + [2]
>>> id(b)
4320647936

Behind the scene, Python is creating a new list, and changing where the b reference is pointing. The difference can be subtle between this case and the += case above. I wouldn’t worry about it much, it’s never affected me, but it’s good to be aware.

Modifying Imports

So how can I change values in imported modules? And how does mutability affect imports?

The way you access, and sometimes change, variables in imported modules may, or may not, have an effect on the module. This is where unit tests taught me a lot, since my tests imported modules and tried to set variables in order to setup certain test cases. I learned it all comes down to references. There are a few key rules.

Modules execute their code in their own namespace

When a function in cars tries to use car_list, it looks up car_list in cars namespace. That means:

The module is only affected if:

  • you update the object data behind that reference, or
  • **you change where the reference points in the module’s namespace **

There are some simple ways to update what’s behind the reference – which works no matter how you import. Let’s run this code:

tester_1.py

import cars
from cars import car_list, myfun
# update data through the cars module
cars.car_list.append(2)
print "updated through cars:", cars.myfun()
# update through my namespace
car_list.append(3)
print "updated imported var:", myfun()

This outputs:

> python tester_1.py
Importing...
updated through cars: [1, 2]
updated imported var: [1, 2, 3]

It’s trickier to change cars’s reference, which does depends on how you import:

tester_2.py

import cars
from cars import car_list, myfun
# change referece through the cars module
cars.car_list = [2]
print "changed cars reference    :", cars.myfun()
# the imported variable still points to the original value
print "test original import      :", cars_list
# changing the imported variable will not affect the "real" value
# so this should print the same as the first call
car_list = [3]
print "changed imported reference:", myfun()

This outputs:

> python tester_2.py
changed cars reference    : [2]
test original import      : [1]
changed imported reference: [2]

In both cases, we are creating a new list, and changing what the variables are referencing. But remember, myfun is in cars, and gets variables from the cars namespace. In the first case we are going to the cars namespace, and setting its car_list to this new list. When myfun runs, it gets the reference we just updated and returns our new list.

When we from cars import car_list we create a car_list in the tester_2 namespace which points to the same list cars has. When cars’s reference is set to [2], tester_2′s reference stays unchanged, which is why it keeps the value [1]. When we update car_list in tester_2′s namespace, it only changes where tester_2 is pointing. It is not changing where cars points (and not changing the value in memory where cars points), so it has no effect on how the cars module runs. So when we check myfun, it returns [2].

Final thoughts

When you run into a problem where you set a variable in one module, then access it though another, knowing what’s going on behind the scenes helps make sense of it all. With imports it’s easy to figure out what’s going on when you realize that variables are pointers to objects in memory. If your code is not doing what you expect when setting variables, then something is pointing to the wrong object. Knowing how Python works behind the scenes with namespaces and imports will help you figure out what’s going on. It’s really not bad once you realize it’s all just references.

Python & C

Python CAs promised, I took a script written in Python and ported parts of it to C. The Python interface is the same, but behind the scenes is blazing fast C code. The example is a little different from what I usually do, but I discovered a few things I had never seen.

I am working off the same example as the previous post, How to Write Faster Python, which has the full original code. The example takes an input image and resizes it down by a factor of N. It takes NxN blocks and brings them down to 1 pixel which is the median RGB of the original block.

Small Port

This example is unusual for me because it’s doing 1 task, resizing, and the core of this 1 task is a very small bit of logic – the median function. I first ported the median code to C, and checked the performance. I know the C code is better because C’s sorting function lets you specify how many elements from the list you want to sort. In Python, you sort the whole list (as far as I know) and so I was able to avoid checks on the size of the list vs. the number of elements I want to sort.

Below is the code, but I have a few changes I want to mention as well.

myscript2.py:

import cv2
import numpy as np
import math
from datetime import datetime
import ctypes

fast_tools = ctypes.cdll.LoadLibrary('./myscript_tools.so')
fast_tools.median.argtypes = (ctypes.c_void_p, ctypes.c_int)


def _median(list, ne):
    """
    Return the median.
    """

    ret = fast_tools.median(list.ctypes.data, ne)
    return ret


def resize_median(img, block_size):
    """
    Take the original image, break it down into blocks of size block_size
    and get the median of each block.
    """

    #figure out how many blocks we'll have in the output
    height, width, depth = img.shape
    num_w = math.ceil(float(width)/block_size)
    num_h = math.ceil(float(height)/block_size)

    #create the output image
    out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

    #iterate over the img and get medians
    row_min = 0
    num_elems = block_size * block_size
    block_b = np.zeros(num_elems, dtype=np.uint8)
    block_g = np.zeros(num_elems, dtype=np.uint8)
    block_r = np.zeros(num_elems, dtype=np.uint8)
    while row_min < height:
        r_max = row_min + block_size
        if r_max > height:
            r_max = height

        col_min = 0
        new_row = []
        while col_min < width:
            c_max = col_min + block_size
            if c_max > width:
                c_max = width

            #block info:
            num_elems = (r_max-row_min) * (c_max-col_min)
            block_i = 0
            for r_i in xrange(row_min, r_max):
                for c_i in xrange(col_min, c_max):
                    block_b[block_i] = img[r_i, c_i, 0]
                    block_g[block_i] = img[r_i, c_i, 1]
                    block_r[block_i] = img[r_i, c_i, 2]
                    block_i += 1

            #have the block info, sort by L
            median_colors = [
                int(_median(block_b, num_elems)),
                int(_median(block_g, num_elems)),
                int(_median(block_r, num_elems))
            ]

            out_img[int(row_min/block_size), int(col_min/block_size)] = median_colors
            col_min += block_size
        row_min += block_size
    return out_img

def run():
    img = cv2.imread('Brandberg_Massif_Landsat.jpg')
    block_size = 16
    start_time = datetime.now()
    resized = resize_median(img, block_size)
    print "Time: {0}".format(datetime.now() - start_time)
    cv2.imwrite('proc.png', resized)

if __name__ == '__main__':
    run()

myscript_tools.c

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h> // for malloc
#include <math.h>

int cmpfunc (const void * a, const void * b)
{
   return ( *(uint8_t*)a - *(uint8_t*)b );
}

int _median(uint8_t * list, int ne){
    //sort inplace
    qsort(list, ne, sizeof(uint8_t), cmpfunc);
    int i;
    if (ne % 2 == 0){
        i = (int)(ne / 2);
        return ((int)(list[i-1]) + (int)(list[i])) / 2;
    } else {
        i = (int)(ne / 2);
        return list[i];
    }
}

int median(const void * outdatav, int ne){
    int med;
    uint8_t * outdata = (uint8_t *) outdatav;
    med = _median(outdata, ne);
    return med;
}

and to compile:

gcc -fPIC -shared -o myscript_tools.so myscript_tools.c

Note: I discovered the .so filename should not match any Python script you plan to import. Otherwise you will get a “ImportError: dynamic module does not define init function” because Python is trying to import the C code instead of your Python module.

The main change I made was to port the median function to C. I also switched from using a list of colors, to using a numpy array of colors. I did this because it’s easier to pass numpy arrays, and because that is more useful in general. Also, so I wouldn’t have to convert the list to a numpy array on every median call, I crated the numpy array in the resize_median code itself. Before this last change, my benchmark showed I was actually running slower. The faster C code was not making up for all the list –> numpy array conversions!

After the 2nd fix I timed it

python -m timeit "import myscript2; myscript2.run()"

and got

500 loops, best of 3: 2.04 sec per loop

Nice! That’s compared to 2.51 from my last post, so it’s around 18% faster.

All C (sort of)

I wasn’t happy with the 18% boost – I thought it would be faster! But, it was only a very small portion ported. I decided to port more, but there wasn’t a good natural place to split the resize function up any further, so I ported the whole thing. This way I could also show how to send and receive a whole image back from C – which I do by reference/pointer.

myscript2.py

import cv2
import numpy as np
import math
from datetime import datetime
import ctypes

fast_tools = ctypes.cdll.LoadLibrary('./myscript_tools.so')

def c_resize_median(img, block_size):
    height, width, depth = img.shape

    num_w = math.ceil(float(width)/block_size)
    num_h = math.ceil(float(height)/block_size)

    #create a numpy array where the C will write the output
    out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

    fast_tools.resize_median(ctypes.c_void_p(img.ctypes.data),
                             ctypes.c_int(height), ctypes.c_int(width),
                             ctypes.c_int(block_size),
                             ctypes.c_void_p(out_img.ctypes.data))
    return out_img


def run():
    img = cv2.imread('Brandberg_Massif_Landsat.jpg')
    block_size = 16
    start_time = datetime.now()
    resized = c_resize_median(img, block_size)
    print "Time: {0}".format(datetime.now() - start_time)
    cv2.imwrite('proc.png', resized)

if __name__ == '__main__':
    run()

myscript_tools.c

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h> // for malloc
#include <math.h>

int cmpfunc (const void * a, const void * b)
{
   return ( *(uint8_t*)a - *(uint8_t*)b );
}

int _median(uint8_t * list, int ne){
    //sort inplace
    qsort(list, ne, sizeof(uint8_t), cmpfunc);
    int i;
    if (ne % 2 == 0){
        i = (int)(ne / 2);
        return ((int)(list[i-1]) + (int)(list[i])) / 2;
    } else {
        i = (int)(ne / 2);
        return list[i];
    }
}

void resize_median(const void * bgr_imgv, const int height, const int width,
                   const int block_size, void * outdatav) {
    const uint8_t * bgr_img = (uint8_t  *) bgr_imgv;
    uint8_t * outdata = (uint8_t *) outdatav;

    int col, row, c_max, r_max, c_si, r_si;
    int max_val;
    const int num_elems = block_size * block_size;
    int j, offset;

    uint8_t *b_values;
    uint8_t *g_values;
    uint8_t *r_values;
    b_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);
    g_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);
    r_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);

    int out_i = 0;

    row = 0;
    while (row < height) {
        col = 0;
        r_max = row + block_size;
        if (r_max > height) {
            r_max = height;
        }
        while (col < width) {
            c_max = col + block_size;
            if (c_max > width) {
                c_max = width;
            }

            // block info:
            j = 0;
            for (r_si = row; r_si < r_max; ++r_si) {
                for (c_si = col; c_si < c_max; ++c_si) {
                    offset = ((r_si*width)+c_si)*3;
                    b_values[j] = bgr_img[offset];
                    g_values[j] = bgr_img[offset+1];
                    r_values[j] = bgr_img[offset+2];
                    j += 1;
                }
            }

            // have the block info, get medians
            max_val = j;
            outdata[out_i]   = _median(b_values, max_val);
            outdata[out_i+1] = _median(g_values, max_val);
            outdata[out_i+2] = _median(r_values, max_val);

            // update indexes
            out_i += 3;
            col += block_size;
        }
        row += block_size;
    }

    free(b_values);
    free(g_values);
    free(r_values);
}

This one is a little more complicated. With multi-dimensional arrays, numpy flattens the data. The new array is [pixel00_blue, pixel00_green, pixel00_red, pixel01_blue, ...] – see the offset variable above for the row/column equation. I also pass pointers with no type, so before using the arrays, I have to typecast them. I realize this example is unusual because the whole script is basically ported, but it illustrates many things that are non-trivial and required some time to figure out.

And for the moment of truth…

500 loops, best of 3: 440 msec per loop

82% faster!

Although it’s a mostly C at this point.

Final Thoughts

So in this case, the Python took about 5 times as long to run compared to the C version. When something takes a few milliseconds, 5-10x isn’t that important, but when it gets more complex, it starts to matter (assuming you’re not I/O bound). Here I ported pretty much the whole module. I think this only makes sense with a library, and not code that you will refer to often – Python is just so easy to read.

An idea popped in my head as I was writing this as well. This should be easy to port to RapydScript. JavaScript has the Image object I could use in place of opencv. I wonder where JavaScript would fall on the spectrum.

How to Write Faster Python

Peugeot F1 This is a high level guide on how to approach speeding up slow apps. I have been writing a computer vision app on the backend of a server and came to the realization that it is easy to write slow Python, and Python itself is not fast. If you’re iterating over a 1000 x 1000 pixels (1 megapixel) image, whatever you’re doing inside will run 1 million times. My code ran an iterative algorithm, and it took over 2 minutes to run the first version of the code. I was able to get that time down to under 10 seconds – and it only took a few hours time with minimal code changes. The hardest part was converting some snippets of code into C, which I plan to detail in the future. The basic approach I followed was:

  1. Set a baseline with working code & test
  2. Find the bottlenecks
  3. Find a way to optimize bottlenecks
  4. Repeat

And fair warning, these sorts of optimizations only work if you have selected a good algorithm. As we used to say at work “You can’t polish a turd”.

Slow, but working code

I am going to start with an example program. This resizes an image using the medians of each region. If you resize to 1/4th of the original size the new image will be the medians of 4×4 blocks. It is not too important you understand the details, but I still included code for reference:

import cv2
import numpy as np
import math
from datetime import datetime

def _median(list):
    """
    Return the median
    """

    sorted_list = sorted(list)
    if len(sorted_list) % 2 == 0:
        #have to take avg of middle two
        i = len(sorted_list) / 2
        #int() to convert from uint8 to avoid overflow
        return (int(sorted_list[i-1]) + int(sorted_list[i])) / 2.0
    else:
        #find the middle (remembering that lists start at 0)
        i = len(sorted_list) / 2
        return sorted_list[i]

def resize_median(img, block_size):
    """
    Take the original image, break it down into blocks of size block_size
    and get the median of each block.
    """

    #figure out how many blocks we'll have in the output
    height, width, depth = img.shape
    num_w = math.ceil(float(width)/block_size)
    num_h = math.ceil(float(height)/block_size)

    #create the output image
    out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

    #iterate over the img and get medians
    row_min = 0
    #TODO: optimize this, maybe precalculate height & widths
    while row_min < height:
        r_max = row_min + block_size
        if r_max > height:
            r_max = height

        col_min = 0
        new_row = []
        while col_min < width:
            c_max = col_min + block_size
            if c_max > width:
                c_max = width

            #block info:
            block_b = []
            block_g = []
            block_r = []
            for r_i in xrange(row_min, r_max):
                for c_i in xrange(col_min, c_max):
                    block_b.append(img[r_i, c_i, 0])
                    block_g.append(img[r_i, c_i, 1])
                    block_r.append(img[r_i, c_i, 2])

            #have the block info, sort by L
            median_colors = [
                int(_median(block_b)),
                int(_median(block_g)),
                int(_median(block_r))
            ]

            out_img[int(row_min/block_size), int(col_min/block_size)] = median_colors
            col_min += block_size
        row_min += block_size
    return out_img

def run():
    img = cv2.imread('Brandberg_Massif_Landsat.jpg')
    block_size = 16
    start_time = datetime.now()
    resized = resize_median(img, block_size)
    print "Time: {0}".format(datetime.now() - start_time)
    cv2.imwrite('proc.png', resized)

if __name__ == '__main__':
    run()

This takes around 3 seconds to run against a random 1000×1000 image from Wikipedia (link):

500 loops, best of 3: 3.09 sec per loop

So this is a good starting point – it works! I am also saving the output image, so that when I make changes, I can spotcheck that I didn’t break something.

Profiling

Python’s standard library has a profiler that works very well (see The Python Profilers). Although I’m not a fan of the text output – I feel it does not intuitively roll up results – you can dump stats to a file which you can view in a UI.

So I can profile the script:

python -m cProfile -o stats.profile myscript.py

Note that it can sometimes add a lot of overhead.

And after installing runsnakerun I can view the stats:

runsnake stats.profile

which shows me: Runsnakerun GUI

So there are a few things that jump out as easy fixes. First, that app spends around 1/3rd of the time appending colors. Second, finding the median takes a significant amount of time as well – most of it from the sorted call – this is not visible in the image above. The other lines of code are not significant enough to display.

Optimizing

I have a few quick rules of thumb for optimizing:

  • creating/appending to lists is slow due to memory allocation – try to use list comprehensions
  • try not to run operations that create copies of objects unless it’s required – this is also slow due to memory allocation
  • dereferencing is slow: if you’re doing mylist[i] several times, just do myvar = mylist[i] up front
  • use libraries as much as possible – many are written in C/C++ and fast

Other than that, make use of search like Google or DuckDuckGo. You can tweak your Python code (you might discover you’re doing something wrong!), use Cython, write C libraries, or find another solution.

So profiling tells me that appending is slowing my code. I can get around this problem by declaring the list once and keeping track of how many elements I “add”. I also know that sorted() is not preferred because it creates a copy of the original list. I can instead use list.sort() and sort the list in place. I make these changes and run the code, and see the output is still good so I probably did not break it. Let’s time it.

500 loops, best of 3: 2.51 sec per loop

That’s almost 20% faster! Not bad for a few minutes of effort.

For completeness, here is the modified code:

import cv2
import numpy as np
import math
from datetime import datetime

def _median(list, ne):
    """
    Return the median.
    """

    #sort inplace
    if ne != len(list):
        list = list[0:ne]
    list.sort()
    if len(list) % 2 == 0:
        #have to take avg of middle two
        i = len(list) / 2
        #int() to convert from uint8 to avoid overflow
        return (int(list[i-1]) + int(list[i])) / 2.0
    else:
        #find the middle (remembering that lists start at 0)
        i = len(list) / 2
        return list[i]

def resize_median(img, block_size):
    """
    Take the original image, break it down into blocks of size block_size
    and get the median of each block.
    """

    #figure out how many blocks we'll have in the output
    height, width, depth = img.shape
    num_w = math.ceil(float(width)/block_size)
    num_h = math.ceil(float(height)/block_size)

    #create the output image
    out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

    #iterate over the img and get medians
    row_min = 0
    #TODO: optimize this, maybe precalculate height & widths
    num_elems = block_size * block_size
    block_b = [0] * num_elems
    block_g = [0] * num_elems
    block_r = [0] * num_elems
    while row_min < height:
        r_max = row_min + block_size
        if r_max > height:
            r_max = height

        col_min = 0
        new_row = []
        while col_min < width:
            c_max = col_min + block_size
            if c_max > width:
                c_max = width

            #block info:
            num_elems = (r_max-row_min) * (c_max-col_min)
            block_i = 0
            for r_i in xrange(row_min, r_max):
                for c_i in xrange(col_min, c_max):
                    block_b[block_i] = img[r_i, c_i, 0]
                    block_g[block_i] = img[r_i, c_i, 1]
                    block_r[block_i] = img[r_i, c_i, 2]
                    block_i += 1

            #have the block info, sort by L
            median_colors = [
                int(_median(block_b, num_elems)),
                int(_median(block_g, num_elems)),
                int(_median(block_r, num_elems))
            ]

            out_img[int(row_min/block_size), int(col_min/block_size)] = median_colors
            col_min += block_size
        row_min += block_size
    return out_img

def run():
    img = cv2.imread('Brandberg_Massif_Landsat.jpg')
    block_size = 16
    start_time = datetime.now()
    resized = resize_median(img, block_size)
    print "Time: {0}".format(datetime.now() - start_time)
    cv2.imwrite('proc.png', resized)

if __name__ == '__main__':
    run()

Repeat

Now that I optimized the code a little, I repeat the process to make it even better. I try to decide what should be slow, and what should not. So, for example, sorting is not very fast here, but that makes sense. Sorting is the most complex part of this code – the rest is iterating and keeping track of observed colors.

Final Thoughts

After having optimized a few projects I have a few final thoughts and lessons learned:

  • Optimizing a slow/bad algorithm is a waste of time, you’ll need to redesign it anyways
  • The granularity of the profiler is sometimes a little funny. You can get around this by making sections of code into functions – metrics on functions are usually captured.
  • The profile will group the calls to the library functions together even if called from different places. So be careful with metrics for libraries/utils – i.e. numpy.
  • Use search to find ways to optimize your code.

If all else fails, you can implement slow parts in C relatively painlessly. This is the option I usually go with, and I will detail it out with some code in the next few weeks.

Have fun speeding up your code!

Python to Javsacript: Compilers vs. Translators

One thing Alex and I always say about RapydScript is that it is really JavaScript with a Pythonic syntax. Following that, people often ask me what that means for them – they want to know how this affects how they develop code, and how it is different than something like Pyjamas/Pyjs. I want to answer that here to for anyone that has wondered what “RapydScript is Pythonic JavaScript” means, and how compilers like Pyjs are different from translators like RapydScript, and why I (full disclosure) prefer translators.

An Example

A clear example of the difference is with division. Say your source looks like:

a = 5
b = 0
c = a / b

The translator will output JavaScript like:

var a = 5;
var b = 0;
var c = a / b;

The translator process is very simple to understand – it’s pretty much just changing the syntax, but this leads to some gotcha’s. When this code runs, c will be set to Infinity, a JavaScript constant, while the original Pythonic source would have raised an Exception.

A compiler, on the other hand, attempts to mimic Python exactly so it may have an output more similar to:

var py_int = function(val){
    this.val = val;
};
py_int.prototype.div = function(denom){
    if (denom.val == 0) {
        throw ZeroDivisionError;
    }
    return py_int(self.val / denom.val);
};
var a = py_int(5);
var b = py_int(0);
var c = a.div(b);

The variables here will all be objects that include methods for all the operations. The division doesn’t directly divide 2 numbers, it runs the divide method in the objects. So when this code runs it will throw a ZeroDivisionError exception just like Python does.

So what are the tradeoffs?

Writing using a compiler is nice because you get to think like a Python developer, which can abstract away some things like cross browser support. It also means that, in many cases, code can be moved between the frontend and backend with no changes. So it’s easy to have Python code compile to Javascript. But if you’re doing something that’s JavaScript specific, like getting HTML elements, taking in keyboard inputs, etc, the compiler you’re using will have to have a working and documented API for accessing these functions.

The real drawback, though, is with the output code is slower, significantly heavier, and, with the compilers I’ve used, unreadable. There are several issues I have with this, but it really boils down unreadable code leads to usless tracebacks when running code in a browser, and apps don’t run (or run well) on mobile devices.

Translators take a very different approach and don’t try to run just like Python. The idea here is to do 80% of the work for 20% of the cost. Your code may look like Python but it will run like JavaScript, as with the division example. There’s a lot of overlap between the two languages, but they’re not exactly the same, so the main drawback here is that you might see some unexpected, but predictable, behavior. This is very easy though if you know JavaScript, and if not, it’s easy to learn the differences.

There are some nice benefits to using a translator. Your input code and output code will look very similar. They will be roughly the same size and it will be easy to map a line with an error in the JavaScript output back to the Pythonic input for easy debugging. The output will be on the order of kB instead of MB and will run faster.

So those are the main differences between compilers and translators like RapydScript. RapydScript is really JavaScript behind the scenes so it will behave differently than Python, which lets it run a lot more efficiently.

Which is right for you?

The choice of which to use comes down to a few things:

  • First, something I have not mentioned, libraries. In general, JavaScript libraries work better with translators and Python libraries work better with compilers. The one caveat is compilers can only translate Pure Python, so if you’re using something like numpy, which uses C, there’s no easy answer for you.
  • Second, if you don’t know any JavaScript, you will have a tougher time with a translator. Speaking from experience though, JavaScript is not very different from Python, and I encourage you to try a translator because you’ll save time debugging, and your app will be more maintainable in the long term.
  • Lastly is performance requirements. If performance is important, you will want to use a translator over a compiler.

I think it makes sense for a beginner who may writing a simple internal app that won’t have any performance requirements to use a compiler. But if you’ll be writing many apps, or even a complex one, I would go with a translator. Having picked up the differences between JavaScript and Python, I now exclusively use RapydScript, a translator, even if I don’t need the performance. It’s easier to debug in the browser where it actually runs, which saves me time.

A Simple web2py App

I know this is waaaaay overdue, but better late than never. Over the last year and a half I’ve switch over to using Linux and away from Pyjamas (I’m still using python though). I’ll have more info in future posts over the next couple weeks. Don’t worry if you’re using Windows though, the steps here work in both Linux and Windows (thank you Python!).

Installing and Running web2py

Start by downloading web2py (source – version 2.2.1 for me). I am doing this in XP, so I extracted the code to C:\Projects\web2py. Open a command window, navigate to your web2py dir and start it up with the command

> python web2py.py -a tmp_pass

This starts up web2py on 127.0.0.1:8000 with the admin password set to tmp_pass. You can use the -h option to see how to set web2py up in other ways. One thing to note is if the server is running on 127.0.0.1 you won’t be able to access it using your real IP address. If you want to test your server using external computer have web2py use the IP 0.0.0.0.

With web2py running, I could then visit http://127.0.0.1:8000 where an sort of hidden admin interface button lets me login using my admin password, tmp_pass.

Creating an App

There is a panel on the right, with a section called “New simple application” which you can use to create an app. This sets up all the template files for you. In my case I created a program called cyborg.

The server shows a list of files which I could edit. It’s a lot of code, definately more than I wanted for my app, but that should be easy to cleanup later on. With the web2py server up, I navigated to http://127.0.0.1:8000/cyborg/. which showed a page with some interesting bullets, including:

  • You visited the url /cyborg/
  • Which called the function index() located in the file web2py/applications/cyborg/controllers/default.py
  • The output of the file is a dictionary that was rendered by the view web2py/applications/cyborg/views/default/index.html

These are the MVC files discussed in my previous post.

Simplifying the App

I decided to investigate each of the steps taken to run the code and try to trim unneeded code: 1) Well, this one is obvious

2) The default code has calls to use databases, and uploading files, etc. Actions I don’t plan on supporting, at least just yet. First, I want to get comfortable with everyting. I changed default.py to be the simplest function possible:

# -*- coding: utf-8 -*-

def index():
    return dict(message="Hello World", message2="How's it going?")

3) The view should get the dictionary from index() and render it. Really whats happening is the view is loaded, and any python code in the index uses the dictionary from index(). Python code is embedded between {{}}

{{if 'message' in globals():}}
<h3>{{=message}}</h3>
{{pass}}
<br>
{{if 'message' in globals():}}
<h3>{{=message2}}</h3>
{{pass}}

The only kind of gotcha I found was that you need to have a pass aligned with every if. I think this is because there isn’t a way to unindent in the html files.

Cleanup

Since I only plan on using a couple function to start, I want to remove all the files that seem unnecessary. I started by removing and checking the page still worked: I removed these folders completely (I think most get recreated by web2py when the app runs, but with just dummy files):

  • databases (this folder has the database files)
  • errors (this is a logs folder)
  • languages (translation files)
  • models (usually where you put information about your database)
  • private
  • static
  • uploads

And cleaned the folders:

controllers: Only kept default.py views: only kept views\default\index.html

I realoaded http://127.0.0.1:8000/cyborg/, and I was happy to see the messages I passed in through the dict.

Pyjamas Alternatives for Web Development

In my last post I mentioned that I’m switching away from Pyjamas. In this one, I will explain the alternatives I looked into, as well as cons and pros of each one. While doing my research (which started almost half a year ago), I’ve stumbled upon multiple Python-inspired alternatives to JavaScript, many of which have been abandoned, and only few of which have reached the stage where they’re suitable for large projects. Out of those I have settled on a solution that works for me, but there is no perfect framework/language, your choice could be different than mine. Let’s look at a few frameworks and decide how usable each one is for writing a full application.

Pyjamas

While I might have issues with how Pyjamas approaches web development, I have to admit that it has gotten quite far, and it’s quite possible to write a usable application with it.

Verdict: usable – if overhead is not a problem

Skulpt

Skulpt does not process your Python code at all until execution time, this is a unique approach that I haven’t seen any other framework take. It’s essentially like a browser-based eval() for your Python. This is a neat concept, since it can be used to extend another framework to support Python-like eval() (something that even Pyjamas doesn’t do), allowing your script to write and execute code on the fly. While I’m not a big fan of eval() function, this would definitely be neat for a framework that aims to achieve complete Python compatibility. Similarly, one of Skulpt’s major disadvantages is having to include the entire “compiler” as a script in your page, if this project ever grows to have the same functionality as Pyjamas, the overhead of loading the extra logic could be enormous. The second disadvantage is speed, Python interpreted and executed in the browser on the fly will never be as fast as even boilerplate-based Pyjamas, let alone pure JavaScript. The final problem is that you can’t obfuscate/compress your code at all, since its interpreter needs to make sense of your Python code (including the whitespace), this might be a non-issue to open-source projects.

Verdict: not usable

pyxc-js

Like Skulpt, this compiler is unsuitable for any large-scale web development, but it has an elegant import mechanism, which is something many other frameworks/compilers lack. Just like Skulpt can be used to implement eval(), this compiler can be used to implement proper import in your compiler (I know Pyjamas has proper import, but it’s heavily integrated into the rest of the framework). This project has been abandoned and the author does not respond to emails. Aside from its import mechanism (which is based on a directed graph the compiler builds at compile time, eliminating any unused and repeated imports), this project doesn’t have much to offer. The documentation is lacking, it took me a while (and some tweaks) to be able to compile anything, the included test cases don’t work, and the generated code is rather messy (it often spits out multiple semi-colons at the end of the line and blank lines get semicolons too, and if I remember correctly there are occasional bugs in how your code gets compiled). On the plus side, it offers an internal compression mechanism that can shorten your code further (although I’m not sure how much I trust it, given the quality of the non-compressed code it generates).

Verdict: not usable

PyCow

This was the first framework aside from Pyjamas that I felt could support large projects. PyCow compiles Python code into MooTools-based JavaScript. More importantly, its source code is cleanly laid out and the original developer is very responsive and helpful (despite having abandoned the project). It relies on templates for many of its conversions rather than hard-coding them in the source or implementing alternative names for JavaScript functions. For example, list.append() gets converted to list.push() at compile time using a template specified in a hash table. This is a great idea, since it makes the compiler code cleaner and the final output stays closer to real JavaScript, introducing less overhead (and less need for large stdlib). The disadvantage of this approach is that we have no good way of knowing whether a certain method is part of our language and needs to be replaced by JavaScript equivalent, or if it belonds to another API and should be left alone. Here is an example where replacing the append() would break our code:

container = $('#element'); // jQuery used to create a container of elements
...
container.append(item);    // replace this append() and jQuery will complain

PyCow is also MooTools’-based, and I’d prefer a language that doesn’t tie me to a certain framework. I investigated rewriting it such that it generates classes using pure JavaScript, but later realized it would not be worth the effort. One other minor pet-peeve is that PyCow is written for Python 2.5, which happened to be around the same time as Python developers were rewriting their AST parser. As a result, PyCow uses a transitional AST package and needs to be rewritten to support the later one that came after 2.6 (2.5 is still closer to the current AST, however, than the AST parser Pyjamas uses form compiler module, which has been deprecated as of Python 2.6). This framework also doesn’t support any importing of modules, but I managed to fix that in my local version by ripping out the import mechanism from pyxc-js. Unfortunately, I’ve moved on since then, abandoning my PyCow enhancements (let me know if you’re interested in continuing from where I left off, however, I can send you the source code).

Verdict: usable

py2js (Variation 1)

There are at least 3 different frameworks by this name, and they’re only vaguely related to each-other. The first one has been abandoned since 2009, but it impressed me for several reasons. First of all, it takes PyCow’s templating one step further, allowing you to map virtually any function/method using more complex filtering. For instance, %0 represents the class that the method is being called on, %1 reprensents the first argument, %* represents all of them. You could use this to add very powerful templating. For example, if you want to output a list of elements as a comma-delimited string in Python, you could use join() as follows: ', '.join(list). In JavaScript, however, join() is a method of array/list (which actually makes more sense to me) rather than a string, requiring you to do list.join(', ') instead. But say you wanted to auto-convert your Python joins to JavaScript, all you would have to do is add the following rule to your template hash:

'%0.join(%1)' : '%1.join(%0)'

I was able to extend this variation of py2js to support multiple Python features with JavaScript alternatives this way. As mentioned earlier, however, this method is susceptible to renaming functions whose names you have no control over.

This compiler relies on templates heavily, introducing another advantage most other compilers lack. Once the code has been converted to AST, it can be manipulated almost entirely using templates (with little need to write additional code). With some tweaks, it can become a universal source-to-source compiler (well, universal is a bit strong of a word here, statically typed languages are a whole other animal). Imagine being able to convert Python to Perl, Ruby, or any other dynamically typed language just by generating a map of logic equivalents. Something like this, for example, could allow you to convert the syntax of a Python function to Perl (block would then further get interpreted by similar templates for loops, conditionals, etc.):

### Input Template
def <method>(%*):
    <block>

### Output Template
sub <method>{
    my (%*) = @_;
    <block>;
}

Admittedly, the more fancy you get with the language (i.e. using eval()), the more likely you’re to run into problems with the template breaking. This compiler was inspired by py2py, another project designed to clean up Python code. In theory, any input languages can be supported as well, as long as you write your own AST parser that converts the code to Pythonic AST tree (or a subset of). This is a fun project in itself, and I might continue it when I have more time. As a compiler, however, this project is not yet suitable for prime-time, lacking a lot of functionality, including classes.

Verdict: not usable

py2js (Variation 2)

The only similarity between this py2js and the previous is the name. They share nothing else (well… they do now, keep reading). Variation 2 just happens to be another project by the same name as variation 1. In fact, even the compile mechanism for this py2js is very different from other compilers mentioned before. The compiler works by “running” your code at compile time. You write your function/class in Python, and then add @JavaScript decorator to it if you want it to be included in the compiled output. You then “run” your program and it dumps JavaScript to STDOUT. While it’s a bit awkward to add @JavaScript decorator to every single function, this offers something most other compilers (including Pyjamas) lack. Since it actually runs your code, you get the benefit of partial “compile-time” error catching, a feature typically only seen in statically-typed languages. You can be sure, for example, that all syntax errors will be caught, as well as undeclared variables/methods. The disadvantage is that this compiler is a bit too strict, all your code has to be pure Python and all your variables (or stubs for them) have to exist at compile time. So, for example, if you want to use something from jQuery, you better have a jQuery stub written in Python (it doesn’t need to mimic jQuery functionality, however, only to have the method names defined). This project has also been abandoned several years ago in a premature stage, which is where the 3rd variation comes in.

Verdict: not usable

py2js (Variation 3) (now Pyjaco)

The 3rd variation started off as a fork of the 2nd variation, after 3 developers decided to take over the original, abandoned, project. Over time, this variation morphed into a powerful compiler, whose functionality is only outdone by Pyjamas itself. It added support for Pythonic assertions, tuples, classes, static methods, getattr()/setattr(), *args/**kwargs, better integration with JavaScript (allowing you to call JavaScript logic by wrapping it inside JS() or automatically handling the conversion at compile time if you use one of predefined globals (document, window, etc.)) as well as many other features you would expect from Python. It has its own standard library with most Python stdlib implemented.

It, however, also falls short in a few areas. Like Pyjamas, it tries to build Pythonic error handling on top of JavaScript. Admittedly, this task might be much easier here, due to partial “compile-time” error catching, which Pyjamas lacks. It also doesn’t properly translate Python’s math methods to JavaScript, a task that shouldn’t be too hard, however.

Additionally, while it’s convenient to have document and window be treated like special classes, this can add confusion when coding. In some cases your strings and numbers will get converted to JavaScript equivalents automatically, while in the others you have to do so manually (when invoking a native JavaScript method and passing it a string, for example). A py2js string is a wrapper around native JavaScript string with additional Python functionality, same applies to many other objects. This is a good thing, since it allows support for Python’s methods without having to overwrite native objects (and I believe Pyjamas takes the same approach for its primitive objects (arrays/strings/etc.). This is a minor pet-peeve, however, and would not be a problem most of the time.

Another annoyance that I already mentioned earlier, is having to explicitly define which functions you want compiled via @JavaScript tag + str() call on the function or class (but this is just a matter of personal preference).

The biggest problem with this third variation, however, is copyright issues (which the developers are trying to resolve now – they could already have been resolved). In its quest to acquire similar functionality as Pyjamas, this compiler has “borrowed” a lot of code from other projects (including the first py2js I mentioned). This py2js project itself carries MIT license, not all of the projects it borrows from, however, are compatible with that license. Additionally, some of these projects were not given proper mention in the copyright notice due to an oversight by one of the developers. As a result, the developers had a falling out, and now forked a couple alternative versions of this project, one of which is still maintained and aims to rewrite the code in question and/or get proper authorization to use it.

Additionally, if this project joins forces with the new Pyjamas, I believe both projects will benefit a lot. The code generated by py2js is significantly smaller (Pyjamas averages 1:10 ratio for non-compiled:compiled code, while py2js is closer to 1:2 – not counting stdlib) and more readable than Pyjamas, while retaining most of Pyjamas’ features. Pyjamas, on the other hand, already has its own implementation for most of py2js code that’s subject to copyright issues, which could be used to solve the biggest problem with py2js. Finally, py2js’ compile-time error catching makes it easier to build Pythonic exception handling on top of JavaScript. And to remedy the annoyance of @JavaScript decorators, the AST parser can be used to append them automatically to temporary version of the code. This can also be used as an opportunity to update Pyjamas to use the latest AST parser implementation, which it will need anyway if it ever wants to be compatible with Python 3 (which is missing the deprecated AST implementation).

Verdict: usable

UPDATE: I’ve been informed that this project now became Pyjaco, and the copyright is no longer an issue. So for those who want to stay closer to Python, this is a very solid alternative. Christian, the project leader, also informed me that not all of the details I mentioned are accurate. Also, apparently the developers seem to have misinterpreted the original post as me claiming that RapydScript (my own compiler) is the best for everything. That was not my intent, and I tried to avoid this issue by mentioning in the first paragraph of the original article that my choice is based on my own projects and the flexibility they need (mostly Grafpad), even stating that “your choice could be different than mine”. I hope they don’t hold a grudge at me because of this misunderstanding.

PyvaScript

PyvaScript is the opposite extreme of Pyjamas. It’s perhaps the closest to pure JavaScript out of all other Python-like languages. It was one of the first alternatives I looked into, and admittedly I originally decided it would be unsuitable for large projects. There are numerous disadvantages compared to other Python frameworks. Its stdlib is puny, and most of the Python methods you would want aren’t available. It not only lacks import logic, but even classes aren’t implemented. It can’t handle negative indexes for arrays. Its compiler isn’t even based on Python AST (it uses Ometa), as a result it’s often oblivious to errors other compilers would catch, and at other times it chokes on issues that other compilers have no problems with.

Its advantages, however, provide the perfect fit for the niche that Pyjamas left open. It’s the most JavaScript-compatible solution, supporting almost all JavaScript functionality out of the box. I can manipulate other JavaScript and DOM without the need of any wrappers or special syntax. Need to use jQuery? Just do it the same way you would in JavaScript: $(element).method(), the compiler won’t complain about the dollar sign. PyvaScript also supports anonymous functions, something regular Python lacks (except 1-liner lambdas), which makes it easier to write your code like JavaScript (I didn’t see the advantage to this at first, but believe me, when writing web apps, it really helps – especially when using other JavaScript logic as a tutorial).

Since PyvaScript adds no bells or whistles to your code, or requires any sort of wrappers, your code behaves a lot more like native JavaScript, allowing you to use closures more effectively and to bind functions to objects dynamically without the language complaining. At first this seemed like a bad idea, but after playing with it, I realized that with proper use of this, I can actually write cleaner code than with Python itself. Most importantly, PyvaScript is extremely light-weight, adding almost no overhead to the generated code and does not force a framework (MooTools, jQuery, etc.) on the user. I also realized, that PyvaScript’s lazy compilation has its own advantages (which I will explain later).

Verdict: usable – but doesn’t alleviate much pain from plain JS

CoffeeScript

Sharing a structure similar to that of Python, it deserves a mention as well. If you ignore its ugly syntax, and poor choice of variable scoping (preferring globals over locals), you will see that it has all the same advantages as PyvaScript, which makes it a good candidate for Pyjamas replacement as well. It has similar feel to Python (although it feels closer to Ruby), introduces list comprehensions, and even adds classes (something PyvaScript does not). It also adds namespaces, preventing variables in different modules from interfering. If I invert the scoping design, remove all the junk variables like on/off (synonyms for true/false), and modify the syntax to use Python tokens, this will be my ideal language for web development, but that’s a project for later.

Verdict: usable

And the winner is…

RapydScript, which I didn’t even mention yet. RapydScript (best described as PyvaScript++) is my wrapper around PyvaScript that provides the best of both worlds (for me at least). I’ve made a few enhancements (abusing PyvaScript’s lazy compilation to allow me to auto-generate broken Python code that becomes proper JavaScript once compiled by PyvaScript) that achive almost everything I would want, and just about all of CoffeeScript’s functionality. Some of the new RapydScript features include support for Pythonic classes including inheritance (single inheritance only, but you can bind functions from any class to fake multiple inheritance), Pythonic optional function arguments, anonymous functions now supported in dictionaries/object literals (something PyvaScript chokes on), beefed up stdlib (also optimized already implemented methods from PyvaScript’s stdlib), support for multi-line comments/doc-strings, preliminary (compile-time) checking for syntax errors and issues that PyvaScript chokes on (because of minor bugs in PyvaScript), module importing (currently everything gets dumped into a single namespace, but compiler does warn about conflicting names), checking proper use of Math functions, automatic insertion of ‘new’ keyword when creating an object (not sure why CoffeeScript doesn’t already do the same).

To me, the main advantages of RapydScript over PyvaScript are ability to break down my modules into separate files (like I would in Python), easier time to build large projects due to proper class implementation (class declaration is done the same way as in native Python), Pythonic declaration of optional arguments to a function (I’m not a big an of JavaScript’s solution for optional arguments), and support for anonymous functions as hash values (which allows me to build object literals the same way as in JavaScript). As for other projects, the main advantages of RapydScript are seamless integration with the DOM and other JavaScript libraries/modules (just treat them like regular Python objects), ability to use both Python and JavaScript best practices as well as rely on JavaScript tutorials (one of the biggest problems for projects in their alpha stage is lack of documentation, for RapydScript you really don’t need any), and lack of bloat (RapydScript gets me as close to Python as I need without forcing any boilerplate on me). As a nice bonus, I think I can add suppport for advanced compilation mode of Google’s Closure compiler with only minor tweaks.

If you want to give RapydScript a try, you can find it in the following repository: https://bitbucket.org/pyjeon/rapydscript. Keep in mind, this is still early alpha version and some of the functionality might change. Chances are, I will eventually rewrite this, using CoffeeScript as base, but the functionality should stay very similar (after all, I am rewriting Grafpad in this, and I don’t want to rewrite it a third time). In my next post, I will dive into more detail about RapydScript, for those interested in using it and needing a tutorial.

Code Tags and Syntax Highlighting

UPDATE: Now that I’m using WordPress, I no longer rely on this. With markdown and markdown syntax highlighting plugins, all this logic is automatically handled for me at page render time. This is still a good reference for people using plogspot or implementing their own syntax highlighting on the back-end. Also, as Adam has mentioned, for blogspot there are other alternatives available that are simpler to setup if you don’t want to do your own implementation.

I really like the flexibility of blogspot, it’s one of very few free blogging platforms that allows me to customize just about any feature of my blog, from custom CSS stylesheets to unique Javascript snipplets. Naturally, I was surprised that there was no <code> tag to wrap code blocks in despite so many existing programmer blogs. Luckily, a quick search revealed another blog that resolved this via a simple CSS tag. So I followed the same advice, but somehow staring at monotone text in a code block was about as satisfying as coding in notepad.

As I started investigating, I found multiple solutions, ranging from online parsers to various language libraries. My favorite was pygments, a python library (are you surprised?) for syntax highlighting. After a little more searching, I found the following article that not only explains how to use pygments but also writes a quick script for parsing Python files. The script had a small glitch, instead of indir=r”, I had to use indir=r’.’, so the code I started with ended up looking like this:

from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
import os

formatter = HtmlFormatter()

indir= r'.'
for x in os.listdir(indir):
    if x.endswith('.py'):
        infile = os.path.join(indir, x)
        outfile = infile.replace('.py', '.html')
        text = ''.join(list(open(infile)))
        html = highlight(text, PythonLexer(), formatter)
        f = open(outfile, 'w')
        f.write(html)
        f.close()
        print 'finished :',x

I then decided to go a step further and update the script to handle multiple programming/markup languages, since I also use CSS, HTML, XML, and Javascript in my blog. But creating a separate file for every snipplet of code I decide to use in my blog post didn’t seem like a good solution.

Instead, I decided to parse out and highlight all the <code> tags from a single file. This would allow me to type an entire blog as one text file and run it through the parser once instead of combining multiple files into a single blog post. I was initially thinking of using Python’s regular expression library, but decided to use XML parsing library instead. It just seemed like the right tool for the job, it can handle additional attributes added to the <code> tag, as well as edit the document tree in-place. In other words, with XML parser I’m much less likely to screw something up with my future “improvements”.

There are several XML libraries available in Python, I decided to go with xml.dom. The main reason I chose it is that its naming scheme is the same as that of Pyjamas and Javascript’s DOM methods (unlike lxml, whose method names feel awkward to me), xml.dom also happens to be the library Pyjamas Desktop uses for XML parsing (I’ll write another tutorial on that soon).

As far as applying correct syntax highlighting based on the language, there are a few ways to go about doing it. One is to use guess_lexer() method from pygments, which can usually detect the correct language automatically. However, a lot of my code-blocks are two-liners that would probably look the same in many different languages, so I decided against that. As a blog writer, I should know what language I’m posting the code in anyway, so it’s very easy for me to specify it myself. Since I’m using a real XML parser, I can add as many attributes to my <code> tag as I wish. So I decided to to specify the language via the “class” attribute (i.e. <code class='python'>).

I then imported the lexers I wanted to use, and defined a dictionary mapping “class” names to lexers (note that omitting the class attribute entirely defaults to non-highlighting TextLexer):

from pygments.lexers import HtmlLexer, JavascriptLexer, CssLexer, XmlLexer, 
                            PythonLexer, TextLexer
import xml.dom.minidom

lexers = {'': TextLexer,
          'html': HtmlLexer,
          'js': JavascriptLexer,
          'css': CssLexer,
          'xml': XmlLexer,
          'python': PythonLexer}

The next step was to loop through all of these code tags, selecting the correct lexer based on the “class” attribute, and applying it to the contents. The last step was to convert the XML DOM tree back to xml format before outputting it to the file. Trying something like this, however, will not work:

for element in code_tags:
    lexer = lexers[element.getAttribute('class')]()
    element.firstChild.nodeValue =
        highlight(element.firstChild.nodeValue, lexer, formatter)
html = doc.toxml()

The problem with the above code is that xml.dom automatically converts < and > angle brackets to &lt; and &gt;, respectively to avoid interpreting the node value as part of the tree. This means that all of the pretty syntax highlighting pygments added will be shown to the user as code instead of getting interpreted by the browser. Since I’m outputting all of the “XML” back to the document, I figured there is no harm in xml.dom “interpreting” the pygments output, and replaced the last line of the for loop to the following:

element.replaceChild(xml.dom.minidom.parseString(
    highlight(element.firstChild.nodeValue, lexer, formatter))
    .documentElement, element.firstChild)

This worked great for Python, CSS, and Javascript parsing, but HTML and XML got “interpreted” into the tree before I could pass them into pygments (element.firstChild.nodeValue did not exist). The trick was to apply the reverse of the solution I applied to get normal highlighting working. We do this via toxml() call, which converts the tree structure into a string. The problem is, if you call toxml() on the entire element, you’ll end up also syntax-highlighting the <code> tag itself, which we’re trying to keep hidden from the viewer. And since nodeList doesn’t have toxml() method, you’ll simply have to loop through each child individually as follows:

for child in element.childNodes:
    element.replaceChild(xml.dom.minidom.parseString(
        highlight(child.toxml(), lexer, formatter))
        .documentElement, child)

The work is almost complete. If you’ve been following along, adding these lines to your code and attempting to run it, you probably noticed that xml.dom throws a syntax error when you try to parse your text file. The reason is that xml.dom requires all of file contents to be inside a root element, and this is true of all XML content (i.e. webpages are always inside tag, svg images are always inside <svg> tag). My initial solution while testing this was to create fake <root> tag around the text body. It works, but this wouldn’t be much of a convenience tool if it made me jump through extra hoops. For my final solution I decided to get rid of them. It’s really not hard, all you have to do is add <root> to the beginning of the string and </root> to the end before parsing it with xml.dom. And in the end, simply remove the last 7 characters and the first 28 (to account for <root> as well as <?xml version="1.0" ?> xml.dom slaps on). The final version of the code ended up looking as follows:

from pygments import highlight
from pygments.lexers import HtmlLexer, JavascriptLexer, CssLexer, XmlLexer,
                            PythonLexer, TextLexer
from pygments.formatters import HtmlFormatter
import os
import xml.dom.minidom

lexers = {'': TextLexer,
          'html': HtmlLexer,
          'js': JavascriptLexer,
          'css': CssLexer,
          'xml': XmlLexer,
          'python': PythonLexer}

formatter = HtmlFormatter()

indir= r'.'
for x in os.listdir(indir):
    if x.endswith('.txt'):
        infile = os.path.join(indir, x)
        outfile = infile.replace('.txt', '.html')
        data = list(open(infile))
        data.insert(0, '<root>')
        data.append('</root>')
        text = ''.join(data)
        doc = xml.dom.minidom.parseString(text)
        code_tags = doc.getElementsByTagName('code')
        for element in code_tags:
            lexer = lexers[element.getAttribute('class')]()
            for child in element.childNodes:
                element.replaceChild(xml.dom.minidom.parseString(
                    highlight(child.toxml(), lexer, formatter))
                    .documentElement, child)
        html = doc.toxml()[28:-7]
        f = open(outfile, 'w')
        f.write(html)
        f.close()
        print 'finished :',x

If you want, you can take this a step further by customizing the <code> tag for each language you use. If you want something like line numbers, just add the following to the formatter initialization line:

formatter = HtmlFormatter(linenos=True)

Also, don’t forget to generate new classes responsible for actually defining the colors you highlight with and placing them into your blog’s CSS stylesheet. To generate the stylesheet, just run the following command in bash:

pygmentize -S default -f html > style.css

And just to prove that my new parser works, I’ve used it to highlight this entire blog post (note, this was originally posted on Blogspot, this WordPress post no longer uses this mechanism). When I have the time, I might go back and highlight my old posts the same way as well.