Python & C

Python CAs promised, I took a script written in Python and ported parts of it to C. The Python interface is the same, but behind the scenes is blazing fast C code. The example is a little different from what I usually do, but I discovered a few things I had never seen.

I am working off the same example as the previous post, How to Write Faster Python, which has the full original code. The example takes an input image and resizes it down by a factor of N. It takes NxN blocks and brings them down to 1 pixel which is the median RGB of the original block.

Small Port

This example is unusual for me because it’s doing 1 task, resizing, and the core of this 1 task is a very small bit of logic – the median function. I first ported the median code to C, and checked the performance. I know the C code is better because C’s sorting function lets you specify how many elements from the list you want to sort. In Python, you sort the whole list (as far as I know) and so I was able to avoid checks on the size of the list vs. the number of elements I want to sort.

Below is the code, but I have a few changes I want to mention as well.

myscript2.py:

import cv2
import numpy as np
import math
from datetime import datetime
import ctypes

fast_tools = ctypes.cdll.LoadLibrary('./myscript_tools.so')
fast_tools.median.argtypes = (ctypes.c_void_p, ctypes.c_int)


def _median(list, ne):
    """
    Return the median.
    """

    ret = fast_tools.median(list.ctypes.data, ne)
    return ret


def resize_median(img, block_size):
    """
    Take the original image, break it down into blocks of size block_size
    and get the median of each block.
    """

    #figure out how many blocks we'll have in the output
    height, width, depth = img.shape
    num_w = math.ceil(float(width)/block_size)
    num_h = math.ceil(float(height)/block_size)

    #create the output image
    out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

    #iterate over the img and get medians
    row_min = 0
    num_elems = block_size * block_size
    block_b = np.zeros(num_elems, dtype=np.uint8)
    block_g = np.zeros(num_elems, dtype=np.uint8)
    block_r = np.zeros(num_elems, dtype=np.uint8)
    while row_min < height:
        r_max = row_min + block_size
        if r_max > height:
            r_max = height

        col_min = 0
        new_row = []
        while col_min < width:
            c_max = col_min + block_size
            if c_max > width:
                c_max = width

            #block info:
            num_elems = (r_max-row_min) * (c_max-col_min)
            block_i = 0
            for r_i in xrange(row_min, r_max):
                for c_i in xrange(col_min, c_max):
                    block_b[block_i] = img[r_i, c_i, 0]
                    block_g[block_i] = img[r_i, c_i, 1]
                    block_r[block_i] = img[r_i, c_i, 2]
                    block_i += 1

            #have the block info, sort by L
            median_colors = [
                int(_median(block_b, num_elems)),
                int(_median(block_g, num_elems)),
                int(_median(block_r, num_elems))
            ]

            out_img[int(row_min/block_size), int(col_min/block_size)] = median_colors
            col_min += block_size
        row_min += block_size
    return out_img

def run():
    img = cv2.imread('Brandberg_Massif_Landsat.jpg')
    block_size = 16
    start_time = datetime.now()
    resized = resize_median(img, block_size)
    print "Time: {0}".format(datetime.now() - start_time)
    cv2.imwrite('proc.png', resized)

if __name__ == '__main__':
    run()

myscript_tools.c

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h> // for malloc
#include <math.h>

int cmpfunc (const void * a, const void * b)
{
   return ( *(uint8_t*)a - *(uint8_t*)b );
}

int _median(uint8_t * list, int ne){
    //sort inplace
    qsort(list, ne, sizeof(uint8_t), cmpfunc);
    int i;
    if (ne % 2 == 0){
        i = (int)(ne / 2);
        return ((int)(list[i-1]) + (int)(list[i])) / 2;
    } else {
        i = (int)(ne / 2);
        return list[i];
    }
}

int median(const void * outdatav, int ne){
    int med;
    uint8_t * outdata = (uint8_t *) outdatav;
    med = _median(outdata, ne);
    return med;
}

and to compile:

gcc -fPIC -shared -o myscript_tools.so myscript_tools.c

Note: I discovered the .so filename should not match any Python script you plan to import. Otherwise you will get a “ImportError: dynamic module does not define init function” because Python is trying to import the C code instead of your Python module.

The main change I made was to port the median function to C. I also switched from using a list of colors, to using a numpy array of colors. I did this because it’s easier to pass numpy arrays, and because that is more useful in general. Also, so I wouldn’t have to convert the list to a numpy array on every median call, I crated the numpy array in the resize_median code itself. Before this last change, my benchmark showed I was actually running slower. The faster C code was not making up for all the list –> numpy array conversions!

After the 2nd fix I timed it

python -m timeit "import myscript2; myscript2.run()"

and got

500 loops, best of 3: 2.04 sec per loop

Nice! That’s compared to 2.51 from my last post, so it’s around 18% faster.

All C (sort of)

I wasn’t happy with the 18% boost – I thought it would be faster! But, it was only a very small portion ported. I decided to port more, but there wasn’t a good natural place to split the resize function up any further, so I ported the whole thing. This way I could also show how to send and receive a whole image back from C – which I do by reference/pointer.

myscript2.py

import cv2
import numpy as np
import math
from datetime import datetime
import ctypes

fast_tools = ctypes.cdll.LoadLibrary('./myscript_tools.so')

def c_resize_median(img, block_size):
    height, width, depth = img.shape

    num_w = math.ceil(float(width)/block_size)
    num_h = math.ceil(float(height)/block_size)

    #create a numpy array where the C will write the output
    out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

    fast_tools.resize_median(ctypes.c_void_p(img.ctypes.data),
                             ctypes.c_int(height), ctypes.c_int(width),
                             ctypes.c_int(block_size),
                             ctypes.c_void_p(out_img.ctypes.data))
    return out_img


def run():
    img = cv2.imread('Brandberg_Massif_Landsat.jpg')
    block_size = 16
    start_time = datetime.now()
    resized = c_resize_median(img, block_size)
    print "Time: {0}".format(datetime.now() - start_time)
    cv2.imwrite('proc.png', resized)

if __name__ == '__main__':
    run()

myscript_tools.c

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h> // for malloc
#include <math.h>

int cmpfunc (const void * a, const void * b)
{
   return ( *(uint8_t*)a - *(uint8_t*)b );
}

int _median(uint8_t * list, int ne){
    //sort inplace
    qsort(list, ne, sizeof(uint8_t), cmpfunc);
    int i;
    if (ne % 2 == 0){
        i = (int)(ne / 2);
        return ((int)(list[i-1]) + (int)(list[i])) / 2;
    } else {
        i = (int)(ne / 2);
        return list[i];
    }
}

void resize_median(const void * bgr_imgv, const int height, const int width,
                   const int block_size, void * outdatav) {
    const uint8_t * bgr_img = (uint8_t  *) bgr_imgv;
    uint8_t * outdata = (uint8_t *) outdatav;

    int col, row, c_max, r_max, c_si, r_si;
    int max_val;
    const int num_elems = block_size * block_size;
    int j, offset;

    uint8_t *b_values;
    uint8_t *g_values;
    uint8_t *r_values;
    b_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);
    g_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);
    r_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);

    int out_i = 0;

    row = 0;
    while (row < height) {
        col = 0;
        r_max = row + block_size;
        if (r_max > height) {
            r_max = height;
        }
        while (col < width) {
            c_max = col + block_size;
            if (c_max > width) {
                c_max = width;
            }

            // block info:
            j = 0;
            for (r_si = row; r_si < r_max; ++r_si) {
                for (c_si = col; c_si < c_max; ++c_si) {
                    offset = ((r_si*width)+c_si)*3;
                    b_values[j] = bgr_img[offset];
                    g_values[j] = bgr_img[offset+1];
                    r_values[j] = bgr_img[offset+2];
                    j += 1;
                }
            }

            // have the block info, get medians
            max_val = j;
            outdata[out_i]   = _median(b_values, max_val);
            outdata[out_i+1] = _median(g_values, max_val);
            outdata[out_i+2] = _median(r_values, max_val);

            // update indexes
            out_i += 3;
            col += block_size;
        }
        row += block_size;
    }

    free(b_values);
    free(g_values);
    free(r_values);
}

This one is a little more complicated. With multi-dimensional arrays, numpy flattens the data. The new array is [pixel00_blue, pixel00_green, pixel00_red, pixel01_blue, ...] – see the offset variable above for the row/column equation. I also pass pointers with no type, so before using the arrays, I have to typecast them. I realize this example is unusual because the whole script is basically ported, but it illustrates many things that are non-trivial and required some time to figure out.

And for the moment of truth…

500 loops, best of 3: 440 msec per loop

82% faster!

Although it’s a mostly C at this point.

Final Thoughts

So in this case, the Python took about 5 times as long to run compared to the C version. When something takes a few milliseconds, 5-10x isn’t that important, but when it gets more complex, it starts to matter (assuming you’re not I/O bound). Here I ported pretty much the whole module. I think this only makes sense with a library, and not code that you will refer to often – Python is just so easy to read.

An idea popped in my head as I was writing this as well. This should be easy to port to RapydScript. JavaScript has the Image object I could use in place of opencv. I wonder where JavaScript would fall on the spectrum.