As promised, I took a script written in Python and ported parts of it to C. The Python interface is the same, but behind the scenes is blazing fast C code. The example is a little different from what I usually do, but I discovered a few things I had never seen.

I am working off the same example as the previous post, How to Write Faster Python, which has the full original code. The example takes an input image and resizes it down by a factor of N. It takes NxN blocks and brings them down to 1 pixel which is the median RGB of the original block.

## Small Port

This example is unusual for me because it’s doing 1 task, resizing, and the core of this 1 task is a very small bit of logic – the median function. I first ported the median code to C, and checked the performance. I know the C code is better because C’s sorting function lets you specify how many elements from the list you want to sort. In Python, you sort the whole list (as far as I know) and so I was able to avoid checks on the size of the list vs. the number of elements I want to sort.

Below is the code, but I have a few changes I want to mention as well.

myscript2.py:

`import cv2`

import numpy as np

import math

from datetime import datetime

import ctypes

fast_tools = ctypes.cdll.LoadLibrary('./myscript_tools.so')

fast_tools.median.argtypes = (ctypes.c_void_p, ctypes.c_int)

def _median(list, ne):

"""

Return the median.

"""

ret = fast_tools.median(list.ctypes.data, ne)

return ret

def resize_median(img, block_size):

"""

Take the original image, break it down into blocks of size block_size

and get the median of each block.

"""

#figure out how many blocks we'll have in the output

height, width, depth = img.shape

num_w = math.ceil(float(width)/block_size)

num_h = math.ceil(float(height)/block_size)

#create the output image

out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

#iterate over the img and get medians

row_min = 0

num_elems = block_size * block_size

block_b = np.zeros(num_elems, dtype=np.uint8)

block_g = np.zeros(num_elems, dtype=np.uint8)

block_r = np.zeros(num_elems, dtype=np.uint8)

while row_min < height:

r_max = row_min + block_size

if r_max > height:

r_max = height

col_min = 0

new_row = []

while col_min < width:

c_max = col_min + block_size

if c_max > width:

c_max = width

#block info:

num_elems = (r_max-row_min) * (c_max-col_min)

block_i = 0

for r_i in xrange(row_min, r_max):

for c_i in xrange(col_min, c_max):

block_b[block_i] = img[r_i, c_i, 0]

block_g[block_i] = img[r_i, c_i, 1]

block_r[block_i] = img[r_i, c_i, 2]

block_i += 1

#have the block info, sort by L

median_colors = [

int(_median(block_b, num_elems)),

int(_median(block_g, num_elems)),

int(_median(block_r, num_elems))

]

out_img[int(row_min/block_size), int(col_min/block_size)] = median_colors

col_min += block_size

row_min += block_size

return out_img

def run():

img = cv2.imread('Brandberg_Massif_Landsat.jpg')

block_size = 16

start_time = datetime.now()

resized = resize_median(img, block_size)

print "Time: {0}".format(datetime.now() - start_time)

cv2.imwrite('proc.png', resized)

if __name__ == '__main__':

run()

myscript_tools.c

`#include <stdint.h>`

#include <stdio.h>

#include <stdlib.h> // for malloc

#include <math.h>

int cmpfunc (const void * a, const void * b)

{

return ( *(uint8_t*)a - *(uint8_t*)b );

}

int _median(uint8_t * list, int ne){

//sort inplace

qsort(list, ne, sizeof(uint8_t), cmpfunc);

int i;

if (ne % 2 == 0){

i = (int)(ne / 2);

return ((int)(list[i-1]) + (int)(list[i])) / 2;

} else {

i = (int)(ne / 2);

return list[i];

}

}

int median(const void * outdatav, int ne){

int med;

uint8_t * outdata = (uint8_t *) outdatav;

med = _median(outdata, ne);

return med;

}

and to compile:

```
gcc -fPIC -shared -o myscript_tools.so myscript_tools.c
```

Note: I discovered the .so filename should *not* match any Python script you plan to import. Otherwise you will get a “ImportError: dynamic module does not define init function” because Python is trying to import the C code instead of your Python module.

The main change I made was to port the median function to C. I also switched from using a list of colors, to using a numpy array of colors. I did this because it’s easier to pass numpy arrays, and because that is more useful in general. Also, so I wouldn’t have to convert the list to a numpy array on every median call, I crated the numpy array in the resize_median code itself. Before this last change, my benchmark showed I was actually running *slower*. **The faster C code was not making up for all the list –> numpy array conversions!**

After the 2nd fix I timed it

```
python -m timeit "import myscript2; myscript2.run()"
```

and got

```
500 loops, best of 3: 2.04 sec per loop
```

Nice! That’s compared to 2.51 from my last post, so it’s around 18% faster.

## All C (sort of)

I wasn’t happy with the 18% boost – I thought it would be faster! But, it was only a very small portion ported. I decided to port more, but there wasn’t a good natural place to split the resize function up any further, so I ported the whole thing. This way I could also show how to send and receive a whole image back from C – which I do by reference/pointer.

myscript2.py

`import cv2`

import numpy as np

import math

from datetime import datetime

import ctypes

fast_tools = ctypes.cdll.LoadLibrary('./myscript_tools.so')

def c_resize_median(img, block_size):

height, width, depth = img.shape

num_w = math.ceil(float(width)/block_size)

num_h = math.ceil(float(height)/block_size)

#create a numpy array where the C will write the output

out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

fast_tools.resize_median(ctypes.c_void_p(img.ctypes.data),

ctypes.c_int(height), ctypes.c_int(width),

ctypes.c_int(block_size),

ctypes.c_void_p(out_img.ctypes.data))

return out_img

def run():

img = cv2.imread('Brandberg_Massif_Landsat.jpg')

block_size = 16

start_time = datetime.now()

resized = c_resize_median(img, block_size)

print "Time: {0}".format(datetime.now() - start_time)

cv2.imwrite('proc.png', resized)

if __name__ == '__main__':

run()

myscript_tools.c

`#include <stdint.h>`

#include <stdio.h>

#include <stdlib.h> // for malloc

#include <math.h>

int cmpfunc (const void * a, const void * b)

{

return ( *(uint8_t*)a - *(uint8_t*)b );

}

int _median(uint8_t * list, int ne){

//sort inplace

qsort(list, ne, sizeof(uint8_t), cmpfunc);

int i;

if (ne % 2 == 0){

i = (int)(ne / 2);

return ((int)(list[i-1]) + (int)(list[i])) / 2;

} else {

i = (int)(ne / 2);

return list[i];

}

}

void resize_median(const void * bgr_imgv, const int height, const int width,

const int block_size, void * outdatav) {

const uint8_t * bgr_img = (uint8_t *) bgr_imgv;

uint8_t * outdata = (uint8_t *) outdatav;

int col, row, c_max, r_max, c_si, r_si;

int max_val;

const int num_elems = block_size * block_size;

int j, offset;

uint8_t *b_values;

uint8_t *g_values;

uint8_t *r_values;

b_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);

g_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);

r_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);

int out_i = 0;

row = 0;

while (row < height) {

col = 0;

r_max = row + block_size;

if (r_max > height) {

r_max = height;

}

while (col < width) {

c_max = col + block_size;

if (c_max > width) {

c_max = width;

}

// block info:

j = 0;

for (r_si = row; r_si < r_max; ++r_si) {

for (c_si = col; c_si < c_max; ++c_si) {

offset = ((r_si*width)+c_si)*3;

b_values[j] = bgr_img[offset];

g_values[j] = bgr_img[offset+1];

r_values[j] = bgr_img[offset+2];

j += 1;

}

}

// have the block info, get medians

max_val = j;

outdata[out_i] = _median(b_values, max_val);

outdata[out_i+1] = _median(g_values, max_val);

outdata[out_i+2] = _median(r_values, max_val);

// update indexes

out_i += 3;

col += block_size;

}

row += block_size;

}

free(b_values);

free(g_values);

free(r_values);

}

This one is a little more complicated. With multi-dimensional arrays, numpy flattens the data. The new array is [pixel00_blue, pixel00_green, pixel00_red, pixel01_blue, ...] – see the `offset`

variable above for the row/column equation. I also pass pointers with no type, so before using the arrays, I have to typecast them. I realize this example is unusual because the whole script is basically ported, but it illustrates many things that are non-trivial and required some time to figure out.

And for the moment of truth…

```
500 loops, best of 3: 440 msec per loop
```

82% faster!

Although it’s a mostly C at this point.

## Final Thoughts

So in this case, the Python took about 5 times as long to run compared to the C version. When something takes a few milliseconds, 5-10x isn’t that important, but when it gets more complex, it starts to matter (assuming you’re not I/O bound). Here I ported pretty much the whole module. I think this only makes sense with a library, and not code that you will refer to often – Python is just so easy to read.

An idea popped in my head as I was writing this as well. This should be easy to port to RapydScript. JavaScript has the `Image`

object I could use in place of opencv. I wonder where JavaScript would fall on the spectrum.

What could be the results with a cython approach? Personally I don’t have any experience on it, but I would be curious.

I haven’t dug into Cython, but it seems like something that’s full of advantages with very few drawbacks. My guess is that it would be a little slower than C. After thinking about benchmarking RapydScript, I want to do another performance post, so I will add Cython in there as well.