Python & C

Python CAs promised, I took a script written in Python and ported parts of it to C. The Python interface is the same, but behind the scenes is blazing fast C code. The example is a little different from what I usually do, but I discovered a few things I had never seen.

I am working off the same example as the previous post, How to Write Faster Python, which has the full original code. The example takes an input image and resizes it down by a factor of N. It takes NxN blocks and brings them down to 1 pixel which is the median RGB of the original block.

Small Port

This example is unusual for me because it’s doing 1 task, resizing, and the core of this 1 task is a very small bit of logic – the median function. I first ported the median code to C, and checked the performance. I know the C code is better because C’s sorting function lets you specify how many elements from the list you want to sort. In Python, you sort the whole list (as far as I know) and so I was able to avoid checks on the size of the list vs. the number of elements I want to sort.

Below is the code, but I have a few changes I want to mention as well.

myscript2.py:

import cv2
import numpy as np
import math
from datetime import datetime
import ctypes

fast_tools = ctypes.cdll.LoadLibrary('./myscript_tools.so')
fast_tools.median.argtypes = (ctypes.c_void_p, ctypes.c_int)


def _median(list, ne):
    """
    Return the median.
    """

    ret = fast_tools.median(list.ctypes.data, ne)
    return ret


def resize_median(img, block_size):
    """
    Take the original image, break it down into blocks of size block_size
    and get the median of each block.
    """

    #figure out how many blocks we'll have in the output
    height, width, depth = img.shape
    num_w = math.ceil(float(width)/block_size)
    num_h = math.ceil(float(height)/block_size)

    #create the output image
    out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

    #iterate over the img and get medians
    row_min = 0
    num_elems = block_size * block_size
    block_b = np.zeros(num_elems, dtype=np.uint8)
    block_g = np.zeros(num_elems, dtype=np.uint8)
    block_r = np.zeros(num_elems, dtype=np.uint8)
    while row_min < height:
        r_max = row_min + block_size
        if r_max > height:
            r_max = height

        col_min = 0
        new_row = []
        while col_min < width:
            c_max = col_min + block_size
            if c_max > width:
                c_max = width

            #block info:
            num_elems = (r_max-row_min) * (c_max-col_min)
            block_i = 0
            for r_i in xrange(row_min, r_max):
                for c_i in xrange(col_min, c_max):
                    block_b[block_i] = img[r_i, c_i, 0]
                    block_g[block_i] = img[r_i, c_i, 1]
                    block_r[block_i] = img[r_i, c_i, 2]
                    block_i += 1

            #have the block info, sort by L
            median_colors = [
                int(_median(block_b, num_elems)),
                int(_median(block_g, num_elems)),
                int(_median(block_r, num_elems))
            ]

            out_img[int(row_min/block_size), int(col_min/block_size)] = median_colors
            col_min += block_size
        row_min += block_size
    return out_img

def run():
    img = cv2.imread('Brandberg_Massif_Landsat.jpg')
    block_size = 16
    start_time = datetime.now()
    resized = resize_median(img, block_size)
    print "Time: {0}".format(datetime.now() - start_time)
    cv2.imwrite('proc.png', resized)

if __name__ == '__main__':
    run()

myscript_tools.c

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h> // for malloc
#include <math.h>

int cmpfunc (const void * a, const void * b)
{
   return ( *(uint8_t*)a - *(uint8_t*)b );
}

int _median(uint8_t * list, int ne){
    //sort inplace
    qsort(list, ne, sizeof(uint8_t), cmpfunc);
    int i;
    if (ne % 2 == 0){
        i = (int)(ne / 2);
        return ((int)(list[i-1]) + (int)(list[i])) / 2;
    } else {
        i = (int)(ne / 2);
        return list[i];
    }
}

int median(const void * outdatav, int ne){
    int med;
    uint8_t * outdata = (uint8_t *) outdatav;
    med = _median(outdata, ne);
    return med;
}

and to compile:

gcc -fPIC -shared -o myscript_tools.so myscript_tools.c

Note: I discovered the .so filename should not match any Python script you plan to import. Otherwise you will get a “ImportError: dynamic module does not define init function” because Python is trying to import the C code instead of your Python module.

The main change I made was to port the median function to C. I also switched from using a list of colors, to using a numpy array of colors. I did this because it’s easier to pass numpy arrays, and because that is more useful in general. Also, so I wouldn’t have to convert the list to a numpy array on every median call, I crated the numpy array in the resize_median code itself. Before this last change, my benchmark showed I was actually running slower. The faster C code was not making up for all the list –> numpy array conversions!

After the 2nd fix I timed it

python -m timeit "import myscript2; myscript2.run()"

and got

500 loops, best of 3: 2.04 sec per loop

Nice! That’s compared to 2.51 from my last post, so it’s around 18% faster.

All C (sort of)

I wasn’t happy with the 18% boost – I thought it would be faster! But, it was only a very small portion ported. I decided to port more, but there wasn’t a good natural place to split the resize function up any further, so I ported the whole thing. This way I could also show how to send and receive a whole image back from C – which I do by reference/pointer.

myscript2.py

import cv2
import numpy as np
import math
from datetime import datetime
import ctypes

fast_tools = ctypes.cdll.LoadLibrary('./myscript_tools.so')

def c_resize_median(img, block_size):
    height, width, depth = img.shape

    num_w = math.ceil(float(width)/block_size)
    num_h = math.ceil(float(height)/block_size)

    #create a numpy array where the C will write the output
    out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

    fast_tools.resize_median(ctypes.c_void_p(img.ctypes.data),
                             ctypes.c_int(height), ctypes.c_int(width),
                             ctypes.c_int(block_size),
                             ctypes.c_void_p(out_img.ctypes.data))
    return out_img


def run():
    img = cv2.imread('Brandberg_Massif_Landsat.jpg')
    block_size = 16
    start_time = datetime.now()
    resized = c_resize_median(img, block_size)
    print "Time: {0}".format(datetime.now() - start_time)
    cv2.imwrite('proc.png', resized)

if __name__ == '__main__':
    run()

myscript_tools.c

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h> // for malloc
#include <math.h>

int cmpfunc (const void * a, const void * b)
{
   return ( *(uint8_t*)a - *(uint8_t*)b );
}

int _median(uint8_t * list, int ne){
    //sort inplace
    qsort(list, ne, sizeof(uint8_t), cmpfunc);
    int i;
    if (ne % 2 == 0){
        i = (int)(ne / 2);
        return ((int)(list[i-1]) + (int)(list[i])) / 2;
    } else {
        i = (int)(ne / 2);
        return list[i];
    }
}

void resize_median(const void * bgr_imgv, const int height, const int width,
                   const int block_size, void * outdatav) {
    const uint8_t * bgr_img = (uint8_t  *) bgr_imgv;
    uint8_t * outdata = (uint8_t *) outdatav;

    int col, row, c_max, r_max, c_si, r_si;
    int max_val;
    const int num_elems = block_size * block_size;
    int j, offset;

    uint8_t *b_values;
    uint8_t *g_values;
    uint8_t *r_values;
    b_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);
    g_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);
    r_values = (uint8_t *) malloc(sizeof(uint8_t) * num_elems);

    int out_i = 0;

    row = 0;
    while (row < height) {
        col = 0;
        r_max = row + block_size;
        if (r_max > height) {
            r_max = height;
        }
        while (col < width) {
            c_max = col + block_size;
            if (c_max > width) {
                c_max = width;
            }

            // block info:
            j = 0;
            for (r_si = row; r_si < r_max; ++r_si) {
                for (c_si = col; c_si < c_max; ++c_si) {
                    offset = ((r_si*width)+c_si)*3;
                    b_values[j] = bgr_img[offset];
                    g_values[j] = bgr_img[offset+1];
                    r_values[j] = bgr_img[offset+2];
                    j += 1;
                }
            }

            // have the block info, get medians
            max_val = j;
            outdata[out_i]   = _median(b_values, max_val);
            outdata[out_i+1] = _median(g_values, max_val);
            outdata[out_i+2] = _median(r_values, max_val);

            // update indexes
            out_i += 3;
            col += block_size;
        }
        row += block_size;
    }

    free(b_values);
    free(g_values);
    free(r_values);
}

This one is a little more complicated. With multi-dimensional arrays, numpy flattens the data. The new array is [pixel00_blue, pixel00_green, pixel00_red, pixel01_blue, ...] – see the offset variable above for the row/column equation. I also pass pointers with no type, so before using the arrays, I have to typecast them. I realize this example is unusual because the whole script is basically ported, but it illustrates many things that are non-trivial and required some time to figure out.

And for the moment of truth…

500 loops, best of 3: 440 msec per loop

82% faster!

Although it’s a mostly C at this point.

Final Thoughts

So in this case, the Python took about 5 times as long to run compared to the C version. When something takes a few milliseconds, 5-10x isn’t that important, but when it gets more complex, it starts to matter (assuming you’re not I/O bound). Here I ported pretty much the whole module. I think this only makes sense with a library, and not code that you will refer to often – Python is just so easy to read.

An idea popped in my head as I was writing this as well. This should be easy to port to RapydScript. JavaScript has the Image object I could use in place of opencv. I wonder where JavaScript would fall on the spectrum.

Displaced Aggression

Displaced Aggression I’m going through a very tough situation in my life – hence the slower development of Grafpad and RapydScript. I won’t get into the details, but basically I’m being sued over a house purchase that fell through. It’s a very stressful situation to be in, and I don’t believe the other party is enjoying it any more than I do. The question then is, why don’t we just discuss it like human beings?

I tried to reason with the other party repeatedly to minimize damages to both. She is not interested in negotiating, her attorney says “she is angry”. She insists that she wants to recover her damages. I still fail to see what those damages are. The mortgage is paid off, she may have incurred a few hundred dollars of extra utility bills she wouldn’t have paid if she moved earlier. I already offered several thousand to make up for it and avoid headaches, we also suggested that she offer her own “just” amount. She refused, her response was that she is angry and intends to take it to court – a court that will take over a year to get to and is just as likely to rule in my favor as in hers.

As a result of the lawsuit, both parties have already incurred several thousand dollars more in lawsuit fees. What did I do that she is so angry about, you may ask? I simply wasn’t able to complete the transaction due to a perfect storm of unfortunate events (recent change in local laws, problems with the bank due to these laws, as well as misinformation from a third party). Another buyer walked away right before me as well, and the seller was angry with them too. I should have seen that as a red flag.

Why am I sharing this? Because I don’t believe that the seller is trying to make a quick buck off of me. If the goal was to maximize her own good, she would have taken the settlement or offered a counter, minimizing harm to both parties, while still maximizing her gain. That is how rational people think. Unfortunately, it seems like it’s not what the seller is after. She is angry. It’s not her own good or damages she cares about, it’s maximizing harm to me – even if it means incurring the same harm herself in the process. Why? Because she is angry, because she wants to “punish” someone for the pain she feels.

When you think like this, you will eventually find someone to punish. Problem is, you’re punishing another victim, you’re not addressing the real cause of the problem. She was angry at the previous buyer, that same anger was later projected at me, amplified by additional headaches. Why was she angry at me? Because this house was a pain to sell (it was on the market for over a year) and because I happened to be the easiest target in this interaction. When you’re angry, you don’t think rationally. When you’re angry, you’re simply searching for a scapegoat, you’re not searching for a solution. If we asked ourselves “how to solve this problem?” more often and “who caused the problem?” less often, we’d create less stress for everyone and we’d need less lawsuits to settle things.

How to Write Faster Python

Peugeot F1 This is a high level guide on how to approach speeding up slow apps. I have been writing a computer vision app on the backend of a server and came to the realization that it is easy to write slow Python, and Python itself is not fast. If you’re iterating over a 1000 x 1000 pixels (1 megapixel) image, whatever you’re doing inside will run 1 million times. My code ran an iterative algorithm, and it took over 2 minutes to run the first version of the code. I was able to get that time down to under 10 seconds – and it only took a few hours time with minimal code changes. The hardest part was converting some snippets of code into C, which I plan to detail in the future. The basic approach I followed was:

  1. Set a baseline with working code & test
  2. Find the bottlenecks
  3. Find a way to optimize bottlenecks
  4. Repeat

And fair warning, these sorts of optimizations only work if you have selected a good algorithm. As we used to say at work “You can’t polish a turd”.

Slow, but working code

I am going to start with an example program. This resizes an image using the medians of each region. If you resize to 1/4th of the original size the new image will be the medians of 4×4 blocks. It is not too important you understand the details, but I still included code for reference:

import cv2
import numpy as np
import math
from datetime import datetime

def _median(list):
    """
    Return the median
    """

    sorted_list = sorted(list)
    if len(sorted_list) % 2 == 0:
        #have to take avg of middle two
        i = len(sorted_list) / 2
        #int() to convert from uint8 to avoid overflow
        return (int(sorted_list[i-1]) + int(sorted_list[i])) / 2.0
    else:
        #find the middle (remembering that lists start at 0)
        i = len(sorted_list) / 2
        return sorted_list[i]

def resize_median(img, block_size):
    """
    Take the original image, break it down into blocks of size block_size
    and get the median of each block.
    """

    #figure out how many blocks we'll have in the output
    height, width, depth = img.shape
    num_w = math.ceil(float(width)/block_size)
    num_h = math.ceil(float(height)/block_size)

    #create the output image
    out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

    #iterate over the img and get medians
    row_min = 0
    #TODO: optimize this, maybe precalculate height & widths
    while row_min < height:
        r_max = row_min + block_size
        if r_max > height:
            r_max = height

        col_min = 0
        new_row = []
        while col_min < width:
            c_max = col_min + block_size
            if c_max > width:
                c_max = width

            #block info:
            block_b = []
            block_g = []
            block_r = []
            for r_i in xrange(row_min, r_max):
                for c_i in xrange(col_min, c_max):
                    block_b.append(img[r_i, c_i, 0])
                    block_g.append(img[r_i, c_i, 1])
                    block_r.append(img[r_i, c_i, 2])

            #have the block info, sort by L
            median_colors = [
                int(_median(block_b)),
                int(_median(block_g)),
                int(_median(block_r))
            ]

            out_img[int(row_min/block_size), int(col_min/block_size)] = median_colors
            col_min += block_size
        row_min += block_size
    return out_img

def run():
    img = cv2.imread('Brandberg_Massif_Landsat.jpg')
    block_size = 16
    start_time = datetime.now()
    resized = resize_median(img, block_size)
    print "Time: {0}".format(datetime.now() - start_time)
    cv2.imwrite('proc.png', resized)

if __name__ == '__main__':
    run()

This takes around 3 seconds to run against a random 1000×1000 image from Wikipedia (link):

500 loops, best of 3: 3.09 sec per loop

So this is a good starting point – it works! I am also saving the output image, so that when I make changes, I can spotcheck that I didn’t break something.

Profiling

Python’s standard library has a profiler that works very well (see The Python Profilers). Although I’m not a fan of the text output – I feel it does not intuitively roll up results – you can dump stats to a file which you can view in a UI.

So I can profile the script:

python -m cProfile -o stats.profile myscript.py

Note that it can sometimes add a lot of overhead.

And after installing runsnakerun I can view the stats:

runsnake stats.profile

which shows me: Runsnakerun GUI

So there are a few things that jump out as easy fixes. First, that app spends around 1/3rd of the time appending colors. Second, finding the median takes a significant amount of time as well – most of it from the sorted call – this is not visible in the image above. The other lines of code are not significant enough to display.

Optimizing

I have a few quick rules of thumb for optimizing:

  • creating/appending to lists is slow due to memory allocation – try to use list comprehensions
  • try not to run operations that create copies of objects unless it’s required – this is also slow due to memory allocation
  • dereferencing is slow: if you’re doing mylist[i] several times, just do myvar = mylist[i] up front
  • use libraries as much as possible – many are written in C/C++ and fast

Other than that, make use of search like Google or DuckDuckGo. You can tweak your Python code (you might discover you’re doing something wrong!), use Cython, write C libraries, or find another solution.

So profiling tells me that appending is slowing my code. I can get around this problem by declaring the list once and keeping track of how many elements I “add”. I also know that sorted() is not preferred because it creates a copy of the original list. I can instead use list.sort() and sort the list in place. I make these changes and run the code, and see the output is still good so I probably did not break it. Let’s time it.

500 loops, best of 3: 2.51 sec per loop

That’s almost 20% faster! Not bad for a few minutes of effort.

For completeness, here is the modified code:

import cv2
import numpy as np
import math
from datetime import datetime

def _median(list, ne):
    """
    Return the median.
    """

    #sort inplace
    if ne != len(list):
        list = list[0:ne]
    list.sort()
    if len(list) % 2 == 0:
        #have to take avg of middle two
        i = len(list) / 2
        #int() to convert from uint8 to avoid overflow
        return (int(list[i-1]) + int(list[i])) / 2.0
    else:
        #find the middle (remembering that lists start at 0)
        i = len(list) / 2
        return list[i]

def resize_median(img, block_size):
    """
    Take the original image, break it down into blocks of size block_size
    and get the median of each block.
    """

    #figure out how many blocks we'll have in the output
    height, width, depth = img.shape
    num_w = math.ceil(float(width)/block_size)
    num_h = math.ceil(float(height)/block_size)

    #create the output image
    out_img = np.zeros((num_h, num_w, 3), dtype=np.uint8)

    #iterate over the img and get medians
    row_min = 0
    #TODO: optimize this, maybe precalculate height & widths
    num_elems = block_size * block_size
    block_b = [0] * num_elems
    block_g = [0] * num_elems
    block_r = [0] * num_elems
    while row_min < height:
        r_max = row_min + block_size
        if r_max > height:
            r_max = height

        col_min = 0
        new_row = []
        while col_min < width:
            c_max = col_min + block_size
            if c_max > width:
                c_max = width

            #block info:
            num_elems = (r_max-row_min) * (c_max-col_min)
            block_i = 0
            for r_i in xrange(row_min, r_max):
                for c_i in xrange(col_min, c_max):
                    block_b[block_i] = img[r_i, c_i, 0]
                    block_g[block_i] = img[r_i, c_i, 1]
                    block_r[block_i] = img[r_i, c_i, 2]
                    block_i += 1

            #have the block info, sort by L
            median_colors = [
                int(_median(block_b, num_elems)),
                int(_median(block_g, num_elems)),
                int(_median(block_r, num_elems))
            ]

            out_img[int(row_min/block_size), int(col_min/block_size)] = median_colors
            col_min += block_size
        row_min += block_size
    return out_img

def run():
    img = cv2.imread('Brandberg_Massif_Landsat.jpg')
    block_size = 16
    start_time = datetime.now()
    resized = resize_median(img, block_size)
    print "Time: {0}".format(datetime.now() - start_time)
    cv2.imwrite('proc.png', resized)

if __name__ == '__main__':
    run()

Repeat

Now that I optimized the code a little, I repeat the process to make it even better. I try to decide what should be slow, and what should not. So, for example, sorting is not very fast here, but that makes sense. Sorting is the most complex part of this code – the rest is iterating and keeping track of observed colors.

Final Thoughts

After having optimized a few projects I have a few final thoughts and lessons learned:

  • Optimizing a slow/bad algorithm is a waste of time, you’ll need to redesign it anyways
  • The granularity of the profiler is sometimes a little funny. You can get around this by making sections of code into functions – metrics on functions are usually captured.
  • The profile will group the calls to the library functions together even if called from different places. So be careful with metrics for libraries/utils – i.e. numpy.
  • Use search to find ways to optimize your code.

If all else fails, you can implement slow parts in C relatively painlessly. This is the option I usually go with, and I will detail it out with some code in the next few weeks.

Have fun speeding up your code!