High Performance Computing in Python

As you have seen, python is a very user-friendly, powerful language. Packages like numpy, scipy, matplotlib, ipython, and pandas combined with its interpreted nature make python an excellent choice for scientific work. Despite these compelling tools, many people hold the (mis)conception that python is too slow for real scientific work and compiled languages like C, C++, or Fortran should be used instead. While it is true that writing tight, nested loops in pure python won’t result in stellar performance, python has numerous tools for speeding up computation either directly in python or by compiling extensions modules to be used within python.

The purpose of this lab is to introduce you to the resources available for making your python programs achieve maximum performance and efficiency.

Python-only Solutions


The blist package provides a drop-in replacement for the python list, but provides better performance when modifying large lists. The following table was taken from the python package index (pypi) entry for blist.

blist performance
Use Case blist list Operation
Insertion into or removal from a list O(log n) O(n) x + x
Taking slices of lists O(log n) O(n) x[1:-1]
Making shallow copies of lists O(1) O(n) x = x[:]
Changing slices of lists O(log n + log k) O(n + k) x[1:k] = x[-k:1]

The creation and manipulation of a blist object should be very natural to users of the python list.

# Import blist.blist
In [287]: from blist import blist

# Create python list with 50 million elements
In [288]: py_x = [1, 2, 3, 4, 5] * int(1e7)

# Create blist from existing python list
In [289]: %timeit bx_1 = blist(py_x)
1 loops, best of 3: 508 ms per loop

It would have been faster to create the blist in a way analogous to how we created py_x

In [290]: %timeit bx_1 = blist([1, 2, 3, 4, 5]) * int(1e7)
100000 loops, best of 3: 11 us per loop


Verify the results in the table by performing the given operations on list and blist objects of size n = [10, 100, 500, 1000, 2500, 5000, 10000, 12000, 25000, 50000, 100000, 1000000, 5000000] (it doesn’t matter what data is in the object, I used a single int).

Perform each operation 5 times on each type of list and compute the mean of those 5 observations for your estimated time.

Generate one figure for each operation. Each figure should have two subplots (each with a logged y-axis): 1. The timing results as a function of n 2. the expected results as a function of n (from table).

Hint: for k in the changing a slice operation use some fraction of the size of the list. Also, while the scales will be different when comparing the actual and expected results, we really care about the shape of the curves. (The curve for O(n) will not match the data very well).

The moral of the blist story is that when you are working with large python lists and performance is an issue, use blist.blist instead.


numpy is the standard package for scientific computing in python. You have been exposed to a lot of numpy functionality, so we will not spend time on it here. I simply couldn’t talk about high performance computing in python without at least mentioning numpy.


While numpy is very powerful and excellent at most array processing and manipulation tasks, it isn’t perfect. One of the biggest flaws with numpy is how it handles intermediate results in complex array expressions. When doing computation on arrays, there are two main extremes:

  1. Doing operations on whole arrays at a time, storing intermediate results in separate arrays
  2. Looping over each element of the array, performing the operation, and combining the results.

numpy uses the first method, which makes for very fast, but very memory intensive operations. For example if you had two numpy arrays, x and y and you wanted to compute 2*x + 3*y numpy will first compute 2*x and store it in memory, them compute 3*y and store it in memory, and finally compute the sum of those two objects. For smaller arrays this is not a problem and the evaluation is near optimal, however for larger arrays this can cause problems for at least two reasons: 1. If the individual components are large enough, they won’t all fit into memory at the same time, and 2. If you had many more terms in the expression, eventually they won’t all fit into memory.

If you wanted to avoid those problems completely you could take the second approach. The problem with that method, at least in an interpreted language like python, is that on each iteration of the loop the interpreter will have to check types to verify that the operation is legal.

numexpr uses an approach that is in the middle of the two. Before computation, arrays are split into manageable chunks, which are acted on like method 1 above and put together like method 2 above. You use numexpr by passing a string representing the operation you want to perform to the numexpr.evaluate function, then numexpr compiles it into machine code to be run on its own internal virtual machine that can be run in parallel if you have a multi-core machine. The return value of numexpr is a numpy array that is exactly equal to the array you would have obtained using only numpy.

We will show some examples

In [291]: import numpy as np

In [292]: import numexpr as ne

In [293]: x = np.arange(1e6)

In [294]: y = np.arange(1e6)

In [295]: z = np.arange(1e6)

In [296]: np_ans = x**2 + y**2 + z**2 + 2*z*y + 2*x*y + 2*z*x

In [297]: %timeit x**2 + y**2 + z**2 + 2*z*y + 2*x*y + 2*z*x
10 loops, best of 3: 90.3 ms per loop

Now using numexpr

In [298]: ne_ans = ne.evaluate('x**2 + y**2 + z**2 + 2*z*y + 2*x*y + 2*z*x')

In [299]: %timeit ne.evaluate('x**2 + y**2 + z**2 + 2*z*y + 2*x*y + 2*z*x')
100 loops, best of 3: 13.6 ms per loop

Just to make sure we got the same answer, let’s check:

In [300]: np.allclose(np_ans, ne_ans)
Out[300]: True

When I ran this on my computer I got 77.5 ms for numpy and 12.3 ms for numexpr. That is a speedup of 6.3 over the already speedy numpy! Not bad.


Investigate the ability of numexpr.evaluate to speed up the same operation on arrays of differing sizes. Evaluate the following expressions with plain numpy and then with numexpr:

  • \(2x + 3y + |z-2| \sqrt{(x-y)^2}\)
  • \(\frac{\left(x^{2y} (\sin x - \cos y) \right)}{\exp z}\)

Use random arrays (using np.random.randn or something like that) where x.shape=(n, n), y.shape=n and z.shape=(2, n, n). Use n=[1, 10, 100, 300, 700, 1000, 2000].

Plot execution time as a function of n for numpy and numexpr on the same axis for each of the expressions above.


When I was writing this question I considered having max(n)=10000. When I tried that on my computer the numpy case took over 3 minutes to evaluate and took 5.98 GB of memory. The numexpr case only took 7.5 seconds, but used up to 367% of the processing power of one of the cores on my machine.

The moral of the numexpr story is that when you are doing arithmetic on arrays and you want to minimize the memory footprint of the calculations while optimizing performance, use numexpr instead of vanilla numpy.

“Compiling” Python


To test the performance of the tools in the rest of this lab we will write functions to solve three main problems.

  1. The pairwise distance problem: This function should take in a numpy array (or similar object) that is n by m. Each row will be an m-tuple representing the coordinates for a vector in \(R^m\). The return value of this function will be an n by n array of euclidean distances between each point (row) in the input array and every other point. We ask you to use three loops to do this (row, column of result matrix and then inner one to loop over points in the m-tuples). We recognize that this inner loop could easily be removed using numpy, but as this is just an example algorithm we want you to do it this way so you can see the largest gains from using the tools described in this section.

  2. Project Euler problem 14 (I will quote the text below). For this problem we ask you to take a brute force approach. One such approach would be to define a function that finds the chain length for a given integer and then to wrap that function in a for loop to go over all the numbers. Problem text:

    The following iterative sequence is defined for the set of positive integers:

    \[n \rightarrow n / 2 \quad \text{(n is even)}\]\[n \rightarrow 3n + 1 \quad \text{(n is odd)}\]

    Using the rule above and starting with 13, we generate the following sequence:

    \[13 \rightarrow 40 \rightarrow 20 \rightarrow 10 \rightarrow 5 \rightarrow 16 \rightarrow 8 \rightarrow 4 \rightarrow 2 \rightarrow 1\]

    It can be seen that this sequence (starting at 13 and finishing at 1) contains 10 terms. Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1.

    Which starting number, under one million, produces the longest chain?

    NOTE: Once the chain starts the terms are allowed to go above one million.

  3. Image filtering. This problem is a bit more difficult so I will just give you some python code that could do it for you...

import numpy as np

def filter2d(image, filt):
    m, n = image.shape
    m_f, n_f = filt.shape
    m_f2 = m_f // 2
    n_f2 = n_f // 2
    result = np.zeros_like(image)
    for i in range(m_f2, m - m_f2):
        for j in range(n_f2, n - n_f2):
            num = 0.0
            for ii in range(m_f):
                for jj in range(n_f):
                    num += (filt[m_f-1-ii, n_f-1-jj] *
                            image[i-m_f2+ii, j-n_f2+jj])
            result[i, j] = num
    return result


Write python functions that solve problems 1 and 2 in the ways described above. Time the execution of these functions under the following circumstances and record the results in a pandas DataFrame (note we will be adding the same test results for Cython and numba so account for that when designing your DataFrame):

  1. For the pairwise distance problem pass in random arrays of the following sizes: (10, 5), (100, 5), (500, 5), (800, 5), (1000, 5)

  2. There is really only one way to solve the Collatz sequence problem so just do it!

  3. For the filtering problem pass in arrays of sizes (note i:(m,n) means image and f:(m, n) means filter):
    • i:(10, 10)-f:(2, 2)
    • i:(50, 50)-f:(5, 5)
    • i(100, 100)-f(10, 10)
    • i(250, 250)-f(20, 20)


Background and Setup

numba is a relatively new project that has received a lot of attention in the past 6-12 months. From the numba github site we read

Numba is an Open Source NumPy-aware optimizing compiler for Python sponsored by Continuum Analytics, Inc. It uses the remarkable LLVM compiler infrastructure to compile Python syntax to machine code.

It is aware of NumPy arrays as typed memory regions and so can speed-up code using NumPy arrays. Other, less well-typed code will be translated to Python C-API calls effectively removing the “interpreter” but not removing the dynamic indirection.

Numba is also not a tracing jit. It compiles your code before it gets run either using run-time type information or type information you provide in the decorator.

Numba is a mechanism for producing machine code from Python syntax and typed data structures such as those that exist in NumPy.

As stated in the quote, numba is built on top of the LLVM (low level virtual machine) infrastructure and therefore requires that you have a number of dependencies installed on your computer. Some of these dependencies are non-trivial to install. In fact, it took about a week of trying before one of the tech supervisors at the BYU super-computing office was able to correctly install numba and its dependencies. Luckily for us, numba is sponsored by the people at Continuum Analytics and is therefore part of Anaconda. If you installed the Anaconda python distribution, you have numba. If you didn’t install Anaconda, you might think about doing that now, or you might just want to get together with someone else in the class who did and work through these problems on their machine.


Assuming you have numba properly set up (or you are working with someone who does) you should be ready to dive in and speeding up your python code!

Most of the work you will do with numba will happen through two main decorators: jit and autojit. We have not talked much about decorators as they are a somewhat advanced python language feature. The main idea behind decorators is that they are wrapped around functions or classes and transform them in some way. Decorators are applied using the @decorator_name syntax as shown below:

def my_fun():

The above code is exactly equal to the following:

def my_fun():

my_fun = decorator(my_fun)

As numba is a fairly new project, there isn’t a lot of documentation yet. I will do my best to explain the key ideas behind making numba work, mostly through examples.

Types in numba

numba works because it compiles your code to machine code the first time you call it. In order for this compilation to happen, numba needs to know what type each of your objects is. There are a number of types that can be imported from numba using from numba import type_name syntax. Below is a table with some of the most common types.

Type Name Result Type
float_ float32
double float64
longdouble float128
char signed char
int8 int8 (char)
int16 int16
int32 int32
int64 int64
complex64 float complex
complex128 double complex
complex256 long double complex

The following notes about the types system in numba were taken from the numba documentation:

Unsigned integer counterparts are available under the name uint8 etc. Also, short-names are available with the style ‘<char>N’ where char is ‘b’, ‘i’, ‘u’, ‘f’, and ‘c’ for boolean, integer, unsigned, float and complex types respectively with ‘N’ indicating the number of bytes in the type. Thus, f8 is equivalent to float64, and c16 is equivalent to double complex.

Native platform-dependent types are also available under names such as int_, short, ulonglong, etc.

Types are names that can be imported from the numba namespace. Alternatively, they can be specified in strings in the jit decorator.

Using the Decorators

As noted above, most of the work you do with numba will come through one of two decorators: jit and autojit. Both serve a similar purpose: to compile the enclosing function or class into machine code the first time that function or class is called. The main difference between jit and autojit is that autojit will, at a relatively minor performance cost, use python to infer data types before sending your code to the LLVM.

To use the jit decorator, you place @jit on the line directly above the def or class statement for the object you would like numba to compile (from now on I will just talk about functions, but you should know that these general principles apply also to classes). You will then tell the jit decorator about the data types for the input and output objects for the function. There are a few ways to do this:

  1. Pass a single argument that is a sytring. This string will be in the following format out_type(in_type1, in_type2, ...). The out_type and various in_type arguments can be any object type that numba knows about. For example I had a function that returned a single floating point value and had one argument that was an array of floats I could do either of the following:
  1. Pass single object with names you have imported from numba namespace. This will be identical to the first option with the except that you don’t have to wrap the types in a string and you must have imported the argument types from numba. So I could use the same calls to @jit after having done the following import:
from numba import f8, double
  1. Pass in and out arguments directly to restype and argtype keyword arguments. This method is a bit more verbose than the first two, but can be more straigtforward. Following the example above and assuming I had done the correct imports I could call either of the below:
@jit(restype=double, argtypes=double[:])
@jit(restype=f8, argtypes=f8[:])

Notice that in the above examples we specified the array with the syntax type[:]. This means numba should expect a 1d array of dtype type. If we had more dimensions to our array we would simply add more colons. For example, if we had a 3d array of integers we would specify int_[:, :, :].

Another thing to note is that we can get performance boosts by telling numba how the memory is laid out for our arrays. By default, numpy creates c-type arrays. This simply means that the memory for the array is allocated such that elements along the last dimension are closer to one another (in terms of physical memory address) than elements in the preceding dimensions. If the array is laid out in Fortran-contiguous ordering this would be the opposite. Either way, if you know the continuity of your array you can get performance enhancements by letting numba know about it by replacing the : where you find the closest elements with ::1. I show some examples below.

  • For a 3d c-contiguous array of floating point values you would say double[:, :, ::1]
  • For a 2d Fortran-contiguous array of integers you would say int_[::1, :]
  • For a 1d contiguous array of long integers (doesn’t matter if it is c or Fortran contiguous - they are the same) you would put long[::1].

The use of the autojit decorator is a bit simpler, but comes at a small performance hit. When using autojit you simply put @autojit() in the line just above your function declaration and it will infer the argument and return types when you call the function. It is easy to see why this is a bit slower at times because you are forcing python to do the work of determining types on the fly, rather than suppling them yourself.


Imagine that I wanted to write a python function that would integrate sin(x) over a given interval, dividing the interval into a certain number of bins. I could write a python function that looks like the following:

In [301]: from __future__ import division

In [302]: from math import sin

In [303]: def py_integrate_sin(a, b, N):
   .....:     """
   .....:     Integrate sin(x) from a, b by splitting into N intervals
   .....:     """
   .....:     s = 0.
   .....:     dx = (b - a) / N
   .....:     for i in xrange(N):
   .....:         s += sin(a + i*dx)
   .....:     return s * dx

In [304]: a, b, N = 1., 6., 100000

In [305]: %timeit py_integrate_sin(a, b, N)
10 loops, best of 3: 27 ms per loop

I could now use the numba decorators to see how much better we can do. Note that I could not get ipython to allow decorators in this file so I will simply put the code here and then use the alternative (non-@) syntax to apply the decorator just below it.

from numba import double, long_, int_, jit, autojit

@jit(double(double, double, long_), locals={'i':long_})
def numba1_integrate_sin(a, b, N):
    s = 0.
    dx = (b - a) / N
    for i in xrange(N):
        s += sin(a + i*dx)
    return s * dx

def numba2_integrate_sin(a, b, N):
    s = 0.
    dx = (b - a) / N
    for i in xrange(N):
        s += sin(a + i*dx)
    return s * dx
In [306]: from numba import double, long_, jit, autojit

In [307]: numba1_integrate_sin = jit('f8(f8, f8, long_)', locals={'i':long_})(py_integrate_sin)

In [308]: %timeit numba1_integrate_sin(a, b, N)
1000 loops, best of 3: 1.75 ms per loop

Now we will time the @autojit version. Note that performance is almost identical because the function is so simple. Other discrepancies come because my computer was not equally busy during the execution of both functions.

In [309]: numba2_integrate_sin = autojit()(py_integrate_sin)

In [310]: %timeit numba2_integrate_sin(a, b, N)
1000 loops, best of 3: 1.77 ms per loop

Now suppose we want to write a function that would raise a square matrix to the Nth power. A Python implementation might look like this:

In [311]: def py_matrix_power(mat, n):
   .....:     m = len(mat)
   .....:     out = np.zeros_like(mat)
   .....:     out[:] = mat[:]
   .....:     temp = np.zeros_like(mat)
   .....:     for _ in xrange(n - 1):
   .....:         for ii in xrange(m):
   .....:             for jj in xrange(m):
   .....:                 for kk in xrange(m):
   .....:                     temp[ii, jj] += out[ii, kk] * mat[kk, jj]
   .....:         out[:] = temp[:]
   .....:         temp = np.zeros_like(mat)
   .....:     return out

I will test this out by raising a random 50 by 50 matrix to the 15th power.

In [312]: x = np.random.randn(50, 50)

In [313]: %timeit py_matrix_power(x, 15)
1 loops, best of 3: 3.32 s per loop

A numba version might look like this:

def numba_matrix_power(mat, n):
    m = len(mat)
    out = np.zeros_like(mat)
    out[:] = mat[:]
    temp = np.zeros_like(mat)

    for _ in xrange(n - 1):
        for ii in xrange(m):
            for jj in xrange(m):
                for kk in xrange(m):
                    temp[ii, jj] += out[ii, kk] * mat[kk, jj]
        out[:] = temp[:]
        temp = np.zeros_like(mat)

    return out
In [314]: numba_matrix_power = autojit()(py_matrix_power)

In [315]: %timeit numba_matrix_power(x, 15)
1 loops, best of 3: 2.56 ms per loop



Make your existing python functions work with numba. Repeat the timing exercises from above and put the results in the DataFrame.


Background and Setup

An alternative to using numba to compile your python code is Cython. Cython is a very mature project and is widely recognized as a (if not the) community standard for interfacing C/C++ code with python as well as boosting the performance of actual python code. The following was taken from the homepage of the Cython documentation:

Cython is a programming language based on Python, with extra syntax allowing for optional static type declarations. It aims to become a superset of the Python language which gives it high-level, object-oriented, functional, and dynamic programming. The source code gets translated into optimized C/C++ code and compiled as Python extension modules. This allows for both very fast program execution and tight integration with external C libraries, while keeping up the high programmer productivity for which the Python language is well known.

So, instead of using a just-in-time compiler like numba, Cython actually generates C/C++ code for you in a form that is very easy to compile into a python extension module. Some people have said that Cython is just python with some type declarations thrown in. While the Cython syntax is very much like python, Cython is more than that because as it is eventually translated into C, any native C code you have is available to Cython and can therefore be used in your Cython code. But, we are getting ahead of ourselves now so we will back up a bit and cover some of the Cython basics.


As with numba, Cython helps your code achieve speed gains by compiling your python code in to a machine readable language. This is a three step process.

  1. Write your Cython code in a .pyx file
  2. Compile your .pyx` file to an equivalent .c file
  3. Compile the .c file to a .so (or .pyd on Windows) file to be imported directly into python.

There are several ways to complete steps 2 and 3 (build your Cython code):

  • Write a distutils setup.py file – reccomended way
  • Use the pyximport module to allow Cython do to the work in the background – limited.
  • Do it by hand, calling the compiler with all the correct flags and links – difficult and only advisable for experienced C programmers.
  • Use the Sage or Ipython Notebook cell magics – also limited because the compiled modules live only in the notebook session.

Throughout this instruction we will show you how to use setup.py

Types in Cython

Before compilation, Cython code is converted into C code that follows the Python C API, It is by typing variables that your code gains speed and efficiency. Because the code is compliant with the python C API, any native python or C type is valid for assignment to your Cython objects. However, you will see much bigger speed gains when you use C types (allowing your python to be more like c is the whole point behind Cython).


C types like enum, union, and struct are available in Cython, even if they don’t have a direct python counterpart.

Types are assigned to variables using the cdef keyword. Below are some examples.

cdef int my_int = 1
cdef double my_float = 1.1
cdef unsigned int my_uint = 42
cdef long my_long = 1000000001

If you have more than one cdef to perform at a time, just put them all in a cdef block.


    int my_int = 1
    double my_float = 1.1
    unsigned int my_uint = 42
    long my_long = 1000000001

Like C, you don’t have to assign a value at the same time as assigning a type. For example, the following code is just as valid as the above.


    int my_int
    double my_float
    unsigned int my_uint
    long my_long

my_int = 1
my_float = 1.1
my_uint = 42
my_long = 1000000001

Functions in Cython come in 3 varieties:

  1. Python functions created with the def keyword. These functions are the ones you have been used to and are directly accessible to python.
  2. C functions defined with the cdef keyword. These functions are given explicit types and are very fast. Because they are C functions they are not accessible to python (they are only defined within Cython .pyx modules)
  3. Hybrid functions defined with the cpdef keyword. These functions can be given type information and can be thought of as both a def and a cdef function. That is, a function with that name is accessible to python, but if accessed from within Cython, Cython will use the C version with a small performance loss over the standard cdef function.

Some valid cdef and cpdef statements appear below:

# only accessible in Cython
cdef int sum_ints(int a, int b)

# Accessible in both python and Cython
cpdef double divide(double a, double b)

# Notice C memory-view syntax
cdef double sum_2darr(double[:, ::1] arr)


Notice that I was able to define a 2d c-contiguous array using the same syntax that numba used (double[:, ::1]). This is a relatively new addition to Cython that, in my opinion, is a great enhancement. One part of working with numpy arrays from within Cython that was relatively slow was accessing slices of the array. To overcome this, the numpy people created a Cython interface that allowed for very fast access to array elements via array buffers. The syntax for the above array would have been the following np.ndarray[double, ndim=2, mode='c']. Recently, however, the Cython people have created a much more robust memoryview interface that allows the data stored in a numpy array to be accessed natively as if it were a C array. This is what the double[:, ::1] syntax is telling Cython to do with arr. Not only is this syntax a bit more compact and easy to remember, but also it actually comes with performance boosts as it allows Cython to treat the array more like a native C array. For that reason, we will only be using the memoryview syntax in this lab.

Writing the setup.py file

One of the most difficult aspects of Cython for a new user is getting a handle on building and compiling extension modules. The most reliable, portable, and robust way of doing this is to let python take care of the building, compiling, and linking for you! I will give you a template setup.py file that you can use for your projects and will then explain what the different parts are.

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

# some Extension args:
#   Extension(name, sources, indlude_dirs=None, ..., language=None)

# Basic Cython file
cy_1 = Extension('cy_1', ['cy_1.pyx'])

# Second extension that calls existing C++ code.
cy_2 = Extension('cy_2', ['cy_2.pyx', 'cy_2.cpp', 'cy_2.h'], language='c++')

# The main python function that builds the stuff
  name = 'My Cython Extension',
  cmdclass = {'build_ext': build_ext},  # Use Cython's build_ext routine
  ext_modules = [cy_1, cy_2] # Must be list, even with only one Extension

After you have written this setup.py file and your Cython code, you can build the extension from the command line (UNIX terminal or Windows command prompt) using the command

This will tell python that you simply want to build the Extension s defined in setup.py in the current working directory. You can then launch python, ipython, or write a script and import the Cython extensions just like you would any other Python library using the Extension ‘s name (first argument you gave to Extension .)

Calling external C/C++ code

One of the main use cases for Cython is to enable Python to call pre-written C and C++ code. This is done using an extern block. The basic sytax is cdef extern from "filename": followed by indented lines specifying the interface for various functions or classes I would like to use. For example, if I wanted to use the sin function in the C standard library file math.h I could do the following:

cdef extern from "math.h":
    double sin(double)

This worked because in the file math.h there is a function sin that receives a single argument of dtype double and outputs a double. The user should note that much of the C and C++ libraries have already been exposed to Cython by the Cython developers. The sin function could also be exposed by the following Python code.

from libc.math cimport sin

Note the cimport statement instead of import. This is special Cython syntax that tells the compiler to look for a Cython block in a particular file (in this case libc.math) that defines the object being imported (the sin function). For a complete listing of which C and C++ standard library functions are availible see the libc and libcpp folders from this point in the Cython github repo.

Another example might be including a Rectangle class within the shapes namespace in a local .cpp file I have written (Note this was taken from the Cython documentation and includes object oriented constructs we haven’t talked about yet so don’t worry about the details - I just want you to know you can do something like this).

cdef extern from "Rectangle.h" namespace "shapes":
    # Expose class
    cdef cppclass Rectangle:
        # Give method signatures
        Rectangle(int, int, int, int) except +
        int x0, y0, x1, y1
        int getLength()
        int getHeight()
        int getArea()
        void move(int, int)


I completely skipped one of the main features of Cython: defining custom extension types (classes). This was deliberate and I did it because we have not, yet, gone object oriented programming in Python. After you have handle on basic object oriented programming I strongly encourage you to review the Cython documentation for defining classes and using existing C++ classes.


I will show the same examples I had for numba here.

# File integrate.pyx

from libc.math cimport sin

cpdef double cy_sin_integrate(double a, double b, long N):
    cdef int i
    cdef double dx, s
    s = 0.
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

The associated setup.py file

# File setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

integrate = Extension('integrate', ['integrate.pyx'])

# The main python function that builds the stuff
  name = 'My Cython Extension',
  cmdclass = {'build_ext': build_ext},
  ext_modules = [integrate]

The timing...

In [5]: from integrate import cy_sin_integrate

In [6]: %timeit cy_sin_integrate(a, b, N)
100 loops, best of 3: 3.77 ms per loop

Now the .pyx and setup.py for the matrix power problem:

# File mat_power.pyx
import numpy as np
cimport cython

# compiler directives to speed up array handling
cpdef double[:, ::1] cy_matrix_power(double[:, ::1], mat, int n):
    cdef int m = len(mat)
    cdef double[:, ::1] out = np.zeros_like(mat)
    out[:] = mat[:]
    cdef double[:, ::1] temp = np.zeros_like(mat)

    cdef int it, ii, jj, kk

    for it in xrange(n - 1):
        for ii in xrange(m):
            for jj in xrange(m):
                for kk in xrange(m):
                    temp[ii, jj] += out[ii, kk] * mat[kk, jj]
        out[:] = temp[:]
        temp = np.zeros_like(mat)

    return out
# file: setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

mat_power = Extension('mat_power', ['mat_power.pyx'])

# The main python function that builds the stuff
  name = 'My Cython Extension',
  cmdclass = {'build_ext': build_ext},
  ext_modules = [mat_power]

And testing it ...

In [7]: from mat_power import cy_matrix_power

In [8]: %timeit cy_matrix_power(x, 15)
100 loops, best of 3: 3.1 ms per loop



Work through the main 3 problems using Cython this time. Repeat the timing exercises from above and put the results in the DataFrame.

Wrapping Compiled Extensions

We have already talked a little bit about how you can use Cython to wrap existing C/C++ (and even some Fortran) libraries. The topic of wrapping in general is far beyond the scope of this lab, but I will at least mention a few of the more useful tools out there.

For wrapping C/C++:

  • Cython: Call external C function in a python like interface that gets compiled into a Python extension module
  • SWIG: Auto-generate python wrapper code based on C/C++ header files. Hands-off, low-flexibility type wrapping.
  • Boost.Python: Auto-generate python wrapper code based on C/C++ header files. Hands-off, low-flexibility type wrapping.
  • xdress: Automatically create Cython wrapper files from C/C++ header files. In the middle of writing your own extensions using Cython and having SWIG or Boost.Python do it for you.

For wrapping Fortran:

  • f2py: The standard. What you should probably use if you have serious fortran to wrap.
  • Cython: Nice integration with Python, C, and C++.


Analyze your timing results for each of the three implementations for each of the three problems. Compute ratio of time saved, as well as plots of time vs. argument size when applicable.

NOTE: I am giving you a lot of flexibility here. I want you to be creative in how you use the data analysis tools you have developed thus far. Spend about 10-15 minutes coming up with a compelling way to share the results you have gathered during this lab.