.. _performance_lab:

************************************
High Performance Computing in Python
************************************

As you have seen, python is a very user-friendly, powerful language. Packages like numpy, scipy, matplotlib, ipython, and pandas combined with its interpreted nature make python an excellent choice for scientific work. Despite these compelling tools, many people hold the (mis)conception that python is too slow for real scientific work and compiled languages like C, C++, or Fortran should be used instead. While it is true that writing tight, nested loops in pure python won't result in stellar performance, python has numerous tools for speeding up computation either directly in python or by compiling extensions modules to be used within python.

The purpose of this lab is to introduce you to the resources available for making your python programs achieve maximum performance and efficiency.

Python-only Solutions
=====================

blist
-----

The `blist <https://pypi.python.org/pypi/blist>`_ package  provides a drop-in replacement for the python list, but provides better performance when modifying large lists. The following table was taken from the python package index (pypi) entry for `blist <https://pypi.python.org/pypi/blist>`_.


.. csv-table:: blist performance
    :header: Use Case, blist, list, Operation
    :widths: 23, 11, 9, 15

    Insertion into or removal from a list, O(log n), O(n), x + x
    Taking slices of lists, O(log n), O(n), x[1:-1]
    Making shallow copies of lists, O(1), O(n), x = x[:]
    Changing slices of lists, O(log n  + log k), O(n + k), x[1:k] = x[-k:1]

The creation and manipulation of a blist object should be very natural to users of the python ``list``.

.. ipython:: python

    # Import blist.blist
    from blist import blist

    # Create python list with 50 million elements
    py_x = [1, 2, 3, 4, 5] * int(1e7)

    # Create blist from existing python list
    %timeit bx_1 = blist(py_x)

It would have been faster to create the blist in a way analogous to how we created ``py_x``

.. ipython:: python

    %timeit bx_1 = blist([1, 2, 3, 4, 5]) * int(1e7)

.. todo::

    Verify the results in the table by performing the given operations on ``list`` and blist objects of size n = [10, 100, 500, 1000, 2500, 5000, 10000, 12000, 25000, 50000, 100000, 1000000, 5000000] (it doesn't matter what data is in the object, I used a single int).

    Perform each operation 5 times on each type of list and compute the mean of those 5 observations for your estimated time.

    Generate one figure for each operation. Each figure should have two subplots (each with a logged y-axis): 1. The timing results as a function of n 2. the expected results as a function of n (from table).

    Hint: for ``k`` in the changing a slice operation use some fraction of the size of the list. Also, while the scales will be different when comparing the actual and expected results, we really care about the shape of the curves. (The curve for O(n) will not match the data very well).

The moral of the blist story is that when you are working with large python lists and performance is an issue, use ``blist.blist`` instead.

numpy
-----

numpy is the standard package for scientific computing in python. You have been exposed to a lot of numpy functionality, so we will not spend time on it here. I simply couldn't talk about high performance computing in python without at least mentioning numpy.

numexpr
-------

While numpy is very powerful and excellent at most array processing and manipulation tasks, it isn't perfect. One of the biggest flaws with numpy is how it handles intermediate results in complex array expressions. When doing computation on arrays, there are two main extremes:

1. Doing operations on whole arrays at a time, storing intermediate results in separate arrays
2. Looping over each element of the array, performing the operation, and combining the results.

numpy uses the first method, which makes for very fast, but very memory intensive operations. For example if you had two numpy arrays, ``x`` and ``y`` and you wanted to compute ``2*x + 3*y`` numpy will first compute ``2*x`` and store it in memory, them compute ``3*y`` and store it in memory, and finally compute the sum of those two objects. For smaller arrays this is not a problem and the evaluation is near optimal, however for larger arrays this can cause problems for at least two reasons: 1. If the individual components are large enough, they won't all fit into memory at the same time, and 2. If you had many more terms in the expression, eventually they won't all fit into memory.

If you wanted to avoid those problems completely you could take the second approach. The problem with that method, at least in an interpreted language like python, is that on each iteration of the loop the interpreter will have to check types to verify that the operation is legal.

numexpr uses an approach that is in the middle of the two. Before computation, arrays are split into manageable chunks, which are acted on like method 1 above and put together like method 2 above. You use numexpr by passing a string representing the operation you want to perform to the ``numexpr.evaluate`` function, then numexpr compiles it into machine code to be run on its own internal virtual machine that can be run in parallel if you have a multi-core machine. The return value of numexpr is a numpy array that is exactly equal to the array you would have obtained using only numpy.

We will show some examples

.. ipython:: python

    import numpy as np
    import numexpr as ne

    x = np.arange(1e6)
    y = np.arange(1e6)
    z = np.arange(1e6)

    np_ans = x**2 + y**2 + z**2 + 2*z*y + 2*x*y + 2*z*x

    %timeit x**2 + y**2 + z**2 + 2*z*y + 2*x*y + 2*z*x

Now using numexpr

.. ipython:: python

    ne_ans = ne.evaluate('x**2 + y**2 + z**2 + 2*z*y + 2*x*y + 2*z*x')

    %timeit ne.evaluate('x**2 + y**2 + z**2 + 2*z*y + 2*x*y + 2*z*x')

Just to make sure we got the same answer, let's check:

.. ipython:: python

    np.allclose(np_ans, ne_ans)

When I ran this on my computer I got 77.5 ms for numpy and 12.3 ms for numexpr. That is a speedup of 6.3 over the already speedy numpy! Not bad.

.. todo::

    Investigate the ability of ``numexpr.evaluate`` to speed up the same operation on arrays of differing sizes. Evaluate the following expressions with plain numpy and then with numexpr:

    - :math:`2x + 3y + |z-2| \sqrt{(x-y)^2}`
    - :math:`\frac{\left(x^{2y} (\sin x - \cos y) \right)}{\exp z}`

    Use random arrays (using np.random.randn or something like that) where ``x.shape=(n, n)``, ``y.shape=n`` and ``z.shape=(2, n, n)``. Use n=[1, 10, 100, 300, 700, 1000, 2000].

    Plot execution time as a function of n for numpy and numexpr on the same axis for each of the expressions above.

.. note::

    When I was writing this question I considered having ``max(n)=10000``. When I tried that on my computer the numpy case took over 3 minutes to evaluate and took 5.98 GB of memory. The numexpr case only took 7.5 seconds, but used up to 367% of the processing power of one of the cores on my machine.

The moral of the numexpr story is that when you are doing arithmetic on arrays and you want to minimize the memory footprint of the calculations while optimizing performance, use numexpr instead of vanilla numpy.

"Compiling" Python
==================

Setup
-----

To test the performance of the tools in the rest of this lab we will write functions to solve three main problems.

1. The pairwise distance problem: This function should take in a numpy array (or similar object) that is n by m. Each row will be an m-tuple representing the coordinates for a vector in :math:`R^m`. The return value of this function will be an n by n array of euclidean distances between each point (row) in the input array and every other point. We ask you to use three loops to do this (row, column of result matrix and then inner one to loop over points in the m-tuples). We recognize that this inner loop could easily be removed using numpy, but as this is just an example algorithm we want you to do it this way so you can see the largest gains from using the tools described in this section.
2. Project Euler problem 14 (I will quote the text below). For this problem we ask you to take a brute force approach. One such approach would be to define a function that finds the chain length for a given integer and then to wrap that function in a for loop to go over all the numbers. Problem text:

    The following iterative sequence is defined for the set of positive integers:

    .. math::

        n \rightarrow  n / 2 \quad \text{(n is even)}

        n \rightarrow  3n + 1 \quad \text{(n is odd)}

    Using the rule above and starting with 13, we generate the following sequence:

    .. math::

        13 \rightarrow 40 \rightarrow 20 \rightarrow 10 \rightarrow 5 \rightarrow 16 \rightarrow 8 \rightarrow 4 \rightarrow 2 \rightarrow 1

    It can be seen that this sequence (starting at 13 and finishing at 1) contains 10 terms. Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1.

    Which starting number, under one million, produces the longest chain?

    NOTE: Once the chain starts the terms are allowed to go above one million.
3. Image filtering. This problem is a bit more difficult so I will just give you some python code that could do it for you...

.. code-block:: python

    import numpy as np

    def filter2d(image, filt):
        m, n = image.shape
        m_f, n_f = filt.shape
        m_f2 = m_f // 2
        n_f2 = n_f // 2
        result = np.zeros_like(image)
        for i in range(m_f2, m - m_f2):
            for j in range(n_f2, n - n_f2):
                num = 0.0
                for ii in range(m_f):
                    for jj in range(n_f):
                        num += (filt[m_f-1-ii, n_f-1-jj] *
                                image[i-m_f2+ii, j-n_f2+jj])
                result[i, j] = num
        return result

.. image is random nxn matrix where n=20, 50, 100, 200, 500 and filt is 10x10

.. todo::

    Write python functions that solve problems 1 and 2 in the ways described above. Time the execution of these functions under the following circumstances and record the results in a pandas DataFrame (note we will be adding the same test results for Cython and numba so account for that when designing your DataFrame):

    1. For the pairwise distance problem pass in random arrays of the following sizes: (10, 5), (100, 5), (500, 5), (800, 5), (1000, 5)
    2. There is really only one way to solve the Collatz sequence problem so just do it!
    3. For the filtering problem pass in arrays of sizes (note i:(m,n) means image and f:(m, n) means filter):
        - i:(10, 10)-f:(2, 2)
        - i:(50, 50)-f:(5, 5)
        - i(100, 100)-f(10, 10)
        - i(250, 250)-f(20, 20)

numba
-----

Background and Setup
^^^^^^^^^^^^^^^^^^^^

numba is a relatively new project that has received a lot of attention in the past 6-12 months. From the numba github site we read

    Numba is an Open Source NumPy-aware optimizing compiler for Python sponsored by Continuum Analytics, Inc. It uses the remarkable LLVM compiler infrastructure to compile Python syntax to machine code.

    It is aware of NumPy arrays as typed memory regions and so can speed-up code using NumPy arrays. Other, less well-typed code will be translated to Python C-API calls effectively removing the "interpreter" but not removing the dynamic indirection.

    Numba is also not a tracing jit. It compiles your code before it gets run either using run-time type information or type information you provide in the decorator.

    Numba is a mechanism for producing machine code from Python syntax and typed data structures such as those that exist in NumPy.

As stated in the quote, numba is built on top of the LLVM (low level virtual machine) infrastructure and therefore requires that you have a number of dependencies installed on your computer. Some of these dependencies are non-trivial to install. In fact, it took about a week of trying before one of the tech supervisors at the BYU super-computing office was able to correctly install numba and its dependencies. Luckily for us, numba is sponsored by the people at Continuum Analytics and is therefore part of Anaconda. If you installed the Anaconda python distribution, you have numba. If you didn't install Anaconda, you might think about doing that now, or you might just want to get together with someone else in the class who did and work through these problems on their machine.

Basics
^^^^^^

Assuming you have numba properly set up (or you are working with someone who does) you should be ready to dive in and speeding up your python code!

Most of the work you will do with numba will happen through two main decorators: ``jit`` and ``autojit``. We have not talked much about decorators as they are a somewhat advanced python language feature. The main idea behind decorators is that they are wrapped around functions or classes and transform them in some way. Decorators are applied using the ``@decorator_name`` syntax as shown below:

.. code-block:: python

    @decorator
    def my_fun():
        pass

The above code is exactly equal to the following:

.. code-block:: python

    def my_fun():
        pass

    my_fun = decorator(my_fun)

As numba is a fairly new project, there isn't a lot of documentation yet. I will do my best to explain the key ideas behind making numba work, mostly through examples.

Types in numba
^^^^^^^^^^^^^^

numba works because it compiles your code to machine code the first time you call it. In order for this compilation to happen, numba needs to know what type each of your objects is. There are a number of types that can be imported from numba using ``from numba import type_name`` syntax. Below is a table with some of the most common types.

==========  ===================
Type Name   Result Type
==========  ===================
float\_     float32
double      float64
longdouble  float128
char        signed char
int8        int8 (char)
int16       int16
int32       int32
int64       int64
complex64   float complex
complex128  double complex
complex256  long double complex
==========  ===================

The following notes about the types system in numba were taken from the numba documentation:

    Unsigned integer counterparts are available under the name uint8 etc. Also, short-names are available with the style ‘<char>N’ where char is ‘b’, ‘i’, ‘u’, ‘f’, and ‘c’ for boolean, integer, unsigned, float and complex types respectively with ‘N’ indicating the number of bytes in the type. Thus, f8 is equivalent to float64, and c16 is equivalent to double complex.

    Native platform-dependent types are also available under names such as int\_, short, ulonglong, etc.

    Types are names that can be imported from the numba namespace. Alternatively, they can be specified in strings in the jit decorator.

Using the Decorators
^^^^^^^^^^^^^^^^^^^^

As noted above, most of the work you do with numba will come through one of two decorators: ``jit`` and ``autojit``. Both serve a similar purpose: to compile the enclosing function or class into machine code the first time that function or class is called. The main difference between ``jit`` and ``autojit`` is that ``autojit`` will, at a relatively minor performance cost, use python to infer data types before sending your code to the LLVM.

To use the ``jit`` decorator, you place ``@jit`` on the line directly above the ``def`` or ``class`` statement for the object you would like numba to compile (from now on I will just talk about functions, but you should know that these general principles apply also to classes). You will then tell the ``jit`` decorator about the data types for the input and output objects for the function. There are a few ways to do this:

1. Pass a single argument that is a sytring. This string will be in the following format ``out_type(in_type1, in_type2, ...)``. The ``out_type`` and various ``in_type`` arguments can be any object type that numba knows about. For example I had a function that returned a single floating point value and had one argument that was an array of floats I could do either of the following:

.. code-block:: python

    @jit('f8(f8[:])')
    @jit('double(double[:])')

2. Pass single object with names you have imported from numba namespace. This will be identical to the first option with the except that you don't have to wrap the types in a string and you must have imported the argument types from numba. So I could use the same calls to ``@jit`` after having done the following import:

.. code-block:: python

    from numba import f8, double
    @jit(f8(f8[:]))
    @jit(double(double[:]))

3. Pass in and out arguments directly to ``restype`` and ``argtype`` keyword arguments. This method is a bit more verbose than the first two, but can be more straigtforward. Following the example above and assuming I had done the correct imports I could call either of the below:

.. code-block:: python

    @jit(restype=double, argtypes=double[:])
    @jit(restype=f8, argtypes=f8[:])

Notice that in the above examples we specified the array with the syntax ``type[:]``. This means numba should expect a 1d array of dtype ``type``. If we had more dimensions to our array we would simply add more colons. For example, if we had a 3d array of integers we would specify ``int_[:, :, :]``.

Another thing to note is that we can get performance boosts by telling numba how the memory is laid out for our arrays. By default, numpy creates c-type arrays. This simply means that the memory for the array is allocated such that elements along the last dimension are closer to one another (in terms of physical memory address) than elements in the preceding dimensions. If the array is laid out in Fortran-contiguous ordering this would be the opposite. Either way, if you know the continuity of your array you can get performance enhancements by letting numba know about it by replacing the ``:`` where you find the closest elements with ``::1``. I show some examples below.

- For a 3d c-contiguous array of floating point values you would say ``double[:, :, ::1]``
- For a 2d Fortran-contiguous array of integers you would say ``int_[::1, :]``
- For a 1d contiguous array of long integers (doesn't matter if it is c or Fortran contiguous - they are the same) you would put ``long[::1]``.

The use of the ``autojit`` decorator is a bit simpler, but comes at a small performance hit. When using ``autojit`` you simply put ``@autojit()`` in the line just above your function declaration and it will infer the argument and return types when you call the function. It is easy to see why this is a bit slower at times because you are forcing python to do the work of determining types on the fly, rather than suppling them yourself.

Examples
^^^^^^^^

Imagine that I wanted to write a python function that would integrate sin(x) over a given interval, dividing the interval into a certain number of bins. I could write a python function that looks like the following:

.. ipython:: python

    from __future__ import division
    from math import sin

    def py_integrate_sin(a, b, N):
        """
        Integrate sin(x) from a, b by splitting into N intervals
        """
        s = 0.
        dx = (b - a) / N
        for i in xrange(N):
            s += sin(a + i*dx)
        return s * dx

    a, b, N = 1., 6., 100000

    %timeit py_integrate_sin(a, b, N)

I could now use the numba decorators to see how much better we can do. Note that I could not get ipython to allow decorators in this file so I will simply put the code here and then use the alternative (non-``@``) syntax to apply the decorator just below it.

.. code-block:: python

    from numba import double, long_, int_, jit, autojit

    @jit(double(double, double, long_), locals={'i':long_})
    def numba1_integrate_sin(a, b, N):
        s = 0.
        dx = (b - a) / N
        for i in xrange(N):
            s += sin(a + i*dx)
        return s * dx


    @autojit()
    def numba2_integrate_sin(a, b, N):
        s = 0.
        dx = (b - a) / N
        for i in xrange(N):
            s += sin(a + i*dx)
        return s * dx

.. ipython:: python

    from numba import double, long_, jit, autojit

    numba1_integrate_sin = jit('f8(f8, f8, long_)', locals={'i':long_})(py_integrate_sin)

    %timeit numba1_integrate_sin(a, b, N)

Now we will time the ``@autojit`` version. Note that performance is almost identical because the function is so simple. Other discrepancies come because my computer was not equally busy during the execution of both functions.

.. ipython:: python

    numba2_integrate_sin = autojit()(py_integrate_sin)

    %timeit numba2_integrate_sin(a, b, N)

Now suppose we want to write a function that would raise a square matrix to the Nth power. A Python implementation might look like this:

.. ipython:: python

    def py_matrix_power(mat, n):
        m = len(mat)
        out = np.zeros_like(mat)
        out[:] = mat[:]
        temp = np.zeros_like(mat)

        for _ in xrange(n - 1):
            for ii in xrange(m):
                for jj in xrange(m):
                    for kk in xrange(m):
                        temp[ii, jj] += out[ii, kk] * mat[kk, jj]
            out[:] = temp[:]
            temp = np.zeros_like(mat)

        return out

I will test this out by raising a random 50 by 50 matrix to the 15th power.

.. ipython:: python

    x = np.random.randn(50, 50)
    %timeit py_matrix_power(x, 15)

A numba version might look like this:

.. code-block:: python

    @autojit()
    def numba_matrix_power(mat, n):
        m = len(mat)
        out = np.zeros_like(mat)
        out[:] = mat[:]
        temp = np.zeros_like(mat)

        for _ in xrange(n - 1):
            for ii in xrange(m):
                for jj in xrange(m):
                    for kk in xrange(m):
                        temp[ii, jj] += out[ii, kk] * mat[kk, jj]
            out[:] = temp[:]
            temp = np.zeros_like(mat)

        return out

.. ipython:: python

    numba_matrix_power = autojit()(py_matrix_power)
    %timeit numba_matrix_power(x, 15)

Exercises
^^^^^^^^^

.. todo::

    Make your existing python functions work with numba. Repeat the timing exercises from above and put the results in the DataFrame.

Cython
------

Background and Setup
^^^^^^^^^^^^^^^^^^^^

An alternative to using numba to compile your python code is Cython. Cython is a very mature project and is widely recognized as a (if not the) community standard for interfacing C/C++ code with python as well as boosting the performance of actual python code. The following was taken from the homepage of the `Cython documentation <http://docs.Cython.org/src/quickstart/overview.html>`_:

    Cython is a programming language based on Python, with extra syntax allowing for optional static type declarations. It aims to become a superset of the Python language which gives it high-level, object-oriented, functional, and dynamic programming. The source code gets translated into optimized C/C++ code and compiled as Python extension modules. This allows for both very fast program execution and tight integration with external C libraries, while keeping up the high programmer productivity for which the Python language is well known.

So, instead of using a just-in-time compiler like numba, Cython actually generates C/C++ code for you in a form that is very easy to compile into a python extension module. Some people have said that Cython is just python with some type declarations thrown in. While the Cython syntax is very much like python, Cython is more than that because as it is eventually translated into C, any native C code you have is available to Cython and can therefore be used in your Cython code. But, we are getting ahead of ourselves now so we will back up a bit and cover some of the Cython basics.

Basics
^^^^^^

As with numba, Cython helps your code achieve speed gains by compiling your python code in to a machine readable language. This is a three step process.

1. Write your Cython code in a ``.pyx`` file
2. Compile your `.pyx`` file to an equivalent ``.c`` file
3. Compile the ``.c`` file to a ``.so`` (or ``.pyd`` on Windows) file to be imported directly into python.

There are several ways to complete steps 2 and 3 (build your Cython code):

- Write a distutils ``setup.py`` file -- reccomended way
- Use the ``pyximport`` module to allow Cython do to the work in the background -- limited.
- Do it by hand, calling the compiler with all the correct flags and links -- difficult and only advisable for experienced C programmers.
- Use the Sage or Ipython Notebook cell magics -- also limited because the compiled modules live only in the notebook session.

Throughout this instruction we will show you how to use ``setup.py``

Types in Cython
^^^^^^^^^^^^^^^

Before compilation, Cython code is converted into C code that follows the `Python C API <http://docs.python.org/2/c-api/>`_, It is by typing variables that your code gains speed and efficiency. Because the code is compliant with the python C API, any native python or C type is valid for assignment to your Cython objects. However, you will see much bigger speed gains when you use C types (allowing your python to be more like c is the whole point behind Cython).

.. note::

    C types like ``enum``, ``union``, and ``struct`` are available in Cython, even if they don't have a direct python counterpart.

Types are assigned to variables using the ``cdef`` keyword. Below are some examples.

.. code-block:: python

    cdef int my_int = 1
    cdef double my_float = 1.1
    cdef unsigned int my_uint = 42
    cdef long my_long = 1000000001

If you have more than one ``cdef`` to perform at a time, just put them all in a ``cdef`` block.

.. code-block:: python

    cdef:

        int my_int = 1
        double my_float = 1.1
        unsigned int my_uint = 42
        long my_long = 1000000001

Like C, you don't have to assign a value at the same time as assigning a type. For example, the following code is just as valid as the above.

.. code-block:: python

    cdef:

        int my_int
        double my_float
        unsigned int my_uint
        long my_long

    my_int = 1
    my_float = 1.1
    my_uint = 42
    my_long = 1000000001

Functions in Cython come in 3 varieties:

1. Python functions created with the ``def`` keyword. These functions are the ones you have been used to and are directly accessible to python.
2. C functions defined with the ``cdef`` keyword. These functions are given explicit types and are very fast. Because they are C functions they are not accessible to python (they are only defined within Cython ``.pyx`` modules)
3. Hybrid functions defined with the ``cpdef`` keyword. These functions can be given type information and can be thought of as both a ``def`` and a ``cdef`` function. That is, a function with that name is accessible to python, but if accessed from within Cython, Cython will use the C version with a small performance loss over the standard ``cdef`` function.

Some valid ``cdef`` and ``cpdef`` statements appear below:

.. code-block:: python

    # only accessible in Cython
    cdef int sum_ints(int a, int b)

    # Accessible in both python and Cython
    cpdef double divide(double a, double b)

    # Notice C memory-view syntax
    cdef double sum_2darr(double[:, ::1] arr)

.. note::

    Notice that I was able to define a 2d c-contiguous array using the same syntax that numba used (``double[:, ::1]``). This is a relatively new addition to Cython that, in my opinion, is a great enhancement. One part of working with numpy arrays from within Cython that was relatively slow was accessing slices of the array. To overcome this, the numpy people created a Cython interface that allowed for very fast access to array elements via array buffers. The syntax for the above array would have been the following ``np.ndarray[double, ndim=2, mode='c']``. Recently, however, the Cython people have created a much more robust memoryview interface that allows the data stored in a numpy array to be accessed natively as if it were a C array. This is what the ``double[:, ::1]`` syntax is telling Cython to do with ``arr``. Not only is this syntax a bit more compact and easy to remember, but also it actually comes with performance boosts as it allows Cython to treat the array more like a native C array. For that reason, we will only be using the memoryview syntax in this lab.

Writing the ``setup.py`` file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

One of the most difficult aspects of Cython for a new user is getting a handle on building and compiling extension modules. The most reliable, portable, and robust way of doing this is to let python take care of the building, compiling, and linking for you! I will give you a template ``setup.py`` file that you can use for your projects and will then explain what the different parts are.

.. code-block:: python

    from distutils.core import setup
    from distutils.extension import Extension
    from Cython.Distutils import build_ext

    # some Extension args:
    #   Extension(name, sources, indlude_dirs=None, ..., language=None)

    # Basic Cython file
    cy_1 = Extension('cy_1', ['cy_1.pyx'])

    # Second extension that calls existing C++ code.
    cy_2 = Extension('cy_2', ['cy_2.pyx', 'cy_2.cpp', 'cy_2.h'], language='c++')

    # The main python function that builds the stuff
    setup(
      name = 'My Cython Extension',
      cmdclass = {'build_ext': build_ext},  # Use Cython's build_ext routine
      ext_modules = [cy_1, cy_2] # Must be list, even with only one Extension
    )

After you have written this ``setup.py`` file and your Cython code, you can build the extension from the command line (UNIX terminal or Windows command prompt) using the command

.. .. code-block:: bash

    python setup.py build_ext --inplace

This will tell python that you simply want to build the ``Extension`` s defined in ``setup.py`` in the current working directory. You can then launch python, ipython, or write a script and import the Cython extensions just like you would any other Python library using the ``Extension`` 's name (first argument you gave to ``Extension`` .)

Calling external C/C++ code
^^^^^^^^^^^^^^^^^^^^^^^^^^^

One of the main use cases for Cython is to enable Python to call pre-written C and C++ code. This is done using an ``extern`` block. The basic sytax is ``cdef extern from "filename":`` followed by indented lines specifying the interface for various functions or classes I would like to use. For example, if I wanted to use the ``sin`` function in the C standard library file ``math.h`` I could do the following:

.. code-block:: python

    cdef extern from "math.h":
        double sin(double)

This worked because in the file ``math.h`` there is a function ``sin`` that receives a single argument of dtype ``double`` and outputs a ``double``. The user should note that much of the C and C++ libraries have already been exposed to Cython by the Cython developers. The ``sin`` function could also be exposed by the following Python code.

.. code-block:: python

    from libc.math cimport sin

Note the ``cimport`` statement instead of ``import``. This is special Cython syntax that tells the compiler to look for a Cython block in a particular file (in this case libc.math) that defines the object being imported (the sin function). For a complete listing of which C and C++ standard library functions are availible see the libc and libcpp folders from this point in the `Cython github repo <https://github.com/cython/cython/tree/master/Cython/Includes>`_.

Another example might be including a ``Rectangle`` class within the ``shapes`` namespace in a local ``.cpp`` file I have written (Note this was taken from the Cython documentation and includes object oriented constructs we haven't talked about yet so don't worry about the details - I just want you to know you can do something like this).

.. code-block:: python

    cdef extern from "Rectangle.h" namespace "shapes":
        # Expose class
        cdef cppclass Rectangle:
            # Give method signatures
            Rectangle(int, int, int, int) except +
            int x0, y0, x1, y1
            int getLength()
            int getHeight()
            int getArea()
            void move(int, int)

.. warning::

    I completely skipped one of the main features of Cython: defining custom extension types (classes). This was deliberate and I did it because we have not, yet, gone object oriented programming in Python. After you have handle on basic object oriented programming I strongly encourage you to review the Cython documentation for defining classes and using existing C++ classes.


Examples
^^^^^^^^

I will show the same examples I had for numba here.

.. code-block:: python

    # File integrate.pyx

    from libc.math cimport sin

    cpdef double cy_sin_integrate(double a, double b, long N):
        cdef int i
        cdef double dx, s
        s = 0.
        dx = (b - a) / N
        for i in range(N):
            s += f(a + i * dx)
        return s * dx

The associated setup.py file

.. code-block:: python

    # File setup.py

    from distutils.core import setup
    from distutils.extension import Extension
    from Cython.Distutils import build_ext

    integrate = Extension('integrate', ['integrate.pyx'])

    # The main python function that builds the stuff
    setup(
      name = 'My Cython Extension',
      cmdclass = {'build_ext': build_ext},
      ext_modules = [integrate]
    )

The timing...

.. code-block:: python

    In [5]: from integrate import cy_sin_integrate

    In [6]: %timeit cy_sin_integrate(a, b, N)
    100 loops, best of 3: 3.77 ms per loop

Now the ``.pyx`` and ``setup.py`` for the matrix power problem:

.. code-block:: python

    # File mat_power.pyx
    import numpy as np
    cimport cython

    # compiler directives to speed up array handling
    @cython.boundscheck(False)
    @cython.wraparound(False)
    cpdef double[:, ::1] cy_matrix_power(double[:, ::1], mat, int n):
        cdef int m = len(mat)
        cdef double[:, ::1] out = np.zeros_like(mat)
        out[:] = mat[:]
        cdef double[:, ::1] temp = np.zeros_like(mat)

        cdef int it, ii, jj, kk

        for it in xrange(n - 1):
            for ii in xrange(m):
                for jj in xrange(m):
                    for kk in xrange(m):
                        temp[ii, jj] += out[ii, kk] * mat[kk, jj]
            out[:] = temp[:]
            temp = np.zeros_like(mat)

        return out

.. code-block:: python

    # file: setup.py

    from distutils.core import setup
    from distutils.extension import Extension
    from Cython.Distutils import build_ext

    mat_power = Extension('mat_power', ['mat_power.pyx'])

    # The main python function that builds the stuff
    setup(
      name = 'My Cython Extension',
      cmdclass = {'build_ext': build_ext},
      ext_modules = [mat_power]
    )

And testing it ...

.. code-block:: python

    In [7]: from mat_power import cy_matrix_power

    In [8]: %timeit cy_matrix_power(x, 15)
    100 loops, best of 3: 3.1 ms per loop

Exercises
^^^^^^^^^

.. todo::

    Work through the main 3 problems using Cython this time. Repeat the timing exercises from above and put the results in the DataFrame.


Wrapping Compiled Extensions
============================

We have already talked a little bit about how you can use Cython to wrap existing C/C++ (and even some Fortran) libraries. The topic of wrapping in general is far beyond the scope of this lab, but I will at least mention a few of the more useful tools out there.

For wrapping C/C++:

- Cython: Call external C function in a python like interface that gets compiled into a Python extension module
- SWIG: Auto-generate python wrapper code based on C/C++ header files. Hands-off, low-flexibility type wrapping.
- Boost.Python: Auto-generate python wrapper code based on C/C++ header files. Hands-off, low-flexibility type wrapping.
- xdress: Automatically create Cython wrapper files from C/C++ header files. In the middle of writing your own extensions using Cython and having SWIG or Boost.Python do it for you.

For wrapping Fortran:

- f2py: The standard. What you should probably use if you have serious fortran to wrap.
- Cython: Nice integration with Python, C, and C++.

.. todo::

    Analyze your timing results for each of the three implementations for each of the three problems. Compute ratio of time saved, as well as plots of time vs. argument size when applicable.

    NOTE: I am giving you a lot of flexibility here. I want you to be creative in how you use the data analysis tools you have developed thus far. Spend about 10-15 minutes coming up with a compelling way to share the results you have gathered during this lab.