Write Your Own C-extension to Speed Up Python by 100x

How to write, compile, package and import your own, superfast C-module into Python

Write Your Own C-extension to Speed Up Python by 100x
It’s like installing a jet engine on your truck! (image by Y S on Unsplash)

Python is a fantastic language that is very easy to pick up, very fast to develop with, and very clear to read. All these benefits come at a cost: Python is pretty slow compared to some other languages. I can strongly recommend reading this article before continuing to get a clear idea of the problem we’re trying to solve. The goal of this article is to answer this question:

How can we combine Python’s ease of development without sacrificing speed?

The sarcastic answer is to rewrite the project in another language but that’s not what we’re here for. You’re a Python programmer, already have a lot of programs written in Python, and just want to speed up a small part of it. Also: if you are used to writing in Python, the transition to another language like C# or Java might be pretty rough.

We’ll combine the best of two worlds: we extend our program with a small module that we write in C. Python programmers can just import this package, don’t have to know a single line of C, and can still enjoy the 100x speed increase.

Importing the C-module into our Python program (image by (Kat Sazonova on Unsplash)

Writing in C? Sounds difficult

Written in C!?” I hear you ask.

You just talked about a rough transition to Java, now we’re going for C?!”. True, writing code in C might be a little challenging but you’ll see that the 100x speed increase is definitely worth it!

Also, we only have to re-write a small part of our code to C (in our case just a single function).

Isn’t this the same as re-writing the project in another language?

The beauty of the solution that this article describes is that you only have to rewrite the slow parts of your code. Imagine we have programmed an API in Python that receives and analyzes audio files. It makes sense to rewrite the function that analyzes the audio file in C since this is the bottleneck of the project. We’d waste a lot of time rewriting our API.

Can’t this be done simpler?

Yes, there are simpler ways to create the C-extension that is going to speed up our program so much. In this article, we used Cython to convert some Python-like code to a C-module that achieves roughly the same performance increase.

However, in this post, we’re going to go the hard way and write our own module in C because it gives us a very interesting peek at the inner workings of Python and how it integrates modules written in C.

When does it make sense to create a C-module?

The type of task we’re able to optimize is CPU-heavy task, not the I/O-tasks like waiting for a response. Waiting for an API isn’t faster in ‘faster languages’.

We want to optimize a small part of our code that performs a very CPU-heavy task. These kinds of tasks are very well suited for optimizing in C.

Let’s set up shop and speed up this program (image by Damir Kopezhanov on Unsplash)

Setup

First things first: setting up a virtual environment. This is not strictly necessary but it’s best practice to keep your dependencies untangled.

As you’ve read in the previous we’ll need a function that does a lot of calculating. We’ll use a simple example: calculating the number of primes within a range. Here’s the vanilla Python code that’ll do that.

The code above looks a bit messy and I hear you think “WHILE LOOPS?! FLAGS?!”. Trust me, they’re in there for a good reason.

Also notice that this is not the most efficient way to calculate the prime number but that’s not the point: we just need a function that takes a lot of computing!


You are already using C-compiled functionalities

Instead of the while loops and flags, we can use the built-in function range(). This passes generation, iteration, and checking if we’re done to a C-module that is much faster. Let’s upgrade that nasty function with range():

Notice that this code is not only more readable: it’s faster too. We can use these two functions to look up the number of primes between 0 and 100.000:[Vanilla] examined 100000 numbers; found 9592 primes in 30.38632 sec
[V+range] examined 100000 numbers; found 9592 primes in 20.00026 sec

Using some built-in C-modules has already improved execution speeds a little but we’re only getting started.

We’ll have to get our hands a bit dirty but the results will be phenomenal (image by Adi Goldstein on Unsplash)

Writing a C module for Python

The first thing we have to do is to convert the prime-finding function to C. Then, somehow, we have to get Python to talk with that C-function. The solution for this is to wrap the C-function in a Python module.

You’re already familiar with these; think of time, os, and sys e.g. We’ll call our module Fastcount.

At the end of this part we’ll have our module installed so you can import the module and execute a method on the module like this:

import Fastcount

We’ll do this in 3 steps:

  1. Re-write the prime-finding function in C
  2. Package the C-function in a Python module
  3. Build and install the module

Step 1: the C-function

This part might be a bit hard if you’re unfamiliar with the C language. It’s much like Python but much more restricted. Check it out:

Apart from some syntax here and there, this function looks a lot like the nasty one we’ve written in the previous chapter.


2. Wrapping the C-function in a module

Okay, so we have a function written in C and a Python file. How can we access the C-function from the Python file? We have to take a few steps:

How all elements fit together (image by author)

We’ve already defined our C-function above so let’s wrap that C-function in a Python object and (black) and work our way outward:

2.1 wrap the C-function into a Python Object.
Let’s go through the code:

Remember that everything in Python is an object? In fact everything in Python is a PyObject. Even declaring integer results in a PyObject. Under the hood, the Python engine works with this structure to enable dynamic typing.

In the code above we’re wrapping our C-function in a PyObject. This function parses the arguments that Python sends us using the PyArg_ParseTuple function. The ii indicates that we expect two integers (more info here).

Next, we call the C-function that the found number of primes. Lastly, we return the found number of primes after we convert them to a PyLong; an object that Python can interpret as a type long.

2.2 add the Python Object to a list of methods.
The below code specifies the function name that we call in Python:

Here we define a list of all methods that our module has.

In this example we define primecounter as the function name; this is what we are going to call the function in Python. Then we refer to the function that creates the PyObject from the previous step. METH_VARAGS defines the signature: it expects self and *args from Python. Lastly, we define a method description for the docstring.

We can add more objects in there if we want our module to have more functions but for the purpose of this demo, we’ll keep it simple. This PyMethodDef also requires line 3; the line containing all the NULL’s.

2.3 Create the Module Definition on which we register the module name, description and the list of methods.
Here’s the code:

This piece of code defines our module. The first item is required for creating a PyModuleDef. In lines 4 and 5 we specify the name and description of our module.

In line 6 we can specify the amount of memory needed to store the state of our program. It’s required when your program is used in multiple sub-interpreters.

The negative value indicates that this module doesn’t support sub-interpreters. Specify the memory requirement of your module to be allocated on each sub-interpreter session with a non-negative value.

The last item, in line 7, refers to the list of methods that we’ve specified in the previous step.

2.4 Create a Initialization function that creates our module from the Module Definition.
The final piece! This is the function that Python will call when it imports our module for the first time:

We use PyModule_Create and pass it a reference to the PyModuleDef from the previous part. This will return a PyObject in which our C-function is wrapped. Check out the whole code here.


3. Building, installing, and running the extension

This part resembles the step in the process of creating your own public or private Python Package. We have to create a setup.py file that points to our C-code from the previous step and then create the package. Let’s go:

The code above is pretty self-explanatory; the most important line is line 11 where we specify where we can find the C-file that we’ve written in step 2.

Next step: simply call python setup.py install. This will take all of our code and package it in a module called Fastcount. Now in Python, we can:

Troubleshooting

Windows: calling python setup.py install might get you an error that reads something like:Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools

You can solve this by installing C++ build tools that you can download here.

Let’s race algorithms (image by Jonathan Chng on Unsplash)

Benchmarking

Let’s put our code to the test; we want to count the number of prime numbers between 0 and 500.000. In order to do this we need to check approximately 1.3 billion numbers; plenty of work for my poor laptop. We are going to benchmark:

  • vanilla Python
  • vanilla Python + built-ins (like range)
  • Fastcount (our C-module)
  • Fastcount MP

After executing all methods multiple times, and taking the least amount of time, the results are in. All methods found the correct number of primes (41.538) and this is how long they took (lower is better):

Execution times of all methods to find primes in a range (lower is better, image by author)

Using the built-in functions like range() already eliminates roughly 35% of the time it takes to complete. Even though this is a pretty nice increase, our own module finishes the task almost 33 times faster than vanilla Python.

To gain even more speed we spread out all calculations over multiple CPUs by multiprocessing our function which completes our calculation 102 times faster than vanilla Python. Check out this article or this one to learn more about safe multi-tasking in Python using threads and processes.


Conclusion

This was a very long and pretty complex article but we learned a lot about how Python functions under the hood. We’ve written, compiled, packaged, and imported our own, custom C-module. Although this article was pretty lengthy, in the end, we got execution speeds down from over 10 minutes to almost 6 seconds: eliminating execution time by 99%!

If you have suggestions/clarifications please comment so I can improve this article. In the meantime, check out my other articles on all kinds of programming-related topics like these:

Happy coding!

— Mike

P.S: like what I’m doing? Follow me!