6 Steps to Make this Pandas Dataframe Operation 100 Times Faster
Cython for Data Science: Combine Pandas with Cython for an incredible speed improvement
In this article you’ll learn how to improve Panda’s df.apply() function to speed it up over 100x. This article takes Pandas’ standard dataframe.apply
function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds. At the end of this article you’ll:
- understand why df.apply() is slow
- understand how to speed up the apply with Cython
- know how to replace the apply by passing the array to Cython
- be able to multi-process your Cython-function to squeeze out the maximum amount of speed
- Annoy your coworkers with the fact that your code is aways so much faster than theirs
Before we begin I highly recommend reading this article about why Python is so slow; it helps you understand the type of problem we’re trying to solve in this article. You can also check out this article about getting started with Cython.
Isn’t Pandas already pretty fast?
True, it’s built on Numpy which is written in C. It’s very fast. Df.apply, however, applies a Python function to an array. The slow Python function is not fast. This is one part of the problem we’re trying to solve. The other part is that looping occurs in Python as well. First, we’ll set up, and then we’ll go about fixing these problems in 6 steps.
Setup
These kinds of tutorials always work best with a practical example, so for this project, we’ll imagine that we have a webshop. Because we already have 17 million customers, we decide to open a physical store so that customers can pick up their products, saving us delivery costs. We have a few locations picked out but which would be the best one? We determine that we’d best settle where the average distance to all of our clients is the smallest.
We want to calculate the average distances for each location but we’re busy people and don’t want to spend a long time waiting for the distance calculation to finish.
Our goal is to optimize calculating the average distance from all of our customers to a certain location.
Loading data
For this we load all of our 17 million customers from the database (using this article) into a dataframe (all unnecessary columns are hidden):customer_id lat lon
0 52.131980 5.510361
1 52.438026 6.815252
2 51.238809 4.447790
3 51.163722 3.588959
4 52.559595 5.483185
Distance calculation function
The function below can be used to calculate the spherical distance between two points on the globe.
Installing dependencies
Spoiler alert: we’re going to use Cython to solve this problem. We are using CythonBuilder to take care of compiling, building, and packaging our package. Install it with:pip install cython
pip install cythonbuilder
When this is installed call cybuilder init
. This will create a folder in your root called ext
. If you are unfamiliar with working with a terminal check out this article.
Calculating the average distance
We have selected a location that we want to calculate the average customer distance to. We are going to solve this problem in 6 steps. In every step, we’ll improve our code and achieve more speed. We’ll start with just Python and gradually add more Cython and other optimizations.
Step 1. Pure Python
We’ll df.apply
the distance-calculation function to our dataframe, assign the result to a new column, and, lastly, average that column.
This works but a lot can be improved. The function finishes in roughly 3 minutes. This will be our benchmark:[1_purepy] 17M rows 179037ms
Step 2. Cythonize
In this part, we’ll just put our Python code in a Cython file. Create a new file called geopack.pyx
in ext/pyxfiles
. If this folder doesn’t exist yet you probably forgot to call cybuilder init
. Just copy and paste your Python function into this new file so it looks like below (don’t forget the imports):
Next, we’ll need to compile, build and package this module. Luckily this is very easy with CythonBuilder. Call cybuilder build
to do just this. Then we can import from the ext
folder that CythonBuilder created and use the function like below:
Easy like that! This code completes in 2.7 minutes, which is already a bit faster even though we haven’t even optimized anything. Let’s look at our benchmark and start optimizing.[1_purepy] 17M rows 179037ms
[2_pyINcy] 17M rows 163102ms (x1.098)
Step 3. Optimizing
In this step, we are optimizing our Cython code. Mostly we just add types so that the code can be compiled and doesn’t have to go through the interpreter. Let’s look at the new function below and then go through all changes.
Line 1 and 2
We’re importing Cython functions now. These are a bit faster
Line 4–7
These are compiler directives; they tell the compile to avoid certain checks. We avoid checking for Nones, Zero-division, and some checks that have to do with bounds and wraparound when we use loops. Check docs for more information.
Line 8
First, we define our function not with def
but cpdef
. This makes both C and Python can access the function. Then we define our return type (the float
part in cpdef float
. Lastly, we type our inputs (like double lat1
e.g.).
Line 11–13
Adding types for variables we use in this function.
Results
After we’ve adjusted our function we call cybuilder build
again so that our package is updated. When we run the new function we see that our code is a bit faster, finishing in roughly 2.4 minutes:[1_purepy] 17M rows 179037ms
[2_pyINcy] 17M rows 163102ms (x1.098)
[3_optiCy] 17M rows 148108ms (x1.209)
We’ve already shaved off 17% of execution times but this is not nearly enough. The reason that df.apply() is slow despite the optimized function is that all looping happens in Python. So let’s Cythonize the loop!
Step 4. Cythonizing the loop
We’ll create a new function in Cython that will receive two arrays (all of our customer’s latitudes and longitudes, and two floats (the store-lat and store-long).
The only thing this function will do is to loop through all of the data and call the function we’ve defined earlier (line 20). Execute cybuilder build
and run this function like the code below:
When we run this new function we finish in 2.3 seconds. That’s a pretty dramatic speed upgrade! In the next part, we’ll discover how to squeeze out even more speed.[1_purepy] 17M rows 179037ms
[2_pyINcy] 17M rows 163102ms (x1.098)
[3_optiCy] 17M rows 148108ms (x1.209)
[4_CyLoop] 17M rows 2346ms (x76.327)
Step 5. Aggregation in Cython
Since we’re interested in the average distance, and we already include the loop in Cython, we don’t have to return an array. We can just calculate the average inside the Cython function we’ve created in the previous step. That'll simplify our function as well:
We call our function directly likeavgdist = geopack.calculate_mean_distance(
store_lat,
store_lon,
df['lat'].to_numpy(),
df['lon'].to_numpy()
)
This shaves off a small bit of time:[1_purepy] 17M rows 179037ms
[2_pyINcy] 17M rows 163102ms (x1.098)
[3_optiCy] 17M rows 148108ms (x1.209)
[4_CyLoop] 17M rows 2346ms (x76.327)
[5_CyAggr] 17M rows 2225ms (x80.468)
Step 6. Using multiple CPUs to process in parallel
We have multiple CPUs, right? So why not use some? In the elegant piece of code below we use a ProcessPool to divide the work over multiple processes. Each process runs simultaneously and uses a different CPU.
Processes take some time to initialize so make sure you have enough work for them to make the startup cost worthwhile. Read this article to learn more about multitasking in Python.
When we execute the code above and have all of our CPU’s working simultaneously on the task at hand, we end up with the final results; the maximum amount of speed that we can squeeze out of this task:[1_purepy] 17M rows 179037ms
[2_pyINcy] 17M rows 163102ms (x1.098)
[3_optiCy] 17M rows 148108ms (x1.209)
[4_CyLoop] 17M rows 2346ms (x76.327)
[5_CyAggr] 17M rows 2225ms (x80.468)
[6_CyProc] 17M rows 1640ms (x109.169)
As you see we have sped up this task almost 110-fold; eliminating over 99% of the Pure Python execution time. Notice that the things we need to do for this incredible speed increase are:
- copy our function to Python
- add some types
- add a simple function that handles looping
- add 4 lines to handle multi-processing.
Not bad! Now let’s calculate the optimal distance.
Using our optimized function to calculate the best location for our store.
We can now execute the code below to finally get our answer:
The default .apply()
method would calculate the average distances to our 3 locations in roughly 9 minutes. Because of our new, optimized, multi-processed function, we only have to wait for around 5 seconds! The results below indicate that Groningen is the best city for us to have our store!Amsterdam: 115.93 km avg
Utrecht: 111.56 km avg
Groningen: 102.54 km avg
Conclusion
This article built forth upon the foundation laid by this one. I hope to have demonstrated that you can combine the ease of coding in Python with the efficiency of C to improve certain pandas operations with relative ease and achieve incredible speed increases. For more information about why the Python function is so slow check out this article.
I hoped everything was as clear as I hope it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics like these:
- Why Python is so slow and how to speed it up
- Getting started with Cython: How to perform >1.7 billion calculations per second in Python
- Write your own C extension to speed up Python x100
- Advanced multi-tasking in Python: applying and benchmarking threadpools and processpools
- Virtual environments for absolute beginners — what is it and how to create one (+ examples)
- Create and publish your own Python package
- Create Your Custom, private Python Package That You Can PIP Install From Your Git Repository
- Create a fast auto-documented, maintainable, and easy-to-use Python API in 5 lines of code with FastAPI
- Dramatically improve your database insert speed with a simple upgrade
Happy coding!
— Mike
P.S: like what I’m doing? Follow me!