6 Steps to Make this Pandas Dataframe Operation 100 Times Faster

Cython for Data Science: Combine Pandas with Cython for an incredible speed improvement

6 Steps to Make this Pandas Dataframe Operation 100 Times Faster
Sure doesn’t look that fast, let’s speed it up! (image by Theodor Lundqvist on unsplash)

In this article you’ll learn how to improve Panda’s df.apply() function to speed it up over 100x. This article takes Pandas’ standard dataframe.apply function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds. At the end of this article you’ll:

  • understand why df.apply() is slow
  • understand how to speed up the apply with Cython
  • know how to replace the apply by passing the array to Cython
  • be able to multi-process your Cython-function to squeeze out the maximum amount of speed
  • Annoy your coworkers with the fact that your code is aways so much faster than theirs

Before we begin I highly recommend reading this article about why Python is so slow; it helps you understand the type of problem we’re trying to solve in this article. You can also check out this article about getting started with Cython.


Isn’t Pandas already pretty fast?

True, it’s built on Numpy which is written in C. It’s very fast. Df.apply, however, applies a Python function to an array. The slow Python function is not fast. This is one part of the problem we’re trying to solve. The other part is that looping occurs in Python as well. First, we’ll set up, and then we’ll go about fixing these problems in 6 steps.

Setup

These kinds of tutorials always work best with a practical example, so for this project, we’ll imagine that we have a webshop. Because we already have 17 million customers, we decide to open a physical store so that customers can pick up their products, saving us delivery costs. We have a few locations picked out but which would be the best one? We determine that we’d best settle where the average distance to all of our clients is the smallest.

We want to calculate the average distances for each location but we’re busy people and don’t want to spend a long time waiting for the distance calculation to finish.

Our goal is to optimize calculating the average distance from all of our customers to a certain location.

Loading data

For this we load all of our 17 million customers from the database (using this article) into a dataframe (all unnecessary columns are hidden):customer_id          lat        lon
         0    52.131980    5.510361
         1    52.438026    6.815252
         2    51.238809    4.447790
         3    51.163722    3.588959
         4    52.559595    5.483185

Distance calculation function

The function below can be used to calculate the spherical distance between two points on the globe.

Installing dependencies

Spoiler alert: we’re going to use Cython to solve this problem. We are using CythonBuilder to take care of compiling, building, and packaging our package. Install it with:pip install cython
pip install cythonbuilder

When this is installed call cybuilder init. This will create a folder in your root called ext. If you are unfamiliar with working with a terminal check out this article.

We’ll go from this to a hypersonic jet in 6 steps (image by History in HD on Unsplash)

Calculating the average distance

We have selected a location that we want to calculate the average customer distance to. We are going to solve this problem in 6 steps. In every step, we’ll improve our code and achieve more speed. We’ll start with just Python and gradually add more Cython and other optimizations.

Step 1. Pure Python

We’ll df.apply the distance-calculation function to our dataframe, assign the result to a new column, and, lastly, average that column.

This works but a lot can be improved. The function finishes in roughly 3 minutes. This will be our benchmark:[1_purepy]  17M rows   179037ms


Step 2. Cythonize

In this part, we’ll just put our Python code in a Cython file. Create a new file called geopack.pyx in ext/pyxfiles. If this folder doesn’t exist yet you probably forgot to call cybuilder init. Just copy and paste your Python function into this new file so it looks like below (don’t forget the imports):

Next, we’ll need to compile, build and package this module. Luckily this is very easy with CythonBuilder. Call cybuilder build to do just this. Then we can import from the ext folder that CythonBuilder created and use the function like below:

Easy like that! This code completes in 2.7 minutes, which is already a bit faster even though we haven’t even optimized anything. Let’s look at our benchmark and start optimizing.[1_purepy]  17M rows   179037ms
[2_pyINcy]  17M rows   163102ms      (x1.098)


Step 3. Optimizing

In this step, we are optimizing our Cython code. Mostly we just add types so that the code can be compiled and doesn’t have to go through the interpreter. Let’s look at the new function below and then go through all changes.

Line 1 and 2
We’re importing Cython functions now. These are a bit faster

Line 4–7
These are compiler directives; they tell the compile to avoid certain checks. We avoid checking for Nones, Zero-division, and some checks that have to do with bounds and wraparound when we use loops. Check docs for more information.

Line 8
First, we define our function not with def but cpdef. This makes both C and Python can access the function. Then we define our return type (the float part in cpdef float. Lastly, we type our inputs (like double lat1 e.g.).

Line 11–13
Adding types for variables we use in this function.

Results
After we’ve adjusted our function we call cybuilder build again so that our package is updated. When we run the new function we see that our code is a bit faster, finishing in roughly 2.4 minutes:[1_purepy]  17M rows   179037ms
[2_pyINcy]  17M rows   163102ms      (x1.098)
[3_optiCy]  17M rows   148108ms      (x1.209)

We’ve already shaved off 17% of execution times but this is not nearly enough. The reason that df.apply() is slow despite the optimized function is that all looping happens in Python. So let’s Cythonize the loop!


Step 4. Cythonizing the loop

We’ll create a new function in Cython that will receive two arrays (all of our customer’s latitudes and longitudes, and two floats (the store-lat and store-long).

The only thing this function will do is to loop through all of the data and call the function we’ve defined earlier (line 20). Execute cybuilder build and run this function like the code below:

When we run this new function we finish in 2.3 seconds. That’s a pretty dramatic speed upgrade! In the next part, we’ll discover how to squeeze out even more speed.[1_purepy]  17M rows   179037ms
[2_pyINcy]  17M rows   163102ms      (x1.098)
[3_optiCy]  17M rows   148108ms      (x1.209)
[4_CyLoop]  17M rows     2346ms     (x76.327)


Step 5. Aggregation in Cython

Since we’re interested in the average distance, and we already include the loop in Cython, we don’t have to return an array. We can just calculate the average inside the Cython function we’ve created in the previous step. That'll simplify our function as well:

We call our function directly likeavgdist = geopack.calculate_mean_distance(
   store_lat,
   store_lon,
   df['lat'].to_numpy(),
   df['lon'].to_numpy()
)

This shaves off a small bit of time:[1_purepy]  17M rows   179037ms
[2_pyINcy]  17M rows   163102ms      (x1.098)
[3_optiCy]  17M rows   148108ms      (x1.209)
[4_CyLoop]  17M rows     2346ms     (x76.327)
[5_CyAggr]  17M rows     2225ms     (x80.468)


Step 6. Using multiple CPUs to process in parallel

We have multiple CPUs, right? So why not use some? In the elegant piece of code below we use a ProcessPool to divide the work over multiple processes. Each process runs simultaneously and uses a different CPU.

Processes take some time to initialize so make sure you have enough work for them to make the startup cost worthwhile. Read this article to learn more about multitasking in Python.

When we execute the code above and have all of our CPU’s working simultaneously on the task at hand, we end up with the final results; the maximum amount of speed that we can squeeze out of this task:[1_purepy]  17M rows   179037ms
[2_pyINcy]  17M rows   163102ms      (x1.098)
[3_optiCy]  17M rows   148108ms      (x1.209)
[4_CyLoop]  17M rows     2346ms     (x76.327)
[5_CyAggr]  17M rows     2225ms     (x80.468)
[6_CyProc]  17M rows     1640ms    (x109.169)

As you see we have sped up this task almost 110-fold; eliminating over 99% of the Pure Python execution time. Notice that the things we need to do for this incredible speed increase are:

  • copy our function to Python
  • add some types
  • add a simple function that handles looping
  • add 4 lines to handle multi-processing.

Not bad! Now let’s calculate the optimal distance.


Using our optimized function to calculate the best location for our store.

We can now execute the code below to finally get our answer:

The default .apply() method would calculate the average distances to our 3 locations in roughly 9 minutes. Because of our new, optimized, multi-processed function, we only have to wait for around 5 seconds! The results below indicate that Groningen is the best city for us to have our store!Amsterdam:  115.93 km avg
Utrecht:    111.56 km avg
Groningen:  102.54 km avg

Now we’re cruising over 100x faster rat Mach 3.3 (image by NASA on Unsplash)

Conclusion

This article built forth upon the foundation laid by this one. I hope to have demonstrated that you can combine the ease of coding in Python with the efficiency of C to improve certain pandas operations with relative ease and achieve incredible speed increases. For more information about why the Python function is so slow check out this article.

I hoped everything was as clear as I hope it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics like these:

Happy coding!

— Mike

P.S: like what I’m doing? Follow me!