6 Steps to Make this Pandas Dataframe Operation 100 Times Faster

Cython for Data Science: Combine Pandas with Cython for an incredible speed improvement

Mike Huls

Dec 13, 2021 — 7 min read

Sure doesn’t look that fast, let’s speed it up! (image by Theodor Lundqvist on unsplash)

In this article you’ll learn how to improve Panda’s df.apply() function to speed it up over 100x. This article takes Pandas’ standard dataframe.apply function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds. At the end of this article you’ll:

understand why df.apply() is slow
understand how to speed up the apply with Cython
know how to replace the apply by passing the array to Cython
be able to multi-process your Cython-function to squeeze out the maximum amount of speed
Annoy your coworkers with the fact that your code is aways so much faster than theirs

Before we begin I highly recommend reading this article about why Python is so slow; it helps you understand the type of problem we’re trying to solve in this article. You can also check out this article about getting started with Cython.

Isn’t Pandas already pretty fast?

True, it’s built on Numpy which is written in C. It’s very fast. Df.apply, however, applies a Python function to an array. The slow Python function is not fast. This is one part of the problem we’re trying to solve. The other part is that looping occurs in Python as well. First, we’ll set up, and then we’ll go about fixing these problems in 6 steps.

Setup

These kinds of tutorials always work best with a practical example, so for this project, we’ll imagine that we have a webshop. Because we already have 17 million customers, we decide to open a physical store so that customers can pick up their products, saving us delivery costs. We have a few locations picked out but which would be the best one? We determine that we’d best settle where the average distance to all of our clients is the smallest.

We want to calculate the average distances for each location but we’re busy people and don’t want to spend a long time waiting for the distance calculation to finish.

Our goal is to optimize calculating the average distance from all of our customers to a certain location.

Loading data

For this we load all of our 17 million customers from the database (using this article) into a dataframe (all unnecessary columns are hidden):customer_id lat lon
0 52.131980 5.510361
1 52.438026 6.815252
2 51.238809 4.447790
3 51.163722 3.588959
4 52.559595 5.483185

Distance calculation function

The function below can be used to calculate the spherical distance between two points on the globe.

	from math import sin, cos, sqrt, atan2, pi

	def calculate_distance(lat1:float, lon1:float, lat2:float, lon2:float):
	""" Calculates sperical distance between two points """
	# approximate radius of earth in km
	R = 6373.0

	# Convert to radians
	lat1 = lat1 * pi / 180
	lon1 = lon1 * pi / 180
	lat2 = lat2 * pi / 180
	lon2 = lon2 * pi / 180

	# Calculate distance and return
	a = sin((lat2 - lat1) / 2) ** 2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1) / 2) ** 2
	c = 2 * atan2(sqrt(a), sqrt(1 - a))
	return R * c

view raw cy_df_apply1_pyfunction_distancecalc.py hosted with ❤ by GitHub

Installing dependencies

Spoiler alert: we’re going to use Cython to solve this problem. We are using CythonBuilder to take care of compiling, building, and packaging our package. Install it with:pip install cython
pip install cythonbuilder

When this is installed call cybuilder init. This will create a folder in your root called ext. If you are unfamiliar with working with a terminal check out this article.

We’ll go from this to a hypersonic jet in 6 steps (image by History in HD on Unsplash)

Calculating the average distance

We have selected a location that we want to calculate the average customer distance to. We are going to solve this problem in 6 steps. In every step, we’ll improve our code and achieve more speed. We’ll start with just Python and gradually add more Cython and other optimizations.

Step 1. Pure Python

We’ll df.apply the distance-calculation function to our dataframe, assign the result to a new column, and, lastly, average that column.

	store_lat = 52.34804116646895
	store_lon = 4.891191148345134
	df['distance'] = df_source.apply(lambda x: calculate_distance(store_lat, store_lon, x['lat'], x['lon']), axis=1)
	avgdist = df['distance'].mean()

view raw cy_df_apply2_s1_purepython.py hosted with ❤ by GitHub

This works but a lot can be improved. The function finishes in roughly 3 minutes. This will be our benchmark:[1_purepy] 17M rows 179037ms

Step 2. Cythonize

In this part, we’ll just put our Python code in a Cython file. Create a new file called geopack.pyx in ext/pyxfiles. If this folder doesn’t exist yet you probably forgot to call cybuilder init. Just copy and paste your Python function into this new file so it looks like below (don’t forget the imports):

	from math import sin, cos, sqrt, atan2, pi

	def calculate_distance(lat1:float, lon1:float, lat2:float, lon2:float):
	""" Calculates sperical distance between two points """
	# approximate radius of earth in km
	R = 6373.0

	# Convert to radians
	lat1 = lat1 * pi / 180
	lon1 = lon1 * pi / 180
	lat2 = lat2 * pi / 180
	lon2 = lon2 * pi / 180

	# Calculate distance and return
	a = sin((lat2 - lat1) / 2) ** 2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1) / 2) ** 2
	c = 2 * atan2(sqrt(a), sqrt(1 - a))
	return R * c

view raw cy_df_apply1_pyfunction_distancecalc.py hosted with ❤ by GitHub

Next, we’ll need to compile, build and package this module. Luckily this is very easy with CythonBuilder. Call cybuilder build to do just this. Then we can import from the ext folder that CythonBuilder created and use the function like below:

	from ext import geopack
	df['distance'] = df_source.apply(lambda x: geopack.calculate_distance(store_lat, store_lon, x['lat'], x['lon']), axis=1)
	avgdist = df['distance'].mean()

view raw cy_df_apply3_s2_cythonize.py hosted with ❤ by GitHub

Easy like that! This code completes in 2.7 minutes, which is already a bit faster even though we haven’t even optimized anything. Let’s look at our benchmark and start optimizing.[1_purepy] 17M rows 179037ms
[2_pyINcy] 17M rows 163102ms (x1.098)

Step 3. Optimizing

In this step, we are optimizing our Cython code. Mostly we just add types so that the code can be compiled and doesn’t have to go through the interpreter. Let’s look at the new function below and then go through all changes.

	cimport cython
	from libc.math cimport sin, cos, sqrt, atan2, pi

	@cython.cdivision(True)
	@cython.boundscheck(False)
	@cython.wraparound(False)
	@cython.nonecheck(False)
	cpdef float calculate_distance(double lat1, double lon1, double lat2, double lon2):
	""" Calculates sperical distance between two points """
	# approximate radius of earth in km
	cdef float R = 6373.0
	cdef float a
	cdef float c

	# Convert to radians
	lat1 = lat1 * pi / 180
	lon1 = lon1 * pi / 180
	lat2 = lat2 * pi / 180
	lon2 = lon2 * pi / 180

	# Calculate distance and return
	a = sin((lat2 - lat1) / 2) ** 2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1) / 2) ** 2
	c = 2 * atan2(sqrt(a), sqrt(1 - a))
	return R * c

view raw cy_df_apply4_s3_optimize.py hosted with ❤ by GitHub

Line 1 and 2
We’re importing Cython functions now. These are a bit faster

Line 4–7
These are compiler directives; they tell the compile to avoid certain checks. We avoid checking for Nones, Zero-division, and some checks that have to do with bounds and wraparound when we use loops. Check docs for more information.

Line 8
First, we define our function not with def but cpdef. This makes both C and Python can access the function. Then we define our return type (the float part in cpdef float. Lastly, we type our inputs (like double lat1 e.g.).

Line 11–13
Adding types for variables we use in this function.

Results
After we’ve adjusted our function we call cybuilder build again so that our package is updated. When we run the new function we see that our code is a bit faster, finishing in roughly 2.4 minutes:[1_purepy] 17M rows 179037ms
[2_pyINcy] 17M rows 163102ms (x1.098)
[3_optiCy] 17M rows 148108ms (x1.209)

We’ve already shaved off 17% of execution times but this is not nearly enough. The reason that df.apply() is slow despite the optimized function is that all looping happens in Python. So let’s Cythonize the loop!

Step 4. Cythonizing the loop

We’ll create a new function in Cython that will receive two arrays (all of our customer’s latitudes and longitudes, and two floats (the store-lat and store-long).

	@cython.cdivision(True)
	@cython.boundscheck(False)
	@cython.wraparound(False)
	@cython.nonecheck(False)
	cpdef np.ndarray[double] apply_distance_calculation(
	double dnx_lat,
	double dnx_lon,
	np.ndarray[double] col_lat2,
	np.ndarray[double] col_lon2
	):

	# retrieve record count
	cdef int nrecords = len(col_lat2)

	# Create empty array to return
	cdef np.ndarray[double] res = np.empty(nrecords)

	# Fill the empty array
	for i in range(nrecords):
	res[i] = calculate_distance(dnx_lat, dnx_lon, col_lat2[i], col_lon2[i])

	return res

view raw cy_df_apply5_s4_cyloop.py hosted with ❤ by GitHub

The only thing this function will do is to loop through all of the data and call the function we’ve defined earlier (line 20). Execute cybuilder build and run this function like the code below:

	df['distance'] = geopack.apply_distance_calculation(
	dnx_lat,
	dnx_lon,
	df['lat'].to_numpy(),
	df['lon'].to_numpy()
	)
	avgdist = df['distance'].mean()

view raw cy_df_apply6_s4_cylooprun.py hosted with ❤ by GitHub

When we run this new function we finish in 2.3 seconds. That’s a pretty dramatic speed upgrade! In the next part, we’ll discover how to squeeze out even more speed.[1_purepy] 17M rows 179037ms
[2_pyINcy] 17M rows 163102ms (x1.098)
[3_optiCy] 17M rows 148108ms (x1.209)
[4_CyLoop] 17M rows 2346ms (x76.327)

Step 5. Aggregation in Cython

Since we’re interested in the average distance, and we already include the loop in Cython, we don’t have to return an array. We can just calculate the average inside the Cython function we’ve created in the previous step. That'll simplify our function as well:

	@cython.cdivision(True)
	@cython.boundscheck(False)
	@cython.wraparound(False)
	@cython.nonecheck(False)
	cpdef double apply_distance_calculation(
	double dnx_lat,
	double dnx_lon,
	np.ndarray[double] col_lat2,
	np.ndarray[double] col_lon2
	):

	cdef double sum_distance
	cdef int nrecords = len(col_lat2)

	# Calculate sum_distance
	for i in range(nrecords):
	sum_distance += calculate_distance(dnx_lat, dnx_lon, col_lat2[i], col_lon2[i])

	return sum_distance / nrecords

view raw cy_df_apply7_s5_aggregate.py hosted with ❤ by GitHub

We call our function directly likeavgdist = geopack.calculate_mean_distance(
store_lat,
store_lon,
df['lat'].to_numpy(),
df['lon'].to_numpy()
)

This shaves off a small bit of time:[1_purepy] 17M rows 179037ms
[2_pyINcy] 17M rows 163102ms (x1.098)
[3_optiCy] 17M rows 148108ms (x1.209)
[4_CyLoop] 17M rows 2346ms (x76.327)
[5_CyAggr] 17M rows 2225ms (x80.468)

Step 6. Using multiple CPUs to process in parallel

We have multiple CPUs, right? So why not use some? In the elegant piece of code below we use a ProcessPool to divide the work over multiple processes. Each process runs simultaneously and uses a different CPU.

	with ProcessPoolExecutor(max_workers=num_workers) as exe:
	sections = np.array_split(df_source, num_workers)
	jobs = [exe.submit(geopack.calculate_mean_distance, dnx_lat, dnx_lon, s['lat'].to_numpy(), s['lon'].to_numpy()) for s in sections]

	avgdist = sum(job.result() for job in jobs) / len(jobs)

view raw cy_df_apply8_s6_cymp.py hosted with ❤ by GitHub

Processes take some time to initialize so make sure you have enough work for them to make the startup cost worthwhile. Read this article to learn more about multitasking in Python.

When we execute the code above and have all of our CPU’s working simultaneously on the task at hand, we end up with the final results; the maximum amount of speed that we can squeeze out of this task:[1_purepy] 17M rows 179037ms
[2_pyINcy] 17M rows 163102ms (x1.098)
[3_optiCy] 17M rows 148108ms (x1.209)
[4_CyLoop] 17M rows 2346ms (x76.327)
[5_CyAggr] 17M rows 2225ms (x80.468)
[6_CyProc] 17M rows 1640ms (x109.169)

As you see we have sped up this task almost 110-fold; eliminating over 99% of the Pure Python execution time. Notice that the things we need to do for this incredible speed increase are:

copy our function to Python
add some types
add a simple function that handles looping
add 4 lines to handle multi-processing.

Not bad! Now let’s calculate the optimal distance.

Using our optimized function to calculate the best location for our store.

We can now execute the code below to finally get our answer:

	num_workers = 12
	for loc in locations:
	with ProcessPoolExecutor(max_workers=num_workers) as exe:
	sections = np.array_split(df_source, num_workers)
	jobs = [exe.submit(geopack.calculate_mean_distance, loc['lat'], loc['lon'], s['lat'].to_numpy(), s['lon'].to_numpy()) for s in sections]

	avgdist = sum(job.result() for job in jobs) / len(jobs)
	print(f"{loc['name']}: {avgdist} avg")

view raw cy_df_apply9_result.py hosted with ❤ by GitHub

The default .apply() method would calculate the average distances to our 3 locations in roughly 9 minutes. Because of our new, optimized, multi-processed function, we only have to wait for around 5 seconds! The results below indicate that Groningen is the best city for us to have our store!Amsterdam: 115.93 km avg
Utrecht: 111.56 km avg
Groningen: 102.54 km avg

Now we’re cruising over 100x faster rat Mach 3.3 (image by NASA on Unsplash)

Conclusion

This article built forth upon the foundation laid by this one. I hope to have demonstrated that you can combine the ease of coding in Python with the efficiency of C to improve certain pandas operations with relative ease and achieve incredible speed increases. For more information about why the Python function is so slow check out this article.

I hoped everything was as clear as I hope it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics like these:

Happy coding!

— Mike

P.S: like what I’m doing? Follow me!