4 Easy-to-Implement, High-Impact Tweaks for Supercharging Your Python Code’s Performance
How to detect, understand, and eliminate bottlenecks in Python for a 1500x speed increase
My philosophy centers around attempting simple solutions before resorting to complex ones. By exploring the easy methods in this article, you may find the performance increase you need, sparing yourself the intricacies and countless hours required for implementing and debugging multiprocesssing, threads or packages written in another language.
In this article we’ll delve into the tools and 4 methods of speeding up any Python code using minimal, easy-to-implement techniques. We’ll analyze our code, detect bottlenecks and fix them in a structured way. We’ll do this by decreasing the amount of work Python it has to do.
If you have to cross a certain distance as quickly as possible you can either drive faster or shorten your path. Similarly, instead of having Python perform a lot of operations faster, you can also reduce the number of operations.
Ultimately, you will gain a deeper understanding of code performance, acquire valueable skills in code analysis to avoid bottlenecks during development and be a better developer. Let’s code!
Contents
We are analyzing our problem in three parts:
Ain part A we define what we mean by performance and discuss the profiler, which we’ll use in the next part to measure our code.
B Part B revolves around using tools to spot the bottleneck. We measure our code and spot performance bottlenecks. We use a practical example to understand why our function underperforms.
C in Part C we discuss ways to eliminate bottlenecks. In the previous parts we’ve learnt how to detect problem-code and analyze it so we know the cause of the slowdown. In this part we discuss strategies for performance increase:
- Choosing the right data structure
- Eliminating slow-running code (like nested loops)
- using built-in functions
In the end you’ll be able to apply these general strategies to any piece of code.
A. About performance
Let’s first get some definitions and preparation out of the way.
1. What is performance?
Since we are trying to optimize the performance of our code, first we need a clear definition about what we mean when we talk about performance:
Performance: the amount of useful work accomplished estimated in terms of time needed, resources used, etc
For brevity, I’ve chosen to monitor the execution speed of a function in this article. Memory usage and size is relevant as well but I’ve chosen to omit it to keep this article a bit shorter.
2. How to measure performance?
Python has a very handy tool built in called the cProfile
. This package provides deterministic profiling of Python programs. This means that precise timings are made for the intervals between all function calls, function returns and exception events. This differs from statistical profiling, which uses random samples to deduce relative indications of where time is being spent.
Using the cProfile
is really easy:
import cProfile
from typing import List
def uppercase_list(a_list:List) -> List:
return [str(item).upper() for item in a_list]
cProfile.run(statement="uppercase_list(['hello', 'world', '!'])")
This produces the following result:
8 function calls in 0.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 cprofile_example.py:6(uppercase_list)
1 0.000 0.000 0.000 0.000 cprofile_example.py:8(<listcomp>)
1 0.000 0.000 0.000 0.000 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
3 0.000 0.000 0.000 0.000 {method 'upper' of 'str' objects}
A quick explanation about the columns:
- ncalls: number of calls
- tottime: total time spent in the given function (excl. subfunctions)
- percall: tottime / ncalls
- cumtime: cumulative time spent in this function and all subfunctions
- percall: cumtime / primitive calls
- last column: location -> functionname at linenumer in filename
As you see cProfile
provides us with an easy-to-use tool to check the performance of our code. Let’s now focus on how to actually improve our Python code.
B. Profiling: spotting and understanding bottlenecks
Let’s start out with some code that needs optimizing. As an example we’ll use a function that does a very simple thing: it takes two lists and returns how many distinct items are present in both lists (case-insensitive).
Let’s check out this (on purpose super-sub-optimal) function:
def duplicate_count_dumb(list_1:List, list_2:List) -> int:
""" Returns the distinct number of items that are present in both provided lists (case-insensitive) """
duplicates: List[str] = []
for item1 in list_1:
for item2 in list_2:
if str(item1).upper() == str(item2).upper():
if (str(item1).upper() not in duplicates):
duplicates.append(str(item1).upper())
return len(duplicates)
This function is clearly not a smart way to solve our problem but it will really nicely demonstrate how we can use cProfile
to spot bottlenecks.
1. Profiling our function
When we use cProfile
on our function, this is the result when we pass two lists that both contain 1,000 five-letter words:
2000008 function calls in 0.523 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.520 0.520 <string>:1(<module>)
1 0.381 0.381 0.520 0.520 list_matching.py:43(duplicate_count_dumb)
1 0.002 0.002 0.523 0.523 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2000002 0.139 0.000 0.139 0.000 {method 'upper' of 'str' objects}
Notice that we execute a little over 2 million function calls in a little over half a second. Let’s first add this time to our bechmark so we have something to compare with later:
┌─Benchmark─────────┬───────────────┬───────────────┬───────────────┐
│ dupcount 1k items │ Min (ms) │ Max (ms) │ Avg (ms) │
├───────────────────┼───────────────┼───────────────┼───────────────┤
│ default │ 520.405100000 │ 529.250500000 │ 523.529233333 │
└───────────────────┴───────────────┴───────────────┴───────────────┘
2. Spotting bottlenecks
The most important line is the bottom one; here we see that we call the upper
method roughly 2 million times. This is because we compare each item in list1 with each item in list 2.
Both items need to be stringified and upper-cased in order to be compared. Since both lists are 1K items long this means that we must perform 1 million comparisons (1000x1000). The problem in this is that we upper
both item1 and item2 a million times.
This means that in the best case at least upper
will be called 2 million times. If items match there will be additional ones to check for their presense in the duplicates
list and appending them (this is why there are 2,000,002 and not 2,000,000).
3. Why is this slow?
Python is loved for its ease-of-use but this comes with a downside. Under the hood Python memory allocation is much slower than other programming languages like C (in which Python is written).
Since strings in Python are immutable, under the hood a new variable is created every time you uppercase
. For all of these variables memory has to be allocated, which Python does relatively slow. When you then do this 2 million times you’ll start to notice the slowdown. Read more about Python’s design in the article above or watch this presentation below:
C. Eliminating bottlenecks
Now that we have spotted our bottleneck it’s time to speed up our code. Remember that our goal is to make Python do less work.
1. Make Python do less work — understand your code
One easy way to do less is to uppercase each item in the list before looping since we don’t care about the casing anyway:
def duplicate_count_smarter(list_1:List, list_2:List) -> int:
""" Returns the distinct number of items that are present in both provided lists (case-insensitive) """
duplicates: {str} = set()
list_1 = [str(w).upper() for w in list_1]
list_2 = [str(w).upper() for w in list_2]
for item1 in list_1:
for item2 in list_2:
if item1 == item2:
duplicates.add(item1)
return len(duplicates)
As you see we first uppercase each item in each list. When we profile our code again we see that this simple change brought the number of function calls down from roughly 2 million to roughly 2 thousand. This is because now we have to uppercase each item in list1 and in list2 (1k + 1k = 2k).
2007 function calls in 0.022 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.019 0.019 <string>:1(<module>)
1 0.019 0.019 0.019 0.019 list_matching.py:74(duplicate_count_smarter)
1 0.000 0.000 0.000 0.000 list_matching.py:77(<listcomp>)
1 0.000 0.000 0.000 0.000 list_matching.py:78(<listcomp>)
1 0.003 0.003 0.022 0.022 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2000 0.000 0.000 0.000 0.000 {method 'upper' of 'str' objects}
We call the upper
and str
functions 1,000x fewer leading to a nice performance increase; execution time decreased by 28x from 523ms to 18ms:
┌─Benchmark─────────┬───────────────┬───────────────┬───────────────┐
│ dupcount 1k items │ Min (ms) │ Max (ms) │ Avg (ms) │
├───────────────────┼───────────────┼───────────────┼───────────────┤
│ default │ 520.405100000 │ 529.250500000 │ 523.529233333 │
│ first_upper │ 18.440300000 │ 18.665600000 │ 18.548466667 │ (x28)
└───────────────────┴───────────────┴───────────────┴───────────────┘
Also notice that I’ve replaced the duplicates list with a set so we don’t need to check if the item is already contained. Using sets will take even more work out of Python’s hands since sets only contain unique values by default.
2. Avoid nested loops
Another easy to spot bottleneck is the fact that we loop over each item in list2 for each item in list 1. If both lists contain 10 items this comes down to (10 x 10=) 100 comparisons. If both lists contain 1000 items we have to do 1 million comparisons. You see how the number of comparisons increases exponentially relative to the number of inputs and how this quickly escalates our performance problems. Read more about Big-O notation for more information.
Instead of looping twice and comparing items we can also implement the code below:
def duplicate_count_with_in(list_1:List, list_2:List) -> int:
""" Returns the number of times a word exists in a list (case-insensitive) xxxx"""
duplicates: {str} = set()
list_1 = [str(w).upper() for w in list_1]
list_2 = [str(w).upper() for w in list_2]
for item1 in list_1:
if item1 in list_2:
duplicates.add(item1)
return len(duplicates)
In this example we only loop through list_1
and check if list_2
contains that value. Let’s check out the benchmark:
┌─Benchmark─────────┬───────────────┬───────────────┬───────────────┐
│ dupcount 1k items │ Min (ms) │ Max (ms) │ Avg (ms) │
├───────────────────┼───────────────┼───────────────┼───────────────┤
│ default │ 520.405100000 │ 529.250500000 │ 523.529233333 │
│ first_upper │ 18.440300000 │ 18.665600000 │ 18.548466667 │ (x28)
│ with_in │ 11.774500000 │ 12.401600000 │ 12.183367000 │ (x43) │
└───────────────────┴───────────────┴───────────────┴───────────────┘
With two simple changes we’ve increased performance a lot but there’s a way to increase performance even more.
3. Use the right data structure— Deduplicating the lists
In order to save even more work we can de-duplicate both lists since this brings down the number of comparisons we have to perform. We’ll take a short detour to think about a function to de-duplicate our lists. Then we’ll integrate this into our list-comparison-function. The first function you might come up with could look like this:
def dedupper(the_list: List) -> List:
""" Removes duplicates from list (case-insensitive) """
return_list = []
for entry in the_list:
_entry = str(entry).upper()
if (_entry not in return_list):
return_list.append(_entry)
return return_list
Nothing too special; it just goes through every item in the list and puts it into another list if it isn’t already in there. The bulk of the work of this function is in checking if a value is present in a list for deduplication purposes. If we take a list of 100,000 words the function takes a little over a minute to complete:
┌─Benchmark────────────────────┬──────────────┐
│ duplicate count (100,000x) │ Avg (ms) │
├──────────────────────────────┼──────────────┤
│ dedup_dumb │ 61386.198600 │
└──────────────────────────────┴──────────────┘
Let’s optimize this function to use sets: a dataset can only contain unique values. This way we don’t have to let Python check if a value is already present in a collection; this now gets handled by the set. fter rewriting we end up with a more easily readable and better performing function:
def dedupper_smart(the_list: List) -> List:
""" Removes duplicates from list (case-insensitive) """
return list({str(entry).upper() for entry in the_list})
As you see we use a set comprehension to loop through each item in the list. After this is done we end up with a set that contains only unique, uppercase values of our list. Before returning we cast it to a list. When we compare this new function to the old we see that it’s over 2,800x faster!
┌─Benchmark────────────────────┬──────────────┐
│ duplicate count (100,000x) │ Avg (ms) │
├──────────────────────────────┼──────────────┤
│ dedup_dumb │ 61386.19860 │
│ dedup_smart │ 21.85930 │(x2,808)
└──────────────────────────────┴──────────────┘
The reason sets are this much faster is becasue they are implemented in C (in which Python itself is also written). Since it’s memory is also managed by C, which is much more efficient. In addition the set is unordered and can only contain unique values, saving a lot of operations checking and ordering the items. This, again, takes more work out of Python’s hands, increasing performance.
Let’s implement the set in our list-matching-function and benchmark:
def duplicate_count_dedup_lists(list_1:List, list_2:List) -> int:
""" Returns the number of times a word exists in a list (case-insensitive) xxxx"""
duplicates: {str} = set()
set_1= {str(entry).upper() for entry in list_1}
set_2 = {str(entry).upper() for entry in list_2}
for item1 in list_1:
if item1 in list_2:
duplicates.add(item1)
return len(duplicates)
As you see the lists now first get deduplicated. When we benchmark we see a huge performance increase:
┌─Benchmark─────────┬───────────────┬───────────────┬───────────────┐
│ dupcount 1k items │ Min (ms) │ Max (ms) │ Avg (ms) │
├───────────────────┼───────────────┼───────────────┼───────────────┤
│ default │ 520.405100000 │ 529.250500000 │ 523.529233333 │
│ first_upper │ 18.440300000 │ 18.665600000 │ 18.548466667 │ (x28)
│ with_in │ 11.774500000 │ 12.401600000 │ 12.183367000 │ (x43) │
│ dedup_lists_set │ 0.351700000 │ 0.593000000 │ 0.496767000 │ (x1,053)
└───────────────────┴───────────────┴───────────────┴───────────────┘
Our function completes in half a millisecond instead of half a second. Still we might have one more trick up our sleeve for the next part
4. Using data structure functions — set intersection
One last optimization that we can add is to skip looping altogether and use the set’s intersection
method:
def duplicate_count_smartest(list_1:List, list_2:List) -> int:
""" Returns the number of times a word exists in a list (case-insensitive) """
set_1 = {str(entry).upper() for entry in list_1}
set_2 = {str(entry).upper() for entry in list_2}
return len(set_1.intersection(set_2))
As you see we end up with a nicely readable, compact function. The intersection
methods returns as set of items that are present in both set_1
and set_2
. Lastly we return the length of this set. Let’s add this to our benchmark and see the final results:
┌─Benchmark─────────┬───────────────┬───────────────┬───────────────┐
│ dupcount 1k items │ Min (ms) │ Max (ms) │ Avg (ms) │
├───────────────────┼───────────────┼───────────────┼───────────────┤
│ default │ 520.405100000 │ 529.250500000 │ 523.529233333 │
│ first_upper │ 18.440300000 │ 18.665600000 │ 18.548466667 │ (x28)
│ with_in │ 11.774500000 │ 12.401600000 │ 12.183367000 │ (x43) │
│ dedup_lists_set │ 0.351700000 │ 0.593000000 │ 0.496767000 │ (x1,053)
│ sets │ 0.340600000 │ 0.369600000 │ 0.351167000 │ (x1,490)
└───────────────────┴───────────────┴───────────────┴───────────────┘
Not only is our final function a lot more readable, it’s also almost 1500x faster than what we started with.
Further optimization?
Maybe you’ve gone through all of these steps and your code performance is still not good enough. Then there is the option to apply multi-processing, threads or even re-write a (part of a) function in a more optimized language like C, Rust or Cython. The goal of this article was to not reach for the big guns right away but first squeeze out every little drop of performance that Python can offer. If further optimization is required the methods in this article will make it even faster.
Conclusion
In this article we’ve gone through a very practical example of spotting, understanding and eliminating performance bottlenecks. We’ve seen that you don’t have to make complex, structural changes to your code right away; sometimes easy, small, easy-to-implement changes can optimize your code a lot. When this is not enough there is always the option of using threads, multiple processes or writing a Python package in a compile language anyway.
I think the most important part is to truly understand your code, understand how Python works and where it is less performant and apply the right data structures. For this, check out my other articles here or this this presentation.
I hope this article was as clear as I hope it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics like these:
- Git for absolute beginners: understanding Git with the help of a video game
- Create and publish your own Python package
- Create a fast auto-documented, maintainable, and easy-to-use Python API in 5 lines of code with FastAPI
Happy coding!
— Mike
P.S: like what I’m doing? Follow me!