4 Easy-to-Implement, High-Impact Tweaks for Supercharging Your Python Code’s Performance

How to detect, understand, and eliminate bottlenecks in Python for a 1500x speed increase

4 Easy-to-Implement, High-Impact Tweaks for Supercharging Your Python Code’s Performance
You Python code after this article (image by SpaceX on Unsplash)

My philosophy centers around attempting simple solutions before resorting to complex ones. By exploring the easy methods in this article, you may find the performance increase you need, sparing yourself the intricacies and countless hours required for implementing and debugging multiprocesssing, threads or packages written in another language.

In this article we’ll delve into the tools and 4 methods of speeding up any Python code using minimal, easy-to-implement techniques. We’ll analyze our code, detect bottlenecks and fix them in a structured way. We’ll do this by decreasing the amount of work Python it has to do.

If you have to cross a certain distance as quickly as possible you can either drive faster or shorten your path. Similarly, instead of having Python perform a lot of operations faster, you can also reduce the number of operations.

Ultimately, you will gain a deeper understanding of code performance, acquire valueable skills in code analysis to avoid bottlenecks during development and be a better developer. Let’s code!

Contents

We are analyzing our problem in three parts:

Ain part A we define what we mean by performance and discuss the profiler, which we’ll use in the next part to measure our code.

B Part B revolves around using tools to spot the bottleneck. We measure our code and spot performance bottlenecks. We use a practical example to understand why our function underperforms.

C in Part C we discuss ways to eliminate bottlenecks. In the previous parts we’ve learnt how to detect problem-code and analyze it so we know the cause of the slowdown. In this part we discuss strategies for performance increase:

  • Choosing the right data structure
  • Eliminating slow-running code (like nested loops)
  • using built-in functions

In the end you’ll be able to apply these general strategies to any piece of code.


A. About performance

Let’s first get some definitions and preparation out of the way.


1. What is performance?

Since we are trying to optimize the performance of our code, first we need a clear definition about what we mean when we talk about performance:

Performance: the amount of useful work accomplished estimated in terms of time needed, resources used, etc

For brevity, I’ve chosen to monitor the execution speed of a function in this article. Memory usage and size is relevant as well but I’ve chosen to omit it to keep this article a bit shorter.

Using multi-stage builds to make your docker image 10x smaller
Clean up your Docker images by leaving behind unnecessary tools

2. How to measure performance?

Python has a very handy tool built in called the cProfile . This package provides deterministic profiling of Python programs. This means that precise timings are made for the intervals between all function calls, function returns and exception events. This differs from statistical profiling, which uses random samples to deduce relative indications of where time is being spent.

Using the cProfile is really easy:

import cProfile 
from typing import List 
 
 
def uppercase_list(a_list:List) -> List: 
  return [str(item).upper() for item in a_list] 
 
cProfile.run(statement="uppercase_list(['hello', 'world', '!'])")

This produces the following result:

8 function calls in 0.000 seconds 
Ordered by: standard name 
 
ncalls  tottime  percall  cumtime  percall  filename:lineno(function) 
    1    0.000    0.000    0.000    0.000   <string>:1(<module>) 
    1    0.000    0.000    0.000    0.000   cprofile_example.py:6(uppercase_list) 
    1    0.000    0.000    0.000    0.000   cprofile_example.py:8(<listcomp>) 
    1    0.000    0.000    0.000    0.000   {built-in method builtins.exec} 
    1    0.000    0.000    0.000    0.000   {method 'disable' of '_lsprof.Profiler' objects} 
    3    0.000    0.000    0.000    0.000   {method 'upper' of 'str' objects}

A quick explanation about the columns:

  • ncalls: number of calls
  • tottime: total time spent in the given function (excl. subfunctions)
  • percall: tottime / ncalls
  • cumtime: cumulative time spent in this function and all subfunctions
  • percall: cumtime / primitive calls
  • last column: location -> functionname at linenumer in filename

As you see cProfile provides us with an easy-to-use tool to check the performance of our code. Let’s now focus on how to actually improve our Python code.

Create a Python Package with Super- Fast Rust Code in 3 Steps
Extend you Python code with a package containing Rust code for a >150x performance increase!

B. Profiling: spotting and understanding bottlenecks

Let’s start out with some code that needs optimizing. As an example we’ll use a function that does a very simple thing: it takes two lists and returns how many distinct items are present in both lists (case-insensitive).

Let’s check out this (on purpose super-sub-optimal) function:

def duplicate_count_dumb(list_1:List, list_2:List) -> int: 
  """ Returns the distinct number of items that are present in both provided lists (case-insensitive) """ 
  duplicates: List[str] = [] 
  for item1 in list_1: 
    for item2 in list_2: 
      if str(item1).upper() == str(item2).upper(): 
        if (str(item1).upper() not in duplicates): 
          duplicates.append(str(item1).upper()) 
  return len(duplicates)

This function is clearly not a smart way to solve our problem but it will really nicely demonstrate how we can use cProfile to spot bottlenecks.

Python args, kwargs, and All Other Ways to Pass Arguments to Your Function
Expertly design your function parameters in 6 examples

1. Profiling our function

When we use cProfile on our function, this is the result when we pass two lists that both contain 1,000 five-letter words:

2000008 function calls in 0.523 seconds 
Ordered by: standard name 
 
 ncalls  tottime  percall  cumtime  percall filename:lineno(function) 
      1    0.000    0.000    0.520    0.520 <string>:1(<module>) 
      1    0.381    0.381    0.520    0.520 list_matching.py:43(duplicate_count_dumb) 
      1    0.002    0.002    0.523    0.523 {built-in method builtins.exec} 
      1    0.000    0.000    0.000    0.000 {built-in method builtins.len} 
      1    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects} 
      1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects} 
2000002    0.139    0.000    0.139    0.000 {method 'upper' of 'str' objects}

Notice that we execute a little over 2 million function calls in a little over half a second. Let’s first add this time to our bechmark so we have something to compare with later:

┌─Benchmark─────────┬───────────────┬───────────────┬───────────────┐ 
│ dupcount 1k items │      Min (ms) │      Max (ms) │      Avg (ms) │ 
├───────────────────┼───────────────┼───────────────┼───────────────┤ 
│ default           │ 520.405100000 │ 529.250500000 │ 523.529233333 │ 
└───────────────────┴───────────────┴───────────────┴───────────────┘

2. Spotting bottlenecks

The most important line is the bottom one; here we see that we call the upper method roughly 2 million times. This is because we compare each item in list1 with each item in list 2.

Both items need to be stringified and upper-cased in order to be compared. Since both lists are 1K items long this means that we must perform 1 million comparisons (1000x1000). The problem in this is that we upper both item1 and item2 a million times.

This means that in the best case at least upper will be called 2 million times. If items match there will be additional ones to check for their presense in the duplicates list and appending them (this is why there are 2,000,002 and not 2,000,000).


3. Why is this slow?

Python is loved for its ease-of-use but this comes with a downside. Under the hood Python memory allocation is much slower than other programming languages like C (in which Python is written).

Why Python is so slow and how to speed it up
Take a look under the hood to see where Python’s bottlenecks lie

Since strings in Python are immutable, under the hood a new variable is created every time you uppercase . For all of these variables memory has to be allocated, which Python does relatively slow. When you then do this 2 million times you’ll start to notice the slowdown. Read more about Python’s design in the article above or watch this presentation below:


C. Eliminating bottlenecks

Now that we have spotted our bottleneck it’s time to speed up our code. Remember that our goal is to make Python do less work.

Python Quirks: Understand How a Variable Can Be Modified by a Function That Doesn’t Return Anything
A deep dive into how Python passes arguments and mutability to prevent unexpected bugs

1. Make Python do less work — understand your code

One easy way to do less is to uppercase each item in the list before looping since we don’t care about the casing anyway:

def duplicate_count_smarter(list_1:List, list_2:List) -> int: 
  """ Returns the distinct number of items that are present in both provided lists (case-insensitive) """ 
  duplicates: {str} = set() 
  list_1 = [str(w).upper() for w in list_1] 
  list_2 = [str(w).upper() for w in list_2] 
  for item1 in list_1: 
    for item2 in list_2: 
      if item1 == item2: 
        duplicates.add(item1) 
  return len(duplicates)

As you see we first uppercase each item in each list. When we profile our code again we see that this simple change brought the number of function calls down from roughly 2 million to roughly 2 thousand. This is because now we have to uppercase each item in list1 and in list2 (1k + 1k = 2k).

2007 function calls in 0.022 seconds 
 
Ordered by: standard name 
 
ncalls  tottime  percall  cumtime  percall filename:lineno(function) 
    1    0.000    0.000    0.019    0.019 <string>:1(<module>) 
    1    0.019    0.019    0.019    0.019 list_matching.py:74(duplicate_count_smarter) 
    1    0.000    0.000    0.000    0.000 list_matching.py:77(<listcomp>) 
    1    0.000    0.000    0.000    0.000 list_matching.py:78(<listcomp>) 
    1    0.003    0.003    0.022    0.022 {built-in method builtins.exec} 
    1    0.000    0.000    0.000    0.000 {built-in method builtins.len} 
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects} 
 2000    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}

We call the upper and str functions 1,000x fewer leading to a nice performance increase; execution time decreased by 28x from 523ms to 18ms:

┌─Benchmark─────────┬───────────────┬───────────────┬───────────────┐ 
│ dupcount 1k items │      Min (ms) │      Max (ms) │      Avg (ms) │ 
├───────────────────┼───────────────┼───────────────┼───────────────┤ 
│ default           │ 520.405100000 │ 529.250500000 │ 523.529233333 │ 
│ first_upper       │  18.440300000 │  18.665600000 │  18.548466667 │ (x28) 
└───────────────────┴───────────────┴───────────────┴───────────────┘

Also notice that I’ve replaced the duplicates list with a set so we don’t need to check if the item is already contained. Using sets will take even more work out of Python’s hands since sets only contain unique values by default.

Args vs kwargs: which is the fastest way to call a function in Python?
A clear demonstration of the timeit module

2. Avoid nested loops

Another easy to spot bottleneck is the fact that we loop over each item in list2 for each item in list 1. If both lists contain 10 items this comes down to (10 x 10=) 100 comparisons. If both lists contain 1000 items we have to do 1 million comparisons. You see how the number of comparisons increases exponentially relative to the number of inputs and how this quickly escalates our performance problems. Read more about Big-O notation for more information.

Instead of looping twice and comparing items we can also implement the code below:

def duplicate_count_with_in(list_1:List, list_2:List) -> int: 
  """ Returns the number of times a word exists in a list (case-insensitive) xxxx""" 
  duplicates: {str} = set() 
  list_1 = [str(w).upper() for w in list_1] 
  list_2 = [str(w).upper() for w in list_2] 
  for item1 in list_1: 
    if item1 in list_2: 
      duplicates.add(item1) 
  return len(duplicates)

In this example we only loop through list_1 and check if list_2 contains that value. Let’s check out the benchmark:

┌─Benchmark─────────┬───────────────┬───────────────┬───────────────┐ 
│ dupcount 1k items │      Min (ms) │      Max (ms) │      Avg (ms) │ 
├───────────────────┼───────────────┼───────────────┼───────────────┤ 
│ default           │ 520.405100000 │ 529.250500000 │ 523.529233333 │ 
│ first_upper       │  18.440300000 │  18.665600000 │  18.548466667 │ (x28) 
│ with_in           │  11.774500000 │  12.401600000 │  12.183367000 │ (x43)     │ 
└───────────────────┴───────────────┴───────────────┴───────────────┘

With two simple changes we’ve increased performance a lot but there’s a way to increase performance even more.


3. Use the right data structure— Deduplicating the lists

In order to save even more work we can de-duplicate both lists since this brings down the number of comparisons we have to perform. We’ll take a short detour to think about a function to de-duplicate our lists. Then we’ll integrate this into our list-comparison-function. The first function you might come up with could look like this:

def dedupper(the_list: List) -> List: 
  """ Removes duplicates from list (case-insensitive) """ 
  return_list = [] 
  for entry in the_list: 
    _entry = str(entry).upper() 
    if (_entry not in return_list): 
      return_list.append(_entry) 
  return return_list

Nothing too special; it just goes through every item in the list and puts it into another list if it isn’t already in there. The bulk of the work of this function is in checking if a value is present in a list for deduplication purposes. If we take a list of 100,000 words the function takes a little over a minute to complete:

┌─Benchmark────────────────────┬──────────────┐ 
│ duplicate count (100,000x)   │     Avg (ms) │ 
├──────────────────────────────┼──────────────┤ 
│ dedup_dumb                   │ 61386.198600 │ 
└──────────────────────────────┴──────────────┘
Applying Python multiprocessing in 2 lines of code
When and how to use multiple cores to execute many times faster

Let’s optimize this function to use sets: a dataset can only contain unique values. This way we don’t have to let Python check if a value is already present in a collection; this now gets handled by the set. fter rewriting we end up with a more easily readable and better performing function:

def dedupper_smart(the_list: List) -> List: 
    """ Removes duplicates from list (case-insensitive) """ 
    return list({str(entry).upper() for entry in the_list})

As you see we use a set comprehension to loop through each item in the list. After this is done we end up with a set that contains only unique, uppercase values of our list. Before returning we cast it to a list. When we compare this new function to the old we see that it’s over 2,800x faster!

┌─Benchmark────────────────────┬──────────────┐ 
│ duplicate count (100,000x)   │     Avg (ms) │ 
├──────────────────────────────┼──────────────┤ 
│ dedup_dumb                   │ 61386.19860  │ 
│ dedup_smart                  │    21.85930  │(x2,808) 
└──────────────────────────────┴──────────────┘

The reason sets are this much faster is becasue they are implemented in C (in which Python itself is also written). Since it’s memory is also managed by C, which is much more efficient. In addition the set is unordered and can only contain unique values, saving a lot of operations checking and ordering the items. This, again, takes more work out of Python’s hands, increasing performance.

Let’s implement the set in our list-matching-function and benchmark:

def duplicate_count_dedup_lists(list_1:List, list_2:List) -> int: 
  """ Returns the number of times a word exists in a list (case-insensitive) xxxx""" 
  duplicates: {str} = set() 
  set_1= {str(entry).upper() for entry in list_1} 
  set_2 = {str(entry).upper() for entry in list_2} 
  for item1 in list_1: 
    if item1 in list_2: 
      duplicates.add(item1) 
  return len(duplicates)

As you see the lists now first get deduplicated. When we benchmark we see a huge performance increase:

┌─Benchmark─────────┬───────────────┬───────────────┬───────────────┐ 
│ dupcount 1k items │      Min (ms) │      Max (ms) │      Avg (ms) │ 
├───────────────────┼───────────────┼───────────────┼───────────────┤ 
│ default           │ 520.405100000 │ 529.250500000 │ 523.529233333 │ 
│ first_upper       │  18.440300000 │  18.665600000 │  18.548466667 │ (x28) 
│ with_in           │  11.774500000 │  12.401600000 │  12.183367000 │ (x43)     │ 
│ dedup_lists_set   │   0.351700000 │   0.593000000 │   0.496767000 │ (x1,053) 
└───────────────────┴───────────────┴───────────────┴───────────────┘

Our function completes in half a millisecond instead of half a second. Still we might have one more trick up our sleeve for the next part

5 real handy python decorators for analyzing/debugging your code
Apply these handy, general-purpose decorators directly to your code

4. Using data structure functions — set intersection

One last optimization that we can add is to skip looping altogether and use the set’s intersection method:

def duplicate_count_smartest(list_1:List, list_2:List) -> int: 
  """ Returns the number of times a word exists in a list (case-insensitive) """ 
  set_1 = {str(entry).upper() for entry in list_1} 
  set_2 = {str(entry).upper() for entry in list_2} 
  return len(set_1.intersection(set_2))

As you see we end up with a nicely readable, compact function. The intersection methods returns as set of items that are present in both set_1 and set_2. Lastly we return the length of this set. Let’s add this to our benchmark and see the final results:

┌─Benchmark─────────┬───────────────┬───────────────┬───────────────┐ 
│ dupcount 1k items │      Min (ms) │      Max (ms) │      Avg (ms) │ 
├───────────────────┼───────────────┼───────────────┼───────────────┤ 
│ default           │ 520.405100000 │ 529.250500000 │ 523.529233333 │ 
│ first_upper       │  18.440300000 │  18.665600000 │  18.548466667 │ (x28) 
│ with_in           │  11.774500000 │  12.401600000 │  12.183367000 │ (x43)     │ 
│ dedup_lists_set   │   0.351700000 │   0.593000000 │   0.496767000 │ (x1,053) 
│ sets              │   0.340600000 │   0.369600000 │   0.351167000 │ (x1,490) 
└───────────────────┴───────────────┴───────────────┴───────────────┘

Not only is our final function a lot more readable, it’s also almost 1500x faster than what we started with.

6 Steps to Make this Pandas Dataframe Operation 100 Times Faster
Cython for Data Science: Combine Pandas with Cython for an incredible speed improvement

Further optimization?

Maybe you’ve gone through all of these steps and your code performance is still not good enough. Then there is the option to apply multi-processing, threads or even re-write a (part of a) function in a more optimized language like C, Rust or Cython. The goal of this article was to not reach for the big guns right away but first squeeze out every little drop of performance that Python can offer. If further optimization is required the methods in this article will make it even faster.

Advanced multi-tasking in Python: Applying and benchmarking thread pools and process pools in 6…
Safely and easily apply multi-tasking to your code

Conclusion

In this article we’ve gone through a very practical example of spotting, understanding and eliminating performance bottlenecks. We’ve seen that you don’t have to make complex, structural changes to your code right away; sometimes easy, small, easy-to-implement changes can optimize your code a lot. When this is not enough there is always the option of using threads, multiple processes or writing a Python package in a compile language anyway.

I think the most important part is to truly understand your code, understand how Python works and where it is less performant and apply the right data structures. For this, check out my other articles here or this this presentation.

I hope this article was as clear as I hope it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics like these:

Happy coding!

— Mike

P.S: like what I’m doing? Follow me!

Join Medium with my referral link — Mike Huls
Read every story from Mike Huls (and thousands of other writers on Medium). Your membership fee directly supports Mike…