Why Python is so slow and how to speed it up

Take a look under the hood to see where Python’s bottlenecks lie

Mike Huls

Dec 8, 2021 — 10 min read

Let’s find out how the Python engine works so that we can go faster (image by Kevin Butz on Unsplash)

In this article we’ll discover that Python is not a bad language that is just very slow. It is optimized for the purpose it is built: easy syntax, readable code and a lot of freedom for the developer. These design choices, however, do make Python code slower than other languages like C and Java.

Understanding how Python works under the hood will show us the causes of why it’s slower. Once the causes are clear we can work our way around it. After reading this article you’ll have a clear understanding on:

how Python is designed and works under the hood
why these design choices affect execution speed
how we can work around some of these bottlenecks to increase the speed of our code significantly

This article is split in three parts. In part A we take a look at how Python is designed. Then, in part B see how and why these design choices affect speed. Finally, in part C we’ll learn how to work around the bottlenecks that result from Python’s design and how we can speed up our code significantly.
Let’s go!

Part A — Python’s design

Let’s start off with a definition. Wikipedia describes Python as:

Python is an interpreted, high-level, general-purpose programming language. It is dynamically typed and garbage-collected.

Believe it or not, you’re going to understand the two sentences above after you’ve read this article. This definition provides a nice glance of Python’s design. High-level, interpreted, general-purpose, dynamic typing and the way garbage is collected take away a lot of hassle from the developer.

In the next parts we’ll go through these elements of design, explain what it means for Python’s performance and conclude with a practical example.

Python is like a kite; easy to use and not super fast. C is like a fighter jet; super fast but not exactly easy to work with (image by Shyam on Unsplash)

Slowness vs waiting

First let’s talk about what we’re trying to measure when we say “slow”. Your code can be slow for a multitude of reasons but not all of them are Python’s fault. Let’s say that there are two types of tasks:

I/O-tasks
CPU-tasks

Examples of I/O tasks are writing a file, requesting some data from an API, printing a page; they involve waiting. Although they cause your program to take more time to execute, this is not Python’s fault. It’s just waiting for a response; a faster language cannot wait faster. This kind of slowness is not what we’re trying to solve in this article. As we’ll see later we can thread these types of tasks (also described in this article).

In this article we figure out why Python executes CPU-tasks more slowly than other languages.

Dynamically typed vs Statically typed

Python is dynamically typed. In languages like C, Java or C++ all variable are statically typed, this means that you write down the specific type of a variable like int my_var = 1;.
In Python we can just type my_var = 1. We can then even assign a new value that is of a totally different type like my_var = “a string". We’ll see how this works under the hood in the next chapter.

Although dynamic typing is pretty easy for the developer, it has some major downsides as we’ll see in the next parts.

Compiled vs Interpreted

Compiling code means to take a program in one language and convert it into another language, usually a lower level than the source. When you compile a program written in C you convert the source code to machine code (which are actual instructions for the CPU), after which you can run your program.

Python works a little different:

Source code is not compiled into machine code but into platform-independent bytecode. Like machine code, bytecode are also instructions but in stead of being executed by the CPU they are executed by an interpreter.
Source code gets compiled while running. Python compiles files as needed in stead of compiling everything before running the program.
The interpreter analyzes the bytecode and translates it to machine code.

Python has to compile into bytecode because it is dynamically typed. Because we don’t specify the type of a variable beforehand, we have to wait for the actual value in order to determine whether what we’re trying to do is actually legal (like adding two integers) before translating to machine code. This is what the interpreter does. In statically typed, compiled languages the compilation and interpretation occurs before running the code.

In summary: code is slowed down by the compilation and interpretation that occurs during runtime. Compare this to a statically typed, compiled language which runs just the CPU instructions once compilated.

It’s actually possible to extend Python with compiled modules that are written in C. This article and this article demonstrates how you can code your own extension in C to speed up your code x100.

Garbage collection and memory management

When you create a variable in Python, the interpreter automatically picks out a spot in memory that is large enough for the value of the variable and stores it there. Then, when the variable is not needed anymore, the slot of memory gets freed again so that other processes can use it again.

In C, the language where Python is written in, this process is not automated at all. When you declare a variable you need to specify its type so that the correct amount of memory can be allocated. Also garbage collection is manual.

So how does Python keep track of which variable to garbage-collect? For each object Python keeps track of how many objects reference that object. If a variable’s reference count is 0 then we can conclude that the variable isn’t used and that it can be deallocated in memory. We’ll see this in action in the next chapter.

Single-thread vs multi-threaded

Some languages, like Java, allow you to run code in parallel on multiple CPU’s. Python, however, is single-threaded on a single CPU by design. The mechanism that makes sure of this is called the GIL: the Global Interpreter Lock. The GIL makes sure that the interpreter executes only one thread at any given time.

The problem the GIL solves is the way Python uses reference counting for memory management. A variable’s reference count needs to be protected from situations where two threads simultaneously increase or decrease the count. This can cause all kinds of weird bugs to to memory leaks (when an object is no longer necessary but is not removed) or, worse, incorrect release of the memory. In the last case a variable gets removed from the memory while other variables still need it.

In short: Because of the way garbage collection is designed, Python has to implements a GIL to ensure it runs on a single thread. There are ways to circumvent the GIL though, read this article, to thread or multiprocess your code and speed it up significanly.

Part B — A look under the hood: Pythons design in practice

Enough with al the theory, let’s see some action! Now that we know how Python is designed, let’s see it in action. We’ll compare the simple declaration of a variable in both C and Python. This way we can see how Python manages its memory and why its design choices result in slow execution times compared to C.

Now that we’ve totally deconstructed Python, let’s put it back together and check out how it runs (image by Jordan Bebek on Unsplash)

Declaring a variable in C

Let’s start out by declaring an integer in C called c_num.int c_num = 42;

When we execute this line of code our machine does the following:

Allocate enough memory for an integer at a certain address (location in memory)
Assign the value 42 to the location of the memory that’s allocated in the previous step
Point c_num to that value

Image there now exists an object in memory that looks like this:

Representation of an integer variable called c_num with the value 42 (image by author)

If we assign a new number to c_num we write the new number to the same address; overwriting, the previous value. This means that the variable is mutable.

We’ve assigned the value 404 to c_num (image by author)

Notice that the address (or location in memory) did not change. Think of it as c_num owning a piece of memory big enough for an integer. You see in the next part that this differs from how Python works.

Declaring a variable in Python

We’ll do the exact same thing as in the previous part; declare an integer.py_num = 42

This line of code kicks of the following steps during execution:

Create a PyObject; allocating enough memory to an address
Set the PyObject’s typecode to integer (as determined by the interpreter)
Set the PyObject’s value to 42
Create a name called py_num
Point py_numto the Pyobject
Increment the PyObject’s refcount by 1

Under the hood the first thing that’s done is to create a PyObject. This is what is meant by the phrase ‘everything in Python is an object’. Python might have the int, str and float types but under the hood every Python variable is just a PyObject. This is why dynamic typing is possible.

Notice that PyObject is not an object in Python. It’s a struct in C that represents all Python objects. If you are interested in how this PyObject works in C check out this article where we code our own C-extension in Python that increases execution speeds x100!

The steps above create the (simplified) objects in memory below:

Our Python integer in memory (simplified) (image by Author)

You’ll immediately notice that we execute more steps and need more memory to store an integer. In addition to the type and value, we also store the refcount for garbage collection purposes. Also you’ll notice that the variable we’ve created, py_num, doesn’t own a block of memory. The memory is owned by the newly created PyObject to which py_num points.

Technically speaking Python has no variables like C has; Python has names. Variables own pieces of memory and can be overwritten, names are pointers to a variable.

So what happens when we want to assign a different value to py_num?

Create a new PyObject at a certain address, allocating enough memory
Set the PyObject’s typecode to integer
Set the PyObject’s value to 404 (the new value)
Point py_numto the Pyobject
Increment the new PyObject’s refcount by 1
Decrease the old PyObject’s refcount by 1

These steps result alter the memory like in the image below:

Our memory after assigning a new value to py_num (image by author)

The image above will demonstrate that in stead of assigning a new value to py_num, rather we bind the name py_num to a new object. This way we can also assign a value of a different type because a new PyObject will be created every time. Py_num just points to a different PyObject. We don’t overwrite like in C, we just point to another object.

Also notice that the refcount on the old object is 0; this will make sure it gets cleaned up by the garbage collector.

Part C — How to speed things up

In the previous parts we’ve dug deep into Pythons design and have seen the consequences in action. We can conclude that the main problems for execution speed are:

Interpretation: compilation and interpretation occurs during runtime due to the dynamic typing of variables. For the same reason we have to create a new PyObject, pick an address in memory and allocate enough memory every time we create or “overwrite” a “variable” we create a new PyObject for which memory is allocated.
Single thread: The way garbage-collection is designed forces a GIL: limiting all executing to a single thread on a single CPU

Time to speed up our hot rot with this jet engine (image by Kaspars Eglitis on Unsplash)

So, with all the knowledge of this article, how do we remedy these problems? Some tips below:

Use built-in C-modules in Python like range()
I/O-tasks release the GIL so they can be threaded; you can wait for many tasks to finish simultaneously (more info here and here)
Run CPU-tasks in parallel by multiprocessing (more info)
Create and import your own C-module into Python; you extend Python with pieces of compiled C-code that are 100x faster than Python. (info)
Not an experienced C-programmer? Write Python-like code that Cython compiles to C and then neatly packages into a Python package. It offers the readability and easy syntax of Python with the speed of C (more info)

Now that’s fast! (image by SpaceX on Unsplash)

Conclusion

If you’re still reading this the complexity and length of this article hasn’t scared you off. Kudos to you! I hope to have shed a light on how Python works under the hood and how to work around its bottlenecks.

If you have suggestions/clarifications please comment so I can improve this article. In the meantime, check out my other articles on all kinds of programming-related topics like these:

Happy coding!

— Mike

P.S: like what I’m doing? Follow me!