Dramatically improve your database insert speed with a simple upgrade

4 levels of creating blazingly fast database connections in Python

Dramatically improve your database insert speed with a simple upgrade
your python script after being upgraded with fast_executemany (Image by NASA on Unsplash)

4 levels of creating blazingly fast Python database connections

Uploading data to your database is easy with Python. Load your data into a Pandas dataframe and use the dataframe.to_sql() method. But have you ever noticed that the insert takes a lot of time when working with large tables? We don’t have time to sit around, waiting for our queries to finish!

graph that shows the extreme speed of fast_executemany
Image based on François Leblanc on leblancfg.com

With a few tweaks you can make inserts A LOT faster. Compare the write times of the brown (default to_sql() method) with the green bar (our goal). Also notice that the vertical axis is on a logarithmic scale! At the end of this article you’ll be able to perform lightning fast database operations. Ready? Lets go!


Goals and steps

We’ll work towards the superfast insertion method in two steps. The first part focusses on how to properly connect to our database, the second part will explore 4 ways to insert data in ascendingly fast order.


1. Connecting to our database

In order to communicate with any database at all, you first need to create a database-engine. This engine translates your python-objects (like an Pandas dataframe) to something that can be inserted into databases. In order to do so it needs to know how it can access your database. This is what a connection string is used for. We’ll first create a connection string and then use this to create our database engine.

The connection string can be seen as the key to using our database (Image by Pixabay on Pexels)

1.1 Create a connection string

The connection string contains information about the type of database, odbc-driver, and the database credentials we need to access the database.constring = "mssql+pyodbc://USERNAME:PASSWORD@DATABASESERVER_IP/DATABASENAME?driver=SQL+Server+Native+Client+11.0"

In the example above we create a string for connecting to a Microsoft SQL Server database. As you can see we specify our database credentials (username, password, the IP of our database server and the name of our database server), as well as the driver we are using. The format of a connection string differs per database, check out connectionstrings.com for an overview of what your connection string should look like.

About drivers: The driver needs to match the version of the database you’re using. In the example above we use MS SQL Server 2011 so we need a SQL Server Native Client 11 Driver. This driver needs to be in stalled on the machine you’re running your Python scripts on. Check out which drivers are installed by searching for ‘ODBC Data Sources’ (under the drivers tab). If your required driver is not installed you can easily download it (e.g. https://www.microsoft.com/en-us/download/details.aspx?id=36434).

1.2 Creating a database engine with our connection string

Once the connection string is valid it is easy to create the database engine. I usually create mine like below.

import sqlalchemy 
dbEngine = sqlalchemy.create_engine(constring, connect_args={'connect_timeout': 10}, echo=False)

Setting echo to True allows you to see all the queries that are executed. If you set the echo to the string ‘debug’ the result rows will be printed as well.

1.3 Testing our database engine

Use the little script below to test your connection. If all went well it should print that the engine is valid. If something went wrong it’ll print the error.try:
   with dbEngine.connect() as con:
       con.execute("SELECT 1")
   print('engine is valid')
except Exception as e:
   print(f'Engine invalid: {str(e)}')


2. Four levels of fast insertion methods

Now that we are able to connect to our database we can start inserting data into the database. We’ll explore 4ways of inserting data, ending with the fastest. At the end of this article you’ll find a detailed summary of pandas’s to_sql method.

Lets’s race these methods and see which one wins (Image by Absolute Charm on Pexels)

2.1 The vanilla to_sql method

You can call this method on a dataframe and pass it the database-engine. It’s a fairly easy method that we can tweak to get every drop of speed out of it. In the example below we create a dataframe and just upload it.import pandas as pd# 1. Create a dataframe
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']})
print(df.head())# dataframe looks like:
  numbers colors
0        1    red
1        2  white
2        3   blue# 2. Upload this dataframe
df.to_sql(con=dbEngine, schema="dbo", name="colortable", if_exists="replace", index=False)

As you can see we only have to specify our connection (our databaseEngine we’ve created earlier), in which schema we want to put our new table and our new tablename. In addition we can determine what to do if the specified schema.tablename already exists (in our case replace) and wether we want to put an index on the table (see further for a complete list of parameters).

The example above is the easiest way of inserting data but also the slowest. The problem is that we write the entire dataframe all at once, creating an insert statement for each record. On a small table, like our color table, this isn’t a huge problem but on large tables it definitely is.

2.2 Chunking

Our code will run a bit better if we add a chunksize. This will write the data in batches of the specified chunksize, saving a lot of memory.

df_large.to_sql(con=dbEngine, schema="dbo", name="largetable", if_exists="replace", index=False, chunksize=1000)

Still this method creates an insert-statement for each record in our table which is still very slow.

2.3 Multi-insert

Adding the ‘multi’-method will improve the insertion speed a lot. In stead of writing a insert statement per record we can now send multiple rows in one statement. The databast can process amultiple records in one operation in stead of an operation per record.

IMPORTANT: this method will not work and is not necessary for a Microsoft SQL Server database. Please see method 4.

df_target.to_sql(con=dbEngine, schema="dbo", name="targettable", if_exists="replace", index=False, chunksize=1000, method='multi')

The ideal chunksize depends on your table dimensions. A table with a lot of columns needs a smaller chunk-size than a table that has only 3.

This is the fasted way to write to a database for many databases. For Microsoft Server, however, there is still a faster option.

2.4 SQL Server fast_executemany

SQLAlchemy 1.3 provides us with the fast_executemany option in creating the dbEngine for SQL server. This method is the fastest way of writing a dataframe to an SQL Server database.

dbEngine = sqlalchemy.create_engine(constring, fast_executemany=True, connect_args={'connect_timeout': 10}, echo=False)
df_target.to_sql(con=dbEngine, schema="dbo", name="targettable", if_exists="replace", index=False, chunksize=1000)

In the code above you see that we have to adjust our databaseEngine a bit; we have to add the fast_executemany option. That’s it.
Then, when we write our dataframe to the database, the only thing we have to remember is that we do not specify our method (or set method=None). This is because the fast_executemany is multi-inserted by default.

fast_executemany is like an F1 car in a horse race! (Image by Chuttersnap on Unsplash)

Conclusion

With these simple upgrades you now have the tools to improve your python-to-database connections. I’ve included a summary with example code and steps below. Happy coding!

-Mike

P.S: like what I’m doing? Follow me!


Overview: to_sql parameters

Below an overview and explanation of all the settings of the to_sql function.

  • con (sqlalchemy engine)
    the databaseconnection (sqlalchemy engine)
  • name (str): required
    name of the table you want to write to
  • schema (str): Default: default database schema (dbo)
    name of the schema you want to write to.
  • if_exists (str): Default: ‘fail’
    What to do if the specified table already exists?
    - ‘fail’: throws an error
    - ‘append’: appends the data to the specified table
    - ‘replace‘: replace the table altogether (WARNING: this is dangerous)
  • index (bool): Default: True
    If Index is set to True it creates an extra column named ‘id’, that is indexed
  • index_label (str or sequence): Default: None
    Column label for the index columns. Default None.
  • chunksize (int): Default: None
    Write rows in batches of size [chunksize]. This is a good idea if you have a lot of records that need to be uploaded. This saves memory. If None: writes all rows at once
  • dtype (dict or scalar): Default none
    Specify datatypes
    If scalar is specified: applies this datatype to all columns in the dataframe before writing to the database. To specified datatype per column provide a dictionary where the dataframe columnnames are the keys. The values are sqlalchemy types (e.g. sqlalchemy.Float etc)
  • method (str): Default None
    Controls the SQL insertion clause
    - None: uses standard SQL insert clause (one per row)
    - ‘multi’: passes multiple values to a single Insert clause.