SQL — Understand how indices work under the hood to speed up your queries.
No more waiting for slow queries to finish
Ever wondered how a database engine can return data for your queries that fast? How it can search through many tables, and millions of records in a flash? This article explores how the database engine works under the hood and sheds light on how to design your tables and indices in the most optimal way. No more waiting for queries to finish!
As usual we’ll first set up an example, creating some sample data so we have something to work with. Then we’ll check out how the database engine performs without indices. Then we’ll add indices to speed up our queries, demonstrating how you can too. At the end of this article you’ll:
- understand what an index is
- understand the types of indices and their differences
- understand how the indices work
- know in which situations to use which type of index
Note that in this article we’re using SQL Server, but the principle apply to many other relational databases like Postgres and MySQL e.g. The syntax might differ though. Let code!
Setup
For illustrating the code in this article I’ve created a table that a lot of applications use. It holds 20 milion records of User information and can be used to register new users, check passwords when logging in and changing user information. I’ve generated the table with Python, using the superfast insertion method described in this article. The code for creating the table looks like this:CREATE TABLE [dbo].[Users] (
[Created] DATETIME NOT NULL DEFAULT GETUTCDATE()
, [Modified] DATETIME NOT NULL DEFAULT GETUTCDATE()
, [FirstName] NVARCHAR(100) NOT NULL
, [LastName] NVARCHAR(100) NOT NULL
, [UserName] NVARCHAR(100) NOT NULL
, [Password] NVARCHAR(100) NOT NULL
, [Age] INT NULL
);
As you can see it contains some columns that relate to the user (the last 5) and two columns that keep track of when the record is created and updated. This table has some problems which we’re going to solve in this article.
Imagine that our Users table is used for a website like Reddit. On every login we need to check whether the username and password. Also, sometimes users change their email, username or password so we need to update as well. In addition we need to be able to add new users by inserting new records. How can we perform fast queries in such a big table?
The heap
We’ve currently designed our table in the dumbest and slowest way possible. When we are looking for a particular user our table has to take a look at every record in the table. This kind of table is called a Heap Table and examining each record in the table is called a Table Scan. Imagine going to a hotel and checking each single room before determining which one is yours! Very inefficient.
Querying
Let’s try to find records by the LastName column:SELECT *
FROM Med.dbo.UsersHeap WITH (NOLOCK)
WHERE LastName = 'LastName123123'
Executing the query above will take quite some time; the database engine has to scan every record in the table. It cannot stop at the first-found record because there might be even more user with the same last name. Also the data is unordered so the record can be in any position (in any row number that is).
Let’s analyze the execution plan:
You’ll see that the database engine performs a table scan which took almost a second! When you hover your mouse on this right-most block it shows you the image below:
You’ll see that it actually read all 20 million records and returned only 2. These kinds of Heap tables and table scans are very inefficient and should not be used. We can improve a lot by adding an index.
Clustered index
Let’s add a clustered index to our table: we’ll add a new column to our table that stores our primary key; an integer with a unique value for each row. These values are then sorted and stored physically, in a tree-like structure.
How it works
But how does the database engine use this index to retrieve rows quicker? Let’s use the hotel-example again: You need to find room E512.
- The letter E indicates that you have to go to the east-wing of the hotel
- The first number (5) indicates that we have to go to the 5th floor
- Once we exit the elevator we see a sign that says that rooms 1–20 are on the left, and rooms 21–40 are on the right. Since the rooms are ordered we don’t need to look far when finding our room!
The database engine works a lot like this. Instead of visiting each single room in the hotel and checking whether it’s ours, we use the tree-like structure of the index that leads us to our destination much quicker. The only difference is that instead of the three branches that we use (the east-wing, 5th floor, right side of the hall), the database engine uses many, many more.
Creating the index
Let’s fix our table:CREATE TABLE [dbo].[Users] (
[Id] INTEGER IDENTITY PRIMARY KEY NOT NULL
, [Created] DATETIME NOT NULL DEFAULT GETUTCDATE()
, [Modified] DATETIME NOT NULL DEFAULT GETUTCDATE()
, [FirstName] NVARCHAR(100) NOT NULL
, [LastName] NVARCHAR(100) NOT NULL
, [UserName] NVARCHAR(100) NOT NULL
, [Password] NVARCHAR(100) NOT NULL
, [Age] INT NULL
);
The magic is in the second line; when you provide a PRIMARY KEY column the database engine creates a clustered index automatically. The IDENTITY part will generate a new integer for every record. Our Id column is new. THe new table looks like this:
Querying
Now we can retrieve records by the Id column. Let’s say we want to retrieve records of Id = 42:SELECT *
FROM Med.dbo.Users WITH (NOLOCK)
WHERE Id = 42
When we execute this, the database engine uses our newly created index; it is much, much faster since it uses the hotel-like method. We can also see this in the execution plan:
See that we’re using a clustered index seek? This indicates that we’re using our clustered index. Notice the time too: 0.000 seconds; that’s fast! We can also check this out in the execution details of this tile:
See that we’ve only read one single row? This is quite the difference compared to our table scan! But what if we want to use another column to filter on?
Nonclustered index
Okay, so the primary key is on the Id column and we can find our records pretty fast using the clustered index. How can we optimize queries where we filter on other columns? For these columns we can create a nonclustered index.
How it works
A nonclustered index works much like the index in a book. These contain certain words and where these words are used. In a economics book you might see that the term ‘inflation’ is used on page 42, 119 and 246. A nonclusted index works a lot like this: it takes all values from the column as keys and registers all the corresponding id’s (from the clustered index). A nonclustered index needs a clustered index to operate. These keys-values-pairs are ordered and stored in a tree-likes structure of their own so we can quickly locate them. This operation is called an Index Scan.
First we’ll create the index:CREATE NONCLUSTERED INDEX ix_users_lastname
ON Med.dbo.Users(LastName)
Image you’re filtering on LastName = ‘Cartman’:
- Then the nonclustered index will perform a key lookup: it will go through the tree, looking for the key called ‘Cartman’. It will find 3 records in our table that have this last name; they have id’s 4, 16 and 333.
- In step 2 we’ll use our clustered index, performing an index seek, to return our actual records. Let’s see this in action!
Querying
Now we can retrieve records by the Name column. Let’s say we want to retrieve records of LastName = ‘LastName456456’:SELECT *
FROM Med.dbo.Users WITH (NOLOCK)
WHERE LastName = 'LastName456456'
This query is superfast as well; as you can see in the execution plan below all operations finished in 0.000 seconds:
You’ll see that it first executes the key lookup; it looks for the value of our column and returns the id’s. Then these id’s get used in the index seek. These two indices working together guarantee a user-friendly, superfast query!
Conclusion
In this article I’ve hoped to shed some light on speeding up your tables by adding indices. This article focused mainly on creating indices and retrieving data from tables. In future article’s we’ll get into when creating an index is actually detrimental for performance and how to best insert data to prevent frequent updates of your index. Follow me to stay notified!
In the meantime, check out my other articles on all kinds of programming-related topics like these:
- DELETE INTO another table
- UPDATE INTO another table
- Insert, delete and update in ONE statement
- UPDATE SELECT a batch of records
- Save upserting
- Inserting into a UNIQUE table
- Inserting values with joined id’s from another table
Happy coding!
— Mike
P.S: like what I’m doing? Follow me!