We’re not all DBAs: Indexes For Developers
We know they speed up queries, but what’s going on under the hood? How do they work?
An index is a structure (commonly a B-Tree, but not required) that we attach to a table that keeps certain columns of that table organized and in memory. Indexes are a solution to the age old problem that going to disk is slow- by caching data in memory you save yourself time reading a records from disk that will mostly be discarded. It’s more efficient to keep common queryable columns in a searchable in-memory store with a reference to where on disk the rest of that row can be found. Having an indexed column lets you find what you need quickly and go to disk specifically for what you need.
There are lots of explanations of B-Trees- Markus Winand has a beautiful explanation. I also give a hearty shout out to Markus in general, his site and book are full of great content and his explanations are amazing- highly recommended.
As a developer needing to work with and manipulate databases constantly, there’s a few useful points on indexes that tend to be forgotten. Let’s consider a few.
An indexed column’s values don’t have to be unique. Cardinality, in database jargon, refers to “how unique” the values in an index are; high cardinality means the underlying values have little repetition. When evaluating a column to decide if we want to put an index on it, high cardinality (low repetition) is good for selectivity, but a column doesn’t need to have perfect cardinality to be a candidate. Indexes are able to handle non-unique values by scanning sequentially through the leaves of the underlying tree. The linkage between tree leaves helps prevent unnecessary operations stemming from jumping around through the internals of the tree, making it less costly to navigate through the entries in the index. Scanning data in this way makes doing an equality check (
val = 'matt') with multiple results a more lightweight operation. We can leverage this structure to scan through B-Tree indexes too- things like range scans (
val > Xand
val <= Y) and even some regexes (
first_name like 'matt%'). Careful though- even when you’re reading from an index in a range, there can still be a lot of data to read, making your query slow.
When columns are unique, during creation of an index, we can add the constraint that the index be unique, and that allows the optimizer leverage the uniqueness for lookup performance.
Multi-column indexes. Firstly, if you didn’t know you could do this- now you do! The underlying behavior and implementation of these indexes vary by DBMS, but across the board, multi-column indexes are versatile in that they can be used to query across all or a subset of columns in the index. To greatly simplify (this varies by DBMS), you can think of think of a table with 3 columns: col1, col2 and col3. When the multi column index is created, the values of all 3 columns are combined into a three part “value”: col1|col2|col3 and sorted- when a query is done, we traverse the tree by comparing against each “section” of the value one at a time and navigate the tree that way. With this arrangement, by supplying 3 values, we can traverse the tree as quickly as we would with a single column index, but giving us greater selectivity since querying by just one column might yield a ton of data to be returned. I mentioned that multi-column indexes support subsets of columns- queries can make use of a multi-column index if we query by only col1, col1 and col2, or all three- each column we add to the query increases the effectiveness of the index lookup because we are adding selectivity and so, less rows to pull back from disk. In some schema designs, multi-column indexes can be a performance boost because we’re able to eliminate so many rows by being selective in our seeking.
Index only scans. With multi-column indexes, it’s possible that the data you’re looking for are held by an index and, therefore, in memory. With the data so available, there’s no need to go to disk if the index can simply give us the data we need. Say you had a table with columns
agein an index and you wanted to find the age for people named ‘Matt Gale’. Your query could look like
select age from person where first_name = 'Matt' and last_name = 'Gale'. In this case there would be no need to go to disk because age is part of the index, so the optimizer just returns age values directly from the index. This can be a really nice resource saver if you can exploit it.
In my experience building applications, Object Relational Mappings (ORMs) over time, tend to grab too many columns during a query and developers don’t give much thought to lookups by more than just a single value and so favour single column indexes as a result. To be able to leverage index only scans, look into lazy loading more of your columns to see if you can squeeze some performance out of your queries. Something to always keep in mind when doing any query against a database is how much data will I get back from this query? Queries can be slow because of lack of proper indexes, but also because the volume of disk accesses required.
For example, let’s say you have a large table with an index on a boolean column (very low cardinality). If you tried to search for rows matching
false your time spent doing disk access is going to be huge (you’re returning half the table!) relative to the tiny amount of time spent looking values up in an index. This is an example of a situation where indexing is not what will save you time, you need to be more selective in the rows you want to get back. Indexes are an important part of database performance but they are not the only thing to consider.
I hope this gives a bit of insight! We don’t all have to be DBAs to write sufficiently fast queries and we shouldn’t need to be. As developers, getting familiar with the core structures of a database is a sufficiently pragmatic way to spot and improve performance. With that, go forth and write fast queries!