Looking Under the Hood: The Basics of Relational Databases
How standard relational databases are implemented
If you use databases often and want to understand how they work, this article is for you. It isn’t a guide on how to use databases or how to design schemas; it’s just about how standard relational databases are implemented. For a relational databases like MySQL or PostgreSQL, how do things actually work under the hood? How is data laid out on disk? What is this “Query Planner” I’ve heard so much about? Understanding the nuts and bolts of relational databases is a great foundation for optimizing performance and troubleshooting issues.
How are rows stored on a disk?
Relational databases are stored by row, or as they’re called in relationship database-land, tuples or records. This means all the data in a particular row of a database is next to each other on disk. This is different from some types of databases (such as those designed for analytics like Snowflake) that store all the data in a given column together.
Say we have a table called Students that looks like this:
Remember, a disk is just a record of data, so on disk the data is stored like this:
1 Harry Potter Holly 3 2 Hermione Granger Vine 3
Or another way to look at that same data on disk…
Notice the gaps of just \0
(null, empty) repeated? There are a few reasons for gaps.
Easy edits
First, since Wand Wood could be 12 characters, the database leaves extra space at the end in case you edit the wood later to be a longer name. Imagine if you just stored Harry Potter in the space for the 12 characters in his name, then in 4th year he decided to change his official name to Harry J. Potter . You would have no space left, so you’d have to slide all the remaining characters down on disk, meaning rewrite the entire rest of the table after his row! Sounds slow. Better to leave a gap since storage is cheap.
Easy navigation
It’s also important for all tuples to take up the same amount of room so that you can easily hop between them. If you wanted to know where row 5 starts and all the rows were a different length, you’d need to know the size of each row before row 5 and add them up. Now we can just say it’s 4*ROW_LENGTH
which is the same for every row, letting you easily navigate the table.
Within a specific tuple, since the fields are all consistent sizes, you can easily locate a field by looking up the size of each one before it (or a precomputed offset) and hopping to it.
Memory alignment
Another more subtle reason for the gaps is that data is stored to align to the space it will take in memory, making it faster to read into memory: if the data is aligned to memory size on disk, then you can read data directly into memory from disk without having to shift it around. On modern 64-bit machines, data generally sits in chunks of size 64 bits for fastest access unless you have some pressing reason to try to squeeze in extra data—like varints in protobufs trying to reduce the size of data sent over a wire.
Data in your database will be spaced out so that it aligns into 64-bit chunks and can easily be read into memory for fast access.
Organizing tuples in a table
Going up a layer, hardware storage is generally divided into segments called blocks usually of 512 kilobytes. Blocks are read all at once and easily transferred to memory at once. For this reason, relevant tuples are stored sequentially on a block so they can all be read in at once. And you wouldn’t want a tuple to start on one block and end on another. If you did that, you would have to read in two whole blocks to read in one tuple. How data is organized within a page is beyond the scope of this article, but here’s some documentation on how it’s done in Postgres for those interested. (Though one thing to note for later is there’s usually some header data associated with each tuple.)
If a table spans multiple blocks, the blocks will point to each other. It’s helpful to have all of a given table as close together on disk as possible so that it can all read into memory in as few loads as possible.
It’s important to remember that these databases were designed in an era where hard disks prevailed, not SSDs. Hard disks had an interesting property where it took MUCH longer to find a particular block (seek) than to read in data from that block (and any subsequent blocks). While seeks are still longer than reads on SSDs, the difference is much less pronounced—a seek is ~9ms on a modern HDD vs ~.1 ms on a modern SSD. That means that database reads would be way, way slower if your data that tended to be accessed together was spread out on the disk. Many of the design decisions that went into the major relational databases are vestigial consequences of this old technology.
That being said, you can tune databases to be more optimized for SSDs. For instance, see random_page_cost and effective_io_concurrency for Postgres.
Some common operations
We can go over some simple operations on databases with just the information we have so far.
Adding a row
Adding a row is pretty simple in a regular table. You navigate to the end of the existing rows (there would usually be a nice pointer to get you here) and write the relevant data.
Adding a row is much trickier if your database is stored in sorted order. Then you must use a table scan or index lookup (more on this later) to find the relevant place in the database. If there happens to be an empty slot (for example from a previously deleted row), you can add your row, but if not you’ll need to shift everything that comes after it over (called a page split). Fancy pointer magic can avert having to shift things, but remember that might result in more disk seeks which are slow.
This is why databases are not usually stored sorted and instead rely on indexes to maintain ordering (again, more on indexes shortly!).
Deleting a row
The logical thing in deleting a row might be to slide everything after it back to conserve space, but remember that as with adding to a sorted table that would mean you have to rewrite the entire table that comes after the row. Instead, a flag in each tuple’s header is usually just flipped to indicate that row is now deleted. The tuple is now morbidly called a tombstone or dead row. They can later be reused, and if you have a lot, it might be a good idea to consolidate your table, called vacuuming in Postgres and done with OPTIMIZE TABLE
in MySQL.
Indexes
Okay so we’ve got our tables laid out on disk. Generally, if you want to access data now, you’ll need to scan the entire table and check each row to see if it matches your search query. This can obviously be very slow for a large table… Enter indexes.
Indexes are pretty simple. They store relevant columns in a smaller, faster data structure that you can use to hop quickly to the desired part of your table. As part of designing a database schema, an engineer thinks about which columns and combinations of columns would benefit from this fast lookup structure and creates an index for them. You can always add indexes later on too as you add new columns and data access patterns.
Indexes are usually stored in a tree structure because they have super fast O(log(n)) lookup. The particular structure that’s most common is a B-tree, though you can configure this.
For instance, above in our students table we could add index on year that would let you sort, select, or do cutoffs by student year in O(log(n)). More on this in the next section.
So to summarize, without an index you need to scan a table to find a row. An index lets you use a fast data structure to look up where your data is without having to scan the whole table. It also lets you walk through data in a sorted order without having to read the whole table into memory or store it sorted.
Query planning and execution
So now the real purpose of a database: querying! What is the process between SELECT * FROM students
to actually getting the data we need?
This happens in 4 stages:
Parsing: Parsing is relatively straightforward. Just take the text and turn it into code as any compiler or interpreter would.
Query re-writing: this is the query planning stage, where the sophisticated optimizations can happen to change the basic query you wrote into one that is very speedy. This where a lot of the magic (and confusion!) happens.
Physical query plan creation: Take the logical query plan from step 2 and turn it into a physical query plan: that means what do you actually read from which tables, columns indexes, etc? While step 2 seems like it has most of the “optimization”, step 3 actually has some as well. For instance, how should the data be passed between stages: via memory or disk?
Execution: Execute the query and return the results
The query planner generates multiple possible plans and then scores them based on how long it thinks they’ll take and how many resources they’ll consume, then it picks the best one.1
Let’s dig in a little bit more to steps 2 and 3, collectively called the query optimizer. Let’s look at a few simple examples. To get a basic sense of why this would be needed, consider the query
SELECT * FROM students WHERE year > 2
on our students table above. Remember, we have an index on year. So here are two options:
Read every row in students and if the year is > 2, copy that row into our result set. This is called a table scan, and if you’re scanning tables of any large size and you’re trying to make an interactive application, it’s almost always bad to do this because you have to read in the entire table to memory and do computations on it.
Use an index on
year
to find pointers to rows where year > 2, and copy those rows into our result set.
It might seem like there’s an obvious choice: isn’t O(log(n)) (for option 2) better than O(n)? Often, and that’s why we make indexes in the first place! But there are a few circumstances where it might not be. What if 99.9% of rows have year > 2? Then it might actually cost us time to read in the index rather than just go straight to the table scan. Or maybe the data is very small, and it’s faster just to run through the data than hop between the index and the data. Databases actually keep statistics internally on the contents of tables so that they can make intelligent decisions about these types of tradeoffs.
The above example is super simple. As queries increase in complexity, so too do query plans. There are several ways to plan join operations, and which one to pick can depend on many factors, from how big the tables are to the amount of memory to the cardinality of different fields.
The query optimizer can get extremely fancy, and it can sometimes do a bad job. You’re probably most accustomed to actually looking at the query planner’s results when it is doing a bad job—that’s when a query that should be fast isn’t and you have to go in and debug why with an explain command.
Finally, since query planning is NP-complete (read: slow), there’s a tradeoff: how much time do you spend searching for an optimal query vs just going for it? This tradeoff is often configurable.
Databases in Practice
I hope this guide helps you feel a little more confident delving into the world of relational databases. Relational databases, while simple on the surface, are extremely complex. They have managed to power most software for decades. Postgres has dozens of parameters just to configure the query planner, not to mention everything else. You’ll have no shortage of interesting intricacies to delve into as you continue your journey, from write-ahead logging to replication to transactions.
Sources and Further reading
I drew from a variety of sources from discussions with experts to reading Wikipedia, but I mostly used these:
In addition to the above, a few other interesting things in database-land:
Read about Dremel, a distributed analytics database by Google (the basis for Snowflake, BigQuery, and much of modern OLAP databases)
Read about Aurora, an innovative use of hardware by AWS
Read about Spanner, a globally consistent database created at Google
Please leave other recommendations! And huge thanks to my reviewers, Julie Tibshirani, Dylan Visher, and Stuart Cornuelle.