Recently I’ve been implementing a subset of SQLite (the world’s most used database, btw) from scratch in Go. I’ll share what I’ve learned about how SQLite stores data on disk which will help us understand key database concepts.
Let’s keep it hands-on and create a new SQLite database, a Users table, and insert a single user.
michal@michal-ThinkPad-T490s:$ sqlite3 mydb.db
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite> CREATE TABLE Users (
Id INT PRIMARY KEY,
Username VARCHAR(32),
Email VARCHAR(255));
sqlite> INSERT INTO Users (Id, Username, Email) VALUES (1, "michal", "michal@example.com");
sqlite> SELECT * FROM Users WHERE Id = 1;
1|michal|michal@example.com
sqlite>
As expected, everything works just fine. Let’s exit and reopen the database.
sqlite> .exit
$sqlite3 mydb.db
sqlite> SELECT * FROM Users WHERE Id = 1;
1|michal|michal@example.com
We got our user back. At some point, SQLite wrote our changes to the mydb.db
file - SQLite is unusual because it stores all data in a single file. Let’s inspect this file with a hex editor to see if we can find our data. The first 16 bytes of the file inform us of the SQLite format used.
A bit deeper, we find the schema for the Users
table.
And finally, the row we inserted!
Let’s see what else we can learn about this file by running stat mydb.db
.
michal@michal-ThinkPad-T490s:$ stat mydb.db
File: mydb.db
Size: 12288 Blocks: 24 IO Block: 4096 regular file
Device: 10302h/66306d Inode: 8938166 Links: 1
The file size is 12288 bytes, we’ll come back to this. We might also notice that the IO Block is 4096. This also turns out to be the default page size on my machine, but in general, it depends on the OS and file system.
SQLite stores the database configuration in the first 100 bytes of the root page. The page size is stored at bytes 16-17 and it is 4096 bytes. The number of pages used by the DB is stored in bytes 28-31 and it is 3. Multiply those two together and we get back the file size 4096 * 3 = 12288, exactly what we found with stat
earlier!
Let’s recap. We can create a table, insert rows, and store it on disk. Then we can retrieve it from the disk.
Why do we split the database into pages?
I’ve gone ahead and populated the Users
table with 100k entries using the following format: sprintf(“%d, user%, user%d@example.com”, i, i, i)
sqlite> SELECT COUNT(Id) FROM Users;
100000
To store those 100k entries, SQLite uses 1361 pages taking up 5.4MB. SQLite internally uses a B+ tree data structure to store data. B+ trees are balanced N-ary trees with a (usually) large number of children per internal node. You can think of them like balanced binary search trees, but with more than 2 children per node, stored values only in leaf nodes, and information on how to traverse the tree in the interior nodes.
Each node in the tree corresponds to a physical page stored on disk. Inserted rows are stored in leaf nodes only. Index information is stored in interior nodes. I might mix calling them pages and nodes, but they mean the same thing.
Leaf pages
Let’s start with the simpler leaf pages. Both leaf and interior nodes share the same header format visualized in the diagram below. I’ve provided a legend for the fields that we care about.
The 0th byte of the page determines the page type - 0x0d
indicates a leaf page. Bytes 3-4 store the number of cells stored in the node. In the leaf page shown below it’s 123 cells. The entries are stored in reverse order, so we can see that user123
is stored first. The next two bytes store the offset at which the first entry starts - 254. Bytes 8-11 are only used by interior pages.
Once we get to the first entry, we see an array of rows. When I was implementing my version of SQLite from scratch, I also stored a pointer to the parent page in the leaf page to make the implementation a little simpler, since sometimes you need to access the parent node from a leaf.
Next, let’s take a look at the more exciting interior nodes.
Interior pages
So far, we know how to store data. What’s missing is a way to efficiently retrieve it. That is precisely the role of interior pages in a B+ tree. Let’s see how they achieve this.
Interior pages have the 0th byte set to 0x05
.
The header format is the same as for leaf nodes, so the number of cells is in bytes 3-4, here it’s 2 cells. The start of the cell array is indicated by bytes 5-6 as offset from the start of the page: 4082. The right child pointer is stored at bytes 8-11: 1216.
Let’s look at the array of cells and try to understand the cell format for interior pages.
The array starts at 0x1ff2
. Each cell is composed of an uint32 page pointer followed by a varint max primary key.
Interpreting the big-endian uint32 is straightforward: 0x00000267 = 615
.
Varint is an encoding that can represent up to 64-bit unsigned ints using 1 to 9 bytes but is more memory efficient for smaller integers.
To decode a varint, we need to look at each byte, starting with the most significant one, and write it in binary:
0x8493370
x84 = 10000100
0x93 = 10010011
0x37 = 00110111
When we reach a byte with the most significant bit set to 0, we have hit the last byte of the varint, which is how I knew that we only needed these 3 bytes. The stored value is then calculated by concatenating the 7 least significant bits:
0000100 | 0010011 | 0110111 = 68023 in decimal
Similarly for the second cell:
0x00000266 = 614
0x82962C = 0000010 | 0010110 | 0101100 = 35628
The values stored as varints indicate that all rows in the subtree pointed to by the page pointer have a primary key lower or equal to this key. Think of it again like a generalization of binary search trees.
So what does this mean? It means that if we are looking for a row with a primary key below or equal to 35628, we should visit page 614. If the key is greater than that but less than or equal to 68023, we should visit page 615. For anything larger, we should follow the right-most pointer to page 1216.
It’s giving us directions on how to find the row we are looking for!
Looking for user4242
Now that we understand interior and leaf pages, let’s look for the user with Id = 4242
to see how this works end to end.
In the previous section, we looked at the root interior node and found that page 614 has entries <= 35628 so let’s look there.
We need to go to page 614, which is at byte offset (614-1) * 4096 (0-indexed) in the mydb.db
file. It’s again an internal node as indicated by the 0th byte. I’ll just list the cell entries but notice that I can do some skipping. It turns out that SQLite keeps the cells in sorted order - we can use binary search!
Page_num, max_key
461, 0000010|0010101|1001101 = 35533
…
79,0110000|0111011 = 6203
…
60, 0100100|0100011 = 4643
…
58, 0100011|0111011 = 4539
57, 0100010|1010011 = 4435
56,0100001|1101011 = 4331
55, 0100001|0000011 = 4227
…
4, 123
User4242
should be on page 56. Let’s inspect it and woohoo - it’s a leaf node! We should be close!
Notice how the largest entered user is indeed user4331
as indicated by the max key stored in the parent interior node.
Let’s go find our user4242
. If we scroll a bit down, we finally see the row we wanted!
This is amazing - we were able to find our row by just visiting 3 pages: the root page, internal page 614, and leaf page 56. This is important because it means SQLite only had to read those 3 pages from the disk to find the row! Reading a specific page from a file can be implemented efficiently with fseek
and reading only PAGE_SIZE
bytes.
Disk IO tends to be the slowest part of a database, especially if using physical hard drives. If we didn’t have an index (interior nodes of the B+ tree), we would have to read all 1361 pages. Here we used a tiny database - just 100k rows, but this logarithmic scaling property becomes crucial as the database grows.
If you are wondering how I learned this, I’ve been implementing a subset of SQLite from scratch in Go. I started by following cstack’s SQLite from scratch blog series. For SQLite implementational details, the official SQLite documentation is fantastic.
I hope you enjoyed this deep dive into database internals! If you did, consider subscribing and/or following me on LinkedIn.