Databases

Database Internals

How databases actually work under the hood — from storage engines and crash recovery to query optimization and connection pooling.

01 / Storage Engines

B-Tree vs LSM-Tree Storage

Every database stores data on disk through a storage engine. The two dominant paradigms are B-Tree based engines and LSM-Tree based engines. They make fundamentally different trade-offs between read performance, write performance, and space usage.

B-Tree Engines

B-Trees are balanced, self-adjusting tree structures where data is organized in fixed-size pages (typically 4–16 KB). Each page holds sorted keys and pointers to child pages. When a page overflows, it splits into two; when it becomes too sparse, it merges with a sibling. This keeps the tree balanced and lookups at O(log n).

Used by: InnoDB (MySQL), PostgreSQL, SQL Server, Oracle. B-Trees excel at point reads and range scans because data is already sorted on disk.

LSM-Tree Engines

LSM-Trees (Log-Structured Merge Trees) buffer writes in an in-memory memtable. When the memtable reaches a threshold, it is flushed to disk as an immutable, sorted SSTable (Sorted String Table). Background compaction periodically merges SSTables to reclaim space and keep read performance manageable.

Used by: RocksDB, Cassandra, LevelDB, ScyllaDB. LSM-Trees excel at write-heavy workloads because writes are always sequential appends.

B-Tree Write Path
Find leaf page
Update in-place
Split if full
LSM-Tree Write Path
Write to memtable
Flush to SSTable
Background compaction
Compaction Strategies
Leveled compaction (LevelDB, RocksDB default): SSTables are organized into levels of exponentially increasing size. Each level has no overlapping key ranges, giving better read performance but higher write amplification.
Size-tiered compaction (Cassandra default): SSTables of similar size are merged together. Simpler and faster for writes, but can have more space amplification and slower reads due to overlapping key ranges.
02 / Amplification Trade-offs

Read, Write & Space Amplification

No storage engine can minimize all three amplification factors simultaneously. Understanding these trade-offs is key to choosing the right engine for your workload.

AmplificationB-TreeLSM-Tree
Read Low — single traversal from root to leaf Higher — may check memtable + multiple SSTables (mitigated by bloom filters)
Write Moderate — random I/O, write page + WAL Higher — data written multiple times during compaction
Space Lower — in-place updates, but page fragmentation Higher — multiple copies exist until compaction
Rule of Thumb
B-Trees favor read-heavy, transactional workloads (OLTP). LSM-Trees favor write-heavy, append-mostly workloads (time-series, logging, messaging). Hybrid approaches like TiDB use both.
03 / Durability & Recovery

WAL & Buffer Pool

Write-Ahead Log (WAL)

Before any change hits the actual data pages, it is first recorded in a sequential, append-only Write-Ahead Log. If the database crashes, it replays the WAL to recover committed transactions that hadn't been flushed to data files yet.

WAL provides the D (Durability) in ACID. Both B-Tree and LSM-Tree engines use WAL — PostgreSQL calls it WAL, MySQL/InnoDB calls it the redo log.

Crash Recovery Flow
Crash
Read WAL from last checkpoint
Redo committed txns
Undo uncommitted txns

Buffer Pool

The buffer pool is a region of memory that caches frequently accessed disk pages. When a query needs data, the engine first checks the buffer pool. On a miss, it loads the page from disk into the pool (evicting a less-used page if needed).

Pages modified in memory but not yet written to disk are called dirty pages. A background process called checkpoint periodically flushes dirty pages to disk, which also advances the WAL recovery point — reducing crash recovery time.

Practical Tip
In InnoDB, set innodb_buffer_pool_size to 70–80% of available RAM. In PostgreSQL, shared_buffers is typically set to 25% of RAM (the OS page cache handles the rest).
04 / Query Processing

From SQL to Execution

A SQL query goes through several transformation stages before any data is touched. Understanding this pipeline helps explain why the same logical query can have wildly different performance characteristics.

Query Processing Pipeline
SQL Text
Parse & Validate
Logical Plan
Optimize
Physical Plan
Execute

Cost-Based Optimization

The optimizer generates multiple candidate physical plans and estimates the cost of each using table statistics: row counts, index cardinality, histogram distributions, and page counts. It picks the plan with the lowest estimated cost.

Cardinality estimation is the hardest part — getting it wrong by even 10x can cause the optimizer to choose a terrible join order or skip an index entirely. This is why ANALYZE / UPDATE STATISTICS matters.

Join Algorithms

Nested Loop Join

For each row in the outer table, scan the inner table. Works well when the inner side has an index. O(n * m) worst case.

Hash Join

Build a hash table from the smaller table, then probe it with the larger table. O(n + m). Great for large, unsorted, equi-joins.

Merge Join

Both inputs must be sorted on the join key. Walks both sides in lockstep. O(n + m). Ideal when data is already sorted or indexes exist.

Key Insight
The optimizer is only as good as its statistics. Stale stats lead to bad plans. Run ANALYZE (PostgreSQL) or ANALYZE TABLE (MySQL) after large data changes.
05 / Concurrency Control

MVCC — Multi-Version Concurrency Control

MVCC allows readers and writers to operate concurrently without blocking each other. Instead of locking rows, the database keeps multiple versions of each row. Each transaction sees a consistent snapshot of the data as of its start time.

How It Works

When a row is updated, the database doesn't overwrite the old version. It creates a new version tagged with the writing transaction's ID. Readers see only versions committed before their snapshot, providing snapshot isolation.

AspectPostgreSQLMySQL (InnoDB)
Old versions stored in Main heap (same table) Separate undo log (rollback segment)
Cleanup mechanism VACUUM removes dead tuples Purge thread cleans undo log
Versioning approach Tuple has xmin/xmax transaction IDs Row has a hidden rollback pointer to undo log
Bloat risk Table bloat if VACUUM falls behind Undo log growth if long-running transactions
Watch Out
Long-running transactions are the enemy of MVCC. In PostgreSQL, they prevent VACUUM from reclaiming dead tuples, causing table bloat. In MySQL, they prevent undo log purging, growing the undo tablespace. Always keep transactions short.
06 / Connection Management

Connection Pooling

Every database connection consumes memory (typically 5–10 MB per connection in PostgreSQL). Creating a new connection involves TCP handshake, TLS negotiation, and authentication — taking 20–50ms. Connection poolers sit between the application and the database, maintaining a pool of reusable connections.

PgBouncer

Lightweight, single-threaded pooler for PostgreSQL. Supports transaction-level and statement-level pooling. Minimal overhead.

ProxySQL

Feature-rich MySQL proxy with connection pooling, query routing, read/write splitting, and query caching built in.

HikariCP

JVM-side connection pool known for low latency and high throughput. Default pool in Spring Boot. Manages connections at the application layer.

Pooling Modes (PgBouncer)

ModeConnection AssignedTrade-off
Session For the entire client session Safest, but least efficient — same as no pooling in many cases
Transaction For one transaction at a time Best balance — most common mode. Cannot use session-level features (PREPARE, LISTEN/NOTIFY)
Statement For a single statement Most aggressive reuse, but no multi-statement transactions allowed
Sizing Rule
A good starting formula for pool size: connections = (core_count * 2) + effective_spindle_count. For SSDs, start with core_count * 2 to core_count * 4. More connections does not mean more throughput — context switching kills performance beyond the sweet spot.

Test Yourself

Score: 0 / 10
Question 01
Which storage engine design is optimized for write-heavy workloads?
LSM-Trees buffer all writes in memory (memtable) and flush sequentially to disk. This converts random writes into sequential I/O, which is much faster on both HDDs and SSDs.
Question 02
In an LSM-Tree, what is the correct order of the write path?
Writes go to the in-memory memtable first. When it reaches a size threshold, it's flushed as an immutable SSTable on disk. Background compaction then merges SSTables to reduce read amplification and reclaim space.
Question 03
What is the primary purpose of the Write-Ahead Log (WAL)?
The WAL records all changes before they are applied to data pages. If the database crashes, it replays the WAL to recover committed transactions, providing the Durability guarantee in ACID.
Question 04
What are "dirty pages" in the context of a buffer pool?
Dirty pages are in-memory pages that have been modified but not yet written back to disk. The checkpoint process periodically flushes dirty pages to persistent storage.
Question 05
Which join algorithm builds a hash table from the smaller input and probes it with the larger?
Hash Join builds a hash table from the smaller (build) side and then scans the larger (probe) side, looking up each row in the hash table. It's O(n + m) and ideal for large equi-joins without sorted input.
Question 06
In PostgreSQL's MVCC implementation, where are old row versions stored?
PostgreSQL stores old tuple versions directly in the main heap. Dead tuples are removed by the VACUUM process. This differs from InnoDB, which stores old versions in a separate undo/rollback segment.
Question 07
Why is cardinality estimation important in cost-based optimization?
The optimizer uses cardinality estimates to predict how many rows each operation will produce. If these estimates are off by orders of magnitude, the optimizer may choose the wrong join algorithm, skip a useful index, or pick a terrible join order.
Question 08
Which PgBouncer pooling mode offers the best balance of safety and efficiency for most applications?
Transaction mode assigns a server connection for the duration of a single transaction, then returns it to the pool. This gives good multiplexing while still supporting multi-statement transactions. Session mode is too conservative; statement mode is too restrictive.
Question 09
What problem do long-running transactions cause in PostgreSQL's MVCC?
VACUUM can only remove tuple versions that are no longer visible to any active transaction. A long-running transaction holds an old snapshot, preventing cleanup of dead tuples created after it started, leading to table bloat.
Question 10
Which compaction strategy organizes SSTables into levels with non-overlapping key ranges?
Leveled compaction (used by LevelDB and RocksDB by default) arranges SSTables into levels where each level has non-overlapping key ranges. This ensures at most one SSTable per level needs to be checked for a given key, improving read performance at the cost of higher write amplification from more frequent compaction.