B-Tree vs LSM-Tree Storage
Every database stores data on disk through a storage engine. The two dominant paradigms are B-Tree based engines and LSM-Tree based engines. They make fundamentally different trade-offs between read performance, write performance, and space usage.
B-Tree Engines
B-Trees are balanced, self-adjusting tree structures where data is organized in fixed-size pages (typically 4–16 KB). Each page holds sorted keys and pointers to child pages. When a page overflows, it splits into two; when it becomes too sparse, it merges with a sibling. This keeps the tree balanced and lookups at O(log n).
Used by: InnoDB (MySQL), PostgreSQL, SQL Server, Oracle. B-Trees excel at point reads and range scans because data is already sorted on disk.
LSM-Tree Engines
LSM-Trees (Log-Structured Merge Trees) buffer writes in an in-memory memtable. When the memtable reaches a threshold, it is flushed to disk as an immutable, sorted SSTable (Sorted String Table). Background compaction periodically merges SSTables to reclaim space and keep read performance manageable.
Used by: RocksDB, Cassandra, LevelDB, ScyllaDB. LSM-Trees excel at write-heavy workloads because writes are always sequential appends.
Size-tiered compaction (Cassandra default): SSTables of similar size are merged together. Simpler and faster for writes, but can have more space amplification and slower reads due to overlapping key ranges.
Read, Write & Space Amplification
No storage engine can minimize all three amplification factors simultaneously. Understanding these trade-offs is key to choosing the right engine for your workload.
| Amplification | B-Tree | LSM-Tree |
|---|---|---|
| Read | Low — single traversal from root to leaf | Higher — may check memtable + multiple SSTables (mitigated by bloom filters) |
| Write | Moderate — random I/O, write page + WAL | Higher — data written multiple times during compaction |
| Space | Lower — in-place updates, but page fragmentation | Higher — multiple copies exist until compaction |
WAL & Buffer Pool
Write-Ahead Log (WAL)
Before any change hits the actual data pages, it is first recorded in a sequential, append-only Write-Ahead Log. If the database crashes, it replays the WAL to recover committed transactions that hadn't been flushed to data files yet.
WAL provides the D (Durability) in ACID. Both B-Tree and LSM-Tree engines use WAL — PostgreSQL calls it WAL, MySQL/InnoDB calls it the redo log.
Buffer Pool
The buffer pool is a region of memory that caches frequently accessed disk pages. When a query needs data, the engine first checks the buffer pool. On a miss, it loads the page from disk into the pool (evicting a less-used page if needed).
Pages modified in memory but not yet written to disk are called dirty pages. A background process called checkpoint periodically flushes dirty pages to disk, which also advances the WAL recovery point — reducing crash recovery time.
innodb_buffer_pool_size to 70–80% of available RAM. In PostgreSQL, shared_buffers is typically set to 25% of RAM (the OS page cache handles the rest).
From SQL to Execution
A SQL query goes through several transformation stages before any data is touched. Understanding this pipeline helps explain why the same logical query can have wildly different performance characteristics.
Cost-Based Optimization
The optimizer generates multiple candidate physical plans and estimates the cost of each using table statistics: row counts, index cardinality, histogram distributions, and page counts. It picks the plan with the lowest estimated cost.
Cardinality estimation is the hardest part — getting it wrong by even 10x can cause the optimizer to choose a terrible join order or skip an index entirely. This is why ANALYZE / UPDATE STATISTICS matters.
Join Algorithms
For each row in the outer table, scan the inner table. Works well when the inner side has an index. O(n * m) worst case.
Build a hash table from the smaller table, then probe it with the larger table. O(n + m). Great for large, unsorted, equi-joins.
Both inputs must be sorted on the join key. Walks both sides in lockstep. O(n + m). Ideal when data is already sorted or indexes exist.
ANALYZE (PostgreSQL) or ANALYZE TABLE (MySQL) after large data changes.
MVCC — Multi-Version Concurrency Control
MVCC allows readers and writers to operate concurrently without blocking each other. Instead of locking rows, the database keeps multiple versions of each row. Each transaction sees a consistent snapshot of the data as of its start time.
How It Works
When a row is updated, the database doesn't overwrite the old version. It creates a new version tagged with the writing transaction's ID. Readers see only versions committed before their snapshot, providing snapshot isolation.
| Aspect | PostgreSQL | MySQL (InnoDB) |
|---|---|---|
| Old versions stored in | Main heap (same table) | Separate undo log (rollback segment) |
| Cleanup mechanism | VACUUM removes dead tuples |
Purge thread cleans undo log |
| Versioning approach | Tuple has xmin/xmax transaction IDs |
Row has a hidden rollback pointer to undo log |
| Bloat risk | Table bloat if VACUUM falls behind | Undo log growth if long-running transactions |
Connection Pooling
Every database connection consumes memory (typically 5–10 MB per connection in PostgreSQL). Creating a new connection involves TCP handshake, TLS negotiation, and authentication — taking 20–50ms. Connection poolers sit between the application and the database, maintaining a pool of reusable connections.
Lightweight, single-threaded pooler for PostgreSQL. Supports transaction-level and statement-level pooling. Minimal overhead.
Feature-rich MySQL proxy with connection pooling, query routing, read/write splitting, and query caching built in.
JVM-side connection pool known for low latency and high throughput. Default pool in Spring Boot. Manages connections at the application layer.
Pooling Modes (PgBouncer)
| Mode | Connection Assigned | Trade-off |
|---|---|---|
| Session | For the entire client session | Safest, but least efficient — same as no pooling in many cases |
| Transaction | For one transaction at a time | Best balance — most common mode. Cannot use session-level features (PREPARE, LISTEN/NOTIFY) |
| Statement | For a single statement | Most aggressive reuse, but no multi-statement transactions allowed |
connections = (core_count * 2) + effective_spindle_count. For SSDs, start with core_count * 2 to core_count * 4. More connections does not mean more throughput — context switching kills performance beyond the sweet spot.