Database Internals — learn.surkar.in

01 / Storage Engines

B-Tree vs LSM-Tree Storage

Every database stores data on disk through a storage engine. The two dominant paradigms are B-Tree based engines and LSM-Tree based engines. They make fundamentally different trade-offs between read performance, write performance, and space usage.

B-Tree Engines

B-Trees are balanced, self-adjusting tree structures where data is organized in fixed-size pages (typically 4–16 KB). Each page holds sorted keys and pointers to child pages. When a page overflows, it splits into two; when it becomes too sparse, it merges with a sibling. This keeps the tree balanced and lookups at O(log n).

Used by: InnoDB (MySQL), PostgreSQL, SQL Server, Oracle. B-Trees excel at point reads and range scans because data is already sorted on disk.

LSM-Tree Engines

LSM-Trees (Log-Structured Merge Trees) buffer writes in an in-memory memtable. When the memtable reaches a threshold, it is flushed to disk as an immutable, sorted SSTable (Sorted String Table). Background compaction periodically merges SSTables to reclaim space and keep read performance manageable.

Used by: RocksDB, Cassandra, LevelDB, ScyllaDB. LSM-Trees excel at write-heavy workloads because writes are always sequential appends.

B-Tree Write Path

Find leaf page

→

Update in-place

→

Split if full

LSM-Tree Write Path

Write to memtable

→

Flush to SSTable

→

Background compaction

Compaction Strategies

Leveled compaction (LevelDB, RocksDB default): SSTables are organized into levels of exponentially increasing size. Each level has no overlapping key ranges, giving better read performance but higher write amplification.
Size-tiered compaction (Cassandra default): SSTables of similar size are merged together. Simpler and faster for writes, but can have more space amplification and slower reads due to overlapping key ranges.

02 / Amplification Trade-offs

Read, Write & Space Amplification

No storage engine can minimize all three amplification factors simultaneously. Understanding these trade-offs is key to choosing the right engine for your workload.

Amplification	B-Tree	LSM-Tree
Read	Low — single traversal from root to leaf	Higher — may check memtable + multiple SSTables (mitigated by bloom filters)
Write	Moderate — random I/O, write page + WAL	Higher — data written multiple times during compaction
Space	Lower — in-place updates, but page fragmentation	Higher — multiple copies exist until compaction

Rule of Thumb

B-Trees favor read-heavy, transactional workloads (OLTP). LSM-Trees favor write-heavy, append-mostly workloads (time-series, logging, messaging). Hybrid approaches like TiDB use both.

03 / Durability & Recovery

WAL & Buffer Pool

Write-Ahead Log (WAL)

Before any change hits the actual data pages, it is first recorded in a sequential, append-only Write-Ahead Log. If the database crashes, it replays the WAL to recover committed transactions that hadn't been flushed to data files yet.

WAL provides the D (Durability) in ACID. Both B-Tree and LSM-Tree engines use WAL — PostgreSQL calls it WAL, MySQL/InnoDB calls it the redo log.

Crash Recovery Flow

Crash

→

Read WAL from last checkpoint

→

Redo committed txns

→

Undo uncommitted txns

Buffer Pool

The buffer pool is a region of memory that caches frequently accessed disk pages. When a query needs data, the engine first checks the buffer pool. On a miss, it loads the page from disk into the pool (evicting a less-used page if needed).

Pages modified in memory but not yet written to disk are called dirty pages. A background process called checkpoint periodically flushes dirty pages to disk, which also advances the WAL recovery point — reducing crash recovery time.

Practical Tip

In InnoDB, set innodb_buffer_pool_size to 70–80% of available RAM. In PostgreSQL, shared_buffers is typically set to 25% of RAM (the OS page cache handles the rest).

04 / Query Processing

From SQL to Execution

A SQL query goes through several transformation stages before any data is touched. Understanding this pipeline helps explain why the same logical query can have wildly different performance characteristics.

Query Processing Pipeline

SQL Text

→

Parse & Validate

→

Logical Plan

→

Optimize

→

Physical Plan

→

Execute

Cost-Based Optimization

The optimizer generates multiple candidate physical plans and estimates the cost of each using table statistics: row counts, index cardinality, histogram distributions, and page counts. It picks the plan with the lowest estimated cost.

Cardinality estimation is the hardest part — getting it wrong by even 10x can cause the optimizer to choose a terrible join order or skip an index entirely. This is why ANALYZE / UPDATE STATISTICS matters.

Join Algorithms

Nested Loop Join

For each row in the outer table, scan the inner table. Works well when the inner side has an index. O(n * m) worst case.

Hash Join

Build a hash table from the smaller table, then probe it with the larger table. O(n + m). Great for large, unsorted, equi-joins.

Merge Join

Both inputs must be sorted on the join key. Walks both sides in lockstep. O(n + m). Ideal when data is already sorted or indexes exist.

Key Insight

The optimizer is only as good as its statistics. Stale stats lead to bad plans. Run ANALYZE (PostgreSQL) or ANALYZE TABLE (MySQL) after large data changes.

05 / Concurrency Control

MVCC — Multi-Version Concurrency Control

MVCC allows readers and writers to operate concurrently without blocking each other. Instead of locking rows, the database keeps multiple versions of each row. Each transaction sees a consistent snapshot of the data as of its start time.

How It Works

When a row is updated, the database doesn't overwrite the old version. It creates a new version tagged with the writing transaction's ID. Readers see only versions committed before their snapshot, providing snapshot isolation.

Aspect	PostgreSQL	MySQL (InnoDB)
Old versions stored in	Main heap (same table)	Separate undo log (rollback segment)
Cleanup mechanism	`VACUUM` removes dead tuples	Purge thread cleans undo log
Versioning approach	Tuple has `xmin`/`xmax` transaction IDs	Row has a hidden rollback pointer to undo log
Bloat risk	Table bloat if VACUUM falls behind	Undo log growth if long-running transactions

Watch Out

Long-running transactions are the enemy of MVCC. In PostgreSQL, they prevent VACUUM from reclaiming dead tuples, causing table bloat. In MySQL, they prevent undo log purging, growing the undo tablespace. Always keep transactions short.

06 / Connection Management

Connection Pooling

Every database connection consumes memory (typically 5–10 MB per connection in PostgreSQL). Creating a new connection involves TCP handshake, TLS negotiation, and authentication — taking 20–50ms. Connection poolers sit between the application and the database, maintaining a pool of reusable connections.

PgBouncer

Lightweight, single-threaded pooler for PostgreSQL. Supports transaction-level and statement-level pooling. Minimal overhead.

ProxySQL

Feature-rich MySQL proxy with connection pooling, query routing, read/write splitting, and query caching built in.

HikariCP

JVM-side connection pool known for low latency and high throughput. Default pool in Spring Boot. Manages connections at the application layer.

Pooling Modes (PgBouncer)

Mode	Connection Assigned	Trade-off
Session	For the entire client session	Safest, but least efficient — same as no pooling in many cases
Transaction	For one transaction at a time	Best balance — most common mode. Cannot use session-level features (PREPARE, LISTEN/NOTIFY)
Statement	For a single statement	Most aggressive reuse, but no multi-statement transactions allowed

Sizing Rule

A good starting formula for pool size: connections = (core_count * 2) + effective_spindle_count. For SSDs, start with core_count * 2 to core_count * 4. More connections does not mean more throughput — context switching kills performance beyond the sweet spot.

Test Yourself

Score: 0 / 10

Question 01

Which storage engine design is optimized for write-heavy workloads?

LSM-Trees buffer all writes in memory (memtable) and flush sequentially to disk. This converts random writes into sequential I/O, which is much faster on both HDDs and SSDs.

Question 02

In an LSM-Tree, what is the correct order of the write path?

Writes go to the in-memory memtable first. When it reaches a size threshold, it's flushed as an immutable SSTable on disk. Background compaction then merges SSTables to reduce read amplification and reclaim space.

Question 03

What is the primary purpose of the Write-Ahead Log (WAL)?

The WAL records all changes before they are applied to data pages. If the database crashes, it replays the WAL to recover committed transactions, providing the Durability guarantee in ACID.

Question 04

What are "dirty pages" in the context of a buffer pool?

Dirty pages are in-memory pages that have been modified but not yet written back to disk. The checkpoint process periodically flushes dirty pages to persistent storage.

Question 05

Which join algorithm builds a hash table from the smaller input and probes it with the larger?

Hash Join builds a hash table from the smaller (build) side and then scans the larger (probe) side, looking up each row in the hash table. It's O(n + m) and ideal for large equi-joins without sorted input.

Question 06

In PostgreSQL's MVCC implementation, where are old row versions stored?

PostgreSQL stores old tuple versions directly in the main heap. Dead tuples are removed by the VACUUM process. This differs from InnoDB, which stores old versions in a separate undo/rollback segment.

Question 07

Why is cardinality estimation important in cost-based optimization?

The optimizer uses cardinality estimates to predict how many rows each operation will produce. If these estimates are off by orders of magnitude, the optimizer may choose the wrong join algorithm, skip a useful index, or pick a terrible join order.

Question 08

Which PgBouncer pooling mode offers the best balance of safety and efficiency for most applications?

Transaction mode assigns a server connection for the duration of a single transaction, then returns it to the pool. This gives good multiplexing while still supporting multi-statement transactions. Session mode is too conservative; statement mode is too restrictive.

Question 09

What problem do long-running transactions cause in PostgreSQL's MVCC?

VACUUM can only remove tuple versions that are no longer visible to any active transaction. A long-running transaction holds an old snapshot, preventing cleanup of dead tuples created after it started, leading to table bloat.

Question 10

Which compaction strategy organizes SSTables into levels with non-overlapping key ranges?

Leveled compaction (used by LevelDB and RocksDB by default) arranges SSTables into levels where each level has non-overlapping key ranges. This ensures at most one SSTable per level needs to be checked for a given key, improving read performance at the cost of higher write amplification from more frequent compaction.