Latency Numbers Every Programmer Should Know
Computer memory is organized as a hierarchy trading off speed, size, and cost. Data close to the CPU is tiny and blazing fast; data far away is massive and slow. Every layer acts as a cache for the layer below it.
| Level | Typical Latency | Size | Cost / GB |
|---|---|---|---|
| Register | ~0.3 ns (1 cycle) | ~1 KB | -- |
| L1 Cache | ~1 ns (3-4 cycles) | 32-64 KB per core | ~$7,000 |
| L2 Cache | ~4 ns (10-12 cycles) | 256 KB - 1 MB per core | ~$1,000 |
| L3 Cache | ~10 ns (30-40 cycles) | 8-64 MB shared | ~$200 |
| RAM (DRAM) | ~100 ns | 8-512 GB | ~$3-5 |
| SSD (NVMe) | ~10-100 µs | 256 GB - 8 TB | ~$0.08 |
| HDD | ~5-10 ms | 1-20 TB | ~$0.02 |
| Network (same DC) | ~0.5 ms | Unlimited | Variable |
| Network (cross-region) | ~50-150 ms | Unlimited | Variable |
Cache Lines and Spatial Locality
CPUs don't read individual bytes from memory. They fetch cache lines (typically 64 bytes). When you access one byte, the entire 64-byte line is pulled into cache. This is why iterating over an array sequentially is fast -- neighboring elements are already in cache (spatial locality). Jumping around memory (like chasing pointers in a linked list) defeats this prefetching.
Temporal Locality
Data you accessed recently is likely to be accessed again soon. Caches exploit this by keeping recently-used data close to the CPU. Loop variables, stack frames, and hot code paths all benefit from temporal locality.
DRAM Architecture & Modern Memory
RAM (Random Access Memory) uses DRAM (Dynamic RAM) -- each bit is stored as a charge in a tiny capacitor paired with a transistor. "Dynamic" because the charge leaks away and must be refreshed thousands of times per second.
How DRAM Works
A DRAM chip is organized as a 2D grid of rows and columns. To read a cell, the memory controller sends a row address (activating an entire row into the "row buffer"), then a column address to select specific bytes. Accessing data already in the row buffer is much faster than activating a new row -- this is called a row buffer hit.
DRAM Access Sequence:
1. ACTIVATE (open a row) ~13 ns (tRCD)
2. READ/WRITE (column cmd) ~13 ns (CL)
3. PRECHARGE (close the row) ~13 ns (tRP)
Row buffer hit: skip step 1+3, just ~13 ns
Row buffer miss: all three steps, ~40 ns + bus transfer
DDR4 vs DDR5
| Property | DDR4 | DDR5 |
|---|---|---|
| Data Rate | 2133-3200 MT/s | 4800-8400 MT/s |
| Voltage | 1.2V | 1.1V |
| Prefetch | 8n | 16n |
| Channels per DIMM | 1 (64-bit) | 2 (32-bit each) |
| Max DIMM Capacity | 64 GB | 256 GB |
| ECC on-die | No | Yes (internal) |
| Burst Length | 8 | 16 |
Channels & Bandwidth
A memory channel is an independent 64-bit (DDR4) or 32-bit (DDR5) data path between the CPU and memory. Dual-channel doubles throughput. For example, DDR4-3200 single-channel delivers ~25.6 GB/s; dual-channel delivers ~51.2 GB/s. More channels = more bandwidth, but latency stays roughly the same.
ECC Memory
ECC (Error-Correcting Code) memory adds extra bits to detect and correct single-bit errors. Essential for servers, databases, and any system where silent data corruption is unacceptable. DDR5 includes on-die ECC for internal reliability, but full ECC DIMMs still add additional parity bits for end-to-end protection.
Page Tables, TLB, and Huge Pages
Virtual memory gives each process its own isolated address space, mapping virtual addresses to physical addresses. The OS and CPU hardware collaborate to make this transparent and fast.
Page Tables
Memory is divided into fixed-size pages (default 4 KB on x86). The page table maps each virtual page to a physical frame. On x86-64, there are 4 levels of page tables (PGD → PUD → PMD → PTE), each indexed by 9 bits of the virtual address. This means a single address translation could require 4 memory accesses -- which is why the TLB exists.
TLB (Translation Lookaside Buffer)
The TLB is a small, very fast cache inside the CPU that stores recent virtual-to-physical page translations. A TLB hit resolves the address in ~1 ns. A TLB miss triggers a page table walk (4 memory accesses = ~400 ns worst case, though upper levels are often cached). The L1 TLB typically holds 64-128 entries; L2 TLB holds 1024-2048 entries.
Huge Pages
With 4 KB pages, 1 GB of memory requires 262,144 page table entries, easily overwhelming the TLB. Huge pages solve this:
| Page Size | Entries for 1 GB | TLB Pressure | Use Case |
|---|---|---|---|
| 4 KB (default) | 262,144 | High | General purpose |
| 2 MB (huge) | 512 | Low | Databases, VMs, large heaps |
| 1 GB (giant) | 1 | Minimal | HPC, large memory-mapped files |
echo 1024 > /proc/sys/vm/nr_hugepages pre-allocates 1024 huge pages (2 GB total). Applications can use mmap() with MAP_HUGETLB, or use Transparent Huge Pages (THP) which the kernel manages automatically -- though THP can cause latency spikes from background compaction.
Minor Faults, Major Faults, and Copy-on-Write
A page fault occurs when a process accesses a virtual page that isn't currently mapped to a physical frame. The CPU traps to the OS kernel, which resolves the fault.
The page is already in memory (e.g., in the page cache) but not yet mapped in this process's page table. The kernel just updates the page table entry. Cost: ~1-10 µs.
The page must be fetched from disk (swap or a file). The process blocks while the kernel issues I/O. Cost: ~1-10 ms (SSD) or ~5-20 ms (HDD). Extremely expensive.
mmap -- Memory-Mapped Files
mmap() maps a file directly into a process's virtual address space. Reads and writes become page faults that the kernel satisfies from the page cache. This avoids the read()/write() system call overhead and lets the kernel manage caching. Databases like SQLite and LMDB use mmap extensively.
// Memory-map a file for reading
int fd = open("data.bin", O_RDONLY);
void *ptr = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Access data -- triggers minor/major faults as needed
int val = ((int*)ptr)[1000];
// Advise kernel about access pattern
madvise(ptr, file_size, MADV_SEQUENTIAL); // prefetch ahead
munmap(ptr, file_size);
Copy-on-Write (CoW)
When a process calls fork(), the kernel doesn't copy all memory. Instead, both parent and child share the same physical pages, marked read-only. Only when one process writes to a page does the kernel copy that page -- a minor page fault triggers the copy. This makes fork() nearly instant regardless of process size.
fork() for background persistence (RDB snapshots). If the dataset is 10 GB and the write rate is high during a snapshot, CoW can double memory usage as modified pages get copied. This is why Redis needs 2x memory headroom when persistence is enabled.
Stack, Heap, malloc, and Allocators
Stack vs Heap
| Property | Stack | Heap |
|---|---|---|
| Allocation | Automatic (push/pop) | Manual or GC-managed |
| Speed | ~1 ns (just move stack pointer) | ~50-100 ns (malloc) |
| Size | Fixed (typically 1-8 MB) | Limited by available memory |
| Fragmentation | None (LIFO ordering) | Can be severe |
| Thread Safety | Inherently safe (per-thread) | Requires synchronization |
| Lifetime | Scoped to function | Until explicitly freed |
How malloc Works
malloc() requests memory from the OS via brk() (extend the data segment) or mmap() (large allocations, typically > 128 KB). The allocator maintains free lists of previously freed blocks, serving new requests from these lists when possible to avoid system calls.
// glibc malloc internals (simplified)
struct chunk {
size_t prev_size; // size of previous chunk (if free)
size_t size; // size of this chunk + flags (in-use, mmap'd, etc.)
// user data starts here when allocated
// when free: fd/bk pointers for doubly-linked free list
};
Modern Allocators
Used by Firefox, Redis, Rust. Uses arenas (multiple independent heaps) to reduce lock contention. Thread-local caches avoid locking for most small allocations. Excellent fragmentation control via size classes and slab allocation.
Google's Thread-Caching Malloc. Each thread has a local cache of free objects. Small allocations (<256 KB) served from thread cache with zero locking. Transfers objects between thread cache and central heap in batches.
Microsoft Research. Achieves excellent performance through free list sharding -- each page has its own free list. Up to 2x faster than jemalloc/tcmalloc on some benchmarks. Used in languages like Zig.
Fragmentation
External fragmentation: free memory is broken into small non-contiguous chunks. A 1 MB allocation fails even though 2 MB total is free. Internal fragmentation: allocated blocks are larger than requested (e.g., you ask for 20 bytes but get a 32-byte slot due to size classes). Modern allocators use size-class binning to minimize both types.
Reference Counting, Mark-and-Sweep, and Beyond
Garbage collection (GC) automates memory management by reclaiming objects that are no longer reachable. Different strategies make different tradeoffs between throughput, latency, and memory overhead.
Reference Counting
Each object tracks how many references point to it. When the count drops to zero, the object is immediately freed. Simple and deterministic, but cannot handle reference cycles (A → B → A). Python uses reference counting as its primary GC, with a cycle detector as backup. Swift/Objective-C use ARC (Automatic Reference Counting), which inserts retain/release calls at compile time.
Mark-and-Sweep
The collector starts from GC roots (stack variables, global variables, registers) and marks all reachable objects. Everything unmarked is swept (freed). Simple but requires a stop-the-world pause -- all application threads are frozen during collection. Used as the foundation for most modern GCs.
Generational GC
Based on the generational hypothesis: most objects die young. Memory is split into generations:
Young generation collections (minor GC) are fast because most objects are dead. Objects surviving multiple collections are promoted to the old generation, which is collected less frequently (major GC). This reduces GC pause times dramatically for most workloads.
Modern JVM Collectors
Divides heap into equal-sized regions (~2048). Collects "garbage-first" regions with the most reclaimable space. Target pause time: 200ms (configurable). Default in Java 9+.
Sub-millisecond pauses regardless of heap size (multi-TB heaps). Uses colored pointers and load barriers. Concurrent compaction. Production-ready in Java 15+.
Similar goals to ZGC (low pause). Uses Brooks forwarding pointers. Concurrent evacuation. Available in OpenJDK. Pause times <10ms for most workloads.
Escape Analysis
The JVM's JIT compiler analyzes whether an object "escapes" the method that created it. If an object never escapes (never stored in a field, never passed to another thread), the JVM can stack-allocate it instead of heap-allocating it, eliminating GC pressure entirely. This is why small, short-lived objects in hot loops are often free in Java -- the JIT optimizes them away.
go build -gcflags='-m' to see which variables escape to the heap. Reducing heap escapes can dramatically reduce GC pressure.