Memory Hierarchy & RAM — learn.surkar.in

01 / The Memory Hierarchy

Latency Numbers Every Programmer Should Know

Computer memory is organized as a hierarchy trading off speed, size, and cost. Data close to the CPU is tiny and blazing fast; data far away is massive and slow. Every layer acts as a cache for the layer below it.

Memory Hierarchy (fastest to slowest)

Registers

→

L1 Cache

→

L2 Cache

→

L3 Cache

→

RAM

→

SSD

→

HDD

→

Network

Level	Typical Latency	Size	Cost / GB
Register	~0.3 ns (1 cycle)	~1 KB	--
L1 Cache	~1 ns (3-4 cycles)	32-64 KB per core	~$7,000
L2 Cache	~4 ns (10-12 cycles)	256 KB - 1 MB per core	~$1,000
L3 Cache	~10 ns (30-40 cycles)	8-64 MB shared	~$200
RAM (DRAM)	~100 ns	8-512 GB	~$3-5
SSD (NVMe)	~10-100 µs	256 GB - 8 TB	~$0.08
HDD	~5-10 ms	1-20 TB	~$0.02
Network (same DC)	~0.5 ms	Unlimited	Variable
Network (cross-region)	~50-150 ms	Unlimited	Variable

The Key Intuition

L1 cache is ~100x faster than RAM, and RAM is ~1000x faster than SSD. A single L1 cache miss that goes all the way to RAM costs as much as 100 register reads. Design your data structures to be cache-friendly -- this is why arrays often beat linked lists in practice.

Cache Lines and Spatial Locality

CPUs don't read individual bytes from memory. They fetch cache lines (typically 64 bytes). When you access one byte, the entire 64-byte line is pulled into cache. This is why iterating over an array sequentially is fast -- neighboring elements are already in cache (spatial locality). Jumping around memory (like chasing pointers in a linked list) defeats this prefetching.

Temporal Locality

Data you accessed recently is likely to be accessed again soon. Caches exploit this by keeping recently-used data close to the CPU. Loop variables, stack frames, and hot code paths all benefit from temporal locality.

02 / RAM Deep Dive

DRAM Architecture & Modern Memory

RAM (Random Access Memory) uses DRAM (Dynamic RAM) -- each bit is stored as a charge in a tiny capacitor paired with a transistor. "Dynamic" because the charge leaks away and must be refreshed thousands of times per second.

How DRAM Works

A DRAM chip is organized as a 2D grid of rows and columns. To read a cell, the memory controller sends a row address (activating an entire row into the "row buffer"), then a column address to select specific bytes. Accessing data already in the row buffer is much faster than activating a new row -- this is called a row buffer hit.

DRAM Access Sequence:
  1. ACTIVATE (open a row)     ~13 ns  (tRCD)
  2. READ/WRITE (column cmd)   ~13 ns  (CL)
  3. PRECHARGE (close the row)  ~13 ns  (tRP)

Row buffer hit: skip step 1+3, just ~13 ns
Row buffer miss: all three steps, ~40 ns + bus transfer

DDR4 vs DDR5

Property	DDR4	DDR5
Data Rate	2133-3200 MT/s	4800-8400 MT/s
Voltage	1.2V	1.1V
Prefetch	8n	16n
Channels per DIMM	1 (64-bit)	2 (32-bit each)
Max DIMM Capacity	64 GB	256 GB
ECC on-die	No	Yes (internal)
Burst Length	8	16

DDR5's Big Win

DDR5 splits each DIMM into two independent 32-bit channels. This doubles the number of concurrent memory requests the controller can handle, improving effective bandwidth for multi-threaded workloads even when raw latency is similar to DDR4.

Channels & Bandwidth

A memory channel is an independent 64-bit (DDR4) or 32-bit (DDR5) data path between the CPU and memory. Dual-channel doubles throughput. For example, DDR4-3200 single-channel delivers ~25.6 GB/s; dual-channel delivers ~51.2 GB/s. More channels = more bandwidth, but latency stays roughly the same.

ECC Memory

ECC (Error-Correcting Code) memory adds extra bits to detect and correct single-bit errors. Essential for servers, databases, and any system where silent data corruption is unacceptable. DDR5 includes on-die ECC for internal reliability, but full ECC DIMMs still add additional parity bits for end-to-end protection.

03 / Virtual Memory

Page Tables, TLB, and Huge Pages

Virtual memory gives each process its own isolated address space, mapping virtual addresses to physical addresses. The OS and CPU hardware collaborate to make this transparent and fast.

Page Tables

Memory is divided into fixed-size pages (default 4 KB on x86). The page table maps each virtual page to a physical frame. On x86-64, there are 4 levels of page tables (PGD → PUD → PMD → PTE), each indexed by 9 bits of the virtual address. This means a single address translation could require 4 memory accesses -- which is why the TLB exists.

x86-64 Virtual Address Translation (4-level)

PGD (9 bits)

→

PUD (9 bits)

→

PMD (9 bits)

→

PTE (9 bits)

→

Offset (12 bits)

TLB (Translation Lookaside Buffer)

The TLB is a small, very fast cache inside the CPU that stores recent virtual-to-physical page translations. A TLB hit resolves the address in ~1 ns. A TLB miss triggers a page table walk (4 memory accesses = ~400 ns worst case, though upper levels are often cached). The L1 TLB typically holds 64-128 entries; L2 TLB holds 1024-2048 entries.

Huge Pages

With 4 KB pages, 1 GB of memory requires 262,144 page table entries, easily overwhelming the TLB. Huge pages solve this:

Page Size	Entries for 1 GB	TLB Pressure	Use Case
4 KB (default)	262,144	High	General purpose
2 MB (huge)	512	Low	Databases, VMs, large heaps
1 GB (giant)	1	Minimal	HPC, large memory-mapped files

Enabling Huge Pages on Linux

echo 1024 > /proc/sys/vm/nr_hugepages pre-allocates 1024 huge pages (2 GB total). Applications can use mmap() with MAP_HUGETLB, or use Transparent Huge Pages (THP) which the kernel manages automatically -- though THP can cause latency spikes from background compaction.

04 / Page Faults & Memory Mapping

Minor Faults, Major Faults, and Copy-on-Write

A page fault occurs when a process accesses a virtual page that isn't currently mapped to a physical frame. The CPU traps to the OS kernel, which resolves the fault.

Minor Page Fault

The page is already in memory (e.g., in the page cache) but not yet mapped in this process's page table. The kernel just updates the page table entry. Cost: ~1-10 µs.

Major Page Fault

The page must be fetched from disk (swap or a file). The process blocks while the kernel issues I/O. Cost: ~1-10 ms (SSD) or ~5-20 ms (HDD). Extremely expensive.

mmap -- Memory-Mapped Files

mmap() maps a file directly into a process's virtual address space. Reads and writes become page faults that the kernel satisfies from the page cache. This avoids the read()/write() system call overhead and lets the kernel manage caching. Databases like SQLite and LMDB use mmap extensively.

// Memory-map a file for reading
int fd = open("data.bin", O_RDONLY);
void *ptr = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);

// Access data -- triggers minor/major faults as needed
int val = ((int*)ptr)[1000];

// Advise kernel about access pattern
madvise(ptr, file_size, MADV_SEQUENTIAL);  // prefetch ahead
munmap(ptr, file_size);

Copy-on-Write (CoW)

When a process calls fork(), the kernel doesn't copy all memory. Instead, both parent and child share the same physical pages, marked read-only. Only when one process writes to a page does the kernel copy that page -- a minor page fault triggers the copy. This makes fork() nearly instant regardless of process size.

Redis and CoW

Redis uses fork() for background persistence (RDB snapshots). If the dataset is 10 GB and the write rate is high during a snapshot, CoW can double memory usage as modified pages get copied. This is why Redis needs 2x memory headroom when persistence is enabled.

05 / Memory Allocation

Stack, Heap, malloc, and Allocators

Stack vs Heap

Property	Stack	Heap
Allocation	Automatic (push/pop)	Manual or GC-managed
Speed	~1 ns (just move stack pointer)	~50-100 ns (malloc)
Size	Fixed (typically 1-8 MB)	Limited by available memory
Fragmentation	None (LIFO ordering)	Can be severe
Thread Safety	Inherently safe (per-thread)	Requires synchronization
Lifetime	Scoped to function	Until explicitly freed

How malloc Works

malloc() requests memory from the OS via brk() (extend the data segment) or mmap() (large allocations, typically > 128 KB). The allocator maintains free lists of previously freed blocks, serving new requests from these lists when possible to avoid system calls.

// glibc malloc internals (simplified)
struct chunk {
    size_t prev_size;   // size of previous chunk (if free)
    size_t size;        // size of this chunk + flags (in-use, mmap'd, etc.)
    // user data starts here when allocated
    // when free: fd/bk pointers for doubly-linked free list
};

Modern Allocators

jemalloc

Used by Firefox, Redis, Rust. Uses arenas (multiple independent heaps) to reduce lock contention. Thread-local caches avoid locking for most small allocations. Excellent fragmentation control via size classes and slab allocation.

tcmalloc

Google's Thread-Caching Malloc. Each thread has a local cache of free objects. Small allocations (<256 KB) served from thread cache with zero locking. Transfers objects between thread cache and central heap in batches.

mimalloc

Microsoft Research. Achieves excellent performance through free list sharding -- each page has its own free list. Up to 2x faster than jemalloc/tcmalloc on some benchmarks. Used in languages like Zig.

Fragmentation

External fragmentation: free memory is broken into small non-contiguous chunks. A 1 MB allocation fails even though 2 MB total is free. Internal fragmentation: allocated blocks are larger than requested (e.g., you ask for 20 bytes but get a 32-byte slot due to size classes). Modern allocators use size-class binning to minimize both types.

06 / Garbage Collection

Reference Counting, Mark-and-Sweep, and Beyond

Garbage collection (GC) automates memory management by reclaiming objects that are no longer reachable. Different strategies make different tradeoffs between throughput, latency, and memory overhead.

Reference Counting

Each object tracks how many references point to it. When the count drops to zero, the object is immediately freed. Simple and deterministic, but cannot handle reference cycles (A → B → A). Python uses reference counting as its primary GC, with a cycle detector as backup. Swift/Objective-C use ARC (Automatic Reference Counting), which inserts retain/release calls at compile time.

Mark-and-Sweep

The collector starts from GC roots (stack variables, global variables, registers) and marks all reachable objects. Everything unmarked is swept (freed). Simple but requires a stop-the-world pause -- all application threads are frozen during collection. Used as the foundation for most modern GCs.

Generational GC

Based on the generational hypothesis: most objects die young. Memory is split into generations:

Generational GC Layout (Java HotSpot)

Eden (new objects)

→

Survivor S0/S1

→

Old Generation

Young generation collections (minor GC) are fast because most objects are dead. Objects surviving multiple collections are promoted to the old generation, which is collected less frequently (major GC). This reduces GC pause times dramatically for most workloads.

Modern JVM Collectors

G1 GC

Divides heap into equal-sized regions (~2048). Collects "garbage-first" regions with the most reclaimable space. Target pause time: 200ms (configurable). Default in Java 9+.

ZGC

Sub-millisecond pauses regardless of heap size (multi-TB heaps). Uses colored pointers and load barriers. Concurrent compaction. Production-ready in Java 15+.

Shenandoah

Similar goals to ZGC (low pause). Uses Brooks forwarding pointers. Concurrent evacuation. Available in OpenJDK. Pause times <10ms for most workloads.

Escape Analysis

The JVM's JIT compiler analyzes whether an object "escapes" the method that created it. If an object never escapes (never stored in a field, never passed to another thread), the JVM can stack-allocate it instead of heap-allocating it, eliminating GC pressure entirely. This is why small, short-lived objects in hot loops are often free in Java -- the JIT optimizes them away.

Practical Tip

In Go, escape analysis determines stack vs heap allocation at compile time. Run go build -gcflags='-m' to see which variables escape to the heap. Reducing heap escapes can dramatically reduce GC pressure.

Test Yourself

Score: 0 / 10

Question 01

An L1 cache reference takes ~1 ns. Approximately how long does a main memory (RAM) reference take?

Main memory (DRAM) access takes approximately 100 ns, making it about 100x slower than an L1 cache hit. This is why cache-friendly data structures matter so much for performance.

Question 02

What is the typical cache line size on modern x86 processors?

Modern x86 CPUs use 64-byte cache lines. When you access a single byte, the entire 64-byte block is fetched into cache, which is why sequential access patterns are so much faster than random access.

Question 03

What key architectural change did DDR5 introduce per DIMM compared to DDR4?

DDR5 splits each DIMM into two independent 32-bit sub-channels (vs DDR4's single 64-bit channel). This doubles concurrent memory request capacity and improves effective bandwidth for multi-threaded workloads.

Question 04

On x86-64 with 4 KB pages, how many levels does the page table have?

x86-64 uses a 4-level page table (PGD, PUD, PMD, PTE), each indexed by 9 bits of the 48-bit virtual address. The remaining 12 bits form the page offset. Note: 5-level paging (LA57) exists for 57-bit addressing but is not the default.

Question 05

A process accesses a virtual page that is already in the page cache but not mapped in its page table. What type of fault occurs?

A minor (soft) page fault occurs when the page is already in physical memory but not yet mapped in the process's page table. The kernel just creates the mapping -- no disk I/O needed. A major fault requires fetching data from disk.

Question 06

Why does Redis need ~2x memory headroom when background persistence (RDB snapshots) is enabled?

Redis uses fork() for snapshots. The child process shares pages via Copy-on-Write. As the parent writes to pages during the snapshot, CoW triggers page copies, potentially duplicating a large fraction of memory if the write rate is high.

Question 07

What is the primary advantage of jemalloc and tcmalloc over glibc's default malloc?

Both jemalloc and tcmalloc use thread-local caches (and arenas in jemalloc's case) so most small allocations are served without any locking. This dramatically reduces contention in multi-threaded applications compared to glibc's ptmalloc2.

Question 08

What fundamental problem does reference counting fail to handle without additional mechanisms?

Reference counting cannot detect cycles -- if A references B and B references A, both counts stay at 1 even when no external references exist. Python solves this with a supplementary cycle-detecting collector. Swift avoids it with weak references.

Question 09

What does escape analysis allow the JVM's JIT compiler to do?

If escape analysis determines an object never leaves the method (not stored in fields, not passed to other threads), the JIT can allocate it on the stack or even eliminate the allocation entirely (scalar replacement). This removes GC pressure for short-lived objects.

Question 10

What is ZGC's primary design goal?

ZGC is designed for ultra-low latency with sub-millisecond pause times, even on heaps of several terabytes. It achieves this through concurrent compaction, colored pointers, and load barriers, performing almost all GC work concurrently with application threads.