How a CPU Executes Instructions
Every instruction a CPU runs passes through a fundamental cycle: Fetch the instruction from memory, Decode it to determine the operation and operands, and Execute the operation. This is the simplest mental model, but modern CPUs expand this into a deeper pipeline for throughput.
The Classic 5-Stage Pipeline
The MIPS-style 5-stage pipeline became the textbook model for RISC processors. Each stage takes one clock cycle, and ideally a new instruction enters the pipeline every cycle, yielding a throughput of one instruction per clock (IPC = 1) at steady state.
| Stage | What Happens | Key Hardware |
|---|---|---|
| IF — Instruction Fetch | PC (program counter) addresses the I-cache. The instruction is read and PC increments (or a branch target is loaded). | I-cache, branch predictor, PC register |
| ID — Instruction Decode | Opcode is decoded. Source registers are read from the register file. Immediates are sign-extended. | Decoder, register file read ports |
| EX — Execute | The ALU performs the operation (add, shift, compare). For branches, the condition is evaluated. | ALU, branch resolution unit |
| MEM — Memory Access | Loads read from D-cache; stores write to D-cache. Non-memory instructions pass through. | D-cache, store buffer, TLB |
| WB — Write-Back | The result is written to the destination register in the register file. | Register file write ports |
Beyond the Textbook: Modern Front-Ends
Real x86-64 CPUs (like Zen 4 or Golden Cove) don't execute x86 instructions directly. The front-end decodes variable-length x86 instructions into fixed-width micro-ops (uops). A micro-op cache (uop cache / DSB on Intel) stores previously decoded uops to bypass the decode stage entirely on hot loops. This is why "decode width" and "uop cache hit rate" matter for performance tuning.
ISA: x86-64, ARM, RISC-V
The Instruction Set Architecture is the contract between hardware and software. It defines registers, instructions, memory model, and encoding. Two philosophies dominate: CISC (Complex Instruction Set Computer) and RISC (Reduced Instruction Set Computer).
| Property | x86-64 (CISC) | ARM (RISC) | RISC-V (RISC) |
|---|---|---|---|
| Encoding | Variable-length (1-15 bytes) | Fixed 32-bit (A64) or mixed 16/32-bit (Thumb2) | Fixed 32-bit (base), 16-bit (C extension) |
| GP Registers | 16 (RAX-R15) | 31 (X0-X30) + SP, XZR | 32 (x0-x31), x0 hardwired to 0 |
| Memory Model | TSO (Total Store Order) — strong | Weakly ordered (requires barriers) | RVWMO (weak, release/acquire fences) |
| Condition Codes | EFLAGS register (implicit) | NZCV flags (explicit via S-suffix) | No flags register; compare-and-branch |
| SIMD | SSE/AVX/AVX-512 (128-512 bit) | NEON (128-bit), SVE/SVE2 (variable up to 2048-bit) | V extension (variable-length vectors) |
| Licensing | Proprietary (Intel/AMD) | Proprietary (Arm Ltd. licenses) | Open standard (free, no royalty) |
| Primary Domain | Desktops, servers, HPC | Mobile, embedded, Apple Silicon, servers (Graviton) | Embedded, research, growing in servers |
Registers in Detail
On x86-64: RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8-R15. Used for arithmetic, addressing, and passing function arguments (System V ABI: RDI, RSI, RDX, RCX, R8, R9).
Points to the next instruction to fetch. Not directly writable; changed by jumps, calls, and returns. RIP-relative addressing is standard in x86-64 for position-independent code.
Points to the top of the stack. Implicitly modified by PUSH, POP, CALL, RET. The System V ABI requires 16-byte alignment at function call boundaries.
Contains status flags (ZF, CF, OF, SF) set by arithmetic ops, and system flags (IF for interrupts, DF for string direction). Conditional branches read these flags.
Pipeline Hazards & Branch Prediction
Pipelining overlaps instruction execution for throughput. But dependencies between instructions create hazards that can stall or corrupt the pipeline.
Three Types of Hazards
| Hazard Type | Cause | Solution |
|---|---|---|
| Data Hazard (RAW, WAR, WAW) | Instruction needs a result not yet produced. E.g., ADD R1, R2, R3 followed by SUB R4, R1, R5 — R1 isn't written back yet. | Forwarding/bypassing (result sent directly from EX stage to next instruction's EX input). Stall if load-use dependency (1-cycle bubble). |
| Control Hazard | Branch instruction changes program flow. Pipeline has already fetched subsequent instructions that may be wrong. | Branch prediction (static/dynamic). On mispredict, flush pipeline (penalty = pipeline depth). |
| Structural Hazard | Two instructions need the same hardware unit simultaneously (e.g., single-ported memory). | Duplicate hardware (separate I-cache and D-cache), or stall one instruction. |
Branch Prediction
Modern CPUs predict branches with >97% accuracy using multi-level predictors. A misprediction flushes 15-20 cycles of work on deep pipelines, making prediction accuracy critical for performance.
A cache mapping branch instruction addresses to their predicted target addresses. Enables fetching from the target before the branch is even decoded.
Classic predictor: each branch has a 2-bit counter (strongly taken, weakly taken, weakly not-taken, strongly not-taken). Requires two consecutive mispredictions to flip.
State-of-the-art: uses multiple tables indexed by different history lengths (Tagged Geometric). Captures both short and long correlations in branch patterns.
Uses a neural-network-like weighted sum of branch history bits. AMD Zen architectures use perceptron-based predictors for improved accuracy on complex patterns.
Superscalar Execution
Superscalar CPUs can issue multiple instructions per cycle to parallel execution units. A 6-wide superscalar core (like Zen 4) can dispatch up to 6 uops per cycle. The actual IPC depends on instruction mix, dependencies, and cache behavior. Achieving IPC > 4 in real workloads is exceptional.
perf stat on Linux to measure IPC directly.
Out-of-Order Execution & Speculation
In-order execution wastes cycles when an instruction stalls (e.g., waiting for a cache miss). Out-of-order (OoO) execution allows the CPU to look ahead and execute independent instructions while earlier ones wait, then retire results in program order to maintain correctness.
| Component | Role | Typical Size (Zen 4) |
|---|---|---|
| Register Alias Table (RAT) | Maps architectural registers to physical registers. Eliminates WAR/WAW hazards via register renaming. | ~200+ physical integer registers |
| Reorder Buffer (ROB) | Tracks all in-flight instructions in program order. Ensures in-order retirement and enables precise exceptions. | 320 entries (Zen 4) |
| Reservation Station (RS) | Holds instructions waiting for operands. When all operands are ready, the instruction is dispatched to an execution unit. | ~92 entries per scheduler |
| Store Buffer | Holds stores until retirement. Enables store-to-load forwarding. Drain on serializing instructions (e.g., MFENCE). | 64 entries |
Speculative Execution
The CPU speculatively executes instructions past unresolved branches. If the prediction is correct, the speculated results commit. If wrong, the ROB squashes all speculated uops and restarts from the correct path. This "free" work is invisible to software—except when it isn't.
Multi-Core, Hyper-Threading & Memory Topology
Multi-Core Architecture
A multi-core CPU has multiple independent execution engines (cores) on a single die. Each core has private L1/L2 caches; all cores share the L3 (LLC). Cores communicate through the cache coherence protocol and a shared interconnect (ring bus on older Intel, mesh on Xeon, Infinity Fabric on AMD).
Simultaneous Multi-Threading (SMT / Hyper-Threading)
SMT allows a single physical core to present as two (or more) logical cores. Each logical core has its own architectural state (registers, PC) but shares execution units, caches, and TLBs. When one thread stalls (e.g., on a cache miss), the other thread can use the execution units.
NUMA (Non-Uniform Memory Access)
In multi-socket systems (and even AMD's chiplet designs), memory is physically attached to specific CPU sockets/dies. Accessing "local" memory is faster (~80ns) than "remote" memory on another socket (~140ns). The OS exposes NUMA topology, and performance-critical applications use numactl or libnuma to bind threads and allocate memory on the correct node.
A group of cores and their directly-attached memory. On a 2-socket system, there are typically 2 NUMA nodes. AMD chiplets may expose multiple nodes per socket.
Cross-node access adds 40-100ns. For data structures accessed by threads on different nodes, this penalty is paid on every cache miss. Prefer node-local allocation.
Round-robin memory allocation across nodes. Good for workloads with unpredictable access patterns. Averages out local vs. remote latency.
Cache Hierarchy, Coherence & False Sharing
The Cache Hierarchy
Caches exploit temporal locality (recently used data will be used again) and spatial locality (nearby data will be used soon). Every memory access checks L1 first, then L2, then L3, then DRAM. Each level is larger but slower.
| Level | Typical Size | Latency (cycles) | Associativity | Scope |
|---|---|---|---|---|
| L1 I-cache | 32-64 KB | ~4 cycles | 8-way | Per core |
| L1 D-cache | 32-48 KB | ~4-5 cycles | 8-12 way | Per core |
| L2 | 256 KB - 1 MB | ~12-14 cycles | 8-16 way | Per core |
| L3 (LLC) | 16-96 MB | ~40-50 cycles | 16-way | Shared across cores |
| DRAM | GBs - TBs | ~200+ cycles (~60-100ns) | N/A | Shared (NUMA-aware) |
Cache Lines
The fundamental unit of cache transfer is the cache line, which is 64 bytes on virtually all modern x86 and ARM processors. When you read a single byte, the entire 64-byte line is fetched. This means:
struct-of-arrays often outperforms array-of-structs for this reason.
Cache Coherence: MESI Protocol
In multi-core systems, each core has private L1/L2 caches that may hold copies of the same memory location. The MESI protocol ensures all cores see a consistent view of memory by tracking the state of each cache line.
This cache has the only valid copy and it's dirty (changed). Must write back to memory before another core can read it.
This cache has the only copy and it's clean (matches memory). Can transition to M on write without bus traffic.
Multiple caches hold this line, all clean. A write requires invalidating other copies first (generates bus traffic).
This cache line is not valid. Any access requires fetching from L3/memory or another core's cache (cache-to-cache transfer).
False Sharing
False sharing occurs when two threads write to different variables that happen to reside on the same 64-byte cache line. Even though there's no logical data sharing, the hardware coherence protocol bounces the cache line between cores on every write, serializing access and destroying performance.
[thread_0_counter | ... padding ... | thread_1_counter]
alignas(64) or __attribute__((aligned(64))). In Java: @Contended annotation (requires -XX:-RestrictContended). In Go: pad structs manually to 64 bytes. The perf c2c tool on Linux can detect false sharing at runtime.
SIMD, Crypto Extensions & Atomics
SIMD: Single Instruction, Multiple Data
SIMD instructions operate on wide registers (128-512 bits) that pack multiple data elements. A single VADDPS ymm0, ymm1, ymm2 adds 8 floats in parallel. This is how media codecs, ML inference, scientific computing, and even memcpy achieve high throughput.
| Extension | Register Width | Elements per Op (float32) | Introduced |
|---|---|---|---|
| SSE | 128-bit (XMM) | 4 | Pentium III (1999) |
| AVX | 256-bit (YMM) | 8 | Sandy Bridge (2011) |
| AVX-512 | 512-bit (ZMM) | 16 | Xeon Phi / Skylake-X (2017) |
| ARM NEON | 128-bit (V registers) | 4 | ARMv7 / Cortex-A8 |
| ARM SVE/SVE2 | 128-2048 bit (scalable) | 4-64 (implementation-dependent) | ARMv8.2+ (Graviton 3) |
AES-NI: Hardware Crypto
AESENC / AESDEC instructions perform one round of AES encryption/decryption in a single instruction (~4 cycles latency, 1 cycle throughput when pipelined). This yields ~10 GB/s AES-256 throughput on modern CPUs, making software AES faster than dedicated hardware in many cases. OpenSSL, BoringSSL, and the Linux kernel all use AES-NI when available.
Atomic Instructions & CAS
Lock-free and concurrent data structures rely on hardware atomic instructions. The most important is Compare-And-Swap (CAS): atomically read a memory location, compare with an expected value, and write a new value only if the comparison succeeds.
On x86: LOCK CMPXCHG (CAS), LOCK XADD (fetch-and-add), LOCK BTS (test-and-set). The LOCK prefix locks the cache line for the duration of the operation. On ARM: the LDXR/STXR (load-exclusive / store-exclusive) pair implements LL/SC (Load-Linked / Store-Conditional), which is more flexible than CAS and avoids the ABA problem at the hardware level.
LOCK CMPXCHG on an uncontended cache line costs ~20 cycles. Under heavy contention (many cores CAS-ing the same line), it can exceed 200 cycles due to cache line bouncing. This is why lock-free queues use padding and per-core sharding to reduce contention on hot cache lines.
Test Yourself
perf stat reports a high rate of "remote DRAM accesses." What is the most effective fix?