CPU Architecture & Execution

01 / The Execution Cycle

How a CPU Executes Instructions

Every instruction a CPU runs passes through a fundamental cycle: Fetch the instruction from memory, Decode it to determine the operation and operands, and Execute the operation. This is the simplest mental model, but modern CPUs expand this into a deeper pipeline for throughput.

The Classic 5-Stage Pipeline

The MIPS-style 5-stage pipeline became the textbook model for RISC processors. Each stage takes one clock cycle, and ideally a new instruction enters the pipeline every cycle, yielding a throughput of one instruction per clock (IPC = 1) at steady state.

Classic 5-Stage RISC Pipeline

IF (Fetch)

→

ID (Decode)

→

EX (Execute)

→

MEM (Memory)

→

WB (Write-Back)

Stage	What Happens	Key Hardware
IF — Instruction Fetch	PC (program counter) addresses the I-cache. The instruction is read and PC increments (or a branch target is loaded).	I-cache, branch predictor, PC register
ID — Instruction Decode	Opcode is decoded. Source registers are read from the register file. Immediates are sign-extended.	Decoder, register file read ports
EX — Execute	The ALU performs the operation (add, shift, compare). For branches, the condition is evaluated.	ALU, branch resolution unit
MEM — Memory Access	Loads read from D-cache; stores write to D-cache. Non-memory instructions pass through.	D-cache, store buffer, TLB
WB — Write-Back	The result is written to the destination register in the register file.	Register file write ports

Key Insight

Pipeline depth is a tradeoff. Deeper pipelines (Intel Prescott had 31 stages) allow higher clock frequencies because each stage does less work, but they amplify the penalty of branch mispredictions and pipeline stalls. Modern designs settle around 14-19 stages.

Beyond the Textbook: Modern Front-Ends

Real x86-64 CPUs (like Zen 4 or Golden Cove) don't execute x86 instructions directly. The front-end decodes variable-length x86 instructions into fixed-width micro-ops (uops). A micro-op cache (uop cache / DSB on Intel) stores previously decoded uops to bypass the decode stage entirely on hot loops. This is why "decode width" and "uop cache hit rate" matter for performance tuning.

02 / Instruction Set Architectures

ISA: x86-64, ARM, RISC-V

The Instruction Set Architecture is the contract between hardware and software. It defines registers, instructions, memory model, and encoding. Two philosophies dominate: CISC (Complex Instruction Set Computer) and RISC (Reduced Instruction Set Computer).

Property	x86-64 (CISC)	ARM (RISC)	RISC-V (RISC)
Encoding	Variable-length (1-15 bytes)	Fixed 32-bit (A64) or mixed 16/32-bit (Thumb2)	Fixed 32-bit (base), 16-bit (C extension)
GP Registers	16 (RAX-R15)	31 (X0-X30) + SP, XZR	32 (x0-x31), x0 hardwired to 0
Memory Model	TSO (Total Store Order) — strong	Weakly ordered (requires barriers)	RVWMO (weak, release/acquire fences)
Condition Codes	EFLAGS register (implicit)	NZCV flags (explicit via S-suffix)	No flags register; compare-and-branch
SIMD	SSE/AVX/AVX-512 (128-512 bit)	NEON (128-bit), SVE/SVE2 (variable up to 2048-bit)	V extension (variable-length vectors)
Licensing	Proprietary (Intel/AMD)	Proprietary (Arm Ltd. licenses)	Open standard (free, no royalty)
Primary Domain	Desktops, servers, HPC	Mobile, embedded, Apple Silicon, servers (Graviton)	Embedded, research, growing in servers

CISC vs RISC Today

The distinction is largely academic. Modern x86 CPUs decode CISC instructions into RISC-like micro-ops internally. ARM has accumulated enough instructions that calling it "reduced" is generous. What actually matters for performance is the microarchitecture (pipeline depth, execution units, cache hierarchy), not the ISA surface.

Registers in Detail

General-Purpose (GPRs)

On x86-64: RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8-R15. Used for arithmetic, addressing, and passing function arguments (System V ABI: RDI, RSI, RDX, RCX, R8, R9).

Instruction Pointer (RIP)

Points to the next instruction to fetch. Not directly writable; changed by jumps, calls, and returns. RIP-relative addressing is standard in x86-64 for position-independent code.

Stack Pointer (RSP)

Points to the top of the stack. Implicitly modified by PUSH, POP, CALL, RET. The System V ABI requires 16-byte alignment at function call boundaries.

Flags (RFLAGS)

Contains status flags (ZF, CF, OF, SF) set by arithmetic ops, and system flags (IF for interrupts, DF for string direction). Conditional branches read these flags.

03 / Pipelining & Superscalar Execution

Pipeline Hazards & Branch Prediction

Pipelining overlaps instruction execution for throughput. But dependencies between instructions create hazards that can stall or corrupt the pipeline.

Three Types of Hazards

Hazard Type	Cause	Solution
Data Hazard (RAW, WAR, WAW)	Instruction needs a result not yet produced. E.g., `ADD R1, R2, R3` followed by `SUB R4, R1, R5` — R1 isn't written back yet.	Forwarding/bypassing (result sent directly from EX stage to next instruction's EX input). Stall if load-use dependency (1-cycle bubble).
Control Hazard	Branch instruction changes program flow. Pipeline has already fetched subsequent instructions that may be wrong.	Branch prediction (static/dynamic). On mispredict, flush pipeline (penalty = pipeline depth).
Structural Hazard	Two instructions need the same hardware unit simultaneously (e.g., single-ported memory).	Duplicate hardware (separate I-cache and D-cache), or stall one instruction.

Branch Prediction

Modern CPUs predict branches with >97% accuracy using multi-level predictors. A misprediction flushes 15-20 cycles of work on deep pipelines, making prediction accuracy critical for performance.

Branch Prediction Pipeline Integration

BTB Lookup

→

Predict Taken/Not-Taken

→

Fetch from Predicted Path

→

Verify at Execute

Mispredict? → Flush pipeline, restore state, fetch correct path

BTB (Branch Target Buffer)

A cache mapping branch instruction addresses to their predicted target addresses. Enables fetching from the target before the branch is even decoded.

2-bit Saturating Counter

Classic predictor: each branch has a 2-bit counter (strongly taken, weakly taken, weakly not-taken, strongly not-taken). Requires two consecutive mispredictions to flip.

TAGE Predictor

State-of-the-art: uses multiple tables indexed by different history lengths (Tagged Geometric). Captures both short and long correlations in branch patterns.

Perceptron Predictor

Uses a neural-network-like weighted sum of branch history bits. AMD Zen architectures use perceptron-based predictors for improved accuracy on complex patterns.

Superscalar Execution

Superscalar CPUs can issue multiple instructions per cycle to parallel execution units. A 6-wide superscalar core (like Zen 4) can dispatch up to 6 uops per cycle. The actual IPC depends on instruction mix, dependencies, and cache behavior. Achieving IPC > 4 in real workloads is exceptional.

Practical Impact

When profiling: if your code's IPC is below 1.0, you're likely memory-bound (stalling on cache misses). If IPC is between 2-4, you're compute-bound. Use perf stat on Linux to measure IPC directly.

04 / Out-of-Order & Speculative Execution

Out-of-Order Execution & Speculation

In-order execution wastes cycles when an instruction stalls (e.g., waiting for a cache miss). Out-of-order (OoO) execution allows the CPU to look ahead and execute independent instructions while earlier ones wait, then retire results in program order to maintain correctness.

Out-of-Order Execution Engine

Fetch & Decode

→

Rename (RAT)

→

Reorder Buffer (ROB)

→

Scheduler / RS

→

Execution Units

→

ROB (Retire in order)

→

Commit to Arch State

Component	Role	Typical Size (Zen 4)
Register Alias Table (RAT)	Maps architectural registers to physical registers. Eliminates WAR/WAW hazards via register renaming.	~200+ physical integer registers
Reorder Buffer (ROB)	Tracks all in-flight instructions in program order. Ensures in-order retirement and enables precise exceptions.	320 entries (Zen 4)
Reservation Station (RS)	Holds instructions waiting for operands. When all operands are ready, the instruction is dispatched to an execution unit.	~92 entries per scheduler
Store Buffer	Holds stores until retirement. Enables store-to-load forwarding. Drain on serializing instructions (e.g., MFENCE).	64 entries

Speculative Execution

The CPU speculatively executes instructions past unresolved branches. If the prediction is correct, the speculated results commit. If wrong, the ROB squashes all speculated uops and restarts from the correct path. This "free" work is invisible to software—except when it isn't.

Spectre & Meltdown (2018)

These attacks exploit speculative execution side effects. Meltdown (CVE-2017-5754): Speculative loads bypass kernel/user privilege checks; the data leaves traces in the cache hierarchy readable via timing side-channels. Fixed with KPTI (kernel page table isolation). Spectre (CVE-2017-5753, -5715): Mistrains branch predictors to speculatively access attacker-chosen memory. Mitigations include retpolines, IBRS, and array bounds clamping. The fundamental issue—that microarchitectural state changes during speculation are observable—continues to produce new variants.

05 / Multi-Core, SMT & NUMA

Multi-Core, Hyper-Threading & Memory Topology

Multi-Core Architecture

A multi-core CPU has multiple independent execution engines (cores) on a single die. Each core has private L1/L2 caches; all cores share the L3 (LLC). Cores communicate through the cache coherence protocol and a shared interconnect (ring bus on older Intel, mesh on Xeon, Infinity Fabric on AMD).

Simultaneous Multi-Threading (SMT / Hyper-Threading)

SMT allows a single physical core to present as two (or more) logical cores. Each logical core has its own architectural state (registers, PC) but shares execution units, caches, and TLBs. When one thread stalls (e.g., on a cache miss), the other thread can use the execution units.

SMT: One Physical Core, Two Logical Threads

Thread 0 State

Thread 1 State

↓

Shared Execution Engine (ALUs, FPUs, Load/Store Units)

↓

Shared L1/L2 Cache

When SMT Hurts

SMT typically gives 15-30% throughput improvement for mixed workloads. But for latency-sensitive or cache-thrashing workloads, the second thread competes for cache and execution resources, increasing tail latency. Many HPC and real-time systems disable SMT. Security-sensitive environments disable it to prevent side-channel attacks between co-resident threads.

NUMA (Non-Uniform Memory Access)

In multi-socket systems (and even AMD's chiplet designs), memory is physically attached to specific CPU sockets/dies. Accessing "local" memory is faster (~80ns) than "remote" memory on another socket (~140ns). The OS exposes NUMA topology, and performance-critical applications use numactl or libnuma to bind threads and allocate memory on the correct node.

NUMA Node

A group of cores and their directly-attached memory. On a 2-socket system, there are typically 2 NUMA nodes. AMD chiplets may expose multiple nodes per socket.

Interconnect Latency

Cross-node access adds 40-100ns. For data structures accessed by threads on different nodes, this penalty is paid on every cache miss. Prefer node-local allocation.

numactl --interleave

Round-robin memory allocation across nodes. Good for workloads with unpredictable access patterns. Averages out local vs. remote latency.

06 / CPU Caches & Coherence

Cache Hierarchy, Coherence & False Sharing

The Cache Hierarchy

Caches exploit temporal locality (recently used data will be used again) and spatial locality (nearby data will be used soon). Every memory access checks L1 first, then L2, then L3, then DRAM. Each level is larger but slower.

Level	Typical Size	Latency (cycles)	Associativity	Scope
L1 I-cache	32-64 KB	~4 cycles	8-way	Per core
L1 D-cache	32-48 KB	~4-5 cycles	8-12 way	Per core
L2	256 KB - 1 MB	~12-14 cycles	8-16 way	Per core
L3 (LLC)	16-96 MB	~40-50 cycles	16-way	Shared across cores
DRAM	GBs - TBs	~200+ cycles (~60-100ns)	N/A	Shared (NUMA-aware)

Cache Lines

The fundamental unit of cache transfer is the cache line, which is 64 bytes on virtually all modern x86 and ARM processors. When you read a single byte, the entire 64-byte line is fetched. This means:

Practical Consequence

Iterating an array sequentially is fast because each cache line fetch brings 64 bytes of useful data (spatial locality). Iterating a linked list with scattered nodes is slow because each pointer chase may trigger a new cache line fetch for just 8 bytes of useful data. Data structure layout matters enormously—struct-of-arrays often outperforms array-of-structs for this reason.

Cache Coherence: MESI Protocol

In multi-core systems, each core has private L1/L2 caches that may hold copies of the same memory location. The MESI protocol ensures all cores see a consistent view of memory by tracking the state of each cache line.

Modified (M)

This cache has the only valid copy and it's dirty (changed). Must write back to memory before another core can read it.

Exclusive (E)

This cache has the only copy and it's clean (matches memory). Can transition to M on write without bus traffic.

Shared (S)

Multiple caches hold this line, all clean. A write requires invalidating other copies first (generates bus traffic).

Invalid (I)

This cache line is not valid. Any access requires fetching from L3/memory or another core's cache (cache-to-cache transfer).

False Sharing

False sharing occurs when two threads write to different variables that happen to reside on the same 64-byte cache line. Even though there's no logical data sharing, the hardware coherence protocol bounces the cache line between cores on every write, serializing access and destroying performance.

False Sharing: Two Variables, One Cache Line

Cache Line (64 bytes)
[thread_0_counter | ... padding ... | thread_1_counter]

Core 0 writes → Invalidate Core 1's copy

↔

Core 1 writes → Invalidate Core 0's copy

Fix: Padding / Alignment

Prevent false sharing by padding structures to cache-line boundaries. In C/C++: alignas(64) or __attribute__((aligned(64))). In Java: @Contended annotation (requires -XX:-RestrictContended). In Go: pad structs manually to 64 bytes. The perf c2c tool on Linux can detect false sharing at runtime.

07 / Specialized Instructions

SIMD, Crypto Extensions & Atomics

SIMD: Single Instruction, Multiple Data

SIMD instructions operate on wide registers (128-512 bits) that pack multiple data elements. A single VADDPS ymm0, ymm1, ymm2 adds 8 floats in parallel. This is how media codecs, ML inference, scientific computing, and even memcpy achieve high throughput.

Extension	Register Width	Elements per Op (float32)	Introduced
SSE	128-bit (XMM)	4	Pentium III (1999)
AVX	256-bit (YMM)	8	Sandy Bridge (2011)
AVX-512	512-bit (ZMM)	16	Xeon Phi / Skylake-X (2017)
ARM NEON	128-bit (V registers)	4	ARMv7 / Cortex-A8
ARM SVE/SVE2	128-2048 bit (scalable)	4-64 (implementation-dependent)	ARMv8.2+ (Graviton 3)

AVX-512 Frequency Throttling

On many Intel CPUs, heavy AVX-512 usage causes the core (or entire chip) to reduce clock frequency by 100-200 MHz to stay within power limits. This means AVX-512 code can actually slow down surrounding non-SIMD code running on other SMT threads or the same core. Always benchmark end-to-end, not just the SIMD kernel.

AES-NI: Hardware Crypto

AESENC / AESDEC instructions perform one round of AES encryption/decryption in a single instruction (~4 cycles latency, 1 cycle throughput when pipelined). This yields ~10 GB/s AES-256 throughput on modern CPUs, making software AES faster than dedicated hardware in many cases. OpenSSL, BoringSSL, and the Linux kernel all use AES-NI when available.

Atomic Instructions & CAS

Lock-free and concurrent data structures rely on hardware atomic instructions. The most important is Compare-And-Swap (CAS): atomically read a memory location, compare with an expected value, and write a new value only if the comparison succeeds.

CAS (Compare-And-Swap) Operation

Read *addr

→

old == expected?

→

Yes: Write new value

old == expected?

→

No: Return old (retry)

On x86: LOCK CMPXCHG (CAS), LOCK XADD (fetch-and-add), LOCK BTS (test-and-set). The LOCK prefix locks the cache line for the duration of the operation. On ARM: the LDXR/STXR (load-exclusive / store-exclusive) pair implements LL/SC (Load-Linked / Store-Conditional), which is more flexible than CAS and avoids the ABA problem at the hardware level.

Performance Note

A LOCK CMPXCHG on an uncontended cache line costs ~20 cycles. Under heavy contention (many cores CAS-ing the same line), it can exceed 200 cycles due to cache line bouncing. This is why lock-free queues use padding and per-core sharding to reduce contention on hot cache lines.

Test Yourself

Score: 0 / 8

Question 01

A 20-stage pipeline CPU has a branch misprediction rate of 5%. Approximately how many cycles are wasted per 1000 instructions due to mispredictions (assuming one branch every 5 instructions)?

1000 instructions / 5 = 200 branches. 200 branches * 5% mispredict = 10 mispredictions. 10 * 20 cycles penalty = 200 wasted cycles. This illustrates why deep pipelines amplify misprediction costs.

Question 02

Two threads on different cores each increment their own counter variable. Both counters are adjacent in memory within the same 64-byte region. Performance is 10x worse than expected. What is the most likely cause?

This is a textbook false sharing scenario. The two counters occupy the same 64-byte cache line. Every write by one core invalidates the other core's copy, forcing a cache-to-cache transfer (~40-80 cycles) on each access. The fix is to pad each counter to 64 bytes so they occupy separate cache lines.

Question 03

In the MESI protocol, a cache line in state "Exclusive" transitions to which state when the local core writes to it?

When a core writes to an Exclusive line, no bus traffic is needed because no other cache has a copy. The line simply transitions to Modified, indicating it's dirty (differs from main memory). This silent upgrade is a key advantage of the Exclusive state over Shared.

Question 04

What is the primary purpose of the Reorder Buffer (ROB) in an out-of-order CPU?

The ROB maintains program order among instructions that may execute out of order. It enables precise exceptions (if instruction N faults, all prior instructions have committed and no later instructions have) and supports rollback on branch mispredictions by squashing speculated entries.

Question 05

x86-64 has a Total Store Order (TSO) memory model. Which of the following reorderings is permitted under TSO?

TSO guarantees that stores appear in program order and loads appear in program order, but a load can be reordered before an earlier store (to a different address) because the store may still be in the store buffer. This is the only reordering TSO permits, and it's why you need MFENCE (or a LOCK'd instruction) to implement sequential consistency on x86.

Question 06

An application running on a 2-socket NUMA system shows high latency. perf stat reports a high rate of "remote DRAM accesses." What is the most effective fix?

Remote DRAM accesses mean threads are accessing memory attached to the other socket, incurring 40-100ns extra latency. Binding both the process's threads and its memory allocations to the same NUMA node ensures local memory access. SMT doesn't increase memory bandwidth, and NUMA behavior is an architectural issue independent of ISA.

Question 07

Why does ARM use Load-Exclusive / Store-Exclusive (LDXR/STXR) instead of a direct Compare-And-Swap instruction like x86's LOCK CMPXCHG?

LL/SC (Load-Linked / Store-Conditional, which LDXR/STXR implements) tracks whether the cache line was written to at all, not just whether the value changed. This means it naturally detects the ABA scenario (value changes A→B→A) where CAS would incorrectly succeed. It also doesn't require locking the bus, fitting better with RISC philosophy. Note: ARMv8.1 did add CAS (CASP) instructions for compatibility, but LL/SC remains the fundamental primitive.

Question 08

A function using AVX-512 intrinsics runs 4x faster in a microbenchmark but only 1.5x faster in the full application. Which factor most likely explains this discrepancy?

On many Intel CPUs, sustained AVX-512 execution causes a "license-based" frequency reduction (100-200 MHz lower clocks). This affects not just the SIMD code but ALL code running on the throttled core (and sometimes neighboring cores). The microbenchmark doesn't capture this because it only measures the SIMD kernel in isolation. This is why end-to-end benchmarking is critical for AVX-512 adoption decisions.