Hardware & Compute

Storage & I/O

From spinning platters to flash cells, from RAID arrays to copy-on-write file systems -- how data is persisted, organized, and made durable at every layer of the stack.

01 / Hard Disk Drives

HDD Internals

A hard disk drive stores data on spinning magnetic platters. A read/write head floats nanometers above each platter surface on an actuator arm. Data is organized into concentric tracks, each divided into fixed-size sectors (typically 512 bytes or 4 KiB with Advanced Format).

HDD read path
Seek to track
Rotational delay
Data transfer

Latency breakdown

Seek time is the time for the head to move to the correct track (3-15 ms average). Rotational latency is the time for the target sector to spin under the head -- on average half a rotation. A 7200 RPM drive completes one rotation in ~8.3 ms, so average rotational latency is ~4.15 ms. Transfer time is typically negligible for small reads.

Key Insight
Sequential reads are dramatically faster than random reads on HDDs. Sequential access avoids repeated seeks and rotational delays, achieving 100-200 MB/s, while random 4 KiB reads may yield only 0.5-1 MB/s effective throughput.

IOPS and scheduling

IOPS (I/O Operations Per Second) for a typical 7200 RPM HDD is roughly 75-150 for random reads. The OS I/O scheduler reorders pending requests to minimize head movement. Classic algorithms include SCAN (elevator), C-SCAN, and the Linux CFQ (Completely Fair Queuing) or mq-deadline schedulers.

02 / Solid-State Drives

SSD & NAND Flash

SSDs use NAND flash memory -- no moving parts. Data is stored in floating-gate transistors organized into pages (~4-16 KiB) grouped into blocks (~256-4096 pages). Reads and writes happen at page granularity, but erases happen at block granularity -- this asymmetry drives much of SSD complexity.

NAND types

TypeBits/CellEndurance (P/E cycles)SpeedCost
SLC1~100,000FastestHighest
MLC2~10,000FastModerate
TLC3~3,000ModerateLower
QLC4~1,000SlowestLowest

Flash Translation Layer (FTL)

The FTL maps logical block addresses (LBAs) to physical NAND pages, hiding flash complexities from the OS. It handles:

Write amplification -- because you cannot overwrite a page in-place (must erase whole block first), the FTL writes data to a new clean page and marks the old one invalid. Garbage collection later reclaims blocks, but may move valid pages, amplifying total writes. A write amplification factor (WAF) of 2-3x is common.

Wear leveling -- distributes writes evenly across all blocks so no single block wears out prematurely. Static wear leveling also relocates cold data to give heavily-written blocks a rest.

TRIM matters
When the OS deletes a file, the SSD doesn't know those pages are free unless the OS sends a TRIM command. Without TRIM, the FTL cannot reclaim invalidated pages efficiently, leading to higher write amplification and degraded performance over time.

NVMe vs SATA

SATA SSDs use the AHCI protocol designed for spinning disks -- limited to a single command queue of 32 entries. NVMe (Non-Volatile Memory Express) was designed for flash: up to 65,535 queues with 65,536 entries each, communicating over PCIe. NVMe drives achieve 3-7 GB/s sequential reads vs SATA's ~550 MB/s ceiling.

03 / I/O Interfaces

PCIe, DMA & IOMMU

PCIe generations

GenerationPer-Lane Bandwidthx4 Link (NVMe)Encoding
PCIe 3.0~1 GB/s~3.9 GB/s128b/130b
PCIe 4.0~2 GB/s~7.9 GB/s128b/130b
PCIe 5.0~4 GB/s~15.8 GB/s128b/130b
PCIe 6.0~8 GB/s~31.5 GB/sPAM4 + FEC

Direct Memory Access (DMA)

DMA allows peripherals (disks, NICs) to transfer data directly to/from main memory without involving the CPU for every byte. The CPU sets up a DMA descriptor (source, destination, length), the DMA controller handles the transfer, and raises an interrupt upon completion. This frees CPU cycles for computation while I/O proceeds in parallel.

DMA transfer flow
CPU programs DMA
Device ↔ RAM
Interrupt on done

IOMMU

The IOMMU (I/O Memory Management Unit) sits between DMA-capable devices and physical memory. It translates device-visible I/O virtual addresses to physical addresses, providing memory isolation (a device can only access memory regions mapped to it) and enabling device passthrough in virtualized environments (e.g., VFIO in Linux). Intel calls theirs VT-d; AMD calls theirs AMD-Vi.

04 / RAID

RAID Levels

RAID (Redundant Array of Independent Disks) combines multiple drives to improve performance, capacity, or fault tolerance. Understanding the tradeoffs is critical for system design.

RAID 0 -- Striping

Data striped across N drives. Read/write throughput scales with N. Zero redundancy -- any single drive failure loses all data. Usable capacity: 100%.

RAID 1 -- Mirroring

Every write goes to two (or more) drives. Survives one drive failure. Read throughput can double. Usable capacity: 50%. Simple but expensive.

RAID 5 -- Distributed Parity

Data and parity striped across N drives. Survives one drive failure. Usable capacity: (N-1)/N. Write penalty: each write requires read-modify-write of parity.

RAID 6 -- Double Parity

Like RAID 5 but with two parity blocks per stripe. Survives two simultaneous drive failures. Higher write penalty. Usable capacity: (N-2)/N.

RAID 10 -- Striped Mirrors

RAID 0 over RAID 1 pairs. Excellent read/write performance with one-drive-per-pair fault tolerance. Usable capacity: 50%. Preferred for databases.

RAID is not a backup
RAID protects against hardware failure, not accidental deletion, corruption, ransomware, or disasters. Always maintain separate backups.
05 / File Systems

File Systems & Journaling

Inodes

An inode is a metadata structure stored on disk that describes a file: permissions, ownership, timestamps, size, and pointers to data blocks. Directories are simply files whose data maps filenames to inode numbers. The inode count is typically fixed at format time (ext4) -- running out of inodes means you can't create files even with free space.

Journaling

A journal (or log) records intended metadata (and optionally data) changes before applying them to the main file system. On crash recovery, the journal is replayed to bring the file system to a consistent state. This avoids lengthy fsck scans. Journaling modes in ext4:

ModeWhat's journaledSafetyPerformance
journalMetadata + dataHighestSlowest
ordered (default)Metadata only, data written firstHighGood
writebackMetadata only, data order not guaranteedLowerFastest

File system comparison

FSKey FeaturesMax VolumeUse Case
ext4Journaling, extents, delayed allocation1 EiBLinux default, general purpose
XFSExcellent large-file performance, parallel I/O8 EiBLarge files, media, HPC
ZFSCopy-on-write, checksums, snapshots, built-in RAID256 ZiBData integrity, NAS, enterprise
BtrfsCoW, snapshots, subvolumes, inline compression16 EiBLinux CoW alternative to ZFS
ZFS checksums
ZFS checksums every block of data and metadata using a Merkle tree. On read, if a checksum mismatch is detected and a redundant copy exists (mirror or raidz), ZFS automatically repairs the corruption -- silent data corruption (bit rot) is caught and healed transparently.
06 / Durability & Storage Models

Durability Guarantees & Storage Types

fsync and fdatasync

Calling write() only places data in the kernel page cache -- a power loss before the OS flushes it means data loss. fsync(fd) forces all dirty pages and metadata for the file to stable storage. fdatasync(fd) is similar but skips metadata that isn't needed to read the data (e.g., st_atime), making it slightly faster.

Write durability path
write()
Page cache
fsync()
Disk platter / NAND

Write-Ahead Log (WAL)

Databases use a WAL to guarantee durability without flushing every data page on each commit. The sequence: (1) append the change to the WAL, (2) fsync the WAL, (3) acknowledge the commit to the client. Data pages are flushed lazily in the background. On crash, the WAL is replayed to recover committed transactions. PostgreSQL, SQLite, and most ACID databases rely on this pattern.

O_DIRECT

Opening a file with O_DIRECT bypasses the kernel page cache, issuing I/O directly between user-space buffers and the disk. Databases like MySQL (InnoDB) use this because they manage their own buffer pool and don't want the OS double-caching pages. Buffers must be aligned to the filesystem block size.

Important
O_DIRECT does not guarantee durability by itself -- you still need fsync() or O_DSYNC to ensure data reaches stable storage. O_DIRECT only bypasses the page cache.

Block vs Object vs File Storage

TypeAbstractionAccessExamples
BlockFixed-size blocks (LBAs)Low-level, requires FS on topEBS, iSCSI, local disk
FileHierarchical files & directoriesPOSIX API, NFS/SMBNFS, EFS, CIFS
ObjectFlat namespace, key → blob + metadataHTTP API (PUT/GET/DELETE)S3, GCS, MinIO

Block storage offers the lowest latency and is used for databases and boot volumes. File storage provides shared access with familiar semantics. Object storage scales massively for unstructured data (images, logs, backups) with built-in replication, but offers only eventual consistency and higher latency.

Test Yourself

Score: 0 / 10
Question 01
Why are sequential reads dramatically faster than random reads on an HDD?
Each random read incurs seek time (3-15 ms) plus rotational latency (~4 ms). Sequential reads keep the head on the same or adjacent tracks, eliminating most of this mechanical overhead.
Question 02
What does a higher bit-per-cell count in NAND flash (SLC → QLC) primarily sacrifice?
Storing more bits per cell requires finer voltage level distinctions, which reduces both endurance (fewer program/erase cycles before wear-out) and read/write speed. SLC lasts ~100K cycles; QLC ~1K cycles.
Question 03
Why is the TRIM command important for SSD performance?
Without TRIM, the FTL cannot tell which pages hold deleted data. It must treat them as valid during garbage collection, causing unnecessary page copies and higher write amplification.
Question 04
What is the main advantage of NVMe over SATA for SSDs?
SATA uses AHCI with a single queue of 32 commands. NVMe supports up to 65,535 queues each with 65,536 entries, communicating directly over PCIe lanes for much higher throughput and lower latency.
Question 05
In RAID 5 with 4 drives, what is the usable capacity?
RAID 5 uses one drive's worth of capacity for distributed parity. With 4 drives, usable capacity is (N-1)/N = 3/4 = 75%.
Question 06
What does DMA (Direct Memory Access) allow?
DMA offloads bulk data movement from the CPU. The CPU sets up a transfer descriptor, then the DMA controller moves data between device and RAM independently, raising an interrupt when complete.
Question 07
What distinguishes ZFS from ext4 regarding data integrity?
ZFS stores checksums in a Merkle tree for every data and metadata block. If corruption is detected on read and a mirror or raidz copy exists, ZFS automatically repairs it. ext4 has metadata checksums (optional) but no data checksum or self-healing.
Question 08
Why do databases like PostgreSQL use a Write-Ahead Log (WAL)?
The WAL records changes sequentially. Only the WAL must be fsynced at commit time -- an efficient sequential write. Dirty data pages are flushed lazily in the background. On crash, the WAL is replayed to recover committed transactions.
Question 09
What does O_DIRECT do when opening a file?
O_DIRECT causes reads and writes to go directly between user-space buffers and the storage device, skipping the page cache. It does NOT guarantee durability -- you still need fsync() for that.
Question 10
Which storage model is best suited for storing billions of images with HTTP API access and built-in replication?
Object storage (e.g., S3, GCS) uses a flat namespace accessed via HTTP APIs, scales to billions of objects, and provides built-in replication and durability. It's ideal for unstructured data like images, logs, and backups.