HDD Internals
A hard disk drive stores data on spinning magnetic platters. A read/write head floats nanometers above each platter surface on an actuator arm. Data is organized into concentric tracks, each divided into fixed-size sectors (typically 512 bytes or 4 KiB with Advanced Format).
Latency breakdown
Seek time is the time for the head to move to the correct track (3-15 ms average). Rotational latency is the time for the target sector to spin under the head -- on average half a rotation. A 7200 RPM drive completes one rotation in ~8.3 ms, so average rotational latency is ~4.15 ms. Transfer time is typically negligible for small reads.
IOPS and scheduling
IOPS (I/O Operations Per Second) for a typical 7200 RPM HDD is roughly 75-150 for random reads. The OS I/O scheduler reorders pending requests to minimize head movement. Classic algorithms include SCAN (elevator), C-SCAN, and the Linux CFQ (Completely Fair Queuing) or mq-deadline schedulers.
SSD & NAND Flash
SSDs use NAND flash memory -- no moving parts. Data is stored in floating-gate transistors organized into pages (~4-16 KiB) grouped into blocks (~256-4096 pages). Reads and writes happen at page granularity, but erases happen at block granularity -- this asymmetry drives much of SSD complexity.
NAND types
| Type | Bits/Cell | Endurance (P/E cycles) | Speed | Cost |
|---|---|---|---|---|
| SLC | 1 | ~100,000 | Fastest | Highest |
| MLC | 2 | ~10,000 | Fast | Moderate |
| TLC | 3 | ~3,000 | Moderate | Lower |
| QLC | 4 | ~1,000 | Slowest | Lowest |
Flash Translation Layer (FTL)
The FTL maps logical block addresses (LBAs) to physical NAND pages, hiding flash complexities from the OS. It handles:
Write amplification -- because you cannot overwrite a page in-place (must erase whole block first), the FTL writes data to a new clean page and marks the old one invalid. Garbage collection later reclaims blocks, but may move valid pages, amplifying total writes. A write amplification factor (WAF) of 2-3x is common.
Wear leveling -- distributes writes evenly across all blocks so no single block wears out prematurely. Static wear leveling also relocates cold data to give heavily-written blocks a rest.
TRIM command. Without TRIM, the FTL cannot reclaim invalidated pages efficiently, leading to higher write amplification and degraded performance over time.
NVMe vs SATA
SATA SSDs use the AHCI protocol designed for spinning disks -- limited to a single command queue of 32 entries. NVMe (Non-Volatile Memory Express) was designed for flash: up to 65,535 queues with 65,536 entries each, communicating over PCIe. NVMe drives achieve 3-7 GB/s sequential reads vs SATA's ~550 MB/s ceiling.
PCIe, DMA & IOMMU
PCIe generations
| Generation | Per-Lane Bandwidth | x4 Link (NVMe) | Encoding |
|---|---|---|---|
| PCIe 3.0 | ~1 GB/s | ~3.9 GB/s | 128b/130b |
| PCIe 4.0 | ~2 GB/s | ~7.9 GB/s | 128b/130b |
| PCIe 5.0 | ~4 GB/s | ~15.8 GB/s | 128b/130b |
| PCIe 6.0 | ~8 GB/s | ~31.5 GB/s | PAM4 + FEC |
Direct Memory Access (DMA)
DMA allows peripherals (disks, NICs) to transfer data directly to/from main memory without involving the CPU for every byte. The CPU sets up a DMA descriptor (source, destination, length), the DMA controller handles the transfer, and raises an interrupt upon completion. This frees CPU cycles for computation while I/O proceeds in parallel.
IOMMU
The IOMMU (I/O Memory Management Unit) sits between DMA-capable devices and physical memory. It translates device-visible I/O virtual addresses to physical addresses, providing memory isolation (a device can only access memory regions mapped to it) and enabling device passthrough in virtualized environments (e.g., VFIO in Linux). Intel calls theirs VT-d; AMD calls theirs AMD-Vi.
RAID Levels
RAID (Redundant Array of Independent Disks) combines multiple drives to improve performance, capacity, or fault tolerance. Understanding the tradeoffs is critical for system design.
Data striped across N drives. Read/write throughput scales with N. Zero redundancy -- any single drive failure loses all data. Usable capacity: 100%.
Every write goes to two (or more) drives. Survives one drive failure. Read throughput can double. Usable capacity: 50%. Simple but expensive.
Data and parity striped across N drives. Survives one drive failure. Usable capacity: (N-1)/N. Write penalty: each write requires read-modify-write of parity.
Like RAID 5 but with two parity blocks per stripe. Survives two simultaneous drive failures. Higher write penalty. Usable capacity: (N-2)/N.
RAID 0 over RAID 1 pairs. Excellent read/write performance with one-drive-per-pair fault tolerance. Usable capacity: 50%. Preferred for databases.
File Systems & Journaling
Inodes
An inode is a metadata structure stored on disk that describes a file: permissions, ownership, timestamps, size, and pointers to data blocks. Directories are simply files whose data maps filenames to inode numbers. The inode count is typically fixed at format time (ext4) -- running out of inodes means you can't create files even with free space.
Journaling
A journal (or log) records intended metadata (and optionally data) changes before applying them to the main file system. On crash recovery, the journal is replayed to bring the file system to a consistent state. This avoids lengthy fsck scans. Journaling modes in ext4:
| Mode | What's journaled | Safety | Performance |
|---|---|---|---|
journal | Metadata + data | Highest | Slowest |
ordered (default) | Metadata only, data written first | High | Good |
writeback | Metadata only, data order not guaranteed | Lower | Fastest |
File system comparison
| FS | Key Features | Max Volume | Use Case |
|---|---|---|---|
| ext4 | Journaling, extents, delayed allocation | 1 EiB | Linux default, general purpose |
| XFS | Excellent large-file performance, parallel I/O | 8 EiB | Large files, media, HPC |
| ZFS | Copy-on-write, checksums, snapshots, built-in RAID | 256 ZiB | Data integrity, NAS, enterprise |
| Btrfs | CoW, snapshots, subvolumes, inline compression | 16 EiB | Linux CoW alternative to ZFS |
Durability Guarantees & Storage Types
fsync and fdatasync
Calling write() only places data in the kernel page cache -- a power loss before the OS flushes it means data loss. fsync(fd) forces all dirty pages and metadata for the file to stable storage. fdatasync(fd) is similar but skips metadata that isn't needed to read the data (e.g., st_atime), making it slightly faster.
Write-Ahead Log (WAL)
Databases use a WAL to guarantee durability without flushing every data page on each commit. The sequence: (1) append the change to the WAL, (2) fsync the WAL, (3) acknowledge the commit to the client. Data pages are flushed lazily in the background. On crash, the WAL is replayed to recover committed transactions. PostgreSQL, SQLite, and most ACID databases rely on this pattern.
O_DIRECT
Opening a file with O_DIRECT bypasses the kernel page cache, issuing I/O directly between user-space buffers and the disk. Databases like MySQL (InnoDB) use this because they manage their own buffer pool and don't want the OS double-caching pages. Buffers must be aligned to the filesystem block size.
O_DIRECT does not guarantee durability by itself -- you still need fsync() or O_DSYNC to ensure data reaches stable storage. O_DIRECT only bypasses the page cache.
Block vs Object vs File Storage
| Type | Abstraction | Access | Examples |
|---|---|---|---|
| Block | Fixed-size blocks (LBAs) | Low-level, requires FS on top | EBS, iSCSI, local disk |
| File | Hierarchical files & directories | POSIX API, NFS/SMB | NFS, EFS, CIFS |
| Object | Flat namespace, key → blob + metadata | HTTP API (PUT/GET/DELETE) | S3, GCS, MinIO |
Block storage offers the lowest latency and is used for databases and boot volumes. File storage provides shared access with familiar semantics. Object storage scales massively for unstructured data (images, logs, backups) with built-in replication, but offers only eventual consistency and higher latency.
Test Yourself
O_DIRECT do when opening a file?