Storage & I/O — learn.surkar.in

01 / Hard Disk Drives

HDD Internals

A hard disk drive stores data on spinning magnetic platters. A read/write head floats nanometers above each platter surface on an actuator arm. Data is organized into concentric tracks, each divided into fixed-size sectors (typically 512 bytes or 4 KiB with Advanced Format).

HDD read path

Seek to track

→

Rotational delay

→

Data transfer

Latency breakdown

Seek time is the time for the head to move to the correct track (3-15 ms average). Rotational latency is the time for the target sector to spin under the head -- on average half a rotation. A 7200 RPM drive completes one rotation in ~8.3 ms, so average rotational latency is ~4.15 ms. Transfer time is typically negligible for small reads.

Key Insight

Sequential reads are dramatically faster than random reads on HDDs. Sequential access avoids repeated seeks and rotational delays, achieving 100-200 MB/s, while random 4 KiB reads may yield only 0.5-1 MB/s effective throughput.

IOPS and scheduling

IOPS (I/O Operations Per Second) for a typical 7200 RPM HDD is roughly 75-150 for random reads. The OS I/O scheduler reorders pending requests to minimize head movement. Classic algorithms include SCAN (elevator), C-SCAN, and the Linux CFQ (Completely Fair Queuing) or mq-deadline schedulers.

02 / Solid-State Drives

SSD & NAND Flash

SSDs use NAND flash memory -- no moving parts. Data is stored in floating-gate transistors organized into pages (~4-16 KiB) grouped into blocks (~256-4096 pages). Reads and writes happen at page granularity, but erases happen at block granularity -- this asymmetry drives much of SSD complexity.

NAND types

Type	Bits/Cell	Endurance (P/E cycles)	Speed	Cost
SLC	1	~100,000	Fastest	Highest
MLC	2	~10,000	Fast	Moderate
TLC	3	~3,000	Moderate	Lower
QLC	4	~1,000	Slowest	Lowest

Flash Translation Layer (FTL)

The FTL maps logical block addresses (LBAs) to physical NAND pages, hiding flash complexities from the OS. It handles:

Write amplification -- because you cannot overwrite a page in-place (must erase whole block first), the FTL writes data to a new clean page and marks the old one invalid. Garbage collection later reclaims blocks, but may move valid pages, amplifying total writes. A write amplification factor (WAF) of 2-3x is common.

Wear leveling -- distributes writes evenly across all blocks so no single block wears out prematurely. Static wear leveling also relocates cold data to give heavily-written blocks a rest.

TRIM matters

When the OS deletes a file, the SSD doesn't know those pages are free unless the OS sends a TRIM command. Without TRIM, the FTL cannot reclaim invalidated pages efficiently, leading to higher write amplification and degraded performance over time.

NVMe vs SATA

SATA SSDs use the AHCI protocol designed for spinning disks -- limited to a single command queue of 32 entries. NVMe (Non-Volatile Memory Express) was designed for flash: up to 65,535 queues with 65,536 entries each, communicating over PCIe. NVMe drives achieve 3-7 GB/s sequential reads vs SATA's ~550 MB/s ceiling.

03 / I/O Interfaces

PCIe, DMA & IOMMU

PCIe generations

Generation	Per-Lane Bandwidth	x4 Link (NVMe)	Encoding
PCIe 3.0	~1 GB/s	~3.9 GB/s	128b/130b
PCIe 4.0	~2 GB/s	~7.9 GB/s	128b/130b
PCIe 5.0	~4 GB/s	~15.8 GB/s	128b/130b
PCIe 6.0	~8 GB/s	~31.5 GB/s	PAM4 + FEC

Direct Memory Access (DMA)

DMA allows peripherals (disks, NICs) to transfer data directly to/from main memory without involving the CPU for every byte. The CPU sets up a DMA descriptor (source, destination, length), the DMA controller handles the transfer, and raises an interrupt upon completion. This frees CPU cycles for computation while I/O proceeds in parallel.

DMA transfer flow

CPU programs DMA

→

Device ↔ RAM

→

Interrupt on done

IOMMU

The IOMMU (I/O Memory Management Unit) sits between DMA-capable devices and physical memory. It translates device-visible I/O virtual addresses to physical addresses, providing memory isolation (a device can only access memory regions mapped to it) and enabling device passthrough in virtualized environments (e.g., VFIO in Linux). Intel calls theirs VT-d; AMD calls theirs AMD-Vi.

04 / RAID

RAID Levels

RAID (Redundant Array of Independent Disks) combines multiple drives to improve performance, capacity, or fault tolerance. Understanding the tradeoffs is critical for system design.

RAID 0 -- Striping

Data striped across N drives. Read/write throughput scales with N. Zero redundancy -- any single drive failure loses all data. Usable capacity: 100%.

RAID 1 -- Mirroring

Every write goes to two (or more) drives. Survives one drive failure. Read throughput can double. Usable capacity: 50%. Simple but expensive.

RAID 5 -- Distributed Parity

Data and parity striped across N drives. Survives one drive failure. Usable capacity: (N-1)/N. Write penalty: each write requires read-modify-write of parity.

RAID 6 -- Double Parity

Like RAID 5 but with two parity blocks per stripe. Survives two simultaneous drive failures. Higher write penalty. Usable capacity: (N-2)/N.

RAID 10 -- Striped Mirrors

RAID 0 over RAID 1 pairs. Excellent read/write performance with one-drive-per-pair fault tolerance. Usable capacity: 50%. Preferred for databases.

RAID is not a backup

RAID protects against hardware failure, not accidental deletion, corruption, ransomware, or disasters. Always maintain separate backups.

05 / File Systems

File Systems & Journaling

Inodes

An inode is a metadata structure stored on disk that describes a file: permissions, ownership, timestamps, size, and pointers to data blocks. Directories are simply files whose data maps filenames to inode numbers. The inode count is typically fixed at format time (ext4) -- running out of inodes means you can't create files even with free space.

Journaling

A journal (or log) records intended metadata (and optionally data) changes before applying them to the main file system. On crash recovery, the journal is replayed to bring the file system to a consistent state. This avoids lengthy fsck scans. Journaling modes in ext4:

Mode	What's journaled	Safety	Performance
`journal`	Metadata + data	Highest	Slowest
`ordered` (default)	Metadata only, data written first	High	Good
`writeback`	Metadata only, data order not guaranteed	Lower	Fastest

File system comparison

FS	Key Features	Max Volume	Use Case
ext4	Journaling, extents, delayed allocation	1 EiB	Linux default, general purpose
XFS	Excellent large-file performance, parallel I/O	8 EiB	Large files, media, HPC
ZFS	Copy-on-write, checksums, snapshots, built-in RAID	256 ZiB	Data integrity, NAS, enterprise
Btrfs	CoW, snapshots, subvolumes, inline compression	16 EiB	Linux CoW alternative to ZFS

ZFS checksums

ZFS checksums every block of data and metadata using a Merkle tree. On read, if a checksum mismatch is detected and a redundant copy exists (mirror or raidz), ZFS automatically repairs the corruption -- silent data corruption (bit rot) is caught and healed transparently.

06 / Durability & Storage Models

Durability Guarantees & Storage Types

fsync and fdatasync

Calling write() only places data in the kernel page cache -- a power loss before the OS flushes it means data loss. fsync(fd) forces all dirty pages and metadata for the file to stable storage. fdatasync(fd) is similar but skips metadata that isn't needed to read the data (e.g., st_atime), making it slightly faster.

Write durability path

write()

→

Page cache

→

fsync()

→

Disk platter / NAND

Write-Ahead Log (WAL)

Databases use a WAL to guarantee durability without flushing every data page on each commit. The sequence: (1) append the change to the WAL, (2) fsync the WAL, (3) acknowledge the commit to the client. Data pages are flushed lazily in the background. On crash, the WAL is replayed to recover committed transactions. PostgreSQL, SQLite, and most ACID databases rely on this pattern.

O_DIRECT

Opening a file with O_DIRECT bypasses the kernel page cache, issuing I/O directly between user-space buffers and the disk. Databases like MySQL (InnoDB) use this because they manage their own buffer pool and don't want the OS double-caching pages. Buffers must be aligned to the filesystem block size.

Important

O_DIRECT does not guarantee durability by itself -- you still need fsync() or O_DSYNC to ensure data reaches stable storage. O_DIRECT only bypasses the page cache.

Block vs Object vs File Storage

Type	Abstraction	Access	Examples
Block	Fixed-size blocks (LBAs)	Low-level, requires FS on top	EBS, iSCSI, local disk
File	Hierarchical files & directories	POSIX API, NFS/SMB	NFS, EFS, CIFS
Object	Flat namespace, key → blob + metadata	HTTP API (PUT/GET/DELETE)	S3, GCS, MinIO

Block storage offers the lowest latency and is used for databases and boot volumes. File storage provides shared access with familiar semantics. Object storage scales massively for unstructured data (images, logs, backups) with built-in replication, but offers only eventual consistency and higher latency.

Test Yourself

Score: 0 / 10

Question 01

Why are sequential reads dramatically faster than random reads on an HDD?

Each random read incurs seek time (3-15 ms) plus rotational latency (~4 ms). Sequential reads keep the head on the same or adjacent tracks, eliminating most of this mechanical overhead.

Question 02

What does a higher bit-per-cell count in NAND flash (SLC → QLC) primarily sacrifice?

Storing more bits per cell requires finer voltage level distinctions, which reduces both endurance (fewer program/erase cycles before wear-out) and read/write speed. SLC lasts ~100K cycles; QLC ~1K cycles.

Question 03

Why is the TRIM command important for SSD performance?

Without TRIM, the FTL cannot tell which pages hold deleted data. It must treat them as valid during garbage collection, causing unnecessary page copies and higher write amplification.

Question 04

What is the main advantage of NVMe over SATA for SSDs?

SATA uses AHCI with a single queue of 32 commands. NVMe supports up to 65,535 queues each with 65,536 entries, communicating directly over PCIe lanes for much higher throughput and lower latency.

Question 05

In RAID 5 with 4 drives, what is the usable capacity?

RAID 5 uses one drive's worth of capacity for distributed parity. With 4 drives, usable capacity is (N-1)/N = 3/4 = 75%.

Question 06

What does DMA (Direct Memory Access) allow?

DMA offloads bulk data movement from the CPU. The CPU sets up a transfer descriptor, then the DMA controller moves data between device and RAM independently, raising an interrupt when complete.

Question 07

What distinguishes ZFS from ext4 regarding data integrity?

ZFS stores checksums in a Merkle tree for every data and metadata block. If corruption is detected on read and a mirror or raidz copy exists, ZFS automatically repairs it. ext4 has metadata checksums (optional) but no data checksum or self-healing.

Question 08

Why do databases like PostgreSQL use a Write-Ahead Log (WAL)?

The WAL records changes sequentially. Only the WAL must be fsynced at commit time -- an efficient sequential write. Dirty data pages are flushed lazily in the background. On crash, the WAL is replayed to recover committed transactions.

Question 09

What does O_DIRECT do when opening a file?

O_DIRECT causes reads and writes to go directly between user-space buffers and the storage device, skipping the page cache. It does NOT guarantee durability -- you still need fsync() for that.

Question 10

Which storage model is best suited for storing billions of images with HTTP API access and built-in replication?

Object storage (e.g., S3, GCS) uses a flat namespace accessed via HTTP APIs, scales to billions of objects, and provides built-in replication and durability. It's ideal for unstructured data like images, logs, and backups.