Distributed Systems

Distributed Systems Fundamentals

The core principles behind building reliable systems across multiple machines: time, consensus, consistency, replication, partitioning, transactions, and failure handling.

01 / Fallacies & Foundations

The 8 Fallacies of Distributed Computing

Peter Deutsch (and James Gosling) identified eight false assumptions that developers new to distributed systems invariably make. Every design decision in this field stems from accepting that these assumptions are wrong.

1. The network is reliable

Packets get dropped, cables get cut, switches fail. You must design for message loss.

2. Latency is zero

Every network hop adds delay. Cross-datacenter calls can be 50-150ms. Design accordingly.

3. Bandwidth is infinite

Serialization overhead and network saturation are real constraints on throughput.

4. The network is secure

Every network boundary is an attack surface. Encrypt, authenticate, authorize.

5. Topology doesn't change

Servers come and go. Cloud instances are ephemeral. Service discovery matters.

6. There is one administrator

Multiple teams, orgs, and cloud providers own different parts of the infrastructure.

7. Transport cost is zero

Serialization, deserialization, and moving data across the wire all have real costs.

8. The network is homogeneous

Different hardware, OS versions, protocols, and software stacks must interoperate.

Why This Matters
Every protocol, algorithm, and pattern in distributed systems exists because one or more of these fallacies was violated. When you see retry logic, timeouts, consensus protocols, or replication -- they all trace back to these realities.
02 / Time & Ordering

Clocks in Distributed Systems

There is no global clock. Physical clocks drift, and NTP can only reduce -- not eliminate -- skew. Distributed systems need logical or hybrid clocks to establish event ordering.

Lamport Timestamps

Each process maintains a counter. On local events, increment. On send, attach the counter. On receive, set counter to max(local, received) + 1. This gives a partial order: if a → b then L(a) < L(b), but the converse is not true.

Lamport Clock Rules
Local event: C++
Send: attach C
Receive: C = max(C, msg.C) + 1

Vector Clocks

Each process maintains a vector of counters, one per process. This captures causal ordering: you can determine if two events are causally related or concurrent. The tradeoff is O(N) space where N is the number of processes.

Vector Clock Comparison
VC(a) < VC(b) iff every entry in a is ≤ b and at least one is strictly less. If neither dominates, the events are concurrent.

Hybrid Logical Clocks (HLC)

Combines physical time with a logical counter. Uses the physical timestamp when possible, falls back to the logical component to break ties. Gives you causality tracking with timestamps that are close to wall-clock time. Used in CockroachDB.

Google TrueTime

Hardware-assisted: GPS receivers and atomic clocks in every datacenter. The API returns an interval [earliest, latest] instead of a point. Spanner waits out the uncertainty window before committing, guaranteeing external consistency. Achieves ~1-7ms uncertainty bounds.

ClockOrderingSpaceNeeds Special HW
LamportPartial (happens-before)O(1)No
VectorCausalO(N)No
HLCCausal + wall-clock approxO(1)No
TrueTimeExternal consistencyO(1)Yes (GPS/atomic)
03 / Consensus & Consistency

Agreement and Consistency Models

FLP Impossibility

Fischer, Lynch, and Paterson proved (1985) that in an asynchronous system where even one process can crash, no deterministic consensus algorithm can guarantee termination. In practice, we work around FLP by using timeouts (partial synchrony), randomization, or failure detectors.

FLP Implication
FLP does not say consensus is impossible -- it says you cannot have all three of safety, liveness, and fault tolerance in a purely asynchronous model. Real systems use timeouts to escape this bound.

Paxos (Conceptual)

Lamport's protocol for reaching consensus on a single value. Three roles: proposers, acceptors, learners. Two phases: Prepare/Promise (proposer claims a ballot number) and Accept/Accepted (proposer sends value, acceptors agree). Correct but notoriously difficult to implement and extend to multi-decree (log) use cases.

Raft

Designed for understandability. Decomposes consensus into three sub-problems:

Raft Sub-Problems
Leader Election
Log Replication
Safety

Leader Election: Nodes start as followers. If a follower receives no heartbeat within its election timeout, it becomes a candidate, increments its term, and requests votes. A candidate wins with a majority. Each term has at most one leader.

Log Replication: The leader receives client requests, appends entries to its log, and replicates them to followers via AppendEntries RPCs. An entry is committed once a majority of nodes have persisted it. Followers apply committed entries to their state machines.

Terms: Monotonically increasing logical clock. Every RPC includes the sender's term. If a node receives a higher term, it steps down. Stale-term messages are rejected.

Consistency Models

ModelGuaranteeExample Systems
LinearizabilityOperations appear instantaneous at some point between invocation and response. Strongest single-object model.Zookeeper (writes), etcd
SequentialAll processes see the same order of operations, consistent with each process's program order.Zookeeper (reads)
CausalCausally related operations are seen in order; concurrent operations may be seen in any order.MongoDB (causal sessions)
EventualIf no new updates, all replicas converge to the same value. No ordering guarantees during convergence.DynamoDB, Cassandra
TunableApplication chooses consistency per operation via quorum settings (R, W, N).Cassandra, Riak
Practical Rule
Linearizability is expensive (coordination required). Most systems pick the weakest consistency model their application can tolerate, then strengthen it only where needed (e.g., for payments, locks, or leader election).
04 / Replication

Replication Strategies

Replication provides fault tolerance and read scalability. The key question is: who accepts writes?

Single-Leader (Primary-Backup)

One node (leader) handles all writes and streams changes to followers. Followers serve reads. Simple, but the leader is a bottleneck and a single point of failure until failover completes. Replication can be synchronous (strong consistency, higher latency) or asynchronous (risk of data loss on leader failure).

Multi-Leader

Multiple nodes accept writes, each replicating to others. Useful for multi-datacenter setups. The hard problem is conflict resolution: last-writer-wins (LWW), merge functions, or CRDTs. Operational complexity is significantly higher.

Leaderless (Dynamo-style)

Any replica accepts reads and writes. Uses quorum parameters: with N replicas, write to W and read from R. As long as R + W > N, reads and writes overlap and you are guaranteed to see the latest write.

Quorum Overlap (N=3, W=2, R=2)
Replica A (W, R)
Replica B (W, R)
Replica C

At least one replica in both the write set and read set has the latest value.

StrategyWrite ThroughputConsistencyConflict Handling
Single-leaderLimited by leaderStrong (sync) or eventual (async)None (single writer)
Multi-leaderHigher (parallel writes)Eventual + conflict resolutionLWW, CRDTs, custom merge
LeaderlessHighTunable via R, W, NRead repair, anti-entropy
05 / Partitioning

Data Partitioning (Sharding)

When data outgrows a single node, you split it across partitions. The goals: even data distribution and even query load.

Hash Partitioning

Apply a hash function to the key and assign to a partition based on the hash. Good distribution but destroys key ordering -- range queries require scatter-gather across all partitions.

Consistent Hashing

Nodes are placed on a hash ring. Each key is assigned to the next node clockwise from its hash position. Adding or removing a node only affects its neighbors, not the entire cluster.

Consistent Hash Ring
Node A @ 90°
Node B @ 180°
Node C @ 270°
Node D @ 0°

Virtual Nodes (vnodes): Each physical node gets many positions on the ring. This smooths out load imbalances caused by uneven hash distribution or heterogeneous hardware. Cassandra and DynamoDB use vnodes.

Range Partitioning

Keys are sorted and split into contiguous ranges. Preserves ordering (efficient range scans) but risks hot spots if keys are not uniformly distributed (e.g., timestamp-prefixed keys).

Rebalancing

When nodes join or leave, data must be redistributed. Strategies: fixed number of partitions (only move partitions, not split them), dynamic splitting (split when partitions get too large), or proportional to node count. Rebalancing should be gradual and not block reads/writes.

Key Tradeoff
Hash partitioning gives even distribution but kills range queries. Range partitioning preserves order but can create hot spots. Many systems (e.g., DynamoDB) let you choose per table.
06 / Transactions & Failures

Distributed Transactions and Failure Handling

Two-Phase Commit (2PC)

A coordinator asks all participants to prepare (vote yes/no). If all vote yes, it sends commit; otherwise abort. The problem: if the coordinator crashes after prepare but before commit/abort, participants are blocked -- they cannot proceed and hold locks indefinitely.

2PC Protocol
Coordinator: Prepare
Participants: Vote
Coordinator: Commit/Abort

Saga Pattern

Breaks a distributed transaction into a sequence of local transactions, each with a compensating action (undo). If step N fails, compensating actions for steps N-1 through 1 are executed in reverse.

Choreography: Each service publishes events; the next service reacts. Simple but hard to track and debug as the number of steps grows.

Orchestration: A central orchestrator directs the saga, calling each service in sequence. Easier to reason about but introduces a single coordination point.

Transactional Outbox Pattern

Write the business data and the outgoing message/event to the same database in a local transaction. A separate process (poller or CDC) reads the outbox table and publishes to the message broker. Guarantees at-least-once delivery without distributed transactions.

Outbox Pattern
App: Write data + outbox row (single txn)
Poller/CDC reads outbox
Publish to broker

Failure Handling

Split-Brain: A network partition causes two groups of nodes to each believe they are the active cluster. Both accept writes, leading to divergence. Prevented by requiring a quorum (majority) to operate, or by using fencing.

Fencing Tokens: When a leader is elected or a lock is acquired, a monotonically increasing token is issued. Storage services reject requests with stale tokens. This prevents a "zombie" leader (one that was partitioned but thinks it's still leader) from making writes.

Idempotency: Design operations so that executing them multiple times has the same effect as once. Critical for safe retries. Common approach: assign a unique request ID and deduplicate on the server side.

Failure Detectors: Used to suspect that a node has crashed. Based on heartbeat timeouts. Can be tuned for accuracy vs. speed. The phi accrual failure detector (used in Akka, Cassandra) outputs a suspicion level rather than a binary alive/dead.

The Zombie Leader Problem
A leader that was network-partitioned may come back and still believe it is leader. Without fencing tokens, it can overwrite data written by the new leader. Always use fencing or epoch numbers to guard against this.

Test Yourself

Score: 0 / 10
Question 01
Which of the 8 fallacies does NTP attempt to mitigate (but cannot fully solve)?
NTP synchronizes clocks by measuring network round-trip time, but because latency is variable and non-zero, it can never achieve perfect synchronization. The residual uncertainty is why we need logical/hybrid clocks.
Question 02
What can vector clocks determine that Lamport timestamps cannot?
Lamport timestamps only give a partial order: L(a) < L(b) does not mean a happened before b. Vector clocks can distinguish causality from concurrency -- if neither vector dominates the other, the events are concurrent.
Question 03
In Raft, what happens when a follower receives an AppendEntries RPC with a term lower than its own?
In Raft, any message with a stale (lower) term is rejected. This ensures that deposed leaders cannot continue replicating entries. The sender will discover it is no longer leader when it receives the rejection.
Question 04
What does linearizability guarantee?
Linearizability is the strongest single-object consistency model. It requires that operations appear atomic and ordered consistently with real-time -- each op takes effect at a single point between call and return.
Question 05
In a leaderless replication system with N=5, what is the minimum W such that R=3 guarantees overlap?
The quorum condition is R + W > N. With N=5 and R=3, we need W > 2, so W = 3 is the minimum. This ensures at least one node is in both the read set and write set.
Question 06
What is the main advantage of consistent hashing over simple hash-mod-N partitioning?
With hash-mod-N, changing N reshuffles almost all keys. Consistent hashing places nodes on a ring, so adding or removing a node only affects the keys between it and its neighbor. Vnodes further smooth the distribution.
Question 07
What is the fundamental weakness of two-phase commit (2PC)?
2PC is a blocking protocol. After voting "yes" in the prepare phase, a participant cannot unilaterally decide to commit or abort -- it must wait for the coordinator. If the coordinator crashes, participants are stuck holding locks until it recovers.
Question 08
What does the transactional outbox pattern guarantee?
The outbox pattern writes the event to a local table in the same transaction as the business data. A separate process publishes it. Since the poller/CDC may re-read rows, it guarantees at-least-once (not exactly-once). Consumers must be idempotent.
Question 09
How do fencing tokens prevent the zombie leader problem?
Each leader election produces a monotonically increasing fencing token. The storage layer tracks the highest token it has accepted. When a zombie leader (with an old, lower token) tries to write, the storage rejects it because its token is stale.
Question 10
What does the FLP impossibility result state?
FLP (1985) proves that in a purely asynchronous model, even one possible crash makes deterministic consensus unsolvable (liveness cannot be guaranteed while maintaining safety). Real systems sidestep this with timeouts and partial synchrony assumptions.