The 8 Fallacies of Distributed Computing
Peter Deutsch (and James Gosling) identified eight false assumptions that developers new to distributed systems invariably make. Every design decision in this field stems from accepting that these assumptions are wrong.
Packets get dropped, cables get cut, switches fail. You must design for message loss.
Every network hop adds delay. Cross-datacenter calls can be 50-150ms. Design accordingly.
Serialization overhead and network saturation are real constraints on throughput.
Every network boundary is an attack surface. Encrypt, authenticate, authorize.
Servers come and go. Cloud instances are ephemeral. Service discovery matters.
Multiple teams, orgs, and cloud providers own different parts of the infrastructure.
Serialization, deserialization, and moving data across the wire all have real costs.
Different hardware, OS versions, protocols, and software stacks must interoperate.
Clocks in Distributed Systems
There is no global clock. Physical clocks drift, and NTP can only reduce -- not eliminate -- skew. Distributed systems need logical or hybrid clocks to establish event ordering.
Lamport Timestamps
Each process maintains a counter. On local events, increment. On send, attach the counter. On receive, set counter to max(local, received) + 1. This gives a partial order: if a → b then L(a) < L(b), but the converse is not true.
Vector Clocks
Each process maintains a vector of counters, one per process. This captures causal ordering: you can determine if two events are causally related or concurrent. The tradeoff is O(N) space where N is the number of processes.
VC(a) < VC(b) iff every entry in a is ≤ b and at least one is strictly less. If neither dominates, the events are concurrent.
Hybrid Logical Clocks (HLC)
Combines physical time with a logical counter. Uses the physical timestamp when possible, falls back to the logical component to break ties. Gives you causality tracking with timestamps that are close to wall-clock time. Used in CockroachDB.
Google TrueTime
Hardware-assisted: GPS receivers and atomic clocks in every datacenter. The API returns an interval [earliest, latest] instead of a point. Spanner waits out the uncertainty window before committing, guaranteeing external consistency. Achieves ~1-7ms uncertainty bounds.
| Clock | Ordering | Space | Needs Special HW |
|---|---|---|---|
| Lamport | Partial (happens-before) | O(1) | No |
| Vector | Causal | O(N) | No |
| HLC | Causal + wall-clock approx | O(1) | No |
| TrueTime | External consistency | O(1) | Yes (GPS/atomic) |
Agreement and Consistency Models
FLP Impossibility
Fischer, Lynch, and Paterson proved (1985) that in an asynchronous system where even one process can crash, no deterministic consensus algorithm can guarantee termination. In practice, we work around FLP by using timeouts (partial synchrony), randomization, or failure detectors.
Paxos (Conceptual)
Lamport's protocol for reaching consensus on a single value. Three roles: proposers, acceptors, learners. Two phases: Prepare/Promise (proposer claims a ballot number) and Accept/Accepted (proposer sends value, acceptors agree). Correct but notoriously difficult to implement and extend to multi-decree (log) use cases.
Raft
Designed for understandability. Decomposes consensus into three sub-problems:
Leader Election: Nodes start as followers. If a follower receives no heartbeat within its election timeout, it becomes a candidate, increments its term, and requests votes. A candidate wins with a majority. Each term has at most one leader.
Log Replication: The leader receives client requests, appends entries to its log, and replicates them to followers via AppendEntries RPCs. An entry is committed once a majority of nodes have persisted it. Followers apply committed entries to their state machines.
Terms: Monotonically increasing logical clock. Every RPC includes the sender's term. If a node receives a higher term, it steps down. Stale-term messages are rejected.
Consistency Models
| Model | Guarantee | Example Systems |
|---|---|---|
| Linearizability | Operations appear instantaneous at some point between invocation and response. Strongest single-object model. | Zookeeper (writes), etcd |
| Sequential | All processes see the same order of operations, consistent with each process's program order. | Zookeeper (reads) |
| Causal | Causally related operations are seen in order; concurrent operations may be seen in any order. | MongoDB (causal sessions) |
| Eventual | If no new updates, all replicas converge to the same value. No ordering guarantees during convergence. | DynamoDB, Cassandra |
| Tunable | Application chooses consistency per operation via quorum settings (R, W, N). | Cassandra, Riak |
Replication Strategies
Replication provides fault tolerance and read scalability. The key question is: who accepts writes?
Single-Leader (Primary-Backup)
One node (leader) handles all writes and streams changes to followers. Followers serve reads. Simple, but the leader is a bottleneck and a single point of failure until failover completes. Replication can be synchronous (strong consistency, higher latency) or asynchronous (risk of data loss on leader failure).
Multi-Leader
Multiple nodes accept writes, each replicating to others. Useful for multi-datacenter setups. The hard problem is conflict resolution: last-writer-wins (LWW), merge functions, or CRDTs. Operational complexity is significantly higher.
Leaderless (Dynamo-style)
Any replica accepts reads and writes. Uses quorum parameters: with N replicas, write to W and read from R. As long as R + W > N, reads and writes overlap and you are guaranteed to see the latest write.
At least one replica in both the write set and read set has the latest value.
| Strategy | Write Throughput | Consistency | Conflict Handling |
|---|---|---|---|
| Single-leader | Limited by leader | Strong (sync) or eventual (async) | None (single writer) |
| Multi-leader | Higher (parallel writes) | Eventual + conflict resolution | LWW, CRDTs, custom merge |
| Leaderless | High | Tunable via R, W, N | Read repair, anti-entropy |
Data Partitioning (Sharding)
When data outgrows a single node, you split it across partitions. The goals: even data distribution and even query load.
Hash Partitioning
Apply a hash function to the key and assign to a partition based on the hash. Good distribution but destroys key ordering -- range queries require scatter-gather across all partitions.
Consistent Hashing
Nodes are placed on a hash ring. Each key is assigned to the next node clockwise from its hash position. Adding or removing a node only affects its neighbors, not the entire cluster.
Virtual Nodes (vnodes): Each physical node gets many positions on the ring. This smooths out load imbalances caused by uneven hash distribution or heterogeneous hardware. Cassandra and DynamoDB use vnodes.
Range Partitioning
Keys are sorted and split into contiguous ranges. Preserves ordering (efficient range scans) but risks hot spots if keys are not uniformly distributed (e.g., timestamp-prefixed keys).
Rebalancing
When nodes join or leave, data must be redistributed. Strategies: fixed number of partitions (only move partitions, not split them), dynamic splitting (split when partitions get too large), or proportional to node count. Rebalancing should be gradual and not block reads/writes.
Distributed Transactions and Failure Handling
Two-Phase Commit (2PC)
A coordinator asks all participants to prepare (vote yes/no). If all vote yes, it sends commit; otherwise abort. The problem: if the coordinator crashes after prepare but before commit/abort, participants are blocked -- they cannot proceed and hold locks indefinitely.
Saga Pattern
Breaks a distributed transaction into a sequence of local transactions, each with a compensating action (undo). If step N fails, compensating actions for steps N-1 through 1 are executed in reverse.
Choreography: Each service publishes events; the next service reacts. Simple but hard to track and debug as the number of steps grows.
Orchestration: A central orchestrator directs the saga, calling each service in sequence. Easier to reason about but introduces a single coordination point.
Transactional Outbox Pattern
Write the business data and the outgoing message/event to the same database in a local transaction. A separate process (poller or CDC) reads the outbox table and publishes to the message broker. Guarantees at-least-once delivery without distributed transactions.
Failure Handling
Split-Brain: A network partition causes two groups of nodes to each believe they are the active cluster. Both accept writes, leading to divergence. Prevented by requiring a quorum (majority) to operate, or by using fencing.
Fencing Tokens: When a leader is elected or a lock is acquired, a monotonically increasing token is issued. Storage services reject requests with stale tokens. This prevents a "zombie" leader (one that was partitioned but thinks it's still leader) from making writes.
Idempotency: Design operations so that executing them multiple times has the same effect as once. Critical for safe retries. Common approach: assign a unique request ID and deduplicate on the server side.
Failure Detectors: Used to suspect that a node has crashed. Based on heartbeat timeouts. Can be tuned for accuracy vs. speed. The phi accrual failure detector (used in Akka, Cassandra) outputs a suspicion level rather than a binary alive/dead.