What Is a Hash Table?
A hash table stores key-value pairs by computing a hash function on the key to produce an index into an array of buckets. The value is stored at that index. Because computing the hash and indexing into an array are both O(1) operations, the average-case time for insert, lookup, and delete is O(1).
The three components of every hash table: (1) a hash function that maps keys to integers, (2) a compression function (usually modulo) that maps those integers to valid bucket indices, and (3) a collision-resolution strategy for when two keys land in the same bucket.
Designing Good Hash Functions
A good hash function has two properties: uniformity (each bucket is equally likely) and the avalanche effect (a small change in input flips roughly half the output bits). Without these, clustering causes collisions that destroy O(1) performance.
Division Method
h(k) = k mod m where m is the table size. Simple and fast. Works best when m is a prime not close to a power of 2, since patterns in keys tend to align with powers of 2.
Multiplication Method
h(k) = floor(m * (k * A mod 1)) where A is an irrational constant (Knuth suggests A = (sqrt(5) - 1) / 2). Less sensitive to the choice of m than the division method, so the table size can be any power of 2 for fast masking.
Universal Hashing
Pick the hash function randomly from a family at startup: h(k) = ((a*k + b) mod p) mod m, where p is a large prime and a, b are random. This guarantees that no adversary can choose pathological inputs, giving O(1) expected time regardless of key distribution.
Chaining vs Open Addressing
When two keys hash to the same bucket, the table needs a strategy. The two families are chaining (store a linked list per bucket) and open addressing (find another empty slot in the array itself).
Open Addressing Variants
Probe sequence: h+1, h+2, h+3... Cache-friendly but suffers from primary clustering -- long runs of occupied slots snowball.
Probe: h+1, h+4, h+9... Breaks primary clustering but introduces secondary clustering. Table size must be prime or power of 2.
Probe: h + i*h2(k). A second hash function determines the step size, eliminating both primary and secondary clustering.
When inserting, if the new key has traveled farther from its home than the occupant, swap them. This equalizes probe distances, giving tight worst-case bounds.
| Property | Chaining | Open Addressing |
|---|---|---|
| Load factor | Can exceed 1.0 | Must stay below 1.0 (ideally < 0.7) |
| Cache behavior | Poor (pointer chasing) | Good (contiguous memory) |
| Deletion | Simple (remove node) | Needs tombstones or backward shift |
| Worst-case lookup | O(n) per bucket | O(n) with full table |
| Extra memory | Pointers per node | None (slots in array) |
Growing the Table
The load factor alpha = n / m (items / buckets) controls performance. As alpha rises, collisions increase and lookups slow down. Most implementations resize when alpha crosses a threshold (0.75 for Java HashMap, 2/3 for Python dict).
Rehashing allocates a new, larger array (typically 2x the size), then reinserts every element by recomputing its hash against the new table size. This single operation is O(n), but since it happens after O(n) insertions, the amortized cost per insert remains O(1).
Shrinking
Some implementations also shrink the table when alpha drops below a lower threshold (e.g., 0.25) to reclaim memory. The same amortized argument applies. The key is to ensure the grow and shrink thresholds don't cause thrashing -- the shrink threshold must be less than half the grow threshold.
Cuckoo Hashing and Consistent Hashing
Cuckoo Hashing
Uses two hash functions h1 and h2. Each key can live in exactly one of two positions: T1[h1(k)] or T2[h2(k)]. Lookup checks at most two slots, giving O(1) worst-case reads.
On insert, if both slots are occupied, the new key evicts the existing occupant, which then gets reinserted at its alternate position. This chain of displacements can cycle, triggering a rehash with new hash functions. In practice, cuckoo hashing keeps alpha below 0.5 per table.
def cuckoo_insert(key, val):
for _ in range(MAX_KICKS):
pos1 = h1(key) % size
if T1[pos1] is empty:
T1[pos1] = (key, val)
return
# evict occupant
key, val, T1[pos1] = T1[pos1].key, T1[pos1].val, (key, val)
pos2 = h2(key) % size
if T2[pos2] is empty:
T2[pos2] = (key, val)
return
key, val, T2[pos2] = T2[pos2].key, T2[pos2].val, (key, val)
rehash_with_new_functions()
Consistent Hashing
Used in distributed systems (Dynamo, Cassandra, CDNs) to spread data across servers. Both keys and server IDs are hashed onto a circular hash ring. A key is stored on the first server encountered clockwise from its position on the ring.
When a server is added or removed, only K/N keys need to move (K = total keys, N = total servers), compared to rehashing everything. Virtual nodes replicate each server at multiple points on the ring to improve balance -- a server with 2x capacity gets 2x virtual nodes.
Hash Tables in Practice
Open addressing with perturbation-based probing. Compact layout since 3.6 preserves insertion order. Resizes at 2/3 load. Uses SipHash for string keys to resist HashDoS.
Chaining with linked lists that convert to red-black trees when a bucket exceeds 8 entries (treeify threshold). Resizes at 0.75 load factor. Default capacity is 16.
Array of buckets, each holding 8 key-value pairs. Overflow buckets chain together. Uses incremental rehashing -- old buckets are evacuated lazily during operations.
Two hash tables for incremental rehashing. Under 128 entries with small values, uses a ziplist (compact encoding) instead. Rehash moves one bucket per operation to avoid latency spikes.
Comparison with Other Structures
| Structure | Lookup | Insert | Ordered? | Space | Best For |
|---|---|---|---|---|---|
| Hash Table | O(1) avg | O(1) avg | No | O(n) | Fast key-value access |
| BST (balanced) | O(log n) | O(log n) | Yes | O(n) | Ordered iteration, range queries |
| Skip List | O(log n) | O(log n) | Yes | O(n) | Concurrent ordered maps (lock-free) |
| Bloom Filter | O(k) | O(k) | No | O(m) bits | Membership test (no false negatives) |