Data Structures

Hash Tables

The workhorse of constant-time lookups. How a simple function turns keys into addresses, what happens when two keys collide, and the clever variants powering everything from Python dicts to distributed caches.

01 / Fundamentals

What Is a Hash Table?

A hash table stores key-value pairs by computing a hash function on the key to produce an index into an array of buckets. The value is stored at that index. Because computing the hash and indexing into an array are both O(1) operations, the average-case time for insert, lookup, and delete is O(1).

Hash mapping: key to bucket
key: "alice"
-->
hash("alice") = 4829
-->
4829 % 8 = 5
-->
bucket[5]
0---
1"bob": 42
2---
3"dave": 7
4---
5"alice": 99
6---
7"eve": 13
Key Insight
Worst-case lookup degrades to O(n) when many keys collide into the same bucket. The quality of the hash function and collision-resolution strategy determine how often this happens in practice.

The three components of every hash table: (1) a hash function that maps keys to integers, (2) a compression function (usually modulo) that maps those integers to valid bucket indices, and (3) a collision-resolution strategy for when two keys land in the same bucket.

02 / Hash Functions

Designing Good Hash Functions

A good hash function has two properties: uniformity (each bucket is equally likely) and the avalanche effect (a small change in input flips roughly half the output bits). Without these, clustering causes collisions that destroy O(1) performance.

Division Method

h(k) = k mod m where m is the table size. Simple and fast. Works best when m is a prime not close to a power of 2, since patterns in keys tend to align with powers of 2.

Multiplication Method

h(k) = floor(m * (k * A mod 1)) where A is an irrational constant (Knuth suggests A = (sqrt(5) - 1) / 2). Less sensitive to the choice of m than the division method, so the table size can be any power of 2 for fast masking.

Universal Hashing

Pick the hash function randomly from a family at startup: h(k) = ((a*k + b) mod p) mod m, where p is a large prime and a, b are random. This guarantees that no adversary can choose pathological inputs, giving O(1) expected time regardless of key distribution.

Watch Out
Never use a simple identity hash (h(k) = k) in production. Sequential integer keys produce sequential bucket indices, causing worst-case clustering with linear probing.
03 / Collision Resolution

Chaining vs Open Addressing

When two keys hash to the same bucket, the table needs a strategy. The two families are chaining (store a linked list per bucket) and open addressing (find another empty slot in the array itself).

Chaining: linked lists at each bucket
0"cat"-->"fox"-->null
1---
2"dog"-->null
3"ant"-->"bee"-->"eel"-->null
4---
Open addressing (linear probing): insert "fox" at h=2, occupied, probe to 3
0"cat"
1---
2"dog"collision!
3"fox"placed here
4---

Open Addressing Variants

Linear Probing

Probe sequence: h+1, h+2, h+3... Cache-friendly but suffers from primary clustering -- long runs of occupied slots snowball.

Quadratic Probing

Probe: h+1, h+4, h+9... Breaks primary clustering but introduces secondary clustering. Table size must be prime or power of 2.

Double Hashing

Probe: h + i*h2(k). A second hash function determines the step size, eliminating both primary and secondary clustering.

Robin Hood Hashing

When inserting, if the new key has traveled farther from its home than the occupant, swap them. This equalizes probe distances, giving tight worst-case bounds.

PropertyChainingOpen Addressing
Load factorCan exceed 1.0Must stay below 1.0 (ideally < 0.7)
Cache behaviorPoor (pointer chasing)Good (contiguous memory)
DeletionSimple (remove node)Needs tombstones or backward shift
Worst-case lookupO(n) per bucketO(n) with full table
Extra memoryPointers per nodeNone (slots in array)
04 / Load Factor and Rehashing

Growing the Table

The load factor alpha = n / m (items / buckets) controls performance. As alpha rises, collisions increase and lookups slow down. Most implementations resize when alpha crosses a threshold (0.75 for Java HashMap, 2/3 for Python dict).

Rehashing allocates a new, larger array (typically 2x the size), then reinserts every element by recomputing its hash against the new table size. This single operation is O(n), but since it happens after O(n) insertions, the amortized cost per insert remains O(1).

Amortized Analysis
Think of it like doubling a dynamic array. Each element "pays" for its own future rehash. After n insertions, total rehash work is 1 + 2 + 4 + ... + n = O(n), so amortized cost per insert = O(n)/n = O(1).

Shrinking

Some implementations also shrink the table when alpha drops below a lower threshold (e.g., 0.25) to reclaim memory. The same amortized argument applies. The key is to ensure the grow and shrink thresholds don't cause thrashing -- the shrink threshold must be less than half the grow threshold.

05 / Advanced Variants

Cuckoo Hashing and Consistent Hashing

Cuckoo Hashing

Uses two hash functions h1 and h2. Each key can live in exactly one of two positions: T1[h1(k)] or T2[h2(k)]. Lookup checks at most two slots, giving O(1) worst-case reads.

On insert, if both slots are occupied, the new key evicts the existing occupant, which then gets reinserted at its alternate position. This chain of displacements can cycle, triggering a rehash with new hash functions. In practice, cuckoo hashing keeps alpha below 0.5 per table.

def cuckoo_insert(key, val):
    for _ in range(MAX_KICKS):
        pos1 = h1(key) % size
        if T1[pos1] is empty:
            T1[pos1] = (key, val)
            return
        # evict occupant
        key, val, T1[pos1] = T1[pos1].key, T1[pos1].val, (key, val)
        pos2 = h2(key) % size
        if T2[pos2] is empty:
            T2[pos2] = (key, val)
            return
        key, val, T2[pos2] = T2[pos2].key, T2[pos2].val, (key, val)
    rehash_with_new_functions()

Consistent Hashing

Used in distributed systems (Dynamo, Cassandra, CDNs) to spread data across servers. Both keys and server IDs are hashed onto a circular hash ring. A key is stored on the first server encountered clockwise from its position on the ring.

Consistent hashing ring
S1
S2
S3
S4
k1 -> S2
k2 -> S3
k3 -> S4

When a server is added or removed, only K/N keys need to move (K = total keys, N = total servers), compared to rehashing everything. Virtual nodes replicate each server at multiple points on the ring to improve balance -- a server with 2x capacity gets 2x virtual nodes.

Why Virtual Nodes?
With only a few physical servers, the ring partitions can be highly uneven. Placing 100-200 virtual nodes per server smooths out the distribution, ensuring each server handles roughly its fair share of keys.
06 / Real-World Implementations

Hash Tables in Practice

Python dict

Open addressing with perturbation-based probing. Compact layout since 3.6 preserves insertion order. Resizes at 2/3 load. Uses SipHash for string keys to resist HashDoS.

Java HashMap

Chaining with linked lists that convert to red-black trees when a bucket exceeds 8 entries (treeify threshold). Resizes at 0.75 load factor. Default capacity is 16.

Go map

Array of buckets, each holding 8 key-value pairs. Overflow buckets chain together. Uses incremental rehashing -- old buckets are evacuated lazily during operations.

Redis Hash

Two hash tables for incremental rehashing. Under 128 entries with small values, uses a ziplist (compact encoding) instead. Rehash moves one bucket per operation to avoid latency spikes.

Comparison with Other Structures

StructureLookupInsertOrdered?SpaceBest For
Hash TableO(1) avgO(1) avgNoO(n)Fast key-value access
BST (balanced)O(log n)O(log n)YesO(n)Ordered iteration, range queries
Skip ListO(log n)O(log n)YesO(n)Concurrent ordered maps (lock-free)
Bloom FilterO(k)O(k)NoO(m) bitsMembership test (no false negatives)
When Not to Use a Hash Table
If you need ordered traversal, range queries, or worst-case O(log n) guarantees, a balanced BST or skip list is the better choice. Hash tables also waste memory at low load factors and can have unpredictable latency spikes during rehashing.

Test Yourself

Score: 0 / 8
Question 01
What is the average-case time complexity for lookup in a hash table?
Hash functions compute the bucket index in constant time, and accessing an array element by index is O(1). With a good hash function and low load factor, collisions are rare, so the average case is O(1).
Question 02
Which property of a hash function ensures that a single-bit change in input flips roughly half the output bits?
The avalanche effect means that a small change in the input (even a single bit) causes a dramatic change in the output. This prevents similar keys from clustering into the same buckets.
Question 03
In open addressing with linear probing, what problem occurs when occupied slots form long contiguous runs?
Primary clustering is specific to linear probing. Long runs of occupied slots grow larger because any key that hashes into the run must probe to the end of it, extending it further. Quadratic probing and double hashing avoid this.
Question 04
What is the load factor of a hash table with 12 elements and 16 buckets?
Load factor = n/m = 12/16 = 0.75. This is exactly the default resize threshold for Java's HashMap.
Question 05
Cuckoo hashing guarantees O(1) worst-case for which operation?
In cuckoo hashing, a key can only be in one of two positions (T1[h1(k)] or T2[h2(k)]), so lookup always checks exactly two slots -- O(1) worst-case. Insert, however, may trigger a chain of evictions or even a full rehash.
Question 06
In consistent hashing, when a server is removed from the ring, which keys need to be redistributed?
Only the keys that mapped to the removed server need to move. They shift to the next server clockwise on the ring. All other keys remain on their current servers, which is why consistent hashing minimizes disruption.
Question 07
Java's HashMap converts a bucket from a linked list to a red-black tree when the chain length exceeds what threshold?
Java 8 introduced treeification: when a bucket's linked list grows past 8 nodes, it converts to a red-black tree, improving worst-case lookup from O(n) to O(log n) for that bucket. It untreeifies back to a list when the count drops to 6.
Question 08
Why is the amortized cost of insertion O(1) even though rehashing is O(n)?
The table doubles when full, so after n insertions, total rehash work is 1 + 2 + 4 + ... + n, which sums to O(n). Dividing by the n insertions that triggered it gives O(1) amortized cost per insert. This is the same argument used for dynamic arrays.