Threads vs Processes
A process is an independent execution unit with its own virtual address space, file descriptor table, and heap. A thread is a lightweight execution unit within a process -- threads share the process's address space, heap, and file descriptors, but each thread has its own stack, register set, program counter, and thread-local storage (TLS).
- Virtual address space
- File descriptor table
- Signal handlers
- PID, UID, GID
- Memory mappings
- Shared: heap, code, global data, FDs
- Private: stack
- Private: registers, PC
- Private: thread-local storage (TLS)
- Private: signal mask
Creating a thread is cheaper than forking a process because there is no need to duplicate the entire address space. Context-switching between threads within the same process is also faster -- the TLB and page tables remain valid. However, shared memory means one misbehaving thread can corrupt the entire process.
task_struct. The clone() syscall with CLONE_VM | CLONE_FS | CLONE_FILES creates a thread; without those flags, it creates a new process. There is no fundamental kernel distinction -- threads are just processes that share resources.
Thread-Local Storage
TLS allows each thread to have its own copy of a variable. In C, __thread int x; or _Thread_local int x; (C11). In POSIX, pthread_key_create() provides dynamic TLS. This is how errno works in multithreaded programs -- each thread gets its own errno without contention.
Threading Models
The mapping between user-level threads and kernel-level threads defines the threading model of a runtime:
| Model | Mapping | Pros | Cons | Examples |
|---|---|---|---|---|
| 1:1 | 1 user thread = 1 kernel thread | True parallelism, simple scheduling | Expensive creation, limited by OS thread count | Linux pthreads, Java, Rust |
| N:1 | N user threads = 1 kernel thread | Cheap creation, fast context switch | No parallelism, one blocking call blocks all | Early Java (green threads), Ruby MRI (GIL) |
| M:N | M user threads = N kernel threads | Best of both: cheap + parallel | Complex scheduler, debugging is hard | Go (goroutines), Erlang BEAM, Java virtual threads |
Green Threads
Green threads are scheduled in user space by the runtime rather than the OS kernel. They have tiny stacks (Go starts goroutines at 2-8 KB vs. 1-8 MB for kernel threads) and can be created in the millions. The runtime multiplexes them onto a pool of OS threads. When a green thread blocks on I/O, the runtime parks it and schedules another -- no kernel context switch needed.
Go's M:N Scheduler
Go uses three entities: G (goroutine), M (OS thread/machine), and P (logical processor). Each P has a local run queue of Gs. Work-stealing: when a P's queue is empty, it steals Gs from another P. When a G makes a blocking syscall, the M detaches from P, and P gets a new M -- this prevents one blocking call from stalling the scheduler.
Synchronization Primitives
When threads share mutable state, you need synchronization to prevent data races. The choice of primitive depends on contention level, hold time, and access patterns.
Mutual exclusion lock. Only one thread holds it at a time. Others block (sleep) until it's released. Use for protecting critical sections with non-trivial hold times.
Busy-waits (spins) in a loop until the lock is free. No context switch overhead, but wastes CPU. Best for very short critical sections (a few instructions) in kernel code or when you know contention is rare.
Allows multiple concurrent readers OR one exclusive writer. Great for read-heavy workloads. Watch for writer starvation -- some implementations prefer readers.
A counter that allows up to N threads to enter. Binary semaphore (N=1) behaves like a mutex. Counting semaphore (N>1) limits concurrency, e.g., connection pools.
Lets a thread wait until a condition is true. Always used with a mutex. Pattern: lock(); while(!cond) wait(); unlock();. The while loop handles spurious wakeups.
Atomic CPU instruction: CAS(addr, expected, new). If *addr == expected, set *addr = new and return true. Foundation of all lock-free data structures.
Memory Barriers / Fences
CPUs and compilers reorder instructions for performance. Memory barriers enforce ordering constraints. A store fence ensures all stores before the fence are visible before any store after it. An acquire-release pair (C++ memory_order_acquire / memory_order_release) is the most common pattern: the release store publishes data, and the acquire load sees it.
// C++ acquire-release example
std::atomic<bool> ready{false};
int data = 0;
// Thread 1 (producer)
data = 42;
ready.store(true, std::memory_order_release); // store fence
// Thread 2 (consumer)
while (!ready.load(std::memory_order_acquire)); // acquire fence
assert(data == 42); // guaranteed to see 42
memory_order_relaxed everywhere for "performance" and then wondering why your lock-free queue loses messages. Relaxed ordering provides no guarantees about visibility of other memory operations. Default to seq_cst (sequential consistency) unless you've proven you need weaker ordering.
Concurrency Hazards
Concurrency bugs are among the hardest to diagnose because they are timing-dependent, non-deterministic, and often disappear under a debugger.
Race Conditions
When the outcome depends on the relative timing of two or more threads. A data race is a specific case: two threads access the same memory location concurrently, at least one is a write, and there is no synchronization. In C/C++, data races are undefined behavior. In Go, the race detector (go run -race) catches them at runtime.
Deadlock
Two or more threads are waiting for each other to release resources, forming a cycle. Deadlock requires all four Coffman conditions simultaneously:
| Condition | Description | Break It By |
|---|---|---|
| Mutual Exclusion | Resource is held exclusively | Use sharable resources (read locks) |
| Hold and Wait | Holding one resource, waiting for another | Acquire all resources atomically |
| No Preemption | Resources can't be forcibly taken | Allow timeouts / forced release |
| Circular Wait | Cycle in the wait graph | Impose a global lock ordering |
Both threads wait forever. Breaking any one Coffman condition prevents this.
Other Hazards
Threads are not blocked but keep changing state in response to each other without making progress. Like two people stepping aside for each other in a hallway.
A thread never gets scheduled or never acquires a lock because higher-priority threads keep preempting it. Fairness policies (FIFO lock queues) prevent this.
A high-priority thread waits on a lock held by a low-priority thread, which is preempted by medium-priority threads. Fix: priority inheritance protocol (the lock holder temporarily inherits the waiter's priority).
CAS sees value A, another thread changes it to B then back to A. CAS succeeds but the state is semantically different. Fix: tagged pointers (add a version counter) or hazard pointers.
Async I/O & Event Loops
The key insight: most server workloads are I/O-bound, not CPU-bound. Instead of dedicating one thread per connection (which doesn't scale past ~10K), use a small number of threads and non-blocking I/O.
Blocking vs Non-Blocking
Blocking I/O: the thread sleeps until the operation completes. Simple but wastes threads. Non-blocking I/O: the syscall returns immediately with EAGAIN if data isn't ready. You need an I/O multiplexer to know when to retry.
| Multiplexer | OS | Mechanism | Notes |
|---|---|---|---|
select | All | Scan FD bitmap each call | Limited to 1024 FDs, O(n) per call |
poll | All | Array of FD structs | No FD limit, still O(n) |
epoll | Linux | Kernel event queue | O(1) per ready event, edge/level triggered |
kqueue | BSD/macOS | Kernel event queue | Similar to epoll, also handles signals/timers |
io_uring | Linux 5.1+ | Shared ring buffer | True async I/O, zero-copy, submission batching |
epoll_wait()
for each ready FD
read/write/accept
epoll_ctl()
Reactor vs Proactor
Reactor (epoll, kqueue, Node.js, Tokio): the event loop notifies you when I/O is ready, then you perform the I/O synchronously. Proactor (io_uring, Windows IOCP): you submit the I/O operation, and the kernel notifies you when it is complete. Proactor avoids the extra syscall after notification and enables true zero-copy.
Concurrency Models
Different languages and runtimes take fundamentally different approaches to managing concurrency. The choice shapes how you think about program structure.
The traditional model: threads share memory and use mutexes/atomics to synchronize. Flexible but error-prone. Used in C, C++, Java, Rust (with ownership safety).
Communicating Sequential Processes: goroutines communicate via channels, not shared memory. "Don't communicate by sharing memory; share memory by communicating." Go's core model.
Each actor has private state and a mailbox. Actors communicate only by sending async messages. No shared state, no locks. Used by Erlang/OTP, Akka (Scala/Java), Elixir. Scales to millions of actors and across network boundaries.
Cooperative multitasking with a state-machine transform. The compiler turns async functions into state machines that yield at await points. Single-threaded (JS) or multi-threaded (Rust Tokio, Python asyncio + thread pool). No data races if single-threaded.
CSP vs Actor Model
| Aspect | CSP (Go Channels) | Actor Model (Erlang) |
|---|---|---|
| Communication | Named channels (typed, directional) | Direct message to actor mailbox |
| Identity | Goroutines are anonymous | Actors have addresses (PIDs) |
| Blocking | Send/receive can block (synchronous channels) | Send is always async (fire-and-forget) |
| Distribution | In-process only (no built-in networking) | Transparent location -- same API local or remote |
| Fault tolerance | Manual error handling | Supervisor trees, "let it crash" |
Rust's Approach
Rust is unique: the ownership and borrow checker prevents data races at compile time. Send and Sync traits mark types that are safe to transfer or share across threads. Mutex<T> owns its data, so you cannot access the data without locking. This makes "fearless concurrency" a reality -- if it compiles, there are no data races.
Test Yourself
while loop (not if) when waiting on a condition variable?