Threads & Concurrency — learn.surkar.in

01 / Threads vs Processes

Threads vs Processes

A process is an independent execution unit with its own virtual address space, file descriptor table, and heap. A thread is a lightweight execution unit within a process -- threads share the process's address space, heap, and file descriptors, but each thread has its own stack, register set, program counter, and thread-local storage (TLS).

Thread vs Process Memory Layout

Process (Own Copy)

Virtual address space
File descriptor table
Signal handlers
PID, UID, GID
Memory mappings

Thread (Shared + Private)

Shared: heap, code, global data, FDs
Private: stack
Private: registers, PC
Private: thread-local storage (TLS)
Private: signal mask

Creating a thread is cheaper than forking a process because there is no need to duplicate the entire address space. Context-switching between threads within the same process is also faster -- the TLB and page tables remain valid. However, shared memory means one misbehaving thread can corrupt the entire process.

Key Insight

On Linux, threads and processes are both represented by task_struct. The clone() syscall with CLONE_VM | CLONE_FS | CLONE_FILES creates a thread; without those flags, it creates a new process. There is no fundamental kernel distinction -- threads are just processes that share resources.

Thread-Local Storage

TLS allows each thread to have its own copy of a variable. In C, __thread int x; or _Thread_local int x; (C11). In POSIX, pthread_key_create() provides dynamic TLS. This is how errno works in multithreaded programs -- each thread gets its own errno without contention.

02 / Threading Models

Threading Models

The mapping between user-level threads and kernel-level threads defines the threading model of a runtime:

Model	Mapping	Pros	Cons	Examples
1:1	1 user thread = 1 kernel thread	True parallelism, simple scheduling	Expensive creation, limited by OS thread count	Linux pthreads, Java, Rust
N:1	N user threads = 1 kernel thread	Cheap creation, fast context switch	No parallelism, one blocking call blocks all	Early Java (green threads), Ruby MRI (GIL)
M:N	M user threads = N kernel threads	Best of both: cheap + parallel	Complex scheduler, debugging is hard	Go (goroutines), Erlang BEAM, Java virtual threads

Green Threads

Green threads are scheduled in user space by the runtime rather than the OS kernel. They have tiny stacks (Go starts goroutines at 2-8 KB vs. 1-8 MB for kernel threads) and can be created in the millions. The runtime multiplexes them onto a pool of OS threads. When a green thread blocks on I/O, the runtime parks it and schedules another -- no kernel context switch needed.

Go's M:N Scheduler

Go uses three entities: G (goroutine), M (OS thread/machine), and P (logical processor). Each P has a local run queue of Gs. Work-stealing: when a P's queue is empty, it steals Gs from another P. When a G makes a blocking syscall, the M detaches from P, and P gets a new M -- this prevents one blocking call from stalling the scheduler.

Why M:N Wins

Go can handle 100K+ concurrent goroutines on a handful of OS threads. The same workload with 1:1 threads would exhaust OS resources and thrash with context-switch overhead. The tradeoff is a more complex runtime and occasional surprising scheduling behavior.

03 / Synchronization Primitives

Synchronization Primitives

When threads share mutable state, you need synchronization to prevent data races. The choice of primitive depends on contention level, hold time, and access patterns.

Mutex

Mutual exclusion lock. Only one thread holds it at a time. Others block (sleep) until it's released. Use for protecting critical sections with non-trivial hold times.

Spinlock

Busy-waits (spins) in a loop until the lock is free. No context switch overhead, but wastes CPU. Best for very short critical sections (a few instructions) in kernel code or when you know contention is rare.

RWLock

Allows multiple concurrent readers OR one exclusive writer. Great for read-heavy workloads. Watch for writer starvation -- some implementations prefer readers.

Semaphore

A counter that allows up to N threads to enter. Binary semaphore (N=1) behaves like a mutex. Counting semaphore (N>1) limits concurrency, e.g., connection pools.

Condition Variable

Lets a thread wait until a condition is true. Always used with a mutex. Pattern: lock(); while(!cond) wait(); unlock();. The while loop handles spurious wakeups.

CAS (Compare-And-Swap)

Atomic CPU instruction: CAS(addr, expected, new). If *addr == expected, set *addr = new and return true. Foundation of all lock-free data structures.

Memory Barriers / Fences

CPUs and compilers reorder instructions for performance. Memory barriers enforce ordering constraints. A store fence ensures all stores before the fence are visible before any store after it. An acquire-release pair (C++ memory_order_acquire / memory_order_release) is the most common pattern: the release store publishes data, and the acquire load sees it.

// C++ acquire-release example
std::atomic<bool> ready{false};
int data = 0;

// Thread 1 (producer)
data = 42;
ready.store(true, std::memory_order_release);  // store fence

// Thread 2 (consumer)
while (!ready.load(std::memory_order_acquire)); // acquire fence
assert(data == 42);  // guaranteed to see 42

Common Mistake

Using memory_order_relaxed everywhere for "performance" and then wondering why your lock-free queue loses messages. Relaxed ordering provides no guarantees about visibility of other memory operations. Default to seq_cst (sequential consistency) unless you've proven you need weaker ordering.

04 / Concurrency Hazards

Concurrency Hazards

Concurrency bugs are among the hardest to diagnose because they are timing-dependent, non-deterministic, and often disappear under a debugger.

Race Conditions

When the outcome depends on the relative timing of two or more threads. A data race is a specific case: two threads access the same memory location concurrently, at least one is a write, and there is no synchronization. In C/C++, data races are undefined behavior. In Go, the race detector (go run -race) catches them at runtime.

Deadlock

Two or more threads are waiting for each other to release resources, forming a cycle. Deadlock requires all four Coffman conditions simultaneously:

Condition	Description	Break It By
Mutual Exclusion	Resource is held exclusively	Use sharable resources (read locks)
Hold and Wait	Holding one resource, waiting for another	Acquire all resources atomically
No Preemption	Resources can't be forcibly taken	Allow timeouts / forced release
Circular Wait	Cycle in the wait graph	Impose a global lock ordering

Deadlock Cycle (Resource Allocation Graph)

Thread A --holds--> Lock 1 --wanted by--> Thread B

Thread B --holds--> Lock 2 --wanted by--> Thread A

Both threads wait forever. Breaking any one Coffman condition prevents this.

Other Hazards

Livelock

Threads are not blocked but keep changing state in response to each other without making progress. Like two people stepping aside for each other in a hallway.

Starvation

A thread never gets scheduled or never acquires a lock because higher-priority threads keep preempting it. Fairness policies (FIFO lock queues) prevent this.

Priority Inversion

A high-priority thread waits on a lock held by a low-priority thread, which is preempted by medium-priority threads. Fix: priority inheritance protocol (the lock holder temporarily inherits the waiter's priority).

ABA Problem

CAS sees value A, another thread changes it to B then back to A. CAS succeeds but the state is semantically different. Fix: tagged pointers (add a version counter) or hazard pointers.

05 / Async I/O & Event Loops

Async I/O & Event Loops

The key insight: most server workloads are I/O-bound, not CPU-bound. Instead of dedicating one thread per connection (which doesn't scale past ~10K), use a small number of threads and non-blocking I/O.

Blocking vs Non-Blocking

Blocking I/O: the thread sleeps until the operation completes. Simple but wastes threads. Non-blocking I/O: the syscall returns immediately with EAGAIN if data isn't ready. You need an I/O multiplexer to know when to retry.

Multiplexer	OS	Mechanism	Notes
`select`	All	Scan FD bitmap each call	Limited to 1024 FDs, O(n) per call
`poll`	All	Array of FD structs	No FD limit, still O(n)
`epoll`	Linux	Kernel event queue	O(1) per ready event, edge/level triggered
`kqueue`	BSD/macOS	Kernel event queue	Similar to epoll, also handles signals/timers
`io_uring`	Linux 5.1+	Shared ring buffer	True async I/O, zero-copy, submission batching

Event Loop Architecture

Poll for events
epoll_wait()

Dispatch callbacks
for each ready FD

Execute handlers
read/write/accept

Re-register interest
epoll_ctl()

^--------------------------------------- loop ----------------------------------------^

Reactor vs Proactor

Reactor (epoll, kqueue, Node.js, Tokio): the event loop notifies you when I/O is ready, then you perform the I/O synchronously. Proactor (io_uring, Windows IOCP): you submit the I/O operation, and the kernel notifies you when it is complete. Proactor avoids the extra syscall after notification and enables true zero-copy.

io_uring

Uses two lock-free ring buffers shared between user space and kernel (submission queue + completion queue). You batch submissions, the kernel processes them asynchronously, and you drain completions -- no syscalls in the hot path. This is the future of high-performance Linux I/O.

06 / Concurrency Models

Concurrency Models

Different languages and runtimes take fundamentally different approaches to managing concurrency. The choice shapes how you think about program structure.

Shared State + Locks

The traditional model: threads share memory and use mutexes/atomics to synchronize. Flexible but error-prone. Used in C, C++, Java, Rust (with ownership safety).

Message Passing (CSP)

Communicating Sequential Processes: goroutines communicate via channels, not shared memory. "Don't communicate by sharing memory; share memory by communicating." Go's core model.

Actor Model

Each actor has private state and a mailbox. Actors communicate only by sending async messages. No shared state, no locks. Used by Erlang/OTP, Akka (Scala/Java), Elixir. Scales to millions of actors and across network boundaries.

Async/Await

Cooperative multitasking with a state-machine transform. The compiler turns async functions into state machines that yield at await points. Single-threaded (JS) or multi-threaded (Rust Tokio, Python asyncio + thread pool). No data races if single-threaded.

CSP vs Actor Model

Aspect	CSP (Go Channels)	Actor Model (Erlang)
Communication	Named channels (typed, directional)	Direct message to actor mailbox
Identity	Goroutines are anonymous	Actors have addresses (PIDs)
Blocking	Send/receive can block (synchronous channels)	Send is always async (fire-and-forget)
Distribution	In-process only (no built-in networking)	Transparent location -- same API local or remote
Fault tolerance	Manual error handling	Supervisor trees, "let it crash"

Rust's Approach

Rust is unique: the ownership and borrow checker prevents data races at compile time. Send and Sync traits mark types that are safe to transfer or share across threads. Mutex<T> owns its data, so you cannot access the data without locking. This makes "fearless concurrency" a reality -- if it compiles, there are no data races.

The Tradeoff

Shared state is fast but fragile. Message passing is safe but adds latency (copying/serialization). Actors scale across machines but add complexity. Async/await is efficient but infectious (one async function forces callers to be async). There is no silver bullet -- pick based on your workload.

Test Yourself

Score: 0 / 10

Question 01

What do threads within the same process share?

Threads share the process's address space (heap, code segment, global data) and file descriptor table. Each thread has its own stack, registers, and TLS.

Question 02

Go's goroutine scheduler uses which threading model?

Go uses an M:N model with its G-M-P scheduler. Many goroutines (G) are multiplexed onto a smaller number of OS threads (M) via logical processors (P), with work-stealing for load balancing.

Question 03

Which synchronization primitive is best for a read-heavy workload with rare writes?

An RWLock allows multiple concurrent readers while only blocking for exclusive write access. When reads vastly outnumber writes, this maximizes throughput compared to a mutex that serializes all access.

Question 04

Which is NOT one of the four Coffman conditions for deadlock?

The four Coffman conditions are: mutual exclusion, hold and wait, no preemption, and circular wait. Starvation is a separate concurrency hazard where a thread never gets a chance to run, but it is not a deadlock condition.

Question 05

What is the ABA problem in lock-free programming?

In the ABA problem, a CAS operation reads value A, then another thread changes it to B and back to A. The CAS succeeds (it still sees A) even though the state is semantically different. Solutions include tagged pointers with version counters or hazard pointers.

Question 06

What advantage does epoll have over select?

epoll maintains an interest set in the kernel and returns only the ready file descriptors, making it O(1) per ready event. select must scan the entire FD bitmap on every call (O(n)) and is limited to 1024 FDs. However, epoll is Linux-specific.

Question 07

In the reactor pattern, what does the event loop notify you about?

The reactor pattern notifies when I/O is ready (the FD is readable/writable), then you perform the I/O synchronously. The proactor pattern notifies when I/O is complete (the kernel performed the operation for you). Reactor is used by epoll, kqueue, and Node.js; proactor by io_uring and Windows IOCP.

Question 08

How does Erlang's actor model handle concurrency?

In the actor model, each actor encapsulates its own state and communicates with other actors solely through asynchronous message passing. There is no shared state, no locks, and no direct memory access between actors. This makes the model inherently safe and distributable.

Question 09

What is priority inversion?

Priority inversion occurs when a high-priority thread is effectively blocked because it waits on a lock held by a low-priority thread, which itself cannot run because medium-priority threads keep preempting it. The fix is priority inheritance: the lock holder temporarily inherits the highest waiter's priority.

Question 10

Why must you use a while loop (not if) when waiting on a condition variable?

POSIX condition variables can experience spurious wakeups -- the thread wakes up even though no signal was sent. Using a while loop re-checks the actual condition after each wakeup, ensuring the thread only proceeds when the condition is genuinely true.