Anatomy of a Process
A process is more than just running code. The kernel tracks each process through a Process Control Block (PCB), a data structure (called task_struct in Linux, ~6 KB) containing everything the OS needs: PID, register state, memory maps, open file descriptors, signal handlers, scheduling priority, and more.
Every process exists in one of several well-defined states. The kernel transitions processes between these states in response to system calls, interrupts, and scheduling decisions.
Waiting → Ready: I/O completes or resource becomes available
Running → Ready: preempted by scheduler (time slice expired)
What the PCB Holds
| Field | Purpose |
|---|---|
| PID / PPID | Unique identifier and parent's PID |
| Register state | Saved CPU registers for context switching |
| Memory maps | Page table pointer, virtual address space layout |
| File descriptors | Table of open files, sockets, pipes |
| Signal handlers | Registered handlers for each signal type |
| Scheduling info | Priority, nice value, CPU time consumed |
| Credentials | UID, GID, capabilities |
Fork, Exec, and the Process Tree
On Unix, new processes are created via the fork() + exec() pattern. fork() creates a child process that is a near-exact copy of the parent. exec() replaces the child's address space with a new program.
pid_t pid = fork(); // create child
if (pid == 0) {
// child process
execvp("ls", args); // replace with new program
} else {
// parent process
waitpid(pid, &status, 0); // wait for child
}
Copy-on-Write (CoW)
fork() does not actually copy all memory. Instead, parent and child share the same physical pages marked as read-only. Only when either process writes to a page does the kernel copy it. This makes fork() extremely fast, often completing in under 1 ms even for processes with GBs of memory.
PID 1 and the Process Tree
Every process has a parent (PPID). The first userspace process is init (or systemd) with PID 1. All other processes descend from it, forming a tree visible via pstree.
wait() to reap it.Zombie: A child that has exited but whose parent hasn't called
wait(). It occupies a slot in the process table (no memory, but a PID). Too many zombies can exhaust the PID space. Fix: the parent must call wait() or handle SIGCHLD.
Process Memory Layout
Each process gets its own virtual address space. The kernel maps virtual addresses to physical frames via page tables. The layout follows a standard convention on modern systems.
Key Segments
| Segment | Contents | Growth |
|---|---|---|
text | Machine code instructions | Fixed, read-only |
data | Initialized global/static variables | Fixed |
BSS | Uninitialized globals (zeroed by kernel) | Fixed |
heap | malloc()/brk() allocations | Grows upward |
mmap | Shared libraries, memory-mapped files | Middle region |
stack | Local variables, return addresses, frames | Grows downward |
cat /proc/self/maps.
CPU Scheduling
The scheduler decides which ready process gets the CPU next. This decision profoundly affects system responsiveness and throughput.
Preemptive vs Cooperative
| Aspect | Preemptive | Cooperative |
|---|---|---|
| Who decides | OS forces context switch via timer interrupt | Process voluntarily yields |
| Fairness | Guaranteed, no starvation | One misbehaving process can hog CPU |
| Complexity | Needs careful synchronization | Simpler, fewer race conditions |
| Used in | All modern OS kernels | Event loops (Node.js), coroutines, early Mac OS |
Linux CFS (Completely Fair Scheduler)
Linux uses CFS as the default scheduler. Instead of fixed time slices, CFS tracks vruntime (virtual runtime) for each process. The process with the lowest vruntime gets the CPU next. CFS uses a red-black tree to find this process in O(log n).
CFS aims for perfect fairness: if two processes have equal priority, each should get exactly 50% of CPU time. In practice, it achieves near-perfect fairness with sub-millisecond granularity.
nice -n 10 ./my_program or renice -n 5 -p PID.
CPU-bound vs I/O-bound
Spends most time computing (video encoding, ML training). Benefits from longer time slices. Throughput-sensitive.
Spends most time waiting on I/O (web servers, databases). Needs low latency to resume quickly. CFS naturally boosts these because their vruntime stays low.
Context Switching
When the scheduler decides to run a different process, the kernel performs a context switch: saving the current process's state and loading the next one's. This is pure overhead — no useful work happens during a context switch.
What Gets Saved
General-purpose registers, program counter, stack pointer, flags register. Saved to the PCB's kernel stack.
CR3 register on x86. Points to the process's page table. Changing it switches the virtual address space.
Floating point and vector registers (SSE, AVX). Can be lazily saved/restored to reduce overhead.
The Cost
A context switch itself takes 1-10 microseconds of direct CPU time. But the indirect cost is much larger: the TLB (Translation Lookaside Buffer) gets flushed, and CPU caches go cold. This can cause thousands of cache misses as the new process warms up, adding 10-100 microseconds of effective cost.
grep pcid /proc/cpuinfo.
IPC Mechanisms
Processes have isolated address spaces, so they need explicit mechanisms to communicate. The kernel provides several IPC primitives, each with different tradeoffs.
Unidirectional byte stream between related processes. Created with pipe(). Shell pipes (|) use this. Named pipes (FIFOs) work between unrelated processes.
Kernel-managed queue of typed messages. Processes send/receive discrete messages. POSIX (mq_open) or System V (msgget) variants.
Fastest IPC: multiple processes map the same physical pages. No kernel involvement for reads/writes. Needs synchronization (semaphores/mutexes). shmget() or mmap() with MAP_SHARED.
Synchronization primitive, not data transfer. Controls access to shared resources. sem_wait() decrements, sem_post() increments. Often used alongside shared memory.
Asynchronous notifications: SIGTERM, SIGKILL, SIGCHLD, SIGUSR1, etc. Limited bandwidth (just a number), but useful for control flow. Cannot be ignored: SIGKILL, SIGSTOP.
Full-duplex, connection-oriented IPC via filesystem path. Same API as network sockets but ~2x faster (no TCP/IP stack). Used by Docker, PostgreSQL, systemd.
IPC Performance Comparison
| Mechanism | Throughput | Latency | Best For |
|---|---|---|---|
| Shared Memory | Highest (memcpy speed) | Lowest | High-throughput data sharing |
| Unix Domain Socket | High (~6 GB/s) | Low (~2 us) | Client-server on same host |
| Pipe | Moderate (~3 GB/s) | Low (~3 us) | Parent-child streaming |
| Message Queue | Moderate | Moderate | Decoupled message passing |
| Signal | Minimal (just signal number) | Low | Process control, notifications |
Test Yourself
task_struct. The sched_entity is a sub-structure within it used specifically by the scheduler.wait() to read its exit status. Option A describes an orphan process.