Kernel Space vs User Space
The kernel is the core of an operating system. It has direct access to hardware, memory, and CPU privileged instructions. To protect the system from buggy or malicious programs, modern CPUs enforce protection rings.
User space (Ring 3) is where your applications run. Code here cannot directly access hardware, execute privileged CPU instructions, or touch another process's memory. When it needs to do any of those things, it must ask the kernel.
Kernel space (Ring 0) has unrestricted access. The kernel manages processes, memory, file systems, networking, and device drivers. A bug in kernel code can crash the entire machine.
Crossing the Boundary
A system call (syscall) is the programmatic interface between user space and kernel space. When your program calls read(), it's not calling a regular function -- it triggers a controlled mode switch into Ring 0.
The Cost of Syscalls
Each syscall costs roughly 100-1000 nanoseconds depending on the operation. The overhead comes from saving/restoring registers, TLB flushes, cache pollution, and Spectre mitigations (KPTI). Minimizing syscalls is a key optimization -- this is why read() uses buffering and epoll batches notifications.
vDSO: Skipping the Switch
Some syscalls don't actually need kernel privileges. gettimeofday() and clock_gettime() just read a clock value. Linux maps a small shared library called the vDSO (virtual Dynamic Shared Object) into every process's address space. These "syscalls" execute entirely in user space -- no trap, no mode switch, ~nanosecond cost.
fork() -- create a child process. exec() -- replace process image. open() / read() / write() -- file I/O. mmap() -- map files/memory. socket() -- create network endpoint. epoll_ctl() -- register interest in I/O events.
Monolithic, Micro, and Hybrid
How much code runs in Ring 0 is a fundamental design decision. The three major approaches trade off performance, reliability, and complexity.
| Property | Monolithic | Microkernel | Hybrid |
|---|---|---|---|
| In Ring 0 | Drivers, FS, networking, IPC -- everything | Only IPC, scheduling, basic memory | Core services in Ring 0, some drivers in user space |
| Performance | Fast (no IPC overhead) | Slower (many context switches) | Good (selective optimization) |
| Reliability | One driver bug can crash system | Faulty services restart independently | Better than monolithic, less than micro |
| Examples | Linux, FreeBSD | Mach, QNX, seL4 | Windows NT, macOS XNU |
| Use cases | Servers, desktops, phones | Safety-critical (avionics, medical) | Consumer OS, gaming consoles |
lsmod to see currently loaded modules.
Inside the Linux Kernel
Linux's monolithic kernel contains several major subsystems, each responsible for a critical OS function.
Completely Fair Scheduler uses a red-black tree to pick the task with the smallest virtual runtime. O(log n) insertion, O(1) pick-next. Replaced by EEVDF in kernel 6.6+.
Buddy allocator for page-sized blocks, slab allocator (SLUB) for small kernel objects. When memory runs low, the OOM killer picks a process to sacrifice.
Virtual File System provides a uniform interface (open, read, write) regardless of the underlying filesystem -- ext4, XFS, btrfs, NFS, or /proc.
Implements TCP/IP from L2 (ethernet) up. Socket buffers (sk_buff), Netfilter (iptables/nftables), and traffic control (tc). Supports XDP for kernel-bypass fast paths.
The OOM Killer
When the system is out of memory and swap, Linux's OOM killer selects a process to terminate. It calculates an oom_score for each process based on memory usage, process age, and whether it's privileged. You can influence it via /proc/<pid>/oom_score_adj (-1000 to +1000).
# Check a process's OOM score
cat /proc/$(pidof nginx)/oom_score
# Make a process immune to OOM killer
echo -1000 > /proc/$(pidof critical-app)/oom_score_adj
Top-Half and Bottom-Half Processing
When hardware needs attention (a network packet arrives, a key is pressed), it sends an interrupt to the CPU. The CPU stops what it's doing and jumps to the kernel's interrupt handler. But interrupt handlers must be fast -- they run with interrupts disabled, so they block everything else.
Linux solves this with a two-phase approach:
Top half: Runs immediately with interrupts disabled. Does the minimum: acknowledge the hardware, copy critical data, schedule bottom-half work. Must finish in microseconds.
Bottom half: Runs later with interrupts enabled. Three mechanisms exist:
| Mechanism | Context | Can Sleep? | Use Case |
|---|---|---|---|
| Softirq | Interrupt (atomic) | No | Networking, block I/O -- high-frequency, per-CPU |
| Tasklet | Interrupt (atomic) | No | Simpler deferred work, built on softirqs |
| Workqueue | Process (kernel thread) | Yes | Any work that needs to sleep (e.g., allocate memory) |
ksoftirqd consuming CPU in top, that's the kernel processing accumulated bottom-half work. Heavy network traffic generates many softirqs. Tools like mpstat -I SCPU show softirq CPU time per core.
The Building Blocks of Containers
Containers are not a kernel primitive. They're a combination of three kernel features working together: namespaces for isolation, cgroups for resource limits, and overlayFS for layered filesystems.
Linux Namespaces
Each namespace type isolates a specific system resource, giving a process the illusion of having its own instance:
| Namespace | Isolates | Effect |
|---|---|---|
| PID | Process IDs | Container sees its init as PID 1 |
| NET | Network stack | Own interfaces, IP addresses, routing tables |
| MNT | Mount points | Own filesystem tree, isolated from host mounts |
| UTS | Hostname | Container can set its own hostname |
| IPC | System V IPC, POSIX queues | Isolated shared memory and semaphores |
| USER | User/group IDs | Root inside container maps to unprivileged user on host |
Cgroups v2
While namespaces handle what a process can see, cgroups (control groups) handle how much it can use. Cgroups v2 provides a unified hierarchy for limiting CPU, memory, I/O, and PIDs.
# Limit a cgroup to 50% of one CPU and 256MB memory
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max
echo "268435456" > /sys/fs/cgroup/myapp/memory.max
# See current memory usage
cat /sys/fs/cgroup/myapp/memory.current
clone() with namespace flags, assign to a cgroup, mount an overlayFS root. That's it. There's no "container" syscall -- it's just clever use of existing primitives.