Operating Systems

Kernel & System Calls

How user programs talk to hardware through the kernel, the architectures that shape modern operating systems, and the machinery that makes containers possible.

01 / The Kernel

Kernel Space vs User Space

The kernel is the core of an operating system. It has direct access to hardware, memory, and CPU privileged instructions. To protect the system from buggy or malicious programs, modern CPUs enforce protection rings.

CPU Protection Rings
Ring 3 — User Space (applications, libraries, shells)
mode switch boundary (trap / syscall instruction)
Ring 0 — Kernel Space (scheduler, drivers, memory manager)

User space (Ring 3) is where your applications run. Code here cannot directly access hardware, execute privileged CPU instructions, or touch another process's memory. When it needs to do any of those things, it must ask the kernel.

Kernel space (Ring 0) has unrestricted access. The kernel manages processes, memory, file systems, networking, and device drivers. A bug in kernel code can crash the entire machine.

Why the Split Matters
A null-pointer dereference in user space kills one process. The same bug in kernel space triggers a kernel panic and takes down the entire system. The ring boundary is the OS's primary safety net.
02 / System Calls

Crossing the Boundary

A system call (syscall) is the programmatic interface between user space and kernel space. When your program calls read(), it's not calling a regular function -- it triggers a controlled mode switch into Ring 0.

Syscall Flow
User calls read()
libc wrapper
SYSCALL instruction (trap)
Mode switch to Ring 0
Kernel handler runs
Result placed in register
SYSRET (back to Ring 3)
libc returns to caller

The Cost of Syscalls

Each syscall costs roughly 100-1000 nanoseconds depending on the operation. The overhead comes from saving/restoring registers, TLB flushes, cache pollution, and Spectre mitigations (KPTI). Minimizing syscalls is a key optimization -- this is why read() uses buffering and epoll batches notifications.

vDSO: Skipping the Switch

Some syscalls don't actually need kernel privileges. gettimeofday() and clock_gettime() just read a clock value. Linux maps a small shared library called the vDSO (virtual Dynamic Shared Object) into every process's address space. These "syscalls" execute entirely in user space -- no trap, no mode switch, ~nanosecond cost.

Key Syscalls to Know
fork() -- create a child process. exec() -- replace process image. open() / read() / write() -- file I/O. mmap() -- map files/memory. socket() -- create network endpoint. epoll_ctl() -- register interest in I/O events.
03 / Kernel Architectures

Monolithic, Micro, and Hybrid

How much code runs in Ring 0 is a fundamental design decision. The three major approaches trade off performance, reliability, and complexity.

Kernel Architecture Comparison
Monolithic
Everything in Ring 0
Microkernel
Minimal Ring 0
Hybrid
Pragmatic middle ground
PropertyMonolithicMicrokernelHybrid
In Ring 0Drivers, FS, networking, IPC -- everythingOnly IPC, scheduling, basic memoryCore services in Ring 0, some drivers in user space
PerformanceFast (no IPC overhead)Slower (many context switches)Good (selective optimization)
ReliabilityOne driver bug can crash systemFaulty services restart independentlyBetter than monolithic, less than micro
ExamplesLinux, FreeBSDMach, QNX, seL4Windows NT, macOS XNU
Use casesServers, desktops, phonesSafety-critical (avionics, medical)Consumer OS, gaming consoles
Linux is Monolithic -- But Modular
Linux puts everything in Ring 0 for speed, but supports loadable kernel modules (LKMs) that can be inserted and removed at runtime. You get monolithic performance with some of the flexibility of a microkernel. Run lsmod to see currently loaded modules.
04 / Linux Subsystems

Inside the Linux Kernel

Linux's monolithic kernel contains several major subsystems, each responsible for a critical OS function.

CFS (Scheduler)

Completely Fair Scheduler uses a red-black tree to pick the task with the smallest virtual runtime. O(log n) insertion, O(1) pick-next. Replaced by EEVDF in kernel 6.6+.

Memory Manager

Buddy allocator for page-sized blocks, slab allocator (SLUB) for small kernel objects. When memory runs low, the OOM killer picks a process to sacrifice.

VFS

Virtual File System provides a uniform interface (open, read, write) regardless of the underlying filesystem -- ext4, XFS, btrfs, NFS, or /proc.

Network Stack

Implements TCP/IP from L2 (ethernet) up. Socket buffers (sk_buff), Netfilter (iptables/nftables), and traffic control (tc). Supports XDP for kernel-bypass fast paths.

The OOM Killer

When the system is out of memory and swap, Linux's OOM killer selects a process to terminate. It calculates an oom_score for each process based on memory usage, process age, and whether it's privileged. You can influence it via /proc/<pid>/oom_score_adj (-1000 to +1000).

# Check a process's OOM score
cat /proc/$(pidof nginx)/oom_score

# Make a process immune to OOM killer
echo -1000 > /proc/$(pidof critical-app)/oom_score_adj
05 / Interrupts

Top-Half and Bottom-Half Processing

When hardware needs attention (a network packet arrives, a key is pressed), it sends an interrupt to the CPU. The CPU stops what it's doing and jumps to the kernel's interrupt handler. But interrupt handlers must be fast -- they run with interrupts disabled, so they block everything else.

Linux solves this with a two-phase approach:

Interrupt Processing
Hardware IRQ
Top Half (fast, IRQs off)
Bottom Half (deferred, IRQs on)

Top half: Runs immediately with interrupts disabled. Does the minimum: acknowledge the hardware, copy critical data, schedule bottom-half work. Must finish in microseconds.

Bottom half: Runs later with interrupts enabled. Three mechanisms exist:

MechanismContextCan Sleep?Use Case
SoftirqInterrupt (atomic)NoNetworking, block I/O -- high-frequency, per-CPU
TaskletInterrupt (atomic)NoSimpler deferred work, built on softirqs
WorkqueueProcess (kernel thread)YesAny work that needs to sleep (e.g., allocate memory)
Why This Matters
If you've ever seen ksoftirqd consuming CPU in top, that's the kernel processing accumulated bottom-half work. Heavy network traffic generates many softirqs. Tools like mpstat -I SCPU show softirq CPU time per core.
06 / Namespaces, Cgroups & Containers

The Building Blocks of Containers

Containers are not a kernel primitive. They're a combination of three kernel features working together: namespaces for isolation, cgroups for resource limits, and overlayFS for layered filesystems.

Linux Namespaces

Each namespace type isolates a specific system resource, giving a process the illusion of having its own instance:

Namespace Isolation
PID
NET
MNT
UTS
IPC
USER
Each namespace creates an isolated view. A container typically uses all six.
NamespaceIsolatesEffect
PIDProcess IDsContainer sees its init as PID 1
NETNetwork stackOwn interfaces, IP addresses, routing tables
MNTMount pointsOwn filesystem tree, isolated from host mounts
UTSHostnameContainer can set its own hostname
IPCSystem V IPC, POSIX queuesIsolated shared memory and semaphores
USERUser/group IDsRoot inside container maps to unprivileged user on host

Cgroups v2

While namespaces handle what a process can see, cgroups (control groups) handle how much it can use. Cgroups v2 provides a unified hierarchy for limiting CPU, memory, I/O, and PIDs.

# Limit a cgroup to 50% of one CPU and 256MB memory
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max
echo "268435456" > /sys/fs/cgroup/myapp/memory.max

# See current memory usage
cat /sys/fs/cgroup/myapp/memory.current
Container = Namespaces + Cgroups + OverlayFS
Docker, Podman, and containerd all work the same way at the kernel level: clone() with namespace flags, assign to a cgroup, mount an overlayFS root. That's it. There's no "container" syscall -- it's just clever use of existing primitives.

Test Yourself

Score: 0 / 8
Question 01
Which CPU ring does user application code execute in?
Ring 3 is the least privileged ring where user applications run. Ring 0 is reserved for the kernel. Most modern OSes only use Ring 0 and Ring 3, skipping Rings 1 and 2 entirely.
Question 02
What does the vDSO optimize?
The vDSO (virtual Dynamic Shared Object) maps a small kernel-provided library into every process. Calls like gettimeofday() execute entirely in user space, avoiding the ~100ns+ cost of a real mode switch.
Question 03
Which kernel architecture does Linux use?
Linux is a monolithic kernel -- all core services (drivers, filesystem, networking) run in Ring 0. However, it supports loadable kernel modules (LKMs) that can be inserted at runtime, giving it some modularity without the IPC overhead of a microkernel.
Question 04
What is the primary advantage of a microkernel over a monolithic kernel?
In a microkernel, drivers and services run in user space. If a driver crashes, only that service is affected -- it can be restarted without rebooting. This comes at the cost of IPC overhead between services, which is why microkernels are typically slower than monolithic kernels.
Question 05
Which Linux bottom-half mechanism can sleep (block)?
Workqueues run in kernel thread context (process context), which means they can sleep, allocate memory with GFP_KERNEL, and do other blocking operations. Softirqs and tasklets run in interrupt context and must not sleep.
Question 06
A container's init process appears as PID 1 inside the container. Which namespace creates this illusion?
The PID namespace gives each container its own set of process IDs. The first process in a new PID namespace becomes PID 1 from the container's perspective, even though it has a different PID on the host.
Question 07
What does Linux's OOM killer use to select which process to terminate?
The OOM killer calculates an oom_score for each process. Processes using more memory get higher scores and are more likely to be killed. The score can be influenced via /proc/<pid>/oom_score_adj, with -1000 making a process immune.
Question 08
What are the three kernel features that make up a Linux container?
A container is not a single kernel feature but a combination: namespaces provide isolation (PID, network, mounts, etc.), cgroups enforce resource limits (CPU, memory, I/O), and overlayFS provides the layered filesystem. There is no "container" syscall.