Kernel & System Calls — learn.surkar.in

01 / The Kernel

Kernel Space vs User Space

The kernel is the core of an operating system. It has direct access to hardware, memory, and CPU privileged instructions. To protect the system from buggy or malicious programs, modern CPUs enforce protection rings.

CPU Protection Rings

Ring 3 — User Space (applications, libraries, shells)

mode switch boundary (trap / syscall instruction)

Ring 0 — Kernel Space (scheduler, drivers, memory manager)

User space (Ring 3) is where your applications run. Code here cannot directly access hardware, execute privileged CPU instructions, or touch another process's memory. When it needs to do any of those things, it must ask the kernel.

Kernel space (Ring 0) has unrestricted access. The kernel manages processes, memory, file systems, networking, and device drivers. A bug in kernel code can crash the entire machine.

Why the Split Matters

A null-pointer dereference in user space kills one process. The same bug in kernel space triggers a kernel panic and takes down the entire system. The ring boundary is the OS's primary safety net.

02 / System Calls

Crossing the Boundary

A system call (syscall) is the programmatic interface between user space and kernel space. When your program calls read(), it's not calling a regular function -- it triggers a controlled mode switch into Ring 0.

Syscall Flow

User calls read()

→

libc wrapper

→

SYSCALL instruction (trap)

→

Mode switch to Ring 0

Kernel handler runs

→

Result placed in register

→

SYSRET (back to Ring 3)

→

libc returns to caller

The Cost of Syscalls

Each syscall costs roughly 100-1000 nanoseconds depending on the operation. The overhead comes from saving/restoring registers, TLB flushes, cache pollution, and Spectre mitigations (KPTI). Minimizing syscalls is a key optimization -- this is why read() uses buffering and epoll batches notifications.

vDSO: Skipping the Switch

Some syscalls don't actually need kernel privileges. gettimeofday() and clock_gettime() just read a clock value. Linux maps a small shared library called the vDSO (virtual Dynamic Shared Object) into every process's address space. These "syscalls" execute entirely in user space -- no trap, no mode switch, ~nanosecond cost.

Key Syscalls to Know

fork() -- create a child process. exec() -- replace process image. open() / read() / write() -- file I/O. mmap() -- map files/memory. socket() -- create network endpoint. epoll_ctl() -- register interest in I/O events.

03 / Kernel Architectures

Monolithic, Micro, and Hybrid

How much code runs in Ring 0 is a fundamental design decision. The three major approaches trade off performance, reliability, and complexity.

Kernel Architecture Comparison

Monolithic

Everything in Ring 0

Microkernel

Minimal Ring 0

Hybrid

Pragmatic middle ground

Property	Monolithic	Microkernel	Hybrid
In Ring 0	Drivers, FS, networking, IPC -- everything	Only IPC, scheduling, basic memory	Core services in Ring 0, some drivers in user space
Performance	Fast (no IPC overhead)	Slower (many context switches)	Good (selective optimization)
Reliability	One driver bug can crash system	Faulty services restart independently	Better than monolithic, less than micro
Examples	Linux, FreeBSD	Mach, QNX, seL4	Windows NT, macOS XNU
Use cases	Servers, desktops, phones	Safety-critical (avionics, medical)	Consumer OS, gaming consoles

Linux is Monolithic -- But Modular

Linux puts everything in Ring 0 for speed, but supports loadable kernel modules (LKMs) that can be inserted and removed at runtime. You get monolithic performance with some of the flexibility of a microkernel. Run lsmod to see currently loaded modules.

04 / Linux Subsystems

Inside the Linux Kernel

Linux's monolithic kernel contains several major subsystems, each responsible for a critical OS function.

CFS (Scheduler)

Completely Fair Scheduler uses a red-black tree to pick the task with the smallest virtual runtime. O(log n) insertion, O(1) pick-next. Replaced by EEVDF in kernel 6.6+.

Memory Manager

Buddy allocator for page-sized blocks, slab allocator (SLUB) for small kernel objects. When memory runs low, the OOM killer picks a process to sacrifice.

VFS

Virtual File System provides a uniform interface (open, read, write) regardless of the underlying filesystem -- ext4, XFS, btrfs, NFS, or /proc.

Network Stack

Implements TCP/IP from L2 (ethernet) up. Socket buffers (sk_buff), Netfilter (iptables/nftables), and traffic control (tc). Supports XDP for kernel-bypass fast paths.

The OOM Killer

When the system is out of memory and swap, Linux's OOM killer selects a process to terminate. It calculates an oom_score for each process based on memory usage, process age, and whether it's privileged. You can influence it via /proc/<pid>/oom_score_adj (-1000 to +1000).

# Check a process's OOM score
cat /proc/$(pidof nginx)/oom_score

# Make a process immune to OOM killer
echo -1000 > /proc/$(pidof critical-app)/oom_score_adj

05 / Interrupts

Top-Half and Bottom-Half Processing

When hardware needs attention (a network packet arrives, a key is pressed), it sends an interrupt to the CPU. The CPU stops what it's doing and jumps to the kernel's interrupt handler. But interrupt handlers must be fast -- they run with interrupts disabled, so they block everything else.

Linux solves this with a two-phase approach:

Interrupt Processing

Hardware IRQ

→

Top Half (fast, IRQs off)

→

Bottom Half (deferred, IRQs on)

Top half: Runs immediately with interrupts disabled. Does the minimum: acknowledge the hardware, copy critical data, schedule bottom-half work. Must finish in microseconds.

Bottom half: Runs later with interrupts enabled. Three mechanisms exist:

Mechanism	Context	Can Sleep?	Use Case
Softirq	Interrupt (atomic)	No	Networking, block I/O -- high-frequency, per-CPU
Tasklet	Interrupt (atomic)	No	Simpler deferred work, built on softirqs
Workqueue	Process (kernel thread)	Yes	Any work that needs to sleep (e.g., allocate memory)

Why This Matters

If you've ever seen ksoftirqd consuming CPU in top, that's the kernel processing accumulated bottom-half work. Heavy network traffic generates many softirqs. Tools like mpstat -I SCPU show softirq CPU time per core.

06 / Namespaces, Cgroups & Containers

The Building Blocks of Containers

Containers are not a kernel primitive. They're a combination of three kernel features working together: namespaces for isolation, cgroups for resource limits, and overlayFS for layered filesystems.

Linux Namespaces

Each namespace type isolates a specific system resource, giving a process the illusion of having its own instance:

Namespace Isolation

PID

NET

MNT

UTS

IPC

USER

Each namespace creates an isolated view. A container typically uses all six.

Namespace	Isolates	Effect
PID	Process IDs	Container sees its init as PID 1
NET	Network stack	Own interfaces, IP addresses, routing tables
MNT	Mount points	Own filesystem tree, isolated from host mounts
UTS	Hostname	Container can set its own hostname
IPC	System V IPC, POSIX queues	Isolated shared memory and semaphores
USER	User/group IDs	Root inside container maps to unprivileged user on host

Cgroups v2

While namespaces handle what a process can see, cgroups (control groups) handle how much it can use. Cgroups v2 provides a unified hierarchy for limiting CPU, memory, I/O, and PIDs.

# Limit a cgroup to 50% of one CPU and 256MB memory
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max
echo "268435456" > /sys/fs/cgroup/myapp/memory.max

# See current memory usage
cat /sys/fs/cgroup/myapp/memory.current

Container = Namespaces + Cgroups + OverlayFS

Docker, Podman, and containerd all work the same way at the kernel level: clone() with namespace flags, assign to a cgroup, mount an overlayFS root. That's it. There's no "container" syscall -- it's just clever use of existing primitives.

Test Yourself

Score: 0 / 8

Question 01

Which CPU ring does user application code execute in?

Ring 3 is the least privileged ring where user applications run. Ring 0 is reserved for the kernel. Most modern OSes only use Ring 0 and Ring 3, skipping Rings 1 and 2 entirely.

Question 02

What does the vDSO optimize?

The vDSO (virtual Dynamic Shared Object) maps a small kernel-provided library into every process. Calls like gettimeofday() execute entirely in user space, avoiding the ~100ns+ cost of a real mode switch.

Question 03

Which kernel architecture does Linux use?

Linux is a monolithic kernel -- all core services (drivers, filesystem, networking) run in Ring 0. However, it supports loadable kernel modules (LKMs) that can be inserted at runtime, giving it some modularity without the IPC overhead of a microkernel.

Question 04

What is the primary advantage of a microkernel over a monolithic kernel?

In a microkernel, drivers and services run in user space. If a driver crashes, only that service is affected -- it can be restarted without rebooting. This comes at the cost of IPC overhead between services, which is why microkernels are typically slower than monolithic kernels.

Question 05

Which Linux bottom-half mechanism can sleep (block)?

Workqueues run in kernel thread context (process context), which means they can sleep, allocate memory with GFP_KERNEL, and do other blocking operations. Softirqs and tasklets run in interrupt context and must not sleep.

Question 06

A container's init process appears as PID 1 inside the container. Which namespace creates this illusion?

The PID namespace gives each container its own set of process IDs. The first process in a new PID namespace becomes PID 1 from the container's perspective, even though it has a different PID on the host.

Question 07

What does Linux's OOM killer use to select which process to terminate?

The OOM killer calculates an oom_score for each process. Processes using more memory get higher scores and are more likely to be killed. The score can be influenced via /proc/<pid>/oom_score_adj, with -1000 making a process immune.

Question 08

What are the three kernel features that make up a Linux container?

A container is not a single kernel feature but a combination: namespaces provide isolation (PID, network, mounts, etc.), cgroups enforce resource limits (CPU, memory, I/O), and overlayFS provides the layered filesystem. There is no "container" syscall.