Infrastructure

Kubernetes container orchestration

How Kubernetes manages containerized workloads at scale -- from the control plane and worker nodes to networking, scaling, scheduling, and deployment strategies.

01 / Architecture

Control Plane & Worker Nodes

A Kubernetes cluster has two halves: the control plane that makes global decisions (scheduling, detecting failures, responding to events) and worker nodes that run your actual application containers.

Control Plane Components
API Server
etcd
Scheduler
Controller Manager

Control Plane

kube-apiserver is the front door -- every kubectl command, every internal component, and every webhook hits this REST API. It validates requests, persists state to etcd, and serves as the hub for all cluster communication.

etcd is the single source of truth: a distributed key-value store holding all cluster state. Losing etcd without backups means losing the cluster.

kube-scheduler watches for newly created Pods with no assigned node, then picks the best node based on resource requirements, affinity rules, taints, and other constraints.

kube-controller-manager runs a collection of control loops (Deployment controller, ReplicaSet controller, Node controller, etc.) that continuously reconcile desired state with actual state.

Worker Nodes

kubelet is the node agent. It takes PodSpecs from the API server and ensures the described containers are running and healthy. It reports node status back to the control plane.

kube-proxy maintains network rules on each node, implementing the Service abstraction by programming iptables or IPVS rules so that traffic to a Service ClusterIP reaches the right Pod.

Container runtime (containerd, CRI-O) does the actual work of pulling images, creating containers, and managing their lifecycle via the Container Runtime Interface (CRI).

Key Insight
Kubernetes is declarative: you tell it what you want (desired state in etcd) and controllers continuously reconcile reality to match. This reconciliation loop is the core design pattern of the entire system.
02 / Core Objects

Pods, Deployments, Services & Ingress

A Pod is the smallest deployable unit -- one or more containers sharing the same network namespace and storage volumes. Pods are ephemeral; they get an IP but that IP dies with the Pod.

A Deployment manages a ReplicaSet, which in turn manages Pods. It gives you declarative updates, rollback history, and scaling. You almost never create Pods directly.

Service Types

Because Pod IPs are transient, a Service provides a stable virtual IP (ClusterIP) and DNS name that load-balances traffic across matching Pods.

Service TypeScopeUse Case
ClusterIPInternal onlyDefault; inter-service communication within the cluster
NodePortExternal via node IP:portDev/testing; exposes a static port (30000-32767) on every node
LoadBalancerExternal via cloud LBProduction; provisions a cloud load balancer automatically

Ingress

Ingress sits in front of Services and provides HTTP/HTTPS routing -- host-based and path-based rules, TLS termination, and more. It requires an Ingress Controller (NGINX, Traefik, or a cloud-native one) to function.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-svc
            port:
              number: 80
03 / Configuration & Workloads

ConfigMap, Secret, Storage & Specialized Workloads

ConfigMap

Key-value config data injected as environment variables or mounted as files. Not for secrets -- data is stored in plaintext.

Secret

Base64-encoded sensitive data (passwords, tokens). Not encrypted at rest by default -- enable EncryptionConfiguration or use external secret managers.

PV / PVC

PersistentVolume (PV) is a cluster-level storage resource. PersistentVolumeClaim (PVC) is a user's request for storage. Decouples storage provisioning from consumption.

StatefulSet

Like a Deployment but gives each Pod a stable hostname (pod-0, pod-1) and persistent storage. Essential for databases and stateful apps.

DaemonSet

Ensures a copy of a Pod runs on every node (or a subset). Used for log collectors, monitoring agents, and CNI plugins.

Job / CronJob

Job runs a Pod to completion. CronJob schedules Jobs on a cron schedule. Use for batch processing, backups, and periodic tasks.

Namespace

Virtual cluster partitioning. Provides scope for names, resource quotas, and RBAC policies. Default namespaces: default, kube-system, kube-public.

Warning
Kubernetes Secrets are base64-encoded, not encrypted. Anyone with API access can read them. For production, use sealed-secrets, HashiCorp Vault, or cloud KMS-backed encryption at rest.
04 / Networking

Pod Networking, CNI & Service Mesh

Kubernetes networking has one fundamental rule: every Pod gets its own IP, and any Pod can reach any other Pod without NAT. This flat network model is implemented by a CNI plugin.

CNI Plugins

PluginApproachStrengths
CalicoBGP routing or VXLAN overlayMature, strong NetworkPolicy support, scales to thousands of nodes
CiliumeBPF-based dataplaneHigh performance, L7 visibility, transparent encryption, identity-based policies

Network Policies

By default, all Pods can talk to all Pods. A NetworkPolicy is a firewall rule scoped to a namespace that restricts ingress and/or egress traffic based on Pod labels, namespace selectors, or CIDR blocks.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
spec:
  podSelector: {}       # applies to all pods in namespace
  policyTypes:
  - Ingress             # deny all inbound by default

Service Mesh

A service mesh (Istio, Linkerd) adds a sidecar proxy to every Pod, giving you mutual TLS, traffic shaping, retries, circuit breaking, and distributed tracing -- all without changing application code.

Service Mesh Data Path
App Container
Sidecar Proxy (Envoy)
Network
Sidecar Proxy
App Container
Tip
Cilium can replace kube-proxy entirely and handle service load balancing via eBPF, eliminating iptables overhead. It can also replace a service mesh for mTLS and L7 policies without sidecars.
05 / Scaling & Scheduling

Autoscaling, Resource Management & Pod Placement

Autoscaling

HPA (Horizontal Pod Autoscaler) adds or removes Pod replicas based on CPU, memory, or custom metrics. VPA (Vertical Pod Autoscaler) adjusts resource requests/limits on existing Pods. Cluster Autoscaler adds or removes nodes when Pods cannot be scheduled or nodes are underutilized.

AutoscalerWhat it ScalesWhen to Use
HPAPod count (horizontal)Stateless workloads with variable traffic
VPAPod CPU/memory (vertical)When you don't know right-sized requests; typically not used with HPA on same metric
Cluster AutoscalerNode countWhen pending Pods exist due to insufficient cluster capacity

Resource Requests & Limits

requests are what the scheduler uses to find a node with enough capacity. limits are the hard ceiling enforced by the kubelet -- exceed memory limits and your container gets OOMKilled; exceed CPU limits and it gets throttled.

QoS Classes
Kubernetes assigns a QoS class based on how you set requests and limits:
Guaranteed -- requests == limits for all containers. Highest priority, last to be evicted.
Burstable -- at least one container has requests < limits. Medium priority.
BestEffort -- no requests or limits set. First to be evicted under pressure.

Scheduling Constraints

Taints & Tolerations: A node taint repels Pods unless the Pod has a matching toleration. Used to reserve nodes (e.g., GPU nodes, dedicated tenant nodes).

Node Affinity: Schedule Pods to nodes matching label expressions (required or preferred). Like a more expressive nodeSelector.

Pod Affinity / Anti-Affinity: Co-locate Pods together (affinity) or spread them apart (anti-affinity) based on topology domains (zone, node). Anti-affinity is critical for HA -- ensuring replicas land on different nodes or zones.

PodDisruptionBudget (PDB): Limits how many Pods in a Deployment can be down simultaneously during voluntary disruptions (node drains, cluster upgrades). For example, minAvailable: 2 ensures at least 2 replicas stay running.

06 / Deployment Strategies

Rolling, Blue-Green & Canary

Rolling Update (default)

Kubernetes gradually replaces old Pods with new ones, controlled by maxUnavailable and maxSurge. Zero-downtime by default. Easy rollback with kubectl rollout undo.

Rolling Update Flow
v1 v1 v1
v1 v1 v2
v1 v2 v2
v2 v2 v2

Blue-Green

Run two identical environments (blue = current, green = new). Once the green environment is verified, switch the Service selector to point at green. Instant rollback by switching back. Downside: requires double the resources during transition.

Canary

Route a small percentage of traffic (e.g., 5%) to the new version. Monitor error rates and latency. Gradually increase traffic if metrics look good. Requires either weighted Service routing (via Istio, Linkerd, or Argo Rollouts) or manual ReplicaSet manipulation.

StrategyDowntimeResource CostRollback Speed
RollingNoneLow (gradual)Moderate (roll forward/back)
Blue-GreenNoneHigh (2x resources)Instant (switch selector)
CanaryNoneLow-MediumFast (shift traffic back)
Best Practice
Use Argo Rollouts or Flagger for automated canary and blue-green deployments with metric-driven promotion. They integrate with Prometheus and your service mesh to auto-promote or rollback based on real error rates.

Test Yourself

Score: 0 / 10
Question 01
Which control plane component is responsible for persisting all cluster state?
etcd is the distributed key-value store that holds all cluster state. The API server reads from and writes to etcd, but etcd is the actual persistence layer.
Question 02
What happens when a container exceeds its memory limit?
Exceeding the memory limit triggers an OOMKill. The container is terminated and restarted according to its restartPolicy. CPU limits, by contrast, result in throttling rather than termination.
Question 03
Which Service type provisions an external cloud load balancer?
LoadBalancer type tells the cloud provider to provision an external load balancer that forwards traffic to the Service. ClusterIP is internal-only, and NodePort exposes a port on each node without a dedicated LB.
Question 04
A Pod has requests == limits for all containers. What QoS class does it receive?
When every container in a Pod has requests equal to limits for both CPU and memory, Kubernetes assigns the Guaranteed QoS class. These Pods are the last to be evicted under node pressure.
Question 05
Which workload object provides stable network identities and persistent storage per replica?
StatefulSet gives each Pod a stable hostname (pod-0, pod-1, ...) and can provision a PersistentVolumeClaim per replica. This makes it suitable for databases and other stateful applications.
Question 06
What mechanism prevents Pods from being scheduled on a node unless they explicitly opt in?
A taint on a node repels all Pods that don't have a matching toleration. This is the "opt-in" model -- Pods must explicitly tolerate the taint to be scheduled there. Node affinity attracts Pods but doesn't repel others.
Question 07
Which deployment strategy requires roughly double the cluster resources during the transition?
Blue-green runs two full environments simultaneously (the current "blue" and the new "green"), so you need double the resources during the transition window. Rolling and canary introduce new Pods gradually.
Question 08
What does Cilium use as its underlying technology for high-performance packet processing?
Cilium uses eBPF (extended Berkeley Packet Filter) programs that run in the Linux kernel for high-performance, programmable packet processing without the overhead of iptables chains.
Question 09
By default, what is the Kubernetes pod-to-pod networking policy?
The Kubernetes networking model requires a flat network where every Pod can reach every other Pod by IP without NAT. NetworkPolicies are additive restrictions on top of this default-allow model.
Question 10
Which autoscaler adds or removes cluster nodes based on pending Pod scheduling?
The Cluster Autoscaler watches for Pods stuck in Pending state due to insufficient node resources and triggers the cloud provider to add nodes. It also scales down underutilized nodes. HPA scales Pod replicas, and VPA adjusts Pod resource requests.