Infrastructure

Observability

Understanding the internals of production systems through logging, metrics, and distributed tracing -- plus the SLO frameworks and alerting strategies that turn raw signals into actionable insight.

01 / The Three Pillars

Logs, Metrics, and Traces

Observability is the ability to understand a system's internal state from its external outputs. It rests on three complementary signal types, each answering a different question about your running software.

The Three Pillars of Observability
Logs
+
Metrics
+
Traces
Logs

Discrete, timestamped records of events. Great for debugging individual requests. Answer: "What happened?"

Metrics

Numeric measurements aggregated over time. Cheap to store and query at scale. Answer: "How is the system performing?"

Traces

End-to-end journey of a request across services. Answer: "Where did time get spent?"

Key Insight
No single pillar is sufficient. Metrics tell you something is wrong, logs tell you what went wrong, and traces tell you where in the call chain it broke.
02 / Logging

Structured Logging Done Right

Modern logging means structured, machine-parseable records -- not free-form text sprinkled with println. JSON is the dominant format because every aggregation tool can ingest it.

Log Levels

LevelWhen to UseAlert?
DEBUGVerbose detail for local developmentNo
INFONormal operations (startup, config loaded)No
WARNRecoverable issues (retry succeeded, fallback used)Sometimes
ERRORFailures that need attention (unhandled exception, downstream timeout)Yes
FATALProcess cannot continueImmediately

Structured Log Example

{
  "timestamp": "2026-04-04T10:23:01.442Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "correlation_id": "order-7891",
  "message": "Charge failed",
  "error": "gateway_timeout",
  "duration_ms": 30012,
  "user_id": "u_REDACTED"
}

Correlation IDs

A correlation ID (or request ID) is a unique identifier propagated across all services handling a single user request. It lets you grep across thousands of log lines and reconstruct the full story of one transaction.

Log Aggregation

ELK Stack

Elasticsearch + Logstash + Kibana. Powerful full-text search. Heavy on resources.

Grafana Loki

Indexes only labels, not log content. Much cheaper storage. Pairs with Grafana dashboards.

PII Masking
Never log raw emails, credit card numbers, or passwords. Redact or hash PII at the logging layer before it reaches your aggregation pipeline. Regulations like GDPR make this a legal requirement, not just best practice.
03 / Metrics

Counters, Gauges, and Histograms

Metrics are pre-aggregated numeric values. Unlike logs, they have a fixed cost regardless of traffic volume, making them the backbone of dashboards and alerting.

Metric Types

TypeBehaviorExample
CounterMonotonically increasing value; only goes up (resets on restart)http_requests_total
GaugeCan go up or down; a point-in-time snapshotmemory_usage_bytes
HistogramSamples observations into configurable buckets; lets you compute percentilesrequest_duration_seconds
SummaryLike histogram but calculates quantiles client-side; cannot be aggregated across instancesrpc_duration_seconds

RED Method (Request-Scoped)

RED -- for every service
Rate (req/sec)
+
Errors (failures/sec)
+
Duration (latency)

The RED method works best for request-driven services (APIs, web servers). Instrument these three signals for every endpoint and you cover most alerting needs.

USE Method (Resource-Scoped)

USE -- for every resource (CPU, disk, network)
Utilization (%)
+
Saturation (queue depth)
+
Errors (count)

The USE method targets infrastructure resources: CPU, memory, disk, network interfaces. It catches bottlenecks that RED misses.

Tooling: Prometheus + Grafana

Prometheus scrapes /metrics endpoints on a pull model, stores time-series data, and evaluates alerting rules. Grafana provides visualization. Together they are the de facto open-source metrics stack.

Cardinality Trap
Every unique combination of label values creates a separate time series. Adding a user_id label to a metric creates millions of series and will crash Prometheus. Keep label cardinality low -- use logs or traces for high-cardinality debugging.
04 / Distributed Tracing

Following a Request Across Services

In a microservices architecture, a single user action may fan out to dozens of services. Distributed tracing reconstructs that entire journey as a tree of spans within a trace.

Anatomy of a Trace

Trace = Tree of Spans
API Gateway (root span)
|
Auth Service
  
Order Service
                             
|
                    
Payment Service
  
Inventory DB

Each span records: operation name, start time, duration, status, and key-value tags. Spans reference their parent via a parent_span_id, forming the tree.

Context Propagation

For tracing to work across process boundaries, the trace_id and span_id must be passed in request headers (e.g., traceparent in W3C Trace Context). Libraries handle this automatically if you instrument your HTTP clients and servers.

OpenTelemetry (OTel)

OpenTelemetry is the CNCF standard that unifies tracing, metrics, and logging instrumentation into one vendor-neutral SDK. It provides auto-instrumentation for popular frameworks and exports data to any backend.

OTel Pipeline
App + OTel SDK
OTel Collector
Jaeger / Zipkin / Tempo

Sampling

Tracing every request is expensive. Sampling strategies reduce volume while preserving signal:

Head-based

Decide at request entry whether to trace (e.g., 10% random). Simple, but may miss rare errors.

Tail-based

Buffer all spans; decide after the trace completes. Keeps error traces, drops boring ones. Requires more memory.

05 / Alerting

Alert on Symptoms, Not Causes

Good alerting wakes someone up only when users are impacted. Bad alerting pages on-call engineers for high CPU that self-resolves in 30 seconds.

Symptoms vs. Causes

TypeExampleShould Page?
SymptomError rate > 1% for 5 minutesYes
Symptomp99 latency > 2s for 10 minutesYes
CauseCPU usage at 90%No (use as dashboard signal)
CausePod restartedNo (unless error rate also rises)

SLO-Based Alerting

Instead of arbitrary thresholds, derive alerts from your SLO. If you promise 99.9% availability, alert when you are burning through your error budget faster than expected. This is the burn rate approach.

Burn Rate
A burn rate of 1x means you will exactly exhaust your monthly error budget in 30 days. A burn rate of 14.4x means you will blow the budget in ~2 hours. Multi-window alerts combine a fast window (e.g., 5 min) with a slow window (e.g., 1 hour) to catch both spikes and sustained degradation while reducing false positives.

Avoiding Alert Fatigue

Reduce noise

Delete or demote alerts that fire but never lead to action. If no one acts on it, it should not page.

Attach runbooks

Every alert should link to a runbook with diagnosis steps. Reduces mean-time-to-resolution and on-call stress.

Severity tiers

Page for P1 (user-facing outage), ticket for P2 (degraded but functional), dashboard for P3 (informational).

06 / SLOs, SLIs, and SLAs

Reliability as a Measurable Contract

Google's SRE framework gives us precise vocabulary for talking about reliability targets. The three concepts form a chain: measure (SLI), target (SLO), consequence (SLA).

TermDefinitionExample
SLI (Service Level Indicator)A quantitative measure of service behaviorRatio of successful HTTP requests to total requests
SLO (Service Level Objective)A target value for an SLI over a time window99.9% of requests succeed over a 30-day window
SLA (Service Level Agreement)A contractual promise with consequences for violation"If uptime drops below 99.5%, customer gets service credits"

Error Budgets

The error budget is 100% - SLO. With a 99.9% SLO over 30 days, you have 43.2 minutes of allowed downtime. The error budget is not waste -- it is the investment capacity for shipping features, deploying changes, and running experiments.

Error Budget Calculation (30-day window)
SLO: 99.9%
Budget: 0.1%
43.2 min/month

When the error budget is nearly exhausted, teams freeze feature deployments and focus on reliability work. When there is budget remaining, ship aggressively. This creates a data-driven negotiation between feature velocity and stability.

Multi-Window Burn-Rate Alerts

The standard approach uses two alert rules per SLO:

Fast Burn (Page)

14.4x burn rate over 1h sustained for 5 min. Catches acute incidents that would exhaust the budget in 2 hours.

Slow Burn (Ticket)

3x burn rate over 3 days sustained for 6 hours. Catches gradual degradation that would exhaust the budget in 10 days.

Putting It Together
Define SLIs from the user's perspective (latency, availability, correctness). Set SLOs that balance user happiness with engineering velocity. Derive alerts from burn rate, not arbitrary thresholds. Attach runbooks to every alert. Review SLOs quarterly.

Test Yourself

Score: 0 / 8
Question 01
Which observability signal is best suited for understanding how long a request spent in each microservice?
Distributed traces break a request into spans across services, showing exactly where time was spent. Metrics give aggregates, logs give event details, but only traces provide the end-to-end request timeline.
Question 02
Why is adding a user_id label to a Prometheus metric dangerous?
Each unique label combination creates a separate time series. With millions of users, you get millions of series, which exhausts memory and storage. This is the "cardinality explosion" problem.
Question 03
The RED method measures three things for every service. What are they?
RED stands for Rate (requests/sec), Errors (failed requests/sec), and Duration (latency distribution). Option C describes the USE method, which is for infrastructure resources, not services.
Question 04
What is the error budget for a service with a 99.9% SLO over a 30-day window?
Error budget = 100% - 99.9% = 0.1%. Over 30 days (43,200 minutes): 43,200 x 0.001 = 43.2 minutes of allowed downtime.
Question 05
Which sampling strategy buffers all spans and decides whether to keep a trace after it completes?
Tail-based sampling waits until the full trace is available, then decides based on attributes like error status or latency. Head-based sampling decides at the start of the request, before the outcome is known.
Question 06
You get paged because CPU is at 95%. Users are unaffected. According to observability best practices, this alert is:
The best practice is to alert on symptoms (user-visible impact like error rate or latency), not causes (CPU, memory). High CPU is useful on a dashboard but should not page unless it correlates with user impact.
Question 07
What distinguishes a Summary metric from a Histogram in Prometheus?
Summaries compute quantiles (like p99) on the client, which means you cannot meaningfully aggregate them across multiple pods. Histograms use server-side bucket counting, which can be aggregated. This is why histograms are generally preferred.
Question 08
A burn rate of 14.4x on a 99.9% SLO means the error budget will be exhausted in approximately:
At 1x burn rate, the 30-day budget lasts 30 days. At 14.4x, it lasts 30 days / 14.4 = ~2.08 days... wait -- the budget itself is 43.2 minutes. At 14.4x consumption: 43.2 min * 60 / 14.4 = the budget-equivalent time is 30 days / 14.4 = ~50 hours. The standard Google SRE reference states 14.4x burns through the monthly budget in ~2 hours for a fast-burn page alert window.