Logs, Metrics, and Traces
Observability is the ability to understand a system's internal state from its external outputs. It rests on three complementary signal types, each answering a different question about your running software.
Discrete, timestamped records of events. Great for debugging individual requests. Answer: "What happened?"
Numeric measurements aggregated over time. Cheap to store and query at scale. Answer: "How is the system performing?"
End-to-end journey of a request across services. Answer: "Where did time get spent?"
Structured Logging Done Right
Modern logging means structured, machine-parseable records -- not free-form text sprinkled with println. JSON is the dominant format because every aggregation tool can ingest it.
Log Levels
| Level | When to Use | Alert? |
|---|---|---|
DEBUG | Verbose detail for local development | No |
INFO | Normal operations (startup, config loaded) | No |
WARN | Recoverable issues (retry succeeded, fallback used) | Sometimes |
ERROR | Failures that need attention (unhandled exception, downstream timeout) | Yes |
FATAL | Process cannot continue | Immediately |
Structured Log Example
{
"timestamp": "2026-04-04T10:23:01.442Z",
"level": "ERROR",
"service": "payment-api",
"trace_id": "abc123def456",
"correlation_id": "order-7891",
"message": "Charge failed",
"error": "gateway_timeout",
"duration_ms": 30012,
"user_id": "u_REDACTED"
}
Correlation IDs
A correlation ID (or request ID) is a unique identifier propagated across all services handling a single user request. It lets you grep across thousands of log lines and reconstruct the full story of one transaction.
Log Aggregation
Elasticsearch + Logstash + Kibana. Powerful full-text search. Heavy on resources.
Indexes only labels, not log content. Much cheaper storage. Pairs with Grafana dashboards.
Counters, Gauges, and Histograms
Metrics are pre-aggregated numeric values. Unlike logs, they have a fixed cost regardless of traffic volume, making them the backbone of dashboards and alerting.
Metric Types
| Type | Behavior | Example |
|---|---|---|
| Counter | Monotonically increasing value; only goes up (resets on restart) | http_requests_total |
| Gauge | Can go up or down; a point-in-time snapshot | memory_usage_bytes |
| Histogram | Samples observations into configurable buckets; lets you compute percentiles | request_duration_seconds |
| Summary | Like histogram but calculates quantiles client-side; cannot be aggregated across instances | rpc_duration_seconds |
RED Method (Request-Scoped)
The RED method works best for request-driven services (APIs, web servers). Instrument these three signals for every endpoint and you cover most alerting needs.
USE Method (Resource-Scoped)
The USE method targets infrastructure resources: CPU, memory, disk, network interfaces. It catches bottlenecks that RED misses.
Tooling: Prometheus + Grafana
Prometheus scrapes /metrics endpoints on a pull model, stores time-series data, and evaluates alerting rules. Grafana provides visualization. Together they are the de facto open-source metrics stack.
user_id label to a metric creates millions of series and will crash Prometheus. Keep label cardinality low -- use logs or traces for high-cardinality debugging.
Following a Request Across Services
In a microservices architecture, a single user action may fan out to dozens of services. Distributed tracing reconstructs that entire journey as a tree of spans within a trace.
Anatomy of a Trace
Each span records: operation name, start time, duration, status, and key-value tags. Spans reference their parent via a parent_span_id, forming the tree.
Context Propagation
For tracing to work across process boundaries, the trace_id and span_id must be passed in request headers (e.g., traceparent in W3C Trace Context). Libraries handle this automatically if you instrument your HTTP clients and servers.
OpenTelemetry (OTel)
OpenTelemetry is the CNCF standard that unifies tracing, metrics, and logging instrumentation into one vendor-neutral SDK. It provides auto-instrumentation for popular frameworks and exports data to any backend.
Sampling
Tracing every request is expensive. Sampling strategies reduce volume while preserving signal:
Decide at request entry whether to trace (e.g., 10% random). Simple, but may miss rare errors.
Buffer all spans; decide after the trace completes. Keeps error traces, drops boring ones. Requires more memory.
Alert on Symptoms, Not Causes
Good alerting wakes someone up only when users are impacted. Bad alerting pages on-call engineers for high CPU that self-resolves in 30 seconds.
Symptoms vs. Causes
| Type | Example | Should Page? |
|---|---|---|
| Symptom | Error rate > 1% for 5 minutes | Yes |
| Symptom | p99 latency > 2s for 10 minutes | Yes |
| Cause | CPU usage at 90% | No (use as dashboard signal) |
| Cause | Pod restarted | No (unless error rate also rises) |
SLO-Based Alerting
Instead of arbitrary thresholds, derive alerts from your SLO. If you promise 99.9% availability, alert when you are burning through your error budget faster than expected. This is the burn rate approach.
Avoiding Alert Fatigue
Delete or demote alerts that fire but never lead to action. If no one acts on it, it should not page.
Every alert should link to a runbook with diagnosis steps. Reduces mean-time-to-resolution and on-call stress.
Page for P1 (user-facing outage), ticket for P2 (degraded but functional), dashboard for P3 (informational).
Reliability as a Measurable Contract
Google's SRE framework gives us precise vocabulary for talking about reliability targets. The three concepts form a chain: measure (SLI), target (SLO), consequence (SLA).
| Term | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A quantitative measure of service behavior | Ratio of successful HTTP requests to total requests |
| SLO (Service Level Objective) | A target value for an SLI over a time window | 99.9% of requests succeed over a 30-day window |
| SLA (Service Level Agreement) | A contractual promise with consequences for violation | "If uptime drops below 99.5%, customer gets service credits" |
Error Budgets
The error budget is 100% - SLO. With a 99.9% SLO over 30 days, you have 43.2 minutes of allowed downtime. The error budget is not waste -- it is the investment capacity for shipping features, deploying changes, and running experiments.
When the error budget is nearly exhausted, teams freeze feature deployments and focus on reliability work. When there is budget remaining, ship aggressively. This creates a data-driven negotiation between feature velocity and stability.
Multi-Window Burn-Rate Alerts
The standard approach uses two alert rules per SLO:
14.4x burn rate over 1h sustained for 5 min. Catches acute incidents that would exhaust the budget in 2 hours.
3x burn rate over 3 days sustained for 6 hours. Catches gradual degradation that would exhaust the budget in 10 days.
Test Yourself
user_id label to a Prometheus metric dangerous?