What to Measure and Why
You cannot improve what you do not measure, but measuring the wrong thing is worse than measuring nothing. Performance engineering begins with choosing the right metrics and understanding what they actually tell you.
Latency and Percentiles
Latency is the time between a request being sent and its response arriving. The trap: reporting average latency hides the experience of your worst-off users. If P50 is 20ms but P99 is 1.2s, one in a hundred users waits 60x longer than the median user.
Throughput
QPS (queries per second) and TPS (transactions per second) measure how much work a system completes per unit time. High throughput with high latency often means requests are queuing. High throughput with low latency is the goal.
Other Critical Metrics
| Metric | What It Measures | Watch For |
|---|---|---|
| Bandwidth | Data volume per second (Gbps) | Network saturation, large payloads |
| IOPS | I/O operations per second | Disk bottlenecks, random vs sequential |
| Saturation | How full a resource is (0-100%) | Queuing begins well before 100% |
| Error Rate | Failed requests / total requests | Load shedding masking as "fast" responses |
The Math Behind Scaling
Three laws form the theoretical backbone of performance engineering. They tell you what is possible before you write a single line of code.
Amdahl's Law
If a fraction s of your workload is serial (cannot be parallelized), then the maximum speedup with N processors is:
Speedup(N) = 1 / (s + (1 - s) / N)
Example: s = 5% serial
N = 10 cores → Speedup = 6.9x
N = 100 cores → Speedup = 16.8x
N = ∞ cores → Speedup = 20x (hard ceiling!)
Little's Law
A deceptively simple but universally applicable relationship for any stable system:
L = λ × W
L = average number of items in system
λ = average arrival rate
W = average time each item spends in system
Example: If requests arrive at 500/sec (λ) and each takes 200ms (W):
L = 500 × 0.2 = 100 concurrent requests in the system
This tells you how many connections, threads, or workers you need. It works for queues, databases, thread pools, and even checkout lines at the grocery store.
Universal Scalability Law (USL)
Extends Amdahl's Law by adding a coherence penalty term: as you add nodes, they must coordinate (cache invalidation, locks, consensus). Beyond a certain point, adding capacity actually decreases throughput.
The USL model: C(N) = N / (1 + α(N-1) + β·N·(N-1)) where α is contention and β is coherence delay. When β > 0, there is a point of negative returns.
Finding the Bottleneck
Profiling answers "where is the time going?" before you optimize. Guessing where bottlenecks are is wrong more often than you think.
CPU Flame Graphs
Flame graphs (invented by Brendan Gregg) visualize stack traces sampled over time. The x-axis is the sampled stack population (wider = more time spent), and the y-axis is stack depth. They instantly reveal which functions dominate CPU time.
# Generate a flame graph on Linux
perf record -g -p <PID> -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
# For Java (using async-profiler)
./asprof -d 30 -f flame.html <PID>
Memory Profiling
Track heap allocations, object retention, and GC behavior. In garbage-collected languages, excessive allocation rates create GC pressure even if peak memory is fine. Tools: jmap/MAT (Java), memray (Python), pprof (Go), Chrome DevTools (JS).
I/O Profiling
Disk and network I/O are often the real bottleneck but are harder to profile. Use iostat, iotop, strace (syscall tracing), and bpftrace for production-safe tracing on Linux.
Breaking Things on Purpose
Load testing validates whether your system meets performance targets under realistic conditions. Different test types answer different questions.
| Test Type | Purpose | Duration | Load Pattern |
|---|---|---|---|
| Load | Can we handle expected traffic? | 5-30 min | Ramp to target QPS |
| Stress | Where does the system break? | Until failure | Ramp beyond capacity |
| Soak | Memory leaks, resource exhaustion | 4-24 hours | Sustained target QPS |
| Spike | Can we handle traffic bursts? | Short bursts | Sudden 10x jumps |
Tools
JavaScript-based, scriptable. Excellent for CI/CD integration. Outputs detailed percentile data. Good for protocol-level testing (HTTP, gRPC, WebSocket).
Python-based, user-behavior modeling. Distributed mode for high load. Great when tests need complex logic or database seeding.
Ultra-lightweight C-based HTTP benchmarking. wrk2 fixes coordinated omission. Best for raw throughput testing of a single endpoint.
# k6 example: ramp to 200 VUs over 2 minutes
import http from 'k6/http';
import { check } from 'k6';
export const options = {
stages: [
{ duration: '1m', target: 200 },
{ duration: '3m', target: 200 },
{ duration: '1m', target: 0 },
],
};
export default function () {
const res = http.get('https://api.example.com/items');
check(res, { 'status 200': (r) => r.status === 200 });
}
Performance Killers
These patterns silently degrade performance. They often pass code review because they look correct, but under load they are devastating.
Fetching a list of N items, then issuing a separate query for each item's related data. Turns 1 query into N+1. Fix: use JOINs, eager loading, or DataLoader batching.
Client makes many small requests instead of one composite request. Each round trip adds latency. Fix: aggregate endpoints, GraphQL, or BFF (Backend for Frontend).
A single blocking call in an async event loop starves all other coroutines. In Node.js or Python asyncio, this freezes the entire server. Fix: use async-compatible libraries or offload to a thread pool.
Allocating millions of short-lived objects forces frequent garbage collection pauses. Fix: reuse objects, use object pools, avoid unnecessary boxing, prefer primitives.
Multiple threads competing for the same lock serialize execution, negating parallelism. Fix: lock-free data structures, finer-grained locks, or share-nothing architecture.
Queues without size limits let producers outpace consumers, eventually exhausting memory. Under load, latency grows unboundedly. Fix: bounded queues with backpressure.
# N+1 query example (Python/SQLAlchemy)
# BAD: N+1 — 1 query for users, then N queries for orders
users = session.query(User).all()
for user in users:
print(user.orders) # Each access triggers a lazy-load query
# GOOD: Eager loading — 1 query with JOIN
users = session.query(User).options(joinedload(User.orders)).all()
for user in users:
print(user.orders) # Already loaded, no extra query
Amortizing Connection Cost
Establishing a new connection is expensive. A TCP handshake is 1 RTT, TLS adds 1-2 more, and database authentication adds yet another round trip. For a cross-region connection at 80ms RTT, that is 240-400ms before the first byte of useful work.
A connection pool maintains a set of pre-established connections. When your code needs a connection, it borrows one from the pool and returns it when done. This amortizes the setup cost across thousands of requests.
Database Connection Pools
| Pool | Language/DB | Key Feature |
|---|---|---|
| HikariCP | Java / JDBC | Fastest JVM pool. Minimal overhead, no-frills design. Default in Spring Boot. |
| PgBouncer | Any / PostgreSQL | External proxy. Transaction-level pooling lets 1000s of clients share a few PG connections. |
| pgpool-II | Any / PostgreSQL | Also handles replication, load balancing. Heavier than PgBouncer. |
HTTP Connection Pools
HTTP/1.1 keep-alive reuses TCP connections across requests. HTTP/2 multiplexes many requests over a single connection. Most HTTP client libraries manage pools internally, but you must configure max connections per host and idle timeouts.
Pool Sizing
pool_size = (core_count * 2) + disk_spindles. For SSDs, that simplifies to roughly core_count * 2. Counter-intuitively, smaller pools often outperform larger ones because they reduce context switching and lock contention on the database side.
# PgBouncer config example (pgbouncer.ini)
[databases]
mydb = host=localhost port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction # Return conn to pool after each transaction
default_pool_size = 20 # Connections per user/database pair
max_client_conn = 1000 # Max client connections to PgBouncer
reserve_pool_size = 5 # Extra conns for burst traffic
reserve_pool_timeout = 3 # Seconds before using reserve pool