Performance

Performance Engineering

Measuring, modeling, and eliminating bottlenecks. From latency percentiles and queueing theory to flame graphs, load testing, and the anti-patterns that silently kill throughput.

01 / Key Metrics

What to Measure and Why

You cannot improve what you do not measure, but measuring the wrong thing is worse than measuring nothing. Performance engineering begins with choosing the right metrics and understanding what they actually tell you.

Latency and Percentiles

Latency is the time between a request being sent and its response arriving. The trap: reporting average latency hides the experience of your worst-off users. If P50 is 20ms but P99 is 1.2s, one in a hundred users waits 60x longer than the median user.

Latency Distribution (not symmetric)
P50: 20ms
...
P95: 180ms
...
P99: 1.2s
...
P99.9: 4.8s
Why Averages Lie
Averages are heavily influenced by outliers in skewed distributions. A single 30-second timeout can shift the average dramatically while the median barely moves. Worse, averages are not additive across services: if service A averages 50ms and service B averages 50ms, the average of A+B is NOT necessarily 100ms. Always report P50, P95, and P99.

Throughput

QPS (queries per second) and TPS (transactions per second) measure how much work a system completes per unit time. High throughput with high latency often means requests are queuing. High throughput with low latency is the goal.

Other Critical Metrics

MetricWhat It MeasuresWatch For
BandwidthData volume per second (Gbps)Network saturation, large payloads
IOPSI/O operations per secondDisk bottlenecks, random vs sequential
SaturationHow full a resource is (0-100%)Queuing begins well before 100%
Error RateFailed requests / total requestsLoad shedding masking as "fast" responses
The USE Method (Brendan Gregg)
For every resource, check: Utilization (% busy), Saturation (queue depth), and Errors. This systematically catches bottlenecks that ad-hoc monitoring misses.
02 / Fundamental Laws

The Math Behind Scaling

Three laws form the theoretical backbone of performance engineering. They tell you what is possible before you write a single line of code.

Amdahl's Law

If a fraction s of your workload is serial (cannot be parallelized), then the maximum speedup with N processors is:

Speedup(N) = 1 / (s + (1 - s) / N)

Example: s = 5% serial
  N = 10  cores  →  Speedup = 6.9x
  N = 100 cores  →  Speedup = 16.8x
  N = ∞   cores  →  Speedup = 20x  (hard ceiling!)
The Serial Fraction Trap
Even 5% serial work limits you to 20x speedup no matter how many cores you throw at it. This is why "just add more servers" eventually stops working. Identify and shrink the serial fraction first.

Little's Law

A deceptively simple but universally applicable relationship for any stable system:

L = λ × W

L = average number of items in system
λ = average arrival rate
W = average time each item spends in system

Example: If requests arrive at 500/sec (λ) and each takes 200ms (W):
  L = 500 × 0.2 = 100 concurrent requests in the system

This tells you how many connections, threads, or workers you need. It works for queues, databases, thread pools, and even checkout lines at the grocery store.

Universal Scalability Law (USL)

Extends Amdahl's Law by adding a coherence penalty term: as you add nodes, they must coordinate (cache invalidation, locks, consensus). Beyond a certain point, adding capacity actually decreases throughput.

Throughput vs. Concurrency (USL)
Linear (ideal)
Amdahl (plateaus)
USL (drops!)

The USL model: C(N) = N / (1 + α(N-1) + β·N·(N-1)) where α is contention and β is coherence delay. When β > 0, there is a point of negative returns.

03 / Profiling

Finding the Bottleneck

Profiling answers "where is the time going?" before you optimize. Guessing where bottlenecks are is wrong more often than you think.

CPU Flame Graphs

Flame graphs (invented by Brendan Gregg) visualize stack traces sampled over time. The x-axis is the sampled stack population (wider = more time spent), and the y-axis is stack depth. They instantly reveal which functions dominate CPU time.

# Generate a flame graph on Linux
perf record -g -p <PID> -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# For Java (using async-profiler)
./asprof -d 30 -f flame.html <PID>
Reading a Flame Graph
Look for wide plateaus (functions that consume a lot of CPU). The color is random and meaningless. If a function is wide at the top, it is a leaf function doing real work. If wide at the bottom, everything above it is its children.

Memory Profiling

Track heap allocations, object retention, and GC behavior. In garbage-collected languages, excessive allocation rates create GC pressure even if peak memory is fine. Tools: jmap/MAT (Java), memray (Python), pprof (Go), Chrome DevTools (JS).

I/O Profiling

Disk and network I/O are often the real bottleneck but are harder to profile. Use iostat, iotop, strace (syscall tracing), and bpftrace for production-safe tracing on Linux.

Benchmarking Pitfalls
Coordinated omission: if your load generator waits for each response before sending the next request, high-latency responses reduce the measured load, making the system look faster than it is. Use open-loop benchmarks. Also beware: JIT warmup, OS page cache, and CPU frequency scaling can all invalidate results.
04 / Load Testing

Breaking Things on Purpose

Load testing validates whether your system meets performance targets under realistic conditions. Different test types answer different questions.

Test TypePurposeDurationLoad Pattern
LoadCan we handle expected traffic?5-30 minRamp to target QPS
StressWhere does the system break?Until failureRamp beyond capacity
SoakMemory leaks, resource exhaustion4-24 hoursSustained target QPS
SpikeCan we handle traffic bursts?Short burstsSudden 10x jumps

Tools

k6

JavaScript-based, scriptable. Excellent for CI/CD integration. Outputs detailed percentile data. Good for protocol-level testing (HTTP, gRPC, WebSocket).

Locust

Python-based, user-behavior modeling. Distributed mode for high load. Great when tests need complex logic or database seeding.

wrk / wrk2

Ultra-lightweight C-based HTTP benchmarking. wrk2 fixes coordinated omission. Best for raw throughput testing of a single endpoint.

# k6 example: ramp to 200 VUs over 2 minutes
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  stages: [
    { duration: '1m', target: 200 },
    { duration: '3m', target: 200 },
    { duration: '1m', target: 0 },
  ],
};

export default function () {
  const res = http.get('https://api.example.com/items');
  check(res, { 'status 200': (r) => r.status === 200 });
}
Load Testing Best Practice
Always test from a separate machine (not the same host as the server). Use realistic data, not the same cached request repeated. Monitor the load generator itself — it can become the bottleneck.
05 / Anti-Patterns

Performance Killers

These patterns silently degrade performance. They often pass code review because they look correct, but under load they are devastating.

N+1 Queries

Fetching a list of N items, then issuing a separate query for each item's related data. Turns 1 query into N+1. Fix: use JOINs, eager loading, or DataLoader batching.

Chatty APIs

Client makes many small requests instead of one composite request. Each round trip adds latency. Fix: aggregate endpoints, GraphQL, or BFF (Backend for Frontend).

Blocking I/O in Async

A single blocking call in an async event loop starves all other coroutines. In Node.js or Python asyncio, this freezes the entire server. Fix: use async-compatible libraries or offload to a thread pool.

GC Pressure

Allocating millions of short-lived objects forces frequent garbage collection pauses. Fix: reuse objects, use object pools, avoid unnecessary boxing, prefer primitives.

Lock Contention

Multiple threads competing for the same lock serialize execution, negating parallelism. Fix: lock-free data structures, finer-grained locks, or share-nothing architecture.

Unbounded Queues

Queues without size limits let producers outpace consumers, eventually exhausting memory. Under load, latency grows unboundedly. Fix: bounded queues with backpressure.

# N+1 query example (Python/SQLAlchemy)

# BAD: N+1 — 1 query for users, then N queries for orders
users = session.query(User).all()
for user in users:
    print(user.orders)  # Each access triggers a lazy-load query

# GOOD: Eager loading — 1 query with JOIN
users = session.query(User).options(joinedload(User.orders)).all()
for user in users:
    print(user.orders)  # Already loaded, no extra query
The Unbounded Queue Trap
An unbounded queue under sustained overload does not shed load — it accumulates it. Memory grows linearly with time. When the process finally OOMs, all queued work is lost. Always set a max size and decide what to do when the queue is full: drop, reject, or block.
06 / Connection Pooling

Amortizing Connection Cost

Establishing a new connection is expensive. A TCP handshake is 1 RTT, TLS adds 1-2 more, and database authentication adds yet another round trip. For a cross-region connection at 80ms RTT, that is 240-400ms before the first byte of useful work.

Cost of a New Connection
TCP Handshake (1 RTT)
TLS Handshake (1-2 RTT)
Auth / Login (1 RTT)
Ready!

A connection pool maintains a set of pre-established connections. When your code needs a connection, it borrows one from the pool and returns it when done. This amortizes the setup cost across thousands of requests.

Database Connection Pools

PoolLanguage/DBKey Feature
HikariCPJava / JDBCFastest JVM pool. Minimal overhead, no-frills design. Default in Spring Boot.
PgBouncerAny / PostgreSQLExternal proxy. Transaction-level pooling lets 1000s of clients share a few PG connections.
pgpool-IIAny / PostgreSQLAlso handles replication, load balancing. Heavier than PgBouncer.

HTTP Connection Pools

HTTP/1.1 keep-alive reuses TCP connections across requests. HTTP/2 multiplexes many requests over a single connection. Most HTTP client libraries manage pools internally, but you must configure max connections per host and idle timeouts.

Pool Sizing

HikariCP's Formula
For database pools, a good starting point: pool_size = (core_count * 2) + disk_spindles. For SSDs, that simplifies to roughly core_count * 2. Counter-intuitively, smaller pools often outperform larger ones because they reduce context switching and lock contention on the database side.
# PgBouncer config example (pgbouncer.ini)
[databases]
mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction          # Return conn to pool after each transaction
default_pool_size = 20           # Connections per user/database pair
max_client_conn = 1000           # Max client connections to PgBouncer
reserve_pool_size = 5            # Extra conns for burst traffic
reserve_pool_timeout = 3         # Seconds before using reserve pool
Connection Pool Exhaustion
If your pool is too small or connections are not returned (leaked), threads block waiting for a free connection. This manifests as sudden latency spikes and timeouts under moderate load. Always set a connection checkout timeout and monitor pool utilization.

Test Yourself

Score: 0 / 10
Question 01
A service has P50 latency of 15ms and P99 latency of 2.5s. The average latency is 120ms. Which metric best represents the experience of the worst-affected users?
P99 captures the tail latency experienced by the worst 1% of users. The average is skewed by these outliers and misrepresents both the typical and worst-case experience. Always look at high percentiles to understand tail behavior.
Question 02
According to Amdahl's Law, if 10% of a workload is serial, what is the maximum possible speedup with infinite processors?
Speedup = 1/s where s is the serial fraction. With s = 0.10, the maximum speedup is 1/0.10 = 10x, regardless of how many processors you add. The serial portion becomes the bottleneck.
Question 03
Using Little's Law: if a web server handles 1000 requests/sec and each request takes an average of 50ms, how many concurrent requests are in the system?
L = λ x W = 1000 req/sec x 0.050 sec = 50 concurrent requests. This tells you that you need at least 50 worker threads (or equivalent concurrency) to handle this load without queuing.
Question 04
What is "coordinated omission" in the context of load testing?
In a closed-loop benchmark, the client waits for each response before sending the next request. When the server is slow, the client sends fewer requests, reducing measured load. This makes the system appear faster than it really is under sustained traffic. Open-loop generators (like wrk2) fix this by sending requests at a fixed rate regardless of response time.
Question 05
What type of load test is best for detecting memory leaks and slow resource exhaustion?
Soak tests run at a sustained load for hours (4-24h). They are specifically designed to catch problems that only manifest over time: memory leaks, connection leaks, log file growth, file descriptor exhaustion, and gradual performance degradation.
Question 06
A page loads a list of 50 blog posts, then fetches each post's author with a separate query. What anti-pattern is this?
This is a classic N+1 query: 1 query to get 50 posts, then 50 individual queries to get each post's author. The fix is to use a JOIN or eager loading to fetch posts and authors in a single query (or at most 2 queries with an IN clause).
Question 07
Why do smaller database connection pools often outperform larger ones?
Each database connection consumes server resources (memory, a backend process/thread). More connections mean more context switching on the DB server and more contention for shared resources like buffer pools and WAL locks. A smaller pool lets each connection get more CPU time and reduces coordination overhead.
Question 08
In a flame graph, what does a wide plateau at the top of the stack indicate?
The top of the stack in a flame graph represents leaf functions — code that is actually "on CPU" rather than waiting for a child function to return. A wide plateau at the top means that function is directly consuming a large share of CPU time and is a prime optimization target.
Question 09
What happens when you make a blocking I/O call inside a Python asyncio event loop?
Asyncio uses a single-threaded event loop. A blocking call (like time.sleep() or a synchronous HTTP request) blocks the entire thread, preventing the event loop from scheduling any other coroutines. All concurrent work stalls until the blocking call returns. Use async libraries or run_in_executor() to offload blocking work.
Question 10
What does the Universal Scalability Law add beyond Amdahl's Law?
Amdahl's Law models contention (serial fraction) but assumes no coordination cost between processors. The USL adds a coherence/crosstalk penalty (β term) that accounts for the cost of keeping caches, state, or data consistent across nodes. When β > 0, adding too many nodes causes throughput to actually decrease — a retrograde region.