Networking

TCP Deep Dive

How the Transmission Control Protocol provides reliable, ordered, byte-stream delivery over an unreliable network -- from the 3-way handshake to congestion control algorithms and the reasons HTTP/3 moved to QUIC.

01 / Fundamentals

What TCP Guarantees

TCP (RFC 793, updated by RFC 9293) sits at the transport layer and provides four core guarantees that raw IP does not: connection-oriented communication, reliable delivery, ordered byte arrival, and a byte-stream abstraction that hides packet boundaries from the application.

Connection-Oriented

Both endpoints must establish a connection before any data flows. State is maintained at each end.

Reliable Delivery

Every byte is acknowledged. Lost segments are retransmitted. Checksums detect corruption.

Ordered

Sequence numbers ensure the receiver reassembles data in the exact order it was sent, even if packets arrive out of order.

Byte-Stream

Applications read and write a continuous stream of bytes. TCP decides how to segment it into packets (segments) internally.

TCP vs UDP at a glance
TCP trades latency for reliability. UDP trades reliability for speed. Neither is "better" -- the choice depends on whether your application can tolerate lost or reordered data.
PropertyTCPUDP
ConnectionConnection-oriented (handshake)Connectionless
ReliabilityACKs + retransmissionBest-effort, no ACKs
OrderingGuaranteed via sequence numbersNo ordering guarantee
Overhead20-60 byte header8 byte header
Use casesHTTP, SSH, databasesDNS, video streaming, gaming
02 / Connection Lifecycle

Handshake, Data Transfer, and Teardown

The 3-Way Handshake

Before any data flows, client and server synchronize sequence numbers via three segments: SYN, SYN-ACK, ACK. This prevents stale connections from being accepted and lets both sides agree on initial sequence numbers (ISNs).

TCP 3-Way Handshake
Client Server
CLOSED SYN (seq=x) → LISTEN
SYN_SENT ← SYN-ACK (seq=y, ack=x+1) SYN_RCVD
ESTAB ACK (ack=y+1) → ESTAB

The 4-Way Teardown

Either side can initiate a close. Because TCP is full-duplex, each direction must be shut down independently. The initiator sends FIN, gets ACK, then the other side sends its FIN and receives ACK. This produces the well-known FIN_WAIT_1 → FIN_WAIT_2 → TIME_WAIT progression on the initiating side.

TCP 4-Way Teardown
ESTAB FIN (seq=m) → ESTAB
FIN_WAIT_1 ← ACK (ack=m+1) CLOSE_WAIT
FIN_WAIT_2 ← FIN (seq=n) LAST_ACK
TIME_WAIT ACK (ack=n+1) → CLOSED
TIME_WAIT = 2 x MSL
The initiating side stays in TIME_WAIT for twice the Maximum Segment Lifetime (typically 60s total on Linux). This ensures delayed duplicates from the old connection expire before the same port pair is reused. On busy servers, many sockets in TIME_WAIT can exhaust ephemeral ports -- mitigations include SO_REUSEADDR, tcp_tw_reuse, and reducing MSL.
03 / Flow Control

Sequence Numbers, ACKs, and the Sliding Window

Every byte in a TCP stream has a 32-bit sequence number. The receiver advertises a receive window (rwnd) -- the number of bytes it can buffer. The sender must not have more than rwnd bytes outstanding (sent but not yet acknowledged).

Sliding Window Concept
ACKed bytes
Sent, awaiting ACK
Sendable (within window)
Not yet sendable

Window Scaling (RFC 7323)

The original 16-bit window field caps rwnd at 64 KB -- far too small for high-bandwidth, high-latency links (bandwidth-delay product). The window scale option, negotiated during the handshake, applies a left-shift of up to 14 bits, supporting windows up to ~1 GB.

Cumulative vs Selective ACK

Standard TCP ACKs are cumulative: an ACK number of N means "I have received all bytes up to N-1." This is simple but forces retransmission of everything after a gap. Selective ACK (SACK), defined in RFC 2018, lets the receiver report non-contiguous blocks it has received, so the sender retransmits only the missing segments.

Practical Impact
On a 100 Mbps link with 50 ms RTT, the bandwidth-delay product is 625 KB. Without window scaling, TCP can only fill 64 KB / 625 KB = ~10% of the pipe. Window scaling is essential for modern networks.
04 / Congestion Control

Slow Start, CUBIC, and BBR

Flow control prevents overwhelming the receiver. Congestion control prevents overwhelming the network. The sender maintains a congestion window (cwnd). The effective sending window is min(cwnd, rwnd).

Congestion Window Growth (Classic TCP)
Slow Start
cwnd doubles each RTT
ssthresh reached
Congestion Avoidance
cwnd += 1 MSS per RTT
Loss detected

Phases of Congestion Control

PhaseTriggerBehavior
Slow StartConnection open or timeoutcwnd starts at 1-10 MSS; doubles each RTT (exponential growth)
Congestion Avoidancecwnd ≥ ssthreshcwnd grows by ~1 MSS per RTT (linear / additive increase)
Fast Retransmit3 duplicate ACKsRetransmit missing segment immediately without waiting for timeout
Fast RecoveryAfter fast retransmitssthresh = cwnd/2, cwnd = ssthresh + 3 MSS, then linear growth

CUBIC (Linux default)

CUBIC uses a cubic function of time since the last loss event to set cwnd, making it less aggressive near the previous maximum but more aggressive when far below it. This produces faster recovery on high-BDP links than the classic Reno/NewReno linear increase.

BBR (Bottleneck Bandwidth and RTT)

Developed by Google, BBR does not use packet loss as the primary congestion signal. Instead it estimates the bottleneck bandwidth and the minimum RTT, then paces packets to match. This works much better on links where buffer bloat causes loss-based algorithms to underperform.

CUBIC vs BBR
CUBIC reacts to loss (fills buffers until packets drop). BBR proactively models the link (tries to avoid filling buffers). BBR typically achieves lower latency and higher throughput on lossy or bufferbloated paths, but can be unfair to CUBIC flows sharing the same bottleneck.
05 / TCP State Machine and Tuning Knobs

States, Options, and Practical Tuning

The 11 TCP States

TCP State Machine (all states)
CLOSED
No connection
LISTEN
Server waiting for SYN
SYN_SENT
Client sent SYN
SYN_RCVD
Server got SYN, sent SYN-ACK
ESTABLISHED
Data transfer
FIN_WAIT_1
Sent FIN, awaiting ACK
FIN_WAIT_2
Got ACK for FIN
CLOSE_WAIT
Got FIN, app not closed yet
CLOSING
Both sides sent FIN simultaneously
LAST_ACK
Sent FIN, awaiting final ACK
TIME_WAIT
Wait 2xMSL before reuse

Important Options and Tuning

Nagle's Algorithm

Coalesces small writes into fewer, larger segments. Reduces overhead but adds latency. Disable with TCP_NODELAY for latency-sensitive apps (games, trading).

Delayed ACK

Receiver waits up to ~40 ms hoping to piggyback the ACK on a data response. Interacts badly with Nagle's -- disable one or the other if you see 40 ms stalls.

TCP Keepalive

Sends probes after idle period (default 2 hours on Linux) to detect dead peers. Configure via tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes.

MSS (Max Segment Size)

Announced during handshake. Typically 1460 bytes (1500 MTU minus 40 bytes of IP+TCP headers). Path MTU discovery adjusts this.

TCP Fast Open (TFO)

Allows data in the SYN packet on repeat connections using a cached cookie. Saves one RTT on connection setup. Requires both client and server support.

SO_REUSEADDR / SO_REUSEPORT

REUSEADDR lets you bind to a port in TIME_WAIT. REUSEPORT lets multiple sockets bind to the same port for load balancing across threads.

Nagle + Delayed ACK = Latency Trap
When Nagle is on, the sender waits to coalesce small writes until it gets an ACK. When delayed ACK is on, the receiver waits to piggyback. Both sides end up waiting on each other, introducing up to 40 ms of artificial delay per write. Set TCP_NODELAY on interactive connections.
06 / Head-of-Line Blocking and Socket Programming

TCP's Limits and the Path to QUIC

Head-of-Line Blocking

TCP guarantees ordered delivery within a single byte stream. If segment #5 is lost, segments #6, #7, #8 must wait in the receive buffer even though they arrived fine. For HTTP/2, which multiplexes many logical streams over one TCP connection, a single lost packet blocks all streams. This is TCP-level head-of-line blocking.

Head-of-Line Blocking in HTTP/2 over TCP
Stream A pkt
Stream B pkt LOST
Stream C pkt (blocked)
Stream A pkt (blocked)

HTTP/3 solves this by replacing TCP with QUIC, which runs over UDP and implements its own per-stream reliability. A lost packet on stream B only blocks stream B -- streams A and C continue unaffected.

Socket Programming: Server Lifecycle

Understanding TCP's connection model is essential for server programming. The kernel manages two queues for each listening socket:

Server Socket Lifecycle
socket()
bind()
listen(backlog)
accept()
read/write
close()
QueueContainsControlled By
SYN Queue (incomplete)Connections in SYN_RCVD statetcp_max_syn_backlog
Accept Queue (complete)Connections in ESTABLISHED waiting for accept()listen(backlog) argument
SYN Flood Attacks
An attacker sends many SYN packets with spoofed source IPs. The server allocates state for each in the SYN queue and sends SYN-ACKs that will never be answered. The SYN queue fills up, blocking legitimate connections. Defenses: SYN cookies (encode state in the ISN, avoid allocating queue entries), rate limiting, and firewall rules. Enable via net.ipv4.tcp_syncookies = 1.
// Minimal TCP server in C
int server_fd = socket(AF_INET, SOCK_STREAM, 0);
setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));

struct sockaddr_in addr = {
    .sin_family = AF_INET,
    .sin_port = htons(8080),
    .sin_addr.s_addr = INADDR_ANY
};
bind(server_fd, (struct sockaddr*)&addr, sizeof(addr));
listen(server_fd, 128);  // backlog = 128

while (1) {
    int client_fd = accept(server_fd, NULL, NULL);
    // read/write on client_fd ...
    close(client_fd);
}

Test Yourself

Score: 0 / 10
Question 01
How many packets are exchanged during a normal TCP connection establishment?
TCP uses a 3-way handshake: the client sends SYN, the server responds with SYN-ACK, and the client completes with ACK. This synchronizes sequence numbers on both sides.
Question 02
Why does the TIME_WAIT state last for 2x MSL?
TIME_WAIT lasts 2x MSL so that any delayed segments from the just-closed connection have time to expire in the network. This prevents them from being misinterpreted as belonging to a new connection on the same port pair.
Question 03
What is the maximum receive window size with the window scaling option?
The window field is 16 bits (max 65535). Window scaling applies a left-shift of up to 14 bits, giving a maximum of 65535 × 16384 ≈ 1 GB.
Question 04
During TCP slow start, how does the congestion window (cwnd) grow?
Despite being called "slow" start, cwnd actually doubles each RTT (exponential growth). It's "slow" only compared to sending at full rate immediately. Growth switches to linear (congestion avoidance) once cwnd reaches ssthresh.
Question 05
What triggers TCP fast retransmit?
When the sender receives 3 duplicate ACKs (4 total ACKs for the same sequence number), it infers a packet was lost and retransmits immediately without waiting for the retransmission timeout.
Question 06
What problem does setting TCP_NODELAY solve?
TCP_NODELAY disables Nagle's algorithm, which coalesces small writes into larger segments. This is critical for latency-sensitive applications like real-time games, financial trading, and interactive protocols where every small message should be sent immediately.
Question 07
Why did HTTP/3 switch from TCP to QUIC (over UDP)?
TCP delivers a single ordered byte stream. When HTTP/2 multiplexes many streams over one TCP connection, a single lost packet blocks all streams. QUIC implements per-stream reliability over UDP, so loss on one stream does not block others.
Question 08
What does the listen() backlog parameter control?
The backlog limits the size of the accept queue -- completed connections (ESTABLISHED) that have not yet been picked up by accept(). When this queue is full, new incoming connections may be dropped or receive a RST.
Question 09
How do SYN cookies defend against SYN flood attacks?
SYN cookies encode the essential connection parameters (MSS, timestamp, etc.) into the server's initial sequence number. The server does not allocate any state for the SYN_RCVD entry. When the client's ACK arrives, the server reconstructs the state from the sequence number.
Question 10
How does BBR differ from CUBIC in its approach to congestion control?
CUBIC (and Reno) treat packet loss as the congestion signal -- they fill buffers until packets drop. BBR instead estimates the bottleneck bandwidth (BtlBw) and minimum RTT (RTprop), then paces packets to match, aiming to keep queues empty rather than full.