How the Transmission Control Protocol provides reliable, ordered, byte-stream delivery over an unreliable network -- from the 3-way handshake to congestion control algorithms and the reasons HTTP/3 moved to QUIC.
01 / Fundamentals
What TCP Guarantees
TCP (RFC 793, updated by RFC 9293) sits at the transport layer and provides four core guarantees that raw IP does not: connection-oriented communication, reliable delivery, ordered byte arrival, and a byte-stream abstraction that hides packet boundaries from the application.
Connection-Oriented
Both endpoints must establish a connection before any data flows. State is maintained at each end.
Reliable Delivery
Every byte is acknowledged. Lost segments are retransmitted. Checksums detect corruption.
Ordered
Sequence numbers ensure the receiver reassembles data in the exact order it was sent, even if packets arrive out of order.
Byte-Stream
Applications read and write a continuous stream of bytes. TCP decides how to segment it into packets (segments) internally.
TCP vs UDP at a glance
TCP trades latency for reliability. UDP trades reliability for speed. Neither is "better" -- the choice depends on whether your application can tolerate lost or reordered data.
Property
TCP
UDP
Connection
Connection-oriented (handshake)
Connectionless
Reliability
ACKs + retransmission
Best-effort, no ACKs
Ordering
Guaranteed via sequence numbers
No ordering guarantee
Overhead
20-60 byte header
8 byte header
Use cases
HTTP, SSH, databases
DNS, video streaming, gaming
02 / Connection Lifecycle
Handshake, Data Transfer, and Teardown
The 3-Way Handshake
Before any data flows, client and server synchronize sequence numbers via three segments: SYN, SYN-ACK, ACK. This prevents stale connections from being accepted and lets both sides agree on initial sequence numbers (ISNs).
TCP 3-Way Handshake
ClientServer
CLOSEDSYN (seq=x) →LISTEN
SYN_SENT← SYN-ACK (seq=y, ack=x+1)SYN_RCVD
ESTABACK (ack=y+1) →ESTAB
The 4-Way Teardown
Either side can initiate a close. Because TCP is full-duplex, each direction must be shut down independently. The initiator sends FIN, gets ACK, then the other side sends its FIN and receives ACK. This produces the well-known FIN_WAIT_1 → FIN_WAIT_2 → TIME_WAIT progression on the initiating side.
TCP 4-Way Teardown
ESTABFIN (seq=m) →ESTAB
FIN_WAIT_1← ACK (ack=m+1)CLOSE_WAIT
FIN_WAIT_2← FIN (seq=n)LAST_ACK
TIME_WAITACK (ack=n+1) →CLOSED
TIME_WAIT = 2 x MSL
The initiating side stays in TIME_WAIT for twice the Maximum Segment Lifetime (typically 60s total on Linux). This ensures delayed duplicates from the old connection expire before the same port pair is reused. On busy servers, many sockets in TIME_WAIT can exhaust ephemeral ports -- mitigations include SO_REUSEADDR, tcp_tw_reuse, and reducing MSL.
03 / Flow Control
Sequence Numbers, ACKs, and the Sliding Window
Every byte in a TCP stream has a 32-bit sequence number. The receiver advertises a receive window (rwnd) -- the number of bytes it can buffer. The sender must not have more than rwnd bytes outstanding (sent but not yet acknowledged).
Sliding Window Concept
ACKed bytes
→
Sent, awaiting ACK
→
Sendable (within window)
→
Not yet sendable
Window Scaling (RFC 7323)
The original 16-bit window field caps rwnd at 64 KB -- far too small for high-bandwidth, high-latency links (bandwidth-delay product). The window scale option, negotiated during the handshake, applies a left-shift of up to 14 bits, supporting windows up to ~1 GB.
Cumulative vs Selective ACK
Standard TCP ACKs are cumulative: an ACK number of N means "I have received all bytes up to N-1." This is simple but forces retransmission of everything after a gap. Selective ACK (SACK), defined in RFC 2018, lets the receiver report non-contiguous blocks it has received, so the sender retransmits only the missing segments.
Practical Impact
On a 100 Mbps link with 50 ms RTT, the bandwidth-delay product is 625 KB. Without window scaling, TCP can only fill 64 KB / 625 KB = ~10% of the pipe. Window scaling is essential for modern networks.
04 / Congestion Control
Slow Start, CUBIC, and BBR
Flow control prevents overwhelming the receiver. Congestion control prevents overwhelming the network. The sender maintains a congestion window (cwnd). The effective sending window is min(cwnd, rwnd).
Congestion Window Growth (Classic TCP)
Slow Start cwnd doubles each RTT
→
ssthresh reached
→
Congestion Avoidance cwnd += 1 MSS per RTT
→
Loss detected
Phases of Congestion Control
Phase
Trigger
Behavior
Slow Start
Connection open or timeout
cwnd starts at 1-10 MSS; doubles each RTT (exponential growth)
Congestion Avoidance
cwnd ≥ ssthresh
cwnd grows by ~1 MSS per RTT (linear / additive increase)
Fast Retransmit
3 duplicate ACKs
Retransmit missing segment immediately without waiting for timeout
Fast Recovery
After fast retransmit
ssthresh = cwnd/2, cwnd = ssthresh + 3 MSS, then linear growth
CUBIC (Linux default)
CUBIC uses a cubic function of time since the last loss event to set cwnd, making it less aggressive near the previous maximum but more aggressive when far below it. This produces faster recovery on high-BDP links than the classic Reno/NewReno linear increase.
BBR (Bottleneck Bandwidth and RTT)
Developed by Google, BBR does not use packet loss as the primary congestion signal. Instead it estimates the bottleneck bandwidth and the minimum RTT, then paces packets to match. This works much better on links where buffer bloat causes loss-based algorithms to underperform.
CUBIC vs BBR
CUBIC reacts to loss (fills buffers until packets drop). BBR proactively models the link (tries to avoid filling buffers). BBR typically achieves lower latency and higher throughput on lossy or bufferbloated paths, but can be unfair to CUBIC flows sharing the same bottleneck.
05 / TCP State Machine and Tuning Knobs
States, Options, and Practical Tuning
The 11 TCP States
TCP State Machine (all states)
CLOSED No connection
LISTEN Server waiting for SYN
SYN_SENT Client sent SYN
SYN_RCVD Server got SYN, sent SYN-ACK
ESTABLISHED Data transfer
FIN_WAIT_1 Sent FIN, awaiting ACK
FIN_WAIT_2 Got ACK for FIN
CLOSE_WAIT Got FIN, app not closed yet
CLOSING Both sides sent FIN simultaneously
LAST_ACK Sent FIN, awaiting final ACK
TIME_WAIT Wait 2xMSL before reuse
Important Options and Tuning
Nagle's Algorithm
Coalesces small writes into fewer, larger segments. Reduces overhead but adds latency. Disable with TCP_NODELAY for latency-sensitive apps (games, trading).
Delayed ACK
Receiver waits up to ~40 ms hoping to piggyback the ACK on a data response. Interacts badly with Nagle's -- disable one or the other if you see 40 ms stalls.
TCP Keepalive
Sends probes after idle period (default 2 hours on Linux) to detect dead peers. Configure via tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes.
MSS (Max Segment Size)
Announced during handshake. Typically 1460 bytes (1500 MTU minus 40 bytes of IP+TCP headers). Path MTU discovery adjusts this.
TCP Fast Open (TFO)
Allows data in the SYN packet on repeat connections using a cached cookie. Saves one RTT on connection setup. Requires both client and server support.
SO_REUSEADDR / SO_REUSEPORT
REUSEADDR lets you bind to a port in TIME_WAIT. REUSEPORT lets multiple sockets bind to the same port for load balancing across threads.
Nagle + Delayed ACK = Latency Trap
When Nagle is on, the sender waits to coalesce small writes until it gets an ACK. When delayed ACK is on, the receiver waits to piggyback. Both sides end up waiting on each other, introducing up to 40 ms of artificial delay per write. Set TCP_NODELAY on interactive connections.
06 / Head-of-Line Blocking and Socket Programming
TCP's Limits and the Path to QUIC
Head-of-Line Blocking
TCP guarantees ordered delivery within a single byte stream. If segment #5 is lost, segments #6, #7, #8 must wait in the receive buffer even though they arrived fine. For HTTP/2, which multiplexes many logical streams over one TCP connection, a single lost packet blocks all streams. This is TCP-level head-of-line blocking.
Head-of-Line Blocking in HTTP/2 over TCP
Stream A pkt
Stream B pkt LOST
Stream C pkt (blocked)
Stream A pkt (blocked)
HTTP/3 solves this by replacing TCP with QUIC, which runs over UDP and implements its own per-stream reliability. A lost packet on stream B only blocks stream B -- streams A and C continue unaffected.
Socket Programming: Server Lifecycle
Understanding TCP's connection model is essential for server programming. The kernel manages two queues for each listening socket:
Server Socket Lifecycle
socket()
→
bind()
→
listen(backlog)
→
accept()
→
read/write
→
close()
Queue
Contains
Controlled By
SYN Queue (incomplete)
Connections in SYN_RCVD state
tcp_max_syn_backlog
Accept Queue (complete)
Connections in ESTABLISHED waiting for accept()
listen(backlog) argument
SYN Flood Attacks
An attacker sends many SYN packets with spoofed source IPs. The server allocates state for each in the SYN queue and sends SYN-ACKs that will never be answered. The SYN queue fills up, blocking legitimate connections. Defenses: SYN cookies (encode state in the ISN, avoid allocating queue entries), rate limiting, and firewall rules. Enable via net.ipv4.tcp_syncookies = 1.
// Minimal TCP server in C
int server_fd = socket(AF_INET, SOCK_STREAM, 0);
setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(8080),
.sin_addr.s_addr = INADDR_ANY
};
bind(server_fd, (struct sockaddr*)&addr, sizeof(addr));
listen(server_fd, 128); // backlog = 128
while (1) {
int client_fd = accept(server_fd, NULL, NULL);
// read/write on client_fd ...
close(client_fd);
}
Test Yourself
Score: 0 / 10
Question 01
How many packets are exchanged during a normal TCP connection establishment?
TCP uses a 3-way handshake: the client sends SYN, the server responds with SYN-ACK, and the client completes with ACK. This synchronizes sequence numbers on both sides.
Question 02
Why does the TIME_WAIT state last for 2x MSL?
TIME_WAIT lasts 2x MSL so that any delayed segments from the just-closed connection have time to expire in the network. This prevents them from being misinterpreted as belonging to a new connection on the same port pair.
Question 03
What is the maximum receive window size with the window scaling option?
The window field is 16 bits (max 65535). Window scaling applies a left-shift of up to 14 bits, giving a maximum of 65535 × 16384 ≈ 1 GB.
Question 04
During TCP slow start, how does the congestion window (cwnd) grow?
Despite being called "slow" start, cwnd actually doubles each RTT (exponential growth). It's "slow" only compared to sending at full rate immediately. Growth switches to linear (congestion avoidance) once cwnd reaches ssthresh.
Question 05
What triggers TCP fast retransmit?
When the sender receives 3 duplicate ACKs (4 total ACKs for the same sequence number), it infers a packet was lost and retransmits immediately without waiting for the retransmission timeout.
Question 06
What problem does setting TCP_NODELAY solve?
TCP_NODELAY disables Nagle's algorithm, which coalesces small writes into larger segments. This is critical for latency-sensitive applications like real-time games, financial trading, and interactive protocols where every small message should be sent immediately.
Question 07
Why did HTTP/3 switch from TCP to QUIC (over UDP)?
TCP delivers a single ordered byte stream. When HTTP/2 multiplexes many streams over one TCP connection, a single lost packet blocks all streams. QUIC implements per-stream reliability over UDP, so loss on one stream does not block others.
Question 08
What does the listen() backlog parameter control?
The backlog limits the size of the accept queue -- completed connections (ESTABLISHED) that have not yet been picked up by accept(). When this queue is full, new incoming connections may be dropped or receive a RST.
Question 09
How do SYN cookies defend against SYN flood attacks?
SYN cookies encode the essential connection parameters (MSS, timestamp, etc.) into the server's initial sequence number. The server does not allocate any state for the SYN_RCVD entry. When the client's ACK arrives, the server reconstructs the state from the sequence number.
Question 10
How does BBR differ from CUBIC in its approach to congestion control?
CUBIC (and Reno) treat packet loss as the congestion signal -- they fill buffers until packets drop. BBR instead estimates the bottleneck bandwidth (BtlBw) and minimum RTT (RTprop), then paces packets to match, aiming to keep queues empty rather than full.