TCP

Reference notes.

TCP (Transmission Control Protocol) provides reliable, ordered, error-checked delivery of data between applications. It’s the transport layer protocol behind HTTP, SSH, SMTP, and most internet traffic.

TCP vs UDP

Feature	TCP	UDP
Connection	Connection-oriented (handshake)	Connectionless
Reliability	Guaranteed delivery, retransmissions	Best-effort, no retransmissions
Ordering	Ordered delivery	No ordering guarantee
Flow control	Yes (sliding window)	No
Congestion control	Yes (CUBIC, BBR)	No
Header size	20-60 bytes	8 bytes
Use cases	HTTP, SSH, email, file transfer	DNS, gaming, streaming, VoIP, QUIC

UDP wins when speed matters more than reliability, or when the application handles reliability itself (e.g., QUIC implements its own reliable transport over UDP).

Connection Lifecycle

Three-Way Handshake (Connection Setup)

Client              Server
  |--- SYN ----------->|    1. Client sends SYN (seq=x)
  |<-- SYN-ACK --------|    2. Server responds SYN-ACK (seq=y, ack=x+1)
  |--- ACK ----------->|    3. Client sends ACK (ack=y+1)
  |                     |    Connection established

Both sides agree on initial sequence numbers, preventing old packets from being confused with new connections.

Four-Way Teardown (Connection Close)

Client              Server
  |--- FIN ----------->|    1. Client done sending
  |<-- ACK ------------|    2. Server acknowledges
  |<-- FIN ------------|    3. Server done sending
  |--- ACK ----------->|    4. Client acknowledges
  |                     |    Connection closed

Either side can initiate. The side that sends the first FIN enters TIME_WAIT state (typically 2× MSL = 60 seconds) to handle delayed packets.

TCP Fast Open (TFO)

Eliminates one round trip on repeat connections. The server issues a cookie on the first connection; the client includes it with data in the SYN of subsequent connections, allowing data transfer before the handshake completes. Saves ~100ms on typical connections. Supported in Linux 3.7+.

Flow Control

Sliding Window

TCP uses a receive window to prevent the sender from overwhelming the receiver. The receiver advertises how many bytes it can accept; the sender must not have more than that amount of unacknowledged data in flight.

Window scaling (RFC 7323) — The original 16-bit window field limits the window to 64KB, far too small for modern high-bandwidth links. Window scaling extends this up to ~1GB using a multiplier negotiated during the handshake. Essential for high-bandwidth, high-latency links (e.g., 10Gbps with 50ms RTT needs ~62MB window).

Nagle’s Algorithm

Buffers small writes and sends them together to reduce the number of tiny packets. Waits until either enough data accumulates or the previous packet is acknowledged. Can cause latency issues in interactive applications — disabled with TCP_NODELAY socket option. Interacts poorly with delayed ACKs (receiver waits ~40ms before ACKing), causing the “Nagle-delayed ACK deadlock” (200ms delays). Most modern applications set TCP_NODELAY.

Congestion Control

Congestion control prevents TCP from overwhelming the network (distinct from flow control, which protects the receiver). The sender maintains a congestion window (cwnd) limiting data in flight.

Classic Phases

Slow start — cwnd starts small (typically 10 segments), doubles each RTT. Grows exponentially until loss occurs or ssthresh is reached.
Congestion avoidance — After reaching ssthresh, cwnd grows linearly (1 segment per RTT). Probes for capacity gradually.
Fast retransmit — After 3 duplicate ACKs, retransmit the lost segment immediately without waiting for a timeout.
Fast recovery — After fast retransmit, halve cwnd and continue with congestion avoidance (don’t restart from slow start).

CUBIC

The default congestion control algorithm in Linux, Windows, and macOS. Loss-based — uses packet loss as the signal for congestion.

CUBIC uses a cubic function to grow cwnd, which is:

Aggressive far from the last loss point — quickly probes for new capacity
Conservative near the last loss point — avoids re-triggering congestion
Concave growth approaching the previous maximum, convex growth beyond it

Works well on high-bandwidth, high-latency links. Dominant algorithm on the internet.

BBR (Bottleneck Bandwidth and Round-trip time)

Google’s model-based congestion control. Instead of reacting to loss, BBR builds an explicit model of the network path using:

Estimated bottleneck bandwidth — Maximum delivery rate observed
Minimum RTT — Baseline round-trip time without queuing

BBR paces packets at the estimated bottleneck rate and limits in-flight data to bandwidth × RTT, targeting minimal queuing delay. Avoids “bufferbloat” — the problem where deep router buffers fill up, adding seconds of latency without dropping packets.

BBRv3 (Linux 6.x, IETF draft as of 2025) improves on earlier versions with a 12% reduction in retransmit rate. Addresses the primary criticism: fairness — BBRv1 was overly aggressive against CUBIC flows. BBRv3 integrates ECN and loss signals alongside its model-based approach. Still an experimental IETF draft; fairness with CUBIC under deep-buffer conditions remains an active research area.

ECN (Explicit Congestion Notification)

Routers mark packets (set CE bit) when queues are filling, instead of dropping them. TCP endpoints can react to congestion before loss occurs. Requires support at both endpoints and intervening routers. Increasingly important for data centre networks and BBR’s evolution.

L4S (Low Latency, Low Loss, Scalable Throughput)

Emerging framework combining ECN with new congestion control algorithms for ultra-low latency. Target use cases: real-time collaboration, cloud gaming, VR/AR.

Reliability Mechanisms

Selective Acknowledgements (SACK)

Without SACK, a single lost packet forces retransmission of everything after it. SACK allows the receiver to report exactly which segments it has received, so the sender only retransmits what’s actually missing. Enabled by default in most modern operating systems.

TCP Keepalive

Detects dead connections by sending periodic probes when no data has been exchanged. Default on Linux: probe after 2 hours, then every 75 seconds, give up after 9 probes. Configurable per-socket. Application-level keepalives (e.g., HTTP/2 PING) are generally preferred.

Retransmission Timeout (RTO)

Calculated from smoothed RTT measurements. If no ACK arrives before the RTO expires, the segment is retransmitted. Uses exponential backoff on repeated timeouts. Modern TCP uses timestamps (RFC 7323) for more accurate RTT measurement.

Common Issues

Issue	Cause	Diagnosis
High retransmissions	Congestion, packet loss, bad links	`ss -ti`, Wireshark retransmission filter
Small window size	Receiver or sender limited, window scaling off	`ss -ti` (check rcv_space, snd_cwnd)
TIME_WAIT accumulation	Many short connections	`ss -s`, consider connection pooling
Connection resets (RST)	Firewall, application crash, port not open	tcpdump, check firewall rules
High latency	Bufferbloat, Nagle + delayed ACK	Enable `TCP_NODELAY`, check buffer sizes
SYN flood	DDoS attack	SYN cookies (`net.ipv4.tcp_syncookies`)

Tuning (Linux)

Key sysctl parameters:

net.core.rmem_max / net.core.wmem_max — Maximum socket buffer sizes
net.ipv4.tcp_congestion_control — Set to bbr or cubic
net.ipv4.tcp_window_scaling — Enable window scaling (default: on)
net.ipv4.tcp_fastopen — Enable TCP Fast Open (3 = client + server)
net.ipv4.tcp_syncookies — SYN flood protection (default: on)
net.ipv4.tcp_tw_reuse — Reuse TIME_WAIT sockets (safe for clients)

References

RFC 9293 - TCP — The updated TCP specification (2022, replaces RFC 793)
RFC 5681 - TCP Congestion Control
BBR Congestion Control (IETF draft)
Cloudflare - A Brief History of TCP
Julia Evans - Networking Zines

Rai Notes

Explorer

TCP