Reference notes.
TCP (Transmission Control Protocol) provides reliable, ordered, error-checked delivery of data between applications. It’s the transport layer protocol behind HTTP, SSH, SMTP, and most internet traffic.
TCP vs UDP
| Feature | TCP | UDP |
|---|---|---|
| Connection | Connection-oriented (handshake) | Connectionless |
| Reliability | Guaranteed delivery, retransmissions | Best-effort, no retransmissions |
| Ordering | Ordered delivery | No ordering guarantee |
| Flow control | Yes (sliding window) | No |
| Congestion control | Yes (CUBIC, BBR) | No |
| Header size | 20-60 bytes | 8 bytes |
| Use cases | HTTP, SSH, email, file transfer | DNS, gaming, streaming, VoIP, QUIC |
UDP wins when speed matters more than reliability, or when the application handles reliability itself (e.g., QUIC implements its own reliable transport over UDP).
Connection Lifecycle
Three-Way Handshake (Connection Setup)
Client Server
|--- SYN ----------->| 1. Client sends SYN (seq=x)
|<-- SYN-ACK --------| 2. Server responds SYN-ACK (seq=y, ack=x+1)
|--- ACK ----------->| 3. Client sends ACK (ack=y+1)
| | Connection established
Both sides agree on initial sequence numbers, preventing old packets from being confused with new connections.
Four-Way Teardown (Connection Close)
Client Server
|--- FIN ----------->| 1. Client done sending
|<-- ACK ------------| 2. Server acknowledges
|<-- FIN ------------| 3. Server done sending
|--- ACK ----------->| 4. Client acknowledges
| | Connection closed
Either side can initiate. The side that sends the first FIN enters TIME_WAIT state (typically 2× MSL = 60 seconds) to handle delayed packets.
TCP Fast Open (TFO)
Eliminates one round trip on repeat connections. The server issues a cookie on the first connection; the client includes it with data in the SYN of subsequent connections, allowing data transfer before the handshake completes. Saves ~100ms on typical connections. Supported in Linux 3.7+.
Flow Control
Sliding Window
TCP uses a receive window to prevent the sender from overwhelming the receiver. The receiver advertises how many bytes it can accept; the sender must not have more than that amount of unacknowledged data in flight.
Window scaling (RFC 7323) — The original 16-bit window field limits the window to 64KB, far too small for modern high-bandwidth links. Window scaling extends this up to ~1GB using a multiplier negotiated during the handshake. Essential for high-bandwidth, high-latency links (e.g., 10Gbps with 50ms RTT needs ~62MB window).
Nagle’s Algorithm
Buffers small writes and sends them together to reduce the number of tiny packets. Waits until either enough data accumulates or the previous packet is acknowledged. Can cause latency issues in interactive applications — disabled with TCP_NODELAY socket option. Interacts poorly with delayed ACKs (receiver waits ~40ms before ACKing), causing the “Nagle-delayed ACK deadlock” (200ms delays). Most modern applications set TCP_NODELAY.
Congestion Control
Congestion control prevents TCP from overwhelming the network (distinct from flow control, which protects the receiver). The sender maintains a congestion window (cwnd) limiting data in flight.
Classic Phases
- Slow start — cwnd starts small (typically 10 segments), doubles each RTT. Grows exponentially until loss occurs or ssthresh is reached.
- Congestion avoidance — After reaching ssthresh, cwnd grows linearly (1 segment per RTT). Probes for capacity gradually.
- Fast retransmit — After 3 duplicate ACKs, retransmit the lost segment immediately without waiting for a timeout.
- Fast recovery — After fast retransmit, halve cwnd and continue with congestion avoidance (don’t restart from slow start).
CUBIC
The default congestion control algorithm in Linux, Windows, and macOS. Loss-based — uses packet loss as the signal for congestion.
CUBIC uses a cubic function to grow cwnd, which is:
- Aggressive far from the last loss point — quickly probes for new capacity
- Conservative near the last loss point — avoids re-triggering congestion
- Concave growth approaching the previous maximum, convex growth beyond it
Works well on high-bandwidth, high-latency links. Dominant algorithm on the internet.
BBR (Bottleneck Bandwidth and Round-trip time)
Google’s model-based congestion control. Instead of reacting to loss, BBR builds an explicit model of the network path using:
- Estimated bottleneck bandwidth — Maximum delivery rate observed
- Minimum RTT — Baseline round-trip time without queuing
BBR paces packets at the estimated bottleneck rate and limits in-flight data to bandwidth × RTT, targeting minimal queuing delay. Avoids “bufferbloat” — the problem where deep router buffers fill up, adding seconds of latency without dropping packets.
BBRv3 (Linux 6.x, IETF draft as of 2025) improves on earlier versions with a 12% reduction in retransmit rate. Addresses the primary criticism: fairness — BBRv1 was overly aggressive against CUBIC flows. BBRv3 integrates ECN and loss signals alongside its model-based approach. Still an experimental IETF draft; fairness with CUBIC under deep-buffer conditions remains an active research area.
ECN (Explicit Congestion Notification)
Routers mark packets (set CE bit) when queues are filling, instead of dropping them. TCP endpoints can react to congestion before loss occurs. Requires support at both endpoints and intervening routers. Increasingly important for data centre networks and BBR’s evolution.
L4S (Low Latency, Low Loss, Scalable Throughput)
Emerging framework combining ECN with new congestion control algorithms for ultra-low latency. Target use cases: real-time collaboration, cloud gaming, VR/AR.
Reliability Mechanisms
Selective Acknowledgements (SACK)
Without SACK, a single lost packet forces retransmission of everything after it. SACK allows the receiver to report exactly which segments it has received, so the sender only retransmits what’s actually missing. Enabled by default in most modern operating systems.
TCP Keepalive
Detects dead connections by sending periodic probes when no data has been exchanged. Default on Linux: probe after 2 hours, then every 75 seconds, give up after 9 probes. Configurable per-socket. Application-level keepalives (e.g., HTTP/2 PING) are generally preferred.
Retransmission Timeout (RTO)
Calculated from smoothed RTT measurements. If no ACK arrives before the RTO expires, the segment is retransmitted. Uses exponential backoff on repeated timeouts. Modern TCP uses timestamps (RFC 7323) for more accurate RTT measurement.
Common Issues
| Issue | Cause | Diagnosis |
|---|---|---|
| High retransmissions | Congestion, packet loss, bad links | ss -ti, Wireshark retransmission filter |
| Small window size | Receiver or sender limited, window scaling off | ss -ti (check rcv_space, snd_cwnd) |
| TIME_WAIT accumulation | Many short connections | ss -s, consider connection pooling |
| Connection resets (RST) | Firewall, application crash, port not open | tcpdump, check firewall rules |
| High latency | Bufferbloat, Nagle + delayed ACK | Enable TCP_NODELAY, check buffer sizes |
| SYN flood | DDoS attack | SYN cookies (net.ipv4.tcp_syncookies) |
Tuning (Linux)
Key sysctl parameters:
net.core.rmem_max/net.core.wmem_max— Maximum socket buffer sizesnet.ipv4.tcp_congestion_control— Set tobbrorcubicnet.ipv4.tcp_window_scaling— Enable window scaling (default: on)net.ipv4.tcp_fastopen— Enable TCP Fast Open (3 = client + server)net.ipv4.tcp_syncookies— SYN flood protection (default: on)net.ipv4.tcp_tw_reuse— Reuse TIME_WAIT sockets (safe for clients)
See Also
- OSI Model — Where TCP fits (Layer 4)
- Network Troubleshooting — Diagnostic tools for TCP issues
References
- RFC 9293 - TCP — The updated TCP specification (2022, replaces RFC 793)
- RFC 5681 - TCP Congestion Control
- BBR Congestion Control (IETF draft)
- Cloudflare - A Brief History of TCP
- Julia Evans - Networking Zines