Reference notes.

TCP (Transmission Control Protocol) provides reliable, ordered, error-checked delivery of data between applications. It’s the transport layer protocol behind HTTP, SSH, SMTP, and most internet traffic.

TCP vs UDP

FeatureTCPUDP
ConnectionConnection-oriented (handshake)Connectionless
ReliabilityGuaranteed delivery, retransmissionsBest-effort, no retransmissions
OrderingOrdered deliveryNo ordering guarantee
Flow controlYes (sliding window)No
Congestion controlYes (CUBIC, BBR)No
Header size20-60 bytes8 bytes
Use casesHTTP, SSH, email, file transferDNS, gaming, streaming, VoIP, QUIC

UDP wins when speed matters more than reliability, or when the application handles reliability itself (e.g., QUIC implements its own reliable transport over UDP).

Connection Lifecycle

Three-Way Handshake (Connection Setup)

Client              Server
  |--- SYN ----------->|    1. Client sends SYN (seq=x)
  |<-- SYN-ACK --------|    2. Server responds SYN-ACK (seq=y, ack=x+1)
  |--- ACK ----------->|    3. Client sends ACK (ack=y+1)
  |                     |    Connection established

Both sides agree on initial sequence numbers, preventing old packets from being confused with new connections.

Four-Way Teardown (Connection Close)

Client              Server
  |--- FIN ----------->|    1. Client done sending
  |<-- ACK ------------|    2. Server acknowledges
  |<-- FIN ------------|    3. Server done sending
  |--- ACK ----------->|    4. Client acknowledges
  |                     |    Connection closed

Either side can initiate. The side that sends the first FIN enters TIME_WAIT state (typically 2× MSL = 60 seconds) to handle delayed packets.

TCP Fast Open (TFO)

Eliminates one round trip on repeat connections. The server issues a cookie on the first connection; the client includes it with data in the SYN of subsequent connections, allowing data transfer before the handshake completes. Saves ~100ms on typical connections. Supported in Linux 3.7+.

Flow Control

Sliding Window

TCP uses a receive window to prevent the sender from overwhelming the receiver. The receiver advertises how many bytes it can accept; the sender must not have more than that amount of unacknowledged data in flight.

Window scaling (RFC 7323) — The original 16-bit window field limits the window to 64KB, far too small for modern high-bandwidth links. Window scaling extends this up to ~1GB using a multiplier negotiated during the handshake. Essential for high-bandwidth, high-latency links (e.g., 10Gbps with 50ms RTT needs ~62MB window).

Nagle’s Algorithm

Buffers small writes and sends them together to reduce the number of tiny packets. Waits until either enough data accumulates or the previous packet is acknowledged. Can cause latency issues in interactive applications — disabled with TCP_NODELAY socket option. Interacts poorly with delayed ACKs (receiver waits ~40ms before ACKing), causing the “Nagle-delayed ACK deadlock” (200ms delays). Most modern applications set TCP_NODELAY.

Congestion Control

Congestion control prevents TCP from overwhelming the network (distinct from flow control, which protects the receiver). The sender maintains a congestion window (cwnd) limiting data in flight.

Classic Phases

  1. Slow start — cwnd starts small (typically 10 segments), doubles each RTT. Grows exponentially until loss occurs or ssthresh is reached.
  2. Congestion avoidance — After reaching ssthresh, cwnd grows linearly (1 segment per RTT). Probes for capacity gradually.
  3. Fast retransmit — After 3 duplicate ACKs, retransmit the lost segment immediately without waiting for a timeout.
  4. Fast recovery — After fast retransmit, halve cwnd and continue with congestion avoidance (don’t restart from slow start).

CUBIC

The default congestion control algorithm in Linux, Windows, and macOS. Loss-based — uses packet loss as the signal for congestion.

CUBIC uses a cubic function to grow cwnd, which is:

  • Aggressive far from the last loss point — quickly probes for new capacity
  • Conservative near the last loss point — avoids re-triggering congestion
  • Concave growth approaching the previous maximum, convex growth beyond it

Works well on high-bandwidth, high-latency links. Dominant algorithm on the internet.

BBR (Bottleneck Bandwidth and Round-trip time)

Google’s model-based congestion control. Instead of reacting to loss, BBR builds an explicit model of the network path using:

  • Estimated bottleneck bandwidth — Maximum delivery rate observed
  • Minimum RTT — Baseline round-trip time without queuing

BBR paces packets at the estimated bottleneck rate and limits in-flight data to bandwidth × RTT, targeting minimal queuing delay. Avoids “bufferbloat” — the problem where deep router buffers fill up, adding seconds of latency without dropping packets.

BBRv3 (Linux 6.x, IETF draft as of 2025) improves on earlier versions with a 12% reduction in retransmit rate. Addresses the primary criticism: fairness — BBRv1 was overly aggressive against CUBIC flows. BBRv3 integrates ECN and loss signals alongside its model-based approach. Still an experimental IETF draft; fairness with CUBIC under deep-buffer conditions remains an active research area.

ECN (Explicit Congestion Notification)

Routers mark packets (set CE bit) when queues are filling, instead of dropping them. TCP endpoints can react to congestion before loss occurs. Requires support at both endpoints and intervening routers. Increasingly important for data centre networks and BBR’s evolution.

L4S (Low Latency, Low Loss, Scalable Throughput)

Emerging framework combining ECN with new congestion control algorithms for ultra-low latency. Target use cases: real-time collaboration, cloud gaming, VR/AR.

Reliability Mechanisms

Selective Acknowledgements (SACK)

Without SACK, a single lost packet forces retransmission of everything after it. SACK allows the receiver to report exactly which segments it has received, so the sender only retransmits what’s actually missing. Enabled by default in most modern operating systems.

TCP Keepalive

Detects dead connections by sending periodic probes when no data has been exchanged. Default on Linux: probe after 2 hours, then every 75 seconds, give up after 9 probes. Configurable per-socket. Application-level keepalives (e.g., HTTP/2 PING) are generally preferred.

Retransmission Timeout (RTO)

Calculated from smoothed RTT measurements. If no ACK arrives before the RTO expires, the segment is retransmitted. Uses exponential backoff on repeated timeouts. Modern TCP uses timestamps (RFC 7323) for more accurate RTT measurement.

Common Issues

IssueCauseDiagnosis
High retransmissionsCongestion, packet loss, bad linksss -ti, Wireshark retransmission filter
Small window sizeReceiver or sender limited, window scaling offss -ti (check rcv_space, snd_cwnd)
TIME_WAIT accumulationMany short connectionsss -s, consider connection pooling
Connection resets (RST)Firewall, application crash, port not opentcpdump, check firewall rules
High latencyBufferbloat, Nagle + delayed ACKEnable TCP_NODELAY, check buffer sizes
SYN floodDDoS attackSYN cookies (net.ipv4.tcp_syncookies)

Tuning (Linux)

Key sysctl parameters:

  • net.core.rmem_max / net.core.wmem_max — Maximum socket buffer sizes
  • net.ipv4.tcp_congestion_control — Set to bbr or cubic
  • net.ipv4.tcp_window_scaling — Enable window scaling (default: on)
  • net.ipv4.tcp_fastopen — Enable TCP Fast Open (3 = client + server)
  • net.ipv4.tcp_syncookies — SYN flood protection (default: on)
  • net.ipv4.tcp_tw_reuse — Reuse TIME_WAIT sockets (safe for clients)

See Also

References