Lecture 2

Flow control: prevent overloading the receiver
Congestion control: deal with problems due to congestion in the network
Both are used together to determine the final rate of data to be sent

TCP Flow Control

Regulated with variable sliding window advertised as the receive window
Advertised in packets
# of bytes the receiver is willing to accept
Rate matching between the sender and receiver
Based on the available buffers of the sender and receiver

When the network layer (IP, L3) delivers data faster than the TCP socket buffer can accept?
The segments will be discarded

On the receiver's side

There is limited buffer space, the RcvBuffer
The available space is the Receive window (rwnd)
!400
AdvertisedWindow=RcvBuffer(LastByteRecvdLastByteRead)
(i.e. just advertise your free space)
The size of the receiver buffer can be set in socket options
Default of 4096
Many OSes automatically adjust the size

Sliding Window

  1. Receiver advertises RcvBuffer
  2. Sender sets sliding window equal to RcvBuffer
    !400
  3. During transmission, Advertised Window gets updated
  4. Use the minimum value between the initial sliding window and the advertised window size
    !400

Usable Window

Usable=Advertised(LastByteSentLastByteAcked)
!400

The usable window may be 0
This means there is no longer any communication
To fix this, sender keeps sending Zero Window Prober (ZWP), which enables communication, and a response advertising a new (could still be 0) available window size

Silly Window Syndrome

Occurs when small segments are transmitted, resulting in inefficient use of the network pipe

E.g. suppose a sender generates data at only 1-byte at a time
Instead of sending this byte by byte, accumulate it

  1. TCP sends first piece of data , even if only 1 byte
  2. For subsequent data, accumulate data, only if there are unacknowledged segments
  3. The buffered data is sent when:
    1. Receiver sends ACK for unacknowledged segment
    2. Enough data has accumulated to fill the Maximum Segment Size (MSS)

When the buffer is big enough (not the constraint), what defines the speed then?

When the path consists of multiple links, the max rate is defined by the slowest link, the bottleneck

The Bandwidth Delay Product: bandwidth(bits per second)RTT(in seconds)

A network with a large BDP (>105bits), it is called a Long Fat Network (LFN)
Network with high RTT and high capacity

Refinement through options

Maximum Segment Size (MSS)

Maximum Segment Size: the largest amount of data in bytes that a device can handle in a single unfragmented piece
We don't want fragmentation: more overhead + more chance of packets being lost
Resulting datagram will be MSS+40bytes (20b TCP header + 20b IP header)

Window Scaling Option

1Gbps link, with 100ms -> BDP = 100 MB

Use windows scaling option, since the max size of receive windows is 68KB

e.g. RWND = 65KB and WSCALE = 4 -> effective window is 65KB24=1MB

Timestamp Option TODO:

SACK Option

Allows selectively acknowledging out-of-order segments

Congestion control

How does TCP limit send rate?
How does TCP sense congestion?
What algorithm should we use to change the sending rate?

!400
Data flow left-to-right

TCP is in principle, a fair mechanism
In the above example, both machines on the left have ±0.5Gbps

Router with Infinite buffer

!400
Switch is modelled as a queue
MM1 queue???
As the input approaches half the bandwidth (half in, half out), the delay will approach infinity

Router with Finite Buffer

There are re-transmissions

When there is congestion, there will be:

Congestion: a problem in the network?

Congestion indicates a 'problem' in the network

Is congestion bad?

Not really, if you know how to manage it
Congestion is unavoidable since we want to maximise network/link usage

End-to-end Congestion Control

No explicit feedback from network
Inferred from end-to-end observed loss or delay
Used by TCP

Network-assisted Congestion Control

Routers provide feedback to end systems
Single bit indicating congestion (Explicit Congestion Notification)

Congestion Control Algorithm

It has 4 components:

  1. Slow start
    • Start with a low cwnd, min(4MSS,max(2MSS,4380bytes))
    • Increase exponentially (2x every RTT)
    • On loss, restart from cwnd = 1 MSS
      !400
  2. Congestion avoidance
    • Once close enough to the bottleneck, use something smoother
    • sshtresh is the cwnd that determines the shift from slow start to congestion avoidance (initially very high =/2 every time congestion is hit)
      !400
  3. Fast retransmit/recovery
    • If sender receives 3 ACKs for the same data
    1. ssthresh = cwnd/2.
    2. retransmit the missing segment
    3. cwnd = ssthresh + 3 MSS (inflate slightly to keep the pipe full)
    4. Each time another duplicate ACK arrives, set cwnd = cwnd + 1MSS . Then, send a new data segment if allowed by the value of cwnd.
    5. When the ACK for the retransmission arrives: exit Fast Recovery
      • set cwnd = ssthresh (in flight bytes when entering fast recovery)
      • move to congestion avoidance (linear increase).

Pasted image 20260204121058.png

Congestion Window

Beside the receive window, the sender also maintains a congestion window (cwnd)
Used to restrict data flow to less than the rwnd when congestion occurs
Works the same as the receive window

Effective window = min(rwnd, cwnd)

Congestion Detection

Two mechanisms to detect congestion:

The congestion window is not static
Increases or decreases based on the arrival of ACKS
When they arrive steadily, cwnd increases
When they stop or repeat, cwnd decreases
Called self-clocking TCP / bandwidth probing

TCP New Reno

TCP Reno and fast recovery can only deal with a single lost packet
TCP NEW Reno on the other hand, can deal with multiple losses
Stays in fast recovery mode until all data that was outstanding when fast recovery started is ACKed
After that, return to congestion avoidance, with cwnd same as when recovery started

TCP Fairness

When K sessions share the same bottleneck of bandwidth R, each should get an average rate of R/K

  1. Flow A has full capacity, flow B is starting
  2. Congestion happens
  3. Multiplicative decrease causes A to lose more than B
  4. Thus B has gained on A
  5. This repeats until equal

Not Always Fair

Reno updates cwnd based on ACK, which depends on RTT
Lower RTT -> faster cwnd increases
Nothing prevents multiple TCP sessions at the same time
UDP is aggressive, just spray and pray, no congestion control

Congestion Control Improvements

AIMD has some problems:
Huge decrease by /2 to cwnd
Slow ramp up

Example

High-speed long-distance network with:
Bandwidth-Delay Product = 125 MB
MSS= 1460 bytes
RTT = 100 ms

To fully utilize link:
cwn = BDP/MSS =125*10^6/1460 = 85,616 MSS
Increase of 1 MSS per RTT
Since we need to reach 85,616 MSS, it will take.85,616 RTTs (CA)
Time required is 85,616×0.1 seconds=8,561 seconds≈2.4 hours
A single packet loss would cause the window to drop by 50%, requiring another ~42,000 RTTs (over an hour) to recover.

Scalable TCP

Instead of reduction by 50%, reduce it by 12.5% on congestion
When ramping up, instead of +1MSS, use +1% cwnd

High-Speed TCP (HSTCP)

For small windows, behaves like Reno
For large windows, it becomes more aggressive
AI: window grows faster
MD: Decrease less aggressively

TCP Cubic

Is there something better than AIMD?
After loss, congestion rate probably hasn't changed much
At the start, ramp up fast to near the max
Then slowly approach the actual max
Pasted image 20260210104451.png|400
The congestion window is a cubic function of time since the last congestion event
Pasted image 20260210104633.png|400

Cubic is more fair since cwnd depends on time, instead of ACKs
ACKs depend on RTT, so Cubic's cwnd is not affected by RTT

Network Assisted Congestion Control

Explicit Congestion Notification (ECN)

Two bits in IP header (ToS field) used to indicate congestion
Sender marks packets as ECN capable in the header (ECN Capable Transport, ECT)
0: No
1: Yes
If a router experiences congestion, it marks the packet instead of dropping it
Receiver then sets ECE (ECN Congestion Experienced) bit on TCP ACK, to notify sender
Sender treats ECE bit as a congestion signal, and reduces cwnd
Sets CWR=1 to indicate

DataCenter TCP (DCTCP)

Congestion Algorithms that were built for long, fat networks do not work very well for data centers

Because:
Buffers are smaller, you end up with dropped packets faster
All switches are under common administrative control, thus adopting ECN is possible
In the wild, you cannot assume everyone supports ECN
Traffic has low latency requires (so we use small buffers, don't want packets sitting around)
Traffic competes with large high-bandwidth flow
E.g. control traffic vs big data transfer
Small traffic has to wait behind the big traffic in the buffer

Extract more detailed feedback from boolean ECN marks
Reduce cwnd based on the fraction of ECN marked packets
Keep queues short by reacting early to congestion, but don't cut back to the point we sacrifice throughput

How It Works

Switch side:
Mark packets when queue length >K
Sender side:
Maintain running average of fraction of packets marked with ECN
Pasted image 20260210110220.png|400

L4S: Low Latency, Low Loss, Scalable Throughput

DCTCP is not suitable for the public internet
Aggressive and not fair
Not ideal for long RTT
Best in controlled environment, where we control ECN config

Alternative, TCP Prague
Scalable, ECN-based congestion control
Use fraction of ECN marked packets for decrease and increase of cwnd
RTT independent
When a non-ECN path is detected, fall back to Reno or Cubic
Ensures fairness, making it usable on public internet
Requires Dual Queue Coupled AQM (Active Queue Management)
L4S flows in one queue
TCP flows in another queue
Two queues are coupled, to ensure L4S flows do not starve TCP flows

BBR

All of the aforementioned are based on loss
It is also possible to use delay
Detect congestion before it happens, instead of overshooting (loss)

Loss-Based Congestion Control

Packet loss before congestion, low throughput in uncongested network
Packet loss after congestion, bloats buffers, high delays

BBR Basics

Dynamically estimate max BW and min RTT
If RTT increases -> likely queue buildup
If BW increases -> path can handle more data

Steps:

  1. Startup phase:
    1. Sequentially probe max BW and min RTT
    2. Double cwnd every RTT
  2. DRAIN phase:
    1. Sending rate is reduced to get rid of the queue created during the startup phase
  3. Steady state:
    1. Vary pacing rate to keep inflight near BDP=maxBWminRTT

Looks like a heartbeat
Peaks are BBR probing for bandwidth

QUIC

https://dl.acm.org/doi/pdf/10.1145/3098822.3098842