Lecture 2
Flow control: prevent overloading the receiver
Congestion control: deal with problems due to congestion in the network
Both are used together to determine the final rate of data to be sent
TCP Flow Control
Regulated with variable sliding window advertised as the receive window
Advertised in packets
# of bytes the receiver is willing to accept
Rate matching between the sender and receiver
Based on the available buffers of the sender and receiver
When the network layer (IP, L3) delivers data faster than the TCP socket buffer can accept?
The segments will be discarded
On the receiver's side
There is limited buffer space, the RcvBuffer
The available space is the Receive window (rwnd)
!400
(i.e. just advertise your free space)
The size of the receiver buffer can be set in socket options
Default of 4096
Many OSes automatically adjust the size
Sliding Window
- Receiver advertises
RcvBuffer - Sender sets sliding window equal to
RcvBuffer
!400 - During transmission, Advertised Window gets updated
- Use the minimum value between the initial sliding window and the advertised window size
!400
Usable Window
!400
The usable window may be 0
This means there is no longer any communication
To fix this, sender keeps sending Zero Window Prober (ZWP), which enables communication, and a response advertising a new (could still be 0) available window size
Silly Window Syndrome
Occurs when small segments are transmitted, resulting in inefficient use of the network pipe
E.g. suppose a sender generates data at only 1-byte at a time
Instead of sending this byte by byte, accumulate it
- TCP sends first piece of data , even if only 1 byte
- For subsequent data, accumulate data, only if there are unacknowledged segments
- The buffered data is sent when:
- Receiver sends ACK for unacknowledged segment
- Enough data has accumulated to fill the Maximum Segment Size (MSS)
Link Capacity
When the buffer is big enough (not the constraint), what defines the speed then?
When the path consists of multiple links, the max rate is defined by the slowest link, the bottleneck
The Bandwidth Delay Product:
A network with a large BDP (
Network with high RTT and high capacity
- If TCP window = BDP, utilisation is 100%
- If TCP window < BDP, link is underutilised
- If TCP window > BDP, link is congested
Refinement through options
Maximum Segment Size (MSS)
Maximum Segment Size: the largest amount of data in bytes that a device can handle in a single unfragmented piece
We don't want fragmentation: more overhead + more chance of packets being lost
Resulting datagram will be
Window Scaling Option
1Gbps link, with 100ms -> BDP = 100 MB
Use windows scaling option, since the max size of receive windows is 68KB
e.g. RWND = 65KB and WSCALE = 4 -> effective window is
Timestamp Option TODO:
SACK Option
Allows selectively acknowledging out-of-order segments
Congestion control
How does TCP limit send rate?
How does TCP sense congestion?
What algorithm should we use to change the sending rate?
!400
Data flow left-to-right
TCP is in principle, a fair mechanism
In the above example, both machines on the left have
Router with Infinite buffer
!400
Switch is modelled as a queue
MM1 queue???
As the input approaches half the bandwidth (half in, half out), the delay will approach infinity
Router with Finite Buffer
There are re-transmissions
- Application layer input = application layer output
- Transport layer input also includes retransmissions
When there is congestion, there will be:
- More work (retransmissions) for given actual data (goodput)
- Unneeded retransmissions (multiple copies)
Congestion: a problem in the network?
Congestion indicates a 'problem' in the network
- Long delays due to queueing
- Lost packets and retransmission due to buffer overflows
- Unneeded retransmissions
Is congestion bad?
Not really, if you know how to manage it
Congestion is unavoidable since we want to maximise network/link usage
End-to-end Congestion Control
No explicit feedback from network
Inferred from end-to-end observed loss or delay
Used by TCP
Network-assisted Congestion Control
Routers provide feedback to end systems
Single bit indicating congestion (Explicit Congestion Notification)
Congestion Control Algorithm
It has 4 components:
- Slow start
- Start with a low
cwnd, - Increase exponentially (2x every RTT)
- On loss, restart from
cwnd= 1 MSS
!400
- Start with a low
- Congestion avoidance
- Once close enough to the bottleneck, use something smoother
sshtreshis thecwndthat determines the shift from slow start to congestion avoidance (initially very highevery time congestion is hit)
!400
- Fast retransmit/recovery
- If sender receives 3 ACKs for the same data
- ssthresh = cwnd/2.
- retransmit the missing segment
- cwnd = ssthresh + 3 MSS (inflate slightly to keep the pipe full)
- Each time another duplicate ACK arrives, set cwnd = cwnd + 1MSS . Then, send a new data segment if allowed by the value of cwnd.
- When the ACK for the retransmission arrives: exit Fast Recovery
- set cwnd = ssthresh (in flight bytes when entering fast recovery)
- move to congestion avoidance (linear increase).
%20Advanced%20Networking/Attachments/Pasted%20image%2020260204121058.png)
Congestion Window
Beside the receive window, the sender also maintains a congestion window (cwnd)
Used to restrict data flow to less than the rwnd when congestion occurs
Works the same as the receive window
Effective window = min(rwnd, cwnd)
Congestion Detection
Two mechanisms to detect congestion:
- Timeouts
- Duplicate ACKs (active indication of lost packet)
The congestion window is not static
Increases or decreases based on the arrival of ACKS
When they arrive steadily, cwnd increases
When they stop or repeat, cwnd decreases
Called self-clocking TCP / bandwidth probing
TCP New Reno
TCP Reno and fast recovery can only deal with a single lost packet
TCP NEW Reno on the other hand, can deal with multiple losses
Stays in fast recovery mode until all data that was outstanding when fast recovery started is ACKed
After that, return to congestion avoidance, with cwnd same as when recovery started
TCP Fairness
When K sessions share the same bottleneck of bandwidth R, each should get an average rate of
- Flow A has full capacity, flow B is starting
- Congestion happens
- Multiplicative decrease causes A to lose more than B
- Thus B has gained on A
- This repeats until equal
Not Always Fair
Reno updates cwnd based on ACK, which depends on RTT
Lower RTT -> faster cwnd increases
Nothing prevents multiple TCP sessions at the same time
UDP is aggressive, just spray and pray, no congestion control
Congestion Control Improvements
AIMD has some problems:
Huge decrease by /2 to cwnd
Slow ramp up
Example
High-speed long-distance network with:
Bandwidth-Delay Product = 125 MB
MSS= 1460 bytes
RTT = 100 ms
To fully utilize link:
cwn = BDP/MSS =125*10^6/1460 = 85,616 MSS
Increase of 1 MSS per RTT
Since we need to reach 85,616 MSS, it will take.85,616 RTTs (CA)
Time required is 85,616×0.1 seconds=8,561 seconds≈2.4 hours
A single packet loss would cause the window to drop by 50%, requiring another ~42,000 RTTs (over an hour) to recover.
Scalable TCP
Instead of reduction by 50%, reduce it by 12.5% on congestion
When ramping up, instead of +1MSS, use +1% cwnd
High-Speed TCP (HSTCP)
For small windows, behaves like Reno
For large windows, it becomes more aggressive
AI: window grows faster
MD: Decrease less aggressively
TCP Cubic
Is there something better than AIMD?
After loss, congestion rate probably hasn't changed much
At the start, ramp up fast to near the max
Then slowly approach the actual max
%20Advanced%20Networking/Attachments/Pasted%20image%2020260210104451.png)
The congestion window is a cubic function of time since the last congestion event

Cubic is more fair since cwnd depends on time, instead of ACKs
ACKs depend on RTT, so Cubic's cwnd is not affected by RTT
Network Assisted Congestion Control
Explicit Congestion Notification (ECN)
Two bits in IP header (ToS field) used to indicate congestion
Sender marks packets as ECN capable in the header (ECN Capable Transport, ECT)
0: No
1: Yes
If a router experiences congestion, it marks the packet instead of dropping it
Receiver then sets ECE (ECN Congestion Experienced) bit on TCP ACK, to notify sender
Sender treats ECE bit as a congestion signal, and reduces cwnd
Sets CWR=1 to indicate
DataCenter TCP (DCTCP)
Congestion Algorithms that were built for long, fat networks do not work very well for data centers
Because:
Buffers are smaller, you end up with dropped packets faster
All switches are under common administrative control, thus adopting ECN is possible
In the wild, you cannot assume everyone supports ECN
Traffic has low latency requires (so we use small buffers, don't want packets sitting around)
Traffic competes with large high-bandwidth flow
E.g. control traffic vs big data transfer
Small traffic has to wait behind the big traffic in the buffer
Extract more detailed feedback from boolean ECN marks
Reduce cwnd based on the fraction of ECN marked packets
Keep queues short by reacting early to congestion, but don't cut back to the point we sacrifice throughput
How It Works
Switch side:
Mark packets when queue length
Sender side:
Maintain running average of fraction of packets marked with ECN
%20Advanced%20Networking/Attachments/Pasted%20image%2020260210110220.png)
L4S: Low Latency, Low Loss, Scalable Throughput
DCTCP is not suitable for the public internet
Aggressive and not fair
Not ideal for long RTT
Best in controlled environment, where we control ECN config
Alternative, TCP Prague
Scalable, ECN-based congestion control
Use fraction of ECN marked packets for decrease and increase of cwnd
RTT independent
When a non-ECN path is detected, fall back to Reno or Cubic
Ensures fairness, making it usable on public internet
Requires Dual Queue Coupled AQM (Active Queue Management)
L4S flows in one queue
TCP flows in another queue
Two queues are coupled, to ensure L4S flows do not starve TCP flows
BBR
All of the aforementioned are based on loss
It is also possible to use delay
Detect congestion before it happens, instead of overshooting (loss)
Loss-Based Congestion Control
Packet loss before congestion, low throughput in uncongested network
Packet loss after congestion, bloats buffers, high delays
BBR Basics
Dynamically estimate max BW and min RTT
If RTT increases -> likely queue buildup
If BW increases -> path can handle more data
Steps:
- Startup phase:
- Sequentially probe max BW and min RTT
- Double
cwndevery RTT
- DRAIN phase:
- Sending rate is reduced to get rid of the queue created during the startup phase
- Steady state:
- Vary pacing rate to keep inflight near
- Vary pacing rate to keep inflight near
Looks like a heartbeat
Peaks are BBR probing for bandwidth