BBR Congestion-Based Congestion Control

The internet is not moving as well as it should/could
Stems from the design choice to interpret TCP packet loss as "congestion"
True at the time due to technical limitations
As NICs evolved, this equivalence became less true

Even today's best loss-based CC (cubic) is still problematic
When using large buffers, loss-based CC keeps them full, causing buffer bloat
With small buffers, CC misinterprets loss as a signal of congestion -> low throughput
To solve the above, we need something else than loss-based

Congestion and Bottlenecks

Any TCP connection has exactly one slowest link (i.e. bottleneck) in each direction
It determines the max data-delivery rate
Queues form there (all links upstream send faster, so the queues move to here)

From TCP's viewpoint, any path with any amount of links behaves the same as a single link with identical RTT and bottleneck rate
Two physical constraints
RTprop: round-trip propagation time (akin to the length of a pipe, baseline delay)
BtlBw: bottleneck bandwidth (akin to the minimum diameter of a pipe)

Pasted image 20260313082810.png|400
Region 1:
When not enough data to fill the pipe, behaviour is determined by RTprop
Transition at inflight=BtlBw×RTprop, a.k.a. the pipe's Bandwidth-delay product (BDP)
Region 2:
When there is enough data, BtlBw dominates
Pipe is full at this point -> extra inflight creates a queue at the bottleneck
Creates the dependence of RTT on inflight data shown in top center
Region 3:
Packets start getting dropped once the excess exceeds the buffer capacity
Congestion is sustained operation in this region
Congestion Control is a scheme to bound how for to the right a connection will operate (on average)

Loss-based congestion control lives at the edge of the middle (bandwidth-limited) region
Deliver full BtlBw at the cost of high RTT, frequent packet loss
Back when memory was expensive (we've come full circle), buffer sizes were only slightly larger than BDP
Minimises the excess delay of loss-based CC, as it starts dropping packets sooner, not letting RTT grow too much
With increased memory size by orders of magnitude larger than BDP, the RTT increased accordingly as loss-based CC has a larger Region 2 -> even higher RTT before experiencing loss

The boundary between R1 and R2 would be a better operating point
Maximise bandwidth, minimise delay
It was proven this was indeed optimal, and an algorithm was created

Although it is impossible to disambiguate what caused a change to RTT, a connection's behaviour over time tells a clear(er) story
It can be caused by e.g.:
Path-length change
Bottleneck bandwidth decrease
Queueing delay due to other traffic
Combining these measurements with a control loop can create a CC protocol that reacts to actual congestion, not loss or queue delay

Characterising the Bottleneck

A connection runs best when:
Packet arrival rate = BtlBw (full pipe), and
Total data in flight = BDP (=BtlBw×RTprop)

First condition ensures the bottleneck is fully utilised
No guarantee of no queue: 10 initial packets into a pipe with a BDP of 5 means there will be a standing queue of 5 packets at the bottleneck that will not dissipate
Second guarantees there is enough data to not starve the bottleneck, yet not overfill the pipe
No guarantee of no queue: a connection sending a BDP in BDP/2 bursts gets full bottleneck utilization, but with an average queue of BDP/4 (??)
The only way to minimise the queue is to meet both conditions

Both BtlBw and RTprop vary over the the life of a connection, thus they must be estimated
TCP tracks RTT at any time t as: RTTt=RTpropt+ηt
ηt represents the "noise" introduced by queues along the path, ack aggregation, ...
RTprop only changes when the path changes, it is estimated as the minimum RTT within a set time window

To estimate BtlBw, use the max delivery rate in a set time window (typically 6 to 10 x RTT)

However, when one is observed the other cannot: you have to fill the pipe to measure throughput, but that creates queue, obscuring the RTprop

... More details on how BBR is implemented with some pseudo code ...