[e2e] Why do we need TCP flow control (rwnd)?

Tue Jul 1 11:26:17 PDT 2008

On Jul 1, 2008, at 9:03 AM, David P. Reed wrote:
> From studying actual congestive collapses, one can figure out how to  
> prevent them.

OK, glad to hear it. I apologize for the form of the data I will  
offer; it is of the rawest type. But you may find it illuminating.  
Before I start, please understand that the network I am about to  
discuss had a very serious problem at one time and has fixed it since.  
So while the charts are a great example of a bad issue, this  
discussion should not reflect negatively on the present network in  
question.

The scenario is a network in Africa that connected at the time to the  
great wide world via VSAT. It is a university, and at the time had  
O(20K) students behind a pair of links from two companies, one at 512  
KBPS and one at 1 MBPS. I was there in 2004 and had a file (my annual  
performance review) that I needed to upload. The process took all day  
and even at the end of it failed to complete. I wondered why, whipped  
out that great little bit of Shareware named PingPlotter (which I  
found very useful back when I ran on a Windows system) and took this  
picture:

     ftp://ftpeng.cisco.com/fred/collapse/ams3-dmzbb-gw1.cisco.com.gif

The second column from the left is the loss rate; as you can see, it  
was between 30 and 40%. The little red lines across the bottom are  
indicative of individual losses, and indicate that they were ongoing.

Based on this data, I convinced the school to increase its bandwidth  
by an order of magnitude. It had the same two links, but now at 5 and  
10 MBPS. Six months later I repeated the experiment but from the other  
end, this time not using PingPlotter because I had changed computers  
and PingPlotter doesn't run on my Mac (wah!). The difference between  
the raw file and the "edited" file is 22 data points that were clear  
outliers. In this, I ran simultaneous pings to the system's two  
addresses, and in so doing measured the ping RTT on both the 5 and 10  
MBPS path. You will see very clearly that the 10 MBPS path was not  
overloaded and didn't experience significant queuing delays (although  
the satellite delay is pretty obvious), while the 5 MBPS path was  
heavily loaded throughout the day and many samples were in the 2000 ms  
ballpark.

     ftp://ftpeng.cisco.com/fred/collapse/Makerere-April-4-2005-edited.pdf
     ftp://ftpeng.cisco.com/fred/collapse/Makerere-April-4-2005.pdf

The delay distribution, for all that high delay on the 5 MBPS path, is  
surprisingly similar to what one finds on any other link. Visually, it  
could be confused with a Poisson distribution.

     ftp://ftpeng.cisco.com/fred/collapse/Makerere-April-4-2005-delay-distribution.pdf

Looking at it in log-linear, however, the difference between the two  
links becomes pretty obvious. The overprovisioned link looks pretty  
normal, but the saturated link has a clear bimodal behavior. When it's  
not all that busy, delays are nominal, but it has a high density  
around 2000 ms RTT and a scattering in between. When it is saturated -  
which it is much of the day - TCP is driving to the cliff, and the  
link's timing reflects the fact.

     ftp://ftpeng.cisco.com/fred/collapse/Makerere-April-4-2005-log-linear-delay-distribution.pdf

A sample space of one is an example, not a study - data, not  
information. But I think the example, coupled with our knowledge of  
queuing theory and general experience, supports four comments:

(1) there ain't nothin' quite like having enough bandwidth. If the  
offered load vastly exceeds capacity, nobody gets anything done. This  
is the classic congestive collapse scenario as predicted in rfcs 896  
and 970. A  review of Nagle's game theory discussion in 970 is  
illuminating.

(2) there ain't nothin' quite like having enough bandwidth. In a  
statistical network, if the offered load approximates capacity, delay  
is maximized, and loss (which is the extreme case of delay) erodes the  
network's effectiveness.

(3) TCP's congestion control algorithms seek to maximize throughput,  
but will work with whatever capacity they find available. If a link is  
in a congestive collapse scenario, increasing capacity by an order of  
magnitude results in TCP being released to increase its windows and,  
through the "fast retransmit" heuristic, recover from occasional  
losses in stride. It will do so, and the result will be to use the  
available capacity regardless of what it is.

(4) congestion control algorithms that tune to the cliff obtain no  
better throughput than algorithms that tune to the knee. That is by  
definition: both the knee and the cliff maximize throughput, but the  
cliff also maximizes queue depth at the bottleneck. Hence, algorithms  
that tune to the knee are no worse for the individual end system, but  
better for the network and the aggregate of its users. The difference  
between a link that is overprovisioned and one on which offered load  
approximates capacity is that on one TCP moves data freely while on  
the other TCP works around the fragility in the network to provide  
adequate service in the face of performance issues.