[TriLUG] Tuning WAN links

Thu Nov 1 03:34:57 EDT 2007

Jeremy Portzer wrote:
> Shawn Hood wrote:
>   
>> I've run iperf between two RHEL4 boxes connected to the 3560s.  The most
>> throughput I've been able to get is ~45mbit by increasing the buffer sizes
>> in /etc/sysctl and using massive window sizes on iperf.  I was hoping you
>> guys could point me in the right direction.  I need to do some reading about
>> how to get the most out of this link, and any reference would be greatly
>> appreciated.  Will this be a matter of create a Linux router on each end to
>> shape the traffic destined for this link?  Is this something better suited
>> for proprietary technology that claims to 'auto-tune' this kinds of links.
>> I'm fairly fluent when it comes to talking about this stuff 'in theory,' but
>> have yet to get any hands on experience.
>>
>>     
>
> I don't know too much about iperf*, but my general "scientific method" 
> approach troubleshooting makes me wonder if you have a "control" in your 
> experiment.  What is the maximum throughput you can get between two Red 
> Hat boxes with the same type of interface (I assume single GigE card?), 
> just directly connected?  What about the capabilities of the switches or 
> other network equipment between the RH boxes and the routers?
>
> Just wanted to make sure you're pointing your finger at the right culprit.
>
> --Jeremy
>
> *meaning, I've never heard of it before reading your post!
>   
To follow onto Jeremy's very good suggestion... Although I think you've 
already picked up on this from your explicit mention that it's a high 
latency link, that particular fact is a big deal.  To really simulate 
that problem, it's fun to setup 3 computers, two boxes with a cross-over 
cable betwen them.  Benchmark their throughput shoveling data around 
with your tool of choice (I like netcat and some interface counter 
scripts, iperf will probably work fine).  Then drop in an OpenBSD box in 
the middle of that cross over cable, running pf, and inject some latency 
into the link.  The pf firewall has particularly good support for doing 
this kind of lab testing, although the same can be accomplished with 
iptables and tc, with some care.  It's been some years since I had the 
time to sit down and do this, but it can be really enlightening to see 
how the queuing degrades under latency, at different line rates.  What 
you're likely to notice is that with the default linux tcp settings, 
higher throughputs suffer more from higher latencies, than lower 
throughputs do.  That is to say, your 3Mbit Cable or DSL line is 
reasonably okay with a 100ms latency and doesn't suffer much of a rate 
loss due to the latency.  Even a 100Mbit connection will see dramatic 
losses in throughput with the addition of even 5 to 10ms of latency.  
Gigabit ethernet doesn't even work at gigabit with NO latency, and the 
traditional stock tcp settings.  :)  Turning up the buffer sizes allows 
you to get reasonable throughput, until you inject minor latency, then 
it all goes to hell again.  Getting reasonable throughput on high 
latency gigabit ethernet lines requires *extremely* large tcp buffer 
sizes, to allow enough packets to be in flight until ack's for them are 
received.

For those interested, a short digression into math.  On a 1Gbit link, 
you're sending 1,000,000,000 bits every second, or 125,000,000 bytes, 
aka 125MBytes.  This works out to sending 125KB every 1ms.  By default 
the linux buffers (tcp_wmem,tcp_rmem) on most older linux systems for 
writing tcp data is 13KB, on newer systems like my gutsy gibbon laptop 
I'm writing from, it's around 4Mb.  You can probably imagine a 13k 
buffer is hard for most applications to fill reliably at the required 
0.1ms intervals to achieve the required 125Kb every 1ms.  Thus, by 
increasing this buffer to a larger size, your app can write data in 
larger chunks, increasing the likelihood that buffer will have data in 
it to keep the flow going.  Also, and really more importantly, these 
value are used by the kernel to calculate the dynamic window size for 
the tcp connection.  In short, it roughly corresponds to the amount of 
data that's allowed to be in-flight at any given moment, before an ack 
is received.  This is required, because the TCP stack may be requested 
to retransmit any of those packets to the other end that are dropped, so 
it has to keep them on hand for that eventuality.  So, if we have 5ms of 
latency, we need to keep 5*125Kb of buffer space, aka 625Kb.  If we have 
50ms of latency, that's 6,250Kb, or roughly 6Mb.  If things get really 
crazy and we have 500Ms of latency (don't laugh, this is a fact of life, 
or at least physics, in the satellite world), that's 60Mbytes of buffer 
space.  It gets worse when you consider that if there is packet loss on 
the link.  As the likelihood that you have to transmit a packet twice 
goes up (ie. the first packet is lost, then the retransmitted packet is 
also lost), things get exponentially ugly, so to speak.  :)

Anway, it's late, I'm rambling, so it's time for bed.  Hopefully the 
rambling has been educational for someone.

Aaron S. Joyner