[TriLUG] Tuning WAN links
Aaron S. Joyner
aaron at joyner.ws
Thu Nov 1 03:34:57 EDT 2007
Jeremy Portzer wrote:
> Shawn Hood wrote:
>
>> I've run iperf between two RHEL4 boxes connected to the 3560s. The most
>> throughput I've been able to get is ~45mbit by increasing the buffer sizes
>> in /etc/sysctl and using massive window sizes on iperf. I was hoping you
>> guys could point me in the right direction. I need to do some reading about
>> how to get the most out of this link, and any reference would be greatly
>> appreciated. Will this be a matter of create a Linux router on each end to
>> shape the traffic destined for this link? Is this something better suited
>> for proprietary technology that claims to 'auto-tune' this kinds of links.
>> I'm fairly fluent when it comes to talking about this stuff 'in theory,' but
>> have yet to get any hands on experience.
>>
>>
>
> I don't know too much about iperf*, but my general "scientific method"
> approach troubleshooting makes me wonder if you have a "control" in your
> experiment. What is the maximum throughput you can get between two Red
> Hat boxes with the same type of interface (I assume single GigE card?),
> just directly connected? What about the capabilities of the switches or
> other network equipment between the RH boxes and the routers?
>
> Just wanted to make sure you're pointing your finger at the right culprit.
>
> --Jeremy
>
> *meaning, I've never heard of it before reading your post!
>
To follow onto Jeremy's very good suggestion... Although I think you've
already picked up on this from your explicit mention that it's a high
latency link, that particular fact is a big deal. To really simulate
that problem, it's fun to setup 3 computers, two boxes with a cross-over
cable betwen them. Benchmark their throughput shoveling data around
with your tool of choice (I like netcat and some interface counter
scripts, iperf will probably work fine). Then drop in an OpenBSD box in
the middle of that cross over cable, running pf, and inject some latency
into the link. The pf firewall has particularly good support for doing
this kind of lab testing, although the same can be accomplished with
iptables and tc, with some care. It's been some years since I had the
time to sit down and do this, but it can be really enlightening to see
how the queuing degrades under latency, at different line rates. What
you're likely to notice is that with the default linux tcp settings,
higher throughputs suffer more from higher latencies, than lower
throughputs do. That is to say, your 3Mbit Cable or DSL line is
reasonably okay with a 100ms latency and doesn't suffer much of a rate
loss due to the latency. Even a 100Mbit connection will see dramatic
losses in throughput with the addition of even 5 to 10ms of latency.
Gigabit ethernet doesn't even work at gigabit with NO latency, and the
traditional stock tcp settings. :) Turning up the buffer sizes allows
you to get reasonable throughput, until you inject minor latency, then
it all goes to hell again. Getting reasonable throughput on high
latency gigabit ethernet lines requires *extremely* large tcp buffer
sizes, to allow enough packets to be in flight until ack's for them are
received.
For those interested, a short digression into math. On a 1Gbit link,
you're sending 1,000,000,000 bits every second, or 125,000,000 bytes,
aka 125MBytes. This works out to sending 125KB every 1ms. By default
the linux buffers (tcp_wmem,tcp_rmem) on most older linux systems for
writing tcp data is 13KB, on newer systems like my gutsy gibbon laptop
I'm writing from, it's around 4Mb. You can probably imagine a 13k
buffer is hard for most applications to fill reliably at the required
0.1ms intervals to achieve the required 125Kb every 1ms. Thus, by
increasing this buffer to a larger size, your app can write data in
larger chunks, increasing the likelihood that buffer will have data in
it to keep the flow going. Also, and really more importantly, these
value are used by the kernel to calculate the dynamic window size for
the tcp connection. In short, it roughly corresponds to the amount of
data that's allowed to be in-flight at any given moment, before an ack
is received. This is required, because the TCP stack may be requested
to retransmit any of those packets to the other end that are dropped, so
it has to keep them on hand for that eventuality. So, if we have 5ms of
latency, we need to keep 5*125Kb of buffer space, aka 625Kb. If we have
50ms of latency, that's 6,250Kb, or roughly 6Mb. If things get really
crazy and we have 500Ms of latency (don't laugh, this is a fact of life,
or at least physics, in the satellite world), that's 60Mbytes of buffer
space. It gets worse when you consider that if there is packet loss on
the link. As the likelihood that you have to transmit a packet twice
goes up (ie. the first packet is lost, then the retransmitted packet is
also lost), things get exponentially ugly, so to speak. :)
Anway, it's late, I'm rambling, so it's time for bed. Hopefully the
rambling has been educational for someone.
Aaron S. Joyner
More information about the TriLUG
mailing list