[TriLUG] MSN bot is pounding my website...
Aaron S. Joyner
aaron at joyner.ws
Thu Dec 9 17:10:24 EST 2004
gregbrown at mindspring.com wrote:
>for all hits
>cat access_log| awk '{ print $1 }' | sort | uniq -c | sort -gr | head
>
>
Disclaimer: this adds nothing of value to the actual conversation at hand.
Just a style preference, but I prefer to use cut instead of awk, as it
seems to be just the right tool for that particular job you're doing.
In the example you give, the replacement cut command would be ...
access_log | cut -f 1 -d\ | sort... It also seems at first glance to
me that awk, being a (much more capable, but correspondingly) heavier
tool for the job would probably be slower at the task. I setup a bit of
an artificial test to determine one way or the other which one was more
efficient. I took a mail log with about 9 million lines in it, and
cat'd it through each of the programs, throwing the output to /dev/null,
and repeated the process three times to get a little bit of an average.
awk took about 54 seconds on average, cut took about 43. awk spent
about 25.5 seconds processing in user space for each run, cut spent
about 6.5. The difference of 11 seconds for both the real time, and
user time spent, shows clearly the fact that awk is paying attention to
the entire line when it reads it in, where as cut shortcuts when it has
achieved it's goal of getting to the first space. The rest of the time
is simply how slow the disks are. :) For comparison, it took an
average of 38 seconds to do a "wc -l" of this file.
So in short, even on really large inputs, it's not going to make more
than 10-15 seconds worth of difference. But if you're an efficiency
nut, or dealing with ridiculous data sets, hopefully I added one more
tool to your bag of text-mangling tricks. :)
Aaron S. Joyner
More information about the TriLUG
mailing list