[TriLUG] Small puzzle: Fix my bad one-liner [Was: TLDP]

Jym Williams Zavada trilugj at jrwz.net
Tue Jan 6 05:45:01 EST 2015


Thank you Cristóbal, I had fun with your little puzzle!  I absolve you of 
your sin, your penance is complete.  Go and sin no more. ;)

Anyways, here's my take on "a more succinct, readable, and/or efficient one 
one-liner that counts distinct IPv4 addresses that have accessed the 
www.tldp.org virtual host on January 3rd":

$ sudo gzip -dc 2015/01/www.tldp.org.vhost[123].access.log.20150103.gz |\
     awk '{addrs[$1]++}END{for(ip in addrs){x++};print x}'

And here's some of my reasoning behind it:

- Annoyingly, on some *nix implementations, zcat pulls the rug out from
   underfoot by expecting a .Z suffix for compressed files.  Experience has
   taught me that if there are multiple and roughly equal ways of setting up
   a command-line, but only one of them is cross-platform compatible, make
   using that one way a habit, and drop the others like you would a hot pan
   from the oven while wearing no mitts!

- Yes, I know the glob syntax for the log files' sequential node numbers
   is better expressed as [1-3] instead of [123], and would in fact save
   typing were there more than three in the sequence.  But being a touch
   typist (and oftimes a lazy one at that), I found it more comfortable and
   quicker to type the numeric sequence entirely left-handedly, as opposed to
   the left-right-left zigzag for a dash between the one and the three.

- Awk is too often overlooked, especially being that Perl and Python are
   frequently considered as the greater "hotness" for sysadmin scripting
   these days.  I know it's an "old school" bias, but to me, awk is like a
   Jaguar, compared to Perl as a Trans Am and Python as a Dodge Charger.
   And what do you know, awk has associative arrays too, imagine that!
   Which is precisely what the doctor ordered for eliminating two extra piped
   processes (namely sort and wc).

Cheers everyone, I hope you all enjoy the New Year!

-Jim Williams Zavada

On Mon, 5 Jan 2015 at 10:43, cristobalpalmer at gmail.com wrote:

> My last post was a bit intemperate and I regret that. As penance, I’m 
> presenting a puzzle.
>
>> On Jan 4, 2015, at 9:00 PM, cristobalpalmer at gmail.com wrote:
>> 
>> $ for i in 1 2 3; do sudo zcat 2015/01/www.tldp.org.vhost$i.access.log.20150103.gz | awk '{print $1}' | sort -u >> /tmp/tldp ; done ; sort -u /tmp/tldp | wc -l
>> 34300
>> 
>> About 34k distinct IPv4 addresses accessed it yesterday. Presumably for 
>> documentation.
>
> My one-liner is pretty bad. It got the job (give a reasonable estimate of 
> distinct clients accessing www.tldp.org for one recent day) done, but I 
> count at least four things wrong with it. Using any tools that are part of 
> the default install of the distro of your choice from the last three 
> years, please construct a more succinct, readable, and/or efficient 
> one-liner that counts distinct IPv4 addresses that have accessed the 
> www.tldp.org virtual host on January 3rd.
>
> Things you should note:
>
>  * There are three different log files; one for each vhost node (i.e.. there are three OS instances running a identical web hosting stacks, and each has a log file that sits in a single shared directory)
>  * A typical line looks like this (mangled for privacy):
>    192.168.1.50 - - [03/Jan/2015:20:55:45 -0500] "GET /LDP/Linux-Filesystem-Hierarchy/html/index.html HTTP/1.1" 200 4860 "http://ubuntuforums.org/showthread.php?t=1637306" "Mozilla/5.0"
>  * Answers that involve real metrics tools and log analysis tools are cool and good, but not in the spirit of this puzzle.[0]
>
> Cheers,
> --
> Cristóbal Palmer
> Technical Director, ibiblio.org
> University of North Carolina at Chapel Hill
> CB #3456, Manning Hall, Chapel Hill, NC 27599-3456
>
> [0] We (ibiblio) got out of the analytics/metrics game for our several hundred vhosts back when google stopped sales of Urchin. We shifted analytics responsibilities to the individual vhosts, and the vast majority went with google analytics. Possibly we’ll revisit this when we get through more of our high-priority infrastructure changes.


More information about the TriLUG mailing list