[TriLUG] Need some help parsing a file

William Sutton william at trilug.org
Tue Dec 31 20:37:00 EST 2013


At that point, you're back to my cat |perl -pe solution, which was a lot 
cleaner than the sed nastiness below.

William Sutton

On Tue, 31 Dec 2013, Steve Litt wrote:

> On Tue, 31 Dec 2013 14:33:57 -0500
> Tom Barron <tpb at dyncloud.net> wrote:
>
>> On Tue, Dec 31, 2013 at 11:46:45AM -0500, R Radford wrote:
>>> The tr solution would also reduce repeated spaces in the filename,
>>> so would not work in that (hopefully extreme, but legal) case.
>>>
>>
>> Yeah, good catch.
>>
>> So let's modify the third line of the test file to have two spaces in
>> the third line's filename, in a spot where we can keep track of them,
>> between 'file' and '.txt':
>>
>> tbarron at home:~$ cat input.txt
>> 11/09/2013 11:49 AM    7,887,098 this is filename 1.txt
>> 11/10/2013  12:50 PM          886,666 this be 2.txt
>> 11/11/2013  04:23 AM		666 tab me file  .txt
>>
>> tbarron at home:~$ od -cb input.txt
>> 0000000   1   1   /   0   9   /   2   0   1   3       1   1   :   4
>> 9 061 061 057 060 071 057 062 060 061 063 040 061 061 072 064 071
>> 0000020       A   M                   7   ,   8   8   7   ,   0   9
>> 8 040 101 115 040 040 040 040 067 054 070 070 067 054 060 071 070
>> 0000040       t   h   i   s       i   s       f   i   l   e   n   a
>> m 040 164 150 151 163 040 151 163 040 146 151 154 145 156 141 155
>> 0000060   e       1   .   t   x   t  \n   1   1   /   1   0   /   2
>> 0 145 040 061 056 164 170 164 012 061 061 057 061 060 057 062 060
>> 0000100   1   3           1   2   :   5   0       P
>> M 061 063 040 040 061 062 072 065 060 040 120 115 040 040 040 040
>> 0000120                           8   8   6   ,   6   6   6       t
>> h 040 040 040 040 040 040 070 070 066 054 066 066 066 040 164 150
>> 0000140   i   s       b   e       2   .   t   x   t  \n   1   1   /
>> 1 151 163 040 142 145 040 062 056 164 170 164 012 061 061 057 061
>> 0000160   1   /   2   0   1   3           0   4   :   2   3       A
>> M 061 057 062 060 061 063 040 040 060 064 072 062 063 040 101 115
>> 0000200  \t  \t   6   6   6       t   a   b       m   e       f   i
>> l 011 011 066 066 066 040 164 141 142 040 155 145 040 146 151 154
>> 0000220   e           .   t   x   t  \n  \n
>>         145 040 040 056 164 170 164 012 012
>> 0000231
>>
>> Without resorting to python or perl, and trying to avoid complex
>> regexes and stick to a functional/pipeline approach without any
>> iteration, this is the best I can figure at the moment:
>>
>> tbarron at home:~$ awk '{$1=$2=$3=$4=""; print $0}' input.txt | sed
>> -e"s/[ ]*//" this is filename 1.txt
>> this be 2.txt
>> tab me file .txt
>>
>> Note that theres a space an a tab in the character class used in the
>> sed regex - could use '\s' but my sed implementation doesn't grok
>> '\t' as tab.
>
> LOL, life gets hairy when you can't count on filenames not having
> spaces, and you can't even count on the fields being fixed width.
> Here's the strongarm I came up for such a case, which doesn't look as
> simple as your awk implementation:
>
> sed -e "s#\([0-9]\{2\}/\)\{2\}[0-9]\{4\}[ \t]\+[0-9]\{2\}:[0-9]\{2\}[ \t]\+[AP]M[ \t]\+[0-9,]\+[ \t]\+##"
>
> Maaaaaan, don't you wish gnu sed had a \d for digit? That [0-9] stuff
> gets old.
>
> All I can say in favor of the preceding sed command is that it
> absolutely, positively will work, regardless of how many spaces,
> commas, or anything else the filename has in it, as long as dates are
> guaranteed mm/dd/yyyy and times are guaranteed hh:mm [AP]M and filesizes
> are composed exclusively of digits or commas, and the date, time, size
> and filename are separated by whitespace. The preceding does sed command
> not give greedy matching a chance, but matches first the date, then the
> time, then the size, and turns them plus the whitespace after the size
> into nothing.
>
> You know what this reminds me of? In the early days of the PC
> revolution, the mainframe guys used to give us (PC programmers) files
> containing report printouts, and have us parse the reports back into
> data. I once asked the mainframe manager why she didn't just have her
> people work with the underlying data, and she said "well, we could, but
> we'd have to write a program!". Hey, I was paid by the hour, it's all
> good.
>
> If filenames were guaranteed not to contain spaces, the preceding could
> be done, with just as much reliability, like this:
>
> cat junk.txt | sed -e "s/.*[ \t]//"
>
> But noooooooooo!
>
> SteveT
>
> Steve Litt                *  http://www.troubleshooters.com/
> Troubleshooting Training  *  Human Performance
> -- 
> This message was sent to: William <william at trilug.org>
> To unsubscribe, send a blank message to trilug-leave at trilug.org from that address.
> TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
> Unsubscribe or edit options on the web	: http://www.trilug.org/mailman/options/trilug/william%40trilug.org
> Welcome to TriLUG: http://trilug.org/welcome
>


More information about the TriLUG mailing list