[TriLUG] Need some help parsing a file
Steve Litt
slitt at troubleshooters.com
Tue Dec 31 20:02:24 EST 2013
On Tue, 31 Dec 2013 14:33:57 -0500
Tom Barron <tpb at dyncloud.net> wrote:
> On Tue, Dec 31, 2013 at 11:46:45AM -0500, R Radford wrote:
> > The tr solution would also reduce repeated spaces in the filename,
> > so would not work in that (hopefully extreme, but legal) case.
> >
>
> Yeah, good catch.
>
> So let's modify the third line of the test file to have two spaces in
> the third line's filename, in a spot where we can keep track of them,
> between 'file' and '.txt':
>
> tbarron at home:~$ cat input.txt
> 11/09/2013 11:49 AM 7,887,098 this is filename 1.txt
> 11/10/2013 12:50 PM 886,666 this be 2.txt
> 11/11/2013 04:23 AM 666 tab me file .txt
>
> tbarron at home:~$ od -cb input.txt
> 0000000 1 1 / 0 9 / 2 0 1 3 1 1 : 4
> 9 061 061 057 060 071 057 062 060 061 063 040 061 061 072 064 071
> 0000020 A M 7 , 8 8 7 , 0 9
> 8 040 101 115 040 040 040 040 067 054 070 070 067 054 060 071 070
> 0000040 t h i s i s f i l e n a
> m 040 164 150 151 163 040 151 163 040 146 151 154 145 156 141 155
> 0000060 e 1 . t x t \n 1 1 / 1 0 / 2
> 0 145 040 061 056 164 170 164 012 061 061 057 061 060 057 062 060
> 0000100 1 3 1 2 : 5 0 P
> M 061 063 040 040 061 062 072 065 060 040 120 115 040 040 040 040
> 0000120 8 8 6 , 6 6 6 t
> h 040 040 040 040 040 040 070 070 066 054 066 066 066 040 164 150
> 0000140 i s b e 2 . t x t \n 1 1 /
> 1 151 163 040 142 145 040 062 056 164 170 164 012 061 061 057 061
> 0000160 1 / 2 0 1 3 0 4 : 2 3 A
> M 061 057 062 060 061 063 040 040 060 064 072 062 063 040 101 115
> 0000200 \t \t 6 6 6 t a b m e f i
> l 011 011 066 066 066 040 164 141 142 040 155 145 040 146 151 154
> 0000220 e . t x t \n \n
> 145 040 040 056 164 170 164 012 012
> 0000231
>
> Without resorting to python or perl, and trying to avoid complex
> regexes and stick to a functional/pipeline approach without any
> iteration, this is the best I can figure at the moment:
>
> tbarron at home:~$ awk '{$1=$2=$3=$4=""; print $0}' input.txt | sed
> -e"s/[ ]*//" this is filename 1.txt
> this be 2.txt
> tab me file .txt
>
> Note that theres a space an a tab in the character class used in the
> sed regex - could use '\s' but my sed implementation doesn't grok
> '\t' as tab.
LOL, life gets hairy when you can't count on filenames not having
spaces, and you can't even count on the fields being fixed width.
Here's the strongarm I came up for such a case, which doesn't look as
simple as your awk implementation:
sed -e "s#\([0-9]\{2\}/\)\{2\}[0-9]\{4\}[ \t]\+[0-9]\{2\}:[0-9]\{2\}[ \t]\+[AP]M[ \t]\+[0-9,]\+[ \t]\+##"
Maaaaaan, don't you wish gnu sed had a \d for digit? That [0-9] stuff
gets old.
All I can say in favor of the preceding sed command is that it
absolutely, positively will work, regardless of how many spaces,
commas, or anything else the filename has in it, as long as dates are
guaranteed mm/dd/yyyy and times are guaranteed hh:mm [AP]M and filesizes
are composed exclusively of digits or commas, and the date, time, size
and filename are separated by whitespace. The preceding does sed command
not give greedy matching a chance, but matches first the date, then the
time, then the size, and turns them plus the whitespace after the size
into nothing.
You know what this reminds me of? In the early days of the PC
revolution, the mainframe guys used to give us (PC programmers) files
containing report printouts, and have us parse the reports back into
data. I once asked the mainframe manager why she didn't just have her
people work with the underlying data, and she said "well, we could, but
we'd have to write a program!". Hey, I was paid by the hour, it's all
good.
If filenames were guaranteed not to contain spaces, the preceding could
be done, with just as much reliability, like this:
cat junk.txt | sed -e "s/.*[ \t]//"
But noooooooooo!
SteveT
Steve Litt * http://www.troubleshooters.com/
Troubleshooting Training * Human Performance
More information about the TriLUG
mailing list