[TriLUG] Need some help parsing a file

Tue Dec 31 01:21:47 EST 2013

On Mon, Dec 30, 2013 at 10:49:09PM -0500, Steve Litt wrote:
> On Sun, 29 Dec 2013 21:04:16 -0500
> Brian Blater <brb.lists at gmail.com> wrote:
> 
> > This has never been my forte and just can't seem to figure out what I
> > need to do.
> > 
> > I've got a file that basically has a directory listing. I need to
> > parse out everything but the filenames. The format of the document is
> > basically like this:
> > 
> > 11/09/2013  11:49 AM         7,887,098 this is filename 1.txt
> > 11/05/2013  08:09 PM        11,652,690 this is filename 2.sh
> > 
> > Basically I need to strip the date, time and bytes and just leave the
> > filename. Filenames will have spaces and various characters, but it is
> > always after the bytes and spaces are what separate everything.
> 
> I'd take advantage of the fact that you want to get rid of the first
> whitespace and everything after it:
> 
> cat junk.txt | sed -e"s/\s.*//"
> 
> I tried the preceding, and it worked perfectly.
> 
> Personally, I think AWK's a little bit overkill for this (but I use AWK
> all the time for tougher parsing), and using Perl for this (or Python
> or Ruby) is insanity.
> 
> The cut option's also excellent, but I remember regex a lot better than
> cut's arguments and options. And as someone says, but removing the
> first space and everything after it, you get around implementation
> problems, unless some version of ls prepends lines with spaces, or you
> use ls -l.
> 
> Thanks,
> 
> SteveT
> 
> Steve Litt                *  http://www.troubleshooters.com/
> Troubleshooting Training  *  Human Performance
> -- 
> This message was sent to: Tom Barron <tpb at dyncloud.net>
> To unsubscribe, send a blank message to trilug-leave at trilug.org from that address.
> TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
> Unsubscribe or edit options on the web	: http://www.trilug.org/mailman/options/trilug/tpb%40dyncloud.net
> Welcome to TriLUG: http://trilug.org/welcome

To me the chief lesson from this thread is how a bunch of skilled
programmers can have so much variance in their interpretation of the
requirements.

I read them to say that the solution should yield the fields *after*
the 'size' field.  The 'sed' solution just proposed solves a different
problem than that one since it yields the first field from the input
file:

tbarron at home:~$ echo '11/09/2013  11:49 AM        7,887,098 this is filename 1.txt' | sed -e"s/\s.*//"
11/09/2013

Now 'cut' by itself doesn't quite cut it as I understand Brian's
requirements because he doesn't indicate that there is a fixed number
of spaces between fields, or even that there are spaces instead of
tabs, etc.

But we can use that cool trailing minus/hyphen notation from cut for
"all the fields from here to the end of the line" with a robust
solution if we combine it in a pipeline with 'tr'.  

Here is a short file with different numbers of fields after the size,
and different numbers of spaces and tabs between fields.  I think it
breaks the solutions proposed thus far.

tbarron at home:~$ cat input.txt
11/09/2013 11:49 AM    7,887,098 this is filename 1.txt
11/10/2013  12:50 PM          886,666 this be 2.txt
11/11/2013  04:23 AM		666 tab me file.txt

tbarron at home:~$ od -cb input.txt
0000000   1   1   /   0   9   /   2   0   1   3       1   1   :   4   9
        061 061 057 060 071 057 062 060 061 063 040 061 061 072 064 071
0000020       A   M                   7   ,   8   8   7   ,   0   9   8
        040 101 115 040 040 040 040 067 054 070 070 067 054 060 071 070
0000040       t   h   i   s       i   s       f   i   l   e   n   a   m
        040 164 150 151 163 040 151 163 040 146 151 154 145 156 141 155
0000060   e       1   .   t   x   t  \n   1   1   /   1   0   /   2   0
        145 040 061 056 164 170 164 012 061 061 057 061 060 057 062 060
0000100   1   3           1   2   :   5   0       P   M                
        061 063 040 040 061 062 072 065 060 040 120 115 040 040 040 040
0000120                           8   8   6   ,   6   6   6       t   h
        040 040 040 040 040 040 070 070 066 054 066 066 066 040 164 150
0000140   i   s       b   e       2   .   t   x   t  \n   1   1   /   1
        151 163 040 142 145 040 062 056 164 170 164 012 061 061 057 061
0000160   1   /   2   0   1   3           0   4   :   2   3       A   M
        061 057 062 060 061 063 040 040 060 064 072 062 063 040 101 115
0000200  \t  \t   6   6   6       t   a   b       m   e       f   i   l
        011 011 066 066 066 040 164 141 142 040 155 145 040 146 151 154
0000220   e   .   t   x   t  \n  \n
        145 056 164 170 164 012 012
0000227

We can use the "squeeze" option to tr in combination with a character
class for horizontal whitespace to put a single tab between each text
field in the input file, so that cut (which likes tabs natively) can
just cut out everything prior to the fifth field on.  (Love that '5-'
to pick up all the remaining fields and keep a nice functional,
non-iterative approach that I at least don't know how to do with 'awk'
for this problem).

tbarron at home:~$ cat input.txt | tr -s '[:blank:]' '\t'  | cut -f 5- 
this	is	filename	1.txt
this	be	2.txt
tab	me	file.txt

We can follow up with another 'tr' if the tabs in the output are
bothersome:

tbarron at home:~$ tr -s '[:blank:]' '\t'  < input.txt  | cut -f 5-  | tr '\t' ' '
this is filename 1.txt
this be 2.txt
tab me file.txt

-- Tom Barron