[TriLUG] simple regular expression to strip HTML?
Jon Carnes
jonc at nc.rr.com
Wed Feb 18 23:42:14 EST 2004
On Wed, 2004-02-18 at 23:23, Jeremy Portzer wrote:
> On Wed, 18 Feb 2004, Tanner Lovelace wrote:
>
> > Jeremy Portzer said the following on 2/18/04 9:28 PM:
> >
> > > Does anyone know of a quick-and-dirty regular expression that will strip
> > > simple HTML tags? I'm not looking for something that is necessarily
> > > 100% safe/tested, but something reasonable that will work. It needs to
> > > use the regular C regexp set of calls, not Perl extensions.
> > >
> > > For example: "<em>Bold</em> type" should substitute to "Bold type"
> > >
> >
> > Doing some experimentation, I see that perl is normally greedy, but
> > if you postpend a quantifier with ? it turns that off. So, this
> > should remove all html tags from a file:
> >
> > perl -pi -e 's/<.*?>//g' [filename]
> >
> > I have tested this and it seems to work for me. YMMV.
>
> Unfortunately, the non-greedy operator -- the question mark, is not
> standard to the C library regexp() call, which I'm using. However, the
> following accomplishes something similar (my thanks to 'scalar' on IRC) :
> s/<[^>]+>//g
>
> This doesn't take into account cases where a > character might be quoted
> within a value inside an HTML tag, but I don't need to worry about that
> for my simple application.
>
> Thanks for the help everyone (both here and on IRC).
Sorry to come in late on this one. I have a library of such clever
things. Here is a small sed statement that does the same thing. Looks
very familiar....
# This sed statement will remove all html tags from a file
sed -e 's/<[^>]*>//g' myfile.html
Jon
More information about the TriLUG
mailing list