[TriLUG] simple regular expression to strip HTML?
Jeremy Portzer
jeremyp at pobox.com
Wed Feb 18 23:23:37 EST 2004
On Wed, 18 Feb 2004, Tanner Lovelace wrote:
> Jeremy Portzer said the following on 2/18/04 9:28 PM:
>
> > Does anyone know of a quick-and-dirty regular expression that will strip
> > simple HTML tags? I'm not looking for something that is necessarily
> > 100% safe/tested, but something reasonable that will work. It needs to
> > use the regular C regexp set of calls, not Perl extensions.
> >
> > For example: "<em>Bold</em> type" should substitute to "Bold type"
> >
>
> Doing some experimentation, I see that perl is normally greedy, but
> if you postpend a quantifier with ? it turns that off. So, this
> should remove all html tags from a file:
>
> perl -pi -e 's/<.*?>//g' [filename]
>
> I have tested this and it seems to work for me. YMMV.
Unfortunately, the non-greedy operator -- the question mark, is not
standard to the C library regexp() call, which I'm using. However, the
following accomplishes something similar (my thanks to 'scalar' on IRC) :
s/<[^>]+>//g
This doesn't take into account cases where a > character might be quoted
within a value inside an HTML tag, but I don't need to worry about that
for my simple application.
Thanks for the help everyone (both here and on IRC).
--Jeremy
--
/---------------------------------------------------------------------\
| Jeremy Portzer jeremyp at pobox.com trilug.org/~jeremy |
| GPG Fingerprint: 712D 77C7 AB2D 2130 989F E135 6F9F F7BC CC1A 7B92 |
\---------------------------------------------------------------------/
More information about the TriLUG
mailing list