[TriLUG] simple regular expression to strip HTML?

Tanner Lovelace lovelace at wayfarer.org
Thu Feb 19 01:33:16 EST 2004


William Sutton said the following on 2/19/04 12:18 AM:

> Yeah, as Jeremy pointed out in a subsequent reply, you really wanted
> 
> s/<[^>]+//g
> 
> rather than
> 
> s/<.*?>//g
> 
> The reason is the latter expression will nuke EVERYTHING, including > 
> characters.  The first expression breaks out like this (for those that 
> don't do Perl regex):

Not true that it will nuke "EVERYTHING".  Without the ?, then,
yes, but with the question mark, they are almost equivalent.
The difference in them lies in the fact that with your regex
you *must* have at least one character between brackets for it to
work.  If you changed the + to a * (like Jon C did in his regex)
they would be exactly equivalent, at least in perl (where there's
*always* more than one way of doing things. :-).

Jeremy, sorry for missing your last line that mentioned no perl.  Doh!
Still, I'd be interested to know what the standard "greedy" behavior
of the C regex calls is.  Do you know offhand?  If it isn't greedy,
then removing the ? from my regex should make it the same as the
other ones listed.

Cheers,
Tanner
-- 
Tanner Lovelace       | Don't move! Or I'll fill ya full of... little
lovelace at wayfarer.org | yellow bolts of light! - Commander John Crichton



More information about the TriLUG mailing list