[TriLUG] simple regular expression to strip HTML?

Tanner Lovelace lovelace at wayfarer.org
Wed Feb 18 22:57:39 EST 2004


Jeremy Portzer said the following on 2/18/04 9:28 PM:

> I would have posted this to the dev@ list, but we've discontinued it...
> :-|

That's ok, I'd rather have it here anyway.

> Does anyone know of a quick-and-dirty regular expression that will strip
> simple HTML tags?  I'm not looking for something that is necessarily
> 100% safe/tested, but something reasonable that will work.  It needs to
> use the regular C regexp set of calls, not Perl extensions.
> 
> For example:  "<em>Bold</em> type" should substitute to "Bold type"
> 

Well, if you want to remove everything between brackets, you could
try this:

s/<.*>//g

But, I dont' remember offhand if that will be greedy or not.  I think
it depends on what you call it from (perl, sed, awk, ed, etc...)

Doing some experimentation, I see that perl is normally greedy, but
if you postpend a quantifier with ? it turns that off.  So, this
should remove all html tags from a file:

perl -pi -e 's/<.*?>//g' [filename]

I have tested this and it seems to work for me.  YMMV.

For those who aren't familiar with regexes, here's what it means:

perl - run the perl executable (Duh :)
-p - Assume a standard loop around command line specified code
-i - edit in place
-e - execute the following code

Code:

s - This is a substitution regular expression
/  - The next characters are the pattern to find.
<  - This is the first letter of the pattern.
.  - Match any character ...
*  -    ... 0 or more times
?  - Don't be greedy in matching (i.e. end gobbling up chars
      immediately when you find the next character specfied rather than
      when you find the last one of the next character specified).
 >  - This is the last character of the pattern.
/  - End of the find pattern, start of the replace pattern.
/  - End of the replace pattern (note we're replacing with nothing)
g  - Do this for all such patterns, not just the first one on each line.

You then specify all the files you want to operate on.  Perl,
because of the -p switch will read them all in and feed them one
line at a time to the code we've specified.  The -i will do the
file editing in-place.  You can also specify a file extension
(like -i.orig) and it will backup the original file before doing
the edit.

Jeremy, will that do what you want?

Cheers,
Tanner
-- 
Tanner Lovelace       | Don't move! Or I'll fill ya full of... little
lovelace at wayfarer.org | yellow bolts of light! - Commander John Crichton



More information about the TriLUG mailing list