[TriLUG] simple regular expression to strip HTML?
    Tanner Lovelace 
    lovelace at wayfarer.org
       
    Wed Feb 18 22:57:39 EST 2004
    
    
  
Jeremy Portzer said the following on 2/18/04 9:28 PM:
> I would have posted this to the dev@ list, but we've discontinued it...
> :-|
That's ok, I'd rather have it here anyway.
> Does anyone know of a quick-and-dirty regular expression that will strip
> simple HTML tags?  I'm not looking for something that is necessarily
> 100% safe/tested, but something reasonable that will work.  It needs to
> use the regular C regexp set of calls, not Perl extensions.
> 
> For example:  "<em>Bold</em> type" should substitute to "Bold type"
> 
Well, if you want to remove everything between brackets, you could
try this:
s/<.*>//g
But, I dont' remember offhand if that will be greedy or not.  I think
it depends on what you call it from (perl, sed, awk, ed, etc...)
Doing some experimentation, I see that perl is normally greedy, but
if you postpend a quantifier with ? it turns that off.  So, this
should remove all html tags from a file:
perl -pi -e 's/<.*?>//g' [filename]
I have tested this and it seems to work for me.  YMMV.
For those who aren't familiar with regexes, here's what it means:
perl - run the perl executable (Duh :)
-p - Assume a standard loop around command line specified code
-i - edit in place
-e - execute the following code
Code:
s - This is a substitution regular expression
/  - The next characters are the pattern to find.
<  - This is the first letter of the pattern.
.  - Match any character ...
*  -    ... 0 or more times
?  - Don't be greedy in matching (i.e. end gobbling up chars
      immediately when you find the next character specfied rather than
      when you find the last one of the next character specified).
 >  - This is the last character of the pattern.
/  - End of the find pattern, start of the replace pattern.
/  - End of the replace pattern (note we're replacing with nothing)
g  - Do this for all such patterns, not just the first one on each line.
You then specify all the files you want to operate on.  Perl,
because of the -p switch will read them all in and feed them one
line at a time to the code we've specified.  The -i will do the
file editing in-place.  You can also specify a file extension
(like -i.orig) and it will backup the original file before doing
the edit.
Jeremy, will that do what you want?
Cheers,
Tanner
-- 
Tanner Lovelace       | Don't move! Or I'll fill ya full of... little
lovelace at wayfarer.org | yellow bolts of light! - Commander John Crichton
    
    
More information about the TriLUG
mailing list