[TriLUG] simple regular expression to strip HTML?

Thu Feb 19 00:18:43 EST 2004

Yeah, as Jeremy pointed out in a subsequent reply, you really wanted

s/<[^>]+//g

rather than

s/<.*?>//g

The reason is the latter expression will nuke EVERYTHING, including > 
characters.  The first expression breaks out like this (for those that 
don't do Perl regex):

s/       //	substitute
  <		any open angle bracket
   [  ]		a character class
    ^		caret inside a character class negates the expression
     >		that is, we exclude closing angle brackets
       +	operate on multiple occurences of the expression
        >	until we get to a closed angle bracket
           g	perform the replacement globally

In other words, for anything beginning with a <, delete the <, then delete 
any non-> characters, then delete the trailing > character, and repeat the 
process globally.

William

On Wed, 18 Feb 2004, Tanner Lovelace wrote:

> Jeremy Portzer said the following on 2/18/04 9:28 PM:
> 
> > I would have posted this to the dev@ list, but we've discontinued it...
> > :-|
> 
> That's ok, I'd rather have it here anyway.
> 
> > Does anyone know of a quick-and-dirty regular expression that will strip
> > simple HTML tags?  I'm not looking for something that is necessarily
> > 100% safe/tested, but something reasonable that will work.  It needs to
> > use the regular C regexp set of calls, not Perl extensions.
> > 
> > For example:  "<em>Bold</em> type" should substitute to "Bold type"
> > 
> 
> Well, if you want to remove everything between brackets, you could
> try this:
> 
> s/<.*>//g
> 
> But, I dont' remember offhand if that will be greedy or not.  I think
> it depends on what you call it from (perl, sed, awk, ed, etc...)
> 
> Doing some experimentation, I see that perl is normally greedy, but
> if you postpend a quantifier with ? it turns that off.  So, this
> should remove all html tags from a file:
> 
> perl -pi -e 's/<.*?>//g' [filename]
> 
> I have tested this and it seems to work for me.  YMMV.
> 
> For those who aren't familiar with regexes, here's what it means:
> 
> perl - run the perl executable (Duh :)
> -p - Assume a standard loop around command line specified code
> -i - edit in place
> -e - execute the following code
> 
> Code:
> 
> s - This is a substitution regular expression
> /  - The next characters are the pattern to find.
> <  - This is the first letter of the pattern.
> .  - Match any character ...
> *  -    ... 0 or more times
> ?  - Don't be greedy in matching (i.e. end gobbling up chars
>       immediately when you find the next character specfied rather than
>       when you find the last one of the next character specified).
>  >  - This is the last character of the pattern.
> /  - End of the find pattern, start of the replace pattern.
> /  - End of the replace pattern (note we're replacing with nothing)
> g  - Do this for all such patterns, not just the first one on each line.
> 
> You then specify all the files you want to operate on.  Perl,
> because of the -p switch will read them all in and feed them one
> line at a time to the code we've specified.  The -i will do the
> file editing in-place.  You can also specify a file extension
> (like -i.orig) and it will backup the original file before doing
> the edit.
> 
> Jeremy, will that do what you want?
> 
> Cheers,
> Tanner
>