[TriLUG] simple regular expression to strip HTML?

Wed Feb 18 23:42:14 EST 2004

On Wed, 2004-02-18 at 23:23, Jeremy Portzer wrote:
> On Wed, 18 Feb 2004, Tanner Lovelace wrote:
> 
> > Jeremy Portzer said the following on 2/18/04 9:28 PM:
> > 
> > > Does anyone know of a quick-and-dirty regular expression that will strip
> > > simple HTML tags?  I'm not looking for something that is necessarily
> > > 100% safe/tested, but something reasonable that will work.  It needs to
> > > use the regular C regexp set of calls, not Perl extensions.
> > > 
> > > For example:  "<em>Bold</em> type" should substitute to "Bold type"
> > > 
> > 
> > Doing some experimentation, I see that perl is normally greedy, but
> > if you postpend a quantifier with ? it turns that off.  So, this
> > should remove all html tags from a file:
> > 
> > perl -pi -e 's/<.*?>//g' [filename]
> > 
> > I have tested this and it seems to work for me.  YMMV.
> 
> Unfortunately, the non-greedy operator -- the question mark, is not 
> standard to the C library regexp() call, which I'm using.  However, the 
> following accomplishes something similar (my thanks to 'scalar' on IRC) :
> 	s/<[^>]+>//g
> 
> This doesn't take into account cases where a > character might be quoted 
> within a value inside an HTML tag, but I don't need to worry about that 
> for my simple application.
> 
> Thanks for the help everyone (both here and on IRC).

Sorry to come in late on this one.  I have a library of such clever
things.  Here is a small sed statement that does the same thing.  Looks
very familiar....

   # This sed statement will remove all html tags from a file
   sed -e 's/<[^>]*>//g' myfile.html

Jon