[TriLUG] simple regular expression to strip HTML?
William Sutton
william at trilug.org
Thu Feb 19 00:18:43 EST 2004
Yeah, as Jeremy pointed out in a subsequent reply, you really wanted
s/<[^>]+//g
rather than
s/<.*?>//g
The reason is the latter expression will nuke EVERYTHING, including >
characters. The first expression breaks out like this (for those that
don't do Perl regex):
s/ // substitute
< any open angle bracket
[ ] a character class
^ caret inside a character class negates the expression
> that is, we exclude closing angle brackets
+ operate on multiple occurences of the expression
> until we get to a closed angle bracket
g perform the replacement globally
In other words, for anything beginning with a <, delete the <, then delete
any non-> characters, then delete the trailing > character, and repeat the
process globally.
William
On Wed, 18 Feb 2004, Tanner Lovelace wrote:
> Jeremy Portzer said the following on 2/18/04 9:28 PM:
>
> > I would have posted this to the dev@ list, but we've discontinued it...
> > :-|
>
> That's ok, I'd rather have it here anyway.
>
> > Does anyone know of a quick-and-dirty regular expression that will strip
> > simple HTML tags? I'm not looking for something that is necessarily
> > 100% safe/tested, but something reasonable that will work. It needs to
> > use the regular C regexp set of calls, not Perl extensions.
> >
> > For example: "<em>Bold</em> type" should substitute to "Bold type"
> >
>
> Well, if you want to remove everything between brackets, you could
> try this:
>
> s/<.*>//g
>
> But, I dont' remember offhand if that will be greedy or not. I think
> it depends on what you call it from (perl, sed, awk, ed, etc...)
>
> Doing some experimentation, I see that perl is normally greedy, but
> if you postpend a quantifier with ? it turns that off. So, this
> should remove all html tags from a file:
>
> perl -pi -e 's/<.*?>//g' [filename]
>
> I have tested this and it seems to work for me. YMMV.
>
> For those who aren't familiar with regexes, here's what it means:
>
> perl - run the perl executable (Duh :)
> -p - Assume a standard loop around command line specified code
> -i - edit in place
> -e - execute the following code
>
> Code:
>
> s - This is a substitution regular expression
> / - The next characters are the pattern to find.
> < - This is the first letter of the pattern.
> . - Match any character ...
> * - ... 0 or more times
> ? - Don't be greedy in matching (i.e. end gobbling up chars
> immediately when you find the next character specfied rather than
> when you find the last one of the next character specified).
> > - This is the last character of the pattern.
> / - End of the find pattern, start of the replace pattern.
> / - End of the replace pattern (note we're replacing with nothing)
> g - Do this for all such patterns, not just the first one on each line.
>
> You then specify all the files you want to operate on. Perl,
> because of the -p switch will read them all in and feed them one
> line at a time to the code we've specified. The -i will do the
> file editing in-place. You can also specify a file extension
> (like -i.orig) and it will backup the original file before doing
> the edit.
>
> Jeremy, will that do what you want?
>
> Cheers,
> Tanner
>
More information about the TriLUG
mailing list