[TriLUG] simple regular expression to strip HTML?
Tanner Lovelace
lovelace at wayfarer.org
Wed Feb 18 22:57:39 EST 2004
Jeremy Portzer said the following on 2/18/04 9:28 PM:
> I would have posted this to the dev@ list, but we've discontinued it...
> :-|
That's ok, I'd rather have it here anyway.
> Does anyone know of a quick-and-dirty regular expression that will strip
> simple HTML tags? I'm not looking for something that is necessarily
> 100% safe/tested, but something reasonable that will work. It needs to
> use the regular C regexp set of calls, not Perl extensions.
>
> For example: "<em>Bold</em> type" should substitute to "Bold type"
>
Well, if you want to remove everything between brackets, you could
try this:
s/<.*>//g
But, I dont' remember offhand if that will be greedy or not. I think
it depends on what you call it from (perl, sed, awk, ed, etc...)
Doing some experimentation, I see that perl is normally greedy, but
if you postpend a quantifier with ? it turns that off. So, this
should remove all html tags from a file:
perl -pi -e 's/<.*?>//g' [filename]
I have tested this and it seems to work for me. YMMV.
For those who aren't familiar with regexes, here's what it means:
perl - run the perl executable (Duh :)
-p - Assume a standard loop around command line specified code
-i - edit in place
-e - execute the following code
Code:
s - This is a substitution regular expression
/ - The next characters are the pattern to find.
< - This is the first letter of the pattern.
. - Match any character ...
* - ... 0 or more times
? - Don't be greedy in matching (i.e. end gobbling up chars
immediately when you find the next character specfied rather than
when you find the last one of the next character specified).
> - This is the last character of the pattern.
/ - End of the find pattern, start of the replace pattern.
/ - End of the replace pattern (note we're replacing with nothing)
g - Do this for all such patterns, not just the first one on each line.
You then specify all the files you want to operate on. Perl,
because of the -p switch will read them all in and feed them one
line at a time to the code we've specified. The -i will do the
file editing in-place. You can also specify a file extension
(like -i.orig) and it will backup the original file before doing
the edit.
Jeremy, will that do what you want?
Cheers,
Tanner
--
Tanner Lovelace | Don't move! Or I'll fill ya full of... little
lovelace at wayfarer.org | yellow bolts of light! - Commander John Crichton
More information about the TriLUG
mailing list