[TriLUG] [hopefully] quickie help with a regex

Kevin Hunter hunteke at earlham.edu
Mon Aug 9 00:38:01 EDT 2010


At 8:35pm -0600 Sun, 08 Aug 2010, Warren Myers wrote:
> (technically this text is in XML/HTML documents (and yes, I know
> regexes are bad for HTML, but in this instance, it's what I
> need)),

You know the details, of course, but in my experience, of the folks 
who've asked me for help in working with HTML and regexes, literally 19 
times out of 20 it was not the approach they really wanted (just the 
first one that came to mind).

> I'm looking to match text inside single quotes using PHP [...]but
> am having a little trouble with the formatting.
>
> I *think* what I want is:
>      [\'][.]*]\']
                ^  ^
Thomas already pointed out the possibly mismatched brackets of the 
character class.

> Is that right, or am I way off? It seems to only match sometimes.

It's difficult to say if you're way off without seeing the larger 
context of the problem.  Regexes are difficult not in concept, but in 
implementation because the littlest detail can make them not match what 
you had intended.  The question you need to ask is "What -- /exactly/ -- 
am I trying to match?"  Are embedded quotes accepted?  How about 
embedded newlines?  Do I want the values of XML element attributes?

Does this script help elucidate anything for you?

-----
$ cat test.php
#!/usr/bin/php
<?php

$str = "<element attr1='asdf\'jkl' attr2='asd\nf'>";

echo "Regexes against string: -->$str<--\n";

$regexes = array(
	"Match w/ greedy regex"        => "/'.*'/",
	"Match with nongreedy regex"   => "/'.*?'/",
	"Match embedded newline"       => "/'[^']*'/",
	  # won't match the newline in current string, but try
	  # removing all the internal single quotes of $str

	"Match attributes w/ anchor"   => "/='.*?'/",
	"Match w/ embedded quote"      => "/'.*?(?<!\\\\)'/",
	  # That's really '(?<!\\)', but it's interpreted twice,
	  # so need to escape the backslash not once, but twice.

	  # 'negative look behind assertion' prevents an escaped
	  # quote.  Not perfect, because, for example, the regex
	  # would miss the final quote of "'asdf\\'"
);

foreach ( $regexes as $description => $regex ) {
	echo "\n$description\n";
	if ( preg_match( $regex, $str, $matches ) ) {
		foreach ( $matches as $k => $v) {
			echo "$k -> $v\n";
		}
	}
}
?>

$ php test.php
# ...
-----

Cheers,

Kevin



More information about the TriLUG mailing list