[TriLUG] A curious regular expression

William Sutton william at trilug.org
Tue Apr 10 00:07:08 EDT 2007


> >
> > Potential matches include:
> > .#&%
> > *#&%
> > .@#&%
> > *@@#&%
> 
> Huh?  I missed the first part of the thread, but which REGEX language are you 
> talking about?  If we're talking Perlish regex, don't the brackets make it a 
> character class?  That is, the {,99} doesn't indicate a quantifier, it just 
> adds the characters '{', ',', '9', and '}' to the character class.

{x,y} in perl regex is in fact a quantifier; empirical tests show that 
when x isn't specified, it is treated as '1'; thus, we're talking about 
1-99 (inclusive) sequential '@' characters.

Short perl program to illustrate:

#####
my @strings = (
               '#&%',      # no leading .*, no @*
               '@#&%',     # no leading.*, has @*
               '+#&%',     # leading .*, no @*
               '+@#&%',    # leading .*, has @*
              );

foreach my $string (@strings)
{
    print "string $string "
      . ($string =~ m/[.*\@{,99}]#&%/ ? "matches" : "does not match") . 
"\n";
}
#####

Now then, the regex (in perl, dunno what regex language was originally 
being used) is as follows:

[		# character class
	.*	# 0 or more characters
	\@{,99}	# and 1-99 '@' characters
]		# end character class
#&%		# followed by '#&%'

in other words, you must have at least a single '@' somewhere in the 
character class; before or after .* doesn't matter; can have 0 or more 
characters (of unspecified value) either before or after the '@' 
character(s), and the string has to also contain '#&%' following the 
character class.

William

> 
> #---------
> #!/usr/bin/env python
> import re
> regex = re.compile("[.*\@{,99}]#&%")
> for s in [".#&%", "*#&%", ".@#&%", "*@@#&%", "{#&%"]:
>     if regex.match(s) != None: 
>         print s, "matched!"
>     else: 
>         print s, "didn't match."
> #---------
> 
> .#&% matched!
> *#&% matched!
> .@#&% didn't match.
> *@@#&% didn't match.
> {#&% matched!
> 
> Of course, I'm cheating a bit there.  regex.search actually matches all of 
> those strings.  regex.match forces the match to start at the start of the 
> string that we're checking.  I did that because you said...
> 
> > or any combination of the leading . and * followed by 0 to 99 @'s, 
> > then followed by the string #&$.
> 
> See?  They'll still match the regex, but the 0 to 99 @'s isn't relevant.  The 
> regex [.*\@{,99}]#&% matches the string that consists of one of the following 
> characters
> . * @ { , 9 }
> followed by the string
> #&%
> 
> So, 99 @'s followed by #&% would be a match, but only on the final @#&%.  The 
> previous 98 @'s aren't part of the matching text.  You could just as easily 
> had a string like 
> s = "Some long sentence with exactly ninety-eight characters in it that 
> doesn't match the regex itself @#&%"
> 
> So, the regex matches the strings
> .#&%
> *#&%
> @#&%
> {#&%
> ,#&%
> 9#&%
> }#&%
> 
> You can, of course, still find a match within a larger string, but I don't 
> think that's what you were saying in your reply.
> 
> Gosh, I killed 20 minutes on this e-mail.  Next time I'll delete the thread 
> *without* a brief scan to see if any interesting tangents popped up.  ;-)
> 
> ---Tom
> 



More information about the TriLUG mailing list