Home > Articles

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

Anchors, Grouping, and Backreferences

Grouping

Usage

(pattern)
(?:pattern)
(?=pattern)
(?!pattern)
(?<=pattern)
(?<!pattern)
(?#text)
internal modifiers: i, m, s, x

Description

Parenthesis in regular expressions are used for grouping subpatterns within the larger pattern. This can be done to provide

  • Limited action to quantifiers: /\bba(na)+na\b/ # "But I don't know when to stop"

  • Limited scope to alternation: /my favorite stooge is (Moe|Curly|Larry)\./

  • Limited range for modifiers: /The dog made a ((?i)spot)\./

  • Introduction for assertions: /Jimmy (?=Buffet|the Greek)/

  • Capturing for backreferences: /\b([a-z]+)\s+\1\b/

The ability to capture subpatterns for backreferences is covered in the entry on backreferences. Some of the examples in this section assume prior knowledge of backreferences.

Simple parenthesis (pattern) and the (?:pattern) form allow you to group a subpattern of a regular expression. Once grouped, quantifiers can be applied against just that portion of the regular expression:

m/\w+:      # Match the first field (it's required)
 (?:[^:]*:){3}  # Match (and discard) the next three fields
 ([^:]*)    # Match (and capture) the next field
/x;

Also, alternation can be limited so that when an alternation symbol is seen, exactly what's being alternated against can be determined:

m/oats|peas|beans$/; # oats, peas or beans (but beans at the end)
m/(oats|peas|beans)$/;# Any of oats, peas or beans only at the end

Internal modifiers can have their scope limited (in fact, internal modifiers can only be specified with parenthesis). So in the following:

m/Tony\s(?i:the)\sTiger/;

the phrase will be matched only if the capitalization is just as it appears; however the word the will not be matched case sensitively. (This could have been accomplished with [Tt]he as well.)

The difference between () and (?:) is that the (?:patterns form of parenthesis doesn't capture the subpattern matched and that (pattern) does[EM]it provides grouping without the capturing side effect. This makes a difference if you're using backreferences. See the backreferences entry.

The constructs (?=pattern), (?!pattern), (?<=pattern), and (?<!pattern) are all used to "look around" the current match to see what either precedes or follows it. They are zero-width assertions, meaning that the subpattern contained within is only used to look ahead or look behind the current point of the match to see whether something is true or not.

Pattern

Name

(?=pattern) Positive lookahead. Is only true if pattern is seen after the current point of the match. So /Abraham\s(?=Simpson|Lincoln)/ matches only if Abraham is followed by Lincoln or Simpson. The benefit is that the last name is not absorbed by the match. See the later examples.
(?!pattern) Negative lookahead. True only if pattern is not seen after the current point of the match. So if /^(?:\d{1,3}\.){3}\d{1,3}$/ matches an IP address (and some bad ones too, such as 999.888.777.666), /^(?!(?:0+\.){3}0+)(?:\d{1,3}\.){3}\d{1,3}$/x matches those same IP addresses, but disallows 0.0.0.0.
(?<=pattern) Positive lookbehind. This asserts that pattern was seen before the current point in the match. /(?<=bar)foo/ matches only if foo was directly preceded by bar. There is a restriction on this subpattern: it must be fixed-width, so /(?<=bar.*)foo/ isn't allowed.
(?<!pattern) Negative lookbehind. True only if pattern was not seen before the current point in the match. /(?<!bar)foo/ is true only if foo was not directly preceded by bar. Like positive lookbehind, the subpattern must be fixed-width.

The (?#text) construct is used to place comments in the body of a regular expression. For example, if the expression is long and convoluted, you might say:

/\D\d{5}(-\d{4})?($# ZIP+4 optional)\D/

Because perl needs the ) to know when to terminate the comment, you cannot include a literal ) in the comment itself.

A cleaner way to include comments within a regular expression is to use the /x modifier to the expression.

The internal modifiers are modifiers (such as /i, /s, /x) that are applied to only a portion of the regular expression. They are specified with the non-capturing parenthesis mechanism by inserting the modifier after the ? but before the next token or by using them within parenthesis with a lone ?:

(?modifiers:pattern) (?modifiers)

To add a modifier to a portion of the expression, use the following modifier value:

if (/Linus Torvalds wrote L(?i:inux)/) { }

This match is case sensitive except the letter-sequence inux, which can be uppercase, lowercase or a mix. A modifier can be removed by preceding it with a dash:

(?modifiers_to_add - modifiers_to_remove:pattern)

For example,

if (/(?-i:Linus) wrote Linux/i) { }

The preceding match is not case sensitive, except the portion matching Linus.

Alternation

Usage

pat|pat

Description

The | metacharacter is used to make the regular expression engine choose between two potential matches; this is called alternation. The | should be placed between potential choices within the pattern:

/cat|dogfish/

Would match either cat or dogfish. The alternation extends outward from the | to the end of the innermost enclosing parenthesis or to another alternation symbol.

/(cat|dog)fish/;    # Either "catfish" or "dogfish"
/(cat|dog|sword)fish/; # catfish, dogfish or swordfish

The alternation extends outward to include any anchors or zero-width assertions that are within the enclosed scope:

s/^\s+|\s+$//g; # Remove leading/trailing whitespace

An empty alternative can be specified, which allows you to choose between a few choices or nothing at all:

/(cat|dog|sword|)fish/; # catfish, dogfish or swordfish or just fish

Perl's regexp engine will process the alternations left-to-right and select the first one that matches. Thus, if you have an alternation that is the prefix of a following alternation, or an empty alternation, it should be placed at the end:

/paper|paperbacks|paperweight/;  # The last two will never match
/(paperbacks|paperweight|paper)/; # Better!
/paper(backs|weight)?/;      # Even better still!

/(|bugle|bugs|bugaboo)/;    # The empty choice will always match

Alternation isn't always the best choice for determining whether a list of things will match. Because of the way that Perl's regex engine works, a list of alternations such as the following:

/than|that|thaw|them|then|they|thin|this|thud|thug|thus/

will run much slower than if the match is re-written as follows:

/th(?:an|at|aw|em|en|ey|in|is|ud|ug|us)/

The regex engine can't scan through the alternations and notice the obvious: the program is trying to match four-letter words that begin with th[EM]it's not that smart (yet). By giving it a hint, that a literal th will need to match before the alternations need to be searched, the speedup time is tremendous. In this case, it is nearly 25 times faster for a large volume of text.

So avoid alternation for simple cases similar to:

m/\b\w(a|e|i|o|u)\w\b/; # 3 letter words, vowel in the middle

when a character class ([aeiou]) or another construct would work better.

NOTE

See Also

character classes in this book

Capturing and Backreferences

Usage

()
\1 \2 \3 \n
$1 $2 $3 $n

Description

The parenthesis in regular expressions, in addition to grouping and other functions mentioned in the grouping entry, also have a side effect[EM]patterns matched within parenthesis are stored, and can be used later in the expression or later in the program outside of the expression. This storage of matched patterns is called capturing, and referring to the captured values are backreferences.

Each set of capturing parenthesis encountered takes the portion of the target string matched by the pattern and stores it in a register. The registers are numbered 1, 2, 3, and so on up to the number of parenthesis in the entire pattern match.

During the match, any captured values are available by referring to the proper register with \register. This allows you to refer to something previously matched later in the pattern:

/(\w+)\s\1/; # Look for repeated words, separated by a space.

In the preceding example, (\w+) captures word characters into the first capture register, and \1 looks for whatever word was stored there after the whitespace character.

After the match has completed (or during the substitution-phase with the s/// operator), the captured value will appear in the variables named $1, $2, $3, and so on up to the number of parenthesis captured in the match.

if ( s/(\w+)\s\1/$1/ ) { # Remove repeated words, separated by a space.
  print "Removed duplicate word $1\n";
}

In this example, the backreference \1 is used to find the repeated word as shown previously. During the substitution, $1 is used to put back just one instance of the repeated word. After the match, $1 is still set to the captured value during the match.

Some notes about the variables $1, $2, and so on are as follows:

  • They're dynamically scoped. So, given the following code:

    $_="She loves you yeah yeah yeah";
    {
      if ( s/(\w+)\s\1/$1/ ) {
        $match=1;
      }
    }
    print "Removed a $1" if $match;

    Because the match occurred within a block of its own (the bare block), $1's value isn't valid outside of that block. Treat them as though they had been declared with local.

  • They're only set if the match succeeds. If the match fails, the values in them are indeterminate. A very common programming mistake is to assume that the match succeeded and then proceed using $1 and company without whatever values they happen to have:

    @addr=(q{From: Bill Murray <bmurray@ttsd.k12.or.us>},
       q{From: Clinton Pierce <clintp@geeksalad.org>},
       q{From: Chris Doyle him@bootlegtoys.com},
       q{From: Shelley.Robertson@samspublishing.com},);
    for(@addr) {
      m/From: (\w+ \w+) <?([\w@.])+>?/;
      print "You got mail from $1\n";
    }

    In this example, because the last bit of data isn't as well-formed as the others, the match actually fails, but the program goes blindly on using $1.

  • You cannot use $1, $2, $3, and so on in the left-hand portion of the substitution operator. Notice this attempt:

    s/(\w+)\s$1/$1/; # WRONG

    The $1 is scanned as a regular variable name when the regular expression is first parsed. It will have the old value of $1 (if any) from a previous match.

  • Multiple sets of parenthesis will cause the capture registers to be used in the order encountered. If the parenthesis nest, each opening ( assigns the next register.

    $name="James T. Kirk";
    if ($name=~m/^((\w+)\s(\w+.?)?\s(\w+))$/) {
      print "First: $2\n"; # First name
      print "Middle: $3\n"; # Middle name/initial
      print "Last: $4\n";  # Last name
      print "Whole: $1\n"  # Whole name
    }

Example Listing 3.5

# Read a file in the format
#    key=value
#    key2=value2
#  and assign the data to %conf appropriately
# ** This is done with a clever code trick in the
#  match operator entry. See TIMTOWDI in action!

open(CONFIG, "config") || die "Can't open config: $!";
while(<CONFIG>) {
  if (m/^([^=]+)=(.*)$/) { # Look for FOO=BAR
    $conf{$1}=$2;
  }
}

NOTE

See Also

local, dynamic scope, match operator, Regular Expression Special Variables, and Character shorthand in this book

Line Anchors

Usage

\A ^ \z \Z $

Description

Anchors are used within regular expression patterns to describe a location. Sometimes the location is relative to something else (\b) or the location can be absolute (\A). Because they don't match an actual character but make an assertion about the state of the match, they also are called zero-width assertions.

The first anchor (appropriately) is ^, which causes the match to happen at the beginning of the string. So,

if (m/^whales/) { }

will only be true if whales occurs at the beginning of $_. If whales occurs anywhere else in $_, the match won't succeed.

Next is the $ metacharacter that only matches at the end of a string:

if (m/Stimpy$/) { }

This pattern will only match if Stimpy occurs at the end of the string. These two metacharacters can be combined for interesting effects:

if (/^$/) { }  # Matches empty lines
# Here, the middle "doesn't matter", but the beginning and
#  endings that must match are well-defined.
if (/^In the beginning.*Amen$/) {}
if (m/^/) { }  # Will always match

When you think you understand $ and ^, read on.

The first few anchors describe the beginning and ending of a string. These are complicated by the fact that "end of a string" can often mean "end of a logical line" or "end of the storage unit," depending on who you ask. The /m modifier on a regular expression match (or substitution) can change which meaning you want. The same goes for "beginning of a string."

From now on in this entry, I'll refer to a logical line and a string. A string is the entire storage unit. A logical line begins at the beginning of the string and extends to a newline character. It also begins after a newline character and extends to the next newline character in the (or the end of a) string. Take, for example, the string of characters in $t the following:

$t=q{That whim on the way
And again I took the day off
To roam the river's edge};

The string contains two newline characters: one following the word way and one following off. Three logical lines are in the one string.

The ^ metacharacter will match at the beginning of the string, unless /m is used as a modifier on the match. In that case, ^ can match at the beginning of any logical line in the string.

The $ metacharacter will match at the end of the string, unless /m is used as a modifier on the match. If that is the case, $ can match at the end of any logical line in the string.

So observe the following matches against $t from the preceding:

if ($t=~/way$/) { } # False! Without /m way isn't at the EOL
if ($t=~/way$/m) { } # True! With /m way is at the End Of Line
if ($t=~/^That/) { } # Always true!
 if ($t=~/^And/) { } # False! Without /m, And isn't at the beginning
if ($t=~/^And/m) { } # True! With /m, And is at the beginning of line

while($t=~/(\w+)$/g) { # Prints only "edge", because
  print "$1";  # without /m, there is only one "end of line"
}

while($t=~/(\w+)$/gm) { # Prints way, off and edge
  print "$1";   #  because each represents an "end of line"
}            #  with /m

The \A metacharacter matches the beginning of the string always, and without regard to the /m modifier being used on the match. So in the sample string $t, the expression $t=~/\A\w+/m will only match the word That. The \z metacharacter similarly will always match at the end of the string, regardless of whether /m is in effect.

The \Z metacharacter is similar to \z with a bit of a difference: \z anchors at the end of the string behind (to the right of) the newline character if any. The \Z metacharacter anchors at the end of the string just in front of the newline character, if there is one, and at the end of the string if there isn't.

NOTE

See Also

multi match and word anchors in this book

Word Anchors

Usage

\b \B

Description

The word anchors \b and \B are zero-width assertions that deal with the boundary between nonword characters (\W) and word characters (\w). The beginning and ending of a string are considered nonword characters.

The \b character matches the boundary between \w and \W characters. So, \bFOO matches FOO but only if the character preceding FOO is not a \w. The \B character matches between \W and \W characters; thus \BFOO will find FOO, but only if it's preceded by a word character.

$t=q{There was a young lady from Hyde
Who ate a green apple and died.
While her lover lamented
The apple fermented
And made cider inside her inside.};

$t=~m/\bher\b/;  # Matches "her" but not "There"
$t=~m/\Bher\B/;  # Matches the "her" in "There"
$t=~m/\bide\b/;  # Matches nothing! Not cider nor inside
$t=~m/\bThere/;   # Matches There, because ^ is a word-boundary

Within a character class, \b stands for backspace and not a word boundary.

A common mistake is to assume that \b matches what people consider to be word boundaries (because _ is a word character). So, clintp@geeksalad.org is three words, U.S.A is also three, but War_And_Peace is only one word.

NOTE

See Also

line anchors in this book

Multimatch Anchor

Usage

\G

Description

Similar to the line anchors, the multimatch anchor is used to match positions within a string as opposed to actually matching characters. It is in that class of metacharacters called zero-width assertions.

The \G metacharacter matches the position right after the previous regular expression match. For example, given the following code:

$_="One fish, two fish, red fish, blue fish";
m/\b\w{3}\b/g; # Matches "One"
m/\G\W+(\w+)/; # $1 is fish
m/\b\w{3}\b/g; # Picks up "two"
m/\G\W+(\w+)/; # $1 is fish (number two)

\G is useful for incrementally bumping along within a string with regular expressions. The location marked by \G can be reset by calling the pos function with an argument:

pos($_)=0;   # Reset \G to the beginning

The advantage of \G to look-ahead or look-behind assertions is that you get to write smaller (and simpler!) regular expressions. The /g modifier will cause the match to go back to the position where the last /g left off. The \G assertion allows you to look ahead without destroying your last position.

Example Listing 3.6

# Take apart the given paragraph looking for
#  phrases joined with the conjunctions "nor" and "or".
# Note that "now or later" and "later Or no" are both
#  picked up. With a single regular expression and no \G
#  this would be much more complicated.

# C.J. lyrics and music by Bob Dorough (c)1973
$t=q{Conjunction Junction, what's your function?
Hookin' up two cars to one when you say
Something like this choice: Either now or later,
Or no choice. Neither now nor ever. (Hey that's clever)
Eat this or that, grow thin or fat.};

# The expression here picks up a word at a time, remembering
#  where we left off with /g
while( $t=~m/(\w+)/g ) {
   $left=$1;

  # Matching with \G here doesn't ruin our position in
  #  the match above...because we didn't use /g.
  if ($t=~/\G\W+(n?or)\W+(\w+)/i) {
    print "$left $1 $2\n";
  }
}

NOTE

See Also

line anchors in this book

Match Modifiers

Usage

m//cgimosx
qr//imosx
s///egimosx

Description

This section describes the modifiers used with regular expression matches, substitutions, and compilations. Some modifiers are particular to an operator:

Modifier

Particular To

/g Match and Substitution Operators
/gc Match Operators
/e Substitution Operators

These modifiers are discussed along with the particular operators to which they apply elsewhere in this book.

The /i operator causes the regular expression to not match case sensitively. During the match, no distinction is made between upper and lowercase letters, including those within character classes:

m/Scrabble/i;  # Matches scrabble or scrabble or sCrAbBlE or... 

The locale pragma causes a wider range of alphabetic characters to be recognized, and sensitivity of upper- and lowercase characters will expand appropriately.

The /m modifier causes the meaning of the ^ and $ anchors to change. With the /m modifier, ^ and s will match at the beginning and end of logical lines (possibly multiple logical lines) within a target string. Some examples of this are in the "Anchors" section.

The /s modifier causes the nature of the . (dot) metacharacter to change. Normally, dot matches any single character except a newline character (\n). With /s in place, the newline is a potential match for .:

$text=q{You are my sunshine, my only sunshine.
  You make me happy, when skies are grey.};
m/You.*/; # Matches from "You are" to "sunshine."
m/You.*/s; # Matches from "You are" to "grey."

The /o modifier causes perl to only compile a regular expression once. Normally, a regular expression containing variables is recompiled each time perl encounters the expression.

$pat='\w+\W\w+';
while(<>) {
  if (/$pat/o) {
    $a++;
  }
}

In this example, the pattern in $pat is only changed outside of the loop. Perl doesn't realize this, so each pass through the loop, the pattern /$pat/ has to be recompiled by the regex engine. Giving perl the hint with /o that the pattern won't change allows the regex engine to skip the recompilation.

This optimization only makes sense when the pattern contains a value that could potentially change ($pat shown previously). Also, if the /o optimization is used and you do change the variables that make up the pattern, subsequent pattern matches won't reflect those changes.

The /x modifier allows you to specify comments within a regular expression. Specifically, comments are as follows:

  • All whitespace in a regular expression becomes insignificant, except within a character class.

  • Comments extend from the # character to the end of the line, or the end of the expression.

  • Literal #s in the expression must be escaped with a \ or represented as a hex or octal constant.

# The FAQ answer to "how to print a number with commas"
$_="1234567890";
1 while     # Repeat ad nauseam...
  s/^     #  start at the beginning, and
    (-?\d+) #  absorb all of the digits (maybe a -)
    (\d{3}) #  except for the last three.
  /$1,$2/x;  #  Put a comma before those three

NOTE

See Also

match operator and substitution operator in this book

  • + Share This
  • 🔖 Save To Your Account