Home > Articles

Using Regular Expressions in Perl

  • Print
  • + Share This
In this sample chapter you will learn how to use regular expressions, a skill that's fundamental to any Perl programmer.
This sample chapter is excerpted from Perl Developer's Dictionary, by Clinton Pierce.
This chapter is from the book

This chapter is from the book

Regular Expression Basics


Understanding how to use regular expressions is fundamental to any Perl programmer. The essential purpose of a regular expression is to match a pattern, and Perl provides two operators for doing just that: m// (match) and s/// (substitute). (The ins and outs of those operators are covered in their own entries.)

When Perl encounters a regular expression, it's handed to a regular expression engine and compiled into a special kind of state machine (a Nondeterministic Finite Automaton). This state machine is used against your data to determine whether the regular expression matches your data. For example, to use the match operator to test whether the word fudge exists in a scalar value:

$r=q{"Oh fudge!" Only that's not what I said.};
if ($r =~ m/fudge/) {
 # ...

The regular expression engine takes /fudge/, compiles a state machine to use against $r, and executes the state machine. If it was successful, the pattern matched.

This was a simple example, and could have been accomplished quicker with the index function. The regular expression engine comes in handy because the pattern can contain metacharacters. Regular expression metacharacters are used to specify things that might or might not be in the data, look different (uppercase? lowercase?) in the data, or portions of the pattern that you just don't care about.

The simplest metacharacter is the . (dot). Within a regular expression, the dot stands for a "don't care" position. Any character will be matched by a dot:

m/m..n/; # Matches: main, mean, moan, morn, moon, honeymooner,
     # m--n, "m n", m00n, m..n, m22n etc... (but not "mn")

The exception is that a dot won't normally match a newline character. For that to happen, the match must have the /s modifier tacked on to the end. See the modifiers entry for details.

Metacharacters stand in for other characters (see "Character Shorthand") or stand in for entire classes of characters (character classes). They also specify quantity (quantifiers), choices (alternators), or positions (anchors).

In general, something that is normally metacharacter can be made "unspecial" by prefixing it with a backslash, which is sometimes called "escaping" the character. So to match a literal m..n (with real dots), change the expression to

m/m\.\.n/; # Matches only m..n

The full list of metacharacters is \, |, ^, $, *, +, ?, ., (, ), [, {

Everything else in Perl's regular expressions matches itself. A normal character (nonmetacharacter) can sometimes be turned into a metacharacter by adding a backslash. For example, "d" is just a letter "d". However, preceded by a backslash,


It matches a digit. More of this is covered in the "Character Shorthand" section. The entire set of metacharacters as well as some contrived metacharacters are covered elsewhere in this book.

As you browse the remainder of this section, keep in mind that there are just a few rules associated with regular expression matching. These are summarized as follows:

  • The goal is for the match to succeed as a whole. Everything else takes a backseat to that goal.

  • The entire pattern must be used to match the given data.

  • The match that begins the earliest (the leftmost) will be taken first.

Unless otherwise directed (with ?), quantifiers will always match as much as possible, and still have the expression match.

To sum up: the largest possible first match is normally taken.

For more information on how regular expression engines work, see the book Mastering Regular Expressions by Jeffrey Friedl.


See Also

m//, s///, character classes, alternation, quantifiers, character shorthand, line anchors, word anchors, grouping, backreferences and qr in this book

Basic Metacharacters and Operators

Match Operator




The m// operator is Perl's pattern match operator. The pattern is first interpolated as though it were a double-quoted string[EM]scalar variables are expanded, backslash escapes are translated, and so on. Afterward, the pattern is compiled for the regular expression engine.

Next, the pattern is used to match data against the $_ variable unless the match operator has been bound with the =~ operator.

m/(?:\(?\d{3}\)?-)?\d{3}-\d{4}/;   # Match against $_
$t=~m/(?:\(?\d{3}\)?-)?\d{3}-\d{4}/; # Match against $t

In a scalar context, the match operator returns true if it succeeds and false if it fails. With the /g modifier, in scalar context the match will proceed along the target string, returning true each time, until the target string is exhausted.

The modifiers (other than /g and /c) are described in the Match Modifiers entry.

In a list context, the match operator returns a list consisting of all the matched portions of the pattern that were captured with parenthesis (as well as setting $1, $2 and so on as a side-effect of the match). If there are no parenthesis in the match, the list (1) is returned. If the match fails, the empty list is returned.

In a list context with the /g modifier, the list of substrings matched by capturing parenthesis is returned. If no parenthesis are in the pattern, it returns the entire contents of each match.

$_=q{I do not like green eggs and ham, I do not like them Sam I Am};

$match=m/\w+/;    # $match=1
$match=m/(\w+)/g;   # $match=1, $1="I"
$match=m/(\w+)/g;   # $match=1, $1="do"
$match=m/(\w+)/g;   # $match=1, $1="not" .. and so on

@match=m/\w*am\b/i;    # @match=(1)
@match=m/(\b\w{4}\b)/i;  # @match=('like');
@match=m/(\w+)\W+(\w+)/i; # @match=qw(I do);

@match=m/\w*am\b/ig;   # @match=qw( ham Sam Am )
@match=m/(\b\w{4}\b)/ig; # @match=qw( like eggs like them )
@match=m/(\w+)\W+(\w+)/ig;# @match=qw( I do not like [...] Sam I am )

After a failed match with the /g modifier, the search position is normally reset to the beginning of the string. If the /c modifier also is specified, this won't happen, and the next /g search will continue where the old one failed. This is useful if you're matching against a target string that might be appended to during successive checks of the match.

The delimiters within the match operator can be changed by specifying another character after the initial m. Any character except whitespace can be used, and using the delimiter of ' has the side-effect of not allowing string interpolation to be performed before the regular expression is compiled. Balanced characters (such as (), [], {}, and <>) can be used to contain the expression.

m/\/home\/clintp\/bin/;  # Match clintp's /bin
m!/home/clintp/bin!;   # Somewhat more sane
m/$ENV{HOME}\/bin/;    # Match the user's own /bin
m'$ENV{HOME}/bin';    # Match literal '$ENV{HOME}/bin' -- useless?

If you're content with using // as delimiters for the pattern, the m can be omitted from the match operator:

while( <IRCLOG> ) {
  if (/<(?:Abigail|Addi)>/) { # Look ma, no "m"!

    # See below for explanation of //
    if (grep(//, @users)) {
      print LOG "$_\n";

If the pattern is omitted completely, the pattern from the last successful regular expression match is used. In the previous sample of code, the expression <(?:Abigail|Addi)> is re-used for the grep's pattern.

Example Listing 3.1

# The example from the "backreferences" section
#  re-worked to use the list-context-with-/g return
#  value of the match operator.

open(CONFIG, "config") || die "Can't open config: $!";
  local $/;


See Also

Substitution operator, ??, and match modifiers in this book

Substitution Operator




The s/// operator is Perl's substitution operator. The pattern is first interpolated as though it were a double-quoted string[EM]scalar variables are expanded, backslash escapes are translated, and so on. Afterward, the pattern is compiled for the regular expression engine.

The pattern is then used to match against a target string; by default, the $_ variable is used unless another value is bound using the =~ operator.

s/today/yesterday/;      # Change string in $_
$t=~s/yesterday/long ago/;  # Change string in $t

If the pattern is successfully matched against the target string, the matched portion is substituted using the replacement.

The substitution operator returns the number of substitutions made. If no substitutions were made, the substitution operator returns false (the empty string). The return value is the same in both scalar and list contexts.

$_="It was, like, ya know, like, totally cool!";
$changes=s/It/She/;     # $changes=1, for the match
$changes=s/\slike,//g;   # $changes=2, for both matches

The /g modifier causes the substitution operator to repeat the match as often as possible. Unlike the match operator, /g has no other side effects (such as walking along the match in scalar context)[EM]it simply repeats the substitution as often as possible for nonoverlapping regions of the target string.

During the substitution, captured patterns from the pattern portion of the operator are available during the replacement part of the operator as $1, $2, and so on. If the /g modifier is used, the captured patterns are refreshed for each replacement.

$_="One fish two fish red fish blue fish";
s/(\w+)\s(\w+)/$2 $1/g; # Swap words for "fish one fish two..."

The /e modifier causes Perl to evaluate the replacement portion of the substitution for each replacement about to happen as though it were being run with eval {}. The replacement expression is syntax checked at compile time and variable substitutions occur at runtime, the same as eval {}.

# Make this URL component "safe" by changing non-letters
#  to 2-digit hex codes (RFC 1738)
$text=~s/(\W)/sprintf('%%%02x', ord($1))/ge;

# Perform word substitutions from a list...
%abrv=( 'A.D.' => 'Anno Domini', 'a.m.' => 'ante meridiem',
 'p.m.' => 'post meridiem', 'e.g.' => 'exempli gratia',
 'etc.' => 'et cetera',   'i.e.' => 'id est');
$text=qq{I awoke at 6 a.m. and went home, etc.};
$text=~s/([\w.]+)/exists $abrv{$1}?$abrv{$1}:$1/eg;

The delimiters within the substitution operator can be changed by specifying another character after the initial s. Any character except whitespace can be used, and using the delimiter of ' has the side-effect of not allowing string interpolation to be performed before the regular expression is compiled. Balanced characters (such as (), [], {}, and <>) can be used to contain the pattern and replacement. Additionally, a different set of characters can be used to encase the pattern and the replacement:

s/\/home\/clintp/\/users\/clintp/g;  # Ugh!
s,/home/clintp,/users/clintp,g;    # Whew! Better.
  {/users/clintp}g;         # This is really clear

The match modifiers (other than /e and /g) are covered in the entry on match modifiers.

Example Listing 3.2

# This function takes its argument and renders it in
#  Pig-Latin following the traditional rules for Pig Latin
# (Note that there's a substitution within a substitution.)
  my $notvowel=qr/[^aeiou_]/i; # _ is because of \w

  sub igpay_atinlay {
    local $_=shift;

    # Match the word
      local $_=$1;
      # Now re-arrange the leading consonants
      #  or if none, append "yay"
      $_; # Return the result
    return $_;
print igpay_atinlay("Hello world"); # "elloHay orldway"


See Also

match operator, match modifiers, capturing, and backreferences in this book

Character Shorthand


Regular expressions, similar to double-quoted strings, also allow you to specify hard-to-type characters as digraphs (backslash sequences), by name or ASCII/Unicode number.

They differ from double-quoted context in that, within a regular expression, you're trying to match the given character[EM]not trying to emit it. A single digraph might match more than one kind of character.

The simplest character shorthand is for the common unprintables. These are as follows:



\t A tab (TAB and HT)
\n A newline (LF, NL). On systems with multicharacter line termination characters, it matches both characters.
\r A carriage return (CR)
\a An alarm character (BEL)
\e An escape character (ESC)

They also can represent any ASCII character using the octal or hexadecimal code for that character. The format for the codes are: \digits for octal and \xdigits for hexadecimal. So to represent a SYN (ASCII 22) character, you can say

/\x16/; # Match SYN (hex)
/\026/; # Match SYN (oct)

However, beware that using \digits can cause ambiguity with backreferences (captured pieces of a regexp). The sequence \2 can mean either ASCII 2 (STX), or it can mean the item that was captured from the second set of parenthesis.

Ambiguous references are resolved in this manner: If the number of captured parenthesis is greater than digit, \digit from the capture; otherwise, the value is the corresponding ASCII value (in octal). Within a character class, \digits will never stand for a backreference. Single digit references such as \digit always stand for backreference, except for \0, which means ASCII 0 (NUL).

To avoid this mess, specify octal ASCII codes using three digits (with a leading zero if necessary). Backreferences will never have a leading zero, and there probably won't be more than 100 backreferences in a regular expression.

Wide (multibyte) characters can be specified in hex by surrounding the hex code with {} to contain the entire sequence of digits. The utf8 pragma also must be in effect.

use utf8;
/\x{262f}/;   # Unicode YIN YANG

When the character is a named character, you can specify the name with a \N{name} sequence if the charnames module has been included.

use charnames ':full';
s/\N{CLOUD}/\N{LIGHTNING}/g; # The weather worsens!

Control-character sequences can be specified directly with \ccharacter. For example, the control-g character is a BEL, and it can be represented as \cg; the control-t character is \ct.

Example Listing 3.2

# Dump the file given on STDIN/command line translating any
#  low-value ASCII characters to their symbolic notation


while(<>) {
  tr/\200-\377/\000-\177/; # Strip 8th bit too.
  foreach(split(//)) {
    printf "%4s ", $_;


See Also

charnames module documentation

character classes in this book

Character Classes


Character classes in Perl are used to match a single character with a particular property. For example, if you want to match a single alphabetic uppercase character, it would be nice to have a convenient property to describe this property. In Perl, surround the characters that describe the property with a set of square brackets:


This expression will match a single, alphabetic, uppercase character (at least for English speakers). This is a character class, and stands in for a single character.

Ranges can be used to simplify the character class:


Ranges that seem natural (0-9, A-Z, A-M, a-z, n-z) will work. If you're familiar with ASCII collating sequence, other less natural ranges (such as [!-/]) can be constructed. Ranges can be combined simply by putting them next to each other within the class:

m/[A-Za-z]/;  # Upper and lowercase alphabetics

Some characters have special meaning within a character class and deserve attention:

The dash (-) character must either be preceded by a backslash, or should appear first within a character class. (Otherwise it might appear to be the beginning of a range.)

A closing bracket (]) within a character class should be preceded by a backslash, or it might be mistaken for the end of the class.

The ^ character will negate the character class. That is, every possible character that doesn't have the property described by the character class will match. So that:

m/[^A-Z]/;  # Match anything BUT an uppercase, alphabetic character

Remember that negating a character class might include some things you didn't expect. In the preceding example, control characters, whitespace, Unicode characters, 8-bit characters, and everything else imaginable would be matched[EM]just not A-Z.

In general, any other metacharacter (including the special character classes later in this section) can be included within a character class. Some exceptions to this are the characters .+()*|$^ which all have their mundane meanings when they appear within a character class, and backreferences (\1, \2) don't work within character classes. The \b sequence means "backspace" in a character class, and not a word boundary.

The hexadecimal, octal, Unicode, and control sequences for characters also work just fine within character classes:

m/[\ca-\cz]/;  # Match all control characters
m/[\x80-\xff]/; # Match high-bit-on characters
use charnames qw(:full);

In Perl regular expressions, common character classes also can be represented by convenient shortcuts. These are listed as follows:



What It Matches

\d Digits [0-9]
\D Nondigits [^0-9]
\s Whitespace [\x20\t\n\r\f]
\S Non-whitespace [^\x20\t\n\r\f]
\w Word character [a-zA-Z0-9_]
\W Non-word character [^a-zA-Z0-9_]

These shortcuts can be used within regular character classes or by themselves within a pattern match:

if (/x[\da-f]/i) { } # Match something hex-ish
s/(\w+)/reverse $1/e; # Reverse word-things only

The actual meaning of these will change if a locale is in effect. So, when perl encounters a string such as ¡feliz cumpleaños!, the exact meaning of metacharacters such as \w will change. This code

$a="\xa1feliz cumplea\xf1os!";  # Happy birthday, feliz cumpleaños
while($a=~m/(\w+)/g) {
  print "Word: $1\n";

will find three words in that text: feliz, cumplea, and os. The \xf1 (n with a tilde) character isn't recognized as a word character. This code:

use locale;
use POSIX qw(locale_h);
setlocale(LC_CTYPE, "sp.ISO8859-1"); # Spanish, Latin-1 encoding

$a="\xa1feliz cumplea\xf1os!"; # Happy b-day.
while($a=~m/(\w+)/g) {
  print "Word: $1\n";

works as a Spanish speaker would expect, finding the words feliz and cumpleaños. The locale can be negated by specifying a bytes pragma within the lexical block, causing the character classes to go back to their original meanings.

Perl also defines character classes to match sets of Unicode characters. These are called Unicode properties, and are represented by \p{property}. The list of properties is extensive because Unicode's property list is long and perl adds a few custom properties to that list as well. Because the Unicode support in Perl is (currently) in flux, your best bet to find out what is currently implemented is to consult the perlunicode manual page for the version of perl that you're interested in.

The last kind of character class shortcut (other than user-defined ones covered in the section on character classes) is defined by POSIX. Within another character class, the POSIX classes can be used to match even more specific kinds of characters. They all have the following form:


where class is the character class you're trying to match. To negate the class, write it as follows: [:^class:].



ascii 7-bit ASCII characters (with an ord value <127)
alpha Matches a letter
lower Matches a lowercase alpha
upper Matches an uppercase alpha
digit Matches a decimal digit
alnum Matches both alpha and digit characters
space Matches a whitespace character (just like \s)
punct Matches a punctuation character
print Matches alnum, punct, or space
graph Matches alnum and punct
word Matches alnum or underscore
xdigit Match hex digits: digit, a-f, and A-F
cntrl The ASCII characters with an ord value <32 (control characters)

To use the POSIX character classes, they must be within another character class:

for(split(//,$line)) {
  if (/[[:print:]]/) { print; }

Using a POSIX class on its own:

if (/[:print:]/) { } # WRONG!

won't have the intended effect. The previous bit of code would match :, p, r, i, n, and t.

If the locale pragma is in effect, the POSIX classes will work as the corresponding C library functions such as isalpha, isalnum, isascii, and so on.

Example Listing 3.3

# Analyze the file on STDIN (or the command line) to get the
# makeup. A typical MS-Word doc is about 60-70% high-bit
# characters and control codes. This book in XML form was
# less than 4% control codes, 10.8% punctuation, 18.2% whitespace
# and 69% alphanumeric characters.

use warnings;
use strict;
my(%chars, $total, %props, $code, %summary);
# Take the file apart, summarize the frequency for
#  each character.
while(<>) {
  $chars{$_}++ for(split(//));

# Warning: space and cntrl overlap so >100% is possible!
%props=(alpha => "Alphabetic",  digit => "Numeric",
  space => "Whitespace",  punct => "Punctuation",
  cntrl => "Control characters",
  '^ascii' => "8-bit characters");

# Build the code to analyze each kind of character
#  and classify it according to the POSIX classes above.
$code.="\$summary{'$_'}+=\$chars{\$_} if /[[:$_:]]/;\n"
  for(keys %props);
eval "for(keys %chars){ $code }";

foreach my $type (keys %props) {
  no warnings 'uninitialized';
  printf "%-18s %6d %4.1f%%\n", $props{$type}, $summary{$type},


See Also

bytes, utf8, and POSIX module documentation

perlunicode in the perl documentation


isalpha in the C library reference





Quantifiers are used to specify how many of a preceding item to match. That item can be a single character (/a*/), a group (/(foo)?/), or it can be anything that stands in for a single character such as a character class (/\w+/).

The first quantifier is ?, which means to match the preceding item zero or one times (in other words, the preceding item is optional).

/flowers?/;    # "flower" or "flowers" will match
/foo[0-9]?/;    # foo1, foo2 or just foo will match
/\b[A-Z]\w+('s)?\b/;  # Matches things like "Bob's" or "Carol" --
      #  capitalized singular words, possibly possessed

# Match day of week names like 'Mon', 'Thurs' and Friday.
# (caution: also matches oddities like 'Satur' -- this can be
#  remedied, but makes a lousy example.)

Any portion of a match quantified by ? will always be successful. Sometimes an item will be found, and sometimes not, but the match will always work.

The quantifier * is similar to ? in that the quantified item is optional, except * specifies that the preceding item can match zero or more times. Specifically, the quantified item should be matched as many times as possible and still have the regular expression match succeed. So,


matches 'fobar', 'foobar', 'foooobar', and also 'fbar'. The * quantifier will always match positively, but whether a matching item will be found is another question. Because of this, beware of expressions such as the following:


You might hope it will match a series of uppercase characters and then a set of word characters, and it will. But it also will match numbers, empty strings, and binary data. Because everything in this expression is optional, the expression will always match.

With * you can absorb unwanted material to make your match less specific:

# This matches any of: <body>, <body background="">,
#  <body background="foo.gif">, <body onload="alert()">,
#  or <body onload="alert()" background="foo.gif:>

In the preceding example, * was used to make [^"] match empty quote marks, or quote marks with something inside; it was used to make the attribute match (foo="bar") optional, and repeat it as often as necessary.

The + quantifier requires the match not only to succeed at least once, but also as many times as possible and still have the regular expression match be successful. So, it's similar to *, except that at least one match is guaranteed. In the preceding example, the space following the \w+ was specified as \s+; otherwise items such as <bodyonload="alert()"> would match.


This matches 'fobar', 'foobar', and of course 'fooooobar'. But unlike *, it will not match 'fbar'.

Perl also allows you to match an item a minimal, fixed, or maximum number of times with the {} quantifiers.



{min,max} Matches at least min times, but at most max times.
{min,} -Matches at least min times, but as many as necessary for the match to succeed.
{count} Matches exactly count times.

Keep in mind that with the {min,} and {min,max} searches, the match will absorb only as many characters as necessary and still have the match succeed. Thus with the following:

if (m/\w(\w{1,5})\w\w/) {
  print "Matched ", length($1), "\n";

The $1 variable winds up with only three characters because the first \w matched P, the last \w's needed "on" to be successful, and that left "yth" for the quantified \w.

Perl's quantifiers are normally maximal matching, meaning that they match as many characters as possible but still allow the regular expression as a whole to match. This is also called greedy matching.

The ? quantifier has another meaning in Perl: when affixed to a *, +, or {} quantifier, it causes the quantifier to match as few characters as necessary for the match to be successful. This is called minimal matching (or lazy matching).

Take the following code:

$_=q{"You maniacs!" he yelled at the surf. "You blew it up!"};
while (m/(".*")/g) {
  print "$1\n";

It might surprise you to see that the regular expression grabs the entire string, not just each quote individually. That's because ".*" matches as much as possible between the quote marks, including other quote marks. Changing the expression to:


solves this problem by asking * to match as little as possible for the match to succeed.

Keep in mind that ? is just a convenient shorthand and might not represent the best possible solution to the problem. The pattern /"[^"]*"/ would have been a more efficient choice because the amount of backtracking by the regular expression engine to be done would have been less. But there is programmer efficiency to consider.


See Also

m operator in this book

Modification Characters


\Q \E \L \l \U \u


The modification characters used in string literals (in an interpolated context) are available in regular expressions as well. See the entry on modification characters for a list.

Understand that these "metacharacters" aren't really metacharacters at all. They do their work because regular expression match operators allow interpolation to happen when the pattern is first examined[EM]much in the same way that \L and \U are only effective in double-quoted strings; they're only effective in regular expressions when the pattern is first examined by perl.

if (m/${foo}blah/) { } # Won't look for BLAH, but 'Ublah'
if (m/\Ublah/) { }   # Will look for BLAH
if (m/(a)\U\1/ { }   # Won't look for aA as you might hope

Most useful among these in regular expressions is the \Q modifier. The \Q modifier is used to quote any metacharacters that follow. When accepting something that will be used in a pattern match from an untrusted source, it is vitally important that you not put the pattern into the regular expression directly. Take this small sample:

# A CGI form is a _VERY_ untrustworthy source of info.

use CGI qw(:form :standard);
print header();
# ...sometime later...
if (/$pat/) {

The trouble with this is that handing $pat to the regular expression engine opens up your system to running code that's determined solely by the user. If the user is malicious, he can:

  • Specify a regular expression that will never terminate (endless backtracking).

  • Specify a regular expression that will use indeterminate amounts of memory.

  • Specify a regular expression that can run perl code with a (?{}) pattern.

The third one is probably the most malicious, so it is disabled unless a use re 'eval' pragma is in effect or the pattern is compiled with a qr operator.

The \Q modifier will cause perl to treat the contents of the pattern literally until an \E is encountered.

Example Listing 3.4

# A re-working of the inline sample above a little
#  more safe.* A form parameter "SEARCH" is used to
#  check the GECOS (often "name") field.
# *[Of course, software is only completely "safe" when
#  it's not being used. --cap]

use CGI qw(:form :standard);

print header(-type => 'text/plain');

push(@ARGV, "/etc/passwd");
while(<>) {
  ($name)=(split(/:/, $_))[4];
  if ($name=~/\Q$pat\E/) {
    print "Yup, ${pat}'s in there somewhere!\n";


See Also

Modification characters in this book

  • + Share This
  • 🔖 Save To Your Account