Home > Articles

Using Regular Expressions in Perl

In this sample chapter you will learn how to use regular expressions, a skill that's fundamental to any Perl programmer.
This sample chapter is excerpted from Perl Developer's Dictionary, by Clinton Pierce.
This chapter is from the book

This chapter is from the book

Regular Expression Basics

Description

Understanding how to use regular expressions is fundamental to any Perl programmer. The essential purpose of a regular expression is to match a pattern, and Perl provides two operators for doing just that: m// (match) and s/// (substitute). (The ins and outs of those operators are covered in their own entries.)

When Perl encounters a regular expression, it's handed to a regular expression engine and compiled into a special kind of state machine (a Nondeterministic Finite Automaton). This state machine is used against your data to determine whether the regular expression matches your data. For example, to use the match operator to test whether the word fudge exists in a scalar value:

$r=q{"Oh fudge!" Only that's not what I said.};
if ($r =~ m/fudge/) {
 # ...
}

The regular expression engine takes /fudge/, compiles a state machine to use against $r, and executes the state machine. If it was successful, the pattern matched.

This was a simple example, and could have been accomplished quicker with the index function. The regular expression engine comes in handy because the pattern can contain metacharacters. Regular expression metacharacters are used to specify things that might or might not be in the data, look different (uppercase? lowercase?) in the data, or portions of the pattern that you just don't care about.

The simplest metacharacter is the . (dot). Within a regular expression, the dot stands for a "don't care" position. Any character will be matched by a dot:

m/m..n/; # Matches: main, mean, moan, morn, moon, honeymooner,
     # m--n, "m n", m00n, m..n, m22n etc... (but not "mn")

The exception is that a dot won't normally match a newline character. For that to happen, the match must have the /s modifier tacked on to the end. See the modifiers entry for details.

Metacharacters stand in for other characters (see "Character Shorthand") or stand in for entire classes of characters (character classes). They also specify quantity (quantifiers), choices (alternators), or positions (anchors).

In general, something that is normally metacharacter can be made "unspecial" by prefixing it with a backslash, which is sometimes called "escaping" the character. So to match a literal m..n (with real dots), change the expression to

m/m\.\.n/; # Matches only m..n

The full list of metacharacters is \, |, ^, $, *, +, ?, ., (, ), [, {

Everything else in Perl's regular expressions matches itself. A normal character (nonmetacharacter) can sometimes be turned into a metacharacter by adding a backslash. For example, "d" is just a letter "d". However, preceded by a backslash,

/\d/

It matches a digit. More of this is covered in the "Character Shorthand" section. The entire set of metacharacters as well as some contrived metacharacters are covered elsewhere in this book.

As you browse the remainder of this section, keep in mind that there are just a few rules associated with regular expression matching. These are summarized as follows:

  • The goal is for the match to succeed as a whole. Everything else takes a backseat to that goal.

  • The entire pattern must be used to match the given data.

  • The match that begins the earliest (the leftmost) will be taken first.

Unless otherwise directed (with ?), quantifiers will always match as much as possible, and still have the expression match.

To sum up: the largest possible first match is normally taken.

For more information on how regular expression engines work, see the book Mastering Regular Expressions by Jeffrey Friedl.

NOTE

See Also

m//, s///, character classes, alternation, quantifiers, character shorthand, line anchors, word anchors, grouping, backreferences and qr in this book

Basic Metacharacters and Operators

Match Operator

m//Usage

m/pattern/modifiers

Description

The m// operator is Perl's pattern match operator. The pattern is first interpolated as though it were a double-quoted string[EM]scalar variables are expanded, backslash escapes are translated, and so on. Afterward, the pattern is compiled for the regular expression engine.

Next, the pattern is used to match data against the $_ variable unless the match operator has been bound with the =~ operator.

m/(?:\(?\d{3}\)?-)?\d{3}-\d{4}/;   # Match against $_
$t=~m/(?:\(?\d{3}\)?-)?\d{3}-\d{4}/; # Match against $t

In a scalar context, the match operator returns true if it succeeds and false if it fails. With the /g modifier, in scalar context the match will proceed along the target string, returning true each time, until the target string is exhausted.

The modifiers (other than /g and /c) are described in the Match Modifiers entry.

In a list context, the match operator returns a list consisting of all the matched portions of the pattern that were captured with parenthesis (as well as setting $1, $2 and so on as a side-effect of the match). If there are no parenthesis in the match, the list (1) is returned. If the match fails, the empty list is returned.

In a list context with the /g modifier, the list of substrings matched by capturing parenthesis is returned. If no parenthesis are in the pattern, it returns the entire contents of each match.

$_=q{I do not like green eggs and ham, I do not like them Sam I Am};

$match=m/\w+/;    # $match=1
$match=m/(\w+)/g;   # $match=1, $1="I"
$match=m/(\w+)/g;   # $match=1, $1="do"
$match=m/(\w+)/g;   # $match=1, $1="not" .. and so on

@match=m/\w*am\b/i;    # @match=(1)
@match=m/(\b\w{4}\b)/i;  # @match=('like');
@match=m/(\w+)\W+(\w+)/i; # @match=qw(I do);

@match=m/\w*am\b/ig;   # @match=qw( ham Sam Am )
@match=m/(\b\w{4}\b)/ig; # @match=qw( like eggs like them )
@match=m/(\w+)\W+(\w+)/ig;# @match=qw( I do not like [...] Sam I am )

After a failed match with the /g modifier, the search position is normally reset to the beginning of the string. If the /c modifier also is specified, this won't happen, and the next /g search will continue where the old one failed. This is useful if you're matching against a target string that might be appended to during successive checks of the match.

The delimiters within the match operator can be changed by specifying another character after the initial m. Any character except whitespace can be used, and using the delimiter of ' has the side-effect of not allowing string interpolation to be performed before the regular expression is compiled. Balanced characters (such as (), [], {}, and <>) can be used to contain the expression.

m/\/home\/clintp\/bin/;  # Match clintp's /bin
m!/home/clintp/bin!;   # Somewhat more sane
m/$ENV{HOME}\/bin/;    # Match the user's own /bin
m'$ENV{HOME}/bin';    # Match literal '$ENV{HOME}/bin' -- useless?
m{/home/clintp};

If you're content with using // as delimiters for the pattern, the m can be omitted from the match operator:

while( <IRCLOG> ) {
  if (/<(?:Abigail|Addi)>/) { # Look ma, no "m"!

    # See below for explanation of //
    if (grep(//, @users)) {
      print LOG "$_\n";
    }
  }
}

If the pattern is omitted completely, the pattern from the last successful regular expression match is used. In the previous sample of code, the expression <(?:Abigail|Addi)> is re-used for the grep's pattern.

Example Listing 3.1

# The example from the "backreferences" section
#  re-worked to use the list-context-with-/g return
#  value of the match operator.

open(CONFIG, "config") || die "Can't open config: $!";
{
  local $/;
  %conf=<CONFIG>=~m/([^=]+)=(.*)\n/g;
}

NOTE

See Also

Substitution operator, ??, and match modifiers in this book

Substitution Operator

s///Usage

s/pattern/replacement/modifiers

Description

The s/// operator is Perl's substitution operator. The pattern is first interpolated as though it were a double-quoted string[EM]scalar variables are expanded, backslash escapes are translated, and so on. Afterward, the pattern is compiled for the regular expression engine.

The pattern is then used to match against a target string; by default, the $_ variable is used unless another value is bound using the =~ operator.

s/today/yesterday/;      # Change string in $_
$t=~s/yesterday/long ago/;  # Change string in $t

If the pattern is successfully matched against the target string, the matched portion is substituted using the replacement.

The substitution operator returns the number of substitutions made. If no substitutions were made, the substitution operator returns false (the empty string). The return value is the same in both scalar and list contexts.

$_="It was, like, ya know, like, totally cool!";
$changes=s/It/She/;     # $changes=1, for the match
$changes=s/\slike,//g;   # $changes=2, for both matches

The /g modifier causes the substitution operator to repeat the match as often as possible. Unlike the match operator, /g has no other side effects (such as walking along the match in scalar context)[EM]it simply repeats the substitution as often as possible for nonoverlapping regions of the target string.

During the substitution, captured patterns from the pattern portion of the operator are available during the replacement part of the operator as $1, $2, and so on. If the /g modifier is used, the captured patterns are refreshed for each replacement.

$_="One fish two fish red fish blue fish";
s/(\w+)\s(\w+)/$2 $1/g; # Swap words for "fish one fish two..."

The /e modifier causes Perl to evaluate the replacement portion of the substitution for each replacement about to happen as though it were being run with eval {}. The replacement expression is syntax checked at compile time and variable substitutions occur at runtime, the same as eval {}.

# Make this URL component "safe" by changing non-letters
#  to 2-digit hex codes (RFC 1738)
$text=~s/(\W)/sprintf('%%%02x', ord($1))/ge;

# Perform word substitutions from a list...
%abrv=( 'A.D.' => 'Anno Domini', 'a.m.' => 'ante meridiem',
 'p.m.' => 'post meridiem', 'e.g.' => 'exempli gratia',
 'etc.' => 'et cetera',   'i.e.' => 'id est');
$text=qq{I awoke at 6 a.m. and went home, etc.};
$text=~s/([\w.]+)/exists $abrv{$1}?$abrv{$1}:$1/eg;

The delimiters within the substitution operator can be changed by specifying another character after the initial s. Any character except whitespace can be used, and using the delimiter of ' has the side-effect of not allowing string interpolation to be performed before the regular expression is compiled. Balanced characters (such as (), [], {}, and <>) can be used to contain the pattern and replacement. Additionally, a different set of characters can be used to encase the pattern and the replacement:

s/\/home\/clintp/\/users\/clintp/g;  # Ugh!
s,/home/clintp,/users/clintp,g;    # Whew! Better.
s[/home/clintp]
  {/users/clintp}g;         # This is really clear

The match modifiers (other than /e and /g) are covered in the entry on match modifiers.

Example Listing 3.2

# This function takes its argument and renders it in
#  Pig-Latin following the traditional rules for Pig Latin
# (Note that there's a substitution within a substitution.)
{
  my $notvowel=qr/[^aeiou_]/i; # _ is because of \w

  sub igpay_atinlay {
    local $_=shift;

    # Match the word
    s[(\w+)]
      {
      local $_=$1;
      # Now re-arrange the leading consonants
      #  or if none, append "yay"
      s/^($notvowel+)(.*)/$2$1ay/
        or
        s/$/yay/;
      $_; # Return the result
      }ge;
    return $_;
  }
}
print igpay_atinlay("Hello world"); # "elloHay orldway"

NOTE

See Also

match operator, match modifiers, capturing, and backreferences in this book

Character Shorthand

Description

Regular expressions, similar to double-quoted strings, also allow you to specify hard-to-type characters as digraphs (backslash sequences), by name or ASCII/Unicode number.

They differ from double-quoted context in that, within a regular expression, you're trying to match the given character[EM]not trying to emit it. A single digraph might match more than one kind of character.

The simplest character shorthand is for the common unprintables. These are as follows:

Character

Matches

\t A tab (TAB and HT)
\n A newline (LF, NL). On systems with multicharacter line termination characters, it matches both characters.
\r A carriage return (CR)
\a An alarm character (BEL)
\e An escape character (ESC)

They also can represent any ASCII character using the octal or hexadecimal code for that character. The format for the codes are: \digits for octal and \xdigits for hexadecimal. So to represent a SYN (ASCII 22) character, you can say

/\x16/; # Match SYN (hex)
/\026/; # Match SYN (oct)

However, beware that using \digits can cause ambiguity with backreferences (captured pieces of a regexp). The sequence \2 can mean either ASCII 2 (STX), or it can mean the item that was captured from the second set of parenthesis.

Ambiguous references are resolved in this manner: If the number of captured parenthesis is greater than digit, \digit from the capture; otherwise, the value is the corresponding ASCII value (in octal). Within a character class, \digits will never stand for a backreference. Single digit references such as \digit always stand for backreference, except for \0, which means ASCII 0 (NUL).

To avoid this mess, specify octal ASCII codes using three digits (with a leading zero if necessary). Backreferences will never have a leading zero, and there probably won't be more than 100 backreferences in a regular expression.

Wide (multibyte) characters can be specified in hex by surrounding the hex code with {} to contain the entire sequence of digits. The utf8 pragma also must be in effect.

use utf8;
/\x{262f}/;   # Unicode YIN YANG

When the character is a named character, you can specify the name with a \N{name} sequence if the charnames module has been included.

use charnames ':full';
s/\N{CLOUD}/\N{LIGHTNING}/g; # The weather worsens!

Control-character sequences can be specified directly with \ccharacter. For example, the control-g character is a BEL, and it can be represented as \cg; the control-t character is \ct.

Example Listing 3.2

# Dump the file given on STDIN/command line translating any
#  low-value ASCII characters to their symbolic notation

@names{(0..32)}=qw( NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF
    CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN
    EM SUB ESC FS GS RS US SPACE);
$names{127}='DEL';

while(<>) {
  tr/\200-\377/\000-\177/; # Strip 8th bit too.
  foreach(split(//)) {
    s/([\000-\x1f\x7f])/$names{ord($1)}/e;
    printf "%4s ", $_;
  }
}

NOTE

See Also

charnames module documentation

character classes in this book

Character Classes

Description

Character classes in Perl are used to match a single character with a particular property. For example, if you want to match a single alphabetic uppercase character, it would be nice to have a convenient property to describe this property. In Perl, surround the characters that describe the property with a set of square brackets:

m/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/

This expression will match a single, alphabetic, uppercase character (at least for English speakers). This is a character class, and stands in for a single character.

Ranges can be used to simplify the character class:

m/[A-Z]/

Ranges that seem natural (0-9, A-Z, A-M, a-z, n-z) will work. If you're familiar with ASCII collating sequence, other less natural ranges (such as [!-/]) can be constructed. Ranges can be combined simply by putting them next to each other within the class:

m/[A-Za-z]/;  # Upper and lowercase alphabetics

Some characters have special meaning within a character class and deserve attention:

The dash (-) character must either be preceded by a backslash, or should appear first within a character class. (Otherwise it might appear to be the beginning of a range.)

A closing bracket (]) within a character class should be preceded by a backslash, or it might be mistaken for the end of the class.

The ^ character will negate the character class. That is, every possible character that doesn't have the property described by the character class will match. So that:

m/[^A-Z]/;  # Match anything BUT an uppercase, alphabetic character

Remember that negating a character class might include some things you didn't expect. In the preceding example, control characters, whitespace, Unicode characters, 8-bit characters, and everything else imaginable would be matched[EM]just not A-Z.

In general, any other metacharacter (including the special character classes later in this section) can be included within a character class. Some exceptions to this are the characters .+()*|$^ which all have their mundane meanings when they appear within a character class, and backreferences (\1, \2) don't work within character classes. The \b sequence means "backspace" in a character class, and not a word boundary.

The hexadecimal, octal, Unicode, and control sequences for characters also work just fine within character classes:

m/[\ca-\cz]/;  # Match all control characters
m/[\x80-\xff]/; # Match high-bit-on characters
use charnames qw(:full);
m/[\N{ARIES}\N{SCORPIUS}\N{PISCES}\N{CANCER}\N{SAGITTARIUS}]/;

In Perl regular expressions, common character classes also can be represented by convenient shortcuts. These are listed as follows:

Class

Name

What It Matches

\d Digits [0-9]
\D Nondigits [^0-9]
\s Whitespace [\x20\t\n\r\f]
\S Non-whitespace [^\x20\t\n\r\f]
\w Word character [a-zA-Z0-9_]
\W Non-word character [^a-zA-Z0-9_]

These shortcuts can be used within regular character classes or by themselves within a pattern match:

if (/x[\da-f]/i) { } # Match something hex-ish
s/(\w+)/reverse $1/e; # Reverse word-things only

The actual meaning of these will change if a locale is in effect. So, when perl encounters a string such as ¡feliz cumpleaños!, the exact meaning of metacharacters such as \w will change. This code

$a="\xa1feliz cumplea\xf1os!";  # Happy birthday, feliz cumpleaños
while($a=~m/(\w+)/g) {
  print "Word: $1\n";
}

will find three words in that text: feliz, cumplea, and os. The \xf1 (n with a tilde) character isn't recognized as a word character. This code:

use locale;
use POSIX qw(locale_h);
setlocale(LC_CTYPE, "sp.ISO8859-1"); # Spanish, Latin-1 encoding

$a="\xa1feliz cumplea\xf1os!"; # Happy b-day.
while($a=~m/(\w+)/g) {
  print "Word: $1\n";
}

works as a Spanish speaker would expect, finding the words feliz and cumpleaños. The locale can be negated by specifying a bytes pragma within the lexical block, causing the character classes to go back to their original meanings.

Perl also defines character classes to match sets of Unicode characters. These are called Unicode properties, and are represented by \p{property}. The list of properties is extensive because Unicode's property list is long and perl adds a few custom properties to that list as well. Because the Unicode support in Perl is (currently) in flux, your best bet to find out what is currently implemented is to consult the perlunicode manual page for the version of perl that you're interested in.

The last kind of character class shortcut (other than user-defined ones covered in the section on character classes) is defined by POSIX. Within another character class, the POSIX classes can be used to match even more specific kinds of characters. They all have the following form:

[:class:]

where class is the character class you're trying to match. To negate the class, write it as follows: [:^class:].

Class

Meaning

ascii 7-bit ASCII characters (with an ord value <127)
alpha Matches a letter
lower Matches a lowercase alpha
upper Matches an uppercase alpha
digit Matches a decimal digit
alnum Matches both alpha and digit characters
space Matches a whitespace character (just like \s)
punct Matches a punctuation character
print Matches alnum, punct, or space
graph Matches alnum and punct
word Matches alnum or underscore
xdigit Match hex digits: digit, a-f, and A-F
cntrl The ASCII characters with an ord value <32 (control characters)

To use the POSIX character classes, they must be within another character class:

for(split(//,$line)) {
  if (/[[:print:]]/) { print; }
}

Using a POSIX class on its own:

if (/[:print:]/) { } # WRONG!

won't have the intended effect. The previous bit of code would match :, p, r, i, n, and t.

If the locale pragma is in effect, the POSIX classes will work as the corresponding C library functions such as isalpha, isalnum, isascii, and so on.

Example Listing 3.3

# Analyze the file on STDIN (or the command line) to get the
# makeup. A typical MS-Word doc is about 60-70% high-bit
# characters and control codes. This book in XML form was
# less than 4% control codes, 10.8% punctuation, 18.2% whitespace
# and 69% alphanumeric characters.

use warnings;
use strict;
my(%chars, $total, %props, $code, %summary);
# Take the file apart, summarize the frequency for
#  each character.
while(<>) {
  $chars{$_}++ for(split(//));
  $total+=length;
}

# Warning: space and cntrl overlap so >100% is possible!
%props=(alpha => "Alphabetic",  digit => "Numeric",
  space => "Whitespace",  punct => "Punctuation",
  cntrl => "Control characters",
  '^ascii' => "8-bit characters");

# Build the code to analyze each kind of character
#  and classify it according to the POSIX classes above.
$code.="\$summary{'$_'}+=\$chars{\$_} if /[[:$_:]]/;\n"
  for(keys %props);
eval "for(keys %chars){ $code }";

foreach my $type (keys %props) {
  no warnings 'uninitialized';
  printf "%-18s %6d %4.1f%%\n", $props{$type}, $summary{$type},
    ($summary{$type}/$total)*100;
}

NOTE

See Also

bytes, utf8, and POSIX module documentation

perlunicode in the perl documentation

NOTE

isalpha in the C library reference

Quantifiers

Usage

{min,max}
{min,}
{min}
*
+
?

Description

Quantifiers are used to specify how many of a preceding item to match. That item can be a single character (/a*/), a group (/(foo)?/), or it can be anything that stands in for a single character such as a character class (/\w+/).

The first quantifier is ?, which means to match the preceding item zero or one times (in other words, the preceding item is optional).

/flowers?/;    # "flower" or "flowers" will match
/foo[0-9]?/;    # foo1, foo2 or just foo will match
/\b[A-Z]\w+('s)?\b/;  # Matches things like "Bob's" or "Carol" --
      #  capitalized singular words, possibly possessed

# Match day of week names like 'Mon', 'Thurs' and Friday.
# (caution: also matches oddities like 'Satur' -- this can be
#  remedied, but makes a lousy example.)
/(Mon|Tues?|Wed(nes)?|Thu(rs)?|Fri|Sat(ur)?|Sun)(day)?/;

Any portion of a match quantified by ? will always be successful. Sometimes an item will be found, and sometimes not, but the match will always work.

The quantifier * is similar to ? in that the quantified item is optional, except * specifies that the preceding item can match zero or more times. Specifically, the quantified item should be matched as many times as possible and still have the regular expression match succeed. So,

/fo*bar/;

matches 'fobar', 'foobar', 'foooobar', and also 'fbar'. The * quantifier will always match positively, but whether a matching item will be found is another question. Because of this, beware of expressions such as the following:

/[A-Z]*\w*/

You might hope it will match a series of uppercase characters and then a set of word characters, and it will. But it also will match numbers, empty strings, and binary data. Because everything in this expression is optional, the expression will always match.

With * you can absorb unwanted material to make your match less specific:

# This matches any of: <body>, <body background="">,
#  <body background="foo.gif">, <body onload="alert()">,
#  or <body onload="alert()" background="foo.gif:>
/<\w+(\s+\w+="[^"]*")*>/;

In the preceding example, * was used to make [^"] match empty quote marks, or quote marks with something inside; it was used to make the attribute match (foo="bar") optional, and repeat it as often as necessary.

The + quantifier requires the match not only to succeed at least once, but also as many times as possible and still have the regular expression match be successful. So, it's similar to *, except that at least one match is guaranteed. In the preceding example, the space following the \w+ was specified as \s+; otherwise items such as <bodyonload="alert()"> would match.

/fo+bar/;

This matches 'fobar', 'foobar', and of course 'fooooobar'. But unlike *, it will not match 'fbar'.

Perl also allows you to match an item a minimal, fixed, or maximum number of times with the {} quantifiers.

Quantifier

Meaning

{min,max} Matches at least min times, but at most max times.
{min,} -Matches at least min times, but as many as necessary for the match to succeed.
{count} Matches exactly count times.

Keep in mind that with the {min,} and {min,max} searches, the match will absorb only as many characters as necessary and still have the match succeed. Thus with the following:

$_="Python";
if (m/\w(\w{1,5})\w\w/) {
  print "Matched ", length($1), "\n";
}

The $1 variable winds up with only three characters because the first \w matched P, the last \w's needed "on" to be successful, and that left "yth" for the quantified \w.

Perl's quantifiers are normally maximal matching, meaning that they match as many characters as possible but still allow the regular expression as a whole to match. This is also called greedy matching.

The ? quantifier has another meaning in Perl: when affixed to a *, +, or {} quantifier, it causes the quantifier to match as few characters as necessary for the match to be successful. This is called minimal matching (or lazy matching).

Take the following code:

$_=q{"You maniacs!" he yelled at the surf. "You blew it up!"};
while (m/(".*")/g) {
  print "$1\n";
}

It might surprise you to see that the regular expression grabs the entire string, not just each quote individually. That's because ".*" matches as much as possible between the quote marks, including other quote marks. Changing the expression to:

m/".*?"/g

solves this problem by asking * to match as little as possible for the match to succeed.

Keep in mind that ? is just a convenient shorthand and might not represent the best possible solution to the problem. The pattern /"[^"]*"/ would have been a more efficient choice because the amount of backtracking by the regular expression engine to be done would have been less. But there is programmer efficiency to consider.

NOTE

See Also

m operator in this book

Modification Characters

Usage

\Q \E \L \l \U \u

Description

The modification characters used in string literals (in an interpolated context) are available in regular expressions as well. See the entry on modification characters for a list.

Understand that these "metacharacters" aren't really metacharacters at all. They do their work because regular expression match operators allow interpolation to happen when the pattern is first examined[EM]much in the same way that \L and \U are only effective in double-quoted strings; they're only effective in regular expressions when the pattern is first examined by perl.

$foo='\U';
if (m/${foo}blah/) { } # Won't look for BLAH, but 'Ublah'
if (m/\Ublah/) { }   # Will look for BLAH
if (m/(a)\U\1/ { }   # Won't look for aA as you might hope

Most useful among these in regular expressions is the \Q modifier. The \Q modifier is used to quote any metacharacters that follow. When accepting something that will be used in a pattern match from an untrusted source, it is vitally important that you not put the pattern into the regular expression directly. Take this small sample:

# A CGI form is a _VERY_ untrustworthy source of info.

use CGI qw(:form :standard);
print header();
$pat=param("SEARCH");
# ...sometime later...
if (/$pat/) {
}

The trouble with this is that handing $pat to the regular expression engine opens up your system to running code that's determined solely by the user. If the user is malicious, he can:

  • Specify a regular expression that will never terminate (endless backtracking).

  • Specify a regular expression that will use indeterminate amounts of memory.

  • Specify a regular expression that can run perl code with a (?{}) pattern.

The third one is probably the most malicious, so it is disabled unless a use re 'eval' pragma is in effect or the pattern is compiled with a qr operator.

The \Q modifier will cause perl to treat the contents of the pattern literally until an \E is encountered.

Example Listing 3.4

# A re-working of the inline sample above a little
#  more safe.* A form parameter "SEARCH" is used to
#  check the GECOS (often "name") field.
# *[Of course, software is only completely "safe" when
#  it's not being used. --cap]

use CGI qw(:form :standard);

print header(-type => 'text/plain');
$pat=param("SEARCH");

push(@ARGV, "/etc/passwd");
while(<>) {
  ($name)=(split(/:/, $_))[4];
  if ($name=~/\Q$pat\E/) {
    print "Yup, ${pat}'s in there somewhere!\n";
  }
}

NOTE

See Also

Modification characters in this book

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020