Home > Articles > Programming > Python

Core Python Applications Programming: Regular Expressions

  • Print
  • + Share This
Python supports regular expressions through the standard library re module. IIn this chapter, Wesley Chun gives you a brief and concise introduction to regular expressions. Due to its brevity, only the most common aspects of regexes used in everyday Python programming will be covered.

This excerpt is from the Rough Cuts version of the book and may not represent the final version of this material.

From the Rough Cut

Chapter Topics

  • Introduction/Motivation
  • Special Characters and Symbols
  • Regular Expressions and Python
  • re Module

1.1 Introduction/Motivation

Manipulating text/data is a big thing. If you don’t believe me, look very carefully at what computers primarily do today. Word processing, “fill-out-form” Web pages, streams of information coming from a database dump, stock quote information, news feeds—the list goes on and on. Because we may not know the exact text or data that we have programmed our machines to process, it becomes advantageous to be able to express this text or data in patterns that a machine can recognize and take action upon.

If I were running an electronic mail (e-mail) archiving company, and you were one of my customers who requested all his or her e-mail sent and received last February, for example, it would be nice if I could set a computer program to collate and forward that information to you, rather than having a human being read through your e-mail and process your request manually. You would be horrified (and infuriated) that someone would be rummaging through your messages, even if his or her eyes were supposed to be looking only at time-stamp. Another example request might be to look for a subject line like “ILOVEYOU” indicating a virus-infected message and remove those e-mail messages from your personal archive. So this begs the question of how we can program machines with the ability to look for patterns in text.

Regular expressions provide such an infrastructure for advanced text pattern matching, extraction, and/or search-and-replace functionality. To put it simply, a regular expression (a.k.a. a “regex” for short) is a string that use special symbols and characters to indicate pattern repetition or to represent multiple characters so that they can “match” a set of strings with similar characteristics described by the pattern (Figure 15–1). In other words, they enable matching of multiple strings—a regex pattern that matched only one string would be rather boring and ineffective, wouldn’t you say?

Figure 15–1 You can use regular expressions, such as the one here, which recognizes valid Python identifiers. ‘‘[A-Za-z]\w+ ” means the first character should be alphabetic, i.e., either A–Z or a–z, followed by at least one (+) alphanumeric character (\w). In our filter, notice how many strings go into the filter, but the only ones to come out are the ones we asked for via the regex. One example that did not make it was “4xZ” because it starts with a number.

Python supports regexes through the standard library re module. In this introductory subsection, we will give you a brief and concise introduction. Due to its brevity, only the most common aspects of regexes used in everyday Python programming will be covered. Your experience will, of course, vary. We highly recommend reading any of the official supporting documentation as well as external texts on this interesting subject. You will never look at strings the same way again!

1.1.1 Your First Regular Expression

As we mentioned above, regexes are strings containing text and special characters that describe a pattern with which to recognize multiple strings. We also briefly discussed a regular expression alphabet and for general text, the alphabet used for regular expressions is the set of all uppercase and lowercase letters plus numeric digits. Specialized alphabets are also possible, for instance, one consisting of only the characters “0” and “1”. The set of all strings over this alphabet describes all binary strings, i.e., “0,” “1,” “00,” “01,” “10,” “11,” “100,” etc.

Let us look at the most basic of regular expressions now to show you that although regexes are sometimes considered an “advanced topic,” they can also be rather simplistic. Using the standard alphabet for general text, we present some simple regexes and the strings that their patterns describe. The following regular expressions are the most basic, “true vanilla,” as it were. They simply consist of a string pattern that matches only one string, the string defined by the regular expression. We now present the regexes followed by the strings that match them:

Regex Pattern

String(s) Matched

foo
foo
Python
Python
abc123
abc123

The first regular expression pattern from the above chart is “foo.” This pattern has no special symbols to match any other symbol other than those described, so the only string that matches this pattern is the string “foo.” The same thing applies to “Python” and “abc123.” The power of regular expressions comes in when special characters are used to define character sets, subgroup matching, and pattern repetition. It is these special symbols that allow a regex to match a set of strings rather than a single one.

  • + Share This
  • 🔖 Save To Your Account