Home > Articles > Programming > C/C++

Regular Expressions 102: Text Translation in C++

  • Print
  • + Share This
  • 💬 Discuss
From the author of
With the new Regular Expression library in the C++11 specification, you can perform complex text translation simply by specifying a couple of text patterns and then just calling a function. Brian Overland, author of C++ for the Impatient, demonstrates how easy and yet powerful these features are to use.

If you've programmed in C++ or one of the other popular programming languages for a while, you probably know how to perform complex text-translation tasks, such as reformatting source code from one HTML format to another. Such tasks are always doable by examining one character at a time and then making a series of decisions. But how would you like to be able to perform complex text translation by specifying a couple of text patterns and then just calling a function? That's possible with the new Regular Expression library included in the C++0x specification. In this article, I'll focus on text translation by search-and-replace.

The Most Useful Regular-Expression Characters

First let's do some review. My previous article "Regular Expressions 101" explained the basic use of the C++ Regular Expression library. The following table reviews some of those basics.

Special Character(s)

Matches

.

Any one character.

[range]

A range of characters such as [a-z], which matches any one lowercase letter, or [0-9], which matches any one digit. You can also build complex ranges, as in [abm-z0-9], which matches any one of the following: a, b, any lowercase letter from m to z inclusive, or a digit.

expression*

An expression repeated zero or more times. As I explain shortly, the asterisk (*) is an expression modifier, not a separate expression.

expression+

An expression repeated one or more times.

(expression)

A group.

The last syntax element in the table, forming groups, is important as a way of specifying which characters to repeat. Groups are an essential to search-and-replace operations, which I'll delve into later in this article.

These five syntax elements are a relatively small subset of the full regular-expression syntax, but they're by the far the most widely used. With these elements, you can specify a wide variety of patterns.

Subtleties of the Repeat Operators

The repeat operators, * and +, involve a subtlety that nearly all manuals and articles on regular expressions gloss over or misstate. Consider the following regular expression:

ca*t

To understand this pattern, you should first note that, by default, a character without special meaning is interpreted literally. That is, usually a character "is what it is." In the regular expression ca*t, a regular-expression function first looks for c in the input string:

c

The function doesn't then try to match a—or at least, not exactly! Instead, the characters a* are taken together to form a sub-pattern that says "Match zero or more copies of the letter a." Therefore, the expression ca*t matches any of the following strings:

ct
cat
caat
caaat

And so on. It might seem reasonable to say that a* means "Match a and then match zero or more copies of a"—but that's not how it works.

Understanding Groups

Finally, let's review the purpose of groups. The repeat operators, * and +, apply to just one character—the character that precedes them—unless the preceding characters are in a range or in a group. For example, what does the following match?

bana(na)+

The final occurrence of na is in a group. As special characters, the parentheses are not matched literally, but rather are used to specify the group. So if you call a function that tries to match this pattern, the function tries to match characters in the input string in this order:

  1. Match bana exactly once.
  2. Match one or more occurrences of na. The use of the asterisk (*) would cause matching of zero or more copies instead.

So any of the following strings comprise a match:

banana
bananana
banananana
bananananana

and so on.

  • + Share This
  • 🔖 Save To Your Account

Discussions

comments powered by Disqus