Searching and Matching with Regular Expressions in C++
In my years of programming work, far too often I’ve seen even experienced programmers write overly complex code that effectively reinvents the wheel. One area in particular where I see this problem occur is in string searching, matching, and replacing. For simple searches and matches in C++, the standard string class does just fine, as do the original C functions that are still available. For example, if you need to find where in a string the first instance of the substring abc occurs, the string class method find_first_of does just fine, as does the C function strstr.
But what if you need to perform more complex searches? Suppose you’re writing a program that allows the user to enter either an email address or a user ID into a box. You want to determine whether the user entered an email address or a user ID, while simultaneously determining whether the email address or user ID is of a valid format. For example, suppose valid user IDs must be eight characters, at least one of which is a number. An email address must be a valid Internet email address. You can probably imagine what such a scenario could entail if you’re faced with the methods in the string class.
I won’t even try to describe an algorithm here; any attempt I make without spending some serious thinking time would likely have at least an error or two. And don’t let your ego get in the way; even if you think you can write a perfect algorithm, do you really want to spend the time? I’m sure you or I could come up with an algorithm that’s correct, but it would take some time, and the system testers over in QA would want to spend their time on it.
Introducing Regular Expressions
Instead, I propose that a solution to such a problem should make use of a technology that has been around for years but is seriously overlooked: regular expressions. Many programmers have heard of regular expressions, but too many don’t know what they are.
What is a regular expression? How about this for a working definition to get you going: A regular expression is a pattern.
Most Windows users are familiar with the *.* pattern for matching filenames, or similar patterns such as *.cpp. The pattern *.cpp matches any file that ends with .cpp. The pattern a*.txt matches all filenames that start with a and end with .txt. The idea, of course, is that the asterisk (*) matches any number of any characters. A somewhat less popular pattern uses the question mark (?), which matches any single character. For example, abc??.txt will match any filename starting with abc followed by any two characters, and then .txt. For example, abc01.txt will match, as will abcde.txt. But abc012.txt will not.
Figure 1 shows a familiar example of pattern matching in use. This dialog box contains a bunch of patterns at the bottom; only filenames matching the pattern (along with directories) appear in the main part of the dialog box.
Figure 1 Windows and DOS use a simple pattern-matching system.
Regular expressions are simply a variation on this filename-matching theme. But instead of matching just filenames, regular expressions are used for matching strings in general. Unfortunately, the familiar * pattern isn’t used to mean "any string." But that’s okay; the rules are still easy to learn. In the following section, I’ll present some of the rules.