Commercial C and C++ programmers progress through these three stages of parsing maturity:
Those who have never coded with regular expressions (REs).
Those who have experienced the power of REs for themselves, and are intoxicated enough to assume that REs solve all problems.
Those who know when REs are a good fit, and when they aren't.
When there's a need for parsing, these folks do it "by hand," rely on lex and yacc, and try to remember from their college days what LALR(1) means.
Let's see how quickly we can reach the third stage.
Intoxication around REs is understandable. They're great for eliminating the tedium of common situations when dealing with erratically formatted data. Think, for a moment, about what it would take to extract the first and last name from a line where variable whitespace and middle names might intervene. Although it's not difficult to accommodate the following cases with the columns formed by a mixture of tabs and spaces, it takes care. The resulting code is rarely pretty or edifying.
First Last First Middle Last First Last
An RE for such a case, though, can be as concise as the following:
This says, "Put everything at the beginning, before the first tab or blank, into one variable; and put everything at the end, after the last tab or blank, into a second variable." That's just what we want!
That sort of expressive power explains why so many modern runtime librariesnot just in C and C++, but also Java, C#, Python, and other languagesinclude RE interfaces. There are times, however, when REs do too much or too little. At the low end, RE enthusiasm apparently makes some programmers forget about the capability that C and C++ runtime libraries have to handle easy problems on their own. When patterns are simple enough, for instance, a strchr() or strstr() can make for a more maintainable solution than even the briefest RE.
An even thornier problem is that REs model only a fraction of the parsers we want, and many programmers haven't learned to recognize when they do not apply. In formal terms, regular expressions are strings conforming to a "regular grammar."
Regular grammars form a subset of "context-free grammars." A regular grammar includes strings that can be parsed left to right without backtracking, and whose symbol matching is narrow: matches against enumerated collections or certain sequentially repeated matches. In the example above, we allow anything other than whitespace in the last name, and any numberfrom zero on upof such characters.