Home > Articles > Programming > Java

Regular Expressions Do Not Solve All Problems

  • Print
  • + Share This
Good tools make all the difference when they're used on jobs where they "fit." What tasks are right for regular expressions, and when are there better choices? Cameron Laird runs it down for you.
Like this article? We recommend

Like this article? We recommend

Commercial C and C++ programmers progress through these three stages of parsing maturity:

  • Those who have never coded with regular expressions (REs).

  • When there's a need for parsing, these folks do it "by hand," rely on lex and yacc, and try to remember from their college days what LALR(1) means.

  • Those who have experienced the power of REs for themselves, and are intoxicated enough to assume that REs solve all problems.

  • Those who know when REs are a good fit, and when they aren't.

Let's see how quickly we can reach the third stage.

Regular Excitement

Intoxication around REs is understandable. They're great for eliminating the tedium of common situations when dealing with erratically formatted data. Think, for a moment, about what it would take to extract the first and last name from a line where variable whitespace and middle names might intervene. Although it's not difficult to accommodate the following cases with the columns formed by a mixture of tabs and spaces, it takes care. The resulting code is rarely pretty or edifying.

 First Last
 First  Middle   Last
 First       Last

An RE for such a case, though, can be as concise as the following:

 ^([^\s]*).*([^\s]*)$

This says, "Put everything at the beginning, before the first tab or blank, into one variable; and put everything at the end, after the last tab or blank, into a second variable." That's just what we want!

That sort of expressive power explains why so many modern runtime libraries—not just in C and C++, but also Java, C#, Python, and other languages—include RE interfaces. There are times, however, when REs do too much or too little. At the low end, RE enthusiasm apparently makes some programmers forget about the capability that C and C++ runtime libraries have to handle easy problems on their own. When patterns are simple enough, for instance, a strchr() or strstr() can make for a more maintainable solution than even the briefest RE.

An even thornier problem is that REs model only a fraction of the parsers we want, and many programmers haven't learned to recognize when they do not apply. In formal terms, regular expressions are strings conforming to a "regular grammar."

Regular grammars form a subset of "context-free grammars." A regular grammar includes strings that can be parsed left to right without backtracking, and whose symbol matching is narrow: matches against enumerated collections or certain sequentially repeated matches. In the example above, we allow anything other than whitespace in the last name, and any number—from zero on up—of such characters.

  • + Share This
  • 🔖 Save To Your Account