Repetitive Searches in C++
- 19.1 Brute-Force Searches
- 19.2 The <tt>regex_iterator</tt> Class Template
- 19.3 The <tt>regex_token_iterator</tt> Class Template
- Exercises
There are only two or three human stories, and they go on repeating themselves as fiercely as if they had never happened before.
—O Pioneers!
WILLA CATHER
Did you spot the problem with the example program that searched for code snippets in a text file at the beginning of Chapter 18?
#include <regex> #include <iostream> #include <fstream> #include <string> #include <stdlib.h> using std::tr1::regex; using std::tr1::regex_search; using std::tr1::smatch; using std::string; using std::ifstream; using std::cout; static void show_matches(const char *fname) { // scan file named by fname, line by line ifstream input(fname); string str; smatch match; const char *expr = "<CODE>(.*)</CODE>"; regex rgx(expr,regex::icase); while (getline(input,str)) { // check line for match if (regex_search(str,match,rgx)) cout << match[1] << '\n'; } } int main(int argc,char *argv[]) { // search for code snippets in text file if (argc != 2) { // wrong number of arguments cout << "Usage: snippets <filename>\n"; return EXIT_FAILURE; } try { // search the file show_matches(argv[1]); } catch(...) { // something went wrong cout << "Error\n"; return EXIT_FAILURE; } return 0; }
In lines that have multiple code snippets, everything between the first "<CODE>" and the last "</CODE>" is listed as a single snippet. To separate multiple snippets, we first have to change the regular expression a bit so that it doesn't swallow multiple snippets. In this case, we can replace the ".*" with a nongreedy repetition:
string expr = " <CODE>(.*?) </CODE>";
Now, to resume searching after the text that matched, we have to change the code. To do that, instead of searching the entire line from the file, we use a pair of iterators that point at the contents of the line. After a match, we advance the first iterator to point at the character immediately following the match and search again.
Example 19.1. Repeated Searches (regexiter/repeated.cpp)
#include <regex> #include <iostream> #include <fstream> #include <string> #include <stdlib.h> using std::tr1::regex; using std::tr1::regex_search; using std::tr1::smatch; using std::string; using std::ifstream; using std::cout; static void show_matches (const char *fname) { //scan file named by fname, line by line ifstream input (fname); string str; smatch match; string expr = "<CODE>(.*?) </CODE>"; regex rgx (expr, regex::icase); while (getline (input, str)) { // check line for match string::const_iterator first = str.begin (); string::const_iterator second = str.end (); while (regex_search (first, second, match, rgx)) { // show match, then skip past it cout << match[1]<< '\n'; first += match.position () + match.length (); } } } int main(int argc, char *argv[]) { // search for code snippets in text file if (argc != 2) { // wrong number of arguments cout << "Usage : snippets <filename>\n"; return EXIT_FAILURE; } try { // search the file show_matches (argv [1]); } catch (...) { // something went wrong cout << "Error\n"; return EXIT_FAILURE; } return 0; }
Don't be fooled, though: Repetitive searches aren't usually that easy to write. For example, if the regular expression begins with a "^", simply restarting the search after a match, as the previous example does, can lead to wrong answers. The following program searches the target text "abcdef" for subsequences that match the regular expression "^(abc|def)". The only one is the initial "abc", but the program finds two, reporting that "def" also matches.
Example 19.2. Naive Search Doesn't Work (regexiter/naive.cpp)
#include <regex> #include <iostream> #include <string> using std::tr1::regex; using std::tr1::regex_search; using std::tr1::smatch; using std::string; using std::cout; int main() { // search for regular expression in text string str = "abcdef"; string::const_iterator first = str.begin (); string::const_iterator second = str.end (); smatch match; string expr = "^(abc | def)"; regex rgx(code); while (regex_search(first, second, match, rgx)) { // check range for match cout << match [0] << '\n'; first+=match.position () + match.length (); } return 0; }
In this chapter, we look first at the complications that any repetitive search has to allow for and the techniques for fixing problems (Section 19.1). Then we look at prewritten solutions, in the form of the class template regex_iterator (Section 19.2) and the class template regex_token_iterator (Section 19.3).
19.1 Brute-Force Searches
In Chapter 17 we looked at several flags that you can pass to the regular expression search functions to change the details of regular expression matching. Here, we look at some of those flags again but in the context of specific problems that arise in repetitive searches. Eventually, we'll build a search function that avoids these problems; you can judge for yourself whether that's a better approach than using the two forms of regular expression iterator that the TR1 library provides.
19.1.1 Lost Anchors
Earlier in this chapter, we looked at a naive search function that reported two matches when applying the regular expression "^(abc|def)" to the target text "abcdef". The problem with simply repeating the same search at a new location in the target text, as that program did, is that on the second call to regex_search, the target text is passed, effectively, as "def", which does match the regular expression. That is, we chopped off the start of the target text but didn't tell the search function that we had done that, so it matched the "^" at the beginning of the regular expression to the beginning of the text that we passed, even though the text was not the beginning of the target text. The solution to this problem is simply to tell the search function that we're not at the beginning of a line, so "^" shouldn't match. To do that, we use the flag match_not_bol for all searches except the first.
Example 19.3. Preserving Anchors (regexiter/search1.cpp)
#include <regex> #include <iostream> using std::tr1::regex; using std::tr1::cmatch; using std::tr1::regex_search; using namespace std::tr1::regex_constants; using std::cout; static void search (const char*tgt, const char *expr) { // show all subsequences of tgt that match expr regex rgx (expr); cmatch match; match_flag_type flags = match_default; const char*first = tgt; const char*last = tgt+strlen (tgt); for (;;) { // keep trying if (regex_search(first, last, match, rgx, flags)) { // show match, move past it cout << match.str () << "at offset" << (match [0].first-tgt) << '\n'; first+=match.position ()+match.length (); flags = flags | match_not_bol; } else break; } } int main() { // demonstrate use of match_not_bol const char*expr = "^(abc | def)"; const char*tgt = "abcdef"; search (tgt, expr); return 0; }
19.1.2 Lost Word Boundaries
The regular expression "\babc" should match the target text "abcabc" in one place: the first occurrence of the character sequence "abc". The second "abc" doesn't match, because it doesn't start at a word boundary. If you try the previous search function with this regular expression and target text, it will find two matches. The problem is similar to the one with lost anchors: When we restart the search after the first match, the regular expression engine treats the start of the text as a word boundary. You might be tempted to fix that with the same approach we used before, by adding the flag match_not_-bow after a successful match. But the two cases are different: A "^" can match only at the beginning of the original target text, so it's okay to simply disallow that match once we've moved away from the beginning of the text. A word boundary can occur inside the target text as well as at the beginning, so we have to be careful to disable matching the beginning of a word only when we're not at the beginning of a word. That can be done by checking whether the last character in a match can be in a word and, if so, prohibiting matching the beginning of a word on the next pass. That solves half the problem.
The other half of the problem occurs with a regular expression like "\b3", when matched against the target text "33". The first "3" is at a word boundary, so it should match. The second "3" is not at a word boundary, so it should not match. But the previous version of search will find that the second one matches because in the target text that's passed for the second search, it is at the beginning of the target text. So we also need to disable matching of the end of a word when the previous character cannot be in a word.
But there's an easier way. The regular expression engine already knows how to identify characters that can be in a word, so we don't need to write that logic ourselves. All we need to do is tell the engine that it can look at the character in front of the target text to decide whether it's at the beginning of a word. That's what the flag match_prev_avail does. Of course, we should do that only when we know that a valid character is in front of the target text. Once we've moved forward in the target text, we know that we can look behind the current position.
Example 19.4. Preserving Word Boundaries (regexiter/search2.cpp)
#include <regex> #include <iostream> using std::tr1::regex; using std::tr1::cmatch; using std::tr1::regex_search; using namespace std::tr1::regex_constants; using std::cout; static void search (const char*tgt, const char*expr) { // show all subsequences of tgt that match expr regex rgx (expr); cmatch match; match_flag_type flags = match_default; const char*first = tgt; const char*last = tgt+strlen (tgt); for (;;) { // keep trying if (regex_search (first, last, match, rgx, flags)) { // show match, move past it cout << match.str () <<"at offset" << (match [0]. first - tgt) << '\n'; first+=match.position ()+match.length (); flags = flags | match_not_bol | match_prev_avail; } else break; } } int main() { //demonstrate use of match_not_bol const char*expr = "\\babc"; const char*tgt = "abcabc"; search (tgt, expr); return 0; }
19.1.3 Empty Matches
To understand the problem that empty matches pose, we first need to look at empty matches in more detail. The regular expression "a*" matches a sequence of zero or more repetitions of the character 'a'. When it matches zero characters, that's an empty match. If you call regex_search to see whether that regular expression matches the target text "bcd", the answer will be that it matches, right at the beginning.
Example 19.5. Empty Match (regexiter/empty.cpp)
#include <regex> #include <iostream> using std::tr1::regex; using std::tr1::cmatch; using std::tr1::regex_search; using namespace std::tr1::regex_constants; using std::cout; int main() { // show empty match const char*expr = "a*"; regex rgx(expr); cmatch match; const char*tgt = "bcd"; if (regex_search(tgt, match, rgx)) { // show the match cout << "Matched at offset" << match.position () << ",with length" << match.length () << '\n'; } return 0; }
If we use the search function that we wrote to eliminate lost anchors to search for all occurrences of "a*" in the target text "bcd", we'll get into trouble. The first match is at offset 0, and its length is 0, so the function will adjust the position in the target string by zero characters and call regex_-search again. This will loop until you get bored and terminate the program.
There are two obvious solutions. First, move the position in the target text forward by one character when you get an empty match. Second, temporarily prohibit empty matches. Both work for some cases, but, as we'll see, you really need a combination of the two.
This version of search implements the first fix.
Example 19.6. Jumping Past Empty Matches (regexiter/search3.cpp)
#include <regex> #include <iostream> #include <string> using std::tr1::regex; using std::tr1::cmatch; using std::tr1::regex_search; using namespace std::tr1::regex_constants; using std::cout; using std::string; static void search (const char*tgt, const char*expr) { // show all subsequences of tgt that match expr regex rgx(expr); cmatch match; match_flag_type flags = match_default; const char*first = tgt; const char*last = tgt+strlen(tgt); string empty("[empty]"); for(;;) { // keep trying if (regex_search(first, last, match, rgx, flags)) { // show match, move past it cout << (match.length()?match.str ():empty) << "at offset" << (match[0].first-tgt) << '\n'; if (match.length()!=0) first+=match.position()+match.length(); else if (first == last) break; else ++first; flags = flags| match_not_bol|match_prev_avail; } else break; } } int main() { // demonstrate use of match_not_null const char*expr = "a*"; const char*tgt = "bcd"; search (tgt, expr); return 0; }
Note the test for first == last; without this, the function will increment first past the end of the target text if an empty match occurs at the end of the target text. This works fine for the regular expression "a*", but try it with the regular expression "a*|c". It doesn't see that the regular expression matches the "c" in the target text. That's because it finds the empty match at that position and jumps past it.
This version of search implements the second fix, using the flag match_-not_null to prevent empty matches until after the next successful match.
Example 19.7. Blocking Empty Matches (regexiter/search4.cpp)
#include <regex> #include <iostream> #include <string> using std::tr1::regex; using std::tr1::cmatch; using std::tr1::regex_search; using namespace std::tr1:: regex_constants; using std::cout; using std::string; static void search(const char*tgt, const char*expr) { // show all subsequences of tgt that match expr regex rgx(expr); cmatch match; match_flag_type flags = match_default; match_flag_type mod = match_default; const char*first = tgt; const char*last = tgt+strlen(tgt); string empty("[empty]"); for(;;) { // keep trying if (regex_search (first, last, match, rgx, flags| mod)) { // show match, move past it cout << (match.length()? match.str():empty) << "at offset" << (match [0].first-tgt) << '\n'; if (match.length()!=0) { // move past match, clear modifier flags first+=match.position()+match.length(); mod = match_default; } else mod = match_not_null; flags = flags | match_not_bol | match_prev_avail; } else break; } } int main() { // demonstrate use of match_not_bol const char * expr = "a*|c"; const char*tgt = "bcd"; search(tgt, expr); return 0; }
This program does, indeed, find the match of "c", but it's not right, because it misses the empty match before "c". We've shut off empty matches for too long. The fix is to shut off empty matches only at the current position in the target text. To do that, we need two changes. First, we need to add the flag match_continuous, so that the regular expression search engine won't look for matches that occur after the start of the target text. That way, we control when the search advances further into the target text. Second, if that constrained search fails, we need to turn off the constraint and move to the next position in the target text. That is, we need to combine the two previous attempted solutions.
Example 19.8. Fixing an Empty Match (regexiter/search5.cpp)
#include <regex> # include <iostream> # include <string> using std::tr1::regex; using std::tr1::cmatch; using std::tr1::regex_search; using namespace std::tr1::regex_constants; using std::cout; using std::string; static void search(const char*tgt, const char*expr) { // show all subsequences of tgt that match expr regex rgx(expr); cmatch match; match_flag_type flags = match_default; match_flag_type mod = match_default; const char*first = tgt; const char* ast = tgt+ strlen(tgt); string empty("[empty]"); for(;;) { // keep trying if (regex_search(first, last, match, rgx, flags | mod)) { // show match, move past it cout << (match.length()? match.str():empty) << "at offset" << (match [0].first-tgt) << '\n'; if (match.length()!=0) { // move past match, clear modifier flags first+=match.position()+ match.length(); mod = match_default; } else mod = match_not_null | match_continuous; flags = flags | match_not_bol | match_prev_avail; } else if (mod != match_default && first!= last) { // move past failed match, clear modifier flags ++first; mod = match_default; } else break; } } int main() { // demonstrate use of match_not_bol const char *expr = "a*|c"; const char *tgt = "bcd"; search(tgt, expr); return 0; }
Now we have a robust search function. It's a little difficult to reuse, 1 though, because the action that it performs when it finds a match is embedded in the code that finds the match. Although this code can be made more generic, in most cases, you should use one of the two forms of iterator that the TR1 library provides, rather than trying to adapt this explicit loop.