- 7.1 Introduction
- 7.2 Fundamentals of Characters and Strings
- 7.3 Class String
- 7.4 Class StringBuilder
- 7.5 Class Character
- 7.6 Tokenizing Strings
- 7.7 Intro to Natural Language Processing (NLP)-at the Root of Generative AI<sup><a id="fn7_5" href="ch07.xhtml#fn7_5a">5</a></sup>
- 7.8 Objects-Natural Case Study: Intro to Regular Expressions in NLP
- 7.9 Objects-Natural Security Case Study: pMa5tfEKwk59dTvC04Ft1IFQz9mEXnkfYXZwxk4ujGE=
- 7.10 Wrap-Up
7.8 Objects-Natural Case Study: Intro to Regular Expressions in NLP
Sometimes, you’ll need to recognize patterns in text, like phone numbers, e-mail addresses, ZIP codes, web addresses, Social Security numbers and more. A regular expression (regex) String describes a search pattern for matching characters in other Strings. Regular expressions can help you extract data from unstructured text, such as social media posts. They’re also crucial for preparing text data for processing so it can be used to train artificial intelligence models used in natural language processing (NLP) and generative AI. Before working with text data, you’ll often use regular expressions to validate it. For example, you might check that:
A U.S. ZIP code consists of five digits (such as 02215) or five digits followed by a hyphen and four more digits (such as 02215-4775).
A last name contains only letters, spaces, apostrophes and hyphens.
An e-mail address contains only the allowed characters in the allowed order.
A U.S. Social Security number contains three digits, a hyphen, two digits, a hyphen and four digits, and adheres to other rules about the specific numbers used in each group of digits.
For common patterns, you rarely need to create regular expressions. Sites like:
offer repositories of existing regular expressions you can copy and use. These also allow you to test regular expressions to determine whether they meet your needs.
Other Uses of Regular Expressions
Regular expressions are also used to:
Extract data from text (known as scraping, text mining and parsing)—for example, locating all URLs in a webpage.
Clean data—for example, removing data that’s not required, eliminating duplicate data, handling incomplete data, fixing typos, ensuring consistent data formats, removing formatting, changing text cases, dealing with outliers and more.
Transform data into other formats—for example, reformatting tab-separated or space-separated values into comma-separated values (CSV) for an application that requires data to be in CSV format.
Generative AI
1 Prompt genAIs to locate the best sites for testing regular expressions. Test that the regular expressions we show in this chapter match their intended text patterns.
7.8.1 Matching Complete Strings to Patterns
Figure 7.19 demonstrates matching entire Strings to regular expression patterns. We broke the program into parts for discussion purposes.
Matching Literal Characters
The String class’s matches method returns true if the String that calls the method matches the pattern in its argument. By default, pattern matching is case-sensitive—you’ll see how to perform case-insensitive matches later. Let’s begin by matching literal characters that match themselves. Line 8 calls the matches method twice, attempting to match the pattern "02215" containing only literal characters (in this case, digits) that match themselves in the specified order:
The call "02215".matches("02215") returns true—"02215" matches the pattern in the matches method’s argument.
The call "51220".matches("02215") returns false—"51220" and the pattern in the argument have the same digits but "51220" has them in the wrong order.
1 // Fig. 7.19: RegexExamples.java
2 // Matching entire Strings to regular expressions.
3 public class RegexExamples {
4 public static void main(String[] args) {
5 // fully match a pattern of literal characters
6 System.out.println("Matching against: 02215");
7 System.out.printf("02215: %b; 51220: %b%n%n",
8 "02215".matches("02215"), "51220".matches("02215"));
9
Matching against: 02215 02215: true; 51220: false
Fig. 7.19 | Matching entire Strings to regular expressions.
Metacharacters, Character Classes and Quantifiers
Regular expressions typically contain various special symbols called metacharacters:
[] {} () \ * + ^ $ ? . |
The \ metacharacter begins each of the predefined character classes, several of which are shown in the following table with the groups of characters they match:
Character class |
Matches |
|---|---|
\d |
Any digit (0–9). |
\D |
Any character that is not a digit. |
\s |
Any whitespace character (such as a space, tab or newline). |
\S |
Any character that is not a whitespace character. |
\w |
Any word character (also called an alphanumeric character)—that is, any uppercase or lowercase letter, any digit or an underscore |
\W |
Any character that is not a word character. |
Recall that a backslash (\) in a Java String begins an escape sequence. Regular expression character classes are not valid Java escape sequences, so you must precede each character class with an extra backslash in a regular expression. For example, \$ matches a dollar sign ($) and \\ matches a backslash (\). To represent these patterns in Java, you must escape each backslash, so
to match a dollar sign ($), specify \\$ in the pattern String, and
to match a backslash (\), specify \\\\ in the pattern String.
Matching Digits
Let’s validate a five-digit United States ZIP code. In the regular expression \d{5}, the character class \d is a regular expression escape sequence that matches one character, specifically one digit (0–9). To match more than one, follow the character class with a quantifier. The quantifier {5} repeats \d five times to match five consecutive digits (line 13):
The first call to matches returns true—"02215" contains five consecutive digits.
The second call to matches returns false—"9876" contains only four consecutive digits.
10 // fully match five digits
11 System.out.println("Matching against: \\d{5}");
12 System.out.printf("02215: %b; 9876: %b%n%n",
13 "02215".matches("\\d{5}"), "9876".matches("\\d{5}"));
14
Matching against: \d{5}
02215: true; 9876: false
Custom Character Classes
Characters in square brackets, [], define a custom character class that matches a single character. For example, [aeiou] matches a lowercase vowel, [A-Z] matches an uppercase letter, [a-z] matches a lowercase letter, and [a-zA-Z] matches any lowercase or uppercase letter. Line 18 defines a custom character class to validate a simple first name with no spaces or punctuation. A first name might contain many letters.
[A-Z] matches one uppercase letter, and
[a-z]* matches any number of lowercase letters.
The * quantifier matches zero or more occurrences of the subexpression to its left (in this case, [a-z]). So [A-Z][a-z]* matches an uppercase letter followed by any number of lowercase letters, as in "Vivaan", "Jo" and "E":
15 // match a word that starts with a capital letter
16 System.out.println("Matching against: [A-Z][a-z]*");
17 System.out.printf("Angel: %b; tina: %b%n%n",
18 "Angel".matches("[A-Z][a-z]*"), "tina".matches("[A-Z][a-z]*"));
19
Matching against: [A-Z][a-z]* Angel: true; tina: false
When a custom character class starts with a caret (^), the class matches any character that’s not specified. So [^a-z] (line 23) matches any character that’s not a lowercase letter:
20 // match any character that's not a lowercase letter
21 System.out.println("Matching against: [^a-z]");
22 System.out.printf("A: %b; a: %b%n%n",
23 "A".matches("[^a-z]"), "a".matches("[^a-z]"));
24
Matching against: [^a-z] A: true; a: false
Metacharacters in a custom character class are treated as literal characters. So [*+$] (line 28) matches one *, + or $ character:
25 // match metacharacters as literals in a custom character class
26 System.out.println("Matching against: [*+$]");
27 System.out.printf("*: %b; !: %b%n%n",
28 "*".matches("[*+$]"), "!".matches("[*+$]"));
29
Matching against: [*+$] *: true; !: false
* vs. + Quantifier
To require at least one lowercase letter in a first name, replace the * quantifier in line 18 with the + quantifier (line 33), which matches at least one occurrence of the subexpression to its left ([a-z]):
30 // match a capital letter followed by at least one lowercase letter
31 System.out.println("Matching against: [A-Z][a-z]+");
32 System.out.printf("Angel: %b; T: %b%n%n",
33 "Angel".matches("[A-Z][a-z]+"), "T".matches("[A-Z][a-z]+"));
34
Matching against: [A-Z][a-z]+ Angel: true; T: false
Both * and + are greedy—they match as many characters as possible. So, the regular expression [A-Z][a-z]+ matches any word that begins with a capital letter followed by at least one lowercase letter. You’ll explore lazy quantifiers in a genAI exercise at the end of this section.
Other Quantifiers
The ? quantifier matches zero or one occurrence of the subexpression to its left. In the regular expression labell?ed (line 38), the subexpression is the literal character "l". So, in the matches calls in line 38, the regular expression matches labelled (the U.K. English spelling) and labeled (the U.S. English spelling), but in the matches call in line 39, the regular expression does not match the misspelled word labellled:
35 // match zero or one occurrence of a subexpression
36 System.out.println("Matching against: labell?ed");
37 System.out.printf("labelled: %b; labeled: %b; labellled: %b%n%n",
38 "labelled".matches("labell?ed"), "labeled".matches("labell?ed"),
39 "labellled".matches("labell?ed"));
40
Matching against: labell?ed labelled: true; labeled: true; labellled: false
You can use the {n,} quantifier to match at least n occurrences of a subexpression to its left. The regular expression in lines 44–45 (\d{3,}) matches Strings containing at least three digits—recall that the \ in this regular expression must be escaped with another \:
41 // match n (3) or more occurrences of a subexpression
42 System.out.println("Matching against: \\d{3,}");
43 System.out.printf("123: %b; 1234567890: %b; 12: %b%n%n",
44 "123".matches("\\d{3,}"), "1234567890".matches("\\d{3,}"),
45 "12".matches("\\d{3,}"));
46
Matching against: \d{3,}
123: true; 1234567890: true; 12: false
You can use the {n,m} quantifier to match between n and m (inclusive) occurrences of a subexpression. The regular expressions in lines 50–51 (\d{3,6}) match Strings containing 3 to 6 digits:
47 // match n to m inclusive (3-6), occurrences of a subexpression
48 System.out.println("Matching against: \\d{3,6}");
49 System.out.printf("123: %b; 123456: %b; 1234567: %b; 12: %b%n",
50 "123".matches("\\d{3,6}"), "123456".matches("\\d{3,6}"),
51 "1234567".matches("\\d{3,6}"), "12".matches("\\d{3,6}"));
52 }
53 }
Matching against: \d{3,6}
123: true; 123456: true; 1234567: false; 12: false
Generative AI
1 Prompt genAIs with the regular expression [b-df-hj-np-tv-z]. Ask them to write Java code showing that the pattern matches only lowercase consonants. Run the code to confirm that the pattern works properly.
2 Regular expression quantifiers are greedy by default. Prompt genAIs for a brief Java tutorial with simple examples showing how lazy regular expression quantifiers match patterns.
7.8.2 Replacing Substrings
The String class’s replaceAll method replaces patterns in a String. Let’s convert a tab-delimited String to comma-delimited (Fig. 7.20). Method replaceAll (line 9) receives:
the regex pattern to match (the tab character "\t") and
the replacement text (","),
and returns a new String containing the modifications. The String class also provides the replaceFirst method, which replaces only the first matching substring.
1 // Fig. 7.20: RegexReplacement.java
2 // Regular expression replacements.
3 public class RegexReplacement {
4 public static void main(String[] args) {
5 // replace tabs with commas
6 String s1 = "1\t2\t3\t4";
7 System.out.printf("Original string: %s%n", s1);
8 System.out.printf("New string with commas replacing tabs: %s%n",
9 s1.replaceAll("\\t", ","));
10 }
11 }
Original string: 1 2 3 4 New string with commas replacing tabs: 1,2,3,4
Fig. 7.20 | Regular expression replacements.
7.8.3 Searching for Matches with Classes Pattern and Matcher
In addition to class String’s regular-expression capabilities, the java.util.regex package provides capabilities that help developers work with regular expressions:
An object of the Pattern class represents a regular expression.
An object of the Matcher class contains a regular-expression pattern and a String in which to search for the pattern.
These classes provide more control over the regular-expression, pattern-matching process.
Finding a Match Anywhere in a String
To introduce Pattern and Matcher, let’s search for literal substrings anywhere in the String s1 (Fig. 7.21, line 9). Line 13 invokes the Pattern class’s static method compile, which creates a Pattern object representing a regular expression—in this case, consisting of the literal characters "Programming". The compile method makes regular expressions more efficient, which improves performance for regular expressions that will be used more than once—in a loop, for example, which we’ll do momentarily. Line 14 calls the Pattern object’s matcher method, which creates and returns a Matcher object that can search for the Pattern’s regular expression in the String s1 ("Programming is fun"). Line 15 calls the Matcher object’s find method to search for "Programming" in s1—the method returns true if the pattern is found; otherwise, it returns false. Lines 17–19 and 21–23 repeat the preceding process to search for the patterns "fun" and "fn", respectively.
1 // Fig. 7.21: RegexMatching.java
2 // Classes Pattern and Matcher.
3 import java.util.regex.Pattern;
4 import java.util.regex.Matcher;
5
6 public class RegexMatching {
7 public static void main(String[] args) {
8 // performing a simple match
9 String s1 = "Programming is fun";
10 System.out.printf("s1: %s%n%n", s1);
11 System.out.println("Search anywhere in s1:");
12
13 Pattern pattern1 = Pattern.compile("Programming");
14 Matcher matcher1 = pattern1.matcher(s1);
15 boolean found1 = matcher1.find();
16
17 Pattern pattern2 = Pattern.compile("fun");
18 Matcher matcher2 = pattern2.matcher(s1);
19 boolean found2 = matcher2.find();
20
21 Pattern pattern3 = Pattern.compile("fn");
22 Matcher matcher3 = pattern3.matcher(s1);
23 boolean found3 = matcher3.find();
24
25 System.out.printf("Programming: %b; fun: %b; fn: %b%n%n",
26 found1, found2, found3);
27
s1: Programming is fun Search anywhere in s1: Programming: true; fun: true; fn: false
Fig. 7.21 | Classes Pattern and Matcher.
Ignoring Case in a Regular Expression and Viewing the Matching Text
Like String comparisons, regular expression pattern matching is case-sensitive by default. The Pattern class provides constants that customize how regular expressions perform matches. Lines 33–34 configure a Pattern that performs case-insensitive matches—as indicated by the constant Pattern.CASE_INSENSITIVE—for the literal "Gabriela". Line 35 creates a Matcher object to search for that pattern in s2. Line 37 calls Matcher method find to determine whether s2 contains the pattern. If so, the Matcher method group (line 39) returns the substring of s2 that matched the pattern ("GABRIELA" in this case).
28 // ignoring case
29 String s2 = "GABRIELA ALVAREZ";
30 System.out.printf("s2: %s%n%n", s2);
31 System.out.println("Case insensitive search for Gabriela in s2:");
32
33 Pattern pattern4 =
34 Pattern.compile("Gabriela", Pattern.CASE_INSENSITIVE);
35 Matcher matcher4 = pattern4.matcher(s2);
36
37 if (matcher4.find()) {
38 System.out.printf("Gabriela found%n");
39 System.out.printf("Matched text: %s%n%n", matcher4.group());
40 }
41 else {
42 System.out.printf("Gabriela not found%n");
43 }
44
s2: GABRIELA ALVAREZ Case insensitive search for Gabriela in s2: Gabriela found Matched text: GABRIELA
Finding All Matches in a String
Let’s extract all the 10-digit phone numbers of the form ###-###-#### from a String. The following code finds each substring in contact (lines 46–47) matching phonePattern (line 48) and displays the matching text. The pattern
\d{3}-\d{3}-\d{4}
matches any substring containing three digits (\d{3}), a hyphen (-), three digits (\d{3}), a hyphen (-) and four digits (\d{4})—contact contains two such phone numbers. Line 49 creates a Matcher object for locating phonePattern matches in contact. When searching for multiple matches, you call the Matcher method find in a loop condition (line 52). Lines 52–54 iterate while find returns true—that is, until there are no more matches in contact. If there’s a match, line 53 calls the Matcher method group to get the substring that was found then displays it. The next call to Matcher method find automatically continues from the character following the preceding match.
45 // finding all matches
46 String contact =
47 "Lea Dubois, Home: 555-555-1234, Work: 555-555-4321";
48 Pattern phonePattern = Pattern.compile("\\d{3}-\\d{3}-\\d{4}");
49 Matcher phoneMatcher = phonePattern.matcher(contact);
50
51 System.out.printf("Finding phone numbers in:%n%s%n", contact);
52 while (phoneMatcher.find()) {
53 System.out.printf(" %s%n", phoneMatcher.group());
54 }
55 }
56 }
Finding phone numbers in: Lea Dubois, Home: 555-555-1234, Work: 555-555-4321 555-555-1234 555-555-4321
7.8.4 Simple Data Wrangling Steps Used to Prepare Text for Training NLP and Generative AI Models
Data does not always come ready for analysis or training artificial intelligence (AI) models. It might be incorrect, in the wrong format or even missing. Industry experience has shown that data scientists spend as much as 80% of their time preparing data before analyzing it.8 Preparing data for analysis is called data munging or data wrangling. From this point forward, we’ll say data wrangling. Figure 7.22 uses String and regex capabilities to perform several common data-wrangling tasks that might be used to prepare text for training the large language models (LLMs) that generative AIs use to understand and respond to your prompts. Some of these include the tasks we listed in Section 7.7’s Intro to Natural Language Processing (NLP)—at the Root of Generative AI. Others include:
converting text to all lowercase letters (shown in Section 7.3.7),
removing punctuation,
collapsing consecutive whitespace characters to a single space,
removing personally identifiable information (PII),
normalizing numbers (sometimes specific numeric values are not relevant for analysis or training, so you might replace them with tags like <NUMBER> or <NUM>),
removing numbers and
removing special symbols.
Whether these are performed on a particular corpus (body of text) typically depends on
what an AI model needs to learn from the corpus or
how the model needs to analyze it.
For example, you might eliminate stop words for a model that summarizes word frequencies for a corpus’s most important words. Keeping stop words could be important for a model that’s learning sentence structure so it can respond to your prompts in complete, well-structured sentences, like those produced in generative AI applications.
Figure 7.22 performs several data wrangling steps on the text block corpus defined in lines 7–14, which intentionally contains extra whitepace. We’ve broken this example into pieces for discussion purposes.
1 // Fig. 7.22: Wrangling.java
2 // Simple data wrangling tasks to prepare text to train language models.
3 import java.util.Arrays;
4
5 public class Wrangling {
6 public static void main(String[] args) {
7 String corpus = """
8 In C++, an int on one machine might be 16 bits (2 bytes)
9 of memory, on another 32 bits (4 bytes), and on another
10 64 bits (8 bytes). In Java, int values are always 32 bits
11 (4 bytes). The eight primitive types in Java -- boolean, byte,
12 char, double, float, int, long and short -- are portable across
13 all computer platforms that support Java.
14 """;
15 System.out.printf("INITIAL CORPUS:%n%s%n", corpus);
16
INITIAL CORPUS: In C++, an int on one machine might be 16 bits (2 bytes) of memory, on another 32 bits (4 bytes), and on another 64 bits (8 bytes). In Java, int values are always 32 bits (4 bytes). The eight primitive types in Java -- boolean, byte, char, double, float, int, long and short -- are portable across all computer platforms that support Java.
Fig. 7.22 | Simple data wrangling tasks to prepare text to train language models.
Lowercasing/Case Folding
A common data wrangling step is to convert an entire corpus to lowercase letters. Recall that Java compares Strings lexicographically by default, so the words "happy" and "Happy" are different. However, they are still the same word and for a task like word-frequency counting, so we’d want to treat them as the same word. You could use regular expressions to replace all uppercase letters with lowercase letters but, as you saw in Section 7.3.7, Java provides String method toLowerCase for that purpose. In this step’s output, note that toLowercase changed “C++” to “c++.”
17 // convert corpus to lowercase
18 String lowercased = corpus.toLowerCase();
19 System.out.printf("LOWERCASED CORPUS:%n%s%n", lowercased);
20
LOWERCASED CORPUS: in c++, an int on one machine might be 16 bits (2 bytes) of memory, on another 32 bits (4 bytes), and on another 64 bits (8 bytes). in java, int values are always 32 bits (4 bytes). the eight primitive types in java -- boolean, byte, char, double, float, int, long and short -- are portable across all computer platforms that support java.
Removing Punctuation
For some kinds of text analysis and training, you must remove punctuation to make the text easier to tokenize. Line 22 introduces a special character class that matches any Unicode punctuation character—\p{Punct}—and uses it in a replaceAll call to replace every punctuation character in the String lowercased with an empty String (""). Once again, note that we must precede a backslash in a regular expression with an extra backslash:
21 // remove punctuation
22 String noPunctuation = lowercased.replaceAll("\\p{Punct}", "");
23 System.out.printf("PUNCTUATION REMOVED:%n%s%n", noPunctuation);
24
PUNCTUATION REMOVED: in c an int on one machine might be 16 bits 2 bytes of memory on another 32 bits 4 bytes and on another 64 bits 8 bytes in java int values are always 32 bits 4 bytes the eight primitive types in java boolean byte char double float int long and short are portable across all computer platforms that support java
In this step’s output, note that removing punctuation transformed “c++” to “c.” So in this series of data wrangling steps, “C++” became “c++” then “c.” You should always perform wrangling operations cautiously.
Collapsing and Normalizing Whitespace
This example’s original String was inconsistently spaced. It’s common to normalize spacing by collapsing consecutive whitespace characters to a single space. In line 26, the regular expression \s+ replaces any one or more whitespace characters with a single space, so any individual newline (\n) or tab (\t) character is also replaced with a space. After this operation, all tokens in the String are uniformly separated from one another with one space. Though we did not do so here, we also could call strip on the resulting String to remove any leading or trailing whitespace (there is none in this example).
25 // collapsing consecutive whitespace characters to one space
26 String spacingNormalized = noPunctuation.replaceAll("\\s+", " ");
27 System.out.printf("SPACING NORMALIZED:%n%s%n%n", spacingNormalized);
28
SPACING NORMALIZED: in c an int on one machine might be 16 bits 2 bytes of memory on another 32 bits 4 bytes and on another 64 bits 8 bytes in java int values are always 32 bits 4 bytes the eight primitive types in java boolean byte char double float int long and short are portable across all computer platforms that support java
Tokenizing
Now that we’ve performed several data wrangling steps on our initial text block, line 30 calls String method split to tokenize the text, and line 31 displays the resulting array of tokens, using Arrays method toString to create a comma-separated String representation of the tokens. Language models might use these tokens to perform a frequency analysis.
29 // tokenizing the text
30 String[] tokenized = spacingNormalized.split(" ");
31 System.out.printf("TOKENS:%n%s%n", Arrays.toString(tokenized));
32 }
33 }
TOKENS: [in, c, an, int, on, one, machine, might, be, 16, bits, 2, bytes, of, memo- ry, on, another, 32, bits, 4, bytes, and, on, another, 64, bits, 8, bytes, in, java, int, values, are, always, 32, bits, 4, bytes, the, eight, primi- tive, types, in, java, boolean, byte, char, double, float, int, long, and, short, are, portable, across, all, computer, platforms, that, support, java]
Generative AI
1 Prompt genAIs with the code from Fig. 7.22. Ask them to modify the code to normalize numbers, by replacing them with tags <NUMBER>, then confirm the results by running the updated program.
2 Prompt genAIs for brief explanations and examples of other characters classes like \p{Punct} that you can use in Java.
3 Prompt genAIs for a brief Java tutorial on removing personally identifiable information (PII) form a corpus.
