Strings, NLP and Regex: Generative AI Foundations
- 7.1 Introduction
- 7.2 Fundamentals of Characters and Strings
- 7.3 Class String
- 7.4 Class StringBuilder
- 7.5 Class Character
- 7.6 Tokenizing Strings
- 7.7 Intro to Natural Language Processing (NLP)-at the Root of Generative AI<sup><a id="fn7_5" href="ch07.xhtml#fn7_5a">5</a></sup>
- 7.8 Objects-Natural Case Study: Intro to Regular Expressions in NLP
- 7.9 Objects-Natural Security Case Study: pMa5tfEKwk59dTvC04Ft1IFQz9mEXnkfYXZwxk4ujGE=
- 7.10 Wrap-Up
Objectives
In this chapter, you’ll:
Create and manipulate immutable character-string objects of class String.
Create and manipulate mutable character-string objects of class StringBuilder.
Test characters for various attributes using static methods of class Character.
Tokenize Strings.
Use generative AI to leverage your Java learning experience.
Be introduced to natural language processing—a key foundation of generative AI.
See an objects-natural case study on using regular expressions to recognize patterns in text for data validation, transformation and extraction—operations commonly used to prepare text to train language models used with generative AI.
See an objects-natural case study on pMa5tfEKwk59d TvC04Ft1IFQz9mEXnkfYXZwxk4ujGE= (see Section 7.9).
Outline
7.1 Introduction
This chapter discusses Java’s string- and character-processing capabilities. The techniques presented here are appropriate for validating program input, displaying information to users and performing many common text-based manipulations. You saw several string-processing capabilities in earlier chapters. This chapter discusses in detail the java.lang package’s String, StringBuilder and Character classes. You’ll continue using generative AI to enhance your learning experience.
Introduction to Natural Language Processing (NLP)
Section 7.7 introduces natural language processing (NLP)—an important data science and AI topic. NLP is a key foundational technology for generative AI. NLP helps computers understand, analyze and process text. While writing this book, we used the paid NLP tool Grammarly1 to help tune the writing and ensure the text’s readability for a broad audience. We also used various generative AIs’ NLP capabilities to evaluate the chapters for breadth and depth of coverage, verify the content and suggest potential improvements.
Objects-Natural Case Study: Using Regular Expressions to Search Strings for Patterns, Validate Data and Replace Substrings
We continue our objects-natural approach with two case studies. Section 7.8 introduces regular expressions (regexes for short), which are crucial in today’s data-rich applications for
cleaning and preparing data for analysis and large language model (LLM) training—a process called data wrangling or data munging,
data mining—such as locating names, organizations, URLs, email addresses and phone numbers in large bodies of text—and
transforming data to other formats—such as converting tab-delimited data to comma-separated-values (CSV) data.
Via class String and classes Matcher and Pattern from the java.util.regex package, we’ll use regular expressions to match patterns in text. Earlier chapters mentioned the importance of validating user input, especially in industrial-strength code. The capabilities presented in this chapter are frequently used to validate data and in
natural-language processing,
preparing data for training generative AIs and
enabling generative AIs to create responses to your prompts.
Objects-Natural Security Case Study: pMa5tfEKwk59dTvC04Ft1IFQz9mEX nkfYXZwxk4ujGE=
The title of our second objects-natural case study is not a mistake. See Section 7.9 for what these seemingly random characters, digits and symbols mean.

