Home > Articles

This chapter is from the book

7.6 Tokenizing Strings

When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning to you. Compilers also perform tokenization. They break up statements into individual pieces like keywords, identifiers, operators and other programming-language elements. In natural language processing (NLP) and generative AI, tokenizing text into sentences and individual words is a key aspect of preparing text for analysis (such as word frequency counting) and training the large language models (LLMs) that help computers understand natural language. In this section, we’ll use the String’s class’s split method to break a String into its component tokens. Tokens are separated from one another by delimiters, typically whitespace characters such as space, tab, newline and carriage return. Other characters can also be used as delimiters to separate tokens, such as tabs, commas and more. Figure 7.18 demonstrates String’s split method.

 1   // Fig. 7.18: TokenTest.java
 2   // Tokenizing with String method split
 3   import java.util.Scanner;
 4
 5   public class TokenTest {
 6      public static void main(String[] args) {
 7         // get sentence
 8         var scanner = new Scanner(System.in);
 9         System.out.println("Enter a sentence and press Enter");
10         String sentence = scanner.nextLine();
11
12         // process user sentence
13         String[] tokens = sentence.split(" ");
14         System.out.printf("%nNumber of elements: %d%nThe tokens are:%n",
15            tokens.length);
16
17         for (String token : tokens) {
18            System.out.printf("  %s%n", token);
19         }
20      }
21   }
Enter a sentence and press Enter
This sentence has five tokens

Number of elements: 5
The tokens are:
  This
  sentence
  has
  five
  tokens

Fig. 7.18 | Tokenizing with String method split.

The program stores the user input in the variable sentence. Line 13 invokes the String method split with the String argument " ", which returns an array of Strings. The space character in the argument String is the delimiter split uses to separate the tokens. Section 7.8 introduces pattern matching in text using regular expressions—split’s argument can be a regular-expression pattern for more complex tokenizing. Lines 14–15 display the tokens array’s length, which is the number of tokens in sentence. Lines 17–19 output each token on a separate line.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.