Home > Articles

This chapter is from the book

7.7 Intro to Natural Language Processing (NLP)—at the Root of Generative AI5

Every day, we use natural language in various forms of communication. Many of the following examples may be familiar to you:

  • You read your text messages and check the latest news clips.

  • You have conversations with family, friends and colleagues.

  • You are Deaf or hard-of-hearing and communicate with friends and family via sign language and enjoy close-captioned video programs.

  • You are blind or have low vision, and read braille, listen to audiobooks, and listen to a screen reader speak about what’s on the computer screen.

  • You read e-mails, distinguishing junk from important communications.

  • You receive a client’s e-mail in a language you don’t know well, translate it online, then respond in English, knowing your client can usually translate your e-mail back to their language.

  • You travel through a neighborhood (maybe via car, bike, or wheelchair, or on foot), observing road signs such as “Stop” and “Pedestrian Crossing.”

  • You give your phone verbal commands like “call home” or “play classical music” or ask questions like, “Where’s the nearest gas station?”

  • You teach a child how to speak and read.

  • You learn a foreign language.

Natural Language Processing (NLP) helps computers understand, analyze and process human text and speech. It’s performed on text collections composed of emails, social media posts, conversations, movie reviews, Shakespeare’s plays, historical documents, news items, meeting logs, and more. A text collection is known as a corpus, the plural of which is corpora.

Some typical NLP applications include:

  • Natural language understanding—understanding text or spoken language.

  • Sentiment analysis—determining whether text has positive, neutral or negative sentiment. For example, companies analyze the sentiment of social media posts about their products.

  • Readability assessment—determining how readable text is based on vocabulary, word lengths, sentence lengths, sentence structure, topics covered and more. While writing this book, we used the paid versions of the NLP tool Grammarly6 to help us tune our writing to ensure its quality and readability for a broad audience.

  • Intelligent virtual assistants—software that helps you perform everyday tasks. Popular intelligent virtual assistants include Amazon Alexa, Apple Siri, Microsoft Copilot, Google Assistant, OpenAI ChatGPT and Anthropic Claude.

  • Text summarization—summarizing the key points of a text corpus. This can save time for busy people.

  • Speech recognition—converting speech to text.

  • Speech synthesis—converting text to speech.

  • Language identification—receiving a text when you don’t know its language in advance, then automatically determining the language.

  • Interlanguage translation—converting text to other spoken languages.

  • Named-entity recognition (NER)—locating and categorizing items in a text corpus like dates, times, quantities, places, people, things, organizations and more.

  • Chatbots—AI-based software that humans interact with via natural language. One popular chatbot application is automated customer support.

  • Similarity detection—examining documents to determine how alike they are. Basic similarity metrics include average sentence length, frequency distribution of sentence lengths, average word length, frequency distribution of word lengths, frequency distribution of word usage, and more.

Many lower-level NLP tasks support the above applications as they perform their tasks, including:

  • Tokenization—splitting text into tokens, which are meaningful units, such as words and numbers.

  • Parts-of-speech (POS) tagging—identifying each word’s part of speech, such as noun, verb, adjective, etc.

  • Noun phrase extraction—locating groups of words representing nouns, such as “the red brick factory.”7

  • Spelling and grammar checking and correction.

  • Stemming—reducing words to their stems by removing prefixes or suffixes. For example, the stem of “varieties” is “varieti.”

  • Lemmatization—like stemming, but produces actual words based on the original words’ context. For example, the lemmatized form of “varieties” is “variety.”

  • Word frequency counting—determining how often each word appears in a corpus.

  • Stop-word elimination—removing words, such as—a, and, as, at, be, do, for, have, he, I, in, it, not, of, on, that, the, to, with, you and more—to analyze the important words in a corpus.

  • n-grams—producing sets of consecutive words in a corpus to identify words frequently appearing adjacent to one another. N-grams are commonly used for predictive text input, such as when your smartphone suggests possible next words as you enter a text message, or your coding support tool lists the possible methods for an object you’re working with.

Generative AI

1 Prompt genAIs for a list of the top-20 n-grams in your preferred spoken language.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.