Natural Language Processing

This document introduces natural language processing, explaining how computers translate between unstructured human language and structured data through techniques like tokenization, stemming, lemmatization, part of speech tagging and named entity recognition.

This document explores natural language processing as the bridge between human communication and computer comprehension. Through a comprehensive examination of NLP techniques including tokenization, stemming, lemmatization, part of speech tagging, and named entity recognition, the discussion reveals how computers transform unstructured text into structured data for AI applications.


Understanding Natural Language Processing

Natural language processing occurs whenever humans communicate, and computers attempt to comprehend that communication. When listening to words and sentences, humans naturally form comprehension from the language structure. When computers perform this same task, it constitutes NLP or natural language processing.

NLP has exceptionally high utility value across all sorts of AI applications, serving as a fundamental capability that enables machines to interact with human language in meaningful ways.


From Unstructured to Structured Data

NLP begins with something called unstructured text, which represents natural human speech patterns. This is simply how people communicate in everyday language.

Unstructured Text Example

Consider the statement “add eggs and milk to my shopping list.” Humans understand exactly what this means, but the text remains unstructured from a computer’s perspective. The computer cannot directly process or act upon this natural language input without translation.

Structured Representation

Computers require a structured representation of the same information that they can process. A structured version might include a shopping list element with sub-elements within it, such as an item for eggs and an item for milk. This hierarchical, organized format represents structured data that computers can manipulate and understand.

The Role of NLP

The job of natural language processing is to translate between these two representations. NLP sits right in the middle, serving as the bridge that translates between unstructured and structured data.


Natural Language Understanding and Generation

The translation process between unstructured and structured data occurs in two directions, each with its own designation.

Natural Language Understanding (NLU)

When translation moves from unstructured to structured data, this process is called Natural Language Understanding or NLU. This direction focuses on comprehending human language and converting it into a format computers can process.

Natural Language Generation (NLG)

When translation moves from structured to unstructured data, this process is called Natural Language Generation or NLG. This direction focuses on creating human-readable text from structured computer data.

The primary focus of most NLP applications centers on going from unstructured to structured representations through natural language understanding.


Use Cases for Natural Language Processing

NLP proves invaluable across numerous applications where computers must interpret or generate human language.

Machine Translation

Machine translation involves converting text from one language to another. This process requires understanding the context of sentences rather than simply translating individual words.

Translation cannot succeed by taking each individual word from one language and substituting the equivalent word in another language. The overall structure and context of what is being communicated must be understood to produce accurate translations.

A classic example of translation failure demonstrates this principle. Taking the phrase “the spirit is willing, but the flesh is weak” and translating it from English to Russian, then translating that Russian version back into English, produces “the vodka is good, but the meat is rotten.” This result completely misses the intended context of the original sentence. NLP helps prevent such contextual misunderstandings.

Virtual Assistants and Chatbots

Virtual assistants such as Siri or Alexa on phones take human utterances and derive commands to execute based upon those inputs. These systems must interpret natural language instructions and convert them into actionable commands.

Chatbots operate similarly but work with written language. They take written input and use it to traverse a decision tree in order to take appropriate actions. NLP proves essential for both virtual assistants and chatbots to function effectively.

Sentiment Analysis

Sentiment analysis takes text, perhaps an email message or a product review, and attempts to derive the sentiment expressed within it. The analysis determines whether content expresses positive or negative sentiment.

Beyond simple positive or negative classification, sentiment analysis can also determine whether text is written as a serious statement or employs sarcasm. NLP provides the tools necessary to extract these nuanced emotional indicators from text.

Spam Detection

Spam detection examines email messages and determines whether they constitute legitimate correspondence or spam. The system looks for indicators within the message content that suggest spam classification.

Overused words, poor grammar, and inappropriate claims of urgency can all indicate that a message is likely spam. NLP enables systems to identify these patterns and filter unwanted messages effectively.


How Natural Language Processing Works

NLP does not function as a single algorithm. Rather, it operates more like a collection of tools that can be applied to resolve various language processing challenges.

Input to NLP

The input to NLP consists of unstructured text, either written text or spoken text that has been converted to written text through a speech-to-text algorithm. Once text is available in written form, NLP processing can begin.


Tokenization

The first stage of NLP is called tokenization. This process involves taking a string and breaking it down into manageable chunks called tokens.

Consider the unstructured text “add eggs and milk to my shopping list.” This sentence contains eight words, which could become eight tokens. From this point forward, NLP processes one token at a time as it traverses through the text.

Tokenization establishes the foundation for all subsequent NLP operations by segmenting continuous text into discrete units that can be individually analyzed.


Stemming

Once text has been broken down into tokens, the first processing stage that can be performed is called stemming. Stemming derives the word stem for a given token.

How Stemming Works

Consider the words “running,” “runs,” and “ran.” The word stem for all three of these variations is “run.” Stemming removes prefixes and suffixes and normalizes tense to arrive at the fundamental word stem.

Limitations of Stemming

Stemming does not work well for every token. For example, “universal” and “university” do not meaningfully stem down to “universe.” The relationship between these words requires deeper semantic understanding than simple prefix and suffix removal can provide.


Limitation

For situations where stemming proves inadequate, another tool called lemmatization becomes available. Lemmatization takes a given token and learns its meaning through dictionary definitions.

Deriving the Lemma

From the dictionary definition, lemmatization derives the root or lemma of a word. Consider the word “better.” The lemma of “better” is “good” because “better” is derived from “good.”

In contrast, the stem of “better” would be “bet,” which lacks semantic meaning. This difference demonstrates why the choice between stemming and lemmatization matters significantly for a given token.

The decision to use stemming or lemmatization for a particular token can substantially impact the accuracy and usefulness of NLP results.


Part of Speech Tagging

Part of speech tagging examines where a token is used within the context of a sentence to determine its grammatical function.

Context-Dependent Meaning

Consider the word “make.” In the sentence “I’m going to make dinner,” “make” functions as a verb. However, in the question “what make is your laptop,” “make” serves as a noun.

The position and usage of a token within a sentence determines its part of speech. Part of speech tagging helps derive this contextual information, which proves essential for understanding the overall meaning and structure of sentences.


Named Entity Recognition

Named entity recognition asks whether a given token has an associated entity that provides additional semantic information.

Entity Categories

Consider the token “Arizona.” This token has an entity classification of a U.S. state, which provides geographical and political context. Similarly, the token “Ralph” has an entity classification of a person’s name, indicating it refers to an individual rather than a common noun.

Named entity recognition enables NLP systems to identify and categorize proper nouns, locations, organizations, dates, and other significant entities within text.


The NLP Toolkit

The techniques described—tokenization, stemming, lemmatization, part of speech tagging, and named entity recognition—represent some of the primary tools available in the NLP toolkit. These tools can be selectively applied to transform unstructured human speech into structured data that computers can understand.

From Unstructured to Structured

The progression from raw text through these various processing stages results in structured data. This structured data can then be applied to all sorts of AI applications, enabling computers to perform sophisticated language-based tasks.

Application to AI

Once unstructured text has been converted to structured data through NLP techniques, that structured information becomes available for use in machine learning models, decision systems, information retrieval applications, and numerous other AI-driven solutions.


NLP Techniques Summary

The core NLP techniques work together to enable comprehensive language processing:

TechniquePurposeExample
TokenizationBreak text into discrete units“add eggs and milk” becomes tokens: add, eggs, and, milk
StemmingExtract word stems“running” becomes “run”
LemmatizationFind dictionary root form“better” becomes “good”
Part of Speech TaggingIdentify grammatical function“make” can be verb or noun depending on context
Named Entity RecognitionIdentify and categorize entities“Arizona” is identified as U.S. state

Conclusion

Natural language processing serves as the essential bridge between human communication and computer comprehension, translating unstructured text into structured data through a collection of specialized techniques. These techniques include tokenization for breaking text into manageable units, stemming and lemmatization for normalizing word forms, part of speech tagging for understanding grammatical context, and named entity recognition for identifying significant entities. Together, these tools enable computers to understand language context, perform accurate translations, power virtual assistants and chatbots, analyze sentiment, and detect spam. The structured data produced by NLP provides the foundation for sophisticated AI applications that interact with human language in meaningful and useful ways.


FAQs

Natural language processing (NLP) is the process by which computers attempt to understand and process human language. It occurs when computers translate between unstructured human speech or text and structured data that machines can comprehend and act upon.

Unstructured text is natural human speech or writing, such as “add eggs and milk to my shopping list.” While humans easily understand this format, it remains unstructured from a computer’s perspective because it lacks the organized format that computers require for processing.

The primary role of NLP is to serve as a bridge that translates between unstructured human language and structured data that computers can process. NLP sits in the middle of this translation process, enabling computers to understand human communication.

Natural Language Understanding (NLU) is the process of translating from unstructured to structured data, focusing on comprehending human language. Natural Language Generation (NLG) is the reverse process, translating from structured data to unstructured human-readable text. NLU helps computers understand language, while NLG helps computers generate language.

Machine translation requires understanding context because it cannot succeed by simply translating individual words. The overall structure and context of sentences must be understood to produce accurate translations. Without context, translations can completely miss the intended meaning, as demonstrated by the phrase “the spirit is willing, but the flesh is weak” becoming “the vodka is good, but the meat is rotten” when translated through intermediary steps.

Use CaseDescription
A. Virtual Assistants1. Determining whether content expresses positive or negative sentiment
B. Sentiment Analysis2. Converting text from one language to another while preserving context
C. Spam Detection3. Taking human utterances and deriving commands to execute
D. Machine Translation4. Examining messages for indicators like overused words or poor grammar
A-3, B-1, C-4, D-2.

Tokenization is the first stage of NLP that involves taking a string of text and breaking it down into manageable chunks called tokens. For example, the sentence “add eggs and milk to my shopping list” contains eight words that could become eight tokens. From this point forward, NLP processes one token at a time.

Stemming derives the word stem for a given token by removing prefixes and suffixes and normalizing tense. For example, “running,” “runs,” and “ran” all have the word stem “run.” Stemming extracts the fundamental root of words to normalize different word forms.

Stemming does not work well for every token. For example, “universal” and “university” do not meaningfully stem down to “universe.” The relationship between these words requires deeper semantic understanding than simple prefix and suffix removal can provide, which is why lemmatization is sometimes needed instead.

Lemmatization is a technique that takes a given token and learns its meaning through dictionary definitions, then derives the root or lemma of the word. For example, the lemma of “better” is “good” because “better” is derived from “good,” whereas stemming would produce “bet,” which lacks semantic meaning.

  1. Stemming is always more accurate than lemmatization
  2. The choice between stemming and lemmatization can substantially impact the accuracy and usefulness of NLP results
  3. Lemmatization is faster but less accurate than stemming
  4. They produce identical results for all tokens
(2) The choice between stemming and lemmatization matters significantly for a given token and can substantially impact the accuracy and usefulness of NLP results. Stemming works well for some words but fails for others, while lemmatization uses dictionary definitions to derive more semantically meaningful roots.

Part of speech tagging examines where a token is used within the context of a sentence to determine its grammatical function. For example, “make” can function as a verb in “I’m going to make dinner” or as a noun in “what make is your laptop.” The position and usage of a token within a sentence determines its part of speech.

Named entity recognition (NER) identifies whether a given token has an associated entity that provides additional semantic information. For example, “Arizona” has an entity classification of a U.S. state, while “Ralph” has an entity classification of a person’s name. NER enables systems to identify and categorize proper nouns, locations, organizations, dates, and other significant entities.

NLP functions as a single algorithm that processes all language tasks in the same way.

False. NLP does not function as a single algorithm. Rather, it operates more like a collection of tools (including tokenization, stemming, lemmatization, part of speech tagging, and named entity recognition) that can be selectively applied to resolve various language processing challenges.

The input to NLP consists of unstructured text, which can be either written text or spoken text that has been converted to written text through a speech-to-text algorithm. Once text is available in written form, NLP processing can begin.

Virtual assistants such as Siri or Alexa take human utterances (spoken language) and derive commands to execute based upon those inputs. Chatbots operate similarly but work with written language, taking written input and using it to traverse a decision tree. Both rely on NLP, but they process different input modalities.

Spam detection examines email messages for indicators within the message content that suggest spam classification, including:

  • Overused words
  • Poor grammar
  • Inappropriate claims of urgency These patterns help NLP systems identify and filter unwanted messages effectively.

  1. Tokenization
  2. Lemmatization
  3. Image recognition
  4. Named entity recognition
(3) Image recognition is not an NLP technique. The core NLP techniques discussed include tokenization, stemming, lemmatization, part of speech tagging, and named entity recognition, all of which work with text data rather than images.

Sentiment analysis uses NLP to take text such as email messages or product reviews and derive the sentiment expressed within them. NLP enables systems to determine whether content expresses positive or negative sentiment and can even distinguish between serious statements and sarcasm.

Once unstructured text has been converted to structured data through NLP techniques, that structured information becomes available for use in machine learning models, decision systems, information retrieval applications, and numerous other AI-driven solutions. The structured data provides the foundation for sophisticated AI applications that interact with human language.