This document introduces natural language processing, explaining how computers translate between unstructured human language and structured data through techniques like tokenization, stemming, lemmatization, part of speech tagging and named entity recognition.
This document explores natural language processing as the bridge between human communication and computer comprehension. Through a comprehensive examination of NLP techniques including tokenization, stemming, lemmatization, part of speech tagging, and named entity recognition, the discussion reveals how computers transform unstructured text into structured data for AI applications.
Natural language processing occurs whenever humans communicate, and computers attempt to comprehend that communication. When listening to words and sentences, humans naturally form comprehension from the language structure. When computers perform this same task, it constitutes NLP or natural language processing.
NLP has exceptionally high utility value across all sorts of AI applications, serving as a fundamental capability that enables machines to interact with human language in meaningful ways.
NLP begins with something called unstructured text, which represents natural human speech patterns. This is simply how people communicate in everyday language.
Consider the statement “add eggs and milk to my shopping list.” Humans understand exactly what this means, but the text remains unstructured from a computer’s perspective. The computer cannot directly process or act upon this natural language input without translation.
Computers require a structured representation of the same information that they can process. A structured version might include a shopping list element with sub-elements within it, such as an item for eggs and an item for milk. This hierarchical, organized format represents structured data that computers can manipulate and understand.
The job of natural language processing is to translate between these two representations. NLP sits right in the middle, serving as the bridge that translates between unstructured and structured data.
The translation process between unstructured and structured data occurs in two directions, each with its own designation.
When translation moves from unstructured to structured data, this process is called Natural Language Understanding or NLU. This direction focuses on comprehending human language and converting it into a format computers can process.
When translation moves from structured to unstructured data, this process is called Natural Language Generation or NLG. This direction focuses on creating human-readable text from structured computer data.
The primary focus of most NLP applications centers on going from unstructured to structured representations through natural language understanding.
NLP proves invaluable across numerous applications where computers must interpret or generate human language.
Machine translation involves converting text from one language to another. This process requires understanding the context of sentences rather than simply translating individual words.
Translation cannot succeed by taking each individual word from one language and substituting the equivalent word in another language. The overall structure and context of what is being communicated must be understood to produce accurate translations.
A classic example of translation failure demonstrates this principle. Taking the phrase “the spirit is willing, but the flesh is weak” and translating it from English to Russian, then translating that Russian version back into English, produces “the vodka is good, but the meat is rotten.” This result completely misses the intended context of the original sentence. NLP helps prevent such contextual misunderstandings.
Virtual assistants such as Siri or Alexa on phones take human utterances and derive commands to execute based upon those inputs. These systems must interpret natural language instructions and convert them into actionable commands.
Chatbots operate similarly but work with written language. They take written input and use it to traverse a decision tree in order to take appropriate actions. NLP proves essential for both virtual assistants and chatbots to function effectively.
Sentiment analysis takes text, perhaps an email message or a product review, and attempts to derive the sentiment expressed within it. The analysis determines whether content expresses positive or negative sentiment.
Beyond simple positive or negative classification, sentiment analysis can also determine whether text is written as a serious statement or employs sarcasm. NLP provides the tools necessary to extract these nuanced emotional indicators from text.
Spam detection examines email messages and determines whether they constitute legitimate correspondence or spam. The system looks for indicators within the message content that suggest spam classification.
Overused words, poor grammar, and inappropriate claims of urgency can all indicate that a message is likely spam. NLP enables systems to identify these patterns and filter unwanted messages effectively.
NLP does not function as a single algorithm. Rather, it operates more like a collection of tools that can be applied to resolve various language processing challenges.
The input to NLP consists of unstructured text, either written text or spoken text that has been converted to written text through a speech-to-text algorithm. Once text is available in written form, NLP processing can begin.
The first stage of NLP is called tokenization. This process involves taking a string and breaking it down into manageable chunks called tokens.
Consider the unstructured text “add eggs and milk to my shopping list.” This sentence contains eight words, which could become eight tokens. From this point forward, NLP processes one token at a time as it traverses through the text.
Tokenization establishes the foundation for all subsequent NLP operations by segmenting continuous text into discrete units that can be individually analyzed.
Once text has been broken down into tokens, the first processing stage that can be performed is called stemming. Stemming derives the word stem for a given token.
Consider the words “running,” “runs,” and “ran.” The word stem for all three of these variations is “run.” Stemming removes prefixes and suffixes and normalizes tense to arrive at the fundamental word stem.
Stemming does not work well for every token. For example, “universal” and “university” do not meaningfully stem down to “universe.” The relationship between these words requires deeper semantic understanding than simple prefix and suffix removal can provide.
For situations where stemming proves inadequate, another tool called lemmatization becomes available. Lemmatization takes a given token and learns its meaning through dictionary definitions.
From the dictionary definition, lemmatization derives the root or lemma of a word. Consider the word “better.” The lemma of “better” is “good” because “better” is derived from “good.”
In contrast, the stem of “better” would be “bet,” which lacks semantic meaning. This difference demonstrates why the choice between stemming and lemmatization matters significantly for a given token.
The decision to use stemming or lemmatization for a particular token can substantially impact the accuracy and usefulness of NLP results.
Part of speech tagging examines where a token is used within the context of a sentence to determine its grammatical function.
Consider the word “make.” In the sentence “I’m going to make dinner,” “make” functions as a verb. However, in the question “what make is your laptop,” “make” serves as a noun.
The position and usage of a token within a sentence determines its part of speech. Part of speech tagging helps derive this contextual information, which proves essential for understanding the overall meaning and structure of sentences.
Named entity recognition asks whether a given token has an associated entity that provides additional semantic information.
Consider the token “Arizona.” This token has an entity classification of a U.S. state, which provides geographical and political context. Similarly, the token “Ralph” has an entity classification of a person’s name, indicating it refers to an individual rather than a common noun.
Named entity recognition enables NLP systems to identify and categorize proper nouns, locations, organizations, dates, and other significant entities within text.
The techniques described—tokenization, stemming, lemmatization, part of speech tagging, and named entity recognition—represent some of the primary tools available in the NLP toolkit. These tools can be selectively applied to transform unstructured human speech into structured data that computers can understand.
The progression from raw text through these various processing stages results in structured data. This structured data can then be applied to all sorts of AI applications, enabling computers to perform sophisticated language-based tasks.
Once unstructured text has been converted to structured data through NLP techniques, that structured information becomes available for use in machine learning models, decision systems, information retrieval applications, and numerous other AI-driven solutions.
The core NLP techniques work together to enable comprehensive language processing:
| Technique | Purpose | Example |
|---|---|---|
| Tokenization | Break text into discrete units | “add eggs and milk” becomes tokens: add, eggs, and, milk |
| Stemming | Extract word stems | “running” becomes “run” |
| Lemmatization | Find dictionary root form | “better” becomes “good” |
| Part of Speech Tagging | Identify grammatical function | “make” can be verb or noun depending on context |
| Named Entity Recognition | Identify and categorize entities | “Arizona” is identified as U.S. state |
Natural language processing serves as the essential bridge between human communication and computer comprehension, translating unstructured text into structured data through a collection of specialized techniques. These techniques include tokenization for breaking text into manageable units, stemming and lemmatization for normalizing word forms, part of speech tagging for understanding grammatical context, and named entity recognition for identifying significant entities. Together, these tools enable computers to understand language context, perform accurate translations, power virtual assistants and chatbots, analyze sentiment, and detect spam. The structured data produced by NLP provides the foundation for sophisticated AI applications that interact with human language in meaningful and useful ways.
| Use Case | Description |
|---|---|
| A. Virtual Assistants | 1. Determining whether content expresses positive or negative sentiment |
| B. Sentiment Analysis | 2. Converting text from one language to another while preserving context |
| C. Spam Detection | 3. Taking human utterances and deriving commands to execute |
| D. Machine Translation | 4. Examining messages for indicators like overused words or poor grammar |
A-3, B-1, C-4, D-2.
(2) The choice between stemming and lemmatization matters significantly for a given token and can substantially impact the accuracy and usefulness of NLP results. Stemming works well for some words but fails for others, while lemmatization uses dictionary definitions to derive more semantically meaningful roots.
NLP functions as a single algorithm that processes all language tasks in the same way.
False. NLP does not function as a single algorithm. Rather, it operates more like a collection of tools (including tokenization, stemming, lemmatization, part of speech tagging, and named entity recognition) that can be selectively applied to resolve various language processing challenges.
Spam detection examines email messages for indicators within the message content that suggest spam classification, including:
(3) Image recognition is not an NLP technique. The core NLP techniques discussed include tokenization, stemming, lemmatization, part of speech tagging, and named entity recognition, all of which work with text data rather than images.