Fundamentals of Natural Language Processing

Natural Language Processing, or NLP, is a technology that enables computers to understand and process human language. It has many practical applications, such as translation apps and voice assistants. Computers also use NLP for tasks like filtering out spam emails and suggesting code completions.

NLP is essentially about helping computers work with language, taking text in, then producing text out. This process involves more than just processing language. There are three key capabilities that NLP enables. First, it captures meaning in a structured way for tasks like question answering, using techniques such as embeddings to create useful representations. Second, NLP generates language, producing text or code for tasks like dialogue and translation. Third, NLP connects language to action, allowing computers to use language to perform tasks, solve problems, and interact with environments, which is where AI agents are used.

In essence, most NLP tasks can be framed as: take some input $x$, produce some output $y$. Both input and output might involve language.

Task	Input (x)	Output (y)
Translation	Text in one language	Text in another language
Text Classification	Any text	A category label
Image Captioning	An image	Text describing the image
Information Retrieval	Search query	Ranked list of documents
Agent Tasks	Environment state	An action to take

Three Ways to Build NLP Systems

Rule-Based Systems

The simplest approach in rule -based systems is to write explicit logic by hand. This involves using regular expressions, keyword matching, or explicit "if-then" rules. To implement this, we define patterns that match certain inputs. When a pattern matches, the model outputs a predetermined response.

Example:

def classify(x: str) -> str:
    # A simple rule-based approach using a list of sports keywords
    sports_keywords = ["baseball", "soccer", "football", "tennis"]
    if any(keyword in x.lower() for keyword in sports_keywords):
        return "sports"
    else:
        return "other"

The trade-off:

Pros	Cons
We can see exactly why a decision was made	Fails to handle nuance, typos, or context
No training data required to get started	Doesn't generalize beyond what we've explicitly coded
Fast to implement for simple tasks	Misses implicit meanings

A simple classifier:

Input	Predicted Label	Logic
"I love to play baseball."	sports	Matches "baseball"
"The stock price is going up."	other	No keywords present
"He got a hat-trick yesterday."	other	Misses sports context without the specific word
"He is wearing tennis shoes."	sports	Matches "tennis"

The last case - "tennis shoes" isn't about tennis, but our simple rule classifies it as sports anyway. This illustrates the brittleness of rule-based approaches.

Supervised Learning

In machine learning, supervised learning (SL) is a type of machine learning paradigm where an algorithm learns to map input data to a specific output based on example input-output pairs. This process involves training a statistical model using labeled data, meaning each piece of input data is provided with the correct output. For instance, if you want a model to identify cats in images, supervised learning would involve feeding it many images of cats (inputs) that are explicitly labeled "cat" (outputs). wikipedia

Therefore, the machine learn rules from data instead of writing them by hand. The system looks at many examples of input and output pairs to figure out the pattern. To make this work, we need a dataset with labeled examples. A model learns a function $f$ such that $f(x) \approx y$ (mapping an input to an output). The goal is for this function to produce output that is close to the actual output, so we use optimization algorithms to reduce the error. More data, especially high-quality data, usually results in better performance.

Reinforcement Learning

The system learns by interacting with an environment and receiving feedback. To do this, an agent takes actions, such as generating text, and receives a reward based on the quality of that action. The agent then adjusts its behavior to maximize rewards. The system has three key components: the current context, known as the state, the possible outputs, or actions, and a numerical signal that indicates success or failure, which is the reward.

While supervised learning and unsupervised learning algorithms respectively attempt to discover patterns in labeled and unlabeled data, reinforcement learning involves training an agent through interactions with its environment. To learn to maximize rewards from these interactions, the agent makes decisions between trying new actions to learn more about the environment (exploration), or using current knowledge of the environment to take the best action (exploitation). wikipedia

Data Requirements Comparison

The "cost" of building a system is measured by the type and amount of data needed:

Method	Data Needed	Performance Guarantee
Intuition-based Rules	None	Low (No guarantees)
Spot-check Rules	Small amount of unlabeled data	Medium (Verified on small samples)
Supervised Learning	Large labeled training set	High (Predictable on similar data)
Reinforcement Learning	Interactive environment/Reward model	Potentially very high (Optimizes for goals)

Building a Rule-Based Sentiment Classifier

The Task Setup

Input ($x$): Raw text of a review (e.g., "I love this movie")
Output ($y$): A discrete label: $1$ (positive), $0$ (neutral), or $-1$ (negative)
Dataset ($D$): A collection of $N$ examples $\{(x_i, y_i)\}_{i=1}^N$

Feature-Based Scoring

A software development approach that organizes code and system design around distinct functionalities or features rather than technical layers (like controllers, models, or views).

Computers can't process raw text on their own, so it needs to be converted into a numerical format called a feature vector. This vector is a list of numbers that stands for the text, allowing machine learning models to analyze and score the data using math. The linear scoring function defines how this process works: $$s_\theta(x) = \mathbf{w}^\top f(x)$$

The Feature Vector $f(x)$: This is a column vector of $h$ dimensions derived from the input text $x$. In a sentiment analysis context, such as a movie review, $f(x)$ might represent word counts. For instance, if $f(x) = [1, 0]^\top$, it signifies that a positive indicator like "love" appears once, while a negative indicator like "hate" is absent. This converts "messy" linguistic patterns into a fixed-size numerical representation.
The Weight Vector $\mathbf{w}$: Also a column vector of size $h$, these weights are parameters learned during the model's training phase. They dictate the importance and direction of each feature. A positive weight (e.g., $w_1 = 5.0$) suggests a feature correlates with a positive outcome, while a negative weight (e.g., $w_2 = -3.0$) indicates the opposite.
The Output Score $s_\theta(x)$: The final result is a scalar value produced by the dot product of the weight and feature vectors. This is essentially a weighted sum: $$s_\theta(x) = \sum_{i=1}^{h} w_i f_i(x)$$

In our specific example, using $\mathbf{w} = [5.0, -3.0]^\top$ and $f(x) = [1, 0]^\top$, the calculation is $5.0 \times 1 + (-3.0) \times 0 = 5.0$. A higher resulting score indicates a higher probability of a positive classification, forming the backbone of modern text classification techniques.

Making Decisions: The Inference Rule

Once we have the score, we need to map it to a discrete category:

$$ g(x) = \begin{cases} 1 & \text{if } s(x) > 0 \\ 0 & \text{if } s(x) = 0 \\ -1 & \text{if } s(x) < 0 \end{cases} $$

Note: This is a "hard" decision boundary. If the score is even slightly above zero, it's classified as positive.

Therefore; Every NLP system has three phases. The first phase, modeling or parameterization, is where the scoring calculation is decided and the adjustable parameters are determined. In rule-based systems, humans set these parameters manually. For example, they might assign a weight of $+2$ to the term "excellent". Machine learning systems learn these parameters from data instead. The second phase is learning, during which the parameters are given specific values. The third phase, inference, happens at runtime. It calculates the score for a new sentence that has not been seen before and outputs the corresponding label.

Hard Cases for Rule-Based Systems

Low-Frequency Words & Saliency

Reviewers often use words like "tenuous" or "mucking" that are not found in a basic list of good or bad words. For instance, a reviewer might say a material link is too "tenuous" to anchor, which is a negative comment. The problem is that if "tenuous" is not in the system's dictionary, it will default to a neutral sentiment.

Solution: We can use sentiment dictionaries such as SentiWordNet, which contain thousands of words with pre-assigned sentiment scores.

Conjugation & Morphology

A word can take many forms. If a system only looks for "act", it might miss the sentiment in "magnificently shot." For instance, the phrase "entertainingly acted, magnificently shot" has a positive sentiment.

Solution: We can use lemmatization, which reduces words to their root form, such as "acted" to "act". Another approach is Part-of-Speech (POS) Tagging, which identifies whether a word is an adjective or verb, allowing us to weight it differently.

Negation (The "Flip" Logic)

Negation words like "not", "don't", and "never" completely reverse sentiment. For example, the phrase "This one is not nearly as dreadful..." is actually positive, even though "dreadful" alone is negative. The system currently labels this phrase as negative because it sees the word "dreadful" and ignores the preceding "not".

Solution: We can use dependency parsing to detect when a negation modifier is attached to a sentiment word, and then adjust the score accordingly by multiplying it by -1.

Metaphor & Analogy

Sentiment is often implied through comparison rather than direct adjectives. For instance, the phrase "Has all the depth of a wading pool" is negative, implying shallowness. The words "depth" and "pool" are not inherently negative.

This is a challenging case for rules, and it typically requires supervised learning or large language models that understand cultural context and world knowledge.

Multi-Language Support

Rule-based systems are language-dependent, which means English rules won't work for Japanese. For instance, the sentence "見事に視聴者の心を掴む作品でした。" is positive.

Solution: One possible solution is to translate everything to English before classifying, using machine translation. Another approach is to use multilingual embeddings, which represent words from different languages in the same vector space.

Transitioning to Supervised Learning

As we have already explained, ruled based systems aren't good. Also we have manually set the weights $\mathbf{w}$ so far. A key change in NLP is learning $\mathbf{w}$ from data. This change moves us from a rule-based approach to supervised learning.

Task setup for this transition: We now assume access to labeled datasets:

Training set ($D_{train}$): Used to learn the parameters
Validation set ($D_{dev}$): Used to tune hyperparameters
Test set ($D_{test}$): Used to evaluate final performance

Each dataset consists of $(x, y)$ pairs where $x$ is the input text and $y$ is the correct label.

Scoring Function

In supervised learning, classification is based on a compatibility score, $s_\theta(x, y) \in \mathbb{R}$. This score measures how well a specific output $y$ fits a given input $x$. A higher score indicates that the model believes $y$ is the correct label for $x$. The framework for this is related to Energy-Based Models (EBMs), where the energy $E(x, y)$ is defined as the negative of the score. $$E(x, y) = -s_\theta(x, y)$$ Learning is the process of adjusting the model so that correct pairs $(x, y)$ have low energy, or a high score, and incorrect pairs have high energy, or a low score. For binary tasks, a single weight vector $\mathbf{w}$ is used to compute a base score $s_\theta(x) = \mathbf{w}^\top f_\phi(x)$. The label $y$ is then factored in to determine compatibility: $$s_\theta(x, y) = y \cdot s_\theta(x)$$ If the model's score is positive and the actual label is positive, the product is positive, indicating a correct prediction. This also applies if both are negative. If the signs disagree, the product is negative, indicating an incorrect prediction. When dealing with multiple classes, a single vector is not enough. A weight matrix $\mathbf{W} \in \mathbb{R}^{h \times K}$ is used, where each of the $K$ classes has its own dedicated weight vector. The features $f_\phi(x) \in \mathbb{R}^{h \times 1}$ are extracted from the input and multiplied by the weights: $$s_\theta(x) = \mathbf{W}^\top f_\phi(x)$$ This results in a vector of scores, where each element represents the score for a specific class. The compatibility score for a specific label $y$ is the value at the $y$-th index of that resulting vector: $$s_\theta(x, y) = s_\theta(x)[y]$$ The objective during training is to adjust $\mathbf{W}$ so that the score at the index of the true label $y$ is significantly higher than the scores for all other possible classes.

Example: Bag of Words (BoW)

The Bag of Words (BoW) model is a straightforward yet powerful feature representation where text is treated as an unordered collection of its constituent words. We define a vocabulary $V$ containing all unique words in a dataset. Each word at position $t$ in a sequence is represented as a one-hot vector $x_t \in \{0,1\}^V$. The document-level feature vector $f(x)$ is constructed by summing these individual vectors: $f(x) = \sum_{t=1}^{T} x_t$. For instance, with a vocabulary of $V=4$ containing {excellent, bad, movie, dull}, the phrase "excellent movie" results in a feature vector $f(x) = [1, 0, 1, 0]^\top$. This process creates a frequency count for each word, but it discards structural information like syntax or word order, which is why it is called the "bag" model.

Therefore, when we apply the scoring function $s(x) = \mathbf{w}^\top f(x)$, the calculation expands into a weighted sum where each word contributes independently: $s(x) = \sum_{i=1}^{V} w_i \cdot f_i(x)$. In a binary classification task like sentiment analysis, the parameters in the weight vector $\mathbf{w}$ represent the "sentimental polarities" of specific words. For instance, if our model learns that "excellent" has a weight of $+2.5$ and "dull" has a weight of $-1.8$, the appearance of these words directly shifts the final score toward the positive or negative class. If a review says "excellent movie," and the weight for "movie" is a neutral $0.1$, the total score $s(x) = (2.5 \times 1) + (0.1 \times 1) = 2.6$ would strongly suggest a positive classification. This demonstrates how BoW reduces complex linguistic nuance to a simple linear combination of independent word influences.

Example:

Word	Weight
love	+2.4
hate	-3.5

In multi-class, $\mathbf{W} \in \mathbb{R}^{V \times K}$ - each word has K weights (one per class).

Example:

Word	positive	neutral	negative
love	2.4	1.5	-0.5
hate	-3.5	-2.0	3.2

Inference In machine learning, inference is the phase where a trained model is used to predict the output for new, unseen input data. Unlike the training phase, where parameters ($\theta$ or $w$) are adjusted to minimize error, inference keeps the parameters fixed and focus purely on calculation.

Once we have a scoring mechanism, the final step in the supervised learning pipeline is $*$inference, where the model makes a concrete prediction $\hat{y}$ based on the calculated scores. This is formally expressed as finding the label $y$ that maximizes the compatibility score: $$\hat{y} = \arg\max_{y \in \mathcal{Y}} s_\theta(x, y)$$ In a multi-class scenario, the model evaluates every possible label, such as "Cat", "Dog", or "Bird", and selects the one with the highest numerical score.

In binary classification, where $y$ can be either -1 or +1, the logic is simpler. Since the compatibility score is defined as $s_\theta(x, y) = y \cdot s_\theta(x)$, the value is maximized when the sign of the label matches the sign of the base score. If $s_\theta(x)$ is positive, choosing $y = +1$ yields a positive result, while $y = -1$ yields a negative one. The prediction rule then becomes a simple sign check: $$\hat{y} = \text{sign}(s_\theta(x))$$ Linear models are prevalent because they are efficient. At test time, the computer only needs to compute a dot product and check if the result is greater than or less than zero. For instance, if a sentiment model processes a review and outputs a raw score of $-2.4$, the $\text{sign}$ function maps this to $-1$, predicting a negative review without further computation.

Learning the Parameters

Once the model architecture and scoring functions are defined, the focus shifts to parameter learning. This process involves adjusting the weights $\mathbf{w}$ (or $\theta$) so the model makes correct predictions. A loss function $\mathcal{L}(\theta, D)$ is used to measure the model's performance. This loss function is a mathematical yardstick that quantifies error across a dataset $D$. It is calculated as the sum of individual errors $L(x, y, \theta)$ for every training example. $$\mathcal{L}(\theta, D) = \sum_{(x,y) \in D} L(x, y, \theta)$$ The ultimate goal of training is the optimization objective: finding the specific parameters $\theta$ that minimize this total error on the training data: $$\min_{\theta} \mathcal{L}(\theta, D_{train})$$ The model refines its weights by minimizing this objective, which lowers the energy for correct pairs and raises it for incorrect ones. This process turns the task of learning into a formal optimization problem. The model typically uses algorithms like Gradient Descent to solve this problem, and this process transforms raw data into a functional predictive system.

Structured Perceptron

The Structured Perceptron algorithm offers a classic, discrete approach to solving this optimization problem without requiring complex calculus. It relies on a simple additive update rule based on individual mistakes. For every training example $(x, y)$, the model first performs inference to find the current predicted label: $$\hat{y} = \arg\max_{y'} s_\mathbf{w}(x, y')$$ The logic of the update is straightforward:

If correct ($\hat{y} = y$): The model is already compatible with the truth. So we do nothing.
If wrong ($\hat{y} \neq y$): Trigger a weight update using the rule $\mathbf{w} \leftarrow \mathbf{w} + y \cdot f(x)$.

for each (x, y) in D:
    y_pred = model.predict(x)
    if y_pred != y:
        w = w + y * f(x)

This rule has a geometric interpretation. When we add the feature vector $f(x)$ of a positive example ($y = +1$) to the weights, the weight vector $\mathbf{w}$ becomes more aligned with those features, which increases the score for that correct pair. If $y = -1$, we subtract the features, decreasing the score. As this process is repeated, correct pairs get higher scores, or lower energy, and incorrect predictions are penalized.

Limitations of BoW

The Bag of Words method is simple, but it has serious limitations. For one, it treats different forms of the same word, like "love" and "loved", as completely different words with no shared meaning. This is a problem because words like "love" and "adore" have similar sentiments, but are given no shared representation. The order of words also poses a problem. In a sentence like "not good", the Bag of Words method treats it as just "good", which completely changes the meaning. Another issue is that it cannot capture the contrast between different parts of a sentence, such as "interesting but boring".

Neural Networks

The transition from traditional models to modern deep learning is a fundamental shift in how machines understand language. In the legacy Bag of Words (BoW) approach, the feature function $f(x)$ is a static, manually designed count of words. This function is fixed, so the model can only learn the weights $\mathbf{w}$ in the linear scoring function $s_\theta(x) = \mathbf{w}^\top f(x)$. The model has a significant limitation: it cannot tell the difference between "The dog bit the man" and "The man bit the dog" because they have the same word count. The BoW approach also does not capture semantic relationships, like the similarity between "happy" and "joyful", or contextual shifts, such as the effect of "not" on the meaning of "good". The model ignores word order, syntax, and nuanced interactions, which are important aspects of language.

Modern deep learning resolves these limitations by introducing Representation Learning, where the feature extractor itself becomes a learned function $f_\phi(x)$. By replacing the static $f(x)$ with a neural network containing internal parameters $\phi$, the scoring function evolves into $s_\theta(x) = \mathbf{w}^\top f_\phi(x)$. Now, the optimization process adjusts both the scoring weights $\mathbf{w}$ and the feature-extraction parameters $\phi$. This allows the machine to move beyond human-defined "feature engineering" and instead discover for itself which complex patterns and relationships in the data are most relevant for the task at hand.

This new approach follows a sophisticated Deep Learning Pipeline:

Embeddings: Raw text is first mapped into dense, high-dimensional vectors that capture initial semantic meaning.
Neural Processing: Architectures like RNNs (for sequences) or Transformers (for global context) process these embeddings while respecting word order and structural hierarchy.
Feature Vector: The network compresses this information into a final, learned numerical representation $f(x)$.
Scoring: This vector is multiplied by the weights to produce a final compatibility score $s(x)$.

The challenge is now about designing the architecture. Researchers face critical design questions, such as choosing the right model, like CNNs or Transformers, and selecting an optimization algorithm, such as SGD or Adam. They also need to consider the overall scale of the network. These choices are key to creating systems that can understand human context and intent, going beyond just counting words.

Beyond Classification

We previously focused on a small, discrete set of labels, but now we generalize the output space $y$ to include a wide range of possibilities, such as a translated sentence, a specific response, or a robotic action. This transition shows that modern NLP and AI are based on a unified framework. The process of completing a task, regardless of its complexity, involves three steps: parameterization, which is choosing the form of the scoring function $s_\theta(x, y)$, learning, which uses labeled data, rewards, or feedback to tune $\theta$, and inference, which predicts the output $\hat{y}$ by searching for the most compatible $y$ or sampling from a distribution.

Softmax

To make numerical scores useful for decision-making, we use the Softmax function. This function turns raw scores into a valid probability distribution, which we denote as $p_\theta(y|x)$. The process involves exponentiating the scores, making them all positive. We then normalize these values by dividing them by the sum of all possible exponents, which ensures that the total probability is 1. $$p_\theta(y|x) = \frac{\exp(s_\theta(x,y))}{\sum_{y'} \exp(s_\theta(x,y'))}$$ This transformation is critical because higher scores lead to much higher probabilities, while lower scores are suppressed. If a model processes the input "I hate this movie," the softmax function might greatly reduce the neutral and positive scores, resulting in a $0.98$ probability for the negative class. The model can also sample from this probability distribution, which is expressed as $\hat{y} \sim p_\theta(y|x)$. This introduces diversity and stochastic behavior into AI responses, since the model is not always picking the single highest score.

The Evolution: Ranking, Generation, and Actions

This unified view enables us to apply the same fundamental scoring logic to different domains. In ranking, used in search engines, we order candidates $y_1, y_2, \dots$ so that $s(x, y_1) > s(x, y_2) > \dots$. For generation, the output $y$ is a combinatorial sequence. Large Language Models (LLMs) rely on repeated sampling, where the model predicts the next token $y_t$ based on the input $x$ and the previously generated tokens $y_{Actions and Agents. In Reinforcement Learning, a policy $\pi(a|s)$ is defined, where the model learns the probability of an action $a$ given a state $s$. For example, an AI agent may decide to recommend another movie, or a robot may choose a physical movement. In both cases, the core idea is the same: the model uses a scoring function $s_\theta(x, y)$ to evaluate how well an action fits its context.

Every system in modern NLP, whether it's a simple spam filter or a sophisticated LLM, relies on a single conceptual idea: learning a scoring function and deciding how to use it. This is essentially what the field of NLP is about. Models differ in terms of the complexity of their scoring functions, how their parameters are optimized, and how they search for the best result. Understanding this concept is fundamental to machine learning and artificial intelligence. It provides a basis for everything that follows in these fields.