The Energy Function

Have you ever tried to fit a square peg into a round hole? It feels wrong, awkward, and requires a lot of effort or energy.

In machine learning, an energy function applies this exact intuition to math. It assigns a numerical value representing how compatible different pieces of information are with one another.

The Formal Definition

Formally, we define an energy function like this: $$e : X \times Z \times \Theta \rightarrow \mathbb{R}$$

The function takes three inputs ($X$, $Z$, and $\Theta$).
It returns a single real number ($\mathbb{R}$), which we call the energy.

Breakdown:

1. Observed Data ($x \in X$)

$X$ represents the set of all data we can actually observe.
Examples: An image of a cat, a spoken sentence, or a row in a spreadsheet.

2. Latent Variables ($z \in Z$)

$Z$ represents hidden or "latent" variables.
These are things that exist but we cannot observe directly. We have to guess or infer them.
Examples: The hidden underlying topic of a document, the cluster a data point belongs to, or the internal mental state of a system.

3. Model Parameters ($\theta \in \Theta$)

$\theta$ (theta) represents the internal settings of our machine learning model.
These parameters act like the "dials and knobs" that control how the model behaves.
Examples: The weights in a neural network, the coefficients of a regression model, or the cluster centers in K-means.

The Output: Energy

When we plug these three things into our function $e(x,z,\theta)$ we get a real number. This number is the energy.

What Does Energy Actually Mean?

At its core, energy measures compatibility or mismatch.

Lower energy $\rightarrow$ Better compatibility (the data, hidden variables, and model agree nicely).
Higher energy $\rightarrow$ Worse compatibility (there is friction, mismatch, or contradiction).

Therefore, a good machine learning model always prefers combinations of $(x, z)$ that result in very low energy.

This concept is heavily inspired by statistical physics, where natural systems always tend to settle into states of minimal energy (like a ball rolling down a hill to rest at the lowest point). Machine learning proudly borrows this elegant idea!

Latent Variables

The variable $z$ represents hidden structure. Because we can only observe $x$ and not $z$, our model can't just compute a single, definitive energy value for a given observation.

Instead, there are multiple possible energy values depending on what the hidden element $z$ might be. This creates uncertainty.

To handle this uncertainty mathematically, we calculate two important metrics:

1. Expected Energy (Mean)

The expected energy averages out our uncertainty over all possible hidden states. It is defined as: $$e_{\mu}(x,\theta) = \mathbb{E}[e(x,z,\theta)]$$

To calculate this, we take the weighted average over every possible latent variable: $$e_{\mu}(x,\theta) = \sum_{z\in Z} p(z) e(x,z,\theta)$$

What is happening here? For every possible hidden configuration $z$, we calculate its energy $e(x,z,\theta)$ and multiply it by how probable that state is, $p(z)$. Then, we sum them all up. This tells us the average energy of our observation $x$, factoring in our uncertainty about the unobserved $z$.

2. Energy Variance

Variance measures how much the energy wildly swings around based on $z$: $$e_v(x,\theta) = \mathbb{E}[(e(x,z,\theta) - e_{\mu}(x,\theta))^2]$$

Small variance: The energy is stable. No matter what the hidden variable $z$ turns out to be, the overall compatibility remains roughly the same.
Large variance: The energy heavily depends on which $z$ is chosen. This gives insight into how much uncertainty the model has when interpreting the input $x$.

The Insight

Different machine learning tasks simply correspond to minimizing the exact same energy function, but with respect to different variables!

The type of problem you are solving simply depends on which variable is unknown. Let's look at a few examples.

Supervised Learning

Imagine our observation consists of two parts: an input $x$ (e.g., an image) and an output $y$ (e.g., the label "dog"). Let's assume there are no sneaky hidden variables $z$.

Our energy function becomes: $$e([x,y], \emptyset, \theta)$$ (The $\emptyset$ just means the latent variable $z$ is empty here).

Given a brand-new input $x'$, we want the model to predict the output $\hat{y}$. How do we do it? We find the $y$ that results in the lowest possible energy: $$\hat{y} = \arg\min_{y \in Y} e([x',y], \emptyset, \theta)$$ The Intuition: The model searches through all possible outputs and picks the one that is most compatible (lowest energy) with the input $x'$.

If $y$ is a discrete category (like "cat" or "dog"), this is called Classification.
If $y$ is a continuous number (like predicting house prices), this is called Regression.

Clustering

Now let's change the scenario. Suppose $x$ is a data point, and $z$ represents the cluster that data point belongs to. We observe the data point $x$, but the cluster assignment $z$ is hidden from us.

To figure out which cluster $x$ belongs to, we solve: $$\hat{z} = \arg\min_{z \in Z} e(x,z,\theta)$$

The Intuition: We evaluate every possible cluster and assign $x$ to the one that gives the lowest energy (maximum compatibility). For instance, the famous K-Means algorithm is just minimizing an energy function that measures the geometric distance to cluster centers.

Representation Learning

What if $z$ isn't a simple distinct cluster, but instead a continuous hidden representation (like a vector of numbers)?

Even then the formula stays exactly the same! Instead of picking a discrete cluster, we are inferring a continuous latent space. You see this principle in action within Autoencoders, Factor Analysis, and Latent Variable Models.

Inference = Minimizing Energy

Across all these examples, the guiding principle is identical:

Given partial information, infer the missing variables by finding the state with the lowest energy.

This process of finding the missing variable that best explains our observation is what we call Inference in machine learning.

The Learning Problem: How Do We Find $\theta$?

Up until now, we've assumed that the model parameters ($\theta$) were already perfectly tuned. But the whole point of learning is to figure out those parameters from data!

A naïve(;) approach might be: "Just minimize the average energy for all our training data!" $$\min_{\theta \in \Theta} \mathbb{E}_{x \sim p_{data}} [e(x,\emptyset,\theta)]$$ The Meaning: Tweak the model's dials ($\theta$) so it assigns low energy to the real data we've observed.

The Problem: Naïve Approach

If we only tell the model to lower the energy for our training data, it might "cheat." It could easily adjust $\theta$ so that it assigns low energy to literally everything in the universe. If everything has low energy, the model can no longer distinguish between good, realistic data and absolute garbage data.

To fix this, we must also ensure that undesirable, fake, or incorrect observations receive HIGH energy.

The Solution: Regularization

To prevent the model from cheating, we add a penalty called a regularizer: $$\min_{\theta \in \Theta} \mathbb{E}_{x \sim p_{data}} [e(x,\emptyset,\theta) - R(\theta)]$$ $R(\theta)$ is a regularization function. Things like L2 weight penalties, sparsity constraints, or margin constraints are common choices.

Regularization forces the model to be honest. It ensures the model fits the real data well, but strongly prevents it from collapsing into a lazy, trivial solution where "everything has low energy."

(Note: In fully energy-based models, ensuring bad data gets high energy is often handled directly by contrasting real data against negatively sampled data.)

Adding Latent Variables to the Mix

When our data has hidden structures ($z$), learning becomes a dual-challenge. We have to solve two problems at the exact same time:

Infer the hidden variables ($z$).
Optimize the model's parameters ($\theta$).

This is common in mixture models or factor analysis. Algorithms like Expectation-Maximization (EM) are specifically designed to handle this delicate balancing act - taking turns guessing the hidden variables based on current parameters, and then updating the parameters based on those guesses.

Summary

From an energy perspective, almost every machine learning problem boils down to three core steps:

Model Definition: Define the energy function $e(x,z,\theta)$. This establishes the architectural shape and rules of your model.
Learning: Estimate the best dials (parameters) from your training data by minimizing the objective. $$\theta^* = \arg\min_\theta (\text{objective})$$
Inference: Use the tuned model in the real world. Given some new partial observation, guess the missing piece by minimizing the energy.

The energy framework is beautiful because it acts as a universal translator for machine learning. It unifies dozens of seemingly unrelated methods under one elegant mathematical umbrella.

Task	What are we trying to figure out? (What minimizes energy?)
Classification	Energy over discrete categories/outputs
Regression	Energy over continuous numbers
Clustering	Energy over cluster assignments
Representation Learning	Energy over continuous latent vectors

The same mathematical engine powers all these paradigms. The only things that truly alter from one algorithm to the next are how the energy function is defined, and which variables we leave blank to find the lowest energy for.

If you reached until this point, I have generated a few set of questions for more understanding. I have linked it here, try to check your knowledge. Link