Let me add one nice explanation of relative entropy that I didn’t have time
to include in the class. I learned it from a MathOverflow answer by
Ashok to a question by John Baez.

Suppose we’re in the presence of some unknown probability distribution on
a finite set $\Xi$. We’re able to observe the results of drawing from this
distribution several times; call these results $x_1, \ldots, x_n \in \Xi$.

*First question:* what would we guess is the unknown distribution?
When the question is posed in that generality, the only reasonable answer
is that it’s exactly the distribution we observed. This is called the
**empirical distribution** of the data. Formally, it’s the linear
combination

$\hat{p} = \frac{1}{n} \Bigl( \delta_{x_1} + \cdots + \delta_{x_n} \Bigr)$

where $\delta_x$ means the point mass (point/Dirac measure) at $x$. So if
$x_1, \ldots, x_n$ are all distinct, the empirical distribution $\hat{p}$
returns each of $x_i$ with probability $1/n$. In general, it returns $x
\in \Xi$ with probability

$\frac{1}{n} \bigl|\bigl\{ i \in \{1, \ldots, n\} \colon x_i = x\bigr\}\bigr|.$

*Second, more subtle, question:* let’s now make the assumption that
the unknown distribution belongs to a certain known family $(p_\theta)_{\theta
\in \Theta}$ of distributions on $\Xi$. Based on the data observed, which $p_\theta$ would we guess is the
unknown distribution?

This is a classic question of statistical inference, and one way of
answering is to use the maximum likelihood method. This means choosing
$\theta$ to maximize the probability of observing the data that was
observed. Under the distribution $p_\theta$, that probability is

$p_\theta(x_1) p_\theta(x_2) \cdots p_\theta(x_n).$

So, we choose the value of $\theta$ that maximizes this probability.
(Let’s assume for simplicity that there’s exactly one such $\theta$). When
we’re thinking of $x_1, \ldots, x_n$ as fixed and $\theta$ as variable,
this probability is referred to as the **likelihood** of $\theta$.

But we can also think geometrically. Relative entropy isn’t a metric, but
it behaves *something* like it, so we might wish to choose $\theta$ to
minimize the “distance” between $p_\theta$ and the observed/empirical
distribution $\hat{p}$. As $\theta$ moves within $\Theta$, think of
$p_\theta$ as defining a curve in the space of all distributions on $\Xi$. We’re looking for the point on the curve closest to $\hat{p}$.

More exactly, we want to minimize the entropy of $\hat{p}$ relative to
$p_\theta$. So, choose $\theta$ to minimize $H(\hat{p} \| p_\theta)$.
Call this the “minimal entropy method”.

The punchline is this:

*The maximum likelihood method and the minimal entropy method are
equivalent*.

Indeed, a routine calculation shows that

$\log\Bigl( p_\theta(x_1) \cdots p_\theta(x_n) \Bigr)
=
-n \Bigl( H(\hat{p} \| p_\theta) + H(\hat{p}) \Bigr).$

Since the left-hand side is an increasing function of the likelihood
$p_\theta(x_1) \cdots p_\theta(x_n)$, and the right-hand side is a
decreasing function of the relative entropy $H(\hat{p} \| p_\theta)$,
choosing $\theta$ to maximize the likelihood is equivalent to choosing
$\theta$ to minimize the relative entropy.

## Re: Functional Equations III: Explaining Relative Entropy

Let me add one nice explanation of relative entropy that I didn’t have time to include in the class. I learned it from a MathOverflow answer by Ashok to a question by John Baez.

Suppose we’re in the presence of some unknown probability distribution on a finite set $\Xi$. We’re able to observe the results of drawing from this distribution several times; call these results $x_1, \ldots, x_n \in \Xi$.

First question:what would we guess is the unknown distribution? When the question is posed in that generality, the only reasonable answer is that it’s exactly the distribution we observed. This is called theempirical distributionof the data. Formally, it’s the linear combination$\hat{p} = \frac{1}{n} \Bigl( \delta_{x_1} + \cdots + \delta_{x_n} \Bigr)$

where $\delta_x$ means the point mass (point/Dirac measure) at $x$. So if $x_1, \ldots, x_n$ are all distinct, the empirical distribution $\hat{p}$ returns each of $x_i$ with probability $1/n$. In general, it returns $x \in \Xi$ with probability

$\frac{1}{n} \bigl|\bigl\{ i \in \{1, \ldots, n\} \colon x_i = x\bigr\}\bigr|.$

Second, more subtle, question:let’s now make the assumption that the unknown distribution belongs to a certain known family $(p_\theta)_{\theta \in \Theta}$ of distributions on $\Xi$. Based on the data observed, which $p_\theta$ would we guess is the unknown distribution?This is a classic question of statistical inference, and one way of answering is to use the maximum likelihood method. This means choosing $\theta$ to maximize the probability of observing the data that was observed. Under the distribution $p_\theta$, that probability is

$p_\theta(x_1) p_\theta(x_2) \cdots p_\theta(x_n).$

So, we choose the value of $\theta$ that maximizes this probability. (Let’s assume for simplicity that there’s exactly one such $\theta$). When we’re thinking of $x_1, \ldots, x_n$ as fixed and $\theta$ as variable, this probability is referred to as the

likelihoodof $\theta$.But we can also think geometrically. Relative entropy isn’t a metric, but it behaves

somethinglike it, so we might wish to choose $\theta$ to minimize the “distance” between $p_\theta$ and the observed/empirical distribution $\hat{p}$. As $\theta$ moves within $\Theta$, think of $p_\theta$ as defining a curve in the space of all distributions on $\Xi$. We’re looking for the point on the curve closest to $\hat{p}$.More exactly, we want to minimize the entropy of $\hat{p}$ relative to $p_\theta$. So, choose $\theta$ to minimize $H(\hat{p} \| p_\theta)$. Call this the “minimal entropy method”.

The punchline is this:

Indeed, a routine calculation shows that

$\log\Bigl( p_\theta(x_1) \cdots p_\theta(x_n) \Bigr) = -n \Bigl( H(\hat{p} \| p_\theta) + H(\hat{p}) \Bigr).$

Since the left-hand side is an increasing function of the likelihood $p_\theta(x_1) \cdots p_\theta(x_n)$, and the right-hand side is a decreasing function of the relative entropy $H(\hat{p} \| p_\theta)$, choosing $\theta$ to maximize the likelihood is equivalent to choosing $\theta$ to minimize the relative entropy.