## February 22, 2017

### Functional Equations III: Explaining Relative Entropy

#### Posted by Tom Leinster

Much of this functional equations course is about entropy and its cousins, such as means, norms, and measures of diversity. So I thought it worth spending one session talking purely about ways of understanding entropy, without actually proving anything about it. I wanted especially to explain how to think about relative entropy — also known as relative information, information gain, and Kullback-Leibler divergence.

My strategy was to do this via coding theory. Information is a slippery concept, and reasoning about it takes some practice. But putting everything in the framework of coding makes everything more concrete. The central point is:

The entropy of a distribution is the mean number of bits per symbols in an optimal encoding.

All this and more is in the course notes. The part we did today starts on page 11.

Next week: relative entropy is the only quantity that satisfies a couple of reasonable properties.

Posted at February 22, 2017 12:13 AM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2941

## 1 Comment & 0 Trackbacks

### Re: Functional Equations III: Explaining Relative Entropy

Let me add one nice explanation of relative entropy that I didn’t have time to include in the class. I learned it from a MathOverflow answer by Ashok to a question by John Baez.

Suppose we’re in the presence of some unknown probability distribution on a finite set $\Xi$. We’re able to observe the results of drawing from this distribution several times; call these results $x_1, \ldots, x_n \in \Xi$.

First question:   what would we guess is the unknown distribution? When the question is posed in that generality, the only reasonable answer is that it’s exactly the distribution we observed. This is called the empirical distribution of the data. Formally, it’s the linear combination

$\hat{p} = \frac{1}{n} \Bigl( \delta_{x_1} + \cdots + \delta_{x_n} \Bigr)$

where $\delta_x$ means the point mass (point/Dirac measure) at $x$. So if $x_1, \ldots, x_n$ are all distinct, the empirical distribution $\hat{p}$ returns each of $x_i$ with probability $1/n$. In general, it returns $x \in \Xi$ with probability

$\frac{1}{n} \bigl|\bigl\{ i \in \{1, \ldots, n\} \colon x_i = x\bigr\}\bigr|.$

Second, more subtle, question:   let’s now make the assumption that the unknown distribution belongs to a certain known family $(p_\theta)_{\theta \in \Theta}$ of distributions on $\Xi$. Based on the data observed, which $p_\theta$ would we guess is the unknown distribution?

This is a classic question of statistical inference, and one way of answering is to use the maximum likelihood method. This means choosing $\theta$ to maximize the probability of observing the data that was observed. Under the distribution $p_\theta$, that probability is

$p_\theta(x_1) p_\theta(x_2) \cdots p_\theta(x_n).$

So, we choose the value of $\theta$ that maximizes this probability. (Let’s assume for simplicity that there’s exactly one such $\theta$). When we’re thinking of $x_1, \ldots, x_n$ as fixed and $\theta$ as variable, this probability is referred to as the likelihood of $\theta$.

But we can also think geometrically. Relative entropy isn’t a metric, but it behaves something like it, so we might wish to choose $\theta$ to minimize the “distance” between $p_\theta$ and the observed/empirical distribution $\hat{p}$. As $\theta$ moves within $\Theta$, think of $p_\theta$ as defining a curve in the space of all distributions on $\Xi$. We’re looking for the point on the curve closest to $\hat{p}$.

More exactly, we want to minimize the entropy of $\hat{p}$ relative to $p_\theta$. So, choose $\theta$ to minimize $H(\hat{p} \| p_\theta)$. Call this the “minimal entropy method”.

The punchline is this:

The maximum likelihood method and the minimal entropy method are equivalent.

Indeed, a routine calculation shows that

$\log\Bigl( p_\theta(x_1) \cdots p_\theta(x_n) \Bigr) = -n \Bigl( H(\hat{p} \| p_\theta) + H(\hat{p}) \Bigr).$

Since the left-hand side is an increasing function of the likelihood $p_\theta(x_1) \cdots p_\theta(x_n)$, and the right-hand side is a decreasing function of the relative entropy $H(\hat{p} \| p_\theta)$, choosing $\theta$ to maximize the likelihood is equivalent to choosing $\theta$ to minimize the relative entropy.

Posted by: Tom Leinster on February 25, 2017 10:25 PM | Permalink | Reply to this

Post a New Comment