Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

February 22, 2017

Functional Equations III: Explaining Relative Entropy

Posted by Tom Leinster

Much of this functional equations course is about entropy and its cousins, such as means, norms, and measures of diversity. So I thought it worth spending one session talking purely about ways of understanding entropy, without actually proving anything about it. I wanted especially to explain how to think about relative entropy — also known as relative information, information gain, and Kullback-Leibler divergence.

My strategy was to do this via coding theory. Information is a slippery concept, and reasoning about it takes some practice. But putting everything in the framework of coding makes everything more concrete. The central point is:

The entropy of a distribution is the mean number of bits per symbols in an optimal encoding.

All this and more is in the course notes. The part we did today starts on page 11.

Next week: relative entropy is the only quantity that satisfies a couple of reasonable properties.

Posted at February 22, 2017 12:13 AM UTC

TrackBack URL for this Entry:

1 Comment & 0 Trackbacks

Re: Functional Equations III: Explaining Relative Entropy

Let me add one nice explanation of relative entropy that I didn’t have time to include in the class. I learned it from a MathOverflow answer by Ashok to a question by John Baez.

Suppose we’re in the presence of some unknown probability distribution on a finite set Ξ\Xi. We’re able to observe the results of drawing from this distribution several times; call these results x 1,,x nΞx_1, \ldots, x_n \in \Xi.

First question:   what would we guess is the unknown distribution? When the question is posed in that generality, the only reasonable answer is that it’s exactly the distribution we observed. This is called the empirical distribution of the data. Formally, it’s the linear combination

p^=1n(δ x 1++δ x n) \hat{p} = \frac{1}{n} \Bigl( \delta_{x_1} + \cdots + \delta_{x_n} \Bigr)

where δ x\delta_x means the point mass (point/Dirac measure) at xx. So if x 1,,x nx_1, \ldots, x_n are all distinct, the empirical distribution p^\hat{p} returns each of x ix_i with probability 1/n1/n. In general, it returns xΞx \in \Xi with probability

1n|{i{1,,n}:x i=x}|. \frac{1}{n} \bigl|\bigl\{ i \in \{1, \ldots, n\} \colon x_i = x\bigr\}\bigr|.

Second, more subtle, question:   let’s now make the assumption that the unknown distribution belongs to a certain known family (p θ) θΘ(p_\theta)_{\theta \in \Theta} of distributions on Ξ\Xi. Based on the data observed, which p θp_\theta would we guess is the unknown distribution?

This is a classic question of statistical inference, and one way of answering is to use the maximum likelihood method. This means choosing θ\theta to maximize the probability of observing the data that was observed. Under the distribution p θp_\theta, that probability is

p θ(x 1)p θ(x 2)p θ(x n). p_\theta(x_1) p_\theta(x_2) \cdots p_\theta(x_n).

So, we choose the value of θ\theta that maximizes this probability. (Let’s assume for simplicity that there’s exactly one such θ\theta). When we’re thinking of x 1,,x nx_1, \ldots, x_n as fixed and θ\theta as variable, this probability is referred to as the likelihood of θ\theta.

But we can also think geometrically. Relative entropy isn’t a metric, but it behaves something like it, so we might wish to choose θ\theta to minimize the “distance” between p θp_\theta and the observed/empirical distribution p^\hat{p}. As θ\theta moves within Θ\Theta, think of p θp_\theta as defining a curve in the space of all distributions on Ξ\Xi. We’re looking for the point on the curve closest to p^\hat{p}.

More exactly, we want to minimize the entropy of p^\hat{p} relative to p θp_\theta. So, choose θ\theta to minimize H(p^p θ)H(\hat{p} \| p_\theta). Call this the “minimal entropy method”.

The punchline is this:

The maximum likelihood method and the minimal entropy method are equivalent.

Indeed, a routine calculation shows that

log(p θ(x 1)p θ(x n))=n(H(p^p θ)+H(p^)). \log\Bigl( p_\theta(x_1) \cdots p_\theta(x_n) \Bigr) = -n \Bigl( H(\hat{p} \| p_\theta) + H(\hat{p}) \Bigr).

Since the left-hand side is an increasing function of the likelihood p θ(x 1)p θ(x n)p_\theta(x_1) \cdots p_\theta(x_n), and the right-hand side is a decreasing function of the relative entropy H(p^p θ)H(\hat{p} \| p_\theta), choosing θ\theta to maximize the likelihood is equivalent to choosing θ\theta to minimize the relative entropy.

Posted by: Tom Leinster on February 25, 2017 10:25 PM | Permalink | Reply to this

Post a New Comment