### Relative Entropy

#### Posted by John Baez

You may recall how Tom Leinster, Tobias Fritz and I cooked up a neat category-theoretic characterization of entropy in a long conversation here on this blog. Now Tobias and I have a sequel giving a category-theoretic characterization of *relative* entropy. But since some people might be put off by the phrase ‘category-theoretic characterization’, it’s called:

I’ve written about this paper before, on my other blog:

- Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.
- Relative Entropy (Part 2): a category related to statistical inference, $\mathrm{FinStat},$ and how relative entropy defines a functor on this category.
- Relative Entropy (Part 3): statement of our main theorem, which characterizes relative entropy up to a constant multiple as the only functor $F : \mathrm{FinStat} \to [0,\infty)$ with a few nice properties.

But now the paper is actually done! Let me give a compressed version of the whole story here… with sophisticated digressions buried in some parenthetical remarks that you’re free to skip if you want.

Our gives a new characterization of the concept of relative entropy, also known as ‘relative information’, ‘information gain’ or—by people who like to use jargon to make their work seem obscure—‘Kullback-Leibler divergence’.

Here’s the basic idea. Whenever you have two probability distributions $p$ and $q$ on the same finite set $X,$ you can define the **entropy of $q$ relative to** $p$:

$S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right)$

Here we set

$q_x \ln\left( \frac{q_x}{p_x} \right)$

equal to $\infty$ when $p_x = 0,$ unless $q_x$ is also zero, in which case we set it equal to 0. Relative entropy thus takes values in $[0,\infty].$

Intuitively speaking, $S(q,p)$ measures how surprised you’d be if you thought a situation was described by a probability distribution $p$… but then someone came along and said no, it’s really $q$.

Or if ‘surprise’ sounds too subjective, it’s the *expected amount of information gained* when you discover the probability distribution is really $q,$ when you’d thought it was $p.$

Tobias and I wanted to use category theory to say what’s so great about relative entropy. We did it using a category $\mathrm{FinStat}$ where:

- an object $(X,q)$ consists of a finite set $X$ and a probability distribution $x \mapsto q_x$ on that set;
- a morphism $(f,s) : (X,q) \to (Y,r)$ consists of a measure-preserving function $f$ from $X$ to $Y,$ together with a probability distribution $x \mapsto s_{x y}$ on $X$ for each element $y \in Y$, with the property that $s_{x y} = 0$ unless $f(x) = y$.

If the raw math seems hard to swallow, perhaps some honey-coated words will help it go down. I think of an object of $\mathrm{FinStat}$ as a system with some finite set of **states** together with a probability distribution on its states. This lets me think of a morphism

$(f,s) : (X,q) \to (Y,r)$

in a nice way. First, there’s a **measurement process** $f : X \to Y$, a function from the set $X$ of states of some system being measured to the set $Y$ of states of some measurement apparatus. The condition that $f$ be measure-preserving says the probability that the apparatus winds up in any state $y \in Y$ is the sum of the probabilities of all states of $X$ leading to that outcome:

$\displaystyle{ r_y = \sum_{x \in f^{-1}(y)} q_x }$

Second, there’s a **hypothesis** $s$. This is a guess about the probability that the system being measured is in the state $x \in X$ given any measurement outcome $y \in Y.$ The guess is the number $s_{x y}$.

Now, suppose we have any morphism

$(f,s) : (X,q) \to (Y,r)$

in $\mathrm{FinStat}.$ From this we get two probability distributions on $X$. First, we have the probability distribution $p$ given by

$\displaystyle{ p_x = \sum_{y \in Y} s_{x y} r_y } \qquad \qquad \heartsuit\heartsuit\heartsuit$

This is our best guess about the the probability that the system is in any given state, given our hypothesis and the probability distribution of measurement results. Second, we have the ‘true’ probability distribution $q$.

In fact, this way of assigning relative entropies to morphisms defines a functor

$RE : \mathrm{FinStat} \to [0,\infty]$

where we use $[0,\infty]$ to denote the category with one object, the numbers $0 \le x \le \infty$ as morphisms, and addition as composition. More precisely, if

$(f,s) : (X,q) \to (Y,r)$

is any morphism in $\mathrm{FinStat},$ we define

$RE(f,s) = S(q,p)$

where $p$ is defined as in equation $\heartsuit\heartsuit\heartsuit$. This tells us how surprised we are when we learn the true probability distribution $q$, if our measurement results were distributed according to $r$ and our hypothesis was $s$.

The fact that $RE$ is a functor is nontrivial and rather interesting! It says that given any composable pair of measurement processes:

$(X,q) \stackrel{(f,s)}{\longrightarrow} (Y,r) \stackrel{(g,t)}{\longrightarrow} (Z,u)$

the relative entropy of their composite is the sum of the relative entropies of the two parts:

$RE((g,t) \circ (f,s)) = RE(g,t) + RE(f,s) .$

We prove that $RE$ is a functor. However, we go further: we *characterize* relative entropy by saying that up to a constant multiple, $RE$ is the *unique* functor from $\mathrm{FinStat}$ to $[0,\infty]$ obeying three reasonable conditions.

### Lower semicontinuity

The first condition is that $RE$ is lower semicontinuous. The set $P(X)$ of probability distibutions on a finite set $X$ naturally has the topology of an $(n-1)$-simplex when $X$ has $n$ elements. The set $[0,\infty]$ has an obvious topology where it’s homeomorphic to a closed interval. However, with these topologies, the relative entropy does *not* define a continuous function

$\begin{array}{rcl} S : P(X) \times P(X) &\to& [0,\infty] \\ (q,p) &\mapsto & S(q,p) . \end{array}$

The problem is that

$\displaystyle{ S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right) }$

and $q_x \ln(q_x/p_x)$ is $\infty$ when $p_x = 0$ and $q_x > 0$ — but it’s $0$ when $p_x = q_x = 0.$

So, it turns out that $S$ is only **lower semicontinuous**, meaning that if $p^i , q^i$ are sequences of probability distributions on $X$ with $p^i \to p$ and $q^i \to q$ then

$S(q,p) \le \liminf_{i \to \infty} S(q^i, p^i)$

We give the set of morphisms in $\mathrm{FinStat}$ its most obvious topology, and show that with this topology, $RE$ maps morphisms to morphisms in a lower semicontinuous way.

(Lower semicontinuity may seem like an annoying property. But there’s a way to redeem it. There’s a sneaky topology on $[0,\infty]$ such that a function taking values in $[0,\infty]$ is lower semicontinuous (in the lim inf sense above) if and only if it’s continuous with respect to this sneaky topology!

Using this idea, we can make $FinStat$ and $[0,\infty]$ into topological categories — that is, categories internal to Top — in such a way that lower semicontinuity simply says

$RE : FinStat \to [0,\infty]$

is a continuous functor.

A bit confusingly, this sneaky topology on $[0,\infty]$ is called the upper topology. I’ve fallen in love with the upper topology on $[0,\infty]$. Why?

Well, $[0,\infty]$ is a very nice rig, or ‘ring without negatives’. Addition is defined in the obvious way, and multiplication is defined in the almost-obvious way, except that

$0 \cdot \infty = \infty \cdot 0 = 0$

Even this is actually obvious if you remember that it’s required by the definition of a rig. But if you try to put the ordinary closed interval topology on $[0,\infty]$, you’ll see multiplication is not continuous, because $a \cdot \infty$ is infinite when $a \gt 0$ but then it suddenly jumps down to zero when $a$ hits zero. However, multiplication *is* continuous if we give $[0,\infty]$ the upper topology! Then $[0,\infty]$ becomes a topological rig.)

### Convex linearity

The second condition is that $RE$ is convex linear. We describe how to take convex linear combinations of morphisms in $\mathrm{FinStat},$ and then the functor $RE$ maps any convex linear combination of morphisms in $\mathrm{FinStat}$ to the corresponding convex linear combination of numbers in $[0,\infty].$

Intuitively, this means that if we take a coin with probability $P$ of landing heads up, and flip it to decide whether to perform one measurement process or another, the expected information gained is $P$ times the expected information gain of the first process plus $1-P$ times the expected information gain of the second process.

(Operadically, the point is that both $FinStat$ and $[0,\infty]$ are algebras of an operad **P** whose operations are convex linear combinations.
The $n$-ary operations in **P** are just probability distributions on an $n$-element set. In other words, they’re points in the $(n-1)$-simplex.

So, saying that $RE$ is convex linear means that

$RE: FinStat \to [0,\infty]$

is a map of **P**-algebras. But we avoid discussing this in our paper because $FinStat$, being a category, is just a ‘weak’ **P**-algebra, and we decided this would be too much for our poor little readers.

For those who like fine nuances: **P** is a *topological* operad, and $FinStat$ and $[0,\infty]$ are algebras of this in the topological category TopCat. As I mentioned, $FinStat$ is a ‘weak’ **P**-algebra, meaning the laws for convex linear combinations hold only up to coherent natural isomorphism. $[0,\infty]$ is strict… but to get convex linear combinations like
$\lambda \cdot 0 + (1 - \lambda) \infty$ to behave continuously, we have to give $[0,\infty]$ the upper topology!)

### Vanishing on a subcategory

The third condition is that $RE$ vanishes on morphisms $(f,s) : (X,q) \to (Y,r)$ where the hypothesis $s$ is **optimal**. By this, we mean that equation $\heartsuit\heartsuit\heartsuit$ gives a probability distribution $p$ equal to the ‘true’ one, $q$.

That makes a lot of sense conceptually: we don’t gain any information upon learning the truth about a situation if we already *knew* the truth!

(But the subcategory of $FinStat$ where we keep all the objects but only these ‘optimal’ morphisms also has a nice category-theoretic significance. Tom Leinster called it **FP** in this post:

That’s because it’s the ‘free **P**-algebra on an internal **P**-algebra’, where **P** is the operad I mentioned. I won’t explain what this means here, because Tom did it! Suffice it to say that it’s a shockingly abstract piece of operad theory that nonetheless manages to capture the concept of entropy very neatly. But that’s *plain old* entropy, not relative entropy.)

### The result

Here, then, is our main result:

**Theorem.** Any lower semicontinuous, convex-linear functor

$F : \mathrm{FinStat} \to [0,\infty]$

that vanishes on every morphism with an optimal hypothesis must equal some constant times the relative entropy. In other words, there exists some constant $c \in [0,\infty]$ such that

$F(f,s) = c RE(f,s)$

for any any morphism $(f,s) : (X,p) \to (Y,q)$ in $\mathrm{FinStat}.$

### The proof

The proof is surprisingly hard. Or maybe we’re just surprisingly bad at proving things. But the interesting thing is this: the proof is swift and effective in the ‘generic’ case — the case where the support of the probability measures involved is the whole set they’re living on, and the constant $c$ is finite.

It takes some more work to handle the case where the probability measures have smaller support.

But the really hard work starts when we handle the case that, in the end, has $c = \infty$. Then the proof becomes more like analysis than what you normally expect in category theory. We slowly corner the result, blocking off all avenues of escape. Then we close in, grab its neck, and strangle it, crushing its larynx ever tighter, as it loses the will to fight back and finally expires… still twitching.

You’ve got to read the proof to understand what I mean.

## Re: Relative Entropy

You might include the name “cross-entropy” among your list – I saw it used a lot when I was thinking about these things.