### Category-Theoretic Characterizations of Entropy II

#### Posted by John Baez

We’re having a lively conversation about different notions related to entropy, and how to understand these using category theory. I don’t want to stop this conversation, but I have something long to say, which seems like a good excuse for a new blog post.

A while back, Tom Leinster took Fadeev’s characterization of Shannon entropy and gave it a slick formulation in terms of operads. I’ve been trying to understand this slick formulation a bit better, and I’ve made a little progress.

Tom’s idea revolves around a special law obeyed by the Shannon entropy. I’ll remind you of that law, or tell you about it for the first time in case you didn’t catch it yet. I’ll explain it in a very lowbrow way, as opposed to Tom’s beautiful highbrow way. Then, I’ll derive it from a simpler law obeyed by something called the ‘partition function’.

I like this, because I’ve got a bit of physics in my blood, and physicists know that the partition function is a very important concept. But I hope you’ll see: you don’t need to know physics to understand or enjoy this stuff!

#### A math puzzle

We begin with a sadly familiar problem:

Suppose you live in a town with a limited number of tolerable restaurants. Every Friday you go out for dinner. You randomly choose a restaurant according to a certain probability distribution $P$. If you go to the $i$th restaurant, you then choose a dish from the menu according to some probability distribution $Q_i$.

How surprising will your choice be, on average?

Note: I’m only talking about how surprised you are by the choice itself—not about any additional surprise due to the salad being exceptionally tasty, or the cook having dropped her wedding ring into your soup, or something like that. So if there’s only one restaurant in town and they only serve one dish, you won’t be at all surprised by your choice. But if there were thousands of restaurants serving hundreds of dishes each, there’d be room for a lot of surprise.

This still sounds like an impossibly vague puzzle. So, I should admit that while ‘surprise’ sounds psychological and very complicated, I’m really just using it as a cute synonym for Shannon entropy. Shannon entropy can be thought of as a measure of ‘expected surprise’.

Now, ‘expected surprise’ sounds like an oxymoron: how can something expected be a surprise? But in this context ‘expected’ means ‘average’. The idea is that your ‘surprise’ at an outcome with probability $p$ is defined to be $- \ln(p)$. Then, if there are a bunch of possible outcomes distributed according to some probability distribution $P$, the ‘expected’ surprise or **Shannon entropy** is:

where $p_i$ is the probability of the $i$th outcome. This is a weighted average where each surprise is weighted by its probability of happening.

So, we really do have a well-defined math puzzle, and it’s not even very hard.

#### Glomming together probabilities

But the interesting thing about this problem is that it involves an operation which I’ll call ‘glomming together’ probability distributions. First you choose a restaurant according to some probability distribution $P$ on the set of restaurants. Then you choose a meal according to some probability distribution $Q_i$. If there are $n$ restaurants in town, you wind up eating meals in a way described by some probability distribution we’ll call

$P \circ (Q_1, \dots, Q_n )$

A bit more formally:

Suppose $P$ is a probability distribution on the set $\{1,\dots, n\}$ and $Q_i$ are probability distributions on finite sets $X_i$, where $i = 1, \dots, n$. Suppose the probability distribution $P$ assigns a probability $p_i$ to each element $i \in \{1,\dots, n\}$, and suppose the distribution $Q_i$ assigns a probability $q_{i j}$ to each element $j \in X_i$.

Then we can **glom together** the probability distributions $Q_i$ using $P$ and get a new probability distribution $P \circ (Q_1, \dots, Q_n)$ on the disjoint union of all the sets $X_i$. The resulting probability for each element $(i,j)$ in the disjoint union $\sqcup_{i = 1}^n X_i$ is just $p_i q_{i j}$.

These probabilities sum to 1 as they should:

$\sum_{i, j} p_i q_{i j} = \sum_i \left( p_i \sum_j q_{i j} \right) = \sum_i p_i = 1$

Note: what $j$ ranges over depends on $i$! But it’s all kosher.

#### Glomming together numbers

There’s something even simpler than glomming together probability measures: we can glom together *numbers!*

Suppose we have a probability measure $P$ on the set $\{1, \dots, n\}$ and for each element $i \in S$ we have a number $x_i$. Then we can define a new number $P(x_1, \dots, x_n)$ by

$P(x_1, \dots, x_n) = \sum_{i = 1}^n p_i x_i$

This is just the ‘weighted average’ of the numbers $x_i$. We have already seen a weighted average in the definition of Shannon entropy, equation (2).

#### Glomming together entropies

Now we’re ready to answer the math puzzle:

On the left we’re glomming together a bunch of probability distributions using $P$ and then taking the entropy. On the right we’re taking the entropies of those probability distributions and then glomming the resulting numbers together using $P$. But there’s also an extra term: the entropy of $P$!

In other words: your expected surprise won’t be just the weighted average of your expected surprises at the various different restaurants, namely $P(S(Q_1), \ldots, S(Q_n))$. It’ll be that *plus* the expected surprise of your choice of restaurant, namely $S(P)$.

You might not have expected such a simple formula. But it’s easy to prove:

$\begin{aligned} S(P\circ(Q_1, \ldots, Q_n)) &=& -\sum_{i, j} p_i q_{i j} \ln(p_i q_{i j}) \\ &=&- \sum_{i, j} p_i q_{i j} \ln(p_i) - \sum_{i, j} p_i q_{i j} \ln(q_{i j}) \\ &=& -\sum_{i} p_i \ln(p_i) - \sum_{i} p_i S(Q_i) \\ &=& S(P) + P(S(Q_1), \dots, S(Q_n)) \end{aligned}$

#### Fadeev’s theorem

The formula for glomming together entropies is more important than you might think: it comes very close to *characterizing* Shannon entropy! In 1957, D. K. Fadeev published a result that we can restate as follows:

**Theorem (Fadeev)**: *Suppose $F$ is a function sending probability measures on finite sets to real numbers. Suppose that:*

- $F$ is invariant under bijections.
- $F$ is continuous.
- $F(P\circ(Q_1, \ldots, Q_n)) = F(P) + P(F(Q_1), \ldots, F(Q_n))$.

*Then $F$ is a real multiple of Shannon entropy. *

In item 1 we’re using the fact that a bijection $f: X \to X'$ between finite sets together with a probability distribution on $X$ gives a probability distribution on $X'$; we want these to have the same entropy. In item 2 we’re using the standard topology on $\mathbb{R}^n$ to put a topology on the set of probability distributions on any $n$-element set. For a proof of this theorem, see the beginning of Rényi’s 1961 paper.

#### What’s going on?

While this equation is cute:

$S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))$

it’s a bit tricky to find its proper place in the world of abstract algebra. A probability distribution $P$ can act either on a list of probability distributions or a list of numbers. Shannon entropy gives a map $S$ from probability distributions to numbers. So, if you’re algebraically inclined, you would want $S$ to be a ‘homomorphism’, obeying a law like this:

$S(P\circ(Q_1, \ldots, Q_n)) = P(S(Q_1), \ldots, S(Q_n)) \qquad (sorry, \; not \; true)$

We see laws of this sort all over math. But the true law has an extra term. What’s going on?

The ‘glomming’ business can be formalized using operads, and Tom’s answer is roughly: Shannon entropy is not a ‘homomorphism’ of operad algebras, but only a ‘lax homomorphism’. For an explanation of what this means, go here.

Right now I want to propose an alternative answer. I hope we can combine it with Tom’s answer and reach a deeper understanding of what’s going on.

For starters, consider another law that has an ‘extra term’:

$\frac{d} {d x} (f g) = (\frac{d}{d x} f )g + f (\frac{d}{d x} g)$

In math jargon, the product rule says that taking the derivative of a function at a point is not a homomorphism from smooth functions to real numbers: it’s a ‘derivation’. We can get a derivation on an algebra by differentiating a family of algebra homomorphisms. Similarly, we can get the funny rule obeyed by entropy by differentiating a family of operad algebra homomorphisms.

Let’s see how.

#### Glomming together partition functions

There’s something more fundamental than the Shannon entropy of a probability distribution: namely, it’s partition function. At least that’s how physicists think. Let me explain.

Suppose $P$ is a probability measure on a finite set $X$. Let $p_i$ be the probability of the element $i \in X$. We can always write

$p_i = e^{-H_i}$

where

$H : X \to (-\infty, \infty]$

is a function called the energy or **Hamiltonian**. The idea is that the probability of a system being in some state decreases exponentially with the energy of that state; we allow the energy $+\infty$ to account for zero probabilities.

But physicists always go further and introduce a parameter $\beta \in (0,\infty)$ which stands for the reciprocal of temperature, to capture the fact that states of high energy are even less likely to be occupied when it’s cold. Now we make $p_i$ into a function of $\beta$:

$p_i(\beta) = e^{-\beta H_i}$

or if you prefer, simply

$p_i(\beta) = p_i^\beta$

When $\beta = 1$, these numbers are just the probabilities $p_i$. But when $\beta \ne 1$, there’s no reason for them to sum to 1. To get a probability measure, we’d have to divide by a fudge factor called the **partition function**:

$Z(P) = \sum_{i \in X} e^{-\beta H_i}$

But I won’t do that just now: I’ll let $P(\beta)$ be the measure on the set $X$ that assigns the measure $p_i(\beta)$ to the point $i \in X$.

Now everything is a function of $\beta$! But everything reduces to something familar at $\beta = 1$, so we can stretch our old notation without breaking it. We now have functions $p_i$ sending numbers $\beta \in (0,\infty)$ to numbers, and a function $P$ sending numbers $\beta$ to measures on $X$. These reduce to our original $p_i$ and $P$ at $\beta = 1$. The partition function is also a function of $\beta$, and this equals 1 at $\beta = 1$.

But here’s the important thing: *the partition function obeys a simpler law than entropy when we glom probability measures together!* Suppose $P$ is a probability distribution on the set $\{1,\dots, n\}$ and $Q_i$ are probability distributions on finite sets $X_i$, where $i = 1, \dots, n$. Then

So, in fancy jargon, $Z$ is an operad algebra homomorphism!

But I need to explain what this equation *means*. First of all, everything is a function of $\beta$. Second, while $P$ and $Q_i$ are only *probability* measures at $\beta = 1$, they’re perfectly fine *measures* at other values of $\beta$, so all our ‘glomming’ formulas still make sense.

Let’s check to make sure we know what’s going on. An expression like $P \circ (Q_1, \dots, Q_n)$ could be ambiguous. We have a recipe for taking a probability measure on a finite set and extending it to a function from numbers $\beta \in (0,\infty)$ to measures on that set. So, we can extend $P$ and $Q_i$ this way and then ‘glom’ them to get a measure-valued function of $\beta$. On the other hand, we can glom them and then extend them. Luckily, the results agree.

Let’s see why. Suppose $P$ assigns the point $i \in X$ the probability $p_i$. When we extend this to a function of $\beta$ we get $p_i^\beta$. Similarly, suppose $Q_i$ assigns the point $j \in X_i$ the probability $q_{ i j}$. When we extend this to a function of $\beta$ we get $q_{i j}^\beta$. When we glom these, the measure of the point $(i,j) \in \sqcup X_i$ will be this function of $\beta$:

$p_i^\beta q_{i j}^\beta$

But this equals

$(p_i q_{i j})^\beta$

which is the result of glomming and then extending.

The right-hand side of equation (3) is also unambiguous… but I’m wasting time: if I were a physicist I’d be done proving this equation by now, instead of stewing around explaining what it means. It’s incredibly easy to prove. From what I’ve said,

$Z(P \circ (Q_1, \dots, Q_n)) = \sum_{i, j} p_i^\beta q_{i j}^\beta = \sum_i p_i^\beta Z(Q_i) = P (Z(Q_1), \dots, Z(Q_n))$

#### From partition function to entropy

How is the Shannon entropy of a probability distribution related to its partition function? Simple:

Here I must emphasize that I’m only talking about the Shannon entropy $S(P)$ of the original probability distribution, ‘at $\beta = 1$’. Physicists extend $S(P)$ to a function of $\beta$, along with everything else. That would be very good to do, but it goes beyond our goal now, and it would make the formula relating $S$ and $Z$ a bit more complicated, so let’s not.

Our goal now is simply to get the rule for glomming entropies by differentiating the rule for glomming partition functions. So, let’s do that using equation (4). Later I’ll show you why (4) is true.

We start with the rule for glomming partition functions:

$Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))$

We differentiate with respect to $\beta$ and use the product rule, taking advantage of the fact that $P (Z(Q_1), \dots, Z(Q_n))$ is linear in $P$ but also linear in each of the functions $Z(Q_i)$:

$\frac{d}{d \beta} Z(P \circ (Q_1, \dots, Q_n)) = \frac{d P}{d \beta} (Z(Q_1), \dots, Z(Q_n)) + P (\frac{d}{d \beta} Z(Q_1), \dots, \frac{d}{d \beta} Z(Q_n))$

Now set $\beta = 1$. We can use equation (4) and the fact that all our partition functions equal 1 at $\beta = 1$:

$- S(P \circ (Q_1, \dots, Q_n)) = \left. \frac{d P}{d \beta} (1,\dots, 1) \right|_{\beta = 1} - P(S(Q_1), \dots, S(Q_n))$

Hey, it looks like we’re almost done! As you can see, the product rule did most of the work, so we’re really saying that $S$ is like a ‘derivation’. We just to work out that funny-looking first term on the right-hand side. It amounts to taking a weighted sum of 1’s, which is just a sum:

$\left. \frac{d P}{d \beta} (1,\dots, 1)\right|_{\beta = 1} = \sum_i \left. \frac{d p_i(\beta)}{d \beta} \right|_{\beta = 1}$

and we have

$\frac{d p_i(\beta)} {d\beta} = \frac{d}{d\beta} e^{-\beta H_i} = -H_i e^{-\beta H_i}$

so

$\left. \frac{d p_i(\beta)} {d\beta} \right|_{\beta = 1} = - H_i e^{-H_i} = \ln(p_i) p_i$

so

$\left. \frac{d P}{d \beta} (1,\dots, 1)\right|_{\beta = 1} = \sum_i p_i \ln(p_i) = - S(P)$

So, we’ve got it!

$S(P \circ (Q_1, \dots, Q_n)) = S(P) + P(S(Q_1), \dots, S(Q_n))$

*The funny-looking rule for glomming entropies is just the derivative of the rule for glomming partition functions!*

But before we go further, let’s check rule (4). For starters,

$\left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1} = \left. \frac{d}{d \beta} \sum_i e^{-\beta H_i} \right|_{\beta = 1}$

But we’ve just seen that

$\left. \frac{d}{d\beta} e^{-\beta H_i} \right|_{\beta = 1} = p_i \ln(p_i)$

so

$\left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1} = \sum_i p_i \ln(p_i) = - S(P)$

#### The upshot

In our notes on the nLab you’ll see a deeper reason for being interested in the partition function: it’s actually a form of ‘decategorification’!

Let $FinProb_0$ be the groupoid of finite probability measure spaces (where, for technical reasons, we exclude measures that assign the measure zero to a nonempty set). Then $Z$ can be thought of as a functor from $FinProb_0$ to its set of isomorphism classes, viewed as a category with only identity morphisms. In other words, $Z$ assigns to each object $P \in FinProb_0$ its partition function $Z(P)$… but we can recover $P$ up to isomorphism from $Z(P)$.

Now, $FinProb_0$ is an algebra of a certain operad $\mathbf{P}$ whose $n$-ary operations are the probability measures on the set $\{1, \dots, n\}$. This is just a fancy way of talking about ‘glomming probability measures’. As a kind of consequence, the set of partition functions also becomes a $\mathbf{P}$-algebra. Furthermore, $Z$ becomes a homomorphism of $\mathbf{P}$-algebras. This last statement mostly just says that

$Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))$

But then we can take $Z$, differentiate it with respect to $\beta$, and set $\beta = 1$. By abstract nonsense, the resulting functor should be some sort of ‘derivation’ of the $Conv$-algebra $FinProb_0$. Concretely, we’ve seen this mostly says that

$S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))$

But there should also be an abstract-nonsense theory of derivations of operad algebras. In fact I discussed this a bit back in week299, but only a little. So, there’s a lot I don’t understand, such as:

How does my ‘derivation’ way of thinking about the law $S(P\circ(Q_1, \ldots, Q_n)) =$ $S(P) + P(S(Q_1), \ldots, S(Q_n))$ relate to Tom Leinster’s interpretation of it in terms of lax operad homomorphisms, or more precisely ‘lax points’?

And if we get this straightened out, it will also be tempting to extend the story to Rényi entropy, using the fact that Rényi entropy is a kind of *q*-derivative of minus the logarithm of the partition function. The *q*-derivative is a kind of ‘twisted derivation’, but I bet a bunch of the same calculations work in some modified form.

## Re: Category-Theoretic Characterizations of Entropy II

All of this seems like a very complicated way of saying that P(X,Y)=P(Y|X)P(X).