## May 18, 2011

### An Operadic Introduction to Entropy

#### Posted by Tom Leinster Bless British trains. A two-hour delay with nothing to occupy me provided the perfect opportunity to figure out the relationships between some of the results that John, Tobias and I have come up with recently.

This post is intended to serve two purposes. First, for those who have been following, it ties together several of the theorems we’ve found. I hope it will make them seem less like a bunch of distinct but vaguely similar results and more like a coherent whole.

Second, for those who haven’t been following, it’s an introduction to entropy, and our recent results on it, for the categorically minded.

I will not assume:

• that you know anything whatsoever about entropy
• that you’ve read any other posts on this blog.

I will assume:

• that you know the definitions of operad and algebra for an operad
• that you know a bit of category theory, including roughly what category theorists mean when they use the word lax.

### Operads, algebras and internal algebras

Let $P$ be an operad in $Set$. I’ll keep it open as to whether $P$ is symmetric or not. It makes sense to talk about $P$-algebras in any category with finite products, and in particular in $Cat$. But $Cat$ is actually a 2-category, so there are further variations available to us.

For a start, we can consider weak (or pseudo) $P$-algebras, where the axioms only hold up to coherent isomorphism. In what follows, I won’t be scrupulously careful to distinguish between strict and weak $P$-algebras. That’s much the same kind of carelessness as when we pretend that a monoidal category is strict.

But more importantly, we can vary the flavour of maps between $P$-algebras. In particular, we can consider lax maps. Given $P$-algebras $B$ and $A$ in $Cat$, a lax map $B \to A$ is a functor $F: B \to A$ together with a natural transformation

$\begin{matrix} P_n \times B^n &\stackrel{1 \times F^n}{\to} &P_n \times A^n \\ \downarrow &\Leftarrow &\downarrow \\ B &\stackrel{F}{\to} &A \end{matrix}$

for each $n \in \mathbb{N}$, satisfying coherence conditions.

Now let $A$ be a $P$-algebra in $Cat$. An internal $P$-algebra in $A$ is a lax map $1 \to A$ of $P$-algebras, where $1$ is the terminal $P$-algebra in $Cat$. (Previously, I’ve sometimes called this a ‘lax point’ of $A$. The internal algebra terminology is, I think, due to Michael Batanin.) The examples that follow shortly should explain the name.

Explicitly, an internal $P$-algebra in $A$ consists of:

• an object $a \in A$
• for each operation $p \in P_n$, a map $\alpha_p: p(a, \ldots, a) \to a$

satisfying the axioms

• $\alpha_{p \circ (r_1, \ldots, r_n)} = \alpha_p \circ p(\alpha_{r_1}, \ldots, \alpha_{r_n})$ for any operations $p \in P_n$ and $r_i \in P_{k_i}$
• $\alpha_1 = 1_a$, where the $1$ on the left-hand side is the unit of the operad
• if we’re doing symmetric operads, $\alpha_{\sigma p} = \alpha_p$ whenever $p \in P_n$ and $\sigma \in S_n$.

Examples

1. Take the terminal non-symmetric operad $P$. Then a $P$-algebra $A$ in $Cat$ is a monoidal category, and an internal algebra in $A$ is an internal monoid in $A$.
2. Fix a monoid $M$, and let $P$ be the non-symmetric operad with no non-unary operations, and whose monoid of unary operations is $M$. I’ll call this the monoid for $M$-sets. Then a $P$-algebra in $Cat$ is a category $A$ with a left action by $M$. An internal algebra in $A$ is an object $a$ together with a map $\alpha_m: m\cdot a \to a$ for each $m \in M$, satisfying the evident action-like axioms.

Although I started with an operad $P$ in $Set$, the concept of internal $P$-algebra makes sense for an operad $P$ in any category $\mathcal{E}$ with finite limits. Our $P$-algebras would then be categories internal to $\mathcal{E}$ (and equipped with a $P$-algebra structure).

The only case we’ll need is $\mathcal{E} = Top$, the category of topological spaces. So $P$ will be a topological operad, and our $P$-algebras $A$ will be topological categories, that is, categories internal to $Top$. I’ll write $Cat(Top)$ for the category of topological categories. The explicit description of internal $P$-algebras still holds, except that we add the condition that the assignment

$p \mapsto \alpha_p$

defines a continuous map from $P_n$ to the space of maps in $A$.

If you’re not comfortable with internal categories, all you need to extract from the previous two paragraphs is ‘some continuity conditions will be slapped on’.

I’ll introduce a particularly important topological operad, $\mathbf{P}$. The elements of $\mathbf{P}_n$ are the probability distributions on an $n$-element set:

$\mathbf{P}_n = \{ \mathbf{p} \in \mathbb{R}^n | p_i \geq 0, \sum_i p_i = 1 \}.$

This is the $(n - 1)$-simplex, and is topologized as such.

To define the composition, imagine that you have a distribution $\mathbf{p}$ on $\{1, \ldots, n\}$ and then further distributions $\mathbf{r}_i$ on $\{1, \ldots, k_i\}$ for each $i \in \{1, \ldots, n\}$. From this you can create a new distribution on $\{1, \ldots, k_1 + \cdots + k_n\}$: first choose an element $i \in \{1, \ldots, n\}$ according to $\mathbf{p}$, then choose an element $j \in \{1, \ldots, k_i\}$ according to $\mathbf{r}_i$. Your output element is $k_1 + \cdots + k_{i - 1} + j$. That is,

$\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n) = (p_1 r_{1 1}, \ldots, p_1 r_{1 k_1},   \ldots,   p_n r_{n 1}, \ldots, p_n r_{n k_n}).$

The unit of the operad $\mathbf{P}$ is, inevitably, the unique probability distribution $(1) \in \mathbf{P}_1$.

I’ll also introduce a particularly important $P$-algebra in $Cat(Top)$.

As a topological category, it’s the additive monoid $\mathbb{R}_+ = [0, \infty)$, regarded as a one-object category. In this post, when $\mathbb{R}_+$ appears as a category, it will always mean this. (The order on $\mathbb{R}_+$ could also be used to make it into a category, but that won’t come into this story.)

The $\mathbf{P}$-algebra structure is trivial on objects, since the category $\mathbb{R}_+$ only has one object. On maps — that is, on real numbers — it’s given by

$\mathbf{p}(x_1, \ldots, x_n) = \sum_i p_i x_i$

($\mathbf{p} \in \mathbf{P}_n$, $x_i \in \mathbb{R}_+$).

Now, what’s an internal $\mathbf{P}$-algebra in $\mathbb{R}_+$?

Well, according to our explicit description, it is:

• an object of the category $\mathbb{R}_+$ — but $\mathbb{R}_+$ only has one object
• an element $\alpha_\mathbf{p} \in \mathbb{R}_+$ for each probability distribution $\mathbf{p} \in \mathbf{P}_n$, which I’ll now write as $\alpha(\mathbf{p})$ for legibility

satisfying the axioms

• $\alpha(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = \alpha(\mathbf{p}) + \sum_i p_i \alpha(\mathbf{r}_i)$
• $\alpha((1)) = 0$
• $\alpha(\sigma \mathbf{p}) = \alpha(\mathbf{p})$ for any permutation $\sigma$ (that is, $\alpha$ is symmetric in its arguments)
• $\alpha: \mathbf{P}_n \to \mathbb{R}_+$ is continuous, for each $n$.

This has now become a very explicit problem: find all the sequences of functions

$\Bigl( \mathbf{P}_n \stackrel{\alpha}{\to} \mathbb{R}_+ \Bigr)$

satisfying these four axioms. Fortunately, there’s already a theorem solving this problem. It was found and proved by Fadeev in the 1950s. To state Fadeev’s theorem, we need the definition of Shannon entropy. For a probability distribution $\mathbf{p} \in \mathbf{P}_n$ on a finite set, the Shannon entropy is

$H(\mathbf{p}) = -\sum_i p_i \ln(p_i)$

where by convention, $0 \ln(0) = 0$. This gives a sequence of functions

$\Bigl( \mathbf{P}_n \stackrel{H}{\to} \mathbb{R}_+ \Bigr).$

It’s easy to verify that $H$ satisfies the four axioms above. And obviously the set of $\alpha$s satisfying those four axioms is closed under multiplication by nonnegative scalars, so $c H$ satisfies them too for any $c \geq 0$. Fadeev proved the converse:

every $\alpha$ satisfying the four axioms above is a nonnegative scalar multiple of Shannon entropy.

In this sense, the internal $\mathbf{P}$-algebras in $\mathbb{R}_+$ ‘are’ the scalar multiples of Shannon entropy.

All the theorems described in the rest of this post will depend on this one.

### The free $P$-algebra containing an internal $P$-algebra

Fix a symmetric operad $P$, say in $Set$ or $Top$. The free $P$-algebra containing an internal $P$-algebra is a $P$-algebra $D$ in $Cat$ (or $Cat(Top)$) equipped with an internal $P$-algebra $(d, \delta)$ and initial as such.

What does this mean? Well, first of all, $d$ is supposed to denote an object of $D$, and $\delta$ is a family of maps $\delta_p: p(d, \ldots, d) \to d$, one for each operation $p$ of $P$, as in the definition of internal algebra. The initiality means that for any $P$-algebra $A$ and internal $P$-algebra $(a, \alpha)$ in $A$, there is a unique (strict) map $F: A \to B$ of $P$-algebras such that

$F(d) = a, \quad F(\delta_p) = \alpha_p   for all  p.$

A slightly more abstract way to put it is that for all $P$-algebras $A$, there is an isomorphism

$\{strict maps  D \to A\} \cong \{internal P-algebras in A\}.$

The two statements are equivalent by the Yoneda Lemma.

This being a universal property, it characterizes $D$ and $(d, \delta)$ uniquely up to isomorphism, if they exist. And they do exist: here’s an explicit description.

• An object of $D$ is a pair $(n, p)$ where $n \in \mathbb{N}$ and $p \in P_n$
• a map $(m, s) \to (n, p)$ in $D$ consists of:
• a map $f: \{1, \ldots, m\} \to \{1, \ldots, n\}$ of finite sets, together with
• for each $i \in \{1, \ldots, n\}$, an element $r \in P_{|f^{-1}(i)|}$
such that
• $s = \tau\cdot (p \circ (r_1, \ldots, r_n))$ where $\tau \in S_m$ is a certain permutation describing the way in which the fibres $f^{-1}(i)$ are interleaved
• I’ll leave you to figure out what the $P$-action on $D$ must be.
• $d = (1, 1_P)$, and I’ll also leave you to figure out what $\delta$ must be.

A few comments are in order. First, this description might seem a bit fearsome, but it’s not so bad; I’ll give a couple of examples in a moment. Second, if we were doing non-symmetric operads then we’d require $f$ to be order-preserving, and there wouldn’t be a permutation $\tau$. Third, $D$ is only a weak $P$-algebra. It would be strict except for the symmetries. But as I said earlier, I’m going to ignore the difference between weak and strict. Fourth, if we’re doing topological operads then $D$ is a category in $Top$, and I hope you can guess what the topology is.

Examples

1. Let $P$ be the terminal non-symmetric operad, so that $P$-algebras in $Cat$ are monoidal categories and internal $P$-algebras are internal monoids. Then $D$ is the monoidal category of finite totally ordered sets (or a skeleton thereof), including the empty set. The ‘generic’ internal monoid $(d, \delta)$ in $D$ has $d = 1$, the one-element totally ordered set. It’s been well-known since at least Categories for the Working Mathematician that a monoidal functor from $D$ to another monoidal category $A$ amounts to an internal monoid in $A$.
2. Fix a monoid $M$, and let $P$ be the monoid for $M$-sets. If you regard $M$ as a category with single object $\star$, then $D$ is the slice category $M/\star$. In other words, the objects of $D$ are the elements of $M$, and a map $s \to p$ in $D$ is an element $r \in M$ such that $s = p r$. If $M$ has cancellation then $D$ is the poset of elements of $M$, ordered by divisibility.

Now for the most important example: the operad $\mathbf{P}$ for probability distributions. An object of $D$ is a finite probability space. I’ll write a typical object as $(X, p)$, where $X$ is a finite set and $\mathbf{p} = (p_i)_{i \in X}$ is a probability distribution on $X$. A map in $D$ is almost just a measure-preserving map. Precisely, a map $(Y, \mathbf{s}) \to (X, \mathbf{p})$ in $D$ consists of a map of sets $f: Y \to X$ together with, for each $i \in X$, a probability distribution $\mathbf{r}_i$ on the fibre $f^{-1}(i)$, such that

$s_j = p_i r_{i j}$

for each $i \in X$ and $j \in f^{-1}(i)$. But if $p_i \neq 0$ for all $i \in X$ then this formula determines $\mathbf{r}_i$ uniquely, and it follows that a map $(Y, \mathbf{s}) \to (X, \mathbf{p})$ amounts to a map of sets $f: Y \to X$ preserving measure, that is, satisfying

$p_i = \sum_{j \in f^{-1}(i)} s_j.$

I’ll write this category $D$ as $\mathbf{FP}$, for two reasons: it’s a category formed freely from the operad $\mathbf{P}$, and it’s almost the same as the category of finite probability spaces.

The $\mathbf{P}$-algebra structure on $\mathbf{FP}$ can be described as follows. Let $\mathbf{p} \in \mathbf{P}_n$. When $\mathbf{p}$ is applied to an $n$-tuple of objects

$(X_1, \mathbf{r}_1), \ldots, (X_n, \mathbf{r}_n)$

of $\mathbf{FP}$, the result is the object

$(X_1 + \cdots + X_n, \mathbf{s})$

where $s_j = p_i r_{i j}$ whenever $j \in X_i$. The action of $\mathbf{P}$ on the maps in $\mathbf{FP}$ is defined in a similar way.

The universal property of $\mathbf{FP}$ implies that

strict maps of $\mathbf{P}$-algebras $\mathbf{FP} \to \mathbb{R}_+$ correspond to internal $\mathbf{P}$-algebras in $\mathbb{R}_+$.

But Fadeev’s theorem classifies the internal $\mathbf{P}$-algebras in $\mathbb{R}_+$: they correspond to the nonnegative scalar multiples of Shannon entropy. So this says:

strict maps of $\mathbf{P}$-algebras $\mathbf{FP} \to \mathbb{R}_+$ correspond to nonnegative scalar multiplies of Shannon entropy.

How they correspond can be extracted from what I’ve said so far, but let me bring it down to earth by stating it as explicitly as I know how. In it, I’ll denote maps in $\mathbf{FP}$ by Greek letters such as $\phi$, to save me from writing pairs such as $(f, \mathbf{r})$ every time.

Theorem P  Let $F$ be a function from the set of maps in $\mathbf{FP}$ to the set $\mathbb{R}_+$, satisfying the following conditions:

1. Functoriality: $F(\psi \circ \phi) = F(\psi) + F(\phi)$ for all composable maps $\psi$, $\phi$ in $\mathbf{FP}$
2. Convex combinations: $F(\lambda \phi + (1 - \lambda) \psi) = \lambda F(\phi) + (1 - \lambda) F(\psi)$ for all maps $\phi$, $\psi$ in $\mathbf{FP}$ and all $\lambda \in [0, 1]$
3. Continuity: $F$ is continuous.

Then there is some $c \geq 0$ such that for all maps $\phi: (Y, \mathbf{s}) \to (X, \mathbf{p})$ in $\mathbf{FP}$,

$F(\phi) = c(H(\mathbf{s}) - H(\mathbf{p})).$

Conversely, for any $c \geq 0$, the function $F$ defined by this formula satisfies the conditions above.

Every part of this theorem comes directly from what we know so far, with the aid of a little routine calculation. For example, the convex combinations axiom (which says that $F$ defines a map of $\mathbf{P}$-algebras) should in principle involve convex combinations of any number of maps, but an easy induction shows that two suffices. The formula $H(\mathbf{s}) - H(\mathbf{p})$ is perhaps a surprise, but again pops out if you work through the details of the correspondence.

I’ve called this Theorem P. Later there will be Theorems P$'$, M and M$'$, each one characterizing entropy in some categorical way. I’ll show you how they all connect up.

### The actual category of finite probability spaces

You might not like the fact that Theorem P involves the category $\mathbf{FP}$, which isn’t quite the same as the category $FinProb$ of finite probability spaces. It’s only zero probabilities that prevent it from being the same. But we have to face the fact: it’s different.

So, can we find a Theorem P$'$ with $FinProb$ instead of $\mathbf{FP}$? Yes, we can. To do so, let’s identify what makes the two categories different.

Let $(Y, \mathbf{s})$ and $(X, \mathbf{p})$ be finite probability spaces. In both categories, a map $(Y, \mathbf{s}) \to (X, \mathbf{p})$ involves a measure-preserving map $f: X \to Y$. In $FinProb$, it’s exactly that, but in $\mathbf{FP}$, it also involves a probability distribution $\mathbf{r}_i$ on each fibre $f^{-1}(i)$, satisfying a compatibility condition. That condition determines $\mathbf{r}_i$ uniquely when $p_i \neq 0$, but not when $p_i = 0$.

Let’s say that a $\mathbf{P}$-algebra $A$ in $Cat(Top)$ is special (for want of a better word) if whenever we have $\mathbf{p} \in \mathbf{P}_n$ and maps

$\phi_1, \psi_1: a_1 \to b_1, \quad \ldots, \quad \phi_n, \psi_n: a_n \to b_n$

in $A$ satisfying the condition that

$p_i \neq 0 \quad \Rightarrow \quad \phi_i = \psi_i$

($1 \leq i \leq n$), then

$\mathbf{p}(\phi_1, \ldots, \phi_n) = \mathbf{p}(\psi_1, \ldots, \psi_n).$

This basically says that when you multiply a map $\phi: a \to b$ by zero, the result doesn’t depend on $\phi$. ‘Multiplying by zero’ doesn’t make sense in this world of convex combinations, but that’s the idea. For this reason, $FinProb$ is special and $\mathbf{FP}$ is not.

Another special $\mathbf{P}$-algebra is the monoid $\mathbb{R}_+$, since if $\mathbf{p} \in \mathbf{P}_n$ and $\mathbf{x}, \mathbf{y} \in \mathbb{R}_+^n$ with $x_i = y_i$ whenever $p_i \neq 0$, then $\sum p_i x_i = \sum p_i y_i$.

You can easily force a non-special $\mathbf{P}$-algebra to be special, by identifying all the parallel pairs of maps of the kind that appear in the last displayed equation. In other words, the inclusion

$SpecialAlg(\mathbf{P}) \hookrightarrow Alg(\mathbf{P})$

has a left adjoint. (When I speak of $\mathbf{P}$-algebras, I will always mean algebras in $Cat(Top)$.) And when this left adjoint is applied to $\mathbf{FP}$, the result is $FinProb$.

This adjointness immediately tells us that

$Hom_{Alg(\mathbf{P})}(FinProb, \mathbb{R}_+) \cong Hom_{Alg(\mathbf{P})}(\mathbf{FP}, \mathbb{R}_+).$

So by Theorem P,

strict maps of $\mathbf{P}$-algebras $FinProb \to \mathbb{R}_+$ correspond to nonnegative scalar multiples of Shannon entropy.

Or explicitly:

Theorem P$'$  Let $F$ be a function from the set of maps in $FinProb$ to the set $\mathbb{R}_+$, satisfying the following conditions:

1. Functoriality: $F(g \circ f) = F(g) + F(f)$ for all composable maps $g$, $f$ in $FinProb$
2. Convex combinations: $F(\lambda f + (1 - \lambda) g) = \lambda F(f) + (1 - \lambda) F(g)$ for all maps $f$, $g$ in $FinProb$ and all $\lambda \in [0, 1]$
3. Continuity: $F$ is continuous.

Then there is some $c \geq 0$ such that for all maps $f: (Y, \mathbf{s}) \to (X, \mathbf{p})$ in $FinProb$,

$F(f) = c(H(\mathbf{s}) - H(\mathbf{p})).$

Conversely, for any $c \geq 0$, the function $F$ defined by this formula satisfies the conditions above.

The theorem is now in a very widely understandable form: no operads, no funny categories.

### From probability spaces to measure spaces

In some sense there’s not much difference between a measure space (at least, a finite one) and a probability space: unless the measure space has total measure zero, which is pretty boring, you can always scale it so that it’s a probability space. Nonetheless, there are some advantages and disadvantages to each. In some contexts, one seems more natural than the other.

Both Theorems P and P$'$, concerning probability spaces, have analogues for measure spaces. I’ll explain them now.

To switch from probability to measure spaces, we switch operads.

The operad $\mathbf{P}$ will be replaced by an operad I’ll call $\mathbf{M}$. First note that for any monoid $G$, there’s an operad whose set of $n$-ary operations is $G^n$, and with multiplication given by

$\mathbf{g} \circ (\mathbf{h}_1, \ldots, \mathbf{h}_n) = (g_1 h_{1 1}, \ldots, g_1 h_{1 k_1},   \ldots,   g_n h_{n 1}, \ldots, g_n h_{n k_n})$

for $\mathbf{g} \in G^n$ and $\mathbf{h}_i \in G^{k_i}$. Let $\mathbf{M}$ be the operad resulting in this way from the multiplicative monoid $[0, \infty)$.

A $\mathbf{M}$-algebra in $Set$ consists of a set $X$ together with a map

$[0, \infty)^n \times X^n \to X$

for each $n$, satisfying axioms. We might write this as

$((\mu_1, \ldots, \mu_n), (x_1, \ldots, x_n)) \mapsto \sum_i \mu_i x_i.$

In fact, an $\mathbf{M}$-algebra $X$ in $Set$ amounts to a commutative monoid equipped with an action by the multiplicative monoid $[0, \infty)$, acting by monoid homomorphisms. So we do have

$\lambda(x + y) = \lambda x + \lambda y, \quad \lambda 0 = 0$

(which is what “acting by monoid homomorphisms” means), but we don’t have

$(\lambda_1 + \lambda_2) x = \lambda_1 x + \lambda_2 x, \quad 0 x = 0.$

There’s no chance of having either of these, first because the definition of the operad $\mathbf{M}$ doesn’t mention the additive structure on $[0, \infty)$, and second because these are equations of the type that an operad can’t encode: variables are duplicated or disappear from one side of the equation to the other. So, an $\mathbf{M}$-algebra is quite like a module over the rig $[0, \infty)$, but not entirely.

From now on, I’ll regard $\mathbf{M}$ as a topological operad (in the obvious way), and ‘algebras’ for it will always be (strict or weak) algebras in $Cat(Top)$, as for $\mathbf{P}$.

There’s an obvious inclusion of operads $\mathbf{P} \hookrightarrow \mathbf{M}$. This induces a functor

$Alg(\mathbf{M}) \to Alg(\mathbf{P})$

which for general reasons has a left adjoint. Write $\mathbf{FM} \in Alg(\mathbf{M})$ for the image under this left adjoint of $\mathbf{FP} \in Alg(\mathbf{P})$. Explicitly:

• an object of $\mathbf{FM}$ is a pair $(X, \mu)$ where $X$ is a finite set and $\mu$ is a measure on $X$ (with all subsets regarded as measurable), or equivalently an element of $[0, \infty)^X$
• a map $(Y, \nu) \to (X, \mu)$ in $\mathbf{FM}$ consists of a map of sets $f: Y \to X$ together with a probability measure $\mathbf{r}_i$ on the fibre $f^{-1}(i)$ for each $i \in X$, such that $\nu_j = \mu_i r_{i j}$ for each $i \in X$ and $j \in f^{-1}(i)$.

As for $\mathbf{FP}$, this category $\mathbf{FM}$ is almost the same as the category $FinMeas$ whose objects are finite sets equipped with a measure, and whose maps are the measure-preserving maps. Indeed, when each $\mu_i$ is nonzero, the maps $(Y, \nu) \to (X, \mu)$ in $\mathbf{FM}$ are exactly the measure-preserving maps. This resemblance to $FinMeas$ is the reason for the name $\mathbf{FM}$. (It is not the free $\mathbf{M}$-algebra containing an internal $\mathbf{M}$-algebra.)

Once you start thinking about it, it’s pretty obvious what the $\mathbf{M}$-algebra structure of $\mathbf{FM}$ must be: it’s the only possible thing. I won’t write it out.

The additive monoid $\mathbb{R}_+$ is naturally an $\mathbf{M}$-algebra, by taking linear combinations. Its underlying $\mathbf{P}$-algebra structure is the one we met before. So by adjointness,

$Hom_{Alg(\mathbf{M})}(\mathbf{FM}, \mathbb{R}_+) \cong Hom_{Alg(\mathbf{P})}(\mathbf{FP}, \mathbb{R}_+).$

But Theorem P tells us exactly what the right-hand side is. Hence

strict maps of $\mathbf{M}$-algebras $\mathbf{FM} \to \mathbb{R}_+$ correspond to nonnegative scalar multiples of Shannon entropy.

Again, the precise form of the correspondence follows from the details of what’s gone before. (Of course, I’m not imagining that you’re diligently writing out those details, but I am imagining that you’re imagining their existence.) When you unwind it all, you get the following explicit result. It refers to the Shannon entropy of an arbitrary measure space $(X, \mu)$, where $X$ is a finite set. This is, by definition,

$H(\mu) = \Vert\mu\Vert\ln\Vert\mu\Vert - \sum_i \mu_i \ln(\mu_i)$

where $\Vert\mu\Vert = \sum_i \mu_i$.

Theorem M  Let $F$ be a function from the set of maps in $\mathbf{FM}$ to the set $\mathbb{R}_+$, satisfying the following conditions:

1. Functoriality: $F(\psi \circ \phi) = F(\psi) + F(\phi)$ for all composable maps $\psi$, $\phi$ in $\mathbf{FM}$
2. Additivity: $F(\phi + \psi) = F(\phi) + F(\psi)$ for all maps $\phi$, $\psi$ in $\mathbf{FM}$
3. Homogeneity: $F(\lambda\phi) = \lambda F(\phi)$ for all maps $\phi$ in $\mathbf{FM}$ and all $\lambda \in [0, \infty)$
4. Continuity: $F$ is continuous.

Then there is some $c \geq 0$ such that for all maps $\phi: (Y, \nu) \to (X, \mu)$ in $\mathbf{FM}$,

$F(\phi) = c(H(\nu) - H(\mu)).$

Conversely, for any $c \geq 0$, the function $F$ defined by this formula satisfies the conditions above.

If you’re still reading, maybe you can guess what the fourth and final theorem will be. It is in fact the main theorem of the (first) paper that John, Tobias and I are writing, and it involves $FinMeas$ rather than this funny category $\mathbf{FM}$.

So, imitating what we did for $\mathbf{P}$, let’s say that an $\mathbf{M}$-algebra $A$ is special if for all maps

$\phi, \psi: a \to b$

in $A$, the maps $0\phi, 0\phi: 0 a \to 0 b$ are equal. The inclusion

$SpecialAlg(\mathbf{M}) \hookrightarrow Alg(\mathbf{M})$

has a left adjoint, which sends the non-special algebra $\mathbf{FM}$ to the special algebra $FinMeas$.

The $\mathbf{M}$-algebra $\mathbb{R}_+$ is also special. So by adjointness,

$Hom_{Alg(\mathbf{M})}(FinMeas, \mathbb{R}_+) \cong Hom_{Alg(\mathbf{M})}(\mathbf{FM}, \mathbb{R}_+)$

which by Theorem M gives

strict maps of $\mathbf{M}$-algebras $FinMeas \to \mathbb{R}_+$ correspond to nonnegative scalar multiples of Shannon entropy.

Explicitly:

Theorem M$'$  Let $F$ be a function from the set of maps in $FinMeas$ to the set $\mathbb{R}_+$, satisfying the following conditions:

1. Functoriality: $F(g \circ f) = F(g) + F(f)$ for all composable maps $g$, $f$ in $FinMeas$
2. Additivity: $F(f + g) = F(f) + F(g)$ for all maps $f$, $g$ in $FinMeas$
3. Homogeneity: $F(\lambda f) = \lambda F(f)$ for all maps $f$ in $FinMeas$ and all $\lambda \in [0, \infty)$
4. Continuity: $F$ is continuous.

Then there is some $c \geq 0$ such that for all maps $f: (Y, \nu) \to (X, \mu)$ in $FinMeas$,

$F(f) = c(H(\nu) - H(\mu)).$

Conversely, for any $c \geq 0$, the function $F$ defined by this formula satisfies the conditions above.

### Tying it all together

We began with the operad $\mathbf{P}$, and the fact that the internal $\mathbf{P}$-algebras in the $\mathbf{P}$-algebra $\mathbb{R}_+$ correspond to the scalar multiples of Shannon entropy. This is Fadeev’s theorem from the 1950s, dressed up.

From there, we proved four closely related theorems: P, P$'$, M and M$'$. The purpose of this last section is to give an overview of how they’re related.

We have a commutative square of forgetful functors

$\begin{matrix} SpecialAlg(\mathbf{M}) &\hookrightarrow &Alg(\mathbf{M}) \\ \downarrow & &\downarrow \\ SpecialAlg(\mathbf{P}) &\hookrightarrow &Alg(\mathbf{P}) \end{matrix}$

(the one down the right-hand edge having been induced by the inclusion of operads $P \hookrightarrow M$). As always, $Alg$ refers to algebras in $Cat(Top)$.

Each of these forgetful functors has a left adjoint, giving another commutative square

$\begin{matrix} SpecialAlg(\mathbf{M}) &\leftarrow &Alg(\mathbf{M}) \\ \uparrow & &\uparrow \\ SpecialAlg(\mathbf{P}) &\leftarrow &Alg(\mathbf{P}). \end{matrix}$

There is a canonical object of $Alg(\mathbf{P})$, namely, the free $\mathbf{P}$-algebra $\mathbf{FP}$ containing an internal $\mathbf{P}$-algebra. Applying to $\mathbf{FP}$ the functors shown in this last square gives:

$\begin{matrix} FinMeas & ↤ &\mathbf{FM} \\ ↥ & & ↥ \\ FinProb & ↤ &\mathbf{FP}. \end{matrix}$

The universal property of $\mathbf{FP}$ tells us that the maps from it to $\mathbb{R}_+$ are the internal $\mathbf{P}$-algebras in $\mathbb{R}_+$, which by Fadeev’s theorem are the scalar multiples of Shannon entropy. This is “Theorem P”. Then adjointness gives us three more theorems for free:

$\begin{matrix} Theorem M' & ↤ &Theorem M \\ ↥ & & ↥ \\ Theorem P' & ↤ &Theorem P. \end{matrix}$

From this point of view, Theorem P (concerning $\mathbf{FP}$) is the fundamental theorem from which the rest follow. Theorem M$'$ (concerning $FinMeas$, and included in our paper) is, at the other extreme, the height of refinement. Nevertheless, it can be stated in a totally direct and explicit way.

Posted at May 18, 2011 5:27 AM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2401

### Re: An Operadic Introduction to Entropy

[in an earlier version of this post Tom had asked how to typeset “mapsto”-arrows that point in various directions, my reply below refers to that]

You need to explicitly code them as unicode characters: typing

  &#x21A4; &#x21A5; &#x21A6; or &#x21A7;


produces

↤ ↥ ↦ or ↧

Posted by: Urs Schreiber on May 18, 2011 6:36 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Fantastic; thanks.

Urs’s comment will make no sense to anyone who didn’t read the post as soon as he did, but I’ll leave it there as a memorial to an act of kindness.

Posted by: Tom Leinster on May 18, 2011 6:46 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I should add that this is thanks to Andrew Stacey, who dug out the codes from somewhere: I had asked him the same question just two weeks ago here.

Posted by: Urs Schreiber on May 18, 2011 7:03 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I have two questions:

you observe that there is an equivalent reformulation of some aspects of the notion of entropy of finite probability distributions in terms of algebras over some operad.

Do you have a hunch as for what that means and what it might be good for? Usually, when phrasing simple well-understood concepts in more general abstract language, one wants to use this in a next step to do something that hasn’t been done before. As in “Now that I have shown that I can re-drill that little hole with this huge power tool, I will use that power tool to drill a tunnel through the Alps.” Which tunnel would that be, do you have an idea? Can you give us a perspective?

My second question (maybe related, maybe not) is: have you thought about how your discussion might generalize to non-finite measure spaces? “Measure space” is a big word for a function on a finite set. Can one pass the discussion to the case of general measure spaces or general probability spaces?

Posted by: Urs Schreiber on May 18, 2011 6:56 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Urs wrote:

Do you have a hunch as for what that means and what it might be good for? Usually, when phrasing simple well-understood concepts in more general abstract language, one wants to use this in a next step to do something that hasn’t been done before.

Amusingly, the main thing I know the operads are good for is that they got us to discover this result, which doesn’t mention operads at all:

Theorem P$'$  Let $F$ be a function assigning a nonnegative real number to any measure-preserving map between finite sets equipped with probability measures. Suppose $F$ obeys the following axioms:

1. Functoriality: $F(g \circ f) = F(g) + F(f)$ for all composable maps $g$, $f$
2. Convex combinations: $F(\lambda f + (1 - \lambda) g) = \lambda F(f) + (1 - \lambda) F(g)$ for all maps $f$, $g$ and all $\lambda \in [0, 1]$
3. Continuity: $F$ is continuous.

Then there is some $c \geq 0$ such that for all measure-preserving maps $f : \mathbf{s} \to \mathbf{p}$:

$F(f) = c(H(\mathbf{s}) - H(\mathbf{p})).$

where $H$ is the Shannon entropy. Conversely, for any $c \geq 0$, the function $F$ defined by this formula satisfies the conditions above.

There’s a long tradition of theorems characterizing Shannon entropy, going back to Shannon himself. Different theorems have different advantages, but I think this is one is the best.

I’m biased, of course! Every mother thinks their newborn child is the most beautiful of all. But I think this one explains how Shannon entropy arises automatically and inevitably whenever you want a way to measure the ‘information loss’ of a measure-preserving map between finite probability spaces.

We can state it without mentioning operads, and we can prove it without mentioning operads, but we got to it by thinking about operads. Why? Mainly because we’re grown-ups, so for a long time we’ve been wondering what that ‘convex combination’ stuff is all about — i.e., what kind of algebraic structure a convex space really is. It’s not a module of a ring, but it’s something very close: it’s an algebra of an operad. Thinking that way led Tom to the initial statement of his result on Shannon entropy as a lax point. And that result, by successive processes of simplification, led to Theorem P$'$.

It’s very much like how thinking about rings and modules can lead to a theorem about a particular module of a particular ring whose statement and even proof doesn’t require the words ‘ring’ and ‘module’.

I think of all this stuff as part of a program to clarify the foundations of statistical mechanics using category theory. We’ve proved that entropy is a functor uniquely characterized by certain properties, and elsewhere in our discussion we’ve proved the same sort of result for the partition function. The concepts of free energy and temperature are also showing up in this story. So, it’s eventually going to be clear that whenever you think about a category of systems and processes that has certain properties, like being an algebra for the Giry monad, the concepts of thermodynamics will apply. In the final stages of this program, which I will probably not stick around for, some set of mathematical concepts will emerge as the inevitable ones for doing everything really nicely. But right now it’s just getting started, so it’s natural to just use whatever math we know to try to figure out what’s going on. But if we get results that don’t need all that math to state, perhaps we should just peel those of and state them in as lowbrow a way as possible. I’m not sure.

My second question (maybe related, maybe not) is: have you thought about how your discussion might generalize to non-finite measure spaces? “Measure space” is a big word for a function on a finite set. Can one pass the discussion to the case of general measure spaces or general probability spaces?

Yes, that should work fine. In particular, there should be a version of Theorem P$'$ that drops the finiteness assumption. The obvious generalization doesn’t make sense at all: there’s no such thing as the entropy of a probability measure on a general measure space. But the correct generalization should work fine. It’ll be a bit more technical; the advantage of the finite case is that you can develop the overall strategy quicker, and then replace sums by integrals and stuff like that later.

Posted by: John Baez on May 18, 2011 9:21 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

John wrote:

I think of all this stuff as part of a program to clarify the foundations of statistical mechanics using category theory.

And I think of all this stuff as part of a programme to clarify the mathematical concept of diversity using category theory.

“Diversity” here might refer to ecological diversity, and indeed a lot of work on the mathematics of diversity has been done in that context. But it’s actually a much more general concept, which has been used in other parts of biology, economics, linguistics, soil science, …, and, I would say, really belongs to mathematics.

I want to find a theorem saying: “if you want a measure of diversity that obeys this, this and this sensible condition, then it has to be one of these.”

Taking a wider view still, I think of all that as part of a huge programme to measure the size of everything.

(There must be a joke to be made here, concerning Oscar Wilde’s description of a cynic as a man who knows the price of everything and the value of nothing. I leave that as an exercise to the reader.)

Posted by: Tom Leinster on May 18, 2011 8:26 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I see. It wasn’t clear to me that this statement was found initially by way of the operad.

But I am not sure yet that I understand this explanation of yours. You write:

I think this one explains how Shannon entropy arises automatically and inevitably whenever you want a way to measure the ‘information loss’ of a measure-preserving map between finite probability spaces.

Where is the “information loss” touched on by the statement? The central piece of the characterization seems to be the preservation of convex combinations. Why is that to be thought of as characteristic of a function that measures “information loss”?

Posted by: Urs Schreiber on May 18, 2011 9:42 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Good question. The quantity in the conclusions of the theorem, $H(\mathbf{s}) - H(\mathbf{p})$, clearly is information loss: we explain why in the introduction to our paper. And it clearly does get preserved by convex linear combinations. But why is that condition something we should expect?

The quick and annoying answer is that this condition, when unpackaged, gives the conditions that everyone starting from Shannon has used to characterize entropy. We arrived at this condition by trying to take those conditions and write them in increasingly terse and elegant ways.

But there’s probably some better, more conceptual answer. Something like this:

Say you put a physical system in a mixed state C as follows: you flip a 40%-60% coin and use the result to decide to prepare the system in either mixed state A or mixed state B. Suppose you know that applying a deterministic but irreversible operation to mixed state A will lose 7 bits of information on average. Suppose you know that applying the same operation to mixed state B will lose 5 bits of information on average. Then applying that operation to mixed state C will lose

$.4 \times 7 + .6 \times 5$

bits of information on average.

Yeah, that makes sense.

It’s worth noting that ‘Tsallis entropy’ arises when you change the rule here. So at least some people have considered other options.

Posted by: John Baez on May 18, 2011 10:46 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

The quantity in the conclusions of the theorem, $H(\mathbf{s}) - H(\mathbf{p})$, clearly is information loss: we explain why

Not sure what you mean. Isn’t this the standard century-old explanation of entropy in terms of information? Isn’t this in fact Shannon’s big insight?

What’s the new explanation that “explains how Shannon entropy arises automatically and inevitably whenever you want a way to measure the ‘information loss’”?

I understand that information loss has the convexity property, but so it seems to me rather that your statement $P'$ achieves the converse: it shows that any reasonable such map that preserves convexivity must in fact measure information loss. You don’t put that in, you get that out of $P'$, it seems to me. No?

Posted by: Urs Schreiber on May 18, 2011 11:24 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Urs wrote:

Not sure what you mean.

If you read the introduction to our paper you’ll see what I meant.

What’s the new explanation that “explains how Shannon entropy arises automatically and inevitably whenever you want a way to measure the ‘information loss’”?

It’s Theorem P$\,'$. This theorem basically says that if you believe

1. The information lost by a composite of processes is the sum of the information losses at each stage.
2. The information lost when a process is applied to a convex linear combination of mixed states equals the convex linear combination of information losses for each of those states.
3. The information lost by a process depends continuously on the mixed state it’s applied to.

then you must believe information is given by Shannon entropy.

If you’re already 100% convinced that information is given by Shannon entropy and uninterested in reexamining why this must be the case, of course this is boring. But personally I think it’s quite cool how the function $- \sum_i p_i \ln p_i$ emerges from axioms 1-3. The axioms all look sort of bland and ‘linear’, while the function looks interestingly ‘nonlinear’.

Posted by: John Baez on May 18, 2011 5:56 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Indeed I did!

But I still don’t see that the assumptions in $P'$ tell us anything immediate about information. Rather what comes out of $P'$ is something that Shannon told us has to do with information, and $P'$ characterizes that instead quite differently, by convexity preservation.

I think you are arguing circularly: of course entropy does satisfy the assumptions of $P'$ – by the very content of $P'$ – and hence after the fact we know that the assumptions of $P'$ have to do with information loss – via Shannon’s work.

Maybe it’s just hard to re-read a new statement about a well-known object as if we didn’t already know that object well.

Anyway, I’ll leave it at that. It’s not such an important point.

Posted by: Urs Schreiber on May 18, 2011 6:28 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Urs wrote:

Maybe it’s just hard to re-read a new statement about a well-known object as if we didn’t already know that object well.

I think that underlies the communication problem we’re having here. I’m playing a game where I say: pretend we didn’t know about Shannon information and had a few intuitions about ‘information loss’. What would be simplest, most innocent-looking assumptions that would force us to conclude that information loss is the difference in Shannon entropies?

Theorem P$'$ provides one answer to that question. Suppose information loss is nonnegative, additive when you compose processes, convex-linear when you take probabilistic mixtures of processes, and depends continuously on the process. Then the information loss must be the difference in Shannon entropies!

(Of course, there’s also a background assumption about what I mean by ‘process’ here.)

It’s very much like the ‘foundations of quantum mechanics’ game where people try to find simple axioms that lead to the usual formulation in terms of complex Hilbert spaces.

When we play such games, we already know the conclusion we’re trying to reach. So you might say these games are pointless: aren’t we bound to win? No, we’re not. We only ‘win’ if we find an extremely plausible set of assumptions that forces us to the conclusions we want. Then we’ve learned something new about the concept we’re studying.

Just in case it got lost: the supposedly interesting point of my original remark had nothing to do with any of this. It was that unlike previous characterizations of Shannon entropy, ours explicitly focuses on the information loss associated to a process, rather the information associated to a state. This is where categories come in: information loss becomes a functor from a category to a category with only one object, but nonnegative reals as morphisms. And this is why our characterization is able to look different — and, I think, nicer — than previous ones.

Posted by: John Baez on May 19, 2011 1:24 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Great post, Tom! I understood it perfectly!

Well, I guess that’s not setting too high a bar for what counts as success; I’ve been working with you on this stuff for the last few weeks, so if I didn’t understand what you said something would be seriously amiss.

Here’s one question I have: should we put this theorem into our first paper?

Theorem P$'$  Let $F$ be a function assigning a nonnegative real number to any measure-preserving map between finite sets equipped with probability measures. Suppose $F$ obeys the following axioms:

1. Functoriality: $F(g \circ f) = F(g) + F(f)$ for all composable maps $g$, $f$
2. Convex combinations: $F(\lambda f + (1 - \lambda) g) = \lambda F(f) + (1 - \lambda) F(g)$ for all maps $f$, $g$ and all $\lambda \in [0, 1]$
3. Continuity: $F$ is continuous.

Then there is some $c \geq 0$ such that for all maps $f: (Y, \mathbf{s}) \to (X, \mathbf{p})$

$F(f) = c(H(\mathbf{s}) - H(\mathbf{p})).$

Conversely, for any $c \geq 0$, the function $F$ defined by this formula satisfies the conditions above.

It’s easy to state, it’s easy to understand, it doesn’t mention operads, and it easily follows from the main result we did prove.

In fact, most people would like it better than the result we proved, because 1) you don’t need to extend the concept of entropy from probability measures to measures to understand it, and 2) the additivity and homogeneity clauses of the theorem we did prove have been lumped into a single clause here: ‘Convex combinations’.

So, we could use this result to rewrite our paper in a way that makes our paper look more ‘traditional’ and more elegant to most eyes.

Of course it’s annoying to have to rewrite what could have been an almost-done paper. But we wouldn’t have to do massive rewriting if we kept in the stuff about $FinMeas$ and made the above theorem—the ‘headline’ theorem—be a corollary of the result we did prove. I guess it would take a 3-6 hour burst of insanely energetic work, the sort of thing I do routinely.

Of course if we were typical grant-seeking publish-or-perish types we would write 4 papers, separately proving Theorems $M, P, M\,'$ and $P\,'$, and then go around giving talks saying what you just said, showing how they all fit together. Posted by: John Baez on May 18, 2011 8:25 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Hmm, Theorem $P'$ might indeed be easier to explain, so I wouldn’t be against it.

On the other hand, for future work on this I think that Theorem $M$ is the most natural one. My personal impression is that we are barely scratching the surface of something deeper. And in order to find out what that is, the formulation as Theorem $M$ has certain advantages over the other ones, two of which I would like to explain in the following.

Firstly, only this is how one can notice that Shannon entropy is secretly the “additivity defect”

$H(p_1,\ldots,p_n)=D(\sum_i p_i)-\sum_i D(p_i)$

of the non-linear derivation $D(x)=x\log x$. I have been mentioning this observation several times, but so far not received any comment on it — is it unclear what I mean?

Secondly, potentially one can still see the entropy functor as a sort of decategorification: it’s a kind of decategorification which retains the morphisms, but throws away the objects by mapping the whole category down to a monoid! Then due to Eckmann-Hilton, one should expect the functor to identify the tensor product of morphisms with composition of morphisms. Looking at Theorem $M$, that’s precisely what entropy does!

I am at a conference right now until next Monday, so it will be hard to keep up with the discussion, but I will try.

Posted by: Tobias Fritz on May 18, 2011 5:06 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Shannon entropy is secretly the “additivity defect” […] is it unclear what I mean?

I believe it’s clear to me. It’s an interesting way of looking at things, though I don’t know what its consequences are.

due to Eckmann-Hilton, one should expect the functor to identify the tensor product of morphisms with composition of morphisms. Looking at Theorem M, that’s precisely what entropy does!

Posted by: Tom Leinster on May 18, 2011 8:04 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

…it’s a kind of decategorification which retains the morphisms, but throws away the objects by mapping the whole category down to a monoid!

Is there a whiff of a Grothendieck construction happening here then?

Posted by: David Corfield on May 19, 2011 10:32 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

David wrote:

Is there a whiff of a Grothendieck construction happening here then?

Well, Tom has characterized our category $\Fin\Prob$ as follows: start from the functor $\mathbf{P}:\Fin\Set\rightarrow\Set$ which assigns to a finite set $X$ the set of probability distributions on $X$, and operates on morphisms in the obvious way. It’s the functor part of the convexity monad, restricted to $\Fin\Set$. Then an object of $\Fin\Prob$ is a pair $(X,p)$ where $X\in\Fin\Set$ and $p\in\mathbf{P}(X)$, and a morphism $f:(X,p)\rightarrow (Y,q)$ is a function $f:X\rightarrow Y$ such that $f_*(p)=q$. So $\Fin\Prob$ is the comma category of $\Set$ with respect to the functor $\mathbb{P}$ and the functor $1\rightarrow\Set$ which points to a one-element set.

You’ve certainly known this, but my question is: is this an instance of the Grothendieck construction?

Because of not really grasping the Grothendieck construction in generality, I don’t understand why you think that it could also show up in my observation that

…[the entropy functor is] a kind of decategorification which retains the morphisms, but throws away the objects by mapping the whole category down to a monoid!

Could you elaborate a bit?

Posted by: Tobias Fritz on May 19, 2011 4:47 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

The Grothendieck construction (at least the one I think is meant) is simply this:

Take your functor $F:C \to Set$. Consider the forgetful functor $Set_* \to Set$, where $Set_*$ is the category of pointed sets and functions. Take the (strict!) pullback $C\times_{Set} Set_* \to C$. This is the Grothendieck construction.

A similar thing is true when you replace $Set$ by $Cat$, and then you use weakly pointed functors as arrows in $Cat_*$ (basepoints are preserved up to a specified arrow).

(David already mentioned this, but perhaps it is worth saying twice)

Posted by: David Roberts on May 20, 2011 2:54 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Tobias wrote:

is this an instance of the Grothendieck construction?

Yes. I think I previously referred to $FinProb$ as the category of elements of the restricted functor $\mathbf{P}: FinSet \to Set$. “Grothendieck construction” and “category of elements” are synonyms.

Quite generally, let $\mathbf{A}$ be any category and $X: \mathbf{A} \to Set$ any functor. Its category of elements, or Grothendieck construction, is the category whose objects are pairs $(a, x)$ with $a \in \mathbf{A}$ and $x \in X(a)$, and whose maps $(a, x) \to (a', x')$ are maps $f: a \to a'$ in $\mathbf{A}$ such that $(X f)(x) = x'.$

As you say, it can be viewed as a comma category. We have functors

$\begin{matrix} & &\mathbf{A}\\ & &\downarrow X\\ 1&\stackrel{1}{\to}&Set, \end{matrix}$

and the category of elements of $X$ is the comma category from the functor $1$ to the functor $X$.

There are some variants. For example, there’s a slightly different construction for contravariant functors $\mathbf{A}^{op} \to Set$ (so that in both cases you have a covariant forgetful functor from the category of elements to $\mathbf{A}$). There’s also a more subtle construction suitable for $Cat$-valued functors, which is very much connected to the story of fibrations.

Posted by: Tom Leinster on May 20, 2011 5:18 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Posted by: David Corfield on May 20, 2011 5:59 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Let me think. An easy example is when you have a group $G$ acting on a set $X$. The group is a one object category, $B G$, with invertible morphisms. You want to represent the group action in terms of the category (groupoid) sitting above with morphisms $(g, x): x \to g\cdot x$, for $x$ in $X$. So this action groupoid is collapsed onto the group by projection.

throws away the objects by mapping the whole category down to a monoid!

ring a bell.

But the action groupoid is the pullback of the forgetful functor $PointedSet$ to $Set$, along the functor $B G$ to $Set$ corresponding to the action.

Posted by: David Corfield on May 19, 2011 5:31 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Posted by: Tobias Fritz on May 20, 2011 6:44 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Tobias wrote:

Hmm, Theorem P′ might indeed be easier to explain, so I wouldn’t be against it.

Okay, great. So now I just need to check with Tom.

Tom: is it okay if I rewrite the paper so that the introduction starts by discussing our characterization of Shannon entropy for probability measures rather than general measures? I think most people will find this easier to follow.

Then, later in the introduction, I can say a bit about general measures.

On the other hand, for future work on this I think that Theorem M is the most natural one.

I can’t tell yet, but I like the theorem for general measures, so I want to keep it in our paper, and derive the theorem for probability measures as a corollary.

So, I’m suggesting that in the section “The Main Result” I’ll keep Theorem 1 the way it is, but then state the characterization of Shannon entropy for probability measures as a separate theorem, and point out how it follows.

Is that okay, guys?

Firstly, only this is how one can notice that Shannon entropy is secretly the “additivity defect”

$H(p_1,\ldots,p_n)=D(\sum_i p_i)-\sum_i D(p_i)$

of the non-linear derivation $D(x)=x\log x$. I have been mentioning this observation several times, but so far not received any comment on it – is it unclear what I mean?

I didn’t understand before, because I didn’t see you write a formula of the form

$f(\sum_i x_i) - \sum_i f(x_i)$

Now I understand what you’re talking about! Hmm!

I don’t want to discuss this ‘linearity defect’ in the current paper, which is quite nice already. I also don’t want to talk about $x \mapsto x \ln x$ being a ‘nonlinear derivation’ in this paper. They’re interesting ideas, but I don’t know what to do with them yet. Right now I just want to finish this paper, so I can stop thinking about it.

Nonetheless, here’s an idea about ‘linearity defects’:

If you have a function $f: G \to H$ from a group to an abelian group, it’s called a ‘1-cochain’ in group cohomology, and its ‘coboundary’ is defined to be function $d f : G \times G \to H$ given by

$d f(x,y) = f(x) + f(y) - f(x y)$

This measures the failure of $f$ to be a homomorphism (especially if $f$ is ‘normalized’, meaning $f(1) = 0$).

So, your formula generalizes this formula for $d f$.

Posted by: John Baez on May 20, 2011 8:26 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Tom: is it okay if I rewrite the paper so that the introduction starts by discussing our characterization of Shannon entropy for probability measures rather than general measures?

Yes. I also agree with the plan of keeping the “Main Result” as it is and deducing the result on probability measures.

Posted by: Tom Leinster on May 20, 2011 8:33 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

John wrote:

[..] Is that okay, guys?

Yes! I just wish I would have some more time to contribute to this. Starting Tuesday I should be back on track.

Nonetheless, here’s an idea about ‘linearity defects’: [John explains $1$-cochains in group cohomology]

Great observation! I have been suspecting for a while that there’s a (co-)chain complex lurking around somewhere, and this is another indication for it.

The other two hints for this are:

• When modifying a non-linear derivation $D$ by a linear term to $D'(x)=D(x)+c\cdot x$ for some constant $c$, then the additivity defect

$D(\sum_i p_i)-\sum_i D(p_i)$

is preserved. Maybe we can think of the linear term as a the coboundary of a $0$-cochain, which disappears after again taking the coboundary due to $\partial^2=0$.

• Given the entropy $H$ as a function on objects, we have defined the entropy of a morphism $f$ to be

$H(\dom(f))-H(\cod(f))$

which looks like the coboundary operator on a graph (thinking of the category as a graph): when $H$ is a function on vertices, this formula converts it into a function on edges. This is part of the cochain complex of the nerve of the category.

Posted by: Tobias Fritz on May 21, 2011 9:45 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Regarding cochain-entropy connections, Bruce Ebanks write in General solution of the 2-cocycle functional equation on solvable groups about the cocycle equation for a function $F: G \times G \to V$,

$F(x, y) + F(x y, z) = F(x, y z) + F(y, z), (1)$

that

The cocycle equation (1) appears in the context of information theory in Forte and Daróczy , where it is used to characterize Shannon’s entropy. There the variables represent probabilities, and equation (1) holds on a proper subset of the first quadrant in the Euclidean plane. The cocycle equation is the key to determining whether an information measure has the branching property.

 B. Forte and Z. Daróczy, A characterization of Shannon’s entropy, Boll. Unione Mat. Ital. (4) 1 (1968), 631–635.

So $F$ is a 2-cocycle for the trivial action of $G$, the positive reals with multiplication, on $V$ the reals under addition? What does that tell us?

Posted by: David Corfield on May 24, 2011 5:19 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Ha! I had written down this formula for the binary entropy of a (non-normalized) measure space as item (4) in this comment:

$H(a,b)+H(a+b,c)=H(b,c)+H(a,b+c)$

Here, $G$ is the monoid $(\mathbb{R}_{\geq 0},+)$. I did not know that this comes up in group cohomology. (Or, as in this case, monoid cohomology.)

In fact it came up again today in an email conversation with John and Tom about what exactly to include in our paper, and Tom called it “cocycle-y”.

So do Forte and Daróczy tell us anything more about this in the context of Shannon entropy? Or maybe this paper, which is reference  of Ebanks:

C. T. Ng, Representations for measures of information with the branching property, Inform. and Control 25 (1974), 45–56.

(Unfortunately, this is behind a paywall which I cannot cross, while the paper by Forte and Daróczy may not exist in digital form…)

So what is the “branching property” of an information measure?

Posted by: Tobias Fritz on May 24, 2011 6:47 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

If I’m reading the bottom of p. 220 of this correctly, $H^n(\mathbb{R}_{\geq 0}, \mathbb{R}) = 0$ for all $n \geq 1$. So why is entropy, the coboundary of $f(x) = x log x$, of interest? It’s a 2-cocycle, of course, but so what?

Posted by: David Corfield on May 25, 2011 1:34 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

The reference seems to be saying the following: if $(S,\circ)$ is a monoid which contains a ‘zero element’ $0\in S$, which presumably means that $0\circ s=0$ for any $s\in S$, then $H^n(S,A)=0$ for any $n\geq 2$ and any $S$-module $A$. Correct?

So which monoid structure on $\mathbb{R}_{\geq 0}$ are we talking about here when saying that $H^n(\mathbb{R}_{\geq 0},\mathbb{R})=0$? In order for such a zero element to exist, we better consider the monoid $(\mathbb{R}_{\geq 0},\cdot)$, i.e. a monoid under multiplication.

However, binary Shannon (and Tsallis) entropy $H_{\alpha}(p_1,p_2)$ is a $2$-cocycle over the monoid $(\mathbb{R}_{\geq 0},+)$! In this case, there is no zero element, so the statement in the reference doesn’t apply.

But it may still be the case that the cohomology vanishes, so that we can conclude that the $2$-cocycle $H_{\alpha}(p_1,p_2)$ is actually a coboundary. If so, one could possibly use this to find a more categorical proof of our entropy characterization theorems, as you had called for here.

Posted by: Tobias Fritz on May 25, 2011 2:36 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Oh yes, the additive monoid $\mathbb{R}_{\geq 0}$. But I still don’t see what’s special about $f(x) = x log x$. Why don’t I take, say, $g(x) = exp(x)$. Then its coboundary is $G(x, y) = exp(x) + exp(y) - exp(x + y)$, and that’s, of course, a fine 2-cocycle.

It would be good to get a look at the paper in this comment which shows how the 2-cocycle equation characterizes Shannon’s entropy. There must be another condition.

Posted by: David Corfield on May 25, 2011 3:58 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

As we know, $f(x)=x\log x$ is special because it has the derivation property:

$f(xy)=x f(y)+f(x)y$

If we regard $f$ as $1$-cochain over the multiplicative monoid $(\mathbb{R}_{\geq 0},\cdot)$ with values in the bimodule $\mathbb{R}$, this equation states that $f$ is a cocycle!

So somehow we start from a $1$-cocyle $f$ over $(\mathbb{R}_{\geq 0},\cdot)$ and turn it into a $2$-cocycle over $(\mathbb{R}_{\geq 0},+)$ by applying the coboundary operator corresponding to the additive monoid.

I’m getting a strong urge to study distributive laws for operads.

Posted by: Tobias Fritz on May 25, 2011 6:31 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

In the case of the multiplicative monoid $\mathbb{R}_{\geq 0}$ acting on $\mathbb{R}$ by multiplication, won’t the coboundary of $f$ be

$\partial f(x, y) = x f(y) - f(x y) + f(x)$

according to the general formula? So that $f(x) = x log x$ is not a cocycle.

Posted by: David Corfield on May 26, 2011 9:05 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Sorry, I should have been more explicit in my previous comment. I was not talking about ordinary monoid cohomology of a monoid $G$ with coefficients in a $G$-module $A$, but about monoid cohomology of a monoid $G$ with coefficients in a $G$-$G$-bimodule $A$. So now $G$ acts on $A$ both from the left and from the right, and these two actions commute.

I haven’t actually seen this spelled out anywhere for groups or monoids; but this is what’s commonly done in the case of Hochschild cohomology for rings and algebras; see e.g. page 10 of The Cohomology of Noncommutative Spaces. So I gather that for a function $f:G^n\rightarrow A$, where $A$ is a $G$-$G$-bimodule, one can define its coboundary as

$(\delta f)(g_1,\ldots,g_{n+1})=g_1\cdot f(g_2,\ldots,g_{n+1})$ $\qquad+\sum_{i=1}^n (-1)^i f(g_1,\ldots,g_i g_{i+1},\ldots, g_{n+1})+(-1)^{n+1} f(g_1,\ldots,g_n)\cdot g_{n+1}$

So this is a bit more symmetrical than the usual formula for the coboundary in group cohomology, since the last term is analogous to the first one.

For $n=1$, this expression vanishes precisely when $f$ has the derivation property.

Posted by: Tobias Fritz on May 26, 2011 11:44 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Ebanks’ own Branching measures of information on strings gives you a clue. He says the branching property is discussed in On measures of information and their characterizations by Aczél and Daróczy.

Posted by: David Corfield on May 24, 2011 10:14 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I wonder what this all means. Are we going to have to learn about operadic cohomology?

As for the branching property, from what I can glean from Inset information measures on open domains, it’s what’s being picked out by the ‘magical property’ of Shannon entropy.

Posted by: David Corfield on May 25, 2011 9:30 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

For what it’s worth, I learnt today (from Pierre Baudot) that the connection between entropy and cohomology is in fact due to Kontsevich! Some of the things that we have been discussing here can be found on p.43-44 of this paper — and there’s also other cool stuff that we didn’t touch upon! The gist is that cohomological thinking lets one solve the functional equations relevant for characterizing entropy in a very natural and straightforward manner.

Posted by: Tobias Fritz on August 26, 2014 6:35 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

A small correction: the connection between the functional equation for (binary) entropy and cohomology already appears in this paper of Cathelineau, which is a bit older than Kontsevich’s work.

Posted by: Tobias Fritz on August 26, 2014 6:51 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Is there any possibility of getting a category theoretic handle on Fadeev’s proof?

Some useful perspective on this might be possible with access to both Čencov’s proof in his book that, up to a constant factor, there’s a unique Riemannian metric which is invariant under the maps of your $\mathbf{FP}$, and L. L. Campbell’s extension of the result to $\mathbf{FM}$ in An Extended Čencov Characterisation of the Information Metric. Campbell tells us that not only is he extending Čencov’s result, but that also unlike in the latter’s book

neither the statement of our result nor the proof use the language of category theory.

As I keep harping on about, I feel sure Campbell’s result must be linked to yours. He’s working with your category $\mathbf{FM}$, and shows that the only sensible metrics on the manifolds of measures over each set are variants of the Fisher metric. He also notes that his method of proof is

reminiscent of Khinchin’s characterization theorem for entropy.

Khinchin and Fadeev seem to get mentioned together often as characterizers of entropy.

Posted by: David Corfield on May 18, 2011 9:11 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

David wrote:

Is there any possibility of getting a category theoretic handle on Fadeev’s proof?

That I would like to know, too! For something like Fadeev’s original proof, it’s going to be difficult if not impossible. It crucially relies on a the following characterization of the logarithm due to Erdös:

Suppose $f:\mathbb{N}\rightarrow\mathbb{R}$ is a function such that $f(mn)=f(m)+f(n)$ and such that $\lim\inf_n (f(n+1)-f(n))\geq 0$. Then $f$ is a multiple of $n\mapsto\log n$.

One can weaken the second hypothesis a little bit, but in any case what one needs to use in a proof along the lines of Fadeev’s is a characterization of the logarithm like this one.

Of course this doesn’t mean that one cannot find a categorical proofs along different lines.

Posted by: Tobias Fritz on May 18, 2011 10:13 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

David wrote:

Is there any possibility of getting a category theoretic handle on Fadeev’s proof?

Unfortunately I’ve never read Fadeev’s proof! Maxim Budaev has kindly given us a link to Fadeev’s original paper in Russian — but alas, I don’t read Russian. The paper has been translated into German, which I can sometimes read:

• D. K. Fadeev, Zum Begriff der Entropie eines endlichen Wahrscheinlichkeitsschemas, Arbeiten zur Informationstheorie I, Berlin, Deutscher Verlag der Wissenschaften, 1957, pp. 85-90.

but I haven’t managed to get ahold of this.

Rényi’s paper provides a very helpful clue. He claims to give, at the very start, a simpler proof of Fadeev’s key technical lemma:

Lemma: Suppose $f : \mathbb{N}^+ \to \mathbb{R}$ obeys $f(n m) = f(n) + f(m)$ and $\lim_{n \to \infty} f(n+1) - f(n) = 0.$ Then for some constant $c$, $f(n) = c \ln n$ for all $n$.

The prove is not thrilling and would probably not benefit from the application of category theory.

Annoyingly, Rényi doesn’t say how this lemma is used to prove Fadeev’s theorem! I bet I could figure it out with a little work… it’s embarrassing that I haven’t done so yet… hmm, maybe I just did.

I think the basic idea is this. Suppose $I$ is a function obeying the hypotheses of Fadeev’s theorem:

Theorem. Suppose $I$ is a map sending lists of probabilities $(p_1, \dots, p_n)$ summing to 1 to nonnegative real numbers. Suppose $I$ obeys these three axioms:

1. $I$ is invariant under permutations.

2. Whenever $0 \le t \le 1$,

$F(t p_1, (1-t)p_1, p_2, \dots, p_n) = F(p_1, \dots, p_n) + p_1 F(t, 1-t)$

3. $I$ is continuous.

Then $I$ is a constant nonnegative multiple of Shannon entropy. In other words, for some $c \ge 0$,

$I(p_1, \dots, p_n) = - c \sum_i p_i \, \ln p_i$

Let’s define $f(n)$ to be $I$ of a probability distribution with $n$ equally likely alternatives:

$f(n) = I(1/n, \dots, 1/n)$

We want this to be proportional to $\ln n$. I see how to use Fadeev’s axioms 1 and 2 to show

$f(n m) = f(n) + f(m)$

I imagine he also uses the continuity axioms 3 to prove

$\lim_{n \to \infty} f(n+1) - f(n) = 0.$

though I don’t see how. If so, Fadeev’s theorem will kick in and show that indeed, $f(n)$ is a constant times $\ln n$.

The point is that Fadeev’s axioms 1 and 2 are equivalent to a superficially much more general rule for how to compute $I$ of $P \circ (Q_1, \dots, Q_n)$. Here $P \circ (Q_1, \dots, Q_n)$ is the result of using a probability measure $P = (p_1, \dots, p_n)$ on an $n$-element set to glom together $n$ different probability measures $Q_1, \dots, Q_n$. Fadeev’s axioms 1 and 2 imply

(1)$I(P\circ(Q_1, \ldots, Q_n)) = I(P) + \sum_{i=1}^n p_i I(Q_i)$

Note that we can construct the probability measure

$(1/n m, \dots, 1/n m)$

by glomming together $n$ equally weighted copies of

$(1/m, \dots , 1/m)$

So, equation (1) implies

$f(n m) = f(n) + f(m).$

Vòila!

I don’t see how to improve this stuff using more category theory. It’s true that we are secretly using the poset of natural numbers where $a \le b$ if $a$ divides $b$. This poset underlies some of the theory of Dirichlet series of ‘completely multiplicative functions’, i.e. those obeying

$f(n m) = f(n) + f(m).$

You may remember Rota’s work on Möbius inversion for posets, which later led to the discovery of Leinster measure for categories. Rota’s work was a generalization of some classic ideas on Möbius inversion for Dirichlet series. But all these ideas seem to be used in a fairly minor way in Fadeev’s argument.

I could be wrong; maybe there’s something more here.

Posted by: John Baez on May 18, 2011 10:25 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

David wrote:

As I keep harping on about, I feel sure Campbell’s result must be linked to yours.

You sound convincing. One of us should look into that when the dust settles.

Khinchin and Fadeev seem to get mentioned together often as characterizers of entropy.

Yes, they’re very similar. If I remember correctly, Fadeev improved an earlier result of Khinchin.

Posted by: John Baez on May 18, 2011 10:59 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I think that’s right. Khinchin seems to have been the first to take Shannon’s work and extract the pure mathematics of it from the engineering context. He wrote that Shannon’s pioneering 1948 paper wasn’t entirely mathematically rigorous, either, and that he was fixing that.

I think Khinchin’s characterization involved the hypothesis that, for each finite set, the uniform distribution on that set maximizes entropy. Later Fadeev saw how to whittle down the hypotheses further, dropping that one.

It’s all in the book of Aczél and Daróczy that keeps being mentioned here.

Posted by: Tom Leinster on May 18, 2011 8:12 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Regarding the link to Cencov-Campbell, it could be something like the conditions of your theorem are requiring that morphisms be sent to conditional Shannon entropies. This entails that the morphisms are acting to preserve the smooth structure in the neighbourhood of their domain, that is, they induce isometries of the Fisher metric.

The requirement that your morphisms induce isometries requires that the metric be the Fisher metric (Cencov’s result). But other entropies (Tsallis, Renyi) have the Fisher metric as second derivative. So the conditions of your theorem must be singling out the Shannon entropy.

I wonder then if you separate out what is forcing a conditional entropy, what is forcing the entropy to have the Fisher metric as second derivative, what is forcing the entropy to be in the Tsallis family, and what is forcing it to be the Shannon entropy.

Posted by: David Corfield on May 19, 2011 11:46 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

This is very nice! I hadn’t gotten around to reading any of the links which explain what operads have to do with the story, but you explained it very nicely. A couple of things puzzle me though:

1. You said that the free P-algebra containing an internal P-algebra, with your explicit description, is only a weak P-algebra, but you asserted its universal property strictly, by way of an isomorphism that talks about strict maps. This sort of thing always makes me queasy; usually it seems like strict maps between weak objects are not correct. What’s going on?

2. I think there’s something odd about the notion of special $\mathbf{P}$-algebra. Certainly, it feels somewhat ad hoc from the operadic point of view. In particular, why is the zero condition imposed for the action on morphisms, but not for the action on objects?

Posted by: Mike Shulman on May 18, 2011 8:45 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

1. It’s possible that I haven’t got this quite right, but what I had in mind was that the $P$-algebra $A$ in $Cat$ against which we’re testing would be a strict $P$-algebra. (I didn’t say this, though.)

For example, if $P$ is the terminal non-symmetric operad then although $D =$(finite totally ordered sets) is a non-strict monoidal category, the monoidal category $A$ into which we’re mapping is supposed to be strict. The statement would be that strict monoidal functors $D \to A$ are the same as monoids in $A$. (“The same” refers to an equivalence of categories.) That’s right, isn’t it?

It would make me queasy the other way round: if we were talking about strict maps from something strict to something weak. (We probably wouldn’t expect there to be any.)

2. I agree that “special” is an odd and ad hoc condition. It’s also decidedly non-operadic, in the sense that equations such as “$0\cdot f = 0\cdot g$” don’t have the same variables on each side. I chose it for entirely contingent reasons.

One reason for not imposing it on objects is that that wouldn’t be true in $FinProb$ or $FinMeas$. Let’s take $FinMeas$. Multiplying a measure space $(X, \mu)$ by a scalar $t$ gives the space $(X, t\mu)$. In particular, multiplying $(X, \mu)$ by $0$ gives the measure space $(X, 0)$. But that’s not isomorphic to $(X', 0)$ unless $X \cong X'$. Similarly, for $FinProb$, we don’t in general have $0\cdot(X, p) + 1\cdot(Y, q) \cong 0\cdot(X', p') + 1\cdot(Y, q).$

Posted by: Tom Leinster on May 18, 2011 9:40 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Yes, I agree, you can have strict maps from a weak gadget to a strict gadget. But a weak gadget won’t be determined up to isomorphism (or, in general, even up to equivalence, although it might be so in this case) by just knowing its strict maps into strict gadgets.

The basic statement of the monoidal-category version that I recall from CWM is that the skeletal category $\Delta_+$ of finite totally ordered sets, which is strict monoidal, is free on a monoid with respect to strict monoidal functors into strict monoidal categories (in the sense of an isomorphism of categories). This characterizes it up to isomorphism. I think we can also conclude, using the coherence theorem for monoidal categories and the “cofibrancy” of $\Delta_+$, that it has an analogous universal property with respect to weak monoidal functors into weak monoidal categories (in the sense of an equivalence of categories), which characterizes it up to equivalence. Since its non-skeletal version is equivalent to it, that non-skeletal version shares the latter up-to-equivalence universal property.

But I don’t immediately see a universal property for the non-skeletal version that is about strict monoidal functors into strict monoidal categories.

Posted by: Mike Shulman on May 18, 2011 11:30 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

a weak gadget won’t be determined up to isomorphism (or, in general, even up to equivalence, although it might be so in this case) by just knowing its strict maps into strict gadgets.

Yes, good point. So perhaps I should have been talking about weak maps all along.

Posted by: Tom Leinster on May 18, 2011 11:40 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

But since we’re talking about probability and measure, shouldn’t we be allowed to ignore things that happen on sets of measure zero? (Did this come up somewhere earlier in this conversation?) My inclination in defining the category of finite probability (or measure) spaces would be to allow functions which are defined only almost everywhere (= everywhere except on a set of measure zero) and consider two functions equal if they agree almost everywhere. I think this category would also be “special on objects”. For instance, any measure space $X$ with total measure zero will be isomorphic to the empty set, since we have empty functions $\emptyset \to X$ and $X\to \emptyset$ which are defined almost everywhere, and whose composites in either direction are equal to the identity almost everywhere.

Moreover, I claim that this category, and the stronger notion of “special”, appear more naturally from the operadic point of view. There is a notion of semi-cartesian operad, which is an operad in which one is allowed to discard, but not duplicate, inputs. More precisely, a semi-cartesian operad $P$ is equipped with functions $j_\ast \colon P_n \to P_m$ for every injection $j\colon n\hookrightarrow m$, and for which the composition is suitably equivariant. We think of $j_\ast$ as saying “take an $n$-ary operation $\alpha$, and make it into an $m$-ary operation by discarding some of the $m$ inputs and putting the rest into $\alpha$.”

I mentioned semicartesian operads a while ago for a different reason. I didn’t say anything about their algebras then, but those are easy to define: an algebra for a semicartesian operad $P$ is an ordinary algebra for an operad with the additional property that $\array{ P_n \times X^m & \to & P_n\times X^n \\ j_\ast \downarrow & &\downarrow \\ P_m \times X^m & \to & X }$ commutes for any injection $j$, where top arrow is the projection $X^m \to X^n$ induced by $j$. In other words, the intuitive description of $P_n \to P_m$ is actually what it does on the algebra.

Now the probability-distributions operad $\mathbf{P}$ is semicartesian in an obvious way: the injections just insert zeros. And I think that by the same logic that gives us your description of the free $P$-algebra containing an internal $P$-algebra for an ordinary operad $P$, we can describe the free $P$-algebra containing an internal $P$-algebra for a semicartesian operad $P$ as follows:

• its objects are again pairs $(n,p)$, where $n\in \mathbb{N}$ and $p\in P_n$.

• A map $(m,s) \to (n,p)$ consists of:

• An injection $j\colon k \hookrightarrow m$ and a $t \in P_k$ such that $j_\ast(t) = s$ (this takes the place of your permutation $\tau$),

• A map $f\colon k \to n$ of finite sets, and

• For each $i\in n$, an element $r_i \in P_{|f^{-1}(i)|}$, such that

• $t = p \circ (r_1, \dots ,r_n)$.

• Equality between maps $(j,t,f,(r_i))$ and $(j',t',f',(r'_i))$ from $(m,s)$ to $(n,p)$ is the equivalence relation generated by the existence of an injection $u\colon t \hookrightarrow t'$ compatible with all the structure on both sides.

Now when $P=\mathbf{P}$ is the semicartesian probability-distributions operad, for any $s\in \mathbf{P}_m$ there is a unique smallest $k$ such that there exists an injection $j\colon k\hookrightarrow m$ and a $t\in \mathbf{P}_k$ with $j_\ast(t)=s$. Namely, $k$ is the number of nonzero probabilities occurring in $s$, and $t$ consists exactly of those nonzero probabilities. Therefore, in the free $\mathbf{P}$-algebra containing an internal $\mathbf{P}$-algebra, we can describe a morphism $(m,s) \to (n,p)$ as

• A map $f\colon m' \to n$, where $m'$ is the subset of $m$ consisting of the points with nonzero measure, and

• For each $i\in n$, an element $r_i \in P_{|f^{-1}(i)|}$, such that

• $s' = \tau \cdot p \circ (r_1, \dots ,r_n)$ for some permutation $\tau$, where $s'$ is the restriction of $s$ to the points with nonzero measure.

I think this is precisely an almost-everywhere-defined measure-preserving function, considered modulo almost-everywhere equality.

(Of course, a “fully cartesian operad”, in which we can duplicate as well as discard inputs, is the same as a Lawvere theory, and hence also the same as a finitary monad on Set. Might this have something to do with your other characterization of $FinProb$ in terms of the Giry monad?)

Posted by: Mike Shulman on May 19, 2011 12:51 AM | Permalink | PGP Sig | Reply to this

### Re: An Operadic Introduction to Entropy

That’s an interesting thought, and I don’t know quite what to make of it.

On the one hand, frameworks in which the “almost everywhere” condition pops up naturally (without having been deliberately incorporated) are certainly interesting. Semicartesian operads are a reasonable concept, I think, and in fact I’ve brushed up against them in a couple of related contexts quite recently. One was my post on the characterization of the $p$-norms by Guillaume Aubrun and Ion Nechita. In a nutshell, the point is that

$\Vert (x_1, \ldots, x_n, 0) \Vert_p = \Vert(x_1, \ldots, x_n) \Vert_p$

but in general

$\Vert (x_1, \ldots, x_{n - 2}, x_{n - 1} + x_n) \Vert_p \neq \Vert (x_1, \ldots, x_n) \Vert_p.$

So injections and non-injections genuinely have different properties.

I also found a different way in which “almost everywhere” arises naturally. Over here, I defined a category $FinMeasFun$ whose objects are finite sets equipped with both a measure and a function to $[0, \infty)$. To put it a bit more categorically: there’s a covariant functor

$Meas: FinSet \to Set$

assigning to each finite set the set of (nonnegative) measures on it, and defined on morphisms by pushforward, and there’s also a contravariant functor

$Fun: FinSet^{op} \to Set$

assigning to each finite set the set of functions from it to $[0, \infty)$. Then $FinMeasFun$ is the pullback over $FinSet$ of the categories of elements of these two functors.

The joke is that a measure and a $[0, \infty)$-valued function on a finite set are the same thing. So if you have an object $(X, \mu)$ of $FinMeas$, you get an object $(X, \mu, \mu)$ of $FinMeasFun$. It turns out that a map

$(X, \mu, \mu) \to (Y, \nu, \nu)$

in $FinMeasFun$ is what might be called an equivalence of measure spaces: it’s a measure-preserving map $f: (X, \mu) \to (Y, \nu)$ for which there exist subspaces $\bar{X} \subseteq X$, $\bar{Y} \subseteq Y$ of full measure such that $f^{-1}\bar{Y} = \bar{X}$ and $f$ restricts to a bijection $\bar{X} \to \bar{Y}$.

On the other hand, I’m not super keen on introducing semicartesian operads. They seem a bit tailored to the job. They’re closely related to what I think are called “semicartesian monoidal categories”, i.e. monoidal categories in which the unit object is terminal. (If my memory is right, then presumably that’s the reason why the name was chosen.) Again this seems a bit of an odd condition.

Posted by: Tom Leinster on May 20, 2011 6:15 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

You’re right that semicartesian monoidal categories are the origin of the name “semicartesian operad,” but I don’t think we should let any prejudice against the former influence our opinion of the latter. The intuition behind a semicartesian operad is, I think, really very simple and natural. Whereas in an ordinary operad we cannot discard or duplicate inputs, and in a Lawvere theory we can do both, in a semicartesian operad we can discard but not duplicate. Linear logic people would talk about “allowing weakening but not contraction.”

It just so happens that when we formalize generalized multicategories in terms of monads, the monad which corresponds to semicartesian operads is the one whose algebras are semicartesian monoidal categories. But there are other places, too, where a monad is important for a particular purpose that doesn’t involve consideration of its algebras.

It is an intriguing question, though, why semicartesian operads have anything to do with finite measure spaces. For instance, the probability-distributions operad is actually not just semicartesian, but a Lawvere theory: we can duplicate inputs. So what is the free $\mathbf{P}$-algebra containing an internal $\mathbf{P}$-algebra when we regard $\mathbf{P}$ as a Lawvere theory? It seems to me that in this case, a morphism between finite probability spaces will be allowed not only to be defined only almost everywhere, but to “split” the probability of a point in the domain between the probabilities of two or more points in the codomain, rather than just mapping it to a single such point. So our choice not to go all the way to that extreme reflects a desire to keep the points of probability spaces “atomic”—while a choice to use semicartesian operads exactly reflects a desire to allow those points to be “ignored” if they have zero probability, but under no circumstances to be “split”.

So maybe in this sense semicartesian operads are “tailored to the purpose” of making almost-everywhere pop out. Maybe the real thing to take away is that if we didn’t want to identify things that agree almost everywhere, then the morphisms in $\mathbf{F P}$, rather than $FinProb$, would be the “right” ones to think about between finite probability spaces!

Posted by: Mike Shulman on May 20, 2011 11:06 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Regarding your point 2, here’s something that puzzles me.

The way I presented it in my post, the category $\mathbf{FP}$ arises naturally from the operad $\mathbf{P}$, but the category $FinProb$ doesn’t: it’s only obtained after the introduction of the ad hoc condition of specialness. However, $FinProb$ does arise naturally — but in a different way.

More specifically, it arises naturally from a monad $\mathbf{T}$ on $Set$. The endofunctor part sends a set $A$ to the set $\mathbf{T}(A)$ of finitely-supported probability distributions on $A$. The underlying operad of the monad $\mathbf{T}$ is $\mathbf{P}$. This monad is what we’ve sometimes been calling the (finitary) Giry monad; its algebras are convex spaces.

To see how the monad $\mathbf{T}$ gives rise naturally to the category $FinProb$, first throw away its unit and multiplication, leaving just the endofunctor of $Set$. Now restrict it to a functor $FinSet \to Set$. (This isn’t actually throwing anything away, as long as you remember that it’s a finitary endofunctor.) The category of elements of this functor $FinSet \to Set$ is $FinProb$.

I have no idea how these two descriptions of $FinProb$ can be related. For example, suppose we start with an arbitrary (finitary?) monad $T$ on $Set$. On the one hand, we can take the category of elements $E$ of the restricted functor $T: FinSet \to Set$. On the other hand, we can take the underlying operad $P$ of $T$, and form the free $P$-algebra $F P$ containing an internal $P$-algebra. Is there a canonical functor $F P \to E$, generalizing the quotient functor $\mathbf{FP} \to FinProb$? I don’t know. I haven’t tried to figure it out.

Posted by: Tom Leinster on May 18, 2011 10:52 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

You said that the free P-algebra containing an internal P-algebra, with your explicit description, is only a weak P-algebra, but you asserted its universal property strictly, by way of an isomorphism that talks about strict maps. This sort of thing always makes me queasy; usually it seems like strict maps between weak objects are not correct. What’s going on?

Yes this had me confused as well. After thinking a bit about it, though, I decided that perhaps there is something a bit subtle going on. If P is a non-symmetric operad, then the monad it induces on a category E preserves pullbacks; whence it induces a 2-monad on Cat(E). In this situation, we know that the inclusion of P-algebras and strict maps into P-algebras and lax maps has a left adjoint (under some boundedness assumptions). Thus it is possible to form the free strict P-algebra containing an internal P-algebra.

On the other hand, if P is a symmetric operad, then it does induce a monad on Cat(E) by the usual formula (maybe one needs to assume that E is extensive for this) but I see no reason why this monad should be a 2-monad. Under these circumstances I am not quite sure how much two-dimensional monad theory carries over intact. So perhaps there is no strict P-algebra which classifies lax P-algebra maps out of 1. That sounds a bit unlikely on the face of it, though.

Posted by: Richard Garner on May 19, 2011 6:38 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Posted by: Mike Shulman on May 19, 2011 7:21 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I’m a little confused … is the empty set allowed to be a “finite measure space”, under your definition, Tom? It looks like it to me; on the other hand, it seems that you don’t allow the empty set to be a finite probability space (because of the condition that $\sum p_i=1$).

Posted by: Charles Rezk on May 19, 2011 5:08 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Hi, Charles. Yes, the empty set is allowed as a measure space. So is any other set equipped with the zero measure. But, for the reason that you give, the empty set can’t be made into a probability space.

Does that cause a problem? I’m hoping not. (But funnily enough, I did realize earlier today that there was at least one point in the nascent paper by John, Tobias and me where we needed to add a couple of words to cover the possibility of an empty measure space.)

Posted by: Tom Leinster on May 19, 2011 7:56 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I assume you mean the comment saying that any measure on a finite set is some function times a probability measure. Posted by: John Baez on May 20, 2011 7:40 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Oops, I never followed this up. I was thinking of where we define the entropy of a measure space. But I’ll go through the draft now and try to catch all instances of this kind of thing. The battle is to do it without making things look too complicated.

Posted by: Tom Leinster on May 20, 2011 7:44 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I just fixed a mistake of this type before equation (4). But yes, there’s at least one more, and I’d love it if you fixed them all.

Posted by: John Baez on May 20, 2011 7:54 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I only found that one other one, in the definition of entropy of a measure space. Fixed now.

Posted by: Tom Leinster on May 20, 2011 9:01 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I really like this operad stuff, and personally I’d enjoy reading a paper based on this blog post. But since we already have a paper almost finished that proves nice versions of Theorems M and P that don’t mention operads, it seems slightly ‘decadent’ to write another paper doing it all with operads. I’d feel better about it if we could dream up some extra trick that more clearly justifies the fancy math. Remember, I’m an applied mathematician now.

Here are some very vague thoughts that come up when I start trying to imagine what that trick might be.

The idea of the ‘free $O$-algebra containing an internal $O$-algebra’ could be interesting for lots of operads $O$.

The first example Tom gives, namely the free monoidal category on an internal monoid, is really famous. It’s the ‘augmented’ version of the category of simplices. This is incredibly important in algebraic topology.

The most important example Tom gives, namely $\mathbf{F P}$, is also very ‘simplicial’ in flavor, since now there’s a simplex of objects for each natural number $n$: namely, all the probability distributions on the $n$-element set.

Is it a coincidence that these are both related to simplices, or not?

Is there some interesting relation between probability theory and the simplicial stuff people do in algebraic topology?

Posted by: John Baez on May 20, 2011 8:53 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

That’s funny: I wrote a reply to John’s comment a day or two ago, but it’s not here. I guess I forgot to press “post”.

it seems slightly ‘decadent’ to write another paper doing it all with operads.

I agree, for the moment. One mysterious thing about the operadic approach described in my post is the role of the categories $\mathbf{FP}$ and $\mathbf{FM}$, which aren’t quite the same as $FinProb$ and $FinMeas$. Perhaps the work of Censov and Campbell that David’s mentioned will provide some clues. Or perhaps the semicartesian operads that Mike’s talking about are the way forward. I don’t know.

The idea of the ‘free $O$-algebra containing an internal $O$-algebra’ could be interesting for lots of operads $O$.

Is it a coincidence that these are both related to simplices, or not?

I’m guessing that it’s a coincidence, though it would be interesting to be proved wrong. There’s something intrinsically simplicial about anything to do with operads, just because operads are operads for the free monoid monad (or to put it differently, algebras for the terminal operad are monoids), and the theory of monoids is bound up with the augmented simplex category. Or for that matter, there’s something intrinsically simplicial about anything to do with categories: $\Delta$ is inherent in the theory of categories. I can’t discern anything especially simplicial about this particular situation.

I gave one other example of the free $O$-algebra containing an internal $O$-algebra. That was where $O$ is the operad for $M$-sets ($M$ being some monoid). There, the construction gives the slice category $M/\star$, which if $M$ has cancellation is the poset of elements of $M$ ordered by divisibility. This is an important construction too, in its own small way, but doesn’t seem particularly simplicial.

Is there some interesting relation between probability theory and the simplicial stuff people do in algebraic topology?

Interesting question. I’m now trying to think about the singular simplicial complex of a space in terms of probability distributions. Hmm.

Posted by: Tom Leinster on May 22, 2011 8:27 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Tim Porter has mentioned stuff several times about simplicial sets where the face and degeneracy maps ‘have measure/probability’. Not sure where, but here on the Cafe and over at the Forum I think.

Posted by: David Roberts on May 23, 2011 2:26 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Is there some interesting relation between probability theory and the simplicial stuff people do in algebraic topology?

Barycentres as maximum entropy distributions?

While we’re on the subject of looking to further links to mathematics, I wonder if operadic understanding can help make sense of the appearance of entropy elsewhere. There’s some good material at Entropy in Ergodic Theory and Dynamical Systems, originally constructed by John’s old friend, Chris Hillman. There’s an intriguing sounding All Entropies Agree for an SFT, an expository paper by Chris which

Discusses the relationships between various notions of entropy including metric entropy, topological entropy, algorithmic entropy (see below) and galois entropy, which relates entropy to the notion of “geometric asymmetry”.

No sign of his What is Information? there, but you can find it at CiteSeer.

Posted by: David Corfield on May 23, 2011 9:20 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

In the main post above, Tom has characterized the category $\mathbf{FP}$ as the free $\mathbf{P}$-algebra containing an internal $\mathbf{P}$-algebra. On the other hand, what seems to be the nicest formulation of our results is Theorem $M$, which is about a different category $\mathbf{FM}$. Tom has remarked that $\mathbf{FM}$ is not the free $\mathbf{M}$-algebra containing an internal $\mathbf{M}$-algebra.

My question is: is there a different abstract characterization of $\mathbf{FM}$? If not, does this make the category $\mathbf{FP}$ somehow more natural than $\mathbf{FM}$?

Posted by: Tobias Fritz on May 24, 2011 11:46 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

There is another very nice category which might play a role in upcoming investigations. This comment is sort of a follow-up to observation (4) of this comment.

I am going to define a category $\Fin\Stoch\Rel$, a category of “stochastic relations”. It is the probabilistic sibling of $\Fin\Rel$, the category of finite sets and relations. Caveat: googling for “category of stochastic relations” turns up lots of results, but these authors seem to be talking about $\Fin\Stoch$ and variants of it that make sense also for infinite Borel spaces.

Just as for $\Fin\Prob$, an object of $\Fin\Stoch\Rel$ also is a finite probability space $(X,p)$. We can think of this as a random variable $X$ with distribution $p$. A morphism from $(X,p)$ to $(Y,q)$ is defined to be a probability distribution on the cartesian product $X\times Y$ such that its marginals are $p$ and $q$, respectively. Spelling this out, this amounts to a non-negative number $s(x,y)$ for each $x\in X$ and $y\in Y$ such that the following conditions hold:

$\sum_{x\in X}s(x,y)=q(y) \quad \forall y\in Y \qquad \sum_{y\in Y} s(x,y) = p(x) \quad \forall x\in X$

Thinking of the objects as random variables, such a morphism is a joint distribution between the two random variables. Specifying a morphism allows us to talk about stuff like the correlation between the random variables or their mutual information, which is a measure of the correlation.

The main point is that joint distributions can be composed: when $s$ is a measure on $X\times Y$ corresponding to a morphism $(X,p)\rightarrow (Y,q)$, and similarly $t$ is a morphism $(Y,q)\rightarrow (Z,r)$, then we can define a composed morphism $(X,p)\rightarrow (Z,r)$ as follows:

$(t\circ s)(x,z)=\sum_{y\in Y} \frac{s(x,y)t(y,z)}{q(y)}$

If the denominator of this expression vanishes, then so does the numerator; in this case, we set $0/0=0$. It is not hard to check that $(t\circ s)$ is a probability measure on $X\times Z$ which has $p$ as its marginal on $X$ and $r$ as its marginal on $Z$.

The identity morphism of $(X,p)$ is given by the measure on $X\times X$ which is the push-forward of $X$ along the diagonal $X\rightarrow X\times X$. Associativity of composition should also be clear.

Some nice properties of this category $\Fin\Stoch\Rel$:

• There is a faithful functor $\Fin\Prob \rightarrow \Fin\Stoch\Rel$, since a measure-preserving function $f:(X,p)\rightarrow (Y,q)$ defines a joint distribution as follows:

$s_f(x,y)= \left\{ \begin{array}{cl} p(x) & \if \: f(x)=y \\ 0 & \otherwise \end{array} \right.$

Similarly, there is also a faithful functor $\Fin\Prob^{\op} \rightarrow \Fin\Stoch\Rel$. I think both also applies to $\mathbf{FP}$ instead of $\Fin\Prob$.

• There is an obvious involution on the morphisms of $\Fin\Stoch\Rel$ which turns it into a dagger category. This involution interchanges the two functors of the previous bullet point.

• $\Fin\Stoch\Rel$ relates to $1/\Fin\Stoch$ as $\mathbf{FP}$ relates to $\Fin\Prob$. In fact, from a joint distribution $s$ on $X\times Y$ we obtain a conditional distributions by applying the usual formula for conditional probabilities:

$P(y|x) = \frac{s(x,y)}{p(x)}$

which is, if no vanishing denominators occur, a morphism in $1/\Fin\Stoch$ from $(X,p)$ to $(Y,q)$. This correspondence is compatible with composition.

• Every morphism in $\Fin\Stoch\Rel$ can be regarded as a span in $\Fin\Prob$, defined by measure-preserving maps $(X\times Y,s)\rightarrow (X,p)$ and $(X\times Y,s)\rightarrow (Y,q)$. Conversely, every span in $\Fin\Prob$ defines a morphism in $\Fin\Stoch\Rel$. This is much like the relation between $\Rel$ and $\Set$.

What are some other nice properties of this category? And how does it fit into the general abstract story on operads and monads? Given the existence of the dagger structure: is it the universal $\mathbf{P}$-algebra containing an internal $\mathbf{P}$-bialgebra? (I don’t know what that means.) Does it arise from a distributive law between $\Fin\Prob$ and itself, or between $\mathbf{FP}$ and itself?

Posted by: Tobias Fritz on May 24, 2011 1:01 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Where does tensor product of probabilities fit with entropy - especially Renyi? I got confused by “glomming” as I initially thought it was tensor
.
In “Entropy as a functor”, tensor X comes up in Corollary 2 for finite measure spaces X and Y. Also here ( May 24 ) and in the (May 18) comment about Eckmann Hilton.

So, Is He( X tensor Y) = He( X) + He(Y) for all entropies e under consideration?

Do I need to look at the Laplace transform for a proof?

Or more likely Fourier transform product goes to convolution.

With differential entropy ( integral -inf,inf) replaces Shannon sum. This leads to a uncertainty bound and application to Renyi entropies.
http://en.wikipedia.org/wiki/Hirschman_uncertainty

>>From this norm we are able to establish a lower bound on the
>>sum of the (differential) Rényi entropies … with 1/alpha + 1/beta = 2.

Holds for Shannon as the limit.

Posted by: William Patton on May 29, 2011 9:22 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Recall that when category theorists talk about a “tensor”, they don’t necessarily mean a tensor product in the usual algebraic sense, i.e. a tensor product of vector spaces or rings or the like. They rather mean the product in a monoidal category, which is an abstract generalization of an ordinary tensor product. It is often also denoted by $\otimes$.

In particular, in this comment you refer to, I had been using “tensor product” in this sense. More specifically, I was talking about the monoidal category $\Fin\Meas$ with the direct sum as the tensor product. In this case, it’s less confusing to write $\oplus$ instead of $\otimes$, so that’s what I’m going to do in the following.

Then yes, the formula

$H_\alpha(X\oplus Y) = H_\alpha(X) + H_\alpha(Y)$

holds for the entropies which we consider in our paper. Here, $H_\alpha$ is the Tsallis entropy of order $\alpha$, with Shannon entropy as the special case $\alpha=1$, of a not necessarily normalized measure. Usually this is only defined for normalized measures, i.e. probability measures, so we have to say what we mean by the entropy of a non-normalized measure. See the paper.

Okay, but usually when one talks about the additivity of entropies, one actually means that the joint entropy of two random variables equals the sum the single-variable entropies, right? So how does this fit into the present context? In fact, given measure spaces $(X,p)$ and $(Y,q)$ there is a natural measure $p\otimes q$ defined on the cartesian product $X\times Y$ given by $(p\otimes q)_{(i,j)}=p_i q_j$ In fact, this is a special case of the glomming operation, where all the $q(1),\ldots,q(n)$ are equal to $q$.

Then, using the notation from our paper, one can apply the ‘glomming formula’ to show that Shannon entropy satisfies the equation $H_\alpha(p\otimes q) = \|p\| \cdot H(q) + H(p) \cdot \|q\|$

I haven’t actually worked out the Tsallis case of this yet.

Maybe we should mention part of this stuff in the paper?

Posted by: Tobias Fritz on May 29, 2011 11:18 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Tobias wrote:

Maybe we should mention part of this stuff in the paper?

I’d like the current paper to be a clean statement of a single idea, rather than a collection of all the related insights we happen to have gotten so far.

The stuff about cocycles may become the seed of another paper. If you could tersely summarize what you and David figured out, maybe we can try to understand more deeply what it means. I guess one thing worth bearing in mind is that throughout algebra, 2-cocycles are good because they allow us to construct ‘central extensions’ of algebraic gadgets. 3-cocycles are good because they let us centrally extend algebraic gadgets and obtain ‘2-gadgets’. 4-cocycles are good because they let us construct ‘3-gadgets’. And so on. But it looks your discussion only got up to 2-cocycles… which is perfectly fine, of course.

However, I think it will be much quicker to write a short paper explaining the partition function as a decategorification process: we already have a bunch written up here. Since the Tsallis entropy is minus the $q$-derivative of the partition function, this should also give a nice way to think about Tsallis entropy (and Shannon entropy, as $q \to 1$).

Your ideas on entropy inequalities are also almost ripe for turning into a paper. My hope is that by adding inequalities as 2-morphisms to the monoidal category we’re calling $\mathbb{R}_+$, we can formulate a lot of these inequalities in a nice unified way.

Yet another idea I want to pursue is giving a characterization of von Neumann entropy for finite-dimensional quantum systems that mimics our characterization of Shannon entropy. I think this should be quite easy. And since I’m at the Centre for Quantum Technologies, not the Centre for Classical Probabilistic Technologies, I feel a certain duty to do this.

There should also be something worthwhile about studying $FinStoch$. In probability theory it’s a bit limiting to study deterministic functions between probability measure spaces. In my blog entries on network theory, I’m trying to study stochastic maps as analogous to linear operators between Hilbert spaces, setting up an analogy between quantum mechanics and probability theory and then an analogy between quantum field theory and stochastic Petri nets.

I suspect we will get tired of writing papers before we run out of ideas. So, it might be good to ponder our strategy and our desires.

Posted by: John Baez on May 31, 2011 10:29 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

All of this looks like stuff we should work on! But there’s a lot to do, and I agree that it may well be too much for us…

At the moment I’m most fascinated about the cohomology story; I will try to summarize my understanding of our observations later in this comment.

What you said about $\Fin\Stoch$ sounds very cool. And as I started to write down here, it seems that relative Shannon entropy (and relative Rényi entropy) is actually monotonically decreasing under measure-preserving stochastic maps, so that a category built from $\Fin\Stoch$ could be the natural home for relative entropy. Or its sibling which I had constructed here: it’s a dagger category! That should also remind us of quantum mechanics.

So, now on to the cohomology business. The framework we need is monoid cohomology of a monoid $M$ with values in a $M$-$M$-bimodule $A$. By this, (I think) I mean the Hochschild cohomology of the monoid algebra; but never mind fancy words, I will attempt to explain it. So $A$ is an abelian group together with a left action of $M$ as well as a right action of $M$, and these two actions commute:

$m (a n)=(m a) n\quad\forall a\in A,\: m,n\in M$

(whoops, pardon the pun!) Then we define an $n$-cochain to be any map $f:M^n\rightarrow A$. Thanks to $A$ being an abelian group, the set of all $n$-cochains forms an abelian group $C_n$. Define the coboundary operator $d:C_n\rightarrow C_{n+1}$ as follows:

$(df)(m_1,\ldots,m_{n+1})=m_1 f(m_2,\ldots,m_{n+1})$ $+\sum_{i=1}^n (-1)^i f(m_1,\ldots,m_i m_{i+1},\ldots,m_{n+1})+(-1)^{n+1} f(m_1,\ldots, m_n) m_{n+1}$

Note that this uses both the left and the right action of $M$ on $A$. It should satisfy $d^2=0$, but verifying this involves many terms and I haven’t yet gone through until the end.

So those $f$ with $d f=0$ are the $n$-cocycle, and those of the form $f=d g$ for some $g\in C_{n-1}$ are the $n$-coboundaries. $n$-cocycles and $n$-coboundaries are abelian groups, with the latter contained in the former; the corresponding quotient abelian group is the cohomology $H^n(M,A)$. We obtain ordinary group cohomology as the special case where the right action is trivial.

So what does this stuff have to do with entropy? We will have two consider two monoids: the additive monoid $(\mathbb{R}_{\geq 0},+)$ and the multiplicative monoid $(\mathbb{R}_{\geq 0},\cdot)$. We take these to act on the abelian group $\mathbb{R}=(\mathbb{R},+)$ as follows: $(\mathbb{R}_{\geq 0},+)$ acts trivially, while $(\mathbb{R}_{\geq 0},\cdot)$ acts by scalar multiplication.

What is a $1$-cocycle over $(\mathbb{R}_{\geq 0},\cdot)$ with coefficients in the bimodule $\mathbb{R}$? Unwinding the given definitions, it should be a function $D:\mathbb{R}_{\geq 0}\rightarrow \mathbb{R}$ such that

$D(x y)=x D(y)+D(x)y$

which we have previously also called a “non-linear derivation”. The prototypical examples are $D(x)=x\log x$ and its scalar multiples. Are there any other (continuous) ones?

Now such a $D$ also is a $1$-cochain over $(\mathbb{R}_{\geq 0},+)$ with coefficients in $\mathbb{R}$. Its coboundary is given by

$H(x,y)=D(x+y)-D(x)-D(y)$

which is, for $D(x)=x\log x$, just the binary Shannon entropy of the two-element measure space with singletons of measure $x$ and $y$.

Since this coboundary is automatically a cocycle, we get the equation

$0=(d H)(x,y,z)=H(y,z)-H(x+y,z)+H(x,y+z)-H(x,y)$

which can also be derived from the additivity and functoriality of entropy of our paper.

So this wraps up my personal present understanding of the ‘entropy as cohomology’ story. It’s funny that we need to consider $\mathbb{R}_{\geq 0}$ both as an additive and a multiplicative monoid. What is this trying to tell us?

Posted by: Tobias Fritz on May 31, 2011 1:42 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

This paperSome Functional Equations in Groups and Rings – by Jessen, Karpf and Thorup might be relevant. If you look at theorem 2 on p. 3, a function $F$ satisfying symmetry, a cocycle equation and $F(a c, b c) = c F(a, b)$, as entropy does, will pair up nicely with a $G$ which is identically zero, see equation $\epsilon$. Hence, an $f$ for which $f(a b) = b f(a) + a f(b)$ gives an appropriate $F = f(a + b) - f(a) -f(b)$ and $G = 0$.

This is for functions on the whole ring, but there is some discussion on p. 6 of extending functions from the positive elements. I don’t have time to look now, but maybe we now see why John was right to have his jaw hit the floor – maybe $(x y)log(x y) = y (x log x) + x (y log y)$ is the key to the good properties of Shannon entropy.

So, are the other functions $f$ for which $f(a b) = b f(a) + a f(b)$?

Posted by: David Corfield on June 8, 2011 1:56 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

So, are there other functions $f$ for which $f(a b) = b f(a) + a f(b)$?

Essentially not. The only measurable functions $f: (0, \infty) \to \mathbb{R}$ satisfying this equation are the scalar multiples of $a \mapsto a \log a$. Actually, you can weaken “measurable” somewhat, but I’ll take it that measurability isn’t too harsh a condition.

Proof: write $g(a) = f(a)/a$. Dividing the original equation through by $a b$, we have $g(a b) = g(a) + g(b).$ If $f$ is measurable then so is $g$, and the only measurable solutions of this last equation are multiples of the logarithm.

(Why? Put $h(t) = g(e^t)$ ($t \in \mathbb{R}$). Then $h(t + u) = h(t) + h(u)$ and $h$ in turn is measurable. This is called the Cauchy functional equation. Cauchy proved that the only continuous solutions are those of the form $h(t) = A t$, where $A$ is a constant. Since his time, the hypothesis of continuity has been weakened in many ways, and measurability is one of them.)

Posted by: Tom Leinster on June 8, 2011 2:49 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I like the sound of this explanation for why Shannon entropy is special:

Because the first cohomology group of the monoid $(\mathbb{R}_{\geq 0}, \cdot)$ with coefficients in the bimodule $\mathbb{R}$ is one-dimensional, generated by the cocycle $x log x$.

Posted by: David Corfield on June 8, 2011 3:31 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I like the sound of this explanation for why Shannon entropy is special:

I tend to like the sound of sentences like this, too. And a characterization of anything as a representative of a unique cohomology class up to scale is generally important and also much stronger than a characterization of something as a weak algebra of some operad. Because the former statement is already fully invariant.

But what I have trouble seeing with much of this discussion so far is where it is headed and what it is all supposed to be telling us. I haven’t been paying full attention, so maybe that’s just my fault. And I don’t necessarly need an application to be interested in an abstract statement. But what I think I do need is some bigger perspective on what it’s all “telling us”. I don’t see this perspective here. Yet.

Posted by: Urs Schreiber on June 8, 2011 4:28 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

What do you tend to look for next when you’ve characterized something as a representative of a unique cohomology class up to scale? Should you think of deformations?

So there’s a deformation of $(\mathbb{R}_{\geq 0}, \cdot)$’s two sided action by multiplication on $\mathbb{R}$, which has as (left) action:

$r \cdot x = (r + \epsilon r log r)x.$

But why would you want such a thing?

Posted by: David Corfield on June 8, 2011 5:08 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

What do you tend to look for next when you’ve characterized something as a representative of a unique cohomology class up to scale?

The first thing to do whenever you find a cohomology class is: you use it as a characteristic class.

(This works for every cohomology class. The point of it being unique only means that there is particular reason to consider that class, instead of some other – simply since there is no other in that degree!).

Every cohomology class on some $X$ with values in some $A$ is an equivalence class of morphisms $c : _X \to A$ – the cocycles – in some suitable $\infty$-category.

Here it seems we need to look at the $(\infty,2)$-category of $\infty$-categories, where

• $X = B \mathbb{R}_{\geq 0}$ is the delooping category of the multiplicative monoid of non-negative real numbers;

• $A = B ( (\mathbb{R}_{\geq 0},\cdot) \ltimes (\mathbb{R},+) )$ is the category with a single object, morphisms labeled by pairs $(x,k)$ and composition given by

$(x_2, k_2)\circ (x_1, k_1) := (x_1 \cdot x_2 ,\, x_1 k_2 + x_2 k_1) \,.$

Notice that this composition is indeed associative

\begin{aligned} ((x_1, k_1) \cdot (x_2, k_2)) \cdot (x_3, k_3) & = (x_1 x_2 , x_1 k_2 + k_1 x_2) \cdot (x_3, k_3) \\ & = (x_1 x_2 x_3, (x_1 k_2 + k_1 x_2) x_3 + x_1 x_2 k_3 ) \\ & = ( x_1 x_2 x_3 , k_1 x_2 x_3 + x_1 k_2 x_3 + x_1 x_2 k_3 ) \end{aligned} \,,

which is the same as

\begin{aligned} (x_1, k_1) \cdot ((x_2, k_2) \cdot (x_3, k_3)) & = (x_1, k_1) \cdot ( x_2 x_3, x_2 k_3 + k_2 x_3 ) \\ & = (x_1 x_2 x_3, k_1 x_2 x_3 + x_1 (x_2 k_3 + k_2 x_3)) \\ & = ( x_1 x_2 x_3 , k_1 x_2 x_3 + x_1 k_2 x_3 + x_1 x_2 k_3 ) \end{aligned}

Notice also that this comes with a forgetful functor $A \to B \mathbb{R}_{\geq 0}$.

• $c$ is the functor $B \mathbb{R}_{\geq 0} \to A$ that sends morphisms labeled by $x$ to morphisms labeled by $(x, x \log x)$.

The functoriality is equivalently the cocycle property (as always for 1-cocycles):

\begin{aligned} c( x_2 \circ x_1 ) & = (x_1, x_2, \; x_1 x_2 log (x_1 x_2)) \\ & = (x_1 x_2, \; x_1 x_2 (log x_1 + log x_2)) \\ &= (x_2,\; x_2 log x_2) \circ (x_1,\; x_1 log x_1) \\ & = c(x_2) \circ c(x_1) \end{aligned} \,.

Notice that $c$ this is not any such functor, but one that respects the way $A$ sits over $B \mathbb{R}_{\geq 0}$

$\array{ B \mathbb{R}_{\geq 0} &\stackrel{c}{\to}& A \\ & {}_{\mathllap{id}}\searrow & \downarrow \\ && B \mathbb{R}_{\geq 0} } \,.$

So this $c$ is really a cocycle in twisted cohomology . In the present context this is traditionally just called “cohomology with values in a nontrivial module” but saying it that way does not make the functoriality that we’ll need for discussing characteristic classes manifest. Some exposition of all this for the groupoidal case is at the $n$Lab entry on group cohomology, but the generalization needed here from groupoids to categories is immediate, everything works verbatim as before.

This means now that for $X$ any other object (some $\infty$-category, for instance some ordered Cech category of an open cover of some topological space), and

$f : X \to B \mathbb{R}_{\geq 0}$

some cocycle on $X$, which we may identify with an $\mathbb{R}_{\geq 0}$-principal bundle over $X$ (notice: a monoid-principal bundle, something rarely considered in traditonal literature), there is the corresponding characteristic class , the equivalence class of the composite morphism

$c(f) : X \stackrel{f}{\to} B \mathbb{R}_{\geq 0} \stackrel{c}{\to} A \,.$

You can also think of this as the pullback of the characteristic class $c$ on the universal $\mathbb{R}_{\geq 0}$-bundle over $B \mathbb{R}_{\geq 0}$ along $f$ to $X$.

So that’s the general story. Now that $c$ is that functor $x \mapsto x \ln x$ we maybe want to say something like “$c(f)$ is the entropy of $f$”. But I don’t see yet how that sentence would relate to something ordinarily called the entropy of something else.

But if it does, that would be interesting.

Posted by: Urs Schreiber on June 8, 2011 6:49 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

At Characteristic class it says

In practice one is interested in this notion for particularly simple objects $B$…This serves to characterize cohomology with coefficients in a complicated object $A$ by a collection of cohomology classes with simpler coefficients.

Here you might think the $B = B ( (\mathbb{R}_{\geq 0},\cdot) \ltimes (\mathbb{R},+) )$ is more complicated than the $A = B \mathbb{R}_{\geq 0}$.

The process of cocycle composition with $c:A \to B$, which sends $[X, A]$ to $[X, B]$, would seem to be sending $(\mathbb{R}_{\geq 0})$-principal bundles to $(\mathbb{R}_{\geq 0},\cdot) \ltimes (\mathbb{R},+)$-principle bundles. Could that be seen as involving a deformation of the principle bundle, a map into its tangent bundle?

Posted by: David Corfield on June 9, 2011 9:16 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

If you had a discrete space $X$, let’s say of 3 points, then I take it there’s one principal $\mathbb{R}_{\geq 0}$-bundle over it, and its sections form $(\mathbb{R}_{\geq 0})^3$.

Won’t composition with cocycle $c$ gives us a way to move to sections of the principal $((\mathbb{R}_{\geq 0},\cdot) \ltimes (\mathbb{R},+))$ bundle, which points the way to differential geometry entering into the study of measurable functions/probability distributions, as in information geometry?

Posted by: David Corfield on June 9, 2011 9:55 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

If you had a discrete space $X$, let’s say of 3 points, then I take it there’s one principal $\mathbb{R}_{\geq 0}$-bundle over it and its sections form $(\mathbb{R}_{\geq 0})^3$.

That’s right, yes. Generally, a “$\mathbb{R}_{\geq 0}$-bundle” on a topoligically discrete category is a labelling of the morphisms of that category in $\mathbb{R}_{\geq 0}$ such that under composition of morphisms, labels multiply.

Similarly, a $\mathbb{R}_{\geq 0} \ltimes \mathbb{R}$-bundle on that topolgically discrete category is a labelling of the morphisms by pairs $(x,k) \in \mathbb{R}_{\geq 0} \times \mathbb{R}$, such that under composition of morphisms the labels follow that twisted composition rule that I wrote out above.

Once we pass to topologically non-discrete spaces, we’ll have to more carefully think about what we shall mean by morphisms $X \to B \mathbb{R}_{\geq 0}$. These should now be morphisms of “cohesive $\infty$-categories”-namely of $(\infty,2)$-sheaves over some geometric category. Details of this will depend on details of the setup.

In principle we can discuss this, but I’d really need some motivating example or something that indicates that there is something to be found at the end of this road. So far all we have is a 1-cocycle given by the formula “$x \mapsto x log x$”. That may remind us of entropy. But do we have any indication that this is more than a superficial resemblance?

Posted by: Urs Schreiber on June 9, 2011 10:11 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Fair enough to wonder the point of it all. I guess one motivation for me is that it has seemed mysterious why differential geometry seems to crop up quite naturally in statistical inference, as in Amari’s information geometry.

I have an inkling that the uniqueness of $x log x$ as a cocycle has to do with the uniqueness of the Fisher information metric. I don’t know if my comments on sections and tangent bundles made any sense.

That may remind us of entropy.

The function $f(x) = x log x$ should do a little more than remind us of entropy. If $(p_i)$ is a probability distribution, i.e., $\sum p_i = 1$, then the entropy of this distribution is $f(\sum p_i) - \sum f(p_i)$, a measure of the defect of linearity of $f$.

Posted by: David Corfield on June 9, 2011 10:50 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

David, you write:

The function $f(x) = x log x$ should do a little more than remind us of entropy.

Then in the next sentence, as if to prove your point, you… say why it reminds us of entropy. ;-)

See what I mean: yes, that formula is the central ingredient in the formulas for entropy. But why on earth is it important then to say that it is a cocycle in some sense? What does it buy us?

It would buy us something if we could now take some general fact of cohomology theory and see how it translates into facts of probability theory. Because that might indicate then that we could take more general facts from cohomology theory, to get a new understanding of probability theory.

I would find that interesting, but I don’t see it yet.

If you’d ask me, I’d say that the hints around us are telling us that to abstractly understand measure theory and probability theory, we should try to abstractly understand the category of von Neumann algebras. There is a remarkable amount of structure in that, and remarkably little of it has been put into some general abstract context, as far as I can see.

For instance my gut feeling is that if we knew the answer to the following question, we would have a good general abstract understanding of measure and probability theory:

What is a von Neumann $n$-algebra for $n \geq 1$ and do they play a role in $(1+n)$-dimensional quantum field theory in the way ordinary von Neumann algebras do in 2-dimensional QFT – where they are vastly important?

A gut feeling would be that once we know the answer to this, a general abstract characterization of entropy will just drop out.

But I don’t know. Also I haven’t really thought about it much at all.

Posted by: Urs Schreiber on June 9, 2011 3:08 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Well that’s what I was hoping to find out. I thought that chasing up $x log x$ as the solitary cohomology class was worth a shot, and with your help, I thought I was glimpsing an explanation for the differential structure involved in statistical inference. That I would find impressive, even if entropy turned out to be a secondary notion.

Doesn’t $[x log x]$ being the sole cocycle class mean that there’s only one good way of extending $(\mathbb{R}_{\geq 0})$-bundles to $(\mathbb{R}_{\geq 0} \ltimes \mathbb{R})$-bundles? And, in terms of sections, wouldn’t this be something like forming a tangent space on a cone of positive measures? If that’s all hopelessly wrong, then I can’t see any profit in pursuing it.

Posted by: David Corfield on June 9, 2011 6:19 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Doesn’t $[ x log x]$ being the sole cocycle class mean that there’s only one good way of extending $\mathb{R}_{\geq 0}$-bundles to $\mathbb{R}_{\geq 0} \ltimes \mathbb{R}$-bundles?

Yes, that’s right. So this is something that exists and is god-given and worth taking notice of. It may be good for something.

But what I am lacking is general abstract insight into that fact that you emphasized above: while $x log x$ is a cocycle over one monoid, for it to relate to entropy one has to consider its coboundary with rtespect to another monoid.

It is this maneuver which I can’t see a good general abstract interpretation of. This maneuver might just be a “coincidence of small formulas”. I don’t know. This is probably the central reason why I can’t see a genuine connection between the cocycle property and entropy as such.

I thought I was glimpsing an explanation for the differential structure involved in statistical inference.

Okay, sorry, I didn’t really follow that. Is there a quick way that I could get a rough idea of that differential structure involved in statistical inference?

Posted by: Urs Schreiber on June 9, 2011 8:48 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Is there a quick way that I could get a rough idea of that differential structure involved in statistical inference?

After a rapid look at the Wikipedia page on information geometry, Shalizi has a useful page about it, which points you off to proper treatments, such as Differential geometry in statistical inference. John’s run a series of posts from a physics point of view starting here).

I think I must be missing something because it should be easy to work out, but what does the god-given way of extending $\mathbb{R}_{\geq 0}$-bundles to $\mathbb{R}_{\geq 0} \ltimes \mathbb{R}$-bundles actually do? In particular what would happen to the space of sections of the bundle on $n$ points, $(\mathbb{R}_{\geq 0})^n$?

Posted by: David Corfield on June 10, 2011 8:48 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Thanks, David. Sorry for the slow reply, for some reason I didn’t see this comment appear first.

So I started putting a tiny little bit of references and content into the $n$Lab entry information geometry.

I haven’t yet got hold of a more detailed source that explains these things. The Wikipedia entry seems a bit rough. And following Shalizi’s links I found pdf-s front pages of books, but not yet the books themselves. (It’s probably all there, but I am doing this in a bit of a haste).

But it sounds like I want to look into this a bit. For instance if relative entropy is not quite a Riemannian metric but Fisher information is indeed a curvature this would seem to suggest to fortget about Riemannian geometry and instead think in terms of just connections and curvature. What is the connection that Fisher information is the curvature of?

I’d like to find out. But maybe not right now. Need to run.

Posted by: Urs Schreiber on June 14, 2011 7:34 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Okay, I got hold of the book Differential geometry in statistical inference by Amari et al. Around p. 12 of the introduction it surveys in some detail the role of various affince connections on these information spaces, and section 5 has the “Bundle theory of statistical models”.

All that sounds interesting. I’ll have to see if I can get some insight out of this in the few minutes that I have available.

Posted by: Urs Schreiber on June 14, 2011 8:32 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

You’ll perhaps have seen by now that there’s a one-parameter family of connections compatible with the Fisher information metric, each of which has a dual. The most famous dual pair are what Amari calls the exponential connection and the mixture connection.

Each connection gives rise to a divergence. The exponential or 1-connection gives rise to the KL-divergence, and the mixture or $(-1)$-connection to the inverse KL-divergence.

Setting parameters in an exponential family, such as the normal distributions, to empirical data is seen to be a projection along a 1-geodesic to the collection of distributions with matching moments, or alternatively as a projection from the empirical distribution to the exponential family along a $(-1)$-geodesic.

Posted by: David Corfield on June 14, 2011 10:11 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I guess it must just do the obvious:

$(x_1, ..., x_n) \mapsto ((x_1, x_1 log x_1), ..., (x_n, x_n log x_n).$

I see no mention of mine of deformation or tangent space has been given the nod, but if it right to think in those terms, this would be a kind of vector field.

Posted by: David Corfield on June 10, 2011 2:14 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Here you might think the $b \mathbb{R}_{\geq 0} \ltimes \mathbb{R}$ is more complicated than the $B \mathbb{R}_{\geq 0}$.

That’s right, but that’s only because here we are not looking at a plain characteristic class, but one with coefficients in a module, hence for characteritsic classes in twisted cohomology.

The thing is that

$\array{ B (\mathbb{R}_{\geq 0} \ltimes \mathbb{R}) \\ \downarrow \\ B \mathbb{R}_{\geq 0} }$

is something like a $B \mathbb{R}$-monoid 2-bundle over $B \mathbb{R}_{\geq 0}$ (a delooped monoid extension analogous to how a bundle gerbe is a groupoid extension), which makes it globally look more complicated. But locally it is simple: the fiber is just the monoid $B \mathbb{R}$.

And since $x log x$ is not a plain cocycle, but one with coefficients in the $\mathbb{R}_{\geq 0}$-module $\mathbb{R}$, we are not homming directly into $B \mathbb{R}$ but into something more complicated, whose fibers look like that.

So it’s still simple, modulo the effect of the twist. But, true, since $\mathbb{R}_{\geq 0}$ is itself already quite simple (for instance: commutative) it is true that the characteristic class in this case is hardly much simpler .

Posted by: Urs Schreiber on June 9, 2011 9:57 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Here’s a new piece of evidence for the conjecture that the cohomological point of view for general rigs is good and useful.

So instead of working with the rig $\mathbb{R}_{\geq 0}$, let’s consider $\mathbb{R}_{\geq 0}^2$. This rig consists of pairs of nonnegative real numbers with componentwise addition and multiplication. Sounds quite dull, doesn’t it?

Just as any other rig, $\mathbb{R}_{\geq 0}^2$ defines a monad on $\Set$, and in particular a “free algebra” functor $T:\Fin\Set\to\Set$. If we consider the category of elements of $T$, we get the following: an object is a finite set $X$ together with two functions $p_1:X\to\mathbb{R}_{\geq 0}$ and $p_2:X\to\mathbb{R}_{\geq 0}$. The values of those functions are denoted by $p_{1,i}$ and $p_{2,j}$ for $i,j\in X$. So the first index indexes the components in $\mathbb{R}_{\geq 0}^2$, while the second one is a function or sequence index.

If we think of these two functions as defining two measures on $X$, then the morphisms are precisely those functions which preserve both measures. This is analogous to Tom’s observation that $\Fin\Meas$ is the category of elements of the “free $\mathbb{R}_{\geq 0}$-module” functor. So let’s call our new category $\Fin\Meas_2$.

Now what are the maps $D:\mathbb{R}_{\geq 0}^2 \to \mathbb{R}_{\geq 0}^2$ which have the derivation property $D(xy)=xD(y)+D(x)y$? One example is this: $D((x_1,x_2))=(x_1\log\frac{x_1}{x_2},x_2\log\frac{x_2}{x_1})$ I’m not going to write down here the brief calculation needed to verify this. Well, I’m actually slightly lying in saying that this $D$ works, since it’s not defined when either of the arguments becomes $0$. But I would hope that one can remedy this by working with the extended reals which also include $\infty$.

For $\Fin\Meas$, we had considered the additivity defect of such a non-linear derivation in order to define entropy. Let’s do the same here! For a sequence $p_{\cdot,1},\ldots,p_{\cdot,n}\in\mathbb{R}_{\geq 0}^2$, we have that the additivity defect $D\left(\sum_i p_{\cdot,i}\right)-\sum_i D(p_{\cdot,i})$ has $\left(\sum_i p_{1,i}\right)\log\frac{\sum_i p_{1,i}}{\sum_i p_{2,i}}-\sum_i p_{2,i}\log\frac{p_{1,i}}{p_{i,2}}$ as its first component. (Of course one obtains the component by exchanging $1$ and $2$, so I don’t want to write that one down as well.)

When both measures are normalized ($\sum_i p_{1,i}=1=\sum_i p_{2,i}$), then we may recognize this formula as relative entropy!

By the same reasoning as in our paper, we can get an ‘information loss’ functor for relative entropy, $\Fin\Meas_2\to\mathbb{R}_{\geq 0}^2$, which is again linear. Now the obvious question is:

• Does this functoriality and linearity (together with continuity) characterize relative entropy?

OK, I hope someone understands what I mean, can comment on it, and, if necessary, explain it better.

(BTW, I would have liked to add more links to previous comments. However by now, the discussion has proliferated so much that I don’t remember where things were said…)

Posted by: Tobias Fritz on June 10, 2011 2:59 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Here’s another thought for the record in case we revisit all this, relating to Urs’ observations and succeeding comments. I think it’s right.

Each $\mathbb{R}_{\geq 0}$ module has a unique (up to scale factor) vector field on it which is invariant under the action of elements of $\mathbb{R}_{\geq 0}$.

In one dimension, the vector field is $(x, -x log x)$. When $r$ acts on it, it does so via multiplication in $\mathbb{R}_{\geq 0} \ltimes \mathbb{R}$,

$(r, -r log r) \cdot (x, -x log x) = (r x, -r x log r -x r log x) = (r x, -r x log (r x)).$

Then say you consider the three-dimensional module $\mathbb{R}^3_{\geq 0}$ with its invariant vector field $((x, y, z), (-x log x, -y log y, -z log z))$ and you project this module onto a one-dimensional module by adding components, then the difference between the image of the vector at $(x, y, z)$ and the vector at $(x + y + z)$ is the entropy of $(x, y, z)$.

Does that seem contrived?

Posted by: David Corfield on June 13, 2011 6:04 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Does that seem contrived?

Maybe a bit, yes. For instance why that invariance property, and why would you push-forward (which is what you are doing) the specific vector field $(1,1,1)$ on $\mathbb{R}_{\geq 0}^3$?

But I can certainly see what you are trying to get at. Let me think a bit more, and maybe read a bit more about information geometry. I’l try to get back to you.

Posted by: Urs Schreiber on June 14, 2011 8:05 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Ah, something far less contrived. Take a look at L. L. Campbell’s An Extended Censov Characterization of the Information Metric. This develops Censov’s result that the only Riemannian metric to put on the simplices of probability distributions over a finite set, if you want them to be preserved under the kind of fine-graining that Tom, John and Tobias have been considering is (more or less) the Fisher information metric.

Campbell extends this to show that if you now take the cones of positive measures, $\mathbb{R}^n_{+}$, the required metric at $x = (x_1, ..., x_n)$ is the Shahshahani metric

$g_{i j} = |x|\delta_{i j}/x_i, where |x| = \sum x_i.$

He notes that on the simplex of probabilities, where $|x| = 1$, the normal vector at $x$ is $N = \sum x_i X_i$, where $X_i$ is the obvious basis of the tangent space. Then if $U = \sum u_i x_i X_i$

$\langle U, N \rangle (x) = \sum u_i x_i.$

So if we choose our $U$ to be the invariant vector field, as above, then

$\langle U, N \rangle (x) = H(x),$

the entropy at $x$.

Posted by: David Corfield on June 14, 2011 2:54 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I see, good point.

So above I complained that the vector field $(1,1,...,1)$ (with respect to the naive metric) or $(x_1, \cdots, x_n)$ (with respect to that extended Fisher metric) that you needed to contract that “canonical vector field” with to get the entropy was not well motivated. But, as you point out, it is actually a natural thing to consider, being the normal vector (up to normalization) to the subspace of probability measures inside all positive measures.

Okay, so that makes it look better. There is still a piece of story-line missing, but I agree that this now has the potential to make one wonder.

Posted by: Urs Schreiber on June 14, 2011 6:25 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Hmm. That’s odd.

Yes, i see what you mean. I would however need much more time than I have now to scan the literature to see what’s going on here.

Thanks for the link to the book by Savaré on gradient flow in non-smooth metric spaces.

One thing the JKO results seems to clarify is the question that was discussed in this or some related thread: whether (relative) entropy ought to be thought of as a kind of metric on the space of probability distributions.

It seems that one can read JKO as saying that, no, the entropy functional rather plays the role of the action functional for statistical mechanics – when the space of probability disctributions is instead equipped with the Wasserstein metric.

Posted by: Urs Schreiber on June 16, 2011 6:50 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Hi David,

can’t reply to your latest message right now, since I am sitting in a talk.

But, incidentally, that talk is about Wasserstein metrics on spaces of probability measures over some $X$, and the Jordan-Kinderlehrer-Otto theorem, which says that heat flow on $X$ is the gradient flow of the Shannon entropy functional with respect to the Wassermann metric.

Not sure if this is at all relevant here. But given the superficial coincidence, I thought I’d drop a note here.

Posted by: Urs Schreiber on June 14, 2011 3:17 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Hmm. That’s odd. It sounds like they ought to be closely related approaches

but I don’t see much overlap in the references.

Posted by: David Corfield on June 15, 2011 5:35 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Doesn’t one tend to think that if a cocycle is a coboundary, that’s unlikely to be interesting by itself? So is what’s doing the work that entropy is derived from a multiplicative 1-cocycle? Or is it perhaps that there’s an interaction between multiplicative and additive monoid cohomology?

Above, in Ebanks remark, it’s entropy as a 2-cocycle that’s taken as key to ‘branching’, which seems to be about the ‘magical property’, or whatever it used to be called.

Posted by: David Corfield on June 3, 2011 1:47 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

David wrote:

[..] Or is it perhaps that there’s an interaction between multiplicative and additive monoid cohomology?

Yes, this is the impression which I have at the moment. But I can’t present any more evidence at the moment than what we’ve already discussed, so it may be more of an intuition.

Posted by: Tobias Fritz on June 4, 2011 12:23 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Perhaps we could pose this as a MathOverflow question: Is there anything known about semiring (rig) cohomology? Or maybe something more specific.

Google gives 1 hit for “Cohomology of semirings”.

Posted by: David Corfield on June 4, 2011 4:27 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Google gives 1 hit for “Cohomology of semirings”.

Did you try to search for the keywords “entropy cohomology” and “entropy cocycle”? I get some roughly reasonable looking hits this way, though it’s hard to discern in a few minutes.

For instance there is an article

Entropic Fluctuations in Statistical Mechanics I. Classical Dynamical Systems (pdf)

which in its section 3.1 introduces the “entropy cocycle” of a dynamical system and around its equation (3.5) identifies it the system’s mean entropy production rate.

There is a thesis

Spectral, ergodic and cohomological problems in dynamical systems (html)

which says things like

Choosing special cocycles leads to invariants of the dynamical system like for example the entropy or the index of an embedded system.

There is an article

Entropy of Operator-valued Random Variables by some L Akant

which considers entropy as cocycles.

And so on.

Posted by: Urs Schreiber on June 4, 2011 8:34 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Sounds like that’s worth pursuing. E.g., the Akant article says:

Now we understand that the entropy of the single matrix models has the mathematical meaning of a non-trivial 1-cocycle of the group of automorphisms. It explains why we could not express the entropy as a function of the moments.

Posted by: David Corfield on June 4, 2011 9:06 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Thanks, Tobias, for providing us with that excellent summary of what you know about entropy and the cohomology of the nonnegative reals. And thanks, David and Urs, for pushing the conversation a bit further. I’ve been so busy finishing our first paper on entropy and thinking about its quantum generalization that I’m only reading this now!

Urs wrote:

But what I have trouble seeing with much of this discussion so far is where it is headed and what it is all supposed to be telling us. I haven’t been paying full attention, so maybe that’s just my fault.

I don’t think it’s just you! I think we have a pile of clues, but we’re not quite sure what mystery they will let us solve, yet.

Let me just think out loud here. The first big clue is that some sort of first cohomology of the topological monoid $(\mathbb{R}_+, \cdot)$ with coefficients in the bimodule $(\mathbb{R}_+, \cdot)$ is 1-dimensional, generated by the cocycle

$D(x) = x \ln x$

This is a fancy way of saying the equation

$D(x y) = D(x) y + x D(y)$

And why did we get interested in this equation in the first place? It’s because Tom pointed out that it’s equivalent to an important rule about the entropy of a product of two probability distributions:

$H(p \otimes q) = H(p) + H(q)$

Here $H$ is the Shannon entropy and $p \otimes q$ is the probability distribution on the set $S \times T$ built from probability distributions $p$ on $S$ and $q$ on $T$ in the obvious way:

$(p \otimes q)_{i j} = p_i q_j$

Now, I believe this fact:

$H(p \otimes q) = H(p) + H(q)$

says that our ‘information loss’ functor

$F : FinProb \to \mathbb{R}_+$

is monoidal. Here I’m reverting to the notation of our first paper. Namely:

• $FinProb$ is the category of finite sets with probability measures, and measure-preserving functions between these,
• $\mathbb{R}_+$ refers to the one-object category with nonnegative reals as morphisms and addition as composition,
• and for any morphism $f: (X,p) \to (Y,q)$ in $FinProb$, $F(f)$ is its information loss $F(f) = H(p) - H(q)$

The category $FinProb$ becomes a monoidal category with usual product of probability measures, described above. The one-object category $\mathbb{R}_+$ becomes monoidal if we take addition to be the tensor product of morphisms as well as the composite. (By Eckmann-Hilton, there’s no other choice!) And then I believe

$H(p \otimes q) = H(p) + H(q)$

is equivalent to having

$F(f \otimes g) = F(f) + F(g)$

for every morphism in $FinProb$.

So: our cocycle condition is equivalent to entropy being a monoidal functor. So it’s not a ‘silly’ condition; there’s something important about it.

Also: this stuff is all about relationships between addition and multiplication: that’s why we’re constantly using $\mathbb{R}_+$ as a commutative monoid in both ways.

Okay. Further thoughts should go in another comment…

Posted by: John Baez on June 9, 2011 1:01 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Well, I got distracted by being given the go-ahead to put our first paper on the arXiv and also submit it for publication!

So this is just a dumb little comment before I give up and go home. We can ask: what are the properties of $\mathbb{R}_+$ that give rise to this cocycle $D(x) = x \ln x$ and then the whole theory of Shannon entropy? And it seems that a large portion of these properties are captured by the idea that $\mathbb{R}_+$ is a commutative rig with logarithm. Well, the logarithm of zero isn’t defined, but we say $0 \ln 0$ still is, so there’s a subtlety here. But apart from that wrinkle it seems we just need the identity

$ln(x y) = \ln x + \ln y$

I don’t know if people have thought about commutative rigs with logarithm. They seem more likely to have thought about commutative rings with exponential.

Commutative rigs with logarithm aren’t the algebras of a Lawvere theory, since $\ln 0$ isn’t well-defined, but somehow working with $D(x) = x \ln x$ sidesteps that problem: ‘commutative rigs with $D$ obeying $D(x y) = D(x) y + x D(y)$’ are algebras of a Lawvere theory, and thus ‘purely algebraic’ gadgets in some very strict sense.

However, the definition of the category $FinProb$ cannot be done merely starting from a commutative rig with $D$ obeying $D(x y) = D(x) y + x D(y)$: we need the concept of convex linear combination as well, normally captured with the help of the condition $0 \le x \le 1$, which involves inequalities.

Anyway, people of a certain ilk might like to continue these reflections and come up with a good concept of a ‘commutative rig with entropy’, one that allows us to generalize the whole story in our first paper from $\mathbb{R}_+$ to all commutative rigs with entropy.

However, I’m not sure why this is worthwhile, and it’s time for dinner!

Posted by: John Baez on June 9, 2011 2:12 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

It’s such an obvious fact that it hardly needs mentioning, but for some reason I feel compelled to. Namely: the logarithm is what provides the isomorphism between $(\mathbb{R}, +)$ and $((0, \infty), \cdot)$. Both the additive and multiplicative structure of the reals matter to us.

That doesn’t answer any of John’s questions, but it’s surely a basic consideration.

Posted by: Tom Leinster on June 9, 2011 2:20 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Interesting! So do you think we could get an explicit description of the Lawvere theory of rigs together with a (not necessarily linear) derivation?

I’m also not sure where this is heading to. My dream would be to uncover the ‘right’ algebraic structure of entropy and then find out that this kind of structure is well-known in other fields in completely different contexts. Or, alternatively, that it is not known but that it would help one to come up with notions of ‘entropy’ in other mathematical contexts which have nothing to do with information theory, but still turn out to be useful. Of course, most likely this hope is very naive.

Posted by: Tobias Fritz on June 9, 2011 6:20 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Tobias wrote:

So do you think we could get an explicit description of the Lawvere theory of rigs together with a (not necessarily linear) derivation?

I guess it depends on what you mean by ‘explicit’. For me, here’s a completely explicit description: the theory of a commutative rig $R$ equipped with $D: R \to R$ such that $D(x y) = D(x) y + x D(y)$.

In other words, I know that this specifies a Lawvere theory, since it’s all about an object $R$ equipped with a bunch of morphisms satisfying a bunch of equations that all make sense in any category with finite products.

There’s a standard machine for turning this kind of description of the Lawvere theory into an actual category with finite products. It’s called the theory of ‘sketches’, and you can read about it in (for example) the free online book Toposes, Triples and Theories.

But the really great thing is that you don’t need to worry much about sketches most of the time. As long as you have some general idea of how they work, you’re okay.

Posted by: John Baez on June 10, 2011 5:48 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I wrote:

So do you think we could get an explicit description of the Lawvere theory of rigs together with a (not necessarily linear) derivation?

John wrote:

I guess it depends on what you mean by ‘explicit’. For me, here’s a completely explicit description: the theory of a commutative rig R equipped with D:R→R such that D(xy)=D(x)y+xD(y).

Sure. While that is clear, it seems that I was not clear ;) What I meant was something like the following: an explicit description of the Lawvere theory of $R$-modules for a rig $R$ is the category of matrices over $R$. (It has natural numbers as objects, and these specify the size of the matrices which are the morphisms.)

Now what is a similarly explicit description of the Lawvere theory of a commutative rig R equipped with $D:R\rightarrow R$ such that $D(xy)=D(x)y+xD(y)$? But then since I don’t know the answer even for the theory of a commutative rig without additional structure, there might not be much to say here. (Except for uttering ‘distributive law’, but that’s not what I’m looking for.)

Posted by: Tobias Fritz on June 10, 2011 10:50 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

The Lawvere theory of commutative rigs does have an easy explicit description.

The objects are, as ever, the natural numbers.

A map $m \to n$ is an $n$-tuple of $m$-ary operations in the theory of commutative rigs. But an $m$-ary operation in the theory of commutative rigs is a way of taking $m$ elements $x_1, \ldots, x_m$ of an arbitrary commutative rig and turning them into a single element. In other words, it’s a polynomial over $\mathbb{N}$ in $m$ commuting variables. So, $\Hom(m, n) = \mathbb{N}[x_1, \ldots, x_m]^n.$

Composition is by substitution of polynomials into one another.

We got lucky here, in the sense that the theory of commutative rigs is unusually amenable. It may be that the theory of rigs with nonlinear derivations is also unusually amenable.

But it should be emphasized that this is an exceptional situation. To take a theory presented by operations and equations, and write down an explicit description of the Lawvere theory, is in general cumbersome. Essentially this amounts to describing the left adjoint to the forgetful functor $(algebras) \to Set.$ A typical example is the theory of groups. If you’ve ever been through the classical and fiddly development of free groups, you’ll know how cumbersome it can be.

Posted by: Tom Leinster on June 10, 2011 3:06 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Thanks a bunch! I can see the general picture now. So in the present case, the problem boils down to describing what a “free rig equipped with a non-linear derivation” is. Sounds nasty.

Posted by: Tobias Fritz on June 10, 2011 3:22 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Having gotten overwhelmed by the discussion here, I decided just to wait for the first paper to be in pretty readable form. I really like it! In particular, I think the organizational decisions (i.e., including the category-theoretic formulation but saving it for the end) were good. Being a picky reader, though, I made a list of typos and nitpicks. John suggested to me via email that, in the spirit of doing this project in public, I should post them here.

I only have two small comments with any real content. First, in explaining the continuity condition of Theorem 1, am I missing something, or could you just use sequences instead of nets here? I guess nets are probably necessary when you later describe how $FinMeas$ is a topological category, but at this point $X$ and $Y$ are fixed finite sets. Of course there’s nothing wrong with using nets, but I’ve found that some mathematicians (not I!) react to nets with extreme distaste, especially when they aren’t strictly necessary.

Second, near the end of the section “Why Shannon entropy works”, it would be nice to be more explicit that “related by the function $f$” just means $y = f(x)$. (Besides that, I find the current version of that passage quite clear.)

And here are some typos:

In defining direct sums in $FinMeas$, the domain and codomain of $f'$ should be apostrophized.

In the explanation of the continuity condition, “we can think of $\Phi$”, not $F$.

There’s a reference to Corollaries 5 and 6, apparently from an earlier version of the numbering.

In Definition 2, the finite sets should be equipped with probability measures.

In the second paragraph of “Why Shannon entropy works”, the $p_i$ is missing in the formula for $\| p \|$. Later in that section, $\sum_r q_j$ should have $r$ changed to $j$, and in easily checking equation (3), an $r$ should be changed to $q$.

Posted by: Mark Meckes on May 26, 2011 4:58 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Thanks a lot, Mark! I’ll make all these changes.

I’m to blame for the nets. My feeling was that to describe the topology we should say which nets converge, not only which sequences, so we might as well use nets right from the start. But if some readers don’t like nets (because they don’t understand them: to know them is to love them), it makes sense to save the nets for when they’re really needed.

Posted by: John Baez on May 27, 2011 2:47 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Okay, I fixed all the errors Mark caught. Tobias will be happy to know that in fixing the strangely misnumbered references to the Corollaries, I figured out how to get all Definitions, Theorems and Corollaries numbered sequentially.

(One reason we’re writing this paper in full public view is to teach youngsters what it’s like to write a paper. It’s not all glamorous brainstorming: there are many tiresome chores involved, like endless typos that need to be fixed. So, with apologies to readers who know this already, I thought we should carry out all this dirty work on the blog.)

Posted by: John Baez on May 27, 2011 3:40 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Damn, you beat me to it ;)

Anyway, it’s great that we had Mark as a beta tester! So now I suppose we should wait for Tom to be back and then finalize the paper? Let me know when you deem it ready for the conversion to LaTeX.

Posted by: Tobias Fritz on May 27, 2011 4:28 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Why not publish this in wiki format on the nJournal and forego latex?

Posted by: Eric on May 28, 2011 1:22 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Eric wrote:

Why not publish this in wiki format on the nJournal and forego latex?

We’ve spent a lot of time simplifying this paper to the point where no knowledge of category theory is required to understand it. The paper is infinitely simpler than this blog entry, for example. At present, except for the last section, all we ask is that the reader be able to read the words ‘morphism’ and ‘functorial’ without experiencing a seizure and collapsing in twisted spasms on the floor. We did this is because the key result is really simple and deserves to be widely understood.

Given all this, publishing the paper on the nJournal would not make sense: it should either be in a general math journal, or a journal about entropy or information.

Posted by: John Baez on May 31, 2011 10:58 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

John wrote:

Actually, now that we’re only talking about probability measures in the main part, maybe we can even do without this fancy term and just use ‘measure-preserving function’?

Posted by: Tobias Fritz on May 31, 2011 11:38 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I think John made the point somewhere else that we need to be a bit careful here. The same function corresponds to multiple different morphisms. That is, we might have a function $f: X \to Y$ between finite sets, and probability measures $p$, $p'$ on $X$ and $q$, $q'$ on $Y$ such that $f: (X, p) \to (Y, q), \quad f: (X, p') \to (Y, q')$ are both measure-preserving. But (contrary to what I’ve just done) we shouldn’t use the same letter for both, since our “$F$” might give them different values.

Posted by: Tom Leinster on May 31, 2011 11:48 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

While it’s true that the same function may correspond to different morphisms, I thought that John’s point referred to the morphisms $f$ and $\lambda f$, for $\lambda\in\mathbb{R}_+$, having the same underlying function. That only happens when we work in $\Fin\Meas$, and definitely requires us to talk about ‘morphisms’ instead of ‘measure-preserving functions’.

Due to the reason you mention, it would a bit sloppy to call a morphism

$f:(X,p)\rightarrow (Y,q)$

just a ‘measure-preserving function’. But there’s less trouble with doing this in $\Fin\Prob$ than in $\Fin\Meas$. So if we always carefully speak of functions between probability spaces $(X,p)$ instead of functions between sets $X$ only, then maybe it would be clear what we mean?

Posted by: Tobias Fritz on May 31, 2011 12:12 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Ah, I see your point. Still, do you think anyone will be scared off by the word “morphism”?

Posted by: Tom Leinster on May 31, 2011 12:16 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Here’s another small comment. Following what is now Definition 3 ($FinProb$), it’s worth noting that the convex combination $\lambda p \oplus (1-\lambda) q$ is a special case of a construction well-known to probabilists and statisticians: it’s a mixture.

Posted by: Mark Meckes on May 27, 2011 4:34 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Thanks very much, Mark.

To add my own point of view on nets: I was brought up on filters and ultrafilters, and on occasions that someone mentions nets, it takes me a while to remember/figure out what the basic definitions are. I’ve never seen the advantage of nets over filters, which is not to deny that there are any, but it’s a foreign language for me.

Posted by: Tom Leinster on May 29, 2011 5:43 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I could pretty well summarize my own perspective on nets v. filters as: “I was brought up on nets, and on occasions that someone mentions filters and ultrafilters, it takes me a while to remember/figure out what the basic definitions are. I’ve never seen the advantage of filters over nets, which is not to deny that there are any, but it’s a foreign language for me.”

But my point here was that I think some people have never gotten comfortable with either nets or filters, and may be confused/annoyed if a topology is described in such “highbrow” terms where sequences would suffice.

Posted by: Mark Meckes on May 29, 2011 6:34 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I think some people have never gotten comfortable with either nets or filters, and may be confused/annoyed if a topology is described in such “highbrow” terms where sequences would suffice.

I agree!

Ultrafilters do have some claim to be natural or inevitable in a way that, to my knowledge, nets and filters don’t. For example, the Stone-Cech compactification of a discrete space is the set of ultrafilters on it.

Posted by: Tom Leinster on May 29, 2011 10:38 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Mark wrote:

But my point here was that I think some people have never gotten comfortable with either nets or filters, and may be confused/annoyed if a topology is described in such “highbrow” terms where sequences would suffice.

Having grown up on nets, I don’t know the standard slick way to avoid them and define a topology just using sequences. I can define a topology by saying which nets converge. Then I can note that the topology is first-countable and say “so in fact nets are overkill: a subset is closed iff every convergent sequence in that subset converges to a point in that subset.” But what’s the best, or most popular, way to define a topology while only saying which sequences converge?

I suppose you can say which sequences converge and then say a set is closed iff every convergent sequence in that subset converges to a point in that subset… and then let the reader check that this concept of closed set really defines a topology.

Or of course we could just forget the nets and say what the open sets are. But in the example we’re doing, that seemed sort of annoying.

Posted by: John Baez on May 31, 2011 10:49 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

If someone hands you a set, a bunch of sequences in it, and for each sequence a point to which it’s supposed to converge, you can always take the finest topology in which that is in fact true. Maybe that’s the same as what you wrote in your penultimate paragraph. I don’t know which topologies can be described that way — is it the first countable ones?

Posted by: Tom Leinster on May 31, 2011 11:09 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

But what’s the best, or most popular, way to define a topology while only saying which sequences converge?

I don’t know about best, but the most popular way would probably be to just say “the topology is first countable, and here’s a description of the convergent sequences.” Mathematically a bit sloppy, but I think clear enough to make the meaning unambiguous to those who care. Many readers will probably just want to be reassured that the continuity condition is basically what they would think it should be, as opposed to something more exotic.

Posted by: Mark Meckes on May 31, 2011 1:47 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

The nlab has a stubby entry on sequential spaces. In particular, sequentiality is more general than first countability.

Posted by: Mike Shulman on June 1, 2011 1:10 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Thanks!

Posted by: Tom Leinster on June 1, 2011 10:35 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

John wrote:

I think of all this stuff as part of a program to clarify the foundations of statistical mechanics using category theory.

Tom wrote:

And I think of all this stuff as part of a programme to clarify the mathematical concept of diversity using category theory.

I think this is great. Now ‘statistical mechanics’ may sound like a branch of physics, and ‘diversity’ may sound like a concept from biology, but that’s exactly one of the walls I’m trying to tear down. As Tom notes, diversity is really a very general concept:

“Diversity” here might refer to ecological diversity, and indeed a lot of work on the mathematics of diversity has been done in that context. But it’s actually a much more general concept, which has been used in other parts of biology, economics, linguistics, soil science, …, and, I would say, really belongs to mathematics.

On the other hand, while I said ‘statistical mechanics’, I could and probably should have said ‘statistical reasoning’, because a lot of the techniques and concepts developed in statistical mechanics are really applicable to reasoning about any situation where probability is involved. Entropy, temperature, the partition function — all these should be part of the repertoire of anyone interested in statistical reasoning!

I keep dreaming of applying more techniques from statistical mechanics to the study of biodiversity. I think I’ve finally found an opening. People interested in this should check out this paper:

I summarized a small portion in my blog post Information Geometry (Part 8). I think this stuff is really exciting!

Posted by: John Baez on May 27, 2011 3:02 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

This seems like a lot of fancy machinery applied to some simple concepts. However, there are lots of interesting open mathematical problems involving entropy, so if the machinery can help solve them, I’d encourage it! But if it can’t help solve such problems, is there a reason to learn it? Ok, here’s two example problems (not very good ones perhaps as they both involve quantum entropy and are probably both ridiculously hard)

1)Characterize quantum states for which strong subadditivity is approximately saturated. These states have a very simple and beautiful characterization when S_{AB}+S_{BC}-S_B-S_{ABC}=0. But what happens when it is close to zero?

2)Determine if there is operational non-additivity of minimum output entropy, that is, whether the non-additivity of minimum output entropy holds even for the regularized form lim_{n\rightarrow \infty} (1/n) H_{min}(E^n).

Posted by: matt on June 1, 2011 1:38 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Matt wrote:

This seems like a lot of fancy machinery applied to some simple concepts.

Well, it’s a matter of perspective. There are probably regular readers here who regard concepts such as “operad” and “category” as more basic and simple than concepts such as “entropy” and even “logarithm”.

I don’t want to take up a position and argue for it, because clearly it comes down to a subjective judgement. And I don’t make any great claims for the usefulness of the set of thoughts in my post. But I do think there can be some benefit in seeing how standard concepts (e.g. entropy) arise inevitably from logical structures (e.g. operads) that were created for quite different purposes. More broadly, it can be useful to have multiple motivations for the same concept, even if that doesn’t immediately lead to the solution of any open problems.

Thanks for the example problems. Personally I don’t have the background to tackle them — I don’t know about quantum states, strong subadditivity, saturation, operational non-additivity, or minimum output entropy. But hopefully some other readers will do better than me. Where do these terms come from? Is it quantum information theory?

By the way, to include Latex in comments here, you need to use the dropdown menu in the comments form and select one of the options containing the word “MathML”. I use “itex to MathML with parbreaks”. Then enclose your math in dollar signs, as usual.

Posted by: Tom Leinster on June 1, 2011 3:07 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Yes, quantum information theory is the source of most of those terms. But it is also a wonderful source of problems where even a better understanding would be useful. For example, strong subadditivity is the statement that given any probability distribution on three variables, A, B, C, a certain combination of entropies of non-negative: $S_{AB}+S_{BC}-S_B-S_{ABC}\geq 0$. It is trivial to prove this inequality classically (repeated application of the chain rule) and yet for quantum states it is a very difficult theorem that was first proven by Lieb and Ruskai. Since then, multiple other proofs for quantum systems have been found, yet there still is none that is really elementary. So, even a new proof of that would be of interest, as well as understanding the various inequalities that might hold on four-or-more variable systems.

Posted by: matt on June 1, 2011 6:49 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

For a really simple summary of the paper Tobias, Tom and I have written, try this blog entry over on Azimuth:

• John Baez, A characterization of entropy.

Note: the three characters in the photo do not represent the three coauthors. One is the paper itself, and I’ll let you decide who the other two are.

Posted by: John Baez on June 2, 2011 12:23 PM | Permalink | Reply to this
Read the post Category-Theoretic Characterizations of Entropy III
Weblog: The n-Category Café
Excerpt: Now we're trying to generalize our characterization of entropy from the classical to the quantum case.
Tracked: June 5, 2011 9:27 AM

### Re: An Operadic Introduction to Entropy Tobias, Tom and I had sort of decided that the operadic characterizations of entropy described in this blog article were a bit too esoteric to be worth publishing. So, it’s amusing to find that someone has already cited this work:

Abstract: Thermodynamic semirings are deformed additive structures on characteristic one semirings, defined using a binary information measure. The algebraic properties of the semiring encode thermodynamical and information theoretic properties of the entropy function. Besides the case of the Shannon entropy, which arises in the context of geometry over the field with one element and the Witt construction in characteristic one, there are other interesting thermodynamic semirings associated to the Rényi and Tsallis entropies, and to the Kullback–Leibler divergence, with connections to information geometry, multifractal analysis, and statistical mechanics. A more general theory of thermodynamic semirings is then formulated in categorical terms, by encoding all partial associativity and commutativity constraints into an entropy operad and a corresponding information algebra.

Of course we’ve spent ages on this blog talking about Rényi and Tsallis entropies, the Kullback–Leibler divergence, and semirings (which we call ‘rigs’) and their applications to thermodynamics. But the section most relevant to the current blog entry is Section 10, ‘Entropy Operad’. It cites our unpublished work. Maybe we should try to publish it!

Posted by: John Baez on August 16, 2011 12:54 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Wow! It’s hard to imagine a paper combining more Café interests than this one. Well, I suppose they could have devised a thermodynamic 2-rig.

Posted by: David Corfield on August 16, 2011 1:33 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

They say that their results are “cominiscent” of your operadic characterizations of entropy. I think this is just a typo; the Google hits for it are false positives due to bad OCR of the Latin commiscent, third person plural of commisceo, “to intermingle”. But it’s a lovely word which should by rights have a meaning! I want to use it in sentences. E.g., “The linked cluster theorem, which states that the partition or generating function over all diagrams is the exponential of that for connected diagrams, is cominiscent of the functorial composition of generalized combinatorial species which we saw in the context of the Bell numbers.”

Posted by: Blake Stacey on August 16, 2011 4:56 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Posted by: Tom Leinster on August 16, 2011 5:53 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Hmmm, a link from the $n$-Café to the TV Tropes wiki… I believe we are approaching the Grand Unification of time sinks.

I think cominiscence might get at an idea or experience I’ve been wishing I had a word for. And, what better way to establish the perfect cromulence of a word than by providing a definition?

cominiscent, adj. Recalling to mind, with the implication that the subject recalled and that which provoked the recollection should be considered together as parts of a potential whole, even if that “bigger picture” has yet to be painted. Descriptive of the feeling one has while studying mathematics that concept $A$ and idea $B$ should be related. Has a bittersweet tinge, when one reflects that the possible relationship glimpsed “through a glass darkly” today might be forehead-slappingly obvious next year, and one’s own particular journey from obscurity to clarity might not even be a good way to teach the subject. Cf. Egan’s “Interesting Truths”; The sensation that one may be in the vicinity of an as-yet-unknown Interesting Truth is an occasion for cominiscence. Example: “Shape invariance in supersymmetric quantum mechanics is cominiscent of fixed points in renormalization-group theory, since in both cases, we do something to the state space, adjust some parameters and find ourselves back at the same partition function we started with, and this constraint allows us to solve for lots of stuff we care about.”

Posted by: Blake Stacey on August 17, 2011 4:44 PM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

I like that definition of ‘cominiscent’, Blake! But I’d imagined this word was similar to some actual word in French, or Italian, or whatever Mathilde Marcolli’s native language might be.

I think the opposite of ‘reminiscent’ should be ‘preminiscent’ — related to ‘premonition’.

And then of course there’s just plain old ‘meniscent’.

Posted by: John Baez on August 18, 2011 11:39 AM | Permalink | Reply to this

### Re: An Operadic Introduction to Entropy

Posted by: Blake Stacey on August 18, 2011 6:51 PM | Permalink | Reply to this
Read the post Shannon Entropy from Category Theory
Weblog: The n-Category Café
Excerpt: A talk about how to characterize Shannon entropy using category theory.
Tracked: May 2, 2022 1:46 AM

Post a New Comment