## May 1, 2011

### Category-Theoretic Characterizations of Entropy

#### Posted by John Baez

Tobias Fritz, Tom Leinster and I seem to be writing a paper on category-theoretic characterizations of entropy—not only the good old Shannon entropy, but also the Rényi entropy $S_q$, which depends on a parameter $q$ and reduces to the Shannon entropy as $q \to 1$. Curiously, you can’t get the Shannon entropy by just sticking $q = 1$ in the definition of Rényi entropy: you really need to sneak up on it, taking a limit as $q \to 1$.

Tobias has worked on the category whose morphisms are stochastic matrices, and a category-theoretic approach to convex sets. Tom likes category-theoretic characterizations of familiar concepts. So it was natural for them to become interested in this topic. But for me, it started by trying to understand Rényi entropy.

A while back I wrote a paper on Rényi entropy and free energy and blogged about it here. I got a bunch of very helpful feedback, which improved the paper immensely and allowed me state the main result in this way: the Rényi entropy is a ‘$q$-deformed version’ of the ordinary entropy. The ordinary Shannon entropy can be defined as minus the derivative of free energy $F(T)$ with respect to temperature $T$. To get the Rényi entropy, we must use a generalization of the usual derivative, called a ‘q-derivative’:

$S_q = - \frac{F(T/q) - F(T)}{1/q - 1}$

To be completely honest, this should be called a ‘$1/q$-derivative’. But that’s no big deal. What’s more interesting is that now it’s obvious why taking the limit $q \to 1$ is a good idea: that’s how $q$-derivatives reduce to ordinary derivatives!

Even more interesting is that $q$-derivatives show up throughout the theory of quantum groups, the combinatorics of finite fields, and elsewhere. Does that shed light on why they’re showing up here too? I wish I knew.

To better understand the connection between $q$-derivatives and Rényi entropy, it would help to understand what’s important—if anything—about Rényi entropy. It’s not exactly a central concept in statistical mechanics. As far as I can tell, it’s more of a curiosity. But there are a few interesting results about it here and there, and one of the more interesting can be found in Rényi’s original paper:

• A. Rényi, On measures of information and entropy, in Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability 1960, pp. 547–561. Also freely available online via Project Euclid.

This paper starts with a nice characterization of Shannon entropy due to Fadeev, and then Theorem 3 gives a characterization of Rényi entropy.

Now Tom Leinster has taken Fadeev’s characterization of Shannon entropy and stated it in beautiful category-theoretic terms. Tobias Fritz suggested a possible avenue toward doing something similar with Rényi entropy, but we’re not there yet. We were discussing this on my other blog, but when Tom uttered the word ‘2-monad’, I realized it was time to move the conversation here.

Most of our ideas so far are recorded here:

If you stare at this stuff carefully, you’ll notice an interesting ‘level shift’ is at work, both in our treatment of:

• finite probability measure spaces (which have entropies),

and:

• nonnegative real numbers (which are entropies: in other words, entropy takes values in $[0,\infty)$).

For the nonnegative real numbers, this level shift goes like this:

1. Sometimes we think of $[0,\infty)$ as a category with objects in $[0,\infty)$ and a unique morphism from $x$ to $y$ iff $x \ge y$.
2. Sometimes we think of $[0,\infty)$ as a category with one object, with morphisms in $[0,\infty)$ and composition given by addition.

Of course there’s no real conflict: we can combine these viewpoints by working with a monoidal category with $[0,\infty)$ as its set of objects, a unique morphism from $x$ to $y$ iff $x \ge y$, and addition as the tensor product. Indeed we sometimes want to get multiplication into the game too, and then we get a ‘rig category’. And sometimes we want to use the topology on $[0,\infty),$ too. Then we’ve got a rig category internal to $\mathrm{Top}$.

All this is bog standard1: it’s just a fancy way of thinking about $[0,\infty)$ as a topological ordered rig.

For finite probability measure spaces, the corresponding level shift is a bit more murky in my mind, and that’s what I’d like to start by talking about. I’m hoping it works in a very analogous way! But let’s see:

1. Sometimes we work with the category $FinProb$ where finite probability measure spaces are objects and measure-preserving maps are morphisms.
2. Sometimes we work with an operad, the ‘Giry operad’, where the set of $n$-ary operations is the set of probability measures on the $n$-point set.

Of course this set is also known as the $(n-1)$-simplex, so all of a sudden instead of doing statistical mechanics we’re in the simplicial world that homotopy theorists like to inhabit! Maybe these worlds are connected more intimately than people suspect. But never mind that for now. My question is:

What’s a nice way to combine viewpoints 1. and 2. for finite probability measure spaces, which is closely analogous to something that also works for $[0,\infty)$?

We could then hope to say that the Shannon entropy is a map

$S : FinProb \to [0,\infty)$

that preserves a lot of the structures that $FinProb$ and $[0,\infty)$ have in common. Indeed, it might be uniquely characterized this way, based on Tom’s reinterpretation of Fadeev’s theorem.

You’ll notice an irksome apparent disanalogy between what I did with $[0,\infty)$ and what I’m doing with finite probability measure spaces: in the second case, and only in that case, do we see operads entering the game.

But that’s probably because things haven’t been cleaned up enough yet.

I think that this is one of those puzzles where finally having had the wits to state it, I should be able to solve it. Anyway, this blog entry is mainly meant as a place where Tobias, Tom and interested bystanders can join me in discussing our paper.

1 I love it when Brits say ‘bog standard’, so I wanted to see if I could pull it off convincingly. How did I do? Apparently England is so wet that bogs are considered ‘standard’. I just like the way it sounds: curt and dismissive.

Posted at May 1, 2011 8:04 AM UTC

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2394

### Re: Category-Theoretic Characterizations of Entropy

So, to start answering my own question, let me cook up a monoidal category that’s related to probability measure spaces but also to the ‘Giry operad’, with a monoidal structure somehow analogous to addition in $[0,\infty)$.

Whenever we have any monoidal category $M$ and any object $x \in M$ we get an operad, the co-concrete operad of $x$, whose set of $n$-ary operations is $hom(x, x^{\otimes n})$.

So, let’s create a category where the objects are finite sets and the morphisms are ‘stochastic maps’. More precisely, a morphism $F: S \to T$ in this category is function $F: S \to P(T)$ where $P(T)$ is the space of probability measures on $T$. This functor $P$ is part of a monad called the ‘Giry monad’ and our ability to compose morphisms arises from this fact.

In jargonesque terms, we’re forming the full subcategory of the Kleisli category of Giry monad, where the objects are just finite sets.

Let’s hope this category becomes monoidal with disjoint union of sets as the tensor product.

Then the set of morphisms from $1$ to $n = 1+ 1 + \cdots + 1$ is the set of probability measures on the $n$-element set. So, the co-concrete operad of the 1-element set seems to be the Giry operad!

The appearance of the ‘co’ here is a bit curious, but otherwise I like this construction.

Posted by: John Baez on May 1, 2011 11:02 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Hi John. I think what you’re doing here is part of the general story about theories and operads. In other words, what you wrote doesn’t use anything special about the particular theory we’re considering.

What you call a co-concrete operad is generally called a co-endomorphism operad. The category of finite sets and stochastic maps is the opposite of the Lawvere theory corresponding to the Giry monad.

Generally, take a finitary algebraic theory $T$. (As you know, there are several equivalent ways of defining “finitary algebraic theory”; use whichever you like.) Write $T_n$ for the set of $n$-ary operations. The Lawvere theory $L$ corresponding to $T$ satisfies $L(n, 1) = T_n$. The finitary monad on $Set$ corresponding to $T$, which I’ll also call $T$, satisfies $T(\{1, \ldots, n\}) = T_n$. The underlying operad $P_T$ in $Set$ has $T_n$ as its set of $n$-ary operations. So the endomorphism operad of $1 \in L$ is always $P_T$.

Posted by: Tom Leinster on May 1, 2011 7:30 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Ah, I see that the category of finite sets and stochastic maps, and the observation that this is the opposite of our Lawvere theory, are in Tobias’s paper A presentation of the category of stochastic matrices. I suspect they’re also in some oldish work of Giry and Lawvere.

Posted by: Tom Leinster on May 1, 2011 10:35 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tom wrote:

So the endomorphism operad of $1\in L$ is always $P_T$.

Yes, I guess this relation between monads, operads and algebraic theories was part of my thinking. The main thing that puzzled me was that:

The category of finite sets and stochastic maps is the opposite of the Lawvere theory corresponding to the Giry monad.

(emphasis mine). The category of finite sets and stochastic maps seems much more interesting than its opposite, since a ‘random process’ in physics, biology, etc. etc. can be thought of as a stochastic map, or a family of such. It makes a lot of sense to assign an entropy to a stochastic map: the entropy says how much randomness the map introduces. It also makes a lot of sense to think of a probability distribution on a finite set $S$ as a special case of a stochastic map: namely, a ‘random element’ $f : 1 \to S$).

So, I was very happy with this category, but puzzled that this is the opposite of some important Lawvere theory. But now I’m thinking that this is just that usual thing about how for any Lawvere theory $Th$, the category of finitely generated free models is the opposite of $Th$. Right?

Posted by: John Baez on May 2, 2011 1:22 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Right. So we have several ways of thinking about the same category:

1. the category of finite sets and stochastic maps
2. the category of natural numbers and stochastic matrices
3. the opposite of the Lawvere theory of convex algebras
4. the full subcategory of the Kleisli category of the theory of convex algebras consisting of only the finite sets
5. the category of finitely generated free convex algebras
6. the category of simplices and convex-combination-preserving maps.

By “convex algebra” I mean algebra for the Giry monad. In description (6), I mean the category whose objects are the simplices $\Delta_n$ ($n \in \mathbb{N}$) and whose maps are the set maps $f$ preserving convex combinations: $f(\sum p_i x_i) = \sum p_i f(x_i)$ whenever $p_i \geq 0$ and $\sum_i p_i = 1$. (So it’s like an affine map, except for the restriction $p_i \geq 0$.)

You could think of this category as a cousin of the category of finite-dimensional vector spaces.

Posted by: Tom Leinster on May 2, 2011 1:56 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tom wrote:

So we have several ways of thinking about the same category…

Yes, it’s a wonderful category, one of those primordial ones that repays endless investigation. I suspect that once we understand it thoroughly enough, the nice characterizations of Shannon entropy and Rényi entropy will just pop out. They’ll be the only functors out of this category (or monoidal 2-category!) that preserve certain amounts of structure.

That’s my game plan, anyway.

Posted by: John Baez on May 2, 2011 4:49 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

For some reason your comment reminded me of this classic:

Lisa Simpson: I’m going to become a vegetarian.
Homer Simpson: Does that mean you’re not going to eat any pork?
Lisa: Yes.
Homer: Bacon?
Homer: Ham?
Lisa: Dad, all those meats come from the same animal!
Homer: Right Lisa, some wonderful, magical animal!

Posted by: Eric Forgy on May 2, 2011 8:54 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

You could think of this category as a cousin of the category of finite-dimensional vector spaces.

That’s essentially the idea mentioned here that we can think of

convex spaces as ‘modules’ of a generalized ring, very much as vector spaces are modules of a field.

Right?

I wonder then if Durov’s work on generalized rings provides any useful tools. E.g., what would be the relevant generalized algebraic geometry?

Posted by: David Corfield on May 2, 2011 10:15 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Right?

Not sure. It’s crucial that the simplices are the (finitely generated) free convex algebras, and that the finite-dimensional vector spaces are the (finitely generated) free vector spaces… because every vector space is free.

I don’t have a clear picture in my mind of the analogy between convex algebras and vector spaces (or modules), but the freeness was part of the idea I was trying to convey.

Posted by: Tom Leinster on May 2, 2011 7:03 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

It seems to me as though the category of finitely generated free convex algebras (= simplices) is quite analogous to the category of finitely generated free vector spaces. We always look at the finitely-generated free things when defining a Lawvere theory, right, whether or not every algebra for the theory is free?

Posted by: Mike Shulman on May 2, 2011 7:13 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Right: that’s exactly the point I was trying to make.

Posted by: Tom Leinster on May 3, 2011 3:31 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Durov mentions the monad $\Delta$ on p. 112, and abstract convex sets as $\Delta$-modules on p. 145. And in the overview p. 36

Generalized ring $\Delta := Aff_{\mathbb{R}} \cap \mathbb{R}_{\ge 0} \subset \mathbb{R}$. Set $\Delta (n)$ is the standard simplex in $\mathbb{R}^n$, and $\Delta$-Mod consists of abstract convex sets.

Posted by: David Corfield on May 2, 2011 8:38 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Right. However, having studied this category for a few months, I came to the conclusion that it is usually more natural and always technically more convenient to consider conical combinations instead of convex combinations. In other words, $\mathbb{R}_{\geq 0}$-semimodules are nicer than convex algebras, but not less general. Another option is to consider only those conical combinations where the total weight is $\leq 1$. In order to get the corresponding Lawvere theories, we simply have to replace the category of stochastic matrices by the category of all matrices with entries in $\mathbb{R}_{\geq 0}$ or by the category of matrices over $\mathbb{R}_{\geq 0}$ where each column sums to $\leq 1$. In any case, by this we are back to the trichotomy between probability measures, measures, and subprobability measures.

By the way, the stuff that is in my paper A presentation of the category of stochastic matrices is also just general abstract nonsense about theories. It does the following: given a presentation of a (non-symmetric) operad over $Set$, it gives a presentation of the associated Lawvere theory. It is so simple that it might as well be trivial to the experts. Some day I would like to see whether this is in the case, and if not, then rewrite it in generality.

Posted by: Tobias Fritz on May 2, 2011 10:47 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

By “the associated” Lawvere theory of an operad, do you mean the Lawvere theory with the same algebras as the operad? You may know that the relationship (which is an adjunction) between Lawvere theories and operads is part of a general picture of “generalized operads / multicategories”, which I said a bit about in another context here. But I don’t know to what extent one can extract an “operation on presentations” (i.e. given a presentation for one, get a presentation for the other) from the abstract nonsense.

Posted by: Mike Shulman on May 2, 2011 7:18 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Let’s see. A presentation of an operad is a parallel pair $FX \stackrel{\to}{\to} FY$ of maps of operads, where $X$ and $Y$ are signatures (objects of $Set^\mathbb{N}$), $F$ assigns to a signature the free operad on it, and the two maps are any maps of operads (not necessarily of the form “$F$ of something”). The operad presented is the coequalizer of this pair.

The same goes for Lawvere theories, now interpreting $F$ as the free theory on a signature.

In the adjunction between theories and operads, the left adjoint goes from operads and theories. So that’s going to turn a presentation of an operad into a presentation of a theory. (And if you think about it concretely, that’s clear too.) The other way round doesn’t seem instantly obvious from this point of view, but I’ve only given it a minute’s thought.

Posted by: Tom Leinster on May 3, 2011 3:40 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Here’s a related thought, concerning your second list:

1. the category $FinProb$

Again, it comes from some general stuff about theories.

There’s a one-to-one correspondence between finitary endofunctors of $Set$ and functors $FinSet \to Set$. (In one direction, the correspondence is given by restriction. In the other, it’s given by left Kan extension.) But also, for any category $C$, there’s a one-to-one correspondence between functors $C \to Set$ and discrete opfibrations over $C$. So: finitary endofunctors of $Set$ correspond to discrete opfibrations over $FinSet$.

Now, a finitary algebraic theory corresponds to a finitary monad $T$ on $Set$, and therefore gives rise to a finitary endofunctor of $Set$ (by throwing away the unit and multiplication). So we can apply the correspondence just described. The resulting category $E(T)$, discretely opfibred over $FinSet$, has:

• as objects: pairs $(n, \nu)$ where $n$ is a finite set and $\nu \in T(n)$ is an $n$-ary operation
• as maps $(n, \nu) \to (m, \mu)$: maps $f: n \to m$ of finite sets such that $\mu$ is the “push-forward” of $\nu$ along $f$, i.e. $(T f)(\nu) = \mu$.

The discrete opfibration $E(T) \to FinSet$ is the obvious forgetful functor.

If we apply this construction to the Giry monad then the resulting category $E(T)$ is precisely $FinProb$.

So the category $FinProb$, equipped with the forgetful functor $FinProb \to FinSet$, carries the same information as the endofunctor part of the Giry monad.

Posted by: Tom Leinster on May 1, 2011 8:12 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

And as for your first list…

1. the poset $([0, \infty), \geq)$
2. the monoid $([0, \infty), +, 0)$

…it’s also the case that the first item can be derived from the second by throwing away some information.

(I say “also” because I showed in this previous comment that the category $FinProb$ can be obtained from the Giry monad by throwing away some information. Actually the second item on your second list was the Giry operad, but I’m hoping you don’t mind.)

The poset $([0, \infty), \geq)$ is obtained from the monoid $([0, \infty), +, 0)$ by taking the (unique) slice category. That is, $([0, \infty), +, 0)$ is a one-object category, and the slice category over the unique object is the poset $([0, \infty), \geq)$.

I call this “throwing away some information” for the following reason. Whenever you have a monoid $M$, you can view it as a category with single object $\star$ and form the slice category $M/\star$. Its objects are the elements of $m$, and a map $m \to n$ is an element $p \in M$ such that $m = n p$. If $M$ has cancellation then $M/\star$ is a preordered set, the order relation being divisibility. But different monoids can have the same underlying preordered set. For example, all groups have the same underlying preordered set, up to equivalence: it’s trivial. All groups of the same order have the same underlying preordered set, up to isomorphism.

I’m not sure if the following is relevant, but I feel compelled to finish the thought. Whenever you have a category $C$ and an object $c \in C$, you not only have a slice category $C/c$: you also have a forgetful functor $C/c \to C$. This is a discrete fibration. And discrete fibrations on $C$ correspond to functors $C^{op} \to Set$; in this case, it’s the representable $C(-, c)$. (What else could it possibly be?)

Applied to our monoid $([0, \infty), +, 0)$, this gives a forgetful functor $([0, \infty), \geq) \to ([0, \infty), +, 0).$ It’s trivial on objects, and sends a map $x \geq y$ in the domain category to the map $x - y$ in the codomain category. It corresponds to the unique representable functor $([0, \infty), +, 0)^{op} \to Set$. Now functors $([0, \infty), +, 0)^{op} \to Set$ are sets with a right action by this monoid, and in this case it’s the right regular representation: $[0, \infty)$ acting on itself, on the right, by addition.

So all of that stuff comes out of the monoid $[0, \infty)$ by general categorical mechanisms. You can find this in Section 4 of Lawvere’s paper Taking categories seriously.

Posted by: Tom Leinster on May 2, 2011 12:00 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Though I seem to have exposed the Giry operad as a part of a monoidal category $FinStoch$ with

• finite sets as objects
• stochastic maps as morphisms

this is off by a ‘level shift’ from the monoidal category $M$ with $[0,\infty)$ as objects and + as tensor product.

After all, the entropy of a probability measure on a finite set is a number. But numbers are objects in $M$, while probability measures on finite sets are certain morphisms in $FinStoch$.

So let’s make up a monoidal category $X$ with one object, numbers in $[0,\infty)$ as morphisms from this object to itself, and + as composition of morphisms. By Eckmann-Hilton, + will also be the tensor product of morphisms. This will be good.

Now, I claim that Shannon entropy can be reinterpreted as a monoidal functor

$F: FinStoch \to X$

How? Well, first of all I should define $F$ on morphisms (on objects it’s trivial). A morphism $f: 1 \to T$ in $FinStoch$ is just a function $F: 1 \to P(T)$, where $P(T)$ is the set of probability measures on the finite set $T$. So, it’s a probability measure on $T$. Let $F(f)$ be the Shannon entropy of this probability measure.

Next, a general morphism $f : S \to T$ in $FinStoch$ is a function $f : S \to P(T)$. So, it’s an $S$-tuple of probability measures on $T$. Let $F(f)$ be the sum of the Shannon entropies of these probability measures.

Now, I’m hoping that $F$ is a monoidal functor!

The proof is supposed to be reminiscent of Tom’s proof that Shannon entropy defines a ‘lax point’. Indeed I’m trying to develop this stuff as an alternative way of saying everything Tom said. While his formulation is elegant, it doesn’t seem to emphasize the fact that a probability distribution and its entropy are ‘things of the same type’. That’s what I’m trying to do now.

It may seem odd to check that $F$ is monoidal before I check it’s a functor, but I think preservation of tensor products is easier — at least after I tell you the tensor product in $FinStoch$. Let me do that.

Here’s the idea: given an $S$-tuple of probability distributions on $T$ and an $S'$-tuple of probability distributions on $T'$, there’s an obvious $S+S'$-tuple of probability distributions on $T+T'$.

More precisely, suppose we have morphisms $f : S \to T$ and $f': S' \to T'$ in $FinStoch$. Then I want their tensor product to be a morphism from $S+S'$ to $T+T'$. Such a thing is a function $S+S' \to P(T+T')$ or in other words, it’s a pair of functions $S \to P(T+T') , \qquad S' \to P(T+T')$ But there are natural inclusions $P(T) \to P(T + T')$ $P(T') \to P(T + T')$ so we can get a pair of the above sort from a pair of functions $S \to P(T) , \qquad S' \to P(T')$ and that’s what $f$ and $f'$ actually are!

Well, now I’m in a position to check that $F$ is a monoidal functor—but now it’s time for dinner, so I’ll quit here.

Posted by: John Baez on May 1, 2011 1:46 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

It seems to me that this $F$ is not a functor. Consider for example the object $S=\{0,1\}$. The set of probability distributions $P(S)$ contains the ‘fair coin’, which is the distribution where both outcomes occur with probability $1/2$.

Let’s look at the endomorphism $f:S\rightarrow S$ which maps both elements to the fair coin. It has the property that $f g=f$ for any other $g\in End(S)$. If $H$ would be compatible with this kind of composition, then we would deduce $H(f)+H(g)=H(f)$, so that $H(g)=0$!

Posted by: Tobias Fritz on May 1, 2011 8:16 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I agree with Tobias’s conclusion that $F$ isn’t a functor. According to my calculations, it is monoidal and does preserve identities, but doesn’t preserve composition.

Before I give some details, may I suggest that we avoid using $T$ for finite sets? We’ve got a monad kicking around, and monads are usually called $T$. Plus, $T$ stands for temperature. (It’s one of several notational clashes. There’s $p$, which wants to stand for both a probability measure and the parameter $p$ of “$p$-norm”. And there’s $\mu$, which wants to stand for both a measure and the multiplication of a monad.)

Anyway, write $P$ for the Giry monad on finite sets. I’ll think of an element $p \in P(A)$ as a family $(p_a)_{a \in A}$ of nonnegative reals summing to $1$. On maps, $P$ takes a measure and pushes it forward: that is, for $f: A \to B$ and $p \in P(A)$, $(P f)(p) = \Bigl(\sum_{a \in f^{-1}(b)} p_a\Bigr)_{b \in B}.$ For a finite set $A$, the unit map $A \to P(A)$ sends $a \in A$ to the point measure at $a$. The multiplication map $P^2(A) \to P(A)$ sends $\Pi \in P^2(A)$ to $\sum_{p \in P(A)} \Pi_p p$.

Take stochastic maps $f: A \to P(B)$ and $g: B \to P(C)$. Their composite in $FinStoch$ is the composite $A \stackrel{f}{\to} P(B) \stackrel{P(g)}{\to} P^2(C) \stackrel{mult_C}{\to} P(C).$ in $FinSet$. If I’ve done it right, this is $a \mapsto \Bigl(\sum_b f(a)_b g(b)_c\Bigr)_{c \in C}.$ You can think of a map $A \to P(B)$ as an $A \times B$ matrix of nonnegative reals with the property that each row — or do I mean column? — sums to $1$. This is a stochastic matrix. From this point of view, composition is matrix multiplication.

Now, the image of this composite map under $F$ is $\sum_a H\Bigl(\Bigl(\sum_b f(a)_b g(b)_c\Bigr)_{c \in C} \Bigr) = \sum_a \Bigl\{ H(f(a)) + \sum_b f(a)_b H(g(b)) \Bigr\}$ by the formula for the entropy of a convex combination (a straightforward generalization of your “magical property”). This is equal to $\sum_a H(f(a)) + \sum_{a, b} f(a)_b H(g(b)).$ But for $F$ to be a functor, we’d need that to be equal to $\sum_a H(f(a)) + \sum_b H(g(b)),$ and in general, it’s not.

Posted by: Tom Leinster on May 1, 2011 9:35 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tom wrote:

Before I give some details, may I suggest that we avoid using $T$ for finite sets? We’ve got a monad kicking around, and monads are usually called $T$. Plus, $T$ stands for temperature.

I’ve never been big on calling monads $T$; it would make sense for ‘tonads’, but… anyway, you’re right, temperature is $T$ in this game.

So: the functoriality fails!

But here’s a fallback plan. Lets take that monoidal category $X$ whose morphisms are numbers in $[0,\infty)$, and make it into a monoidal 2-category using inequalities as 2-morphisms. (A ‘monoidal 2-poset’.) Doesn’t your calculation show

$F(f g) \le F(f) + F(g) ?$

So that we’re getting a lax 2-functor?

Tobias’ counterexample illustrates why this is a sensible notion: the amount of randomness in a composite process should be less than or equal to the sum of the amounts of randomness introduced at each stage.

But how come your ‘lax point’ idea used the magical property to prove an equation, while here we’re just using it to prove an inequality? Maybe in some important cases this inequality is really an equation.

Posted by: John Baez on May 2, 2011 1:38 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

John wrote:

Maybe in some important cases this inequality is really an equation.

I had hoped it was an equation in cases where the composite $f g$ of morphisms in $FinStoch$ is of the ‘co-operadic’ type

$(f_1 + \cdots f_n) f$

where $f: 1 \to n$ and $f_i : 1 \to n_i$, and I’m writing $+$ for the tensor product in $FinStoch$. However, calculations seem to show that even in this case we only get

$F((f_1 + \cdots f_n) f) \le F(f_1 + \cdots f_n) + F(f)$

It would be nice if someone could check this.

So, right now I’m stuck when it comes to translating Tom’s operadic characterization of Shannon entropy into a property of the map

$F: FinStoch \to X$

It’s a monoidal lax 2-functor (right?) but that fact alone is not enough to capture the ‘magical property’ of Shannon entropy (right?). If so, I think we need to take into account more structure that $FinStoch$ and $X$ have in common, and say that $F$ preserves some of this extra structure. Something about ‘convex linear combinations’.

Posted by: John Baez on May 2, 2011 4:46 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I haven’t thought about the inequalities yet, but isn’t $X$ merely a monoidal category rather than a monoidal 2-category? The objects of the monoidal category are the nonnegative reals, the morphisms are inequalities, and the monoidal structure is addition. (When you talk about the inequalities as being the 2-morphisms, I think you’re already viewing the monoidal category as a one-object 2-category.)

Posted by: Tom Leinster on May 2, 2011 6:16 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tom wrote:

I haven’t thought about the inequalities yet, but isn’t $X$ merely a monoidal category rather than a monoidal 2-category? The objects of the monoidal category are the nonnegative reals, the morphisms are inequalities, and the monoidal structure is addition.

Well, there’s a lot of level-shifting going on in this game! But in the comment where I introduced ‘X, I defined it as a monoidal category with one object, elements of $[0,\infty)$ as morphisms, and addition as composition and also tensor product of morphisms. The reason for this ‘overly elevated’ description of $[0,\infty)$ is that I wanted to think of Shannon entropy as giving a functor

$F: FinStoch \to X ,$

and it’s the morphisms, not the objects, of $FinStoch$ that get assigned entropies.

So then, when Tobias and you pointed out that $F$ is not a functor, but merely obeys

$F(f g) \le F(f) + F(g)$

I fought back by including 2-morphisms in $X$, namely inequalities between numbers. This makes $X$ into a monoidal 2-category. And, I believe it makes $F$ into a lax functor, which however is strictly monoidal. I’m hoping you’ll check me on that!

Now, you may think it’s absurdly highbrow to treat $[0,\infty)$ as a monoidal 2-category, but monoidal 2-categories seem like the ‘least fancy’ kind of $n$-category that allows us to treat entropy as some sort of lax $n$-functor

$F: FinStoch \to X$

By the way, I like the of shifting viewpoints on $FinStoch$ and $X$ using the slice construction you described here, both for $FinStoch$ and $X$. But actually, it seems that for a uniform treatment I need a coslice category instead. It seems that:

1) Starting from the category with one object, $[0,\infty)$ as morphisms, and $+$ as composition, and forming the coslice category under the only object, we get the poset $([0,\infty), \le)$.

2) Starting from the monoidal category with one object, $[0,\infty)$ as morphisms, and $+$ as both composition and tensor product, and forming the coslice category under the only object, we get the monoidal poset $([0,\infty), \le)$.

3) Starting from the monoidal category $FinStoch$ with finite sets as objects and stochastic maps as morphisms, and forming the coslice category under the terminal object, we get the monoidal category $FinProb$.

Since $FinProb$ and $FinStoch$ are two important things that entropy is defined on, both interesting to us, but entropy assigns numbers to objects of $FinProb$ but morphisms of $FinStoch$, I think there’s something worth pondering here.

Posted by: John Baez on May 2, 2011 9:51 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Interesting that coslice categories seem to appear frequently!

Question: What does your $FinProb$ stand for, exactly? Previously I thought that we had taken it to have finite probability spaces as objects and measure-preserving functions as morphisms. But in your point 3) above, you also seem to have included stochastic maps. With this definition of $FinProb$, entropy is not functorial!

Using the former definition of $FinProb$, I had been thinking about coslice categories under objects of $FinProb$ here. The coslice category under $\Omega\in FinProb$ seems to be the category of random variables with sample space $\Omega$. Then we can describe joint distributions of random variables in terms of the categorical product on this coslice category! (This also proves that $FinProb$ itself does not have products: if it would, the coslice categories would have to inherit them; in this case, the product would not depend on $\Omega$; but for two random variables with given distribution, their joint distribution does depend on the sample space. qed.)

Posted by: Tobias Fritz on May 2, 2011 11:05 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tobias wrote:

Question: What does your $FinProb$ stand for, exactly? Previously I thought that we had taken it to have finite probability spaces as objects and measure-preserving functions as morphisms. But in your point 3) above, you also seem to have included stochastic maps.

Whoops! I hadn’t noticed that the coslice category of objects under $1$ in $FinStoch$ had more morphisms than $FinProb$.

With this definition of $FinProb$, entropy is not functorial!

I guess you’re right. Not in the original sense, where we had a functor from $FinProb$ to the category with numbers as objects and inequalities $\ge$ as morphisms. But how about mapping $FinProb$ to some 2-category with numbers for objects (entropies of probability distributions), numbers for morphisms (entropies of stochastic maps), addition for composition of morphisms, and inequalities as 2-morphisms. Could entropy be a lax functor?

(When I say “some category with…” it’s because I can think of more than one possibility like this. I haven’t had time to sort through the details, and I’d be glad for help.)

Posted by: John Baez on May 3, 2011 1:58 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I thought the entropy notion attached to arrows of what you call $FinStoch$ is what is called conditional entropy, since a map $p: X \to Y$ is a conditional distribution.

It’s only given a specific distribution on $X$ that you get a numerical value for conditional entropy. Given such a distribution, say $r(x)$, then there is a joint distribution on $X \times Y$:

$q(x, y) = r(x) \cdot p(y|x).$

Then conditional entropy = $H(q) - H(r)$.

It’s not straightforward to me what kind of entropy thing is associated to $p$. It’s not just a number.

Posted by: David Corfield on May 2, 2011 11:08 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

It seems that for some stochastic matrices there is an entropy. This is the case for certain $n$ by $n$ stochastic matrices, where for the $r(x)$ above you use the stationary distribution.

Posted by: David Corfield on May 2, 2011 4:15 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

David wrote:

I thought the entropy notion attached to arrows of what you call FinStoch is what is called conditional entropy, since a map $p:X \to Y$ is a conditional distribution.

I’m a bit under the weather; maybe that explains why I can’t see why a stochastic map deserves to be called a conditional distribution. Say we have a stochastic map $f: S \to S'$ For each point in $S$, a stochastic map gives a probability distribution on $S'$. Here’s how I’m associating an entropy to that. If $S$ is a single point, $f: S \to S'$ is a probability distribution on the finite set $S'$, and I take its Shannon entropy. If $S$ is a bunch of points, I get a bunch of probability distributions, and I add their Shannon entropies.

Since my definition of the entropy for a stochastic map $f: S \to S'$ uses a sum over points of $S$, it’s pretty easy to show that this entropy gives a map

$F : FinStoch \to X$

(with $X$ defined as above) that is strictly monoidal:

$F(f + g) = F(f) + F(g)$

However, it’s only a lax 2-functor:

$F(f g) \le F(x) + F(y)$

which captures one of Tobias’ observations about laxness and inequalities involving entropy.

Posted by: John Baez on May 3, 2011 2:16 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I’m a bit under the weather; maybe that explains why I can’t see why a stochastic map deserves to be called a conditional distribution.

But then you go to explain how each point in $S$ gives a probability distribution on $S'$. That’s precisely the same as a conditional probability, $P(s'|s)$. For fixed $s \in S$, the sum of $P(s'|s)$ over $s'$ is one.

So then, yes you have an entropy for each $s \in S$. You want a single number now, so choose to add them. I was just pointing out that the standard way of treating the entropy of $f$ is called conditional entropy which retains the full set of entropies for each point $s$, and waits for a distribution on $S$ to form the expectation of the entropy.

Posted by: David Corfield on May 3, 2011 8:53 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

And I wonder why you’re adding the entropies for each point $s$.

I could see the sense in taking the mean, which is working out how much uncertainty is introduced by stochastic $f: S \to S'$ relative to a uniform distribution on $S$. There’s at least a barycentric flavour to that choice.

Doing it your way, the entropy for the stochastic map $f:S \to S'$ which is uniform on $S'$, and independent of $S$, depends on the size of $S$. I can’t see the intuitive sense of that.

Posted by: David Corfield on May 3, 2011 9:33 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

David wrote:

But then you go to explain how each point in $S$ gives a probability distribution on $S′$. That’s precisely the same as a conditional probability…

I guess maybe it’s a special case or something… I’m a bit too feverish to think about this.

And I wonder why you’re adding the entropies for each point $s$.

That, at least, I can answer. Formally speaking, it’s because that gave me a map

$F : FinStoch \to X$

that’s monoidal:

$F(f + g) = F(f) + F(g)$

Intuitively speaking, it’s because a stochastic map $f: n \to m$ is an $n$-tuple of independent random variables taking values in the finite set $m$, so I’d expect the information in the whole bunch to be gotten by adding up the information of each one.

There may be other good things to do, but right now all I’m trying to do—with miserable lack of success—is capture Fadeev’s characterization of Shannon entropy in some neat category-theoretic language. None of my attempts have succeeded in capturing the magical property of Shannon entropy.

Tom has two nice ways to do it already! But neither completely satisfies me, for reasons explained here. I suspect I can take his results and restate them in a way that suits my philosophy of decategorification. But perhaps not today…

Posted by: John Baez on May 3, 2011 10:24 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I hope your fever improves soon. I can see that in very concrete terms, we’re dealing with row stochastic matrices. To sum two such matrices, we just make a block diagonal matrix.

Now you want to assign a number, called entropy, to any such matrix in such a way that the entropy of the sum is equal to the sum of the entropies of the block submatrices. And a simple way to do this is just to take the entropy of a matrix to be the sum of the entropies of each of its rows.

It doesn’t strike me as a natural thing to do, but let’s see if it accords with Tom’s set up.

Posted by: David Corfield on May 3, 2011 1:03 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

A little while ago, John wrote:

$F(f g) \leq F(f) + F(g)?$

There’s been an overwhelming amount of activity on this thread, which is great, but means that I’ve only just got to this. I think the answer’s no.

To recap: we had the category $FinStoch$, whose objects are the finite sets and whose maps $A \to B$ are the $A \times B$ row-stochastic matrices, i.e. matrices of nonnegative reals whose rows sum to $1$. Composition, if written in diagrammatic order, is matrix multiplication: the composite of

$A \stackrel{f}{\to} B \stackrel{g}{\to} C$

is the matrix product $f g$. We assigned to each map $f: A \to B$ in $FinStoch$ the real number

$F(f) = \sum_a H((f_{a b})_{b \in B})$

where $H$ is Shannon entropy. (By definition of $FinStoch$, the expression in brackets is a probability distribution, so we can indeed take its Shannon entropy.)

Now put $A = B = C = \{1, 2\}$ and

$f = \begin{pmatrix} 1&0\\ 1&0 \end{pmatrix}, \quad g = \begin{pmatrix} 1/2&1/2\\ 0&1 \end{pmatrix}.$

Then

$F(f) + F(g) = [H(1, 0) + H(1, 0)] + [H(1/2, 1/2) + H(0, 1)] = \log 2.$

On the other hand,

$F(f g) = F \begin{pmatrix} 1/2&1/2\\ 1/2&1/2 \end{pmatrix} = 2 H(1/2, 1/2) = 2 \log 2.$

Hence $F(f g) \gt F(f) + F(g)$.

A similar calculation shows that $F(g f) \lt F(g) + F(f)$. So there is no universally true inequality of this type, either way round.

Posted by: Tom Leinster on May 8, 2011 4:54 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Entropy as a functor

Great. Very intriguing stuff.

This is possible consider the Entropy respect to the Valuations Counting Measure. The valuations are generated by process of objects construction(reconstruction)-diagram building dynamics-constructive morphisms. Then Entropy is dynamical variable.
If process depends(defined) on the entropy then appears non trivial natural transformations process(respect to induced endofunctors process). That weakens the associativity and any conditions represented by finite diagrams.

The Entropy dynamically classifying the objects by their complexity (defined). This results in some kind of convergence-the Entropic Memory convergence.
In Algorithmic point of view the search algorithm-search for greater complexity. This algorithm has cognitive property, the entropic memory.
Figuratively, the valuating process stays no long where was recently.

More detail:

Posted by: interested bystander on May 1, 2011 5:20 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I love “seem to be writing a paper”! It encourages my fond hope that without anyone trying, a paper will simply become written.

On the most important question, I think your usage of “bog standard” is pretty much on target. It maybe doesn’t sound quite right… but perhaps that’s because it sounds funny coming from a Californian. As your link points out, a key factor in the phrase’s success is that “bog” is British slang for “toilet” — which makes “bog standard” doubly British.

Posted by: Tom Leinster on May 1, 2011 7:18 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tom wrote:

I think your usage of “bog standard” is pretty much on target.

I was hoping you’d say it was spot-on.

Posted by: John Baez on May 3, 2011 4:16 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I have just expanded our notes to include categorical formulations of information-theoretic inequalities. Again slice and coslice categories appear abundantly. In particular, one can define the category of random variables over a sample space $\Omega$ as the coslice category of $\Fin\Prob$ under $\Omega$.

I would appreciate some feedback! Please adjust terminology, notation, or anything else as you see fit. Tomorrow I will take a closer look at Tom’s section on power means and try to do likewise.

Posted by: Tobias Fritz on May 2, 2011 4:27 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I have been working some more on the section about categorical interpretations of the basic inequalities of information theory. Is it clear what am I up to?

Two questions:

(1) I have used the term ‘bislice category’ for the following: let $C$ be a category, $x, y\in C$ objects, and $f:x\rightarrow y$ a morphism. Now consider the category $x/C/y$ where the objects are triples $(z,i_z,p_z)$ with $i_z:x\rightarrow z$ and $p_z:z\rightarrow y$ such that $p_z i_z=f$. Morphisms are the obvious ones compatible with the given data. Does this kind of construction have a name? If not, is ‘bislice category’ appropriate? (It seems to be a full subcategory of a bislice category…)

(2) Is there a better way to introduce the relative entropy $H(X/Z)$ than by defining it as $H(X)-H(Z)$? This definition seems a bit ad hoc. Also, is the term ‘relative entropy’ appropriate for this concept, which slightly differs from ordinary relative entropy?

Also I have been looking at Tom’s section on power means. It seems excellently written and I have nothing to say about the open questions mentioned, so I did not actually change anything. One thing that’s missing is an explanation of the relation to entropies — I suppose it’s motivated by the rig-theoretical characterization of Rényi entropies (John’s corollary 2)?

Posted by: Tobias Fritz on May 5, 2011 2:01 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Hi Tobias. I’ll try to explain what the power means are doing there quite soon, partly in answer to your implied question and partly in answer to John’s question. But I’ve been at a conference all day and haven’t got the energy right now.

Posted by: Tom Leinster on May 5, 2011 11:33 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I’m really busy this second, but: we definitely should not use “relative entropy” to mean the difference in entropies; that term already has a standard meaning and that other standard concept will eventually become very important in what we’re talking about!

Posted by: John Baez on May 6, 2011 2:27 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

But ordinary relative entropy is a difference of entropies, isn’t it? For random variables $X$ and $Y$, the relative entropy is

$H(X|Y) = H(X,Y) - H(Y)$

where $(X,Y)$ is the joint distribution of $X$ and $Y$. This is a special case of my concept when considering the projection $(X,Y)\rightarrow Y$, which is a measure-preserving function.

What I’m doing is just extending this to other measure-preserving functions.

I’ll think of a better term anyway. What about ‘entropy over $Y$’?

Posted by: Tobias Fritz on May 6, 2011 8:30 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Sorry, to be more clear I probably should have written $X\times Y$ instead of $(X,Y)$.

Posted by: Tobias Fritz on May 6, 2011 8:33 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

That’s usually referred to as conditional entropy. When I have read the term ‘relative entropy’ it has always been as a synonym for Kullback-Leibler divergence, which concerns a non-symmetric distance on between distributions over the same space.

Posted by: David Corfield on May 6, 2011 10:24 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Ooops! That has cleared up a misconception: I had somehow thought of these two concepts as the same thing, although clearly they are not.

Then, of course I agree completely with John: I should definitely *not* use the term ‘relative entropy’. And relative entropy is indeed a very natural concept and it should be interesting to characterize it in categorical language!

Posted by: Tobias Fritz on May 6, 2011 11:22 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I’ve never got on top of the various entropy-like quantities that take two arguments, including but probably not limited to:

• relative entropy
• conditional entropy
• cross entropy
• joint entropy
• mutual information
• Kullback–Leibler distance (= Kullback–Leibler divergence = relative entropy?)
Posted by: Tom Leinster on May 6, 2011 3:19 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

We are already talking in this thread about the Kullback-Leibler distance = Kullback-Leibler divergence = relative entropy. The cross entropy equals entropy plus relative entropy, so mathematically there’s nothing new going on here.

The joint entropy, conditional entropy and mutual information are crucial tools of information theory. Shannon’s noisy channel coding theorem is all about mutual information. So let me try to explain these quantities a bit.

Joint entropy is the simplest of these: if $X$ and $Y$ are random variables with joint distribution $X\times Y$, then the joint entropy $H(X,Y)$ is defined to be the ordinary entropy of the joint distribution:

$H(X,Y)=H(X\times Y)$

So there is nothing mysterious going on here. If $H(X)$ quantifies the amount of randomness in $X$, then $H(X,Y)$ just quantifies the amount of random in $X$ and $Y$ together.

How much randomness can there be in $X$ and $Y$ together? Obviously, it should be at least as much as is contained in $X$ alone or in $Y$ alone:

$H(X,Y)\geq H(X),\quad H(X,Y)\geq H(Y)$

There are situations when these are actually equalities: for example when $X=Y$. On the other hand, $X$ and $Y$ together can contain at most as much randomness as $X$ contains plus what $Y$ contains:

$H(X,Y)\leq H(X)+H(Y)$

This holds with equality if and only if $X$ and $Y$ are probabilistically independent: this is the additivity of entropy.

Conditional entropy and mutual information can now be defined as derived quantities. They are precisely those which the above inequalities guarantee to be non-negative. So, conditional entropy is

$H(X|Y)=H(X,Y)-H(Y)$

It quantifies something like this: given that we know the value of $Y$, how much additional randomness do we expect to be contained in $X$? Mutual information is the quantity

$I(X:Y)=H(X)+H(Y)-H(X,Y)$

It quantifies the amount of knowledge one has about $X$ when one knows the value of $Y$, and vice versa. It vanishes if and only if $X$ and $Y$ are probabilistically independent.

Since these are just simple combinations of (joint) entropies, inventing additional terms like ‘conditional entropy’ and ‘mutual information’ is redundant. It helps in getting more succinct notation and in confusing people by throwing around many different terms which refer to the same things.

Besides the two inequalities above, which are the non-negativity of conditional entropy and mutual entropy, there is another one which states the non-negativity of conditional mutual information, which is a quantity that unifies and extends the above concepts. I have found categorical formulations of all these inequalities here (and erroneously used the term “relative entropy” where I should have said “conditional entropy”). There are many other useful inequalities which can be derived from these, e.g. the data processing inequality.

Posted by: Tobias Fritz on May 7, 2011 10:09 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Thanks, Tobias! Your comment will be a very useful reference.

Posted by: Tom Leinster on May 8, 2011 3:24 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

One thing to be careful about is that the notation “$H(X,Y)$” is commonly used for both the cross entropy and the joint entropy. David has taken it to mean cross entropy, while in my contributions it always stands for joint entropy.

Posted by: Tobias Fritz on May 9, 2011 11:45 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

There should be a relationship between conditional entropy $H(Y|X)$ and the KL-divergence (relative entropy) between the joint distribution, $p(x, y)$, on $X \times Y$ and the distribution

$q(x, y) = p(x)/|Y|.$

Let $|Y| = n$. Then

$\begin{array}{ccl} KL(P \| Q) &=& -\sum_{x, y} p(x, y) log \frac{q(x, y)}{p(x, y)} \\ &=& -\sum_{x, y} p(x, y) log \frac{p(x)/n}{p(x, y)}\\ &=& -\sum_{x, y} p(x, y) (-log n + log \frac{p(x)}{p(x, y)})\\ &=& log n - H(Y|X).\\ \end{array}$

Or something like that.

Posted by: David Corfield on May 6, 2011 1:52 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

As David said, $H(X, Y) - H(X)$ is called conditional entropy. The relative entropy is something else, and something very important in its own right: I explained it in part 6 of my series on information geometry. I was trying not to use much math there, but secretly here’s how I think about it.

In general, there’s no way to define the entropy of a single probability measure on measure space $X$. Entropy is defined for a pair of probability measures $\mu, \nu$ such that $\mu$ is absolutely continuous with respect to $\nu$:

$S(\mu,\nu) = \int\; \frac{d \mu}{d \nu} \ln \left(\frac{ d \mu}{d \nu}\right) \; d \nu$

where the fraction is a Radon-Nikodym derivative. This is the entropy of $\mu$ relative to $\nu$.

In the special case where $X$ is a finite set, and only in that case, there exists a canonical probability measure $\nu$ on $X$, namely normalized counting measure. So in that case and only in that case can we talk about the entropy of a single probability measure $\mu$, by which we really mean its entropy relative to the canonical probability measure (plus a constant depending on the cardinality of $X$).

The ‘relativity’ of relative entropy has an important philosophical significance, which I explained much more carefully over on part 6. Namely: there is no good absolute notion of the information contained in a piece of data! It only makes sense to ask how much information some data contains relative to the prior beliefs of the person who obtains that data!

Since we’ve mainly been talking about probability measures on finite sets in this discussion, we are not yet really grappling with the ‘relativity’ of entropy. That’s okay for a while, but someday someone will go further and think about relative entropy as a functor. And maybe it’ll be us. After all, Rényi’s own characterization of Rényi entropy is very clearly a characterization of relative Rényi entropy. He doesn’t provide any really interesting characterization of ‘absolute’ Rényi entropy.

Posted by: John Baez on May 6, 2011 11:23 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I’ve been meaning to take the time to try to absorb what you guys are doing here, but I’ve been busy at a conference and you’re making it hard to catch up. But now you’ve raised a couple details I’ve thought about lately (and discussed a little with Tom in email), so I have a spot to grab hold.

The first (trivial) point is that of course $S(\mu,\nu)$ makes perfect sense even if $\nu$ isn’t a probability measure. As you point out, if $X$ is a finite set and you normalize $\nu$ you get an additive fudge factor, but letting $\nu$ be non-normalized counting measure you get Shannon entropy exactly. Similarly you get differential entropy on $\mathbb{R}$, or even $\mathbb{R}^n$ if $\nu$ is Lebesgue measure, which of course can’t be normalized. So if you want to think about the right version of your story for relative entropy, it seems unnatural to me to restrict attention to the case when $\nu$ is a probability measure.

The more interesting point is specifically to do with this sentence:

In the special case where $X$ is a finite set, and only in that case, there exists a canonical probability measure $\nu$ on $X$, namely normalized counting measure.

Namely, in what relevant precise sense is counting measure–or normalized counting measure—canonical? (I can think of many canonical descriptions of normalized counting measure; my point is that it’s not clear to me what bearing any of them has on this circle of ideas.)

I guess to summarize, I’m trying to understand what mathematical justifications there are for not doing everything on a more-or-less arbitrary measure space $(X,\nu)$ in the first place.

Posted by: Mark Meckes on May 6, 2011 2:25 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Mark Meckes wrote:

So if you want to think about the right version of your story for relative entropy, it seems unnatural to me to restrict attention to the case when $\nu$ is a probability measure.

[…]

I guess to summarize, I’m trying to understand what mathematical justifications there are for not doing everything on a more-or-less arbitrary measure space $(X,\nu)$ in the first place.

I guess I’m approaching this whole project not as a topic in pure mathematics but from the viewpoint of probability theory—and its twin offspring: statistics, and statistical mechanics.

From a mathematical viewpoint, $S(\mu,\nu)$ certainly makes sense whenever $\mu$ and $\nu$ are nonnegative measures, $\mu$ is absolutely continuous to $\nu$, and the integral converges. And it’s even a concept worth thinking about!

But I’m mostly thinking about probability theory, which offers a nice conceptual interpretation of what’s going on. A probability measure describes your beliefs about some situation, and relative entropy $S(\mu,\nu)$ is the ‘information gain’ as you update your prior belief $\nu$ to a new one, $\mu$. This is the reason I’m interested in this stuff.

In this context it’s hard to know what it means to, say, multiply a probability measure by 2. Do you believe everything is twice as likely to happen as before?

However, even in probability theory people sometimes consider ‘improper priors’, meaning nonnegative measures with infinite total mass, like Lebesgue measure on the real line. If someone claims a particle is ‘randomly located on the real line’, we might try to describe that using the measure $d x$, which can’t be normalized to a probability measure. This causes certain problems (divergent integrals, and difficulty in deciding on a correct normalization), but it’s sometimes a course worth pursuing.

Namely, in what relevant precise sense is counting measure–or normalized counting measure—canonical?

By ‘canonical’ I try to always mean ‘functorial with respect to isomorphisms’. I could explain what that means, but maybe you’d be happier if I say what it amounts to in this special case. It amounts to saying that normalized counting measure on a finite set $S$ is invariant under every bijection $f: S \to S$.

On the other hand, for a measurable space $X$ with infinitely many points, there’s typically1 no probability measure on $X$ that’s invariant under all measurable maps $f : X \to X$ with measurable inverses.

This gives the theory of probability measures on finite sets a distinctive feel. The canonical nature of entropy in this special case is an example of what I mean; when we leave this realm we really need relative entropy.

1 “Typically”? Well, as I try to prove this claim I seem to need some condition that ensures there are lots of measurable subsets of $X$. All the counterexamples seem absurdly pathological. The claim is true if, for example, $X$ is a Polish space.

Posted by: John Baez on May 8, 2011 4:17 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Mark wrote:

I’ve been meaning to take the time to try to absorb what you guys are doing here, but I’ve been busy at a conference and you’re making it hard to catch up.

We certainly are. Sorry. Personally, I’m not even on top of my own contributions, let alone anyone else’s. I guess we’ll stop generating new ideas after a while, and move into a more organizational phase.

One thing on my mental to-do list is to try out the idea you explained to me by email: that the key ingredient when defining entropy is an operation for turning (suitable) measures into functions. This operation is barely visible when we’re only looking at finite sets. Nevertheless, if the idea’s right then it should provide a guide for generalizing everything we do from finite sets to more general spaces.

Posted by: Tom Leinster on May 8, 2011 5:25 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tom wrote:

One thing on my mental to-do list is to try out the idea you explained to me by email: that the key ingredient when defining entropy is an operation for turning (suitable) measures into functions. This operation is barely visible when we’re only looking at finite sets.

This remark, and what you’ve written about pairing measures with functions, is very close to things Tobias and I have been saying. So I think we’re all on similar wavelengths. But it might be worthwhile laying out the slightly different versions;

1. Tobias is studying relative Rényi as a functor out of the coslice category $2 / FinStoch$. An object in here is a finite set with two probability distributions on it, say $\nu$ and $\mu$.

(We’d better make sure that $\nu \ll \mu$ if we’re going to define any sort of relative entropy $H(\nu, \mu)$. So as often the case, it seems that nowhere vanishing probability distributions on finite sets are more convenient than arbitrary probability distributions on finite sets.)

2. I keep running around emphasizing that for finite sets there’s a canonical probability measure, which allows us to turn nonnegative functions into probability measures, and lets us define the entropy of one other probability measure. And I keep warning everyone that this fact is somewhat deceptive, because it goes away when we get to infinite measurable spaces. And I keep arguing that this is a mathematical hint that fundamentally, relative entropy is what really matters. The real reason I believe this is that I’m a Bayesian, so I believe the ‘prior$\mu$ must never be taken for granted.

So: I’m very happy to study the category of finite sets equipped with nowhere vanishing probability distributions, because it’s so beautiful. But when we go to infinite sets, we really want to consider a different category, to get pairs of measures (or pairs consisting of a measure and a function) into the game. And it may be good to warm up already when working with finite sets.

(I’m perfectly happy to write a paper where we do a lot of things nicely for finite sets, and then let other people generalize the ideas to other measurable spaces. I’m trying to develop a surgical strike approach to math where I do something quickly and then move on. So, I don’t mind collaborating on just one paper even if the really good results show up later in some series of papers.)

Posted by: John Baez on May 8, 2011 8:54 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

John wrote:

By ‘canonical’ I try to always mean ‘functorial with respect to isomorphisms’. I could explain what that means, but maybe you’d be happier if I say what it amounts to in this special case. It amounts to saying that normalized counting measure on a finite set $S$ is invariant under every bijection $f:S \to S$.

I’m happy enough with ‘functorial with respect to isomorphisms’ if it also comes with a hint as to which categories to think about. (Fortunately, more than hints have been given in this discussion.) And on reflection, I’m much happier with that than with your spelling out the special case.

The reason for that is that once you point out that normalized counting measure on a finite set is the unique probability measure invariant under all bijections, I think “Yes, but up to a scalar multiple, non-normalized counting measure is the unique measure invariant under all bijections, even on countable sets. And non-normalized counting measure also has the property, which normalized counting measure lacks, of playing nicely with disjoint unions.” (With or without normalization, it plays nicely with Cartesian products.)

On the other hand, I can’t think of any category structure to put on countable measure spaces that makes the first statement say anything about functoriality, or for that matter makes the second statement say anything about sums, or even makes the product of measure spaces into a product in the category. (An even stronger statement is true: I don’t know any very nice category of measure—as opposed to probability—spaces at all. This is another point I’ve discussed briefly with Tom.) So telling me you want functoriality with respect to isomorphisms makes a much more compelling case than telling me you want invariance under bijections.

Posted by: Mark Meckes on May 9, 2011 3:52 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

While we’re on the topic, I’ll share something I wrote to Mark a month or so ago. It’s a quick-and-dirty characterization of counting measure and its scalar multiples. Note that normalized counting measure—I mean, counting measure normalized to be a probability measure—is not a scalar multiple of counting measure, as the scale factor you’d need depends on the cardinality of the set.

Let $M$ be the category whose objects are finite measure spaces (with all subsets measurable) and whose maps are the injections $i: X \to Y$ such that $i$ induces a measure-isomorphism between $X$ and $iX$.

Let $I$ be the category finite sets and injections.

There is a forgetful functor $M \to I$. Little theorem: its sections are precisely the functors

$F_t: I \to M \quad (t \in [0, \infty))$

where $F_t(X)$ equips a set $X$ with counting measure scaled by a factor of $t$. (Proof: think about the one-element set.)

It’s a pretty artificial characterization. Maybe someone can improve on it.

Posted by: Tom Leinster on May 9, 2011 8:55 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I guess I’m misusing the word ‘canonical’ as defined on the nLab: there it’s used to mean ‘natural with respect to isomorphisms’. But there should also be some name for ‘functorial with respect to isomorphisms’, because there are lots of things like that. So I guess for now I’ll use ‘canonical’.

For example, every countable set has a ‘canonical’ measure on it, namely counting measure. It’s ‘canonical’ in some hand-wavy sense of not requiring any arbitrary choices for its definition. This construction doesn’t extend to a functor $F: C \to D$ from the category of countable sets and functions to the category of measure spaces and measure-preserving maps. But it does extend to a functor $F: C_0 \to D_0$ where the subscript indicates the ‘core’ of a category: that is, the subcategory with the same objects but only the isomorphisms as morphisms. So, the construction is ‘canonical’ in my sense of ‘being functorial under isomorphisms’. Note that any multiple of counting measure is also canonical.

Here’s a similar example. Let $R$ be the category of Riemannian manifolds and isometries. There’s a standard way to make any Riemannian into a measure space by giving it a kind of ‘Lebesgue measure’. (For oriented Riemannian manifolds people often talk about the ‘volume form’, but we don’t need an orientation to get a measure.) This construction doesn’t extend to a functor $F: R \to D$ from $R$ to the category of measure spaces and measure-preserving maps. But it does extend to a functor $F: R_0 \to D_0$.

By the way, I feel like apologizing for my slightly grumpy reaction to your suggestion that we extend a lot of the ideas being discussed here from probability measure spaces to measure spaces. At the time I couldn’t think of a really good reason for doing that, except that generalizing is always nice. But you’ll see that a key idea in my new post is to take any probability measure on a finite set and get from it (canonically!) a 1-parameter family of measures on that set. These can be normalized to give probability measures, and physicist always do that before computing their entropy… but I need to not normalize them to get certain things to work. I’m not looking at the entropy of non-probability measures (yet), but such measures are definitely sneaking into the story.

Posted by: John Baez on May 10, 2011 1:37 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I guess I’m misusing the word ‘canonical’ as defined on the nLab: there it’s used to mean ‘natural with respect to isomorphisms’.

Another good reason not to use “canonical” in that sense. (-: If we said “core-natural” instead for “natural with respect to isomorphisms”, then we could analogously say “core-functorial” for “functorial with respect to isomorphisms”.

Posted by: Mike Shulman on May 10, 2011 2:55 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

It would be nice if there were terms with a bit more of a warm fuzzy glow than ‘core-natural’ and ‘core-functorial’. I might suggest ‘canonical’ and ‘god-given’, but after the debacle concerning ‘evil’ I think I won’t keep pushing my religion on people.

Seriously: there’s actually an advantage of using terminology that’s clunky and unappealing here, which is that it’s less likely to be used in the vague hand-wavy way that ‘canonical’ and ‘natural’ already are. So, it’s more likely that if someone uses it, they really know what they mean.

Posted by: John Baez on May 10, 2011 5:19 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Your two paragraphs seem to me to contradict each other. Are “core-natural” and “core-functorial” insufficiently clunky and unappealing? Or do you want something that is both clunky and unappealing and has a warm fuzzy glow? (-:

Posted by: Mike Shulman on May 10, 2011 11:50 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Mike wrote:

Yes, they do. On the one hand, there are advantages to the warm fuzzy glow of a term like ‘canonical’, because constructions with this property really are good, and it’s nice to have the term indicate that. On the other hand, terms with warm fuzzy glows tend to get used vaguely, so there’s an advantage to a horrendously clunky and unappealing term like ‘core-functorial’: nobody would ever say that unless they know what it means and mean exactly that.

(By the way, I do know that ‘warm fuzzy glow’ is a mixed metaphor. I suppose you could stick a lightbulb in a wool sock…)

Posted by: John Baez on May 11, 2011 8:03 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Just to be clear, I never meant to suggest considering the entropy of an arbitrary measure, which I also couldn’t see a really good reason to do. I was just suggesting considering the relative entropy of a probability measure with respect to an arbitrary measure, which is a thoroughly classical thing.

Posted by: Mark Meckes on May 10, 2011 5:16 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Yes, thanks for reminding us about that stuff! Somehow I forgot to take this philosophy into account during the present project.

I have been thinking about how to regard relative entropy (i.e. the Kullback-Leibler divergence) and Rényi’s $q$-deformed versions of it as functors. It seems to fit into the picture rather nicely. Arguably more nicely than with the bare entropies themselves!

Let’s consider the category $\Fin\Stoch$ this time. Objects are natural numbers and morphisms are stochastic matrices of the appropriate sizes. We can also call these Markov morphisms. So, in contrast to before, we can now allow probabilistic maps!

The protagonist enters: the coslice category $2/\Fin\Stoch$. An object of this category is just a finite set together with two probability distributions on it. A morphism of this category is a Markov morphisms which preserves these distributions. As a coslice category, $2/\Fin\Stoch$ is a discrete opfibration over $\Fin\Stoch$.

I conjecture that relative Rényi entropy $I_q(P|Q)$ is a functor$I_q:2/\Fin\Stoch \rightarrow \overline{\mathbb{R}}_+$where again $\overline{\mathbb{R}}_+=\mathbb{R}_+\cup\{\infty\}$ is regarded as a poset.

This claim boils down to asserting that $I_q$ is non-increasing under Markov morphisms. I gather that this is known, but haven’t yet been able to find a reference for it. (A special case is contained in Rényi’s paper.) The following proof attempt is not yet complete:

(1) By the observations in A presentation of the category of stochastic matrices, every Markov morphisms can be factored into certain ‘generating’ ones which are of the following type:

(a) permuting the outcomes,

(b) creating a new outcome with zero probability,

© merge two outcomes into one,

(d) splitting one outcome with probability $p$ into two new outcomes which have probability $\lambda p$ and $(1-\lambda)p$, respectively.

Hence, it is enough to prove the assertion for each case separately.

(2) It is clear that $I_q$ does not change under generators of type (a) or (b). For (d), this follows from a simple calculation.

(3) For (c), the corresponding inequality is less trivial to prove. I have verified it in the cases $q=1/2$, $1$ and $2$ and found that if it holds for some $q$, then it also holds for $q'=1-q$. So it remains to prove it in generality. Also, I have disregarded degenerate cases like zero probabilities.

In fact, this monotonicity property seems very reminiscent of David’s remark on the Fisher-Rao metric. What is the precise relation between the Fisher-Rao metric and the Kullback-Leibler divergence? Is the former some sort of infinitesimal version of the latter?

(This might be explained in John’s information geometry posts, but I have to admit to not having followed them all.)

Posted by: Tobias Fritz on May 6, 2011 2:39 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

What is the precise relation between the Fisher-Rao metric and the Kullback-Leibler divergence? Is the former some sort of infinitesimal version of the latter?

Yes, but this is true of all members of a 1 parameter family of divergences, linked to a corresponding family of affine connections. Information geometry studies all this:

The mathematician Cencov (Chentsov) proved in the 1960s and 1970s that on the space of probability distributions on a sample space containing at least three points,

• There exists a unique intrinsic metric. It is the Fisher information metric.

• There exists a unique one parameter family of affine connections. It is the family of $\alpha$-affine connections later popularized by Amari.

Both of these uniqueness are, of course, up to the multiplication by a constant.

Amari and Nagaoka’s study in the 1980s brought all these results together, with the introduction of the concept of dual-affine connections, and the interplay among metric, affine connection and divergence. In particular,

• Given a Riemannian metric $g$ and a family of dual affine connections $\Gamma_{\alpha}$, there exists a unique set of dual divergences $D_{\alpha}$ defined by them.

• Given the family of dual divergences $D_{\alpha}$, the metric and affine connections can be uniquely determined by second order and third order differentiations.

Posted by: David Corfield on May 6, 2011 3:29 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Not wishing to be left out, there is of course a Tsallis relative $\alpha$-entropy:

$H_{\alpha} (P \|Q) = - \sum p_i ln_{\alpha} \frac{q_i}{p_i} = \frac{1}{1 - \alpha}(1 - \sum p_i^{\alpha} q_i^{1- \alpha}),$ that $ln_{\alpha}$ being a deformed logarithm

$ln_{\alpha} x \equiv \frac{x^{1 - \alpha} - 1}{1 - \alpha}.$

Something I always meant to work out was which, if either, of the Renyi or Tsallis divergences forms Amari’s family. Favouring looking up over hard work, equation (3.78) in Amari’s book settles it in Tsallis’s favour (reparameterising the $\alpha$ and taking the $p$ and $q$ as normalised).

Posted by: David Corfield on May 6, 2011 5:21 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

OK, so for Tsallis entropy $H_\alpha$,

$-H_\alpha(P \Vert Q) = H_\alpha(P) + \sum p_i^\alpha \ln_\alpha(q_i)$

(a simple calculation from the equation you gave), whereas

$H_\alpha(P \circ (P_1, \ldots, P_n)) = H_\alpha(P) + \sum p_i^\alpha H_\alpha(P_i).$

I’m a total novice at this, but the conclusion I tentatively draw is that relative entropy is extremely closely related to the entropy of a composite, for finite sets at least. Would you agree?

The minus sign on the left-hand side of the first equation is a bit puzzling. Maybe it’s a hint that $-H_\alpha(P \Vert Q)$ is a more basic quantity than $H_\alpha(P \Vert Q)$, even though it’s negative. Anyway, let’s try the same thing for diversity, a.k.a. Rényi extropy. This is related to Tsallis entropy by

$D_\alpha(P) = (1 - (\alpha - 1) H_\alpha(P))^{1/(1 - \alpha)}$

which suggests putting

$D_\alpha(P \Vert Q) = \bigl(1 - (\alpha - 1) [-H_\alpha(P \Vert Q)]\bigr)^{1/(1 - \alpha)}.$

(I’m not sure I’m handling the sign issue correctly, but I’ll plough on.) This reduces to

$D_\alpha(P \Vert Q) = \bigl( \sum p_i^\alpha q_i^{1 - \alpha} \bigr)^{1/(1 - \alpha)},$

which is extremely similar to the equation

$D_\alpha(P \circ (P_1, \ldots, P_n)) = \bigl( \sum p_i^\alpha D_\alpha(P_i)^{1 - \alpha} \bigr)^{1/(1 - \alpha)}.$

Posted by: Tom Leinster on May 8, 2011 9:45 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

The minus sign on the left-hand side of the first equation is a bit puzzling.

Well it’s more usual in the Shannon case to think of the equation as

$H(P\|Q) = H(P, Q) - H(P).$

Divergence from $P$ to $Q$ = cross entropy of $P$ and $Q$ $-$ entropy of $P$.

Cross entropy is minimized when $P = Q$. If the true distribution is $P$, encoding outcomes with a code based on $P$ is most efficient, or one is least surprised with a world distributed according to $P$, if one believes it to be distributed according to $P$.

A Tsallis cross entropy should be no problem to define.

Posted by: David Corfield on May 8, 2011 10:55 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tobias wrote:

What is the precise relation between the Fisher-Rao metric and the Kullback-Leibler divergence? Is the former some sort of infinitesimal version of the latter?

As you feared, precisely this question was the topic of Part 7 of my information geometry posts.

First of all, you should say ‘relative entropy’, not ‘Pullback-Liebling divergence’ or something like that. Why? Because the latter term is quite uninformative and esoteric-sounding, while the former makes it absolutely clear what’s going on: entropy only makes sense relative to some prior assumptions about what’s likely to happen.

Even someone who knows nothing about the subject will find the phrase “relative entropy” vaguely interesting (relativity is cool, so is entropy)… but not “Kullback-Leibler divergence”. It’s too important a concept to be burdened with a name like that.

Okay, now that that’s straightened out:

Briefly, the Fisher-Rao metric is the Hessian of the relative entropy.

More precisely, if $M$ is the manifold consisting of all probability distributions on some measure space, and $S(x,y)$ is the entropy of $x \in M$ relative to $y \in M$, then

$\frac{\partial}{\partial x_i} \frac{\partial}{\partial x_j} S(x,y)|_{x = y} = g_{i j}(y)$

is the Fisher-Rao metric. Here I’m using some arbitrary local coordinates near the point $y \in M$.

This works regardless of whether $M$ is finite-dimensional or infinite-dimensional. I gave the proof and discussed the meaning of this fact over there.

So, the Fisher-Rao metric is a 2nd-order infinitesimal version of relative entropy.

Posted by: John Baez on May 7, 2011 7:50 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tobias had a technical question here. I think he’s trying to show that the Rényi entropy $I_q$ obeys the inequality

(1)$I_q(p_1, \dots p_n) \le I_q (\lambda p_1, (1-\lambda) p_1, p_2, \dots, p_n)$

when $p_i \ge 0$, $\sum_{i=1}^n p_i = 1$, $0 \le \lambda \le 1$.

Since

$I_q(p_1, \dots p_n) = \frac{1}{1 - q} \sum_i p_i^q$

I think that (1) is equivalent to

(2)$\lambda^q + (1-\lambda)^q \le 1$

for $q \ge 1$, and the reverse inequality for $q \le 1$. Right?

Let’s look at the case $q \ge 1$. Then (2) is true because it’s true at the point $\lambda = 0$ and

(3)$\frac{d}{d\lambda} (\lambda^q + (1-\lambda)^q ) \le 0$

for $0 \le \lambda \le \frac{1}{2}$, so (2) keeps getting ‘truer’ as $\lambda$ increases up to $\frac{1}{2}$. But past $\frac{1}{2}$ we can just switch $\lambda$ and $1- \lambda$.

Why is (3) true for $0 \le \lambda \le \frac{1}{2}$? Well, do the derivative:

$\frac{d}{d\lambda} (\lambda^q + (1-\lambda)^q ) = q (\lambda^{q-1} - (1-\lambda)^{q-1})$

and use the fact that $x^{q-1}$ is a nondecreasing function of $x$ for $x \ge 0$, $q \ge 1$.

Assuming I haven’t done something dumb, the case $q \le 1$ should succumb to the same sort of argument. I guess I’ll let someone else try that.

Posted by: John Baez on May 8, 2011 10:28 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

The inequality holds for all $q$. Write $D_q$ for the diversity or Rényi extropy of order $q$. Since exponential is order-preserving, the inequality to be proved is equivalent to

$D_q(p_1, \ldots, p_n) \leq D_q (\lambda p_1, (1 - \lambda) p_1, p_2, \ldots, p_n).$

The right-hand side is

$D_q\Bigl(\mathbf{p} \circ \bigl( (\lambda, 1 - \lambda), (1), \ldots, (1) \bigr) \Bigr).$

By this earlier comment, this is equal to

$D_q(\mathbf{p}) \cdot M_{1 - q} \Bigl( \mathbf{p}^{(q)}, \bigl( D_q(\lambda, 1 - \lambda), D_q(1), \ldots, D_q(1) \bigr) \Bigr).$

Here $\mathbf{p}^{(q)}$ is the escort distribution (though for this argument it doesn’t matter what that is), and $M_{1 - q}(\mathbf{w}, \mathbf{x})$ denotes the power mean of order $1 - q$ of the numbers $x_1, \ldots, x_n$, weighted by $w_1, \ldots, w_n$. But diversities are always $\geq 1$, and a mean of numbers $\geq 1$ is $\geq 1$. So this is greater than or equal to $D_q(\mathbf{p}) \cdot 1$, as required.

More generally, the same argument shows that

$D_q(\mathbf{p}) \leq D_q(\mathbf{p} \circ (\mathbf{p}_1, \ldots, \mathbf{p}_n))$

as you’d expect from your favourite interpretation of entropy.

Posted by: Tom Leinster on May 8, 2011 10:58 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tobias had a technical question here. I think he’s trying to show that the Rényi entropy $I_q$ obeys the inequality [..]

Thanks for getting back to this. I’m glad to see that someone comments on my attempts ;) Apparently I haven’t explained myself well, since my incomplete proof is about something else…

I was talking about the relative Rényi entropy $I_q(P|Q)$ which takes two distributions as arguments, and I would like to show the following inequality: from $P=(p_1,\ldots,p_n)$ we construct

$P'=(p_1+p_2,p_3,\ldots,p_n)$

and use the same formula to define $Q'$ in terms of $Q$. I wonder whether the relative Rényi entropy satisfies

$I_q(P'|Q') \leq I_q(P|Q) \quad ?$

I can prove this for particular values of $q$, but so far not in generality. If it’s true, and if one can take care of zero probabilities and this kind of things, then that would turn relative Rényi entropy into a nice functor

$2/\Fin\Stoch\rightarrow\R_+$

Posted by: Tobias Fritz on May 8, 2011 3:15 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Just to be clear, are you using $I$ in the same way that Rényi does in his paper? On page 554, he defines

$I_q(P | R) = \frac{1}{q - 1} \log \Bigl( \sum p_i^q r_i^{1 - q} \Bigr).$

Is that the definition of relative entropy you’re using?

Sorry if you’ve already said this somewhere. It’s hard to keep track.

Posted by: Tom Leinster on May 8, 2011 3:51 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Yes, I have also noticed that it’s often hard keep track of things. So probably I should provide more detail and background information in my comments…

Then I might as well write down the inequality that I am trying to prove. I will now also write $r_i$ for the entries of the second distribution in order to avoid a clash of notation with ‘$q$’ as the parameter of relative Rényi entropy.

So here comes the problem:

$\frac{(p_1+p_2)^q}{(r_1+r_2)^{q-1}}\leq \frac{p_1^q}{r_1^{q-1}}+\frac{p_2^q}{r_2^{q-1}}\quad ?$

As of now I would like to leave aside the cases where any probability vanishes. So the assumptions are $p_i\in(0,1)$ with $p_1+p_2\leq 1$, and likewise for $r_1$, $r_2$.

Posted by: Tobias Fritz on May 8, 2011 4:12 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

That’s true for all $q \in [-\infty, 0] \cup [1, \infty]$. For $q \in [0, 1]$, the opposite inequality holds.

I’ll use the power mean notation that I mentioned here. Probably the most important theorem about power means is that $M_t$ is (non-strictly) increasing in $t \in [-\infty, \infty]$. (A special case of this that most people know is $M_0 \leq M_1$, the arithmetic-geometric mean inequality.)

Write

$\mathbf{w} = \Bigl( \frac{p_1}{p_1 + p_2}, \frac{p_2}{p_1 + p_2} \Bigr) \in \Delta_2$

(where $\Delta_n = \Delta^{n - 1}$), and

$\mathbf{x} = \Bigl( \frac{p_1}{r_1}, \frac{p_2}{r_2} \Bigr) \in (0, \infty)^2.$

$M_{-1}(\mathbf{w}, \mathbf{x})^{q - 1} \leq M_{q - 1}(\mathbf{w}, \mathbf{x})^{q - 1}?$

If $q \geq 1$ then raising to the power of $q - 1$ is an increasing function and $-1 \leq q - 1$, so the inequality holds. If $q \leq 0$ then raising to the power of $q - 1$ is a decreasing function and $-1 \geq q - 1$, so again the inequality holds. If $0 \leq q \leq 1$ then raising to the power of $q - 1$ is a decreasing function and $-1 \leq q - 1$, so the opposite inequality holds.

(I’m kind of ignoring the cases $q = 1, \pm \infty$, but by continuity that all works out.)

Posted by: Tom Leinster on May 8, 2011 5:41 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Cool! Since the $q-1$ factor in the definition of relative Rényi entropy reverses the inequalities again when $q\lt 1$, we get that the conjectured inequality holds for $q\geq 0$. I will write this down tomorrow as a new section in our notes.
Posted by: Tobias Fritz on May 9, 2011 10:12 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Ah, good. I was too lazy to reproduce the reduction of your original hoped-for inequality

$I_q(P' | R') \leq I_q(P | R)$

$\frac{(p_1 + p_2)^q}{(r_1 + r_2)^{q - 1}} \leq \frac{p_1^q}{r_1^{q - 1}} + \frac{p_2^q}{r_2^{q - 1}}$

but I was rather hoping that you’d made a slip in the reduction, multiplying through by $(q - 1)$ on each side and forgetting to reverse the inequality when $q \lt 1$. If I understand correctly, you did make some such slip, so that by my argument,

$I_q(P' | Q') \leq I_q(P | Q)$

for all $q \in [0, \infty]$. And presumably the opposite inequality holds for all $q \in [-\infty, 0]$. Have I got that right?

Posted by: Tom Leinster on May 9, 2011 10:34 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

In case you want to avoid operads, here’s an operad-less version of the operadic characterization of Shannon entropy.

The price you pay is having to work with monoidal 2-categories. I’ll skip some details. First I’ll state the result as quickly as possible. Then I’ll explain it a bit, including what it has to do with the operadic version.

Notation: I will regard a probability distribution on a finite set $E$ as a family $p = (p_e)_{e \in E}$ of nonnegative real numbers summing to 1.

Let $M_\Delta$ be the symmetric monoidal category in which:

• an object is a finite set
• a map $A \to B$ is a map $f: A \to B$ of finite sets together with a probability distribution on each fibre $f^{-1}(b)$ ($b \in B$). I will write such a map as a pair $(f, p)$, where $p(b) = (p(b)_a)_{a \in f^{-1}(b)}$ is the probability distribution on $f^{-1}(b)$.
• you should be able to guess what composition is
• tensor product is disjoint union.

Symmetric monoidal functors $M_\Delta \to Cat$ are the same thing as algebras in $\Cat$ for the operad $\Delta = (\Delta_n)_{n \in \mathbb{N}}$; but I’m not allowed to mention operads, so I’ll just describe the two such symmetric monoidal functors that I need. (By “monoidal” I mean weak/pseudo/strong monoidal.)

The first one is constant at the terminal category $1$.

The second one, $R: M_\Delta \to Cat$, is defined as follows.

• On objects, $R(A)$ is the monoid $(\mathbb{R}^A, +, 0)$ as a one-object category.
• On maps, take $(f, p): A \to B$ in $M_\Delta$. The induced functor (monoid homomorphism) $\mathbb{R}^A \to \mathbb{R}^B$ sends $x \in \mathbb{R}^A$ to $\Bigl(\sum_{a \in f^{-1}(b)} p(b)_a x_a \Bigr)_{b \in B}.$

Now, $Cat$ is actually a monoidal 2-category, and we can regard $M_\Delta$ as a monoidal 2-category with only identity 2-morphisms. We may therefore speak of monoidal lax transformations between monoidal functors $M_\Delta \to Cat$. Note the order of the words “monoidal” and “lax”: the monoidal structure is preserved up to iso, but the 2-category structure is only preserved laxly. (I’m dropping the word “symmetric” just to make things less cluttered, but really it’s there.)

In particular, we can consider monoidal lax transformations from the constant functor $1$ to $R$.

Theorem (Fadeev)   The monoidal lax transformations $1 \to R$ are precisely the real multiples of Shannon entropy.

Note:  Not quite. Fadeev’s theorem has a continuity condition, so I think I need to be working with (2-)categories internal to, or enriched in, $Top$. I haven’t yet figured out the correct thing to say, but I’m confident it can be done.

Now I’ll explain how this relates to the operadic characterization, which will also indicate why it’s true.

For any algebraic theory $T$, there’s a free finite product category $L_T$ containing a model of $T$, namely, the Lawvere theory of $T$. Its objects are the finite sets (or natural numbers), and a map $A \to B$ consists of an operation of arity $|A|$ for each $b \in B$.

Similarly, for any symmetric operad $P$, there’s a free symmetric monoidal category $M_P$ containing an algebra for $P$. Its objects are the finite sets (or natural numbers), and a map $A \to B$ consists of a map $f: A \to B$ of finite sets together with an operation of arity $|f^{-1}(b)|$ for each $b \in B$. I’d like to call this the symmetric monoidal theory of $P$.

In this case we start with the symmetric operad $\Delta$ whose $n$-ary component $\Delta_n$ is the $(n - 1)$-simplex. The resulting category $M_\Delta$ is the one described above. Its universal property is that for any symmetric monoidal category $E$, the symmetric monoidal functors $M_\Delta \to E$ are the $\Delta$-algebras in $E$.

We take $E = Cat$. The two algebras we’re interested in are the terminal one, and the monoid $(\mathbb{R}, +, 0)$ with the obvious action by $\Delta$. (It’s a convex space.) The monoidal functor $M_\Delta \to Cat$ corresponding to the algebra $\mathbb{R}$ is the $R$ described above.

Lax maps between $\Delta$-algebras in $Cat$ then correspond to monoidal lax transformations between monoidal functors $M_\Delta \to Cat$. So the theorem just stated is equivalent to my original statement: the lax maps $1 \to (\mathbb{R}, +, 0)$ are the scalar multiples of Shannon entropy.

One more thing. We’ve encountered at least four categories so far:

1. convex algebras/spaces
2. finite sets and stochastic maps
3. finite probability spaces
4. $M_\Delta$

I claim that, almost, all four come about in an automatic way from the Giry monad. The first is its category of algebras. The second is its Lawvere theory. The fourth is constructed in two steps: first you take the underlying symmetric operad of the Giry monad, and then you take the symmetric monoidal theory of that operad.

But what about the third, finite probability spaces? Well, it’s almost $M_\Delta/1$, the slice of $M_\Delta$ over its generating object $1$. The “almost” has to do with zero probabilities. I’ll explain the near-equivalence $FinProb \approx M_\Delta/1$ by pretending that zero probabilities don’t exist.

So: an object of $M_\Delta/1$ is a finite set $A$ with a map $(f, p): A \to 1$. Here $f$ is the unique map of sets $A \to 1$, and $p$ provides a probability distribution on each fibre. In other words, an object of $M_\Delta/1$ is a finite probability space $(A, p)$.

A map $(A, p) \to (B, r)$ in $M_\Delta/1$ is a pair $(h, u)$ where $h: A \to B$ is a map of sets and $u$ provides a probability distribution on each fibre $h^{-1}(b)$. And for it to be a map in the slice category, a certain triangle has to commute, and that turns out to be equivalent to the equation $p_a = r_{h(a)} u(h(a))_a$ for all $a \in A$. Dividing gives $u(h(a))_a$ in terms of $p$ and $r$. With a bit more calculation, you see that $u$ is uniquely determined by $p$ and $r$, in such a way that the pair $(h, u)$ amounts to just a measure-preserving map $h$. Thus, it’s the same as a map in $FinProb$ — as long as you ignore those zero probabilities.

I see that there have been lots of comments about slice categories so far, and I haven’t read them all, so maybe someone’s already reached this conclusion.

It might be possible to take advantage of the forgetful functor $FinProb \approx M_\Delta/1 \to M_\Delta$ in order to characterize entropy in terms of transformations between functors $FinProb \to Cat$. A snag is that $FinProb$ and $M_\Delta/1$ aren’t monoidal categories. But perhaps there’s a good way forward.

Posted by: Tom Leinster on May 3, 2011 4:34 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

This looks like great stuff, Tom! It will take me a while to digest it. I just thought I should respond to this:

In case you want to avoid operads…

I’m not trying to avoid operads, though my thrashing around may look like that. What I’m trying to do is take your characterizations of Shannon entropy and re-express them in a way that looks something like this:

Theorem: The only maps

$F: A \to X$

with certain properties are the multiples of Shannon entropy.

Here $A$ should be some gadget related to probability measures or stochastic maps or some other nice thing that could have an entropy. $X$ should be a gadget of the same mathematical type, but related to numbers: namely, things that are entropies.

I don’t really care what type of gadget $A$ and $X$ are: categories, monoidal 2-categories, algebras for some operad or monad or 2-monad, possibly internal to Top, etc. $F$ should be a map between gadgets of this type, possibly lax.

I want a theorem like this because I strongly feel that entropy, like cardinality, is a map that takes an object and distills it into a number saying how ‘big’ it is (how much information it has). Certain things we can do with these objects match up to certain things we can do with these numbers, and that’s why entropy is useful.

I think your ‘lax point’ characterization of Shannon entropy is secretly doing what I want here… but only secretly. Your lax points look like maps

$G : 1 \to X$

where $1$ is something very dull and $X$ looks like ‘numbers’. Somehow the probability measures are hidden in the laxness and the fact that $1$ and $X$ are being treated as $P$-algebras.

The idea that a lax map of $P$-algebras $F : 1 \to X$ looks like a $P$-algebra internal to $X$ is not the part I’m having trouble with. Jim and I used this idea to try to formalize the ‘microcosm principle’ in Definition 57 of HDA2. We said something roughly like this:

Definition. Let $O$ be an operad, let $A$ be an $(n+1)$-categorical $O$-algebra, and let $1$ be the terminal $(n+1)$-categorical $O$-algebra. Then we define an $n$-categorical $O$-algebra object in $A$ to be a morphism of $O$-opetopic sets $a: \tau \to A$. If $n = 0$, we call this simply an $O$-algebra object in $A$.

But somehow I’m being very slow about taking this lax map formulation and translating it into one that I think people will find more intuitive. No matter how fancy the machinery, I think people will appreciate clearly seeing a map sending things that have entropy to their entropies.

Hmm, maybe after that rant I’m finally ready to do this…

Posted by: John Baez on May 3, 2011 7:02 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I too am struggling to convert Tom’s account into what I know. $f: A \to B$ in $M_{\Delta}$ is a stochastic function (aka conditional probability distribution) in reverse. So I’m happier with $(M_{\Delta})^{op}$, but that shouldn’t be too big an obstacle. I think I just need to see an answer to the following:

What needs to be specified to give the monoidal lax transformation, $\alpha$, corresponding to Shannon entropy?

Presumably, for each finite set $A$, it will assign a functor, $\alpha(A)$, from the terminal category to the category $R(A)$. Each only has a single object. Doesn’t the arrow in the terminal category have to be sent to the constant zero function from $A$ to $\mathbb{R}$? What else?

Posted by: David Corfield on May 3, 2011 12:52 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

David wrote:

What needs to be specified to give the monoidal lax transformation, $\alpha$, corresponding to Shannon entropy?

I’ll let Tom answer that… I understand his operadic approach better than his new approach. But I won’t explain the operadic approach, either: I’ll just say what crucial property of Shannon entropy seems to be captured by the ‘laxness’ in that approach. This property has the so-called magical property in Fadeev’s theorem as a special case, so it indeed seems crucial if we’re trying to uniquely characterize Shannon entropy in some nice way.

Suppose $P$ is a probability distribution on the set $\{1, \dots, n\}$ and $P_i$ are probability distributions on some finite sets $S_i$, where $1 \le i \le n$. Then there’s a way to form a probability distribution on the set $S_1 + \cdots + S_n$. I’ll call this probability distribution $P \circ (P_1, \dots, P_n)$. Here’s how it works.

Say you want to pick a random element of $S_1 + \cdots + S_n$. First you choose a number from $1$ to $n$ using the probability distribution $P$. Then, if you picked the number $i$, you choose an element of $S_i$ using the probability distribution $P_i$.

Now, the crucial property of the Shannon entropy $S$ is:

$S(P \circ (P_1, \dots, P_n)) = S(P) + P(S(P_1), \dots, S(P_n))$

The only mysterious thing here should be the expression $P(S(P_1), \dots , S(P_n))$. Here I’m using the fact that a probability measure $P$ on the set $\{1,\dots, n\}$ lets us form a weighted average of $n$ numbers. I’m using $P(S(P_1), \dots S(P_n))$ to mean the weighted average of the numbers $S(P_i)$.

Tom saw that this crucial property would fall out naturally from having a certain lax map of operad algebras!

On the one hand we can use $P$ to glom together $n$ probability measure spaces. On the other, we can use it to glom together $n$ numbers. But we don’t have

$S(P (P_1, \dots P_n)) = P(S(P_1), \dots S(P_n))$

as you might naively hope. There’s an extra term, and that’s the ‘laxness’.

Posted by: John Baez on May 3, 2011 1:49 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Right. But you’ll notice in this explanation that you only apply entropy $S$ to distributions, arrows into $1$, or out of $1$ if you prefer the opposite category. That’s why I wanted to see what that $\alpha(A)$ does, to know how it fits with Tom’s induced functor $\mathbb{R}^A \to \mathbb{R}^B$, as I rather suspect that he isn’t according an entropy to an arrow $f: A \to B$, for non-singleton $B$.

Posted by: David Corfield on May 3, 2011 3:32 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Posted by: Tom Leinster on May 3, 2011 9:38 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I’ve said it implicitly lots of times, and John’s just said it fairly explicitly, but let me say it even more explicitly: the “magical property” of Shannon entropy $S$ is equivalent to

1. the formula

$S(P\circ(P_1, \ldots, P_n)) = S(P) + P(S(P_1), \ldots, S(P_n))$

for the entropy of an operadic composite, together with

2. the formula

$S((1)) = 0$

for the entropy of the operadic unit (the unique distribution on a one-element set).

This equivalence

magical property $\iff$ (1) + (2)

is under the assumption that $H$ is symmetric in its arguments.

Recall that the “magical property” is one of Fadeev’s axioms:

$S(t p_1, (1 - t)p_1, p_2, \ldots, p_n) = S(p_1, \ldots, p_n) + p_1 S(t, 1 - t).$

Now assume (1) and (2) above, and let’s prove the magical property. Take $P = (p_1, \ldots, p_n)$, $P_1 = (t, 1 - t)$ and $P_i = (1)$ for $i \geq 2$: then (1) gives

$S(t p_1, (1 - t)p_1, p_2, \ldots, p_n) = S(p_1, \ldots, p_n) + p_1 S(t, 1 - t) + \sum_{i = 2}^n p_i S((1))$

which by (2) gives the magical property. Conversely, assume the magical property. Taking $n = 1$, $p_1 = 1$ and $t = 1$ gives

$S(1, 0) = S((1)) + 1S(1, 0)$

from which (2) follows immediately. A straightforward induction (using symmetry) gives you (1).

So the magical property is just a very parsimonious way of saying how entropy interacts with the structure of our operad. For that reason, I propose to call it the operadic property.

Posted by: Tom Leinster on May 3, 2011 5:27 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

To relate this to what I was saying above about conditional entropy, the magic formula,

$S(P∘(P_1,...,P_n))=S(P)+P(S(P_1),...,S(P_n)),$

expresses

$H(X, Y) = H(X) + H(Y|X),$

where $Y$ is distributed over the space which is the disjoint union of the $S_i$.

Posted by: David Corfield on May 4, 2011 8:51 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Up there, David wrote:

$f: A \to B$ in $M_\Delta$ is a stochastic function (aka conditional probability distribution) in reverse.

Really? You’re saying that a map from $A$ to $B$ in $M_\Delta$ is a stochastic map from $B$ to $A$, i.e. a function from $B$ to the set $P(A)$ of probability distributions on $A$? I don’t think it is.

The definition is that a map $A \to B$ in $M_\Delta$ is a pair $(f, p)$ where $f: A \to B$ is a map of sets and $p = (p(b))_{b \in B}$ is a family consisting of a probability distribution $p(b)$ on $f^{-1}(b)$ for each $b \in B$.

Example: suppose that $A$ and $B$ each have cardinality 2. Then there are two maps $A \to B$ in $M_\Delta$ (both isomorphisms), but there are infinitely many stochastic maps from $B$ to $A$.

What needs to be specified to give the monoidal lax transformation, $\alpha$, corresponding to Shannon entropy?

Right. We’ve got two functors $1, R: M_\Delta \to \Cat$. Recall that $R$ assigns to a finite set $A$ the additive monoid $\mathbb{R}^A$, seen as a one-object category.

We’re interested in monoidal lax transformations $\alpha: 1 \to R$. The “monoidal” part is essentially a property rather than structure, so let’s forget about it for now and just think about lax transformations $\alpha: 1 \to R$.

A strict transformation $\alpha: 1 \to R$ consists of

• for each finite set $A \in M_\Delta$, a functor $\alpha_A: 1 \to R(A)$

such that

• for each map $(f, p): A \to B$ in $M_\Delta$, the square of functors $\begin{matrix} 1 &\stackrel{\alpha_A}{\to} &R(A) &\\ \downarrow& &\downarrow&R(f, p) \\ 1 &\stackrel{\alpha_B}{\to} &R(B)& \end{matrix}$ commutes.

But as you noticed, all of this is trivial: since $R(A)$ has only one object, the functors $\alpha_A$ are uniquely determined, and the commutativity of the square is automatic.

A lax transformation is the same, except that that square of functors isn’t required to commute: there is instead a natural transformation inside it. Thus, for each map $(f, p): A \to B$ in $M_\Delta$, we have a natural transformation $\begin{matrix} 1 &\stackrel{\alpha_A}{\to} &R(A)&\\ \downarrow&\Downarrow \alpha_{(f, p)} &\downarrow&R(f, p) \\ 1 &\stackrel{\alpha_B}{\to} &R(B).& \end{matrix}$ A natural transformation of this type is simply an element of the monoid $R(B) = \mathbb{R}^B$. So, what needs to be specified is: for each map $(f, p): A \to B$ in $M_\Delta$, an element $\alpha_{(f, p)} \in \mathbb{R}^B$. In particular, taking $B$ to be a one-element set, this means specifying a real number for each probability distribution on a finite set.

Of course, the natural transformations $\alpha_{(f, p)}$ are subject to some coherence axioms, and we also have to remember that $\alpha$ was supposed to be monoidal. The upshot is that $\alpha_{(f, p)} \in \mathbb{R}^B$ must have $b$-component $C \cdot H(p(b))$ where $H$ is Shannon entropy and $C$ is a real constant independent of everything.

Posted by: Tom Leinster on May 3, 2011 9:36 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Really?

Oh, you’re right. You could say they’re special stochastic functions, where in the row stochastic matrix every column has only one nonzero entry (or interchange ‘row’ and ‘column’ looking at things your way). Grouping them according to fibres, the stochastic matrix would be a series of distributions, stepping down, indexed by $b \in B$, where your $A$ is the disjoint union of $S_i$ mentioned here.

I hadn’t really got the idea of lax points before, so thanks. Do they crop up interestingly in a range of places?

So, what needs to be specified is: for each map $(f, p):A \to B$ in $M_{\Delta}$, an element $\alpha: (f,p)\in \mathbb{R}^B$.

Great, so one doesn’t even have an entropy per se for one of your special stochastic functions. It awaits a distribution on $B$ to be able to give a number, the conditional entropy.

Posted by: David Corfield on May 4, 2011 9:09 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

David wrote:

You could say [maps in $M_\Delta$ are] special stochastic functions

What you’re doing here is comparing

the free finite product category containing an algebra for the Giry monad

with

the free symmetric monoidal category containing an algebra for the Giry operad.

(By the “Giry monad” I mean, probably improperly, the monad on $Set$ whose algebras are convex spaces. By the “Giry operad” I mean its underlying symmetric operad.)

The first category is the Lawvere theory of the Giry monad, which is the category $FinStoch$. The second is what I called $M_\Delta$. By the universal property, there’s a canonical functor $M_\Delta \to FinStoch$. This has nothing to do with the particular monad we’re studying: it’s a completely general construction.

This functor $M_\Delta \to FinStoch$ is bijective on objects (which in both cases are natural numbers). I guess it’s faithful, which would justify your description of maps in $M_\Delta$ as “special stochastic functions”.

Regarding lax points: yes, they crop up interestingly elsewhere. Michael Batanin called them internal algebras. Here’s why.

Let $P$ be an operad of sets, and let $Y$ be a $P$-algebra in $Cat$. A lax point, or internal $P$-algebra, is a lax map $1 \to Y$ of $P$-algebras in $Cat$.

Example  Let $P$ be the terminal operad (non-symmetric, say). Then $P$-algebras in a monoidal category are monoids; in particular, $P$-algebras in $Cat$ are (strict) monoidal categories. An internal $P$-algebra in a monoidal category $Y$ is a lax monoidal functor $1 \to Y$, which is the same thing as a monoid in $Y$. Since $P$-algebras are the same thing as monoids, another way to phrase this is “an internal monoid in $Y$ is an internal monoid in $Y$”. So it’s good terminology, in this case at least.

Example  Let $M$ be a monoid. Let $P$ be the operad whose unary operations form the monoid $M$, and with no non-unary operations. Then $P$-algebras in a monoidal category are objects equipped with a left $M$-action. In particular, $P$-algebras in $Cat$ are “$M$-categories”. An internal $P$-algebra in an $M$-category $Y$ is an object $y \in Y$ equipped with a map

$m \cdot y \to y$

for each $m \in M$, satisfying the evident action-like axioms. It’s reasonable to call this an “internal $M$-object in $Y$”, so again the terminology is sensible.

I’d add the caveat, though, that the category we’re concerned with here is a one-object category (the additive monoid $\mathbb{R}$). So if we want to say something like “Shannon entropy and its multiples are the only internal convex algebras in $\mathbb{R}$”, we need to do so with the awareness that there’s a level shift.

Posted by: Tom Leinster on May 8, 2011 5:52 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Thanks. I guess I’d like to know whether there’s something deep about Shannon entropy being an internal algebra, or whether it’s a fluke. Could there be something systematic about it? Hence my query below as to what might happen with a slight shift of monad to subprobabilities or all measures.

Is there also a Rényi-Tsallis monad, where points of $\mathbb{R}$ might be one of the two entropies?

Posted by: David Corfield on May 8, 2011 11:04 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Nice! I think I understand the ideas.

One question about your category $M_\Delta$: are the morphisms automatically surjective maps? For a probability distribution to exist on a fiber, that fiber in particular has to be nonempty, right? So each fiber has to be nonempty, meaning that the function is surjective.

On the issue of zero probabilities: maybe one can deal with this by identifying two measure-preserving maps whenever they coincide almost surely? This defines a quotient category of $\Fin\Prob$ in which two probability spaces are isomorphic whenever there exists a measure-preserving bijection between their supports.

Posted by: Tobias Fritz on May 3, 2011 1:36 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Thanks. Yes, the morphisms in $M_\Delta$ are automatically surjective, because there’s no probability distribution on the empty set.

I’m not worried about the fact that $M_\Delta/1$ and $FinProb$ are only nearly isomorphic (because of zero probabilities). It might be that $FinProb$ isn’t quite the right category to be considering. The way I view the present evidence is that $M_\Delta/1$ has more of a right to be considered categorically natural than $FinProb$ does, because the former arises canonically from the finitary Giry monad and the latter, as far as I know, doesn’t.

Posted by: Tom Leinster on May 3, 2011 5:15 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Short memory or what? Two days previously, I’d observed that $FinProb$ does indeed arise canonically from the Giry monad.

So now we have two ever-so-slightly different categories, $FinProb$ and $M_\Delta/1$, each arising canonically from the Giry monad. Strange.

Posted by: Tom Leinster on May 8, 2011 4:17 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

In reply to this comment of John’s: at the moment I’m just generating stuff.

A little while ago, John, I sent you an email preemptively apologizing that I won’t be able to put as much energy into this as I’d like to. It might look like that apology was misplaced, given that I’ve written quite a lot here, but actually what I’ve done has been pretty hurried. If I wasn’t for other things going on, I’d be taking a more reflective approach, trying to see the connections between the many ideas so far rather than just flinging out more and more of them. I’m hoping someone else will do the connecting up: maybe you, maybe Tobias, maybe someone else, maybe a future me.

Anyway, I agree with you that it would be nice to describe entropy as the unique map

(things that have entropy) $\to$ numbers

with certain properties. Of course Rényi and Fadeev have already done this. We’re trying to make entropy look categorically inevitable, though, and it may be that the best categorical description is not of this form.

Re lax points or internal algebras, I should probably have mentioned a while ago that this has very strong echos of the microcosm principle; I had realized this, but should have said so.

Posted by: Tom Leinster on May 3, 2011 1:43 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

What kind of changes to this set-up might be expected to continue to yield something interesting? E.g., what are the lax points of $\mathbb{R}$ for the subprobability monad? I see there’s a notion of entropy for subprobability measures

$H(P) = \frac{-\sum p_i log p_i}{\sum p_i}.$

But there seem to be alternatives, e.g.,

$H(P) = -\sum \frac{p_i}{\sum p_i} log (\frac{p_i}{\sum p_i}).$

At first glance, operadic composition looks unlikely to work.

Then there’s the full cone of positive measures.

Posted by: David Corfield on May 5, 2011 12:02 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Thanks for the new stuff, Tobias! I’ll check it out! In my opinion, all three of us should feel free to systematize notation between the various parts of this ‘paper’, and otherwise improve it.

Speaking of notational matters: I’m a bit nervous about using $\Delta_n$ for the $(n-1)$-simplex, since $\Delta^n$ is often used for the $n$-simplex. But maybe that’s okay because they’re both fundamental in different ways.

I’m also slightly nervous about using $\Delta$ for the monad sending any set to the set of finitely supported probability measures on that set, since $\Delta$ is often used for the category of simplices (not with convex maps, but with maps sending vertices to vertices in an order-preserving way). I’m a bit more nervous about calling this monad $\Delta$ the ‘Giry monad’, since people seem to use that phrase for the monad sending any Polish space $X$ to the Polish space of subprobability measures on $X$. At least that’s what they do here and here.

So, for now, I’m tend to use $P(X)$ for the space of probability measures on a space $X$, and call it $P(n)$ when $X = \{1, \dots, n\}$. This $P$ is a monad on Polish spaces… but I’ll be quite happy if we focus attention to probability measures on finite sets in this paper, in which case we could also use finitely supported probability measures if that’s technically more convenient.

Wow, what a boring and annoying comment! And worst of all, it seems like I’m picking on Tom throughout. Anyway, I thought I’d mention these small things. I had some other more interesting things to say, but I’ll do that in another comment.

Posted by: John Baez on May 3, 2011 6:05 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

My attitude to these things is just to write macros. And as we’re some way off actually having a paper, it doesn’t bother me too much.

Anyway, I probably agree with you about most things.

On $\Delta_n$ for the $(n - 1)$-simplex, well, the reason I chose this notation is of course that $\Delta^n$ is the $n$-simplex. But yeah, then we’re more or less compelled to use $\Delta$ for the corresponding operad or monad or theory, which sucks. I think $\mathbf{P}$ is a good alternative.

I like to use $P$ for a generic operad, though I don’t mind breaking that habit. I know you’re not into $T$ for monad, John, but it really is standard (from Theory or Triple, I guess) and I don’t see a great reason to change it. Confusion with temperature might not be a real risk.

As for what the “Giry monad” is, I’m quite likely using it in a nonstandard way. Giry herself constructs the monad assigning to a measurable or Polish space the space of probability measures on it (I mean, she constructs two different monads). What I’ve been calling the Giry monad, I think Tobias calls the finitary Giry monad. I prefer descriptive names: perhaps the “convex monad”?

Mostly I’m not really bothered yet because we seem to be a long way off having a paper: I mean, we don’t even know what the main result is yet, do we? But that’s not to say that it’s not worth settling on conventions early.

Posted by: Tom Leinster on May 3, 2011 1:47 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

More importantly: why are you writing about power means on our nLab pages, Tom? I’m not complaining—it’s great, but I feel you’re way ahead of me and I’m dimly trying to comprehend what you’re up to. But now I finally have a guess: you’re going to show that if we replacing the usual ‘weighted arithmetic mean’ by a power mean in this formula crucial to the characterization of Shannon entropy:

$S(P \circ (P_1, \dots , P_n)) = S(P) + P(S(P_1), \dots , S(P_n))$

we get a characterization of Rényi entropy! In other words, we apply $P$ to the list of numbers $S(P_1), \dots , S(P_n)$ using a power mean instead of the arithmetic mean.

This could explain why you’re classifying ‘convex algebra structures on $\mathbb{R}_+$ satisfying the homogeneity axiom’.

But I guess I have another question beside my general question “am I reading your mind correctly?” Namely: does the Rényi entropy with parameter $q$ really obey a modified version of the above equation where $P$ acts via a power mean rather than the arithmetic mean?

Posted by: John Baez on May 3, 2011 3:15 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Unfortunately the above formula does not hold… Here comes a counterexample.

An interesting special case of Rényi entropy is the min-entropy $H_\infty$. Given an $n$-element probability distribution $P=(p_1,\ldots,p_n)$, its min-entropy is

$H_\infty(P)=\log\frac{1}{\max_i p_i}$

(Side note: the corresponding convex algebra structure on $\mathbb{R}_+$ in this case is the maximum operation, i.e. tropical addition!)

So let’s use $H_q$ for $q=\infty$ and consider the distributions $P_1=(1/2,1/2)$, $P_2=(1)$ and $P=(1/2,1/2)$. Then $P\circ (P_1,P_2)=(1/4,1/4,1/2)$, so that the left-hand side becomes $log(2)$. However, both terms on the right-hand side are also $log(2)$! So in this case, the conjectured equation does not hold.

Due to continuity in $q$, this also provides a counterexample for any other large enough value of $q$.

Posted by: Tobias Fritz on May 3, 2011 6:12 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

To add to what Tobias wrote, here’s the general formula for the Rényi entropies of an operadic composite.

Notation: $H_q$ is the Rényi entropy of order $q \in [0, \infty]$. Its exponential is $D_q = \exp(H_q)$, the “Rényi extropy” of order $q$. Ecologists call $D_q$ the Hill number of order $q$; the $D$ stands for “diversity”.

We’re going to take the operadic composite $P \circ (P_1, \ldots, P_n)$ where $P = (p_1, \ldots, p_n)$. Since I’m more used to working with the extropies than the entropies, I’ll give the formula for those first:

$D_q(P\circ(P_1, \ldots, P_n)) = \begin{cases} \biggl( \sum_{i: p_i \gt 0} p_i^q D_q(P_i)^{1 - q} \biggr)^{1/(1-q)} & if q \neq 1, \infty \\ D_1(P) \prod_i D_1(P_i)^{p_i} & if q = 1 \\ min_{i: p_i \gt 0} (D_\infty(P_i)/p_i) & if q = \infty. \end{cases}$

Now let’s see if I can successfully take logarithms. I reckon this translates to:

$H_q(P\circ(P_1, \ldots, P_n)) = \begin{cases} \frac{1}{1 - q} \log \sum_{i: p_i \gt 0} p_i^q \exp((1 - q)H_q(P_i)) & if q \neq 1, \infty \\ H_1(P) + \sum_i p_i H_1(P_i) & if q = 1 \\ min_{i: p_i \gt 0} (H_\infty(P_i) - \log p_i) & if q = \infty. \end{cases}$

Note that in all cases, the right-hand side is a function of $P = (p_1, \ldots, p_n)$ and $H_q(P_1), \ldots, H_q(P_n)$: you don’t need to know $P_1, \ldots, P_n$ themselves. But only in the case $q = 1$ does the right-hand side involve a term $H_q(P)$. (See comment below.)

Posted by: Tom Leinster on May 3, 2011 6:40 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

So in the language of conditional entropy, we can say that it’s much harder to define a conditional Rényi entropy, as explained here.

Posted by: David Corfield on May 4, 2011 10:36 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

On the other hand, the Tsallis entropy works more easily in this respect. Here you’re looking for a chain rule

$H(X, Y) = H(X) + H(Y|X) + g(H(X), H(Y|X)).$

If you plump for the simple $g(x, y) = const.x y$, then you reach the Tsallis entropy, satisfying:

$T_q(X, Y) = T_q(X) + T_q(Y|X) - (q - 1)T_q(X)T_q(Y|X).$

The definition of Tsallis entropy is $T_q = \frac{1}{q - 1}(1 - \sum_i p_i^q)$. The relation with Rényi entropy, $H_q$ is given by

$T_q = \frac{1}{q - 1} (1 - e^{(q - 1)H_q}).$

It seems to be commonly the case that some of the nice properties of the Shannon entropy are retained by the Rényi entropy and some by the Tsallis entropy, but that leaving $q = 1$ behind you can’t have both at the same time.

There’s a good explanation of the above in Generalized information and entropy measures in physics.

Posted by: David Corfield on May 4, 2011 11:34 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I don’t get it. The Rényi and (so-called) Tsallis entropies are related by an invertible transformation. So surely anything you can do with one, you can do with the other?

Posted by: Tom Leinster on May 4, 2011 10:39 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I meant that there are pleasant features of the Shannon entropy found in Rényi entropy but not in Tsallis entropy, and vice versa.

For example, the Rényi entropy of two independent systems is the sum of their individual Rényi entropies. One can formulate a Tsallis entropy version of independence but it’s messier – there’s an adjustment term depending on $q$ and the product of the entropies.

Similarly, Tsallis entropy comes into its own when you’re looking to express the entropy of a joint system by decomposing it, while the Rényi entropy formulation is much messier ((48) of this).

Posted by: David Corfield on May 5, 2011 9:37 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

OK, got you. So we still have the same functional dependencies, but some of them look simpler with Rényi than with Tsallis, and others look simpler with Tsallis than with Rényi.

That’s interesting to me, because I’d pretty much reached the point where (the exponential of) Rényi entropy seemed to me superior in every way to Tsallis entropy.

Posted by: Tom Leinster on May 5, 2011 11:29 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Actually, David, I’m not convinced that that equation (48) is really so messy — at least, not if you use the exponential $D_q$ of Rényi entropy. It then becomes the formula

$D_q(P \circ (P_1, \ldots, P_n)) = \Bigl( \sum p_i^q D_q(P_i)^{1 - q} \Bigr)^{1/(1 - q)}$

that we’ve been discussing (e.g. here). The corresponding Tsallis formula is, if I’m not mistaken,

$T_q(P \circ (P_1, \ldots, P_n)) = T_q(P) + \sum p_i^q T_q(P_i).$ That’s nice, and generalizes the $q = 1$ case nicely. Actually, I think that’s the nicest formula involving Tsallis entropy that I’ve seen. But it doesn’t seem an order of magnitude cleaner than the formula for $D_q$.

Posted by: Tom Leinster on May 6, 2011 5:21 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Beck suggests that

by taking the exponential one clearly removes the logarithm in the definition of the Rényi entropies in eq. (15). This means one is effectively back to the Tsallis entropies.

So $D_q$ and $T_q$ might be taken to be similar enough that only one needs to be defined. Could it be that the ‘magic formula’ for $T_q$ points to an operadic composition in the clearest way?

Posted by: David Corfield on May 6, 2011 8:33 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I can’t understand why Beck says this. Rényi entropy, Rényi extropy and Tsallis entropy are all related to each other by invertible, order-preserving transformations. In that sense they’re all “effectively” the same. But why does he believe that Rényi extropy and Tsallis entropy are particularly similar? Rényi extropy is,

$\Bigl( \sum p_i^q \Bigr)^{1/(1 - q)}$

while Tsallis entropy is

$\frac{1}{q - 1} \Bigl( 1 - \sum p_i^q \Bigr).$

I wouldn’t call them any more similar to each other than to Rényi entropy,

$\log \biggl( \Bigl( \sum p_i^q \Bigr)^{1/(1 - q)} \biggr).$

In fact, if you look at the case $q = 1$, these three expressions become, respectively,

$e^{H(\mathbf{p})}, \quad H(\mathbf{p}), \quad H(\mathbf{p}).$

So it’s the second and the third (the two entropies) that are the same here, not the first and the second.

In short, I don’t see what he’s driving at.

Posted by: Tom Leinster on May 8, 2011 6:31 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I want to explain something about the relationship between Rényi entropy $H_q$, Rényi extropy $D_q$, and “Tsallis” entropy $T_q$, and why I’ve come to regard Rényi extropy as the superior form. (I put “Tsallis” in quotes because the entropies that often bear his name enjoyed a lively existence in information theory and theoretical ecology for over 20 years before Tsallis discovered them.) These three quantities are, respectively,

$H_q(\mathbf{p}) = \frac{1}{1 - q} \log \Bigl( \sum p_i^q \Bigr), \quad D_q(\mathbf{p}) = \Bigl( \sum p_i^q \Bigr)^{1/(1 - q)}, \quad T_q(\mathbf{p}) = \frac{1}{q - 1} \Bigl( 1 - \sum p_i^q \Bigr).$

As I mentioned before, each is related to the others by an invertible, order-preserving transformation, and to that extent they’re interchangeable.

But the unique selling point of Rényi extropy is that it’s an effective number:

$D_q(1/n, \ldots, 1/n) = n.$

Since the uniform distribution maximizes $D_q$, another way to say this is that $max_{\mathbf{p}} D_q(\mathbf{p}) = n$, where the maximum is over probability distributions $\mathbf{p}$ on $\{1, \ldots, n\}$.

Any kind of entropy can be transformed, in a more or less unique way, into an effective number. For example, the effective number form of both Rényi entropy and Tsallis entropy is Rényi extropy. This is because all three are functions of $\sum p_i^q$ and Rényi extropy is an effective number.

In ecology, $D_q(\mathbf{p})$ is called the Hill number of order $q$, and is used as a measure of diversity. That’s why I use the letter D, and that’s why I prefer the term diversity (of order $q$) to “Rényi extropy” or “Hill number”. The idea is that we have an ecological community made up of $n$ species in proportions $p_1, \ldots, p_n$. Diversity is maximized when the species are evenly balanced. I wrote a long explanation a couple of years ago.

The effective number property makes $D_q$ behave very intuitively. If the diversity of an ecosystem is 2144, that means it’s as diverse as a system of 2144 species in equal proportions. There are “effectively” 2144 species.

Lou Jost has pointed out with compelling examples that if your entropy is not an effective number, the numbers it produces will be treacherous.

To elaborate on his example here, suppose there’s been an oil spill. Ecologists claim it’s caused devastation, but the oil company announces that biodiversity in the marine ecosystem has dropped by just 2%. Pressed for details, the oil company scientists say they’re using a standard measure of diversity, the Tsallis entropy of order $2$:

$T_2(\mathbf{p}) = 1 - \sum p_i^2$

(known to ecologists as the Simpson or Gini-Simpson index). They reveal that the pre-spill diversity was 0.99, and now it’s 0.97, which is indeed a drop of 2%.

But the ecologists are right. If $T_2(\mathbf{p}) = 0.99$ then $D_2(\mathbf{p}) = 100$, and if $T_2(\mathbf{p}) = 0.97$ then $D_2(\mathbf{p}) = 33$. So before the spill, the marine community was as diverse as a community of 100 species in equal proportions, and after the spill, it’s as diverse as a community of 33 species in equal proportions. If the pre- and post-spill communities really did consist of species in equal proportions, then the spill would have wiped out two thirds of them. Any sensible measure — that is, any effective number — would drop by two thirds. But $T_2$, not being an effective number, drops by a completely misleading 2%.

Going back to the pure mathematics of it, one reason I favour effective numbers is that they seem to be the right thing in my ongoing efforts to study “size”. For example, I observed above that the cardinality of a finite set is the maximum value of $D_q$ of a probability distribution on it, precisely because $D_q$ is an effective number. There’s also a version of $D_q$ for distributions on metric spaces, and its maximum value is closely related to the magnitude of the space. Again, that only works because $D_q$ is an effective number.

I guess the unique selling point of Tsallis entropy is that it can be expressed as a sum

$T_q(\mathbf{p}) = \sum_i \tau_q(p_i)$

for some function $\tau_q$ (namely, $x \mapsto \frac{1}{q - 1}(x - x^q)$). Rewritten as

$T_q(\mathbf{p}) = \sum_i p_i \sigma_q(p_i),$

this is the interpretation of entropy as expected surprise; $\sigma_q$ assigns to a probability a measure of surprise at the event occurring. But I don’t really know. Is that what users of Tsallis entropy would point to as its fundamental property?

Posted by: Tom Leinster on May 8, 2011 7:37 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Is that what users of Tsallis entropy would point to as its fundamental property?

We could do with a Tsallis entropy fan to present their case. It does seem to be that with Tsallis you can modify equations satisfied by Shannon entropy in the simplest way, e.g., back here, perhaps by using deformed logarithms.

I think also that the fact that the Tsallis divergence (relative entropy) fits into Amari’s family of $\alpha$-divergences must be telling us something.

There’s some interesting comparative work done by Amari in the context of three families of divergences in Families of Alpha- Beta- and Gamma-Divergences: Flexible and Robust Measures of Similarities, especially in the appendices. Hmm, the deformed exponential of the Tsallis entropy equals the ordinary exponential of the Renyi entropy.

Something I’d certainly hope to see explained if category theory is to throw light on entropy, etc., is why differential geometry seems to emerge so naturally. It hasn’t always been presented in the most straightforward of ways, but time and again when I worked with a machine learning group those who understood various algorithms as performing projection along geodesics came up with the best ideas. There’s even the EM-algorithm which involves alternating projections along the geodesics of a pair of dual connections (whose divergences are Kullback-Leibler, or whatever we decided to call it, and inverse Kullback-Leibler).

Amari explains this in one of his clearest papers Information geometry in optimization, machine learning and statistical inference. For a basic video introduction to information geometry try Dasgupta’s lecture. You have to see once in your life the idea of projecting from a prior along a geodesic to the space of distributions matching in some respect the empirically observed distribution. This is where the corresponding exponential family intersects that space. Alternatively, you project from the empirical distribution to the exponential family.

Posted by: David Corfield on May 9, 2011 11:18 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I was wrong, and the truth is more interesting than I’d believed.

Back here, I wrote down the formula for the Rényi entropy of an operadic composite. Actually, things look nicer if you use the exponential of Rényi entropy, the “Rényi extropy” or Hill number $D_q$. We had a composite

$P \circ (P_1, \ldots, P_n)$

in the operad for convex algebras (whose space of $n$-ary operations is $\Delta_n = \Delta^{n - 1}$), and we were writing $P = (p_1, \ldots, p_n)$. I correctly gave the formula

$D_q(P \circ (P_1, \ldots, P_n)) = \begin{cases} \biggl( \sum_{i: p_i \gt 0} p_i^q D_q(P_i)^{1 - q} \biggr)^{1/(1 - q)}& if q \neq 1, \infty\\ D_1(P) \prod_i D_1(P_i)^{p_i}& if q = 1\\ min_{i: p_i \gt 0} (D_\infty(P_i)/p_i)& if q = \infty. \end{cases}$

But then I (essentially) wrote:

only in the case $q = 1$ does the right-hand side involve a term $D_q(P)$.

And while that’s literally true, it’s morally false, in the following interesting way.

Every probability distribution $P = (p_1, \ldots, p_n)$ gives rise to a one-parameter family of probability distributions $P^{(q)}$ ($q \in \mathbb{R}$), sometimes called its escort distributions. Write

$Z(P): q \mapsto \sum_i p_i^q$

for the partition function of $P$, and $p^{(q)}_i = p_i^q/Z(P)(q)$; then put

$P^{(q)} = (p^{(q)}_1, \ldots, p^{(q)}_n).$

This is a probability distribution. In terms of the vector space structure on $\Delta_n$, the escort distributions are the points on the 1-dimensional subspace spanned by $P$.

Now the formula above becomes

$D_q(P \circ (P_1, \ldots, P_n)) = D_q(P) \cdot \biggl( \sum_{i: p_i \gt 0} p^{(q)}_i D_q(P_i)^{1 - q} \biggr)^{1/(1 - q)}$

for $q \neq 1$. Expressed in this way, the right-hand side does involve a term $D_q(P)$.

In the product on the right-hand side, the second factor is a (weighted) power mean. Here’s some standard notation. Let $w = (w_1, \ldots, w_n) \in \Delta_n$, let $x = (x_1, \ldots, x_n) \in [0, \infty)^n$, and let $t \in [-\infty, \infty]$. The power mean of $x_1, \ldots, x_n$, weighted by $w_1, \ldots, w_n$, of order $t$, is

$M_t(w, x) = \begin{cases} \Bigl( \sum w_i x_i^t \Bigr)^{1/t} &if t \neq -\infty, 0, \infty\\ min x_i &if t = -\infty\\ \prod x_i^{w_i} &if t = 0\\ max x_i &if t = \infty \end{cases}$

where the sum, min, product and max are over all $i$ such that $w_i \gt 0$. So then,

$D_q(P \circ (P_1, \ldots, P_n)) = D_q(P) \cdot M_{1 - q}(P^{(q)}, D_q(P_\bullet)),$

where $D_q(P_\bullet) \in \mathbb{R}^n$ has $i$th coordinate $D_q(P_i)$. This holds uniformly for all $q \in [0, \infty]$.

As an aside, $D_q(P \circ (P_1, \ldots, P_n))$ itself can be expressed as a power mean:

$D_q(P \circ (P_1, \ldots, P_n)) = M_{1 - q}(w, D_q(P_\bullet)/w_\bullet).$

I don’t know what to do with this thought.

All of this is part of my answer to John and Tobias when they ask why I’m bringing power means into this conversation.

Posted by: Tom Leinster on May 6, 2011 5:06 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Thanks a million, Tom! Sitting in the talks today I got some good ideas, and your calculations seem to be related. I’ll try to say something more precise soon, but right now I’ll just say that these ‘escort distributions’ (I’d never heard that term) are the equilibrium states at different temperatures of a physical system with a fixed Hamiltonian (which physicists think about all the time, and are crucial to my understanding of Rényi entropy). So, I’m very happy to see them showing up: this is starting to look like what I always wanted! I think we’ll soon have a very nice paper. I’m not sure what it’ll say, but (curiously) I think the really hard part is almost done.

Posted by: John Baez on May 6, 2011 5:24 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

John wrote:

‘escort distributions’ (I’d never heard that term) are the equilibrium states at different temperatures of a physical system with a fixed Hamiltonian.

I think it is good idea to considering the entropy evolution, in this interpretation, as Hamiltonians renormalization flow: morphisms in Hamiltonians’ category (homomorphisms of the bilinear forms).

Because of the entropy is a “more flexible measure of dimensionality” (rational in constructive case), it is possible considering this flow as homotopies chain connecting the Hamiltonians even with different equilibrium dimensions.

Posted by: Maxim Budaev on May 6, 2011 9:32 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Sitting in a talk today, I learned about epitopes. Of course, I thought of you and Jim.

I really hoped he’d go on to talk about the homology of epitopes.

I hope you’re feeling well again.

Posted by: Tom Leinster on May 6, 2011 5:33 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tom wrote:

Sitting in a talk today, I learned about epitopes. Of course, I thought of you and Jim.

(Tom’s making a joke about ‘opetopes’.)

By the way: what is this conference you’re at?

I hope you’re feeling well again.

Thanks – I’m feeling great. First of all, it was just a 36-hour bug. Second of all, your new calculations on the quasi-operadic nature of Rényi extropy fill me with hope. Third of all, they resonate in an exciting way with the ideas I had this morning. And fourth of all, I’ve been spending lots of time listening to talks by Susan Holmes, whose work combines fascinating math with lots and lots of attention to real-world biological data. Ant networks, pyrosequencing, metagenomics, the microbiome, treespace, the statistical analysis of heterogeneous datasets… my mind is reeling in a most pleasant way from a bombardment of exciting new ideas!

I hope that sometime you and Mike Shulman and anyone else interested in operads visits Azimuth and tell me what you think about the phylogenetic operad. It’s very similar to some classic operads showing up in homotopy theory, but I’ve never seen it before.

Posted by: John Baez on May 6, 2011 10:53 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

By the way: what is this conference you’re at?

It was the open day (actually two days) of the Boyd Orr Centre for Population and Ecosystem Health. This is the interdisciplinary centre in Glasgow that I joined last year and wrote about here.

It was pretty good. I learned a lot about infectious diseases. (Fun fact: keeping cows drastically cuts your risk of catching malaria.) Christina Cobbold gave a great talk on two aspects of diversity. The first part was on the diversity of variants of trypanosome, the parasite that causes sleeping sickness. Its nasty trick is to keep changing its surface, so that as soon as your immune system generates the white blood cells to fight it, it mutates into something slightly different. The new white cells are now useless, so your poor immune system has to start the process all over again. If you get infected, it’s a bit like catching many diseases in succession: you get ill, then recover, then get ill again, then recover… all because of the diversity of the variants.

The second part was about our joint work on measuring diversity in general. If all those epidemiogists and ecologists and evolutionary biologists in the audience were watching carefully, they’d have seen the formula for Rényi extropy.

Posted by: Tom Leinster on May 8, 2011 3:48 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Looks like James Griffin got there first. (-:

I’m having trouble keeping up with everything I’m supposed to keep up with right now, but I do intend to drop by Azimuth sometime.

Posted by: Mike Shulman on May 6, 2011 10:29 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

I thought a little bit about this a year or two ago and wonder if you’ve considered the following point of view.

Consider the category of finite probability spaces and surjections. We have a covariant functor to vector spaces which gives functions and assigns to a map the corresponding ‘integration along the fiber’. Consider this instead as a functor to affine spaces and try to deform the constant term (fixing the linear part). One easily checks that Shannon entropy is the unique deformation, i.e. that the assignment

$\begin{matrix} X &\mapsto& Fun(X)\\ f: X \to Y &\mapsto &f_* + H(f) \end{matrix}$

is a functor iff $H(f)$ is a multiple of the Shannon entropy. To be clear about what kind of object this is, $H(f)$ is a function on the target of $f$. I think that to be perfectly honest we require some continuity here.

To see why this is so, lets just write out functoriality in a simple case

$X \xrightarrow{\pi} Y \to \ast$

$H(X \to \ast) + Pushforward(X \to \ast) = Pushforward(Y \to \ast) ( \pi_* + H(\pi) ) + H(Y \to \ast)$

or in other words:

$H(X \to \ast) = Pushforward( Y \to \ast ) (H(\pi:X \to Y)) + H(Y\to \ast)$

which is exactly the usual formula “total entropy = expected entropy of a subsystem + entropy from selecting a subsystem”.

Is this at all related to what you guys are doing?

[Josh: I fixed the math. To enter Latex at this blog, choose a text filter such as “itex to MathML with parbreaks” from the drop-down menu on the comment form. Most Latex works as usual, including AMS embellishments. Displays go between double dollars. Tom]

### Re: Category-Theoretic Characterizations of Entropy

Could you expand on this some more? In particular, do you really mean to start from the category of finite probability spaces with surjections? That category is equivalent to the category of finite sets with surjections, and defining entropy here doesn’t make much sense. Maybe you meant to take the category of finite probability spaces with measure-preserving functions? If so, then we are using this category a lot. We call it $\Fin\Prob$.

Then, can you give a more precise definition of the functor? What is ‘integration along the fiber’? And how do you prove that Shannon entropy is the unique (say, continuous) deformation of this functor?

The “total entropy = expected entropy of a subsystem + entropy from selecting a subsystem” formula is indeed something we are concerned with. Tom has found an operadic formulation of it.

Posted by: Tobias Fritz on May 9, 2011 6:25 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tobias,

Sure, I suppose I ought to be using your category FinProb. In spelling out the assumptions I need in the following, my approach lost a lot of its charm, but here’s what I meant:

My first functor, which I’ll be deforming, takes a finite probability space to the vector space of functions on it, and literally averages them over the fibers of a map. For a map to a point, this is “expectation”, and for a general map it is some kind of conditional expectation.

For a map f: X –> Y, I am going to define H(f), an element of Functions(Y). I would like to do this in such a way that the following assignment is a functor:

On objects:
X |–> Functions(X)

On maps:
f: X –> Y
is carried to
f_* + H(f): Functions(X) –> Functions(Y)

Lets look at what this actually says in the case of a composition

f: X –> Y
Y –> pt

I’ll call pushforward to a point by its right name, expectation. And I’ll denote H(X –> pt) just by H(X)

Lets follow the zero function in Functions(X), we find

H(X) = Expectation(H(f))+H(Y)

What we want now very nearly falls out of what you folks are calling Fadeev’s theorem.

I think I need to say something about symmetry, this follows immediately if we demand nonnegativity of H(f) (compose a map and its inverse), but I’m not so sure in general.

I also need to impose compatibility with base change. Nothing that I can see forces H(f)(y) to be H(f^{-1}(y)).

I’d appreciate it if a moderator could delete my previous post.

Josh

### Re: Category-Theoretic Characterizations of Entropy

Josh: it’s tough to delete your old post without rendering all comments on that post invisible, and while I could manage it in half an hour, Tobias’ post beginning “Could you expand on that some more?” would then become quite mysterious, along with your remark “In spelling out the assumptions I need in the following, my approach lost a lot of its charm.” So, let’s leave history as it is.

(By the way: are you summing over each fiber, or averaging over it? The usual pushforward operation on functions sums over the fiber, and that’s functorial. Does averaging over the fiber give a functor?)

Posted by: John Baez on May 10, 2011 4:58 AM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

John,

I’m averaging. Functoriality of this operation is the “tower law” of conditional expectation.

Josh

### Re: Category-Theoretic Characterizations of Entropy

So I now verified that this makes sense. What’s nice about it is that it’s a bit less abstract than Tom’s operadic formulation. There are a few subtleties which I would like to discuss. First of all, working with $\Fin\Prob$ as the category of all finite probability spaces and measure-preserving functions is once again problematic: in degenerate cases with zero probabilities, conditional probabilities don’t exist, but these are required for defining the ‘integration over the fibers’.

Secondly, as you already noticed, the symmetry, i.e. the invariance under isomorphism, is not automatic. Without artificially imposing non-negativity, it does not follow from your other requirements. The following reasoning shows why.

Let us fix any non-trivial object $Z\in\Fin\Prob$. Then given any functor $F:\Fin\Prob \rightarrow \Aff,$ we can modify it as follows: We choose any fixed $g\in\Trans(F(Z))$. (Here, $\Trans(F(Z))$ is the vector space of translations on $F(Z)$. It acts on $F(Z)$ by addition.) Then we can define a new functor $F': \Fin\Prob \rightarrow \Aff$ as follows: $F'$ equals $F$ on objects. On morphisms $f:X\rightarrow Y$, we define $F'(f)$ as follows: $F'(f)=F(f)+\delta_{X,Z}g-\delta_{Y,Z}g$ where $\delta_{X,Z}$ is $1$ if $X$ is equal to $Z$ (equal, not just isomorphic), and $0$ otherwise. You can check easily that this turns $F'$ into a functor.

In fact, what I have done here is to construct $F'$ from $F$ using a gauge transformation. So, what might be correct is the following:

“Any deformation of the ‘integration along the fiber’ functor $\Fin\Prob\rightarrow\Aff$ is a multiple of Shannon entropy up to a gauge transformation.”

Thirdly, how can we express categorically the requirement that the new functor $\Fin\Prob\rightarrow\Aff$ is supposed to be a ‘deformation’ of the original functor $\Fin\Prob\rightarrow\Vect$? One possibility is to go from affine space to projective space: given an affine space, consider its set of lines. Two lines are called equivalent if they are parallel. The equivalence classes of lines then are the projective space associated to the affine space. Morphisms between affine spaces descend to morphisms of the associated projective spaces, since they map lines to lines. So we have a functor $\Aff\rightarrow\Proj$. ‘Deformation’ now means that the composed functors $\Fin\Prob\rightarrow\Proj$ have to coincide.

Posted by: Tobias Fritz on May 10, 2011 6:07 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Regarding your last paragraph, could the vector space structure on simplices be relevant?

Posted by: Tom Leinster on May 10, 2011 10:06 PM | Permalink | Reply to this

### Re: Category-Theoretic Characterizations of Entropy

Tobias,

Thanks for your interest in this. The way I was thinking of things was really as actual vector spaces with affine maps between them. In this case there is a nice functor to the ordinary category of vector spaces just by taking the derivative (linear part). This is really the sense that I’m deforming, which as you’ve suggested is not really the category of abstract affine spaces.