Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

May 8, 2011

Category-Theoretic Characterizations of Entropy II

Posted by John Baez

We’re having a lively conversation about different notions related to entropy, and how to understand these using category theory. I don’t want to stop this conversation, but I have something long to say, which seems like a good excuse for a new blog post.

A while back, Tom Leinster took Fadeev’s characterization of Shannon entropy and gave it a slick formulation in terms of operads. I’ve been trying to understand this slick formulation a bit better, and I’ve made a little progress.

Tom’s idea revolves around a special law obeyed by the Shannon entropy. I’ll remind you of that law, or tell you about it for the first time in case you didn’t catch it yet. I’ll explain it in a very lowbrow way, as opposed to Tom’s beautiful highbrow way. Then, I’ll derive it from a simpler law obeyed by something called the ‘partition function’.

I like this, because I’ve got a bit of physics in my blood, and physicists know that the partition function is a very important concept. But I hope you’ll see: you don’t need to know physics to understand or enjoy this stuff!

A math puzzle

We begin with a sadly familiar problem:

Suppose you live in a town with a limited number of tolerable restaurants. Every Friday you go out for dinner. You randomly choose a restaurant according to a certain probability distribution PP. If you go to the iith restaurant, you then choose a dish from the menu according to some probability distribution Q iQ_i. How surprising will your choice be, on average?

Note: I’m only talking about how surprised you are by the choice itself—not about any additional surprise due to the salad being exceptionally tasty, or the cook having dropped her wedding ring into your soup, or something like that. So if there’s only one restaurant in town and they only serve one dish, you won’t be at all surprised by your choice. But if there were thousands of restaurants serving hundreds of dishes each, there’d be room for a lot of surprise.

This still sounds like an impossibly vague puzzle. So, I should admit that while ‘surprise’ sounds psychological and very complicated, I’m really just using it as a cute synonym for Shannon entropy. Shannon entropy can be thought of as a measure of ‘expected surprise’.

Now, ‘expected surprise’ sounds like an oxymoron: how can something expected be a surprise? But in this context ‘expected’ means ‘average’. The idea is that your ‘surprise’ at an outcome with probability pp is defined to be ln(p)- \ln(p). Then, if there are a bunch of possible outcomes distributed according to some probability distribution PP, the ‘expected’ surprise or Shannon entropy is:

(1)S(P)= ip iln(p i) S(P) = - \sum_i p_i \ln(p_i)

where p ip_i is the probability of the iith outcome. This is a weighted average where each surprise is weighted by its probability of happening.

So, we really do have a well-defined math puzzle, and it’s not even very hard.

Glomming together probabilities

But the interesting thing about this problem is that it involves an operation which I’ll call ‘glomming together’ probability distributions. First you choose a restaurant according to some probability distribution PP on the set of restaurants. Then you choose a meal according to some probability distribution Q iQ_i. If there are nn restaurants in town, you wind up eating meals in a way described by some probability distribution we’ll call

P(Q 1,,Q n) P \circ (Q_1, \dots, Q_n )

A bit more formally:

Suppose PP is a probability distribution on the set {1,,n}\{1,\dots, n\} and Q iQ_i are probability distributions on finite sets X iX_i, where i=1,,ni = 1, \dots, n. Suppose the probability distribution PP assigns a probability p ip_i to each element i{1,,n}i \in \{1,\dots, n\}, and suppose the distribution Q iQ_i assigns a probability q ijq_{i j} to each element jX ij \in X_i.

Then we can glom together the probability distributions Q iQ_i using PP and get a new probability distribution P(Q 1,,Q n)P \circ (Q_1, \dots, Q_n) on the disjoint union of all the sets X iX_i. The resulting probability for each element (i,j)(i,j) in the disjoint union i=1 nX i\sqcup_{i = 1}^n X_i is just p iq ijp_i q_{i j}.

These probabilities sum to 1 as they should:

i,jp iq ij= i(p i jq ij)= ip i=1 \sum_{i, j} p_i q_{i j} = \sum_i \left( p_i \sum_j q_{i j} \right) = \sum_i p_i = 1

Note: what jj ranges over depends on ii! But it’s all kosher.

Glomming together numbers

There’s something even simpler than glomming together probability measures: we can glom together numbers!

Suppose we have a probability measure PP on the set {1,,n}\{1, \dots, n\} and for each element iSi \in S we have a number x ix_i. Then we can define a new number P(x 1,,x n)P(x_1, \dots, x_n) by

P(x 1,,x n)= i=1 np ix i P(x_1, \dots, x_n) = \sum_{i = 1}^n p_i x_i

This is just the ‘weighted average’ of the numbers x ix_i. We have already seen a weighted average in the definition of Shannon entropy, equation (2).

Glomming together entropies

Now we’re ready to answer the math puzzle:

(2)S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n)) S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))

On the left we’re glomming together a bunch of probability distributions using PP and then taking the entropy. On the right we’re taking the entropies of those probability distributions and then glomming the resulting numbers together using PP. But there’s also an extra term: the entropy of PP!

In other words: your expected surprise won’t be just the weighted average of your expected surprises at the various different restaurants, namely P(S(Q 1),,S(Q n))P(S(Q_1), \ldots, S(Q_n)). It’ll be that plus the expected surprise of your choice of restaurant, namely S(P)S(P).

You might not have expected such a simple formula. But it’s easy to prove:

S(P(Q 1,,Q n)) = i,jp iq ijln(p iq ij) = i,jp iq ijln(p i) i,jp iq ijln(q ij) = ip iln(p i) ip iS(Q i) = S(P)+P(S(Q 1),,S(Q n)) \begin{aligned} S(P\circ(Q_1, \ldots, Q_n)) &=& -\sum_{i, j} p_i q_{i j} \ln(p_i q_{i j}) \\ &=&- \sum_{i, j} p_i q_{i j} \ln(p_i) - \sum_{i, j} p_i q_{i j} \ln(q_{i j}) \\ &=& -\sum_{i} p_i \ln(p_i) - \sum_{i} p_i S(Q_i) \\ &=& S(P) + P(S(Q_1), \dots, S(Q_n)) \end{aligned}

Fadeev’s theorem

The formula for glomming together entropies is more important than you might think: it comes very close to characterizing Shannon entropy! In 1957, D. K. Fadeev published a result that we can restate as follows:

Theorem (Fadeev): Suppose FF is a function sending probability measures on finite sets to real numbers. Suppose that:

  1. FF is invariant under bijections.
  2. FF is continuous.
  3. F(P(Q 1,,Q n))=F(P)+P(F(Q 1),,F(Q n))F(P\circ(Q_1, \ldots, Q_n)) = F(P) + P(F(Q_1), \ldots, F(Q_n)).

Then FF is a real multiple of Shannon entropy.

In item 1 we’re using the fact that a bijection f:XXf: X \to X' between finite sets together with a probability distribution on XX gives a probability distribution on XX'; we want these to have the same entropy. In item 2 we’re using the standard topology on n\mathbb{R}^n to put a topology on the set of probability distributions on any nn-element set. For a proof of this theorem, see the beginning of Rényi’s 1961 paper.

What’s going on?

While this equation is cute:

S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n)) S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))

it’s a bit tricky to find its proper place in the world of abstract algebra. A probability distribution PP can act either on a list of probability distributions or a list of numbers. Shannon entropy gives a map SS from probability distributions to numbers. So, if you’re algebraically inclined, you would want SS to be a ‘homomorphism’, obeying a law like this:

S(P(Q 1,,Q n))=P(S(Q 1),,S(Q n))(sorry,nottrue) S(P\circ(Q_1, \ldots, Q_n)) = P(S(Q_1), \ldots, S(Q_n)) \qquad (sorry, \; not \; true)

We see laws of this sort all over math. But the true law has an extra term. What’s going on?

The ‘glomming’ business can be formalized using operads, and Tom’s answer is roughly: Shannon entropy is not a ‘homomorphism’ of operad algebras, but only a ‘lax homomorphism’. For an explanation of what this means, go here.

Right now I want to propose an alternative answer. I hope we can combine it with Tom’s answer and reach a deeper understanding of what’s going on.

For starters, consider another law that has an ‘extra term’:

ddx(fg)=(ddxf)g+f(ddxg) \frac{d} {d x} (f g) = (\frac{d}{d x} f )g + f (\frac{d}{d x} g)

In math jargon, the product rule says that taking the derivative of a function at a point is not a homomorphism from smooth functions to real numbers: it’s a ‘derivation’. We can get a derivation on an algebra by differentiating a family of algebra homomorphisms. Similarly, we can get the funny rule obeyed by entropy by differentiating a family of operad algebra homomorphisms.

Let’s see how.

Glomming together partition functions

There’s something more fundamental than the Shannon entropy of a probability distribution: namely, it’s partition function. At least that’s how physicists think. Let me explain.

Suppose PP is a probability measure on a finite set XX. Let p ip_i be the probability of the element iXi \in X. We can always write

p i=e H i p_i = e^{-H_i}

where

H:X(,] H : X \to (-\infty, \infty]

is a function called the energy or Hamiltonian. The idea is that the probability of a system being in some state decreases exponentially with the energy of that state; we allow the energy ++\infty to account for zero probabilities.

But physicists always go further and introduce a parameter β(0,)\beta \in (0,\infty) which stands for the reciprocal of temperature, to capture the fact that states of high energy are even less likely to be occupied when it’s cold. Now we make p ip_i into a function of β\beta:

p i(β)=e βH i p_i(\beta) = e^{-\beta H_i}

or if you prefer, simply

p i(β)=p i β p_i(\beta) = p_i^\beta

When β=1\beta = 1, these numbers are just the probabilities p ip_i. But when β1\beta \ne 1, there’s no reason for them to sum to 1. To get a probability measure, we’d have to divide by a fudge factor called the partition function:

Z(P)= iXe βH i Z(P) = \sum_{i \in X} e^{-\beta H_i}

But I won’t do that just now: I’ll let P(β)P(\beta) be the measure on the set XX that assigns the measure p i(β)p_i(\beta) to the point iXi \in X.

Now everything is a function of β\beta! But everything reduces to something familar at β=1\beta = 1, so we can stretch our old notation without breaking it. We now have functions p ip_i sending numbers β(0,)\beta \in (0,\infty) to numbers, and a function PP sending numbers β\beta to measures on XX. These reduce to our original p ip_i and PP at β=1\beta = 1. The partition function is also a function of β\beta, and this equals 1 at β=1\beta = 1.

But here’s the important thing: the partition function obeys a simpler law than entropy when we glom probability measures together! Suppose PP is a probability distribution on the set {1,,n}\{1,\dots, n\} and Q iQ_i are probability distributions on finite sets X iX_i, where i=1,,ni = 1, \dots, n. Then

(3)Z(P(Q 1,,Q n))=P(Z(Q 1),,Z(Q n)) Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

So, in fancy jargon, ZZ is an operad algebra homomorphism!

But I need to explain what this equation means. First of all, everything is a function of β\beta. Second, while PP and Q iQ_i are only probability measures at β=1\beta = 1, they’re perfectly fine measures at other values of β\beta, so all our ‘glomming’ formulas still make sense.

Let’s check to make sure we know what’s going on. An expression like P(Q 1,,Q n)P \circ (Q_1, \dots, Q_n) could be ambiguous. We have a recipe for taking a probability measure on a finite set and extending it to a function from numbers β(0,)\beta \in (0,\infty) to measures on that set. So, we can extend PP and Q iQ_i this way and then ‘glom’ them to get a measure-valued function of β\beta. On the other hand, we can glom them and then extend them. Luckily, the results agree.

Let’s see why. Suppose PP assigns the point iXi \in X the probability p ip_i. When we extend this to a function of β\beta we get p i βp_i^\beta. Similarly, suppose Q iQ_i assigns the point jX ij \in X_i the probability q ijq_{ i j}. When we extend this to a function of β\beta we get q ij βq_{i j}^\beta. When we glom these, the measure of the point (i,j)X i(i,j) \in \sqcup X_i will be this function of β\beta:

p i βq ij βp_i^\beta q_{i j}^\beta

But this equals

(p iq ij) β (p_i q_{i j})^\beta

which is the result of glomming and then extending.

The right-hand side of equation (3) is also unambiguous… but I’m wasting time: if I were a physicist I’d be done proving this equation by now, instead of stewing around explaining what it means. It’s incredibly easy to prove. From what I’ve said,

Z(P(Q 1,,Q n))= i,jp i βq ij β= ip i βZ(Q i)=P(Z(Q 1),,Z(Q n)) Z(P \circ (Q_1, \dots, Q_n)) = \sum_{i, j} p_i^\beta q_{i j}^\beta = \sum_i p_i^\beta Z(Q_i) = P (Z(Q_1), \dots, Z(Q_n))

From partition function to entropy

How is the Shannon entropy of a probability distribution related to its partition function? Simple:

(4)S(P)=ddβZ(P)| β=1 S(P) = - \left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1}

Here I must emphasize that I’m only talking about the Shannon entropy S(P)S(P) of the original probability distribution, ‘at β=1\beta = 1’. Physicists extend S(P)S(P) to a function of β\beta, along with everything else. That would be very good to do, but it goes beyond our goal now, and it would make the formula relating SS and ZZ a bit more complicated, so let’s not.

Our goal now is simply to get the rule for glomming entropies by differentiating the rule for glomming partition functions. So, let’s do that using equation (4). Later I’ll show you why (4) is true.

We start with the rule for glomming partition functions:

Z(P(Q 1,,Q n))=P(Z(Q 1),,Z(Q n)) Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

We differentiate with respect to β\beta and use the product rule, taking advantage of the fact that P(Z(Q 1),,Z(Q n))P (Z(Q_1), \dots, Z(Q_n)) is linear in PP but also linear in each of the functions Z(Q i)Z(Q_i):

ddβZ(P(Q 1,,Q n))=dPdβ(Z(Q 1),,Z(Q n))+P(ddβZ(Q 1),,ddβZ(Q n)) \frac{d}{d \beta} Z(P \circ (Q_1, \dots, Q_n)) = \frac{d P}{d \beta} (Z(Q_1), \dots, Z(Q_n)) + P (\frac{d}{d \beta} Z(Q_1), \dots, \frac{d}{d \beta} Z(Q_n))

Now set β=1\beta = 1. We can use equation (4) and the fact that all our partition functions equal 1 at β=1\beta = 1:

S(P(Q 1,,Q n))=dPdβ(1,,1)| β=1P(S(Q 1),,S(Q n)) - S(P \circ (Q_1, \dots, Q_n)) = \left. \frac{d P}{d \beta} (1,\dots, 1) \right|_{\beta = 1} - P(S(Q_1), \dots, S(Q_n))

Hey, it looks like we’re almost done! As you can see, the product rule did most of the work, so we’re really saying that SS is like a ‘derivation’. We just to work out that funny-looking first term on the right-hand side. It amounts to taking a weighted sum of 1’s, which is just a sum:

dPdβ(1,,1)| β=1= idp i(β)dβ| β=1 \left. \frac{d P}{d \beta} (1,\dots, 1)\right|_{\beta = 1} = \sum_i \left. \frac{d p_i(\beta)}{d \beta} \right|_{\beta = 1}

and we have

dp i(β)dβ=ddβe βH i=H ie βH i \frac{d p_i(\beta)} {d\beta} = \frac{d}{d\beta} e^{-\beta H_i} = -H_i e^{-\beta H_i}

so

dp i(β)dβ| β=1=H ie H i=ln(p i)p i \left. \frac{d p_i(\beta)} {d\beta} \right|_{\beta = 1} = - H_i e^{-H_i} = \ln(p_i) p_i

so

dPdβ(1,,1)| β=1= ip iln(p i)=S(P) \left. \frac{d P}{d \beta} (1,\dots, 1)\right|_{\beta = 1} = \sum_i p_i \ln(p_i) = - S(P)

So, we’ve got it!

S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n)) S(P \circ (Q_1, \dots, Q_n)) = S(P) + P(S(Q_1), \dots, S(Q_n))

The funny-looking rule for glomming entropies is just the derivative of the rule for glomming partition functions!

But before we go further, let’s check rule (4). For starters,

ddβZ(P)| β=1=ddβ ie βH i| β=1 \left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1} = \left. \frac{d}{d \beta} \sum_i e^{-\beta H_i} \right|_{\beta = 1}

But we’ve just seen that

ddβe βH i| β=1=p iln(p i) \left. \frac{d}{d\beta} e^{-\beta H_i} \right|_{\beta = 1} = p_i \ln(p_i)

so

ddβZ(P)| β=1= ip iln(p i)=S(P) \left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1} = \sum_i p_i \ln(p_i) = - S(P)

The upshot

In our notes on the nLab you’ll see a deeper reason for being interested in the partition function: it’s actually a form of ‘decategorification’!

Let FinProb 0FinProb_0 be the groupoid of finite probability measure spaces (where, for technical reasons, we exclude measures that assign the measure zero to a nonempty set). Then ZZ can be thought of as a functor from FinProb 0FinProb_0 to its set of isomorphism classes, viewed as a category with only identity morphisms. In other words, ZZ assigns to each object PFinProb 0P \in FinProb_0 its partition function Z(P)Z(P)… but we can recover PP up to isomorphism from Z(P)Z(P).

Now, FinProb 0FinProb_0 is an algebra of a certain operad P\mathbf{P} whose nn-ary operations are the probability measures on the set {1,,n}\{1, \dots, n\}. This is just a fancy way of talking about ‘glomming probability measures’. As a kind of consequence, the set of partition functions also becomes a P\mathbf{P}-algebra. Furthermore, ZZ becomes a homomorphism of P\mathbf{P}-algebras. This last statement mostly just says that

Z(P(Q 1,,Q n))=P(Z(Q 1),,Z(Q n)) Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

But then we can take ZZ, differentiate it with respect to β\beta, and set β=1\beta = 1. By abstract nonsense, the resulting functor should be some sort of ‘derivation’ of the ConvConv-algebra FinProb 0FinProb_0. Concretely, we’ve seen this mostly says that

S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n)) S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))

But there should also be an abstract-nonsense theory of derivations of operad algebras. In fact I discussed this a bit back in week299, but only a little. So, there’s a lot I don’t understand, such as:

How does my ‘derivation’ way of thinking about the law S(P(Q 1,,Q n))=S(P\circ(Q_1, \ldots, Q_n)) = S(P)+P(S(Q 1),,S(Q n))S(P) + P(S(Q_1), \ldots, S(Q_n)) relate to Tom Leinster’s interpretation of it in terms of lax operad homomorphisms, or more precisely ‘lax points’?

And if we get this straightened out, it will also be tempting to extend the story to Rényi entropy, using the fact that Rényi entropy is a kind of q-derivative of minus the logarithm of the partition function. The q-derivative is a kind of ‘twisted derivation’, but I bet a bunch of the same calculations work in some modified form.

Posted at May 8, 2011 12:57 PM UTC

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2396

129 Comments & 2 Trackbacks

Re: Category-Theoretic Characterizations of Entropy II

All of this seems like a very complicated way of saying that P(X,Y)=P(Y|X)P(X).

Posted by: Matt Leifer on May 9, 2011 1:03 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Fadeev characterized Shannon entropy as the unique-up-to-scale function of probability distributions with 3 properties. Tom showed that these properties could be summarized by saying you’ve got a lax point for the operad whose operations are convex linear combinations. I’m trying to understand what that means, and I noticed that you can derive these 3 properties from the fact that 1) entropy can be written as a derivative of the partition function, and 2) the partition function is a homomorphism of algebras for that operad.

So all this is just a complicated way of stating that the probability of X and Y is the probability of X times the probability of Y given X?

That fact is indeed relevant, but I think you’re exaggerating. All theorems are tautologies, and complicated theorems are proved using simpler ones, but that doesn’t mean they’re just complicated ways of saying the simper things.

Posted by: John Baez on May 9, 2011 2:56 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Matt Leifer said that

All of this seems like a very complicated way of saying that P(X,Y)=P(Y|X)P(X).

Maybe we can restate this, but with more constructive connotation:

What do you all think is the name of that operad P\mathbf{P} (following the notation of Tom’s latest post) which governs all this discussion? Isn’t it true that a natural name is:

the operad of conditional probabilities

?

Its nn-ary operations are probability distributions on nn random variables, and its composition operation is precisely forming conditional probabilities. Right?

Given that, as Tom’s post succinctly summarizes, the upshot of the discussion here is “many statements about entropy can naturally be understood also as statements about the weak algebras of this operad”, it does seem that it’s not too far off if one said that

All this seems like a general abstract way of talking about conditional probability.

That statement sounds both true and useful to me.

Posted by: Urs Schreiber on May 18, 2011 8:38 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Physicists extend S(P)S(P) to a function of β\beta, along with everything else. That would be very good to do, but it goes beyond our goal now, and it would make the formula relating SS and ZZ a bit more complicated, so let’s not.

So what happens? Instead of the term

dPdβ(1,,1)| β=1, \left. \frac{d P}{d \beta} (1,\dots, 1) \right|_{\beta = 1},

you get

dPdβ(Z(Q 1),,Z(Q n)). \frac{d P}{d \beta} (Z(Q_1),\dots, Z(Q_n)).

That’s not going to lead anywhere very pretty, is it?

Posted by: David Corfield on May 9, 2011 3:32 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I haven’t found anything pretty to say yet—but I haven’t tried very hard, either.

Physicists would find it absurd to claim that anything specially nice happens at β=1\beta = 1, since β=1/kT\beta = 1/k T and there’s nothing special about the temperature TT that gives 1 when multiplied by Boltzmann’s constant:. After all, this depends on our choice of units!

On the other hand, I’m making β=1\beta = 1 special by starting with probabilities p ip_i and then letting

p i(β)=p i β.p_i(\beta) = p_i^\beta .

This gives a 1-parameter family of measures that happens to be a probability measure at β=1\beta = 1. So, the partition function ZZ equals 1 at β=1\beta = 1, and various standard physics formulas simplify at this point.

I pulled this trick in my Rényi entropy paper, too—but then the physicists piled on and said it looked weird to treat β=1\beta = 1 as special. The mathematician in me wanted to say “but precisely because β\beta doesn’t depend on our choices of units, we’re free to pick units where β=1\beta = 1 at the temperature we’re interested in!” But I know physicists don’t do that: they may choose units where the speed of light or Boltzmann’s constant is 1, but not where the temperature of their lab is 1. So I rewrote the paper in a way that doesn’t give any special significance to β=1\beta = 1, and it became a lot better. For example, that’s when it became clear that the Rényi entropy was the qq-derivative of free energy.

Here’s what we’d need to ponder if we wanted to go beyond β=1\beta = 1 here:

The usual relation between entropy and the partition function, good for any value of β\beta, is

S(β)=lnZ(β)βddβlnZ(β) S(\beta) = \ln Z(\beta) - \beta \frac{d}{d \beta} \ln Z(\beta)

In this blog post I only consider systems where Z(1)=1Z(1) = 1, so I get

S(1)=ddβZ(β)| β=1 S(1) = - \left. \frac{d}{d \beta} Z(\beta)\right|_{\beta = 1}

which is equation (4). I only consider the entropy at β=1\beta = 1. So, the question would be how to go beyond this case.

The general equation

(1)S(β)=lnZ(β)βddβlnZ(β) S(\beta) = \ln Z(\beta) - \beta \frac{d}{d \beta} \ln Z(\beta)

looks fairly wretched… but it’s an incredibly important law of physics! The physicists rewrite it like this:

(2)F=ETS F = E - T S

Here FF is the free energy (the energy available to do work), EE is the expected value of the energy, TT is the temperature and SS is the entropy. TST S is the energy in the form of heat, so this equation says “the energy available to do work is the energy minus the energy in the form of heat”.

In units where k=1k = 1 we have

F=1βlnZ(β) F = - \frac{1}{\beta} \ln Z(\beta)

E=ddβlnZ(β) E = - \frac{d}{d \beta} \ln Z(\beta)

T=1β T = \frac{1}{\beta}

so we can see (2) is equivalent to (1), and we can also prove (1). There’s a long story behind this stuff, of course. It’s all well-known, but my notebooks are full of calculations where I got myself up to speed, so I don’t have the energy to explain it all here

What fascinates me is that all these ideas from ‘thermodynamics’ are available for use as soon as we have a probability measure on a finite set! We just introduce the concept of temperature using p i(β)=p i βp_i(\beta) = p_i^\beta and we’re off and running.

In statistics this trick is lurking in the very important idea of an ‘exponential family’ of probability measures. I was very happy to realize that exponential families are the same as Gibbs states in physics, which is what I’m talking about now. More recently, Tom has told us the probability distribution

p i β/Z(β) p_i^\beta / Z(\beta)

is called an escort distribution in some other circles. (Which circles, Tom?) So the same tricks keep turning up, but I think the physicists have gone the furthest with them—using them to explain how steam engines and refrigerators work, etcetera etcetera—so there’s a lot of mileage to be gotten by taking ideas from statistical mechanics and applying them elsewhere.

Posted by: John Baez on May 10, 2011 3:12 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

is called an escort distribution in some circles. (Which circles, Tom?)

I learned it from some physics paper. I think it was Nonadditive conditional entropy and its significance for local realism by Abe and Rajagopal.

You might be surprised to hear that I’ve been reading physics papers, John. This was in aid of understanding diversity, a couple of years ago. While I’m at it, I’ll cite a couple of other ones I discovered in the same way, both of which could be relevant to us:

David recently mentioned a closely related paper too, in the previous thread.

Posted by: Tom Leinster on May 10, 2011 3:46 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I was very happy to realize that exponential families are the same as Gibbs states in physics, which is what I’m talking about now.

So do you have something in physics resembling the information geometric idea of projection along geodesics in exponential families?

Here’s one of the simplest examples I have thought about. The world is going to deliver to me a sample taken from a distribution over 4 outcomes (00,01,10,11)(00, 01, 10, 11). My task is to approximate the true distribution (p 00,p 01,p 10,p 11)(p_{00}, p_{01}, p_{10}, p_{11}), with p ij=1\sum p_{i j} = 1. It’s a three-dimensional space. Before I know anything, I take as my best guess the uniform distribution p ij=1/4p_{i j} = 1/4, i.e., the maximum entropy distribution.

I decide that when I’m given the sample distribution, q ijq_{i j}, I’m going to move to the closest distribution to the uniform which matches it in terms of the expected proportion of 00s, i.e., I’ll ensure (2p 00+p 01+p 10)=(2q 00+q 01+q 10). (2 p_{00} + p_{01} + p_{10}) = (2 q_{00} + q_{01} + q_{10}).

With one constraint we see that the subspace of distributions whose expectation of 00 matches the observed proportion is two-dimensional. ‘Orthogonal’ to this subspace is the one-dimensional family of distributions for which the probabilities of 00 and 11 are independent, i.e., the family (t 2,t(1t),t(1t),(1t) 2)(t^2, t(1 - t), t(1 - t), (1-t)^2). Notice that the equations defining this space are log-linear in the sense that logp 00logp 01logp 10+logp 11=0log p_{00} - log p_{01} - log p_{10} + log p_{11} = 0 and logp 01=logp 10log p_{01} = log p_{10}, unlike the two-dimensional subspace which was just defined by a linear equation. This is related to why they’re called exponential families.

The subspaces intersect at the maximum entropy distribution whose expectation of 00 matches the empirical distribution’s. We have followed a geodesic (according to the KL-divergence) from the initial MaxEnt distribution to the two-dimensional subspace. Alternatively, we have followed a geodesic from the empirical distribution to the exponential family (according to the inverse KL-divergence).

If pp is the uniform distribution and qq the empirical distribution, then I’m choosing an rr for which

D(pq)=D(pr)+D(rq). D(p\|q) = D(p\|r) + D(r\|q).

This is likened to Pythagoras’s theorem and motivates the idea of the divergence as a squared length.

Another interpretation is that we are to play a game where I begin with pp, you select a distribution qq and tell me only the expectation value of 00, and I am to select a distribution rr. I must pay D(rq)D(r\|q) to you. Then opting for the strategy above, I have guaranteed that I have lessened my penalty by D(pr)D(p\|r) from the initial D(pq)D(p\|q). And wherever you chose on the subspace, I will have lessened the penalty by that amount.

Any other strategy I adopt, then if you know it, you can use it to chose a qq so that my penalty increases. Ideas of equilibrium, saddle points and minimising maximum loss come in here.

So, eventually, back to my original question, in statistical physics do you move from an initial distribution to a maximum entropy distribution which matches some specified macroscopic expectation value? If so, is this seen in a differential geometric light?

Posted by: David Corfield on May 10, 2011 2:41 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

David wrote:

So, eventually, back to my original question, in statistical physics do you move from an initial distribution to a maximum entropy distribution which matches some specified macroscopic expectation value? If so, is this seen in a differential geometric light?

The entropy(in generalized form as a convex functional is maximized on equidistribution)
may be source of geometry-measure duality and thus to estimates a complexity of the geometry.
Rougly, for example, the entropy may estimates an effective curvature and dimension of an attractor. Because of the measure concentration effect in attractor led to decreases the entropy.

Posted by: Maxim Budaev on May 11, 2011 9:42 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I’ve only seen the term “escort distribution” used in the literature to which I pointed in the Azimuth discussion. Of course, that’s a statement about my own limited experience as much as anything else. The most accessible instance, in terms of paedagogical style, was this textbook:

They describe their motivations in the following way:

[W]e deal with thermodynamic tools used to analyse complicated, possibly fractal probability distributions. These distributions can, for example, be generated by chaotic mappings, i.e., they are given in terms of the natural invariant probability density and the attractor of the map. Nevertheless, the considerations of part III are quite generally applicable to any probability distribution, no matter how it is generated. To any such probability distribution one can attribute deduced distributions, which we call ‘escort distributions’. These are classes of probability distributions that are constructed as a tool for scanning the structure of a given probability distribution in more detail. The escort distributions have the same formal structure as the generalized canonical distributions of thermodynamics. They yield the same basic thermodynamic relations, hence they are the common root of the various analogies to thermodynamics.

Posted by: Blake Stacey on May 11, 2011 4:56 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Thanks, Blake! So, those guys clearly understand the point I’m so excited about these days: that any probability distribution gives a family of ‘escort distributions’, which then let us apply all the tools of statistical mechanics. This idea should have millions of implications and applications. Someone should explain some of these in a book on probability theory—not a book entitled Thermodynamics of Chaotic Systems, which is likely to scare off anyone who doesn’t already know and love and statistical mechanics.

Unfortunately even I can’t resist using words like ‘inverse temperature’ β\beta, ‘Hamiltonian’ HH and ‘partition function’ ZZ while explaining this idea. I don’t know better names for the quantities involved in taking a probability distribution p ip_i, writing them as

p i=e H i p_i = e^{-H_i}

and then embedding them in a 1-parameter family of probability distributions

p i(β)=e βH i/Z(β) p_i(\beta) = e^{-\beta H_i}/Z(\beta)

But while these physics-flavored buzzwords provide the link to a huge existing literature, they may fool hard-core probability theorists into thinking we’re talking about ‘physics’ here. Someone should write a book where they just say

p i(β)=p i β/Z(β) p_i(\beta) = p_i^\beta / Z(\beta)

and never give names to these variables. A whole different audience would then find this trick appealing!

Posted by: John Baez on May 11, 2011 7:47 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Judging from the references I’ve seen and the passage that Blake quotes, I would guess that Beck and Schlögl invented the term ‘escort distributions’. I quite like it. Partly I like just having a name for it, and partly I like the image that each probability distribution is constantly escorted by infinitely many other distributions.

John wrote:

I don’t know better names for the quantities involved in taking a probability distribution p ip_i, writing them as

p i=e H i p_i = e^{-H_i}

and then embedding them in a 1-parameter family of probability distributions

p i(β)=e βH i/Z(β). p_i(\beta) = e^{-\beta H_i}/Z(\beta).

What I don’t see is the point of writing p i=e H ip_i = e^{-H_i} in the first place. Why not just leave it as p ip_i? The quantity H i=logp iH_i = -\log p_i does of course appear in the definition of Shannon entropy, but we can just call it logp i-\log p_i rather than giving it a new name (as you point out at the end of your comment).

I’m sure the reason why I don’t see the point is that I don’t know enough physics. If I understood the importance of Hamiltonians, maybe I wouldn’t be able to resist writing p i=e H ip_i = e^{-H_i} either.

Incidentally, I now see why physicists like to use SS for entropy, rather than Shannon’s original notation HH.

Posted by: Tom Leinster on May 11, 2011 1:35 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tom wrote:

I’m sure the reason why I don’t see the point is that I don’t know enough physics.

You don’t actually need to know much physics, though it could help. None of this really has anything to do with ‘energy’ or ‘Hamiltonians’—that’s just an example physicists like. The main thing you need to know is this.

Suppose H:XH : X \to \mathbb{R} is some function. Suppose you want to maximize the Shannon entropy of a probability distribution p:X[0,1]p: X \to [0,1] while ensuring that mean value of HH is some number cc:

iXp iH i=c \sum_{i \in X} p_i H_i = c

Then you need to choose pp of this form:

p i=1Z(β)e βH i p_i = \frac{1}{Z(\beta)} e^{-\beta H_i}

for some real number β\beta \in \mathbb{R}, where

Z(β)= iXe βH i Z(\beta) = \sum_{i \in X} e^{-\beta H_i}

In short: the goal of maximizing entropy subject to the constraint that some function HH has a given mean value forces us to think about e βH ie^{-\beta H_i}.

This is Gibbs’ great idea, which gives a rule to guess a probability distribution to describe a situation when all we know is the expected value of some quantity.

It easily generalizes to situations where we have a bunch of quantities, not just one.

The most important quantity in the universe is ‘energy’, so that’s the example physicists often choose to exemplify this idea. But the idea really has nothing to do with ‘energy’, nor even physics.

Posted by: John Baez on May 11, 2011 3:44 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Thanks very much, John. That’s interesting.

As a kind of experiment, let me have a go at rephrasing what you said in a way that doesn’t explicitly involve negative exponentials. I’m not saying the following way is better — I’m just playing around.

I start from a position of taking diversity (== Rényi extropy) to be the primary quantity, rather than entropy. The diversity of order 11 is the exponential of Shannon entropy, namely

D 1(p)= ip i p i. D_1(\mathbf{p}) = \prod_i p_i^{-p_i}.

And, for instance, where the Shannon entropy SS of a composite distribution is given by

S(p(r 1,,r n))=S(p)+ ip iS(r i), S(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = S(\mathbf{p}) + \sum_i p_i S(\mathbf{r}_i),

the 1-diversity is given by

D 1(p(r 1,,r n))=D 1(p) iD 1(r i) p i. D_1(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = D_1(\mathbf{p}) \cdot \prod_i D_1(\mathbf{r}_i)^{p_i}.

So where a user of entropy might naturally consider arithmetic means p iH i\sum p_i H_i, a user of diversity might naturally consider geometric means K i p i\prod K_i^{p_i}.

Now to restate the theorem of Gibbs that you described. Let K:X(0,)K: X \to (0, \infty) be some function. (Think K=e HK = e^H.) Suppose we want to maximize the 1-diversity of a probability distribution p:X[0,1]\mathbf{p}: X \to [0, 1] while ensuring that the geometric mean of KK is some positive number CC:

iXK i p i=C. \prod_{i \in X} K_i^{p_i} = C.

(Think C=e cC = e^c.) Then you need to choose p\mathbf{p} of this form:

p i=1Z(β)K i β p_i = \frac{1}{Z(\beta)} K_i^{-\beta}

for some real number β\beta, where

Z(β)= iXK i β. Z(\beta) = \sum_{i \in X} K_i^{-\beta}.

There: no exponentials. Of course this is completely trivial, but it represents a small shift in viewpoint.

Posted by: Tom Leinster on May 11, 2011 6:25 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Or to put the conclusion more succinctly: p\mathbf{p} is an escort distribution of KK.

Here I’m expanding the usage of “escort distribution” so that every nonzero measure on a finite set has a one-parameter family of escort distributions, not just every probability measure. In my previous comment, p\mathbf{p} is the escort distribution of the measure KK with parameter β-\beta. The escort distributions are always probability distributions/measures, so KK will only be a member of its own family of escort distributions if it’s a probability measure.

Posted by: Tom Leinster on May 11, 2011 7:27 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Beautiful! That reminds me of my brief forays into Bayesian statistics, when I was told that often the best prior probability distribution to choose is the one which maximizes our ignorance (i.e. maximizes the entropy) subject to our definite knowledge. In particular, if all you know is that a probability distribution on the real line has a given second moment about a given point μ\mu, i.e. a given value of p(x)(xμ) 2dx\int p(x) \cdot (x-\mu)^2 \,d x, then the maximum entropy distribution is a Gaussian with mean μ\mu, i.e. p(x)=1Z(β)e β(xμ) 2p(x) = \frac{1}{Z(\beta)} e^{-\beta (x-\mu)^2}.

Although I’m not actually following the main direction of this conversation, I’m still learning a lot from the sidelines. (-:

Posted by: Mike Shulman on May 11, 2011 11:50 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Mike wrote:

That reminds me of my brief forays into Bayesian statistics, when I was told that often the best prior probability distribution to choose is the one which maximizes our ignorance (i.e. maximizes the entropy) subject to our definite knowledge.

Right!

This ‘maximum entropy’ strategy was introduced by Josiah Willard Gibbs around 1900 and is the foundation of statistical mechanics. For example, if you tell a physicist that you have a box of gas, and say a few things you know about it, they’ll assume the gas molecules have random positions and momenta, distributed according to a probability distribution that maximizes entropy subject to the constraints provided by what you know. It works!

In particular, if you tell me the expected energy is some number cc, I’ll assume the probability distribution goes like

p i=e βH i/Z(β) p_i = e^{-\beta H_i}/ Z(\beta)

and I’ll choose β\beta so that

ip iH i=c \sum_i p_i H_i = c

(There’s also a continuous version, where I do an integral instead of a sum.)

So far, this seems like a nice math trick. But then β\beta turns out to have a marvelous significance: it’s 1/kT1/ k T where kk is Boltzmann’s constant and TT is temperature!

Boltzmann’s constant is just an artifact of the fact that people separately dreamt up units of energy and temperature before they knew how this stuff. If they’d known this stuff, they’d probably have chosen units where Boltzmann’s constant equals 1. So don’t worry about that.

What matters more is the profound insight into ‘temperature’. Shockingly, temperature is just a quantity that shows up automatically when you try to maximize entropy subject to the constraint that the energy has some mean value!

And indeed the reciprocal of temperature, or more precisely β\beta, is ultimately more important than temperature. This explains that law you may have heard, about it being impossible to cool things down to ‘absolute zero’ temperature. There’s no standard name for β\beta, so I call it coolness. Then the law says “you can’t attain infinite coolness”, which sounds pretty intuitive.

Much later, E. T. Jaynes realized that the maximum entropy strategy is useful everywhere. It’s hard to believe nobody had noticed this before, but the idea seem to have hit him with the force of a religious relevation, and he proselytized for it relentlessly. This made the principle of maximum entropy well-known outside of physics. In his book (free online) he wrote:

This is quite a neat mathematical procedure, and, of course, you recognize what we have been doing here. These relations are all just the standard equations of statistical mechanics given to us by J. Willard Gibbs, but now in a disembodied form with all the physics removed.

Indeed, virtually all known thermodynamic relations, found over more than a Century by the most diverse and difficult kinds of physical reasoning and experimentation, are now seen as straightforward mathematical identities of the Maximum Entropy formalism. This makes it clear that those relations are actually independent of any particular physical assumptions and are properties of extended logic in general, giving us a new insight into why the relations of thermodynamics are so general, independent of the properties of any particular substance. This historically oldest application, Gibbsian statistical mechanics, is still the most used (although many of its users are still unaware of its generality). We stress that generality by considering statistical mechanics only in Chapter 29, after a few other applications have been expounded.

He liked to give examples like this: suppose you have an unfair 6-sided die and the mean of the numbers you get when you roll it is 5. What’s the probability that when you roll it, it’ll come up showing 2?

It seems that you don’t have enough information to answer this problem, but the principle of maximum entropy gives a recipe for answering it. His book argues that this recipe is ‘right’.

At some point—I’m not sure when!—it was realized that many famous 1-parameter families of probability distributions are of the form

p i=e βH i/Z(β) p_i = e^{-\beta H_i}/ Z(\beta)

or

p(x)=e βH(x)/Z(β) p(x) = e^{-\beta H(x)}/ Z(\beta)

for various interesting functions HH. So, they maximize entropy subject to constraints on the mean value of HH. People now call probability distributions of this form exponential families, and they play an important role in statistics.

Your example is a great one: the Gaussian distribution maximizes entropy subject to a constraint on the mean value of H(x)=x 2H(x) = x^2. The mean value of x 2x^2 is called the 2nd moment of xx, or the variance of xx, or the square of the standard deviation.

Posted by: John Baez on May 12, 2011 1:30 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

For what it’s worth, Beck and Schlögl use “bit-number” instead of “energy”.

Posted by: Blake Stacey on May 11, 2011 4:42 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

David wrote:

So, eventually, back to my original question, in statistical physics do you move from an initial distribution to a maximum entropy distribution which matches some specified macroscopic expectation value? If so, is this seen in a differential geometric light?

The idea of choosing the maximum entropy distribution that matches some specified macroscopic expectation value has been important in statistical mechanics ever since Gibbs. It’s absolutely everywhere, and its ramifications have been explored to the nnth degree.

However, getting from some initial distribution (a ‘prior’, I suppose) to some new improved one by following a geodesic is not something I’ve seen physicists talk about. I’ve only seen that in work on ‘information geometry’, and I still don’t understand it very well.

Why do you care about the path you move along to reach your new best guess based on new data? It’s as if we’re trying to shoot an arrow and hit the bull’s-eye. The physicist will be very concerned about hitting the right spot, but your “geodesic” description seems to focus on the arrow’s path to the target. Why? What’s so important about that?

If I understood that I might see some connection to physics. After all, physicists are very concerned about objects zipping along geodesics, minimizing action, and the like.

Posted by: John Baez on May 11, 2011 8:43 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Why do you care about the path you move along to reach your new best guess based on new data? It’s as if we’re trying to shoot an arrow and hit the bull’s-eye. The physicist will be very concerned about hitting the right spot, but your “geodesic” description seems to focus on the arrow’s path to the target. Why? What’s so important about that?

OK, maybe it’s not the movement along the geodesic that’s important, but what is important is that the space of distributions is foliatiated in dual ways such that each contains the geodesics of one of a pair of dual connections.

In my example above, the three-dimensional space is foliated into two-dimensional leaves according to the expected proportion of 00s. Any leaf is flat so far as the connection corresponding to linear relations between coordinates is concerned. It is also foliated orthogonally by one-dimensional leaves, such as the line of distributions with the probability of 00 and 11 independent, along with other lines keeping constant the degree of dependence. These latter are defined by conditions which are linear in the logarithms of the coordinates. They’re geodesics too according to the corresponding connection.

Updating a probability corresponds to moving to the point of intersection between the leaf of your prior distribution according to the latter foliation, and the leaf containing the empirical distribution according to the former foliation. If you made the move in two steps, perhaps first with an initial sample, and then with a second, you would arrive at the same place as taking the samples together. So the path itself matters.

That last property is a feature of geodesics, but does it not make much sense to you to frame matters in terms of geodesics if you don’t care about the ‘time’ it takes to move along them?

Posted by: David Corfield on May 11, 2011 9:22 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I see that there’s a generalized Fadeev characterization, which has as its third axiom

S q(x 1,...,x n1,μx n,(1μ)x n)=S q(x 1,...,x n)+x n qS q(μ,1μ). S_q(x_1,..., x_{n−1}, \mu x_n, (1 - \mu) x_n) = S_q(x_1,..., x_n) + x_n^q S_q(\mu, 1- \mu).

This, with the symmetry and continuity axioms, characterizes Tsallis entropy, taking us back to the Tsallis/Rényi debate.

Posted by: David Corfield on May 9, 2011 5:24 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I don’t think a debate is what we need. We can all make subjective judgements over which approach gives the simplest formulas, and my own subjective judgement is that the formulas given by Rényi extropy and Tsallis entropy are about equal in simplicity. But much more useful, I think, is to give precise statements of the distinguishing mathematical properties of each formulation.

For example, I explained that Rényi extropy D qD_q is an effective number

D q(1/n,,1/n)=n D_q(1/n, \ldots, 1/n) = n

—and that this distinguishes it from Rényi entropy and Tsallis entropy. Effective numbers are very important, so this is a crucial property; it’s what led me long ago to the belief that treating Rényi extropy as the primary player is likely to be fruitful for what I want to do. I don’t know what the distinguishing property of Tsallis entropy is.

Posted by: Tom Leinster on May 9, 2011 7:43 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

How about the following line of thought?

We should recognise that D qD_q is an important quantity, but it’s tendentious to call it Rényi extropy. Given that D qD_q is a 1-parameter family of quantities, it’s every bit as natural to take its qq-deformed logarithm, rather than natural logarithms for all qq. The qq-logarithm of D qD_q is the Tsallis qq-entropy. We have just as much right to call D qD_qTsallis extropy’.

Posted by: David Corfield on May 10, 2011 9:09 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I thought extropy simply meant the exponential of entropy? So Bloggs extropy means the exponential of Bloggs entropy. I understand your argument, but it seems to me to sow confusion. If we’re going to use the word extropy, we should stick to one unambiguous meaning for it.

And, as I keep saying, “Tsallis entropy” is really a terrible misnomer. Tsallis may have made new contributions to physics in his work on this entropy, but if we’re talking about the mere mathematical definition then he was about the sixth person to discover it: Havrda, Charvát, Daróczy, Patil, and Taillie all came before him, by some years. I think most of these people are still alive, so it matters a bit more to get it right.

Anyway, I don’t want to call it Rényi extropy: I want to call it diversity.

Posted by: Tom Leinster on May 10, 2011 9:20 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

My comment wasn’t really about how we should name certain constructions, but more about credit rating. Two ways to pick up credit are (1) to be of intrinsic importance and (2) to be tightly associated to something of importance. I was just suggesting an argument that the latter route via association to D qD_q seemed not to give Rényi entropy any special advantage.

As for credit for Tsallis entropy, or whatever it should be called, I guess its role as a qq-deformed Shannon entropy, so that it satisfies a bunch of qq-deformed equations, such as this, and the fact that its associated divergence forms Amari’s 1-parameter family are its main selling points.

Posted by: David Corfield on May 10, 2011 10:01 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I was looking at one of the Amari papers you mentioned just as your comment appeared, but I haven’t been able to get anything out of it yet.

As for the various entropies, I see it like this. Practically speaking, anything called an entropy can be transformed, in an invertible way, into an effective number. This puts an equivalence relation on the class of entropies: two entropies are equivalent if they have the same effective number form.

All entropies in the same equivalence class can be viewed as “tightly associated”, because they can all be obtained from one another by an invertible transformation. But only one entropy in each equivalence class has the special property of being an effective number.

Posted by: Tom Leinster on May 10, 2011 10:38 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

If you are after a more interpretative approach to Tsallis entropy, try Flemming Topsøe’s Interaction between Truth and Belief as the key to entropy and other quantities of statistical physics:

The approach aims at providing a genuine interpretation, rather than relying either on analogies based on formal mathematical manipulations or else – more fruitfully, but not satisfactory – on axiomatic characterizations.

There are further slides on this topic as his webpage.

Posted by: David Corfield on May 10, 2011 10:44 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Thanks, David! So, I’ll have to ponder this Tsallis entropy formula

T q(x 1,...,x n1,μx n,(1μ)x n)=T q(x 1,...,x n)+x n qS q(μ,1μ). T_q(x_1,..., x_{n−1}, \mu x_n, (1 - \mu) x_n) = T_q(x_1,..., x_n) + x_n^q S_q(\mu, 1- \mu).

which Tom earlier called “the nicest formula involving Tsallis entropy that I’ve seen”, alongside his similar formula for glomming together Rényi entropies.

But I’m especially glad that this formula shows up as part of a characterization theorem for Tsallis entropy. I hadn’t known that. Indeed, I haven’t yet checked out that book Tom and Mark Meckes keep talking about:

  • J. Aczél and Z. Daróczy, On measures of information and their characterizations, Mathematics in Science and Engineering, vol. 115, Academic Press, New York, 1975.

so I feel quite ignorant about results that characterize different kinds of entropy. I could be missing out on lots of good ideas that way!

Luckily, the National University of Singapore does own this book. Anyone who doesn’t have access to it can read this review and drool: “one has the impression that they have so thoroughly explored the field, that there is little chance for the discovery of really new properties of Shannon’s entropy and eventually Rényi’s entropy”.

Posted by: John Baez on May 10, 2011 4:42 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Here are some more ideas. They are possibly naive and not directly related to the intriguing ideas discussed in the above post; sorry about that. Let me know if you want more detail on any one of these ideas. I don’t want to write too much in one comment.

(1) Suppose we are given objects Q 1,,Q nFinProbQ_1,\ldots,Q_n\in\Fin\Prob as in the main post above. Is it possible to regard the object P(Q 1,,Q n)P\circ(Q_1,\ldots,Q_n) as a weighted colimit using enrichment over the category FinMes\Fin\Mes (finite measure spaces with measure-preserving functions)?

(2) So far, we have always been taking the target category to be given by the real numbers; whether that be as a monoid, a poset, or a symmetric monoidal category which combines the previous two points of view. But following the idea that entropy is an instance of decategorification, shouldn’t it be a result instead of an assumption what the target category is?

(3) Following up on the previous point: how can one characterize the natural numbers as a decategorification of FinSet\Fin\Set? Would this be a good example which could guide us in extending it to our setting?

(4) I have been thinking a bit more about the coslice category 1/FinStoch1/\Fin\Stoch. It is the category of finite probability spaces together with measure-preserving probabilistic functions (stochastic matrices). It contains two interesting subcategories: FinProb\Fin\Prob on the one hand, which consists precisely of the measure-preserving deterministic functions; these morphisms do nothing but identify some of the events in its domain. But it also contains the subcategory of those probabilistic functions where each element of the codomain has at most one preimage; these morphisms do nothing but add additional randomness to some outcomes. This subcategory is the opposite category of what Tom has been calling M ΔM_\Delta. The punchline: these two subcategories smell like a weak factorization system for FinStoch\Fin\Stoch! Every stochastic map can be decomposed as first adding more randomness, and then identifying some outcomes. I am also confident that the lifting property holds, but this remains to be seen.

Moreover, a crucial property of entropy is that it decreases (non-strictly) under the morphisms of the first subcategory, while it increases (non-strictly) under morphisms from the second subcategory!

Posted by: Tobias Fritz on May 9, 2011 10:55 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I find (4) intriguing. It might be worth thinking about this in the full generality of monads. All three categories (FinStochFinStoch, FinProbFinProb and M ΔM_\Delta) are derived in a canonical way from the monad P\mathbf{P} for convex algebras. FinStochFinStoch is the opposite of the Lawvere theory of P\mathbf{P}, FinProbFinProb is the category of elements of the restricted functor P:FinSetSet\mathbf{P}: FinSet \to Set (explanation), and M ΔM_\Delta is the free symmetric monoidal category containing an algebra for the underlying operad of P\mathbf{P}. (Now that I’m calling the monad P\mathbf{P} rather than Δ\Delta, this category should be called M PM_{\mathbf{P}} rather than M ΔM_\Delta.)

I’m hoping someone else will think about this…

By the way, are you sure you mean M ΔM_\Delta, rather than M Δ/1M_\Delta/1? The objects of M ΔM_\Delta are, after all, just finite sets.

Posted by: Tom Leinster on May 10, 2011 3:21 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I hadn’t really been following the previous discussion, but I enjoyed reading this post! I like the idea of “differentiating” operad homomorphisms.

A somewhat off-topic question: is this “partition function” related to the “partition function” of quantum field theory, which one also sometimes seems to differentiate with respect to something in the exponent to get, er, something possibly related to Feynman diagrams?

Posted by: Mike Shulman on May 10, 2011 3:00 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Mike wrote:

A somewhat off-topic question: is this “partition function” related to the “partition function” of quantum field theory, which one also sometimes seems to differentiate with respect to something in the exponent to get, er, something possibly related to Feynman diagrams?

Yes indeed!

Right now we’re talking about partition functions in statistical mechanics, which you get by summing over ‘states’ of a system, and what you sum is exp(βH)\exp(- \beta H) where HH is the ‘energy’ of a state and β\beta is essentially ‘inverse temperature’. Everything you want to know in statistical mechanics, you can get by differentiating the partition function and playing various tricks.

There’s also partition functions in quantum field theory, which you get by summing over ‘histories’ of a system, and what you sum is exp(iS/)\exp(i S/\hbar) where SS is the ‘action’ of a history and \hbar is ‘Planck’s constant’. Everything you want to know in quantum field theory, you can get by differentiating the partition function and playing various tricks.

They’re incredibly similar. This is a basic fact of physics, which however remains deeply mysterious.

The really big difference is the factor of ii. This related to how probabilities in statistical mechanics are real while amplitudes in quantum field theory are complex. But often people play a trick and do quantum field theory using ‘imaginary time’, which sticks an extra factor of ii into the action, leaving us with a real exponent in exp(iS/)\exp(i S / \hbar). Then quantum field theory can be made to look exactly like statistical mechanics.

Whether they use this extra trick or not, physicists constantly exploit the strong analogy between statistical mechanics and quantum field theory. So, you’ll often see techniques developed in particle physics or string theory getting applied to more mundane and often very practical problems. Or conversely!

Posted by: John Baez on May 10, 2011 3:59 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Very nice, thank you! Now I feel like I understand the partition function in quantum field theory a little tiny bit better. Have people made any progress in explaining this deep mystery?

Posted by: Mike Shulman on May 10, 2011 11:55 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I bumped into a webpage with some inequalities involving Rényi entropy:

• Jörgen Cederlöf, Authentication in Quantum Key Growing, 4.4.2 Chain rule of Rényi entropy.

The inequalities in Theorem 3 look interesting because they involve both Rényi entropy and Shannon entropy. The proofs are pretty simple calculations using Jensen’s inequality. Tobias might find a lax functor or something lurking around here.

There’s also this:

Van Even says:

Rényi entropy is well studied [Aczél and Daréczy, 1975, Ben-Bassat and Raviv, 1978], but although Rényi divergence appears in many computations, it has hitherto not been studied systematically. This chapter is intended as a reference document, which treats the basic properties of Rényi divergence in detail.

For those not keeping score, “Rényi divergence” is the same as “relative Rényi entropy”.

Over here, Tobias was asking if relative Rényi entropy decreases when you lump two outcomes together. I guess Tom answered that affirmatively here. I think this result is a special case of van Even’s Theorem 6.11 (the Data Processing Inequality). There he shows that relative Rényi entropy decreases whenever you go to a coarser σ\sigma-subalgebra. He’s working with general measure spaces, but in the case of finite sets this basically just means lumping outcomes into groups.

I inserted a reference to van Even’s work in Tobias’ section on The basic inequalities of information theory. Feel free to move that reference! But we should probably keep it around somewhere, and we should probably not spend any space in our paper(s) proving known inequalities, since we’ve got a lot to talk about.

Posted by: John Baez on May 10, 2011 5:56 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

More on why partition functions are so great. Tobias wrote:

Following up on the previous point: how can one characterize the natural numbers as a decategorification of FinSet?

I’ll tell you!

Would this be a good example which could guide us in extending it to our setting?

Yes, and I’ve been using this example to guide my writeup of partition functions. I should probably add a lot more background material, though! For example, this:

Decategorification is a systematic process attaching to any essentially small category CC its set of isomorphism classes C̲\underline{C}, together with a functor

[]:C 0C̲ [\cdot] : C_0 \to \underline{C}

sending any object cCc \in C to its isomorphism class [c][c]. Here C 0C_0 is the core of CC, meaning the category with the same objects but only the isomorphisms of CC as morphisms.

So, you can just say “the natural numbers are the decategorification of FinSet”. But if you want to be more precise, you can say there’s a bijection

FinSet̲ \underline{FinSet} \cong \mathbb{N}

such that the composite functor

[]:FinSet 0FinSet̲ [\cdot] : FinSet_0 \to \underline{FinSet} \cong \mathbb{N}

sends lots of structures in FinSetFinSet to similar structures in \mathbb{N}.

For example, it sends the coproduct and product of finite sets to ++ and ×\times in \mathbb{N}. (And this is what we expect children to learn, in an intuitive way, when they first learn to add and multiply: is there any wonder that some find it difficult?)

In short, we can say that “the decategorification of the rig category FinSetFinSet is isomorphic to the rig of natural numbers”.

At the end of my section on the partition function, I show that the decategorification of the rig category FinProbFinProb is isomorphic to the rig of partition functions.

By the way, this is only true for the version of FinProbFinProb where the probability measures assign a nonzero measure to each point. Do we want a better name for this category???

But anyway, this blog post takes advantage of another structure that FinProbFinProb and the ring of partition functions have in common. They are both algebras of the operad I’m calling ConvexConvex—the one whose operations are convex linear combinations. And decategorification is also a homomorphism of convex algebras.

So, we can also say the decategorification of the convex algebra FinProbFinProb is isomorphic to the convex algebra of partition functions.

Note that FinProbFinProb is a convex algebra in the category of topological categories, as explained by Tom in our ‘paper’. Its decategorification is a convex algebra in the category of topological spaces. That’s just how it should be: the decategorification of a topological category is a topological space.

It really annoys me that everything I just said sounds sophisticated and fancy. It’s all very simple, and blitheringly obvious.

Posted by: John Baez on May 10, 2011 7:04 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

A couple of small thoughts:

the operad I’m calling ConvexConvex

I’m not sold on this notation. You pointed out to me that Δ\Delta was a bad name for this operad, since Δ\Delta is typically used to denote a certain category. But similarly, single words are often used to denote categories. I’d expect ConvexConvex to denote the category of convex algebras. (Yes, an algebra for ConvexConvex is a convex algebra, and people use LieLie for the operad whose algebras are Lie algebras, but I’m not sold on that either.)

I’m gravitating towards P\mathbf{P} for this operad. Its elements—probability distributions—can then be called p\mathbf{p} or PP.

Note that FinProbFinProb is a convex algebra in the category of topological categories, as explained by Tom

Maybe I’m just having a momentary block, but I don’t know what you mean by FinProbFinProb being a convex algebra, despite having apparently explained it myself.

Posted by: Tom Leinster on May 10, 2011 8:43 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tom wrote:

I’m gravitating towards P\mathbf{P} for this operad.

Okay, good - let’s say that’s settled. I was secretly using PP a lot, but for all sorts of reasons P\mathbf{P} is better.

I’m going crazy trying to edit our nLab page down to a more coherent article. I’ll rename the operad you were calling Δ\DeltaP\mathbf{P}’, and I’ll feel better.

Maybe I’m just having a momentary block, but I don’t know what you mean by FinProbFinProb being a convex algebra, despite having apparently explained it myself.

Oh, whoops - I guess you didn’t explain that. You explained the idea of P\mathbf{P}-algebras in Cat(Top)Cat(Top), and I think FinProbFinProb is one of those.

Let’s see.

It’s a topological category in a hopefully obvious way (though there’s a choice here if we decide not to admit finite probability measures with points of measure zero).

More interestingly, suppose we have an nn-ary operation in PPP \in \mathbf{P}, meaning an nn-tuple

P=(p 1,,p n) P = (p_1, \dots , p_n )

of probabilities that sum to 1. We want this to act on FinProbFinProb giving a functor

α(P):FinProb nFinProb \alpha(P): FinProb^n \to FinProb

and we want these functors to make FinProbFinProb into an algebra of P\mathbf{P}.

So, for starters, suppose we have an nn-tuple of objects Q 1,,Q nFinProbQ_1, \dots , Q_n \in FinProb. Then we can form a probability measure space α(P)(Q 1,,Q n)\alpha(P)(Q_1, \dots, Q_n) by taking the the disjoint union of the Q iQ_i and multiplying the measure on Q iQ_i by the number p ip_i. It’s really cute to write this as

α(P)(Q 1,,Q n)=p 1Q 1++p nQ n \alpha(P)(Q_1, \dots, Q_n) = p_1 Q_1 + \cdots + p_n Q_n

where ++ is disjoint union of measure spaces and we’re using the obvious way to multiply a measure space by a number: just multiply the measure by that number!

I think we can also define how PP acts on morphisms in FinProbFinProb, giving us a functor

α(P):FinProb nFinProb \alpha(P): FinProb^n \to FinProb

I was using this idea all over my blog post, secretly. It’s implicit in laws like this

Z(P(Q 1,,Q n))=P(Z(Q 1),,Z(Q n)) Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

You can think of P(Q 1,,Q n)P \circ (Q_1, \dots, Q_n) as composition in the operad P\mathbf{P} if you want. But that’s a very close relative of the idea I just explained! And viewed in the proper light, the above law can be reinterpreted as saying that ZZ is a homomorphism of P\mathbf{P}-algebras:

Z(α(P)(Q 1,,Q n))=P(Z(Q 1),,Z(Q n)) Z(\alpha(P)(Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

At least I hope so. Am I screwing up somewhere?

Posted by: John Baez on May 10, 2011 10:44 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Thanks! I was having a momentary block.

Posted by: Tom Leinster on May 10, 2011 10:49 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Got it! It’s pretty, even though or probably because it’s not what I had in mind. I will explain the latter stuff towards the end of this comment.

Tom has found that the ‘old’ FinProb\Fin\Prob which allows zero probabilities can naturally be constructed from the Giry monad (the convex combinations monad). It’s basically the comma category of Set\Set with respect to the functors 1Set1\rightarrow\Set and P:SetSet\mathbf{P}:\Set\rightarrow\Set. So does the ‘new’ category FinProb\Fin\Prob, where zero probabilities are disallowed, also have such an abstract description? At the moment, I don’t see how to do this. If it’s not possible, then this might be an argument for keeping the zero probabilities.

Then, about (either version of) FinProb\Fin\Prob: isn’t it a bit evil to say that it is a P\mathbf{P}-algebra? I mean, the P\mathbf{P}-algebra structure is not compatible with operadic composition ‘on the nose’, or is it? It seems analogous to me to saying that Set\Set with respect to cartesian product is a strict monoidal category, which it is not. We might need some kind of ‘associators’ or so, leading to some high-brow category theory…

What I had in mind for viewing entropy as a decategorification is something which remembers a little bit more of the morphisms of the original category. For an important property of entropy is one which I explained here: it decreases under identifying outcomes, while it increases under adding noise. So instead of taking the decategorification to be only a rig, maybe we want it to be an ordered rig which takes this into account?

More generally, suppose we have a category CC which contains a subcategory LCL\subseteq C which we think of as morphisms which ‘increase size’, and a subcategory RCR\subseteq C which we think of as morphisms which ‘decrease size’. Then, we may consider functors from CC to posets which send morphisms in LL to \leq and morphisms in RR to \geq. The decategorification of CC should then be a poset together with such a functor from CC, such that this data is universal for all such functors from CC to posets.

Does this make any sense at all?

Posted by: Tobias Fritz on May 10, 2011 1:13 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tobias wrote:

Then, about (either version of) FinProb: isn’t it a bit evil to say that it is a P-algebra? I mean, the P-algebra structure is not compatible with operadic composition ‘on the nose’, or is it?

Good point! Since I made FinProbFinProb into a P\mathbf{P}-algebra using

α(P)(Q 1,,Q n)=p 1Q 1++p nQ n \alpha(P)(Q_1, \dots, Q_n) = p_1 Q_1 + \cdots + p_n Q_n

where ++ is coproduct of measure spaces, and the coproduct only obeys the associative law up to a specified isomorphism, it’ll only be a ‘weak’ algebra of this operad.

We might need some kind of ‘associators’ or so, leading to some high-brow category theory…

Yeah, but we’re highbrow guys around here. I bet weak algebras of operads are already well-understood. Tom is the one to ask about this, since he wrote the book on operads and higher categories. But I could probably reinvent the theory myself if needed, since operads are monoid objects in a certain monoidal category, and we already know how to weaken the concept of ‘action of a monoid object’.

It’s also quite possible that a strictification is fairly benign here. After all, FinProbFinProb is very very much like the operad P\mathbf{P} itself, especially if we work with the version of FinProbFinProb that allows for probability measures having points of measure zero. Tom already explained this somewhere… and since there’s nothing ‘weak’ about P\mathbf{P}, there’s probably a fairly reasonable category equivalent to FinProbFinProb that’s a strict P\mathbf{P}-algebra.

Again, I think Tom could easily handle this on a napkin while eating breakfast.

What I had in mind for viewing entropy as a decategorification is something which remembers a little bit more of the morphisms of the original category.

We can probably work this out; I get what you mean but I haven’t figured out the right way to formalize it, and now it’s my bedtime.

Posted by: John Baez on May 10, 2011 4:28 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I bet weak algebras of operads are already well-understood.

Yup. They’re nothing to be scared of. The definition is no more complicated than the definition of monoidal category (which is in fact a special case).

Tom is the one to ask about this, since he wrote the book on operads and higher categories.

Regard the grey hairs.

But I could probably reinvent the theory myself if needed

Yes indeed.

there’s probably a fairly reasonable category equivalent to FinProbFinProb that’s a strict P\mathbf{P}-algebra.

I guess we could take the category whose objects are pairs (n,p)(n, \mathbf{p}) where nn \in \mathbb{N} and pP n\mathbf{p} \in \mathbf{P}_n, and whose maps (n,p)(m,r)(n, \mathbf{p}) \to (m, \mathbf{r}) are measure-preserving maps {1,,n}{1,,m}\{1, \ldots, n\} \to \{1, \ldots, m\}. That’ll do it, right?

Posted by: Tom Leinster on May 10, 2011 11:00 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I bet weak algebras of operads are already well-understood.

Weak algebras, also known as homotopy algebras or \infty-algebras , were the very motivation for the introduction of the notion of operad in the first place.

Posted by: Urs Schreiber on May 10, 2011 11:45 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Weak algebras, also known as homotopy algebras or ∞-algebras , were the very motivation for the introduction of the notion of operad in the first place.

No, I don’t think that’s really true, not in the sense that John means “weak algebra”. The goal of operads was (more or less) to describe things like “homotopy monoids” and “homotopy commutative monoids”, but this is done by defining such a thing to be a strict algebra over a different operad. One can then study “resolutions” of operads with the property that their strict algebras are a good notion of “weak algebra” for the starting operad, but that’s (a priori) a different thing from starting from a “weak algebra” for a given operad in the sense that the operad laws hold up to specified isomorphism/equivalence. Of course the two are related, but I think it would be more correct to say that the introduction of the notion of operad was a way precisely to avoid the notion of “weak algebra” in the sense that John means.

Posted by: Mike Shulman on May 11, 2011 2:34 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

No, I don’t think that’s really true,

Only if you distinguish a concept from its presentation. Which I’d think we wouldn’t do if we are categorically minded.

And the difference between writing down a component of an associator and having an abstract associator included in the operad itself is almost just a typographical one. Similarly, somebody who thinks of pseudofunctors as strict 2-functors out of a resolving 2-category still knows and sees everything there is to know and see about weak functors, it’s just a way of organizing the information.

So I think it is important to notice that the notion of weak algebras over an operad is a very well studied one. But if you disagree, let’s leave it at that, it’s not such an important point here.

Posted by: Urs Schreiber on May 11, 2011 8:32 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Only if you distinguish a concept from its presentation. Which I’d think we wouldn’t do if we are categorically minded.

Great! Does that mean I can ignore the differences between all the definitions of weak n-category, since they are all presentations of the same concept? (-:

the notion of weak algebras over an operad is a very well studied one.

The intuitive concept of “weak algebra over an operad” is, indeed, a very well studied one. But the particular precise definition of “weak algebra over an operad” which John was referring to, in contrast to some other different precise definitions of the same intuitive concept, is not.

Probably we are not actually disagreeing, though, only using the same words to mean different things.

Posted by: Mike Shulman on May 11, 2011 11:36 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Back in Entropy, diversity and cardinality (part 1), I parenthetically observed that the function xxlogxx \mapsto -x\log x is a derivation, and that this fact is exactly what makes Shannon entropy obey the law

H(X×Y)=H(X)+H(Y). H(X \times Y) = H(X) + H(Y).

I never understood what was going on there. Does your way of thinking shed some light on it?

Posted by: Tom Leinster on May 10, 2011 8:52 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Not yet, but it might once I pick my jaw up off the floor! I never knew that

D(x)=xlnx D(x) = - x \ln x

obeys the product rule

D(xy)=(Dx)y+x(Dy) D(x y) = (D x) y + x (D y)

This is an utter shock: I’m supposed to know basic algebra facts like this!

Saying it’s a ‘derivation’ isn’t quite right. That accounts for some of the speed at which my jaw hit the floor: [0,)[0,\infty) is a familiar rig to me, so I was astounded to hear it had a derivation I didn’t know about. But a derivation of a rig must obey

D(0)=0 D(0) = 0

and the sum rule

D(x+y)=D(x)+D(y) D(x + y) = D(x) + D(y)

as well as the product rule, and your DD doesn’t obey the sum rule (though it does have D(0)=0D(0) = 0).

Nonetheless, you have just made clear that the concept of a derivation from a mere monoid to a bimodule of that monoid makes sense, and that here’s an interesting example.

And more important, I’m sure this must be related to all the junk I’ve been saying lately! Thanks—I definitely need to mull over this new clue.

Posted by: John Baez on May 10, 2011 11:57 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Taking up on this clue, let’s try to define Shanon entropy also for non-normalized distributions p=(p 1,,p n)p=(p_1,\ldots,p_n). In Rényi’s paper, we can find the definition

H(p)= ip i jp jlogp i H(p)=-\sum_i \frac{p_i}{\sum_j p_j} \log p_i

but I would like to try instead this:

H(p)= ip ilog(p i jp j) H(p) = -\sum_i p_i \log \left( \frac{p_i}{\sum_j p_j} \right)

We can also write this as

H(p)=( ip i)log( ip i) ip ilogp i H(p) = \left(\sum_i p_i\right)\:\log\left(\sum_i p_i\right) -\sum_i p_i\log p_i

and notice that HH actually is the “linearity defect” of the function D(x)=xlog(x)D(x)=x\log(x)!

It has other nice properties which I am currently working on, so more about that later… Eventually I would hope that we can use this to replace the category FP\mathbf{FP} with its P\mathbf{P}-algebra structure by a similar category of finite measure spaces, where the additional structure is then only given by direct sums of measure spaces (coproducts) and scalar multiples.

Do you like this idea?

By the way, your writing speed is faster than I can read!

Posted by: Tobias Fritz on May 12, 2011 11:15 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tobias wrote:

Do you like this idea?

Yes, if it works. I can imagine you wanting to redo some of our work on the ‘convex linear combination’ monad by the ‘nonnegative linear combination’ monad, whose algebras are modules of the rig [0,)[0,\infty). You could then systematically replace FP\mathbf{FP} by some category, let’s call it FQ\mathbf{FQ} for now, which is related to measures on finite sets in a way similar to how FP\mathbf{FP} is related to probability measures on finite sets.

However, right now I am very eager to not study new ideas, but merely to finish a paper explaining Tom’s operadic characterization of Shannon entropy in a way that lots of people can understand. I hope you and Tom will agree that this is a reasonable goal, though certainly not all we might someday want to do. There is too much to do!

My idea is to start with the simplest ideas and then lead up to ideas that take more expertise to understand, like Tom’s ‘lax points’ idea. Somewhere in between might be some material where we consider \mathbb{R}, not merely as a 1-object category, but as a 1-object 2-category with inequalities as morphisms. This might be a way to incorporate your work on inequalities.

I still think there should be a beautiful relation between

1) entropy as minus the derivative of the partition function

2) the partition function as the decategorification of the category of finite probability measure spaces

and

3) entropy as a map of P\mathbf{P}-algebras.

But I will think about this ‘on the side’.

By the way, your writing speed is faster than I can read!

That’s probably because you need to think about what I’m saying, while I don’t.

Posted by: John Baez on May 12, 2011 12:20 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

lead up to ideas that take more expertise to understand, like Tom’s ‘lax points’ idea.

We don’t need to mention that at all, assuming my post yesterday was right. Notice that the direct statement of the theorem there doesn’t mention lax points.

Posted by: Tom Leinster on May 12, 2011 12:37 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I would have thought that to entice people into reading about operads and/or P\mathbf{P}-algebras for entropy it would be good to indicate a range of directions that might then be naturally taken which otherwise would not have been.

Is there anything interesting about all P\mathbf{P}-algebras? Is there a QP\mathbf{Q} \to \mathbf{P}, or the other way, such that Shannon entropy throws light on some aspect of Q\mathbf{Q}-algebras? Etc.

Looks like Tobias is providing an answer to the second below.

Posted by: David Corfield on May 12, 2011 2:24 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

David wrote:

I would have thought that to entice people into reading about operads and/or P\mathbf{P}-algebras for entropy it would be good to indicate a range of directions that might then be naturally taken which otherwise would not have been.

I can think of stuff like that. But first I want to figure out what’s really going on here, which means I need to write stuff up.

Posted by: John Baez on May 13, 2011 6:01 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

However, right now I am very eager to not study new ideas, but merely to finish a paper explaining Tom’s operadic characterization of Shannon entropy in a way that lots of people can understand. I hope you and Tom will agree that this is a reasonable goal, though certainly not all we might someday want to do. There is too much to do!

Yes, absolutely! I agree that the present result is a good moment to take a halt and try to explain it to people. So we should find a way to make it as simple and as appealing as possible! And I think that precisely this can be done by following the idea of allowing non-normalized probabilities. I am quite excited about this at the moment, so I hope that these preliminary results are correct and that my rant doesn’t sound flamboyant.

Allowing non-normalized distributions entails going from convex combinations to non-negative linear combinations. A more concise term is ‘conical combinations’, so let me use this. What is an algebra for the monad of conical combinations? Well, it’s a set XX together with an addition (x,y)x+y(x,y)\mapsto x+y and a scalar multiplication xpxx\mapsto p\cdot x for each p +p\in\mathbb{R}_+, satisfying the obvious compatibility conditions. If you think about it a bit, this is precisely saying that XX is a module over the rig +\mathbb{R}_+! So, in the paper we don’t even need to utter the word ‘algebra of a monad’, we can just say ‘module’, which will certainly appeal to more people.

Moreover, in this approach we even get the target category +\mathbb{R}_+ without artificially postulating it: it’s the free module on one generator! And since rescaling this module by a scalar is a module isomorphism, we can even say that all the scalar multiples of Shannon entropy are isomorphic – so, there is only one entropy functor.

Now, on to the details. In the following, I will try to reproduce something similar to Tom’s result about Shannon entropy as a functor FP +\mathbf{FP}\rightarrow\mathbb{R}_+.

Let’s start with the category FinMeas\Fin\Meas where objects are finite measure spaces and morphisms are measure-preserving functions between those; in particular, any morphism preserves the normalization, so there are no morphisms between spaces of different total measure. There is a notion of direct sum \oplus of measure spaces, where the take the disjoint union of the underlying sets and keep the measure on each set. There is also a notion of scalar multiplication of an object: for a given weight λ +\lambda\in\mathbb{R}_+, we just take a measure space and scale the whole measure by λ\lambda.

Unfortunately, these operations do not satisfy (weak) distributivity. E.g. ppp\oplus p is not isomorphic to 2p2p. But at least there is a nice measure-preserving function pp2pp\oplus p\rightarrow 2p, so I imagine that one can identify direct sum \oplus and λ\lambda-multiplication as part of the structure of some sort of lax algebra of the operad for +\mathbb{R}_+-modules, whatever that means. But that remains to be seen.

Now we consider functors

H:FinMeas + H: \Fin\Meas \rightarrow \mathbb{R}_+

which are compatible with \oplus and λ\lambda-multiplication, i.e.

H(fg)=H(f)+H(g),H(λf)=λH(f) H(f\oplus g)=H(f)+H(g),\qquad H(\lambda f)=\lambda H(f)

for all morphisms ff and gg and all λ +\lambda\in\mathbb{R}_+.

To be continued in the next comment…

Posted by: Tobias Fritz on May 12, 2011 1:39 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

…continued from the previous comment.

Of course, besides the above conditions on HH, there is also functoriality

H(gf)=H(g)+H(f).H(g\circ f)=H(g)+H(f).

Ignoring continuity issues for now, the claim is that any such functor is a scalar multiple of the relative entropy

S(f:pr)=S(p)S(r)S(f:p\rightarrow r)=S(p)-S(r)=( ip i)log( ip i)( ir i)log( ir i) ip ilogp i+ ir ilogr i=\left(\sum_i p_i\right)\log\left(\sum_i p_i\right)-\left(\sum_i r_i\right)\log\left(\sum_i r_i\right)-\sum_i p_i\log p_i+\sum_i r_i\log r_i = ip ilogr f(i)p i= \sum_i p_i \log\frac{r_{f(i)}}{p_i}

where the last equality holds since ff is measure-preserving. It is simple to check that this functor SS satisfies indeed the required conditions.

So it remains to show the converse: given a functor HH which is compatible with direct sum and λ\lambda-multiplication, show that it is a scalar multiple of SS.

Notation: given a finite measure space p=(p 1,,p n)p=(p_1,\ldots,p_n), I will write p^= ip i\hat{p}=\sum_i p_i for its total measure, and later also consider the one-element measure space (p^)(\hat{p}).

As a special case of functoriality, we obtain H(id p)=0H(\id_p)=0 for any pFinMeasp\in\Fin\Meas. Now we can define the entropy of an object to be

H(p)=H(p(p^)).H(p)=H(p\rightarrow(\hat{p})).

Then it follows from functoriality as applied to prr^p\rightarrow r\rightarrow \hat{r} that HH is uniquely determined by the entropies of objects, and the entropy of morphism is the difference of the entropies of domain and codomain,

H(f:pr)=H(p)H(r).H(f:p\rightarrow r)=H(p)-H(r).

Now we can decompose ff as a fiberwise direct sum as follows: each fiber f 1(i)f^{-1}(i) is a measure space in its own right, carrying the measure p(i)p(i), so that pp(i)p\cong \oplus p(i) and p(i)^=r i\hat{p(i)}=r_i. So we get from the compatibility of HH with direct sums,

H(f)= iH(p(i)r i)H(f)=\sum_i H(p(i)\rightarrow r_i)

As the final step, we can rescale by a factor of 1/r i1/r_i and use the homogeneity property of HH to obtain

H(p(i)r i)=r iH(p(i)r i(1))=r iH(p(i)r i)H(p(i)\rightarrow r_i)=r_i\cdot H\left(\frac{p(i)}{r_i}\rightarrow (1)\right) = r_i \cdot H\left(\frac{p(i)}{r_i}\right)

Putting the previous three equations together, we have the operadic property, this time even for measure spaces without normalization! Since Fadeev’s axiom is a special case of this, we recover HH as a scalar multiple of relative Shannon entropy SS.

So, everything essentially works as before, except that convex spaces and P\mathbf{P}-algebras become redundant. This may sound strange from someone who has studied convex spaces for a substantial amount of time; but then the latter experience is precisely why I think that they are normally redundant.

The cleanest instance of the operadic property may actually be as follows:

S(pr)=S(p)+S(r)+S((p^,r^))S(p\oplus r)=S(p)+S(r)+S((\hat{p},\hat{r}))

Posted by: Tobias Fritz on May 12, 2011 2:25 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

And in order to adopt this to Tom’s characterization of qq-Tsallis entropy, it is enough to change the homogeneity condition to H(λf)=λ qH(f)H(\lambda f)=\lambda^q H(f) – et voilà!

Posted by: Tobias Fritz on May 12, 2011 4:17 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Sorry, this last comment was a bit premature. But I think that it should work out if one defines Tsallis entropy/surprise entropy for general measures as

S q(p)=1q1(( ip i) q ip i q) S_q(p)=\frac{1}{q-1}\left(\left(\sum_i p_i\right)^q-\sum_i p_i^q\right)

Just like the Shannon entropy is the “additivity defect” of the function xxlogxx\mapsto x\log x, the qq-Tsallis entropy is the “additivity defect” of the function

xx qq1 x\mapsto \frac{x^q}{q-1}

Posted by: Tobias Fritz on May 12, 2011 4:27 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tobias wrote

Allowing non-normalized distributions entails going from convex combinations to non-negative linear combinations. A more concise term is ‘conical combinations’, so let me use this. What is an algebra for the monad of conical combinations? Well, it’s a set XX together with an addition (x,y)x+y(x,y)\mapsto x+y and a scalar multiplication xpxx\mapsto p\cdot x for each p +p\in\mathbb{R}_+, satisfying the obvious compatibility conditions. If you think about it a bit, this is precisely saying that XX is a module over the rig +\mathbb{R}_+! So, in the paper we don’t even need to utter the word ‘algebra of a monad’, we can just say ‘module’, which will certainly appeal to more people.

Yes, that’s certainly true. Due to the sin of pride, I can’t resist mentioning that I recently said:

I can imagine you wanting to redo some of our work on the ‘convex linear combination’ monad by the ‘nonnegative linear combination’ monad, whose algebras are modules of the rig [0,∞).

However, I hadn’t thought of using this to express Tom’s idea using more familiar language. That would be neat!

Right now I’m trying to write up Tom’s idea in a way that mentions operads but avoids monads—at least at first. I like the operad for convex linear combinations and I think it deserves to be explored, because (as you know) there’s a fairly big literature on physical theories that assume a convex set of states. I sort of wish people working on these would have a reason to start using operads.

For example:

• Howard Barnum, Ross Duncan and Alex Wilce, Symmetry, compact closure and dagger compactness for categories of convex operational models.

On the other hand, you could argue that what a lot of this work does is take a convex set and embed it inside a cone, which is a [0,∞)-module. So maybe modules of rigs will suffice. I guess you’ll soon find out!

Posted by: John Baez on May 12, 2011 5:09 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Is there any need to mention either operads or monads? I’m not saying that we don’t want to, just that as far as I can see, we don’t need to.

What I mean is this. A convex space can be described as a set equipped with some binary operations satisfying some axioms. Anyone who’s seen the definition of topological group will understand immediately that this definition can automatically be extended to a definition of convex space in TopTop or CatCat, for instance. Almost anyone who knows what a monad is will understand immediately that there is an associated monad; so maybe there’s no need to tell them so explicitly.

Similarly, an algebra for the convexity operad P\mathbf{P} can be described as a set (or whatever) equipped with the same binary operations, satisfying a subset of the previous set of axioms. (Just drop “tx+(1t)x=xt x + (1-t)x = x”, I guess.) Almost anyone who knows what an operad is will understand that because of the form of the axioms, there’s an associated operad; so again, maybe there’s little need to tell them so.

Of course some people like operads and monads and find it positively helpful to see things from that point of view. (And of course, I’m one of them.) So it depends what we’re trying to achieve.

Posted by: Tom Leinster on May 12, 2011 6:14 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tom wrote:

Is there any need to mention either operads or monads? I’m not saying that we don’t want to, just that as far as I can see, we don’t need to.

I think it’s good to explain everything first with a minimum of technology, then again with an intermediate amount, and then again with a maximum amount. For example, there’s something devastatingly cool about saying that multiples of Shannon entropy are the only ‘lax points’ of some sort. But this sort of observation can occur near the end of the paper, after the children have been put to bed and we can talk freely.

Posted by: John Baez on May 13, 2011 6:10 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

However, I hadn’t thought of using this to express Tom’s idea using more familiar language. That would be neat!

It might not have been explained well, but that is precisely what I did in these two comments.

And, at least to me, it seems significantly more insightful than the version which you’re about to explain in the draft. For example, the idea of entropy as the “linearity defect” or “additivity defect” of a non-linear function is something which is not apparent when working with normalized measures only. Besides, it all generalizes to Tsallis entropy in the way I indicated here. I can provide more detail on this tomorrow.

Is it clear what I have been doing?

Posted by: Tobias Fritz on May 12, 2011 7:05 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tobias wrote:

It might not have been explained well, but that is precisely what I did these two comments. Is it clear what I have been doing?

You posted the 2nd of those comments after I wrote mine, but before I posted it. (I had dinner in the meantime.) It’ll take me a while to read what you’ve done and understand it. It could be great.

Posted by: John Baez on May 13, 2011 6:05 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

your DD doesn’t obey the sum rule

Ah, I’d forgotten that was in the definition of derivation. I only remembered the exciting part, not boring old linearity. Sorry for any jaw injuries.

Posted by: Tom Leinster on May 10, 2011 12:07 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

In the proof of Equation (2), shouldn’t there be minus signs in the right hand sides of the first 3 lines?

Posted by: Stuart Presnell on May 10, 2011 3:19 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Yes, thanks! I’ll fix that.

Posted by: John Baez on May 11, 2011 2:13 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

By the way, I cannot help but notice the similarity between the partition functionZ(β)= j=1 np j βZ(\beta)=\sum_{j=1}^n p_j^\betaand the Riemann zeta functionζ(β)= j=1 (1j) β\zeta(\beta)=\sum_{j=1}^\infty\left(\frac{1}{j}\right)^\beta

The bridge between the two concepts could be played by operator zeta functions. For an operator OO with eigenvalues λ i\lambda_i, this is defined to beζ O(β)= i(1λ i) β\zeta_O(\beta)=\sum_i \left( \frac{1}{\lambda_i} \right)^\beta

In certain mathematical circles, the partition function is also known as the trace of the heat kernel: when the Hamiltonian operator HH is the Laplacian on a Riemannian manifold, then heat flow on that manifold is governed by the heat kernel e tHe^{tH}, where now tt is the amount of time for which heat is allowed to flow. The trace of the heat kernel, as a function of tt, is precisely the partition function (up to the sign of the argument).

Posted by: Tobias Fritz on May 10, 2011 3:28 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Since the Riemann hypothesis is such a famous open problem, people have tried thinking about the Riemann zeta function in lots of ways.

In week199, I described how the Riemann zeta function can be thought of as the partition function of the “free Riemann gas”, a quantum system consisting of arbitrarily many noninteracting “primons”. Each primon has this spectrum of energy levels:

ln2,ln3,ln5,ln7,ln11, \ln 2, \ln 3, \ln 5, \ln 7, \ln 11, \dots

This idea was advocated by Bernard Julia. Unfortunately, by itself it doesn’t seem to give enough new insights to make progress on the Riemann hypothesis.

However, there has been an intensive search to find a classical system which gives the free Riemann gas when quantized. Gutzwiller discovered a famous formula relating periodic orbits of a classical system to the energy levels of its quantized version. For a free particle on a Riemannian manifold these periodic orbits are closed geodesics. An excellent context for examining this relationship in detail is a free particle on a hyperbolic surface: a quotient of the hyperbolic plane by some discrete group of symmetries. In week215 you’ll see a detailed analogy between number fields and hyperbolic surfaces which was worked out by Darin Brown and others. Both entities have a kind of zeta function, and indeed the whole apparatus of algebraic number theory seems to have an analogue in the theory of hyperbolic surfaces.

This is part of a deeper analogy between primes and knots, which I discussed in week257. The analogy between algebraic number fields and Riemann surfaces (e.g. hyperbolic surfaces) has many aspects, but if we look at the spectrum of an algebraic number field with the eyes of étale cohomology it looks not 2-dimensional but 3-dimensional. From this viewpoint primes in an algebraic number field are analogous not to closed geodesics on a surface, but knots in a 3-manifold!

For an explanation of this fact make sure to read not only the body of week257 but also the remarks by James Borger and Jordan Ellenberg in the “Addenda”.

There is a lot of deep mathematics supporting these strange analogies, and we can hope that some of it will fit into a clearer picture when people prove the Riemann hypothesis.

Posted by: John Baez on May 11, 2011 1:59 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I need to stop doing so much of this stuff and focus on other things for a while — revising a paper, examining a thesis, and running this meeting.

But before I do that, here’s a version of my “lax point” result that I think some people will like better. It’s closer to John’s ideal of characterizing (Shannon) entropy as a functor

(things that should have entropy) \to (numbers that are entropies).

I’ve only just figured it out, and I seem to be going through a mistake-making phase, so read sceptically…

First I’ll state it as briefly as possible. Then I’ll explain.

Statement of theorem

Let FP\mathbf{FP} be the category in which:

  • an object is a pair (n,p)(n, \mathbf{p}) where nn is a natural number and pP n\mathbf{p} \in \mathbf{P}_n
  • a map (f,r):(m,s)(n,p)(f, \mathbf{r}): (m, \mathbf{s}) \to (n, \mathbf{p}) is a map f:{1,,m}{1,,n}f: \{1, \ldots, m\} \to \{1, \ldots, n\} of finite sets together with, for each i{1,,n}i \in \{1, \ldots, n\}, a probability distribution r(i)\mathbf{r}(i) on the fibre f 1(i)f^{-1}(i), such that

    s j=r(i) jp i s_j = \mathbf{r}(i)_j p_i

    whenever i{1,,n}i \in \{1, \ldots, n\} and jf 1(i)j \in f^{-1}(i).

This is a topological category: the space of objects is the disjoint union of the spaces P n\mathbf{P}_n (nn \in \mathbb{N}), and similarly for the space of maps.

It’s also a (strict) algebra in TopCatTopCat for the symmetric, topological operad P\mathbf{P}, in a fairly obvious way.

The monoid (,+,0)(\mathbb{R}, +, 0), viewed as a one-object category R\mathbf{R}, is also a P\mathbf{P}-algebra in TopCatTopCat, in the usual way.

Theorem?  The strict maps of P\mathbf{P}-algebras FPR\mathbf{FP} \to \mathbf{R} correspond to the real multiples of Shannon entropy.

I need to say what “correspond” means. Let F:FPRF: \mathbf{FP} \to \mathbf{R} be a strict map of P\mathbf{P}-algebras. This means that FF is an ordinary functor satisfying some axioms. On objects it is trivial, since R\mathbf{R} has only one object. Now FP\mathbf{FP} has a terminal object (1,(1))(1, (1)), so for each object (n,p)(n, \mathbf{p}) of FP\mathbf{FP} the functor FF can be applied to the unique map

(n,p)(1,(1)) (n, \mathbf{p}) \to (1, (1))

to produce a map in R\mathbf{R}, that is, a real number. Hence every map F:PRF: \mathbf{P} \to \mathbf{R} of P\mathbf{P}-algebras gives rise to a sequence of functions

(P nF n) n. \Bigl( \mathbf{P}_n \stackrel{F_n}{\to} \mathbb{R} \Bigr)_{n \in \mathbb{N}}.

The theorem states that (i) if F n=G nF_n = G_n for all nn then F=GF = G, and (ii) for all FF, there exists real cc such that F nF_n is cc times Shannon entropy for all nn.

Explanation

The inspiration for this was as follows. An internal monoid in a (strict) monoidal category \mathbb{C} can be viewed as a lax monoidal functor 11 \to \mathbb{C}. This is something like my ‘lax point’ characterization of entropy. But a monoid in \mathbb{C} can also be viewed as a strict monoidal functor 𝔻\mathbb{D} \to \mathbb{C}, where 𝔻\mathbb{D} is the category of (possibly empty) finite ordinals. I hoped to find an analogue of 𝔻\mathbb{D}. The analogue would be a P\mathbf{P}-algebra in CatCat, rather than a monoidal category.

Rather tautologically, the defining characteristic of 𝔻\mathbb{D} is that it’s

the free monoidal category containing an internal monoid.

(If you’re uneasy with such language, try this post. The relevant passage begins “The point I want to make…”) So our analogue of 𝔻\mathbb{D} should be

the free P\mathbf{P}-algebra in CatCat containing an internal P\mathbf{P}-algebra.

(I should really say TopCatTopCat rather than CatCat, but never mind.) But what’s an internal P\mathbf{P}-algebra in a P\mathbf{P}-algebra in CatCat? Well, “internal algebra” was terminology introduced by Michael Batanin to mean the same as what I’ve been calling “lax point”. I explained the thinking here.

This description says that our analogue of 𝔻\mathbb{D} should have the following property: strict maps from it to another P\mathbf{P}-algebra \mathbb{C} correspond naturally to internal P\mathbf{P}-algebras in (== lax points of) \mathbb{C}. In particular, this will mean that the strict maps from it to R\mathbf{R} correspond to the lax points of R\mathbf{R}, which are the multiples of Shannon entropy.

The analogue of 𝔻\mathbb{D} is, in fact, the category called FP\mathbf{FP}. In other words, FP\mathbf{FP} is the free P\mathbf{P}-algebra in CatCat containing an internal P\mathbf{P}-algebra. You can construct such a free thing for any operad: there’s nothing special about P\mathbf{P} here. And when you figure it all out, what you get is the explicit description above.

The name FP\mathbf{FP} is sort of a (very esoteric) pun. It’s a free thing formed from the operad P\mathbf{P}. But it’s also extremely similar to the category of finite probability spaces, FinProbFinProb. It would be identical if it weren’t for the existence of zero probabilities. Going back to the definition of a map in FP\mathbf{FP}, you can see that r\mathbf{r} is determined by s\mathbf{s} and p\mathbf{p} (as long as no probabilities are zero), since r(i) j=s j/p i\mathbf{r}(i)_j = s_j/p_i. From there you can persuade yourself that when p\mathbf{p} is nowhere zero, maps (m,s)(n,p)(m, \mathbf{s}) \to (n, \mathbf{p}) are just measure-preserving maps {1,,m}{1,,n}\{1,\ldots, m\} \to \{1, \ldots, n\}.

(In fact, this category FP\mathbf{FP} is the same FinProbFinProb-like category that I called M Δ/1M_\Delta/1 back here, although now that I’m calling our operad P\mathbf{P} rather than Δ\Delta, it should be called M P/1M_{\mathbf{P}}/1 instead. I’m sure the isomorphism FP=M P/1\mathbf{FP} = M_{\mathbf{P}}/1 holds for completely general reasons that have nothing to do with the operad in question.)

Assuming this is all correct — and I hope someone will check it — it strongly suggests a way of handling zero probabilities: use FP\mathbf{FP}, not FinProbFinProb.

Final thought: did John or Tobias already discover this result? It seems very similar to things they’ve been doing. If so, apologies: I haven’t read everything carefully. Or maybe they already discovered it was false…

Posted by: Tom Leinster on May 11, 2011 12:11 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tom wrote:

Theorem? The strict maps of P\mathbf{P}-algebras FP\mathbf{FP} \to \mathbb{R} correspond to the real multiples of Shannon entropy.

As you note, this is exactly the type of result I’ve been looking for all along! But I’m rather puzzled, because I can’t see how this theorem is true. I wish it were. Maybe I’m making a silly mistake.

Suppose

S:FPS: \mathbf{FP} \to \mathbb{R}

is the map that assigns any object of FP\mathbf{FP} its Shannon entropy. Suppose PP is an nn-ary operation of P\mathbf{P} and suppose Q 1,,Q nQ_1, \dots, Q_n are objects of FP\mathbf{FP}. Then if SS is a strict map of \mathbb{P}-algebras, it seems we should have

S(P(Q 1,,Q n))=P(S(Q 1),,S(Q n)) S(P(Q_1, \dots, Q_n)) = P(S(Q_1), \dots , S(Q_n))

But as you know, this equation isn’t true. Instead, we have an extra term:

S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n)) S(P(Q_1, \dots, Q_n)) = S(P) + P(S(Q_1), \dots , S(Q_n))

In this extra term we’re taking advantage of the fact that operations of P\mathbf{P} have Shannon entropy, just like objects of FP\mathbf{FP} (which are actually the same thing).

Am I mixed up? How do you get the extra term?

Needless to say, I’m looking for as lowbrow an explanation as possible. The highbrow explanation about “FP\mathbf{FP} is the free P\mathbf{P}-algebra in CatCat containing an internal Palgebra\mathbf{P}-algebra” is very beautiful and convincing, but it’s sort of like someone has proved by abstract arguments that I can fly: when I actually flap my arms, it doesn’t seem to work.

Posted by: John Baez on May 11, 2011 3:14 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I think Tom wants \mathbb{R} to be considered as a monoid, not as a poset. And we better take +\mathbb{R}_+ instead of all of \mathbb{R}. So any functor

S:FP +S:\mathbf{FP} \rightarrow \mathbb{R}_+

is trivial on objects, since +\mathbb{R}_+ has only one object.

What such a functor does is that it assigns entropy values to the morphisms of FP\mathbf{FP}. I have been wanting to do this all along, and now Tom has found out how to use this approach together with the P\mathbf{P}-algebra structure on FP\mathbf{FP} in order to formulate the ‘magical property’ (or operadic property) as saying that SS is compatible with the P\mathbf{P}-algebra structure. Given that I understood this correctly, let me try to expand on Tom’s explanation a bit. So suppose we have such a functor SS which is also a map of P\mathbf{P}-algebras.

For any object (n,p)FP(n,\mathbf{p})\in\mathbf{FP}, we have a unique morphism to the terminal object *=(1,(1))\ast=(1,(1)). Now we think of the ”SS-entropy” of (n,p)(n,\mathbf{p}) as the value of the functor SS when applied to the morphisms to the terminal object,

S((n,p)*).S((n,\mathbf{p})\rightarrow \ast).

Now when we have a morphism (f,r):(m,s)(n,p)(f,\mathbf{r}):(m,\mathbf{s})\rightarrow(n,\mathbf{p}), we can consider the commutative triangle formed by (f,r)(f,\mathbf{r}) and the two terminal morphisms (m,s)*(m,\mathbf{s})\rightarrow \ast and (n,p)*(n,\mathbf{p})\rightarrow \ast. An application of functoriality gives

S((m,s)*)=S((n,p)*)+S((f,r))S((m,\mathbf{s})\rightarrow \ast)=S((n,\mathbf{p})\rightarrow \ast)+S((f,\mathbf{r}))

so that S(f)S(f) is uniquely determined by the SS-entropy of (m,s)(m,\mathbf{s}) and (n,p)(n,\mathbf{p}). This is Tom’s statement (i). Back on Azimuth, I had erroneously claimed this equation to be a rephrasing of the operadic property.

The main point follows now: we need to make use of the P\mathbf{P}-algebra structure! The point is that the morphism ff can be decomposed into a ”convex combination” of its components f if_i, where the weights are precisely the probabilities p ip_i on the codomain. By the ”components” f if_i, I mean the morphisms

(f i,r(i)):(f 1(i),r(i))*(f_i,\mathbf{r}(i)):(f^{-1}(i),\mathbf{r}(i))\rightarrow \ast

which are just the terminal morphisms from the given distributions on the fibers of ff, regarded as objects in their own right. Then,

f= i=1 np if if=\sum_{i=1}^n p_i f_i

where this convex combination makes sense due to the P\mathbf{P}-algebra structure on FP\mathbf{FP}.

Since the functor SS is assumed to be preserve the P\mathbf{P}-algebra structure, we can deduce

S(f)= i=1 np iS(f i)= i=1 np iS((f 1(i),r(i)))S(f)=\sum_{i=1}^n p_i S(f_i)=\sum_{i=1}^n p_i S((f^{-1}(i),\mathbf{r}(i)))

which, together with the previous equation involving S(f)S(f), is nothing but the operadic property!

Posted by: Tobias Fritz on May 11, 2011 12:08 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tobias wrote:

I think Tom wants \mathbb{R} to be considered as a monoid, not as a poset.

Oh! Thanks! That’ll change everything!

Yes, that’s indeed what Tom said. I had been so busy lately trying to think of entropy as a decategorification process sending objects to their entropies and discarding the morphisms, that I hadn’t noticed he was talking here about the entropy of morphisms in FP\mathbf{FP}. And of course it’s especially easy to get confused because both objects and morphisms of FP\mathbf{FP} come with a reasonable concept of entropy! But you’re saying that we can should think of the entropy of an object as the entropy of its unique morphism to the terminal object.

I haven’t carefully read your comment, but I think that now when I flap my arms I’ll actually start to fly. Thanks.

I’m still having a bit of trouble giving up my dream of thinking about entropy as fundamentally some sort of decategorification process, but I guess I can adapt. This was my big mistake all along. Anyway, I seem to have shown that it’s the partition function that serves to decategorify FinProbFinProb, not the entropy. I’ll check to see how that story works when I use FP\mathbf{FP}.

It will all fit together into a nice picture in the end… of that I’m sure.

By the way, I like “operadic property as a more respectable substitute for “magical property”. I introduced the latter term because I wanted something very attention-getting and easy to remember. But “operadic property” is something we can use in our paper.

Posted by: John Baez on May 11, 2011 12:53 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

John wrote:

when I actually flap my arms, it doesn’t seem to work.

I’m enjoying visualizing this. But unfortunately Tobias has now explained enough that you’ve probably stopped flapping. [Edit: apparently John and I were writing comments at about the same time, and I see now that he’s actually taking off.]

Tobias is right — thanks for giving all that explanation. I’m not using the order structure on \mathbb{R} at all, just its additive monoid structure. The key sentence in my comment is this:

The monoid (,+,0)(\mathbb{R}, +, 0), viewed as a one-object category R\mathbf{R}

Note the different typefaces (yeah, probably too sneaky). We’re not talking about functors FP\mathbf{FP} \to \mathbb{R}: that would have no meaning as I’ve set things up, because I haven’t defined a category called \mathbb{R}. We’re talking about functors FPR\mathbf{FP} \to \mathbf{R}.

I think I said enough that anyone who wanted to could prove the theorem (assuming it’s correct). In particular, I said how to start with a map F:FPRF: \mathbf{FP} \to \mathbf{R} of P\mathbf{P}-algebras in CatCat and produce a sequence

F n:P n F_n: \mathbf{P}_n \to \mathbb{R}

of functions, and I said that these F nF_ns are always proportional to Shannon entropy (with the same constant of proportionality for all nn). But maybe it would be helpful to say how it works the other way round. Starting with Shannon entropy, what’s the resulting map FPR\mathbf{FP} \to \mathbf{R} of P\mathbf{P}-algebras?

Well, R\mathbf{R} has only one object, so there’s nothing to say there. We need to specify which real number gets assigned to a map

(f,r):(m,s)(n,p) (f, \mathbf{r}): (m, \mathbf{s}) \to (n, \mathbf{p})

in FP\mathbf{FP}. Recall the notation:

  • mm and nn are natural numbers
  • s\mathbf{s} and p\mathbf{p} are probability distributions on {1,,m}\{1, \ldots, m\} and {1,,n}\{1, \ldots, n\} respectively
  • f:{1,,m}{1,,n}f: \{1, \ldots, m\} \to \{1, \ldots, n\} is a map of finite sets
  • r\mathbf{r} assigns to each i{1,,n}i \in \{1, \ldots, n\} a probability distribution r(i)\mathbf{r}(i) on f 1(i)f^{-1}(i), in such a way that
  • s j=p ir(i) js_j = p_i \mathbf{r}(i)_j whenever jf 1(i)j \in f^{-1}(i).

Our functor FPR\mathbf{FP} \to \mathbf{R} corresponding to Shannon entropy HH is

(f,r) i=1 np iH(r(i)). (f, \mathbf{r}) \mapsto \sum_{i = 1}^n p_i H(\mathbf{r}(i)).

If all the p ip_is are nonzero then r\mathbf{r} is uniquely determined by ff, since r(i) j=s j/p i\mathbf{r}(i)_j = s_j/p_i. So maps into (n,p)(n, \mathbf{p}) in FP\mathbf{FP} are just measure-preserving maps. In that situation, our functor sends a measure-preserving map

f:(m,s)(n,p) f: (m, \mathbf{s}) \to (n, \mathbf{p})

to

i=1 np iH((s jp i) jf 1(i)). \sum_{i = 1}^n p_i H\Bigl( \Bigl( \frac{s_j}{p_i} \Bigr)_{j \in f^{-1}(i)} \Bigr).

Finally, Tobias wrote:

we better take +\mathbb{R}_+ instead of all of \mathbb{R}.

You may well be right, but why is that?

Posted by: Tom Leinster on May 11, 2011 1:12 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Concerning the last question on why to take +\mathbb{R}_+ instead of \mathbb{R}: Well, actually I thought you had a typo and wanted to say +\mathbb{R}_+… does Fadeev’s theorem still apply when the target category is all of \mathbb{R}?

Then, I think that saying that the target category is \mathbb{R} or +\mathbb{R}_+ should also be justified somehow by categorical considerations. Can one define the P\mathbf{P}-algebra +\mathbb{R}_+ somehow canonically in terms of the operad P\mathbf{P} alone? Or in terms of the monad assiocated to P\mathbf{P}? (BTW, have we settled on a term for the latter? “Giry monad”, “convex combinations monad”, “convexity monad”?)

Posted by: Tobias Fritz on May 11, 2011 2:13 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

does Fadeev’s theorem still apply when the target category [set?] is all of \mathbb{R}?

I don’t know. Rényi is vague on that point. It shouldn’t be difficult to figure it out, but I haven’t.

Anyway, I had no reason for using \mathbb{R} rather than +\mathbb{R}_+. On reflection, +\mathbb{R}_+ seems a more natural choice. I agree with you that it would be nice to extract the monoid +\mathbb{R}_+ canonically from P\mathbf{P} in some way.

Concerning names, again I don’t know. John questioned my use of “Giry monad”, given that Giry herself and later authors used that for a more sophisticated monad on the category of measure spaces or Polish spaces. I think your term “finitary Giry monad” is sensible and accurate.

Personally I have a preference for terminology that’s descriptive rather than being named after the originator, so I’d probably vote for something involving the word “convex(ity)”. I refer to the monad corresponding to the theory of groups as any of “the monad for groups”, “the free group monad” or “the group monad”. Here, I’d be happy with “the monad for convex spaces”, but that’s a bit long. I thought for a bit about “the convex monad”, but the trouble is that it’s the algebras that are convex, not the monad itself.

So I think, in conclusion, that my favourite is “the convexity monad”. We can also talk about “the convexity operad”.

Posted by: Tom Leinster on May 11, 2011 2:41 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tom wrote:

I think your term “finitary Giry monad” is sensible and accurate.

Personally I have a preference for terminology that’s descriptive rather than being named after the originator, so I’d probably vote for something involving the word “convex(ity)”.

I agree with you. Really fundamental concepts deserve descriptive names. It would be terrible if, for example, the 2-category CatCat were called the ‘Eilenberg–MacLane 2-category’, or the category of algebras of a monad were called the ‘Eilenberg–Moore category’. Names like that repel people from the subject being discussed, because you have to be an insider to have any clue about what’s going on.

It would be especially bad if the monad we’re talking about, which is much simpler and more fundamental than the Giry monad, were given a scarier name: the ‘finitary Giry monad’.

I refer to the monad corresponding to the theory of groups as any of “the monad for groups”, “the free group monad” or “the group monad”. Here, I’d be happy with “the monad for convex spaces”, but that’s a bit long.

At present I’ve edited the nLab page so this monad is called “the monad for convex algebras”, and its algebras are called “convex algebras”. Is “convex space” the standard term? I like that better.

The all-knowing nLab uses “convex space”, though it says the concept has “been rediscovered many times under many different names.” So, let’s use that. This page also has a lot of references, and a theorem due to Marshall Stone saying when a convex space can be embedded in n\mathbb{R}^n. That theorem, or at least the ideas surrounding it, may help answer some of Tom’s puzzles on our nLab page.

I thought for a bit about “the convex monad”, but the trouble is that it’s the algebras that are convex, not the monad itself.

Yeah, I had the same thought.

So I think, in conclusion, that my favourite is “the convexity monad”. We can also talk about “the convexity operad”.

Right now in our page we only refer to the monad once by name. So it’s possible that “the monad for convex spaces” is okay. It has the virtue of being maximally clear: if you sort of know what a monad is, and what a convex space is, you’ll automatically sort of know what the “monad for convex spaces” is.

Here’s another question: what should the symbol for this monad be? I think we’ll use the symbol more than the name! We can’t call it \mathbb{P}. Should we break out the annoyingly fancy fonts and call it 𝒫\mathcal{P} or something?

LaTeX has a special symbol for the Weierstrass P-function \wp - we could use that! It’s \wp. I think it usually comes out looking a bit more upper-case than it just did here. I’ve never understood why this is among their most basic symbols.

Posted by: John Baez on May 12, 2011 3:07 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

John wrote:

what should the symbol for this monad be? […] We can’t call it \mathbb{P}.

Can we not? I’m not totally sure what your thinking is. First, I’ve been using P\mathbf{P} rather than \mathbb{P} for the operad, but I guess you’re not distinguishing here between bold and blackboard bold. (And it’s probably good not to do so, given that we also have to be able to write these symbols — on a blackboard, for instance.)

But more importantly, the operad P\mathbf{P} is the underlying operad of the convexity monad: there’s a forgetful functor from monads on sets to operads, and it carries the convexity monad to the convexity operad. It’s quite normal to use the same letter for a thing and one of its underlying things, after perhaps muttering a few words about harmless abuse of language.

If we were talking about, say, the underlying endofunctor of our monad, I’d have no problem with using the same letter for the monad and the endofunctor. Of course, it would get confusing if we were talking in the same breath about algebras for the underlying endofunctor and algebras for the monad itself, and in that case we might have to crack open some new notation. But my default position would be to use the same letter unless the need for disambiguation arose.

So we could just wait and see whether we really need two different pieces of notation. Or is it already clear to you that we will?

Posted by: Tom Leinster on May 12, 2011 4:52 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tom wrote:

John wrote:

what should the symbol for this monad be? […] We can’t call it \mathbb{P}.

Can we not? I’m not totally sure what your thinking is.

My ‘thinking’ is that I meant to type P\mathbf{P} here, but I slipped and typed \mathbb{P}.  

It’s quite normal to use the same letter for a thing and one of its underlying things, after perhaps muttering a few words about harmless abuse of language.

Sure. But as you know, I like to explain stuff and connect different communities of mathematicians. I try to write papers in a way so that they’re readable by the union of the various communities who might find them interesting—not the intersection.

So, I’d like our paper to be comprehensible to at least a few of the smarter people who study entropy and the convex set formalism for probabilistic theories (more about that later). I think if I write a decent explanation, I can tell them what they need to know about monads and operads. But I don’t think they’ll become so comfortable with those concepts that they’ll be ready to use the same letter for both! I was imagining them struggling to understand this stuff, thinking that would be the straw that breaks the camel’s back.

But maybe not. Anyway, I plan to start writing some stuff now—right this second!—and I’ll see how it goes. We can always change things around, but it’s slightly easier to merge two notations into one than split one in two.

Posted by: John Baez on May 12, 2011 5:59 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

it’s slightly easier to merge two notations into one than split one into two

Yes, good point. Let’s go with two symbols then. P\mathbf{P} for the operad, T\mathbf{T} or M\mathbf{M} or C\mathbf{C} (for convex) for the monad?

Posted by: Tom Leinster on May 12, 2011 12:41 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I wrote:

Our functor FPR\mathbf{FP} \to \mathbf{R} corresponding to Shannon entropy HH is

(f,r) i=1 np iH(r(i)). (f, \mathbf{r}) \mapsto \sum_{i = 1}^n p_i H(\mathbf{r}(i)).

Here we had a map (f,r):(m,s)(n,p)(f, \mathbf{r}): (m, \mathbf{s}) \to (n, \mathbf{p}) in FP\mathbf{FP}.

But I failed to notice that the sum above can be simplified. A totally elementary calculation shows that it’s equal to

H(s)H(p) H(\mathbf{s}) - H(\mathbf{p})

(and in particular is independent of r\mathbf{r}). This provides the link to Tobias’s theorem alluded to here.

Posted by: Tom Leinster on May 16, 2011 8:34 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

there should also be an abstract-nonsense theory of derivations of operad algebras

For instance section 7 of Hinich’s Homological algebra of homotopy algebras . See nLab: derivation – on an algebra over an operad.

This definition has the property that for the associative operad it reduces to the ordinary notion of derivations. But it seems to differ from what you consider above.

Posted by: Urs Schreiber on May 11, 2011 1:41 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

A little technical question while I’m writing up stuff. The nLab defines a convex space to be a set XX equipped with operations c p:X×XXc_p: X \times X \to X for each number 0p10 \le p \le 1, obeying the following laws:

1. c 0(a,b)=bc_0(a,b) = b,

2. c 1(a,b)=ac_1(a,b) = a (this one is redundant),

3. c p(a,a)=ac_p(a,a) = a,

4. c p(a,b)=c 1p(b,a)c_p(a,b) = c_{1-p}(b,a),

5. c p(c q(a,b),c)=c pq(a,c r(b,c))c_p(c_q(a,b),c) = c_{p q}(a,c_r(b,c)) for (1pq)r=(1q)p(1 - p q)r = (1 - q)p.

On the other hand, Tobias leaves out the first two.

I think we want the first two. Can Tobias derive them from the rest, or did he just leave them out by accident?

Posted by: John Baez on May 12, 2011 11:39 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I don’t think that these conditions follow from the other ones. In my original paper, I did include them. In the talk you linked to, I didn’t, because I only considered weights p(0,1)p\in(0,1), so that p=0p=0 and p=1p=1 were explicitly excluded. This did not make any difference, since everything I did was completely algebraic, so including these trivial operations is kind of pointless.

In the present context where we have a topology, we also want that c p(x,y)c_p(x,y) converges to xx as p0p\rightarrow 0. This means that we should consider the closed interval of weights p[0,1]p\in[0,1] as parametrizing the operations and include the extra condition.

Posted by: Tobias Fritz on May 12, 2011 12:06 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Okay, all that makes sense. Thanks!

Posted by: John Baez on May 12, 2011 12:23 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I’m totally failing in my attempt not to think about this stuff, but given that, I might as well build on my failure by sharing something new with you.

It’s a categorical characterization of the entropies

S q(p)={1q1(1p i q) ifq1 p ilogp i ifq=1 S_q(\mathbf{p}) = \begin{cases} \frac{1}{q - 1} \Bigl( 1 - \sum p_i^q \Bigr)& if q \neq 1\\ -\sum p_i \log p_i& if q = 1 \end{cases}

discovered by Havrda and Charvát, then rediscovered by Daróczy, then studied in the book of Aczél and Daróczy, then rediscovered again by Patil and Taillie, then rediscovered yet again by Tsallis, after whom physicists name it. As I can’t bear to join them in that, I want to call it surprise entropy, mostly because of its interpretation as expected surprise (explained in John’s post above and my post here), but also because it’s surprising how many people have discovered it. I use the letter S qS_q because it stands for Surprise, and because that’s what physicists use. (It also stands for Tsallis, if you talk fast.)

Anyway, regard the topological monoid ( +,+,0)(\mathbb{R}_+, +, 0) as a one-object topological category R +\mathbf{R}_+, as usual. Fix q0q \geq 0. Then R +\mathbf{R}_+ becomes an algebra for the symmetric operad P\mathbf{P} via

p(x 1,,x n)= ip i qx i. \mathbf{p}(x_1, \ldots, x_n) = \sum_i p_i^q x_i.

Denote this P\mathbf{P}-algebra by R + (q)\mathbf{R}_+^{(q)}.

(Aside: this is an algebra for the convexity operad P\mathbf{P}, but decidedly not for the convexity monad. For instance, if p=(1/2,1/2)\mathbf{p} = (1/2, 1/2) then p(x,x)=2 1qxx\mathbf{p}(x, x) = 2^{1 - q} x \neq x if q1q \neq 1.)

Theorem  the lax points of the P\mathbf{P}-algebra R + (q)\mathbf{R}_+^{(q)} are the nonnegative scalar multiples of S qS_q.

Proof: just as in the Shannon case, a lax point of R + (q)\mathbf{R}_+^{(q)} amounts to a sequence of functions S:P n +S: \mathbf{P}_n \to \mathbb{R}_+ satisfying symmetry, continuity and

S(p(r 1,,r n))=S(p)+p(S(r 1,,S(r n)) S(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = S(\mathbf{p}) + \mathbf{p}(S(\mathbf{r}_1, \ldots, S(\mathbf{r}_n))

which means

S(p(r 1,,r n))=S(p)+ ip i qS(r i). S(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = S(\mathbf{p}) + \sum_i p_i^q S(\mathbf{r}_i).

Now apply the generalized Fadeev theorem proved as Theorem V.2 of this paper of Furuichi that David pointed out.

Corollary  The strict maps of P\mathbf{P}-algebras FPR + (q)\mathbf{FP} \to \mathbf{R}_+^{(q)} correspond to the nonnegative scalar multiples of S qS_q.

This follows by the characterization in my comment above of FP\mathbf{FP} as the free P\mathbf{P}-algebra in TopCatTopCat containing an internal P\mathbf{P}-algebra.

John, given that this is such a straightforward generalization of the Shannon case, what do you think about including this?

Posted by: Tom Leinster on May 12, 2011 1:02 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

So you have a one parameter family of P\mathbf{P}-algebras. Is that at all common to consider continuous families of algebras?

Posted by: David Corfield on May 12, 2011 1:12 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

David wrote:

So you have a one parameter family of P-algebras. Is that at all common to consider continuous families of algebras?

Yes, it’s very common and extremely important. For example, when we quantize a system, we replace the commutative algebra of observables by a one-parameter family of noncommutative algebras. They’re parametrized by a number called Planck’s constant, or \hbar, which says ‘how quantum’ things are. When this constant is 0 we’re back to the commutative algebra we started with.

You’ll see in the post we’re commenting on that I embedded a certain P\mathbf{P}-algebra homomorphism from FinProbFinProb to [0,)[0,\infty) into a 1-parameter family of P\mathbf{P}-algebra homomorphisms. They’re parametrized by a number called inverse temperature, or β\beta, which says ‘how cool’ things are. When this constant is 1 we’re back to the original problem.

These two examples are related: I recently explained how \hbar is a lot like 1/β1/\beta, and quantum field theory is like statistical mechanics.

However, now you’re asking about a 1-parameter family of P\mathbf{P}-algebras, not a 1-parameter family of homomorphisms out of a fixed one. They’re closely related ideas but I don’t have it all straight in my mind. If I did, I’d be closer to understanding Rényi entropy. I’ve already explained how Rényi entropy is a qq-deformed derivative, which reduces to the ordinary Shannon entropy as q1q \to 1. So that’s a big clue. But a bunch of stuff is still confusing.

(Here qq is somehow related to β\beta. It takes a certain amount of suffering to line up all the analogies precisely, what with q,,βq, \hbar, \beta and TT all serving as parameters in different but related subjects.)

That was a physics-flavored answer. Here’s a more mathematical answer.

For any sort of structure, ‘deformation theory’ studies parametrized families of that sort of structure, and people usually focus on 1-parameter families. These 1-parameter families may be parametrized either by the real line or a ‘formal’ version of the real line, where instead of everything becoming a continuous or smooth or polynomial function of a real variable, they become a formal power series in that variable.

Formal deformations of associative algebras have been intensively studied: you can classify them with the help of Hochschild cohomology. Indeed, any sort of deformation theory goes hand-in-hand with a cohomology theory. This is one aspect of the ‘layer-cake’ or ‘Postnikov tower’ philosophy of cohomology. So, there’s probably a cohomology theory that classifies P\mathbf{P}-algebra deformations—but I doubt thinking about that is the best way to make progress right now.

Following the tradition of throwing a book at someone when they ask a question, I’ll toss this at you:

However, it’s only a chapter, and only 3 pages long, so it won’t hurt too much.

Posted by: John Baez on May 13, 2011 5:54 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tom wrote:

John, given that this is such a straightforward generalization of the Shannon case, what do you think about including this?

It’s a great-looking result! Congratulations!

I’m afraid I haven’t taken the time to understand it yet, since I’ve spent the day carefully going through Tobias’ characterization of Shannon entropy. I wonder what you think about his result? I know you’re busy now, so you may want to wait before reading that material. That’s okay.

However, regardless of that, we should include your new result in our paper. We can either include Tobias’ result about FinMeasFinMeas and the main results we know about FP\mathbf{FP}, or figure out a way to phrase your new result in terms of FinMeasFinMeas, or both. A couple of different formulations ain’t a bad thing.

Posted by: John Baez on May 13, 2011 12:15 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Okay, Tobias - let me see if I understand your comment that:

Putting the previous three equations together, we have the operadic property, this time even for measure spaces without normalization!

I don’t see why… so I’ll try to figure it out. If I’m going to write this stuff up, I might as well start now. I’ll start at the beginning.

You are starting with a category FinMeasFinMeas whose objects are finite sets equipped with measures and whose morphisms are measure-preserving functions. (You called the objects ‘finite measure spaces’, but unfortunately that means something completely different!)

This category is symmetric monoidal under disjoint union, which we denote by \oplus. We should remember that this is not a coproduct in the categorical sense.

There is also an action of the multiplicative monoid [0,)[0,\infty) as endofunctors of FinMeasFinMeas. Numbers in [0,)[0,\infty) act on the objects of FinMeasFinMeas by rescaling the measures on our finite sets. They act trivially on the morphisms.

Unfortunately, these operations do not satisfy (weak) distributivity.

Actually they do in the following sense: while we don’t have

(λ+λ)(p)λpλp (\lambda + \lambda')(p) \cong \lambda p \oplus \lambda' p

we do have a canonical isomorphism

λ(pq)λpλq \lambda (p \oplus q) \cong \lambda p \oplus \lambda q

and so presumably we could show that [0,)[0,\infty) acts not just as endofunctors on FinMeasFinMeas, but as symmetric monoidal endofunctors. Indeed, we seem to have a monoidal functor from the multiplicative monoid [0,)[0,\infty) to the monoidal category End SymMonCat(FinMeas)End_{SymMonCat}(FinMeas). This is an annoyingly precise way to summarize a bunch of plausible rules I haven’t checked yet. But let’s move on.

Next, you consider +\mathbb{R}_+. By the way, I’m slightly scared to use the symbol +\mathbb{R}_+, both because I’ve never seen it before, and because I’ve often seen +\mathbb{R}^+ used to mean the set of strictly positive numbers, i.e. (0,)(0,\infty).

However, I will guess what you mean and define +\mathbb{R}_+ to be the one-object category with the additive monoid [0,)[0,\infty) as morphisms. It seems we need a name other than [0,)[0,\infty) for this category.

Next, you’re telling me that +\mathbb{R}_+ has the same sort of structure as FinMeasFinMeas. First, it’s a symmetric monoidal category (with addition as tensor product of morphisms). Second, the multiplicative monoid [0,)[0,\infty) acts on it as endofunctors in an obvious way (by multiplication). Third, [0,)[0,\infty) acts not just as endofunctors, but as symmetric monoidal endofunctors. And fourth, this action gives a monoidal functor from the multiplicative monoid [0,)[0,\infty) to the monoidal category End SymMonCat(FinMeas)End_{SymMonCat}(FinMeas).

And your claim is that all the functors

F:FinMeas + F : FinMeas \to \mathbb{R}_+

respecting all this structure are nonnegative multiples of Shannon entropy.

Wow, this approach is incredibly intuitive and non-technical!

Just kidding. If this is true, I really think it’s great! And of course we can say this all less technically. But I wanted to say what I think is really going on in this setup. There’s a nice but somewhat novel interaction between the ‘addition’ \oplus and the multiplication of measures by numbers. Indeed, given a monoid MM, I now think a symmetric monoidal category (X,,0)(X, \oplus, 0) equipped with a monoidal functor

MEnd SymmMonCat(X)M \to End_{SymmMonCat}(X)

may deserve to be called an M-module category. If so, we’re talking about the case where MM is the multiplicative monoid [0,)[0,\infty). FinMeasFinMeas and +\mathbb{R}_+ are module categories for this, and you’re claiming all the module homomorphisms are multiples of Shannon entropy.

But let me see if it’s true…

Posted by: John Baez on May 13, 2011 7:17 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Yes, that’s exactly what I was getting it! I’m sorry for not having been able to explain it in an understandable way; in the initial excitement, I committed even more slips than usual, like this one:

(You called the objects ‘finite measure spaces’, but unfortunately that means something completely different!)

What I meant by ‘not satisfying weak distributivity’ was not the isomorphism

λ(pq)λpλq \lambda(p\oplus q)\cong \lambda p\oplus \lambda q

but the non-isomorphism of (λ+λ)p(\lambda+\lambda')p with λpλp\lambda p\oplus\lambda'p. Though in the end, this may be a feature rather than a bug: at least there is an obvious natural measure-preserving function

λpλp(λ+λ)p \lambda p\oplus\lambda'p\rightarrow (\lambda+\lambda')p

So could this be some kind of ‘lax distributivity’? (Something similar also occurs in the \mathbb{P}-algebra formulation, where 12p+12p\frac{1}{2}p+\frac{1}{2}p is not isomorphic to pp.)

By the way, I’m slightly scared to use the symbol +\mathbb{R}_+, both because I’ve never seen it before, and because I’ve often seen +\mathbb{R}_+ used to mean the set of strictly positive numbers, i.e. (0,∞).

Actually, I usually write 0\mathbb{R}_{\geq 0} for precisely this reason. If I remember correctly, I saw Tom writing +\mathbb{R}_+ first and adopted his notation. Anyway, what’s important is that you interpreted it correctly.

And your claim is that all the functors

F:FinMeas+ F:FinMeas→ℝ +

respecting all this structure are nonnegative multiples of Shannon entropy.

Right. Where it should be kept in mind that such a functor assigns entropy values to morphisms. And we think of the entropy of such a morphism as the entropy of the domain measure space minus the entropy of the codomain measure space. It’s conditional entropy!

Given any such functor, we can then define the entropy of an object p=(p 1,,p n)p=(p_1,\ldots,p_n) to be the entropy of the unique morphism that maps from that object to a one-element measure space. This one-element measure space I have denoted by (p^)=( ip i)(\hat{p})=(\sum_i p_i). And a few times I forgot the brackets around p^\hat{p} in the two comments above…

Then, it makes sense to talk about the entropy of objects, and that is what we’re really interested in. And by functoriality (triangle involving any morphism together with both its domain and codomain mapping to the same one-element measure space), the entropy of a morphism is then equal to the entropy of its domain minus the entropy of its codomain.

I’m feeling exhausted today, so I will better take a break…

Posted by: Tobias Fritz on May 13, 2011 9:18 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

So, in this post Tobias is looking at homomorphisms of [0,)[0,\infty)-module categories

F:FinMeas + F: FinMeas \to \mathbb{R}_+

and trying to classify them.

But if you don’t understand what a ‘homomorphism of [0,)[0,\infty)-module categories’, fear not! Apparently all we need to know is this. Given morphisms f,gf, g in FinMeasFinMeas and any number λ[0,)\lambda \in [0,\infty), we want

(1)F(fg)=F(f)+F(g) F (f \oplus g) = F(f) + F(g)
(2)F(λf)=λF(f) F(\lambda f) = \lambda F(f)

and also

(3)F(fg)=F(f)+F(g) F (f \circ g) = F(f) + F(g)

in the case when ff and gg are composable. Furthermore, we need FF to be continuous in the obvious sense.

Hey, wait a minute. An object of FinMeasFinMeas is a finite set with a measure on it, and a number λ[0,)\lambda \in [0,\infty) acts on it by rescaling the measure. But a morphism is just a measure-preserving function, and there’s no good way to multiply such a function by a number. That’s why I had defined λf\lambda f to be the same function as ff, but going between rescaled measure spaces. So now I’m getting a bit nervous about equation (2).

Tobias is claiming that all FF with the above properties come from nonnegative scalar multiples of Shannon entropy, in a certain sense.

What’s this sense? Well, for starters, suppose SS is Shannon entropy, which Tobias defines for a measure pp on a finite set as follows:

S(p)=( ip i)log( ip i) ip ilogp i S(p) = (\sum_i p_i) \log (\sum_i p_i) - \sum_i p_i \, \log p_i

(Note that this does reduce to the usual Shannon entropy when ip i=1\sum_i p_i = 1, and note that I owe Mark yet another apology for saying I didn’t want to define entropy for measures that weren’t probability measures! Now we’re doing just that.)

Then, for each morphism f:prf : p \to r, Tobias can take F(f)F(f) to be

F(f)=S(p)S(r) F(f) = S(p) - S(r)

and he claims this obeys (1), (2), and (3).

(He called F(f)F(f) ‘relative entropy’, but we know that’s not the right term.)

Let’s follow Tobias in checking these equations. His big idea is to expand out FF, like this:

F(f)=S(p)S(r) F(f) = S(p)-S(r) =( ip i)log( ip i)( jr j)log( jr j) ip ilogp i+ jr jlogr j = \left(\sum_i p_i \right)\log \left(\sum_i p_i\right)-\left(\sum_j r_j \right)\log\left(\sum_j r_j\right)-\sum_i p_i \log p_i +\sum_j r_j \log r_j

where ii ranges over points of the measure space where pp lives, and jj ranges over points of the space where rr lives.

Then, because ff is measure-preserving, we have

ip i= rr j \sum_i p_i = \sum_r r_j

so the first two terms cancel and we get

F(f)= ip ilogp i+ jr jlogr j F(f) = - \sum_i p_i\log p_i + \sum_j r_j \log r_j

Next, again because ff is measure-preserving, we have

r j= i:f(i)=jp i r_j = \sum_{i: f(i) = j} p_i

so

jr jlogr j = j i:f(i)=jp ilogr j = j i:f(i)=jp ilogr f(i) = ip ilogr f(i) \begin{aligned} \sum_j r_j \log r_j &=& \sum_j \sum_{i: f(i) = j} p_i \log r_j \\ &=& \sum_j \sum_{i: f(i) = j} p_i \log r_{f(i)} \\ &=& \sum_i p_i \log r_{f(i)} \end{aligned}

where in the last step we notice that summing over all points ii that map to jj and then summing over all jj is the same as summing over all ii.

So,

F(f) = ip ilogp i+ jr jlogr j = i(p ilogp i+p ilogr f(i)) = ip ilogr f(i)p i \begin{aligned} F(f) &=& - \sum_i p_i\log p_i + \sum_j r_j \log r_j \\ &=& \sum_i \left( - p_i \log p_i +p_i \log r_{f(i)} \right) \\ &=& \sum_i p_i \log \frac{r_{f(i)}}{p_i} \end{aligned}

Now, one advantage of this formulation as a sum is that it makes equation (1):

F(fg)=F(f)+F(g) F (f \oplus g) = F(f) + F(g)

completely obvious from the definitions.

It also makes it easy to check the equation I was worried about, namely (2):

F(λf)=λF(f) F(\lambda f) = \lambda F(f)

We just note that

F(λf)= iλp ilogλr f(i)λp i=λF(f) F(\lambda f) = \sum_i \lambda p_i \log \frac{\lambda r_{f(i)}}{\lambda p_i}= \lambda F(f)

On the other hand, to check equation (3):

F(fg)=F(f)+F(g) F (f \circ g) = F(f) + F(g)

it’s easier to use the original definition

F(f)=S(p)S(r) F(f) = S(p) - S(r)

since then if g:qpg : q \to p, f:prf: p \to r we have

F(fg)=S(q)S(r)=S(q)S(p)+S(p)S(r)=F(f)+F(g) F(f \circ g) = S(q) - S(r) = S(q) - S(p) + S(p) - S(r) = F(f) + F(g)

So, Shannon entropy does indeed give a functor obeying (1)-(3). And since we didn’t actually choose a specific base for our logarithm, any positive real multiple of Shannon entropy works just as well. And so, of course, does 0 times Shannon entropy.

So now we just need to follow Tobias’ argument that these properties characterize the nonnegative multiples of Shannon entropy.

Posted by: John Baez on May 13, 2011 8:57 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Arggh, I know that ‘relative entropy’ is a nonsense term in this context, so I don’t know why I had used it again. Sorry. It should have been conditional entropy.

Then, are you sure you got the signs right? Shannon entropy comes with a negative sign:

S(p)= ip ilogp iS(p)=-\sum_i p_i\log p_i

This sign is required for making SS non-negative. With this, we should agree about S(domain)S(codomain)S(\domain)-S(\codomain) for the entropy loss or conditional entropy or entropy of a map or whatever we want to call it.

Posted by: Tobias Fritz on May 13, 2011 9:29 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tobias:

Then, are you sure you got the signs right?

I’m never, ever sure I got the signs right. Negative numbers were invented by the devil and passed on to Venetian bankers so they could get us in debt: it’s no accident they were written in red. I know that Shannon entropy is

S(p)= ip ilogp i0S(p) = - \sum_i p_i \log p_i \ge 0

but I frequently forget to put that minus sign in. However, I did put it in above. And then since physical processes tend to increase entropy, I wanted to define the entropy gain of a morphism f:prf : p \to r to be F(f)=S(r)S(p)F(f) = S(r) - S(p). But now I see that in the situation we’re discussing, that comes out negative! There’s something philosophically funny about that, but it’s clearly true: morphisms can map many points to one, thus reducing entropy.

So yeah, I was wrong and you were right. We need F(f)=S(p)S(r)F(f) = S(p) - S(r).

I’m going to take the liberty of fixing this mistake in my comment above, to reduce the number of people who get confused.

Thanks. Now I need to check the most interesting part of your argument! I’m proceeding slowly, not because you explained this poorly, but because I feel like writing up everything carefully, so I can stick it in the paper.

Posted by: John Baez on May 13, 2011 10:00 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Now I’ll check Tobias’ argument that any continuous functor

F:FinMeas + F: FinMeas \to \mathbb{R}_+

obeying

(1)F(fg)=F(f)+F(g) F (f \oplus g) = F(f) + F(g)
(2)F(λf)=λF(f) F(\lambda f) = \lambda F(f)

is a given by a multiple of Shannon entropy in this way: if f:prf: p \to r is a morphism in FinMeasFinMeas, then

(3)F(f)=c(S(p)S(r)) F(f) = c(S(p) - S(r))

for some c0c \ge 0.

In addition to properties (1) and (2) we’ll also explicitly use functoriality, which says:

(4)F(fg)=F(f)+F(g) F (f \circ g) = F(f) + F(g)

For a while I was having trouble understanding Tobias’ argument, but then all of a sudden it seemed obvious. It’s very beautiful! So, I’ll just repeat it in slightly different words. I’ll probably put in too much detail to be very clear, but I want to be 100% sure everything is right.

We’ll think of an object of FinMeasFinMeas as a finite set XX together with an XX-tuple of numbers p i[0,)p_i \in [0,\infty). We’ll write this object as pp for short. The ‘total mass’ of this object is the number

p^= iXp i \hat p = \sum_{i \in X} p_i

And we’ll use (c)(c) to mean the following object of FinMeasFinMeas: the one-point set {1}\{1\} together with number cc.

A key trick is to note that for any object pp a there’s a unique morphism

! p:p(p^) !_p : p \to (\hat{p})

We can think of this as the map that crushes pp down to a point and loses all the information that pp had. So, we define the ‘entropy’ of the object pp by

H(p)=F(! p) H(p) = F(!_p)

We will see that this is a multiple of the ordinary Shannon entropy.

Moreover, if we have any morphism

f:pr f: p \to r

then

(p^)=(r^) (\hat{p}) = (\hat{r})

and

! p=! rf !_p = !_r \circ f

So, by (3), we have

F(! p)=F(! r)+F(f) F(!_p) = F(!_r) + F(f)

or in other words

(5)F(f)=H(p)H(r) F(f) = H(p) - H(r)

So, FF of any morphism is determined by the entropy of its source and its target!

Now we can decompose f:prf: p \to r as a fiberwise direct sum as follows: for each point jj in the codomain, the fiber f 1(j)f^{-1}(j) becomes a measure space by virtue of being a subset of the domain. We call this measure space p(j)p(j). So,

p jp(j)p\cong \bigoplus_j p(j)

This lets us write ff as the direct sum of a bunch of maps crushing each fiber down to a point, together with a map gg from the empty set to all the points not in the range in ff:

f=g! p(j) f = g \oplus \bigoplus !_{p(j)}

I think it’s easy to see from (1), (2) and (3) that

F(g)=0 F(g) = 0

so it follows from (1) that

F(f)= jF(! p(j))= jH(p(j)) F(f)=\sum_j F( !_{p(j)} ) = \sum_j H(p(j))

so by (4),

H(p)H(r)= jH(p(j)) H(p) - H(r) = \sum_j H(p(j))

or

H(p)=H(r)+ jH(p(j)) H(p) = H(r) + \sum_j H(p(j))

Now, note quite generally that HH is homogeneous:

H(λq)=λH(q) H(\lambda q) = \lambda H(q)

thanks to property (2). We can divide the measure on the fiber p(j)p(j) by its total mass, which is r jr_j, to get a probability measure space 1r jp(j)\frac{1}{r_j} p(j). By homogeneity, the entropy of this space is

H(p(j))=r jH(1r jp(j)) H(p(j)) = r_j H(\frac{1}{r_j} p(j))

and so

H(p)=H(r)+ jr jH(1r jp(j)) H(p) = H(r) + \sum_j r_j H(\frac{1}{r_j p(j)})

Lo and behold! This is the operadic property! that according to Fadeev’s theorem characterizes Shannon entropy up to a scalar multiple—well, at least in conjunction with three other properties: continuity, bijection-invariance and H(1)=0H(1) = 0.

A couple of small loose ends:

1) I divided by r jr_j, which is illegal when r j0r_j \ne 0. The fibers of f:prf : p \to r over points of measure zero will have measure zero; we cannot normalize them to get probability measure spaces. We need to do something else here. You’ll note these ‘fibers over points of measure zero’ are a constant headache. However, these fibers have H=0H = 0, so if I weren’t so tired I’d probably see that they pose no problem for us now.

2) Fadeev’s theorem also requires that H(1)=0H(1) = 0, meaning that HH of the one-point probability measure space (1)(1) is 0. This is easy to check because the map ! (1)!_{(1)} is the identity, so by functoriality F(! (1))=0F(!_{(1)}) = 0—but that’s H(1)H(1), by definition.

Posted by: John Baez on May 13, 2011 11:42 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I forgot to say: I do think Tobias’ version of Fadeev’s theorem is the prettiest I’ve seen so far. Since I’ve probably managed to make it sound a bit scary, let me summarize it here:

Theorem. Suppose F:FinMeas +F : FinMeas \to \mathbb{R}_+ is a continuous functor obeying

F(fg)=F(f)+F(g) F(f \oplus g) = F(f) + F(g)

and

F(λf)=λF(f). F(\lambda f) = \lambda F(f).

Then for any morphism f:pqf : p \to q in FinMeasFinMeas,

F(f)=c(S(p)S(q)) F(f) = c(S(p) - S(q))

for some c0c \ge 0, where S(p)S(p) is the Shannon entropy of the measure space pp.

There’s a bit that needs to be defined for average people to understand this, most notably the entropy of a measure that’s not a probability measure… but we can just say that we’re extending by homogeneity: if pp is a probability measure and λ0\lambda \ge 0, then we define S(λp)=λS(p)S(\lambda p) = \lambda S(p). I find that slightly less scary than

S(p)= ip ilog( ip i) ip ilogp i S(p) = \sum_i p_i \log(\sum_i p_i) - \sum_i p_i \log p_i

though this formula is not without its charm.

Also: this theorem is about nonnegative measures on finite sets, but I believe we could drop that restriction and replace +\mathbb{R}_+ by \mathbb{R} (the one-object category having reals as morphisms and multiplication as composition) and get a similar theorem. So, I think that’s really just a matter of taste.

Things may change when we try to get inequalities into the game, e.g. by treating +\mathbb{R}_+ as a one object 2-category.

Posted by: John Baez on May 13, 2011 12:07 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I’m getting left behind on this, and it’s going to get worse this coming week. But if I go quiet, don’t interpret that as lack of enthusiasm; it’s just the pressure of other things. I think this stuff is great. In particular, I think this result of Tobias is great.

While waiting for a delayed train, I had a whole lot of thoughts about Tobias’s result, but I’ll have to save them for another time. Just two little things for now:

  1. I wondered for a while whether there was a hypothesis missing from your statement of Tobias’s theorem, John. For each pair of objects X,YFinMeasX, Y \in FinMeas, there is a canonical isomorphism XYYXX \oplus Y \to Y \oplus X. I thought you might need to demand that FF maps each such isomorphism to 0 +0 \in \mathbb{R}_+. But now I realize that’s unnecessary. Like any functor, FF preserves isomorphisms, so it maps each of these symmetry isomorphisms to an invertible element of the additive monoid +\mathbb{R}_+. And it so happens that the only such element is 00. If we were using \mathbb{R} instead of +\mathbb{R}_+, we wouldn’t be able to get away with this.
  2. I haven’t checked the details, but I wonder whether there’s a similar theorem as follows:

    Theorem?  Suppose that F:FinProb +F: FinProb \to \mathbb{R}_+ is a continuous functor obeying

    F(λf(1λ)g)=λF(f)+(1λ)F(g) F(\lambda f \oplus (1 - \lambda) g) = \lambda F(f) + (1 - \lambda) F(g)

    whenever λ[0,1]\lambda \in [0, 1]. Then for any morphism f:pqf: p \to q in FinProbFinProb,

    F(f)=c(S(p)S(q)) F(f) = c(S(p) - S(q))

    for some c0c \geq 0, where S(p)S(p) is the Shannon entropy of the probability space pp.

    This is identical to one of my previous results, except that FP\mathbf{FP} has been replaced by FinProbFinProb. Again there’s more to say here, but I won’t say it at the moment.

Posted by: Tom Leinster on May 16, 2011 2:53 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tom wrote:

I’m getting left behind on this, and it’s going to get worse this coming week. But if I go quiet, don’t interpret that as lack of enthusiasm; it’s just the pressure of other things.

Okay, thanks for letting me know. I’m hoping to write up two papers on this stuff and move on to other things, but I expect that you and Tobias will want to come back and polish them up a bit. If you have major new ideas (and I’m afraid at least one of you will), I’m hoping they can go in other papers. I’m really hoping to avoid the ‘morass of mathematics’ that tends to turn every collaboration of mine into a long, drawn-out affair. Not that I’m uninterested in this stuff! There’s just too much to do. Ars longa, vita brevis, and the planet is burning…

I wonder whether there’s a hypothesis missing from your statement of Tobias’s theorem, John. For each pair of objects X,YFinMeasX, Y \in FinMeas, there is a canonical isomorphism XYYXX \oplus Y \to Y \oplus X. Do you not need to demand that FF maps each such isomorphism to 0 +0 \in \mathbb{R}_+?

Since I’m assuming that

F(fg)=F(f)+F(g)F(f \circ g) = F(f) + F(g)

(or, in the later more sophisticated treatment, that F:FinMeas +F: FinMeas \to \mathbb{R}_+ is a functor), FF must send every isomorphism in FinMeasFinMeas to an isomorphism in +\mathbb{R}_+. But the only isomorphism in +\mathbb{R}_+ is zero.

I suppose this counts as a reason to use +\mathbb{R}_+ instead of \mathbb{R}.

I haven’t checked the details, but I wonder whether there’s a similar theorem as follows…

I think that’s quite possible. It’s worth noting that as a category, FinMeasFinMeas is a coproduct of categories isomorphic to FinProbFinProb, one for each λ>0\lambda \gt 0, together with a copy of FinSetFinSet, which corresponds to the measures with total mass 00. We can write

FinMeas= λ0FinMeas λFinMeas = \bigsqcup_{\lambda \ge 0} FinMeas_\lambda

where FinMeas λFinMeas_\lambda is the category of finite sets equipped with measures of total mass λ\lambda.

The structures we’re using on FinMeasFinMeas are \oplus and scalar multiplication. But \oplus restricts to a bunch of functors

:FinMeas λ×FinMeas μFinMeas λ+μ \oplus : FinMeas_\lambda \times FinMeas_\mu \to FinMeas_{\lambda + \mu}

while multiplying a measure by c0c \ge 0 gives a functor

c:FinMeas λFinMeas cλ c \cdot : FinMeas_\lambda \to FinMeas_{c \lambda}

which is an isomorphism if c0c \ne 0.

So, there’s a sense in which switching between FinMeasFinMeas and FinProbFinProb is purely a matter of bookkeeping.

(Sure, the measures of total mass zero are also there in FinMeasFinMeas, but I don’t think that’s a huge big deal. In particular, since we’re assuming F(λf)=λF(f)F(\lambda f) = \lambda F(f), every morphism ff between finite sets with measures of total mass zero must have F(f)=0F(f) = 0. Note also that such FF are determined by their values on probability measures.)

Posted by: John Baez on May 16, 2011 4:26 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

If the first part of John’s comment looks like it doesn’t make sense, that’s my fault. Having realized the answer to my own question, I edited the comment of mine that John was replying to. But presumably he was composing his comment at about the same time.

Posted by: Tom Leinster on May 16, 2011 4:31 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

As well as the ‘horizontal’ account of FinMeasFinMeas in terms of the ‘sheets’ of fixed measure λ\lambda, there is of course also a ‘vertical’ account in terms of the collections of measures over the finite sets.

Lebanon here describes your measure preserving maps as congruent embeddings by a Markov morphism. Then there’s result by Censov and extended by Lebanon about the possible metrics supported on the fibres. Essentially the Fisher-Rao metric is the only one.

It would be worth tying your and their results together.

Posted by: David Corfield on May 17, 2011 9:00 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tom wrote:

This is identical to one of my previous results […]

Yes! In fact, I was just taking your formulation and replacing probability measures by arbitrary measures. That’s all! Whether we work with FinMeas\Fin\Meas or the analog of FP\mathbf{FP} doesn’t matter much, and I decided to go with FinMeas\Fin\Meas since it’s easier to define. So I think that John is vastly exaggerating when he calls our result “Tobias’ characterization of Shannon entropy”.

Posted by: Tobias Fritz on May 16, 2011 11:18 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Au contraire! “Tobias’ characterization of Shannon entropy” is perfectly accurate. I certainly wasn’t intending to claim any kind of priority. I’m just trying to see the relationships between the various results we’ve got.

Posted by: Tom Leinster on May 16, 2011 7:01 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

John’s quoting the formula S(p)= ip ilog( ip i) ip ilogp i S(p) = \sum_i p_i \log (\sum_i p_i) - \sum_i p_i \log p_i jogged my memory and reminded me that I do know a context in which people consider (essentially) entropies of non-probability measures, although it’s usually stated in terms of functions instead. If μ\mu is a fixed reference measure on XX and ff is a nonnegative measurable function, then the entropy of ff with respect to μ\mu is Ent μ(f)=(flogf)dμ(fdμ)log(fdμ). Ent_\mu(f) = \int (f \log f) d\mu -(\int f d\mu) \log (\int f d\mu). Thinking of ff as the Radon-Nikodym derivative of some other measure ν\nu with respect to μ\mu, this is the same as the relative entropy of ν\nu with respect to μ\mu, appropriately modified to be homogeneous in ν\nu. This kind of expression appears as the left-hand-side (i.e., the smaller side) in logarithmic Sobolev inequalities.

Posted by: Mark Meckes on May 16, 2011 3:31 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Interesting! So what is a logarithmic Sobolev inequality? Some googling turns up some complicated-looking inequalities involving Gaussian measures on n\mathbb{R}^n. Are there more basic versions for finite probability spaces?

Posted by: Tobias Fritz on May 16, 2011 7:54 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I was rather surprised to find there’s no Wikipedia page to link. A probability measure μ\mu is said to satisfy a logarithmic Sobolev inequality (LSI) with constant CC if Ent μ(f 2)2C|f| 2dμ Ent_\mu(f^2) \le 2 C \int \vert \nabla f \vert^2 d\mu for every (reasonably nice) function ff. The main classical example is standard Gaussian measure on n\mathbb{R}^n, for which C=1C = 1. Also normalized volume measure on a compact Riemannian manifold satisfies an LSI under some curvature conditions.

For discrete settings of course you need some way to make sense of |f| 2\vert \nabla f \vert^2. The major example in which LSIs are well-studied is finite state space Markov chains, for which see this paper by Diaconis and Saloff-Coste.

Posted by: Mark Meckes on May 17, 2011 1:24 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Thanks! What I find most curious about this is that the left-hand side actually of this inequality actually involves the square of the function! In the paper you linked to, even complex-valued ff are considered, and then one uses |f| 2|f|^2 as a probability density function defining a measure. This is just as if ff where a quantum-mechanical wave function! In quantum mechanics, the absolute value squared of a particle’s wave function is the probability density of the particle; usually, this is a density with respect to the Lebesgue measure instead of the Gaussian μ\mu. So let’s consider instead the case of a compact Riemannian manifold, where μ\mu is the canonical measure, for which according to your explanation the inequality also holds.

Then if we think of ff as the wave function of a particle living on the manifold, then we can interpret the left-hand side of the inequality as the amount of randomness in the particle’s position.

What about the right-hand side? Is there a similar interpretation? If we apply Green’s first identity and obtain, since the boundary terms vanish,

(f) 2dμ=fΔfdμ\int (\nabla f)^2 d\mu = -\int f \Delta f d\mu

(assuming for simplicity that ff is real-valued.) Up to a multiplicative constant, the right-hand side is nothing but the expectation value of the energy of the particle in the wave function ff!

Conclusion: the logarithmic Sobolev inequality bounds the position uncertainty of a quantum particle in terms of the particle’s energy!

But wait, can this really be right? What if ff is a constant function? Then the right-hand side vanishes, while the left-hand side is strictly positive?

So in the case of a compact Riemannian manifold with μ\mu being the volume measure, does the inequality really look like this:

Ent μ(f 2)fΔfdμ?\Ent_\mu(f^2) \leq -\int f\Delta f\, d\mu \quad ?

Posted by: Tobias Fritz on May 17, 2011 11:14 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

What I find most curious about this is that the left-hand side actually of this inequality actually involves the square of the function!

Since we’re interested in information theory here, it’s interesting to note that in the continuous context replacing ff with f\sqrt{f} gives the equivalent formulation Ent μ(f)C2|f| 2fdμ. Ent_\mu(f) \le \frac{C}{2} \int \frac{|\nabla f|^2}{f} d\mu. The right-hand side of this inequality is called the Fisher information of ff with respect to μ\mu.

This is just as if ff where a quantum-mechanical wave function!

That’s a very interesting observation; I’d never thought of it that way. However, Len Gross (whose seminal paper on the subject you linked to above) was motivated by considerations — which I know nothing about — from quantum field theory, so for all I know that’s exactly how he thinks about it.

So let’s consider instead the case of a compact Riemannian manifold, where μ\mu is the canonical measure, for which according to your explanation the inequality also holds.

Don’t quote me on that without independently verifying what additional hypotheses are needed (I don’t have access to my books at the moment). I think a strictly positive lower bound on Ricci curvature suffices, and the constant CC will depend on that bound and maybe also the dimension. Certainly it holds on the sphere with its usual Riemannian structure.

The reformulation at the end of your comment is correct, up to the absence of the CC (and of course a choice of sign convention for Δ\Delta).

But wait, can this really be right? What if ff is a constant function? Then the right-hand side vanishes, while the left-hand side is strictly positive?

If ff is constant then the left-hand side also vanishes. Keep in mind that Ent μ(|f| 2)Ent_\mu(|f|^2) is a relative entropy. It measures the difference in position uncertainty between μ\mu (which has maximum uncertainty) and |f| 2dμ|f|^2 d\mu. So an LSI bounds the position uncertainty of a particle from below in terms of its energy.

Posted by: Mark Meckes on May 17, 2011 2:46 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Okay, I’m happy that you came to the same conclusion as I did, so that I have not just been dreaming things up. Actually I think it’s very similar to Tom’s previous approach, just a prettier variant of the same thing.

There’s a typo in your last equation: The p(j)p(j) should be in the numerator, not in the denominator.

The fibers with total measure zero indeed pose no problem: since H(p(j))H(p(j)) vanishes for those jj for which r j=0r_j=0, we can as well leave those jj out from the sum jH(p(j))\sum_j H(p(j)).

A few more ideas:

  • This characterization of Shannon entropy probably generalizes to a reformulation of Tom’s characterization of Tsallis entropy if one changes the homogeneity condition from F(λf)=λF(f)F(\lambda f)=\lambda F(f) to F(λf)=λ qF(f)F(\lambda f)=\lambda^q F(f), and defines Tsallis entropy on non-normalized measures as

S q(p)=1q1(( ip i) q ip i q) S_q(p)=\frac{1}{q-1}\left(\left(\sum_i p_i\right)^q-\sum_i p_i^q\right)

which is again an “additivity defect”. Today I’m too tired to think about the details…

  • The calculations might become prettier if we don’t write out all these logarithms, but use the shorthand notation D(x)=xlogxD(x)=x\log x. More generally, we can possibly show the following:

Conjecture: Let D: + +D:\mathbb{R}_+\rightarrow \mathbb{R}_+ be any (non-linear) function which satisfies

D(xy)=xD(y)+D(x)y D(xy)=xD(y)+D(x)y

Then there is an entropy functor S D:FinMeas +S_D:\Fin\Meas\rightarrow\mathbb{R}_+ given by

F D(f:pr)=S D(p)S D(r),S(p)=D( ip i) iD(p i) F_D(f:p\rightarrow r)=S_D(p)-S_D(r),\qquad S(p)=D\left(\sum_i p_i\right)-\sum_i D(p_i)

This conjecture doesn’t buy as anything new when working over +\mathbb{R}_+, but:

  • Most of this stuff actually makes sense over any (commutative?) rig, including this conjecture. Besides using continuity, we’re doing nothing which would be specific to the real numbers.

  • Let (a,b,c)(a,b,c) be a three-element measure space. A nice consequence of functoriality and additivity of an entropy functor is the following equation:

F(a,b,c)=F(a,b)+F(a+b,c)=F(b,c)+F(a,b+c) F(a,b,c)=F(a,b)+F(a+b,c)=F(b,c)+F(a,b+c)

Proof: We consider the measure-preserving functions which identify the first two elements,

(a,b,c)(a+b,c)(a+b+c) (a,b,c)\rightarrow (a+b,c)\rightarrow (a+b+c)

The first map is the direct sum of id (c)\id_{(c)} and (a,b)(a+b)(a,b)\rightarrow (a+b), so its entropy is 0+F(a,b)0+F(a,b). The entropy of the second map is F(a+b,c)F(a+b,c). So by functoriality, the sum F(a,b)+F(a+b,c)F(a,b)+F(a+b,c) equals the entropy of the composed map, which is F(a,b,c)F(a,b,c). This proves that the first equals sign holds. The second can be shown similarly by considering the maps

(a,b,c)(a,b+c)(a+b+c) (a,b,c)\rightarrow (a,b+c)\rightarrow (a+b+c)

or by using permutation invariance of entropy, which is a special case of functoriality.

This equation smells like some sort of associativity thing, doesn’t it?

Posted by: Tobias Fritz on May 13, 2011 3:36 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

How do your first two bullet points interact? I.e., how must the derivation equation be modified to cater for the DD corresponding to S qS_q?

Posted by: David Corfield on May 13, 2011 5:06 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Good question! Presumably in the case of S qS_q, the corresponding D qD_q ought to satisfy some sort of qq-derivation property.

Given the above formula for S qS_q, we can try to set

D q(x)=x qxq1 D_q(x)=\frac{x^q-x}{q-1}

which has the right behavior for q1q\rightarrow 1. (I remember this formula from somewhere, probably from John’s paper.)

This definitions recovers the above S qS_q as the additivity defect of D qD_q,

S q(p)=D q( ip i) iD q(p i)=1q1(( ip i) q ip i q) S_q(p)=D_q\left(\sum_i p_i\right)-\sum_i D_q(p_i)=\frac{1}{q-1}\left(\left(\sum_i p_i\right)^q-\sum_i p_i^q\right)

Now what happens when we plug in a product of numbers into D qD_q?

D q(xy)=x qy qxyq1=x qy qyq1+x qxq1y=x qD(y)+D(x)y D_q(xy)=\frac{x^q y^q-xy}{q-1}=x^q\frac{y^q-y}{q-1}+\frac{x^q-x}{q-1}y=x^q D(y)+D(x)y

So this should be the desired relation. It’s analogous to the product rule for the qq-derivative, which would justify to call this D qD_q a “qq-derivation”.

Posted by: Tobias Fritz on May 13, 2011 5:49 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

If

Most of this stuff actually makes sense over any (commutative?) rig, including this conjecture,

do you have candidates? So to find DDs appropriate for the rig min\mathbb{R}_{min},

D(xy)=xD(y)+D(x)y D(x y) = x D(y) + D(x) y

we’d need

D(x+y)=min{x+D(y),D(x)+y}. D(x + y) = min\{x + D(y), D(x) + y\}.

Posted by: David Corfield on May 14, 2011 5:17 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Hmm, for the rig min\mathbb{R}_{\min} I don’t have any candidates.

In the case of +\mathbb{R}_+, the function D(x)=xlogxD(x)=x\log x actually takes on negative values for 0<x<10\lt x\lt 1, so it’s not even a function from the rig to itself…

What could be a nice example though are the finite fields 𝔽 p\mathbb{F}_p for an odd prime pp. In this case, one can use the discrete logarithm log 2\log_2, where log 2(z)\log_2(z) is the unique solution xx in 𝔽 p\mathbb{F}_p to the equation 2 x=z2^x=z. This discrete logarithm also satisfies log 2(xy)=log 2(x)+log 2(y)\log_2(xy)=\log_2(x)+\log_2(y), which means that the definition

D(x)=xlog 2x D(x)=x\log_2 x

also gives a (non-linear) derivation over 𝔽 p\mathbb{F}_p, and one can define the entropy of p=(p 1,,p n)𝔽 p np=(p_1,\ldots,p_n)\in\mathbb{F}_p^n in the obvious way analogous to Shannon entropy. Has this concept been studied? Could it be useful in number theory?

Posted by: Tobias Fritz on May 16, 2011 11:49 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I wrote a draft of a paper describing Tobias’ characterization of Shannon entropy:

It lacks the discussion that would make it really clear and interesting. There are also some rough patches that need to be fixed. So, don’t read this yet unless you’re already convinced this is a fascinating subject… or your name begins with “To”. There will be many chances to read it later, and it’ll only get better with age.

I didn’t include anything about FP\mathbf{FP}, operads, Rényi entropy, or the partition function. I think it might be good to put that material in another paper or papers. I’m imagining calling these papers “Entropy as a Functor I”, “Entropy as a Functor II”, etc., or something like that. (It’s possible that having “functor” in the title is a bad idea.) I think it may actually be less work to write a few short papers than one long one, except for the business of dealing with journals. There are fewer organizational decisions to be made. Also, short papers are easier to read.

Anyway, you can see my strategy for this paper: push the heavy-duty category theory to the end.

And you can also see that I’m experimenting with 1) writing papers more quickly and 2) making the writing process more public, because it’s more fun that way.

Posted by: John Baez on May 14, 2011 4:16 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I added a short introduction to the paper, sketching some of the main ideas. Now the main result is stated 3 times in somewhat different ways. I think that’s actually a good thing. I used to think you could tell somebody something once and they’d know it, but that was very naive.

Posted by: John Baez on May 15, 2011 7:56 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

That looks great! I’m not sure how to proceed; should I edit the draft as I feel like? Or should we first discuss any changes here? Given that you’re the much more skillful expositor, I’m a bit in doubt.

For example: it seems good how the result is stated in the introduction. Maybe we can do even better if we try to avoid the word ‘morphism’?

I would like to take a look at Fadeev’s original paper, but haven’t been able to find this online. Does anyone know whether it is available in digital form?

And it seems that he published his characterization of entropy in Russian first and the one we currently cite is a German translation. So, we should probably (also) cite the original version, right?

Posted by: Tobias Fritz on May 16, 2011 11:08 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

link to Faddeev’s paper

Posted by: Maxim Budaev on May 16, 2011 11:50 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Thanks, Maxim! Actually I meant the German translation, but I’ll have a go at trying to understand this one ;)

Posted by: Tobias Fritz on May 16, 2011 12:06 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Okay — thanks to some serious work by Tom, Tobias and myself, the paper has been massively updated to include a better introduction, and also a characterization of Tsallis entropy as well as Shannon entropy:

I’m just trying this new title on for size, not insisting we use it. The point is supposed to be this: instead of focusing on assigning an entropy to an object (say, a finite set with a probability measure), we’re focusing on assigning an entropy to a morphism (a measure-preserving map). That’s a very category-theoretic move to make! But instead of slugging our readers in the gut with category theory right away, I’m trying to start by explaining the idea in something like plain English. And the idea seems to be ‘information loss’.

In other words: if you apply a function to an element of a set, you may lose some information about which element you had, because that function may not be one-to-one. Making this quantitative and imposing 4 simple axioms leads to Shannon entropy! Tweaking one of the axioms gives Tsallis entropy.

Tom emailed me a longish list of suggestions, and I’m stuck here in Narita Airport so I just implemented most of those. The paper is getting pretty nice. David Corfield may be pleased or at least intrigued at how Tsallis won out over Rényi… at least in this paper.

Digressing quite a bit:

I found it amusing that a while back, when David mentioned the “Tsallis/Rényi debate”, Tom responded by saying “I don’t think a debate is what we need.” This is not the only time in the last fortnight when one of the mathematicians here has reacted negatively to David’s use of the word “debate”. I think we’ve talked about this cultural difference before. Philosophers tend to think debating is a good way to hammer out differences and find the truth, while mathematicians pretty quickly say “let’s stop bickering and prove some theorems!”

Posted by: John Baez on May 17, 2011 6:46 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I’m not sure that people are understanding what I mean by ‘debate’. It’s not a horribly adversarial battle between two camps, only one of which can win. It’s more of a discussion to find out whether there is partiality in one’s vision. In fact we had some thing like a debate here, and just as well, because had you gone with Tom’s initial position

I’d pretty much reached the point where (the exponential of) Rényi entropy seemed to me superior in every way to Tsallis entropy,

you wouldn’t have included the material about Tsallis entropy in your paper. It wasn’t a case of Tsallis ‘winning out’, but rather of finding it its proper place in the system.

Posted by: David Corfield on May 17, 2011 9:21 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

David wrote:

I’m not sure that people are understanding what I mean by ‘debate’. It’s not a horribly adversarial battle between two camps, only one of which can win. It’s more of a discussion to find out whether there is partiality in one’s vision.

The mathematicians I know never use the term ‘debate’ for the kinds of discussions we engage in. We tend to talk about ‘figuring things out’ or ‘solving problems’. When we do this, we may argue a lot, but when it works well there aren’t really fixed ‘sides’. In fact, I often find people switching sides several times in a conversation. Perhaps the reason is that we’re used to the idea that to solve a problem you have to try very hard to prove things and also, simultaneously, try very hard to disprove them.

So, even working by yourself, it goes like this: you want to prove PP, so you try to prove it for a while, but if you get stuck, you try to dream up the best potential counterexample you can. If the counterexample fails, you try to analyze why it did, and thus learn more about why PP is true. But if the counterexample wins, you try to see why it won, so you can change PP to some better hypothesis PP'. And so on: this is just the standard top layer of what can become a very complicated process of simultaneously fighting two or more sides of a big battle on many fronts.

So when I have a collaborator, it’s often convenient to split up the task of fighting this battle — but we’re both quite used to switching sides, so we often do. You’ll hear remarks like “But wait a minute: I could argue for that better than you just did!”

By contrast, I suspect that to most mathematicians the term ‘debate’ calls up images of high-school debating where the sides are fixed ahead of time and it would count as a defeat to change sides. Compared to what we do this seems like a very rigid and formulaic process, almost like a medieval joust, or trial by combat.

I know that’s not what you mean by ‘debate’, but you could probably get your point across more easily to mathematicians if you talk more like they do.

Posted by: John Baez on May 18, 2011 2:05 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Ah, I see. It’s largely a matter of terminology then.

Your description of mathematical discussion naturally bears resemblance to discussion in any academic discipline. Had you extended the subject matter from the truth of a result to the worth of particular organizations of a field, the similarities to philosophy would have been stronger, in that it is not to be expected that one reaches an end point. Of course, differences would still remain – demonstrated mathematical results are an integral part of the depiction of a field as well-organized.

Discussions about improving conceptual organization, the kind of thing Bourbaki meetings dealt with, have always interested me more.

Posted by: David Corfield on May 18, 2011 12:43 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

had you gone with Tom’s initial position […] you wouldn’t have included the material about Tsallis entropy in your paper.

I’m not sure I agree. I was going with my initial position at the point when I found this theorem on Tsallis entropy, and subsequently the variant that made it into the paper. In fact, I sent John and Tobias an email trying to persuade them that we should include this variant. (It turned out that John, at least, needed no persuasion: he had discovered it at the same time himself, but was on a plane so couldn’t communicate it.)

Posted by: Tom Leinster on May 17, 2011 11:48 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

About the title: do you think the word “functor” is so scary that it shouldn’t be mentioned?

What about “A functorial characterization of Shannon and Tsallis entropy”?

Posted by: Tobias Fritz on May 17, 2011 11:58 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

To me, the question of using functorial in the title goes back to a discussion about whether there is any mathematical topic which could most simply be explained by using category theory. (I have tried to relocate that discussion and failed.) Since the objective of this paper seems to be to explain/teach something clearly, one has to ask if a categorical approach (or a hint of one in the title) would add or detract from the focus of achieving a clear description of the ideas to a wide range of readers. Perhaps someday the use of category theory will make things easier to be understand for non-specialists, but maybe not yet. John alluded to that in his comments above.

I think that category theory should be used as a basic tool to achieve simple explanations of many key ideas. However, I think it requires a very disciplined, minimalist use of categorical concepts initially. But that’s a topic to be discussed elsewhere.

Posted by: Charlie C on May 17, 2011 2:20 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Charlie wrote:

To me, the question of using functorial in the title goes back to a discussion about whether there is any mathematical topic which could most simply be explained by using category theory.

What counts as ‘simple’ depends a lot on who is deciding. As one learns more and more math, one keeps compressing ones previous understandings in terser and more general formulations. So, an expert reading an elementary textbook will find basic things explained in a way that seems horribly inefficient and unnecessarily complicated. But the beginner hearing the expert’s explanation may find it incomprehensible or perhaps slippery and ‘overly abstract’.

Even among people who have studied math for many decades, tastes vary immensely. For me, personally, every subject in math can be understood more simply with the help of category theory. So when I’m explaining ideas to myself, it’s always simpler to use category theory.

But that’s probably not what you meant by “explain most simply”.

Since the objective of this paper seems to be to explain/teach something clearly, one has to ask if a categorical approach (or a hint of one in the title) would add or detract from the focus of achieving a clear description of the ideas to a wide range of readers.

Well, we’re writing some research papers, not papers that explain previously known stuff. We’ve figured out a bunch of new ways of thinking about entropy — ways that seem pretty hard to come up with unless you know category theory. So, one question is how much we want to talk to people who already know category theory, versus how much we want to introduce the category-theoretic ideas to a broader range of readers, versus how much we want to hide how we originally came up with our ideas, and translate them into language that’s more familiar to people who work on information theory or statistical mechanics. This is a decision we have to make.

We may want to make this decision in different ways in different papers, or even different ways in the same paper. Right now, the paper we’re finishing off uses just a smidgen of category theory until the last section: basically just the idea of “morphism between objects”. But then, in the last section, we reveal more about how we actually think about this stuff. And at present that may come as shock to some readers. We’re not really trying to teach them any category theory: we’re basically just saying “here’s another way to think about what we’re doing — see if you can handle it.” If they can’t, they can still enjoy the rest of the paper.

The funny thing is that Tom first came up with the basic idea of this first paper using the language of ‘lax points’ for operad algebras, which is fairly sophisticated. Then he figured out to restate it using operad algebra homomorphisms, which are less sophisticated. And then Tobias figured out a way to state it that uses less category theory and never mentions operads at all — and that’s what we’re doing now. But would we ever have come up with these without having category theory in our minds?

Posted by: John Baez on May 18, 2011 2:50 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

John, I think your reply below to Tobias describes my position regarding the title of the paper better than I did.

The other point I wanted to make, although this may not be the proper forum to do so, was that I am hoping that simple categorical descriptions can be valuable as a tool for summarizing complex relationships in a way that makes them easier to remember and understand – in other words, as a teaching tool. I’m on a quixotic quest to achieve a reasonably accurate understanding of gauge theory starting with almost no relevant background. This is like learning several new languages at the same time – a little topology (including elementary homotopy theory), some group theory, locally trivially bundles and principal bundles, some differential geometry, and so on … I’m using your book as a roadmap to guide me and keep me focused. And amid the clouds of new concepts and definitions, it seems that a little category theory to pull things together might make life easier. So given my goal, you can see why I would hope that simple category theory as a basic tool, if introduced at the beginning of the road, would make things easier to understand, summarize, and remember. Are you aware of anyone having used simple category theory, introduced early on, as a fundamental teaching tool for their main topic, with the aim of simplifying the teaching and learning tasks?

Posted by: Charlie C on May 18, 2011 6:42 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Charlie wrote:

Are you aware of anyone having used simple category theory, introduced early on, as a fundamental teaching tool for their main topic, with the aim of simplifying the teaching and learning tasks?

One example relevant to your own quest springs to mind:

Chris Isham, Modern Differential Geometry for Physicists, World Scientific Press, Singapore, 1999.

Isham is one of the gurus of quantum gravity, and recently has gotten very excited about topos theory (a branch of category theory). In this introductory book on differential geometry for physicists, he uses a little — but not too much! — category theory to organize things. You might like it.

Posted by: John Baez on May 19, 2011 6:49 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Thanks, John, for the suggestion! Exciting! I think it will be a very useful text; this is the kind of overview that I find very helpful in guiding my more detailed studies. I’ll let you know if the use of category theory is helpful to me. Ordering it now!

Posted by: Charlie C on May 19, 2011 1:25 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

…and recently has gotten very excited about topos theory…

‘Recently’ in the sense of 15 or so years ago.

Posted by: David Corfield on May 19, 2011 8:37 AM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

I guess I’m getting old. Since Isham only began seriously working on topos theory after a long and prominent career thinking about quantum gravity, this switch still seems rather recent and surprising to me. I was raised on his earlier work.

The switch was also surprising because one of his great skills lay in questioning the popular approaches to quantum gravity and finding flaws in them. With his switch to topos theory, his papers turned to having a more ‘positive’ tone.

In February 2007 he said:

Christopher Isham, now 63, has scaled back his teaching and outreach activities. Not that he dislikes the challenge of explaining theoretical physics to 16-year olds. On the contrary: Isham has a great interest in young people. But the Imperial College professor suffers from a neurological disease, which renders him less mobile than he would like. He now walks only with sticks, and works mostly from home. Coming to grips with the failure of your physical body is a sobering condition for a physicist with a strong interest in philosophy and theology.

Like many of his colleagues, Isham wants to reconcile the two pillars of 20th-century physics: General Relativity and quantum theory. But unlike the majority of theoretical physicists, Isham believes the answer can be found in a new mathematical language called topos theory. Introduced by the German mathematician Alexander Grothendieck, topos theory – unlike regular philosophy and theology – is able to handle concepts that can be partially true, instead of just true or false. But topos theory can be frustratingly hard to explain, says Isham. “I could give a seminar on topos theory here at Imperial College, but not many people would understand it.”

[…]

“String theory and loop quantum gravity are currently the two main programs” attacking this problem, says Isham, “but personally, I think they’re both wrong. In my opinion, quantum theory needs to be changed.”

According to mathematical physicist Robbert Dijkgraaf of the University of Amsterdam, the topos approach could yield unexpected results, since it provides a fundamental way of describing quantization. “But it’s a very long shot to solve the problem of quantum gravity,” he notes.

’t Hooft is even more skeptical. “Our biggest problem is how to formulate our questions,” he says. “What exactly do you want to know, and which questions are you able to answer? Isham believes another mathematical language may help, but I don’t think so. It sounds a bit as if describing the world in German is better than in Chinese.”

But Isham is convinced that looking at the world in a new way will yield surprising new insights. “Most scientists are arrogant,” he says. “They tend to think that particles and forces is just all there is to be said about reality. But I do think there are profound mysteries in the world.”

[…]

“I can’t tell whether or not this will turn out to be useful in the end,” he acknowledges, “but what I can tell is that my recent work has been the best in my career.”

I think the situation with topos theory at Imperial College has improved a bit since 2007, and I bet he has taught a seminar or two on the subject since then.

Posted by: John Baez on May 19, 2011 12:51 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Tobias wrote:

About the title: do you think the word “functor” is so scary that it shouldn’t be mentioned?

What about “A functorial characterization of Shannon and Tsallis entropy”?

I think it depends what we’re trying to do in this paper, and which journal we’re going to submit it to.

(We should really answer the second question before rewriting the paper anymore. Finishing a paper before you’ve decided where to submit it is like writing a marriage proposal before you’ve chosen a girl. Sorry, that’s a sexist and old-fashioned analogy — does anyone write marriage proposals these days? — but I can’t think of a better one.)

If we’re trying to convince the world that category theory is great for getting new insights into statistical mechanics, or something like that, then a title like yours could be good. But:

1) I think ‘A functorial characterization of entropy’ is better: titles should be short and snappy, and ‘Tsallis’ will repulse more people than it attracts. The number of people who like both Tsallis entropy and functors must be very small; it’s probably around 4.

2) If we’re going to put ‘functor’ in the title, we need to explain why in the introduction, not wait until the final mind-boggling section to reveal that FF is secretly a functor.

On the other hand, if we’re trying to convince as many people as possible that we’ve got the cutest characterization of entropy since… well, Fadeev… then we should downplay the word ‘functor’. Many people simply won’t read a paper with that word in the title.

Don’t forget, we’re writing more than one paper, so we can take different strategies in different papers. This could be the ‘easy’ paper for people who are scared of functors. I’ve been trying to make it very easy except for the last section, where (rather bizarrely) we suddenly start talking about categorified modules of topological monoids!

(I’ve been meaning to soften the transition a bit; right now the category theory comes down like an enormous crushing hammer on the unsuspecting reader.)

Posted by: John Baez on May 18, 2011 11:15 AM | Permalink | Reply to this
Read the post An Operadic Introduction to Entropy
Weblog: The n-Category Café
Excerpt: Summary of recent results on entropy, and introduction to entropy for the categorically minded.
Tracked: May 18, 2011 10:12 PM
Read the post Category-Theoretic Characterizations of Entropy III
Weblog: The n-Category Café
Excerpt: Now we're trying to generalize our characterization of entropy from the classical to the quantum case.
Tracked: June 5, 2011 9:26 AM

Re: Category-Theoretic Characterizations of Entropy II

Here’s a follow-up on turning probabilities into a Hamiltonian by taking logarithms.

In order to get an understanding of Araki’s notion of relative entropy of states on von Neumann algebras, I have been studying a bit of Tomita-Takesaki theory. And funnily enough, what crops up there is precisely John’s idea of taking logarithms of probabilities in order to obtain a Hamiltonian!

For finite-dimensional von Neumann algebras, the basic constructions of Tomita-Takesaki theory are as follows. We start with a matrix algebra M n=M n()M_n=M_n(\mathbb{C}), which we regard as both a von Neumann algebra and a Hilbert space HH with respect to the Hilbert-Schmidt scalar product:

a,b=tr(a *b) \langle a,b\rangle = \tr(a^\ast b)

We think of the algebra M nM_n as acting on the Hilbert space H=M nH=M_n by left multiplication. As usual, we can start with a vector ψH\psi\in H and turn it into a state ρ:M nC\rho:M_n\to\C as follows:

ρ(a)=tr(ψ *aψ) \rho(a) = tr(\psi^* a\psi)

Conversely, every state ρ\rho on M nM_n can be represented in this form: just take ψ\psi to be the (or some) square root of the density matrix. We might regard ψ\psi as a sort of wave function.

If one then goes through the formalism, it turns out that the modular automorphisms associated to the state ρ\rho (with wave function ψ\psi) are given by

σ t ρ(a)=(ψψ *) ita(ψψ *) it \sigma^\rho_t(a) = (\psi\psi^\ast)^{i t} a (\psi\psi^\ast)^{-i t}

Anyone familiar with quantum mechanics will immediately recognize this as the usual quantum dynamics in the Heisenberg picture:

σ t H(a)=e itHae itH \sigma^H_t(a) = e^{i t H} a e^{-i t H}

where the Hamiltonian is in this case given by:

H=log(ψψ *) H = \log(\psi\psi^\ast)

which is the quantum version of what John did above in the main post!

Posted by: Tobias Fritz on June 28, 2011 7:29 PM | Permalink | Reply to this

Re: Category-Theoretic Characterizations of Entropy II

Nice! I’d known about Tomita–Takesaki theory (for example it shows up nicely in the paper by Alain Connes and Carlo Rovelli’s paper on thermodynamics and the nature of time), but I hadn’t noticed that it’s based on the same idea I’ve been so excited by lately. Thanks!

Posted by: John Baez on June 29, 2011 7:51 AM | Permalink | Reply to this

Post a New Comment