Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

February 14, 2017

Functional Equations II: Shannon Entropy

Posted by Tom Leinster

In the second instalment of the functional equations course that I’m teaching, I introduced Shannon entropy. I also showed that up to a constant factor, it’s uniquely characterized by a functional equation that it satisfies: the chain rule.

Notes for the course so far are here. For a quick summary of today’s session, read on.

You can read the full story in the notes, but here I’ll state the main result as concisely as I can.

For n0n \geq 0, let Δ n\Delta_n denote the set of probability distributions p=(p 1,,p n)\mathbf{p} = (p_1, \ldots, p_n) on {1,,n}\{1, \ldots, n\}. The Shannon entropy of pΔ n\mathbf{p} \in \Delta_n is

H(p)= i:p i>0p ilogp i. H(\mathbf{p}) = - \sum_{i \colon p_i \gt 0} p_i \log p_i.

Now, given

wΔ n,p 1Δ k 1,,p nΔ k n, \mathbf{w} \in \Delta_n, \,\, \mathbf{p}^1 \in \Delta_{k_1}, \ldots, \mathbf{p}^n \in \Delta_{k_n},

we obtain a composite distribution

w(p 1,,p n)=(w 1p 1 1,,w 1p k 1 1,,w np 1 n,,w np k n n). \mathbf{w} \circ (\mathbf{p}^1, \ldots, \mathbf{p}^n) = (w_1 p^1_1, \ldots, w_1 p^1_{k_1}, \, \ldots, \, w_n p^n_1, \ldots, w_n p^n_{k_n}).

The chain rule for HH states that

H(w(p 1,,p n))=H(w)+ i=1 nw iH(p i). H(\mathbf{w} \circ (\mathbf{p}^1, \ldots, \mathbf{p}^n)) = H(\mathbf{w}) + \sum_{i = 1}^n w_i H(\mathbf{p}^i).

So, (H:Δ n +) n1(H: \Delta_n \to \mathbb{R}^+)_{n \geq 1} is a sequence of continuous functions satisfying the chain rule. Clearly, the same is true of any nonnegative scalar multiple of HH.

Theorem (Faddeev, 1956)   The only sequences of continuous functions (Δ n +) n1(\Delta_n \to \mathbb{R}^+)_{n \geq 1} satisfying the chain rule are the scalar multiples of entropy.

One interesting aspect of the proof is where the difficulty lies. Let I:Δ n +I: \Delta_n \to \mathbb{R}^+ be continuous functions satisfying the chain rule; we have to show that II is proportional to HH. All the effort and ingenuity goes into showing that II is proportional to HH when restricted to the uniform distributions. In other words, the hard part is to show that there exists a constant cc such that

I(1/n,,1/n)=cH(1/n,,1/n) I(1/n, \ldots, 1/n) = c H(1/n, \ldots, 1/n)

for all n1n \geq 1. But once that’s done, showing that I(p)=cH(p)I(\mathbf{p}) = c H(\mathbf{p}) is a pushover. The notes show you how!

Posted at February 14, 2017 11:06 PM UTC

TrackBack URL for this Entry:

12 Comments & 0 Trackbacks

Re: Functional Equations II: Shannon Entropy

So who is your audience for this course?

Posted by: Todd Trimble on February 15, 2017 1:28 AM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

It’s pretty varied. Some staff, some PhD students, some undergraduates. Some analysts, some mathematical physicists, some topology and algebra people, some mathematical biologists, some people who I don’t know. And one baby.

Posted by: Tom Leinster on February 15, 2017 1:34 AM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

By the way, here’s something that bothers me a little. The unique characterization theorem above doesn’t require symmetry. Of course, it’s true that entropy is symmetric, in the sense that

H(p σ(1),,p σ(n))=H(p 1,,p n) H(p_{\sigma(1)}, \ldots, p_{\sigma(n)}) = H(p_1, \ldots, p_n)

for any σS n\sigma \in S_n. But unless I’ve made a mistake, this isn’t actually needed to prove the theorem.

I was a bit freaked out when I first realized this, because it seemed to me near-unthinkable that symmetry wouldn’t be a basic requirement. If you have a bijection between two finite sets and probability distributions on each that match up under the bijection, then of course their entropies should be equal. But apparently that isn’t needed in the proof. I don’t know what to make of this.

Historically, the waters are muddied by the fact that Faddeev himself did assume symmetry. But I think that’s only because rather than assuming the general form of the chain rule, he just assumed the special case

H(p 1,,p n1,tp n,(1t)p n)=H(p 1,,p n)+p nH(t,1t). H(p_1, \ldots, p_{n - 1}, t p_n, (1 - t)p_n) = H(p_1, \ldots, p_n) + p_n H(t, 1 - t).

An easy induction shows that this special case of the chain rule, together with symmetry, proves the general case. And I believe that’s all he needs symmetry for.

Posted by: Tom Leinster on February 15, 2017 1:48 AM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

Well, looking over the nice proof from your notes, it all seems to hinge on the all-important identity

p(u k 1,u k 2,,u k n)=u kp \circ (u_{k_1}, u_{k_2}, \ldots, u_{k_n}) = u_k

in the case where p=(k 1k,k 2k,,k nk)p = (\frac{k_1}{k}, \frac{k_2}{k}, \ldots, \frac{k_n}{k}) is a rational point of the simplex. So that would seem to be the thing one needs to feel in one’s bones, and how it relates to symmetry.

As for myself: I don’t really know what to make of this chain rule. The usual chain rule from calculus really says the derivative is a functor in a suitable sense – in my favorite formulation, the tangent bundle functor. But this chain rule I don’t have much feeling for, yet.

Posted by: Todd Trimble on February 15, 2017 1:27 PM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

Oh, and of course speaking of derivatives, one can’t help but notice that the functional equation satisfied by f(p)=plogpf(p) = -p\log p, viz.

f(xy)=f(x)y+xf(y)f(x y) = f(x) y + x f(y)

looks just like the product rule for derivatives

D(fg)=D(f)g+fD(g)D(f g) = D(f) g + f D(g)

and so… what gives?

(From an “nPOV”, the product rule is explained by the fact that the derivative = tangent bundle functor D=() DD = (-)^D preserves products. That is, the product of two functions f,g:f, g: \mathbb{R} \to \mathbb{R} is a composite

Δ×f×g×\mathbb{R} \stackrel{\Delta}{\to} \mathbb{R} \times \mathbb{R} \stackrel{f \times g}{\to} \mathbb{R} \times \mathbb{R} \stackrel{\cdot}{\to} \mathbb{R}

and the product rule follows by combining product-preservation of DD with the fact that the derivative of the bilinear multiplication map :(x,y)xy\cdot: (x, y) \mapsto x y is

(y x):T x()×T y()T xy()\left(\array{y \\ x}\right): T_x(\mathbb{R}) \times T_y(\mathbb{R}) \to T_{x y}(\mathbb{R})

at the level of tangent spaces. Is there any chance of some analogous story for pplogpp \mapsto -p\log p?)

Posted by: Todd Trimble on February 15, 2017 4:55 PM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

No time for a proper reply now, but there was some conversation about this derivation-like property of xxlogxx \mapsto -x \log x back here:

Category-theoretic characterizations of entropy II

An operadic introduction to entropy

Just search for the word “derivation” in both pages and you’ll find a bunch of thoughts. I think we did advance our understanding, but (speaking for myself) there’s plenty I don’t understand.

One quick observation: f:xxlogxf: x \mapsto -x\log x is, up to a constant factor, the only measurable function such that

f(xy)=xf(y)+f(x)y f(x y) = x f(y) + f(x) y

for all x,y>0x, y \gt 0. To see this, just substitute g(x)=f(x)/xg(x) = f(x)/x and use what’s known about Cauchy’s functional equation.

Posted by: Tom Leinster on February 15, 2017 5:12 PM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

As it happens, there’s a formula inverting (sort-of) the (planar nonsymetric)operad \circ on interiors: (...w j...,p,q,...v k...)=(Σw,p+q,Σv)(wΣw,(p,q)p+q,vΣv) (...w_j..., p , q , ... v_k ... ) = (\Sigma w, p+q, \Sigma v) \circ (\frac{w}{\Sigma w},\frac{(p,q)}{p+q},\frac{v}{\Sigma v}) to which applying the chain rule gives I(...w j...,p,q,...v k...)=I(Σw,p+q,Σv)+(Σw)I(w/Σw)+(Σv)I(v/Σv)+(p+q)I(pp+q,qp+q) I(...w_j...,p,q,...v_k...) = I(\Sigma w,p+q,\Sigma v) + (\Sigma w)I(w/\Sigma w) + (\Sigma v) I(v/\Sigma v) + (p+q)I(\frac{p}{p+q},\frac{q}{p+q}) so that symmetry overall reduces to symmetry on Δ 1\Delta_1.

And … I think that’s where Todd’s remark kicks in?

(I’m reminded of how the anti-symmetry of determinants is derived from multilinearity and the shearing rule, because (0 1 1 0)=(1 0 1 1)(1 1 0 1)(1 0 1 1)\left(\begin{array}{rr}0 & 1 \\ -1 & 0\end{array}\right) = \left(\begin{array}{rr}1 & 0 \\ 1 & 1\end{array}\right) \left(\begin{array}{rr}1 & -1 \\ 0 & 1\end{array}\right) \left(\begin{array}{rr}1 & 0 \\ 1 & 1\end{array}\right) /)

Posted by: Jesse C. McKeown on February 15, 2017 2:58 PM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

Dear keepers… typo: I transposed something there in my parenthetical.

Posted by: Jesse C. McKeown on February 15, 2017 3:00 PM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

Todd, there’s always so much to respond to in your comments. Each one requires serious thought, and so many of them are gold mines!

So I’ll just respond to one particular point here, and even this will be a long comment.

As for myself: I don’t really know what to make of this chain rule. The usual chain rule from calculus really says the derivative is a functor in a suitable sense - in my favorite formulation, the tangent bundle functor. But this chain rule I don’t have much feeling for, yet.

Here’s the best I can do. A longer version of the following is in this post, but here I’ll try to say this as directly and briefly as possible. Which turns out to be “not very”.

There is a (topological, symmetric) operad Δ\Delta whose nn-ary operations are the probability distributions on nn elements, with the composition described. A Δ\Delta-algebra in SetSet is something like a convex algebra, but the law

λx+(1λ)x=x \lambda x + (1 - \lambda) x = x

is not required to hold (and indeed can’t be expressed operadically). We can consider Δ\Delta-algebras in CatCat instead of SetSet — or better, in Cat(Top)Cat(Top) (topological categories). To keep the overhead down, I’ll suppress the “TopTop” bit in what follows, but it really should be there — that is, everything should be continuous.

One Δ\Delta-algebra in CatCat is FinProbFinProb, the category of finite sets equipped with a probability distribution and measure-preserving transformations between them. The Δ\Delta-algebra structure is given like this: if you have a probability distribution pp on a set XX, and another distribution pp' on XX', then for every λ[0,1]\lambda \in [0, 1] you get a distribution λp(1λ)p\lambda p \oplus (1 - \lambda) p' on X⨿XX \amalg X'. I haven’t defined the notation, but I’m sure you know what I mean.

Whenever CC is a categorical Δ\Delta-algebra (by which I mean a Δ\Delta-algebra in CatCat), you can talk about internal Δ\Delta-algebras in CC. This is just like the fact that whenever MM is a monoidal category, you can talk about internal monoids in MM. (In one case we’re using the operad Δ\Delta; in the other we’re using the terminal operad.) A Δ\Delta-algebra in CC consists of an object XX together with a map λX(1λ)XX\lambda X \oplus (1 - \lambda) X \to X for each λ[0,1]\lambda \in [0, 1], satisfying some axioms.

For instance, in FinProbFinProb there is an internal Δ\Delta-algebra consisting of the one-element set 11 together with the unique probability distribution on it. It’s an internal Δ\Delta-algebra in a unique way.

The monoidal category DD of finite totally ordered sets has inside it the monoid 11 (the one-element set). As is well known, this is the initial example of a monoidal category containing a monoid. It’s the “free monoidal category on a monoid”. So, for an arbitrary monoidal category AA, the (strict) monoidal functors DAD \to A are in canonical bijection with the monoids in AA.

In exactly the same sense, FinProbFinProb together with the internal Δ\Delta-algebra 11 is the free categorical Δ\Delta-algebra on an internal Δ\Delta-algebra. So, for an arbitrary categorical Δ\Delta-algebra AA, the structure-preserving functors FinProbAFinProb \to A are in canonical bijection with the Δ\Delta-algebras in AA.

(There was a white lie just then. It’s not quite FinProb that has this universal property; it’s a near-identical category that I called FPFP in that previous post. The difference lies in some delicacy involving zero probabilities. I’ll ignore it here.)

Now, the monoid ( +,+,0)(\mathbb{R}^+, +, 0) is a convex algebra in the obvious way (it’s a convex set!), and when you view this monoid as a one-object category, it becomes a categorical Δ\Delta-algebra. The internal Δ\Delta-algebras in +\mathbb{R}^+ are, therefore, in canonical bijection with the structure-preserving functors FinProb +FinProb \to \mathbb{R}^+.

Now, at last, we get to the chain rule. What is a structure-preserving functor F:FinProb +F \colon FinProb \to \mathbb{R}^+? It’s a way of associating to each measure-preserving map θ\theta a real number F(θ)F(\theta). For instance, given any probability distribution pp on a finite set XX, there is a unique map

! p:(X,p)(1,u 1) !_p \colon (X, p) \to (1, u_1)

(where u 1u_1 means the unique distribution on 11), which gives rise to a number F(! p) +F(!_p) \in \mathbb{R}^+. So, among other things, FF assigns to each probability distribution a real number. Let’s write I(p)=F(! p)I(p) = F(!_p).

But FF has to be a structure-preserving functor. Here “structure-preserving” refers to the Δ\Delta-algebra structure. Crucial observation: any measure-preserving map θ:(X,p)(Y,q)\theta \colon (X, p) \to (Y, q) between finite probability spaces can be written as a convex combination (over yYy \in Y) of maps to the one-point space. Since FF is required to preserve that convex combination, FF is entirely determined by II. In other words, FF is really just a way of assigning a number to each probability distribution — satisfying axioms, which we’re not finished with yet.

The one requirement on FF that we have left to consider is actual functoriality: preservation of composition. Rather than consider the general case, let me take just the example you mention. Take a probability distribution with rational probabilities,

p=(k 1k,,k nk) p = \Bigl( \frac{k_1}{k}, \ldots, \frac{k_n}{k} \Bigr)

where k ik_i \in \mathbb{Z} and k i=k\sum k_i = k. This gives rise to a composable pair of arrows in FinProbFinProb:

(k,u k)π(n,p)ψ(1,u 1) (k, u_k) \stackrel{\pi}{\to} (n, p) \stackrel{\psi}{\to} (1, u_1)

where kk means the set {1,,k}\{1, \ldots, k\}, where u ku_k is the uniform distribution on kk elements, where π\pi is the map that partitions kk into pieces of size k 1,,k nk_1, \ldots, k_n, and where ψ\psi is the only thing it could possibly be. Let’s apply FF to this composite ψπ\psi \circ \pi.

By what I just said about convex combinations, π\pi is a convex combination of maps into the terminal space:

π= i=1 np i((k i,u k i)(1,u 1)). \pi = \bigoplus_{i = 1}^n p_i \Bigl( (k_i, u_{k_i}) \to (1, u_1) \Bigr).

Since FF is structure-preserving,

F(π)= i=1 np iI(u k i). F(\pi) = \sum_{i = 1}^n p_i I(u_{k_i}).

Now looking at the second map in our composable pair, ψ\psi, we have F(ψ)=I(p)F(\psi) = I(p) simply by definition of II. Hence

F(ψ)+F(π)=I(p)+ i=1 np iI(u k i). F(\psi) + F(\pi) = I(p) + \sum_{i = 1}^n p_i I(u_{k_i}).

But F:FinProb +F: FinProb \to \mathbb{R}^+ is a functor, and ++ is composition in +\mathbb{R}^+, so

F(ψ)+F(π)=F(ψπ). F(\psi) + F(\pi) = F(\psi \circ \pi).

And F(ψπ)=I(u k)F(\psi \circ \pi) = I(u_k), again by definition of II. So

I(p)+ i=1 np iI(u k i)=I(u k). I(p) + \sum_{i = 1}^n p_i I(u_{k_i}) = I(u_k).

That’s exactly the chain rule! At least, it’s the chain rule for this particular example of an operadic composite,

u k=p(u k 1,,u k n). u_k = p \circ (u_{k_1}, \ldots, u_{k_n}).

But everything I just said works equally well for any other composite.

So at last, we’ve seen how the chain rule expresses the fact that some functor preserves composition.

Posted by: Tom Leinster on February 16, 2017 12:20 AM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

Reading the course notes, I was a bit intrigued by the result on multiplicative functions that was tentatively attributed to Erdős, that in one version says than a monotone (= nondecreasing) multiplicative monoid map f: + +f: \mathbb{N}_+ \to \mathbb{R}_+ must be of the form f(n)=n αf(n) = n^\alpha where α>0\alpha \gt 0.

First, it is correct to attribute this to Erdős; it’s in

* P. Erdős, On the distribution function of additive functions, Ann. of Math. (2) 47 (1946), 1-20. MR7,416c; Zentralblatt 61,79. (link)

Although it should be said he proved stronger statements, one of which is where the condition on ff being a monoid map is weakened to mere multiplicativity: recall that in number theory, a function ff on the positive integers is multiplicative if f(mn)=f(m)f(n)f(m n) = f(m)f(n) whenever m,nm, n are relatively prime.

The statement that monotone multiplicative functions are power functions has been gradually simplified over the course of a number of papers (including one co-authored by Lambek). A relatively simple proof that strictly increasing multiplicative functions are power functions was given by Everett Howe, here.

But getting back to Tom’s statement: there was a somewhat contentious Mathematics StackExchange thread about this, here. Removing all the fuss and slightly rewording some details, it goes like this. Let f: + +f: \mathbb{N}_+ \to \mathbb{R}_+ be a monotone monoid map.

Step 1: We show that ff is either the constant function 11 or is strictly increasing. Suppose there are a,b +a, b \in \mathbb{N}_+ with a<ba \lt b and f(a)=f(b)f(a) = f(b). Let pp be any prime; we can find integers i,ji, j with

log pa<ij<i+1j<log pb\log_p a \lt \frac{i}{j} \lt \frac{i+1}{j} \lt \log_p b

so that a j<p i<p i+1<b ja^j \lt p^i \lt p^{i+1} \lt b^j. Applying the monotone monoid map ff to the last set of inequalities, we get

f(a) jf(p) if(p) i+1f(b) jf(a)^j \leq f(p)^i \leq f(p)^{i+1} \leq f(b)^j

but since f(a)=f(b)f(a) = f(b), we conclude f(p) i=f(p) i+1f(p)^i = f(p)^{i+1} and thus f(p)=1f(p) = 1. But such pp generate +\mathbb{N}_+.

Step 2: If f: + +f: \mathbb{N}_+ \to \mathbb{R}_+ is a strictly increasing monoid map were not of the form f(n)=n αf(n) = n^\alpha, then nlogf(n)lognn \mapsto \frac{\log f(n)}{\log n} (n>1n \gt 1) is not constant, and so there are a,b>1a, b \gt 1 with logf(a)loga<logf(b)logb\frac{\log f(a)}{\log a} \lt \frac{\log f(b)}{\log b}. Abbreviate the left side to xx and the right side to yy, so f(a)=a xf(a) = a^x and f(b)=b yf(b) = b^y and xy<1\frac{x}{y} \lt 1. There are positive integers i,ji, j such that

xylogbloga<ij<logbloga\frac{x}{y} \frac{\log b}{\log a} \lt \frac{i}{j} \lt \frac{\log b}{\log a}


iloga<jlogbandjxlogb<iylogai \log a \lt j \log b \qquad and \qquad j x \log b \lt i y \log a


a i<b jandb jy<a ix.a^i \lt b^j \qquad and \qquad b^{j y} \lt a^{i x}.

Applying the strictly increasing monoid map ff to the first inequality, we infer

a ix=f(a) i<f(b) j=b jya^{i x} = f(a)^i \lt f(b)^j = b^{j y}

which contradicts the second inequality. Thus a strictly increasing monoid map must be of the form nn αn \mapsto n^\alpha, with α>0\alpha \gt 0.

Posted by: Todd Trimble on February 16, 2017 9:28 PM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

Thanks, Todd. I’d never actually looked at Erdős’s paper until you linked to it. What I know about this theorem, I learned from Rényi’s paper “On measures of entropy and information” and Khinchin’s book Mathematical Foundations of Information Theory.

Rényi states the result with the stronger multiplicativity hypothesis (i.e. for all arguments, not just coprime pairs), and with the hypothesis about limits (not monotonicity). Then he writes:

This lemma was first proved by Erdős [reference to the paper you cite]. In fact Erdős proved somewhat more, namely he supposed, as is usual in number theory, the validity of (1.2) [f(nm)=f(n)+f(m)f(n m) = f(n) + f(m)] only for nn and mm being relatively prime. Later the lemma was rediscovered by Fadeev. The proofs of both Erdős and Fadeev are rather complicated. In this section we give a new proof of the lemma, which is much simpler.

The proof takes about a page and a third. I worked through it a few years ago and don’t remember it being particularly hard.

The reason I put the question mark next to Erdős’s name is that although I knew it was correct to attribute to him the stronger result (with the weaker hypothesis involving coprime pairs), I wasn’t sure about the weaker result. But Rényi evidently believes that it’s due to Erdős, and I see now that that seems to be the consensus.

With the monotonicity hypothesis rather than the limit hypothesis, the result is much easier. By chance, I worked through a proof today, using Khinchin’s book. It looks very similar to the argument in your comment, although as it’s presented a bit differently, it’s not immediately obvious whether it really is the same.

Posted by: Tom Leinster on February 17, 2017 4:25 AM | Permalink | Reply to this

Re: Functional Equations II: Shannon Entropy

We’ve had discussions a couple of years back on how the functional equation for binary entropy is intimately related to cocycle conditions. It may be interesting to know that this has also been investigated by Kontsevich in the appendix to this paper. Furthermore, Kontsevich notes that the same functional equation(s) that characterize(s) binary entropy also has a unique up-to-scalar solution within the space of functions /p/p\mathbb{Z}/p\mathbb{Z}\to\mathbb{Z}/p\mathbb{Z} for a prime pp, given by by H(x)= k=1 p1x kkH(x) = \sum_{k=1}^{p-1} \tfrac{x^k}{k}. Over \mathbb{R}, this would be the truncated power series for log(1x)-log(1-x).

(Hat tip to Pierre Baudot for the pointer.)

Posted by: Tobias Fritz on February 21, 2017 1:27 PM | Permalink | Reply to this

Post a New Comment