## February 14, 2017

### Functional Equations II: Shannon Entropy

#### Posted by Tom Leinster In the second instalment of the functional equations course that I’m teaching, I introduced Shannon entropy. I also showed that up to a constant factor, it’s uniquely characterized by a functional equation that it satisfies: the chain rule.

Notes for the course so far are here. For a quick summary of today’s session, read on.

You can read the full story in the notes, but here I’ll state the main result as concisely as I can.

For $n \geq 0$, let $\Delta_n$ denote the set of probability distributions $\mathbf{p} = (p_1, \ldots, p_n)$ on $\{1, \ldots, n\}$. The Shannon entropy of $\mathbf{p} \in \Delta_n$ is

$H(\mathbf{p}) = - \sum_{i \colon p_i \gt 0} p_i \log p_i.$

Now, given

$\mathbf{w} \in \Delta_n, \,\, \mathbf{p}^1 \in \Delta_{k_1}, \ldots, \mathbf{p}^n \in \Delta_{k_n},$

we obtain a composite distribution

$\mathbf{w} \circ (\mathbf{p}^1, \ldots, \mathbf{p}^n) = (w_1 p^1_1, \ldots, w_1 p^1_{k_1}, \, \ldots, \, w_n p^n_1, \ldots, w_n p^n_{k_n}).$

The chain rule for $H$ states that

$H(\mathbf{w} \circ (\mathbf{p}^1, \ldots, \mathbf{p}^n)) = H(\mathbf{w}) + \sum_{i = 1}^n w_i H(\mathbf{p}^i).$

So, $(H: \Delta_n \to \mathbb{R}^+)_{n \geq 1}$ is a sequence of continuous functions satisfying the chain rule. Clearly, the same is true of any nonnegative scalar multiple of $H$.

Theorem (Faddeev, 1956)   The only sequences of continuous functions $(\Delta_n \to \mathbb{R}^+)_{n \geq 1}$ satisfying the chain rule are the scalar multiples of entropy.

One interesting aspect of the proof is where the difficulty lies. Let $I: \Delta_n \to \mathbb{R}^+$ be continuous functions satisfying the chain rule; we have to show that $I$ is proportional to $H$. All the effort and ingenuity goes into showing that $I$ is proportional to $H$ when restricted to the uniform distributions. In other words, the hard part is to show that there exists a constant $c$ such that

$I(1/n, \ldots, 1/n) = c H(1/n, \ldots, 1/n)$

for all $n \geq 1$. But once that’s done, showing that $I(\mathbf{p}) = c H(\mathbf{p})$ is a pushover. The notes show you how!

Posted at February 14, 2017 11:06 PM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2939

### Re: Functional Equations II: Shannon Entropy

So who is your audience for this course?

Posted by: Todd Trimble on February 15, 2017 1:28 AM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

It’s pretty varied. Some staff, some PhD students, some undergraduates. Some analysts, some mathematical physicists, some topology and algebra people, some mathematical biologists, some people who I don’t know. And one baby.

Posted by: Tom Leinster on February 15, 2017 1:34 AM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

By the way, here’s something that bothers me a little. The unique characterization theorem above doesn’t require symmetry. Of course, it’s true that entropy is symmetric, in the sense that

$H(p_{\sigma(1)}, \ldots, p_{\sigma(n)}) = H(p_1, \ldots, p_n)$

for any $\sigma \in S_n$. But unless I’ve made a mistake, this isn’t actually needed to prove the theorem.

I was a bit freaked out when I first realized this, because it seemed to me near-unthinkable that symmetry wouldn’t be a basic requirement. If you have a bijection between two finite sets and probability distributions on each that match up under the bijection, then of course their entropies should be equal. But apparently that isn’t needed in the proof. I don’t know what to make of this.

Historically, the waters are muddied by the fact that Faddeev himself did assume symmetry. But I think that’s only because rather than assuming the general form of the chain rule, he just assumed the special case

$H(p_1, \ldots, p_{n - 1}, t p_n, (1 - t)p_n) = H(p_1, \ldots, p_n) + p_n H(t, 1 - t).$

An easy induction shows that this special case of the chain rule, together with symmetry, proves the general case. And I believe that’s all he needs symmetry for.

Posted by: Tom Leinster on February 15, 2017 1:48 AM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

Well, looking over the nice proof from your notes, it all seems to hinge on the all-important identity

$p \circ (u_{k_1}, u_{k_2}, \ldots, u_{k_n}) = u_k$

in the case where $p = (\frac{k_1}{k}, \frac{k_2}{k}, \ldots, \frac{k_n}{k})$ is a rational point of the simplex. So that would seem to be the thing one needs to feel in one’s bones, and how it relates to symmetry.

As for myself: I don’t really know what to make of this chain rule. The usual chain rule from calculus really says the derivative is a functor in a suitable sense – in my favorite formulation, the tangent bundle functor. But this chain rule I don’t have much feeling for, yet.

Posted by: Todd Trimble on February 15, 2017 1:27 PM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

Oh, and of course speaking of derivatives, one can’t help but notice that the functional equation satisfied by $f(p) = -p\log p$, viz.

$f(x y) = f(x) y + x f(y)$

looks just like the product rule for derivatives

$D(f g) = D(f) g + f D(g)$

and so… what gives?

(From an “nPOV”, the product rule is explained by the fact that the derivative = tangent bundle functor $D = (-)^D$ preserves products. That is, the product of two functions $f, g: \mathbb{R} \to \mathbb{R}$ is a composite

$\mathbb{R} \stackrel{\Delta}{\to} \mathbb{R} \times \mathbb{R} \stackrel{f \times g}{\to} \mathbb{R} \times \mathbb{R} \stackrel{\cdot}{\to} \mathbb{R}$

and the product rule follows by combining product-preservation of $D$ with the fact that the derivative of the bilinear multiplication map $\cdot: (x, y) \mapsto x y$ is

$\left(\array{y \\ x}\right): T_x(\mathbb{R}) \times T_y(\mathbb{R}) \to T_{x y}(\mathbb{R})$

at the level of tangent spaces. Is there any chance of some analogous story for $p \mapsto -p\log p$?)

Posted by: Todd Trimble on February 15, 2017 4:55 PM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

No time for a proper reply now, but there was some conversation about this derivation-like property of $x \mapsto -x \log x$ back here:

Category-theoretic characterizations of entropy II

Just search for the word “derivation” in both pages and you’ll find a bunch of thoughts. I think we did advance our understanding, but (speaking for myself) there’s plenty I don’t understand.

One quick observation: $f: x \mapsto -x\log x$ is, up to a constant factor, the only measurable function such that

$f(x y) = x f(y) + f(x) y$

for all $x, y \gt 0$. To see this, just substitute $g(x) = f(x)/x$ and use what’s known about Cauchy’s functional equation.

Posted by: Tom Leinster on February 15, 2017 5:12 PM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

As it happens, there’s a formula inverting (sort-of) the (planar nonsymetric)operad $\circ$ on interiors: $(...w_j..., p , q , ... v_k ... ) = (\Sigma w, p+q, \Sigma v) \circ (\frac{w}{\Sigma w},\frac{(p,q)}{p+q},\frac{v}{\Sigma v})$ to which applying the chain rule gives  I(…wj…,p,q,…vk…) = I(\Sigma w,p+q,\Sigma v) + (\Sigma w)I(w/\Sigma w) + (\Sigma v) I(v/\Sigma v) + (p+q)I(\frac{p}{p+q},\frac{q}{p+q}) $so that symmetry overall reduces to symmetry on$\Delta_1$. And ... I think that's where Todd's remark kicks in? (I'm reminded of how the *anti*-symmetry of determinants is derived from multilinearity and the shearing rule, because$$\left(\begin{array}{rr}0 & 1 \\ -1 & 0\end{array}\right) = \left(\begin{array}{rr}1 & 0 \\ 1 & 1\end{array}\right) \left(\begin{array}{rr}1 & -1 \\ 0 & 1\end{array}\right) \left(\begin{array}{rr}1 & 0 \\ 1 & 1\end{array}\right)$\$ /)

Posted by: Jesse C. McKeown on February 15, 2017 2:58 PM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

Dear keepers… typo: I transposed something there in my parenthetical.

Posted by: Jesse C. McKeown on February 15, 2017 3:00 PM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

Todd, there’s always so much to respond to in your comments. Each one requires serious thought, and so many of them are gold mines!

So I’ll just respond to one particular point here, and even this will be a long comment.

As for myself: I don’t really know what to make of this chain rule. The usual chain rule from calculus really says the derivative is a functor in a suitable sense - in my favorite formulation, the tangent bundle functor. But this chain rule I don’t have much feeling for, yet.

Here’s the best I can do. A longer version of the following is in this post, but here I’ll try to say this as directly and briefly as possible. Which turns out to be “not very”.

There is a (topological, symmetric) operad $\Delta$ whose $n$-ary operations are the probability distributions on $n$ elements, with the composition described. A $\Delta$-algebra in $Set$ is something like a convex algebra, but the law

$\lambda x + (1 - \lambda) x = x$

is not required to hold (and indeed can’t be expressed operadically). We can consider $\Delta$-algebras in $Cat$ instead of $Set$ — or better, in $Cat(Top)$ (topological categories). To keep the overhead down, I’ll suppress the “$Top$” bit in what follows, but it really should be there — that is, everything should be continuous.

One $\Delta$-algebra in $Cat$ is $FinProb$, the category of finite sets equipped with a probability distribution and measure-preserving transformations between them. The $\Delta$-algebra structure is given like this: if you have a probability distribution $p$ on a set $X$, and another distribution $p'$ on $X'$, then for every $\lambda \in [0, 1]$ you get a distribution $\lambda p \oplus (1 - \lambda) p'$ on $X \amalg X'$. I haven’t defined the notation, but I’m sure you know what I mean.

Whenever $C$ is a categorical $\Delta$-algebra (by which I mean a $\Delta$-algebra in $Cat$), you can talk about internal $\Delta$-algebras in $C$. This is just like the fact that whenever $M$ is a monoidal category, you can talk about internal monoids in $M$. (In one case we’re using the operad $\Delta$; in the other we’re using the terminal operad.) A $\Delta$-algebra in $C$ consists of an object $X$ together with a map $\lambda X \oplus (1 - \lambda) X \to X$ for each $\lambda \in [0, 1]$, satisfying some axioms.

For instance, in $FinProb$ there is an internal $\Delta$-algebra consisting of the one-element set $1$ together with the unique probability distribution on it. It’s an internal $\Delta$-algebra in a unique way.

The monoidal category $D$ of finite totally ordered sets has inside it the monoid $1$ (the one-element set). As is well known, this is the initial example of a monoidal category containing a monoid. It’s the “free monoidal category on a monoid”. So, for an arbitrary monoidal category $A$, the (strict) monoidal functors $D \to A$ are in canonical bijection with the monoids in $A$.

In exactly the same sense, $FinProb$ together with the internal $\Delta$-algebra $1$ is the free categorical $\Delta$-algebra on an internal $\Delta$-algebra. So, for an arbitrary categorical $\Delta$-algebra $A$, the structure-preserving functors $FinProb \to A$ are in canonical bijection with the $\Delta$-algebras in $A$.

(There was a white lie just then. It’s not quite FinProb that has this universal property; it’s a near-identical category that I called $FP$ in that previous post. The difference lies in some delicacy involving zero probabilities. I’ll ignore it here.)

Now, the monoid $(\mathbb{R}^+, +, 0)$ is a convex algebra in the obvious way (it’s a convex set!), and when you view this monoid as a one-object category, it becomes a categorical $\Delta$-algebra. The internal $\Delta$-algebras in $\mathbb{R}^+$ are, therefore, in canonical bijection with the structure-preserving functors $FinProb \to \mathbb{R}^+$.

Now, at last, we get to the chain rule. What is a structure-preserving functor $F \colon FinProb \to \mathbb{R}^+$? It’s a way of associating to each measure-preserving map $\theta$ a real number $F(\theta)$. For instance, given any probability distribution $p$ on a finite set $X$, there is a unique map

$!_p \colon (X, p) \to (1, u_1)$

(where $u_1$ means the unique distribution on $1$), which gives rise to a number $F(!_p) \in \mathbb{R}^+$. So, among other things, $F$ assigns to each probability distribution a real number. Let’s write $I(p) = F(!_p)$.

But $F$ has to be a structure-preserving functor. Here “structure-preserving” refers to the $\Delta$-algebra structure. Crucial observation: any measure-preserving map $\theta \colon (X, p) \to (Y, q)$ between finite probability spaces can be written as a convex combination (over $y \in Y$) of maps to the one-point space. Since $F$ is required to preserve that convex combination, $F$ is entirely determined by $I$. In other words, $F$ is really just a way of assigning a number to each probability distribution — satisfying axioms, which we’re not finished with yet.

The one requirement on $F$ that we have left to consider is actual functoriality: preservation of composition. Rather than consider the general case, let me take just the example you mention. Take a probability distribution with rational probabilities,

$p = \Bigl( \frac{k_1}{k}, \ldots, \frac{k_n}{k} \Bigr)$

where $k_i \in \mathbb{Z}$ and $\sum k_i = k$. This gives rise to a composable pair of arrows in $FinProb$:

$(k, u_k) \stackrel{\pi}{\to} (n, p) \stackrel{\psi}{\to} (1, u_1)$

where $k$ means the set $\{1, \ldots, k\}$, where $u_k$ is the uniform distribution on $k$ elements, where $\pi$ is the map that partitions $k$ into pieces of size $k_1, \ldots, k_n$, and where $\psi$ is the only thing it could possibly be. Let’s apply $F$ to this composite $\psi \circ \pi$.

By what I just said about convex combinations, $\pi$ is a convex combination of maps into the terminal space:

$\pi = \bigoplus_{i = 1}^n p_i \Bigl( (k_i, u_{k_i}) \to (1, u_1) \Bigr).$

Since $F$ is structure-preserving,

$F(\pi) = \sum_{i = 1}^n p_i I(u_{k_i}).$

Now looking at the second map in our composable pair, $\psi$, we have $F(\psi) = I(p)$ simply by definition of $I$. Hence

$F(\psi) + F(\pi) = I(p) + \sum_{i = 1}^n p_i I(u_{k_i}).$

But $F: FinProb \to \mathbb{R}^+$ is a functor, and $+$ is composition in $\mathbb{R}^+$, so

$F(\psi) + F(\pi) = F(\psi \circ \pi).$

And $F(\psi \circ \pi) = I(u_k)$, again by definition of $I$. So

$I(p) + \sum_{i = 1}^n p_i I(u_{k_i}) = I(u_k).$

That’s exactly the chain rule! At least, it’s the chain rule for this particular example of an operadic composite,

$u_k = p \circ (u_{k_1}, \ldots, u_{k_n}).$

But everything I just said works equally well for any other composite.

So at last, we’ve seen how the chain rule expresses the fact that some functor preserves composition.

Posted by: Tom Leinster on February 16, 2017 12:20 AM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy Reading the course notes, I was a bit intrigued by the result on multiplicative functions that was tentatively attributed to Erdős, that in one version says than a monotone (= nondecreasing) multiplicative monoid map $f: \mathbb{N}_+ \to \mathbb{R}_+$ must be of the form $f(n) = n^\alpha$ where $\alpha \gt 0$.

First, it is correct to attribute this to Erdős; it’s in

* P. Erdős, On the distribution function of additive functions, Ann. of Math. (2) 47 (1946), 1-20. MR7,416c; Zentralblatt 61,79. (link)

Although it should be said he proved stronger statements, one of which is where the condition on $f$ being a monoid map is weakened to mere multiplicativity: recall that in number theory, a function $f$ on the positive integers is multiplicative if $f(m n) = f(m)f(n)$ whenever $m, n$ are relatively prime.

The statement that monotone multiplicative functions are power functions has been gradually simplified over the course of a number of papers (including one co-authored by Lambek). A relatively simple proof that strictly increasing multiplicative functions are power functions was given by Everett Howe, here.

But getting back to Tom’s statement: there was a somewhat contentious Mathematics StackExchange thread about this, here. Removing all the fuss and slightly rewording some details, it goes like this. Let $f: \mathbb{N}_+ \to \mathbb{R}_+$ be a monotone monoid map.

Step 1: We show that $f$ is either the constant function $1$ or is strictly increasing. Suppose there are $a, b \in \mathbb{N}_+$ with $a \lt b$ and $f(a) = f(b)$. Let $p$ be any prime; we can find integers $i, j$ with

$\log_p a \lt \frac{i}{j} \lt \frac{i+1}{j} \lt \log_p b$

so that $a^j \lt p^i \lt p^{i+1} \lt b^j$. Applying the monotone monoid map $f$ to the last set of inequalities, we get

$f(a)^j \leq f(p)^i \leq f(p)^{i+1} \leq f(b)^j$

but since $f(a) = f(b)$, we conclude $f(p)^i = f(p)^{i+1}$ and thus $f(p) = 1$. But such $p$ generate $\mathbb{N}_+$.

Step 2: If $f: \mathbb{N}_+ \to \mathbb{R}_+$ is a strictly increasing monoid map were not of the form $f(n) = n^\alpha$, then $n \mapsto \frac{\log f(n)}{\log n}$ ($n \gt 1$) is not constant, and so there are $a, b \gt 1$ with $\frac{\log f(a)}{\log a} \lt \frac{\log f(b)}{\log b}$. Abbreviate the left side to $x$ and the right side to $y$, so $f(a) = a^x$ and $f(b) = b^y$ and $\frac{x}{y} \lt 1$. There are positive integers $i, j$ such that

$\frac{x}{y} \frac{\log b}{\log a} \lt \frac{i}{j} \lt \frac{\log b}{\log a}$

whence

$i \log a \lt j \log b \qquad and \qquad j x \log b \lt i y \log a$

or

$a^i \lt b^j \qquad and \qquad b^{j y} \lt a^{i x}.$

Applying the strictly increasing monoid map $f$ to the first inequality, we infer

$a^{i x} = f(a)^i \lt f(b)^j = b^{j y}$

which contradicts the second inequality. Thus a strictly increasing monoid map must be of the form $n \mapsto n^\alpha$, with $\alpha \gt 0$.

Posted by: Todd Trimble on February 16, 2017 9:28 PM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

Thanks, Todd. I’d never actually looked at Erdős’s paper until you linked to it. What I know about this theorem, I learned from Rényi’s paper “On measures of entropy and information” and Khinchin’s book Mathematical Foundations of Information Theory.

Rényi states the result with the stronger multiplicativity hypothesis (i.e. for all arguments, not just coprime pairs), and with the hypothesis about limits (not monotonicity). Then he writes:

This lemma was first proved by Erdős [reference to the paper you cite]. In fact Erdős proved somewhat more, namely he supposed, as is usual in number theory, the validity of (1.2) [$f(n m) = f(n) + f(m)$] only for $n$ and $m$ being relatively prime. Later the lemma was rediscovered by Fadeev. The proofs of both Erdős and Fadeev are rather complicated. In this section we give a new proof of the lemma, which is much simpler.

The proof takes about a page and a third. I worked through it a few years ago and don’t remember it being particularly hard.

The reason I put the question mark next to Erdős’s name is that although I knew it was correct to attribute to him the stronger result (with the weaker hypothesis involving coprime pairs), I wasn’t sure about the weaker result. But Rényi evidently believes that it’s due to Erdős, and I see now that that seems to be the consensus.

With the monotonicity hypothesis rather than the limit hypothesis, the result is much easier. By chance, I worked through a proof today, using Khinchin’s book. It looks very similar to the argument in your comment, although as it’s presented a bit differently, it’s not immediately obvious whether it really is the same.

Posted by: Tom Leinster on February 17, 2017 4:25 AM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

I discovered something nice yesterday.

Previously, I’d been aware of two different hypotheses on a multiplicative function $f \colon \mathbb{Z}^+ \to \mathbb{R}^+$ guaranteeing that it’s a power function:

• That it’s (non-strictly) monotone;

• That $f(n + 1) - f(n) \to 0$ as $n \to \infty$.

But yesterday I discovered that there’s a single hypothesis, weaker than both of these ones, that still implies the result:

• $\lim\inf_{n \to \infty} (f(n + 1) - f(n)) \geq 0$.

I found this in section 0.4 of the book by Aczél and Daróczy. They say it was stated without proof by Erdős in 1957, then proved independently by Kátai and by Máté, both in 1967.

When I say “multiplicative” here, I mean it in the weaker number-theoretic sense that $f(m n) = f(m)f(n)$ for coprime $m$ and $n$. Aczél and Daróczy give the proof only when $f$ is multiplicative in the stronger sense that the equation holds for all $m$ and $n$, but they hint that it’s not substantially different.

Posted by: Tom Leinster on April 20, 2017 10:42 AM | Permalink | Reply to this

### Re: Functional Equations II: Shannon Entropy

We’ve had discussions a couple of years back on how the functional equation for binary entropy is intimately related to cocycle conditions. It may be interesting to know that this has also been investigated by Kontsevich in the appendix to this paper. Furthermore, Kontsevich notes that the same functional equation(s) that characterize(s) binary entropy also has a unique up-to-scalar solution within the space of functions $\mathbb{Z}/p\mathbb{Z}\to\mathbb{Z}/p\mathbb{Z}$ for a prime $p$, given by by $H(x) = \sum_{k=1}^{p-1} \tfrac{x^k}{k}$. Over $\mathbb{R}$, this would be the truncated power series for $-log(1-x)$.

(Hat tip to Pierre Baudot for the pointer.)

Posted by: Tobias Fritz on February 21, 2017 1:27 PM | Permalink | Reply to this

Post a New Comment