Entropies vs. Means

Posted by Tom Leinster

$MathML-enabled post (click for more details).$

If you’ve been watching this blog, you can’t help but have noticed the current entropy-fest. It started on John’s blog Azimuth, generated a lengthy new page on John’s patch of the nLab, and led to first this entry at the Café, then this one.

Things have got pretty unruly. It’s good unruliness, in the same way that brainstorming is good, but in this post I want to do something to help those of us who are confused by the sheer mass of concepts, questions and results—which I suspect is all of us.

I want to describe a particular aspect of the geography of this landscape of ideas. Specifically, I’ll describe some connections between the concepts of entropy and mean.

This can be thought of as background to the project of finding categorical characterizations of entropy. There will be almost no category theory in this post.

$MathML-enabled post (click for more details).$

I’ll begin by describing the most vague and the most superficial connections between entropy and means. Then I’ll build up to a more substantial connection that appeared in the comments on the first Café post, finishing with a connection that we haven’t seen here before.

Something vague I’m interested in measures of size. This has occupied a large part of my mathematical life for the last few years. Means aren’t exactly a measure of size, but they almost are: the mean number of cameras owned by a citizen of Cameroon is the size of the set of Cameroonian cameras, divided by the size of the population. So I naturally got interested in means: see these two posts, for instance. On the other hand, entropy is also a kind of size measure, as I argued in these two posts. So the two concepts were already somewhat connected in my mind.

Something superficial All I want to say here is: look at the definitions! Just look at them!

So, I’d better give you these definitions.

Basic definitions I’ll write

$\mathbf{P}_n = \{ (p_1, \ldots, p_n) \in [0, \infty)^n | \sum p_i = 1 \}$ (which previously I’ve written as $\Delta_n$ ). For each $t \in [-\infty, \infty]$ , the power mean of order $t$ is the function

$M_t: \mathbf{P}_n \times [0, \infty)^n \to [0, \infty)$

defined for $t \neq -\infty, 0, \infty$ by

$M_t(\mathbf{p}, \mathbf{x}) = \Bigl( \sum_{i: p_i \gt 0} p_i x_i^t \Bigr)^{1/t}.$

Think of this as an average of $x_1, \ldots, x_n$ , weighted by $p_1, \ldots, p_n$ . The three exceptional values of $t$ are handled by taking limits: $M_t(\mathbf{p}, \mathbf{x}) = \begin{cases} min x_i &if t = -\infty\\ \prod x_i^{p_i} &if t = 0\\ max x_i &if t = \infty. \end{cases}$

The minimum, product and maximum are, like the sum, taken over all $i$ such that $p_i \gt 0$ . I’ll generally assume that $t \neq -\infty, 0, \infty$ ; these cases never cause trouble. So: the only definition you need to pay attention to is the one for generic $t$ .

Now for a definition of entropy… almost. Actually, I’m going to work with the closely related notion of diversity. For $q \in [-\infty, \infty]$ , the diversity of order $q$ is the map

$D_q: \mathbf{P}_n \to [0, \infty)$

defined by

$D_q(\mathbf{p}) = \Bigl( \sum_{i: p_i \gt 0} p_i^q \Bigr)^{1/(1 - q)}$

for $q \neq -\infty, 1, \infty$ , and again by taking limits in the exceptional cases:

$D_q(\mathbf{p}) = \begin{cases} min (1/p_i) &if q = -\infty \\ \prod p_i^{-p_i} &if q = 1\\ max (1/p_i) &if q = \infty \end{cases}$

where again the min, product and max are over all $i$ such that $p_i \gt 0$ .

The name ‘diversity’ originates from an ecological application. We think of $\mathbf{p} = (p_1, \ldots, p_n)$ as representing a community of $n$ species in proportions $p_1, \ldots, p_n$ , and $D_q(\mathbf{p})$ as a measure of that community’s biodiversity. Different values of the parameter $q$ represent different opinions on how much importance should be assigned to rare or common species. (Newspaper stories on biodiversity typically focus on threats to rare species, but the balance of common species is also important for the healthy functioning of an ecosystem as a whole.) Theoretical ecologists often call $D_q$ the Hill number of order $q$ .

Now, many of you know $D_q$ not as ‘diversity’, but as Rényi extropy. I’d like to advocate the name ‘diversity’.

First, diversity is a fundamental concept and deserves a simple name. It’s much more general than just something from ecology: it applies whenever you have a collection of things divided into classes.

Second, ‘Rényi extropy’ is a terribly off-putting name. It assumes you already understand entropy (itself a significant task), then that you understand Rényi entropy (whose meaning you couldn’t possibly guess since it’s named after a person), and then that you’re familiar with the half-jokey usage of ‘extropy’ to mean the exponential of entropy. In contrast, diversity is something that can be understood directly, without knowing about entropy of any kind.

An enormously important property of diversity is that it is an effective number. This means that the value it assigns to the uniform distribution on a set is the cardinality of that set:

$D_q(1/n, \ldots, 1/n) = n.$

This is what distinguishes diversity from the various other functions of $\sum p_i^q$ that get used (e.g. Rényi entropy and the entropy variously named after Havrda, Charvát, Daróczy, Patil, Taillie and Tsallis). I recently gave a little explanation of why effective numbers are so important, and I gave a different explanation (using terminology differently) in this post on entropy, diversity and cardinality.

Something superficial, continued Let me now go back to my superficial reason for thinking that means will be useful in the study of entropy and diversity. I declared: just look at the formulas! There’s an obvious resemblance. And in particular, look what happens in the definition of power mean when you put $\mathbf{x} = \mathbf{p}$ and $t = q - 1$ :

$M_{q - 1}(\mathbf{p}, \mathbf{p}) = \Bigl( \sum p_i^q \Bigr)^{1/(q - 1)} = 1/D_q(\mathbf{p}).$

This reminds me of some other things. To study quadratic forms $x \mapsto x^* A x$ , it’s really helpful to study the associated bilinear forms $(x, y) \mapsto x^* A y$ . Or, similarly, you’ll often be able to prove more about a Banach space if you know it’s a Hilbert space.

Moreover, there are reasons for thinking that something quite significant is going on in the step ‘put $\mathbf{x} = \mathbf{p}$ ’. I suspect that fundamentally, $\mathbf{x}$ is a function on $\{1, \ldots, n\}$ , but $\mathbf{p}$ is a measure. By equating them we’re really taking advantage of the finiteness of our sets. For more general sets or spaces, we might need to keep $\mathbf{p}$ and $\mathbf{x}$ separate.

Something substantial To explain this more substantial connection between diversity and means, I first need to explain how the simplices $\mathbf{P}_n$ form an operad.

If you know what an operad is, it’s enough for me to tell you that any convex subset $X$ of $\mathbb{R}^n$ is naturally a $\mathbf{P}$ -algebra via the action

$\mathbf{p}(x_1, \ldots, x_n) = \sum p_i x_i$

( $\mathbf{p} \in \mathbf{P}_n, x_i \in X$ ). That should enable you to work out what the composition in $\mathbf{P}$ must be.

If you don’t know what an operad is, all you need to know is the following. An operad structure on the sequence of sets $(\mathbf{P}_n)_{n \in \mathbb{N}}$ consists of a choice of map

$\mathbf{P}_n \times \mathbf{P}_{k_1} \times \cdots \times \mathbf{P}_{k_n} \to \mathbf{P}_{k_1 + \cdots + k_n}$

for each $n, k_1, \ldots, k_n \in \mathbb{N}$ , satisfying some axioms. The map is written

$(\mathbf{p}, \mathbf{r}_1, \ldots, \mathbf{r}_n) \mapsto \mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)$

and called composition. The particular operad structure that I have in mind has its composition defined by

$\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n) = \bigl( p_1 r_{1 1}, \ldots, p_1 r_{1 k_1},   \ldots,   p_n r_{n 1}, \ldots, p_n r_{n k_n} \bigr).$

So the composite is obtained by putting the probability distributions $\mathbf{r}_1, \ldots, \mathbf{r}_n$ side by side, weighting them by $p_1, \ldots, p_n$ respectively.

Here’s the formula for the diversity of a composite:

$D_q(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = \Bigl( \sum_{i: p_i \gt 0} p_i^q D_q(\mathbf{r}_i)^{1 - q} \Bigr)^{1/(1 - q)}.$

Notice that the diversity of a composite depends only on $\mathbf{p}$ and the diversities $D_q(\mathbf{r}_i)$ , not on the distributions $\mathbf{r}_i$ themselves. Pushing that thought, you might hope that it wouldn’t depend on $\mathbf{p}$ itself, only its diversity; but it’s not to be.

(Here I’m assuming that $q \neq -\infty, 1, \infty$ . I’ll let you work out those cases, or you can find them here. And you should take what I say about the case $q \lt 0$ with a pinch of salt; I haven’t paid much attention to it.)

Digressing briefly, this expression can be written as a mean:

$D_q(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = M_{1 - q}(\mathbf{p}, D_q(\mathbf{r}_\bullet)/\mathbf{p})$

where $D_q(\mathbf{r}_\bullet)/\mathbf{p}$ is the vector with $i$ th component $D_q(\mathbf{r}_i)/p_i$ . I call this a digression because I don’t know whether this is a useful observation. It’s a different connection between the diversity of a composite and means that I want to point out here.

To explain that connection, I need a couple more bits of terminology. The partition function of a probability distribution $\mathbf{p}$ is the function

$Z(\mathbf{p}): \mathbb{R} \to (0, \infty)$

defined by

$Z(\mathbf{p})(q) = \sum_{i: p_i \gt 0} p_i^q.$

Any probability distribution $\mathbf{p}$ belongs to a one-parameter family $\bigl(\mathbf{p}^{(q)} \bigr)_{q \in \mathbb{R}}$ of probability distributions, defined by

$p^{(q)}_i = p_i^q/Z(\mathbf{p})(q)$

where $\mathbf{p}^{(q)} = (p^{(q)}_1, \ldots, p^{(q)}_n)$ . These are sometimes called the escort distributions of $\mathbf{p}$ . (In particular, $\mathbf{p}^{(1)} = \mathbf{p}$ , so there’s something especially convenient about the case $q = 1$ .)

A small amount of elementary algebra tells us that the diversity of a composite can be re-expressed as follows:

$D_q(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = D_q(\mathbf{p}) \cdot M_{1 - q}\bigl(\mathbf{p}^{(q)}, D_q(\mathbf{r}_\bullet)\bigr).$

This is the connection I’ve been building up to: the diversity of a composite expressed in terms of a power mean.

To understand this further, think of a large ecological community spread over several islands, with the special feature that no species can be found on more than one island. The distribution $\mathbf{p}$ gives the relative sizes of the total populations on the different islands, and the distribution $\mathbf{r}_i$ gives the relative abundances of the various species on the $i$ th island.

Now, the formula tells us the diversity of the composite community in terms of the diversities of the islands and their relative sizes. More exactly, it expresses it as a product of two factors: the diversity between the islands ( $D_q(\mathbf{p})$ ), and the average diversity within the islands ( $M_{1 - q}(\ldots)$ ).

Something new …where ‘new’ is in the sense of ‘new to this conversation’, not ‘new to the world’.

We’ve just seen how, for each real number $q$ , the diversity $D_q$ of a composite $\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)$ decomposes as a product of two factors. The first factor is the diversity of $\mathbf{p}$ . The second is some kind of mean of the diversities of the $\mathbf{r}_i$ s, weighted by a distribution depending on $\mathbf{p}$ .

We know this because we have a formula for $D_q$ . But what if we take the description in the previous paragraph as axiomatic? In other words, suppose that we have for each $n \in \mathbb{N}$ functions

$D: \mathbf{P}_n \to (0, \infty), \quad \hat{ }: \mathbf{P}_n \to \mathbf{P}_n,$

and some kind of ‘mean operation’ $M$ , satisfying

$D(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = D(\mathbf{p}) \cdot M(\hat{\mathbf{p}}, D(\mathbf{r}_\bullet)).$

What does this tell us about $D$ , $M$ and $\hat{ }$ ? Could it even be that it forces $D = D_q$ , $M = M_{1 - q}$ and $\hat{ } = ( )^{(q)}$ for some $q$ ?

Well, it depends what you mean by ‘mean’. But that’s a subject that’s been well raked over, and there are several axiomatic characterizations of the power means out there. So let me skip that part of the question and assume immediately that $M = M_{1 - q}$ for some $q \in (0, \infty)$ .

So now we’ve decided what our mean operation is, but we still have an undetermined thing called ‘diversity’ and an undetermined operation $\hat{ }$ for turning one probability distribution into another. All we have by way of constraints is the equation above for the diversity of a composite, and perhaps we’ll also allow ourselves some further basic assumptions on diversity, such as continuity.

The theorem is that these meagre assumptions are enough to determine diversity uniquely.

Theorem (Routledge) Let $q \in (0, \infty)$ . Let

$\bigl( D: \mathbf{P}_n \to (0, \infty) \bigr)_{n \in \mathbb{N}}, \quad \bigl( \hat{ }: \mathbf{P}_n \to \mathbf{P}_n \bigr)_{n \in \mathbb{N}}$

be families of functions such that

$D$ is an effective number

$D$ is symmetric

$D$ is continuous

$D(\mathbf{p}\circ(\mathbf{r}_1, \ldots, \mathbf{r}_n)) = D(\mathbf{p}) \cdot M_{1 - q}(\hat{\mathbf{p}}, D(\mathbf{r}_\bullet))$ for all $\mathbf{p}, \mathbf{r}_1, \ldots, \mathbf{r}_n$ .

Then $D = D_q$ and $\hat{ } = ( )^{(q)}$ .

This result appeared in

R. D. Routledge, Diversity indices: which ones are admissible? Journal of Theoretical Biology 76 (1979), 503–515.

And the moral is: diversity, hence entropy, can be uniquely characterized using means.

Postscript This theorem is closer to the basic concerns of ecology than you might imagine. When a geographical area is divided into several zones, you can ask how much of the biological diversity of the area should be attributed to variation between the zones, and how much to variation within the zones. This is very like our island scenario above, but more complicated, since the same species may be present in multiple zones.

Ecologists talk about $\alpha$ -diversity (the average within-zone diversity), $\beta$ -diversity (the diversity between the zones), and $\gamma$ -diversity (the global diversity, i.e. that of the whole community). The concept of $\beta$ -diversity can play a part in conservation decisions. For example, if the $\beta$ -diversity of our area is perceived or measured to be low, that means that some of the zones are quite similar to each other. In that case, it might not be important to conserve all of them: resources can be concentrated on just a few.

The theorem tells us something about how $\alpha$ -, $\beta$ - and $\gamma$ -diversity must be defined if simple and desirable properties are to hold. This story reached a definitive end in a quite recent paper:

Lou Jost, Partitioning diversity into independent alpha and beta components, Ecology 88 (2007), 2427–2439.

But Jost’s paper takes us beyond what we’re currently doing, so I’ll leave it there for now.

Posted at May 10, 2011 7:10 AM UTC

TrackBack URL for this Entry: https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2397

Re: Entropies vs. Means

Thank you, Tom. I have been reading the Renyi Entropy posts feeling more and more lost. But ‘Diversity’ has helped me regain orientation.

Posted by: Roger Witte on May 10, 2011 8:57 AM | Permalink | Reply to this

This is good stuff!

So do you think that the formula for $D_q(p\circ (r_1,\ldots,r_n))$ can also be formulated categorically in terms of lax points for general $q$ , like you already did in the $q=1$ case?

(I was about to link to our notes, but somehow currently these only contain a section on the partition function…?)

Posted by: Tobias Fritz on May 10, 2011 12:32 PM | Permalink | Reply to this

Tobias wrote:

I was busy editing our notes, and there are certain typos that can make everything after the typo not show up at all. I fixed that. Now someone is editing our notes and I can’t continue working on them. I guess that’s you!

I am starting to organize the material up to including ‘Shannon entropy’ in a more systematic way. I will move the section ‘Rényi entropy’ further down, because I think a self-contained story about convex algebras, partition functions and Shannon entropy is starting to emerge which does not need to involve Rényi entropy. This self-contained story may also include some of your ideas on ‘the basic inequalities of information theory’.

I hope and believe that Rényi entropy (and/or related concepts) will show up naturally in an expanded version of this story. However, about 100 times more people understand Shannon entropy than Rényi entropy. It’s about 100 times as important, too. And even without Rényi entropy, our paper will be fairly intimidating to most people thanks to the use of monads, operads, ideas from statistical mechanics, and so on. So, I’m leaning towards a first paper that leaves out Rényi ideas. We could then try to ‘Rényify’ things in a second paper, if we want.

Posted by: John Baez on May 10, 2011 1:19 PM | Permalink | Reply to this

Thanks, Tobias. To answer your question, I don’t know whether the lax point result can be generalized to arbitrary $q$ . I’ve thought about it a little bit, without seeing a way forward, but I haven’t tried to push it hard.

Posted by: Tom Leinster on May 11, 2011 12:25 AM | Permalink | Reply to this

Tom wrote:

Well, it depends what you mean by ‘mean’.

I think you’re providing Bill Clinton with some serious competition here!

Posted by: John Baez on May 10, 2011 1:28 PM | Permalink | Reply to this

Nice explanation, thank you! One question: if $\sum p_i = 1$ and each $p_i\ge 0$ , then it seems unlikely to me that any of the $p_i$ s could be greater than $1$ ; so did you really mean to write $[0,\infty)^n$ in the definition of $\mathbf{P}$ ?

Posted by: Mike Shulman on May 10, 2011 11:52 PM | Permalink | Reply to this

I did! Presumably you’re just saying it looks strange, in the same way that it would look strange to have used $[0, 7]$ instead (although it would make no difference). I suppose what I first had in mind was the expression

$\{ \mathbf{p} \in \mathbb{R}^n | p_i \geq 0, \sum p_i = 1\}.$

Then I realized that it was wasteful to say “ $\mathbf{p} \in \mathbb{R}^n$ ” and “ $p_i \geq 0$ ” separately, so I merged them into one: $\mathbf{p} \in [0, \infty)^n$ .

Posted by: Tom Leinster on May 11, 2011 12:19 AM | Permalink | Reply to this

The n-Category Café

Skip to the Main Content

May 10, 2011