### Entropies *vs.* Means

#### Posted by Tom Leinster

If you’ve been watching this blog, you can’t help but have noticed the current entropy-fest. It started on John’s blog Azimuth, generated a lengthy new page on John’s patch of the nLab, and led to first this entry at the Café, then this one.

Things have got pretty unruly. It’s *good* unruliness, in the same way
that brainstorming is good, but in this post I want to do something to help
those of us who are confused by the sheer mass of concepts, questions and
results—which I suspect is all of us.

I want to describe a particular aspect of the geography of this landscape of
ideas. Specifically, I’ll describe some connections between the concepts of
*entropy* and *mean*.

This can be thought of as background to the project of finding
*categorical* characterizations of entropy.
There will be almost no category theory in this post.

I’ll begin by describing the most vague and the most superficial connections between entropy and means. Then I’ll build up to a more substantial connection that appeared in the comments on the first Café post, finishing with a connection that we haven’t seen here before.

**Something vague** I’m interested in measures of size. This has
occupied a large part of my
mathematical life for the last few years. Means aren’t exactly a measure of
size, but they almost are: the mean number of cameras owned by a citizen of
Cameroon is the size of the set of Cameroonian cameras, divided by the size of
the population. So I naturally got interested in means: see these two posts, for instance. On the other hand,
entropy is also a kind of size measure, as I argued in these
two posts. So the two concepts were already somewhat connected in
my mind.

**Something superficial** All I want to say here is: look at the
definitions! Just look at them!

So, I’d better give you these definitions.

**Basic definitions** I’ll write

$\mathbf{P}_n
=
\{ (p_1, \ldots, p_n) \in [0, \infty)^n | \sum p_i = 1 \}$
(which previously I’ve written as $\Delta_n$). For each $t \in [-\infty,
\infty]$, the **power mean** of order $t$ is the function

$M_t: \mathbf{P}_n \times [0, \infty)^n \to [0, \infty)$

defined for $t \neq -\infty, 0, \infty$ by

$M_t(\mathbf{p}, \mathbf{x}) = \Bigl( \sum_{i: p_i \gt 0} p_i x_i^t \Bigr)^{1/t}.$

Think of this as an average of $x_1, \ldots, x_n$, weighted by $p_1, \ldots, p_n$. The three exceptional values of $t$ are handled by taking limits: $M_t(\mathbf{p}, \mathbf{x}) = \begin{cases} min x_i &if t = -\infty\\ \prod x_i^{p_i} &if t = 0\\ max x_i &if t = \infty. \end{cases}$

The minimum, product and maximum are, like the sum, taken over all $i$ such that $p_i \gt 0$. I’ll generally assume that $t \neq -\infty, 0, \infty$; these cases never cause trouble. So: the only definition you need to pay attention to is the one for generic $t$.

Now for a definition of entropy… almost. Actually, I’m going to work with
the closely related notion of *diversity*. For $q \in [-\infty,
\infty]$, the **diversity of order $q$** is the map

$D_q: \mathbf{P}_n \to [0, \infty)$

defined by

$D_q(\mathbf{p}) = \Bigl( \sum_{i: p_i \gt 0} p_i^q \Bigr)^{1/(1 - q)}$

for $q \neq -\infty, 1, \infty$, and again by taking limits in the exceptional cases:

$D_q(\mathbf{p}) = \begin{cases} min (1/p_i) &if q = -\infty \\ \prod p_i^{-p_i} &if q = 1\\ max (1/p_i) &if q = \infty \end{cases}$

where again the min, product and max are over all $i$ such that $p_i \gt 0$.

The name ‘diversity’ originates from an ecological application. We think
of $\mathbf{p} = (p_1, \ldots, p_n)$ as representing a community of $n$ species
in proportions $p_1, \ldots, p_n$, and $D_q(\mathbf{p})$ as a measure of
that community’s biodiversity. Different values of the parameter $q$ represent
different opinions on how much importance should be assigned to rare or common
species. (Newspaper stories on biodiversity typically focus on threats to rare
species, but the balance of common species is also important for the healthy
functioning of an ecosystem as a whole.) Theoretical ecologists often call
$D_q$ the **Hill number** of order $q$.

Now, many of you know $D_q$ not as ‘diversity’, but as Rényi extropy. I’d like to advocate the name ‘diversity’.

First, diversity is a fundamental concept and deserves a simple name. It’s much more general than just something from ecology: it applies whenever you have a collection of things divided into classes.

Second, ‘Rényi extropy’ is a terribly off-putting name. It assumes you already
understand entropy (itself a significant task), then that you understand
*Rényi* entropy (whose meaning you couldn’t possibly guess since it’s
named after a person), and then that you’re familiar with the half-jokey usage
of ‘extropy’ to mean the exponential of entropy. In contrast, diversity is
something that can be understood directly, without
knowing about entropy of any kind.

An enormously important property of diversity is that it is an **effective
number**. This means that the value it assigns to the uniform distribution
on a set is the cardinality of that set:

$D_q(1/n, \ldots, 1/n) = n.$

This is what distinguishes diversity from the various other functions of $\sum p_i^q$ that get used (e.g. Rényi entropy and the entropy variously named after Havrda, Charvát, Daróczy, Patil, Taillie and Tsallis). I recently gave a little explanation of why effective numbers are so important, and I gave a different explanation (using terminology differently) in this post on entropy, diversity and cardinality.

**Something superficial, continued** Let me now go back to my
superficial reason for thinking that means will be useful in the study of
entropy and diversity. I declared: just look at the formulas! There’s an obvious resemblance. And in
particular,
look what happens in the definition of power mean when you put $\mathbf{x} =
\mathbf{p}$ and $t = q - 1$:

$M_{q - 1}(\mathbf{p}, \mathbf{p}) = \Bigl( \sum p_i^q \Bigr)^{1/(q - 1)} = 1/D_q(\mathbf{p}).$

This reminds me of some other things. To study quadratic forms $x \mapsto x^* A x$, it’s really helpful to study the associated bilinear forms $(x, y) \mapsto x^* A y$. Or, similarly, you’ll often be able to prove more about a Banach space if you know it’s a Hilbert space.

Moreover, there are reasons for thinking that something quite significant is going on in the step ‘put $\mathbf{x} = \mathbf{p}$’. I suspect that fundamentally, $\mathbf{x}$ is a function on $\{1, \ldots, n\}$, but $\mathbf{p}$ is a measure. By equating them we’re really taking advantage of the finiteness of our sets. For more general sets or spaces, we might need to keep $\mathbf{p}$ and $\mathbf{x}$ separate.

**Something substantial** To explain this more substantial connection between diversity and means, I first need to explain how
the simplices $\mathbf{P}_n$ form an operad.

If you know what an operad is, it’s enough for me to tell you that any convex subset $X$ of $\mathbb{R}^n$ is naturally a $\mathbf{P}$-algebra via the action

$\mathbf{p}(x_1, \ldots, x_n) = \sum p_i x_i$

($\mathbf{p} \in \mathbf{P}_n, x_i \in X$). That should enable you to work out what the composition in $\mathbf{P}$ must be.

If you don’t know what an operad is, all you need to know is the following. An operad structure on the sequence of sets $(\mathbf{P}_n)_{n \in \mathbb{N}}$ consists of a choice of map

$\mathbf{P}_n \times \mathbf{P}_{k_1} \times \cdots \times \mathbf{P}_{k_n} \to \mathbf{P}_{k_1 + \cdots + k_n}$

for each $n, k_1, \ldots, k_n \in \mathbb{N}$, satisfying some axioms. The map is written

$(\mathbf{p}, \mathbf{r}_1, \ldots, \mathbf{r}_n) \mapsto \mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)$

and called **composition**. The particular operad structure that I have in
mind has its composition defined by

$\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n) = \bigl( p_1 r_{1 1}, \ldots, p_1 r_{1 k_1}, \ldots, p_n r_{n 1}, \ldots, p_n r_{n k_n} \bigr).$

So the composite is obtained by putting the probability distributions $\mathbf{r}_1, \ldots, \mathbf{r}_n$ side by side, weighting them by $p_1, \ldots, p_n$ respectively.

Here’s the formula for the diversity of a composite:

$D_q(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = \Bigl( \sum_{i: p_i \gt 0} p_i^q D_q(\mathbf{r}_i)^{1 - q} \Bigr)^{1/(1 - q)}.$

Notice that the diversity of a composite depends only on $\mathbf{p}$ and the
diversities $D_q(\mathbf{r}_i)$, *not* on the distributions $\mathbf{r}_i$
themselves. Pushing that thought, you might hope that it wouldn’t depend on
$\mathbf{p}$ itself, only its diversity; but it’s not to be.

(Here I’m assuming that $q \neq -\infty, 1, \infty$. I’ll let you work out those cases, or you can find them here. And you should take what I say about the case $q \lt 0$ with a pinch of salt; I haven’t paid much attention to it.)

Digressing briefly, this expression can be written as a mean:

$D_q(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = M_{1 - q}(\mathbf{p}, D_q(\mathbf{r}_\bullet)/\mathbf{p})$

where $D_q(\mathbf{r}_\bullet)/\mathbf{p}$ is the vector with $i$th component
$D_q(\mathbf{r}_i)/p_i$. I call this a digression because I don’t know whether this is a useful
observation. It’s a *different* connection between the diversity of a
composite and means that I want to point out here.

To explain that connection, I need a couple more bits of terminology.
The **partition function** of a probability distribution
$\mathbf{p}$ is the function

$Z(\mathbf{p}): \mathbb{R} \to (0, \infty)$

defined by

$Z(\mathbf{p})(q) = \sum_{i: p_i \gt 0} p_i^q.$

Any probability distribution $\mathbf{p}$ belongs to a one-parameter family $\bigl(\mathbf{p}^{(q)} \bigr)_{q \in \mathbb{R}}$ of probability distributions, defined by

$p^{(q)}_i = p_i^q/Z(\mathbf{p})(q)$

where $\mathbf{p}^{(q)} = (p^{(q)}_1, \ldots, p^{(q)}_n)$. These are sometimes called the **escort distributions** of $\mathbf{p}$. (In particular,
$\mathbf{p}^{(1)} = \mathbf{p}$, so there’s something especially convenient about the case $q = 1$.)

A small amount of elementary algebra tells us that the diversity of a composite can be re-expressed as follows:

$D_q(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = D_q(\mathbf{p}) \cdot M_{1 - q}\bigl(\mathbf{p}^{(q)}, D_q(\mathbf{r}_\bullet)\bigr).$

This is the connection I’ve been building up to: the diversity of a composite expressed in terms of a power mean.

To understand this further, think of a large ecological community spread over several islands, with the special feature that no species can be found on more than one island. The distribution $\mathbf{p}$ gives the relative sizes of the total populations on the different islands, and the distribution $\mathbf{r}_i$ gives the relative abundances of the various species on the $i$th island.

Now, the formula tells us the diversity
of the composite community in terms of the diversities of the islands
and their
relative sizes. More exactly, it expresses it as a product of two factors: the
diversity *between* the islands ($D_q(\mathbf{p})$), and the average
diversity *within* the islands ($M_{1 - q}(\ldots)$).

**Something new** …where ‘new’ is in the sense of ‘new to this
conversation’, not ‘new to the world’.

We’ve just seen how, for each real number $q$, the diversity $D_q$ of a composite $\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)$ decomposes as a product of two factors. The first factor is the diversity of $\mathbf{p}$. The second is some kind of mean of the diversities of the $\mathbf{r}_i$s, weighted by a distribution depending on $\mathbf{p}$.

We know this because we have a formula for $D_q$.
But what if we take the description in the previous paragraph as
*axiomatic*? In other words, suppose that we have for each $n \in
\mathbb{N}$ functions

$D: \mathbf{P}_n \to (0, \infty), \quad \hat{ }: \mathbf{P}_n \to \mathbf{P}_n,$

and some kind of ‘mean operation’ $M$, satisfying

$D(\mathbf{p} \circ (\mathbf{r}_1, \ldots, \mathbf{r}_n)) = D(\mathbf{p}) \cdot M(\hat{\mathbf{p}}, D(\mathbf{r}_\bullet)).$

What does this tell us about $D$, $M$ and $\hat{ }$? Could it even be that it forces $D = D_q$, $M = M_{1 - q}$ and $\hat{ } = ( )^{(q)}$ for some $q$?

Well, it depends what you mean by ‘mean’. But that’s a subject that’s been well raked over, and there are several axiomatic characterizations of the power means out there. So let me skip that part of the question and assume immediately that $M = M_{1 - q}$ for some $q \in (0, \infty)$.

So now we’ve decided what our mean operation is, but we still have an undetermined thing called ‘diversity’ and an undetermined operation $\hat{ }$ for turning one probability distribution into another. All we have by way of constraints is the equation above for the diversity of a composite, and perhaps we’ll also allow ourselves some further basic assumptions on diversity, such as continuity.

The theorem is that these meagre assumptions are enough to determine diversity uniquely.

Theorem (Routledge)Let $q \in (0, \infty)$. Let$\bigl( D: \mathbf{P}_n \to (0, \infty) \bigr)_{n \in \mathbb{N}}, \quad \bigl( \hat{ }: \mathbf{P}_n \to \mathbf{P}_n \bigr)_{n \in \mathbb{N}}$

be families of functions such that

- $D$ is an effective number
- $D$ is symmetric
- $D$ is continuous
- $D(\mathbf{p}\circ(\mathbf{r}_1, \ldots, \mathbf{r}_n)) = D(\mathbf{p}) \cdot M_{1 - q}(\hat{\mathbf{p}}, D(\mathbf{r}_\bullet))$ for all $\mathbf{p}, \mathbf{r}_1, \ldots, \mathbf{r}_n$.
Then $D = D_q$ and $\hat{ } = ( )^{(q)}$.

This result appeared in

R. D. Routledge, Diversity indices: which ones are admissible?

Journal of Theoretical Biology76 (1979), 503–515.

And the moral is: diversity, hence entropy, can be uniquely characterized using means.

**Postscript**
This theorem is closer to the basic concerns of ecology than you might imagine.
When a
geographical area is divided into several zones, you can ask how much of the
biological diversity of the area should be attributed to variation *between* the
zones, and how much to variation *within* the zones. This is very like
our island scenario above, but more complicated, since the same species may
be present in multiple zones.

Ecologists talk about $\alpha$-diversity (the **a**verage within-zone diversity), $\beta$-diversity (the diversity **b**etween the
zones), and $\gamma$-diversity (the **g**lobal diversity, i.e. that of
the whole community). The concept of $\beta$-diversity can play a part in
conservation decisions. For example, if the $\beta$-diversity of our
area is perceived or measured to be low, that means that some of the zones are
quite similar to each other. In that case, it might not be important to
conserve all of them: resources can be concentrated on just a few.

The theorem tells us something about how $\alpha$-, $\beta$- and $\gamma$-diversity must be defined if simple and desirable properties are to hold. This story reached a definitive end in a quite recent paper:

Lou Jost, Partitioning diversity into independent alpha and beta components,

Ecology88 (2007), 2427–2439.

But Jost’s paper takes us beyond what we’re currently doing, so I’ll leave it there for now.

## Re: Entropies vs. Means

Thank you, Tom. I have been reading the Renyi Entropy posts feeling more and more lost. But ‘Diversity’ has helped me regain orientation.