Entropy, Diversity and Cardinality (Part 1)
Posted by David Corfield
Guest post by Tom Leinster
This is the first of two posts about
- the difficult problem of how to quantify biodiversity
- the concept of the cardinality of a metric space.
The connection is provided by that important and subtle notion, entropy.
The ideas I’ll present depend crucially on the insights of two people. First, André Joyal explained to me the connection between cardinality and entropy. Then, Christina Cobbold told me about the connection between entropy and biodiversity, and suggested that there might be a direct link between measures of biodiversity and the cardinality of a metric space. She was more right than she knew: it turns out that the cardinality of a metric space, which I’d believed to be a new concept coming from enriched category theory, was discovered 15 years ago by ecologists!
Outline
Suppose you’re a scientist investigating the impact of human activities on the biodiversity of some particular ecosystem: say, the forests of Indonesia. To do this rigorously, you’ll need some way of quantifying biodiversity. In other words, you’ll need a way of taking a raw mass of ecological data and turning it into a single number.
Being an open-minded and well-educated scientist, you’d be happy in principle if your ‘number’ lay in some number system of a non-traditional kind (e.g. an abstract rig), but you see that there are certain advantages to sticking with the reals. For example, you can meaningfully say things like ‘the biodiversity of the Indonesian forests has fallen by 23% in 10 years’. We’ll stick to real values here.
It turns out that there are many sensible measures of biodiversity — ecologists have been debating their relative merits for years. Two of the most important aspects of an ecosystem that you might want a diversity measure to reflect are:
- Abundance: the proportions in which the species occur (e.g. 50% grass, 30% clover, 20% daisies)
- Similarity: the extent to which the species are related (e.g. an ecosystem made up of 50% snails and 50% slugs should probably be regarded as less diverse than one made up of 50% snails and 50% bats).
This first post will be about (1) only. We’ll look at a family of diversity measures taking only abundance into account, and use it to explore notions of entropy and cardinality.
The second post will be about (1) and (2) together. There, metric spaces will make their entrance.
Diversity
Diversity is a widely applicable concept: just as you can talk about diversity of species in an ecosystem, you could also talk about diversity of types of rock on a mountain, words in a novelist’s work, etc. Nevertheless, I’ll continue to discuss it in the ecological setting. This is partly because it’s what I’ve thought about most, partly because biodiversity is important, and partly because it lends itself well to vivid imagery.
So let’s imagine an ecosystem in which $n$ species occur, in proportions $p_1, \ldots, p_n$ respectively. Thus, $p_1 + \cdots + p_n = 1$. I’ll refer to a finite family of non-negative reals summing to $1$ as a finite probability space. (This is a slight abuse of terminology, but never mind.) In the simple example above where the ecosystem consisted of just grass, clover and daisies, we had $n = 3$ and $(p_1, p_2, p_3) = (0.5, 0.3, 0.2)$.
Some decision has to be made about how exactly the proportions, or ‘relative abundances’, $p_i$ are measured. It could be done according to the number of individuals of each species, or the total mass of each species (so that an ant counts for less than an antelope), or any other measure thought to be helpful. I’ll assume that this decision has been made.
Our task now is to turn the probability space $(p_1, \ldots, p_n)$ into a single real number, representing the ‘diversity’ of the system. Here are the three ways of doing it most popular among ecologists.
0. Species richness Just count the number of species present. It’s best to then subtract $1$, since an ecosystem containing only one species (e.g. a field containing only wheat) is usefully thought of as having zero diversity. (Ecosystems containing no species at all are off the scale; recall the axiom that $\sum p_i = 1$.) So with the notation above, the species richness is defined to be $n - 1$.
This measure is not only crude, but also, statistically, very sensitive to sample size. In an ecosystem where most species are rare, even a large sample may fail to detect many species and so drastically underestimate the species richness.
1. Shannon entropy Its relevance to ecology has been described as ‘tenuous’; nevertheless, it is one of the most widely-used measures of biodiversity. The Shannon entropy, or information entropy, or information diversity, of the probability space $(p_1, \ldots, p_n)$ is $-\sum_{i = 1}^n p_i \log p_i.$ We use the convention that $x \log x = 0$ when $x = 0$, since $lim_{x \rightarrow 0} x \log x = 0$. (Alternatively, as every category theorist knows, $0^0 = 1$, so $0\,\log\,0 = \log\, 0^0 = \log\, 1 = 0$.)
I’ll have much more to say about entropy later. For now, let’s just record some of its basic properties.
First, Shannon entropy is always non-negative. Second, it’s zero if and only if some $p_i$ is $1$ and the rest are $0$. In other words, when the ecosystem is made up entirely of one species, diversity is zero. Third, entropy/diversity is maximized (for a fixed $n$) when all the species occur in equal abundance: $p_1 = \cdots = p_n = 1/n$. In that case, the entropy is $\log n$.
2. Simpson diversity Another simple diversity measure born in the 1940s is Simpson diversity, $1 - \sum_{i = 1}^n p_i^2.$ Like Shannon entropy, it’s always non-negative, it’s zero if and only if some $p_i$ is $1$, and it’s maximized (for fixed $n$) when $p_1 = \cdots = p_n = 1/n$. In that case, its value is $1 - 1/n$; and as we would expect of a measure of diversity, this is an increasing function of $n$.
Simpson diversity has the advantage of being quadratic, which makes it amenable to methods of multilinear algebra. It also has certain statistical advantages (such as the existence of an unbiased estimator). And as these notes point out, it’s the probability that two randomly-chosen individuals are of different species.
I learned about diversity measures from
Carlo Ricotta, Laszlo Szeidl, Towards a unifying approach to diversity measures: Bridging the gap between the Shannon entropy and Rao’s quadratic index, Theoretical Population Biology 70 (2006), 237–243
and Chapter 7 of
Russell Lande, Steinar Engen, Bernt-Erik Sæther, Stochastic Population Dynamics in Ecology and Conservation, Oxford University Press, 2003
(both courtesy of Christina Cobbold). Unfortunately there seems to be no free copy of either online, but the Wikipedia article on diversity measures gives some basic information.
Bringing them all together The crucial observation now is that all three of these useful diversity measures are members of a continuous, one-parameter family of measures.
To understand this, it can help to think in terms of ‘surprise’. How surprised would you be if you found an ant on an ant farm? Not surprised at all. How surprised would you be if you found a Yangtze river dolphin in the Yangtze? Sadly, you should be extremely surprised. A surprise function $\sigma$ assigns to each probability $p \in [0, 1]$ a degree of surprise $\sigma(p) \in [0, \infty]$; we require $\sigma$ to be decreasing (so that more probable events are less surprising) and satisfy $\sigma(1) = 0$ (so that the occurrence of an event of probability $1$ is no surprise at all).
From any surprise function, you can obtain a measure of diversity. If the surprise function is called $\sigma$, the diversity measure assigns to a probability space $(p_1, \ldots, p_n)$ the quantity $\sum_{i = 1}^n p_i \sigma(p_i)$ — the expected surprise. ‘Expected surprise’ might sound paradoxical, but it’s not. How surprised do you expect to be tomorrow? Personally, I expect to be mildly surprised but not astonished: that’s what most of my days are like. In an ecosystem containing only one species, you’ll never be at all surprised at what you find, so your expected surprise is $0$; correspondingly, $\sum p_i \sigma(p_i) = 0$. On the other hand, imagine picking individuals at random from an ecosystem containing $10$ species in equal proportions: then you’ll always be a bit surprised at what you find. (Sometimes, as here, the ‘surprise’ metaphor seems a bit strained. You can think instead of related concepts such as unpredictability, rarity, or information content.)
Now let’s define that one-parameter family of diversity measures. First define, for each $\alpha \geq 0$, a surprise function $\sigma_\alpha: [0, 1] \to [0, \infty]$ by $\sigma_\alpha(p) = \left\{ \begin{matrix} \frac{1}{\alpha - 1} (1 - p^{\alpha - 1}) & if \alpha \neq 1 \\ - \log p & if \alpha = 1. \end{matrix} \right.$ The reason for the second clause is that the first doesn’t make sense when $\alpha = 1$, but $\lim_{\alpha \to 1} \frac{1}{\alpha - 1} (1 - p^{\alpha - 1}) = - \log p$ (by l’Hôpital’s rule, or by evaluating $\lim_{r \to -1} \int_1^x t^r\, d t$ in two different ways.) The resulting diversity measure $D_\alpha$ is given by $D_\alpha (p_1, \ldots, p_n) = \sum_{i = 1}^n p_i \sigma_\alpha(p_i) = \left\{ \begin{matrix} \frac{1}{\alpha - 1} \left(1 - \sum p_i^\alpha\right) & if \alpha \neq 1 \\ - \sum p_i \log p_i & if \alpha = 1. \end{matrix} \right.$ In particular,
- $D_0(p_1, \ldots, p_n) = n - 1$, the species richness
- $D_1(p_1, \ldots, p_n) = - \sum p_i \log p_i$, the Shannon entropy
- $D_2(p_1, \ldots, p_n) = 1 - \sum p_i^2$, the Simpson diversity.
The measures $D_\alpha$ have good basic properties, at least for $\alpha > 0$. We always have $D_\alpha(p_1, \ldots, p_n) \geq 0$, with equality if and only if some $p_i$ is $1$. For fixed $n$, $D_\alpha(p_1, \ldots, p_n)$ is maximized when $p_1 = \cdots = p_n = 1/n$, and in that case its value is $\sigma_\alpha(1/n) = \left\{ \begin{matrix} \frac{1}{\alpha - 1} \left(1 - n^{1 - \alpha}\right) & if \alpha \neq 1 \\ \log n & if \alpha = 1. \end{matrix} \right.$ These expressions aren’t very convenient. We’ll fix that later.
This one-parameter family of measures appears to have been discovered several times over. According to the paper of Ricotta and Szeidl cited above, it was discovered in information theory —
J. Aczél, Z. Daróczy, On Measures of Information and their Characterizations, Academic Press (1975)
— then independently in ecology —
G.P. Patil, C. Taillie, Diversity as a concept and its measurement, Journal of the American Statistical Association 77 (1982), 548–567
— and then again independently in physics —
Constantino Tsallis, Possible generalization of Boltzmann–Gibbs statistics, Journal of Statistical Physics 52 (1998), 479–487.
(I haven’t looked up these sources.) Accordingly, people in different disciplines attribute it differently - e.g. physicists seem to call it Tsallis entropy.
Sometimes these diversity measures $D_\alpha$ are referred to as ‘entropy’ of degree $\alpha$. But I want to reserve the term ‘entropy’, as I’ll explain in a moment.
Entropy
There are many related quantities called ‘entropy’. The notion appears in physics, communications engineering, statistics, linguistics, dynamical systems, …, as well as ecology; it can be thought of as measuring disorder, information content, uncertainty, uniformity, diversity, ….
Stick out your arm in the $n$-Category Café and you’ll knock over the coffee of someone who knows more about entropy than I do. Witness, for instance, this learned conversation of Ben Allen and Chris Hillman; David, in his machine learning days, used exotic-sounding related concepts such as Kullback–Leibler divergence; John and Urs doubtless know all about entropy in physics. But for now, I’ll stick humbly to the example above: the Shannon entropy $-\sum p_i \log p_i$ of a finite probability space $(p_1, \ldots, p_n)$.
An important property of Shannon entropy is that it is log-like. In other words, let $A = (p_1, \ldots, p_n)$ and $B = (q_1, \ldots, q_m)$ be finite probability spaces. There is an obvious ‘product’ space, $A \times B = (p_1 q_1, \ldots, p_1 q_m, \ldots, p_n q_1, \ldots, p_n q_m),$ and the log-like property is this: writing $H$ for Shannon entropy, $H(A \times B) = H(A) + H(B).$ This is one of the things that makes $H = D_1$ more convenient than the other diversity measures $D_\alpha$ defined above: it is only for $\alpha = 1$ that this property holds. I’ll reserve the word ‘entropy’ for log-like measures.
(The proof that $H$ is log-like is a harmless exercise, depending on two things. One is that these are probability spaces: $\sum p_i = 1$. The other is that the function $f: x \mapsto -x\log x$ is a derivation: $f(x y) = f(x)y + x f(y)$.)
Aside: a question Here’s something I’d like to understand. Given probability spaces $A$ and $B$ as above, and given $\lambda, \mu \geq 0$ such that $\lambda + \mu = 1$, there’s a new probability space $\lambda A + \mu B = (\lambda p_1, \ldots, \lambda p_n, \mu q_1, \ldots, \mu q_m)$ and we have $H(\lambda A + \mu B) = -(\lambda \log\lambda + \mu \log\mu) + \lambda H(A) + \mu H(B).$ More generally, there is a symmetric operad $C$ given by $C_n = \Delta^{n - 1} = \{ (p_1, \ldots, p_n) \in \mathbb{R}^n : p_i \geq 0, \sum p_i = 1 \}$ with composition as follows: if $A = (p_1, \ldots, p_n), A_1 = (p_{1, 1}, \ldots, p_{1, k_1}), \ldots, A_n = (p_{n, 1}, \ldots, p_{n, k_n})$ then $A \circ (A_1, \ldots, A_n) = (p_1 p_{1, 1}, \ldots, p_1 p_{1, k_1}, \ldots, p_n p_{n, 1}, \ldots, p_n p_{n, k_n}).$ (A $C$-algebra might be called a ‘convex algebra’; for instance, any convex subset of a real vector space is naturally one.) We have $H(A \circ (A_1, \ldots, A_n)) = H(A) + \sum_{i = 1}^n p_i H(A_i).$ The question is: what’s going on here, abstractly? I’m imagining restating this equation in such a way that there is no mention of the $p_i$s, and thus, perhaps, finding a good characterization of entropy in operadic terms.
Cardinality
André Joyal explained to me that he likes to think of entropy as something like cardinality. More precisely, the exponential of entropy is like cardinality. To test out this point of view, let’s write $|A| = e^{H(A)} = \prod_{i = 1}^n p_i^{-p_i}$ for any finite probability space $A = (p_1, \ldots, p_n)$, and call it the cardinality of $A$.
(There are slightly different conventions on Shannon entropy. Because of its applications to digital communication, some people like to take their logarithms to base 2; others use base $e$, as I’m doing here.
Cardinality is independent of this choice.)
Now let’s translate the basic properties of entropy into properties of cardinality, to see if cardinality deserves its name. Since entropy is always non-negative, we always have $|A| \geq 1$. We have $|A| = 1$ if and only if some $p_i$ is $1$ and the rest are $0$, a situation that can be interpreted as
there is effectively only one species.
For a fixed $n$, the cardinality is maximized when $p_1 = \cdots = p_n = 1/n$, and in that case $|A| = n$; this situation can be interpreted as
all $n$ species are fully present.
Finally, the log-like property of entropy translates as $|A \times B| = |A| \times |B|$ — cardinality is multiplicative.
This is all very satisfactory. Our ‘cardinality’ has the properties that one might intuitively hope for. Furthermore, it corresponds very closely to a well-known and useful measure of diversity, the Shannon entropy $H = D_1$. But this is not the only useful measure of diversity: there are, at least, all the other measures $D_\alpha$ ($\alpha \geq 0$). Is there a corresponding notion of ‘$\alpha$-cardinality’ for every $\alpha$, with the same good properties?
The answer is yes, but that’s not quite obvious. It’s no use defining the $\alpha$-cardinality of $A$ to be $e^{D_\alpha(A)}$: for since $D_\alpha$ is not log-like except when $\alpha = 1$, this $\alpha$-cardinality would not be multiplicative. So that wouldn’t be a useful definition.
Let’s take stock. We’ve fixed an $\alpha \geq 0$, and we’re trying to define, for each finite probability space $A = (p_1, \ldots, p_n)$, its ‘$\alpha$-cardinality’ $|A|_\alpha$. We want to do this in such a way that:
- $|A|_\alpha$ is a function (preferably invertible) of $D_\alpha(A)$
- $1 \leq |A|_\alpha \leq n$
- $|A|_\alpha = 1$ if some $p_i$ is $1$
- $|A|_\alpha = n$ if $p_1 = \cdots = p_n = 1/n$
- $|A \times B|_\alpha = |A|_\alpha \times |B|_\alpha$.
I’ll skip some elementary steps here, but it’s not hard to see that these requirements (in fact, (1) and (4) alone) pretty much force the answer on us. It turns out that we need $D_\alpha$ and $| |_\alpha$ to be related by the equation $D_\alpha(A) = \left\{ \begin{matrix} \frac{1}{\alpha - 1} \left(1 - |A|_\alpha^{1 - \alpha}\right) & if \alpha \neq 1 \\ \log |A|_1 & if \alpha = 1, \end{matrix} \right.$ and a small amount of elementary algebra then gives us the definition: $|A|_\alpha = \left\{ \begin{matrix} \left( \sum_{i = 1}^n p_i^\alpha \right)^{\frac{1}{1 - \alpha}} & if \alpha \neq 1 \\ \prod_{i = 1}^n p_i^{-p_i} & if \alpha = 1, \end{matrix} \right.$ the $\alpha$-cardinality of the finite probability space $A = (p_1, \ldots, p_n)$.
(The 1-cardinality is just the cardinality. Here at the $n$-Category Café, we’re used to the convention that ‘1-widget’ means the same as ‘widget’.)
It’s easy to confirm that properties (1)–(5) do indeed hold for $\alpha$-cardinality, for every $\alpha \geq 0$.
Example: $\alpha = 0$ The diversity measure $D_0$ is very simple: $D_0(A) = n - 1$. (Recall that the motivation for subtracting $1$ was to make a one-species ecosystem have diversity zero.) The $0$-cardinality of $A$ is just $n$, the number of species — obviously a useful quantity too!
Example: $\alpha = 1$ This is the motivating example: $D_1$ is the Shannon entropy, and $1$-cardinality is cardinality.
Example: $\alpha = 2$ We saw that $D_2$ is Simpson diversity: $D_2(A) = 1 - \sum p_i^2$. The $2$-cardinality of $A$ is $|A|_2 = 1/\sum p_i^2,$ which is also often used as a measurement of diversity (and also sometimes called Simpson diversity).
Example: $\alpha \to \infty$ It’s easy to show that for any probability space $A$, we have $\lim_{\alpha \to \infty} D_\alpha(A) = 0$. However, if we work with cardinality rather than diversity, something more interesting happens: $\lim_{\alpha\to\infty} |A|_\alpha = 1/\max_{1 \leq i \leq n} p_i = \min_{1 \leq i \leq n} (1/p_i).$ This (or its reciprocal) is sometimes called the Berger–Parker index, and might as well be written $|A|_\infty$. It has all the good properties (2)—(5) of cardinalities.
Back to entropy
For some purposes it’s preferable to use a measure that’s log-like (as entropy is) rather than multiplicative (as cardinality is). For example, in information theory it’s natural to count how many bits are needed to encode a message, and that’s a log-like measure.
So, for any $\alpha \geq 0$, let’s define the $\alpha$-entropy of a finite probability space $A = (p_1, \ldots, p_n)$ to be $H_\alpha(A) = \log |A|_\alpha = \left\{ \begin{matrix} \frac{1}{1 - \alpha} \log (\sum p_i^\alpha) & if \alpha \neq 1 \\ - \sum p_i \log p_i & if \alpha = 1. \end{matrix} \right.$ The $\alpha$-entropy $H_\alpha$ is usually called the Rényi entropy of order $\alpha$. By construction, each $H_\alpha$ is log-like, takes its minimal value $0$ when some $p_i$ is $1$, and, for a fixed $n$, takes its maximum value $\log n$ when $p_1 = \cdots = p_n = 1/n$. The $1$-entropy $H_1$ is just the Shannon entropy $H$.
Summary, and preview of Part 2
We’ve been discussing ecosystems, which for the purposes of this first post are simply finite sequences $(p_1, \ldots, p_n)$ of non-negative numbers summing to $1$. We’ve seen three families of measures of ecosystems, each indexed over non-negative real numbers $\alpha$:
- The diversity measures $D_\alpha$. The values $\alpha = 0, 1, 2$ correspond to the most popular diversity measures in ecology. Generally, the measure $D_\alpha$ can be interpreted as ‘expected surprise’.
- The cardinalities $| |_\alpha$. These have excellent mathematical properties, e.g. the $\alpha$-cardinality of an $n$-species ecosystem is always between $1$ and $n$, and $\alpha$-cardinality is multiplicative.
- The Rényi entropies $H_\alpha$. Again, these have excellent mathematical properties, and like Shannon entropy (the case $\alpha = 1$), they are all log-like.
The three ways of measuring are completely interchangeable: for a fixed $\alpha$, any of the three numbers $D_\alpha(A)$, $|A|_\alpha$ and $H_\alpha(A)$ can be derived from any of the others. The formulas above tell you how.
What’s next?
Well, a weak point in everything so far is the extremely crude modelling. We’ve taken the species in our ecosystem to form a mere set: two species are either equal or not, and that’s that. But when you think about biodiversity, about the variety of life, you instinctively grasp that there’s more to it: some species are quite similar, some very different. This should influence the measurement of diversity.
In the second post I’ll explain a way of building this into the model. Specifically, the collection of species in an ecosystem will be modelled as a metric space instead of a mere set. (A set can be regarded as a metric space in which every point is distance $\infty$ from every other point.) This will lead us into connections between biodiversity, entropy and the cardinality of metric spaces.
Re: Entropy, Diversity and Cardinality (Part 1)
Do you choose your surprise function so as to arrive at this expection of mild surprise, or is it chosen independently? Imagine someone who finds life extraordinarily unsurprising or extraordinarily surprising. These would be, I think, unpleasant situations to be in, but would we not say that their surprise functions were not well-tuned?
Or perhaps we organise our lives so that for our chosen surprise function, life is as surprising as we desire. We opt for a quiet life or a hectic one, etc.
I would suspect that a bit of both occurs.
Another ‘pathology’ we might observe is someone taking an event of a certain kind to be much less probable than we consider it to be. Perhaps being amazed to find two people in a group of 26 having the same birthday.