### Measuring Diversity

#### Posted by Tom Leinster

Christina Cobbold and I wrote a paper on measuring biological diversity:

Tom Leinster and Christina A. Cobbold,

Measuring diversity: the importance of species similarity.

Ecology, in press (doi:10.1890/10-2402.1).

As the name of the journal suggests, our paper was written for ecologists — but mathematicians should find it pretty accessible too.

While I’m at it, I’ll mention that I’m coordinating a five-week research programme on *The Mathematics of Biodiversity*
at the Centre de Recerca Matemàtica, Barcelona,
next summer. It includes a one-week exploratory conference (2–6 July 2012), to which
everyone interested is warmly welcome.

In a moment, I’ll start talking about organisms and species. But don’t be fooled: mathematically, none of this is intrinsically about biology. That’s why this post is called “Measuring diversity”, not “Measuring biological diversity”. You could apply it in many other ways, or not apply it at all, as you’ll see.

It’s an example of what Jordan Ellenberg has amusingly called applied
pure math. I think that’s a joke in *slightly* poor taste, because I don’t want
to surrender the term “applied math” to those who basically use it to mean
“applied differential equations”. Nevertheless, I suspect we’re on the same side.

Long-time patrons of the Café may remember a pair of posts in 2008 on entropy, diversity and cardinality. But those were long posts, a long time ago, and there’s a lot about them that I’d change now. So I’ll start afresh.

Imagine a ‘community’ of organisms — the
fish in a lake, the fungi in a forest, or the bacteria on your skin. We divide
them into $S$ groups, conventionally called **species**, though they needn’t
be species in the ordinary sense. (The division of organisms into species is somewhat arbitrary, which is a problem, though it’s less of a problem with the approach presented here than with many previous approaches.)

We then record two things about the community. First:

Relative abundancesThe relative frequencies, or abundances, of the species form a probability distribution $p = (p_1, \ldots, p_S)$ on $\{1, \ldots, S\}$. Here $p_i$ is the proportion of the total population belonging to species $i$, where ‘proportion’ is measured in any way you think sensible (number of individuals, total mass, etc).Note that we only record

relativeabundances, notabsoluteabundances. As it’s usually used, the worddiversitydenotes an intensive quantity. If nine-tenths of a forest is destroyed, it might be a terrible thing, but on the (unrealistic) assumption that all the flora and fauna in the forest are distributed homogeneously, it doesn’t actually cause a decrease in biodiversity.

The second thing we record is:

SimilaritiesThe similarity between each pair of species is measured by a real number between $0$ and $1$, with $0$ denoting total dissimilarity and $1$ denoting identical species. Writing the similarity between the $i$th and $j$th species as $Z_{i j}$, this gives an $S \times S$ matrix $Z$ with entries in $[0, 1]$. Our only assumption on $Z$ is that its diagonal entries are all $1$: every species is identical to itself.There are many approaches to measuring inter-species similarity, of which probably the most familiar is genetic, as in ‘you share 98% of your DNA with a chimpanzee’. Different measures of similarity will produce different measures of diversity.

Sometimes one has, instead of a measure of inter-species

similarity(measured on a scale of 0 to 1), a measure of inter-speciesdistance(measured on a scale of 0 to $\infty$). Distances $d_{i j}$ can be converted into similarities $Z_{i j}$ by the transformation $Z_{i j} = e^{-d_{i j}}$, or more generally by $Z_{i j} = e^{-t d_{i j}}$ for some positive scale factor $t$. That’s not the only transformation you can use, but it has some good mathematical properties.

What we have to do now is take this data and turn it into a single number, measuring the diversity of the community. Actually, it’s not going to be quite as simple as that… but let’s take it one step at a time.

The similarities form an $S \times S$ matrix $Z$, and the relative abundances can be regarded as forming an $S$-dimensional column vector $p$. So, we get an $S$-dimensional column vector $Z p$, whose $i$th entry is

$(Z p)_i = \sum_j Z_{i j} p_j.$

This is the expected similarity between an individual of the $i$th species and an individual chosen at random. It therefore measures the ‘ordinariness’, or lack of distinctiveness, of that individual.

The average ordinariness of an individual in the community is, then,

$\sum_i p_i (Z p)_i.$

This is greatest if the community is concentrated into a few very similar
species. Economists have used the word **concentration** for quantities
like this. Now, we’re after a measure of diversity, which should be
*inversely* related to concentration. So we could define the diversity of
the community as the reciprocal of the concentration:

$1/\sum_i p_i (Z p)_i.$

This turns out to be a good measure of diversity. But it’s not the only good one.

Why not? I’ll give two explanations: one mathematical, one ecological.

Mathematically, the point is that when I wrote down the formula for the
‘average’ ordinariness, I neglected the fact that there are many good notions
of average. In particular, there are the power means. For $t
\in \mathbb{R}$, the **power mean** of numbers $x_1, \ldots, x_S \geq 0$,
weighted by a probability distribution $p_1, \ldots, p_S$, is got by
transforming each $x_i$ into $x_i^t$, then forming their ordinary mean weighted by
the $p_i$s, then applying the inverse transformation. In other words, it’s

$\Bigl( \sum_i p_i x_i^t \Bigr)^{1/t}.$

We’ll apply this with $x_i = (Z p)_i$. For reasons I won’t explain, I’ll shift the indexing by putting $t = q - 1$, and I’ll restrict to $q \geq 0$. So, the average ordinariness ‘of order $q$’ is

$\Bigl( \sum_i p_i (Z p)_i^{q - 1} \Bigr)^{1/(q - 1)}.$

This is a measure of concentration. Its reciprocal is

$D_q^Z(p) = \Bigl( \sum_i p_i (Z p)_i^{q - 1} \Bigr)^{1/(1 - q)}.$

And that, by definition, is the **diversity of order $q$** of the community. The diversity measure we arrived at above was the case $q = 2$, and is called the quadratic
diversity, since it’s the reciprocal of a quadratic form.

The formula for $D_q^Z(p)$ doesn’t make sense for $q = 1$ or $q = \infty$, but you can easily make sense of it by taking limits. Doing this leads to the definitions

$D_1^Z(p) = \prod_i (Z p)_i^{-p_i}$

and

$D_\infty^Z(p) = 1/\max_i (Z p)_i.$

Technical note: in order for everything to be well-defined, you have to take the sums and max to be over only those values of $i$ for which $p_i \gt 0$ (that is, over only the species that are actually present).

So, we’ve got not just *one* measure of diversity, but a *one-parameter family* of
them:

$(D_q)_{q \geq 0}.$

Ecologically, this spectrum of diversity measures corresponds to a spectrum of
viewpoints on what diversity *is*. Consider two bird communities. The first looks like this:

It contains four species, one of which makes up most of the population, and three of which are quite rare. The second community looks like this:

It has only three species, but they’re evenly balanced.

Now, which community is more diverse? It’s a matter of opinion. Or, if you like, it’s a matter of how you interpret the word ‘diverse’. Usually in the mainstream press, and often in scholarly articles too, ‘biodiversity’ is used as a synonym for ‘number of species present’. On this count, the first community is more diverse. But if you’re mostly concerned with the functioning of the whole community, the role of rare species might not be particularly important: maybe your primary concern is that no species is too dominant, and on that score, the second community wins.

Varying the parameter $q$ corresponds to varying your viewpoint. Specifically, $q$ controls how *little* emphasis you place on rare
species. So the graphs of $D_q^Z(p)$ against $q$, for the two communities,
might look like this:

The purple curve represents the first community, and the blue curve represents the second. (The exact shapes of the graphs will depend on the similarity matrix $Z$.) For low values of $q$ (emphasizing rare species), the first community looks more diverse than the second. For high values of $q$ (emphasizing common species), it’s the opposite.

It turns out that many diversity measures previously used in ecology are special cases of the ones given above. Also, these measures have excellent mathematical properties. Lots are listed in our paper. Here I’ll give just two.

Naive modelThere’s a ‘naive’ model of an ecological community in which distinct species are always assumed to have nothing in common. This is a terribly crude assumption, and makes a community consisting of two species of slug as diverse as a community consisting of a slug and a giraffe.Nevertheless, this is the model used by most diversity measures to date. It corresponds to taking $Z = I$. When you take $Z = I$ using our measures, the formula for diversity is:

$D_q^Z(p) = \begin{cases} \Bigl( \sum_i p_i^q \Bigr)^{1/(1 - q)} & if q \neq 1, \infty \\ \prod_i p_i^{-p_i} & if q = 1 \\ 1/\max_i p_i & if q = \infty. \end{cases}$

These are known in ecology as the

Hill numbers, and in mathematics as the exponentials of theRényi entropies. A lot is known about them. Even more is known about the case $q = 1$, which is the exponential ofShannon entropy.

Effective numberOur measures areeffective numbers, which means that a community of $S$ equally abundant, totally dissimilar species is assigned a diversity of $S$. In symbols,$D_q^I(1/S, \ldots, 1/S) = S,$

for all $q \in [0, \infty]$.

So if someone tells you ‘this community has diversity 26.2’, and they’re using an effective number measure, that means it’s slightly more diverse than a community of 26 equally abundant, totally dissimilar species. If they come to you a year later saying that its diversity has dropped to 13.1, that means, in a directly comprehensible way, that its diversity has halved. As Mark Hill (of the Hill numbers) put it, effective numbers ‘enable us to speak naturally’.

As far as I’m concerned, this work links together many of my interests involving measures of size. Apart from diversity being an important mathematical concept in itself, it’s related to entropy, power means, and magnitude of metric spaces.

As far as biologists are concerned, there seem to be two main points of interest.

One is the even-handed approach to the spectrum of possible viewpoints — treating all values of $q$ democratically, rather than choosing one and claiming that it’s the ‘best’. This leads to the graphical device of drawing graphs like the one above, in order to compare and contrast communities. I’m surprised that this has generated so much enthusiasm, because these graphs (‘diversity profiles’) have been advocated for a long time, by many different authors. But they also seem to be new to many people.

The other — and the reason behind the title of our paper — is that we’ve built inter-species similarity into the model. Ours isn’t the first diversity measure to do this, but it seems to be the most general. I’d like to explain the practical impact that this can have, but I’m running out of energy now, so I’ll leave that for another day.

**Update (9 November 2011)** John kindly let me write a version of this post for Azimuth. It’s actually quite different from what I’ve written here. The major thing that it has and this post doesn’t is an illustration of how taking species similarity into account can change your judgement on which of two communities is the more diverse.

## Re: Measuring Diversity

Thanks, that’s a very clear post.

When writing

you link to the Wikipedia page on the intensive/extensive property distinction. We really should have such a page at nLab, especially as it is a favourite topic of Lawvere’s. I’ve been trying to grasp it through nForum discussions, e.g., here, and longer ago at the Café.

Does anyone here have a good category theoretic handle on it?