### Variance in a Metric Space

#### Posted by Tom Leinster

Here’s a small observation for a Sunday afternoon. When you have a random
variable in $\mathbb{R}^n$ or some other similar space, you can take its
expected value. For a random variable $X$ in an arbitrary metric space,
you can’t take the expected value — it simply doesn’t make sense. *(Edited to add: maybe that’s misleading. See the comments!)* But
you *can* take its variance, and the right definition is

$Var(X) = \tfrac{1}{2} \mathbb{E}(d(X_1, X_2)^2),$

where $X_1$ and $X_2$ are independent and distributed identically to $X$.

This observation rests on a footling calculation: that when the metric space concerned is the real line, this definition agrees with the usual one. Let’s do it. We have to show that for a real random variable $X$,

$\mathbb{E}((X_1 - X_2)^2) = 2 Var(X),$

where again $X_1$ and $X_2$ are independent and distributed identically to $X$. Expanding the left-hand side gives

$\mathbb{E}(X_1^2) + \mathbb{E}(X_2^2) - 2\mathbb{E}(X_1 X_2).$

Since $X_1$ and $X_2$ are independent, $\mathbb{E}(X_1 X_2) = \mathbb{E}(X_1) \mathbb{E}(X_2)$. Since $X_1$ and $X_2$ are each distributed identically to $X$, we have $\mathbb{E}(X_i) = \mathbb{E}(X)$ and $\mathbb{E}(X_i^2) = \mathbb{E}(X^2)$. So our expression reduces to

$2 \bigl( \mathbb{E}(X^2) - \mathbb{E}(X)^2 \bigr),$

which is twice the variance of $X$.

So, our definition of variance in an arbitrary metric space agrees with the usual definition in $\mathbb{R}$.

People also enjoy taking random variables in inner product spaces. A
random variable in an inner product space is called a **random
vector**. So, the general definition of variance in a metric space
specializes to give a definition of the variance of a random vector. What
is it?

For concreteness, let’s consider a random vector $\mathbf{X}$ in $\mathbb{R}^n$. Then $\mathbf{X}$ can be viewed as a sequence $(X^{(1)}, \ldots, X^{(n)})$ of real random variables. Using the definition of variance in a metric space, a simple calculation similar to the one above gives

$Var(\mathbf{X}) = Var(X^{(1)}) + \cdots + Var(X^{(n)}).$

This agrees with the usual definition of the variance of a random vector, which takes advantage of the fact that in a vector space, expectation *does* make sense.

Now for a puzzle. More or less every finite metric space comes with a canonical probability distribution, namely, the maximum entropy distribution. How should we interpret its variance?

This needs some explanation.

First I’ll explain entropy on finite *sets*, leaving the metric for
later. For a probability distribution $\mathbf{p} = (p_1, \ldots, p_n)$ on
a finite set, its **Shannon entropy** is defined by

$H_1(\mathbf{p}) = - \sum_{i \colon p_i \gt 0} p_i \log p_i.$

This turns out to be just one member (albeit a special one) of a whole
*family* of entropies, indexed by a parameter $q$. Indeed, for any nonnegative
real $q \neq 1$, the **Rényi entropy of order $q$** is defined by

$H_q(\mathbf{p}) = \frac{1}{1 - q} \log \sum_{i \colon p_i \gt 0} p_i^q.$

It’s a pleasant exercise to show that $H_q(\mathbf{p}) \to H_1(\mathbf{p})$ as $q \to 1$; so Shannon entropy can be thought of as Rényi entropy of order $1$. A significant fact is that for each $q \geq 0$, the uniform distribution $\mathbf{p} = (1/n, \ldots, 1/n)$ maximizes the entropy of order $q$.

Next, let’s bring in the metric. Take a finite metric space $A = \{a_1,
\ldots, a_n\}$, and let $Z$ be the $n \times n$ matrix whose $(i, j)$-entry
is $\exp(-d(a_i, a_j))$. The **entropy of order $1$** of a probability
distribution $\mathbf{p}$ on $A$ is

$H_1(\mathbf{p}) = - \sum_{i \colon p_i \gt 0} p_i \log (Z\mathbf{p})_i$

where $Z\mathbf{p}$ means the matrix $Z$ times the column vector
$\mathbf{p}$. Similarly, for any nonnegative real $q \neq 1$, the
**entropy of order $q$** of $\mathbf{p}$ is

$H_q(\mathbf{p}) = \frac{1}{1 - q} \log \sum_{i \colon p_i \gt 0} p_i (Z\mathbf{p})_i^{q - 1}.$

Again, $H_q(\mathbf{p}) \to H_1(\mathbf{p})$ as $q \to 1$. In the case where the distances between distinct points are all $\infty$, the matrix $Z$ is just the identity, so these entropies reduce to Shannon’s and Rényi’s.

For each $q \geq 0$, we can ask which probability distribution(s) on $A$ maximize the entropy of order $q$. In principle, this distribution will be different for different values of $q$. But the surprise is:

There is a single distribution on $A$ maximizing the entropy of all orders simultaneously.

We already know this in the degenerate case where all distances are infinite: the maximizing distribution is uniform. But usually, it’s not uniform. For instance, if our space consists of three points, with the first two very close together and the third much further away, then the maximizing distribution will be roughly $(1/4, 1/4, 1/2)$. If you imagine the points gradually moving towards a configuration where they’re the vertices of an equilateral triangle, then as you do this, the maximizing distribution will gradually move towards $(1/3, 1/3, 1/3)$.

So: every finite metric space carries one or more probability distributions that maximize the entropy. Generically — I mean, for a metric space chosen at random — there will be exactly one maximizing distribution. So we can ask: what is its variance?

And here’s where I’m puzzled. I don’t know what the answer means. Here are a few examples:

Take a two-point metric space, with the points distance $\ell$ apart. Then the maximizing distribution is uniform, so a pair of points chosen at random have probability $1/2$ of being distance $0$ apart and probability $1/2$ of being distance $\ell$ apart. Recall that in general, for a distribution on a metric space, the variance is half of the expected value of the squared distance between a pair of points. So the variance of the maximizing distribution is

$\tfrac{1}{2} \bigl( \tfrac{1}{2} \cdot 0^2 + \tfrac{1}{2} \cdot \ell^2 \bigr) = \tfrac{1}{4} \ell^2.$

More generally, take $n$ points all at distance $\ell$ from each other (that is, the vertices of a regular simplex). Again, the maximizing distribution is uniform, and a similar calculation tells us that its variance is

$\frac{n - 1}{2n} \ell^2.$

Now take the real interval $[-\ell, \ell]$. You may object that this is not a

*finite*metric space, but nevertheless, things seem to work out. The maximizing distribution is the measure$\frac{1}{2(1 + \ell)} \Bigl( \delta_{-\ell} + \lambda + \delta_\ell \Bigr)$

where $\delta_x$ means the point measure at $x$ and $\lambda$ means Lebesgue measure on $[-\ell, \ell]$. A little calculation then shows that the variance of this distribution is

$\frac{1 + \ell/3}{1 + \ell} \ell^2.$

Ignoring the possibility that a metric space might have more than one maximizing distribution (and that they might have different variances), let’s write the variance of the maximizing distribution on a metric space $A$ as $Var(A)$.

It’s apparent from the last example that when you scale $A$ up by a
constant factor $t$, the maximizing distribution does *not* simply scale up
in the obvious way. If it did, then $Var(t A)$ would be equal to $t^2
Var(A)$. But again, the last example shows that this isn’t so.

Nevertheless, this suggests looking at the ratio

$\frac{Var(t A)}{t^2}.$

When $A$ is the space $S_n$ consisting of $n$ points all at distance $1$ from each other, our earlier calculation shows that this ratio is

$\frac{n - 1}{2n}.$

In particular, it’s independent of $t$. For the line segment $[-1, 1]$, the ratio is

$\frac{1 + t/3}{1 + t} = \frac{1}{3} + \frac{2/3}{1 + t}.$

This does depend on $t$, but converges to $1/3$ as $t \to \infty$. So maybe, for an arbitrary space $A$, we should look at the limit

$\rho(A) = \lim_{t \to \infty} \frac{Var(t A)}{t^2}.$

In our two examples, $\rho(S_n) = (n - 1)/2n$ and $\rho([-1, 1]) = 1/3$. More generally, we can consider the space $\ell S_n$ consisting of $n$ points distance $\ell$ apart and the interval $[-\ell, \ell]$, and

$\rho(\ell S_n) = \frac{n - 1}{2n} \ell^2, \,\,\,\,\,\,\, \rho([-\ell, \ell]) = \frac{\ell^2}{3}.$

(It’s easy to see that in general, $\rho(\ell A) = \rho(A) \ell^2$.) But
again, what do these numbers *mean*?

## Re: Variance in a Metric Space

I don’t know what the answer to your puzzle is, but maybe it helps to say that I’ve seen this notion of variance before, namely in the context of metric measure spaces and the Gromov-Wasserstein distance. So maybe there’s more literature about it that could be relevant.

So except for the factor of $\tfrac{1}{2}$, your variance is equal to the $p$-diameter, defined e.g. in Definition 5.2 here, with $p=2$.