Variance in a Metric Space
Posted by Tom Leinster
Here’s a small observation for a Sunday afternoon. When you have a random variable in or some other similar space, you can take its expected value. For a random variable in an arbitrary metric space, you can’t take the expected value — it simply doesn’t make sense. (Edited to add: maybe that’s misleading. See the comments!) But you can take its variance, and the right definition is
where and are independent and distributed identically to .
This observation rests on a footling calculation: that when the metric space concerned is the real line, this definition agrees with the usual one. Let’s do it. We have to show that for a real random variable ,
where again and are independent and distributed identically to . Expanding the left-hand side gives
Since and are independent, . Since and are each distributed identically to , we have and . So our expression reduces to
which is twice the variance of .
So, our definition of variance in an arbitrary metric space agrees with the usual definition in .
People also enjoy taking random variables in inner product spaces. A random variable in an inner product space is called a random vector. So, the general definition of variance in a metric space specializes to give a definition of the variance of a random vector. What is it?
For concreteness, let’s consider a random vector in . Then can be viewed as a sequence of real random variables. Using the definition of variance in a metric space, a simple calculation similar to the one above gives
This agrees with the usual definition of the variance of a random vector, which takes advantage of the fact that in a vector space, expectation does make sense.
Now for a puzzle. More or less every finite metric space comes with a canonical probability distribution, namely, the maximum entropy distribution. How should we interpret its variance?
This needs some explanation.
First I’ll explain entropy on finite sets, leaving the metric for later. For a probability distribution on a finite set, its Shannon entropy is defined by
This turns out to be just one member (albeit a special one) of a whole family of entropies, indexed by a parameter . Indeed, for any nonnegative real , the Rényi entropy of order is defined by
It’s a pleasant exercise to show that as ; so Shannon entropy can be thought of as Rényi entropy of order . A significant fact is that for each , the uniform distribution maximizes the entropy of order .
Next, let’s bring in the metric. Take a finite metric space , and let be the matrix whose -entry is . The entropy of order of a probability distribution on is
where means the matrix times the column vector . Similarly, for any nonnegative real , the entropy of order of is
Again, as . In the case where the distances between distinct points are all , the matrix is just the identity, so these entropies reduce to Shannon’s and Rényi’s.
For each , we can ask which probability distribution(s) on maximize the entropy of order . In principle, this distribution will be different for different values of . But the surprise is:
There is a single distribution on maximizing the entropy of all orders simultaneously.
We already know this in the degenerate case where all distances are infinite: the maximizing distribution is uniform. But usually, it’s not uniform. For instance, if our space consists of three points, with the first two very close together and the third much further away, then the maximizing distribution will be roughly . If you imagine the points gradually moving towards a configuration where they’re the vertices of an equilateral triangle, then as you do this, the maximizing distribution will gradually move towards .
So: every finite metric space carries one or more probability distributions that maximize the entropy. Generically — I mean, for a metric space chosen at random — there will be exactly one maximizing distribution. So we can ask: what is its variance?
And here’s where I’m puzzled. I don’t know what the answer means. Here are a few examples:
Take a two-point metric space, with the points distance apart. Then the maximizing distribution is uniform, so a pair of points chosen at random have probability of being distance apart and probability of being distance apart. Recall that in general, for a distribution on a metric space, the variance is half of the expected value of the squared distance between a pair of points. So the variance of the maximizing distribution is
More generally, take points all at distance from each other (that is, the vertices of a regular simplex). Again, the maximizing distribution is uniform, and a similar calculation tells us that its variance is
Now take the real interval . You may object that this is not a finite metric space, but nevertheless, things seem to work out. The maximizing distribution is the measure
where means the point measure at and means Lebesgue measure on . A little calculation then shows that the variance of this distribution is
Ignoring the possibility that a metric space might have more than one maximizing distribution (and that they might have different variances), let’s write the variance of the maximizing distribution on a metric space as .
It’s apparent from the last example that when you scale up by a constant factor , the maximizing distribution does not simply scale up in the obvious way. If it did, then would be equal to . But again, the last example shows that this isn’t so.
Nevertheless, this suggests looking at the ratio
When is the space consisting of points all at distance from each other, our earlier calculation shows that this ratio is
In particular, it’s independent of . For the line segment , the ratio is
This does depend on , but converges to as . So maybe, for an arbitrary space , we should look at the limit
In our two examples, and . More generally, we can consider the space consisting of points distance apart and the interval , and
(It’s easy to see that in general, .) But again, what do these numbers mean?
Re: Variance in a Metric Space
I don’t know what the answer to your puzzle is, but maybe it helps to say that I’ve seen this notion of variance before, namely in the context of metric measure spaces and the Gromov-Wasserstein distance. So maybe there’s more literature about it that could be relevant.
So except for the factor of , your variance is equal to the -diameter, defined e.g. in Definition 5.2 here, with .