Variance in a Metric Space

Posted by Tom Leinster

$MathML-enabled post (click for more details).$

Here’s a small observation for a Sunday afternoon. When you have a random variable in $\mathbb{R}^n$ or some other similar space, you can take its expected value. For a random variable $X$ in an arbitrary metric space, you can’t take the expected value — it simply doesn’t make sense. (Edited to add: maybe that’s misleading. See the comments!) But you can take its variance, and the right definition is

$Var(X) = \tfrac{1}{2} \mathbb{E}(d(X_1, X_2)^2),$

where $X_1$ and $X_2$ are independent and distributed identically to $X$ .

$MathML-enabled post (click for more details).$

This observation rests on a footling calculation: that when the metric space concerned is the real line, this definition agrees with the usual one. Let’s do it. We have to show that for a real random variable $X$ ,

$\mathbb{E}((X_1 - X_2)^2) = 2 Var(X),$

where again $X_1$ and $X_2$ are independent and distributed identically to $X$ . Expanding the left-hand side gives

$\mathbb{E}(X_1^2) + \mathbb{E}(X_2^2) - 2\mathbb{E}(X_1 X_2).$

Since $X_1$ and $X_2$ are independent, $\mathbb{E}(X_1 X_2) = \mathbb{E}(X_1) \mathbb{E}(X_2)$ . Since $X_1$ and $X_2$ are each distributed identically to $X$ , we have $\mathbb{E}(X_i) = \mathbb{E}(X)$ and $\mathbb{E}(X_i^2) = \mathbb{E}(X^2)$ . So our expression reduces to

$2 \bigl( \mathbb{E}(X^2) - \mathbb{E}(X)^2 \bigr),$

which is twice the variance of $X$ .

So, our definition of variance in an arbitrary metric space agrees with the usual definition in $\mathbb{R}$ .

People also enjoy taking random variables in inner product spaces. A random variable in an inner product space is called a random vector. So, the general definition of variance in a metric space specializes to give a definition of the variance of a random vector. What is it?

For concreteness, let’s consider a random vector $\mathbf{X}$ in $\mathbb{R}^n$ . Then $\mathbf{X}$ can be viewed as a sequence $(X^{(1)}, \ldots, X^{(n)})$ of real random variables. Using the definition of variance in a metric space, a simple calculation similar to the one above gives

$Var(\mathbf{X}) = Var(X^{(1)}) + \cdots + Var(X^{(n)}).$

This agrees with the usual definition of the variance of a random vector, which takes advantage of the fact that in a vector space, expectation does make sense.

Now for a puzzle. More or less every finite metric space comes with a canonical probability distribution, namely, the maximum entropy distribution. How should we interpret its variance?

This needs some explanation.

First I’ll explain entropy on finite sets, leaving the metric for later. For a probability distribution $\mathbf{p} = (p_1, \ldots, p_n)$ on a finite set, its Shannon entropy is defined by

$H_1(\mathbf{p}) = - \sum_{i \colon p_i \gt 0} p_i \log p_i.$

This turns out to be just one member (albeit a special one) of a whole family of entropies, indexed by a parameter $q$ . Indeed, for any nonnegative real $q \neq 1$ , the Rényi entropy of order $q$ is defined by

$H_q(\mathbf{p}) = \frac{1}{1 - q} \log \sum_{i \colon p_i \gt 0} p_i^q.$

It’s a pleasant exercise to show that $H_q(\mathbf{p}) \to H_1(\mathbf{p})$ as $q \to 1$ ; so Shannon entropy can be thought of as Rényi entropy of order $1$ . A significant fact is that for each $q \geq 0$ , the uniform distribution $\mathbf{p} = (1/n, \ldots, 1/n)$ maximizes the entropy of order $q$ .

Next, let’s bring in the metric. Take a finite metric space $A = \{a_1, \ldots, a_n\}$ , and let $Z$ be the $n \times n$ matrix whose $(i, j)$ -entry is $\exp(-d(a_i, a_j))$ . The entropy of order $1$ of a probability distribution $\mathbf{p}$ on $A$ is

$H_1(\mathbf{p}) = - \sum_{i \colon p_i \gt 0} p_i \log (Z\mathbf{p})_i$

where $Z\mathbf{p}$ means the matrix $Z$ times the column vector $\mathbf{p}$ . Similarly, for any nonnegative real $q \neq 1$ , the entropy of order $q$ of $\mathbf{p}$ is

$H_q(\mathbf{p}) = \frac{1}{1 - q} \log \sum_{i \colon p_i \gt 0} p_i (Z\mathbf{p})_i^{q - 1}.$

Again, $H_q(\mathbf{p}) \to H_1(\mathbf{p})$ as $q \to 1$ . In the case where the distances between distinct points are all $\infty$ , the matrix $Z$ is just the identity, so these entropies reduce to Shannon’s and Rényi’s.

For each $q \geq 0$ , we can ask which probability distribution(s) on $A$ maximize the entropy of order $q$ . In principle, this distribution will be different for different values of $q$ . But the surprise is:

There is a single distribution on $A$ maximizing the entropy of all orders simultaneously.

We already know this in the degenerate case where all distances are infinite: the maximizing distribution is uniform. But usually, it’s not uniform. For instance, if our space consists of three points, with the first two very close together and the third much further away, then the maximizing distribution will be roughly $(1/4, 1/4, 1/2)$ . If you imagine the points gradually moving towards a configuration where they’re the vertices of an equilateral triangle, then as you do this, the maximizing distribution will gradually move towards $(1/3, 1/3, 1/3)$ .

So: every finite metric space carries one or more probability distributions that maximize the entropy. Generically — I mean, for a metric space chosen at random — there will be exactly one maximizing distribution. So we can ask: what is its variance?

And here’s where I’m puzzled. I don’t know what the answer means. Here are a few examples:

Take a two-point metric space, with the points distance $\ell$ apart. Then the maximizing distribution is uniform, so a pair of points chosen at random have probability $1/2$ of being distance $0$ apart and probability $1/2$ of being distance $\ell$ apart. Recall that in general, for a distribution on a metric space, the variance is half of the expected value of the squared distance between a pair of points. So the variance of the maximizing distribution is

$\tfrac{1}{2} \bigl( \tfrac{1}{2} \cdot 0^2 + \tfrac{1}{2} \cdot \ell^2 \bigr) = \tfrac{1}{4} \ell^2.$
More generally, take $n$ points all at distance $\ell$ from each other (that is, the vertices of a regular simplex). Again, the maximizing distribution is uniform, and a similar calculation tells us that its variance is

$\frac{n - 1}{2n} \ell^2.$
Now take the real interval $[-\ell, \ell]$ . You may object that this is not a finite metric space, but nevertheless, things seem to work out. The maximizing distribution is the measure

$\frac{1}{2(1 + \ell)} \Bigl( \delta_{-\ell} + \lambda + \delta_\ell \Bigr)$

where $\delta_x$ means the point measure at $x$ and $\lambda$ means Lebesgue measure on $[-\ell, \ell]$ . A little calculation then shows that the variance of this distribution is

$\frac{1 + \ell/3}{1 + \ell} \ell^2.$

Ignoring the possibility that a metric space might have more than one maximizing distribution (and that they might have different variances), let’s write the variance of the maximizing distribution on a metric space $A$ as $Var(A)$ .

It’s apparent from the last example that when you scale $A$ up by a constant factor $t$ , the maximizing distribution does not simply scale up in the obvious way. If it did, then $Var(t A)$ would be equal to $t^2 Var(A)$ . But again, the last example shows that this isn’t so.

Nevertheless, this suggests looking at the ratio

$\frac{Var(t A)}{t^2}.$

When $A$ is the space $S_n$ consisting of $n$ points all at distance $1$ from each other, our earlier calculation shows that this ratio is

$\frac{n - 1}{2n}.$

In particular, it’s independent of $t$ . For the line segment $[-1, 1]$ , the ratio is

$\frac{1 + t/3}{1 + t} = \frac{1}{3} + \frac{2/3}{1 + t}.$

This does depend on $t$ , but converges to $1/3$ as $t \to \infty$ . So maybe, for an arbitrary space $A$ , we should look at the limit

$\rho(A) = \lim_{t \to \infty} \frac{Var(t A)}{t^2}.$

In our two examples, $\rho(S_n) = (n - 1)/2n$ and $\rho([-1, 1]) = 1/3$ . More generally, we can consider the space $\ell S_n$ consisting of $n$ points distance $\ell$ apart and the interval $[-\ell, \ell]$ , and

$\rho(\ell S_n) = \frac{n - 1}{2n} \ell^2, \,\,\,\,\,\,\, \rho([-\ell, \ell]) = \frac{\ell^2}{3}.$

(It’s easy to see that in general, $\rho(\ell A) = \rho(A) \ell^2$ .) But again, what do these numbers mean?

Posted at February 26, 2017 4:05 PM UTC

TrackBack URL for this Entry: https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2943

Re: Variance in a Metric Space

I don’t know what the answer to your puzzle is, but maybe it helps to say that I’ve seen this notion of variance before, namely in the context of metric measure spaces and the Gromov-Wasserstein distance. So maybe there’s more literature about it that could be relevant.

So except for the factor of $\tfrac{1}{2}$ , your variance is equal to the $p$ -diameter, defined e.g. in Definition 5.2 here, with $p=2$ .

Posted by: Tobias Fritz on February 26, 2017 5:46 PM | Permalink | Reply to this

Aha, thanks! I’ll take a look.

I guess I should have said in the post that I can’t possibly have been the first person in history to observe that to define variance, one only requires a metric!

Posted by: Tom Leinster on February 26, 2017 6:04 PM | Permalink | Reply to this

Here’s an elementary question that has nothing to do with entropy:

Given a metric space, what is the greatest possible variance among all distributions on it?

In other words, how much space is there available for a distribution to spread out?

For finite metric spaces $A = \{a_1, \ldots, a_n\}$ , I’m asking about the number

$MaxVar(A) \colon= \sup_{\mathbf{p}} \, Var(\mathbf{p}) = \sup_{\mathbf{p}} \, \tfrac{1}{2} \sum_{i, j} p_i p_j d(a_i, a_j)^2,$

where the suprema are over all probability distributions $\mathbf{p} = (p_1, \ldots, p_n)$ on $A$ . Since the space of all distributions on $A$ is compact and $Var(\mathbf{p})$ is continuous in $\mathbf{p}$ , this supremum is both finite and attained by some distribution.

For instance, given a space $A$ , we could attempt to get a distribution of greatest variance by picking two points $a, b$ a maximal distance apart, then using the distribution that flips a fair coin and chooses either $a$ or $b$ accordingly. Or we could try the uniform distribution. But if my back-of-the-envelope calculations are correct, neither of these always gives the greatest possible variance.

Is there any reason to think that the distribution that maximizes the entropy also maximizes the variance?

Posted by: Tom Leinster on February 26, 2017 6:10 PM | Permalink | Reply to this

Interesting question! Assuming that the optimal distribution is in the interior of the simplex (i.e. has full support), we can obtain the condition for optimality simply by differentiating $Var$ and using a Lagrange multiplier for the normalization. As you’ve probably noticed already, this results in the condition that $\mathbf{p}$ is optimal if and only if $\sum_j p_j d(a_i,a_j)^2$ is independent of $i$ . Does this look anything like the maximum entropy distribution? Naively, doing something with the square of the metric seems very different from doing something with its exponential…

I’m not sure how to adapt this to the case when no optimal $\mathbf{p}$ has full support. This happens e.g. when all points lie on a line.

Posted by: Tobias Fritz on February 26, 2017 11:03 PM | Permalink | Reply to this

For a random variable $X$ in an arbitrary metric space, you can’t take the expected value — it simply doesn’t make sense.

We could define it as the value of $x$ which minimises

$\mathbb{E}(d(x,X)^2).$

This definition works in $\mathbb{R}^n$ since

$\mathbb{E}((x-X)^2)=(x-\mathbb{E}(X))^2+\mathbb{E}(X^2)-\mathbb{E}(X)^2$

which is minimised at $x=\mathbb{E}(X)$ (and gives the variance at its minimum).

Clearly the minimum will always be achieved in a finite metric space, and in fact since $\mathbb{E}(d(x,X)^2)$ is a continuous function of $x$ it will also be achieved in any compact space.

We can similarly define “median” by minimising $\mathbb{E}(d(x,X))$ .

Posted by: Oscar Cunningham on February 26, 2017 9:35 PM | Permalink | Reply to this

Right, this concept seems to be known as the barycenter of a metric measure space, generalizing the Fréchet mean. In general, neither existence nor uniqueness is guaranteed. Various conditions under which it uniquely exists seem to be know, as per the first link above.

Posted by: Tobias Fritz on February 26, 2017 11:17 PM | Permalink | Reply to this

Very interesting post!

I don’t know if this can be helpful, and I’m sure that someone has noticed this before, but the continuous limit of the quantity:

$\frac{1}{2} \mathbb{E} (d(X_1,X_2)^2)$

at least if we consider only “neighboring points”, is in some sense the action functional:

$\frac{1}{2} \int |\partial \phi|^2\; dx\;,$

and “harmonic” fields are then the ones that are “maximally spread” given the constraints.

I don’t know how to interpret this further, nor how to make this more precise, though.

Posted by: Paolo Perrone on February 27, 2017 4:49 PM | Permalink | Reply to this

The n-Category Café

Skip to the Main Content

February 26, 2017