Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

February 26, 2017

Variance in a Metric Space

Posted by Tom Leinster

Here’s a small observation for a Sunday afternoon. When you have a random variable in n\mathbb{R}^n or some other similar space, you can take its expected value. For a random variable XX in an arbitrary metric space, you can’t take the expected value — it simply doesn’t make sense. (Edited to add: maybe that’s misleading. See the comments!) But you can take its variance, and the right definition is

Var(X)=12𝔼(d(X 1,X 2) 2), Var(X) = \tfrac{1}{2} \mathbb{E}(d(X_1, X_2)^2),

where X 1X_1 and X 2X_2 are independent and distributed identically to XX.

This observation rests on a footling calculation: that when the metric space concerned is the real line, this definition agrees with the usual one. Let’s do it. We have to show that for a real random variable XX,

𝔼((X 1X 2) 2)=2Var(X), \mathbb{E}((X_1 - X_2)^2) = 2 Var(X),

where again X 1X_1 and X 2X_2 are independent and distributed identically to XX. Expanding the left-hand side gives

𝔼(X 1 2)+𝔼(X 2 2)2𝔼(X 1X 2). \mathbb{E}(X_1^2) + \mathbb{E}(X_2^2) - 2\mathbb{E}(X_1 X_2).

Since X 1X_1 and X 2X_2 are independent, 𝔼(X 1X 2)=𝔼(X 1)𝔼(X 2)\mathbb{E}(X_1 X_2) = \mathbb{E}(X_1) \mathbb{E}(X_2). Since X 1X_1 and X 2X_2 are each distributed identically to XX, we have 𝔼(X i)=𝔼(X)\mathbb{E}(X_i) = \mathbb{E}(X) and 𝔼(X i 2)=𝔼(X 2)\mathbb{E}(X_i^2) = \mathbb{E}(X^2). So our expression reduces to

2(𝔼(X 2)𝔼(X) 2), 2 \bigl( \mathbb{E}(X^2) - \mathbb{E}(X)^2 \bigr),

which is twice the variance of XX.

So, our definition of variance in an arbitrary metric space agrees with the usual definition in \mathbb{R}.

People also enjoy taking random variables in inner product spaces. A random variable in an inner product space is called a random vector. So, the general definition of variance in a metric space specializes to give a definition of the variance of a random vector. What is it?

For concreteness, let’s consider a random vector X\mathbf{X} in n\mathbb{R}^n. Then X\mathbf{X} can be viewed as a sequence (X (1),,X (n))(X^{(1)}, \ldots, X^{(n)}) of real random variables. Using the definition of variance in a metric space, a simple calculation similar to the one above gives

Var(X)=Var(X (1))++Var(X (n)). Var(\mathbf{X}) = Var(X^{(1)}) + \cdots + Var(X^{(n)}).

This agrees with the usual definition of the variance of a random vector, which takes advantage of the fact that in a vector space, expectation does make sense.

Now for a puzzle. More or less every finite metric space comes with a canonical probability distribution, namely, the maximum entropy distribution. How should we interpret its variance?

This needs some explanation.

First I’ll explain entropy on finite sets, leaving the metric for later. For a probability distribution p=(p 1,,p n)\mathbf{p} = (p_1, \ldots, p_n) on a finite set, its Shannon entropy is defined by

H 1(p)= i:p i>0p ilogp i. H_1(\mathbf{p}) = - \sum_{i \colon p_i \gt 0} p_i \log p_i.

This turns out to be just one member (albeit a special one) of a whole family of entropies, indexed by a parameter qq. Indeed, for any nonnegative real q1q \neq 1, the Rényi entropy of order qq is defined by

H q(p)=11qlog i:p i>0p i q. H_q(\mathbf{p}) = \frac{1}{1 - q} \log \sum_{i \colon p_i \gt 0} p_i^q.

It’s a pleasant exercise to show that H q(p)H 1(p)H_q(\mathbf{p}) \to H_1(\mathbf{p}) as q1q \to 1; so Shannon entropy can be thought of as Rényi entropy of order 11. A significant fact is that for each q0q \geq 0, the uniform distribution p=(1/n,,1/n)\mathbf{p} = (1/n, \ldots, 1/n) maximizes the entropy of order qq.

Next, let’s bring in the metric. Take a finite metric space A={a 1,,a n}A = \{a_1, \ldots, a_n\}, and let ZZ be the n×nn \times n matrix whose (i,j)(i, j)-entry is exp(d(a i,a j))\exp(-d(a_i, a_j)). The entropy of order 11 of a probability distribution p\mathbf{p} on AA is

H 1(p)= i:p i>0p ilog(Zp) i H_1(\mathbf{p}) = - \sum_{i \colon p_i \gt 0} p_i \log (Z\mathbf{p})_i

where ZpZ\mathbf{p} means the matrix ZZ times the column vector p\mathbf{p}. Similarly, for any nonnegative real q1q \neq 1, the entropy of order qq of p\mathbf{p} is

H q(p)=11qlog i:p i>0p i(Zp) i q1. H_q(\mathbf{p}) = \frac{1}{1 - q} \log \sum_{i \colon p_i \gt 0} p_i (Z\mathbf{p})_i^{q - 1}.

Again, H q(p)H 1(p)H_q(\mathbf{p}) \to H_1(\mathbf{p}) as q1q \to 1. In the case where the distances between distinct points are all \infty, the matrix ZZ is just the identity, so these entropies reduce to Shannon’s and Rényi’s.

For each q0q \geq 0, we can ask which probability distribution(s) on AA maximize the entropy of order qq. In principle, this distribution will be different for different values of qq. But the surprise is:

There is a single distribution on AA maximizing the entropy of all orders simultaneously.

We already know this in the degenerate case where all distances are infinite: the maximizing distribution is uniform. But usually, it’s not uniform. For instance, if our space consists of three points, with the first two very close together and the third much further away, then the maximizing distribution will be roughly (1/4,1/4,1/2)(1/4, 1/4, 1/2). If you imagine the points gradually moving towards a configuration where they’re the vertices of an equilateral triangle, then as you do this, the maximizing distribution will gradually move towards (1/3,1/3,1/3)(1/3, 1/3, 1/3).

So: every finite metric space carries one or more probability distributions that maximize the entropy. Generically — I mean, for a metric space chosen at random — there will be exactly one maximizing distribution. So we can ask: what is its variance?

And here’s where I’m puzzled. I don’t know what the answer means. Here are a few examples:

  • Take a two-point metric space, with the points distance \ell apart. Then the maximizing distribution is uniform, so a pair of points chosen at random have probability 1/21/2 of being distance 00 apart and probability 1/21/2 of being distance \ell apart. Recall that in general, for a distribution on a metric space, the variance is half of the expected value of the squared distance between a pair of points. So the variance of the maximizing distribution is

    12(120 2+12 2)=14 2. \tfrac{1}{2} \bigl( \tfrac{1}{2} \cdot 0^2 + \tfrac{1}{2} \cdot \ell^2 \bigr) = \tfrac{1}{4} \ell^2.

  • More generally, take nn points all at distance \ell from each other (that is, the vertices of a regular simplex). Again, the maximizing distribution is uniform, and a similar calculation tells us that its variance is

    n12n 2. \frac{n - 1}{2n} \ell^2.

  • Now take the real interval [,][-\ell, \ell]. You may object that this is not a finite metric space, but nevertheless, things seem to work out. The maximizing distribution is the measure

    12(1+)(δ +λ+δ ) \frac{1}{2(1 + \ell)} \Bigl( \delta_{-\ell} + \lambda + \delta_\ell \Bigr)

    where δ x\delta_x means the point measure at xx and λ\lambda means Lebesgue measure on [,][-\ell, \ell]. A little calculation then shows that the variance of this distribution is

    1+/31+ 2. \frac{1 + \ell/3}{1 + \ell} \ell^2.

Ignoring the possibility that a metric space might have more than one maximizing distribution (and that they might have different variances), let’s write the variance of the maximizing distribution on a metric space AA as Var(A)Var(A).

It’s apparent from the last example that when you scale AA up by a constant factor tt, the maximizing distribution does not simply scale up in the obvious way. If it did, then Var(tA)Var(t A) would be equal to t 2Var(A)t^2 Var(A). But again, the last example shows that this isn’t so.

Nevertheless, this suggests looking at the ratio

Var(tA)t 2. \frac{Var(t A)}{t^2}.

When AA is the space S nS_n consisting of nn points all at distance 11 from each other, our earlier calculation shows that this ratio is

n12n. \frac{n - 1}{2n}.

In particular, it’s independent of tt. For the line segment [1,1][-1, 1], the ratio is

1+t/31+t=13+2/31+t. \frac{1 + t/3}{1 + t} = \frac{1}{3} + \frac{2/3}{1 + t}.

This does depend on tt, but converges to 1/31/3 as tt \to \infty. So maybe, for an arbitrary space AA, we should look at the limit

ρ(A)=lim tVar(tA)t 2. \rho(A) = \lim_{t \to \infty} \frac{Var(t A)}{t^2}.

In our two examples, ρ(S n)=(n1)/2n\rho(S_n) = (n - 1)/2n and ρ([1,1])=1/3\rho([-1, 1]) = 1/3. More generally, we can consider the space S n\ell S_n consisting of nn points distance \ell apart and the interval [,][-\ell, \ell], and

ρ(S n)=n12n 2,ρ([,])= 23. \rho(\ell S_n) = \frac{n - 1}{2n} \ell^2, \,\,\,\,\,\,\, \rho([-\ell, \ell]) = \frac{\ell^2}{3}.

(It’s easy to see that in general, ρ(A)=ρ(A) 2\rho(\ell A) = \rho(A) \ell^2.) But again, what do these numbers mean?

Posted at February 26, 2017 4:05 PM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2943

7 Comments & 0 Trackbacks

Re: Variance in a Metric Space

I don’t know what the answer to your puzzle is, but maybe it helps to say that I’ve seen this notion of variance before, namely in the context of metric measure spaces and the Gromov-Wasserstein distance. So maybe there’s more literature about it that could be relevant.

So except for the factor of 12\tfrac{1}{2}, your variance is equal to the pp-diameter, defined e.g. in Definition 5.2 here, with p=2p=2.

Posted by: Tobias Fritz on February 26, 2017 5:46 PM | Permalink | Reply to this

Re: Variance in a Metric Space

Aha, thanks! I’ll take a look.

I guess I should have said in the post that I can’t possibly have been the first person in history to observe that to define variance, one only requires a metric!

Posted by: Tom Leinster on February 26, 2017 6:04 PM | Permalink | Reply to this

Re: Variance in a Metric Space

Here’s an elementary question that has nothing to do with entropy:

Given a metric space, what is the greatest possible variance among all distributions on it?

In other words, how much space is there available for a distribution to spread out?

For finite metric spaces A={a 1,,a n}A = \{a_1, \ldots, a_n\}, I’m asking about the number

MaxVar(A):=sup pVar(p)=sup p12 i,jp ip jd(a i,a j) 2, MaxVar(A) \colon= \sup_{\mathbf{p}} \, Var(\mathbf{p}) = \sup_{\mathbf{p}} \, \tfrac{1}{2} \sum_{i, j} p_i p_j d(a_i, a_j)^2,

where the suprema are over all probability distributions p=(p 1,,p n)\mathbf{p} = (p_1, \ldots, p_n) on AA. Since the space of all distributions on AA is compact and Var(p)Var(\mathbf{p}) is continuous in p\mathbf{p}, this supremum is both finite and attained by some distribution.

For instance, given a space AA, we could attempt to get a distribution of greatest variance by picking two points a,ba, b a maximal distance apart, then using the distribution that flips a fair coin and chooses either aa or bb accordingly. Or we could try the uniform distribution. But if my back-of-the-envelope calculations are correct, neither of these always gives the greatest possible variance.

Is there any reason to think that the distribution that maximizes the entropy also maximizes the variance?

Posted by: Tom Leinster on February 26, 2017 6:10 PM | Permalink | Reply to this

Re: Variance in a Metric Space

Interesting question! Assuming that the optimal distribution is in the interior of the simplex (i.e. has full support), we can obtain the condition for optimality simply by differentiating VarVar and using a Lagrange multiplier for the normalization. As you’ve probably noticed already, this results in the condition that p\mathbf{p} is optimal if and only if jp jd(a i,a j) 2\sum_j p_j d(a_i,a_j)^2 is independent of ii. Does this look anything like the maximum entropy distribution? Naively, doing something with the square of the metric seems very different from doing something with its exponential…

I’m not sure how to adapt this to the case when no optimal p\mathbf{p} has full support. This happens e.g. when all points lie on a line.

Posted by: Tobias Fritz on February 26, 2017 11:03 PM | Permalink | Reply to this

Re: Variance in a Metric Space

For a random variable XX in an arbitrary metric space, you can’t take the expected value — it simply doesn’t make sense.

We could define it as the value of xx which minimises

𝔼(d(x,X) 2).\mathbb{E}(d(x,X)^2).

This definition works in n\mathbb{R}^n since

𝔼((xX) 2)=(x𝔼(X)) 2+𝔼(X 2)𝔼(X) 2\mathbb{E}((x-X)^2)=(x-\mathbb{E}(X))^2+\mathbb{E}(X^2)-\mathbb{E}(X)^2

which is minimised at x=𝔼(X)x=\mathbb{E}(X) (and gives the variance at its minimum).

Clearly the minimum will always be achieved in a finite metric space, and in fact since 𝔼(d(x,X) 2)\mathbb{E}(d(x,X)^2) is a continuous function of xx it will also be achieved in any compact space.

We can similarly define “median” by minimising 𝔼(d(x,X))\mathbb{E}(d(x,X)).

Posted by: Oscar Cunningham on February 26, 2017 9:35 PM | Permalink | Reply to this

Re: Variance in a Metric Space

Right, this concept seems to be known as the barycenter of a metric measure space, generalizing the Fréchet mean. In general, neither existence nor uniqueness is guaranteed. Various conditions under which it uniquely exists seem to be know, as per the first link above.

Posted by: Tobias Fritz on February 26, 2017 11:17 PM | Permalink | Reply to this

Re: Variance in a Metric Space

Very interesting post!

I don’t know if this can be helpful, and I’m sure that someone has noticed this before, but the continuous limit of the quantity:

12𝔼(d(X 1,X 2) 2) \frac{1}{2} \mathbb{E} (d(X_1,X_2)^2)

at least if we consider only “neighboring points”, is in some sense the action functional:

12|ϕ| 2dx, \frac{1}{2} \int |\partial \phi|^2\; dx\;,

and “harmonic” fields are then the ones that are “maximally spread” given the constraints.

I don’t know how to interpret this further, nor how to make this more precise, though.

Posted by: Paolo Perrone on February 27, 2017 4:49 PM | Permalink | Reply to this

Post a New Comment