The Fisher Metric Will Not Be Deformed
Posted by Tom Leinster
The pillars of society are those who cannot be bribed or bought, the upright citizens of integrity, the incorruptibles. Throw at them what you will, they never bend.
In the mathematical world, the Fisher metric is one such upstanding figure.
What I mean is this. The Fisher metric can be derived from the concept of relative entropy. But relative entropy can be deformed in various ways, and you might imagine that when you deform it, the Fisher metric gets deformed too. Nope. Bastion of integrity that it is, it remains unmoved.
You don’t need to know what the Fisher metric is in order to get the point: the Fisher metric is a highly canonical concept.
Let’s start with Shannon entropy. Given a finite probability distribution , its Shannon entropy is defined as
(I’ll assume all probabilities are nonzero, so there are no problems with things being undefined.)
This is the most important type of “entropy” for finite probability distributions: it has uniquely good properties. But it admits a couple of families of deformations that share most of those properties. One is the family of Rényi entropies, indexed by a real parameter :
Another is the family of entropies that I like to call the -logarithmic entropies (because they’re what you get if you replace the logarithm in the definition of Shannon entropy by a -logarithm), and that physicists call the Tsallis entropies (because Tsallis was about the tenth person to discover them). They’re defined by
There’s obviously a problem with the definitions of the Rényi entropy and the -logarithmic entropy when . They don’t make sense. But both converge to the Shannon entropy as , and that’s what I mean by “deformation”.
An easy way to prove this is to use l’Hôpital’s rule. And that l’Hôpital argument just as easily shows that it’s easy to dream up new deformations of Shannon entropy (not that they’re necessarily interesting). For any function , define a kind of “entropy of order ” as
If you want to show that this converges to as , all you need to assume about is that and .
Taking satisfies these conditions and gives Rényi entropy. The simplest function satisfying the conditions is the linear approximation to the function at , namely, . And that gives -logarithmic entropy.
That’s entropy, defined for a single probability distribution. But there’s also relative entropy, defined for a pair of distributions on the same finite set. The formula is
where and are probability distributions on elements.
I won’t explain here why relative entropy is important. But very roughly, you can think of it as measuring the difference between and . It’s always nonnegative, and it’s equal to zero just when . However, it would be a bad idea to use the word “distance”: it’s not symmetric, and more importantly, it doesn’t satisfy the triangle inequality.
Actually, relative entropy is slightly more like a squared distance. A little calculus exercise shows that when and are close together,
The sum here is just the Euclidean squared distance scaled by a different factor along each coordinate axis.
But it’s still wrong to think of relative entropy as a squared distance. Its square root fails the triangle inequality. So, it’s not a metric in the sense of metric spaces.
However, you can use the square root of relative entropy as an infinitesimal metric — that is, a metric in the sense of Riemannian geometry. It’s called the Fisher metric, at least up to a constant factor that I won’t worry about here. And it makes the set of all probability distributions on into a Riemannian manifold.
This works as follows. The set of probability distributions on is the -simplex (whose boundary points I’m ignoring). It’s a smooth manifold in the obvious way, and every one of its tangent spaces can naturally be identified with
The “little calculus exercise” above tells us that when you treat as an infinitesimal squared distance, the resulting norm on the tangent space at is given by
Or equivalently, by the polarization identity, the resulting inner product on is given by
And that’s the Riemannian metric on . By definition, it’s the Fisher metric.
(Well: it’s actually times what’s normally called the Fisher metric, but as I said, I’m not going to worry too much about constant factors in this post.)
Summary so far: We’re working on the space of probability distributions on elements.There is a machine which takes as input anything that looks vaguely like a squared distance on , and produces as output a Riemannian metric on . When you give this machine relative entropy as its input, what it produces as output is the Fisher metric.
Now the fun starts. Just as the entropy of a single distribution can be deformed in at least a couple of ways, the relative entropy of a pair of distributions has interesting deformations. Here are two families of them. The Rényi relative entropies are given by
and the -logarithmic relative entropies are given by
Again, is a real parameter here. Again, both and converge to the standard relative entropy as . And again, it’s easy to write down other families of deformations in this sense: define a kind of “relative entropy of order ” by
where is any function satisfying the same two conditions as before: and . This generalizes both the Rényi and -logarithmic relative entropies, by taking to be either or .
Let’s feed this very general kind of relative entropy into the machine. A bit of calculation shows that
for any function satisfying those same two conditions. The right-hand side is just what we saw before, multiplied by . So, the output of the machine — the Riemannian metric on that comes from this generalized entropy — is just times the Fisher metric!
So: when you deform the notion of relative entropy and feed it into the machine, the same thing always happens. No matter which deformation you put in, the machine spits out the same Riemannian metric on (at least, up to a constant factor). It’s always the Fisher metric.
A thrill-seeker would call that result disappointing. They might have been hoping that deforming relative entropy would lead to interestingly deformed versions of the Fisher metric. But there are no such things. Try as you might, the Fisher metric simply refuses to be deformed.
Re: The Fisher Metric Will Not Be Deformed
I haven’t been at all scholarly in this post. I’ve skipped a whole bunch of calculations and differential-geometric details. I haven’t given references. I haven’t said what (if anything) is new.
Let me remedy some of those defects here. There’s a general theory of how to take a kind of faux squared distance on a manifold (e.g. relative entropy) and extract from it a Riemannian metric. It’s apparently due to Eguchi, whose work I haven’t seen yet; I’ve just read the summary in section 3.2 of Amari and Nagaoka’s book Methods of Information Geometry. The term they use for a “faux squared distance” is contrast function.
In my post, I sketched the proof of the fact that if you start with any of the -logarithmic (Tsallis) relative entropies, the resulting Riemannian metric on the simplex is just the Fisher metric, up to a constant factor. This is certainly known, and can be found in information geometry texts such as the new book by Ay, Jost, Lê and Schwachhöfer. I think they use the term “-divergence” for this relative entropy (where their is essentially our ), and the “-connection” is also an important part of the story.
I don’t know whether the same fact for the Rényi entropies is widely known. Last summer, I met Nihat Ay, one of the authors of this book, at Luminy. I asked him whether he knew what Riemannian metric on came out of the Rényi entropies, and he said he didn’t, but he correctly guessed that it would essentially be the Fisher metric again. So maybe it’s somehow intuitive to experts.