### The Fisher Metric Will Not Be Deformed

#### Posted by Tom Leinster

The pillars of society are those who cannot be bribed or bought, the upright citizens of integrity, the incorruptibles. Throw at them what you will, they never bend.

In the mathematical world, the Fisher metric is one such upstanding figure.

What I mean is this. The Fisher metric can be derived from the concept of relative entropy. But relative entropy can be deformed in various ways, and you might imagine that when you deform it, the Fisher metric gets deformed too. Nope. Bastion of integrity that it is, it remains unmoved.

You don’t need to know what the Fisher metric is in order to get the point: the Fisher metric is a highly canonical concept.

Let’s start with Shannon entropy. Given a finite probability distribution $p = (p_1, \ldots, p_n)$, its Shannon entropy is defined as

$H(p) = - \sum_i p_i \log p_i.$

(I’ll assume all probabilities are nonzero, so there are no problems with things being undefined.)

This is the most important type of “entropy” for finite probability distributions: it has uniquely good properties. But it admits a couple of families of deformations that share most of those properties. One is the family of Rényi entropies, indexed by a real parameter $q$:

$H_q(p) = \frac{1}{1 - q} \log \sum p_i^q.$

Another is the family of entropies that I like to call the $q$-logarithmic entropies (because they’re what you get if you replace the logarithm in the definition of Shannon entropy by a $q$-logarithm), and that physicists call the Tsallis entropies (because Tsallis was about the tenth person to discover them). They’re defined by

$S_q(p) = \frac{1}{1 - q} \biggl( \sum p_i^q - 1 \biggr).$

There’s obviously a problem with the definitions of the Rényi entropy $H_q(p)$
and the $q$-logarithmic entropy $S_q(p)$ when $q = 1$. They don’t make
sense. But both *converge* to the Shannon entropy $H(p)$ as $q \to 1$,
and that’s what I mean by “deformation”.

An easy way to prove this is to use l’Hôpital’s rule. And that l’Hôpital argument just as easily shows that it’s easy to dream up new deformations of Shannon entropy (not that they’re necessarily interesting). For any function $\lambda : (0, \infty) \to \mathbb{R}$, define a kind of “entropy of order $q$” as

$\frac{1}{1 - q} \cdot \lambda \biggl( \sum p_i^q \biggr).$

If you want to show that this converges to $H(p)$ as $q \to 1$, all you need to assume about $\lambda$ is that $\lambda(1) = 0$ and $\lambda'(1) = 1$.

Taking $\lambda = \log$ satisfies these conditions and gives
Rényi entropy. The *simplest* function $\lambda$ satisfying the conditions is the
linear approximation to the function $\log$ at $1$, namely, $\lambda(x) = x
- 1$. And that gives $q$-logarithmic entropy.

That’s entropy, defined for a *single* probability distribution. But
there’s also *relative* entropy, defined for a *pair* of distributions on
the same finite set. The formula is

$H(p \| r) = \sum_i p_i \log(p_i/r_i),$

where $p$ and $r$ are probability distributions on $n$ elements.

I won’t explain here why relative entropy is important. But *very*
roughly, you can think of it as measuring the difference between $p$ and
$r$. It’s always nonnegative, and it’s equal to zero just when $p = r$.
However, it would be a bad idea to use the word “distance”: it’s not
symmetric, and more importantly, it doesn’t satisfy the triangle
inequality.

Actually, relative entropy is slightly more like a *squared* distance. A
little calculus exercise shows that when $p$ and $r$ are close together,

$H(p \| r) = \sum \frac{1}{2p_i} (p_i - r_i)^2 + o(\|p - r\|^2).$

The sum here is just the Euclidean squared distance scaled by a different factor along each coordinate axis.

But it’s still wrong to think of relative entropy as a squared distance. Its square root fails the triangle inequality. So, it’s not a metric in the sense of metric spaces.

However, you *can* use the square root of relative entropy as an
infinitesimal metric — that is, a metric in the sense of Riemannian
geometry. It’s called the Fisher metric, at least up to a constant factor
that I won’t worry about here. And it makes the set of all probability
distributions on $\{1, \ldots, n\}$ into a Riemannian manifold.

This works as follows. The set of probability distributions on $\{1, \ldots, n\}$ is the $(n - 1)$-simplex $\Delta_n$ (whose boundary points I’m ignoring). It’s a smooth manifold in the obvious way, and every one of its tangent spaces can naturally be identified with

$T = \{ t = (t_1, \ldots, t_n) \in \mathbb{R}^n : t_1 + \cdots + t_n = 0 \}.$

The “little calculus exercise” above tells us that when you treat $H(-\|-)$ as an infinitesimal squared distance, the resulting norm on the tangent space $T$ at $p$ is given by

$\|t\|^2 = \sum_i \frac{1}{2p_i} t_i^2.$

Or equivalently, by the polarization identity, the resulting inner product on $T$ is given by

$\langle t, u \rangle = \sum_i \frac{1}{2p_i} t_i u_i.$

And that’s the Riemannian metric on $\Delta_n$. By definition, it’s the Fisher metric.

(Well: it’s actually $1/2$ times what’s normally called the Fisher metric, but as I said, I’m not going to worry too much about constant factors in this post.)

**Summary so far:** We’re working on the space $\Delta_n$ of probability distributions on $n$ elements.There is a machine which takes as input anything that
looks vaguely like a squared distance on $\Delta_n$, and produces as output a
Riemannian metric on $\Delta_n$. When you give this machine relative
entropy as its input, what it produces as output is the Fisher metric.

Now the fun starts. Just as the entropy of a single distribution can be deformed in at least a couple of ways, the relative entropy of a pair of distributions has interesting deformations. Here are two families of them. The Rényi relative entropies are given by

$H_q(p \| r) = \frac{1}{q - 1} \log \sum p_i^q r_i^{1 - q},$

and the $q$-logarithmic relative entropies are given by

$S_q(p \| r) = \frac{1}{q - 1} \biggl( \sum p_i^q r_i^{1 - q} - 1 \biggr).$

Again, $q$ is a real parameter here. Again, both $H_q(p \| r)$ and $S_q(p \| r)$ converge to the standard relative entropy $H(p \| r)$ as $q \to 1$. And again, it’s easy to write down other families of deformations in this sense: define a kind of “relative entropy of order $q$” by

$H^\lambda_q(p \| r) = \frac{1}{q - 1} \lambda \biggl( \sum p_i^q r_i^{1 - q} \biggr)$

where $\lambda$ is any function satisfying the same two conditions as before: $\lambda(1) = 0$ and $\lambda'(1) = 1$. This generalizes both the Rényi and $q$-logarithmic relative entropies, by taking $\lambda(x)$ to be either $\log x$ or $x - 1$.

Let’s feed this very general kind of relative entropy into the machine. A bit of calculation shows that

$H^\lambda_q(p \| r) = q \sum_i \frac{1}{2p_i} (p_i - r_i)^2 + o(\|p - r\|^2)$

for *any* function $\lambda$ satisfying those same two conditions. The
right-hand side is just what we saw before, multiplied by $q$. So, the
output of the machine — the Riemannian metric on $\Delta_n$ that comes from this generalized entropy
— is just $q$ times the Fisher metric!

So: when you deform the notion of relative entropy and feed it into the machine, the same thing always happens. No matter which deformation you put in, the machine spits out the same Riemannian metric on $\Delta_n$ (at least, up to a constant factor). It’s always the Fisher metric.

A thrill-seeker would call that result disappointing. They might have been hoping that deforming relative entropy would lead to interestingly deformed versions of the Fisher metric. But there are no such things. Try as you might, the Fisher metric simply refuses to be deformed.

## Re: The Fisher Metric Will Not Be Deformed

I haven’t been at all scholarly in this post. I’ve skipped a whole bunch of calculations and differential-geometric details. I haven’t given references. I haven’t said what (if anything) is new.

Let me remedy some of those defects here. There’s a general theory of how to take a kind of faux squared distance on a manifold (e.g. relative entropy) and extract from it a Riemannian metric. It’s apparently due to Eguchi, whose work I haven’t seen yet; I’ve just read the summary in section 3.2 of Amari and Nagaoka’s book

Methods of Information Geometry. The term they use for a “faux squared distance” iscontrast function.In my post, I sketched the proof of the fact that if you start with any of the $q$-logarithmic (Tsallis) relative entropies, the resulting Riemannian metric on the simplex is just the Fisher metric, up to a constant factor. This is certainly known, and can be found in information geometry texts such as the new book by Ay, Jost, Lê and Schwachhöfer. I think they use the term “$\alpha$-divergence” for this relative entropy (where their $\alpha$ is essentially our $q$), and the “$\alpha$-connection” is also an important part of the story.

I don’t know whether the same fact for the Rényi entropies is widely known. Last summer, I met Nihat Ay, one of the authors of this book, at Luminy. I asked him whether he knew what Riemannian metric on $\Delta_n$ came out of the Rényi entropies, and he said he didn’t, but he correctly guessed that it would essentially be the Fisher metric again. So maybe it’s somehow intuitive to experts.