### A Bit More About Entropy

#### Posted by David Corfield

There’s been a lot of interest in entropy of late around here. I thought I’d record what I’d found since it’s spread over a few posts.

Entropy, we have seen, can provide a measure of information loss under coarse-graining. From a distribution over the restaurants in a town, if for each restaurant I specify a distribution over the dishes served there, then I can generate a distribution over all instances of restaurant and dish. On the other hand, from such a distribution over all dishes in the restaurants of the town, I can coarse-grain to give a distribution over restaurants. What Tom, John and Tobias show is that a sensible positive real-valued measure of what is lost is equal to the difference between the entropies of each distribution.

Now the kind of measure-preserving mapping which takes a distribution over restaurants to a distribution over dishes in restaurants has been named by others a *congruent embedding by a Markov mapping*. They are part of a larger story in which entropy can be situated.

It starts with Čencov in *Statistical decision rules and optimal inference* (1982). The term *congruent embedding* comes from the way a measure-preserving map from distributions over $m$ restaurants to distributions over $n$ dishes in restaurants can be seen as an embedding of the simplex of distributions over $m$ things, $S_{m-1}$, into the simplex of distributions over $n$ things, $S_{n-1}$.

Now Čencov showed that the only metric on the manifolds $S_{k}$ for which all congruent embeddings induce an isometry is

$g_{i j} = \delta_{i j}/x_i.$

A simple calculation shows that this is equal to the Fisher information metric.

Campbell in An extended Čencov characterization of the information metric then showed that it’s worth looking outside $S_{n-1}$ to the full cone of measures, $\mathbb{R}^n_+$. He extended Čencov’s result to the positive cones by showing that metrics giving rise to isometries under congruent embeddings are highly constrained, and include

$g_{i j} = \delta_{i j}\cdot |\mathbf{x}|/x_i,$

where $|\mathbf{x}| = \sum x_i$. This metric has come to be known as the Shahshahani metric for reasons you can discover from Marc Harper’s papers discussed on John’s blog here.

So, vectors in the tangent plane at a point in the subspace of probability distributions, $S_{n-1}$, have the form $A = a_i X_i$, where $\sum a_i = 0$, and $X_i$ is the obvious basis.

The unit normal vector at $\mathbf{x} \in S_{n-1}$ for the Shahshahani metric is $N = \sum x_i X_i$ since

$\langle A, N \rangle = \sum a_i\cdot x_i/x_i = 0,$

and

$\langle N, N \rangle = \sum x_i = |\mathbf{x}| = 1, for \mathbf{x} \in S_{n - 1}.$

Now another very natural quantity in this set up is the invariant vector field $U_{\mathbf{x}} = (-x_i log x_i X_i)$. I found this after a discussion with Urs on cohomology and characteristic classes. It is invariant under the multiplicative action of $\mathbb{R}_+$,

$r \cdot U_{\mathbf{x}} = \sum (-r x_i log x_i - x_i r log r)X_i = \sum (-r x_i log (r x_i)) X_i = U_{r \mathbf{x}}.$

An obvious thing to try now is to take the inner product at a point $\mathbf{x} \in S_{n - 1}$ of $U$ and $N$. We find

$\langle U_{\mathbf{x}}, N_{\mathbf{x}} \rangle = H(\mathbf{x}),$

the entropy of the distribution $\mathbf{x}$.

Relative entropy seems to arise then as though you parallel transport the invariant vector $U_{\mathbf{y}}$ to $\mathbf{x}$, then compare the projections of it and $U_{\mathbf{x}}$ onto the unit normal vector at $\mathbf{x}$.

$D(\mathbf{x} \| \mathbf{y}) = \langle U_{\mathbf{x}} - U_{\mathbf{y}}, N_{\mathbf{x}} \rangle.$

I wonder if from the geometry of the situation we can see why the Fisher-Shahshahani metric emerges as the curvature of the relative entropy $D(\mathbf{x} \| \mathbf{y})$.

## Re: A Bit More About Entropy

I don’t get your calculation showing that the vector field $U_{\mathbf{X}}$ is invariant under the multiplicative action of $\mathbb{R}_+$. Where does the term involving $r\log r$ come from? The only point (with positive coordinates) where $U_{\mathbf{X}}$ vanishes is at $x=(1,\ldots,1)$. However, this point is not invariant under the $\mathbb{R}_+$-action. So how is it possible that $U_{\mathbf{X}}$ is invariant?