Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

June 17, 2011

A Bit More About Entropy

Posted by David Corfield

There’s been a lot of interest in entropy of late around here. I thought I’d record what I’d found since it’s spread over a few posts.

Entropy, we have seen, can provide a measure of information loss under coarse-graining. From a distribution over the restaurants in a town, if for each restaurant I specify a distribution over the dishes served there, then I can generate a distribution over all instances of restaurant and dish. On the other hand, from such a distribution over all dishes in the restaurants of the town, I can coarse-grain to give a distribution over restaurants. What Tom, John and Tobias show is that a sensible positive real-valued measure of what is lost is equal to the difference between the entropies of each distribution.

Now the kind of measure-preserving mapping which takes a distribution over restaurants to a distribution over dishes in restaurants has been named by others a congruent embedding by a Markov mapping. They are part of a larger story in which entropy can be situated.

It starts with Čencov in Statistical decision rules and optimal inference (1982). The term congruent embedding comes from the way a measure-preserving map from distributions over mm restaurants to distributions over nn dishes in restaurants can be seen as an embedding of the simplex of distributions over mm things, S m1S_{m-1}, into the simplex of distributions over nn things, S n1S_{n-1}.

Now Čencov showed that the only metric on the manifolds S kS_{k} for which all congruent embeddings induce an isometry is

g ij=δ ij/x i. g_{i j} = \delta_{i j}/x_i.

A simple calculation shows that this is equal to the Fisher information metric.

Campbell in An extended Čencov characterization of the information metric then showed that it’s worth looking outside S n1S_{n-1} to the full cone of measures, + n\mathbb{R}^n_+. He extended Čencov’s result to the positive cones by showing that metrics giving rise to isometries under congruent embeddings are highly constrained, and include

g ij=δ ij|x|/x i, g_{i j} = \delta_{i j}\cdot |\mathbf{x}|/x_i,

where |x|=x i|\mathbf{x}| = \sum x_i. This metric has come to be known as the Shahshahani metric for reasons you can discover from Marc Harper’s papers discussed on John’s blog here.

So, vectors in the tangent plane at a point in the subspace of probability distributions, S n1S_{n-1}, have the form A=a iX iA = a_i X_i, where a i=0\sum a_i = 0, and X iX_i is the obvious basis.

The unit normal vector at xS n1\mathbf{x} \in S_{n-1} for the Shahshahani metric is N=x iX iN = \sum x_i X_i since

A,N=a ix i/x i=0, \langle A, N \rangle = \sum a_i\cdot x_i/x_i = 0,

and

N,N=x i=|x|=1,forxS n1. \langle N, N \rangle = \sum x_i = |\mathbf{x}| = 1, for \mathbf{x} \in S_{n - 1}.

Now another very natural quantity in this set up is the invariant vector field U x=(x ilogx iX i)U_{\mathbf{x}} = (-x_i log x_i X_i). I found this after a discussion with Urs on cohomology and characteristic classes. It is invariant under the multiplicative action of +\mathbb{R}_+,

rU x=(rx ilogx ix irlogr)X i=(rx ilog(rx i))X i=U rx. r \cdot U_{\mathbf{x}} = \sum (-r x_i log x_i - x_i r log r)X_i = \sum (-r x_i log (r x_i)) X_i = U_{r \mathbf{x}}.

An obvious thing to try now is to take the inner product at a point xS n1\mathbf{x} \in S_{n - 1} of UU and NN. We find

U x,N x=H(x), \langle U_{\mathbf{x}}, N_{\mathbf{x}} \rangle = H(\mathbf{x}),

the entropy of the distribution x\mathbf{x}.

Relative entropy seems to arise then as though you parallel transport the invariant vector U yU_{\mathbf{y}} to x\mathbf{x}, then compare the projections of it and U xU_{\mathbf{x}} onto the unit normal vector at x\mathbf{x}.

D(xy)=U xU y,N x. D(\mathbf{x} \| \mathbf{y}) = \langle U_{\mathbf{x}} - U_{\mathbf{y}}, N_{\mathbf{x}} \rangle.

I wonder if from the geometry of the situation we can see why the Fisher-Shahshahani metric emerges as the curvature of the relative entropy D(xy)D(\mathbf{x} \| \mathbf{y}).

Posted at June 17, 2011 5:16 PM UTC

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2410

11 Comments & 1 Trackback

Re: A Bit More About Entropy

I don’t get your calculation showing that the vector field U XU_{\mathbf{X}} is invariant under the multiplicative action of +\mathbb{R}_+. Where does the term involving rlogrr\log r come from? The only point (with positive coordinates) where U XU_{\mathbf{X}} vanishes is at x=(1,,1)x=(1,\ldots,1). However, this point is not invariant under the +\mathbb{R}_+-action. So how is it possible that U XU_{\mathbf{X}} is invariant?

Posted by: Tobias Fritz on June 17, 2011 8:21 PM | Permalink | Reply to this

Re: A Bit More About Entropy

Where does the term involving rlogrr log r come from?

David takes the +\mathbb{R}_+-action on +\mathbb{R}_+ \ltimes \mathbb{R} induced from the cocycle

()log(): + + (-) log (-) : \mathbb{R}_+ \to \mathbb{R}_+ \ltimes \mathbb{R}

and then interprets +×\mathbb{R}_+ \times \mathbb{R} as the tangent space to +\mathbb{R}_+.

So how is it possible that U XU_X is invariant?

Invariance as in his formula. Maybe better “ +\mathbb{R}_+-equivariance”: the value of UU at a point acted on by rr is rr acting on UU at the original point. Just another way of saying that ()log()(-) log(-) is a cocycle.

Posted by: Urs Schreiber on June 17, 2011 9:05 PM | Permalink | Reply to this

Re: A Bit More About Entropy

Ah, ok! Since David talks about an action on a manifold, I thought that he really meant an action on a manifold which acts on vector fields via the ordinary derivative. If one defines the +\mathbb{R}_+-action on the tangent bundle in this standard sense, then invariance (equivariance) does not hold. This is why I was confused.

But actually, David’s +\mathbb{R}_+-action on the tangent bundle +×\mathbb{R}_+\times\mathbb{R} is given by r(x,v)(rx,rv+xrlogr) r\cdot (x,v) \mapsto (r x, r v+x r\log r) as I gather from the discussion. In nn dimensions, this formula holds for each coordinate. Then I agree, U XU_\mathbf{X} is invariant under this action. Maybe the definition of the action should be made more explicit in the post above.

Posted by: Tobias Fritz on June 17, 2011 9:50 PM | Permalink | Reply to this

Re: A Bit More About Entropy

Yes, equivariance. To motivate the action, the conversation from here might help. The cocycle equation is saying that there’s only one way (or one one-parameter family of ways) to deform a +\mathbb{R}_+-bundle, or one equivariant vector field under the action for the cocycle

r(x,b iX i)=(rx,(rb ix irlogr)X i). r \cdot (\mathbf{x}, \sum b_i X_i) = (r \mathbf{x}, \sum (r b_i - x_i r log r)X_i).

Posted by: David Corfield on June 17, 2011 9:49 PM | Permalink | Reply to this

Re: A Bit More About Entropy

I have been trying to find a moment to think a little bit about what the big picture might be that David is apparently seeing aspects of here. I am thinking that if one had a deeper insight here, then there would be a more natural way to understand these equivariance conditions. The tangent bundle analogy might be good, but I am not sure yet that it is perfect.

In fact, we know that a nice way to characterize that “vector field” instead is as a cocycle. We’d just need to bring that cohomological picture in line with the precedure of “contracting with the normal vector”, that David rightly pointed out seems to be a natural way to get the actual entropy functional out of the cocycle.

I still feel that this is all very much fishing in the dark, but that understood, I could imagine fishing in this direction here:

we might first of all think of our finite probability space XSetX \in Set as the coproduct of free monoids

X˜:= xXB. \tilde X := \coprod_{x \in X} \mathbf{B} \mathbb{N} \,.

Then a positive measure ρ\rho on XX is precisely a functor

ρ:X˜B 0. \rho : \tilde X \to \mathbf{B}\mathbb{R}_{\geq 0} \,.

Apart from making this trivial statement true, I am not exactly sure why one would want to look at a probability space as a monoid this way, but since we are fishing in the dark anyway, why not this way; maybe later we’ll see that it has something to do with implementing dynamics (diffusion?) into the picture. But I don’t know.

With this, the entropy density of any probability density ρ\rho is the “characteristic class”

ρ ()logρ ():X˜ρB 0()log()B 0. \rho_{(-)} log \rho_{(-)} : \tilde X \stackrel{\rho}{\to} \mathbf{B} \mathbb{R}_{\geq 0} \stackrel{(-)log (-)}{\to} \mathbf{B} \mathbb{R}_{\geq 0 } \ltimes \mathbb{R} \,.

If this cohomological re-formulation is at all going in the right direction, then it seems inevitable that the step to the actual entropy

S(ρ)= Xρ()logρ() S(\rho) = \int_X \rho(-) log \rho(-)

is given by some flavor of push-forward in cohomology along X*X \to *, aka fiber integration or “umkehr map”.

That would harmonize with the statement that we saw in the other thread: that the JKO theorem indicates that we should think of entropy as an action functional: all action functionals of higher Chern-Simons type are precisely such push-forwards of postcompositions with characteristic maps.

I should say that I am a bit hesitant to say this, because it might be a bit of a stretch here. We don’t really have enough datapoints to see if this fishing in the dark is producing more than old boots at the moment. But in any case, if one is fishing in the dark in the first place, one should pull once there is any vibration.

Posted by: Urs Schreiber on June 18, 2011 12:27 PM | Permalink | Reply to this

Re: A Bit More About Entropy

I see the Wasserstein distance defined in section 3 of that JKO paper arises from an optimal transport problem. Such problems were discussed in the context of profunctors between metric spaces back here. I wonder if an enriched category theoretic story could be told.

Posted by: David Corfield on June 20, 2011 11:18 AM | Permalink | Reply to this

Re: A Bit More About Entropy

Why not look at this in the exterior algebra on n…sets? This carries geometry and deRahm cohomology. This drags along differential, codifferential and laplacian.
Also the Hodge *. The * gives an inner product p wedge *q = V where V is the volume n-form
The probability map is the 1 form df of the map X -> R+.
The (product of the f(xi))V is a number times the volume form in wedge(n).

Campbell points out(p 140)that the normal on Sm-1 in wedge n-1? N = sum( 1 wedge(ihat))
when wedged with the 1 form sum ui wedge(i) with ui= f(xi)
“appears as the expected value of a random variable which takes the values u1,…um.”

The volume of the inscribed n-cube associated with the 1 form u is largest for the uniform distribution.
It is smallest near the axes. So it seems something like inverse information. The magnitude is Product ui.
I find no justification for taking -log but that it turns product into sum and now uniform is smallest.

The codifferential is the adjoint of the exterior derivative (q, cod r) = (dq, r)
Perhaps log is related to codifferential since that assigns to each real number in the top wedge a n-1 form.
The hodge star certainly could drop that to a 1 form and then inner product with df again delivers a number.
Possibly the issue is the other direction *df wedge ( (log(df)), *df) ?
I cannot remember what happens to cocycles but you have xlog(x) as a unique cocycle somewhere.
“cochains in the fundamental sense should assign ‘quantities’ to the chains of homology theory.” cocycles are kernel cod.

It seems that the stuff you talk about only lives in the 2 bottom and top levels of the exterior algebra.
So the various orderings of d,*,and maybe w(edge) df, codf DELf

The Laplace-deRham operator is given by
DEL = (cod + d)^2 = cod.d + d.cod It is symmetric and positive
http://en.wikipedia.org/wiki/Hodge_dual
so what is DEL on n-1 forms? say on *df = sum ( ihat (prod_on_ihat(fi)) )
or on the 1 form df?

It also seems pretty clear that when log shows up exp better too. This may be John’s major point about following Boltzman and Gibbs by bringing in the partition function.
If you go looking at moments of the distributions than the middle binomial layers of the exterior algebra show up.
As well as I can follow, this is what John and Tobias jumped right onto for quantum deformed probability. The probability turns into a diagonal n matrix living in the middle of a bilinear pairing of complex forms and vectors. (x | P | y) This extends to exterior forms and I am sure in a while to Clifford algebras after quantum deformation.

I see in Akant after (93) page 33 of Entropy of Operator-valued Random Variables
http://cdsweb.cern.ch/record/528497/files/0111263.pdf
“Then generating function log Z(S)is the maximum of this functional over all probability distributions. Indeed,sum(j>i)( log |ai - aj| is (up to a constant depending only on N) the log (of the volume)of the space of all hermitean matrices with spectrum a1,a2, …aN”
Earlier p 23 he gets this from an action principle involving log det diag pi.
“The factorized Schwinger{Dyson equations follow from requiring that this ac-tion be extremal under infnitesimal variations of phi.”
Finally he pulls log out of the hat at (65)
In particular we can consider the volume of of the pre-image of a point qbar in Qbar..
It is a measure of our ignorance of the microscopic variables, when qbar is the result of measuring the macroscopic ones. Any monotonic function of this volume is just as good a measure of this ignorance. The best choice is the logarithm of this volume, since then it would be additive for statistically independent systems.”

I get the impression that you already have xlog(x) involved in a unique cocycle?
above and over there http://golem.ph.utexas.edu/cgi-bin/MT-3.0/sxp-comments.fcgi?entry_id=2402;parent_id=38486
“I have an inkling that the uniqueness of xlogx as a cocycle has to do with the uniqueness of …
Now another very natural quantity in this set up is the invariant vector field U x”


Posted by: William Patton on June 19, 2011 3:16 AM | Permalink | Reply to this

Re: A Bit More About Entropy

David wrote:

I wonder if from the geometry of the situation we can see why the Fisher-Shahshahani metric emerges as the curvature of the relative entropy […]

I think you’re just using the word ‘curvature’ in a nonstandard way here, which may confuse people.

You can talk about the curvature of a Riemannian metric, or more generally the curvature of a connection. But the relative entropy is neither of those things.

The relative entropy is just a function that takes a mutually absolutely continuous pair of probability measures x,y\mathbf{x}, \mathbf{y} and gives a number which you’re calling D(xy)D(\mathbf{x} \| \mathbf{y}). And the Fisher information metric — I hope that’s the same as what you’re talking about — at y\mathbf{y} is the Hessian of D(xy)D(\mathbf{x} \| \mathbf{y}), thought of as a function of x\mathbf{x}, at the point x=y\mathbf{x} = \mathbf{y}.

I explained this here, as you probably know.

Posted by: John Baez on June 23, 2011 11:49 AM | Permalink | Reply to this

Re: A Bit More About Entropy

Oh, to work in a discipline which likes to be precise!

So when Amari speaks of his connections, that concerns the third derivative, right? I ought to work out, I suppose, which connection fits with the ‘parallel transport’ of U yU_{\mathbf{y}} mentioned in my post.

Posted by: David Corfield on June 23, 2011 2:43 PM | Permalink | Reply to this

Re: A Bit More About Entropy

Čencov also showed that there is a unique one-parameter family of affine connections called the α\alpha-connection, and these turn out to have a very nice duality property with associated dual divergences (include KL-divergence for α=±1\alpha=\pm 1. A nice introduction including some interesting examples is “Gradient systems in view of information geometry” by Akio Fujiwara and Shun-ichi Amari. Also Amari and Nagaoka’s “Methods of Information Geometry” in the translations of mathematical monographs series has a chapter on the dual structure of statistical manifolds.
Posted by: Marc Harper on August 4, 2011 3:41 AM | Permalink | Reply to this

Re: A Bit More About Entropy

In this case there is an easy way to get a handle on the curvature of the simplex with the Fisher information metric. The transformation x iy i 24x_i \mapsto \frac{{y_i}^2}{4} for each coordinate is a diffeomorphism of the simplex (of nn coordinates) onto the “positive orthant” of the radius 2 n-sphere with the (induced) Euclidean metric. The details are easy to verify, just use the fact that x 1++x n=1x_1 + \cdots + x_n = 1 and compute the metric transformation using the Jacobian (which is very easy since everything is diagonal in this case). The sectional curvatures of these spheres are all 1/41/4, or n(n1)/4n(n-1)/4 if you prefer the scalar curvature. This transformation also tells you what the geodesics are – great circles (or I guess portions of them, more precisely). [See “Evolutionary Games and Population Dynamics” by Hofbauer and Sigmund, pg 84. I’m not sure who first noticed this… maybe Ethan Akin?]
Posted by: Marc Harper on August 4, 2011 3:28 AM | Permalink | Reply to this
Read the post Cohomology in Everyday Life
Weblog: The n-Category Café
Excerpt: intuitive examples of cohomology
Tracked: June 14, 2012 12:37 PM

Post a New Comment