Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

February 13, 2007

Category Theoretic Probability Theory II

Posted by David Corfield

Encouraged by the comment of a statistician as eminent as Phil Dawid, I shall continue with what category theory has to say about probability theory. Recall that we were considering a monad on the category of measurable spaces, Meas. For a number of reasons, researchers prefer to work with Pol, the category of Polish spaces, i.e., separable metric spaces for which a complete metric exists. We have an endofunctor P, which sends a space, X, to the space of probability measures on the Borel sets of X. P(X) is equipped with the weakest topology which makes τ Xfdτ continuous for any f, a bounded, continuous, real function on X. We also need a map μ X:P(P(X))P(X):

μ X(M)(A):= P(X)τ(A)M(dτ)

Now, there are interesting developments of this idea, such as Abramsky et al.’s Nuclear and Trace Ideals in Tensored ∗-Categories, which looks to work Giry’s monad into a context even more closely resembling the category of relations, but let’s look about to see if we can make contact with some probability theory. Well, a natural kind of thing to do would be to work out the algebras for the monad. And this is just what Ernst-Erich Doberkat does in Characterizing the Eilenberg-Moore Algebras for a Monad of Stochastic Relations. What you’re looking for are measurable maps between P(X) and X, such that the ‘fibres’ are convex and closed, and such that δ x, the delta distribution on x, is in the fibre over x. And there’s another condition which requires a compact subset of P(X) to be sent to a compact subset of X.

Now, as ever, P(X) will support an algebra, μ X:P(P(X))P(X). This is the analogue of a free group being an algebra of the group monad. But just as there are many interesting groups which are not free, we should want to find algebras of Giry’s monad which are not of the μ X form. Doberkat shows that for such an algebra X must be connected, and suggests this example

h:P([0 ,1 ])τ 0 1 tτ(dt)[0 ,1 ].

I haven’t worked out the details, but I have the suspicion that this might be μ {0 ,1 }. After all, probability measures on {0 ,1 } are just binomial, parameterised by p[0 ,1 ].

The other example he gives has X a bounded, closed and convex subset of n, and probability measures being sent to their barycentre. It wouldn’t surprise me if this were the paradigmatic example of an algebra for the Giry monad. It subsumes the example above.

Finding centres of gravity also occurs in Bayesian information geometry. As shown on page 4 of Hichem Snoussi and Ali Mohammad-Djafari’s Information geometry and prior selection, given a prior distribution over probability distributions, a best estimator for the generating distribution after data has been received is expressible as the barycentre of the posterior distribution. But as we’re working in P(X), this barycentre only makes sense with respect to a geometry on it. Now, the ‘sensible’ geometries form a 1-dimensional family, although only for one member of this family, corresponding to the Kullback-Liebler divergence, is the geometry of P(X) convex, which I guess is what we need for an algebra. Perhaps this could be fixed, otherwise it’s yet another point for the KL-divergence.

Update: Doberkat has a longer article on Eilenberg-Moore algebras of the Giry monad as item 5 here. (Unfortunately, the monograph ‘Stochastic Relations: Foundations for Markov Transition Systems’ doesn’t appear to be available.) There are two monads being treated here, one which sends a Polish space to the space of all probability measures, the other to the space of all subprobability measures. The extra structure relating to these monads, is that of a (positive) convex structure. In the case of a convex structure, this intuitively captures the idea that a weighted sum of points in the space has barycentre within the space.

Doberkat’s work relates to the category of Polish spaces with continuous maps. He notes that it would be interesting to develop the theory for the general case of Borel measurable maps.

Posted at February 13, 2007 6:37 PM UTC

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1163

13 Comments & 2 Trackbacks

Re: Category Theoretic Probability Theory II

Nice.. I wish I would have read this before.. I basically created my own formulation of these types of ‘stochastic categories’ (they are really determinstic!) as I was creating ways to analytically solved infinite-dimensional problems ‘exactly’ and without numerical methods except at the final stage of evaluating.. had to use loads of measure theory, capacities, lifting maps, harmonic and complex analysis, the Weil topology, analytical number theory and some other stuff.. I absolutely hate reading this stuff abstractly if no concrete examples are given.. give me some concrete generating functions and something I can *compute*! But, now that I have actual concrete realizations of things I can go back and find out what abstract concept they implement, sort of a backwards approach but it works.

Posted by: Stephen Crowley on February 13, 2007 9:28 PM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

Something intriguing about this monad perspective is that it fits neatly with a Bayesian outlook. Bayesians like to think in terms of distibutions over distributions and form averages. For example, ask me about a coin you’re about to toss, and, so long as I take all exchangeable sequences (permutations) to be equally likely, I can represent my belief in a distribution over p, the binomial parameter. Ask me for my probability that heads will be tossed and I’ll calculate the expectation of that distribution.

I said

Now, the ‘sensible’ geometries form a 1-dimensional family, although only for one member of this family, corresponding to the Kullback-Liebler divergence, is the geometry of P(X) convex, which I guess is what we need for an algebra. Perhaps this could be fixed, otherwise it’s yet another point for the KL-divergence.

One way out is to observe that for δ[0,1 ], where δ parameterises the family of geometries, the space of subprobability measures is δ-convex. Hence the map taking a distribution over subprobability measures to its barycentre with respect to one of these geometries constitutes an Eilenberg-Moore algebra for the subprobability monad.

Posted by: David Corfield on February 14, 2007 12:56 PM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

David wrote:

For example, ask me about a coin you’re about to toss, and, so long as I take all exchangeable sequences (permutations) to be equally likely, I can represent my belief in a distribution over p, the binomial parameter. Ask me for my probability that heads will be tossed and I’ll calculate the expectation of that distribution.

Indeed, if you do not begin with the apriori assumption that the coin is unbiased but only that your state of ignorance (since you have yet to flip any coins) implies that the conditional probability is 1/2, but Laplace’s rule of succession implies that if your first flip turns up heads, then you are less likely to receive another heads in the next flip, but you still can and usually do, but if you do, then you are even LESS likely to receive another, but it is still possible, in fact there is a non-zero chance that you could flip a coin and get heads each time but Laplace’s rule ensures that that usually never happens and is a law in some sense. Frequentist probabability is totally bogus based on such a revelation because it ignores the “time-dynamics of probability”. Good gamblers know this well. Now, extending these concepts to continuous-time is quite a challenge because the problem immediately becomes infinite dimensional, as in, how do you separate ‘instants’ and project expectations for the future? It can be done but it is very unintuitive. Naely, analytical number theory and analytic continuation comes in handy by using the fact that the gamma function is the real/complex generalization of factorials(combinatorics)! Now, build a sigma-algebra or delta-ring out of all this..

Posted by: Stephen Crowley on February 14, 2007 5:59 PM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

Laplace’s rule of succession implies that if your first flip turns up heads, then you are less likely to receive another heads in the next flip

Isn’t this backwards? If your first flip is heads, then this is evidence that the coin is biased. If you believe that the coin flips are independent (the coin has no memory), then the probability of subsequent heads increases. In an extreme case, after 1000 successive coin flips, I have extremely high confidence that the next flip will be heads (it’s a trick coin).

Certainly this is how Laplace applied his rule of succession to the Sunrise Problem. (After 2 million successive sunrises, the probability of no sunrise tomorrow has shrunk to less than 1 in 2 million.)

Posted by: Toby Bartels on February 14, 2007 7:20 PM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

Toby wrote:

Isn’t this backwards? If your first flip is heads, then this is evidence that the coin is biased. If you believe that the coin flips are independent (the coin has no memory), then the probability of subsequent heads increases. In an extreme case, after 1000 successive coin flips, I have extremely high confidence that the next flip will be heads (it’s a trick coin).

I might have it exactly backwards. I’m thinking of the conditional probability of successive coinflips given apriori no knowledge of the biasedness of the coin and starting off with a ‘prior’ of 1/2. It also depends on what you call ‘memory’, as I know of no system, even in theory, that does not have memory. I realize it is popular to call Markov processes ‘memoryless’ but in fact they have memory, otherwise an infinite set of flips would not have an “average” outcome of 1/2..and in fact they don’t, as this mathematical coin only has 2 states, 1 and 0 (or whatever symbols you want to use in place of 1 and 0.. -1 and +1 are natural choices). So, what good is a theory that gives a single probability for “the outcome” of a coinflip to be 1/2 when in fact that cannot be an outcome.

The real answer is that before event 1(1st event) say that E(1)=1/2 and E(1)=1/2.

event 1:
If you get observe 1, then the conditional probability becomes E(0)=1/3 and E(1)=2/3.

event 2:
If you then observe another 1, then the conditional probability becomes E(0)=1/4 and E(1)=3/4, but if you were to have observed 0 then the probability come back to E(0)=2/4=1/2 and E(1)=2/4=1/2

Obviously, this is a simplified model and applies only to coins in discrete time but it can be generalized to complex numbers and dependent coin-flips and continuous-time, some sort of self-avoiding “quantum” random walk.

Certainly this is how Laplace applied his rule of succession to the Sunrise Problem. (After 2 million successive sunrises, the probability of no sunrise tomorrow has shrunk to less than 1 in 2 million.)

Right, but it should track the probability as closely as possible should there suddenly be deviations from that sequence. It has to observe and ‘reason’ in some sense.

The point is, in the continuous dependent coin-flipping thought-experiment the biasedness is time-varying and must be estimated ‘optimally’ and its infinite-dimensional, think of an infinitely expanding tree structure where you confirm or falsify paths down the tree as observations arrive, this is also even “infinitely infinitely dimensional” if there is error in the observation and you cannot even be completely certain of things you ‘observed’.

In the observation with error (think about it, what other useful observations are there besides ones with error), you cannot completely exclude paths that did not occur because they all occured to some extent and are simply weighted rather than removed.

Posted by: Stephen Crowley on February 14, 2007 7:59 PM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

Actually, those probabilties are inverse, but swap “E(0)’ and “E(1)” where it is wrong, then it is right. The point is that the ratio of heads to tails is a mean-reverting process around a mean value of 1/2 for an unbiased perfect coin. This number, 1/2 is somehow “built into” the law of chance and this is what I mean by saying that coinflips have memory, because it somehow remembers that it should be “hovering around” 1/2 in a mean-reverting fashion, and its impossible to actually attain a probability of 0 or 100%, but it can come vanishingly close to either in theory,

Posted by: Stephen Crowley on February 14, 2007 11:07 PM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

If you get observe 1, then the conditional probability becomes E(0)=1/3 and E(1)=2/3.

Correct! I mean that this is what Laplace’s Rule of Succession gives.

(Of course, if you have some a priori reason to believe the coin fair —or biased for that matter—, then you’ll have different expectations.)

Posted by: Toby Bartels on February 14, 2007 11:43 PM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

Assume that you are going to put all your faith in exchangeability, e.g., that whatever you see will never cause you to think the flipping mechanism has changed, even if you see 1000 heads then 1000 tails. In this case de Finetti showed that your degree of belief is representable as a probability distribution over Binomial distributions. Think of a function, f(p), on [0 ,1 ] which integrates to 1, representing how you think the coin may be biased. E.g., if I’m dealing with a crook, I might have a spread out f. If I’ve examined the coin and flipping mechanism, I might have f peaked around 0.5.

Then as data comes in my f is updated, via Bayes’ theorem, eventually leading to a highly peaked distribution about the relative frequency of heads.

When asked for my odds for the next toss I calculate 0 1 pf(p)dp, with my latest f(p).

Now I certainly don’t want to get into a philosophy of probability debate. What I’m interested in is that one large difference in statistical learning theory is that one side (frequentists) tends to find modes and the other side (Bayesians) means. It’s curious that the approach via monads has a mechanism for taking the mean of a distribution of distributions, and thus appears to favour the Bayesian side.

Posted by: David Corfield on February 14, 2007 11:09 PM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

Assume that you are going to put all your faith in exchangeability, e.g., that whatever you see will never cause you to think the flipping mechanism has changed, even if you see 1000 heads then 1000 tails. In this case de Finetti showed that your degree of belief is representable as a probability distribution over Binomial distributions. Think of a function, f(p), on [0,1] which integrates to 1, representing how you think the coin may be biased. E.g., if I’m dealing with a crook, I might have a spread out f. If I’ve examined the coin and flipping mechanism, I might have f peaked around 0.5.

Yes, it’s all based on exchangeable sequences. For that to be valid, you have to have some reason for assuming dependence between observations. The independence assumption seems only valid for “low frequency” or “static” applications, which I find boring. de Finetti’s classical theorem is nice for sure, but unless I’m missing something it only applies to real numbers and hence is not algebraicaly complete. The Quantum de Finetti theorem which is used in quantum measurement theory makes a lot more sense to me because of the nice properties of complex numbers.

Dealing with only real numbers leads to all sorts of weirdness if you try to take time into account, and the weirdness goes away in the complex/time domain.

Now I certainly don’t want to get into a philosophy of probability debate. What I’m interested in is that one large difference in statistical learning theory is that one side (frequentists) tends to find modes and the other side (Bayesians) means. It’s curious that the approach via monads has a mechanism for taking the mean of a distribution of distributions, and thus appears to favour the Bayesian side.

I guess I live in a camp of my own then, because I rather deal with an infinite set of complex modes, time-evolution operators on these modes, and rules for updating modes based on likelihoods and indeed the system boils down to a Hamiltonian due to the conservation of probability. In practice and out of necessity only a finite number of these modes can be calculated, but I can take to any degree of accuracy that I desire.

Posted by: Stephen Crowley on February 15, 2007 12:28 AM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

This monadic perspective is exactly how some people model probability distributions in the programming language Haskell. There’s at least one Haskell library that does this. Of course in Haskell we’re dealing with finite spaces and so the definition of μ there involves only a finite summation. But in essence it’s the same thing.

Posted by: Dan Piponi on February 17, 2007 1:35 AM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

If you’ll permit a question from a humble cognitive scientist…

In the article “The converse of a stochastic relation” Doberkat seems to be reinventing Bayesian inference in the category theoretic setting. Indeed, the first equation in that paper is (i think) just Bayes’ rule. Yet he never mentions Bayesian learning… so am I missing something here?

If not, to what extent is the Giry monad the “right” structure to describe the logic of Bayesian induction? Should we Bayesians be describing our models in the internal language of one of the categories associated with the Giry monad? If so, which and what does that language it look like?

Posted by: Noah Goodman on February 18, 2007 2:03 PM | Permalink | Reply to this

Re: Category Theoretic Probability Theory II

Yes you’re right that the first equation is Bayes’ theorem. (Paper 15 here.) However, Bayes’ theorem is not intrinsically Bayesian. It’s just a truth about probabilities.

On the other hand, the Theorem is used often by Bayesians as a means of updating their probability assignments (or ‘degrees of belief’) in situations where it makes no sense to a frequentist. For example, one uses P(he)=P(eh).P(h)/P(e) to find the support given to hypothesis h by evidence e. For many hypotheses P(h) will make no sense for the frequentist. Further, but there are provisos, if e is all you learn, the new P(h) will be the old P(he).

Doberkat doesn’t discuss what kinds of element are to be found in his Polish spaces so may be a frequentist for all I know. My point about the monad picture fitting nicely with Bayesianism is that the map μ X:PP(X)P(X) is something the Bayesian can understand directly, while for the frequentist I can’t see that it could be more than a tool for composing conditional probabilities.

Posted by: David Corfield on February 18, 2007 6:46 PM | Permalink | Reply to this
Read the post Category Theory in Machine Learning
Weblog: The n-Category Café
Excerpt: Does category theory have a future in machine learning?
Tracked: September 5, 2007 4:03 PM
Read the post Progic
Weblog: The n-Category Café
Excerpt: My colleague here in Canterbury Jon Williamson is part of an international research group, progicnet, whose aim is to find a good integration of probability theory and first-order logic. For one reason or another, some technical projects get counted...
Tracked: September 18, 2007 11:38 AM

Re: Category Theoretic Probability Theory II

Roland Friedrich and I were thinking on Giry-Lawvere monad and the variants thereof in few meetings in last two years. We never thought enough through to finish a research but we realized that there are many interesting twins of the Giry monad, for example when one considers projection valued measures instead of scalar valued and then one sees some connections to the subject of coherent states, the subject I was earlier working on in connection to quantum groups. We checked few new analytic details which are more nontrivial when the measure is valued in projective operators rather than scalars, but I do not have any complete account of this. We discussed many other measure like concepts which could benefit from this approach.

For newcomers all that business of the Giry monad is an elaborate version of thinking via characteristic functions like in the primitive case of power set monad, and also delta functions, thus the idea has to do with classifying objects on one side and with measure theoretic concepts on analysis side, and also of reproducing kernels in coherent states and evolution story business. Giry was a student of Lawvere and she wrote an elaboration of an earlier preprint of Lawvere and published it as

Giry, Michèle
A categorical approach to probability theory. Categorical aspects of topology and analysis (Ottawa, Ont., 1980),
pp. 68–85,
Lecture Notes in Math., 915, Springer.
MR83k:60007

There are some analytic incorrectnesses there I was told. Prof. Voevodsky gave some talk at IHES where he had approached some biology-motivated stochastics via a refined Giry approach, I was told by Roland who got then interested in. A priori I think the approach is not giving something deep, but it is an interesting source of analogies. There are many papers after Giry, mainly in the computer science literature. Roland drew my attention to Doberkat’s paper as one of very few which are actually good among the mentioned papers. Though good part of the paper is only a survey.

Posted by: Zoran Skoda on September 19, 2007 6:19 PM | Permalink | Reply to this

Post a New Comment