Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

August 13, 2024

Introduction to Categorical Probability

Posted by Tom Leinster

Guest post by Utku Boduroğlu, Drew McNeely, and Nico Wittrock

When is it appropriate to completely reinvent the wheel? To an outsider, that seems to happen a lot in category theory, and probability theory isn’t spared from this treatment. We’ve had a useful language to describe probability since the 17th century, and it works. Why change it up now?

This may be a tempting question when reading about categorical probability, but we might argue that this isn’t completely reinventing traditional probability from the ground up. Instead, we’re developing a set of tools that allows us to work with traditional probability in a really powerful and intuitive way, and these same tools also allow us define new kinds of uncertainty where traditional probability is limited. In this blog post, we’ll work with examples of both traditional finite probability and nondeterminism, or possibility to show how they can be unified in a single categorical language.

Probability distributions. We want to proceed with our discussion through an example, and so before we introduce everything, consider the following:

You’ve just installed a sprinkler system to your lawn! It is a very advanced piece of technology, measuring a myriad of different things to determine when to turn on the sprinklers… and you have no idea how it does this. In your effort to have an idea of when the system turns on (you pay the water bill, after all) you decided to keep track of how the weather feels and whether your sprinkler is on or not.

You make the following distinctions:

Weather = {sunny, cloudy, rainy}, Humidity = {dry, humid}, Temperature = {hot, mild, cold}, Sprinkler = {on, off}

And you have joint data across these sets for each day:

(1)Weather Humidity Temperature Sprinkler sunny humid mild off sunny dry hot on cloudy dry hot on cloudy humid mild on rainy humid cold off cloudy dry cold on sunny humid cold off \begin{array}[t]{cccc} --Weather & Humidity & Temperature & Sprinkler--\\ sunny & humid & mild & off \\ sunny & dry & hot & on \\ cloudy & dry & hot & on \\ cloudy & humid & mild & on \\ rainy & humid & cold & off \\ cloudy & dry & cold & on \\ sunny & humid & cold & off \\ \end{array}

You make an assumption that the frequency with which each weather event occurred would be an accurate estimate for how it will be in the future, and so you assemble the previous 3 months’ weather data into probability distributions.

We will be relating our definitions and examples to this theme of a lawn sprinkler system. To contextualize our examples, we provide some definitions.

A probability distribution on a finite set XX is a function p:2 X[0,1]p: 2^X\to [0, 1] assigning to each subset AXA\subset X a number p(A)p(A) such that

  • p()=0p(\emptyset) = 0,
  • p(X)=1p(X) = 1,
  • and for disjoint subsets A 1,,A kXA_1,\dots, A_k \subset X, ip(A i)=p( iA i)\sum_i p(A_i) = p(\bigcup_i A_i).

For our purposes, a simpler characterization exists from the fact that we can consider a finite set to disjointly consist of its individual points; namely we can think of a probability distribution on XX to be a function p:X[0,1]p: X\to [0, 1] such that xXp(x)=1.\sum_{x\in X} p(x) = 1.

We will also make use of the ket notation to denote a distribution/state on XX; for X{x 1,,x k}X\coloneqq \{x_1,\dots,x_k\} with the values λ ip(x i)\lambda_i \coloneqq p(x_i), the following notation also describes a distribution on XX:

(2) i=1 kλ i=1λ 1x 1+λ 2x 2++λ kx k \sum_{i=1}^k \lambda_i = 1 \leftrightsquigarrow \lambda_1\mid x_1\rangle + \lambda_2\mid x_2\rangle + \dots + \lambda_k\mid x_k\rangle

Given this notion, we can model the transition between “state spaces” XX to YY by means of a stochastic matrix, which is a matrix f:X×Y[0,1]f: X\times Y \to [0,1] such that each column sums to 1, which we denote

(3) yYf(yx)=1 \sum_{y\in Y} f(y\mid x) = 1

Following our established ket notation, we can equivalently describe the action of the channel f:XYf: X\to Y by

(4)f x:γ 1y 1+γ 2y 2++γ ny n f_x: \gamma_1 \mid y_1\rangle + \gamma_2 \mid y_2 \rangle + \dots + \gamma_n \mid y_n \rangle

with γ if(y ix)\gamma_i \coloneqq f(y_i\mid x) and f xf_x forming a probability distribution on YY.

Furthermore, given two channels f:XYf: X\to Y and g:YZg: Y\to Z, we also have a way of obtaining a composite channel gf:XZg\circ f: X\to Z, by the Chapman-Kolmogorov equation, defining the channel

(5)(gf)(zx) yYg(zy)f(yx) (g\circ f)(z\mid x) \coloneqq \sum_{y\in Y} g(z\mid y)f(y\mid x)

We interpret distributions to be channels from the singleton set to their respective sets, e.g.: p:{*}Wp: \{\ast\} \to W, q:{*}Hq: \{\ast\} \to H, r:{*}Tr: \{\ast\} \to T. Then, composing any such distribution with a channel will again yield a distribution

(6){*}pXfY \{\ast\} \stackrel{p}{\to} X \stackrel{f}{\to} Y

Consider the example scenario we described above. Suppose that you compiled the historical weather data into the following probability distribution p:{*}WHTp: \{\ast\} \to W\otimes H\otimes T (more to come about \otimes in just a second):

(7)p *:0.2s,d,h+0.3r,h,c+0.3c,h,m+0.2c,d,h p_\ast: 0.2\mid s,d,h\rangle + 0.3\mid r,h,c\rangle + 0.3\mid c,h,m\rangle + 0.2\mid c,d,h\rangle

From the table in the example, we can obtain the following channel f:WHTSf: W\otimes H\otimes T \to S (if we assume the principle of indifference, i.e., that the entries in the table all occur with equal probability–which is the case, as we have a list of observations)

(8)f (w,h,t)=δ w,h,t,onon+δ w,h,t,offoff, f_{(w,h,t)} = \delta_{w, h, t, \text{on}} \mid \text{on} \rangle + \delta_{w, h, t, \text{off}} \mid \text{off} \rangle,

where δ w,h,t,on\delta_{w, h, t, \text{on}} is 11 if the combination w,h,t,onw, h, t, \text{on} occurs in the table above, and 00 otherwise; analogously for δ w,h,t,off\delta_{w, h, t, \text{off}}.

Then, by everything we’ve established so far, we can reason about the likelihood that the sprinkler will turn on the next day by composing the state pp of the climate with the channel ff to obtain a state fpf\circ p of the sprinkler, computed

(9)fp:0.7on+0.3off f\circ p: 0.7\mid \text{on} \rangle + 0.3\mid \text{off} \rangle

All in all, along with the identity matrices, all this data assembles into the category FinStoch\mathsf{FinStoch} with

  • objects: finite sets
  • morphisms: stochastic matrices
  • where the composition is determined through the Chapman-Kolmogorov equation

This is one of the first examples of a Markov category that we will be looking at, and it will be a good baseline to observe why a Markov category is defined the way it is.

Possibility distributions. Markov categories need not only house probabilistic models of uncertainty; we’ll see that the following also forms a Markov category:

Consider a channel between two finite sets XX, YY to be an assignment f:XYf: X\to Y such that each f(x)Yf(x)\subset Y is a non-empty subset. Defining the composition to be

(10)gf(x) yf(x)g(y) g\circ f (x) \coloneqq \bigcup_{y\in f(x)} g(y)

and the identities as x{x}x \mapsto \{x\} gives us the Markov category FinSetMulti\mathsf{FinSetMulti} of possibilities!

The same data from the example can be used in a possibilistic way as well; a channel SWHTS \to W\otimes H\otimes T can map the sprinkler to all the possible states of weather/climate where the sprinkler has turned on etc. Then, a state is just a set of possible configurations, and the composed state fpf\circ p is the set of all possible configurations one can reach from the initial configurations of pp.

Something you may have noticed from the two examples of morphisms of Markov categories is that fixing an element xXx\in X yields some structure attached to YY with “desirable properties”: in the case of FinStoch\mathsf{FinStoch}, we have that each f xf_x is a probability distribution on YY – in fact, the Chapman-Kolmogorov formula further provides a way to obtain a probability distribution from a probability distribution of probability distributions. In the case of FinSetMulti\mathsf{FinSetMulti}, each f xf_x is a non-empty subset of YY, and the composition is provided through the union of a set of sets.

This is not a coincidence: we will see that for certain monads, the Kleisli category they yield turn out to be Markov categories! The monads in question will provide us descriptions of what the channels are, as well as the rule for composition.

Kleisli Categories. If you are familiar with Kleisli categories, you might have uncovered FinSetMulti\mathsf{FinSetMulti} from above as the Kleisli category of the normalized finite powerset monad. In fact, it turns out that many Markov categories of interest arise as Kleisli categories of so-called probability monads, such as the Giry monad, Radon monad, or distribution monads over semirings. Rather than explaining (technical) details of these, we want to dive into the underlying construction.

If you do not know Kleisli categories–don’t worry, we’ll try to explain it on the go, focusing on the relevant properties for our purpose. The idea is the following:

  1. Take a cartesian monoidal category D\mathsf{D}, representing deterministic processes and
  2. Introduce non-deterministic processes as a monadic effect by a probability monad T:DDT : \mathsf{D} \to \mathsf{D}. The Kleisli category D T\mathsf{D}_T of TT has the same objects as D\mathsf{D}, and morphisms D T(X,Y)D(X,TY).\mathsf{D_T}(X,Y) \coloneqq \mathsf{D}(X, TY) . We call these Kleisli maps.

Example. as already mentioned, the category FinSetMulti\mathsf{FinSetMulti} (modelling possible transitions between finite sets) is the Kleisli category of the normalized finite powerset monad P:FinSetFinSetP: \mathsf{FinSet} \to \mathsf{FinSet} which assigns to a set XPX{SX:S}.X \mapsto PX \coloneqq \{ S \subseteq X : S \neq \emptyset \}. “Normalization” here means that we exclude the empty set.

While the previous example captures possibility, the following is the most general framework for probability:

Example. The Markov cateory Stoch\mathsf{Stoch} is the Kleisli category of the Giry monad G:MeasMeasG : \mathsf{Meas} \to \mathsf{Meas} on the (cartesian monoida) category of measurable spaces, sending a measurable space XX to the space PXP X of distributions on it (a.k.a probability measures). Its Kleisli morphisms are known as Markov kernels. (Hence the name Markov categories!) By definition, these are measurable functions f:XPYf: X \to P Y, meaning that each point xXx \in X is assigned a distribution f xf_x on YY: normal distributions, uniform distribution, delta distribution (a.k.a Dirac measure), … Regard it as a stochastic process with input XX and probabilistic output YY. If the weather is sunny tomorrow, will the sprinkler switch on? Well, probably… In contrast, morphisms XYX \to Y in Meas\mathsf{Meas} (i.e. measurable functions) are deterministic, as their output are points f(x)Yf(x) \in Y being definitely determined by their input xXx \in X.

A general definition of determinism in Markov categories follows below. Before, let’s investigate the term process, by which we mean morphisms in a monoidal category: the usual composition amounts to the concatenation of processes, while the tensor product merges two subsystems into one, by running them “in parallel”.

For our category D\mathsf{D} of “deterministic processes”, this is straight forward; being cartesian monoidal means

  1. it has a terminal object II.
  2. it has products X×YX \times Y and projection pairs Xout 1X×Yout 2YX \xleftarrow{\mathrm{out}_1} X \times Y \xrightarrow{\mathrm{out}_2} Y satisfying the universal property of the product:

    Diagram of universal property of the product.

  3. it has a symmetric monoidal structure induced by 1. and 2.

Things are more complicated for the Kleisli category D TD_T: to get a tensor product, we need the monad TT to be commutative, i.e., it comes with well behaved zipper functions (in D\mathsf{D}) X,Y:TX×TYT(X×Y).\nabla_{X,Y} : TX \times TY \to T(X \times Y). To be precise, we require X,Y\nabla_{X,Y} to make T:DDT : \mathsf{D} \to \mathsf{D} a symmetric monoidal functor, such that multiplication and unit of the monad are monoidal natural transformations.

Kleisli maps fD(A,TX)f \in \mathsf{D}(A, TX) and gD(B,TY)g \in \mathsf{D}(B, TY) may then be tensored as fg:A×Bf×gTX×TY X,YT(X×Y).f \otimes g : A \times B \xrightarrow{f \times g} TX \times TY \xrightarrow{\nabla_{X,Y}} T(X \times Y).

Example. For the power set-monad P:FinSetFinSetP : \mathsf{FinSet} \to \mathsf{FinSet}, the zipper maps two subsets AXA \subseteq X and BYB \subseteq Y to X,Y(A,B)A×BX×Y\nabla_{X, Y} (A, B) \coloneqq A \times B \subseteq X \times Y. Kleisli maps f:APXf: A \to PX and g:BPYg: B \to PY hence have the tensor product fg:(a,b)f a×g bX×Y.f \otimes g: (a, b) \mapsto f_a \times g_b \subseteq X \times Y .

Example. The zipper for the Giry monad G:MeasMeasG : \mathsf{Meas} \to \mathsf{Meas} assigns the product measure μν\mu {\otimes} \nu to probability measures μ\mu, ν\nu. Tensoring two Markov kernels f:AGXf: A \to GX and g:BGYg: B \to GY yields fg:(a,b)f ag b.f \otimes g: (a, b) \mapsto f_a \otimes g_b .

In categorical terms, the induced symmetric monoidal structure on the Kleisli category D T\mathsf{D}_T is such that the Kleisli functor Kl T:DD TKl_T : \mathsf{D} \to \mathsf{D}_T is strict symmetric monoidal.

But we want more: we want the Kleisli functor to preserve the projection pairs out i\mathrm{out}_{i}, in that the following diagrams (in D T\mathsf{D}_T!) commute for del iKl T(out i)\mathrm{del}_{i} \coloneqq Kl_T (\mathrm{out}_{i}):

Rectangle with projections.

There are multiple equivalent requirements:

  • del 1\mathrm{del}_1 and del 2\mathrm{del}_2 are natural transformations;
  • TT preserves the terminal object ITII \cong TI (in D\mathsf{D});
  • II is a terminal object in D T\mathsf{D}_T, which is thus semicartesian.

Example. The power set monad P:FinSetFinSetP: \mathsf{FinSet} \to \mathsf{FinSet} preserves the terminal object {*}\{\ast\} if it is “normalized” in the above sense: PX\emptyset \notin PX.

As a consequence, D T\mathsf{D}_T has weak products: any pair of Kleisli maps f:AXf: A \to X, g:AYg: A \to Y factorizes as:

Diagram with weak products.

with the copy process of AA copy AKl T(id A,id A)D T(A,AA).\mathrm{copy}_A \coloneqq Kl_T (\langle \mathrm{id}_A, \mathrm{id}_A \rangle) \in \mathsf{D}_T (A, A \otimes A). (The lower triangles commute, as KlKl is a functor and out iid A,id A=id A\mathrm{out}_i \circ \langle \mathrm{id}_A, \mathrm{id}_A \rangle = \mathrm{id}_A.)

However, the vertical Kleisli map AXYA \to X \otimes Y in that diagram is not unique–hence the term semicartesian–as the following example shows.

In the Kleisli category of the Giry-monad, consider the uniform distributions on X={sunny,cloudy,rainy}X = \{\text{sunny}, \text{cloudy}, \text{rainy}\} and Y={on,off}Y = \{\text{on}, \text{off}\}, i.e. Markov kernels f:{*}X,*13|sunny+13|cloudy+13|rainyf : \{\ast\} \to X, \quad \ast \mapsto \frac{1}{3} |\text{sunny} \rangle + \frac{1}{3} |\text{cloudy} \rangle + \frac{1}{3} |\text{rainy} \rangleg:{*}Y,*12|on+12|off.g : \{\ast\} \to Y, \quad \ast \mapsto \frac{1}{2} |\text{on} \rangle + \frac{1}{2} |\text{off} \rangle. Then the product-diagram commutes for both of the following Markov kernels {*}X×Y\{\ast\} \to X {\times} Y: *16|sunny, on+16|sunny, off+16|cloudy, on+16|cloudy, off+16|rainy, on+16|rainy, off\ast \mapsto \frac{1}{6} | \text{sunny, on} \rangle + \frac{1}{6} | \text{sunny, off} \rangle + \frac{1}{6} | \text{cloudy, on} \rangle + \frac{1}{6} | \text{cloudy, off} \rangle + \frac{1}{6} | \text{rainy, on} \rangle + \frac{1}{6} | \text{rainy, off} \rangle*13|sunny, on+13|cloudy, off+16|rainy, on+16|rainy, off.\ast \mapsto \frac{1}{3} | \text{sunny, on} \rangle + \frac{1}{3} | \text{cloudy, off} \rangle + \frac{1}{6} | \text{rainy, on} \rangle + \frac{1}{6} | \text{rainy, off} \rangle. Which one is obtained as above via {*}copy{*}{*}fgXY\{\ast\} \xrightarrow{\mathrm{copy}} \{\ast\} \otimes \{\ast\} \xrightarrow{f \otimes g} X \otimes Y?

In probability theory, this ambiguity is known as the fact that Markov kernels h:AXYh : A \to X \otimes Y are not determined by their marginalizations del 1h:AX\mathrm{del}_1 \circ h: A \to X and del 2h:AY\mathrm{del}_2 \circ h: A \to Y. From a more category theoretic perspective, it means that the family of copy morphisms is not natural.

At first glance, this may not seem very convenient: considering something non-natural–but we want this, in order to capture uncertainty. We will give the details below (in subsection “Determinism”) and return to the big picture from above: a probability monad TT on a cartesian monoidal category D\mathsf{D} induces probabilistic/non-deterministic morphisms, which destroy the uniqueness constraint of the product and leaves us with weak products in D T\mathsf{D}_T.

In the corresponding diagram of weak products, we have already seen the rectangles on top commute if and only if D T\mathrm{D}_T is semicartesian.

Rectangles with projections

Have you noticed that the triangles at the bottom look like a counitality constraint? In fact, each (A,copy A,del A)(A, \mathrm{copy}_A, \mathrm{del}_A) is a commutative comonoid object in D T\mathsf{D}_T. This is the starting point for the general definition of Markov categories.

Triangles with projections

Markov Categories. Let’s start with a terse definition: A Markov category is a semi-Cartesian category where every object is a commutative comonoid compatible with the monoidal structure.

In more detail, a Markov category is a symmetric monoidal category (C,,I)(\mathsf{C}, \otimes, I) where each object is equipped with

  • a deletion map del X:XIdel_X : X \to I depicted as

    String diagram of deletion map.

    Even though they differ from the deletion maps from the section on Kleisli categories, they are morally the same.

  • a copy map copy X:XXXcopy_X :X \to X \otimes X depicted as

    String diagram of copy map.

such that

  • the collection of deletion maps is natural:

    String diagram of naturality of delete maps.

    Equivalently, II is required to be terminal. Hence, the deletion maps are compatible with the tensor product:

    String diagram of deletion maps of tensor products.

  • the collection of copy maps is compatible with the symmetric monoidal structure

    String diagram with copy maps of tensor products.

  • each (A,copy A,del A)(A, \mathrm{copy}_A, \mathrm{del}_A) is a commutative comonoid:

    String diagrams with axioms of commutative comonoids.

Let’s go a little bit more in-depth into why each of these axioms are required:

It is obvious why composition and identity is important to form a category. We note, however, that we want to think of constituents of a Markov category as states and channels that take states to states. So, in such a case, compositionality is important to be able to talk about “taking states to states”, where for a state pp, we wish for its “pushforward” f *(p)=fpf_\ast(p) = f\circ p to be a state as well. We also want to compose (probabilistic) systems out of smaller building blocks. From a more probability theoretic point of view, our theory should allow us to describe distributions over joint variables.

There are also reasons why we want the comultiplication structure. We can think of the copy map as a Markov kernel that takes an input xXx \in X and outputs a Dirac delta distribution on its diagonal, δ (x,x)T(XX)\delta_{(x,x)} \in T(X\otimes X). In our example, the copy morphism on our set of weather conditions forms the following stochastic matrix:

copy W= S C R (S,S) (S,C) (S,R) (C,S) (C,C) (C,R) (R,S) (R,C) (R,R) [1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1] copy_W = \array{ & \array{S & C & R} \\ \array{ (S,S) \\ (S,C) \\ (S,R) \\ (C,S) \\ (C,C) \\ (C,R) \\ (R,S) \\ (R,C) \\ (R,R) } & \left [ \array{ 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 } \right ] }

When a distribution is postcomposed with a copy, it will land on the diagonal in the joint space. So for instance, if a distribution on weather states is p W=0.2|Sunny+0.3|Cloudy+0.5|Rainyp_W = 0.2 | \mathrm{Sunny} \rangle + 0.3 | \mathrm{Cloudy} \rangle + 0.5 | \mathrm{Rainy} \rangle, then we get copy Wp W=0.2|Sunny,Sunny+0.3|Cloudy,Cloudy+0.5|Rainy,Rainy\mathrm{copy}_W \circ p_W = 0.2 | \mathrm{Sunny},\mathrm{Sunny} \rangle + 0.3 | \mathrm{Cloudy},\mathrm{Cloudy} \rangle + 0.5 | \mathrm{Rainy},\mathrm{Rainy} \rangle

Cartesian categories come equipped with diagonal maps that do something very similar to this. Paired with the projections, this makes all objects of Cartesian categories comonoids as well, and in fact all Cartesian categories are Markov categories. These don’t make for very interesting categories in terms of probability though, since all morphisms are deterministic as mentioned earlier. But if we have a probability monad on a Cartesian category, we can transport the diagonal maps into its Kleisli category, and these become precisely the copy maps. So why do we want this comultiplication structure on our objects? If we think of string diagrams as having pipes through which information flows, then it’s useful to duplicate information and run different transformations on their parallel streams for comparison.

In probability theory, we have a kind of projection known as marginalization. Here, we build this projection morphism using the delete map and densoring it with an identity. Naturality of the deletion maps corresponds to normalization of Markov kernels. More generally speaking, deleting information seems desirable in a framework for information processing (even though it’s impossible in quantum information theory). Naturality of del\mathrm{del}, i.e. terminality of II, means that deleting an output of a process deletes the whole process.

As seen above (section on Kleisli categories), this allows for weak products; this is the category theoretic description of uncertainty, in contrast to determinism of cartesian monoidal categories.

Common Markov categories. There are tons of Markov categories out there, some quite obscure, but also many whose compositional computations are used every single day. As we’ve seen, every Cartesian category such as Set\mathsf{Set} is Markov, albeit one that doesn’t contain uncertainty. But as we’ve also seen, for each affine commutative monad, whose functor can represent the collection of uncertainty distributions over an object, its Kleisli category is Markov. The classic example here is Stoch:=Kl(G)\mathsf{Stoch} := \mathsf{Kl}(G) for the Giry monad. The primary topic for our research week during the Adjoint School was actually to try to find a monad for imprecise probability distributions such as Dempster-Shafer belief functions. Our method so far is to try to find a type of Riesz representation theorem to equivocate these belief functions with functionals. Since functionals form the continuation monad, then this would be an easy way to prove that our belief functions form a monad themselves.

There are other Markov categories that aren’t quite Kleisli categories, but they come from very similar constructions such as relative monads or decorated linear maps. In our example we’ve been working with FinStoch\mathsf{FinStoch}, which has finite sets and stochastic matrices, and another common one is Gauss\mathsf{Gauss}, which contains Gaussian normal distributions on n\mathbb{R}^n and affine maps with Gaussian noise.

Additional Axioms and Properties Markov categories as we’ve built them so far form a great setting for probability, but the characters on stage have a lot more depth to them than just being stochastic kernels. Many morphisms have relationships with each other that correspond to useful notions in traditional probability.

Determinism. Looking back at Cartesian categories, there seems to be something special about them: all of their morphisms seem to be “deterministic,” in that they map a single input to a single output. This isn’t a very categorical notion though, so let’s try to find properties of Cartesian categories that encapsulate the idea that there’s no uncertainty in the morphism outputs.

One unique property that Cartesian categories have over Markov categories is that their diagonal maps are natural in a certain sense. Explicitly, if we equate the two inputs of the tensor product to form a “squaring” endofunctor :fff- \otimes - : f \mapsto f\otimes f, then the collection of diagonal maps in a Cartesian category form a natural transformation Δ:id\Delta : \mathrm{id} \rightarrow - \otimes -. The copy maps in a general Markov category do not follow the naturality square for all morphisms, which translates to the following string diagram:

Definition of Determinism

This actually makes sense as a condition for a kernel to be deterministic! If we really think about what uncertainty means, it boils down to the idea that many different outputs of a process could be possible given a single input. Say the process maps pressure to weather state, and it’s a low pressure day. You could duplicate these exact pressure conditions on the other side of town, but the weather gods might decide to bless your neighbors with rain while they leave you only with cloud cover. This would be different from copying your weather state and pasting it over your friend’s house. On the other hand, a deterministic process could be from weather to sprinkler, if it’s always guaranteed to sprinkle when the sun is out. If you and your friend have identical weather, there’s no difference between each sprinkler having its own sun sensor or a single sensor controlling both.

Here’s a concrete example with possibilistic states: Say the forecast today has p W={Cloudy,Rainy}p_W = \{\mathrm{Cloudy}, \mathrm{Rainy}\} as possibilities. If we copy this, we get copy Wp W={(Cloudy,Cloudy),(Rainy,Rainy)}\mathrm{copy}_W \circ p_W = \{(\mathrm{Cloudy}, \mathrm{Cloudy}), (\mathrm{Rainy}, \mathrm{Rainy})\} which is not equal to p Wp W={(Cloudy,Cloudy),(Cloudy,Rainy),(Rainy,Cloudy),(Rainy,Rainy)}p_W \otimes p_W = \{(\mathrm{Cloudy}, \mathrm{Cloudy}),(\mathrm{Cloudy}, \mathrm{Rainy}), (\mathrm{Rainy}, \mathrm{Cloudy}), (\mathrm{Rainy}, \mathrm{Rainy})\}. On the other hand, we could look outside and determine the weather is certainly q W={Rainy}q_W = \{\mathrm{Rainy}\}. Then copying and tensoring would both give us copy Wq W={(Rainy,Rainy)}\mathrm{copy}_W \circ q_W = \{(\mathrm{Rainy}, \mathrm{Rainy})\}.

Only Cartesian categories have all-deterministic morphisms, and so we also call them deterministic Markov categories. Further, all of the following are equivalent statemtents:

  • A Markov category is deterministic
  • Its copy map is natural
  • It is Cartesian

General Markov categories have not only deterministic morphisms, but they at least have a few. In fact, it’s not hard to prove that copies, deletes, swaps, and identities are all deterministic themselves, and that determinism is closed under composition. This means that the collection of deterministic morphisms form a wide subcategory of C\mathsf{C}, which we call C det\mathsf{C}_{\mathrm{det}}, and that category is Markov itself!

Example. The deterministic subcategory of FinSetMulti\mathsf{FinSetMulti} is (equivalent to) FinSet\mathsf{FinSet}.

Example. Similarly, FinStoch\mathsf{FinStoch} has FinSet\mathsf{FinSet} as deterministic subcategory. However, this does not translate to the infinite case: Stoch detMeas\mathsf{Stoch}_{\mathrm{det}} \ncong \mathsf{Meas}!

Conditionals. In traditional probability, we define a conditional probability as “the probability of one event given that another event is already known to have occurred.” In traditional probability, we define a conditional probability as “the probability of one event given that another event is already known to have occurred.” This is constructed from a joint probability distribution, whose values are “renormalized” to the restriction of the known event.

For example, say the forecast for today given jointly for temperature and weather, and the data is given in the table below:

p= Hot Cold Sunny .1 Cloudy .1 .2 Rainy .6 p = \array{\arrayopts{\collines{solid} \rowlines{solid}} & \mathbf{Hot} & \mathbf{Cold} \\ \mathbf{Sunny} & .1 & \\ \mathbf{Cloudy} & .1 & .2 \\ \mathbf{Rainy} & & .6 }

Now if we feel that it’s cold outside, what’s our new estimate for the chance of rain? We can calculate this by restricting our data to only the event of low pressure, and renormalizing that data to sum up again to 1. Renormalization is easily done by dividing our values by the total probability of that restriction, which is .2+.6=.8.2 + .6 = .8. So the chance of rain given that it’s cold is .6/.8=.75.6/.8 = .75.

From here, we have a general formula for calculating conditional probability in the finite case:

p(y|x)=p(y,x) xp(y,x)p(y|x) = \frac{p(y,x)}{\sum_x p(y,x)}

where the traditional notation for the conditional probability of yy given xx is given by a pipe separating them. If this looks exactly like the notation for stochastic kernels, this is no coincidence! In fact, we can calculate these quantities for all outcomes to generate a stochastic kernel from TT to WW:

p |T=[.5 0 .5 .25 0 .75] p_{|T} = \begin{bmatrix} .5 & 0 \\ .5 & .25 \\ 0 & .75 \end{bmatrix}

We give this kernel the same name as pp but with the subscript |T|T to show that we turned pp’s output into an input.

There are many different formulas for conditionals with respect to different kinds of probability, but how do we generalize this concept to cover all types of uncertainty, and put it in the language of our framework? The key insight is to recognize that at the end, we were able to build a morphism from TT to WW that used the relationships between the two variables in pp. In fact the information contained in p |Tp_{|T} gives it a special property which allows it to serve as sort of a “recovery process” for some data in pp, as shown in the diagram below.

Imagine you’ve forgotten the weather portion of today’s forecast, but you remember the predictions on what today’s temperature will be. This is represented by the marginalization p Tp_T. If you’ve calculated this conditional kernel earlier and stored it as backup, then you can simply graph this out with your remaining data to fully restore the original information! We’ll use this as the basis for our definition, but we’ll add parametrization with an input:

Definition. Given a morphism f:AXYf:A \rightarrow X\otimes Y, a conditional on XX which we call f |Xf_{|X} is any morphism, f |X:AXYf_{|X}: A\otimes X \rightarrow Y which satisfies

Definition of Conditionals

which again can act as a recovery process from XX to YY (parametrized by AA) if the original data on YY has been deleted.

Unfortunately conditional morphisms are difficult to find, are not unique, and might not even exist for a given kernel. However if they do exist, then they are unique up to a certain equivalence called almost sure equality. And there are many Markov categories which do have conditionals for every morphism (such as BorelStoch\mathsf{BorelStoch}, unlike Stoch\mathsf{Stoch}), and there are even several Markov categories for which we have closed-form solutions for conditionals.

To make string diagrams simpler, we often draw conditionals like so:

Bent Wire Notation for Conditionals

where we “bend the wire back” to signify which output has been turned into an input. We should note though, this is only graphical sugar and does not represent some kind of “cap” morphism. In fact, nontrivial compact closed Markov categories do not exist. Conditionals also cannot be built up from compositions of other morphisms, so we put a dashed box around it to signify that the contents inside are “sealed off” from associating with other morphisms on the outside. So when we draw a bunch of morphisms inside the dashed box, it means we’re taking the conditional of the morphism resulting from composition of the smaller morphisms. Even though the dash box seals the insides, luckily there are some properties of conditionals that allow us to do rewrites. Bent wire notation makes these really nice:

Rewrite Rules for Conditionals

where the gg in the bottom equation needs to be deterministic.

Conditional Independence. In traditional probability, a joint distribution is said to be independent in its variables if it satisfies p(x,y)=p(x)p(y)p(x,y) = p(x)p(y) for all xx and yy.

So for instance, the following joint state on temperature and pressure is independent p= High(.4) Low(.6) Hot(.8) .32 .48 Cold(.2) .08 .12 p = \array{\arrayopts{\collines{solid} \rowlines{solid}} & \mathbf{High}\ (.4) & \mathbf{Low}\ (.6) \\ \mathbf{Hot}\ (.8) & .32 & .48 \\ \mathbf{Cold}\ (.2) & .08 & .12 }

The marginals are shown with the labels, so you can see that each entry is the product of its marginals.

However, if we move just one speck of mass over from (High, Hot) to (Low, Hot), then it breaks independence:

p= High(.39) Low(.61) Hot(.8) .31 .49 Cold(.2) .08 .12 p = \array{\arrayopts{\collines{solid} \rowlines{solid}} & \mathbf{High}\ (.39) & \mathbf{Low}\ (.61) \\ \mathbf{Hot}\ (.8) & .31 & .49 \\ \mathbf{Cold}\ (.2) & .08 & .12 }

This traditional definition seems a little arbitrary, so what does this mean intuitively? String diagrams can help here, and further they will allow us to generalize to any Markov category. First, a very informal definition: we say that a morphism p:AXYp:A\rightarrow X\otimes Y displays independence in its outputs if its conditional doesn’t use its bent-wire input at all. We also say that its outputs are not correlated with each other, or they don’t share mutual information.

Let’s look at this more closely in string diagrams with a formal definition:

Definition. A morphism p:AXYp:A\rightarrow X\otimes Y displays XY||AX\perp Y || A, read as “XX is independent of YY given AA”, if its conditional can be calculated as Conditional Independence

This looks like the bent wire has just been snipped! But if we look back to the definition of conditionals, this encapsulates the idea that there’s no information about YY contained in XX. If the conditional is a “data recovery” morphism that reconstructs YY from the information it shares with XX, then we notice two things: one, the original ff has to be used in the conditional, which means the recovery morphism needs to store the entirety of YY’s original information to recover it. And two, the XX input wire juts gets deleted, so it doesn’t use any information from our unforgotten channels during recovery. This means that whatever you know about XX isn’t useful in reconstructing the information on YY.

If we do some string diagram manipulation, then we’ll see that this reduces to the traditional definition of independence:

Symmetric Conditional Independence

From here, we can see that this definition is actually symmetric: if XX is independent from YY, then YY is also independent from XX and we can just say that XX and YY are independent of each other. Is it possible for XX and YY to be only partially independent? What if we can factor out a component of the states that exhibits dependence, and the other components are independent? We can capture that scenario with the following: We say that a morphism with signature f:AXBYf: A \rightarrow X\otimes B\otimes Y, exhibits XY|B||AX\perp Y | B || A, read as “XX is independent of YY given input AA and output BB”, if its conditional with respect to XX and BB is equal to the conditional with its XX wire snipped, like so:

General Definition of Independence

This version of independence has several equivalent definitions, and it is also symmetric.

Conclusion. So what can we do with all these constructions? It’s neat that we now have a graphical language to describe probability, and also we have a unifying language that describes all different types of uncertainty. We’ve already done a lot of work in formulating traditional results in terms of Markov categories, which then generalizes these results to several different types of uncertainty representations. For instance, there are generalized results for the De Finetti Theorem, general constructions for Hidden Markov Models, and graphical ways to do causal inferencing. One benefit to categorifying these constructions could be to simplify the constructions of complicated statistical algorithms by making their code modular: the general categorical construction can be coded in the abstract, and kept separate from code used to build a Markov category.

One other cool thing is that, since the categorical construction is so broad, it allows you to really dive deep into discovering the “meaning” of uncertainty, and how it’s structured. For example, information flow axioms help explain what makes stochastic probability (that is, probabilities valued between 0 and 1) so special as a representation of uncertainty, as opposed to assigning other kinds of values to events.

Currently, there’s an ever-growing list of classical probabilistic notions that are being captured in this categorical language. While it seems like we’re reinventing the wheel with a new language, really what we’re trying to do is just make the wheel bigger.

Posted at August 13, 2024 2:03 PM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/3549

18 Comments & 0 Trackbacks

Re: Introduction to Categorical Probability

This has got to be the best mathematics blog on the Internet. Thank you for taking the time to produce such high-quality content.

Posted by: Max on August 14, 2024 3:08 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

I’d like to say a bit about why I’m interested in categorical probability, coming from a probability background, even if it does appear to be reinventing the wheel.

Kolmogorov’s framework for probability as measure theory on spaces of total mass one has been tremendously successful, but it does a poor job on its own of reflecting what probability theory is actually about, and how probability theorists actually work.

On the one hand, it builds in structures that we rarely think about, in particular the domains of whatever random variables we’re working with. A (real-valued) random variable is universally defined to be a measurable function X:ΩX: \Omega \to \mathbb{R} on a probability space (Ω,μ)(\Omega, \mu), but most probability papers either mention (Ω,μ)(\Omega, \mu) only in passing or never mention it at all. And elements ωΩ\omega \in \Omega or the values X(ω)X(\omega) never come up unless it’s absolutely necessary for technical reasons. If you even talk about these things, except in very particular ways, then most probabilists would agree that you aren’t really doing probability.

Maybe having these structures in the background is a necessary evil, but actually referring to them feels to me rather like messing around with Dedekind cuts when I really just want to be working with a complete ordered field.

On the other hand, and relatedly, probabilists freely (and usually silently!) replace Ω\Omega and/or XX with other spaces or functions that are more amenable to answering the questions at hand, but so that XX remains “the same” in an appropriate sense. Learning how and why one does that is a big part of learning how to do probability theory (and in particular, understanding why probability theory is different from the theory of probability spaces), but this tends to be handled in an informal way. Among other things, this can be an obstacle to more structurally-minded mathematicians learning to follow probabilists’ modes of thought.

So I would like to see an approach to probability theory that puts the focus squarely on the things probabilists care about, and provides ways of talking clearly about what probabilists are really doing when, for example, introducing couplings or auxiliary random variables.

I feel like most of the attempts I’ve seen at a categorical description of probability fall into the trap of being more of a categorical description of measure-theory-with-total-mass-one. Nevertheless, I’ve always felt that category theory had promise as a framework for talking about probability theory, because of category theory’s well-demonstrated power to focus attention on essential features, especially in the case of focusing on a map as a whole thing of interest separated from its pointwise behavior.

The theory of Markov categories that the authors of this post describe is the first attempt at a category-theoretic account of probability theory that I find compelling. As far as I’ve seen it’s still missing some key features (notably, expectations of real-valued random variables), but it does provide a good language for talking about the kinds of things I mentioned above, and it’s arguably better than more traditional approaches for talking about ideas like conditional independence, as this post demonstrates.

But a big part of what made me like this particular reinvention of the wheel in the first place is the following cute little fact, which is simpler-minded than anything in this post.

Switching to notation more consistent with this post, traditionally a random variable with values in XX is defined to be a measurable function f:ΩXf: \Omega \to X, where (Ω,μ)(\Omega, \mu) is a probability space. But then we’re not supposed to talk about either Ω\Omega or f(ω)f(\omega) unless we absolutely can’t help it! But in a Markov category, a random variable in XX is just a morphism f:IXf: I \to X, where II is the monoidal unit (the terminal object in the deterministic subcategory). We only need to refer to structure that’s already there, and there’s simply no reason to mention any of the things we’re not supposed to be thinking about!

Even better: Say for simplicity we’re working in FinStochFinStoch, with deterministic subcategory FinSetFinSet. Then II is some fixed singleton set, and a deterministic morphism f:IXf:I \to X is simply a function. That is, a deterministic random variable is the same thing as an element of XX, formalized here in the same way as in categorical approaches to set theory. A random variable in general is the same thing without the “deterministic”. So a random variable in XX is a non-deterministic — that is, random! — version of an element of XX. And that’s how working probabilists actually think about random variables, not as measurable functions.

Posted by: Mark Meckes on August 14, 2024 4:21 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

Thanks for this, Mark. I recall you telling me previously about the analogy between Markov categories and ETCS (Lawvere’s Elementary Theory of the Category of Sets). Is there more you can say about it beyond what you describe in the last couple of paragraphs of your comment?

Posted by: Tom Leinster on August 14, 2024 9:51 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

I didn’t mean to claim that there is anything like a precise mathematical analogy with ETCS. (I also don’t mean to claim that there isn’t! I’d like to give this more thought sometime.) The perspective in my last two paragraphs is the only really concrete thing I had in mind.

But I do perceive an broader philosophical similarity here. You have espoused the view that ETCS does a better job than traditional set theories of framing the axioms in terms of principles that working mathematicians use all the time. The theory of Markov categories makes an attempt to do something similar for probability theory. Like I said, in the traditional setup for probability theory, you immediately see things that you don’t actually want to be thinking about (kind of like how in ZFC everything is a set, tempting you to ask silly questions like what are the elements of π\pi), and many of the things you want to work with aren’t clearly part of the framework and need to be built on top of it (like how ZFC doesn’t include functions, so you need to define them as sets).

In contrast, though I suspect the definitions in the theory of Markov categories haven’t yet reached their final form, most if not all of the axioms used there have some clear meaning in terms of probability, which the authors of the post do a nice job of illustrating.

Posted by: Mark Meckes on August 15, 2024 4:14 AM | Permalink | Reply to this

Re: Introduction to Categorical Probability

That’s great to hear, and the analogy with ETCS is intriguing!

How do you feel about the idea of putting Markov kernels at center stage, with both probability measures and measurable maps as special cases? Is it sensible to have Markov kernels as the primary structure?

As far as I’ve seen it’s still missing some key features (notably, expectations of real-valued random variables)

Actually we already have this to a larger extent than it may seem, and in fact in several ways. So in the remainder of this comment, let me elaborate a bit on what these ways are.

First, taking an expectation is integration. And integration is actually at the very heart of Markov categories! What I’m referring to is the fact that Markov kernels compose via the Chapman-Kolmogorov formula, which says that for Markov kernels f:XYf : X \rightsquigarrow Y and g:YZg : Y \rightsquigarrow Z, the composite kernel is defined by (gf)(T|x)=g(T|y)f(dy|x).(g \circ f)(T|x) = \int g(T|y) \, f(\mathrm{d}y|x). Now if the target space is Z={0,1}Z = \{0,1\}, the kernels gg can are in bijection with the measurable maps g({1}|):Y[0,1]g(\{1\}|-) : Y \to [0,1]. If we now use a probability measure p:IXp : I \rightsquigarrow X in place of ff, then the composite is a kernel 1{0,1}1 \to \{0,1\} whose value on {1}\{1\} is precisely the expectation of the measurable function. So in this way, taking expectation values of [0,1][0,1]-valued random variables is encoded in the composition.

I don’t want to claim that this is how we should treat random variables and expectation in general. I think that this observation can be part of a more comprehensive treatment, but on its own it feels like too much of a hack and shouldn’t be the final answer.

Second, one can argue that the most fundamental kind of expectation value is not that of real-valued random variables, but the formation of the expectation value of a random measure, which is itself a measure again. It’s more fundamental because it requires less structure: taking the expectation of a random measure makes sense on any measurable space regardless in absence of any linear structure.

And also taking expectations of random measures is at the heart of categorical probability, because this is precisely what the multiplication transformation PPPP P \to P of a probability monad does. Of course, this goes back all the way to Lawvere and Giry. In the Markov categories setting, we get this whenever we have a representable Markov categories. These are those Markov categories for which every object XX, there is a ‘distribution object’ PXPX such that general morphisms AXA \rightsquigarrow X are in bijection with deterministic morphisms APXA \to PX. This is something which crucially holds in measure-theoretic probability: a Markov kernel is the same thing as a measurable map into a space of distributions, and it’s important to have both of these descriptions available and go back and forth between them. For example the identity map PXPXP X \to P X corresponds to a Markov kernel PXXPX \rightsquigarrow X which should be thought of as sampling. In any case, being representable means that there is an adjunction between general morphisms and deterministic morphisms, and we’ve shown that in this case, the Markov category is precisely the Kleisli category of the induced probability monad. So the fact that the formation of expectations is encoded both in the composition and in the monad multiplication holds not just for measure-theoretic probability, but has an abstract generalization to every representable Markov category.

Third, for a given probability monad we can also consider its Eilenberg–Moore algebras, which should be thought of as objects together with the additional structure of every probability measure having a designated barycenter (in a well-behaved manner). So for example, every compact convex set S nS \subseteq \mathbb{R}^n is an algebra of the Giry monad, because expectations of SS-valued random variables make sense and lie again in SS. Taking expectations of general (unbounded) real-valued variables is more subtle: taking S=S = \mathbb{R} doesn’t work because the relevant integrals often don’t converge. But of course we need to be able to deal with this case as well. This can be done in a well-behaved manner based on a suitable notion of partial morphism. (We develop this in a draft on the law of large numbers, which will hopefully come out at some point this fall.)

Posted by: Tobias Fritz on August 15, 2024 8:01 AM | Permalink | Reply to this

Re: Introduction to Categorical Probability

How do you feel about the idea of putting Markov kernels at center stage, with both probability measures and measurable maps as special cases? Is it sensible to have Markov kernels as the primary structure?

I think it is sensible. For one thing, Markov kernels per se are a central tool in probability once you get past the most elementary topics. And as you say, the two most basic objects* in probability — probability measures and measurable maps — are special cases. So that already makes it feel more natural to me to put Markov kernels at the center rather than need to construct them from multiple more basic ingredients.

(* I keep stumbling or hesitating when writing about this subject because I want to use the word “object” in an informal sense, as in “object of study”, and worrying about the fact that the “objects” I’m talking about show up not as objects as morphisms or diagrams in Markov categories!)

But then on top of that Markov kernels also let you immediately start talking easily about other things that are slightly less central but still extremely important, like (obviously) Markov chains, mixtures/hierarchical models, and composition of couplings.

This is not to say that I immediately recognized that putting Markov kernels at center stage was a good choice. But it didn’t take long for me to come around to that view. Another thing that helped convince me is how nicely independence can be characterized by a diagram.

Actually we already have this [expectation] to a larger extent than it may seem, and in fact in several ways.

Right, and I agree in particular with what you said here:

Second, one can argue that the most fundamental kind of expectation value is not that of real-valued random variables, but the formation of the expectation value of a random measure, which is itself a measure again. It’s more fundamental because it requires less structure: taking the expectation of a random measure makes sense on any measurable space regardless in absence of any linear structure.

But most of classical probability theory, including the big famous results like LLN, CLT, and LIL, takes place in the presence of linear structure. Its role is so central that I’d really like to see \mathbb{R}, or something that plays its role in this setting, pop out on its own from the (possibly extended) axioms of a Markov category.

I’m really looking forward to seeing your draft on LLN, which I expect will help clarify this issue for me a lot!

Posted by: Mark Meckes on August 15, 2024 2:12 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

Thanks for expanding on that analogy/comparison, Mark.

I wonder whether the resemblance would be closer if we said that Markov categories were like toposes rather than ETCS categories.

What I have in mind is that the definition of topos intentionally encompasses a wide spectrum of different examples, from tiny ones like 1\mathbf{1} and (finite sets) to great big complex ones. In contrast, there is in some weak informal sense intended to be only one ETCS category. Of course I don’t really mean this: the whole field of independence results in set theory sees to that. But most working mathematicians who refer to categories at all tend to use the phrase “the category of sets”, which reflects the common way of thinking.

An ETCS category is a nontrivial topos with three further properties:

  • A function is determined by its effect on elements. That is, if f,g:XYf, g: X \to Y and fx=gxf x = g x for all x:1Xx: 1 \to X then f=gf = g.

  • There’s a natural numbers object.

  • The axiom of choice: every surjection has a section.

These axioms are very strong, especially the first and the third. The second rules out toposes like FinSetFinSet.

Idle question: is there a useful property of Markov categories that ensures that some of the objects are infinite, thus ruling out examples like FinStoch?

Posted by: Tom Leinster on August 15, 2024 9:26 AM | Permalink | Reply to this

Re: Introduction to Categorical Probability

I wonder whether the resemblance would be closer if we said that Markov categories were like toposes rather than ETCS categories.

Maybe! Speculatively, maybe Markov categories are like toposes, and “the” category of sets is like StochStoch or something. My observation about random variables as random elements isn’t about Markov categories in general, just about StochStoch and its relatives. (Though I think it could be useful intuition in other Markov categories, too.)

But maybe toposes aren’t quite the right thing here. The Markov categories that arise as Kleisli categories for various probability monads feel like they should be seen as analogous to categories of “nice” spaces. And those aren’t typically toposes, are they? (My knowledge of toposes is next to nonexistent.)

The right way to think about this might actually be right there in the post. Markov categories are the stochastic generalization of deterministic Markov categories. I’m not sure if anyone has thought much about how broad the class of deterministic Markov categories is, or what additional axioms could force such a category to be essentially FinSetFinSet or SetSet.

Idle question: is there a useful property of Markov categories that ensures that some of the objects are infinite, thus ruling out examples like FinStoch?

I’ve wondered that too. And somewhat in the opposite direction, here’s a tangentially related idle question: In the ETCS axioms, is there some weaker replacement for the natural numbers object axiom that results in getting essentially FinSetFinSet as a minimal example?

Posted by: Mark Meckes on August 15, 2024 2:41 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

The Markov categories that arise as Kleisli categories for various probability monads feel like they should be seen as analogous to categories of “nice” spaces. And those aren’t typically toposes, are they?

I hope someone more knowledgeable than me will say more, but it’s perhaps more true than it seems that nice spaces can form a topos.

Probably you know the idea of toposes as spaces. Every topological space gives rise to a topos, namely, the topos of sheaves of sets on it. And you can recover the space from its topos (glossing over a technicality). For that reason, people often think of toposes as “generalized spaces”.

But there’s also the idea of toposes of spaces. I understand this much less well. I guess simplicial sets are the easiest example.

A classic paper on this subject is Lawvere’s Categories of spaces may not be generalized spaces as exemplified by directed graphs. Let me just quote a bit:

It has long been recognized [G1], [L] that even within geometry (that is, even apart from their algebraic/logical role) toposes come in (at least) two varieties: as spaces (possibly generalized, treated via the category of sheaves of discrete sets), or as categories of spaces (analytic [G2], topological [J], combinatorial, etc.). The success of theorems [J’] which approximate toposes by generalized spaces has perhaps obscured the role of the second class of toposes, though some explicit knowledge of it is surely necessary for a reasonable axiomatic understanding of toposes of C C^\infty spaces or of the topos of simplicial sets. Perhaps some of the confusion is due to the lack of a stabilized definition of morphism appropriate to categories of spaces in the way that “geometric morphisms” are appropriate to generalized spaces.

There are certain properties which a topos of spaces often has; a wise selection of these should serve as an axiomatic definition of the subject. While we have not achieved that goal yet, we list some important properties and show that these properties cannot be true for a “generalized space” of the localic or groupoid kind.

Mark also wrote:

In the ETCS axioms, is there some weaker replacement for the natural numbers object axiom that results in getting essentially FinSet as a minimal example?

If you skip the natural numbers axiom entirely then FinSet is the minimal example, I think.

At least, that’s true as long as you remember to say “nontrivial”. Otherwise, the category 1 with only one object and its identity map is the smallest example. Trivial means that your category is equivalent to 1. In the presence of the other ETCS axioms, nontriviality is equivalent to the existence of an empty set.

Posted by: Tom Leinster on August 15, 2024 4:47 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

Just to say that this distinction between toposes as spaces and as categories of spaces is spelled out a little at nLab: big and little toposes, often given in French: gros/petit topos.

Posted by: David Corfield on August 20, 2024 11:44 AM | Permalink | Reply to this

Re: Introduction to Categorical Probability

Okay, so maybe toposes do play a role in this story. A next question here might be how the definition of a topos compares to a deterministic (i.e., Cartesian) Markov category. (And a first step for me, if I were to get around to thinking about this, would be to actually look up the definition of a topos!)

If you skip the natural numbers axiom entirely then FinSet is the minimal example, I think.

Yes, I think that must be right. And existence of the empty set is included as axiom 3 in your paper.

Posted by: Mark Meckes on August 15, 2024 7:44 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

Tom wrote:

Idle question: is there a useful property of Markov categories that ensures that some of the objects are infinite, thus ruling out examples like FinStoch?

I’m hoping that someone else will comment more on this. So for now, let me just say that the answer is an emphatic yes, and importantly so!

Mark wrote:

Okay, so maybe toposes do play a role in this story. A next question here might be how the definition of a topos compares to a deterministic (i.e., Cartesian) Markov category.

The latter technical question has a simple answer: deterministic Markov categories are the same thing as arbitrary categories with finite products (using these products as the monoidal structure). Here’s a useful way to think about this. First, the failure of XYX \otimes Y to be a product of XX and YY in a Markov category is what probability theory is all about: a probability distribution on a product space is not uniquely determined by its marginals. But this does become true for deterministic distributions. So a ‘deterministic Markov category’ is the minimal amount of structure that’s needed to talk about deterministic distributions and deterministic Markov kernels, and like this it shouldn’t be surprising that these are exactly the categories with finite products. The right technical formulation of this is known as (a version of) Fox’s theorem.

Another point is that deterministic Markov categories are not interesting as Markov categories, for reasons similar to why discrete spaces are not interesting as topological spaces: the theory completely trivializes. Nevertheless, the deterministic morphisms in a Markov category form a deterministic Markov subcategory, and it’s often useful to understand this subcategory. As already mentioned, often the whole Markov category is the Kleisli category of a monad on this subcategory.

Now what about this subcategory being a topos? Clearly this is a much stronger condition than it merely having finite products. And it’s not hard to construct interesting examples where it is indeed a topos, such as the Kleisli category of the distribution monad on Set or of the reader monad on any topos. Both of these are examples of Markov categories of a probabilistic flavour. However, they arguably aren’t really what one actually wants for probability since they will not be as well-behaved as the Markov categories we study in the context of measure-theoretic probability.

So if we’re asking whether there is a well-behaved Markov category for measure-theoretic probability whose deterministic subcategory is a topos, then that this is still pretty much open from my POV. The flavour of this question seems similar to asking whether a well-behaved category of topological spaces can be a topos.

While I don’t have a good answer to this, I want to add the caveat that it’s not enough to have a category of measurable-like spaces and then a version of the Giry monad on it: the category and monad also need to behave in a way that we expect from probability, and this is not automatic. For example, here’s one important feature of probability: if a distribution on a product space X×YX \times Y has deterministic marginal on XX, then it is deterministic itself. This is a consequence of the nonnegativity of probability, and it’s used quite frequently (and somewhat implicitly) in measure-theoretic proofs. Now if you take the Kleisli category of the Giry monad on quasi-Borel spaces, then you get a Markov category which fails this ‘deterministic marginal property’, as we’ve shown in Prop 3.3 of our information flow axioms paper. Therefore quasi-Borel spaces aren’t a good category for the purposes of measure-theoretic probability. (Though as we hint at in the paper, this failure may itself have other uses that make it into a feature.)

Posted by: Tobias Fritz on August 19, 2024 10:10 AM | Permalink | Reply to this

Re: Introduction to Categorical Probability

deterministic Markov categories are the same thing as arbitrary categories with finite products (using these products as the monoidal structure). Here’s a useful way to think about this. First, the failure of XYX \otimes Y to be a product of XX and YY in a Markov category is what probability theory is all about: a probability distribution on a product space is not uniquely determined by its marginals.

Thanks, that’s a beautiful answer!

Posted by: Mark Meckes on August 19, 2024 4:58 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

There is also Alex Simpson’s topos of random probability sheaves that he defines in this talk:

https://www.youtube.com/watch?v=Y1RkPhwJ0M

Posted by: Madeleine Birchfield on August 15, 2024 9:31 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

I agree that we still need a canonical way to talk about expectation of (say) real random variables as well as random measures.

I am working exactly on this, using the idea of algebras over monads, which Tobias has already mentioned. I hope to be able to tell you more in the next few weeks!

Posted by: Paolo Perrone on August 20, 2024 4:00 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

Great, I look forward to hearing more!

Posted by: Mark Meckes on August 20, 2024 4:45 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

I wonder if there are any interesting relationships between the normalized finite powerset monad and the mersenne divisibility adjunction? And if there are, could that have any consequences for the structure of FinSetMulti?

Posted by: Anon on August 14, 2024 11:31 PM | Permalink | Reply to this

Re: Introduction to Categorical Probability

What’s the Mersenne divisibility adjunction? The page you link doesn’t say, and when I do a web search, I get no hits for that phrase.

Let me guess. The Wikipedia page you link points out that the function g:g: \mathbb{N} \to \mathbb{N} defined by

g(n)=2 n1 g(n) = 2^n - 1

is order-preserving with respect to divisibility, and in fact preserves gcds. So seen as a functor

g:(,)(,), g: (\mathbb{N}, \mid) \to (\mathbb{N}, \mid),

it must have a left adjoint ff, which must be given by

f(m)=gcd{n:m(2 n1)}. f(m) = gcd\{n: m \mid (2^n - 1)\}.

Is this what you meant? And is there a more explicit form for ff? I haven’t thought about it.

Obviously numbers of the form 2 n12^n - 1 have something to do with “normalized” power sets, i.e. power sets excluding the empty set. Is there more reason to suspect a link with FinSetMulti?

Posted by: Tom Leinster on August 15, 2024 10:10 AM | Permalink | Reply to this

Post a New Comment