### Where Do Probability Measures Come From?

#### Posted by Tom Leinster

*Guest post by Tom Avery*

Tom *(here Tom means me, not him — Tom)* has written several times about a piece of categorical machinery that, when given an appropriate input, churns out some well-known mathematical concepts. This machine is the process of constructing the codensity monad of a functor.

In this post, I’ll give another example of a well-known concept that arises as a codensity monad; namely probability measures. This is something that I’ve just written a paper about.

### The Giry monads

Write $\mathbf{Meas}$ for the category of measurable spaces (sets equipped with a $\sigma$-algebra of subsets) and measurable maps. I’ll also write $I$ for the unit interval $[0,1]$, equipped with the Borel $\sigma$-algebra.

Let $\Omega \in \mathbf{Meas}$. There are lots of different probability measures we can put on $\Omega$; write $G\Omega$ for the set of all of them.

Is $G\Omega$ a measurable space? Yes: An element of $G\Omega$ is a function that sends measurable subsets of $\Omega$ to numbers in $I$. Turning this around, we have, for each measurable $A \subseteq \Omega$, an evaluation map $ev_A \colon G\Omega \to I$. Let’s give $G\Omega$ the smallest $\sigma$-algebra such that all of these are measurable.

Is $G$ a functor? Yes: Given a measurable map $g \colon \Omega \to \Omega'$ and $\pi \in G\Omega$, we can define the pushforward $G g(\pi)$ of $\pi$ along $g$ by

$G g(\pi)(A') = \pi(g^{-1} A')$

for measurable $A' \subseteq \Omega'$.

Is $G$ a monad? Yes: Given $\omega \in \Omega$ we can define $\eta(\omega) \in G\Omega$ by

$\eta(\omega)(A) = \chi_A (\omega)$

where $A$ is a measurable subset of $\Omega$ and $\chi_A$ is its characteristic function. In other words $\eta(\omega)$ is the *Dirac measure* at $\omega$. Given $\rho \in G G\Omega$, let

$\mu(\rho)(A) = \int_{\G\Omega} ev_A \,\mathrm{d}\rho$

for measurable $A \subseteq \Omega$, where $\ev_A \colon G\Omega \to I$ is as above.

This is the **Giry monad** $\mathbb{G} = (G,\eta,\mu)$, first defined (unsurprisingly) by Giry in “A categorical approach to probability theory”.

A finitely additive probability measure $\pi$ is just like a probability measure, except that it is only well-behaved with respect to *finite* disjoint unions, rather than arbitrary *countable* disjoint unions. More precisely, rather than having

$\pi\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} \pi(A_i)$

for disjoint $A_i$, we just have

$\pi\left(\bigcup_{i=1}^{n} A_i\right) = \sum_{i=1}^{n} \pi(A_i)$

for disjoint $A_i$.

We could repeat the definition of the Giry monad with “probability measure” replaced by “finitely additive probability measure”; doing so would give the **finitely additive Giry monad** $\mathbb{F} = (F,\eta,\mu)$. Every probability measure is a finitely additive probability measure, but not all finitely additive probability measures are probability measures. So $\mathbb{G}$ is a proper submonad of $\mathbb{F}$.

The Kleisli category of $\mathbb{G}$ is quite interesting. Its objects are just the measurable spaces, and the morphisms are a kind of non-deterministic map called a **Markov kernel** or **conditional probability distribution**. As a special case, a discrete space equipped with an endomorphism in the Kleisli category is a discrete-time Markov chain.

I’ll explain how the Giry monads arise as codensity monads, but first I’d like to mention a connection with another example of a codensity monad; namely the ultrafilter monad.

An ultrafilter $\mathcal{U}$ on a set $X$ is a set of subsets of $X$ satisfying some properties. So $\mathcal{U}$ is a subset of the powerset $\mathcal{P}X$ of $X$, and is therefore determined by its characteristic function, which takes values in $\{0,1\} \subseteq I$. In other words, an ultrafilter on $X$ can be thought of as a special function

$\mathcal{P}X \to I.$

It turns out that “special function” here means “finitely additive probability measure defined on all of $\mathcal{P}X$ and taking values in $\{0,1\}$”.

So the ultrafilter monad on $\mathbf{Set}$ (which sends a set to the set of ultrafilters on it) is a primitive version of the finitely additive Giry monad. With this in mind, and given the fact that the ultrafilter monad is the codensity monad of the inclusion of the category of finite sets into the category of sets, it is not that surprising that the Giry monads are also codensity monads. In particular, we might expect $\mathbb{F}$ to be the codensity monad of some functor involving spaces that are “finite” in some sense, and for $\mathbb{G}$ we’ll need to include some information pertaining to countable additivity.

### Integration operators

If you have a measure on a space then you can integrate functions on that space. The converse is also true: if you have a way of integrating functions on a space then you can extract a measure.

There are various ways of making this precise, the most famous of which is the Riesz-Markov-Kakutani Representation Theorem:

**Theorem.** *Let $X$ be a compact Hausdorff space. Then the space of finite, signed Borel measures on $X$ is canonically isomorphic to*

$\mathbf{NVS}(\mathbf{Top}(X,\mathbb{R}),\mathbb{R})$

*as a normed vector space, where $\mathbf{Top}$ is the category of topological spaces, and $\mathbf{NVS}$ is the category of normed vector spaces.*

Given a finite, signed Borel measure $\pi$ on $X$, the corresponding map $\mathbf{Top}(X,\mathbb{R}) \to \mathbb{R}$ sends a function to its integral with respect to $\pi$. There are various different versions of this theorem that go by the same name.

My paper contains the following more modest version, which is a correction of a claim by Sturtz.

**Proposition.** *Finitely additive probability measures on a measurable space $\Omega$ are canonically in bijection with functions $\phi \colon \mathbf{Meas}(\Omega,I) \to I$ that are*

**affine:***if $f,g \in \mathbf{Meas}(\Omega,I)$ and $r \in I$ then*

$\phi(r f + (1-r)g) = r\phi(f) + (1-r)\phi(g),$

*and*

**weakly averaging:***if $\bar{r}$ denotes the constant function with value $r$ then $\phi(\bar{r}) = r$.*

*Call such a function a finitely additive integration operator. The bijection restricts to a correspondence between (countably additive) probability measures and functions $\phi$ that additionally*

**respect limits:***if $f_n \in \mathbf{Meas}(\Omega,I)$ is a sequence of functions converging pointwise to $0$ then $\phi(f_n)$ converges to $0$.*

Call such a function an **integration operator**. The integration operator corresponding to a probability measure $\pi$ sends a function $f$ to

$\int_{\Omega}f \mathrm{d}\pi,$

which justifies the name. In the other direction, given an integration operator $\phi$, the value of the corresponding probability measure on a measurable set $A \subseteq \Omega$ is $\phi(\chi_A)$.

These bijections are measurable (with respect to a natural $\sigma$-algebra on the set of finitely additive integration operators) and natural in $\Omega$, so they define isomorphisms of endofunctors of $\mathbf{Meas}$. Hence we can transfer the monad structures across the isomorphisms, and obtain descriptions of the Giry monads in terms of integration operators.

### The Giry monads via codensity monads

So far so good. But what does this have to do with codensity monads? First let’s recall the definition of a codensity monad. I won’t go into a great deal of detail; for more information see Tom’s first post on the topic.

Let $U \colon \mathbb{C} \to \mathcal{M}$ be a functor. The codensity monad of $U$ is the right Kan extension of $U$ along itself. This consists of a functor $T^U \colon \mathcal{M} \to \mathcal{M}$ satisfying a universal property, which equips $T^U$ with a canonical monad structure. The codensity monad doesn’t always exist, but it will whenever $\mathbb{C}$ is small and $\mathcal{M}$ is complete. You can think of $T^U$ as a generalisation of the monad induced by the adjunction between $U$ and its left adjoint that makes sense when the left adjoint doesn’t exist. In particular, when the left adjoint *does* exist, the two monads coincide.

The end formula for right Kan extensions gives

$T^U m = \int_{c \in \mathbb{C}} [\mathcal{M}(m,U c),U c],$

where $[\mathcal{M}(m,U c),U c]$ denotes the $\mathcal{M}(m,U c)$ power of $U c$ in $\mathcal{M}$, i.e. the product of $\mathcal{M}(m,U c)$ (a set) copies of $U c$ (an object of $\mathcal{M}$) in $\mathcal{M}$.

It doesn’t matter too much if you’re not familiar with ends because we can give an explicit description of $T^U m$ in the case that $\mathcal{M} = \mathbf{Meas}$: The elements of $T^U\Omega$ are families $\alpha$ of functions

$\alpha_c \colon \mathbf{Meas}(\Omega, U c) \to U c$

that are natural in $c \in \mathbb{C}$. For each $c \in \mathbb{C}$ and measurable $f \colon \Omega \to U c$ we have $\ev_f \colon T^U \Omega \to I$ mapping $\alpha$ to $\alpha_c (f)$. The $\sigma$-algebra on $T^U \Omega$ is the smallest such that each of these maps is measurable.

All that’s left is to say what we should choose $\mathbb{C}$ and $U$ to be in order to get the Giry monads.

A subset $c$ of a real vector space $V$ is convex if for any $x,y \in c$ and $r \in I$ the convex combination $r x + (1-r)y$ is also in $c$, and a map $h \colon c \to c'$ between convex sets is called **affine** if it preserves convex combinations. So there’s a category of convex sets and affine maps between them. We will be interested in certain full subcategories of this.

Let $d_0$ be the (convex) set of sequences in $I$ that converge to $0$ (it is a subset of the vector space $c_0$ of all real sequences converging to $0$). Now we can define the categories of interest:

Let $\mathbb{C}$ be the category whose objects are all finite powers $I^n$ of $I$, with all affine maps between them.

Let $\mathbb{D}$ be the category whose objects are all finite powers of $I$, together with $d_0$, and all affine maps between them.

All the objects of $\mathbb{C}$ and $\mathbb{D}$ can be considered as measurable spaces (as subspaces of powers of $I$), and all the affine maps between them are then measurable, so we have (faithful but not full) inclusions $U \colon \mathbb{C} \to \mathbf{Meas}$ and $V \colon \mathbb{D} \to \mathbf{Meas}$.

**Theorem.** *The codensity monad of $U$ is the finitely additive Giry monad, and the codensity monad of $V$ is the Giry monad.*

Why should this be true? Let’s start with $U$. An element of $T^U \Omega$ is a family of functions

$\alpha_{I^n} \colon\mathbf{Meas}(\Omega,I^n) \to I^n.$

But a map into $I^n$ is determined by its composites with the projections to $I$, and these projections are affine. This means that $\alpha$ is completely determined by $\alpha_{I}$, and the other components are obtained by applying $\alpha_{I}$ separately in each coordinate. In other words, an element of $T^U \Omega$ is a special sort of function

$\mathbf{Meas}(\Omega, I) \to I.$

Look familiar? As you might guess, the functions with the above domain and codomain that define elements of $T^U \Omega$ are precisely the finitely additive integration operators.

The affine and weakly averaging properties of $\alpha_{I}$ are enforced by naturality with respect to certain affine maps. For example, the naturality square involving the affine map

$r\pi_1 + (1-r)\pi_2 \colon I^2 \to I$

(where $\pi_i$ are the projections) forces $\alpha_I$ to preserve convex combinations of the form $r f + (1-r)g$. The weakly averaging condition comes from naturality with respect to constant maps.

How is the situation different for $T^V$? As before $\alpha \in T^V \Omega$ is determined by $\alpha_I$, and $\alpha_{d_0}$ is obtained by applying $\alpha_I$ in each coordinate, thanks to naturality with respect to the projections. A measurable map $f \colon \Omega \to d_0$ is a sequence of maps $f_n \colon \Omega \to I$ converging pointwise to $0$, and

$\alpha_{d_0}(f) = (\alpha_I(f_i))_{i=1}^{\infty}.$

But $\alpha_{d_0}(f) \in d_0$, so $\alpha_I(f_i)$ must converge to $0$. So $\alpha_I$ is an integration operator!

The rest of the proof consists of checking that these assignments $\alpha \mapsto \alpha_{I}$ really do define isomorphisms of monads.

It’s natural to wonder how much you can alter the categories $\mathbb{C}$ and $\mathbb{D}$ without changing the codensity monads. Here’s a result to that effect:

**Proposition**. *The categories $\mathbb{C}$ and $\mathbb{D}$ can be replaced by the monoids of affine endomorphisms of $I^2$ and $d_0$ respectively (regarded as 1-object categories, with the evident functors to $\mathbf{Meas}$) without changing the codensity monads.*

This gives categories of convex sets that are minimal such that their inclusions into $\mathbf{Meas}$ give rise to the Giry monads. Here I mean minimal in the sense that they contain the fewest objects with all affine maps between them. They are not uniquely minimal; there are other convex sets whose monoids of affine endomorphisms also give rise to the Giry monads.

This result gives yet another characterisation of (finitely and countably) additive probability measures: a probability measure on $\Omega$ is an $\mathrm{End}(d_0)$-set morphism

$\mathbf{Meas}(\Omega,d_0) \to d_0,$

where $\mathrm{End}(d_0)$ is the monoid of affine endomorphisms of $d_0$. Similarly for finitely additive probability measures, with $d_0$ replaced by $I^2$.

What about *maximal* categories of convex sets giving rise to the Giry monads? I don’t have a definitive answer to this question, but you can at least throw in all bounded, convex subsets of Euclidean space:

**Proposition**. *Let $\mathbb{C}'$ be the category of all bounded, convex subsets of $\mathbb{R}^n$ (where $n$ varies) and affine maps. Let $\mathbb{D}'$ be $\mathbb{C}'$ but with $d_0$ adjoined. Then replacing $\mathbb{C}$ by $\mathbb{C}'$ and $\mathbb{D}$ by $\mathbb{D}'$ does not change the codensity monads.*

The definition of $\mathbb{D}'$ is a bit unsatisfying; $d_0$ feels (and literally is) tacked on. It would be nice to have a characterisation of *all* the subsets of $\mathbb{R}^{\mathbb{N}}$ (or indeed all the convex sets) that can be included in $\mathbb{D}'$. But so far I haven’t found one.

## Re: Where Do Probability Measures Come From?