Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

July 7, 2018

Beyond Classical Bayesian Networks

Posted by John Baez

guest post by Pablo Andres-Martinez and Sophie Raynor

In the final installment of the Applied Category Theory seminar, we discussed the 2014 paper “Theory-independent limits on correlations from generalized Bayesian networks” by Henson, Lal and Pusey.

In this post, we’ll give a short introduction to Bayesian networks, explain why quantum mechanics means that one may want to generalise them, and present the main results of the paper. That’s a lot to cover, and there won’t be a huge amount of category theory, but we hope to give the reader some intuition about the issues involved, and another example of monoidal categories used in causal theory.

Introduction

Bayesian networks are a graphical modelling tool used to show how random variables interact. A Bayesian network consists of a pair (G,P)(G,P) of directed acyclic graph (DAG) GG together with a joint probability distribution PP on its nodes, satisfying the Markov condition. Intuitively the graph describes a flow of information.

The Markov condition says that the system doesn’t have memory. That is, the distribution on a given node YY is only dependent on the distributions on the nodes XX for which there is an edge XYX \rightarrow Y. Consider the following chain of binary events. In spring, the pollen in the air may cause someone to have an allergic reaction that may make them sneeze.

a poset

In this case the Markov condition says that given that you know that someone is having an allergic reaction, whether or not it is spring is not going to influence your belief about the likelihood of them sneezing. Which seems sensible.

Bayesian networks are useful

  • as an inference tool, thanks to belief propagation algorithms,

  • and because, given a Bayesian network (G,P)(G,P), we can describe d-separation properties on GG which enable us to discover conditional independences in PP.

It is this second point that we’ll be interested in here.

Before getting into the details of the paper, let’s try to motivate this discussion by explaining its title: “Theory-independent limits on correlations from generalized Bayesian networks" and giving a little more background to the problem it aims to solve.

Crudely put, the paper aims to generalise a method that assumes classical mechanics to one that holds in quantum and more general theories.

Classical mechanics rests on two intuitively reasonable and desirable assumptions, together called local causality,

  • Causality:

    Causality is usually treated as a physical primitive. Simply put it is the principle that there is a (partial) ordering of events in space time. In order to have information flow from event AA to event BB, AA must be in the past of BB.

    Physicists often define causality in terms of a discarding principle: If we ignore the outcome of a physical process, it doesn’t matter what process has occurred. Or, put another way, the outcome of a physical process doesn’t change the initial conditions.

  • Locality:

    Locality is the assumption that, at any given instant, the values of any particle’s properties are independent of any other particle. Intuitively, it says that particles are individual entities that can be understood in isolation of any other particle.

    Physicists usually picture particles as having a private list of numbers determining their properties. The principle of locality would be violated if any of the entries of such a list were a function whose domain is another particle’s property values.

In 1935 Einstein, Podolsky and Rosen showed that quantum mechanics (which was a recently born theory) predicted that a pair of particles could be prepared so that applying an action on one of them would instantaneously affect the other, no matter how distant in space they were, thus contradicting local causality. This seemed so unreasonable that the authors presented it as evidence that quantum mechanics was wrong.

But Einstein was wrong. In 1964, John S. Bell set the bases for an experimental test that would demonstrate that Einstein’s “spooky action at a distance” (Einstein’s own words), now known as entanglement, was indeed real. Bell’s experiment has been replicated countless of times and has plenty of variations. This video gives a detailed explanation of one of these experiments, for a non-physicist audience.

But then, if acting on a particle has an instantaneous effect on a distant point in space, one of the two principle above is violated: On one hand, if we acted on both particles at the same time, each action being a distinct event, both would be affecting each other’s result, so it would not be possible to decide on an ordering; causality would be broken. The other option would be to reject locality: a property’s value may be given by a function, so the resulting value may instantaneously change when the distant ‘domain’ particle is altered. In that case, the particles’ information was never separated in space, as they were never truly isolated, so causality is preserved.

Since causality is integral to our understanding of the world and forms the basis of scientific reasoning, the standard interpretation of quantum mechanics is to accept non-locality.

The definition of Bayesian networks implies a discarding principle and hence there is a formal sense in which they are causal (even if, as we shall see, the correlations they model do not always reflect the temporal order). Under this interpretation, the causal theory Bayesian networks describe is classical. Precisely, they can only model probability distributions that satisfy local causality. Hence, in particular, they are not sufficient to model all physical correlations.

The goal of the paper is to develop a framework that generalises Bayesian networks and d-separation results, so that we can still use graph properties to reason about conditional dependence under any given causal theory, be it classical, quantum, or even more general. In particular, this theory will be able to handle all physically observed correlations, and all theoretically postulated correlations.

Though category theory is not mentioned explicitly, the authors achieve their goal by using the categorical framework of operational probablistic theories (OPTs).

Bayesian networks and d-separation

Consider the situation in which we have three Boolean random variables. Alice is either sneezing or she is not, she either has a a fever or she does not, and she may or may not have flu.

Now, flu can cause both sneezing and fever, that is

P(sneezing|flu)P(sneezing) and likewise P(fever|flu)P(fever)P(sneezing \ | \ flu ) \neq P( sneezing) \ \text{ and likewise } \ P(fever \ | \ flu ) \neq P( fever)

so we could represent this graphically as

a poset

Moreover, intuitively we wouldn’t expect there to be any other edges in the above graph. Sneezing and fever, though correlated - each is more likely if Alice has flu - are not direct causes of each other. That is,

P(sneezing|fever)P(sneezing) but P(sneezing|fever,flu)=P(sneezing|flu).P(sneezing \ | \ fever ) \neq P(sneezing) \ \text{ but } \ P(sneezing \ | \ fever, \ flu ) = P(sneezing \ | \ flu).

Bayesian networks

Let GG be a directed acyclic graph or DAG GG. (Here a directed graph is a presheaf on (\bullet \rightrightarrows \bullet)).

The set Pa(Y)Pa(Y) of parents of a node YY of GG contains those nodes XX of GG such that there is a directed edge XYX \to Y.

So, in the example above Pa(flu)=Pa(flu) = \emptyset while Pa(fever)=Pa(sneezing)={flu}Pa(fever) = Pa(sneezing) = \{ flu \}.

To each node XX of a directed graph GG, we may associate a random variable, also denoted XX. If VV is the set of nodes of GG and (x X) XV(x_X)_{X \in V} is a choice of value x Xx_X for each node XX, such that yy is the chosen value for YY, then pa(y)pa(y) will denote the Pa(Y)Pa(Y)-tuple of values (x X) XPa(Y)(x_X)_{X \in Pa(Y)}.

To define Bayesian networks, and establish the notation, let’s revise some probability basics.

Let P(x,y|z)P(x,y \ | \ z) mean P(X=x and Y=y|Z=z)P(X = x \text{ and } \ Y = y \ | \ Z = z), the probability that XX has the value xx, and YY has the value yy given that ZZ has the value zz. Recall that this is given by

P(x,y|z)=P(x,y,z)P(z).P(x,y \ |\ z) = \frac{ P(x,y,z) }{P(z)}.

The chain rule says that, given a value xx of XX and sets of values Ω,Λ\Omega, \Lambda of other random variables,

P(x,Ω|Λ)=P(x|Λ)P(Ω|x,Λ).P(x, \Omega \ | \ \Lambda) = P( x \ | \ \Lambda) P( \Omega \ | \ x, \Lambda).

Random variables XX and YY are said to be conditionally independent given ZZ, written XY|ZX \perp\!\!\!\!\!\!\!\perp Y \ | \ Z, if for all values xx of XX, yy of YY and zz of ZZ

P(x,y|z)=P(x|z)P(y|z).P(x,y \ | \ z) = P(x \ | \ z) P(y \ | \ z).

By the chain rule this is equivalent to

P(x|y,z)=P(x|z),x,y,z.P(x \ | \ y,z ) = P (x \ | \ z) , \ \forall x,y, z.

More generally, we may replace X,YX,Y and ZZ with sets of random variables. So, in the special case that ZZ is empty, then XX and YY are independent if and only if P(x,y)=P(x)P(y)P(x, y) = P(x)P(y) for all x,yx,y.

Markov condition

A joint probability distribution PP on the nodes of a DAG GG is said to satisfy the Markov condition if for any set of random variable {X i} i=1 n\{X_i\}_{i = 1}^n on the nodes of GG, with choice of values {x i} i=1 n\{x_i\}_{i = 1}^n

P(x i,,x n)= i=1 nP(x i|pa(x i)).P(x_i, \dots, x_n) = \prod_{i = 1}^n P(x_i \ | \ {pa(x_i)}).

So, for the flu, fever and sneezing example above, a distribution PP satisfies the Markov condition if

P(flu,fever,sneezing)=P(fever|flu)P(sneezing|flu)P(flu).P(flu, \ fever, \ sneezing) = P(fever \ | \ flu) P(sneezing \ | \ flu) P(flu).

A Bayesian network is defined as a pair (G,P)(G,P) of a DAG GG and a joint probability distribution PP on the nodes of GG that satisfies the Markov condition with respect to GG. This means that each node in a Bayesian network is conditionally independent, given its parents, of any of the remaining nodes.

In particular, given a Bayesian network (G,P)(G,P) such that there is a directed edge XYX \to Y, the Markov condition implies that

yP(x,y)= yP(x)P(y|x)=P(x) yP(y|x)=P(x)\sum_{y} P(x,y) = \sum_y P(x) P(y \ | \ x) = P(x) \sum_y P(y \ | \ x) = P(x)

which may be interpreted as a discard condition. (The ordering is reflected by the fact that we can’t derive P(y)P(y) from xP(x,y)= xP(x)P(y|x)\sum_{x} P(x,y) = \sum_x P(x) P(y \ | \ x).)

Let’s consider some simple examples.

Fork

In the example of flu, sneezing and fever above, the graph has a fork shape. For a probability distribution PP to satisfy the Markov condition for this graph we must have

P(x,y,z)=P(x|z)P(y|z)P(z),x,y,z.P(x, y, z) = P(x \ | \ z) P(y \ | \ z)P(z), \ \forall x,y,z.

However, P(x,y)P(x)P(y)P(x,y) \neq P(x) P(y).

In other words, XY|ZX \perp\!\!\!\!\!\!\!\perp Y \ | \ Z, though XX and YY are not independent. This makes sense, we wouldn’t expect sneezing and fever to be uncorrelated, but given that we know whether or not Alice has flu, telling us that she has fever isn’t going to tell us anything about her sneezing.

Collider

Reversing the arrows in the fork graph above gives a collider as in the following example.

a poset

Clearly whether or not Alice has allergies other than hayfever is independent of what season it is. So we’d expect a distribution on this graph to satisfy XY|X \perp\!\!\!\!\!\!\!\perp Y \ | \ \emptyset. However, if we know that Alice is having an allergic reaction, and it happens to be spring, we will likely assume that she has some allergy, i.e. XX and YY are not conditionally independent given ZZ.

Indeed, the Markov condition and chain rule for this graph gives us XY|X \perp\!\!\!\!\!\!\!\perp Y \ | \ \emptyset:

P(x,y,z)=P(x)P(y)P(z|x,y)=P(z|x,y)P(x|y)P(y)x,y,z.P(x, y, z) = P(x)P(y) P(z \ | \ x,\ y) = P(z \ | \ x,\ y) P( x\ | \ y) P(y) \ \forall x,y,z.

from which we cannot derive P(x|z)P(y|z)=P(x,y|z)P(x \ | \ z) P(y \ | \ z) = P(x,y \ | \ z). (However, it could still be true for some particular choice of probability distribution.)

Chain

Finally, let us return to the chain of correlations presented in the introduction.

Clearly the probabilities that it is spring and that Alice is sneezing are not independent, and indeed, we cannot derive P(x,y)=P(x)P(y)P(x, y) = P(x) P(y). However observe that, by the chain rule, a Markov distribution on the chain graph must satisfy XY|ZX\perp\!\!\!\!\!\!\!\perp Y \ | \ Z. If we know Alice is having an allergic reaction that is not hayfever, whether or not she is sneezing is not going to affect our guess as to what season it is.

Crucially, in this case, knowing the season is also not going to affect whether we think Alice is sneezing. By definition, conditional independence of XX and YY given ZZ is symmetric in XX and YY. In other words, a joint distribution PP on the variables X,Y,ZX,Y,Z satisfies the Markov condition with respect to the chain graph

XZYX \longrightarrow Z \longrightarrow Y

if and only if PP satisfies the Markov condition on

YZX.Y \longrightarrow Z \longrightarrow X .

d-separation

The above observations can be generalised to statements about conditional independences in any Bayesian network. That is, if (G,P)(G,P) is a Bayesian network then the structure of GG is enough to derive all the conditional independences in PP that are implied by the graph GG (in reality there may be more that have not been included in the network!).

Given a DAG GG and a set of vertices UU of GG, let m(U)m(U) denote the union of UU with all the vertices vv of GG such that there is a directed edge from UU to vv. The set W(U)W(U) will denote the non-inclusive future of UU, that is, the set of vertices vv of GG for which there is no directed (possibly trivial) path from vv to UU.

For a graph GG, let X,Y,ZX, Y, Z now denote disjoint subsets of the vertices of GG (and their corresponding random variables). Set W:=W(XYZ)W := W(X \cup Y \cup Z).

Then XX and YY are said to be d-separated by ZZ, written XY|ZX \perp Y \ | \ Z, if there is a partition {U,V,W,Z}\{U,V,W,Z\} of the nodes of GG such that

  • XUX \subseteq U and YVY \subseteq V, and

  • m(U)m(V)W,m(U) \cap m(V) \subseteq W, in other words UU and VV have no direct influence on each other.

(This is lemma 19 in the paper.)

Now d-separation is really useful since it tells us everything there is to know about the conditional dependences on Bayesian networks with underlying graph GG. Indeed,

Theorem 5

  • Soundness of d-separation (Verma and Pearl, 1988) If PP is a Markov distribution with respect to a graph GG then for all disjoint subsets X,Y,ZX,Y,Z of nodes of GG XY|ZX \perp Y \ | \ Z implies that XY|ZX \perp\!\!\!\!\!\!\!\perp Y \ | \ Z.

  • Completeness of d-separation (Meek, 1995) If XY|ZX \perp\!\!\!\!\!\!\!\perp Y \ | \ Z for all PP Markov with respect to GG, then XY|ZX \perp Y \ | \ Z.

We can combine the previous examples of fork, collider and chain graphs to get the following

a poset

A priori, Allergic reaction is conditionally independent of Fever. Indeed, we have the partition

a poset

which clearly satisfies d-separation. However, if Sneezing is known then W=W = \emptyset, so Allergic reaction and Fever are not independent. Indeed, if we use the same sets UU and VV as before, then m(U)m(V)={Sneezing}m(U) \cap m(V) = \{ Sneezing \}, so the condition for d-separation fails; and it does for any possible choice of UU and VV. Interestingly, if Flu is also known, we again obtain conditional independence between Allergic reaction and Fever, as shown below.

a poset

Before describing the limitations of this setup and why we may want to generalise it, it is worth observing that Theorem 5 is genuinely useful computationally. Theorem 5 says that given a Bayesian network (G,P)(G,P), the structure of GG gives us a recipe to factor PP, thereby greatly increasing computation efficiency for Bayesian inference.

Latent variables, hidden variables, and unobservables

In the context of Bayesian networks, there are two reasons that we may wish to add variables to a probabilistic model, even if we are not entirely sure what the variables signify or how they are distributed. The first reason is statistical and the second is physical.

Consider the example of flu, fever and sneezing discussed earlier. Although our analysis told us FeverSneezing|FluFever \perp\!\!\!\!\!\!\!\perp Sneezing \ | \ Flu, if we conduct an experiment we are likely to find:

P(fever|sneezing,flu)P(fever|flu).P(fever \ | \ sneezing, \ flu) \neq P(fever \ | \ flu).

The problem is caused by the graph not properly modelling reality, but a simplification of it. After all, there are a whole bunch of things that can cause sneezing and flu. We just don’t know what they all are or how to measure them. So, to make the network work, we may add a hypothetical latent variable that bunches together all the unknown joint causes, and equip it with a distribution that makes the whole network Bayesian, so that we are still able to perform inference methods like belief propagation.

a poset

On the other hand, we may want to add variables to a Bayesian network if we have evidence that doing so will provide a better model of reality.

For example, consider the network with just two connected nodes

a poset

Every distribution on this graph is Markov, and we would expect there to be a correlation between a road being wet and the grass next to it being wet as well, but most people would claim that there’s something missing from the picture. After all, rain could be a ‘common cause’ of the road and the grass being wet. So, it makes sense to add a third variable.

But maybe we can’t observe whether it has rained or not, only whether the grass and/or road are wet. Nonetheless, the correlation we observe suggests that they have a common cause. To deal with such cases, we could make the third variable hidden. We may not know what information is included in a hidden variable, nor its probability distribution.

All that matters is that the hidden variable helps to explain the observed correlations.

a poset

So, latent variables are a statistical tool that ensure the Markov condition holds. Hence they are inherently classical, and can, in theory, be known. But the universe is not classical, so, even if we lump whatever we want into as many classical hidden variables as we want and put them wherever we need, in some cases, there will still be empirically observed correlations that do not satisfy the Markov condition.

Most famously, Bell’s experiment shows that it is possible to have distinct variables AA and BB that exhibit correlations that cannot be explained by any classical hidden variable, since classical variables are restricted by the principle of locality.

In other words, though AB|ΛA \perp B \ | \ \Lambda,

P(a|b,λ)P(a|λ).P(a \ | b,\ \lambda) \neq P(a \ | \ \lambda).

Implicitly, this means that a classical Λ\Lambda is not enough. If we want P(a|b,λ)P(a|λ)P(a \ | b,\ \lambda) \neq P(a \ | \ \lambda) to hold, Λ\Lambda must be a non-local (non-classical) variable. Quantum mechanics implies that we can’t possibly empirically find the value of a non-local variable (for similar reasons to the Heisenberg’s uncertainty principle), so non-classical variables are often called unobservables. In particular, it is irrelevant to question whether AB|ΛA \perp\!\!\!\!\!\!\!\perp B \ | \ \Lambda, as we would need to know the value of Λ\Lambda in order to condition over it.

Indeed, this is the key idea behind what follows. We declare certain variables to be unobservable and then insist that conditional (in)dependence only makes sense between observable variables conditioned over observable variables.

Generalising classical causality

The correlations observed in the Bell experiment can be explained by quantum mechanics. But thought experiments such as the one described here suggest that theoretically, correlations may exist that violate even quantum causality.

So, given that graphical models and d-separation provide such a powerful tool for causal reasoning in the classical context, how can we generalise the Markov condition and Theorem 5 to quantum, and even more general causal theories? And, if we have a theory-independent Markov condition, are there d-separation results that don’t correspond to any given causal theory?

Clearly the first step in answering these questions is to fix a definition of a causal theory.

Operational probabilistic theories

An operational theory is a symmetric monoidal category (C,,I)(\mathsf {C}, \otimes, I) whose objects are known as systems or resources. Morphisms are finite sets f={𝒞 i} iIf = \{\mathcal {C}_i\}_{i \in I} called tests, whose elements are called outcomes. Tests with a single element are called deterministic, and for each system Aob(C)A \in ob (\mathsf {C}), the identity id A(A,A)id_A \in \mathsf (A,A) is a deterministic test.

In this discussion, we’ll identify tests {𝒞 i} i,{𝒟 j} j\{\mathcal {C}_i \}_i , \{\mathcal {D}_j\}_j in C\mathsf {C} if we may always replace one with the other without affecting the distributions in C(I,I)\mathsf {C}(I, I).

Given {𝒞 i} iC(B,C)\{\mathcal {C}_i \}_i \in \mathsf {C}(B, C) and {𝒟 j}C(A,B)\{\mathcal {D}_j \} \in \mathsf {C}(A, B), their composition fgf \circ g is given by

{𝒞 i𝒟 j} i,jC(A,C).\{ \mathcal {C}_i \circ \mathcal {D}_j \}_{i,j} \in \mathsf {C}(A, C).

First apply 𝒟\mathcal {D} with output BB then apply 𝒞\mathcal {C} with outcome CC.

The monoidal composition {𝒞 i𝒟 j} i,jC(AC,BD)\{ \mathcal {C}_i \otimes \mathcal {D}_j \}_{i, j} \in \mathsf {C}(A \otimes C, B \otimes D) corresponds to applying {𝒞 i} iC(A,B)\{\mathcal {C}_i\}_i \in \mathsf {C}(A,B) and {𝒟 j} j\{ \mathcal {D}_j \}_j separately on AA and CC.

An operational probabilistic theory or OPT is an operational theory such that every test III \to I is a probability distribution.

A morphism {𝒞 i} iC(A,I)\{ \mathcal {C}_i \}_i \in \mathsf {C}(A, I) is called an effect on AA. An OPT C\mathsf {C} is called causal or a causal theory if, for each system Aob(C)A \in ob (\mathsf {C}), there is a unique deterministic effect AC(A,I)\top_A \in \mathsf {C}( A, I) which we call the discard of AA.

In particular, for a causal OPT C\mathsf {C}, uniqueness of the discard implies that, for all systems A,Bob(C)A, B \in ob (\mathsf {C}),

A B= AB,\top_A \otimes \top_B = \top_{A \otimes B}, and, given any determinstic test 𝒞C(A,B)\mathcal {C} \in \mathsf {C}(A, B),

B𝒞= A.\top_B \circ \mathcal {C} = \top_A.

The existence of a discard map allows a definition of causal morphisms in a causal theory. For example, as we saw in January when we discussed Kissinger and Uijlen’s paper, a test {𝒞 i} iC(A,B)\{ \mathcal {C}_i \}_i \in \mathsf {C} (A, B) is causal if

B{𝒞 i} i= AC(A,I).\top_B \circ \{ \mathcal {C}_i \}_i = \top_A \in \mathsf {C}( A, I).

In other words, for a causal test, discarding the outcome is the same as not performing the test. Intuitively it is not obvious why such morphisms should be called causal. But this definition enables the formulation of a non-signalling condition that describes the conditions under which the possibility of cause-effect correlation is excluded, in particular, it implies the impossibility of time travel.

Examples

The category Mat( +)Mat(\mathbb {R}_+) of natural numbers and with Mat( +)(m,n)Mat(\mathbb {R}_+)(m,n) the set of n×mn \times m matrices, has the structure of a causal OPT. The causal morphisms in Mat( +)Mat(\mathbb {R}_+) are the stochastic maps (the matrices whose columns sum to 1). This category describes classical probability theory.

The category CPM\mathsf{CPM} of sets of linear operators on Hilbert spaces and completely positive maps between them is an OPT and describes quantum relations. The causal morphisms are the trace preserving completely positive maps.

Finally, Boxworld is the theory that allows to describe any correlation between two variables as the cause of some resource of the theory in the past.

Generalised Bayesian networks

So, we’re finally ready to give the main construction and results of the paper. As mentioned before, to get a generalised d-separation result, the idea is that we will distinguish observable and unobservable variables, and simply insist that conditional independence is only defined relative to observable variables.

To this end, a generalised DAG or GDAG is a DAG GG together with a partition on the nodes of GG into two subsets called observed and unobserved. We’ll represent observed nodes by triangles, and unobserved nodes by circles. An edge out of an (un)observed node will be called (un)observed and represented by a (solid) dashed arrow.

In order to get a generalisation of Theorem 5, we still need to come up with a sensible generalisation of the Markov property which will essentially say that at an observed node that has only observed parents, the distribution must be Markov. However, if an observed node has an unobserved parent, the latter’s whole history is needed to describe the distribution.

To state this precisely, we will associate a causal theory (C,,I)(\mathsf {C}, \otimes, I) to a GDAG GG via an assignment of systems to edges of GG and tests to nodes of GG, such that the observed edges of GG will ‘carry’ only the outcomes of classical tests (so will say something about conditional probability) whereas unobserved edges will carry only the output system.

Precisely, such an assignment PP satisfies the generalised Markov condition (GMC) and is called a generalised Markov distribution if

  • Each unobserved edge corresponds to a distinct system in the theory.

  • If we can’t observe what is happening at a node, we can’t condition over it: To each unobserved node and each value of its observed parents, we assign a deterministic test from the system defined by the product of its incoming (unobserved) edges to the system defined by the product of its outgoing (unobserved) edges.

  • Each observed node XX is an observation test, i.e. a morphism in C(A,I)\mathsf {C}(A, I) for the system Aob(C)A \in ob( \mathsf {C}) corresponding to the product of the systems assigned to the unobserved input edges of XX. Since C\mathsf {C} is a causal theory, this says that XX is assigned a classical random variable, also denoted XX, and that if YY is an observed node, and has observed parent XX, the distribution at YY is conditionally dependent on the distribution at XX (see here for details).

  • It therefore follows that each observed edge is assigned the trivial system II.

  • The joint probability distribution on the observed nodes of GG is given by the morphism C(I,I)\mathsf {C}(I, I) that results from these assignments.

A generalised Bayesian network consists of a GDAG GG together with a generalised Markov distribution PP on GG.

Example

Consider the following GDAG

a poset

Let’s build its OPT morphism as indictated by the generalised Markov condition.

The observed node XX has no incoming edges so it corresponds to a C(I,I)\mathsf {C}(I, I) morphism, and thus we assign a probability distribution to it.

The unobserved node A depends on XX, and has no unobserved inputs, so we assign a deterministic test A(x):IAA(x): I \to A for each value xx of XX.

a poset

The observed node YY has one incoming unobserved edge and no incoming observed edges so we assign to it a test Y:AIY: A \to I such that, for each value xx of XX, YA(x)Y \circ A(x) is a probability distribution.

Building up the rest of the picture gives an OPT diagram of the form

a poset

which is a C(I,I)\mathsf {C}(I, I) morphism that defines the joint probability distribution P(x,y,z,w)P(x,y,z,w). We now have all the ingredients to state Theorem 22, the generalised d-separation theorem. This is the analogue of Theorem 5 for generalised Markov distributions.

Theorem 22

Given a GDAG GG and subsets X,Y,ZX,Y, Z of observed nodes

  • if a probability distribution PP is generalised Markov relative to GG then XY|ZXY|ZX \perp Y \ | \ Z \Rightarrow X\perp\!\!\!\!\!\!\!\perp Y \ | \ Z.

  • If XY|ZX\perp\!\!\!\!\!\!\!\perp Y \ | \ Z holds for all generalised Markov probability distributions on GG, then XY|ZX \perp Y \ | \ Z.

Note in particular that there is no change in the definition of d-separation: d-separation of a GDAG GG is simply d-separation with respect to its underlying DAG. There is also no change in the definition of conditional independence. Now, however, we restrict to statements of conditional independence with respect to observed nodes only. This enables the generalised soundness and completeness statements of the theorem.

The proof of soundness uses uniqueness of discarding, and completeness follows since generalised Markov is a stronger condition on a distribution than classically Markov.

Classical distributions on GDAGs

Theorem 22 is all well and good. But does it really generalise the classical case? That is, can we recover Theorem 5 for all classical Bayesian networks from Theorem 22?

As a first step, Proposition 17 states that if all the nodes of a generalised Bayesian network are observed, then it is a classical bayesian network. In fact, this follows pretty immediately from the definitions.

Moreover, it is easily checked that, given a classical Bayesian network, even if it has hidden or latent variables, it can still be expressed directly as a generalised Bayesian network with no unobserved nodes.

In fact, Theorem 22 generalises Theorem 5 in a stricter sense. That is, the generalised Bayesian network setup together with classical causality adds nothing extra to the theory of classical Bayesian networks. If a generalised Markov distribution is classical (then hidden and latent variables may be represented by unobserved nodes), it can be viewed as a classical Bayesian network. More precisely, Lemma 18 says that, given any generalised Bayesian network (G,P)(G,P) with underlying DAG GG' and distribution P𝒞P \in \mathcal {C}, we can construct a classical Bayesian network (G,P)(G', P') such that PP' agrees with PP on the observed nodes.

It is worth voicing a note of caution. The authors themselves mention in the conclusion that the construction based on GDAGs with two types of nodes is not entirely satisfactory. The problem is that, although the setups and results presented here do give a generalisation of Theorem 5, they do not, as such, provide a way of generalising Bayesian networks as they are used for probabilistic inference to non-classical settings. For example, belief propagation works through observed nodes, but there is no apparent way of generalising it for unobserved nodes.

Theory independence

More generally, given a GDAG GG, we can look at the set of distributions on GG that are generalised Markov with respect to a given causal theory. Of particular importance are the following.

  • The set 𝒞\mathcal {C} of generalised Markov distributions in Mat( +)Mat(\mathbb {R}_+) on GG.

  • The set 𝒬\mathcal {Q} of generalised Markov distributions in CPM\mathsf{CPM} on GG.

  • The set 𝒢\mathcal {G} of all generalised Markov distributions on GG. (This is the set of generalised Markov distributions in Boxworld.)

Moreover, we can distinguish another class of distributions on GG, by not restricting to d-seperation of observed nodes, but considering distributions that satisfy the observable conditional independences given by any d-separation properties on the graph. Theorem 22 implies, in particular that GIG \subseteq I.

And, so, since Mat( +)Mat(\mathbb {R}_+) embeds into CPM\mathsf{CPM}, we have 𝒞𝒬𝒢\mathcal {C} \subseteq \mathcal {Q} \subseteq \mathcal {G} \subseteq \mathcal {I}.

This means that one can ask for which graphs (some or all of) these inequalities are strict, and the last part of the paper explores these questions. In the original paper, a sufficient condition is given for graphs to satisfy 𝒞\mathcal {C} \neq \mathcal {I}. I.e. for these graphs it is guaranteed that the causal structure admits correlations that are non-local. Moreover the authors show that their condition is necessary for small enough graphs.

Another interesting result is that there exist graphs for which 𝒢\mathcal {G} \neq \mathcal {I}. This means that using a theory of resources, whatever theory it may be, to explain correlations imposes constraints that are stronger than those imposed by the relations themselves.

What next?

This setup represents one direction for using category theory to generalise Bayesian networks. In our group work at the ACT workshop, we considered another generalisation of Bayesian networks, this time staying within the classical realm. Namely, building on the work of Bonchi, Gadducci, Kissinger, Sobocinski, and Zanasi, we gave a functorial Markov condition on directed graphs admitting cycles. Hopefully we’ll present this work here soon.

Posted at July 7, 2018 11:11 AM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/3051

0 Comments & 0 Trackbacks

Post a New Comment