Operads and Phylogenetic Trees
Posted by John Baez
A few years ago, after hearing Susan Holmes speak about the mathematics of phylogenetic trees, I became interested in their connection to algebraic topology. I wrote an article about it:
- John Baez, Operads and the tree of life, Azimuth, 6 July 2011.
In trying to the make the ideas precise I recruited the help of Nina Otter, who was then a graduate student at ETH Zürich. She came to Riverside and we started to work together.
Now Nina Otter is a grad student at Oxford working on mathematical biology with Heather Harrington. I visited her last summer and we made more progress… but then she realized that our paper needed another big theorem, a result relating our topology on the space of phylogenetic trees to the topology described by Susan Holmes and her coauthors here:
- Louis J. Billera, Susan P. Holmes and Karen Vogtmann, Geometry of the space of phylogenetic trees, Advances in Applied Mathematics 27 (2001), 733–767.
It took another half year to finish things up. I could never have done this myself.
But now we’re done! Here’s our paper:
- John Baez and Nina Otter, Operads and phylogenetic trees.
Let me tell you the basic idea….
Trees are important, not only in mathematics, but also biology. The most important is the ‘tree of life’ relating all organisms that have ever lived on Earth. Darwin drew this sketch of it in 1837:
He wrote about it in On the Origin of Species:
The affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth. The green and budding twigs may represent existing species; and those produced during former years may represent the long succession of extinct species. At each period of growth all the growing twigs have tried to branch out on all sides, and to overtop and kill the surrounding twigs and branches, in the same manner as species and groups of species have at all times overmastered other species in the great battle for life.
Now we know that the tree of life is not really a tree in the mathematical sense. One reason is ‘endosymbiosis’: the incorporation of one organism together with its genetic material into another, as probably happened with the mitochondria in our cells and also the plastids that hold chlorophyll in plants. Another is ‘horizontal gene transfer’: the passing of genetic material from one organism to another, which happens frequently with bacteria. So, the tree of life is really a thicket:
But a tree is often a good approximation, especially for animals and plants in the last few hundred million years. Biologists who try to infer phylogenetic trees from present-day genetic data often use simple models where:
- the genotype of each species follows a random walk, but
- species branch in two at various times.
These are called ‘Markov models’. The simplest Markov model for DNA evolution is the Jukes–Cantor model. Consider one or more pieces of DNA having a total of $n$ base pairs. We can think of this as a string of letters chosen from the set {A,T,C,G}:
…ATCGATTGAGCTCTAGCG…
As time passes, the Jukes–Cantor model says the DNA changes randomly, with each base pair having the same constant rate of randomly flipping to any other. So, we get a Markov process on the set
$X = \{ A,T,C,G \}{}^N$
However, a species can also split in two. So, given current-day genetic data from various species, biologists try to infer the most probable tree where, starting from a common ancestor, the DNA in question undergoes a random walk most of the time but branches in two at certain times.
To formalize this, we can define a concept of ‘phylogenetic tree’. Our work is based on the definition of Billera, Holmes and Vogtmann, though we use a slightly different definition, for reasons that will soon become clear. For us, a phylogenetic tree is a rooted tree with leaves labelled by numbers $1,2, \dots, n$ and edges labelled by ‘times’ or, geometrically speaking, ‘lengths’ in $[0, \infty)$. We require that:
- the length of every edge is positive, except perhaps for ‘external edges’: that is, edges incident to the leaves or root;
- there are no 1-ary vertices.
For example, here is a phylogenetic tree with 5 leaves:
where $\ell_1, \dots, \ell_6 \ge 0$ but we demand that $\ell_7 > 0$. We draw the vertices as dots. We do not count the leaves and the root as vertices, and we label the root with the number $0$. We cannot collapse edges of length zero that end at leaves, since doing so would eliminate those leaves. Also note that the embedding of the tree in the plane is irrelevant, so this counts as the same phylogenetic tree:
In applications to biology, we are often interested in trees where the total distance from the root to the leaf is the same for every leaf, since all species have evolved for the same time from their common ancestor. These are mathematically interesting as well, because then the distance between any two leaves defines an ultrametric on the set of leaves. However, more general phylogenetic trees are also interesting—and they become essential when we construct an operad whose operations are phylogenetic trees.
Let ${Phyl}_n$ be the set of phylogenetic trees with $n$ leaves. This has a natural topology, which we explain in Section 6 of our paper. For example, here is a continuous path in ${Phyl}_4$ where we only change the length of one internal edge, reducing it until it becomes zero and we can collapse it:
Phylogenetic trees reconstructed by biologists are typically binary. When a phylogenetic tree appears to have higher arity, sometimes we merely lack sufficient data to resolve a higher-arity branching into a number of binary ones. With the topology we are using on ${Phyl}_n$, binary trees form an open dense set of ${Phyl}_n$, except for ${Phyl}_1$. However, trees of higher arity are still important, because paths, paths of paths, etc. in ${Phyl}_n$ are often forced to pass through trees of higher arity.
Billera, Holmes and Vogtmann focused their attention on the set $\mathcal{T}_n$ of phylogenetic trees where lengths of the external edges—edges incident to the root and leaves—are fixed to a constant value. They endow $\mathcal{T}_n$ with a metric, which induces a topology on $\mathcal{T}_n$, and there is a homeomorphism
${Phyl}_n \cong \mathcal{T}_n \times [0,\infty)^{n+1}$
where the data in $[0,\infty)^{n+1}$ describe the lengths of the external edges in a general phylogenetic tree.
In algebraic topology, trees are often used to describe the composition of $n$-ary operations. This is formalized in the theory of operads. An ‘operad’ is an algebraic stucture where for each natural number $n=0,1,2,\dots$ we have a set $O_n$ whose elements are considered as abstract $n$-ary operations, not necessarily operating on anything yet. An element $f \in O_n$ can be depicted as a planar tree with one vertex and $n$ labelled leaves:
We can compose these operations in a tree-like way to get new operations:
and an associative law holds, making this sort of composite unambiguous:
There are various kinds of operads, but in our paper the operads are always ‘unital’, having an operation $1 \in O_1$ that acts as an identity for composition. They are also ‘symmetric’, meaning there is an action of the symmetric group $S_n$ on each set $O_n$, compatible with composition. Further, they are ‘topological’, meaning that each set $O_n$ is a topological space, with composition and permutations acting as continuous maps.
In Section 5 we prove that there is an operad ${Phyl}$, the ‘phylogenetic operad’, whose space of $n$-ary operations is ${Phyl}_n$. This raises a number of questions:
- What is the mathematical nature of this operad?
- How is it related to ‘Markov processes with branching’?
- How is it related to known operads in topology?
Briefly, the answer is that ${Phyl}$ is the coproduct of ${Com}$, the operad for commutative topological semigroups, and $[0,\infty)$, the operad having only unary operations, one for each $t \in [0,\infty)$. The first describes branching, the second describes Markov processes. Moreover, ${Phyl}$ is closely related to the Boardmann–Vogt $W$ construction applied to ${Com}$. This is a construction that Boardmann and Vogt applied to another operad in order to obtain an operad whose algebras are loop spaces.
To understand all this in more detail, first recall that the raison d’être of operads is to have ‘algebras’. The most traditional sort of algebra of an operad $O$ is a topological space $X$ on which each operation $f \in O_n$ acts as a continuous map
$\alpha(f)\colon X^n \to X$
obeying some conditions: composition, the identity, and the permutation group actions are preserved, and $\alpha(f)$ depends continuously on $f$. The idea is that the abstract operations in $O$ are realized as actual operations on the space $X$.
In this paper we instead need algebras of a linear sort. Such an algebra of $O$ is a finite-dimensional real vector space $V$ on which each operation $f \in O_n$ acts as a multilinear map
$\alpha(f)\colon V^n \to X$
obeying the same list of conditions. We can also think of $\alpha(f)$ as a linear map
$\alpha(f)\colon V^{\otimes n} \to V$
where $V^{\otimes n}$ is the $n$th tensor power of $V$.
We also need ‘coalgebras’ of operads. The point is that while ordinarily one thinks of an operation $f \in O_n$ as having $n$ inputs and one output, a phylogenetic tree is better thought of as having one input and $n$ outputs. A coalgebra of $O$ is a finite-dimensional real vector space $V$ on which operation $f \in O_n$ gives a linear map
$\alpha(f) \colon V \to V^{\otimes n}$
obeying the same conditions as an algebra, but ‘turned around’.
The main point of this paper is that the phylogenetic operad has interesting coalgebras, which correspond to how phylogenetic trees are actually used to describe branching Markov processes in biology. But to understand this, we need to start by looking at coalgebras of two operads from which the phylogenetic operad is built.
By abuse of notation, we will use $[0,\infty)$ as the name for the operad having only unary operations, one for each $t \in [0,\infty)$, with composition of operations given by addition. A coalgebra of $[0,\infty)$ is a finite-dimensional real vector space $V$ together with for each $t \in [0,\infty)$ a linear map
$\alpha(t) \colon V \to V$
such that:
- $\alpha(s+t) = \alpha(s) \alpha(t)$ for all $s,t \in [0,\infty)$
- $\alpha(0) = 1_V$
- $\alpha(t)$ depends continuously on $t$.
Analysts usually call such a thing a ‘continuous one-parameter semigroup’ of operators on $V$.
Given a finite set $X$, a ‘Markov process’ or ‘continuous-time Markov chain’ on $X$ is a continuous one-parameter semigroup of operators on $\mathbb{R}^X$ such that if $f \in \mathbb{R}^X$ is a probability distribution on $X$, so is $\alpha(t) f$ for all $t \in [0,\infty)$. Equivalently, if we think of $\alpha(t)$ as a $X \times X$ matrix of real numbers, we demand that its entries be nonnegative and each column sum to 1. Such a matrix is called ‘stochastic’. If $X$ is a set of possible sequences of base pairs, a Markov process on $X$ describes the random changes of DNA with the passage of time. We shall see that any Markov process on $X$ makes $\mathbb{R}^X$ into a coalgebra of $[0,\infty)$.
This handles the Markov process aspect of DNA evolution; what about the branching? For this we use ${Com}$, the unique operad with one $n$-ary operation for each $n$ > 0. Algebras of ${Com}$ are not-necessarily-unital commutative algebras: there is only one way to multiply $n$ elements for $n$ > 0.
For us what matters most is that coalgebras of ${Com}$ are finite-dimensional cocommutative coalgebras, not necessarily with counit. If $X$ is a finite set, there is a cocommutative coalgebra whose underlying vector space is $\mathbb{R}^X$. The unique $n$-ary operation of ${Com}$ acts as the linear map
$\displaystyle{ \Delta_n \colon \mathbb{R}^X \to \mathbb{R}^X \otimes \cdots \otimes \mathbb{R}^X \; \cong \;\mathbb{R}^{X^n} }$
where
$\Delta_n (f)(x_1, \dots, x_n) = \left\{ \begin{array}{cl} f(x) & {if } x_1 = \cdots = x_n = x \\ \\ 0 & {otherwise} \end{array} \right.$
This map describes the ‘$n$-fold duplication’ of a probability distribution $f$ on the set $X$ of possible genes. We use this map to say what takes place when a species branches:
Next, we wish to describe how to combine the operads $[0,\infty)$ and ${Com}$ to obtain the phylogenetic operad. Any pair of operads $O$ and $O'$ has a coproduct $O + O'$. The definition of coproduct gives an easy way to understand the algebras of $O + O'$. Such an algebra is simply an object that is both an algebra of $O$ and an algebra of $O'$, with no compatibility conditions imposed.
One can also give an explicit construction of $O + O'$. Briefly, the $n$-ary operations of $O + O'$ are certain equivalence classes of trees with leaves labelled $\{1,\dots, n\}$ and all nodes, except for leaves, labelled by operations in $O$ or $O'$. While this fact is surely known to experts on operads, it seems hard to find in the literature, so we prove this in Theorem 24. Cat lovers out there will be pleased to note that the proof relies on the existence of an algebraic theory whose algebras are operads.
Given this, it should come as no surprise that the operad ${Phyl}$ is the coproduct ${Com} + [0,\infty)$. In fact, we shall take this as a definition. Starting from this definition, we work backwards to show that the operations of ${Phyl}$ correspond to phylogenetic trees. We prove this in Theorem 28. The definition of coproduct determines a topology on the spaces ${Phyl}_n$, and it is a nontrivial fact that with this topology we have ${Phyl}_n \cong \mathcal{T}_n \times [0,\infty)^{n+1}$ for $n \gt 1$, where $\mathcal{T}_n$ has the topology defined by Billera, Holmes and Vogtmann. We prove this in Theorem 34.
Using the definition of the phylogenetic operad as a coproduct, it is clear that given any Markov process on a finite set $X$, the vector space $\mathbb{R}^X$ naturally becomes a coalgebra of this operad. The reason is that, as we have seen, $\mathbb{R}^X$ is automatically a coalgebra of ${Com}$, and the Markov process makes it into a coalgebra of $[0,\infty)$. Thus, by the universal property of a coproduct, it becomes a coalgebra of ${Phyl} \cong {Com} + [0,\infty)$. We prove this in Theorem 36.
Since operads arose in algebraic topology, it is interesting to consider how the phylogenetic operad connects to ideas from that subject. Boardmann and Vogt defined a construction on operads, the ‘$W$ construction’, which when applied to the operad for spaces with an associative multiplication gives an operad for loop spaces. The operad ${Phyl}$ has an interesting relation to $W({Com})$. To see this, define addition on $[0,\infty]$ in the obvious way, where
$\infty + t = t + \infty = \infty$
Then $[0,\infty]$ becomes a commutative topological monoid, so we obtain an operad with only unary operations, one for each $t \in [0,\infty]$, where composition is addition. By abuse of notation, let us call this operad $[0,\infty]$.
Boardmann and Vogt’s $W$ construction involves trees with edges having lengths in $[0,1]$, but we can equivalently use $[0,\infty]$. Leinster has observed that for any nonsymmetric topological operad $O$, Boardmann and Vogt’s operad $W(O)$ is closely related to $O + [0,\infty]$. Here we make this observation precise in the symmetric case. Operations in ${Com} + [0,\infty]$ are just like phylogenetic trees except that edges may have length $\infty$. Moreover, for any operad $O$, the operad $W(O)$ is a non-unital suboperad of $O + [0,\infty]$. An operation of $O + [0,\infty]$ lies in $W(O)$ if and only if all the external edges of the corresponding tree have length $\infty$. We prove this in Theorem 40.
Berger and Moerdijk showed that if $S_n$ acts freely on $O_n$ and $O_1$ is well-pointed, then $W(O)$ is a cofibrant replacement for $O$. This is true for $O = {Assoc}$, the operad whose algebras are topological semigroups. This cofibrancy is why Boardmann and Vogt could use $W({Assoc})$ as an operad for loop spaces. But $S_n$ does not act freely on ${Com}_n$, and $W({Com})$ is not a cofibrant replacement for ${Com}$. So, it is not an operad for infinite loop spaces.
Nonetheless, the larger operad ${Com} + [0,\infty]$, a compactification of ${Phyl} = {Com} + [0,\infty)$, is somewhat interesting. The reason is that any Markov process
$\alpha \colon [0,\infty) \to \mathrm{End}(\mathbb{R}^X)$
approaches a limit as $t \to \infty$. Indeed, $\alpha$ extends uniquely to a homomorphism from the topological monoid $[0,\infty]$ to $\mathrm{End}(\mathbb{R}^X)$. Thus, given a Markov process on a finite set $X$, the vector space $\mathbb{R}^X$ naturally becomes a coalgebra of ${Com} + [0,\infty]$. We prove this in Theorem 38.
Re: Operads and Phylogenetic Trees
This is very cool! It’s neat that “Markov processes with branching” used in genomics have a nice operadic description, and it’s tantalizing that there might be some connection to the original application of operads (iterated loop spaces).
As a step in the latter direction, we might note that if $\tilde{C}$ is a $\Sigma$-cofibrant replacement for $Com$ (that is, $S_n$ acts freely on $\tilde{C}_n$ and we have a weak equivalence $\tilde{C}\to Com$), then any $Com$-(co)algebra becomes automatically an $\tilde{C}$-(co)algebra. Therefore, the phylogenetic Markov models should also yield $(\tilde{C}+[0,\infty])$-coalgebras, while $W(\tilde{C})$ would be an operad for infinite loop spaces. Then if we could apply some natural contravariant monoidal functor from vector spaces to topological spaces, we would get an actual infinite loop space, and we could ask whether it’s anything familiar to algebraic topologists.