# Planet Musings

## October 07, 2015

### Chad Orzel — 036/366: Pine Bush

The difficult-to-spell name “Schenectady” (where Union is located) derives from a Mohawk word meaning “beyond the pines.” The pines in question are an extensive region of pine barrens between Albany and Schenectady, a small bit of which survives as the Albany Pine Bush Nature Preserve. They’ve got a nice little nature center and some trails through the pine bush, and I took The Pip down there this morning, because his day care was closed again for the last of the fall block of Jewish holidays. Here’s a shot to give you an idea of the landscape:

Plants in the Pine Bush Nature Preserve.

Unfortunately, this was as far as we got into the preserve, as the Little Dude wanted nothing to do with any kind of outdoor activity. He rampaged around the (rather small) visitor center for a while, and after lunch was very happy to run in circles in one of the local shopping malls, but he just did not want to be outside, despite the fact that it was a beautiful day today.

As someone who grew up in a rural area, I find this very frustrating, but, you know, there are worse forms of rebellion. I’m going to have to go back there by myself one of these days, to get a real look around.

Tomorrow, we’re done with Jewish holidays, and back to the regular day care routine. And while I did have a lot of fun hanging out with The Pip, I’d be lying if I said it wasn’t a relief to be handing him off to somebody else to entertain during the day…

## October 06, 2015

### Clifford Johnson — Get ready for some “movie science” chatter…

Yes, I've been hanging out with my Screen Junkies friends again, and this time I also got to meet JPL's Christina Heinlein, who you may recall was in the first of the Screen Junkies "Movie Science" episodes last year. While we were both in it, I'd not got to meet her that time since our chats with host Hal Rudnick were recorded at quite different times. This time, however, schedules meant [...] Click to continue reading this post

The post Get ready for some “movie science” chatter… appeared first on Asymptotia.

### Clifford Johnson — Neutrinos!

Neutrinos!

See the press release here, and congratulations to the winners!

(Honestly, I thought that the Nobel prize for this had already been given...)

The post Neutrinos! appeared first on Asymptotia.

### Alexey Petrov — Nobel Prize in Physics 2015

So, the Nobel Prize in Physics 2015 has been announced. To much surprise of many (including the author), it was awarded jointly to Takaaki Kajita and Arthur B. McDonald “for the discovery of neutrino oscillations, which shows that neutrinos have mass.” Well deserved Nobel Prize for a fantastic discovery.

What is this Nobel prize all about? Some years ago (circa 1997) there were a couple of “deficit” problems in physics. First, it appeared that the detected number of (electron) neutrinos coming form the Sun was measured to be less than expected. This could be explained in a number of ways. First, neutrino could oscillate — that is, neutrinos produced as electron neutrinos in nuclear reactions in the Sun could turn into muon or tau neutrinos and thus not be detected by existing experiments, which were sensitive to electron neutrinos. This was the most exciting possibility that ultimately turned out to be correct! But it was by far not the only one! For example, one could say that the Standard Solar Model (SSM) predicted the fluxes wrong — after all, the flux of solar neutrinos is proportional to core temperature to a very high power (~T25 for 8B neutrinos, for example). So it is reasonable to say that neutrino flux is not so well known because the temperature is not well measured (this might be disputed by solar physicists). Or something more exotic could happen — like the fact that neutrinos could have large magnetic moment and thus change its helicity while propagating in the Sun to turn into a right-handed neutrino that is sterile.

The solution to this is rather ingenious — measure neutrino flux in two ways — sensitive to neutrino flavor (using “charged current (CC) interactions”) and insensitive to neutrino flavor (using “neutral current (NC) interactions”)! Choosing heavy water — which contains deuterium — is then ideal for this detection. This is exactly what SNO collaboration, led by A. McDonald did

As it turned out, the NC flux was exactly what SSM predicted, while the CC flux was smaller. Hence the conclusion that electron neutrinos would oscillate into other types of neutrinos!

Another “deficit problem” was associated with the ratio of “atmospheric” muon and electron neutrinos. Cosmic rays hit Earth’s atmosphere and create pions that subsequently decay into muons and muon neutrinos. Muons would also eventually decay, mainly into an electron, muon (anti)neutrino and an electron neutrino, as

As can be seen from the above figure, one would expect to have 2 muon-flavored neutrinos per one electron-flavored one.

This is not what Super K experiment (T. Kajita) saw — the ratio really changed with angle — that is, the ratio of neutrino fluxes from above would differ substantially from the ratio from below (this would describe neutrinos that went through the Earth and then got into the detector). The solution was again neutrino oscillations – this time, muon neutrinos oscillated into the tau ones.

The presence of neutrino oscillations imply that they have (tiny) masses — something that is not predicted by minimal Standard Model. So one can say that this is the first indication of physics beyond the Standard Model. And this is very exciting.

I think it is interesting to note that this Nobel prize might help the situation with funding of US particle physics research (if anything can help…). It shows that physics has not ended with the discovery of the Higgs boson — and Fermilab might be on the right track to uncover other secrets of the Universe.

### Terence Tao — 275A, Notes 1: Integration and expectation

In Notes 0, we introduced the notion of a measure space ${\Omega = (\Omega, {\mathcal F}, \mu)}$, which includes as a special case the notion of a probability space. By selecting one such probability space ${(\Omega,{\mathcal F},\mu)}$ as a sample space, one obtains a model for random events and random variables, with random events ${E}$ being modeled by measurable sets ${E_\Omega}$ in ${{\mathcal F}}$, and random variables ${X}$ taking values in a measurable space ${R}$ being modeled by measurable functions ${X_\Omega: \Omega \rightarrow R}$. We then defined some basic operations on these random events and variables:

• Given events ${E,F}$, we defined the conjunction ${E \wedge F}$, the disjunction ${E \vee F}$, and the complement ${\overline{E}}$. For countable families ${E_1,E_2,\dots}$ of events, we similarly defined ${\bigwedge_{n=1}^\infty E_n}$ and ${\bigvee_{n=1}^\infty E_n}$. We also defined the empty event ${\emptyset}$ and the sure event ${\overline{\emptyset}}$, and what it meant for two events to be equal.
• Given random variables ${X_1,\dots,X_n}$ in ranges ${R_1,\dots,R_n}$ respectively, and a measurable function ${F: R_1 \times \dots \times R_n \rightarrow S}$, we defined the random variable ${F(X_1,\dots,X_n)}$ in range ${S}$. (As the special case ${n=0}$ of this, every deterministic element ${s}$ of ${S}$ was also a random variable taking values in ${S}$.) Given a relation ${P: R_1 \times \dots \times R_n \rightarrow \{\hbox{true}, \hbox{false}\}}$, we similarly defined the event ${P(X_1,\dots,X_n)}$. Conversely, given an event ${E}$, we defined the indicator random variable ${1_E}$. Finally, we defined what it meant for two random variables to be equal.
• Given an event ${E}$, we defined its probability ${{\bf P}(E)}$.

These operations obey various axioms; for instance, the boolean operations on events obey the axioms of a Boolean algebra, and the probabilility function ${E \mapsto {\bf P}(E)}$ obeys the Kolmogorov axioms. However, we will not focus on the axiomatic approach to probability theory here, instead basing the foundations of probability theory on the sample space models as discussed in Notes 0. (But see this previous post for a treatment of one such axiomatic approach.)

It turns out that almost all of the other operations on random events and variables we need can be constructed in terms of the above basic operations. In particular, this allows one to safely extend the sample space in probability theory whenever needed, provided one uses an extension that respects the above basic operations. We gave a simple example of such an extension in the previous notes, but now we give a more formal definition:

Definition 1 Suppose that we are using a probability space ${\Omega = (\Omega, {\mathcal F}, \mu)}$ as the model for a collection of events and random variables. An extension of this probability space is a probability space ${\Omega' = (\Omega', {\mathcal F}', \mu')}$, together with a measurable map ${\pi: \Omega' \rightarrow \Omega}$ (sometimes called the factor map) which is probability-preserving in the sense that

$\displaystyle \mu'( \pi^{-1}(E) ) = \mu(E) \ \ \ \ \ (1)$

for all ${E \in {\mathcal F}}$. (Caution: this does not imply that ${\mu(\pi(F)) = \mu'(F)}$ for all ${F \in {\mathcal F}'}$ – why not?)

An event ${E}$ which is modeled by a measurable subset ${E_\Omega}$ in the sample space ${\Omega}$, will be modeled by the measurable set ${E_{\Omega'} := \pi^{-1}(E_\Omega)}$ in the extended sample space ${\Omega'}$. Similarly, a random variable ${X}$ taking values in some range ${R}$ that is modeled by a measurable function ${X_\Omega: \Omega \rightarrow R}$ in ${\Omega}$, will be modeled instead by the measurable function ${X_{\Omega'} := X_\Omega \circ \pi}$ in ${\Omega'}$. We also allow the extension ${\Omega'}$ to model additional events and random variables that were not modeled by the original sample space ${\Omega}$ (indeed, this is one of the main reasons why we perform extensions in probability in the first place).

Thus, for instance, the sample space ${\Omega'}$ in Example 3 of the previous post is an extension of the sample space ${\Omega}$ in that example, with the factor map ${\pi: \Omega' \rightarrow \Omega}$ given by the first coordinate projection ${\pi(i,j) := i}$. One can verify that all of the basic operations on events and random variables listed above are unaffected by the above extension (with one caveat, see remark below). For instance, the conjunction ${E \wedge F}$ of two events can be defined via the original model ${\Omega}$ by the formula

$\displaystyle (E \wedge F)_\Omega := E_\Omega \cap F_\Omega$

or via the extension ${\Omega'}$ via the formula

$\displaystyle (E \wedge F)_{\Omega'} := E_{\Omega'} \cap F_{\Omega'}.$

The two definitions are consistent with each other, thanks to the obvious set-theoretic identity

$\displaystyle \pi^{-1}( E_\Omega \cap F_\Omega ) = \pi^{-1}(E_\Omega) \cap \pi^{-1}(F_\Omega).$

Similarly, the assumption (1) is precisely what is needed to ensure that the probability ${\mathop{\bf P}(E)}$ of an event remains unchanged when one replaces a sample space model with an extension. We leave the verification of preservation of the other basic operations described above under extension as exercises to the reader.

Remark 2 There is one minor exception to this general rule if we do not impose the additional requirement that the factor map ${\pi}$ is surjective. Namely, for non-surjective ${\pi}$, it can become possible that two events ${E, F}$ are unequal in the original sample space model, but become equal in the extension (and similarly for random variables), although the converse never happens (events that are equal in the original sample space always remain equal in the extension). For instance, let ${\Omega}$ be the discrete probability space ${\{a,b\}}$ with ${p_a=1}$ and ${p_b=0}$, and let ${\Omega'}$ be the discrete probability space ${\{ a'\}}$ with ${p'_{a'}=1}$, and non-surjective factor map ${\pi: \Omega' \rightarrow \Omega}$ defined by ${\pi(a') := a}$. Then the event modeled by ${\{b\}}$ in ${\Omega}$ is distinct from the empty event when viewed in ${\Omega}$, but becomes equal to that event when viewed in ${\Omega'}$. Thus we see that extending the sample space by a non-surjective factor map can identify previously distinct events together (though of course, being probability preserving, this can only happen if those two events were already almost surely equal anyway). This turns out to be fairly harmless though; while it is nice to know if two given events are equal, or if they differ by a non-null event, it is almost never useful to know that two events are unequal if they are already almost surely equal. Alternatively, one can add the additional requirement of surjectivity in the definition of an extension, which is also a fairly harmless constraint to impose (this is what I chose to do in this previous set of notes).

Roughly speaking, one can define probability theory as the study of those properties of random events and random variables that are model-independent in the sense that they are preserved by extensions. For instance, the cardinality ${|E_\Omega|}$ of the model ${E_\Omega}$ of an event ${E}$ is not a concept within the scope of probability theory, as it is not preserved by extensions: continuing Example 3 from Notes 0, the event ${E}$ that a die roll ${X}$ is even is modeled by a set ${E_\Omega = \{2,4,6\}}$ of cardinality ${3}$ in the original sample space model ${\Omega}$, but by a set ${E_{\Omega'} = \{2,4,6\} \times \{1,2,3,4,5,6\}}$ of cardinality ${18}$ in the extension. Thus it does not make sense in the context of probability theory to refer to the “cardinality of an event ${E}$“.

On the other hand, the supremum ${\sup_n X_n}$ of a collection of random variables ${X_n}$ in the extended real line ${[-\infty,+\infty]}$ is a valid probabilistic concept. This can be seen by manually verifying that this operation is preserved under extension of the sample space, but one can also see this by defining the supremum in terms of existing basic operations. Indeed, note from Exercise 24 of Notes 0 that a random variable ${X}$ in the extended real line is completely specified by the threshold events ${(X \leq t)}$ for ${t \in {\bf R}}$; in particular, two such random variables ${X,Y}$ are equal if and only if the events ${(X \leq t)}$ and ${(Y \leq t)}$ are surely equal for all ${t}$. From the identity

$\displaystyle (\sup_n X_n \leq t) = \bigwedge_{n=1}^\infty (X_n \leq t)$

we thus see that one can completely specify ${\sup_n X_n}$ in terms of ${X_n}$ using only the basic operations provided in the above list (and in particular using the countable conjunction ${\bigwedge_{n=1}^\infty}$.) Of course, the same considerations hold if one replaces supremum, by infimum, limit superior, limit inferior, or (if it exists) the limit.

In this set of notes, we will define some further important operations on scalar random variables, in particular the expectation of these variables. In the sample space models, expectation corresponds to the notion of integration on a measure space. As we will need to use both expectation and integration in this course, we will thus begin by quickly reviewing the basics of integration on a measure space, although we will then translate the key results of this theory into probabilistic language.

As the finer details of the Lebesgue integral construction are not the core focus of this probability course, some of the details of this construction will be left to exercises. See also Chapter 1 of Durrett, or these previous blog notes, for a more detailed treatment.

— 1. Integration on measure spaces —

Let ${(\Omega, {\mathcal F}, \mu)}$ be a measure space, and let ${f}$ be a measurable function on ${\Omega}$, taking values either in the reals ${{\bf R}}$, the non-negative extended reals ${[0,+\infty]}$, the extended reals ${[-\infty,+\infty]}$, or the complex numbers ${{\bf C}}$. We would like to define the integral

$\displaystyle \int_\Omega f\ d\mu \ \ \ \ \ (2)$

of ${f}$ on ${\Omega}$. (One could make the integration variable explicit, e.g. by writing ${\int_\Omega f(\omega)\ d\mu(\omega)}$, but we will usually not do so here.) When integrating a reasonably nice function (e.g. a continuous function) on a reasonably nice domain (e.g. a box in ${{\bf R}^n}$), the Riemann integral that one learns about in undergraduate calculus classes suffices for this task; however, for the purposes of probability theory, we need the much more general notion of a Lebesgue integral in order to properly define (2) for the spaces ${\Omega}$ and functions ${f}$ we will need to study.

Not every measurable function can be integrated by the Lebesgue integral. There are two key classes of functions for which the integral exists and is well behaved:

• Unsigned measurable functions ${f: \Omega \rightarrow [0,+\infty]}$, that take values in the non-negative extended reals ${[0,+\infty]}$; and
• Absolutely integrable functions ${f: \Omega \rightarrow {\bf R}}$ or ${f: \Omega \rightarrow {\bf C}}$, which are scalar measurable functions whose absolute value ${|f|}$ has a finite integral: ${\int_\Omega |f|\ d\mu < \infty}$. (Sometimes we also allow absolutely integrable functions to attain an infinite value ${\infty}$, so long as they only do so on a set of measure zero.)

One could in principle extend the Lebesgue integral to slightly more general classes of functions, e.g. to sums of absolutely integrable functions and unsigned functions. However, the above two classes already suffice for most applications (and as a general rule of thumb, it is dangerous to apply the Lebesgue integral to functions that are not unsigned or absolutely integrable, unless you really know what you are doing).

We will construct the Lebesgue integral in the following four stages. First, we will define the Lebesgue integral just for unsigned simple functions – unsigned measurable functions that take on only finitely many values. Then, by a limiting procedure, we extend the Lebesgue integral to unsigned functions. After that, by decomposing a real absolutely integrable function into unsigned components, we extend the integral to real absolutely integrable functions. Finally, by taking real and imaginary parts, we extend to complex absolutely integrable functions. (This is not the only order in which one could perform this construction; for instance, in Durrett, one first constructs integration of bounded functions on finite measure support before passing to arbitrary unsigned functions.)

First consider an unsigned simple function ${f: \Omega \rightarrow [0,+\infty]}$, thus ${f}$ is measurable and only takes values at a finite number of values. Then we can express ${f}$ as a finite linear combination (in ${[0,+\infty]}$) of indicator functions. Indeed, if we enumerate the values that ${f}$ takes as ${a_1,\dots,a_n \in [0,+\infty]}$ (avoiding repetitions) and setting ${E_i := \{ \omega \in \Omega: f(\omega) = a_i \}}$ for ${i=1,\dots,n}$, then it is clear that

$\displaystyle f = \sum_{i=1}^n a_i 1_{E_i}.$

(It should be noted at this point that the operations of addition and multiplication on ${[0,+\infty]}$ are defined by setting ${+\infty + a = a + \infty = +\infty}$ for all ${a \in [0,+\infty]}$, and ${a \cdot +\infty = +\infty \cdot a}$ for all positive ${a \in (0,+\infty]}$, but that ${0 \cdot +\infty = +\infty \cdot 0}$ is defined to equal ${0}$. To put it another way, multiplication is defined to be continuous from below, rather than from above: ${a \cdot b = \lim_{x \rightarrow a^-, y \rightarrow b^-} x \cdot y}$. One can verify that the commutative, associative, and distributive laws continue to hold on ${[0,+\infty]}$, but we caution that the cancellation laws do not hold when ${+\infty}$ is involved.)

Conversely, given any coefficients ${a_1,\dots,a_n \in [0,+\infty]}$ (not necessarily distinct) and measurable sets ${E_1,\dots,E_n}$ in ${{\mathcal F}}$ (not necessarily disjoint), the sum ${\sum_{i=1}^n a_i 1_{E_i}}$ is an unsigned simple function.

A single simple function can be decomposed in multiple ways as a linear combination of unsigned simple functions. For instance, on the real line ${{\bf R}}$, the function ${2 \times 1_{[0,1)} + 1 \times 1_{[1,3)}}$ can also be written as ${1 \times 1_{[0,2)} + 1 \times 1_{[1,3)}}$ or as ${2 \times 1_{[0,1)} + 1 \times 1_{[1,2)} + 1 \times 1_{[2,3)}}$. However, there is an invariant of all these decompositions:

Exercise 3 Suppose that an unsigned simple function ${f}$ has two representations as the linear combination of indicator functions:

$\displaystyle f = \sum_{i=1}^n a_i 1_{E_i} = \sum_{j=1}^m b_j 1_{F_j},$

where ${n,m}$ are nonnegative integers, ${a_1,\dots,a_n,b_1,\dots,b_m}$ lie in ${[0,+\infty]}$, and ${E_1,\dots,E_n,F_1,\dots,F_m}$ are measurable sets. Show that

$\displaystyle \sum_{i=1}^n a_i \mu(E_i) = \sum_{j=1}^m b_j \mu(F_j).$

(Hint: first handle the special case where the ${F_j}$ are all disjoint and non-empty, and each of the ${E_i}$ is expressible as the union of some subcollection of the ${F_j}$. Then handle the general case by considering the atoms of the finite boolean algebra generated by ${E_i}$ and ${F_j}$.)

We capture this invariant by introducing the simple integral ${\hbox{Simp} \int_{\Omega} f\ d\mu}$ of an unsigned simple function by the formula

$\displaystyle \hbox{Simp} \int_\Omega f\ d\mu := \sum_{i=1}^n a_i \mu(E_i)$

whenever ${f}$ admits a decomposition ${f = \sum_{i=1}^n a_i 1_{E_i}}$. The above exercise is then precisely the assertion that the simple integral is well-defined as an element of ${[0,+\infty]}$.

Exercise 4 Let ${f, g: \Omega \rightarrow [0,+\infty]}$ be unsigned simple functions, and let ${c \in [0,+\infty]}$.

• (i) (Linearity) Show that

$\displaystyle \hbox{Simp} \int_\Omega f+g\ d\mu = \hbox{Simp} \int_\Omega f\ d\mu + \hbox{Simp} \int_\Omega g\ d\mu$

and

$\displaystyle \hbox{Simp} \int_\Omega cf\ d\mu = c \hbox{Simp} \int_\Omega f\ d\mu.$

• (ii) Show that if ${f}$ and ${g}$ are equal almost everywhere, then

$\displaystyle \hbox{Simp} \int_\Omega f\ d\mu = \hbox{Simp} \int_\Omega g\ d\mu.$

• (iii) Show that ${\hbox{Simp} \int_\Omega f\ d\mu \geq 0}$, with equality if and only if ${f}$ is zero almost everywhere.
• (iv) (Monotonicity) If ${f \leq g}$ almost everywhere, show that ${\hbox{Simp} \int_\Omega f\ d\mu \leq \hbox{Simp} \int_\Omega g\ d\mu}$.
• (v) (Markov inequality) Show that ${\mu( \{ \omega: f(\omega) \geq t \} ) \leq \frac{1}{t} \hbox{Simp} \int_\Omega f\ d\mu}$ for any ${0 < t < \infty}$.

Now we extend from unsigned simple functions to more general unsigned functions. If ${f: \Omega \rightarrow [0,+\infty]}$ is an unsigned measurable function, we define the unsigned integral ${\int_\Omega f\ d\mu}$ as

$\displaystyle \int_\Omega f\ d\mu = \sup_{0 \leq g \leq f} \hbox{Simp} \int_\Omega g\ d\mu \ \ \ \ \ (3)$

where the supremum is over all unsigned simple functions such that ${0 \leq g(\omega) \leq f(\omega)}$ for all ${\omega \in \Omega}$.

Many of the properties of the simple integral carry over to the unsigned integral easily:

Exercise 5 Let ${f, g: \Omega \rightarrow [0,+\infty]}$ be unsigned functions, and let ${c \in [0,+\infty]}$.

$\displaystyle \int_\Omega f+g\ d\mu \geq \int_\Omega f\ d\mu + \int_\Omega g\ d\mu$

and

$\displaystyle \int_\Omega cf\ d\mu = c \int_\Omega f\ d\mu.$

• (ii) Show that if ${f}$ and ${g}$ are equal almost everywhere, then

$\displaystyle \int_\Omega f\ d\mu = \int_\Omega g\ d\mu.$

• (iii) Show that ${\int_\Omega f\ d\mu \geq 0}$, with equality if and only if ${f}$ is zero almost everywhere.
• (iv) (Monotonicity) If ${f \leq g}$ almost everywhere, show that ${\int_\Omega f\ d\mu \leq \int_\Omega g\ d\mu}$.
• (v) (Markov inequality) Show that ${\mu( \{ \omega: f(\omega) \geq t \} ) \leq \frac{1}{t} \int_\Omega f\ d\mu}$ for any ${0 < t < \infty}$. In particular, if ${\int_\Omega f\ d\mu < \infty}$, then ${f}$ is finite almost everywhere.
• (vi) (Compatibility with simple integral) If ${f}$ is simple, show that ${\int_\Omega f\ d\mu = \hbox{Simp} \int_\Omega f\ d\mu}$.
• (vii) (Compatibility with measure) For any measurable set ${E}$, show that ${\int_\Omega 1_E\ d\mu = \mu(E)}$.

$\displaystyle \int_\Omega f\ d\mu = \sum_{\omega \in\Omega} f(\omega) p_\omega.$

(Note that the condition ${\sum_{\omega \in \Omega} p_\omega = 1}$ in the definition of a discrete probability space is not required to prove this identity.)

The observant reader will notice that the linearity property of simple functions has been weakened to superadditivity. This can be traced back to a breakdown of symmetry in the definition (3); the unsigned simple integral of ${f}$ is defined via approximation from below, but not from above. Indeed the opposite claim

$\displaystyle \int_\Omega f\ d\mu \stackrel{?}{=} \inf_{g \geq f} \hbox{Simp} \int_\Omega g\ d\mu \ \ \ \ \ (4)$

can fail. For a counterexample, take ${\Omega}$ to be the discrete probability space ${\{1,2,3,\dots\}}$ with probabilities ${p_n := 2^{-n}}$, and let ${f: \Omega \rightarrow [0,+\infty]}$ be the function ${f(n)=n}$. By Exercise 6 we have ${\int_\Omega f\ d\mu = \sum_{n=1}^\infty n2^{-n} = 2}$. On the other hand, any simple function ${g}$ with ${g \geq f}$ must equal ${+\infty}$ on a set of positive measure (why?) and so the right-hand side of (4) can be infinite. However, one can get around this difficulty under some further assumptions on ${f}$, and thus recover full linearity for the unsigned integral:

Exercise 7 (Linearity of the unsigned integral) Let ${(\Omega, {\mathcal F}, \mu)}$ be a measure space.

• (i) Let ${f: \Omega \rightarrow [0,+\infty]}$ be an unsigned measurable function which is both bounded (i.e., there is a finite ${M}$ such that ${|f(\omega)| \leq M}$ for all ${\omega \in \Omega}$) and has finite measure support (i.e., there is a measurable set ${E}$ with ${\mu(E) < \infty}$ such that ${f(\omega)=0}$ for all ${\omega \in \Omega \backslash E}$). Show that (4) holds for this function ${f}$.
• (ii) Establish the additivity property

$\displaystyle \int_\Omega f+g\ d\mu =\int_\Omega f\ d\mu + \int_\Omega g\ d\mu$

whenever ${f, g: \Omega \rightarrow [0,+\infty]}$ are unsigned measurable functions that are bounded with finite measure support.

• (iii) Show that

$\displaystyle \int_\Omega \min(f,n)\ d\mu \rightarrow \int_\Omega f\ d\mu$

as ${n \rightarrow \infty}$ whenever ${f: \Omega \rightarrow [0,+\infty]}$ is unsigned measurable.

• (iv) Using (iii), extend (ii) to the case where ${f,g}$ are unsigned measurable functions with finite measure support, but are not necessarily bounded.
• (v) Show that

$\displaystyle \int_\Omega f 1_{f \geq 1/n}\ d\mu \rightarrow \int_\Omega f\ d\mu$

as ${n \rightarrow \infty}$ whenever ${f: \Omega \rightarrow [0,+\infty]}$ is unsigned measurable.

• (vi) Using (iii) and (v), show that (ii) holds for any unsigned measurable ${f,g}$ (which are not necessarily bounded or of finite measure support).

Next, we apply the integral to absolutely integrable functions. We call a scalar function ${f: \Omega \rightarrow {\bf R}}$ or ${f: \Omega \rightarrow {\bf C}}$ absolutely integrable if it is measurable and the unsigned integral ${\int_\Omega |f|\ d\mu}$ is finite. A real-valued absolutely integrable function ${f}$ can be expressed as the difference ${f = f_1 - f_2}$ of two unsigned absolutely integrable functions ${f_1,f_2}$; indeed, one can check that the choice ${f_1 := \max(f,0)}$ and ${f_2 := \max(-f,0)}$ work for this. Conversely, any difference ${f_1 - f_2}$ of unsigned absolutely integrable functions ${f_1,f_2}$ is absolutely integrable (this follows from the triangle inequality ${|f_1-f_2| \leq |f_1|+|f_2|}$). A single absolutely integrable function ${f}$ may be written as a difference ${f_1-f_2}$ of unsigned absolutely integrable functions in more than one way, for instance we might have

$\displaystyle f = f_1 - f_2 = g_1 - g_2$

for unsigned absolutely integrable functions ${f_1,f_2,g_1,g_2}$. But when this happens, we can rearrange to obtain

$\displaystyle f_1 + g_2 = g_1 + f_2$

and thus by linearity of the unsigned integral

$\displaystyle \int_\Omega f_1\ d\mu + \int_\Omega g_2\ d\mu = \int_\Omega g_1\ d\mu + \int_\Omega f_2\ d\mu.$

By the absolute integrability of ${f_1,f_2,g_1,g_2}$, all the integrals are finite, so we may rearrange this identity as

$\displaystyle \int_\Omega f_1\ d\mu - \int_\Omega f_2\ d\mu = \int_\Omega g_1\ d\mu - \int_\Omega g_2\ d\mu.$

This allows us to define the Lebesgue integral ${\int_\Omega f\ d\mu \in {\bf R}}$ of a real-valued absolutely integrable function ${f}$ to be the expression

$\displaystyle \int_\Omega f\ d\mu := \int_\Omega f_1\ d\mu - \int_\Omega f_2\ d\mu$

for any given decomposition ${f = f_1-f_2}$ of ${f}$ as the difference of two unsigned absolutely integrable functions. Note that if ${f}$ is both unsigned and absolutely integrable, then the unsigned integral and the Lebesgue integral of ${f}$ agree (as can be seen by using the decomposition ${f = f - 0}$), and so there is no ambiguity in using the same notation ${\int_\Omega f\ d\mu}$ to denote both integrals. (By the same token, we may now drop the modifier ${\hbox{Simp}}$ from the simple integral of a simple unsigned ${f}$, which we may now also denote by ${\int_\Omega f\ d\mu}$.)

The Lebesgue integral also enjoys good linearity properties:

Exercise 8 Let ${f, g: \Omega \rightarrow {\bf R}}$ be real-valued absolutely integrable functions, and let ${c \in {\bf R}}$.

• (i) (Linearity) Show that ${f+g}$ and ${cf}$ are also real-valued absolutely integrable functions, with

$\displaystyle \int_\Omega f+g\ d\mu = \int_\Omega f\ d\mu + \int_\Omega g\ d\mu$

and

$\displaystyle \int_\Omega cf\ d\mu = c \int_\Omega f\ d\mu.$

(For the second relation, one may wish to first treat the special cases ${c>0}$ and ${c=-1}$.)

• (ii) Show that if ${f}$ and ${g}$ are equal almost everywhere, then

$\displaystyle \int_\Omega f\ d\mu = \int_\Omega g\ d\mu.$

• (iii) Show that ${\int_\Omega |f|\ d\mu \geq 0}$, with equality if and only if ${f}$ is zero almost everywhere.
• (iv) (Monotonicity) If ${f \leq g}$ almost everywhere, show that ${\int_\Omega f\ d\mu \leq\int_\Omega g\ d\mu}$.
• (v) (Markov inequality) Show that ${\mu( \{ \omega: |f(\omega)| \geq t \} ) \leq \frac{1}{t} \int_\Omega |f|\ d\mu}$ for any ${0 < t < \infty}$.

Because of part (iii) of the above exercise, we can extend the Lebesgue integral to real-valued absolutely integrable functions that are only defined and real-valued almost everywhere, rather than everywhere. In particular, we can apply the Lebesgue integral to functions that are sometimes infinite, so long as they are only infinite on a set of measure zero, and the function is absolutely integrable everywhere else.

Finally, we extend to complex-valued functions. If ${f: \Omega \rightarrow {\bf C}}$ is absolutely integrable, observe that the real and imaginary parts ${\hbox{Re} f, \hbox{Im} f: \Omega \rightarrow {\bf C}}$ are also absolutely integrable (because ${|\hbox{Re} f|, |\hbox{Im} f| \leq |f|}$). We then define the (complex) Lebesgue integral ${\int_\Omega f\ d\mu \in {\bf C}}$ of ${f}$ in terms of the real Lebesgue integral by the formula

$\displaystyle \int_\Omega f\ d\mu := \int_\Omega \hbox{Re}(f)\ d\mu + i \int_\Omega \hbox{Im}(f)\ d\mu.$

Clearly, if ${f}$ is real-valued and absolutely integrable, then the real Lebesgue integral and the complex Lebesgue integral of ${f}$ coincide, so it does not create ambiguity to use the same symbol ${\int_\Omega f\ d\mu}$ for both concepts. It is routine to extend the linearity properties of the real Lebesgue integral to its complex counterpart:

Exercise 9 Let ${f, g: \Omega \rightarrow {\bf C}}$ be complex-valued absolutely integrable functions, and let ${c \in {\bf C}}$.

• (i) (Linearity) Show that ${f+g}$ and ${cf}$ are also complex-valued absolutely integrable functions, with

$\displaystyle \int_\Omega f+g\ d\mu = \int_\Omega f\ d\mu + \int_\Omega g\ d\mu$

and

$\displaystyle \int_\Omega cf\ d\mu = c \int_\Omega f\ d\mu.$

(For the second relation, one may wish to first treat the special cases ${c \in {\bf R}}$ and ${c = i}$.)

• (ii) Show that if ${f}$ and ${g}$ are equal almost everywhere, then

$\displaystyle \int_\Omega f\ d\mu = \int_\Omega g\ d\mu.$

• (iii) Show that ${\int_\Omega |f|\ d\mu \geq 0}$, with equality if and only if ${f}$ is zero almost everywhere.
• (iv) (Markov inequality) Show that ${\mu( \{ \omega: |f(\omega)| \geq t \} ) \leq \frac{1}{t} \int_\Omega |f|\ d\mu}$ for any ${0 < t < \infty}$.

We record a simple, but incredibly fundamental, inequality concerning the Lebesgue integral:

Lemma 10 (Triangle inequality) If ${f: \Omega \rightarrow {\bf C}}$ is a complex-valued absolutely integrable function, then

$\displaystyle |\int_\Omega f\ d\mu| \leq \int_\Omega |f|\ d\mu.$

Proof: We have

$\displaystyle \hbox{Re} \int_\Omega f\ d\mu = \int_\Omega \hbox{Re} f\ d\mu$

$\displaystyle \leq \int_\Omega |f|\ d\mu.$

This looks weaker than what we want to prove, but we can “amplify” this inequality to the full strength triangle inequality as follows. Replacing ${f}$ by ${e^{i\theta} f}$ for any real ${\theta}$, we have

$\displaystyle \hbox{Re} e^{i\theta} \int_\Omega f\ d\mu \leq \int_\Omega |f|\ d\mu.$

Since we can choose the phase ${e^{i\theta}}$ to make the expression ${e^{i\theta} \int_\Omega f\ d\mu}$ equal to ${|\int_\Omega f\ d\mu|}$, the claim follows. $\Box$

Finally, we observe that the Lebesgue integral extends the Riemann integral, which is particularly useful when it comes to actually computing some of these integrals:

Exercise 11 If ${f: [a,b] \rightarrow {\bf C}}$ is a Riemann integrable function on a compact interval ${[a,b]}$, show that ${f}$ is also absolutely integrable, and that the Lebesgue integral ${\int_{[a,b]} f\ dm}$ (with ${m}$ Lebesgue measure restricted to ${[a,b]}$) coincides with the Riemann integral ${\int_a^b f(x)\ dx}$. Similarly if ${f}$ is Riemann integrable on a box ${[a_1,b_1] \times \dots \times [a_n,b_n]}$.

— 2. Expectation of random variables —

We now translate the above notions of integration on measure spaces to the probabilistic setting.

A random variable ${X}$ taking values in the unsigned extended real line ${[0,+\infty]}$ is said to be simple if it takes on at most finitely many values. Equivalently, ${X}$ can be expressed as a finite unsigned linear combination

$\displaystyle X = \sum_{i=1}^n a_i 1_{E_i}$

of indicator random variables, where ${a_1,\dots,a_n \in [0,+\infty]}$ are unsigned and ${E_i}$ are events. We then define the simple expectation ${\hbox{Simp} {\bf E} X \in [0,+\infty]}$ of ${X}$ to be the quantity

$\displaystyle \hbox{Simp} {\bf E} X := \sum_{i=1}^n a_i {\bf P}(E_i),$

and checks that this definition is independent of the choice of decomposition of ${X}$ into indicator functions. Observe that if we model the random variable ${X}$ using a probability space ${\Omega}$, then the simple expectation of ${X}$ is precisely the simple integral of the corresponding unsigned simple function ${X_\Omega}$.

Next, given an arbitrary unsigned random variable ${X}$ taking values in ${[0,+\infty]}$, one defines its (unsigned) expectation ${{\bf E} X \in [0,+\infty]}$ as

$\displaystyle {\bf E} X := \sup_{0\leq Y \leq X} \hbox{Simp} {\bf E} Y$

where ${Y}$ ranges over all simple unsigned random variables such that ${0 \leq Y \leq X}$ is surely true. This extends the simple expectation (thus ${{\bf E} X = \hbox{Simp} {\bf E} X}$ for all simple unsigned ${X}$), and in terms of a probability space model ${\Omega}$, the expectation ${{\bf E} X}$ is precisely the unsigned integral of ${X_\Omega}$.

A scalar random variable ${X}$ is said to be absolutely integrable if ${{\bf E} |X| < \infty}$, thus for instance any bounded random variable is absolutely integrable. If ${X}$ is real-valued and absolutely integrable, we define its expectation by the formula

$\displaystyle {\bf E} X := {\bf E} X_1 - {\bf E} X_2$

where ${X = X_1 - X_2}$ is any representation of ${X}$ as the difference of unsigned absolutely integrable random variables ${X_1,X_2}$; one can check that this definition does not depend on the choice of representation and is thus well-defined. For complex-valued absolutely integrable ${X}$, we then define

$\displaystyle {\bf E} X := {\bf E} \hbox{Re} X + i {\bf E} \hbox{Im} X.$

In all of these cases, the expectation of ${X}$ is equal to the integral of the representation ${X_\Omega}$ of ${X}$ in any probability space model; in the case that ${\Omega}$ is given by a discrete probability model, one can check that this definition of expectation agrees with the one given in Notes 0. Using the former fact, we can translate the properties of integration already established to the probabilistic setting:

Proposition 12

• (i) (Unsigned linearity) If ${X,Y}$ are unsigned random variables, and ${c}$ is a deterministic unsigned quantity, then ${{\bf E}(X+Y) = {\bf E} X + {\bf E} Y}$ and ${{\bf E} cX = c {\bf E} X}$. (Note that these identities hold even when ${X,Y}$ are not absolutely integrable.)
• (ii) (Complex linearity) If ${X,Y}$ are absolutely integrable random variables, and ${c}$ is a deterministic complex quantity, then ${X+Y}$ and ${cX}$ are also absolutely integrable, with ${{\bf E}(X+Y) = {\bf E} X + {\bf E} Y}$ and ${{\bf E} cX = c {\bf E} X}$.
• (iii) (Compatibility with probability) If ${E}$ is an event, then ${{\bf E} 1_E = {\bf P}(E)}$. In particular, ${{\bf E} 1 = 1}$.
• (iv) (Almost sure equivalence) If ${X, Y}$ are unsigned (resp. absolutely integrable) and ${X=Y}$ almost surely, then ${{\bf E} X = {\bf E} Y}$.
• (v) If ${X}$ is unsigned or absolutely integrable, then ${{\bf E} |X| \geq 0}$, with equality if and only if ${X=0}$ almost surely.
• (vi) (Monotonicity) If ${X,Y}$ are unsigned or real-valued absolutely integrable, and ${X \leq Y}$ almost surely, then ${{\bf E} X \leq {\bf E} Y}$.
• (vii) (Markov inequality) If ${X}$ is unsigned or absolutely integrable, then ${{\bf P}(|X| \geq t) \leq \frac{1}{t} {\bf E} |X|}$ for any deterministic ${t>0}$.
• (viii) (Triangle inequality) If ${X}$ is absolutely integrable, then ${|{\bf E} X| \leq {\bf E} |X|}$.

As before, we can use part (iv) to define expectation of scalar random variables ${X}$ that are only defined and finite almost surely, rather than surely.

Note that we have built the notion of expectation (and of related notions, such as absolute integrability) out of notions that were already probabilistic in nature, in the sense that they were unaffected if one replaced the underlying probabilistic model with an extension. Therefore, the notion of expectation is automatically probabilistic in the same sense. Because of this, we will be easily able to manipulate expectations of random variables without having to explicitly mention an underlying probability space ${\Omega}$, and so one will now see such spaces fade from view starting from this point in the course.

— 3. Exchanging limits with integrals or expectations —

When performing analysis on measure spaces, it is important to know if one can interchange a limit with an integral:

$\displaystyle \int_\Omega \lim_{n \rightarrow \infty} f_n\ d\mu \stackrel{?}{=} \lim_{n \rightarrow \infty} \int_\Omega f_n\ d\mu.$

Similarly, in probability theory, we often wish to interchange a limit with an expectation:

$\displaystyle {\bf E} \lim_{n \rightarrow \infty} X_n \stackrel{?}{=} \lim_{n \rightarrow \infty} {\bf E} X_n.$

Of course, one needs the integrands or random variables to be either unsigned or absolutely integrable, and the limits to be well-defined to have any hope of doing this. Naively, one could hope that limits and integrals could always be exchanged when the expressions involved are well-defined, but this is unfortunately not the case. In the case of integration on, say, the real line ${{\bf R}}$ using Lebesgue measure ${m}$, we already see four key examples:

• (Moving bump example) Take ${f_n := 1_{[n,n+1]}}$. Then ${\int_{\bf R} \lim_{n \rightarrow \infty} f_n\ dm = 0}$, but ${\lim_{n \rightarrow \infty} \int_{\bf R} f_n\ dm = 1}$.
• (Spreading bump example) Take ${f_n := \frac{1}{n} 1_{[0,n]}}$. Then ${\int_{\bf R} \lim_{n \rightarrow \infty} f_n\ dm = 0}$, but ${\lim_{n \rightarrow \infty} \int_{\bf R} f_n\ dm = 1}$.
• (Concentrating bump example) Take ${f_n := n 1_{[0,1/n]}}$. Then ${\int_{\bf R} \lim_{n \rightarrow \infty} f_n\ dm = 0}$, but ${\lim_{n \rightarrow \infty} \int_{\bf R} f_n\ dm = 1}$.
• (Receding infinity example) Take ${f_n := 1_{[n,\infty)}}$. Then ${\int_{\bf R} \lim_{n \rightarrow \infty} f_n\ dm = 0}$, but ${\lim_{n \rightarrow \infty} \int_{\bf R} f_n\ dm = +\infty}$.

In all these examples, the limit of the integral exceeds the integral of the limit; by replacing ${f_n}$ with ${-f_n}$ in the first three examples (which involve absolutely integrable functions) one can also build examples where the limit of the integral is less than the integral of the limit. Most of these examples rely on the infinite measure of the real line and thus do not directly have probabilistic analogues, but the concentrating bump example involves functions that are all supported on the unit interval ${[0,1]}$ and thus also poses a problem in the probabilistic setting.

Nevertheless, there are three important cases in which we can relate the limit (or, in the case of Fatou’s lemma, the limit inferior) of the integral to the integral of the limit (or limit inferior). Informally, they are:

• (Fatou’s lemma) For unsigned ${f_n}$, the integral of the limit inferior cannot exceed the limit inferior of the integral. “Limits (or more precisely, limits inferior) can destroy (unsigned) mass, but cannot create it.”
• (Monotone convergence theorem) For unsigned monotone increasing ${f_n}$, the limit of the integral equals the integral of the limit.
• (Dominated convergence theorem) For ${f_n}$ that are uniformly dominated by an absolutely integrable function, the limit of the integral equals the integral of the limit.

These three results then have analogues for convergence of random variables. We will also mention a fourth useful tool in that setting, which allows one to exchange limits and expectations when one controls a higher moment. There are a few more such general results allowing limits to be exchanged with integrals or expectations, but my advice would be to work out such exchanges by hand rather than blindly cite (possibly incorrectly) an additional convergence theorem beyond the four mentioned above, as this is safer and will help strengthen one’s intuition on the situation.

We now state and prove these results more explicitly.

Lemma 13 (Fatou’s lemma) Let ${(\Omega, {\mathcal F}, \mu)}$ be a measure space, and let ${f_1, f_2, \dots: \Omega \rightarrow [0,+\infty]}$ be a sequence of unsigned measurable functions. Then

$\displaystyle \int_\Omega \liminf_{n \rightarrow \infty} f_n\ d\mu \leq \liminf_{n \rightarrow \infty} \int_\Omega f_n\ d\mu.$

An equivalent form of this lemma is that if one has

$\displaystyle \int_\Omega f_n\ d\mu \leq M$

for some ${M}$ and all sufficiently large ${n}$, then one has

$\displaystyle \int_\Omega \liminf_{n \rightarrow \infty} f_n\ d\mu \leq M$

as well. That is to say, if the original unsigned functions ${f_n}$ eventually have “mass” less than or equal to ${M}$, then the limit (inferior) ${\liminf_{n \rightarrow \infty} f_n}$ also has “mass” less than or equal to ${M}$. The limit may have substantially less mass, as the four examples above show, but it can never have more mass (asymptotically) than the functions that comprise the limit. Of course, one can replace limit inferior by limit in the left or right hand side if one knows that the relevant limit actually exists (but one cannot replace limit inferior by limit superior if one does not already have convergence, see Example 15 below). On the other hand, it is essential that the ${f_n}$ are unsigned for Fatou’s lemma to work, as can be seen by negating one of the first three key examples mentioned above.

Proof: By definition of the unsigned integral, it suffices to show that

$\displaystyle \int_\Omega g\ d\mu \leq \liminf_{n \rightarrow \infty} \int_\Omega f_n\ d\mu$

whenever ${g}$ is an unsigned simple function with ${g \leq \liminf_{n \rightarrow \infty} f_n}$. Multiplying by ${1-\varepsilon}$, it thus suffices to show that

$\displaystyle (1-\varepsilon) \int_\Omega g\ d\mu \leq \liminf_{n \rightarrow \infty} \int_\Omega f_n\ d\mu$

for any ${0 < \varepsilon < 1}$ and any unsigned ${g}$ as above.

We can write ${g}$ as the sum ${g = \sum_{i=1}^n a_i 1_{E_i}}$ for some strictly positive ${a_i}$ and disjoint ${E_i}$; we allow the ${a_i}$ and the measures ${\mu(E_i)}$ to be infinite. On each ${E_i}$, we have ${\liminf_{n \rightarrow \infty} f_n > (1-\varepsilon) a_i}$. Thus, if we define

$\displaystyle E_{i,N} := \{ \omega \in E_i: f_n(\omega) \geq (1-\varepsilon) a_i \hbox{ for all } n \geq N \}$

then the ${E_{i,N}}$ increase to ${E_i}$ as ${N \rightarrow \infty}$: ${\bigcup_{N=1}^\infty E_{i,N} = E_i}$. By continuity from below (Exercise 23 of Notes 0), we thus have

$\displaystyle \mu(E_{i,N}) \rightarrow \mu(E_i)$

as ${N \rightarrow \infty}$. Since

$\displaystyle f_N \geq \sum_{i=1}^N (1-\varepsilon) a_i 1_{E_{i,N}}$

we conclude upon integration that

$\displaystyle \int_\Omega f_N\ d\mu \geq \sum_{i=1}^N (1-\varepsilon) a_i \mu( E_{i,N} )$

and thus on taking limit inferior

$\displaystyle \liminf_{N \rightarrow \infty} \int_\Omega f_N\ d\mu \geq \sum_{i=1}^N (1-\varepsilon) a_i \mu( E_i).$

But the right-hand side is ${(1-\varepsilon) \int_\Omega g\ d\mu}$, and the claim follows. $\Box$

Of course, Fatou’s lemma may be phrased probabilistically:

Lemma 14 (Fatou’s lemma for random variables) Let ${X_1,X_2,\dots}$ be a sequence of unsigned random variables. Then

$\displaystyle {\bf E} \liminf_{n \rightarrow \infty} X_n \leq \liminf_{n \rightarrow \infty} {\bf E} X_n.$

As a corollary, if ${X_1,X_2,\dots}$ are unsigned and converge almost surely to a random variable ${Y}$, then

$\displaystyle {\bf E} Y \leq \liminf_{n \rightarrow \infty} {\bf E} X_n.$

Next, we establish the monotone convergence theorem.

Theorem 16 (Monotone convergence theorem) Let ${(\Omega, {\mathcal F}, \mu)}$ be a measure space, and let ${f_1, f_2, \dots: \Omega \rightarrow [0,+\infty]}$ be a sequence of unsigned measurable functions which is monotone increasing, thus ${f_n(\omega) \leq f_{n+1}(\omega)}$ for all ${n}$ and ${\omega \in \Omega}$. Then

$\displaystyle \int_\Omega \lim_{n \rightarrow \infty} f_n\ d\mu = \lim_{n \rightarrow \infty} \int_\Omega f_n\ d\mu.$

Note that the limits exist on both sides because monotone sequences always have limits. Indeed the limit in either side is equal to the supremum. The receding infinity example shows that it is important that the functions here are monotone increasing rather than monotone decreasing. We also observe that it is enough for the ${f_n}$ to be increasing almost everywhere rather than everywhere, since one can then modify the ${f_n}$ on a set of measure zero to be increasing everywhere, which does not affect the integrals on either side of this theorem.

Proof: From Fatou’s lemma we already have

$\displaystyle \int_\Omega \lim_{n \rightarrow \infty} f_n\ d\mu \leq \lim_{n \rightarrow \infty} \int_\Omega f_n\ d\mu.$

On the other hand, from monotonicity we see that

$\displaystyle \int_\Omega \lim_{n \rightarrow \infty} f_n\ d\mu \geq \int_\Omega f_m\ d\mu$

for any natural number ${m}$, and on taking limits as ${m \rightarrow \infty}$ we obtain the claim. $\Box$

An important corollary of the monotone convergence theorem is that one can freely interchange infinite sums with integrals for unsigned functions, that is to say

$\displaystyle \int_\Omega \sum_{n=1}^\infty g_n\ d\mu = \sum_{n=1}^\infty \int_\Omega g_n\ d\mu$

for any unsigned ${g_1,g_2,\dots: \Omega \rightarrow [0,+\infty]}$ (not necessarily monotone). Indeed, to see this one simply applies the monotone convergence theorem to the partial sums ${f_N := \sum_{n=1}^N g_n}$.

We of course can translate this into the probabilistic context:

Theorem 17 (Monotone convergence theorem for random variables) Let ${0 \leq X_1 \leq X_2 \leq \dots}$ be a monotone non-decreasing sequence of unsigned random variables. Then

$\displaystyle {\bf E} \lim_{n \rightarrow \infty} X_n = \lim_{n \rightarrow \infty} {\bf E} X_n.$

Similarly, for any unsigned random variables ${Y_1,Y_2,\dots}$, we have

$\displaystyle {\bf E} \sum_{n=1}^\infty Y_n = \sum_{n=1}^\infty {\bf E} Y_n.$

Lemma 18 (Borel-Cantelli lemma) Let ${E_1,E_2,\dots}$ be a sequence of events with ${\sum_n {\bf P}(E_n) < \infty}$. Then almost surely, at most finitely many of the events ${E_n}$ hold; that is to say, one has ${\sum_n 1_{E_n} < \infty}$ almost surely.

Proof: From the monotone convergence theorem, we have

$\displaystyle {\bf E} \sum_n 1_{E_n} = \sum_n {\bf E} 1_{E_n} = \sum_n {\bf P}(E_n) < \infty.$

By Markov’s inequality, this implies that ${\sum_n 1_{E_n}}$ is almost surely finite, as required. $\Box$

We will develop a partial converse to this lemma (the “second” Borel-Cantelli lemma) in a subsequent set of notes. For now, we give a crude converse in which we assume not only that the ${{\bf P}(E_n)}$ sum to infinity, but they are in fact uniformly bounded from below:

Exercise 19 Let ${E_1,E_2,\dots}$ be a sequence of events with ${\inf_n {\bf P}(E_n) > 0}$. Show that with positive probability, an infinite number of the ${E_n}$ hold; that is to say, ${\mathop{\bf P}( \sum_n 1_{E_n} = \infty ) > 0}$. (Hint: if ${{\bf P}(E_n) \geq \delta > 0}$ for all ${n}$, establish the lower bound ${\mathop{\bf P}( \sum_{n \leq N} 1_{E_n} \geq \delta N/2 ) \geq \delta/2}$ for all ${N}$. Alternatively, one can apply Fatou’s lemma to the random variables ${1_{\overline{E_n}}}$.)

Finally, we give the dominated convergence theorem.

Theorem 20 (Dominated convergence theorem) Let ${(\Omega, {\mathcal F}, \mu)}$ be a measure space, and let ${f_1, f_2, \dots: \Omega \rightarrow {\bf C}}$ be measurable functions which converge pointwise to some limit. Suppose that there is an unsigned absolutely integrable function ${g: \Omega \rightarrow [0,+\infty]}$ which dominates the ${f_n}$ in the sense that ${|f_n(\omega)| \leq |g(\omega)|}$ for all ${n}$ and all ${\omega}$. Then

$\displaystyle \int_\Omega \lim_{n \rightarrow \infty} f_n\ d\mu = \lim_{n \rightarrow \infty} \int_\Omega f_n\ d\mu.$

In particular, the limit on the right-hand side exists.

Again, it will suffice for ${g}$ to dominate each ${f_n}$ almost everywhere rather than everywhere, as one can upgrade this to everywhere domination by modifying each ${f_n}$ on a set of measure zero. Similarly, pointwise convergence can be replaced with pointwise convergence almost everywhere. The domination of each ${f_n}$ by a single function ${g}$ implies that the integrals ${\int_\Omega |f_n|\ d\mu}$ are uniformly bounded in ${n}$, but this latter condition is not sufficient by itself to guarantee interchangeability of the limit and integral, as can be seen by the first three examples given at the start of this section.

Proof: By splitting into real and imaginary parts, we may assume without loss of generality that the ${f_n}$ are real-valued. As ${g}$ is absolutely integrable, it is finite almost everywhere; after modification on a set of measure zero we may assume it is finite everywhere. Let ${f}$ denote the pointwise limit of the ${f_n}$. From Fatou’s lemma applied to the unsigned functions ${g-f_n}$ and ${g+f_n}$, we have

$\displaystyle \int_\Omega g-f\ d\mu \leq \liminf_{n \rightarrow \infty} \int_\Omega g-f_n\ d\mu$

and

$\displaystyle \int_\Omega g+f\ d\mu \leq \liminf_{n \rightarrow \infty} \int_\Omega g+f_n\ d\mu$

Rearranging this (taking crucial advantage of the finite nature of the ${\int_\Omega g\ d\mu}$, and hence ${\int_\Omega f\ d\mu}$ and ${\int_\Omega f_n\ d\mu}$), we conclude that

$\displaystyle \limsup_{n \rightarrow \infty} f_n\ d\mu \leq \int_\Omega f\ d\mu \leq \liminf_{n \rightarrow\ infty} f_n\ d\mu$

and the claim follows. $\Box$

Remark 21 Amusingly, one can use the dominated convergence theorem to give an (extremely indirect) proof of the divergence of the harmonic series ${\sum_{n=1}^\infty \frac{1}{n}}$. For, if that series was convergent, then the function ${\sum_{n=1}^\infty \frac{1}{n} 1_{[n-1,n]}}$ would be absolutely integrable, and the spreading bump example described above would contradict the dominated convergence theorem. (Expert challenge: see if you can deconstruct the above argument enough to lower bound the rate of divergence of the harmonic series ${\sum_{n=1}^\infty \frac{1}{n}}$.)

We again translate the above theorem to the probabilistic context:

Theorem 22 (Dominated convergence theorem for random variables) Let ${X_1,X_2,\dots}$ be scalar random variables which converge almost surely to a limit ${X}$. Suppose there is an unsigned absolutely integrable random variable ${Y}$ such that ${|X_n| \leq Y}$ almost surely for each ${n}$. Then

$\displaystyle \lim_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X.$

As a corollary of the dominated convergence theorem for random variables we have the bounded convergence theorem: if ${X_1,X_2,\dots}$ are scalar random variables that converge almost surely to a limit ${X}$, and are almost surely bounded in magnitude by a uniform constant ${M}$, then we have

$\displaystyle \lim_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X.$

(In Durrett, the bounded convergence theorem is proven first, and then used to establish Fatou’s theorem and the dominated and monotone convergence theorems. The order in which one establishes these results – which are all closely related to each other – is largely a matter of personal taste.) A further corollary of the dominated convergence theorem is that one has the identity

$\displaystyle {\bf E} \sum_n Y_n = \sum_n {\bf E} Y_n$

whenever ${Y_n}$ are scalar random variables with ${\sum_n |Y_n|}$ absolutely integrable (or equivalently, that ${\sum_n {\bf E} |Y_n|}$ is finite).

Another useful variant of the dominated convergence theorem is

Theorem 23 (Convergence for random variables with bounded moment) Let ${X_1,X_2,\dots}$ be scalar random variables which converge almost surely to a limit ${X}$. Suppose there is ${\varepsilon>0}$ and ${M>0}$ such that ${{\bf E} |X_n|^{1+\varepsilon} \leq M}$ for all ${n}$. Then

$\displaystyle \lim_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X.$

This theorem fails for ${\varepsilon=0}$, as the concentrating bump example shows. The case ${\varepsilon=1}$ (that is to say, bounded second moment ${{\bf E} |X_n|^2}$) is already quite useful. The intuition here is that concentrating bumps are in some sense the only obstruction to interchanging limits and expectations, and these can be eliminated by hypotheses such as a bounded higher moment hypothesis or a domination hypothesis.

Proof: By taking real and imaginary parts we may assume that the ${X_n}$ (and hence ${X}$) are real-valued. For any natural number ${m}$, let ${X_n^{[m]}}$ denote the truncation ${X_n^{[m]} := \max(\min(X_n,m),-m)}$ of ${X_n}$ to the interval ${[-m,m]}$, and similarly define ${X^{[m]} := \max(\min(X,m),-m)}$. Then ${X_n^{[m]}}$ converges pointwise to ${X^{[m]}}$, and hence by the bounded convergence theorem

$\displaystyle \lim_{n \rightarrow \infty} {\bf E} X_n^{[m]} = {\bf E} X^{[m]}.$

On the other hand, we have

$\displaystyle |X_n - X_n^{[m]}| \leq m^{-\varepsilon} |X_n|^{1+\varepsilon}$

(why?) and thus on taking expectations and using the triangle inequality

$\displaystyle {\bf E} X_n = {\bf E} X_n^{[m]} + O( m^{-\varepsilon} M )$

where we are using the asymptotic notation ${O(X)}$ to denote a quantity bounded in magnitude by ${CX}$ for an absolute constant ${C}$. Also, from Fatou’s lemma we have

$\displaystyle {\bf E} |X|^{1+\varepsilon} \leq M$

so we similarly have

$\displaystyle {\bf E} X = {\bf E} X^{[m]} + O( m^{-\varepsilon} M )$

Putting all this together, we see that

$\displaystyle \liminf_{n \rightarrow \infty} {\bf E} X_n, \limsup_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X + O( m^{-\varepsilon} M ).$

Sending ${m \rightarrow \infty}$, we obtain the claim. $\Box$

Remark 24 The essential point about the condition ${{\bf E} |X_n|^{1+\varepsilon}}$ was that the function ${x \mapsto x^{1+\varepsilon}}$ grew faster than linearly as ${x \rightarrow \infty}$. One could accomplish the same result with any other function with this property, e.g. a hypothesis such as ${{\bf E} |X_n| \log |X_n| \leq M}$ would also suffice. The most natural general condition to impose here is that of uniform integrability, which encompasses the hypotheses already mentioned, but we will not focus on this condition here.

Exercise 25 (Scheffé’s lemma) Let ${X_1,X_2,\dots}$ be a sequence of absolutely integrable scalar random variables that converge almost surely to another absolutely integrable scalar random variable ${X}$. Suppose also that ${{\bf E} |X_n|}$ converges to ${{\bf E} |X|}$ as ${n \rightarrow \infty}$. Show that ${{\bf E} |X-X_n|}$ converges to zero as ${n \rightarrow \infty}$. (Hint: there are several ways to prove this result, known as Scheffe’s lemma. One is to split ${X_n}$ into two components ${X_n = X_{n,1} + X_{n,2}}$, such that ${X_{n,1}}$ is dominated by ${|X|}$ but converges almost surely to ${X}$, and ${X_{n,2}}$ is such that ${|X_n| = |X_{n,1}| + |X_{n,2}|}$. Then apply the dominated convergence theorem.)

— 4. The distribution of a random variable —

We have seen that the expectation of a random variable is a special case of the more general notion of Lebesgue integration on a measure space. There is however another way to think of expectation as a special case of integration, which is particularly convenient for computing expectations. We first need the following definition.

Definition 26 Let ${X}$ be a random variable taking values in a measurable space ${R = (R, {\mathcal B})}$. The distribution of ${X}$ (also known as the law of ${X}$) is the probability measure ${\mu_X}$ on ${R}$ defined by the formula

$\displaystyle \mu_X(S) := {\bf P}( X \in S )$

for all measurable sets ${S \in {\mathcal B}}$; one easily sees from the Kolmogorov axioms that this is indeed a probability measure.

Example 27 If ${X}$ only takes on at most countably many values (and if every point in ${R}$ is measurable), then the distribution ${\mu_X}$ is the discrete measure that assigns each point ${a}$ in the range of ${X}$ a measure of ${{\bf P}(X=a)}$.

Example 28 If ${X}$ is a real random variable with cumulative distribution function ${F}$, then ${\mu_X}$ is the Lebesgue-Stieltjes measure associated to ${F}$. For instance, if ${X}$ is drawn uniformly at random from ${[0,1]}$, then ${\mu_X}$ is Lebesgue measure restricted to ${[0,1]}$. In particular, two scalar variables are equal in distribution if and only if they have the same cumulative distribution function.

Example 29 If ${X}$ and ${Y}$ are the results of two separate rolls of a fair die (as in Example 3 of Notes 0), then ${X}$ and ${Y}$ are equal in distribution, but are not equal as random variables.

Remark 30 In the converse direction, given a probability measure ${\mu}$ on a measurable space ${(R, {\mathcal B})}$, one can always build a probability space model and a random variable ${X}$ represented by that model whose distribution is ${\mu}$. Indeed, one can perform the “tautological” construction of defining the probability space model to be ${\Omega := (R, {\mathcal B}, \mu)}$, and ${X: \Omega \rightarrow R}$ to be the identity function ${X(\omega) := \omega}$, and then one easily checks that ${\mu_X = \mu}$. Compare with Corollaries 26 and 29 of Notes 0. Furthermore, one can view this tautological model as a “base” model for random variables of distribution ${\mu}$ as follows. Suppose one has a random variable ${X}$ of distribution ${\mu}$ which is modeled by some other probability space ${\Omega' := (\Omega', {\mathcal F}', \mu')}$, thus ${X_{\Omega'}: \Omega' \rightarrow R}$ is a measurable function such that

$\displaystyle \mu(S) = {\bf P}(X \in S) = \mu'( \{ \omega' \in \Omega': X_{\Omega'}(\omega') \in S \})$

for all ${S \in {\mathcal B}}$. Then one can view the probability space ${\Omega'}$ as an extension of the tautological probability space ${\Omega = (R, {\mathcal B}, \mu)}$ using ${X_{\Omega'}}$ as the factor map.

We say that two random variables ${X,Y}$ are equal in distribution, and write ${X \stackrel{d}{=} Y}$, if they have the same law: ${\mu_X = \mu_Y}$, that is to say ${{\bf P}(X \in S) = {\bf P}(Y \in S)}$ for any measurable set ${R}$ in the range. This definition makes sense even when ${X,Y}$ are defined on different sample spaces. Roughly speaking, the distribution captures the “size” and “shape” of the random variable, but not its “location” or how it relates to other random variables.

Theorem 31 (Change of variables formula) Let ${X}$ be a random variable taking values in a measurable space ${R = (R, {\mathcal B})}$. Let ${f: R \rightarrow {\bf R}}$ or ${f: R \rightarrow {\bf C}}$ be a measurable scalar function (giving ${{\bf R}}$ or ${{\bf C}}$ the Borel ${\sigma}$-algebra of course) such that either ${f \geq 0}$, or that ${{\bf E} |f(X)| < \infty}$. Then

$\displaystyle {\bf E} f(X) = \int_R f(x)\ d\mu_X(x).$

Thus for instance, if ${X}$ is a real random variable, then

$\displaystyle {\bf E} |X| = \int_{\bf R} |x|\ d\mu_X(x),$

and more generally

$\displaystyle {\bf E} |X|^p = \int_{\bf R} |x|^p\ d\mu_X(x)$

for all ${0 < p < \infty}$; furthermore, if ${X}$ is unsigned or absolutely integrable, one has

$\displaystyle {\bf E} X = \int_{\bf R} x\ d\mu_X(x).$

The point here is that the integration is not over some unspecified sample space ${\Omega}$, but over a very explicit domain, namely the reals; we have “changed variables” to integrate over ${{\bf R}}$ instead over ${\Omega}$, with the distribution ${\mu_X}$ representing the “Jacobian” factor that typically shows up in such change of variables formulae.

Proof: First suppose that ${f}$ is unsigned and only takes on a finite number ${a_1,\dots,a_n}$ of values. Then

$\displaystyle f(X) = \sum_{i=1}^n a_i 1_{f(X)=a_i}$

and hence

$\displaystyle {\bf E} f(X) = \sum_{i=1}^n a_i {\bf P}(f(X)=a_i)$

$\displaystyle = \sum_{i=1}^n a_i \mu_X( \{ x: f(x) = a_i\} )$

$\displaystyle = \int_R \sum_{i=1}^n a_i 1_{f(x)=a_i}\ d\mu_X(x)$

$\displaystyle = \int_R f(x)\ d\mu_X(x)$

as required.

Next, suppose that ${f}$ is unsigned but can take on infinitely many values. We can express ${f}$ as the monotone increasing limit of functions ${f_n}$ that only take a finite number of values; for instance we can define ${f_n(x)}$ to be ${f(x)}$ rounded down to the largest multiple of ${1/n}$ less than both ${n}$ and ${f(x)}$. By the preceding computation, we have

$\displaystyle f_n(X) = \int_R f_n(x)\ d\mu_X(x),$

and on taking limits as ${n \rightarrow \infty}$ using the monotone convergence theorem we obtain the claim in this case.

Now suppose that ${f}$ is real-valuked with ${{\bf E} |f(X)| < \infty}$. We write ${f = f_1 - f_2}$ where ${f_1 := \max(f,0)}$ and ${f_2 := \min(f,0)}$, then we have ${{\bf E} f_1(X), {\bf E} f_2(X) < \infty}$ and

$\displaystyle {\bf E} f_i(X) = \int{\bf R} f_i(x)\ d\mu_X(x)$

for ${i=1,2}$. Subtracting these two identities together, we obtain the claim.

Finally, the case of complex-valued ${f}$ with ${{\bf E} |f(X)| < \infty}$ follows from the real-valued case by taking real and imaginary parts. $\Box$

Example 32 Let ${X}$ be the uniform distribution on ${[0,1]}$, then

$\displaystyle {\bf E} f(X) = \int_0^1 f(x)\ dx$

for any Riemann integrable ${f}$; thus for instance

$\displaystyle {\bf E} X^p = \frac{1}{p+1}$

for any ${0 < p < \infty}$.

Remark 33 An alternate way to prove the change of variables formula is to observe that the formula is obviously true when one uses the tautological model ${(R, {\mathcal B}, \mu_X)}$ for ${X}$, and then the claim follows from the model-independence of expectation and the observation from Remark 30 that any other model for ${X}$ is an extension of the tautological model.

— 5. Some basic inequalities —

We record here for future reference some basic inequalities concerning expectation that we will need in the sequel. We have already seen the triangle inequality

$\displaystyle |{\bf E}(X)| \leq {\bf E} |X| \ \ \ \ \ (5)$

$\displaystyle {\bf P}(|X| \geq t) \leq \frac{1}{t} {\bf E} |X| \ \ \ \ \ (6)$

$\displaystyle {\bf P}(|X - {\bf E} X| \geq t) \leq \frac{1}{t^2} \hbox{Var}(X) \ \ \ \ \ (7)$

for absolutely integrable ${X}$ and ${t>0}$, where the Variance ${\hbox{Var}(X)}$ of ${X}$ is defined as

$\displaystyle \hbox{Var}(X) := {\bf E}( |X - {\bf E} X|^2 ).$

Next, we record

Lemma 34 (Jensen’s inequality) If ${f: {\bf R} \rightarrow {\bf R}}$ is a convex function, ${X}$ is a real random variable with ${X}$ and ${f(X)}$ both absolutely integrable, then

$\displaystyle f( {\bf E} X ) \leq {\bf E} f(X).$

Proof: Let ${x_0}$ be a real number. Being convex, the graph of ${f}$ must be supported by some line at ${(x_0,f(x_0))}$, that is to say there exists a slope ${c}$ (depending on ${x_0}$) such that ${f(x) \ge f(x_0) + c(x-x_0)}$ for all ${x \in {\bf R}}$. (If ${f}$ is differentiable at ${x_0}$, one can take ${c}$ to be the derivative of ${f}$ at ${x_0}$, but one always has a supporting line even in the non-differentiable case.) In particular

$\displaystyle f(X) \geq f(x_0) + c(X-x_0).$

Taking expectations and using linearity of expectation, we conclude

$\displaystyle {\bf E} f(X) \geq f(x_0) + c({\bf E} X - x_0 )$

and the claim follows from setting ${x_0 := {\bf E} X}$. $\Box$

Exercise 35 (Complex Jensen inequality) Let ${f: {\bf C} \rightarrow {\bf R}}$ be a convex function (thus ${f((1-t)z+tw) \leq (1-t)f(z) + tf(w)}$ for all complex ${z,w}$ and all ${0 \leq t \leq 1}$, and let ${X}$ be a complex random variable with ${X}$ and ${f(X)}$ both absolutely integrable. Show that

$\displaystyle f( {\bf E} X ) \leq {\bf E} f(X).$

Note that the triangle inequality ${|{\bf E} X| \leq {\bf E} |X|}$ is the special case of Jensen’s inequality (or the complex Jensen’s inequality, if ${X}$ is complex-valued) corresponding to the convex function ${f(x) := |x|}$ on ${{\bf R}}$ (or ${f(z) := |z|}$ on ${{\bf C}}$). Another useful example is

$\displaystyle |{\bf E} X|^2 \leq {\bf E} |X|^2.$

As a related application of convexity, observe from the convexity of the function ${x \mapsto e^x}$ that

$\displaystyle e^{tx + (1-t)y} \leq t e^x + (1-t) e^y$

for any ${0 \leq t \leq 1}$ and ${x,y \in {\bf R}}$. This implies in particular Young’s inequality

$\displaystyle |X| |Y| \leq \frac{1}{p} |X|^p + \frac{1}{q} |Y|^q$

for any scalar ${X,Y}$ and any exponents ${1 < p,q < \infty}$ with ${\frac{1}{p} + \frac{1}{q}=1}$; note that this inequality is also trivially true if one or both of ${|X|, |Y|}$ are infinite. Taking expectations, we conclude that

$\displaystyle {\bf E} |X| |Y| \leq \frac{1}{p} {\bf E} |X|^p + \frac{1}{q} {\bf E} |Y|^q$

if ${X,Y}$ are scalar random variabels and ${1 < p, q < \infty}$ are deterministic exponents with ${\frac{1}{p} + \frac{1}{q}=1}$. In particular, if ${|X|^p, |Y|^q}$ are absolutely integrable, then so is ${XY}$, and

$\displaystyle |{\bf E} X Y| \leq \frac{1}{p} {\bf E} |X|^p + \frac{1}{q} {\bf E} |Y|^q.$

We can amplify this inequality as follows. Multiplying ${X}$ by some ${0 < \lambda < \infty}$ and dividing ${Y}$ by the same ${\lambda}$, we conclude that

$\displaystyle |{\bf E} X Y| \leq \frac{\lambda^p}{p} {\bf E} |X|^p + \frac{\lambda^{-q}}{q} {\bf E} |Y|^q;$

optimising the right-hand side in ${\lambda}$, we obtain (after some algebra, and after disposing of some edge cases when ${X}$ or ${Y}$ is almost surely zero) the important Hölder inequality

$\displaystyle |{\bf E} X Y| \leq ({\bf E} |X|^p)^{1/p} ({\bf E} |Y|^q)^{1/q} \ \ \ \ \ (8)$

which we can write as

$\displaystyle |{\bf E} X Y| \leq \| X\|_p \|Y\|_q$

where we use the notation

$\displaystyle \|X\|_p := ({\bf E} |X|^p)^{1/p}$

for ${0 < p < \infty}$. Using the convention

$\displaystyle \|X\|_\infty := \inf \{ M: |X| \leq M \hbox{ a. s.} \}$

(thus ${\|X\|_\infty}$ is the essential supremum of ${X}$), we also see from the triangle inequality that the Hölder inequality applies in the boundary case when one of ${p,q}$ is allowed to be ${\infty}$ (so that the other is equal to ${1}$):

$\displaystyle |{\bf E} X Y| \leq \| X\|_1 \|Y\|_\infty, \|X\|_\infty \|Y\|_1.$

$\displaystyle |{\bf E} X Y| \leq \| X\|_2 \|Y\|_2, \ \ \ \ \ (9)$

valid whenever ${X,Y}$ are square-integrable in the sense that ${\|X\|_2, \|Y\|_2}$ are finite.

Exercise 36 Show that the expressions ${\|X\|_p}$ are non-decreasing in ${p}$ for ${p \in (0,+\infty]}$. In particular, if ${\|X\|_p}$ is finite for some ${p}$, then it is automatically finite for all smaller values of ${p}$.

Exercise 37 For any square-integrable ${X}$, show that

$\displaystyle {\bf P}(X \neq 0) \geq \frac{ ({\bf E} |X|)^2}{{\bf E}(|X|^2)}.$

Exercise 38 If ${1 < p < \infty}$ and ${X,Y}$ are scalar random variables with ${\|X\|_p, \|Y\|_p < \infty}$, use Hölder’s inequality to establish that

$\displaystyle {\bf E} |X| |X+Y|^{p-1} \leq \|X\|_p \|X+Y\|_p^{p-1}$

and

$\displaystyle {\bf E} |Y| |X+Y|^{p-1} \leq \|X\|_p \|X+Y\|_p^{p-1}$

$\displaystyle \|X+Y\|_p \leq \|X\|_p + \|Y\|_p.$

Show that this inequality is also valid at the endpoint cases ${p=1}$ and ${p=\infty}$.

Exercise 39 If ${X}$ is non-negative and square-integrable, and ${0 \leq \theta \leq 1}$, establish the Paley-Zygmund inequality

$\displaystyle {\bf P}( X > \theta {\bf E}(X) ) \geq (1-\theta)^2 \frac{({\bf E} X)^2}{{\bf E}(|X|^2)}.$

(Hint: use the Cauchy-Schwarz inequality to upper bound ${{\bf E} X 1_{X > \theta {\bf E} X}}$ in terms of ${{\bf E} |X|^2}$ and ${{\bf P}( X > \theta {\bf E}(X) )}$.)

Filed under: 275A - probability theory, math.CA, math.PR Tagged: expectation, integration

### Terence Tao — 275A, Notes 0: Foundations of probability theory

Starting this week, I will be teaching an introductory graduate course (Math 275A) on probability theory here at UCLA. While I find myself using probabilistic methods routinely nowadays in my research (for instance, the probabilistic concept of Shannon entropy played a crucial role in my recent paper on the Chowla and Elliott conjectures, and random multiplicative functions similarly played a central role in the paper on the Erdos discrepancy problem), this will actually be the first time I will be teaching a course on probability itself (although I did give a course on random matrix theory some years ago that presumed familiarity with graduate-level probability theory). As such, I will be relying primarily on an existing textbook, in this case Durrett’s Probability: Theory and Examples. I still need to prepare lecture notes, though, and so I thought I would continue my practice of putting my notes online, although in this particular case they will be less detailed or complete than with other courses, as they will mostly be focusing on those topics that are not already comprehensively covered in the text of Durrett. Below the fold are my first such set of notes, concerning the classical measure-theoretic foundations of probability. (I wrote on these foundations also in this previous blog post, but in that post I already assumed that the reader was familiar with measure theory and basic probability, whereas in this course not every student will have a strong background in these areas.)

Note: as this set of notes is primarily concerned with foundational issues, it will contain a large number of pedantic (and nearly trivial) formalities and philosophical points. We dwell on these technicalities in this set of notes primarily so that they are out of the way in later notes, when we work with the actual mathematics of probability, rather than on the supporting foundations of that mathematics. In particular, the excessively formal and philosophical language in this set of notes will not be replicated in later notes.

— 1. Some philosophical generalities —

By default, mathematical reasoning is understood to take place in a deterministic mathematical universe. In such a universe, any given mathematical statement ${S}$ (that is to say, a sentence with no free variables) is either true or false, with no intermediate truth value available. Similarly, any deterministic variable ${x}$ can take on only one specific value at a time.

However, for a variety of reasons, both within pure mathematics and in the applications of mathematics to other disciplines, it is often desirable to have a rigorous mathematical framework in which one can discuss non-deterministic statements and variables – that is to say, statements which are not always true or always false, but in some intermediate state, or variables that do not take one particular value or another with definite certainty, but are again in some intermediate state. In probability theory, which is by far the most widely adopted mathematical framework to formally capture the concept of non-determinism, non-deterministic statements are referred to as events, and non-deterministic variables are referred to as random variables. In the standard foundations of probability theory, as laid out by Kolmogorov, we can then model these events and random variables by introducing a sample space (which will be given the structure of a probability space) to capture all the ambient sources of randomness; events are then modeled as measurable subsets of this sample space, and random variables are modeled as measurable functions on this sample space. (We will briefly discuss a more abstract way to set up probability theory, as well as other frameworks to capture non-determinism than classical probability theory, at the end of this set of notes; however, the rest of the course will be concerned exclusively with classical probability theory using the orthodox Kolmogorov models.)

Note carefully that sample spaces (and their attendant structures) will be used to model probabilistic concepts, rather than to actually be the concepts themselves. This distinction (a mathematical analogue of the map-territory distinction in philosophy) actually is implicit in much of modern mathematics, when we make a distinction between an abstract version of a mathematical object, and a concrete representation (or model) of that object. For instance:

• In linear algebra, we distinguish between an abstract vector space ${V}$, and a concrete system of coordinates ${\phi: V \rightarrow {\bf R}^n}$ given by some basis of ${V}$.
• In group theory, we distinguish between an abstract group ${G}$, and a concrete representation of that group ${\phi: G \rightarrow \hbox{Aut}(X)}$ as isomorphisms on some space ${X}$.
• In differential geometry, we distinguish between an abstract manifold ${M}$, and a concrete atlas of coordinate systems that coordinatises that manifold.
• Though it is rarely mentioned explicitly, the abstract number systems such as ${{\bf N}, {\bf Z}, {\bf Q}, {\bf R}, {\bf C}}$ are distinguished from the concrete numeral systems (e.g. the decimal or binary systems) that are used to represent them (this distinction is particularly useful to keep in mind when faced with the infamous identity ${0.999\dots = 1}$, or when switching from one numeral representation system to another).

The distinction between abstract objects and concrete models can be fairly safely discarded if one is only going to use a single model for each abstract object, particularly if that model is “canonical” in some sense. However, one needs to keep the distinction in mind if one plans to switch between different models of a single object (e.g. to perform change of basis in linear algebra, change of coordinates in differential geometry, or base change in algebraic geometry). As it turns out, in probability theory it is often desirable to change the sample space model (for instance, one could extend the sample space by adding in new sources of randomness, or one could couple together two systems of random variables by joining their sample space models together). Because of this, we will take some care in this foundational set of notes to distinguish probabilistic concepts (such as events and random variables) from their sample space models. (But we may be more willing to conflate the two in later notes, once the foundational issues are out of the way.)

From a foundational point of view, it is often logical to begin with some axiomatic description of the abstract version of a mathematical object, and discuss the concrete representations of that object later; for instance, one could start with the axioms of an abstract group, and then later consider concrete representations of such a group by permutations, invertible linear transformations, and so forth. This approach is often employed in the more algebraic areas of mathematics. However, there are at least two other ways to present these concepts which can be preferable from a pedagogical point of view. One way is to start with the concrete representations as motivating examples, and only later give the abstract object that these representations are modeling; this is how linear algebra, for instance, is often taught at the undergraduate level, by starting first with ${{\bf R}^2}$, ${{\bf R}^3}$, and ${{\bf R}^n}$, and only later introducing the abstract vector spaces. Another way is to avoid the abstract objects altogether, and focus exclusively on concrete representations, but taking care to emphasise how these representations transform when one switches from one representation to another. For instance, in general relativity courses in undergraduate physics, it is not uncommon to see tensors presented purely through the concrete representation of coordinates indexed by multiple indices, with the transformation of such tensors under changes of variable carefully described; the abstract constructions of tensors and tensor spaces using operations such as tensor product and duality of vector spaces or vector bundles are often left to an advanced differential geometry class to set up properly.

The foundations of probability theory are usually presented (almost by default) using the last of the above three approaches; namely, one talks almost exclusively about sample space models for probabilistic concepts such as events and random variables, and only occasionally dwells on the need to extend or otherwise modify the sample space when one needs to introduce new sources of randomness (or to forget about some existing sources of randomness). However, much as in differential geometry one tends to work with manifolds without specifying any given atlas of coordinate charts, in probability one usually manipulates events and random variables without explicitly specifying any given sample space. For a student raised exclusively on concrete sample space foundations of probability, this can be a bit confusing, for instance it can give the misconception that any given random variable is somehow associated to its own unique sample space, with different random variables possibly living on different sample spaces, which often leads to nonsense when one then tries to combine those random variables together. Because of such confusions, we will try to take particular care in these notes to separate probabilistic concepts from their sample space models.

— 2. A simple class of models: discrete probability spaces —

The simplest models of probability theory are those generated by discrete probability spaces, which are adequate models for many applications (particularly in combinatorics and other areas of discrete mathematics), and which already capture much of the essence of probability theory while avoiding some of the finer measure-theoretic subtleties. We thus begin by considering discrete sample space models.

Definition 1 (Discrete probability theory) A discrete probability space ${\Omega = (\Omega, (p_\omega)_{\omega \in \Omega})}$ is an at most countable set ${\Omega}$ (whose elements ${\omega \in \Omega}$ will be referred to as outcomes), together with a non-negative real number ${p_\omega}$ assigned to each outcome ${\omega}$ such that ${\sum_{\omega \in \Omega} p_\omega = 1}$; we refer to ${p_\omega}$ as the probability of the outcome ${\omega}$. The set ${\Omega}$ itself, without the structure ${(p_\omega)_{\omega \in \Omega}}$, is often referred to as the sample space, though we will often abuse notation by using the sample space ${\Omega}$ to refer to the entire discrete probability space ${(\Omega, (p_\omega)_{\omega \in \Omega})}$.

In discrete probability theory, we choose an ambient discrete probability space ${\Omega}$ as the randomness model. We then model an event ${E}$ by subsets ${E_\Omega}$ of the sample space ${\Omega}$. The probability ${{\bf P}(E)}$ of an event ${E}$ is defined to be the quantity

$\displaystyle {\bf P}(E) := \sum_{\omega \in E_\Omega} p_\omega;$

note that this is a real number in the interval ${[0,1]}$. An event ${E}$ is surely true or is the sure event if ${E_\Omega = \Omega}$, and is surely false or is the empty event if ${E_\Omega =\emptyset}$.

We model random variables ${X}$ taking values in the range ${R}$ by functions ${X_\Omega: \Omega \rightarrow R}$ from the sample space ${\Omega}$ to the range ${R}$. Random variables taking values in ${{\bf R}}$ will be called real random variables or random real numbers. Similarly for random variables taking values in ${{\bf C}}$. We refer to real and complex random variables collectively as scalar random variables.

We consider two events ${E,F}$ to be equal if they are modeled by the same set: ${E=F \iff E_\Omega = F_\Omega}$. Similarly, two random variables ${X,Y}$ taking values in a common range ${R}$ are considered to be equal if they are modeled by the same function: ${X=Y \iff X_\Omega = Y_\Omega}$. In particular, if the discrete sample space ${\Omega}$ is understood from context, we will usually abuse notation by identifying an event ${E}$ with its model ${E_\Omega}$, and similarly identify a random variable ${X}$ with its model ${X_\Omega}$.

Remark 2 One can view classical (deterministic) mathematics as the special case of discrete probability theory in which ${\Omega}$ is a singleton set (there is only one outcome ${\omega}$), and the probability assigned to the single outcome ${\omega}$ in ${\Omega}$ is ${1}$: ${p_\omega = 1}$. Then there are only two events (the surely true and surely false events), and a random variable in ${R}$ can be identified with a deterministic element of ${R}$. Thus we can view probability theory as a generalisation of deterministic mathematics.

As discussed in the preceding section, the distinction between a collection of events and random variable and its models becomes important if one ever wishes to modify the sample space, and in particular to extend the sample space to a larger space that can accommodate new sources of randomness (an operation which we will define formally later, but which for now can be thought of as an analogue to change of basis in linear algebra, coordinate change in differential geometry, or base change in algebraic geometry). This is best illustrated with a simple example.

Example 3 (Extending the sample space) Suppose one wishes to model the outcome ${X}$ of rolling a single, unbiased six-sided die using discrete probability theory. One can do this by choosing the discrete proability space ${\Omega}$ to be the six-element set ${\{1,2,3,4,5,6\}}$, with each outcome ${i \in \{1,2,3,4,5,6\}}$ given an equal probability of ${p_i := 1/6}$ of occurring; this outcome ${i}$ may be interpreted as the state in which the die roll ${X}$ ended up being equal to ${i}$. The outcome ${X}$ of rolling a die may then be identified with the identity function ${X_\Omega: \Omega \rightarrow \{1,\dots,6\}}$, defined by ${X_\Omega(i) := i}$ for ${i \in \Omega}$. If we let ${E}$ be the event that the outcome ${X}$ of rolling the die is an even number, then with this model ${\Omega}$ we have ${E_\Omega = \{2,4,6\}}$, and

$\displaystyle {\bf P}(E) = \sum_{\omega \in E_\Omega} p_\omega = 3 \times \frac{1}{6} = \frac{1}{2}.$

Now suppose that we wish to roll the die again to obtain a second random variable ${Y}$. The sample space ${\Omega = \{1,2,3,4,5,6\}}$ is inadequate for modeling both the original die roll ${X}$ and the second die roll ${Y}$. To accommodate this new source of randomness, we can then move to the larger discrete probability space ${\Omega' := \{1,\dots,6\} \times \{1,\dots,6\}}$, where each outcome ${(i,j) \in \Omega'}$ now having probability ${p'_{(i,j)} := \frac{1}{36}}$; this outcome ${(i,j)}$ can be interpreted as the state in which the die roll ${X}$ ended up being ${i}$, and the die roll ${Y}$ ended up being ${j}$. The random variable ${X}$ is now modeled by a new function ${X_{\Omega'}: \Omega' \rightarrow \{1,\dots,6\}}$ defined by ${X_{\Omega'}(i,j) := i}$ for ${(i,j) \in \Omega'}$; the random variable ${Y}$ is similarly modeled by the function ${Y_{\Omega'}: \Omega' \rightarrow \{1,\dots,6\}}$ defined by ${Y_{\Omega'}(i,j) := j}$ for ${(i,j) \in \Omega'}$. The event ${E}$ that ${X}$ is even is now modeled by the set

$\displaystyle E_{\Omega'} = \{2,4,6\} \times \{1,2,3,4,5,6\}.$

This set is distinct from the previous model ${E_\Omega}$ of ${E}$ (for instance, ${E_{\Omega'}}$ has eighteen elements, whereas ${E_\Omega}$ has just three), but the probability of ${E}$ is unchanged:

$\displaystyle {\bf P}(E) = \sum_{\omega' \in E_{\Omega'}} p'_{\omega'} = 18 \times\frac{1}{36} = \frac{1}{2}.$

One can of course also combine together the random variables ${X,Y}$ in various ways. For instance, the sum ${X+Y}$ of the two die rolls is a random variable taking values in ${\{2,\dots,12\}}$; it cannot be modeled by the sample space ${\Omega}$, but in ${\Omega'}$ it is modeled by the function

$\displaystyle (X+Y)_{\Omega'}: (i,j) \mapsto i+j.$

Similarly, the event ${X=Y}$ that the two die rolls are equal cannot be modeled by ${\Omega}$, but is modeled in ${\Omega'}$ by the set

$\displaystyle (X=Y)_{\Omega'} = \{ (i,i): i=1,\dots,6\}$

and the probability ${{\bf P}(X=Y)}$ of this event is

$\displaystyle {\bf P}(X=Y) = \sum_{\omega' \in (X=Y)_{\Omega'}} p'_{\omega'} = 6 \times \frac{1}{36} = \frac{1}{6}.$

We thus see that extending the probability space has also enlarged the space of events one can consider, as well as the random variables one can define, but that existing events and random variables continue to be interpretable in the extended model, and that probabilistic concepts such as the probability of an event remain unchanged by the extension of the model.

The set-theoretic operations on the sample space ${\Omega}$ induce similar boolean operations on events:

• The conjunction ${E \wedge F}$ of two events ${E,F}$ is defined by intersection of their models: ${(E \wedge F)_\Omega := E_\Omega \cap F_\Omega}$.
• The disjunction ${E \vee F}$ of two events ${E,F}$ is defined by the union of their models: ${(E \vee F)_\Omega := E_\Omega \cup F_\Omega}$.
• The symmetric difference ${E \Delta F}$ of two events ${E,F}$ is defined by symmetric difference of their models: ${(E \Delta F)_\Omega := E_\Omega \Delta F_\Omega}$.
• The complement ${\overline{E}}$ of an event ${E}$ is defined by complement of their models: ${\overline{E}_\Omega := \Omega \backslash E_\Omega}$.
• We say that one event ${E}$ is contained in or implies another event ${F}$, and write ${E \subset F}$, if we have containment of their models: ${E \subset F \iff E_\Omega \subset F_\Omega}$. We also write “${F}$ is true on ${E}$” synonymously with ${E \subset F}$.
• Two events ${E,F}$ are disjoint if their conjunction is the empty event, or equivalently if their models ${E_\Omega, F_\Omega}$ are disjoint.

Thus, for instance, the conjunction of the event that a die roll ${X}$ is even, and that it is less than ${3}$, is the event that the die roll is exactly ${2}$. As before, we will usually be in a situation in which the sample space ${\Omega}$ is clear from context, and in that case one can safely identify events with their models, and view the symbols ${\vee}$ and ${\wedge}$ as being synonymous with their set-theoretic counterparts ${\cup}$ and ${\cap}$ (this is for instance what is done in Durrett).

With these operations, the space of all events (known as the event space) thus has the structure of a boolean algebra (defined below in Definition 4). We observe that the probability ${{\bf P}}$ is finitely additive in the sense that

$\displaystyle {\bf P}( E \vee F ) = {\bf P}(E) + {\bf P}(F)$

whenever ${E,F}$ are disjoint events; by induction this implies that

$\displaystyle {\bf P}(E_1 \vee \dots \vee E_n) = {\bf P}(E_1) + \dots + {\bf P}(E_n)$

whenever ${E_1,\dots,E_n}$ are pairwise disjoint events. We have ${{\bf P}(\emptyset)=0}$ and ${{\bf P}(\overline{\emptyset})=1}$, and more generally

$\displaystyle {\bf P}(\overline{E}) = 1 - {\bf P}(E)$

for any event ${E}$. We also have monotonicity: if ${E \subset F}$, then ${{\bf P}(E) \leq {\bf P}(F)}$.

Now we define operations on random variables. Whenever one has a function ${f: R \rightarrow S}$ from one range ${R}$ to another ${S}$, and a random variable ${X}$ taking values in ${R}$, one can define a random variable ${f(X)}$ taking values in ${S}$ by composing the relevant models:

$\displaystyle f(X)_\Omega := f \circ X_\Omega,$

thus ${f(X)_\Omega}$ maps ${\omega}$ to ${f(X_\Omega(\omega))}$ for any outcome ${\omega \in \Omega}$. Given a finite number ${X_1,\dots,X_n}$ of random variables taking values in ranges ${R_1,\dots,R_n}$, we can form the joint random variable ${(X_1,\dots,X_n)}$ taking values in the Cartesian product ${R_1 \times \dots \times R_n}$ by concatenation of the models, thus

$\displaystyle (X_1,\dots,X_n)_\Omega: \omega \mapsto ((X_1)_\Omega(\omega),\dots, (X_n)_{\Omega}(\omega)).$

Combining these two operations, given any function ${f: R_1 \times \dots \times R_n \rightarrow S}$ of ${n}$ variables in ranges ${R_1,\dots,R_n}$, and random variables ${X_1,\dots,X_n}$ taking values in ${R_1,\dots,R_n}$ respectively, we can form a random variable ${f(X_1,\dots,X_n)}$ taking values in ${S}$ by the formula

$\displaystyle f(X_1,\dots,X_n)_\Omega: \omega \mapsto f((X_1)_\Omega(\omega),\dots, (X_n)_{\Omega}(\omega)).$

Thus for instance we can add, subtract, or multiply two scalar random variables to obtain another scalar random variable.

A deterministic element ${x}$ of a range ${R}$ will (by abuse of notation) be identified with a random variable ${x}$ taking values in ${R}$, whose model in ${\Omega}$ is constant: ${x_\Omega(\omega) = x}$ for all ${\omega \in \Omega}$. Thus for instance ${37}$ is a scalar random variable.

Given a relation ${F: R_1 \times \dots \times R_n \rightarrow \{ \hbox{ true}, \hbox{ false}\}}$ on ${n}$ ranges ${R_1,\dots,R_n}$, and random variables ${X_1,\dots,X_n}$, we can define the event ${F(R_1,\dots,R_n)}$ by setting

$\displaystyle F(R_1,\dots,R_n) :=\{ \omega \in \Omega: F((R_1)_\Omega(\omega),\dots,(R_n)_\Omega(\omega)) \hbox{ true}\}.$

Thus for instance, for two real random variables ${X,Y}$, the event ${X > Y}$ is modeled as

$\displaystyle (X>Y)_\Omega := \{ \omega \in \Omega: X_\Omega(\omega) > Y_\Omega(\omega) \}$

and the event ${X=Y}$ is modeled as

$\displaystyle (X=Y)_\Omega := \{ \omega \in \Omega: X_\Omega(\omega) = Y_\Omega(\omega) \}.$

At this point we encounter a slight notational conflict between the dual role of the equality symbol ${=}$ as a logical symbol and as a binary relation: we are interpreting ${X=Y}$ both as an external equality relation between the two random variables (which is true iff the functions ${X_\Omega}$, ${Y_\Omega}$ are identical), and as an internal event (modeled by ${(X=Y)_\Omega}$). However, it is clear that ${X=Y}$ is true in the external sense if and only if the internal event ${X=Y}$ is surely true. As such, we shall abuse notation and continue to use the equality symbol for both the internal and external concepts of equality (and use the modifier “surely” for emphasis when referring to the external usage).

It is clear that any equational identity concerning functions or operations on deterministic variables implies the same identity (in the external, or surely true, sense) for random variables. For instance, the commutativity of addition ${x+y=y+x}$ for deterministic real numbers ${x,y}$ immediately implies the commutativity of addition: ${X+Y=Y+X}$ is surely true for real random variables ${X,Y}$; similarly ${X+0=X}$ is surely true for all scalar random variables ${X}$, etc.. We will freely apply the usual laws of algebra for scalar random variables without further comment.

Given an event ${E}$, we can associate the indicator random variable ${1_E}$ (also written as ${{\bf I}(E)}$ in some texts) to be the unique real random variable such that ${1_E=1}$ when ${E}$ is true and ${1_E=0}$ when ${E}$ is false, thus ${(1_E)_\Omega(\omega)}$ is equal to ${1}$ when ${\omega \in E_\Omega}$ and ${0}$ otherwise. (The indicator random variable is sometimes called the characteristic function in analysis, and sometimes denoted ${\chi_E}$ instead of ${1_E}$, but we avoid using the term “characteristic function” here, as it will have an unrelated but important meaning in probability theory.) We record the trivial but useful fact that Boolean operations on events correspond to arithmetic manipulations on their indicators. For instance, if ${E,F}$ are events, we have

$\displaystyle 1_{E \wedge F} = 1_E 1_F$

$\displaystyle 1_{\overline{E}} = 1 - 1_E$

and the inclusion-exclusion principle

$\displaystyle 1_{E \vee F} = 1_E + 1_F - 1_{E \wedge F}. \ \ \ \ \ (1)$

In particular, if the events ${E,F}$ are disjoint, then

$\displaystyle 1_{E \vee F} = 1_E + 1_F.$

Also note that ${E \subset F}$ if and only if the assertion ${1_E \leq 1_F}$ is surely true. We will use these identities and equivalences throughout the course without further comment.

Given a scalar random variable ${X}$, we can attempt to define the expectation ${{\bf E}(X)}$ through the model ${X_\Omega}$ by the formula

$\displaystyle {\bf E}(X) := \sum_{\omega \in \Omega} X_\Omega(\omega) p_\omega.$

If the discrete sample space ${\Omega}$ is finite, then this sum is always well-defined and so every scalar random variable has an expectation. If however the discrete sample space ${\Omega}$ is infinite, the expectation may not be well defined. There are however two key cases in which one has a meaningful expectation. The first is if the random variable ${X}$ is unsigned, that is to say it takes values in the non-negative reals ${[0,+\infty)}$, or more generally in the extended non-negative real line ${[0,+\infty]}$. In that case, one can interpret the expectation ${{\bf E}(X)}$ as an element of ${[0,+\infty]}$. The other case is when the random variable ${X}$ is absolutely integrable, which means that the absolute value ${|X|}$ (which is an unsigned random variable) has finite expectation: ${{\bf E} |X| < \infty}$. In that case, the series defining ${{\bf E}(X)}$ is absolutely convergent to a real or complex number (depending on whether ${X}$ was a real or complex random variable.)

$\displaystyle {\bf P}(E) = {\bf E}(1_E)$

between probability and expectation, valid for any event ${E}$. We also have the obvious, but fundamentally important, property of linearity of expectation: we have

$\displaystyle {\bf E}(cX) = c {\bf E}(X)$

and

$\displaystyle {\bf E}(X+Y) = {\bf E}(X) + {\bf E}(Y)$

whenever ${c}$ is a scalar and ${X,Y}$ are scalar random variables, either under the assumption that ${c,X,Y}$ are all unsigned, or that ${X,Y}$ are absolutely integrable. Thus for instance by applying expectations to (1) we obtain the identity

$\displaystyle {\bf P}(E \vee F) = {\bf P}(E) + {\bf P}(F) - {\bf P}(E \wedge F).$

We close this section by noting that discrete probabilistic models stumble when trying to model continuous random variables, which take on an uncountable number of values. Suppose for instance one wants to model a random real number ${X}$ drawn uniformly at random from the unit interval ${[0,1]}$, which is an uncountable set. One would then expect, for any subinterval ${[a,b]}$ of ${[0,1]}$, that ${X}$ will fall into this interval with probability ${b-a}$. Setting ${a=b}$ (or, if one wishes instead, taking a limit such as ${b \rightarrow a^+}$), we conclude in particular that for any real number ${a}$ in ${[0,1]}$, that ${X}$ will equal ${a}$ with probability ${0}$. If one attempted to model this situation by a discrete probability model, we would find that each outcome ${\omega}$ of the discrete sample space ${\Omega}$ has to occur with probability ${p_\omega = 0}$ (since for each ${\omega}$, the random variable ${X}$ has only a single value ${X_\Omega(\omega)}$). But we are also requiring that the sum ${\sum_{\omega \in \Omega} p_\omega}$ is equal to ${1}$, a contradiction. In order to address this defect we must generalise from discrete models to more general probabilistic models, to which we now turn.

— 3. The Kolmogorov foundations of probability theory —

We now present the more general measure-theoretic foundation of Kolmogorov which subsumes the discrete theory, while also allowing one to model continuous random variables. It turns out that in order to perform sums, limits and integrals properly, the finite additivity property of probability needs to be amplified to countable additivity (but, as we shall see, uncountable additivity is too strong of a property to ask for).

We begin with the notion of a measurable space. (See also this previous blog post, which covers similar material from the perspective of a real analysis graduate class rather than a probability class.)

Definition 4 (Measurable space) Let ${\Omega}$ be a set. A Boolean algebra in ${\Omega}$ is a collection ${{\mathcal F}}$ of subsets of ${\Omega}$ which

• contains ${\emptyset}$ and ${\Omega}$;
• is closed under pairwise unions and intersections (thus if ${E,F \in {\mathcal F}}$, then ${E \cup F}$ and ${E \cap F}$ also lie in ${{\mathcal F}}$); and
• is closed under complements (thus if ${E \in {\mathcal F}}$, then ${\Omega \backslash E}$ also lies in ${{\mathcal F}}$.

(Note that some of these assumptions are redundant and can be dropped, thanks to de Morgan’s laws.) A ${\sigma}$-algebra in ${\Omega}$ (also known as a ${\sigma}$-field) is a Boolean algebra ${{\mathcal F}}$ in ${\Omega}$ which is also

• closed under countable unions and countable intersections (thus if ${E_1,E_2,E_3,\dots \in {\mathcal F}}$, then ${\bigcup_{n=1}^\infty E_n \in {\mathcal F}}$ and ${\bigcap_{n=1}^\infty E_n \in {\mathcal F}}$).

Again, thanks to de Morgan’s laws, one only needs to verify closure under just countable union (or just countable intersection) in order to verify that a Boolean algebra is a ${\sigma}$-algebra. A measurable space is a pair ${(\Omega, {\mathcal F})}$, where ${\Omega}$ is a set and ${{\mathcal F}}$ is a ${\sigma}$-algebra in ${\Omega}$. Elements of ${{\mathcal F}}$ are referred to as measurable sets in this measurable space.

If ${{\mathcal F}, {\mathcal F}'}$ are two ${\sigma}$-algebras in ${\Omega}$, we say that ${{\mathcal F}}$ is coarser than ${{\mathcal F}'}$ (or ${{\mathcal F}'}$ is finer than ${{\mathcal F}}$) if ${{\mathcal F} \subset {\mathcal F}'}$, thus every set that is measurable in ${(\Omega,{\mathcal F})}$ is also measurable in ${(\Omega,{\mathcal F}')}$.

Example 5 (Trivial measurable space) Given any set ${\Omega}$, the collection ${\{\emptyset, \Omega\}}$ is a ${\sigma}$-algebra; in fact it is the coarsest ${\sigma}$-algebra one can place on ${\Omega}$. We refer to ${(\Omega, \{\emptyset,\Omega\})}$ as the trivial measurable space on ${\Omega}$.

Example 6 (Discrete measurable space) At the other extreme, given any set ${\Omega}$, the power set ${2^\Omega := \{ E: E \subset \Omega\}}$ is a ${\sigma}$-algebra (and is the finest ${\sigma}$-algebra one can place on ${\Omega}$). We refer to ${(\Omega, 2^\Omega)}$ as the discrete measurable space on ${\Omega}$.

Example 7 (Atomic measurable spaces) Suppose we have a partition ${\Omega = \biguplus_{\alpha \in A} E_\alpha}$ of a set ${\Omega}$ into disjoint subsets ${E_\alpha}$ (which we will call atoms), indexed by some label set ${A}$ (which may be finite, countable, or uncountable). Such a partition defines a ${\sigma}$-algebra on ${\Omega}$, consisting of all sets of the form ${\bigcup_{\alpha \in B} E_\alpha}$ for subsets ${B}$ of ${A}$ (we allow ${B}$ to be empty); thus a set is measurable here if and only if it can be described as a union of atoms. One can easily verify that this is indeed a ${\sigma}$-algebra. The trivial and discrete measurable spaces in the preceding two examples are special cases of this atomic construction, corresponding to the trivial partition ${\Omega = \Omega}$ (in which there is just one atom ${\Omega}$) and the discrete partition ${\Omega = \biguplus_{x \in \Omega} \{x\}}$ (in which the atoms are individual points in ${\Omega}$).

Example 8 Let ${\Omega}$ be an uncountable set, and let ${{\mathcal F}}$ be the collection of sets in ${\Omega}$ which are either at most countable, or are cocountable (their complement is at most countable). Show that this is a ${\sigma}$-algebra on ${\Omega}$ which is non-atomic (i.e. it is not of the form of the preceding example).

Example 9 (Generated measurable spaces) It is easy to see that if one has a non-empty family ${({\mathcal F}_\alpha)_{\alpha \in A}}$ of ${\sigma}$-algebras on a set ${\Omega}$, then their intersection ${\bigcap_{\alpha \in A} {\mathcal F}_\alpha}$ is also a ${\sigma}$-algebra, even if ${A}$ is uncountably infinite. Becaue of this, whenever one has an arbitrary collection ${{\mathcal A}}$ of subsets in ${\Omega}$, one can define the ${\sigma}$-algebra ${\langle {\mathcal A} \rangle}$ generated by ${{\mathcal A}}$ to be the intersection of all the ${\sigma}$-algebras that contain ${{\mathcal A}}$ (note that there is always at least one ${\sigma}$-algebra participating in this intersection, namely the discrete ${\sigma}$-algebra). Equivalently, ${\langle {\mathcal A} \rangle}$ is the coarsest ${\sigma}$-algebra that views every set in ${{\mathcal A}}$ as being measurable. (This is a rather indirect way to describe ${\langle {\mathcal A} \rangle}$, as it does not make it easy to figure out exactly what sets lie in ${\langle {\mathcal A} \rangle}$. There is a more direct description of this ${\sigma}$-algebra, but it requires the use of the first uncountable ordinal; see Exercise 15 of these notes.) In Durrett, the notation ${\sigma({\mathcal A})}$ is used in place of ${\langle {\mathcal A}\rangle}$.

Example 10 (Borel ${\sigma}$-algebra) Let ${\Omega}$ be a topological space; to avoid pathologies let us assume that ${\Omega}$ is locally compact Hausdorff and ${\sigma}$-compact, though the definition below also can be made for more general spaces. For instance, one could take ${\Omega = {\bf R}^n}$ or ${\Omega = {\bf C}^n}$ for some finite ${n}$. We define the Borel ${\sigma}$-algebra on ${\Omega}$ to be the ${\sigma}$-algebra generated by the open sets of ${\Omega}$. (Due to our topological hypotheses on ${\Omega}$, the Borel ${\sigma}$-algebra is also generated by the compact sets of ${\Omega}$.) Measurable subsets in the Borel ${\sigma}$-algebra are known as Borel sets. Thus for instance open and closed sets are Borel, and countable unions and countable intersections of Borel sets are Borel. In fact, as a rule of thumb, any subset of ${{\bf R}^n}$ or ${{\bf C}^n}$ that arises from a “non-pathological” construction (not using the axiom of choice, or from a deliberate attempt to build a non-Borel set) can be expected to be a Borel set. Nevertheless, non-Borel sets exist in abundance if one looks hard enough for them, even without the axiom of choice; see for instance Exercise 16 of this previous blog post.

The following exercise gives a useful tool (somewhat analogous to mathematical induction) to verify properties regarding measurable sets in generated ${\sigma}$-algebras, such as Borel ${\sigma}$-algebras.

Exercise 11 Let ${{\mathcal A}}$ be a collection of subsets of a set ${\Omega}$, and let ${P(E)}$ be a property of subsets ${E}$ of ${\Omega}$ (thus ${P(E)}$ is true or false for each ${E}$ in ${\Omega}$). Assume the following axioms:

• ${P(\emptyset)}$ is true.
• ${P(E)}$ is true for all ${E \in {\mathcal A}}$.
• If ${E \subset \Omega}$ is such that ${P(E)}$ is true, then ${P(\Omega \backslash E)}$ is also true.
• If ${E_1, E_2, \dots \subset \Omega}$ are such that ${P(E_n)}$ is true for all ${n}$, then ${P(\bigcup_n E_n)}$ is true.

Show that ${P(E)}$ is true for all ${E \in \langle {\mathcal A} \rangle}$. (Hint: what can one say about ${\{ E \subset \Omega: P(E) \hbox{ true}\}}$?)

Thus, for instance, if a property of subsets of ${{\bf R}^n}$ is true for all open sets, and is closed under countable unions and complements, then it is automatically true for all Borel sets.

Example 12 (Pullback) Let ${(R, {\mathcal B})}$ be a measurable space, and let ${\phi: \Omega \rightarrow R}$ be any function from another set ${\Omega}$ to ${R}$. Then we can define the pullback ${\phi^*({\mathcal B})}$ of the ${\sigma}$-algebra ${{\mathcal B}}$ to be the collection of all subsets in ${\Omega}$ that are of the form ${\phi^{-1}(S)}$ for some ${S \in {\mathcal B}}$. This is easily verified to be a ${\sigma}$-algebra. We refer to the measurable space ${(\Omega, \phi^*({\mathcal B}))}$ as the pullback of the measurable space ${(R, {\mathcal B})}$ by ${\phi}$. Thus for instance an atomic measurable space on ${\Omega}$ generated by a partition ${\Omega = \biguplus_{\alpha \in A} E_\alpha}$ is the pullback of ${A}$ (viewed as a discrete measurable space) by the “colouring” map from ${\Omega}$ to ${A}$ that sends each element of ${E_\alpha}$ to ${\alpha}$ for all ${\alpha \in A}$.

Remark 13 In probabilistic terms, one can interpret the space ${\Omega}$ in the above construction as a sample space, and the function ${\phi}$ as some collection of “random variables” or “measurements” on that space, with ${R}$ being all the possible outcomes of these measurements. The pullback then represents all the “information” one can extract from that given set of measurements.

Example 14 (Product space) Let ${(R_\alpha, {\mathcal B}_\alpha)_{\alpha \in A}}$ be a family of measurable spaces indexed by a (possibly infinite or uncountable) set ${A}$. We define the product ${(\prod_{\alpha \in A} R_\alpha, \prod_{\alpha \in A} {\mathcal B}_\alpha)}$ on the Cartesian product space ${\prod_{\alpha \in A} R_\alpha}$ by defining ${\prod_{\alpha \in A} {\mathcal B}_\alpha}$ to be the ${\sigma}$-algebra generated by the basic cylinder sets of the form

$\displaystyle \{ (x_\alpha)_{\alpha \in A} \in \prod_{\alpha \in A} R_\alpha: x_\beta \in E_\beta \}$

for ${\beta \in A}$ and ${E_\beta \in {\mathcal B}_\alpha}$. For instance, given two measurable spaces ${(R_1, {\mathcal B}_1)}$ and ${(R_2, {\mathcal B}_2)}$, the product ${\sigma}$-algebra ${{\mathcal B}_1 \times {\mathcal B}_2}$ is generated by the sets ${E_1 \times R_2}$ and ${R_1 \times E_2}$ for ${E_1 \in {\mathcal B}_1, E_2 \in {\mathcal B}_2}$. (One can also show that ${{\mathcal B}_1 \times {\mathcal B}_2}$ is the ${\sigma}$-algebra generated by the products ${E_1 \times E_2}$ for ${E_1 \in {\mathcal B}_1, E_2 \in {\mathcal B}_2}$, but this observation does not extend to uncountable products of measurable spaces.)

Exercise 15 Show that ${{\bf R}^n}$ with the Borel ${\sigma-}$algebra is the product of ${n}$ copies of ${{\bf R}}$ with the Borel ${\sigma}$-algebra.

As with almost any other notion of space in mathematics, there is a natural notion of a map (or morphism) between measurable spaces.

Definition 16 A function ${\phi: \Omega \rightarrow R}$ between two measurable spaces ${(\Omega, {\mathcal F})}$, ${(R,{\mathcal B})}$ is said to be measurable if one has ${\phi^{-1}(S) \in {\mathcal F}}$ for all ${S \in {\mathcal B}}$.

Thus for instance the pullback of a measurable space ${R}$ by a map ${\phi: \Omega \rightarrow R}$ could alternatively be defined as the coarsest measurable space structure on ${\Omega}$ for which ${\phi}$ is still measurable. It is clear that the composition of measurable functions is also measurable.

Exercise 17 Show that any continuous map from topological spaces ${X, Y}$ is measurable (when one gives ${X}$ and ${Y}$ the Borel ${\sigma}$-algebras).

Exercise 18 If ${X_1: \Omega \rightarrow R_1, \dots, X_n: \Omega \rightarrow R_n}$ are measurable functions into measurable spaces ${R_1,\dots,R_m}$, show that the joint function ${(X_1,\dots,X_n): \Omega \rightarrow R_1 \times \dots \times R_n}$ into the product space ${R_1 \times \dots \times R_n}$ defined by ${(X_1,\dots,X_n): \omega \mapsto (X_1(\omega),\dots,X_n(\omega))}$ is also measurable.

As a corollary of the above exercise, we see that if ${X_1: \Omega \rightarrow R_1, \dots, X_n: \Omega \rightarrow R_n}$ are measurable, and ${F: R_1 \times \dots \times R_n \rightarrow S}$ is measurable, then ${F(X_1,\dots,X_n): \Omega \rightarrow S}$ is also measurable. In particular, if ${X_1, X_2: \Omega \rightarrow {\bf R}}$ or ${X_1,X_2: \Omega \rightarrow {\bf C}}$ are scalar measurable functions, then so are ${X_1 +X_2}$, ${X_1 - X_2}$, ${X_1 \cdot X_2}$, etc..

Next, we turn measurable spaces into measure spaces by adding a measure.

Definition 19 (Measure spaces) Let ${(\Omega, {\mathcal F})}$ be a measurable space. A finitely additive measure on this space is a map ${\mu: {\mathcal F} \rightarrow [0,+\infty]}$ obeying the following axioms:

• (Empty set) ${\mu(\emptyset)=0}$.
• (Finite additivity) If ${E, F \in {\mathcal F}}$ are disjoint, then ${\mu(E \cup F) = \mu(E) + \mu(F)}$.

A countably additive measure is a finitely additive measure ${\mu: {\mathcal F} \rightarrow [0,+\infty]}$ obeying the following additional axiom:

• (Countable additivity) If ${E_1, E_2, E_3, \dots \in {\mathcal F}}$ are disjoint, then ${\mu(\bigcup_{n=1}^\infty E_n) = \sum_{n=1}^\infty \mu(E_n)}$.

A probability measure on ${\Omega}$ is a countably additive measure ${\mu: {\mathcal F} \rightarrow [0,+\infty]}$ obeying the following additional axiom:

• (Unit total probability) ${\mu(\Omega)=1}$.

A measure space is a triplet ${\Omega = (\Omega, {\mathcal F}, \mu)}$ where ${(\Omega,{\mathcal F})}$ is a measurable space and ${\mu}$ is a measure on that space. If ${\mu}$ is furthermore a probability measure, we call ${(\Omega, {\mathcal F}, \mu)}$ a probability space.

Example 20 (Discrete probability measures) Let ${\Omega}$ be a discrete measurable space, and for each ${\omega \in \Omega}$, let ${p_\omega}$ be a non-negative real number such that ${\sum_{\omega \in \Omega} p_\omega = 1}$. (Note that this implies that there are at most countably many ${\omega}$ for which ${p_\omega > 0}$ – why?.) Then one can form a probability measure ${\mu}$ on ${\Omega}$ by defining

$\displaystyle \mu(E) := \sum_{\omega \in E} p_\omega$

for all ${E \subset \Omega}$.

Example 21 (Lebesgue measure) Let ${{\bf R}}$ be given the Borel ${\sigma}$-algebra. Then it turns out there is a unique measure ${m}$ on ${{\bf R}}$, known as Lebesgue measure (or more precisely, the restriction of Lebesgue measure to the Borel ${\sigma}$-algebra) such that ${m([a,b]) = b-a}$ for every closed interval ${[a,b]}$ with ${-\infty \leq a \leq b \leq \infty}$ (this is also true if one uses open intervals or half-open intervals in place of closed intervals). More generally, there is a unique measure ${m^n}$ on ${{\bf R}^n}$ for any natural number ${n}$, also known as Lebesgue measure, such that

$\displaystyle m^n( [a_1,b_1] \times \dots \times[a_n,b_n]) = (b_1-a_1) \times \dots \times (b_n-a_n)$

for all closed boxes ${[a_1,b_1] \times \dots \times[a_n,b_n]}$, that is to say products of ${n}$ closed intervals. The construction of Lebesgue measure is a little tricky; see this previous blog post for details.

We can then set up general probability theory similarly to how we set up discrete probability theory:

Definition 22 (Probability theory) In probability theory, we choose an ambient probability space ${\Omega = (\Omega,{\mathcal F},\mu)}$ as the randomness model, and refer to the set ${\Omega}$ (without the additional structures ${{\mathcal F}}$, ${\mu}$) as the sample space for that model. We then model an event ${E}$ by elements ${E_\Omega}$ of ${\sigma}$-algebra ${{\mathcal F}}$. The probability ${{\bf P}(E)}$ of an event ${E}$ is defined to be the quantity

$\displaystyle {\bf P}(E) := \mu( E_\Omega ).$

An event ${E}$ is surely true or is the sure event if ${E_\Omega = \Omega}$, and is surely false or is the empty event if ${E_\Omega =\emptyset}$. It is almost surely true or an almost sure event if ${{\bf P}(E) = 1}$, and almost surely false or a null event if ${{\bf P}(E)=0}$.

We model random variables ${X}$ taking values in the range ${R}$ by measurable functions ${X_\Omega: \Omega \rightarrow R}$ from the sample space ${\Omega}$ to the range ${R}$. We define real, complex, and scalar random variables as in the discrete case.

As in the discrete case, we consider two events ${E,F}$ to be equal if they are modeled by the same set: ${E=F \iff E_\Omega = F_\Omega}$. Similarly, two random variables ${X,Y}$ taking values in a common range ${R}$ are considered to be equal if they are modeled by the same function: ${X=Y \iff X_\Omega = Y_\Omega}$. Again, if the sample space ${\Omega}$ is understood from context, we will usually abuse notation by identifying an event ${E}$ with its model ${E_\Omega}$, and similarly identify a random variable ${X}$ with its model ${X_\Omega}$.

As in the discrete case, set-theoretic operations on the sample space induce similar boolean operations on events. Furthermore, since the ${\sigma}$-algebra ${{\mathcal F}}$ is closed under countable unions and countable intersections, we may similarly define the countable conjunction ${\bigwedge_{n=1}^\infty E_n}$ or countable disjunction ${\bigvee_{n=1}^\infty E_n}$ of a sequence ${E_1,E_2,\dots}$ of events; however, we do not define uncountable conjunctions or disjunctions as these may not be well-defined as events.

The axioms of a probability space then yield the Kolmogorov axioms for probability:

• ${{\bf P}(\emptyset)=0}$.
• ${{\bf P}(\overline{\emptyset})=1}$.
• If ${E_1,E_2,\dots}$ are disjoint events, then ${{\bf P}(\bigvee_{n=1}^\infty E_n) = \sum_{n=1}^\infty {\bf P}(E_n)}$.

We can manipulate random variables just as in the discrete case, with the only caveat being that we have to restrict attention to measurable operations. For instance, if ${X}$ is a random variable taking values in a measurable space ${R}$, and ${f: R \rightarrow S}$ is a measurable map, then ${f(X)}$ is well defined as a random variable taking values in ${S}$. Similarly, if ${f: R_1 \times \dots \times R_n \rightarrow S}$ is a measurable map and ${X_1,\dots,X_n}$ are random variables taking values in ${R_1,\dots,R_n}$ respectively, then ${f(X_1,\dots,X_n)}$ is a random variable taking values in ${S}$. Similarly we can create events ${F(X_1,\dots,X_n)}$ out of measurable relations ${F: R_1 \times \dots \times R_n \rightarrow \{ \hbox{true}, \hbox{false}\}}$ (giving the boolean range ${\{ \hbox{true}, \hbox{false}\}}$ the discrete ${\sigma}$-algebra, of course). Finally, we continue to view deterministic elements of a space ${X}$ as a special case of a random element of ${X}$, and associate the indicator random variable ${1_E}$ to any event ${E}$ as before.

We say that two random variables ${X,Y}$ agree almost surely if the event ${X=Y}$ is almost surely true; this is an equivalence relation. In many cases we are willing to consider random variables up to almost sure equivalence. In particular, we can generalise the notion of a random variable slightly by considering random variables ${X}$ whose models ${X_\Omega: \Omega \rightarrow R}$ are only defined almost surely, i.e. their domain is not all of ${\Omega}$, but instead ${\Omega}$ with a set of measure zero removed. This is, technically, not a random variable as we have defined it, but it can be associated canonically with an equivalence class of random variables up to almost sure equivalence, and so we view such objects as random variables “up to almost sure equivalence”. Similarly, we declare two events ${E}$ and ${F}$ almost surely equivalent if their symmetric difference ${E \Delta F}$ is a null event, and will often consider events up to almost sure equivalence only.

We record some simple consequences of the measure-theoretic axioms:

Exercise 23 Let ${(\Omega, {\mathcal F}, \mu)}$ be a measure space.

1. (Monotonicity) If ${E \subset F}$ are measurable, then ${\mu(E) \leq \mu(F)}$.
2. (Subadditivity) If ${E_1,E_2,\dots}$ are measurable (not necessarily disjoint), then ${\mu(\bigcup_{n=1}^\infty E_n) \leq \sum_{n=1}^\infty \mu(E_n)}$.
3. (Continuity from below) If ${E_1 \subset E_2 \subset \dots }$ are measurable, then ${\mu(\bigcup_{n=1}^\infty E_n) = \lim_{n \rightarrow \infty} \mu(E_n)}$.
4. (Continuity from above) If ${E_1 \supset E_2 \supset \dots}$ are measurable and ${\mu(E_1)}$ is finite, then ${\mu(\bigcap_{n=1}^\infty E_n) = \lim_{n \rightarrow \infty} \mu(E_n)}$. Give a counterexample to show that the claim can fail when ${\mu(E_1)}$ is infinite.

Of course, these measure-theoretic facts immediately imply their probabilistic counterparts (and the pesky hypothesis that ${\mu(E_1)}$ is finite is automatic and can thus be dropped):

1. (Monotonicity) If ${E \subset F}$ are events, then ${{\mathbf P}(E) \leq {\mathbf P}(F)}$. (In particular, ${0 \leq {\mathbf P}(E) \leq 1}$ for any event ${E}$.)
2. (Subadditivity) If ${E_1,E_2,\dots}$ are events (not necessarily disjoint), then ${{\mathbf P}(\bigvee_{n=1}^\infty E_n) \leq \sum_{n=1}^\infty {\mathbf P}(E_n)}$.
3. (Continuity from below) If ${E_1 \subset E_2 \subset \dots }$ are events, then ${{\mathbf P}(\bigvee_{n=1}^\infty E_n) = \lim_{n \rightarrow \infty} {\mathbf P}(E_n)}$.
4. (Continuity from above) If ${E_1 \supset E_2 \supset \dots}$ is events, then ${{\mathbf P}(\bigwedge_{n=1}^\infty E_n) = \lim_{n \rightarrow \infty} {\mathbf P}(E_n)}$.

Note that if a countable sequence ${E_1, E_2, \dots}$ of events each hold almost surely, then their conjunction does as well (by applying subadditivity to the complementary events ${\overline{E_1},\overline{E_2},\dots}$. As a general rule of thumb, the notion of “almost surely” behaves like “surely” as long as one only performs an at most countable number of operations (which already suffices for a large portion of analysis, such as taking limits or performing infinite sums).

Exercise 24 Let ${(\Omega, {\mathcal B})}$ be a measurable space.

• If ${f: \Omega \rightarrow [-\infty,\infty]}$ is a function taking values in the extended reals ${[-\infty,\infty]}$, show that ${f}$ is measurable (giving ${[-\infty,\infty]}$ the Borel ${\sigma}$-algebra) if and only if the sets ${\{ \omega \in \Omega: f(\omega) \leq t \}}$ are measurable for all real ${t}$.
• If ${f,g: \Omega \rightarrow [-\infty,\infty]}$ are functions, show that ${f=g}$ if and only if ${\{ \omega \in \Omega: f(\omega) \leq t \} = \{ \omega \in \Omega: g(\omega) \leq t \}}$ for all reals ${t}$.
• If ${f_1,f_2,\dots: \Omega \rightarrow [-\infty,\infty]}$ are measurable, show that ${\sup_n f_n}$, ${\inf_n f_n}$, ${\limsup_{n \rightarrow \infty} f_n}$, and ${\liminf_{n \rightarrow \infty} f_n}$ are all measurable.

Remark 25 Occasionally, there is need to consider uncountable suprema or infima, e.g. ${\sup_{t \in {\bf R}} f_t}$. It is then no longer automatically the case that such an uncountable supremum or infimum of measurable functions is again measurable. However, in practice one can avoid this issue by carefully rewriting such uncountable suprema or infima in terms of countable ones. For instance, if it is known that ${f_t(\omega)}$ depends continuously on ${t}$ for each ${\omega}$, then ${\sup_{t \in {\bf R}} f_t = \sup_{t \in {\bf Q}} f_t}$, and so measurability is not an issue.

Using the above exercise, if one is given a sequence ${X_1,X_2,\dots}$ of random variables taking values in the extended real line ${[-\infty,\infty]}$, we can define the random variables ${\sup_n X_n}$, ${\inf_n X_n}$, ${\limsup_{n \rightarrow \infty} X_n}$, ${\liminf_{n \rightarrow \infty} X_n}$ which also take values in the extended real line, and which obey relations such as

$\displaystyle (\sup_n X_n > t) := \bigwedge_{n=1}^\infty (X_n > t)$

for any real number ${t}$.

We now say that a sequence ${X_1,X_2,\dots}$ of random variables in the extended real line converges almost surely if one has

$\displaystyle \liminf_{n \rightarrow \infty} X_n = \limsup_{n \rightarrow \infty} X_n$

almost surely, in which case we can define the limit ${\lim_{n \rightarrow \infty} X_n}$ (up to almost sure equivalence) as

$\displaystyle \lim_{n \rightarrow \infty} X_n = \liminf_{n \rightarrow \infty} X_n = \limsup_{n \rightarrow \infty} X_n.$

This corresponds closely to the concept of almost everywhere convergence in measure theory, which is a slightly weaker notion than pointwise convergence which allows for bad behaviour on a set of measure zero. (See this previous blog post for more discussion on different notions of convergence of measurable functions.)

We will defer the general construction of expectation of a random variable to the next set of notes, where we review the notion of integration on a measure space. For now, we quickly review the basic construction of continuous scalar random variables.

Exercise 26 Let ${\mu}$ be a probability measure on the real line ${{\bf R}}$ (with the Borel ${\sigma}$-algebra). Define the Stieltjes measure function ${F: {\bf R} \rightarrow [0,1]}$ associated to ${\mu}$ by the formula

$\displaystyle F( t ) := \mu( (-\infty,t] ) = \mu( \{ x \in {\bf R}: x \leq t \} ).$

Establish the following properties of ${F}$:

• (i) ${F}$ is non-decreasing.
• (ii) ${\lim_{t \rightarrow -\infty} F(t) = 0}$ and ${\lim_{t \rightarrow +\infty} F(t) = 1}$.
• (iii) ${F}$ is right-continuous, thus ${F(t) = \lim_{s \rightarrow t^+} F(s)}$ for all ${t \in {\bf R}}$.

There is a somewhat difficult converse to this exercise: if ${F}$ is a function obeying the above three properties, then there is a unique probability measure ${\mu}$ on ${{\bf R}}$ (the Lebesgue-Stieltjes measure associated to ${F}$) for which ${F}$ is the Stieltjes measure function. See Section 3 of this previous post for details. As a consequence of this, we have

Corollary 27 (Construction of a single continuous random variable) Let ${F: {\bf R} \rightarrow [0,1]}$ be a function obeying the properties (i)-(iii) of the above exercise. Then, by using a suitable probability space model, we can construct a real random variable ${X}$ with the property that

$\displaystyle F(t) = {\bf P}( X \leq t)$

for all ${t \in {\bf R}}$.

Indeed, we can take the probability space to be ${{\bf R}}$ with the Borel ${\sigma}$-algebra and the Lebesgue-Stieltjes measure associated to ${F}$. This corollary is not fully satisfactory, because often we may already have chosen a probability space to model some other random variables, and the probability space provided by this corollary may be completely unrelated to the one used. We can resolve these issues with product measures and other joinings, but this will be deferred to a later set of notes.

Define the cumulative distribution function ${F: {\bf R} \rightarrow [0,1]}$ of a real random variable ${X}$ to be the function

$\displaystyle F(t) := {\bf P}(X \leq t).$

Thus we see that cumulative distribution functions obey the properties (i)-(iii) above, and conversely any function with those properties is the cumulative distribution function of some real random variable. We say that two real random variables (possibly on different sample spaces) agree in distribution if they have the same cumulative distribution function. One can therefore define a real random variable, up to agreement in distribution, by specifying the cumulative distribution function. See Durrett for some standard real distributions (uniform, normal, geometric, etc.) that one can define in this fashion.

Exercise 28 Let ${X}$ be a real random variable with cumulative distribution function ${F}$. For any real number ${t}$, show that

$\displaystyle {\bf P}(X < t) = \lim_{s \rightarrow t^-} F(s)$

and

$\displaystyle {\bf P}(X = t) = F(t) - \lim_{s \rightarrow t^-} F(s).$

In particular, one has ${{\bf P}(X=t)=0}$ for all ${t}$ if and only if ${F}$ is continuous.

Note in particular that this illustrates the distinction between almost sure and sure events: if ${X}$ has a continuous cumulative distribution function, and ${t}$ is a real number, then ${X=t}$ is almost surely false, but it does not have to be surely false. (Indeed, if one takes the sample space to be ${{\bf R}}$ and ${X_\Omega}$ to be the identity function, then ${X=t}$ will not be surely false.) On the other hand, the fact that ${X}$ is equal to some real number is of course surely true. The reason these statements are consistent with each other is that there are uncountably many real numbers ${t}$. (Countable additivity tells us that a countable disjunction of null events is still null, but says nothing about uncountable disjunctions.)

Exercise 29 (Skorokhod representation of scalar variables) Let ${U}$ be a uniform random variable taking values in ${[0,1]}$ (thus ${U}$ has cumulative distribution function ${F_U(t) := \min(\max(t,0),1)}$), and let ${F: {\bf R} \rightarrow [0,1]}$ be another cumulative distribution function. Show that the random variables

$\displaystyle X^- := \sup \{ y \in {\bf R}: F(y) < U \}$

and

$\displaystyle X^+ := \inf \{ y \in {\bf R}: F(y) \geq U \}$

are indeed random variables (that is to say, they are measurable in any given model ${\Omega}$), and have cumulative distribution function ${F}$. (This construction is attributed to Skorokhod, but it should not be confused with the Skorokhod representation theorem. It provides a quick way to generate a single scalar variable, but unfortunately it is difficult to modify this construction to generate multiple scalar variables, especially if they are somehow coupled to each other.)

There is a multidimensional analogue of the above theory, which is almost identical, except that the monotonicity property has to be strengthened:

Exercise 30 Let ${\mu}$ be a probability measure on ${{\bf R}^n}$ (with the Borel ${\sigma}$-algebra). Define the Stieltjes measure function ${F: {\bf R}^n \rightarrow [0,1]}$ associated to ${\mu}$ by the formula

$\displaystyle F( t_1,\dots,t_n ) := \mu( (-\infty,t_1] \times \dots \times (-\infty,t_n] )$

$\displaystyle = \mu( \{ (x_1,\dots,x_n) \in {\bf R}^n: x_i \leq t_i \forall i=1,\dots,n \} ).$

Establish the following properties of ${F}$:

• (i) ${F}$ is non-decreasing: ${F(t_1,\dots,t_n) \leq F(t'_1,\dots,t'_n)}$ whenever ${t_i \leq t'_i}$ for all ${i}$.
• (ii) ${\lim_{t_1,\dots,t_n \rightarrow -\infty} F(t) = 0}$ and ${\lim_{t_1,\dots,t_n \rightarrow +\infty} F(t) = 1}$.
• (iii) ${F}$ is right-continuous, thus ${F(t_1,\dots,t_n) = \lim_{(s_1,\dots,s_n) \rightarrow (t_1,\dots,t_n)^+} F(s)}$ for all ${(t_1,\dots,t_n) \in {\bf R}^n}$, where the ${+}$ superscript denotes that we restrict each ${s_i}$ to be greater than or equal to ${t_i}$.
• (iv) One has

$\displaystyle \sum_{(\omega_1,\dots,\omega_n) \in \{0,1\}^n} (-1)^{\omega_1+\dots+\omega_n+n} F( t_{1,\omega_1}, \dots, t_{n,\omega_n} ) \geq 0$

whenever ${t_{i,0} \leq t_{i,1}}$ are real numbers for ${i=1,\dots,n}$. (Hint: try to express the measure of a box ${(t_{1,0},t_{1,1}] \times \dots \times (t_{n,0},t_{n,1}]}$ with respect to ${\mu}$ in terms of the Stieltjes measure function ${F}$.)

Again, there is a difficult converse to this exercise: if ${F}$ is a function obeying the above four properties, then there is a unique probability measure ${\mu}$ on ${{\bf R}^n}$ for which ${F}$ is the Stieltjes measure function. See Durrett for details; one can also modify the arguments in this previous post. In particular, we have

Corollary 31 (Construction of several continuous random variables) Let ${F: {\bf R}^n \rightarrow [0,1]}$ be a function obeying the properties (i)-(iv) of the above exercise. Then, by using a suitable probability space model, we can construct real random variables ${X_1,\dots,X_n}$ with the property that

$\displaystyle F(t_1,\dots,t_n) = {\bf P}( X_1 \leq t_1 \wedge \dots \wedge X_n \leq t_n)$

for all ${t_1,\dots,t_n \in {\bf R}}$.

Again, this corollary is not completely satisfactory because the probability space produced by it (which one can take to be ${{\bf R}^n}$ with the Borel ${\sigma}$-algebra and the Lebesgue-Stieltjes measure on ${F}$) may not be the probability space one wants to use; we will return to this point later.

— 4. Variants of the standard foundations (optional) —

We have focused on the orthodox foundations of probability theory in which we model events and random variables through probability spaces. In this section, we briefly discuss some alternate ways to set up the foundations, as well as alternatives to probability theory itself. (Actually, many of the basic objects and concepts in mathematics have multiple such foundations; see for instance this blog post exploring the many different ways to define the notion of a group.) We mention them here in order exclude them from discussion in subsequent notes, which will be focused almost exclusively on orthodox probability.

One approach to the foundations of probability is to view the event space as an abstract ${\sigma}$-algebra ${{\mathcal E}}$ – a collection of abstract objects with operations such as ${\wedge}$ and ${\vee}$ (and ${\bigwedge_{n=1}^\infty}$ and ${\bigvee_{n=1}^\infty}$) that obey a number of axioms; see this previous post for a formal definition. The probability map ${E \mapsto {\bf P}(E)}$ can then be viewed as an abstract probability measure on ${{\mathcal E}}$, that is to say a map from ${{\mathcal E}}$ to ${[0,1]}$ that obeys the Kolmogorov axioms. Random variables ${X}$ taking values in a measurable space ${(R, {\mathcal B})}$ can be identified with their pullback map ${X^*: {\mathcal B} \rightarrow {\mathcal E}}$, which is the morphism of (abstract) ${\sigma}$-algebras that sends a measurable set ${S \in {\mathcal B}}$ to the event ${X \in S}$ in ${{\mathcal E}}$; with some care one can then redefine all the operations in previous sections (e.g. applying a measurable map ${f: R \rightarrow S}$ to a random variable ${X}$ taking values in ${R}$ to obtain a random variable ${f(X)}$ taking values in ${S}$) in terms of this pullback map, allowing one to define random variables satisfactorily in this abstract setting. The probability space models discussed above can then be viewed as representations of abstract probability spaces by concrete ones. It turns out that (up to null events) any abstract probability space can be represented by a concrete one, a result known as the Loomis-Sikorski theorem; see this previous post for details.

Another, related, approach is to start not with the event space, but with the space of scalar random variables, and more specifically with the space ${L^\infty}$ of almost surely bounded scalar random variables ${X}$ (thus, there is a deterministic scalar ${C}$ such that ${|X| \leq C}$ almost surely). It turns out that this space has the structure of a commutative tracial (abstract) von Neumann algebra. Conversely, starting from a commutative tracial von Neumann algebra one can form an abstract probability space (using the idempotent elements of the algebra as the events), and thus represent this algebra (up to null events) by a concrete probability space. This particular choice of probabilistic foundations is particularly convenient when one wishes to generalise classical probability to noncommutative probability, as this is simply a matter of dropping the axiom that the von Neumann algebra is commutative. This leads in particular to the subjects of quantum probability and free probability, which are generalisations of classical probability that are beyond the scope of this course (but see this blog post for an introduction to the latter, and this previous post for an abstract algebraic description of a probability space).

It is also possible to model continuous probability via a nonstandard version of discrete probability (or even finite probability), which removes some of the technicalities of measure theory at the cost of replacing them with the formalism of nonstandard analysis instead. This approach was pioneered by Ed Nelson, but will not be discussed further here. (See also these previous posts on the Loeb measure construction, which is a closely related way to combine the power of measure theory with the conveniences of nonstandard analysis.)

One can generalise the traditional, countably additive, form of probability by replacing countable additivity with finite additivity, but then one loses much of the ability to take limits or infinite sums, which reduces the amount of analysis one can perform in this setting. Still, finite additivity is good enough for many applications, particularly in discrete mathematics. An even broader generalisation is that of qualitative probability, in which events that are neither almost surely true or almost surely false are not assigned any specific numerical probability between ${0}$ or ${1}$, but are simply assigned a symbol such as ${I}$ to indicate their indeterminate status; see this previous blog post for this generalisation, which can for instance be used to view the concept of a “generic point” in algebraic geometry or metric space topology in probabilistic terms.

There have been multiple attempts to move more radically beyond the paradigm of probability theory and its relatives as discussed above, in order to more accurately capture mathematically the concept of non-determinism. One family of approaches is based on replacing deterministic logic by some sort of probabilistic logic; another is based on allowing several parameters in one’s model to be unknown (as opposed to being probabilistic random variables), leading to the area of uncertainty quantification. These topics are well beyond the scope of this course.

Filed under: 275A - probability theory, math.CA, math.PR Tagged: foundations

### Doug Natelson — Table-top electron microscopes

A quick question in the hopes that some people in this blog's readership have direct experience:  Anyone work with a table-top scanning electron microscope (SEM) that they really like?  Any rough idea of the cost and operations challenges (e.g., having to replace tungsten filaments all the time)?  I was chatting with someone about educational opportunities that something like these would present, and I was curious about the numbers without wanting to email vendors or anything that formal.  Thanks for any information.

(Note: It would really be fun to try to develop a really low-budget SEM - the electron microscopy version of this.  On the one hand, you could imagine microfabricated field emitters and the amazing cheapness of CCDs could help.  However, the need for good vacuum and some means of beam focusing and steering makes this much more difficult.  Clearly a good undergrad design project....)

### Backreaction — Repost in celebration of the 2015 Nobel Prize in Physics: Neutrino masses and angles

It was just announced that this year's Nobel Prize in physics goes to Takaaki Kajita from the Super-Kamiokande Collaboration and Arthur B. McDonald from the Sudbury Neutrino Observatory (SNO) Collaboration “for the discovery of neutrino oscillations, which shows that neutrinos have mass.” On this occasion, I am reposting a brief summary of the evidence for neutrino masses that I wrote in 2007.

Neutrinos come in three known flavors. These flavors correspond to the three charged leptons, the electron, the muon and the tau. The neutrino flavors can change during the neutrino's travel, and one flavor can be converted into another. This happens periodically. The neutrino flavor oscillations have a certain wavelength, and an amplitude which sets the probability of the change to happen. The amplitude is usually quantified in a mixing angle θ. In this, sin2(2 θ) = 1, or θ = π/4 corresponds to maximal mixing, which means one flavor changes completely into another, and then back.

This neutrino mixing happens when the mass-eigenstates of the Hamiltonian are not the same as the flavor eigenstates. The wavelength λ of the oscillation turns out to depend (in the relativistic limit) on the difference in the squared masses Δm2 (not the square of the difference!) and the neutrino's energy E as λ = 4Em2. The larger the energy of the neutrinos the larger the wavelength. For a source with a spectrum of different energies around some mean value, one has a superposition of various wavelengths. On distances larger than the typical oscillation length corresponding to the mean energy, this will average out the oscillation.

The plot below from the KamLAND Collaboration shows an example of an experiment to test neutrino flavor conversion. The KamLAND neutrino sources are several Japanese nuclear reactors that emit electron anti-neutrinos with a very well known energy and power spectrum, that has a mean value around some MeV. The average distance to the reactors is ~180 km. The plot shows the ratio of the observed electron anti-neutrinos to the expected number without oscillations. The KamLAND result is the red dot. The other data points were earlier experiments in other locations that did not find a drop. The dotted line is the best fit to this data.

[Figure: KamLAND Collaboration]

One sees however that there is some kind of redundancy in this fit, since one can shift around the wavelength and stay within the errorbars. These reactor data however are only one of the measurements of neutrino oscillations that have been made during the last decades. There are a lot of other experiments that have measured deficites in the expected solar and atmospheric neutrino flux. Especially important in this regard was the SNO data that confirmed that indeed not only there were less solar electron neutrinos than expected, but that they actually showed up in the detector with a different flavor, and the KamLAND analysis of the energy spectrum that clearly favors oscillation over decay.

The plot below depicts all the currently available data for electron neutrino oscillations, which places the mass-square around 8×10-5 eV2, and θ at about 33.9° (i.e. the mixing is with high confidence not maximal).

[Figure: Hitoshi Murayama, see here for references on the used data]

The lines on the top indicate excluded regions from earlier experiments, the filled regions are allowed values. You see the KamLAND 95%CL area in red, and SNO in brown. The remaining island in the overlap is pretty much constrained by now. Given that neutrinos are so elusive particles, and this mass scale is incredibly tiny, I am always impressed by the precision of these experiments!

To fit the oscillations between all the known three neutrino flavors, one needs three mixing angles, and two mass differences (the overall mass scale factors out and does not enter, neutrino oscillations thus are not sensitive to the total neutrino masses). All the presently available data has allowed us to tightly constrain the mixing angles and mass squares. The only outsider (that was thus excluded from the global fits) is famously LSND (see also the above plot), so MiniBooNE was designed to check on their results. For more info on MiniBooNE, see Heather Ray's excellent post at CV.

This post originally appeared in December 2007 as part of our advent calendar A Plottl A Day.

### Tommaso Dorigo — Nobel Prize To Neutrino Oscillations

The winners of the 2015 Nobel Prize in Physics are:

• Takaaki Kajita Kajita (Super Kamiokande)
• Arthur McDonald (Sudbury Neutrino Observatory - SNO)
“for the discovery of neutrino oscillations, which shows that neutrinos have mass"

### n-Category CaféConfigurations of Lines and Models of Lie Algebras (Part 2)

To get deeper into this paper:

we should think about the 24-cell, the $\mathrm{D}_4$ Dynkin diagram, and the Lie algebra $\mathfrak{so}(8)$ that it describes. After all, its this stuff that underlies the octonions, which in turn underlies the exceptional Lie algebras, which are the main subject of Manivel’s paper.

Remember that the root lattice of $\mathfrak{so}(2n)$ is called the $\mathrm{D}_n$ lattice: it consists of vectors in $\mathbb{R}^n$ with integer entries that sum to an even number.

I’m interested in $\mathfrak{so}(8)$ so I’m interested in the $\mathrm{D}_4$ root lattice: 4-tuples of integers that sum to an even number! If you like the Hurwitz integral quaternions these are essentially the same thing multiplied by a factor of 2.

The shortest nonzero vectors in the $\mathrm{D}_4$ lattice are the roots. There are 16 like this:

$(\pm 1, \pm 1, \pm 1, \pm 1)$

and 8 like this:

$(\pm 2, 0, 0 , 0) \; and \; permutations$

The roots form the vertices of a regular polytope in 4 dimensions. It’s called the 24-cell because it not only has 24 vertices, it also has 24 octahedral faces: it’s self-dual.

One thing we’re seeing here is that we can take the vertices of a 4-dimensional cube, namely $(\pm 1, \pm 1, \pm 1, \pm 1)$:

and the vertices of a 4-dimensional orthoplex, namely $(\pm 2, 0, 0 , 0)$ and permutations:

and together they form the vertices of a 24-cell:

(In case you forgot, an orthoplex or cross-polytope is the $n$-dimensional analogue of an octahedron, a regular polytope with $2n$ vertices.)

However, we can go further! If you take every other vertex of an $n$-dimensional cube — that is, start with one vertex and then all its second nearest neighbors and their second nearest neighbors and so on — you get the vertices of something called the $n$-dimensional demicube. In 3 dimensions, a demicube is a regular tetrahedron, so you can fit two tetrahedra in a cube like this:

But in 4 dimensions, a demicube is an orthoplex! This is easy to see: if we take these points

$(\pm 1, \pm 1, \pm 1, \pm 1)$

and keep only those with an even number of minus signs, we get

$\pm (1,1,1,1)$

$\pm (1,1,-1,-1)$

$\pm (1,-1,1,-1)$

$\pm (1,-1,-1,1)$

which are the 8 vertices of an orthoplex.

So, we can take the vertices of a 24-cell and partition them into three 8-element sets, each being the vertices of an orthoplex!

And this has a nice representation-theoretic significance:

• the vertices of the first orthoplex correspond to the weights of the 8-dimensional vector representation of $\mathfrak{so}(8)$;

• the vertices of the second correspond to the weights of the 8-dimensional left-handed spinor representation of $\mathfrak{so}(8)$;

• the vertices of the third correspond to the weights of the 8-dimensional right-handed spinor representation of $\mathfrak{so}(8).$

Here I’m being a bit sneaky, identifying the weight lattice of $\mathrm{D}_4$ with the root lattice. In fact it’s twice as dense, but it ‘looks the same’: it’s the same up to a rescaling and rotation. The weight lattice contains a 24-cell whose vertices are the weights of the vector, left- and right-handed spinor representations. But the root lattice, which is what I really had been talking about, contains a larger 24-cell whose vertices are the roots — that is, the nonzero weights of the adjoint representation of $\mathfrak{so}(8)$ on itself.

Now, you may wonder which is the ‘first’ orthoplex, which is the ‘second’ one and which is the ‘third’, but it doesn’t matter much since there’s a symmetry between them! This is triality. The group $\mathrm{S}_3$ acts on $\mathfrak{so}(8)$ as outer automorphisms, and thus on the weight lattice and root lattice. So, it acts as symmetries of the 24-cell — and it acts by permuting the 3 orthoplexes!

But here’s the cool thing that Manivel focuses on. Suppose we take the 24-cell whose vertices are the roots of $\mathrm{D}_4$. Any orthoplex in here consists of 4 pairs of opposite roots along 4 orthogonal axies. So, it gives a root system of type $\mathrm{A}_1 \times \mathrm{A}_1 \times \mathrm{A}_1 \times \mathrm{A}_1$. In other words, it picks out a copy of $\mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C})$ in $\mathfrak{so}(8,\mathbb{C})$.

This Lie subalgebra

$\mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C}) \; \subset \; \mathfrak{so}(8,\mathbb{C})$

acts on $\mathfrak{so}(8,\mathbb{C})$ by the adjoint action. It acts on itself, and it acts on the rest of $\mathfrak{so}(8,\mathbb{C})$ via the representation

$\mathbb{C}^2 \otimes \mathbb{C}^2 \otimes \mathbb{C}^2 \otimes \mathbb{C}^2$

where each copy of $\mathfrak{sl}(2,\mathbb{C})$ acts on just one factor of $\mathbb{C}^2$ (in the obvious way). The weights of this ‘rest of’ $\mathfrak{so}(8,\mathbb{C})$ form the vertices of a 4-dimensional cube.

So, we’re getting a nice vector space isomorphism

$\mathfrak{so}(8,\mathbb{C}) \;\; \cong \;\; \mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C})\; \; \oplus \;\; \mathbb{C}^2 \otimes \mathbb{C}^2 \otimes \mathbb{C}^2 \otimes \mathbb{C}^2$

which is based on how the vertices of a 24-cell can be partitioned into the vertices of an orthoplex and a cube. But in fact we’re getting 3 such isomorphisms, related by triality!

If we call the 4 copies of $\mathbb{C}^2$ here $V_1, V_2, V_3, V_4$, we can write

$\mathfrak{so}(8,\mathbb{C}) \;\; \cong \;\; \bigoplus_{i = 1}^4 \mathfrak{sl}(V_i) \; \oplus \; \bigotimes_{i = 1}^4 V_i$

Also, we can write the three 8-dimensional irreducible representations of $\mathfrak{so}(8,\mathbb{C})$ as

$(V_1 \otimes V_2) \oplus (V_3 \otimes V_4)$

$(V_1 \otimes V_3) \oplus (V_2 \otimes V_4)$

$(V_1 \otimes V_4) \oplus (V_2 \otimes V_3)$

Note that they come from the three ways of partitioning a 4-element set into two 3-element sets.

Manivel calls this description of $\mathfrak{so}(8,\mathbb{C})$ the four-ality description, but I prefer to speak of tetrality. How is tetrality related to triality? The Dynkin diagram of $\mathrm{D}_4$ has $S_3$ symmetry, which yields triality:

But the corresponding extended or affine Dynkin diagram $\widetilde{\mathrm{D}}_4$ has $S_4$ symmetry, which yields tetrality:

One reason the affine Dynkin diagram is important is that maximal-rank Lie subalgebras of a Lie algebra with Dynkin diagram $X$ are obtained by deleting a dot from the corresponding affine Dynkin diagram $\widetilde{X}$. If we delete the middle dot of $\widetilde{\mathrm{D}}_4$ we get a disconnected diagram with 4 separate dots, which is the diagram for $\mathrm{A}_1 \times \mathrm{A}_1 \times \mathrm{A}_1 \times \mathrm{A}_1$. This is why $\mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C}) \oplus \mathfrak{sl}(2,\mathbb{C})$ shows up as a maximal rank Lie subalgebra of $\mathfrak{so}(8,\mathbb{C})$.

The relation between tetrality and triality, I believe, is that there’s a homomorphism $S_4 \to S_3$ coming from the three ways of partitioning a 4-element set into two 3-element sets.

This is a good place to stop, though we’re really just getting started.

### Clifford Johnson — Metals are Shiny

"Metals are shiny." That's one of my favourite punchlines to end a class on electromagnetism with, and that's what I did today. I just love bringing up a bit of everyday physics as a striking consequence of two hours worth of development on the board, and this is a good one for that. I hope the class enjoyed it as much as I did! (Basically, as you can't see in the snapshot of my notes in the photo, those expressions are results of a computation of the [...] Click to continue reading this post

The post Metals are Shiny appeared first on Asymptotia.

### Clifford Johnson — Thomas and Fermi

The other day the Thomas-Fermi model (and its enhancements by Dirac and others) wandered across my desk (and one of my virtual blackboards as you can see in the picture) for a while. Putting aside why it showed up (perhaps I will say later on, but I cannot now), it was fun to delve for a while into some of these early attempts in quantum mechanics to try to understand approximation methods for treating fairly complicated quantum systems (like atoms of various sizes). The basic model showed up in 1927, just a year after Schrodinger's [...] Click to continue reading this post

The post Thomas and Fermi appeared first on Asymptotia.

### Chad Orzel — 035/366: Trike Rack

Another fall day, another holiday closing at the JCC. I was home with The Pip for most of the day, which was the usual mix of fun, exhausting, and puzzling. For example, while I offered several times to go out to a playground before lunch, he refused. But then insisted that we walk to the store to buy… something. I got this picture with my phone:

The Pip’s bike in the rack at the Co-Op.

Because it amused me to see a bike rack with just a little red tricycle in it.

We did go to a couple of playgrounds later, and I shot some video that I’ll use for physics-y stuff at some point. But this is as good a photo-of-the-day as anything else I got.

Just one more Jewish holiday to get through, tomorrow, then it’s smooth sailing for a few weeks at least…

## October 05, 2015

### Secret Blogging Seminar — Postdoc position at ANU

We’ve just put up an ad for a new 2 year postdoctoral position at the ANU, to work with myself and Tony Licata. We’re looking for someone who’s interested in operator algebras, quantum topology, and/or representation theory, to collaborate with us on Australian Research Council funded projects.

The ad hasn’t yet been crossposted to MathJobs, but hopefully it will eventually appear there! In any case, applications need to be made through the ANU website. You need to submit a CV, 3 references, and a document addressing the selection criteria. Let me know if you have any questions about the application process, the job, or Canberra!

### n-Category CaféConfigurations of Lines and Models of Lie Algebras (Part 1)

I’m really enjoying this article, so I’d like to talk about it here at the n-Category Café:

It’s a bit intense, so it may take a series of posts, but let me just get started…

I started reading this paper because I wanted to finally understand the famous “27 lines on a cubic surface” and how they’re related to the smallest nontrivial representations of $\mathrm{E}_6$, which are 27-dimensional. Actually $\mathrm{E}_6$ has two nontrivial representations of this dimension, which are not isomorphic: one is the exceptional Jordan algebra, and one is its dual! The exceptional Jordan algebra consists of $3 \times 3$ self-adjoint octonionic matrices, so it has dimension

$8+8+8+3 = 27$

The determinant of such a matrix turns out to be well-defined despite the noncommutativity and nonassociativity of the octonions. $\mathrm{E}_6$ is the group of linear transformations of the exceptional Jordan algebra that preserves the determinant.

As you might expect, all this stuff going on in dimension 27 is just the tip of an iceberg—and Manivel explores quite a large chunk of that iceberg. But today let me just touch on the tip.

For starters, the Cayley–Salmon theorem says that every smooth cubic surface in $\mathbb{C}\mathrm{P}^3$ has exactly 27 lines on it.

(This is apparently one of those cases where mathematicians shared credit with the meal that inspired their work, like the Fermi–Pasta–Ulam problem.)

I can’t visualize those 27 lines in general. But Clebsch gave an example of a smooth real cubic surface where all the lines actually lie in $\mathbb{R}\mathrm{P}^3$, so you can see them. This is called the Clebsch surface, or Klein’s icosahedral cubic surface, because Klein also worked on it. It looks like this:

and the 27 lines look like this:

Please click on the pictures to see who created them! Here’s a model of it — one of those nice old plaster models you see in old universities:

This model is in Göttingen, photographed by Oliver Zauzig.

I would enjoy diving down the rabbit hole here and learning everything about this particular cubic surface, but I’ll resist for now! I’ll just say a few things:

First, the Clebsch surface can be described very nicely as a surface in $\mathbb{R}\mathrm{P}^4$ using the homogeneous equations

$x_0+x_1+x_2+x_3+x_4 = 0$ $x_0^3+x_1^3+x_2^3+x_3^3+x_4^3 = 0$

but then you can eliminate one variable and think of it as a surface in $\mathbb{R}\mathrm{P}^3$ given by the equation

$x_1^3+x_2^3+x_3^3+x_4^3 = (x_1+x_2+x_3+x_4)^3$

Second, the lines are actually defined over the golden field $\mathbb{Q}[\sqrt{5}]$. This may have something to do with why it’s called ‘Klein’s icosahedral cubic surface’ — I’ll avoid looking into that right now, but it may eventually be important, because there are nice relations between some exceptional Lie algebras and the golden field.

Third, you can see some points where three lines intersect: these are called Eckardt points and there are 10 of them.

Anyway, you may be wondering why there are 27 lines on a smooth cubic surface. The best argument I’ve seen so far, in terms of maximum friendliness, minimum jargon, and maximum total insight conveyed, is here:

I can’t say I fully understand it, since it’s fairly involved, but I still recommend it to anyone who knows a reasonable amount of algebraic geometry.

Anyway, I don’t think one needs to fully understand this to start wondering what $\mathrm{E}_6$ has to do with it. Here’s some of what Manivel has to say:

The configuration of the 27 lines on a smooth cubic surface in $\mathbb{C}\mathrm{P}^3$ has been thoroughly investigated by the classical algebraic geometers. It has been known for a long time that the automorphism group of this configuration can be identified with the Weyl group of the root system of type $\mathrm{E}_6$, of order 51,840. Moreover, the minimal representation $J$ of the simply connected complex Lie group of type $\mathrm{E}_6$ has dimension 27.

Here the letter $J$ means ‘exceptional Jordan algebra’.

This is a minuscule representation, meaning that the weight spaces are lines and that the Weyl group $W(\mathrm{E}_6)$ acts transitively on the weights. In fact one can recover the lines configuration of the cubic surface by defining two weights to be incident if they are not orthogonal with respect to the unique (up to scale) invariant scalar product. Conversely, one can recover the action of the Lie group $\mathrm{E}_6$ on $J$ from the line configuration.

I hope to come back to this and keep digging deeper. Right now I don’t even understand how the 27 weight spaces in $J$, which are 1-dimensional subspaces in a 27-dimensional space, are connected to 27 lines in some surface. But there are some other things in Manivel’s paper that I understand and like a lot.

### Backreaction — When string theorists are out of luck, will Loop Quantum Gravity come to rescue?

Tl;dr: I don’t think they want rescuing.

String theorists and researchers working on loop quantum gravity (LQG) like to each point out how their own attempt to quantize gravity is better than the others’. In the end though, they’re both trying to achieve the same thing – consistently combining quantum field theory with gravity – and it is hard to pin down just exactly what makes strings and loops incompatible. Other than egos that is.

The obvious difference used to be that LQG works only in 4 dimensions, whereas string theory works only in 10 dimensions, and LQG doesn’t allow for supersymmetry, which is a consequence of quantizing strings. However, several years ago the LQG framework has been extended to higher dimensions, and they can now also include supergravity, so that objection is gone.

Then there’s the issue with Lorentz-invariance, which is respected in string theory, but its fate in LQG has been subject of much debate. As of recently though, some researchers working on LQG have argued that Lorentz-invariance, used as a constraint, leads to requirements on the particle interactions, which then have to become similar to some limits found in string theory. This should come as no surprise to string theorists who have been claiming for decades that there is one and only one way to combine all the known particle interactions...

Two doesn’t make a trend, but I have a third, which is a recent paper that appeared on the arxiv:
Bodendorfer argues in his paper that loop quantization might be useful for calculations in supergravity and thus relevant for the AdS/CFT duality.

This duality relates certain types of gauge theories – similar to those used in the standard model – with string theories. In the last decade, the duality has become exceedingly popular because it provides an alternative to calculations which are difficult or impossible in the gauge theory. The duality is normally used only in the limit where one has classical (super)gravity (λ to ∞) and an infinite number of color charges (Nc to ∞). This limit is reasonably well understood. Most string theorists however believe in the full conjecture, which is that the duality remains valid for all values of these parameters. The problem is though, if one does not work in this limit, it is darned hard to calculate anything.

A string theorist, they joke, is someone who thinks three is infinitely large. Being able to deal with a finite number of color charges is relevant for applications because the strong nuclear force has 3 colors only. If one keeps the size of the space-time fixed relative to the string length (which corresponds to fixed λ), a finite Nc however means taking into account string effects, and since the string coupling gs ~ λ/Nc goes to infinity with λ when Nc remains finite, this is a badly understood limit.

In his paper, Bodendorfer looks at the limit of finite Nc and λ to infinity. It’s a clever limit in that it gets rid of the string excitations, and instead moves the problem of small color charges into the realm of super-quantum gravity. Loop quantum gravity is by design a non-perturbative quantization, so it seems ideally suited to investigate this parameter range where string theorists don’t know what to do. But it’s also a strange limit in that I don’t see how to get back the perturbative limit and classical gravity once one has pushed gs to infinity. (If you have more insight than me, please leave a comment.)

In any case, the connection Bodendorfer makes in his paper is that the limit of Nc to ∞ can also be obtained in LQG by a suitable scaling of the spin network. In LQG one works with a graph that has a representation label, l. The graph describes space-time and this label enters the spectrum of the area operator, so that the average quantum of area increases with this label. When one keeps the network fixed, the limit of large l then blows up the area quanta and thus the whole space, which corresponds to the limit of Nc to infinity.

So far, so good. If LQG could now be used to calculate certain observables on the gravity side, then one could further employ the duality to obtain the corresponding observables in the gauge theory. The key question is though whether the loop-quantization actually reproduces the same limit that one would obtain in string theory. I am highly skeptical that this is indeed the case. Suppose it was. This would mean that LQG, like string theory, must have a dual description as a gauge theory still outside the classical limit in which they both agree (they better do). The supersymmetric version of LQG used here has the matter content of supergravity. But it is missing all the framework that in string theory eventually give rise to branes (stacks thereof) and compactifications, which seem so essential to obtain the duality to begin with.

And then there is the problem that in LQG it isn’t well understood how to get back classical gravity in the continuum limit, which Bodendorfer kind of assumes to be the case. If that doesn’t work, then we don’t even know whether in the classical limit the two descriptions actually agree.

Despite my skepticism, I think this is an important contribution. In lack of experimental guidance, the only way we can find out which theory of quantum gravity is the correct description of nature is to demonstrate that there is only one way to quantize gravity that reproduces the General Relativity and the Standard Model in the suitable limits while being UV-finite. Studying how the known approaches do or don’t relate to each other is a step to understanding whether one has any option in the quantization, or whether we do indeed already have enough data to uniquely identify the sought-after theory.

Summary: It’s good someone is thinking about this. Even better this someone isn’t me. For a theory that has only one parameter, it seems to have a lot of parameters.

### Alexey Petrov — Nobel week 2015

So, once again, the Nobel week is upon us. And one of the topics of conversations for the “water cooler chat” in physics departments around the world is speculations on who (besides the infamous Hungarian “physicist” — sorry for the insider joke, I can elaborate on that if asked) would get the Nobel Prize in physics this year. What is your prediction?

With invention of various metrics for “measuring scientific performance” one can make educated guesses — and even put predictions on the industrial footage — see Thomson Reuters predictions based on a number of citations (they did get the Englert-Higgs prize right, but are almost always off). Or even try your luck with on-line betting (sorry, no link here — I don’t encourage this). So there is a variety of ways to make you interested.

My predictions for 2015: Vera Rubin for Dark Matter or Deborah Jin for fermionic condensates. But you must remember that my record is no better than that of Thomson Reuters.

### Chad Orzel — 034/366: Dinner

A big chunk of Sunday was lost to a wretched cold– despite a two-hour afternoon nap, I was asleep by 10pm– but I did get the camera out for a bit while doing some late-season grilling:

Pork chops on the grill.

Not the most amazing photo, I know (though it does take deliberate work to get those cross-hatched grill marks…), but it pairs well with this:

Thermal camera image of pork chops on the grill.

Which is the same basic scene in the infrared. Because I have a thermal camera, so why not?

### John Preskill — “Experimenting” with women-in-STEM stereotypes

When signing up for physics grad school, I didn’t expect to be interviewed by a comedienne on a spoof science show about women in STEM.

Last May, I received an email entitled “Amy Poehler’s Smart Girls.” The actress, I read, had co-founded the Smart Girls organization to promote confidence and creativity in preteens and teens. Smart Girls was creating a webseries hosted by Megan Amram, author of Science…for Her! The book parodies women’s magazines and ridicules stereotypes of women as unsuited for science.

Megan would host the webseries, “Experimenting with Megan,” in character as an airhead. She planned to interview “kick-ass lady scientists/professors/doctors” in a parody of a talk show. Would I, the email asked, participate?

I’m such a straitlaced fogey, I never say “kick-ass.” I’m such a workaholic, I don’t watch webshows. I’ve not seen Parks and Recreation, the TV series that starred Amy Poehler and for which Megan wrote. The Hollywood bug hasn’t bitten me, though I live 30 minutes from Studio City.

But I found myself in a studio the next month. Young men and women typed on laptops and chattered in the airy, bright waiting lounge. Beyond a doorway lay the set, enclosed by fabric-covered walls that prevented sounds from echoing. Script-filled binders passed from hand to hand, while makeup artists, cameramen, and gophers scurried about.

Being interviewed on “Experimenting with Megan.”

Disney’s Mouseketeers couldn’t have exuded more enthusiasm or friendliness than the “Experimenting” team. “Can I bring you a bottle of water?” team members kept asking me and each other. “Would you like a chair?” The other women who interviewed that day—two biologist postdocs—welcomed me into their powwow. Each of us, we learned, is outnumbered by men at work. None of us wears a lab coat, despite stereotypes of scientists as white-coated. Each pours herself into her work: One postdoc was editing a grant proposal while off-set.

I watched one interview, in which Megan asked why biologists study fruit flies instead of “cuter” test subjects. Then I stepped on-set beside her. I perched on an armchair that almost swallowed my 5’ 3.5” self.* Textbooks, chemistry flasks, and high-heeled pumps stood on the bookshelves behind Megan.

The room quieted. A clapperboard clapped: “Take one.” Megan thanked me for coming, then launched into questions.

Megan hadn’t warned me what she’d ask. We began with “Do you like me?” and “What is the ‘information’ [in ‘quantum information theory’], and do you ever say, ‘Too much information’?” Each question rode hot on the heels of the last. The barrage reminded me of interviews for not-necessarily-scientific scholarships. Advice offered by one scholarship-committee member, the year before I came to Caltech, came to mind: Let loose. Act like an athlete tearing down the field, the opposing team’s colors at the edges of your vision. Savor the challenge.

I savored it. I’d received instructions to play the straight man, answering Megan’s absurdity with science. To “Too much information?” I parried that we can never know enough. When I mentioned that quantum mechanics describes electrons, Megan asked about the electricity she feels upon seeing Chris Hemsworth. (I hadn’t heard of Chris Hemsworth. After watching the interview online, a friend reported that she’d enjoyed the reference to Thor. “What reference to Thor?” I asked. Hemsworth, she informed me, plays the title character.) I dodged Chris Hemsworth; caught “electricity”; and stretched to superconductors, quantum devices whose charges can flow forever.

Academic seminars conclude with question-and-answer sessions. If only those Q&As zinged with as much freshness and flexibility as Megan’s.

The “Experimenting” approach to stereotype-blasting diverges from mine. High-heeled pumps, I mentioned, decorated the set. The “Experimenting” team was parodying the stereotype of women as shoe-crazed. “Look at this stereotype!” the set shouts. “Isn’t it ridiculous?”

As a woman who detests high heels and shoe shopping, I prefer to starve the stereotype of justification. I’ve preferred reading to shopping since before middle school, when classmates began frequenting malls. I feel more comfortable demonstrating, through silence, how little shoes interest me. I’d rather offer no reason for anyone to associate me with shoes.**

I scarcely believe that I appear just after a “sexy science” tagline and a hot-or-not quiz. Before my interview on her quantum episode, Megan discussed the relationship between atoms and Adams. Three guests helped her, three Hollywood personalities named “Adam.”*** Megan held up cartoons of atoms, and photos of Adams, and asked her guests to rate their hotness. I couldn’t have played Megan’s role, couldn’t imagine myself in her (high-heeled) shoes.

But I respect the “Experimenting” style. Megan’s character serves as a foil for the interviewee I watched. Megan’s ridiculousness underscored the postdoc’s professionalism and expertise.

According to online enthusiasm, “Experimenting” humor resonates with many viewers. So diverse is the community that needs introducing to STEM, diverse senses of humor have roles to play. So deep run STEM’s social challenges, multiple angles need attacking.

Just as diverse perspectives can benefit women-in-STEM efforts, so can diverse perspectives benefit STEM. Which is why STEM needs women, Adams, shoe-lovers, shoe-haters…and experimentation.

With gratitude to the “Experimenting” team for the opportunity to contribute to its cause. The live-action interview appears here (beginning at 2:42), and a follow-up personality quiz appears here.

*If you’re 5′ 3.5″, every half-inch matters.

**Except when I blog about how little I wish to associate with shoes.

***Megan introduced her guests as “Adam Shankman, Adam Pally, and an intern that we made legally change his name to Adam to be on the show.” The “intern” is Adam Rymer, president of Legendary Digital Networks. Legendary owns Amy Poehler’s Smart Girls.

### Doug Natelson — Annual Nobel speculation

It's getting to be that time of year again.  The 2015 Nobel Prize in Physics will be announced this coming Tuesday morning (EU/US time).  Based on past patterns, it looks like it could well be an astro prize.  Dark matter/galaxy rotation curves anyone?  Or extrasolar planets?  (I still like Aharonov + Berry for geometric phases, perhaps with Thouless as well.  However, it's unlikely that condensed matter will come around this year.)

On Wednesday, the chemistry prize will be awarded.  There, I have no clue.  Curious Wavefunction has a great write-up that you should read, though.

Speculate away!

## October 04, 2015

### Secret Blogging Seminar — Michigan Math In Action

Those of you who are interested in college math instruction may be interested in a no-longer-so-new blog “Michigan Math In Action”, which a number of our faculty started last year. (I was involved in the sense of telling people “blogs are fun!”, but haven’t written anything for them yet.) It mostly features thoughtful pieces on teaching calculus and similar courses.

Recently, Gavin Larose put up a lengthy footnoted post on the effort that goes into running our “Gateway testing” center, and the benefits we get from it. This is a room designed for proctoring computerized tests of basic skills, and we use it for things like routine differentiation or putting matrices into reduced row echelon form, which we want every student to know but which are a waste of class time. Check it out!

### Chad Orzel — 033/366: Ommmm…..

Saturdays are the busiest days around Chateau Steelypips, with both SteelyKid and The Pip having soccer in the morning (in two different places), then lunch, then some sort of activity for the afternoon. Yesterday, this was a party for one of SteelyKid’s friends. And last night, there was a Movie Night at SteelyKid’s BFF’s house, where the kids all watched a cartoon and the adults hung around reveling in adult conversation.

All of which means that I managed to go the entire day without taking any photos. Not even a cell-phone snapshot (I’m assistant coaching SteelyKid’s soccer team, but the other coach wasn’t there this week, so I was on my own, which doesn’t allow for photography).

So, here’s a shot of The Pip from Friday night:

The Pip “meditating.”

Since our trip to DC back in July, the kids have been nuts for the Cartoon Network show “Teen Titans Go.” One of the characters, Raven, uses magic as her power, and spends a bunch of time floating in the air and meditating. The Pip finds this fascinating, and will periodically declare that he’s meditating, striking this pose. Usually with a scrunched-up intense face that doesn’t suggest contemplation of the infinite, but this shot was in a sort of transitional moment, and actually looks vaguely meditative.

And it’s cute, so works as a “sorry for no new photos” offering…

## October 03, 2015

### Georg von Hippel — Fundamental Parameters from Lattice QCD, Last Days

The last few days of our scientific programme were quite busy for me, since I had agreed to give the summary talk on the final day. I therefore did not get around to blogging, and will keep this much-delayed summary rather short.

On Wednesday, we had a talk by Michele Della Morte on non-perturbatively matched HQET on the lattice and its use to extract the b quark mass, and a talk by Jeremy Green on the lattice measurement of the nucleon strange electromagnetic form factors (which are purely disconnected quantities).

On Thursday, Sara Collins gave a review of heavy-light hadron spectra and decays, and Mike Creutz presented arguments for why the question of whether the up-quark is massless is scheme dependent (because the sum and difference of the light quark masses are protected by symmetries, but will in general renormalize differently).

On Friday, I gave the summary of the programme. The main themes that I identified were the question of how to estimate systematic errors, and how to treat them in averaging procedures, the issues of isospin breaking and scale setting ambiguities as major obstacles on the way to sub-percent overall precision, and the need for improved communication between the "producers" and "consumers" of lattice results. In the closing discussion, the point was raised that for groups like CKMfitter and UTfit the correlations between different lattice quantities are very important, and that lattice collaborations should provide the covariance matrices of the final results for different observables that they publish wherever possible.

### Clifford Johnson — Benedict

I call this part of the garden Benedict, for obvious reasons... right?

The post Benedict appeared first on Asymptotia.

### David Hogg — #AstroHackWeek 2015, day 3

Andy Mueller (NYU) started the day with an introduction to machine learning and the scikit-learn package (of which he is a developer). His tutorial was beautiful, as it followed a fully interactive and operational Jupiter notebook. There was an audible gasp in the room when the astronomers saw the magic of dimensionality reduction in the form of the t-SNE algorithm. It is truly magic! There was much debate in the room about whether it was useful for anything other than blowing your mind (ie, visualization).

In the afternoon, Baron and I refactored our galaxy projection/deprojection code in preparation for inferring the galaxy itself. This refactor was non-trivial (and we learned a lot about the Jupyter notebook!), but it was done by the end of the day. We discussed next steps for the galaxy inference, which we hope to start tomorrow!

### David Hogg — #AstroHackWeek 2015, day 2

The second day of #AstroHackWeek started with Juliana Freire (NYU) talking to us about databases and data management. She also talked about map-reduce and its various cousins. Freire is the Executive Director of the Moore-Sloan Data Science Environment at NYU, and so also our boss in this endeavor (AstroHackWeek is supported by the M-S DSE, and also Github and the LSST project). Many good questions came up in Freire's discussion about the special needs of astronomers when it comes to database choice and customization. Freire opined that one of the reasons that the SDSS databases were so successful is that we had Jim Gray (Microsoft; deceased) and a team working full time on making them awesome. I agree!

In the afternoon, Dalya Baron (TAU) and I made fake data for our galaxy-deprojection project. These data were images with finite point-spread function and Gaussian noise. We then showed that by likelihood optimization we can (very easily, I am surprised to say) infer the Euler angles of the projection (and other nuisance parameters, like shifts, re-scalings, and background level). We also showed that if we have two kinds of three-dimensional galaxies making the two-dimensional images, we can fairly confidently decide which three-d galaxy made which two-d image. This is important for deprojecting heterogeneous collections. I ended the day very stoked about all this!

## October 02, 2015

### Chad Orzel — 032/366: Symmetry

Took the camera along this afternoon when I took Emmy for a walk (Kate and I are going to see The Martian tonight, so Emmy got an early dinner and stroll), and took pictures of a bunch of random stuff in a little park near our house. Including these two pictures pasted together into one:

Two tree trunks in a park near Chateau Steelypips.

These two trees aren’t right next to each other, but I didn’t do much more than turn 90 degrees to get from one to the other. If I were attempting to pass myself off as an artiste, I would explain that this juxtaposition here is a political allegory: the tree on the right is encrusted with nasty old fungus, the tree on the left is ringed with a climbing green vine, but they’re both being killed by the stuff attached to them.

Send me my giant paycheck, Art World…

Really, though, I just liked the color and texture contrasts of these, which is why I took the two individual shots. And when I had them up side-by-side to try to choose one for the photo of the day, they looked good together, so I said “What the hell, I’ll combine them in GIMP.”

### Tommaso Dorigo — Thank You Guido

It is with great sadness that I heard (reading it here first) about the passing away of Guido Altarelli, a very distinguished Italian theoretical physicist. Altarelli is best known for the formulas that bear his name, the Altarelli-Parisi equations (also known as DGLAP since it was realized that due recognition for the equations had to be given also to Dokshitzer, Gribov, and Lipatov). But Altarelli was a star physicist who gave important contributions to Standard Model physics in a number of ways.

### David Hogg — #AstroHackWeek 2015, day 1

We kicked off AstroHackWeek 2015 today, with a huge crowd (some 60-ish) people from all over the world and in all different astrophysics fields (and a range of career stages!). Kyle Barbary (UCB) started the day with an introduction to Python for data analysts and Daniela Huppenkothen (NYU, and the principal organizer of the event) followed with some ideas for exploratory data analysis. They set up Jupyter notebooks for interactive tutorials and the crowd followed along. At some point in the morning, Huppenkothen shocked the crowd by letting them know that admissions to AstroHackWeek had been done (in part) with a random number generator!

In the afternoon, the hacking started: People split up into groups to work on problems brought to the meeting by the participants. Dalya Baron (TAU) and I teamed up to write code to build and project mixtures of Gaussians in preparation for an experiment in classifying and determining the projections (Euler angles) of galaxies. This is the project that Leslie Greengard and I have been discussing at the interface of molecular microscopy and astrophysics; that is, anything we figure out here could be used in many other contexts. By the end of the day we had working mixture-of-Gaussian code and could generate fake images.

### Backreaction — Book Review: “A Beautiful Question” by Frank Wilczek

A Beautiful Question: Finding Nature's Deep Design
By Frank Wilczek
Penguin Press (July 14, 2015)

My four year old daughter recently discovered that equilateral triangles combine to larger equilateral triangles. When I caught a distracted glimpse of her artwork, I thought she had drawn the baryon decuplet, an often used diagram to depict relations between particles composed of three quarks.

The baryon decuplet doesn’t come easy to us, but the beauty of symmetry does, and how amazing that physicists have found it tightly woven into the fabric of nature itself: Both the standard model of particle physics and General Relativity, our currently most fundamental theories, are in essence mathematically precise implementations of symmetry requirements. But next to being instrumental for the accurate description of nature, the appeal of symmetries is a human universal that resonates in art and design throughout cultures. For the physicist, it is impossible not to note the link, not to see the equations behind the art. It may be a curse or it may be a blessing.

For Frank Wilczek it clearly is a blessing. In his most recent book “A Beautiful Question,” he tells the success of symmetries in physics, and goes on to answer his question whether “the world embodies beautiful ideas” with a clear “Yes.”

 Lara’s decuplet
Wilczek starts from the discovery of basic mathematical relationships like Pythagoras’ theorem (not shying away from explaining how to prove it!) and proceeds through the history of physics along selected milestones such as musical harmonies, the nature of light and the basics of optics, Newtonian gravity and its extension to General Relativity, quantum mechanics, and ultimately the standard model of particle physics. He briefly touches on condensed matter physics, graphene in particular, and has an interesting digression about the human eye’s limited ability to decode visual information (yes, the shrimp again).

In the last chapters of the book, Wilczek goes into quite some detail about the particle content of the standard model, and in just which way it seems to be not as beautiful as one may have hoped. He introduces the reader to extended theories, grand unification and supersymmetry, invented to remedy the supposed shortcomings of the standard model. The reader who is not familiar with the quantum numbers used to classify elementary particles will likely find this chapter somewhat demanding. But whether or not one makes the effort to follow the details, Wilczek’s gets his message across clearly: Striving for beauty in natural law has been a useful guide, and he expects it to remain one, even though he is careful to note that relying on beauty has on various occasions lead to plainly wrong theories, such as the attempt to explain planetary orbits with the Platonic solids, or to the idea to develop a theory of atoms based on the mathematics of knots.

“A Beautiful Question” is a skillfully written reflection, or “meditation” as Wilczek puts it. It is well structured and accompanied by many figures, including two inserts with color prints. The book also contains an extensive glossary, recommendations for further reading, and a timeline of the discoveries mentioned in the text.

 My husband’s decuplet.
The content of the book is unique in the genre. David Goldberg’s book “The Universe in the Rearview Mirror: How Hidden Symmetries Shape Reality,” for example, also discusses the role of symmetries in fundamental physics, but Wilzcek gives more space to the connection between aesthetics in art and science. “A Beautiful Question” picks up and expands on the theme of Steven Weinberg’s 1992 book “Dreams of a Final Theory” that also expounded the relevance of beauty in the development of physical theories. More than 20 years have passed, but the dream is still as elusive today as it was back then.

For all his elaboration on the beauty of symmetry though, Wilczek’s book falls short of spelling out the main conundrum physicists face today: We have no reason to be confident that the laws of nature which we have yet to discover will conform to the human sense of beauty. Neither does he spend many words on aspects of beauty beyond symmetry; Wilczek only briefly touches on fractals, and never goes into the rich appeal of chaos and complexity.

My mother used to say that “symmetry is the art of the dumb,” which is maybe a somewhat too harsh criticism on the standard model, but seeing that reliance on beauty has not helped us within the last 20 years, maybe it is time to consider that the beauty of the answers might not reveal itself as effortlessly as does the tiling of the plane to a 4 year old. Maybe the inevitable subjectivity in our sense of aesthetic appeal that has served us well so far is about to turn from a blessing to a curse, misleading us as to where the answers lie.

Wilczek’s book contains something for every reader, whether that is the physicist interested to learn how a Nobel Prize winner thinks of the connection between ideas and reality, or the layman wanting to know more about the structure of fundamental law. “A Beautiful Question” reminds us of the many ways that science connects to the arts, and invites us to marvel at the success our species has had in unraveling the mysteries of nature.

[An edited version of this review appeared in the October issue of Physics Today.]

### Backreaction — Service Announcement: Backreaction now on facebook!

Over the years the discussion of my blogposts has shifted over to facebook. To follow this trend and to make it easier for you to engage, I have now set up a facebook page for this blog. Just "like" the page to get the newest blogposts and other links that I post :)

## October 01, 2015

### Tommaso Dorigo — Researchers' Night 2015

Last Friday I was invited by the University of Padova to talk about particle physics to the general public, in occasion of the "Researchers Night", a yearly event organized by the European Commission which takes place throughout Europe - in 280 cities this year. Of course I gladly accepted the invitation, although it caused some trouble to my travel schedule (I was in Austria for lectures until Friday morning, and you don't want to see me driving when I am in a hurry, especially on a 500km route).

## September 30, 2015

### Doug Natelson — DOE Experimental Condensed Matter Physics PI Meeting 2015 - Day 3

Things I learned from the last (half)day of the DOE PI meeting:
• "vortex explosion" would be a good name for a 1980s metal band.
• Pulsed high fields make possible some really amazing measurements in both high $$T_{\mathrm{C}}$$ materials and more exotic things like SmB6.
• Looking at structural defects (twinning) and magnetic structural issues (spin orientation domain walls) can give insights into complicated issues in pnictide superconductors.
• Excitons can be a nice system for looking at coherence phenomena ordinarily seen in cold atom systems.  See here and here.  Theory proposes that you could play with these at room temperature with the right material system.
• Thermal gradients can drive spin currents even in insulating paramagnets, and these can be measured with techniques that could be performed down to small length scales.
• Very general symmetry considerations when discussing competing ordered states (superconductivity, charge density wave order, spin density wave order) can lead to testable predictions.
• Hybrid, monocrystalline nanoparticles combining metals and semiconductors are very pretty and can let you drive physical processes based on the properties of both material systems.

### n-Category CaféAn exact square from a Reedy category

I first learned about exact squares from a blog post written by Mike Shulman on the $n$-Category Café.

Today I want to describe a family of exact squares, which are also homotopy exact, that I had not encountered previously. These make a brief appearance in a new preprint, A necessary and sufficient condition for induced model structures, by Kathryn Hess, Magdalena Kedziorek, Brooke Shipley, and myself.

Proposition. If $R$ is any (generalized) Reedy category, with $R^+ \subset R$ the direct subcategory of degree-increasing morphisms and $R^- \subset R$ the inverse subcategory of degree-decreasing morphisms, then the pullback square: $\array{ iso(R) & \to & R^- \\ \downarrow & \swArrow id & \downarrow \\ R^+ & \to & R}$ is (homotopy) exact.

In summary, a Reedy category $(R,R^+,R^-)$ gives rise to a canonical exact square, which I’ll call the Reedy exact square.

## Exact squares and Kan extensions

Let’s recall the definition. Consider a square of functors inhabited by a natural transformation $\array{A & \overset{f}{\to} & B\\ ^u\downarrow & \swArrow\alpha & \downarrow^v\\ C& \underset{g}{\to} & D}$ For any category $M$, precomposition defines a square $\array{M^A & \overset{f^\ast}{\leftarrow} & M^B\\ ^{u^\ast}\uparrow & \swArrow \alpha^\ast & \uparrow^{v^\ast}\\ M^C& \underset{g^\ast}{\leftarrow} & M^D}$ Supposing there exist left Kan extensions $u_! \dashv u^\ast$ and $v_! \dashv v^\ast$ and right Kan extensions $f^\ast \dashv f_\ast$ and $g^\ast \dashv g_\ast$, the mates of $\alpha^*$ define canonical Beck-Chevalley transformations: $u_! f^\ast \Rightarrow g^\ast v_!\quad and \quad v^\ast g_\ast \Rightarrow f_\ast u^\ast.$ Note if either of the Beck-Chevalley transformations is an isomorphism, the other one is too by the (contravariant) correspondence between natural transformations between a pair of left adjoints and natural transformations between the corresponding right adjoints.

Definition. $\array{A & \overset{f}{\to} & B\\ ^u\downarrow & \swArrow\alpha & \downarrow^v\\ C& \underset{g}{\to} & D}$ is an exact square if, for any $M$ admitting pointwise Kan extensions, the Beck-Chevalley transformations are isomorphisms.

Comma squares provide key examples, in which case the Beck-Chevalley isomorphisms recover the limit and colimit formulas for pointwise Kan extensions.

The notion of homotopy exact square is obtained by replacing $M$ by some sort of homotopical category, the adjoints by derived functors, and “isomorphism” by “equivalence.”

## The proof

In the preprint we give a direct proof that these Reedy squares are exact by computing the Kan extensions, but exactness follows more immediately from the following characterization theorem, stated using comma categories. The natural transformation $\alpha \colon v f \Rightarrow g u$ induces a functor $B \downarrow f \times_A u \downarrow C \to v \downarrow g$ over $C \times B$ defined on objects by sending a pair $b \to f(a), u(a) \to c$ to the composite morphism $v(b) \to v f(a) \to g u(a) \to g(c)$. Fixing a pair of objects $b$ in $B$ and $c$ in $C$, this pulls back to define a functor $b \downarrow f \times_A u \downarrow c \to vb \downarrow gc.$

Theorem. A square $\array{A & \overset{f}{\to} & B\\ ^u\downarrow & \swArrow\alpha & \downarrow^v\\ C& \underset{g}{\to} & D}$ is exact if and only if each fiber of $b \downarrow f \times_A u \downarrow c \to v b \downarrow g c$ is non-empty and connected.

See the nLab for a proof. Similarly, the square is homotopy exact if and only if each fiber of this functor has a contractible nerve.

In the case of a Reedy square $\array{ iso(R) & \to & R^- \\ \downarrow & \swArrow id & \downarrow \\ R^+ & \to & R}$ these fibers are precisely the categories of Reedy factorizations of a fixed morphism. For an ordinary Reedy category $R$, Reedy factorizations are unique, and so the fibers are terminal categories. For a generalized Reedy category, Reedy factorizations are unique up to unique isomorphism, so the fibers are contractible groupoids.

## Reedy diagrams as bialgebras

For any category $M$, the objects in the lower right-hand square $\array{ M^{iso(R)} & \leftarrow & M^{R^-} \\ \uparrow & \swArrow id & \uparrow \\ M^{R^+} & \leftarrow & M^R}$ are Reedy diagrams in $M$, and the functors restrict to various subdiagrams. Because the indexing categories all have the same objects, if $M$ is bicomplete each of these restriction functors is both monadic and comonadic. If we think of the $M^{R^-}$ as being comonadic over $M^{iso(R)}$ and $M^{R^+}$ as being monadic over $M^{iso(R)}$, then the Beck-Chevalley isomorphism exhibits $M^R$ as the category of bialgebras for the monad induced by the direct subcategory $R^+$ and the comonad induced by the inverse subcategory $R^-$.

There is a homotopy-theoretic interpretation of this, which I’ll describe in the case where $R$ is a strict Reedy category (so that $iso(R)=ob(R)$), though it works in the generalized context as well. If $M$ is a model category, then $M^{iso(R)}$ inherits a model structure, with everything defined objectwise. The Reedy model structure on $M^{R^-}$ coincides with the injective model structure, which has cofibrations and weak equivalences created by the restriction functor $M^{R^-} \to M^{iso(R)}$; we might say this model structure is “left-induced”. Dually, the Reedy model structure on $M^{R^+}$ coincides with the projective model structure, which has fibrations and weak equivalences created by $M^{R^+} \to M^{iso(R)}$; this is “right-induced”.

The Reedy model structure on $M^R$ then has two interpretations: it is right-induced along the monadic restriction functor $M^R \to M^{R^-}$ and it is left-induced along the comonadic restriction functor $M^R \to M^{R^+}$. The paper A necessary and sufficient condition for induced model structures describes a general technique for inducing model structures on categories of bialgebras, which reproduces the Reedy model structure in this special case.

### Resonaances — Weekend plot: minimum BS conjecture

This weekend plot completes my last week's post:

It shows the phase diagram for models of natural electroweak symmetry breaking. These models can be characterized by 2 quantum numbers:

• B [Baroqueness], describing how complicated is the model relative to the standard model;
• S [Specialness], describing the fine-tuning needed to achieve electroweak symmetry breaking with the observed Higgs boson mass.

To allow for a fair comparison, in all models the cut-off scale is fixed to Λ=10 TeV. The standard model (SM) has, by definition,  B=1, while S≈(Λ/mZ)^2≈10^4.  The principle of naturalness postulates that S should be much smaller, S ≲ 10.  This requires introducing new hypothetical particles and interactions, therefore inevitably increasing B.

The most popular approach to reducing S is by introducing supersymmetry.  The minimal supersymmetric standard model (MSSM) does not make fine-tuning better than 10^3 in the bulk of its parameter space. To improve on that, one needs to introduce large A-terms (aMSSM), or  R-parity breaking interactions (RPV), or an additional scalar (NMSSM).  Another way to decrease S is achieved in models the Higgs arises as a composite Goldstone boson of new strong interactions. Unfortunately, in all of those models,  S cannot be smaller than 10^2 due to phenomenological constraints from colliders. To suppress S even further, one has to resort to the so-called neutral naturalness, where new particles beyond the standard model are not charged under the SU(3) color group. The twin Higgs - the simplest  model of neutral naturalness - can achieve S10 at the cost of introducing a whole parallel mirror world.

The parametrization proposed here leads to a striking observation. While one can increase B indefinitely (many examples have been proposed  the literature),  for a given S there seems to be a minimum value of B below which no models exist.  In fact, the conjecture is that the product B*S is bounded from below:
BS ≳ 10^4.
One robust prediction of the minimum BS conjecture is the existence of a very complicated (B=10^4) yet to be discovered model with no fine-tuning at all.  The take-home message is that one should always try to minimize BS, even if for fundamental reasons it cannot be avoided completely ;)

### David Hogg — Singer, Sigworth

Who walked in to the SCDA today but Amit Singer (Princeton), a mathematician who works on Cryo-EM and related problems! That was good for Greengard and me; we bombarded him with questions about how the system works, physically. Of course he knew, and showed us a beautiful set of notes from Fred Sigworth (Yale) explaining the physics of the situation. It is beautifully situated at the interface of geometric and physical optics, which will make the whole thing fun.

### David Hogg — tomography, detailed balance, exoplanet non-parametrics

After I started to become convinced that we might be able to make progress on the Cryo-EM problem, Greengard described to me an easier (in some ways) problem: cryo-electron tomography. This is similar to Cryo-EM, but the experimenter controls the angles! In principle this should make it easier (and it does), but according to mathematical standards the problem is still ill-posed: The device can't be turned to all possible angles, or not even enough to fill out the tomographic information needed for general reconstructions. Of course this doesn't phase me!.

Useful conversations with Foreman-Mackey included two interesting subjects. One is that even if you have K different MCMC moves, each of which satisfies detailed balance, it is not guaranteed that a deterministic sequence of them will satisfy detailed balance! That blew me away for a few minutes but then started to make sense. Ish. According to Foreman-Mackey, there is a connection between this issue and the point that a product of symmetric matrices will not necessarily be symmetric!

The other interesting subject with Foreman-Mackey was on exoplanet system modeling. We want to explore some non-parametrics: That is, instead of considering the exoplanet population as a mixture of one-planet, two-planet, three-planet (and so-on) systems, model it just with K-planet systems, where K is very large (or infinite). This model would require having a significant amount of the planet-mass pdf at very low masses (or sizes). Not only might this have many technical advantages, it also accords with our intuitions: After all, I think we (almost) all think that if you have one or two major planets, you probably have many, many minor ones. Like way many.

## September 29, 2015

### Doug Natelson — DOE Experimental Condensed Matter Physics PI Meeting 2015 - Day 2

Among the things I learned at the second day of the meeting:

• In relatively wide quantum wells, and high fields, you can enter the quantum Hall insulating state.  Using microwave measurements, you can see signatures of phase transitions within the insulating state - there are different flavors of insulator in there.  See here.
• As I'd alluded to a while ago, you can make "artificial" quantum systems with graphene-like energetic properties (for example).
• In 2d hole gasses at the interface between Ge and overlying SiGe, you can get really huge anisotropy of the electrical resistivity in magnetic fields, with the "hard" axis along the direction of the in-plane magnetic field.
• In single-layer thick InN quantum wells with GaN above and below, you can have a situation where there is basically zero magnetoresistance.  That's really weird.
• In clever tunneling spectroscopy experiments (technique here) on 2d hole gasses, you can see sharp inelastic features that look like inelastic excitation of phonons.
• Tunneling measurements through individual magnetic nanoparticles can show spin-orbit-coupling-induced level spacings, and cranking up the voltage bias can permit spin processes that are otherwise blockaded.  See here.
• Niobium islands on a gold film are a great tunable system for studying the motion of vortices in superconductors, and even though the field is a mature one, new and surprising insights come out when you have a clean, controlled system and measurement techniques.
• Scanning Josephson microscopy (requiring a superconducting STM tip, a superconducting sample, and great temperature and positional control) is going to be very powerful for examining the superconducting order parameter on atomic scales.
• In magnetoelectric systems (e.g., ferroelectrics coupled to magnetic materials), combinations of nonlinear optics and electronic measurements are required to unravel which of the various possible mechanisms (charge vs strain mediated) generates the magnetoelectric coupling.
• Strongly coupling light in a cavity with Rydberg atoms should be a great technique for generating many body physics for photons (e.g., the equivalent of quantum Hall).
• Carbon nanotube devices can be great systems for looking at quantum phase transitions and quantum critical scaling, in certain cases.
• Controlling vortex pinning and creep is hugely important in practical superconductors.  Arrays of ferromagnetic particles as in artificial spin ice systems can control and manipulate vortices.  Thermal fluctuations in high temperature superconductors could end up limiting performance badly, even if the transition temperature is at room temperature or more, and the situation is worse if the material is more anisotropic in terms of effective mass.
• "Oxides are like people; it is their defects that make them interesting."

### Sean Carroll — Core Theory T-Shirts

Way back when, for purposes of giving a talk, I made a figure that displayed the world of everyday experience in one equation. The label reflects the fact that the laws of physics underlying everyday life are completely understood.

So now there are T-shirts. (See below to purchase your own.)

It’s a good equation, representing the Feynman path-integral formulation of an amplitude for going from one field configuration to another one, in the effective field theory consisting of Einstein’s general theory of relativity plus the Standard Model of particle physics. It even made it onto an extremely cool guitar.

I’m not quite up to doing a comprehensive post explaining every term in detail, but here’s the general idea. Our everyday world is well-described by an effective field theory. So the fundamental stuff of the world is a set of quantum fields that interact with each other. Feynman figured out that you could calculate the transition between two configurations of such fields by integrating over every possible trajectory between them — that’s what this equation represents. The thing being integrated is the exponential of the action for this theory — as mentioned, general relativity plus the Standard Model. The GR part integrates over the metric, which characterizes the geometry of spacetime; the matter fields are a bunch of fermions, the quarks and leptons; the non-gravitational forces are gauge fields (photon, gluons, W and Z bosons); and of course the Higgs field breaks symmetry and gives mass to those fermions that deserve it. If none of that makes sense — maybe I’ll do it more carefully some other time.

Gravity is usually thought to be the odd force out when it comes to quantum mechanics, but that’s only if you really want a description of gravity that is valid everywhere, even at (for example) the Big Bang. But if you only want a theory that makes sense when gravity is weak, like here on Earth, there’s no problem at all. The little notation k < Λ at the bottom of the integral indicates that we only integrate over low-frequency (long-wavelength, low-energy) vibrations in the relevant fields. (That's what gives away that this is an "effective" theory.) In that case there's no trouble including gravity. The fact that gravity is readily included in the EFT of everyday life has long been emphasized by Frank Wilczek. As discussed in his latest book, A Beautiful Question, he therefore advocates lumping GR together with the Standard Model and calling it The Core Theory.

I couldn’t agree more, so I adopted the same nomenclature for my own upcoming book, The Big Picture. There’s a whole chapter (more, really) in there about the Core Theory. After finishing those chapters, I rewarded myself by doing something I’ve been meaning to do for a long time — put the equation on a T-shirt, which you see above.

I’ve had T-shirts made before, with pretty grim results as far as quality is concerned. I knew this one would be especially tricky, what with all those tiny symbols. But I tried out Design-A-Shirt, and the result seems pretty impressively good.

So I’m happy to let anyone who might be interested go ahead and purchase shirts for themselves and their loved ones. Here are the links for light/dark and men’s/women’s versions. I don’t actually make any money off of this — you’re just buying a T-shirt from Design-A-Shirt. They’re a little pricey, but that’s what you get for the quality. I believe you can even edit colors and all that — feel free to give it a whirl and report back with your experiences.

### Doug Natelson — DOE Experimental Condensed Matter Physics PI meeting 2015 - Day 1

Things I learned at today's session of the DOE ECMP PI meeting:
• In the right not-too-thick, not-too-thin layers of the 3d topological insulator Bi1.5Sb0.5Te1.7Se1.3 (a cousin of Bi2Se3 that actually is reasonably insulating in the bulk), it is possible to use top and bottom gates to control the surface states on the upper and lower faces, independently.  See here.
• In playing with suspended structures of different stackings of a few layers of graphene, you can get some dramatic effects, like the appearance of large, sharp energy gaps.  See here.
• While carriers in graphene act in some ways like massless particles because their band energy depends linearly on their crystal momentum (like photon energy depends linearly on photon momentum in electrodynamics), they have a "dynamical" effective mass, $$m^* = \hbar (\pi n_{2d})^{1/2}/v_{\mathrm{F}}$$, related to how the electronic states respond to an electric bias.
• PdCoO2 is a weird layered metal that can be made amazingly clean, so that its residual resistivity can be as small as 8 n$$\Omega$$-cm.  That's about 200 times smaller than the room temperature resistivity of gold or copper.
• By looking at how anisotropic the electrical resistivity is as a function of direction in the plane of layered materials, and how that anisotropy can vary with applied strain, you can define a "nematic susceptibility".  That susceptibility implies the existence of fluctuations in the anisotropy of the electronic properties (nematic fluctuations).  Those fluctuations seem to diverge at the structural phase transition in the iron pnictide superconductors.  See here.   Generically, these kinds of fluctuations seem to boost the transition temperature of superconductors.
• YPtBi is a really bizarre material - nonmetallic temperature dependence, high resistivity, small carrier density, yet superconducts.
• Skyrmions (see here) can be nucleated in controlled ways in the right material systems.  Using the spin Hall effect, they can be pushed around.  They can also be moved by thermally driven spin currents, and interestingly skyrmions tend to flow from the cold side of a sample to the hot side.
• It's possible to pump angular momentum from an insulating ferromagnet, through an insulating antiferromagnet (NiO), and into a metal.  See here.
• The APS Conferences for Undergraduate Women in Physics have been a big hit, using attendance as a metric.  Extrapolating, in a couple of years it looks like nearly all of the undergraduate women majoring in physics in the US will likely be attending one of these.
• Making truly nanoscale clusters out of some materials (e.g., Co2Si, Mn5Si3) can turn them from weak ferromagnets or antiferromagnets in the bulk into strong ferromagnets in nanoparticle form.   See here.

## September 28, 2015

### Sean Carroll — The Big Picture

Once again I have not really been the world’s most conscientious blogger, have I? Sometimes other responsibilities have to take precedence — such as looming book deadlines. And I’m working on a new book, and that deadline is definitely looming!

And here it is. The title is The Big Picture: On the Origins of Life, Meaning, and the Universe Itself. It’s scheduled to be published on May 17, 2016; you can pre-order it at Amazon and elsewhere right now.

An alternative subtitle was What Is, and What Matters. It’s a cheerfully grandiose (I’m supposed to say “ambitious”) attempt to connect our everyday lives to the underlying laws of nature. That’s a lot of ground to cover: I need to explain (what I take to be) the right way to think about the fundamental nature of reality, what the laws of physics actually are, sketch some cosmology and connect to the arrow of time, explore why there is something rather than nothing, show how interesting complex structures can arise in an undirected universe, talk about the meaning of consciousness and how it can be purely physical, and finally trying to understand meaning and morality in a universe devoid of transcendent purpose. I’m getting tired just thinking about it.

From another perspective, the book is an explication of, and argument for, naturalism — and in particular, a flavor I label Poetic Naturalism. The “Poetic” simply means that there are many ways of talking about the world, and any one that is both (1) useful, and (2) compatible with the underlying fundamental reality, deserves a place at the table. Some of those ways of talking will simply be emergent descriptions of physics and higher levels, but some will also be matters of judgment and meaning.

As of right now the book is organized into seven parts, each with several short chapters. All that is subject to change, of course. But this will give you the general idea.

* Part One: Being and Stories

How we think about the fundamental nature of reality. Poetic Naturalism: there is only one world, but there are many ways of talking about it. Suggestions of naturalism: the world moves by itself, time progresses by moments rather than toward a goal. What really exists.

* Part Two: Knowledge and Belief

Telling different stories about the same underlying truth. Acquiring and updating reliable beliefs. Knowledge of our actual world is never perfect. Constructing consistent planets of belief, guarding against our biases.

* Part Three: Time and Cosmos

The structure and development of our universe. Time’s arrow and cosmic history. The emergence of memories, causes, and reasons. Why is there a universe at all, and is it best explained by something outside itself?

* Part Four: Essence and Possibility

Drawing the boundary between known and unknown. The quantum nature of deep reality: observation, entanglement, uncertainty. Vibrating fields and the Core Theory underlying everyday life. What we can say with confidence about life and the soul.

* Part Five: Complexity and Evolution

Why complex structures naturally arise as the universe moves from order to disorder. Self-organization and incremental progress. The origin of life, and its physical purpose. The anthropic principle, environmental selection, and our role in the universe.

* Part Six: Thinking and Feeling

The mind, the brain, and the body. What consciousness is, and how it might have come to be. Contemplating other times and possible worlds. The emergence of inner experiences from non-conscious matter. How free will is compatible with physics.

* Part Seven: Caring and Mattering

Why we can’t derive ought from is, even if “is” is all there is. And why we nevertheless care about ourselves and others, and why that matters. Constructing meaning and morality in our universe. Confronting the finitude of life, deciding what stories we want to tell along the way.

Hope that whets the appetite a bit. Now back to work with me.

### Backreaction — No, Loop Quantum Gravity has not been shown to violate the Holographic Principle

 Didn't fly.
Tl;dr: The claim in the paper is just wrong. Read on if you want to know why it matters.

Several people asked me for comments on a recent paper that appeared on the arxiv, “Violation of the Holographic Principle in the Loop Quantum Gravity” by Ozan Sargın and Mir Faizal. We have met Mir Faizal before; he is the one who explained that the LHC would make contact to parallel universes [spoiler alert: it won’t]. Now, I have recently decided to adapt a strict diet of intellectual veganism: I’ll refuse to read anything produced by making science suffer. So I wouldn’t normally have touched the paper, not even with a fork. But since you asked, I gave it a look.

The claim in the paper is that Loop Quantum Gravity (LQG), the most popular approach to quantum gravity after string theory, must be wrong because it violates the Holographic Principle. The Holographic Principle requires that the number of different states inside a volume is bounded by the surface of the volume. That sounds like a rather innocuous and academic constraint, but once you start thinking about it it’s totally mindboggling.

All our intuition tells us that the number of different states in a volume is bounded by the volume, not the surface. Try stuffing the Legos back into your kid’s toy box, and you will think it’s the volume that bounds what you can cram inside. But the Holographic Principle says that this is only approximately so. If you would try to pack more and more, smaller and smaller Legos into the box, you would eventually fail to get anything more inside. And if you would measure what bounds the success of your stuffing of the tiniest Legos, it would be the surface area of the box. In more detail, the amount of different states has to be less then a quarter of the surface area measured in Planck units. That’s a huge number and so far off our daily experience that we never notice this limit. What we notice in practice is only the bound by the volume.

The Holographic Principle is a consequence of black hole physics, which does not depend on the details of quantizing gravity, and it is therefore generally expected that the entropy bound must be obeyed by all approaches to quantum gravity.

Physicists have tried, of course, to see whether they can find a way to violate this bound. You can consider various types of systems, pack them as tightly as possible, and then calculate the number of degrees of freedom. In this, it is essential that you take into account quantum behavior, because it’s the uncertainty principle that ultimately prevents arbitrarily tight packing. In all known cases however, it was found that the system will collapse to a black hole before the bound is saturated. And black holes themselves saturate the bound. So whatever physicists tried, they only confirmed that the bound holds indeed. With every such thought-experiment, and with every failure of violating the entropy bound, they have grown more convinced that the holographic principle captures a deep truth about nature.

The only known exception that violates the holographic entropy bound are the super-entropic monster-states constructed by Hsu and collaborators. These states however are pathological in that not only will they inevitably go on to collapse to a black hole, they also must have come out of a white hole in the past. They are thus mathematically possible, but not physically realistic. (Aside: That the states come out of a white hole and vanish into a black hole also means you can’t create these super-entropic configurations by throwing in stuff from infinity, which should come as a relief to anybody who believes in the AdS/CFT correspondence.)

So if Loop Quantum Gravity would violate the Holographic Principle that would be a pretty big deal, making the theory inconsistent with all that’s known about black hole physics!

In the paper, the authors redo the calculation for the entropy of a particular quantum system. With the usual quantization, this system obeys the holographic principle. With the quantization technique from Loop Quantum Gravity, the authors get an additional term but the system still obeys the holographic entropy bound, since the additional term is subdominant to the first. They conclude “We have demonstrated that the holographic principle is violated due to the effects coming from LQG.” It’s a plain non-sequitur.

I suspect that the authors mistook the maximum entropy of the quantum system under consideration, previously calculated by ‘t Hooft, for the holographic bound. This is strange because in the introduction they have the correct definition for the holographic bound. Besides this, the claim that in LQG it should be more difficult to obey the holographic bound is highly implausible to begin with. LQG is a discretization approach. It reduces the number of states, it doesn’t increase them. Clearly, if you go down to the discretization scale, the number of states should drop to zero. This makes me think that not only did the authors misinterpret the result, they probably also got the sign of the additional term wrong.

(To prevent confusion, please note that in the paper they calculated corrections to the entropy of the matter, not corrections to the black hole entropy, which would go onto the other side of the equation.)

You might get away with the impression that we have here two unfortunate researchers who were confused about some terminology, and I’m being an ass for highlighting their mistakes. And you would be right, of course, they were confused, and I’m an ass. But let me add that after having read the paper I did contact the authors and explained that their statement that the LQG violates the Holographic Principle is wrong and does not follow from their calculation. After some back and forth, they agreed with me, but refused to change anything about their paper, claiming that it’s a matter of phrasing and in their opinion it’s all okay even though it might confuse some people. And so I am posting this explanation here because then it will show up as an arxiv trackback. Just to avoid that it confuses some people.

In summary: Loop Quantum Gravity is alive and well. If you feed me papers in the future, could you please take into account my dietary preferences?

## September 27, 2015

### Tommaso Dorigo — One Dollar On 5.3 TeV

This is just a short post to mention one thing I recently learned from a colleague - the ATLAS experiment also seems to have collected a 5.3 TeV dijet event, as CMS recently did (the way the communication took place indicates that this is a public information; if it is not, might you ATLAS folks let me know, so that I'll remove this short posting?). If any reader here from ATLAS can point me to the event display I would be grateful. These events are spectacular to look at: the CMS 5 TeV dijet event display was posted here a month ago if you like to have a look.

## September 26, 2015

### Resonaances — Weekend Plot: celebration of a femtobarn

The LHC run-2 has reached the psychologically important point where the amount the integrated luminosity exceeds one inverse femtobarn. To celebrate this event, here is a plot showing the ratio of the number of hypothetical resonances produced so far in run-2 and in run-1 collisions as a function of the resonance mass:
In the run-1 at 8 TeV, ATLAS and CMS collected around 20 fb-1. For 13 TeV collisions the amount of data is currently 1/20 of that, however the hypothetical cross section for producing hypothetical TeV scale particles is much larger. For heavy enough particles the gain in cross section is larger than 1/20, which means that run-2 now probes a previously unexplored parameter space (this simplistic argument ignores the fact that backgrounds are also larger at 13 TeV, but it's approximately correct at very high masses where backgrounds are small). Currently, the turning point is about 2.7 TeV for resonances produced, at the fundamental level, in quark-antiquark collisions, and even below that for those produced in gluon-gluon collisions. The current plan is to continue the physics run till early November which, at this pace, should give us around 3 fb-1 to brood upon during the winter break. This means that the 2015 run will stop short before sorting out the existence of the 2 TeV di-boson resonance indicated by run-1 data. Unless, of course, the physics run is extended at the expense of heavy-ion collisions scheduled for November ;)

### Jordan Ellenberg — Pila on a “modular Fermat equation”

I like this paper by Pila that just went up on the arXiv, which shows the way that you can get Diophantine consequences from the rapid progress being made in theorems of Andre-Oort type.  (I also want to blog about Tsimerman + Zhang + Yuan on “average Colmez” and Andre-Oort, maybe later!)

Pila shows that if N and M are sufficiently large primes, you can’t have elliptic curves E_1/Q and E_2/Q such that E_1 has an N-isogenous curve E_1 -> E’_1, E_2 has an M-isogenous curve E_2 -> E’_2, and j(E’_1) + j(E’_2) = 1.  (It seems to me the proof uses little about this particular algebraic relation and would work just as well for any f(j(E’_1),j(E’_2)) whose vanishing didn’t cut out a modular curve in X(1) x X(1).)  (This is “Fermat-like” in that it asserts finiteness of rational points on a natural countable family of high-genus curves; a more precise analogy is explained in the paper.)

How this works, loosely:  suppose you have such an (E_1, E_2).  A theorem of Kühne guarantees that E_1 and E_2 are not both CM (I didn’t know this!) It follows (WLOG assume N > M) that the N-isogenies of E_1 are defined over a field of degree at least N^a for some small a (Pila uses more precise bounds coming from a recent paper of Najman.)  So the Galois conjugates of (E’_1, E’_2) give you a whole bunch of algebraic points (E”_1, E”_2) with j(E”_1) + j(E”_2) = 1.

So what?  Rational curves have lots of low-height algebraic points.  But here’s the thing.  These isogenous choices of (E’_1, E’_2) aren’t just any algebraic points on X(1) x X(1); they represent pairs of elliptic curves drawn from a {\em fixed pair of isogeny classes}.  Let H be the hyperbolic plane as usual, and write (z,w) for a point on H x H corresponding to (E’_1, E’_2).  Then the other choices (E”_1, E”_2) correspond to points (gz,hw) with g,h in GL(Q).  GL(Q), not GL(R)!  That’s what we get from working in a fixed isogeny class.  And these points satisfy

j(gz) + j(hw) = 1.

To sum up:  you have a whole bunch of rational points (g,h) on GL_2 x GL_2.  These points are pretty low height (for this Pila gestures at a paper of his with Habegger.)  And they lie on the surface j(gz) + j(hw) = 1.  But this surface is a totally non-algebraic thing, because remember, j is a transcendental function on H!  So (Pila’s version of) the Ax-Lindemann theorem (correction from comments:  the Pila-Wilkie theorem) generates a contradiction; a transcendental curve can’t have too many low-height rational points.

## September 25, 2015

### Terence Tao — Entropy and rare events

Let ${X}$ and ${Y}$ be two random variables taking values in the same (discrete) range ${R}$, and let ${E}$ be some subset of ${R}$, which we think of as the set of “bad” outcomes for either ${X}$ or ${Y}$. If ${X}$ and ${Y}$ have the same probability distribution, then clearly

$\displaystyle {\bf P}( X \in E ) = {\bf P}( Y \in E ).$

In particular, if it is rare for ${Y}$ to lie in ${E}$, then it is also rare for ${X}$ to lie in ${E}$.

If ${X}$ and ${Y}$ do not have exactly the same probability distribution, but their probability distributions are close to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance ${\delta(X,Y)}$ between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that

$\displaystyle {\bf P}(Y \in E) - \delta(X,Y) \leq {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \delta(X,Y) \ \ \ \ \ (1)$

for any ${E \subset R}$. In particular, if it is rare for ${Y}$ to lie in ${E}$, and ${X,Y}$ are close in total variation, then it is also rare for ${X}$ to lie in ${E}$.

$\displaystyle \delta(X,Y) \leq \sqrt{\frac{1}{2} D_{KL}(X||Y)}$

where the Kullback-Leibler divergence ${D_{KL}(X||Y)}$ is defined by the formula

$\displaystyle D_{KL}(X||Y) = \sum_{x \in R} {\bf P}( X=x ) \log \frac{{\bf P}(X=x)}{{\bf P}(Y=x)}.$

(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that ${D_{KL}(X||Y)}$ is non-negative (Gibbs’ inequality), and vanishes if and only if ${X}$, ${Y}$ have the same distribution; thus one can think of ${D_{KL}(X||Y)}$ as a measure of how close the distributions of ${X}$ and ${Y}$ are to each other, although one should caution that this is not a symmetric notion of distance, as ${D_{KL}(X||Y) \neq D_{KL}(Y||X)}$ in general. Inserting Pinsker’s inequality into (1), we see for instance that

$\displaystyle {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \sqrt{\frac{1}{2} D_{KL}(X||Y)}.$

Thus, if ${X}$ is close to ${Y}$ in the Kullback-Leibler sense, and it is rare for ${Y}$ to lie in ${E}$, then it is rare for ${X}$ to lie in ${E}$ as well.

We can specialise this inequality to the case when ${Y}$ a uniform random variable ${U}$ on a finite range ${R}$ of some cardinality ${N}$, in which case the Kullback-Leibler divergence ${D_{KL}(X||U)}$ simplifies to

$\displaystyle D_{KL}(X||U) = \log N - {\bf H}(X)$

where

$\displaystyle {\bf H}(X) := \sum_{x \in R} {\bf P}(X=x) \log \frac{1}{{\bf P}(X=x)}$

is the Shannon entropy of ${X}$. Again, a routine application of Jensen’s inequality shows that ${{\bf H}(X) \leq \log N}$, with equality if and only if ${X}$ is uniformly distributed on ${R}$. The above inequality then becomes

$\displaystyle {\bf P}(X \in E) \leq {\bf P}(U \in E) + \sqrt{\frac{1}{2}(\log N - {\bf H}(X))}. \ \ \ \ \ (2)$

Thus, if ${E}$ is a small fraction of ${R}$ (so that it is rare for ${U}$ to lie in ${E}$), and the entropy of ${X}$ is very close to the maximum possible value of ${\log N}$, then it is rare for ${X}$ to lie in ${E}$ also.

The inequality (2) is only useful when the entropy ${{\bf H}(X)}$ is close to ${\log N}$ in the sense that ${{\bf H}(X) = \log N - O(1)}$, otherwise the bound is worse than the trivial bound of ${{\bf P}(X \in E) \leq 1}$. In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy ${{\bf H}(X)}$ was allowed to be smaller than ${\log N - O(1)}$. More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:

Lemma 1 (Pinsker-type inequality) Let ${X}$ be a random variable taking values in a finite range ${R}$ of cardinality ${N}$, let ${U}$ be a uniformly distributed random variable in ${R}$, and let ${E}$ be a subset of ${R}$. Then

$\displaystyle {\bf P}(X \in E) \leq \frac{(\log N - {\bf H}(X)) + \log 2}{\log 1/{\bf P}(U \in E)}.$

Proof: Consider the conditional entropy ${{\bf H}(X | 1_{X \in E} )}$. On the one hand, we have

$\displaystyle {\bf H}(X | 1_{X \in E} ) = {\bf H}(X, 1_{X \in E}) - {\bf H}(1_{X \in E} )$

$\displaystyle = {\bf H}(X) - {\bf H}(1_{X \in E})$

$\displaystyle \geq {\bf H}(X) - \log 2$

by Jensen’s inequality. On the other hand, one has

$\displaystyle {\bf H}(X | 1_{X \in E} ) = {\bf P}(X \in E) {\bf H}(X | X \in E )$

$\displaystyle + (1-{\bf P}(X \in E)) {\bf H}(X | X \not \in E)$

$\displaystyle \leq {\bf P}(X \in E) \log |E| + (1-{\bf P}(X \in E)) \log N$

$\displaystyle = \log N - {\bf P}(X \in E) \log \frac{N}{|E|}$

$\displaystyle = \log N - {\bf P}(X \in E) \log \frac{1}{{\bf P}(U \in E)},$

where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim. $\Box$

Remark 2 As noted in comments, this inequality can be viewed as a special case of the more general inequality

$\displaystyle {\bf P}(X \in E) \leq \frac{D(X||Y) + \log 2}{\log 1/{\bf P}(Y \in E)}$

for arbitrary random variables ${X,Y}$ taking values in the same discrete range ${R}$, which follows from the data processing inequality

$\displaystyle D( f(X)||f(Y)) \leq D(X|| Y)$

for arbitrary functions ${f}$, applied to the indicator function ${f = 1_E}$. Indeed one has

$\displaystyle D( 1_E(X) || 1_E(Y) ) = {\bf P}(X \in E) \log \frac{{\bf P}(X \in E)}{{\bf P}(Y \in E)}$

$\displaystyle + {\bf P}(X \not \in E) \log \frac{{\bf P}(X \not \in E)}{{\bf P}(Y \not \in E)}$

$\displaystyle \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - h( {\bf P}(X \in E) )$

$\displaystyle \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - \log 2$

where ${h(u) := u \log \frac{1}{u} + (1-u) \log \frac{1}{1-u}}$ is the entropy function.

Thus, for instance, if one has

$\displaystyle {\bf H}(X) \geq \log N - o(K)$

and

$\displaystyle {\bf P}(U \in E) \leq \exp( - K )$

for some ${K}$ much larger than ${1}$ (so that ${1/K = o(1)}$), then

$\displaystyle {\bf P}(X \in E) = o(1).$

More informally: if the entropy of ${X}$ is somewhat close to the maximum possible value of ${\log N}$, and it is exponentially rare for a uniform variable to lie in ${E}$, then it is still somewhat rare for ${X}$ to lie in ${E}$. The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable ${X}$ which is uniformly distributed inside a small set ${E}$ with some probability ${p}$ and uniformly distributed outside of ${E}$ with probability ${1-p}$, for some parameter ${0 \leq p \leq 1}$.

It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as

$\displaystyle F(U) \approx {\bf E} F(U)$

with exponentially high probability, where ${U}$ is a uniform distribution and ${F}$ is some reasonable function of ${U}$. Combining this with the above lemma, we can then obtain approximations of the form

$\displaystyle F(X) \approx {\bf E} F(U) \ \ \ \ \ (3)$

$\displaystyle \sum_{j=1}^H \sum_{p \in {\mathcal P}} \lambda(n+j) \lambda(n+j+p) 1_{p|n+j}$

$\displaystyle \approx \sum_{j=1}^H \sum_{p \in {\mathcal P}} \frac{\lambda(n+j) \lambda(n+j+p)}{p}$

for “most” choices of ${n}$ and a suitable choice of ${H}$ (with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as ${\sum_{n \leq x} \frac{\lambda(n)\lambda(n+1)}{n}}$ through the multiplicativity of ${\lambda}$, while the right-hand side, being a linear correlation involving two parameters ${j,p}$ rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.

Filed under: expository, math.IT, math.NT Tagged: concentration of measure, entropy

### Terence Tao — The logarithmically averaged Chowla and Elliott conjectures for two-point correlations; the Erdos discrepancy problem

I’ve just uploaded two related papers to the arXiv:

This pair of papers is an outgrowth of these two recent blog posts and the ensuing discussion. In the first paper, we establish the following logarithmically averaged version of the Chowla conjecture (in the case ${k=2}$ of two-point correlations (or “pair correlations”)):

$\displaystyle \sum_{x/\omega(x) < n \leq x} \frac{\lambda(a_1 n + b_1) \lambda(a_2 n+b_2)}{n} = o( \log \omega(x) ) \ \ \ \ \ (1)$

as ${x \rightarrow \infty}$.

$\displaystyle \sum_{n \leq x} \frac{\lambda(n) \lambda(n+1)}{n} = o(\log x). \ \ \ \ \ (2)$

$\displaystyle \sum_{n \leq x} \lambda(n) \lambda(n+1) = o(x) \ \ \ \ \ (3)$

which is a strictly stronger estimate than (2), and remains open.

The arguments also extend to other completely multiplicative functions than the Liouville function. In particular, one obtains a slightly averaged version of the non-asymptotic Elliott conjecture that was shown in the previous blog post to imply a positive solution to the Erdos discrepancy problem. The averaged version of the conjecture established in this paper is slightly weaker than the one assumed in the previous blog post, but it turns out that the arguments there can be modified without much difficulty to accept this averaged Elliott conjecture as input. In particular, we obtain an unconditional solution to the Erdos discrepancy problem as a consequence; this is detailed in the second paper listed above. In fact we can also handle the vector-valued version of the Erdos discrepancy problem, in which the sequence ${f(1), f(2), \dots}$ takes values in the unit sphere of an arbitrary Hilbert space, rather than in ${\{-1,+1\}}$.

Estimates such as (2) or (3) are known to be subject to the “parity problem” (discussed numerous times previously on this blog), which roughly speaking means that they cannot be proven solely using “linear” estimates on functions such as the von Mangoldt function. However, it is known that the parity problem can be circumvented using “bilinear” estimates, and this is basically what is done here.

We now describe in informal terms the proof of Theorem 1, focusing on the model case (2) for simplicity. Suppose for contradiction that the left-hand side of (2) was large and (say) positive. Using the multiplicativity ${\lambda(pn) = -\lambda(n)}$, we conclude that

$\displaystyle \sum_{n \leq x} \frac{\lambda(n) \lambda(n+p) 1_{p|n}}{n}$

is also large and positive for all primes ${p}$ that are not too large; note here how the logarithmic averaging allows us to leave the constraint ${n \leq x}$ unchanged. Summing in ${p}$, we conclude that

$\displaystyle \sum_{n \leq x} \frac{ \sum_{p \in {\mathcal P}} \lambda(n) \lambda(n+p) 1_{p|n}}{n}$

is large and positive for any given set ${{\mathcal P}}$ of medium-sized primes. By a standard averaging argument, this implies that

$\displaystyle \frac{1}{H} \sum_{j=1}^H \sum_{p \in {\mathcal P}} \lambda(n+j) \lambda(n+p+j) 1_{p|n+j} \ \ \ \ \ (4)$

is large for many choices of ${n}$, where ${H}$ is a medium-sized parameter at our disposal to choose, and we take ${{\mathcal P}}$ to be some set of primes that are somewhat smaller than ${H}$. (A similar approach was taken in this recent paper of Matomaki, Radziwill, and myself to study sign patterns of the Möbius function.) To obtain the required contradiction, one thus wants to demonstrate significant cancellation in the expression (4). As in that paper, we view ${n}$ as a random variable, in which case (4) is essentially a bilinear sum of the random sequence ${(\lambda(n+1),\dots,\lambda(n+H))}$ along a random graph ${G_{n,H}}$ on ${\{1,\dots,H\}}$, in which two vertices ${j, j+p}$ are connected if they differ by a prime ${p}$ in ${{\mathcal P}}$ that divides ${n+j}$. A key difficulty in controlling this sum is that for randomly chosen ${n}$, the sequence ${(\lambda(n+1),\dots,\lambda(n+H))}$ and the graph ${G_{n,H}}$ need not be independent. To get around this obstacle we introduce a new argument which we call the “entropy decrement argument” (in analogy with the “density increment argument” and “energy increment argument” that appear in the literature surrounding Szemerédi’s theorem on arithmetic progressions, and also reminiscent of the “entropy compression argument” of Moser and Tardos, discussed in this previous post). This argument, which is a simple consequence of the Shannon entropy inequalities, can be viewed as a quantitative version of the standard subadditivity argument that establishes the existence of Kolmogorov-Sinai entropy in topological dynamical systems; it allows one to select a scale parameter ${H}$ (in some suitable range ${[H_-,H_+]}$) for which the sequence ${(\lambda(n+1),\dots,\lambda(n+H))}$ and the graph ${G_{n,H}}$ exhibit some weak independence properties (or more precisely, the mutual information between the two random variables is small).

Informally, the entropy decrement argument goes like this: if the sequence ${(\lambda(n+1),\dots,\lambda(n+H))}$ has significant mutual information with ${G_{n,H}}$, then the entropy of the sequence ${(\lambda(n+1),\dots,\lambda(n+H'))}$ for ${H' > H}$ will grow a little slower than linearly, due to the fact that the graph ${G_{n,H}}$ has zero entropy (knowledge of ${G_{n,H}}$ more or less completely determines the shifts ${G_{n+kH,H}}$ of the graph); this can be formalised using the classical Shannon inequalities for entropy (and specifically, the non-negativity of conditional mutual information). But the entropy cannot drop below zero, so by increasing ${H}$ as necessary, at some point one must reach a metastable region (cf. the finite convergence principle discussed in this previous blog post), within which very little mutual information can be shared between the sequence ${(\lambda(n+1),\dots,\lambda(n+H))}$ and the graph ${G_{n,H}}$. Curiously, for the application it is not enough to have a purely quantitative version of this argument; one needs a quantitative bound (which gains a factor of a bit more than ${\log H}$ on the trivial bound for mutual information), and this is surprisingly delicate (it ultimately comes down to the fact that the series ${\sum_{j \geq 2} \frac{1}{j \log j \log\log j}}$ diverges, which is only barely true).

Once one locates a scale ${H}$ with the low mutual information property, one can use standard concentration of measure results such as the Hoeffding inequality to approximate (4) by the significantly simpler expression

$\displaystyle \frac{1}{H} \sum_{j=1}^H \sum_{p \in {\mathcal P}} \frac{\lambda(n+j) \lambda(n+p+j)}{p}. \ \ \ \ \ (5)$

The important thing here is that Hoeffding’s inequality gives exponentially strong bounds on the failure probability, which is needed to counteract the logarithms that are inevitably present whenever trying to use entropy inequalities. The expression (5) can then be controlled in turn by an application of the Hardy-Littlewood circle method and a non-trivial estimate

$\displaystyle \sup_\alpha \frac{1}{X} \int_X^{2X} |\frac{1}{H} \sum_{x \leq n \leq x+H} \lambda(n) e(\alpha n)|\ dx = o(1) \ \ \ \ \ (6)$

When one uses this method to study more general sums such as

$\displaystyle \sum_{n \leq x} \frac{g_1(n) g_2(n+1)}{n},$

one ends up having to consider expressions such as

$\displaystyle \frac{1}{H} \sum_{j=1}^H \sum_{p \in {\mathcal P}} c_p \frac{g_1(n+j) g_2(n+p+j)}{p}.$

where ${c_p}$ is the coefficient ${c_p := \overline{g_1}(p) \overline{g_2}(p)}$. When attacking this sum with the circle method, one soon finds oneself in the situation of wanting to locate the large Fourier coefficients of the exponential sum

$\displaystyle S(\alpha) := \sum_{p \in {\mathcal P}} \frac{c_p}{p} e^{2\pi i \alpha p}.$

In many cases (such as in the application to the Erdös discrepancy problem), the coefficient ${c_p}$ is identically ${1}$, and one can understand this sum satisfactorily using the classical results of Vinogradov: basically, ${S(\alpha)}$ is large when ${\alpha}$ lies in a “major arc” and is small when it lies in a “minor arc”. For more general functions ${g_1,g_2}$, the coefficients ${c_p}$ are more or less arbitrary; the large values of ${S(\alpha)}$ are no longer confined to the major arc case. Fortunately, even in this general situation one can use a restriction theorem for the primes established some time ago by Ben Green and myself to show that there are still only a bounded number of possible locations ${\alpha}$ (up to the uncertainty mandated by the Heisenberg uncertainty principle) where ${S(\alpha)}$ is large, and we can still conclude by using (6). (Actually, as recently pointed out to me by Ben, one does not need the full strength of our result; one only needs the ${L^4}$ restriction theorem for the primes, which can be proven fairly directly using Plancherel’s theorem and some sieve theory.)

It is tempting to also use the method to attack higher order cases of the (logarithmically) averaged Chowla conjecture, for instance one could try to prove the estimate

$\displaystyle \sum_{n \leq x} \frac{\lambda(n) \lambda(n+1) \lambda(n+2)}{n} = o(\log x).$

The above arguments reduce matters to obtaining some non-trivial cancellation for sums of the form

$\displaystyle \frac{1}{H} \sum_{j=1}^H \sum_{p \in {\mathcal P}} \frac{\lambda(n+j) \lambda(n+p+j) \lambda(n+p+2j)}{p}.$

A little bit of “higher order Fourier analysis” (as was done for very similar sums in the ergodic theory context by Frantzikinakis-Host-Kra and Wooley-Ziegler) lets one control this sort of sum if one can establish a bound of the form

$\displaystyle \frac{1}{X} \int_X^{2X} \sup_\alpha |\frac{1}{H} \sum_{x \leq n \leq x+H} \lambda(n) e(\alpha n)|\ dx = o(1) \ \ \ \ \ (7)$

where ${X}$ goes to infinity and ${H}$ is a very slowly growing function of ${X}$. This looks very similar to (6), but the fact that the supremum is now inside the integral makes the problem much more difficult. However it looks worth attacking (7) further, as this estimate looks like it should have many nice applications (beyond just the ${k=3}$ case of the logarithmically averaged Chowla or Elliott conjectures, which is already interesting).

For higher ${k}$ than ${k=3}$, the same line of analysis requires one to replace the linear phase ${e(\alpha n)}$ by more complicated phases, such as quadratic phases ${e(\alpha n^2 + \beta n)}$ or even ${k-2}$-step nilsequences. Given that (7) is already beyond the reach of current literature, these even more complicated expressions are also unavailable at present, but one can imagine that they will eventually become tractable, in which case we would obtain an averaged form of the Chowla conjecture for all ${k}$, which would have a number of consequences (such as a logarithmically averaged version of Sarnak’s conjecture, as per this blog post).

It would of course be very nice to remove the logarithmic averaging, and be able to establish bounds such as (3). I did attempt to do so, but I do not see a way to use the entropy decrement argument in a manner that does not require some sort of averaging of logarithmic type, as it requires one to pick a scale ${H}$ that one cannot specify in advance, which is not a problem for logarithmic averages (which are quite stable with respect to dilations) but is problematic for ordinary averages. But perhaps the problem can be circumvented by some clever modification of the argument. One possible approach would be to start exploiting multiplicativity at products of primes, and not just individual primes, to try to keep the scale fixed, but this makes the concentration of measure part of the argument much more complicated as one loses some independence properties (coming from the Chinese remainder theorem) which allowed one to conclude just from the Hoeffding inequality.

Filed under: math.CO, math.NT, paper Tagged: Chowla conjecture, multiplicative functions, polymath5

### Terence Tao — The Erdos discrepancy problem via the Elliott conjecture

The Chowla conjecture asserts that all non-trivial correlations of the Liouville function are asymptotically negligible; for instance, it asserts that

$\displaystyle \sum_{n \leq X} \lambda(n) \lambda(n+h) = o(X)$

as ${X \rightarrow \infty}$ for any fixed natural number ${h}$. This conjecture remains open, though there are a number of partial results (e.g. these two previous results of Matomaki, Radziwill, and myself).

A natural generalisation of Chowla’s conjecture was proposed by Elliott. For simplicity we will only consider Elliott’s conjecture for the pair correlations

$\displaystyle \sum_{n \leq X} g(n) \overline{g}(n+h).$

For such correlations, the conjecture was that one had

$\displaystyle \sum_{n \leq X} g(n) \overline{g}(n+h) = o(X) \ \ \ \ \ (1)$

$\displaystyle \sum_p \hbox{Re} \frac{1 - g(p) \overline{\chi(p)} p^{-it}}{p} = +\infty \ \ \ \ \ (2)$

for any Dirichlet character ${\chi}$ and any real number ${t}$. In the language of “pretentious number theory”, as developed by Granville and Soundararajan, the hypothesis (2) asserts that the completely multiplicative function ${g}$ does not “pretend” to be like the completely multiplicative function ${n \mapsto \chi(n) n^{it}}$ for any character ${\chi}$ and real number ${t}$. A condition of this form is necessary; for instance, if ${g(n)}$ is precisely equal to ${\chi(n) n^{it}}$ and ${\chi}$ has period ${q}$, then ${g(n) \overline{g}(n+q)}$ is equal to ${1_{(n,q)=1} + o(1)}$ as ${n \rightarrow \infty}$ and (1) clearly fails. The prime number theorem in arithmetic progressions implies that the Liouville function obeys (2), and so the Elliott conjecture contains the Chowla conjecture as a special case.

As it turns out, Elliott’s conjecture is false as stated, with the counterexample ${g}$ having the property that ${g}$ “pretends” locally to be the function ${n \mapsto n^{it_j}}$ for ${n}$ in various intervals ${[1, X_j]}$, where ${X_j}$ and ${t_j}$ go to infinity in a certain prescribed sense. See this paper of Matomaki, Radziwill, and myself for details. However, we view this as a technicality, and continue to believe that certain “repaired” versions of Elliott’s conjecture still hold. For instance, our counterexample does not apply when ${g}$ is restricted to be real-valued rather than complex, and we believe that Elliott’s conjecture is valid in this setting. Returning to the complex-valued case, we still expect the asymptotic (1) provided that the condition (2) is replaced by the stronger condition

$\displaystyle \sup_{|t| \leq X} |\sum_{p \leq X} \hbox{Re} \frac{1 - g(p) \overline{\chi(p)} p^{-it}}{p}| \rightarrow +\infty$

as ${X \rightarrow +\infty}$ for all fixed Dirichlet characters ${\chi}$. In our paper we supported this claim by establishing a certain “averaged” version of this conjecture; see that paper for further details. (See also this recent paper of Frantzikinakis and Host which establishes a different averaged version of this conjecture.)

One can make a stronger “non-asymptotic” version of this corrected Elliott conjecture, in which the ${X}$ parameter does not go to infinity, or equivalently that the function ${g}$ is permitted to depend on ${X}$:

$\displaystyle \inf_{|t| \leq AX} |\sum_{p \leq X} \hbox{Re} \frac{1 - g(p) \overline{\chi(p)} p^{-it}}{p}| \geq A$

for all Dirichlet characters ${\chi}$ of period at most ${A}$. Then one has

$\displaystyle |\sum_{n \leq X} g(n) \overline{g(n+h)}| \leq \varepsilon X$

for all natural numbers ${1 \leq h \leq 1/\varepsilon}$.

Meanwhile, we have the following conjecture that is the focus of the Polymath5 project:

Conjecture 2 (Erdös discrepancy conjecture) For any function ${f: {\bf N} \rightarrow \{-1,+1\}}$, the discrepancy

$\displaystyle \sup_{n,d \in {\bf N}} |\sum_{j=1}^n f(jd)|$

is infinite.

It is instructive to compute some near-counterexamples to Conjecture 2 that illustrate the difficulty of the Erdös discrepancy problem. The first near-counterexample is that of a non-principal Dirichlet character ${f(n) = \chi(n)}$ that takes values in ${\{-1,0,+1\}}$ rather than ${\{-1,+1\}}$. For this function, one has from the complete multiplicativity of ${\chi}$ that

$\displaystyle |\sum_{j=1}^n f(jd)| = |\sum_{j=1}^n \chi(j) \chi(d)|$

$\displaystyle \leq |\sum_{j=1}^n \chi(j)|.$

If ${q}$ denotes the period of ${\chi}$, then ${\chi}$ has mean zero on every interval of length ${q}$, and thus

$\displaystyle |\sum_{j=1}^n f(jd)| \leq |\sum_{j=1}^n \chi(j)| \leq q.$

Thus ${\chi}$ has bounded discrepancy.

Of course, this is not a true counterexample to Conjecture 2 because ${\chi}$ can take the value ${0}$. Let us now consider the following variant example, which is the simplest member of a family of examples studied by Borwein, Choi, and Coons. Let ${\chi = \chi_3}$ be the non-principal Dirichlet character of period ${3}$ (thus ${\chi(n)}$ equals ${+1}$ when ${n=1 \hbox{ mod } 3}$, ${-1}$ when ${n = 2 \hbox{ mod } 3}$, and ${0}$ when ${n = 0 \hbox{ mod } 3}$), and define the completely multiplicative function ${f = \tilde \chi: {\bf N} \rightarrow \{-1,+1\}}$ by setting ${\tilde \chi(p) := \chi(p)}$ when ${p \neq 3}$ and ${\tilde \chi(3) = +1}$. This is about the simplest modification one can make to the previous near-counterexample to eliminate the zeroes. Now consider the sum

$\displaystyle \sum_{j=1}^n \tilde \chi(j)$

with ${n := 1 + 3 + 3^2 + \dots + 3^k}$ for some large ${k}$. Writing ${j = 3^a m}$ with ${m}$ coprime to ${3}$ and ${a}$ at most ${k}$, we can write this sum as

$\displaystyle \sum_{a=0}^k \sum_{1 \leq m \leq n/3^j} \tilde \chi(3^a m).$

Now observe that ${\tilde \chi(3^a m) = \tilde \chi(3)^a \tilde \chi(m) = \chi(m)}$. The function ${\chi}$ has mean zero on every interval of length three, and ${\lfloor n/3^j\rfloor}$ is equal to ${1}$ mod ${3}$, and thus

$\displaystyle \sum_{1 \leq m \leq n/3^j} \tilde \chi(3^a m) = 1$

for every ${a=0,\dots,k}$, and thus

$\displaystyle \sum_{j=1}^n \tilde \chi(j) = k+1 \gg \log n.$

Thus ${\tilde \chi}$ also has unbounded discrepancy, but only barely so (it grows logarithmically in ${n}$). These examples suggest that the main “enemy” to proving Conjecture 2 comes from completely multiplicative functions ${f}$ that somehow “pretend” to be like a Dirichlet character but do not vanish at the zeroes of that character. (Indeed, the special case of Conjecture 2 when ${f}$ is completely multiplicative is already open, appears to be an important subcase.)

All of these conjectures remain open. However, I would like to record in this blog post the following striking connection, illustrating the power of the Elliott conjecture (particularly in its nonasymptotic formulation):

Theorem 3 (Elliott conjecture implies unbounded discrepancy) Conjecture 1 implies Conjecture 2.

The argument relies heavily on two observations that were previously made in connection with the Polymath5 project. The first is a Fourier-analytic reduction that replaces the Erdos Discrepancy Problem with an averaged version for completely multiplicative functions ${g}$. An application of Cauchy-Schwarz then shows that any counterexample to that version will violate the conclusion of Conjecture 1, so if one assumes that conjecture then ${g}$ must pretend to be like a function of the form ${n \mapsto \chi(n) n^{it}}$. One then uses (a generalisation) of a second argument from Polymath5 to rule out this case, basically by reducing matters to a more complicated version of the Borwein-Choi-Coons analysis. Details are provided below the fold.

There is some hope that the Chowla and Elliott conjectures can be attacked, as the parity barrier which is so impervious to attack for the twin prime conjecture seems to be more permeable in this setting. (For instance, in my previous post I raised a possible approach, based on establishing expander properties of a certain random graph, which seems to get around the parity problem, in principle at least.)

(Update, Sep 25: fixed some treatment of error terms, following a suggestion of Andrew Granville.)

— 1. Fourier reduction —

We will prove Theorem 3 by contradiction, assuming that there is a function ${f}$ with bounded discrepancy and then concluding a violation of the Elliott conjecture.

The function ${f: {\bf N} \rightarrow \{-1,+1\}}$ need not have any multiplicativity properties, but by using an argument from Polymath5 we can extract a random completely multiplicative function which also has good discrepancy properties (albeit in an probabilistic sense only):

Proposition 4 (Fourier reduction) Suppose that ${f: {\bf N} \rightarrow \{-1,+1\}}$ is a function such that

$\displaystyle \sup_{n,d \in {\bf N}} |\sum_{j=1}^n f(jd)| < +\infty. \ \ \ \ \ (3)$

Then there exists a random completely multiplicative function ${g}$ of magnitude ${1}$ such that

$\displaystyle {\bf E} |\sum_{j=1}^n g(j)|^2 \ll 1$

uniformly for all natural numbers ${n}$ (we allow implied constants to depend on ${f}$).

The space of completely multiplicative functions ${g}$ of magnitude ${1}$ can be identified with the infinite product ${(S^1)^{\bf Z}}$ since ${g}$ is determined by its values ${g(p) \in S^1}$ at the primes. In particular, this space is compact metrisable in the product topology. The functions ${g \mapsto |\sum_{j=1}^n g(j)|^2}$ are continuous in this topology for all ${n}$. By vague compactness of probability measures on compact metrisable spaces (Prokhorov’s theorem), it thus suffices to construct, for each ${X \geq 1}$, a random completely multiplicative function ${g}$ of magnitude ${1}$ such that

$\displaystyle {\bf E} |\sum_{j=1}^n g(j)|^2 \ll 1$

for all ${n \leq X}$, where the implied constant is uniform in ${n}$ and ${X}$.

By hypothesis, we have

$\displaystyle |\sum_{j=1}^n f(jd)| \ll 1 \ \ \ \ \ (4)$

for all ${n,d}$ (the implied constant can depend on ${f}$ but is otherwise absolute). Let ${X \geq 1}$, and let ${p_1,\dots,p_r}$ be the primes up to ${X}$. Let ${M \geq X}$ be a natural number that we assume to be sufficiently large depending on ${X}$. Define a function ${F: ({\bf Z}/M{\bf Z})^r \rightarrow \{-1,+1\}}$ by the formula

$\displaystyle F( a_1,\dots,a_r ) := f( p_1^{a_1} \dots p_r^{a_r} )$

for ${a_1,\dots,a_r \in \{0,\dots,M-1\}}$. We also define the function ${\pi: [1,X] \rightarrow ({\bf Z}/M{\bf Z})^r}$ by setting ${\pi( p_1^{a_1} \dots p_r^{a_r} ) := (a_1,\dots,a_r)}$ whenever ${p_1^{a_1} \dots p_r^{a_r}}$ is in ${[1,X]}$ (this is well defined for ${M \geq X}$). Applying (4) for ${n \leq X}$ and ${d}$ of the form ${p_1^{a_1} \dots p_r^{a_r}}$ with ${1 \leq a_i \leq M - X}$, we conclude that

$\displaystyle |\sum_{j=1}^n F( x + \pi(j) )| \ll 1$

for all ${n \leq X}$ and all but ${O_X( M^{r-1} )}$ of the ${M^r}$ elements ${x = (a_1,\dots,a_r)}$ of ${({\bf Z}/M{\bf Z})^r}$. For the exceptional elements, we have the trivial bound

$\displaystyle |\sum_{j=1}^n F( x + \pi(j) )| \leq X.$

Square-summing in ${x}$, we conclude (if ${M}$ is sufficiently large depending on ${X}$) that

$\displaystyle {\bf E}_{x \in ({\bf Z}/M{\bf Z})^r} |\sum_{j=1}^n F( x + \pi(j) )|^2 \ll 1 \ \ \ \ \ (5)$

By Fourier expansion, we can write

$\displaystyle F(x) = \sum_{\xi \in ({\bf Z}/M{\bf Z})^r} \hat F(\xi) e( \frac{x \cdot \xi}{M} )$

where ${(a_1,\dots,a_r) \cdot (\xi_1,\dots,\xi_r) := a_1 \xi_1 + \dots + a_r \xi_r}$, ${e(\theta) := e^{2\pi i \theta}}$, and

$\displaystyle \hat F(\xi) := {\bf E}_{x \in ({\bf Z}/M{\bf Z})^r} F(x) e( -\frac{x \cdot \xi}{M} ).$

$\displaystyle \sum_{\xi \in ({\bf Z}/M{\bf Z})^r} |\hat F(\xi)|^2 |\sum_{j=1}^n e( \frac{\pi(j) \cdot \xi}{M} )|^2.$

On the other hand, from the Plancherel identity we have

$\displaystyle \sum_{\xi \in ({\bf Z}/M{\bf Z})^r} |\hat F(\xi)|^2 = 1$

and so we can interpret ${|\hat F(\xi)|^2}$ as the probability distribution of a random frequency ${\xi = (\xi_1,\dots,\xi_r) \in ({\bf Z}/M{\bf Z})^r}$. The estimate (5) now takes the form

$\displaystyle {\bf E} |\sum_{j=1}^n e( \frac{\pi(j) \cdot \xi}{M} )|^2 \ll 1$

for all ${n \leq N}$. If we then define the completely multiplicative function ${g}$ by setting ${g(p_j) := e( \xi_j / M )}$ for ${j=1,\dots,r}$, and ${g(p) := 1}$ for all other primes, we obtain

$\displaystyle {\bf E} |\sum_{j=1}^n g(j)|^2 \ll 1$

for all ${n \leq X}$, as desired. $\Box$

Remark 5 A similar reduction applies if the original function ${f}$ took values in the unit sphere of a complex Hilbert space, rather than in ${\{-1,+1\}}$. Conversely, the random ${g}$ constructed above can be viewed as an element of a unit sphere of a suitable Hilbert space, so the conclusion of Proposition 4 is in fact logically equivalent to failure of the Hilbert space version of the Erdös discrepancy conjecture.

Remark 6 From linearity of expectation, we see from Proposition 4 that for any natural number ${N}$, we have

$\displaystyle {\bf E} \frac{1}{N} \sum_{n=1}^N |\sum_{j=1}^n g(j)|^2 \ll 1$

and hence for each ${N}$ we conclude that there exists a deterministic completely multiplicative function of unit magnitude such that

$\displaystyle \frac{1}{N} \sum_{n=1}^N |\sum_{j=1}^n g(j)|^2 \ll 1.$

This was the original formulation of the Fourier reduction in Polymath5, however the fact that ${g}$ varied with ${N}$ made this formulation inconvenient for our argument.

— 2. Applying the Elliott conjecture —

Suppose for contradiction that Conjecture 1 holds but that there exists a function ${f: {\bf N} \rightarrow \{-1,+1\}}$ of bounded discrepancy in the sense of (3). By Proposition 4, we may thus find a random completely multiplicative function ${g}$ of magnitude ${1}$ such that

$\displaystyle {\bf E} |\sum_{j=1}^n g(j)|^2 \ll 1 \ \ \ \ \ (6)$

for all ${n}$.

We now use Elliott’s conjecture as a sort of “inverse theorem” (in the spirit of the inverse sumset theorem of Frieman, and the inverse theorems for the Gowers uniformity norms) to force ${g}$ to pretend to behave like a modulated character quite often.

$\displaystyle \sum_{p \leq X} \hbox{Re} \frac{1 - g(p) \overline{\chi(p)} p^{-it}}{p} \ll_\varepsilon 1. \ \ \ \ \ (7)$

$\displaystyle {\bf E} {\bf E}_{n \in [1,X]} |\sum_{j=1}^n g(j)|^2 \ll 1$

so from Markov’s inequality we see with probability ${1-O(\varepsilon)}$ that

$\displaystyle {\bf E}_{n \in [1,X]} |\sum_{j=1}^n g(j)|^2 \ll_\varepsilon 1.$

Let us condition to this event. Shifting ${n}$ by ${H}$ we conclude (for ${X}$ large enough) that

$\displaystyle {\bf E}_{n \in [1,X]} |\sum_{j=1}^{n+H} g(j)|^2 \ll_\varepsilon 1$

and hence by the triangle inequality

$\displaystyle {\bf E}_{n \in [1,X]} |\sum_{j=n+1}^{n+H} g(j)|^2 \ll_\varepsilon 1,$

which we rewrite as

$\displaystyle {\bf E}_{n \in [1,X]} |\sum_{h=1}^{H} g(n+h)|^2 \ll_\varepsilon 1,$

We can square the left-hand side out as

$\displaystyle \sum_{h_1, h_2 \in [1,H]} {\bf E}_{n \in [1,X]} g(n+h_1) \overline{g(n+h_2)}.$

The diagonal term ${h_1,h_2}$ contributes ${H}$ to this expression. Thus, for ${H=O_\varepsilon(1)}$ sufficiently large depending on ${\varepsilon}$, we can apply the triangle inequality and pigeonhole principle to find distinct ${h_1,h_2 \in [1,H]}$ such that

$\displaystyle |{\bf E}_{n \in [1,X]} g(n+h_1) \overline{g(n+h_2)}| \gg_\varepsilon 1.$

By symmetry we can take ${h_2 > h_1}$. Setting ${h := h_2 - h_1}$, we conclude (for ${X}$ large enough) that

$\displaystyle |{\bf E}_{n \in [1,X]} g(n) \overline{g(n+h)}| \gg_\varepsilon 1.$

Applying Conjecture 1 in the contrapositive, we obtain the claim. $\Box$

The conclusion (8) asserts that in some sense, ${g}$ “pretends” to be like the function ${n \mapsto \chi(n)n^{it}}$; as it has magnitude one, it should resemble the function ${\tilde \chi}$ discussed in the introduction. The remaining task is to find some generalisation of the argument that shows that ${\tilde \chi}$ had (logarithmically) large discrepancy to show that ${g}$ likewise fails to obey (6).

— 3. Ruling out correlation with modulated characters —

We now use (a generalisation of) this Polymath5 argument. Let ${g}$ be the random completely multiplicative function provided by Proposition 4. We will need the following parameters:

• A sufficiently small quantity ${0 < \varepsilon < 1/2}$.
• A natural number ${H \geq 1}$ that is sufficiently large depending on ${\varepsilon}$.
• A quantity ${0 < \delta < 1/2}$ that is sufficiently small depending on ${\varepsilon,H}$.
• A natural number ${k \geq 1}$ that is sufficiently large depending on ${\varepsilon,H,\delta}$.
• A real number ${X \geq 1}$ that is sufficiently large depending on ${\varepsilon,H,\delta,k}$.

By Proposition 7, we see with probability ${1-O(\varepsilon)}$ that there exists a Dirichlet character ${\chi}$ of period ${q=O_\varepsilon(1)}$ and a real number ${t = O_\varepsilon(X)}$ such that

$\displaystyle \sum_{p \leq X} \hbox{Re} \frac{1 - g(p) \overline{\chi(p)} p^{-it}}{p} \ll_\varepsilon 1. \ \ \ \ \ (8)$

By reducing ${\chi}$ if necessary we may assume that ${\chi}$ is primitive.

It will be convenient to cut down the size of ${t}$.

$\displaystyle t = O_{\varepsilon}(X^\delta). \ \ \ \ \ (9)$

Proof: By Proposition 7 with ${X}$ replaced by ${X^\delta}$, we see that with probability ${1-O(\varepsilon)}$, one can find a Dirichlet character ${\chi'}$ of period ${q' = O_\varepsilon(1)}$ and a real number ${t' = O_\varepsilon( X^\delta )}$ such that

$\displaystyle \sum_{p \leq X^\delta} \hbox{Re} \frac{1 - g(p) \overline{\chi'(p)} p^{-it'}}{p} \ll_\varepsilon 1.$

We may assume that ${|t'-t| \geq X^\delta}$, since we are done otherwise. Applying the pretentious triangle inequality (see Lemma 3.1 of this paper of Granville and Soundararajan), we conclude that

$\displaystyle \sum_{p \leq X^\delta} \hbox{Re} \frac{1 - \chi(p) \overline{\chi'(p)} p^{-i(t'-t)}}{p} \ll_\varepsilon 1.$

However, from the Vinogradov-Korobov zero-free region for ${L(\cdot,\chi \overline{\chi'})}$ (see this previous blog post) it is not difficult to show that

$\displaystyle \sum_{p \leq X^\delta} \hbox{Re} \frac{1 - \chi(p) \overline{\chi'(p)} p^{-i(t'-t)}}{p} \gg \log\log X$

if ${X}$ is sufficiently large depending on ${\varepsilon,\delta}$, a contradiction. The claim follows. $\Box$

Let us now condition to the probability ${1-O(\varepsilon)}$ event that ${\chi}$, ${t}$ exist obeying (8) and the bound (9).

The bound (8) asserts that ${g}$ “pretends” to be like the completely multiplicative function ${n \mapsto \chi(p) p^{it}}$. We can formalise this by making the factorisation

$\displaystyle g(n) := \tilde \chi(n) n^{it} h(n) \ \ \ \ \ (10)$

where ${\tilde \chi}$ is the completely multiplicative function of magnitude ${1}$ defined by setting ${\tilde \chi(p) := \chi(p)}$ for ${p \not | q}$ and ${\tilde \chi(p) := g(p) p^{-it}}$ for ${p|q}$, and ${h}$ is the completely multiplicative function of magnitude ${1}$ defined by setting ${h(p) := g(p) \overline{\chi(p)} p^{-it}}$ for ${p \not | q}$, and ${h(p) = 1}$ for ${p|q}$. The function ${\tilde \chi}$ should be compared with the function of the same name studied in the introduction.

The bound (8) then becomes

$\displaystyle |\sum_{p \leq X} \hbox{Re} \frac{1-h(p)}{p}| \ll_\varepsilon 1. \ \ \ \ \ (11)$

We now perform some manipulations to remove the ${n^{it}}$ and ${h}$ factors from ${g}$ and isolate the ${\tilde \chi}$ factor, which is more tractable to compute with; then we will perform more computations to arrive at an expression just involving ${\chi}$ which we will be able to evaluate fairly easily.

From (6) and the triangle inequality we have

$\displaystyle {\bf E} {\bf E}_{H' \in [H/2,H]} |\sum_{m=1}^{H'} g(n+m)|^2 \ll 1$

for all ${n}$ (even after conditioning to the ${1-O(\varepsilon)}$ event). The ${{\bf E}_{H' \in [H/2,H]}}$ averaging will not be used until much later in the argument, and the reader may wish to ignore it for now.

By (10), the above estimate can be written as

$\displaystyle {\bf E} {\bf E}_{H' \in [H/2,H]} |\sum_{m=1}^{H'} \tilde \chi(n+m) (n+m)^{it} h(n+m)|^2 \ll 1.$

For ${n \geq X^{2\delta}}$ we can use (9) to conclude that ${(n+m)^{it} = n^{it} + O_{\varepsilon,H,\delta}(X^{-\delta})}$. The contribution of the error term is negligible, thus

$\displaystyle {\bf E} {\bf E}_{H' \in [H/2,H]} |\sum_{m=1}^{H'}\tilde \chi(n+m) n^{it} h(n+m)|^2 \ll 1$

for all ${n \geq X^{2\delta}}$. We can factor out the ${n^{it}}$ to obtain

$\displaystyle {\bf E} {\bf E}_{H' \in [H/2,H]} |\sum_{m=1}^{H'} \tilde \chi(n+m) h(n+m)|^2 \ll 1.$

For ${n < X^{2\delta}}$ we can crudely bound the left-hand side by ${H^2}$. If ${\delta}$ is sufficiently small, we can then sum weighted by ${\frac{1}{n^{1+1/\log X}}}$ and conclude that

$\displaystyle {\bf E} {\bf E}_{H' \in [H/2,H]} \frac{1}{\log X} \sum_n \frac{1}{n^{1+1/\log X}} |\sum_{m=1}^{H'} \tilde \chi(n+m) h(n+m)|^2 \ll 1.$

(The zeta function type weight of ${\frac{1}{n^{1+1/\log X}}}$ will be convenient later in the argument when one has to perform some multiplicative number theory, as the relevant sums can be computed quite directly and easily using Euler products.) Thus, with probability ${1-O(\varepsilon)}$, one has

$\displaystyle {\bf E}_{H' \in [H/2,H]} \sum_n \frac{1}{n^{1+1/\log X}} |\sum_{m=1}^{H'} \tilde \chi(n+m) h(n+m)|^2 \ll_\varepsilon \log X.$

We condition to this event. We have successfully eliminated the role of ${n^{it}}$; we now work to eliminate ${h}$. Call a residue class ${a \hbox{ mod } q^k}$ bad if ${a+m}$ is divisible by ${p^k}$ for some ${p|q}$ and ${1 \leq m \leq H}$. and good otherwise. We restrict ${n}$ to good residue classes, thus

$\displaystyle {\bf E}_{H' \in [H/2,H]}\sum_{a \in [1,q^k], \hbox{ good}} \sum_{n = a \hbox{ mod } q^k} \frac{1}{n^{1+1/\log X}} |\sum_{m=1}^{H'}\tilde \chi(n+m) h(n+m)|^2$

$\displaystyle \ll_\varepsilon \log X.$

By Cauchy-Schwarz, we conclude that

$\displaystyle {\bf E}_{H' \in [H/2,H]} \sum_{a \in [1,q^k], \hbox{ good}}$

$\displaystyle |\sum_{n = a \hbox{ mod } q^k} \frac{1}{n^{1+1/\log X}} \sum_{m=1}^{H'} \tilde \chi(n+m) h(n+m)|^2$

$\displaystyle \ll_\varepsilon \frac{1}{q^k} \log^2 X.$

Now we claim that for a ${n}$ in a good residue class ${a \hbox{ mod } q^k}$, the quantity ${\chi(n+m)}$ does not depend on ${n}$. Indeed, by hypothesis, ${(n+m,q^k) = (a+m,q^k)}$ is not divisible by ${p^k}$ for any ${p|q}$ and is thus a factor of ${q^{k-1}}$, and ${\frac{n+m}{(n+m,q^k)} = \frac{n+m}{(a+m,q^k)}}$ is coprime to ${q}$. We then factor

$\displaystyle \tilde \chi(n+m)= \tilde \chi((n+m,q^k)) \tilde \chi( \frac{n+m}{(n+m,q^k)} )$

$\displaystyle = \tilde \chi((a,q^k)) \chi( \frac{n+m}{(a+m,q^k)} )$

$\displaystyle = \tilde \chi((a,q^k)) \chi( \frac{a+m}{(a+m,q^k)} )$

where in the last line we use the periodicity of ${\chi}$. Thus we have ${\tilde \chi(n+m) = \tilde \chi(a+m)}$, and so

$\displaystyle {\bf E}_{H' \in [H/2,H]} \sum_{a \in [1,q^k], \hbox{ good}}$

$\displaystyle |\sum_{m=1}^{H'} \tilde \chi(a+m) \sum_{n = a \hbox{ mod } q^k} \frac{1}{n^{1+1/\log X}} h(n+m)|^2$

$\displaystyle \ll_\varepsilon \frac{1}{q^k} \log^2 X.$

Shifting ${n}$ by ${m}$ we see that

$\displaystyle \sum_{n = a \hbox{ mod } q^k} \frac{1}{n^{1+1/\log X}} h(n+m)$

$\displaystyle = \sum_{n = a+m \hbox{ mod } q^k} \frac{1}{n^{1+1/\log X}} h(n) + O_H(1)$

and thus (for ${X}$ large enough)

$\displaystyle {\bf E}_{H' \in [H/2,H]} \sum_{a \in [1,q^k], \hbox{ good}} |\sum_{m=1}^{H'} \tilde \chi(a+m) \sum_{n = a+m \hbox{ mod } q^k} \frac{h(n)}{n^{1+1/\log X}}|^2 \ \ \ \ \ (12)$

$\displaystyle \ll_\varepsilon \frac{1}{q^k} \log^2 X.$

Now, we perform some multiplicative number theory to understand the innermost sum. From taking Euler products we have

$\displaystyle \sum_n \frac{h(n)}{n^{1+1/\log X}} = {\mathfrak S}$

for ${{\mathfrak S} := \prod_p (1 - \frac{h(p)}{p^{1+1/\log X}})^{-1}}$; from (11) and Mertens’ theorem one can easily verify that

$\displaystyle \log X \ll_\varepsilon |{\mathfrak S}| \ll_\varepsilon \log X. \ \ \ \ \ (13)$

More generally, for any Dirichlet character ${\chi_1}$ we have

$\displaystyle \sum_n \frac{\chi_1(n) h(n)}{n^{1+1/\log X}} = \prod_p (1 - \frac{h(p) \chi_1(p)}{p^{1+1/\log X}})^{-1}.$

Since

$\displaystyle \prod_p (1 - \frac{\chi_1(p)}{p^{1+1/\log X}})^{-1} = L(1+\frac{1}{\log X}, \chi_1) \ll_{q,k} 1$

we have

$\displaystyle \sum_n \frac{\chi_1(n) h(n)}{n^{1+1/\log X}} \ll_{q,k} \exp( \sum_p \frac{|1-h(p)|}{p^{1+1/\log X}})$

which after using (11), Cauchy-Schwarz (using ${|1-h(p)| \ll (1-\hbox{Re} h(p))^{1/2})}$ and Mertens theorem gives

$\displaystyle \sum_n \frac{\chi_1(n) h(n)}{n^{1+1/\log X}} \ll_{q,k} \exp(O_\varepsilon(\sqrt{\log\log X}))$

for any non-principal character ${\chi_1}$ of period dividing ${q^k}$; for a principal character ${\chi_0}$ of period ${r}$ dividing ${q^k}$ we have

$\displaystyle \sum_n \frac{\chi_0 h(n)}{n^{1+1/\log X}} = \prod_{p \not | r} (1 - \frac{h(p)}{p^{1+1/\log X}})^{-1}$

$\displaystyle = \frac{\phi(r)}{r} {\mathfrak S} + O_{q,k}(1)$

since ${h(p)=1}$ and hence ${\frac{h(p)}{p^{1+1/\log X}} = \frac{1}{p} + O(\frac{1}{\log X})}$ for all ${p|r}$, where we have also used (13). By expansion into Dirichlet characters we conclude that

$\displaystyle \sum_{n = b \hbox{ mod } r} \frac{h(n)}{n^{1+1/\log X}} = \frac{1}{r} {\mathfrak S} + O_{q,k}(\exp(O_\varepsilon(\sqrt{\log\log X})))$

for all ${r|q^k}$ and primitive residue classes ${b \hbox{ mod } r}$. For non-primitive residue classes ${b \hbox{ mod } r}$, we write ${r = (b,r) r'}$ and ${b = (b,r) b'}$. The previous arguments then give

$\displaystyle \sum_{n = b' \hbox{ mod } r'} \frac{h(n)}{n^{1+1/\log X}} = \frac{1}{r'} {\mathfrak S} + O_{q,k}(\exp(O_\varepsilon(\sqrt{\log\log X})))$

which since ${h((b,r))=1}$ gives (again using (13))

$\displaystyle \sum_{n = b \hbox{ mod } r} \frac{h(n)}{n^{1+1/\log X}} = \frac{1}{r} {\mathfrak S} + O_{q,k}(\exp(O_\varepsilon(\sqrt{\log\log X})))$

for all ${b \hbox{ mod } r}$ (not necessarily primitive). Inserting this back into (12) we see that

$\displaystyle {\bf E}_{H' \in [H/2,H]}\sum_{a \in [1,q^k] \hbox{ good}} | \sum_{m=1}^{H'}\tilde \chi(a+m) (\frac{1}{q^k} {\mathfrak S} + O_{q,k}(\exp(O_\varepsilon(\sqrt{\log\log X})))) |^2 \ll_\varepsilon \frac{1}{q^k} \log^2 X$

and thus by (13) we conclude (for ${X}$ large enough) that

$\displaystyle \frac{1}{q^k} \sum_{a \in [1,q^k] \hbox{ good}} {\bf E}_{H' \in [H/2,H]} |\sum_{m=1}^{H'} \tilde \chi(a+m)|^2 \ll_\varepsilon 1. \ \ \ \ \ (14)$

We have now eliminated both ${t}$ and ${h}$. The remaining task is to establish some lower bound on the discrepancy of ${\tilde \chi}$ that will contradict (14). As mentioned above, this will be a more complicated variant of the Borwein-Choi-Coons analysis in the introduction. The square in (14) will be helpful in dealing with the fact that we don’t have good control on the ${\tilde \chi(p)}$ for ${p|q}$ (as we shall see, the squaring introduces two terms of this type that end up cancelling each other).

We expand (14) to obtain

$\displaystyle {\bf E}_{H' \in [H/2,H]} \sum_{m_1,m_2 \in [1,H']} \sum_{a \in [1,q^k], \hbox{ good}} \tilde \chi(a+m_1) \overline{\tilde \chi(a+m_2)} \ll_\varepsilon q^k.$

Write ${d_1 := (a+m_1,q^k)}$ and ${d_2 := (a+m_2,q^k)}$, thus ${d_1,d_2 | q^{k-1}}$ and for ${i=1,2}$ we have

$\displaystyle \tilde \chi(a+m_i) = \tilde \chi(d_i) \chi( \frac{a+m_i}{d_i} ).$

We thus have

$\displaystyle \sum_{d_1,d_2|q^{k-1}} \tilde \chi(d_1) \overline{\tilde \chi}(d_2) {\bf E}_{H' \in [H/2,H]} \sum_{m_1,m_2 \in [1,H']}$

$\displaystyle \sum_{a \in [1,q^k], \hbox{ good}: (a+m_1,q^k)=d_1, (a+m_2,q^k)=d_2}$

$\displaystyle \chi(\frac{a+m_1}{d_1}) \overline{\chi}(\frac{a+m_2}{d_2}) \ll_\varepsilon q^k.$

We reinstate the bad ${a}$. The number of such ${a}$ is at most ${O(2^{-k} q^k )}$, so their total contribution here is ${O_H(2^{-k} q^k)}$ which is negligible, thus

$\displaystyle \sum_{d_1,d_2|q^{k-1}} \tilde \chi(d_1) \overline{\tilde \chi}(d_2) {\bf E}_{H' \in [H/2,H]} \ \ \ \ \ (15)$

$\displaystyle \sum_{m_1,m_2 \in [1,H']} \sum_{a \in [1,q^k]: (a+m_1,q^k)=d_1, (a+m_2,q^k)=d_2}$

$\displaystyle \chi(\frac{a+m_1}{d_1}) \overline{\chi}(\frac{a+m_2}{d_2}) \ll_\varepsilon q^k.$

For ${d_1 \geq q^{k/10}}$ or ${d_2 \geq q^{k/10}}$, the inner sum is ${O_H(q^{-k/10} q^k )}$, which by the divisor bound gives a negligible contribution. Thus we may restrict to ${d_1,d_2 < q^{k/10}}$. Note that as ${\chi}$ is already restricted to numbers coprime to ${q}$, and ${d_1,d_2}$ divide ${q^{k-1}}$, we may replace the constraints ${(a+m_i,q^k)=d_i}$ with ${d_i|a+m_i}$ for ${i=1,2}$.

We consider the contribution of an off-diagonal term ${d_1 \neq d_2}$ for a fixed choice of ${m_1,m_2}$. To handle these terms we expand the non-principal character ${\chi(n)}$ as a linear combination of ${e( \xi n / q )}$ for ${\xi \in ({\bf Z}/q{\bf Z})^\times}$ with Fourier coefficients ${O(1)}$. Thus we can expand out

$\displaystyle \sum_{a \in [1,q^k]: d_1|a+m_1; d_2|a+m_2} \chi(\frac{a+m_1}{d_1}) \overline{\chi}(\frac{a+m_2}{d_2})$

as a linear combination of ${O(1)}$ expressions of the form

$\displaystyle \sum_{a \in [1,q^k]: d_1|a+m_1; d_2|a+m_2} e(\frac{\xi_1}{q} \frac{a+m_1}{d_1} - \frac{\xi_2}{q} \frac{a+m_2}{d_2})$

with ${\xi_1,\xi_2 \in ({\bf Z}/q{\bf Z})^\times}$ and coefficients of size ${O(1)}$.

The constraints ${ d_1|a+m_1; d_2|a+m_2}$ are either inconsistent, or constrain ${a}$ to a single residue class ${a = b \hbox{ mod } [d_1,d_2]}$. Writing ${a = b + n [d_1,d_2]}$, we have

$\displaystyle e(\frac{\xi_1}{q} \frac{a+m_1}{d_1} - \frac{\xi_2}{q} \frac{a+m_2}{d_2}) = e( \theta + \frac{n}{q} ( \xi_1 \frac{[d_1,d_2]}{d_1} - \xi_2 \frac{[d_1,d_2]}{d_2} ) )$

for some phase ${\theta}$ that can depend on ${b,m_1,m_2,d_1,d_2,q,\xi_1,\xi_2}$ but is independent of ${n}$. If ${d_1 \neq d_2}$, then at least one of the two quantities ${\frac{[d_1,d_2]}{d_1}}$ and ${\frac{[d_1,d_2]}{d_2}}$ is divisible by a prime ${p|q}$ that does not divide the other quantity. Therefore ${\xi_1 \frac{[d_1,d_2]}{d_1} - \xi_2 \frac{[d_1,d_2]}{d_2}}$ cannot be divisible by ${p}$, and thus by ${q}$. We can then sum the geometric series in ${n}$ (or ${a}$) and conclude that

$\displaystyle \sum_{a \in [1,q^k]: d_1|a+m_1; d_2|a+m_2} e(\frac{\xi_1}{q} \frac{a+m_1}{d_1} - \frac{\xi_2}{q} \frac{a+m_2}{d_2}) \ll [d_1,d_2] \ll q^{k/5}$

and so by the divisor bound the off-diagonal terms ${d_1 \neq d_2}$ contribute at most ${O_H( q^{k/5 + o(k)} )}$ to (15). For ${k}$ large, this is negligible, and thus we only need to consider the diagonal contribution ${d_1 = d_2 < q^{k/10}}$. Here the ${ \tilde \chi(d_1) \overline{\tilde \chi}(d_2)}$ terms helpfully cancel, and we obtain

$\displaystyle \sum_{d|q^{k-1}: d < q^{k/10}} {\bf E}_{H' \in [H/2,H]} \sum_{m_1,m_2 \in [1,H']} \sum_{a \in [1,q^k]: d | a+m_1, a+m_2}$

$\displaystyle \chi(\frac{a+m_1}{d}) \overline{\chi}(\frac{a+m_2}{d}) \ll_\varepsilon q^k.$

We have now eliminated ${\tilde \chi}$, leaving only the Dirichlet character ${\chi}$ which is much easier to work with. We gather terms and write the left-hand side as

$\displaystyle \sum_{d|q^{k-1}: d < q^{k/10}} {\bf E}_{H' \in [H/2,H]} \sum_{a \in [1,q^k]} |\sum_{m \in [1,H']: d|a+m} \chi(\frac{a+m}{d})|^2 \ll \varepsilon q^k.$

The summand in ${d}$ is now non-negative. We can thus throw away all the ${d}$ except of the form ${d = q^i}$ with ${q^i < \sqrt{H}}$, to conclude that

$\displaystyle \sum_{i: q^i < \sqrt{H}} {\bf E}_{H' \in [H/2,H]} \sum_{a \in [1,q^k]} |\sum_{m \in [1,H']: q^i|a+m} \chi(\frac{a+m}{q^i})|^2 \ll_\varepsilon q^k.$

It is now that we finally take advantage of the averaging ${{\bf E}_{H' \in [H/2,H]}}$ to simplify the ${m}$ summation. Observe from the triangle inequality that for any ${H' \in [H/2, 3H/4]}$ and ${a \in [1,q^k]}$ one has

$\displaystyle |\sum_{H' < a \leq H' + q^i: q^i|a+m} \chi(\frac{a+m}{q^i})|^2$

$\displaystyle \ll |\sum_{m \in [1,H']: q^i|a+m} \chi(\frac{a+m}{q^i})|^2 + |\sum_{m \in [1,H'+q^i]: q^i|a+m} \chi(\frac{a+m}{q^i})|^2;$

summing over ${i,H',a}$ we conclude that

$\displaystyle \sum_{i: q^i < \sqrt{H}} {\bf E}_{H' \in [H/2,3H/4]} \sum_{a \in [1,q^k]} |\sum_{H' < m \leq H'+q^i: q^i|a+m} \chi(\frac{a+m}{q^i})|^2 \ll_\varepsilon q^k.$

In particular, by the pigeonhole principle there exists ${H' \in [H/2,3H/4]}$ such that

$\displaystyle \sum_{i: q^i < \sqrt{H}} \sum_{a \in [1,q^k]} |\sum_{H' < m \leq H'+q^i: q^i|a+m} \chi(\frac{a+m}{q^i})|^2 \ll_\varepsilon q^k.$

Shifting ${a}$ by ${H'}$ and discarding some terms, we conclude that

$\displaystyle \sum_{i: q^i < \sqrt{H}} \sum_{a \in [1,q^k/2]} |\sum_{1 \leq m \leq q^i: q^i|a+m} \chi(\frac{a+m}{q^i})|^2 \ll_\varepsilon q^k.$

Observe that for a fixed ${a}$ there is exactly one ${m}$ in the inner sum, and ${\frac{a+m}{q^i} = \lfloor \frac{a}{q^i} \rfloor + 1}$. Thus we have

$\displaystyle \sum_{i: q^i < \sqrt{H}} \sum_{a \in [1,q^k/2]} |\chi(\lfloor \frac{a}{q^i}\rfloor + 1)|^2 \ll_\varepsilon q^k.$

Making the change of variables ${b := \lfloor \frac{a}{q^i} \rfloor + 1}$, we thus have

$\displaystyle \sum_{i: q^i < \sqrt{H}} q^i \sum_{b \in [1,q^{k-i}/4]} |\chi(b)|^2 \ll_\varepsilon q^k.$

But ${b \mapsto |\chi(b)|^2}$ is periodic of period ${q}$ with mean ${\gg 1}$, thus

$\displaystyle \sum_{b \in [1,q^{k-i}/4]} |\chi(b)|^2 \gg q^{k-i}$

and thus

$\displaystyle \sum_{i: q^i < \sqrt{H}} 1 \ll_\varepsilon 1,$

which leads to a contradiction for ${H}$ large enough (note the logarithmic growth in ${H}$ here, matching the logarithmic growth in the Borwein-Choi-Coons analysis). The claim follows.

Filed under: expository, math.NT, polymath Tagged: Elliott conjecture, multiplicative functions, polymath5

### John Preskill — Bits, Bears, and Beyond in Banff: Part Deux

You might remember that about one month ago, Nicole blogged about the conference Beyond i.i.d. in information theory and told us about bits, bears, and beyond in Banff. I was very pleased that Nicole did so, because this conference has become one of my favorites in recent years (ok, it’s my favorite). You can look at her post to see what is meant by “Beyond i.i.d.” The focus of the conference includes cutting-edge topics in quantum Shannon theory, and the conference still has a nice “small world” feel to it (for example, the most recent edition and the first one featured a music session from participants). Here is a picture of some of us having a fun time:

Will Matthews, Felix Leditzky, me, and Nilanjana Datta (facing away) singing “Jamaica Farewell”.

The Beyond i.i.d. series has shaped a lot of the research going on in this area and has certainly affected my own research directions. The first Beyond i.i.d. was held in Cambridge, UK in January 2013, organized by Nilanjana Datta and Renato Renner. It had a clever logo, featuring cyclists of various sorts biking one after another, the first few looking the same and the ones behind them breaking out of the i.i.d. pattern:

It was also at the Cambridge edition that the famous entropy zoo first appeared, which has now been significantly updated, based on recent progress in the area. The next Beyond i.i.d. happened in Singapore in May 2014, organized by Marco Tomamichel, Vincent Tan, and Stephanie Wehner. (Stephanie was a recent “quantum woman” for her work on a loophole-free Bell experiment.)

The tradition continued this past summer in beautiful Banff, Canada. I hope that it goes on for a long time. At least I have next year’s to look forward to, which will be in beachy Barcelona in the summertime, (as of now) planned to be just one week before Alexander Holevo presents the Shannon lecture in Barcelona at the ISIT 2016 conference (by the way, this is the first time that a quantum information theorist has won the prestigious Shannon award).

So why am I blabbing on and on about the Beyond i.i.d. conference if Nicole already wrote a great summary of the Banff edition this past summer? Well, she didn’t have room in her blog post to cover one of my favorite topics that was discussed at my favorite conference, so she graciously invited me to discuss it here.

The driving question of my new favorite topic is “What is the right notion of a quantum Markov chain?” The past year or so has seen some remarkable progress in this direction. To motivate it, let’s go back to bears, and specifically bear attacks (as featured in Nicole’s post). In Banff, the locals there told us that they had never heard of a bear attacking a group of four or more people who hike together. But let’s suppose that Alice, Bob, and Eve ignore this advice and head out together for a hike in the mountains. Also, in a different group are 50 of Alice’s sisters, but the park rangers are focusing their binoculars on the group of three (Alice, Bob, and Eve), observing their movements, because they are concerned that a group of three will run into trouble.

In the distance, there is a clever bear observing the movements of Alice, Bob, and Eve, and he notices some curious things. If he looks at Alice and Bob’s movements alone, they appear to take each step randomly, but for the most part together. That is, their steps appear correlated. He records their movements for some time and estimates a probability distribution $p(a,b)$ that characterizes their movements. However, if he considers the movements of Alice, Bob, and Eve all together, he realizes that Alice and Bob are really taking their steps based on what Eve does, who in turn is taking her steps completely at random. So at this point the bear surmises that Eve is the mediator of the correlations observed between Alice and Bob’s movements, and when he writes down an estimate for the probability distribution $p(a,b,e)$ characterizing all three of their movements, he notices that it factors as $p(a,b,e) = p(a|e) p(b|e) p(e)$. That is, the bear sees that the distribution forms a Markov chain.

What is an important property of such a Markov chain?“, asks the bear. Well, neglecting Alice’s movements (summing over the $a$ variable), the probability distribution reduces to $p(b|e) p(e)$, because $p(a|e)$ is a conditional probability distribution. A characteristic of a Markov probability distribution is that one could reproduce the original distribution $p(a,b,e)$ simply by acting on the $e$ variable of $p(b|e) p(e)$ with the conditional probability distribution $p(a|e)$. So the bear realizes that it would be possible for Alice to be lost and subsequently replaced by Eve calling in one of Alice’s sisters, such that nobody else would notice anything different from before — it would appear as if the movements of all three were unchanged once this replacement occurs. Salivating at his realization, the bear takes Eve briefly aside without any of the others noticing. The bear explains that he will not eat Eve and will instead eat Alice if Eve can call in one of Alice’s sisters and direct her movements to be chosen according to the distribution $p(a|e)$. Eve, realizing that her options are limited (ok, ok, maybe there are other options…), makes a deal with the bear. So the bear promptly eats Alice, and Eve draws in one of Alice’s sisters, whom Eve then directs to walk according to the distribution $p(a|e)$. This process repeats, going on and on, and all the while, the park rangers, focusing exclusively on the movements on the party of three, don’t think anything of what’s going because they observe that the joint distribution $p(a,b,e)$ describing the movements of “Alice,” Bob, and Eve never seems to change (let’s assume that the actions of the bear and Eve are very fast :) ). So the bear is very satisfied after eating Alice and some of her sisters, and Eve is pleased not to be eaten, at the same time never having cared too much for Alice or any of her sisters.

A natural question arises: “What could Alice and Bob do to prevent this terrible situation from arising, in which Alice and so many of her sisters get eaten without the park rangers noticing anything?” Well, Alice and Bob could attempt to coordinate their movements independently of Eve’s. Even better, before heading out on a hike, they could make sure to have brought along several entangled pairs of particles (and perhaps some bear spray). If Alice and Bob choose their movements according to the outcomes of measurements of the entangled pairs, then it would be impossible for Alice to be eaten and the park rangers not to notice. That is, the distribution describing their movements could never be described by a Markov chain distribution of the form $p(a|e) p(b|e) p(e)$. Thus, in such a scenario, as soon as the bear attacks Alice and then Eve replaces her with one of her sisters, the park rangers would immediately notice something different about the movements of the party of three and then figure out what is going on. So at least Alice could save her sisters…

What is the lesson here? A similar scenario is faced in quantum key distribution. Eve and other attackers (such as a bear) might try to steal what is there in Alice’s system and then replace it with something else, in an attempt to go undetected. If the situation is described by classical physics, this would be possible if Eve had access to a “hidden variable” that dictates the actions of Alice and Bob. But according to Bell’s theorem or the monogamy of entanglement, it is impossible for a “hidden variable” strategy to mimic the outcomes of measurements performed on sufficiently entangled particles.

Since we never have perfectly entangled particles or ones whose distributions exactly factor as Markov chains, it would be ideal to quantify, for a given three-party quantum state of Alice, Bob, and Eve, how well one could recover from the loss of the Alice system by Eve performing a recovery channel on her system alone. This would help us to better understand approximate cases that we expect to appear in realistic scenarios. At the same time, we could have a more clear understanding of what constitutes an approximate quantum Markov chain.

Now due to recent results of Fawzi and Renner, we know that this quantification of quantum non-Markovianity is possible by using the conditional quantum mutual information (CQMI), a fundamental measure of information in quantum information theory. We already knew that the CQMI is non-negative when evaluated for any three-party quantum state, due to the strong subadditivity inequality, but now we can say more than that: If the CQMI is small, then Eve can recover well from the loss of Alice, implying that the reduced state of Alice and Bob’s system could not have been too entangled in the first place. Relatedly, if Eve cannot recover well from the loss of Alice, then the CQMI cannot be small. The CQMI is the quantity underlying the squashed entanglement measure, which in turn plays a fundamental role in characterizing the performance of realistic quantum key distribution systems.

Since the original results of Fawzi and Renner appeared on the arXiv, this topic has seen much activity in “quantum information land.” Here are some papers related to this topic, which have appeared in the past year or so (apologies if I missed your paper!):

Some of the papers are admittedly that of myself and my collaborators, but hey!, please forgive me, I’ve been excited about the topic. We now know simpler proofs of the original FawziRenner results and extensions of them that apply to the quantum relative entropy as well. Since the quantum relative entropy is such a fundamental quantity in quantum information theory, some of the above papers provide sweeping ramifications for many foundational statements in quantum information theory, including entanglement theory, quantum distinguishability, the Holevo bound, quantum discord, multipartite information measures, etc. Beyond i.i.d. had a day of talks dedicated to the topic, and I think we will continue seeing further developments in this area.

## September 23, 2015

### Scott Aaronson — Six announcements

1. I did a podcast interview with Julia Galef for her series “Rationally Speaking.”  See also here for the transcript (which I read rather than having to listen to myself stutter).  The interview is all about Aumann’s Theorem, and whether rational people can agree to disagree.  It covers a lot of the same ground as my recent post on the same topic, except with less technical detail about agreement theory and more … well, agreement.  At Julia’s suggestion, we’re planning to do a follow-up podcast about the particular intractability of online disagreements.  I feel confident that we’ll solve that problem once and for all.  (Update: Also check out this YouTube video, where Julia offers additional thoughts about what we discussed.)
2. When Julia asked me to recommend a book at the end of the interview, I picked probably my favorite contemporary novel: The Mind-Body Problem by Rebecca Newberger Goldstein.  Embarrassingly, I hadn’t realized that Rebecca had already been on Julia’s show twice as a guest!  Anyway, one of the thrills of my life over the last year has been to get to know Rebecca a little, as well as her husband, who’s some guy named Steve Pinker.  Like, they both live right here in Boston!  You can talk to them!  I was especially pleased two weeks ago to learn that Rebecca won the National Humanities Medal—as I told Julia, Rebecca Goldstein getting a medal at the White House is the sort of thing I imagine happening in my ideal fantasy world, making it a pleasant surprise that it happened in this one.  Huge congratulations to Rebecca!
3. The NSA has released probably its most explicit public statement so far about its plans to move to quantum-resistant cryptography.  For more see Bruce Schneier’s Crypto-Gram.  Hat tip for this item goes to reader Ole Aamot, one of the only people I’ve ever encountered whose name alphabetically precedes mine.
4. Last Tuesday, I got to hear Ayaan Hirsi Ali speak at MIT about her new book, Heretic, and then spend almost an hour talking to students who had come to argue with her.  I found her clear, articulate, and courageous (as I guess one has to be in her line of work, even with armed cops on either side of the lecture hall).  After the shameful decision of Brandeis in caving in to pressure and cancelling Hirsi Ali’s commencement speech, I thought it spoke well of MIT that they let her speak at all.  The bar shouldn’t be that low, but it is.
5. From far away on the political spectrum, I also heard Noam Chomsky talk last week (my first time hearing him live), about the current state of linguistics.  Much of the talk, it struck me, could have been given in the 1950s with essentially zero change (and I suspect Chomsky would agree), though a few parts of it were newer, such as the speculation that human languages have many of the features they do in order to minimize the amount of computation that the speaker needs to perform.  The talk was full of declarations that there had been no useful work whatsoever on various questions (e.g., about the evolutionary function of language), that they were total mysteries and would perhaps remain total mysteries forever.
6. Many of you have surely heard by now that Terry Tao solved the Erdös Discrepancy Problem, by showing that for every infinite sequence of heads and tails and every positive integer C, there’s a positive integer k such that, if you look at the subsequence formed by every kth flip, there comes a point where the heads outnumber tails or vice versa by at least C.  This resolves a problem that’s been open for more than 80 years.  For more details, see this post by Timothy Gowers.  Notably, Tao’s proof builds, in part, on a recent Polymath collaborative online effort.  It was a big deal last year when Konev and Lisitsa used a SAT-solver to prove that there’s always a subsequence with discrepancy at least 3; Tao’s result now improves on that bound by ∞.

## September 21, 2015

### Tommaso Dorigo — Statistics Lectures For Physicists In Traunkirchen

The challenge of providing Ph.D. students in Physics with an overview of statistical methods and concepts useful for data analysis in just three hours of lectures is definitely a serious one, so I decided to take it as I got invited to the "Indian Summer School" in the pleasant lakeside town of Traunkirchen, Austria.

### Scott Aaronson — Bell inequality violation finally done right

A few weeks ago, Hensen et al., of the Delft University of Technology and Barcelona, Spain, put out a paper reporting the first experiment that violates the Bell inequality in a way that closes off the two main loopholes simultaneously: the locality and detection loopholes.  Well, at least with ~96% confidence.  This is big news, not only because of the result itself, but because of the advances in experimental technique needed to achieve it.  Last Friday, two renowned experimentalists—Chris Monroe of U. of Maryland and Jungsang Kim of Duke—visited MIT, and in addition to talking about their own exciting ion-trap work, they did a huge amount to help me understand the new Bell test experiment.  So OK, let me try to explain this.

While some people like to make it more complicated, the Bell inequality is the following statement. Alice and Bob are cooperating with each other to win a certain game (the “CHSH game“) with the highest possible probability. They can agree on a strategy and share information and particles in advance, but then they can’t communicate once the game starts. Alice gets a uniform random bit x, and Bob gets a uniform random bit y (independent of x).  Their goal is to output bits, a and b respectively, such that a XOR b = x AND y: in other words, such that a and b are different if and only if x and y are both 1.  The Bell inequality says that, in any universe that satisfies the property of local realism, no matter which strategy they use, Alice and Bob can win the game at most 75% of the time (for example, by always outputting a=b=0).

What does local realism mean?  It means that, after she receives her input x, any experiment Alice can perform in her lab has a definite result that might depend on x, on the state of her lab, and on whatever information she pre-shared with Bob, but at any rate, not on Bob’s input y.  If you like: a=a(x,w) is a function of x and of the information w available before the game started, but is not a function of y.  Likewise, b=b(y,w) is a function of y and w, but not of x.  Perhaps the best way to explain local realism is that it’s the thing you believe in, if you believe all the physicists babbling about “quantum entanglement” just missed something completely obvious.  Clearly, at the moment two “entangled” particles are created, but before they separate, one of them flips a tiny coin and then says to the other, “listen, if anyone asks, I’ll be spinning up and you’ll be spinning down.”  Then the naïve, doofus physicists measure one particle, find it spinning down, and wonder how the other particle instantly “knows” to be spinning up—oooh, spooky! mysterious!  Anyway, if that’s how you think it has to work, then you believe in local realism, and you must predict that Alice and Bob can win the CHSH game with probability at most 3/4.

What Bell observed in 1964 is that, even though quantum mechanics doesn’t let Alice send a signal to Bob (or vice versa) faster than the speed of light, it still makes a prediction about the CHSH game that conflicts with local realism.  (And thus, quantum mechanics exhibits what one might not have realized beforehand was even a logical possibility: it doesn’t allow communication faster than light, but simulating the predictions of quantum mechanics in a classical universe would require faster-than-light communication.)  In particular, if Alice and Bob share entangled qubits, say $$\frac{\left| 00 \right\rangle + \left| 11 \right\rangle}{\sqrt{2}},$$ then there’s a simple protocol that lets them violate the Bell inequality, winning the CHSH game ~85% of the time (with probability (1+1/√2)/2 > 3/4).  Starting in the 1970s, people did experiments that vindicated the prediction of quantum mechanics, and falsified local realism—or so the story goes.

The violation of the Bell inequality has a schizophrenic status in physics.  To many of the physicists I know, Nature’s violating the Bell inequality is so trivial and obvious that it’s barely even worth doing the experiment: if people had just understood and believed Bohr and Heisenberg back in 1925, there would’ve been no need for this whole tiresome discussion.  To others, however, the Bell inequality violation remains so unacceptable that some way must be found around it—from casting doubt on the experiments that have been done, to overthrowing basic presuppositions of science (e.g., our own “freedom” to generate random bits x and y to send to Alice and Bob respectively).

For several decades, there was a relatively conservative way out for local realist diehards, and that was to point to “loopholes”: imperfections in the existing experiments which meant that local realism was still theoretically compatible with the results, at least if one was willing to assume a sufficiently strange conspiracy.

Fine, you interject, but surely no one literally believed these little experimental imperfections would be the thing that would rescue local realism?  Not so fast.  Right here, on this blog, I’ve had people point to the loopholes as a reason to accept local realism and reject the reality of quantum entanglement.  See, for example, the numerous comments by Teresa Mendes in my Whether Or Not God Plays Dice, I Do post.  Arguing with Mendes back in 2012, I predicted that the two main loopholes would both be closed in a single experiment—and not merely eventually, but in, like, a decade.  I was wrong: achieving this milestone took only a few years.

Before going further, let’s understand what the two main loopholes are (or rather, were).

The locality loophole arises because the measuring process takes time and Alice and Bob are not infinitely far apart.  Thus, suppose that, the instant Alice starts measuring her particle, a secret signal starts flying toward Bob’s particle at the speed of light, revealing her choice of measurement setting (i.e., the value of x).  Likewise, the instant Bob starts measuring his particle, his doing so sends a secret signal flying toward Alice’s particle, revealing the value of y.  By the time the measurements are finished, a few microseconds later, there’s been plenty of time for the two particles to coordinate their responses to the measurements, despite being “classical under the hood.”

Meanwhile, the detection loophole arises because in practice, measurements of entangled particles—especially of photons—don’t always succeed in finding the particles, let alone ascertaining their properties.  So one needs to select those runs of the experiment where Alice and Bob both find the particles, and discard all the “bad” runs where they don’t.  This by itself wouldn’t be a problem, if not for the fact that the very same measurement that reveals whether the particles are there, is also the one that “counts” (i.e., where Alice and Bob feed x and y and get out a and b)!

To someone with a conspiratorial mind, this opens up the possibility that the measurement’s success or failure is somehow correlated with its result, in a way that could violate the Bell inequality despite there being no real entanglement.  To illustrate, suppose that at the instant they’re created, one entangled particle says to the other: “listen, if Alice measures me in the x=0 basis, I’ll give the a=1 result.  If Bob measures you in the y=1 basis, you give the b=1 result.  In any other case, we’ll just evade detection and count this run as a loss.”  In such a case, Alice and Bob will win the game with certainty, whenever it gets played at all—but that’s only because of the particles’ freedom to choose which rounds will count.  Indeed, by randomly varying their “acceptable” x and y values from one round to the next, the particles can even make it look like x and y have no effect on the probability of a round’s succeeding.

Until a month ago, the state-of-the-art was that there were experiments that closed the locality loophole, and other experiments that closed the detection loophole, but there was no single experiment that closed both of them.

To close the locality loophole, “all you need” is a fast enough measurement on photons that are far enough apart.  That way, even if the vast Einsteinian conspiracy is trying to send signals between Alice’s and Bob’s particles at the speed of light, to coordinate the answers classically, the whole experiment will be done before the signals can possibly have reached their destinations.  Admittedly, as Nicolas Gisin once pointed out to me, there’s a philosophical difficulty in defining what we mean by the experiment being “done.”  To some purists, a Bell experiment might only be “done” once the results (i.e., the values of a and b) are registered in human experimenters’ brains!  And given the slowness of human reaction times, this might imply that a real Bell experiment ought to be carried out with astronauts on faraway space stations, or with Alice on the moon and Bob on earth (which, OK, would be cool).  If we’re being reasonable, however, we can grant that the experiment is “done” once a and b are safely recorded in classical, macroscopic computer memories—in which case, given the speed of modern computer memories, separating Alice and Bob by half a kilometer can be enough.  And indeed, experiments starting in 1998 (see for example here) have done exactly that; the current record, unless I’m mistaken, is 18 kilometers.  (Update: I was mistaken; it’s 144 kilometers.)  Alas, since these experiments used hard-to-measure photons, they were still open to the detection loophole.

To close the detection loophole, the simplest approach is to use entangled qubits that (unlike photons) are slow and heavy and can be measured with success probability approaching 1.  That’s exactly what various groups did starting in 2001 (see for example here), with trapped ions, superconducting qubits, and other systems.  Alas, given current technology, these sorts of qubits are virtually impossible to move miles apart from each other without decohering them.  So the experiments used qubits that were close together, leaving the locality loophole wide open.

So the problem boils down to: how do you create long-lasting, reliably-measurable entanglement between particles that are very far apart (e.g., in separate labs)?  There are three basic ideas in Hensen et al.’s solution to this problem.

The first idea is to use a hybrid system.  Ultimately, Hensen et al. create entanglement between electron spins in nitrogen vacancy centers in diamond (one of the hottest—or coolest?—experimental quantum information platforms today), in two labs that are about a mile away from each other.  To get these faraway electron spins to talk to each other, they make them communicate via photons.  If you stimulate an electron, it’ll sometimes emit a photon with which it’s entangled.  Very occasionally, the two electrons you care about will even emit photons at the same time.  In those cases, by routing those photons into optical fibers and then measuring the photons, it’s possible to entangle the electrons.

Wait, what?  How does measuring the photons entangle the electrons from whence they came?  This brings us to the second idea, entanglement swapping.  The latter is a famous procedure to create entanglement between two particles A and B that have never interacted, by “merely” entangling A with another particle A’, entangling B with another particle B’, and then performing an entangled measurement on A’ and B’ and conditioning on its result.  To illustrate, consider the state

$$\frac{\left| 00 \right\rangle + \left| 11 \right\rangle}{\sqrt{2}} \otimes \frac{\left| 00 \right\rangle + \left| 11 \right\rangle}{\sqrt{2}}$$

and now imagine that we project the first and third qubits onto the state $$\frac{\left| 00 \right\rangle + \left| 11 \right\rangle}{\sqrt{2}}.$$

If the measurement succeeds, you can check that we’ll be left with the state $$\frac{\left| 00 \right\rangle + \left| 11 \right\rangle}{\sqrt{2}}$$ in the second and fourth qubits, even though those qubits were not entangled before.

So to recap: these two electron spins, in labs a mile away from each other, both have some probability of producing a photon.  The photons, if produced, are routed to a third site, where if they’re both there, then an entangled measurement on both of them (and a conditioning on the results of that measurement) has some nonzero probability of causing the original electron spins to become entangled.

But there’s a problem: if you’ve been paying attention, all we’ve done is cause the electron spins to become entangled with some tiny, nonzero probability (something like 6.4×10-9 in the actual experiment).  So then, why is this any improvement over the previous experiments, which just directly measured faraway entangled photons, and also had some small but nonzero probability of detecting them?

This leads to the third idea.  The new setup is an improvement because, whenever the photon measurement succeeds, we know that the electron spins are there and that they’re entangled, without having to measure the electron spins to tell us that.  In other words, we’ve decoupled the measurement that tells us whether we succeeded in creating an entangled pair, from the measurement that uses the entangled pair to violate the Bell inequality.  And because of that decoupling, we can now just condition on the runs of the experiment where the entangled pair was there, without worrying that that will open up the detection loophole, biasing the results via some bizarre correlated conspiracy.  It’s as if the whole experiment were simply switched off, except for those rare lucky occasions when an entangled spin pair gets created (with its creation heralded by the photons).  On those rare occasions, Alice and Bob swing into action, measuring their respective spins within the brief window of time—about 4 microseconds—allowed by the locality loophole, seeking an additional morsel of evidence that entanglement is real.  (Well, actually, Alice and Bob swing into action regardless; they only find out later whether this was one of the runs that “counted.”)

So, those are the main ideas (as well as I understand them); then there’s lots of engineering.  In their setup, Hensen et al. were able to create just a few heralded entangled pairs per hour.  This allowed them to produce 245 CHSH games for Alice and Bob to play, and to reject the hypothesis of local realism at ~96% confidence.  Jungsang Kim explained to me that existing technologies could have produced many more events per hour, and hence, in a similar amount of time, “particle physics” (5σ or more) rather than “psychology” (2σ) levels of confidence that local realism is false.  But in this type of experiment, everything is a tradeoff.  Building not one but two labs for manipulating NV centers in diamond is extremely onerous, and Hensen et al. did what they had to do to get a significant result.

The basic idea here, of using photons to entangle longer-lasting qubits, is useful for more than pulverizing local realism.  In particular, the idea is a major part of current proposals for how to build a scalable ion-trap quantum computer.  Because of cross-talk, you can’t feasibly put more than 10 or so ions in the same trap while keeping all of them coherent and controllable.  So the current ideas for scaling up involve having lots of separate traps—but in that case, one will sometimes need to perform a Controlled-NOT, or some other 2-qubit gate, between a qubit in one trap and a qubit in another.  This can be achieved using the Gottesman-Chuang technique of gate teleportation, provided you have reliable entanglement between the traps.  But how do you create such entanglement?  Aha: the current idea is to entangle the ions by using photons as intermediaries, very similar in spirit to what Hensen et al. do.

At a more fundamental level, will this experiment finally convince everyone that local realism is dead, and that quantum mechanics might indeed be the operating system of reality?  Alas, I predict that those who confidently predicted that a loophole-free Bell test could never be done, will simply find some new way to wiggle out, without admitting the slightest problem for their previous view.  This prediction, you might say, is based on a different kind of realism.

## September 20, 2015

### Tim Gowers — EDP28 — problem solved by Terence Tao!

I imagine most people reading this will already have heard that Terence Tao has solved the Erdős discrepancy problem. He has blogged about the solution in two posts, a first that shows how to reduce the problem to the Elliott conjecture in number theory, and a second that shows (i) that an averaged form of the conjecture is sufficient and (ii) that he can prove the averaged form. Two preprints covering (i) and (ii) are here and here: the one covering (i) has been submitted to Discrete Analysis.

This post is therefore the final post of the polymath5 project. I refer you to Terry’s posts for the mathematics. I will just make a few comments about what all this says about polymath projects in general.

After the success of the first polymath project, which found a purely combinatorial proof of the density Hales-Jewett theorem, there was an appetite to try something similar. However, the subsequent experience made it look as though the first project had been rather lucky, and not necessarily a good indication of what the polymath approach will typically achieve. I started polymath2, about a Banach-space problem, which never really got off the ground. Gil Kalai started polymath3, on the polynomial Hirsch conjecture, but the problem was not solved. Terence Tao started polymath4, about finding a deterministic algorithm to output a prime between $n$ and $2n$, which did not find such an algorithm but did prove some partial results that were interesting enough to publish in an AMS journal called Mathematics of Computation. I started polymath5, with the aim of solving the Erdős discrepancy problem (after this problem was chosen by a vote from a shortlist that I drew up), and although we had some interesting ideas, we did not solve the problem. The most obviously successful polymath project was polymath8, which aimed to bring down the size of the gap in Zhang’s prime-gaps result, but it could be argued that success for that project was guaranteed in advance: it was obvious that the gap could be reduced, and the only question was how far.

Actually, that last argument is not very convincing, since a lot more came out of polymath8 than just a tightening up of the individual steps of Zhang’s argument. But I want to concentrate on polymath5. I have always felt that that project, despite not solving the problem, was a distinct success, because by the end of it I, and I was not alone, understood the problem far better and in a very different way. So when I discussed the polymath approach with people, I described its virtues as follows: a polymath discussion tends to go at lightning speed through all the preliminary stages of solving a difficult problem — trying out ideas, reformulating, asking interesting variants of the question, finding potentially useful reductions, and so on. With some problems, once you’ve done all that, the problem is softened up and you can go on and solve it. With others, the difficulties that remain are still substantial, but at least you understand far better what they are.

In the light of what has now happened, the second case seems like a very accurate description of the polymath5 project, since Terence Tao used ideas from that project in an essential way, but also recent breakthroughs in number theory by Kaisa Matomäki and Maksim Radziwiłł that led on to work by those authors and Terry himself that led on to the averaged form of the Elliott conjecture that Terry has just proved. Thus, if the proof of the Erdős discrepancy problem in some sense requires these ideas, then there was no way we could possibly have hoped to solve the problem back in 2010, when polymath5 was running, but what we did achieve was to create a sort of penumbra around the problem, which had the effect that when these remarkable results in number theory became available, the application to the Erdős discrepancy problem was significantly easier to spot, at least for Terence Tao …

I’ll remark here that the approach to the problem that excited me most when we were thinking about it was a use of duality to reduce the problem to an existential statement: you “just” have to find a function with certain properties and you are done. Unfortunately, finding such a function proved to be extremely hard. Terry’s work proves abstractly that such a function exists, but doesn’t tell us how to construct it. So I’m left feeling that perhaps I was a bit too wedded to that duality approach, though I also think that it would still be very nice if someone managed to make it work.

There are a couple of other questions that are interesting to think about. The first is whether polymath5 really did play a significant role in the discovery of the solution. Terry refers to the work of polymath5, but one of the key polymath5 steps he uses was contributed by him, so perhaps he could have just done the whole thing on his own.

At the very least I would say that polymath5 got him interested in the problem, and took him quickly through the stage I talked about above of looking at it from many different angles. Also, the Fourier reduction argument that Terry found was a sort of response to observations and speculations that had taken place in the earlier discussion, so it seems likely that in some sense polymath5 played a role in provoking Terry to have the thoughts he did. My own experience of polymath projects is that they often provoke me to have thoughts I wouldn’t have had otherwise, even if the relationship between those thoughts and what other people have written is very hard to pin down — it can be a bit like those moments where someone says A, and then you think of B, which appears to have nothing to do with A, but then you manage to reconstruct your daydreamy thought processes to see that A made you think of C, which made you think of D, which made you think of B.

Another question is what should happen to polymath projects that don’t result in a solution of the problem that they are trying to solve, but do have useful ideas. Shouldn’t there come a time when the project “closes” and the participants (and othes) are free to think about the problem individually? I feel strongly that there should, since otherwise there is a danger that a polymath project could actually delay progress on a problem by discouraging research on it. With polymath5 I tried to signal such a “closure” by writing a survey article that was partly about the work of polymath5. And Terry has now written up his work as an individual author, but been careful to say which ingredients of his proof were part of the polymath5 discussion and which were new. That seems to me to be exactly how things should work, but perhaps the lesson for the future is that the closing of a polymath project should be done more explicitly — up to now several of them have just quietly died. I had at one time intended to do rather more than what I did in the survey article, and write up, on behalf of polymath5 and published under the polymath name, a proper paper that would contain the main ideas discovered by polymath5 with full proofs. That would have been a better way of closing the project and would have led to a cleaner situation — Terry could have referred to that paper just as anyone refers to a mathematical paper. But while I regret not getting round to that, I don’t regret it too much, because I also quite like the idea that polymath5’s ideas are freely available on the internet but not in the form of a traditional journal article. (I still think that on balance it would have been better to write up the ideas though.)

Another lesson for the future is that it would be great to have some more polymath projects. We now know that Polymath5 has accelerated the solution of a famous open problem. I think we should be encouraged by this and try to do the same for several other famous open problems, but this time with the idea that as soon as the discussion stalls, the project will be declared to be finished. Gil Kalai has said on his blog that he plans to start a new project: I hope it will happen soon. And at some point when I feel slightly less busy, I would like to start one too, on another notorious problem with an elementary statement. It would be interesting to see whether a large group of people thinking together could find anything new to say about, for example, Frankl’s union-closed conjecture, or the asymptotics of Ramsey numbers, or the cap-set problem, or …

### n-Category CaféThe Free Modular Lattice on 3 Generators

The set of subspaces of a vector space, or submodules of some module of a ring, is a lattice. It’s typically not a distributive lattice. But it’s always modular, meaning that the distributive law

$a \vee (b \wedge c) = (a \vee b) \wedge (a \vee c)$

holds when $a \le b$ or $a \le c$. Another way to say it is that a lattice is modular iff whenever you’ve got $a \le a'$, then the existence of an element $x$ with

$a \wedge x = a' \wedge x \; \mathrm{and} \; a \vee x = a' \vee x$

is enough to imply $a = a'$. Yet another way to say it is that there’s an order-preserving map from the interval $[a \wedge b,b]$ to the interval $[a,a \vee b]$ that sends any element $x$ to $x \vee a$, with an order-preserving inverse that sends $y$ to $y \wedge b$:

Dedekind studied modular lattices near the end of the nineteenth century, and in 1900 he published a paper showing that the free modular lattice on 3 generators has 28 elements.

One reason this is interesting is that the free modular lattice on 4 or more generators is infinite. But the other interesting thing is that the free modular lattice on 3 generators has intimate relations with 8-dimensional space. I have some questions about this stuff.

One thing Dedekind did is concretely exhibit the free modular lattice on 3 generators as a sublattice of the lattice of subspaces of $\mathbb{R}^8$. If we pick a basis of this vector space and call it $e_1, \dots, e_8$, he looked at the subspaces

$X = \langle e_2, e_4, e_5, e_8 \rangle , \quad Y = \langle e_2, e_3, e_6, e_7 \rangle, \quad Z = \langle e_1, e_4, e_6, e_7 + e_8 \rangle$

By repeatedly taking intersections and unions, he built 28 subspaces starting from these three.

This proves the free modular lattice on 3 generators has at least 28 elements. In fact it has exactly 28 elements. I think Dedekind showed this by working out the free modular lattice ‘by hand’ and noting that it, too, has 28 elements. It looks like this:

This picture makes it a bit hard to see the $S_3$ symmetry of the lattice, but if you look you can see it. (Can someone please draw a nice 3d picture that makes the symmetry manifest?)

If you look carefully here, as Hugh Thomas did, you will see 30 elements! That’s because the person who drew this picture, like me, defines a lattice to be a poset with upper bounds and lower bounds for all finite subsets. Dedekind defined it to be a poset with upper bounds and lower bounds for all nonempty finite subsets. In other words, Dedekind’s kind of lattice has operations $\vee$ and $\wedge$, while mine also has a top and bottom element. So, Dedekind’s ‘free lattice on 3 generators’ did not include the top and bottom element of the picture here. So, it had just 28 elements.

Now, there’s something funny about how 8-dimensional space and the number 28 are showing up here. After all, the dimension of $\mathrm{SO}(8)$ is 28. This could be just a coincidence, but maybe not. Let me explain why.

The 3 subspace problem asks us to classify triples of subspaces of a finite-dimensional vector space $V$, up to invertible linear transformations of $V$. There are finitely many possibilities, unlike the situation for the 4 subspace problem. One way to see this is to note that 3 subspaces $X, Y, Z \subseteq V$ give a representation of the $D_4$ quiver, which is this little category here:

This fact is trivial: a representation of the $D_4$ quiver is just 3 linear maps $X \to V$, $Y \to V$, $Z \to V$, and here we are taking those to be inclusions. The nontrivial part is that indecomposable representations of any Dynkin quiver correspond in a natural one-to-one way with positive roots of the corresponding Lie algebra. The Lie algebra corresponding to $D_4$ is $\mathfrak{so}(8)$, the Lie algebra of the group of rotations in 8 dimensions. This Lie algebra has 12 positive roots. So, the $D_4$ quiver has 12 indecomposable representations. The representation coming from any triple of subspaces $X, Y, Z \subseteq V$ must be a direct sum of these indecomposable representations, so we can classify the possibilities and solve the 3 subspace problem!

What’s going on here? On the one hand, Dedekind the free modular lattice on 3 generators shows up as a lattice of subspaces generated by 3 subspaces of $\mathbb{R}^8$. On the other hand, the 3 subspace problem is closely connected to classifying representations of the $D_4$ quiver, whose corresponding Lie algebra happens to be $\mathfrak{so}(8)$. But what’s the relation between these two facts, if any?

Another way to put the question is this: what’s the relation between the 12 indecomposable representations of the $D_4$ quiver and the 28 elements of the free modular lattice on 3 generators? Or, more numerogically speaking: what relationship between the numbers 12 and 28 is at work in this business?

Here’s one somewhat wacky guess. The Lie algebra of $\mathfrak{so}(8)$ has 12 positive roots, and its Cartan algebra has dimension 4. As usual, the Lie algebra is spanned by positive roots, an equal number of negative roots, and the Cartan subalgebra, so we get

$28 = 12 + 12 + 4$

But I don’t really see how this is connected to anything I’d said previously. In particular, I don’t see why 24 of the 28 elements of the lattice of subspaces generated by

$X = \langle e_2, e_4, e_5, e_8 \rangle , \; Y = \langle e_2, e_3, e_6, e_7 \rangle, \; Z = \langle e_1, e_4, e_6, e_7 + e_8 \rangle$

should be related to roots of $D_4$.

I think a more sane, non-numerological approach to this network of issues is to take the $D_4$ quiver representation corresponding to Dedekind’s choice of $X , Y, Z \subseteq \mathbb{R}^8$, decompose it into indecomposables, and see which positive roots those correspond to. I may try my hand at that in the comments, but I’m really looking for some help here.

## September 17, 2015

### Scott Aaronson — Ask Me Anything: Diversity Edition

With the fall semester imminent, and by popular request, I figured I’d do another Ask Me Anything (see here for the previous editions).  This one has a special focus: I’m looking for questions from readers who consider themselves members of groups that have historically been underrepresented in the Shtetl-Optimized comments section.  Besides the “obvious”—e.g., women and underrepresented ethnic groups—other examples might include children, traditionally religious people, jocks, liberal-arts majors… (but any group that includes John Sidles is probably not an example).  If I left out your group, please go ahead and bring it to my and your fellow readers’ attention!

My overriding ideal in life—what is to me as Communism was to Lenin, as Frosted Flakes are to Tony the Tiger—is people of every background coming together to discover and debate universal truths that transcend their backgrounds.  So few things have ever stung me more than accusations of being a closed-minded ivory-tower elitist white male nerd etc. etc.  Anyway, to anyone who’s ever felt excluded here for whatever reason, I hope this AMA will be taken as a small token of goodwill.

Similar rules apply as to my previous AMAs:

• Only one question per person.
• No multi-part questions, or questions that require me to read a document or watch a video and then comment on it.
• Questions need not have anything to do with your underrepresented group (though they could). Math, science, futurology, academic career advice, etc. are all fine.  But please be courteous; anything gratuitously nosy or hostile will be left in the moderation queue.
• I’ll stop taking further questions most likely after 24 hours (I’ll post a warning before closing the thread).

Update (Sep. 6): For anyone from the Boston area, or planning to visit it, I have an important piece of advice.  Do not ever, under any circumstances, attempt to visit Walden Pond, and tell everyone you know to stay away.  After we spent 40 minutes driving there with a toddler, the warden literally screamed at us to go away, that the park was at capacity. It wasn’t an issue of parking: even if we’d parked elsewhere, we just couldn’t go.  Exceptions were made for the people in front of us, but not for us, the ones with the 2-year-old who’d been promised her weekend outing would be to meet her best friend at Walden Pond.  It’s strangely fitting that what for Thoreau was a place of quiet contemplation, is today purely a site of overcrowding and frustration.

Another Update: OK, no new questions please, only comments on existing questions! I’ll deal with the backlog later today. Thanks to everyone who contributed.

### Sesh Nadathur — 10 tips for making postdoc applications (Part 1)

Around this time of year in the academic cycle, thousands of graduate students around the world will be starting to apply for a limited supply of short-term postdoctoral research positions, or 'postdocs'. They will not only be competing against each other, but also slightly more senior colleagues applying for their second or possibly third or fourth postdocs.

The lucky minority who are successful — and, as Richie Benaud once said about cricket captaincy, it is a matter of 90% luck and 10% skill, but don't try it without the 10% — will probably need to move their entire life and family to a new city, country or continent. The entire application cycle can last two or three months — or much longer for those who are not successful in the first round — and is by far the most stressful part of an early academic career.

What I'd like to do here is to provide some unsolicited advice on how best to approach the application process, which I hope will be of help to people starting out on it. This advice mostly consists of a collection of things that I wish people had told me when I was starting out myself, plus things that people did tell me, but that for whatever reason I didn't understand or appreciate.

My own application experience has been in the overlapping fields of cosmology, astrophysics and high-energy particle physics, and most of my advice is written with these fields in mind. Some points are likely to be more generally useful, but I don't promise anything!

I'm also not going to claim to know much about what types of things hiring professors or committees actually look for — in fact, I strongly suspect that there are very few useful generalizations that can be made which cover all types of jobs and departments. So I won't tell you what to wear for an interview, or what font to use in your CV. Instead I'll try to focus on things that might help make the application process a bit less stressful for you, the applicant, giving you a better chance of coming out the other side still happy, sane, and excited about science.

With that preamble out of the way, here are the first 5 of my tips for applying for postdocs! The next 5 follow in part 2 of this post.

### 1. Start early

At least in the high-energy and astro fields, the way the postdoc job market works means that for the vast majority of jobs starting in September or October of a given year, the application deadlines fall around September to October of the previous year. Sometimes — particularly for positions at European universities — the deadlines may be a month or two later. However, for most available positions job offers are made around Christmas or early in the new year, and the number of positions still advertised after about February is small to start with and decreases fast with each additional month.

This means if you want to start a postdoc in 2016, you should already have started preparing your application materials. If not, it's not too late, but start immediately!

Applying for research jobs is a very different type of activity to doing research, is not as interesting, requires learning a different set of skills, and therefore can be quite daunting. This makes it all too easy to procrastinate and put it off! In my first application cycle, I came up with a whole lot of excuses and didn't get around to seriously applying for anything until at least December, which is way too late.

### 2. Consider other options

This sounds a bit harsh, but I think it is vital. My point is not that getting into academia is a bad career move, necessarily. But don't get into it out of inertia. I've met a few people who, far too many months into the application cycle, with their funding due to run out, and despite scores of rejections, continue the desperate search for a postdoc position somewhere, anywhere, simply because they cannot imagine what else they might do.

Don't be that person. There are lots of cool things you can do even if you don't get a postdoc. There are many other interesting and fulfilling careers out there, which will provide greater security, won't require constant upheaval, and will almost certainly pay better. Many of them still require the kinds of skills we've spent so many years learning — problem solving, tricky mathematics, cool bits of coding, data analysis — but most projects outside academia will be shorter and less nebulous, success will be more quantifiable and the benefits of success may well be more tangible.

If you have no idea what kinds of jobs you could do outside academia, find out now. Get in touch with previous graduate students from your department who went that way, find out what they are doing and how they got there. The AstroBetter website provides a great collection of career profiles which may provide inspiration.

If after examining the alternatives you decide you'd still prefer that postdoc, great. But at least when you apply you won't be doing it purely out of inertia, and you'll have the reassurance that if you don't get it, there are other cool things you could do instead. And I'm pretty sure this will help your peace of mind during the weeks or months you spend re-drafting those research statements!

 Image credit: Jorge Cham.

### 3. Apply everywhere

It's not uncommon in physics for some postdoc ads to attract 100 qualified applicants or more per available position, and the numbers of advertised positions isn't that large. So apply for as many as you can! It's not a great idea to decide where to apply based on 'extraneous' reasons — e.g., you only want to live in California or Finland or some such.

Particularly if you're starting within Europe, you will probably have to move to a new country, and probably another new country after that. So if you have a strong aversion to moving countries, I'd suggest going back to point 2 above.

On the other hand, you can and should take a more positive view: living somewhere new, learning a new language and discovering a new culture and cuisine can be tremendous fun! Even remote places you've never heard of, or places you think you might not like, can provide you with some of the best memories of your life. Just as small example, before I moved to Helsinki, my mental image of Finland was composed of endless dark, depressing winter. After two years here this image has been converted instead to one of summers of endless sunshine and beautiful days spent at the beach! (Disclaimer: of course Finland is also dark, cold and miserable sometimes. Especially November.)

So for every advertised position, unless you are absolutely 100% certain that you would rather quit academia than move there for a few years — don't think about it, just apply. For the others, think about it and then apply anyway. If you get offered the job you'll always be able to say no later.

### 4. Don't apply everywhere

However, life is short. Every day you spend drafting a statement telling people what great research you would do if they hired you is a day spent not doing research, or indeed anything else. If you apply for upwards of 50 different postdoc jobs (not an uncommon number!), all that time adds up.

So don't waste it. Read the job advertisement carefully, and assess your chances realistically. There's not much to be gained from applying to departments which are not a good academic fit for you.

When I first applied for postdocs several years ago, I would read an advert that said something like "members of the faculty in Department X have interests in, among other things, string theory, lattice QCD, high-temperature phase transitions, multiloop scattering amplitudes, collider phenomenology, BSM physics, and cosmology," and I'd focus on those two words "and cosmology". So despite knowing that "cosmology" is a very broad term that can mean different things to different people, and despite not being qualified to work on string theory, lattice QCD, high-temperature phase transitions etc., I'd send off my application talking about analysis of CMB data, galaxy redshift surveys and so on, optimistically reasoning that "they said they were interested in cosmology!" And then I'd never hear back from them.

Nowadays, my rule of thumb would be this: look through the list of faculty, and if it doesn't contain at least one or two people whose recent papers you have read carefully (not just skimmed the abstract!) because they intersected closely with your own work, don't bother applying. If you don't know them, they almost certainly won't know you. And if they don't know you or your work, your application probably won't even make it past the first round of sorting — faced with potentially hundreds of applicants, they won't even get around to reading your carefully crafted research statement or your glowing references.

Being selective in where you apply will save you a heap of time, allow you to produce better applications for the places which really do fit your profile, and most importantly leave you feeling a lot less jaded and disillusioned at the end of the process.

### 5. Choose your recommendations well

Almost all postdoc adverts ask for three letters of recommendation in addition to research plans and CVs. These letters will probably play a crucial part in the success of your application. Indeed for a lot of PhD students applying for their first postdoc, the decision to hire is based almost entirely on the recommendation letters - there's not much of an existing track record by this stage, after all.

So it's important to choose well when asking senior people to write these recommendation for you. As a graduate student, your thesis advisor has to be one of them. It helps if one of the others is from a different university to yours. If possible, all three should be people you have worked, or are working, closely with, e.g. coauthors. But if this is not possible, one of the three could also be a well-known person in the field who knows your work and can comment on its merit and significance in the literature.

Having said that, there are several other factors that go into choosing who to get recommendations from. Some professors are much better at supporting and promoting their students and postdocs in the job market than others. You'll notice these people at conferences and seminars: in their talks they will go out of their way to praise and give credit to the students who obtained the results they are presenting, whereas others might not bother. These people will likely write more helpful recommendations; they also generally provide excellent career advice, and may well help your application in other, less obvious, ways. They are the ideal mentors, and all other things being equal, their students typically fare much better at getting that first and all-important step on the postdoc ladder. Of course ideally your thesis advisor will be such a person, but if not, find someone in your department who is and ask them for help.

Somewhat unfortunately, I'm convinced that how well your referees themselves are known in the department to which you are applying is almost as important as how much they praise you. If neither you nor any of your referees have links — previous collaborations, research visits, invitations to give seminars — with members of the advertising department, I think the chances of your application receiving the fullest consideration are unfortunately much smaller. (I realise this is a cynical view and having never been on a hiring committee myself I have no more than anecdotal evidence in support of it. But I do see which postdocs get hired where.) So choose wisely.

It is also a good idea to talk frankly to your professors/advisor beforehand. Explain where you are planning to apply, what they are looking for, and what aspects of your research skills you would like their letters to emphasise. Get their advice, but also provide your own input. You don't want to end up with a research statement saying you're interested in working in field A, while your recommendations only talk about your contributions in field B.

---

That's it for part 1 of this lot of unsolicited advice. Part 2 is available here!

### Sesh Nadathur — 10 tips for making postdoc applications (Part 2)

This post is part 2 of a series with unsolicited advice for postdoc applicants. Part 1, which includes a description of the motivation behind the posts and tips 1 through 5, can be found here.

----

### 6. Promote yourself

This sounds sort of obvious, but for cultural reasons may come easier to some people than to others. I don't mean to suggest you should be boastful or oversell yourself in your CV and research statement. But be aware that people reading through hundreds of applications will not have time to read between the lines to discover your unstated accomplishments — so present the information that supports your application as clearly and as matter-of-factly as possible.

As a very junior member of the academic hierarchy, it is quite likely that nobody in the hiring department has heard of you or read any of your papers, no matter how good they were. They are far more likely to recognise your name if you happen to have taken some steps make yourself known to them, for instance by having arranged a research visit, given a talk at their local journal club or seminar series, initiated a collaboration on topics of mutual interest, or simply introduced yourself and your research at some recent conference.

Organisers of seminar series and journal clubs are generally more than happy to have volunteers help fill some speaking slots — put anxieties to one side and just email them to ask! And if they do have time for you, make sure you give a good talk.

### 7. Know what type of postdoc you're applying for

There are, roughly speaking, three different categories of postdoc positions in high-energy and astro (and probably more generally in all fields of physics, if not in all sciences).

The first category is fellowships. These are positions which provide funding to the successful candidate to pursue a largely independent line of research. They therefore require you to propose a detailed and interesting research plan. They may sometimes be tied to a particular institution, but often they provide an external pot of money that you will be bringing to the department you go to. They are also highly prestigious. Examples applicable to a cosmology context are Hubble and Einstein fellowships in the US, CITA national fellowships in Canada, Royal Society and Royal Astronomical Society fellowships in the UK, Humboldt fellowships in Germany, Marie Skłodowska-Curie fellowships across Europe, and many others.

The second category is roughly a research assistant type of position. Here you are hired by some senior person who has won a grant for their research proposal, or has some other source of funding out of which to pay your salary. You are expected to work on their research project, in a pretty closely defined role.

The third category is a sort of mix of the above two, which I'll call a semi-independent postdoc. This is where the postdoc funding comes from someone's grant, but they do not specify a particular research programme at the outset, giving you a large degree of independence in what work you actually want to do.

 Image credit: Jorge Cham.

When you apply, it is imperative that you know which of these three types of positions the people hiring you have in mind. There is no point trying to sell a detailed independent research plan — no matter how exciting — to someone who is only interested in whether you have the specific skills and experience to do what they tell you to. Equally, if what they want is evidence that you will drive research in your own directions, an application that lists your technical skills but doesn't present a coherent plan of what you will do with them is no good.

Unfortunately postdoc ads don't use these terms, so it is generally not clear whether the position is of the second or third type. If in doubt, contact the department and find out what they want.

Also, even when the distinction may be clear, many people still produce the same type of application for all jobs they apply for in one cycle. I know of an instance of a highly successful young scientist who managed to win not one but two prestigious individual fellowship grants worth hundreds of thousands of Euros each, and yet did not get a single offer from any of non-fellowship positions they applied to! So tailor your applications to the situation.

### 8. Apply for fellowships

Applications for the more prestigious fellowships require more work — a lot more work — than other postdoc applications. They will require you to produce a proper research proposal, which will need to include a clear and inspiring outline — of anything between one and twenty pages length — of what you intend to do with the fellowship. This will take a long time to think of and longer to write. They will probably require you to work on the application in collaboration with your host department, and they may have a million other specifically-sized hoops for you to jump through.

Nevertheless, you should make a serious attempt to apply for them, for the following reasons.

• Any kind of successful academic career requires you to write lots of such proposals, so you might as well start practising now.
• Writing a proposal forces you to prepare a serious plan of what research you want to do over the next few years, which will help clarify a lot of things in your mind, including how much you actually want to stay in the field (see point 2 again!).
• You will often write the proposal in collaboration with your potential host department. This makes it far more likely that they will think favourably of you should any other opportunities arise there later! For instance, I know of many cases where applicants for Marie Curie fellowships have ended up with positions in the department of their choice even though the fellowship application itself was ultimately unsuccessful.
• Major fellowship programmes are more likely to have the resources and the procedures in place to thoroughly evaluate each proposal, reducing the unfortunate random element I'll talk about below. Many will provide individual feedback and assessments, which will help you if you reapply next year.
• Counter-intuitively, success rates may be significantly higher for major fellowships than for standard postdoc jobs! Last year, the success rate for Hubble fellowships was 5%, for Einstein fellowships 6%, and for Marie Curie fellowships (over all fields of physics) almost 18%. All of these numbers compare quite well with those for standard postdoc positions! Granted, this is at least partly due to self-selection by applicants who don't think they can prepare a good enough proposal in the first place, but it is still something to bear in mind.
• Obviously, a successful fellowship application counts for a lot more in advancing your career than a standard postdoc.

### 9. Recognize the randomness

Potential employers are faced with a very large number of applications for each postdoc vacancy; a ratio of 100:1 is not uncommon. Even with the best of intentions, it is just not possible to give each application equal careful consideration, so some basic pre-filtering is inevitable.

Unfortunately for you, each department will have its own criteria for pre-filtering, and you do not know what those criteria are. Some will filter on recommendation letters, some on number of publications, some on number of citations. (As a PhD student I was advised by a well-meaning faculty member at a leading UK university that although they found my research very interesting, I did not yet meet their cutoff of X citations for hiring postdocs.) Others may deduce your field of interest only from existing publications rather than your research statement — this is particularly hard on recent PhDs who may be trying to broaden their horizons beyond their advisor's influence.

Beyond this, it's doubtful that two people in different departments will have the same opinion of a given application anyway. They're only human, and their assessments will always be coloured by their own research interests, their plans for the future of the department, their different personal relationships with the writers of your recommendation letters, maybe even what they had for breakfast that morning.

You can't control any of this. Your job is to produce as good and complete an application as possible (remember to send everything they ask for!), to apply to lots of suitable places, and then to learn not to fret.

### 10. Don't tie your self-esteem to the outcome

You will get rejections. Many of them. Even worse, there will be many places who don't even bother to let you know you were rejected. You will sometimes get a rejection at the exact same time as someone else you know gets an offer, possibly for the same position. (Things are made worse if you read the postdoc rumour mills regularly.)

It's pretty hard to prevent these rejections from affecting you. It's all too easy to see them as a judgement of your scientific worth, or to develop a form of imposter syndrome. Don't do this! Read point 9 again.

I'd also highly recommend reading this post by Renée Hlozek, which deals with many of the same issues. (Renée is one of the rising stars of cosmology, with a new faculty position after a very prestigious postdoc fellowship, but she too got multiple rejections the first time she applied. So it does happen to the best too, though people rarely tell you that.)

-------

That's it for my 10 tips on applying for (physics) postdocs. They were written primarily as advice I would have liked to have given my former self at the time I was completing my PhD.

There's plenty of other advice available elsewhere on the web, some of it good and some not so good. I personally felt that far too much of it concerned how to choose the best of multiple offers, which is both a bit pointless (if you've got so many offers you'll be fine either way) and really quite far removed from the experience of the vast majority of applicants. I hope some people find this a little more useful.

## September 16, 2015

### Resonaances — What can we learn from LHC Higgs combination

Recently, ATLAS and CMS released the first combination of their Higgs results. Of course, one should not expect any big news here: combination of two datasets that agree very well with the Standard Model predictions has to agree very well with the Standard Model predictions...  However, it is interesting to ask what the new results change at the quantitative level concerning our constraints on Higgs boson couplings to matter.

First, experiments quote the overall signal strength μ, which measures how many Higgs events were detected at the LHC in all possible production and decay channels compared to the expectations in the Standard Model. The latter, by definition, is μ=1.  Now, if you had been impatient to wait for the official combination, you could have made a naive one using the previous ATLAS (μ=1.18±0.14) and CMS (μ=1±0.14) results. Assuming the errors are Gaussian and uncorrelated, one would obtains this way the combined μ=1.09±0.10. Instead, the true number is (drum roll)
So, the official and naive numbers are practically the same.  This result puts important constraints on certain models of new physics. One important corollary is that the Higgs boson branching fraction to invisible (or any undetected exotic) decays is limited as  Br(h → invisible) ≤ 13% at 95% confidence level, assuming the Higgs production is not affected by new physics.

From the fact that, for the overall signal strength, the naive and official combinations coincide one should not conclude that the work ATLAS and CMS has done together is useless. As one can see above, the statistical and systematic errors are comparable for that measurement, therefore a naive combination is not guaranteed to work. It happens in this particular case that the multiple nuisance parameters considered in the analysis pull essentially in random directions. But it could well have been different. Indeed, the more one enters into details, the more the impact of the official combination becomes relevant.  For the signal strength measured in particular final states of the Higgs decay the differences are more pronounced:
One can see that the naive combination somewhat underestimates the errors. Moreover, for the WW final state the central value is shifted by half a sigma (this is mainly because, in this channel, the individual ATLAS and CMS measurements that go into the combination seem to be different than the previously published ones). The difference is even more clearly visible for 2-dimensional fits, where the Higgs production cross section via the gluon fusion (ggf) and vector boson fusion (vbf) are treated as free parameters. This plot compares the regions preferred at 68% confidence level by the official and naive combinations:
There is a significant shift of the WW and also of the ττ ellipse. All in all, the LHC Higgs combination brings no revolution, but it allows one to obtain more precise and more reliable constraints on some new physics models.  The more detailed information is released, the more useful the combined results become.

### John Preskill — Quantum Shorts 2015: A “flash fiction” competition

A blog on everything quantum is the perfect place to announce the launch of the 2015 Quantum Shorts competition. The contest encourages readers to create quantum-themed “flash fiction”: a short story of no more than 1000 words that is inspired by quantum physics. Scientific American, the longest continuously published magazine in the U.S., Nature, the world’s leading multidisciplinary science journal, and Tor Books, the leading science fiction and fantasy publisher, are media partners for the contest run by the Centre for Quantum Technologies at the National University of Singapore. Entries can be submitted now through 11:59:59 PM ET on December 1, 2015 at http://shorts.quantumlah.org.

“Quantum physics seems to inspire creative minds, so we can’t wait to see what this year’s contest will bring,” says Scientific American Editor in Chief and competition judge Mariette DiChristina.

A panel of judges will select the winners and runner-ups in two categories: Open and Youth. The public will also vote and decide the People’s Choice Prize from entries shortlisted across both categories. Winners will receive a trophy, a cash prize and a one-year digital subscription to ScientificAmerican.com. The winner of the Open category will also be featured on ScientificAmerican.com.

The quantum world offers lots of scope for enthralling characters and mind-blowing plot twists, according to Artur Ekert, director of the Centre for Quantum Technologies and co-inventor of quantum cryptography. “A writer has plenty to play with when science allows things to be in two places – or even two universes – at once,” he says. “The result might be funny, tense or even confusing. But it certainly won’t be boring.” Artur is one of the Open category judges.

Another judge is Colin Sullivan, editor of Futures, Nature’s own science-themed fiction strand. “Science fiction is a powerful and innovative genre,” Colin says. “We are excited to see what kinds of stories quantum physics can inspire.”

The 2015 Quantum Shorts contest is also supported by scientific partners around the world. The Institute for Quantum Information and Matter is proud to sponsor this competition, along with our friends at the Centre for Engineered Quantum Systems, an Australian Research Council Centre of Excellence, the Institute for Quantum Computing at the University of Waterloo, and the Joint Quantum Institute, a research partnership between the University of Maryland and the National Institute of Standards and Technology.

Submissions to Quantum Shorts 2015 are limited to 1000 words and can be entered into the Quantum Shorts competition via the website at http://shorts.quantumlah.org, which also features a full set of rules and guidelines.

## September 15, 2015

### John Preskill — Toward physical realizations of thermodynamic resource theories

The thank-you slide of my presentation remained onscreen, and the question-and-answer session had begun. I was presenting a seminar about thermodynamic resource theories (TRTs), models developed by quantum-information theorists for small-scale exchanges of heat and work. The audience consisted of condensed-matter physicists who studied graphene and photonic crystals. I was beginning to regret my topic’s abstractness.

The question-asker pointed at a listener.

“This is an experimentalist,” he continued, “your arch-nemesis. What implications does your theory have for his lab? Does it have any? Why should he care?”

I could have answered better. I apologized that quantum-information theorists, reared on the rarefied air of Dirac bras and kets, had developed TRTs. I recalled the baby steps with which science sometimes migrates from theory to experiment. I could have advocated for bounding, with idealizations, efficiencies achievable in labs. I should have invoked the connections being developed with fluctuation results, statistical mechanical theorems that have withstood experimental tests.

The crowd looked unconvinced, but I scored one point: The experimentalist was not my arch-nemesis.

“My new friend,” I corrected the questioner.

His question has burned in my mind for two years. Experiments have inspired, but not guided, TRTs. TRTs have yet to drive experiments. Can we strengthen the connection between TRTs and the natural world? If so, what tools must resource theorists develop to predict outcomes of experiments? If not, are resource theorists doing physics?

A Q&A more successful than mine.

I explore answers to these questions in a paper released today. Ian Durham and Dean Rickles were kind enough to request a contribution for a book of conference proceedings. The conference, “Information and Interaction: Eddington, Wheeler, and the Limits of Knowledge” took place at the University of Cambridge (including a graveyard thereof), thanks to FQXi (the Foundational Questions Institute).

“Proceedings are a great opportunity to get something off your chest,” John said.

That seminar Q&A had sat on my chest, like a pet cat who half-smothers you while you’re sleeping, for two years. Theorists often justify TRTs with experiments.* Experimentalists, an argument goes, are probing limits of physics. Conventional statistical mechanics describe these regimes poorly. To understand these experiments, and to apply them to technologies, we must explore TRTs.

Does that argument not merit testing? If experimentalists observe the extremes predicted with TRTs, then the justifications for, and the timeliness of, TRT research will grow.

Something to get off your chest. Like the contents of a conference-proceedings paper, according to my advisor.

You’ve read the paper’s introduction, the first eight paragraphs of this blog post. (Who wouldn’t want to begin a paper with a mortifying anecdote?) Later in the paper, I introduce TRTs and their role in one-shot statistical mechanics, the analysis of work, heat, and entropies on small scales. I discuss whether TRTs can be realized and whether physicists should care. I identify eleven opportunities for shifting TRTs toward experiments. Three opportunities concern what merits realizing and how, in principle, we can realize it. Six adjustments to TRTs could improve TRTs’ realism. Two more-out-there opportunities, though less critical to realizations, could diversify the platforms with which we might realize TRTs.

One opportunity is the physical realization of thermal embezzlement. TRTs, like thermodynamic laws, dictate how systems can and cannot evolve. Suppose that a state $R$ cannot transform into a state $S$: $R \not\mapsto S$. An ancilla $C$, called a catalyst, might facilitate the transformation: $R + C \mapsto S + C$. Catalysts act like engines used to extract work from a pair of heat baths.

Engines degrade, so a realistic transformation might yield $S + \tilde{C}$, wherein $\tilde{C}$ resembles $C$. For certain definitions of “resembles,”** TRTs imply, one can extract arbitrary amounts of work by negligibly degrading $C$. Detecting the degradation—the work extraction’s cost—is difficult. Extracting arbitrary amounts of work at a difficult-to-detect cost contradicts the spirit of thermodynamic law.

The spirit, not the letter. Embezzlement seems physically realizable, in principle. Detecting embezzlement could push experimentalists’ abilities to distinguish between close-together states $C$ and $\tilde{C}$. I hope that that challenge, and the chance to violate the spirit of thermodynamic law, attracts researchers. Alternatively, theorists could redefine “resembles” so that $C$ doesn’t rub the law the wrong way.

The paper’s broadness evokes a caveat of Arthur Eddington’s. In 1927, Eddington presented Gifford Lectures entitled The Nature of the Physical World. Being a physicist, he admitted, “I have much to fear from the expert philosophical critic.” Specializing in TRTs, I have much to fear from the expert experimental critic. The paper is intended to point out, and to initiate responses to, the lack of physical realizations of TRTs. Some concerns are practical; some, philosophical. I expect and hope that the discussion will continue…preferably with more cooperation and charity than during that Q&A.

If you want to continue the discussion, drop me a line.

*So do theorists-in-training. I have.

**A definition that involves the trace distance.