Planet Musings

December 11, 2019

Jordan EllenbergThe real diehard

Yesterday a guy saw my Orioles hat and said, “Wow, you’re a real diehard.”

He was wearing a Hartford Whalers hat.

n-Category Café Random Permutations (Part 8)

Last time we saw a strong connection between random permutations and Poisson distributions. Thinking about this pulled me into questions about partitions and the moments of Poisson distributions. These may be a bit of a distraction — but maybe not, since every permutation gives a partition, namely the partition into cycles.

Here’s a good puzzle to start with:

Puzzle 11. Suppose raindrops are falling on your head, randomly and independently, at an average rate of one per minute. What’s the average of the cube of the number of raindrops that fall on your head in one minute?

To get started we need a bit of probability theory. Suppose random events are happening at an average rate of one per minute, in a nice independent way so that the history so far doesn’t affect the rate of new events now. Let P(k,t)P(k,t) be the probability that kk events have happened between time 00 and time tt. Then:

ddtP(k,t)=P(k1,t)P(k,t) \frac{d}{d t} P(k,t) = P(k-1,t) - P(k,t)

since if k1k-1 events have happened an extra event will bring the total to kk, while if kk events have already happened an extra one will make the total stop being kk.

Now, it takes a bit of work to figure out P(k,t)P(k,t) given this equation and the obvious initial conditions

P(k,0)={1 if k=0 0 if k>0P(k,0) = \left\{\begin{array}{ccl} 1 &\mathrm{if}& k = 0 \\ 0 &\mathrm{if}& k \gt 0 \end{array}\right.

But if someone tells you the answer, you just need elementary calculus to check that it’s right:

P(k,t)=t kk!e t P(k,t) = \frac{t^k}{k!} e^{-t}

So, let’s be happy someone has already figured out the answer!

Back to our raindrop problem. Here the time tt is 1 minute, so the probability of kk raindrops landing on your head in this time is

1k!e 1 \frac{1}{k!} e^{-1}

Sanity check: these probabilities sum to 1.

We’re trying to figure out the expected value of the cube of the number of raindrops that land on your head in a minute. We now know this is

1e k=0 k 3k! \frac{1}{e} \sum_{k=0}^\infty \frac{k^3}{k!}

So this is the sum we need to do. But being mathematicians, let’s generalize. Let’s figure out the expected value of the nnth power of the number of raindrops that land on your head in a minute:

1e k=0 k nk! \frac{1}{e} \sum_{k=0}^\infty \frac{k^n}{k!}

The sum here looks like a mutant version of

k=0 n kk!=e n \sum_{k=0}^\infty \frac{n^k}{k!} = e^n

But I mention this only so you don’t get confused and take a wrong turn!

To compute

1e k=0 k nk! \frac{1}{e} \sum_{k=0}^\infty \frac{k^n}{k!}

it’s nice to express the powers k nk^n in terms of ‘falling powers’:

k n̲=k(k1)(k2)(kn+1) k^{\underline{n}} = k (k-1) (k-2) \cdots (k - n + 1)

The power k nk^n counts the number of functions from nn to kk, where now I’m identifying a natural number with a set having that number of elements. Similarly, the falling power k n̲k^{\underline{n}} counts the number of injections from nn to kk. Why? With an injection, the first element of nn has kk choices of where to go, but the next one has only k1k-1, and so on.

There’s a nice formula expressing ordinary powers as linear combinations of falling powers:

k n= i=0 k{n i}k i̲ k^n = \sum_{i = 0}^k \left\{ \begin{array}{c} n \\ i \end{array} \right\} \; k^{\underline{i}}

Here {n i}\left\{ \begin{array}{c} n \\ i \end{array} \right\} is just the number of partitions of nn into ii parts.

This formula has a wonderfully slick proof. Any function f:nkf \colon n \to k factors as a surjection followed by an injection:

nfimfk n \stackrel{f}{\to} \mathrm{im} f \hookrightarrow k

and the surjection gives a partition of nn into ii parts, where ii is the number of points in the image imf\mathrm{im} f. Conversely a partition of nn into ii parts together with an injection iki \hookrightarrow k determines a function f:nkf \colon n \to k. So, the number of functions f:nkf \colon n \to k is the sum over 0in0 \le i \le n of the number of partitions of nn into ii parts times the number of injections iki \hookrightarrow k. And that’s all the formula says!

So, the expected value of the nnth power of the number of raindrops that fall on your head is

1e k=0 k nk!=1e k=0 i=0 k{n i}k i̲k! \frac{1}{e} \sum_{k=0}^\infty \frac{k^n}{k!} \; = \; \frac{1}{e} \sum_{k=0}^\infty \sum_{i = 0}^k \left\{ \begin{array}{c} n \\ i \end{array} \right\} \, \frac{k^{\underline{i}}}{k!}

But in fact we can simplify this! Since there are no injections iki \hookrightarrow k when i>ki \gt k, the falling power vanishes in this case, so our sum is just

1e i,k=0 {n i}k i̲k! \frac{1}{e} \sum_{i,k=0}^\infty \left\{ \begin{array}{c} n \\ i \end{array} \right\} \, \frac{k^{\underline{i}}}{k!}

And now we can do the sum over kk first. Since

k=0 k i̲k!= k=0 k(k1)(ki+1)k(k1)1= k=i 1(ki)!= j=0 1j!=e \sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} \; = \; \sum_{k = 0}^\infty \frac{k(k-1)\cdots (k-i+1)}{k(k-1) \cdots 1} \; = \; \sum_{k = i}^\infty \frac{1}{(k-i)!} \; = \; \sum_{j = 0}^\infty \frac{1}{j!} \; = \; e

we get

1e i,k=0 {n i}k i̲k!= i=0 {n i} \frac{1}{e} \sum_{i,k=0}^\infty \left\{ \begin{array}{c} n \\ i \end{array} \right\} \, \frac{k^{\underline{i}}}{k!} \; = \; \sum_{i = 0}^\infty \left\{ \begin{array}{c} n \\ i \end{array} \right\}

which is the total number of partitions of an nn-element set!

In our original question we were interested in n=3n = 3. There are 5 partitions of a 3-element set:

{{1,2,3}}, \{\{1,2,3\}\}, {{1,2},{3}},{{2,3},{1}},{{3,1},{2}}, \{\{1,2\}, \{3\}\}, \; \{\{2,3\}, \{1\}\}, \; \{\{3,1\}, \{2\}\}, {{1},{2},{3}} \{\{1\}, \{2\}, \{3\}\}

So, the expected value of the cube of the number of raindrops that land on your head in a minute is 5.

Now that we’re done, let’s reflect on what happened, and introduce more jargon so you can read more about the concepts involved.

The function

P(k,t)=t kk!e t P(k,t) = \frac{t^k}{k!} e^{-t}

is called a Poisson distribution. It’s the answer whenever you want to know the probability of a given number kk of events occurring in some amount of time tt if these events occur with a constant average rate, set here to 11 for convenience, independently of the time since the last event.

The parameter tt is the mean of the Poisson distribution. We worked out the nnth moment of the Poisson distribution with mean 11. In other words, we worked out the expected value of k nk^n in this case:

k=0 k nP(k,1)=1e k=0 k nk! \sum_{k = 0}^\infty k^n P(k,1) \; = \; \frac{1}{e} \sum_{k = 0}^\infty \frac{k^n}{k!}

We got the number of partitions of an nn-element set, which is called the Bell number B nB_n, after the author of a fantasy novel about a planet where all mathematics is done by men.

This result is called Dobiński’s formula:

1e k=0 k nk!=B n \frac{1}{e} \sum_{k = 0}^\infty \frac{k^n}{k!} = B_n

Dobiński published a paper about it in 1877, but it seems to me that he only proved some special cases:

The number of partitions of an nn-element set into ii parts, {n i}\left\{ \begin{array}{c} n \\ i \end{array} \right\} , is called a Stirling number of the second kind. Given the definitions it’s obvious that

i=0 {n i}=B n \sum_{i = 0}^\infty \left\{ \begin{array}{c} n \\ i \end{array} \right\} = B_n

(We can stop the sum at i=ni = n without changing its value.)

The identity

k n= i=0 k{n i}k i̲ k^n = \sum_{i = 0}^k \left\{ \begin{array}{c} n \\ i \end{array} \right\} \, k^{\underline{i}}

is more interesting, since it connects Stirling numbers of the second kind to falling powers, which are also known as falling factorials, falling sequential products, and various other things. The proof I gave comes from here:

Category theorists will say it relies on the uniqueness of the epi-mono factorization of a function between finite sets. Rota doesn’t use these terms, but he talks about the ‘kernel’ of a function, which is his name for the epi in this factorization.

In this paper Rota goes ahead and proves Dobiński’s formula, but he does not explicitly discuss its connection to the Poisson distribution. Someone on Wikipedia translated his argument into the language of probability, which I like. Unfortunately the Wikipedia article merely reduces Dobiński’s formula to the identity

1e k=0 k i̲k!=1 \frac{1}{e} \sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} = 1

and stops there, remarking that the kkth factorial moment of the Poisson distribution of mean 11 equals 11. Factorial moments are like ordinary moments but with the power k ik^i replaced by the falling power k i̲k^{\underline{i}}.

I was frustrated to see the argument fizzle out before it ended. Then I saw why

k=0 k i̲k!=e \sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} = e

It follows from some facts about species and groupoid cardinality.

If you have a species, namely a functor FF from the groupoid of finite sets to the category of sets, you can think of it as describing a type of structure you can put on finite sets. For each natural number kk, F(k)F(k) is the set of structures of this type that you can put on a kk-element set. The species has a generating function

f(x)= k=0 |F(k)|k!x k f(x) = \sum_{k = 0}^\infty \frac{|F(k)|}{k!} x^k

and the number f(1)f(1) has a nice meaning if the sum involved is well-defined. The category of elements F\int F has finite sets equipped with the given structure as objects and bijections preserving the structure as morphisms. This category F\int F is a groupoid, and the number f(1)f(1) is just the groupoid cardinality of F\int F.

Now fix a finite set ii. There’s a species FF that assigns to each finite set kk the set of all injections iki \hookrightarrow k. The generating function of this species is

f(x)= k=0 k i̲k!x k f(x) = \sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} x^k


f(1)= k=0 k i̲k! f(1) = \sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!}

is the cardinality of the groupoid F\int F whose objects are finite sets equipped with an inclusion of the set ii. But this groupoid is equivalent to the groupoid of finite sets and bijections! Why? Because the morphisms in F\int F are bijections that preserve the inclusion of the set ii. Thus, points in the image of this inclusion have no ‘freedom of movement’, and our morphism boils down to an arbitrary bijection between the remaining points.

Since F\int F is equivalent to the groupoid of finite sets, it must have the same cardinality, namely

k=0 1k!=e \sum_{k = 0}^\infty \frac{1}{k!} = e

So, we get

k=0 k i̲k!=e \sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} = e

Later, Akiva Weinberger pointed out to me that this identity can be shown by a simple reindexing:

k=0 k i̲k!= k=i 1(ki)!= j=0 1j!=e \sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} \; = \; \sum_{k = i}^\infty \frac{1}{(k-i)!} \; = \; \sum_{j = 0}^\infty \frac{1}{j!} \; = \; e

This is precisely a decategorified version of my observation that F\int F is equivalent to the groupoid of finite sets.

By the way, Dobiński’s formula, written this way:

k=0 k nk!=eB n \sum_{k = 0}^\infty \frac{k^n}{k!} = e B_n

says the cardinality of the groupoid of finite sets equipped with a function from a fixed finite set is ee times the number of partitions of that set. After a while this starts to seem obvious, if you keep thinking about epi-mono factorizations.

So groupoid cardinality establishes a nice connection between Poisson distributions, partitions, and injections. I doubt I’ve gotten to the bottom of it, since I haven’t really seen why the combinatorics is connected to Poisson processes. I’m also not sure how much all this will help me with permutations. But it was fun.

December 10, 2019

Terence TaoMaryam Mirzakhani New Frontiers Prize

Just a short post to announce that nominations are now open for the Maryam Mirzakhani New Frontiers Prize, which is a newly announced annual $50,000 award from the Breakthrough Prize Foundation presented to early-career, women mathematicians who have completed their PhDs within the past two years, and recognizes outstanding research achievement.  (I will be serving on the prize committee.)  Nominations for this (and other breakthrough prizes) can be made at this page.

Jordan Ellenberg"Hamilton"

We saw the last show of the touring company’s visit to Madison. The kids have played the record hundreds of times so I know the songs very well. But there’s a lot you get from seeing the songs realized by actors in physical space.

  • I had imagined King George as a character in the plot interacting with the rest of the cast; but in the show, he’s a kind of god/chorus floating above the action, seeing certain things clearly that the people in the thick of it can’t. So his famous line, “I will kill your friends and family to remind you of my love,” comes off in person as less menacing, more cosmic. Neil Haskell played the role very, very, very mincy, which I think was a mistake, but it got laughs.
  • On the other hand, I hadn’t grasped from the songs how big a role George Washington plays. It’s set up very nicely, with the relation between Hamilton and the two Schuyler sisters presented as a shadow of the much more robust and fully felt love triangle between Hamilton, Burr, and Washington.
  • The biggest thing I hadn’t understood from the record was the show’s gentle insistence, built up slowly and unavoidably over the whole of the night, that the winner of a duel is the one who gets shot.

December 09, 2019

n-Category Café A Visual Telling of Joyal's Proof Of Cayley's Formula

Cayley’s formula says how many trees there are on a given set of vertices. For a set with nn elements, there are n n2n^{n - 2} of them. In 1981, André Joyal published a wonderful new proof of this fact, which, although completely elementary and explicit, seems to have been one of the seeds for his theory of combinatorial species.

All I’m going to do in this post is explain Joyal’s proof, with lots of pictures. In a later post, I’ll explain the sense in which Cayley’s formula is the set-theoretic analogue of the linear-algebraic fact I wrote about last time: that for a random linear operator on a finite vector space VV, the probability that it’s nilpotent is 1/#V1/\#V. And I’ll also give a new (?) proof of that fact, imitating Joyal’s. But that’s for another day.

Joyal’s proof was published here:

André Joyal, Une théorie combinatoire des séries formelles. Advances in Mathematics 42 (1981), 1–82.

(I try not to link to Elsevier, as I’m permanently furious with them for draining universities of an obscene amount of our precious funds. But this is a free download. To balance my karma, let me remind you that many paywalled papers can be found on Sci-Hub for free. The most effective way to use Sci-Hub is to put the DOI of the paper you’re seeking into the search box; the text search function doesn’t seem to work very well.)

Doron Zeilberger stated Joyal’s argument concisely in one of his infamous opinions (number 108):

Going back to “beauty-bare”, nothing in set theory rivals the beauty of André Joyal’s lovely proof of Arthur Cayley’s theorem that the number of labeled trees, T nT_n, on nn vertices equals n n2n^{n-2}. It is so beautiful that I can say it in words.

Let’s prove instead n 2T n=n nn^2 T_n = n^n .

The left side counts doubly-rooted trees, which are labeled trees with a directed path (possibly of one vertex) of labeled vertices with some trees (possibly trivial one-vertex ones) hanging from them. On the other hand the right side counts the number of functions, ff, from {1,,n}\{1,\ldots,n\} into {1,,n}\{1,\ldots,n\}. If you draw the directed graph of such a function joining vertex ii to vertex f(i)f(i), you would get a collection of cycles with some trees (possibly trivial one-vertex ones) hanging from them. So the left-side counts “lines of numbers” with hanging trees, and the right side counts “collection of cycles” with hanging trees. But “lines of numbers” are permutations in one-line notation, and “collection of cycles” are permutations in cycle-notation. QED!

If that’s crystal clear to you, great! You can stop reading here. But I wanted to state the argument in a more leisurely way, with pictures.

Let XX be a finite set. A tree on XX is an undirected graph with vertex-set XX, such that there’s a unique non-backtracking path between any two vertices. Here’s a typical tree on a 22-element set XX:

Step one

Category theorists and algebraists often use “tree” to mean rooted tree: one vertex is distinguished as special and conventionally placed at the bottom. But in the sense I’ll use the word here, trees don’t come with a distinguished vertex.

Cayley’s formula says that the number of trees on XX is n n2n^{n - 2}, where nn is the cardinality of XX. Joyal’s proof goes like this.

An equivalent statement is that the number of bipointed trees is n nn^n, where “bipointed” means that the tree comes equipped with an ordered pair of distinguished vertices (which are allowed to be the same). Here’s a bipointed tree on our 22-element set XX, with the first distinguished vertex towards the left in brown and the second towards the right in pink (both circled):

Step two

Joyal used the fabulous word vertebrate to mean bipointed tree, for reasons we’ll see in a moment. Since n nn^n is the number of functions from XX to itself, Cayley’s formula says:

the number of vertebrates on XX is equal to the number of endofunctions of XX.

That’s what we’ll show.

First note that by definition of tree, for any vertebrate on XX there’s a unique non-backtracking path from the first distinguished vertex to the second, here marked in blue:

Step three

You can look at the tree as the skeleton of some fantastical creature, with the blue path as the backbone, and the vertices on it as vertebrae, with the first (brown) distinguished vertex at the head end and the second (pink) one at the tail end. Hanging off each vertebra is a rooted tree, the root being that vertebra. If you let your imagination roam free, you can picture each tree as a rather ornate limb. The most similar real-life creature I could come up with is the amazing leafy seadragon (whose existence I think I learned of from David Roberts):

Leafy seadragon

Anyway, the blue path from the first distinguished vertex to the second determines a total order on the set of vertebrae. Once you’ve got that order, you might as well erase the edges between them, as they’re determined by the order. So a vertebrate amounts to a diagram like this:

Step four

So far everything has been canonical: no arbitrary choices. The set of vertebrates on XX is in canonical one-to-one correspondence with diagrams of this type. But now we’re going to break that pattern, using an elementary observation:

the number of total orders on a finite set SS is equal to the number of permutations of SS.

Of course, both are (#S)!(\# S)!. There’s no canonical bijection between orders on and permutations of an abstract set SS — for if there were, which order would correspond to the identity permutation? But that’s OK: all we need here is that there’s some bijection.

So, let’s arbitrarily choose, for each SXS \subseteq X, a bijection between orders on SS and permutations of SS. This done, we can replace the ordering of our five-element subset with a permutation of it. Which particular permutation it is depends on that arbitrary choice of bijection, but let’s say for sake of argument that it’s this one:

Step five

Hanging off each vertebra (yellow vertex) is a rooted tree. It makes no difference if we adorn its edges with arrows directed towards the root:

Step six

(There’s no choice involved in doing this.) Now what we’ve got is a directed graph with XX as its vertex-set, and with the property that each vertex is the source (tail) of exactly one arrow. This is nothing but a function f:XXf: X \to X! We seem to have also chosen a distinguished set of vertices, the yellow ones. But actually, they’re determined by ff: they’re exactly the periodic points. So we lose no information if we throw away the colouring:

Step seven

So what we have is precisely an endofunction of XX.

We’ve now shown that the vertebrates (bipointed trees) on XX are in bijection with the endofunctions of XX. Since there are n nn^n such functions, where n=#Xn = \# X, it follows that there are n nn^n vertebrates on XX, and therefore n n2n^{n - 2} unrooted trees on XX. And that’s Joyal’s proof of Cayley’s theorem.

Of course, the argument I’ve given is pictorial and not entirely rigorous. I didn’t even define “tree” rigorously. Probably the best way to make it rigorous is via species, the concept that Joyal pioneered in that same paper.

I’m not going to give a full introduction to species here, but let me just say briefly how it works. A species is a functor from the category of finite sets and bijections to the category of sets. For example, there’s a species OrdOrd that assigns to each finite set the set of total orders on it. There are several ways of putting two species together to make a third, and in particular, species can be “composed”. We’ve essentially shown that

Tree **OrdTree *, Tree_{\ast\ast} \cong Ord \circ Tree_{\ast},

where Tree **Tree_{\ast\ast} is the species of vertebrates (bipointed trees), Tree *Tree_{\ast} is the species of rooted (pointed) trees, and \circ is composition. This isomorphism expresses the fact explained in the first few pictures above: that a vertebrate is the same thing as a collection of rooted trees, with a total order on them (1–5 in the example shown). On the other hand, we also have

EndPermTree *, End \cong Perm \circ Tree_{\ast},

where EndEnd is the species of endofunctions and PermPerm is the species of permutations. This isomorphism expresses the fact explained in the last few pictures above: that an endofunction is the same thing as a collection of rooted trees, with a permutation of them.

Now OrdOrd and PermPerm are not isomorphic species; that is, as functors they are not naturally isomorphic. So we’re not going to conclude that Tree **Tree_{\ast\ast} and EndEnd are isomorphic species either. However, OrdOrd and PermPerm are pointwise isomorphic, in the sense that Ord(X)Ord(X) and Perm(X)Perm(X) have the same number of elements for each finite set XX (namely, (#X)!(\# X)!). It follows that Tree **Tree_{\ast\ast} and EndEnd are pointwise isomorphic too. In other words: on any finite set, there are the same number of vertebrates and endofunctions. That’s Cayley’s theorem.

December 08, 2019

Doug NatelsonBrief items

Here are some tidbits that came across my eyeballs this past week:

  • I just ran into this article from early in 2019.  It touches on my discussion about liquids, and is a great example of a recurring theme in condensed matter physics.  The authors look at the vibrational excitations of liquid droplets on surfaces.  As happens over and over in physics, the imposition of boundary conditions on the liquid motion (e.g., wetting conditions on the surface and approximately incompressible liquid with a certain surface tension) leads to quantization of the allowed vibrations.  Discrete frequencies/mode shapes/energies are picked out due to those constraints, leading to a "periodic table" of droplet vibrations.  (This one looks moderately like atomic states, because spherical harmonics show up in the mode description, as they do when looking at atomic orbitals.)
  • Another article from the past, this one from 2014 in the IEEE Spectrum.  It talks about how we arrived at the modern form for Maxwell's equations.  Definitely a good read for those interested in the history of physics.  Maxwell's theory was developing in parallel with what became vector calculus, and Maxwell's original description (like Faraday's intuition) was very mechanistic rather than abstract.
  • Along those lines, this preprint came out recently promoting a graphical pedagogical approach to vector calculus.  The spirit at work here is that Feynman's graphical diagrammatic methods were a great way to teach people perturbative quantum field theory, and do perhaps a diagrammatic scheme for vector calc could be good.  I'm a bit of a skeptic - I found the approach by Purcell to be very physical and intuitive, and this doesn't look simpler to me.
  • This preprint about twisted bilayer graphene and the relationship between superconductivity and strongly insulating states caught my eye, and I need to read it carefully.  The short version:  While phase diagrams showing superconductivity and insulating states as a function of carrier density make it tempting to think that SC evolves out of the insulating states via doping (as likely in the cuprates), the situation may be more complicated.

n-Category Café Random Permutations (Part 7)

Last time I computed the expected number of cycles of length kk in a random permutation of an nn-element set. The answer is wonderfully simple: 1/k1/k for all k=1,,nk = 1, \dots, n.

As Mark Meckes pointed out, this fact is just a shadow of a more wonderful fact. Let C kC_k be the number of cycles of length kk in a random permutation of an nn-element set, thought of as a random variable. Then as nn \to \infty, the distribution of C kC_k approaches a Poisson distribution with mean 1/k1/k.

But this is just a shadow of an even more wonderful fact!

First, if we fix some number b1b \ge 1, and look at all the random variables C 1,,C bC_1, \dots, C_b, they converge to independent Poisson distributions as nn \to \infty.

In fact we can even let bb \to \infty as nn \to \infty, so long as nn outstrips bb, in the following sense: b/n0b/n \to 0. The random variables C 1,,C bC_1, \dots , C_b will still converge to independent Poisson distributions.

These facts are made precise and proved here:

In the proof they drop another little bombshell. Suppose jkj \ne k. Then the random variables C jC_j and C kC_k become uncorrelated as soon as nn gets big enough for a permutation of an nn-element set to have both a jj-cycle and a kk-cycle. That is, their expected values obey

E(C jC k)=E(C j)E(C k) E(C_j C_k) = E(C_j) E(C_k)

whenever nj+kn \ge j + k. Of course when n<j+kn \lt j + k it’s impossible to have both a cycle of length jj and one of length kk, so in that case we have

E(C jC k)=0 E(C_j C_k) = 0

It’s exciting to me how the expected value E(C jC k)E(C_j C_k) pops suddenly from 00 to E(C j)E(C k)=1/jkE(C_j) E(C_k) = 1/jk as soon as our set gets big enough to allow both a jj-cycle and a kk-cycle. It says that in some sense the presence of a large cycle in a random permutation does not discourage the existence of other large cycles… unless it absolutely forbids it!

This is a higher-level version of a phenomenon we’ve already seen: E(C k)E(C_k) jumps from 00 to 1/k1/k as soon as nkn \ge k.

More generally, Arratia and Tavaré show that

E(C j 1C j m)=E(C j 1)E(C j m) E(C_{j_1} \, \cdots \, C_{j_m}) = E(C_{j_1}) \, \cdots \,E(C_{j_m})

whenever nj 1++j mn \ge j_1 + \cdots + j_m. But on the other hand, clearly

E(C j 1C j m)=0 E(C_{j_1} \, \cdots \, C_{j_m}) = 0

whenever n<j 1++j mn \lt j_1 + \cdots + j_m, at least if the numbers j ij_i are distinct. After all, it’s impossible for a permutation of a finite set to have cycles of different lengths that sum to more than the size of that set.

I’m trying to understand the nature of random permutations. These results help a lot. But instead of describing how they’re proved, I want to conclude with a picture that has also helped me a lot:

One can argue that Flajolet and Sedgewick should have used circumferences rather than areas to indicate cycle lengths: after all, the circumference is a kind of ‘cycle length’. That would have made the large cycles seem even larger compared to the puny ones.

Of course, experts on the visual display of quantitative information have a lot of opinions on these issues. But never mind! Let’s extract some morals from this chart.

We can see that a random permutation of a large set has rather few cycles on average. To be precise, the expected number of cycles in a random permutation of an nn-element set is lnn\sim \ln n.

We can also see that the lengths of the cycles range dramatically from tiny to large. To be precise, suppose 1mmn1 \le m \le m' \le n. Then since the expected number of cycles of length kk is 1/k1/k, the expected number of cycles of length between mm and mm' is close to

lnmlnm \ln m' \; - \; \ln m

Thus we can expect as many cycles of length between 11 and 1010 as between 1010 and 100100, or between 100100 and 10001000, etc.

So, I’m now imagining a random permutation of a large set as a way of chopping up a mountain into rocks with a wild range of sizes: some sand grains, some pebbles, some fist-sized rocks and some enormous boulders. Moreover, as long as two different sizes don’t add up to more than the size of our mountain, the number of rocks of those two sizes are uncorrelated.

I want to keep improving this intuition. For example, if jkj \ne k and nj+kn \ge j + k, are the random variables C jC_j and C kC_k independent, or just uncorrelated?

Tommaso DorigoArt And Science In Venice

Last month the Museum of Natural History of Venice hosted, in the last room of the exhibit called "room of the cetaceans" (where a large skeleton of a whale hangs from the ceiling), an exhibit of artwork produced by high-school students from the Venice area. The event, which belongs to the "Art and Science across Italy" project, was the culminating point of a series of lectures on particle physics, on science in art, and related topics which involved the students and INFN personnel from the Padova section.

read more

December 07, 2019

David Hoggmy definition of an adversarial attack

Based on conversations with Soledad Villar, Teresa Huang, Zach Martin, Greg Scanlon, and Eva Wang (all NYU), I worked today on establishing criteria for a successful adversarial attack against a regression in the natural sciences (like astronomy). The idea is you add a small, irrelevant amount u to your data x and it changes the labels y by an unexpectedly large amount. Or, to be more specific:

  • The L2 norm (u.u) of the vector u should be equal to a small number Q
  • The vector u should be orthogonal to your expectation v of the gradient of the function dy/dx
  • The change in the inferred labels at x+u relative to x should be much larger than you would get for the same-length move in the v direction!
The first criterion is that the change is small. The second is that it is irrelevant. The third is that it produces a big change in the regression's output. One issue is that you can only execute this when you have v, or an expectation for dy/dx independent of your regression model. That's true in some contexts (like spectroscopic parameter estimation) but not others.

David Hoggproducts of Gaussians, Joker

As my loyal reader knows, I love that products of Gaussians are themselves Gaussians! The result is that there are many factorizations of a Gaussian into many different Gaussian products. As my loyal reader also knows, Adrian Price-Whelan (Flatiron) and I found a bug in our code The Joker which fits radial-velocity data with Keplerian orbital models; this bug is related to the fundamental factorization of Gaussians that underlies the method. Today Price-Whelan showed me results from the fixed code, and we discussed them (and the priors we are using in our marginalization), along with the paper we are writing about the factorization. Yes, people, this is my MO: When you have a big bug—or really a big think-o or conceptual error—don't just fix it, write a paper about it! That's the origin of my paper on the K-correction. We are also contemplating writing a note about how you can constrain time-domain signals with periods longer than the interval over which you are observing them!

Jordan EllenbergThe Americans can retire at 42

On the front page of my New York Times today, this capsule summary:

The French can retire at 62. Or 52. Sometimes 42. President Emmanuel Macron calls the tangle unsustainable. A million protesters disagree.

In the actual article, we learn that the retirement age of 42 applies to one group of workers; dancers in the national ballet. I find it very annoying when an article is teased with a number presented as normal when it’s actually extremely atypical. You could write the same teaser about the United States, having New York City firefighters in mind. But you would be misleading your audience even though the claim would be, I suppose, technically correct.

David Hoggastronomical adversaries: we are go!

The NYU Center for Data Science has a big masters program, and as a part of that, students do capstone research projects with industry and academic partners. I am one of the latter! So Soledad Villar (NYU) and I have been advising two groups of capstone students in projects. One of these groups (Teresa Ningyuan Huang, Zach Martin, Greg Scanlon, and Eva Shuyu Wang) has been working on adversarial attacks against regressions in astronomy. The work is new in part because it brings the idea of attacks to the natural science domain, and because attacks haven't been really defined for regression contexts. Today we decided that this work is ready (enough) to publish. So we are going to try to finish and submit something for a conference deadline this week!

David Hoggdisk and black holes

Today was the last day of the visit by Christina Eilers (MIT). We decided that we have a clear scope for our Milky Way disk paper, and we have a full outline and figures, so it was a success. But we don't have a good picture of our black-hole and quasar projects, which include finding a data-driven model that generates quasar properties given a black-hole mass (and possibly many latent parameters). The data are messy! And we don't know what to believe. To be continued.

David HoggLFC calibration accuracy; stars form in lines?

Lily Zhao (Yale) and I got to the point that we can calibrate one laser-frequency-comb exposure on EXPRES with a calibration based on all the LFC exposures. We find that we predict the lines in each exposure to a mean (over all lines) of about 3 cm/s! If this holds up to cross-validation, I am happy!

In Stars & Exoplanets Meeting at Flatiron, many fun things happened. Zhao showed us temperature variations observed spectroscopically in Sun-like stars: Are these due to spots? And Marina Kounkel (WWU) showed us incredible visualizations of young stars, showing that they are very clustered in kinematics and age. She believes that they form in lines, and then the lines break up. Her interactive plots were amazing!

December 06, 2019

Matt von HippelLife Cycle of an Academic Scientist

So you want to do science for a living. Some scientists work for companies, developing new products. Some work for governments. But if you want to do “pure science”, science just to learn about the world, then you’ll likely work at a university, as part of what we call academia.

The first step towards academia is graduate school. In the US, this means getting a PhD.

(Master’s degrees, at least in the US, have a different purpose. Most are “terminal Master’s”, designed to be your last degree. With a terminal Master’s, you can be a technician in a lab, but you won’t get farther down this path. In the US you don’t need a Master’s before you apply for a PhD program, and having one is usually a waste of time: PhD programs will make you re-take most of the same classes.)

Once you have a PhD, it’s time to get a job! Often, your first job after graduate school is a postdoc. Postdocs are short-term jobs, usually one to three years long. Some people are lucky enough to go to the next stage quickly, others have more postdoc jobs first. These jobs will take you all over the world, everywhere people with your specialty work. Sometimes these jobs involve teaching, but more often you just do scientific research.

In the US system, If everything goes well, eventually you get a tenure-track job. These jobs involve both teaching and research. You get to train PhD students, hire postdocs, and in general start acting like a proper professor. This stage lasts around seven years, while the university evaluates you. If they decide you’re not worth it then typically you’ll have to leave to apply for another job in another university. If they like you though, you get tenure.

Tenure is the first time as an academic scientist that you aren’t on a short-term contract. Your job is more permanent than most, you have extra protection from being fired that most people don’t. While you can’t just let everything slide, you have freedom to make more of your own decisions.

A tenured job can last until retirement, when you become an emeritus professor. Emeritus professors are retired but still do some of the work they did as professors. They’re paid out of their pension instead of a university salary, but they still sometimes teach or do research, and they usually still have an office. The university can hire someone new, and the cycle continues.

This isn’t the only path scientists take. Some work in a national lab instead. These don’t usually involve teaching duties, and the path to a permanent job is a bit different. Some get teaching jobs instead of research professorships. These teaching jobs are usually not permanent, instead universities are hiring more and more adjunct faculty who have to string together temporary contracts to make a precarious living.

I’ve mostly focused on the US system here. Europe is a bit different: Master’s degrees are a real part of the system, tenure-track doesn’t really exist, and adjunct faculty don’t always either. Some particular countries, like Germany, have their own quite complicated systems, other countries fall in between.

n-Category Café A Couple of Talks in Vienna

I’m in Vienna to give a couple of talks, one yesterday and one later this afternoon. Draft slides of both talks are on this page.

The first sees me take up again the work of Michael Friedman, in particular his Dynamics of Reason. I see that I was lamenting the difficulty of getting my original article published back in 2007. (But then I was still a couple of months away from finding out that I had a permanent job after a very long search, and so evidently fed up.)

In any case, I think the story I’m telling is improving with age - we now appear to have modal homotopy type theory and M-theory in harmony. But irrespective of whether Urs Schreiber’s Hypothesis H bears fruit, there’s still an important story to tell about mathematics modifying its framework.

The second talk introduces ideas from my forthcoming Modal homotopy type theory: The prospect of a new logic for philosophy.

December 05, 2019

n-Category Café Position at U. C. Riverside

The Department of Mathematics at the University of California, Riverside invites application to a faculty position at the full professor level in the general area of pure mathematics. The successful candidate is expected to fill the F. Burton Jones Chair in Pure Mathematics. The search seeks candidates with national and international recognition, an outstanding research track record and outlook, a strong record in mentoring and teaching, and leadership in promoting diversity and serving the professional community.

A PhD in mathematics is required. It is expected that at UCR, the candidate will lead a rigorous research program, will develop and teach graduate and undergraduate courses, will advise and mentor graduate and undergraduate students, will vigorously promote diversity, and will engage in both campus and professional service activities.

A start-up package will be provided. The F. Burton Jones endowment fund will be made available to support the candidate’s research and mentoring activities. This position will start between July 1, 2020 and July 1, 2021.

Interested individuals should submit a cover letter, a full curriculum vita, a statement of research, a statement of teaching, official teaching evaluations and a statement on contribution to promoting diversity. In addition, applicants should provide contact names and addresses for four or five references on the candidate’s research, and one reference on the candidate’s teaching and mentoring. All application materials must be submitted through AP Recruit online at:

Review of applications will commence on May 1, 2020, and proceed until position is filled. For full consideration, applicants should submit their complete applications by April 30, 2020.

The University of California is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status, or any other characteristic protected by law.

UCR is a world-class research university with an exceptionally diverse undergraduate student body. Its mission is explicitly linked to providing routes to educational success for underrepresented and first-generation college students. A commitment to this mission is a preferred qualification.

Advancement through the Professor ranks at the University of California is through a series of structured, merit-based evaluations, occurring every 3-5 years, each of which includes substantial peer input.


Document requirements:

Cover Letter

Curriculum Vitae - Your most recently updated C.V.

Statement of Research

Statement of Teaching

Official Teaching Evaluation

Statement of Past and/or Planned Future Contributions to Advancing Diversity and Inclusive Excellence - In a “Statement of Contributions to Diversity,” we ask applicants to describe their past and potential future contributions to promoting a diverse, equitable, and inclusive environment, which is a key requirement of the role of every academic member at UCR. There are numerous ways to contribute, and a commitment to this part of our mission can be reflected through research, teaching, supervision, mentoring, community engagement, service, and any of the other varied activities that are a part of an academic career.

Reference requirements

5-6 required (contact information only)

December 04, 2019

Terence TaoAn uncountable Moore-Schmidt theorem

Asgar Jamneshan and I have just uploaded to the arXiv our paper “An uncountable Moore-Schmidt theorem“. This paper revisits a classical theorem of Moore and Schmidt in measurable cohomology of measure-preserving systems. To state the theorem, let {X = (X,{\mathcal X},\mu)} be a probability space, and {\mathrm{Aut}(X, {\mathcal X}, \mu)} be the group of measure-preserving automorphisms of this space, that is to say the invertible bimeasurable maps {T: X \rightarrow X} that preserve the measure {\mu}: {T_* \mu = \mu}. To avoid some ambiguity later in this post when we introduce abstract analogues of measure theory, we will refer to measurable maps as concrete measurable maps, and measurable spaces as concrete measurable spaces. (One could also call {X = (X,{\mathcal X}, \mu)} a concrete probability space, but we will not need to do so here as we will not be working explicitly with abstract probability spaces.)

Let {\Gamma = (\Gamma,\cdot)} be a discrete group. A (concrete) measure-preserving action of {\Gamma} on {X} is a group homomorphism {\gamma \mapsto T^\gamma} from {\Gamma} to {\mathrm{Aut}(X, {\mathcal X}, \mu)}, thus {T^1} is the identity map and {T^{\gamma_1} \circ T^{\gamma_2} = T^{\gamma_1 \gamma_2}} for all {\gamma_1,\gamma_2 \in \Gamma}. A large portion of ergodic theory is concerned with the study of such measure-preserving actions, especially in the classical case when {\Gamma} is the integers (with the additive group law).

Let {K = (K,+)} be a compact Hausdorff abelian group, which we can endow with the Borel {\sigma}-algebra {{\mathcal B}(K)}. A (concrete measurable) {K}cocycle is a collection {\rho = (\rho_\gamma)_{\gamma \in \Gamma}} of concrete measurable maps {\rho_\gamma: X \rightarrow K} obeying the cocycle equation

\displaystyle \rho_{\gamma_1 \gamma_2}(x) = \rho_{\gamma_1} \circ T^{\gamma_2}(x) + \rho_{\gamma_2}(x) \ \ \ \ \ (1)

for {\mu}-almost every {x \in X}. (Here we are glossing over a measure-theoretic subtlety that we will return to later in this post – see if you can spot it before then!) Cocycles arise naturally in the theory of group extensions of dynamical systems; in particular (and ignoring the aforementioned subtlety), each cocycle induces a measure-preserving action {\gamma \mapsto S^\gamma} on {X \times K} (which we endow with the product of {\mu} with Haar probability measure on {K}), defined by

\displaystyle S^\gamma( x, k ) := (T^\gamma x, k + \rho_\gamma(x) ).

This connection with group extensions was the original motivation for our study of measurable cohomology, but is not the focus of the current paper.

A special case of a {K}-valued cocycle is a (concrete measurable) {K}-valued coboundary, in which {\rho_\gamma} for each {\gamma \in \Gamma} takes the special form

\displaystyle \rho_\gamma(x) = F \circ T^\gamma(x) - F(x)

for {\mu}-almost every {x \in X}, where {F: X \rightarrow K} is some measurable function; note that (ignoring the aforementioned subtlety), every function of this form is automatically a concrete measurable {K}-valued cocycle. One of the first basic questions in measurable cohomology is to try to characterize which {K}-valued cocycles are in fact {K}-valued coboundaries. This is a difficult question in general. However, there is a general result of Moore and Schmidt that at least allows one to reduce to the model case when {K} is the unit circle {\mathbf{T} = {\bf R}/{\bf Z}}, by taking advantage of the Pontryagin dual group {\hat K} of characters {\hat k: K \rightarrow \mathbf{T}}, that is to say the collection of continuous homomorphisms {\hat k: k \mapsto \langle \hat k, k \rangle} to the unit circle. More precisely, we have

Theorem 1 (Countable Moore-Schmidt theorem) Let {\Gamma} be a discrete group acting in a concrete measure-preserving fashion on a probability space {X}. Let {K} be a compact Hausdorff abelian group. Assume the following additional hypotheses:

Then a {K}-valued concrete measurable cocycle {\rho = (\rho_\gamma)_{\gamma \in \Gamma}} is a concrete coboundary if and only if for each character {\hat k \in \hat K}, the {\mathbf{T}}-valued cocycles {\langle \hat k, \rho \rangle = ( \langle \hat k, \rho_\gamma \rangle )_{\gamma \in \Gamma}} are concrete coboundaries.

The hypotheses (i), (ii), (iii) are saying in some sense that the data {\Gamma, X, K} are not too “large”; in all three cases they are saying in some sense that the data are only “countably complicated”. For instance, (iii) is equivalent to {K} being second countable, and (ii) is equivalent to {X} being modeled by a complete separable metric space. It is because of this restriction that we refer to this result as a “countable” Moore-Schmidt theorem. This theorem is a useful tool in several other applications, such as the Host-Kra structure theorem for ergodic systems; I hope to return to these subsequent applications in a future post.

Let us very briefly sketch the main ideas of the proof of Theorem 1. Ignore for now issues of measurability, and pretend that something that holds almost everywhere in fact holds everywhere. The hard direction is to show that if each {\langle \hat k, \rho \rangle} is a coboundary, then so is {\rho}. By hypothesis, we then have an equation of the form

\displaystyle \langle \hat k, \rho_\gamma(x) \rangle = \alpha_{\hat k} \circ T^\gamma(x) - \alpha_{\hat k}(x) \ \ \ \ \ (2)

for all {\hat k, \gamma, x} and some functions {\alpha_{\hat k}: X \rightarrow {\mathbf T}}, and our task is then to produce a function {F: X \rightarrow K} for which

\displaystyle \rho_\gamma(x) = F \circ T^\gamma(x) - F(x)

for all {\gamma,x}.

Comparing the two equations, the task would be easy if we could find an {F: X \rightarrow K} for which

\displaystyle \langle \hat k, F(x) \rangle = \alpha_{\hat k}(x) \ \ \ \ \ (3)

for all {\hat k, x}. However there is an obstruction to this: the left-hand side of (3) is additive in {\hat k}, so the right-hand side would have to be also in order to obtain such a representation. In other words, for this strategy to work, one would have to first establish the identity

\displaystyle \alpha_{\hat k_1 + \hat k_2}(x) - \alpha_{\hat k_1}(x) - \alpha_{\hat k_2}(x) = 0 \ \ \ \ \ (4)

for all {\hat k_1, \hat k_2, x}. On the other hand, the good news is that if we somehow manage to obtain the equation, then we can obtain a function {F} obeying (3), thanks to Pontryagin duality, which gives a one-to-one correspondence between {K} and the homomorphisms of the (discrete) group {\hat K} to {\mathbf{T}}.

Now, it turns out that one cannot derive the equation (4) directly from the given information (2). However, the left-hand side of (2) is additive in {\hat k}, so the right-hand side must be also. Manipulating this fact, we eventually arrive at

\displaystyle (\alpha_{\hat k_1 + \hat k_2} - \alpha_{\hat k_1} - \alpha_{\hat k_2}) \circ T^\gamma(x) = (\alpha_{\hat k_1 + \hat k_2} - \alpha_{\hat k_1} - \alpha_{\hat k_2})(x).

In other words, we don’t get to show that the left-hand side of (4) vanishes, but we do at least get to show that it is {\Gamma}-invariant. Now let us assume for sake of argument that the action of {\Gamma} is ergodic, which (ignoring issues about sets of measure zero) basically asserts that the only {\Gamma}-invariant functions are constant. So now we get a weaker version of (4), namely

\displaystyle \alpha_{\hat k_1 + \hat k_2}(x) - \alpha_{\hat k_1}(x) - \alpha_{\hat k_2}(x) = c_{\hat k_1, \hat k_2} \ \ \ \ \ (5)

for some constants {c_{\hat k_1, \hat k_2} \in \mathbf{T}}.

Now we need to eliminate the constants. This can be done by the following group-theoretic projection. Let {L^0({\bf X} \rightarrow {\bf T})} denote the space of concrete measurable maps {\alpha} from {{\bf X}} to {{\bf T}}, up to almost everywhere equivalence; this is an abelian group where the various terms in (5) naturally live. Inside this group we have the subgroup {{\bf T}} of constant functions (up to almost everywhere equivalence); this is where the right-hand side of (5) lives. Because {{\bf T}} is a divisible group, there is an application of Zorn’s lemma (a good exercise for those who are not acquainted with these things) to show that there exists a retraction {w: L^0({\bf X} \rightarrow {\bf T}) \rightarrow {\bf T}}, that is to say a group homomorphism that is the identity on the subgroup {{\bf T}}. We can use this retraction, or more precisely the complement {\alpha \mapsto \alpha - w(\alpha)}, to eliminate the constant in (5). Indeed, if we set

\displaystyle \tilde \alpha_{\hat k}(x) := \alpha_{\hat k}(x) - w(\alpha_{\hat k})

then from (5) we see that

\displaystyle \tilde \alpha_{\hat k_1 + \hat k_2}(x) - \tilde \alpha_{\hat k_1}(x) - \tilde \alpha_{\hat k_2}(x) = 0

while from (2) one has

\displaystyle \langle \hat k, \rho_\gamma(x) \rangle = \tilde \alpha_{\hat k} \circ T^\gamma(x) - \tilde \alpha_{\hat k}(x)

and now the previous strategy works with {\alpha_{\hat k}} replaced by {\tilde \alpha_{\hat k}}. This concludes the sketch of proof of Theorem 1.

In making the above argument rigorous, the hypotheses (i)-(iii) are used in several places. For instance, to reduce to the ergodic case one relies on the ergodic decomposition, which requires the hypothesis (ii). Also, most of the above equations only hold outside of a set of measure zero, and the hypothesis (i) and the hypothesis (iii) (which is equivalent to {\hat K} being at most countable) to avoid the problem that an uncountable union of sets of measure zero could have positive measure (or fail to be measurable at all).

My co-author Asgar Jamneshan and I are working on a long-term project to extend many results in ergodic theory (such as the aforementioned Host-Kra structure theorem) to “uncountable” settings in which hypotheses analogous to (i)-(iii) are omitted; thus we wish to consider actions on uncountable groups, on spaces that are not standard Borel, and cocycles taking values in groups that are not metrisable. Such uncountable contexts naturally arise when trying to apply ergodic theory techniques to combinatorial problems (such as the inverse conjecture for the Gowers norms), as one often relies on the ultraproduct construction (or something similar) to generate an ergodic theory translation of these problems, and these constructions usually give “uncountable” objects rather than “countable” ones. (For instance, the ultraproduct of finite groups is a hyperfinite group, which is usually uncountable.). This paper marks the first step in this project by extending the Moore-Schmidt theorem to the uncountable setting.

If one simply drops the hypotheses (i)-(iii) and tries to prove the Moore-Schmidt theorem, several serious difficulties arise. We have already mentioned the loss of the ergodic decomposition and the possibility that one has to control an uncountable union of null sets. But there is in fact a more basic problem when one deletes (iii): the addition operation {+: K \times K \rightarrow K}, while still continuous, can fail to be measurable as a map from {(K \times K, {\mathcal B}(K) \otimes {\mathcal B}(K))} to {(K, {\mathcal B}(K))}! Thus for instance the sum of two measurable functions {F: X \rightarrow K} need not remain measurable, which makes even the very definition of a measurable cocycle or measurable coboundary problematic (or at least unnatural). This phenomenon is known as the Nedoma pathology. A standard example arises when {K} is the uncountable torus {{\mathbf T}^{{\bf R}}}, endowed with the product topology. Crucially, the Borel {\sigma}-algebra {{\mathcal B}(K)} generated by this uncountable product is not the product {{\mathcal B}(\mathbf{T})^{\otimes {\bf R}}} of the factor Borel {\sigma}-algebras (the discrepancy ultimately arises from the fact that topologies permit uncountable unions, but {\sigma}-algebras do not); relating to this, the product {\sigma}-algebra {{\mathcal B}(K) \otimes {\mathcal B}(K)} is not the same as the Borel {\sigma}-algebra {{\mathcal B}(K \times K)}, but is instead a strict sub-algebra. If the group operations on {K} were measurable, then the diagonal set

\displaystyle K^\Delta := \{ (k,k') \in K \times K: k = k' \} = \{ (k,k') \in K \times K: k - k' = 0 \}

would be measurable in {{\mathcal B}(K) \otimes {\mathcal B}(K)}. But it is an easy exercise in manipulation of {\sigma}-algebras to show that if {(X, {\mathcal X}), (Y, {\mathcal Y})} are any two measurable spaces and {E \subset X \times Y} is measurable in {{\mathcal X} \otimes {\mathcal Y}}, then the fibres {E_x := \{ y \in Y: (x,y) \in E \}} of {E} are contained in some countably generated subalgebra of {{\mathcal Y}}. Thus if {K^\Delta} were {{\mathcal B}(K) \otimes {\mathcal B}(K)}-measurable, then all the points of {K} would lie in a single countably generated {\sigma}-algebra. But the cardinality of such an algebra is at most {2^{\alpha_0}} while the cardinality of {K} is {2^{2^{\alpha_0}}}, and Cantor’s theorem then gives a contradiction.

To resolve this problem, we give {K} a coarser {\sigma}-algebra than the Borel {\sigma}-algebra, which we call the reduced {\sigma}-algebra {{\mathcal B}^\otimes(K)}, thus coarsening the measurable space structure on {K = (K,{\mathcal B}(K))} to a new measurable space {K_\otimes := (K, {\mathcal B}^\otimes(K))}. In the case of compact Hausdorff abelian groups, {{\mathcal B}^{\otimes}(K)} can be defined as the {\sigma}-algebra generated by the characters {\hat k: K \rightarrow {\mathbf T}}; for more general compact abelian groups, one can define {{\mathcal B}^{\otimes}(K)} as the {\sigma}-algebra generated by all continuous maps into metric spaces. This {\sigma}-algebra is equal to {{\mathcal B}(K)} when {K} is metrisable but can be smaller for other {K}. With this measurable structure, {K_\otimes} becomes a measurable group; it seems that once one leaves the metrisable world that {K_\otimes} is a superior (or at least equally good) space to work with than {K} for analysis, as it avoids the Nedoma pathology. (For instance, from Plancherel’s theorem, we see that if {m_K} is the Haar probability measure on {K}, then {L^2(K,m_K) = L^2(K_\otimes,m_K)} (thus, every {K}-measurable set is equivalent modulo {m_K}-null sets to a {K_\otimes}-measurable set), so there is no damage to Plancherel caused by passing to the reduced {\sigma}-algebra.

Passing to the reduced {\sigma}-algebra {K_\otimes} fixes the most severe problems with an uncountable Moore-Schmidt theorem, but one is still faced with an issue of having to potentially take an uncountable union of null sets. To avoid this sort of problem, we pass to the framework of abstract measure theory, in which we remove explicit mention of “points” and can easily delete all null sets at a very early stage of the formalism. In this setup, the category of concrete measurable spaces is replaced with the larger category of abstract measurable spaces, which we formally define as the opposite category of the category of {\sigma}-algebras (with Boolean algebra homomorphisms). Thus, we define an abstract measurable space to be an object of the form {{\mathcal X}^{\mathrm{op}}}, where {{\mathcal X}} is an (abstract) {\sigma}-algebra and {\mathrm{op}} is a formal placeholder symbol that signifies use of the opposite category, and an abstract measurable map {T: {\mathcal X}^{\mathrm{op}} \rightarrow {\mathcal Y}^{\mathrm{op}}} is an object of the form {(T^*)^{\mathrm{op}}}, where {T^*: {\mathcal Y} \rightarrow {\mathcal X}} is a Boolean algebra homomorphism and {\mathrm{op}} is again used as a formal placeholder; we call {T^*} the pullback map associated to {T}.  [UPDATE: It turns out that this definition of a measurable map led to technical issues.  In a forthcoming revision of the paper we also impose the requirement that the abstract measurable map be \sigma-complete (i.e., it respects countable joins).] The composition {S \circ T: {\mathcal X}^{\mathrm{op}} \rightarrow {\mathcal Z}^{\mathrm{op}}} of two abstract measurable maps {T: {\mathcal X}^{\mathrm{op}} \rightarrow {\mathcal Y}^{\mathrm{op}}}, {S: {\mathcal Y}^{\mathrm{op}} \rightarrow {\mathcal Z}^{\mathrm{op}}} is defined by the formula {S \circ T := (T^* \circ S^*)^{\mathrm{op}}}, or equivalently {(S \circ T)^* = T^* \circ S^*}.

Every concrete measurable space {X = (X,{\mathcal X})} can be identified with an abstract counterpart {{\mathcal X}^{op}}, and similarly every concrete measurable map {T: X \rightarrow Y} can be identified with an abstract counterpart {(T^*)^{op}}, where {T^*: {\mathcal Y} \rightarrow {\mathcal X}} is the pullback map {T^* E := T^{-1}(E)}. Thus the category of concrete measurable spaces can be viewed as a subcategory of the category of abstract measurable spaces. The advantage of working in the abstract setting is that it gives us access to more spaces that could not be directly defined in the concrete setting. Most importantly for us, we have a new abstract space, the opposite measure algebra {X_\mu} of {X}, defined as {({\bf X}/{\bf N})^*} where {{\bf N}} is the ideal of null sets in {{\bf X}}. Informally, {X_\mu} is the space {X} with all the null sets removed; there is a canonical abstract embedding map {\iota: X_\mu \rightarrow X}, which allows one to convert any concrete measurable map {f: X \rightarrow Y} into an abstract one {[f]: X_\mu \rightarrow Y}. One can then define the notion of an abstract action, abstract cocycle, and abstract coboundary by replacing every occurrence of the category of concrete measurable spaces with their abstract counterparts, and replacing {X} with the opposite measure algebra {X_\mu}; see the paper for details. Our main theorem is then

Theorem 2 (Uncountable Moore-Schmidt theorem) Let {\Gamma} be a discrete group acting abstractly on a {\sigma}-finite measure space {X}. Let {K} be a compact Hausdorff abelian group. Then a {K_\otimes}-valued abstract measurable cocycle {\rho = (\rho_\gamma)_{\gamma \in \Gamma}} is an abstract coboundary if and only if for each character {\hat k \in \hat K}, the {\mathbf{T}}-valued cocycles {\langle \hat k, \rho \rangle = ( \langle \hat k, \rho_\gamma \rangle )_{\gamma \in \Gamma}} are abstract coboundaries.

With the abstract formalism, the proof of the uncountable Moore-Schmidt theorem is almost identical to the countable one (in fact we were able to make some simplifications, such as avoiding the use of the ergodic decomposition). A key tool is what we call a “conditional Pontryagin duality” theorem, which asserts that if one has an abstract measurable map {\alpha_{\hat k}: X_\mu \rightarrow {\bf T}} for each {\hat k \in K} obeying the identity { \alpha_{\hat k_1 + \hat k_2} - \alpha_{\hat k_1} - \alpha_{\hat k_2} = 0} for all {\hat k_1,\hat k_2 \in \hat K}, then there is an abstract measurable map {F: X_\mu \rightarrow K_\otimes} such that {\alpha_{\hat k} = \langle \hat k, F \rangle} for all {\hat k \in \hat K}. This is derived from the usual Pontryagin duality and some other tools, most notably the completeness of the {\sigma}-algebra of {X_\mu}, and the Sikorski extension theorem.

We feel that it is natural to stay within the abstract measure theory formalism whenever dealing with uncountable situations. However, it is still an interesting question as to when one can guarantee that the abstract objects constructed in this formalism are representable by concrete analogues. The basic questions in this regard are:

  • (i) Suppose one has an abstract measurable map {f: X_\mu \rightarrow Y} into a concrete measurable space. Does there exist a representation of {f} by a concrete measurable map {\tilde f: X \rightarrow Y}? Is it unique up to almost everywhere equivalence?
  • (ii) Suppose one has a concrete cocycle that is an abstract coboundary. When can it be represented by a concrete coboundary?

For (i) the answer is somewhat interesting (as I learned after posing this MathOverflow question):

  • If {Y} does not separate points, or is not compact metrisable or Polish, there can be counterexamples to uniqueness. If {Y} is not compact or Polish, there can be counterexamples to existence.
  • If {Y} is a compact metric space or a Polish space, then one always has existence and uniqueness.
  • If {Y} is a compact Hausdorff abelian group, one always has existence.
  • If {X} is a complete measure space, then one always has existence (from a theorem of Maharam).
  • If {X} is the unit interval with the Borel {\sigma}-algebra and Lebesgue measure, then one has existence for all compact Hausdorff {Y} assuming the continuum hypothesis (from a theorem of von Neumann) but existence can fail under other extensions of ZFC (from a theorem of Shelah, using the method of forcing).
  • For more general {X}, existence for all compact Hausdorff {Y} is equivalent to the existence of a lifting from the {\sigma}-algebra {\mathcal{X}/\mathcal{N}} to {\mathcal{X}} (or, in the language of abstract measurable spaces, the existence of an abstract retraction from {X} to {X_\mu}).
  • It is a long-standing open question (posed for instance by Fremlin) whether it is relatively consistent with ZFC that existence holds whenever {Y} is compact Hausdorff.

Our understanding of (ii) is much less complete:

  • If {K} is metrisable, the answer is “always” (which among other things establishes the countable Moore-Schmidt theorem as a corollary of the uncountable one).
  • If {\Gamma} is at most countable and {X} is a complete measure space, then the answer is again “always”.

In view of the answers to (i), I would not be surprised if the full answer to (ii) was also sensitive to axioms of set theory. However, such set theoretic issues seem to be almost completely avoided if one sticks with the abstract formalism throughout; they only arise when trying to pass back and forth between the abstract and concrete categories.

Terence TaoEigenvectors from Eigenvalues: a survey of a basic identity in linear algebra

Peter Denton, Stephen Parke, Xining Zhang, and I have just uploaded to the arXiv a completely rewritten version of our previous paper, now titled “Eigenvectors from Eigenvalues: a survey of a basic identity in linear algebra“. This paper is now a survey of the various literature surrounding the following basic identity in linear algebra, which we propose to call the eigenvector-eigenvalue identity:

Theorem 1 (Eigenvector-eigenvalue identity) Let {A} be an {n \times n} Hermitian matrix, with eigenvalues {\lambda_1(A),\dots,\lambda_n(A)}. Let {v_i} be a unit eigenvector corresponding to the eigenvalue {\lambda_i(A)}, and let {v_{i,j}} be the {j^{th}} component of {v_i}. Then

\displaystyle |v_{i,j}|^2 \prod_{k=1; k \neq i}^n (\lambda_i(A) - \lambda_k(A)) = \prod_{k=1}^{n-1} (\lambda_i(A) - \lambda_k(M_j))

where {M_j} is the {n-1 \times n-1} Hermitian matrix formed by deleting the {j^{th}} row and column from {A}.

When we posted the first version of this paper, we were unaware of previous appearances of this identity in the literature; a related identity had been used by Erdos-Schlein-Yau and by myself and Van Vu for applications to random matrix theory, but to our knowledge this specific identity appeared to be new. Even two months after our preprint first appeared on the arXiv in August, we had only learned of one other place in the literature where the identity showed up (by Forrester and Zhang, who also cite an earlier paper of Baryshnikov).

The situation changed rather dramatically with the publication of a popular science article in Quanta on this identity in November, which gave this result significantly more exposure. Within a few weeks we became informed (through private communication, online discussion, and exploration of the citation tree around the references we were alerted to) of over three dozen places where the identity, or some other closely related identity, had previously appeared in the literature, in such areas as numerical linear algebra, various aspects of graph theory (graph reconstruction, chemical graph theory, and walks on graphs), inverse eigenvalue problems, random matrix theory, and neutrino physics. As a consequence, we have decided to completely rewrite our article in order to collate this crowdsourced information, and survey the history of this identity, all the known proofs (we collect seven distinct ways to prove the identity (or generalisations thereof)), and all the applications of it that we are currently aware of. The citation graph of the literature that this ad hoc crowdsourcing effort produced is only very weakly connected, which we found surprising:

The earliest explicit appearance of the eigenvector-eigenvalue identity we are now aware of is in a 1966 paper of Thompson, although this paper is only cited (directly or indirectly) by a fraction of the known literature, and also there is a precursor identity of Löwner from 1934 that can be shown to imply the identity as a limiting case. At the end of the paper we speculate on some possible reasons why this identity only achieved a modest amount of recognition and dissemination prior to the November 2019 Quanta article.

December 03, 2019

Jordan EllenbergShocking the carrots

It takes a long time to soften carrots adequately by sauteing them, so a lot of recipes ask you to boil the carrots first or even mix butter and water and cook the carrots in the resulting lipidous slurry. This ends up tasting OK but the carrots never really taste sauteed to me, they taste boiled! I want that sear.

So this week I tried something new, borrowing a technique I learned a long time ago for perfect sauteed asparagus. Put your butter in the pan, melt it, get those carrots sauteing in there. Put in some salt and whatever other seasoning you want at whatever time suits that seasoning. (Dill is traditional, I used nutmeg this week and it was great) Saute the carrots until they’re nicely browned. At this point they will not be cooked enough. Eat one, it’ll taste nice on the outside but still be crunchy and part-raw.

So now it’s time to shock the carrots. Fill a small drinking glass half-full with water. So maybe a quarter cup, I dunno. Throw the water in the hot pan and immediately, as the sizzle kicks in and the steam begins to rise, slam the lid on. It should sound sort of like a high hat when you crash and then right away mute. Turn the heat down and let the carrots steam in there for about six minutes. When you open it, the water should be gone but if it’s not I would just take the carrots out with a slotted spoon. Result: fully tender carrots that taste sauteed, not boiled.

Doug NatelsonWhat is a liquid?

I wrote recently about phases of matter (and longer ago here).  The phase that tends to get short shrift in the physics curriculum is the liquid, and this is actually a symptom indicating that liquids are not simple things. 

We talk a lot about gases, and they tend to be simple in large part because they are low density systems - the constituents spend the overwhelming majority of their time far apart (compared to the size of the constituents), and therefore tend to interact with each other only very weakly.  We can even look in the ideal limit of infinitesimal particle size and zero interactions, so that the only energy in the problem is the kinetic energy of the particles, and derive the Ideal Gas Law.  

There is no such thing as an Ideal Liquid Law.  That tells you something about the complexity of these systems right there.

A classical liquid is a phase of matter in which the constituent particles have a typical interparticle distance comparable to the particle size, and therefore interact strongly, with both a "hard core repulsion" so that the particles are basically impenetrable, and usually through some kind of short-ranged attraction, either from van der Waals forces and/or longer-ranged/stronger interactions.  The kinetic energy of the particles is sufficiently large that they don't bond rigidly to each other and therefore move past and around each other continuously.  However, the density is so high that you can't even do very well by only worrying about pairs of interacting particles - you have to keep track of three-body, four-body, etc. interactions somehow.    

The very complexity of these strongly interacting collections of particles leads to the emergence of some simplicity at larger scales.  Because the particles are cheek-by-jowl and impenetrable, liquids are about as incompressible as solids.  The lack of tight bonding and enough kinetic energy to keep everyone moving means that, on average and on scales large compared to the particle size, liquids are homogeneous (uniform properties in space) and isotropic (uniform properties in all directions).  When pushed up against solid walls by gravity or other forces, liquids take on the shapes of their containers.  (If the typical kinetic energy per particle can't overcome the steric interactions with the local environment, then particles can get jammed.  Jammed systems act like "rigid" solids.)

Because of the constant interparticle collisions, energy and momentum get passed along readily within liquids, leading to good thermal conduction (the transport of kinetic energy of the particles via microscopic, untraceable amounts we call heat) and viscosity (the transfer of transverse momentum between adjacent rough layers of particles just due to collisions - the fluid analog of friction).  The lack of rigid bonding interactions means that liquids can't resist shear; layers of particles slide past each other.  This means that liquids, like gases, don't have transverse sound waves.   The flow of particles is best described by hydrodynamics, a continuum approach that makes sense on scales much bigger than the particles.   

Quantum liquids are those for which the quantum statistics of the constituents are important to the macroscopic properties.  Liquid helium is one such example.  Physicists have also adopted the term "liquid" to mean any strongly interacting, comparatively incompressible, flow-able system, such as the electrons in a metal ("Fermi liquid").  

Liquids are another example emergence that is deep, profound, and so ubiquitous that people tend to look right past it.  "Liquidity" is a set of properties so well-defined that a small child can tell you whether something is a liquid by looking at a video of it; those properties emerge largely independent of the microscopic details of the constituents and their interactions (water molecules with hydrogen bonds; octane molecules with van der Waals attraction; very hot silica molecules in flowing lava); and none of those properties are obvious if one starts with, say, the Standard Model of particle physics.  

December 02, 2019

Scott AaronsonTwo updates

  1. Two weeks ago, I blogged about the claim of Nathan Keller and Ohad Klein to have proven the Aaronson-Ambainis Conjecture. Alas, Keller and Klein tell me that they’ve now withdrawn their preprint (though it may take another day for that to show up on the arXiv), because of what looks for now like a fatal flaw, in Lemma 5.3, discovered by Paata Ivanishvili. (My own embarrassment over having missed this flaw is slightly mitigated by most of the experts in discrete Fourier analysis having missed it as well!) Keller and Klein are now working to fix the flaw, and I wholeheartedly wish them success.
  2. In unrelated news, I was saddened to read that Virgil Griffith—cryptocurrency researcher, former Integrated Information Theory researcher, and onetime contributor to Shtetl-Optimized—was arrested at LAX for having traveled to North Korea to teach the DPRK about cryptocurrency, against the admonitions of the US State Department. I didn’t know Virgil well, but I did meet him in person at least once, and I liked his essays for this blog about how, after spending years studying IIT under Giulio Tononi himself, he became disillusioned with many aspects of it and evolved to a position not far from mine (though not identical either).
    Personally, I despise the North Korean regime for the obvious reasons—I regard it as not merely evil, but cartoonishly so—and I’m mystified by Virgil’s apparently sincere belief that he could bring peace between the North and South by traveling to North Korea to give a lecture about blockchain. Yet, however world-historically naïve he may have been, his intentions appear to have been good. More pointedly—and here I’m asking not in a legal sense but in a human one—if giving aid and comfort to the DPRK is treasonous, then isn’t the current occupant of the Oval Office a million times guiltier of that particular treason (to say nothing of others)? It’s like, what does “treason” even mean anymore? In any case, I hope some plea deal or other arrangement can be worked out that won’t end Virgil’s productive career.

Secret Blogging SeminarRead Izabella Laba on diversity statements

There has been a dispute running through mathematical twitter about diversity statements in academic hiring. Prompted by that, Izabella Laba has just written an excellent post, which affirms the importance of diversity as a goal, but lays out the many tricky issues with diversity statements. It makes a lot of points I would like to make, and raises others that I hadn’t thought of but should have. In case there is someone reading this blog but not reading Professor Laba’s, go check it out.

Sadly, this blog is still dead. I have a list of things I would like to write on it, but I have no idea when or if I will find the time.

Scott AaronsonThe Aaronson-Ambainis Conjecture (2008-2019)

Update: Sadly, Nathan Keller and Ohad Klein tell me that they’ve retracted their preprint, because of what currently looks like a fatal flaw in Lemma 5.3, uncovered by Paata Ivanishvili. I wish them the best of luck in fixing the problem. In the meantime, I’m leaving up this post for “historical” reasons.

Around 1999, one of the first things I ever did in quantum computing theory was to work on a problem that Fortnow and Rogers suggested in a paper: is it possible to separate P from BQP relative to a random oracle? (That is, without first needing to separate P from PSPACE or whatever in the real world?) Or to the contrary: suppose that a quantum algorithm Q makes T queries to a Boolean input string X. Is there then a classical simulation algorithm that makes poly(T) queries to X, and that approximates Q’s acceptance probability for most values of X? Such a classical simulation, were it possible, would still be consistent with the existence of quantum algorithms like Simon’s and Shor’s, which are able to achieve exponential (and even greater) speedups in the black-box setting. It would simply demonstrate the importance, for Simon’s and Shor’s algorithms, of global structure that makes the string X extremely non-random: for example, encoding a periodic function (in the case of Shor’s algorithm), or encoding a function that hides a secret string s (in the case of Simon’s). It would underscore that superpolynomial quantum speedups depend on structure.

I never managed to solve this problem. Around 2008, though, I noticed that a solution would follow from a perhaps-not-obviously-related conjecture, about influences in low-degree polynomials. Namely, let p:Rn→R be a degree-d real polynomial in n variables, and suppose p(x)∈[0,1] for all x∈{0,1}n. Define the variance of p to be
and define the influence of the ith variable to be
Here the expectations are over strings in {0,1}n, and xi means x with its ith bit flipped between 0 and 1. Then the conjecture is this: there must be some variable i such that Infi(p) ≥ poly(Var(p)/d) (in other words, that “explains” a non-negligible fraction of the variance of the entire polynomial).

Why would this conjecture imply the statement about quantum algorithms? Basically, because of the seminal result of Beals et al. from 1998: that if a quantum algorithm makes T queries to a Boolean input X, then its acceptance probability can be written as a real polynomial over the bits of X, of degree at most 2T. Given that result, if you wanted to classically simulate a quantum algorithm Q on most inputs—and if you only cared about query complexity, not computation time—you’d simply need to do the following:
(1) Find the polynomial p that represents Q’s acceptance probability.
(2) Find a variable i that explains at least a 1/poly(T) fraction of the total remaining variance in p, and query that i.
(3) Keep repeating step (2), until p has been restricted to a polynomial with not much variance left—i.e., to nearly a constant function p(X)=c. Whenever that happens, halt and output the constant c.
The key is that by hypothesis, this algorithm will halt, with high probability over X, after only poly(T) steps.

Anyway, around the same time, Andris Ambainis had a major break on a different problem that I’d told him about: namely, whether randomized and quantum query complexities are polynomially related for all partial functions with permutation symmetry (like the collision and the element distinctness functions). Andris and I decided to write up the two directions jointly. The result was our 2011 paper entitled The Need for Structure in Quantum Speedups.

Of the two contributions in the “Need for Structure” paper, the one about random oracles and influences in low-degree polynomials was clearly the weaker and less satisfying one. As the reviewers pointed out, that part of the paper didn’t solve anything: it just reduced one unsolved problem to a new, slightly different problem that was also unsolved. Nevertheless, that part of the paper acquired a life of its own over the ensuing decade, as the world’s experts in analysis of Boolean functions and polynomials began referring to the “Aaronson-Ambainis Conjecture.” Ryan O’Donnell, Guy Kindler, and many others had a stab. I even got Terry Tao to spend an hour or two on the problem when I visited UCLA.

Now, at long last, Nathan Keller and Ohad Klein say they’ve proven the Aaronson-Ambainis Conjecture, in a preprint whose title is a riff on ours: “Quantum Speedups Need Structure.”

Their paper hasn’t yet been peer-reviewed, and I haven’t yet carefully studied it, but I could and should: at 19 pages, it looks very approachable and clear, if not as radically short as (say) Huang’s proof of the Sensitivity Conjecture. Keller and Klein’s argument subsumes all the earlier results that I knew would need to be subsumed, and involves all the concepts (like a real analogue of block sensitivity) that I knew would need to be involved somehow.

My plan had been as follows:
(1) Read their paper in detail. Understand every step of their proof.
(2) Write a blog post that reflects my detailed understanding.

Unfortunately, this plan did not sufficiently grapple with the fact that I now have two kids. It got snagged for a week at step (1). So I’m now executing an alternative plan, which is to jump immediately to the blog post.

Assuming Keller and Klein’s result holds up—as I expect it will—by combining it with the observations in my and Andris’s paper, one immediately gets an explanation for why no one has managed to separate P from BQP relative to a random oracle, but only relative to non-random oracles. This complements the work of Kahn, Saks, and Smyth, who around 2000 gave a precisely analogous explanation for the difficulty of separating P from NP∩coNP relative to a random oracle.

Unfortunately, the polynomial blowup is quite enormous: from a quantum algorithm making T queries, Keller and Klein apparently get a classical algorithm making O(T18) queries. But such things can almost always be massively improved.

Feel free to use the comments to ask any questions about this result or its broader context. I’ll either do my best to answer from the limited amount I know, or else I’ll pass the questions along to Nathan and Ohad themselves. Maybe, at some point, I’ll even be forced to understand the new proof.

Congratulations to Nathan and Ohad!

Update (Nov. 20): Tonight I finally did what I should’ve done two weeks ago, and worked through the paper from start to finish. Modulo some facts about noise operators, hypercontractivity, etc. that I took on faith, I now have a reasonable (albeit imperfect) understanding of the proof. It’s great!

In case it’s helpful to anybody, here’s my one-paragraph summary of how it works. First, you hit your bounded degree-d function f with a random restriction to attenuate its higher-degree Fourier coefficients (reminiscent of Linial-Mansour-Nisan).  Next, in that attenuated function, you find a small “coalition” of influential variables—by which we mean, a set of variables for which there’s some assignment that substantially biases f.  You keep iterating—finding influential coalitions in subfunctions on n/4, n/8, etc. variables. All the while, you keep track of the norm of the vector of all the block-sensitivities of all the inputs (the authors don’t clearly explain this in the intro, but they reveal it near the end). Every time you find another influential coalition, that norm goes down by a little, but by approximation theory, it can only go down O(d2) times until it hits rock bottom and your function is nearly constant. By the end, you’ll have approximated f itself by a decision tree of depth poly(d, 1/ε, log(n)).  Finally, you get rid of the log(n) term by using the fact that f essentially depended on at most exp(O(d)) variables anyway.

Anyway, I’m not sure how helpful it is to write more: the paper itself is about 95% as clear as it could possibly be, and even where it isn’t, you’d probably need to read it first (and, uh, know something about influences, block sensitivity, random restrictions, etc.) before any further clarifying remarks would be of use. But happy to discuss more in the comments, if anyone else is reading it.

November 29, 2019

Matt von HippelThe Teaching Heuristic for Non-Empirical Science

Science is by definition empirical. We discover how the world works not by sitting and thinking, but by going out and observing the world. But sometimes, all the observing we can do can’t possibly answer a question. In those situations, we might need “non-empirical science”.

The blog Slate Star Codex had a series of posts on this topic recently. He hangs out with a crowd that supports the many-worlds interpretation of quantum mechanics: the idea that quantum events are not truly random, but instead that all outcomes happen, the universe metaphorically splitting into different possible worlds. These metaphorical universes can’t be observed, so no empirical test can tell the difference between this and other interpretations of quantum mechanics: if we could ever know the difference, it would have to be for “non-empirical” reasons.

What reasons are those? Slate Star Codex teases out a few possible intuitions. He points out that we reject theories that have “unnecessary” ideas. He imagines a world where chemists believe that mixing an acid and a base also causes a distant star to go supernova, and a creationist world where paleontologists believe fossils are placed by the devil. In both cases, there might be no observable difference between their theories and ours, but because their theories have “extra pieces” (the distant star, the devil), we reject them for non-empirical reasons. Slate Star Codex asks if this supports many-worlds: without the extra assumption that quantum events randomly choose one outcome, isn’t quantum mechanics simpler?

I agree with some of this. Science really does use non-empirical reasoning. Without it, there’s no reason not to treat the world as a black box, a series of experiments with no mechanism behind it. But while we do reject theories with unnecessary ideas, that isn’t our only standard. We also need our theories to teach us about the world.

Ultimately, we trust science because it allows us to do things. If we understand the world, we can interact with it: we can build technology, design new experiments, and propose new theories. With this in mind, we can judge scientific theories by how well they help us do these things. A good scientific theory is one that gives us more power to interact with the world. It can do this by making correct predictions, but it can also do this by explaining things, making it easier for us to reason about them. Beyond empiricism, we can judge science by how well it teaches us.

This gives us an objection to the “supernova theory” of Slate Star Codex’s imagined chemists: it’s much more confusing to teach. To teach chemistry in that world you also have to teach the entire life cycle of stars, a subject that students won’t use in any other part of the course. The creationists’ “devil theory” of paleontology has the same problem: if their theory really makes the right predictions they’d have to teach students everything our paleontologists do: every era of geologic history, every theory of dinosaur evolution, plus an extra course in devil psychology. They end up with a mix that only makes it harder to understand the subject.

Many-worlds may seem simpler than other interpretations of quantum mechanics, but that doesn’t make it more useful, or easier to teach. You still need to teach students how to predict the results of experiments, and those results will still be random. If you teach them many-worlds, you need to add more discussion much earlier on, advanced topics like self-localizing uncertainty and decoherence. You need a quite extensive set of ideas, many of which won’t be used again, to justify rules another interpretation could have introduced much more simply. This would be fine if those ideas made additional predictions, but they don’t: like every interpretation of quantum mechanics, you end up doing the same experiments and building the same technology in the end.

I’m not saying I know many-worlds is false, or that I know another interpretation is true. All I’m saying is that, when physicists criticize many-worlds, they’re not just blindly insisting on empiricism. They’re rejecting many-worlds, in part, because all it does is make their work harder. And that, more than elegance or simplicity, is how we judge theories.

Sean CarrollThanksgiving

This year we give thanks for space. (We’ve previously given thanks for the Standard Model Lagrangian, Hubble’s Law, the Spin-Statistics Theorem, conservation of momentum, effective field theory, the error bar, gauge symmetry, Landauer’s Principle, the Fourier Transform, Riemannian Geometry, the speed of light, the Jarzynski equality, and the moons of Jupiter.)

Even when we restrict to essentially scientific contexts, “space” can have a number of meanings. In a tangible sense, it can mean outer space — the final frontier, that place we could go away from the Earth, where the stars and other planets are located. In a much more abstract setting, mathematicians use “space” to mean some kind of set with additional structure, like Hilbert space or the space of all maps between two manifolds. Here we’re aiming in between, using “space” to mean the three-dimensional manifold in which physical objects are located, at least as far as our observable universe is concerned.

That last clause reminds us that there are some complications here. The three dimensions we see of space might not be all there are; extra dimensions could be hidden from us by being curled up into tiny balls (or generalizations thereof) that are too small to see, or if the known particles and forces are confined to a three-dimensional brane embedded in a larger universe. On the other side, we have intimations that quantum theories of gravity imply the holographic principle, according to which an N-dimensional universe can be thought of as arising as a projection of (N-1)-dimensions worth of information. And much less speculatively, Einstein and Minkowski taught us long ago that three-dimensional space is better thought of as part of four-dimensional spacetime.

Let’s put all of that aside. Our everyday world is accurately modeled as stuff, distributed through three-dimensional space, evolving with time. That’s something to be thankful for! But we can also wonder why it is the case.

I don’t mean “Why is space three-dimensional?”, although there is that. I mean why is there something called “space” at all? I recently gave an informal seminar on this at Columbia, and I talk about it a bit in Something Deeply Hidden, and it’s related in spirit to a question Michael Nielsen recently asked on Twitter, “Why does F=ma?

Space is probably emergent rather than fundamental, and the ultimate answer to why it exists is probably going to involve quantum mechanics, and perhaps quantum gravity in particular. The right question is “Why does the wave function of the universe admit a description as a set of branching semi-classical worlds, each of which feature objects evolving in three-dimensional space?” We’re working on that!

But rather than answer it, for the purposes of thankfulness I just want to point out that it’s not obvious that space as we know it had to exist, even if classical mechanics had been the correct theory of the world.

Newton himself thought of space as absolute and fundamental. His ontology, as the philosophers would put it, featured objects located in space, evolving with time. Each object has a trajectory, which is its position in space at each moment of time. Quantities like “velocity” and “acceleration” are important, but they’re not fundamental — they are derived from spatial position, as the first and second derivatives with respect to time, respectively.

But that’s not the only way to do classical mechanics, and in some sense it’s not the most basic and powerful way. An alternative formulation is provided by Hamiltonian mechanics, where the fundamental variable isn’t “position,” but the combination of “position and momentum,” which together describe the phase space of a system. The state of a system at any one time is given by a point in phase space. There is a function of phase space cleverly called the Hamiltonian H(x,p), from which the dynamical equations of the system can be derived.

Via Wikipedia.

That might seem a little weird, and students tend to be somewhat puzzled by the underlying idea of Hamiltonian mechanics when they are first exposed to it. Momentum, we are initially taught in our physics courses, is just the mass times the velocity. So it seems like a derived quantity, not a fundamental one. How can Hamiltonian mechanics put momentum on an equal footing to position, if one is derived from the other?

The answer is that in the Hamiltonian approach, momentum is not defined to be mass times velocity. It ends up being equal to that by virtue of an equation of motion, at least if the Hamiltonian takes the right form. But in principle it’s an independent variable.

That’s a subtle distinction! Hamiltonian mechanics says that at any one moment a system is described by two quantities, its position and its momentum. No time derivatives or trajectories are involved; position and momentum are completely different things. Then there are two equations telling us how the position and the momentum change with time. The derivative of the position is the velocity, and one equation sets it equal to the momentum divided by the mass, just as in Newtonian mechanics. The other equation sets the derivative of the momentum equal to the force. Combining the two, we again find that force equals mass times acceleration (derivative of velocity).

So from the Hamiltonian perspective, positions and momenta are on a pretty equal footing. Why then, in the real world, do we seem to “live in position space”? Why don’t we live in momentum space?

As far as I know, no complete and rigorous answer to these questions has ever been given. But we do have some clues, and the basic principle is understood, even if some details remain to be ironed out.

That principle is this: we can divide the world into subsystems that interact with each other under appropriate circumstances. And those circumstances come down to “when they are nearby in space.” In other words, interactions are local in space. They are not local in momentum. Two billiard balls can bump into each other when they arrive at the same location, but nothing special happens when they have the same momentum or anything like that.

Ultimately this can be traced to the fact that the Hamiltonian of the real world is not some arbitrary function of positions and momenta; it’s a very specific kind of function. The ultimate expression of this kind of locality is field theory — space is suffused with fields, and what happens to a field at one point in space only directly depends on the other fields at precisely the same point in space, nowhere else. And that’s embedded in the Hamiltonian of the universe, in particular in the fact that the Hamiltonian can be written as an integral over three-dimensional space of a local function, called the “Hamiltonian density.”

H = \int \mathcal{H}(\phi, \pi) \, d^3x,

where φ is the field (which here acts as a “coordinate”) and π is its corresponding momentum.

This represents progress on the “Why is there space?” question. The answer is “Because space is the set of variables with respect to which interactions are local.” Which raises another question, of course: why are interactions local with respect to anything? Why do the fundamental degrees of freedom of nature arrange themselves into this kind of very specific structure, rather than some other one?

We have some guesses there, too. One of my favorite recent papers is “Locality From the Spectrum,” by Jordan Cotler, Geoffrey Penington, and Daniel Ranard. By “the spectrum” they mean the set of energy eigenvalues of a quantum Hamiltonian — i.e. the possible energies that states of definite energy can have in a theory. The game they play is to divide up the Hilbert space of quantum states into subsystems, and then ask whether a certain list of energies is compatible with “local” interactions between those subsystems. The answers are that most Hamiltonians aren’t compatible with locality in any sense, and for those where locality is possible, the division into local subsystems is essentially unique. So locality might just be a consequence of certain properties of the quantum Hamiltonian that governs the universe.

Fine, but why that Hamiltonian? Who knows? This is above our pay grade right now, though it’s fun to speculate. Meanwhile, let’s be thankful that the fundamental laws of physics allow us to describe our everyday world as a collection of stuff distributed through space. If they didn’t, how would we ever find our keys?

November 27, 2019

Scott AaronsonGuest post by Greg Kuperberg: Archimedes’ other principle and quantum supremacy

Scott’s Introduction: Happy Thanksgiving! Please join me in giving thanks for the beautiful post below, by friend-of-the-blog Greg Kuperberg, which tells a mathematical story that stretches from the 200s BC all the way to Google’s quantum supremacy result last month.

Archimedes’ other principle and quantum supremacy

by Greg Kuperberg

Note: UC Davis is hiring in CS theory! Scott offered me free ad space if I wrote a guest post, so here we are. The position is in all areas of CS theory, including QC theory although the search is not limited to that.

In this post, I wear the hat of a pure mathematician in a box provided by Archimedes. I thus set aside what everyone else thinks is important about Google’s 53-qubit quantum supremacy experiment, that it is a dramatic milestone in quantum computing technology. That’s only news about the physical world, whose significance pales in comparison to the Platonic world of mathematical objects. In my happy world, I like quantum supremacy as a demonstration of a beautiful coincidence in mathematics that has been known for more than 2000 years in a special case. The single-qubit case was discovered by Archimedes, duh. As Scott mentions in Quantum Computing Since Democritus, Bill Wootters stated the general result in a 1990 paper, but Wootters credits a 1974 paper by the Czech physicist Stanislav Sýkora. I learned of it in the substantially more general context of symplectic geometric that mathematicians developed independently between Sýkora’s prescient paper and Wootters’ more widely known citation. Much as I would like to clobber you with highly abstract mathematics, I will wait for some other time.

Suppose that you pick a pure state \(|\psi\rangle\) in the Hilbert space \(\mathbb{C}^d\) of a \(d\)-dimensional qudit, and then make many copies and fully measure each one, so that you sample many times from some distribution \(\mu\) on the \(d\) outcomes. You can think of such a distribution \(\mu\) as a classical randomized state on a digit of size \(d\). The set of all randomized states on a \(d\)-digit makes a \((d-1)\)-dimensional simplex \(\Delta^{d-1}\) in the orthant \(\mathbb{R}_{\ge 0}^d\). The coincidence is that if \(|\psi\rangle\) is uniformly random in the unit sphere in \(\mathbb{C}^d\), then \(\mu\) is uniformly random in \(\Delta^{d-1}\). I will call it the Born map, since it expresses the Born rule of quantum mechanics that amplitudes yield probabilities. Here is a diagram of the Born map of a qutrit, except with the laughable simplification of the 5-sphere in \(\mathbb{C}^3\) drawn as a 2-sphere.

If you pretend to be a bad probability student, then you might not be surprised by this coincidence, because you might suppose that all probability distributions are uniform other than treacherous exceptions to your intuition. However, the principle is certainly not true for a “rebit” (a qubit with real amplitudes) or for higher-dimensional “redits.” With real amplitudes, the probability density goes to infinity at the sides of the simplex \(\Delta^{d-1}\) and is even more favored at the corners. It also doesn’t work for mixed qudit states; the projection then favors the middle of \(\Delta^{d-1}\).

Archimedes’ theorem

The theorem of Archimedes is that a natural projection from the unit sphere to a circumscribing vertical cylinder preserves area. The projection is the second one that you might think of: Project radially from a vertical axis rather than radially in all three directions. Since Archimedes was a remarkably prescient mathematician, he was looking ahead to the Bloch sphere of pure qubit states \(|\psi\rangle\langle\psi|\) written in density operator form. If you measure \(|\psi\rangle\langle\psi|\) in the computational basis, you get a randomized bit state \(\mu\) somewhere on the interval from guaranteed 0 to guaranteed 1.

This transformation from a quantum state to a classical randomized state is a linear projection to a vertical axis. It is the same as Archimedes’ projection, except without the angle information. It doesn’t preserve dimension, but it does preserve measure (area or length, whatever) up to a factor of \(2\pi\). In particular, it takes a uniformly random \(|\psi\rangle\langle\psi|\) to a uniformly random \(\mu\).

Archimedes’ projection is also known as the Lambert cylindrical map of the world. This is the map that squishes Greenland along with the top of North America and Eurasia to give them proportionate area.

(I forgive Lambert if he didn’t give prior credit to Archimedes. There was no Internet back then to easily find out who did what first.) Here is a calculus-based proof of Archimedes’ theorem: In spherical coordinates, imagine an annular strip on the sphere at a polar angle of \(\theta\). (The polar angle is the angle from vertical in spherical coordinates, as depicted in red in the Bloch sphere diagram.) The strip has a radius of \(\sin\theta\), which makes it shorter than its unit radius friend on the cylinder. But it’s also tilted from vertical by an angle of \(\frac{\pi}2-\theta\), which makes it wider by a factor of \(1/(\sin \theta)\) than the height of its projection onto the \(z\) axis. The two factors exactly cancel out, making the area of the strip exactly proportional to the length of its projection onto the \(z\) axis. This is a coincidence which is special to the 2-sphere in 3 dimensions. As a corollary, we get that the surface area of a unit sphere is \(4\pi\), the same as an open cylinder with radius 1 and height 2. If you want to step through this in even more detail, Scott mentioned to me an action video which is vastly spiffier than anything that I could ever make.

The projection of the Bloch sphere onto an interval also shows what goes wrong if you try this with a rebit. The pure rebit states — again expressed in density operator form \(|\psi\rangle\langle\psi|\) are a great circle in the Bloch sphere. If you linearly project a circle onto an interval, then the length of the circle is clearly bunched up at the ends of the interval and the projected measure on the interval is not uniform.

Sýkora’s generalization

It is a neat coincidence that the Born map of a qubit preserves measure, but a proof that relies on Archimedes’ theorem seems to depend on the special geometry of the Bloch sphere of a qubit. That the higher-dimensional Born map also preserves measure is downright eerie. Scott challenged me to write an intuitive explanation. I remembered two different (but similar) proofs, neither of which is original to me. Scott and I disagree as to which proof is nicer.

As a first step of the first proof, it is easy to show that the Born map \(p = |z|^2\) for a single amplitude \(z\) preserves measure as a function from the complex plane \(\mathbb{C}\) to the ray \(\mathbb{R}_{\ge 0}\). The region in the complex numbers \(\mathbb{C}\) where the length of \(z\) is between \(a\) and \(b\), or \(a \le |z| \le b\), is \(\pi(b^2 – a^2)\). The corresponding interval for the probability is \(a^2 \le p \le b^2\), which thus has length \(b^2-a^2\). That’s all, we’ve proved it! More precisely, the area of any circularly symmetric region in \(\mathbb{C}\) is \(\pi\) times the length of its projection onto \(\mathbb{R}_{\ge 0}\).

The second step is to show the same thing for the Born map from the \(d\)-qudit Hilbert space \(\mathbb{C}^d\) to the \(d\)-digit orthant \(\mathbb{R}_{\ge 0}^d\), again without unit normalization. It’s also measure-preserving, up to a factor of \(\pi^d\) this time, because it’s the same thing in each coordinate separately. To be precise, the volume ratio holds for any region in \(\mathbb{C}^d\) that is invariant under separately rotating each of the \(d\) phases. (Because you can approximate any such region with a union of products of thin annuli.)

The third and final step is the paint principle for comparing surface areas. If you paint the hoods of two cars with the same thin layer of paint and you used the same volume of paint for each one, then you can conclude that the two car hoods have nearly same area. In our case, the Born map takes the region \[ 1 \le |z_0|^2 + |z_1|^2 + \cdots + |z_{d-1}|^2 \le 1+\epsilon \] in \(\mathbb{C}^d\) to the region \[ 1 \le p_0 + p_1 + \cdots + p_{d-1} \le 1+\epsilon \] in the orthant \(\mathbb{R}_{\ge 0}^d\). The former is the unit sphere \(S^{2d-1}\) in \(\mathbb{C}^d\) painted to a thickness of roughly \(\epsilon/2\). The latter is the probability simplex \(\Delta^{n-1}\) painted to a thickness of exactly \(\epsilon\). Taking the limit \(\epsilon \to 0\), the Born map from \(S^{2d-1}\) to \(\Delta^{n-1}\) preserves measure up to a factor of \(2\pi^n\).

You might wonder “why” this argument works even if you accept that it does work. The key is that the exponent 2 appears in two different ways: as the dimension of the complex numbers, and as the exponent used to set probabilities and define spheres. If we try the same argument with real amplitudes, then the volume between “spheres” of radius \(a\) and \(b\) is just \(2(b-a)\), which does not match the length \(b^2-a^2\). The Born map for a single real amplitude is the parabola \(p = x^2\), which clearly distorts length since it is not linear. The higher-dimensional real Born map similarly distorts volumes, whether or not you restrict to unit-length states.

If you’re a bitter-ender who still wants Archimedes’ theorem for real amplitudes, then you might consider the variant formula \(p = |x|\) to obtain a probability \(p\) from a “quantum amplitude” \(x\). Then the “Born” map does preserve measure, but for the trivial reason that \(x = \pm p\) is not really a quantum amplitude, it is a probability with a vestigial sign. Also the unit “sphere” in \(\mathbb{R}^d\) is not really a sphere in this theory, it is a hyperoctahedron. The only “unitary” operators that preserve the unit hyperoctahedron are signed permutation matrices. You can only use them for reversible classical computing or symbolic dynamics; they don’t have the strength of true quantum computing or quantum mechanics.

The fact that the Born map preserves measure also yields a bonus calculation of the volume of the unit ball in \(2d\) real dimensions, if we interpret that as \(d\) complex dimensions. The ball \[ |z_0|^2 + |z_1|^2 + \cdots + |z_{d-1}|^2 \le 1 \] in \(\mathbb{C}^d\) is sent to a different simplex \[ p_0 + p_1 + \cdots + p_{d-1} \le 1 \] in \(\mathbb{R}_{\ge 0}^d\). If we recall that the volume of a \(d\)-dimensional pyramid is \(\frac1d\) times base times height and calculate by induction on \(d\), we get that this simplex has volume \(\frac1{d!}\). Thus, the volume of the \(2d\)-dimensional unit ball is \(\frac{\pi^d}{d!}\).

You might ask whether the volume of a \(d\)-dimensional unit ball is always \(\frac{\pi^{d/2}}{(d/2)!}\) for both \(d\) even and odd. The answer is yes if we interpret factorials using the gamma function formula \(x! = \Gamma(x+1)\) and look up that \(\frac12! = \Gamma(\frac32) = \frac{\sqrt{\pi}}2\). The gamma function was discovered by Euler as a solution to the question of defining fractional factorials, but the notation \(\Gamma(x)\) and the cumbersome shift by 1 is due to Legendre. Although Wikipedia says that no one knows why Legendre defined it this way, I wonder if his goal was to do what the Catholic church later did for itself in 1978: It put a Pole at the origin.

(Scott wanted to censor this joke. In response, I express my loyalty to my nation of birth by quoting the opening of the Polish national anthem: “Poland has not yet died, so long as we still live!” I thought at first that Stanislav Sýkora is Polish since Stanisław and Sikora are both common Polish names, but his name has Czech spelling and he is Czech. Well, the Czechs are cool too.)

Sýkora’s 1974 proof of the generalized Archimedes’ theorem is different from this one. He calculates multivariate moments of the space of unit states \(S^{2d-1} \subseteq \mathbb{C}^d\), and confirms that they match the moments in the probability simplex \(\Delta^{d-1}\). There are inevitably various proofs of this result, and I will give another one.

Another proof, and quantum supremacy

There is a well-known and very useful algorithm to generate a random point on the unit sphere in either \(\mathbb{R}^d\) or \(\mathbb{C}^d\), and a similar algorithm to generate a random point in a simplex. In the former algorithm, we make each real coordinate \(x\) into an independent Gaussian random variable with density proportional to \(e^{-x^2}\;dx\), and then rescale the result to unit length. Since the exponents combine as \[ e^{-x_0^2}e^{-x_1^2}\cdots e^{-x_{d-1}^2} = e^{-(x_0^2 + x_1^2 + \cdots + x_{d-1}^2)}, \] we learn that the total exponent is spherically symmetric. Therefore after rescaling, the result is a uniformly random point on the unit sphere \(S^{d-1} \subseteq \mathbb{R}^d\). Similarly, the other algorithm generates a point in the orthant \(\mathbb{R}_{\ge 0}^d\) by making each real coordinate \(p \ge 0\) an independent random variable with exponential distribution \(e^{-p}\;dp\). This time we rescale the vector until its sum is 1. This algorithm likewise produces a uniformly random point in the simplex \(\Delta^{d-1} \subseteq \mathbb{R}_{\ge 0}^d\) because the total exponent of the product \[ e^{-p_0}e^{-p_1}\cdots e^{-p_{d-1}} = e^{-(p_0 + p_1 + \cdots + p_{d-1})} \] only depends on the sum of the coordinates. Wootters describes both of these algorithms in his 1990 paper, but instead of relating them to give his own proof of the generalized Archimedes’ theorem, he cites Sýkora.

The gist of the proof is that the Born map takes the Gaussian algorithm to the exponential algorithm. Explicitly, the Gaussian probability density for a single complex amplitude \[ z = x+iy = re^{i\theta} \] can be converted from Cartesian to polar coordinate as follows: \[ \frac{e^{-|z|^2}\;dx\;dy}{\pi} = \frac{e^{-r^2}r\;dr\;d\theta}{\pi}. \] I have included the factor of \(r\) that is naturally present in an area integral in polar coordinates. We will need it, and it is another way to see that the theorem relies on the fact that the complex numbers are two-dimensional. To complete the proof, we substitute \(p = r^2\) and remember that \(dp = 2r\;dr\), and then integrate over \(\theta\) (trivially, since the integrand does not depend on \(\theta\)). The density simplifies to \(e^{-p}\;dp\), which is exactly the exponential distribution for a real variable \(p \ge 0\). Since the Born map takes the Gaussian algorithm to the exponential algorithm, and since each algorithm produces a uniformly random point, the Born map must preserve uniform measure. (Scott likes this proof better because it is algorithmic, and because it is probabilistic.)

Now about quantum supremacy. You might think that a random chosen quantum circuit on \(n\) qubits produces a nearly uniformly random quantum state \(|\psi\rangle\) in their joint Hilbert space, but it’s not quite not that simple. When \(n=53\), or otherwise as \(n \to \infty\), a manageable random circuit is not nearly creative enough to either reach or approximate most of the unit states in the colossal Hilbert space of dimension \(d = 2^n\). The state \(|\psi\rangle\) that you get from (say) a polynomial-sized circuit resembles a fully random state in various statistical and computational respects, both proven and conjectured. As a result, if you measure the qubits in the computational basis, you get a randomized state on \(n\) bits that resembles a uniformly random point in \(\Delta^{2^n-1}\).

If you choose \(d\) probabilities, and if each one is an independent exponential random variable, then the law of large numbers says that the sum (which you use for rescaling) is close to \(d\) when \(d\) is large. When \(d\) is really big like \(2^{53}\), a histogram of the probabilities of the bit strings of a supremacy experiment looks like an exponential curve \(f(p) \propto e^{-pd}\). In a sense, the statistical distribution of the bit strings is almost the same almost every time, independent of which random quantum circuit you choose to generate them. The catch is that the position of any given bit string does depend on the circuit and is highly scrambled. I picture it in my mind like this:

It is thought to be computationally intractable to calculate where each bit string lands on this exponential curve, or even where just one of them does. (The exponential curve is attenuated by noise in the actual experiment, but it’s the same principle.) That is one reason that random quantum circuits are supreme.

November 26, 2019

Doug NatelsonGeneral relativity (!) and band structure

Today we had a seminar at Rice by Qian Niu of the University of Texas, and it was a really nice, pedagogical look at this paper (arxiv version here).   Here's the basic idea.

As I wrote about here, in a crystalline solid the periodic lattice means that single-particle electronic states look like Bloch waves, labeled by some wavevector \(\mathbf{k}\), of the form \(u_{\mathbf{k}}(\mathbf{r}) \exp(i \mathbf{k}\cdot \mathbf{r})\) where \(u_{\mathbf{k}}\) is periodic in space like the lattice.  It is possible to write down semiclassical equations of motion of some wavepacket that starts centered around some spatial position \(\mathbf{r}\) and some (crystal) momentum \(\hbar \mathbf{k}\).  These equations tell you that the momentum of the wavepacket changes with time as due to the external forces (looking a lot like the Lorentz force law), and the position of the wavepacket has a group velocity, plus an additional "anomalous" velocity related to the Berry phase (which has to do with the variation of \(u_{\mathbf{k}}\) over the allowed  values of \(\mathbf{k}\)).

The paper asks the question, what are the semiclassical equations of motion for a wavepacket if the lattice is actually distorted a bit as a function of position in real space.  That is, imagine a strain gradient, or some lattice deformation.  In that case, the wavepacket can propagate through regions where the lattice is varying spatially on very long scales while still being basically periodic on shorter scales still long compared to the Fermi wavelength.

It turns out that the right way to tackle this is with the tools of differential geometry, the same tools used in general relativity.  In GR, when worrying how the coordinates of a particle change as it moves along, there is the ordinary velocity, and then there are other changes in the components of the velocity vector because the actual geometry of spacetime (the coordinate system) is varying with position.  You need to describe this with a "covariant derivative", and that involves Christoffel symbols.  In this way, gravity isn't a force - it's freely falling particles propagating as "straight" as they can, but the actual geometry of spacetime makes their trajectory look curved based on our choice of coordinates.

For the semiclassical motion problem in a distorted lattice, something similar happens.  You have to worry about how the wavepacket evolves both because of the local equations of motion, and because the wavepacket is propagating into a new region of the lattice where the \(u_{\mathbf{k}}\) functions are different because the actual lattice is different (and that also affects the Berry phase anomalous velocity piece).   Local rotations of the lattice can lead to an affective Coriolis force on the wavepacket; local strain gradients can lead to effective accelerations of the wavepacket.

(For more fun, you can have temporal periodicity as well.  That means you don't just have Bloch functions in 3d, you have Bloch-Floquet functions in 3+1d, and that's where I fell behind.)

Bottom line:  The math of general relativity is an elegant way to look at semiclassical carrier dynamics in real materials.  I knew that undergrad GR course would come in handy....

November 25, 2019

Terence Tao254A, Notes 9 – second moment and entropy methods

In these notes we presume familiarity with the basic concepts of probability theory, such as random variables (which could take values in the reals, vectors, or other measurable spaces), probability, and expectation. Much of this theory is in turn based on measure theory, which we will also presume familiarity with. See for instance this previous set of lecture notes for a brief review.

The basic objects of study in analytic number theory are deterministic; there is nothing inherently random about the set of prime numbers, for instance. Despite this, one can still interpret many of the averages encountered in analytic number theory in probabilistic terms, by introducing random variables into the subject. Consider for instance the form

\displaystyle  \sum_{n \leq x} \mu(n) = o(x) \ \ \ \ \ (1)

of the prime number theorem (where we take the limit {x \rightarrow \infty}). One can interpret this estimate probabilistically as

\displaystyle  {\mathbb E} \mu(\mathbf{n}) = o(1) \ \ \ \ \ (2)

where {\mathbf{n} = \mathbf{n}_{\leq x}} is a random variable drawn uniformly from the natural numbers up to {x}, and {{\mathbb E}} denotes the expectation. (In this set of notes we will use boldface symbols to denote random variables, and non-boldface symbols for deterministic objects.) By itself, such an interpretation is little more than a change of notation. However, the power of this interpretation becomes more apparent when one then imports concepts from probability theory (together with all their attendant intuitions and tools), such as independence, conditioning, stationarity, total variation distance, and entropy. For instance, suppose we want to use the prime number theorem (1) to make a prediction for the sum

\displaystyle  \sum_{n \leq x} \mu(n) \mu(n+1).

After dividing by {x}, this is essentially

\displaystyle  {\mathbb E} \mu(\mathbf{n}) \mu(\mathbf{n}+1).

With probabilistic intuition, one may expect the random variables {\mu(\mathbf{n}), \mu(\mathbf{n}+1)} to be approximately independent (there is no obvious relationship between the number of prime factors of {\mathbf{n}}, and of {\mathbf{n}+1}), and so the above average would be expected to be approximately equal to

\displaystyle  ({\mathbb E} \mu(\mathbf{n})) ({\mathbb E} \mu(\mathbf{n}+1))

which by (2) is equal to {o(1)}. Thus we are led to the prediction

\displaystyle  \sum_{n \leq x} \mu(n) \mu(n+1) = o(x). \ \ \ \ \ (3)

The asymptotic (3) is widely believed (it is a special case of the Chowla conjecture, which we will discuss in later notes; while there has been recent progress towards establishing it rigorously, it remains open for now.

How would one try to make these probabilistic intuitions more rigorous? The first thing one needs to do is find a more quantitative measurement of what it means for two random variables to be “approximately” independent. There are several candidates for such measurements, but we will focus in these notes on two particularly convenient measures of approximate independence: the “{L^2}” measure of independence known as covariance, and the “{L \log L}” measure of independence known as mutual information (actually we will usually need the more general notion of conditional mutual information that measures conditional independence). The use of {L^2} type methods in analytic number theory is well established, though it is usually not described in probabilistic terms, being referred to instead by such names as the “second moment method”, the “large sieve” or the “method of bilinear sums”. The use of {L \log L} methods (or “entropy methods”) is much more recent, and has been able to control certain types of averages in analytic number theory that were out of reach of previous methods such as {L^2} methods. For instance, in later notes we will use entropy methods to establish the logarithmically averaged version

\displaystyle  \sum_{n \leq x} \frac{\mu(n) \mu(n+1)}{n} = o(\log x) \ \ \ \ \ (4)

of (3), which is implied by (3) but strictly weaker (much as the prime number theorem (1) implies the bound {\sum_{n \leq x} \frac{\mu(n)}{n} = o(\log x)}, but the latter bound is much easier to establish than the former).

As with many other situations in analytic number theory, we can exploit the fact that certain assertions (such as approximate independence) can become significantly easier to prove if one only seeks to establish them on average, rather than uniformly. For instance, given two random variables {\mathbf{X}} and {\mathbf{Y}} of number-theoretic origin (such as the random variables {\mu(\mathbf{n})} and {\mu(\mathbf{n}+1)} mentioned previously), it can often be extremely difficult to determine the extent to which {\mathbf{X},\mathbf{Y}} behave “independently” (or “conditionally independently”). However, thanks to second moment tools or entropy based tools, it is often possible to assert results of the following flavour: if {\mathbf{Y}_1,\dots,\mathbf{Y}_k} are a large collection of “independent” random variables, and {\mathbf{X}} is a further random variable that is “not too large” in some sense, then {\mathbf{X}} must necessarily be nearly independent (or conditionally independent) to many of the {\mathbf{Y}_i}, even if one cannot pinpoint precisely which of the {\mathbf{Y}_i} the variable {\mathbf{X}} is independent with. In the case of the second moment method, this allows us to compute correlations such as {{\mathbb E} {\mathbf X} \mathbf{Y}_i} for “most” {i}. The entropy method gives bounds that are significantly weaker quantitatively than the second moment method (and in particular, in its current incarnation at least it is only able to say non-trivial assertions involving interactions with residue classes at small primes), but can control significantly more general quantities {{\mathbb E} F( {\mathbf X}, \mathbf{Y}_i )} for “most” {i} thanks to tools such as the Pinsker inequality.

— 1. Second moment methods —

In this section we discuss probabilistic techniques of an “{L^2}” nature. We fix a probability space {(\Omega, {\mathcal F}, \mathop{\bf P})} to model all of random variables; thus for instance we shall model a complex random variable {\mathbf{X}} in these notes by a measurable function {\mathbf{X}: \Omega \rightarrow {\bf C}}. (Strictly speaking, there is a subtle distinction one can maintain between a random variable and its various measure-theoretic models, which becomes relevant if one later decides to modify the probability space {\Omega}, but this distinction will not be so important in these notes and so we shall ignore it. See this previous set of notes for more discussion.)

We will focus here on the space {L^2 = L^2(\Omega, {\mathcal F}, \mathop{\bf P})} of complex random variables {\mathbf{X}} (that is to say, measurable maps {\mathbf{X}: \Omega \rightarrow {\bf C}}) whose second moment

\displaystyle  {\mathbb E}(|\mathbf{X}|^2)

of {\mathbf{X}} is finite. In many number-theoretic applications the finiteness of the second moment will be automatic because {\mathbf{X}} will only take finitely many values. As is well known, the space {L^2} has the structure of a complex Hilbert space, with inner product

\displaystyle  \langle \mathbf{X}, \mathbf{Y} \rangle_{L^2} := {\mathbb E}(\mathbf{X} \overline{\mathbf{Y}})

and norm

\displaystyle  \| \mathbf{X} \|_{L^2} := ({\mathbb E} (|\mathbf{X}|^2))^{1/2}

for {\mathbf{X},\mathbf{Y} \in L^2}. By slight abuse of notation, the complex numbers {{\bf C}} can be viewed as a subset of {L^2}, by viewing any given complex number {z} as a constant (deterministic) random variable. Then {{\bf C}} is a one-dimensional subspace of {L^2}, spanned by the unit vector {1 \in L^2}. Given a random variable {\mathbf{X}} to {L^2}, the projection of {\mathbf{X}} to {{\bf C}} is then the mean

\displaystyle  {\mathbb E} \mathbf{X} =\langle \mathbf{X}, 1 \rangle,

and we obtain an orthogonal splitting {\mathbf{X} = ({\mathbb E} \mathbf{X}) + (\mathbf{X} - {\mathbb E} \mathbf{X})} of any {\mathbf{X} \in L^2} into its mean {{\mathbb E} \mathbf{X}} and its mean zero part {\mathbf{X} - {\mathbb E} \mathbf{X}}. By Pythagoras’ theorem, we then have

\displaystyle  {\mathbb E}(|\mathbf{X}|^2) = {\mathbb E}(|\mathbf{X} - {\mathbb E} \mathbf{X}|^2) + {\mathbb E} |{\mathbb E} \mathbf{X}|^2.

The first quantity on the right-hand side is the square of the distance from {\mathbf{X}} to {{\bf C}}, and this non-negative quantity is known as the variance

\displaystyle  \mathrm{Var} \mathbf{X} := {\mathbb E}(|\mathbf{X} - {\mathbb E} \mathbf{X}|^2) = {\mathbb E}(|\mathbf{X}|^2) - |{\mathbb E} \mathbf{X}|^2. \ \ \ \ \ (5)

The square root {(\mathrm{Var} \mathbf{X})^{1/2} = \| \mathbf{X} - {\mathbb E} \mathbf{X} \|_{L^2}} of the variance is known as the standard deviation. The variance controls the distribution of the random variable {\mathbf{X}} through Chebyshev’s inequality

\displaystyle  {\mathbb P}( |\mathbf{X} - {\mathbb E} \mathbf{X}| \geq \lambda ) \leq \frac{\mathrm{Var}(\mathbf{X})}{\lambda^2} \ \ \ \ \ (6)

for any {\lambda > 0}, which is immediate from observing the inequality {1_{|\mathbf{X} - {\mathbb E} \mathbf{X}| \geq \lambda} \leq \frac{|\mathbf{X} - {\mathbb E} \mathbf{X}|^2}{\lambda^2}} and then taking expectations of both sides. Roughly speaking, this inequality asserts that {\mathbf{X}} typically deviates from its mean {{\mathbb E} \mathbf{X}} by no more than a bounded multiple of the standard deviation {(\mathrm{Var} \mathbf{X})^{1/2}}.

A slight generalisation of Chebyshev’s inequality that can be convenient to use is

\displaystyle  {\mathbb P}( |\mathbf{X} - \mu| \geq \lambda ) \leq \frac{\mathrm{Var}(\mathbf{X}) + |{\mathbb E} \mathbf{X} - \mu|^2}{\lambda^2} \ \ \ \ \ (7)

for any {\lambda > 0} and any complex number {\mu} (which typically will be a simplified approximation to the mean {{\mathbb E} X}), which is proven similarly to (6) but noting (from (5)) that {{\mathbb E} |\mathrm{X}-\mu|^2 = \mathrm{Var}(\mathbf{X}) + |{\mathbb E} \mathbf{X} - \mu|^2}.

Informally, (6) is an assertion that a square-integrable random variable {\mathbf{X}} will concentrate around its mean {{\mathbb E} \mathbf{X}} if its variance is not too large. See these previous notes for more discussion of the concentration of measure phenomenon. One can often obtain stronger concentration of measure than what is provided by Chebyshev’s inequality if one is able to calculate higher moments than the second moment, such as the fourth moment {{\mathbb E} |\mathbf{X}|^4} or exponential moments {{\mathbb E} e^{c\mathbf{X}}}, but we will not pursue this direction in this set of notes.

Clearly the variance is homogeneous of order two, thus

\displaystyle  \mathrm{Var}(c\mathbf{X}) = |c|^2 \mathrm{Var}(\mathbf{X})

for any {\mathbf{X} \in L^2} and {c \in {\bf C}}. In particular, the variance is not always additive: the claim {\mathrm{Var}(\mathbf{X}+\mathbf{Y}) = \mathrm{Var}(\mathbf{X}) + \mathrm{Var}(\mathbf{Y})} fails in particular when {\mathbf{X}=\mathbf{Y}} is not almost surely zero. However, there is an important substitute for this formula. Given two random variables {\mathbf{X},\mathbf{Y} \in L^2}, the inner product of the corresponding mean zero parts {\mathbf{X} - {\mathbb E} \mathbf{X}, \mathbf{Y} - {\mathbb E} \mathbf{Y}} is a complex number known as the covariance:

\displaystyle  \mathrm{Cov}(\mathbf{X},\mathbf{Y}) := \langle \mathbf{X}-{\mathbb E} \mathbf{X}, \mathbf{Y} - {\mathbb E} \mathbf{Y} \rangle_{L^2} = {\mathbb E}( (\mathbf{X}-{\mathbb E} \mathbf{X}) \overline{(\mathbf{Y} - {\mathbb E} \mathbf{Y})} ).

As {{\mathbb E} \mathbf{X}, {\mathbb E} \mathbf{Y}} are orthogonal to {\mathbf{X}-{\mathbb E} \mathbf{X}, \mathbf{Y} - {\mathbb E} \mathbf{Y}}, it is not difficult to obtain the alternate formula

\displaystyle  \mathrm{Cov}(\mathbf{X},\mathbf{Y}) := \langle \mathbf{X}, \mathbf{Y} \rangle_{L^2} - \langle {\mathbb E} \mathbf{X}, {\mathbb E} \mathbf{Y} \rangle = {\mathbb E}(\mathbf{X} \overline{\mathbf{Y}}) - ({\mathbb E} \mathbf{X}) \overline{{\mathbb E} \mathbf{Y}} \ \ \ \ \ (8)

for the covariance.

The covariance is then a positive semi-definite inner product on {L^2} (it basically arises from the Hilbert space structure of the space {{\bf C}^\perp} of mean zero variables), and {\mathrm{Cov}(\mathbf{X},\mathbf{X}) = \mathrm{Var}(\mathbf{X})}. From the Cauchy-Schwarz inequality we have

\displaystyle  |\mathrm{Cov}(\mathbf{X},\mathbf{Y})| \leq \mathrm{Var}(\mathbf{X})^{1/2} \mathrm{Var}(\mathbf{Y})^{1/2}.

If {\mathbf{X}, \mathbf{Y}} have non-zero variance (that is, they are not almost surely constant), then the ratio

\displaystyle  \frac{\mathrm{Cov}(\mathbf{X},\mathbf{Y})}{\mathrm{Var}(\mathbf{X})^{1/2} \mathrm{Var}(\mathbf{Y})^{1/2}}

is then known as the correlation between {\mathbf{X}} and {\mathbf{Y}}, and is a complex number of magnitude at most {1}; for real-valued {\mathbf{X}, \mathbf{Y}} that are not almost surely constant, the correlation is instead a real number between {-1} and {1}. At one extreme, a correlation of magnitude {1} occurs if and only if {\mathbf{X} - {\mathbb E} \mathbf{X}} is a scalar multiple of {\mathbf{Y} - {\mathbb E} \mathbf{Y}}. At the other extreme, a correlation of zero is an indication (though not a guarantee) of independence. Recall that two random variables {\mathbf{X}, \mathbf{Y}} are independent if one has

\displaystyle  {\mathbb P}( \mathbf{X} \in E, \mathbf{Y} \in F ) = {\mathbb P}( \mathbf{X} \in E ) {\mathbb P}( \mathbf{Y} \in F )

for all (Borel) measurable {E,F}. In particular, setting {E = [0,x]}, {F = [0,y]} for {x,y \geq 0} and integrating using Fubini’s theorem, we conclude that

\displaystyle  {\mathbb E} \max(\mathbf{X},0) \max(\mathbf{Y},0) = ({\mathbb E} \max(\mathbf{X},0)) ({\mathbb E} \max(\mathbf{Y},0));

similarly with {\max(\mathbf{X},0)} replaced by {-\min(\mathbf{X},0)}, and similarly for {\mathbf{Y}}. In particular we have

\displaystyle  {\mathbb E} \mathbf{X} \mathbf{Y} = ({\mathbb E} \mathbf{X}) ({\mathbb E} \mathbf{Y})

and thus from (8) we thus see that independent random variables have zero covariance (and zero correlation, when they are not almost surely constant). On the other hand, the converse fails:

Exercise 1 Provide an example of two random variables {\mathbf{X}, \mathbf{Y} \in L^2} which are not independent, but which have zero correlation or covariance with each other. (There are many ways to produce some examples. One comes from exploiting various systems of orthogonal functions, such as sines and cosines. Another comes from working with random variables taking only a small number of values, such as {\{-1,0,1\}}.

From the cosine rule we have

\displaystyle  \mathrm{Var}(\mathbf{X}+\mathbf{Y}) = \mathrm{Var}(\mathbf{X}) + \mathrm{Var}(\mathbf{Y}) + 2 \mathrm{Re} \mathrm{Cov}( \mathbf{X}, \mathbf{Y} ) \ \ \ \ \ (9)

and more generally

\displaystyle  \mathrm{Var}(\mathbf{X}_1+ \dots + \mathbf{X}_n) = \sum_{i=1}^n \mathrm{Var}(\mathbf{X}_i) + 2 \sum_{1 \leq i < j \leq n} \mathrm{Re} \mathrm{Cov}( \mathbf{X}_i, \mathbf{X}_j ) \ \ \ \ \ (10)

for any finite collection of random variables {\mathbf{X}_1, \dots, \mathbf{X}_n \in L^2}. These identities combine well with Chebyshev-type inequalities such as (6), (7), and this leads to a very common instance of the second moment method in action. For instance, we can use it to understand the distribution of the number of prime factors of a random number {\mathbf{n}} that fall within a given set {{\mathcal P}}. Given any set {A} of natural numbers, define the logarithmic size {l(A)} to be the quantity

\displaystyle  l(A) := \sum_{n \in A} \frac{1}{n}.

Thus for instance Euler’s theorem asserts that the primes have infinite logarithmic size.

Lemma 2 (Turan-Kubilius inequality, special case) Let {I \subset {\bf R}} be an interval of length at least {1}, and let {\mathbf{n}} be an integer drawn uniformly at random from this interval, thus

\displaystyle  {\mathbb P}( \mathbf{n} \in E ) = \frac{\#(E \cap I)}{\#({\bf Z} \cap I)}

for all {E \subset {\bf Z}}. Let {{\mathcal P}} be a finite collection of primes, all of which have size at most {P}. Then the random variable {\mathbf{\omega} := \# \{ p \in {\mathcal P}: p | \mathbf{n} \}} has mean

\displaystyle  {\mathbb E} \mathbf{\omega} = l(\mathcal{P}) + O( \frac{P}{|I|} )

and variance

\displaystyle  \mathrm{Var}( \mathbf{\omega} ) \ll l(\mathcal{P}) + \frac{P^2}{|I|}.

In particular,

\displaystyle  {\mathbb E} |\mathbf{\omega} - l(\mathcal{P})|^2 \ll l(\mathcal{P}) + \frac{P^2}{|I|}

and from (7) we have

\displaystyle  {\mathbb P}( |\mathbf{\omega} - l({\mathcal P}) | \geq \lambda ) \ll \frac{l({\mathcal P}) + \frac{P^2}{|I|} }{\lambda^2}

for any {\lambda > 0}.

Proof: For any natural number {d}, we have

\displaystyle  \sum_{n \in I} 1_{d|n} = \sum_{n \in I/d} = \frac{1}{d} |I| + O(1)

and hence

\displaystyle  {\mathbb P}( d | \mathbf{n}) = \frac{\frac{1}{d} |I| + O(1)}{|I| + O(1)} = \frac{1}{d} + O( \frac{1}{|I|} ). \ \ \ \ \ (11)

We now write {\mathbf{\omega} = \sum_{p \in {\mathcal P}} 1_{p|{\mathbf n}}}. From (11) we see that each indicator random variable {1_{p|{\mathbf n}}}, {p \in {\mathcal P}} has mean {\frac{1}{p} + O( \frac{1}{|I|})} and variance {O( \frac{1}{p} + \frac{1}{|I|})}; similarly, for any two distinct {p_1, p_2 \in {\mathcal P}}, we see from (11), (8) the indicators {1_{p_1|{\mathbf n}}}, {1_{p_2|{\mathbf n}}} have covariance

\displaystyle  \mathrm{Cov}(1_{p_1|{\mathbf n}}, 1_{p_2|{\mathbf n}}) = \frac{1}{p_1 p_2} + O( \frac{1}{|I|} ) - ( \frac{1}{p_1} + O( \frac{1}{|I|} ) ) ( \frac{1}{p_2} + O( \frac{1}{|I|} ) ) = O( \frac{1}{|I|} )

and the claim now follows from (10). \Box

The exponents of {P} in the error terms here are not optimal; but in practice, we apply this inequality when {|I|} is much larger than any given power of {P}, so factors such as {P^3/|I|} will be negligible. Informally speaking, the above lemma asserts that a typical number in a large interval {I} will have roughly {l({\mathcal P})} prime factors in a given finite set {{\mathcal P}} of primes, as long as the logarithmic size {l({\mathcal P})} is large.

If we apply the above lemma to {I = [1,x]} for some large {x}, and {{\mathcal P}} equal to the primes up to (say) {P = x^{0.1}}, we have {l({\mathcal P}) = \log\log x + O(1)}, and hence

\displaystyle  {\mathbb E} |\mathbf{\omega} - \log\log x |^2 \ll \log \log x.

Since {\omega(\mathbf{n}) = \mathbf{\omega} + O(1)}, we recover the main result

\displaystyle  \frac{1}{x} \sum_{n \leq x} |\omega(x) - \log\log x|^2 \ll \log\log x

of Section 5 of Notes 1 (indeed this is essentially the same argument as in that section, dressed up in probabilistic language). In particular, we recover the Hardy-Ramanujan law that a proportion {1-o(1)} of the natural numbers in {[1,x]} have {(1+o(1)) \log\log x} prime factors.

Exercise 3 (Turan-Kubilius inequality, general case) Let {f: {\bf N} \rightarrow {\bf C}} be an additive function (which means that {f(n+m) = f(n) + f(m)} whenever {n,m} are coprime. Show that

\displaystyle  \frac{1}{x} \sum_{n \leq x} |f(n) - A|^2 \ll \sum_{p \leq x} \sum_{j \leq \log x/\log p} |f(p^j)|^2 / p^j


\displaystyle  A := \sum_{p \leq x} \sum_{j \leq \log x/\log p} |f(p^j)| p^{-j} (1-\frac{1}{p}).

(Hint: one may first want to work with the special case when {f(p^j)} vanishes whenever {p^j \geq x^{0.1}} so that the second moment method can be profitably applied, and then figure out how to address the contributions of prime powers larger than {x^{0.1}}.)

Exercise 4 (Turan-Kubilius inequality, logarithmic version) Let {2 \leq y \leq x} with {x/y \geq 2}, and let {{\mathcal P}} be a collection of primes of size less than {P} with {P \leq y^{0.1}}. Show that

\displaystyle  \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{|\sum_{p \in {\mathcal P}} 1_{p|n} - l({\mathcal P})|^2}{n} \ll l({\mathcal P}) + 1 \ \ \ \ \ (12)

Exercise 5 (Paley-Zygmund inequality) Let {\mathbf{X} \in L^2} be non-negative with positive mean. Show that

\displaystyle  {\mathbb P}( \mathbf{X} > \lambda {\mathbb E} \mathbf{X}) \geq (1-\lambda)^2 \frac{ ({\mathbb E} \mathbf{X})^2 }{ {\mathbb E} (\mathbf{X}^2) }.

This inequality can sometimes give slightly sharper results than the Chebyshev inequality when using the second moment method.

Now we give a useful lemma that quantifies a heuristic mentioned in the introduction, namely that if several random variables {\mathbf{Y}_1,\dots,\mathbf{Y}_n} do not correlate with each other, then it is not possible for any further random variable {\mathbf{X}} to correlate with many of them simultaneously. We first state an abstract Hilbert space version.

Lemma 6 (Bessel type inequality, Hilbert space version) If {u, v_1,\dots,v_n} are elements of a Hilbert space {H}, and {w_1,\dots,w_n} are positive reals, then

\displaystyle  \left( \sum_{i=1}^n w_i |\langle u, v_i \rangle_{H}|^2\right)^{1/2} \leq \|u \|_{H} \left( \sup_{1 \leq i \leq n} \sum_{j=1}^n w_j |\langle v_i, v_j\rangle_{H}| \right)^{1/2}. \ \ \ \ \ (13)

Proof: We use the duality method. Namely, we can write the left-hand side of (13) as

\displaystyle  \sum_{i=1}^n w_i \langle u, v_i \rangle_{H} \overline{c_i} = \langle u, \sum_{i=1}^n w_i c_i v_i \rangle_{H}

for some complex numbers {c_i} with {\sum_{i=1}^n w_i |c_i|^2=1} (just take {c_i} to be {\langle u, v_i \rangle_{H}} normalised by the left-hand side of (14), or zero if that left-hand side vanishes. By Cauchy-Schwarz, it then suffices to establish the dual inequality

\displaystyle  \|\sum_{i=1}^n w_i c_i v_i \|_{H}^2 \leq \sup_{1 \leq i \leq n} \sum_{j=1}^n w_j |\langle v_i, v_j\rangle_{H}|.

The left-hand side can be written as

\displaystyle  \sum_{i=1}^n \sum_{j=1}^n w_i w_j c_i \overline{c_j} \langle v_i, v_j \rangle.

Using the arithmetic mean-geometric mean inequality {c_i \overline{c_j} \leq \frac{1}{2} |c_i|^2 + \frac{1}{2} |c_j|^2} and symmetry, this may be bounded by

\displaystyle  \sum_{i=1}^n w_i |c_i|^2 \sum_{j=1}^n w_j |\langle v_i, v_j \rangle|.

Since {\sum_{i=1}^n w_i |c_i|^2 =1}, the claim follows. \Box

Corollary 7 (Bessel type inequality, probabilistic version) If {\mathbf{X}, \mathbf{Y}_1,\dots,\mathbf{Y}_n \in L^2}, and {w_1,\dots,w_n} are positive reals, then

\displaystyle  \sum_{i=1}^n w_i |\mathrm{Cov}( \mathbf{X}, \mathbf{Y}_i )|^2 \ \ \ \ \ (14)

\displaystyle  \leq \mathrm{Var}(\mathbf{X}) \left( \sup_{1 \leq i \leq n} \sum_{j=1}^n w_j |\mathrm{Cov}(\mathbf{Y}_i, \mathbf{Y}_j)| \right).

Proof: By subtracting the mean from each of {\mathbf{X}, \mathbf{Y}_i} we may assume that these random variables have mean zero. The claim now follows from Lemma 6. \Box

To get a feel for this inequality, suppose for sake of discussion that {\mathbf{X}} and {\mathbf{Y}_i} all have unit variance and {w_i = 1}, but that the {\mathbf{Y}_i} are pairwise uncorrelated. Then the right-hand side is equal to {1}, and the left-hand side is the sum of squares of the correlations between {\mathbf{X}} and each of the {\mathbf{Y}_i}. Any individual correlation is then still permitted to be as large as {1}, but it is not possible for multiple correlations to be this large simultaneously. This is geometrically intuitive if one views the random variables {\mathbf{X}, \mathbf{Y}_i} as vectors in a Hilbert space (and correlation as a rough proxy for the angle between such vectors). This lemma also shares many commonalities with the large sieve inequality, discussed in this set of notes.

One basic number-theoretic application of this inequality is the following sampling inequality of Elliott, that lets one approximate a sum {\sum_{n \in I} f(n)} of an arithmetic function by its values {\sum_{n \in I} f(n) p 1_{p|n}} on multiples of primes {p}:

Exercise 8 (Elliott’s inequality) Let {I \subset {\bf R}} be an interval of length at least {1}. Show that for any function {f: {\bf Z} \rightarrow {\bf C}}, one has

\displaystyle  \sum_{p \leq |I|^{0.1}} \frac{1}{p} \left| \frac{1}{|I|} \sum_{n \in I} f(n) - \frac{1}{|I|} \sum_{n \in I} f(n) p 1_{p|n} \right|^2 \ll \frac{1}{|I|} \sum_{n \in I} |f(n)|^2.

(Hint: Apply Corollary 7 with {X := f({\mathbf n})}, {Y_p := p 1_{p|\mathbf{n}}}, and {w_p := 1/p}, where {\mathbf{n}} is the uniform variable from Lemma 2.) Conclude in particular that for every {\varepsilon > 0}, one has

\displaystyle  \frac{1}{|I|} \sum_{n \in I} f(n) = \frac{1}{|I|} \sum_{n \in I} f(n) p 1_{p|n} + O\left( \varepsilon \left(\frac{1}{|I|} \sum_{n \in I} |f(n)|^2\right)^{1/2} \right)

for all primes {p \leq |I|} outside of a set of exceptional primes of logarithmic size {O(\varepsilon^{-2})}.

Informally, the point of this inequality is that an arbitrary arithmetic function {f} may exhibit correlation with the indicator function {1_{p|n}} of the multiples of {p} for some primes {p}, but cannot exhibit significant correlation with all of these indicators simultaneously, because these indicators are not very correlated to each other. We note however that this inequality only gains a tiny bit over trivial bounds, because the set of primes up to {x} only has logarithmic size {O(\log\log x)} by Mertens’ theorems; thus, any asymptotics that are obtained using this inequality will typically have error terms that only improve upon the trivial bound by factors such as {\log\log x}.

Exercise 9 (Elliott’s inequality, logarithmic form) Let {2 \leq y \leq x} with {x/y \geq 2}. Show that for any function {f: {\bf Z} \rightarrow {\bf C}}, one has

\displaystyle  \sum_{p \leq y^{0.1}} \frac{1}{p} \left| \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{f(n)}{n} - \frac{1}{\log(x/y)} \sum_{y/p \leq n \leq x/p} \frac{f(pn)}{n} \right|^2

\displaystyle  \ll \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{|f(n)|^2}{n}

and thus for every {\varepsilon>0}, one has

\displaystyle  \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{f(n)}{n} = \frac{1}{\log(x/y)} \sum_{y/p \leq n \leq x/p} \frac{f(pn)}{n}

\displaystyle  + O\left( \varepsilon \left( \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{|f(n)|^2}{n} \right)^{1/2} \right)

for all primes {p \leq y} outside of an exceptional set of primes of logarithmic size {O(\varepsilon^{-2})}.

Exercise 10 Use Exercise (9) and a duality argument to provide an alternate proof of Exercise 4. (Hint: express the left-hand side of (12) as a correlation between {\sum_{p \in {\mathcal P}} 1_{p|n} - l({\mathcal P})} and some suitably {L^2}-normalised arithmetic function {f(n)}.)

As a quick application of Elliott’s inequality, let us establish a weak version of the prime number theorem:

Proposition 11 (Weak prime number theorem) For any {\varepsilon>0} we have

\displaystyle  \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{\mu(n)}{n} = O(\varepsilon)

whenever {y, x/y} are sufficiently large depending on {\varepsilon}.

This estimate is weaker than what one can obtain by existing methods, such as Exercise 56 of Notes 1. However in the next section we will refine this argument to recover the full prime number theorem.

Proof: Fix {\varepsilon}, and suppose that {y, x/y} are sufficiently large. From Exercise 9 one has

\displaystyle  \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{\mu(n)}{n} = \frac{1}{\log(x/y)} \sum_{y/p \leq n \leq x/p} \frac{\mu(pn)}{n} + O(\varepsilon)

for all primes {p \leq y} outside of an exceptional set of logarithmic size {O(\varepsilon^{-2})}. If we restrict attention to primes {p \leq (x/y)^\varepsilon} then one sees from the integral test that one can replace the sum {\sum_{y/p \leq n \leq x/p}} by {\sum_{y \leq n \leq x}} and only incur an additional error of {O(\varepsilon)}. If we furthermore restrict to primes {p} larger than {\varepsilon}, then the contribution of those {n} that are divisible by {p} is also {O(\varepsilon)}. For {n} not divisible by {p}, one has {\mu(pn) = -\mu(n)}. Putting all this together, we conclude that

\displaystyle  \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{\mu(n)}{n} = - \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{\mu(n)}{n} + O(\varepsilon)

for all primes {1/\varepsilon \leq p \leq (x/y)^\varepsilon, y} outside of an exceptional set of logarithmic size {O(\varepsilon^{-2})}. In particular, for {y, x/y} large enough this statement is true for at least one such {p}. The claim then follows. \Box

As another application of Elliott’s inequality, we present a criterion for orthogonality between multiplicative functions and other sequences, first discovered by Katai (with related results also introduced earlier by Daboussi and Delange), and rediscovered by Bourgain, Sarnak, and Ziegler:

Proposition 12 (Daboussi-Delange-Katai-Bourgain-Sarnak-Ziegler criterion) Let {f: {\bf N} \rightarrow {\bf C}} be a multiplicative function with {|f(n)| \leq 1} for all {n}, and let {g: {\bf N} \rightarrow {\bf C}} be another bounded function. Suppose that one has

\displaystyle  \frac{1}{x} \sum_{n \leq x} g(pn) \overline{g(qn)} = o(1)

as {x \rightarrow \infty} for any two distinct primes {p,q}. Then one has

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) \overline{g(n)} = o(1)

as {x \rightarrow \infty}.

Proof: Suppose the claim fails, then there exists {\varepsilon>0} (which we can assume to be small) and arbitrarily large {x} such that

\displaystyle  |\frac{1}{x} \sum_{n \leq x} f(n) \overline{g(n)}| \geq \varepsilon.

By Exercise 8, this implies that

\displaystyle  |\frac{1}{x} \sum_{n \leq x} f(n) \overline{g(n)} p 1_{p|n}| \gg \varepsilon \ \ \ \ \ (15)

for all primes {p \leq x} outside of an exceptional set of logarithmic size {O(\varepsilon^{-2})}. Call such primes “good primes”. In particular, by the pigeonhole principle, and assuming {x} large enough, there exists a dyadic range {[P, (1+\varepsilon^2) P]} with {\varepsilon^{-10} \ll P \ll_\varepsilon 1} which contains {\gg \varepsilon^2 \frac{P}{\log P}} good primes.

Fix a good prime {p} in {[P, (1+\varepsilon^2)P]}. From (15) we have

\displaystyle  |\frac{1}{x/P} \sum_{n \leq x/p} f(pn) \overline{g(pn)}| \gg \varepsilon.

We can replace the range {n \leq x/p} by {n \leq x/P} with negligible error. We also have {f(pn) = f(p) f(n)} except when {n} is a multiple of {p}, but this latter case only contributes {O(1/p)} which is also negligible compared to the right-hand side. We conclude that

\displaystyle  |\frac{1}{x/P} \sum_{n \leq x/P} f(n) \overline{g(pn)}| \gg \varepsilon

for every good prime. On the other hand, from Lemma 6 we have

\displaystyle  \sum_p |\frac{1}{x/P} \sum_{n \leq x/P} f(n) \overline{g(pn)}|^2 \ll \sup_p \sum_q |\frac{1}{x/P} \sum_{n \leq x/P} g(pn) \overline{g(qn)}|

where {p,q} range over the good primes in {[P,(1+\varepsilon) P]}. The left-hand side is then {\gg \varepsilon^3 \frac{P}{\log P}}, and by hypothesis the right-hand side is {O(1)} for {x} large enough. As {P \geq \varepsilon^{-10}} and {\varepsilon} is small, this gives the desired contradiction \Box

Exercise 13 (Daboussi-Delange theorem) Let {\alpha} be irrational, and let {f: {\bf N} \rightarrow {\bf C}} be a multiplicative function with {|f(n)| \leq 1} for all {n}. Show that

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) e(-\alpha n) = o(1) \ \ \ \ \ (16)

as {x \rightarrow \infty}. If instead {\alpha} is rational, show that there exists {f: {\bf N} \rightarrow {\bf C}} be a multiplicative function with {|f(n)| \leq 1} for which the statement (16) fails. (Hint: use Dirichlet characters and Plancherel’s theorem for finite abelian groups.)

— 2. An elementary proof of the prime number theorem —

Define the Mertens function

\displaystyle  M(x) := \sum_{n \leq x} \mu(n).

As shown in Theorem 58 of Notes 1, the prime number theorem is equivalent to the bound

\displaystyle  \frac{M(x)}{x}=o(1) \ \ \ \ \ (17)

as {x \rightarrow \infty}. We now give a recent proof of this theorem, due to Redmond McNamara (personal communication), that relies primarily on Elliott’s inequality and the Selberg symmetry formula; it is a relative of the standard elementary proof of this theorem due to Erdös and Selberg. In order to keep the exposition simple, we will not arrange the argument in a fashion that optimises the decay rate {o(1)} (in any event, there are other proofs of the prime number theorem that give significantly stronger bounds).

Firstly we see that Elliott’s inequality gives the following weaker version of (17):

Lemma 14 (Oscillation for Mertens’ function) If {x \geq 2} and {\varepsilon > 0}, then we have

\displaystyle  \frac{M(x)}{x} = -\frac{M(x/p)}{x/p} + O(\varepsilon)

for all primes {p \leq x} outside of an exceptional set of primes of logarithmic size {O(\varepsilon^{-2})}.

Proof: We may assume {\varepsilon \leq 1} as the claim is trivial otherwise. From Exercise 8 applied to {I=[1,x]} and {f=\mu}, we have

\displaystyle  \frac{M(x)}{x} = \frac{1}{x/p} \sum_{n \leq x/p} \mu(pn) + O(\varepsilon)

for all {p \leq x} outside of an exceptional set of primes of logarithmic size {O(\varepsilon^2)}. Since {\mu(pn)=-\mu(n)} for {n} not divisible by {p}, the right-hand side can be written as

\displaystyle  -\frac{M(x/p)}{x/p} + O(\varepsilon) + O(1/p).

Since {1/p \leq \varepsilon} outside of an exceptional set of logarithmic size {O(\varepsilon^{-2})}, the claim follows. \Box

Informally, this lemma asserts that {\frac{M(x)}{x} \approx -\frac{M(x/p)}{x/p}} for most primes {p}, which morally implies that {\frac{M(x)}{x} \approx \frac{M(x/p_1p_2)}{x/p_1 p_2}} for most primes {p_1,p_2}. If we can then locate suitable primes {p,p_1,p_2} with {p \approx p_1 p_2}, thus should then lead to {\frac{M(x)}{x} \approx -\frac{M(x)}{x}}, which should then yield the prime number theorem {\frac{M(x)}{x} = o(1)}. The manipulations below are intended to make this argument rigorous.

It will be convenient to work with a logarithmically averaged version of this claim.

Corollary 15 (Logarithmically averaged oscillation) If {\varepsilon>0} and {x} is sufficiently large depending on {\varepsilon}, then

\displaystyle  \int_{\log x}^x |\frac{M(t)}{t} + \frac{M(t/p)}{t/p}| \frac{dt}{t} \ll \varepsilon \log x \ \ \ \ \ (18)

for all {p \leq \log x} outside of an exceptional set of logarithmic size {O(\varepsilon^{-3})}.

Proof: For each {\log x \leq t \leq x}, we have from the previous lemma that

\displaystyle  |\frac{M(t)}{t} + \frac{M(t/p)}{t/p}| \leq \varepsilon

for all {p \leq \log x} outside of an exceptional set {{\mathcal P}_t} of logarithmic size {O(\varepsilon^{-2})}. We then have

\displaystyle  \int_{\log x}^x |\frac{M(t)}{t} + \frac{M(t/p)}{t/p}| \frac{dt}{t} \ll \varepsilon \log x + \int_{\log x}^x 1_{p \in {\mathcal P}_t} \frac{dt}{t}

so it suffices by Markov’s inequality to show that

\displaystyle  \sum_{p \leq \log x} \frac{1}{p} \int_{\log x}^x 1_{p \in {\mathcal P}_t} \frac{dt}{t} \ll \varepsilon^{-2} \log x.

But by Fubini’s theorem, the left-hand side may be bounded by

\displaystyle  \int_{\log x}^x \varepsilon^{-2} \frac{dt}{t}

and the claim follows. \Box

Let {\varepsilon>0} be sufficiently small, and let {x} be sufficiently large depending on {\varepsilon}. Call a prime {p \leq \log x} good if the bound (18) holds and bad otherwise, thus all primes {p \leq \log x} outside of an exceptional set of bad primes of logarithmic size {O(\varepsilon^{-3})} are good. Now we observe that we can make {M(x)/x} small as long as we can make two good primes multiply to be close to a third:

Proposition 16 Suppose there are good primes {p_1, p_2, p \leq (\log x)^{0.1}} with {p = (1+O(\varepsilon)) p_1 p_2}. Then {\frac{M(x)}{x} = O(\varepsilon)}.

Proof: By definition of good prime, we have the bounds

\displaystyle  \int_{\log x}^x |\frac{M(t)}{t} + \frac{M(t/p_1)}{t/p_1}| \frac{dt}{t} \ll \varepsilon \log x \ \ \ \ \ (19)

\displaystyle  \int_{\log x}^x |\frac{M(t)}{t} + \frac{M(t/p_2)}{t/p_2}| \frac{dt}{t} \ll \varepsilon \log x \ \ \ \ \ (20)

\displaystyle  \int_{\log x}^x |\frac{M(t)}{t} + \frac{M(t/p)}{t/p}| \frac{dt}{t} \ll \varepsilon \log x. \ \ \ \ \ (21)

We rescale (20) by {p_1} to conclude that

\displaystyle  \int_{\log x/p_1}^{x/p_1} |\frac{M(t/p_1)}{t/p_1} + \frac{M(t/p_1p_2)}{t/p_1p_2}| \frac{dt}{t} \ll \varepsilon \log x.

We can replace the integration range here from {[\log x/p_1, x/p_1]} to {[\log x, x]} with an error of {O(\varepsilon \log x)} if {x} is large enough. Also, since {p = (1+O(\varepsilon)) p_1 p_2}, we have {\frac{M(t/p_1p_2)}{t/p_1p_2} = \frac{M(t/p)}{t/p}}. Thus we have

\displaystyle  \int_{\log x}^{x} |\frac{M(t/p_1)}{t/p_1} + \frac{M(t/p)}{t/p}| \frac{dt}{t} \ll \varepsilon \log x.

Combining this with (19), (21) and the triangle inequality (writing {\frac{M(t)}{t}} as a linear combination of {\frac{M(t)}{t} + \frac{M(t/p_1)}{t/p_1}}, {\frac{M(t)}{t} + \frac{M(t/p)}{t/p}}, and {\frac{M(t/p_1)}{t/p_1} + \frac{M(t/p)}{t/p}}) we conclude that

\displaystyle  \int_{\log x}^x |\frac{M(t)}{t}| \frac{dt}{t} \ll \varepsilon \log x. \ \ \ \ \ (22)

This is an averaged version of the claim we need. To remove the averaging, we use the identity {\mu L = - \mu * \Lambda} (see equation (63) of Notes 1) to conclude that

\displaystyle  \sum_{n \leq x} \mu(n) \log n = - \sum_{m \leq x} \Lambda(m) M(x/m).

From the triangle inequality one has

\displaystyle  |M(x/m)| \ll \varepsilon^{-1} \int_{x/m}^{(1+\varepsilon)x/m} \frac{M(t)}{t}\ dt + \varepsilon \frac{x}{m}

and hence by Mertens’ theorem

\displaystyle  \sum_{n \leq x} \mu(n) \log n \ll \varepsilon^{-1} \sum_{m \leq x} \Lambda(m) \int_{x/m}^{(1+\varepsilon)x/m} \frac{|M(t)|}{t}\ dt + \varepsilon x \log x.

From the Brun-Titchmarsh inequality (Corollary 61 of Notes 1) we have

\displaystyle  \sum_m \Lambda(m) 1_{x/m \leq t \leq (1+\varepsilon)x/m} \ll \varepsilon \frac{x}{t}

and so from the previous estimate and Fubini’s theorem one has

\displaystyle  \sum_{n \leq x} \mu(n) \log n \ll \int_{1}^{(1+\varepsilon)x} \frac{|M(t)|}{t} \frac{x}{t}\ dt + \varepsilon x \log x

and hence by (22) (using trivial bounds to handle the region outside of {[\log x, x]})

\displaystyle  \sum_{n \leq x} \mu(n) \log n \ll \varepsilon x \log x


\displaystyle \sum_{n \leq x} \mu(n) \log(x/n) \ll \sum_{n \leq x} \log(x/n) \ll x,

we conclude (for {x} large enough) that

\displaystyle  \sum_{n \leq x} \mu(n) \log x \ll \varepsilon x \log x

and the claim follows. \Box

To finish the proof of the prime number theorem, it thus suffices to locate, for {x} sufficiently large, three good primes {p_1,p_2,p} with {p = (1+O(\varepsilon)) p_1 p_2}. If we already had the prime number theorem, or even the weaker form that every interval of the form {[x,e^\varepsilon x]} contained {\gg \frac{\varepsilon x}{\log x}} primes for {x} large enough, then this would be quite easy: pick a large natural number {K} (depending on {\varepsilon}, but independent of {x}), so that the primes up to {e^{\varepsilon K}} has logarithmic size {\asymp \log K} (so that only {O(\frac{\varepsilon^{-3}}{\log K})} of them are bad, as measured by logarithmic size), and let {\mathbf{k}_1, \mathbf{k}_2} be random numbers and drawn uniformly from (say) {\{K+1,\dots,2K\}}. From the prime number theorem, for each {K \leq k \leq 4K}, the interval {I_k := [e^{\varepsilon k},e^{\varepsilon(k+1)}]} contains {\gg \frac{e^{\varepsilon k}}{k} } primes. In particular, {I_{\mathrm{k}_1}} contains {\gg \frac{e^{\varepsilon \mathbf{k}_1}}{\mathbf{k}_1} } primes, but the expected number of bad primes in this interval is {O( \frac{\varepsilon^{-3}}{\log K} \frac{e^{\varepsilon k}}{\mathbf{k}_1}}. Thus by Markov’s inequality there would be at least a {90\%} chance (say) of having at least one good prime {\mathbf{p_1}} in {I_{\mathbf{k}_1}}; similarly there is a {90\%} chance of having a good prime {\mathbf{p_2}} in {I_{\mathbf{k}_1}}, and a {90\%} chance of having a good prime {\mathbf{p}} in {I_{\mathbf{k}_1+\mathbf{k}_2}}. Thus (as an application of the probabilistic method), there exist (deterministic) good primes {p_1,p_2,p} with the required properties.

Of course, using the prime number theorem here to prove the prime number theorem would be circular. However, we can still locate a good triple {p_1,p_2,p} of primes using the Selberg symmetry formula

\displaystyle  \sum_{n \leq x} \Lambda_2(n) = (2 + o(1)) x \log x \ \ \ \ \ (23)

as {x \rightarrow \infty}, where {\Lambda_2} is the second von Mangoldt function

\displaystyle  \Lambda_2(n) = \Lambda(n) \log n + \sum_{d|n} \Lambda(d) \Lambda(\frac{n}{d});

see Proposition 60 of Notes 1. We can strip away the contribution of the primes:

Exercise 17 Show that

\displaystyle  \sum_{p \leq x} \log^2 x + \sum_{p_1,p_2: p_1 p_2 \leq x} \log p_1 \log p_2 = (2+o(1)) x \log x

as {x \rightarrow \infty}.

In particular, on evaluating this at {x = e^{\varepsilon k}, e^{\varepsilon(k+1)}} and subtracting, we have

\displaystyle  \sum_{p \in I_k} \varepsilon^2 k^2 + \sum_{p_1,p_2: p_1 p_2 \in I_k} \log p_1 \log p_2 \asymp \varepsilon^2 k e^{\varepsilon k}

whenever {k} is sufficiently large depending on {\varepsilon}. In particular, for any such {k}, one either has

\displaystyle  \sum_{p \in I_k} \frac{1}{p} \gg \frac{1}{k} \ \ \ \ \ (24)


\displaystyle  \sum_{p_1 p_2 \in I_k} \frac{\log p_1 \log p_2}{p_1 p_2} \gg \varepsilon^2 k \ \ \ \ \ (25)

(or both). Informally, the Selberg symmetry formula shows that the interval {I_k} contains either a lot of primes, or a lot of semiprimes. The factor of {\log p_1 \log p_2} is slightly annoying, so we now remove it. Consider the contribution of those primes {p_1} to (25) with {p_1 \leq e^{\varepsilon^3 k}}. This is bounded by

\displaystyle  \sum_{p_1 \leq e^{\varepsilon^3 k}} \frac{\log p_1}{p_1} \sum_{p_2 = (1+O(\varepsilon)) e^{\varepsilon k} / p_1} \frac{\varepsilon k}{p_2}

which we can bound crudely using the Chebyshev bound by

\displaystyle  \sum_{p_1 \leq e^{\varepsilon^3 k}} \frac{\log p_1}{p_1}

which by Mertens theorem is {O(\varepsilon^3 k)}. Thus the contribution of this case can be safely removed from (25). Similarly for those cases when {p_2 \leq e^{\varepsilon^3 k}}. For the remaining cases we bound {\log p_1 \log p_2 \leq \varepsilon^2 k^2}. We conclude that for any sufficiently large {k}, either (24) or

\displaystyle  \sum_{p_1 p_2 \in I_k: p_1,p_2 \geq e^{\varepsilon^3 k}} \frac{1}{p_1 p_2} \gg \frac{e^{\varepsilon k}}{k}. \ \ \ \ \ (26)

holds (or both).

In order to find primes {p_1,p_2,p} with {p} close to {p_1 p_2}, it would be very convenient if we could find a {k} for which (24) and (26) both hold. We can’t quite do this directly, but due to the “connected” nature of the set of scales {k}, but we can do the next best thing:

Proposition 18 Suppose {k_0} is sufficiently large depending on {\varepsilon}. Then there exists {k,k' \in [k_0,k_0+\varepsilon^{-2}]} with {|k-k'| \leq 1} such that

\displaystyle  \sum_{p \in I_k} \frac{1}{p_1} \geq \frac{\varepsilon}{k} \ \ \ \ \ (27)


\displaystyle  \sum_{p_1 p_2 \in I_{k'}: p_1,p_2 \geq e^{\varepsilon^3 k'}} \frac{1}{p_1 p_2} \geq \varepsilon \frac{1}{k'}. \ \ \ \ \ (28)

Proof: We know that every {k} in {[k_0,k_0+\varepsilon^{-2}]} obeys at least one of (27), (28). Our task is to produce an adjacent pair of {k,k'}, one of which obeys (27) and the other obeys (28). Suppose for contradiction that no such pair exists, then whenever {k} fails to obey (27), then any adjacent {k} must also fail to do so, and similarly for (28). Thus either (27) will fail to hold for all {k \in [k_0,k_0+\varepsilon^{-2}]}, or (28) will fail to hold for all such {k}. If (27) fails for all {k \in [k_0,k_0+\varepsilon^{-2}]}, then on summing we have

\displaystyle  \sum_{e^{\varepsilon k_0} \leq p \leq e^{\varepsilon^{-1} e^{\varepsilon k_0}}} \frac{1}{p} \ll \frac{1}{\varepsilon k_0}

which contradicts Mertens’ theorem if {\varepsilon} is large enough because the left-hand side is {\gg \frac{1}{\varepsilon^2 k_0}}. Similarly, if (28) fails for all {k \in [k_0,k_0+\varepsilon^{-2}]}, then

\displaystyle  \sum_{e^{\varepsilon k_0} \leq p_1 p_2 \leq e^{\varepsilon^{-1}} e^{\varepsilon k_0}: p_1,p_2 \geq e^{2\varepsilon^3 k_0}} \frac{1}{p_1 p_2} \leq \frac{1}{\varepsilon k_0}

and again Mertens’ theorem can be used to lower bound the left-hand side by {\gg \frac{1}{\varepsilon^2 k_0}} (in fact one can even gain an additional factor of {\log \frac{1}{\varepsilon}} if one works things through carefully) and obtain a contradiction. \Box

The above proposition does indeed provide a triple of primes {p_1,p_2,p} with {p = (1+O(\varepsilon)) p_1 p_2}. If {k_0} is sufficiently large depending on {\varepsilon} and less than (say) {\log\log x}, so that {p_1,p_2,p \leq \log^{0.1} x}, this would give us what we need as long as one of the triples consisted only of good primes. The only way this can fail is if either

\displaystyle  \sum_{p \in I_k: p \hbox{ bad}} \frac{1}{p} \gg \varepsilon \frac{1}{k}

for some {k \in [k_0,k_0+\varepsilon^{-2}]}, or if

\displaystyle  \sum_{p_1 p_2 \in I_{k'}: p_1,p_2 \geq e^{\varepsilon^3 k'}; p_1 \hbox{ bad}} \frac{1}{p_1 p_2} \gg \frac{\varepsilon}{k'}

for some {k' \in [k_0,k_0+\varepsilon^{-2}]}. In the first case, we can sum to conclude that

\displaystyle  \sum_{e^{\varepsilon k_0} \leq p \leq e^{\varepsilon k_0 + 2\varepsilon^{-2}}: p \hbox{ bad}} \frac{1}{p} \gg \frac{\varepsilon}{k_0}, \ \ \ \ \ (29)

and in the second case we have

\displaystyle  \sum_{e^{\varepsilon^3 k_0} \leq p_1 \leq e^{\varepsilon k_0 + 2\varepsilon^{-2}}; p_1 \hbox{ bad}} \sum_{e^{\varepsilon k_0}/p_1 \leq p_2 \leq e^{\varepsilon(k_0+1)}/p_1} \frac{1}{p_1 p_2} \gg \frac{\varepsilon}{k_0}

and hence by Chebyshev bounds

\displaystyle  \sum_{e^{\varepsilon^3 k_0} \leq p_1 \leq e^{\varepsilon k_0 + 2\varepsilon^{-2}}; p_1 \hbox{ bad}} \frac{1}{p_1} \gg \varepsilon^2. \ \ \ \ \ (30)

Since the total set of bad primes up to {\log x} has logarithmic size {O(\varepsilon^{-2})}, we conclude from the pigeonhole principle (and the divergence of the harmonic series {\sum_{k_0=1}^\infty \frac{1}{k_0}}) that for any {C_\varepsilon} depending only on {\varepsilon}, and any {x} large enough, there exists {C_\varepsilon \leq k_0 \leq \log\log x} such that neither of (29) and (30) hold. Indeed the set of {C_\varepsilon \leq k_0 \leq \log\log x} obeying (29) has logarithmic size {O_\varepsilon(1)}, and similarly for (30). Choosing a {k_0} that avoids both of these scenarios, we then find a good {p \in I_k} and good {p_1,p_2} with {p_1 p_2 \in I_{k'}}, so that {p_1 p_2 = (1+O(\varepsilon)) p}, and then by Proposition 16 we conclude that {M(x)/x = O(\varepsilon)} for all sufficiently large {x}. Sending {\varepsilon} to zero, we obtain the prime number theorem.

— 3. Entropy methods —

In the previous section we explored the consequences of the second moment method, which applies to square-integrable random variables {\mathbf{X} \in L^2} taking values in the real or complex numbers. Now we explore entropy methods, which now apply to random variables which take a finite number of values (equipped with the discrete sigma-algebra), but whose range need not be numerical in nature. (One could extend entropy methods to slightly larger classes of random variables, such as ones that attain a countable number of values, but for our applications finitely-valued random variables will suffice.)

The fundamental notion here is that of the Shannon entropy of a random variable. If {\mathbf{X}} takes values in a finite set {R}, its Shannon entropy {\mathbb{H}(\mathbf{X})} (or entropy for short) is defined by the formula

\displaystyle  \mathbb{H}(\mathbf{X}) := \sum_{x \in R} {\mathbb P}(\mathbf{X}=x) \log \frac{1}{{\mathbb P}(\mathbf{X}=x)}

where {x} ranges over all the possible values of {\mathbf{X}}, and we adopt the convention {0 \log \frac{1}{0} = 0}, so that values that are almost surely not attained by {\mathbf{X}} do not influence the entropy. We choose here to use the natural logarithm {\log} to normalise our entropy (in which case a unit of entropy is known as a “nat“); in the information theory literature it is also common to use the base two logarithm {\log_2} to measure entropy (in which case a unit of entropy is known as a “bit“, which is equal to {\log 2} nats). However, the precise choice of normalisation will not be important in our discussion.

It is clear that if two random variables {\mathbf{X}, \mathbf{Y}} have the same probability distribution, then they have the same entropy. Also, the precise choice of range set {R} is not terribly important: if {\mathbf{X}} takes values in {R}, and {f: R \rightarrow R'} is an injection, then it is clear that {\mathbf{X}} and {f(\mathbf{X})} have the same entropy:

\displaystyle  \mathbb{H}(f(\mathbf{X})) = \mathbb{H}(\mathbf{X}). \ \ \ \ \ (31)

This is in sharp contrast to moment-based statistics such as the mean or variance, which can be radically changed by applying some injective transformation to the range values.

Informally, the entropy {\mathbb{H}(\mathbf{X})} informally measures how “spread out” or “disordered” the distribution of {\mathbf{X}} is, behaving like a logarithm of the size of the “essential support” of such a variable; from an information-theoretic viewpoint, it measures the amount of “information” one learns when one is told the value of {\mathbf{X}}. Here are some basic properties of Shannon entropy that help support this intuition:

Exercise 19 (Basic properties of Shannon entropy) Let {\mathbf{X}} be a random variable taking values in a finite set {R}.

One can view Shannon entropy as a generalisation of the notion of cardinality of a finite set (or equivalently, cardinality of finite sets can be viewed as a special case of Shannon entropy); see this previous blog post for an elaboration of this point.

The concept of Shannon entropy becomes significantly more powerful when combined with that of conditioning. Recall that a random variable {\mathbf{X}} taking values in a range set {R} can be modeled by a measurable map {\mathbf{X}: \Omega \rightarrow R} from a probability space {(\Omega,{\mathcal F}, \mathop{\bf P})} to the range {R}. If {E \in {\mathcal F}} is an event in {\Omega} of positive probability, we can then condition {\mathbf{X}: \Omega \rightarrow R} to the event {E} to form a new random variable {(\mathbf{X}|E): E \rightarrow R} on the conditioned probability space {(E, {\mathcal F}|_E, \mathop{\bf P}|E)}, where

\displaystyle  {\mathcal F}|_E := \{ E \cap F: F \in {\mathcal F} \}

is the restriction of the {\sigma}-algebra {{\mathcal F}} to {E},

\displaystyle  {\mathbb P}(F|E) := \frac{{\mathbb P}( E \cap F )}{{\mathbb P}(E)}

is the conditional probability measure on {E}, and {\mathbf{X}|E:E \rightarrow R} is the restriction of {\mathbf{X}: \Omega \rightarrow R} to {E}. This random variable {(\mathbf{X}|E)} lives on a different probability space than {\mathbf{X}} itself, so it does not make sense to directly combine these variables (thus for instance one cannot form the sum {\mathbf{X} + (\mathbf{X}|E)} even when both random variables are real or complex valued); however, one can still form the Shannon entropy {\mathbb{H}( \mathbf{X}|E )} of the conditioned random variable {(\mathbf{X}|E)}, which is given by the same formula

\displaystyle  \mathbb{H}(\mathbf{X}|E) := \sum_{x \in R} {\mathbb P}(\mathbf{X}=x|E) \log \frac{1}{{\mathbb P}(\mathbf{X}=x|E)}.

Given another random variable {\mathbf{Y}: \Omega \rightarrow S} taking values in another finite set {S}, we can then define the conditional Shannon entropy {\mathbb{H}(\mathbf{X}|\mathbf{Y})} to be the expected entropy of the level sets {\mathbb{H}(\mathbf{X}|\mathbf{Y}=y)}, thus

\displaystyle  \mathbb{H}(\mathbf{X}|\mathbf{Y}) := \sum_{y \in S} {\mathbb P}( \mathbf{Y} = y ) \mathbb{H}(\mathbf{X}|\mathbf{Y}=y)

with the convention that the summand here vanishes when {{\mathbb P}( \mathbf{Y} = y ) = 0}. From the law of total probability we have

\displaystyle  {\mathbb P}( \mathbf{X}=x ) = \sum_{y \in S} {\mathbb P}( \mathbf{Y} = y ) {\mathbb P}(\mathbf{X}=x|\mathbf{Y}=y)

for any {x \in R}, and hence by Jensen’s inequality

\displaystyle  {\mathbb P}( \mathbf{X}=x ) \log \frac{1}{{\mathbb P}( \mathbf{X}=x )} \leq \sum_{y \in S} {\mathbb P}( \mathbf{Y} = y ) {\mathbb P}(\mathbf{X}=x|\mathbf{Y}=y) \log \frac{1}{{\mathbb P}(\mathbf{X}=x|\mathbf{Y}=y)}

for any {x \in R}; summing we obtain the Shannon entropy inequality

\displaystyle  \mathbb{H}(\mathbf{X}|\mathbf{Y}) \leq \mathbb{H}(\mathbf{X}). \ \ \ \ \ (33)

Informally, this inequality asserts that the new information content of {\mathbf{X}} can be decreased, but not increased, if one is first told some additional information {\mathbf{Y}}.

This inequality (33) can be rewritten in several ways:

Exercise 20 Let {\mathbf{X}: \Omega \rightarrow R}, {\mathbf{Y}: \Omega \rightarrow S} be random variables taking values in finite sets {R,S} respectively.

From the above exercise we see that the mutual information {\mathbb{I}( \mathbf{X}: \mathbf{Y} )} is a measure of dependence between {\mathbf{X}} and {\mathbf{Y}}, much as correlation or covariance was in the previous sections. There is however one key difference: whereas a zero correlation or covariance is a consequence but not a guarantee of independence, zero mutual information is logically equivalent to independence, and is thus a stronger property. To put it another way, zero correlation or covariance allows one to calculate the average {\mathbb{E}( \mathbf{X} \mathbf{Y} )} in terms of individual averages of {\mathbf{X}, \mathbf{Y}}, but zero mutual information is stronger because it allows one to calculate the more general averages {\mathbb{E}( f(\mathbf{X}) g(\mathbf{Y}) )} in terms of individual averages of {\mathbf{X}, \mathbf{Y}}, for arbitrary functions {f,g} taking values into the complex numbers. This increased power of the mutual information statistic will allow us to estimate various averages of interest in analytic number theory in ways that do not seem amenable to second moment methods.

The subadditivity property formula can be conditioned to any event {E} occuring with positive probability (replacing the random variables {\mathbf{X}, \mathbf{Y}} by their conditioned counterparts {({\mathbf X}|E), (\mathbf{Y}|E)}), yielding the inequality

\displaystyle  \mathbb{H}(\mathbf{X}, \mathbf{Y}|E) \leq \mathbb{H}(\mathbf{X}|E) + \mathbb{H}(\mathbf{Y}|E).

Applying this inequality to the level events {\mathbf{Z}=z} of some auxiliary random variable {\mathbf{Z}: \Omega \rightarrow T} taking values in another finite set {T}, multiplying by {\mathbf{P}( \mathbf{Z}=z)}, and summing, we conclude the inequality

\displaystyle  \mathbb{H}(\mathbf{X}, \mathbf{Y}|{\mathbf Z}) \leq \mathbb{H}(\mathbf{X}|\mathbf{Z}) + \mathbb{H}(\mathbf{Y}|\mathbf{Z}).

In other words, the conditional mutual information

\displaystyle  \mathbb{I}( \mathbf{X}: \mathbf{Y}|{\mathbf Z} ) := \mathbb{H}(\mathbf{X}|\mathbf{Z}) + \mathbb{H}(\mathbf{Y}|\mathbf{Z})-\mathbb{H}(\mathbf{X}, \mathbf{Y}|{\mathbf Z})

between {\mathbf{X}} and {\mathbf{Y}} conditioning on {{\mathbf Z}} is always non-negative:

\displaystyle  \mathbb{I}( \mathbf{X}: \mathbf{Y}|{\mathbf Z} ) \geq 0. \ \ \ \ \ (36)

One has conditional analogues of the above exercise:

Exercise 21 Let {\mathbf{X}: \Omega \rightarrow R}, {\mathbf{Y}: \Omega \rightarrow S}, {\mathbf{Z}: \Omega \rightarrow T} be random variables taking values in finite sets {R,S,T} respectively.

We now give a key motivating application of the Shannon entropy inequalities. Suppose one has a sequence {\mathbf{X}_1, \mathbf{X}_2, \dots} of random variables, all taking values in a finite set {R}, which are stationary in the sense that the tuples {(\mathbf{X}_1,\dots,\mathbf{X}_n)} and {(\mathbf{X}_2,\dots,\mathbf{X}_{n+1})} have the same distribution for every {n}. In particular we will have

\displaystyle  \mathbb{H}( \mathbf{X}_{n+1} | \mathbf{X}_2,\dots,\mathbf{X}_n ) = \mathbb{H}( \mathbf{X}_{n} | \mathbf{X}_1,\dots,\mathbf{X}_{n-1} )

and hence by (39)

\displaystyle  \mathbb{H}( \mathbf{X}_{n+1} | \mathbf{X}_1,\dots,\mathbf{X}_n ) \leq \mathbb{H}( \mathbf{X}_{n} | \mathbf{X}_1,\dots,\mathbf{X}_{n-1} ).

If we write {h_n := \mathbb{H}(\mathbf{X}_1,\dots,\mathbf{X}_n)}, we conclude from (34) that we have the concavity property

\displaystyle  h_{n+1} - h_n \leq h_n - h_{n-1}.

In particular we have {h_{n+1} - h_n \leq h_i - h_{i-1}} for any {1 \leq i \leq n}, which on summing and telescoping series (noting that {h_0=0}) gives

\displaystyle  n (h_{n+1} - h_n) \leq h_n

and hence we have the entropy monotonicity

\displaystyle  \frac{h_{n+1}}{n+1} \leq \frac{h_n}{n}. \ \ \ \ \ (41)

In particular, the limit {\lim_{n \rightarrow\infty} \frac{h_n}{n}} exists. This quantity is known as the Kolmogorov-Sinai entropy of the stationary process {\mathbf{X}_1, \mathbf{X}_2, \dots}; it is an important statistic in the theory of dynamical systems, and roughly speaking measures the amount of entropy produced by this process as a function of a discrete time vairable {n}. We will not directly need the Kolmogorov-Sinai entropy in our notes, but a variant of the entropy monotonicity formula (41) will be important shortly.

In our application we will be dealing with processes that are only asymptotically stationary rather than stationary. To control this we recall the notion of the total variation distance {d_{TV}(\mathbf{X}, \mathbf{Y})} between two random variables {\mathbf{X}, \mathbf{Y}} taking values in the same finite space {R}, defined by

\displaystyle  d_{TV}(\mathbf{X}, \mathbf{Y}) := \sum_{r \in R} | {\mathbb P}( \mathbf{X} = r ) - {\mathbb P}( \mathbf{Y} = r ) |.

There is an essentially equivalent notion of this distance which is also often in use:

Exercise 22 If two random variables {\mathbf{X}, \mathbf{Y}} take values in the same finite space {R}, establish the inequalities

\displaystyle  \frac{1}{2} d_{TV}(\mathbf{X}, \mathbf{Y}) \leq \sup_{E \subset R} |{\mathbb P}( \mathbf{X} = E ) - {\mathbb P}( \mathbf{Y} = E )| \leq d_{TV}(\mathbf{X}, \mathbf{Y})

and for any {f: R \rightarrow {\bf C}}, establish the inequality

\displaystyle  |{\mathbb E} f(\mathbf{X}) - {\mathbb E} f(\mathbf{Y})| \leq d_{TV}(\mathbf{X}, \mathbf{Y}) \sup_{r \in R} |f(r)|. \ \ \ \ \ (42)

Shannon entropy is continuous in total variation distance as long as we keep the range finite. More quantitatively, we have

Lemma 23 If two random variables {\mathbf{X}, \mathbf{Y}} take values in the same finite space {R}, then

\displaystyle  \mathbb{H}(\mathbf{X}) = \mathbb{H}(\mathbf{Y}) + O( d_{TV}(\mathbf{X}, \mathbf{Y}) \log(\# R (1 +\frac{1}{d_{TV}(\mathbf{X}, \mathbf{Y})})) )

with the convention that the error term vanishes when {d_{TV}(\mathbf{X}, \mathbf{Y}) = 0}.

Proof: Set {\varepsilon := d_{TV}(\mathbf{X}, \mathbf{Y})}. The claim is trivial when {\varepsilon=0} (since then {\mathbf{X}, \mathbf{Y}} have the same distribution) and when {\varepsilon \geq 1/2} (from (32)), so let us assume {0 < \varepsilon < \frac{1}{2}}, and our task is to show that

\displaystyle  \mathbb{H}(\mathbf{X}) = \mathbb{H}(\mathbf{Y}) + O( \varepsilon \log \# R + \varepsilon \log \frac{1}{\varepsilon} ).

If we write {h(p) := p \log \frac{1}{p}}, {p_r := {\mathbb P}( \mathbf{X} = r )}, and {q_r := {\mathbb P}( \mathbf{Y} = r )}, then

\displaystyle  \mathbb{H}(\mathbf{X}) = \mathbb{H}(\mathbf{Y}) + \sum_{r \in R} h(q_r) - h(p_r).

By dividing into the cases {|p_r-q_r| \leq \frac{1}{2} q_r} and {|p_r-q_r| > \frac{1}{2} q_r} we see that

\displaystyle  h(q_r) - h(p_r) \ll |p_r - q_r| + h( |p_r - q_r| )

since {\varepsilon = \sum_{r \in R} |p_r - q_r|}, it thus suffices to show that

\displaystyle  \sum_{r \in R} h( |p_r - q_r| ) \ll \varepsilon \log \# R + \varepsilon \log \frac{1}{\varepsilon}.

But from Jensen’s inequality (32) one has

\displaystyle  \sum_{r \in R} h( \frac{|p_r - q_r|}{\varepsilon} ) \leq \log \# R;

since {h( |p_r - q_r| ) = \varepsilon h( \frac{|p_r - q_r|}{\varepsilon} ) + |p_r-q_r| \log \frac{1}{\varepsilon}}, the claim follows. \Box

In the converse direction, if a random variable has entropy close to the maximum {\log \# R}, then one can control the total variation:

Lemma 24 (Special case of Pinsker inequality) If {\mathbf{X}} takes values in a finite set {R}, and {\mathbf{U}} is a uniformly distributed random variable on {R}, then

\displaystyle  d_{TV}( \mathbf{X}, \mathbf{U} ) \ll \left( \mathbb{H}( \mathbf{U} ) - \mathbb{H}( \mathbf{X} ) \right)^{1/2}.

Of course, we have {\mathbb{H}(\mathbf{U}) = \log \# R}, so we may also write the above inequality as

\displaystyle  d_{TV}( \mathbf{X}, \mathbf{U} ) \ll \left( \log \# R - \mathbb{H}( \mathbf{X} ) \right)^{1/2}.

The optimal value of the implied constant here is known to equal {\sqrt{2}}, but we will not use this sharp version of the inequality here.

Proof: If we write {p_r := {\mathbb P}( \mathbf{X} = r )} and {q_r := {\mathbb P}( \mathbf{U} = r ) = 1/\# R}, and {h(p) := p \log \frac{1}{p}}, then we can rewrite the claimed inequality as

\displaystyle  \sum_{r \in R} |p_r - q_r| \ll \left( \sum_{r \in R} h(q_r) - h(p_r) \right)^{1/2}.

Observe that the function {h} is concave, and in fact {h''(p) = - \frac{1}{p}} for all {0 < p \leq 1}. From this and Taylor expansion with remainder we may write

\displaystyle  h(q_r) - h(p_r) = h'(q_r) (q_r - p_r) + \frac{1}{2} \frac{|p_r - q_r|^2}{s_r}

for some {s_r} between {p_r} and {q_r}. Since {h'(q_r)} is independent of {r}, and {\sum_{r \in R} q_r = \sum_{r \in R} p_r = 1}, we thus have on summing in {r}

\displaystyle  \sum_{r \in R} h(q_r) - h(p_r) = \frac{1}{2} \sum_{r \in R} \frac{|p_r - q_r|^2}{s_r}.

By Cauchy-Schwarz we then have

\displaystyle  \sum_{r \in R} |p_r - q_r| \ll \left( \sum_{r \in R} h(q_r) - h(p_r) \right)^{1/2} \left( \sum_{r \in R} s_r \right)^{1/2}.

Since {s_r = O( p_r + q_r)} and {\sum_{r \in R} q_r = \sum_{r \in R} p_r = 1}, the claim follows. \Box

The above lemma does not hold when the comparison variable {\mathbf{U}} is not assumed to be uniform; in particular, two non-uniform random variables can have precisely the same entropy but yet have different distributions, so that their total variation distance is positive. There is a more general variant, known as the Pinsker inequality, which we will not use in these notes:

Exercise 25 (Pinsker inequality) If {\mathbf{X}, \mathbf{Y}} take values in a finite set {R}, define the Kullback-Leibler divergence of {\mathbf{X}} relative to {\mathbf{Y}} by the formula

\displaystyle  d_{KL}( \mathbf{X} || \mathbf{Y} ) := \sum_{r \in R} \mathbf{P}( \mathbf{X} = r ) \log \frac{\mathbf{P}( \mathbf{X} = r )}{\mathbf{P}( \mathbf{Y} = r )}

(with the convention that the summand vanishes when {\mathbf{P}( \mathbf{X} = r )} vanishes).

  • (i) Establish the Gibbs inequality {0 \leq d_{KL}(\mathbf{X} || \mathbf{Y}) \leq +\infty}.
  • (ii) Establish the Pinsker inequality

    \displaystyle  d_{TV}( \mathbf{X}, \mathbf{Y} ) \ll d_{KL}( \mathbf{X} || \mathbf{Y} )^{1/2}.

    In particular, {d_{KL}( \mathbf{X} || \mathbf{Y} )} vanishes if and only if {\mathbf{X}, \mathbf{Y}} have identical distribution. Show that this implies Lemma 24 as a special case.

  • (iii) Give an example to show that the Kullback-Liebler divergence need not be symmetric, thus there exist {\mathbf{X},\mathbf{Y}} such that {d_{KL}( \mathbf{X} || \mathbf{Y} ) \neq d_{KL}( \mathbf{Y} || \mathbf{X} )}.
  • (iv) If {\mathbf{X}_1, \mathbf{X}_2} are random variables taking values in finite sets {R_1,R_2}, and {\mathbf{X}'_1, \mathbf{X}'_2} are independent random variables taking values in {R_1,R_2} respectively with each {\mathbf{X}'_i} having the same distribution of {\mathbf{X}_i}, show that

    \displaystyle  \mathbb{I}( X_1 : X_2 ) = d_{KL}( (X_1,X_2) || (X'_1,X'_2) ).

In our applications we will need a relative version of Lemma 24:

Corollary 26 (Relative Pinsker inequality) If {\mathbf{X}} takes values in a finite set {R}, {\mathbf{Z}} takes values in a finite set {T}, and {\mathbf{U}} is a uniformly distributed random variable on {R} that is independent of {\mathbf{Z}}, then

\displaystyle  d_{TV}( (\mathbf{X},\mathbf{Z}), (\mathbf{U},\mathbf{Z}) ) \ll \left( \mathbb{H}( \mathbf{U} ) - \mathbb{H}( \mathbf{X} | \mathbf{Z} ) \right)^{1/2}.

Proof: From direct calculation we have the identity

\displaystyle  d_{TV}( (\mathbf{X},\mathbf{Z}), (\mathbf{U},\mathbf{Z}) ) = \sum_{t \in T} d_{TV}( (\mathbf{X}|\mathbf{Z}=t), (\mathbf{U}|\mathbf{Z}=t) ) {\mathbb P}(\mathbf{Z} = t ).

As {\mathbf{U}} is independent of {\mathbf{Z}}, {(\mathbf{U}|\mathbf{Z}=t)} is uniformly distributed on {R}. From Lemma 24 we conclude

\displaystyle  d_{TV}( (\mathbf{X}|\mathbf{Z}=t), (\mathbf{U}|\mathbf{Z}=t) ) \ll \left( \mathbb{H}( \mathbf{U} ) - \mathbb{H}( \mathbf{X} | \mathbf{Z}=t ) \right)^{1/2}.

Inserting this bound and using the Cauchy-Schwarz inequality, we obtain the claim. \Box

Now we are ready to apply the above machinery to give a key inequality that is analogous to Elliott’s inequality. Inequalities of this type first appeared in one of my papers, introducing what I called the “entropy decrement argument”; the following arrangement of the inequality and proof is due to Redmond McNamara (personal communication).

Theorem 27 (Entropy decrement inequality) Let {\mathbf{n}} be a random variable taking values in a finite set of integers, which obeys the approximate stationarity

\displaystyle  d_{TV}(\mathbf{n}, \mathbf{n}+1) \leq \delta \ \ \ \ \ (43)

for some {0 < \delta \leq \frac{1}{2}}. Let {p_1,\dots,p_m} be a collection of distinct primes less than some threshold {P}, and let {1 = \ell_0 \leq \ell_1 \leq \dots \leq \ell_m \leq P} be natural numbers that are also bounded by {P}. Let {f: {\bf Z} \rightarrow R} be a function taking values in a finite set {R}. For {i=1,\dots,m}, let {\mathbf{X}_i} denote the {R^{\ell_i}}-valued random variable

\displaystyle  \mathbf{X}_i := (f(\mathbf{n}+1), \dots, f(\mathbf{n}+\ell_i)) \ \ \ \ \ (44)

and let {\mathbf{Y}_i} denote the {{\bf Z}/p_i{\bf Z}}-valued random variable

\displaystyle  \mathbf{Y}_i := \mathbf{n} \hbox{ mod } p_i.

Also, let {\mathbf{U}_i} be a random variable drawn uniformly from {{\bf Z}/p_i{\bf Z}}, independently of {\mathbf{n}}. Then

\displaystyle  \sum_{i=1}^m \frac{d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) )^2 }{\ell_i} \ll \log \# R + \delta \exp(O(P)) \log \frac{\# R}{\delta}. \ \ \ \ \ (45)

The {\exp(O(P))} factor (arising from an invocation of the Chinese remainder theorem in the proof) unfortunately restricts the usefulness of this theorem to the regime in which all the primes involved are of “sub-logarithmic size”, but once one is in that regime, the second term on the right-hand side of (45) tends to be negligible in practice. Informally, this theorem asserts that for most small primes {p_i \leq P}, the random variables {\mathbf{X}_i} and {\mathbf{Y}_i} behave as if they are independent of each other.

Proof: We can assume {\# R \geq 2}, as the claim is trivial for {\# R = 1} (the {\mathbf{X}_i} all have zero entropy). For {i=0,\dots,n}, we introduce the {\prod_{j \leq i} {\bf Z}/p_i{\bf Z}}-valued random variable

\displaystyle  \mathbf{Y}_{\leq i} := (\mathbf{Y}_j)_{j \leq i}.

The idea is to exploit some monotonicity properties of the quantity {\frac{\mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i} )}{\ell_i}}, in analogy with (41). By telescoping series we have

\displaystyle  \sum_{i=1}^n \frac{\mathbb{H}( \mathbf{X}_{i-1} | \mathbf{Y}_{\leq i-1} )}{\ell_{i-1}} - \frac{\mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i} )}{\ell_i} \leq \mathbb{H}( \mathbf{X}_0 )

\displaystyle  \leq \log \# R

where we extend (44) to the {i=0} case. From (38) we have

\displaystyle  \mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i} ) = \mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i-1} ) - \mathbb{I}( \mathbf{X}_{i} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} )

and thus

\displaystyle  \sum_{i=1}^n \frac{\mathbb{I}( \mathbf{X}_{i} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} )}{\ell_i} \ \ \ \ \ (46)

\displaystyle  \leq \log \# R - \sum_{i=1}^n \frac{\mathbb{H}( \mathbf{X}_{i-1} | \mathbf{Y}_{\leq i-1} )}{\ell_{i-1}} - \frac{\mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i-1} )}{\ell_i}.

Now we lower bound the summand on the right-hand side. From multiple applications of the conditional chain rule (37) we have

\displaystyle  \mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i-1} ) = \sum_{k=1}^{\ell_i} c_{k,i} \ \ \ \ \ (47)


\displaystyle  \mathbb{H}( \mathbf{X}_{i-1} | \mathbf{Y}_{\leq i-1} ) = \sum_{k=1}^{\ell_{i-1}} c_{k,i} \ \ \ \ \ (48)


\displaystyle  c_{k,i} := \mathbb{H}( f(\mathbf{n}+k) | \mathbf{Y}_{\leq i-1}, f(\mathbf{n}+1),\dots,f(\mathbf{n}+k-1) ).

We now use the approximate stationarity of {\mathbf{n}} to derive an approximate monotonicity property for {c_{k,i}}. If {k>1}, then from (39) we have

\displaystyle  c_{k,i} \leq \mathbb{H}( f(\mathbf{n}+k) | \mathbf{Y}_{\leq i-1}, f(\mathbf{n}+2),\dots,f(\mathbf{n}+k-1) ).

Write {\mathbf{n}' := \mathbf{n} + 1} and

\displaystyle  \mathbf{Y}'_{\leq i-1} := (\mathbf{n}' \hbox{ mod } p_j)_{j \leq i-1}.

Note that {\mathbf{Y}'_{\leq i-1}} is a deterministic function of {\mathbf{Y}_{\leq i-1}} and vice versa. Thus we can replace {\mathbf{Y}_{\leq i-1}} by {\mathbf{Y}'_{\leq i-1}} in the above formula, and conclude that

\displaystyle  c_{k,i} \leq \mathbb{H}( f(\mathbf{n}'+k-1) | \mathbf{Y}'_{\leq i-1}, f(\mathbf{n}'+1),\dots,f(\mathbf{n}'+k-2) ).

The tuple {(f(\mathbf{n}'_1),\dots,f(\mathbf{n}'+k-1), \mathbf{Y}'_{\leq i-1})} takes values in a set of cardinality {(O(\# R))^P} thanks to the Chebyshev bounds. Hence by two applications of Lemma 23, (43) we have

\displaystyle  c_{k,i} \leq \mathbb{H}( f(\mathbf{n}+k-1) | \mathbf{Y}_{\leq i-1}, f(\mathbf{n}+1),\dots,f(\mathbf{n}+k-2) ) + O( \delta \log \frac{1}{\delta} + \delta P \log \# R ).

The first term on the right-hand side is {c_{k-1,i}}. Worsening the error term slightly, we conclude that

\displaystyle  c_{k,i} \leq c_{k-1,i} + O( \delta P^{O(1)} \log \frac{\# R}{\delta} )

and hence

\displaystyle  c_{k,i} \leq c_{j,i} + O( \delta P^{O(1)} \log \frac{\# R}{\delta} )

for any {1 \leq j \leq m}. In particular

\displaystyle  \sum_{k=\ell_{i-1}+1}^{\ell_i} c_{k,i} \leq \frac{\ell_i - \ell_{i-1}}{\ell_{i-1}} \sum_{k=1}^{\ell_{i-1}} c_{k,i} + O( \delta P^{O(1)} \log \frac{\# R}{\delta} )

which by (47), (48) rearranges to

\displaystyle  \frac{\mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i-1} )}{\ell_i} \leq \frac{\mathbb{H}( \mathbf{X}_{i-1} | \mathbf{Y}_{\leq i-1} )}{\ell_{i-1}} + O( \delta P^{O(1)} \log \frac{\# R}{\delta} ).

From (46) we conclude that

\displaystyle  \sum_{i=1}^m \frac{\mathbb{I}( \mathbf{X}_{i-1} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} )}{\ell_i} \ll \log \# R + \delta P^{O(1)} \log \frac{\# R}{\delta}.

Meanwhile, from Corollary 26, (39), (38) we have

\displaystyle  d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) )^2 \ll \mathbb{H}( \mathbf{U}_i ) - \mathbb{H}( \mathbf{Y}_i | \mathbf{X}_i )

\displaystyle \ll \mathbb{H}( \mathbf{U}_i ) - \mathbb{H}( \mathbf{Y}_i | \mathbf{X}_i, \mathbf{Y}_{\leq i-1} )

\displaystyle = \mathbb{H}( \mathbf{U}_i ) - \mathbb{H}( \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} ) + \mathbb{I}( \mathbf{X}_{i-1} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} )

\displaystyle = \mathbb{H}( \mathbf{U}_i ) + \mathbb{H}(\mathbf{Y}_{\leq i-1}) - \mathbb{H}( \mathbf{Y}_{\leq i} ) + \mathbb{I}( \mathbf{X}_{i-1} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} ).

The probability distribution of {\mathbf{Y}_{\leq i}} is a function on {\prod_{j \leq i} {\bf Z}/p_j{\bf Z}}, which by the Chinese remainder theorem we can identify with a cyclic group {{\bf Z}/Q_i{\bf Z}} where {Q_i := \prod_{j \leq i} p_j}. From (43) we see that the value of this distribution at adjacent values {n,n+1} of this cyclic group varies by {O(\delta)}, hence the total variation distance between this random variable and the uniform distribution on {\prod_{j \leq i} {\bf Z}/p_i{\bf Z}} is {O( \delta Q_i^{O(1)} ) = O( \delta \exp(O(P)) )} by Chebyshev bounds. By Lemma 23 we then have

\displaystyle  \mathbb{H}( \mathbf{Y}_{\leq i} ) = \log Q_i + O( \delta \exp(O(P)) \log \frac{1}{\delta} )

and thus

\displaystyle  d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) )^2 \leq \mathbb{I}( \mathbf{X}_{i-1} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} ) + O(\delta \exp(O(P)) \log \frac{1}{\delta} ).

The claim follows. \Box

We now compare this result to Elliott’s inequality. If one tries to address precisely the same question that Elliott’s inequality does – namely, to try to compare a sum {\frac{1}{|I|} \sum_{n \in I} f(n)} with sampled subsums {\frac{1}{|I|} \sum_{n \in I} f(n) p 1_{p|n}} – then the results are quantitatively much weaker:

Corollary 28 (Weak Elliott inequality) Let {I \subset {\bf R}} be an interval of length at least {1}. Let {f:{\bf Z} \rightarrow {\bf C}} be a function with {|f(n)| \leq 1} for all {n}, and let {0 < \varepsilon < \frac{1}{2}}. Then one has

\displaystyle  \frac{1}{|I|} \sum_{n \in I} f(n) = \frac{1}{|I|} \sum_{n \in I} f(n) p 1_{p|n} + O(\varepsilon)

for all primes {p \leq \log |I|} outside of an exceptional set of primes of logarithmic size {O( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}.

Comparing this with Exercise 8 we see that we cover a much smaller range of primes {p}; also the size of the exceptional set is slightly worse. This version of Elliot’s inequality is however still strong enough to recover a proof of the prime number theorem as in the previous section.

Proof: We can assume that {\varepsilon} is small, as the claim is trivial for {\varepsilon} comparable to {1}. We can also assume that

\displaystyle  \log\log\log |I| \geq \varepsilon^{-2} \log \frac{1}{\varepsilon} \ \ \ \ \ (49)

since the claim is also trivial otherwise (just make all primes up to {\log |I|} exceptional, then use Mertens’ theorem). As a consequence of this, any quantity involving {|I|} in the denominator will end up being completely negligible in practice. We can also restrict attention to primes less than (say) {\log^{1/2} |I|}, since the remaining primes between {\log |I|} and {\log^{1/2} |I|} have logarithmic size {O(1)}.

By rounding the real and imaginary parts of {f} to the nearest multiple of {\varepsilon}, we may assume that {f} takes values in some finite set {R} of complex numbers of size {O(1)} with cardinality {O(\varepsilon^{-2})}. Let {\mathbf{n}} be drawn uniformly at random from {I}. Then (43) holds with {\delta = O(1/|I|)}, and from Theorem 27 with {\ell_i = p_i} and {P = \log^{1/2} |I|} (which makes the second term of the right-hand side of (45) negligible) we have

\displaystyle  \sum_{i=1}^m \frac{d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) )^2 }{p_i} \ll \log \frac{1}{\varepsilon}

where {p_1,\dots,p_m} are the primes up to {\log^{1/2} |I|}, arranged in increasing order. By Markov’s inequality, we thus have

\displaystyle  d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) ) \ll \varepsilon

for {p_i} outside of a set of primes of logarithmic size {O( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}.

Let {i} be as above. Now let {F_i: R^{p_i} \times {\bf Z}/p_i{\bf Z} \rightarrow {\bf C}} be the function

\displaystyle  F_i( (x_1,\dots,x_{p_i}), a ) := \sum_{j=1}^{p_i} x_j 1_{j = -a \hbox{ mod } p_i},

that is to say {F_i( (x_1,\dots,x_{p_i}), a )} picks out the unique component {x_j} of the tuple {(x_1,\dots,x_{p_i})} in which {j+a} is divisible by {p_i}. This function is bounded by {O(1)}, and then by (42) we have

\displaystyle  {\mathbb E} F_i( \mathbf{X}_i, \mathbf{Y}_i ) = {\mathbb E} F_i( \mathbf{X}_i, \mathbf{U}_i ) + O(\varepsilon).

The left-hand side is equal to

\displaystyle  \frac{1}{|I|} \sum_{n \in I} \sum_{j=1}^{p_i} f(n+j) 1_{j = -n \hbox{ mod } p_i} + O(\varepsilon)

which on switching the summations and using the large nature of {|I|} can be rewritten as

\displaystyle  \frac{1}{|I|} \sum_{n \in I} f(n) p_i 1_{p_i|n} + O(\varepsilon).

Meanwhile, the left-hand side is equal to

\displaystyle  \frac{1}{|I|} \sum_{n \in I} \frac{1}{p_i} \sum_{j=1}^{p_i} f(n+j) + O(\varepsilon)

which again by switching the summations becomes

\displaystyle  \frac{1}{|I|} \sum_{n \in I} f(n) + O(\varepsilon).

The claim follows. \Box

In the above argument we applied (42) with a very specific choice of function {F_i}. The power of Theorem 27 lies in the ability to select many other such functions {F_i}, leading to estimates that do not seem to be obtainable purely from the second moment method. In particular we have the following generalisation of the previous estimate:

Proposition 29 (Weak Elliott inequality for multiple correlations) Let {I \subset {\bf R}} be an interval of length at least {1}. Let {f:{\bf Z} \rightarrow {\bf C}} be a function with {|f(n)| \leq 1} for all {n}, and let {0 < \varepsilon < \frac{1}{2}}. Let {h_1,\dots,h_k} be integers. Then one has

\displaystyle  \frac{1}{|I|} \sum_{n \in I} f(n+h_1 p) \dots f(n+h_k p)

\displaystyle = \frac{1}{|I|} \sum_{n \in I} f(n + h_1 p) \dots f(n+h_k p) p 1_{p|n} + O_{k,h_1,\dots,h_k}(\varepsilon)

for all primes {p \leq \log |I|} outside of an exceptional set of primes of logarithmic size {O_{k,h_1,\dots,h_k}( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}.

Proof: We allow all implied constants to depend on {k,h_1,\dots,h_k}. As before we can assume that {\varepsilon} is sufficiently small (depending on {k,h_1,\dots,h_k}), that {f} takes values in a set {R} of bounded complex numbers of cardinality {O(\varepsilon^{-2})}, and that {|I|} is large in the sense of (49), and restrict attention to primes up to {P := \log^{1/2} |I|}. By shifting the {h_j} and using the large nature of {|I|} we can assume that the {h_j} are all non-negative, taking values in {\{0,\dots,H-1\}} for some {H=O(1)}. We now apply Theorem 27 with {\ell_i = H p_i} and conclude as before that

\displaystyle  d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) ) \ll \varepsilon

for {p_i} outside of a set of primes of logarithmic size {O( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}.

Let {i} be as above. Let {F_i: R^{Hp_i} \times {\bf Z}/p_i{\bf Z} \rightarrow {\bf C}} be the function

\displaystyle  F_i( (x_1,\dots,x_{Hp_i}), a ) := \sum_{j=1}^{p_i} x_{j+h_1 p_i} \dots x_{j+h_k p_i} 1_{j = -a \hbox{ mod } p_i}.

This function is still bounded by {O(1)}, so by (42) as before we have

\displaystyle  {\mathbb E} F_i( \mathbf{X}_i, \mathbf{Y}_i ) = {\mathbb E} F_i( \mathbf{X}_i, \mathbf{U}_i ) + O(\varepsilon).

The left-hand side is equal to

\displaystyle  \frac{1}{|I|} \sum_{n \in I} \sum_{j=1}^{p_i} f(n+j+h_1 p_i) \dots f(n+j+h_k p_i) \dots 1_{j = -n \hbox{ mod } p_i} + O(\varepsilon)

which on switching the summations and using the large nature of {|I|} can be rewritten as

\displaystyle  \frac{1}{|I|} \sum_{n \in I} f(n+h_1 p_i) \dots f(n+h_k p_i) p_i 1_{p_i|n} + O(\varepsilon).

Meanwhile, the left-hand side is equal to

\displaystyle  \frac{1}{|I|} \sum_{n \in I} \frac{1}{p_i} \sum_{j=1}^{p_i} f(n+j+h_1 p_i) \dots f(n+j+h_k p_i) + O(\varepsilon)

which again by switching the summations becomes

\displaystyle  \frac{1}{|I|} \sum_{n \in I} f(n+h_1 p_i) \dots f(n+h_k p_i) + O(\varepsilon).

The claim follows. \Box

There is a logarithmically averaged version of the above proposition:

Exercise 30 (Weak Elliott inequality for logarithmically averaged multiple correlations) Let {2 \leq y \leq x} with {x/y \geq 2}, let {f: {\bf Z} \rightarrow {\bf C}} be a function bounded in magnitude by {1}, let {0 < \varepsilon < \frac{1}{2}}, and let {h_1,\dots,h_k} be integers. Show that

\displaystyle  \frac{1}{\log x/y} \sum_{y \leq n \leq x} f(n+h_1 p) \dots f(n+h_k p)

\displaystyle = \frac{1}{\log x/y} \sum_{y \leq n \leq x} f(n + h_1 p) \dots f(n+h_k p) p 1_{p|n} + O_{k,h_1,\dots,h_k}(\varepsilon)

for all primes {p \leq \log y} outside of an exceptional set of primes of logarithmic size {O_{k,h_1,\dots,h_k}( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}.

When one specialises to multiplicative functions, this lets us dilate shifts in multiple correlations by primes:

Exercise 31 Let {2 \leq N \leq y \leq x} with {x/y \geq N}, let {f: {\bf N} \rightarrow {\bf C}} be a multiplicative function bounded in magnitude by {1}, let {0 < \varepsilon < \frac{1}{2}}, and let {h_1,\dots,h_k} be nonnegative integers. Show that

\displaystyle  f(p)^k \frac{1}{\log x/y} \sum_{y \leq n \leq x} \frac{f(n+h_1) \dots f(n+h_k)}{n} = \frac{1}{\log x/y} \sum_{y \leq n \leq x} \frac{f(n + h_1 p) \dots f(n+h_k p)}{n} + O_{k,h_1,\dots,h_k}(\varepsilon)

for all primes {p \leq \log N} outside of an exceptional set of primes of logarithmic size {O_{k,h_1,\dots,h_k}( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}.

For instance, setting {f} to be the Möbius function, {h_1=0}, {h_2=0}, and {y = \log x} (say), we see that

\displaystyle  \frac{1}{\log x} \sum_{n \leq x} \frac{\mu(n) \mu(n+1)}{n} = \frac{1}{\log x} \sum_{n \leq x} \frac{\mu(n) \mu(n+p)}{n} + O(\varepsilon)

for all primes {p \leq \log\log x} outside of an exceptional set of primes of logarithmic size {\varepsilon^{-2} \log \frac{1}{\varepsilon}}. In particular, for {x} large enough, one can obtain bounds of the form

\displaystyle  \frac{1}{\log x} \sum_{n \leq x} \frac{\mu(n) \mu(n+1)}{n} =\frac{1}{(\# {\mathcal P}) \log x} \sum_{n \leq x} \sum_{p \in {\mathcal P}} \frac{\mu(n) \mu(n+p)}{n} + O(\varepsilon)

for various moderately large sets of primes {p}. It turns out that these double sums on the right-hand side can be estimated by methods which we will cover in later series of notes. Among other things, this allows us to establish estimates such as

\displaystyle  \frac{1}{\log x} \sum_{n \leq x} \frac{\mu(n) \mu(n+1)}{n} = o(1)

as {n \rightarrow \infty}, which to date have only been established using these entropy methods (in conjunction with the methods discussed in later notes). This is progress towards an open problem in analytic number theory known as Chowla’s conjecture, which we will also discuss in later notes.

Matt StrasslerHas a New Force of Nature Been Discovered?

There have been dramatic articles in the news media suggesting that a Nobel Prize has essentially already been awarded for the amazing discovery of a “fifth force.” I thought I’d better throw some cold water on that fire; it’s fine for it to smoulder, but we shouldn’t let it overheat.

There could certainly be as-yet unknown forces waiting to be discovered — dozens of them, perhaps.   So far, there are four well-studied forces: gravity, electricity/magnetism, the strong nuclear force, and the weak nuclear force.  Moreover, scientists are already fully confident there is a fifth force, predicted but not yet measured, that is generated by the Higgs field. So the current story would really be about a sixth force.

Roughly speaking, any new force comes with at least one new particle.  That’s because

  • every force arises from a type of field (for instance, the electric force comes from the electromagnetic field, and the predicted Higgs force comes from the Higgs field)
  • and ripples in that type of field are a type of particle (for instance, a minimal ripple in the electromagnetic field is a photon — a particle of light — and a minimal ripple in the Higgs field is the particle known as the Higgs boson.)

The current excitement, such as it is, arises because someone claims to have evidence for a new particle, whose properties would imply a previously unknown force exists in nature.  The force itself has not been looked for, much less discovered.

The new particle, if it really exists, would have a rest mass about 34 times larger than that of an electron — about 1/50th of a proton’s rest mass. In technical terms that means its E=mc² energy is about 17 million electron volts (MeV), and that’s why physicists are referring to it as the X17.  But the question is whether the two experiments that find evidence for it are correct.

In the first experiment, whose results appeared in 2015, an experimental team mainly based in Debrecen, Hungary studied large numbers of nuclei of beryllium-8 atoms, which had been raised to an “excited state” (that is, with more energy than usual).  An excited nucleus inevitably disintegrates, and the experimenters studied the debris.  On rare occasions they observed electrons and positrons [a.k.a. anti-electrons], and these behaved in a surprising way, as though they were produced in the decay of a previously unknown particle.

In the newly reported experiment, whose results just appeared, the same team observed  the disintegration of excited nuclei of helium.  They again found evidence for what they hope is the X17, and therefore claim confirmation of their original experiments on beryllium.

When two qualitatively different experiments claim the same thing, they are less likely to be wrong, because it’s not likely that any mistakes in the two experiments would create fake evidence of the same type.  On the face of it, it does seem unlikely that both measurements, carried out on two different nuclei, could fake an X17 particle.

However, we should remain cautious, because both experiments were carried out by the same scientists. They, of course, are hoping for their Nobel Prize (which, if their experiments are correct, they will surely win) and it’s possible they could suffer from unconscious bias. It’s very common for individual scientists to see what they want to see; scientists are human, and hidden biases can lead even the best scientists astray.  Only collectively, through the process of checking, reproducing, and using each other’s work, do scientists create trustworthy knowledge.

So it is prudent to await efforts by other groups of experimenters to search for this proposed X17 particle.  If the X17 is observed by other experiments, then we’ll become confident that it’s real. But we probably won’t know until then.  I don’t currently know whether the wait will be months or a few years.

Why I am so skeptical? There are two distinct reasons.

First, there’s a conceptual, mathematical issue. It’s not easy to construct reasonable equations that allow the X17 to co-exist with all of the known types of elementary particles. That it has a smaller mass than a proton is not a problem per se.  But the X17 needs to have some unique and odd properties in order to (1)  be seen in these experiments, yet (2) not be seen in certain other previous experiments, some of which were explicitly looking for something similar.   To make equations that are consistent with these properties requires some complicated and not entirely plausible trickery.  Is it impossible? No.  But a number of the methods that scientists suggested were flawed, and the ones that remain are, to my eye, a bit contrived.

Of course, physics is an experimental science, and what theorists like me think doesn’t, in the end, matter.  If the experiments are confirmed, theorists will accept the facts and try to understand why something that seems so strange might be true.  But we’ve learned an enormous amount from mathematical thinking about nature in the last century — for instance, it was math that told us that the Higgs particle couldn’t be heavier than 1000 protons, and it was on the basis of that `advice’ that the Large Hadron Collider was built to look for it (and it found it, in 2012.) Similar math led to the discoveries of the W and Z particles roughly where they were expected. So when the math tells you the X17 story doesn’t look good, it’s not reason enough for giving up, but it is reason for some pessimism.

Second, there are many cautionary tales in experimental physics. For instance, back in 2003 there were claims of evidence of a particle called a pentaquark with a rest mass about 1.6 times a proton’s mass — an exotic particle, made from quarks and gluons, that’s both like and unlike a proton.  Its existence was confirmed by multiple experimental groups!  Others, however, didn’t see it. It took several years for the community to come to the conclusion that this pentaquark, which looked quite promising initially, did not in fact exist.

The point is that mistakes do get made in particle hunts, sometimes in multiple experiments, and it can take some time to track them down. It’s far too early to talk about Nobel Prizes.

[Note that the Higgs boson’s discovery was accepted more quickly than most.  It was discovered simultaneously by two distinct experiments using two methods each, and confirmed by additional methods and in larger data sets soon thereafter.  Furthermore,  there were already straightforward equations that happily accommodated it, so it was much more plausible than the X17.] 

And just for fun, here’s a third reason I’m skeptical. It has to do with the number 17. I mean, come on, guys, seriously — 17 million electron volts? This just isn’t auspicious.  Back when I was a student, in the late 1980s and early 90s, there was a set of experiments, by a well-regarded experimentalist, which showed considerable evidence for an additional neutrino with a E=mc² energy of 17 thousand electron volts. Other experiments tried to find it, but couldn’t. Yet no one could find a mistake in the experimenter’s apparatus or technique, and he had good arguments that the competing experiments had their own problems. Well, after several years, the original experimenter discovered that there was a piece of his equipment which unexpectedly could absorb about 17 keV of energy, faking a neutrino signal. It was a very subtle problem, and most people didn’t fault him since no one else had thought of it either. But that was the end of the 17 keV neutrino, and with it went hundreds of research papers by both experimental and theoretical physicists, along with one scientist’s dreams of a place in history.

In short, history is cruel to most scientists who claim important discoveries, and teaches us to be skeptical and patient. If there is a fifth sixth force, we’ll know within a few years. Don’t expect to be sure anytime soon. The knowledge cycle in science runs much, much slower than the twittery news cycle, and that’s no accident; if you want to avoid serious errors that could confuse you for a long time to come, don’t rush to judgment.

John PreskillThe paper that begged for a theme song

A year ago, the “I’m a little teapot” song kept playing in my head.

I was finishing a collaboration with David Limmer, a theoretical chemist at the University of California Berkeley. David studies quantum and classical systems far from equilibrium, including how these systems exchange energy and information with their environments. Example systems include photoisomers.

A photoisomer is a molecular switch. These switches appear across nature and technologies. We have photoisomers in our eyes, and experimentalists have used photoisomers to boost solar-fuel storage. A photoisomer has two functional groups, or collections of bonded atoms, attached to a central axis. 


Your average-Joe photoisomer spends much of its life in equilibrium, exchanging heat with room-temperature surroundings. The molecule has the shape above, called the cis configuration. Imagine shining a laser or sunlight on the photoisomer. The molecule can absorb a photon, or particle of light, gaining energy. The energized switch has the opportunity to switch: One chemical group can rotate downward. The molecule will occupy its trans configuration.


The molecule now has more energy than it had while equilibrium, albeit less energy than it had right after absorbing the photon. The molecule can remain in this condition for a decent amount of time. (Experts: The molecule occupies a metastable state.) That is, the molecule can store sunlight. For that reason, experimentalists at Harvard and MIT attached photoisomers to graphene nanotubules, improving the nanotubules’ storage of solar fuel.

Teapot 1

With what probability does a photoisomer switch upon absorbing a photon? This question has resisted easy answering, because photoisomers prove difficult to model: They’re small, quantum, and far from equilibrium. People have progressed by making assumptions, but such assumptions can lack justifications or violate physical principles. David wanted to derive a simple, general bound—of the sort in which thermodynamicists specialize—on a photoisomer’s switching probability.

He had a hunch as to how he could derive such a bound. I’ve blogged, many times, about thermodynamic resource theories. Thermodynamic resource theories are simple models, developed in quantum information theory, for exchanges of heat, particles, information, and more. These models involve few assumptions: the conservation of energy, quantum theory, and, to some extent, the existence of a large environment (Markovianity). With such a model, David suspected, he might derive his bound.

Teapot 2

I knew nothing about photoisomers when I met David, but I knew about thermodynamic resource theories. I’d contributed to their development, to the theorems that have piled up in the resource-theory corner of quantum information theory. Then, the corner had given me claustrophobia. Those theorems felt so formal, abstract, and idealized. Formal, abstract theory has drawn me ever since I started studying physics in college. But did resource theories model physical reality? Could they impact science beyond our corner of quantum information theory? Did resource theories matter?

I called for connecting thermodynamic resource theories to physical reality four years ago, in a paper that begins with an embarrassing story about me. Resource theorists began designing experiments whose results should agree with our theorems. Theorists also tried to improve the accuracy with which resource theories model experimentalists’ limitations. See David’s and my paper for a list of these achievements. They delighted me, as a step toward the broadening of resource theories’ usefulness. 

Like any first step, this step pointed toward opportunities. Experiments designed to test our theorems essentially test quantum mechanics. Scientists have tested quantum mechanics for decades; we needn’t test it much more. Such experimental proposals can push experimentalists to hone their abilities, but I hoped that the community could accomplish more. We should be able to apply resource theories to answer questions cultivated in other fields, such as condensed matter and chemistry. We should be useful to scientists outside our corner of quantum information.

Teapot 3

David’s idea lit me up like photons on a solar-fuel-storage device. He taught me about photoisomers, I taught him about resource theories, and we derived his bound. Our proof relies on the “second laws of thermodynamics.” These abstract resource-theory results generalize the second law of thermodynamics, which helps us understand why time flows in only one direction. We checked our bound against numerical simulations (experts: of Lindbladian evolution). Our bound is fairly tight if the photoisomer has a low probability of absorbing a photon, as in the Harvard-MIT experiment. 

Experts: We also quantified the photoisomer’s coherences relative to the energy eigenbasis. Coherences can’t boost the switching probability, we concluded. But, en route to this conclusion, we found that the molecule is a natural realization of a quantum clock. Our quantum-clock modeling extends to general dissipative Landau-Zener transitions, prevalent across condensed matter and chemistry.

Teapot 4

As I worked on our paper one day, a jingle unfolded in my head. I recognized the tune first: “I’m a little teapot.” I hadn’t sung that much since kindergarten, I realized. Lyrics suggested themselves: 

I’m a little isomer
with two hands.
Here is my cis pose;
here is my trans.

Stand me in the sunlight;
watch me spin.
I’ll keep solar
energy in!

The song lodged itself in my head for weeks. But if you have to pay an earworm to collaborate with David, do.

November 22, 2019

Doug NatelsonRecent results on the arxiv

Here are a few interesting results I stumbled upon recently:
  • This preprint has become this Science paper that was published this week.  The authors take a cuprate superconductor (YBCO) and use reactive ion etching to pattern an array of holes in the film.  Depending on how long they etch, they can kill global superconductivity but leave the system such that it still behaves as a funny kind of metal (resistance decreasing with decreasing temperature), with some residual resistance at low temperatures.  The Hall effect in this metallic state produces no signal - a sign that the is balance between particle-like and hole-like carriers (particle-hole symmetry).  For magnetic field perpendicular to the film, they also see magnetoresistance with features periodic in flux through one cell of the pattern, with a periodicity that indicates the charge carriers have charge 2e.  This is an example of a "Bose metal".  Neat!  (The question about whether there are pairs without superconductivity touches on our own recent work.)
  • This preprint was recently revised (and thus caught my eye in the arxiv updates).  In it, the authors are using machine learning to try to find new superconductors.  The results seem encouraging.  I do wonder if one could do a more physics-motivated machine learning approach (that is, something with an internal structure to the classification scheme and the actual weighting procedure) to look at this and other related problems (like identifying which compounds might be growable via which growth techniques).
  • This preprint is not a condensed matter topic, but has gotten a lot of attention.  The authors look at a particular nuclear transition in 4He, and find a peculiar angular distribution for the electron-positron pairs that come out.  The reason this is of particular interest is that this paper by the same investigators looking at a nuclear transition in 8Be three years ago found something very similar.  If one assumes that there is a previously unobserved boson (a dark matter candidate perhaps) of some sort with a mass of around 17 MeV that couples in there, that could explain both results.  Intriguing, but it would be great if these observations were confirmed independently by a different group.  

Matt von HippelQCD and Reductionism: Stranger Than You’d Think

Earlier this year, I made a list of topics I wanted to understand. The most abstract and technical of them was something called “Wilsonian effective field theory”. I still don’t understand Wilsonian effective field theory. But while thinking about it, I noticed something that seemed weird. It’s something I think many physicists already understand, but that hasn’t really sunk in with the public yet.

There’s an old problem in particle physics, described in many different ways over the years. Take our theories and try to calculate some reasonable number (say, the angle an electron turns in a magnetic field), and instead of that reasonable number we get infinity. We fix this problem with a process called renormalization that hides that infinity away, changing the “normalization” of some constant like a mass or a charge. While renormalization first seemed like a shady trick, physicists eventually understood it better. First, we thought of it as a way to work around our ignorance, that the true final theory would have no infinities at all. Later, physicists instead thought about renormalization in terms of scaling.

Imagine looking at the world on a camera screen. You can zoom in, or zoom out. The further you zoom out, the more details you’ll miss: they’re just too small to be visible on your screen. You can guess what they might be, but your picture will be different depending on how you zoom.

In particle physics, many of our theories are like that camera. They come with a choice of “zoom setting”, a minimum scale where they still effectively tell the whole story. We call theories like these effective field theories. Some physicists argue that these are all we can ever have: since our experiments are never perfect, there will always be a scale so small we have no evidence about it.

In general, theories can be quite different at different scales. Some theories, though, are especially nice: they look almost the same as we zoom in to smaller scales. The only things that change are the mass of different particles, and their charges.


One theory like this is Quantum Chromodynamics (or QCD), the theory of quarks and gluons. Zoom in, and the theory looks pretty much the same, with one crucial change: the force between particles get weaker. There’s a number, called the “coupling constant“, that describes how strong a force is, think of it as sort of like an electric charge. As you zoom in to quarks and gluons, you find you can still describe them with QCD, just with a smaller coupling constant. If you could zoom “all the way in”, the constant (and thus the force between particles) would be zero.

This makes QCD a rare kind of theory: one that could be complete to any scale. No matter how far you zoom in, QCD still “makes sense”. It never gives contradictions or nonsense results. That doesn’t mean it’s actually true: it interacts with other forces, like gravity, that don’t have complete theories, so it probably isn’t complete either. But if we didn’t have gravity or electricity or magnetism, if all we had were quarks and gluons, then QCD could have been the final theory that described them.

And this starts feeling a little weird, when you think about reductionism.

Philosophers define reductionism in many different ways. I won’t be that sophisticated. Instead, I’ll suggest the following naive definition: Reductionism is the claim that theories on larger scales reduce to theories on smaller scales.

Here “reduce to” is intentionally a bit vague. It might mean “are caused by” or “can be derived from” or “are explained by”. I’m gesturing at the sort of thing people mean when they say that biology reduces to chemistry, or chemistry to physics.

What happens when we think about QCD, with this intuition?

QCD on larger scales does indeed reduce to QCD on smaller scales. If you want to ask why QCD on some scale has some coupling constant, you can explain it by looking at the (smaller) QCD coupling constant on a smaller scale. If you have equations for QCD on a smaller scale, you can derive the right equations for a larger scale. In some sense, everything you observe in your larger-scale theory of QCD is caused by what happens in your smaller-scale theory of QCD.

But this isn’t quite the reductionism you’re used to. When we say biology reduces to chemistry, or chemistry reduces to physics, we’re thinking of just a few layers: one specific theory reduces to another specific theory. Here, we have an infinite number of layers, every point on the scale from large to small, each one explained by the next.

Maybe you think you can get out of this, by saying that everything should reduce to the smallest scale. But remember, the smaller the scale the smaller our “coupling constant”, and the weaker the forces between particles. At “the smallest scale”, the coupling constant is zero, and there is no force. It’s only when you put your hand on the zoom nob and start turning that the force starts to exist.

It’s reductionism, perhaps, but not as we know it.

Now that I understand this a bit better, I get some of the objections to my post about naturalness a while back. I was being too naive about this kind of thing, as some of the commenters (particularly Jacques Distler) noted. I believe there’s a way to rephrase the argument so that it still works, but I’d have to think harder about how.

I also get why I was uneasy about Sabine Hossenfelder’s FQXi essay on reductionism. She considered a more complicated case, where the chain from large to small scale could be broken, a more elaborate variant of a problem in Quantum Electrodynamics. But if I’m right here, then it’s not clear that scaling in effective field theories is even the right way to think about this. When you have an infinite series of theories that reduce to other theories, you’re pretty far removed from what most people mean by reductionism.

Finally, this is the clearest reason I can find why you can’t do science without an observer. The “zoom” is just a choice we scientists make, an arbitrary scale describing our ignorance. But without it, there’s no way to describe QCD. The notion of scale is an inherent and inextricable part of the theory, and it doesn’t have to mean our theory is incomplete.

Experts, please chime in here if I’m wrong on the physics here. As I mentioned at the beginning, I still don’t think I understand Wilsonian effective field theory. If I’m right though, this seems genuinely weird, and something more of the public should appreciate.

Tommaso DorigoThe 17 MeV Anomaly That Would Not Die

Is there a fifth force of Nature, beyond the four we know about ? This question has been around ever since it was understood that 
1 - electric charges attract and repel, and influence one another, due to the action of the electromagnetic force;

2 - hadronic matter is held together by the strong force;

3 - quarks transmute into other quarks due to the action of the weak force (and leptons do that too);

4 - bodies carrying mass feel attracted to one another, although very weakly, by the gravitational force.

read more

November 21, 2019

John PreskillOn the Coattails of Quantum Supremacy

Most readers have by now heard that Google has “achieved” quantum “supremacy”. Notice the only word not in quotes is “quantum”, because unlike previous proposals that have also made some waves, quantumness is mostly not under review here. (Well, neither really are the other two words, but that story has already been covered quite eloquently by John, Scott, and Toby.) The Google team has managed to engineer a device that, although noisy, can do the right thing a large-enough fraction of the time for people to be able to “quantify its quantumness”.

However, the Google device, while less so than previous incarnations, is still noisy. Future devices like it will continue to be noisy. Noise is what makes quantum computers so darn difficult to build; it is what destroys the fragile quantum superpositions that we are trying so hard to protect (remember, unlike a classical computer, we are not protecting things we actually observe, but their superposition).

Protecting quantum information is like taking your home-schooled date (who has lived their entire life in a bunker) to the prom for the first time. It is a fun and necessary part of a healthy relationship to spend time in public, but the price you pay is the possibility that your date will hit it off with someone else. This will leave you abandoned, dancing alone to Taylor Swift’s “You Belong With Me” while crying into your (spiked?) punch.

When the environment corrupts your quantum date.

The high school sweetheart/would-be dance partner in the above provocative example is the quantum superposition — the resource we need for a working quantum computer. You want it all to yourself, but your adversary — the environment — wants it too. No matter how much you try to protect it, you’ll have to observe it eventually (after all, you want to know the answer to your computation). And when you do (take your date out onto the crowded dance floor), you run the risk of the environment collapsing the information before you do, leaving you with nothing.

Protecting quantum information is also like (modern!) medicine. The fussy patient is the quantum information, stored in delicate superposition, while quantumists are the doctors aiming to prevent the patient from getting sick (or “corrupted”). If our patient incurs say “quasiparticle poisoning”, we first diagnose the patient’s syndromes, and, based on this diagnosis, apply procedures like “lattice surgery” and “state injection” to help our patient successfully recover.

The medical analogy to QEC, noticed first by Daniel Litinski. All terms are actually used in papers. Cartoon by Jose Gonzalez.

Error correction with qubits

Error correction sounds hard, and it should! Not to fear: plenty of very smart people have thought hard about this problem, and have come up with a plan — to redundantly encode the quantum superposition in a way that allows protection from errors caused by noise. Such quantum error-correction is an expansion of the techniques we currently use to protect classical bits in your phone and computer, but now the aim is to protect, not the definitive bit states 0 or 1, but their quantum superpositions. Things are even harder now, as the protection machinery has to do its magic without disturbing the superposition itself (after all, we want our quantum calculation to run to its conclusion and hack your bank).

For example, consider a qubit — the fundamental quantum unit represented by two shelves (which, e.g., could be the ground and excited states of an atom, the absence or presence of a photon in a box, or the zeroth and first quanta of a really cold LC circuit). This qubit can be in any quantum superposition of the two shelves, described by 2 probability amplitudes, one corresponding to each shelf. Observing this qubit will collapse its state onto either one of the shelves, changing the values of the 2 amplitudes. Since the resource we use for our computation is precisely this superposition, we definitely do not want to observe this qubit during our computation. However, we are not the only ones looking: the environment (other people at the prom: the trapping potential of our atom, the jiggling atoms of our metal box, nearby circuit elements) is also observing this system, thereby potentially manipulating the stored quantum state without our knowledge and ruining our computation.

Now consider 50 such qubits. Such a space allows for a superposition with 2^{50} different amplitudes (instead of just 2^1 for the case of a single qubit). We are once again plagued by noise coming from the environment. But what if we now, less ambitiously, want to store only one qubit’s worth of information in this 50-qubit system? Now there is room to play with! A clever choice of how to do this (a.k.a. the encoding) helps protect from the bad environment. 

The entire prospect of building a bona-fide quantum computer rests on this extra overhead or quantum redundancy of using a larger system to encode a smaller one. It sounds daunting at first: if we need 50 physical qubits for each robust logical qubit, then we’d need “I-love-you-3000” physical qubits for 60 logical ones? Yes, this is a fact we all have to live with. But granted we can scale up our devices to that many qubits, there is no fundamental obstacle that prevents us from then using error correction to make next-level computers.

To what extent do we need to protect our quantum superposition from the environment? It would be too ambitious to protect it from a meteor shower. Or a power outage (although that would be quite useful here in California). So what then can we protect against?

Our working answer is local noise — noise that affects only a few qubits that are located near each other in the device. We can never be truly certain if this type of noise is all that our quantum computers will encounter. However, our belief that this is the noise we should focus on is grounded in solid physical principles — that nature respects locality, that affecting things far away from you is harder than making an impact nearby. (So far Google has not reported otherwise, although much more work needs to be done to verify this intuition.)

The harmonic oscillator

In what other ways can we embed our two-shelf qubit into a larger space? Instead of scaling up using many physical qubits, we can utilize a fact that we have so far swept under the rug: in any physical system, our two shelves are already part of an entire bookcase! Atoms have more than one excited state, there can be more than one photon in a box, and there can be more than one quantum in a cold LC circuit. Why don’t we use some of that higher-energy space for our redundant encoding?

The noise in our bookcase will certainly be different, since the structure of the space, and therefore the notion of locality, is different. How to cope with this? The good news is that such a space — the space of the harmonic oscillator — also has a(t least one) natural notion of locality!

Whatever the incarnation, the oscillator has associated with it a position and momentum (different jargon for these quantities may be used, depending on the context, but you can just think of a child on a swing, just quantized). Anyone who knows the joke about Heisenberg getting pulled over, will know that these two quantities cannot be set simultaneously.

Cartoon by Jose Gonzalez.

Nevertheless, local errors can be thought of as small shifts in position or momentum, while nonlocal errors are ones that suddenly shift our bewildered swinging quantized child from one side of the swing to the other.

Armed with a local noise model, we can extend our know-how from multi-qubit land to the oscillator. One of the first such oscillator codes were developed by Gottesman, Kitaev, and Preskill (GKP). Proposed in 2001, GKP encodings posed a difficult engineering challenge: some believed that GKP states could never be realized, that they “did not exist”. In the past few years however, GKP states have been realized nearly simultaneously in two experimental platforms. (Food for thought for the non-believers!)

Parallel to GKP codes, another promising oscillator encoding using cat states is also being developed. This encoding has historically been far easier to create experimentally. It is so far the only experimental procedure achieving the break-even point, at which the actively protected logical information has the same lifetime as the system’s best unprotected degree of freedom.

Can we mix and match all of these different systems? Why yes! While Google is currently trying to build the surface code out of qubits, using oscillators (instead of qubits) for the surface code and encoding said oscillators either in GKP (see related IBM post) [1,2,3] or cat [4,5] codes is something people are seriously considering. There is even more overhead, but the extra information one gets from the correction procedure might make for a more fault-tolerant machine. With all of these different options being explored, it’s an exciting time to be into quantum!


It turns out there are still other systems we can consider, although because they are sufficiently more “out there” at the moment, I should first say “bear with me!” as I explain. Forget about atoms, photons in a box, and really cold LC circuits. Instead, consider a rigid 3-dimensional object whose center of mass has been pinned in such a way that the object can rotate any way it wants. Now, “quantize” it! In other words, consider the possibility of having quantum superpositions of different orientations of this object. Just like superpositions of a dead and alive cat, of a photon and no photon, the object can be in quantum superposition of oriented up, sideways, and down, for example. Superpositions of all possible orientations then make up our new configuration space (read: playground), and we are lucky that it too inherits many of the properties we know and love from its multi-qubit and oscillator cousins.

Examples of rigid bodies include airplanes (which can roll, pitch and yaw, even while “fixed” on a particular trajectory vector) and robot arms (which can rotate about multiple joints). Given that we’re not quantizing those (yet?), what rigid body should we have in mind as a serious candidate? Well, in parallel to the impressive engineering successes of the multi-qubit and oscillator paradigms, physicists and chemists have made substantial progress in trapping and cooling molecules. If a trapped molecule is cold enough, it’s vibrational and electronic states can be neglected, and its rotational states form exactly the rigid body we are interested in. Such rotational states, as far as we can tell, are not in the realm of Avengers-style science fiction.

Superpositions of molecular orientations don’t violate the Deutsch proposition.

The idea to use molecules for quantum computing dates all the way back to a 2001 paper by Dave DeMille, but in a recent paper by Jacob Covey, John Preskill, and myself, we propose a framework of how to utilize the large space of molecular orientations to protect against (you guessed it!) a type of local noise. In the second part of the story, called “Quantum Error Correction with Molecules“, I will cover a particular concept that is not only useful for a proper error-correcting code (classical and quantum), but also one that is quite fun to try and understand. The concept is based on a certain kind of tiling, called Voronoi tiles or Thiessen polygons, which can be used to tile anything from your bathroom floor to the space of molecular orientations. Stay tuned!

John BaezApplied Category Theory Meeting at UCR (Part 3)


We had a special session on applied category theory here at UCR:

Applied category theory, Fall Western Sectional Meeting of the AMS, 9–10 November 2019, U.C. Riverside.

I was bowled over by the large number of cool ideas. I’ll have to blog about some of them. A bunch of people stayed for a few days afterwards, and we had lots of great conversations.

The biggest news was that Brendan Fong and David Spivak definitely want to set up an applied category theory in the San Francisco Bay Area, which they’re calling the Topos Institute. They are now in the process of raising funds for this institute! I plan to be involved, so I’ll be saying more about this later.

But back to the talks. We didn’t make videos, but here are the slides. Click on talk titles to see abstracts of the talks. For a multi-author talk, the person whose name is in boldface is the one who gave the talk. You also might enjoy comparing the 2017 talks.

Saturday November 9, 2019

8:00 a.m.
Fibrations as generalized lens categoriestalk slides.
David I. Spivak, Massachusetts Institute of Technology

9:00 a.m.
Supplying bells and whistles in symmetric monoidal categoriestalk slides.
Brendan Fong, Massachusetts Institute of Technology
David I. Spivak, Massachusetts Institute of Technology

9:30 a.m.
Right adjoints to operadic restriction functorstalk slides.
Philip Hackney, University of Louisiana at Lafayette
Gabriel C. Drummond-Cole, IBS Center for Geometry and Physics

10:00 a.m.
Duality of relationstalk slides.
Alexander Kurz, Chapman University

10:30 a.m.
A synthetic approach to stochastic maps, conditional independence, and theorems on sufficient statisticstalk slides.
Tobias Fritz, Perimeter Institute for Theoretical Physics

3:00 p.m.
Constructing symmetric monoidal bicategories functoriallytalk slides.
Michael Shulman, University of San Diego
Linde Wester Hansen, University of Oxford

3:30 p.m.
Structured cospanstalk slides.
Kenny Courser, University of California, Riverside
John C. Baez, University of California, Riverside

4:00 p.m.
Generalized Petri netstalk slides.
Jade Master, University of California, Riverside

4:30 p.m.
Formal composition of hybrid systemstalk slides and website.

Paul Gustafson, Wright State University
Jared Culbertson, Air Force Research Laboratory
Dan Koditschek, University of Pennsylvania
Peter Stiller, Texas A&M University

5:00 p.m.
Strings for cartesian bicategoriestalk slides.
M. Andrew Moshier, Chapman University

5:30 p.m.
Defining and programming generic compositions in symmetric monoidal categoriestalk slides.
Dmitry Vagner, Los Angeles, CA

Sunday November 10, 2019

8:00 a.m.
Mathematics for second quantum revolutiontalk slides.
Zhenghan Wang, UCSB and Microsoft Station Q

9:00 a.m.
A compositional and statistical approach to natural languagetalk slides.
Tai-Danae Bradley, CUNY Graduate Center

9:30 a.m.
Exploring invariant structure in neural activity with applied topology and category theorytalk slides.
Brad Theilman, UC San Diego
Krista Perks, UC San Diego
Timothy Q Gentner, UC San Diego

10:00 a.m.
Of monks, lawyers and villages: new insights in social network science — talk cancelled due to illness.
Nina Otter, Mathematics Department, UCLA
Mason A. Porter, Mathematics Department, UCLA

10:30 a.m.
Functorial cluster embeddingtalk slides.

Steve Huntsman, BAE Systems FAST Labs

2:00 p.m.
Quantitative equational logictalk slides.
Prakash Panangaden, School of Computer Science, McGill University
Radu Mardare, Strathclyde University
Gordon D. Plotkin, University of Edinburgh

3:00 p.m.
Brakes: an example of applied category theorytalk slides in PDF and Powerpoint.
Eswaran Subrahmanian, Carnegie Mellon University / National Institute of Standards and Technology

3:30 p.m.
Intuitive robotic programming using string diagramstalk slides.
Blake S. Pollard, National Institute of Standards and Technology

4:00 p.m.
Metrics on functor categoriestalk slides.
Vin de Silva, Department of Mathematics, Pomona College

4:30 p.m.
Hausdorff and Wasserstein metrics on graphs and other structured datatalk slides.
Evan Patterson, Stanford University

John PreskillQuantum Error Correction with Molecules

In the previous blog post (titled, “On the Coattails of Quantum Supremacy“) we started with Google and ended up with molecules! I also mentioned a recent paper by John Preskill, Jake Covey, and myself (see also this videoed talk) where we assume that, somewhere in the (near?) future, experimentalists will be able to construct quantum superpositions of several orientations of molecules or other rigid bodies. Next, I’d like to cover a few more details on how to construct error-correcting codes for anything from classical bits in your phone to those future quantum computers, molecular or otherwise.

Classical error correction: the basics

Error correction is concerned with the design of an encoding that allows for protection against noise. Let’s say we want to protect one classical bit, which is in either “0” or “1”. If the bit is say in “0”, and the environment (say, the strong magnetic field from a magnet you forgot was laying next to your hard drive) flipped it to “1” without our knowledge, an error would result (e.g., making your phone think you swiped right!)

Now let’s encode our single logical bit into three physical bits, whose 2^3=8 possible states are represented by the eight corners of the cube below. Let’s encode the logical bit as “0” —> 000 and “1” —> 111, corresponding to the corners of the cube marked by the black and white ball, respectively. For our (local) noise model, we assume that flips of only one of the three physical bits are more likely to occur than flips of two or three at the same time.

Error correction is, like many Hollywood movies, an origin story. If, say, the first bit flips in our above code, the 000 state is mapped to 100, and 111 is mapped to 011. Since we have assumed that the most likely error is a flip of one of the bits, we know upon observing that 100 must have come from the clean 000, and 011 from 111. Thus, in either case of the logical bit being “0” or “1”, we can recover the information by simply observing which state the majority of the bits are in. The same things happen when the second or third bits flip. In all three cases, the logical “0” state is mapped to one of its three neighboring points (above, in blue) while the logical “1” is mapped to its own three points, which, crucially, are distinct from the neighbors of “0”. The set of points \{000,100,010,001\} that are closer to 000 than to 111 is called a Voronoi tile.

Now, let’s adapt these ideas to molecules. Consider the rotational states of a dumb-bell molecule consisting of two different atoms. (Let’s assume that we have frozen this molecule to the point that the vibration of the inter-atomic bond is limited, essentially creating a fixed distance between the two atoms.) This molecule can orient itself in any direction, and each such orientation can be represented as a point \mathbf{v} on the surface of a sphere. Now let us encode a classical bit using the north and south poles of this sphere (represented in the picture below as a black and a white ball, respectively). The north pole of the sphere corresponds to the molecule being parallel to the z-axis, while the south pole corresponds to the molecule being anti-parallel.

This time, the noise consists of small shifts in the molecule’s orientation. Clearly, if such shifts are small, the molecule just wiggles a bit around the z-axis. Such wiggles still allow us to infer that the molecule is (mostly) parallel and anti-parallel to the axis, as long as they do not rotate the molecule all the way past the equator. Upon such correctable rotations, the logical “0” state — the north pole — is mapped to a point in the northern hemisphere, while logical “1” — the south pole — is mapped to a point in the southern hemisphere. The northern hemisphere forms a Voronoi tile of the logical “0” state (blue in the picture), which, along with the corresponding tile of the logical “1” state (the southern hemisphere), tiles the entire sphere.

Quantum error correction

To upgrade these ideas to the quantum realm, recall that this time we have to protect superpositions. This means that, in addition to shifting our quantum logical state to other states as before, noise can also affect the terms in the superposition itself. Namely, if, say, the superposition is equal — with an amplitude of +1/\sqrt{2} in “0” and +1/\sqrt{2} in “1” — noise can change the relative sign of the superposition and map one of the amplitudes to -1/\sqrt{2}. We didn’t have to worry about such sign errors before, because our classical information would always be the definite state of “0” or “1”. Now, there are two effects of noise to worry about, so our task has become twice as hard!

Not to worry though. In order to protect against both sources of noise, all we need to do is effectively stagger the above constructions. Now we will need to design a logical “0” state which is itself a superposition of different points, with each point separated from all of the points that are superimposed to make the logical “1” state.

Diatomic molecules: For the diatomic molecule example, consider superpositions of all four corners of two antipodal tetrahedra for the two respective logical states.


The logical “0” state for the quantum code is now itself a quantum superposition of orientations of our diatomic molecule corresponding to the four black points on the sphere to the left (the sphere to the right is a top-down view). Similarly, the logical “1” quantum state is a superposition of all orientations corresponding to the white points.

Each orientation (black or white point) present in our logical states rotates under fluctuations in the position of the molecule. However, the entire set of orientations for say logical “0” — the tetrahedron — rotates rigidly under such rotations. Therefore, the region from which we can successfully recover after rotations is fully determined by the Voronoi tile of any one of the corners of the tetrahedron. (Above, we plot the tile for the point at the north pole.) This cell is clearly smaller than the one for classical north-south-pole encoding we used before. However, the tetrahedral code now provides some protection against phase errors — the other type of noise that we need to worry about if we are to protect quantum information. This is an example of the trade-off we must make in order to protect against both types of noise; a licensed quantum mechanic has to live with such trade-offs every day.

Oscillators: Another example of a quantum encoding is the GKP encoding in the phase space of the harmonic oscillator. Here, we have at our disposal the entire two-dimensional plane indexing different values of position and momentum. In this case, we can use a checkerboard approach, superimposing all points at the centers of the black squares for the logical “0” state, and similarly all points at the centers of the white squares for the logical “1”. The region depicting correctable momentum and position shifts is then the Voronoi cell of the point at the origin: if a shift takes our central black point to somewhere inside the blue square, we know (most likely) where that point came from! In solid state circles, the blue square is none other than the primitive or unit cell of the lattice consisting of points making up both of the logical states.

Asymmetric molecules (a.k.a. rigid rotors): Now let’s briefly return to molecules. Above, we considered diatomic molecules that had a symmetry axis, i.e., that were left unchanged under rotations about the axis that connects the two atoms. There are of course more general molecules out there, including ones that are completely asymmetric under any possible (proper) 3D rotation (see figure below for an example).

mol-f0 - blog

BONUS: There is a subtle mistake relating to the geometry of the rotation group in the labeling of this figure. Let me know if you can find it in the comments!

All of the orientations of the asymmetric molecule, and more generally a rigid body, can no longer be parameterized by the sphere. They can be parameterized by the 3D rotation group \mathsf{SO}(3): each orientation of an asymmetric molecule is labeled by the 3D rotation necessary to obtain said orientation from a reference state. Such rotations, and in turn the orientations themselves, are parameterized by an axis \mathbf{v} (around which to rotate) and an angle \omega (by which one rotates). The rotation group \mathsf{SO}(3) luckily can still be viewed by humans on a sheet of paper. Namely, \mathsf{SO}(3) can be thought of as a ball of radius \pi with opposite points identified. The direction of each vector \omega\mathbf{v} lying inside the ball corresponds to the axis of rotation, while the length corresponds to the angle. This may take some time to digest, but it’s not crucial to the story.

So far we’ve looked at codes defined on cubes of bits, spheres, and phase-space lattices. Turns out that even \mathsf{SO}(3) can house similar encodings! In other words, \mathsf{SO}(3) can also be cut up into different Voronoi tiles, which in turn can be staggered to create logical “0” and “1” states consisting of different molecular orientations. There are many ways to pick such states, corresponding to various subgroups of \mathsf{SO}(3). Below, we sketch two sets of black/white points, along with the Voronoi tile corresponding to the rotations that are corrected by each encoding.

Voronoi tiles of the black point at the center of the ball representing the 3D rotation group, for two different molecular codes. This and the Voronoi cells corresponding to the other points tile together to make up the entire ball. 3D printing all of these tiles would make for cool puzzles!

In closing…

Achieving supremacy was a big first step towards making quantum computing a practical and universal tool. However, the largest obstacles still await, namely handling superposition-poisoning noise coming from the ever-curious environment. As quantum technologies advance, other possible routes for error correction are by encoding qubits in harmonic oscillators and molecules, alongside the “traditional” approach of using arrays of physical qubits. Oscillator and molecular qubits possess their own mechanisms for error correction, and could prove useful (granted that the large high-energy space required for the procedures to work can be accessed and controlled). Even though molecular qubits are not yet mature enough to be used in quantum computers, we have at least outlined a blueprint for how some of the required pieces can be built. We are by no means done however: besides an engineering barrier, we need to further develop how to run robust computations on these exotic spaces.

Author’s note: I’d like to acknowledge Jose Gonzalez for helping me immensely with the writing of this post, as well as for drawing the comic panels in the previous post. The figures above were made possible by Mathematica 12.

November 20, 2019

Jordan EllenbergKhot,Minzer,Safra on approximate sections of sheaves

Subhash Khot is giving a talk at Current Developments in Math this year and his talk has the intriguing phrase “Grassmann graph” in it so I thought I’d look up what it is and what he and his collaborators prove about it, and indeed it’s interesting! I’m just going to jot down something I learned from “Pseudorandom Sets in Grassmann Graph have Near-Perfect Expansion,” by Khot, Dor Minzer, and Muli Safra, in a way that makes it sound like something algebraic geometers might be interested in, which, indeed, I think they might be!

Suppose you have a sheaf F on a space, and the space has a covering U_1, .. U_N. The sheaf axiom says that if we have a family of sections s_i of F(U_i) such that s_i and s_j agree on U_i \cap U_j for all i,j, then there is actually a global section s in F(X) which restricts to each s_i.

What if we only have an approximate section? That is: what if we have a family of s_i such that: if I select i and j uniformly at random, the probability that s_i and s_j agree on U_i \cap U_j is bounded below by some p > 0. Call such a family a “p-section.” (You should take the view that this is really a family of problems with X changing and N growing, so that when I say p > 0 the content is that p is bounded away from some 0 uniformly in X,N.)

The question is then: Is an approximate section approximately a section?

(This is meant to recall the principle from additive number theory that an approximate subgroup is approximately a subgroup, as in e.g. Freiman-Rusza.)

That is: if s_1, .. s_N from a p-section, is there some actual section s in F(X) such that, for i chosen uniformly at random,

\mathbf{Pr}(s | U_i) = s_i > p' > 0

for some p’ depending only on p?

The case which turns out to be relevant to complexity theory is the Grassmann graph, which we can describe as follows: X is a k-dimensional vector space over F_2 and the U_i are the l-dimensional vector spaces for some integer l. But we do something slightly weird (which is what makes it the Grassmann graph, not the Grassmann simplicial complex) and declare that the only nonempty intersections are those where U_i \cap U_j has dimension l-1. The sheaf is the one whose sections on U_i are the linear functions from U_i to F_2.

Speculation 1.7 in the linked paper is that an approximate section is approximately a section. This turns out not to be true! Because there are large sets of U_i whose intersection with the rest of X is smaller than you might expect. This makes sense: if X is a space which is connected but which is “almost a disjoint union of X_1 and X_2,” i.e. X_1 \cup X_2 = X and $\latex X_1 \cap X_2$ involves very few of the U_i, then by choosing a section of F(X_1) and a section of F(X_2) independently you can get an approximate section which is unlikely to be approximated by any actual global section.

But the good news is that, in the case at hand, that ends up being the only problem. Khot-Minzer-Safra classify the “approximately disconnected” chunks of X (they are collections of l-dimensional subspaces containing a fixed subspace of small dimension and contained in a fixed subspace of small codimension) and show that any approximate section of F is approximated by a section on some such chunk; this is all that is needed to prove the “2-to-2 games conjecture” in complexity theory, which is their ultimate goal.

So I found all this quite striking! Do questions about approximate global sections being approximated by global sections appear elsewhere? (The question as phrased here is already a bit weird from an algebraic geometry point of view, since it seems to require that you have or impose a probability measure on your set of open patches, but maybe that’s natural in some cases?)

November 18, 2019

John PreskillBas|ket>ball: A Game for Young Students Learning Quantum Computing

It is no secret that quantum computing has recently become one of the trendiest topics within the physics community, gaining financial support and good press at an ever increasing pace. The new technology not only promises huge advances in information processing, but it also – in theory – has the potential to crack the encryption that currently protects sensitive information inside governments and businesses around the world. Consequently, quantum research has extended beyond academic groups and has entered the technical industry, creating new job opportunities for both experimentalists and theorists. However, in order for this technology to become a reality, we need qualified engineers and scientists that can fill these positions.

Increasing the number of individuals with an interest in this field starts with educating our youth. While it does not take particularly advanced mathematics to explain the basics of quantum computing, there are still involved topics such as quantum superposition, unitary evolution, and projective measurement that can be difficult to conceptualize. In order to explain these topics at a middle and high school level to encourage more students to enter this area, we decided to design an educational game called Bas|ket>ball, which allows students to directly engage with quantum computing concepts outside of the classroom while being physically active. 

After playing the game with students in our local Quantum Information High (QIHigh) Program at Stevens Institute of Technology, we realized that the game is a fun learning tool worth sharing with the broader physics community. Here, we describe a non-gender specific activity that can be used to effectively teach the basics of quantum computing at a high school level. 

QI-High Program Group at Stevens Institute of Technology, Hoboken, NJ
From the left: Instructor Sean Karg, Grace Conlin, Alex White, Emma Dezenzo, Instructor Kaitlin Gili, Program Advisor Dr. Rupak Chatterjee

Quantum Basketball is something that helps you understand a very confusing topic, especially for a ninth grader! In the QI-High Program at Stevens, I was approached with a challenge of learning about quantum computing, and while I was hesitant at first, my mentors made the topic so much more understandable by relating it to a sport that I love!

Grace Conlin, Freshman Student from High Tech High School

The Rules of Bas|ket>ball

The game can have up to 10 student players and only requires one basketball. Each player acts as a quantum bit (qubit) in a quantum register and is initialized to the |0> position. During each turn, a player will perform one of the allowed quantum gates depending on their position on the court. A diagram of the court positions is displayed at the bottom. 

There are four options of quantum gates from which players can choose to move around the court: 

  1. X Gate – This single qubit gate will take a player from the |0> to the |1> position, and vice versa.
  2. Hadamard Gate – This single qubit gate will take a player from the |0> to the (|0> + |1>) / \sqrt{2} position and the |1> to the (|0> – |1>) / \sqrt{2} position, and vice versa. 
  3. Control-Not Gate – This two-qubit gate allows one player to control another only if they are in the |1> position, or in superposition between |0> and |1>. The player in the |1> position can move a player back and forth between the |0> and |1> positions. The player in the superposition can choose to entangle with a player in the |0> position.
  4. Z Measurement – The player takes a shot. The player measures a 1 if he/she makes the shot and measures a 0 if he/she misses. Once the player shoots, he/she has to return back to the |0> position no matter what was measured. 

The first player to measure ten 1’s (make ten shots) wins! In order to make the game more interesting, the following additional rules are put in place: 

  1. Each player has one SWAP gate that can be utilized per game to switch positions with any other player, including players that are entangled. This is an effective way to replace yourself with someone in an entangled state. 
  2. Up to five players can be entangled at any given time. The only way to break the entanglement is to make a Z Measurement by taking a shot. If one of the entangled players makes a shot, each player entangled with that player receives a point value equal to the number of individuals they are entangled with (including themselves). If the player misses, the entanglement is broken and no points are awarded. Either way, all players go back to the |0> position. 

Example Bas|ket>ball Match

For example, let’s say that we have three student players. Each will start at the red marker, behind the basketball hoop. One by one, each student will choose from the list of gate operations above. If the first player chooses an X-gate, he/she will physically move to the blue marker and be in a better position to make a measurement (take a shot) during the next turn. If the second player chooses to perform a Hadamard gate, he/she will move to the green marker. Each of the students will continue to move around the court and score points by making measurements until 10 points are reached. 

However, things can get more interesting when players start to form alliances by entangling. If player 1 is at the green marker and player 3 is at the red marker, then player 1 can perform a C-Not gate on player 3 to become entangled with them. Now if either player takes a shot and scores a point, both players will be awarded 2 points (1 x the number of players entangled). 

Diagram of basketball court that describes the positions on the court that correspond to the respective quantum states around the hoop.

We believe that simple games such as this, along with Quantum TiqTaqToe and Quantum Chess, will attract more young students to pursue degrees in physics and computer science, and eventually specialize in quantum information and computing fields. Not only is this important for the overall progression of the field, but also to encourage more diversity and inclusion in STEM.

November 15, 2019

Matt von HippelGuest Post: On the Real Inhomogeneous Universe and the Weirdness of ‘Dark Energy’

A few weeks ago, I mentioned a paper by a colleague of mine, Mohamed Rameez, that generated some discussion. Since I wasn’t up for commenting on the paper’s scientific content, I thought it would be good to give Rameez a chance to explain it in his own words, in a guest post. Here’s what he has to say:

In an earlier post, 4gravitons had contemplated the question of ‘when to trust the contrarians’, in the context of our about-to-be-published paper in which we argue that accounting for the effects of the bulk flow in the local Universe, there is no evidence for any isotropic cosmic acceleration, which would be required to claim some sort of ‘dark energy’.

In the following I would like to emphasize that this is a reasonable view, and not a contrarian one. To do so I will examine the bulk flow of the local Universe and the historical evolution of what appears to be somewhat dodgy supernova data. I will present a trivial solution (from data) to the claimed ‘Hubble tension’.  I will then discuss inhomogeneous cosmology, and the 2011 Nobel prize in Physics. I will proceed to make predictions that can be falsified with future data. I will conclude with some questions that should be frequently asked.

Disclaimer: The views expressed here are not necessarily shared by my collaborators. 

The bulk flow of the local Universe:

The largest anisotropy in the Cosmic Microwave Background is the dipole, believed to be caused by our motion with respect to the ‘rest frame’ of the CMB with a velocity of ~369 km s^-1. Under this view, all matter in the local Universe appear to be flowing. At least out to ~300 Mpc, this flow continues to be directionally coherent, to within ~40 degrees of the CMB dipole, and the scale at which the average relative motion between matter and radiation converges to zero has so far not been found.

This is one of the most widely accepted results in modern cosmology, to the extent that SN1a data come pre ‘corrected’ for it.

Such a flow has covariant consequences under general relativity and this is what we set out to test.

Supernova data, directions in the sky and dodgyness:

Both Riess et al 1998 and Perlmutter et al 1999 used samples of supernovae down to redshifts of 0.01, in which almost all SNe at redshifts below 0.1 were in the direction of the flow.

Subsequently in Astier et al 2006, Kowalsky et al 2008, Amanullah et al 2010 and Suzuki et al 2011, it is reported that a process of outlier rejection was adopted in which data points >3\sigma from the Hubble diagram were discarded. This was done using a highly questionable statistical method that involves adjusting an intrinsic dispersion term \sigma_{\textrm{int}} by hand until a \chi^2/\textrm{ndof} of 1 is obtained to the assumed \LambdaCDM model. The number of outliers rejected is however far in excess of 0.3% – which is the 3\sigma expectation. As the sky coverage became less skewed, supernovae with redshift less than ~0.023 were excluded for being outside the Hubble flow. While the Hubble diagram so far had been inferred from heliocentric redshifts and magnitudes, with the introduction of SDSS supernovae that happened to be in the direction opposite to the flow, peculiar velocity ‘corrections’ were adopted in the JLA catalogue and supernovae down to extremely low redshifts were reintroduced. While the early claims of a cosmological constant were stated as ‘high redshift supernovae were found to be dimmer (15% in flux) than the low redshift supernovae (compared to what would be expected in a \Lambda=0 universe)’, it is worth noting that the peculiar velocity corrections change the redshifts and fluxes of low redshift supernovae by up to ~20 %.

When it was observed that even with this ‘corrected’ sample of 740 SNe, any evidence for isotropic acceleration using a principled Maximum Likelihood Estimator is less than 3\sigma , it was claimed that by adding 12 additional parameters (to the 10 parameter model) to allow for redshift and sample dependence of the light curve fitting parameters, the evidence was greater than 4\sigma .

As we discuss in Colin et al. 2019, these corrections also appear to be arbitrary, and betray an ignorance of the fundamentals of both basic statistical analysis and relativity. With the Pantheon compilation, heliocentric observables were no longer public and these peculiar velocity corrections initially extended far beyond the range of any known flow model of the Local Universe. When this bug was eventually fixed, both the heliocentric redshifts and magnitudes of the SDSS SNe that filled in the ‘redshift desert’ between low and high redshift SNe were found to be alarmingly discrepant. The authors have so far not offered any clarification of these discrepancies.

Thus it seems to me that the latest generation of ‘publicly available’ supernova data are not aiding either open science or progress in cosmology.

A trivial solution to the ‘Hubble tension’?

The apparent tension between the Hubble parameter as inferred from the Cosmic Microwave Background and low redshift tracers has been widely discussed, and recent studies suggest that redshift errors as low as 0.0001 can have a significant impact. Redshift discrepancies as big as 0.1 have been reported. The shifts reported between JLA and Pantheon appear to be sufficient to lower the Hubble parameter from ~73 km s^-1 Mpc^-1 to ~68 km s^-1 Mpc^-1.

On General Relativity, cosmology, metric expansion and inhomogeneities:

In the maximally symmetric Friedmann-Lemaitre-Robertson-Walker solution to general relativity, there is only one meaningful global notion of distance and it expands at the same rate everywhere. However, the late time Universe has structure on all scales, and one may only hope for statistical (not exact) homogeneity. The Universe is expected to be lumpy. A background FLRW metric is not expected to exist and quantities analogous to the Hubble and deceleration parameters will vary across the sky.  Peculiar velocities may be more precisely thought of as variations in the expansion rate of the Universe. At what rate does a real Universe with structure expand? The problems of defining a meaningful average notion of volume, its dynamical evolution, and connecting it to observations are all conceptually open.

On the 2011 Nobel Prize in Physics:

The Fitting Problem in cosmology was written in 1987. In the context of this work and the significant theoretical difficulties involved in inferring fundamental physics from the real Universe, any claims of having measured a cosmological constant from directionally skewed, sparse samples of intrinsically scattered observations should have been taken with a grain of salt.  By honouring this claim with a Nobel Prize, the Swedish Academy may have induced runaway prestige bias in favour of some of the least principled analyses in science, strengthening the confirmation bias that seems prevalent in cosmology.

This has resulted in the generation of a large body of misleading literature, while normalizing the practice of ‘massaging’ scientific data. In her recent video about gravitational waves, Sabine Hossenfelder says “We should not hand out Nobel Prizes if we don’t know how the predictions were fitted to the data”. What about when the data was fitted (in 1998-1999) using a method that has been discredited in 1989 to a toy model that has been cautioned against in 1987, leading to a ‘discovery’ of profound significance to fundamental physics?

A prediction with future cosmological data:

With the advent of high statistics cosmological data in the future, such as from the Large Synoptic Survey Telescope, I predict that the Hubble and deceleration parameters inferred from supernovae in hemispheres towards and away from the CMB dipole will be found to be different in a statistically significant (>5\sigma ) way. Depending upon the criterion for selection and blind analyses of data that can be agreed upon, I would be willing to bet a substantial amount of money on this prediction.

Concluding : on the amusing sociology of ‘Dark Energy’ and manufactured concordance:

Of the two authors of the well-known cosmology textbook ‘The Early Universe’, Edward Kolb writes these interesting papers questioning dark energy while Michael Turner is credited with coining the term ‘Dark Energy’.  Reasonable scientific perspectives have to be presented as ‘Dark Energy without dark energy’. Papers questioning the need to invoke such a mysterious content that makes up ‘68% of the Universe’ are quickly targeted by inane articles by non-experts or perhaps well-meant but still misleading YouTube videos. Much of this is nothing more than a spectacle.

In summary, while the theoretical debate about whether what has been observed as Dark Energy is the effect of inhomogeneities is ongoing, observers appear to have been actively using the most inhomogeneous feature of the local Universe through opaque corrections to data, to continue claiming that this ‘dark energy’ exists.

It is heartening to see that recent works lean toward a breaking of this manufactured concordance and speak of a crisis for cosmology.

Questions that should be frequently asked:

Q. Is there a Hubble frame in the late time Universe?

A. The Hubble frame is a property of the FLRW exact solution, and in the late time Universe in which galaxies and clusters have peculiar motions with respect to each other, an equivalent notion does not exist. While popular inference treats the frame in which the CMB dipole vanishes as the Hubble frame, the scale at which the bulk flow of the local Universe converges to that frame has never been found. We are tilted observers.

Q. I am about to perform blinded analyses on new cosmological data. Should I correct all my redshifts towards the CMB rest frame?

A. No. Correcting all your redshifts towards a frame that has never been found is a good way to end up with ‘dark energy’. It is worth noting that while the CMB dipole has been known since 1994, supernova data have been corrected towards the CMB rest frame only after 2010, for what appear to be independent reasons.

Q. Can I combine new data with existing Supernova data?

A. No. The current generation of publicly available supernova data suffer from the natural biases that are to be expected when data are compiled incrementally through a human mediated process. It would be better to start fresh with a new sample.

Q. Is ‘dark energy’ fundamental or new physics?

A. Given that general relativity is a 100+ year old theory and significant difficulties exist in describing the late time Universe with it, it is unnecessary to invoke new fundamental physics when confronting any apparent acceleration of the real Universe. All signs suggest that what has been ascribed to dark energy are the result of a community that is hell bent on repeating what Einstein supposedly called his greatest mistake.

Digging deeper:

The inquisitive reader may explore the resources on inhomogeneous cosmology, as well as the works of George Ellis, Thomas Buchert and David Wiltshire.

Tommaso DorigoVenice Flooded Again Today, While Precious Manuscripts Get Wasted

Last November 12 the city of Venice was flooded by the second-highest tide in recorded history. The sea level, pushed by 60 mph SE winds and intense rainfalls, surged to +187 cm above average, a mere 7cm less than the disastrous event of November 4 1966, which put the city and its surroundings to their knees.

read more

November 14, 2019

Peter Rohde Australian PM Scott Morrison’s 2-minute video challenge goes viral

Several days ago NZ PM Jacinda Ardern released her now-viral 2-minute policy achievement challenge.

I have prepared a response on behalf of Australia.

November 12, 2019

Tommaso DorigoAndrea Giammanco: The Top Quark As A New Tool For Nuclear Physics

[I extorted the text below from my friend and colleague Andrea Giammanco, who has been one of the driving forces behind a new scientific result at the crossroads of nuclear and particle physics - TD]

A new result by the CMS collaboration [CMS-PAS-HIN-19-001,] demonstrates for the first time that top quarks can be produced in nucleus-nucleus collisions.

read more

Scott AaronsonAnnual recruitment post

Just like I did last year, and the year before, I’m putting up a post to let y’all know about opportunities in our growing Quantum Information Center at UT Austin.

I’m proud to report that we’re building something pretty good here. This fall Shyam Shankar joined our Electrical and Computer Engineering (ECE) faculty to do experimental superconducting qubits, while (as I blogged in the summer) the quantum complexity theorist John Wright will join me on the CS faculty in Fall 2020. Meanwhile, Drew Potter, an expert on topological qubits, rejoined our physics faculty after a brief leave. Our weekly quantum information group meeting now regularly attracts around 30 participants—from the turnout, you wouldn’t know it’s not MIT or Caltech or Waterloo. My own group now has five postdocs and six PhD students—as well as some amazing undergrads striving to meet the bar set by Ewin Tang. Course offerings in quantum information currently include Brian La Cour’s Freshman Research Initiative, my own undergrad Intro to Quantum Information Science honors class, and graduate classes on quantum complexity theory, experimental realizations of QC, and topological matter (with more to come). We’ll also be starting an undergraduate Quantum Information Science concentration next fall.

So without further ado:

(1) If you’re interested in pursuing a PhD focused on quantum computing and information (and/or classical theoretical computer science) at UT Austin: please apply! If you want to work with me or John Wright on quantum algorithms and complexity, apply to CS (I can also supervise physics students in rare cases). Also apply to CS, of course, if you want to work with our other CS theory faculty: David Zuckerman, Dana Moshkovitz, Adam Klivans, Anna Gal, Eric Price, Brent Waters, Vijaya Ramachandran, or Greg Plaxton. If you want to work with Drew Potter on nonabelian anyons or suchlike, or with Allan MacDonald, Linda Reichl, Elaine Li, or others on many-body quantum theory, apply to physics. If you want to work with Shyam Shankar on superconducting qubits, apply to ECE. Note that the deadline for CS and physics is December 1, while the deadline for ECE is December 15.

You don’t need to ask me whether I’m on the lookout for great students: I always am! If you say on your application that you want to work with me, I’ll be sure to see it. Emailing individual faculty members is not how it works and won’t help. Admissions are extremely competitive, so I strongly encourage you to apply broadly to maximize your options.

(2) If you’re interested in a postdoc in my group, I’ll have approximately two openings starting in Fall 2020. To apply, just send me an email by January 1, 2020 with the following info:
– Your CV
– 2 or 3 of your best papers (links or PDF attachments)
– The names of two recommenders (who should email me their letters separately)

(3) If you’re on the faculty job market in quantum computing and information—well, please give me a heads-up if you’re potentially interested in Austin! Our CS, physics, and ECE departments are all open to considering additional candidates in quantum information, both junior and senior. I can’t take credit for this—it surely has to do with developments beyond my control, both at UT and beyond—but I’m happy to relay that, in the three years since I arrived in Texas, the appetite for strengthening UT’s presence in quantum information has undergone jaw-dropping growth at every level of the university.

Also, Austin-Bergstrom International Airport now has direct flights to London, Frankfurt, and (soon) Amsterdam and Paris.

Hook ’em Hadamards!

Doug NatelsonAdvice on proposal writing

Many many people have written about how to write scientific grant proposals, and much of that advice is already online.   Rather than duplicate that work, and recognizing that sometimes different people need to hear advice in particular language, I want to link to some examples.

  • Here (pdf) is some advice straight from the National Science Foundation about how to write a compelling proposal.  It's older (2004) and a bit out of date, but the main points are foundational.
  • This is a very good collection of advice that has been updated (2015) to reflect current practice about NSF.  
  • Here are lecture notes from a course at Illinois that touched on this as well, generalizing beyond the NSF.
Fundamentally, sponsored academic research is an odd thing.  You are trying to convince an agency or foundation with finite (often very finite) resources that allocating some of their precious support to you will be a good thing.  Limiting the conversation to the often ill-named "basic research" (see here and the book therein for a discussion of "basic" vs "applied"), this means work where the primary products of the research are (i) fundamental advances in our understanding of some system or phenomena; (ii) personnel trained in scientific/engineering/research knowledge and skills; (iii) scholarly publications (and/or patents for more technology-focused topics) that report the results, with the intent of propagating the work to the community and having an impact. 

This last one has taken a pre-eminent position of importance because it's something that can be readily counted and measured.  There is a rough rule that many program officers in NSF and DOE will tell you; averaging over their programs, they get roughly one high impact paper per $100K total cost.  They would like more, of course. 

Talk with program officers before writing and submitting - know the audience.  Program officers (including foundational ones) tend to take real pride in their portfolios.  Everyone likes funding successful, high-impact, exciting, trend-setting work.  Still, particular program officers have areas of emphasis, in part so that there is not duplication of effort or support within an agency or across agencies.  (This is especially true in areas like high energy theory, where if you've got DOE funding, you essentially can't get NSF support, and vice versa.)  You will be wasting your time if you submit to the wrong program or pitch your idea to the wrong reviewing audience.   NSF takes a strong line that their research directions are broadly set by the researchers themselves, via their deep peer review process (mail-in reviews, in-person or virtual panel discussions) and workshops that define programmatic goals.  DOE likewise has workshops to help define major challenges and open questions, though my sense is that the department takes a more active role in delineating priorities.   The DOD is more goal-directed, with program officers having a great deal of sway on topics of interest, and the prospect that such research may transition closer to technology-readiness.  Foundations are idiosyncratic, but a common refrain is that they prefer to fund topics that are not already supported by federal agencies.

Think it through, and think like a referee.  When coming up with an idea, do your best to consider in some detail how you would actually pull this off.  How could you tell if it works?  What would the implications be of success?  What are the likely challenges and barriers?  If some step doesn't go as planned, is it a show-stopper, or are their other ways to go?  As an experimentalist:  Do you have the tools you need to do this?  How big a signal are you trying to detect?   Remember, referees are frequently asked to evaluate strengths and weaknesses of technical approach.  Better to have this in mind while at an early stage of the process.

Clearly state the problem, and explain the proposal's organization.  Reviewers might be asked to read several proposals in a short timeframe.  It seems like a good idea to say up front, in brief (like in a page or so):  What is the problem?  What are the open scientific/engineering questions you are specifically addressing?  What is your technical approach?  What will the results mean?  Then, explain the organization of the proposal (e.g., section 2 gives a more detailed introduction to the problem and open questions; section 3 explains the technical approach, including a timeline of proposed work; etc.).  This lets readers know where to find things. 

I'll confess:  I got this organizational approach by emulating the structure of an excellent proposal that I reviewed a number of years ago.  It was really terrific - clear; pedagogical, so that a non-expert in that precise area could understand the issues and ideas; very cleanly written; easy-to-read figures, including diagrams that really showed how the ideas would work.   Reviewing proposals is very helpful in improving your own.  Very quickly you will get a sense of what you think makes a good or bad proposal.  NSF is probably the most open to getting new investigators involved in the reviewing process. 

Don't wait until the last minute.  You know that classmate of yours from undergrad days, the one who used to brag about how they waited until the night before to blitz through a 20 page writing assignment?  Amazingly, some of these people end up as successful academics.  I genuinely don't know how they do it, because these days research funding is so competitive and proposals are detailed and complicated.  There are many little formatting details that agencies enforce now.  You don't want to get to an hour before the deadline and realize that all of your bibliographic references are missing a URL field.   People really do read sections like data management plans and postdoctoral mentoring plans - you can't half-ass them.   Also, while it is unlikely to sink a really good proposal, it definitely comes across badly to referees if there are missing or mislabeled references, figures, etc. 

I could write more, and probably will amend this down the line, but work calls and this is at least a start.

November 11, 2019

Tommaso DorigoA Seminar On Mendeleev

This year marks the 150th anniversary of the Periodic Table of Elements, whose inventor is Dmitrii Mendeleev, a Russian physicist who is famous for that achievement but who actually gave enormous contributions to Physics in a number of different areas of experimental research. It is also well known, but actually a misconception, that Mendeleev "invented" the correct recipe for the Russian national drink, vodka. In fact, he studied the mixture of water and alcohol in detail, discovering several of its interesting properties, but vodka was appreciated before him, as it did after.

read more

November 08, 2019

John BaezWhy Is Category Theory a Trending Topic?

I wrote something for the Spanish newspaper El País, which has a column on mathematics called “Café y Teoremas”. Ágata Timón helped me a lot with writing this, and she also translated it into Spanish:

• John Baez, Qué es la teoría de categorías y cómo se ha convertido en tendencia, El País, 8 November 2019.

Here’s the English-language version I wrote. It’s for a general audience so don’t expect hard-core math!

Why has “category theory” become a trending topic?

Recently, various scientific media have been paying attention to a branch of mathematics called “category theory” that has become pretty popular inside the mathematical community in recent years. Some mathematicians are even starting to complain on Twitter that more people are tweeting about category theory than their own specialties. But what is this branch of mathematics, and why is it becoming so fashionable?

Category theory was invented in 1945 as a general technique to transform problems in one field of pure mathematics into problems in another field, where they could be solved. For example, we know that at any moment there must be a location on the surface of the Earth there where the wind velocity is zero. This is a marvelous result—but to prove this result, we must translate it into a fact about algebra, and a bit of category theory is very helpful here. More difficult results often require more category theory. The proof of Fermat’s Last Theorem, for example, builds on a vast amount of 20th-century mathematics, in which category theory plays a crucial role.

Category theory is sometimes called “the mathematics of mathematics”, since it stands above many other fields of mathematics, connecting and uniting them. Unfortunately even mathematicians have a limited tolerance for this high level of abstraction. So, for a long time many mathematicians called category theory “abstract nonsense”—using it reluctantly when it was necessary for their work, but not really loving it.

On the other hand, other mathematicians embraced the beauty and power of category theory. Thus, its influence has gradually been spreading. Since the 1990s, it has been infiltrating computer science: for example, new programming languages like Haskell and Scala use ideas from this subject. But now we are starting to see people apply category theory to chemistry, electrical engineering, and even the design of brakes in cars! “Applied category theory”, once an oxymoron, is becoming a real subject.

To understand this we need a little taste of the ideas. A category consists of a set of “objects” together with “morphisms”—some kind of processes, or paths—going between these objects. For example, we could take the objects to be cities, and the morphisms to be routes from one city to another. The key requirement is that if we have a morphism from an object x to an object y and a morphism from y to an object z, we can “compose” them and get a morphism from x to z. For example, if you have a way to drive from Madrid to Seville and a way to drive from Seville to Faro, that gives a way to drive from Madrid to Faro. Thus there is a category of cities and routes between them.

In mathematics, this focus on morphisms represented a radical shift of viewpoint. Starting around 1900, logicians tried to build the whole of mathematics on solid foundations. This turned out to be a difficult and elusive task, but their best attempt at the time involved “set theory”. A set is simply a collection of elements. In set theory as commonly practiced by mathematicians, these elements are also just sets. In this worldview, everything is just a set. It is a static worldview, as if we had objects but no morphisms. On the other hand, category theory builds on set theory by emphasizing morphisms—ways of transforming things—as equal partners to things themselves. It is not incompatible with set theory, but it offers new ways of thinking.

The idea of a category is simple. Exploiting it is harder. A loose group of researchers are starting to apply category theory to subjects beyond pure mathematics. The key step is to focus a bit less on things and a bit more on morphisms, which are ways to go between things, or ways to transform one thing into another. This is attitude is well suited to computer programming: a program is a way to transform input data into output data, and composing programs is the easiest way to build complicated programs from simpler ones. But personally, I am most excited by applications to engineering and the natural sciences, because these are newer and more surprising.

I was very pleased when two of my students got internships at the engineering firm Siemens, applying category theory to industrial processes. The first, Blake Pollard, now has a postdoctoral position at the National Institute of Standards and Technology in the USA. Among other things, he has used a programming method based on category theory to help design a “smart grid”—an electrical power network that is flexible enough to handle the ever-changing power generated by thousands of homes equipped with solar panels.

Rumors say that soon there may even be an institute of applied category theory, connecting mathematicians to programmers and businesses who need this way of thinking. It is too early to tell if this is the beginning of a trend, but my friends and colleagues on Twitter are very excited.

Matt von HippelAcademia Has Changed Less Than You’d Think

I recently read a biography of James Franck. Many of you won’t recognize the name, but physicists might remember the Franck-Hertz experiment from a lab class. Franck and Hertz performed a decisive test of Bohr’s model of the atom, ushering in the quantum age and receiving the 1925 Nobel Prize. After fleeing Germany when Hitler took power, Franck worked on the Manhattan project and co-authored the Franck Report urging the US not to use nuclear bombs on Japan. He settled at the University of Chicago, which named an institute after him.*

You can find all that on his Wikipedia page. The page also mentions his marriage later in life to Hertha Sponer. Her Wikipedia page talks about her work in spectroscopy, about how she was among the first women to receive a PhD in Germany and the first on the physics faculty at Duke University, and that she remained a professor there until 1966, when she was 70.

Neither Wikipedia page talks about two-body problems, or teaching loads, or pensions.

That’s why I was surprised when the biography covered Franck’s later life. Until Franck died, he and Sponer would travel back and forth, he visiting her at Duke and she visiting him in Chicago. According to the biography, this wasn’t exactly by choice: they both would have preferred to live together in the same city. Somehow though, despite his Nobel Prize and her scientific accomplishments, they never could. The biography talks about how the university kept her teaching class after class, so she struggled to find time for research. It talks about what happened as the couple got older, as their health made it harder and harder to travel back and forth, and they realized that without access to their German pensions they would not be able to support themselves in retirement. The biography gives the impression that Sponer taught till 70 not out of dedication but because she had no alternative.

When we think about the heroes of the past, we imagine them battling foes with historic weight: sexism, antisemitism, Nazi-ism. We don’t hear about their more everyday battles, with academic two-body problems and stingy universities. From this, we can get the impression that the dysfunctions of modern academia are new. But while the problems have grown, we aren’t the first academics with underpaid, overworked teaching faculty, nor the first to struggle to live where we want and love who we want. These are struggles academics have faced for a long, long time.

*Full disclosure: Franck was also my great-great-grandfather, hence I may find his story more interesting than most.

Scott AaronsonThe morality of quantum computing

This morning a humanities teacher named Richard Horan, having read my NYT op-ed on quantum supremacy, emailed me the following question about it:

Is this pursuit [of scalable quantum computation] just an arms race? A race to see who can achieve it first? To what end? Will this achievement yield advances in medical science and human quality of life, or will it threaten us even more than we are threatened presently by our technologies? You seem rather sanguine about its possible development and uses. But how close does the hand on that doomsday clock move to midnight once we “can harness an exponential number of amplitudes for computation”?

I thought this question might possibly be of some broader interest, so here’s my response (with some light edits).

Dear Richard,

A radio interviewer asked me a similar question a couple weeks ago—whether there’s an ethical dimension to quantum computing research.  I replied that there’s an ethical dimension to everything that humans do.

A quantum computer is not like a nuclear weapon: it’s not going to directly kill anybody (unless the dilution refrigerator falls on them or something?).  It’s true that a full, scalable QC, if and when it’s achieved, will give a temporary advantage to people who want to break certain cryptographic codes.  The morality of that, of course, could strongly depend on whether the codebreakers are working for the “good guys” (like the Allies during WWII) or the “bad guys” (like, perhaps, Trump or Vladimir Putin or Xi Jinping).

But in any case, there’s already a push to switch to new cryptographic codes that already exist and that we think are quantum-resistant.  An actual scalable QC on the horizon would of course massively accelerate that push.  And once people make the switch, we expect that security on the Internet will be more-or-less back where it started.

Meanwhile, the big upside potential from QCs is that they’ll provide an unprecedented ability to simulate physics and chemistry at the molecular level.  That could at least potentially help with designing new medicines, as well as new batteries and solar cells and carbon capture technologies—all things that the world desperately needs right now.

Also, the theory developed around QC has already led to many new and profound insights about physics and computation.  Some of us regard that as an inherent good, in the same way that art and music and literature are.

Now, one could argue that the climate crisis, or various other crises that our civilization faces, are so desperate that instead of working to build QCs, we should all just abandon our normal work and directly confront the crises, as (for example) Greta Thunberg is doing.  On some days I share that position.  But of course, were the position upheld, it would have implications not just for quantum computing researchers but for almost everyone on earth—including humanities teachers like yourself.


November 06, 2019

Cylindrical OnionPatatrack Hackathons 7th edition. Working hard having fun

Guest Post by Felice Pantaleo - CERN

Patatrack Hackathon's group, 7th edition. Picture credit: Felice Pantaleo

November 04, 2019

Robert HellingOn Nuclear Fusion (in German)

Florian Freistetter has posed the challenge in his blog to write a generally accessible text on why nuclear fusion works. Here is my attempt (according to the rules in German):

Gleich und gleich gesellt sich gern

Wassertropfen auf einer Oberfläche, Fettaugen in der Suppe, Bläschen in der Limo (oder im Körper eines Tauchers, siehe mein anderes Blog) und eben Atomkerne, diese Phänomene haben gemeinsam, dass die Oberflächenspannung eine entscheidende Rolle spielt.

Allen ist es gemeinsam, dass es eine Substanz (Wasser, Fett, Gas, Nukleonen - also Protonen und Neutronen, die Bestandteile des Atomkerns) gibt, die es am liebsten hat, wenn sie mit sich selbst umgeben ist und nicht mit der Umgebung (Luft, Suppe - hauptsächlich Wasser, Limo oder Vakuum). In all diesen Beispielen kann sich die Substanz besser arrangieren, wenn sie von ihresgleichen umgeben ist. Eine Grenzfläche hingegen kostet Energie, die Grenzflächenenergie.

In guter Näherung ist dieser Energiekosten proportional zur Fläche dieser Grenzfläche. Wenn es schon eine Grenzfläche geben muss, ist es am günstigsten, diese möglichst klein zu halten. Da die Substanzmenge und damit ihr Volumen jeweils unveränderlich ist, stellt sich eine runde Form (Kreisscheibe oder Kugel, je nach dem ob wir es mit etwas zweidimensionalem wie Fettaugen oder dreidimensionalen wie Bläschen zu tun haben) ein, die für eben dieses Volumen die kleinste Oberfläche hat (im zweidimensionalen Fall ist entsprechend die "Oberfläche" die Randlänge, während das "Volumen" der Flächeninhalt ist).

Was passiert aber, wenn zwei Tropfen, Fettaugen, Bläschen oder Atomkerne zusammenkommen? Wenn sie sich vereinigen, ist das vereinigte Volumen so groß wie die beiden Volumina vorher zusammen. Die Oberfläche ist jedoch kleiner als die Summe der Oberflächen vorher. Daher ist weniger Grenzflächenenergie nötig, der Rest an Energie wird frei. Bei der Suppe ist das so wenig, dass man es normalerweise eben nur daran merkt, dass sich die Fettaugen zu immer größeren vereinigen, beim Atomkern ist es aber so viel (einige Megaelektronenvolt pro Kern), dass man damit ein Fusionskraftwerk oder einen Stern betreiben kann.

Die freiwerdende Energie kommt also daher, dass weniger Nukleonen eine offene Flanke zum Vakuum haben, für sie ist es günstiger direkt nebeneinander zu liegen.

Soweit das qualitative Bild. Wir können es aber auch leicht quantitativ machen: Das Tröpfchen, das aus der Vereinigung zweier kleinerer entstanden ist, muss das doppelte Volumen der Ausgangströpfchen haben. Da aber das Volumen eines dreidimensionalen Körpers mit der dritten Potenz seines Durchmessers wächst, hat das große Tröpfchen nicht den doppelten Durchmesser der kleinen Tröpfchen, sondern ist nur um den Faktor $2^{1/3}$, also um die dritte Wurzel aus 2, etwa 1,26 größer. 

Die Oberfläche wächst hingegen quadratisch mit dem Durchmesser, ist also um den Faktor $2^{2/3}$, also etwa 1,59 größer. Am Anfang hatten wir jedoch zwei Tröpfchen, also auch zweimal die Oberfläche, am Ende nur noch 1,59 mal die Oberfläche eines kleinen Tröpfchens. Wir haben also die Grenzflächenenergie im Umfang von 0,41 klein-Tröpfchenoberflächen gewonnen.

Daher werden sich mit der Zeit immer mehr Tröpfchen zu wenigen großen Tropfen vereinigen, da letztere weniger Oberfläche zur Luft in der Summe haben.

Genau das gleiche ist es bei Atomkernen. Auch diese verkleinern durch Zusammenkommen die Gesamtoberfläche zum Vakuum und bei dieser Vereinigung oder Fusion wird die entsprechende Oberflächenenergie frei.

Allerdings gibt es bei Atomkernen noch weitere energetische Beiträge, die vor allem bei großen Kernen mit vielen Nukleonen wichtig werden und dafür sorgen, dass zu große Kerne zwar eine kleinere Oberfläche als die Summe der möglichen Bruchstücke haben, aber trotzdem energetisch ungünstiger sind, so dass eine energetisch günstigste Kerngröße gibt (dies ist, wenn ich mich richtig an mein Studium erinnere, der Kern des Elements Eisen).

Da ist zunächst das "Pauli-Verbot", das verhindert, dass zwei Nukleonen in genau dem gleichen Zustand im Atomkern sind. Sie müssen sich in mindestens einem Aspekt unterscheiden. Dies kann zB ihr Drehimpuls (Spin) sein oder aber ihr "Isospin", also ob sie ein Proton oder ein Neutron sind. Wenn sie aber in all diesen Aspekten übereinstimmen, müssen sie wenigstens verschiedene Energieniveaus im Kern einnehmen. Kommen weitere Nukleonen hinzu, sind die untersten Energieniveaus schon besetzt und sie müssen ein höheres einnehmen (was eben diese Energie kostet).  

Innerhalb des Kerns können sich aber Neutronen und Protonen und zurück ineinander umwandeln (dies ist der beta-Zerfall), kommt also etwa ein Neutron hinzu und müsste ein hohes Neutronen-Energieniveu besetzten, kann es sich, wenn ein günstigeres Protonen-Niveau noch frei ist, in ein Proton umwandeln (es sendet dazu ein Elektron und ein Antineutrino aus, damit auch mit der Ladung alles stimmt). Hier gibt es einen Energiebeitrag der jeweils einzeln von der Protonen- und der Neutronenzahl ist und teurer wird, je größer der Kern ist.

Ein weiterer Effekt ist, dass eben die Protonen elektrisch geladen ist und die anderen Protonen abstößt. Dies benötigt auch Energie in der Gesamtenergiebilanz eines Atomkerns, die proportional zum Quadrat der Protonenzahl ist (also ungünstig für zu große Kerne ist).

Wenn man all dies zusammenzählt, sieht man, dass man bei kleinen Kernen erstmal sehr viel Grenzflächen-Energie gewinnt, wenn man diese zu einem größeren vereinigt. Ab einer mittleren Kerngr öße fangen dann die anderen Effekte an zu überwiegen und zu große Kerne sind auch wieder nicht günstig, weswegen man auf durch Kernspaltung, also die Auftrennen solcher zu großen Kerne wieder Energie gewinnen kann.

November 01, 2019

Tim GowersAdvances in Combinatorics fully launched

It’s taken longer than we originally intended, but I am very happy to report that Advances in Combinatorics, a new arXiv overlay journal that is run along similar lines to Discrete Analysis, has just published its first five articles, with plenty more in the pipeline.

I am excited by the business model of the journal, which is that its very small running costs (like Discrete Analysis, it uses the Scholastica platform, which charges $10 per submission, as well as a fixed annual charge of $250, and there are a few other costs such as archiving articles with CLOCKSS and having DOIs) are being met, for the next five years in the first instance, by the library of Queens University in Kingston Ontario, who are also providing us with very useful administrative support. My dream would be for other libraries to have the foresight to support similar ventures, since the potential for savings if the stranglehold of the big commercial publishers is broken is huge. I do not underestimate the obstacles, but for a long time I have felt that what we are waiting for is a tipping point, and even quite a small venture can in principle have a significant effect on when that tipping point occurs.

The aim of Advances in Combinatorics is to be a top specialist combinatorics journal. Information about the editorial board, the submission process, and of course the articles themselves, can all be found on the website. Like Discrete Analysis, the journal has Editorial Introductions to each article, with the aim of making the webpage informative and fun to browse. We will be grateful for any feedback, and even more grateful for support in the form of excellent combinatorics papers while we are still getting established.

A final remark is that although I have reached the limit of the number of journals of this type that I can be closely involved in, I would be delighted to hear from anybody who thinks that they can put together a high-quality editorial board in an area that does not have a suitable journal for people who want to publish good papers without supporting the big commercial publishers. I know people who can offer advice about suitable platforms, funding, and similar matters. It would be great if free-to-read free-to-publish journals could cover all of mathematics, but we are still a long way from that at the moment.

October 28, 2019

John PreskillThe quantum steampunker by Massachusetts Bay

Every spring, a portal opens between Waltham, Massachusetts and another universe. 

The other universe has a Watch City dual to Waltham, known for its watch factories. The cities throw a festival to which explorers, inventors, and tourists flock. Top hats, goggles, leather vests, bustles, and lace-up boots dot the crowds. You can find pet octopodes, human-machine hybrids, and devices for bending space and time. Steam powers everything.

Watch City

Watch City Steampunk Festival

So I learned thanks to Maxim Olshanyi, a professor of physics at the University of Massachusetts Boston. He hosted my colloquium, “Quantum steampunk: Quantum information meets thermodynamics,” earlier this month. Maxim, I discovered, has more steampunk experience than I. He digs up century-old designs for radios, builds the radios, and improves upon the designs. He exhibits his creations at the Watch City Steampunk Festival.

Maxim photo

Maxim Olshanyi

I never would have guessed that Maxim moonlights with steampunkers. But his hobby makes sense: Maxim has transformed our understanding of quantum integrability.

Integrability is to thermalization as Watch City is to Waltham. A bowl of baked beans thermalizes when taken outside in Boston in October: Heat dissipates into the air. After half-an-hour, large-scale properties bear little imprint of their initial conditions: The beans could have begun at 112ºF or 99º or 120º. Either way, the beans have cooled.

Integrable systems avoid thermalizing; more of their late-time properties reflect early times. Why? We can understand through an example, an integrable system whose particles don’t interact with each other (whose particles are noninteracting fermions). The dynamics conserve the particles’ momenta. Consider growing the system by adding particles. The number of conserved quantities grows as the system size. The conserved quantities retain memories of the initial conditions.

Imagine preparing an integrable system, analogously to preparing a bowl of baked beans, and letting it sit for a long time. Will the system equilibrate, or settle down to, a state predictable with a simple rule? We might expect not. Obeying the same simple rule would cause different integrable systems to come to resemble each other. Integrable systems seem unlikely to homogenize, since each system retains much information about its initial conditions.

Boston baked beans

Maxim and collaborators exploded this expectation. Integrable systems do relax to simple equilibrium states, which the physicists called the generalized Gibbs ensemble (GGE). Josiah Willard Gibbs cofounded statistical mechanics during the 1800s. He predicted the state to which nonintegrable systems, like baked beans in autumnal Boston, equilibrate. Gibbs’s theory governs classical systems, like baked beans, as does the GGE theory. But also quantum systems equilibrate to the GGE, and Gibbs’s conclusions translate into quantum theory with few adjustments. So I’ll explain in quantum terms.

Consider quantum baked beans that exchange heat with a temperature-T environment. Let \hat{H} denote the system’s Hamiltonian, which basically represents the beans’ energy. The beans equilibrate to a quantum Gibbs state, e^{ - \hat{H} / ( k_{\rm B} T ) } / Z. The k_{\rm B} denotes Boltzmann’s constant, a fundamental constant of nature. The partition function Z enables the quantum state to obey probability theory (normalizes the state).

Maxim and friends modeled their generalized Gibbs ensemble on the Gibbs state. Let \hat{I}_m denote a quantum integrable system’s m^{\rm th} conserved quantity. This system equilibrates to e^{ - \sum_m \lambda_m \hat{I}_m } / Z_{\rm GGE}. The Z_{\rm GGE} normalizes the state. The intensive parameters \lambda_m’s serve analogously to temperature and depend on the conserved quantities’ values. Maxim and friends predicted this state using information theory formalized by Ed Jaynes. Inventing the GGE, they unlocked a slew of predictions about integrable quantum systems. 


A radio built by Maxim. According to him, “The invention was to replace a diode with a diode bridge, in a crystal radio, thus gaining a factor of two in the output power.”

I define quantum steampunk as the intersection of quantum theory, especially quantum information theory, with thermodynamics, and the application of this intersection across science. Maxim has used information theory to cofound a branch of quantum statistical mechanics. Little wonder that he exhibits homemade radios at the Watch City Steampunk Festival. He also holds a license to drive steam engines and used to have my postdoc position. I appreciate having older cousins to look up to. Here’s hoping that I become half the quantum steampunker that I found by Massachusetts Bay.

With thanks to Maxim and the rest of the University of Massachusetts Boston Department of Physics for their hospitality.

The next Watch City Steampunk Festival takes place on May 9, 2020. Contact me if you’d attend a quantum-steampunk meetup!

October 23, 2019

Terence TaoAlmost all Collatz orbits attain almost bounded values

I’ve just uploaded to the arXiv my paper “Almost all Collatz orbits attain almost bounded values“, submitted to the proceedings of the Forum of Mathematics, Pi. In this paper I returned to the topic of the notorious Collatz conjecture (also known as the {3x+1} conjecture), which I previously discussed in this blog post. This conjecture can be phrased as follows. Let {{\bf N}+1 = \{1,2,\dots\}} denote the positive integers (with {{\bf N} =\{0,1,2,\dots\}} the natural numbers), and let {\mathrm{Col}: {\bf N}+1 \rightarrow {\bf N}+1} be the map defined by setting {\mathrm{Col}(N)} equal to {3N+1} when {N} is odd and {N/2} when {N} is even. Let {\mathrm{Col}_{\min}(N) := \inf_{n \in {\bf N}} \mathrm{Col}^n(N)} be the minimal element of the Collatz orbit {N, \mathrm{Col}(N), \mathrm{Col}^2(N),\dots}. Then we have

Conjecture 1 (Collatz conjecture) One has {\mathrm{Col}_{\min}(N)=1} for all {N \in {\bf N}+1}.

Establishing the conjecture for all {N} remains out of reach of current techniques (for instance, as discussed in the previous blog post, it is basically at least as difficult as Baker’s theorem, all known proofs of which are quite difficult). However, the situation is more promising if one is willing to settle for results which only hold for “most” {N} in some sense. For instance, it is a result of Krasikov and Lagarias that

\displaystyle  \{ N \leq x: \mathrm{Col}_{\min}(N) = 1 \} \gg x^{0.84}

for all sufficiently large {x}. In another direction, it was shown by Terras that for almost all {N} (in the sense of natural density), one has {\mathrm{Col}_{\min}(N) < N}. This was then improved by Allouche to {\mathrm{Col}_{\min}(N)  0.869}, and extended later by Korec to cover all {\theta > \frac{\log 3}{\log 4} \approx 0.7924}. In this paper we obtain the following further improvement (at the cost of weakening natural density to logarithmic density):

Theorem 2 Let {f: {\bf N}+1 \rightarrow {\bf R}} be any function with {\lim_{N \rightarrow \infty} f(N) = +\infty}. Then we have {\mathrm{Col}_{\min}(N) < f(N)} for almost all {N} (in the sense of logarithmic density).

Thus for instance one has {\mathrm{Col}_{\min}(N) < \log\log\log\log N} for almost all {N} (in the sense of logarithmic density).

The difficulty here is one usually only expects to establish “local-in-time” results that control the evolution {\mathrm{Col}^n(N)} for times {n} that only get as large as a small multiple {c \log N} of {\log N}; the aforementioned results of Terras, Allouche, and Korec, for instance, are of this type. However, to get {\mathrm{Col}^n(N)} all the way down to {f(N)} one needs something more like an “(almost) global-in-time” result, where the evolution remains under control for so long that the orbit has nearly reached the bounded state {N=O(1)}.

However, as observed by Bourgain in the context of nonlinear Schrödinger equations, one can iterate “almost sure local wellposedness” type results (which give local control for almost all initial data from a given distribution) into “almost sure (almost) global wellposedness” type results if one is fortunate enough to draw one’s data from an invariant measure for the dynamics. To illustrate the idea, let us take Korec’s aforementioned result that if {\theta > \frac{\log 3}{\log 4}} one picks at random an integer {N} from a large interval {[1,x]}, then in most cases, the orbit of {N} will eventually move into the interval {[1,x^{\theta}]}. Similarly, if one picks an integer {M} at random from {[1,x^\theta]}, then in most cases, the orbit of {M} will eventually move into {[1,x^{\theta^2}]}. It is then tempting to concatenate the two statements and conclude that for most {N} in {[1,x]}, the orbit will eventually move {[1,x^{\theta^2}]}. Unfortunately, this argument does not quite work, because by the time the orbit from a randomly drawn {N \in [1,x]} reaches {[1,x^\theta]}, the distribution of the final value is unlikely to be close to being uniformly distributed on {[1,x^\theta]}, and in particular could potentially concentrate almost entirely in the exceptional set of {M \in [1,x^\theta]} that do not make it into {[1,x^{\theta^2}]}. The point here is the uniform measure on {[1,x]} is not transported by Collatz dynamics to anything resembling the uniform measure on {[1,x^\theta]}.

So, one now needs to locate a measure which has better invariance properties under the Collatz dynamics. It turns out to be technically convenient to work with a standard acceleration of the Collatz map known as the Syracuse map {\mathrm{Syr}: 2{\bf N}+1 \rightarrow 2{\bf N}+1}, defined on the odd numbers {2{\bf N}+1 = \{1,3,5,\dots\}} by setting {\mathrm{Syr}(N) = (3N+1)/2^a}, where {2^a} is the largest power of {2} that divides {3N+1}. (The advantage of using the Syracuse map over the Collatz map is that it performs precisely one multiplication of {3} at each iteration step, which makes the map better behaved when performing “{3}-adic” analysis.)

When viewed {3}-adically, we soon see that iterations of the Syracuse map become somewhat irregular. Most obviously, {\mathrm{Syr}(N)} is never divisible by {3}. A little less obviously, {\mathrm{Syr}(N)} is twice as likely to equal {2} mod {3} as it is to equal {1} mod {3}. This is because for a randomly chosen odd {\mathbf{N}}, the number of times {\mathbf{a}} that {2} divides {3\mathbf{N}+1} can be seen to have a geometric distribution of mean {2} – it equals any given value {a \in{\bf N}+1} with probability {2^{-a}}. Such a geometric random variable is twice as likely to be odd as to be even, which is what gives the above irregularity. There are similar irregularities modulo higher powers of {3}. For instance, one can compute that for large random odd {\mathbf{N}}, {\mathrm{Syr}^2(\mathbf{N}) \hbox{ mod } 9} will take the residue classes {0,1,2,3,4,5,6,7,8 \hbox{ mod } 9} with probabilities

\displaystyle  0, \frac{8}{63}, \frac{16}{63}, 0, \frac{11}{63}, \frac{4}{63}, 0, \frac{2}{63}, \frac{22}{63}

respectively. More generally, for any {n}, {\mathrm{Syr}^n(N) \hbox{ mod } 3^n} will be distributed according to the law of a random variable {\mathbf{Syrac}({\bf Z}/3^n{\bf Z})} on {{\bf Z}/3^n{\bf Z}} that we call a Syracuse random variable, and can be described explicitly as

\displaystyle  \mathbf{Syrac}({\bf Z}/3^n{\bf Z}) = 2^{-\mathbf{a}_1} + 3^1 2^{-\mathbf{a}_1-\mathbf{a}_2} + \dots + 3^{n-1} 2^{-\mathbf{a}_1-\dots-\mathbf{a}_n} \hbox{ mod } 3^n, \ \ \ \ \ (1)

where {\mathbf{a}_1,\dots,\mathbf{a}_n} are iid copies of a geometric random variable of mean {2}.

In view of this, any proposed “invariant” (or approximately invariant) measure (or family of measures) for the Syracuse dynamics should take this {3}-adic irregularity of distribution into account. It turns out that one can use the Syracuse random variables {\mathbf{Syrac}({\bf Z}/3^n{\bf Z})} to construct such a measure, but only if these random variables stabilise in the limit {n \rightarrow \infty} in a certain total variation sense. More precisely, in the paper we establish the estimate

\displaystyle  \sum_{Y \in {\bf Z}/3^n{\bf Z}} | \mathbb{P}( \mathbf{Syrac}({\bf Z}/3^n{\bf Z})=Y) - 3^{m-n} \mathbb{P}( \mathbf{Syrac}({\bf Z}/3^m{\bf Z})=Y \hbox{ mod } 3^m)| \ \ \ \ \ (2)

\displaystyle  \ll_A m^{-A}

for any {1 \leq m \leq n} and any {A > 0}. This type of stabilisation is plausible from entropy heuristics – the tuple {(\mathbf{a}_1,\dots,\mathbf{a}_n)} of geometric random variables that generates {\mathbf{Syrac}({\bf Z}/3^n{\bf Z})} has Shannon entropy {n \log 4}, which is significantly larger than the total entropy {n \log 3} of the uniform distribution on {{\bf Z}/3^n{\bf Z}}, so we expect a lot of “mixing” and “collision” to occur when converting the tuple {(\mathbf{a}_1,\dots,\mathbf{a}_n)} to {\mathbf{Syrac}({\bf Z}/3^n{\bf Z})}; these heuristics can be supported by numerics (which I was able to work out up to about {n=10} before running into memory and CPU issues), but it turns out to be surprisingly delicate to make this precise.

A first hint of how to proceed comes from the elementary number theory observation (easily proven by induction) that the rational numbers

\displaystyle  2^{-a_1} + 3^1 2^{-a_1-a_2} + \dots + 3^{n-1} 2^{-a_1-\dots-a_n}

are all distinct as {(a_1,\dots,a_n)} vary over tuples in {({\bf N}+1)^n}. Unfortunately, the process of reducing mod {3^n} creates a lot of collisions (as must happen from the pigeonhole principle); however, by a simple “Lefschetz principle” type argument one can at least show that the reductions

\displaystyle  2^{-a_1} + 3^1 2^{-a_1-a_2} + \dots + 3^{m-1} 2^{-a_1-\dots-a_m} \hbox{ mod } 3^n \ \ \ \ \ (3)

are mostly distinct for “typical” {a_1,\dots,a_m} (as drawn using the geometric distribution) as long as {m} is a bit smaller than {\frac{\log 3}{\log 4} n} (basically because the rational number appearing in (3) then typically takes a form like {M/2^{2m}} with {M} an integer between {0} and {3^n}). This analysis of the component (3) of (1) is already enough to get quite a bit of spreading on { \mathbf{Syrac}({\bf Z}/3^n{\bf Z})} (roughly speaking, when the argument is optimised, it shows that this random variable cannot concentrate in any subset of {{\bf Z}/3^n{\bf Z}} of density less than {n^{-C}} for some large absolute constant {C>0}). To get from this to a stabilisation property (2) we have to exploit the mixing effects of the remaining portion of (1) that does not come from (3). After some standard Fourier-analytic manipulations, matters then boil down to obtaining non-trivial decay of the characteristic function of {\mathbf{Syrac}({\bf Z}/3^n{\bf Z})}, and more precisely in showing that

\displaystyle  \mathbb{E} e^{-2\pi i \xi \mathbf{Syrac}({\bf Z}/3^n{\bf Z}) / 3^n} \ll_A n^{-A} \ \ \ \ \ (4)

for any {A > 0} and any {\xi \in {\bf Z}/3^n{\bf Z}} that is not divisible by {3}.

If the random variable (1) was the sum of independent terms, one could express this characteristic function as something like a Riesz product, which would be straightforward to estimate well. Unfortunately, the terms in (1) are loosely coupled together, and so the characteristic factor does not immediately factor into a Riesz product. However, if one groups adjacent terms in (1) together, one can rewrite it (assuming {n} is even for sake of discussion) as

\displaystyle  (2^{\mathbf{a}_2} + 3) 2^{-\mathbf{b}_1} + (2^{\mathbf{a}_4}+3) 3^2 2^{-\mathbf{b}_1-\mathbf{b}_2} + \dots

\displaystyle  + (2^{\mathbf{a}_n}+3) 3^{n-2} 2^{-\mathbf{b}_1-\dots-\mathbf{b}_{n/2}} \hbox{ mod } 3^n

where {\mathbf{b}_j := \mathbf{a}_{2j-1} + \mathbf{a}_{2j}}. The point here is that after conditioning on the {\mathbf{b}_1,\dots,\mathbf{b}_{n/2}} to be fixed, the random variables {\mathbf{a}_2, \mathbf{a}_4,\dots,\mathbf{a}_n} remain independent (though the distribution of each {\mathbf{a}_{2j}} depends on the value that we conditioned {\mathbf{b}_j} to), and so the above expression is a conditional sum of independent random variables. This lets one express the characeteristic function of (1) as an averaged Riesz product. One can use this to establish the bound (4) as long as one can show that the expression

\displaystyle  \frac{\xi 3^{2j-2} (2^{-\mathbf{b}_1-\dots-\mathbf{b}_j+1} \mod 3^n)}{3^n}

is not close to an integer for a moderately large number ({\gg A \log n}, to be precise) of indices {j = 1,\dots,n/2}. (Actually, for technical reasons we have to also restrict to those {j} for which {\mathbf{b}_j=3}, but let us ignore this detail here.) To put it another way, if we let {B} denote the set of pairs {(j,l)} for which

\displaystyle  \frac{\xi 3^{2j-2} (2^{-l+1} \mod 3^n)}{3^n} \in [-\varepsilon,\varepsilon] + {\bf Z},

we have to show that (with overwhelming probability) the random walk

\displaystyle (1,\mathbf{b}_1), (2, \mathbf{b}_1 + \mathbf{b}_2), \dots, (n/2, \mathbf{b}_1+\dots+\mathbf{b}_{n/2})

(which we view as a two-dimensional renewal process) contains at least a few points lying outside of {B}.

A little bit of elementary number theory and combinatorics allows one to describe the set {B} as the union of “triangles” with a certain non-zero separation between them. If the triangles were all fairly small, then one expects the renewal process to visit at least one point outside of {B} after passing through any given such triangle, and it then becomes relatively easy to then show that the renewal process usually has the required number of points outside of {B}. The most difficult case is when the renewal process passes through a particularly large triangle in {B}. However, it turns out that large triangles enjoy particularly good separation properties, and in particular afer passing through a large triangle one is likely to only encounter nothing but small triangles for a while. After making these heuristics more precise, one is finally able to get enough points on the renewal process outside of {B} that one can finish the proof of (4), and thus Theorem 2.

John BaezDiversity Workshop at UCR

We’re having a workshop to promote diversity in math here at UCR:

Riverside Mathematics Workshop for Excellence and Diversity, Friday 8 November 2019, U. C. Riverside. Organized by John Baez, Weitao Chen, Edray Goins, Ami Radunskaya, and Fred Wilhelm.

If you want to come, please register here.

It’s happening right before the applied category theory meeting, so I hope some of you can make both… especially since the category theorist Eugenia Cheng will be giving a talk!

Three talks will take place in Skye Hall—home of the math department—starting at 1 pm. After this we’ll have refreshments and an hour for students to talk to the speakers. Starting at 6 pm there will be a reception across the road at the UCR Alumni Center, with food and a panel discussion on the challenges we face in promoting diversity at U.C. Riverside.

All the talks will be in Skye 284:

• 1:00–1:50 p.m. Abba Gumel, Arizona State University.

Some models for enhancing diversity and capacity-building in STEM education in under-represented minority communities.

STEM (science, technology, engineering and mathematics) education is undoubtedly the necessary bedrock for the development and sustenance of the vitally-needed knowledge-based economy that fuels and sustains the development of modern nations. Central to STEM education are, of course, the mathematical science … which are the rock-solid foundation of all the natural and engineering sciences. Hence, it is vital that all diverse populations are not left behind in the quest to build and sustain capacity in the mathematical sciences. This talk focuses on discussion around a number of pedagogic and mentorship models that have been (and are being) used to help increase diversity and capacity-building in STEM education in general, and in the mathematical sciences in particular, in under-represented minority populations. Some examples from Africa, Canada and the U.S. will be presented.

• 2:00–2:50. Marissa Loving, Georgia Tech.

Where do I belong? Creating space in the math community.

I will tell the story of my mathematical journey with a focus on my time in grad school. I will be blunt about the ups and downs I have experienced and touch on some of the barriers (both structural and internalized) I have encountered. I will also discuss some of the programs and spaces I have helped create in my quest to make the mathematics community into a place where folks from historically under-represented groups (particularly women of color) can feel safe, seen, and free to devote their energy to their work. If you have ever felt like you don’t belong or worried that you have made others feel that way, this talk is for you.

• 3:00–3:50 p.m. Eugenia Cheng, School of the Art Institute of Chicago.

Inclusion–exclusion in mathematics and beyond: who stays in, who falls out, why it happens, and what we could do about it.

The question of why women and minorities are under-represented in mathematics is complex and there are no simple answers, only many contributing factors. I will focus on character traits, and argue that if we focus on this rather than gender we can have a more productive and less divisive conversation. To try and focus on characters rather than genders I will introduce gender-neutral character adjectives “ingressive” and “congressive” as a new dimension to shift our focus away from masculine and feminine. I will share my experience of teaching congressive abstract mathematics to art students, in a congressive way, and the possible effects this could have for everyone in mathematics, not just women. Moreover I will show that abstract mathematics is applicable to working towards a more inclusive, congressive society in this politically divisive era. This goes against the assumption that abstract math can only be taught to high level undergraduates and graduate students, and the accusation that it is removed from real life.

• 4:00–4:30 p.m. Refreshments in Skye 284.

• 4:30–5:30 p.m. Conversations Between Speakers & Students, Not Faculty, in Skye 284.

• 6:00–6:45 p.m. Reception with Food at the Alumni Center.

• 6:45 – 7:45 p.m. Panel Discussion at Alumni Center with Alissa Crans, Jose Gonzalez and Paige Helms, moderated by Edray Goins.

October 17, 2019

Mark Chu-CarrollDashi-Braised Brisket with Onions

I’m a nice jewish boy, so I grew up eating a lot of brisket. Brisket’s an interesting piece of meat. By almost any reasonable standard, it’s an absolutely godawful cut of beef. It’s ridiculously tough. We’re not talking just a little bit chewy here: you can cook a hunk of brisket for four hours, still have something that’s inedible, because your teeth can’t break it down. It’s got a huge layer of fat on top – but the meat itself is completely lean – so if you cook it long enough to be chewable, it can be dry as a bone.

But my ancestors were peasants. They couldn’t afford to eat beef normally, and when special occasions rolled around, the only beef they could afford was the stuff that no one else wanted. So they got briskets.

If you get interested in foods, though, you learn that many of the best foods in the world started off with some poor peasant who wanted to make something delicious, but couldn’t afford expensive ingredients! Brisket is a perfect example. Cook it for a good long time, or in a pressure cooker, with lots of liquid, and lots of seasoning, and it’s one of the most flavorful pieces of the entire animal. Brisket is really delicious, once you manage to break down the structure that makes it so tough. These days, it’s become super trendy, and everyone loves brisket!

Anyway, like I said, I grew up eating jewish brisket. But then I married a Chinese woman, and in our family, we always try to blend traditions as much as we can. In particular, because we’re both food people, I’m constantly trying to take things from my tradition, and blend some of her tradition into it. So I wanted to find a way of blending some chinese flavors into my brisket. What I wound up with is more japanese than chinese, but it works. The smoky flavor of the dashi is perfect for the sweet meatiness of the brisket, and the onions slowly cook and sweeten, and you end up with something that is distinctly similar to the traditional jewish onion-braised-brisket, but also very distinctly different.


  1. 1 brisket.
  2. 4 large onions.
  3. 4 packets of shredded bonito flakes from an asian market.
  4. 4 large squares of konbu (japanese dried kelp)
  5. 1 cup soy sauce.
  6. 1 cup apple cider.
  7. Random root vegetables that you like. I tend to go with carrots and daikon radish, cut into 1 inch chunks.


  1. First, make some dashi:
    1. Put about 2 quarts of water into a pot on your stove, and bring to a boil.
    2. Lower to a simmer, and then add the konbu, and simmer for 30 minutes.
    3. Turn off the heat, add the bonito, and then let it sit for 10 minutes.
    4. Strain out all of the kelp and bonito, and you’ve got dashi!.
  2. Slice all of the onions into strips.
  3. Cut the brisket into sections that will fit into an instant pot or other pressure cooker.
  4. Fill the instant pot by laying a layer of onions, followed by a piece of brisket, followed by a layer of onions until all of the meat is covered in onions.
  5. Take your dashi, add the apple cider, and add soy sauce until it tastes too salty. That’s just right (Remember, your brisket is completely unsalted!) Pour it over the brisket and onions.
  6. Fill in any gaps around the brisket and onions with your root vegetables.
  7. Cook in the instant pot for one hour, and then let it slowly depressurize.
  8. Meanwhile, preheat your oven to 275.
  9. Transfer the brisket from the instant pot to a large casserole or dutch oven. Cover with the onions. Taste the sauce – it should be quite a bit less salty. If it isn’t salty enough, add a bit more sauce sauce; if it tastes sour, add a bit more apply cider.
  10. Cook in the oven for about 1 hour, until the top has browned; then turn the brisket over, and let it cook for another hour until the other side is brown.
  11. Slice into thick slices. (It should be falling apart, so that you can’t cut it thin!).
  12. Strain the fat off of the broth, and cook with a bit of cornstarch to thicken into a gravy.
  13. Eat.

October 14, 2019

John BaezClimate Technology Primer (Part 2)

Here’s the second of a series of blog articles:

• Adam Marblestone, Climate technology primer (2/3): CO2 removal.

The first covered the basics of climate science as related to global warming. This one moves on to consider technologies for removing carbon dioxide from the air.

I hope you keep the following warning in mind as you read on:

I’m focused here on trying to understand the narrowly technical aspects, not on the political aspects, despite those being crucial. This is meant to be a review of the technical literature, not a political statement. I worried that writing a blog purely on the topic of technological intervention in the climate, without attempting or claiming to do justice to the social issues raised, would implicitly suggest that I advocate a narrowly technocratic or unilateral approach, which is not my intention. By focusing on technology, I don’t mean to detract from the importance of the social and policy aspects.

The technological issues are worth studying on their own, since they constrain what’s possible. For example: to draw down as much CO2 as human civilization is emitting now, with trees their peak growth phase and their carbon stored permanently, could be done by covering the whole USA with such trees.

October 11, 2019

Noncommutative GeometrySir Michael Atiyah, a Knight Mathematician. A Tribute to Michael Atiyah, an Inspiration and a Friend. By Alain Connes and Joseph Kouneiher

Sir Michael Atiyah was considered one of the world’s foremost mathematicians. He is best known for his work in algebraic topology and the codevelopment of a branch of mathematics called topological 𝐾-theory, together with the Atiyah–Singer index theorem, for which he received the Fields Medal (1966). He also received the Abel Prize (2004) along with Isadore M. Singer for their discovery and

Peter Rohde New paper: Photonic quantum error correction of linear optics using W-state encoding

With my PhD student Madhav Krishnan Vijayan, and old PhD colleague Austin Lund.

Full paper available at


Error-detection and correction are necessary prerequisites for any scalable quantum computing architecture. Given the inevitability of unwanted physical noise in quantum systems and the propensity for errors to spread as computations proceed, computational outcomes can become substantially corrupted. This observation applies regardless of the choice of physical implementation. In the context of photonic quantum information processing, there has recently been much interest in passive linear optics quantum computing, which includes boson-sampling, as this model eliminates the highly-challenging requirements for feed-forward via fast, active control. That is, these systems are passive by definition. In usual scenarios, error detection and correction techniques are inherently active, making them incompatible with this model, arousing suspicion that physical error processes may be an insurmountable obstacle. Here we explore a photonic error-detection technique, based on W-state encoding of photonic qubits, which is entirely passive, based on post-selection, and compatible with these near-term photonic architectures of interest. We show that this W-state redundant encoding techniques enables the suppression of dephasing noise on photonic qubits via simple fan-out style operations, implemented by optical Fourier transform networks, which can be readily realised today. The protocol effectively maps dephasing noise into heralding failures, with zero failure probability in the ideal no-noise limit.

October 10, 2019

John BaezFoundations of Math and Physics One Century After Hilbert

I wrote a review of this book with chapters by Penrose, Witten, Connes, Atiyah, Smolin and others:

• John Baez, review of Foundations of Mathematics and Physics One Century After Hilbert: New Perspectives, edited by Joseph Kouneiher, Notices of the American Mathematical Society 66 no. 11 (November 2019), 1690–1692.

It gave me a chance to say a bit—just a tiny bit—about the current state of fundamental physics and the foundations of mathematics.