Guest Post by Felice Pantaleo - CERN
Patatrack Hackathon's group, 7th edition. Picture credit: Felice Pantaleo
Yesterday a guy saw my Orioles hat and said, “Wow, you’re a real diehard.”
He was wearing a Hartford Whalers hat.
Last time we saw a strong connection between random permutations and Poisson distributions. Thinking about this pulled me into questions about partitions and the moments of Poisson distributions. These may be a bit of a distraction — but maybe not, since every permutation gives a partition, namely the partition into cycles.
Here’s a good puzzle to start with:
Puzzle 11. Suppose raindrops are falling on your head, randomly and independently, at an average rate of one per minute. What’s the average of the cube of the number of raindrops that fall on your head in one minute?
To get started we need a bit of probability theory. Suppose random events are happening at an average rate of one per minute, in a nice independent way so that the history so far doesn’t affect the rate of new events now. Let $P(k,t)$ be the probability that $k$ events have happened between time $0$ and time $t$. Then:
$\frac{d}{d t} P(k,t) = P(k-1,t) - P(k,t)$
since if $k-1$ events have happened an extra event will bring the total to $k$, while if $k$ events have already happened an extra one will make the total stop being $k$.
Now, it takes a bit of work to figure out $P(k,t)$ given this equation and the obvious initial conditions
$P(k,0) = \left\{\begin{array}{ccl} 1 &\mathrm{if}& k = 0 \\ 0 &\mathrm{if}& k \gt 0 \end{array}\right.$
But if someone tells you the answer, you just need elementary calculus to check that it’s right:
$P(k,t) = \frac{t^k}{k!} e^{-t}$
So, let’s be happy someone has already figured out the answer!
Back to our raindrop problem. Here the time $t$ is 1 minute, so the probability of $k$ raindrops landing on your head in this time is
$\frac{1}{k!} e^{-1}$
Sanity check: these probabilities sum to 1.
We’re trying to figure out the expected value of the cube of the number of raindrops that land on your head in a minute. We now know this is
$\frac{1}{e} \sum_{k=0}^\infty \frac{k^3}{k!}$
So this is the sum we need to do. But being mathematicians, let’s generalize. Let’s figure out the expected value of the $n$th power of the number of raindrops that land on your head in a minute:
$\frac{1}{e} \sum_{k=0}^\infty \frac{k^n}{k!}$
The sum here looks like a mutant version of
$\sum_{k=0}^\infty \frac{n^k}{k!} = e^n$
But I mention this only so you don’t get confused and take a wrong turn!
To compute
$\frac{1}{e} \sum_{k=0}^\infty \frac{k^n}{k!}$
it’s nice to express the powers $k^n$ in terms of ‘falling powers’:
$k^{\underline{n}} = k (k-1) (k-2) \cdots (k - n + 1)$
The power $k^n$ counts the number of functions from $n$ to $k$, where now I’m identifying a natural number with a set having that number of elements. Similarly, the falling power $k^{\underline{n}}$ counts the number of injections from $n$ to $k$. Why? With an injection, the first element of $n$ has $k$ choices of where to go, but the next one has only $k-1$, and so on.
There’s a nice formula expressing ordinary powers as linear combinations of falling powers:
$k^n = \sum_{i = 0}^k \left\{ \begin{array}{c} n \\ i \end{array} \right\} \; k^{\underline{i}}$
Here $\left\{ \begin{array}{c} n \\ i \end{array} \right\}$ is just the number of partitions of $n$ into $i$ parts.
This formula has a wonderfully slick proof. Any function $f \colon n \to k$ factors as a surjection followed by an injection:
$n \stackrel{f}{\to} \mathrm{im} f \hookrightarrow k$
and the surjection gives a partition of $n$ into $i$ parts, where $i$ is the number of points in the image $\mathrm{im} f$. Conversely a partition of $n$ into $i$ parts together with an injection $i \hookrightarrow k$ determines a function $f \colon n \to k$. So, the number of functions $f \colon n \to k$ is the sum over $0 \le i \le n$ of the number of partitions of $n$ into $i$ parts times the number of injections $i \hookrightarrow k$. And that’s all the formula says!
So, the expected value of the $n$th power of the number of raindrops that fall on your head is
$\frac{1}{e} \sum_{k=0}^\infty \frac{k^n}{k!} \; = \; \frac{1}{e} \sum_{k=0}^\infty \sum_{i = 0}^k \left\{ \begin{array}{c} n \\ i \end{array} \right\} \, \frac{k^{\underline{i}}}{k!}$
But in fact we can simplify this! Since there are no injections $i \hookrightarrow k$ when $i \gt k$, the falling power vanishes in this case, so our sum is just
$\frac{1}{e} \sum_{i,k=0}^\infty \left\{ \begin{array}{c} n \\ i \end{array} \right\} \, \frac{k^{\underline{i}}}{k!}$
And now we can do the sum over $k$ first. Since
$\sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} \; = \; \sum_{k = 0}^\infty \frac{k(k-1)\cdots (k-i+1)}{k(k-1) \cdots 1} \; = \; \sum_{k = i}^\infty \frac{1}{(k-i)!} \; = \; \sum_{j = 0}^\infty \frac{1}{j!} \; = \; e$
we get
$\frac{1}{e} \sum_{i,k=0}^\infty \left\{ \begin{array}{c} n \\ i \end{array} \right\} \, \frac{k^{\underline{i}}}{k!} \; = \; \sum_{i = 0}^\infty \left\{ \begin{array}{c} n \\ i \end{array} \right\}$
which is the total number of partitions of an $n$-element set!
In our original question we were interested in $n = 3$. There are 5 partitions of a 3-element set:
$\{\{1,2,3\}\},$ $\{\{1,2\}, \{3\}\}, \; \{\{2,3\}, \{1\}\}, \; \{\{3,1\}, \{2\}\},$ $\{\{1\}, \{2\}, \{3\}\}$
So, the expected value of the cube of the number of raindrops that land on your head in a minute is 5.
Now that we’re done, let’s reflect on what happened, and introduce more jargon so you can read more about the concepts involved.
The function
$P(k,t) = \frac{t^k}{k!} e^{-t}$
is called a Poisson distribution. It’s the answer whenever you want to know the probability of a given number $k$ of events occurring in some amount of time $t$ if these events occur with a constant average rate, set here to $1$ for convenience, independently of the time since the last event.
The parameter $t$ is the mean of the Poisson distribution. We worked out the $n$th moment of the Poisson distribution with mean $1$. In other words, we worked out the expected value of $k^n$ in this case:
$\sum_{k = 0}^\infty k^n P(k,1) \; = \; \frac{1}{e} \sum_{k = 0}^\infty \frac{k^n}{k!}$
We got the number of partitions of an $n$-element set, which is called the Bell number $B_n$, after the author of a fantasy novel about a planet where all mathematics is done by men.
This result is called Dobiński’s formula:
$\frac{1}{e} \sum_{k = 0}^\infty \frac{k^n}{k!} = B_n$
Dobiński published a paper about it in 1877, but it seems to me that he only proved some special cases:
The number of partitions of an $n$-element set into $i$ parts, $\left\{ \begin{array}{c} n \\ i \end{array} \right\}$, is called a Stirling number of the second kind. Given the definitions it’s obvious that
$\sum_{i = 0}^\infty \left\{ \begin{array}{c} n \\ i \end{array} \right\} = B_n$
(We can stop the sum at $i = n$ without changing its value.)
The identity
$k^n = \sum_{i = 0}^k \left\{ \begin{array}{c} n \\ i \end{array} \right\} \, k^{\underline{i}}$
is more interesting, since it connects Stirling numbers of the second kind to falling powers, which are also known as falling factorials, falling sequential products, and various other things. The proof I gave comes from here:
Category theorists will say it relies on the uniqueness of the epi-mono factorization of a function between finite sets. Rota doesn’t use these terms, but he talks about the ‘kernel’ of a function, which is his name for the epi in this factorization.
In this paper Rota goes ahead and proves Dobiński’s formula, but he does not explicitly discuss its connection to the Poisson distribution. Someone on Wikipedia translated his argument into the language of probability, which I like. Unfortunately the Wikipedia article merely reduces Dobiński’s formula to the identity
$\frac{1}{e} \sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} = 1$
and stops there, remarking that the $k$th factorial moment of the Poisson distribution of mean $1$ equals $1$. Factorial moments are like ordinary moments but with the power $k^i$ replaced by the falling power $k^{\underline{i}}$.
I was frustrated to see the argument fizzle out before it ended. Then I saw why
$\sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} = e$
It follows from some facts about species and groupoid cardinality.
If you have a species, namely a functor $F$ from the groupoid of finite sets to the category of sets, you can think of it as describing a type of structure you can put on finite sets. For each natural number $k$, $F(k)$ is the set of structures of this type that you can put on a $k$-element set. The species has a generating function
$f(x) = \sum_{k = 0}^\infty \frac{|F(k)|}{k!} x^k$
and the number $f(1)$ has a nice meaning if the sum involved is well-defined. The category of elements $\int F$ has finite sets equipped with the given structure as objects and bijections preserving the structure as morphisms. This category $\int F$ is a groupoid, and the number $f(1)$ is just the groupoid cardinality of $\int F$.
Now fix a finite set $i$. There’s a species $F$ that assigns to each finite set $k$ the set of all injections $i \hookrightarrow k$. The generating function of this species is
$f(x) = \sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} x^k$
Thus,
$f(1) = \sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!}$
is the cardinality of the groupoid $\int F$ whose objects are finite sets equipped with an inclusion of the set $i$. But this groupoid is equivalent to the groupoid of finite sets and bijections! Why? Because the morphisms in $\int F$ are bijections that preserve the inclusion of the set $i$. Thus, points in the image of this inclusion have no ‘freedom of movement’, and our morphism boils down to an arbitrary bijection between the remaining points.
Since $\int F$ is equivalent to the groupoid of finite sets, it must have the same cardinality, namely
$\sum_{k = 0}^\infty \frac{1}{k!} = e$
So, we get
$\sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} = e$
Later, Akiva Weinberger pointed out to me that this identity can be shown by a simple reindexing:
$\sum_{k = 0}^\infty \frac{k^{\underline{i}}}{k!} \; = \; \sum_{k = i}^\infty \frac{1}{(k-i)!} \; = \; \sum_{j = 0}^\infty \frac{1}{j!} \; = \; e$
This is precisely a decategorified version of my observation that $\int F$ is equivalent to the groupoid of finite sets.
By the way, Dobiński’s formula, written this way:
$\sum_{k = 0}^\infty \frac{k^n}{k!} = e B_n$
says the cardinality of the groupoid of finite sets equipped with a function from a fixed finite set is $e$ times the number of partitions of that set. After a while this starts to seem obvious, if you keep thinking about epi-mono factorizations.
So groupoid cardinality establishes a nice connection between Poisson distributions, partitions, and injections. I doubt I’ve gotten to the bottom of it, since I haven’t really seen why the combinatorics is connected to Poisson processes. I’m also not sure how much all this will help me with permutations. But it was fun.
Just a short post to announce that nominations are now open for the Maryam Mirzakhani New Frontiers Prize, which is a newly announced annual $50,000 award from the Breakthrough Prize Foundation presented to early-career, women mathematicians who have completed their PhDs within the past two years, and recognizes outstanding research achievement. (I will be serving on the prize committee.) Nominations for this (and other breakthrough prizes) can be made at this page.
We saw the last show of the touring company’s visit to Madison. The kids have played the record hundreds of times so I know the songs very well. But there’s a lot you get from seeing the songs realized by actors in physical space.
Cayley’s formula says how many trees there are on a given set of vertices. For a set with $n$ elements, there are $n^{n - 2}$ of them. In 1981, André Joyal published a wonderful new proof of this fact, which, although completely elementary and explicit, seems to have been one of the seeds for his theory of combinatorial species.
All I’m going to do in this post is explain Joyal’s proof, with lots of pictures. In a later post, I’ll explain the sense in which Cayley’s formula is the set-theoretic analogue of the linear-algebraic fact I wrote about last time: that for a random linear operator on a finite vector space $V$, the probability that it’s nilpotent is $1/\#V$. And I’ll also give a new (?) proof of that fact, imitating Joyal’s. But that’s for another day.
Joyal’s proof was published here:
André Joyal, Une théorie combinatoire des séries formelles. Advances in Mathematics 42 (1981), 1–82.
(I try not to link to Elsevier, as I’m permanently furious with them for draining universities of an obscene amount of our precious funds. But this is a free download. To balance my karma, let me remind you that many paywalled papers can be found on Sci-Hub for free. The most effective way to use Sci-Hub is to put the DOI of the paper you’re seeking into the search box; the text search function doesn’t seem to work very well.)
Doron Zeilberger stated Joyal’s argument concisely in one of his infamous opinions (number 108):
Going back to “beauty-bare”, nothing in set theory rivals the beauty of André Joyal’s lovely proof of Arthur Cayley’s theorem that the number of labeled trees, $T_n$, on $n$ vertices equals $n^{n-2}$. It is so beautiful that I can say it in words.
Let’s prove instead $n^2 T_n = n^n$ .
The left side counts doubly-rooted trees, which are labeled trees with a directed path (possibly of one vertex) of labeled vertices with some trees (possibly trivial one-vertex ones) hanging from them. On the other hand the right side counts the number of functions, $f$, from $\{1,\ldots,n\}$ into $\{1,\ldots,n\}$. If you draw the directed graph of such a function joining vertex $i$ to vertex $f(i)$, you would get a collection of cycles with some trees (possibly trivial one-vertex ones) hanging from them. So the left-side counts “lines of numbers” with hanging trees, and the right side counts “collection of cycles” with hanging trees. But “lines of numbers” are permutations in one-line notation, and “collection of cycles” are permutations in cycle-notation. QED!
If that’s crystal clear to you, great! You can stop reading here. But I wanted to state the argument in a more leisurely way, with pictures.
Let $X$ be a finite set. A tree on $X$ is an undirected graph with vertex-set $X$, such that there’s a unique non-backtracking path between any two vertices. Here’s a typical tree on a 22-element set $X$:
Category theorists and algebraists often use “tree” to mean rooted tree: one vertex is distinguished as special and conventionally placed at the bottom. But in the sense I’ll use the word here, trees don’t come with a distinguished vertex.
Cayley’s formula says that the number of trees on $X$ is $n^{n - 2}$, where $n$ is the cardinality of $X$. Joyal’s proof goes like this.
An equivalent statement is that the number of bipointed trees is $n^n$, where “bipointed” means that the tree comes equipped with an ordered pair of distinguished vertices (which are allowed to be the same). Here’s a bipointed tree on our 22-element set $X$, with the first distinguished vertex towards the left in brown and the second towards the right in pink (both circled):
Joyal used the fabulous word vertebrate to mean bipointed tree, for reasons we’ll see in a moment. Since $n^n$ is the number of functions from $X$ to itself, Cayley’s formula says:
the number of vertebrates on $X$ is equal to the number of endofunctions of $X$.
That’s what we’ll show.
First note that by definition of tree, for any vertebrate on $X$ there’s a unique non-backtracking path from the first distinguished vertex to the second, here marked in blue:
You can look at the tree as the skeleton of some fantastical creature, with the blue path as the backbone, and the vertices on it as vertebrae, with the first (brown) distinguished vertex at the head end and the second (pink) one at the tail end. Hanging off each vertebra is a rooted tree, the root being that vertebra. If you let your imagination roam free, you can picture each tree as a rather ornate limb. The most similar real-life creature I could come up with is the amazing leafy seadragon (whose existence I think I learned of from David Roberts):
Anyway, the blue path from the first distinguished vertex to the second determines a total order on the set of vertebrae. Once you’ve got that order, you might as well erase the edges between them, as they’re determined by the order. So a vertebrate amounts to a diagram like this:
So far everything has been canonical: no arbitrary choices. The set of vertebrates on $X$ is in canonical one-to-one correspondence with diagrams of this type. But now we’re going to break that pattern, using an elementary observation:
the number of total orders on a finite set $S$ is equal to the number of permutations of $S$.
Of course, both are $(\# S)!$. There’s no canonical bijection between orders on and permutations of an abstract set $S$ — for if there were, which order would correspond to the identity permutation? But that’s OK: all we need here is that there’s some bijection.
So, let’s arbitrarily choose, for each $S \subseteq X$, a bijection between orders on $S$ and permutations of $S$. This done, we can replace the ordering of our five-element subset with a permutation of it. Which particular permutation it is depends on that arbitrary choice of bijection, but let’s say for sake of argument that it’s this one:
Hanging off each vertebra (yellow vertex) is a rooted tree. It makes no difference if we adorn its edges with arrows directed towards the root:
(There’s no choice involved in doing this.) Now what we’ve got is a directed graph with $X$ as its vertex-set, and with the property that each vertex is the source (tail) of exactly one arrow. This is nothing but a function $f: X \to X$! We seem to have also chosen a distinguished set of vertices, the yellow ones. But actually, they’re determined by $f$: they’re exactly the periodic points. So we lose no information if we throw away the colouring:
So what we have is precisely an endofunction of $X$.
We’ve now shown that the vertebrates (bipointed trees) on $X$ are in bijection with the endofunctions of $X$. Since there are $n^n$ such functions, where $n = \# X$, it follows that there are $n^n$ vertebrates on $X$, and therefore $n^{n - 2}$ unrooted trees on $X$. And that’s Joyal’s proof of Cayley’s theorem.
Of course, the argument I’ve given is pictorial and not entirely rigorous. I didn’t even define “tree” rigorously. Probably the best way to make it rigorous is via species, the concept that Joyal pioneered in that same paper.
I’m not going to give a full introduction to species here, but let me just say briefly how it works. A species is a functor from the category of finite sets and bijections to the category of sets. For example, there’s a species $Ord$ that assigns to each finite set the set of total orders on it. There are several ways of putting two species together to make a third, and in particular, species can be “composed”. We’ve essentially shown that
$Tree_{\ast\ast} \cong Ord \circ Tree_{\ast},$
where $Tree_{\ast\ast}$ is the species of vertebrates (bipointed trees), $Tree_{\ast}$ is the species of rooted (pointed) trees, and $\circ$ is composition. This isomorphism expresses the fact explained in the first few pictures above: that a vertebrate is the same thing as a collection of rooted trees, with a total order on them (1–5 in the example shown). On the other hand, we also have
$End \cong Perm \circ Tree_{\ast},$
where $End$ is the species of endofunctions and $Perm$ is the species of permutations. This isomorphism expresses the fact explained in the last few pictures above: that an endofunction is the same thing as a collection of rooted trees, with a permutation of them.
Now $Ord$ and $Perm$ are not isomorphic species; that is, as functors they are not naturally isomorphic. So we’re not going to conclude that $Tree_{\ast\ast}$ and $End$ are isomorphic species either. However, $Ord$ and $Perm$ are pointwise isomorphic, in the sense that $Ord(X)$ and $Perm(X)$ have the same number of elements for each finite set $X$ (namely, $(\# X)!$). It follows that $Tree_{\ast\ast}$ and $End$ are pointwise isomorphic too. In other words: on any finite set, there are the same number of vertebrates and endofunctions. That’s Cayley’s theorem.
Last time I computed the expected number of cycles of length $k$ in a random permutation of an $n$-element set. The answer is wonderfully simple: $1/k$ for all $k = 1, \dots, n$.
As Mark Meckes pointed out, this fact is just a shadow of a more wonderful fact. Let $C_k$ be the number of cycles of length $k$ in a random permutation of an $n$-element set, thought of as a random variable. Then as $n \to \infty$, the distribution of $C_k$ approaches a Poisson distribution with mean $1/k$.
But this is just a shadow of an even more wonderful fact!
First, if we fix some number $b \ge 1$, and look at all the random variables $C_1, \dots, C_b$, they converge to independent Poisson distributions as $n \to \infty$.
In fact we can even let $b \to \infty$ as $n \to \infty$, so long as $n$ outstrips $b$, in the following sense: $b/n \to 0$. The random variables $C_1, \dots , C_b$ will still converge to independent Poisson distributions.
These facts are made precise and proved here:
In the proof they drop another little bombshell. Suppose $j \ne k$. Then the random variables $C_j$ and $C_k$ become uncorrelated as soon as $n$ gets big enough for a permutation of an $n$-element set to have both a $j$-cycle and a $k$-cycle. That is, their expected values obey
$E(C_j C_k) = E(C_j) E(C_k)$
whenever $n \ge j + k$. Of course when $n \lt j + k$ it’s impossible to have both a cycle of length $j$ and one of length $k$, so in that case we have
$E(C_j C_k) = 0$
It’s exciting to me how the expected value $E(C_j C_k)$ pops suddenly from $0$ to $E(C_j) E(C_k) = 1/jk$ as soon as our set gets big enough to allow both a $j$-cycle and a $k$-cycle. It says that in some sense the presence of a large cycle in a random permutation does not discourage the existence of other large cycles… unless it absolutely forbids it!
This is a higher-level version of a phenomenon we’ve already seen: $E(C_k)$ jumps from $0$ to $1/k$ as soon as $n \ge k$.
More generally, Arratia and Tavaré show that
$E(C_{j_1} \, \cdots \, C_{j_m}) = E(C_{j_1}) \, \cdots \,E(C_{j_m})$
whenever $n \ge j_1 + \cdots + j_m$. But on the other hand, clearly
$E(C_{j_1} \, \cdots \, C_{j_m}) = 0$
whenever $n \lt j_1 + \cdots + j_m$, at least if the numbers $j_i$ are distinct. After all, it’s impossible for a permutation of a finite set to have cycles of different lengths that sum to more than the size of that set.
I’m trying to understand the nature of random permutations. These results help a lot. But instead of describing how they’re proved, I want to conclude with a picture that has also helped me a lot:
One can argue that Flajolet and Sedgewick should have used circumferences rather than areas to indicate cycle lengths: after all, the circumference is a kind of ‘cycle length’. That would have made the large cycles seem even larger compared to the puny ones.
Of course, experts on the visual display of quantitative information have a lot of opinions on these issues. But never mind! Let’s extract some morals from this chart.
We can see that a random permutation of a large set has rather few cycles on average. To be precise, the expected number of cycles in a random permutation of an $n$-element set is $\sim \ln n$.
We can also see that the lengths of the cycles range dramatically from tiny to large. To be precise, suppose $1 \le m \le m' \le n$. Then since the expected number of cycles of length $k$ is $1/k$, the expected number of cycles of length between $m$ and $m'$ is close to
$\ln m' \; - \; \ln m$
Thus we can expect as many cycles of length between $1$ and $10$ as between $10$ and $100$, or between $100$ and $1000$, etc.
So, I’m now imagining a random permutation of a large set as a way of chopping up a mountain into rocks with a wild range of sizes: some sand grains, some pebbles, some fist-sized rocks and some enormous boulders. Moreover, as long as two different sizes don’t add up to more than the size of our mountain, the number of rocks of those two sizes are uncorrelated.
I want to keep improving this intuition. For example, if $j \ne k$ and $n \ge j + k$, are the random variables $C_j$ and $C_k$ independent, or just uncorrelated?
Based on conversations with Soledad Villar, Teresa Huang, Zach Martin, Greg Scanlon, and Eva Wang (all NYU), I worked today on establishing criteria for a successful adversarial attack against a regression in the natural sciences (like astronomy). The idea is you add a small, irrelevant amount u to your data x and it changes the labels y by an unexpectedly large amount. Or, to be more specific:
As my loyal reader knows, I love that products of Gaussians are themselves Gaussians! The result is that there are many factorizations of a Gaussian into many different Gaussian products. As my loyal reader also knows, Adrian Price-Whelan (Flatiron) and I found a bug in our code The Joker which fits radial-velocity data with Keplerian orbital models; this bug is related to the fundamental factorization of Gaussians that underlies the method. Today Price-Whelan showed me results from the fixed code, and we discussed them (and the priors we are using in our marginalization), along with the paper we are writing about the factorization. Yes, people, this is my MO: When you have a big bug—or really a big think-o or conceptual error—don't just fix it, write a paper about it! That's the origin of my paper on the K-correction. We are also contemplating writing a note about how you can constrain time-domain signals with periods longer than the interval over which you are observing them!
On the front page of my New York Times today, this capsule summary:
The French can retire at 62. Or 52. Sometimes 42. President Emmanuel Macron calls the tangle unsustainable. A million protesters disagree.
In the actual article, we learn that the retirement age of 42 applies to one group of workers; dancers in the national ballet. I find it very annoying when an article is teased with a number presented as normal when it’s actually extremely atypical. You could write the same teaser about the United States, having New York City firefighters in mind. But you would be misleading your audience even though the claim would be, I suppose, technically correct.
The NYU Center for Data Science has a big masters program, and as a part of that, students do capstone research projects with industry and academic partners. I am one of the latter! So Soledad Villar (NYU) and I have been advising two groups of capstone students in projects. One of these groups (Teresa Ningyuan Huang, Zach Martin, Greg Scanlon, and Eva Shuyu Wang) has been working on adversarial attacks against regressions in astronomy. The work is new in part because it brings the idea of attacks to the natural science domain, and because attacks haven't been really defined for regression contexts. Today we decided that this work is ready (enough) to publish. So we are going to try to finish and submit something for a conference deadline this week!
Today was the last day of the visit by Christina Eilers (MIT). We decided that we have a clear scope for our Milky Way disk paper, and we have a full outline and figures, so it was a success. But we don't have a good picture of our black-hole and quasar projects, which include finding a data-driven model that generates quasar properties given a black-hole mass (and possibly many latent parameters). The data are messy! And we don't know what to believe. To be continued.
Lily Zhao (Yale) and I got to the point that we can calibrate one laser-frequency-comb exposure on EXPRES with a calibration based on all the LFC exposures. We find that we predict the lines in each exposure to a mean (over all lines) of about 3 cm/s! If this holds up to cross-validation, I am happy!
In Stars & Exoplanets Meeting at Flatiron, many fun things happened. Zhao showed us temperature variations observed spectroscopically in Sun-like stars: Are these due to spots? And Marina Kounkel (WWU) showed us incredible visualizations of young stars, showing that they are very clustered in kinematics and age. She believes that they form in lines, and then the lines break up. Her interactive plots were amazing!
So you want to do science for a living. Some scientists work for companies, developing new products. Some work for governments. But if you want to do “pure science”, science just to learn about the world, then you’ll likely work at a university, as part of what we call academia.
The first step towards academia is graduate school. In the US, this means getting a PhD.
(Master’s degrees, at least in the US, have a different purpose. Most are “terminal Master’s”, designed to be your last degree. With a terminal Master’s, you can be a technician in a lab, but you won’t get farther down this path. In the US you don’t need a Master’s before you apply for a PhD program, and having one is usually a waste of time: PhD programs will make you re-take most of the same classes.)
Once you have a PhD, it’s time to get a job! Often, your first job after graduate school is a postdoc. Postdocs are short-term jobs, usually one to three years long. Some people are lucky enough to go to the next stage quickly, others have more postdoc jobs first. These jobs will take you all over the world, everywhere people with your specialty work. Sometimes these jobs involve teaching, but more often you just do scientific research.
In the US system, If everything goes well, eventually you get a tenure-track job. These jobs involve both teaching and research. You get to train PhD students, hire postdocs, and in general start acting like a proper professor. This stage lasts around seven years, while the university evaluates you. If they decide you’re not worth it then typically you’ll have to leave to apply for another job in another university. If they like you though, you get tenure.
Tenure is the first time as an academic scientist that you aren’t on a short-term contract. Your job is more permanent than most, you have extra protection from being fired that most people don’t. While you can’t just let everything slide, you have freedom to make more of your own decisions.
A tenured job can last until retirement, when you become an emeritus professor. Emeritus professors are retired but still do some of the work they did as professors. They’re paid out of their pension instead of a university salary, but they still sometimes teach or do research, and they usually still have an office. The university can hire someone new, and the cycle continues.
This isn’t the only path scientists take. Some work in a national lab instead. These don’t usually involve teaching duties, and the path to a permanent job is a bit different. Some get teaching jobs instead of research professorships. These teaching jobs are usually not permanent, instead universities are hiring more and more adjunct faculty who have to string together temporary contracts to make a precarious living.
I’ve mostly focused on the US system here. Europe is a bit different: Master’s degrees are a real part of the system, tenure-track doesn’t really exist, and adjunct faculty don’t always either. Some particular countries, like Germany, have their own quite complicated systems, other countries fall in between.
I’m in Vienna to give a couple of talks, one yesterday and one later this afternoon. Draft slides of both talks are on this page.
The first sees me take up again the work of Michael Friedman, in particular his Dynamics of Reason. I see that I was lamenting the difficulty of getting my original article published back in 2007. (But then I was still a couple of months away from finding out that I had a permanent job after a very long search, and so evidently fed up.)
In any case, I think the story I’m telling is improving with age - we now appear to have modal homotopy type theory and M-theory in harmony. But irrespective of whether Urs Schreiber’s Hypothesis H bears fruit, there’s still an important story to tell about mathematics modifying its framework.
The second talk introduces ideas from my forthcoming Modal homotopy type theory: The prospect of a new logic for philosophy.
The Department of Mathematics at the University of California, Riverside invites application to a faculty position at the full professor level in the general area of pure mathematics. The successful candidate is expected to fill the F. Burton Jones Chair in Pure Mathematics. The search seeks candidates with national and international recognition, an outstanding research track record and outlook, a strong record in mentoring and teaching, and leadership in promoting diversity and serving the professional community.
A PhD in mathematics is required. It is expected that at UCR, the candidate will lead a rigorous research program, will develop and teach graduate and undergraduate courses, will advise and mentor graduate and undergraduate students, will vigorously promote diversity, and will engage in both campus and professional service activities.
A start-up package will be provided. The F. Burton Jones endowment fund will be made available to support the candidate’s research and mentoring activities. This position will start between July 1, 2020 and July 1, 2021.
Interested individuals should submit a cover letter, a full curriculum vita, a statement of research, a statement of teaching, official teaching evaluations and a statement on contribution to promoting diversity. In addition, applicants should provide contact names and addresses for four or five references on the candidate’s research, and one reference on the candidate’s teaching and mentoring. All application materials must be submitted through AP Recruit online at: https://aprecruit.ucr.edu/apply/JPF01199.
Review of applications will commence on May 1, 2020, and proceed until position is filled. For full consideration, applicants should submit their complete applications by April 30, 2020.
The University of California is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status, or any other characteristic protected by law.
UCR is a world-class research university with an exceptionally diverse undergraduate student body. Its mission is explicitly linked to providing routes to educational success for underrepresented and first-generation college students. A commitment to this mission is a preferred qualification.
Advancement through the Professor ranks at the University of California is through a series of structured, merit-based evaluations, occurring every 3-5 years, each of which includes substantial peer input.
Document requirements:
Cover Letter
Curriculum Vitae - Your most recently updated C.V.
Statement of Research
Statement of Teaching
Official Teaching Evaluation
Statement of Past and/or Planned Future Contributions to Advancing Diversity and Inclusive Excellence - In a “Statement of Contributions to Diversity,” we ask applicants to describe their past and potential future contributions to promoting a diverse, equitable, and inclusive environment, which is a key requirement of the role of every academic member at UCR. There are numerous ways to contribute, and a commitment to this part of our mission can be reflected through research, teaching, supervision, mentoring, community engagement, service, and any of the other varied activities that are a part of an academic career.
Reference requirements
5-6 required (contact information only)
Asgar Jamneshan and I have just uploaded to the arXiv our paper “An uncountable Moore-Schmidt theorem“. This paper revisits a classical theorem of Moore and Schmidt in measurable cohomology of measure-preserving systems. To state the theorem, let be a probability space, and be the group of measure-preserving automorphisms of this space, that is to say the invertible bimeasurable maps that preserve the measure : . To avoid some ambiguity later in this post when we introduce abstract analogues of measure theory, we will refer to measurable maps as concrete measurable maps, and measurable spaces as concrete measurable spaces. (One could also call a concrete probability space, but we will not need to do so here as we will not be working explicitly with abstract probability spaces.)
Let be a discrete group. A (concrete) measure-preserving action of on is a group homomorphism from to , thus is the identity map and for all . A large portion of ergodic theory is concerned with the study of such measure-preserving actions, especially in the classical case when is the integers (with the additive group law).
Let be a compact Hausdorff abelian group, which we can endow with the Borel -algebra . A (concrete measurable) –cocycle is a collection of concrete measurable maps obeying the cocycle equation
for -almost every . (Here we are glossing over a measure-theoretic subtlety that we will return to later in this post – see if you can spot it before then!) Cocycles arise naturally in the theory of group extensions of dynamical systems; in particular (and ignoring the aforementioned subtlety), each cocycle induces a measure-preserving action on (which we endow with the product of with Haar probability measure on ), defined by
This connection with group extensions was the original motivation for our study of measurable cohomology, but is not the focus of the current paper.
A special case of a -valued cocycle is a (concrete measurable) -valued coboundary, in which for each takes the special form
for -almost every , where is some measurable function; note that (ignoring the aforementioned subtlety), every function of this form is automatically a concrete measurable -valued cocycle. One of the first basic questions in measurable cohomology is to try to characterize which -valued cocycles are in fact -valued coboundaries. This is a difficult question in general. However, there is a general result of Moore and Schmidt that at least allows one to reduce to the model case when is the unit circle , by taking advantage of the Pontryagin dual group of characters , that is to say the collection of continuous homomorphisms to the unit circle. More precisely, we have
Theorem 1 (Countable Moore-Schmidt theorem) Let be a discrete group acting in a concrete measure-preserving fashion on a probability space . Let be a compact Hausdorff abelian group. Assume the following additional hypotheses:
- (i) is at most countable.
- (ii) is a standard Borel space.
- (iii) is metrisable.
Then a -valued concrete measurable cocycle is a concrete coboundary if and only if for each character , the -valued cocycles are concrete coboundaries.
The hypotheses (i), (ii), (iii) are saying in some sense that the data are not too “large”; in all three cases they are saying in some sense that the data are only “countably complicated”. For instance, (iii) is equivalent to being second countable, and (ii) is equivalent to being modeled by a complete separable metric space. It is because of this restriction that we refer to this result as a “countable” Moore-Schmidt theorem. This theorem is a useful tool in several other applications, such as the Host-Kra structure theorem for ergodic systems; I hope to return to these subsequent applications in a future post.
Let us very briefly sketch the main ideas of the proof of Theorem 1. Ignore for now issues of measurability, and pretend that something that holds almost everywhere in fact holds everywhere. The hard direction is to show that if each is a coboundary, then so is . By hypothesis, we then have an equation of the form
for all and some functions , and our task is then to produce a function for which
for all .
Comparing the two equations, the task would be easy if we could find an for which
for all . However there is an obstruction to this: the left-hand side of (3) is additive in , so the right-hand side would have to be also in order to obtain such a representation. In other words, for this strategy to work, one would have to first establish the identity
for all . On the other hand, the good news is that if we somehow manage to obtain the equation, then we can obtain a function obeying (3), thanks to Pontryagin duality, which gives a one-to-one correspondence between and the homomorphisms of the (discrete) group to .
Now, it turns out that one cannot derive the equation (4) directly from the given information (2). However, the left-hand side of (2) is additive in , so the right-hand side must be also. Manipulating this fact, we eventually arrive at
In other words, we don’t get to show that the left-hand side of (4) vanishes, but we do at least get to show that it is -invariant. Now let us assume for sake of argument that the action of is ergodic, which (ignoring issues about sets of measure zero) basically asserts that the only -invariant functions are constant. So now we get a weaker version of (4), namely
for some constants .
Now we need to eliminate the constants. This can be done by the following group-theoretic projection. Let denote the space of concrete measurable maps from to , up to almost everywhere equivalence; this is an abelian group where the various terms in (5) naturally live. Inside this group we have the subgroup of constant functions (up to almost everywhere equivalence); this is where the right-hand side of (5) lives. Because is a divisible group, there is an application of Zorn’s lemma (a good exercise for those who are not acquainted with these things) to show that there exists a retraction , that is to say a group homomorphism that is the identity on the subgroup . We can use this retraction, or more precisely the complement , to eliminate the constant in (5). Indeed, if we set
then from (5) we see that
while from (2) one has
and now the previous strategy works with replaced by . This concludes the sketch of proof of Theorem 1.
In making the above argument rigorous, the hypotheses (i)-(iii) are used in several places. For instance, to reduce to the ergodic case one relies on the ergodic decomposition, which requires the hypothesis (ii). Also, most of the above equations only hold outside of a set of measure zero, and the hypothesis (i) and the hypothesis (iii) (which is equivalent to being at most countable) to avoid the problem that an uncountable union of sets of measure zero could have positive measure (or fail to be measurable at all).
My co-author Asgar Jamneshan and I are working on a long-term project to extend many results in ergodic theory (such as the aforementioned Host-Kra structure theorem) to “uncountable” settings in which hypotheses analogous to (i)-(iii) are omitted; thus we wish to consider actions on uncountable groups, on spaces that are not standard Borel, and cocycles taking values in groups that are not metrisable. Such uncountable contexts naturally arise when trying to apply ergodic theory techniques to combinatorial problems (such as the inverse conjecture for the Gowers norms), as one often relies on the ultraproduct construction (or something similar) to generate an ergodic theory translation of these problems, and these constructions usually give “uncountable” objects rather than “countable” ones. (For instance, the ultraproduct of finite groups is a hyperfinite group, which is usually uncountable.). This paper marks the first step in this project by extending the Moore-Schmidt theorem to the uncountable setting.
If one simply drops the hypotheses (i)-(iii) and tries to prove the Moore-Schmidt theorem, several serious difficulties arise. We have already mentioned the loss of the ergodic decomposition and the possibility that one has to control an uncountable union of null sets. But there is in fact a more basic problem when one deletes (iii): the addition operation , while still continuous, can fail to be measurable as a map from to ! Thus for instance the sum of two measurable functions need not remain measurable, which makes even the very definition of a measurable cocycle or measurable coboundary problematic (or at least unnatural). This phenomenon is known as the Nedoma pathology. A standard example arises when is the uncountable torus , endowed with the product topology. Crucially, the Borel -algebra generated by this uncountable product is not the product of the factor Borel -algebras (the discrepancy ultimately arises from the fact that topologies permit uncountable unions, but -algebras do not); relating to this, the product -algebra is not the same as the Borel -algebra , but is instead a strict sub-algebra. If the group operations on were measurable, then the diagonal set
would be measurable in . But it is an easy exercise in manipulation of -algebras to show that if are any two measurable spaces and is measurable in , then the fibres of are contained in some countably generated subalgebra of . Thus if were -measurable, then all the points of would lie in a single countably generated -algebra. But the cardinality of such an algebra is at most while the cardinality of is , and Cantor’s theorem then gives a contradiction.
To resolve this problem, we give a coarser -algebra than the Borel -algebra, which we call the reduced -algebra , thus coarsening the measurable space structure on to a new measurable space . In the case of compact Hausdorff abelian groups, can be defined as the -algebra generated by the characters ; for more general compact abelian groups, one can define as the -algebra generated by all continuous maps into metric spaces. This -algebra is equal to when is metrisable but can be smaller for other . With this measurable structure, becomes a measurable group; it seems that once one leaves the metrisable world that is a superior (or at least equally good) space to work with than for analysis, as it avoids the Nedoma pathology. (For instance, from Plancherel’s theorem, we see that if is the Haar probability measure on , then (thus, every -measurable set is equivalent modulo -null sets to a -measurable set), so there is no damage to Plancherel caused by passing to the reduced -algebra.
Passing to the reduced -algebra fixes the most severe problems with an uncountable Moore-Schmidt theorem, but one is still faced with an issue of having to potentially take an uncountable union of null sets. To avoid this sort of problem, we pass to the framework of abstract measure theory, in which we remove explicit mention of “points” and can easily delete all null sets at a very early stage of the formalism. In this setup, the category of concrete measurable spaces is replaced with the larger category of abstract measurable spaces, which we formally define as the opposite category of the category of -algebras (with Boolean algebra homomorphisms). Thus, we define an abstract measurable space to be an object of the form , where is an (abstract) -algebra and is a formal placeholder symbol that signifies use of the opposite category, and an abstract measurable map is an object of the form , where is a Boolean algebra homomorphism and is again used as a formal placeholder; we call the pullback map associated to . [UPDATE: It turns out that this definition of a measurable map led to technical issues. In a forthcoming revision of the paper we also impose the requirement that the abstract measurable map be -complete (i.e., it respects countable joins).] The composition of two abstract measurable maps , is defined by the formula , or equivalently .
Every concrete measurable space can be identified with an abstract counterpart , and similarly every concrete measurable map can be identified with an abstract counterpart , where is the pullback map . Thus the category of concrete measurable spaces can be viewed as a subcategory of the category of abstract measurable spaces. The advantage of working in the abstract setting is that it gives us access to more spaces that could not be directly defined in the concrete setting. Most importantly for us, we have a new abstract space, the opposite measure algebra of , defined as where is the ideal of null sets in . Informally, is the space with all the null sets removed; there is a canonical abstract embedding map , which allows one to convert any concrete measurable map into an abstract one . One can then define the notion of an abstract action, abstract cocycle, and abstract coboundary by replacing every occurrence of the category of concrete measurable spaces with their abstract counterparts, and replacing with the opposite measure algebra ; see the paper for details. Our main theorem is then
Theorem 2 (Uncountable Moore-Schmidt theorem) Let be a discrete group acting abstractly on a -finite measure space . Let be a compact Hausdorff abelian group. Then a -valued abstract measurable cocycle is an abstract coboundary if and only if for each character , the -valued cocycles are abstract coboundaries.
With the abstract formalism, the proof of the uncountable Moore-Schmidt theorem is almost identical to the countable one (in fact we were able to make some simplifications, such as avoiding the use of the ergodic decomposition). A key tool is what we call a “conditional Pontryagin duality” theorem, which asserts that if one has an abstract measurable map for each obeying the identity for all , then there is an abstract measurable map such that for all . This is derived from the usual Pontryagin duality and some other tools, most notably the completeness of the -algebra of , and the Sikorski extension theorem.
We feel that it is natural to stay within the abstract measure theory formalism whenever dealing with uncountable situations. However, it is still an interesting question as to when one can guarantee that the abstract objects constructed in this formalism are representable by concrete analogues. The basic questions in this regard are:
For (i) the answer is somewhat interesting (as I learned after posing this MathOverflow question):
Our understanding of (ii) is much less complete:
In view of the answers to (i), I would not be surprised if the full answer to (ii) was also sensitive to axioms of set theory. However, such set theoretic issues seem to be almost completely avoided if one sticks with the abstract formalism throughout; they only arise when trying to pass back and forth between the abstract and concrete categories.
Peter Denton, Stephen Parke, Xining Zhang, and I have just uploaded to the arXiv a completely rewritten version of our previous paper, now titled “Eigenvectors from Eigenvalues: a survey of a basic identity in linear algebra“. This paper is now a survey of the various literature surrounding the following basic identity in linear algebra, which we propose to call the eigenvector-eigenvalue identity:
Theorem 1 (Eigenvector-eigenvalue identity) Let be an Hermitian matrix, with eigenvalues . Let be a unit eigenvector corresponding to the eigenvalue , and let be the component of . Then
where is the Hermitian matrix formed by deleting the row and column from .
When we posted the first version of this paper, we were unaware of previous appearances of this identity in the literature; a related identity had been used by Erdos-Schlein-Yau and by myself and Van Vu for applications to random matrix theory, but to our knowledge this specific identity appeared to be new. Even two months after our preprint first appeared on the arXiv in August, we had only learned of one other place in the literature where the identity showed up (by Forrester and Zhang, who also cite an earlier paper of Baryshnikov).
The situation changed rather dramatically with the publication of a popular science article in Quanta on this identity in November, which gave this result significantly more exposure. Within a few weeks we became informed (through private communication, online discussion, and exploration of the citation tree around the references we were alerted to) of over three dozen places where the identity, or some other closely related identity, had previously appeared in the literature, in such areas as numerical linear algebra, various aspects of graph theory (graph reconstruction, chemical graph theory, and walks on graphs), inverse eigenvalue problems, random matrix theory, and neutrino physics. As a consequence, we have decided to completely rewrite our article in order to collate this crowdsourced information, and survey the history of this identity, all the known proofs (we collect seven distinct ways to prove the identity (or generalisations thereof)), and all the applications of it that we are currently aware of. The citation graph of the literature that this ad hoc crowdsourcing effort produced is only very weakly connected, which we found surprising:
The earliest explicit appearance of the eigenvector-eigenvalue identity we are now aware of is in a 1966 paper of Thompson, although this paper is only cited (directly or indirectly) by a fraction of the known literature, and also there is a precursor identity of Löwner from 1934 that can be shown to imply the identity as a limiting case. At the end of the paper we speculate on some possible reasons why this identity only achieved a modest amount of recognition and dissemination prior to the November 2019 Quanta article.
It takes a long time to soften carrots adequately by sauteing them, so a lot of recipes ask you to boil the carrots first or even mix butter and water and cook the carrots in the resulting lipidous slurry. This ends up tasting OK but the carrots never really taste sauteed to me, they taste boiled! I want that sear.
So this week I tried something new, borrowing a technique I learned a long time ago for perfect sauteed asparagus. Put your butter in the pan, melt it, get those carrots sauteing in there. Put in some salt and whatever other seasoning you want at whatever time suits that seasoning. (Dill is traditional, I used nutmeg this week and it was great) Saute the carrots until they’re nicely browned. At this point they will not be cooked enough. Eat one, it’ll taste nice on the outside but still be crunchy and part-raw.
So now it’s time to shock the carrots. Fill a small drinking glass half-full with water. So maybe a quarter cup, I dunno. Throw the water in the hot pan and immediately, as the sizzle kicks in and the steam begins to rise, slam the lid on. It should sound sort of like a high hat when you crash and then right away mute. Turn the heat down and let the carrots steam in there for about six minutes. When you open it, the water should be gone but if it’s not I would just take the carrots out with a slotted spoon. Result: fully tender carrots that taste sauteed, not boiled.
There has been a dispute running through mathematical twitter about diversity statements in academic hiring. Prompted by that, Izabella Laba has just written an excellent post, which affirms the importance of diversity as a goal, but lays out the many tricky issues with diversity statements. It makes a lot of points I would like to make, and raises others that I hadn’t thought of but should have. In case there is someone reading this blog but not reading Professor Laba’s, go check it out.
Sadly, this blog is still dead. I have a list of things I would like to write on it, but I have no idea when or if I will find the time.
Update: Sadly, Nathan Keller and Ohad Klein tell me that they’ve retracted their preprint, because of what currently looks like a fatal flaw in Lemma 5.3, uncovered by Paata Ivanishvili. I wish them the best of luck in fixing the problem. In the meantime, I’m leaving up this post for “historical” reasons.
Around 1999, one of the first things I ever did in quantum computing theory was to work on a problem that Fortnow and Rogers suggested in a paper: is it possible to separate P from BQP relative to a random oracle? (That is, without first needing to separate P from PSPACE or whatever in the real world?) Or to the contrary: suppose that a quantum algorithm Q makes T queries to a Boolean input string X. Is there then a classical simulation algorithm that makes poly(T) queries to X, and that approximates Q’s acceptance probability for most values of X? Such a classical simulation, were it possible, would still be consistent with the existence of quantum algorithms like Simon’s and Shor’s, which are able to achieve exponential (and even greater) speedups in the black-box setting. It would simply demonstrate the importance, for Simon’s and Shor’s algorithms, of global structure that makes the string X extremely non-random: for example, encoding a periodic function (in the case of Shor’s algorithm), or encoding a function that hides a secret string s (in the case of Simon’s). It would underscore that superpolynomial quantum speedups depend on structure.
I never managed to solve this problem. Around 2008, though, I noticed that a solution would follow from a perhaps-not-obviously-related conjecture, about influences in low-degree polynomials. Namely, let p:R^{n}→R be a degree-d real polynomial in n variables, and suppose p(x)∈[0,1] for all x∈{0,1}^{n}. Define the variance of p to be
Var(p):=E_{x,y}[|p(x)-p(y)|],
and define the influence of the i^{th} variable to be
Inf_{i}(p):=E_{x}[|p(x)-p(x^{i})|].
Here the expectations are over strings in {0,1}^{n}, and x^{i} means x with its i^{th} bit flipped between 0 and 1. Then the conjecture is this: there must be some variable i such that Inf_{i}(p) ≥ poly(Var(p)/d) (in other words, that “explains” a non-negligible fraction of the variance of the entire polynomial).
Why would this conjecture imply the statement about quantum algorithms? Basically, because of the seminal result of Beals et al. from 1998: that if a quantum algorithm makes T queries to a Boolean input X, then its acceptance probability can be written as a real polynomial over the bits of X, of degree at most 2T. Given that result, if you wanted to classically simulate a quantum algorithm Q on most inputs—and if you only cared about query complexity, not computation time—you’d simply need to do the following:
(1) Find the polynomial p that represents Q’s acceptance probability.
(2) Find a variable i that explains at least a 1/poly(T) fraction of the total remaining variance in p, and query that i.
(3) Keep repeating step (2), until p has been restricted to a polynomial with not much variance left—i.e., to nearly a constant function p(X)=c. Whenever that happens, halt and output the constant c.
The key is that by hypothesis, this algorithm will halt, with high probability over X, after only poly(T) steps.
Anyway, around the same time, Andris Ambainis had a major break on a different problem that I’d told him about: namely, whether randomized and quantum query complexities are polynomially related for all partial functions with permutation symmetry (like the collision and the element distinctness functions). Andris and I decided to write up the two directions jointly. The result was our 2011 paper entitled The Need for Structure in Quantum Speedups.
Of the two contributions in the “Need for Structure” paper, the one about random oracles and influences in low-degree polynomials was clearly the weaker and less satisfying one. As the reviewers pointed out, that part of the paper didn’t solve anything: it just reduced one unsolved problem to a new, slightly different problem that was also unsolved. Nevertheless, that part of the paper acquired a life of its own over the ensuing decade, as the world’s experts in analysis of Boolean functions and polynomials began referring to the “Aaronson-Ambainis Conjecture.” Ryan O’Donnell, Guy Kindler, and many others had a stab. I even got Terry Tao to spend an hour or two on the problem when I visited UCLA.
Now, at long last, Nathan Keller and Ohad Klein say they’ve proven the Aaronson-Ambainis Conjecture, in a preprint whose title is a riff on ours: “Quantum Speedups Need Structure.”
Their paper hasn’t yet been peer-reviewed, and I haven’t yet carefully studied it, but I could and should: at 19 pages, it looks very approachable and clear, if not as radically short as (say) Huang’s proof of the Sensitivity Conjecture. Keller and Klein’s argument subsumes all the earlier results that I knew would need to be subsumed, and involves all the concepts (like a real analogue of block sensitivity) that I knew would need to be involved somehow.
My plan had been as follows:
(1) Read their paper in detail. Understand every step of their proof.
(2) Write a blog post that reflects my detailed understanding.
Unfortunately, this plan did not sufficiently grapple with the fact that I now have two kids. It got snagged for a week at step (1). So I’m now executing an alternative plan, which is to jump immediately to the blog post.
Assuming Keller and Klein’s result holds up—as I expect it will—by combining it with the observations in my and Andris’s paper, one immediately gets an explanation for why no one has managed to separate P from BQP relative to a random oracle, but only relative to non-random oracles. This complements the work of Kahn, Saks, and Smyth, who around 2000 gave a precisely analogous explanation for the difficulty of separating P from NP∩coNP relative to a random oracle.
Unfortunately, the polynomial blowup is quite enormous: from a quantum algorithm making T queries, Keller and Klein apparently get a classical algorithm making O(T^{18}) queries. But such things can almost always be massively improved.
Feel free to use the comments to ask any questions about this result or its broader context. I’ll either do my best to answer from the limited amount I know, or else I’ll pass the questions along to Nathan and Ohad themselves. Maybe, at some point, I’ll even be forced to understand the new proof.
Congratulations to Nathan and Ohad!
Update (Nov. 20): Tonight I finally did what I should’ve done two weeks ago, and worked through the paper from start to finish. Modulo some facts about noise operators, hypercontractivity, etc. that I took on faith, I now have a reasonable (albeit imperfect) understanding of the proof. It’s great!
In case it’s helpful to anybody, here’s my one-paragraph summary of how it works. First, you hit your bounded degree-d function f with a random restriction to attenuate its higher-degree Fourier coefficients (reminiscent of Linial-Mansour-Nisan). Next, in that attenuated function, you find a small “coalition” of influential variables—by which we mean, a set of variables for which there’s some assignment that substantially biases f. You keep iterating—finding influential coalitions in subfunctions on n/4, n/8, etc. variables. All the while, you keep track of the norm of the vector of all the block-sensitivities of all the inputs (the authors don’t clearly explain this in the intro, but they reveal it near the end). Every time you find another influential coalition, that norm goes down by a little, but by approximation theory, it can only go down O(d^{2}) times until it hits rock bottom and your function is nearly constant. By the end, you’ll have approximated f itself by a decision tree of depth poly(d, 1/ε, log(n)). Finally, you get rid of the log(n) term by using the fact that f essentially depended on at most exp(O(d)) variables anyway.
Anyway, I’m not sure how helpful it is to write more: the paper itself is about 95% as clear as it could possibly be, and even where it isn’t, you’d probably need to read it first (and, uh, know something about influences, block sensitivity, random restrictions, etc.) before any further clarifying remarks would be of use. But happy to discuss more in the comments, if anyone else is reading it.
Science is by definition empirical. We discover how the world works not by sitting and thinking, but by going out and observing the world. But sometimes, all the observing we can do can’t possibly answer a question. In those situations, we might need “non-empirical science”.
The blog Slate Star Codex had a series of posts on this topic recently. He hangs out with a crowd that supports the many-worlds interpretation of quantum mechanics: the idea that quantum events are not truly random, but instead that all outcomes happen, the universe metaphorically splitting into different possible worlds. These metaphorical universes can’t be observed, so no empirical test can tell the difference between this and other interpretations of quantum mechanics: if we could ever know the difference, it would have to be for “non-empirical” reasons.
What reasons are those? Slate Star Codex teases out a few possible intuitions. He points out that we reject theories that have “unnecessary” ideas. He imagines a world where chemists believe that mixing an acid and a base also causes a distant star to go supernova, and a creationist world where paleontologists believe fossils are placed by the devil. In both cases, there might be no observable difference between their theories and ours, but because their theories have “extra pieces” (the distant star, the devil), we reject them for non-empirical reasons. Slate Star Codex asks if this supports many-worlds: without the extra assumption that quantum events randomly choose one outcome, isn’t quantum mechanics simpler?
I agree with some of this. Science really does use non-empirical reasoning. Without it, there’s no reason not to treat the world as a black box, a series of experiments with no mechanism behind it. But while we do reject theories with unnecessary ideas, that isn’t our only standard. We also need our theories to teach us about the world.
Ultimately, we trust science because it allows us to do things. If we understand the world, we can interact with it: we can build technology, design new experiments, and propose new theories. With this in mind, we can judge scientific theories by how well they help us do these things. A good scientific theory is one that gives us more power to interact with the world. It can do this by making correct predictions, but it can also do this by explaining things, making it easier for us to reason about them. Beyond empiricism, we can judge science by how well it teaches us.
This gives us an objection to the “supernova theory” of Slate Star Codex’s imagined chemists: it’s much more confusing to teach. To teach chemistry in that world you also have to teach the entire life cycle of stars, a subject that students won’t use in any other part of the course. The creationists’ “devil theory” of paleontology has the same problem: if their theory really makes the right predictions they’d have to teach students everything our paleontologists do: every era of geologic history, every theory of dinosaur evolution, plus an extra course in devil psychology. They end up with a mix that only makes it harder to understand the subject.
Many-worlds may seem simpler than other interpretations of quantum mechanics, but that doesn’t make it more useful, or easier to teach. You still need to teach students how to predict the results of experiments, and those results will still be random. If you teach them many-worlds, you need to add more discussion much earlier on, advanced topics like self-localizing uncertainty and decoherence. You need a quite extensive set of ideas, many of which won’t be used again, to justify rules another interpretation could have introduced much more simply. This would be fine if those ideas made additional predictions, but they don’t: like every interpretation of quantum mechanics, you end up doing the same experiments and building the same technology in the end.
I’m not saying I know many-worlds is false, or that I know another interpretation is true. All I’m saying is that, when physicists criticize many-worlds, they’re not just blindly insisting on empiricism. They’re rejecting many-worlds, in part, because all it does is make their work harder. And that, more than elegance or simplicity, is how we judge theories.
This year we give thanks for space. (We’ve previously given thanks for the Standard Model Lagrangian, Hubble’s Law, the Spin-Statistics Theorem, conservation of momentum, effective field theory, the error bar, gauge symmetry, Landauer’s Principle, the Fourier Transform, Riemannian Geometry, the speed of light, the Jarzynski equality, and the moons of Jupiter.)
Even when we restrict to essentially scientific contexts, “space” can have a number of meanings. In a tangible sense, it can mean outer space — the final frontier, that place we could go away from the Earth, where the stars and other planets are located. In a much more abstract setting, mathematicians use “space” to mean some kind of set with additional structure, like Hilbert space or the space of all maps between two manifolds. Here we’re aiming in between, using “space” to mean the three-dimensional manifold in which physical objects are located, at least as far as our observable universe is concerned.
That last clause reminds us that there are some complications here. The three dimensions we see of space might not be all there are; extra dimensions could be hidden from us by being curled up into tiny balls (or generalizations thereof) that are too small to see, or if the known particles and forces are confined to a three-dimensional brane embedded in a larger universe. On the other side, we have intimations that quantum theories of gravity imply the holographic principle, according to which an N-dimensional universe can be thought of as arising as a projection of (N-1)-dimensions worth of information. And much less speculatively, Einstein and Minkowski taught us long ago that three-dimensional space is better thought of as part of four-dimensional spacetime.
Let’s put all of that aside. Our everyday world is accurately modeled as stuff, distributed through three-dimensional space, evolving with time. That’s something to be thankful for! But we can also wonder why it is the case.
I don’t mean “Why is space three-dimensional?”, although there is that. I mean why is there something called “space” at all? I recently gave an informal seminar on this at Columbia, and I talk about it a bit in Something Deeply Hidden, and it’s related in spirit to a question Michael Nielsen recently asked on Twitter, “Why does F=ma?”
Space is probably emergent rather than fundamental, and the ultimate answer to why it exists is probably going to involve quantum mechanics, and perhaps quantum gravity in particular. The right question is “Why does the wave function of the universe admit a description as a set of branching semi-classical worlds, each of which feature objects evolving in three-dimensional space?” We’re working on that!
But rather than answer it, for the purposes of thankfulness I just want to point out that it’s not obvious that space as we know it had to exist, even if classical mechanics had been the correct theory of the world.
Newton himself thought of space as absolute and fundamental. His ontology, as the philosophers would put it, featured objects located in space, evolving with time. Each object has a trajectory, which is its position in space at each moment of time. Quantities like “velocity” and “acceleration” are important, but they’re not fundamental — they are derived from spatial position, as the first and second derivatives with respect to time, respectively.
But that’s not the only way to do classical mechanics, and in some sense it’s not the most basic and powerful way. An alternative formulation is provided by Hamiltonian mechanics, where the fundamental variable isn’t “position,” but the combination of “position and momentum,” which together describe the phase space of a system. The state of a system at any one time is given by a point in phase space. There is a function of phase space cleverly called the Hamiltonian H(x,p), from which the dynamical equations of the system can be derived.
That might seem a little weird, and students tend to be somewhat puzzled by the underlying idea of Hamiltonian mechanics when they are first exposed to it. Momentum, we are initially taught in our physics courses, is just the mass times the velocity. So it seems like a derived quantity, not a fundamental one. How can Hamiltonian mechanics put momentum on an equal footing to position, if one is derived from the other?
The answer is that in the Hamiltonian approach, momentum is not defined to be mass times velocity. It ends up being equal to that by virtue of an equation of motion, at least if the Hamiltonian takes the right form. But in principle it’s an independent variable.
That’s a subtle distinction! Hamiltonian mechanics says that at any one moment a system is described by two quantities, its position and its momentum. No time derivatives or trajectories are involved; position and momentum are completely different things. Then there are two equations telling us how the position and the momentum change with time. The derivative of the position is the velocity, and one equation sets it equal to the momentum divided by the mass, just as in Newtonian mechanics. The other equation sets the derivative of the momentum equal to the force. Combining the two, we again find that force equals mass times acceleration (derivative of velocity).
So from the Hamiltonian perspective, positions and momenta are on a pretty equal footing. Why then, in the real world, do we seem to “live in position space”? Why don’t we live in momentum space?
As far as I know, no complete and rigorous answer to these questions has ever been given. But we do have some clues, and the basic principle is understood, even if some details remain to be ironed out.
That principle is this: we can divide the world into subsystems that interact with each other under appropriate circumstances. And those circumstances come down to “when they are nearby in space.” In other words, interactions are local in space. They are not local in momentum. Two billiard balls can bump into each other when they arrive at the same location, but nothing special happens when they have the same momentum or anything like that.
Ultimately this can be traced to the fact that the Hamiltonian of the real world is not some arbitrary function of positions and momenta; it’s a very specific kind of function. The ultimate expression of this kind of locality is field theory — space is suffused with fields, and what happens to a field at one point in space only directly depends on the other fields at precisely the same point in space, nowhere else. And that’s embedded in the Hamiltonian of the universe, in particular in the fact that the Hamiltonian can be written as an integral over three-dimensional space of a local function, called the “Hamiltonian density.”
where φ is the field (which here acts as a “coordinate”) and π is its corresponding momentum.
This represents progress on the “Why is there space?” question. The answer is “Because space is the set of variables with respect to which interactions are local.” Which raises another question, of course: why are interactions local with respect to anything? Why do the fundamental degrees of freedom of nature arrange themselves into this kind of very specific structure, rather than some other one?
We have some guesses there, too. One of my favorite recent papers is “Locality From the Spectrum,” by Jordan Cotler, Geoffrey Penington, and Daniel Ranard. By “the spectrum” they mean the set of energy eigenvalues of a quantum Hamiltonian — i.e. the possible energies that states of definite energy can have in a theory. The game they play is to divide up the Hilbert space of quantum states into subsystems, and then ask whether a certain list of energies is compatible with “local” interactions between those subsystems. The answers are that most Hamiltonians aren’t compatible with locality in any sense, and for those where locality is possible, the division into local subsystems is essentially unique. So locality might just be a consequence of certain properties of the quantum Hamiltonian that governs the universe.
Fine, but why that Hamiltonian? Who knows? This is above our pay grade right now, though it’s fun to speculate. Meanwhile, let’s be thankful that the fundamental laws of physics allow us to describe our everyday world as a collection of stuff distributed through space. If they didn’t, how would we ever find our keys?
Scott’s Introduction: Happy Thanksgiving! Please join me in giving thanks for the beautiful post below, by friend-of-the-blog Greg Kuperberg, which tells a mathematical story that stretches from the 200s BC all the way to Google’s quantum supremacy result last month.
by Greg Kuperberg
Note: UC Davis is hiring in CS theory! Scott offered me free ad space if I wrote a guest post, so here we are. The position is in all areas of CS theory, including QC theory although the search is not limited to that.
In this post, I wear the hat of a pure mathematician in a box provided by Archimedes. I thus set aside what everyone else thinks is important about Google’s 53-qubit quantum supremacy experiment, that it is a dramatic milestone in quantum computing technology. That’s only news about the physical world, whose significance pales in comparison to the Platonic world of mathematical objects. In my happy world, I like quantum supremacy as a demonstration of a beautiful coincidence in mathematics that has been known for more than 2000 years in a special case. The single-qubit case was discovered by Archimedes, duh. As Scott mentions in Quantum Computing Since Democritus, Bill Wootters stated the general result in a 1990 paper, but Wootters credits a 1974 paper by the Czech physicist Stanislav Sýkora. I learned of it in the substantially more general context of symplectic geometric that mathematicians developed independently between Sýkora’s prescient paper and Wootters’ more widely known citation. Much as I would like to clobber you with highly abstract mathematics, I will wait for some other time.
Suppose that you pick a pure state \(|\psi\rangle\) in the Hilbert space \(\mathbb{C}^d\) of a \(d\)-dimensional qudit, and then make many copies and fully measure each one, so that you sample many times from some distribution \(\mu\) on the \(d\) outcomes. You can think of such a distribution \(\mu\) as a classical randomized state on a digit of size \(d\). The set of all randomized states on a \(d\)-digit makes a \((d-1)\)-dimensional simplex \(\Delta^{d-1}\) in the orthant \(\mathbb{R}_{\ge 0}^d\). The coincidence is that if \(|\psi\rangle\) is uniformly random in the unit sphere in \(\mathbb{C}^d\), then \(\mu\) is uniformly random in \(\Delta^{d-1}\). I will call it the Born map, since it expresses the Born rule of quantum mechanics that amplitudes yield probabilities. Here is a diagram of the Born map of a qutrit, except with the laughable simplification of the 5-sphere in \(\mathbb{C}^3\) drawn as a 2-sphere.
If you pretend to be a bad probability student, then you might not be surprised by this coincidence, because you might suppose that all probability distributions are uniform other than treacherous exceptions to your intuition. However, the principle is certainly not true for a “rebit” (a qubit with real amplitudes) or for higher-dimensional “redits.” With real amplitudes, the probability density goes to infinity at the sides of the simplex \(\Delta^{d-1}\) and is even more favored at the corners. It also doesn’t work for mixed qudit states; the projection then favors the middle of \(\Delta^{d-1}\).
The theorem of Archimedes is that a natural projection from the unit sphere to a circumscribing vertical cylinder preserves area. The projection is the second one that you might think of: Project radially from a vertical axis rather than radially in all three directions. Since Archimedes was a remarkably prescient mathematician, he was looking ahead to the Bloch sphere of pure qubit states \(|\psi\rangle\langle\psi|\) written in density operator form. If you measure \(|\psi\rangle\langle\psi|\) in the computational basis, you get a randomized bit state \(\mu\) somewhere on the interval from guaranteed 0 to guaranteed 1.
This transformation from a quantum state to a classical randomized state is a linear projection to a vertical axis. It is the same as Archimedes’ projection, except without the angle information. It doesn’t preserve dimension, but it does preserve measure (area or length, whatever) up to a factor of \(2\pi\). In particular, it takes a uniformly random \(|\psi\rangle\langle\psi|\) to a uniformly random \(\mu\).
Archimedes’ projection is also known as the Lambert cylindrical map of the world. This is the map that squishes Greenland along with the top of North America and Eurasia to give them proportionate area.
(I forgive Lambert if he didn’t give prior credit to Archimedes. There was no Internet back then to easily find out who did what first.) Here is a calculus-based proof of Archimedes’ theorem: In spherical coordinates, imagine an annular strip on the sphere at a polar angle of \(\theta\). (The polar angle is the angle from vertical in spherical coordinates, as depicted in red in the Bloch sphere diagram.) The strip has a radius of \(\sin\theta\), which makes it shorter than its unit radius friend on the cylinder. But it’s also tilted from vertical by an angle of \(\frac{\pi}2-\theta\), which makes it wider by a factor of \(1/(\sin \theta)\) than the height of its projection onto the \(z\) axis. The two factors exactly cancel out, making the area of the strip exactly proportional to the length of its projection onto the \(z\) axis. This is a coincidence which is special to the 2-sphere in 3 dimensions. As a corollary, we get that the surface area of a unit sphere is \(4\pi\), the same as an open cylinder with radius 1 and height 2. If you want to step through this in even more detail, Scott mentioned to me an action video which is vastly spiffier than anything that I could ever make.
The projection of the Bloch sphere onto an interval also shows what goes wrong if you try this with a rebit. The pure rebit states — again expressed in density operator form \(|\psi\rangle\langle\psi|\) are a great circle in the Bloch sphere. If you linearly project a circle onto an interval, then the length of the circle is clearly bunched up at the ends of the interval and the projected measure on the interval is not uniform.
It is a neat coincidence that the Born map of a qubit preserves measure, but a proof that relies on Archimedes’ theorem seems to depend on the special geometry of the Bloch sphere of a qubit. That the higher-dimensional Born map also preserves measure is downright eerie. Scott challenged me to write an intuitive explanation. I remembered two different (but similar) proofs, neither of which is original to me. Scott and I disagree as to which proof is nicer.
As a first step of the first proof, it is easy to show that the Born map \(p = |z|^2\) for a single amplitude \(z\) preserves measure as a function from the complex plane \(\mathbb{C}\) to the ray \(\mathbb{R}_{\ge 0}\). The region in the complex numbers \(\mathbb{C}\) where the length of \(z\) is between \(a\) and \(b\), or \(a \le |z| \le b\), is \(\pi(b^2 – a^2)\). The corresponding interval for the probability is \(a^2 \le p \le b^2\), which thus has length \(b^2-a^2\). That’s all, we’ve proved it! More precisely, the area of any circularly symmetric region in \(\mathbb{C}\) is \(\pi\) times the length of its projection onto \(\mathbb{R}_{\ge 0}\).
The second step is to show the same thing for the Born map from the \(d\)-qudit Hilbert space \(\mathbb{C}^d\) to the \(d\)-digit orthant \(\mathbb{R}_{\ge 0}^d\), again without unit normalization. It’s also measure-preserving, up to a factor of \(\pi^d\) this time, because it’s the same thing in each coordinate separately. To be precise, the volume ratio holds for any region in \(\mathbb{C}^d\) that is invariant under separately rotating each of the \(d\) phases. (Because you can approximate any such region with a union of products of thin annuli.)
The third and final step is the paint principle for comparing surface areas. If you paint the hoods of two cars with the same thin layer of paint and you used the same volume of paint for each one, then you can conclude that the two car hoods have nearly same area. In our case, the Born map takes the region \[ 1 \le |z_0|^2 + |z_1|^2 + \cdots + |z_{d-1}|^2 \le 1+\epsilon \] in \(\mathbb{C}^d\) to the region \[ 1 \le p_0 + p_1 + \cdots + p_{d-1} \le 1+\epsilon \] in the orthant \(\mathbb{R}_{\ge 0}^d\). The former is the unit sphere \(S^{2d-1}\) in \(\mathbb{C}^d\) painted to a thickness of roughly \(\epsilon/2\). The latter is the probability simplex \(\Delta^{n-1}\) painted to a thickness of exactly \(\epsilon\). Taking the limit \(\epsilon \to 0\), the Born map from \(S^{2d-1}\) to \(\Delta^{n-1}\) preserves measure up to a factor of \(2\pi^n\).
You might wonder “why” this argument works even if you accept that it does work. The key is that the exponent 2 appears in two different ways: as the dimension of the complex numbers, and as the exponent used to set probabilities and define spheres. If we try the same argument with real amplitudes, then the volume between “spheres” of radius \(a\) and \(b\) is just \(2(b-a)\), which does not match the length \(b^2-a^2\). The Born map for a single real amplitude is the parabola \(p = x^2\), which clearly distorts length since it is not linear. The higher-dimensional real Born map similarly distorts volumes, whether or not you restrict to unit-length states.
If you’re a bitter-ender who still wants Archimedes’ theorem for real amplitudes, then you might consider the variant formula \(p = |x|\) to obtain a probability \(p\) from a “quantum amplitude” \(x\). Then the “Born” map does preserve measure, but for the trivial reason that \(x = \pm p\) is not really a quantum amplitude, it is a probability with a vestigial sign. Also the unit “sphere” in \(\mathbb{R}^d\) is not really a sphere in this theory, it is a hyperoctahedron. The only “unitary” operators that preserve the unit hyperoctahedron are signed permutation matrices. You can only use them for reversible classical computing or symbolic dynamics; they don’t have the strength of true quantum computing or quantum mechanics.
The fact that the Born map preserves measure also yields a bonus calculation of the volume of the unit ball in \(2d\) real dimensions, if we interpret that as \(d\) complex dimensions. The ball \[ |z_0|^2 + |z_1|^2 + \cdots + |z_{d-1}|^2 \le 1 \] in \(\mathbb{C}^d\) is sent to a different simplex \[ p_0 + p_1 + \cdots + p_{d-1} \le 1 \] in \(\mathbb{R}_{\ge 0}^d\). If we recall that the volume of a \(d\)-dimensional pyramid is \(\frac1d\) times base times height and calculate by induction on \(d\), we get that this simplex has volume \(\frac1{d!}\). Thus, the volume of the \(2d\)-dimensional unit ball is \(\frac{\pi^d}{d!}\).
You might ask whether the volume of a \(d\)-dimensional unit ball is always \(\frac{\pi^{d/2}}{(d/2)!}\) for both \(d\) even and odd. The answer is yes if we interpret factorials using the gamma function formula \(x! = \Gamma(x+1)\) and look up that \(\frac12! = \Gamma(\frac32) = \frac{\sqrt{\pi}}2\). The gamma function was discovered by Euler as a solution to the question of defining fractional factorials, but the notation \(\Gamma(x)\) and the cumbersome shift by 1 is due to Legendre. Although Wikipedia says that no one knows why Legendre defined it this way, I wonder if his goal was to do what the Catholic church later did for itself in 1978: It put a Pole at the origin.
(Scott wanted to censor this joke. In response, I express my loyalty to my nation of birth by quoting the opening of the Polish national anthem: “Poland has not yet died, so long as we still live!” I thought at first that Stanislav Sýkora is Polish since Stanisław and Sikora are both common Polish names, but his name has Czech spelling and he is Czech. Well, the Czechs are cool too.)
Sýkora’s 1974 proof of the generalized Archimedes’ theorem is different from this one. He calculates multivariate moments of the space of unit states \(S^{2d-1} \subseteq \mathbb{C}^d\), and confirms that they match the moments in the probability simplex \(\Delta^{d-1}\). There are inevitably various proofs of this result, and I will give another one.
There is a well-known and very useful algorithm to generate a random point on the unit sphere in either \(\mathbb{R}^d\) or \(\mathbb{C}^d\), and a similar algorithm to generate a random point in a simplex. In the former algorithm, we make each real coordinate \(x\) into an independent Gaussian random variable with density proportional to \(e^{-x^2}\;dx\), and then rescale the result to unit length. Since the exponents combine as \[ e^{-x_0^2}e^{-x_1^2}\cdots e^{-x_{d-1}^2} = e^{-(x_0^2 + x_1^2 + \cdots + x_{d-1}^2)}, \] we learn that the total exponent is spherically symmetric. Therefore after rescaling, the result is a uniformly random point on the unit sphere \(S^{d-1} \subseteq \mathbb{R}^d\). Similarly, the other algorithm generates a point in the orthant \(\mathbb{R}_{\ge 0}^d\) by making each real coordinate \(p \ge 0\) an independent random variable with exponential distribution \(e^{-p}\;dp\). This time we rescale the vector until its sum is 1. This algorithm likewise produces a uniformly random point in the simplex \(\Delta^{d-1} \subseteq \mathbb{R}_{\ge 0}^d\) because the total exponent of the product \[ e^{-p_0}e^{-p_1}\cdots e^{-p_{d-1}} = e^{-(p_0 + p_1 + \cdots + p_{d-1})} \] only depends on the sum of the coordinates. Wootters describes both of these algorithms in his 1990 paper, but instead of relating them to give his own proof of the generalized Archimedes’ theorem, he cites Sýkora.
The gist of the proof is that the Born map takes the Gaussian algorithm to the exponential algorithm. Explicitly, the Gaussian probability density for a single complex amplitude \[ z = x+iy = re^{i\theta} \] can be converted from Cartesian to polar coordinate as follows: \[ \frac{e^{-|z|^2}\;dx\;dy}{\pi} = \frac{e^{-r^2}r\;dr\;d\theta}{\pi}. \] I have included the factor of \(r\) that is naturally present in an area integral in polar coordinates. We will need it, and it is another way to see that the theorem relies on the fact that the complex numbers are two-dimensional. To complete the proof, we substitute \(p = r^2\) and remember that \(dp = 2r\;dr\), and then integrate over \(\theta\) (trivially, since the integrand does not depend on \(\theta\)). The density simplifies to \(e^{-p}\;dp\), which is exactly the exponential distribution for a real variable \(p \ge 0\). Since the Born map takes the Gaussian algorithm to the exponential algorithm, and since each algorithm produces a uniformly random point, the Born map must preserve uniform measure. (Scott likes this proof better because it is algorithmic, and because it is probabilistic.)
Now about quantum supremacy. You might think that a random chosen quantum circuit on \(n\) qubits produces a nearly uniformly random quantum state \(|\psi\rangle\) in their joint Hilbert space, but it’s not quite not that simple. When \(n=53\), or otherwise as \(n \to \infty\), a manageable random circuit is not nearly creative enough to either reach or approximate most of the unit states in the colossal Hilbert space of dimension \(d = 2^n\). The state \(|\psi\rangle\) that you get from (say) a polynomial-sized circuit resembles a fully random state in various statistical and computational respects, both proven and conjectured. As a result, if you measure the qubits in the computational basis, you get a randomized state on \(n\) bits that resembles a uniformly random point in \(\Delta^{2^n-1}\).
If you choose \(d\) probabilities, and if each one is an independent exponential random variable, then the law of large numbers says that the sum (which you use for rescaling) is close to \(d\) when \(d\) is large. When \(d\) is really big like \(2^{53}\), a histogram of the probabilities of the bit strings of a supremacy experiment looks like an exponential curve \(f(p) \propto e^{-pd}\). In a sense, the statistical distribution of the bit strings is almost the same almost every time, independent of which random quantum circuit you choose to generate them. The catch is that the position of any given bit string does depend on the circuit and is highly scrambled. I picture it in my mind like this:
It is thought to be computationally intractable to calculate where each bit string lands on this exponential curve, or even where just one of them does. (The exponential curve is attenuated by noise in the actual experiment, but it’s the same principle.) That is one reason that random quantum circuits are supreme.
In these notes we presume familiarity with the basic concepts of probability theory, such as random variables (which could take values in the reals, vectors, or other measurable spaces), probability, and expectation. Much of this theory is in turn based on measure theory, which we will also presume familiarity with. See for instance this previous set of lecture notes for a brief review.
The basic objects of study in analytic number theory are deterministic; there is nothing inherently random about the set of prime numbers, for instance. Despite this, one can still interpret many of the averages encountered in analytic number theory in probabilistic terms, by introducing random variables into the subject. Consider for instance the form
of the prime number theorem (where we take the limit ). One can interpret this estimate probabilistically as
where is a random variable drawn uniformly from the natural numbers up to , and denotes the expectation. (In this set of notes we will use boldface symbols to denote random variables, and non-boldface symbols for deterministic objects.) By itself, such an interpretation is little more than a change of notation. However, the power of this interpretation becomes more apparent when one then imports concepts from probability theory (together with all their attendant intuitions and tools), such as independence, conditioning, stationarity, total variation distance, and entropy. For instance, suppose we want to use the prime number theorem (1) to make a prediction for the sum
After dividing by , this is essentially
With probabilistic intuition, one may expect the random variables to be approximately independent (there is no obvious relationship between the number of prime factors of , and of ), and so the above average would be expected to be approximately equal to
which by (2) is equal to . Thus we are led to the prediction
The asymptotic (3) is widely believed (it is a special case of the Chowla conjecture, which we will discuss in later notes; while there has been recent progress towards establishing it rigorously, it remains open for now.
How would one try to make these probabilistic intuitions more rigorous? The first thing one needs to do is find a more quantitative measurement of what it means for two random variables to be “approximately” independent. There are several candidates for such measurements, but we will focus in these notes on two particularly convenient measures of approximate independence: the “” measure of independence known as covariance, and the “” measure of independence known as mutual information (actually we will usually need the more general notion of conditional mutual information that measures conditional independence). The use of type methods in analytic number theory is well established, though it is usually not described in probabilistic terms, being referred to instead by such names as the “second moment method”, the “large sieve” or the “method of bilinear sums”. The use of methods (or “entropy methods”) is much more recent, and has been able to control certain types of averages in analytic number theory that were out of reach of previous methods such as methods. For instance, in later notes we will use entropy methods to establish the logarithmically averaged version
of (3), which is implied by (3) but strictly weaker (much as the prime number theorem (1) implies the bound , but the latter bound is much easier to establish than the former).
As with many other situations in analytic number theory, we can exploit the fact that certain assertions (such as approximate independence) can become significantly easier to prove if one only seeks to establish them on average, rather than uniformly. For instance, given two random variables and of number-theoretic origin (such as the random variables and mentioned previously), it can often be extremely difficult to determine the extent to which behave “independently” (or “conditionally independently”). However, thanks to second moment tools or entropy based tools, it is often possible to assert results of the following flavour: if are a large collection of “independent” random variables, and is a further random variable that is “not too large” in some sense, then must necessarily be nearly independent (or conditionally independent) to many of the , even if one cannot pinpoint precisely which of the the variable is independent with. In the case of the second moment method, this allows us to compute correlations such as for “most” . The entropy method gives bounds that are significantly weaker quantitatively than the second moment method (and in particular, in its current incarnation at least it is only able to say non-trivial assertions involving interactions with residue classes at small primes), but can control significantly more general quantities for “most” thanks to tools such as the Pinsker inequality.
— 1. Second moment methods —
In this section we discuss probabilistic techniques of an “” nature. We fix a probability space to model all of random variables; thus for instance we shall model a complex random variable in these notes by a measurable function . (Strictly speaking, there is a subtle distinction one can maintain between a random variable and its various measure-theoretic models, which becomes relevant if one later decides to modify the probability space , but this distinction will not be so important in these notes and so we shall ignore it. See this previous set of notes for more discussion.)
We will focus here on the space of complex random variables (that is to say, measurable maps ) whose second moment
of is finite. In many number-theoretic applications the finiteness of the second moment will be automatic because will only take finitely many values. As is well known, the space has the structure of a complex Hilbert space, with inner product
and norm
for . By slight abuse of notation, the complex numbers can be viewed as a subset of , by viewing any given complex number as a constant (deterministic) random variable. Then is a one-dimensional subspace of , spanned by the unit vector . Given a random variable to , the projection of to is then the mean
and we obtain an orthogonal splitting of any into its mean and its mean zero part . By Pythagoras’ theorem, we then have
The first quantity on the right-hand side is the square of the distance from to , and this non-negative quantity is known as the variance
The square root of the variance is known as the standard deviation. The variance controls the distribution of the random variable through Chebyshev’s inequality
A slight generalisation of Chebyshev’s inequality that can be convenient to use is
for any and any complex number (which typically will be a simplified approximation to the mean ), which is proven similarly to (6) but noting (from (5)) that .
Informally, (6) is an assertion that a square-integrable random variable will concentrate around its mean if its variance is not too large. See these previous notes for more discussion of the concentration of measure phenomenon. One can often obtain stronger concentration of measure than what is provided by Chebyshev’s inequality if one is able to calculate higher moments than the second moment, such as the fourth moment or exponential moments , but we will not pursue this direction in this set of notes.
Clearly the variance is homogeneous of order two, thus
for any and . In particular, the variance is not always additive: the claim fails in particular when is not almost surely zero. However, there is an important substitute for this formula. Given two random variables , the inner product of the corresponding mean zero parts is a complex number known as the covariance:
As are orthogonal to , it is not difficult to obtain the alternate formula
The covariance is then a positive semi-definite inner product on (it basically arises from the Hilbert space structure of the space of mean zero variables), and . From the Cauchy-Schwarz inequality we have
If have non-zero variance (that is, they are not almost surely constant), then the ratio
is then known as the correlation between and , and is a complex number of magnitude at most ; for real-valued that are not almost surely constant, the correlation is instead a real number between and . At one extreme, a correlation of magnitude occurs if and only if is a scalar multiple of . At the other extreme, a correlation of zero is an indication (though not a guarantee) of independence. Recall that two random variables are independent if one has
for all (Borel) measurable . In particular, setting , for and integrating using Fubini’s theorem, we conclude that
similarly with replaced by , and similarly for . In particular we have
and thus from (8) we thus see that independent random variables have zero covariance (and zero correlation, when they are not almost surely constant). On the other hand, the converse fails:
Exercise 1 Provide an example of two random variables which are not independent, but which have zero correlation or covariance with each other. (There are many ways to produce some examples. One comes from exploiting various systems of orthogonal functions, such as sines and cosines. Another comes from working with random variables taking only a small number of values, such as .
for any finite collection of random variables . These identities combine well with Chebyshev-type inequalities such as (6), (7), and this leads to a very common instance of the second moment method in action. For instance, we can use it to understand the distribution of the number of prime factors of a random number that fall within a given set . Given any set of natural numbers, define the logarithmic size to be the quantity
Thus for instance Euler’s theorem asserts that the primes have infinite logarithmic size.
Lemma 2 (Turan-Kubilius inequality, special case) Let be an interval of length at least , and let be an integer drawn uniformly at random from this interval, thus
for all . Let be a finite collection of primes, all of which have size at most . Then the random variable has mean
and variance
In particular,
for any .
Proof: For any natural number , we have
We now write . From (11) we see that each indicator random variable , has mean and variance ; similarly, for any two distinct , we see from (11), (8) the indicators , have covariance
and the claim now follows from (10).
The exponents of in the error terms here are not optimal; but in practice, we apply this inequality when is much larger than any given power of , so factors such as will be negligible. Informally speaking, the above lemma asserts that a typical number in a large interval will have roughly prime factors in a given finite set of primes, as long as the logarithmic size is large.
If we apply the above lemma to for some large , and equal to the primes up to (say) , we have , and hence
Since , we recover the main result
of Section 5 of Notes 1 (indeed this is essentially the same argument as in that section, dressed up in probabilistic language). In particular, we recover the Hardy-Ramanujan law that a proportion of the natural numbers in have prime factors.
Exercise 3 (Turan-Kubilius inequality, general case) Let be an additive function (which means that whenever are coprime. Show that
where
(Hint: one may first want to work with the special case when vanishes whenever so that the second moment method can be profitably applied, and then figure out how to address the contributions of prime powers larger than .)
Exercise 4 (Turan-Kubilius inequality, logarithmic version) Let with , and let be a collection of primes of size less than with . Show that
Exercise 5 (Paley-Zygmund inequality) Let be non-negative with positive mean. Show that
This inequality can sometimes give slightly sharper results than the Chebyshev inequality when using the second moment method.
Now we give a useful lemma that quantifies a heuristic mentioned in the introduction, namely that if several random variables do not correlate with each other, then it is not possible for any further random variable to correlate with many of them simultaneously. We first state an abstract Hilbert space version.
Lemma 6 (Bessel type inequality, Hilbert space version) If are elements of a Hilbert space , and are positive reals, then
Proof: We use the duality method. Namely, we can write the left-hand side of (13) as
for some complex numbers with (just take to be normalised by the left-hand side of (14), or zero if that left-hand side vanishes. By Cauchy-Schwarz, it then suffices to establish the dual inequality
The left-hand side can be written as
Using the arithmetic mean-geometric mean inequality and symmetry, this may be bounded by
Since , the claim follows.
Corollary 7 (Bessel type inequality, probabilistic version) If , and are positive reals, then
Proof: By subtracting the mean from each of we may assume that these random variables have mean zero. The claim now follows from Lemma 6.
To get a feel for this inequality, suppose for sake of discussion that and all have unit variance and , but that the are pairwise uncorrelated. Then the right-hand side is equal to , and the left-hand side is the sum of squares of the correlations between and each of the . Any individual correlation is then still permitted to be as large as , but it is not possible for multiple correlations to be this large simultaneously. This is geometrically intuitive if one views the random variables as vectors in a Hilbert space (and correlation as a rough proxy for the angle between such vectors). This lemma also shares many commonalities with the large sieve inequality, discussed in this set of notes.
One basic number-theoretic application of this inequality is the following sampling inequality of Elliott, that lets one approximate a sum of an arithmetic function by its values on multiples of primes :
Exercise 8 (Elliott’s inequality) Let be an interval of length at least . Show that for any function , one has
(Hint: Apply Corollary 7 with , , and , where is the uniform variable from Lemma 2.) Conclude in particular that for every , one has
for all primes outside of a set of exceptional primes of logarithmic size .
Informally, the point of this inequality is that an arbitrary arithmetic function may exhibit correlation with the indicator function of the multiples of for some primes , but cannot exhibit significant correlation with all of these indicators simultaneously, because these indicators are not very correlated to each other. We note however that this inequality only gains a tiny bit over trivial bounds, because the set of primes up to only has logarithmic size by Mertens’ theorems; thus, any asymptotics that are obtained using this inequality will typically have error terms that only improve upon the trivial bound by factors such as .
Exercise 9 (Elliott’s inequality, logarithmic form) Let with . Show that for any function , one has
and thus for every , one has
for all primes outside of an exceptional set of primes of logarithmic size .
Exercise 10 Use Exercise (9) and a duality argument to provide an alternate proof of Exercise 4. (Hint: express the left-hand side of (12) as a correlation between and some suitably -normalised arithmetic function .)
As a quick application of Elliott’s inequality, let us establish a weak version of the prime number theorem:
Proposition 11 (Weak prime number theorem) For any we have
whenever are sufficiently large depending on .
This estimate is weaker than what one can obtain by existing methods, such as Exercise 56 of Notes 1. However in the next section we will refine this argument to recover the full prime number theorem.
Proof: Fix , and suppose that are sufficiently large. From Exercise 9 one has
for all primes outside of an exceptional set of logarithmic size . If we restrict attention to primes then one sees from the integral test that one can replace the sum by and only incur an additional error of . If we furthermore restrict to primes larger than , then the contribution of those that are divisible by is also . For not divisible by , one has . Putting all this together, we conclude that
for all primes outside of an exceptional set of logarithmic size . In particular, for large enough this statement is true for at least one such . The claim then follows.
As another application of Elliott’s inequality, we present a criterion for orthogonality between multiplicative functions and other sequences, first discovered by Katai (with related results also introduced earlier by Daboussi and Delange), and rediscovered by Bourgain, Sarnak, and Ziegler:
Proposition 12 (Daboussi-Delange-Katai-Bourgain-Sarnak-Ziegler criterion) Let be a multiplicative function with for all , and let be another bounded function. Suppose that one has
as for any two distinct primes . Then one has
as .
Proof: Suppose the claim fails, then there exists (which we can assume to be small) and arbitrarily large such that
By Exercise 8, this implies that
Fix a good prime in . From (15) we have
We can replace the range by with negligible error. We also have except when is a multiple of , but this latter case only contributes which is also negligible compared to the right-hand side. We conclude that
for every good prime. On the other hand, from Lemma 6 we have
where range over the good primes in . The left-hand side is then , and by hypothesis the right-hand side is for large enough. As and is small, this gives the desired contradiction
Exercise 13 (Daboussi-Delange theorem) Let be irrational, and let be a multiplicative function with for all . Show that
as . If instead is rational, show that there exists be a multiplicative function with for which the statement (16) fails. (Hint: use Dirichlet characters and Plancherel’s theorem for finite abelian groups.)
— 2. An elementary proof of the prime number theorem —
Define the Mertens function
As shown in Theorem 58 of Notes 1, the prime number theorem is equivalent to the bound
Firstly we see that Elliott’s inequality gives the following weaker version of (17):
Lemma 14 (Oscillation for Mertens’ function) If and , then we have
for all primes outside of an exceptional set of primes of logarithmic size .
Proof: We may assume as the claim is trivial otherwise. From Exercise 8 applied to and , we have
for all outside of an exceptional set of primes of logarithmic size . Since for not divisible by , the right-hand side can be written as
Since outside of an exceptional set of logarithmic size , the claim follows.
Informally, this lemma asserts that for most primes , which morally implies that for most primes . If we can then locate suitable primes with , thus should then lead to , which should then yield the prime number theorem . The manipulations below are intended to make this argument rigorous.
It will be convenient to work with a logarithmically averaged version of this claim.
Corollary 15 (Logarithmically averaged oscillation) If and is sufficiently large depending on , then
Proof: For each , we have from the previous lemma that
for all outside of an exceptional set of logarithmic size . We then have
so it suffices by Markov’s inequality to show that
But by Fubini’s theorem, the left-hand side may be bounded by
and the claim follows.
Let be sufficiently small, and let be sufficiently large depending on . Call a prime good if the bound (18) holds and bad otherwise, thus all primes outside of an exceptional set of bad primes of logarithmic size are good. Now we observe that we can make small as long as we can make two good primes multiply to be close to a third:
Proposition 16 Suppose there are good primes with . Then .
Proof: By definition of good prime, we have the bounds
We rescale (20) by to conclude that
We can replace the integration range here from to with an error of if is large enough. Also, since , we have . Thus we have
Combining this with (19), (21) and the triangle inequality (writing as a linear combination of , , and ) we conclude that
This is an averaged version of the claim we need. To remove the averaging, we use the identity (see equation (63) of Notes 1) to conclude that
From the triangle inequality one has
and hence by Mertens’ theorem
From the Brun-Titchmarsh inequality (Corollary 61 of Notes 1) we have
and so from the previous estimate and Fubini’s theorem one has
and hence by (22) (using trivial bounds to handle the region outside of )
Since
we conclude (for large enough) that
and the claim follows.
To finish the proof of the prime number theorem, it thus suffices to locate, for sufficiently large, three good primes with . If we already had the prime number theorem, or even the weaker form that every interval of the form contained primes for large enough, then this would be quite easy: pick a large natural number (depending on , but independent of ), so that the primes up to has logarithmic size (so that only of them are bad, as measured by logarithmic size), and let be random numbers and drawn uniformly from (say) . From the prime number theorem, for each , the interval contains primes. In particular, contains primes, but the expected number of bad primes in this interval is . Thus by Markov’s inequality there would be at least a chance (say) of having at least one good prime in ; similarly there is a chance of having a good prime in , and a chance of having a good prime in . Thus (as an application of the probabilistic method), there exist (deterministic) good primes with the required properties.
Of course, using the prime number theorem here to prove the prime number theorem would be circular. However, we can still locate a good triple of primes using the Selberg symmetry formula
as , where is the second von Mangoldt function
see Proposition 60 of Notes 1. We can strip away the contribution of the primes:
Exercise 17 Show that
as .
In particular, on evaluating this at and subtracting, we have
whenever is sufficiently large depending on . In particular, for any such , one either has
(or both). Informally, the Selberg symmetry formula shows that the interval contains either a lot of primes, or a lot of semiprimes. The factor of is slightly annoying, so we now remove it. Consider the contribution of those primes to (25) with . This is bounded by
which we can bound crudely using the Chebyshev bound by
which by Mertens theorem is . Thus the contribution of this case can be safely removed from (25). Similarly for those cases when . For the remaining cases we bound . We conclude that for any sufficiently large , either (24) or
In order to find primes with close to , it would be very convenient if we could find a for which (24) and (26) both hold. We can’t quite do this directly, but due to the “connected” nature of the set of scales , but we can do the next best thing:
Proposition 18 Suppose is sufficiently large depending on . Then there exists with such that
Proof: We know that every in obeys at least one of (27), (28). Our task is to produce an adjacent pair of , one of which obeys (27) and the other obeys (28). Suppose for contradiction that no such pair exists, then whenever fails to obey (27), then any adjacent must also fail to do so, and similarly for (28). Thus either (27) will fail to hold for all , or (28) will fail to hold for all such . If (27) fails for all , then on summing we have
which contradicts Mertens’ theorem if is large enough because the left-hand side is . Similarly, if (28) fails for all , then
and again Mertens’ theorem can be used to lower bound the left-hand side by (in fact one can even gain an additional factor of if one works things through carefully) and obtain a contradiction.
The above proposition does indeed provide a triple of primes with . If is sufficiently large depending on and less than (say) , so that , this would give us what we need as long as one of the triples consisted only of good primes. The only way this can fail is if either
for some , or if
for some . In the first case, we can sum to conclude that
and in the second case we have
Since the total set of bad primes up to has logarithmic size , we conclude from the pigeonhole principle (and the divergence of the harmonic series ) that for any depending only on , and any large enough, there exists such that neither of (29) and (30) hold. Indeed the set of obeying (29) has logarithmic size , and similarly for (30). Choosing a that avoids both of these scenarios, we then find a good and good with , so that , and then by Proposition 16 we conclude that for all sufficiently large . Sending to zero, we obtain the prime number theorem.
— 3. Entropy methods —
In the previous section we explored the consequences of the second moment method, which applies to square-integrable random variables taking values in the real or complex numbers. Now we explore entropy methods, which now apply to random variables which take a finite number of values (equipped with the discrete sigma-algebra), but whose range need not be numerical in nature. (One could extend entropy methods to slightly larger classes of random variables, such as ones that attain a countable number of values, but for our applications finitely-valued random variables will suffice.)
The fundamental notion here is that of the Shannon entropy of a random variable. If takes values in a finite set , its Shannon entropy (or entropy for short) is defined by the formula
where ranges over all the possible values of , and we adopt the convention , so that values that are almost surely not attained by do not influence the entropy. We choose here to use the natural logarithm to normalise our entropy (in which case a unit of entropy is known as a “nat“); in the information theory literature it is also common to use the base two logarithm to measure entropy (in which case a unit of entropy is known as a “bit“, which is equal to nats). However, the precise choice of normalisation will not be important in our discussion.
It is clear that if two random variables have the same probability distribution, then they have the same entropy. Also, the precise choice of range set is not terribly important: if takes values in , and is an injection, then it is clear that and have the same entropy:
Informally, the entropy informally measures how “spread out” or “disordered” the distribution of is, behaving like a logarithm of the size of the “essential support” of such a variable; from an information-theoretic viewpoint, it measures the amount of “information” one learns when one is told the value of . Here are some basic properties of Shannon entropy that help support this intuition:
Exercise 19 (Basic properties of Shannon entropy) Let be a random variable taking values in a finite set .
- (i) Show that , with equality if and only if is almost surely deterministic (that is to say, it is almost surely equal to a constant ).
- (ii) Show that
with equality if and only if is uniformly distributed on . (Hint: use Jensen’s inequality and the convexity of the map on .)
- (iii) (Shannon-McMillan-Breiman theorem) Let be a natural number, and let be independent copies of . As , show that there is a subset of cardinality with the properties that
and
uniformly for all . (The proof of this theorem will require Stirling’s formula, which you may assume here as a black box; see also this previous blog post.) Informally, we thus see a large tuple of independent samples of approximately behaves like a uniform distribution on values.
One can view Shannon entropy as a generalisation of the notion of cardinality of a finite set (or equivalently, cardinality of finite sets can be viewed as a special case of Shannon entropy); see this previous blog post for an elaboration of this point.
The concept of Shannon entropy becomes significantly more powerful when combined with that of conditioning. Recall that a random variable taking values in a range set can be modeled by a measurable map from a probability space to the range . If is an event in of positive probability, we can then condition to the event to form a new random variable on the conditioned probability space , where
is the restriction of the -algebra to ,
is the conditional probability measure on , and is the restriction of to . This random variable lives on a different probability space than itself, so it does not make sense to directly combine these variables (thus for instance one cannot form the sum even when both random variables are real or complex valued); however, one can still form the Shannon entropy of the conditioned random variable , which is given by the same formula
Given another random variable taking values in another finite set , we can then define the conditional Shannon entropy to be the expected entropy of the level sets , thus
with the convention that the summand here vanishes when . From the law of total probability we have
for any , and hence by Jensen’s inequality
for any ; summing we obtain the Shannon entropy inequality
This inequality (33) can be rewritten in several ways:
Exercise 20 Let , be random variables taking values in finite sets respectively.
- (i) Establish the chain rule
where is the joint random variable . In particular, (33) can be expressed as a subadditivity formula
- (ii) If is a function of , in the sense that for some (deterministic) function , show that .
- (iii) Define the mutual information by the formula
Establish the inequalities
with the first inequality holding with equality if and only if are independent, and the latter inequalities holding if and only if is a function of (or vice versa).
From the above exercise we see that the mutual information is a measure of dependence between and , much as correlation or covariance was in the previous sections. There is however one key difference: whereas a zero correlation or covariance is a consequence but not a guarantee of independence, zero mutual information is logically equivalent to independence, and is thus a stronger property. To put it another way, zero correlation or covariance allows one to calculate the average in terms of individual averages of , but zero mutual information is stronger because it allows one to calculate the more general averages in terms of individual averages of , for arbitrary functions taking values into the complex numbers. This increased power of the mutual information statistic will allow us to estimate various averages of interest in analytic number theory in ways that do not seem amenable to second moment methods.
The subadditivity property formula can be conditioned to any event occuring with positive probability (replacing the random variables by their conditioned counterparts ), yielding the inequality
Applying this inequality to the level events of some auxiliary random variable taking values in another finite set , multiplying by , and summing, we conclude the inequality
In other words, the conditional mutual information
between and conditioning on is always non-negative:
One has conditional analogues of the above exercise:
Exercise 21 Let , , be random variables taking values in finite sets respectively.
- (i) Establish the conditional chain rule
In particular, (36) is equivalent to the inequality
- (ii) Show that equality holds in (36) if and only if are conditionally independent relative to , which means that
for any , , .
- (iii) Show that , with equality if and only if is almost surely a deterministic function of .
- (iv) Show the data processing inequality
for any functions , , and more generally that
- (v) If is an injective function, show that
However, if is not assumed to be injective, show by means of examples that there is no order relation between the left and right-hand side of (40) (in other words, show that either side may be greater than the other). Thus, increasing or decreasing the amount of information that is known may influence the mutual information between two remaining random variables in either direction.
- (vi) If is a function of , and also a function of (thus for some and ), and a further random variable is a function jointly of (thus for some ), establish the submodularity inequality
We now give a key motivating application of the Shannon entropy inequalities. Suppose one has a sequence of random variables, all taking values in a finite set , which are stationary in the sense that the tuples and have the same distribution for every . In particular we will have
and hence by (39)
If we write , we conclude from (34) that we have the concavity property
In particular we have for any , which on summing and telescoping series (noting that ) gives
and hence we have the entropy monotonicity
In particular, the limit exists. This quantity is known as the Kolmogorov-Sinai entropy of the stationary process ; it is an important statistic in the theory of dynamical systems, and roughly speaking measures the amount of entropy produced by this process as a function of a discrete time vairable . We will not directly need the Kolmogorov-Sinai entropy in our notes, but a variant of the entropy monotonicity formula (41) will be important shortly.
In our application we will be dealing with processes that are only asymptotically stationary rather than stationary. To control this we recall the notion of the total variation distance between two random variables taking values in the same finite space , defined by
There is an essentially equivalent notion of this distance which is also often in use:
Exercise 22 If two random variables take values in the same finite space , establish the inequalities
and for any , establish the inequality
Shannon entropy is continuous in total variation distance as long as we keep the range finite. More quantitatively, we have
Lemma 23 If two random variables take values in the same finite space , then
with the convention that the error term vanishes when .
Proof: Set . The claim is trivial when (since then have the same distribution) and when (from (32)), so let us assume , and our task is to show that
If we write , , and , then
By dividing into the cases and we see that
since , it thus suffices to show that
But from Jensen’s inequality (32) one has
since , the claim follows.
In the converse direction, if a random variable has entropy close to the maximum , then one can control the total variation:
Lemma 24 (Special case of Pinsker inequality) If takes values in a finite set , and is a uniformly distributed random variable on , then
Of course, we have , so we may also write the above inequality as
The optimal value of the implied constant here is known to equal , but we will not use this sharp version of the inequality here.
Proof: If we write and , and , then we can rewrite the claimed inequality as
Observe that the function is concave, and in fact for all . From this and Taylor expansion with remainder we may write
for some between and . Since is independent of , and , we thus have on summing in
By Cauchy-Schwarz we then have
Since and , the claim follows.
The above lemma does not hold when the comparison variable is not assumed to be uniform; in particular, two non-uniform random variables can have precisely the same entropy but yet have different distributions, so that their total variation distance is positive. There is a more general variant, known as the Pinsker inequality, which we will not use in these notes:
Exercise 25 (Pinsker inequality) If take values in a finite set , define the Kullback-Leibler divergence of relative to by the formula
(with the convention that the summand vanishes when vanishes).
- (i) Establish the Gibbs inequality .
- (ii) Establish the Pinsker inequality
In particular, vanishes if and only if have identical distribution. Show that this implies Lemma 24 as a special case.
- (iii) Give an example to show that the Kullback-Liebler divergence need not be symmetric, thus there exist such that .
- (iv) If are random variables taking values in finite sets , and are independent random variables taking values in respectively with each having the same distribution of , show that
In our applications we will need a relative version of Lemma 24:
Corollary 26 (Relative Pinsker inequality) If takes values in a finite set , takes values in a finite set , and is a uniformly distributed random variable on that is independent of , then
Proof: From direct calculation we have the identity
As is independent of , is uniformly distributed on . From Lemma 24 we conclude
Inserting this bound and using the Cauchy-Schwarz inequality, we obtain the claim.
Now we are ready to apply the above machinery to give a key inequality that is analogous to Elliott’s inequality. Inequalities of this type first appeared in one of my papers, introducing what I called the “entropy decrement argument”; the following arrangement of the inequality and proof is due to Redmond McNamara (personal communication).
Theorem 27 (Entropy decrement inequality) Let be a random variable taking values in a finite set of integers, which obeys the approximate stationarity
for some . Let be a collection of distinct primes less than some threshold , and let be natural numbers that are also bounded by . Let be a function taking values in a finite set . For , let denote the -valued random variable
and let denote the -valued random variable
Also, let be a random variable drawn uniformly from , independently of . Then
The factor (arising from an invocation of the Chinese remainder theorem in the proof) unfortunately restricts the usefulness of this theorem to the regime in which all the primes involved are of “sub-logarithmic size”, but once one is in that regime, the second term on the right-hand side of (45) tends to be negligible in practice. Informally, this theorem asserts that for most small primes , the random variables and behave as if they are independent of each other.
Proof: We can assume , as the claim is trivial for (the all have zero entropy). For , we introduce the -valued random variable
The idea is to exploit some monotonicity properties of the quantity , in analogy with (41). By telescoping series we have
where we extend (44) to the case. From (38) we have
Now we lower bound the summand on the right-hand side. From multiple applications of the conditional chain rule (37) we have
We now use the approximate stationarity of to derive an approximate monotonicity property for . If , then from (39) we have
Write and
Note that is a deterministic function of and vice versa. Thus we can replace by in the above formula, and conclude that
The tuple takes values in a set of cardinality thanks to the Chebyshev bounds. Hence by two applications of Lemma 23, (43) we have
The first term on the right-hand side is . Worsening the error term slightly, we conclude that
and hence
for any . In particular
which by (47), (48) rearranges to
From (46) we conclude that
Meanwhile, from Corollary 26, (39), (38) we have
The probability distribution of is a function on , which by the Chinese remainder theorem we can identify with a cyclic group where . From (43) we see that the value of this distribution at adjacent values of this cyclic group varies by , hence the total variation distance between this random variable and the uniform distribution on is by Chebyshev bounds. By Lemma 23 we then have
and thus
The claim follows.
We now compare this result to Elliott’s inequality. If one tries to address precisely the same question that Elliott’s inequality does – namely, to try to compare a sum with sampled subsums – then the results are quantitatively much weaker:
Corollary 28 (Weak Elliott inequality) Let be an interval of length at least . Let be a function with for all , and let . Then one has
for all primes outside of an exceptional set of primes of logarithmic size .
Comparing this with Exercise 8 we see that we cover a much smaller range of primes ; also the size of the exceptional set is slightly worse. This version of Elliot’s inequality is however still strong enough to recover a proof of the prime number theorem as in the previous section.
Proof: We can assume that is small, as the claim is trivial for comparable to . We can also assume that
By rounding the real and imaginary parts of to the nearest multiple of , we may assume that takes values in some finite set of complex numbers of size with cardinality . Let be drawn uniformly at random from . Then (43) holds with , and from Theorem 27 with and (which makes the second term of the right-hand side of (45) negligible) we have
where are the primes up to , arranged in increasing order. By Markov’s inequality, we thus have
for outside of a set of primes of logarithmic size .
Let be as above. Now let be the function
that is to say picks out the unique component of the tuple in which is divisible by . This function is bounded by , and then by (42) we have
The left-hand side is equal to
which on switching the summations and using the large nature of can be rewritten as
Meanwhile, the left-hand side is equal to
which again by switching the summations becomes
The claim follows.
In the above argument we applied (42) with a very specific choice of function . The power of Theorem 27 lies in the ability to select many other such functions , leading to estimates that do not seem to be obtainable purely from the second moment method. In particular we have the following generalisation of the previous estimate:
Proposition 29 (Weak Elliott inequality for multiple correlations) Let be an interval of length at least . Let be a function with for all , and let . Let be integers. Then one has
for all primes outside of an exceptional set of primes of logarithmic size .
Proof: We allow all implied constants to depend on . As before we can assume that is sufficiently small (depending on ), that takes values in a set of bounded complex numbers of cardinality , and that is large in the sense of (49), and restrict attention to primes up to . By shifting the and using the large nature of we can assume that the are all non-negative, taking values in for some . We now apply Theorem 27 with and conclude as before that
for outside of a set of primes of logarithmic size .
Let be as above. Let be the function
This function is still bounded by , so by (42) as before we have
The left-hand side is equal to
which on switching the summations and using the large nature of can be rewritten as
Meanwhile, the left-hand side is equal to
which again by switching the summations becomes
The claim follows.
There is a logarithmically averaged version of the above proposition:
Exercise 30 (Weak Elliott inequality for logarithmically averaged multiple correlations) Let with , let be a function bounded in magnitude by , let , and let be integers. Show that
for all primes outside of an exceptional set of primes of logarithmic size .
When one specialises to multiplicative functions, this lets us dilate shifts in multiple correlations by primes:
Exercise 31 Let with , let be a multiplicative function bounded in magnitude by , let , and let be nonnegative integers. Show that
for all primes outside of an exceptional set of primes of logarithmic size .
For instance, setting to be the Möbius function, , , and (say), we see that
for all primes outside of an exceptional set of primes of logarithmic size . In particular, for large enough, one can obtain bounds of the form
for various moderately large sets of primes . It turns out that these double sums on the right-hand side can be estimated by methods which we will cover in later series of notes. Among other things, this allows us to establish estimates such as
as , which to date have only been established using these entropy methods (in conjunction with the methods discussed in later notes). This is progress towards an open problem in analytic number theory known as Chowla’s conjecture, which we will also discuss in later notes.
There have been dramatic articles in the news media suggesting that a Nobel Prize has essentially already been awarded for the amazing discovery of a “fifth force.” I thought I’d better throw some cold water on that fire; it’s fine for it to smoulder, but we shouldn’t let it overheat.
There could certainly be as-yet unknown forces waiting to be discovered — dozens of them, perhaps. So far, there are four well-studied forces: gravity, electricity/magnetism, the strong nuclear force, and the weak nuclear force. Moreover, scientists are already fully confident there is a fifth force, predicted but not yet measured, that is generated by the Higgs field. So the current story would really be about a sixth force.
Roughly speaking, any new force comes with at least one new particle. That’s because
The current excitement, such as it is, arises because someone claims to have evidence for a new particle, whose properties would imply a previously unknown force exists in nature. The force itself has not been looked for, much less discovered.
The new particle, if it really exists, would have a rest mass about 34 times larger than that of an electron — about 1/50th of a proton’s rest mass. In technical terms that means its E=mc² energy is about 17 million electron volts (MeV), and that’s why physicists are referring to it as the X17. But the question is whether the two experiments that find evidence for it are correct.
In the first experiment, whose results appeared in 2015, an experimental team mainly based in Debrecen, Hungary studied large numbers of nuclei of beryllium-8 atoms, which had been raised to an “excited state” (that is, with more energy than usual). An excited nucleus inevitably disintegrates, and the experimenters studied the debris. On rare occasions they observed electrons and positrons [a.k.a. anti-electrons], and these behaved in a surprising way, as though they were produced in the decay of a previously unknown particle.
In the newly reported experiment, whose results just appeared, the same team observed the disintegration of excited nuclei of helium. They again found evidence for what they hope is the X17, and therefore claim confirmation of their original experiments on beryllium.
When two qualitatively different experiments claim the same thing, they are less likely to be wrong, because it’s not likely that any mistakes in the two experiments would create fake evidence of the same type. On the face of it, it does seem unlikely that both measurements, carried out on two different nuclei, could fake an X17 particle.
However, we should remain cautious, because both experiments were carried out by the same scientists. They, of course, are hoping for their Nobel Prize (which, if their experiments are correct, they will surely win) and it’s possible they could suffer from unconscious bias. It’s very common for individual scientists to see what they want to see; scientists are human, and hidden biases can lead even the best scientists astray. Only collectively, through the process of checking, reproducing, and using each other’s work, do scientists create trustworthy knowledge.
So it is prudent to await efforts by other groups of experimenters to search for this proposed X17 particle. If the X17 is observed by other experiments, then we’ll become confident that it’s real. But we probably won’t know until then. I don’t currently know whether the wait will be months or a few years.
Why I am so skeptical? There are two distinct reasons.
First, there’s a conceptual, mathematical issue. It’s not easy to construct reasonable equations that allow the X17 to co-exist with all of the known types of elementary particles. That it has a smaller mass than a proton is not a problem per se. But the X17 needs to have some unique and odd properties in order to (1) be seen in these experiments, yet (2) not be seen in certain other previous experiments, some of which were explicitly looking for something similar. To make equations that are consistent with these properties requires some complicated and not entirely plausible trickery. Is it impossible? No. But a number of the methods that scientists suggested were flawed, and the ones that remain are, to my eye, a bit contrived.
Of course, physics is an experimental science, and what theorists like me think doesn’t, in the end, matter. If the experiments are confirmed, theorists will accept the facts and try to understand why something that seems so strange might be true. But we’ve learned an enormous amount from mathematical thinking about nature in the last century — for instance, it was math that told us that the Higgs particle couldn’t be heavier than 1000 protons, and it was on the basis of that `advice’ that the Large Hadron Collider was built to look for it (and it found it, in 2012.) Similar math led to the discoveries of the W and Z particles roughly where they were expected. So when the math tells you the X17 story doesn’t look good, it’s not reason enough for giving up, but it is reason for some pessimism.
Second, there are many cautionary tales in experimental physics. For instance, back in 2003 there were claims of evidence of a particle called a pentaquark with a rest mass about 1.6 times a proton’s mass — an exotic particle, made from quarks and gluons, that’s both like and unlike a proton. Its existence was confirmed by multiple experimental groups! Others, however, didn’t see it. It took several years for the community to come to the conclusion that this pentaquark, which looked quite promising initially, did not in fact exist.
The point is that mistakes do get made in particle hunts, sometimes in multiple experiments, and it can take some time to track them down. It’s far too early to talk about Nobel Prizes.
[Note that the Higgs boson’s discovery was accepted more quickly than most. It was discovered simultaneously by two distinct experiments using two methods each, and confirmed by additional methods and in larger data sets soon thereafter. Furthermore, there were already straightforward equations that happily accommodated it, so it was much more plausible than the X17.]
And just for fun, here’s a third reason I’m skeptical. It has to do with the number 17. I mean, come on, guys, seriously — 17 million electron volts? This just isn’t auspicious. Back when I was a student, in the late 1980s and early 90s, there was a set of experiments, by a well-regarded experimentalist, which showed considerable evidence for an additional neutrino with a E=mc² energy of 17 thousand electron volts. Other experiments tried to find it, but couldn’t. Yet no one could find a mistake in the experimenter’s apparatus or technique, and he had good arguments that the competing experiments had their own problems. Well, after several years, the original experimenter discovered that there was a piece of his equipment which unexpectedly could absorb about 17 keV of energy, faking a neutrino signal. It was a very subtle problem, and most people didn’t fault him since no one else had thought of it either. But that was the end of the 17 keV neutrino, and with it went hundreds of research papers by both experimental and theoretical physicists, along with one scientist’s dreams of a place in history.
In short, history is cruel to most scientists who claim important discoveries, and teaches us to be skeptical and patient. If there is a fifth sixth force, we’ll know within a few years. Don’t expect to be sure anytime soon. The knowledge cycle in science runs much, much slower than the twittery news cycle, and that’s no accident; if you want to avoid serious errors that could confuse you for a long time to come, don’t rush to judgment.
A year ago, the “I’m a little teapot” song kept playing in my head.
I was finishing a collaboration with David Limmer, a theoretical chemist at the University of California Berkeley. David studies quantum and classical systems far from equilibrium, including how these systems exchange energy and information with their environments. Example systems include photoisomers.
A photoisomer is a molecular switch. These switches appear across nature and technologies. We have photoisomers in our eyes, and experimentalists have used photoisomers to boost solar-fuel storage. A photoisomer has two functional groups, or collections of bonded atoms, attached to a central axis.
Your average-Joe photoisomer spends much of its life in equilibrium, exchanging heat with room-temperature surroundings. The molecule has the shape above, called the cis configuration. Imagine shining a laser or sunlight on the photoisomer. The molecule can absorb a photon, or particle of light, gaining energy. The energized switch has the opportunity to switch: One chemical group can rotate downward. The molecule will occupy its trans configuration.
The molecule now has more energy than it had while equilibrium, albeit less energy than it had right after absorbing the photon. The molecule can remain in this condition for a decent amount of time. (Experts: The molecule occupies a metastable state.) That is, the molecule can store sunlight. For that reason, experimentalists at Harvard and MIT attached photoisomers to graphene nanotubules, improving the nanotubules’ storage of solar fuel.
With what probability does a photoisomer switch upon absorbing a photon? This question has resisted easy answering, because photoisomers prove difficult to model: They’re small, quantum, and far from equilibrium. People have progressed by making assumptions, but such assumptions can lack justifications or violate physical principles. David wanted to derive a simple, general bound—of the sort in which thermodynamicists specialize—on a photoisomer’s switching probability.
He had a hunch as to how he could derive such a bound. I’ve blogged, many times, about thermodynamic resource theories. Thermodynamic resource theories are simple models, developed in quantum information theory, for exchanges of heat, particles, information, and more. These models involve few assumptions: the conservation of energy, quantum theory, and, to some extent, the existence of a large environment (Markovianity). With such a model, David suspected, he might derive his bound.
I knew nothing about photoisomers when I met David, but I knew about thermodynamic resource theories. I’d contributed to their development, to the theorems that have piled up in the resource-theory corner of quantum information theory. Then, the corner had given me claustrophobia. Those theorems felt so formal, abstract, and idealized. Formal, abstract theory has drawn me ever since I started studying physics in college. But did resource theories model physical reality? Could they impact science beyond our corner of quantum information theory? Did resource theories matter?
I called for connecting thermodynamic resource theories to physical reality four years ago, in a paper that begins with an embarrassing story about me. Resource theorists began designing experiments whose results should agree with our theorems. Theorists also tried to improve the accuracy with which resource theories model experimentalists’ limitations. See David’s and my paper for a list of these achievements. They delighted me, as a step toward the broadening of resource theories’ usefulness.
Like any first step, this step pointed toward opportunities. Experiments designed to test our theorems essentially test quantum mechanics. Scientists have tested quantum mechanics for decades; we needn’t test it much more. Such experimental proposals can push experimentalists to hone their abilities, but I hoped that the community could accomplish more. We should be able to apply resource theories to answer questions cultivated in other fields, such as condensed matter and chemistry. We should be useful to scientists outside our corner of quantum information.
David’s idea lit me up like photons on a solar-fuel-storage device. He taught me about photoisomers, I taught him about resource theories, and we derived his bound. Our proof relies on the “second laws of thermodynamics.” These abstract resource-theory results generalize the second law of thermodynamics, which helps us understand why time flows in only one direction. We checked our bound against numerical simulations (experts: of Lindbladian evolution). Our bound is fairly tight if the photoisomer has a low probability of absorbing a photon, as in the Harvard-MIT experiment.
Experts: We also quantified the photoisomer’s coherences relative to the energy eigenbasis. Coherences can’t boost the switching probability, we concluded. But, en route to this conclusion, we found that the molecule is a natural realization of a quantum clock. Our quantum-clock modeling extends to general dissipative Landau-Zener transitions, prevalent across condensed matter and chemistry.
As I worked on our paper one day, a jingle unfolded in my head. I recognized the tune first: “I’m a little teapot.” I hadn’t sung that much since kindergarten, I realized. Lyrics suggested themselves:
I’m a little isomer
with two hands.
Here is my cis pose;
here is my trans.
Stand me in the sunlight;
watch me spin.
I’ll keep solar
energy in!
The song lodged itself in my head for weeks. But if you have to pay an earworm to collaborate with David, do.
Earlier this year, I made a list of topics I wanted to understand. The most abstract and technical of them was something called “Wilsonian effective field theory”. I still don’t understand Wilsonian effective field theory. But while thinking about it, I noticed something that seemed weird. It’s something I think many physicists already understand, but that hasn’t really sunk in with the public yet.
There’s an old problem in particle physics, described in many different ways over the years. Take our theories and try to calculate some reasonable number (say, the angle an electron turns in a magnetic field), and instead of that reasonable number we get infinity. We fix this problem with a process called renormalization that hides that infinity away, changing the “normalization” of some constant like a mass or a charge. While renormalization first seemed like a shady trick, physicists eventually understood it better. First, we thought of it as a way to work around our ignorance, that the true final theory would have no infinities at all. Later, physicists instead thought about renormalization in terms of scaling.
Imagine looking at the world on a camera screen. You can zoom in, or zoom out. The further you zoom out, the more details you’ll miss: they’re just too small to be visible on your screen. You can guess what they might be, but your picture will be different depending on how you zoom.
In particle physics, many of our theories are like that camera. They come with a choice of “zoom setting”, a minimum scale where they still effectively tell the whole story. We call theories like these effective field theories. Some physicists argue that these are all we can ever have: since our experiments are never perfect, there will always be a scale so small we have no evidence about it.
In general, theories can be quite different at different scales. Some theories, though, are especially nice: they look almost the same as we zoom in to smaller scales. The only things that change are the mass of different particles, and their charges.
One theory like this is Quantum Chromodynamics (or QCD), the theory of quarks and gluons. Zoom in, and the theory looks pretty much the same, with one crucial change: the force between particles get weaker. There’s a number, called the “coupling constant“, that describes how strong a force is, think of it as sort of like an electric charge. As you zoom in to quarks and gluons, you find you can still describe them with QCD, just with a smaller coupling constant. If you could zoom “all the way in”, the constant (and thus the force between particles) would be zero.
This makes QCD a rare kind of theory: one that could be complete to any scale. No matter how far you zoom in, QCD still “makes sense”. It never gives contradictions or nonsense results. That doesn’t mean it’s actually true: it interacts with other forces, like gravity, that don’t have complete theories, so it probably isn’t complete either. But if we didn’t have gravity or electricity or magnetism, if all we had were quarks and gluons, then QCD could have been the final theory that described them.
And this starts feeling a little weird, when you think about reductionism.
Philosophers define reductionism in many different ways. I won’t be that sophisticated. Instead, I’ll suggest the following naive definition: Reductionism is the claim that theories on larger scales reduce to theories on smaller scales.
Here “reduce to” is intentionally a bit vague. It might mean “are caused by” or “can be derived from” or “are explained by”. I’m gesturing at the sort of thing people mean when they say that biology reduces to chemistry, or chemistry to physics.
What happens when we think about QCD, with this intuition?
QCD on larger scales does indeed reduce to QCD on smaller scales. If you want to ask why QCD on some scale has some coupling constant, you can explain it by looking at the (smaller) QCD coupling constant on a smaller scale. If you have equations for QCD on a smaller scale, you can derive the right equations for a larger scale. In some sense, everything you observe in your larger-scale theory of QCD is caused by what happens in your smaller-scale theory of QCD.
But this isn’t quite the reductionism you’re used to. When we say biology reduces to chemistry, or chemistry reduces to physics, we’re thinking of just a few layers: one specific theory reduces to another specific theory. Here, we have an infinite number of layers, every point on the scale from large to small, each one explained by the next.
Maybe you think you can get out of this, by saying that everything should reduce to the smallest scale. But remember, the smaller the scale the smaller our “coupling constant”, and the weaker the forces between particles. At “the smallest scale”, the coupling constant is zero, and there is no force. It’s only when you put your hand on the zoom nob and start turning that the force starts to exist.
It’s reductionism, perhaps, but not as we know it.
Now that I understand this a bit better, I get some of the objections to my post about naturalness a while back. I was being too naive about this kind of thing, as some of the commenters (particularly Jacques Distler) noted. I believe there’s a way to rephrase the argument so that it still works, but I’d have to think harder about how.
I also get why I was uneasy about Sabine Hossenfelder’s FQXi essay on reductionism. She considered a more complicated case, where the chain from large to small scale could be broken, a more elaborate variant of a problem in Quantum Electrodynamics. But if I’m right here, then it’s not clear that scaling in effective field theories is even the right way to think about this. When you have an infinite series of theories that reduce to other theories, you’re pretty far removed from what most people mean by reductionism.
Finally, this is the clearest reason I can find why you can’t do science without an observer. The “zoom” is just a choice we scientists make, an arbitrary scale describing our ignorance. But without it, there’s no way to describe QCD. The notion of scale is an inherent and inextricable part of the theory, and it doesn’t have to mean our theory is incomplete.
Experts, please chime in here if I’m wrong on the physics here. As I mentioned at the beginning, I still don’t think I understand Wilsonian effective field theory. If I’m right though, this seems genuinely weird, and something more of the public should appreciate.
Most readers have by now heard that Google has “achieved” quantum “supremacy”. Notice the only word not in quotes is “quantum”, because unlike previous proposals that have also made some waves, quantumness is mostly not under review here. (Well, neither really are the other two words, but that story has already been covered quite eloquently by John, Scott, and Toby.) The Google team has managed to engineer a device that, although noisy, can do the right thing a large-enough fraction of the time for people to be able to “quantify its quantumness”.
However, the Google device, while less so than previous incarnations, is still noisy. Future devices like it will continue to be noisy. Noise is what makes quantum computers so darn difficult to build; it is what destroys the fragile quantum superpositions that we are trying so hard to protect (remember, unlike a classical computer, we are not protecting things we actually observe, but their superposition).
Protecting quantum information is like taking your home-schooled date (who has lived their entire life in a bunker) to the prom for the first time. It is a fun and necessary part of a healthy relationship to spend time in public, but the price you pay is the possibility that your date will hit it off with someone else. This will leave you abandoned, dancing alone to Taylor Swift’s “You Belong With Me” while crying into your (spiked?) punch.
The high school sweetheart/would-be dance partner in the above provocative example is the quantum superposition — the resource we need for a working quantum computer. You want it all to yourself, but your adversary — the environment — wants it too. No matter how much you try to protect it, you’ll have to observe it eventually (after all, you want to know the answer to your computation). And when you do (take your date out onto the crowded dance floor), you run the risk of the environment collapsing the information before you do, leaving you with nothing.
Protecting quantum information is also like (modern!) medicine. The fussy patient is the quantum information, stored in delicate superposition, while quantumists are the doctors aiming to prevent the patient from getting sick (or “corrupted”). If our patient incurs say “quasiparticle poisoning”, we first diagnose the patient’s syndromes, and, based on this diagnosis, apply procedures like “lattice surgery” and “state injection” to help our patient successfully recover.
Error correction with qubits
Error correction sounds hard, and it should! Not to fear: plenty of very smart people have thought hard about this problem, and have come up with a plan — to redundantly encode the quantum superposition in a way that allows protection from errors caused by noise. Such quantum error-correction is an expansion of the techniques we currently use to protect classical bits in your phone and computer, but now the aim is to protect, not the definitive bit states 0 or 1, but their quantum superpositions. Things are even harder now, as the protection machinery has to do its magic without disturbing the superposition itself (after all, we want our quantum calculation to run to its conclusion and hack your bank).
For example, consider a qubit — the fundamental quantum unit represented by two shelves (which, e.g., could be the ground and excited states of an atom, the absence or presence of a photon in a box, or the zeroth and first quanta of a really cold LC circuit). This qubit can be in any quantum superposition of the two shelves, described by 2 probability amplitudes, one corresponding to each shelf. Observing this qubit will collapse its state onto either one of the shelves, changing the values of the 2 amplitudes. Since the resource we use for our computation is precisely this superposition, we definitely do not want to observe this qubit during our computation. However, we are not the only ones looking: the environment (other people at the prom: the trapping potential of our atom, the jiggling atoms of our metal box, nearby circuit elements) is also observing this system, thereby potentially manipulating the stored quantum state without our knowledge and ruining our computation.
Now consider 50 such qubits. Such a space allows for a superposition with different amplitudes (instead of just for the case of a single qubit). We are once again plagued by noise coming from the environment. But what if we now, less ambitiously, want to store only one qubit’s worth of information in this 50-qubit system? Now there is room to play with! A clever choice of how to do this (a.k.a. the encoding) helps protect from the bad environment.
The entire prospect of building a bona-fide quantum computer rests on this extra overhead or quantum redundancy of using a larger system to encode a smaller one. It sounds daunting at first: if we need 50 physical qubits for each robust logical qubit, then we’d need “I-love-you-3000” physical qubits for 60 logical ones? Yes, this is a fact we all have to live with. But granted we can scale up our devices to that many qubits, there is no fundamental obstacle that prevents us from then using error correction to make next-level computers.
To what extent do we need to protect our quantum superposition from the environment? It would be too ambitious to protect it from a meteor shower. Or a power outage (although that would be quite useful here in California). So what then can we protect against?
Our working answer is local noise — noise that affects only a few qubits that are located near each other in the device. We can never be truly certain if this type of noise is all that our quantum computers will encounter. However, our belief that this is the noise we should focus on is grounded in solid physical principles — that nature respects locality, that affecting things far away from you is harder than making an impact nearby. (So far Google has not reported otherwise, although much more work needs to be done to verify this intuition.)
The harmonic oscillator
In what other ways can we embed our two-shelf qubit into a larger space? Instead of scaling up using many physical qubits, we can utilize a fact that we have so far swept under the rug: in any physical system, our two shelves are already part of an entire bookcase! Atoms have more than one excited state, there can be more than one photon in a box, and there can be more than one quantum in a cold LC circuit. Why don’t we use some of that higher-energy space for our redundant encoding?
The noise in our bookcase will certainly be different, since the structure of the space, and therefore the notion of locality, is different. How to cope with this? The good news is that such a space — the space of the harmonic oscillator — also has a(t least one) natural notion of locality!
Whatever the incarnation, the oscillator has associated with it a position and momentum (different jargon for these quantities may be used, depending on the context, but you can just think of a child on a swing, just quantized). Anyone who knows the joke about Heisenberg getting pulled over, will know that these two quantities cannot be set simultaneously.
Nevertheless, local errors can be thought of as small shifts in position or momentum, while nonlocal errors are ones that suddenly shift our bewildered swinging quantized child from one side of the swing to the other.
Armed with a local noise model, we can extend our know-how from multi-qubit land to the oscillator. One of the first such oscillator codes were developed by Gottesman, Kitaev, and Preskill (GKP). Proposed in 2001, GKP encodings posed a difficult engineering challenge: some believed that GKP states could never be realized, that they “did not exist”. In the past few years however, GKP states have been realized nearly simultaneously in two experimental platforms. (Food for thought for the non-believers!)
Parallel to GKP codes, another promising oscillator encoding using cat states is also being developed. This encoding has historically been far easier to create experimentally. It is so far the only experimental procedure achieving the break-even point, at which the actively protected logical information has the same lifetime as the system’s best unprotected degree of freedom.
Can we mix and match all of these different systems? Why yes! While Google is currently trying to build the surface code out of qubits, using oscillators (instead of qubits) for the surface code and encoding said oscillators either in GKP (see related IBM post) [1,2,3] or cat [4,5] codes is something people are seriously considering. There is even more overhead, but the extra information one gets from the correction procedure might make for a more fault-tolerant machine. With all of these different options being explored, it’s an exciting time to be into quantum!
Molecules?
It turns out there are still other systems we can consider, although because they are sufficiently more “out there” at the moment, I should first say “bear with me!” as I explain. Forget about atoms, photons in a box, and really cold LC circuits. Instead, consider a rigid 3-dimensional object whose center of mass has been pinned in such a way that the object can rotate any way it wants. Now, “quantize” it! In other words, consider the possibility of having quantum superpositions of different orientations of this object. Just like superpositions of a dead and alive cat, of a photon and no photon, the object can be in quantum superposition of oriented up, sideways, and down, for example. Superpositions of all possible orientations then make up our new configuration space (read: playground), and we are lucky that it too inherits many of the properties we know and love from its multi-qubit and oscillator cousins.
Examples of rigid bodies include airplanes (which can roll, pitch and yaw, even while “fixed” on a particular trajectory vector) and robot arms (which can rotate about multiple joints). Given that we’re not quantizing those (yet?), what rigid body should we have in mind as a serious candidate? Well, in parallel to the impressive engineering successes of the multi-qubit and oscillator paradigms, physicists and chemists have made substantial progress in trapping and cooling molecules. If a trapped molecule is cold enough, it’s vibrational and electronic states can be neglected, and its rotational states form exactly the rigid body we are interested in. Such rotational states, as far as we can tell, are not in the realm of Avengers-style science fiction.
The idea to use molecules for quantum computing dates all the way back to a 2001 paper by Dave DeMille, but in a recent paper by Jacob Covey, John Preskill, and myself, we propose a framework of how to utilize the large space of molecular orientations to protect against (you guessed it!) a type of local noise. In the second part of the story, called “Quantum Error Correction with Molecules“, I will cover a particular concept that is not only useful for a proper error-correcting code (classical and quantum), but also one that is quite fun to try and understand. The concept is based on a certain kind of tiling, called Voronoi tiles or Thiessen polygons, which can be used to tile anything from your bathroom floor to the space of molecular orientations. Stay tuned!
We had a special session on applied category theory here at UCR:
• Applied category theory, Fall Western Sectional Meeting of the AMS, 9–10 November 2019, U.C. Riverside.
I was bowled over by the large number of cool ideas. I’ll have to blog about some of them. A bunch of people stayed for a few days afterwards, and we had lots of great conversations.
The biggest news was that Brendan Fong and David Spivak definitely want to set up an applied category theory in the San Francisco Bay Area, which they’re calling the Topos Institute. They are now in the process of raising funds for this institute! I plan to be involved, so I’ll be saying more about this later.
But back to the talks. We didn’t make videos, but here are the slides. Click on talk titles to see abstracts of the talks. For a multi-author talk, the person whose name is in boldface is the one who gave the talk. You also might enjoy comparing the 2017 talks.
• 8:00 a.m.
Fibrations as generalized lens categories — talk slides.
David I. Spivak, Massachusetts Institute of Technology
• 9:00 a.m.
Supplying bells and whistles in symmetric monoidal categories — talk slides.
Brendan Fong, Massachusetts Institute of Technology
David I. Spivak, Massachusetts Institute of Technology
• 9:30 a.m.
Right adjoints to operadic restriction functors — talk slides.
Philip Hackney, University of Louisiana at Lafayette
Gabriel C. Drummond-Cole, IBS Center for Geometry and Physics
• 10:00 a.m.
Duality of relations — talk slides.
Alexander Kurz, Chapman University
• 10:30 a.m.
A synthetic approach to stochastic maps, conditional independence, and theorems on sufficient statistics — talk slides.
Tobias Fritz, Perimeter Institute for Theoretical Physics
• 3:00 p.m.
Constructing symmetric monoidal bicategories functorially — talk slides.
Michael Shulman, University of San Diego
Linde Wester Hansen, University of Oxford
• 3:30 p.m.
Structured cospans — talk slides.
Kenny Courser, University of California, Riverside
John C. Baez, University of California, Riverside
• 4:00 p.m.
Generalized Petri nets — talk slides.
Jade Master, University of California, Riverside
• 4:30 p.m.
Formal composition of hybrid systems — talk slides and website.
Paul Gustafson, Wright State University
Jared Culbertson, Air Force Research Laboratory
Dan Koditschek, University of Pennsylvania
Peter Stiller, Texas A&M University
• 5:00 p.m.
Strings for cartesian bicategories — talk slides.
M. Andrew Moshier, Chapman University
• 5:30 p.m.
Defining and programming generic compositions in symmetric monoidal categories — talk slides.
Dmitry Vagner, Los Angeles, CA
• 8:00 a.m.
Mathematics for second quantum revolution — talk slides.
Zhenghan Wang, UCSB and Microsoft Station Q
• 9:00 a.m.
A compositional and statistical approach to natural language — talk slides.
Tai-Danae Bradley, CUNY Graduate Center
• 9:30 a.m.
Exploring invariant structure in neural activity with applied topology and category theory — talk slides.
Brad Theilman, UC San Diego
Krista Perks, UC San Diego
Timothy Q Gentner, UC San Diego
• 10:00 a.m.
Of monks, lawyers and villages: new insights in social network science — talk cancelled due to illness.
Nina Otter, Mathematics Department, UCLA
Mason A. Porter, Mathematics Department, UCLA
• 10:30 a.m.
Functorial cluster embedding — talk slides.
Steve Huntsman, BAE Systems FAST Labs
• 2:00 p.m.
Quantitative equational logic — talk slides.
Prakash Panangaden, School of Computer Science, McGill University
Radu Mardare, Strathclyde University
Gordon D. Plotkin, University of Edinburgh
• 3:00 p.m.
Brakes: an example of applied category theory — talk slides in PDF and Powerpoint.
Eswaran Subrahmanian, Carnegie Mellon University / National Institute of Standards and Technology
• 3:30 p.m.
Intuitive robotic programming using string diagrams — talk slides.
Blake S. Pollard, National Institute of Standards and Technology
• 4:00 p.m.
Metrics on functor categories — talk slides.
Vin de Silva, Department of Mathematics, Pomona College
• 4:30 p.m.
Hausdorff and Wasserstein metrics on graphs and other structured data — talk slides.
Evan Patterson, Stanford University
In the previous blog post (titled, “On the Coattails of Quantum Supremacy“) we started with Google and ended up with molecules! I also mentioned a recent paper by John Preskill, Jake Covey, and myself (see also this videoed talk) where we assume that, somewhere in the (near?) future, experimentalists will be able to construct quantum superpositions of several orientations of molecules or other rigid bodies. Next, I’d like to cover a few more details on how to construct error-correcting codes for anything from classical bits in your phone to those future quantum computers, molecular or otherwise.
Error correction is concerned with the design of an encoding that allows for protection against noise. Let’s say we want to protect one classical bit, which is in either “0” or “1”. If the bit is say in “0”, and the environment (say, the strong magnetic field from a magnet you forgot was laying next to your hard drive) flipped it to “1” without our knowledge, an error would result (e.g., making your phone think you swiped right!)
Now let’s encode our single logical bit into three physical bits, whose possible states are represented by the eight corners of the cube below. Let’s encode the logical bit as “0” —> 000 and “1” —> 111, corresponding to the corners of the cube marked by the black and white ball, respectively. For our (local) noise model, we assume that flips of only one of the three physical bits are more likely to occur than flips of two or three at the same time.
Error correction is, like many Hollywood movies, an origin story. If, say, the first bit flips in our above code, the 000 state is mapped to 100, and 111 is mapped to 011. Since we have assumed that the most likely error is a flip of one of the bits, we know upon observing that 100 must have come from the clean 000, and 011 from 111. Thus, in either case of the logical bit being “0” or “1”, we can recover the information by simply observing which state the majority of the bits are in. The same things happen when the second or third bits flip. In all three cases, the logical “0” state is mapped to one of its three neighboring points (above, in blue) while the logical “1” is mapped to its own three points, which, crucially, are distinct from the neighbors of “0”. The set of points that are closer to 000 than to 111 is called a Voronoi tile.
Now, let’s adapt these ideas to molecules. Consider the rotational states of a dumb-bell molecule consisting of two different atoms. (Let’s assume that we have frozen this molecule to the point that the vibration of the inter-atomic bond is limited, essentially creating a fixed distance between the two atoms.) This molecule can orient itself in any direction, and each such orientation can be represented as a point on the surface of a sphere. Now let us encode a classical bit using the north and south poles of this sphere (represented in the picture below as a black and a white ball, respectively). The north pole of the sphere corresponds to the molecule being parallel to the z-axis, while the south pole corresponds to the molecule being anti-parallel.
This time, the noise consists of small shifts in the molecule’s orientation. Clearly, if such shifts are small, the molecule just wiggles a bit around the z-axis. Such wiggles still allow us to infer that the molecule is (mostly) parallel and anti-parallel to the axis, as long as they do not rotate the molecule all the way past the equator. Upon such correctable rotations, the logical “0” state — the north pole — is mapped to a point in the northern hemisphere, while logical “1” — the south pole — is mapped to a point in the southern hemisphere. The northern hemisphere forms a Voronoi tile of the logical “0” state (blue in the picture), which, along with the corresponding tile of the logical “1” state (the southern hemisphere), tiles the entire sphere.
Quantum error correction
To upgrade these ideas to the quantum realm, recall that this time we have to protect superpositions. This means that, in addition to shifting our quantum logical state to other states as before, noise can also affect the terms in the superposition itself. Namely, if, say, the superposition is equal — with an amplitude of in “0” and in “1” — noise can change the relative sign of the superposition and map one of the amplitudes to . We didn’t have to worry about such sign errors before, because our classical information would always be the definite state of “0” or “1”. Now, there are two effects of noise to worry about, so our task has become twice as hard!
Not to worry though. In order to protect against both sources of noise, all we need to do is effectively stagger the above constructions. Now we will need to design a logical “0” state which is itself a superposition of different points, with each point separated from all of the points that are superimposed to make the logical “1” state.
Diatomic molecules: For the diatomic molecule example, consider superpositions of all four corners of two antipodal tetrahedra for the two respective logical states.
Each orientation (black or white point) present in our logical states rotates under fluctuations in the position of the molecule. However, the entire set of orientations for say logical “0” — the tetrahedron — rotates rigidly under such rotations. Therefore, the region from which we can successfully recover after rotations is fully determined by the Voronoi tile of any one of the corners of the tetrahedron. (Above, we plot the tile for the point at the north pole.) This cell is clearly smaller than the one for classical north-south-pole encoding we used before. However, the tetrahedral code now provides some protection against phase errors — the other type of noise that we need to worry about if we are to protect quantum information. This is an example of the trade-off we must make in order to protect against both types of noise; a licensed quantum mechanic has to live with such trade-offs every day.
Oscillators: Another example of a quantum encoding is the GKP encoding in the phase space of the harmonic oscillator. Here, we have at our disposal the entire two-dimensional plane indexing different values of position and momentum. In this case, we can use a checkerboard approach, superimposing all points at the centers of the black squares for the logical “0” state, and similarly all points at the centers of the white squares for the logical “1”. The region depicting correctable momentum and position shifts is then the Voronoi cell of the point at the origin: if a shift takes our central black point to somewhere inside the blue square, we know (most likely) where that point came from! In solid state circles, the blue square is none other than the primitive or unit cell of the lattice consisting of points making up both of the logical states.
Asymmetric molecules (a.k.a. rigid rotors): Now let’s briefly return to molecules. Above, we considered diatomic molecules that had a symmetry axis, i.e., that were left unchanged under rotations about the axis that connects the two atoms. There are of course more general molecules out there, including ones that are completely asymmetric under any possible (proper) 3D rotation (see figure below for an example).
All of the orientations of the asymmetric molecule, and more generally a rigid body, can no longer be parameterized by the sphere. They can be parameterized by the 3D rotation group : each orientation of an asymmetric molecule is labeled by the 3D rotation necessary to obtain said orientation from a reference state. Such rotations, and in turn the orientations themselves, are parameterized by an axis (around which to rotate) and an angle (by which one rotates). The rotation group luckily can still be viewed by humans on a sheet of paper. Namely, can be thought of as a ball of radius with opposite points identified. The direction of each vector lying inside the ball corresponds to the axis of rotation, while the length corresponds to the angle. This may take some time to digest, but it’s not crucial to the story.
So far we’ve looked at codes defined on cubes of bits, spheres, and phase-space lattices. Turns out that even can house similar encodings! In other words, can also be cut up into different Voronoi tiles, which in turn can be staggered to create logical “0” and “1” states consisting of different molecular orientations. There are many ways to pick such states, corresponding to various subgroups of . Below, we sketch two sets of black/white points, along with the Voronoi tile corresponding to the rotations that are corrected by each encoding.
In closing…
Achieving supremacy was a big first step towards making quantum computing a practical and universal tool. However, the largest obstacles still await, namely handling superposition-poisoning noise coming from the ever-curious environment. As quantum technologies advance, other possible routes for error correction are by encoding qubits in harmonic oscillators and molecules, alongside the “traditional” approach of using arrays of physical qubits. Oscillator and molecular qubits possess their own mechanisms for error correction, and could prove useful (granted that the large high-energy space required for the procedures to work can be accessed and controlled). Even though molecular qubits are not yet mature enough to be used in quantum computers, we have at least outlined a blueprint for how some of the required pieces can be built. We are by no means done however: besides an engineering barrier, we need to further develop how to run robust computations on these exotic spaces.
Author’s note: I’d like to acknowledge Jose Gonzalez for helping me immensely with the writing of this post, as well as for drawing the comic panels in the previous post. The figures above were made possible by Mathematica 12.
Subhash Khot is giving a talk at Current Developments in Math this year and his talk has the intriguing phrase “Grassmann graph” in it so I thought I’d look up what it is and what he and his collaborators prove about it, and indeed it’s interesting! I’m just going to jot down something I learned from “Pseudorandom Sets in Grassmann Graph have Near-Perfect Expansion,” by Khot, Dor Minzer, and Muli Safra, in a way that makes it sound like something algebraic geometers might be interested in, which, indeed, I think they might be!
Suppose you have a sheaf F on a space, and the space has a covering U_1, .. U_N. The sheaf axiom says that if we have a family of sections s_i of F(U_i) such that s_i and s_j agree on for all i,j, then there is actually a global section s in F(X) which restricts to each s_i.
What if we only have an approximate section? That is: what if we have a family of s_i such that: if I select i and j uniformly at random, the probability that s_i and s_j agree on is bounded below by some p > 0. Call such a family a “p-section.” (You should take the view that this is really a family of problems with X changing and N growing, so that when I say p > 0 the content is that p is bounded away from some 0 uniformly in X,N.)
The question is then: Is an approximate section approximately a section?
(This is meant to recall the principle from additive number theory that an approximate subgroup is approximately a subgroup, as in e.g. Freiman-Rusza.)
That is: if s_1, .. s_N from a p-section, is there some actual section s in F(X) such that, for i chosen uniformly at random,
for some p’ depending only on p?
The case which turns out to be relevant to complexity theory is the Grassmann graph, which we can describe as follows: X is a k-dimensional vector space over F_2 and the U_i are the l-dimensional vector spaces for some integer l. But we do something slightly weird (which is what makes it the Grassmann graph, not the Grassmann simplicial complex) and declare that the only nonempty intersections are those where has dimension l-1. The sheaf is the one whose sections on U_i are the linear functions from U_i to F_2.
Speculation 1.7 in the linked paper is that an approximate section is approximately a section. This turns out not to be true! Because there are large sets of U_i whose intersection with the rest of X is smaller than you might expect. This makes sense: if X is a space which is connected but which is “almost a disjoint union of X_1 and X_2,” i.e. and $\latex X_1 \cap X_2$ involves very few of the U_i, then by choosing a section of F(X_1) and a section of F(X_2) independently you can get an approximate section which is unlikely to be approximated by any actual global section.
But the good news is that, in the case at hand, that ends up being the only problem. Khot-Minzer-Safra classify the “approximately disconnected” chunks of X (they are collections of l-dimensional subspaces containing a fixed subspace of small dimension and contained in a fixed subspace of small codimension) and show that any approximate section of F is approximated by a section on some such chunk; this is all that is needed to prove the “2-to-2 games conjecture” in complexity theory, which is their ultimate goal.
So I found all this quite striking! Do questions about approximate global sections being approximated by global sections appear elsewhere? (The question as phrased here is already a bit weird from an algebraic geometry point of view, since it seems to require that you have or impose a probability measure on your set of open patches, but maybe that’s natural in some cases?)
It is no secret that quantum computing has recently become one of the trendiest topics within the physics community, gaining financial support and good press at an ever increasing pace. The new technology not only promises huge advances in information processing, but it also – in theory – has the potential to crack the encryption that currently protects sensitive information inside governments and businesses around the world. Consequently, quantum research has extended beyond academic groups and has entered the technical industry, creating new job opportunities for both experimentalists and theorists. However, in order for this technology to become a reality, we need qualified engineers and scientists that can fill these positions.
Increasing the number of individuals with an interest in this field starts with educating our youth. While it does not take particularly advanced mathematics to explain the basics of quantum computing, there are still involved topics such as quantum superposition, unitary evolution, and projective measurement that can be difficult to conceptualize. In order to explain these topics at a middle and high school level to encourage more students to enter this area, we decided to design an educational game called Bas|ket>ball, which allows students to directly engage with quantum computing concepts outside of the classroom while being physically active.
After playing the game with students in our local Quantum Information High (QIHigh) Program at Stevens Institute of Technology, we realized that the game is a fun learning tool worth sharing with the broader physics community. Here, we describe a non-gender specific activity that can be used to effectively teach the basics of quantum computing at a high school level.
Quantum Basketball is something that helps you understand a very confusing topic, especially for a ninth grader! In the QI-High Program at Stevens, I was approached with a challenge of learning about quantum computing, and while I was hesitant at first, my mentors made the topic so much more understandable by relating it to a sport that I love!
Grace Conlin, Freshman Student from High Tech High School
The game can have up to 10 student players and only requires one basketball. Each player acts as a quantum bit (qubit) in a quantum register and is initialized to the |0> position. During each turn, a player will perform one of the allowed quantum gates depending on their position on the court. A diagram of the court positions is displayed at the bottom.
There are four options of quantum gates from which players can choose to move around the court:
The first player to measure ten 1’s (make ten shots) wins! In order to make the game more interesting, the following additional rules are put in place:
For example, let’s say that we have three student players. Each will start at the red marker, behind the basketball hoop. One by one, each student will choose from the list of gate operations above. If the first player chooses an X-gate, he/she will physically move to the blue marker and be in a better position to make a measurement (take a shot) during the next turn. If the second player chooses to perform a Hadamard gate, he/she will move to the green marker. Each of the students will continue to move around the court and score points by making measurements until 10 points are reached.
However, things can get more interesting when players start to form alliances by entangling. If player 1 is at the green marker and player 3 is at the red marker, then player 1 can perform a C-Not gate on player 3 to become entangled with them. Now if either player takes a shot and scores a point, both players will be awarded 2 points (1 x the number of players entangled).
We believe that simple games such as this, along with Quantum TiqTaqToe and Quantum Chess, will attract more young students to pursue degrees in physics and computer science, and eventually specialize in quantum information and computing fields. Not only is this important for the overall progression of the field, but also to encourage more diversity and inclusion in STEM.
A few weeks ago, I mentioned a paper by a colleague of mine, Mohamed Rameez, that generated some discussion. Since I wasn’t up for commenting on the paper’s scientific content, I thought it would be good to give Rameez a chance to explain it in his own words, in a guest post. Here’s what he has to say:
In an earlier post, 4gravitons had contemplated the question of ‘when to trust the contrarians’, in the context of our about-to-be-published paper in which we argue that accounting for the effects of the bulk flow in the local Universe, there is no evidence for any isotropic cosmic acceleration, which would be required to claim some sort of ‘dark energy’.
In the following I would like to emphasize that this is a reasonable view, and not a contrarian one. To do so I will examine the bulk flow of the local Universe and the historical evolution of what appears to be somewhat dodgy supernova data. I will present a trivial solution (from data) to the claimed ‘Hubble tension’. I will then discuss inhomogeneous cosmology, and the 2011 Nobel prize in Physics. I will proceed to make predictions that can be falsified with future data. I will conclude with some questions that should be frequently asked.
Disclaimer: The views expressed here are not necessarily shared by my collaborators.
The bulk flow of the local Universe:
The largest anisotropy in the Cosmic Microwave Background is the dipole, believed to be caused by our motion with respect to the ‘rest frame’ of the CMB with a velocity of ~369 km s^-1. Under this view, all matter in the local Universe appear to be flowing. At least out to ~300 Mpc, this flow continues to be directionally coherent, to within ~40 degrees of the CMB dipole, and the scale at which the average relative motion between matter and radiation converges to zero has so far not been found.
This is one of the most widely accepted results in modern cosmology, to the extent that SN1a data come pre ‘corrected’ for it.
Such a flow has covariant consequences under general relativity and this is what we set out to test.
Supernova data, directions in the sky and dodgyness:
Both Riess et al 1998 and Perlmutter et al 1999 used samples of supernovae down to redshifts of 0.01, in which almost all SNe at redshifts below 0.1 were in the direction of the flow.
Subsequently in Astier et al 2006, Kowalsky et al 2008, Amanullah et al 2010 and Suzuki et al 2011, it is reported that a process of outlier rejection was adopted in which data points >3 from the Hubble diagram were discarded. This was done using a highly questionable statistical method that involves adjusting an intrinsic dispersion term by hand until a of 1 is obtained to the assumed CDM model. The number of outliers rejected is however far in excess of 0.3% – which is the 3 expectation. As the sky coverage became less skewed, supernovae with redshift less than ~0.023 were excluded for being outside the Hubble flow. While the Hubble diagram so far had been inferred from heliocentric redshifts and magnitudes, with the introduction of SDSS supernovae that happened to be in the direction opposite to the flow, peculiar velocity ‘corrections’ were adopted in the JLA catalogue and supernovae down to extremely low redshifts were reintroduced. While the early claims of a cosmological constant were stated as ‘high redshift supernovae were found to be dimmer (15% in flux) than the low redshift supernovae (compared to what would be expected in a universe)’, it is worth noting that the peculiar velocity corrections change the redshifts and fluxes of low redshift supernovae by up to ~20 %.
When it was observed that even with this ‘corrected’ sample of 740 SNe, any evidence for isotropic acceleration using a principled Maximum Likelihood Estimator is less than 3 , it was claimed that by adding 12 additional parameters (to the 10 parameter model) to allow for redshift and sample dependence of the light curve fitting parameters, the evidence was greater than 4 .
As we discuss in Colin et al. 2019, these corrections also appear to be arbitrary, and betray an ignorance of the fundamentals of both basic statistical analysis and relativity. With the Pantheon compilation, heliocentric observables were no longer public and these peculiar velocity corrections initially extended far beyond the range of any known flow model of the Local Universe. When this bug was eventually fixed, both the heliocentric redshifts and magnitudes of the SDSS SNe that filled in the ‘redshift desert’ between low and high redshift SNe were found to be alarmingly discrepant. The authors have so far not offered any clarification of these discrepancies.
Thus it seems to me that the latest generation of ‘publicly available’ supernova data are not aiding either open science or progress in cosmology.
A trivial solution to the ‘Hubble tension’?
The apparent tension between the Hubble parameter as inferred from the Cosmic Microwave Background and low redshift tracers has been widely discussed, and recent studies suggest that redshift errors as low as 0.0001 can have a significant impact. Redshift discrepancies as big as 0.1 have been reported. The shifts reported between JLA and Pantheon appear to be sufficient to lower the Hubble parameter from ~73 km s^-1 Mpc^-1 to ~68 km s^-1 Mpc^-1.
On General Relativity, cosmology, metric expansion and inhomogeneities:
In the maximally symmetric Friedmann-Lemaitre-Robertson-Walker solution to general relativity, there is only one meaningful global notion of distance and it expands at the same rate everywhere. However, the late time Universe has structure on all scales, and one may only hope for statistical (not exact) homogeneity. The Universe is expected to be lumpy. A background FLRW metric is not expected to exist and quantities analogous to the Hubble and deceleration parameters will vary across the sky. Peculiar velocities may be more precisely thought of as variations in the expansion rate of the Universe. At what rate does a real Universe with structure expand? The problems of defining a meaningful average notion of volume, its dynamical evolution, and connecting it to observations are all conceptually open.
On the 2011 Nobel Prize in Physics:
The Fitting Problem in cosmology was written in 1987. In the context of this work and the significant theoretical difficulties involved in inferring fundamental physics from the real Universe, any claims of having measured a cosmological constant from directionally skewed, sparse samples of intrinsically scattered observations should have been taken with a grain of salt. By honouring this claim with a Nobel Prize, the Swedish Academy may have induced runaway prestige bias in favour of some of the least principled analyses in science, strengthening the confirmation bias that seems prevalent in cosmology.
This has resulted in the generation of a large body of misleading literature, while normalizing the practice of ‘massaging’ scientific data. In her recent video about gravitational waves, Sabine Hossenfelder says “We should not hand out Nobel Prizes if we don’t know how the predictions were fitted to the data”. What about when the data was fitted (in 1998-1999) using a method that has been discredited in 1989 to a toy model that has been cautioned against in 1987, leading to a ‘discovery’ of profound significance to fundamental physics?
A prediction with future cosmological data:
With the advent of high statistics cosmological data in the future, such as from the Large Synoptic Survey Telescope, I predict that the Hubble and deceleration parameters inferred from supernovae in hemispheres towards and away from the CMB dipole will be found to be different in a statistically significant (>5 ) way. Depending upon the criterion for selection and blind analyses of data that can be agreed upon, I would be willing to bet a substantial amount of money on this prediction.
Concluding : on the amusing sociology of ‘Dark Energy’ and manufactured concordance:
Of the two authors of the well-known cosmology textbook ‘The Early Universe’, Edward Kolb writes these interesting papers questioning dark energy while Michael Turner is credited with coining the term ‘Dark Energy’. Reasonable scientific perspectives have to be presented as ‘Dark Energy without dark energy’. Papers questioning the need to invoke such a mysterious content that makes up ‘68% of the Universe’ are quickly targeted by inane articles by non-experts or perhaps well-meant but still misleading YouTube videos. Much of this is nothing more than a spectacle.
In summary, while the theoretical debate about whether what has been observed as Dark Energy is the effect of inhomogeneities is ongoing, observers appear to have been actively using the most inhomogeneous feature of the local Universe through opaque corrections to data, to continue claiming that this ‘dark energy’ exists.
It is heartening to see that recent works lean toward a breaking of this manufactured concordance and speak of a crisis for cosmology.
Questions that should be frequently asked:
Q. Is there a Hubble frame in the late time Universe?
A. The Hubble frame is a property of the FLRW exact solution, and in the late time Universe in which galaxies and clusters have peculiar motions with respect to each other, an equivalent notion does not exist. While popular inference treats the frame in which the CMB dipole vanishes as the Hubble frame, the scale at which the bulk flow of the local Universe converges to that frame has never been found. We are tilted observers.
Q. I am about to perform blinded analyses on new cosmological data. Should I correct all my redshifts towards the CMB rest frame?
A. No. Correcting all your redshifts towards a frame that has never been found is a good way to end up with ‘dark energy’. It is worth noting that while the CMB dipole has been known since 1994, supernova data have been corrected towards the CMB rest frame only after 2010, for what appear to be independent reasons.
Q. Can I combine new data with existing Supernova data?
A. No. The current generation of publicly available supernova data suffer from the natural biases that are to be expected when data are compiled incrementally through a human mediated process. It would be better to start fresh with a new sample.
Q. Is ‘dark energy’ fundamental or new physics?
A. Given that general relativity is a 100+ year old theory and significant difficulties exist in describing the late time Universe with it, it is unnecessary to invoke new fundamental physics when confronting any apparent acceleration of the real Universe. All signs suggest that what has been ascribed to dark energy are the result of a community that is hell bent on repeating what Einstein supposedly called his greatest mistake.
Digging deeper:
The inquisitive reader may explore the resources on inhomogeneous cosmology, as well as the works of George Ellis, Thomas Buchert and David Wiltshire.
Several days ago NZ PM Jacinda Ardern released her now-viral 2-minute policy achievement challenge.
I have prepared a response on behalf of Australia.
Just like I did last year, and the year before, I’m putting up a post to let y’all know about opportunities in our growing Quantum Information Center at UT Austin.
I’m proud to report that we’re building something pretty good here. This fall Shyam Shankar joined our Electrical and Computer Engineering (ECE) faculty to do experimental superconducting qubits, while (as I blogged in the summer) the quantum complexity theorist John Wright will join me on the CS faculty in Fall 2020. Meanwhile, Drew Potter, an expert on topological qubits, rejoined our physics faculty after a brief leave. Our weekly quantum information group meeting now regularly attracts around 30 participants—from the turnout, you wouldn’t know it’s not MIT or Caltech or Waterloo. My own group now has five postdocs and six PhD students—as well as some amazing undergrads striving to meet the bar set by Ewin Tang. Course offerings in quantum information currently include Brian La Cour’s Freshman Research Initiative, my own undergrad Intro to Quantum Information Science honors class, and graduate classes on quantum complexity theory, experimental realizations of QC, and topological matter (with more to come). We’ll also be starting an undergraduate Quantum Information Science concentration next fall.
So without further ado:
(1) If you’re interested in pursuing a PhD focused on quantum computing and information (and/or classical theoretical computer science) at UT Austin: please apply! If you want to work with me or John Wright on quantum algorithms and complexity, apply to CS (I can also supervise physics students in rare cases). Also apply to CS, of course, if you want to work with our other CS theory faculty: David Zuckerman, Dana Moshkovitz, Adam Klivans, Anna Gal, Eric Price, Brent Waters, Vijaya Ramachandran, or Greg Plaxton. If you want to work with Drew Potter on nonabelian anyons or suchlike, or with Allan MacDonald, Linda Reichl, Elaine Li, or others on many-body quantum theory, apply to physics. If you want to work with Shyam Shankar on superconducting qubits, apply to ECE. Note that the deadline for CS and physics is December 1, while the deadline for ECE is December 15.
You don’t need to ask me whether I’m on the lookout for great students: I always am! If you say on your application that you want to work with me, I’ll be sure to see it. Emailing individual faculty members is not how it works and won’t help. Admissions are extremely competitive, so I strongly encourage you to apply broadly to maximize your options.
(2) If you’re interested in a postdoc in my group, I’ll have approximately two openings starting in Fall 2020. To apply, just send me an email by January 1, 2020 with the following info:
– Your CV
– 2 or 3 of your best papers (links or PDF attachments)
– The names of two recommenders (who should email me their letters separately)
(3) If you’re on the faculty job market in quantum computing and information—well, please give me a heads-up if you’re potentially interested in Austin! Our CS, physics, and ECE departments are all open to considering additional candidates in quantum information, both junior and senior. I can’t take credit for this—it surely has to do with developments beyond my control, both at UT and beyond—but I’m happy to relay that, in the three years since I arrived in Texas, the appetite for strengthening UT’s presence in quantum information has undergone jaw-dropping growth at every level of the university.
Also, Austin-Bergstrom International Airport now has direct flights to London, Frankfurt, and (soon) Amsterdam and Paris.
Hook ’em Hadamards!
I wrote something for the Spanish newspaper El País, which has a column on mathematics called “Café y Teoremas”. Ágata Timón helped me a lot with writing this, and she also translated it into Spanish:
• John Baez, Qué es la teoría de categorías y cómo se ha convertido en tendencia, El País, 8 November 2019.
Here’s the English-language version I wrote. It’s for a general audience so don’t expect hard-core math!
Recently, various scientific media have been paying attention to a branch of mathematics called “category theory” that has become pretty popular inside the mathematical community in recent years. Some mathematicians are even starting to complain on Twitter that more people are tweeting about category theory than their own specialties. But what is this branch of mathematics, and why is it becoming so fashionable?
Category theory was invented in 1945 as a general technique to transform problems in one field of pure mathematics into problems in another field, where they could be solved. For example, we know that at any moment there must be a location on the surface of the Earth there where the wind velocity is zero. This is a marvelous result—but to prove this result, we must translate it into a fact about algebra, and a bit of category theory is very helpful here. More difficult results often require more category theory. The proof of Fermat’s Last Theorem, for example, builds on a vast amount of 20th-century mathematics, in which category theory plays a crucial role.
Category theory is sometimes called “the mathematics of mathematics”, since it stands above many other fields of mathematics, connecting and uniting them. Unfortunately even mathematicians have a limited tolerance for this high level of abstraction. So, for a long time many mathematicians called category theory “abstract nonsense”—using it reluctantly when it was necessary for their work, but not really loving it.
On the other hand, other mathematicians embraced the beauty and power of category theory. Thus, its influence has gradually been spreading. Since the 1990s, it has been infiltrating computer science: for example, new programming languages like Haskell and Scala use ideas from this subject. But now we are starting to see people apply category theory to chemistry, electrical engineering, and even the design of brakes in cars! “Applied category theory”, once an oxymoron, is becoming a real subject.
To understand this we need a little taste of the ideas. A category consists of a set of “objects” together with “morphisms”—some kind of processes, or paths—going between these objects. For example, we could take the objects to be cities, and the morphisms to be routes from one city to another. The key requirement is that if we have a morphism from an object x to an object y and a morphism from y to an object z, we can “compose” them and get a morphism from x to z. For example, if you have a way to drive from Madrid to Seville and a way to drive from Seville to Faro, that gives a way to drive from Madrid to Faro. Thus there is a category of cities and routes between them.
In mathematics, this focus on morphisms represented a radical shift of viewpoint. Starting around 1900, logicians tried to build the whole of mathematics on solid foundations. This turned out to be a difficult and elusive task, but their best attempt at the time involved “set theory”. A set is simply a collection of elements. In set theory as commonly practiced by mathematicians, these elements are also just sets. In this worldview, everything is just a set. It is a static worldview, as if we had objects but no morphisms. On the other hand, category theory builds on set theory by emphasizing morphisms—ways of transforming things—as equal partners to things themselves. It is not incompatible with set theory, but it offers new ways of thinking.
The idea of a category is simple. Exploiting it is harder. A loose group of researchers are starting to apply category theory to subjects beyond pure mathematics. The key step is to focus a bit less on things and a bit more on morphisms, which are ways to go between things, or ways to transform one thing into another. This is attitude is well suited to computer programming: a program is a way to transform input data into output data, and composing programs is the easiest way to build complicated programs from simpler ones. But personally, I am most excited by applications to engineering and the natural sciences, because these are newer and more surprising.
I was very pleased when two of my students got internships at the engineering firm Siemens, applying category theory to industrial processes. The first, Blake Pollard, now has a postdoctoral position at the National Institute of Standards and Technology in the USA. Among other things, he has used a programming method based on category theory to help design a “smart grid”—an electrical power network that is flexible enough to handle the ever-changing power generated by thousands of homes equipped with solar panels.
Rumors say that soon there may even be an institute of applied category theory, connecting mathematicians to programmers and businesses who need this way of thinking. It is too early to tell if this is the beginning of a trend, but my friends and colleagues on Twitter are very excited.
I recently read a biography of James Franck. Many of you won’t recognize the name, but physicists might remember the Franck-Hertz experiment from a lab class. Franck and Hertz performed a decisive test of Bohr’s model of the atom, ushering in the quantum age and receiving the 1925 Nobel Prize. After fleeing Germany when Hitler took power, Franck worked on the Manhattan project and co-authored the Franck Report urging the US not to use nuclear bombs on Japan. He settled at the University of Chicago, which named an institute after him.*
You can find all that on his Wikipedia page. The page also mentions his marriage later in life to Hertha Sponer. Her Wikipedia page talks about her work in spectroscopy, about how she was among the first women to receive a PhD in Germany and the first on the physics faculty at Duke University, and that she remained a professor there until 1966, when she was 70.
Neither Wikipedia page talks about two-body problems, or teaching loads, or pensions.
That’s why I was surprised when the biography covered Franck’s later life. Until Franck died, he and Sponer would travel back and forth, he visiting her at Duke and she visiting him in Chicago. According to the biography, this wasn’t exactly by choice: they both would have preferred to live together in the same city. Somehow though, despite his Nobel Prize and her scientific accomplishments, they never could. The biography talks about how the university kept her teaching class after class, so she struggled to find time for research. It talks about what happened as the couple got older, as their health made it harder and harder to travel back and forth, and they realized that without access to their German pensions they would not be able to support themselves in retirement. The biography gives the impression that Sponer taught till 70 not out of dedication but because she had no alternative.
When we think about the heroes of the past, we imagine them battling foes with historic weight: sexism, antisemitism, Nazi-ism. We don’t hear about their more everyday battles, with academic two-body problems and stingy universities. From this, we can get the impression that the dysfunctions of modern academia are new. But while the problems have grown, we aren’t the first academics with underpaid, overworked teaching faculty, nor the first to struggle to live where we want and love who we want. These are struggles academics have faced for a long, long time.
*Full disclosure: Franck was also my great-great-grandfather, hence I may find his story more interesting than most.
This morning a humanities teacher named Richard Horan, having read my NYT op-ed on quantum supremacy, emailed me the following question about it:
Is this pursuit [of scalable quantum computation] just an arms race? A race to see who can achieve it first? To what end? Will this achievement yield advances in medical science and human quality of life, or will it threaten us even more than we are threatened presently by our technologies? You seem rather sanguine about its possible development and uses. But how close does the hand on that doomsday clock move to midnight once we “can harness an exponential number of amplitudes for computation”?
I thought this question might possibly be of some broader interest, so here’s my response (with some light edits).
Dear Richard,
A radio interviewer asked me a similar question a couple weeks ago—whether there’s an ethical dimension to quantum computing research. I replied that there’s an ethical dimension to everything that humans do.
A quantum computer is not like a nuclear weapon: it’s not going to directly kill anybody (unless the dilution refrigerator falls on them or something?). It’s true that a full, scalable QC, if and when it’s achieved, will give a temporary advantage to people who want to break certain cryptographic codes. The morality of that, of course, could strongly depend on whether the codebreakers are working for the “good guys” (like the Allies during WWII) or the “bad guys” (like, perhaps, Trump or Vladimir Putin or Xi Jinping).
But in any case, there’s already a push to switch to new cryptographic codes that already exist and that we think are quantum-resistant. An actual scalable QC on the horizon would of course massively accelerate that push. And once people make the switch, we expect that security on the Internet will be more-or-less back where it started.
Meanwhile, the big upside potential from QCs is that they’ll provide an unprecedented ability to simulate physics and chemistry at the molecular level. That could at least potentially help with designing new medicines, as well as new batteries and solar cells and carbon capture technologies—all things that the world desperately needs right now.
Also, the theory developed around QC has already led to many new and profound insights about physics and computation. Some of us regard that as an inherent good, in the same way that art and music and literature are.
Now, one could argue that the climate crisis, or various other crises that our civilization faces, are so desperate that instead of working to build QCs, we should all just abandon our normal work and directly confront the crises, as (for example) Greta Thunberg is doing. On some days I share that position. But of course, were the position upheld, it would have implications not just for quantum computing researchers but for almost everyone on earth—including humanities teachers like yourself.
Best,
Scott
Patatrack Hackathon's group, 7th edition. Picture credit: Felice Pantaleo
It’s taken longer than we originally intended, but I am very happy to report that Advances in Combinatorics, a new arXiv overlay journal that is run along similar lines to Discrete Analysis, has just published its first five articles, with plenty more in the pipeline.
I am excited by the business model of the journal, which is that its very small running costs (like Discrete Analysis, it uses the Scholastica platform, which charges $10 per submission, as well as a fixed annual charge of $250, and there are a few other costs such as archiving articles with CLOCKSS and having DOIs) are being met, for the next five years in the first instance, by the library of Queens University in Kingston Ontario, who are also providing us with very useful administrative support. My dream would be for other libraries to have the foresight to support similar ventures, since the potential for savings if the stranglehold of the big commercial publishers is broken is huge. I do not underestimate the obstacles, but for a long time I have felt that what we are waiting for is a tipping point, and even quite a small venture can in principle have a significant effect on when that tipping point occurs.
The aim of Advances in Combinatorics is to be a top specialist combinatorics journal. Information about the editorial board, the submission process, and of course the articles themselves, can all be found on the website. Like Discrete Analysis, the journal has Editorial Introductions to each article, with the aim of making the webpage informative and fun to browse. We will be grateful for any feedback, and even more grateful for support in the form of excellent combinatorics papers while we are still getting established.
A final remark is that although I have reached the limit of the number of journals of this type that I can be closely involved in, I would be delighted to hear from anybody who thinks that they can put together a high-quality editorial board in an area that does not have a suitable journal for people who want to publish good papers without supporting the big commercial publishers. I know people who can offer advice about suitable platforms, funding, and similar matters. It would be great if free-to-read free-to-publish journals could cover all of mathematics, but we are still a long way from that at the moment.
Every spring, a portal opens between Waltham, Massachusetts and another universe.
The other universe has a Watch City dual to Waltham, known for its watch factories. The cities throw a festival to which explorers, inventors, and tourists flock. Top hats, goggles, leather vests, bustles, and lace-up boots dot the crowds. You can find pet octopodes, human-machine hybrids, and devices for bending space and time. Steam powers everything.
So I learned thanks to Maxim Olshanyi, a professor of physics at the University of Massachusetts Boston. He hosted my colloquium, “Quantum steampunk: Quantum information meets thermodynamics,” earlier this month. Maxim, I discovered, has more steampunk experience than I. He digs up century-old designs for radios, builds the radios, and improves upon the designs. He exhibits his creations at the Watch City Steampunk Festival.
I never would have guessed that Maxim moonlights with steampunkers. But his hobby makes sense: Maxim has transformed our understanding of quantum integrability.
Integrability is to thermalization as Watch City is to Waltham. A bowl of baked beans thermalizes when taken outside in Boston in October: Heat dissipates into the air. After half-an-hour, large-scale properties bear little imprint of their initial conditions: The beans could have begun at 112ºF or 99º or 120º. Either way, the beans have cooled.
Integrable systems avoid thermalizing; more of their late-time properties reflect early times. Why? We can understand through an example, an integrable system whose particles don’t interact with each other (whose particles are noninteracting fermions). The dynamics conserve the particles’ momenta. Consider growing the system by adding particles. The number of conserved quantities grows as the system size. The conserved quantities retain memories of the initial conditions.
Imagine preparing an integrable system, analogously to preparing a bowl of baked beans, and letting it sit for a long time. Will the system equilibrate, or settle down to, a state predictable with a simple rule? We might expect not. Obeying the same simple rule would cause different integrable systems to come to resemble each other. Integrable systems seem unlikely to homogenize, since each system retains much information about its initial conditions.
Maxim and collaborators exploded this expectation. Integrable systems do relax to simple equilibrium states, which the physicists called the generalized Gibbs ensemble (GGE). Josiah Willard Gibbs cofounded statistical mechanics during the 1800s. He predicted the state to which nonintegrable systems, like baked beans in autumnal Boston, equilibrate. Gibbs’s theory governs classical systems, like baked beans, as does the GGE theory. But also quantum systems equilibrate to the GGE, and Gibbs’s conclusions translate into quantum theory with few adjustments. So I’ll explain in quantum terms.
Consider quantum baked beans that exchange heat with a temperature- environment. Let denote the system’s Hamiltonian, which basically represents the beans’ energy. The beans equilibrate to a quantum Gibbs state, . The denotes Boltzmann’s constant, a fundamental constant of nature. The partition function enables the quantum state to obey probability theory (normalizes the state).
Maxim and friends modeled their generalized Gibbs ensemble on the Gibbs state. Let denote a quantum integrable system’s conserved quantity. This system equilibrates to . The normalizes the state. The intensive parameters ’s serve analogously to temperature and depend on the conserved quantities’ values. Maxim and friends predicted this state using information theory formalized by Ed Jaynes. Inventing the GGE, they unlocked a slew of predictions about integrable quantum systems.
I define quantum steampunk as the intersection of quantum theory, especially quantum information theory, with thermodynamics, and the application of this intersection across science. Maxim has used information theory to cofound a branch of quantum statistical mechanics. Little wonder that he exhibits homemade radios at the Watch City Steampunk Festival. He also holds a license to drive steam engines and used to have my postdoc position. I appreciate having older cousins to look up to. Here’s hoping that I become half the quantum steampunker that I found by Massachusetts Bay.
With thanks to Maxim and the rest of the University of Massachusetts Boston Department of Physics for their hospitality.
The next Watch City Steampunk Festival takes place on May 9, 2020. Contact me if you’d attend a quantum-steampunk meetup!
I’ve just uploaded to the arXiv my paper “Almost all Collatz orbits attain almost bounded values“, submitted to the proceedings of the Forum of Mathematics, Pi. In this paper I returned to the topic of the notorious Collatz conjecture (also known as the conjecture), which I previously discussed in this blog post. This conjecture can be phrased as follows. Let denote the positive integers (with the natural numbers), and let be the map defined by setting equal to when is odd and when is even. Let be the minimal element of the Collatz orbit . Then we have
Conjecture 1 (Collatz conjecture) One has for all .
Establishing the conjecture for all remains out of reach of current techniques (for instance, as discussed in the previous blog post, it is basically at least as difficult as Baker’s theorem, all known proofs of which are quite difficult). However, the situation is more promising if one is willing to settle for results which only hold for “most” in some sense. For instance, it is a result of Krasikov and Lagarias that
for all sufficiently large . In another direction, it was shown by Terras that for almost all (in the sense of natural density), one has . This was then improved by Allouche to , and extended later by Korec to cover all . In this paper we obtain the following further improvement (at the cost of weakening natural density to logarithmic density):
Theorem 2 Let be any function with . Then we have for almost all (in the sense of logarithmic density).
Thus for instance one has for almost all (in the sense of logarithmic density).
The difficulty here is one usually only expects to establish “local-in-time” results that control the evolution for times that only get as large as a small multiple of ; the aforementioned results of Terras, Allouche, and Korec, for instance, are of this type. However, to get all the way down to one needs something more like an “(almost) global-in-time” result, where the evolution remains under control for so long that the orbit has nearly reached the bounded state .
However, as observed by Bourgain in the context of nonlinear Schrödinger equations, one can iterate “almost sure local wellposedness” type results (which give local control for almost all initial data from a given distribution) into “almost sure (almost) global wellposedness” type results if one is fortunate enough to draw one’s data from an invariant measure for the dynamics. To illustrate the idea, let us take Korec’s aforementioned result that if one picks at random an integer from a large interval , then in most cases, the orbit of will eventually move into the interval . Similarly, if one picks an integer at random from , then in most cases, the orbit of will eventually move into . It is then tempting to concatenate the two statements and conclude that for most in , the orbit will eventually move . Unfortunately, this argument does not quite work, because by the time the orbit from a randomly drawn reaches , the distribution of the final value is unlikely to be close to being uniformly distributed on , and in particular could potentially concentrate almost entirely in the exceptional set of that do not make it into . The point here is the uniform measure on is not transported by Collatz dynamics to anything resembling the uniform measure on .
So, one now needs to locate a measure which has better invariance properties under the Collatz dynamics. It turns out to be technically convenient to work with a standard acceleration of the Collatz map known as the Syracuse map , defined on the odd numbers by setting , where is the largest power of that divides . (The advantage of using the Syracuse map over the Collatz map is that it performs precisely one multiplication of at each iteration step, which makes the map better behaved when performing “-adic” analysis.)
When viewed -adically, we soon see that iterations of the Syracuse map become somewhat irregular. Most obviously, is never divisible by . A little less obviously, is twice as likely to equal mod as it is to equal mod . This is because for a randomly chosen odd , the number of times that divides can be seen to have a geometric distribution of mean – it equals any given value with probability . Such a geometric random variable is twice as likely to be odd as to be even, which is what gives the above irregularity. There are similar irregularities modulo higher powers of . For instance, one can compute that for large random odd , will take the residue classes with probabilities
respectively. More generally, for any , will be distributed according to the law of a random variable on that we call a Syracuse random variable, and can be described explicitly as
where are iid copies of a geometric random variable of mean .
In view of this, any proposed “invariant” (or approximately invariant) measure (or family of measures) for the Syracuse dynamics should take this -adic irregularity of distribution into account. It turns out that one can use the Syracuse random variables to construct such a measure, but only if these random variables stabilise in the limit in a certain total variation sense. More precisely, in the paper we establish the estimate
for any and any . This type of stabilisation is plausible from entropy heuristics – the tuple of geometric random variables that generates has Shannon entropy , which is significantly larger than the total entropy of the uniform distribution on , so we expect a lot of “mixing” and “collision” to occur when converting the tuple to ; these heuristics can be supported by numerics (which I was able to work out up to about before running into memory and CPU issues), but it turns out to be surprisingly delicate to make this precise.
A first hint of how to proceed comes from the elementary number theory observation (easily proven by induction) that the rational numbers
are all distinct as vary over tuples in . Unfortunately, the process of reducing mod creates a lot of collisions (as must happen from the pigeonhole principle); however, by a simple “Lefschetz principle” type argument one can at least show that the reductions
are mostly distinct for “typical” (as drawn using the geometric distribution) as long as is a bit smaller than (basically because the rational number appearing in (3) then typically takes a form like with an integer between and ). This analysis of the component (3) of (1) is already enough to get quite a bit of spreading on (roughly speaking, when the argument is optimised, it shows that this random variable cannot concentrate in any subset of of density less than for some large absolute constant ). To get from this to a stabilisation property (2) we have to exploit the mixing effects of the remaining portion of (1) that does not come from (3). After some standard Fourier-analytic manipulations, matters then boil down to obtaining non-trivial decay of the characteristic function of , and more precisely in showing that
for any and any that is not divisible by .
If the random variable (1) was the sum of independent terms, one could express this characteristic function as something like a Riesz product, which would be straightforward to estimate well. Unfortunately, the terms in (1) are loosely coupled together, and so the characteristic factor does not immediately factor into a Riesz product. However, if one groups adjacent terms in (1) together, one can rewrite it (assuming is even for sake of discussion) as
where . The point here is that after conditioning on the to be fixed, the random variables remain independent (though the distribution of each depends on the value that we conditioned to), and so the above expression is a conditional sum of independent random variables. This lets one express the characeteristic function of (1) as an averaged Riesz product. One can use this to establish the bound (4) as long as one can show that the expression
is not close to an integer for a moderately large number (, to be precise) of indices . (Actually, for technical reasons we have to also restrict to those for which , but let us ignore this detail here.) To put it another way, if we let denote the set of pairs for which
we have to show that (with overwhelming probability) the random walk
(which we view as a two-dimensional renewal process) contains at least a few points lying outside of .
A little bit of elementary number theory and combinatorics allows one to describe the set as the union of “triangles” with a certain non-zero separation between them. If the triangles were all fairly small, then one expects the renewal process to visit at least one point outside of after passing through any given such triangle, and it then becomes relatively easy to then show that the renewal process usually has the required number of points outside of . The most difficult case is when the renewal process passes through a particularly large triangle in . However, it turns out that large triangles enjoy particularly good separation properties, and in particular afer passing through a large triangle one is likely to only encounter nothing but small triangles for a while. After making these heuristics more precise, one is finally able to get enough points on the renewal process outside of that one can finish the proof of (4), and thus Theorem 2.
We’re having a workshop to promote diversity in math here at UCR:
• Riverside Mathematics Workshop for Excellence and Diversity, Friday 8 November 2019, U. C. Riverside. Organized by John Baez, Weitao Chen, Edray Goins, Ami Radunskaya, and Fred Wilhelm.
If you want to come, please register here.
It’s happening right before the applied category theory meeting, so I hope some of you can make both… especially since the category theorist Eugenia Cheng will be giving a talk!
Three talks will take place in Skye Hall—home of the math department—starting at 1 pm. After this we’ll have refreshments and an hour for students to talk to the speakers. Starting at 6 pm there will be a reception across the road at the UCR Alumni Center, with food and a panel discussion on the challenges we face in promoting diversity at U.C. Riverside.
All the talks will be in Skye 284:
• 1:00–1:50 p.m. Abba Gumel, Arizona State University.
Some models for enhancing diversity and capacity-building in STEM education in under-represented minority communities.
STEM (science, technology, engineering and mathematics) education is undoubtedly the necessary bedrock for the development and sustenance of the vitally-needed knowledge-based economy that fuels and sustains the development of modern nations. Central to STEM education are, of course, the mathematical science … which are the rock-solid foundation of all the natural and engineering sciences. Hence, it is vital that all diverse populations are not left behind in the quest to build and sustain capacity in the mathematical sciences. This talk focuses on discussion around a number of pedagogic and mentorship models that have been (and are being) used to help increase diversity and capacity-building in STEM education in general, and in the mathematical sciences in particular, in under-represented minority populations. Some examples from Africa, Canada and the U.S. will be presented.
• 2:00–2:50. Marissa Loving, Georgia Tech.
Where do I belong? Creating space in the math community.
I will tell the story of my mathematical journey with a focus on my time in grad school. I will be blunt about the ups and downs I have experienced and touch on some of the barriers (both structural and internalized) I have encountered. I will also discuss some of the programs and spaces I have helped create in my quest to make the mathematics community into a place where folks from historically under-represented groups (particularly women of color) can feel safe, seen, and free to devote their energy to their work. If you have ever felt like you don’t belong or worried that you have made others feel that way, this talk is for you.
• 3:00–3:50 p.m. Eugenia Cheng, School of the Art Institute of Chicago.
Inclusion–exclusion in mathematics and beyond: who stays in, who falls out, why it happens, and what we could do about it.
The question of why women and minorities are under-represented in mathematics is complex and there are no simple answers, only many contributing factors. I will focus on character traits, and argue that if we focus on this rather than gender we can have a more productive and less divisive conversation. To try and focus on characters rather than genders I will introduce gender-neutral character adjectives “ingressive” and “congressive” as a new dimension to shift our focus away from masculine and feminine. I will share my experience of teaching congressive abstract mathematics to art students, in a congressive way, and the possible effects this could have for everyone in mathematics, not just women. Moreover I will show that abstract mathematics is applicable to working towards a more inclusive, congressive society in this politically divisive era. This goes against the assumption that abstract math can only be taught to high level undergraduates and graduate students, and the accusation that it is removed from real life.
• 4:00–4:30 p.m. Refreshments in Skye 284.
• 4:30–5:30 p.m. Conversations Between Speakers & Students, Not Faculty, in Skye 284.
• 6:00–6:45 p.m. Reception with Food at the Alumni Center.
• 6:45 – 7:45 p.m. Panel Discussion at Alumni Center with Alissa Crans, Jose Gonzalez and Paige Helms, moderated by Edray Goins.
I’m a nice jewish boy, so I grew up eating a lot of brisket. Brisket’s an interesting piece of meat. By almost any reasonable standard, it’s an absolutely godawful cut of beef. It’s ridiculously tough. We’re not talking just a little bit chewy here: you can cook a hunk of brisket for four hours, still have something that’s inedible, because your teeth can’t break it down. It’s got a huge layer of fat on top – but the meat itself is completely lean – so if you cook it long enough to be chewable, it can be dry as a bone.
But my ancestors were peasants. They couldn’t afford to eat beef normally, and when special occasions rolled around, the only beef they could afford was the stuff that no one else wanted. So they got briskets.
If you get interested in foods, though, you learn that many of the best foods in the world started off with some poor peasant who wanted to make something delicious, but couldn’t afford expensive ingredients! Brisket is a perfect example. Cook it for a good long time, or in a pressure cooker, with lots of liquid, and lots of seasoning, and it’s one of the most flavorful pieces of the entire animal. Brisket is really delicious, once you manage to break down the structure that makes it so tough. These days, it’s become super trendy, and everyone loves brisket!
Anyway, like I said, I grew up eating jewish brisket. But then I married a Chinese woman, and in our family, we always try to blend traditions as much as we can. In particular, because we’re both food people, I’m constantly trying to take things from my tradition, and blend some of her tradition into it. So I wanted to find a way of blending some chinese flavors into my brisket. What I wound up with is more japanese than chinese, but it works. The smoky flavor of the dashi is perfect for the sweet meatiness of the brisket, and the onions slowly cook and sweeten, and you end up with something that is distinctly similar to the traditional jewish onion-braised-brisket, but also very distinctly different.
Ingredients:
Instructions
Here’s the second of a series of blog articles:
• Adam Marblestone, Climate technology primer (2/3): CO_{2} removal.
The first covered the basics of climate science as related to global warming. This one moves on to consider technologies for removing carbon dioxide from the air.
I hope you keep the following warning in mind as you read on:
I’m focused here on trying to understand the narrowly technical aspects, not on the political aspects, despite those being crucial. This is meant to be a review of the technical literature, not a political statement. I worried that writing a blog purely on the topic of technological intervention in the climate, without attempting or claiming to do justice to the social issues raised, would implicitly suggest that I advocate a narrowly technocratic or unilateral approach, which is not my intention. By focusing on technology, I don’t mean to detract from the importance of the social and policy aspects.
The technological issues are worth studying on their own, since they constrain what’s possible. For example: to draw down as much CO_{2} as human civilization is emitting now, with trees their peak growth phase and their carbon stored permanently, could be done by covering the whole USA with such trees.
With my PhD student Madhav Krishnan Vijayan, and old PhD colleague Austin Lund.
Full paper available at https://arxiv.org/abs/1910.03093
Error-detection and correction are necessary prerequisites for any scalable quantum computing architecture. Given the inevitability of unwanted physical noise in quantum systems and the propensity for errors to spread as computations proceed, computational outcomes can become substantially corrupted. This observation applies regardless of the choice of physical implementation. In the context of photonic quantum information processing, there has recently been much interest in passive linear optics quantum computing, which includes boson-sampling, as this model eliminates the highly-challenging requirements for feed-forward via fast, active control. That is, these systems are passive by definition. In usual scenarios, error detection and correction techniques are inherently active, making them incompatible with this model, arousing suspicion that physical error processes may be an insurmountable obstacle. Here we explore a photonic error-detection technique, based on W-state encoding of photonic qubits, which is entirely passive, based on post-selection, and compatible with these near-term photonic architectures of interest. We show that this W-state redundant encoding techniques enables the suppression of dephasing noise on photonic qubits via simple fan-out style operations, implemented by optical Fourier transform networks, which can be readily realised today. The protocol effectively maps dephasing noise into heralding failures, with zero failure probability in the ideal no-noise limit.
I wrote a review of this book with chapters by Penrose, Witten, Connes, Atiyah, Smolin and others:
• John Baez, review of Foundations of Mathematics and Physics One Century After Hilbert: New Perspectives, edited by Joseph Kouneiher, Notices of the American Mathematical Society 66 no. 11 (November 2019), 1690–1692.
It gave me a chance to say a bit—just a tiny bit—about the current state of fundamental physics and the foundations of mathematics.