Planet Musings

May 30, 2017

David Hoggmeasuring the velocity of a star

Yesterday and today I wrote code. This is a much rarer activity than I would like! I wrote code to test different methods for measuring the centroid of an absorption line in a stellar spectrum, with applications to extreme precision radial-velocity experiments. After some crazy starts and stops, I was able to strongly confirm my strong expectation: Cross-correlation with a realistic template is far better for measuring radial velocities than cross-correlation with a bad template (especially a binary mask). I am working out the full complement of experiments I want to do. I am convinced that there is a (very boring) paper to be written.

Scott AaronsonThe Social Justice Warriors are right

As you might know, I haven’t been exactly the world’s most consistent fan of the Social Justice movement, nor has it been the most consistent fan of me.

I cringe when I read about yet another conservative college lecture shut down by mob violence; or student protesters demanding the firing of a professor for trying gently to argue and reason with them; or an editor forced from his position for writing a (progressive) defense of “cultural appropriation”—a practice that I take to have been ubiquitous for all of recorded history, and without which there wouldn’t be any culture at all.  I cringe not only because I know that I was in the crosshairs once before and could easily be again, but also because, it seems to me, the Social Justice scalp-hunters are so astoundingly oblivious to the misdirection of their energies, to the power of their message for losing elections and neutering the progressive cause, to the massive gift their every absurdity provides to the world’s Fox Newses and Breitbarts and Trumps.

Yet there’s at least one issue where it seems to me that the Social Justice Warriors are 100% right, and their opponents 100% wrong. This is the moral imperative to take down every monument to Confederate “war heroes,” and to rename every street and school and college named after individuals whose primary contribution to the world was to defend chattel slavery.  As a now-Southerner, I have a greater personal stake here than I did before: UT Austin just recently removed its statue of Jefferson Davis, while keeping up its statue of Robert E. Lee.  My kids will likely attend what until very recently was called Robert E. Lee Elementary—this summer renamed Russell Lee Elementary.  (My suggestion, that the school be called T. D. Lee Parity Violation Elementary, was sadly never considered.)

So I was gratified that last week, New Orleans finally took down its monuments to slavers.  Mayor Mitch Landrieu’s speech, setting out the reasons for the removal, is worth reading.

I used to have little patience for “merely symbolic” issues: would that offensive statues and flags were the worst problems!  But it now seems to me that the fight over Confederate symbols is just a thinly-veiled proxy for the biggest moral question that’s faced the United States through its history, and also the most urgent question facing it in 2017.  Namely: Did the Union actually win the Civil War? Were the anti-Enlightenment forces—the slavers, the worshippers of blood and land and race and hierarchy—truly defeated? Do those forces acknowledge the finality and the rightness of their defeat?

For those who say that, sure, slavery was bad and all, but we need to keep statues to slavers up so as not to “erase history,” we need only change the example. Would we similarly defend statues of Hitler, Himmler, and Goebbels, looming over Berlin in heroic poses?  Yes, let Germans reflect somberly and often on this aspect of their heritage—but not by hoisting a swastika over City Hall.

For those who say the Civil War wasn’t “really” about slavery, I reply: this is the canonical example of a “Mount Stupid” belief, the sort of thing you can say only if you’ve learned enough to be wrong but not enough to be unwrong.  In 1861, the Confederate ringleaders themselves loudly proclaimed to future generations that, indeed, their desire to preserve slavery was their overriding reason to secede. Here’s CSA Vice-President Alexander Stephens, in his famous Cornerstone Speech:

Our new government is founded upon exactly the opposite ideas; its foundations are laid, its cornerstone rests, upon the great truth that the negro is not equal to the white man; that slavery, subordination to the superior race, is his natural and normal condition. This, our new government, is the first, in the history of the world, based upon this great physical, philosophical, and moral truth.

Here’s Texas’ Declaration of Secession:

We hold as undeniable truths that the governments of the various States, and of the confederacy itself, were established exclusively by the white race, for themselves and their posterity; that the African race had no agency in their establishment; that they were rightfully held and regarded as an inferior and dependent race, and in that condition only could their existence in this country be rendered beneficial or tolerable. That in this free government all white men are and of right ought to be entitled to equal civil and political rights; that the servitude of the African race, as existing in these States, is mutually beneficial to both bond and free, and is abundantly authorized and justified by the experience of mankind, and the revealed will of the Almighty Creator, as recognized by all Christian nations; while the destruction of the existing relations between the two races, as advocated by our sectional enemies, would bring inevitable calamities upon both and desolation upon the fifteen slave-holding states.

It was only when defeat looked inevitable that the slavers started changing their story, claiming that their real grievance was never about slavery per se, but only “states’ rights” (states’ right to do what, exactly?). So again, why should take the slavers’ rationalizations any more seriously than we take the postwar epiphanies of jailed Nazis that actually, they’d never felt any personal animus toward Jews, that the Final Solution was just the world’s biggest bureaucratic mishap?  Of course there’s a difference: when the Allies occupied Germany, they insisted on thorough de-Nazification.  They didn’t suffer streets to be named after Hitler. And today, incredibly, fascism and white nationalism are greater threats here in the US than they are in Germany.  One reads about the historic irony of some American Jews, who are eligible for German citizenship because of grandparents expelled from there, now seeking to move there because they’re terrified about Trump.

By contrast, after a brief Reconstruction, the United States lost its will to continue de-Confederatizing the South.  The leaders were left free to write book after book whitewashing their cause, even to hold political office again.  And probably not by coincidence, we then got nearly a hundred years of Jim Crow—and still today, a half-century after the civil rights movement, southern governors and legislatures that do everything in their power to disenfranchise black voters.

For those who ask: but wasn’t Robert E. Lee a great general who was admired by millions? Didn’t he fight bravely for a cause he believed in?  Maybe it’s just me, but I’m allergic to granting undue respect to history’s villains just because they managed to amass power and get others to go along with them.  I remember reading once in some magazine that, yes, Genghis Khan might have raped thousands and murdered millions, but since DNA tests suggest that ~1% of humanity is now descended from him, we should also celebrate Khan’s positive contribution to “peopling the world.” Likewise, Hegel and Marx and Freud and Heidegger might have been wrong in nearly everything they said, sometimes with horrific consequences, but their ideas still need to be studied reverently, because of the number of other intellectuals who took them seriously.  As I reject those special pleas, so I reject the analogous ones for Jefferson Davis, Alexander Stephens, and Robert E. Lee, who as far as I can tell, should all (along with the rest of the Confederate leadership) have been sentenced for treason.

This has nothing to do with judging the past by standards of the present. By all means, build statues to Washington and Jefferson even though they held slaves, to Lincoln even though he called blacks inferior even while he freed them, to Churchill even though he fought the independence of India.  But don’t look for moral complexity where there isn’t any.  Don’t celebrate people who were terrible even for their own time, whose public life was devoted entirely to what we now know to be evil.

And if, after the last Confederate general comes down, the public spaces are too empty, fill them with monuments to Alan Turing, Marian Rejewski, Bertrand Russell, Hypatia of Alexandria, Emmy Noether, Lise Meitner, Mark Twain, Srinivasa Ramanujan, Frederick Douglass, Vasili Arkhipov, Stanislav Petrov, Raoul Wallenberg, even the inventors of saltwater taffy or Gatorade or the intermittent windshield wiper.  There are, I think, enough people who added value to the world to fill every city square and street sign.

May 29, 2017

Tommaso DorigoBoosted H-->bb Decays Seen By CMS !

The CMS collaboration at the CERN Large Hadron Collider has pulled off an extremely neat new measurement of the Higgs boson production rate - one which, for some reasons, is extraordinary in its own right.

Despite being the decay mode with the highest probability (two thirds of Higgs bosons die that way), the H->bb process is among the most elusive to put in evidence in LHC data, because b-quarks are quite commonplace there. 

read more

May 28, 2017

David Hoggwhat is math? interpolation of imaging

The research highlight of the day was a long call with Dustin Lang (Toronto) to discuss about interpolation, centroiding, and (crazily) lexicographic ordering. The latter is part of a project I want to do to understand how to search in a controlled way for useful statistics or informative anomalies in cosmological data. He found it amusing that my request of mathematicians for a lexicographic ordering of statistical operations was met with the reaction “that's not math, that's philosophy”.

On centroiding and interpolation: It looks like Lang is finding (perhaps not surprisingly) that standard interpolators (the much-used approximations to sinc-interpolation) in astronomy very slightly distort the point-spread function in imaging, and that distortion is a function of sub-pixel shift. He is working on making better interpolators, but both he and I are concerned about reinventing wheels. Some of the things he is worried about will affect spectroscopy as well as imaging, and, since EPRV projects are trying to do things at the 1/1000 pixel level, it might really, really matter.

May 27, 2017

David Hoggchemical correlates of planet system architecture

At Stars group meeting, Jo Bovy (Toronto) demonstrated to us that the red-giant branch in Gaia DR1 TGAS is populated about how you would expect from a simple star-formation history and stellar evolution tracks. This was surprising to me: The red clump is extremely prominent. This project involved building an approximate selection function for TGAS, which he has done, and released open-source!

John Brewer (Yale) showed relationships he has found between planet-system architectures and stellar chemical abundances. He cleverly designed complete samples of different kinds of planetary systems to make comparisons on a reasonable basis. He doesn't have a causal explanation or causal inference of what he is finding. But there are some very strong covariances of chemical-abundance ratios with system architectures. This makes me more excited than ever to come up with some kind of general description or parameterization of a bound few-body system that is good for inference.

I spent the afternoon at CIS 303, a middle school in the Bronx, as part of their Career and College Day. This is an opportunity for middle schoolers to discuss with people with a huge range of careers and backgrounds what they do and how they got there. So much fun. I also ran into Michael Blanton (NYU) at the event!

Tim GowersIntransitive dice IV: first problem more or less solved?

I hope, but am not yet sure, that this post is a counterexample to Betteridge’s law of headlines. To back up that hope, let me sketch an argument that has arisen from the discussion so far, which appears to get us close to showing that if A,B and C are three n-sided dice chosen independently at random from the sequences model (I will recap some of these definitions in a moment), then the probability that A beats C given that A beats B and B beats C is 1/2+o(1). Loosely speaking, the information that A beats B and B beats C tells you nothing about how likely A is to beat C. Having given the argument, I will try to isolate a statement that looks plausible, not impossible to prove, and sufficient to finish off the argument.

Basic definitions

An nsided die in the sequence model is a sequence (a_1,\dots,a_n) of elements of [n]=\{1,2,\dots,n\} such that \sum_ia_i=n(n+1)/2, or equivalently such that the average value of a_i is (n+1)/2, which is of course the average value of a random element of [n]. A random n-sided die in this model is simply an n-sided die chosen uniformly at random from the set of all such dice.

Given n-sided dice A=(a_1,\dots,a_n) and B=(b_1,\dots,b_n), we say that A beats B if


If the two sets above have equal size, then we say that A ties with B.

A first reduction

When looking at this problem, it is natural to think about the following directed graph: the vertex set is the set of all n-sided dice and we put an arrow from A to B if A beats B.

We believe (and even believe we can prove) that ties are rare. Assuming that to be the case, then the conjecture above is equivalent to the statement that if A,B,C are three vertices chosen independently at random in this graph, then the probability that ABC is a directed cycle is what you expect for a random tournament, namely 1/8.

One can also make a more general conjecture, namely that the entire (almost) tournament is quasirandom in a sense defined by Chung and Graham, which turns out to be equivalent to the statement that for almost all pairs (A,B) of dice, the four possible pairs of truth values for the pair of statements

(A beats C, B beats C)

each occur with probability approximately 1/4. If this is true, then given k random dice A_1,\dots,A_k, all the 2^{\binom k2} possibilities for which A_i beat which A_j have probability approximately 2^{-\binom k2}. This would imply, for example, that if A_1,\dots,A_k are independent random n-sided dice, then the probability that A_1 beats A_k given that A_i beats A_j for all other pairs (i,j) with i<j is still 1/2+o(1).

Several of us have done computer experiments to test these conjectures, and it looks as though the first one is true and the second one false. A further reason to be suspicious of the stronger conjecture is that a natural approach to prove it appears to be morally equivalent to a relationship between the correlations of certain random variables that doesn’t seem to have any heuristic justification or to fit with experimental evidence. So although we don’t have a disproof of the stronger conjecture (I think it would be very interesting to find one), it doesn’t seem like a good idea to spend a lot of effort trying to prove it, unless we can somehow explain away the evidence that appears to be stacking up against it.

The first conjecture turns out to be equivalent to a statement that doesn’t mention transitivity. The very quick proof I’ll give here was supplied by Luke Pebody. Suppose we have a tournament (that is, a complete graph with each edge directed in one of the two possible directions) and write d_+(x) for the out-degree of a vertex x (that is, the number of y such that there is an arrow from x to y) and d_-(x) for the in-degree. Then let us count the number of ordered triples (x,y,z) such that x\to y\to z. Any directed triangle xyz in the tournament will give rise to three such triples, namely (x,y,z), (y,z,x) and (z,x,y). And any other triangle will give rise to just one: for example, if x\to y, x\to z and y\to z we get just the triple (x,y,z). So the number of ordered triples (x,y,z) such that x\to y and y\to z is \binom n3 plus twice the number of directed triangles. Note that \binom n3 is approximately n^3/6.

But the number of these ordered triples is also \sum_yd_+(y)d_-(y). If almost all in-degrees and almost all out-degrees are roughly n/2, then this is approximately n^3/4, which means that the number of directed triangles is approximately (n^3/4-n^3/6)/2\approx\binom n3/4. That is, in this case, the probability that three dice form an intransitive triple is approximately 1/4, as we are hoping from the conjecture. If on the other hand several in-degrees fail to be roughly n/2, then \sum_yd_+(y)d_-(y) is substantially lower than n^3/4 and we get a noticeably smaller proportion of intransitive triples.

Thus, the weaker conjecture is equivalent to the statement that almost every die beats approximately half the other dice.

Why should a “typical” die beat approximately half the other dice?

The answer to this is fairly simple, heuristically at least. Let A be an arbitrary die. For j=1,2,\dots,n define g_A(j) to be the number of i with a_i\leq j and define h_A(j) to be g_A(j)-j. Then

\sum_jg_A(j)=\sum_j\sum_i\mathbf 1_{[a_i\leq j]}=\sum_i(n-a_j+1)


from which it follows that \sum_jh_A(j)=0.

We also have that if B is another die, then

\sum_jh_A(b_j)=\sum_j(\sum_i\mathbf 1_{a_i\leq b_j}-b_j)=\sum_{i,j}\mathbf 1_{[a_i\leq b_j]}-n(n+1)/2.

If we make the simplifying assumption that a_i=b_j sufficiently infrequently to make no real difference to what is going on (which is not problematic, as a slightly more complicated but still fairly simple function can be used instead of h_A to avoid this problem), then we find that to a reasonable approximation B beats A if and only if \sum_jh_A(j) is positive.

So what we would like to prove is that if b_1,\dots,b_n are chosen independently at random from [n], then

\mathbb P\Bigl[\sum_jh_A(b_j)>0\Bigm|\sum_jb_j=n(n+1)/2\Bigr]\approx 1/2.

We are therefore led to consider the random variable


where now B=(b_1,\dots,b_n) is chosen uniformly at random from [n]^n without any condition on the sum. To write this in a more transparent way, let (U,V) be the random variable (h_A(x), x-(n+1)/2), where x is chosen uniformly at random from [n]. Then (X,Y) is a sum of n independent copies of (U,V). What we are interested in is the distribution we obtain when we condition the random variable (X,Y) on Y=0.

This should mean that we are in an excellent position, since under appropriate conditions, a lot is known about sums of independent random variables, and it looks very much as though those conditions are satisfied by (U,V), at least when A is “typical”. Indeed, what we would expect, by the central limit theorem, is that (X,Y) will approximate a bivariate normal distribution with mean 0 (since both U and V have mean zero). But a bivariate normal distribution is centrally symmetric, so we expect the distribution of [X|Y=0] to be approximately centrally symmetric, which would imply what we wanted above, since that is equivalent to the statement that \mathbb P[X>0|Y=0]\approx 1/2.

Why are we not immediately done?

How can we make the above argument rigorous? The central limit theorem on its own is not enough, for two reasons. The first is that it does not give us information about the speed of convergence to a normal distribution, whereas we need a sum of n copies of (U,V) to be close to normal. The second is that the notion of “close to normal” is not precise enough for our purposes: it will allow us to approximate the probability of an event such as [X\geq x\wedge Y\geq y] but not of a “probability zero” event such as [X>0\wedge Y=0].

The first of these difficulties is not too worrying, since plenty of work has been done on the speed of convergence in the central limit theorem. In particular, there is a famous theorem of Berry and Esseen that is often used when this kind of information is needed.

However, the Berry-Esseen theorem still suffers from the second drawback. To get round that one needs to turn to more precise results still, known as local central limit theorems, often abbreviated to LCLTs. With a local central limit theorem, one can even talk about the probability that (X,Y) takes a specific value after a specific number of steps. Roughly speaking, it says (in its 2-dimensional version) that if (U,V) is a random variable of mean zero taking values in \mathbb Z^2 and if (U,V) satisfies suitable moment conditions and is not supported in a proper sublattice of \mathbb Z^2, then writing (X,Y) for a sum of n copies of (U,V), we have that the probability that (X,Y) takes a particular value differs from the “expected” probability (given by a suitable Gaussian formula) by O(n^{-2}). (I’m not 100% sure I’ve got that right: the theorem in question is Theorem 2.1.1 from this book.)

That looks very close to what we want, but it still falls short. The problem is that the implied constant depends on the random variable (U,V). A simple proof of this is that if (U,V) is not supported in a sublattice but very nearly is — for example, if the probability that it takes a value outside the sublattice is 2^{-2^n} — then one will have to add together an extremely large number of copies of (U,V) before the sum ceases to be concentrated in the sublattice.

So the situation we appear to be in is the following. We have more precise information about the random variable (U,V) than is assumed in the LCLT in the reference above, and we want to use that to obtain an explicit constant in the theorem.

It could be that out there in the literature is exactly the result we need, which would be nice, but it also seems possible that we will have to prove an appropriate version of the LCLT for ourselves. I’d prefer the first, but the second wouldn’t be too disappointing, as the problem is quite appealing and even has something of an additive-combinatorial flavour (since it is about describing an iterated convolution of a subset of \mathbb Z^2 under appropriate assumptions).

What properties can we establish about the random variable (U,V)?

I said above, with no justification, that we have more precise information about the random variable (U,V). Let me now try to give the justification.

First of all, we know everything we could possibly want to know about V: it is the uniform distribution on [n] - (n+1)/2. (In particular, if n is odd, then it is the uniform distribution on the set of integers in [-(n-1)/2, (n-1)/2].)

How about the distribution of U? That question is equivalent to asking about the values taken by h_A(j), and their multiplicities. There is quite a lot one can say about those. For example, I claim that with high probability (if A is a random n-sided die) h_A(j) is never bigger than C\sqrt{n\log n}. That is because if we choose a fully random sequence (a_1,\dots,a_n)\in[n]^n, then the expected number of i such that a_i\leq j is j, and the probability that this number differs from j by more than t\sqrt n is \exp(-ct^2), by standard probabilistic estimates, so if we set t=C\sqrt{\log n}, then this is at most n^{-cC}, which we can make a lot smaller than n^{-1} by choosing C to be, say, 10c^{-1}. (I think c can be taken to be 1/8 if you want me to be more explicit.) Since the probability that \sum_ia_i=n(n+1)/2 is proportional to 1/n, it follows that this conclusion continues to hold even after we condition on that event.

Another simple observation is that the values taken by U are not contained in a sublattice (assuming, that is, that h_A is ever non-zero). That is simply because h_A(j+1)\geq h_A(j)-1 and h_A averages zero.

A third simple observation is that with probability 1-o(1) h_A will take a value of at least \sqrt n/\log n at least somewhere. I’ll sketch a proof of this. Let k be around \log n and let m_1<\dots<m_k be evenly spaced in [n], staying away from the end points 1 and n. Let A be a purely random sequence in [n]^n. Then the standard deviation of h_A(m_1) is around \sqrt{n/\log n}, so the probability that it is less than \sqrt n/\log n is around (\log n)^{-1/2}. The same is true of the conditional probability that h_A(m_i) is less than \sqrt n/\log n conditioned on the value of h_A(m_{i-1}) (the worst case being when this value is 0). So the probability that this happens for every i is at most (\log n)^{-\log n/2}. This is much smaller than n^{-1}, so the conclusion remains valid when we condition on the sum of the a_i being n(n+1)/2. So the claim follows. Note that because of the previous simple observation, it follows that h_A must be at least c\sqrt{n/\log n} in magnitude at least c\sqrt{n/\log n} times, so up to log factors we get that \sum_jh_A(j)^2 is at least n^{3/2}. With a bit more effort, it should be possible to push this up to something more like n^2, since one would expect that h_A(j) would have rough order of magnitude \sqrt n for a positive fraction of the j. Maybe this would be a good subproblem to think about, and ideally not too difficult.

How about the joint distribution (U,V)? It seems highly likely that for typical A this will not be concentrated in a lattice, and that elementary arguments such as the above can be used to prove this. But let me indicate the kind of situation that we would have to prove is not typical. Suppose that n=15 and A=(1,1,1,1,1,8,8,8,8,8,15,15,15,15,15). Then as j runs from 1 to 15 the values taken by g_A are (5,5,5,5,5,5,5,10,10,10,10,10,10,10,15) and the values taken by h_A are (4,3,2,1,0,-1,-2,2,1,0,-1,-2,-3,-4,0). For this example, all the points (h_A(x),x) live in the lattice of points (u,v) such that u+v is a multiple of 5.

This wouldn’t necessarily be a disaster for us actually, since the LCLT can be restricted to a sublattice and if after conditioning on Y=0 we happen to have that X is always a multiple of 5, that isn’t a problem if we still have the central symmetry. But it would probably be nicer to prove that it is an atypical occurrence, so that we don’t have to worry about (U,V) living inside a sublattice (or even being concentrated in one).

My guess is that if we were to pursue these kinds of thoughts, we would end up being able to prove a statement that would say something like that (U,V) takes a pretty representative sample of values with V being between -(n-1)/2 and (n-1)/2 and U being in a range of width around \sqrt n. I would expect, for example, that if we add three or four independent copies of (U,V), then we will have a distribution that is similar in character to the uniform distribution on a rectangle of width of order of magnitude n and height of order of magnitude \sqrt n. And if that’s true, then adding n of them should give us something very close to normal (in an appropriate discrete sense of the word “normal”).

What next?

There are two obvious tasks here. One is to try to prove as much as we can about the random variable (U,V). The other is to try to prove a suitable LCLT that is strong enough to give us that the probability that X>0 given that Y=0 is approximately 1/2, under suitable assumptions about (U,V). And then we have to hope that what we achieve for the first is sufficient for the second.

It’s possible that the second task can be achieved by simply going through one of the existing proofs of the LCLT and being more careful about the details. But if that’s the case, then we should spend some time trying to find out whether anyone has done it already, since there wouldn’t be much point in duplicating that work. I hope I’ve set out what we want clearly enough for any probabilist who might stumble upon this blog post to be able to point us in the right direction if indeed the result we want is out there somewhere.

May 26, 2017

BackreactionCan we probe the quantization of the black hole horizon with gravitational waves?

Tl;dr: Yes, but the testable cases aren’t the most plausible ones. It’s the year 2017, but we still don’t know how space and time get along with quantum mechanics. The best clue so far comes from Stephen Hawking and Jacob Bekenstein. They made one of the most surprising finds that theoretical physics saw in the 20th century: Black holes have entropy. It was a surprise because entropy is a

Tommaso DorigoPhysics-Inspired Artwork In Venice 2: Symmetries

As I explained in the previous post of this series, students in high schools of the Venice area have been asked to produce artistic works inspired by LHC physics research, and in particular the Higgs boson.

read more

Geraint F. LewisFalling into a black hole: Just what do you see?

Everyone loves black holes. Immense gravity, a one-way space-time membrane, the possibility of links to other universes. All lovely stuff.

A little trawl of the internets reveals an awful lot of web pages discussing black holes, and discussions about spaghettification, firewalls, lost information, and many other things. Actually, a lot of the stuff out there on the web is nonsense, hand-waving, partly informed guesswork. And one of the questions that gets asked is "What would you see looking out into the universe?"

Some (incorrectly) say that you would never cross the event horizon, a significant mis-understanding of the coordinates of relativity. Other (incorrectly) conclude from this that you actually see the entire future history of the universe play out in front of your eyes.

What we have to remember, of course, is that relativity is a mathematical theory, and instead of hand waving, we can use mathematics to work out what we will see. And that's what I did.

I won't go through the details here, but it is based upon correctly calculating redshifts in relativity and conservation laws embodied in Killing vectors. But the result is an equation, an equation that looks like this

Here,  rs is the radius from which you start to fall, re is the radius at which the photon was emitted, and ro is the radius at which you receive the photon. On the left-hand-side is the ratio of the frequencies of the photon at the time of observation compared to emission. If this is bigger than one, then the photon is observed to have more energy than emitted, and the photon is blueshifted. If it is less than one, then it has less energy, and the photon is redshifted. Oh, and m is the mass of the black hole.

One can throw this lovely equation into python and plot it up. What do you get.

So, falling from a radius of 2.1m, we get
And falling from 3m
And from 5m
And 10m
and finally at 50m

In each of these, each line is a photon starting from different differences.

The key conclusion is that within the event horizon (r=2m) photons are generally seen to be redshifted, irrespective of where you start falling from. In fact in the last moment before you meet your ultimate end in the central singularity, the energy of the observed photon goes to zero and the outside universe is infinitely reshifted and vanishes from view.

How cool is that?

David Hoggmentoring and noise

Today I was only able to spend a small amount of time at a really valuable (and nicely structured) mentoring workshop run by Saavik Ford and CUNY faculty. The rest of the day I sprinted on my Exoplanet Research Program proposal, in which I am writing about terms in the extreme precision radial-velocity noise budget!

May 25, 2017

David Hoggquasi-periodic solar flares; TESS

In the morning, Daniela Huppenkothen (NYU) and I discussed Solar flares and other time-variable stellar phenomena with Chris Ick (NYU). He is going to help us take a more principled probabilistic approach to the question of whether flares contain quasi-periodic oscillations. He is headed off to learn about Gaussian Processes.

Armin Rest (STScI) was around today; he discussed image differencing with Dun Wang (NYU). After their discussions, we decided to make Wang's code easily installable, and get Rest to install and run it. Rest wants to have various image-differencing or transient-discovery pipelines running on the TESS data in real time (or as real as possible), and this could form the core of that. Excited!

Chad OrzelLong-Overdue Photo-a-Day Wrap-Up

During my sabbatical last year, I decided to try to do a photo-a-day project, taking and sharing at least one good picture a day of something or another. The strict photo-a-day format fell victim to my general busy-ness and disorganization, but I did eventually complete the whole thing. In the final post of the series, I said I might have some wrap-up thoughts in a few days; that was several months ago.

Every time I need to clear some disk space, though, I’m reminded that I left that hanging, so I finally did something in the direction of a summative post, by uploading all the edited pictures from the project to a Google Photos album. There are way more than the promised 366 photos in here, because there were sometimes multiple images on a single day, and some others are photos I took and never used for anything.

Screen shot of the Google Photos album of my photo-a-day pictures.

(That photo of Emmy in the first row still makes me a little sad…)

So, looking back on this, what are my thoughts?

I think it was definitely worthwhile, especially in the earlier part when I had more time to play around and think about what pictures to take. I learned a lot about the camera and what sort of shots work well, so I take somewhat fewer total pictures to get a handful of good ones these days. (Obviously, for things like SteelyKid’s softball games, I’m still doing a lot of burst-mode shooting in hopes of getting a really good moment. For landscape-ish things, though, I feel like I have a better sense of what will work, and don’t waste as many shots trying to get stuff that never turns out right.

It’s also interesting to look back at that whole collection of stuff, and remember what a busy year that was. We had a bunch of out-of-town trips, including a Disney cruise with the kids and a trip to Rome for a friend’s wedding. (There are many more Rome pictures on Google Photos, in other albums– I only pulled a few for the 366 project…)

Excluding the cute-kid photos (of which there are many) and the science-of-photography experiments, there’s a reasonably consistent aesthetic to the kinds of pictures I liked enough to include in this. I’ve gotten used to shooting mostly with the 22mm “pancake” lens, which has a large aperture and a large-ish field of view. It’s a fixed length, which means no zoom, but the speed factor mostly makes up for it. I’ve been switching back and forth between that and the telephoto lens lately, because SteelyKid’s playing softball again, and screwing around with the camera keeps me from getting too bored at the games.

There were a couple of things I had hoped to do, but never really found the time, mostly involving processing. I was going to spend a day or so taking photos with an eye toward making them black and white; there are a couple of B&W shots in the album, mostly where the lighting conditions made it difficult to get a decent color image. I thought it might be fun to try to do that deliberately; still do, really, but it’s hard to find time.

I also wanted to get some time to play around more with GIMP (my image-processing program of choice, for quite a few years now) and try more processing than just cropping and color-correcting. In particular, I’ve never done much with selectively processing parts of images, or trying to edit stuff out of photos. It’d be nice to know how to do those things, because people who know what they’re doing get some cool effects that way, but again, it’s hard to find the time.

Anyway, Looking at that whole collection, I’m pleased to see that the number of real duds is pretty small. And it was a lot of fun to spend time walking around specifically looking at things as potential photographs; I need to do more of that.

And that is pretty much all I have to say about that…

n-Category Café A Type Theory for Synthetic ∞-Categories

One of the observations that launched homotopy type theory is that the rule of identity-elimination in Martin-Löf’s identity types automatically generates the structure of an \infty-groupoid. In this way, homotopy type theory can be viewed as a “synthetic theory of \infty-groupoids.”

It is natural to ask whether there is a similar directed type theory that describes a “synthetic theory of (,1)(\infty,1)-categories” (or even higher categories). Interpreting types directly as (higher) categories runs into various problems, such as the fact that not all maps between categories are exponentiable (so that not all \prod-types exist), and that there are numerous different kinds of “fibrations” given the various possible functorialities and dimensions of categories appearing as fibers. The 2-dimensional directed type theory of Licata and Harper has semantics in 1-categories, with a syntax that distinguishes between co- and contra-variant dependencies; but since the 1-categorical structure is “put in by hand”, it’s not especially synthetic and doesn’t generalize well to higher categories.

An alternative approach was independently suggested by Mike and by Joyal, motivated by the model of homotopy type theory in the category of Reedy fibrant simplicial spaces, which contains as a full subcategory the \infty-cosmos of complete Segal spaces, which we call Rezk spaces. It is not possible to model ordinary homotopy type theory directly in the Rezk model structure, which is not right proper, but we can model it in the Reedy model structure and then identify internally some “types with composition,” which correspond to Segal spaces, and “types with composition and univalence,” which correspond to the Rezk spaces.

Almost five years later, we are finally developing this approach in more detail. In a new paper now available on the arXiv, Mike and I give definitions of Segal and Rezk types motivated by these semantics, and demonstrate that these simple definitions suffice to develop the synthetic theory of (,1)(\infty,1)-categories. So far this includes functors, natural transformations, co- and contravariant type families with discrete fibers (\infty-groupoids), the Yoneda lemma (including a “dependent” Yoneda lemma that looks like “directed identity-elimination”), and the theory of coherent adjunctions.

Cofibrations and extension types

One of the reasons this took so long to happen is that it required a technical innovation to become feasible. To develop the synthetic theory of Segal and Rezk types, we need to detect the semantic structure of the simplicial spaces model internally, and it seems that the best way to do this is to axiomatize the presence of a strict interval 22 (a totally ordered set with distinct top and bottom elements). This is the geometric theory of which simplicial sets are the classifying topos (and of which simplicial spaces are the classifying (,1)(\infty,1)-topos). We can then define an arrow in a type AA to be a function 2A2\to A.

However, often we want to talk about arrows with specified source and target. We can of course define the type hom A(x,y)\hom_A(x,y) of such arrows to be f:2A(f(0)=x)×(f(1)=y)\sum_{f:2\to A} (f(0)=x)\times (f(1)=y), but since we are in homotopy type theory, the equalities f0=xf0=x and f1=yf1=y are data, i.e. homotopical paths, that have to be carried around everywhere. When we start talking about 2-simplices and 3-simplices with specified boundaries as well, the complexity becomes unmanageable.

The innovation that solves this problem is to introduce a notion of cofibration in type theory, with a corresponding type of extensions. If i:ABi:A\to B is a cofibration and X:B𝒰X:B\to \mathcal{U} is a type family dependent on BB, while f: a:AX(i(a))f:\prod_{a:A} X(i(a)) is a section of XX over ii, then we introduce an extension type b:BX(b) f i\langle \prod_{b:B} X(b) \mid^i_f\rangle consisting of “those dependent functions g: b:BX(b)g:\prod_{b:B} X(b) such that g(i(a))f(a)g(i(a)) \equiv f(a) — note the strict judgmental equality! — for any a:Aa:A”. This is modeled semantically by a “Leibniz” or “pullback-corner” map. In particular, we can define hom A(x,y)= t:2A [x,y] 0,1\hom_A(x,y) = \langle \prod_{t:2} A \mid^{0,1}_{[x,y]} \rangle, the type of functions f:2Af:2\to A such that f(0)xf(0)\equiv x and f(1)yf(1) \equiv y strictly, and so on for higher simplices.

General extension types along cofibrations were first considered by Mike and Peter Lumsdaine for a different purpose. In addition to the pullback-corner semantics, they are inspired by the path-types of cubical type theory, which replace the inductively specified identity types of ordinary homotopy type theory with a similar sort of restricted function-type out of the cubical interval. Our paper introduces a general notion of “type theory with shapes” and extension types that includes the basic setup of cubical type theory as well as our simplicial type theory, along with potential generalizations to Joyal’s “disks” for a synthetic theory of (,n)(\infty,n)-categories.

Simplices in the theory of a strict interval

In simplicial type theory, the cofibrations are the “inclusions of shapes” generated by the coherent theory of a strict interval, which is axiomatized by the interval 22, top and bottom elements 0,1:20,1 : 2, and an inequality relation \le satisfying the strict interval axioms.

Simplices can then be defined as

Δ n:={t 1,,t nt nt n1t 2t 1} \Delta^n := \{ \langle t_1,\ldots, t_n\rangle \mid t_n \leq t_{n-1} \cdots t_2 \leq t_1 \}

Note that the 1-simplex Δ 1\Delta^1 agrees with the interval 22.

Boundaries, e.g. of the 2-simplex, can be defined similarly Δ 2:={t 1,t 2:2×2(0t 2t 1)(t 2t 1)(t 2t 11)} \partial\Delta^2 :=\{&#10216;t_1,t_2&#10217;: 2 \times 2 \mid (0 \equiv t_2 \leq t_1) \vee (t_2 \equiv t_1) \vee (t_2 \leq t_1 \equiv 1)\} making the inclusion of the boundary of a 2-simplex into a cofibration.

Segal types

For any type AA with terms x,y:Ax,y : A define

hom A(x,y):=2A [x,y] Δ 1 hom_A(x,y) := \langle 2 \to A \mid^{\partial\Delta^1}_{ [x,y]} \rangle

That is, a term f:hom A(x,y)f : hom_A(x,y), which we call an arrow from xx to yy in AA, is a function f:2Af: 2 \to A so that f(0)xf(0) \equiv x and f(1)yf(1) \equiv y. For f:hom A(x,y)f : hom_A(x,y), g:hom A(y,z)g : hom_A(y,z), and h:hom A(x,z)h : hom_A(x,z), a similar extension type

hom A(x,y,z,f,g,h):=Δ 2A [x,y,z,f,g,h] Δ 2 hom_A(x,y,z,f,g,h) := \langle \Delta^2 \to A \mid^{\partial\Delta^2}_{[x,y,z,f,g,h]}\rangle

has terms that we interpret as witnesses that hh is the composite of ff and gg. We define a Segal type to be a type in which any composable pair of arrows admits a unique (composite, witness) pair. In homotopy type theory, this may be expressed by saying that AA is Segal if and only if for all f:hom A(x,y)f : hom_A(x,y) and g:hom A(y,z)g : hom_A(y,z) the type

h:hom A(x,z)hom A(x,y,z,f,g,h) \sum_{h : hom_A(x,z)} hom_A(x,y,z,f,g,h)

is contractible. A contractible type is in particular inhabited, and an inhabitant in this case defines a term gf:hom A(x,z)g \circ f : hom_A(x,z) that we refer to as the composite of ff and gg, together with a 2-simplex witness comp(g,f):hom A(x,y,z,f,g,gf)comp(g,f) : hom_A(x,y,z,f,g,g \circ f).

Somewhat surprisingly, this single contractibility condition characterizing Segal types in fact ensures coherent categorical structure at all dimensions. The reason is that if AA is Segal, then the type XAX \to A is also Segal for any type or shape XX. For instance, applying this result in the case X=2X=2 allows us to prove that the composition operation in any Segal type is associative. In an appendix we prove a conjecture of Joyal that in the simplical spaces model this condition really does characterize exactly the Segal spaces, as usually defined.

Discrete types

An example of a Segal type is a discrete type, which is one for which the map

idtoarr: x,y:A(x= Ay)hom A(x,y) idtoarr : \prod_{x,y: A} (x=_A y) \to hom_A(x,y)

defined by identity elimination by sending the reflexivity term to the identity arrow, is an equivalence. In a discrete type, the \infty-groupoid structure encoded by the identity types and equivalent to the (,1)(\infty,1)-category structure encoded by the hom types. More precisely, a type AA is discrete if and only if it is Segal, as well as Rezk-complete (in the sense to be defined later on), and moreover “every arrow is an isomorphism”.

The dependent Yoneda lemma

If AA and BB are Segal types, then any function f:ABf:A\to B is automatically a “functor”, since by composition it preserves 2-simplices and hence witnesses of composition. However, not every type family C:A𝒰C:A\to \mathcal{U} is necessarily functorial; in particular, the universe 𝒰\mathcal{U} is not Segal — its hom-types hom 𝒰(X,Y)\hom_{\mathcal{U}}(X,Y) consist intuitively of “spans and higher spans”. We say that C:A𝒰C:A\to \mathcal{U} is covariant if for any f:hom A(x,y)f:\hom_A(x,y) and u:C(x)u:C(x), the type

v:C(y) t:2C(f(t)) [u,v] Δ 1 \sum_{v:C(y)} \langle \prod_{t:2} C(f(t)) \mid^{\partial\Delta^1}_{[u,v]}\rangle

of “liftings of ff starting at uu” is contractible. An inhabitant of this type consists of a point f *(u):C(y)f_\ast(u):C(y), which we call the (covariant) transport of uu along ff, along with a witness trans(f,u)trans(f,u). As with Segal types, this single contractibility condition suffices to ensure that this action is coherently functorial. It also ensures that the fibers C(x)C(x) are discrete, and that the total space x:AC(x)\sum_{x:A} C(x) is Segal.

In particular, for any Segal type AA and any a:Aa:A, the hom-functor hom A(a,):A𝒰\hom_A(a,-) :A \to \mathcal{U} is covariant. The Yoneda lemma says that for any covariant C:A𝒰C:A\to \mathcal{U}, evaluation at (a,id a)(a,id_a) defines an equivalence

( x:Ahom A(a,x)C(x))C(a) \Big(\prod_{x:A} \hom_A(a,x) \to C(x)\Big) \simeq C(a)

The usual proof of the Yoneda lemma applies, except that it’s simpler since we don’t need to check naturality or functoriality; in the “synthetic” world all of that comes for free.

More generally, we have a dependent Yoneda lemma, which says that for any covariant C:( x:Ahom A(a,x))𝒰C : \Big(\sum_{x:A} \hom_A(a,x)\Big) \to \mathcal{U}, we have a similar equivalence

( x:A f:hom A(a,x)C(x,f))C(a,id a). \Big(\prod_{x:A} \prod_{f:\hom_A(a,x)} C(x,f)\Big) \simeq C(a,id_a).

This should be compared with the universal property of identity-elimination (path induction) in ordinary homotopy type theory, which says that for any type family C:( x:A(a=x))𝒰C : \Big(\sum_{x:A} (a=x)\Big) \to \mathcal{U}, evaluation at (a,refl a)(a,refl_a) defines an equivalence

( x:A f:a=xC(x,f))C(a,refl a). \Big(\prod_{x:A} \prod_{f:a=x} C(x,f)\Big) \simeq C(a,refl_a).

In other words, the dependent Yoneda lemma really is a “directed” generalization of path induction.

Rezk types

When is an arrow f:hom A(x,y)f : hom_A(x,y) in a Segal type an isomorphism? Classically, ff is an isomorphism just when it has a two-sided inverse, but in homotopy type theory more care is needed, for the same reason that we have to be careful when defining what it means for a function to be an equivalence: we want the notion of “being an isomorphism” to be a mere proposition. We could use analogues of any of the equivalent notions of equivalence in Chapter 4 of the HoTT Book, but the simplest is the following:

isiso(f):=( g:hom A(y,x)gf=id x)×( h:hom A(y,x)fh=id y) isiso(f) := \left(\sum_{g : hom_A(y,x)} g \circ f = id_x\right) \times \left(\sum_{h : hom_A(y,x)} f \circ h = id_y \right)

An element of this type consists of a left inverse and a right inverse together with witnesses that the respective composites with ff define identities. It is easy to prove that g=hg = h, so that ff is an isomorphism if and only if it admits a two-sided inverse, but the point is that any pair of terms in the type isiso(f)isiso(f) are equal (i.e., isiso(f)isiso(f) is a mere proposition), which would not be the case for the more naive definition.

The type of isomorphisms from xx to yy in AA is then defined to be

(x Ay):= f:hom A(x,y)isiso(f). (x \cong_A y) := \sum_{f : \hom_A(x,y)} isiso(f).

Identity arrows are in particular isomorphisms, so by identity-elimination there is a map

x,y:A(x= Ay)(x Ay) \prod_{x,y: A} (x =_A y) \to (x \cong_A y)

and we say that a Segal type AA is Rezk complete if this map is an equivalence, in which case AA is a Rezk type.

Coherent adjunctions

Similarly, it is somewhat delicate to define homotopy correct types of adjunction data that are contractible when they are inhabited. In the final section to our paper, we compare transposing adjunctions, by which we mean functors f:ABf : A \to B and u:BAu : B \to A (i.e. functions between Segal types) together with a fiberwise equivalence

a:A,b:Bhom A(a,ub)hom B(fa,b) \prod_{a :A, b: B} \hom_A(a,u b) \simeq \hom_B(f a,b)

with various notions of diagrammatic adjunctions, specified in terms of units and counits and higher coherence data.

The simplest of these, which we refer to as a quasi-diagrammatic adjunction is specified by a pair of functors as above, natural transformations η:Id Auf\eta: Id_A \to u f and ϵ:fuId B\epsilon : f u \to Id_B (a “natural transformation” is just an arrow in a function-type between Segal types), and witnesses α\alpha and β\beta to both of the triangle identities. The incoherence of this type of data has been observed in bicategory theory (it is not cofibrant as a 2-category) and in (,1)(\infty,1)-catgory theory (as a subcomputad of the free homotopy coherent adjunction it is not parental). One homotopically correct type of adjunction data is a half-adjoint diagrammatic adjunction, which has additionally a witness that fα:ϵfuϵfηuϵf \alpha : \epsilon \circ f u\epsilon \circ f\eta u \to \epsilon and βu:ϵϵfufηu\beta u: \epsilon \circ \epsilon f u \circ f \eta u commute with the naturality isomorphism for ϵ\epsilon.

We prove that given Segal types AA and BB and functors f:ABf : A \to B and u:BAu : B \to A, the type of half-adjoint diagrammatic adjunctions between them is equivalent to the type of transposing adjunctions. More precisely, if in the notion of transposing adjunction we interpret “equivalence” as a “half-adjoint equivalence”, i.e. a pair of maps ϕ\phi and ψ\psi with homotopies ϕψ=1\phi \psi = 1 and ψϕ=1\psi \phi = 1 and a witness to one of the triangle identities for an adjoint equivalence (this is another of the coherent notions of equivalence from the HoTT Book), then these data correspond exactly under the Yoneda lemma to those of a half-adjoint diagrammatic adjunction.

This suggests that similar correspondences for other kinds of coherent equivalences. For instance, if we interpret transposing adjunctions using the “bi-invertibility” notion of coherent equivalence (specification of a separate left and right inverse, as we used above to define isomorphisms in a Segal type), we obtain upon Yoneda-fication a new notion of coherent diagrammatic adjunction, consisting of a unit η\eta and two counits ϵ,ϵ\epsilon,\epsilon', together with witnesses that η,ϵ\eta,\epsilon satisfy one triangle identity and η,ϵ\eta,\epsilon' satisfy the other triangle identity.

Finally, if the types AA and BB are not just Segal but Rezk, we can show that adjoints are literally unique, not just “unique up to isomorphism”. That is, given a functor u:BAu:B\to A between Rezk types, the “type of left adjoints to uu” is a mere proposition.

May 24, 2017

Doug NatelsonHot electrons and a connection to thermoelectricity

The two recent posts about the Seebeck effect and hot electrons give some context so that I can talk about a paper we published last month.

We started out playing around with metal nanowires, and measuring the open-circuit voltage (that is, hook up a volt meter across the device, which nominally doesn't allow current to flow) across those wires as a function of where we illuminated them with a near-IR laser.  Because the metal absorbs some of the light, that laser spot acts like a local heat source (though figuring out the temperature profile requires some modeling of the heat transfer processes).   As mentioned here, particles tend to diffuse from hot locations to cold locations; in an open circuit, a voltage builds up to balance out this tendency, because in the steady state no net current flows in an open circuit; and in a metal, the way electron motion and scattering depend on the energy of the electrons gives you the magnitude and sign of this process.   If the metal is sufficiently nanoscale that boundary scattering matters, you end up with a thermoelectric response that depends on the metal geometry.  The end result is shown in the left portion of the figure.  If you illuminate the center of the metal wire, you measure no net voltage - you shouldn't, because the whole system is symmetric.  The junction where the wire fans out to a bigger pad acts like a thermocouple because of that boundary scattering, and if you illuminate it you get a net thermoelectric voltage (sign depends on how you pick ground and which end you're illuminating).   Bottom line:  Illumination heats the electrons a bit (say a few Kelvin), and you get a thermoelectric voltage because of that, to offset the tendency of the electrons to diffuse due to the temperature gradient.  In this system, the size of the effect is small - microvolts at our illumination conditions.

Now we can take that same nanowire, and break it to make a tunnel junction somewhere in there - a gap between the two electrodes where the electrons are able to "tunnel" across from one side to the other.  When we illuminate the tunnel junction, we now see open-circuit photovoltages that are much larger, and very localized to the gap region.  So, what is going on here?  The physics is related, but not true thermoelectricity (which assumes that it always makes sense to define temperature everywhere).   What we believe is happening is something that was discussed theoretically here, and was reported in molecule-containing junctions here.   As I said when talking about hot electrons, when light gets absorbed, it is possible to kick electrons way up in energy.  Usually that energy gets dissipated by being spread among other electrons very quickly.  However, if hot electrons encounter the tunnel junction before they've lost most of that energy, they have a higher likelihood of getting across the tunnel junction, because quantum tunneling is energy-dependent.  Producing more hot electrons on one side of the junction than the other will drive a tunneling current.  We still have an open circuit, though, so some voltage has to build up so that the net current in the steady state adds up to zero.  Bottom line:  Illumination here can drive a "hot" electron tunneling current, and you get a photovoltage to offset that process.  This isn't strictly a thermoelectric effect because the electrons aren't thermally distributed - it's the short-lived high energy tail that matters most.

It's fun to think about ways to try to better understand and maximize such effects, perhaps for applications in photodetection or other technologies....

Tim GowersIntransitive dice III

I now feel more optimistic about the prospects for this project. I don’t know whether we’ll solve the problem, but I think there’s a chance. But it seems that there is after all enough appetite to make it an “official” Polymath project. Perhaps we could also have an understanding that the pace of the project will be a little slower than it has been for most other projects. I myself have various other mathematical projects on the boil, so can’t spend too much time on this one, but quite like the idea of giving it an occasional go when the mood takes me, and trying to make slow but steady progress. So I’ve created a polymath13 category, into which this post now fits. I’ve also retrospectively changed the category for the previous two posts. I don’t think we’ve got to the stage where a wiki will be particularly useful, but I don’t rule that out at some point in the future.

In this post I want to expand on part of the previous one, to try to understand better what would need to be true for the quasirandomness assertion to be true. I’ll repeat a few simple definitions and simple facts needed to make the post more self-contained.

By an nsided die I mean a sequence in [n]^n (where [n] is shorthand for \{1,2,\dots,n\}) that adds up to n(n+1)/2. Given an n-sided die A=(a_1,\dots,a_n) and j\in[n], I define g_A(j) to be the number of i such that a_i\leq j and h_A(j) to be g_A(j)-j.

We can write h_A(j) as \sum_i\mathbf 1_{[a_i\leq j]}-j. Therefore, if C=(c_1,\dots,c_n) is another die, or even just an arbitrary sequence in [n]^n, we have that
\sum_jh_A(c_j)=\sum_j(\sum_i\mathbf 1_{[a_i\leq c_j]}-c_j)=\sum_{i,j}\mathbf 1_{[a_i\leq c_j]} - \sum_jc_j.
If \sum_jc_j=n(n+1)/2 and no a_i is equal to any c_j, then the sign of this sum therefore tells us whether A beats C. For most A, we don’t expect many ties, so the sign of the sum is a reasonable, but not perfect, proxy for which of the two dice wins. (With a slightly more complicated function we can avoid the problem of ties: I shall stick with the simpler one for ease of exposition, but would expect that if proofs could be got to work, then we would switch to the more complicated functions.)

This motivates the following question. Let A and B be two random dice. Is it the case that with high probability the remaining dice C are split into four sets of roughly equal size according to the signs of h_A(C) and h_B(C)? I expect the answer to this question to be the same as the answer to the original transitivity question, but I haven’t checked as carefully as I should that my cavalier approach to ties isn’t problematic.

I propose the following way of tackling this question. We fix A and B and then choose a purely random sequence C=(c_1,\dots,c_n) (that is, with no constraint on the sum) and look at the 3D random variable
Each coordinate separately is a sum of independent random variables with mean zero, so provided not too many of the h_A or h_B are zero, which for random A and B is a reasonable assumption, we should get something that approximates a trivariate normal distribution.

Therefore, we should expect that when we condition on \sum_j(c_j-(n+1)/2) being zero, we will get something that approximates a bivariate normal distribution. Although that may not be completely straightforward to prove rigorously, tools such as the Berry-Esseen theorem ought to be helpful, and I’d be surprised if this was impossibly hard. But for now I’m aiming at a heuristic argument, so I want simply to assume it.

What we want is for the signs of the first two coordinates to be approximately independent, which I think is equivalent to saying (assuming normality) that the first two coordinates themselves are approximately independent.

However, what makes the question interesting is that the first two coordinates are definitely not independent without the conditioning: the random variables \sum_jh_A(c_j) and \sum_jh_B(c_j) are typically quite strongly correlated. (There are good reasons to expect this to be the case, and I’ve tested it computationally too.) Also, we expect correlations between these variables and \sum_j(c_j-(n+1)/2). So what we are asking for is that all these correlations should disappear when we condition appropriately. More geometrically, there is a certain ellipsoid, and we want its intersection with a certain plane to be a circle.

The main aim of this post is to make the last paragraph more precise. That is, I want to take three standard normal random variables X, Y and Z that are not independent, and understand precisely the circumstances that guarantee that X and Y become independent when we condition on Z.

The joint distribution of (X,Y,Z) is determined by the matrix of correlations. Let this matrix be split up as \begin{pmatrix}\Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22}\\ \end{pmatrix}, where \Sigma_{11} is the 2\times 2 covariance matrix of (X,Y), \Sigma_{12} is a 2\times 1 matrix, \Sigma_{21} is a 1\times 2 matrix and \Sigma_{22} is the matrix (1). A general result about conditioning joint normal distributions on a subset of the variables tells us, if I understand the result correctly, that the covariance matrix of (X,Y) when we condition on the value of Z is \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}. (I got this from Wikipedia. It seems to be quite tricky to prove, so I hope it really can be used as a black box.) So in our case if we have a covariance matrix \begin{pmatrix}1&\lambda&\mu\\ \lambda&1&\nu\\ \mu&\nu&1\\ \end{pmatrix} then the covariance matrix of (X,Y) conditioned on Z should be \begin{pmatrix}1-\mu^2&\lambda-\mu\nu\\ \lambda-\mu\nu&1-\nu^2\\ \end{pmatrix}.

That looks dimensionally odd because I normalized the random variables to have variance 1. If instead I had started with the more general covariance matrix \begin{pmatrix}a&\lambda&\mu\\ \lambda&b&\nu\\ \mu&\nu&c\\ \end{pmatrix} I would have ended up with \begin{pmatrix}a-c^{-1}\mu^2&\lambda-c^{-1}\mu\nu\\ \lambda-c^{-1}\mu\nu&b-c^{-1}\nu^2\\ \end{pmatrix}.

So after the conditioning, if we want X and Y to become independent, we appear to want \lambda c to equal \mu\nu. That is, we want
\langle X,Z\rangle\langle Y,Z\rangle=\langle X,Y\rangle\langle Z,Z\rangle
where I am using angle brackets for covariances.

If we divide each variable by its standard deviation, that gives us that the correlation between X and Y should be the product of the correlation between X and Z and the correlation between Y and Z.

I wrote some code to test this, and it seemed not to be the case, anything like, but I am not confident that I didn’t make careless mistakes in the code. (However, my correlations were reasonable numbers in the range [-1,1], so any mistakes there might have been didn’t jump out at me. I might just rewrite the code from scratch without looking at the old version.)

One final remark I’d like to make is that if you feel there is something familiar about the expression \langle x,z\rangle\langle y,z\rangle-\langle x,y\rangle\langle z,z\rangle, then you are not completely wrong. The formula for the vector triple product is

x\wedge(y\wedge z)=\langle x,z\rangle y - \langle x,y\rangle z.

Therefore, the expression can be condensed to \langle x\wedge(y\wedge z),z\rangle. Now this is the scalar triple product of the three vectors x, y\wedge z, and z. For this to be zero, we need x to lie in the plane generated by y\wedge z and z. Note that y\wedge z is orthogonal to both y and z. So if P is the orthogonal projection to the subspace generated by z, we want x-Px to be orthogonal to y. Actually, that can be read out of the original formula too, since it is \langle\langle x,z\rangle z - \langle z,z\rangle x,y\rangle. A nicer way of thinking of it (because more symmetrical) is that we want the orthogonal projections of x and y to the subspace orthogonal to z to be orthogonal. To check that, assuming (WLOG) that \|z\|_2=1,
\langle x-\langle x,z\rangle z,y-\langle y,z\rangle z\rangle=\langle x,y\rangle-\langle x,z\rangle\langle y,z\rangle.

So what I’d like to see done (but I’m certainly not saying it’s the only thing worth doing) is the following.

1. Test experimentally whether for a random pair (A,B) of n-sided dice we find that the correlations of the random variables X=\sum_jh_A(j), Y=\sum_jh_B(j) and Z=\sum_j(c_j-(n+1)/2) really do appear to satisfy the relationship

corr(X,Z).corr(Y,Z)= corr(X,Y).

Here the c_j are chosen randomly without any conditioning on their sum. My experiment seemed to indicate not, but I’m hoping I made a mistake.

2. If they do satisfy that relationship, then we can start to think about why.

3. If they do not satisfy it, then we can start to think about why not. In particular, which of the heuristic assumptions used to suggest that they should satisfy that relationship is wrong — or is it my understanding of multivariate normals that is faulty?

If we manage to prove that they typically do satisfy that relationship, at least approximately, then we can think about whether various distributions become sufficiently normal sufficiently quickly for that to imply that intransitivity occurs with probability 1/4.

Scott AaronsonYet more errors in papers

Following up on my posts PostBQP Postscripts and More Wrong Things I Said In Papers, it felt like time for another post in which I publicly flog myself for mistakes in my research papers.  [Warning: The rest of this post is kinda, sorta technical.  Read at your own risk.]

(1) In my 2006 paper “Oracles are subtle but not malicious,” I claimed to show that if PP is contained in BQP/qpoly, then the counting hierarchy collapses to QMA (Theorem 5).  But on further reflection, I only know how to show a collapse of the counting hierarchy under the stronger assumption that PP is in BQP/poly.  If PP is in BQP/qpoly, then certainly P#P=PP=QMA, but I don’t know how to collapse any further levels of the counting hierarchy.  The issue is this: in QMA, we can indeed nondeterministically guess an (amplified) quantum advice state for a BQP/qpoly algorithm.  We can then verify that the advice state works to solve PP problems, by using (for example) the interactive protocol for the permanent, or some other #P-complete problem.  But having done that, how do we then unravel the higher levels of the counting hierarchy?  For example, how do we simulate PPPP in PPBQP=PP?  We don’t have any mechanism to pass the quantum advice up to the oracle PP machine, since queries to a PP oracle are by definition classical strings.  We could try to use tools from my later paper with Andy Drucker, passing a classical description of the quantum advice up to the oracle and then using the description to reconstruct the advice for ourselves.  But doing so just doesn’t seem to give us a complexity class that’s low for PP, which is what would be needed to unravel the counting hierarchy.  I still think this result might be recoverable, but a new idea is needed.

(2) In my 2008 algebrization paper with Avi Wigderson, one of the most surprising things we showed was a general connection between communication complexity lower bounds and algebraic query complexity lower bounds.  Specifically, given a Boolean oracle A:{0,1}n→{0,1}, let ~A be a low-degree extension of A over a finite field F (that is, ~A(x)=A(x) whenever x∈{0,1}n).  Then suppose we have an algorithm that’s able to learn some property of A, by making k black-box queries to ~A.  We observed that, in such a case, if Alice is given the top half of the truth table of A, and Bob is given the bottom half of the truth table, then there’s necessarily a communication protocol by which Alice and Bob can learn the same property of A, by exchanging at most O(kn log|F|) bits.  This connection is extremely model-independent: a randomized query algorithm gives rise to a randomized communication protocol, a quantum query algorithm gives rise to a quantum communication protocol, etc. etc.  The upshot is that, if you want to lower-bound the number of queries that an algorithm needs to make to the algebraic extension oracle ~A, in order to learn something about A, then it suffices to prove a suitable communication complexity lower bound.  And the latter, unlike algebraic query complexity, is a well-studied subject with countless results that one can take off the shelf.  We illustrated how one could use this connection to prove, for example, that there exists an oracle A such that NPA ⊄ BQP~A, for any low-degree extension ~A of A—a separation that we didn’t and don’t know how to prove any other way. Likewise, there exists an oracle B such that BQPB ⊄ BPP~B for any low-degree extension ~B of B.

The trouble is, our “proof sketches” for these separations (in Theorem 5.11) are inadequate, even for “sketches.”  They can often be fixed, but only by appealing to special properties of the communication complexity separations in question, properties that don’t necessarily hold for an arbitrary communication separation between two arbitrary models.

The issue is this: while it’s true, as we claimed, that a communication complexity lower bound implies an algebraic query complexity lower bound, it’s not true in general that a communication complexity upper bound implies an algebraic query complexity upper bound!  So, from a communication separation between models C and D, we certainly obtain a query complexity problem that’s not in D~A, but then the problem might not be in CA.  What tripped us up was that, in the cases we had in mind (e.g. Disjointness), it’s obvious that the query problem is in CA.  In other cases, however, such as Raz’s separation between quantum and randomized communication complexity, it probably isn’t even true.  In the latter case, to recover the desired conclusion about algebraic query complexity (namely, the existence of an oracle B such that BQPB ⊄ BPP~B), what seems to be needed is to start from a later quantum vs. classical communication complexity separation due to Klartag and Regev, and then convert their communication problem into a query problem using a recent approach by myself and Shalev Ben-David (see Section 4).  Unfortunately, my and Shalev’s approach only tells us nonconstructively that there exists a query problem with the desired separation, with no upper bound on the gate complexity of the quantum algorithm.  So strictly speaking, I still don’t know how to get a separation between the relativized complexity classes BQPB and BPP~B defined in terms of Turing machines.

In any case, I of course should have realized this issue with the algebrization paper the moment Shalev and I encountered the same issue when writing our later paper.  Let me acknowledge Shalev, as well as Robin Kothari, for helping to spur my realization of the issue.

In case it wasn’t clear, the mistakes I’ve detailed here have no effect on the main results of the papers in question (e.g., the existence of an oracle relative to which PP has linear-sized circuits; the existence and pervasiveness of the algebrization barrier).  The effect is entirely on various “bonus” results—results that, because they’re bonus, were gone over much less carefully by authors and reviewers alike.

Nevertheless, I’ve always felt like in science, the louder you are about your own mistakes, the better.  Hence this post.

May 23, 2017

Andrew JaffeNot-quite hacked

This week, the New York Times, The Wall Street Journal and Twitter, along with several other news organizations, have all announced that they were attacked by (most likely) Chinese hackers.

I am not quite happy to join their ranks: for the last few months, the traffic on this blog has been vastly dominated by attempts to get into the various back-end scripts that run this site, either by direct password hacks or just denial-of-service attacks. In fact, I only noticed it because the hackers exceeded my bandwidth allowance by a factor of a few (and costing me a few hundred bucks in over-usage charged by my host in the process, unfortunately).

I’ve since attempted to block the attacks by denying access to the IP addresses which have been the most active (mostly from domains that look like, for what it’s worth). So, my apologies if any of this results in any problems for anyone else trying to access the blog.

Andrew JaffeJSONfeed

More technical stuff, but I’m trying to re-train myself to actually write on this blog, so here goes…

For no good reason other than it was easy, I have added a JSONfeed to this blog. It can be found at, and accessed from the bottom of the right-hand sidebar if you’re actually reading this at

What does this mean? JSONfeed is an idea for a sort-of successor to something called RSS, which may stand for really simple syndication, a format for encapsulating the contents of a blog like this one so it can be indexed, consumed, and read in a variety of ways without explicitly going to my web page. RSS was created by developer, writer, and all around web-and-software guru Dave Winer, who also arguably invented — and was certainly part of the creation of — blogs and podcasting. Five or ten years ago, so-called RSS readers were starting to become a common way to consume news online. NetNewsWire was my old favourite on the Mac, although its original versions by Brent Simmons were much better than the current incarnation by a different software company; I now use something called Reeder. But the most famous one was Google Reader, which Google discontinued in 2013, thereby killing off most of the RSS-reader ecosystem.

But RSS is not dead: RSS readers still exist, and it is still used to store and transfer information between web pages. Perhaps most importantly, it is the format behind subscriptions to podcasts, whether you get them through Apple or Android or almost anyone else.

But RSS is kind of clunky, because it’s built on something called XML, an ugly but readable format for structuring information in files (HTML, used for the web, with all of its < and > “tags”, is a close cousin). Nowadays, people use a simpler family of formats called JSON for many of the same purposes as XML, but it is quite a bit easier for humans to read and write, and (not coincidentally) quite a bit easier to create computer programs to read and write.

So, finally, two more web-and-software developers/gurus, Brent Simmons and Manton Reece realised they could use JSON for the same purposes as RSS. Simmons is behind NewNewsWire and Reece’s most recent project is an “indie microblogging” platform (think Twitter without the giant company behind it), so they both have an interest in these things. And because JSON is so comparatively easy to use, there is already code that I could easily add to this blog so it would have its own JSONfeed. So I did it.

So it’s easy to create a JSONfeed. What there isn’t — so far — are any newsreaders like NetNewsWire or Reeder that can ingest them. (In fact, Maxime Vaillancourt apparently wrote a web-based reader in about an hour, but it may already be overloaded…). Still, looking forward to seeing what happens.

Andrew JaffeSDSS 1416+13B

J1416 It’s not that often that I can find a reason to write about both astrophysics and music — my obsessions, vocations and avocations — at the same time. But the recent release of Scott Walker’s (certainly weird, possibly wonderful) new record Bish Bosch has given me just such an excuse: Track 4 is a 21-minute opus of sorts, entitled “SDSS1416+13B (Zercon, A Flagpole Sitter)”. The title seems a random collection of letters, numbers and words, but that’s not what it is: SDSS1416+13B is the (very slightly mangled) identification of an object in the Sloan Digital Sky Survey (SDSS) catalog — 1416+13B means that it is located at Right Ascension 14h16m and Declination 13° (actually, its full name is SDSS J141624.08+134826.7 which gives the location more precisely) and “B” denotes that it’s actually the second of two objects (the other one is unsurprisingly called “A”).

In fact it’s a pretty interesting object: it was actually discovered not by SDSS alone, but by cross-matching with another survey, the UK Infrared Deep Sky Survey (UKIDSS) and looking at the images by eye. It turns out that the two components are a binary system made up of two brown dwarfs — objects that aren’t massive enough to burn hydrogen via nuclear fusion, but are more massive than even the heaviest planets, often big enough to form at the centre of their own stellar systems, and heavy enough have some nuclear reactions in their core. In fact, the UKIDSS survey has been one of the best ways to find such comparatively cool objects; my colleagues Daniel Mortlock and Steve Warren found one of the coolest known brown dwarfs in UKIDSS in 2007, using techniques very similar to those they also used to find the most distant quasar yet known, recounted by Daniel in a guest-post here. Like that object, SDSS1416+13B is one of the coolest such objects ever found.

What does all this have to do with Scott Walker? I have no idea. Since he started singing as a member of the Walker Brothers in the 60s — and even more so since his 70s solo records, Walker has been known for his classical-sounding baritone, though with his mannered, massive vibrato, he always sounds a bit like a rocker’s caricature of a classical singer. I’ve always thought it was more force of personality than actual skill that drew people — especially here in the UK — to him.

His latest, Bish Bosch, the third in a purported trilogy of records he’s made since resurfacing in the mid-1990s, veers between mannered art-songs and rock’n’roll, silences punctuated with electric guitars, fart-sounds and trumpets.

The song “SDSS1416” itself is an (I assume intentionally funny?) screed, alternating sophomoric insults (my favourite is “don’t go to a mind reader, go to a palmist; I know you’ve got a palm”) with recitations of Roman numerals and, finally, the only link to observations of a brown dwarf I can find, “Infrared, infrared/ I could drop/ into the darkness.” Your guess is as good as mine. It’s compelling, but I can’t tell if that’s as an epic or a train wreck.

Andrew JaffeScience as metaphor

In further pop-culture crossover news, I was pleased to see this paragraph in John Keane’s review of Alan Ryan’s “On Politics” in this weekend’s Financial Times:

Ryan sees this period [the 1940s] as the point of triumph of liberal democracy against its Fascist and Stalinist opponents. Closer attention shows this decade was instead a moment of what physicists call dark energy: the universe of meaning of democracy underwent a dramatic expansion, in defiance of the cosmic gravity of contemporary events. The ideal of monitory democracy was born.

Not a bad metaphor. Nice to see that the author, a professor of Politics from Sydney, is paying attention to the stuff that really matters.

Andrew JaffeQuantum debrief

A week ago, I finished my first time teaching our second-year course in quantum mechanics. After a bit of a taster in the first year, the class concentrates on the famous Schrödinger equation, which describes the properties of a particle under the influence of an external force. The simplest version of the equation is just Schrodinger This relates the so-called wave function, ψ, to what we know about the external forces governing its motion, encoded in the Hamiltonian operator, Ĥ. The wave function gives the probability (technically, the probability amplitude) for getting a particular result for any measurement: its position, its velocity, its energy, etc. (See also this excellent public work by our department’s artist-in-residence.)

Over the course of the term, the class builds up the machinery to predict the properties of the hydrogen atom, which is the canonical real-world system for which we need quantum mechanics to make predictions. This is certainly a sensible endpoint for the 30 lectures.

But it did somehow seem like a very old-fashioned way to teach the course. Even back in the 1980s when I first took a university quantum mechanics class, we learned things in a way more closely related to the way quantum mechanics is used by practicing physicists: the mathematical details of Hilbert spaces, path integrals, and Dirac Notation.

Today, an up-to-date quantum course would likely start from the perspective of quantum information, distilling quantum mechanics down to its simplest constituents: qbits, systems with just two possible states (instead of the infinite possibilities usually described by the wave function). The interactions become less important, superseded by the information carried by those states.

Really, it should be thought of as a full year-long course, and indeed much of the good stuff comes in the second term when the students take “Applications of Quantum Mechanics” in which they study those atoms in greater depth, learn about fermions and bosons and ultimately understand the structure of the periodic table of elements. Later on, they can take courses in the mathematical foundations of quantum mechanics, and, yes, on quantum information, quantum field theory and on the application of quantum physics to much bigger objects in “solid-state physics”.

Despite these structural questions, I was pretty pleased with the course overall: the entire two-hundred-plus students take it at the beginning of their second year, thirty lectures, ten ungraded problem sheets and seven in-class problems called “classworks”. Still to come: a short test right after New Year’s and the final exam in June. Because it was my first time giving these lectures, and because it’s such an integral part of our teaching, I stuck to to the same notes and problems as my recent predecessors (so many, many thanks to my colleagues Paul Dauncey and Danny Segal).

Once the students got over my funny foreign accent, bad board handwriting, and worse jokes, I think I was able to get across both the mathematics, the physical principles and, eventually, the underlying weirdness, of quantum physics. I kept to the standard Copenhagen Interpretation of quantum physics, in which we think of the aforementioned wavefunction as a real, physical thing, which evolves under that Schrödinger equation — except when we decide to make a measurement, at which point it undergoes what we call collapse, randomly and seemingly against causality: this was Einstein’s “spooky action at a distance” which seemed to indicate nature playing dice with our Universe, in contrast to the purely deterministic physics of Newton and Einstein’s own relativity. No one is satisfied with Copenhagen, although a more coherent replacement has yet to be found (I won’t enumerate the possibilities here, except to say that I find the proliferating multiverse of Everett’s Many-Worlds interpretation ontologically extravagant, and Chris FuchsQuantum Bayesianism compelling but incomplete).

I am looking forward to getting this year’s SOLE results to find out for sure, but I think the students learned something, or at least enjoyed trying to, although the applause at the end of each lecture seemed somewhat tinged with British irony.

Terence TaoQuantitative continuity estimates

Suppose {F: X \rightarrow Y} is a continuous (but nonlinear) map from one normed vector space {X} to another {Y}. The continuity means, roughly speaking, that if {x_0, x \in X} are such that {\|x-x_0\|_X} is small, then {\|F(x)-F(x_0)\|_Y} is also small (though the precise notion of “smallness” may depend on {x} or {x_0}, particularly if {F} is not known to be uniformly continuous). If {F} is known to be differentiable (in, say, the Frechét sense), then we in fact have a linear bound of the form

\displaystyle  \|F(x)-F(x_0)\|_Y \leq C(x_0) \|x-x_0\|_X

for some {C(x_0)} depending on {x_0}, if {\|x-x_0\|_X} is small enough; one can of course make {C(x_0)} independent of {x_0} (and drop the smallness condition) if {F} is known instead to be Lipschitz continuous.

In many applications in analysis, one would like more explicit and quantitative bounds that estimate quantities like {\|F(x)-F(x_0)\|_Y} in terms of quantities like {\|x-x_0\|_X}. There are a number of ways to do this. First of all, there is of course the trivial estimate arising from the triangle inequality:

\displaystyle  \|F(x)-F(x_0)\|_Y \leq \|F(x)\|_Y + \|F(x_0)\|_Y. \ \ \ \ \ (1)

This estimate is usually not very good when {x} and {x_0} are close together. However, when {x} and {x_0} are far apart, this estimate can be more or less sharp. For instance, if the magnitude of {F} varies so much from {x_0} to {x} that {\|F(x)\|_Y} is more than (say) twice that of {\|F(x_0)\|_Y}, or vice versa, then (1) is sharp up to a multiplicative constant. Also, if {F} is oscillatory in nature, and the distance between {x} and {x_0} exceeds the “wavelength” of the oscillation of {F} at {x_0} (or at {x}), then one also typically expects (1) to be close to sharp. Conversely, if {F} does not vary much in magnitude from {x_0} to {x}, and the distance between {x} and {x_0} is less than the wavelength of any oscillation present in {F}, one expects to be able to improve upon (1).

When {F} is relatively simple in form, one can sometimes proceed simply by substituting {x = x_0 + h}. For instance, if {F: R \rightarrow R} is the squaring function {F(x) = x^2} in a commutative ring {R}, one has

\displaystyle  F(x_0+h) = (x_0+h)^2 = x_0^2 + 2x_0 h+ h^2

and thus

\displaystyle  F(x_0+h) - F(x_0) = 2x_0 h + h^2

or in terms of the original variables {x, x_0} one has

\displaystyle  F(x) - F(x_0) = 2 x_0 (x-x_0) + (x-x_0)^2.

If the ring {R} is not commutative, one has to modify this to

\displaystyle  F(x) - F(x_0) = x_0 (x-x_0) + (x-x_0) x_0 + (x-x_0)^2.

Thus, for instance, if {A, B} are {n \times n} matrices and {\| \|_{op}} denotes the operator norm, one sees from the triangle inequality and the sub-multiplicativity {\| AB\|_{op} \leq \| A \|_{op} \|B\|_{op}} of operator norm that

\displaystyle  \| A^2 - B^2 \|_{op} \leq \| A - B \|_{op} ( 2 \|B\|_{op} + \|A - B \|_{op} ). \ \ \ \ \ (2)

If {F(x)} involves {x} (or various components of {x}) in several places, one can sometimes get a good estimate by “swapping” {x} with {x_0} at each of the places in turn, using a telescoping series. For instance, if we again use the squaring function {F(x) = x^2 = x x} in a non-commutative ring, we have

\displaystyle  F(x) - F(x_0) = x x - x_0 x_0

\displaystyle  = (x x - x_0 x) + (x_0 x - x_0 x_0)

\displaystyle  = (x-x_0) x + x_0 (x-x_0)

which for instance leads to a slight improvement of (2):

\displaystyle  \| A^2 - B^2 \|_{op} \leq \| A - B \|_{op} ( \| A\|_{op} + \|B\|_{op} ).

More generally, for any natural number {n}, one has the identity

\displaystyle  x^n - x_0^n = (x-x_0) (x^{n-1} + x^{n-2} x_0 + \dots + x x_0^{n-2} + x_0^{n-1}) \ \ \ \ \ (3)

in a commutative ring, while in a non-commutative ring one must modify this to

\displaystyle  x^n - x_0^n = \sum_{i=0}^{n-1} x_0^i (x-x_0) x^{n-1-i},

and for matrices one has

\displaystyle  \| A^n - B^n \|_{op} \leq \| A-B\|_{op} ( \|A\|_{op}^{n-1} + \| A\|_{op}^{n-2} \| B\|_{op} + \dots + \|B\|_{op}^{n-1} ).

Exercise 1 If {U} and {V} are unitary {n \times n} matrices, show that the commutator {[U,V] := U V U^{-1} V^{-1}} obeys the inequality

\displaystyle  \| [U,V] - I \|_{op} \leq 2 \| U - I \|_{op} \| V - I \|_{op}.

(Hint: first control {\| UV - VU \|_{op}}.)

Now suppose (for simplicity) that {F: {\bf R}^d \rightarrow {\bf R}^{d'}} is a map between Euclidean spaces. If {F} is continuously differentiable, then one can use the fundamental theorem of calculus to write

\displaystyle  F(x) - F(x_0) = \int_0^1 \frac{d}{dt} F( \gamma(t) )\ dt

where {\gamma: [0,1] \rightarrow Y} is any continuously differentiable path from {x_0} to {x}. For instance, if one uses the straight line path {\gamma(t) := (1-t) x_0 + tx}, one has

\displaystyle  F(x) - F(x_0) = \int_0^1 ((x-x_0) \cdot \nabla F)( (1-t) x_0 + t x )\ dt.

In the one-dimensional case {d=1}, this simplifies to

\displaystyle  F(x) - F(x_0) = (x-x_0) \int_0^1 F'( (1-t) x_0 + t x )\ dt. \ \ \ \ \ (4)

Among other things, this immediately implies the factor theorem for {C^k} functions: if {F} is a {C^k({\bf R})} function for some {k \geq 1} that vanishes at some point {x_0}, then {F(x)} factors as the product of {x-x_0} and some {C^{k-1}} function {G}. Another basic consequence is that if {\nabla F} is uniformly bounded in magnitude by some constant {C}, then {F} is Lipschitz continuous with the same constant {C}.

Applying (4) to the power function {x \mapsto x^n}, we obtain the identity

\displaystyle  x^n - x_0^n = n (x-x_0) \int_0^1 ((1-t) x_0 + t x)^{n-1}\ dt \ \ \ \ \ (5)

which can be compared with (3). Indeed, for {x_0} and {x} close to {1}, one can use logarithms and Taylor expansion to arrive at the approximation {((1-t) x_0 + t x)^{n-1} \approx x_0^{(1-t) (n-1)} x^{t(n-1)}}, so (3) behaves a little like a Riemann sum approximation to (5).

Exercise 2 For each {i=1,\dots,n}, let {X^{(1)}_i} and {X^{(0)}_i} be random variables taking values in a measurable space {R_i}, and let {F: R_1 \times \dots \times R_n \rightarrow {\bf R}^m} be a bounded measurable function.

  • (i) (Lindeberg exchange identity) Show that

    \displaystyle  \mathop{\bf E} F(X^{(1)}_1,\dots,X^{(1)}_n) - \mathop{\bf E} F(X^{(0)}_1,\dots,X^{(0)}_n)

    \displaystyle = \sum_{i=1}^n \mathop{\bf E} F( X^{(1)}_1,\dots, X^{(1)}_{i-1}, X^{(1)}_i, X^{(0)}_{i+1}, \dots, X^{(0)}_n)

    \displaystyle - \mathop{\bf E} F( X^{(1)}_1,\dots, X^{(1)}_{i-1}, X^{(0)}_i, X^{(0)}_{i+1}, \dots, X^{(0)}_n).

  • (ii) (Knowles-Yin exchange identity) Show that

    \displaystyle  \mathop{\bf E} F(X^{(1)}_1,\dots,X^{(1)}_n) - \mathop{\bf E} F(X^{(0)}_1,\dots,X^{(0)}_n)

    \displaystyle = \int_0^1 \sum_{i=1}^n \mathop{\bf E} F( X^{(t)}_1,\dots, X^{(t)}_{i-1}, X^{(1)}_i, X^{(t)}_{i+1}, \dots, X^{(t)}_n)

    \displaystyle - \mathop{\bf E} F( X^{(t)}_1,\dots, X^{(t)}_{i-1}, X^{(0)}_i, X^{(t)}_{i+1}, \dots, X^{(t)}_n)\ dt,

    where {X^{(t)}_i = 1_{I_i \leq t} X^{(0)}_i + 1_{I_i > t} X^{(1)}_i} is a mixture of {X^{(0)}_i} and {X^{(1)}_i}, with {I_1,\dots,I_n} uniformly drawn from {[0,1]} independently of each other and of the {X^{(0)}_1,\dots,X^{(0)}_n, X^{(1)}_0,\dots,X^{(1)}_n}.

  • (iii) Discuss the relationship between the identities in parts (i), (ii) with the identities (3), (5).

(The identity in (i) is the starting point for the Lindeberg exchange method in probability theory, discussed for instance in this previous post. The identity in (ii) can also be used in the Lindeberg exchange method; the terms in the right-hand side are slightly more symmetric in the indices {1,\dots,n}, which can be a technical advantage in some applications; see this paper of Knowles and Yin for an instance of this.)

Exercise 3 If {F: {\bf R}^d \rightarrow {\bf R}^{d'}} is continuously {k} times differentiable, establish Taylor’s theorem with remainder

\displaystyle  F(x) = \sum_{j=0}^{k-1} \frac{1}{j!} (((x-x_0) \cdot \nabla)^j F)( x_0 )

\displaystyle + \int_0^1 \frac{(1-t)^{k-1}}{(k-1)!} (((x-x_0) \cdot \nabla)^k F)((1-t) x_0 + t x)\ dt.

If {\nabla^k F} is bounded, conclude that

\displaystyle  |F(x) - \sum_{j=0}^{k-1} \frac{1}{j!} (((x-x_0) \cdot \nabla)^j F)( x_0 )|

\displaystyle \leq \frac{|x-x_0|^k}{k!} \sup_{y \in {\bf R}^d} |\nabla^k F(y)|.

For real scalar functions {F: {\bf R}^d \rightarrow {\bf R}}, the average value of the continuous real-valued function {(x - x_0) \cdot \nabla F((1-t) x_0 + t x)} must be attained at some point {t} in the interval {[0,1]}. We thus conclude the mean-value theorem

\displaystyle  F(x) - F(x_0) = ((x - x_0) \cdot \nabla F)((1-t) x_0 + t x)

for some {t \in [0,1]} (that can depend on {x}, {x_0}, and {F}). This can for instance give a second proof of fact that continuously differentiable functions {F} with bounded derivative are Lipschitz continuous. However it is worth stressing that the mean-value theorem is only available for real scalar functions; it is false for instance for complex scalar functions. A basic counterexample is given by the function {e(x) := e^{2\pi i x}}; there is no {t \in [0,1]} for which {e(1) - e(0) = e'(t)}. On the other hand, as {e'} has magnitude {2\pi}, we still know from (4) that {e} is Lipschitz of constant {2\pi}, and when combined with (1) we obtain the basic bounds

\displaystyle  |e(x) - e(y)| \leq \min( 2, 2\pi |x-y| )

which are already very useful for many applications.

Exercise 4 Let {H_0, V} be {n \times n} matrices, and let {t} be a non-negative real.

  • (i) Establish the Duhamel formula

    \displaystyle  e^{t(H_0+V)} = e^{tH_0} + \int_0^t e^{(t-s) H_0} V e^{s (H_0+V)}\ ds

    \displaystyle  = e^{tH_0} + \int_0^t e^{(t-s) (H_0+V)} V e^{s H_0}\ ds

    where {e^A} denotes the matrix exponential of {A}. (Hint: Differentiate {e^{(t-s) H_0} e^{s (H_0+V)}} or {e^{(t-s) (H_0+V)} e^{s H_0}} in {s}.)

  • (ii) Establish the iterated Duhamel formula

    \displaystyle  e^{t(H_0+V)} = e^{tH_0} + \sum_{j=1}^k \int_{0 \leq t_1 \leq \dots \leq t_j \leq t}

    \displaystyle e^{(t-t_j) H_0} V e^{(t_j-t_{j-1}) H_0} V \dots e^{(t_2-t_1) H_0} V e^{t_1 H_0}\ dt_1 \dots dt_j

    \displaystyle  + \int_{0 \leq t_1 \leq \dots \leq t_{k+1} \leq t}

    \displaystyle  e^{(t-t_{k+1}) (H_0+V)} V e^{(t_{k+1}-t_k) H_0} V \dots e^{(t_2-t_1) H_0} V e^{t_1 H_0}\ dt_1 \dots dt_{k+1}

    for any {k \geq 0}.

  • (iii) Establish the infinitely iterated Duhamel formula

    \displaystyle  e^{t(H_0+V)} = e^{tH_0} + \sum_{j=1}^\infty \int_{0 \leq t_1 \leq \dots \leq t_j \leq t}

    \displaystyle e^{(t-t_j) H_0} V e^{(t_j-t_{j-1}) H_0} V \dots e^{(t_2-t_1) H_0} V e^{t_1 H_0}\ dt_1 \dots dt_j.

  • (iv) If {H(t)} is an {n \times n} matrix depending in a continuously differentiable fashion on {t}, establish the variation formula

    \displaystyle  \frac{d}{dt} e^{H(t)} = (F(\mathrm{ad}(H(t))) H'(t)) e^{H(t)}

    where {\mathrm{ad}(H)} is the adjoint representation {\mathrm{ad}(H)(V) = HV - VH} applied to {H}, and {F} is the function

    \displaystyle  F(z) := \int_0^1 e^{sz}\ ds

    (thus {F(z) = \frac{e^z-1}{z}} for non-zero {z}), with {F(\mathrm{ad}(H(t)))} defined using functional calculus.

We remark that further manipulation of (iv) of the above exercise using the fundamental theorem of calculus eventually leads to the Baker-Campbell-Hausdorff-Dynkin formula, as discussed in this previous blog post.

Exercise 5 Let {A, B} be positive definite {n \times n} matrices, and let {Y} be an {n \times n} matrix. Show that there is a unique solution {X} to the Sylvester equation

\displaystyle  AX + X B = Y

which is given by the formula

\displaystyle  X = \int_0^\infty e^{-tA} Y e^{-tB}\ dt.

In the above examples we had applied the fundamental theorem of calculus along linear curves {\gamma(t) = (1-t) x_0 + t x}. However, it is sometimes better to use other curves. For instance, the circular arc {\gamma(t) = \cos(\pi t/2) x_0 + \sin(\pi t/2) x} can be useful, particularly if {x_0} and {x} are “orthogonal” or “independent” in some sense; a good example of this is the proof by Maurey and Pisier of the gaussian concentration inequality, given in Theorem 8 of this previous blog post. In a similar vein, if one wishes to compare a scalar random variable {X} of mean zero and variance one with a Gaussian random variable {G} of mean zero and variance one, it can be useful to introduce the intermediate random variables {\gamma(t) := (1-t)^{1/2} X + t^{1/2} G} (where {X} and {G} are independent); note that these variables have mean zero and variance one, and after coupling them together appropriately they evolve by the Ornstein-Uhlenbeck process, which has many useful properties. For instance, one can use these ideas to establish monotonicity formulae for entropy; see e.g. this paper of Courtade for an example of this and further references. More generally, one can exploit curves {\gamma} that flow according to some geometrically natural ODE or PDE; several examples of this occur famously in Perelman’s proof of the Poincaré conjecture via Ricci flow, discussed for instance in this previous set of lecture notes.

In some cases, it is difficult to compute {F(x)-F(x_0)} or the derivative {\nabla F} directly, but one can instead proceed by implicit differentiation, or some variant thereof. Consider for instance the matrix inversion map {F(A) := A^{-1}} (defined on the open dense subset of {n \times n} matrices consisting of invertible matrices). If one wants to compute {F(B)-F(A)} for {B} close to {A}, one can write temporarily write {F(B) - F(A) = E}, thus

\displaystyle  B^{-1} - A^{-1} = E.

Multiplying both sides on the left by {B} to eliminate the {B^{-1}} term, and on the right by {A} to eliminate the {A^{-1}} term, one obtains

\displaystyle  A - B = B E A

and thus on reversing these steps we arrive at the basic identity

\displaystyle  B^{-1} - A^{-1} = B^{-1} (A - B) A^{-1}. \ \ \ \ \ (6)

For instance, if {H_0, V} are {n \times n} matrices, and we consider the resolvents

\displaystyle  R_0(z) := (H_0 - z I)^{-1}; \quad R_V(z) := (H_0 + V - zI)^{-1}

then we have the resolvent identity

\displaystyle  R_V(z) - R_0(z) = - R_V(z) V R_0(z) \ \ \ \ \ (7)

as long as {z} does not lie in the spectrum of {H_0} or {H_0+V} (for instance, if {H_0}, {V} are self-adjoint then one can take {z} to be any strictly complex number). One can iterate this identity to obtain

\displaystyle  R_V(z) = \sum_{j=0}^k (-R_0(z) V)^j R_0(z) + (-R_V(z) V) (-R_0(z) V)^k R_0(z)

for any natural number {k}; in particular, if {R_0(z) V} has operator norm less than one, one has the Neumann series

\displaystyle  R_V(z) = \sum_{j=0}^\infty (-R_0(z) V)^j R_0(z).

Similarly, if {A(t)} is a family of invertible matrices that depends in a continuously differentiable fashion on a time variable {t}, then by implicitly differentiating the identity

\displaystyle  A(t) A(t)^{-1} = I

in {t} using the product rule, we obtain

\displaystyle  (\frac{d}{dt} A(t)) A(t)^{-1} + A(t) \frac{d}{dt} A(t)^{-1} = 0

and hence

\displaystyle  \frac{d}{dt} A(t)^{-1} = - A(t)^{-1} (\frac{d}{dt} A(t)) A(t)^{-1}

(this identity may also be easily derived from (6)). One can then use the fundamental theorem of calculus to obtain variants of (6), for instance by using the curve {\gamma(t) = (1-t) A + tB} we arrive at

\displaystyle  B^{-1} - A^{-1} = \int_0^1 ((1-t)A + tB)^{-1} (A-B) ((1-t)A + tB)^{-1}\ dt

assuming that the curve stays entirely within the set of invertible matrices. While this identity may seem more complicated than (6), it is more symmetric, which conveys some advantages. For instance, using this identity it is easy to see that if {A, B} are positive definite with {A>B} in the sense of positive definite matrices (that is, {A-B} is positive definite), then {B^{-1} > A^{-1}}. (Try to prove this using (6) instead!)

Exercise 6 If {A} is an invertible {n \times n} matrix and {u, v} are {n \times 1} vectors, establish the Sherman-Morrison formula

\displaystyle  (A + t uv^T)^{-1} = A^{-1} - \frac{t}{1 + t v^T A^{-1} u} A^{-1} uv^T A^{-1}

whenever {t} is a scalar such that {1 + t v^T A^{-1} u} is non-zero. (See also this previous blog post for more discussion of these sorts of identities.)

One can use the Cauchy integral formula to extend these identities to other functions of matrices. For instance, if {F: {\bf C} \rightarrow {\bf C}} is an entire function, and {\gamma} is a counterclockwise contour that goes around the spectrum of both {H_0} and {H_0+V}, then we have

\displaystyle  F(H_0+V) = \frac{-1}{2\pi i} \int_\gamma F(z) R_V(z)\ dz

and similarly

\displaystyle  F(H_0) = \frac{-1}{2\pi i} \int_\gamma F(z) R_0(z)\ dz

and hence by (7) one has

\displaystyle  F(H_0+V) - F(H_0) = \frac{1}{2\pi i} \int_\gamma F(z) R_V(z) V F_0(z)\ dz;

similarly, if {H(t)} depends on {t} in a continuously differentiable fashion, then

\displaystyle  \frac{d}{dt} F(H(t)) = \frac{1}{2\pi i} \int_\gamma F(z) (H(t) - zI)^{-1} H'(t) (z) (H(t)-zI)^{-1}\ dz

as long as {\gamma} goes around the spectrum of {H(t)}.

Exercise 7 If {H(t)} is an {n \times n} matrix depending continuously differentiably on {t}, and {F: {\bf C} \rightarrow {\bf C}} is an entire function, establish the tracial chain rule

\displaystyle  \frac{d}{dt} \hbox{tr} F(H(t)) = \hbox{tr}(F'(H(t)) H'(t)).

In a similar vein, given that the logarithm function is the antiderivative of the reciprocal, one can express the matrix logarithm {\log A} of a positive definite matrix by the fundamental theorem of calculus identity

\displaystyle  \log A = \int_0^\infty (I + sI)^{-1} - (A + sI)^{-1}\ ds

(with the constant term {(I+tI)^{-1}} needed to prevent a logarithmic divergence in the integral). Differentiating, we see that if {A(t)} is a family of positive definite matrices depending continuously on {t}, that

\displaystyle  \frac{d}{dt} \log A(t) = \int_0^\infty (A(t) + sI)^{-1} A'(t) (A(t)+sI)^{-1}\ dt.

This can be used for instance to show that {\log} is a monotone increasing function, in the sense that {\log A> \log B} whenever {A > B > 0} in the sense of positive definite matrices. One can of course integrate this formula to obtain some formulae for the difference {\log A - \log B} of the logarithm of two positive definite matrices {A,B}.

To compare the square root {A^{1/2} - B^{1/2}} of two positive definite matrices {A,B} is trickier; there are multiple ways to proceed. One approach is to use contour integration as before (but one has to take some care to avoid branch cuts of the square root). Another to express the square root in terms of exponentials via the formula

\displaystyle  A^{1/2} = \frac{1}{\Gamma(-1/2)} \int_0^\infty (e^{-tA} - I) t^{-1/2} \frac{dt}{t}

where {\Gamma} is the gamma function; this formula can be verified by first diagonalising {A} to reduce to the scalar case and using the definition of the Gamma function. Then one has

\displaystyle  A^{1/2} - B^{1/2} = \frac{1}{\Gamma(-1/2)} \int_0^\infty (e^{-tA} - e^{-tB}) t^{-1/2} \frac{dt}{t}

and one can use some of the previous identities to control {e^{-tA} - e^{-tB}}. This is pretty messy though. A third way to proceed is via implicit differentiation. If for instance {A(t)} is a family of positive definite matrices depending continuously differentiably on {t}, we can differentiate the identity

\displaystyle  A(t)^{1/2} A(t)^{1/2} = A(t)

to obtain

\displaystyle  A(t)^{1/2} \frac{d}{dt} A(t)^{1/2} + (\frac{d}{dt} A(t)^{1/2}) A(t)^{1/2} = \frac{d}{dt} A(t).

This can for instance be solved using Exercise 5 to obtain

\displaystyle  \frac{d}{dt} A(t)^{1/2} = \int_0^\infty e^{-sA(t)^{1/2}} A'(t) e^{-sA(t)^{1/2}}\ ds

and this can in turn be integrated to obtain a formula for {A^{1/2} - B^{1/2}}. This is again a rather messy formula, but it does at least demonstrate that the square root is a monotone increasing function on positive definite matrices: {A > B > 0} implies {A^{1/2} > B^{1/2} > 0}.

Several of the above identities for matrices can be (carefully) extended to operators on Hilbert spaces provided that they are sufficiently well behaved (in particular, if they have a good functional calculus, and if various spectral hypotheses are obeyed). We will not attempt to do so here, however.

Filed under: expository, math.CA, math.RA Tagged: estimates, matrix identities, stability

Tommaso DorigoPhysics-Inspired Artwork In Venice 1: Sub-Lime

This is the first of a series of posts that will publish the results of artistic work by high-school students of three schools in Venice, who participate in a contest and exposition connected to the initiative "Art and Science across Italy", an initiative of the network CREATIONS, funded by the Horizon 2020 programme

read more

May 22, 2017

Dave BaconQuantum Advantage

Update (22 May 2017): This Scirate thread seems to have touched a nerve. Since this was previously buried in the comments here, it’s worth promoting to the top of the post. I think that “quantum computational supremacy” addresses the concern. Basically, we use “quantum” as an adjective for our peer group, which makes the analogy to “white” too strong. Adding “computational” emphasizes that it is the computation, not the people, that are supreme.

I’ve had quite a few conversations lately about a comment I left on Scirate. The paper at that link, “Quantum advantage with shallow circuits” by Sergey Bravyi, David Gosset, Robert Koenig, shows a provable separation between analogous classes of quantum and classical circuits, even when the quantum circuit is restricted to nearest-neighbor gates on a 2D grid. This is a fantastic result! My comment, however, wasn’t regarding the result, but rather the title of the paper. I’m just happy that they called it a “quantum advantage” instead of using that other term…

The term “quantum supremacy” is the fashionable name for the quantum experiments attempting to beat classical computers at some given task, not necessarily a useful one. According to current usage, the term (strangely) only applies to computational problems. The theoretical and experimental work towards demonstrating this is wonderful. But the term itself, as any native English speaker can tell you, has the unfortunate feature that it immediately calls to mind “white supremacy”. Indeed, one can even quantify this using a Google ngram search for *_ADJ supremacy over all books in Google’s corpus between 1900 and 2008:

None of these terms has a particularly good connotation, but white supremacy (the worst on the list) is an order of magnitude more common than the others and has, on net, been growing since the 30s. For almost every native speaker that I’ve talked to, and quite a few non-native speakers as well, the taint of this is hard to escape. (For speakers of German or French, this word is a bit like “Vormachtstellung” or “collaboration” respectively.)

The humor surrounding this term has always been in bad taste — talking about “quantum supremacists” and jokes about disavowing their support — but it was perhaps tolerable before the US election in November. Given that there are several viable alternatives, for example “quantum advantage” or even “quantum superiority”, can we please agree as a community to abandon this awful term?

This isn’t about being PC. And I’m not trying to shame any of the people that have used this term. It’s just a poor word choice, and we don’t have to be stuck with it. Connotations of words matter: you don’t say someone is “scrawny” if you mean they are thin, even though my thesaurus lists these words as synonyms. Given the readily available alternatives, the only case I can think of for “supremacy” at this point is inertia, which is a rather poor argument.

So please, say it with me now: quantum advantage.

Update: Ashley Montanaro points out that “advantage” should potentially be reserved for a slight advantage. I maintain that “superiority” is still a good choice, and I also offer “dominance” as another alternative. Martin Schwarz suggests some variation of “breaking the X barrier”, which has a nice feel to it. 

Scott AaronsonUnsong of unsongs

On Wednesday, Scott Alexander finally completed his sprawling serial novel Unsong, after a year and a half of weekly updates—incredibly, in his spare time while also working as a full-term resident in psychiatry, and also regularly updating Slate Star Codex, which I consider to be the world’s best blog.  I was honored to attend a party in Austin (mirroring parties in San Francisco, Boston, Tel Aviv, and elsewhere) to celebrate Alexander’s release of the last chapter—depending on your definition, possibly the first “fan event” I’ve ever attended.

Like many other nerds I’ve met, I’d been following Unsong almost since the beginning—with its mix of Talmudic erudition, CS humor, puns, and even a shout-out to Quantum Computing Since Democritus (which shows up as Ben Aharon’s Gematria Since Adam), how could I not be?  I now count Unsong as one of my favorite works of fiction, and Scott Alexander alongside Rebecca Newberger Goldstein among my favorite contemporary novelists.  The goal of this post is simply to prod readers of my blog who don’t yet know Unsong: if you’ve ever liked anything here on Shtetl-Optimized, then I predict you’ll like Unsong, and probably more.


Though not trivial to summarize, Unsong is about a world where the ideas of religion and mysticism—all of them, more or less, although with a special focus on kabbalistic Judaism—turn out to be true.  In 1968, the Apollo 8 mission leads not to an orbit of the Moon, as planned, but instead to cracking an invisible crystal sphere that had surrounded the Earth for millennia.  Down through the crack rush angels, devils, and other supernatural forces.  Life on Earth becomes increasingly strange: on the one hand, many technologies stop working; on the other, people can now gain magical powers by speaking various names of God.  A worldwide industry arises to discover new names of God by brute-force search through sequences of syllables.  And a powerful agency, the eponymous UNSONG (United Nations Subcommittee on Names of God), is formed to enforce kabbalistic copyright law, hunting down and punishing anyone who speaks divine names without paying licensing fees to the theonomic corporations.

As the story progresses, we learn that eons ago, there was an epic battle in Heaven between Good and Evil, and Evil had the upper hand.  But just as all seemed lost, an autistic angel named Uriel reprogrammed the universe to run on math and science rather than on God’s love, as a last-ditch strategy to prevent Satan’s forces from invading the sublunary realm.  Molecular biology, the clockwork regularity of physical laws, false evidence for a huge and mindless cosmos—all these were retconned into the world’s underpinnings.  Uriel did still need to be occasionally involved, but less as a loving god than as an overworked sysadmin: for example, he descended to Mount Sinai to warn humans never to boil goats in their mothers’ milk, because he discovered that doing so (like the other proscribed activities in the Torah, Uriel’s readme file) triggered bugs in the patchwork of code that was holding the universe together.  Now that the sky has cracked, Uriel is forced to issue increasingly desperate patches, and even those will only buy a few decades until his math-and-science-based world stops working entirely, with Satan again triumphant.

Anyway, that’s a tiny part of the setup.  Through 72 chapters and 22 interludes, there’s world-building and philosophical debates and long kabbalistic digressions.  There are battle sequences (the most striking involves the Lubavitcher Rebbe riding atop a divinely-animated Statue of Liberty like a golem).  There’s wordplay and inside jokes—holy of holies are there those—including, notoriously, a sequence of cringe-inducing puns involving whales.  But in this story, wordplay isn’t just there for the hell of it: Scott Alexander has built an entire fictional universe that runs on wordplay—one where battles between the great masters, the equivalent of the light-saber fights in Star Wars, are conducted by rearranging letters in the sky to give them new meanings.  Scott A. famously claims he’s bad at math (though if you read anything he’s written on statistics or logic puzzles, it’s clear he undersells himself).  One could read Unsong as Alexander’s book-length answer to the question: what could it mean for the world to be law-governed but not mathematical?  What if the Book of Nature were written in English, or Hebrew, or other human languages, and if the Newtons and Einsteins were those who were most adept with words?

I should confess that for me, the experience of reading Unsong was colored by the knowledge that, in his years of brilliant and prolific writing, lighting up the blogosphere like a comet, the greatest risk Scott Alexander ever took (by his own account) was to defend me.  It’s like, imagine that in Elizabethan England, you were placed in the stocks and jeered at by thousands for advocating some unpopular loser cause—like, I dunno, anti-cat-burning or something.  And imagine that, when it counted, your most eloquent supporter was a then-obscure poet from Stratford-upon-Avon.  You’d be grateful to the poet, of course; you might even become a regular reader of his work, even if it wasn’t good.  But if the same poet went on to write Hamlet or Macbeth?  It might almost be enough for you to volunteer to be scorned and pilloried all over again, just for the honor of having the Bard divert a rivulet of his creative rapids to protesting on your behalf.

Yes, a tiny part of me had a self-absorbed child’s reaction to Unsong: “could Amanda Marcotte have written this?  could Arthur Chu?  who better to have in your camp: the ideologues du jour of Twitter and Metafilter, and RationalWiki?  Or a lone creative genius, someone who can conjure whole worlds into being, as though graced himself with the Shem haMephorash of which he writes?”  Then of course I’d catch myself, and think: no, if you want to be in Scott Alexander’s camp, then the only way to do it is to be in nobody’s camp.  If two years ago it was morally justified to defend me, then the reasons why have nothing to do with the literary gifts of any of my defenders.  And conversely, the least we can do for Unsong is to judge it by what’s on the page, rather than as a soldier in some army fielded by the Gray Tribe.

So in that spirit, let me explain some of what’s wrong with Unsong.  That it’s a first novel sometimes shows.  It’s brilliant on world-building and arguments and historical tidbits and jokes, epic on puns, and uneven on character and narrative flow.  The story jumps around spasmodically in time, so much so that I needed a timeline to keep track of what was happening.  Subplots that are still open beget additional subplots ad headacheum, like a string of unmatched left-parentheses.  Even more disorienting, the novel changes its mind partway through about its narrative core.  Initially, the reader is given a clear sense that this is going to be a story about a young Bay Area kabbalist named Aaron Smith-Teller, his not-quite-girlfriend Ana, and their struggle for supernatural fair-use rights.  Soon, though, Aaron and Ana become almost side characters, their battle against UNSONG just one subplot among many, as the focus shifts to the decades-long war between the Comet King, a messianic figure come to rescue humanity, and Thamiel, the Prince of Hell.  For the Comet King, even saving the earth from impending doom is too paltry a goal to hold his interest much.  As a strict utilitarian and fan of Peter Singer, the Comet King’s singleminded passion is destroying Hell itself, and thereby rescuing the billions of souls who are trapped there for eternity.

Anyway, unlike the Comet King, and unlike a certain other Scott A., I have merely human powers to marshal my time.  I also have two kids and a stack of unwritten papers.  So let me end this post now.  If the post causes just one person to read Unsong who otherwise wouldn’t have, it will be as if I’ve nerdified the entire world.

John PreskillThe power of information

Sara Imari Walker studies ants. Her entomologist colleague Gabriele Valentini cultivates ant swarms. Gabriele coaxes a swarm from its nest, hides the nest, and offers two alternative nests. Gabriele observe the ants’ responses, then analyzes their data with Sara.

Sara doesn’t usually study ants. She trained in physics, information theory, and astrobiology. (Astrobiology is the study of life; life’s origins; and conditions amenable to life, on Earth and anywhere else life may exist.) Sara analyzes how information reaches, propagates through, and manifests in the swarm.

Some ants inspect one nest; some, the other. Few ants encounter both choices. Yet most of the ants choose simultaneously. (How does Gabriele know when an ant chooses? Decided ants carry other ants toward the chosen nest. Undecided ants don’t.)

Gabriele and Sara plotted each ant’s status (decided or undecided) at each instant. All the ants’ lines start in the “undecided” region, high up in the graph. Most lines drop to the “decided” region together. Physicists call such dramatic, large-scale changes in many-particle systems “phase transitions.” The swarm transitions from the “undecided” phase to the “decided,” as moisture transitions from vapor to downpour.

Sara presentation

Sara versus the ants

Look from afar, and you’ll see evidence of a hive mind: The lines clump and slump together. Look more closely, and you’ll find lags between ants’ decisions. Gabriele and Sara grouped the ants according to their behaviors. Sara explained the grouping at a workshop this spring.

The green lines, she said, are undecided ants.

My stomach dropped like Gabriele and Sara’s ant lines.

People call data “cold” and “hard.” Critics lambast scientists for not appealing to emotions. Politicians weave anecdotes into their numbers, to convince audiences to care.

But when Sara spoke, I looked at her green lines and thought, “That’s me.”

I’ve blogged about my indecisiveness. Postdoc Ning Bao and I formulated a quantum voting scheme in which voters can superpose—form quantum combinations of—options. Usually, when John Preskill polls our research group, I abstain from voting. Politics, and questions like “Does building a quantum computer require only engineering or also science?”,1 have many facets. I want to view such questions from many angles, to pace around the questions as around a sculpture, to hear other onlookers, to test my impressions on them, and to cogitate before choosing.2 However many perspectives I’ve gathered, I’m missing others worth seeing. I commiserated with the green-line ants.


I first met Sara in the building behind the statue. Sara earned her PhD in Dartmouth College’s physics department, with Professor Marcelo Gleiser.

Sara presented about ants at a workshop hosted by the Beyond Center for Fundamental Concepts in Science at Arizona State University (ASU). The organizers, Paul Davies of Beyond and Andrew Briggs of Oxford, entitled the workshop “The Power of Information.” Participants represented information theory, thermodynamics and statistical mechanics, biology, and philosophy.

Paul and Andrew posed questions to guide us: What status does information have? Is information “a real thing” “out there in the world”? Or is information only a mental construct? What roles can information play in causation?

We paced around these questions as around a Chinese viewing stone. We sat on a bench in front of those questions, stared, debated, and cogitated. We taught each other about ants, artificial atoms, nanoscale machines, and models for information processing.


Chinese viewing stone in Yuyuan Garden in Shanghai

I wonder if I’ll acquire opinions about Paul and Andrew’s questions. Maybe I’ll meander from “undecided” to “decided” over a career. Maybe I’ll phase-transition like Sara’s ants. Maybe I’ll remain near the top of her diagram, a green holdout.

I know little about information’s power. But Sara’s plot revealed one power of information: Information can move us—from homeless to belonging, from ambivalent to decided, from a plot’s top to its bottom, from passive listener to finding yourself in a green curve.


With thanks to Sara Imari Walker, Paul Davies, Andrew Briggs, Katherine Smith, and the Beyond Center for their hospitality and thoughts.


1By “only engineering,” I mean not “merely engineering” pejoratively, but “engineering and no other discipline.”

2I feel compelled to perform these activities before choosing. I try to. Psychological experiments, however, suggest that I might decide before realizing that I’ve decided.

May 21, 2017

Tommaso DorigoEurope In My Region 2017

The European Commission has launched a couple of weeks ago the campaign "Europe in my region 2017", an initiative aimed at getting the general public informed on the projects funded by the European Community in their area of residence or activity. There are open day events scheduled a bit everywhere, a blog contest, a photo contest, and other initiatives of interest.

read more

Jordan EllenbergWhat is the median length of homeownership?

Well, it’s longer than it used to be, per Conor Dougherty in the New York Times:

The median length of time people have owned their homes rose to 8.7 years in 2016, more than double what it had been 10 years earlier.

The accompanying chart shows that “median length of homeownership” used to hover at  just under 4 years.  That startled me!  Doesn’t 4 years seem like a pretty short length of time to own a house?

When I thought about this a little more, I realized I had no idea what this meant.  What is the “median length of homeownership” in 2017?  Does it mean you go around asking each owner-occupant how long they’ve lived in their house, and take the median of those numbers?  Probably not:  when people were asked that in 2008, the median answer was 10 years, and whatever the Times was measuring was about 3.7 years in 2008.

Does it mean you look at all house sales in 2017, subtract the time since last sale, and take the median of those numbers?

Suppose half of all houses changed hands every year, and the other half changed hands every thirty years.  Are the lengths of ownership we’re medianning half “one year” and half “30 years”, or “30/31 1 year” and 1/31 “30 years”?

There are about 75 million owner-occupied housing units in the US and 4-6 million homes sold per year, so the mean number of sales per unit per year is certainly way less than 1/4; of course, there’s no reason this mean should be close to the median of, well, whatever we’re taking the median of.

Basically I have no idea what’s being measured.  The Times doesn’t link to the Moody’s Analytics study it’s citing, and Dougherty says that study’s not public.  I did some Googling for “median length of homeownership” and as far as I can tell this isn’t a standard term of art with a consensus definition.

As papers run more data-heavy pieces I’d love to see a norm develop that there should be some way for the interested reader to figure out exactly what the numbers in the piece refer to.  Doesn’t even have to be in the main text.  Could be a linked sidebar.  I know not everybody cares about this stuff.  But I do!




Chad OrzelKids and Construction Update

The big development of this week is that construction started on the Great Chateau Steelypips Renovation of 2017. We’re extending one part of the back of the house about ten feet to gain a bedroom on the second floor, and gut-renovating the kitchen, dining room, and mud room. This is a massive undertaking, with a massive price tag (more than we paid for the house in 2002), and is going to be massively disruptive later this summer. Right now, it’s very exciting, because we have an excavator and giant piles of dirt in our back yard:

The Pip looking at the excavator used to dig the foundation for our new addition.

SteelyKid is Queen of the Dirt Pile.

In the first week of work, they’ve dug the hole, poured the concrete footer, built the reinforced cinderblock wall, and sealed and insulated what will be the crawl space under the new addition. Next week, they should start framing the exterior. The part where they rip out the entire kitchen won’t be for a good while yet, but I’m already kind of dreading it…

While out exploring the yard, I also got a good Kate-and-kids shot:

Kate and the sillyheads.

(When I saw this pose, I immediately said, “We are but poor, lost circus performers…” which is probably the only time in her life Kate has been compared to Andre the Giant… )

And to close this out, a couple of action shots, which also indicate the importance of the background in photography. This shot of Kate pushing The Pip on a tire swing, for example:

Kate pushing The Pip on a tire swing.

There’s a lot I like about this, but it would be immeasurably improved without the bored teenager checking his phone in the background. There are other shots from this sequence that would be even better, but Bored Teen Boy is even more disruptive. Alas, my photo-editing skills don’t really extend to clipping this dude out of the photo.

On the other hand, there’s this shot of SteelyKid running to home during a softball game:

In this case, the reactions of the other people in the background kind of make the shot. I have another sequence of these, without those people, and while they’re great for showing motion and the intensity of her run, they’re lacking an indefinable something that this one has thanks in part to Young Rick Astley in the background, there.

So, you know, with candid action photos, sometimes you win or lose depending on random people who have nothing to do with the shot you’re trying to get. Not the most profound lesson, maybe, but that’s what I’ve got for you this week.

May 20, 2017

Geraint F. LewisThe Chick Peck Problem

So, my plans for my blog through 2017 have not quite gone to plan, but things have been horrendously busy, and it seems like the rest of the year is likely to continue this way.

But I did get a chance to do some recreational mathematics, spurred on my a story in the news. It's to do with a problem presented at the 2017 Raytheon MATHCOUNTS® National Competition and reported in the New York Times. Here's the question as presented in the press:

Kudos to 13 year old Texan, Luke Robitialle, who got this right.

With a little thought, you should be able to realise that the answer is 25. For any particular chick, there are four potential out comes, each with equal probability. Either the chick is
  • pecked from the left
  • pecked from the right
  • pecked from left and right
  • not pecked at all
Only one of these options results in the chick being unpecked, and so the expected number of chicks unpecked in a circle of 100 is one quarter of this number, or 25.

ABC journalist and presenter extraordinaire, Leigh Sales, tweeted 
But it's the kind of maths that makes me ask more questions. 

While 25 is the expected number of unpecked chicks, what is the distribution of unpecked chicks? What I mean by this is that they peck left or right at random, there might be 24 unpecked chicks for one group of a hundred chicks, 25 for the next, and 23 for the next. So, the question is, given a large number of 100 chick experiments, what's the distribution of unpecked chicks?

I tried to find an analytic solution, but my brain is old, and so I resorted to good old numerical methods. Generate lots of experiments on a computer, and see what the outcomes look like. But there is an issue that we have to think about, namely the question of how many different configurations of chicks pecking left and right can have?

Well, left and right are two options, and for 100 chicks, the number of possible left and right configurations is 

That's a lot of possibilities! How are we going to sample these? 

Well, if we treat a 0 as "chick pecks to the left", and 1 as "check pecks to the right", then if we choose a random integer between 0 and 2100-1,  and represent it as a binary number, then that will be a random sampling of the pecking order (pecking order, get it!) As an example, all chicks peck to the left would be 100 0s in binary, where as all the chicks peck to the right would be 100 1s in binary. 

Let's try a randomly drawn integer in the range. We get (in base 10) 333483444300232384702347234. In binary this is

So, the first bunch of chicks peck to the left, then we have a mix of right to left pecks. 
But how many of these chicks are unpecked (remembering what the original question)? Well, for any particular chick, it will be unpecked if the chick to its left pecks to the left, and the chick to its right pecks to the right. So, we're looking for sequences of '001' and '011', with the middle digit representing the chick we are interested in. 
So, we can chick this into a little python code (had to learn it, all the cool kids are using it these days) and this is what I have
There is a little extra in there to account for the fact that the chicks are sitting in a circle, but as you can see, the code is quite compact.
OK. Let's run for the 100 chicks in the question. What do we get?
Yay! The unpecks peak at 25, but there is a nice distribution (which, I am sure, must have an analytic solution somewhere. 
But given the simplicity of the code, I can easily change the number of chicks. What about 10 chicks in circle?
Hmmm. Interesting. What about 1000 chicks?
And 33 chicks?
Most likely number of unpecked chicks is 8, but again, a nice distribution. 
Now, you might be sitting there wondering why the heck I am doing this? Well, firstly, because it is fun! And interesting! It's a question and it is fun to find the answer. 
Secondly it is not obvious what the distribution would be, and how complex it would be to derive, or even if it exists, and so a numerical approach allows us to find an answer. 
Finally, I can easily generalize this to questions like "what if the left pecks are more likely than right pecks by a factor of two, what would the distribution be like?" It would just take a couple of lines of code and I would have an answer. 
And if you can't see how such curiosity led examinations are integral to science, then you don't know what science is.

May 19, 2017

Doug NatelsonWhat are "hot" electrons?

In basic chemistry or introductory quantum mechanics, you learn about the idea of energy levels for electrons.  If you throw a bunch of electrons into some system, you also learn about the ground state, the lowest energy state of the whole system, where the electrons fill up* the levels from the bottom up, in accord with the Pauli principle.   In statistical physics, there are often a whole lot of energy levels and a whole lot of electrons (like \(10^{22}\) per cc), so we have to talk about distribution functions, and how many electrons are in the levels with energies between \(E\) and \(E + dE\).   In thermal equilibrium (meaning our system of interest is free to exchange energy in the form of heat with some large reservoir described by a well-defined temperature \(T\)), the distribution of electrons as a function of energy is given by the Fermi-Dirac distribution.

So, what are "hot" electrons?  If we have a system driven out of equilibrium, it's possible to have the electrons arranged in a non-thermal (non-FD distribution!) way.  Two examples are of particular interest at the nanoscale.  In a transistor, say, or other nanoelectronic device, it is possible to apply a voltage across the system so that \(eV >> k_{\mathrm{B}}T\) and inject charge carriers at energies well above the thermally distributed population.  Often electron-electron scattering on the 10-100 fs timescale redistributes the energy across the electrons, restoring a thermal distribution at some higher effective temperature (and on longer timescales, that energy cascades down into the vibrations of the lattice).  Electrons in a metal like Au at the top of the distribution are typically moving at speeds of \(\sim 10^{6}\) m/s (!!), so that means that near where the current is injected, on distance scales like 10-100 nm, there can be "hot" electrons well above the FD distribution.  

The other key way to generate "hot" electrons is by optical absorption.  A visible photon (perhaps a green one with an energy \(\hbar \omega\) of 2 eV) can be absorbed by a metal or a semiconductor, and this can excite an electron at an energy \(\hbar \omega\) above the top of the FD distribution.  Often, on the 10-100 fs timescale, as above, that energy gets redistributed among many electrons, and then later into the lattice.  That's heating by optical absorption.  In recent years, there has been an enormous amount of interest in trying to capture and use those hot electrons or their energy before there is a chance for that energy go become converted to heat.  See here, for instance, for thoughts about solar energy harvesting, or here for a discussion of hot electron photochemistry.  Nanoscale systems are of great interest in this field for several reasons, including the essential fact that hot electrons generated in them can access the system surface or boundary in the crucial timespan before energy relaxation.

(Talking about this and thermoelectricity now sets the stage so I can talk about our recent paper in an upcoming post.)

*Really, the whole many-body electron wavefunction has to be antisymmetric under the exchange of any two electrons, so it's wrong to talk as if one particular electron is sitting in one particular state, but let's ignore that for now.  Also, in general, the energy levels of the many-electron system actually depend on the number and arrangement of the electrons in the system (correlation effects!), but let's ignore that, too.

Tim GowersIntransitive dice II

I’m not getting the feeling that this intransitive-dice problem is taking off as a Polymath project. However, I myself like the problem enough to want to think about it some more. So here’s a post with some observations and with a few suggested subproblems that shouldn’t be hard to solve and that should shed light on the main problem. If the rate of comments by people other than me doesn’t pick up, then I think I’ll simply conclude that there wasn’t sufficient interest to run the project. However, if I do that, I have a back-up plan, which is to switch to a more traditional collaboration — that is, done privately with a small number of people. The one non-traditional aspect of it will be that the people who join the collaboration will select themselves by emailing me and asking to be part of it. And if the problem gets solved, it will be a normal multi-author paper. (There’s potentially a small problem if someone asks to join in with the collaboration and then contributes very little to it, but we can try to work out some sort of “deal” in advance.)

But I haven’t got to that point yet: let me see whether a second public post generates any more reaction.

I’ll start by collecting a few thoughts that have already been made in comments. And I’ll start that with some definitions. First of all, I’m going to change the definition of a die. This is because it probably makes sense to try to prove rigorous results for the simplest model for which they are true, and random multisets are a little bit frightening. But I am told that experiments suggest that the conjectured phenomenon occurs for the following model as well. We define an n-sided die to be a sequence A=(a_1,\dots,a_n) of integers between 1 and n such that \sum_ia_i=n(n+1)/2. A random n-sided die is just one of those chosen uniformly from the set of all of them. We say that A beats B if
\sum_{i,j}\mathbf 1_{[a_i>b_j]}>\sum_{i,j}\mathbf 1_{[a_i<b_j]}.
That is, A beats B if the probability, when you roll the two dice, that A shows a higher number than B is greater than the probability that B shows a higher number than A. If the two probabilities are equal then we say that A ties with B.

The main two conjectures are that the probability that two dice tie with each other tends to zero as n tends to infinity and that the “beats” relation is pretty well random. This has a precise meaning, but one manifestation of this randomness is that if you choose three dice A, B and C uniformly at random and are given that A beats B and B beats C, then the probability that A beats C is, for large n, approximately 1/2. In other words, transitivity doesn’t happen any more often than it does for a random tournament. (Recall that a tournament is a complete graph in which every edge is directed.)

Now let me define a function that helps one think about dice. Given a die A, define a function f_A on the set [n]=\{1,2,\dots,n\} by
f_A(j)=\sum_i(\mathbf 1_{[a_i>j]}-\mathbf 1_{[a_i<j]})=|\{i:a_i>j\}|-|\{i:a_i<j\}|.
Then it follows immediately from the definitions that A beats B if \sum_jf_A(b_j)>0, which is equivalent to the statement that \sum_jf_B(a_j)<0.

If the “beats” tournament is quasirandom, then we would expect that for almost every pair of dice A,B the remaining dice are split into four parts of roughly equal sizes, according to whether they beat A and whether they beat B. So for a typical pair of dice A,B we would like to show that \sum_jf_A(c_j)>0 for roughly half of all dice C, and \sum_jf_B(c_j)>0 for roughly half of all dice C, and that these two events have almost no correlation.

It is critical here that the sums should be fixed. Otherwise, if we are told that A beats B, the most likely explanation is that the sum of A is a bit bigger than the sum of B, and then A is significantly more likely to beat C than B is.

Note that for every die A we have
\sum_jf_A(j)=\sum_{i,j}(\mathbf 1_{[a_i>j]}-\mathbf 1_{[a_i<j]})
That is, every die ties with the die (1,2,\dots,n).

Now let me modify the functions f_A to make them a bit easier to think about, though not quite as directly related to the “beats” relation (though everything can be suitably translated). Define g_A(j) to be |\{i:a_i\leq j\}| and h_A(j) to be g_A(j)-j. Note that f_A(j)=(n-g_A(j))-g_A(j-1) which would normally be approximately equal to n-2g_A(j).

We are therefore interested in sums such as \sum_jg_A(c_j). I would therefore like to get a picture of what a typical sequence (g_A(1),\dots,g_A(n)) looks like. I’m pretty sure that g_A(j) has mean j. I also think it is distributed approximately normally around j. But I would also like to know about how g_A(j_1) and g_A(j_2) correlate, since this will help us get some idea of the variance of \sum_jg_A(c_j), which, if everything in sight is roughly normal, will pin down the distribution. I’d also like to know about the covariance of \sum_jg_A(c_j) and \sum_jg_B(c_j), or similar quantities anyway, but I don’t want to walk before I can fly.

Anyhow, I had the good fortune to see Persi Diaconis a couple of days ago, and he assured me that the kind of thing I wanted to understand had been studied thoroughly by probabilists and comes under the name “constrained limit theorems”. I’ve subsequently Googled that phrase and found some fairly old papers written in the typical uncompromising style and level of generality of their day, which leaves me thinking that it may be simpler to work a few things out from scratch. The main purpose of this post is to set out some exercises that have that as their goal.

What is the average of g_A(j)?

Suppose, then, that we have a random n-sided die A. Let’s begin by asking for a proper proof that the mean of g_A(j) is j. It clearly is if we choose a purely random n-tuple of elements of [n], but what happens if we constrain the average to be (n+1)/2?

I don’t see an easy proof. In fact, I’m not sure it’s true, and here’s why. The average will always be j if and only if the probability that a_1\leq j is always equal to j/n, and that is true if and only if a_1 is uniformly distributed. (The distributions of the a_i are of course identical, but — equally of course — not independent.) But do we expect a_1 to be uniformly distributed? No we don’t: if a_1=(n+1)/2 that will surely make it easier for the global average to be (n+1)/2 than if a_1=n.

However, I would be surprised if it were not at least approximately true. Here is how I would suggest proving it. (I stress that I am not claiming that this is an unknown result, or something that would detain a professional probabilist for more than two minutes — that is why I used the word “exercise” above. But I hope these questions will be useful exercises.)

The basic problem we want to solve is this: if a_1,\dots,a_n are chosen independently and uniformly from [n], then what is the conditional probability that a_1=j given that the average of the a_i is exactly (n+1)/2?

It’s not the aim of this post to give solutions, but I will at least say why I think that the problems aren’t too hard. In this case, we can use Bayes’s theorem. Using well-known estimates for sums of independent random variables, we can give good approximations to the probability that the sum is n(n+1)/2 and of the probability of that given that a_1=j (which is just the probability that the sum of the remaining n-1 a_is is n(n+1)/2-j\ ). We also know that the probability that a_1=j is 1/n. So we have all the information we need. I haven’t done the calculation, but my guess is that the tendency for a_1 to be closer to the middle than to the extremes is not very pronounced.

In fact, here’s a rough argument for that. If we choose a_i uniformly from [n], then the variance is about n^2/12. So the variance of the sum of the a_i (in the fully independent case) is about n^3/12, so the standard deviation is proportional to n^{3/2}. But if that’s the case, then the probability that the sum equals n(n+1)/2+t is roughly constant for t=O(n).

I think it should be possible to use similar reasoning to prove that if m=o(\sqrt n), then a_1,\dots,a_m are approximately independent. (Of course, this would apply to any m of the a_i, if correct.)

How is g_A(j) distributed?

What is the probability that j+s of the a_i are at most j? Again, it seems to me that Bayes’s theorem and facts about sums of independent random variables are enough for this. We want the probability of the above event given that \sum_ia_i=n(n+1)/2. By Bayes’s theorem, we can work this out if we know the probability that \sum_ia_i=n(n+1)/2 given that g_A(j)=j+s, together with the probability that \sum_ia_i=n(n+1)/2 and the probability that g_A(j)=j+s, in both cases when A is chosen fully independently. The last two calculations are simple. The first one isn’t 100% simple, but it doesn’t look too bad. We have a sum of j+s random variables that are uniform on [j] and n-j-s that are uniform on \{j+1,\dots,n\} and we want to know how likely it is that they add up to n(n+1)/2. We could do this by conditioning on the possible values of the two sums, which then leaves us with sums of independent variables, and adding up all the results. It looks to me as though that calculation shouldn’t be too unpleasant. What I would recommend is to do the calculation on the assumption that the distributions are normal (in a suitable discrete sense) with whatever mean and variance they have to have, since that will yield an answer that is almost certainly correct. A rigorous proof can come later, and shouldn’t be too much harder.

The answer I expect and hope for is that g_A(j) is approximately normally distributed with mean j and a variance that would come out of the calculations.

What is the joint distribution of g_A(j_1) and g_A(j_2)?

This can in principle be done by exactly the same technique, except that now things get one step nastier because we have to condition on the sum of the a_i that are at most j_1, the sum of the a_i that are between j_1+1 and j_2, and the sum of the rest. So we end up with a double sum of products of three probabilities at the end instead of a single sum of products of two probabilities. The reason I haven’t done this is that I am quite busy with other things and the calculation will need a strong stomach. I’d be very happy if someone else did it. But if not, I will attempt it at some point over the next … well, I don’t want to commit myself too strongly, but perhaps the next week or two. At this stage I’m just interested in the heuristic approach — assume that probabilities one knows are roughly normal are in fact given by an exact formula of the form Ae^{-\lambda (x-\mu)^2}.

For some experimental evidence about this, see a comment by Ian on the previous post, which links to some nice visualizations. Ian, if you’re reading this, it would take you about another minute, I’d have thought, to choose a few random dice A and plot the graphs h_A. It would be interesting to see such plots to get an idea of what a typical one looks like: roughly how often does it change sign, for example?

What is a heuristic argument for the “A beats B” tournament being quasirandom?

I have much less to say here — in particular, I don’t have a satisfactory answer. But I haven’t spent serious time on it, and I think it should be possible to get one.

One slight simplification is that we don’t have to think too hard about whether A beats B when we are thinking about the three dice A, B and C. As I commented above, the tournament will be quasirandom (I think I’m right in saying) if for almost every A and B the events “A beats C” and “B beats C” have probability roughly 1/2 each and are hardly correlated.

A good starting point would be the first part. Is it true that almost every die beats approximately half the other dice? This question was also recommended by Bogdan Grechuk in a comment on the previous post. He suggested, as a preliminary question, the question of finding a good sufficient condition on a die for this to be the case.

That I think is approachable too. Let’s fix some function g_A without worrying too much about whether it comes from a die (but I have no objection to assuming that it is non-decreasing and that g_A(n)=n, should that be helpful). Under what conditions can we be confident that the sum \sum_jg_A(c_j) is greater than n(n+1)/2 with probability roughly 1/2, where C=(c_1,\dots,c_n) is a random die?

Assuming it’s correct that each c_j is roughly uniform, \sum_jg_A(c_j) is going to average \sum_jg_A(j), which if A is a die will be close to n(n+1)/2. But we need to know rather more than that in order to obtain the probability in question.

But I think the Bayes approach may still work. We’d like to nail down the distribution of \sum_jg_A(c_j) given that \sum_jc_j=n(n+1)/2. So we can look at \mathbb P[\sum_jg_A(c_j)=n(n+1)/2+t|\sum_jc_j=n(n+1)/2], where now the c_i are chosen uniformly and independently. Calling that \mathbb P[X|Y], we find that it’s going to be fairly easy to estimate the probabilities of X and Y. However, it doesn’t seem to be notably easier to calculate \mathbb P[Y|X] than it is to calculate \mathbb P[X|Y]. But we have made at least one huge gain, which is that now the c_j are independent, so I’d be very surprised if people don’t know how to estimate this probability. Indeed, the probability we really want to know is \mathbb P[X\wedge Y]. From that all else should follow. And I think that what we’d like is a nice condition on g_A that would tell us that the two events are approximately independent.

I’d better stop here, but I hope I will have persuaded at least some people that there’s some reasonably low-hanging fruit around, at least for the time being.

Tim GowersA potential new Polymath project: intransitive dice

A few days ago I received an email from Brian Conrey, who has a suggestion for a possible Polymath project. The problem he wants to see solved is a little different from the problems in most previous projects, in that it is not a well known and long-standing question of acknowledged importance. However, that is not an argument against trying it out, since it is still far from clear what kinds of problems are best suited to the polymathematical approach, and it would be good to get more data. And this problem has other qualities that could make it very suitable indeed. First of all, it is quite new — it arises from a paper published last year, though it appeared on arXiv in 2013 — so we do not yet have a clear idea of how difficult it is, which should give us hope that it may turn out to be doable. Secondly, and more importantly, it is a very attractive question.

Suppose you have a pair of dice D_1,D_2 with different numbers painted on their sides. Let us say that D_1 beats D_2 if, thinking of them as random variables, the probability that D_1>D_2 is greater than the probability that D_2>D_1. (Here, the rolls are of course independent, and each face on each die comes up with equal probability.) It is a famous fact in elementary probability that this relation is not transitive. That is, you can have three dice D_1,D_2,D_3 such that D_1 beats D_2, D_2 beats D_3, and D_3 beats D_1.

Brian Conrey, James Gabbard, Katie Grant, Andrew Liu and Kent E. Morrison became curious about this phenomenon and asked the kind of question that comes naturally to an experienced mathematician: to what extent is intransitivity “abnormal”? The way they made the question precise is also one that comes naturally to an experienced mathematician: they looked at n-sided dice for large n and asked about limiting probabilities. (To give another example where one might do something like this, suppose one asked “How hard is Sudoku?” Well, any Sudoku puzzle can be solved in constant time by brute force, but if one generalizes the question to arbitrarily large Sudoku boards, then one can prove that the puzzle is NP-hard to solve, which gives a genuine insight into the usual situation with a 9\times 9 board.)

Let us see how they formulate the question. The “usual” n-sided die can be thought of as a random variable that takes values in the set \{1,2,\dots,n\}, each with equal probability. A general n-sided die is one where different probability distributions on \mathbb N are allowed. There is some choice about which ones to go for, but Conrey et al go for the following natural conditions.

  1. For each integer m, the probability p_m that it occurs is a multiple of 1/n.
  2. If p_m>0, then 1\leq m\leq n.
  3. The expectation \sum_mmp_m is the same as it is for the usual die — namely (n+1)/2.

Equivalently, an n-sided die is a multiset of size n with elements in \{1,2,\dots,n\} and sum n(n+1)/2. For example, (2,2,3,3,5,6) and (1,2,3,3,6,6) are six-sided dice.

If we have two n-sided dice A and B represented in this way as (a_1,a_2,\dots,a_n) and (b_1,b_2,\dots,b_n), then A beats B if the number of pairs (a_i,b_j) with a_i>b_j exceeds the number of pairs with a_i<b_j.

The question can be formulated a little over-precisely as follows.

Question. Let A, B and C be three n-sided dice chosen uniformly at random. What is the probability that A beats C if you are given that A beats B and B beats C?

I say “over-precisely” because there isn’t a serious hope of finding an exact formula for this conditional probability. However, it is certainly reasonable to ask about the limiting behaviour as n tends to infinity.

It’s important to be clear what “uniformly at random” means in the question above. The authors consider two n-sided dice to be the same if the probability distributions are the same, so in the sequence representation a random die is a random non-decreasing sequence of integers from \{1,2,\dots,n\} that add up to n(n+1)/2 — the important word there being “non-decreasing”. Another way of saying this is that, as indicated above, the distribution is uniform over multisets (with the usual notion of equality) rather than sequences.

What makes the question particularly nice is that there is strong evidence for what the answer ought to be, and the apparent answer is, at least initially, quite surprising. The authors make the following conjecture.

Conjecture. Let A, B and C be three n-sided dice chosen uniformly at random. Then the probability that A beats C if you are given that A beats B and B beats C tends to 1/2 as n tends to infinity.

This is saying that if you know that A beats B and that B beats C, you basically have no information about whether A beats C.

They back up this conjecture with some experimental evidence. When n=6, there turn out to be 4417 triples of dice (A,B,C) such that A beats B and B beats C. For 930 of these triples, A and C were tied, for 1756, A beat C, and for 1731, C beat A.

It seems obvious that as n tends to infinity, the probability that two random n-sided dice are tied tends to zero. Somewhat surprisingly, that is not known, and is also conjectured in the paper. It might make a good first target.

The reason these problems are hard is at least in part that the uniform distribution over non-decreasing sequences of length n with entries in \{1,\dots,n\} that add up to n(n+1)/2 is hard to understand. In the light of that, it is tempting to formulate the original question — just how abnormal is transitivity? — using a different, more tractable distribution. However, experimental evidence presented by the authors in their paper indicates that the problem is quite sensitive to the distribution one chooses, so it is not completely obvious that a good reformulation of this kind exists. But it might still be worth thinking about.

Assuming that the conjecture is true, I would imagine that the heuristic reason for its being true is that for large n, two random dice will typically be “close” in the sense that although one beats the other, it does not do so by very much, and therefore we do not get significant information about what it looks like just from knowing that it beats the other one.

That sounds a bit vague, so let me give an analogy. Suppose we choose random unit vectors u,v,w in \mathbb R^2 and are given the additional information that \langle u,v\rangle\geq 0 and \langle v,w\rangle\geq 0. What is the probability that \langle u,w\rangle\geq 0? This is a simple exercise, and, unless I’ve messed up, the answer is 3/4. That is, knowing that in some sense u is close to v and v is close to w makes it more likely that u is close to w.

But now let’s choose our random vectors from \mathbb R^n. The picture changes significantly. For fixed u, the concentration of measure phenomenon tells us that for almost all v the inner product \langle u,v\rangle is close to zero, so we can think of u as the North Pole and the unit sphere as being almost all contained in a thin strip around the equator. And if v happens to be just in the northern hemisphere — well, it could just as easily have landed in the southern hemisphere. After a change of basis, we can assume that u=e_1 and v is very close to e_2. So when we choose a third vector w, we are asking whether the sign of its second coordinate is correlated with the sign of its first. And the answer is no — or rather, yes but only very weakly.

One can pursue that thought and show that the graph where one joins u to v if \langle u,v\rangle\geq 0 is, for large n, quasirandom, which means, roughly speaking, that it has several equivalent properties that are shared by almost all random graphs. (For a more detailed description, Googling “quasirandom graphs” produces lots of hits.)

For the problem of Conrey et al, the combinatorial object being examined is not a graph but a tournament: that is, a complete graph with orientations on each of its edges. (The vertices are dice, and we draw an arrow from D_1 to D_2 if D_1 beats D_2. Strictly speaking this is not a tournament, because of ties, but I am assuming that ties are rare enough for this to make no significant difference to the discussion that follows.) It is natural to speculate that the main conjecture is a consequence of a much more general statement, namely that this tournament is quasirandom in some suitable sense. In their paper, the authors do indeed make this speculation (it appears there as Conjecture 4).

It turns out that there is a theory of quasirandom tournaments, due to Fan Chung and Ron Graham. Chung and Graham showed that a number of properties that a tournament can have are asymptotically equivalent. It is possible that one of the properties they identified could be of use in proving the conjecture described in the previous paragraph, which, in the light of the Chung-Graham paper, is exactly saying that the tournament is quasirandom. I had hoped that there might be an analogue for tournaments of the spectral characterization of quasirandom graphs (which says that a graph is quasirandom if its second largest eigenvalue is small), since that could give a significantly new angle on the problem, but there is no such characterization in Chung and Graham’s list of properties. Perhaps it is worth looking for something of this kind.

Here, once again, is a link to the paper where the conjectures about dice are set out, and more detail is given. If there is enough appetite for a Polymath project on this problem, I am happy to host it on this blog. All I mean by this is that I am happy for the posts and comments to appear here — at this stage I am not sure what level of involvement I would expect to have with the project itself, but I shall certainly follow the discussion to start with and I hope I’ll be able to make useful contributions.

BackreactionCan we use gravitational waves to rule out extra dimensions – and string theory with it?

Gravitational Waves, Computer simulation. Credits: Henze, NASA Tl;dr: Probably not. Last week I learned from New Scientist that “Gravitational waves could show hints of extra dimensions.” The article is about a paper which recently appeared on the arxiv: Signatures of extra dimensions in gravitational waves David Andriot and Gustavo Lucena Gómez arXiv:1704.07392 [hep-th] The claim in this

May 18, 2017

Jordan EllenbergFox-Neuwirth-Fuks cells, quantum shuffle algebras, and Malle’s conjecture for function fields

I’ve gotten behind on blogging about preprints!  Let me tell you about a new one I’m really happy with, joint with TriThang Tran and Craig Westerland, which we posted a few months ago.

Malle’s conjecture concerns the number of number fields with fixed Galois group and bounded discriminant, a question I’ve been interested in for many years now.  We recall how it goes.

Let K be a global field — that is, a number field or the function field of a curve over a finite field.  Any degree-n extension L/K (here L could be a field or just an etale algebra over K — hold that thought) gives you a homomorphism from Gal(K) to S_n, whose image we call, in a slight abuse of notation, the Galois group of L/K.

Let G be a transitive subgroup of S_n, and let N(G,K,X) be the number of degree-n extensions of K whose Galois group is G and whose discriminant has norm at most X.  Every permutation g in G has an index, which is just n – the number of orbits of g.  So the permutations of index 1 are the transpositions, those of index 2 are the three-cycles and the double-flips, etc.  We denote by a(G) the reciprocal of the minimal index of any element of G.  In particular, a(G) is at most 1, and is equal to 1 if and only if G contains a transposition.

(Wait, doesn’t a transitive subgroup of S_n with a transposition have to be the whole group?  No, that’s only for primitive permutation groups.  D_4 is a thing!)

Malle’s conjecture says that, for every \epsilon > 0, there are constants c,c_\epsilon such that

c X^{a(G)} < N(G,K,X) < c_\epsilon X^{a(G)+\epsilon}

We don’t know much about this.  It’s easy for G = S_2.  A theorem of Davenport-Heilbronn (K=Q) and Datskovsky-Wright (general case) proves it for G = S_3.  Results of Bhargava handle S_4 and S_5, Wright proved it for abelian G.  I kind of think this new theorem of Alex Smith implies for K=Q and every dihedral G of 2-power order?  Anyway:  we don’t know much.  S_6?  No idea.  The best upper bounds for general n are still the ones I proved with Venkatesh a long time ago, and are very much weaker than what Malle predicts.

Malle’s conjecture fans will point out that this is only the weak form of Malle’s conjecture; the strong form doesn’t settle for an unspecified X^\epsilon, but specifies an asymptotic X^a (log X)^b.   This conjecture has the slight defect that it’s wrong sometimes; my student Seyfi Turkelli wrote a nice paper which I think resolves this problem, but the revised version of the conjecture is a bit messy to state.

Anyway, here’s the new theorem:

Theorem (E-Tran-Westerland):  Let G be a transitive subgroup of S_n.  Then for all q sufficiently large relative to G, there is a constant c_\epsilon such that

N(G,\mathbf{F}_q(t),X) < c_\epsilon X^{a(G)+\epsilon}

for all X>0.

In other words:

The upper bound in the weak Malle conjecture is true for rational function fields.

A few comments.

  1.  We are still trying to fix the mistake in our 2012 paper about stable cohomology of Hurwitz spaces.  Craig and I discussed what seemed like a promising strategy for this in the summer of 2015.  It didn’t work.  That result is still unproved.  But the strategy developed into this paper, which proves a different and in some respects stronger theorem!  So … keep trying to fix your mistakes, I guess?  There might be payoffs you don’t expect.
  2. We can actually bound that X^\epsilon is actually a power of log, but not the one predicted by Malle.
  3. Is there any chance of getting the strong Malle conjecture?  No, and I’ll explain why.  Consider the case G=S_4.  Then a(G) = 1, and in this case the strong Malle’s conjecture predicts N(S_4,K,X) is on order X, not just X^{1+eps}.   But our method doesn’t really distinguish between quartic fields and other kinds of quartic etale algebras.  So it’s going to count all algebras L_1 x L_2, where L_1 and L_2 are quadratic fields with discriminants X_1 and X_2 respectively, with X_1 X_2 < X.  We already know there’s approximately one quadratic field per discriminant, on average, so the number of such algebras is about the number of pairs (X_1, X_2) with X_1 X_2 < X, which is about X log X.  So there’s no way around it:  our method is only going to touch weak Malle.  Note, by the way, that for quartic extensions, the strong Malle conjecture was proved by Bhargava, and he observes the same phenomenon:

    …inherent in the zeta function is a sum over all etale extensions” of Q, including the “reducible” extensions that correspond to direct sums of quadratic extensions. These reducible quartic extensions far outnumber the irreducible ones; indeed, the number of reducible quartic extensions of absolute discriminant at most X is asymptotic to X log X, while we show that the number of quartic field extensions of absolute discriminant at most X is only O(X).

  4.  I think there is, on the other hand, a chance of getting rid of the “q sufficiently large relative to G” condition and proving something for a fixed F_q(t) and all finite groups G.


OK, so how did we prove this?

Here’s a sketch.  The starting point, as always, is that the G-extensions of F_q(t) are in bijection with G-covers of P^1/F_q, which are in turn in bijection with the F_q-rational points of a moduli space of G-covers, the Hurwitz space.  In fact there is a whole sequence of Hurwitz spaces Hur_1, Hur_2, … where Hur_n is an n-dimensional variety parametrizing G-covers with n branch points.

Then the game is to control the cohomology of these Hurwitz spaces and thereby control its number of F_q-rational points via the Grothendieck-Lefschetz fixed point theorem.

A lot of our work with Akshay has been in proving stable cohomology for these spaces; that the cohomology group H^i(Hur_n) is eventually constant as n grows with i fixed.

But in this paper we do something softer!  We show that the Betti numbers h^i(Hur_n) grow at most polynomially in n.   This fact plus Grothendieck-Lefschetz allows us to show

|Hur_n(\mathbf{F}_q)| \leq n^d q^n

for some d.  The discriminant of these covers turns out to be on order q^{n/a}.  So we end up with

N(G,\mathbf{F}_q(t),X) \leq (\log X)^{ad} X^a

which is what we wanted.

To get control of these numbers, we make use of a combinatorial description of the Hurwitz spaces.  Let V be the vector space of dimension |G|-1 freely spanned by the nontrivial elements of G.  Then it turns out V is not just any vector space; it’s a braided vector space, which means there’s a map

\tau:  V \otimes V \rightarrow V \otimes V

satisfying a certain relation; oh, let’s not write it down, I’ll just tell you what tau is in this case.  V tensor V is spanned by pairs of elements (g,h), and we have

\tau(g,h) = h, h^{-1} g h.

A braided vector space V is so-called because its nth tensor power V_n picks up an action of the n-strand braid group Br_n.  (Note that Br_2 is Z so on V_2 we just have an invertible endomorphism, namely \tau.)   In this case, the braid group acts on n-tuples

(g_1, .. g_n).

by the usual Hurwitz moves, where “pull strand i past strand i+1” sends the above tuple to

(g_1, \ldots g_{i+1}, g_{i+1}^{-1} g_i g_{i+1}, \ldots, g_n).

It turns out that the Hurwitz space is a K(pi,1) corresponding to the action of the braid group on this finite set; in particular,

H^i(Hur_n) = H^i(Br_n, V_n)

and so our problem of bounding Betti numbers comes down to a combinatorial problem in group cohomology.

Let W_n be the coinvariant space of V_n under S_n; you can think of this as being spanned by the orbits of the braid group on (G-0)^n.  Then concatenation of tuples gives you a ring

R = \oplus_n W_n

And as mentioned above, the Hurwitz space is the finite cover of configuration space

W_n = H_0(Hur_n).

This much Craig, Akshay and I already understood, and indeed the driver of our first paper was an argument in the homological algebra of R-modules.

But in this paper we invoke much deeper algebra, from a world that was totally new to me.  You see, we should have known that you never want to crush a group action down to its orbits, or a braid group representation down to its coinvariants.  That is morally wrong!  As a modern categorical type of person you want to keep track of the whole shebang.

Now things are gonna get vague because it’s a long paper!  But here goes.  It turns out that to any braided vector space V you can attach a quantum shuffle algebra A(V).  The ring R keeps track of H_0 of Hurwitz space, but it turns out that the cohomology of A(V) keeps track of all the cohomology of Hurwitz space:

H^i(Hur_n) = H^i(Br_n, V_n) = Ext^{n-i,n}_{A(V)}(k,k).

(OK, OK, I forgot to mention that A(V) is graded so this Ext is bigraded, and that’s not actually exactly A(V) but a twisted version of it, etc. etc.)

There’s still a way to go from here:  for instance, we actually mostly don’t use A(V) itself but a natural subalgebra called the Nichols algebra, which is good enough for our purposes.  But maybe the main idea is this.  Because we’re not actually trying to compute the dimensions of these Ext groups, only bound them above, we can get away with some really soft arguments.  In particular, we don’t ever actually compute cohomology; we basically write down a minimal resolution of A(V) and then show that the graded pieces of the terms have polynomially growing ranks.  The cohomology groups are subquotients of the terms, and that’s where we get our upper bounds!

OK I can’t resist one more weird comment.  Did you notice that in our description of A(V) we never really used the group law on G?  We only used that you could conjugate one element of G by another.  So we’re not actually using that G is a group, only that it’s a quandle.  The whole setup of our paper works for any quandle!  I don’t know if it has any interesting topological meaning, though.  Also — is the adjectival form for quandle “quandelic” or “quandular”?









Doug NatelsonMore coming, soon.

I will be posting more soon.  I'm in the midst of finally shifting my group webpage to a more modern design.  In the meantime, if there are requests for particular topics, please put them in the comments and I'll see what I can do.

Update:  Victory.  After a battle with weird permissions issues associated with the way Rice does webhosting, it's up here:

Still a few things that should be updated and cleaned up (including my personal homepage), but the major work is done.

John BaezThe Dodecahedron, the Icosahedron and E8

Here you can see the slides of a talk I’m giving:

The dodecahedron, the icosahedron and E8, Annual General Meeting of the Hong Kong Mathematical Society, Hong Kong University of Science and Technology.

It’ll take place on 10:50 am Saturday May 20th in Lecture Theatre G. You can see the program for the whole meeting here.

The slides are in the form of webpages, and you can see references and some other information tucked away at the bottom of each page.

In preparing this talk I learned more about the geometric McKay correspondence, which is a correspondence between the simply-laced Dynkin diagrams (also known as ADE Dynkin diagrams) and the finite subgroups of \mathrm{SU}(2).

There are different ways to get your hands on this correspondence, but the geometric way is to resolve the singularity in \mathbb{C}^2/\Gamma where \Gamma \subset \mathrm{SU}(2) is such a finite subgroup. The variety \mathbb{C}^2/\Gamma has a singularity at the origin–or more precisely, the point coming from the origin in \mathbb{C}^2. To make singularities go away, we ‘resolve’ them. And when you take the ‘minimal resolution’ of this variety (a concept I explain here), you get a smooth variety S with a map

\pi \colon S \to \mathbb{C}^2/\Gamma

which is one-to-one except at the origin. The points that map to the origin lie on a bunch of Riemann spheres. There’s one of these spheres for each dot in some Dynkin diagram—and two of these spheres intersect iff their two dots are connected by an edge!

In particular, if \Gamma is the double cover of the rotational symmetry group of the dodecahedron, the Dynkin diagram we get this way is E_8:

The basic reason \mathrm{E}_8 is connected to the icosahedron is that the icosahedral group is generated by rotations of orders 2, 3 and 5 while the \mathrm{E}_8 Dynkin diagram has ‘legs’ of length 2, 3, and 5 if you count right:

In general, whenever you have a triple of natural numbers a,b,c obeying

\displaystyle{ \frac{1}{a} + \frac{1}{b} + \frac{1}{c} > 1}

you get a finite subgroup of \mathrm{SU}(2) that contains rotations of orders a,b,c, and a simply-laced Dynkin diagram with legs of length a,b,c. The three most exciting cases are:

(a,b,c) = (2,3,3): the tetrahedron, and E_6,

(a,b,c) = (2,3,4): the octahedron, and E_7,

(a,b,c) = (2,3,5): the icosahedron, and E_8.

But the puzzle is this: why does resolving the singular variety \mathbb{C}^2/\Gamma gives a smooth variety with a bunch of copies of the Riemann sphere \mathbb{C}\mathrm{P}^1 sitting over the singular point at the origin, with these copies intersecting in a pattern given by a Dynkin diagram?

It turns out the best explanation is in here:

• Klaus Lamotke, Regular Solids and Isolated Singularities, Vieweg & Sohn, Braunschweig, 1986.

In a nutshell, you need to start by blowing up \mathbb{C}^2 at the origin, getting a space X containing a copy of \mathbb{C}\mathrm{P}^1 on which \Gamma acts. The space X/\Gamma has further singularities coming from the rotations of orders a, b and c in \Gamma. When you resolve these, you get more copies of \mathbb{C}\mathrm{P}^1, which intersect in the pattern given by a Dynkin diagram with legs of length a,b and c.

I would like to understand this better, and more vividly. I want a really clear understanding of the minimal resolution S. For this I should keep rereading Lamotke’s book, and doing more calculations.

I do, however, have a nice vivid picture of the singular space \mathbb{C}^2/\Gamma. For that, read my talk! I’m hoping this will lead, someday, to an equally appealing picture of its minimal resolution.

May 17, 2017

Tommaso DorigoGuest Post: Dorigo’s Anomaly! And The Social Psychology Of Professional Discourse In Physics, By Alex Durig

Dr. Alex Durig (see picture) is a professional freelance writer, with a PhD in social psychology from Indiana University (1992). He has authored seven books in his specialization of perception and logic. He claims to have experienced great frustration resolving his experience of perception and logic when it comes to physics, but he says he no longer feels crazy, ever since Anomaly! was published. So I am offering this space to him to hear what he has to say about that...


On Dorigo's Anomaly! and the Social Psychology of Professional Discourse in Physics, by Alex Durig

read more

May 16, 2017

Backreaction“Not a Toy” - New Video about Symmetry Breaking

Here is the third and last of the music videos I produced together with Apostolos Vasilidis and Timo Alho, sponsored by FQXi. The first two are here and here. In this video, I am be-singing a virtual particle pair that tries to separate, and quite literally reflect on the inevitable imperfection of reality. The lyrics of this song went through an estimated ten thousand iterations until we

BackreactionA Philosopher Tries to Understand the Black Hole Information Paradox

Is the black hole information loss paradox really a paradox? Tim Maudlin, a philosopher from NYU and occasional reader of this blog, doesn’t think so. Today, he has a paper on the arXiv in which he complains that the so-called paradox isn’t and physicists don’t understand what they are talking about. (Information) Paradox Lost Tim Maudlin arXiv:1705.03541 [physics.hist-ph] So is the paradox a

May 15, 2017

Terence TaoGeneralisations of the limit functional

Suppose one has a bounded sequence {(a_n)_{n=1}^\infty = (a_1, a_2, \dots)} of real numbers. What kinds of limits can one form from this sequence?

Of course, we have the usual notion of limit {\lim_{n \rightarrow \infty} a_n}, which in this post I will refer to as the classical limit to distinguish from the other limits discussed in this post. The classical limit, if it exists, is the unique real number {L} such that for every {\varepsilon>0}, one has {|a_n-L| \leq \varepsilon} for all sufficiently large {n}. We say that a sequence is (classically) convergent if its classical limit exists. The classical limit obeys many useful limit laws when applied to classically convergent sequences. Firstly, it is linear: if {(a_n)_{n=1}^\infty} and {(b_n)_{n=1}^\infty} are classically convergent sequences, then {(a_n+b_n)_{n=1}^\infty} is also classically convergent with

\displaystyle  \lim_{n \rightarrow \infty} (a_n + b_n) = (\lim_{n \rightarrow \infty} a_n) + (\lim_{n \rightarrow \infty} b_n) \ \ \ \ \ (1)

and similarly for any scalar {c}, {(ca_n)_{n=1}^\infty} is classically convergent with

\displaystyle  \lim_{n \rightarrow \infty} (ca_n) = c \lim_{n \rightarrow \infty} a_n. \ \ \ \ \ (2)

It is also an algebra homomorphism: {(a_n b_n)_{n=1}^\infty} is also classically convergent with

\displaystyle  \lim_{n \rightarrow \infty} (a_n b_n) = (\lim_{n \rightarrow \infty} a_n) (\lim_{n \rightarrow \infty} b_n). \ \ \ \ \ (3)

We also have shift invariance: if {(a_n)_{n=1}^\infty} is classically convergent, then so is {(a_{n+1})_{n=1}^\infty} with

\displaystyle  \lim_{n \rightarrow \infty} a_{n+1} = \lim_{n \rightarrow \infty} a_n \ \ \ \ \ (4)

and more generally in fact for any injection {\phi: {\bf N} \rightarrow {\bf N}}, {(a_{\phi(n)})_{n=1}^\infty} is classically convergent with

\displaystyle  \lim_{n \rightarrow \infty} a_{\phi(n)} = \lim_{n \rightarrow \infty} a_n. \ \ \ \ \ (5)

The classical limit of a sequence is unchanged if one modifies any finite number of elements of the sequence. Finally, we have boundedness: for any classically convergent sequence {(a_n)_{n=1}^\infty}, one has

\displaystyle  \inf_n a_n \leq \lim_{n \rightarrow \infty} a_n \leq \sup_n a_n. \ \ \ \ \ (6)

One can in fact show without much difficulty that these laws uniquely determine the classical limit functional on convergent sequences.

One would like to extend the classical limit notion to more general bounded sequences; however, when doing so one must give up one or more of the desirable limit laws that were listed above. Consider for instance the sequence {a_n = (-1)^n}. On the one hand, one has {a_n^2 = 1} for all {n}, so if one wishes to retain the homomorphism property (3), any “limit” of this sequence {a_n} would have to necessarily square to {1}, that is to say it must equal {+1} or {-1}. On the other hand, if one wished to retain the shift invariance property (4) as well as the homogeneity property (2), any “limit” of this sequence would have to equal its own negation and thus be zero.

Nevertheless there are a number of useful generalisations and variants of the classical limit concept for non-convergent sequences that obey a significant portion of the above limit laws. For instance, we have the limit superior

\displaystyle  \limsup_{n \rightarrow \infty} a_n := \inf_N \sup_{n \geq N} a_n

and limit inferior

\displaystyle  \liminf_{n \rightarrow \infty} a_n := \sup_N \inf_{n \geq N} a_n

which are well-defined real numbers for any bounded sequence {(a_n)_{n=1}^\infty}; they agree with the classical limit when the sequence is convergent, but disagree otherwise. They enjoy the shift-invariance property (4), and the boundedness property (6), but do not in general obey the homomorphism property (3) or the linearity property (1); indeed, we only have the subadditivity property

\displaystyle  \limsup_{n \rightarrow \infty} (a_n + b_n) \leq (\limsup_{n \rightarrow \infty} a_n) + (\limsup_{n \rightarrow \infty} b_n)

for the limit superior, and the superadditivity property

\displaystyle  \liminf_{n \rightarrow \infty} (a_n + b_n) \geq (\liminf_{n \rightarrow \infty} a_n) + (\liminf_{n \rightarrow \infty} b_n)

for the limit inferior. The homogeneity property (2) is only obeyed by the limits superior and inferior for non-negative {c}; for negative {c}, one must have the limit inferior on one side of (2) and the limit superior on the other, thus for instance

\displaystyle  \limsup_{n \rightarrow \infty} (-a_n) = - \liminf_{n \rightarrow \infty} a_n.

The limit superior and limit inferior are examples of limit points of the sequence, which can for instance be defined as points that are limits of at least one subsequence of the original sequence. Indeed, the limit superior is always the largest limit point of the sequence, and the limit inferior is always the smallest limit point. However, limit points can be highly non-unique (indeed they are unique if and only if the sequence is classically convergent), and so it is difficult to sensibly interpret most of the usual limit laws in this setting, with the exception of the homogeneity property (2) and the boundedness property (6) that are easy to state for limit points.

Another notion of limit are the Césaro limits

\displaystyle  \mathrm{C}\!-\!\lim_{n \rightarrow \infty} a_n := \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N a_n;

if this limit exists, we say that the sequence is Césaro convergent. If the sequence {(a_n)_{n=1}^\infty} already has a classical limit, then it also has a Césaro limit that agrees with the classical limit; but there are additional sequences that have a Césaro limit but not a classical one. For instance, the non-classically convergent sequence {a_n= (-1)^n} discussed above is Césaro convergent, with a Césaro limit of {0}. However, there are still bounded sequences that do not have Césaro limit, such as {a_n := \sin( \log n )} (exercise!). The Césaro limit is linear, bounded, and shift invariant, but not an algebra homomorphism and also does not obey the rearrangement property (5).

Using the Hahn-Banach theorem, one can extend the classical limit functional to generalised limit functionals {\mathop{\widetilde \lim}_{n \rightarrow \infty} a_n}, defined to be bounded linear functionals from the space {\ell^\infty({\bf N})} of bounded real sequences to the real numbers {{\bf R}} that extend the classical limit functional (defined on the space {c_0({\bf N}) + {\bf R}} of convergent sequences) without any increase in the operator norm. (In some of my past writings I made the slight error of referring to these generalised limit functionals as Banach limits, though as discussed below, the latter actually refers to a subclass of generalised limit functionals.) It is not difficult to see that such generalised limit functionals will range between the limit inferior and limit superior. In fact, for any specific sequence {(a_n)_{n=1}^\infty} and any number {L} lying in the closed interval {[\liminf_{n \rightarrow \infty} a_n, \limsup_{n \rightarrow \infty} a_n]}, there exists at least one generalised limit functional {\mathop{\widetilde \lim}_{n \rightarrow \infty}} that takes the value {L} when applied to {a_n}; for instance, for any number {\theta} in {[-1,1]}, there exists a generalised limit functional that assigns that number {\theta} as the “limit” of the sequence {a_n = (-1)^n}. This claim can be seen by first designing such a limit functional on the vector space spanned by the convergent sequences and by {(a_n)_{n=1}^\infty}, and then appealing to the Hahn-Banach theorem to extend to all sequences. This observation also gives a necessary and sufficient criterion for a bounded sequence {(a_n)_{n=1}^\infty} to classically converge to a limit {L}, namely that all generalised limits of this sequence must equal {L}.

Because of the reliance on the Hahn-Banach theorem, the existence of generalised limits requires the axiom of choice (or some weakened version thereof); there are presumably models of set theory without the axiom of choice in which no generalised limits exist, but I do not know of an explicit reference for this.

Generalised limits can obey the shift-invariance property (4) or the algebra homomorphism property (2), but as the above analysis of the sequence {a_n = (-1)^n} shows, they cannot do both. Generalised limits that obey the shift-invariance property (4) are known as Banach limits; one can for instance construct them by applying the Hahn-Banach theorem to the Césaro limit functional; alternatively, if {\mathop{\widetilde \lim}} is any generalised limit, then the Césaro-type functional {(a_n)_{n=1}^\infty \mapsto \mathop{\widetilde \lim}_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N a_n} will be a Banach limit. The existence of Banach limits can be viewed as a demonstration of the amenability of the natural numbers (or integers); see this previous blog post for further discussion.

Generalised limits that obey the algebra homomorphism property (2) are known as ultrafilter limits. If one is given a generalised limit functional {p\!-\!\lim_{n \rightarrow \infty}} that obeys (2), then for any subset {A} of the natural numbers {{\bf N}}, the generalised limit {p\!-\!\lim_{n \rightarrow \infty} 1_A(n)} must equal its own square (since {1_A(n)^2 = 1_A(n)}) and is thus either {0} or {1}. If one defines {p \subset 2^{2^{\bf N}}} to be the collection of all subsets {A} of {{\bf N}} for which {p\!-\!\lim_{n \rightarrow \infty} 1_A(n) = 1}, one can verify that {p} obeys the axioms of a non-principal ultrafilter. Conversely, if {p} is a non-principal ultrafilter, one can define the associated generalised limit {p\!-\!\lim_{n \rightarrow \infty} a_n} of any bounded sequence {(a_n)_{n=1}^\infty} to be the unique real number {L} such that the sets {\{ n \in {\bf N}: |a_n - L| \leq \varepsilon \}} lie in {p} for all {\varepsilon>0}; one can check that this does indeed give a well-defined generalised limit that obeys (2). Non-principal ultrafilters can be constructed using Zorn’s lemma. In fact, they do not quite need the full strength of the axiom of choice; see the Wikipedia article on the ultrafilter lemma for examples.

We have previously noted that generalised limits of a sequence can converge to any point between the limit inferior and limit superior. The same is not true if one restricts to Banach limits or ultrafilter limits. For instance, by the arguments already given, the only possible Banach limit for the sequence {a_n = (-1)^n} is zero. Meanwhile, an ultrafilter limit must converge to a limit point of the original sequence, but conversely every limit point can be attained by at least one ultrafilter limit; we leave these assertions as an exercise to the interested reader. In particular, a bounded sequence converges classically to a limit {L} if and only if all ultrafilter limits converge to {L}.

There is no generalisation of the classical limit functional to any space that includes non-classically convergent sequences that obeys the subsequence property (5), since any non-classically-convergent sequence will have one subsequence that converges to the limit superior, and another subsequence that converges to the limit inferior, and one of these will have to violate (5) since the limit superior and limit inferior are distinct. So the above limit notions come close to the best generalisations of limit that one can use in practice.

We summarise the above discussion in the following table:

Limit Always defined Linear Shift-invariant Homomorphism Constructive
Classical No Yes Yes Yes Yes
Superior Yes No Yes No Yes
Inferior Yes No Yes No Yes
Césaro No Yes Yes No Yes
Generalised Yes Yes Depends Depends No
Banach Yes Yes Yes No No
Ultrafilter Yes Yes No Yes No

Filed under: 245B - Real analysis, expository, math.CA Tagged: Banach limit, convergence, ultrafilter

May 13, 2017

Chad OrzelKid Growth Update

At SteelyKid’s softball game today, the Pip provided an ideal cute-kid photo to use as a springboard to some SCIENCE! Or at least, a graph… Anyway, here’s the Little Dude showing off how tall he’s gotten:

The Pip under Kate's coat.

The Pip under Kate’s coat.

OK, really he’s hiding under Kate’s raincoat (after two beautiful sunny days in a row, we’re back to dreary rain today), but as a side effect of that process he’s demonstrating that he’s pretty tall. But exactly how tall?

We’re having an addition put on the back of Chateau Steelypips, and our kitchen gutted and rebuilt, so we’ve been moving a lot of stuff around to get ready. During this process, it occurred to me that this might threaten the door frame where we’ve been marking the kids’ heights at irregular intervals. So, I copied down all the values, and like any self-respecting nerd with a table of numbers, I turned them into a scatter plot:

Height measurements for SteelyKid and The Pip.

Height measurements for SteelyKid and The Pip.

There’s a good deal of scatter in these, because their idea of what “stand up straight” means is… flexible. It’s interesting to see that they’re on basically the same trend– he’s more or less exactly the same height she was at the same age, though she usually seems taller. We tend to forget how big he is, both because SteelyKid is always around, and also because when I see him with his kindergarten class he doesn’t seem that large compared to other kids. That’s mostly because his birthday is in November, so he has a lot of classmates who are nearly a full year older than him, and one or two kids with birthdays in the same part of the year are a full year older, as they waited a year to start kindergarten.

I attempted to add a line to this for median height among the US population, but that didn’t work very well, because getting the data synched up was a little awkward. I may give it another run later; in general, though, they’re a bit above the median height. I have no idea how that happened, he said, struggling to remain deadpan.

I was a little surprised by how linear that graph looks– a linear fit to SteelyKid’s height has an R2 of 0.988. Taken literally, this would lead you to conclude that she’ll pass me in height a few months shy of her 18th birthday. Of course, this fit also has her height at birth as 30 inches rather than the actual 20, so maybe we shouldn’t run too far with the linear extrapolation… The Pip’s graph hints at a bit of curve, mostly because of that first point when he was really little; a linear fit to his height (which is still pretty good– R2 = 0.96) would have him passing me at 15.

Anyway, the real action of the morning was a softball game, where SteelyKid demonstrated fielding prowess beyond her usual idle twirling and announcing “I’m BORED!!!” (when she’s in the outfield):


(That’s her after stopping a ground ball, getting up and starting to throw to first…)

And since The Pip should get a photo where you can actually see his face, I present the author photo from his forthcoming novel:

The Pip, posing for his author photo.

The Pip, posing for his author photo.

May 12, 2017

n-Category Café Unboxing Algebraic Theories of Generalised Arities

Guest post by José Siqueira

We began our journey in the second Kan Extension Seminar with a discussion of the classical concept of Lawvere theory , facilitated by Evangelia. Together with the concept of a model, this technology allows one to encapsulate the behaviour of algebraic structures defined by collections of nn-ary operations subject to axioms (such as the ever-popular groups and rings) in a functorial setting, with the added flexibility of easily transferring such structures to arbitrary underlying categories 𝒞\mathcal{C} with finite products (rather than sticking with Set\mathbf{Set}), naturally leading to important notions such as that of a Lie group.

Throughout the seminar, many features of Lawvere theories and connections to other concepts were unearthed and natural questions were addressed — notably for today’s post, we have established a correspondence between Lawvere theories and finitary monads in Set\mathbf{Set} and discussed the notion of operad, how things go in the enriched context and what changes if you tweak the definitions to allow for more general kinds of limit. We now conclude this iteration of the seminar by bringing to the table “Monads with arities and their associated theories”, by Clemens Berger, Paul-André Melliès and Mark Weber, which answers the (perhaps last) definitional “what-if”: what goes on if you allow for operations of more general arities.

At this point I would like to thank Alexander Campbell, Brendan Fong and Emily Riehl for the amazing organization and support of this seminar, as well as my fellow colleagues, whose posts, presentations and comments drafted a more user-friendly map to traverse this subject.

Allowing general arities

Recall that a Lawvere theory can be defined as a pair (I,L)(I,L), where LL is a small category with finite coproducts and I: 0LI: \aleph_0 \to L is an identity-on-objects finite-coproduct preserving functor. To this data we associate a nerve functor ν 0:SetPSh( 0)\nu_{\aleph_0}: \mathbf{Set} \to PSh(\aleph_0), which takes a set XX to its 0\aleph_0-nerve ν 0(X): 0 opSet\nu_{\aleph_0}(X): \aleph_0^{op} \to \mathbf{Set}, the presheaf Set(i 0(),X)\mathbf{Set}(i_{\aleph_0}(-), X) — the 0\aleph_0-nerve of a set XX thus takes a finite cardinal nn to X nX^n, up to isomorphism. It is easy to check ν 0\nu_{\aleph_0} is faithful, but it is also full, with αν 0(α 1)\alpha\cong \nu_{\aleph_0}(\alpha_1) for each natural transformation α:ν 0(X)ν 0(X)\alpha: \nu_{\aleph_0}(X) \to \nu_{\aleph_0}(X'), seeing α 1\alpha_1 as a function XXX \to X'. This allows us to regard sets as presheaves over the small category 0\aleph_0, and as ν 0(X)([n])=Set([n],X)X n\nu_{\aleph_0}(X)([n])=\mathbf{Set}([n],X)\cong X^n, the 0\aleph_0-nerves can be used to encode all possible nn-ary operations on sets. To capture this behaviour of 0\aleph_0, we are inclined to make the following definition:

Definition. Let 𝒞\mathcal{C} be a category and 𝒜\mathcal{A} be a full small subcategory of 𝒞\mathcal{C}. We say 𝒜\mathcal{A} is a dense generator of 𝒞\mathcal{C} if its associated nerve functor ν 𝒜:𝒞PSh(𝒜)\nu_{\mathcal{A}}: \mathcal{C} \to PSh(\mathcal{A}) is fully faithful, where ν 𝒜(X)=𝒞(ı 𝒜(),X)\nu_{\mathcal{A}}(X)= \mathcal{C}(\imath_{\mathcal{A}}(-), X) for each X𝒞X \in \mathcal{C}.

The idea is that we can replace Set\mathbf{Set} and 0\aleph_0 in the original definition of Lawvere theory by a category 𝒞\mathcal{C} with a dense generator 𝒜\mathcal{A}. This allows us to have operations with arities more diverse than simply finite cardinals, while still retaining “good behaviour” — if we think about the dense generator as giving the “allowed arities”, we end up being able to extend all the previous concepts and make the following analogies:

Lawvere theoryL TheoryΘwith arity𝒜 Model ofL op Θ-model Finitary monad Monad with arity𝒜 Globular (Batanin) operad Homogenous globular theory Symmetric operad Γ-homogeneous theory     \array {\arrayopts{ \colalign{left left} \rowlines{solid} } \\ \text{Lawvere theory}\, L &\text{Theory}\, \Theta \, \text{with arity} \, \mathcal{A}\\ \text{Model of}\, L^{op}& \Theta\text{-model}\\ \text{Finitary monad}&\text{Monad with arity}\, \mathcal{A}\\ \text{Globular (Batanin) operad}&\text{Homogenous globular theory}\\ \text{Symmetric operad}&\Gamma\text{-homogeneous theory}\\ &nbsp;& &nbsp; } We’ll now discuss each generalised concept and important/useful properties.

If (I,L)(I,L) is a Lawvere theory, the restriction functor I *:PSh(L)PSh( 0)I^{\ast}: PSh(L) \to PSh(\aleph_0) induces a monad I *I !I^{\ast} I_!, where I !I_! is left Kan extension along II. This monad preserves the essential image of the nerve functor ν 0\nu_{\aleph_0}, and in fact this condition reduces to preservation of coproducts by II (refer to 3.5 in the paper for further details). If MM is a model of L opL^{op} on Set\mathbf{Set} in the usual sense (i.e M:L opSetM: L^{op} \to \mathbf{Set} preserves finite products) we can see that its restriction along II is isomorphic to the 0\aleph_0-nerve of MI([1])MI([1]) by arguing that

(I *M)[n]=MI[n]=M( nI[1]) inL=M( nI[1]) inL op nMI[1]MI[1] nν 0(MI[1])[n], (I^{\ast} M)[n] = MI[n] = M \underbrace{(\coprod_n I[1])}_{\text{in} \, L} = M\underbrace{(\prod_n I[1])}_{\text{in}\, L^{op}}\cong \prod_n MI[1] \cong MI[1]^n \cong \nu_{\aleph_0}(MI[1])[n],

and so we may want to define:

Definition. Let 𝒞\mathcal{C} be a category with a dense generator 𝒜\mathcal{A}. A theory with arities 𝒜\mathcal{A} on 𝒞\mathcal{C} is a pair (Θ,j)(\Theta,j), where j:𝒜Θj: \mathcal{A} \to \Theta is a bijective-on-objects functor such that the induced monad j *j !j^{\ast}j_! on PSh(𝒜)PSh(\mathcal{A}) preserves the essential image of the associated nerve functor ν 𝒜\nu_{\mathcal{A}}. A Θ\Theta-model is a presheaf on Θ\Theta whose restriction along jj is isomorphic to some 𝒜\mathcal{A}-nerve.

Again, for 𝒜= 0\mathcal{A}=\aleph_0, this requirement on models says a Θ\Theta-model MM restricts to powers of some object: I *M()=MI()X ||I^\ast M(-)=MI(-) \cong X^{|-|} for some set XX, the outcome we wanted for models of Lawvere theories.

A morphism of models is still just a natural transformation between them as presheaves and a morphism of theories (Θ 1,j 1)(Θ 2,j 2)(\Theta_1, j_1) \to (\Theta_2, j_2) is a functor θ:Θ 1Θ 2\theta: \Theta_1 \to \Theta_2 that intertwines with the arity functors, i.e j 2=θj 1j_2=\theta j_1. We’ll write Mod(Θ)Mod(\Theta) for the full subcategory of PSh(Θ)PSh(\Theta) consisting of the models of Θ\Theta and Th(𝒞,𝒜)Th(\mathcal{C}, \mathcal{A}) for the category of theories with arities 𝒜\mathcal{A} on 𝒞\mathcal{C}. We aim to prove a result that establishes an equivalence between Th(𝒞,𝒜)Th(\mathcal{C},\mathcal{A}) and some category of monads, to mirror the situation between Lawvere theories and finitary monads on Set\mathbf{Set}.

Dense generators and nerves

Having a dense generator is desirable because we can then mimic the following situation:

Recall that if 𝒟\mathcal{D} is small and F:𝒟SetF:\mathcal{D} \to \mathbf{Set} is a functor, then we can form a diagram of shape (*F) op({\ast}\downarrow F)^{op} over [𝒟,Set][\mathcal{D}, \mathbf{Set}] by composing the (opposite) of the natural projection functor (*F)𝒟({\ast}\downarrow F) \to \mathcal{D} and the Yoneda embedding. We may then consider the cocone

μ=(μ (d,x)=μ x:𝒟(d,)F(d,x)(*F) op), \mu=(\mu_{(d,x)}=\mu_x: \mathcal{D}(d,-) \to F \mid (d,x) \in ({\ast} \downarrow F)^{op}),

where μ x\mu_x is the natural transformation corresponding to xF(d)x \in F(d) via the Yoneda lemma, and find out it is actually a colimit, canonically expressing FF as a colimit of representable functors — if you are so inclined, you might want to look at this as the coend identity

F()= d𝒟F(d)×𝒟(,d) F(-)= \int^{d \in \mathcal{D}} F(d) \times \mathcal{D}(-,d)

when FF is a presheaf on 𝒟\mathcal{D}. Likewise for XX an object of a category 𝒞\mathcal{C} with dense generator 𝒜\mathcal{A}, there is an associated diagram a X:𝒜/X𝒞a_X: \mathcal{A}/X \to \mathcal{C}, which comes equipped with an obvious natural transformation to the constant functor on XX, whose (AfX)(A \xrightarrow{f} X)-component is simply ff itself — this is called the 𝒜\mathcal{A}-cocone over XX, and it is just the cocone of vertex XX under the diagram a Xa_X of shape 𝒜/X\mathcal{A}/X in 𝒞\mathcal{C} whose legs consist of all morphisms AXA \to X with A𝒜A \in \mathcal{A}. Note that if 𝒜\mathcal{A} is small (as is the case), then this diagram is small and, if 𝒞=PSh(𝒜)\mathcal{C}=PSh(\mathcal{A}), the slice category 𝒜/X\mathcal{A}/X reduces to the category of elements of the presheaf XX and this construction gives the Yoneda cocone under XX. One can show that

Proposition. A small full subcategory 𝒜\mathcal{A} of 𝒞\mathcal{C} is a dense generator precisely when the 𝒜\mathcal{A}-cocones are actually colimit-cocones in 𝒞\mathcal{C}.

This canonically makes every object XX of 𝒞\mathcal{C} a colimit of objects in 𝒜\mathcal{A} , and in view of this result it makes sense to define:

Definition. Let 𝒞\mathcal{C} be a category with a dense generator 𝒜\mathcal{A}. A monad TT on 𝒞\mathcal{C} is a monad with arities 𝒜\mathcal{A} when ν 𝒜T\nu_{\mathcal{A}}T takes the 𝒜\mathcal{A}-cocones of 𝒞\mathcal{C} to colimit-cocones in PSh(𝒜)PSh(\mathcal{A}).

That is, the monad has arities 𝒜\mathcal{A} whenever scrambling the nerve functor by first applying TT does not undermine its capacity of turning 𝒜\mathcal{A}-cocones into colimits, which in turns preserves the status of 𝒜\mathcal{A} as a dense generator, morally speaking — the Nerve Theorem makes this statement more precise:

The Nerve Theorem. Let 𝒞\mathcal{C} be a category with a dense generator 𝒜\mathcal{A}. For any monad TT with arities 𝒜\mathcal{A}, the full subcategory Θ T\Theta_T spanned by the free TT-algebras on objects of 𝒜\mathcal{A} is a dense generator of the Eilenberg-Moore category 𝒞 T\mathcal{C}^T. The essential image of the associated nerve functor is spanned by those presheaves whose restriction along j Tj_T belongs to the essential image of the ν 𝒜\nu_{\mathcal{A}}, where j T:𝒜Θ Tj_T: \mathcal{A} \to \Theta_T is obtained by restricting the free TT-algebra functor.

The proof given relies on an equivalent characterization for monads with arities, namely that a monad TT on a category 𝒞\mathcal{C} with arities 𝒜\mathcal{A} is a monad with arities 𝒜\mathcal{A} if and only if the “generalised lifting (pseudocommutative) diagram”

𝒞 T ν T PSh(Θ T) U j T * 𝒞 ν 𝒜 PSh(𝒜) . \begin{matrix} \mathcal{C}^T & \overset{\nu_T}{\longrightarrow} & PSh(\Theta_T) \\ {}_{U}\downarrow && \downarrow_{j_T^{\ast}} \\ \mathcal{C} & \underset{\nu_{\mathcal{A}}}{\longrightarrow} & PSh(\mathcal{A}) \\ \end{matrix}.

is an exact adjoint square, meaning the mate (j T) !ν 𝒜ν TF(j_T)_!\nu_{\mathcal{A}} \Rightarrow \nu_T F of the invertible 22-cell implicit in the above square is also invertible, where FF is the free TT-algebra functor. Note j T *j_T^{\ast} is monadic, so this diagram indeed gives some sort of lifting of the nerve functor on 𝒞\mathcal{C} to the level of monad algebras.

We can build on this result a little bit. Let α\alpha be a regular cardinal (at this point you might want to check David’s discussion on finite presentability).

Definition. A category 𝒞\mathcal{C} is α\alpha-accessible if it has α\alpha-filtered colimits and a dense generator 𝒜\mathcal{A} comprised only of α\alpha-presentable objects such that 𝒜/X\mathcal{A}/X is α\alpha-filtered for each object XX of 𝒞\mathcal{C}. If in addition the category is cocomplete, we say it is locally α\alpha-presentable.

If 𝒞\mathcal{C} is α\alpha-accessible, there is a god-given choice of dense generator — we take 𝒜\mathcal{A} to be a skeleton of the full subcategory 𝒞(α)\mathcal{C}(\alpha) spanned by the α\alpha-presentable objects of 𝒞\mathcal{C}. As all objects in 𝒜\mathcal{A} are α\alpha-presentable, the associated nerve functor preserves α\alpha-filtered colimits and so any monad TT preserving α\alpha-filtered colimits is a monad with arities 𝒜\mathcal{A}. The essential image of ν 𝒜\nu_{\mathcal{A}} is spanned by the α\alpha-flat presheaves on 𝒜\mathcal{A} (meaning presheaves whose categories of elements are α\alpha-filtered). As a consequence, any given object in an α\alpha-accessible category is canonically an α\alpha-filtered colimit of α\alpha-presentable objects and we can prove:

Theorem (Gabriel-Ulmer, Adámek-Rosický). If a monad TT on an α\alpha-accessible category 𝒞\mathcal{C} preserves α\alpha-filtered colimits, then its category of algebras 𝒞 T\mathcal{C}^T is α\alpha-accessible, with a dense generator Θ T\Theta_T spanned by the free TT-algebras on (a skeleton 𝒜\mathcal{A} of) the α\alpha-presentable objects C(α)C(\alpha). Moreover, this category of algebras is equivalent to the full subcategory of PSh(Θ T)PSh(\Theta_T) spanned by those presheaves whose restriction along j Tj_T is α\alpha-flat.

Proof. We know TT is a monad with arities 𝒜\mathcal{A}. That Θ T\Theta_T is a dense generator as stated follows from its definition and the Nerve Theorem. Now, 𝒞 T\mathcal{C}^T has α\alpha-filtered colimits since 𝒞\mathcal{C} has and TT preserves them. As the forgetful functor U:𝒞 T𝒞U: \mathcal{C}^T \to \mathcal{C} preserves α\alpha-filtered colimits (a monadic functor creates all colimits 𝒞\mathcal{C} has and TT preserves), it follows that the free algebra functor preserves α\alpha-presentability: 𝒞(FA,)𝒞(A,U())\mathcal{C}(FA,-) \cong \mathcal{C}(A, U(-)) preserves α\alpha-filtered colimits whenever AA is α\alpha-presentable, and so objects of Θ T\Theta_T are α\alpha-presentable. One can then check each 𝒜/X\mathcal{A}/X is α\alpha-filtered. \square

Note that this theorem says, for α= 0\alpha=\aleph_0, that if a monad on sets is finitary, then its category of algebras (i.e models for the associated classical Lawvere theory) is accessible, with a dense generator given by all the free TT-algebras on finite sets: this is because a finitely-presentable (i.e 0\aleph_0-presentable) set is precisely the same as a finite set. As a consequence, the typical “algebraic gadgets” are canonically a colimit of free ones on finitely many generators.

Theories and monads (with arities) are equivalent

If TT is a monad with arities 𝒜\mathcal{A}, then (Θ T,j T)(\Theta_T, j_T) is a theory with arities 𝒜\mathcal{A}. The Nerve Theorem then guarantees that ν T:𝒞 TPSh(Θ T)\nu_T: \mathcal{C}^T \to PSh(\Theta_T) induces an equivalence of categories between Θ T\Theta_T-models and TT-algebras, since its essential image is, by definition, the category of Θ T\Theta_T-models and the functor is fully faithful. This gives us hope that the situation with Lawvere theories and finitary monads can be extended, and this is indeed the case: the assignment T(Θ T,j T)T \mapsto (\Theta_T, j_T) extends to a functor Mnd(𝒞,𝒜)Th(𝒞,𝒜)\mathbf{Mnd}(\mathcal{C}, \mathcal{A}) \to \mathbf{Th}(\mathcal{C}, \mathcal{A}), which forms an equivalence of categories together with the functor Th(𝒞,𝒜)Mnd(𝒞,𝒜)\mathbf{Th}(\mathcal{C}, \mathcal{A}) \to \mathbf{Mnd}(\mathcal{C}, \mathcal{A}) that takes a theory (Θ,j)(\Theta, j) to the monad ρ 𝒜Tν 𝒜\rho_{\mathcal{A}}T\nu_{\mathcal{A}} with arities 𝒜\mathcal{A}, where ρ 𝒜\rho_{\mathcal{A}} is a choice of right adjoint to ν 𝒜:𝒞EssIm(ν 𝒜)\nu_{\mathcal{A}}: \mathcal{C} \to EssIm(\nu_{\mathcal{A}}). When 𝒞=Set\mathcal{C}=\mathbf{Set} and 𝒜= 0\mathcal{A}=\aleph_0, we recover the Lawvere theory/finitary monad equivalence.

Relation to operads and examples

Certain kinds of theories with arities are equivalent to operads. Namely, there is a notion of homogeneous globular theory that corresponds to globular (Batanin) operads. Similarly, there is a notion of Γ\Gamma-homogeneous theory that corresponds to symmetric operads. The remainder of the paper brings other equivalent definitions for monad with arities and builds a couple of examples, such as the free groupoid monad, which is a monad with arities given by (finite, connected) acyclic graphs. A notable example is that dagger categories arise as models of a theory on involutive graphs with non-trivial arities.

May 09, 2017

Jordan EllenbergI guess Caffe 608 was in trouble

Eight years after I wondered whether the arthouse cinema / cafe in Hilldale could really make a go of it, Sundance 608 is getting bought out by AMC.  I have really come to like this weird little sort-of-arthouse and hope it doesn’t change too much under new management.  It’s a sign of my age, I guess, that I still think of “movie at the mall” as an entertainment option I want to exist.  It’s my Lindy Hop, my vaudeville, my Show of Shows.

Doug NatelsonBrief items

Some interesting items of note:

  • Gil Refael at Cal Tech has a discussion going on the Institute for Quantum Information and Matter blog about the content of "modern physics" undergraduate courses.  The dilemma as usual is how to get exciting, genuinely modern physics developments into an already-packed undergrad curriculum.  
  • The variety and quality of 3d printed materials continues to grow and impress.  Last month a team of folks from Karlsruhe demonstrated very nice printing of (after some processing) fused silica.  Then last week I ran across this little toy.  I want one.  (Actually, I want to know how much they cost without getting on their sales engineer call list.)  We very recently acquired one of these at Rice for our shared equipment facility, thanks to generous support of the NSF MRI program.   There are reasons to be skeptical that additive manufacturing will scale in such a way as to have enormous impact, but it sure is cool and making impressive progress.
  • There is a news release about our latest paper that has been picked up by a few places, including the NSF's electronic newsletter.  I'll write more about that very soon.
  • The NSF and the SRC are having a joint program in "SemiSynBio", trying to work at the interface of semiconductor devices and synthetic biology to do information processing and storage.  That's some far out stuff for the SRC - they're usually pretty conservative.
  • Don Lincoln has won the AIP's 2017 Gemant Award for his work presenting science to the public - congratulations!  You have likely seen his videos put out by Fermilab - they're frequently featured on ZapperZ's blog

May 08, 2017

Scott AaronsonMy broken blog

I wanted to let people know I’m well-aware that Shtetl-Optimized has been suffering from the following problems lately:

  • Commenters are presented with the logins (handle, email address, and URL) of random other commenters, rather than with their own default login data.  In particular, this means that email addresses are leaking, and that when you comment, you should not (for the time being) enter your real email address if that’s information that you’d wanted to share only with me.  Another thing it means is that, when I try to comment, I’m not logged in as “Scott,” so even I have to enter my login data manually every time I comment.
  • Comments (including my own comments!) take about an hour to show up after I’ve approved them.
  • New blog posts also take a while to show up.

Since all three of these problems started happening around the same time, I assume they’re related.  But I don’t even know where to start in trying to solve them (Googling for “WordPress” plus descriptions of these bugs was unhelpful).  Would anyone like to help out?  If you earn my trust, I’ll even temporarily give you administrative privileges on this blog so you can poke around yourself.

Thanks so much, and hope to return to your regularly scheduled programming shortly…

May 07, 2017

Chad OrzelKid Reading Update

For a long time now, I’ve had a Sunday routine with the kids, where we go to the Schenectady Greenmarket and then to the Open Door (which is right next to the outdoor market, and a couple blocks from the indoor location), then to lunch at Panera, and usually grocery shopping. We have a standing deal that they get one book per week free, but if they want toys or a second book, it comes out of their allowance.

SteelyKid very consistently finds something to read, but The Pip was recently in a mode where he only wanted to read Pokemon books, and the Open Door doesn’t carry those, so he would sulk. The last couple of weeks, though, he’s gotten into random other books again, and started grabbing things from the kids-book section. This is, of course, very gratifying to a bookish parent, which is also why there’s the standing book-a-week deal…

This week, SteelyKid got and devoured a new graphic novel, 5 Worlds: The Sand Warrior (I handed it to her to look at, and she just sat down and started reading. After ten minutes or so, I had to decree that we weren’t just reading the entire book there in the store…). The Pip got a book about a tech-obsessed cave boy.

In the car on the way to lunch, they were so engrossed in their books that the quiet was actually a little alarming. So I got selfie-mode photos of the two of them, to document their reading:

SteelyKid (r) and The Pip in the car, reading.

SteelyKid (r) and The Pip in the car, reading.

SteelyKid actually corrected me to say that she wasn’t reading her book in this, but doing kenken puzzles on her tablet. Which is another quiet activity I’m very happy to encourage…

So, there’s your cute-kid content for the week. They were even more entertaining on Saturday, when both of them had friends over for the afternoon, but the photos I got from that necessarily include other peoples’ kids, and I try to avoid posting those.

Jordan EllenbergIs academia wrong for you?

Good article by Daniel McCormack in Chronicle of Higher Education on underpublicized aspects of academic life.

For instance:

These iterative failures are, at a very deep level, the essence of creating new knowledge, and are therefore inseparable from the job. If you can’t imagine going to bed at the end of nearly every day with a nagging feeling that you could have done better, academe is not for you.

The academic workplace is a really unusual one.  For instance, it’s one of the few places to work where you’re nobody’s boss and nobody’s your boss.  It really suits some people — I’m one.  But lots of other people feel otherwise: it’s too slow, too lacking in immediate feedback, too content with the way “it’s always been done.”  And a lot of those people have great things to contribute to mathematics and don’t fit in the system we’ve set up to pay people to do math.

Also, this:

So while the ideal career path leads from graduate school to a tenure-track position, the one you will more likely find yourself on leads from graduate school to a series of short-term positions that will require you to move — often.

is less true in math than in many other areas, but still kind of true.  And it works badly not just for people who temperamentally hate moving, but for people who want to have kids and have a limited childbearing window.

McCormack is right:  without catastrophizing, we should always be trying to give our Ph.D. students as real a picture as possible of what academic life is like, and not just the advisor’s life with tenure at an R1 university.  Lots of people will still happily sign up.  But other people will think more seriously about other great ways to do mathematics.




Terence TaoNew bounds for Szemerédi’s theorem, III: A polylogarithmic bound for r_4(N)

Ben Green and I have (finally!) uploaded to the arXiv our paper “New bounds for Szemerédi’s theorem, III: A polylogarithmic bound for {r_4(N)}“, submitted to Mathematika. This is the sequel to two previous papers (and an erratum to the former paper), concerning quantitative versions of Szemerédi’s theorem in the case of length four progressions. This sequel has been delayed for over a decade for a number of reasons, but we have finally managed to write the arguments up to our satisfaction and submit it (to a special issue of Mathematika honouring the work of Klaus Roth).

For any natural number {N}, define {r_4(N)} to be the largest cardinality of a subset {A} of {[N] = \{1,\dots,N\}} which does not contain any non-trivial arithmetic progressions {a, a+r, a+2r, a+3r} of length four (where “non-trivial” means that {r} is non-zero). Trivially we have {r_4(N) \leq N}. In 1969, Szemerédi showed that {r_4(N) = o(N)}. However, the decay rate that could be theoretically extracted from this argument (and from several subsequent proofs of this bound, including one by Roth) were quite poor. The first significant quantitative bound on this quantity was by Gowers, who showed that {r_4(N) \ll N (\log \log N)^{-c}} for some absolute constant {c>0}. In the second paper in the above-mentioned series, we managed to improve this bound to {r_4(N) \ll N \exp( - c \sqrt{\log \log N})}. In this paper, we improve the bound further to {r_4(N) \ll N (\log N)^{-c}}, which seems to be the limit of the methods. (We remark that if we could take {c} to be larger than one, this would imply the length four case of a well known conjecture of Erdös that any set of natural numbers whose sum of reciprocals diverges would contain arbitrarily long arithmetic progressions. Thanks to the work of Sanders and of Bloom, the corresponding case of the conjecture for length three conjectures is nearly settled, as it is known that for the analogous bound on {r_3(N)} one can take any {c} less than one.)

Most of the previous work on bounding {r_4(N)} relied in some form or another on the density increment argument introduced by Roth back in 1953; roughly speaking, the idea is to show that if a dense subset {A} of {[N]} fails to contain arithmetic progressions of length four, one seeks to then locate a long subprogression of {[N]} in which {A} has increased density. This was the basic method for instance underlying our previous bound {r_4(N) \ll N \exp( - c \sqrt{\log \log N})}, as well as a finite field analogue of the bound {r_4(N) \ll N (\log N)^{-c}}; however we encountered significant technical difficulties for several years in extending this argument to obtain the result of the current paper. Our method is instead based on “energy increment arguments”, and more specifically on establishing quantitative version of a Khintchine-type recurrence theorem, similar to the qualitative recurrence theorems established (in the ergodic theory context) by Bergelson-Host-Kra, and (in the current combinatorial context) by Ben Green and myself.

One way to phrase the latter recurrence theorem is as follows. Suppose that {A \subset [N]} has density {\delta}. Then one would expect a “randomly” selected arithmetic progression {{\bf a}, {\bf a}+{\bf r}, {\bf a}+2{\bf r}, {\bf a}+3{\bf r}} in {[N]} (using the convention that random variables will be in boldface) to be contained in {A} with probability about {\delta^4}. This is not true in general, however it was shown by Ben and myself that for any {\eta>0}, there was a set of shifts {r \in [-N,N]} of cardinality {\gg_{\delta,\eta} N}, such that for any such {r} one had

\displaystyle {\bf P}( {\bf a}, {\bf a}+r, {\bf a}+2r, {\bf a}+3r \in A ) \geq \delta^4 - \eta

if {{\bf a}} was chosen uniformly at random from {[N]}. This easily implies that {r_4(N) = o(N)}, but does not give a particularly good bound on the decay rate, because the implied constant in the cardinality lower bound {\gg_{\delta,\eta} N} is quite poor (in fact of tower-exponential type, due to the use of regularity lemmas!), and so one has to take {N} to be extremely large compared to {\delta,\eta} to avoid the possibility that the set of shifts in the above theorem consists only of the trivial shift {r=0}.

We do not know how to improve the lower bound on the set of shifts to the point where it can give bounds that are competitive with those in this paper. However, we can obtain better quantitative results if we permit ourselves to couple together the two parameters {{\bf a}} and {{\bf r}} of the length four progression. Namely, with {A}, {\delta}, {\eta} as above, we are able to show that there exist random variables {{\bf a}, {\bf r}}, not necessarily independent, such that

\displaystyle {\bf P}( {\bf a}, {\bf a}+{\bf r}, {\bf a}+2{\bf r}, {\bf a}+3{\bf r} \in A ) \geq \delta^4 - \eta \ \ \ \ \ (1)

and such that we have the non-degeneracy bound

\displaystyle {\bf P}( {\bf r} = 0 ) \ll \exp( - \eta^{-O(1)} ) / N.

This then easily implies the main theorem.

The energy increment method is then deployed to locate a good pair {({\bf a}, {\bf r})} of random variables that will obey the above bounds. One can get some intuition on how to proceed here by considering some model cases. Firstly one can consider a “globally quadratically structured” case in which the indicator function {1_A} “behaves like” a globally quadratic function such as {F( \alpha n^2 )}, for some irrational {\alpha} and some smooth periodic function {F: {\bf R}/{\bf Z} \rightarrow {\bf R}} of mean {\delta}. If one then takes {{\bf a}, {\bf r}} to be uniformly distributed in {[N]} and {[-\varepsilon N, \varepsilon N]} respectively for some small {\varepsilon>0}, with no coupling between the two variables, then the left-hand side of (1) is approximately of the form

\displaystyle \int_{(x,y,z,w) \in ({\bf R}/{\bf Z})^4: x-3y+3z-w = 0} F(x) F(y) F(z) F(w) \ \ \ \ \ (2)

where the integral is with respect to the probability Haar measure, and the constraint {x-3y+3z-w=0} ultimately arises from the algebraic constraint

\displaystyle \alpha {\bf a}^2 - 3 \alpha ({\bf a}+{\bf r})^2 + 3 \alpha ({\bf a}+2{\bf r})^2 - \alpha ({\bf a}+3{\bf r})^2 = 0.

However, an application of the Cauchy-Schwarz inequality and Fubini’s theorem shows that the integral in (2) is at least {(\int_{{\bf R}/{\bf Z}} F)^4}, which (morally at least) gives (1) in this case.

Due to the nature of the energy increment argument, it also becomes necessary to consider “locally quadratically structured” cases, in which {[N]} is partitioned into some number of structured pieces {B_c} (think of these as arithmetic progressions, or as “Bohr sets), and on each piece {B_c}, {1_A} behaves like a locally quadratic function such as {F_c( \alpha_c n^2 )}, where {\alpha_c} now varies with {c}, and the mean of {F_c} will be approximately {\delta} on the average after averaging in {c} (weighted by the size of the pieces {B_c}). Now one should select {{\bf a}} and {{\bf r}} in the following coupled manner: first one chooses {{\bf a}} uniformly from {[N]}, then one defines {{\bf c}} to be the label {c} such that {{\bf a} \in B_c}, and then selects {{\bf r}} uniformly from a set {B_{c,\varepsilon}} which is related to {B_c} in much the same way that {[-\varepsilon N, \varepsilon N]} is related to {[N]}. If one does this correctly, the analogue of (2) becomes

\displaystyle {\bf E} \int_{(x,y,z,w) \in ({\bf R}/{\bf Z})^4: x-3y+3z-w = 0} F_{\mathbf c}(x) F_{\mathbf c}(y) F_{\mathbf c}(z) F_{\mathbf c}(w),

and one can again use Cauchy-Schwarz and Fubini’s theorem to conclude.

The general case proceeds, very roughly, by an iterative argument. At each stage of the iteration, one has some sort of quadratic model of {1_A} which involves a decomposition of {[N]} into structured pieces {B_c}, and a quadratic approximation to {1_A} on each piece. If this approximation is accurate enough (or more precisely, if a certain (averaged) local Gowers uniformity norm {U^3} of the error is small enough) to model the count in (1) (for random variables {{\bf a}, {\bf r}} determined by the above partition of {[N]} into pieces {B_c}), and if the frequencies (such as {\alpha_c}) involved in the quadratic approximation are “high rank” or “linearly independent over the rationals” in a suitably quantitative sense, then some version of the above arguments can be made to work. If there are some unwanted linear dependencies in the frequencies, we can do some linear algebra to eliminate one of the frequencies (using some geometry of numbers to keep the quantitative bounds under control) and continue the iteration. If instead the approximation is too inaccurate, then the error will be large in a certain averaged local Gowers uniformity norm {U^3}. A significant fraction of the paper is then devoted to establishing a quantitative inverse theorem for that norm that concludes (with good bounds) that the error must then locally correlate with locally quadratic phases, which can be used to refine the quadratic approximation to {1_A} in a manner that significantly increases its “energy” (basically an {L^2} norm). Such energy increments cannot continue indefinitely, and when they terminate we obtain the desired claim.

There are existing inverse theorems for {U^3} type norms in the literature, going back to the pioneering work of Gowers mentioned previously, and relying on arithmetic combinatorics tools such as Freiman’s theorem and the Balog-Szemerédi-Gowers lemma, which are good for analysing the “{1\%}-structured homomorphisms” that arise in Gowers’ argument. However, when we applied these methods to the local Gowers norms we obtained inferior quantitative results that were not strong enough for our application. Instead, we use arguments from a different paper of Gowers in which he tackled Szemerédi’s theorem for arbitrary length progressions. This method produces “{99\%}-structured homomorphisms” associated to any function with large Gowers uniformity norm; however the catch is that such homomorphisms are initially supported only on a sparse unstructured set, rather than a structured set such as a Bohr set. To proceed further, one first has to locate inside the sparse unstructured set a sparse pseudorandom subset of a Bohr set, and then use “error-correction” type methods (such as “majority-vote” based algorithms) to locally upgrade this {99\%}-structured homomorphism on pseudorandom subsets of Bohr sets to a {100\%}-structured homomorphism on the entirety of a Bohr set. It is then possible to use some “approximate cohomology” tools to “integrate” these homomorphisms (and discern a key “local symmetry” property of these homomorphisms) to locate the desired local quadratic structure (in much the same fashion that a {1}-form on {{\bf R}^n} that varies linearly with the coordinates can be integrated to be the derivative of a quadratic function if we know that the {1}-form is closed). These portions of the paper are unfortunately rather technical, but broadly follow the methods already used in previous literature.

Filed under: math.CO, paper Tagged: arithmetic progressions, Ben Green, Gowers uniformity norm, Szemeredi's theorem

May 06, 2017

Terence TaoNotes on nilcharacters and their symbols

A sequence {a: {\bf Z} \rightarrow {\bf C}} of complex numbers is said to be quasiperiodic if it is of the form

\displaystyle a(n) = F( \alpha_1 n \hbox{ mod } 1, \dots, \alpha_k n \hbox{ mod } 1)

for some real numbers {\alpha_1,\dots,\alpha_k} and continuous function {F: ({\bf R}/{\bf Z})^k \rightarrow {\bf C}}. For instance, linear phases such as {n \mapsto e(\alpha n + \beta)} (where {e(\theta) := e^{2\pi i \theta}}) are examples of quasiperiodic sequences; the top order coefficient {\alpha} (modulo {1}) can be viewed as a “frequency” of the integers, and an element of the Pontryagin dual {\hat {\bf Z} \equiv {\bf R}/{\bf Z}} of the integers. Any periodic sequence is also quasiperiodic (taking {k=1} and {\alpha_1} to be the reciprocal of the period). A sequence is said to be almost periodic if it is the uniform limit of quasiperiodic sequences. For instance any Fourier series of the form

\displaystyle a(n) = \sum_{j=1}^\infty c_j e(\alpha_j n)

with {\alpha_1,\alpha_2,\dots} real numbers and {c_1,c_2,\dots} an absolutely summable sequence of complex coefficients, will be almost periodic.

These sequences arise in various “complexity one” problems in arithmetic combinatorics and ergodic theory. For instance, if {(X, \mu, T)} is a measure-preserving system – a probability space {(X,\mu)} equipped with a measure-preserving shift, and {f_1,f_2 \in L^\infty(X)} are bounded measurable functions, then the correlation sequence

\displaystyle a(n) := \int_X f_1(x) f_2(T^n x)\ d\mu(x)

can be shown to be an almost periodic sequence, plus an error term {b_n} which is “null” in the sense that it has vanishing uniform density:

\displaystyle \sup_N \frac{1}{M} \sum_{n=N+1}^{N+M} |b_n| \rightarrow 0 \hbox{ as } M \rightarrow \infty.

This can be established in a number of ways, for instance by writing {a(n)} as the Fourier coefficients of the spectral measure of the shift {T} with respect to the functions {f_1,f_2}, and then decomposing that measure into pure point and continuous components.

In the last two decades or so, it has become clear that there are natural higher order versions of these concepts, in which linear polynomials such as {\alpha n + \beta} are replaced with higher degree counterparts. The most obvious candidates for these counterparts would be the polynomials {\alpha_d n^d + \dots + \alpha_0}, but this turns out to not be a complete set of higher degree objects needed for the theory. Instead, the higher order versions of quasiperiodic and almost periodic sequences are now known as basic nilsequences and nilsequences respectively, while the higher order version of a linear phase is a nilcharacter; each nilcharacter then has a symbol that is a higher order generalisation of the concept of a frequency (and the collection of all symbols forms a group that can be viewed as a higher order version of the Pontryagin dual of {{\bf Z}}). The theory of these objects is spread out in the literature across a number of papers; in particular, the theory of nilcharacters is mostly developed in Appendix E of this 116-page paper of Ben Green, Tamar Ziegler, and myself, and is furthermore written using nonstandard analysis and treating the more general setting of higher dimensional sequences. I therefore decided to rewrite some of that material in this blog post, in the simpler context of the qualitative asymptotic theory of one-dimensional nilsequences and nilcharacters rather than the quantitative single-scale theory that is needed for combinatorial applications (and which necessitated the use of nonstandard analysis in the previous paper).

For technical reasons (having to do with the non-trivial topological structure on nilmanifolds), it will be convenient to work with vector-valued sequences, that take values in a finite-dimensional complex vector space {{\bf C}^m} rather than in {{\bf C}}. By doing so, the space of sequences is now, technically, no longer a ring, as the operations of addition and multiplication on vector-valued sequences become ill-defined. However, we can still take complex conjugates of a sequence, and add sequences taking values in the same vector space {{\bf C}^m}, and for sequences taking values in different vector spaces {{\bf C}^m}, {{\bf C}^{m'}}, we may utilise the tensor product {\otimes: {\bf C}^m \times {\bf C}^{m'} \rightarrow {\bf C}^{mm'}}, which we will normalise by defining

\displaystyle (z_1,\dots,z_m) \otimes (w_1,\dots,w_{m'}) = (z_1 w_1, \dots, z_1 w_{m'}, \dots, z_m w_1, \dots, z_m w_{m'} ).

This product is associative and bilinear, and also commutative up to permutation of the indices. It also interacts well with the Hermitian norm

\displaystyle \| (z_1,\dots,z_m) \| := \sqrt{|z_1|^2 + \dots + |z_m|^2}

since we have {\|z \otimes w\| = \|z\| \|w\|}.

The traditional definition of a basic nilsequence (as defined for instance by Bergelson, Host, and Kra) is as follows:

Definition 1 (Basic nilsequence, first definition) A nilmanifold of step at most {d} is a quotient {G/\Gamma}, where {G} is a connected, simply connected nilpotent Lie group of step at most {d} (thus, all {d+1}-fold commutators vanish) and {\Gamma} is a discrete cocompact lattice in {G}. A basic nilsequence of degree at most {d} is a sequence of the form {n \mapsto F(g^n g_0 \Gamma)}, where {g_0 \Gamma \in G/\Gamma}, {g \in G}, and {F: G/\Gamma \rightarrow {\bf C}^m} is a continuous function.

For instance, it is not difficult using this definition to show that a sequence is a basic nilsequence of degree at most {1} if and only if it is quasiperiodic. The requirement that {G} be simply connected can be easily removed if desired by passing to a universal cover, but it is technically convenient to assume it (among other things, it allows for a well-defined logarithm map that obeys the Baker-Campbell-Hausdorff formula). When one wishes to perform a more quantitative analysis of nilsequences (particularly when working on a “single scale”. sich as on a single long interval {\{ N+1, \dots, N+M\}}), it is common to impose additional regularity conditions on the function {F}, such as Lipschitz continuity or smoothness, but ordinary continuity will suffice for the qualitative discussion in this blog post.

Nowadays, and particularly when one needs to understand the “single-scale” equidistribution properties of nilsequences, it is more convenient (as is for instance done in this ICM paper of Green) to use an alternate definition of a nilsequence as follows.

Definition 2 Let {d \geq 0}. A filtered group of degree at most {d} is a group {G} together with a sequence {G_\bullet = (G_0,G_1,G_2,\dots)} of subgroups {G \geq G_0 \geq G_1 \geq \dots} with {G_{d+1}=\{\hbox{id}\}} and {[G_i,G_j] \subset G_{i+j}} for {i,j \geq 0}. A polynomial sequence {g: {\bf Z} \rightarrow G} into a filtered group is a function such that {\partial_{h_i} \dots \partial_{h_1} g(n) \in G_i} for all {i \geq 0} and {n,h_1,\dots,h_i \in{\bf Z}}, where {\partial_h g(n) := g(n+h) g(n)^{-1}} is the difference operator. A filtered nilmanifold of degree at most {s} is a quotient {G/\Gamma}, where {G} is a filtered group of degree at most {s} such that {G} and all of the subgroups {G_i} are connected, simply connected nilpotent filtered Lie group, and {\Gamma} is a discrete cocompact subgroup of {G} such that {\Gamma_i := \Gamma \cap G_i} is a discrete cocompact subgroup of {G_i}. A basic nilsequence of degree at most {d} is a sequence of the form {n \mapsto F(g(n))}, where {g: {\bf Z} \rightarrow G} is a polynomial sequence, {G/\Gamma} is a filtered nilmanifold of degree at most {d}, and {F: G \rightarrow {\bf C}^m} is a continuous function which is {\Gamma}-automorphic, in the sense that {F(g \gamma) = F(g)} for all {g \in G} and {\gamma \in \Gamma}.

One can easily identify a {\Gamma}-automorphic function on {G} with a function on {G/\Gamma}, but there are some (very minor) advantages to working on the group {G} instead of the quotient {G/\Gamma}, as it becomes slightly easier to modify the automorphy group {\Gamma} when needed. (But because the action of {\Gamma} on {G} is free, one can pass from {\Gamma}-automorphic functions on {G} to functions on {G/\Gamma} with very little difficulty.) The main reason to work with polynomial sequences {n \mapsto g(n)} rather than geometric progressions {n \mapsto g^n g_0 \Gamma} is that they form a group, a fact essentially established by by Lazard and Leibman; see Corollary B.4 of this paper of Green, Ziegler, and myself for a proof in the filtered group setting.

It is easy to see that any sequence that is a basic nilsequence of degree at most {d} in the sense of the first definition, is also a basic nilsequence of degree at most {d} in the second definition, since a nilmanifold of degree at most {d} can be filtered using the lower central series, and any linear sequence {n \mapsto g^n g_0} will be a polynomial sequence with respect to that filtration. The converse implication is a little trickier, but still not too hard to show: see Appendix C of this paper of Ben Green, Tamar Ziegler, and myself. There are two key examples of basic nilsequences to keep in mind. The first are the polynomially quasiperiodic sequences

\displaystyle a(n) = F( P_1(n), \dots, P_k(n) ),

where {P_1,\dots,P_k: {\bf Z} \rightarrow {\bf R}} are polynomials of degree at most {d}, and {F: {\bf R}^k \rightarrow {\bf C}^m} is a {{\bf Z}^k}-automorphic (i.e., {{\bf Z}^k}-periodic) continuous function. The map {P: {\bf Z} \rightarrow {\bf R}^k} defined by {P(n) := (P_1(n),\dots,P_k(n))} is a polynomial map of degree at most {d}, if one filters {{\bf R}^k} by defining {({\bf R}^k)_i} to equal {{\bf R}^k} when {i \leq d}, and {\{0\}} for {i > d}. The torus {{\bf R}^k/{\bf Z}^k} then becomes a filtered nilmanifold of degree at most {d}, and {a(n)} is thus a basic nilsequence of degree at most {d} as per the second definition. It is also possible explicitly describe {a_n} as a basic nilsequence of degree at most {d} as per the first definition, for instance (in the {k=1} case) by taking {G} to be the space of upper triangular unipotent {d+1 \times d+1} real matrices, and {\Gamma} the subgroup with integer coefficients; we leave the details to the interested reader.

The other key example is a sequence of the form

\displaystyle a(n) = F( \alpha n, \{ \alpha n \} \beta n )

where {\alpha,\beta} are real numbers, {\{ \alpha n \} = \alpha n - \lfloor \alpha n \rfloor} denotes the fractional part of {\alpha n}, and and {F: {\bf R}^2 \rightarrow {\bf C}^m} is a {{\bf Z}^2}-automorphic continuous function that vanishes in a neighbourhood of {{\bf Z} \times {\bf R}}. To describe this as a nilsequence, we use the nilpotent connected, simply connected degree {2}, Heisenberg group

\displaystyle G := \begin{pmatrix} 1 & {\bf R} & {\bf R} \\ 0 & 1 & {\bf R} \\ 0 & 0 & 1 \end{pmatrix}

with the lower central series filtration {G_0=G_1=G}, {G_2= [G,G] = \begin{pmatrix} 1 &0 & {\bf R} \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}}, and {G_i = \{ \mathrm{id} \}} for {i > 2}, {\Gamma} to be the discrete compact subgroup

\displaystyle \Gamma := \begin{pmatrix} 1 & {\bf Z} & {\bf Z} \\ 0 & 1 & {\bf Z} \\ 0 & 0 & 1 \end{pmatrix},

{g: {\bf Z} \rightarrow G} to be the polynomial sequence

\displaystyle g(n) := \begin{pmatrix} 1 & \beta n & \alpha \beta n^2 \\ 0 & 1 & \alpha n \\ 0 & 0 & 1 \end{pmatrix}

and {\tilde F: G \rightarrow {\bf C}^m} to be the {\Gamma}-automorphic function

\displaystyle \tilde F( \begin{pmatrix} 1 & x & y \\ 0 & 1 & z \\ 0 & 0 & 1 \end{pmatrix} ) = F( \{ z \}, y - \lfloor z \rfloor x );

one easily verifies that this function is indeed {\Gamma}-automorphic, and it is continuous thanks to the vanishing properties of {F}. Also we have {a(n) = \tilde F(g(n))}, so {a} is a basic nilsequence of degree at most {2}. One can concoct similar examples with {\{ \alpha n \} \beta n} replaced by other “bracket polynomials” of {n}; for instance

\displaystyle a(n) = F( \alpha n, \{ \alpha n - \frac{1}{2} \} \beta n )

will be a basic nilsequence if {F} now vanishes in a neighbourhood of {(\frac{1}{2}+{\bf Z}) \times {\bf R}} rather than {{\bf Z} \times {\bf R}}. See this paper of Bergelson and Leibman for more discussion of bracket polynomials (also known as generalised polynomials) and their relationship to nilsequences.

A nilsequence of degree at most {d} is defined to be a sequence that is the uniform limit of basic nilsequences of degree at most {d}. Thus for instance a sequence is a nilsequence of degree at most {1} if and only if it is almost periodic, while a sequence is a nilsequence of degree at most {0} if and only if it is constant. Such objects arise in higher order recurrence: for instance, if {h_0,\dots,h_d} are integers, {(X,\mu,T)} is a measure-preserving system, and {f_0,\dots,f_d \in L^\infty(X)}, then it was shown by Leibman that the sequence

\displaystyle n \mapsto \int_X f_0(T^{h_0 n} x) \dots f_d(T^{h_d n} x)\ d\mu(x)

is equal to a nilsequence of degree at most {d}, plus a null sequence. (The special case when the measure-preserving system was ergodic and {h_i = i} for {i=0,\dots,d} was previously established by Bergelson, Host, and Kra.) Nilsequences also arise in the inverse theory of the Gowers uniformity norms, as discussed for instance in this previous post.

It is easy to see that a sequence {a: {\bf Z} \rightarrow {\bf C}^m} is a basic nilsequence of degree at most {d} if and only if each of its {m} components are. The scalar basic nilsequences {a: {\bf Z} \rightarrow {\bf C}} of degree {d} are easily seen to form a {*}-algebra (that is to say, they are a complex vector space closed under pointwise multiplication and complex conjugation), which implies similarly that vector-valued basic nilsequences {a: {\bf Z} \rightarrow {\bf C}^m} of degree at most {d} form a complex vector space closed under complex conjugation for each {m}, and that the tensor product of any two basic nilsequences of degree at most {d} is another basic nilsequence of degree at most {d}. Similarly with “basic nilsequence” replaced by “nilsequence” throughout.

Now we turn to the notion of a nilcharacter, as defined in this paper of Ben Green, Tamar Ziegler, and myself:

Definition 3 (Nilcharacters) Let {d \geq 1}. A sub-nilcharacter of degree {d} is a basic nilsequence {\chi: n \mapsto F(g(n))} of degree at most {d}, such that {F} obeys the additional modulation property

\displaystyle F( g_d g ) = e( \xi \cdot g_d ) F(g) \ \ \ \ \ (1)

for all {g \in G} and {g_d \in G_d}, where {\xi: G_d \rightarrow {\bf R}} is a continuous homomorphism {g_d \mapsto \xi \cdot g_d}. (Note from (1) and {\Gamma}-automorphy that unless {F} vanishes identically, {\xi} must map {\Gamma_d} to {{\bf Z}}, thus without loss of generality one can view {\xi} as an element of the Pontryagial dual of the torus {G_d / \Gamma_d}.) If in addition one has {\|F(g)\|=1} for all {g \in G}, we call {\chi} a nilcharacter of degree {d \geq 1}.

In the degree one case {d=1}, the only sub-nilcharacters are of the form {\chi(n) = e(\alpha n)} for some vector {c \in {\bf C}^m} and {\alpha \in {\bf R}}, and this is a nilcharacter if {c} is a unit vector. Similarly, in higher degree, any sequence of the form {\chi(n) = c e(P(n))}, where {c \in {\bf C}^m} is a vector and {P: {\bf Z} \rightarrow {\bf R}} is a polynomial of degree at most {d}, is a sub-nilcharacter of degree {d}, and a character if {c} is a unit vector. A nilsequence of degree at most {d-1} is automatically a sub-nilcharacter of degree {d}, and a nilcharacter if it is of magnitude {1}. A further example of a nilcharacter is provided by the two-dimensional sequence {\chi: {\bf Z} \rightarrow {\bf C}^2} defined by

\displaystyle \chi(n) := ( F_0( \alpha n ) e( \{ \alpha n \} \beta n ), F_{1/2}( \alpha n ) e( \{ \alpha n - \frac{1}{2} \} \beta n ) ) \ \ \ \ \ (2)

where {F_0, F_{1/2}: {\bf R} \rightarrow {\bf C}} are continuous, {{\bf Z}}-automorphic functions that vanish on a neighbourhood of {{\bf Z}} and {\frac{1}{2}+{\bf Z}} respectively, and which form a partition of unity in the sense that

\displaystyle |F_0(x)|^2 + |F_{1/2}(x)|^2 = 1

for all {x \in {\bf R}}. Note that one needs both {F_0} and {F_{1/2}} to be not identically zero in order for all these conditions to be satisfied; it turns out (for topological reasons) that there is no scalar nilcharacter that is “equivalent” to this nilcharacter in a sense to be defined shortly. In some literature, one works exclusively with sub-nilcharacters rather than nilcharacters, however the former space contains zero-divisors, which is a little annoying technically. Nevertheless, both nilcharacters and sub-nilcharacters generate the same set of “symbols” as we shall see later.

We claim that every degree {d} sub-nilcharacter {f: {\bf Z} \rightarrow {\bf C}^m} can be expressed in the form {f = c \chi}, where {\chi: {\bf Z} \rightarrow {\bf C}^{m'}} is a degree {d} nilcharacter, and {c: {\bf C}^{m'} \rightarrow {\bf C}^m} is a linear transformation. Indeed, by scaling we may assume {f(n) = F(g(n))} where {|F| < 1} uniformly. Using partitions of unity, one can find further functions {F_1,\dots,F_m} also obeying (1) for the same character {\xi} such that {|F_1|^2 + \dots + |F_m|^2} is non-zero; by dividing out the {F_1,\dots,F_m} by the square root of this quantity, and then multiplying by {\sqrt{1-|F|^2}}, we may assume that

\displaystyle |F|^2 + |F_1|^2 + \dots + |F_m|^2 = 1,

and then

\displaystyle \chi(n) := (F(g(n)), F_1(g(n)), \dots, F_m(g(n)))

becomes a degree {d} nilcharacter that contains {f(n)} amongst its components, giving the claim.

As we shall show below, nilsequences can be approximated uniformly by linear combinations of nilcharacters, in much the same way that quasiperiodic or almost periodic sequences can be approximated uniformly by linear combinations of linear phases. In particular, nilcharacters can be used as “obstructions to uniformity” in the sense of the inverse theory of the Gowers uniformity norms.

The space of degree {d} nilcharacters forms a semigroup under tensor product, with the constant sequence {1} as the identity. One can upgrade this semigroup to an abelian group by quotienting nilcharacters out by equivalence:

Definition 4 Let {d \geq 1}. We say that two degree {d} nilcharacters {\chi: {\bf Z} \rightarrow {\bf C}^m}, {\chi': {\bf Z} \rightarrow {\bf C}^{m'}} are equivalent if {\chi \otimes \overline{\chi'}: {\bf Z} \rightarrow {\bf C}^{mm'}} is equal (as a sequence) to a basic nilsequence of degree at most {d-1}. (We will later show that this is indeed an equivalence relation.) The equivalence class {[\chi]_{\mathrm{Symb}^d({\bf Z})}} of such a nilcharacter will be called the symbol of that nilcharacter (in analogy to the symbol of a differential or pseudodifferential operator), and the collection of such symbols will be denoted {\mathrm{Symb}^d({\bf Z})}.

As we shall see below the fold, {\mathrm{Symb}^d({\bf Z})} has the structure of an abelian group, and enjoys some nice “symbol calculus” properties; also, one can view symbols as precisely describing the obstruction to equidistribution for nilsequences. For {d=1}, the group is isomorphic to the Ponytragin dual {\hat {\bf Z} = {\bf R}/{\bf Z}} of the integers, and {\mathrm{Symb}^d({\bf Z})} for {d > 1} should be viewed as higher order generalisations of this Pontryagin dual. In principle, this group can be explicitly described for all {d}, but the theory rapidly gets complicated as {d} increases (much as the classification of nilpotent Lie groups or Lie algebras of step {d} rapidly gets complicated even for medium-sized {d} such as {d=3} or {d=4}). We will give an explicit description of the {d=2} case here. There is however one nice (and non-trivial) feature of {\mathrm{Symb}^d({\bf Z})} for {d \geq 2} – it is not just an abelian group, but is in fact a vector space over the rationals {{\bf Q}}!

— 1. Properties of nilcharacters —

Much of the material here is an adaptation of material from Appendix E of my paper with Green and Ziegler. We focus primarily on nilcharacters, and discuss the theory of sub-nilcharacters in Remark 9.

We begin with the verification that equivalence of nilcharacters is indeed an equivalence relation. Symmetry is obvious (since the conjugate of a basic nilsequence is again a basic nilsequence). Now we turn to reflexivity. Let {\chi(n) = F( g(n) )} be a degree {d} nilcharacter. Then {\chi \otimes \overline{\chi}} can be written as

\displaystyle \chi \otimes \overline{\chi}(n) = F \otimes \overline{F}( g(n) ).

From (1) we see that the function {F \otimes \overline{F}: G \rightarrow {\bf C}^{m^2}} is {G_d}-invariant, and thus descends upon quotienting by the closed central subgroup {G_d} of {G} to a function on {G/G_d} (which is {\Gamma G_d/G_d} invariant). As {G/G_d} is a filtered group of degree at most {d-1}, and {G/\Gamma G_d} is similarly a filtered nilmanifold of degree at most {d-1}, this lets us describe {\chi \otimes \overline{\chi}} as a nilsequence of degree {\leq d-1}, giving the claim. (It is instructive to calculate {\chi \otimes \overline{\chi}} explicitly for the example (2) to see the quasiperiodic structure.) Finally, we show transitivity. If {\chi_1} is equivalent to {\chi_2}, and {\chi_2} is equivalent to {\chi_3}, then {\chi_1 \otimes \overline{\chi_2}} and {\chi_2 \otimes \overline{\chi_3}} are both nilsequences of degree {\leq d-1}, so their tensor product {\chi_1 \otimes \overline{\chi_2} \otimes \chi_2 \otimes \overline{\chi_3}} is also. On the other hand, {\chi_2} has magnitude {1}, so {\overline{\chi_2} \otimes \chi_2} has trace one. Contracting, we conclude that {\chi_1 \otimes \overline{\chi_3}} is a nilsequence of degree {\leq d-1}, establishing transitivity.

Remark 5 A variant of the above arguments shows that two degree {d} nilcharacters {\chi: {\bf Z} \rightarrow {\bf C}^m}, {\chi': {\bf Z} \rightarrow {\bf C}^{m'}} are equivalent if and only if {\chi} can be expressed as {c (\chi' \otimes \psi)} for some nilsequence {\psi: {\bf Z} \rightarrow {\bf C}^{m''}} of degree at most {d-1}, and some linear transformation {c: {\bf C}^{m'm''} \rightarrow {\bf C}^m}.

Now we impose a group structure on {\mathrm{Symb}^d({\bf Z})} by taking

\displaystyle 0 := [1]_{\mathrm{Symb}^d({\bf Z})}

to be the identity element,

\displaystyle - [\chi]_{\mathrm{Symb}^d({\bf Z})} := [\overline{\chi}]_{\mathrm{Symb}^d({\bf Z})}

to be the negation operation, and

\displaystyle [\chi]_{\mathrm{Symb}^d({\bf Z})} + [\chi']_{\mathrm{Symb}^d({\bf Z})} := [\chi \otimes \chi']_{\mathrm{Symb}^d({\bf Z})}

to be the group operation, for any degree {d} nilcharacters {\chi,\chi'}.

It is clear that negation is well-defined: if {\chi} is equivalent to {\chi'}, then {\overline{\chi}} is equivalent to {\overline{\chi'}}. Similarly, the group operation is well-defined; for instance, if {\chi'} is equivalent to {\chi''}, then {\chi \otimes \chi'} is equivalent to {\chi \otimes \chi''}, since {(\chi \otimes \chi') \otimes \overline{(\chi \otimes \chi'')}} is a rearrangement of the tensor product of {\chi \otimes \overline{\chi}} and {\chi' \otimes \overline{\chi''}}, both of which are known to be nilsequences of degree at most {d-1}. It is also routine to check that the group operation is commutative and associative, that {0} is indeed the group identity, and that the negation operation is the inverse of the addition operation.

Now we show that any nilsequence {f: {\bf Z} \rightarrow {\bf C}^m} of degree at most {d} may be approximated uniformly by finite linear combinations

\displaystyle \sum_{j=1}^k c_j \chi_j

of degree {d} nilsequences {\chi_j: {\bf Z} \rightarrow {\bf C}^{m_j}}, where {c_j: {\bf C}^{m_j} \rightarrow {\bf C}^m} are linear transformations. Clearly it suffices to show this for a basic nilsequence {f(n) = F(g(n))}. By the Stone-Weierstrass theorem, the continuous function {F} can be approximated uniformly by smooth functions, so we may assume without loss of generality that {F} is smooth. Quotienting out by the discrete group {\Gamma_d}, {F} now descends to a function on {G/\Gamma_d}, which has a central free action of the torus {G_d/\Gamma_d}. Performing a (uniformly convergent) Fourier decomposition with respect to this group action (or, if one prefers, decomposing into isotypic components of this action), we may assume that {F} obeys the equivariance property (1) for some character {\xi} from {G_d/\Gamma_d} to {{\bf R}/{\bf Z}}, which makes it a sub-nilcharacter, and the claim then follows.

In the case {d=1}, we have already noted that every nilcharacter takes the form {\chi(n) = c e(\alpha n)} for some real number {\alpha} and unit vector {c}, and it is easy to see that two such nilcharacters {c e(\alpha n)}, {c' e(\alpha' n)} are equivalent if and only if {\alpha} and {\alpha'} differ by an integer. As such, one readily verifies that {\mathrm{Symb}^1({\bf Z})} is isomorphic to {{\bf R}/{\bf Z}}, with the symbol of {c e(\alpha n)} being identified with {\alpha \hbox{ mod } 1}. Similarly, if {d > 1} and one restricts attention to nilcharacters of the form {\chi(n) = c e(\alpha_d n^d + \dots + \alpha_0)}, then the nilcharacters are equivalent if and only if the {\alpha_d} differ by a rational, and so {\mathrm{Symb}^d({\bf Z})} contains a copy of {{\bf R}/{\bf Q}}, with the symbol of {c e(\alpha_d n^d + \dots + \alpha_0)} being identified with {\alpha_d \hbox{ mod } {\bf Q}}. (Among other things, this suggests that there is no particularly good (e.g. Hausdorff) topology to place on {\mathrm{Symb}^d({\bf Z})} for {d>1}, in contrast to the Pontryagin dual which has a natural topology (the compact-open topology).

Nilcharacters behave well with translation and dilation:

Lemma 6 (Affine symmetry) Let {d \geq 1}, and let {\chi: {\bf Z} \rightarrow {\bf C}^m} be a nilcharacter.

  • For any integer {h}, the shift {n \mapsto \chi(n+h)} is equivalent to {\chi}: {[\chi(\cdot+h)]_{\mathrm{Symb}^d({\bf Z})} = [\chi]_{\mathrm{Symb}^d({\bf Z})}}.
  • For any integer {q}, the dilation {n \mapsto \chi(qn)} is equivalent to {\chi^{\otimes q^d}} (the tensor product of {q^d} copies of {\chi}: {[\chi(q\cdot)]_{\mathrm{Symb}^d({\bf Z})} = q^d[\chi]_{\mathrm{Symb}^d({\bf Z})}}.

The reader may check that this lemma is consistent with the identification of the symbol of {c e(\alpha_d n^d + \dots + \alpha_0)} with {\alpha_d \hbox{ mod } {\bf Q}}.

Proof: For the first claim, it suffices to represent {\chi(\cdot+h) \otimes \overline{\chi}(\cdot)} as a nilsequence of degree at most {d-1}. If {\chi(n) = F(g(n))}, then this sequence can be expanded as

\displaystyle n \mapsto F(g(n+h)) \otimes \overline{F}(g(n)).

The sequence {n \mapsto (g(n+h), g(n))} can be verified to be a polynomial sequence in {G^\Box}, which is defined to be the filtered group {G^\Box = G_0 \times_{G_1} G_0} with subgroups {G^\Box_i := G_i \times_{G_{i+1}} G_i}, where {G_i \times_{G_{i+1}} G_i} is the collection of pairs {(g_i, g'_i) \in G_i \times G_i} such that {g_i \in g'_i G_{i+1}}. From (1) we see that the function {(g,g') \mapsto F(g) \otimes \overline{F}(g')} is invariant with respect to the central action of {G^\Box_d} (which is the diagonally embedded copy of {G_d} in {G_d \times G_d}). If we then define {\overline{G^\Box}} to be the quotient of {G^\Box} by {G^\Box_d}, this is now a filtered group of degree at most {d-1}, and the polynomial sequence {n \mapsto (g(n+h),g(n))} projects to a polynomial sequence {\overline{g^\Box}} in {\overline{G^\Box}}, with a representation of the form

\displaystyle F(g(n+h)) \otimes \overline{F}(g(n)) = \overline{F^\Box}( \overline{g^\Box}(n))

for some continuous function {\overline{F^\Box}: \overline{G^\Box} \rightarrow {\bf C}^{m^2}} which is automorphic with respect to the projection of {G^\Box} to {\overline{G^\Box}}. This gives the first claim.

The second claim is proven similarly, with the role of {G_i \times_{G_{i+1}} G_i} being played by the collection of pairs {(g_i, g'_i) \in G_i \times G_i} such that {g_i \in (g'_i)^{q^i} G_{i+1}}; we leave the details to the interested reader (or see Lemma E.8(v) of my paper with Green and Ziegler). \Box

Now we connect the symbol with equidistribution.

Proposition 7 Let {d \geq 1}, and let {\chi: {\bf R} \rightarrow {\bf C}^m} be a nilcharacter. Then {[\chi]_{\mathrm{Symb}^d({\bf Z})}} vanishes if and only if there exists a nilsequence {f} of degree at most {d-1} such that

\displaystyle \limsup_{N \rightarrow \infty} |\sum_{n=1}^N \chi(n) \otimes f(n)| > 0.

Actually, the limits of sequences such as {\lim_{N \rightarrow \infty} \sum_{n=1}^N \chi(n) \otimes f(n)} always exist, and are in fact equal to {\lim_{N \rightarrow \infty} \sum_{n=M_N+1}^{M_N+N} \chi(n) \otimes f(n)} for an arbitrary sequence {M_N}; this follows from the equidistribution theorem of Leibman.

Proof: If {[\chi]_{\mathrm{Symb}^d({\bf Z})}} vanishes, then {\chi} is a nilsequence of degree {d-1}, and {\chi \otimes \overline{\chi}} has trace {!1}. The forward implication then follows by taking {f = \overline{\chi}}.

Now we establish the converse implication. As {f} is a sub-nilcharacter of vanishing symbol, we may write it as a linear transform of a nilcharacter {\chi'} of vanishing symbol. By replacing {\chi} with {\chi \otimes \chi'} (which does not affect the symbol), we may thus assume without loss of generality that {f} is the identity, that is to say we may assume without loss of generality that

\displaystyle \limsup_{N \rightarrow \infty} |\sum_{n=1}^N \chi(n)| > 0.

When {d=1} this immediately shows that {\chi} is constant from Fourier analysis, so assume {d>1}. Write {\chi(n) = F(g(n))}. Now we appeal to the factorisation theorem for nilsequences (see Corollary 1.12 of this paper of myself and Ben Green to factorise

\displaystyle g(n) = \varepsilon g'(n) \gamma(n)

where {\varepsilon \in G} is a constant, {\gamma} is a polynomial sequence which is rational in the sense that there is a natural number {m} such that {\gamma(n)^m \in\Gamma} for all {n \in {\bf Z}} (which, among other things, implies that {\gamma(n) \Gamma'} is periodic in {n} for any subgroup {\Gamma'} commensurate to {\Gamma}), and {g'} is another periodic sequence which takes values in a filtered subgroup {G'} of {G} such that {G' = G'_0} and {G'_i} is a connected, simply connected Lie group containing {\Gamma \cap G'_i} as a cocompact subgroup for each {i}, such that {g'(n) \Gamma'} is totally equidistributed (that is, equidistributed along any arithmetic progression) in {G'/\Gamma'} for any {\Gamma'} commensurate to {\Gamma \cap G'}.

By the triangle inequality, we can find an arithmetic progression in which {\gamma(n) = \gamma} is constant and {\chi} has non-zero mean. On this progression we have {\chi(n) = \tilde F(g'(n))}, where {\tilde F: G' \rightarrow {\bf C}^m} is the function {\tilde F(g) := F(\varepsilon g \gamma)}. This function is automorphic with respect to the group {\gamma^{-1} \Gamma \gamma \cap G'}, which is commensurate with {\Gamma \cap G'} as {\gamma} is rational; by equidistribution, we conclude that {\tilde F} has non-zero mean on {G'}. But {\tilde F} obeys the equivariance property (1) for {g_d} in {G'_d}, and hence {\xi} must annihilate all of {G'_d}. Quotienting out by {G'_d}, we can thus represent {\chi} on this progression by a nilsequence of degree at most {d-1}. On other progressions of the same spacing, the constant {\gamma} changes, but {\xi} still annihilates {G'_d}, so {\chi} is also a nilsequence of degree at most {d-1} on all other progressions. Since the indicator function of an arithmetic progression is also a nilsequence of degree at most {1}, and hence at most {d-1}, {\chi} itself is a nilsequence of degree at most {d-1}, as required. \Box

Remark 8 One can almost obtain this result by using the arguments of Bogdanov and Viola, but the assertion that {[\chi]_{\mathrm{Symb}^d({\bf Z})}} vanishes becomes replaced with the weaker assertion that {[\chi]_{\mathrm{Symb}^d({\bf Z})}} can be approximated to arbitrary accuracy (in some normalised {\ell^1} sense) by nilsequences of degree at most {d-1}. In the finite field setting one can use an “error correction” argument to recover an analogue of the above proposition by the Bogdanov-Viola argument; see this paper of Ben Green and myself. But we do not know how to do this in the nilsequence setting without using the same sort of machinery used in the above proof of the proposition.

Remark 9 Now we can discuss the relationship of sub-nilcharacters with nilcharacters. We have already observed that any degree {d} sub-nilcharacter {f} is of the form {c \psi}, where {\psi} is a degree {d} nilcharacter and {c} is a linear transformation; conversely, any sequence of the form {c \psi} will be a degree {d} sub-nilcharacter. If a sub-nilcharacter {f} that is not identically zero has two representations {f = c \psi = c' \psi'}, then {|f|^2 = \tilde c( \psi \otimes \overline{\psi'})} for some linear functional {\tilde c}. The mean of {|f|^2} is then non-zero by the equidistribution theory of nilmanifolds, hence by the above proposition, {[\psi \otimes \overline{\psi'}]_{\mathrm{Symb}^d({\bf Z})}} vanishes, thus {\psi} and {\psi'} have the same symbol. Because of this, it becomes meaningful to talk about the symbol {[f]_{\mathrm{Symb}^d({\bf Z})}} of a sub-nilcharacter as long as it is not identically zero. This symbol can then easily be verified to be a homomorphism in the sense that {[f \otimes f']_{\mathrm{Symb}^d({\bf Z})} = [f]_{\mathrm{Symb}^d({\bf Z})} + [f']_{\mathrm{Symb}^d({\bf Z})}} whenever {f,f'} are degree {d} sub-nilcharacters with {f \otimes f'} not identically zero. Thus one has a reasonably satisfactory theory of symbols for sub-nilcharacters as well as nilcharacters, although the presence of zero divisors does create some additional annoying technical issues, which is the main reason we focus instead on nilcharacters in this post.

For similar reasons, we can also assign a symbol in {\mathrm{Symb}^d({\bf Z})} to any bracket polynomial of degree at most {d}; we omit the details.

Now we discuss the structure of {\mathrm{Symb}^2({\bf Z})}. Amongst the elements of this group, we have the symbols {[\alpha]} for {\alpha \in {\bf R}}, defined as the symbol of the nilcharacter {n \mapsto e(\alpha n^2)}. These symbols obey the laws

\displaystyle [\alpha] + [\beta] = [\alpha+\beta]


\displaystyle [a/q] = 0

for any rational {a/q} (since {n \mapsto e(an^2/q)} is periodic and thus a nilsequence of degree at most {1}). In addition, we have the symbols {[\alpha \ast \beta]}, defined for {\alpha,\beta \in {\bf R}} to be the symbol of the nilcharacter (2). It is clear from definition that these symbols are linear in {\beta}:

\displaystyle [\alpha \ast \beta] + [\alpha \ast \beta'] = [\alpha \ast (\beta+\beta')]. \ \ \ \ \ (3)

Also, if one of {\alpha} or {\beta} is rational, then the nilcharacter collapses to a nilsequence of degree at most one, and thus

\displaystyle [a/q * \alpha] = [\alpha * a/q] = 0

for any rational {a/q} and real {\alpha}. Furthermore, we also have the identity

\displaystyle [\alpha * \beta] + [\beta * \alpha] = [\alpha \beta].

To see this, observe that {e( (\alpha n - \{ \alpha n \})(\beta n - \{\beta n \})) = 1} since the argument of {e()} is the product of two integers. We can rewrite this as

\displaystyle e( \{ \alpha n \} \beta n ) e( \{ \beta n \} \alpha n ) = e( \alpha \beta n ).

The sub-nilcharacter {F( \alpha n ) e( \{ \alpha n \} \beta n )}, where {F: {\bf R} \rightarrow {\bf C}} is {{\bf Z}}-automorphic and vanishing on a neighbourhood of {{\bf Z}}, has symbol {[\alpha*\beta]}, and similarly {F( \beta n ) e( \{ \beta n \} \alpha n )} has symbol {[\beta*\alpha]}. One can arrange matters so that the product is not identically zero, and by the above identity the symbol will be {[\alpha \beta]}, giving the claim. From this and (3) we see that {\alpha \ast \beta} is also linear in {\alpha}:

\displaystyle [\alpha \ast \beta] + [\alpha' \ast \beta] = [(\alpha + \alpha') \ast \beta].

Finally, noting that {F(\alpha n / 2) e( (\alpha n - \{ \alpha n \})^2 / 2)} is a nilsequence of degree at most {1} if {F} vanishes on a neighbourhood of {{\bf Z}/2}, one can show that

\displaystyle [\alpha * \alpha] = [\alpha^2 / 2].

These turn out to be a generating set of identities for {\mathrm{Symb}^2({\bf Z})}:

Proposition 10 {\mathrm{Symb}^2({\bf Z})} is generated by the symbols {[\alpha]} and {[\alpha * \beta]} for {\alpha,\beta \in {\bf R}}, subject to the identities listed above.

Proof: The fact that any degree {2} nilcharacter can be expressed as a finite integer linear combination of the symbols {[\alpha]} and {[\alpha * \beta]} follows from the calculations in Section 12 of this paper of Green and myself. Now we need to show that a finite integer linear combination of these symbols only vanishes when such vanishing can be deduced from the above relations. Given such a combination, we can use express all real numbers {\alpha,\beta} involved as integer combinations of a linearly independent basis {1, \xi_1,\dots,\xi_m} over the rationals, and after many applications of the above identities we may then place the linear combination in the normal form

\displaystyle \sum_{1 \leq i < j \leq m} a_{ij} [\xi_i * \xi_j] + [\eta] \ \ \ \ \ (4)

for some integers {a_{ij}} and real {\eta}. It will then suffice to show that such linear combinations can only vanish when all the {a_{ij}} are zero and {\eta} is a rational.

The claim is clear when the {a_{ij}} all vanish, so suppose for instance that {a_{12}} does not vanish. If one takes {G} to be the free nilpotent Lie group of degree {2} generated by {m} generators {e_1,\dots,e_m}, and {\Gamma} to be the discrete subgroup generated by the same generators, then one can construct a sub-nilsequence {f} with the symbol (4) by the formula

\displaystyle f(n) := e(\eta n^2) F( e_m^{\xi_m n} \dots e_1^{\xi_1 n} )

where {F:G \rightarrow {\bf C}} is a continuous {\Gamma}-automorphic function such that

\displaystyle F( [e_j, e_i]^t g ) = e(a_{ij} t) F(g) \ \ \ \ \ (5)

for all {1 \leq i < j \leq n}, {g \in G}, and {t \in {\bf R}}.

If the symbol (4) vanishes, then this sub-nilsequence must correlate with a linear phase, thus

\displaystyle n \mapsto e(\eta n^2 + \gamma n) F( e_m^{\xi_m n} \dots e_1^{\xi_1 n} )

must have non-zero mean. Suppose first that {\eta} is irrational. The sequence {n \mapsto (\eta n^2 + \gamma_n, e_m^{\xi_m n} \dots e_1^{\xi_1 n} )} is a polynomial sequence into {{\bf R} \times G}, whose quotient onto the abelianisation {{\bf R} \times {\bf R}^m} is easily seen to be equidistributed modulo {{\bf Z} \times {\bf Z}^m}. Applying the equidistribution results , this implies that the polynomial sequence into {{\bf R} \times G} is equidistributed modulo {{\bf Z} \times \Gamma}. This implies that the {{\bf Z} \times \Gamma}-automorphic function

\displaystyle (x, g) \mapsto e(x) F(g)

on {{\bf R} \times G} has mean zero, which is absurd thanks to the {e(x)} component. Similarly if {\gamma} is irrational and independent (over {{\bf Q}}) of the {1,\xi_1,\dots,\xi_m}. The only remaining case is when {\xi} is rational and {\gamma} is a linear combination over {{\bf Q}} of the {1,\xi_1,\dots,\xi_m}. By dividing the {\xi_i} by a suitable natural number (and multiplying the {a_{i,j}} appropriately), we may assume that {\gamma} is the sum of a rational and an integer combination of the {\xi_1,\dots,\xi_m}, and then by passing to a suitable arithmetic progression we can absorb the {e(\eta n^2 + \gamma n)} term into the {F} term. The above equidistribution analysis now implies that {F} has non-zero mean on {G}, hich is absurd thanks to (5) applied to the nonzero coefficient {a_{12}}. \Box

Remark 11 We can now define an anti-symmetric form {\wedge} from {{\bf R}/{\bf Q} \rightarrow {\bf R}/{\bf Q}} to {\mathrm{Symb}^2({\bf Z})} by defining

\displaystyle (\alpha+{\bf Q}) \wedge (\beta+{\bf Q}) := [\alpha * \beta] - [\alpha \beta / 2]

for any real {\alpha,\beta}; one can check using the above identities that this is indeed a well-defined anti-symmetric form. The above proposition then gives an isomorphism

\displaystyle \mathrm{Symb}^2({\bf Z}) \equiv \bigwedge^2 {\bf R}/{\bf Q} \oplus {\bf R}/{\bf Q}

where we view {{\bf R}/{\bf Q}} as a vector space over {{\bf Q}}. In particular {\mathrm{Symb}^2({\bf Z})} is a vector space over {{\bf Q}}.

In principle, one could extend the above calculations to higher degrees, and give similarly explicit descriptions of {\mathrm{Symb}^d({\bf Z})} for higher {d}. The claculations appear to become rather complicated, however. Nevertheless, we can at least establish the following:

Theorem 12 If {d \geq 2}, then the abelian group {\mathrm{Symb}^d({\bf Z})} can be given the structure of a vector space over {{\bf Q}}.

Proof: In order to give a suitable action of {{\bf Q}} on {\mathrm{Symb}^d({\bf Z})}, one needs to show two things:

  • (divisible group) For every degree {d} nilcharacter {\chi} and natural number {q}, there exists a degree {d} nilcharacter {\chi'} such that {(\chi')^{\otimes q}} is equivalent to {\chi}.
  • (torsion-free) If {\chi} is a degree {d} nilcharacter and {q} is a natural number such that {\chi^{\otimes q}} has vanishing symbol, then {\chi} also has vanishing symbol.

We begin with divisibility. It suffices by Lemma 6 to write {\chi(n)} as {\chi''(q n)} for some degree {d} nilcharacter {\chi''}, since one can then take {\chi'} equal to {(\chi'')^{\otimes q^{d-1}}}. This in turn is achievable if we can extend the polynomial sequence {n \mapsto g(n)} from {{\bf Z}} to {G} associated to {\chi} to a continuous polynomial sequence {t \mapsto g(t)} from {{\bf R}} to {G}. But any polynomial sequence can be written in the form {g(n) = g_0 g_1^n \dots g_d^{n^d}} for some {g_i \in G_i}, and one simply takes {g(t) = g_0 g_1^t \dots g_d^{t^d}} for {t \in {\bf R}} (using the logarithm and exponential maps on the simply connected nilpotent groups {G_i} to define the operation of raising to real powers). This establishes divisibility; we note that this also holds when {d=1}.

Now we establish that {\mathrm{Symb}^d({\bf Z})} is torsion-free. If {\chi^{\otimes q}} has vanishing symbol, then by Lemma 6, {n \mapsto \chi(qn)} is a nilsequence of degree at most {d-1}, and hence correlates with a nilcharacter {\chi'} of degree {d-1}. By the divisibility argument, we can write {\chi'(n) = \chi''(qn)} for another nilcharacter {\chi''} of degree {d-1}. Thus {\chi} correlates with {\chi''} on an arithmetic progression of spacing {q}, and hence by Fourier expansion (and the hypothesis {d \geq 1}) we see that {\chi} correlates with a nilcharacter of degree at most {d-1} on the entire integers, giving the claim. \Box

Filed under: expository, math.CA, math.DS Tagged: additive combinatorics, equidistribution, nilcharacter, nilmanifolds, nilsequences

n-Category Café A Discussion on Notions of Lawvere Theories

Guest post by Daniel Cicala

The Kan Extension Seminar II continues with a discussion of the paper Notions of Lawvere Theory by Stephen Lack and Jirí Rosický.

In his landmark thesis, William Lawvere introduced a method to the study of universal algebra that was vastly more abstract than those previously used. This method actually turns certain mathematical stuff, structure, and properties into a mathematical object! This is achieved with a Lawvere theory: a bijective-on-objects product preserving functor T: 0 opLT \colon \aleph^{\text{op}}_0 \to \mathbf{L} where 0\aleph_0 is a skeleton of the category FinSet\mathbf{FinSet} and L\mathbf{L} is a category with finite products. The analogy between algebraic gadgets and Lawvere theories reads as: stuff, structure, and properties correspond respectively to 1, morphisms, and commuting diagrams.

To get an actual instance, or a model, of an algebraic gadget from a Lawvere theory, we take a product preserving functor m:TSetm \colon \mathbf{T} \to \mathbf{Set}. A model picks out a set m(1)m(1) and nn-ary operations m(f):m(1) nm(1)m(f) \colon m(1)^n \to m(1) for every TT-morphism f:n1f \colon n \to 1.

To read more about classical Lawvere theories, you can read Evangelia Aleiferi’s discussion of Hyland and Power’s paper on the topic.

With this elegant perspective on universal algebra, we do what mathematicians are wont to do: generalize it. However, there is much to consider undertaking such a project. Firstly, what elements of the theory ought to be generalized? Lack and Rosický provide a clear answer to this question. They generalize along the following three tracks:

  • consider a class of limits besides finite products,

  • replace the base category Set\mathbf{Set} with some other suitable category, and

  • enrich everything.

Another important consideration is to determine exactly how far to generalize. Why not just go as far as possible? Here are two reasons. First, there are a number of results in this paper that stand up to further generalization if one doesn’t care about constructibility. A second limiting factor of generalization is that one should ensure that central properties still hold. In Notions of Lawvere Theory, the properties lifted from classical Lawvere theories are

  • the correspondence between Lawvere theories and monads,

  • that algebraic functors have left adjoints, and

  • models form reflective subcategories of certain functor categories.

Before starting the discussion of the paper, I would like to take a moment to thank Alexander, Brendan and Emily for running this seminar. I have truly learned a lot and have enjoyed wonderful conversations with everyone involved.

Replacing finite limits

To find a suitable class of limits to replace finite products, we require the concept of presentability. The best entry point is to learn about local finite presentability, which David Myers has discussed here. With a little modification to the ideas there, we define a notion of local strong finite presentability and local Φ\Phi-presentability for a class of limits Φ\Phi.

We begin with sifted colimits, which are those Set\mathbf{Set}-valued colimits that commute with finite products. Note the familiarity of this definition with the commutativity property of filtered colimits. Of course, filtered colimits are also sifted. Another example is a reflexive pair, that is a category with shape


Anyway, we now look at the strongly finitely presentable objects in a category C\mathbf{C}. These are those objects xx whose representable C(x,):CSet\mathbf{C}(x,-) \colon \mathbf{C} \to \mathbf{Set} preserves sifted colimits. Denote the full subcategory of these by C sfp\mathbf{C}_{\text{sfp}}. Some simple examples include Set sfp\mathbf{Set}_{\text{sfp}}, which consists of the finite sets, and Ab sfp\mathbf{Ab}_{\text{sfp}}, which has the free and finitely generated Abelian groups. Also, given a category C\mathbf{C} of models for a Lawvere theory, C sfp\mathbf{C}_{\text{sfp}} is exactly those finitely presentable objects that are regular projective. A category C\mathbf{C} is locally strongly finitely presentable if it is cocomplete, C sfp\mathbf{C}_{\text{sfp}} is small, and any C\mathbf{C}-object is a sifted colimit of a diagram in C sfp\mathbf{C}_{\text{sfp}}. There is also a nice characterization (Theorem 3.1 in the paper) that states C\mathbf{C} is locally strongly finitely presentable if and only if C sfp\mathbf{C}_{\text{sfp}} has finite coproducts and we can identity C\mathbf{C} with the category of finite product-preserving functors C sfp opSet\mathbf{C}^{\text{op}}_{\text{sfp}} \to \mathbf{Set}. One of the most important results of Notions of Lawvere Theory, was in expanding the theory to encompass sifted (weighted) colimits. More on this later.

We can play this game with any class of limits Φ\Phi. Before defining PhiPhi-presentability, here is a bit of jargon.

Definition. A functor is Φ\Phi-flat if its colimit commutes with Φ\Phi-limits.

We call an object xx of a category C\mathbf{C} Φ\Phi-presentable if C(x,):CSet\mathbf{C}(x,-) \colon \mathbf{C} \to \mathbf{Set} preserves Φ\Phi-flat colimits. Given the full subcategory C Φ\mathbf{C}_\Phi of Φ\Phi-presentable objects, we call C\mathbf{C} locally Φ\Phi-presentable if it is cocomplete, C Φ\mathbf{C}_{\Phi} is small, and any C\mathbf{C}-object is a Φ\Phi-flat colimit of a diagram in C Φ\mathbf{C}_{\Phi}. Fortunately, we retain the characterization of C\mathbf{C} being locally Φ\Phi-presentable if and only if C Φ\mathbf{C}_{\Phi} has Φ\Phi-colimits and C\mathbf{C} is equivalent to the category Φ\Phi-Cts(C Φ op,Set)\mathbf{Cts}(\mathbf{C}^{\text{op}}_\Phi, \mathbf{Set}) of Φ\Phi-continuous functors C Φ opSet\mathbf{C}^{\text{op}}_\Phi \to \mathbf{Set}. Important results in Notions of Lawvere Theory use the assumption of Φ\Phi-presentability.

Let’s come back to Lawvere theories. From this point on, we fix a symmetric monoidal closed category 𝒱\mathcal{V} that is both complete and cocomplete. Also, Φ\Phi will refer to a class of weights over 𝒱\mathcal{V}. Our first task will be to determine what class of limits can replace finite products in the classical case. To this end, we take the following assumption.

Axiom A. Φ\Phi-continuous weights are Φ\Phi-flat.

This axiom is an analogy with how filtered colimits commute with finite limits in Set\mathbf{Set}. But for what classes of limits Φ\Phi does this hold?

To answer this question, we fix a sound doctrine 𝔻\mathbb{D}. Very roughly, a sound doctrine is a collection of small categories whose limits behave nicely with respect to certain colimits. After putting some small assumptions on the underlying category 𝒱 0\mathcal{V}_0 which we’ll sweep under the rug, define 𝒱 𝔻\mathcal{V}_\mathbb{D} to be the full sub 𝒱\mathcal{V}-category consisting of those objects xx such that [x,]:𝒱𝒱[x,-] \colon \mathcal{V} \to \mathcal{V} preserves 𝔻\mathbb{D}-flat colimits. Let Φ\Phi be the class of limits ‘built from’ conical 𝔻\mathbb{D}-limits and 𝒱 𝔻\mathcal{V}_\mathbb{D}-powers in the sense that we take ϕΦ\phi \in \Phi if

  • any 𝒱\mathcal{V}-category with conical 𝔻\mathbb{D} limits and 𝒱 𝔻\mathcal{V}_\mathbb{D}-powers also admits ϕ\phi-weighted limits, and

  • any 𝒱\mathcal{V}-functor conical 𝔻\mathbb{D} limits and 𝒱 𝔻\mathcal{V}_\mathbb{D}-powers also preserves ϕ\phi-weighted limits.

The fancy way of saying this is that Φ\Phi is the saturation class of conical 𝔻\mathbb{D}-limits and 𝒱 𝔻\mathcal{V}_\mathbb{D}-powers. It’s easy enough to see that Φ\Phi contains the conical 𝔻\mathbb{D}-limits and 𝒱\mathcal{V}-powers.

Having constructed a class of limits Φ\Phi from a sound doctrine 𝔻\mathbb{D}, we use the following theorem to imply that Φ\Phi satisfies the axiom above.

Theorem. Let 𝒦\mathcal{K} be a small V\mathbf{V}-category with Φ\Phi-weighted limits and F:𝒦𝒱F \colon \mathcal{K} \to \mathcal{V} be a V\mathbf{V}-functor. The following are equivalent:

  • FF is 𝔻\mathbb{D}-continuous;

  • FF is Φ\Phi-flat

  • FF is Φ\Phi-continuous.

In particular, the first item allows us to construct Φ\Phi using sound limits and the equivalence between the second and third item is precisely the axiom of interest. Here are some examples.

Example. Let 𝔻\mathbb{D} be the collection of all finite categories. We will also take 𝒱 0\mathcal{V}_0 be locally finitely presentable with the additional requirement that the monoidal unit II is finitely presentable as is the tensor product of two finitely presentable objects. Examples of such a 𝒱\mathcal{V} are categories of sets, abelian groups, modules over a commutative ring, chain complexes, categories, groupoids, and simplicial sets. Then Φ\Phi, as constructed from 𝔻\mathbb{D} above gives a good notion of 𝒱\mathcal{V}-enriched finite limits.

Example. A second example, and one of the main contributions of Notions of Lawvere Theory is when 𝔻\mathbb{D} is the class of all finite, discrete categories. Here, we take our 𝒱\mathcal{V} as in the first example, though we do not require the monoidal unit to be strongly finitely presentable. We do this because, by requiring the monoidal unit to be strongly finitely presentable, we lose the example where 𝒱\mathcal{V} is the category of directed graphs, which happens to be a key example, particularly to realizing categories as an model of a Lawvere theory. In this case, the induced class Φ\Phi gives an enriched version of strongly finite limits as discussed above. This Φ\Phi generalizes finite products in the sense that they coincide when 𝒱\mathcal{V} is Set\mathbf{Set}.

Correspondence between Lawvere theories and monads

Now that we’ve gotten our hands on some suitable limits, let’s see how we can obtain the classical correspondence between Lawvere theories and monads. Naturally, we’ll be assuming axiom A. In addition, we fix a 𝒱\mathcal{V}-category 𝒦\mathcal{K} that satisfies the following.

Axiom B1. 𝒦\mathcal{K} is locally Φ\Phi-presentable.

This axiom implies, as in our discussions above, that 𝒦Φ\mathcal{K} \cong \Phi-Cts(𝒦 Φ op,𝒱)\mathbf{Cts}(\mathcal{K}^{\text{op}}_\Phi,\mathcal{V}). This is not particularly restrictive, as presheaf 𝒱\mathcal{V}-categories are locally Φ\Phi-presentable.

Now, define a Lawvere Φ\Phi-theory on 𝒦\mathcal{K} to be a bijective-on-objects V\mathbf{V}-functor g:𝒦 Φ opg\colon \mathcal{K}^{\text{op}}_{\Phi} \to \mathcal{L} that preserves Φ\Phi-limits. A striking difference between a Lawvere Φ\Phi-theory and the classical notion is that the former does not require \mathcal{L} to have the limits under consideration. This makes defining the models of a Lawvere Φ\Phi-theory a subtler issue than in the classical case. Instead of defining a model to be a Φ\Phi-continuous functor as one might expect, we instead use the pullback square


To understand what a model looks like, use the intuition for a pullback in the category Set\mathbf{Set} and the fact that 𝒦\mathcal{K} is equivalent to Φ\Phi-Cts(𝒦 Φ op,𝒱)\mathbf{Cts}(\mathcal{K}^{\text{op}}_\Phi,\mathcal{V}). So a model will be a 𝒱\mathcal{V}-functor 𝒱\mathcal{L} \to \mathcal{V} whose restriction along gg is Φ\Phi-continuous.

The other major player in this section is the category of Φ\Phi-flat monads Mnd Φ(𝒦)\mathbf{Mnd}_{\Phi}(\mathcal{K}). We claim that there is an equivalence between Law Φ(𝒦)\mathbf{Law}_{\Phi}(\mathcal{K}) and Mnd Φ(𝒦)\mathbf{Mnd}_{\Phi}(\mathcal{K}). To verify this, we construct a pair of functors between Law Φ(𝒦)\mathbf{Law}_{\Phi}(\mathcal{K}) and Mnd Φ(𝒦)\mathbf{Mnd}_{\Phi}(\mathcal{K}). The first under consideration: mnd:Law Φ(𝒦)Mnd Φ(𝒦){mnd} \colon \mathbf{Law}_{\Phi}(\mathcal{K}) \to \mathbf{Mnd}_{\Phi}(\mathcal{K}). We define this with the help of the following proposition.

Proposition. The functor uu from the above pullback diagram is monadic via a Φ\Phi-flat monad tt.

Hence, a Φ\Phi-theory \mathcal{L} gives a monadic functor u:Mod()𝒦u \colon \mathbf{Mod}(\mathcal{L}) \to \mathcal{K} that yields a monad tt on 𝒦\mathcal{K}. Moreover, this monad preserves all the limits required to be an object in Mnd Φ(𝒦)\mathbf{Mnd}_\Phi (\mathcal{K}). So, define mnd()=t{mnd} (\mathcal{L}) = t.

Next, we define a functor th:Mnd Φ(𝒦)Law Φ(𝒦){th} \colon \mathbf{Mnd}_{\Phi}(\mathcal{K}) \to \mathbf{Law}_{\Phi}(\mathcal{K}). Consider a monad tt in Mnd Φ(𝒦)\mathbf{Mnd}_{\Phi}(\mathcal{K}). As per usual, tt factors through the Eilenberg-Moore category 𝒦𝒦 t\mathcal{K} \to \mathcal{K}^t which we precompose with the inclusion 𝒦 Φ𝒦\mathcal{K}_\Phi \to \mathcal{K} giving f:𝒦 Φ𝒦 mf \colon \mathcal{K}_\Phi \to \mathcal{K}^m. Now defining a 𝒱\mathcal{V}-category 𝒢\mathcal{G} that has objects from 𝒦 Φ\mathcal{K}_\Phi and 𝒢(x,y)=𝒦 m(fx,fy)\mathcal{G}(x,y) = \mathcal{K}^m(fx,fy), we factor ff


where \ell is bijective-on-objects and rr is full and faithful. This factorization is unique up to unique isomorphism. Define th(t)=𝒢 op\text{th} (t) = \mathcal{G}^{\text{op}}.

At this point, we have functors mnd:Law Φ(𝒦)Mnd Φ(𝒦) and th:Mnd Φ(𝒦)Law Φ(𝒦) \text{mnd} \colon \mathbf{Law}_{\Phi}(\mathcal{K}) \to \mathbf{Mnd}_{\Phi}(\mathcal{K}) \quad \text{ and } \quad \text{th} \colon \mathbf{Mnd}_{\Phi}(\mathcal{K}) \to \mathbf{Law}_{\Phi}(\mathcal{K}) so let’s turn our attention to showing that these are mutual weak inverses. The first step is to show that the category of algebras 𝒦 t\mathcal{K}^t for a given monad tt is the category of models Mod(th(t))\mathbf{Mod}(\text{th} (t)).

Theorem 6.6. The 𝒱\mathcal{V}-functor 𝒦 t(r,):𝒦 t[th(t) op,𝒱],x𝒦 m(r,x) \mathcal{K}^t(r-,-) \colon \mathcal{K}^t \to [\text{th} (t)^{\text{op}},\mathcal{V}], \quad x \mapsto \mathcal{K}^m(r-,x) restricts to an isomorphism of 𝒱\mathcal{V}-categories 𝒦 tMod(th(t))\mathcal{K}^t \cong \mathbf{Mod}(\text{th} (t)).

This theorem gives us that mndthid\text{mnd} \circ \text{th} \cong \text{id}. The next theorem gives us the other direction.

Theorem 6.7. There is an isomorphism thmndid\text{th} \circ \text{mnd} \cong \text{id}.

Let’s sketch the proof. Let g:𝒦 Φ opg \colon \mathcal{K}^{\text{op}}_{\Phi} \to \mathcal{L} be a Lawvere Φ\Phi-theory. If we denote mnd(𝒯)\text{mnd} (\mathcal{T}) by tt, we get thmnd(𝒯)=th(t)=𝒢 op\text{th} \circ \text{mnd} (\mathcal{T}) = \text{th} (t) = \mathcal{G}^{\text{op}} via the factorization


where \ell is bijective-on-objects and rr is fully faithful. It remains to show that 𝒯=𝒢 op\mathcal{T} = \mathcal{G}^{\text{op}}.
Let’s compute the image of an 𝒦 Φ\mathcal{K}_\Phi-object xx in Mod Φ(th(t))\mathbf{Mod}_{\Phi} (\text{th} (t)). For this, recall that we have 𝒦Φ\mathcal{K} \simeq \Phi-Cts(𝒦 Φ op,𝒱)\mathbf{Cts}(\mathcal{K}^{\text{op}}_{\Phi},\mathcal{V}) by assumption. Embedding xx into Φ\Phi-Cats(𝒦 Φ,𝒱)\mathbf{Cats}(\mathcal{K}_{\Phi},\mathcal{V}) gives us 𝒦 Φ(,x):𝒦 Φ op𝒱. \mathcal{K}_\Phi(-,x) \colon \mathcal{K}^{\text{op}}_{\Phi} \to \mathcal{V}. This, in turn, is mapped to the left Kan extension Lan g(𝒦 Φ(,x)):𝒯𝒱 \text{Lan}_g (\mathcal{K}_\Phi(-,x)) \colon \mathcal{T} \to \mathcal{V} along g:𝒦 Φ op𝒯g \colon \mathcal{K}^{\text{op}}_{\Phi} \to \mathcal{T} (the Lawvere Φ\Phi theory we began with). Here, we can compute that Lan g(𝒦 Φ(,x))\text{Lan}_g (\mathcal{K}_\Phi(-,x)) is 𝒯(,gx)\mathcal{T}(-,gx) meaning the factorization above is


Therefore, 𝒯=𝒢 op\mathcal{T} = \mathcal{G}^{\text{op}} as desired.

Many-sorted theories

Moving from single-sorted to many-sorted theories, we will take a different assumption on our 𝒱\mathcal{V}-category 𝒦\mathcal{K}.

Axiom B2. 𝒦\mathcal{K} is a 𝒱\mathcal{V}-category with Φ\Phi-limits and such that the Yoneda inclusion 𝒦[𝒦 op,𝒱]\mathcal{K} \to [\mathcal{K}^{\text{op}} , \mathcal{V}] has a Φ\Phi-continuous left adjoint.

This requirement on 𝒦\mathcal{K} is not overly restrictive as it holds for all presheaf 𝒱\mathcal{V}-categories and all Grothendieck topoi when 𝒱\mathcal{V} is Set\mathbf{Set}. The nice thing about this assumption is that we can compute all colimits and Φ\Phi-limits in 𝒦\mathcal{K} by passing to [𝒦 op,𝒱][\mathcal{K}^{\text{op}} , \mathcal{V}], where they commute, then reflecting back.

Generalizing Lawvere theories here is a bit simpler than in the previous section. Indeed, call any small 𝒱\mathcal{V}-category \mathcal{L} with Φ\Phi-limits a Φ\Phi-theory. Notice that we no longer have a bijective-on-objects functor involved in the definition. That functor forced the single-sortedness. With the functor no longer constraining the structure, we have the possibility for many sorts. Also, a Φ\Phi-theory does have all Φ\Phi-limits here, unlike in the single-sorted case. This allows for a much simpler definition of a model. Indeed, the category of models for a Φ\Phi-theory LL is the full subcategory Φ\Phi-Cts(,𝒦)\mathbf{Cts}(\mathcal{L}, \mathcal{K}) of [,𝒦][\mathcal{L}, \mathcal{K}].

Presently, we are interested in generalizing two important properties of Lawvere theory to Φ\Phi-theories. The first is that algebraic functors have left adjoints. The second is the reflectiveness of models.

Algebraic functors have left adjoints. A morphism of Φ\Phi-theories is a Φ\Phi-continuous 𝒱\mathcal{V}-functor g:g \colon \mathcal{L} \to \mathcal{L}\prime. Any such morphism induces a pullback 𝒱\mathcal{V}-functor g *:Φg^{\ast} \colon \Phi-Cts(,𝒦)Φ\mathbf{Cts}(\mathcal{L}\prime, \mathcal{K}) \to \Phi-Cts(,𝒦)\mathbf{Cts}(\mathcal{L}, \mathcal{K}) between model 𝒱\mathcal{V}-categories. We call such functors Φ\Phi-algebraic. And yes, these do have left adjoints just as in the context of classical Lawvere theories.

Theorem. Let \mathcal{L} and \mathcal{L}\prime be Φ\Phi-theories and g:g \colon \mathcal{L} \to \mathcal{L}\prime a 𝒱\mathcal{V}-functor between them. Given a model m:𝒦m \colon \mathcal{L} \to \mathcal{K}, then Lan gm:𝒦\text{Lan}_g m \colon \mathcal{L}\prime \to \mathcal{K} is a model.

What is happening here? Of course, pulling back by gg gives a way to turn models of \mathcal{L}\prime into models of \mathcal{L} – this is the algebraic functor g *g^{\ast}. But the left Kan extension along gg gives a way to turn a model mm of \mathcal{L} into a model of \mathcal{L}\prime as depicted in the diagram


This theorem says that process gives a functor g *:Φg_{\ast} \colon \Phi-Cts(,𝒦)Φ\mathbf{Cts}(\mathcal{L}, \mathcal{K}) \to \Phi-Cts(,𝒦)\mathbf{Cts}(\mathcal{L}\prime, \mathcal{K}) given by mLan gmm \mapsto \text{Lan}_g m.

We can prove this theorem for 𝒦=𝒱\mathcal{K} = \mathcal{V} without requiring axiom B2. This axiom is used to extend this result to a 𝒱\mathcal{V}-category 𝒦\mathcal{K}. The existence of the left adjoint \ell to the Yoneda embedding yy of 𝒦\mathcal{K} gives a factorization Lan gm=Lan gym\text{Lan}_g m = \ell \text{Lan}_g y m. The proof then reduces to showing that Lan gym\text{Lan}_g y m is Φ\Phi-continuous, since we are already assuming that \ell is. But because the codomain of Lan gym\text{Lan}_g y m is [𝒦 op,𝒱][\mathcal{K}^{\text{op}},\mathcal{V}], we can rest on the fact that we have proven the result for 𝒦=𝒱\mathcal{K} = \mathcal{V}. Limits are taken pointwise, after all. Actually, the left adjoint to g *g^{\ast} holds more generally, but our assumptions on 𝒦\mathcal{K} allow us to explicitly compute the left adjoint with left Kan extensions.

Reflexiveness of models. Having discussed left adjoints of algebraic functors, we now move on to show that categories of models Φ\Phi-Cts(𝒯,𝒱)\mathbf{Cts}(\mathcal{T},\mathcal{V}) are reflexive in [𝒯,𝒱][\mathcal{T}, \mathcal{V}]. Consider the free-forgetful (ordinary) adjunction


between 𝒱\mathcal{V}-categories and those 𝒱\mathcal{V}-categories with Φ\Phi-limits and functors preserving them. Given a 𝒱\mathcal{V}-category \mathcal{L} in the image of UU. Note that \mathcal{L} is a Φ\Phi-theory. It follows from this adjunction that Φ\Phi-Cts(ℱℒ,𝒱)\mathbf{Cts}(\mathcal{FL},\mathcal{V}) is equivalent to the category [,𝒱][\mathcal{L},\mathcal{V}]. Moreover, since \mathcal{L} has Φ\Phi-limits, the inclusion ℱℒ\mathcal{L} \hookrightarrow \mathcal{FL} has a right adjoint RR inducing an algebraic functor R *:ΦR^{\ast} \colon \Phi-Cts(,𝒱)Φ\mathbf{Cts}(\mathcal{L},\mathcal{V}) \to \Phi-Cts(ℱℒ,𝒱)[,𝒱]\mathbf{Cts}(\mathcal{FL},\mathcal{V}) \simeq [\mathcal{L},\mathcal{V}]. But we just showed that algebraic functors have left adjoints, giving us the following theorem.

Theorem. Φ\Phi-Cts(,𝒱)\mathbf{Cts}(\mathcal{L},\mathcal{V}) is reflective in [,𝒱][\mathcal{L},\mathcal{V}].

As promised, in the two general contexts corresponding to the axioms B1 and B2, we have the Lawvere theory-monad correspondence, that algebraic functors have left adjoints, and that categories of models are reflective.

An example

After all of that abstract nonsense, let’s get our feet back on the ground. Here is an example of manifesting a category with a chosen terminal object as a generalized Lawvere theory. This comes courtesy of Nishizawa and Power.

Let 0\mathbf{0} denote the empty category, 1\mathbf{1} the terminal category, and 2\mathbf{2} the category {ab}\{a \to b\} with two objects and a single arrow between them. We will also take 𝒦=𝒱=Cat\mathcal{K} = \mathcal{V} = \mathbf{Cat}. The class of limits Φ\Phi here is the finite Cat\mathbf{Cat}-powers. We define a Lawvere Φ\Phi-theory \mathcal{L} to be the Cat\mathbf{Cat}-category where we formally add to Cat fp op\mathbf{Cat}^{\text{op}}_{\text{fp}} (the opposite full subcategory on the finitely presentable objects) two arrows: τ:01\tau \colon \mathbf{0} \to \mathbf{1} and σ:12\sigma \colon \mathbf{1} \to \mathbf{2}. We then close up under finite Cat\mathbf{Cat}-powers and modulo the commutative diagrams

diagram8 \quad \quad diagram9 \quad \quad diagram10

Now \mathcal{L} is the Lawvere Φ\Phi-theory for a category with a chosen terminal object. A model of \mathcal{L} is a Φ\Phi-continuous Cat\mathbf{Cat}-functor Cat\mathcal{L} \to \mathbf{Cat}. This means that if MM is a model, it must preserve powers and so the following diagrams must commute:

diagram11 \quad \quad diagram12 \quad \quad diagram13

Here dom\text{dom} and cod\text{cod} choose the domain and codomain, and Δ\Delta the diagonal functor. Note that the commutativity of these diagrams witnesses the preservation of raising the terminal category to the first three diagrams given. Let’s parse these diagrams out.

The category we get from the model MM is M1M1 and the distinguished terminal object tt is chosen by MτM\tau. The first two diagrams provide a morphism xtx \to t for every object xx in M1M1. The third diagram gives the identity map on tt. The uniqueness of maps into tt follows from the functorality of MσM\sigma and cod\text{cod}.

Conversely, given a category C\mathbf{C} with a chosen terminal object tt, define a model M:CatM \colon \mathcal{L} \to \mathbf{Cat} by 1C\mathbf{1} \to \mathbf{C} and 1 xC x\mathbf{1}^x \to \mathbf{C}^x for \mathcal{L}-objects xx. Also, let MτM\tau choose tt and let MσM \sigma send each \mathcal{L}-object xx to the !:xt! \colon x \to t.

BackreactionAway Note

I'm in Munich next week, playing with the philosophers. Be good while I'm away, back soon. (Below, the girls, missing a few teeth.)

John BaezPhosphorus Sulfides

I think of sulfur and phosphorus as clever chameleons of the periodic table: both come in many different forms, called allotropes. There’s white phosphorus, red phosphorus, violet phosphorus and black phosphorus:

and there are about two dozen allotropes of sulfur, with a phase diagram like this:

So I should have guessed that sulfur and phosphorus combine to make many different compounds. But I never thought about this until yesterday!

I’m a great fan of diamonds, not for their monetary value but for the math of their crystal structure:

In a diamond the carbon atoms do not form a lattice in the strict mathematical sense (which is more restrictive than the sense of this word in crystallography). The reason is that there aren’t translational symmetries carrying any atom to any other. Instead, there are two lattices of atoms, shown as red and blue in this picture by Greg Egan. Each atom has 4 nearest neighbors arranged at the vertices of a regular tetrahedron; the tetrahedra centered at the blue atoms are ‘right-side up’, while those centered at the red atoms are ‘upside down’.

Having thought about this a lot, I was happy to read about adamantane. It’s a compound with 10 carbons and 16 hydrogens. There are 4 carbons at the vertices of a regular tetrahedron, and 6 along the edges—but the edges bend out in such a way that the carbons form a tiny piece of a diamond crystal:

or more abstractly, focusing on the carbons and their bonds:

Yesterday I learned that phosphorus decasulfide, P4S10, follows the same pattern:

The angles deviate slightly from the value of

\arccos (-1/3) \approx 109.4712^\circ

that we’d have in a fragment of a mathematically ideal diamond crystal, but that’s to be expected.

It turns out there are lots of other phosphorus sulfides! Here are some of them:

Puzzle 1. Why do each of these compounds have exactly 4 phosphorus atoms?

I don’t know the answer! I can’t believe it’s impossible to form phosphorus–sulfur compounds with some other number of phosphorus atoms, but the Wikipedia article containing this chart says

All known molecular phosphorus sulfides contain a tetrahedral array of four phosphorus atoms. P4S2 is also known but is unstable above −30 °C.

All these phosphorus sulfides contain at most 10 sulfur atoms. If we remove one sulfur from phosphorus decasulfide we can get this:

This is the ‘alpha form’ of P4S9. There’s also a beta form, shown in the chart above.

Some of the phosphorus sulfides have pleasing symmetries, like the
alpha form of P4S4:

or the epsilon form of P4S6:

Others look awkward. The alpha form of P4S5 is an ungainly beast:

They all seem to have a few things in common:

• There are 4 phosphorus atoms.

• Each phosphorus atom is connected to 3 or 4 atoms, at most one of which is phosphorus.

• Each sulfur atom is connected to 1 or 2 atoms, which must all be phosphorus.

The pictures seem pretty consistent about showing a ‘double bond’ when a sulfur atom is connected to just 1 phosphorus. However, they don’t show a double bond when a phosphorus atom is connected to just 3 sulfurs.

Puzzle 2. Can you draw molecules obeying the 3 rules listed above that aren’t on the chart?

Of all the phosphorus sulfides, P4S10 is not only the biggest and most symmetrical, it’s also the most widely used. Humans make thousands of tons of the stuff! It’s used for producing organic sulfur compounds.

People also make P4S3: it’s used in strike-anywhere matches. This molecule is not on the chart I showed you, and it also violates one of the rules I made up:

Somewhat confusingly, P4S10 is not only called phosphorus decasulfide: it’s also called phosphorus pentasulfide. Similarly, P4S3 is called phosphorus sesquisulfide. Since the prefix ‘sesqui-’ means ‘one and a half’, there seems to be some kind of division by 2 going on here.

May 05, 2017

Scott AaronsonThis Week’s BS

There are two pieces of BosonSampling-related news that people have asked me about this week.

First, a group in Shanghai, led by Chaoyang Lu and Jianwei Pan, has reported in Nature Photonics that they can do BosonSampling with a coincidence rate that’s higher than in previous experiments by a factor of several thousand.  This, in particular, lets them do BosonSampling with 5 photons.  Now, 5 might not sound like that much, especially since the group in Bristol previously did 6-photon BosonSampling.  But to make their experiment work, the Bristol group needed to start its photons in the initial state |3,3〉: that is, two modes with 3 photons each.  This gives rise to matrices with repeated rows, whose permanents are much easier to calculate than the permanents of arbitrary matrices.  By contrast, the Shangai group starts its photons in the “true BosonSampling initial state” |1,1,1,1,1〉: that is, five modes with 1 photon each.  That’s the kind of initial state we ultimately want.

The second piece of news is that on Monday, a group at Bristol—overlapping with the group we mentioned before—submitted a preprint to the arXiv with the provocative title “No imminent quantum supremacy by boson sampling.”  In this paper, they give numerical evidence that BosonSampling, with n photons and m modes, can be approximately simulated by a classical computer in “merely” about n2n time (that is, the time needed to calculate a single n×n permanent), as opposed to the roughly mn time that one would need if one had to calculate permanents corresponding to all the possible outcomes of the experiment.  As a consequence of that, they argue that achieving quantum supremacy via BosonSampling would probably require at least ~50 photons—which would in turn require a “step change” in technology, as they put it.

I completely agree with the Bristol group’s view of the asymptotics.  In fact, Alex Arkhipov and I ourselves repeatedly told experimentalists, in our papers and talks about BosonSampling (the question came up often…), that the classical complexity of the problem should only be taken to scale like 2n, rather than like mn.  Despite not having a general proof that the problem could actually be solved in ~2n time in the worst case, we said that for two main reasons:

  1. Even under the most optimistic assumptions, our hardness reductions, from Gaussian permanent estimation and so forth, only yielded ~2n hardness, not ~mn hardness.  (Hardness reductions giving us important clues about the real world?  Whuda thunk??)
  2. If our BosonSampling matrix is Haar-random—or otherwise not too skewed to produce outcomes with huge probabilities—then it’s not hard to see that we can do approximate BosonSampling in O(n2n) time classically, by using rejection sampling.

Indeed, Alex and I insisted on these points despite some pushback from experimentalists, who were understandably hoping that they could get to quantum supremacy just by upping m, the number of modes, without needing to do anything heroic with n, the number of photons!  So I’m happy to see that a more careful analysis supports the guess that Alex and I made.

On the other hand, what does this mean for the number of photons needed for “quantum supremacy”: is it 20? 30? 50?  I confess that that sort of question interests me much less, since it all depends on the details of how you define the comparison (are we comparing against ENIAC? a laptop? a server farm? how many cores? etc etc).  As I’ve often said, my real hope with quantum supremacy is to see a quantum advantage that’s so overwhelming—so duh-obvious to the naked eye—that we don’t have to squint or argue about the definitions.

Doug NatelsonWhat is thermoelectricity?

I noticed I'd never written up anything about thermoelectricity, and while the wikipedia entry is rather good, it couldn't hurt to have another take on the concept.   Thermoelectricity is the mutual interaction of the flow of heat and the flow of charge - this includes creating a voltage gradient by applying a temperature gradient (the Seebeck Effect) and driving a heating or cooling thermal flow by pushing an electrical current (the Peltier Effect).  Recently there have been new generalizations, like using a temperature gradient to drive a net accumulation of electronic spin (the spin Seebeck effect).

First, the basic physics.  To grossly oversimplify, all other things being equal, particles tend to diffuse from hot locations to cold locations.  (This is not entirely obvious in generality, at least not to me, from our definitions of temperature or chemical potential, and clearly in some situations there are still research questions about this.  There is certainly a hand-waving argument that hotter particles, be they molecules in a gas or electrons in a solid, tend to have higher kinetic energies, and therefore tend to diffuse more rapidly.  That's basically the argument made here.)

Let's take a bar of a conductor and force there to be a temperature gradient across it.  The mobile charge carriers will tend to diffuse away from the hot end.  Moreover, there will be a net flux of lattice vibrations (phonons) away from the hot end.  Those phonons can also tend to scatter charge carriers - an effect called phonon drag.   For an isolated bar, though, there can't be any net current, so a voltage gradient develops such that the drift current balances out the diffusion tendency.  This is the Seebeck effect, and the Seebeck coefficient is the constant of proportionality between the temperature gradient and the voltage gradient.   If you hook up two materials with different (known) Seebeck coefficients as shown, you make a thermocouple and can use the thermoelectric voltage generated as thermometer.

Ignoring the phonon drag bit, the Seebeck coefficient depends on particular material properties - the sign of the charge carriers (thermoelectric measurements are one way to tell if your system is conducting via electrons or holes, leading to some dramatic effects in quantum dots), and the energy dependence of their conductivity (which has wrapped up in it the band structure of the material and extrinsic factors like the mean free path for scattering off impurities and boundaries).

Because of this dependence on extrinsic factors, it is possible to manipulate the Seebeck coefficient through nanoscale structuring or alteration of materials.  Using boundary scattering as a tuning parameter for the mean free path is enough to let you make thermocouples just by controlling the geometry of a single metal.  This has been pointed out here and here, and in our own group we have seen those effects here.   Hopefully I'll have time to write more on this later....

(By the way, as I write this, Amazon is having some kind of sale on my book, at $19 below publisher list price.  No idea why or how long that will last, but I thought I'd point it out.  I'll delete this text when that expires.)

Jordan EllenbergIntersectionality as nonlinearity

I wonder if the idea of intersectionality would be better-understood in STEMmy circles if we called it “nonlinearity” instead.  Put that way, e.g.

“the condition of being queer and disabled isn’t the sum of the condition of being queer and the condition of being disabled, or even some linear combination of those, it’s just its own thing, which draws input from each of those conditions in some more complicated way and which has features of its own particular to the intersection”

it’s something I think most mathematicians would find extremely natural and intuitive.

May 04, 2017

Chad OrzelThe Inescapable Vagueness of Academic Hiring

Inside Higher Ed ran a piece yesterday from a Ph.D. student pleading for more useful data about job searching:

What we need are professional studies, not just anecdotal advice columns, about how hiring committees separate the frogs from the tadpoles. What was the average publication count of tenure-track hires by discipline? How did two Ph.D. graduates with the same references (a controlled variable) fare on the job market, and why? What percentage of tenure-track hires began with national conference interviews? These are testable unknowns, not divine mysteries.

From the age-old Jobtracks project (ended in 2001, archived here) to those 21st-century methods such as the American Historical Association’s jobs-to-Ph.D.s ratio report, many studies have examined the employment trajectory of Ph.D. students. Few, however, have cross-referenced the arc of tenure-track success with the qualifications of the students on the market. Instead, only two types of applicant data are typically deemed significant enough to gather in these and other job reports: (1) the prestige and affluence of their alma mater, and (2) their age, race or gender.

Thank you so much for the detailed information about all the things in my application that I can’t improve.

I have a lot of sympathy for this. As a profession, academia is awful about even collecting data on critical topics, let alone sharing it with people who might make use of it. This can be maddening when it comes time to apply for jobs, or to seek tenure or promotion. The criteria for hiring and promotion seem to be shrouded in mystery, and you can never seem to get a straight answer to simple questions.

Some of this is, like every irritating thing in modern life, the fault of lawyers. Criteria are vague because giving more specific values opens the door to lawsuits from people who meet one stated criterion, but are manifestly unsuitable for some other reason– someone who has more publications than the stated threshold, say, but 75% of those are in vanity journals. If you try to get a little bit specific, you quickly find yourself having to plug stupid loopholes, with wildly multiplying criteria, until you need a degree in contract law to read the job ads.

(And if you don’t believe there would be lawsuits over that kind of nonsense, well, I’d like to request political asylum in your world. I could tell you stories about idiotic things that people have threatened to sue over. That is, I could, if it wouldn’t be wildly inappropriate for me to do so, so instead I’ll just darkly mutter that there are stories to be told…)

The bigger problem, though, is that most of the stats that could readily be reported are noisy garbage, subject to wild misinterpretation. As nicely demonstrated by the paragraphs where the author tries to highlight real data of the type he’d like to see:

However, secondary data analysis of other studies that have been generously made public can reveal clues that the job reports don’t care to. For example, a 2016 study that measured the publications and impact of STEM Ph.D. students happened to simultaneously measure the average number of their publications while in grad school, cross-referenced to their later hireability. The average number of publications for each future tenure-track hire was 0.5, a surprisingly low number that would likely be higher in humanities disciplines.

Another study from 2014 measured similar survey data with similar results, but it added that publishing rates among graduate students have been steadily increasing over time, while hiring rates have been steadily decreasing. That study placed the average number of publications around 0.8. It is clear that standards are shifting, but how much? And how do those standards vary by field?

Those numbers both registered as wildly implausible to me, so I looked up the papers, and they’re both misrepresented here. The first study, showing half a publication per tenure-track hire is restricted to Portugal, which the authors describe as “a developing higher education system” that is “still characterized by poorly qualified academics– only 68% of academics in public universities hold a Ph.D.” The number from the second study is a flat-out misquote– the 0.8 publications per student number is for publications by students who did not publish with their advisor as a co-author. Those who did co-author papers with their advisor, as is common practice in STEM fields, had 3.99 articles pre-Ph.D.; averaging them all together you get 1.88 publications before the doctorate for all the students considered in the paper, more than twice the figure quoted.

Even if you crunch the numbers correctly, though, these statistics are not all that meaningful, because acceptable publication rates vary wildly between subfields, even within a single discipline. I had a total of 5 publications before I finished my Ph.D. (though one of those was a theory paper to which my only contribution was delivering a floppy disk full of experimental data to the theorists upstairs), which is pretty respectable for an experimentalist in my field. I came into an established lab, though, and just picked up on an experiment that was already working. I know of grad students who were involved in lab construction projects who ended up with only one or two publications, but who were well-regarded scientists in the field because of the scale of the work they did building up an experiment that would go on to churn out lots of papers. In a very different field, on the other hand, a former student who went into theory had eight papers pre-Ph.D., which is more in line with expectations in his field.

And the noisiness of the data only gets worse. People working in fields like nuclear and particle physics, where papers are published by large collaborations, can end up with their names on dozens of papers, because everybody in the collaboration gets listed on every paper from that collaboration. (I don’t think grad students are generally included as full members in this sense, though junior faculty often are.) You can, of course, attempt to divide things up by subfield, but as always, the more finely you subdivide the pool, the less reliable the data gets, from simple statistics.

I’m not going to say there aren’t hiring committees doing simple paper-counting as an early weed-out step, but it’s not remotely a good metric for making final decisions.

In the end, the question comes back to a point raised earlier in the piece, namely what factors the candidate can control. Hireability is not a matter of hitting some threshold number of publications, it’s about making the most of the opportunities available to you, and being able to make a persuasive case that you have done so, and will be able to do so in the future, as a faculty member. What counts as “enough” publications to get a degree or go on the job market isn’t a bright-line rule, it’s a judgement call, and making a reasonable decision about when you have “enough” is part of the transition from student to scholar.

The factors a job applicant can control are making that central decision– what constitutes “enough,” in consultation with the advisor(s)– and making the case that the decision was a good one. When we’re doing a search and trying to separate “frogs from tadpoles,” we’re not just counting papers, we’re reading statements, and (speaking for myself) a clear and convincing research statement putting past work in context and showing how it carries through to future work carries a lot more weight than an extra paper or two on the CV.

That is, I realize, maddeningly vague. I’ve been through the process, and been maddened by it. But as comforting as it might be to have more stats about the process, that comfort would be an illusion, because most of the readily available measures are junk.

(And, of course, everything comes back to the central problem, namely the vast gulf between the number of qualified candidates and the number of available jobs…)

May 02, 2017

John PreskillModern Physics Education?

Being the physics department executive officer (on top of being a quantum physicist) makes me think a lot about our physics college program. It is exciting. We start with mechanics, and then go to electromagnetism (E&M) and relativity, then to quantum and statistical mechanics, and then to advanced mathematical methods, analytical mechanics and more E&M. The dessert is usually field theory, astrophysics and advanced lab. You can take some advanced courses, introducing condensed matter, quantum computation, particle theory, AMO, general relativity, nuclear physics, etc. By the time we are done with college, we definitely feel like we know a lot.

But in the end of all that, what do we know about modern physics? Certainly we all took a class called ‘modern physics’. Or should I say ‘”modern” physics’? Because, I’m guessing, the modern physics class heavily featured the Stern-Gerlach experiment (1922) and mentions of De-Broglie, Bohr, and Dirac quite often. Don’t get me wrong: great physics, and essential. But modern?

So what would be modern physics? What should we teach that does not predate 1960? By far the biggest development in my neck of the woods is easy access to computing power. Even I can run simulations for a Schroedinger equation (SE) with hundreds of sites and constantly driven. Even I can diagonalize a gigantic matrix that corresponds to a Mott-Hubbard model of 15 or maybe even 20 particles. What’s more, new approximate algorithms capture the many-body quantum dynamics, and ground states of chains with 100s of sites. These are the DMRG (density matrix renormalization group) and MPS (matrix product states) (see for a review of DMRG, and for a review of MPS, both by the inspiring Uli Schollwoeck).

Should we teach that? Isn’t it complicated? Yes and no. Respectively – not simultaneously. We should absolutely teach it. And no – it is really not complicated. That’s the point – it is simpler than Schroedinger’s equation! How do we teach it? I am not sure yet, but certainly there is a junior level time slot for computational quantum mechanics somewhere.

What else? Once we think about it, the flood gates open. Condensed matter just gave us a whole new paradigm for semi-conductors: topological insulators. Definitely need to teach that – and it is pure 21st century! Tough? Not at all, just solving SE on a lattice. Not tough? Well, maybe not trivial, but is it any tougher than finding the orbitals of Hydrogen? (at the risk of giving you nightmares, remember Laguerre polynomials? Oh – right – you won’t get any nightmares, because, most likely, you don’t remember!)

With that let me take a shot at the standard way that quantum mechanics is taught. Roughly a quantum class goes like this: wave-matter duality; SE; free particle; box; harmonic oscillator, spin, angular momentum, hydrogen atom. This is a good program for atomic physics, and possibly field theory. But by and large, this is the quantum mechanics of vacuum. What about quantum mechanics of matter? Is Feynman path integral really more important than electron waves in solids? All physics is beautiful. But can’t Feynman wait while we teach tight binding models?

And I’ll stop here, before I get started on hand-on labs, as well as the fragmented nature of our programs.

Question to you all out there: Suppose we go and modernize (no quotes) our physics program. What should we add? What should we take away? And we all agree – all physics is Beautiful! I’m sure I have my blind spots, so please comment!

John BaezDiamondoids

I have a new favorite molecule: adamantane. As you probably know, someone is said to be ‘adamant’ if they are unshakeable, immovable, inflexible, unwavering, uncompromising, resolute, resolved, determined, firm, rigid, or steadfast. But ‘adamant’ is also a legendary mineral, and the etymology is the same as that for ‘diamond’.

The molecule adamantane, shown above, features 10 carbon atoms arranged just like a small portion of a diamond crystal! It’s a bit easier to see this if you ignore the 16 hydrogen atoms and focus on the carbon atoms and bonds between those:

It’s a somewhat strange shape.

Puzzle 1. Give a clear, elegant description of this shape.

Puzzle 2. What is its symmetry group? This is really two questions: I’m asking about the symmetry group of this shape as an abstract graph, but also the symmetry group of this graph as embedded in 3d Euclidean space, counting both rotations and reflections.

Puzzle 3. How many ‘kinds’ of carbon atoms does adamantane have? In other words, when we let the symmetry group of this graph act on the set of vertices, how many orbits are there? (Again this is really two questions, depending on which symmetry group we use.)

Puzzle 4. How many kinds of bonds between carbon atoms does adamantane have? In other words, when we let the symmetry group of this graph act on the set of edges, how many orbits are there? (Again, this is really two questions.)

You can see the relation between adamantane and a diamond if you look carefully at a diamond crystal, as shown in this image by H. K. D. H. Bhadeshia:

or this one by Greg Egan:

Even with these pictures at hand, I find it a bit tough to see the adamantane pattern lurking in the diamond! Look again:

Adamantane has an interesting history. The possibility of its existence was first suggested by a chemist named Decker at a conference in 1924. Decker called this molecule ‘decaterpene’, and registered surprise that nobody had made it yet. After some failed attempts, it was first synthesized by the Croatian-Swiss chemist Vladimir Prelog in 1941. He later won the Nobel prize for his work on stereochemistry.

However, long before it was synthesized, adamantane was isolated from petroleum by the Czech chemists Landa, Machacek and Mzourek! They did it in 1932. They only managed to make a few milligrams of the stuff, but we now know that petroleum naturally contains between .0001% and 0.03% adamantane!

Adamantane can be crystallized:

but ironically, the crystals are rather soft. It’s all that hydrogen. It’s also amusing that adamantane has an odor: supposedly it smells like camphor!

Adamantane is just the simplest of the molecules called diamondoids.
These are a few:

1 is adamantane.

2 is called diamantane.

3 is called triamantane.

4 is called isotetramantane, and it comes in two mirror-image forms.

Here are some better pictures of diamantane:

People have done lots of chemical reactions with diamondoids. Here are some things they’ve done with the next one, pentamantane:

Many different diamondoids occur naturally in petroleum. Though the carbon in diamonds is not biological in origin, the carbon in diamondoids found in petroleum is. This was shown by studying ratios of carbon isotopes.

Eric Drexler has proposed using diamondoids for nanotechnology, but he’s talking about larger molecules than those shown here.

For more fun along these lines, try:

Diamonds and triamonds, Azimuth, 11 April 2016.

April 30, 2017

Noncommutative GeometryGreat Thanks!

Let me express my heartfelt thanks to the organizers of the Shanghai 2017 NCG-event as well as to each participant. Your presence and talks were the most wonderful gift showing so many lively facets of mathematics and physics displaying the vitality of NCG! The whole three weeks were a pleasure due to the wonderful hospitality of our hosts, Xiaoman Chen, Guoliang Yu and Yi-Jun Yao. It is