Planet Musings

March 04, 2015

Doug NatelsonMarch Meeting, days 1 and 2

I am sufficiently buried in work, it's been difficult to come up with my annual March Meeting blog reports.  Here is a very brief list of some cool things I've seen:

  • Jen Dionne from Stanford showed a very neat combination of tomography and cathodoluminescence, using a TEM with tilt capability to map out the plasmon modes of individual asymmetric "nanocup" particles (polystyrene core, gold off-center shell).
  • Shilei Zhang presented what looks to me like a very clever idea, a "magnetic abacus" memory, that uses the spin Hall effect in a clever readout scheme as well as a spin transfer torque way to flip bits.
  • I've seen a couple of talks about using interesting planar structures for optical purposes.  Harry Atwater spoke about using plasmons in graphene to make tunable resonant elements for, e.g., photodetection and modified emissivity (tuning black body radiation!).  My former Bell Labs department head Federico Capasso spoke about using designer dielectric resonator arrays to make "metasurface" optical elements (basically optical phased arrays) to do wild things like achromatic beam steering.
  • Chris Adami had possibly the most ambitious title, "The Evolutionary Path Toward Sentient Robots".  Spoiler:  we are far from having to worry about this.
  • Michael Coey spoke about magnetism at interfaces, including a weird result in CeO2 nanoparticles that appears to have its origins in giant orbital paramagnetism.
  • There was a neat talk by Ricardo Ruiz from HGST about the amazing nanofabrication required for future hard disk storage.  Patterned media (with 10 nm half-pitch of individual magnetic islands) looks like it's on the way.
There were a number of other very nice talks.  It's pretty clear that I could have spent the whole meeting so far at "beyond graphene" 2d materials talks and things to do with spin-orbit coupling.  A bit more tomorrow.

n-Category Café Lebesgue's Universal Covering Problem

Lebesgue’s universal covering problem is famously difficult, and a century old. So I’m happy to report some progress:

• John Baez, Karine Bagdasaryan and Philip Gibbs, Lebesgue’s universal covering problem.

But we’d like you to check our work! It will help if you’re good at programming. As far as the math goes, it’s just high-school geometry… carried to a fanatical level of intensity.

Here’s the story:

A subset of the plane has diameter 1 if the distance between any two points in this set is 1\le 1. You know what a circle of diameter 1 looks like. But an equilateral triangle with edges of length 1 also has diameter 1:

After all, two points in this triangle are farthest apart when they’re at two corners.

Note that this triangle doesn’t fit inside a circle of diameter 1:

There are lots of sets of diameter 1, so it’s interesting to look for a set that can contain them all.

In 1914, the famous mathematician Henri Lebesgue sent a letter to a pal named Pál. And in this letter he challenged Pál to find the convex set with smallest possible area such that every set of diameter 1 fits inside.

More precisely, he defined a universal covering to be a convex subset of the plane that can cover a translated, reflected and/or rotated version of every subset of the plane with diameter 1. And his challenge was to find the universal covering with the least area.

Pál worked on this problem, and 6 years later he published a paper on it. He found a very nice universal covering: a regular hexagon in which one can inscribe a circle of diameter 1. This has area

32=0.86602540 \frac{\sqrt{3}}{2} = 0.86602540 \dots

But he also found a universal covering with less area, by removing two triangles from this hexagon—for example, the triangles C1C2C3 and E1E2E3 here:

Our paper explains why you can remove these triangles, assuming the hexagon was a universal covering in the first place. The resulting universal covering has area

223=0.84529946 2 - \frac{2}{\sqrt{3}} = 0.84529946 \dots

In 1936, Sprague went on to prove that more area could be removed from another corner of Pál’s original hexagon, giving a universal covering of area

0.8441377708435 0.8441377708435 \dots

In 1992, Hansen took these reductions even further by removing two more pieces from Pál’s hexagon. Each piece is a thin sliver bounded by two straight lines and an arc. The first piece is tiny. The second is downright microscopic!

Hansen claimed the areas of these regions were 410 114 \cdot 10^{-11} and 610 186 \cdot 10^{-18}. However, our paper redoes his calculation and shows that the second number is seriously wrong. The actual areas are 3.750710 113.7507 \cdot 10^{-11} and 8.446010 218.4460 \cdot 10^{-21}.

Philip Gibbs has created a Java applet illustrating Hansen’s universal cover. I urge you to take a look! You can zoom in and see the regions he removed:

• Philip Gibbs, Lebesgue’s universal covering problem.

I find that my laptop, a Windows machine, makes it hard to view Java applets because they’re a security risk. I promise this one is safe! To be able to view it, I had to go to the “Search programs and files” window, find the “Configure Java” program, go to “Security”, and add

to the “Exception Site List”. It’s easy once you know what to do.

And it’s worth it, because only the ability to zoom lets you get a sense of the puny slivers that Hansen removed! One is the region XE2T here, and the other is T′C3V:

You can use this picture to help you find these regions in Philip Gibbs’ applet. But this picture is not in scale! In fact the smaller region, T′C3V, has length 3.710 73.7 \cdot 10^{-7} and maximum width 1.410 141.4 \cdot 10^{-14}, tapering down to a very sharp point.

That’s about a few atoms wide if you draw the whole hexagon on paper! And it’s about 30 million times longer than it is wide. This is the sort of thing you can only draw with the help of a computer.

Anyway, Hansen’s best universal covering had an area of

0.844137708416 0.844137708416 \dots

This tiny improvement over Sprague’s work led Klee and Wagon to write:

it does seem safe to guess that progress on [this problem], which has been painfully slow in the past, may be even more painfully slow in the future.

However, our new universal covering removes about a million times more area than Hansen’s larger region: a whopping 2.23310 52.233 \cdot 10^{-5}. So, we get a universal covering with area

0.844115376859 0.844115376859 \dots

The key is to slightly rotate the dodecagon shown in the above pictures, and then use the ideas of Pál and Sprague.

There’s a lot of room between our number and the best lower bound on this problem, due to Brass and Sharifi:

0.832 0.832

So, one way or another, we can expect a lot of progress now that computers are being brought to bear. Philip Gibbs has a heuristic computer calculation pointing toward a value of

0.84408 0.84408

so perhaps that’s what we should shoot for.

Read our paper for the details! If you want to check our work, we’ll be glad to answer lots of detailed questions. We want to rotate the dodecagon by an amount that minimizes the area of the universal covering we get, so we use a program to compute the area for many choices of rotation angle:

• Philip Gibbs, Java program.

The program is not very long—please study it or write your own, in your own favorite language! The output is here:

• Philip Gibbs, Java program output.

and as explained at the end of our paper, the best rotation angle is about 1.3 1.3^\circ.

March 03, 2015

John BaezVisual Insight

I have another blog, called Visual Insight. Over here, our focus is on applying science to help save the planet. Over there, I try to make the beauty of pure mathematics visible to the naked eye.

I’m always looking for great images, so if you know about one, please tell me about it! If not, you may still enjoy taking a look.

Here are three of my favorite images from that blog, and a bit about the people who created them.

I suspect that these images, and many more on Visual Insight, are all just different glimpses of the same big structure. I have a rough idea what that structure is. Sometimes I dream of a computer program that would let you tour the whole thing. Unfortunately, a lot of it lives in more than 3 dimensions.

Less ambitiously, I sometimes dream of teaming up with lots of mathematicians and creating a gorgeous coffee-table book about this stuff.


Schmidt arrangement of the Eisenstein integers


Schmidt Arrangement of the Eisenstein Integers - Katherine Stange

This picture drawn by Katherine Stange shows what happens when we apply fractional linear transformations

z \mapsto \frac{a z + b}{c z + d}

to the real line sitting in the complex plane, where a,b,c,d are Eisenstein integers: that is, complex numbers of the form

m + n \sqrt{-3}

where m,n are integers. The result is a complicated set of circles and lines called the ‘Schmidt arrangement’ of the Eisenstein integers. For more details go here.

Katherine Stange did her Ph.D. with Joseph H. Silverman, an expert on elliptic curves at Brown University. Now she is an assistant professor at the University of Colorado, Boulder. She works on arithmetic geometry, elliptic curves, algebraic and integer sequences, cryptography, arithmetic dynamics, Apollonian circle packings, and game theory.


{7,3,3} honeycomb

This is the {7,3,3} honeycomb as drawn by Danny Calegari. The {7,3,3} honeycomb is built of regular heptagons in 3-dimensional hyperbolic space. It’s made of infinite sheets of regular heptagons in which 3 heptagons meet at vertex. 3 such sheets meet at each edge of each heptagon, explaining the second ‘3’ in the symbol {7,3,3}.

The 3-dimensional regions bounded by these sheets are unbounded: they go off to infinity. They show up as holes here. In this image, hyperbolic space has been compressed down to an open ball using the so-called Poincaré ball model. For more details, go here.

Danny Calegari did his Ph.D. work with Andrew Casson and William Thurston on foliations of three-dimensional manifolds. Now he’s a professor at the University of Chicago, and he works on these and related topics, especially geometric group theory.


{7,3,3} honeycomb meets the plane at infinity

This picture, by Roice Nelson, is another view of the {7,3,3} honeycomb. It shows the ‘boundary’ of this honeycomb—that is, the set of points on the surface of the Poincaré ball that are limits of points in the {7,3,3} honeycomb.

Roice Nelson used stereographic projection to draw part of the surface of the Poincaré ball as a plane. The circles here are holes, not contained in the boundary of the {7,3,3} honeycomb. There are infinitely many holes, and the actual boundary, the region left over, is a fractal with area zero. The white region on the outside of the picture is yet another hole. For more details, and a different version of this picture, go here.

Roice Nelson is a software developer for a flight data analysis company. There’s a good chance the data recorded on the airplane from your last flight moved through one of his systems! He enjoys motorcycling and recreational mathematics, he has a blog with lots of articles about geometry, and he makes plastic models of interesting geometrical objects using a 3d printer.

Chad OrzelAnnouncing the Schrödinger Sessions: Science for Science Fiction

A few years back, I became aware of Mike Brotherton’s Launch Pad Astronomy Workshop, and said “somebody should do this for quantum physics.” At the time, I wasn’t in a position to do that, but in the interim, the APS Outreach program launched the Public Outreach and Informing the Public Grant program, providing smallish grants for new public outreach efforts. So, because I apparently don’t have enough on my plate as it is, I floated the idea with Steve Rolston at Maryland (my immediate supervisor when I was a grad student), who liked it, and we put together a proposal with their Director of Outreach, Emily Edwards. We didn’t get funded last year, but the problems were easily fixed, and this year’s proposal was funded. Woo-hoo!

So, we’re very pleased to announce that this summer we’ll be holding “The Schrödinger Sessions: Science for Science Fiction” a workshop at the Joint Quantum Institute (a combined initiative of the University of Maryland, College Park and NIST in Gaithersburg) to provide a three-day “crash course” in quantum physics for science fiction writers. The workshop will run from Thursday, July 30 through Saturday August 1, 2015, on the Maryland campus in College Park, with housing, breakfast, and lunch included. There’s a fake schedule up on that web page, that we’ll fill once we get JQI scientists signed up, but it gives the basic idea: three days of lectures and discussions with scientists, and visits to JQI’s labs.

The web page is a little sketchy, because we were using a pre-existing template to speed things up, but that’s why I have a blog: to provide much more information. Which we might as well do in semi-traditional Q&A format:

This sounds cool, but what does this have to do with public outreach? The idea is to bring in science fiction writers, and show them some of the latest and greatest in quantum physics, with the goal of inspiring and informing new stories using quantum ideas and quantum technology.

We know that science fiction stories reach and inspire their audience to learn more about science, and even make careers in science– things like this astronaut’s tribute to Leonard Nimoy are a dramatic reminder of the inspirational effect of science fiction. Our hope is that the writers who come to the workshop will learn new and amazing things to include in their fiction, and through that work, they’ll reach a wider audience than we could hope to bring in person to JQI.

But why quantum physics? Well, because we think quantum physics is awesome. And because quantum physics is essential for all sorts of modern technology– you can’t have computers without Schrödinger cats, after all. And most of all because the sort of things they study at JQI– quantum information, quantum teleportation, quantum computing– could have a revolutionary impact on the technology of the future.

Isn’t quantum too small and weird to make good stories, though? Hardly. Quantum physics has figured prominently in stories like Robert Charles Wilson’s “Divided by Infinity”, and Ted Chiang’s “Story of Your Life” (SPOILERS), and Hannu Rajaniemi just completed the trilogy that starts with The Quantum Thief, which you can tell from the title is full of quantum ideas.

The weirdness of quantum physics is a bit off-putting, but then that’s the point of the workshop: to bring in writers to learn more about quantum physics, and see how it works in practice. The hope is that this will make writers who come to the workshop more comfortable with the subject, and thus more likely to write stories with a quantum component.

OK, but why Maryland? Well, because the Joint Quantum Institute is one of the world’s leading centers for research in quantum mechanics and its applications. Just check out their collection of news stories about JQI research to get a sense of the range and impact of their work. If you want to see quantum physicists at work, it’s one of the very best places in the world to go.

Yeah, but isn’t the weather awful hot in July and August? Look, you can’t have everything, OK?

OK, let’s get to practical stuff. When you say “writers,” you mean people who do short stories and novels? No, we’re defining “writer” as broadly as we can. We’d love to have people who write for television, or movies, or video games, or online media. Really, anybody who makes up stories about stuff that hasn’t really happened is welcome, regardless of the medium in which that work appears.

How many of these writers are you looking for? The budget in the proposal called for 15, though that depends a bit on how much money we need for food and housing; if more people than we expected are willing to share rooms, we might be able to take one or two more.

So there’s going to be an application process? Yes. I mean, we’d love to have a huge number of people, but we have logistical constraints to deal with. We’ll take applications online starting later this week (my other major task for today is to put together the application web form), continuing for a couple of weeks, and hope to make decisions around April 1, so attendees will have plenty of time to make travel plans.

Speaking of travel, what’s included in this package? We plan to provide housing for attendees in the dorms on Maryland’s campus, and breakfast, lunch, and coffee/snack breaks will be included. We left dinners open, in case people want to explore the DC area a little (great restaurants there, that’s one of the things I miss from grad school…), but might look at doing one group dinner with a fun talk of some sort. The schedule is still being sorted out.

There is a possibility that a limited amount of funding might be available for travel support, but again, it depends on a bunch of other factors that affect the overall budget.

And what will the selection criteria be? Well, the ultimate goal of the workshop is public outreach, so we’ll be trying to invite participant whose work will be able to reach as broad an audience as possible. That means we’ll be looking for a mix of established and up-and-coming writers, and as much diversity as we can manage in terms of audience, subgenre, media, etc. I can’t really be any more specific than that, though.

What if I’m busy on those days, or just can’t afford it this year? Will this happen again? Can’t you at least let us get through one of these before asking that?

If it goes well, we’d certainly be open to that possibility, but it’ll depend on a lot of factors, mostly involving money, but also level of interest, success of the workshop, etc.


And that is the big news I’ve been sitting on for a while now. I’m pretty excited about this, and hope it will be a great program. If you know anybody who might be interested in this, please point them in our direction.

Georg von HippelQNP 2015, Day One

Hello from Valparaíso, where I continue this year's hectic conference circuit at the 7th International Conference on Quarks and Nuclear Physics (QNP 2015). Except for some minor inconveniences and misunderstandings, the long trip to Valparaíso (via Madrid and Santiago de Chile) went quite smoothly, and so far, I have found Chile a country of bright sunlight and extraordinarily helpful and friendly people.

The first speaker of the conference was Emanuele Nocera, who reviewed nucleon and nuclear parton distributions. The study of parton distributions become necessary because hadrons are really composed not simply of valence quarks, as the quark model would have it, but of an indefinite number of (sea) quarks, antiquarks and gluons, any of which can contribute to the overall momentum and spin of the hadron. In an operator product expansion framework, hadronic scattering amplitudes can then be factorised into Wilson coefficients containing short-distance (perturbative) physics and parton distribution functions containing long-distance (non-perturbative) physics. The evolution of the parton distribution functions (PDFs) with the momentum scale is given by the DGLAP equations containing the perturbatively accessible splitting functions. The PDFs are subject to a number of theoretical constraints, of which the sum rules for the total hadronic momentum and valence quark content are the most prominent. For nuclei, on can assume that a similar factorisation as for hadrons still holds, and that the nuclear PDFs are linear combinations of nucleon PDFs modified by multiplication with a binding factor; however, nuclei exhibit correlations between nucleons, which are not well-described in such an approach. Combining all available data from different sources, global fits to PDFs can be performed using either a standard χ2 fit with a suitable model, or a neural network description. There are far more and better data on nucleon than nuclear PDFs, and for nucleons the amount and quality of the data also differs between unpolarised and polarised PDFs, which are needed to elucidate the "proton spin puzzle".

Next was the first lattice talk of the meeting, given by Huey-Wen Lin, who gave a review of the progress in lattice studies of nucleon structure. I think Huey-Wen gave a very nice example by comparing the computational and algorithmic progress with that in videogames (I'm not an expert there, but I think the examples shown were screenshots of Nethack versus some modern first-person shooter), and went on to explain the importance of controlling all systematic errors, in particular excited-state effects, before reviewing recent results on the tensor, scalar and axial charges and the electromagnetic form factors of the nucleon. As an outlook towards the current frontier, she presented the inclusion of disconnected diagrams and a new idea of obtaining PDFs from the lattice more directly rather than through their moments.

The next speaker was Robert D. McKeown with a review of JLab's Nuclear Science Programme. The CEBAF accelerator has been upgraded to 12 GeV, and a number of experiments (GlueX to search for gluonic excitations, MOLLER to study parity violation in Møller scattering, and SoLID to study SIDIS and PVDIS) are ready to be launched. A number of the planned experiments will be active in areas that I know are also under investigation by experimental colleagues in Mainz, such as a search for the "dark photon" and a study of the running of the Weinberg angle. Longer-term plans at JLab include the design of an electron-ion collider.

After a rather nice lunch, Tomofumi Nagae spoke about the hadron physics programme an J-PARC. In spite of major setbacks by the big earthquake and a later radiation accident, progress is being made. A search for the Θ+ pentaquark did not find a signal (which I personally do not find surprising, since the whole pentaquark episode is probably of more immediate long-term interest to historians and sociologists of science than to particle physicists), but could not completely exclude all of the discovery claims.

This was followed by a take by Jonathan Miller of the MINERνA collaboration presenting their programme of probing nuclei with neutrinos. Major complications include the limited knowledge of the incoming neutrino flux and the fact that final-state interactions on the nuclear side may lead to one process mimicking another one, making the modelling in event generators a key ingredient of understanding the data.

Next was a talk about short-range correlations in nuclei by Or Henn. Nucleons subject to short-range correlations must have high relative momenta, but a low center-of-mass momentum. The experimental studies are based on kicking a proton out of a nucleus with an electron, such that both the momentum transfer (from the incoming and outgoing electron) and the final momentum of the proton are known, and looking for a nucleon with a momentum close to minus the difference between those two (which must be the initial momentum of the knocked-out proton) coming out. The astonishing result is that at high momenta, neutron-proton pairs dominate (meaning that protons, being the minority, have a much larger chance of having high momenta) and are linked by a tensor force. Similar results are known from other two-component Fermi systems, such as ultracold atomic gases (which are of course many, many orders of magnitude less dense than nuclei).

After the coffee break, Heinz Clement spoke about dibaryons, specifically about the recently discovered d*(2380) resonance, which taking all experimental results into account may be interpreted as a ΔΔ bound state

The last talk of the day was by André Walker-Loud, who reviewed the study of nucleon-nucleon interactions and nuclear structure on the lattice, starting with a very nice review of the motivations behind such studies, namely the facts that big-bang nucleosynthesis is very strongly dependent on the deuterium binding energy and the proton-neutron mass difference, and this fine-tuning problem needs to be understood from first principles. Besides, currently the best chance for discovering BSM physics seems once more to lie with low-energy high-precision experiments, and dark matter searches require good knowledge of nuclear structure to control their systematics. Scattering phase shifts are being studied through the Lüscher formula. Current state-of-the-art studies of bound multi-hadron systems are related to dibaryons, in particular the question of the existence of the H-dibaryon at the physical pion mass (note that the dineutron, certainly unbound in the real world, becomes bound at heavy enough pion masses), and three- and four-nucleon systems are beginning to become treatable, although the signal-to-noise problem gets worse as more baryons are added to a correlation function, and the number of contractions grows rapidly. Going beyond masses and binding energies, the new California Lattice Collaboration (CalLat) has preliminary results for hadronic parity violation in the two-nucleon system, albeit at a pion mass of 800 MeV.

Clifford Johnsondublab at LAIH

LAIH_Mark_McNeill_27th_Feb_2015 (Click for larger view.) Mark ("Frosty") McNeill gave us a great overview of the work of the dublab collective at last Friday's LAIH luncheon. As I said in my introduction:
... dublab shows up as part of the DNA of many of the most engaging live events around the City (at MOCA, LACMA, Barnsdall, the Hammer, the Getty, the Natural History Museum, the Hollywood Bowl… and so on), and dublab is available in its core form as a radio project any time you like if you want to listen online. [...] dublab is a "non-profit web radio collective devoted to the growth of  positive music, arts and culture."
Frosty is a co-founder of dublab, and he told us a bit about its history, activities, and their new wonderful project called "Sound Share LA" which will be launching soon: They are creating a multimedia archive of Los Angeles based [...] Click to continue reading this post

Tommaso DorigoRecent Results From Super-Kamiokande

(The XVIth edition of "Neutrino Telescopes" is going on in Venice this week. The writeup below is from a talk by M.Nakahata at the morning session today. For more on the conference and the results shown and discussed there, see the conference blog.)

read more

March 02, 2015

John PreskillAlways look on the bright side…of CPTP maps.

Once upon a time, I worked with a postdoc who shaped my views of mathematical physics, research, and life. Each week, I’d email him a PDF of the calculations and insights I’d accrued. He’d respond along the lines of, “Thanks so much for your notes. They look great! I think they’re mostly correct; there are just a few details that might need fixing.” My postdoc would point out the “details” over espresso, at a café table by a window. “Are you familiar with…?” he’d begin, and pull out of his back pocket some bit of math I’d never heard of. My calculations appeared to crumble like biscotti.

Some of the math involved CPTP maps. “CPTP” stands for a phrase little more enlightening than the acronym: “completely positive trace-preserving”. CPTP maps represent processes undergone by quantum systems. Imagine preparing some system—an electron, a photon, a superconductor, etc.—in a state I’ll call “\rho“. Imagine turning on a magnetic field, or coupling one electron to another, or letting the superconductor sit untouched. A CPTP map, labeled as \mathcal{E}, represents every such evolution.

“Trace-preserving” means the following: Imagine that, instead of switching on the magnetic field, you measured some property of \rho. If your measurement device (your photodetector, spectrometer, etc.) worked perfectly, you’d read out one of several possible numbers. Let p_i denote the probability that you read out the i^{\rm{th}} possible number. Because your device outputs some number, the probabilities sum to one: \sum_i p_i = 1.  We say that \rho “has trace one.” But you don’t measure \rho; you switch on the magnetic field. \rho undergoes the process \mathcal{E}, becoming a quantum state \mathcal{E(\rho)}. Imagine that, after the process ended, you measured a property of \mathcal{E(\rho)}. If your measurement device worked perfectly, you’d read out one of several possible numbers. Let q_a denote the probability that you read out the a^{\rm{th}} possible number. The probabilities sum to one: \sum_a q_a =1. \mathcal{E(\rho)} “has trace one”, so the map \mathcal{E} is “trace preserving”.

Now that we understand trace preservation, we can understand positivity. The probabilities p_i are positive (actually, nonnegative) because they lie between zero and one. Since the p_i characterize a crucial aspect of \rho, we call \rho “positive” (though we should call \rho “nonnegative”). \mathcal{E} turns the positive \rho into the positive \mathcal{E(\rho)}. Since \mathcal{E} maps positive objects to positive objects, we call \mathcal{E} “positive”. \mathcal{E} also satisfies a stronger condition, so we call such maps “completely positive.”**

So I called my postdoc. “It’s almost right,” he’d repeat, nudging aside his espresso and pulling out a pencil. We’d patch the holes in my calculations. We might rewrite my conclusions, strengthen my assumptions, or prove another lemma. Always, we salvaged cargo. Always, I learned.

I no longer email weekly updates to a postdoc. But I apply what I learned at that café table, about entanglement and monotones and complete positivity. “It’s almost right,” I tell myself when a hole yawns in my calculations and a week’s work appears to fly out the window. “I have to fix a few details.”

Am I certain? No. But I remain positive.

*Experts: “Trace-preserving” means \rm{Tr}(\rho) =1 \Rightarrow \rm{Tr}(\mathcal{E}(\rho)) = 1.

**Experts: Suppose that ρ is defined on a Hilbert space H and that E of rho is defined on H'. “Channel is positive” means Positive

To understand what “completely positive” means, imagine that our quantum system interacts with an environment. For example, suppose the system consists of photons in a box. If the box leaks, the photons interact with the electromagnetic field outside the box. Suppose the system-and-environment composite begins in a state SigmaAB defined on a Hilbert space HAB. Channel acts on the system’s part of state. Let I denote the identity operation that maps every possible environment state to itself. Suppose that Channel changes the system’s state while I preserves the environment’s state. The system-and-environment composite ends up in the state Channel SigmaAB. This state is positive, so we call Channel “completely positive”:Completely pos

Terence Tao254A, Notes 4: Some sieve theory

Many problems in non-multiplicative prime number theory can be recast as sieving problems. Consider for instance the problem of counting the number {N(x)} of pairs of twin primes {p,p+2} contained in {[x/2,x]} for some large {x}; note that the claim that {N(x) > 0} for arbitrarily large {x} is equivalent to the twin prime conjecture. One can obtain this count by any of the following variants of the sieve of Eratosthenes:

  1. Let {A} be the set of natural numbers in {[x/2,x-2]}. For each prime {p \leq \sqrt{x}}, let {E_p} be the union of the residue classes {0\ (p)} and {-2\ (p)}. Then {N(x)} is the cardinality of the sifted set {A \backslash \bigcup_{p \leq \sqrt{x}} E_p}.
  2. Let {A} be the set of primes in {[x/2,x-2]}. For each prime {p \leq \sqrt{x}}, let {E_p} be the residue class {-2\ (p)}. Then {N(x)} is the cardinality of the sifted set {A \backslash \bigcup_{p \leq \sqrt{x}} E_p}.
  3. Let {A} be the set of primes in {[x/2+2,x]}. For each prime {p \leq \sqrt{x}}, let {E_p} be the residue class {2\ (p)}. Then {N(x)} is the cardinality of the sifted set {A \backslash \bigcup_{p \leq \sqrt{x}} E_p}.
  4. Let {A} be the set {\{ n(n+2): x/2 \leq n \leq x-2 \}}. For each prime {p \leq \sqrt{x}}, let {E_p} be the residue class {0\ (p)} Then {N(x)} is the cardinality of the sifted set {A \backslash \bigcup_{p \leq \sqrt{x}} E_p}.

Exercise 1 Develop similar sifting formulations of the other three Landau problems.

In view of these sieving interpretations of number-theoretic problems, it becomes natural to try to estimate the size of sifted sets {A \backslash \bigcup_{p | P} E_p} for various finite sets {A} of integers, and subsets {E_p} of integers indexed by primes {p} dividing some squarefree natural number {P} (which, in the above examples, would be the product of all primes up to {\sqrt{x}}). As we see in the above examples, the sets {E_p} in applications are typically the union of one or more residue classes modulo {p}, but we will work at a more abstract level of generality here by treating {E_p} as more or less arbitrary sets of integers, without caring too much about the arithmetic structure of such sets.

It turns out to be conceptually more natural to replace sets by functions, and to consider the more general the task of estimating sifted sums

\displaystyle  \sum_{n \in {\bf Z}} a_n 1_{n \not \in \bigcup_{p | P} E_p} \ \ \ \ \ (1)

for some finitely supported sequence {(a_n)_{n \in {\bf Z}}} of non-negative numbers; the previous combinatorial sifting problem then corresponds to the indicator function case {a_n=1_{n \in A}}. (One could also use other index sets here than the integers {{\bf Z}} if desired; for much of sieve theory the index set and its subsets {E_p} are treated as abstract sets, so the exact arithmetic structure of these sets is not of primary importance.)

Continuing with twin primes as a running example, we thus have the following sample sieving problem:

Problem 2 (Sieving problem for twin primes) Let {x, z \geq 1}, and let {\pi_2(x,z)} denote the number of natural numbers {n \leq x} which avoid the residue classes {0, -2\ (p)} for all primes {p < z}. In other words, we have

\displaystyle  \pi_2(x,z) := \sum_{n \in {\bf Z}} a_n 1_{n \not \in \bigcup_{p | P(z)} E_p}

where {a_n := 1_{n \in [1,x]}}, {P(z) := \prod_{p < z} p} is the product of all the primes strictly less than {z} (we omit {z} itself for minor technical reasons), and {E_p} is the union of the residue classes {0, -2\ (p)}. Obtain upper and lower bounds on {\pi_2(x,z)} which are as strong as possible in the asymptotic regime where {x} goes to infinity and the sifting level {z} grows with {x} (ideally we would like {z} to grow as fast as {\sqrt{x}}).

From the preceding discussion we know that the number of twin prime pairs {p,p+2} in {(x/2,x]} is equal to {\pi_2(x-2,\sqrt{x}) - \pi_2(x/2,\sqrt{x})}, if {x} is not a perfect square; one also easily sees that the number of twin prime pairs in {[1,x]} is at least {\pi_2(x-2,\sqrt{x})}, again if {x} is not a perfect square. Thus we see that a sufficiently good answer to Problem 2 would resolve the twin prime conjecture, particularly if we can get the sifting level {z} to be as large as {\sqrt{x}}.

We return now to the general problem of estimating (1). We may expand

\displaystyle  1_{n \not \in \bigcup_{p | P} E_p} = \prod_{p | P} (1 - 1_{E_p}(n)) \ \ \ \ \ (2)

\displaystyle  = \sum_{k=0}^\infty (-1)^k \sum_{p_1 \dots p_k|P: p_1 < \dots < p_k} 1_{E_{p_1}} \dots 1_{E_{p_k}}(n)

\displaystyle  = \sum_{d|P} \mu(d) 1_{E_d}(n)

where {E_d := \bigcap_{p|d} E_p} (with the convention that {E_1={\bf Z}}). We thus arrive at the Legendre sieve identity

\displaystyle  \sum_{n \in {\bf Z}} a_n 1_{n \not \in \bigcup_{p | P} E_p} = \sum_{d|P} \mu(d) \sum_{n \in E_d} a_n. \ \ \ \ \ (3)

Specialising to the case of an indicator function {a_n=1_{n \in A}}, we recover the inclusion-exclusion formula

\displaystyle  |A \backslash \bigcup_{p|P} E_p| = \sum_{d|P} \mu(d) |A \cap E_d|.

Such exact sieving formulae are already satisfactory for controlling sifted sets or sifted sums when the amount of sieving is relatively small compared to the size of {A}. For instance, let us return to the running example in Problem 2 for some {x,z \geq 1}. Observe that each {E_p} in this example consists of {\omega(p)} residue classes modulo {p}, where {\omega(p)} is defined to equal {1} when {p=2} and {2} when {p} is odd. By the Chinese remainder theorem, this implies that for each {d|P(z)}, {E_d} consists of {\prod_{p|d} \omega(p)} residue classes modulo {d}. Using the basic bound

\displaystyle  \sum_{n \leq x: n = a\ (q)} 1 = \frac{x}{q} + O(1) \ \ \ \ \ (4)

for any {x > 0} and any residue class {a\ (q)}, we conclude that

\displaystyle  \sum_{n \in E_d} a_n = g(d) x + O( \prod_{p|d} \omega(p) ) \ \ \ \ \ (5)

for any {d|P(z)}, where {g} is the multiplicative function

\displaystyle  g(d) := \prod_{p|d: p|P(z)} \frac{\omega(p)}{p}.

Since {\omega(p) \leq 2} and there are at most {\pi(z)} primes dividing {P(z)}, we may crudely bound {\prod_{p|d} \omega(p) \leq 2^{\pi(z)}}, thus

\displaystyle  \sum_{n \in E_d} a_n = g(d) x + O( 2^{\pi(z)} ). \ \ \ \ \ (6)

Also, the number of divisors of {P(z)} is at most {2^{\pi(z)}}. From the Legendre sieve (3), we thus conclude that

\displaystyle  \pi_2(x,z) = (\sum_{d|P(z)} \mu(d) g(d) x) + O( 4^{\pi(z)} ).

We can factorise the main term to obtain

\displaystyle  \pi_2(x,z) = x \prod_{p < z} (1-\frac{\omega(p)}{p}) + O( 4^{\pi(z)} ).

This is compatible with the heuristic

\displaystyle  \pi_2(x,z) \approx x \prod_{p < z} (1-\frac{\omega(p)}{p}) \ \ \ \ \ (7)

coming from the equidistribution of residues principle (Section 3 of Supplement 4), bearing in mind (from the modified Cramér model, see Section 1 of Supplement 4) that we expect this heuristic to become inaccurate when {z} becomes very large. We can simplify the right-hand side of (7) by recalling the twin prime constant

\displaystyle  \Pi_2 := \prod_{p>2} (1 - \frac{1}{(p-1)^2}) = 0.6601618\dots

(see equation (7) from Supplement 4); note that

\displaystyle  \prod_p (1-\frac{1}{p})^{-2} (1-\frac{\omega(p)}{p}) = 2 \Pi_2

so from Mertens’ third theorem (Theorem 42 from Notes 1) one has

\displaystyle  \prod_{p < z} (1-\frac{\omega(p)}{p}) = (2\Pi_2+o(1)) \frac{1}{(e^\gamma \log z)^2} \ \ \ \ \ (8)

as {z \rightarrow \infty}. Bounding {4^{\pi(z)}} crudely by {\exp(o(z))}, we conclude in particular that

\displaystyle  \pi_2(x,z) = (2\Pi_2 +o(1)) \frac{x}{(e^\gamma \log z)^2}

when {x,z \rightarrow \infty} with {z = O(\log x)}. This is somewhat encouraging for the purposes of getting a sufficiently good answer to Problem 2 to resolve the twin prime conjecture, but note that {z} is currently far too small: one needs to get {z} as large as {\sqrt{x}} before one is counting twin primes, and currently {z} can only get as large as {\log x}.

The problem is that the number of terms in the Legendre sieve (3) basically grows exponentially in {z}, and so the error terms in (4) accumulate to an unacceptable extent once {z} is significantly larger than {\log x}. An alternative way to phrase this problem is that the estimate (4) is only expected to be truly useful in the regime {q=o(x)}; on the other hand, the moduli {d} appearing in (3) can be as large as {P}, which grows exponentially in {z} by the prime number theorem.

To resolve this problem, it is thus natural to try to truncate the Legendre sieve, in such a way that one only uses information about the sums {\sum_{n \in E_d} a_n} for a relatively small number of divisors {d} of {P}, such as those {d} which are below a certain threshold {D}. This leads to the following general sieving problem:

Problem 3 (General sieving problem) Let {P} be a squarefree natural number, and let {{\mathcal D}} be a set of divisors of {P}. For each prime {p} dividing {P}, let {E_p} be a set of integers, and define {E_d := \bigcap_{p|d} E_p} for all {d|P} (with the convention that {E_1={\bf Z}}). Suppose that {(a_n)_{n \in {\bf Z}}} is an (unknown) finitely supported sequence of non-negative reals, whose sums

\displaystyle  X_d := \sum_{n \in E_d} a_n \ \ \ \ \ (9)

are known for all {d \in {\mathcal D}}. What are the best upper and lower bounds one can conclude on the quantity (1)?

Here is a simple example of this type of problem (corresponding to the case {P = 6}, {{\mathcal D} = \{1, 2, 3\}}, {X_1 = 100}, {X_2 = 60}, and {X_3 = 10}):

Exercise 4 Let {(a_n)_{n \in {\bf Z}}} be a finitely supported sequence of non-negative reals such that {\sum_{n \in {\bf Z}} a_n = 100}, {\sum_{n \in {\bf Z}: 2|n} a_n = 60}, and {\sum_{n \in {\bf Z}: 3|n} a_n = 10}. Show that

\displaystyle  30 \leq \sum_{n \in {\bf Z}: (n,6)=1} a_n \leq 40

and give counterexamples to show that these bounds cannot be improved in general, even when {a_n} is an indicator function sequence.

Problem 3 is an example of a linear programming problem. By using linear programming duality (as encapsulated by results such as the Hahn-Banach theorem, the separating hyperplane theorem, or the Farkas lemma), we can rephrase the above problem in terms of upper and lower bound sieves:

Theorem 5 (Dual sieve problem) Let {P, {\mathcal D}, E_p, E_d, X_d} be as in Problem 3. We assume that Problem 3 is feasible, in the sense that there exists at least one finitely supported sequence {(a_n)_{n \in {\bf Z}}} of non-negative reals obeying the constraints in that problem. Define an (normalised) upper bound sieve to be a function {\nu^+: {\bf Z} \rightarrow {\bf R}} of the form

\displaystyle  \nu^+ = \sum_{d \in {\mathcal D}} \lambda^+_d 1_{E_d}

for some coefficients {\lambda^+_d \in {\bf R}}, and obeying the pointwise lower bound

\displaystyle  \nu^+(n) \geq 1_{n \not \in\bigcup_{p|P} E_p}(n) \ \ \ \ \ (10)

for all {n \in {\bf Z}} (in particular {\nu^+} is non-negative). Similarly, define a (normalised) lower bound sieve to be a function {\nu^-: {\bf Z} \rightarrow {\bf R}} of the form

\displaystyle  \nu^-(n) = \sum_{d \in {\mathcal D}} \lambda^-_d 1_{E_d}

for some coefficients {\lambda^-_d \in {\bf R}}, and obeying the pointwise upper bound

\displaystyle  \nu^-(n) \leq 1_{n \not \in\bigcup_{p|P} E_p}(n)

for all {n \in {\bf Z}}. Thus for instance {1} and {0} are (trivially) upper bound sieves and lower bound sieves respectively.

  • (i) The supremal value of the quantity (1), subject to the constraints in Problem 3, is equal to the infimal value of the quantity {\sum_{d \in {\mathcal D}} \lambda^+_d X_d}, as {\nu^+ = \sum_{d \in {\mathcal D}} \lambda^+_d 1_{E_d}} ranges over all upper bound sieves.
  • (ii) The infimal value of the quantity (1), subject to the constraints in Problem 3, is equal to the supremal value of the quantity {\sum_{d \in {\mathcal D}} \lambda^-_d X_d}, as {\nu^- = \sum_{d \in {\mathcal D}} \lambda^-_d 1_{E_d}} ranges over all lower bound sieves.

Proof: We prove part (i) only, and leave part (ii) as an exercise. Let {A} be the supremal value of the quantity (1) given the constraints in Problem 3, and let {B} be the infimal value of {\sum_{d \in {\mathcal D}} \lambda^+_d X_d}. We need to show that {A=B}.

We first establish the easy inequality {A \leq B}. If the sequence {a_n} obeys the constraints in Problem 3, and {\nu^+ = \sum_{d \in {\mathcal D}} \lambda^+_d 1_{E_d}} is an upper bound sieve, then

\displaystyle  \sum_n \nu^+(n) a_n = \sum_{d \in {\mathcal D}} \lambda^+_d X_d

and hence (by the non-negativity of {\nu^+} and {a_n})

\displaystyle  \sum_{n \not \in \bigcup_{p|P} E_p} a_n \leq \sum_{d \in {\mathcal D}} \lambda^+_d X_d;

taking suprema in {f} and infima in {\nu^+} we conclude that {A \leq B}.

Now suppose for contradiction that {A<B}, thus {A < C < B} for some real number {C}. We will argue using the hyperplane separation theorem; one can also proceed using one of the other duality results mentioned above. (See this previous blog post for some discussion of the connections between these various forms of linear duality.) Consider the affine functional

\displaystyle  \rho_0: (a_n)_{n \in{\bf Z}} \mapsto C - \sum_{n \not \in \bigcup_{p|P} E_p} a_n.

on the vector space of finitely supported sequences {(a_n)_{n \in {\bf Z}}} of reals. On the one hand, since {C > A}, this functional is positive for every sequence {(a_n)_{n \in{\bf Z}}} obeying the constraints in Problem 3. Next, let {K} be the space of affine functionals {\rho} of the form

\displaystyle  \rho: (a_n)_{n \in {\bf Z}} \mapsto -\sum_{d \in {\mathcal D}} \lambda^+_d ( \sum_{n \in E_d} a_n - X_d ) + \sum_n a_n \nu(n) + X

for some real numbers {\lambda^+_d \in {\bf R}}, some non-negative function {\nu: {\bf Z} \rightarrow {\bf R}^+} which is a finite linear combination of the {1_{E_d}} for {d|P}, and some non-negative {X}. This is a closed convex cone in a finite-dimensional vector space {V}; note also that {\rho_0} lies in {V}. Suppose first that {\rho_0 \in K}, thus we have a representation of the form

\displaystyle C - \sum_{n \not \in \bigcup_{p|P} E_p} a_n = -\sum_{d \in {\mathcal D}} \lambda^+_d ( \sum_{n \in E_d} a_n - X_d ) + \sum_n a_n \nu(n) + X

for any finitely supported sequence {(a_n)_{n \in {\bf Z}}}. Comparing coefficients, we conclude that

\displaystyle  \sum_{d \in {\mathcal D}} \lambda^+_d 1_{E_d}(n) \geq 1_{n \not \in \bigcup_{p|P} E_p}

for any {n} (i.e., {\sum_{d \in {\mathcal D}} \lambda^+_d 1_{E_d}} is an upper bound sieve), and also

\displaystyle  C \geq \sum_{d \in {\mathcal D}} \lambda^+_d X_d,

and thus {C \geq B}, a contradiction. Thus {\rho_0} lies outside of {K}. But then by the hyperplane separation theorem, we can find an affine functional {\iota: V \rightarrow {\bf R}} on {V} that is non-negative on {K} and negative on {\rho_0}. By duality, such an affine functional takes the form {\iota: \rho \mapsto \rho((b_n)_{n \in {\bf Z}}) + c} for some finitely supported sequence {(b_n)_{n \in {\bf Z}}} and {c \in {\bf R}} (indeed, {(b_n)_{n \in {\bf Z}}} can be supported on a finite set consisting of a single representative for each atom of the finite {\sigma}-algebra generated by the {E_p}). Since {\iota} is non-negative on the cone {K}, we see (on testing against multiples of the functionals {(a_n)_{n \in {\bf Z}} \mapsto \sum_{n \in E_d} a_n - X_d} or {(a_n)_{n \in {\bf Z}} \mapsto a_n}) that the {b_n} and {c} are non-negative, and that {\sum_{n \in E_d} b_n - X_d = 0} for all {d \in {\mathcal D}}; thus {(b_n)_{n \in {\bf Z}}} is feasible for Problem 3. Since {\iota} is negative on {\rho_0}, we see that

\displaystyle  \sum_{n \not \in \bigcup_{p|P} E_p} b_n \geq C

and thus {A \geq C}, giving the desired contradiction. \Box

Exercise 6 Prove part (ii) of the above theorem.

Exercise 7 Show that the infima and suprema in the above theorem are actually attained (so one can replace “infimal” and “supremal” by “minimal” and “maximal” if desired).

Exercise 8 What are the optimal upper and lower bound sieves for Exercise 4?

In the case when {{\mathcal D}} consists of all the divisors of {P}, we see that the Legendre sieve {\sum_{d|P} \mu(d) 1_{E_d}} is both the optimal upper bound sieve and the optimal lower bound sieve, regardless of what the quantities {X_d} are. However, in most cases of interest, {{\mathcal D}} will only be some strict subset of the divisors of {P}, and there will be a gap between the optimal upper and lower bounds.

Observe that a sequence {(\lambda^+_d)_{d \in {\mathcal D}}} of real numbers will form an upper bound sieve {\sum_d \lambda^+_d 1_{E_d}} if one has the inequalities

\displaystyle  \lambda^+_1 \geq 1


\displaystyle  \sum_{d|n} \lambda^+_d \geq 0

for all {n|P}; we will refer to such sequences as upper bound sieve coefficients. (Conversely, if the sets {E_p} are in “general position” in the sense that every set of the form {\bigcap_{p|n} E_p \backslash \bigcup_{p|P; p\not | n} E_p} for {n|P} is non-empty, we see that every upper bound sieve arises from a sequence of upper bound sieve coefficients.) Similarly, a sequence {(\lambda^-_d)_{d \in {\mathcal D}}} of real numbers will form a lower bound sieve {\sum_d \lambda^-_d 1_{E_d}} if one has the inequalities

\displaystyle  \lambda^-_1 \leq 1


\displaystyle  \sum_{d|n} \lambda^-_d \leq 0

for all {n|P} with {n>1}; we will refer to such sequences as lower bound sieve coefficients.

Exercise 9 (Brun pure sieve) Let {P} be a squarefree number, and {k} a non-negative integer. Show that the sequence {(\lambda_d)_{d \in P}} defined by

\displaystyle  \lambda_d := 1_{\omega(d) \leq k} \mu(d),

where {\omega(d)} is the number of prime factors of {d}, is a sequence of upper bound sieve coefficients for even {k}, and a sequence of lower bound sieve coefficients for odd {k}. Deduce the Bonferroni inequalities

\displaystyle  \sum_{n \in {\bf Z}} a_n 1_{n \not \in \bigcup_{p | P} E_p} \leq \sum_{d|P: \omega(p) \leq k} \mu(d) X_d \ \ \ \ \ (11)

when {k} is even, and

\displaystyle  \sum_{n \in {\bf Z}} a_n 1_{n \not \in \bigcup_{p | P} E_p} \geq \sum_{d|P: \omega(p) \leq k} \mu(d) X_d \ \ \ \ \ (12)

when {k} is odd, whenever one is in the situation of Problem 3 (and {{\mathcal D}} contains all {d|P} with {\omega(p) \leq k}). The resulting upper and lower bound sieves are sometimes known as Brun pure sieves. The Legendre sieve can be viewed as the limiting case when {k \geq \omega(P)}.

In many applications the sums {X_d} in (9) take the form

\displaystyle  \sum_{n \in E_d} a_n = g(d) X + r_d \ \ \ \ \ (13)

for some quantity {X} independent of {d}, some multiplicative function {g} with {0 \leq g(p) \leq 1}, and some remainder term {r_d} whose effect is expected to be negligible on average if {d} is restricted to be small, e.g. less than a threshold {D}; note for instance that (5) is of this form if {D \leq x^{1-\varepsilon}} for some fixed {\varepsilon>0} (note from the divisor bound, Lemma 23 of Notes 1, that {\prod_{p|d} \omega(p) \ll x^{o(1)}} if {d \ll x^{O(1)}}). We are thus led to the following idealisation of the sieving problem, in which the remainder terms {r_d} are ignored:

Problem 10 (Idealised sieving) Let {z, D \geq 1} (we refer to {z} as the sifting level and {D} as the level of distribution), let {g} be a multiplicative function with {0 \leq g(p) \leq 1}, and let {{\mathcal D} := \{ d|P(z): d \leq D \}}. How small can one make the quantity

\displaystyle  \sum_{d \in {\mathcal D}} \lambda^+_d g(d) \ \ \ \ \ (14)

for a sequence {(\lambda^+_d)_{d \in {\mathcal D}}} of upper bound sieve coefficients, and how large can one make the quantity

\displaystyle  \sum_{d \in {\mathcal D}} \lambda^-_d g(d) \ \ \ \ \ (15)

for a sequence {(\lambda^-_d)_{d \in {\mathcal D}}} of lower bound sieve coefficients?

Thus, for instance, the trivial upper bound sieve {\lambda^+_d := 1_{d=1}} and the trivial lower bound sieve {\lambda^-_d := 0} show that (14) can equal {1} and (15) can equal {0}. Of course, one hopes to do better than these trivial bounds in many situations; usually one can improve the upper bound quite substantially, but improving the lower bound is significantly more difficult, particularly when {z} is large compared with {D}.

If the remainder terms {r_d} in (13) are indeed negligible on average for {d \leq D}, then one expects the upper and lower bounds in Problem 3 to essentially be the optimal bounds in (14) and (15) respectively, multiplied by the normalisation factor {X}. Thus Problem 10 serves as a good model problem for Problem 3, in which all the arithmetic content of the original sieving problem has been abstracted into two parameters {z,D} and a multiplicative function {g}. In many applications, {g(p)} will be approximately {\kappa/p} on the average for some fixed {\kappa>0}, known as the sieve dimension; for instance, in the twin prime sieving problem discussed above, the sieve dimension is {2}. The larger one makes the level of distribution {D} compared to {z}, the more choices one has for the upper and lower bound sieves; it is thus of interest to obtain equidistribution estimates such as (13) for {d} as large as possible. When the sequence {a_d} is of arithmetic origin (for instance, if it is the von Mangoldt function {\Lambda}), then estimates such as the Bombieri-Vinogradov theorem, Theorem 17 from Notes 3, turn out to be particularly useful in this regard; in other contexts, the required equidistribution estimates might come from other sources, such as homogeneous dynamics, or the theory of expander graphs (the latter arises in the recent theory of the affine sieve, discussed in this previous blog post). However, the sieve-theoretic tools developed in this post are not particularly sensitive to how a certain level of distribution is attained, and are generally content to use sieve axioms such as (13) as “black boxes”.

In some applications one needs to modify Problem 10 in various technical ways (e.g. in altering the product {P(z)}, the set {{\mathcal D}}, or the definition of an upper or lower sieve coefficient sequence), but to simplify the exposition we will focus on the above problem without such alterations.

As the exercise below (or the heuristic (7)) suggests, the “natural” size of (14) and (15) is given by the quantity {V(z) := \prod_{p < z} (1 - g(p))} (so that the natural size for Problem 3 is {V(z) X}):

Exercise 11 Let {z,D,g} be as in Problem 10, and set {V(z) := \prod_{p \leq z} (1 - g(p))}.

  • (i) Show that the quantity (14) is always at least {V(z)} when {(\lambda^+_d)_{d \in {\mathcal D}}} is a sequence of upper bound sieve coefficients. Similarly, show that the quantity (15) is always at most {V(z)} when {(\lambda^-_d)_{d \in {\mathcal D}}} is a sequence of lower bound sieve coefficients. (Hint: compute the expected value of {\sum_{d|n} \lambda^\pm_d} when {n} is a random factor of {P(z)} chosen according to a certain probability distribution depending on {g}.)
  • (ii) Show that (14) and (15) can both attain the value of {V(z)} when {D \geq P(z)}. (Hint: translate the Legendre sieve to this setting.)

The problem of finding good sequences of upper and lower bound sieve coefficients in order to solve problems such as Problem 10 is one of the core objectives of sieve theory, and has been intensively studied. This is more of an optimisation problem rather than a genuinely number theoretic problem; however, the optimisation problem is sufficiently complicated that it has not been solved exactly or even asymptotically, except in a few special cases. (It can be reduced to a optimisation problem involving multilinear integrals of certain unknown functions of several variables, but this problem is rather difficult to analyse further; see these lecture notes of Selberg for further discussion.) But while we do not yet have a definitive solution to this problem in general, we do have a number of good general-purpose upper and lower bound sieve coefficients that give fairly good values for (14), (15), often coming within a constant factor of the idealised value {V(z)}, and which work well for sifting levels {z} as large as a small power of the level of distribution {D}. Unfortunately, we also know of an important limitation to the sieve, known as the parity problem, that prevents one from taking {z} as large as {D^{1/2}} while still obtaining non-trivial lower bounds; as a consequence, sieve theory is not able, on its own, to sift out primes for such purposes as establishing the twin prime conjecture. However, it is still possible to use these sieves, in conjunction with additional tools, to produce various types of primes or prime patterns in some cases; examples of this include the theorem of Ben Green and myself in which an upper bound sieve is used to demonstrate the existence of primes in arbitrarily long arithmetic progressions, or the more recent theorem of Zhang in which (among other things) used an upper bound sieve was used to demonstrate the existence of infinitely many pairs of primes whose difference was bounded. In such arguments, the upper bound sieve was used not so much to count the primes or prime patterns directly, but to serve instead as a sort of “container” to efficiently envelop such prime patterns; when used in such a manner, the upper bound sieves are sometimes known as enveloping sieves. If the original sequence was supported on primes, then the enveloping sieve can be viewed as a “smoothed out indicator function” that is concentrated on almost primes, which in this context refers to numbers with no small prime factors.

In a somewhat different direction, it can be possible in some cases to break the parity barrier by assuming additional equidistribution axioms on the sequence {a_n} than just (13), in particular controlling certain bilinear sums involving {a_{nm}} rather than just linear sums of the {a_n}. This approach was in particular pursued by Friedlander and Iwaniec, leading to their theorem that there are infinitely many primes of the form {n^2+m^4}.

The study of sieves is an immense topic; see for instance the recent 527-page text by Friedlander and Iwaniec. We will limit attention to two sieves which give good general-purpose results, if not necessarily the most optimal ones:

  • (i) The beta sieve (or Rosser-Iwaniec sieve), which is a modification of the classical combinatorial sieve of Brun. (A collection of sieve coefficients {\lambda_d^{\pm}} is called combinatorial if its coefficients lie in {\{-1,0,+1\}}.) The beta sieve is a family of upper and lower bound combinatorial sieves, and are particularly useful for efficiently sieving out all primes up to a parameter {z = x^{1/u}} from a set of integers of size {x}, in the regime where {u} is moderately large, leading to what is sometimes known as the fundamental lemma of sieve theory.
  • (ii) The Selberg upper bound sieve, which is a general-purpose sieve that can serve both as an upper bound sieve for classical sieving problems, as well as an enveloping sieve for sets such as the primes. (One can also convert the Selberg upper bound sieve into a lower bound sieve in a number of ways, but we will only touch upon this briefly.) A key advantage of the Selberg sieve is that, due to the “quadratic” nature of the sieve, the difficult optimisation problem in Problem 10 is replaced with a much more tractable quadratic optimisation problem, which can often be solved for exactly.

Remark 12 It is possible to compose two sieves together, for instance by using the observation that the product of two upper bound sieves is again an upper bound sieve, or that the product of an upper bound sieve and a lower bound sieve is a lower bound sieve. Such a composition of sieves is useful in some applications, for instance if one wants to apply the fundamental lemma as a “preliminary sieve” to sieve out small primes, but then use a more precise sieve like the Selberg sieve to sieve out medium primes. We will see an example of this in later notes, when we discuss the linear beta-sieve.

We will also briefly present the (arithmetic) large sieve, which gives a rather different approach to Problem 3 in the case that each {E_p} consists of some number (typically a large number) of residue classes modulo {p}, and is powered by the (analytic) large sieve inequality of the preceding section. As an application of these methods, we will utilise the Selberg upper bound sieve as an enveloping sieve to establish Zhang’s theorem on bounded gaps between primes. Finally, we give an informal discussion of the parity barrier which gives some heuristic limitations on what sieve theory is able to accomplish with regards to counting prime patters such as twin primes.

These notes are only an introduction to the vast topic of sieve theory; more detailed discussion can be found in the Friedlander-Iwaniec text, in these lecture notes of Selberg, and in many further texts.

— 1. Combinatorial sieving and the fundamental lemma of sieve theory —

We begin with a discusion of combinatorial upper and lower bound sieves, in which the sieve coefficients {\lambda^\pm_d} take values in {\{-1,0,+1\}}. These sieves, first introduced by Brun, can be viewed as truncations of the Legendre sieve (which corresponds to the choice of coefficients {\lambda^\pm_d = \mu(d)} for all {d|P(z)}). The Legendre sieve, and more generally the Brun pure sieves in Exercise 9, give basic examples of combinatorial sieves.

We loosely follow the discussion of Friedlander and Iwaniec. The combinatorial sieves may be motivated by observing the Buchstab identity

\displaystyle  1_{n \not \in \bigcup_{p < z} E_p} = 1 - \sum_{p_1 < z} 1_{E_{p_1}}(n) 1_{n \not \in \bigcup_{p < p_1} E_p} \ \ \ \ \ (16)

for any {n \in {\bf Z}} and any collection of sets {E_p \subset {\bf Z}} for {p < z}. This identity reflects the basic fact that if {n} does lie in {\bigcup_{p < z} E_p}, then there is a unique prime {p_1 < z} such that {n} lies in {E_{p_1}}, but does not lie in {E_p} for any {p < p_1}.

There is an analogous identity for the function {g}:

Exercise 13 For any arithmetic function {g}, show that

\displaystyle  \prod_{p<z} (1-g(p)) = 1 - \sum_{p_1 < z} g(p_1) \prod_{p<p_1} (1-g(p)).

How is this identity related to (16)?

The Buchstab identity (16) allows one to convert upper bound sieves into lower bound sieves and vice versa. Indeed, if {\sum_{d|P(p_1)} \lambda^{-,p_1}_d 1_{E_d}} is a family of lower bound sieves for each {p_1 < z}, then (16) tells us that

\displaystyle  1 - \sum_{p_1 < z} \sum_{d|P(p_1)} \lambda^{-,p_1}_d 1_{E_{p_1 d}}

is an upper bound sieve; similarly, if {\sum_{d|P(p_1)} \lambda^{+,p_1}_d 1_{E_d}} is a family of upper bound sieves for each {p_1 < z}, then

\displaystyle  1 - \sum_{p_1 < z} \sum_{d|P(p_1)} \lambda^{+,p_1}_d 1_{E_{p_1 d}}

is a lower bound sieve. In terms of sieve coefficients, we see that if {(\lambda^{-,p_1}_d)_{d | P(p_1)}} are lower bound sieve coefficients for each {p_1 < z}, then

\displaystyle  \left( 1_{d=1} - 1_{d > 1} \lambda^{-,p_*(d)}_{d/p_*(d)} \right)_{d|P(z)}

is a sequence of upper bound sieve coefficients (where {p_*(d)} denotes the largest prime factor of {d}). Similarly, if {(\lambda^{+,p_1}_d)_{d | P(p_1)}} are upper bound sieve coefficients for each {p_1 < z}, then

\displaystyle  \left( 1_{d=1} - 1_{d > 1} \lambda^{+,p_*(d)}_{d/p_*(d)} \right)_{d|P(z)}

is a sequence of lower bound sieve coefficients.

One can iterate this procedure to produce an alternating sequence of upper and lower bound sieves. If one iterates out the Buchstab formula completely, one ends up back at the Legendre sieve. However, one can truncate this iteration by using the trivial lower bound sieve of {0} for some of the primes {p_1}. For instance, suppose one seeks an upper bound for {1_{n \not \in \bigcup_{p < z} E_p}}. Applying (16), we only save some of the summands, say those {p_1} which obey some predicate {A_1(p_1)} to be chosen later. For the remaining summands, we use the trivial lower bound sieve of {0}, giving

\displaystyle  1_{n \not \in \bigcup_{p < z} E_p} \leq 1 - \sum_{p_1 < z: A_1(p_1)} 1_{E_{p_1}}(n) 1_{n \not \in \bigcup_{p < p_1} E_p}.

For the surviving summands, we apply (16) again. With the sign change, the trivial lower bound sieve is not applicable, so we do not discard any further summands and arrive at

\displaystyle  1_{n \not \in \bigcup_{p < z} E_p} \leq 1 - \sum_{p_1 < z: A_1(p_1)} 1_{E_{p_1}}(n)

\displaystyle  + \sum_{p_2 < p_1 < z: A_1(p_1)} 1_{E_{p_1}}(n) 1_{n \not \in \bigcup_{p < p_2} E_p}.

For the summands in the second sum on the right, we apply (16) once again. We once again have a favourable sign and can use the trivial lower bound sieve of {0} to discard some of the resulting summands, keeping only those triples {p_1,p_2,p_3} that obey some predicate {A_3(p_1,p_2,p_3)}, giving

\displaystyle  1_{n \not \in \bigcup_{p < z} E_p} \leq 1 - \sum_{p_1 < z: A_1(p_1)} 1_{E_{p_1}}(n)

\displaystyle  + \sum_{p_2 < p_1 < z: A_1(p_1)} 1_{E_{p_1 p_2}}(n)

\displaystyle  - \sum_{p_3 < p_2 < p_1 < z: A_1(p_1) \wedge A_3(p_1,p_2,p_3)} 1_{E_{p_1 p_2 p_3}}(n) 1_{n \not \in \bigcup_{p < p_3} E_p}.

Since one cannot have an infinite descent of primes, this process will eventually terminate to produce an upper bound sieve. A similar argument (but with the signs reversed) gives a lower bound sieve. We formalise this discussion as follows:

Proposition 14 (General combinatorial sieve) Let {z > 0}. For each natural number {r}, let {A_r(p_1,\dots,p_r)} be a predicate pertaining to a sequence {z > p_1 > \dots > p_r} of {k} decreasing primes, thus {A_r(p_1,\dots,p_r)} is either true or false for a given choice of such primes. Let {{\mathcal D}_+} (resp. {{\mathcal D}_-}) denote the set of {d|P(z)} which, when factored as {d = p_1 \dots p_m} for {z > p_1 > \dots > p_m}, are such that {A_r(p_1,\dots,p_r)} holds for all odd (resp. even) {1 \leq r \leq m}. Thus

\displaystyle  {\mathcal D}_+ := \{ p_1 < z: A_1(p_1) \} \cup \{ p_1 p_2: p_2 < p_1 < z; A_1(p_1)\}

\displaystyle  \cup \{p_1 p_2 p_3: p_3 < p_2 < p_1 < z; A_1(p_1) \wedge A_3(p_1,p_2,p_3) \}

\displaystyle  \cup \{p_1 p_2 p_3 p_4: p_4 < p_3 < p_2 < p_1 < z; A_1(p_1) \wedge A_3(p_1,p_2,p_3) \}

\displaystyle  \cup \dots


\displaystyle  {\mathcal D}_- := \{ p_1 < z \} \cup \{ p_1 p_2: p_2 < p_1 < z; A_2(p_1,p_2)\}

\displaystyle  \cup \{p_1 p_2 p_3: p_3 < p_2 < p_1 < z; A_2(p_1,p_2) \}

\displaystyle  \cup \{p_1 p_2 p_3 p_4: p_4 < p_3 < p_2 < p_1 < z; A_2(p_1,p_2) \wedge A_4(p_1,p_2,p_3,p_4) \}

\displaystyle  \cup \dots.

Note that the Legendre sieve is the special case of the combinatorial sieve when all the predicates {A_r(p_1,\dots,p_r)} are true for all choices of inputs {p_1,\dots,p_r} (i.e. one does not ever utilise the trivial lower bound sieve); more generally, the Brun pure sieve with parameter {k}corresponds to the case when {A_r(p_1,\dots,p_r)} is always true for {r \leq k} and always false for {r>k}. The predicates {A_r} for odd {r} are only used for the upper bound sieve and not the lower bound sieve; conversely, the predicates {A_r} for even {r} are only used for the lower bound sieve and not the upper bound sieve.

Exercise 15 Prove Proposition 14.

Exercise 16 Interpret the sieves in Exercise 8 as combinatorial sieves. What are the predicates {A_1, A_2} in this case?

We now use the above proposition to attack Problem 10 for a given choice of parameters {z} and {D}; we will focus on the regime when

\displaystyle  z = D^{1/s} \ \ \ \ \ (19)

for some {s,z \geq 1}, thus we are sifting out primes up to a small power of the sieve level {D}. To do this, one needs to select the predicates {A_r(p_1,\dots,p_r)} in such a fashion that the sets {{\mathcal D}_\pm} only consist of divisors {d|P(z)} that are less than or equal to {D}. There are a number of ways to achieve this, but experience has shown that a good general-purpose choice in this regard is to define {A_r(p_1,\dots,p_r)} to be the predicate

\displaystyle  p_1 \dots p_{r-1} p_r^{\beta+1} \leq D \ \ \ \ \ (20)

for some parameter

\displaystyle  1 \leq \beta \leq s \ \ \ \ \ (21)

to be chosen later. The combinatorial sieve with this choice of predicate is known as the beta sieve or Rosser-Iwaniec sieve at sieve level {D}. Observe that if {p_1 \dots p_r} lies in {{\mathcal D}_\pm} for some {r \geq 2} and {z > p_1 > \dots > p_r}, then at least one of {p_1 \dots p_{r-1} p_r^{\beta+1} \leq D} or {p_1 \dots p_{r-2} p_{r-1}^{\beta+1} \leq D} will hold, and hence {p_1 \dots p_r \leq D} since {\beta \geq 1}. If {r \leq 1}, we have the same conclusion thanks to (19) since {s \geq 1}. Thus under the hypotheses (21), (19), the beta sieves are indeed upper and lower bound sieves at level {D}.

Now we estimate the quantities (14), (15) for the beta sieve coefficients {\lambda^\pm_d := \mu(d) 1_{{\mathcal D}_\pm}(d)} for some multiplicative function {g} with {0 \leq g(p) \leq 1} for all {p}. In order to actually compute asymptotics, we assume the Mertens-like axiom

\displaystyle  V(w) \ll_\kappa \left(\frac{\log z}{\log w}\right)^\kappa V(z) \ \ \ \ \ (22)

whenever {2 \leq w \leq z}, where {V(z) := \prod_{p < z} (1 - g(p))} and some fixed {\kappa>0}, which we refer to as the sieve dimension. From Mertens’ theorems, we see that these axioms will in particular be obeyed if

\displaystyle  g(p) \leq \frac{\kappa}{p} + O_\kappa(\frac{1}{p^2})


\displaystyle  g(p) \leq 1-c

for all primes {p} and some fixed {c>0} depending only on {\kappa}.

Let {g} be as above. From (17), (18), (22) we see that the quantities (14), (15) are both of the form

\displaystyle  V(z) \left( 1 + O_\kappa \left( \sum_r \sum_{d \in {\mathcal E}_r} g(d) \left( \frac{\log z}{\log p_*(d)} \right)^\kappa \right)\right). \ \ \ \ \ (23)

We now estimate the error term; our estimates will be somewhat crude in order to simplify the calculations, but this will already suffice for many applications. First observe that if {z > p_1 > \dots > p_r} then

\displaystyle  p_1 \dots p_{r-1} p_r^{\beta+1} \leq z^{r+\beta} = D^{(r+\beta)/s}.

We conclude that the condition {A_r(p_1,\dots,p_r)} is automatically satisfied when {r+\beta \leq s}, and so the {r} summand in (23) can be restricted to the range {r > s - \beta}.

Next, note that if {d = p_1 \dots p_r \in {\mathcal E}_r} for some {z > p_1 > \dots > p_r}, then from (20) we have

\displaystyle  p_1 \dots p_{r'-1} p_{r'}^{\beta+1} \leq D

for all {1 \leq r' < r} with the same parity as {r}, which implies (somewhat crudely) that

\displaystyle  p_1 \dots p_{r'-1} p_{r'}^{\beta} \leq D

for all {1 \leq r' < r}, regardless of parity (note that the {r'=1} case follows since {p_1 \leq z = D^{1/s} \leq D}). We rearrange this inequality as

\displaystyle  \frac{D}{p_1 \dots p_{r'}} \leq \left( \frac{D}{p_1 \dots p_{r'-1}}\right)^{\frac{\beta-1}{\beta}}

which iterates to give

\displaystyle  \frac{D}{p_1 \dots p_{r-1}} \leq D^{(\frac{\beta-1}{\beta})^{r-1}}.

On the other hand, from the definition of {{\mathcal E}_r} we have

\displaystyle  p_1 \dots p_{r-1} p_r^{\beta+1} > D

which leads to a lower bound on {p_r = p_*(d)}:

\displaystyle  p_r > D^{(\frac{\beta-1}{\beta})^{r-1} / (\beta+1)}

\displaystyle  > D^{(\frac{\beta-1}{\beta})^{r} / \beta}

\displaystyle  > z^{(\frac{\beta-1}{\beta})^{r}}

so in particular

\displaystyle  \frac{\log z}{\log p_*(d)} < \left(\frac{\beta}{\beta-1}\right)^r.

We can thus estimate (23) somewhat crudely as

\displaystyle  V(z) \left( 1 + O_\kappa \left( \sum_{r \geq s-\beta} \sum_{z > p_1 > \dots > p_r > z^{(\frac{\beta-1}{\beta})^{r}}} g(p_1) \dots g(p_r) \left(\frac{\beta}{\beta-1}\right)^r \right)\right).

We can estimate

\displaystyle  \sum_r \sum_{z > p_1 > \dots > p_r > z^{(\frac{\beta-1}{\beta})^{r}}} g(p_1) \dots g(p_r) \leq \frac{1}{r!} (\sum_{z > p \geq z^{(\frac{\beta-1}{\beta})^{r}}} g(p))^r

\displaystyle  \leq \frac{1}{r!} (\log \prod_{z > p \geq z^{(\frac{\beta-1}{\beta})^{r}}} (1-g(p))^{-1})^r

\displaystyle  = \frac{1}{r!} ( \log( V(z^{(\frac{\beta-1}{\beta})^{r}}) / V(z) ) )^r

\displaystyle  \ll \frac{1}{r!} ( \kappa r \log \frac{\beta}{\beta-1} + O(1) )^r

\displaystyle  \ll ( \kappa e \log \frac{\beta}{\beta-1} + O(1/r) )^r

thanks to (22) and the crude bound {\frac{r!}{r^r} \leq e^r} coming from the Taylor expansion of {e^r}. If we choose {\beta} large enough so that

\displaystyle  \kappa e (\log \frac{\beta}{\beta-1}) \frac{\beta}{\beta-1} \leq \frac{1}{e} \ \ \ \ \ (24)

(note the right-hand side decays like {\kappa/\beta} as {\beta \rightarrow \infty}), we thus can estimate (23) by

\displaystyle  V(z) \left( 1 + O_\kappa \left( \sum_{r \geq s-\beta} (\frac{1}{e} + O(1/r))^r \right)\right)

which sums to

\displaystyle  V(z) ( 1 + O_{\kappa,\beta}( e^{-s} ) ).

We conclude

Lemma 17 (Fundamental lemma of sieve theory) Let {\kappa > 0}. If {z = D^{1/s}} for some {D,s \geq 1}, then there exist combinatorial upper and lower sieve coefficients {(\lambda^\pm_d)_{d \in {\mathcal D}}} supported in {{\mathcal D} := \{ d | P(z): d \leq D \}} such that

\displaystyle  \sum_{d \in {\mathcal D}} \lambda^\pm_d g(d) = V(z) ( 1 + O_{\kappa}( e^{-s} ) ) \ \ \ \ \ (25)

for any multiplicative function {g} with {0 \leq g(p) \leq 1} for all primes {p}, obeying the bounds (22).

Informally, the fundamental lemma shows that one can get exponentially close to the Legendre sieve limit of {V(z)} as long as the sieve level {D} is a large power of the sifting range {z}.

Proof: Without loss of generality we may assume that {s} is sufficiently large depending on {\kappa}, since for small {s} we may simply reduce {z} (for the purposes of the upper bound sieve) or use the trivial lower bound sieve {0}. But then we may find {\beta = \beta(\kappa)} obeying (24) and less than {s}, and the claim follows. \Box

Exercise 18 Improve the {O_\kappa( e^{-s} )} error in (25) to {O_{\kappa,\varepsilon}( e^{-(1-\varepsilon) s \log s} )} for any {\varepsilon>0}. (It is possible to show that this error term is essentially best possible, at least in the linear dimension case {\kappa = 1}, by using asymptotics for rough numbers (Exercise 28 of Supplement 4), but we will not do so here.)

The fundamental lemma is often used as a preliminary sifting device to remove small primes, in preparation for a more sophisticated sieve to deal with medium primes. But one can of course use it directly. Indeed, thanks to Theorem 5, we now have an adequate answer to Problem 3 in the regime where the sieve dimension {\kappa} is fixed and the sifting level is small compared to the level of distribution:

Corollary 19 (Fundamental lemma, applied directly) Let {z = D^{1/s}} for some {\kappa, s, D \geq 1}, and suppose that {g} is a multiplicative function obeying the bounds (22) for all {2 \leq w \leq z}, where {V(w) := \prod_{p<w} (1-g(p))}. For each {p \leq z}, let {E_p} be a set of integers, and let {(a_n)_{n \in {\bf Z}}} is a finitely supported sequence of non-negative reals such that

\displaystyle  \sum_{n \in E_d} a_n = X g(d) + r_d

for all square-free {d \leq D}, where {X > 0} and {r_d \in {\bf R}}, and {E_d := \bigcap_{p|d} E_p}. Then

\displaystyle  \sum_{n \not \in \bigcup_{p \leq z} E_p} a_n = (1 + O_\kappa(e^{-s})) X V(z) + O( \sum_{d\leq D: \mu^2(d)=1} |r_d| ).

We now specialise to the twin prime problem in Problem 2. Here, {X = x}, and {g(p)=\omega(p)/p} where {\omega(p)=2} for odd {p}, with {\omega(2)=1}. In particular we may take {\kappa = 2}. We set {z := x^{1/u}} for some fixed {u > 1}, and {D = x^{1-o(1)}} for some quantity {o(1)} going sufficiently slowly to infinity, so {s := u + o(1)}. From (8) we have

\displaystyle  V(z) = (1+o(1)) \frac{u^2}{e^{2\gamma}} \frac{2\Pi_2}{\log^2 x}

and from (6) and the divisor bound we have

\displaystyle  r_d \leq \tau(d) \ll x^{o(1)}

for any {d \leq D}, as {x \rightarrow \infty}. From the fundamental lemma, we conclude that

\displaystyle  \pi_2(x, x^{1/u}) = \frac{u^2}{e^{2\gamma}} (1 + O( e^{-u} ) + o(1)) \frac{2\Pi_2 x}{\log^2 x} + O( x^{o(1)} D ).

If we let the {o(1)} term in {D = x^{1-o(1)}} decay sufficiently slowly, then the final error term in the above expression can be absorbed into the second error term, and so

\displaystyle  \pi_2(x, x^{1/u}) = \frac{u^2}{e^{2\gamma}} (1 + O( e^{-u} ) + o(1)) \frac{2\Pi_2 x}{\log^2 x}. \ \ \ \ \ (26)

Setting {u=2}, we conclude in particular that

\displaystyle  \pi_2(x, x^{1/2}) \ll \frac{x}{\log^2 x}

which implies from the sieve of Eratosthenes that there are {O( \frac{x}{\log^2 x} )} pairs of twin primes in {[x/2,x]}. This implies a famous theorem of Brun:

Exercise 20 (Brun’s theorem) Show that the sum {\sum_{p: p+2 \in {\mathcal P}} \frac{1}{p}} is convergent.

Alternatively, if one applies (26) with {u} a sufficiently large natural number, we see that

\displaystyle  \pi_2(x,x^{1/u}) \gg \frac{x}{\log^2 x}

for all sufficiently large {x}. This implies that there are {\gg \frac{x}{\log^2 x}} numbers {n} in {[1,x]} with the property that {n, n+2} are both not divisible by any prime less than or equal to {x^{1/u}}. In particular (if {n \leq x-2}), {n} and {n+2} both have at most {u} prime factors. We conclude a version of the twin prime conjecture for almost primes:

Corollary 21 There is an absolute constant {u} such that there are infinitely many natural numbers {n} such that {n, n+2} are each the product of at most {u} primes.

Exercise 22 By going through the beta sieve estimates carefully, show that one can take {u=20} in the above corollary. (One can do substantially better if one uses more sophisticated sieve estimates, and of course Chen’s theorem tells us that {u} can be taken as low as {2}; we will review the proof of Chen’s theorem in subsequent notes.)

In the above corollary, both elements of the pair {n,n+2} are permitted to be almost prime rather than prime. It is possible to use the fundamental lemma to improve upon this by forcing {n} (for instance) to be prime. To do this, we run the twin prime sieve differently; rather than start with all integers and sieve out two residue classes modulo each prime {p}, we instead start with the primes and sieve out one residue class modulo {p}. More precisely, we consider the quantity

\displaystyle  \sum_{n \not \in \bigcup_{p \leq z} E_p} f(n)

where {E_p} is the residue class {-2\ (p)} and

\displaystyle  f(n) := \Lambda(n) 1_{[1,x-2]}(n)

(say). Observe that the sum {\sum_{n \in E_d} f(n)} is only as large as {O( \log^{O(1)} x)} at worst for even {d}. For odd {d}, we expect from the prime number theorem in arithmetic progressions that

\displaystyle  \sum_{n \in E_d} f(n) = \frac{1}{\phi(d)} \frac{x}{2} + r_d

for some small {r_d}. Thus it is natural to apply Corollary 19 with {X := \frac{x}{2}} and {g(d) := 1_{(d,2)=1} \frac{1}{\phi(d)}}, so that {\kappa = 1}. From the Bombieri-Vinogradov theorem (see Theorem 17 of Notes 3), we see that

\displaystyle  \sum_{d \leq x^{1/2-\varepsilon}} |r_d| \ll_{A,\varepsilon} x \log^{-A} x

for any fixed {\varepsilon, A> 0}. We thus take {D := x^{1/2-\varepsilon}} for some small fixed {\varepsilon>0} and {z := x^{1/u}}, and apply Corollary 19 to conclude that

\displaystyle  \sum_{n \not \in \bigcup_{p \leq z} E_p} f(n) = (1 + O( e^{-u/(1/2-\varepsilon)})) \frac{x}{2} \prod_{2 < p < z}( 1 - \frac{1}{\phi(p)})

\displaystyle  + O_{A,\varepsilon}( x \log^{-A} x ).

For {u} a sufficiently large fixed constant, we conclude from Mertens’ theorem that

\displaystyle  \sum_{n \not \in \bigcup_{p \leq z} E_p} f(n) \gg \frac{x}{\log x}

for {x} large enough. The contribution for those {n} for which {n} is a power of a prime, rather than a prime, is easily seen to be negligible, and we conclude that there are {\gg \frac{x}{\log^2 x}} primes {p} less than {x-2} such that {p+2} has no factors less than {x^{1/u}}, and thus is the product of at most {u} primes. We thus have the following improvement of Corollary 21:

Proposition 23 There is an absolute constant {u} such that there are infinitely many primes {p} such that {p+2} is the product of at most {u} primes.

More generally, the fundamental lemma (combined with equidistribution results such as the Bombieri-Vinogradov theorem) lets one prove analogues of various unsolved conjectures about primes, in which one either (a) is content with upper bounds of the right order of magnitude, or (b) is willing to replace primes with a suitable notion of almost prime. We give some examples in the exercises below.

Exercise 24 (Approximations to the prime {k}-tuples conjecture) Let {(h_1,\dots,h_k)} be an admissible {k}-tuple of integers for some {k \geq 1}, thus the {h_i} are all distinct, and for any {p}, the number {\omega(p)} of residue classes occupied by {h_1,\dots,h_k\ (p)} is never equal to {p}. Let {{\mathfrak S}} denote the singular series

\displaystyle  {\mathfrak S} := \prod_p (1-\frac{1}{p})^{-k} (1-\frac{\omega(p)}{p}). \ \ \ \ \ (27)

Exercise 25 (Approximations to the even Goldbach conjecture) If {N} is an even natural number, define the singular series

\displaystyle  {\mathfrak S}(N) := 2 \Pi_2 \prod_{p>2: p|N} \frac{p-1}{p-2}.

  • (i) Show that the number of ways to write {N} as the sum of two primes {N=p_1+p_2} is {O( {\mathfrak S}(N) \frac{N}{\log^2 N} )} if {N} is sufficiently large. (This should be compared with the prediction of {(1+o(1)) {\mathfrak S}(N) \frac{N}{\log^2 N}} as {N \rightarrow \infty}, see Exercise 12 of Supplement 4.)
  • (ii) Show that if {u} is a sufficiently large natural number, then every sufficiently large even number {N} is expressible as the sum of two natural numbers {n_1,n_2}, each of which are the product of at most {u} prime factors.
  • (iii) Strengthen (ii) to include the requirement that {n_1} is prime.

Exercise 26 (Approximations to Legendre’s conjecture)

  • (i) Show that for any {x,y \geq 2}, the number of primes in the interval {[x,x+y]} is at most {O( \frac{y}{\log y} )}.
  • (ii) Show that if {u} is a sufficiently large natural number, then for all sufficiently large {x}, there is at least one natural number {n} between {x^2} and {(x+1)^2} that is the product of at most {u} primes.

Exercise 27 (Approximations to Landau’s fourth problem) Define the singular series

\displaystyle  {\mathfrak S} := \prod_p \frac{1-\omega(p)/p}{1-1/p}

where {\omega(p)} is the number of residue classes {a\ (p)} such that {a^2 + 1 =0\ (p)}.

  • (i) Show that the number of primes of the form {n^2+1} with {n \leq x} is {O( {\mathfrak S} \frac{x}{\log x} )} for sufficiently large {x}. (This should be compared with the prediction {(1+o(1)) {\mathfrak S} \frac{x}{\log x}}, see Exercise 15 of Supplement 4.)
  • (ii) Show that if {u} is a sufficiently large natural number, then there are an infinite number of numbers of the form {n^2+1} that are the product of at most {u} primes.

— 2. The Selberg upper bound sieve —

We now turn to the Selberg upper bound sieve, which is a simple but remarkably useful general-purpose upper bound sieve. We again follow the discussion of Friedlander and Iwaniec.

The idea of the sieve comes from the following trivial observation: if {P} is a squarefree number, {E_p} is a set of integers for each {p|P}, and {(\rho_d)_{d|P}} are arbitrary real numbers with {\rho_1 = 1}, then the function

\displaystyle  \left( \sum_{d|P} \rho_d 1_{E_d} \right)^2

is an upper bound sieve, since it is clearly non-negative and equals {1} outside of {\bigcup_{p|P} E_p}. Equivalently, the sequence

\displaystyle  \lambda^+_d := \sum_{d_1,d_2: [d_1,d_2] = d} \rho_{d_1} \rho_{d_2} \ \ \ \ \ (28)

for {d|P} is a sequence of upper bound sieve coefficients, where {[d_1,d_2]} is the least common multiple of {d_1,d_2}. If we set {D=R^2} and assume that the {\rho_d} are only non-zero for {d \leq R}, then this sequence of sieve coefficients will only be non-zero on {{\mathcal D} := \{ d|P: d \leq D \}}. We will refer to sieves (and sieve coefficients) constructed by this method as Selberg upper bound sieves (or Selberg upper bound sieve coefficients); they are also referred to as {\rho^2} sieves in the literature.

A key advantage of the Selberg sieve is that the coefficients {\rho_d} for {1 < d \leq R} are completely unconstrained within the real numbers, and the sieve depends in a quadratic fashion on these coefficients, so the problem of optimising the Selberg sieve for a given application usually reduces to an unconstrained quadratic optimisation problem, which can often be solved exactly in many cases. Of course, the optimal Selberg sieve need not be the globally optimal sieve amongst all upper bound sieves; but in practice the Selberg sieve tends to give quite good results even if it might not be the globally optimal choice.

Selberg sieves can be used for many purposes, but we begin with the classical sieve problem posed in Problem 10. To avoid some degeneracies we will assume that

\displaystyle  0 < g(p) < 1 \ \ \ \ \ (29)

for all {p|P}. Restricting to upper bound sieve coefficients of the Selberg form, the quantity (14) becomes

\displaystyle  \sum_{d_1,d_2|P} \rho_{d_1} \rho_{d_2} g( [d_1,d_2] ). \ \ \ \ \ (30)

This is a quadratic form in the coefficients {\rho_d}; our task is to minimise this amongst all choices of coefficients with {\rho_1=1}, with {\rho_d} vanishing for {d>R}.

Observe that any {d_1,d_2|P} can be expressed as {d_1 = a_1 b}, {d_2 = a_2 b} with {a_1 a_2 b | P}, in which case {[d_1,d_2] = a_1 a_2 b}. By multiplicativity, we may thus write (30) as

\displaystyle  \sum_{a_1 a_2 b|P} \rho_{a_1 b} \rho_{a_2 b} g(b) g(a_1) g(a_2)

which we can rearrange as as

\displaystyle  \sum_{b|P} g(b) \sum_{a_1,a_2 | P/b: (a_1,a_2)=1} \rho_{a_1b} g(a_1) \rho_{a_2 b} g(a_2).

The inner sum almost factorises as a square, but we have the coprimality constraint {(a_1,a_2)=1} to deal with. The standard trick for dealing with this constraint is Möbius inversion:

\displaystyle  1_{(a_1,a_2)=1} = \sum_{d |(a_1,a_2)} \mu(d)

\displaystyle  = \sum_{d|a_1,a_2} \mu(d);

inserting this and writing {a_1 = d c_1}, {a_2 = d c_2}, we can now write (30) as

\displaystyle  \sum_{b|P} g(b) \sum_{d|P/b} \mu(d) g(d)^2 \sum_{c_1,c_2 | P/db} \rho_{dbc_1} g(c_1) \rho_{dbc_2} g(c_1)

which now does factor as a sum of squares:

\displaystyle  \sum_{b|P} g(b) \sum_{d|P/b} \mu(d) g(d)^2 \left( \sum_{c | P/db} \rho_{dbc} g(c)\right)^2.

Writing {e = bcd}, this becomes

\displaystyle  \sum_{bd|P} \frac{\mu(d)}{g(b)} \left( \sum_{bd | e | P} \rho_e g(e)\right)^2.

Writing {bd = m}, this becomes

\displaystyle  \sum_{m|P} h(m) y_m^2 \ \ \ \ \ (31)


\displaystyle  y_m := \frac{\mu(m)}{h(m)} \sum_{m | e | P} \rho_e g(e)

and {h} is defined on factors of {P} by the Dirichlet convolution

\displaystyle  \frac{1}{h} = \mu * \frac{1}{g}.

Thus {h} is the multiplicative function with

\displaystyle  h(p) := \frac{g(p)}{1-g(p)} \ \ \ \ \ (32)

for all {p|P}; we extend {h} by zero outside of the factors of {P}. In particular {h} is non-negative.

The coefficients {(y_m)_{m|P}} are a transform of the original coefficients {(\rho_e)_{e|P}}. For instance, if {P} was replaced by a single prime {p}, then the coefficients {(y_1,y_p)} would be related to the coefficients {(\rho_1,\rho_p)} by the transform

\displaystyle  y_1 = \rho_1 + g(p) \rho_p; \quad y_p = -\frac{g(p)}{h(p)} \rho_p

which is inverted as

\displaystyle  \rho_1 = y_1 + h(p) y_p; \quad \rho_p = - \frac{1}{g(p)} h(p) y_p.

More generally, we have the following inversion formula:

Exercise 28 (Möbius-type inversion formula) Verify that

\displaystyle  \rho_e = \frac{\mu(e)}{g(e)} \sum_{e|m|P} h(m) y_m \ \ \ \ \ (33)

for any {e|P}.

From this formula, we see in particular that {y_m} is supported on {m \leq R} if and only if {\rho_e} is supported on {e \leq R}. The constraint {\rho_1=1} can now be written in terms of the {y_m} coefficients as

\displaystyle  \sum_{m|P} h(m) y_m = 1. \ \ \ \ \ (34)

Our task is now to minimise the quadratic form (31) amongst all coefficients {y_m} supported on {m \leq R} obeying (34). The Cauchy-Schwarz inequality gives the lower bound

\displaystyle  \sum_{m|P} h(m) y_m^2 \geq J^{-1}

where {J := \sum_{m|P: m \leq R} h(m)}, with equality precisely when {y_m = 1/J} for {m \leq R}. Inserting this back into (33), we arrive at the optimised Selberg sieve coefficients

\displaystyle  \rho_e = \frac{1}{J} \frac{\mu(e)}{g(e)} \sum_{e|m|P: m \leq R} h(m) \ \ \ \ \ (35)

and the quantity (14) is equal to {\frac{1}{J}}.

We have a pleasant size bound on the coefficients {\rho_e}:

Lemma 29 We have {|\rho_e| \leq 1} for all {e|P}.

Proof: Writing {m = ek}, we have

\displaystyle  |\rho_e| = \frac{g(e)}{J} \sum_{k| P/e: k \leq R/e} h(e) h(k)


\displaystyle  J \geq \sum_{k \leq R/e} \sum_{d|e} h(d) h(k)

\displaystyle  = \sum_{k \leq R/e} h(k) \frac{h(e)}{g(e)}

since one can compute that {\frac{h}{g} = h * 1}. The claim follows. \Box

The upper bound sieve coefficients {\lambda^+_d} in (28) is then bounded by

\displaystyle  |\lambda^+_d| \leq \sum_{[d_1,d_2]=d} 1 = \tau_3(d)

where {\tau_3(d) = \sum_{d_1 d_2 d_3 = d} 1} is the third divisor function. Applying Theorem 5, we thus have a general upper bound for Problem 3:

Theorem 30 (Selberg sieve upper bound) Let {P} be a squarefree natural number. For each prime {p} dividing {P}, let {E_p} be a set of integers. Let {g} be a multiplicative function with {0 < g(p) < 1} for all {p}, and define the multiplicative function {h} by (32), extending {h} by zero outside of the factors of {P}. Let {D = R^2} for some {R \geq 1}. Suppose that {(a_n)_{n \in {\bf Z}}} is a finitely supported sequence of non-negative reals, such that

\displaystyle  \sum_{n \in E_d} a_n = g(d) X + r_d \ \ \ \ \ (36)

for all {d \leq D} and some reals {X, r_d}. Then

\displaystyle  \sum_{n \not \in\bigcup_{p|P} E_p} a_n \leq \frac{X}{\sum_{d \leq R} h(d)} + \sum_{d \leq D} \tau_3(d) |r_d|.

Remark 31 To compare this bound against the “expected” value of {X \prod_{p|P} (1-g(p))}, observe from Euler products that

\displaystyle \prod_{p|P} (1-g(p)) = \frac{1}{\sum_d h(d)}.

The quantity {\sum_{d \leq R} h(d)} can often be computed in practice using standard methods for controlling sums of multiplicative functions, such as Theorem 27 of Notes 1. We illustrate this with the following sample application of the Selberg sieve:

Theorem 32 (Sieving an interval) Let {k, C} be fixed natural numbers, and let {x \geq 2}. For each prime number {p \leq \sqrt{x}}, let {E_p} be the union of {\omega(p)} residue classes modulo {p}, where {\omega(p) = k} for all {p \geq C}, and {\omega(p) < p} for all {p}. Then for any {\varepsilon > 0}, one has

\displaystyle  |\{ n: n \leq x \} \backslash \bigcup_{p \leq \sqrt{x}} E_p| \leq (2^k k! + \varepsilon) {\mathfrak S} \frac{x}{\log^k x} \ \ \ \ \ (37)

whenever {x} is sufficiently large depending on {k,C,\varepsilon}, where {{\mathfrak S}} is the singular series

\displaystyle  {\mathfrak S} := \prod_p (1-\frac{1}{p})^{-k} (1 - \frac{\omega(p)}{p}). \ \ \ \ \ (38)

Proof: We apply Theorem 30 with {a_n := 1_{1 \leq n \leq x}}, {X := x}, {P := \prod_{p \leq \sqrt{x}} p}, and {g(p) := \frac{\omega(p)}{p}}. We set {D := x^{1-\varepsilon}} and {R := D^{1/2}}. For any {d|P}, {E_d} is the union of {\prod_{p|d} \omega(p)} residue classes modulo {d}, and thus by (36) we have

\displaystyle  |r_d| \leq \prod_{p|d} \omega(p)

and in particular

\displaystyle  |r_d| \ll_{k,C} \tau_k(d).

From Theorem 30, we thus can upper bound the left-hand side of (37) by

\displaystyle  \frac{x}{\sum_{d \leq R} h(d)} + O_{k,C}( \sum_{d \leq D} \tau_3(d) \tau_k(d) ).

By the divisor bound and the choice of {D} we have

\displaystyle  \sum_{d \leq D} \tau_3(d) \tau_k(d) \ll_{k,C,\varepsilon} \frac{x}{\log^{k+1} x}

so it will suffice to show (after adjusting {\varepsilon}) that

\displaystyle  \sum_{d \leq R} h(d) = \frac{1}{(2 + O(\varepsilon))^k k! {\mathfrak S}} \log^k x + O_{k,C,\varepsilon}( \log^{k-1} x ).

But from Theorem 27(ii) of Notes 1 (with {g(n)} replaced by {n h(n)}), the left-hand side is

\displaystyle  \frac{1}{k!} {\mathfrak H} \log^k R + O_{k,C,\varepsilon}( \log^{k-1} R )


\displaystyle  {\mathfrak H} = \prod_p (1 - \frac{1}{p})^k (1 + h(p)).

Since {h(p) = \frac{g(p)}{1-g(p)} = \frac{\omega(p)}{p-\omega(p)}}, we see that {{\mathfrak H} = {\mathfrak S}^{-1}}, and the claim follows (after adjusting {\varepsilon} appropriately). \Box

The upper bound in (37) is off by a factor of about {2^k k!} from what one expects to be the truth. The following exercises give some quick applications of Theorem 32.

Exercise 33 (Brun-Titchmarsh inequality) If {\varepsilon > 0}, and {y} is sufficiently large depending on {\varepsilon}, establish the Brun-Titchmarsh inequality

\displaystyle  \pi(x+y) - \pi(x) \leq \frac{(2+\varepsilon)y}{\log y}

for all {x > 1}, where {\pi(x)} is the number of primes less than {x}. More generally, if {\varepsilon > 0}, {a\ (q)} is a primitive residue class, and {y} is sufficiently large depending on {\varepsilon,q}, establish the Brun-Titchmarsh inequality

\displaystyle  \pi(x+y; a\ (q)) - \pi(x; a\ (q)) \leq \frac{(2+\varepsilon)y}{\phi(q) \log(y/q)}

for all {x>1}, where {\pi(x; a\ (q))} is the number of primes less than {x} that lie in {a\ (q)}. Comparing this with the prime number theorem in arithmetic progressions, we see that the Brun-Titchmarsh inequality is off from the truth by a factor of two when {x = O(y)}; however, the Brun-Titchmarsh inequality is applicable even when {x} is extremely large compared with {y}, whereas the prime number theorem gives no non-trivial bound in this regime. The {\varepsilon} can in fact be deleted in these inequalities, a result of Montgomery and Vaughan (using a refinement of the large sieve, discussed below).

Exercise 34 (Upper bound for {k}-tuple conjecture) If {(h_1,\dots,h_k)} is an admissible {k}-tuple (as in Exercise 24), and let {{\mathfrak S}} be the singular series (27). Show that

\displaystyle  |\{ n \leq x: n+h_1,\dots,n+h_k \in {\mathcal P}\}| \leq (2^k k! + o(1)) {\mathfrak S} \frac{x}{\log^k x} \ \ \ \ \ (39)

as {x \rightarrow \infty}. (This should be compared with the prime tuples conjecture, which predicts

\displaystyle  |\{ n \leq x: n+h_1,\dots,n+h_k \in {\mathcal P}\}| = (1 + o(1)) {\mathfrak S} \frac{x}{\log^k x}

as {x \rightarrow \infty}.)

As with the fundamental lemma, one can use the Selberg sieve together with the Bombieri-Vinogradov theorem to sift out a set of primes, rather than a set of intervals:

Exercise 35 (Sieving primes) Let {k, C} be fixed natural numbers, and let {x \geq 2}. For each prime number {p \leq \sqrt{x}}, let {E_p} be the union of {\omega(p)} primitive residue classes modulo {p}, where {\omega(p) = k-1} for all {p \geq C}, and {\omega(p) < p-1} for all {p}. Then for any {\varepsilon > 0}, one has

\displaystyle  |\{ p \in {\mathcal P}: p \leq x \} \backslash \bigcup_{p \leq \sqrt{x}} E_p| \leq (4^{k-1} (k-1)! + \varepsilon \ \ \ \ \ (40)

\displaystyle  + O_{k,C,\varepsilon}( \frac{1}{\log x} )) {\mathfrak S} \frac{x}{\log^k x}

where {{\mathfrak S}} is the singular series

\displaystyle  {\mathfrak S} := \prod_p (1-\frac{1}{p})^{1-k} (1 - \frac{\omega(p)}{p-1}).

(You will need Exercise 23 of Notes 3 to control the {\tau_3} terms arising from the remainder term.) If one assumes the Elliott-Halberstam conjecture (see Exercise 22 of Notes 3), show that the factor of {4^{k-1}} here can be lowered to {2^{k-1}}.

Exercise 36 Show that the quantity {2^k k!} in (39) may be replaced by {4^{k-1} (k-1)!}, which is a superior bound for {k \leq 3}. (It remains an open question to improve upon the bound {\min( 2^k k!, 4^{k-1} (k-1)! )} without assuming any further unproven conjectures.) Assuming the Elliott-Halberstam conjecture, show that one can replace {2^k k!} instead by {2^{k-1} (k-1)!}.

We have approached the analysis of the Selberg sieve as a “discrete” optimisation problem, in which one is trying to optimise the parameters {(\rho_d)_{d|P; d \leq R}} or {(y_m)_{m|P; m \leq R}} that are indexed by the discrete set {\{ d|P: d \leq R \}}. It is however also instructive to consider a “continuous” version of the optimisation problem, in which the coefficients {\rho_d} or {y_m} are described by a continuous function of {d} or {m}; this is a slightly more restrictive class of sieves, but turns out to lead to numerical bounds that come very close to the discrete optimum, and allow for the use of methods from calculus of variations to come into play when using the Selberg sieve in more sophisticated ways. (See these lecture notes of Selberg for further discussion of how the continuous problem serves as a good approximation of the discrete one.) We give a basic illustration of this by presenting an alternate proof of (a very slightly weaker form of) Theorem 32 in which we do not use the choice (35) for the sieve weights {\rho_e}, but instead choose weights of the form

\displaystyle  \rho_d := \mu(d) f( \frac{\log d}{\log R} ) \ \ \ \ \ (41)

for some smooth compactly supported function {f: {\bf R} \rightarrow {\bf R}} that vanishes on {[1,+\infty)}, and which equals {1} at the origin; compare this smoothly truncated version of the Möbius function with the more abrupt truncations of the Möbius function that arise in combinatorial sieving. Such cases of the Selberg sieve were studied by several authors, most notably Goldston, Pintz, and Yildirim (although they made {F} a polynomial on {[0,1]}, rather than a smooth function). As in the previous proof of Theorem 32, we take {g(p) := \frac{\omega(p)}{p}}, {D := x^{1-\varepsilon}}, {R := D^{1/2}}, {P := \prod_{p \leq \sqrt{x}} p}, {X=x}, and {a_n := 1_{1 \leq n \leq x}}. The quantity (30) (or (14)) is then equal to

\displaystyle  \sum_{d_1,d_2|P} f( \frac{\log d_1}{\log R} ) f( \frac{\log d_2}{\log R} ) \mu(d_1) \mu(d_2) g( [d_1,d_2] ).

We can remove the constraints {d_1, d_2|P}, since the summands vanish outside of this range. To estimate this sum we use an argument related to that used to prove Proposition 9 from Notes 2. We first use Fourier inversion, applied to the function {u \mapsto e^u f(u)}, to obtain a representation of the form

\displaystyle  e^u f(u) = \int_{\bf R} F(t) e^{-itu}\ dt \ \ \ \ \ (42)

for some smooth, rapidly decreasing function {F: {\bf R} \rightarrow {\bf R}}, thus

\displaystyle  |F(t)| \ll_{f,A} (1+|t|)^{-A} \ \ \ \ \ (43)

for any {A>0}. We then have

\displaystyle  f( \frac{\log d_1}{\log R} ) = \int_{\bf R} \frac{F(t_1)}{d_1^{\frac{1+it_1}{\log R}}} dt_1


\displaystyle  f( \frac{\log d_2}{\log R} ) = \int_{\bf R} \frac{F(t_2)}{d_2^{\frac{1+it_2}{\log R}}} dt_2

and so by Fubini’s theorem (using the divisor bound {g(d) = O( d^{-1+o(1)} )}) we may write (30) as

\displaystyle  \int_{\bf R} \int_{\bf R} F(t_1) F(t_2) \sum_{d_1,d_2} \frac{\mu(d_1) \mu(d_2) g([d_1,d_2])}{d_1^{\frac{1+it_1}{\log R}} d_2^{\frac{1+it_2}{\log R}}}\ dt_1 dt_2,

and then by Euler products and the definition of {g} we may factorise this as

\displaystyle  \int_{\bf R} \int_{\bf R} F(t_1) F(t_2) G( t_1, t_2)\ dt_1 dt_2 \ \ \ \ \ (44)


\displaystyle  G(t_1,t_2) := \prod_{p} ( 1 - \frac{\omega(p)}{p^{1+\frac{1+it_1}{\log R}}} - \frac{\omega(p)}{1+p^{\frac{1+it_2}{\log R}}} + \frac{\omega(p)}{p^{1+\frac{1+it_1+1+it_2}{\log R}}} )\ dt_1 dt_2.

We have the crude upper bound

\displaystyle  |G(t_1,t_2)| \ll_{k,C} \prod_p \exp( O_k( \frac{1}{p^{1+1/\log R}} ) )

which by the asymptotic {\sum_p \frac{1}{p^s} = \log(s-1) + O(1)} for {s > 1} (see equation (22) of Notes 1) implies that

\displaystyle  |G(t_1,t_2)| \ll_{k,C} \log^{O_k(1)} R.

From this and the rapid decrease of {f}, we see that the contribution of the cases {|t_1| \geq \sqrt{\log R}} or {|t_2| \geq \sqrt{\log R}} (say) to (44) is {O_{k,C,f,A}( \log^{-A} R )} for any {A>0}. Thus we can write (44) as

\displaystyle  \int_{|t_1|, |t_2| \leq \sqrt{\log R}} F(t_1) F(t_2) G( t_1, t_2)\ dt_1 dt_2 + O_{k,C,f,A}(\log^{-A} R).

Since {\zeta(s) = \prod_p (1 - \frac{1}{p^s})^{-1}}, we can factor

\displaystyle  G(t_1,t_2) = \frac{\zeta(1 + \frac{1+it_1+1+it_2}{\log R})^k}{\zeta(1+\frac{1+it_1}{\log R})^k \zeta(1+\frac{1+it_2}{\log R})^k} \prod_p E_p(t_1,t_2)

where {E_p(t_1,t_2)} is the local Euler factor

\displaystyle  \frac{( 1 - \frac{\omega(p)}{p^{1+\frac{1+it_1}{\log R}}} - \frac{\omega(p)}{1+p^{\frac{1+it_2}{\log R}}} + \frac{\omega(p)}{p^{1+\frac{1+it_1+1+it_2}{\log R}}} ) (1 - \frac{1}{p^{1+\frac{1+it_1+1+it_2}{\log R}}})^k}{(1-\frac{1}{p^{1+\frac{1+it_1}{\log R}}})^k (1-\frac{1}{p^{1+\frac{1+it_2}{\log R}}})^k}.

The point of this factorisation is that, thanks to a routine but somewhat tedious Taylor expansion, one has the asymptotics

\displaystyle  E_p(t_1,t_2) = 1 + O_{k,C}(\frac{1}{p^{3/2}})

for any complex {t_1,t_2} with imaginary part less than {\frac{1}{2} \log R} in magnitude, and hence (by the Cauchy integral formula)

\displaystyle  \partial_{t_1} E_p(t_1,t_2), \partial_{t_2} E_p(t_1,t_2) = O_{k,C}( \frac{1}{\log R} \frac{1}{p^{3/2}})

for any complex {t_1,t_2} with imaginary part kless than {\frac{1}{4} \log R} in magnitude. By the fundamental theorem of calculus, we thus have

\displaystyle  E_p(t_1,t_2) = E_p(i,i) + O_{k,C}( \frac{1}{\sqrt{\log R}} \frac{1}{p^{3/2}} )

when {|t_1|, |t_2| \leq \sqrt{\log R}} are real. Also, observe from (38) that

\displaystyle  \prod_p E_p(i,i) = {\mathfrak S}

and hence

\displaystyle  \prod_p E_p(t_1,t_2) = {\mathfrak S} + O_{k,C}( \frac{1}{\sqrt{\log R}} ).

From this and the standard asymptotic {\zeta(s) = \frac{1}{s-1} + O(1)} for {s} close to {1}, we have

\displaystyle  G(t_1,t_2) = {\mathfrak S} \log^{-k} R \frac{(1+it_1)^k (1+it_2)^k}{(1+it_1+1+it_2)^k} + O_{k,C}( (1+|t_1|+|t_2|)^{O(1)} \log^{-k-1/2} R )

and thus we can write (44) as

\displaystyle  {\mathfrak S} \log^{-k} R \int_{|t_1|, |t_2| \leq \sqrt{\log R}} F(t_1) F(t_2) \frac{(1+it_1)^k (1+it_2)^k}{(1+it_1+1+it_2)^k} \ dt_1 dt_2

\displaystyle  + O_{k,C,f}( \log^{-k-1/2} R ).

From the rapid decrease of {F}, we may now remove the constraints on {t_1,t_2} and write this as

\displaystyle  {\mathfrak S} \log^{-k} R \int_{\bf R} \int_{\bf R} F(t_1) F(t_2) \frac{(1+it_1)^k (1+it_2)^k}{(1+it_1+1+it_2)^k} \ dt_1 dt_2

\displaystyle + O_{k,C,f}( \log^{-k-1/2} R ).

To handle the weight {\frac{(1+it_1)^k (1+it_2)^k}{(1+it_1+1+it_2)^k}}, we observe from dividing (42) by {e^u} and then differentiating {k} times that

\displaystyle  f^{(k)}(u) = (-1)^k \int_{\bf R} (1+it)^k F(t) e^{-(1+it)u}\ dt

and hence by Fubini’s theorem

\displaystyle  f^{(k)}(u)^2 = \int_{\bf R} \int_{\bf R} (1+it_1)^k (1+it_2)^k F(t_1) F(t_2) e^{-(1+it_1+1+it_2)u}\ dt_1 dt_2.

Using the Gamma function identity {\int_0^\infty e^{-su} u^{k-1}\ du = \frac{(k-1)!}{s^k}} for any {s} with {\hbox{Re}(s)>0}, we conclude from another application of Fubini’s theorem that

\displaystyle  \int_0^\infty f^{(k)}(u)^2 u^{k-1}\ du

\displaystyle = (k-1)! \int_{\bf R} \int_{\bf R} \frac{(1+it_1)^k (1+it_2)^k}{(1+it_1+1+it_2)^k} F(t_1) F(t_2)\ dt_1 dt_2.

Note from the support of {f} that we can restrict the {u} integral to {1}. Putting this all together, we see that with the choice of weights (41), the expression (14) takes the form

\displaystyle  \frac{1}{(k-1)!} {\mathfrak S} \log^{-k} R \int_0^1 f^{(k)}(u)^2 u^{k-1}\ du + O_{k,C,f}( \log^{-k-1/2} R ).

Recall that we are requiring {f(0)} to equal {1}, which after {k} applications of the fundamental theorem of calculus (or integration by parts) is equivalent to the constraint

\displaystyle  \int_0^1 f^{(k)}(u) \frac{u^{k-1}}{(k-1)!}\ du = 1.

Since {\int_0^1 u^{k-1}\ du = \frac{1}{k}}, we conclude from Cauchy-Schwarz that

\displaystyle  \int_0^1 f^{(k)}(u)^2 u^{k-1}\ du \geq k ((k-1)!)^2,

with equality occurring when {f^{(k)}(u) = k!}. Equality is not quite attainable with {f} smooth, but by a mollification one can select an {f} smoothly supported on (say) {[-1,1]} with {f(0)=0} and

\displaystyle  \int_0^1 f^{(k)}(u)^2 u^{k-1}\ du \leq (1+\varepsilon) k ((k-1)!)^2

for any {\varepsilon > 0}. The expression (14) now takes the form

\displaystyle  (1+\varepsilon) k! {\mathfrak S} \log^{-k} R + O_{k,C,f}( \log^{-k-1/2} R ).

Meanwhile, from (41) we have {\rho_d = O_f(1)} and thus {\lambda^+_d \ll_f \tau_3(d)} much as before. We can then apply Theorem 5 as before to recover Theorem 32.

Remark 37 The Selberg sieve is most naturally used as an upper bound sieve. However, it is possible to modify the Selberg sieve to produce a lower bound sieve in a number of ways. One way is to simply insert the Selberg sieve into the Buchstab identity (16); see for instance this paper of Ankeny and Onishi for an investigation of this approach. Another method is to multiply the Selberg upper bound sieve by a lower bound sieve {\Lambda_-}; somewhat surprisingly, even a very crude lower bound sieve such as the {k=1} Bonferroni sieve {\Lambda_- = 1 - \sum_{p|P} 1_{E_p}} can give reasonably good results for this purpose. See Friedlander and Iwaniec for further analysis of this lower bound sieve (sometimes known as the {\Lambda^2 \Lambda_-} sieve).

— 3. A multidimensional Selberg sieve, and bounded gaps between primes —

We now discuss a multidimensional variant of the Selberg sieve that establishes a recent result of Maynard (and independently by myself), giving further partial progress towards the Hardy-Littlewood prime tuples conjecture. For sake of exposition we shall omit some details; see this previous blog post for a more careful treatment; see also this survey of Granville.

More precisely, we sketch the proof of

Theorem 38 (Maynard’s theorem) Let {m} be a natural number, and let {k} be sufficiently large depending on {m}. Then for any admissible {k}-tuple {(h_1,\dots,h_k)}, there exist infinitely many natural numbers {n} such that at least {m+1} of the numbers {n+h_1,\dots,n+h_k} are prime.

Exercise 39 Show that Maynard’s theorem implies that the quantity {H_{m} = \lim \inf_{n \rightarrow \infty} p_{n+m}-p_n} is finite for each natural number {m}, where {p_n} is the {n^{th}} prime, is finite. (For the most recent bounds on {H_m}, see this Polymath paper.) In particular, the {m=1} case of Theorem 38 yields “bounded gaps between primes”, in that there is a finite {H} for which {p_{n+1}-p_n \leq H} for infinitely many {n}.

Even the {m=1} case of this theorem was only proven in 2013 by Zhang, building upon earlier work of Goldston, Pintz, and Yildirim. Zhang’s original value of {H} in the “bounded gaps” application was {H = 7 \times 10^7}; this has since been lowered several times, with the current record being {H=246}, due to the Polymath project. The original arguments of Zhang and Goldston-Pintz-Yildirim used a “one-dimensional” Selberg sieve, with coefficients given by (41), and required quite strong equidistribution hypotheses on the primes (stronger than what the Bombieri-Vinogradov theorem provides). However, the argument of Maynard relies on a more general and flexible “multi-dimensional” Selberg sieve, which has better numerical performance, and as such does not need any equidistribution result beyond the Bombieri-Vinogradov theorem. This sieve has since been used for a number of further applications in analytic number theory; see this web page for a list of papers in this direction.

Let {m, k, h_1,\dots,h_k} be as in Theorem 38. Suppose that, for sufficiently large {x}, we can find a non-negative function {\nu: {\bf N} \rightarrow {\bf R}^+} supported on {[x/2,x]} with the property that

\displaystyle  \sum_{i=1}^k \sum_n \nu(n) 1_{n+h_i \in {\mathcal P}} > m \sum_n \nu(n). \ \ \ \ \ (45)

Then by the pigeonhole principle, there exists at least one {n \in [x/2,x]} such that {\sum_{i=1}^k 1_{n+h_i \in {\mathcal P}} > m}, that is to say at least {m+1} of the numbers {n+h_1,\dots,n+h_k} are prime. Letting {x} go to infinity (keeping {m,k,h_1,\dots,h_k} fixed), we will obtain Theorem 38.

It remains to establish the existence of a function {\nu} with the required properties. Prior to the work of Goldston, Pintz, and Yildirim, it was only known (using tools such as the fundamental lemma of sieve theory, or off-the-shelf Selberg sieves) how to construct sieve weights {\nu} obeying (45) with {m} replaced by a small constant less than {1}. The earlier arguments of Zhang and Goldston-Pintz-Yildirim chose a weight {\nu} which was essentially of the Selberg sieve form

\displaystyle  \nu(n) := 1_{[x/2,x]}(n) \left( \sum_{d | (n+h_1) \dots (n+h_k)} \mu(d) f( \frac{\log d}{\log R} ) \right)^2

(compare with (41)), where {R = x^{\theta/2}} for some fixed {0 < \theta < 1}, and {f} was a smooth fixed compactly supported function. Omitting many calculations, their conclusion was that one could prove the {m=1} case of Theorem 38 once one had some sort of distribution result on the primes at level {\theta} (as defined in Exercise 22 of Notes 3) with {1/2 < \theta < 1}. Unfortunately, this just falls short of what the Bombieri-Vinogradov theorem provides, which is a level of distribution {\theta} for any {0 < \theta < 1/2}. One of the key contributions of Zhang was to obtain a partial distribution result at a value of {\theta} slightly above {1/2}, which when combined with (a modification of) the previous work of Goldston, Pintz, and Yildirim, was able to yield the {m=1} case of Theorem 38.

The approach of Maynard and myself was a little different, based on multidimensional Selberg sieves such as

\displaystyle  \nu(n) := 1_{[x/2,x]}(n) \times

\displaystyle  \left( \sum_{d_i | n+h_i \forall i=1,\dots,k} \mu(d_1) \dots \mu(d_k) f( \frac{\log d_1}{\log R}, \dots, \frac{\log d_k}{\log R} ) \right)^2

for some multidimensional smooth function {f: [0,+\infty)^k \rightarrow {\bf R}} supported on the simplex {\{ (t_1,\dots,t_k): t_1+\dots +t_k \leq 1 \}}; the one-dimensionalsieve considered by Zhang and Goldston-Pintz-Yildirim is then essentially the case when {f(t_1,\dots,t_k)} is a function just of {t_1+\dots+t_k} on this simplex. Actually, to avoid some (very minor) issues concerning common factors between the {n+h_i}, and also to essentially eliminate the role of the singular series, it is convenient to work with a slight modification

\displaystyle  \nu(n) := 1_{n=b\ (W)} 1_{[x/2,x]}(n) \times \ \ \ \ \ (46)

\displaystyle  \left( \sum_{d_i | n+h_i \forall i=1,\dots,k} \mu(d_1) \dots \mu(d_k) f( \frac{\log d_1}{\log R}, \dots, \frac{\log d_k}{\log R} ) \right)^2

of the above sieve, where {W = \prod_{p \leq w} p} for some {w} slowly growing to infinity with {x} (e.g. {w = \log\log\log x} will do), and {b\ (W)} is a residue class such that {b+h_i\ (W)} is primitive for all {i=1,\dots,k} (such a {b} exists thanks to the admissibility of {(h_1,\dots,h_k)} and the Chinese remainder theorem). Note that for {w} large enough, the {n+h_i} have no common factors, and so the {d_1,\dots,d_k} are automatically coprime to each other and to {W}. (In Maynard’s paper, the coefficients {f( \frac{\log d_1}{\log R}, \dots, \frac{\log d_k}{\log R} )} were replaced by a sequence {\rho_{d_1,\dots,d_k}} which was not directly arising from a smooth function {f}, but instead the analogue of the {y_{m_1,\dots,m_k}} coefficients from the preceding section were chosen to be of the form {F( \frac{\log d_1}{\log R}, \dots, \frac{\log d_k}{\log R} )} for some suitable function {F}. However, the differences between the two approaches are fairly minor.)

It is possible to estimate the right-hand side of (45) using a modification of the arguments of the previous section:

Exercise 40 Let {k, h_1,\dots,h_k, b, W, \theta, R, f, \nu} be as in the above discussion. Assume that {0 < \theta < 1}. If {w} goes to infinity at a sufficiently slow rate at {x \rightarrow \infty}, establish the asymptotic

\displaystyle  \sum_n \nu(n) = \left( \frac{W}{\phi(W)} \right)^k \frac{x}{2\log^k R} (I(f) + o(1))

as {x \rightarrow \infty}, where

\displaystyle  I(f) := \int_{[0,+\infty)^k} f_{1,\dots,k}(t_1,\dots,t_k)^2\ dt_1 \dots dt_k

and {f_{1,\dots,k}} denotes the mixed partial derivative {\frac{\partial^k f}{\partial t_1 \dots \partial t_k}} in each of the {k} variables {t_1,\dots,t_k}.

Now we consider a term on the left-hand side, say {\sum_n \nu(n) 1_{n+h_k \hbox{ prime}}}. When {n+h_k} is prime, the quantity {d_k} in (46) is necessarily {1}, so (46) simplifies to a {k-1}-dimensional version of itself:

\displaystyle  \nu(n) = 1_{n=b\ (W)} 1_{[x/2,x]}(n) \times

\displaystyle \left( \sum_{d_i | n+h_i \forall i=1,\dots,k-1} \mu(d_1) \dots \mu(d_{k-1}) f( \frac{\log d_1}{\log R}, \dots, \frac{\log d_{k-1}}{\log R}, 0 ) \right)^2.

If {\theta < \frac{1}{2}}, one can control the effect of the weight {1_{n+h_k \hbox{ prime}}} using the Bombieri-Vinogradov theorem, and after some moderately complicated calculations one obtains

Exercise 41 Let {k, h_1,\dots,h_k, b, W, \theta, R, f, \nu} be as in the above discussion. Assume that {0 < \theta < 1/2}. If {w} goes to infinity at a sufficiently slow rate at {x \rightarrow \infty}, establish the asymptotic

\displaystyle  \sum_n \nu(n) 1_{n+h_k \in {\mathcal P}} = \left( \frac{W}{\phi(W)} \right)^k \frac{x}{2\log^{k-1} R \log x} (J_k(f) + o(1))

as {x \rightarrow \infty}, where

\displaystyle  J_k(f) := \int_{[0,+\infty)^{k-1}} f_{1,\dots,k-1}(t_1,\dots,t_{k-1},0)^2\ dt_1 \dots dt_{k-1}

and {f_{1,\dots,k-1}} denotes the mixed partial derivative {\frac{\partial^{k-1} f}{\partial t_1 \dots \partial t_{k-1}}} in the {k-1} variables {t_1,\dots,t_{k-1}}. Similarly for permutations of the {1,\dots,k} indices.

In view of these two calculations, one can establish Maynard’s theorem as soon as one can find {k} and {f} for which

\displaystyle  \sum_{i=1}^k J_i(f) > \frac{m}{\theta/2} I(f)

where {J_i} is defined similarly to {J_k} but with the role of the {i} and {k} indices swapped. This is now a problem purely in calculus of variations rather than number theory. It is thus natural to take {\theta} close to {1/2} (though, for the qualitative analysis performed here, actually any non-zero choice of {\theta} independent of {k} will suffice). Our task is now to find {k,f} with

\displaystyle  \sum_{i=1}^k J_i(f) > 4 m I(f).

The optimal choice of {f} for a given {k} is not known for any {k>2}, but one can generate a “good enough” choice of {f} by the following set of calculations. Firstly, we restrict to the case when {f(t_1,\dots,t_k)} is symmetric in the {t_1,\dots,t_k}, so that all the {J_i} are equal and our objective is now to satisfy the inequality

\displaystyle  kJ_k(f) > 4 m I(f).

Next, we make the change of variables {F := f_{1,\dots,k}}, thus {F} is smooth on {[0,+\infty)^k} and supported on the simplex {\{t_1+\dots+t_k \leq 1\}} but is otherwise arbitrary on this simplex. From the fundamental theorem of calculus, the desired inequality now becomes

\displaystyle  k \int_{[0,+\infty)^{k-1}} (\int_0^\infty F(t_1,\dots,t_k)\ dt_k)^2 dt_1 \dots dt_{k-1}

\displaystyle > 4m \int_{[0,+\infty)^k} F(t_1,\dots,t_k)^2\ dt_1 \dots t_k.

We select a function {F} of the truncated product form

\displaystyle  F(t_1,\dots,t_k) = k^{1/2} g(k t_1) \dots k^{1/2} g(k t_k) 1_{t_1 + \dots +t_k < 1}

where {g: [0,+\infty) \rightarrow {\bf R}^+} is a smooth non-negative function normalised so that

\displaystyle  \int_0^\infty g(t)^2\ dt = 1 \ \ \ \ \ (47)

so that

\displaystyle  \int_{[0,+\infty)^k} F(t_1,\dots,t_k)^2\ dt_1 \dots t_k \leq 1

and our objective (after a rescaling) is now to get

\displaystyle  \int_{[0,+\infty)^{k-1}} \left(\int_0^\infty g(t_k) 1_{t_1+\dots+t_k \leq k}\ dt_k\right)^2 g(t_1)^2 dt_1 \dots g(t_{k-1})^2 dt_{k-1}

\displaystyle  > 4m.

The emergence of the function {g} as a parameter in the multidimensional sieve is a feature that is not present in previous sieves (at least not without sharply curtailing the support of {g}), and is the main reason why this sieve performs better than previous sieves for this problem.

We interpret the above optimisation problem probabilistically. Let {X_1,X_2,\dots} be independent real values in {[0,+\infty)} with probability density function {g(t)^2\ dt}. Our task is now to obtain the inequality

\displaystyle  \mathop{\bf E} \left(\int_0^{\min(k-X_1-\dots-X_k),0} g(t_k)\ dt_k\right)^2 > 4m.

Suppose that we can locate a fixed function {g} (independent of {k}) obeying (47) with

\displaystyle  \mathop{\bf E} X_i = \int_0^\infty t g(t)^2\ dt < 1 \ \ \ \ \ (48)


\displaystyle  \int_0^\infty g(t)\ dt = \infty \ \ \ \ \ (49)

Then from the law of large numbers, {k-X_1-\dots-X_k} goes to infinity almost surely as {k} to {\infty}, and so from dominated convergence one has

\displaystyle  \lim_{k \rightarrow \infty} \mathop{\bf E} (\int_0^{\min(k-X_1-\dots-X_k),0} g(t_k)\ dt_k)^2 = \infty

and the claim follows from (49) if {k} is sufficiently large depending on {m}.

We are thus reduced to the easy one-dimensional problem of producing a smooth function {g: [0,+\infty) \rightarrow {\bf R}} obeying the constraints (47), (48), (49). However, one can verify that the choice

\displaystyle  g(t) := \frac{1}{(1+t) \log(2+t)}

(barely) obeys (49) with {\int_0^\infty g(t)^2\ dt} and {\int_0^\infty t g(t)^2\ dt} both finite, and the claim follows by a routine rescaling.

Exercise 42 By working through a more quantitative version of the above argument, establish Theorem 38 with {m \gg \log k} as {k \rightarrow \infty}. (Hint: One can replace the use of the law of large numbers with Chebyshev’s inequality.) It is not currently known how to obtain this theorem with any choice of {m} that grows faster than logarithmically in {k}; the current record is {m = (\frac{1}{4} + \frac{7}{600}) \log k + O(1)}, due to the Polymath project. It is known, though, that bounds on the order of {\log k} are the limit of the Selberg sieve method, and one either needs new sieves, or new techniques beyond sieve theory, to increase {m} beyond this rate; see the Polymath paper for further discussion.

One can use the multidimensional sieve to study the classical sieving problem of counting prime {k}-tuples, but unfortunately the numerology it gives is precisely the same as the numerology given by previous sieves:

Exercise 43 Let {(h_1,\dots,h_k)} be an admissible tuple, and let {w} go to infinity sufficiently slowly as {x \rightarrow \infty}.

  • (i) Use Exercise 40, followed by an optimisation of the function {f} (or {F}) to establish the bound

    \displaystyle  |\{x/2 \leq n \leq x: n = b\ (W); n+h_1,\dots,n+h_k \in {\mathcal P}\}|

    \displaystyle \leq (2^k k! + o(1)) \left( \frac{W}{\phi(W)}\right)^k \frac{x}{2\log^k x}

    as {x \rightarrow \infty}, uniformly for all {b\ (W)} with {b+h_1, \dots, b+h_k\ (W)} primitive. Use this to give an alternate proof of (39).

  • (ii) By using Exercise 41 instead of Exercise 40, repeat (i) but with {2^k k!} replaced by {4^{k-1} (k-1)!}.

— 4. The large sieve —

In the general discussion of sieving problems such as Problem 3 in previous sections, there was no structure imposed on the sets {E_p} beyond the fact that one could control the sums (9). However, as we have seen in practice, {E_p} is typically the union of some number {\omega(p)} of residue classes modulo {p}. It is possible to exploit this structure using Fourier analysis, leading to an approach to sieving known as the large sieve, and relying crucially on the analytic large sieve inequality from Notes 3. The large sieve tends to give results similar to that of the Selberg sieve; indeed, the two sieves are in some sense “dual” to each other and are powered by the same sort of {L^2}-based techniques (such as the Cauchy-Schwarz inequality). However, the large sieve methods can offer sharper control on error terms than the Selberg sieve if carried out carefully. As the name suggests, the large sieve is particularly useful in the regime when the number of residue classes modulo {p} being sieved out is large, for instance as large as a constant multiple of {p}. (When {\omega(p)} is very large – close to {p}, then there is another sieve that gives even sharper results, namely the larger sieve of Gallagher, which we do not discuss further here.)

To apply the analytic large sieve to sieving problems, we need to exploit an “uncertainty principle” of Montgomery (discussed further in this previous blog post), that shows that functions which avoid certain residue classes in the physical domain are necessarily spread out in the frequency domain. More precisely, we have

Theorem 44 (Montgomery uncertainty principle) Let {f: {\bf Z} \rightarrow {\bf C}} be a finitely supported function, and let {\hat f: {\bf R}/{\bf Z} \rightarrow {\bf C}} be the Fourier transform {\hat f(\theta) := \sum_n f(n) e(n\theta)}, where {e(x) := e^{2\pi i x}}. Suppose that for each prime {p}, there are {\omega(p)} residue classes modulo {p} on which {f} vanishes identically for some {\omega(p) < p}. Then for any {\theta \in {\bf R}/{\bf Z}} and any squarefree natural number {q}, one has

\displaystyle  \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\hat f(\theta + \frac{a}{q})|^2 \geq h(q) |\hat f(\theta)|^2, \ \ \ \ \ (50)

where {h} is the multiplicative function

\displaystyle  h(q) := \mu^2(q) \prod_{p|q} \frac{\omega(p)}{p-\omega(p)}. \ \ \ \ \ (51)

Proof: Observe from the Chinese remainder theorem that if the inequality (50) holds for two coprime values {q=q_1, q=q_2} of {q}, then it also holds for {q=q_1q_2} (as every fraction {\frac{a}{q_1 q_2}} in {{\bf R}/{\bf Z}} with {(a,q_1q_2)=1} can be uniquely written as {\frac{a_1}{q_1} + \frac{a_2}{q_2}} with {(a_1,q_1)=(a_2,q_2)=1}). Thus it suffices to verify the claim when {q} is prime. By multiplying {f(n)} by {e(n\theta)} we can also reduce to the case {\theta=0}, thus our task is now to show that

\displaystyle  \sum_{a \in ({\bf Z}/p{\bf Z}) \backslash \{0\}} |\hat f(\frac{a}{p})|^2 \geq \frac{\omega(p)}{p-\omega(p)} |\hat f(0)|^2

or equivalently

\displaystyle  \sum_{a \in {\bf Z}/p{\bf Z}} |\hat f(\frac{a}{p})|^2 \geq \frac{p}{p-\omega(p)} |\hat f(0)|^2.

By the Plancherel identity in {{\bf Z}/p{\bf Z}}, the left-hand side can be written as

\displaystyle  p \sum_{b \in {\bf Z}/p{\bf Z}} |\sum_{n = b\ (p)} f(n)|^2

and the right-hand side is

\displaystyle  \frac{p}{p-\omega(p)} |\sum_{b \in {\bf Z}/p{\bf Z}} \sum_{n = b\ (p)} f(n)|^2.

Since the sum {\sum_{n = b\ (p)} f(n)} is only non-vanishing for at most {p-\omega(p)} choices of {b}, the claim follows from the Cauchy-Schwarz inequality. \Box

We then obtain the following variant of Theorem 32 or Theorem 30:

Theorem 45 (Arithmetic large sieve) Let {Q \geq 1}. For each prime number {p \leq Q}, let {E_p} be the union of {\omega(p)} residue classes modulo {p}, where {\omega(p) < p} for all {p}. Then for any {M \in {\bf Z}} and natural number {N \geq 1}, we have

\displaystyle  |\{ n \in {\bf Z}: M < n \leq M+N\} \backslash \bigcup_{p \leq Q} E_p| \ll \frac{N+Q^2}{J} \ \ \ \ \ (52)


\displaystyle  J := \sum_{q \leq Q} h(q)

and {h} is given by (51).

Proof: Let {f} be the indicator function of the set in (52). From Theorem 44 with {\theta=0}, and {q} summed up to {Q}, one has

\displaystyle  \sum_{q \leq Q} \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\hat f(\frac{a}{q})|^2 \geq J |\hat f(0)|^2.

Now observe from the basic identity

\displaystyle  \frac{a_1}{q_1} - \frac{a_2}{q_2} = \frac{a_1 q_2 - a_2 q_1}{q_1 q_2}

that the Farey sequence {\{ \frac{a}{q}: q \leq Q; a \in ({\bf Z}/q{\bf Z})^\times\}} is {\frac{1}{Q^2}}-separated in {{\bf R}/{\bf Z}}. We may thus invoke the analytic large sieve inequality (see Proposition 6 of Notes 3) to conclude that

\displaystyle  \sum_{q \leq Q} \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\hat f(\frac{a}{q})|^2 \ll (N + Q^2) \sum_n |f(n)|^2.

But from construction of {n}, the left-hand side of (52) is equal to both {\sum_n |f(n)|^2} and {\hat f(0)}, and the claim follows. \Box

Note that if one uses the sharp form of the large sieve inequality (Remark 7 from Notes 3) then one can sharpen (52) to

\displaystyle  |\{ n \in {\bf Z}: M < n \leq M+N\} \backslash \bigcup_{p \leq Q} E_p| \leq \frac{N+Q^2}{J}.

Setting {Q = N^{1/2-\varepsilon}}, one can then recover sharper versions of Theorem 32 and its consequences, for instance by working along these lines (and using a very sharp version of the analytic large sieve inequality that exploits the variable spacing of the set of fractions {\frac{a}{q}}), result of Montgomery and Vaughan were able to completely delete the error term in the Brun-Titchmarsh inequality (Exercise 33).

Remark 46 It is a general phenomenon that large sieve methods tend to give the same main term as the Selberg sieve; see Section 9.5 of Friedlander-Iwaniec for further discussion.

— 5. The parity problem —

We have seen that sieve theoretic techniques are reasonably good at obtaining upper bounds to various sieving problems, and can also provide lower bounds if the sifting limit is low. However, these methods have proven to be frustratingly ineffective at providing non-trivial lower bounds for the original problems that motivated the development of sieve theory in the first place, namely counting patterns of primes such as twin primes. While we do not fully understand the exact limitations of what sieve theory can and cannot do, we do have a convincing heuristic barrier, known as the parity barrier or parity problem and first raised explicitly by Selberg, which explains why sieves have so much difficulty in lower bounding (or efficiently upper bounding) counts involving primes; in particular they explain why sieve-theoretic methods have thus far been unable to fully resolve any of the four Landau problems. The barrier is not completely rigorous, as it relies on the Möbius pseudorandomness principle (or Liouville pseudorandomness principle) discussed in Section 2 of Supplement 4. As such, our discussion in this section will be somewhat informal in nature; for instance, we will use vague symbols such as {\approx} without defining their meaning precisely.

Consider the general sieving problem in Problem 3, in which one tries to control the quantity {\sum_{n \not \in \bigcup_{p|P} E_p} a_n} via knowledge of the sums {\sum_{n \in E_d} a_n}, together with the hypothesis that the {a_n} are non-negative. In most situations, one expects this problem to be underdetermined; in particular, there will probably exist two different finitely supported sequences {(a_n)_{n \in {\bf Z}}}, {(b_n)_{n \in {\bf Z}}} of non-negative reals which are essentially indistinguishable with regards to the inputs of this sieve problem, in the sense that

\displaystyle  \sum_{n \in E_d} a_n \approx \sum_{n \in E_d} b_n \approx X_d

for all {d \in {\mathcal D}}, but which have very different outputs for this sieve problem, in that

\displaystyle  \sum_{n \not \in \bigcup_{p|P} E_p} a_n \not \approx \sum_{n \not \in \bigcup_{p|P} E_p} b_n .

Then any upper bound for Problem 3 with this choice of {E_d, X_d, {\mathcal D}, P} must exceed (at least approximately) the greater of {\sum_{n \not \in \bigcup_{p|P} E_p} a_n} and {\sum_{n \not \in \bigcup_{p|P} E_p} b_n}; similarly, any lower bound for this problem must be (approximately) less than the lesser of {\sum_{n \not \in \bigcup_{p|P} E_p} a_n} and {\sum_{n \not \in \bigcup_{p|P} E_p} b_n}.

As a special case of this observation, we have

Proposition 47 (Abstract parity barrier) (Informal) Let {E_d, X_d, {\mathcal D}, P} be as in Problem 3. Let {(a_n)_{n \in {\bf Z}}} be a finitely supported sequence of non-negative reals such that

\displaystyle  \sum_{n \in E_d} a_n \approx X_d

for all {d \in {\mathcal D}}. Suppose that we have an additional “parity function” {\omega: {\bf Z} \rightarrow \{-1,0,+1\}} with the property that

\displaystyle  \omega(n) = -1

for all {n \not \in \bigcup_{p|P} E_p} in the support of {a_n}, but such that

\displaystyle  \sum_{d \in E_d} a_n \omega(n) \approx 0

for all {d \in {\mathcal D}}. Then any upper bound for Problem 3 cannot be significantly smaller than {2 \sum_{n \not \in \bigcup_{p|P} E_p} a_n}, and any lower bound for Problem 3 cannot be significantly larger than zero. In particular, one does not expect to be able to show using such sieve methods that the support of {a_n} contains any appreciable number of elements outside of {\bigcup_{p|P} E_p}.

Proof: (Non-rigorous!) For the statement about the upper bound, apply the previous discussion to the sequence {b_n := a_n (1 - \omega(n))}. For the statement about the lower bound, apply the previous discussion to the sequence {b(n) := a_n (1+\omega(n))}. \Box

Informally, the proposition asserts that if the sifted set {{\bf Z} \backslash \bigcup_{p|P} E_p} is sensitive to the parity function {\omega}, but original pre-sifted sequence {a_n} and the sets {E_d} are not, then traditional sieve methods cannot give any non-trivial lower bounds, and any upper bounds must be off from the truth by a factor of at least two.

In practice, one can heuristically obey the hypotheses of this proposition if one takes {\omega(n)} to be some variant of the Liouville function {\lambda(n)}, which counts the parity of the number of prime factors of {n}, in which case the parity problem asserts (roughly speaking) that the sifting problem must allow both numbers with an even number of prime factors and numbers with an odd number of prime factors to survive if one is to have any hope of good lower bounds. For instance:

Corollary 48 (Parity barrier for twin primes) (Very informal) One cannot use traditional sieve methods to resolve the twin prime conjecture.

Proof: (Very non-rigorous! Also assumes Liouville pseudorandomness) As mentioned in the introduction, there are a number of different ways to try to approach the twin prime conjecture by sieve methods. Consider for instance the approach in which one starts with the primes {A} in {[x/2+2,x]} and removes the residue class {E_p = 2\ (p)} for each {p \leq \sqrt{x}}. The quantity {\lambda(n+2)} will then be {-1} on the sifted set {A \backslash \bigcup_{p|P} E_p} (since {n+2} will be prime on that set), while the Liouville pseudorandomness heuristic predicts that

\displaystyle  \sum_{n \in E_d} 1_A(n) \lambda(n+2) = \sum_{x/2+2 \leq n \leq x: n = 2\ (d)} 1_{\mathcal P}(n) \lambda(n+2)

is negligible for all {d \leq x} (say). The claim then follows from Proposition 47. \Box

Exercise 49 (Informal) Develop similar parity obstructions for the other sieve-theoretic approaches to the twin prime conjecture. In the case when one is sieving out the integers by two residue classes {0, -2\ (p)} for each prime {p}, argue that any sieve-theoretic bound must in fact be off from the truth by a factor of at least four. (Hint: experiment with weights such as {(1 - \lambda(n)) (1-\lambda(n+2))}.)

Exercise 50 (Informal) Argue why any sieve-theoretic upper bound on prime {k}-tuples that is based on equidistribution estimates for the natural numbers and primes must be off from the truth by a factor of at least {2^{k-1}}. (This should be compared with the known upper bounds of {2^k k!} and {4^{k-1} (k-1)!} discussed in previous sections. It is not clear at present which of these bounds is closest to the true limit of sieve theory.)

The above argument is far from watertight, and there have been successful attempts to circumvent the parity barrier. For instance:

  • (i) The parity barrier does not prohibit the possibility of producing a very small number of prime (or near-prime) patterns, even if one cannot hope to get lower bounds that are of the “right” order of magnitude. For instance, it was shown unconditionally by Goldston, Graham, Pintz, and Yildirim that there exist infinitely many pairs of consecutive numbers {n,n+1}, each of which are the product of exactly four primes. Cramér type random models (see Supplement 4) predict the number of such pairs up to {x} to be approximately {\frac{x}{\log^2 x}} (ignoring some factors of {\log\log x}), and the parity barrier prevents one from getting lower bounds of this magnitude; but the construction of Goldston, Graham, Pintz, and Yildirim only produces a quite sparse set of pairs (about {\frac{x}{\log^3 x}} or so) and so evades the parity barrier. Using a very different class of arguments, it is also possible to produce analogues of patterns like twin primes in the function field setting (in which the role of primes are replaced by irreducible polynomials over a finite field) by using very sparse sequences of polynomials for which irreducibility can be tested easily; see for instance this paper of Hall for the case of twin irreducible polynomials.
  • (ii) The parity barrier may disappear if one is somehow in a situation in which the Möbius or Liouville pseudorandomness principles break down. An extreme example of this is when one assumes the existence of one (or many) Siegel zeroes, which imply extremely high correlation between the Möbius or Liouville functions and a Dirichlet character, in strong violation of the pseudorandomness principle. A striking example of this case occurs in this paper of Heath-Brown, which demonstrates the truth of the twin prime conjecture if one assumes the existence of an infinite sequence of Siegel zeroes.
  • (iii) Traditional sieve theoretic methods only assume that one can control “linear” or “Type I” sums of the form {\sum_{n \in E_d} a_n}. However, in some cases it is also possible to control “bilinear” or “Type II” sums such as {\sum_n \sum_m \alpha_n \beta_m a_{nm}} for various coefficients {\alpha_n, \beta_m} of controlled size and support but uncontrolled phase. Note that multiplicative parity functions such as {\lambda} will be distinguishable with such sums, since {\sum_n \sum_m \alpha_n \beta_m \lambda(nm)} will not exhibit any cancellation if {\alpha_n, \beta_m} both oscillate with the sign of {\lambda(n), \lambda(m)} respectively; as such, the parity barrier is no longer in effect. In such situations it can be possible to go beyond the limits of traditional sieve theory, for instance by using decompositions similar to the Vaughan identity employed in Notes 3. Of course, one still has to actually establish the required bilinear sum asymptotics, and this typically requires some deep and non-trivial input outside of sieve theory. A model example here is the work of Friedlander and Iwaniec establishing the infinitude of primes of the form {a^2+b^4}. Here, the relevant bilinear input is obtained using the algebraic structure of the Gaussian integers and some non-trivial results on lattice counting problems.
  • (iv) Sieve theory is focused on “one-dimensional” sums, such as {\sum_{n \not \in \bigcup_{p|P} E_p} a_n}, or perhaps {\sum_{n \leq x} \Lambda(n) \Lambda(n+2)}. The presence of the parity problem relies on the ability to modify weights such as {a_n} or {\Lambda(n)} in such sums in such a fashion that the value of the sum dramatically changes. If one instead is trying to control a multi-dimensional sum, such as {\sum_{n,r \leq x} \Lambda(n) \Lambda(n+r) \Lambda(n+2r) \Lambda(n+3r)} (which roughly speaking counts arithmetic progressions of primes of length four), then there is no easy way to modify any of the factors of such a sum in the fashion required for the parity barrier to be in effect. For such multi-dimensional sums one can then hope to use additional estimates to produce non-trivial lower bounds. For instance, with Ben Green, we used Selberg sieves combined with the additional tool of Szemerédi’s theorem to obtain lower bounds for multi-dimensional sums similar to the one above, proving in particular that the primes contained arbitrarily long arithmetic progressions.

It is one of the major problems in analytic prime number theory to continue to find more ways to defeat the parity barrier, so that we can produce prime patterns in many more situations than what is currently possible. However, this has proven to be a surprisingly challenging task, and one which typically requires the injection of new ideas and methods external to sieve theory.

Remark 51 As mentioned previously, it is not known in general if the parity problem represents the only obstruction to sieve methods. However, under the additional hypothesis of the Elliott-Halberstam conjecture, and when focusing on “binary” problems such as the twin prime conjecture, there is a more detailed analysis of the parity problem due to Bombieri which indicates that, in some sense, the parity problem is indeed the only information not captured by sieve-theoretic methods, in that the asymptotic distribution of primes or almost primes in pairs {n,n+2} is completely determined by a single parameter {0 \leq \delta \leq 2} that represents the ratio between the actual number of twin primes and the expected number of twin primes; the parity problem prevents us from obtaining any further information about {\delta} other than that it lies between {0} and {2}, but once {\delta} is known, essentially all other asymptotics regarding the prime factorisation of {n} and {n+2} will also be known. See Chapter 16 of Friedlander-Iwaniec for further discussion.

Remark 52 For a slightly different way to formalise the parity problem, see this previous blog post. Among other things, it is shown in that post that the parity problem prevents one from proving Theorem 38 with any choice of {m,k} with {m > \lceil \frac{k}{2} \rceil} using purely sieve-theoretic methods, in particular ruling out a solution to the twin prime conjecture by sieve methods.

Filed under: 254A - analytic prime number theory, math.NT Tagged: fundamental lemma of sieve theory, parity problem, Selberg sieve, sieve theory

Terence Tao254A, Supplement 7: Normalised limit profiles of the log-magnitude of the Riemann zeta function (optional)

A major topic of interest of analytic number theory is the asymptotic behaviour of the Riemann zeta function {\zeta} in the critical strip {\{ \sigma+it: 0 < \sigma < 1; t \in {\bf R} \}} in the limit {t \rightarrow +\infty}. For the purposes of this set of notes, it is a little simpler technically to work with the log-magnitude {\log |\zeta|: {\bf C} \rightarrow [-\infty,+\infty]} of the zeta function. (In principle, one can reconstruct a branch of {\log \zeta}, and hence {\zeta} itself, from {\log |\zeta|} using the Cauchy-Riemann equations, or tools such as the Borel-Carathéodory theorem, see Exercise 40 of Supplement 2.)

One has the classical estimate

\displaystyle  \zeta(\sigma+it) = O( t^{O(1)} )

when {\sigma = O(1)} and {t \geq 10} (say), so that

\displaystyle  \log |\zeta(\sigma+it)| \leq O( \log t ). \ \ \ \ \ (1)

(See e.g. Exercise 37 from Supplement 3.) In view of this, let us define the normalised log-magnitudes {F_T: {\bf C} \rightarrow [-\infty,+\infty]} for any {T \geq 10} by the formula

\displaystyle  F_T( \sigma + it ) := \frac{1}{\log T} \log |\zeta( \sigma + i(T + t) )|;

informally, this is a normalised window into {\log |\zeta|} near {iT}. One can rephrase several assertions about the zeta function in terms of the asymptotic behaviour of {F_T}. For instance:

  • (i) The bound (1) implies that {F_T} is asymptotically locally bounded from above in the limit {T \rightarrow \infty}, thus for any compact set {K \subset {\bf C}} we have {F_T(\sigma+it) \leq O_K(1)} for {\sigma+it \in K} and {T} sufficiently large. In fact the implied constant in {K} only depends on the projection of {K} to the real axis.
  • (ii) For {\sigma > 1}, we have the bounds

    \displaystyle  |\zeta(\sigma+it)|, \frac{1}{|\zeta(\sigma+it)|} \leq \zeta(\sigma)

    which implies that {F_T} converges locally uniformly as {T \rightarrow +\infty} to zero in the region {\{ \sigma+it: \sigma > 1, t \in {\bf R} \}}.

  • (iii) The functional equation, together with the symmetry {\zeta(\sigma-it) = \overline{\zeta(\sigma+it)}}, implies that

    \displaystyle  |\zeta(\sigma+it)| = 2^\sigma \pi^{\sigma-1} |\sin \frac{\pi(\sigma+it)}{2}| |\Gamma(1-\sigma-it)| |\zeta(1-\sigma+it)|

    which by Exercise 17 of Supplement 3 shows that

    \displaystyle  F_T( 1-\sigma+it ) = \frac{1}{2}-\sigma + F_T(\sigma+it) + o(1)

    as {T \rightarrow \infty}, locally uniformly in {\sigma+it}. In particular, when combined with the previous item, we see that {F_T(\sigma+it)} converges locally uniformly as {T \rightarrow +\infty} to {\frac{1}{2}-\sigma} in the region {\{ \sigma+it: \sigma < 0, t \in {\bf R}\}}.

  • (iv) From Jensen’s formula (Theorem 16 of Supplement 2) we see that {\log|\zeta|} is a subharmonic function, and thus {F_T} is subharmonic as well. In particular we have the mean value inequality

    \displaystyle  F_T( z_0 ) \leq \frac{1}{\pi r^2} \int_{z: |z-z_0| \leq r} F_T(z)

    for any disk {\{ z: |z-z_0| \leq r \}}, where the integral is with respect to area measure. From this and (ii) we conclude that

    \displaystyle  \int_{z: |z-z_0| \leq r} F_T(z) \geq O_{z_0,r}(1)

    for any disk with {\hbox{Re}(z_0)>1} and sufficiently large {T}; combining this with (i) we conclude that {F_T} is asymptotically locally bounded in {L^1} in the limit {T \rightarrow \infty}, thus for any compact set {K \subset {\bf C}} we have {\int_K |F_T| \ll_K 1} for sufficiently large {T}.

From (v) and the usual Arzela-Ascoli diagonalisation argument, we see that the {F_T} are asymptotically compact in the topology of distributions: given any sequence {T_n} tending to {+\infty}, one can extract a subsequence such that the {F_T} converge in the sense of distributions. Let us then define a normalised limit profile of {\log|\zeta|} to be a distributional limit {F} of a sequence of {F_T}; they are analogous to limiting profiles in PDE, and also to the more recent introduction of “graphons” in the theory of graph limits. Then by taking limits in (i)-(iv) we can say a lot about such normalised limit profiles {F} (up to almost everywhere equivalence, which is an issue we will address shortly):

  • (i) {F} is bounded from above in the critical strip {\{ \sigma+it: 0 \leq \sigma \leq 1 \}}.
  • (ii) {F} vanishes on {\{ \sigma+it: \sigma \geq 1\}}.
  • (iii) We have the functional equation {F(1-\sigma+it) = \frac{1}{2}-\sigma + F(\sigma+it)} for all {\sigma+it}. In particular {F(\sigma+it) = \frac{1}{2}-\sigma} for {\sigma<0}.
  • (iv) {F} is subharmonic.

Unfortunately, (i)-(iv) fail to characterise {F} completely. For instance, one could have {F(\sigma+it) = f(\sigma)} for any convex function {f(\sigma)} of {\sigma} that equals {0} for {\sigma \geq 1}, {\frac{1}{2}-\sigma} for {\sigma \leq 1}, and obeys the functional equation {f(1-\sigma) = \frac{1}{2}-\sigma+f(\sigma)}, and this would be consistent with (i)-(iv). One can also perturb such examples in a region where {f} is strictly convex to create further examples of functions obeying (i)-(iv). Note from subharmonicity that the function {\sigma \mapsto \sup_t F(\sigma+it)} is always going to be convex in {\sigma}; this can be seen as a limiting case of the Hadamard three-lines theorem (Exercise 41 of Supplement 2).

We pause to address one minor technicality. We have defined {F} as a distributional limit, and as such it is a priori only defined up to almost everywhere equivalence. However, due to subharmonicity, there is a unique upper semi-continuous representative of {F} (taking values in {[-\infty,+\infty)}), defined by the formula

\displaystyle  F(z_0) = \lim_{r \rightarrow 0^+} \frac{1}{\pi r^2} \int_{B(z_0,r)} F(z)\ dz

for any {z_0 \in {\bf C}} (note from subharmonicity that the expression in the limit is monotone nonincreasing as {r \rightarrow 0}, and is also continuous in {z_0}). We will now view this upper semi-continuous representative of {F} as the canonical representative of {F}, so that {F} is now defined everywhere, rather than up to almost everywhere equivalence.

By a classical theorem of Riesz, a function {F} is subharmonic if and only if the distribution {-\Delta F} is a non-negative measure, where {\Delta := \frac{\partial^2}{\partial \sigma^2} + \frac{\partial^2}{\partial t^2}} is the Laplacian in the {\sigma,t} coordinates. Jensen’s formula (or Greens’ theorem), when interpreted distributionally, tells us that

\displaystyle  -\Delta \log |\zeta| = \frac{1}{2\pi} \sum_\rho \delta_\rho

away from the real axis, where {\rho} ranges over the non-trivial zeroes of {\zeta}. Thus, if {F} is a normalised limit profile for {\log |\zeta|} that is the distributional limit of {F_{T_n}}, then we have

\displaystyle  -\Delta F = \nu

where {\nu} is a non-negative measure which is the limit in the vague topology of the measures

\displaystyle  \nu_{T_n} := \frac{1}{2\pi \log T_n} \sum_\rho \delta_{\rho - T_n}.

Thus {\nu} is a normalised limit profile of the zeroes of the Riemann zeta function.

Using this machinery, we can recover many classical theorems about the Riemann zeta function by “soft” arguments that do not require extensive calculation. Here are some examples:

Theorem 1 The Riemann hypothesis implies the Lindelöf hypothesis.

Proof: It suffices to show that any limiting profile {F} (arising as the limit of some {F_{T_n}}) vanishes on the critical line {\{1/2+it: t \in {\bf R}\}}. But if the Riemann hypothesis holds, then the measures {\nu_{T_n}} are supported on the critical line {\{1/2+it: t \in {\bf R}\}}, so the normalised limit profile {\nu} is also supported on this line. This implies that {F} is harmonic outside of the critical line. By (ii) and unique continuation for harmonic functions, this implies that {F} vanishes on the half-space {\{ \sigma+it: \sigma \geq \frac{1}{2} \}} (and equals {\frac{1}{2}-\sigma} on the complementary half-space, by (iii)), giving the claim. \Box

In fact, we have the following sharper statement:

Theorem 2 (Backlund) The Lindelöf hypothesis is equivalent to the assertion that for any fixed {\sigma_0 > \frac{1}{2}}, the number of zeroes in the region {\{ \sigma+it: \sigma > \sigma_0, T \leq t \leq T+1 \}} is {o(\log T)} as {T \rightarrow \infty}.

Proof: If the latter claim holds, then for any {T_n \rightarrow \infty}, the measures {\nu_{T_n}} assign a mass of {o(1)} to any region of the form {\{ \sigma+it: \sigma > \sigma_0; t_0 \leq t \leq t_0+1 \}} as {n \rightarrow \infty} for any fixed {\sigma_0>\frac{1}{2}} and {t_0 \in {\bf R}}. Thus the normalised limiting profile measure {\nu} is supported on the critical line, and we can repeat the previous argument.

Conversely, suppose the claim fails, then we can find a sequence {T_n} and {\sigma_0>0} such that {\nu_{T_n}} assigns a mass of {\gg 1} to the region {\{ \sigma+it: \sigma > \sigma_0; 0\leq t \leq 1 \}}. Extracting a normalised limiting profile, we conclude that the normalised limiting profile measure {\nu} is non-trivial somewhere to the right of the critical line, so the associated subharmonic function {F} is not harmonic everywhere to the right of the critical line. From the maximum principle and (ii) this implies that {F} has to be positive somewhere on the critical line, but this contradicts the Lindelöf hypothesis. (One has to take a bit of care in the last step since {F_{T_n}} only converges to {F} in the sense of distributions, but it turns out that the subharmonicity of all the functions involved gives enough regularity to justify the argument; we omit the details here.) \Box

Theorem 3 (Littlewood) Assume the Lindelöf hypothesis. Then for any fixed {\alpha>0}, the number of zeroes in the region {\{ \sigma+it: T \leq t \leq T+\alpha \}} is {(2\pi \alpha+o(1)) \log T} as {T \rightarrow +\infty}.

Proof: By the previous arguments, the only possible normalised limiting profile for {\log |\zeta|} is {\max( 0, \frac{1}{2}-\sigma )}. Taking distributional Laplacians, we see that the only possible normalised limiting profile for the zeroes is Lebesgue measure on the critical line. Thus, {\nu_T( \{\sigma+it: T \leq t \leq T+\alpha \} )} can only converge to {\alpha} as {T \rightarrow +\infty}, and the claim follows. \Box

Even without the Lindelöf hypothesis, we have the following result:

Theorem 4 (Titchmarsh) For any fixed {\alpha>0}, there are {\gg_\alpha \log T} zeroes in the region {\{ \sigma+it: T \leq t \leq T+\alpha \}} for sufficiently large {T}.

Among other things, this theorem recovers a classical result of Littlewood that the gaps between the imaginary parts of the zeroes goes to zero, even without assuming unproven conjectures such as the Riemann or Lindelöf hypotheses.

Proof: Suppose for contradiction that this were not the case, then we can find {\alpha > 0} and a sequence {T_n \rightarrow \infty} such that {\{ \sigma+it: T_n \leq t \leq T_n+\alpha \}} contains {o(\log T)} zeroes. Passing to a subsequence to extract a limit profile, we conclude that the normalised limit profile measure {\nu} assigns no mass to the horizontal strip {\{ \sigma+it: 0 \leq t \leq\alpha \}}. Thus the associated subharmonic function {F} is actually harmonic on this strip. But by (ii) and unique continuation this forces {F} to vanish on this strip, contradicting the functional equation (iii). \Box

Exercise 5 Use limiting profiles to obtain the matching upper bound of {O_\alpha(\log T)} for the number of zeroes in {\{ \sigma+it: T \leq t \leq T+\alpha \}} for sufficiently large {T}.

Remark 6 One can remove the need to take limiting profiles in the above arguments if one can come up with quantitative (or “hard”) substitutes for qualitative (or “soft”) results such as the unique continuation property for harmonic functions. This would also allow one to replace the qualitative decay rates {o(1)} with more quantitative decay rates such as {1/\log \log T} or {1/\log\log\log T}. Indeed, the classical proofs of the above theorems come with quantitative bounds that are typically of this form (see e.g. the text of Titchmarsh for details).

Exercise 7 Let {S(T)} denote the quantity {S(T) := \frac{1}{\pi} \hbox{arg} \zeta(\frac{1}{2}+iT)}, where the branch of the argument is taken by using a line segment connecting {\frac{1}{2}+iT} to (say) {2+iT}, and then to {2}. If we have a sequence {T_n \rightarrow \infty} producing normalised limit profiles {F, \nu} for {\log|\zeta|} and the zeroes respectively, show that {t \mapsto \frac{1}{\log T_n} S(T_n + t)} converges in the sense of distributions to the function {t \mapsto \frac{1}{\pi} \int_{1/2}^1 \frac{\partial F}{\partial t}(\sigma+it)\ d\sigma}, or equivalently

\displaystyle  t \mapsto \frac{1}{2\pi} \frac{\partial}{\partial t} \int_0^1 F(\sigma+it)\ d\sigma.

Conclude in particular that if the Lindelöf hypothesis holds, then {S(T) = o(\log T)} as {T \rightarrow \infty}.

A little bit more about the normalised limit profiles {F} are known unconditionally, beyond (i)-(iv). For instance, from Exercise 3 of Notes 5 we have {\zeta(1/2 + it ) = O( t^{1/6+o(1)} )} as {t \rightarrow +\infty}, which implies that any normalised limit profile {F} for {\log|\zeta|} is bounded by {1/6} on the critical line, beating the bound of {1/4} coming from convexity and (ii), (iii), and then convexity can be used to further bound {F} away from the critical line also. Some further small improvements of this type are known (coming from various methods for estimating exponential sums), though they fall well short of determining {F} completely at our current level of understanding. Of course, given that we believe the Riemann hypothesis (and hence the Lindelöf hypothesis) to be true, the only actual limit profile that should exist is {\max(0,\frac{1}{2}-\sigma)} (in fact this assertion is equivalent to the Lindelöf hypothesis, by the arguments above).

Better control on limiting profiles is available if we do not insist on controlling {\zeta} for all values of the height parameter {T}, but only for most such values, thanks to the existence of several mean value theorems for the zeta function, as discussed in Notes 6; we discuss this below the fold.

— 1. Limiting profiles outside of exceptional sets —

In order to avoid an excessive number of extraction of subsequences and discarding of exceptional sets, we now move away from the standard sequential notion of a limit, and instead work with the less popular, but equally valid notion of an ultrafilter limit. Recall that an ultrafilter {p} on a set {X} is a collection of subsets of {X} (which we will call the “{p}-large” sets) which are the sets of full measure with regards to some finitely additive {\{0,1\}}-valued probability measure on {X} (with the power set Boolean algebra {2^X}). We call a subset of {X} {p}-small if it is not {p}-large. Given a function {f: X \rightarrow Y} into a topological space {Y} and a point {y \in Y}, we say that {f} converges to {y} along {p} if {f^{-1}(U)} is {p}-large for every neighbourhood {U} of {y}, and then we call {y} a {p}-limit of {f}.

Exercise 8 Let {f: X \rightarrow Y} be a function into a topological space {Y}, and let {p} be an ultrafilter on {X}.

  • (i) If {Y} is compact, show that {f} has at least one {p}-limit.
  • (ii) If {Y} is Hausdorff, show that {f} has at most one {p}-limit.
  • (iii) Conversely, if {Y} fails to be compact (resp. Hausdorff), show that there exists a function {f: X \rightarrow Y} and an ultrafilter {p} on {X} such that {f} has no {p}-limit (resp. more than one {p}-limit).

In particular, given an ultrafilter {p} on the non-negative reals {{\bf R}^+ = [0,+\infty)}, which is non-principal in the sense that all compact subsets of {{\bf R}^+} are {p}-small,, there exists a unique normalised limiting profile {F} that is the limit of {T \mapsto F_T} along {p}, and similarly for {\nu}. Because the distributional topology is second countable, such limiting profiles are also limiting profiles of sequences {T_n \rightarrow \infty} as in the previous discussion, and so we retain all existing properties of limit profiles such as (i)-(iv). However, in the ultrafilter formalism we can now easily avoid various “small” exceptional sets of {T}, in addition to the compact sets that have already been excluded. For instance, let us call an ultrafilter {p} generic if every Lebesgue measurable subset {A} of {{\bf R}} of zero upper density (thus {A \cap [0,T]} has Lebesgue measure {o(T)} as {T \rightarrow \infty}) is {p}-small. The existence of generic ultrafilters follows easily from Zorn’s lemma. Define a generic limit profile to be a limit profile that arises from a generic ultrafilter; informally, these capture the possible behaviour of the zeta function outside of a set of heights {T} of zero density. To see how these profiles are better than arbitrary limit profiles, we recall from Exercise 2 of Notes 6 that

\displaystyle  \sum_j |\sum_{n=1}^N a_n n^{-it_j}|^2 \ll (\sum_{n=1}^N |a_n|^2) (T+N)

if the {t_j} are {1}-separated elements of {[0,T]} and {a_n} are arbitrary complex coefficients. If we set {T=N}, we can conclude (among other things), that for any constant {C}, one has

\displaystyle  \sup_{t': |t-t'| \leq C} |\sum_{n=1}^T a_n n^{-it}| \ll T^{o(1)} (\sum_{n=1}^T |a_n|^2)^{1/2}

for all {t \in [0,T]} outside of a set of measure {o(T)} (informally: “square root cancellation occurs generically”). Using this, one can for instance show that

\displaystyle  \sup_{t': |t-t'| \leq C} |\zeta(\frac{1}{2}+it')| \ll T^{o(1)}

for all {t \in [0,T]} outside of a set of measure {o(T)}, which implies that any generic limit profile {F} vanishes on the critical line, and thus must be {\max(0, \frac{1}{2}-\sigma)}; that is to say, the Lindelöf hypothesis is true “generically”.

One can profitably explore the regime between arbitrary non-principal ultrafilters and generic ultrafilters by introducing the intermediate notion of an {\alpha}-generic ultrafilter for any {0 < \alpha < 1}, defined as an ultrafilter {p} with the property that any Lebesgue measurable subset {A} of {{\bf R}} of “dimension at most {\alpha}” in the sense that {A \cap [0,T]} has measure {O( T^{\alpha+o(1)})}, is {p}-small. One can then interpret many existing mean value theorems on the zeta function (or on other Dirichlet series) as controlling the {\alpha}-generic limit profiles of {\log|\zeta|}, or more generally of the log-magnitude of various Dirichlet series (e.g. {\sum_{n \leq T^\theta} \frac{\mu(n)}{n^s}} for various exponents {\theta}). For instance, the previous argument shows that

\displaystyle  \sup_{t': |t-t'| \leq C} |\zeta(\frac{1}{2}+it')| \ll T^{(1-\alpha)/2+o(1)}

for all {t \in [0,T]} outside of a set of measure {T^{\alpha+o(1)}}, which implies that any {\alpha}-generic limit profile {F} is bounded above by {\frac{1-\alpha}{2}} on the critical line. One can also recast much of the arguments in Notes 6 in this language (defining limit profiles for various Dirichlet polynomials, and using such profiles and zero-detecting polynomials to establish {\alpha}-generic zero-free regions), although this is mostly just a change of notation and does not seem to yield any major simplifications to these arguments.

Filed under: 254A - analytic prime number theory, math.NT Tagged: limit profiles, Riemann zeta function, ultrafilters

Sean CarrollGuest Post: An Interview with Jamie Bock of BICEP2

Jamie Bock If you’re reading this you probably know about the BICEP2 experiment, a radio telescope at the South Pole that measured a particular polarization signal known as “B-modes” in the cosmic microwaves background radiation. Cosmologists were very excited at the prospect that the B-modes were the imprint of gravitational waves originating from a period of inflation in the primordial universe; now, with more data from the Planck satellite, it seems plausible that the signal is mostly due to dust in our own galaxy. The measurements that the team reported were completely on-target, but our interpretation of them has changed — we’re still looking for direct evidence for or against inflation.

Here I’m very happy to publish an interview that was carried out with Jamie Bock, a professor of physics at Caltech and a senior research scientist at JPL, who is one of the leaders of the BICEP2 collaboration. It’s a unique look inside the workings of an incredibly challenging scientific effort.

New Results from BICEP2: An Interview with Jamie Bock

What does the new data from Planck tell you? What do you know now?

A scientific race has been under way for more than a decade among a dozen or so experiments trying to measure B-mode polarization, a telltale signature of gravitational waves produced from the time of inflation. Last March, BICEP2 reported a B-mode polarization signal, a twisty polarization pattern measured in a small patch of sky. The amplitude of the signal we measured was surprisingly large, exceeding what we expected for galactic emission. This implied we were seeing a large gravitational wave signal from inflation.

We ruled out galactic synchrotron emission, which comes from electrons spiraling in the magnetic field of the galaxy, using low-frequency data from the WMAP [Wilkinson Microwave Anisotropy Probe] satellite. But there were no data available on polarized galactic dust emission, and we had to use models. These models weren’t starting from zero; they were built on well-known maps of unpolarized dust emission, and, by and large, they predicted that polarized dust emission was a minor constituent of the total signal.

Obviously, the answer here is of great importance for cosmology, and we have always wanted a direct test of galactic emission using data in the same piece of sky so that we can test how much of the BICEP2 signal is cosmological, representing gravitational waves from inflation, and how much is from galactic dust. We did exactly that with galactic synchrotron emission from WMAP because the data were public. But with galactic dust emission, we were stuck, so we initiated a collaboration with the Planck satellite team to estimate and subtract polarized dust emission. Planck has the world’s best data on polarized emission from galactic dust, measured over the entire sky in multiple spectral bands. However, the polarized dust maps were only recently released.

On the other side, BICEP2 gives us the highest-sensitivity data available at 150 GHz to measure the CMB. Interestingly, the two measurements are stronger in combination. We get a big boost in sensitivity by putting them together. Also, the detectors for both projects were designed, built, and tested at Caltech and JPL, so I had a personal interest in seeing that these projects worked together. I’m glad to say the teams worked efficiently and harmoniously together.

What we found is that when we subtract the galaxy, we just see noise; no signal from the CMB is detectable. Formally we can say at least 40 percent of the total BICEP2 signal is dust and less than 60 percent is from inflation.

How do these new data shape your next steps in exploring the earliest moments of the universe?

It is the best we can do right now, but unfortunately the result with Planck is not a very strong test of a possible gravitational wave signal. This is because the process of subtracting galactic emission effectively adds more noise into the analysis, and that noise limits our conclusions. While the inflationary signal is less than 60 percent of the total, that is not terribly informative, leaving many open questions. For example, it is quite possible that the noise prevents us from seeing part of the signal that is cosmological. It is also possible that all of the BICEP2 signal comes from the galaxy. Unfortunately, we cannot say more because the data are simply not precise enough. Our ability to measure polarized galactic dust emission in particular is frustratingly limited.

Figure 1:  Maps of CMB polarization produced by BICEP2 and Keck Array.  The maps show the  ‘E-mode’ polarization pattern, a signal from density variations in the CMB, not gravitational  waves.  The polarization is given by the length and direction of the lines, with a coloring to better  show the sign and amplitude of the E-mode signal.  The tapering toward the edges of the map is  a result of how the instruments observed this region of sky.  While the E-mode pattern is about 6  times brighter than the B-mode signal, it is still quite faint.  Tiny variations of only 1 millionth of  a degree kelvin are faithfully reproduced across these multiple measurements at 150 GHz, and in  new Keck data at 95 GHz still under analysis.  The very slight color shift visible between 150  and 95 GHz is due to the change in the beam size.

Figure 1: Maps of CMB polarization produced by BICEP2 and Keck Array.  The maps show the
‘E-mode’ polarization pattern, a signal from density variations in the CMB, not gravitational
waves.  The polarization is given by the length and direction of the lines, with a coloring to better
show the sign and amplitude of the E-mode signal.  The tapering toward the edges of the map is
a result of how the instruments observed this region of sky.  While the E-mode pattern is about 6
times brighter than the B-mode signal, it is still quite faint.  Tiny variations of only 1 millionth of
a degree kelvin are faithfully reproduced across these multiple measurements at 150 GHz, and in
new Keck data at 95 GHz still under analysis. The very slight color shift visible between 150
and 95 GHz is due to the change in the beam size.

However, there is good news to report. In this analysis, we added new data obtained in 2012–13 from the Keck Array, an instrument with five telescopes and the successor to BICEP2 (see Fig. 1). These data are at the same frequency band as BICEP2—150 GHz—so while they don’t help subtract the galaxy, they do increase the total sensitivity. The Keck Array clearly detects the same signal detected by BICEP2. In fact, every test we can do shows the two are quite consistent, which demonstrates that we are doing these difficult measurements correctly (see Fig. 2). The BICEP2/Keck maps are also the best ever made, with enough sensitivity to detect signals that are a tiny fraction of the total.

A power spectrum of the B-mode polarization signal that plots the strength of the signal as a function of angular frequency.  The data show a signal significantly above what is expected for a universe without gravitational waves, given by the red line.  The excess peaks at angular scales of about 2 degrees.  The independent measurements of BICEP2 and Keck Array shown in red and blue are consistent within the errors, and their combination is shown in black.  Note the sets of points are slightly shifted along the x-axis to avoid overlaps.

Figure 2: A power spectrum of the B-mode polarization signal that plots the strength of the signal as a function of angular frequency. The data show a signal significantly above what is expected for a universe without gravitational waves, given by the red line. The excess peaks at angular scales of about 2 degrees. The independent measurements of BICEP2 and Keck Array shown in red and blue are consistent within the errors, and their combination is shown in black. Note the sets of points are slightly shifted along the x-axis to avoid overlaps.

In addition, Planck’s measurements over the whole sky show the polarized dust is fairly well behaved. For example, the polarized dust has nearly the same spectrum across the sky, so there is every reason to expect we can measure and remove dust cleanly.

To better subtract the galaxy, we need better data. We aren’t going to get more data from Planck because the mission has finished. The best way is to measure the dust ourselves by adding new spectral bands to our own instruments. We are well along in this process already. We added a second band to the Keck Array last year at 95 GHz and a third band this year at 220 GHz. We just installed the new BICEP3 instrument at 95 GHz at the South Pole (see Fig. 3). BICEP3 is single telescope that will soon be as powerful as all five Keck Array telescopes put together. At 95 GHz, Keck and BICEP3 should surpass BICEP2’s 150 GHz sensitivity by the end of this year, and the two will be a very powerful combination indeed. If we switch the Keck Array entirely over to 220 GHz starting next year, we can get a third band to a similar depth.

BICEP3 installed and carrying out calibration measurements off a reflective mirror placed above the receiver. The instrument is housed within a conical reflective ground shield to minimize the brightness contrast between the warm earth and cold space.  This picture was taken at the beginning of the winter season, with no physical access to the station for the next 8 months, when BICEP3 will conduct astronomical observations (Credit:  Sam Harrison

Figure 3: BICEP3 installed and carrying out calibration measurements off a reflective mirror placed above the receiver. The instrument is housed within a conical reflective ground shield to minimize the brightness contrast between the warm earth and cold space. This picture was taken at the beginning of the winter season, with no physical access to the station for the next 8 months, when BICEP3 will conduct astronomical observations (Credit: Sam Harrison)

Finally, this January the SPIDER balloon experiment, which is also searching the CMB for evidence of inflation, completed its first flight, outfitted with comparable sensitivity at 95 and 150 GHz. Because SPIDER floats above the atmosphere (see Fig. 4), we can also measure the sky on larger spatial scales. This all adds up to make the coming years very exciting.

View of the earth and the edge of space, taken from an optical camera on the SPIDER gondola at float altitude shortly after launch. Clearly visible below is Ross Island, with volcanos Mt. Erebus and Mt. Terror and the McMurdo Antarctic base, the Royal Society mountain range to the left, and the edge of the Ross permanent ice shelf.   (Credit:  SPIDER team).

Figure 4: View of the earth and the edge of space, taken from an optical camera on the SPIDER gondola at float altitude shortly after launch. Clearly visible below is Ross Island, with volcanos Mt. Erebus and Mt. Terror and the McMurdo Antarctic base, the Royal Society mountain range to the left, and the edge of the Ross permanent ice shelf. (Credit: SPIDER team).

Why did you make the decision last March to release results? In retrospect, do you regret it?

We knew at the time that any news of a B-mode signal would cause a great stir. We started working on the BICEP2 data in 2010, and our standard for putting out the paper was that we were certain the measurements themselves were correct. It is important to point out that, throughout this episode, our measurements basically have not changed. As I said earlier, the initial BICEP2 measurement agrees with new data from the Keck Array, and both show the same signal. For all we know, the B-mode polarization signal measured by BICEP2 may contain a significant cosmological component—that’s what we need to find out.

The question really is, should we have waited until better data were available on galactic dust? Personally, I think we did the right thing. The field needed to be able to react to our data and test the results independently, as we did in our collaboration with Planck. This process hasn’t ended; it will continue with new data. Also, the searches for inflationary gravitational waves are influenced by these findings, and it is clear that all of the experiments in the field need to focus more resources on measuring the galaxy.

How confident are you that you will ultimately find conclusive evidence for primordial gravitational waves and the signature of cosmic inflation?

I don’t have an opinion about whether or not we will find a gravitational wave signal—that is why we are doing the measurement! But any result is so significant for cosmology that it has to be thoroughly tested by multiple groups. I am confident that the measurements we have made to date are robust, and the new data we need to subtract the galaxy more accurately are starting to pour forth. The immediate path forward is clear: we know how to make these measurements at 150 GHz, and we are already applying the same process to to the new frequencies. Doing the measurements ourselves also means they are uniform so we understand all of the errors, which, in the end, are just as important.

What will it mean for our understanding of the universe if you don’t find the signal?

The goal of this program is to learn how inflation happened. Inflation requires matter-energy with an unusual repulsive property in order to rapidly expand the universe. The physics are almost certainly new and exotic, at energies too high to be accessed with terrestrial particle accelerators. CMB measurements are one of the few ways to get at the inflationary physics, and we need to squeeze them for all they are worth. A gravitational wave signal is very interesting because it tells us about the physical process behind inflation. A detection of the polarization signal at a high level means that the certain models of inflation, perhaps along the lines of the models first developed, are a good explanation.

But here again is the real point: we also learn more about inflation if we can rule out polarization from gravitational waves. No detection at 5 percent or less of the total BICEP2 signal means that inflation is likely more complicated, perhaps involving multiple fields, although there are certainly other possibilities. Either way is a win, and we’ll find out more about what caused the birth of the universe 13.8 billion years ago.

Our team dedicated itself to the pursuit of inflationary polarization 15 years ago fully expecting a long and difficult journey. It is exciting, after all this work, to be at this stage where the polarization data are breaking into new ground, providing more information about gravitational waves than we learned before. The BICEP2 signal was a surprise, and its ultimate resolution is still a work in progress. The data we need to address these questions about inflation are within sight, and whatever the answers are, they are going to be interesting, so stay tuned.

Tommaso DorigoFrancis Halzen On Cosmogenic Neutrinos

During the first afternoon session of the XVI Neutrino Telescopes conference (here is the conference blog, which contains a report of most of the lectures and posters as they are presented) Francis Halzen gave a very nice account of the discovery of cosmogenic neutrinos by the IceCube experiment, and its implications. Below I offer a writeup - apologizing to Halzen if I misinterpreted anything.

read more

Chad OrzelCelebrities and Attention Police

While I’m running unrelated articles head-on into each other, two other things that caught my eye recently were Sabine Hossenfelder’s thoughts on scientific celebrities (taking off from Lawrence Krauss’s defense of same) and Megan Garber’s piece on “attention policing”, spinning off that silliness about a badly exposed photo of a dress that took the Internet by storm.

Like Sabine, I’m generally in favor of the idea of science celebrities, though as someone whose books are found on shelves between Lawrence Krauss’s and Neil deGrasse Tyson’s, there’s no small amount of self-interest in that. But I think it’s generally good to direct more public attention to science-y things, by whatever means necessary. Which is why I spent an afternoon a month or so back with a cameraman filming me putting footballs in a freezer.

I do share the concern, though, about the attention-steering effects of celebrity. Sabine describes this within the context of science as a hypothetical involving Neil deGrasse Tyson publicly mentioning one of her papers. I think that hypothetical is already a reality outside of science, though, with the vagaries of celebrity meaning that the public image of science is largely drawn from a handful of photogenic fields with charismatic proponents at the expense of a much larger range of science with more direct impact on daily life.

The APS March Meeting takes place this week, and will be the largest physics meeting of the year. It probably won’t generate publicity in proportion to its attendance, though, because it’s largely focused on condensed matter physics, and that’s just not as sexy as astrophysics or particle physics. And that can be pretty frustrating for people in March Meeting research fields.

(This is not entirely the fault of the celebrities themselves– Cosmos did make an effort to include some stuff from other fields of science, and Brian Cox has expanded his tv-presenter empire to include biological topics. the problem, though, is that celebrity isn’t transferrable, and so doing this necessarily means having people who attained fame via work in one field out in the media talking about fields that can be pretty far removed from their areas of expertise. Which can be even more frustrating, particularly when the new topics aren’t handled well.)

At the same time, though, “you’re insufficiently interested in the stuff I consider important” is the battle cry of the humorless scold, as described in Garber’s piece at the Atlantic (same link as above, to save you scrolling back up). Which is why I spend a lot of time on Twitter scrolling past stuff that I think is pointless, biting back snide replies. There’s no accounting for taste, and all that, and you’re not going to get people to stop caring about whatever weird shit they’ve decided to care about by telling them that it’s unimportant trash. All you’re going to do is piss them off.

So, as irritating as it can be to see celebrity-driven attention going disproportionately to a handful of fields, it’s also important not to get too worked up about that and start yelling at people, because that doesn’t really do any good. (Yes, I’m aware of the irony of saying this in a post where I just finished complaining that people are insufficiently interested in things I consider important. You’re very clever, now shut up.)

All you can really do is work with the system as much as you can to promote what you like, and when something catches on, run with it as far as you can. Which is why, while I have no interest in reading any of the umpteen pieces on the science of color perception that were rushed out in the wake of the silly dress business, I applaud the scientists who produced them. Ride that horse until it collapses under you, because you might not get another chance.

So, you know, as with so many other things, there’s a needle to thread here. Celebrity is probably a net positive, but you need to be cautious about its attention-directing effects. At the same time, though, getting too caught up in the misdirection of attention by celebrity communicators is a short fast road to frustration and the writing of crankily contrarian books.

Jordan EllenbergMichael Harris on Elster on Montaigne on Diagoras on Abraham Wald

Michael Harris — who is now blogging! — points out that Montaigne very crisply got to the point I make in How Not To Be Wrong about survivorship bias, Abraham Wald, and the missing bullet holes:

Here, for example, is how Montaigne explains the errors in reasoning that lead people to believe in the accuracy of divinations: “That explains the reply made by Diagoras, surnamed the Atheist, when he was in Samothrace: he was shown many vows and votive portraits from those who have survived shipwrecks and was then asked, ‘You, there, who think that the gods are indifferent to human affairs, what have you to say about so many men saved by their grace?’— ‘It is like this’, he replied, ‘there are no portraits here of those who stayed and drowned—and they are more numerous!’ ”

The quote is from Jon Elster, Reason and Rationality, p.26.

Tommaso DorigoNeutrino Physics: Poster Excerpts from Neutel XVI

The XVI edition of "Neutrino Telescopes" is about to start in Venice today. In the meantime, I have started to publish in the conference blog a few excerpts of the posters that compete for the "best poster award" at the conference this week. You might be interested to check them out:

read more

March 01, 2015

Terence Tao254A, Notes 3: The large sieve and the Bombieri-Vinogradov theorem

A fundamental and recurring problem in analytic number theory is to demonstrate the presence of cancellation in an oscillating sum, a typical example of which might be a correlation

\displaystyle  \sum_{n} f(n) \overline{g(n)} \ \ \ \ \ (1)

between two arithmetic functions {f: {\bf N} \rightarrow {\bf C}} and {g: {\bf N} \rightarrow {\bf C}}, which to avoid technicalities we will assume to be finitely supported (or that the {n} variable is localised to a finite range, such as {\{ n: n \leq x \}}). A key example to keep in mind for the purposes of this set of notes is the twisted von Mangoldt summatory function

\displaystyle  \sum_{n \leq x} \Lambda(n) \overline{\chi(n)} \ \ \ \ \ (2)

that measures the correlation between the primes and a Dirichlet character {\chi}. One can get a “trivial” bound on such sums from the triangle inequality

\displaystyle  |\sum_{n} f(n) \overline{g(n)}| \leq \sum_{n} |f(n)| |g(n)|;

for instance, from the triangle inequality and the prime number theorem we have

\displaystyle  |\sum_{n \leq x} \Lambda(n) \overline{\chi(n)}| \leq x + o(x) \ \ \ \ \ (3)

as {x \rightarrow \infty}. But the triangle inequality is insensitive to the phase oscillations of the summands, and often we expect (e.g. from the probabilistic heuristics from Supplement 4) to be able to improve upon the trivial triangle inequality bound by a substantial amount; in the best case scenario, one typically expects a “square root cancellation” that gains a factor that is roughly the square root of the number of summands. (For instance, for Dirichlet characters {\chi} of conductor {O(x^{O(1)})}, it is expected from probabilistic heuristics that the left-hand side of (3) should in fact be {O_\varepsilon(x^{1/2+\varepsilon})} for any {\varepsilon>0}.)

It has proven surprisingly difficult, however, to establish significant cancellation in many of the sums of interest in analytic number theory, particularly if the sums do not have a strong amount of algebraic structure (e.g. multiplicative structure) which allow for the deployment of specialised techniques (such as multiplicative number theory techniques). In fact, we are forced to rely (to an embarrassingly large extent) on (many variations of) a single basic tool to capture at least some cancellation, namely the Cauchy-Schwarz inequality. In fact, in many cases the classical case

\displaystyle  |\sum_n f(n) \overline{g(n)}| \leq (\sum_n |f(n)|^2)^{1/2} (\sum_n |g(n)|^2)^{1/2}, \ \ \ \ \ (4)

considered by Cauchy, where at least one of {f, g: {\bf N} \rightarrow {\bf C}} is finitely supported, suffices for applications. Roughly speaking, the Cauchy-Schwarz inequality replaces the task of estimating a cross-correlation between two different functions {f,g}, to that of measuring self-correlations between {f} and itself, or {g} and itself, which are usually easier to compute (albeit at the cost of capturing less cancellation). Note that the Cauchy-Schwarz inequality requires almost no hypotheses on the functions {f} or {g}, making it a very widely applicable tool.

There is however some skill required to decide exactly how to deploy the Cauchy-Schwarz inequality (and in particular, how to select {f} and {g}); if applied blindly, one loses all cancellation and can even end up with a worse estimate than the trivial bound. For instance, if one tries to bound (2) directly by applying Cauchy-Schwarz with the functions {\Lambda} and {\chi}, one obtains the bound

\displaystyle  |\sum_{n \leq x} \Lambda(n) \overline{\chi(n)}| \leq (\sum_{n \leq x} \Lambda(n)^2)^{1/2} (\sum_{n \leq x} |\chi(n)|^2)^{1/2}.

The right-hand side may be bounded by {\ll x \log^{1/2} x}, but this is worse than the trivial bound (3) by a logarithmic factor. This can be “blamed” on the fact that {\Lambda} and {\chi} are concentrated on rather different sets ({\Lambda} is concentrated on primes, while {\chi} is more or less uniformly distributed amongst the natural numbers); but even if one corrects for this (e.g. by weighting Cauchy-Schwarz with some suitable “sieve weight” that is more concentrated on primes), one still does not do any better than (3). Indeed, the Cauchy-Schwarz inequality suffers from the same key weakness as the triangle inequality: it is insensitive to the phase oscillation of the factors {f, g}.

While the Cauchy-Schwarz inequality can be poor at estimating a single correlation such as (1), its power improves when considering an average (or sum, or square sum) of multiple correlations. In this set of notes, we will focus on one such situation of this type, namely that of trying to estimate a square sum

\displaystyle  (\sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}|^2)^{1/2} \ \ \ \ \ (5)

that measures the correlations of a single function {f: {\bf N} \rightarrow {\bf C}} with multiple other functions {g_j: {\bf N} \rightarrow {\bf C}}. One should think of the situation in which {f} is a “complicated” function, such as the von Mangoldt function {\Lambda}, but the {g_j} are relatively “simple” functions, such as Dirichlet characters. In the case when the {g_j} are orthonormal functions, we of course have the classical Bessel inequality:

Lemma 1 (Bessel inequality) Let {g_1,\dots,g_J: {\bf N} \rightarrow {\bf C}} be finitely supported functions obeying the orthonormality relationship

\displaystyle  \sum_n g_j(n) \overline{g_{j'}(n)} = 1_{j=j'}

for all {1 \leq j,j' \leq J}. Then for any function {f: {\bf N} \rightarrow {\bf C}}, we have

\displaystyle  (\sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}|^2)^{1/2} \leq (\sum_n |f(n)|^2)^{1/2}.

For sake of comparison, if one were to apply the Cauchy-Schwarz inequality (4) separately to each summand in (5), one would obtain the bound of {J^{1/2} (\sum_n |f(n)|^2)^{1/2}}, which is significantly inferior to the Bessel bound when {J} is large. Geometrically, what is going on is this: the Cauchy-Schwarz inequality (4) is only close to sharp when {f} and {g} are close to parallel in the Hilbert space {\ell^2({\bf N})}. But if {g_1,\dots,g_J} are orthonormal, then it is not possible for any other vector {f} to be simultaneously close to parallel to too many of these orthonormal vectors, and so the inner products of {f} with most of the {g_j} should be small. (See this previous blog post for more discussion of this principle.) One can view the Bessel inequality as formalising a repulsion principle: if {f} correlates too much with some of the {g_j}, then it does not have enough “energy” to have large correlation with the rest of the {g_j}.

In analytic number theory applications, it is useful to generalise the Bessel inequality to the situation in which the {g_j} are not necessarily orthonormal. This can be accomplished via the Cauchy-Schwarz inequality:

Proposition 2 (Generalised Bessel inequality) Let {g_1,\dots,g_J: {\bf N} \rightarrow {\bf C}} be finitely supported functions, and let {\nu: {\bf N} \rightarrow {\bf R}^+} be a non-negative function. Let {f: {\bf N} \rightarrow {\bf C}} be such that {f} vanishes whenever {\nu} vanishes, we have

\displaystyle  (\sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}|^2)^{1/2} \leq (\sum_n |f(n)|^2 / \nu(n))^{1/2} \ \ \ \ \ (6)

\displaystyle  \times ( \sum_{j=1}^J \sum_{j'=1}^J c_j \overline{c_{j'}} \sum_n \nu(n) g_j(n) \overline{g_{j'}(n)} )^{1/2}

for some sequence {c_1,\dots,c_J} of complex numbers with {\sum_{j=1}^J |c_j|^2 = 1}, with the convention that {|f(n)|^2/\nu(n)} vanishes whenever {f(n), \nu(n)} both vanish.

Note by relabeling that we may replace the domain {{\bf N}} here by any other at most countable set, such as the integers {{\bf Z}}. (Indeed, one can give an analogue of this lemma on arbitrary measure spaces, but we will not do so here.) This result first appears in this paper of Boas.

Proof: We use the method of duality to replace the role of the function {f} by a dual sequence {c_1,\dots,c_J}. By the converse to Cauchy-Schwarz, we may write the left-hand side of (6) as

\displaystyle  \sum_{j=1}^J \overline{c_j} \sum_{n} f(n) \overline{g_j(n)}

for some complex numbers {c_1,\dots,c_J} with {\sum_{j=1}^J |c_j|^2 = 1}. Indeed, if all of the {\sum_{n} f(n) \overline{g_j(n)}} vanish, we can set the {c_j} arbitrarily, otherwise we set {(c_1,\dots,c_J)} to be the unit vector formed by dividing {(\sum_{n} f(n) \overline{g_j(n)})_{j=1}^J} by its length. We can then rearrange this expression as

\displaystyle  \sum_n f(n) \overline{\sum_{j=1}^J c_j g_j(n)}.

Applying Cauchy-Schwarz (dividing the first factor by {\nu(n)^{1/2}} and multiplying the second by {\nu(n)^{1/2}}, after first removing those {n} for which {\nu(n)} vanish), this is bounded by

\displaystyle  (\sum_n |f(n)|^2 / \nu(n))^{1/2} (\sum_n \nu(n) |\sum_{j=1}^J c_j g_j(n)|^2)^{1/2},

and the claim follows by expanding out the second factor. \Box

Observe that Lemma 1 is a special case of Proposition 2 when {\nu=1} and the {g_j} are orthonormal. In general, one can expect Proposition 2 to be useful when the {g_j} are almost orthogonal relative to {\nu}, in that the correlations {\sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}} tend to be small when {j,j'} are distinct. In that case, one can hope for the diagonal term {j=j'} in the right-hand side of (6) to dominate, in which case one can obtain estimates of comparable strength to the classical Bessel inequality. The flexibility to choose different weights {\nu} in the above proposition has some technical advantages; for instance, if {f} is concentrated in a sparse set (such as the primes), it is sometimes useful to tailor {\nu} to a comparable set (e.g. the almost primes) in order not to lose too much in the first factor {\sum_n |f(n)|^2 / \nu(n)}. Also, it can be useful to choose a fairly “smooth” weight {\nu}, in order to make the weighted correlations {\sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}} small.

Remark 3 In harmonic analysis, the use of tools such as Proposition 2 is known as the method of almost orthogonality, or the {TT^*} method. The explanation for the latter name is as follows. For sake of exposition, suppose that {\nu} is never zero (or we remove all {n} from the domain for which {\nu(n)} vanishes). Given a family of finitely supported functions {g_1,\dots,g_J: {\bf N} \rightarrow {\bf C}}, consider the linear operator {T: \ell^2(\nu^{-1}) \rightarrow \ell^2(\{1,\dots,J\})} defined by the formula

\displaystyle  T f := ( \sum_{n} f(n) \overline{g_j(n)} )_{j=1}^J.

This is a bounded linear operator, and the left-hand side of (6) is nothing other than the {\ell^2(\{1,\dots,J\})} norm of {Tf}. Without any further information on the function {f} other than its {\ell^2(\nu^{-1})} norm {(\sum_n |f(n)|^2 / \nu(n))^{1/2}}, the best estimate one can obtain on (6) here is clearly

\displaystyle  (\sum_n |f(n)|^2 / \nu(n))^{1/2} \times \|T\|_{op},

where {\|T\|_{op}} denotes the operator norm of {T}.

The adjoint {T^*: \ell^2(\{1,\dots,J\}) \rightarrow \ell^2(\nu^{-1})} is easily computed to be

\displaystyle  T^* (c_j)_{j=1}^J := (\sum_{j=1}^J c_j \nu(n) g_j(n) )_{n \in {\bf N}}.

The composition {TT^*: \ell^2(\{1,\dots,J\}) \rightarrow \ell^2(\{1,\dots,J\})} of {T} and its adjoint is then given by

\displaystyle  TT^* (c_j)_{j=1}^J := (\sum_{j=1}^J c_j \sum_n \nu(n) g_j(n) \overline{g_{j'}}(n) )_{j=1}^J.

From the spectral theorem (or singular value decomposition), one sees that the operator norms of {T} and {TT^*} are related by the identity

\displaystyle  \|T\|_{op} = \|TT^*\|_{op}^{1/2},

and as {TT^*} is a self-adjoint, positive semi-definite operator, the operator norm {\|TT^*\|_{op}} is also the supremum of the quantity

\displaystyle  \langle TT^* (c_j)_{j=1}^J, (c_j)_{j=1}^J \rangle_{\ell^2(\{1,\dots,J\})} = \sum_{j=1}^J \sum_{j'=1}^J c_j \overline{c_{j'}} \sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}

where {(c_j)_{j=1}^J} ranges over unit vectors in {\ell^2(\{1,\dots,J\})}. Putting these facts together, we obtain Proposition 2; furthermore, we see from this analysis that the bound here is essentially optimal if the only information one is allowed to use about {f} is its {\ell^2(\nu^{-1})} norm.

For further discussion of almost orthogonality methods from a harmonic analysis perspective, see Chapter VII of this text of Stein.

Exercise 4 Under the same hypotheses as Proposition 2, show that

\displaystyle  \sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}| \leq (\sum_n |f(n)|^2 / \nu(n))^{1/2}

\displaystyle  \times ( \sum_{j=1}^J \sum_{j'=1}^J |\sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}| )^{1/2}

as well as the variant inequality

\displaystyle  |\sum_{j=1}^J \sum_{n} f(n) \overline{g_j(n)}| \leq (\sum_n |f(n)|^2 / \nu(n))^{1/2}

\displaystyle  \times | \sum_{j=1}^J \sum_{j'=1}^J \sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}|^{1/2}.

Proposition 2 has many applications in analytic number theory; for instance, we will use it in later notes to control the large value of Dirichlet series such as the Riemann zeta function. One of the key benefits is that it largely eliminates the need to consider further correlations of the function {f} (other than its self-correlation {\sum_n |f(n)|^2 / \nu(n)} relative to {\nu^{-1}}, which is usually fairly easy to compute or estimate as {\nu} is usually chosen to be relatively simple); this is particularly useful if {f} is a function which is significantly more complicated to analyse than the functions {g_j}. Of course, the tradeoff for this is that one now has to deal with the coefficients {c_j}, which if anything are even less understood than {f}, since literally the only thing we know about these coefficients is their square sum {\sum_{j=1}^J |c_j|^2}. However, as long as there is enough almost orthogonality between the {g_j}, one can estimate the {c_j} by fairly crude estimates (e.g. triangle inequality or Cauchy-Schwarz) and still get reasonably good estimates.

In this set of notes, we will use Proposition 2 to prove some versions of the large sieve inequality, which controls a square-sum of correlations

\displaystyle  \sum_n f(n) e( -\xi_j n )

of an arbitrary finitely supported function {f: {\bf Z} \rightarrow {\bf C}} with various additive characters {n \mapsto e( \xi_j n)} (where {e(x) := e^{2\pi i x}}), or alternatively a square-sum of correlations

\displaystyle  \sum_n f(n) \overline{\chi_j(n)}

of {f} with various primitive Dirichlet characters {\chi_j}; it turns out that one can prove a (slightly sub-optimal) version of this inequality quite quickly from Proposition 2 if one first prepares the sum by inserting a smooth cutoff with well-behaved Fourier transform. The large sieve inequality has many applications (as the name suggests, it has particular utility within sieve theory). For the purposes of this set of notes, though, the main application we will need it for is the Bombieri-Vinogradov theorem, which in a very rough sense gives a prime number theorem in arithmetic progressions, which, “on average”, is of strength comparable to the results provided by the Generalised Riemann Hypothesis (GRH), but has the great advantage of being unconditional (it does not require any unproven hypotheses such as GRH); it can be viewed as a significant extension of the Siegel-Walfisz theorem from Notes 2. As we shall see in later notes, the Bombieri-Vinogradov theorem is a very useful ingredient in sieve-theoretic problems involving the primes.

There is however one additional important trick, beyond the large sieve, which we will need in order to establish the Bombieri-Vinogradov theorem. As it turns out, after some basic manipulations (and the deployment of some multiplicative number theory, and specifically the Siegel-Walfisz theorem), the task of proving the Bombieri-Vinogradov theorem is reduced to that of getting a good estimate on sums that are roughly of the form

\displaystyle  \sum_{j=1}^J |\sum_n \Lambda(n) \overline{\chi_j}(n)| \ \ \ \ \ (7)

for some primitive Dirichlet characters {\chi_j}. This looks like the type of sum that can be controlled by the large sieve (or by Proposition 2), except that this is an ordinary sum rather than a square sum (i.e., an {\ell^1} norm rather than an {\ell^2} norm). One could of course try to control such a sum in terms of the associated square-sum through the Cauchy-Schwarz inequality, but this turns out to be very wasteful (it loses a factor of about {J^{1/2}}). Instead, one should try to exploit the special structure of the von Mangoldt function {\Lambda}, in particular the fact that it can be expressible as a Dirichlet convolution {\alpha * \beta} of two further arithmetic sequences {\alpha,\beta} (or as a finite linear combination of such Dirichlet convolutions). The reason for introducing this convolution structure is through the basic identity

\displaystyle  (\sum_n \alpha*\beta(n) \overline{\chi_j}(n)) = (\sum_n \alpha(n) \overline{\chi_j}(n)) (\sum_n \beta(n) \overline{\chi_j}(n)) \ \ \ \ \ (8)

for any finitely supported sequences {\alpha,\beta: {\bf N} \rightarrow {\bf C}}, as can be easily seen by multiplying everything out and using the completely multiplicative nature of {\chi_j}. (This is the multiplicative analogue of the well-known relationship {\widehat{f*g}(\xi) = \hat f(\xi) \hat g(\xi)} between ordinary convolution and Fourier coefficients.) This factorisation, together with yet another application of the Cauchy-Schwarz inequality, lets one control (7) by square-sums of the sort that can be handled by the large sieve inequality.

As we have seen in Notes 1, the von Mangoldt function {\Lambda} does indeed admit several factorisations into Dirichlet convolution type, such as the factorisation {\Lambda = \mu * L}. One can try directly inserting this factorisation into the above strategy; it almost works, however there turns out to be a problem when considering the contribution of the portion of {\mu} or {L} that is supported at very small natural numbers, as the large sieve loses any gain over the trivial bound in such settings. Because of this, there is a need for a more sophisticated decomposition of {\Lambda} into Dirichlet convolutions {\alpha * \beta} which are non-degenerate in the sense that {\alpha,\beta} are supported away from small values. (As a non-example, the trivial factorisation {\Lambda = \Lambda * \delta} would be a totally inappropriate factorisation for this purpose.) Fortunately, it turns out that through some elementary combinatorial manipulations, some satisfactory decompositions of this type are available, such as the Vaughan identity and the Heath-Brown identity. By using one of these identities we will be able to complete the proof of the Bombieri-Vinogradov theorem. (These identities are also useful for other applications in which one wishes to control correlations between the von Mangoldt function {\Lambda} and some other sequence; we will see some examples of this in later notes.)

For further reading on these topics, including a significantly larger number of examples of the large sieve inequality, see Chapters 7 and 17 of Iwaniec and Kowalski.

Remark 5 We caution that the presentation given in this set of notes is highly ahistorical; we are using modern streamlined proofs of results that were first obtained by more complicated arguments.

— 1. The large sieve inequality —

We begin with a (slightly weakened) form of the large sieve inequality for additive characters, also known as the analytic large sieve inequality, first extracted explicitly by Davenport and Halberstam from previous work on the large sieve, and then refined further by many authors (see Remark 7 below).

Proposition 6 (Analytic large sieve inequality) Let {f: {\bf Z} \rightarrow {\bf C}} be a function supported on an interval {[M,M+N]} for some {M \in {\bf R}} and {N > 0}, and let {\xi_1,\dots,\xi_J \in {\bf R}/{\bf Z}} be {\delta}-separated (thus {\|\xi_i - \xi_j\|_{{\bf R}/{\bf Z}} \ge \delta} for all {1 \leq i < j \leq J} and some {\delta>0}, where {\|\xi\|_{{\bf R}/{\bf Z}}} denotes the distance from {\xi} to the nearest integer. Then

\displaystyle  \sum_{j=1}^J |\sum_n f(n) e( - \xi_j n )|^2 \ll (N + \frac{1}{\delta}) \sum_n |f(n)|^2. \ \ \ \ \ (9)

One can view this proposition as a variant of the Plancherel identity

\displaystyle  \sum_{j=1}^N |\sum_{n=1}^N f(n) e( - jn / N )|^2 = N \sum_{n=1}^N |f(n)|^2

associated to the Fourier transform on a cyclic group {{\bf Z}/N{\bf Z}}. This identity also shows that apart from the implied constant, the bound (9) is essentially best possible.

Proof: By increasing {N} (if {N < 1/\delta}) or decreasing {\delta} (if {1/\delta < N}), we can reduce to the case {N = \frac{1}{\delta}} without making the hypotheses any stronger. Thus the {\xi_j} are now {1/N}-separated, and our task is now to show that

\displaystyle  \sum_{j=1}^J |\sum_n f(n) e( - \xi_j n )|^2 \ll N \sum_n |f(n)|^2.

We now wish to apply Proposition 2 (with {{\bf N}} relabeled by {{\bf Z}}), but we need to choose a suitable weight function {\nu: {\bf Z} \rightarrow {\bf R}^+}. It turns out that it is advantageous for technical reasons to select a weight that has good Fourier-analytic properties.

Fix a smooth non-negative function {\psi: {\bf R} \rightarrow {\bf R}} supported on {[-1/4,1/4]} and not identically zero; we allow implied constants to depend on {\psi}. (Actually, the smoothness of {\psi} is not absolutely necessary for this argument; one could take {\psi = 1_{[-1/4,1/4]}} below if desired.) Consider the weight {\nu: {\bf R} \rightarrow {\bf R}^+} defined by

\displaystyle  \nu(n) := |\int_{\bf R} \psi( x ) e( x \frac{n-M}{N} )\ dx|^2.

Observe that {\nu \gg 1} for all {n \in [M,M+N]}. Thus, by Proposition 2 with this choice of {\nu}, we reduce to showing that

\displaystyle  \sum_{j=1}^J \sum_{j'=1}^J c_j \overline{c_{j'}} \sum_n \nu(n) e( (\xi_j - \xi_{j'}) n ) \ll N \ \ \ \ \ (10)

whenever {c_1,\dots,c_J} are complex numbers with {\sum_{j=1}^J |c_j|^2 = 1}.

We first consider the diagonal contribution {j=j'}. We can write {\nu} in terms of the Fourier transform {\hat \psi(\xi) := \int_{\bf R} \psi(x) e^{i x\xi}\ dx} of {\psi} as

\displaystyle  \nu(t) = |\hat \psi( 2\pi (t-M) / N )|^2,

and hence by Exercise 28 of Supplement 2, we have the bounds

\displaystyle  \nu(n) \ll \frac{1}{(1 + (n-M)/N)^2} \ \ \ \ \ (11)

(say) and thus

\displaystyle  \sum_n \nu(n) \ll N.

Thus the diagonal contribution {j=j'} of (10) is acceptable.

Now we consider an off-diagonal term {\sum_n \nu(n) e( (\xi_j - \xi_{j'}) n )} with {j \neq j'}. By the Poisson summation formula (Theorem 34 from Supplement 2), we can rewrite this as

\displaystyle  \sum_m \hat \nu( 2\pi (m + \xi_j - \xi_{j'}) ).

Now from the Fourier inversion formula, the function {\hat \psi} has Fourier transform supported on {[-1/4,1/4]}, and so {\nu} has Fourier transform supported in {[-\frac{\pi}{N}, \frac{\pi}{N}]}. Since {\xi_j - \xi_{j'}} is at least {1/N} from the nearest integer, we conclude that {\hat \nu( 2\pi (m + \xi_j - \xi_{j'}) )=0} for all integers {m}, and hence all the off-diagonal terms in fact vanish! The claim follows. \Box

Remark 7 If {M,N} are integers, one can in fact obtain the sharper bound

\displaystyle  \sum_{j=1}^J |\sum_n f(n) e( - \xi_j n )|^2 \leq (N + \frac{1}{\delta}) \sum_n |f(n)|^2,

a result of Montgomery and Vaughan (with {N} replaced by {N+1}, although it was observed subsequently subsequently by Paul Cohen that the additional {+1} term could be deleted by an amplification trick similar to those discussed in this previous post). See this survey of Montgomery for these results and on the (somewhat complicated) evolution of the large sieve, starting with the pioneering work of Linnik. However, in our applications the cruder form of the analytic large sieve inequality given by Proposition 6 will suffice.

Exercise 8 Let {f: {\bf N} \rightarrow {\bf C}} be a function supported on an interval {[M,M+N]}. Show that for any {A \geq 1}, one has

\displaystyle  |\sum_n f(n) e( - \xi n)| \leq A (\sum_n |f(n)|^2)^{1/2}

for all {\xi \in {\bf R}/{\bf Z}} outside of the union of {O( \frac{N}{A^2} )} intervals (or arcs) of length {1/N}. In particular, we have

\displaystyle  |\sum_n f(n) e( - \xi n)| \ll \sqrt{N} (\sum_n |f(n)|^2)^{1/2}

for all {\xi \in {\bf R}/{\bf Z}}, and the significantly superior estimate

\displaystyle  |\sum_n f(n) e( - \xi n)| \ll (\sum_n |f(n)|^2)^{1/2}

for most {\xi \in {\bf R}/{\bf Z}} (outside of at most (say) {N/2} intervals of length {1/N}).

Exercise 9 (Continuous large sieve inequality) Let {[M,M+N]} be an interval for some {M \in {\bf R}} and {N > 0}, and let {\xi_1,\dots,\xi_J \in {\bf R}} be {\delta}-separated.

  • (i) For any complex numbers {c_1,\dots,c_J}, show that

    \displaystyle  \int_M^{M+N} |\sum_{j=1}^J c_j e( \xi_j t )|^2\ dt \ll (N + \frac{1}{\delta}) \sum_{j=1}^J |c_j|^2.

    (Hint: replace the restriction of {t} to {[M,M+N]} with the weight {\nu(t)} used to prove Proposition 6.)

  • (ii) For any continuous {f: [M,M+N] \rightarrow {\bf C}}, show that

    \displaystyle  \sum_{j=1}^J |\int_M^{M+N} f(t) e( - \xi_j t )\ dt|^2 \ll (N + \frac{1}{\delta}) \int_M^{M+N} |f(t)|^2\ dt.

Now we establish a variant of the analytic large sieve inequality, involving Dirichlet characters, due to Bombieri and Davenport.

Proposition 10 (Large sieve inequality for characters) Let {f: {\bf Z} \rightarrow {\bf C}} be a function supported on an interval {[M,M+N]} for some {M \in {\bf R}} and {N > 0}, and let {Q \geq 1}. Then

\displaystyle  \sum_{q \leq Q} \frac{q}{\phi(q)} \sum^*_{\chi\ (q)} |\sum_n f(n) \overline{\chi(n)}|^2 \ll (N + Q^2) \sum_n |f(n)|^2, \ \ \ \ \ (12)

where {\sum^*_{\chi\ (q)}} denotes the sum over all primitive Dirichlet characters of modulus {q}.

Proof: By increasing {N} or decreasing {Q}, we may assume that {N=Q^2}, so {N \geq 1} and our task is now to show that

\displaystyle  \sum_{q \leq \sqrt{N}} \frac{q}{\phi(q)} \sum^*_{\chi\ (q)} |\sum_n f(n) \overline{\chi(n)}|^2 \ll N \sum_n |f(n)|^2.

Let {\nu} be the weight from Proposition 6. By Proposition 2 (and some relabeling), using the functions {(\frac{q}{\phi(q)})^{1/2} \chi}, it suffices to show that

\displaystyle  \sum_{q,q' \leq \sqrt{N}} \left(\frac{q}{\phi(q)}\right)^{1/2} \left(\frac{q'}{\phi(q')}\right)^{1/2} \sum^*_{\chi\ (q)} \sum^*_{\chi'\ (q')} c_\chi \overline{c_{\chi'}} \sum_n \nu(n) \chi(n) \overline{\chi'(n)} \ \ \ \ \ (13)

\displaystyle  \ll N

whenever {c_\chi} are complex numbers with

\displaystyle  \sum_{q \leq \sqrt{N}} \sum^*_{\chi\ (q)} |c_\chi|^2 = 1. \ \ \ \ \ (14)

We first establish that the diagonal contribution {(q,\chi)=(q',\chi')} to (13) is acceptable, that is to say that

\displaystyle  \sum_{q \leq \sqrt{N}} \frac{q}{\phi(q)} \sum^*_{\chi\ (q)} |c_\chi|^2 \sum_n \nu(n) |\chi(n)|^2 \ll N.

The function {|\chi(n)|^2} is the principal character of modulus {q}; it is periodic with period {q} and has mean value {\phi(q)/q}. In particular, since {q \leq \sqrt{N} \leq N}, we see that {\sum_{M' \leq n \leq M'+N} |\chi(n)|^2 \ll \frac{\phi(q)}{q} N} for any interval {[M',M'+N]} of length {N}. From (11) and a partition into intervals of length {N}, we see that

\displaystyle  \sum_n \nu(n) |\chi(n)|^2 \ll \frac{\phi(q)}{q} N

and the claim then follows from (14).

Now consider an off-diagonal term {\sum_n \nu(n) \chi(n) \overline{\chi'(n)}} with {\chi \neq \chi'}. As {\chi,\chi'} are primitive, this implies that {\chi \overline{\chi'}} is a non-principal character and thus has mean zero. Let {r} be the modulus of this character; by Fourier expansion we may write {\chi \overline{\chi'}} as a linear combination of the additive characters {n \mapsto e( k n / r )} for {1 \leq k < r}. (We can obtain explicit coefficients from this expansion by invoking Lemma 48 of Supplement 3, but we will not need those coefficients here.) Thus {\sum_n \nu(n) \chi(n) \overline{\chi'(n)}} is a linear combination of the quantities {\sum_n \nu(n) e( kn/r)} for {1 \leq k < r}. But the modulus {r} is the least common multiple of the moduli of {\chi,\chi'}, so in particular {r \leq N}, while as observed in the proof of Proposition 6, we have {\sum_n \nu(n) e( \theta n ) = 0} whenever {\|\theta\|_{{\bf R}/{\bf Z}} \geq 1/2N}. So the off-diagonal terms all vanish, and the claim follows. \Box

One can also derive Proposition 10 from Proposition 6:

Exercise 11 Let {f: {\bf Z} \rightarrow {\bf C}} be a finitely supported sequence.

  • (i) For any natural number {q}, establish the identity

    \displaystyle  \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\sum_n f(n) e(an/q)|^2

    \displaystyle = \frac{1}{\phi(q)} \sum_{\chi\ (q)} |\sum_n f(n) \sum_{a \in {\bf Z}/q{\bf Z}} \chi(a) e(an/q)|^2

    where {({\bf Z}/q{\bf Z})^\times} is the set of congruence classes {a\ (q)} coprime to {q}, and {\sum_{\chi\ (q)}} is the sum over all characters (not necessarily primitive) of modulus {q}.

  • (ii) For any natural number {q}, establish the inequality

    \displaystyle  \frac{q}{\phi(q)} \sum^*_{\chi\ (q)} |\sum_n f(n) \overline{\chi(n)}|^2 \leq \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\sum_n f(n) e(an/q)|^2

    and use this and Proposition 6 to derive Proposition 10. (Hint: use Lemma 48 from Supplement 3.)

Remark 12 By combining the arguments in the above exercise with the results in Remark 7, one can sharpen Proposition 10 to

\displaystyle  \sum_{q \leq Q} \frac{q}{\phi(q)} \sum^*_{\chi\ (q)} |\sum_n f(n) \overline{\chi(n)}|^2 \leq (N + Q^2) \sum_n |f(n)|^2,

that is to say one can delete the implied constant. See this paper of Montgomery and Vaughan for some further refinements of this inequality.

— 2. The Barban-Davenport-Halberstam theorem —

We now apply the large sieve inequality for characters to obtain an analogous inequality for arithmetic progressions, due independently to Barban, and to Davenport and Halberstam; we state a slightly weakened form of that theorem here. For any finitely supported arithmetic function {f: {\bf N} \rightarrow {\bf C}} and any primitive residue class {a\ (q)}, we introduce the discrepancy

\displaystyle  \Delta(f; a\ (q)) := \sum_{n: n = a\ (q)} f(n) - \frac{1}{\phi(q)} \sum_{n: (n,q)=1} f(n).

This quantity measures the extent to which {f} is well distributed among the primitive residue classes modulo {n}. From multiplicative Fourier inversion (see Theorem 69 from Notes 1) we have the identity

\displaystyle  \Delta(f; a\ (q)) = \frac{1}{\phi(q)} \sum_{\chi\ (q): \chi \neq \chi_0} \chi(a) \sum_n f(n) \overline{\chi(n)} \ \ \ \ \ (15)

where the sum is over non-principal characters {\chi} of modulus {q}.

Theorem 13 (Barban-Davenport-Halberstam) Let {x > 2}, and let {f: {\bf N} \rightarrow {\bf C}} be a function supported on {[1,x]} with the property that

\displaystyle  \sum_n |f(n)|^2 \ll x \log^{O(1)} x \ \ \ \ \ (16)

and obeying the Siegel-Walfisz property

\displaystyle  \Delta(f 1_{(\cdot,s)=1}; a\ (r)) \ll_A x \log^{-A} x \ \ \ \ \ (17)

for any fixed {A>0}, any primitive residue class {a\ (r)}, and any {1 \leq s \leq x}. Then one has

\displaystyle  \sum_{q \leq Q} \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(f; a\ (q))|^2 \ll_A x^2 \log^{-A} x \ \ \ \ \ (18)

for any {A > 0}, provided that {Q \leq x \log^{-B} x} for some sufficiently large {B = B(A)} depending only on {A}.

Informally, (18) is asserting that

\displaystyle  \Delta(f; a\ (q)) \ll_A \frac{x}{\phi(q)} \log^{-A} x

for “most” primitive residue classes {a\ (q)} with {q} much smaller than {x}; in most applications, the trivial bounds on {\Delta(f; a\ (q))} are of the type {O( \frac{x}{\phi(q)} \log^{O(1)} x )}, so this represents a savings of an arbitrary power of a logarithm on the average. Note that a direct application of (17) only gives (18) for {Q} of size {\log^{O(1)} x}; it is the large sieve which allows for the significant enlargement of {Q}.

Proof: Let {x, f, Q, A} be as above, with {Q \leq x \log^{-B} x} for some large {B} to be chosen later. From (15) and the Plancherel identity one has

\displaystyle  \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(f; a\ (q))|^2 = \frac{1}{\phi(q)} \sum_{\chi\ (q): \chi \neq \chi_0} |\sum_n f(n) \overline{\chi(n)}|^2

so our task is to show that

\displaystyle  \sum_{q \leq Q} \frac{1}{\phi(q)} \sum_{\chi\ (q): \chi \neq \chi_0} |\sum_n f(n) \overline{\chi(n)}|^2 \ll_A x^2 \log^{-A} x. \ \ \ \ \ (19)

We cannot apply the large sieve inequality yet, because the characters {\chi} here are not necessarily primitive. But we may express any non-principal character {\chi(n)} as {\tilde \chi(n) 1_{(n,s)=1}} for some primitive character {\tilde \chi} of conductor {r>1}, where {r,s,t} are natural numbers with {rst = q}. In particular, {r,s,t \leq Q} and {\frac{1}{\phi(q)} \leq \frac{1}{\phi(s)} \frac{1}{\phi(r)} \frac{1}{\phi(t)}}. Thus we may (somewhat crudely) upper bound the left-hand side of (19) by

\displaystyle  \sum_{t \leq Q} \frac{1}{\phi(t)} \sum_{s \leq Q} \frac{1}{\phi(s)} \sum_{1 < r \leq Q} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n f(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}|^2.

From Theorem 27 of Notes 1 we have {\sum_{s \leq Q} \frac{1}{\phi(s)} \ll \log x}, so we may bound the above by

\displaystyle  (\log x)^2 \sup_{s \leq Q} \sum_{1 < r \leq Q} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n f(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}|^2.

By dyadic decomposition (and adjusting {A} slightly), it thus suffices to show that

\displaystyle  \sum_{R < r \leq 2R} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n f(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}|^2 \ll_A x^2 \log^{-A} x \ \ \ \ \ (20)

for any {1 \leq R \leq Q} and {s \leq Q}.

From Proposition 10 and (16), we may bound the left-hand side of (20) by {\frac{1}{R} (x + R^2) x \log^{O(1)} x}. If {R \geq \log^B x} and {R \leq Q \leq x \log^{-B} x}, then we obtain (20) if {B} is sufficiently large depending on {A}. The only remaining case to consider is when {R < log^B x}. But from the Siegel-Walfisz hypothesis (17) we easily see that

\displaystyle  \sum_n f(n) 1_{(n,s)=1} \overline{\tilde \chi(n)} \ll_{A'} R x \log^{-A'} x

for any {A' > 0} and any primitive character {\tilde \chi} of conductor {R < r \leq 2R}. Since the total number of primitive characters appearing in (20) is {O(R^2) = O( \log^{2B} x)}, the claim follows by taking {A'} large enough. \Box

One can specialise this to the von Mangoldt function:

Exercise 14 Use the Barban-Davenport-Halberstam theorem and the Siegel-Walfisz theorem (Exercise 64 from Notes 2) to conclude that

\displaystyle  \sum_{q \leq Q} \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\Lambda 1_{[1,x]}; a\ (q))|^2 \ll_A x^2 \log^{-A} x \ \ \ \ \ (21)

for any {A > 0}, provided that {Q \leq x \log^{-B} x} for some sufficiently large {B = B(A)} depending only on {A}. Obtain a similar claim with {\Lambda} replaced by the Möbius function.

Remark 15 Recall that the implied constants in the Siegel-Walfisz theorem depended on {A} in an ineffective fashion. As such, the implied constants in (21) also depend ineffectively on {A}. However, if one replaces Siegel’s theorem by an effective substitute such as Tatuzawa’s theorem (see Theorem 62 of Notes 2) or the Landau-Page theorem (Theorem 53 of Notes 2), one can obtain an effective version of the Siegel-Walfisz theorem for all moduli {q} that are not multiples of a single exceptional modulus {q_*}. One can then obtain an effective version of (21) if one restricts to moduli {q} that are not multiples of {q_*}. Similarly for the Bombieri-Vinogradov theorem in the next section. Such variants of the Barban-Davenport-Halberstam theorem or Bombieri-Vinogradov theorem can be used as a substitute in some applications to remove any ineffective dependence of constants, at the cost of making the argument slightly more convoluted.

— 3. The Bombieri-Vinogradov theorem —

The Barban-Davenport-Halberstam theorem controls the discrepancy {\Delta(f; a\ (q))} after averaging in both the modulus {q} and the residue class {a}. For many problems in sieve theory, it turns out to be more important to control the discrepancy with an averaging only in the modulus {q}, with the residue class {a} being allowed to vary in {q} in the “worst-case” fashion. Specifically, one often wishes to control expressions of the form

\displaystyle  \sum_{q \leq Q} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(f; a\ (q))|. \ \ \ \ \ (22)

for some finitely supported {f: {\bf N} \rightarrow {\bf C}} and {Q>1}. This expression is difficult to control for arbitrary {f}, but it turns out that one can obtain a good bound if {f} is expressible as a Dirichlet convolution {f = \alpha*\beta} for some suitably “non-degenerate” sequences {\alpha,\beta}. More precisely, we have the following general form of the Bombieri-Vinogradov theorem, first articulated by Motohashi:

Theorem 16 (General Bombieri-Vinogradov theorem) Let {x > 2}, let {M,N \geq 1} be such that {MN \ll x}, and let {\alpha,\beta: {\bf N} \rightarrow {\bf C}} be arithmetic functions supported on {[1,M]}, {[1,N]} respectively, with

\displaystyle  \sum_m |\alpha(m)|^2 \ll M \log^{O(1)} x \ \ \ \ \ (23)


\displaystyle  \sum_n |\beta(n)|^2 \ll N \log^{O(1)} x. \ \ \ \ \ (24)

Suppose that {\beta} obeys the Siegel-Walfisz property

\displaystyle  \Delta(\beta 1_{(\cdot,s)=1}; a\ (r)) \ll_A N \log^{-A} x \ \ \ \ \ (25)

for all {A > 0}, all primitive residue classes {a\ (r)}, and all {1 \leq s \leq x}. Then one has

\displaystyle  \sum_{q \leq Q} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\alpha * \beta; a\ (q))| \ll_{A} x \log^{-A} x \ \ \ \ \ (26)

for any {A > 0}, provided that {Q \leq x^{1/2} \log^{-B} x} and {M,N \geq \log^B x} for some sufficiently large {B = B(A)} depending on {A}.

Proof: We adapt the arguments of the previous section. From (15) and the triangle inequality, we have

\displaystyle  \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\alpha*\beta; a\ (q))| \leq \frac{1}{\phi(q)} \sum_{\chi\ (q): \chi \neq \chi_0} |\sum_n \alpha*\beta(n) \overline{\chi(n)}|

and so we can upper bound the left-hand side of (26) by

\displaystyle  \sum_{q \leq Q} \frac{1}{\phi(q)} \sum_{\chi\ (q): \chi \neq \chi_0} |\sum_n \alpha*\beta(n) \overline{\chi(n)}|.

As in the previous section, we may reduce {\chi} to primitive characters and bound this expression by

\displaystyle  \ll (\log x)^2 \sup_{s \leq Q} \sum_{1 < r \leq Q} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n \alpha*\beta(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}|.

By dyadic decomposition (and adjusting {A} slightly), it thus suffices to show that

\displaystyle  \sum_{R < r \leq 2R} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n \alpha*\beta(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}| \ll_A x \log^{-A} x \ \ \ \ \ (27)

for all {1 \leq R \leq Q} and {s \leq Q}, and any {A>1}, assuming {Q \leq x^{1/2} \log^{-B} x} and {M,N \geq \log^B x} with {B} sufficiently large depending on {A}.

We cannot yet easily apply the large sieve inequality, because the character sums here are not squared. But we now crucially exploit the Dirichlet convolution structure using the identity (8), to factor {\sum_n \alpha*\beta(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}} as the product of {\sum_m \alpha(m) 1_{(m,s)=1} \overline{\tilde \chi(m)}} and {\sum_n \beta(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}}. From the Cauchy-Schwarz inequality, we may thus bound (27) by the geometric mean of

\displaystyle  \sum_{R < r \leq 2R} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_m \alpha(m) 1_{(m,s)=1} \overline{\tilde \chi(m)}|^2 \ \ \ \ \ (28)


\displaystyle  \sum_{R < r \leq 2R} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n \beta(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}|^2. \ \ \ \ \ (29)

Now we have the all-important square needed in the large sieve inequality. From (23), (24) and Proposition 2, we may bound (28) by

\displaystyle  \ll \frac{1}{R} (M + R^2) M \log^{O(1)} x \ \ \ \ \ (30)

and (29) by

\displaystyle  \ll \frac{1}{R} (N + R^2) N \log^{O(1)} x

and so (27) is bounded by

\displaystyle  \ll ( \frac{MN}{R} + M N^{1/2} + M^{1/2} N + R (MN)^{1/2} ) \log^{O(1)} x.

Since {MN \leq x}, we can write this as

\displaystyle  \ll ( \frac{1}{R} + \frac{1}{N^{1/2}} + \frac{1}{M^{1/2}} + \frac{R}{x^{1/2}} ) x \log^{O(1)} x.

Since {N,M \geq \log^B x} and {R \leq Q \leq x^{1/2} \log^{-B} x}, we obtain (27) if {R \geq \log^B x} and {B} is sufficiently large depending on {A}. The only remaining case to handle is when {R \leq \log^B x}. In this case, we can use the Siegel-Walfisz hypothesis (25) as in the previous section to bound (29) by {O_{A'}( N^2 \log^{-A'} x)} for any {A'>0}. Meanwhile, from (30), (28) is bounded by {O( M^2 \log^{O(B+1)} x )}. By taking {A'} sufficiently large, we conclude (27) in this case also. \Box

In analogy with Exercise 14, we would like to apply this general result to specific arithmetic functions, such as the von Mangoldt function {\Lambda} or the Möbius function {\mu}, and in particular to prove the following famous result of Bombieri and of A. I. Vinogradov (not to be confused with the more well known number theorist I. M. Vinogradov):

Theorem 17 (Bombieri-Vinogradov theorem) Let {x \geq 2}. Then one has

\displaystyle  \sum_{q \leq Q} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll_{A} x \log^{-A} x \ \ \ \ \ (31)

for any {A > 0}, provided that {Q \leq x^{1/2} \log^{-B} x} for some sufficiently large {B = B(A)} depending on {A}.

Informally speaking, the Bombieri-Vinogradov theorem asserts that for “almost all” moduli {q} that are significantly less than {x^{1/2}}, one has

\displaystyle  |\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll_A \frac{x}{\phi(q)} \log^{-A} x

for all primitive residue classes {a\ (q)} to this modulus. This should be compared with the Generalised Riemann Hypothesis (GRH), which gives the bound

\displaystyle  |\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll x^{1/2} \log^2 x

for all {q \leq x^{1/2}}; see Exercise 48 of Notes 2. Thus one can view the Bombieri-Vinogradov theorem as an assertion that the GRH holds (with a slightly weaker error term) “on average”, at least insofar as the impact of GRH on the prime number theorem in arithmetic progressions is concerned.

The initial arguments of Bombieri and Vinogradov were somewhat complicated, in particular involving the explicit formula for {L}-functions (Exercise 45 of Notes 2); the modern proof of the Bombieri-Vinogradov theorem avoids this and proceeds instead through Theorem 16 (or a close cousin thereof). Note that this theorem generalises the Siegel-Walfisz theorem (Exercise 64 of Notes 2), which is equivalent to the special case of Theorem 17 when {Q = \log^{O(1)} x}.

The obvious thing to try when proving Theorem 17 using Theorem 16 is to use one of the basic factorisations of such functions into Dirichlet convolutions, e.g. {\Lambda = \mu * L}, and then to decompose that convolution into pieces {\alpha*\beta} of the form required in Theorem 16; we will refer to such convolutions as Type II convolutions, loosely following the terminology of Vaughan. However, one runs into a problem coming from the components of the factors {\mu, L} supported at small numbers (of size {n = O(\log^{O(1)} x)}), as the parameters {M,N} associated to those parameters cannot obey the conditions {MN \ll x}, {M, N \geq \log^B x}. Indeed, observe that any Type II convolution {\alpha * \beta} will necessarily vanish at primes of size comparable to {x}, and so one cannot possibly represent functions such as {\Lambda} or {\mu} purely in terms of such Type II convolutions.

However, it turns out that we can still decompose functions such as {\Lambda,\mu} into two types of convolutions: not just the Type II convolutions considered above, but also a further class of Type I convolutions {\alpha * \beta}, in which one of the factors, say {\beta}, is very slowly varying (or “smooth”) and supported on a very long interval, e.g. {\beta = 1_{[N,2N]}} for some large {N}; with these sums {\alpha} is permitted to be concentrated arbitrarily close to {n=1}, and in particular the Type I convolution can be non-zero on primes comparable to {x}. It turns out that bounding the discrepancy of Type I convolutions is relatively easy, and leads to a proof of Theorem 17.

We turn to the details. There are a number of decompositions of {\Lambda} or {\mu} that one could use to accomplish the desired task. One popular choice of decomposition is the Vaughan identity, which may be compared with the decompositions appearing in the Dirichlet hyperbola method (see Section 3 of Notes 1):

Lemma 18 (Vaughan identity) For any {U,V > 1}, one has

\displaystyle  \Lambda = \Lambda_{\leq V} + \mu_{\leq U} * L - \mu_{\leq U} * \Lambda_{\leq V} * 1 + \mu_{>U} * \Lambda_{>V} * 1 \ \ \ \ \ (32)

where {\Lambda_{\leq V}(n) :=\Lambda(n) 1_{n \leq V}}, {\Lambda_{> V}(n) :=\Lambda(n) 1_{n > V}}, {\mu_{\leq V}(n) :=\mu(n) 1_{n \leq U}}, and {\mu_{> V}(n) :=\mu(n) 1_{n > U}}.

In this decomposition, {U} and {V} are typically two small powers of {x} (e.g. {U=V=x^{1/5}}), although the exact choice of {U,V} is often not of critical importance. The terms {\mu_{\leq U} * L} and {\mu_{\leq U} * \Lambda_{\leq V} * 1} are “Type I” convolutions, the terms {\mu_{>U} * \Lambda_{>V} * 1} should be considered as “Type II” convolutions. The term {\Lambda_{\leq V}} is a lower order error that is usually disposed of quite quickly. The Vaughan identity is already strong enough for many applications, but in some more advanced applications (particularly those in which one exploits the structure of triple or higher convolutions) it becomes convenient to use more sophisticated identities such as the Heath-Brown identity, which we will discuss in later notes.

Proof: Since {\mu = \mu_{\leq U} + \mu_{> U}} and {\Lambda = \Lambda_{\leq V} + \Lambda_{>V}}, we have

\displaystyle  \mu * \Lambda = \mu * \Lambda_{\leq V} + \mu_{\leq U} * \Lambda - \mu_{\leq U} * \Lambda_{\leq V} + \mu_{>U}* \Lambda_{>V}.

Convolving both sides by {1} and using the identities {\mu*1 =\delta} and {\Lambda*1=L}, we obtain the claim. \Box

Armed with this identity and Proposition 2 we may now finish off the proof of the Bombieri-Vinogradov theorem. We may assume that {x} is large (depending on {A}) as the claim is trivial otherwise. We apply the Vaughan identity (32) with {U=V=x^{1/5}} (actually for our argument below, any choice of {U,V} with {U,V \geq \log^B x} and {UV \leq x^{1/2} \log^{-B} x} would have sufficed). By the triangle inequality, it now suffices to establish (31) with {\Lambda} replaced by {\Lambda_{\leq V}}, {\mu_{\leq U} * L}, {\mu_{\leq U} * \Lambda_{\leq V}*1}, and {\mu_{>U}*\Lambda_{>V}*1}.

The term {\Lambda_{\leq V}} is easily disposed of: from the triangle inequality (and crudely bounding {\Lambda_{\leq V}} by {\log x 1_{\leq V}}) we see that

\displaystyle  \Delta( \Lambda_{\leq V} 1_{[1,x]}; a\ (q)) \ll \frac{V}{\phi(q)} \log x

and the claim follows since {\sum_{q \leq Q} \frac{1}{\phi(q)} \ll \log x} and {V = x^{1/5}}.

Next, we deal with the Type II convolution {\mu_{>U} *\Lambda_{>V}*1}. The presence of the {1_{[1,x]}} cutoff is slightly annoying (it prevents one from directly applying Proposition 2), but we will deal with this by using the following finer-than-dyadic decomposition trick, originally due to Fouvry and to Fouvry-Iwaniec. We may replace {\mu_{>U} * 1} by {(\mu_{>U}*1) 1_{(U,x]}}, since the portion of {\mu_{>U}*1} on {(x,\infty)} has no contribution. We may similarly replace {\Lambda_{>V}} by {\Lambda 1_{(V,x]}}. Next, we set {\lambda := 1 + \log^{-A-100} x}, and decompose {(\mu_{>U}*1) 1_{(U,x]}} into {O(\log^{A+101} x)} components {\alpha}, each of which is supported in an interval {[M, \lambda M]} and bounded in magnitude by {\tau} for some {M \geq U}. We similarly decompose {\Lambda 1_{(V,x]}} into {O(\log^{A+101} x)} components {\beta} supported in an interval {[N, \lambda N]} and bounded in magnitude by {\log x} for some {N \geq V}. Thus {(\mu_{>U} * \Lambda_{>V} * 1) 1_{[1,x]}} can be decomposed into {O( \log^{2A+202} x )} terms of the form {(\alpha*\beta)1_{[1,x]}} for various {\alpha,\beta} that are components of {\mu_{>U}*1} and {\Lambda_{>V}} respectively.

If {MN > x} then {(\alpha*\beta)1_{[1,x]}} vanishes, so we may assume that {MN \leq x}. By construction we also have {M,N \geq x^{1/5}}, so in particular {M,N \geq \log^B x} if {B} depends only on {A} (recall we are assuming {x} to be large). If {MN < \lambda^{-2} x}, then {(\alpha*\beta)1_{[1,x]} = \alpha*\beta}. The bounds (23), (24) are clear (bounding {\mu_{>U}*1} in magnitude by {\tau}), and from the Siegel-Walfisz theorem we see that the {\beta} components obey the hypothesis (25). Thus by applying Proposition 2 (with {A} replaced by {3A+22}) we see that the total contribution of all the {\alpha*\beta} terms with {MN < \lambda^{-2} x} is acceptable.

It remains to control the total contribution of the {\alpha*\beta} terms with {\lambda^{-2} x \leq MN \leq x}. Note that for each {\alpha} there are only {O(1)} choices of {\beta} that are in this range, so there are only {O( \log^{A+11} x )} such terms {\alpha*\beta} to deal with. We then crudely bound

\displaystyle  \sup_{a \in ({\bf Z}/q{\bf Z})^\times} \Delta( \alpha*\beta 1_{[1,x]}; a\ (q) ) \ll \sup_{a \in ({\bf Z}/q{\bf Z})^\times} \sum_{n = a\ (q)} |\alpha| * |\beta|(n)

\displaystyle  \ll (\log x) \sup_{a \in ({\bf Z}/q{\bf Z})^\times} \sum_{M \leq m \leq \lambda M, N \leq n \leq \lambda N: mn = a\ (q)} \tau(m).

Since {MN \geq \lambda^{-2} x} and {q \leq x^{1/2}}, one has either {M \gg q} or {N \gg q}. If {N \gg q}, we observe that for each fixed {m} in the above sum, there are {O( \frac{N}{q} \log^{-A-100} x )} choices of {n} that contribute, so the double sum is {O( \frac{MN}{q} \log^{-2A-199} x ) = O( \frac{x}{q} \log^{-2A-199} x )} (using the mean value bounds for {\tau}). Thus we see that the total contribution of this case is at most

\displaystyle  \ll \log^{A+101} x \times \log x \times \log^{-2A-199} x \times \sum_{q \leq Q} \frac{x}{q}

which is acceptable.

Now we consider the case when {M \gg q}. For fixed {n} in the above sum, the sum in {m} can be bounded by {\log^{65} x \times \frac{M}{q} \log^{-A-100} x}, thanks to Corollary 26 below (taking, say, {c_1 =1/2} and {c_2 = 5/6}). The total contribution of this sum is then

\displaystyle  \ll \log^{A+101} x \times \log x \times \log^{65} x \times \log^{-A-100} x \times \log^{-A-100 x} \times \sum_{q \leq Q} \frac{x}{q}

which is also acceptable.

Now let us consider a Type I term {\mu_{\leq U}*L}. From the triangle inequality have

\displaystyle  |\Delta((\mu_{\leq U} * L) 1_{[1,x]}; a\ (q))| \leq \sum_{d: (d,q)=1} |\mu_{\leq U}(d)| |\Delta( L 1_{[1,x/d]}; a/d\ (q) )|.

We exploit the monotonicity of {L} via the following simple fact:

Exercise 19 Let {f: [y,x] \rightarrow {\bf R}} be a monotone function. Show that

\displaystyle  |\Delta( f 1_{[y,x]}; a\ (q) )| \ll |f(y)| + |f(x)|

for all primitive residue classes {a\ (q)}. (Hint: use Lemma 2 from Notes 1 and a change of variables.)

Applying this exercise, we see that the contribution of this term to (31) is {O( \sum_{q \leq Q} \sum_d |\mu_{\leq U}(d)| \log x ) = O( Q U \log x )}, which is acceptable since {Q \leq x^{1/2}} and {U = x^{1/5}}.

A similar argument for the Type I term {\mu_{\leq U} * \Lambda_{\leq V} * 1} term (with {\mu_{\leq U} * \Lambda_{\leq V}} and {1} replacing {\mu_{\leq U}} and {L}) gives a contribution to (31) of

\displaystyle  \ll \sum_{q \leq Q} \sum_d |\mu_{\leq U} * \Lambda_{\leq V}|(d)

\displaystyle  \ll Q \left(\sum_d |\mu_{\leq U}|(d)\right) \left(\sum_m \Lambda_{\leq V}(m)\right)

\displaystyle  \ll Q U V

which is also acceptable since {Q \leq x^{1/2}} and {U=V=x^{1/5}}. This concludes the proof of Theorem 17.

Exercise 20 Strengthen the Bombieri-Vinogradov theorem by showing that

\displaystyle  \sum_{q \leq Q} \sup_{y \leq x} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\Lambda 1_{[1,y]}; a\ (q))| \ll_{A} x \log^{-A} x

for all {x \geq 2} and {A>0}, if {Q \leq x^{1/2} \log^{-B} x} for some sufficiently large {B} depending on {A}. (Hint: at present {y} ranges over an uncountable number of values, but if one can round {y} to the nearest multiple of (say) {x \log^{-A-10} x} then there are only {O(\log^{A+10} x)} values of {y} in the supremum that need to be considered. Then use the original Bombieri-Vinogradov theorem as a black box.)

Exercise 21 For any {U, V > 1}, establish the Vaughan-type identity

\displaystyle  \mu = \mu_{\leq V} + \mu_{\leq U} - \mu_{\leq U} * \mu_{\leq V} * 1 + \mu_{>U} * \mu_{>V} * 1

and use this to show that the Bombieri-Vinogradov theorem continues to hold when {\Lambda} is replaced by {\mu}.

Exercise 22 Let us say that {0 < \theta < 1} is a level of distribution for the von Mangoldt function {\Lambda} if one has

\displaystyle  \sum_{q \leq Q} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll_{A,\theta} x \log^{-A} x

whenever {A>0} and {Q \leq x^\theta}; thus, for instance, the Bombieri-Vinogradov theorem implies that every {0 < \theta < 1/2} is a level of distribution for {\Lambda}. Use the Cramér random model (see Section 1 of Supplement 4) to predict that every {0 < \theta < 1} is a level of distribution for {\Lambda}; this claim is known as the Elliott-Halberstam conjecture, and would have a number of consequences in sieve theory. Unfortunately, no level of distribution above {1/2} (or even at {1/2}) is currently known, however there are weaker versions of the Elliott-Halberstam conjecture known with levels above {1/2} which do have some interesting number-theoretic consequences; we will return to this point in later notes. For now, we will just remark that {1/2} appears to be the limit of what one can do by using the large sieve and Dirichlet character methods in this set of notes, and all the advances beyond {1/2} have had to rely on other tools (such as exponential sum estimates), although the Cauchy-Schwarz inequality remains an indispensable tool in all of these results.

Exercise 23 Strengthen the Bombieri-Vinogradov theorem by showing that

\displaystyle  \sum_{q \leq Q} \tau(q)^C \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll_{A,C} x \log^{-A} x

for all {x \geq 2}, {C \geq 1}, and {A>0}, if {Q \leq x^{1/2} \log^{-B} x} for some sufficiently large {B} depending on {A,C}. (Hint: use the Cauchy-Schwarz inequality and the trivial bound {|\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll \frac{x}{q} \log x}, together with elementary estimates on such sums as {\sum_{q \leq Q} \frac{\tau(q)^{2C}}{q}}.)

Exercise 24 Show that the Bombieri-Vinogradov theorem continues to hold if the von Mangoldt function {\Lambda} is replaced by the indicator function {{\mathcal P}} of the primes.

— 4. Appendix: the divisor function in arithmetic progressions —

We begin with a lemma of Landreau that controls the divisor function by a short divisor sum.

Lemma 25 For any {\theta > 0} one has

\displaystyle  \tau(n) \leq 2^{2/\theta} \sum_{d|n: d \leq n^\theta} \tau(d)^{2/\theta}

for all natural numbers {n}.

Proof: We write {n = qm}, where {q} is the product of all the prime factors of {n} that are greater than or equal to {n^{\theta/2}}, and {m} is the product of all the prime factors of {n} that are less than {n^{\theta/2}}, counting multiplicity of course. By a greedy algorithm, we can repeatedly pull out factors of {m} of size between {n^{\theta/2}} and {n^\theta} until the remaining portion of {m} falls below {n^\theta}, yielding a factorisation of the form {m = n_1 \dots n_r} where {n_1,\dots,n_{r-1}} lie between {n^{\theta/2}} and {n^\theta}, {n_r} is at most {n^\theta}, and (if {r>1}) {n_{r-1} n_r} is at least {n^\theta}. The lower bounds on {n_1,\dotws,n_{r-2}} and {n_{r-1} n_r} imply that {r \leq 2/\theta}. By the trivial inequality {\tau(ab) \leq \tau(a) \tau(b)} we have

\displaystyle  \tau(n) \leq \tau(q) \tau(m) \leq 2^{2/\theta} \tau(n_1) \ldots \tau(n_r)

and thus by the pigeonhole principle one has

\displaystyle  \tau(n) \leq 2^{2/\theta} \tau(n_j)^r

for some {j}. Since {n_j} is a factor of {n} that is at most {n^\theta}, this gives the claim as long as {r \leq 2/\theta}. If {r > 2/\theta}, then as each of the {n_1,\dots,n_{r-1}} are at least {n^{\theta/2}}, we see that {r-1 \leq 2/\theta} and {n_{r-1} n_r} cannot exceed {n^\theta}. If we now use the inequality

\displaystyle  \tau(n) \leq 2^{2/\theta} \tau(n_1) \dots \tau(n_{r-2}) \tau(n_{r-1} n_r)

and repeat the pigeonholing argument, we again obtain the claim. \Box

Corollary 26 Let {0 < c_1 < c_2 < 1}, let {x \geq 1}, let {x^{c_2} \leq y \leq x}, and let {a\ (q)} be a primitive residue class with {q \leq x^{c_1}}. Then

\displaystyle  \sum_{x \leq n\leq x+y: n=a\ (q)} \tau(n) \ll_{c_1,c_2} \frac{y}{q} \log^{2^{\frac{2}{c_2-c_1}}+1} x. \ \ \ \ \ (33)

One can lower the exponent of the logarithm here to {\log x} (consistent with the heuristic that the average value of {\tau(n)} for {n=O(x)} is {O(\log x)}), but this requires additional arguments; see for instance this previous blog post. In contrast, the divisor bound {\tau(n)=O(n^{o(1)})} only gives an upper bound of {x^{o(1)} \frac{y}{q}} here.

Proof: We allow implied constants to depend on {c_1,c_2}. Set {\theta := c_2-c_1}. Using Lemma 25 we have

\displaystyle  \tau(n) \ll \sum_{d|n: d \leq (x+y)^\theta} \tau(d)^{2/\theta}

for all {n \leq x+y}, and so the left-hand side of (33) is bounded by

\displaystyle  \ll \sum_{d \leq (x+y)^\theta} \tau(d)^{2/\theta} \sum_{x \leq n\leq x+y: n=a\ (q); d|n} 1.

The conditions {n=a\ (q)}, {d|n} constrain {n} to either the empty set, or an arithmetic progression of modulus {qd \leq x^{c_1} (x+y)^\theta \ll y}, so the inner sum is {O(\frac{y}{d})}. On the other hand, from Theorem 27 (or Exercise 29) of Notes 1 we have

\displaystyle  \sum_{d \leq (x+y)^\theta} \frac{\tau(d)^{2/\theta}}{d} \ll \log^{2^{2/\theta} + 1} x,

and the claim follows. \Box

Exercise 27 Under the same assumptions as in Corollary 26, establish the bound

\displaystyle  \sum_{x \leq n\leq x+y: n=a\ (q)} \tau^k(n) \ll_{c_1,c_2,k} \frac{y}{q} \log^{O_{c_1,c_2,k}(1)} x

for any {k \geq 1}.

Filed under: 254A - analytic prime number theory, math.NT Tagged: almost orthogonality, Bombieri-Vinogradov theorem, Dirichlet character, large sieve, Vaughan identity

Chad OrzelOwnership of the Means of Adjudication

Back on Thursday when I was waiting to be annoyed by a speech, one of the ways I passed time was reading stuff on my phone, which included This Grantland piece about Charles Barkley and “advanced stats”. In it, Bryan Curtis makes the argument that while Barkley’s recent comments disparaging statistical tools seem at first like just the same old innumeracy, it’s really a question of ownership.

But Barkley was firing a shot in a second war. Let’s call it Moneyball II. This clash doesn’t pit a blogger versus a newspaperman in a debate over the value of PER. It pits media versus athletes in a battle over who gets to tell the story of basketball. “I viewed Charles Barkley’s comments as being completely about media criticism, not about how a team is run,” said Craig Calcaterra, who blogs at HardballTalk. “If Barkley were still playing and a coach came to him and said, ‘Here’s something we discovered in our analytics department,’ I’m sure he’d be receptive to it. But he doesn’t want to hear someone in the media second-guessing his authority about basketball.”

This general theme is echoed through a lot of the sillier pseudo-controversies surrounding sports these days. Curtis briefly mentions Kevin Durant turning hostile, but doesn’t specifically mention Marshawn Lynch (probably because he’s framing the story in terms of the NBA, not the NFL). Lynch is probably the best example, though, because that lets you see that both sides can be really petty– Curtis wrote about Lynch in the run-up to the Super Bowl, when he was getting blasted for refusing to talk to the media, and hits most of the high points. Lynch gets flack because reporters feel he’s violating the unspoken agreement inherent in sports media: athletes smile and answer dumb questions, and reporters provide free advertising for the teams and the league.

But in another sense, this is just the same argument about ownership, seen from the other side. Ex-players like Barkley are trying to preserve their historical privilege as “expert” commentators based on having played the game, while statisticians are pushing the primacy of data. NFL reporters are demanding their traditional right to shape the narrative around the game, while Lynch and a handful of others refuse to play along. In both cases, the people whose traditional prerogatives are being threatened are getting bent out of shape over it.

I mention this in the context of Thursday’s annoying speech, because the idea of a conflict over who gets to tell the story resonated in an odd way with a lot of my reactions to that speech. In particular, the contrast between the very traditional “high culture” stuff held up in the speech (and, for that matter, the fact that we always get classical music performances at these things) and the joking mention of particularly inane Kanye West lyrics seemed like a really stark example of drawing an arbitrary line between culture with enduring value, and culture that elites should point at and laugh.

It’s not a perfect analogy, of course, because there are plenty of folks in the collection of disciplines dubbed “the humanities” who take pop culture as their area of study, and mine that for some useful insight. I wonder, though, if a lot of the anxiety about a “crisis in the humanities” isn’t really this same kind of anxiety about who gets to tell the story. Or, rather, who gets to decide what stories are worth telling.

And, of course, there’s an even more direct parallel with the ever-popular topic of Scientists vs. Journalists. That conflict is very directly and obviously about who gets to tell the story of science, with scientists in the Charles Barkley role of claiming special expertise in deciding what’s worth talking about. You don’t find a lot of Marshawn Lynches in this one, but there are more than a few Kevin Durants ready to declare publicly that writers “don’t know shit” about their topics.

So, anyway, that’s your Information Supercollider moment for the week, in which the odd mix of stuff I read via social media bounces around making weird connections. Not sure how well this holds up, but it’s a thing I’ve been toying with, and might as well be a blog post…

ResonaancesWeekend Plot: Bs mixing phase update

Today's featured plot was released last week by the LHCb collaboration:

It shows the CP violating phase in Bs meson mixing, denoted as φs,  versus the difference of the decay widths between the two Bs meson eigenstates. The interest in φs comes from the fact that it's  one of the precious observables that 1) is allowed by the symmetries of the Standard Model, 2) is severely suppressed due to the CKM structure of flavor violation in the Standard Model. Such observables are a great place to look for new physics (other observables in this family include Bs/Bd→μμ, K→πνν, ...). New particles, even too heavy to be produced directly at the LHC, could produce measurable contributions to φs as long as they don't respect the Standard Model flavor structure. For example, a new force carrier with a mass as large as 100-1000 TeV and order 1 flavor- and CP-violating coupling to b and s quarks would be visible given the current experimental precision. Similarly, loops of supersymmetric particles with 10 TeV masses could show up, again if the flavor structure in the superpartner sector is not aligned with that in the  Standard Model.

The phase φs can be measured in certain decays of neutral Bs mesons where the process involves an interference of direct decays and decays through oscillation into the anti-Bs meson. Several years ago measurements at Tevatron's D0 and CDF experiments suggested a large new physics contribution. The mild excess has gone away since, like many other such hints.  The latest value quoted by LHCb is φs = - 0.010 ± 0.040, which combines earlier measurements of the Bs → J/ψ π+ π- and  Bs → Ds+ Ds- decays with  the brand new measurement of the Bs → J/ψ K+ K- decay. The experimental precision is already comparable to the Standard Model prediction of φs = - 0.036. Further progress is still possible, as the Standard Model prediction can be computed to a few percent accuracy.  But the room for new physics here is getting tighter and tighter.

Jordan EllenbergIdle question: are Kakeya sets winning?

Jayadev Athreya was here last week and reminded me about this notion of “winning sets,” which I learned about from Howie Masur — originally, one of the many contributions of Wolfgang Schmidt.

Here’s a paper by Curt McMullen introducing a somewhat stronger notion, “absolute winning.”

Anyway:  a winning set (or an absolute winning set) in R^n is “big” in some sense.  In particular, it has to have full Hausdorff dimension, but it doesn’t have to have positive measure.

Kakeya sets (subsets of R^n containing a unit line segment in every direction) can have measure zero, by the Besicovitch construction, and are conjectured (when n=2, known) to have Hausdorff dimension n.  So should we expect these sets to be winning?  Are Besicovitch sets winning?

I have no reason to need to know.  I just think these refined classifications of sets which are measure 0 yet still “large” are very interesting.  And for all I know, maybe there are sets where the easiest way to prove they have full Hausdorff dimension is to prove they’re winning!



Scott AaronsonHow can we fight online shaming campaigns?

Longtime friend and colleague Boaz Barak sent me a fascinating New York Times Magazine article that profiles people who lost their jobs or otherwise had their lives ruined, because of a single remark that then got amplified a trillionfold in importance by social media.  (The author, Jon Ronson, also has a forthcoming book on the topic.)  The article opens with Justine Sacco: a woman who, about to board a flight to Cape Town, tweeted “Going to Africa.  Hope I don’t get AIDS.  Just kidding.  I’m white!”

To the few friends who read Sacco’s Twitter feed, it would’ve been obvious that she was trying to mock the belief of many well-off white people that they live in a bubble, insulated from the problems of the Third World; she wasn’t actually mocking black Africans who suffer from AIDS.  In a just world, maybe Sacco deserved someone to take her aside and quietly explain that her tweet might be read the wrong way, that she should be more careful next time.  Instead, by the time she landed in Cape Town, she learned that she’d become the #1 worldwide Twitter trend and a global symbol of racism.  She lost her career, she lost her entire previous life, and tens of thousands of people expressed glee about it.  The article rather heartbreakingly describes Sacco’s attempts to start over.

There are many more stories like the above.  Some I’d already heard about: the father of three who lost his job after he whispered a silly joke involving “dongles” to the person next to him at a conference, whereupon Adria Richards, a woman in front of him, snapped his photo and posted it to social media, to make an example of him as a sexist pig.  (Afterwards, a counter-reaction formed, which successfully got Richards fired from her job: justice??)  Other stories I hadn’t heard.

Reading this article made it clear to me just how easily I got off, in my own recent brush with the online shaming-mobs.  Yes, I made the ‘mistake’ of writing too openly about my experiences as a nerdy male teenager, and the impact that one specific aspect of feminist thought (not all of feminism!) had had on me.  Within the context of the conversation that a few nerdy men and women were having on this blog, my opening up led to exactly the results I was hoping for: readers thoughtfully sharing their own experiences, a meaningful exchange of ideas, even (dare I say it?) glimmers of understanding and empathy.

Alas, once the comment was wrested from its original setting into the clickbait bazaar, the story became “MIT professor explains: the real oppression is having to learn to talk to women” (the title of Amanda Marcotte’s hit-piece, something even some in Marcotte’s ideological camp called sickeningly cruel).  My photo was on the front page of Salon, next to the headline “The plight of the bitter nerd.”  I was subjected to hostile psychoanalysis not once but twice on ‘Dr. Nerdlove,’ a nerd-bashing site whose very name drips with irony, rather like the ‘Democratic People’s Republic of Korea.’  There were tweets and blog comments that urged MIT to fire me, that compared me to a mass-murderer, and that “deduced” (from first principles!) all the ways in which my parents screwed up in raising me and my female students cower in fear of me.   And yes, when you Google me, this affair now more-or-less overshadows everything else I’ve done in my life.

But then … there were also hundreds of men and women who rose to my defense, and they were heavily concentrated among the people I most admire and respect.  My supporters ranged from the actual female students who took my classes or worked with me or who I encouraged in their careers, from whom there was only kindness, not a single negative word; to the shy nerds who thanked me for being one of the only people to acknowledge their reality; to the lesbians and bisexual women who told me my experience also resonated with them; to the female friends and colleagues who sent me notes urging me to ignore the nonsense.  In the end, not only have I not lost any friends over this, I’ve gained new ones, and I’ve learned new sides of the friends I had.

Oh, and I didn’t get any death threats: I guess that’s good!  (Once in my life I did get death threats—graphic, explicit threats, about which I had to contact the police—but it was because I refused to publicize someone’s P=NP proof.)

Since I was away from campus when this blew up, I did feel some fear about the professional backlash that would await me on my return.  Would my office be vandalized?  Would activist groups be protesting my classes?  Would MIT police be there to escort me from campus?

Well, you want to know what happened instead?  Students and colleagues have stopped me in the hall, or come by my office, just to say they support me.  My class has record enrollment this term.  I was invited to participate in MIT’s Diversity Summit, since the organizers felt it would mean a lot to the students to see someone there who had opened up about diversity issues in STEM in such a powerful way.  (I regretfully had to decline, since the summit conflicted with a trip to Stanford.)  And an MIT graduate women’s reading group invited me for a dinner discussion (at my suggestion, Laurie Penny participated as well).  Imagine that: not only are MIT’s women’s groups not picketing me, they’re inviting me over for dinner!  Is there any better answer to the claim, urged on me by some of my overzealous supporters, that the bile of Amanda Marcotte represents all of feminism these days?

Speaking of which, I met Laurie Penny for coffee last month, and she and I quickly hit it off.  We’ve even agreed to write a joint blog post about our advice for shy nerds.  (In my What I Believe post, I had promised a post of advice for shy female nerds—but at Laurie’s urging, we’re broadening the focus to shy nerds of both sexes.)  Even though Laurie’s essay is the thing that brought me to the attention of the Twitter-mobs (which wasn’t Laurie’s intent!), and even though I disagreed with several points in her essay, I knew on reading it that Laurie was someone I’d enjoy talking to.  Unlike so much writing by online social justice activists, which tends to be encrusted with the specialized technical terms of that field—you know, terms like “asshat,” “shitlord,” “douchecanoe,” and “precious feefees of entitled white dudes”—Laurie’s prose shone with humanity and vulnerability: her own, which she freely shared, and mine, which she generously acknowledged.

Overall, the response to my comment has never made me happier or more grateful to be part of the STEM community (I never liked the bureaucratic acronym “STEM,” but fine, I’ll own it).  To many outsiders, we STEM nerds are a sorry lot: we’re “sperglords” (yes, slurs are fine, as long as they’re directed against the right targets!) who might be competent in certain narrow domains, but who lack empathy and emotional depth, and are basically narcissistic children.  Yet somehow when the chips were down, it’s my fellow STEM nerds, and people who hang out with STEM nerds a lot, who showed me far more empathy and compassion than many of the “normals” did.  So if STEM nerds are psychologically broken, then I say: may I surround myself, for the rest of my life, with men and women who are psychologically broken like I am.  May I raise Lily, and any future children I have, to be as psychologically broken as they can be.  And may I stay as far as possible from anyone who’s too well-adjusted.

I reserve my ultimate gratitude for the many women in STEM, friends and strangers alike, who sent me messages of support these past two months.  I’m not ashamed to say it: witnessing how so many STEM women stood up for me has made me want to stand up for them, even more than I did before.  If they’re not called on often enough in class, I’ll call on them more.  If they’re subtly discouraged from careers in science, I’ll blatantly encourage them back.  If they’re sexually harassed, I’ll confront their harassers myself (well, if asked to).  I will listen to them, and I will try to improve.

Is it selfish that I want to help female STEM nerds partly because they helped me?  Here’s the thing: one of my deepest moral beliefs is in the obligation to fight for those among the disadvantaged who don’t despise you, and who wouldn’t gladly rid the planet of everyone like you if they could.  (As I’ve written before, on issue after issue, this belief makes me a left-winger by American standards, and a right-winger by academic ones.)  In the present context, I’d say I have a massive moral obligation toward female STEM nerds and toward Laurie Penny’s version of feminism, and none at all toward Marcotte’s version.

All this is just to say that I’m unbelievably lucky—privileged (!)—to have had so many at MIT and elsewhere willing to stand up for me, and to have reached in a stage in life where I’m strong enough to say what I think and to weather anything the Internet says back.  What worries me is that others, more vulnerable, didn’t and won’t have it as easy when the Twitter hate-machine turns its barrel on them.  So in the rest of this post, I’d like to discuss the problem of what to do about social-media shaming campaigns that aim to, and do, destroy the lives of individuals.  I’m convinced that this is a phenomenon that’s only going to get more and more common: something sprung on us faster than our social norms have evolved to deal with it.  And it would be nice if we could solve it without having to wait for a few high-profile suicides.

But first, let me address a few obvious questions about why this problem is even a problem at all.

Isn’t social shaming as old as society itself—and permanent records of the shaming as old as print media?

Yes, but there’s also something fundamentally new about the problem of the Twitter-mobs.  Before, it would take someone—say, a newspaper editor—to make a conscious decision to the effect, “this comment is worth destroying someone’s life over.”  Today, there might be such an individual, but it’s also possible for lives to be destroyed in a decentralized, distributed fashion, with thousands of Twitterers collaborating to push a non-story past the point of no return.  And among the people who “break” the story, not one has to intend to ruin the victim’s life, or accept responsibility for it afterward: after all, each one made the story only ε bigger than it already was.  (Incidentally, this is one reason why I haven’t gotten a Twitter account: while it has many worthwhile uses, it’s also a medium that might as well have been designed for mobs, for ganging up, for status-seeking among allies stripped of rational arguments.  It’s like the world’s biggest high school.)

Don’t some targets of online shaming campaigns, y’know, deserve it?

Of course!  Some are genuine racists or misogynists or homophobes, who once would’ve been able to inflict hatred their entire lives without consequence, and were only brought down thanks to social media.  The trouble is, the participants in online shaming campaigns will always think they’re meting out righteous justice, whether they are or aren’t.  But there’s an excellent reason why we’ve learned in modern societies not to avenge even the worst crimes via lynch mobs.  There’s a reason why we have trials and lawyers and the opportunity for the accused to show their innocence.

Some might say that no safeguards are possible or necessary here, since we’re not talking about state violence, just individuals exercising their free speech right to vilify someone, demand their firing, that sort of thing.  Yet in today’s world, trial-by-Internet can be more consequential than the old kind of trial: would you rather spend a year in jail, but then be free to move to another town where no one knew about it, or have your Google search results tarnished with lurid accusations (let’s say, that you molested children) for the rest of your life—to have that forever prevent you from getting a job or a relationship, and have no way to correct the record?  With trial by Twitter, there’s no presumption of innocence, no requirement to prove that any other party was harmed, just the law of the schoolyard.

Whether shaming is justified in a particular case is a complicated question, but for whatever it’s worth, here are a few of the questions I would ask:

  • Did the person express a wish for anyone (or any group of people) to come to harm, or for anyone’s rights to be infringed?
  • Did the person express glee or mockery about anyone else’s suffering?
  • Did the person perpetrate a grievous factual falsehood—like, something one could prove was a falsehood in a court of law?
  • Did the person violate anyone else’s confidence?
  • How much does the speaker’s identity matter?  If it had been a man rather than a woman (or vice versa) saying parallel things, would we have taken equal offense?
  • Does the comment have what obscenity law calls “redeeming social value”?  E.g., does it express an unusual viewpoint, or lead to an interesting discussion?

Of course, even in those cases where shaming campaigns are justified, they’ll sometimes be unproductive and ill-advised.

Aren’t society’s most powerful fair targets for public criticism, even mocking or vicious criticism?

Of course.  Few would claim, for example, that we have an ethical obligation to ease up on Todd Akin over his “legitimate rape” remarks, since all the rage might give Akin an anxiety attack.  Completely apart from the (de)merits of the remarks, we accept that, when you become (let’s say) an elected official, a CEO, or a university president, part of the bargain is that you no longer get to complain if people organize to express their hatred of you.

But what’s striking about the cases in the NYT article is that it’s not public figures being gleefully destroyed: just ordinary people who in most cases, made one ill-advised joke or tweet, no worse than countless things you or I have probably said in private among friends.  The social justice warriors try to justify what would otherwise look like bullying by shifting attention away from individuals: sure, Justine Sacco might be a decent person, but she stands for the entire category of upper-middle-class, entitled white women, a powerful structural force against whom the underclass is engaged in a righteous struggle.  Like in a war, the enemy must be fought by any means necessary, even if it means picking off one hapless enemy foot-soldier to make an example to the rest.  And anyway, why do you care more about this one professional white woman, than about the millions of victims of racism?  Is it because you’re a racist yourself?

I find this line of thinking repugnant.  For it perverts worthy struggles for social equality into something callous and inhuman, and thereby undermines the struggles themselves.  It seems me to have roughly the same relation to real human rights activism as the Inquisition did to the ethical teachings of Jesus.  It’s also repugnant because of its massive chilling effect: watching a few shaming campaigns is enough to make even the most well-intentioned writer want to hide behind a pseudonym, or only offer those ideas and experiences that are sure to win approval.  And the chilling effect is not some accidental byproduct; it’s the goal.  This negates what, for me, is a large part of the promise of the Internet: that if people from all walks of life can just communicate openly, everything made common knowledge, nothing whispered or secondhand, then all the well-intentioned people will eventually come to understand each other.

If I’m right that online shaming of decent people is a real problem that’s only going to get worse, what’s the solution?  Let’s examine five possibilities.

(1) Libel law.  For generations, libel has been recognized as one of the rare types of speech that even a liberal, democratic society can legitimately censor (along with fraud, incitement to imminent violence, national secrets, child porn, and a few others).  That libel is illegal reflects a realistic understanding of the importance of reputation: if, for example, CNN falsely reports that you raped your children, then it doesn’t really matter if MSNBC later corrects the record; your life as you knew it is done.

The trouble is, it’s not clear how to apply libel law in the age of social media.  In the cases we’re talking about, an innocent person’s life gets ruined because of the collective effect of thousands of people piling on to make nasty comments, and it’s neither possible nor desirable to prosecute all of them.  Furthermore, in many cases the problem is not that the shamers said anything untrue: rather, it’s that they “merely” took something true and spitefully misunderstood it, or blew it wildly, viciously, astronomically out of proportion.  I don’t see any legal remedies here.

(2) “Shame the shamers.”  Some people will say the only answer is to hit the shamers with their own weapons.  If an overzealous activist gets an innocent jokester fired from his job, shame the activist until she’s fired from her job.  If vigilantes post the jokester’s home address on the Internet with crosshairs overlaid, find the vigilantes’ home addresses and post those.  It probably won’t surprise many people that I’m not a fan of this solution.  For it only exacerbates the real problem: that of mob justice overwhelming reasoned debate.  The most I can say in favor of vigilantism is this: you probably don’t get to complain about online shaming, if what you’re being shamed for is itself a shaming campaign that you prosecuted against a specific person.

(In a decade writing this blog, I can think of exactly one case where I engaged in what might be called a shaming campaign: namely, against the Bell’s inequality denier Joy Christian.  Christian had provoked me over six years, not merely by being forehead-bangingly wrong about Bell’s theorem, but by insulting me and others when we tried to reason with him, and by demanding prize money from me because he had ‘proved’ that quantum computing was a fraud.  Despite that, I still regret the shaming aspects of my Joy Christian posts, and will strive not to repeat them.)

(3) Technological solutions.  We could try to change the functioning of the Internet, to make it harder to use it to ruin people’s lives.  This, more-or-less, is what the European Court of Justice was going for, with its much-discussed recent ruling upholding a “right to be forgotten” (more precisely, a right for individuals to petition for embarrassing information about them to be de-listed from search engines).  Alas, I fear that the Streisand effect, the Internet’s eternal memory, and the existence of different countries with different legal systems will forever make a mockery of all such technological solutions.  But, OK, given that Google is constantly tweaking its ranking algorithms anyway, maybe it could give less weight to cruel attacks against non-public-figures?  Or more weight (or even special placement) to sites explaining how the individual was cleared of the accusations?  There might be scope for such things, but I have the strong feeling that they should be done, if at all, on a voluntary basis.

(4) Self-censorship.  We could simply train people not to express any views online that might jeopardize their lives or careers, or at any rate, not to express those views under their real names.  Many people I’ve talked to seem to favor this solution, but I can’t get behind it.  For it effectively cedes to the most militant activists the right to decide what is or isn’t acceptable online discourse.  It tells them that they can use social shame as a weapon to get what they want.  When women are ridiculed for sharing stories of anorexia or being sexually assaulted or being discouraged from careers in science, it’s reprehensible to say that the solution is to teach those women to shut up about it.  I not only agree with that but go further: privacy is sometimes important, but is also an overrated value.  The respect that one rational person affords another for openly sharing the truth (or his or her understanding of the truth), in a spirit of sympathy and goodwill, is a higher value than privacy.  And the Internet’s ability to foster that respect (sometimes!) is worth defending.

(5) Standing up.  And so we come to the only solution that I can wholeheartedly stand behind.  This is for people who abhor shaming campaigns to speak out, loudly, for those who are unfairly shamed.

At the nadir of my own Twitter episode, when it felt like my life was now finished, throw in the towel, the psychiatrist Scott Alexander wrote a 10,000-word essay in my defense, which also ranged controversially into numerous other issues.  In a comment on his girlfriend Ozy’s blog, Alexander now says that he regrets aspects of Untitled (then again, it was already tagged “Things I Will Regret Writing” when he posted it!).  In particular, he now feels that the piece was too broad in its critique of feminism.  However, he then explains as follows what motivated him to write it:

Scott Aaronson is one of the nicest and most decent people in the world, who does nothing but try to expand human knowledge and support and mentor other people working on the same in a bunch of incredible ways. After a lot of prompting he exposed his deepest personal insecurities, something I as a psychiatrist have to really respect. Amanda Marcotte tried to use that to make mincemeat of him, casually, as if destroying him was barely worth her time. She did it on a site where she gets more pageviews than he ever will, among people who don’t know him, and probably stained his reputation among nonphysicists permanently. I know I have weird moral intuitions, but this is about as close to pure evil punching pure good in the face just because it can as I’ve ever seen in my life. It made me physically ill, and I mentioned the comments of the post that I lost a couple pounds pacing back and forth and shaking and not sleeping after I read it. That was the place I was writing from. And it was part of what seemed to me to be an obvious trend, and although “feminists vs. nerds” is a really crude way of framing it, I couldn’t think of a better one in that mental state and I couldn’t let it pass.

I had three reactions on reading this.  First, if there is a Scott in this discussion who’s “pure good,” then it’s not I.  Second, maybe the ultimate solution to the problem of online shaming mobs is to make a thousand copies of Alexander, and give each one a laptop with an Internet connection.  But third, as long as we have only one of him, the rest of us have a lot of work cut out for us.  I know, without having to ask, that the only real way I can thank Alexander for coming to my defense, is to use this blog to defend other people (anywhere on the ideological spectrum) who are attacked online for sharing in a spirit of honesty and goodwill.  So if you encounter such a person, let me know—I’d much prefer that to letting me know about the latest attempt to solve NP-complete problems in polynomial time with some analog contraption.

Unrelated Update: Since I started this post with Boaz Barak, let me also point to his recent blog post on why theoretical computer scientists care so much about asymptotics, despite understanding full well that the constants can overwhelm them in practice.  Boaz articulates something that I’ve tried to say many times, but he’s crisper and more eloquent.

Update (Feb. 27): Since a couple people asked, I explain here what I see as the basic problems with the “Dr. Nerdlove” site.

Update (Feb. 28): In the middle of this affair, perhaps the one thing that depressed me the most was Salon‘s “Plight of the bitter nerd” headline. Random idiots on the Internet were one thing, but how could a “serious,” “respectable” magazine lend its legitimacy to such casual meanness? I’ve now figured out the answer: I used to read Salon sometimes in the late 90s and early 2000s, but not since then, and I simply hadn’t appreciated how far the magazine had descended into clickbait trash. There’s an amusing fake Salon Twitter account that skewers the magazine with made-up headlines (“Ten signs your cat might be racist” / “Nerd supremacism: should we have affirmative action to get cool people into engineering?”), mixed with actual Salon headlines, in such a way that it would be difficult to tell many of them apart were they not marked. (Indeed, someone should write a web app where you get quizzed to see how well you can distinguish them.) “The plight of the bitter nerd” is offered there as one of the real headlines that’s indistinguishable from the parodies.

February 28, 2015

David HoggIBM Watson

I spent my day today at the IBM T. J. Watson Research Center in Yorktown Heights, NY, hosted by Siyuan Lu (IBM). I had great discussions with the Physical Analytics team, and got in some quality time with Bruce Elmegreen (IBM), with whom I overlap on inferences about the initial mass function. I spoke about exoplanet search and population inference in my talk. The highlight of the trip was a visit to the Watson group, where I watched them talk to Watson, but we also looked into the data center, which contains the Watson that won on Jeopardy!. We made some plans to teach Watson some things about the known exoplanets; he is an expert in dealing with structured data.

David HoggVicki Kaspi

Vicki Kaspi (McGill) gave the Physics Colloquium talk today. She compared the fastest-known millisecond pulsar (which her group discovered) to the fastest commercial blenders in spin period. The pulsar wins, but it wins far more in surface speed: The surface of a millisecond pulsar is moving a significant fraction (like 0.1) of the speed of light! She talked about the uses of pulsars for precision measurement and testing of general relativity. It is just incredible that nature delivers us these clocks! I got interested during the talk in the spin constraints on the equation of state: We often see constraints on equation of state from mass measurements, but there must be equally compelling limits from the spin: If you are spinning such that your surface is moving at or even near the sound speed in the material, I think (or I have an intuition) that everything goes to hell fast.

David Hoggstellar modes, reading minds

At group meeting today Angus talked about her attempts to reproduce the asteroseismological measurements in the literature from Kepler short-cadence data. I think there is something missing, because we don't observe all the modes as clearly as they do. Our real goal is not just to reproduce the results of course; we discussed our advantages over existing methods: We have a more realistic generative model of the data; we can do multiple frequencies simultaneously; we can handle not just non-uniform time sampling but also non-uniform exposure times (which matters for high frequencies), and we can take in extremely non-trivial noise models (including ones that do detrending). I am sure we have a project and a paper, but we don't understand our best scope yet.

Just before lunch, Kyunghyun Cho (Montreal) gave a Computer Science colloquium about deep learning and translation. His system is structured as an encoder, a "meaning representation", and then a decoder. All three components are interesting, but the representation in the middle is a model system for holding semantic or linguistic structure. Very interesting! He has good performance. But the most interesting things in his talk were about other kinds of problems that can be cast as machine translation: Translating images into captions, for example, or translating brain images into sentences that the subject is currently reading! Cho's implication was that mind reading is just around the corner...

John BaezScholz’s Star

100,000 years ago, some of my ancestors came out of Africa and arrived in the Middle East. 50,000 years ago, some of them reached Asia. But between those dates, about 70,000 years ago, two stars passed through the outer reaches of the Solar System, where icy comets float in dark space!

One was a tiny red dwarf called Scholz’s star. It’s only 90 times as heavy as Jupiter. Right now it’s 20 light years from us, so faint that it was discovered only in 2013, by Ralf-Dieter Scholz—an expert on nearby stars, high-velocity stars, and dwarf stars.

The other was a brown dwarf: a star so small that it doesn’t produce energy by fusion. This one is only 65 times the mass of Jupiter, and it orbits its companion at a distance of 80 AU.

(An AU, or astronomical unit, is the distance between the Earth and the Sun.)

A team of scientists has just computed that while some of my ancestors were making their way to Asia, these stars passed about 0.8 light years from our Sun. That’s not very close. But it’s close enough to penetrate the large cloud of comets surrounding the Sun: the Oort cloud.

They say this event didn’t affect the comets very much. But if it shook some comets loose from the Oort cloud, they would take about 2 million years to get here! So, they won’t arrive for a long time.

At its closest approach, Scholz’s star would have had an apparent magnitude of about 11.4. This is a bit too faint to see, even with binoculars. So, don’t look for it myths and legends!

As usual, the paper that made this discovery is expensive in journals but free on the arXiv:

• Eric E. Mamajek, Scott A. Barenfeld, Valentin D. Ivanov, Alexei Y. Kniazev, Petri Vaisanen, Yuri Beletsky, Henri M. J. Boffin, The closest known flyby of a star to the Solar System.

It must be tough being a scientist named ‘Boffin’, especially in England! Here’s a nice account of how the discovery was made:

• University of Rochester, A close call of 0.8 light years, 16 February 2015.

The brown dwarf companion to Scholz’s star is a ‘class T’ star. What does that mean? It’s pretty interesting. Let’s look at an example just 7 light years from Earth!

Brown dwarfs


Thanks to some great new telescopes, astronomers have been learning about weather on brown dwarfs! It may look like this artist’s picture. (It may not.)

Luhman 16 is a pair of brown dwarfs orbiting each other just 7 light years from us. The smaller one, Luhman 16B, is half covered by huge clouds. These clouds are hot—1200 °C—so they’re probably made of sand, iron or salts. Some of them have been seen to disappear! Why? Maybe ‘rain’ is carrying this stuff further down into the star, where it melts.

So, we’re learning more about something cool: the ‘L/T transition’.

Brown dwarfs can’t fuse ordinary hydrogen, but a lot of them fuse the isotope of hydrogen called deuterium that people use in H-bombs—at least until this runs out. The atmosphere of a hot brown dwarf is similar to that of a sunspot: it contains molecular hydrogen, carbon monoxide and water vapor. This is called a class M brown dwarf.

But as they run out of fuel, they cool down. The cooler class L brown dwarfs have clouds! But the even cooler class T brown dwarfs do not. Why not?

This is the mystery we may be starting to understand: the clouds may rain down, with material moving deeper into the star! Luhman 16B is right near the L/T transition, and we seem to be watching how the clouds can disappear as a brown dwarf cools. (Its larger companion, Luhman 16A, is firmly in class L.)

Finally, as brown dwarfs cool below 300 °C, astronomers expect that ice clouds start to form: first water ice, and eventually ammonia ice. These are the class Y brown dwarfs. Wouldn’t that be neat to see? A star with icy clouds!

Could there be life on some of these stars?

Caroline Morley regularly blogs about astronomy. If you want to know more about weather on Luhman 16B, try this:

• Caroline Morley, Swirling, patchy clouds on a teenage brown dwarf, 28 February 2012.

She doesn’t like how people call brown dwarfs “failed stars”. I agree! It’s like calling a horse a “failed giraffe”.

For more, try:

Brown dwarfs, Scholarpedia.

BackreactionAre pop star scientists bad for science?

[Image Source: Asia Tech Hub]

In January, Lawrence Krauss wrote a very nice feature article for the Bulletin of the Atomic Scientists, titled “Scientists as celebrities: Bad for science or good for society?” In his essay, he reflects on the rise to popularity of Einstein, Sagan, Feynman, Hawking, and deGrasse Tyson.

Krauss, not so surprisingly, concludes that scientific achievement is neither necessary nor sufficient for popularity, and that society benefits from scientists’ voices in public debate. He does not however address the other part of the question that his essay’s title raises: Is scientific celebrity bad for science?

I have to admit that people who idolize public figures just weird me out. It isn’t only that I am generally suspicious of groups of any kinds and avoid crowds like the plague, but that there is something creepy about fans trying to outfan each other by insisting their stars are infallible. It’s one thing to follow the lives of popular figures, be happy for them and worry about them. It’s another thing to elevate their quotes to unearthly wisdom and preach their opinion like supernatural law.

Years ago, I unknowingly found myself in a group of Feynman fans who were just comparing notes about the subject of their adoration. In my attempt to join the discussion I happily informed them that I didn’t like Feynman’s books, didn’t like, in fact, his whole writing style. The resulting outrage over my blasphemy literally had me back out of the room.

Sorry, have I insulted your hero?

An even more illustrative case is that of Michael Hale making a rather innocent joke about a photo of Neil deGrasse Tyson on twitter, and in reply getting shot down with insults. You can find some (very explicit) examples in the writeup of his story “How I Became Thousands of Nerds' Worst Enemy by Tweeting a Photo.” After blowing up on twitter, his photo ended up on the facebook page “I Fucking Love Science.” The best thing about the ensuing facebook thread is the frustration of several people who apparently weren’t able to turn off notifications of new comments. The post has been shared more than 50,000 times, and Michael Hale now roasts in nerd hell somewhere between Darth Vader and Sauron.

Does this seem like scientist’s celebrity is beneficial to balanced argumentation? Is fandom ever supportive to rational discourse?

I partly suspect that Krauss, like many people his age and social status, doesn’t fully realize the side-effects that social media attention brings, the trolls in the blogosphere’s endless comment sections and the anonymous insults in the dark corners of forum threads. I agree with Krauss that it’s good that scientists voice their opinions in public. I’m not sure that celebrity is a good way to encourage people to think on their own. Neither, for that matter, are facebook pages with expletives in the title.

Be that as it may, pop star scientists serve, as Steve Fuller put it bluntly, as “marketing”
“The upshot is that science needs to devote an increased amount of its own resources to what might be called pro-marketing.”
Agreed. And for that reason, I am in favor of scientific celebrity, even though I doubt that idolization can ever bring insight. But let us turn now to the question what ill effects celebrity can have on science.

Many of those who become scientists report getting their inspiration from popular science books, shows, or movies. Celebrities clearly play a big role in this pull. One may worry that the resulting interest in science is then very focused on a few areas that are the popular topics of the day. However, I don’t see this worry having much to do with reality. What seems to happen instead is that young people, once their interest is sparked, explore the details by themselves and find a niche that they fit in. So I think that science benefits from popular science and its voices by inspiring young people to go into science.

The remaining worry that I can see is that scientific pop stars affect the interests of those already active in science. My colleagues always outright dismiss the possibility that their scientific opinion is affected by anything or anybody. It’s a nice demonstration of what psychologists call the “bias blind spot”. It is well documented that humans pay more attention to information that they receive repeatedly and in particular if it comes from trusted sources. This was once a good way to extract relevant information in a group of 300 fighting for survival. But in the age of instant connectivity and information overflow, it means that our interests are easy to play.

If you don’t know what I mean, imagine that deGrasse Tyson had just explained he read my recent paper and thinks it’s totally awesome. What would happen? Well, first of all, all my colleagues would instantly hate me and proclaim that my paper is nonsense without even having read it. Then however, a substantial amount of them would go and actually read it. Some of them would attempt to find flaws in it, and some would go and write follow-up papers. Why? Because the papal utterance would get repeated all over the place, they’d take it to lunch, they’d discuss it with their colleagues, they’d ask others for opinion. And the more they discuss it, the more it becomes interesting. That’s how the human brain works. In the end, I’d have what the vast majority of papers never gets: attention.

That’s a worry you can have about scientific celebrity, but to be honest it’s a very constructed worry. That’s because pop star scientists rarely if ever comment on research that isn’t already very well established. So the bottomline is that while it could be bad for science, I don’t think scientific celebrity is actually bad for science, or at least I can’t see how.

The above mentioned problem of skewing scientific opinions by selectively drawing attention to some works though is a real problem with the popular science media, which doesn’t shy away from commenting on research which is still far from being established. The better outlets, in the attempt of proving their credibility, stick preferably to papers of those already well-known and decorate their articles with quotes from more well-known people. The result is a rich-get-richer trend. On the very opposite side, there’s a lot of trash media that seem to randomly hype nonsense papers in the hope of catching readers with fat headlines. This preferably benefits scientists who shamelessly oversell their results. The vast majority of serious high quality research, in pretty much any area, goes largely unnoticed by the public. That, in my eyes, is a real problem which is bad for science.

My best advice if you want to know what physicists really talk about is to follow the physics societies or their blogs or journals respectively. I find they are reliable and trustworthy information sources, and usually very balanced because they’re financed by membership fees, not click rates. Your first reaction will almost certainly be that their news are boring and that progress seems incremental. I hate to spell it out, but that’s how science really is.

Tommaso DorigoMiscellanea

This week I was traveling in Belgium so my blogging activities have been scarce. Back home, I will resume with serious articles soon (with the XVI Neutrino Telescopes conference next week, there will be a lot to report on!). In the meantime, here's a list of short news you might care about as an observer of progress in particle physics research and related topics.

read more

Geraint F. LewisShooting relativistic fish in a rational barrel

I need to take a breather from grant writing, which is consuming almost every waking hour in between all of the other things that I still need to do. So see this post as a cathartic exercise.

What makes a scientist? Is it the qualification? What you do day-to-day? The association and societies to which you belong? I think a unique definition may be impossible as there is a continuum of properties of scientists. This makes it a little tricky for the lay-person to identify "real science" from "fringe science" (but, in all honesty, the distinction between these two is often not particularly clear cut).

One thing that science (and many other fields) do is have meetings, conferences and workshops to discuss their latest results. Some people seem to spend their lives flitting between exotic locations essentially presenting the same talk to almost the same audience, but all scientists probably attend a conference or two per year.

In one of my own fields, namely cosmology, there are lots of conferences per year. But accompanying these there are another set of conferences going on, also on cosmology and often including discussions of gravity, particle physics, and the power of electricity in the Universe. At these meetings, the words "rational" and "logical" are bandied about, and it is clear that the people attending think that the great mass of astronomer and physicists have gotten it all wrong, are deluded, are colluding to keep the truth from the public for some bizarre agenda - some sort of worship of Einstein and "mathemagics" (I snorted with laughter when I heard this).

If I am being paid to lie to the public, I would like to point out that my cheque has not arrived and unless it does shortly I will go to the papers with a "tell all"!!

These are not a new phenomenon, but were often in shadows. But now, of course, with the internet, any one can see these conference in action with lots of youtube clips and lectures.

Is there any use for such videos? I think so, as, for the student of physics, they present an excellent place to tests one knowledge by identifying just where the presenters are straying off the path.

A brief search of youtube will turn up talks that point out that black holes cannot exist because
is the starting point for the derivation of the Schwarzschild solution.

Now, if you are not really familiar with the mathematics of relativity, this might look quite convincing. The key point is this equation

Roughly speaking, this says that space-time geometry (left-hand side) is related to the matter and energy density (right-hand side, and you calculate the Schwarzschild geometry for a black hole by setting the right-hand side equal to zero.

Now, with the right-hand side equal to zero that means there is no energy and mass, and the conclusion in the video says that there is no source, no thing to produce the bending of space-time and hence the effects of gravity. So, have the physicists been pulling the wool over everyones eyes for almost 100 years?

Now, a university level student may not have done relativity yet, but it should be simple to see the flaw in this argument. And, to do this, we can use the wonderful world of classical mechanics.

In classical physics, where gravity is a force and we deal with potentials, we have a similar equation to the relativistic equation above. It's known as Poisson's equation
The left-hand side is related to derivatives of the gravitational potential, whereas the right-hand side is some constants (including Newton's gravitational constant (G)) and the density given by the rho.

I think everyone is happy with this equation. Now, one thing you calculate early on in gravitational physics is that the gravitational potential outside of a massive spherical object is given by
Note that we are talking about the potential is outside of the spherical body (the simple V and Phi are meant to be the same thing). So, if we plug this potential into Poisson's equation, does it give us a mass distribution which is spherical?

Now, Poisson's equation can look a little intimidating, but let's recast the potential in Cartesian coordinates. Then it looks like this

Ugh! Does that make it any easier? Yes, let's just simply plug it into Wolfram Alpha to do the hard work. So, the derivatives have an x-part, y-part and z-part - here's the x-part.
Again, is you are a mathphobe, this is not much better, but let's add the y- and z-parts.

After all that, the result is zero! Zilch! Nothing! This must mean that Poisson's equation for this potential is
So, the density is equal to zero. Where's the mass that produces the gravitational field? This is the same as the apparent problem with relativity. What Poisson's equation tells us that the derivatives o the potential AT A POINT is related to the density AT THAT POINT! 

Now, remember these are derivatives, and so the potential can have a whole bunch of shapes at that point, as long as the derivatives still hold. One of these, of course, is there being no mass there and so no gravitational potential at all, but any vacuum, with no mass, will above Poisson = 0 equation, including the potential outside of any body (the one used in this example relied on a spherical source).

So, the relativistic version is that the properties of the space-time curvature AT A POINT is related to the mass and energy AT A POINT. A flat space-time is produced when there is no mass and energy, and so has G=0, but so does any point in a vacuum, but that does not mean that the space-time at that point is not curved (and so no gravity).

Anyway, I got that off my chest, and my Discovery Project submitted, but now it's time to get on with a LIEF application! 

February 27, 2015

Jordan EllenbergErotische Flugblaetter

I was working in Memorial Library yesterday. Whenever I’m over there, I like to pull a book off the shelf and look at it.  (E.G.) I feel I have some kind of duty to the books — there are so many which will never be taken off the shelf again!

Anyway, there has never been an easier choice than Flugblatt-Propaganda Im 2.Weltkrieg:  Erotische Flugblätter.  How was I not supposed to look at that!  And I was richly rewarded.  The Nazi propagandists knew their business; the leaflets are written in perfect colloquial English, assuring American troops that the US government is purposely prolonging the war to keep unemployment low at home, that their kids and wives are pleading for them to come home alive by any means necessary (especially:  surrendering and riding out the rest of the war in a comfy German POW camp, with movies, sports, and the same food the German soldiers get) and, most of all, that their girlfriends back home, tired of waiting, are taking up with draft-dodgers and war-profiteers (especially the ruthless “Sam Levy.”)  UK troops got their own version:  their girlfriends weren’t making time with shifty Jews, but with US soldiers, who were “training” in England while the British men died at the front.

Some highlights:

photo 3 photo 2 photo 1

n-Category Café Concepts of Sameness (Part 4)

This time I’d like to think about three different approaches to ‘defining equality’, or more generally, introducing equality in formal systems of mathematics.

These will be taken from old-fashioned logic — before computer science, category theory or homotopy theory started exerting their influence. Eventually I want to compare these to more modern treatments.

If you know other interesting ‘old-fashioned’ approaches to equality, please tell me!

The equals sign is surprisingly new. It was never used by the ancient Babylonians, Egyptians or Greeks. It seems to originate in 1557, in Robert Recorde’s book The Whetstone of Witte. If so, we actually know what the first equation looked like:

As you can see, the equals sign was much longer back then! He used parallel lines “because no two things can be more equal.”

Formalizing the concept of equality has raised many questions. Bertrand Russell published The Principles of Mathematics [R] in 1903. Not to be confused with the Principia Mathematica, this is where he introduced Russell’s paradox. In it, he wrote:

identity, an objector may urge, cannot be anything at all: two terms plainly are not identical, and one term cannot be, for what is it identical with?

In his Tractatus, Wittgenstein [W] voiced a similar concern:

Roughly speaking: to say of two things that they are identical is nonsense, and to say of one thing that it is identical with itself is to say nothing.

These may seem like silly objections, since equations obviously do something useful. The question is: precisely what?

Instead of tackling that head-on, I’ll start by recalling three related approaches to equality in the pre-categorical mathematical literature.

The indiscernibility of identicals

The principle of indiscernibility of identicals says that equal things have the same properties. We can formulate it as an axiom in second-order logic, where we’re allowed to quantify over predicates PP:

xy[x=yP[P(x)P(y)]] \forall x \forall y [x = y \; \implies \; \forall P \, [P(x) \; \iff \; P(y)] ]

We can also formulate it as an axiom schema in 1st-order logic, where it’s sometimes called substitution for formulas. This is sometimes written as follows:

For any variables x,yx, y and any formula ϕ\phi, if ϕ\phi' is obtained by replacing any number of free occurrences of xx in ϕ\phi with yy, such that these remain free occurrences of yy, then

x=y[ϕϕ] x = y \;\implies\; [\phi \;\implies\; \phi' ]

I think we can replace this with the prettier

x=y[ϕϕ] x = y \;\implies\; [\phi \;\iff \; \phi']

without changing the strength of the schema. Right?

We cannot derive reflexivity, symmetry and transitivity of equality from the indiscernibility of identicals. So, this principle does not capture all our usual ideas about equality. However, as shown last time, we can derive symmetry and transitivity from this principle together with reflexivity. This uses an interesting form of argument where take “being equal to zz” as one of the predicates (or formulas) to which we apply the principle. There’s something curiously self-referential about this. It’s not illegitimate, but it’s curious.

The identity of indiscernibles

Leibniz [L] is often credited with formulating a converse principle, the identity of indiscernibles. This says that things with all the same properties are equal. Again we can write it as a second-order axiom:

xy[P[P(x)P(y)]x=y] \forall x \forall y [ \forall P [ P(x) \; \iff \; P(y)] \; \implies \; x = y ]

or a first-order axiom schema.

We can go further if we take the indiscernibility of identicals and identity of indiscernibles together as a package:

xy[P[P(x)P(y)]x=y] \forall x \forall y [ \forall P [ P(x) \; \iff \; P(y)] \; \iff \; x = y ]

This is often called the Leibniz law. It says an entity is determined by the collection of predicates that hold of that entity. Entities don’t have mysterious ‘essences’ that determine their individuality: they are completely known by their properties, so if two entities have all the same properties they must be the same.

This principle does imply reflexivity, symmetry and transitivity of equality. They follow from the corresponding properties of \iff in a satisfying way. Of course, if we were wondering why equality has these three properties, we are now led to wonder the same thing about the biconditional \iff. But this counts as progress: it’s a step toward ‘logicizing’ mathematics, or at least connecting == firmly to \iff.

Apparently Russell and Whitehead used a second-order version of the Leibniz law to define equality in the Principia Mathematica [RW], while Kalish and Montague [KL] present it as a first-order schema. I don’t know the whole history of such attempts.

When you actually look to see where Leibniz formulated this principle, it’s a bit surprising. He formulated it in the contrapositive form, he described it as a ‘paradox’, and most surprisingly, it’s embedded as a brief remark in a passage that would be hair-curling for many contemporary rationalists. It’s in his Discourse on Metaphysics, a treatise written in 1686:

Thus Alexander the Great’s kinghood is an abstraction from the subject, and so is not determinate enough to pick out an individual, and doesn’t involve the other qualities of Alexander or everything that the notion of that prince includes; whereas God, who sees the individual notion or ‘thisness’ of Alexander, sees in it at the same time the basis and the reason for all the predicates that can truly be said to belong to him, such as for example that he would conquer Darius and Porus, even to the extent of knowing a priori (and not by experience) whether he died a natural death or by poison — which we can know only from history. Furthermore, if we bear in mind the interconnectedness of things, we can say that Alexander’s soul contains for all time traces of everything that did and signs of everything that will happen to him — and even marks of everything that happens in the universe, although it is only God who can recognise them all.

Several considerable paradoxes follow from this, amongst others that it is never true that two substances are entirely alike, differing only in being two rather than one. It also follows that a substance cannot begin except by creation, nor come to an end except by annihilation; and because one substance can’t be destroyed by being split up, or brought into existence by the assembling of parts, in the natural course of events the number of substances remains the same, although substances are often transformed. Moreover, each substance is like a whole world, and like a mirror of God, or indeed of the whole universe, which each substance expresses in its own fashion — rather as the same town looks different according to the position from which it is viewed. In a way, then, the universe is multiplied as many times as there are substances, and in the same way the glory of God is magnified by so many quite different representations of his work.

(Emphasis mine — you have to look closely to find the principle of identity of indiscernibles, because it goes by so quickly!)

There have been a number of objections to the Leibniz law over the years. I want to mention one that might best be handled using some category theory. In 1952, Max Black [B] claimed that in a symmetrical universe with empty space containing only two symmetrical spheres of the same size, the two spheres are two distinct objects even though they have all their properties in common.

As Black admits, this problem only shows up in a ‘relational’ theory of geometry, where we can’t say that the spheres have different positions — e.g., one centered at the points (x,y,z)(x,y,z), the other centered at (x,y,z)(-x,-y,-z) — but only speak of their position relative to one another. This sort of theory is certainly possible, and it seems to be important in physics. But I believe it can be adequately formulated only with the help of some category theory. In the situation described by Black, I think we should say the spheres are not equal but isomorphic.

As widely noted, general relativity also pushes for a relational approach to geometry. Gauge theory, also, raises the issue of whether indistinguishable physical situations should be treated as equal or merely isomorphic. I believe the mathematics points us strongly in the latter direction.

A related issue shows up in quantum mechanics, where electrons are considered indistinguishable (in a certain sense), yet there can be a number of electrons in a box — not just one.

But I will discuss such issues later.


In traditional set theory we try to use sets as a substitute for predicates, saying xSx \in S as a substitute for P(x)P(x). This lets us keep our logic first-order and quantify over sets — often in a universe where everything is a set — as a substitute for quantifying over predicates. Of course there’s a glitch: Russell’s paradox shows we get in trouble if we try to treat every predicate as defining a set! Nonetheless it is a powerful strategy.

If we apply this strategy to reformulate the Leibniz law in a universe where everything is a set, we obtain:

ST[S=TR[SRTR]] \forall S \forall T [ S = T \; \iff \; \forall R [ S \in R \; \iff \; T \in R]]

While this is true in Zermelo-Fraenkel set theory, it is not taken as an axiom. Instead, people turn the idea around and use the axiom of extensionality:

ST[S=TR[RSRT]] \forall S \forall T [ S = T \; \iff \; \forall R [ R \in S \; \iff \; R \in T]]

Instead of saying two sets are equal if they’re in all the same sets, this says two sets are equal if all the same sets are in them. This leads to a view where the ‘contents’ of an entity as its defining feature, rather than the predicates that hold of it.

We could, in fact, send this idea back to second-order logic and say that predicates are equal if and only if they hold for the same entities:

PQ[x[P(x)Q(x)]P=Q] \forall P \forall Q [\forall x [P(x) \; \iff \; Q(x)] \; \iff P = Q ]

as a kind of ‘dual’ of the Leibniz law:

xy[P[P(x)P(y)]x=y] \forall x \forall y [ \forall P [ P(x) \; \iff \; P(y)] \; \iff \; x = y ]

I don’t know if this has been remarked on in the foundational literature, but it’s a close relative of a phenomenon that occurs in other forms of duality. For example, continuous real-valued functions F,GF, G on a topological space obey

FG[x[F(x)=G(x)]F=G] \forall F \forall G [\forall x [F(x) \; = \; G(x)] \; \iff F = G ]

but if the space is nice enough, continuous functions ‘separate points’, which means we also have

xy[F[F(x)=F(y)]x=y] \forall x \forall y [ \forall F [ F(x) \; = \; F(y)] \; \iff \; x = y ]


Chad OrzelThis Is Not What I Want As a Defense of “The Humanities”

Yesterday was Founders Day at Union, celebrating the 220th anniversary of the granting of a charter for the college. The name of the event always carries a sort of British-boarding-school air for me, and never fails to earworm me with a very particular rugby song, but really it’s just one of those formal-procession-and-big-speaker events that provide local color for academia.

This year’s event started, as always, with a classical music performance– a song by Aaron Copeland, this time, so we’ve at least caught up to the 20th Century. (I’m not sure I want to live long enough to see a Bob Dylan number performed at one of these…) The main point, though, was the talk by Laura Skandera Trombley, president of Pitzer College, on The Enduring Value of the Humanities.

Working where I do, I’ve heard a lot of these sorts of talks, but I still don’t really know what I want from a defense of “the humanities.” I’m pretty sure, though, that this wasn’t it.

There was a lot to not like, starting with the traditional cherry-picking of statistics to show that there’s a crisis in “the humanities”– quoting the Huffington Post on the 50% decline in the number of students majoring in the humanities. Of course, as has been noted nearly as many times as that statistic has been thrown out is the fact that it’s garbage. The apparent big decline comes from careful selection of a starting point at the peak of a giant bubble in “humanities” enrollments inflated by Baby Boomers desperate to stay out of Vietnam.

More than that, though, there’s a bunch of baiting and switching going on here. The case for the value of “the humanities” basically boils down to “You like art, don’t you? Wouldn’t it suck if we didn’t have art?” But, you know, to the extent that there’s a genuine crisis going on, it’s not because anyone’s threatening to stop producing art. Times have never been better for the production of art– in fact, the real crisis facing people who make art is that there’s too damn much of it, driving prices down and making it increasingly difficult to make a living making art.

But when we talk about “the humanities” in an academic context, we’re not talking about people who make art– only a tiny fraction of people in “humanities” departments are engaged in that. To the extent that “the humanities” are under threat in academia, what’s threatened isn’t the production of art, but comfortable faculty positions in which people are paid to talk about art. Which is a very different thing. The production of art is doing just fine, it’s the dissection of art that needs defending. But we didn’t get that.

(To be fair, there’s an exact parallel to this tactic in the sciences. See, for example, this Daily Beast piece which could be snarkily summarized as “Why should we spend $10 billion on the Large Hadron Collider? Well, you like radio, don’t you?” I don’t like that version of it any more than I like this one.)

There’s also a little sleight-of-hand when it comes to the selection of examples. The two most detailed examples given are the works of Aristotle, and a quote from a T.S. Eliot poem used at the opening of a TV show. But again, this isn’t really what “the humanities” are these days– they’re just safe and lazy signifiers that everybody will agree are Important in a sort of abstract sense. But if you were to suggest that every student at the college needs to read Aristotle and Eliot, there would be a revolt among the faculty (not without justification, though that’s a separate culture war).

Even the obligatory list of dropped names of great works ends up having problems:

More than ever we seek ways to feel connected to one another, and in the end it doesn’t matter if it’s the beauty of Strauss’ flowing “An der schönen blauen Donau,” or Bill T. Jones’ exploration of survival through dance in Still/Here, or Auden’s incomparable “Lullaby,” “Lay your head my darling, human on my faithless arm,” or Maxine Hong Kingston’s anguished admission in Woman Warrior, “You must not tell anyone what I am about to tell you. In China your father had a sister who killed herself. She jumped into the family well,” or our poet-bard, Kanye West’s love song to Kim, “Bound to fall in love, bound to fall in love (uh-huh honey)”; these are all expressions and interpretations of life and they tie us to those who came before as well as to our contemporaries.

On the page, that looks better than it sounded live. In person, the Kanye West reference was really grating, as it was delivered in a very showy deadpan manner, to deliberately highlight the vapidity of those lyrics, and make clear their inclusion was a joke. Because nothing is funnier than old white people making fun of rap.

And in a way, that’s sort of telling, because while the times have never been better for the production of art, the only appearance of art in one of the many modern, vital modes being produced today was brought in as a sneering joke. The art that was sincerely held up as having enduring value– even the opening song– was mostly drawn from fields that are on life support, propped up almost exclusively by the elite academic consensus that these are Important.

And in a way, that’s the biggest problem I have with this whole genre of speeches in defense of “the humanities” and academic disciplines in general: they are fundamentally elitist. These speeches aren’t for the students who are ostensibly the purpose of the institution, they’re to flatter the vanity of the faculty and wealthy alumni, and pat them on the back for their essential role in deciding what has value. Which is why the examples cited are always these ancient pressed-under-glass things. Everyone will agree that Aristotle and Eliot are Important, but the really active topics in “the humanities” are multicultural, and deal with critical theory and area studies and identity politics and intersectionality. But those don’t get talked about, because those topics upset people.

Even the obligatory pseudo-economic case is fundamentally kind of elite. The speech included the requisite shout-outs to “critical thinking” and the contractually mandated list of famous people with degrees in a “humanities” discipline. But that’s hugely problematic in a lot of ways, starting with the fact that it’s an argument based on “black swans”– telling students to major in philosophy because it worked for George Soros isn’t all that much different from telling people to buy lottery tickets because some lady in Arkansas hit the PowerBall jackpot.

More than that, though, the whole argument founded on the development of “critical thinking skills” is ultimately a sort of negative argument. It’s a familiar one in physics, because we’re one of the less obviously applied undergrad science majors, and I’ve used versions of it myself in talking to parents who ask what their kids might do after graduation. “You learn to think broadly about a wide range of problems, so you can go off and work in lots of other fields,” we say, but what we really mean is “Go ahead and major in our subject because you enjoy it; it won’t screw up your chances of getting a good job any more than any other major.” And that holds true for the argument applied to “the humanities.”

And, you know, that’s an easy case to make when you’re speaking at an elite private college like Union, because it’s probably true that the precise choice of major doesn’t make a great deal of difference for our students. We don’t quite have the cachet of Harvard or Williams, but we’re at the low end of the upper tier of elite colleges, and the name on the diploma will open enough doors in enough fields that our students will be able to get jobs, albeit not without some effort.

But move down the academic ladder a bit, and I’m not sure that argument works quite as well. A “humanities” degree from Union will carry a good deal more weight than a “humanities” degree from Directional State University. Those students are probably right to give more weight to immediately marketable and relevant credentials; as, for that matter, are many Union students who come from underprivileged backgrounds. Particularly in what remains a sort of dismal economic climate.

So, you know, a lot of stuff that bugged me packed into one short speech. I’m still not sure what I really want to see as a defense of the value of “the humanities,” but this very definitely was not it.

Chad OrzelIn Which I Am Outwitted by a Six-Year-Old

SteelyKid has developed a habit of not answering questions, whether because she’s genuinely zoning out, or just not acknowledging adults, it’s not clear. (She’s going to be a real joy when she’s a teenager, I can tell…) In retaliation, I’ve started giving imaginary answers for her, which generall snaps her out of it, but I’ve been waiting to see what the next step was.

Which was taken last night: in the car on the way to taekwondo sparring class, I asked “What are you guys doing in art class these days?” silence.

“Hey, [SteelyKid]? What are you doing in art these days?”


“Oh, rattlesnake painting? That sounds pretty cool.”


“So, is that painting on rattlesnakes, or with rattlesnakes?”

“Well, it would have to be a dead rattlesnake.” (Finally, a response!)

“I guess. Though I suppose if it were asleep, you could paint on it. You might not want to be around when it woke up, though.”

“Hmmm… OK, here’s what I would do. I would get the snake, and put it to sleep. Then I’d give it to a museum, and they’d keep it until it grew enough to shed its skin. Then they’d give me the skin, and let the snake go.”


“And then I’d paint on the skin– on one side oft he skin. Then I’d take a piece of paper, and press the skin onto the paper, and the paint would go off on the paper and look just like the snake. And then all I’d have to do is draw the head, and color it in.”

“Yeah, I guess that would work. Very clever.”

So, once again, I have lost a battle of wits to a first-grader. Happily, this got her out of the not-answering-questions mode, and she chattered happily about what she’s really doing in art class these days (a project involving a picture of a snowman that sounds a little Calvin and Hobbes), pop music, and various other things. I’m going to have to think up some new absurd activities for future car rides, though, if she’s going to go and raise the bar on me like this.

Georg von HippelBack from Mumbai

On Saturday, my last day in Mumbai, a group of colleagues rented a car with a driver to take a trip to Sanjay Gandhi National Park and visit the Kanheri caves, a Buddhist site consisting of a large number of rather simple monastic cells and some worship and assembly halls with ornate reliefs and inscriptions, all carved out out of solid rock (some of the cell entrances seem to have been restored using steel-reinforced concrete, though).

On the way back, we stopped at Mani Bhavan, where Mahatma Gandhi lived from 1917 to 1934, and which is now a museum dedicated to his live and legacy.

In the night, I flew back to Frankfurt, where the temperature was much lower than in Mumbai; in fact, on Monday there was snow.

Terence Tao254A, Supplement 6: A cheap version of the theorems of Halasz and Matomaki-Radziwill (optional)

In analytic number theory, it is a well-known phenomenon that for many arithmetic functions {f: {\bf N} \rightarrow {\bf C}} of interest in number theory, it is significantly easier to estimate logarithmic sums such as

\displaystyle  \sum_{n \leq x} \frac{f(n)}{n}

than it is to estimate summatory functions such as

\displaystyle  \sum_{n \leq x} f(n).

(Here we are normalising {f} to be roughly constant in size, e.g. {f(n) = O( n^{o(1)} )} as {n \rightarrow \infty}.) For instance, when {f} is the von Mangoldt function {\Lambda}, the logarithmic sums {\sum_{n \leq x} \frac{\Lambda(n)}{n}} can be adequately estimated by Mertens’ theorem, which can be easily proven by elementary means (see Notes 1); but a satisfactory estimate on the summatory function {\sum_{n \leq x} \Lambda(n)} requires the prime number theorem, which is substantially harder to prove (see Notes 2). (From a complex-analytic or Fourier-analytic viewpoint, the problem is that the logarithmic sums {\sum_{n \leq x} \frac{f(n)}{n}} can usually be controlled just from knowledge of the Dirichlet series {\sum_n \frac{f(n)}{n^s}} for {s} near {1}; but the summatory functions require control of the Dirichlet series {\sum_n \frac{f(n)}{n^s}} for {s} on or near a large portion of the line {\{ 1+it: t \in {\bf R} \}}. See Notes 2 for further discussion.)

Viewed conversely, whenever one has a difficult estimate on a summatory function such as {\sum_{n \leq x} f(n)}, one can look to see if there is a “cheaper” version of that estimate that only controls the logarithmic sums {\sum_{n \leq x} \frac{f(n)}{n}}, which is easier to prove than the original, more “expensive” estimate. In this post, we shall do this for two theorems, a classical theorem of Halasz on mean values of multiplicative functions on long intervals, and a much more recent result of Matomaki and Radziwill on mean values of multiplicative functions in short intervals. The two are related; the former theorem is an ingredient in the latter (though in the special case of the Matomaki-Radziwill theorem considered here, we will not need Halasz’s theorem directly, instead using a key tool in the proof of that theorem).

We begin with Halasz’s theorem. Here is a version of this theorem, due to Montgomery and to Tenenbaum:

Theorem 1 (Halasz-Montgomery-Tenenbaum) Let {f: {\bf N} \rightarrow {\bf C}} be a multiplicative function with {|f(n)| \leq 1} for all {n}. Let {x \geq 3} and {T \geq 1}, and set

\displaystyle  M := \min_{|t| \leq T} \sum_{p \leq x} \frac{1 - \hbox{Re}( f(p) p^{-it} )}{p}.

Then one has

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) \ll (1+M) e^{-M} + \frac{1}{\sqrt{T}}.

Informally, this theorem asserts that {\sum_{n \leq x} f(n)} is small compared with {x}, unless {f} “pretends” to be like the character {p \mapsto p^{it}} on primes for some small {y}. (This is the starting point of the “pretentious” approach of Granville and Soundararajan to analytic number theory, as developed for instance here.) We now give a “cheap” version of this theorem which is significantly weaker (both because it settles for controlling logarithmic sums rather than summatory functions, it requires {f} to be completely multiplicative instead of multiplicative, and because it only gives qualitative decay rather than quantitative estimates), but easier to prove:

Theorem 2 (Cheap Halasz) Let {x} be a parameter going to infinity, and let {f: {\bf N} \rightarrow {\bf C}} be a completely multiplicative function (possibly depending on {x}) such that {|f(n)| \leq 1} for all {n}. Suppose that

\displaystyle  \frac{1}{\log x} \sum_{p \leq x} \frac{(1 - \hbox{Re}( f(p) )) \log p}{p} \rightarrow \infty. \ \ \ \ \ (1)


\displaystyle  \frac{1}{\log x} \sum_{n \leq x} \frac{f(n)}{n} = o(1). \ \ \ \ \ (2)

Note that now that we are content with estimating exponential sums, we no longer need to preclude the possibility that {f(p)} pretends to be like {p^{it}}; see Exercise 11 of Notes 1 for a related observation.

To prove this theorem, we first need a special case of the Turan-Kubilius inequality.

Lemma 3 (Turan-Kubilius) Let {x} be a parameter going to infinity, and let {1 < P < x} be a quantity depending on {x} such that {P = x^{o(1)}} and {P \rightarrow \infty} as {x \rightarrow \infty}. Then

\displaystyle  \sum_{n \leq x} \frac{ | \frac{1}{\log \log P} \sum_{p \leq P: p|n} 1 - 1 |}{n} = o( \log x ).

Informally, this lemma is asserting that

\displaystyle  \sum_{p \leq P: p|n} 1 \approx \log \log P

for most large numbers {n}. Another way of writing this heuristically is in terms of Dirichlet convolutions:

\displaystyle  1 \approx 1 * \frac{1}{\log\log P} 1_{{\mathcal P} \cap [1,P]}.

This type of estimate was previously discussed as a tool to establish a criterion of Katai and Bourgain-Sarnak-Ziegler for Möbius orthogonality estimates in this previous blog post. See also Section 5 of Notes 1 for some similar computations.

Proof: By Cauchy-Schwarz it suffices to show that

\displaystyle  \sum_{n \leq x} \frac{ | \frac{1}{\log \log P} \sum_{p \leq P: p|n} 1 - 1 |^2}{n} = o( \log x ).

Expanding out the square, it suffices to show that

\displaystyle  \sum_{n \leq x} \frac{ (\frac{1}{\log \log P} \sum_{p \leq P: p|n} 1)^j}{n} = \log x + o( \log x )

for {j=0,1,2}.

We just show the {j=2} case, as the {j=0,1} cases are similar (and easier). We rearrange the left-hand side as

\displaystyle  \frac{1}{(\log\log P)^2} \sum_{p_1, p_2 \leq P} \sum_{n \leq x: p_1,p_2|n} \frac{1}{n}.

We can estimate the inner sum as {(1+o(1)) \frac{1}{[p_1,p_2]} \log x}. But a routine application of Mertens’ theorem (handling the diagonal case when {p_1=p_2} separately) shows that

\displaystyle  \sum_{p_1, p_2 \leq P} \frac{1}{[p_1,p_2]} = (1+o(1)) (\log\log P)^2

and the claim follows. \Box

Remark 4 As an alternative to the Turan-Kubilius inequality, one can use the Ramaré identity

\displaystyle  \sum_{p \leq P: p|n} \frac{1}{\# \{ p' \leq P: p'|n\} + 1} - 1 = 1_{(p,n)=1 \hbox{ for all } p \leq P}

(see e.g. Section 17.3 of Friedlander-Iwaniec). This identity turns out to give superior quantitative results than the Turan-Kubilius inequality in applications; see the paper of Matomaki and Radziwill for an instance of this.

We now prove Theorem 2. Let {Q} denote the left-hand side of (2); by the triangle inequality we have {Q=O(1)}. By Lemma 3 (for some {P = x^{o(1)}} to be chosen later) and the triangle inequality we have

\displaystyle  \sum_{n \leq x} \frac{\frac{1}{\log P} \sum_{d \leq P: d|n} \Lambda(d) f(n)}{n} = Q \log x + o( \log x ).

We rearrange the left-hand side as

\displaystyle  \frac{1}{\log P} \sum_{d \leq P} \frac{\Lambda(d) f(d)}{d} \sum_{m \leq x/d} \frac{f(m)}{m}.

We now replace the constraint {m \leq x/d} by {m \leq x}. The error incurred in doing so is

\displaystyle  O( \frac{1}{\log P} \sum_{d \leq P} \frac{\Lambda(d)}{d} \sum_{x/P \leq m \leq x} \frac{1}{m} )

which by Mertens’ theorem is {O(\log P) = o( \log x )}. Thus we have

\displaystyle  \frac{1}{\log P} \sum_{d \leq P} \frac{\Lambda(d) f(d)}{d} \sum_{m \leq x} \frac{f(m)}{m} = Q \log x + o( \log x ).

But by definition of {Q}, we have {\sum_{m \leq x} \frac{f(m)}{m} = Q \log x}, thus

\displaystyle  [1 - \frac{1}{\log P} \sum_{d \leq P} \frac{\Lambda(d) f(d)}{d}] Q = o(1).

From Mertens’ theorem, the expression in brackets can be rewritten as

\displaystyle  \frac{1}{\log P} \sum_{d \leq P} \frac{\Lambda(d) (1 - f(d))}{d} + o(1)

and so the real part of this expression is at least

\displaystyle  \frac{1}{\log P} \sum_{p \leq P} \frac{(1 - \hbox{Re} f(p)) \log p}{p} + o(1).

Note from Mertens’ theorem and the hypothesis on {f} that

\displaystyle  \frac{1}{\log x^\varepsilon} \sum_{p \leq x^\varepsilon} \frac{(1 - \hbox{Re} f(p)) \log p}{p} \rightarrow \infty

for any fixed {\varepsilon}, and hence by diagonalisation that

\displaystyle  \frac{1}{\log P} \sum_{p \leq P} \frac{(1 - \hbox{Re} f(p)) \log p}{p} \rightarrow \infty

if {P=x^{o(1)}} for a sufficiently slowly decaying {o(1)}. The claim follows.

Exercise 5 (Granville-Koukoulopoulos-Matomaki)

  • (i) If {g} is a completely multiplicative function with {g(p) \in \{0,1\}} for all primes {p}, show that

    \displaystyle  (e^{-\gamma}-o(1)) \prod_{p \leq x} (1 - \frac{g(p)}{p})^{-1} \leq \sum_{n \leq x} \frac{g(n)}{n} \leq \prod_{p \leq x} (1 - \frac{g(p)}{p})^{-1}.

    as {x \rightarrow \infty}. (Hint: for the upper bound, expand out the Euler product. For the lower bound, show that {\sum_{n \leq x} \frac{g(n)}{n} \times \sum_{n \leq x} \frac{h(n)}{n} \ge \sum_{n \leq x} \frac{1}{n}}, where {h} is the completely multiplicative function with {h(p) = 1-g(p)} for all primes {p}.)

  • (ii) If {g} is multiplicative and takes values in {[0,1]}, show that

    \displaystyle  \sum_{n \leq x} \frac{g(n)}{n} \asymp \prod_{p \leq x} (1 - \frac{g(p)}{p})^{-1}

    \displaystyle  \asymp \exp( \sum_{p \leq x} \frac{g(p)}{p} )

    for all {x \geq 1}.

Now we turn to a very recent result of Matomaki and Radziwill on mean values of multiplicative functions in short intervals. For sake of illustration we specialise their results to the simpler case of the Liouville function {\lambda}, although their arguments actually work (with some additional effort) for arbitrary multiplicative functions of magnitude at most {1} that are real-valued (or more generally, stay far from complex characters {p \mapsto p^{it}}). Furthermore, we give a qualitative form of their estimates rather than a quantitative one:

Theorem 6 (Matomaki-Radziwill, special case) Let {X} be a parameter going to infinity, and let {2 \leq h \leq X} be a quantity going to infinity as {X \rightarrow \infty}. Then for all but {o(X)} of the integers {x \in [X,2X]}, one has

\displaystyle  \sum_{x \leq n \leq x+h} \lambda(n) = o( h ).

Equivalently, one has

\displaystyle  \sum_{X \leq x \leq 2X} |\sum_{x \leq n \leq x+h} \lambda(n)|^2 = o( h^2 X ). \ \ \ \ \ (3)

A simple sieving argument (see Exercise 18 of Supplement 4) shows that one can replace {\lambda} by the Möbius function {\mu} and obtain the same conclusion. See this recent note of Matomaki and Radziwill for a simple proof of their (quantitative) main theorem in this special case.

Of course, (3) improves upon the trivial bound of {O( h^2 X )}. Prior to this paper, such estimates were only known (using arguments similar to those in Section 3 of Notes 6) for {h \geq X^{1/6+\varepsilon}} unconditionally, or for {h \geq \log^A X} for some sufficiently large {A} if one assumed the Riemann hypothesis. This theorem also represents some progress towards Chowla’s conjecture (discussed in Supplement 4) that

\displaystyle  \sum_{n \leq x} \lambda(n+h_1) \dots \lambda(n+h_k) = o( x )

as {x \rightarrow \infty} for any fixed distinct {h_1,\dots,h_k}; indeed, it implies that this conjecture holds if one performs a small amount of averaging in the {h_1,\dots,h_k}.

Below the fold, we give a “cheap” version of the Matomaki-Radziwill argument. More precisely, we establish

Theorem 7 (Cheap Matomaki-Radziwill) Let {X} be a parameter going to infinity, and let {1 \leq T \leq X}. Then

\displaystyle  \int_X^{X^A} \left|\sum_{x \leq n \leq e^{1/T} x} \frac{\lambda(n)}{n}\right|^2\frac{dx}{x} = o\left( \frac{\log X}{T^2} \right), \ \ \ \ \ (4)

for any fixed {A>1}.

Note that (4) improves upon the trivial bound of {O( \frac{\log X}{T^2} )}. Again, one can replace {\lambda} with {\mu} if desired. Due to the cheapness of Theorem 7, the proof will require few ingredients; the deepest input is the improved zero-free region for the Riemann zeta function due to Vinogradov and Korobov. Other than that, the main tools are the Turan-Kubilius result established above, and some Fourier (or complex) analysis.

— 1. Proof of theorem —

We now prove Theorem 7. We first observe that it will suffice to show that

\displaystyle  \int_0^\infty \varphi( \frac{\log x}{\log X} ) \left|\sum_n \eta( T( \log x - \log n ) ) \frac{\lambda(n)}{n}\right|^2\frac{dx}{x} = o\left( \frac{\log X}{T^2} \right)

for any smooth {\eta, \varphi: {\bf R} \rightarrow {\bf R}} supported on (say) {[-2,2]} and {(1,+\infty)} respectively, as the claim follows by taking {\eta} and {\varphi} to be approximations to {1_{[-1,0]}} and {1_{[1,A]}} respectively and using the triangle inequality to control the error.

We will need a quantity {P = X^{o(1)}} that goes to infinity reasonably fast; for instance, {P := \exp( \log^{0.99} X )} will suffice. By Lemma 3 and the triangle inequality, we can replace {\lambda(n)} in (4) by {\frac{1}{\log \log P} \sum_{p \leq P: p|n} \lambda(n)} while only incurring an acceptable error. Thus our task is now to show that

\displaystyle  \int_0^\infty \varphi( \frac{\log x}{\log X} ) \left|\sum_n \eta( T(\log x - \log n)) \frac{\sum_{p \leq P: p|n} \lambda(n)}{n}\right|^2\frac{dx}{x}

\displaystyle  = o\left( \frac{\log X}{T^2} (\log\log P)^2 \right).

I will (perhaps idiosyncratically) adopt a Fourier-analytic point of view here, rather than a more traditional complex-analytic point of view (for instance, we will use Fourier transforms as a substitute for Dirichlet series). To bring the Fourier perspective to the forefront, we make the change of variables {x = e^u} and {n = pm}, and note that {\varphi( \frac{\log x}{\log X} ) = \varphi( \frac{\log m}{\log X} ) + o(1)}, to rearrange the previous claim as

\displaystyle  \int_{\bf R} |\sum_{p \leq P} \frac{1}{p} F( u - \log p )|^2\ du = o( \log X (\log\log P)^2 ).


\displaystyle  F(y) := T \sum_m \varphi( \frac{\log m}{\log X} ) \eta( T( y - \log m) ) \frac{\lambda(m)}{m}. \ \ \ \ \ (5)

Introducing the normalised discrete measure

\displaystyle  \mu := \frac{1}{\log\log P} \sum_{p \leq P} \frac{1}{p} \delta_{\log p},

it thus suffices to show that

\displaystyle  \| F * \mu \|_{L^2({\bf R})}^2 = o( \log X )

where {*} now denotes ordinary (Fourier) convolution rather than Dirichlet convolution.

From Mertens’ theorem we see that {\mu} has total mass {O(1)}; also, from the triangle inequality (and the hypothesis {T \leq X}) we see that {F} is supported on {[(1-o(1)) \log X, (2+o(1))\log X]} and obeys the pointwise bound of {O(1)}; also, the derivative of {F} is bounded by {O(T)}. Thus we see that the trivial bound on {\| F * \mu \|_{L^2({\bf R})}^2} is {O( \log X)} by Young’s inequality. To improve upon this, we use Fourier analysis. By Plancherel’s theorem, we have

\displaystyle  \| F * \mu \|_{L^2({\bf R})}^2 = \int_{\bf R} |\hat F(\xi)|^2 |\hat \mu(\xi)|^2\ d\xi

where {\hat F, \hat \mu} are the Fourier transforms

\displaystyle  \hat F(\xi) := \int_{\bf R} F(x) e^{-2\pi i x \xi}\ dx


\displaystyle  \hat \mu(\xi) := \int_{\bf R} e^{-2\pi i x \xi}\ d\mu(x).

From Plancherel’s theorem and the derivative bound on {\hat F} we have

\displaystyle  \int_{\bf R} |\hat F(\xi)|^2 \ll \log X


\displaystyle  \int_{\bf R} |\xi|^2 |\hat F(\xi)|^2 \ll T^2 \log X \leq X^2 \log X

so the contribution of those {\xi} with {\hat \mu(\xi)=o(1)} or {X/|\xi| = o(1)} is acceptable. Also, from the definition of {F} we have

\displaystyle  \hat F(\xi) = \hat \eta( \xi / T ) \sum_m \varphi( \frac{\log m}{\log X} ) \frac{\lambda(m)}{m^{1 + 2\pi i \xi}}

and so from the prime number theorem we have {\hat F(\xi) = o(1)} when {\xi = O(1)}; since {\hat \mu(\xi) = O(1)}, we see that the contribution of the region {|\xi| = O(1)} is also acceptable. It thus suffices to show that

\displaystyle  \hat \mu(\xi) = o(1)

whenever {\xi = O(X)} and {1/|\xi|=o(1)}. But by definition of {\mu}, we may expand {\hat \mu(\xi)} as

\displaystyle  \frac{1}{\log\log P} \sum_{p \leq P} \frac{1}{p^{1 + 2\pi i \xi}}

so by smoothed dyadic decomposition (and by choosing {P = X^{o(1)}} with {o(1)} decaying sufficiently slowly) it suffices to show that

\displaystyle  \sum_p \varphi( \frac{\log p}{\log Q} ) \frac{\log p}{p^{1 + \frac{1}{\log Q} + 2\pi i \xi}} = o( \log Q )

whenever {Q = X^{o(1)}} for some sufficiently slowly decaying {o(1)}. We replace the summation over primes with a von Mangoldt function weight to rewrite this as

\displaystyle  \sum_n \varphi( \frac{\log n}{\log Q} ) \frac{\Lambda(n)}{n^{1 + \frac{1}{\log Q} + 2\pi i \xi}} = o( \log Q ).

Performing a Fourier expansion of the smooth function {\varphi}, it thus suffices to show the Dirichlet series bound

\displaystyle  -\frac{\zeta'}{\zeta}(\sigma+it) = \sum_n \frac{\Lambda(n)}{n^{\sigma+it}} = o( \log |t| )

as {|t| \rightarrow \infty} and {\sigma > 1} (we use the crude bound {-\frac{\zeta'}{\zeta}(\sigma+it) \ll \frac{1}{\sigma-1}} to deal with the {t=O(1)} contribution). But this follows from the Vinogradov-Korobov bounds (who in fact get a bound of {O( \log^{2/3}(|t|) \log\log^{1/3}(|t|) )} as {|t| \rightarrow \infty}); see Exercise 43 of Notes 2 combined with Exercise 4(i) of Notes 5.

Remark 8 If one were working with a more general completely multiplicative function {f} than the Liouville function {\lambda}, then one would have to use a duality argument to control the large values of {\hat \mu} (which could occur at a couple more locations than {\xi = O(1)}), and use some version of Halasz’s theorem to also obtain some non-trivial bounds on {F} at those large values (this would require some hypothesis that {f} does not pretend to be like any of the characters {p \mapsto p^{it}} with {t = O(X)}). These new ingredients are in a similar spirit to the “log-free density theorem” from Theorem 6 of Notes 7. See the Matomaki-Radziwill paper for details (in the non-cheap case).

Filed under: 254A - analytic prime number theory, math.NT Tagged: Halasz's theorem, Kaisa Matomaki, Maksym Radziwill, multiplicative number theory, Turan-Kubilius inequality

February 26, 2015

Mark Chu-CarrollPropositions as Proofsets: Unwinding the confusion

My type theory post about the different interpretations of a proposition caused a furor in the comments. Understand what’s going on that caused all of the confusion is going to be important as we continue to move forward into type theory.

The root problem is really interesting, once you see what’s going on. We’re taking a statement that, on the face of it, isn’t about sets. Then we’re appyling a set-based interpretation of it, and looking at the subset relation. That’s all good. The problem is that when we start looking at a set-based interpretation, we’re doing what we would do in classical set theory – but that’s a different thing from what we’re doing here. In effect, we’re changing the statement.

For almost all of us, math is something that we learned from the perspective of axiomatic set theory and first order predicate logic. So that’s the default interpretation that we put on anything mathematical. When you talk about a a proposition as a set, we’re programmed to think of it in that classical way: for any set S, there’s a logical predicate P_s such that by definition, \forall x: x \in S \Leftrightarrow P_s(x). When you see P \Rightarrow Q in a set-theory context, what you think is something like \forall x: x \in P \Rightarrow x \in Q. Under that intepretation, the idea that P \supset Q is equivalent to P \rightarrow Q is absolutely ridiculous. If you follow the logic, implication must be the reverse of the subset relation!

The catch, though, is that we’re not talking about set theory, and the statement P \Rightarrow Q that we’re looking at is emphatically not \forall x : P(x) \Rightarrow Q(x). And that, right there, is the root of the problem.

P \rightarrow Q always means P \rightarrow Q – it doesn’t matter whether we’re doing set theory or type theory or whatever else. But in set theory, when we talk about the intepretation of P as a set, right now, in the world of type theory, we’re talking about a different set.

Super set doesn’t suddenly mean subset. Implication doesn’t start working backwards! and yet, I’m still trying to tell you that i really meant it when i said that superset meant implication! how can that possibly make sense?

In type theory, we´re trying to take a very different look at math. In particular, we’re building everything up on a constructive/computational framework. So we’re necessarily going to look at some different interpretations of things – we’re going to look at things in ways that just don’t make sense in the world of classical set theory/FOPL. We’re not going to contradict set theory – but we’re going to look at things very differently.

For example, the kind of statement we’re talking here about is a complete, closed, logical proposition, not a predicate, nor a set. The proposition P is a statement like “‘hello’ has five letters”.

When we look at a logical proposition P, one of the type theoretic interpretations of it is as a set of facts: P can be viewed as the set of all facts that can be proven true using P. In type theory land, this makes perfect sense: if I’ve got a proof of P, then I’ve got a proof of everything that P can prove. P isn’t a statement about the items in Ps proof-set. P is a logical statement about something, and the elements of the proof-set of P are the things that the statement P can prove.

With that in mind, what does P \Rightarrow Q mean in type theory? It means that everything provable using Q is provable using nothing but P.

(It’s really important to note here that there are no quantifiers in that statement. Again, we are not saying \forall p: P(x) \Rightarrow Q(x). P and Q are atomic propositions – not open quantified statements.)

If you are following the interpretation that says that P is the set of facts that are provable using the proposition P, then if P \Rightarrow Q, that means that everything that’s in Q must also be in P. In fact, it means pretty much exactly the same thing as classical superset. Q is a set of facts provable by the statement Q. The statement Q is provable using the statement P – which means that everything in the provable set of Q must, by definition! be in the provable set of P.

The converse doesn’t hold. There can be things provable by P (and thus in the proof-set of P) which are not provable using Q. So taken as sets of facts provable by logical propositions, P \supset Q!

Again, that seems like it’s the opposite of what we’d expect. But the trick is to recognize the meaning of the statements we’re working with, and that despite a surface resemblance, they’re not the same thing that we’re used to. Type theory isn’t saying that the set theoretic statements are wrong; nor is set theory saying that type theory is wrong.

The catch is simple: we’re trying to inject a kind of quantification into the statement P \Rightarrow Q which isn’t there; and then we’re using our interpretation of that quantified statement to say something different.

But there’s an interpretation of statements in type theory which is entirely valid, but which trips over our intuition: our training has taught us to take it, and expand it into an entirely different statement. We create blanks that aren’t there, fill them in, and by doing so, convert it into something that it isn’t, and confuse ourselves.

Doug NatelsonBrief items + the March APS Meeting

This has been an absurdly busy period for the last few weeks; hence my lower rate of posting.  I hope that this will resolve itself relatively soon, but you never know.  I am going to the first three days of the March APS meeting, and will try to blog about what I see and learn there, as I have in past years.

In the meantime, a handful of items that have cropped up:
  • If you go to the APS meeting, you can swing by the Cambridge University Press table, and pre-order my nano textbook for a mere $64.  It's more than 600 pages with color figures - that's a pretty good deal.  They will have a couple of bound proof copies, so you can see what it looks like, to a good approximation.  If you teach a senior undergrad or first-year grad sequence on this stuff and think you might have an interest in trying this out as a text, please drop me an email and I can see about getting you a copy.  (My editor tells me that the best way to boost readership of the book is to get a decent number of [hopefully positive] reviews on Amazon....)
  • On a related note, you should really swing by the Cambridge table to order yourself a copy of the 19-years-in-the-making third edition of Horowitz and Hill's Art of Electronics.  I haven't seen it yet, but I have every reason to think that it's going to be absolutely fantastic.  Seriously, from the experimental physics side, this is a huge deal.
  • This is a fun video, showing a "motor" made from an alkaline battery, a couple of metal-coated rare-earth magnets, and a coil of uninsulated wire.  It's not that crazy to see broadly how it works (think inhomogeneous fields from a finite solenoid + large magnetic moment), but it's cool nonetheless.
  • Here's an article (pdf) that's very much worth reading about the importance of government funding of basic research.  It was favorably referenced here by that (sarcasm mode = on) notorious socialist organization (/sarcasm), the American Enterprise Institute.

n-Category Café Introduction to Synthetic Mathematics (part 1)

John is writing about “concepts of sameness” for Elaine Landry’s book Category Theory for the Working Philosopher, and has been posting some of his thoughts and drafts. I’m writing for the same book about homotopy type theory / univalent foundations; but since HoTT/UF will also make a guest appearance in John’s and David Corfield’s chapters, and one aspect of it (univalence) is central to Steve Awodey’s chapter, I had to decide what aspect of it to emphasize in my chapter.

My current plan is to focus on HoTT/UF as a synthetic theory of \infty-groupoids. But in order to say what that even means, I felt that I needed to start with a brief introduction about the phrase “synthetic theory”, which may not be familiar. Right now, my current draft of that “introduction” is more than half the allotted length of my chapter; so clearly it’ll need to be trimmed! But I thought I would go ahead and post some parts of it in its current form; so here goes.

In general, mathematical theories can be classified as analytic or synthetic. An analytic theory is one that analyzes, or breaks down, its objects of study, revealing them as put together out of simpler things, just as complex molecules are put together out of protons, neutrons, and electrons. For example, analytic geometry analyzes the plane geometry of points, lines, etc. in terms of real numbers: points are ordered pairs of real numbers, lines are sets of points, etc. Mathematically, the basic objects of an analytic theory are defined in terms of those of some other theory.

By contrast, a synthetic theory is one that synthesizes, or puts together, a conception of its basic objects based on their expected relationships and behavior. For example, synthetic geometry is more like the geometry of Euclid: points and lines are essentially undefined terms, given meaning by the axioms that specify what we can do with them (e.g. two points determine a unique line). (Although Euclid himself attempted to define “point” and “line”, modern mathematicians generally consider this a mistake, and regard Euclid’s “definitions” (like “a point is that which has no part”) as fairly meaningless.) Mathematically, a synthetic theory is a formal system governed by rules or axioms. Synthetic mathematics can be regarded as analogous to foundational physics, where a concept like the electromagnetic field is not “put together” out of anything simpler: it just is, and behaves in a certain way.

The distinction between analytic and synthetic dates back at least to Hilbert, who used the words “genetic” and “axiomatic” respectively. At one level, we can say that modern mathematics is characterized by a rich interplay between analytic and synthetic — although most mathematicians would speak instead of definitions and examples. For instance, a modern geometer might define “a geometry” to satisfy Euclid’s axioms, and then work synthetically with those axioms; but she would also construct examples of such “geometries” analytically, such as with ordered pairs of real numbers. This approach was pioneered by Hilbert himself, who emphasized in particular that constructing an analytic example (or model) proves the consistency of the synthetic theory.

However, at a deeper level, almost all of modern mathematics is analytic, because it is all analyzed into set theory. Our modern geometer would not actually state her axioms the way that Euclid did; she would instead define a geometry to be a set PP of points together with a set LL of lines and a subset of P×LP\times L representing the “incidence” relation, etc. From this perspective, the only truly undefined term in mathematics is “set”, and the only truly synthetic theory is Zermelo–Fraenkel set theory (ZFC).

This use of set theory as the common foundation for mathematics is, of course, of 20th century vintage, and overall it has been a tremendous step forwards. Practically, it provides a common language and a powerful basic toolset for all mathematicians. Foundationally, it ensures that all of mathematics is consistent relative to set theory. (Hilbert’s dream of an absolute consistency proof is generally considered to have been demolished by Gödel’s incompleteness theorem.) And philosophically, it supplies a consistent ontology for mathematics, and a context in which to ask metamathematical questions.

However, ZFC is not the only theory that can be used in this way. While not every synthetic theory is rich enough to allow all of mathematics to be encoded in it, set theory is by no means unique in possessing such richness. One possible variation is to use a different sort of set theory like ETCS, in which the elements of a set are “featureless points” that are merely distinguished from each other, rather than labeled individually by the elaborate hierarchical membership structures of ZFC. Either sort of “set” suffices just as well for foundational purposes, and moreover each can be interpreted into the other.

However, we are now concerned with more radical possibilities. A paradigmatic example is topology. In modern “analytic topology”, a “space” is defined to be a set of points equipped with a collection of subsets called open, which describe how the points vary continuously into each other. (Most analytic topologists, being unaware of synthetic topology, would call their subject simply “topology.”) By contrast, in synthetic topology we postulate instead an axiomatic theory, on the same ontological level as ZFC, whose basic objects are spaces rather than sets.

Of course, by saying that the basic objects “are” spaces we do not mean that they are sets equipped with open subsets. Instead we mean that “space” is an undefined word, and the rules of the theory cause these “spaces” to behave more or less like we expect spaces to behave. In particular, synthetic spaces have open subsets (or, more accurately, open subspaces), but they are not defined by specifying a set together with a collection of open subsets.

It turns out that synthetic topology, like synthetic set theory (ZFC), is rich enough to encode all of mathematics. There is one trivial sense in which this is true: among all analytic spaces we find the subclass of indiscrete ones, in which the only open subsets are the empty set and the whole space. A notion of “indiscrete space” can also be defined in synthetic topology, and the collection of such spaces forms a universe of ETCS-like sets (we’ll come back to these in later installments). Thus we could use them to encode mathematics, entirely ignoring the rest of the synthetic theory of spaces. (The same could be said about the discrete spaces, in which every subset is open; but these are harder (though not impossible) to define and work with synthetically. The relation between the discrete and indiscrete spaces, and how they sit inside the synthetic theory of spaces, is central to the synthetic theory of cohesion, which I believe David is going to mention in his chapter about the philosophy of geometry.)

However, a less boring approach is to construct the objects of mathematics directly as spaces. How does this work? It turns out that the basic constructions on sets that we use to build (say) the set of real numbers have close analogues that act on spaces. Thus, in synthetic topology we can use these constructions to build the space of real numbers directly. If our system of synthetic topology is set up well, then the resulting space will behave like the analytic space of real numbers (the one that is defined by first constructing the mere set of real numbers and then equipping it with the unions of open intervals as its topology).

The next question is, why would we want to do mathematics this way? There are a lot of reasons, but right now I believe they can be classified into three sorts: modularity, philosophy, and pragmatism. (If you can think of other reasons that I’m forgetting, please mention them in the comments!)

By “modularity” I mean the same thing as does a programmer: even if we believe that spaces are ultimately built analytically out of sets, it is often useful to isolate their fundamental properties and work with those abstractly. One advantage of this is generality. For instance, any theorem proven in Euclid’s “neutral geometry” (i.e. without using the parallel postulate) is true not only in the model of ordered pairs of real numbers, but also in the various non-Euclidean geometries. Similarly, a theorem proven in synthetic topology may be true not only about ordinary topological spaces, but also about other variant theories such as topological sheaves, smooth spaces, etc. As always in mathematics, if we state only the assumptions we need, our theorems become more general.

Even if we only care about one model of our synthetic theory, modularity can still make our lives easier, because a synthetic theory can formally encapsulate common lemmas or styles of argument that in an analytic theory we would have to be constantly proving by hand. For example, just as every object in synthetic topology is “topological”, every “function” between them automatically preserves this topology (is “continuous”). Thus, in synthetic topology every function \mathbb{R}\to \mathbb{R} is automatically continuous; all proofs of continuity have been “packaged up” into the single proof that analytic topology is a model of synthetic topology. (We can still speak about discontinuous functions too, if we want to; we just have to re-topologize \mathbb{R} indiscretely first. Thus, synthetic topology reverses the situation of analytic topology: discontinuous functions are harder to talk about than continuous ones.)

By contrast to the argument from modularity, an argument from philosophy is a claim that the basic objects of mathematics really are, or really should be, those of some particular synthetic theory. Nowadays it is hard to find mathematicians who hold such opinions (except with respect to set theory), but historically we can find them taking part in the great foundational debates of the early 20th century. It is admittedly dangerous to make any precise claims in modern mathematical language about the beliefs of mathematicians 100 years ago, but I think it is justified to say that in hindsight, one of the points of contention in the great foundational debates was which synthetic theory should be used as the foundation for mathematics, or in other words what kind of thing the basic objects of mathematics should be. Of course, this was not visible to the participants, among other reasons because many of them used the same words (such as “set”) for the basic objects of their theories. (Another reason is that among the points at issue was the very idea that a foundation of mathematics should be built on precisely defined rules or axioms, which today most mathematicians take for granted.) But from a modern perspective, we can see that (for instance) Brouwer’s intuitionism is actually a form of synthetic topology, while Markov’s constructive recursive mathematics is a form of “synthetic computability theory”.

In these cases, the motivation for choosing such synthetic theories was clearly largely philosophical. The Russian constructivists designed their theory the way they did because they believed that everything should be computable. Similarly, Brouwer’s intuitionism can be said to be motivated by a philosophical belief that everything in mathematics should be continuous.

(I wish I could write more about the latter, because it’s really interesting. The main thing that makes Brouwerian intuitionism non-classical is choice sequences: infinite sequences in which each element can be “freely chosen” by a “creating subject” rather than being supplied by a rule. The concrete conclusion Brouwer drew from this is that any operation on such sequences must be calculable, at least in stages, using only finite initial segments, since we can’t ask the creating subject to make an infinite number of choices all at once. But this means exactly that any such operation must be continuous with respect to a suitable topology on the space of sequences. It also connects nicely with the idea of open sets as “observations” or “verifiable statements” that was mentioned in another thread. However, from the perspective of my chapter for the book, the purpose of this introduction is to lay the groundwork for discussing HoTT/UF as a synthetic theory of \infty-groupoids, and Brouwerian intuitionism would be a substantial digression.)

Finally, there are arguments from pragmatism. Whereas the modularist believes that the basic objects of mathematics are actually sets, and the philosophist believes that they are actually spaces (or whatever), the pragmatist says that they could be anything: there’s no need to commit to a single choice. Why do we do mathematics, anyway? One reason is because we find it interesting or beautiful. But all synthetic theories may be equally interesting and beautiful (at least to someone), so we may as well study them as long as we enjoy it.

Another reason we study mathematics is because it has some application outside of itself, e.g. to theories of the physical world. Now it may happen that all the mathematical objects that arise in some application happen to be (say) spaces. (This is arguably true of fundamental physics. Similarly, in applications to computer science, all objects that arise may happen to be computable.) In this case, why not just base our application on a synthetic theory that is good enough for the purpose, thereby gaining many of the advantages of modularity, but without caring about how or whether our theory can be modeled in set theory?

It is interesting to consider applying this perspective to other application domains. For instance, we also speak of sets outside of a purely mathematical framework, to describe collections of physical objects and mental acts of categorization; could we use spaces in the same way? Might collections of objects and thoughts automatically come with a topological structure by virtue of how they are constructed, like the real numbers do? I think this also starts to seem quite natural when we imagine topology in terms of “observations” or “verifiable statetments”. Again, saying any more about that in my chapter would be a substantial digression; but I’d be interested to hear any thoughts about it in the comments here!

Clifford JohnsonCeiba Speciosa

Saw all the fluffy stuff on the ground. Took me a while to "cotton on" and look up: silk_floss_tree (ceiba speciosa.. "silk floss" tree. click for larger view.) -cvj Click to continue reading this post

February 25, 2015

ResonaancesPersistent trouble with bees

No, I still have nothing to say about colony collapse disorder... this blog will stick to physics for at least 2 more years. This is an update on the anomalies in B decays reported by the LHCbee experiment. The two most important ones are:

  1. The  3.7 sigma deviation from standard model predictions in the differential distribution of the B➝K*μ+μ- decay products.
  2.  The 2.6 sigma violation of lepton flavor universality in B+→K+l+l- decays. 

 The first anomaly is statistically more significant. However, the theoretical error of the standard model prediction is not trivial to estimate and the significance of the anomaly is subject to fierce discussions. Estimates in the literature range from 4.5 sigma to 1 sigma, depending on what is assumed about QCD uncertainties. For this reason, the second anomaly made this story much more intriguing.  In that case, LHCb measures the ratio of the decay with muons and with electrons:  B+→K+μ+μ- vs B+→K+e+e-. This observable is theoretically clean, as large QCD uncertainties cancel in the ratio. Of course, 2.7 sigma significance is not too impressive; LHCb once had a bigger anomaly (remember CP violation in D meson decays?)  that is now long gone. But it's fair to say that the two anomalies together are marginally interesting.      

One nice thing is that both anomalies can be explained at the same time by a simple modification of the standard model. Namely, one needs to add the 4-fermion coupling between a b-quark, an s-quark, and two muons:

with Λ of order 30 TeV. Just this one extra coupling greatly improves a fit to the data, though other similar couplings could be simultaneously present. The 4-fermion operators can be an effective description of new heavy particles coupled to quarks and leptons.  For example, a leptoquark (scalar particle with a non-zero color charge and lepton number) or a Z'  (neutral U(1) vector boson) with mass in a few TeV range have been proposed. These are of course simple models created ad-hoc. Attempts to put these particles in a bigger picture of physics beyond  the standard model have not been very convincing so far, which may be one reason why the anomalies are viewed a bit skeptically. The flip side is that, if the anomalies turn out to be real, this will point to unexpected symmetry structures around the corner.

Another nice element of this story is that it will be possible to acquire additional relevant information in the near future. The first anomaly is based on just 1 fb-1 of LHCb data, and it will be updated to full 3 fb-1 some time this year. Furthermore, there are literally dozens of other B decays where the 4-fermion operators responsible for the anomalies could  also show up. In fact, there may already be some hints that this is happening. In the table borrowed from this paper we can see that there are several other  2-sigmish anomalies in B-decays that may possibly have the same origin. More data and measurements in  more decay channels should clarify the picture. In particular, violation of lepton flavor universality may come together with lepton flavor violation.  Observation of decays forbidden in the standard model, such as B→Keμ or  B→Kμτ, would be a spectacular and unequivocal signal of new physics.

n-Category Café Concepts of Sameness (Part 3)

Now I’d like to switch to pondering different approaches to equality. (Eventually I’ll have put all these pieces together into a coherent essay, but not yet.)

We tend to think of x=xx = x as a fundamental property of equality, perhaps the most fundamental of all. But what is it actually used for? I don’t really know. I sometimes joke that equations of the form x=xx = x are the only really true ones — since any other equation says that different things are equal — but they’re also completely useless.

But maybe I’m wrong. Maybe equations of the form x=xx = x are useful in some way. I can imagine one coming in handy at the end of a proof by contradiction where you show some assumptions imply xxx \ne x. But I don’t remember ever doing such a proof… and I have trouble imagining that you ever need to use a proof of this style.

If you’ve used the equation x=xx = x in your own work, please let me know.

To explain my question a bit more precisely, it will help to choose a specific formalism: first-order classical logic with equality. We can get the rules for this system by taking first-order classical logic with function symbols and adding a binary predicate “==” together with three axiom schemas:

1. Reflexivity: for each variable xx,

x=x x = x

2. Substitution for functions: for any variables x,yx, y and any function symbol ff,

x=yf(,x,)=f(,y,) x = y \; \implies\; f(\dots, x, \dots) = f(\dots, y, \dots)

3. Substitution for formulas: For any variables x,yx, y and any formula ϕ\phi, if ϕ\phi' is obtained by replacing any number of free occurrences of xx in ϕ\phi with yy, such that these remain free occurrences of yy, then

x=y(ϕϕ) x = y \;\implies\; (\phi \;\implies\; \phi')

Where did symmetry and transitivity of equality go? They can actually be derived!

For transitivity, use ‘substitution for formulas’ and take ϕ\phi to be x=zx = z, so that ϕ\phi' is y=zy = z. Then we get

x=y(x=zy=z) x=y \;\implies\; (x = z \;\implies\; y = z)

This is almost transitivity. From this we can derive

(x=y&x=z)y=z (x = y \;\&\; x = z) \;\implies\; y = z

and from this we can derive the usual statement of transitivity

(x=y&y=z)x=z (x = y\; \& \; y = z) \;\implies\; x = z

by choosing different names of variables and using symmetry of equality.

But how do we get symmetry? We can derive this using reflexivity and substitution for formulas. Take ϕ\phi to be x=xx = x and take ϕ\phi' be the result of substituting the first instance of xx with yy: that is, y=xy = x. Then we get

x=y(x=xy=x) x = y \;\implies \;(x = x \;\implies \;y = x)

Using x=xx = x, we can derive

x=yy=x x = y \;\implies\; y = x

This is the only time I remember using x=xx = x to derive something! So maybe this equation is good for something. But if proving symmetry and transitivity of equality is the only thing it’s good for, I’m not very impressed. I would have been happy to take both of these as axioms, if necessary. After all, people often do.

So, just to get the conversation started, I’ll conjecture that reflexivity of equality is completely useless if we include symmetry of equality in our axioms. Namely:

Conjecture. Any theorem in classical first-order logic with equality that does not include a subformula of the form x=xx = x for any variable xx can also be derived from a variant where we drop reflexivity, keep substitution for functions and substitution for formulas, and add this axiom schema:

1’. Symmetry: for any variables xx and yy,

x=yy=x x = y \; \implies \; y = x

Proof theorists: can you show this is true, or find a counterexample? We’ve seen that we can get transitivity from this setup, and then I don’t really see how it hurts to omit x=xx = x. I may be forgetting something, though!

John PreskillCelebrating Theoretical Physics at Caltech’s Burke Institute

Editor’s Note: Yesterday and today, Caltech is celebrating the inauguration of the Walter Burke Institute for Theoretical Physics. John Preskill made the following remarks at a dinner last night honoring the board of the Sherman Fairchild Foundation.

This is an exciting night for me and all of us at Caltech. Tonight we celebrate physics. Especially theoretical physics. And in particular the Walter Burke Institute for Theoretical Physics.

Some of our dinner guests are theoretical physicists. Why do we do what we do?

I don’t have to convince this crowd that physics has a profound impact on society. You all know that. We’re celebrating this year the 100th anniversary of general relativity, which transformed how we think about space and time. It may be less well known that two years later Einstein laid the foundations of laser science. Einstein was a genius for sure, but I don’t think he envisioned in 1917 that we would use his discoveries to play movies in our houses, or print documents, or repair our vision. Or see an awesome light show at Disneyland.

And where did this phone in my pocket come from? Well, the story of the integrated circuit is fascinating, prominently involving Sherman Fairchild, and other good friends of Caltech like Arnold Beckman and Gordon Moore. But when you dig a little deeper, at the heart of the story are two theorists, Bill Shockley and John Bardeen, with an exceptionally clear understanding of how electrons move through semiconductors. Which led to transistors, and integrated circuits, and this phone. And we all know it doesn’t stop here. When the computers take over the world, you’ll know who to blame.

Incidentally, while Shockley was a Caltech grad (BS class of 1932), John Bardeen, one of the great theoretical physicists of the 20th century, grew up in Wisconsin and studied physics and electrical engineering at the University of Wisconsin at Madison. I suppose that in the 1920s Wisconsin had no pressing need for physicists, but think of the return on the investment the state of Wisconsin made in the education of John Bardeen.1

So, physics is a great investment, of incalculable value to society. But … that’s not why I do it. I suppose few physicists choose to do physics for that reason. So why do we do it? Yes, we like it, we’re good at it, but there is a stronger pull than just that. We honestly think there is no more engaging intellectual adventure than struggling to understand Nature at the deepest level. This requires attitude. Maybe you’ve heard that theoretical physicists have a reputation for arrogance. Okay, it’s true, we are arrogant, we have to be. But it is not that we overestimate our own prowess, our ability to understand the world. In fact, the opposite is often true. Physics works, it’s successful, and this often surprises us; we wind up being shocked again and again by “unreasonable effectiveness of mathematics in the natural sciences.” It’s hard to believe that the equations you write down on a piece of paper can really describe the world. But they do.

And to display my own arrogance, I’ll tell you more about myself. This occasion has given me cause to reflect on my own 30+ years on the Caltech faculty, and what I’ve learned about doing theoretical physics successfully. And I’ll tell you just three principles, which have been important for me, and may be relevant to the future of the Burke Institute. I’m not saying these are universal principles – we’re all different and we all contribute in different ways, but these are principles that have been important for me.

My first principle is: We learn by teaching.

Why do physics at universities, at institutions of higher learning? Well, not all great physics is done at universities. Excellent physics is done at industrial laboratories and at our national laboratories. But the great engine of discovery in the physical sciences is still our universities, and US universities like Caltech in particular. Granted, US preeminence in science is not what it once was — it is a great national asset to be cherished and protected — but world changing discoveries are still flowing from Caltech and other great universities.

Why? Well, when I contemplate my own career, I realize I could never have accomplished what I have as a research scientist if I were not also a teacher. And it’s not just because the students and postdocs have all the great ideas. No, it’s more interesting than that. Most of what I know about physics, most of what I really understand, I learned by teaching it to others. When I first came to Caltech 30 years ago I taught advanced elementary particle physics, and I’m still reaping the return from what I learned those first few years. Later I got interested in black holes, and most of what I know about that I learned by teaching general relativity at Caltech. And when I became interested in quantum computing, a really new subject for me, I learned all about it by teaching it.2

Part of what makes teaching so valuable for the teacher is that we’re forced to simplify, to strip down a field of knowledge to what is really indispensable, a tremendously useful exercise. Feynman liked to say that if you really understand something you should be able to explain it in a lecture for the freshman. Okay, he meant the Caltech freshman. They’re smart, but they don’t know all the sophisticated tools we use in our everyday work. Whether you can explain the core idea without all the peripheral technical machinery is a great test of understanding.

And of course it’s not just the teachers, but also the students and the postdocs who benefit from the teaching. They learn things faster than we do and often we’re just providing some gentle steering; the effect is to amplify greatly what we could do on our own. All the more so when they leave Caltech and go elsewhere to change the world, as they so often do, like those who are returning tonight for this Symposium. We’re proud of you!

My second principle is: The two-trick pony has a leg up.

I’m a firm believer that advances are often made when different ideas collide and a synthesis occurs. I learned this early, when as a student I was fascinated by two topics in physics, elementary particles and cosmology. Nowadays everyone recognizes that particle physics and cosmology are closely related, because when the universe was very young it was also very hot, and particles were colliding at very high energies. But back in the 1970s, the connection was less widely appreciated. By knowing something about cosmology and about particle physics, by being a two-trick pony, I was able to think through what happens as the universe cools, which turned out to be my ticket to becoming a Caltech professor.

It takes a community to produce two-trick ponies. I learned cosmology from one set of colleagues and particle physics from another set of colleagues. I didn’t know either subject as well as the real experts. But I was a two-trick pony, so I had a leg up. I’ve tried to be a two-trick pony ever since.

Another great example of a two-trick pony is my Caltech colleague Alexei Kitaev. Alexei studied condensed matter physics, but he also became intensely interested in computer science, and learned all about that. Back in the 1990s, perhaps no one else in the world combined so deep an understanding of both condensed matter physics and computer science, and that led Alexei to many novel insights. Perhaps most remarkably, he connected ideas about error-correcting code, which protect information from damage, with ideas about novel quantum phases of matter, leading to radical new suggestions about how to operate a quantum computer using exotic particles we call anyons. These ideas had an invigorating impact on experimental physics and may someday have a transformative effect on technology. (We don’t know that yet; it’s still way too early to tell.) Alexei could produce an idea like that because he was a two-trick pony.3

Which brings me to my third principle: Nature is subtle.

Yes, mathematics is unreasonably effective. Yes, we can succeed at formulating laws of Nature with amazing explanatory power. But it’s a struggle. Nature does not give up her secrets so readily. Things are often different than they seem on the surface, and we’re easily fooled. Nature is subtle.4

Perhaps there is no greater illustration of Nature’s subtlety than what we call the holographic principle. This principle says that, in a sense, all the information that is stored in this room, or any room, is really encoded entirely and with perfect accuracy on the boundary of the room, on its walls, ceiling and floor. Things just don’t seem that way, and if we underestimate the subtlety of Nature we’ll conclude that it can’t possibly be true. But unless our current ideas about the quantum theory of gravity are on the wrong track, it really is true. It’s just that the holographic encoding of information on the boundary of the room is extremely complex and we don’t really understand in detail how to decode it. At least not yet.

This holographic principle, arguably the deepest idea about physics to emerge in my lifetime, is still mysterious. How can we make progress toward understanding it well enough to explain it to freshmen? Well, I think we need more two-trick ponies. Except maybe in this case we’ll need ponies who can do three tricks or even more. Explaining how spacetime might emerge from some more fundamental notion is one of the hardest problems we face in physics, and it’s not going to yield easily. We’ll need to combine ideas from gravitational physics, information science, and condensed matter physics to make real progress, and maybe completely new ideas as well. Some of our former Sherman Fairchild Prize Fellows are leading the way at bringing these ideas together, people like Guifre Vidal, who is here tonight, and Patrick Hayden, who very much wanted to be here.5 We’re very proud of what they and others have accomplished.

Bringing ideas together is what the Walter Burke Institute for Theoretical Physics is all about. I’m not talking about only the holographic principle, which is just one example, but all the great challenges of theoretical physics, which will require ingenuity and synthesis of great ideas if we hope to make real progress. We need a community of people coming from different backgrounds, with enough intellectual common ground to produce a new generation of two-trick ponies.

Finally, it seems to me that an occasion as important as the inauguration of the Burke Institute should be celebrated in verse. And so …

Who studies spacetime stress and strain
And excitations on a brane,
Where particles go back in time,
And physicists engage in rhyme?

Whose speedy code blows up a star
(Though it won’t quite blow up so far),
Where anyons, which braid and roam
Annihilate when they get home?

Who makes math and physics blend
Inside black holes where time may end?
Where do they do all this work?
The Institute of Walter Burke!

We’re very grateful to the Burke family and to the Sherman Fairchild Foundation. And we’re confident that your generosity will make great things happen!


  1. I was reminded of this when I read about a recent proposal by the current governor of Wisconsin. 
  2. And by the way, I put my lecture notes online, and thousands of people still download them and read them. So even before MOOCs – massive open online courses – the Internet was greatly expanding the impact of our teaching. Handwritten versions of my old particle theory and relativity notes are also online here
  3. Okay, I admit it’s not quite that simple. At that same time I was also very interested in both error correction and in anyons, without imagining any connection between the two. It helps to be a genius. But a genius who is also a two-trick pony can be especially awesome. 
  4. We made that the tagline of IQIM. 
  5. Patrick can’t be here for a happy reason, because today he and his wife Mary Race welcomed a new baby girl, Caroline Eleanor Hayden, their first child. The Burke Institute is not the only good thing being inaugurated today. 

Terence Tao254A, Notes 7: Linnik’s theorem on primes in arithmetic progressions

In the previous set of notes, we saw how zero-density theorems for the Riemann zeta function, when combined with the zero-free region of Vinogradov and Korobov, could be used to obtain prime number theorems in short intervals. It turns out that a more sophisticated version of this type of argument also works to obtain prime number theorems in arithmetic progressions, in particular establishing the celebrated theorem of Linnik:

Theorem 1 (Linnik’s theorem) Let {a\ (q)} be a primitive residue class. Then {a\ (q)} contains a prime {p} with {p \ll q^{O(1)}}.

In fact it is known that one can find a prime {p} with {p \ll q^{5}}, a result of Xylouris. For sake of comparison, recall from Exercise 65 of Notes 2 that the Siegel-Walfisz theorem gives this theorem with a bound of {p \ll \exp( q^{o(1)} )}, and from Exercise 48 of Notes 2 one can obtain a bound of the form {p \ll \phi(q)^2 \log^2 q} if one assumes the generalised Riemann hypothesis. The probabilistic random models from Supplement 4 suggest that one should in fact be able to take {p \ll q^{1+o(1)}}.

We will not aim to obtain the optimal exponents for Linnik’s theorem here, and follow the treatment in Chapter 18 of Iwaniec and Kowalski. We will in fact establish the following more quantitative result (a special case of a more powerful theorem of Gallagher), which splits into two cases, depending on whether there is an exceptional zero or not:

Theorem 2 (Quantitative Linnik theorem) Let {a\ (q)} be a primitive residue class for some {q \geq 2}. For any {x > 1}, let