Planet Musings

December 01, 2015

BackreactionHawking radiation is not produced at the black hole horizon.

Stephen Hawking’s “Brief History of Time” was one of the first popular science books I read, and I hated it. I hated it because I didn’t understand it. My frustration with this book is a big part of the reason I’m a physicist today – at least I know who to blame.

I don’t hate the book any more – admittedly Hawking did a remarkable job of sparking public interest in the fundamental questions raised by black hole physics. But every once in a while I still want to punch the damned book. Not because I didn’t understand it, but because it convinced so many other people they did understand it.

In his book, Hawking painted a neat picture for black hole evaporation that is now widely used. According to this picture, black holes evaporate because pairs of virtual particles nearby the horizon are ripped apart by tidal forces. One of the particles gets caught behind the horizon and falls in, the other escapes. The result is a steady emission of particles from the black hole horizon. It’s simple, it’s intuitive, and it’s wrong.

Hawking’s is an illustrative picture, but nothing more than that. In reality – you will not be surprised to hear – the situation is more complicated.

The pairs of particles – to the extent that it makes sense to speak of particles at all – are not sharply localized. They are instead blurred out over a distance comparable to the black hole radius. The pairs do not start out as points, but as diffuse clouds smeared all around the black hole, and they only begin to separate when the escapee has retreated from the horizon a distance comparable to the black hole’s radius. This simple image that Hawking provided for the non-specialist is not backed up by the mathematics. It contains an element of the truth, but take it too seriously and it becomes highly misleading.

That this image isn’t accurate is not a new insight – it’s been known since the late 1970s that Hawking radiation is not produced in the immediate vicinity of the horizon. Already in Birrell and Davies’ textbook it is clearly spelled out that taking the particles from the far vicinity of the black hole and tracing them back to the horizon – thereby increasing (“blueshifting”) their frequency – does not deliver the accurate description in the horizon area. The two parts of the Hawking-pairs blur into each other in the horizon area, and to meaningfully speak of particles one should instead use a different, local, notion of particles. Better even, one should stick to calculating actually observable quantities like the stress-energy tensor.

That the particle pairs are not created in the immediate vicinity of the horizon was necessary to solve a conundrum that bothered physicists back then. The temperature of the black hole radiation is very small, but this is in the far distance to the black hole. For this radiation to have been able to escape, it must have started out with an enormous energy close by the black hole horizon. But if such an enormous energy was located there, then an infalling observer should notice and burn to ashes. This however violates the equivalence principle, according to which the infalling observer shouldn’t notice anything unusual upon crossing the horizon.

This problem is resolved by taking into account that tracing back the outgoing radiation to the horizon does not give a physically meaningful result. If one instead calculates the stress-energy in the vicinity of the horizon, one finds that it is small and remains small even upon horizon crossing. It is so small that an observer would only be able to tell the difference to flat space on distances comparable to the black hole radius (which is also the curvature scale). Everything fits nicely, and no disagreement with the equivalence principle comes about.

[I know this sounds very similar to the firewall problem that has been discussed more recently but it’s a different issue. The firewall problem comes about because if one requires the outgoing particles to carry information, then the correlation with the ingoing particles gets destroyed. This prevents a suitable cancellation in the near-horizon area. Again however one can criticize this conclusion by complaining that in the original “firewall paper” the stress-energy wasn’t calculated. I don’t think this is the origin of the problem, but other people do.]

The actual reason that black holes emit particles, the one that is backed up by mathematics, is that different observers have different notions of particles.

We are used to a particle either being there or not being there, but this is only the case so long as we move relative to each other at constant velocity. If an observer is accelerated, his definition of what a particle is changes. What looks like empty space for an observer at constant velocity suddenly seems to contain particles for an accelerated observer. This effect, named after Bill Unruh – who discovered it almost simultaneously with Hawking’s finding that black holes emit radiation – is exceedingly tiny for accelerations we experience in daily life, thus we never notice it.

The Unruh effect is very closely related to the Hawking effect by which black holes evaporate. Matter that collapses to a black hole creates a dynamical space-time that gives rise to an acceleration between observers in the past and in the future. The result is that the space-time around the collapsing matter, that did not contain particles before the black hole was formed, contains thermal radiation in the late stages of collapse. This Hawking-radiation that is emitted from the black hole is the same as the vacuum that initially surrounded the collapsing matter.

That, really, is the origin of particle emission from black holes: what is a “particle” depends on the observer. Not quite as simple, but dramatically more accurate.

The image provided by Hawking with the virtual particle pairs close by the horizon has been so stunningly successful that now even some physicists believe it is what really happens. The knowledge that blueshifting the radiation from infinity back to the horizon gives a grossly wrong stress-energy seems to have gotten buried in the literature. Unfortunately, misunderstanding the relation between the flux of Hawking-particles in the far distance and in the vicinity of the black hole leads one to erroneously conclude that the flux is much larger than it is. Getting this relation wrong is for example the reason why Mersini-Houghton came to falsely conclude that black holes don’t exist.

It seems about time someone reminds the community of this. And here comes Steve Giddings.

Steve Giddings is the nonlocal hero of George Musser’s new book “Spooky Action at a Distance.” For the past two decades or so he’s been on a mission to convince his colleagues that nonlocality is necessary to resolve the black hole information loss problem. I spent a year in Santa Barbara a few doors down the corridor from Steve, but I liked his papers better when we was still on the idea that black hole remnants keep the information. Be that as it may, Steve knows black holes inside and out, and he has a new note on the arxiv that discusses the question where Hawking radiation originates.

In his paper, Steve collects the existing arguments why we know the pairs of the Hawking radiation are not created in the vicinity of the horizon, and he adds some new arguments. He estimates the effective area from which Hawking-radiation is emitted and finds it to be a sphere with a radius considerably larger than the black hole. He also estimates the width of wave-packets of Hawking radiation and shows that it is much larger than the separation of the wave-packet’s center from the horizon. This nicely fits with some earlier work of his that demonstrated that the partner particles do not separate from each other until after they have left the vicinity of the black hole.

All this supports the conclusion that Hawking particles are not created in the near vicinity of the horizon, but instead come from a region surrounding the black hole with a few times the black hole’s radius.

Steve’s paper has an amusing acknowledgement in which he thanks Don Marolf for confirming that some of their colleagues indeed believe that Hawking radiation is created close by the horizon. I can understand this. When I first noticed this misunderstanding I also couldn’t quite believe it. I kept pointing towards Birrell-Davies but nobody was listening. In the end I almost thought I was the one who got it wrong. So, I for sure am very glad about Steve’s paper because now, rather than citing a 40 year old textbook, I can just cite his paper.

If Hawking’s book taught me one thing, it’s that sticky visual metaphors that can be a curse as much as they can be a blessing.

Chad OrzelTEDxAlbany Talk This Thursday, 12/3

I’ve been a little bad about self-promoting here of late, but I should definitely plug this: I’m speaking at the TEDxAlbany event this Thursday, December 3rd; I’m scheduled first, at 9:40 am. The title is “The Exotic Physics of an Ordinary Morning“:

You might think that the bizarre predictions of quantum mechanics and relativity– particles that are also waves, cats that are both alive and dead, clocks that run at different rates depending on how you’re moving– and only come into play in physics laboratories or near black holes. In fact, though, even the strangest features of modern physics are essential for everything around us. The mundane process of getting up and getting ready for work relies on surprisingly exotic physics; understanding how this plays out adds an element of wonder to even the most ordinary morning.

(This is slightly inaccurate, as I was originally planning to get some relativity into this, but that ended up being awkward. So the final version is all quantum.)

I believe they’ve streamed the talks in past years, though I don’t have any information on that at the moment. They have video of all the past speakers, so I assume the same will be true this year; once it’s up, you can be sure I’ll point to it.

Anyway, that’s the Big Thing I’m stressing out about this week… If I get useful information on how to stream the video, I’ll share it; if you’re in the Albany area, and free that day, they may still have tickets if you want to check it out live…

Doug NatelsonVarious items - solids, explanations, education reform, and hiring for impact

I'm behind in a lot of writing of various flavors right now, but here are some items of interest:

  • Vassily Lubchenko across town at the University of Houston has written a mammoth review article for Adv. Phys. about how to think about glasses, the glass transition, and crystals.  It includes a real discussion of mechanical rigidity as a comparatively universal property - basically "why are solids solid?".  
  • Randall Munroe of xkcd fame has come out with another book, Thing Explainer, in which he tackles a huge array of difficult science and technology ideas and concepts using only the 1000 most common English words.  For a sample, he has an article in this style in The New Yorker in honor of the 100th anniversary of general relativity.
  • There was an editorial in the Washington Post on Sunday talking about how to stem the ever-rising costs of US university education.  This is a real problem, though I'm concerned that some of the authors' suggestions don't connect to the real world (e.g., if you want online courses to function in a serious, high quality way, that still requires skilled labor, and that labor isn't free).
  • Much university hiring is incremental, and therefore doesn't "move the needle" much in terms of departmental rankings, reputation, or resources.  There are rare exceptions.  Four years ago the University of Chicago launched their Institute for Molecular Engineering, with the intent of creating something like 35 new faculty lines over 15 years.  Now Princeton has announced that they are going to hire 10 new faculty lines in computer science.  That will increase the size of that one department from 32 to 42 tenure/tenure-track faculty.   Wow.

Chad Orzel091/366: Turkey Talk

Not an incredibly artistic or innovative bit of photography, here, but I spent a good chunk of the day taking The Pip for his annual physical, so the only pictures I took were of SteelyKid’s schoolwork:

SteelyKid's second-grade assignment to imagine and illustrate a conversation with a turkey.

SteelyKid’s second-grade assignment to imagine and illustrate a conversation with a turkey.

This assignment asked her to draw herself asking a question that a turkey would then answer, and then the turkey asking her one in return. Her writing is more enthusiastic than accurate as far as spelling is concerned, but she has a good imagination. In case you can’t make it out (spelling corrected):

SteelyKid: Are turkeys fast runners?
Turkey: No, they are too fat and heavy so they do not run fast.

Turkey: What happens when you get full?
SteelyKid: We get sick and throw up on the ground.

And there’s this week’s bit of insight into the second-grade mindset…

David Hoggdeprojecting Gaussians

In a day crushed by health issues, I worked from my bed! I wrote up a new baby problem for my cryo-EM and deprojecting-galaxies projects: Can you infer the variance tensor for a 3-d Gaussian density blob if you only get to observe noisy 2-d projections, and you don't know any of the projection angles or offsets? Obviously the answer is yes, but the real question is how close to fully principled inference can you get, tractably, and for realistic data sets? I wrote the problem statement and half the solution (in text form); if I am confined to my bed for the rest of the week, maybe I will get to write the code too.

I also had a conversation with Marina Spivak (SCDA) about likelihoods, posteriors, marginalization, and optimization for cryo-EM, and a conversation with Ness about chemical tagging with The Cannon.

Steinn SigurðssonA New Kepler Orrery

Ethan Kruse has update the Kepler Orrery, just in time for Extreme Solar Systems III now under way.


November 30, 2015

Scott AaronsonOrdinary Words Will Do

Izabella Laba, a noted mathematician at the University of British Columbia, recently posted some tweets that used me as a bad, cautionary example for how “STEM faculty should be less contemptuous of social sciences.”  Here was the offending comment of mine, from the epic Walter Lewin thread last fall:

[W]hy not dispense with the empirically-empty notion of “privilege,” and just talk directly about the actual well-being of actual people, or groups of people?  If men are doing horrific things to women—for example, lashing them for driving cars, like in Saudi Arabia—then surely we can just say so in plain language.  Stipulating that the torturers are “exercising their male privilege” with every lash adds nothing to anyone’s understanding of the evil.  It’s bad writing.  More broadly, it seems to me that the entire apparatus of “privilege,” “delegitimation,” etc. etc. can simply be tossed overboard, to rust on the ocean floor alongside dialectical materialism and other theoretical superstructures that were once pompously insisted upon as preconditions of enlightened social discourse.  This isn’t quantum field theory.  Ordinary words will do.

Prof. Laba derisively commented:

Might as well ask you to explain calculus without using fancy words like “derivative” or “continuous.”  Simple number arithmetic will do.

Prof. Laba’s tweets were favorited by Jordan Ellenberg, a mathematician who wrote the excellent popular book How Not to Be Wrong.  (Ellenberg had also criticized me last year for my strange, naïve idea that human relations can be thought about using logic.)

Given my respect for the critics, I guess I’m honor-bound to respond.

For the record, I tend not to think about the social sciences—or for that matter, the natural sciences—as monolithic entities at all.  I admire any honest attempt to discover the truth about anything.  And not being a postmodern relativist, I believe there are deep truths worth discovering in history, psychology, economics, linguistics, possibly even sociology.  Reading the books of Steven Pinker underscored for me how much is actually understood nowadays about human nature—much of it only figured out within the last half-century.  Likewise, reading the voluminous profundities of Scott Alexander taught me that even in psychiatry, there are truths (and even a few definite cures) to be had for those who seek.

I also believe that the social sciences are harder—way harder—than math or physics or CS.  They’re harder because of the tenuousness of the correlations, because of the complexity of each individual human brain (let alone 7 billion of them on the same planet), but most of all, because politics and ideology and the scientist’s own biases place such powerful thumbs on the scale.  This makes it all the more impressive when a social scientist, like (say) Stanley Milgram or Judith Rich Harris or Napoleon Chagnon, teaches the world something important and new.

I will confess to contempt for anything that I regard as pompous obscurantism—for self-referential systems of jargon whose main purposes are to bar outsiders, to mask a lack of actual understanding, and to confer power on certain favored groups.  And I regard the need to be alert to such systems, to nip them in the bud before they grow into Lysenkoism, as in some sense the problem of intellectual life.  Which brings me to the most fundamental asymmetry between the hard and soft sciences.  Namely, the very fact that it’s so much harder to nurture new truths to maturity in the social sciences than it is in math or physics, means that in the former, the jargon-weeds have an easier time filling the void—and we know they’ve done it again and again, even in the post-Enlightenment West.

Time for a thought experiment.  Suppose you showed up at a university anytime between, let’s say, 1910 and 1970, and went from department to department asking (in so many words): what are you excited about this century?  Where are your new continents, what’s the future of your field?  Who should I read to learn about that future?

In physics, the consensus answer would’ve been something like: Planck, Einstein, Bohr, Schrödinger, Dirac.

In psychology, it would’ve been: Freud and Jung (with another faction for B. F. Skinner).

In politics and social sciences, over an enormous swath of academia (including in the West), it would’ve been: Marx, Engels, Trotsky, Lenin.

With hindsight, we now know that the physics advice would’ve been absolute perfection, the psychology and politics advice an unmitigated disaster.  Yes, physicists today know more than Einstein, can even correct him on some points, but the continents he revealed to us actually existed—indeed, have only become more important since Einstein’s time.

But Marx and Freud?  You would’ve done better to leave the campus, and ask a random person on the street what she or he thought about economics and psychology.  In high school, I remember cringing through a unit on the 1920s, when we learned about how “two European professors upset a war-weary civilization’s established certainties—with Einstein overturning received wisdom about space and time, and Freud doing just the same for the world of the mind.”  It was never thought important to add that Einstein’s theories turned out to be true while Freud’s turned out to be false.  Still, at least Freud’s ideas led “only” to decades of bad psychology and hundreds of innocent people sent to jail because of testimony procured through hypnosis, rather than to tens of millions of dead, as with the other social-scientific theory that reigned supreme among 20th-century academics.

Marx and Freud built impressive intellectual edifices—sufficiently impressive for a large fraction of intellectuals to have accepted those men as gurus on par with Darwin and Einstein for almost a century.  Yet on nearly every topic they wrote about, we now know that Marx and Freud couldn’t have been any more catastrophically wrong.  Moreover, their wrongness was knowable at the time—and was known to many, though the ones who knew were typically the ones who the intellectual leaders sneered at, as deluded reactionaries.

Which raises a question: suppose that, in the 1920s, I’d taken the social experts’ advice to study Marx and Freud, didn’t understand much of what they said (and found nonsensical much of what I did understand), and eventually rejected them as pretentious charlatans.  Then why wouldn’t I have been just like Prof. Laba’s ignorant rube, who dismisses calculus because he doesn’t understand technical terms like “continuous” and “derivative”?

On reflection, I don’t think that the two cases are comparable at all.

The hard sciences need technical vocabularies for a simple reason: because they’re about things that normal people don’t spend their hours obsessively worrying about.  Yes, I’d have a hard time understanding organic chemists or differential geometers, but largely for the same reasons I’d have a hard time understanding football fans or pirates.  It’s not just that I don’t understand the arguments; it’s that the arguments are about a world that’s alien to me (and that, to be honest, I don’t care about as much as I do my world).

Suppose, by contrast, that you’re writing about the topics everyone spends their time obsessively worrying about: politics, society, the human mind, the relations between the races and sexes.  In other words, suppose you’re writing about precisely the topics for which the ordinary English language has been honed over centuries—for which Shakespeare and Twain and Dr. King and so many others deployed the language to such spectacular effect.  In that case, what excuse could you possibly have to write in academese, to pepper your prose with undefined in-group neologisms?

Well, let’s be charitable; maybe you have a reason.  For example, maybe you’re doing a complicated meta-analysis of psychology papers, so you need to talk about r-values and kurtosis and heteroskedasticity.  Or maybe you’re putting people in an fMRI machine while you ask them questions, so you need to talk about the temporal resolution in the anterior cingulate cortex.  Or maybe you’re analyzing sibling rivalries using game theory, so you need Nash equilibria.  Or you’re picking apart sentences using Chomskyan formal grammar.  In all these cases, armchair language doesn’t suffice because you’re not just sitting in your armchair: you’re using a new tool to examine the everyday from a different perspective.  For present purposes, you might as well be doing algebraic geometry.

The Freudians and Marxists would, of course, claim that they’re doing the exact same thing.  Yes, they’d say, you thought you had the words to discuss your own mind or the power structure of society, but really you didn’t, because you lacked the revolutionary theoretical framework that we now possess.  (Trotsky’s writings  are suffused with this brand of arrogance in nearly every sentence: for example, when he ridicules the bourgeoisie liberals who whine about “human rights violations” in the early USSR, yet who are too dense to phrase their objections within the framework of dialectical materialism.)

I submit that, even without the hindsight of 2015, there would’ve been excellent reasons to be skeptical of these claims.  Has it ever happened, you might ask yourself, that someone sat in their study and mused about the same human questions that occupied Plato and Shakespeare and Hume, in the same human way they did, and then came up with a new, scientific conclusion that was as rigorous and secure as relativity or evolution?

Let me know if I missed something, but I can’t think of a single example.  Sure, it seems to me, there have been geniuses of human nature, who enlarged our vision without any recourse to the quantitative methods of science.  But even those geniuses “only” contributed melodies for other geniuses to answer in counterpoint, rather than stones for everyone who came later to build upon.  Also, the geniuses usually wrote well.

Am I claiming that progress is impossible in the social realm?  Not at all.  The emancipation of slaves, the end of dueling and blasphemy laws and the divine right of kings, women’s suffrage and participation in the workforce, gay marriage—all these strike me as crystal-clear examples of moral progress, as advances that will still be considered progress a thousand years from now, if there’s anyone around then to discuss such things.  Evolutionary psychology, heuristics and biases, reciprocal altruism, and countless other developments likewise strike me as intellectual progress within the sciences of human nature.  But none of these advances needed recondite language!  Ordinary words sufficed for Thomas Paine and Frederick Douglass and John Stuart Mill, as they sufficed for Robert Axelrod and for Kahneman and Tversky.  So forgive me for thinking that whatever is true and important in the social world today, should likewise be defensible to every smart person in ordinary words, and that this represents a genuine difference between the social sciences and physics.

Which brings us to the central point that Prof. Laba disputed in that comment of mine.  I believe there are countless moral heroes in our time, as well as social scientists who struggle heroically to get the right answers.  But as far as I can tell, the people who build complex intellectual edifices around words like “privilege” and “delegitimation” and “entitlement” and “marginalized” are very much the same sort of people who, a few generations ago, built similar edifices around “bourgeoisie” and “dialectical” and “false consciousness.”  In both cases, there’s an impressive body of theory that’s held up as the equivalent in its domain of relativity, quantum mechanics, and Darwinism, with any skeptics denounced as science-deniers.  In both cases, enlightened liberals are tempted to side with the theorists, since the theorists believe in so many of the same causes that the enlightened liberals believe in, and hate so many of the same people who the enlightened liberals hate.  But in both cases, the theorists’ language seems to alternate between incomprehensible word-salad and fervid, often profanity-laced denunciations, skipping entirely over calm clarity.  And in both cases, the only thing that the impressive theoretical edifice ever seems to get used for, is to prove over and over that certain favored groups should get more power while disfavored ones should get less.

So I’m led to the view that, if you want to rouse people’s anger about injustice or their pity about preventable suffering, or end arbitrary discrimination codified into law, or give individuals more freedom to pursue their own happiness, or come up with a new insight about human nature, or simply describe the human realities that you see around you—for all these purposes, the words that sufficed for every previous generation’s great humanists will also suffice for you.

On the other hand, to restrict freedom and invent new forms of discrimination—and to do it in the name of equality and justice—that takes theory.  You’ll need a sophisticated framework, for example, to prove that even if two adults both insist they’re consenting to a relationship, really they might not be, because of power structures in the wider society that your superior insight lets you see.  You’ll need advanced discourse to assure you that, even though your gut reaction might be horror at (say) someone who misspoke once and then had their life gleefully destroyed on social media, your gut is not to be trusted, because it’s poisoned by the same imperialist, patriarchal biases as everything else—and because what looks like a cruel lynching needs to be understood in a broader social context (did the victim belong to a dominant group, or to a marginalized one?).  Finally, you’ll need oodles of theory (bring out the Marcuse) to explain why the neoliberal fanaticism about “free speech” and “tolerance” and “due process” and “the presumption of innocence” is too abstract and simplistic—for those concepts, too, fail to distinguish between a marginalized group that deserves society’s protection and a dominant group that doesn’t.

So I concede to Prof. Laba that the complicated discourse of privilege, hegemony, etc. serves a definite purpose for the people who wield it, just as much as the complicated discourse of quantum field theory serves a purpose for physicists.  It’s just that the purposes of the privilege-warriors aren’t my purposes.  For my purposes—which include fighting injustice, advancing every social and natural science as quickly as possible, and helping all well-meaning women and men see each other’s common humanity—I said last year and I say again that ordinary words will do.

Update (Oct. 26): Izabella Laba has written a response to this post, for which I’m extremely grateful. Her reply reveals that she and I have a great deal of common ground, and also a few clear areas of disagreement (e.g., what’s wrong with Steven Pinker?). But my most important objection is simply that, the first time I loaded her blog, the text went directly over the rock image in the background, making it impossible to read without highlighting it.

Scott AaronsonTalk, be merry, and be rational

Yesterday I wrote a statement on behalf of a Scott Alexander SlateStarCodex/rationalist meetup, which happened last night at MIT (in the same room where I teach my graduate class), and which I’d really wanted to attend but couldn’t.  I figured I’d share the statement here:

I had been looking forward to attending tonight’s MIT SlateStarCodex meetup as I hardly ever look forward to anything. Alas, I’m now stuck in Chicago, with my flight cancelled due to snow, and with all flights for the next day booked up. But instead of continuing to be depressed about it, I’ve decided to be happy that this meetup is even happening at all—that there’s a community of people who can read, let’s say, a hypothetical debate moderator questioning Ben Carson about what it’s like to be a severed half-brain, and simply be amused, instead of silently trying to figure out who benefits from the post and which tribe the writer belongs to. (And yes, I know: the answer is the gray tribe.) And you can find this community anywhere—even in Cambridge, Massachusetts! Look, I spend a lot of time online, just getting more and more upset reading social justice debates that are full of people calling each other douchebags without even being able to state anything in the same galactic supercluster as the other side’s case. And then what gives me hope for humanity is to click over to the slatestarcodex tab, and to see all the hundreds of comments (way more than my blog gets) by people who disagree with each other but who all basically get it, who all have minds that don’t make me despair. And to realize that, when Scott Alexander calls an SSC meetup, he can fill a room just about anywhere … well, at least anywhere I would visit. So talk, be merry, and be rational.

I’m now back in town, and told by people who attended the meetup that it was crowded, disorganized, and great.  And now I’m off to Harvard, to attend the other Scott A.’s talk “How To Ruin A Perfectly Good Randomized Controlled Trial.”

Update (Nov. 24) Scott Alexander’s talk at Harvard last night was one of the finest talks I’ve ever attended. He was introduced to rapturous applause as simply “the best blogger on the Internet,” and as finally an important speaker, in a talk series that had previously wasted everyone’s time with the likes of Steven Pinker and Peter Singer. (Scott demurred that his most notable accomplishment in life was giving the talk at Harvard that he was just now giving.) The actual content, as Scott warned from the outset, was “just” a small subset of a basic statistics course, but Scott brought each point alive with numerous recent examples, from psychiatry, pharmacology, and social sciences, where bad statistics or misinterpretations of statistics were accepted by nearly everyone and used to set policy. (E.g., Alcoholics Anonymous groups that claimed an “over 95%” success rate, because the people who relapsed were kicked out partway through and not counted toward the total.) Most impressively, Scott leapt immediately into the meat, ended after 20 minutes, and then spent the next two hours just taking questions. Scott is publicity-shy, but I hope for others’ sake that video of the talk will eventually make its way online.

Then, after the talk, I had the honor of meeting two fellow Boston-area rationalist bloggers, Kate Donovan and Jesse Galef. Yes, I said “fellow”: for almost a decade, I’ve considered myself on the fringes of the “rationalist movement.” I’d hang out a lot with skeptic/effective-altruist/transhumanist/LessWrong/OvercomingBias people (who are increasingly now SlateStarCodex people), read their blogs, listen and respond to their arguments, answer their CS theory questions. But I was always vaguely uncomfortable identifying myself with any group that even seemed to define itself by how rational it was compared to everyone else (even if the rationalists constantly qualified their self-designation with “aspiring”!). Also, my rationalist friends seemed overly interested in questions like how to prevent malevolent AIs from taking over the world, which I tend to think we lack the tools to make much progress on right now (though, like with many other remote possibilities, I’m happy for some people to work on them and see if they find anything interesting).

So, what changed? Well, in the debates about social justice, public shaming, etc. that have swept across the Internet these past few years, it seems to me that my rationalist friends have proven themselves able to weigh opposing arguments, examine their own shortcomings, resist groupthink and hysteria from both sides, and attack ideas rather than people, in a way that the wider society—and most depressingly to me, the “enlightened, liberal” part of society—has often failed. In a real-world test (“real-world,” in this context, meaning social media…), the rationalists have walked the walk and rationaled the rational, and thus they’ve given me no choice but to stand up and be counted as one of them.

Have a great Thanksgiving, those of you in the US!

Another Update: Dana, Lily, and I had the honor of having Scott Alexander over for dinner tonight. I found this genius of human nature, who took so much flak last year for defending me, to be completely uninterested in discussing anything related to social justice or online shaming. Instead, his gaze was fixed on the eternal: he just wanted to grill me all evening about physics and math and epistemology. Having recently read this Nature News article by Ron Cowen, he kept asking me things like: “you say that in quantum gravity, spacetime itself is supposed to dissolve into some sort of network of qubits. Well then, how does each qubit know which other qubits it’s supposed to be connected to? Are there additional qubits to specify the connectivity pattern? If so, then doesn’t that cause an infinite regress?” I handwaved something about AdS/CFT, where a dynamic spacetime is supposed to emerge from an ordinary quantum theory on a fixed background specified in advance. But I added that, in some sense, he had rediscovered the whole problem of quantum gravity that’s confused everyone for almost a century: if quantum mechanics presupposes a causal structure on the qubits or whatever other objects it talks about, then how do you write down a quantum theory of the causal structures themselves?

I’m sure there’s a lesson in here somewhere about what I should spend my time on.

Terence Tao275A, Notes 4: The central limit theorem

Let {X_1,X_2,\dots} be iid copies of an absolutely integrable real scalar random variable {X}, and form the partial sums {S_n := X_1 + \dots + X_n}. As we saw in the last set of notes, the law of large numbers ensures that the empirical averages {S_n/n} converge (both in probability and almost surely) to a deterministic limit, namely the mean {\mu= {\bf E} X} of the reference variable {X}. Furthermore, under some additional moment hypotheses on the underlying variable {X}, we can obtain square root cancellation for the fluctuation {\frac{S_n}{n} - \mu} of the empirical average from the mean. To simplify the calculations, let us first restrict to the case {\mu=0, \sigma^2=1} of mean zero and variance one, thus

\displaystyle  {\bf E} X = 0


\displaystyle  {\bf Var}(X) = {\bf E} X^2 = 1.

Then, as computed in previous notes, the normalised fluctuation {S_n/\sqrt{n}} also has mean zero and variance one:

\displaystyle  {\bf E} \frac{S_n}{\sqrt{n}} = 0

\displaystyle  {\bf Var}(\frac{S_n}{\sqrt{n}}) = {\bf E} (\frac{S_n}{\sqrt{n}})^2 = 1.

This and Chebyshev’s inequality already indicates that the “typical” size of {S_n} is {O(\sqrt{n})}, thus for instance {\frac{S_n}{\sqrt{n} \omega(n)}} goes to zero in probability for any {\omega(n)} that goes to infinity as {n \rightarrow \infty}. If we also have a finite fourth moment {{\bf E} |X|^4 < \infty}, then the calculations of the previous notes also give a fourth moment estimate

\displaystyle  {\bf E} (\frac{S_n}{\sqrt{n}})^4 = 3 + O( \frac{{\bf E} |X|^4}{n} ).

From this and the Paley-Zygmund inequality (Exercise 42 of Notes 1) we also get some lower bound for {\frac{S_n}{\sqrt{n}}} of the form

\displaystyle  {\bf P}( |\frac{S_n}{\sqrt{n}}| \geq \varepsilon ) \geq \varepsilon

for some absolute constant {\varepsilon>0} and for {n} sufficiently large; this indicates in particular that {\frac{S_n \omega(n)}{\sqrt{n}}} does not converge in any reasonable sense to something finite for any {\omega(n)} that goes to infinity.

The question remains as to what happens to the ratio {S_n/\sqrt{n}} itself, without multiplying or dividing by any factor {\omega(n)}. A first guess would be that these ratios converge in probability or almost surely, but this is unfortunately not the case:

Proposition 1 Let {X_1,X_2,\dots} be iid copies of an absolutely integrable real scalar random variable {X} with mean zero, variance one, and finite fourth moment, and write {S_n := X_1 + \dots + X_n}. Then the random variables {S_n/\sqrt{n}} do not converge in probability or almost surely to any limit, and neither does any subsequence of these random variables.

Proof: Suppose for contradiction that some sequence {S_{n_j}/\sqrt{n_j}} converged in probability or almost surely to a limit {Y}. By passing to a further subsequence we may assume that the convergence is in the almost sure sense. Since all of the {S_{n_j}/\sqrt{n_j}} have mean zero, variance one, and bounded fourth moment, Theorem 24 of Notes 1 implies that the limit {Y} also has mean zero and variance one. On the other hand, {Y} is a tail random variable and is thus almost surely constant by the Kolmogorov zero-one law from Notes 3. Since constants have variance zero, we obtain the required contradiction. \Box

Nevertheless there is an important limit for the ratio {S_n/\sqrt{n}}, which requires one to replace the notions of convergence in probability or almost sure convergence by the weaker concept of convergence in distribution.

Definition 2 (Vague convergence and convergence in distribution) Let {R} be a locally compact Hausdorff topological space with the Borel {\sigma}-algebra. A sequence of finite measures {\mu_n} on {R} is said to converge vaguely to another finite measure {\mu} if one has

\displaystyle  \int_R G(x)\ d\mu_n(x) \rightarrow \int_R G(x)\ d\mu(x)

as {n \rightarrow \infty} for all continuous compactly supported functions {G: R \rightarrow {\bf R}}. (Vague convergence is also known as weak convergence, although strictly speaking the terminology weak-* convergence would be more accurate.) A sequence of random variables {X_n} taking values in {R} is said to converge in distribution (or converge weakly or converge in law) to another random variable {X} if the distributions {\mu_{X_n}} converge vaguely to the distribution {\mu_X}, or equivalently if

\displaystyle  {\bf E}G(X_n) \rightarrow {\bf E} G(X)

as {n \rightarrow \infty} for all continuous compactly supported functions {G: R \rightarrow {\bf R}}.

One could in principle try to extend this definition beyond the locally compact Hausdorff setting, but certain pathologies can occur when doing so (e.g. failure of the Riesz representation theorem), and we will never need to consider vague convergence in spaces that are not locally compact Hausdorff, so we restrict to this setting for simplicity.

Note that the notion of convergence in distribution depends only on the distribution of the random variables involved. One consequence of this is that convergence in distribution does not produce unique limits: if {X_n} converges in distribution to {X}, and {Y} has the same distribution as {X}, then {X_n} also converges in distribution to {Y}. However, limits are unique up to equivalence in distribution (this is a consequence of the Riesz representation theorem, discussed for instance in this blog post). As a consequence of the insensitivity of convergence in distribution to equivalence in distribution, we may also legitimately talk about convergence of distribution of a sequence of random variables {X_n} to another random variable {X} even when all the random variables {X_1,X_2,\dots} and {X} involved are being modeled by different probability spaces (e.g. each {X_n} is modeled by {\Omega_n}, and {X} is modeled by {\Omega}, with no coupling presumed between these spaces). This is in contrast to the stronger notions of convergence in probability or almost sure convergence, which require all the random variables to be modeled by a common probability space. Also, by an abuse of notation, we can say that a sequence {X_n} of random variables converges in distribution to a probability measure {\mu}, when {\mu_{X_n}} converges vaguely to {\mu}. Thus we can talk about a sequence of random variables converging in distribution to a uniform distribution, a gaussian distribution, etc..

From the dominated convergence theorem (available for both convergence in probability and almost sure convergence) we see that convergence in probability or almost sure convergence implies convergence in distribution. The converse is not true, due to the insensitivity of convergence in distribution to equivalence in distribution; for instance, if {X_1,X_2,\dots} are iid copies of a non-deterministic scalar random variable {X}, then the {X_n} trivially converge in distribution to {X}, but will not converge in probability or almost surely (as one can see from the zero-one law). However, there are some partial converses that relate convergence in distribution to convergence in probability; see Exercise 10 below.

Remark 3 The notion of convergence in distribution is somewhat similar to the notion of convergence in the sense of distributions that arises in distribution theory (discussed for instance in this previous blog post), however strictly speaking the two notions of convergence are distinct and should not be confused with each other, despite the very similar names.

The notion of convergence in distribution simplifies in the case of real scalar random variables:

Proposition 4 Let {X_1,X_2,\dots} be a sequence of scalar random variables, and let {X} be another scalar random variable. Then the following are equivalent:

  • (i) {X_n} converges in distribution to {X}.
  • (ii) {F_{X_n}(t)} converges to {F_X(t)} for each continuity point {t} of {F_X} (i.e. for all real numbers {t \in {\bf R}} at which {F_X} is continuous). Here {F_X(t) := {\bf P}(X \leq t)} is the cumulative distribution function of {X}.

Proof: First suppose that {X_n} converges in distribution to {X}, and {F_X} is continuous at {t}. For any {\varepsilon > 0}, one can find a {\delta} such that

\displaystyle  F_X(t) - \varepsilon \leq F_X(t') \leq F_X(t) + \varepsilon

for every {t' \in [t-\delta,t+\delta]}. One can also find an {N} larger than {|t|+\delta} such that {F_X(-N) \leq \varepsilon} and {F_X(N) \geq 1-\varepsilon}. Thus

\displaystyle  {\bf P} (|X| \geq N ) = O(\varepsilon)


\displaystyle  {\bf P} (|X - t| \leq \delta ) = O(\varepsilon).

Let {G: {\bf R} \rightarrow [0,1]} be a continuous function supported on {[-2N, t]} that equals {1} on {[-N, t-\delta]}. Then by the above discussion we have

\displaystyle  {\bf E} G(X) = F_X(t) + O(\varepsilon)

and hence

\displaystyle  {\bf E} G(X_n) = F_X(t) + O(\varepsilon)

for large enough {n}. In particular

\displaystyle  {\bf P}( X_n \leq t ) \geq F_X(t) - O(\varepsilon).

A similar argument, replacing {G} with a continuous function supported on {[t,2N]} that equals {1} on {[t+\delta,N]} gives

\displaystyle  {\bf P}( X_n > t ) \geq 1 - F_X(t) - O(\varepsilon)

for {n} large enough. Putting the two estimates together gives

\displaystyle  F_{X_n}(t) = F_X(t) + O(\varepsilon)

for {n} large enough; sending {\varepsilon \rightarrow 0}, we obtain the claim.

Conversely, suppose that {F_{X_n}(t)} converges to {F_X(t)} at every continuity point {t} of {F_X}. Let {G: {\bf R} \rightarrow {\bf R}} be a continuous compactly supported function, then it is uniformly continuous. As {F_X} is monotone increasing, it can only have countably many points of discontinuity. From these two facts one can find, for any {\varepsilon>0}, a simple function {G_\varepsilon(t) = \sum_{i=1}^n c_i 1_{(t_i,t_{i+1}]}} for some {t_1 < \dots < t_n} that are points of continuity of {F_X}, and real numbers {c_i}, such that {|G(t) - G_\varepsilon(t)| \leq \varepsilon} for all {t}. Thus

\displaystyle  {\bf P} G(X_n) = {\bf P} G_\varepsilon(X_n) + O(\varepsilon)

\displaystyle  = \sum_{i=1}^n c_i(F_{X_n}(t_{i+1}) - F_{X_n}(t)) + O(\varepsilon).

Similarly for {X_n} replaced by {X}. Subtracting and taking limit superior, we conclude that

\displaystyle  \limsup_{n \rightarrow \infty} |{\bf P} G(X_n) - {\bf P} G(X)| = O(\varepsilon),

and on sending {\varepsilon \rightarrow 0}, we obtain that {X_n} converges in distribution to {X} as claimed. \Box

The restriction to continuity points of {t} is necessary. Consider for instance the deterministic random variables {X_n = 1/n}, then {X_n} converges almost surely (and hence in distribution) to {0}, but {F_{X_n}(0) = 0} does not converge to {F_X(0)=1}.

Example 5 For any natural number {n}, let {X_n} be a discrete random variable drawn uniformly from the finite set {\{0/n, 1/n, \dots, (n-1)/n\}}, and let {X} be the continuous random variable drawn uniformly from {[0,1]}. Then {X_n} converges in distribution to {X}. Thus we see that a continuous random variable can emerge as the limit of discrete random variables.

Example 6 For any natural number {n}, let {X_n} be a continuous random variable drawn uniformly from {[0,1/n]}, then {X_n} converges in distribution to the deterministic real number {0}. Thus we see that discrete (or even deterministic) random variables can emerge as the limit of continuous random variables.

Exercise 7 (Portmanteau theorem) Show that the properties (i) and (ii) in Proposition 4 are also equivalent to the following three statements:

  • (iii) One has {\limsup_{n \rightarrow \infty} {\bf P}( X_n \in K ) \leq {\bf P}(X \in K)} for all closed sets {K \subset {\bf R}}.
  • (iv) One has {\liminf_{n \rightarrow \infty} {\bf P}( X_n \in U ) \geq {\bf P}(X \in U)} for all open sets {U \subset {\bf R}}.
  • (v) For any Borel set {E \subset {\bf R}} whose topological boundary {\partial E} is such that {{\bf P}(X \in \partial E) = 0}, one has {\lim_{n \rightarrow \infty} {\bf P}(X_n \in E) = {\bf P}(X \in E)}.

(Note: to prove this theorem, you may wish to invoke Urysohn’s lemma. To deduce (iii) from (i), you may wish to start with the case of compact {K}.)

We can now state the famous central limit theorem:

Theorem 8 (Central limit theorem) Let {X_1,X_2,\dots} be iid copies of a scalar random variable {X} of finite mean {\mu := {\bf E} X} and finite non-zero variance {\sigma^2 := {\bf Var}(X)}. Let {S_n := X_1 + \dots + X_n}. Then the random variables {\frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu)} converges in distribution to a random variable with the standard normal distribution {N(0,1)} (that is to say, a random variable with probability density function {x \mapsto \frac{1}{\sqrt{2\pi}} e^{-x^2/2}}). Thus, by abuse of notation

\displaystyle  \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \rightarrow N(0,1).

In the normalised case {\mu=0, \sigma^2=1} when {X} has mean zero and unit variance, this simplifies to

\displaystyle  \frac{S_n}{\sqrt{n}} \rightarrow N(0,1).

Using Proposition 4 (and the fact that the cumulative distribution function associated to {N(0,1)} is continuous, the central limit theorem is equivalent to asserting that

\displaystyle  {\bf P}( \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \leq t ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{-x^2/2}\ dx

as {n \rightarrow \infty} for any {t \in {\bf R}}, or equivalently that

\displaystyle  {\bf P}( a \leq \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \leq b ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{a}^b e^{-x^2/2}\ dx.

Informally, one can think of the central limit theorem as asserting that {S_n} approximately behaves like it has distribution {N( n \mu, n \sigma^2 )} for large {n}, where {N(\mu,\sigma^2)} is the normal distribution with mean {\mu} and variance {\sigma^2}, that is to say the distribution with probability density function {x \mapsto \frac{1}{\sqrt{2\pi} \sigma} e^{-(x-\mu)^2/2\sigma^2}}. The integrals {\frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{-x^2/2}\ dx} can be written in terms of the error function {\hbox{erf}} as {\frac{1}{2} + \frac{1}{2} \hbox{erf}(t/\sqrt{2})}.

The central limit theorem is a basic example of the universality phenomenon in probability – many statistics involving a large system of many independent (or weakly dependent) variables (such as the normalised sums {\frac{\sqrt{n}}{\sigma}(\frac{S_n}{n}-\mu)}) end up having a universal asymptotic limit (in this case, the normal distribution), regardless of the precise makeup of the underlying random variable {X} that comprised that system. Indeed, the universality of the normal distribution is such that it arises in many other contexts than the fluctuation of iid random variables; the central limit theorem is merely the first place in probability theory where it makes a prominent appearance.

We will give several proofs of the central limit theorem in these notes; each of these proofs has their advantages and disadvantages, and can each extend to prove many further results beyond the central limit theorem. We first give Lindeberg’s proof of the central limit theorem, based on exchanging (or swapping) each component {X_1,\dots,X_n} of the sum {S_n} in turn. This proof gives an accessible explanation as to why there should be a universal limit for the central limit theorem; one then computes directly with gaussians to verify that it is the normal distribution which is the universal limit. Our second proof is the most popular one taught in probability texts, namely the Fourier-analytic proof based around the concept of the characteristic function {t \mapsto {\bf E} e^{itX}} of a real random variable {X}. Thanks to the powerful identities and other results of Fourier analysis, this gives a quite short and direct proof of the central limit theorem, although the arguments may seem rather magical to readers who are not already familiar with Fourier methods. Finally, we give a proof based on the moment method, in the spirit of the arguments in the previous notes; this argument is more combinatorial, but is straightforward and is particularly robust, in particular being well equipped to handle some dependencies between components; we will illustrate this by proving the Erdos-Kac law in number theory by this method. Some further discussion of the central limit theorem (including some further proofs, such as one based on Stein’s method) can be found in this blog post. Some further variants of the central limit theorem, such as local limit theorems, stable laws, and large deviation inequalities, will be discussed in the next (and final) set of notes.

The following exercise illustrates the power of the central limit theorem, by establishing combinatorial estimates which would otherwise require the use of Stirling’s formula to establish.

Exercise 9 (De Moivre-Laplace theorem) Let {X} be a Bernoulli random variable, taking values in {\{0,1\}} with {{\bf P}(X=0)={\bf P}(X=1)=1/2}, thus {X} has mean {1/2} and variance {1/4}. Let {X_1,X_2,\dots} be iid copies of {X}, and write {S_n := X_1+\dots+X_n}.

The above special case of the central limit theorem was first established by de Moivre and Laplace.

We close this section with some basic facts about convergence of distribution that will be useful in the sequel.

Exercise 10 Let {X_1,X_2,\dots}, {Y_1,Y_2,\dots} be sequences of real random variables, and let {X,Y} be further real random variables.

For future reference we also mention (but will not prove) Prokhorov’s theorem that gives a partial converse to part (iii) of the above exercise:

Theorem 11 (Prokhorov’s theorem) Let {X_1,X_2,\dots} be a sequence of real random variables which is tight (that is, for every {\varepsilon>0} there exists {K>0} such that {{\bf P}(|X_n| \geq K) < \varepsilon} for all sufficiently large {n}). Then there exists a subsequence {X_{n_j}} which converges in distribution to some random variable {X} (which may possibly be modeled by a different probability space model than the {X_1,X_2,\dots}.)

The proof of this theorem relies on the Riesz representation theorem, and is beyond the scope of this course; but see for instance Exercise 29 of this previous blog post. (See also the closely related Helly selection theorem, covered in Exercise 30 of the same post.)

— 1. The Lindeberg approach to the central limit theorem —

We now give the Lindeberg argument establishing the central limit theorem. The proof splits into two unrelated components. The first component is to establish the central limit theorem for a single choice of underlying random variable {X}. The second component is to show that the limiting distribution of {\frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu)} is universal in the sense that it does not depend the choice of underlying random variable. Putting the two components together gives Theorem 8.

We begin with the first component of the argument. One could use the Bernoulli distribution from Exercise 9 as the choice of underlying random variable, but a simpler choice of distribution (in the sense that no appeal to Stirling’s formula is required) is the normal distribution {N(\mu,\sigma^2)} itself. The key computation is:

Lemma 12 (Sum of independent Gaussians) Let {X_1, X_2} be independent real random variables with normal distributions {N(\mu_1,\sigma_1^2)}, {N(\mu_2,\sigma_2^2)} respectively for some {\mu_1,\mu_2 \in {\bf R}} and {\sigma_1,\sigma_2 > 0}. Then {X_1+X_2} has the normal distribution {N(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)}.

This is of course consistent with the additivity of mean and variance for independent random variables, given that random variables with the distribution {N(\mu,\sigma^2)} have mean {\mu} and variance {\sigma^2}.

Proof: By subtracting {\mu_1} and {\mu_2} from {X_1,X_2} respectively, we may normalise {\mu_1=\mu_2=0}, by dividing through by {\sqrt{\sigma_1^2+\sigma_2^2}} we may also normalise {\sigma_1^2+\sigma_2^2=1}. Thus

\displaystyle  {\bf P}( X_1 \in E_1 ) = \frac{1}{\sqrt{2\pi} \sigma_1} \int_{E_1} e^{-x_1^2 / 2\sigma_1^2}\ dx_1


\displaystyle  {\bf P}( X_2 \in E_2 ) = \frac{1}{\sqrt{2\pi} \sigma_2} \int_{E_2} e^{-x_2^2 / 2\sigma_2^2}\ dx_2

for any Borel sets {E_1,E_2 \subset {\bf R}}. As {X_1,X_2} are independent, this implies that

\displaystyle  {\bf P}( (X_1,X_2) \in E ) = \frac{1}{2\pi \sigma_1 \sigma_2} \int_E e^{-x_1^2/2\sigma_1^2 - x_2^2/2\sigma_2^2}\ dx_1 dx_2

for any Borel set {E \subset {\bf R}^2} (this follows from the uniqueness of product measure, or equivalently one can use the monotone class lemma starting from the case when {E} is a finite boolean combination of product sets {E_1 \times E_2}). In particular, we have

\displaystyle  {\bf P}( X_1 + X_2 \leq t ) = \frac{1}{2\pi \sigma_1 \sigma_2} \int_{{\bf R}^2} 1_{x_1+x_2 \leq t} e^{-x_1^2/2\sigma_1^2 - x_2^2/2\sigma_2^2}\ dx_1 dx_2

for any {t \in {\bf R}}. Making the change of variables {x = x_1 + x_2} (and using the Fubini-Tonelli theorem as necessary) we can write the right-hand side as

\displaystyle  \frac{1}{2\pi \sigma_1 \sigma_2} \int_{-\infty}^t \int_{\bf R} e^{-x_1^2/2\sigma_1^2 - (x-x_1)/2\sigma_2^2}\ dx_1 dx. \ \ \ \ \ (2)

We can complete the square using {\sigma_1^2+\sigma_2^2=1} to write (after some routine algebra)

\displaystyle  e^{-x_1^2/2\sigma_1^2 - (x-x_1)/2\sigma_2^2} = e^{-x^2 / 2} e^{-(x_1 - x \sigma_1^2)^2 / 2 \sigma_1^2 \sigma_2^2}

so on using the identity {\frac{1}{\sqrt{2\pi} \sigma} \int_{\bf R} e^{-(x_1-\mu)^2/2\sigma^2}\ dx_1 = 1} for any {\mu \in {\bf R}} and {\sigma>0}, we can write (2) as

\displaystyle  \frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{-x^2 / 2}\ dx

and so {X_1+X_2} has the cumulative distribution function of {N(0, 1)}, giving the claim. \Box

In the next section we give an alternate proof of the above lemma using the machinery of characteristic functions. A more geometric argument can be given as follows. With the same normalisations as in the above proof, we can write {\sigma_1 = \cos \theta} and {\sigma_2 = \sin \theta} for some {0 < \theta < \pi/2}. Then we can write {X_1 = \cos \theta Y_1} and {X_2 = \sin \theta Y_2} where {Y_1,Y_2} are iid copies of {N(0,1)}. But the joint probability density function {(x_1,x_2) \mapsto \frac{1}{2\pi} e^{-(x_1^2+x_2^2)/2}} of {(Y_1,Y_2)} is rotation invariant, so {X_1+X_2 = \cos \theta Y_1 + \sin \theta Y_2} has the same distribution as {Y_1}, and the claim follows.

From the above lemma we see that if {X_1,X_2,\dots} are iid copies of a normal distribution {N(\mu,\sigma^2)} of mean {\mu} and variance {\sigma^2}, then {S_n = X_1 + \dots X_n} has distribution {N(n \mu, n \sigma^2)}, and hence {\frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu)} has distribution exactly equal to {N(0,1)}. Thus the central limit theorem is clearly true in this case.

Exercise 13 (Probabilistic interpretation of convolution) Let {f, g: {\bf R} \rightarrow [0,+\infty]} be measurable functions with {\int_{\bf R} f(x)\ dx = \int_{\bf R} g(x)\ dx = 1}. Define the convolution {f*g} of {f} and {g} to be

\displaystyle  f*g(x) := \int_{\bf R} f(y) g(x-y)\ dy.

Show that if {X, Y} are independent real random variables with probability density functions {f, g} respectively, then {X+Y} has probability density function {f*g}.

Now we turn to the general case of the central limit theorem. By subtracting {\mu} from {X} (and from each of the {X_i}) we may normalise {\mu = 0}; by dividing {X} (and each of the {X_i}) by {\sigma} we may also normalise {\sigma=1}. Thus {{\bf E} X = 0, {\bf E} X^2 = 1}, and our task is to show that

\displaystyle  {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}} ) \rightarrow {\bf E} G(N)

as {n \rightarrow \infty}, for any continuous compactly supported functions {G: {\bf R} \rightarrow {\bf R}}, where {N} is a random variable distributed according to {N(0,1)} (possibly modeled by a different probability space than the original {X_1,X_2,\dots}). Since any continuous compactly supported function can be approximated uniformly by smooth compactly supported functions (as can be seen from the Weierstrass or Stone-Weierstrass theorems), it suffices to show this for smooth compactly supported {G}.

Let {Y_1,Y_2,\dots} be iid copies of {N}; by extending the probability space used to model {X_1,X_2,\dots} (using Proposition 26 from Notes 2), we can model the {Y_1,Y_2,\dots} and {X_1,X_2,\dots} by a common model, in such a way that the combined collection {(X_n,Y_n)_{n=1}^\infty} of random variables are jointly independent. As we have already proved the central limit theorem in the normally distributed case, we already have

\displaystyle  {\bf E} G( \frac{Y_1+\dots+Y_n}{\sqrt{n}} ) \rightarrow {\bf E} G(N) \ \ \ \ \ (3)

as {n \rightarrow \infty} (indeed we even have equality here). So it suffices to show that

\displaystyle  {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}} ) - {\bf E} G( \frac{Y_1+\dots+Y_n}{\sqrt{n}} ) \rightarrow 0 \ \ \ \ \ (4)

as {n \rightarrow \infty}.

We first establish this claim under the additional simplifying assumption of a finite third moment: {{\bf E} |X|^3 < \infty}. Rather than swap all of the {X_1,\dots,X_n} with all of the {Y_1,\dots,Y_n}, let us just swap the final {X_n} to a {Y_n}, that is to say let us consider the expression

\displaystyle  {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}} ) - {\bf E} G( \frac{X_1+\dots+X_{n-1}+Y_n}{\sqrt{n}} ).

Writing {Z := \frac{X_1 + \dots + X_{n-1}}{\sqrt{n}}}, we can write this as

\displaystyle  {\bf E} G( Z + \frac{1}{\sqrt{n}} X_n ) - {\bf E} G( Z + \frac{1}{\sqrt{n}} Y_n ).

To compute this expression we use Taylor expansion. As {G} is smooth and compactly supported, the first three derivatives of {G} are bounded, leading to the Taylor approximation

\displaystyle  G( Z + \frac{1}{\sqrt{n}} X_n ) = G(Z) + \frac{1}{\sqrt{n}} G'(Z) X_n + \frac{1}{2n} G''(Z) X_n^2 + O( \frac{1}{n^{3/2}} |X_n|^3 )

where the implied constant depends on {G}. Taking expectations, we conclude that

\displaystyle  {\bf E} G( Z + \frac{1}{\sqrt{n}} X_n ) = {\bf E} G(Z) + \frac{1}{\sqrt{n}} {\bf E} G'(Z) X_n + \frac{1}{2n} {\bf E} G''(Z) X_n^2

\displaystyle + O( \frac{1}{n^{3/2}} {\bf E} |X|^3 ).

Now for a key point: as the random variable {Z} only depends on {X_1,\dots,X_{n-1}}, it is independent of {X_n}, and so we can decouple the expectations to obtain

\displaystyle  {\bf E} G( Z + \frac{1}{\sqrt{n}} X_n ) = {\bf E} G(Z) + \frac{1}{\sqrt{n}} {\bf E} G'(Z) {\bf E} X_n + \frac{1}{2n} {\bf E} G''(Z) {\bf E} X_n^2

\displaystyle + O( \frac{1}{n^{3/2}} {\bf E} |X|^3 ).

The same considerations apply after swapping {X_n} with {Y_n} (which also has a bounded third moment):

\displaystyle  {\bf E} G( Z + \frac{1}{\sqrt{n}} Y_n ) = {\bf E} G(Z) + \frac{1}{\sqrt{n}} {\bf E} G'(Z) {\bf E} Y_n + \frac{1}{2n} {\bf E} G''(Z) {\bf E} Y_n^2

\displaystyle + O( \frac{1}{n^{3/2}} ).

But by hypothesis, {X_n} and {Y_n} have matching moments to second order: {{\bf E} X_n = {\bf E} Y_n = 0} and {{\bf E} X_n^2 = {\bf E} Y_n^2 = 1}. Thus on subtraction we have

\displaystyle  {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}} ) - {\bf E} G( \frac{X_1+\dots+X_{n-1}+Y_n}{\sqrt{n}} )

\displaystyle = O( \frac{1}{n^{3/2}} (1 + {\bf E} |X|^3) ).

A similar argument (permuting the indices, and replacing some of the {X_i} with {Y_i}) gives

\displaystyle  {\bf E} G( \frac{X_1+\dots+X_i+Y_{i+1}+\dots+Y_n}{\sqrt{n}} )

\displaystyle - {\bf E} G( \frac{X_1+\dots+X_{i-1}+Y_i+\dots+Y_n}{\sqrt{n}} )

\displaystyle  = O( \frac{1}{n^{3/2}} (1 + {\bf E} |X|^3) )

for all {1 \leq i \leq n}. Summing the telescoping series, we conclude that

\displaystyle  {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}} ) - {\bf E} G( \frac{Y_1+\dots+Y_n}{\sqrt{n}} ) = O( \frac{1}{n^{1/2}} (1 + {\bf E} |X|^3))

which gives (4). Note how it was important to Taylor expand to at least third order to obtain a total error bound that went to zero, which explains why it is the first two moments {{\bf E} X, {\bf E} X^2} of {X} (or equivalently, the mean and variance) that play such a decisive role in the central limit theorem.

Now we remove the hypothesis of finite third moment. As in the previous set of notes, we use the truncation method, taking advantage of the “room” inherent in the {\frac{1}{n^{1/2}}} factor in the error term of the above analysis. For technical reasons we have to modify the usual truncation slightly to preserve the mean zero condition. Let {\varepsilon > 0}. We split {X_i = X_{i,\leq} + X_{i,>}}, where {X_{i,\leq} := X_i 1_{|X_i| \leq \varepsilon \sqrt{n}} - \mu_n} and {X_{i,>} := X_i 1_{|X_i| > \varepsilon \sqrt{n}} + \mu_n} with {\mu_n := {\bf E} X 1_{|X| \leq \varepsilon \sqrt{n}}}; we split {X = X_{\leq} + X_{>}} respectively (we are suppressing the dependence of {X_{i,\leq}, X_{i,>}, X_{\leq}, X_{>}} on {\varepsilon} and {n} to simplify the notation). The random variables {X_\leq} (and {X_{i,\leq}}) are chosen to have mean zero, but their variance is not quite one. However, from dominated convergence, the quantity {\mu_n := {\bf E} X 1_{|X| \leq \varepsilon \sqrt{n}}} converges to {0}, and the variance {\sigma_n^2 := {\bf Var}(X_{\leq}) = {\bf E} |X|^2 1_{|X| \leq \varepsilon \sqrt{n}} - \mu_n^2} converges to {1}, as {n \rightarrow \infty}.

Let {Y'_1,\dots,Y'_n} be iid copies of {N(0,\sigma_n^2)}. The previous arguments then give

\displaystyle  {\bf E} G( \frac{X_{1,\leq}+\dots+X_{n,\leq}}{\sqrt{n}} ) - {\bf E} G( \frac{Y'_1+\dots+Y'_n}{\sqrt{n}} ) = O( \frac{1}{n^{1/2}} (1 + {\bf E} |X_\leq|^3))

for {n} large enough. We can bound

\displaystyle  {\bf E} |X_\leq|^3 \leq {\bf E} \varepsilon n^{1/2} |X|^2 = \varepsilon n^{1/2}

and thus

\displaystyle  {\bf E} G( \frac{X_{1,\leq}+\dots+X_{n,\leq}}{\sqrt{n}} ) - {\bf E} G( \frac{Y'_1+\dots+Y'_n}{\sqrt{n}} ) = O( \varepsilon )

for large enough {n}. By Lemma 12, {\frac{Y'_1+\dots+Y'_n}{\sqrt{n}}} is distributed as {N(0,\sigma_n)}, and hence

\displaystyle  {\bf E} G( \frac{Y'_1+\dots+Y'_n}{\sqrt{n}} ) \rightarrow {\bf E} G(N)

as {n \rightarrow \infty}. We conclude that

\displaystyle  {\bf E} G( \frac{X_{1,\leq}+\dots+X_{n,\leq}}{\sqrt{n}} ) = {\bf E} G(N) + O( \varepsilon ).

Next, we consider the error term

\displaystyle  \frac{X_{1,>}+\dots+X_{n,>}}{\sqrt{n}}.

The variable {X_>} has mean zero, and by dominated convergence, the variance of {X_>} goes to zero as {n \rightarrow \infty}. The above error term is also mean zero and has the same variance as {X_>}. In particular, from Cauchy-Schwarz we have

\displaystyle  {\bf E} |\frac{X_{1,>}+\dots+X_{n,>}}{\sqrt{n}}| \rightarrow 0.

As {G} is smooth and compactly supported, it is Lipschitz continuous, and hence

\displaystyle  G( \frac{X_{1}+\dots+X_{n}}{\sqrt{n}} ) = G( \frac{X_{1,\leq}+\dots+X_{n,\leq}}{\sqrt{n}} ) + O( |\frac{X_{1,>}+\dots+X_{n,>}}{\sqrt{n}}| ).

Taking expectations, and then combining all these estimates, we conclude that

\displaystyle  {\bf E} G( \frac{X_{1}+\dots+X_{n}}{\sqrt{n}} ) = {\bf E} G(N) + O( \varepsilon )

for {n} sufficiently large; letting {n} go to infinity we obtain (3) as required. This concludes the proof of Theorem 8.

The above argument can be generalised to a central limit theorem for certain triangular arrays, known as the Lindeberg central limit theorem:

Exercise 14 (Lindeberg central limit theorem) Let {k_n} be a sequence of natural numbers going to infinity in {n}. For each natural number {n}, let {X_{1,n},\dots,X_{k_n,n}} be jointly independent real random variables of mean zero and finite variance. (We do not require the random variables {(X_{1,n},\dots,X_{k_n,n})} to be jointly independent in {n}, or even to be modeled by a common probability space.) Let {\sigma_n} be defined by

\displaystyle  \sigma_n^2 := \sum_{i=1}^{k_n} {\bf Var}(X_{i,n})

and assume that {\sigma_n>0} for all {n}.

  • (i) If one assumes the Lindeberg condition that

    \displaystyle  \frac{1}{\sigma_n^2} \sum_{i=1}^{k_n} {\bf E}( |X_{i,n}|^2 1_{|X_{i,n}| > \varepsilon \sigma_n} ) \rightarrow 0

    as {n \rightarrow \infty} for any {\varepsilon > 0}, then show that the random variables {\frac{X_{1,n}+\dots+X_{k_n,n}}{\sigma_n}} converge in distribution to a random variable with the normal distribution {N(0,1)}.

  • (ii) Show that the Lindeberg condition implies the Feller condition

    \displaystyle  \frac{1}{\sigma_n^2} \max_{1 \leq i \leq n} {\bf E} |X_{i,n}|^2 \rightarrow 0

    as {n \rightarrow \infty}.

Note that Theorem 8 (after normalising to the mean zero case {\mu=0}) corresponds to the special case {k_n = n} and {X_{i,n} := X_i} (or, if one wishes, {X_{i,n} := X_i/\sqrt{n}}) of the Lindeberg central limit theorem. It was shown by Feller that if the situation as in the above exercise and the Feller condition holds, then the Lindeberg condition is necessary as well as sufficient for {X_{1,n}+\dots+X_{k_n,n}} to converge in distribution to a random variable with normal distribution {N(0,\sigma^2)}; the combined result is sometimes known as the Lindeberg-Feller theorem.

Exercise 15 (Weak Berry-Esséen theorem) Let {X_1,\dots,X_n} be iid copies of a real random variable {X} of mean zero, unit variance, and finite third moment.

  • (i) Show that

    \displaystyle  {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}}) = {\bf E} G(N) + O( n^{-1/2} (\sup_{x \in {\bf R}} |G'''(x)|) {\bf E} |X|^3 )

    whenever {G} is three times continuously differentiable and compactly supported, with {N} distributed as {N(0,1)} and the implied constant in the {O()} notation absolute.

  • (ii) Show that

    \displaystyle  {\bf P} ( \frac{X_1+\dots+X_n}{\sqrt{n}} \leq t ) = {\bf P}( N \leq t) + O( n^{-1/2} {\bf E} |X|^3 )^{1/4}

    for any {t \in {\bf R}}, with the implied constant absolute.

We will strengthen the conclusion of this theorem in Theorem 37 below.

Remark 16 The Lindeberg exchange method explains why the limiting distribution of statistics such as {\frac{X_1+\dots+X_n}{\sqrt{n}}} depend primarily on the first two moments of the component random variables {X_i}, if there is a suitable amount of independence between the {X_i}. It turns out that there is an analogous application of the Lindeberg method in random matrix theory, which (very roughly speaking) asserts that appropriate statistics of random matrices such as {(X_{ij})_{1 \leq i, j \leq n}} depend primarily on the first four moments of the matrix components {X_{ij}}, if there is a suitable amount of independence between the {X_{ij}}. See for instance this survey article of Van Vu and myself for more discussion of this. The Lindeberg method also suggests that the more moments of {X} one assumes to match with the Gaussian variable {N}, the faster the rate of convergence (because one can use higher order Taylor expansions).

We now use the Lindeberg central limit theorem to obtain the converse direction of the Kolmogorov three-series theorem (Exercise 29 of Notes 3).

Exercise 17 (Kolmogorov three-series theorem, converse direction) Let {X_1,X_2,\dots} be a sequence of jointly independent real random variables, with the property that the series {\sum_{n=1}^\infty X_n} is almost surely convergent (i.e., the partial sums are almost surely convergent), and let {A>0}.

  • (i) Show that {\sum_{n=1}^\infty {\bf P}( |X_n| > A )} is finite. (Hint: argue by contradiction and use the second Borel-Cantelli lemma.)
  • (ii) Show that {\sum_{n=1}^\infty {\bf Var}( X_n 1_{|X_n| \leq A} )} is finite. (Hint: first use (i) and the Borel-Cantelli lemma to reduce to the case where {|X_n| \leq A} almost surely. If {\sum_{n=1}^\infty \hbox{Var}(X_n)} is infinite, use Exercise 14 to show that {\sum_{n=1}^N (X_n-{\bf E}X_n) / \sqrt{\sum_{n=1}^\infty {\bf Var}(X_n)}} converges in distribution to a standard normal distribution, and use this to contradict the almost sure convergence of {\sum_{n=1}^\infty X_n}.
  • (iii) Show that the series {\sum_{n=1}^\infty {\bf E} X_n 1{|X_n| \leq A}} is convergent. (Hint: reduce as before to the case where {|X_n| \leq A} almost surely, and apply the forward direction of the three-series theorem to {\sum_n X_n - {\bf E} X_n}.)

— 2. The Fourier-analytic approach to the central limit theorem —

Let us now give the standard Fourier-analytic proof of the central limit theorem. Given any real random variable {X}, we introduce the characteristic function {\varphi_X: {\bf R} \rightarrow {\bf C}}, defined by the formula

\displaystyle  \varphi_X(t) := {\bf E} e^{itX}. \ \ \ \ \ (5)

Equivalently, {\varphi_X} is the Fourier transform of the probability measure {\mu_X}. One should caution that the term “characteristic function” has several other unrelated meanings in mathematics; particularly confusingly, in real analysis “characteristic function” is used to denote what in probability one would call an “indicator function”. Note that no moment hypotheses are required to define the characteristic function, because the random variable {e^{itX}} is bounded even when {X} is not absolutely integrable.

Example 18 The signed Bernoulli distribution, which takes the values {+1} and {-1} with probabilities of {1/2} each, has characteristic function {\phi(t) = \cos(t)}.

Most of the standard random variables in probability have characteristic functions that are quite simple and explicit. For the purposes of proving the central limit theorem, the most important such explicit form of the characteristic function is of the normal distribution:

Exercise 19 Show that the normal distribution {N(\mu,\sigma^2)} has characteristic function {\phi(t) = e^{it\mu} e^{-\sigma^2 t^2/2}}.

We record the explicit characteristic functions of some other standard distributions:

Exercise 20 Let {\lambda > 0}, and let {X} be a Poisson random variable with intensity {\lambda}, thus {X} takes values in the non-negative integers with {{\bf P}(X=k) = e^{-\lambda} \frac{\lambda^k}{k!}}. Show that {\varphi_X(t) = \exp( \lambda(e^{it}-1) )} for all {t \in {\bf R}}.

Exercise 21 Let {X} be uniformly distributed in some interval {[a,b]}. Show that {\varphi_X(t) = \frac{e^{itb}-e^{ita}}{it(b-a)}} for all non-zero {t}.

Exercise 22 Let {x_0 \in {\bf R}} and {\gamma > 0}, and let {X} be a Cauchy random variable with parameters {x_0,\gamma}, which means that {X} is a real random variable with probability density function {\frac{\gamma}{\pi ((x-x_0)^2+\gamma^2)}}. Show that {\varphi_X(t) = e^{ix_0 t} e^{-\gamma |t|}} for all {t \in {\bf R}}.

The characteristic function is clearly bounded in magnitude by {1}, and equals {1} at the origin. By the dominated convergence theorem, {\varphi_X} is continuous in {t}.

Exercise 23 (Riemann-Lebesgue lemma) Show that if {X} is a real random variable that has an absolutely integrable probability density function {f}, then {\varphi_X(t) \rightarrow 0} as {t \rightarrow \infty}. (Hint: first show the claim when {f} is a finite linear combination of intervals, then for the general case show that {f} can be approximated in {L^1} norm by such finite linear combinations.) Note from Example 18 that the claim can fail if {X} does not have a probability density function. (In Fourier analysis, this fact is known as the Riemann-Lebesgue lemma.)

Exercise 24 Show that the characteristic function {\varphi_X} of a real random variable {X} is in fact uniformly continuous on its domain.

Let {X} be a real random variable. If we Taylor expand {e^{itX}} and formally interchange the series and expectation, we arrive at the heuristic identity

\displaystyle  \varphi_X(t) = \sum_{k=0}^\infty \frac{(it)^k}{k!} {\bf E} X^k \ \ \ \ \ (6)

which thus interprets the characteristic function of a real random variable {X} as a kind of generating function for the moments. One rigorous version of this identity is as follows.

Exercise 25 (Taylor expansion of characteristic function) Let {X} be a real random variable with finite {k^{th}} moment for some {k \geq 1}. Show that {\varphi_X} is {k} times continuously differentiable, with

\displaystyle  \frac{d^j}{dt^j} \varphi_X(t) = i^j {\bf E} X^j e^{itX}

for all {0 \leq j \leq k}. Conclude in particular the partial Taylor expansion

\displaystyle  \varphi_X(t) = \sum_{j=0}^k \frac{(it)^j}{j!} {\bf E} X^j + o( |t|^k )

where {o(|t|^k)} is a quantity that goes to zero as {t \rightarrow 0}, times {|t|^k}.

Exercise 26 Let {X} be a real random variable, and assume that it is subgaussian in the sense that there exist constants {C, c > 0} such that

\displaystyle  {\bf P}( |X| \geq t ) \leq C e^{-ct^2}

for all {t \in {\bf R}}. (Thus for instance a bounded random variable is subgaussian, as is any gaussian random variable.) Rigorously establish (6) in this case, and show that the series converges locally uniformly in {t}.

Note that the characteristic function depends only on the distribution of {X}: if {X} and {Y} are equal in distribution, then {\varphi_X = \varphi_Y}. The converse statement is true also: if {\varphi_X=\varphi_Y}, then {X} and {Y} are equal in distribution. This follows from a more general (and useful) fact, known as Lévy’s continuity theorem.

Theorem 27 (Lévy continuity theorem, special case) Let {X_n} be a sequence of real random variables, and let {X} be an additional real random variable. Then the following statements are equivalent:

  • (i) {\varphi_{X_n}} converges pointwise to {\varphi_X}.
  • (ii) {X_n} converges in distribution to {X}.

Proof: The implication of (i) from (ii) is immediate from (5) and Exercise 10(viii).

Now suppose that (i) holds, and we wish to show that (ii) holds. We need to show that

\displaystyle  {\bf E} G(X_n) \rightarrow {\bf E} G(X)

whenever {G: {\bf R} \rightarrow {\bf R}} is a continuous, compactly supported function. As in the Lindeberg argument, it suffices to prove this when {G} is smooth and compactly supported, in particular {G} is a Schwartz function (infinitely differentiable, with all derivatives rapidly decreasing). But then we have the Fourier inversion formula

\displaystyle  G(X_n) = \int_{{\bf R}} \hat G(t) e^{it \cdot X_n}\ dt


\displaystyle  \hat G(t) := \frac{1}{2\pi} \int_{{\bf R}} G(x) e^{-it \cdot x}\ dx

is Schwartz, and is in particular absolutely integrable (see e.g. these lecture notes of mine). From the Fubini-Tonelli theorem, we thus have

\displaystyle  {\bf E} G(X_n) = \int_{{\bf R}} \hat G(t) \varphi_{X_n}(t)\ dt \ \ \ \ \ (7)

and similarly for {X}. The claim now follows from the Lebesgue dominated convergence theorem. \Box

Remark 28 Setting {X_n := Y} for all {n}, we see in particular the previous claim that {\varphi_X = \varphi_Y} if and only if {X}, {Y} have the same distribution. It is instructive to use the above proof as a guide to prove this claim directly.

There is one subtlety with the Lévy continuity theorem: it is possible for a sequence {\varphi_{X_n}} of characteristic functions to converge pointwise, but for the limit to not be the characteristic function of any random variable, in which case {X_n} will not converge in distribution. For instance, if {X_n \equiv N(0,n)}, then {\varphi_{X_n}(t) = e^{-t^2/2n}} converges pointwise to {1_{t=0}} for any {t}, but this is clearly not the characteristic function of any random variable (as characteristic functions are continuous). However, this lack of continuity is the only obstruction:

Exercise 29 (Lévy’s continuity theorem, full version) Let {X_n} be a sequence of real valued random variables. Suppose that {\varphi_{X_n}} converges pointwise to a limit {\varphi}. Show that the following are equivalent:

  • (i) {\varphi} is continuous at {0}.
  • (ii) {X_n} is a tight sequence (as in Exercise 10(iii)).
  • (iii) {\varphi} is the characteristic function of a real valued random variable {X} (possibly after extending the sample space).
  • (iv) {X_n} converges in distribution to some real valued random variable {X} (possibly after extending the sample space).

Hint: To get from (ii) to the other conclusions, use Theorem 11 and Theorem 27. To get back to (ii) from (i), use (7) for a suitable Schwartz function {G}. The other implications are easy once Theorem 27 is in hand.

Remark 30 Lévy’s continuity theorem is very similar in spirit to Weyl’s criterion in equidistribution theory.

Exercise 31 (Esséen concentration inequality) Let {X} be a random variable taking values in {{\bf R}}. Then for any {r>0}, {\varepsilon > 0}, show that

\displaystyle  \sup_{x_0 \in {\bf R}} {\bf P}( |X - x_0| \leq r ) \leq C_{\varepsilon} r \int_{t \in {\bf R}: |t| \leq \varepsilon/r} |\varphi_X(t)|\ dt \ \ \ \ \ (8)

for some constant {C_{\varepsilon}} depending only on {\varepsilon}. (Hint: Use (7) for a suitable Schwartz function {G}.) The left-hand side of (8) (as well as higher dimensional analogues of this quantity) is known as the small ball probability of {X} at radius {r}.

In Fourier analysis, we learn that the Fourier transform is a particularly well-suited tool for studying convolutions. The probability theory analogue of this fact is that characteristic functions are a particularly well-suited tool for studying sums of independent random variables. More precisely, we have

Exercise 32 (Fourier identities) Let {X, Y} be independent real random variables. Then

\displaystyle  \varphi_{X+Y}(t) = \varphi_X(t) \varphi_Y(t) \ \ \ \ \ (9)

for all {t \in V}. Also, for any scalar {c}, one has

\displaystyle  \varphi_{cX}(t) = \varphi_X(\overline{c}t)

and more generally, for any linear transformation {T: V \rightarrow V}, one has

\displaystyle  \varphi_{TX}(t) = \varphi_X(T^*t).

Remark 33 Note that this identity (9), combined with Exercise 19 and Remark 28, gives a quick alternate proof of Lemma 12.

In particular, we have the simple relationship

\displaystyle  \varphi_{S_n/\sqrt{n}}(t) = \varphi_X( t / \sqrt{n} )^n \ \ \ \ \ (10)

that describes the characteristic function of {S_n/\sqrt{n}} in terms of that of {X}.

We now have enough machinery to give a quick proof of the central limit theorem:

Proof: (Fourier proof of Theorem 8) We may normalise {X} to have mean zero and variance {1}. By Exercise 25, we thus have

\displaystyle  \varphi_X(t) = 1 - t^2/2 + o(|t|^2)

for sufficiently small {t}, or equivalently

\displaystyle  \varphi_X(t) = \exp( - t^2/2 + o(|t|^2) )

for sufficiently small {t}. Applying (10), we conclude that

\displaystyle  \varphi_{S_n/\sqrt{n}}(t) \rightarrow \exp( - t^2/2 )

as {n \rightarrow \infty} for any fixed {t}. But by Exercise 19, {\exp(-t^2/2)} is the characteristic function of the normal distribution {N(0,1)}. The claim now follows from the Lévy continuity theorem. \Box

The above machinery extends without difficulty to vector-valued random variables {\vec X = (X_1,\dots,X_d)} taking values in {{\bf R}^d}. The analogue of the characteristic function is then the function {\varphi_{\vec X}: {\bf R}^d \rightarrow {\bf C}} defined by

\displaystyle  \varphi_{\vec X}(t_1,\dots,t_d) := {\bf E} e^{i (t_1 X_1 + \dots + t_d X_d)}.

We leave the routine extension of the above results and proofs to the higher dimensional case to the reader. Most interesting is what happens to the central limit theorem:

Exercise 34 (Vector-valued central limit theorem) Let {\vec X=(X_1,\ldots,X_d)} be a random variable taking values in {{\bf R}^d} with finite second moment. Define the covariance matrix {\Sigma(\vec X)} to be the {d \times d} matrix {\Sigma} whose {ij^{th}} entry is the covariance {{\bf E} (X_i - {\bf E}(X_i)) (X_j - {\bf E}(X_j))}.

  • Show that the covariance matrix is positive semi-definite real symmetric.
  • Conversely, given any positive definite real symmetric {d \times d} matrix {\Sigma} and {\mu \in {\bf R}^d}, show that the multivariate normal distribution {N(\mu,\Sigma)_{{\bf R}^d}}, given by the absolutely continuous measure

    \displaystyle  \frac{1}{((2\pi)^d \det \Sigma)^{1/2}} e^{- (x-\mu) \cdot \Sigma^{-1} (x-\mu) / 2}\ dx,

    has mean {\mu} and covariance matrix {\Sigma}, and has a characteristic function given by

    \displaystyle  F(t) = e^{i \mu \cdot t} e^{- t \cdot \Sigma t / 2}.

    How would one define the normal distribution {N(\mu,\Sigma)_{{\bf R}^d}} if {\Sigma} degenerated to be merely positive semi-definite instead of positive definite?

  • If {\vec S_n := \vec X_1+\ldots + \vec X_n} is the sum of {n} iid copies of {\vec X}, show that {\frac{\vec S_n - n \mu}{\sqrt{n}}} converges in distribution to {N(0,\Sigma(X))_{{\bf R}^d}}. (For this exercise, you may assume without proof that the Lévy continuity theorem extends to {{\bf R}^d}.)

Exercise 35 (Complex central limit theorem) Let {X} be a complex random variable of mean {\mu \in {\bf C}}, whose real and imaginary parts have variance {\sigma^2/2} and covariance {0}. Let {X_1,\ldots,X_n \equiv X} be iid copies of {X}. Show that as {n \rightarrow \infty}, the normalised sums {\frac{\sqrt{n}}{\sigma} (\frac{X_1+\dots+X_n}{n} - \mu)} converge in distribution to the standard complex gaussian {N(0,1)_{\bf C}}, defined as the measure {\mu} on {{\bf C}} with

\displaystyle  \mu(S) := \frac{1}{\pi} \int_{S} e^{-\pi|z|^2}\ dz

for Borel {S \in {\bf C}}, where {dz} is Lebesgue measure on {{\bf C}} (identified with {{\bf R}^2} in the usual fashion).

Exercise 36 Use characteristic functions and the truncation argument to give an alternate proof of the Lindeberg central limit theorem (Theorem 14).

A more sophisticated version of the Fourier-analytic method gives a more quantitative form of the central limit theorem, namely the Berry-Esséen theorem.

Theorem 37 (Berry-Esséen theorem) Let {X} have mean zero and unit variance. Let {S_n := X_1+\ldots+X_n}, where {X_1,\ldots,X_n} are iid copies of {X}. Then we have

\displaystyle  {\bf P}( \frac{S_n}{\sqrt{n}} < a ) = {\bf P}( N < a ) + O( \frac{1}{\sqrt{n}} ({\bf E} |X|^3)) \ \ \ \ \ (11)

uniformly for all {a \in {\bf R}}, where {N} has the distribution of {N(0,1)}, and the implied constant is absolute.

Proof: (Optional) Write {\varepsilon := {\bf E} |X|^3/\sqrt{n}}; our task is to show that

\displaystyle  {\bf P}( \frac{S_n}{\sqrt{n}} < a ) = {\bf P}( N < a ) + O( \varepsilon )  \ \ \ \ \ (12)

for all {a}. We may of course assume that {\varepsilon < 1}, as the claim is trivial otherwise. (In particular, {X} has finite third moment.)

Let {c > 0} be a small absolute constant to be chosen later. Let {\eta: {\bf R} \rightarrow {\bf R}} be an non-negative Schwartz function with total mass {1} whose Fourier transform is supported in {[-c,c]}; such an {\eta} can be constructed by taking the inverse Fourier transform of a smooth function supported on {[-c/2,c/2]} and equal to one at the origin, and then multiplying that transform by its complex conjugate to make it non-negative.

Let {\psi: {\bf R} \rightarrow {\bf R}} be the smoothed out version of {1_{(-\infty,0]}}, defined as

\displaystyle  \psi(x) := \int_{\bf R} 1_{(-\infty,0]}(x - \varepsilon y) \eta(y)\ dy

\displaystyle  = \int_{x/\varepsilon}^\infty \eta(y)\ dy.

Observe that {\psi} is decreasing from {1} to {0}. As {\eta} is rapidly decreasing and has mean one, we also have the bound

\displaystyle  \psi(x) = 1_{(-\infty,0]}(x) + O_\eta( (1+|x|/\varepsilon)^{-100} ) \ \ \ \ \ (13)

(say) for any {x}, where subscript indicates that the implied constant depends on {\eta}.

We claim that it suffices to show that

\displaystyle  {\bf E} \psi(\frac{S_n}{\sqrt{n}} - a) = {\bf E} \psi(N - a ) + O_\eta(\varepsilon) \ \ \ \ \ (14)

for every {a}, where the subscript means that the implied constant depends on {\eta}. Indeed, suppose that (14) held. Replacing {a} by {a + C \varepsilon} and {a - C\varepsilon} for some large absolute constant {C} and subtracting, we have

\displaystyle  {\bf E} \psi(\frac{S_n}{\sqrt{n}} - a - C \varepsilon) - \psi(\frac{S_n}{\sqrt{n}} - a + C \varepsilon) = {\bf E} \psi(N - a - C \varepsilon) - {\bf E} \psi(N - a + C \varepsilon) + O_\eta(\varepsilon)

for any {a}. From (13) we see that the function {\psi(t - a - C \varepsilon) - \psi(t - a + C \varepsilon )} is bounded by {O_{\eta,C} ( 1 + |x-a|/\varepsilon)^{-100} )}, and hence by the bounded probability density of {N}

\displaystyle  {\bf E} \psi(N - a - C \varepsilon) - {\bf E} \psi(N - a + C \varepsilon) = O_{\eta,C}(\varepsilon).

Also, {\psi(t - a - C \varepsilon) - \psi(t - a + C \varepsilon )} is non-negative, and for {C} large enough, it is bounded from below by (say) {1/2} on {[0,\varepsilon]}. We conclude (after choosing {C} appropriately) that

\displaystyle  {\bf E} ( \frac{S_n}{\sqrt{n}} \in [a,a+\varepsilon] ) = O_\eta(\varepsilon) \ \ \ \ \ (15)

for all real {a}. This implies that

\displaystyle  {\bf E} (1 + |\frac{S_n}{\sqrt{n}} - a|/\varepsilon)^{-100} = O_\eta(\varepsilon),

as can be seen by covering the real line by intervals {[a+j\varepsilon, a+(j+1)\varepsilon)]} and applying (15) to each such interval. From (13) we conclude that

\displaystyle  {\bf E} \psi(\frac{S_n}{\sqrt{n}} - a) = {\bf E} 1{(-\infty,0]}(\frac{S_n}{\sqrt{n}} - a) + O_\eta(\varepsilon).

A similar argument gives

\displaystyle  {\bf E} \psi(N - a) = {\bf E} 1{(-\infty,0]}(N - a) + O_\eta(\varepsilon),

and (12) then follows from (14).

It remains to establish (14). We write

\displaystyle  \psi(x) = \lim_{M \rightarrow \infty} \int_{x/\varepsilon}^{x/\varepsilon + M} \eta(y)\ dy

(with the expression in the limit being uniformly bounded) and

\displaystyle  \eta(y) = \int_{\bf R} \hat \eta(t) e^{ity}\ dt

to conclude (after applying Fubini’s theorem) that

\displaystyle  \psi(x) = \lim_{M \rightarrow \infty} \int_{-c}^c \hat \eta(t) e^{itx/\varepsilon} \frac{e^{itM}-1}{it}\ dt

using the compact support of {\hat \eta}. Hence

\displaystyle  \psi(\frac{S_n}{\sqrt{n}}) - \psi(N) = \lim_{M \rightarrow \infty} \int_{-c}^c \hat \eta(t) (e^{it\frac{S_n}{\sqrt{n}}/\varepsilon} - e^{itN}) \frac{e^{itM}-1}{it}\ dt.

By the fundamental theorem of calculus we have {e^{it\frac{S_n}{\sqrt{n}}/\varepsilon} - e^{itN} = O( |t| (|\frac{S_n}{\sqrt{n}}| + |N|) )}. This factor of {|t|} cancels a similar factor on the denominator to make the expression inside the limit dominated by an absolutely integrable random variable. Thus by the dominated convergence theorem

\displaystyle  {\bf E} \psi(\frac{S_n}{\sqrt{n}}) - {\bf E} \psi(N) = \lim_{M \rightarrow \infty} \int_{-c}^c \hat \eta(t) (\varphi_{S_n/\sqrt{n}}(t/\varepsilon)-\varphi_N(t/\varepsilon)) \frac{e^{itM}-1}{it}\ dt.

\displaystyle  \int_{-c}^c \hat \eta(t) (\varphi_{S_n/\sqrt{n}}(t/\varepsilon) - \varphi_N(t/\varepsilon)) \frac{e^{itM}-1}{it}\ dt = O_\eta(\varepsilon)

uniformly in {M}. Applying the triangle inequality and the compact support of {\hat \eta}, it suffices to show that

\displaystyle  \int_{|t| \leq c} |\varphi_{S_n/\sqrt{n}}(t/\varepsilon) - \varphi_{N}(t/\varepsilon)| \frac{dt}{|t|} = O(\varepsilon). \ \ \ \ \ (16)

From Taylor expansion we have

\displaystyle  e^{itX} = 1 + it X - \frac{t^2}{2} X^2 + O( |t|^3 |X|^3 )

for any {t}; taking expectations and using the definition of {\varepsilon} we have

\displaystyle  \varphi_X(t) = 1 - t^2/2 + O( \varepsilon \sqrt{n} |t|^3 )

and in particular

\displaystyle  \varphi_X(t) = \exp( - t^2/2 + O( \varepsilon \sqrt{n} |t|^3 ) )

if {|t| \leq c / \varepsilon \sqrt{n}} and {c} is small enough. Applying (10), we conclude that

\displaystyle  \varphi_{S_n/\sqrt{n}}(t) = \exp(-t^2/2 + O( \varepsilon |t|^3 ))

if {|t| \leq c/\varepsilon}. Meanwhile, from Exercise 19 we have {\varphi_N(t) = \exp(-t^2/2)}. Elementary calculus then gives us

\displaystyle  |\varphi_{S_n/\sqrt{n}}(t) - \varphi_N(t)| \leq O( \varepsilon |t|^3 \exp( - t^2/4 ) )

(say) if {c} is small enough. Inserting this bound into (16) we obtain the claim. \Box

Exercise 38 Show that the error terms in Theorem 37 are sharp (up to constants) when {X} is a signed Bernoulli random variable, or more precisely when {X} takes values in {\{-1,+1\}} with probability {1/2} for each.

Exercise 39 Let {X_n} be a sequence of real random variables which converge in distribution to a real random variable {X}, and let {Y_n} be a sequence of real random variables which converge in distribution to a real random variable {Y}. Suppose that, for each {n}, {X_n} and {Y_n} are independent, and suppose also that {X} and {Y} are independent. Show that {X_n+Y_n} converges in distribution to {X+Y}. (Hint: use the Lévy continuity theorem.)

The following exercise shows that the central limit theorem fails when the variance is infinite.

Exercise 40 Let {X_1,X_2,\dots} be iid copies of an absolutely integrable random variable {X} of mean zero.

  • (i) In this part we assume that {X} is symmetric, which means that {X} and {-X} have the same distribution. Show that for any {t > 0} and {M > 0},

    \displaystyle  {\bf P}( X_1 + \dots + X_n \geq t )

    \displaystyle \geq \frac{1}{2} {\bf P}( X_1 1_{|X_1| \leq M} + \dots + X_n 1_{|X_n| \leq M} \geq t ).

    (Hint: relate both sides of this inequality to the probability of the event that {X_1 1_{|X_1| \leq M} + \dots + X_n 1_{|X_n| \leq M} \geq t} and {X_1 1_{|X_1| < M} + \dots + X_n 1_{|X_n| < M} \geq 0}, using the symmetry of the situation.)

  • (ii) If {X} is symmetric and {\frac{X_1+\dots+X_n}{\sqrt{n}}} converges in distribution to a real random variable {Y}, show that {X} has finite variance. (Hint: if this is not the case, then {X 1_{|X| \leq M}} will have arbitrarily large variance as {M} increases. On the other hand, {{\bf P}( Y \geq u )} can be made arbitrarily small by taking {u} large enough. For a large threshold {M}, apply (i) (with {t = u \sqrt{n}}) to obtain a contradiction.
  • (iii) Generalise (ii) by removing the hypothesis that {X} is symmetric. (Hint: apply the symmetrisation trick of replacing {X} by {X-X'}, where {X'} is an independent copy of {X}, and use the previous exercise. One may need to utilise some truncation arguments to show that {X-X'} has infinite variance whenever {X} has infinite variance.)

Exercise 41

  • (i) If {X} is a real random variable of mean zero and variance {\sigma^2}, and {t} is a real number, show that

    \displaystyle  \varphi_X(t) = 1 + O( \sigma^2 t^2 )

    and that

    \displaystyle  \varphi_X(t) = 1 - \frac{1}{2} \sigma^2 t^2 + O( t^2 {\bf E} \min( |X|^2, t |X|^3 ) ).

    (Hint: first establish the Taylor bounds {e^{itX} = 1 + itX + O(t^2 |X|^2)} and {e^{itX} = 1 + itX + \frac{1}{2} t^2 |X|^2 + O( t^3 |X|^3 )}.)

  • (ii) Establish the pointwise inequality

    \displaystyle  |z_1 \dots z_n - z'_1 \dots z'_n| \leq \sum_{i=1}^n |z_i - z'_i|

    whenever {z_1,\dots,z_n,z'_1,\dots,z'_n} are complex numbers in the disk {\{ z \in {\bf C}: |z| \leq 1 \}}.

  • (iii) Suppose that for each {n}, {X_{1,n},\dots,X_{n,n}} are jointly independent real random variables of mean zero and finite variance, obeying the uniform bound

    \displaystyle  |X_{i,n}| \leq \varepsilon_n \sqrt{n}

    for all {i=1,\dots,n} and some {\varepsilon_n} going to zero as {n \rightarrow \infty}, and also obeying the variance bound

    \displaystyle  \sum_{i=1}^n {\bf Var}( X_{i,n} ) \rightarrow \sigma^2

    as {n \rightarrow \infty} for some {0 < \sigma < \infty}. If {S_n := X_{1,n} + \dots + X_{n,n}}, use (i) and (ii) to show that

    \displaystyle  \varphi_{S_n/\sqrt{n}}(t) \rightarrow e^{-\sigma^2 t^2/2}

    as {n \rightarrow \infty} for any given real {t}.

  • (iv) Use (iii) and a truncation argument to give an alternate proof of the Lindeberg central limit theorem (Theorem 14). (Note: one has to address the issue that truncating a random variable may alter its mean slightly.)

— 3. The moment method —

The above Fourier-analytic proof of the central limit theorem is one of the quickest (and slickest) proofs available for this theorem, and is accordingly the “standard” proof given in probability textbooks. However, it relies quite heavily on the Fourier-analytic identities in Exercise 32, which in turn are extremely dependent on both the commutative nature of the situation (as it uses the identity {e^{A+B}=e^A e^B}) and on the independence of the situation (as it uses identities of the form {{\bf E}(e^A e^B) = ({\bf E} e^A)({\bf E} e^B)}). As one or both of these factors can be absent when trying to generalise this theorem, it is also important to look for non-Fourier based methods to prove results such as the central limit theorem. These methods often lead to proofs that are lengthier and more technical than the Fourier proofs, but also tend to be more robust.

The most elementary (but still remarkably effective) method available in this regard is the moment method, which we have already used in previous notes. In principle, this method is equivalent to the Fourier method, through the identity (6); but in practice, the moment method proofs tend to look somewhat different than the Fourier-analytic ones, and it is often more apparent how to modify them to non-independent or non-commutative settings.

We first need an analogue of the Lévy continuity theorem. Here we encounter a technical issue: whereas the Fourier phases {x \mapsto e^{itx}} were bounded, the moment functions {x \mapsto x^k} become unbounded at infinity. However, one can deal with this issue as long as one has sufficient decay:

Exercise 42 (Subgaussian random variables) Let {X} be a real random variable. Show that the following statements are equivalent:

  • (i) There exist {C,c>0} such that {{\bf P}(|X| \geq t) \leq C e^{-ct^2}} for all {t>0}.
  • (ii) There exist {C'>0} such that {{\bf E} |X|^k \leq (C'k)^{k/2}} for all {k \geq 1}.
  • (iii) There exist {C'',c''>0} such that {{\bf E} e^{tX} \leq C'' e^{c'' t^2}} for all {t \in {\bf R}}.

Furthermore, show that if (i) holds for some {C,c}, then (ii) holds for {C'} depending only on {C,c}, and similarly for any of the other implications. Variables obeying (i), (ii), or (iii) are called subgaussian. The function {t \mapsto {\bf E} e^{tX}} is known as the moment generating function of {X}; it is of course closely related to the characteristic function of {X}.

Exercise 43 Use the truncation method to show that in order to prove the central limit theorem (Theorem 8), it suffices to do so in the case when the underlying random variable {X} is bounded (and in particular subgaussian).

Once we restrict to the subgaussian case, we have an analogue of the Lévy continuity theorem:

Theorem 44 (Moment continuity theorem) Let {X_n} be a sequence of real random variables, and let {X} be a subgaussian random variable. Suppose that for every {k=0,1,2,\dots}, {{\bf E} X_n^k} converges pointwise to {{\bf E} X^k}. Then {X_n} converges in distribution to {X}.

Proof: Let {k = 0,1,2,\dots}, then by the preceding exercise we have

\displaystyle  {\bf E} |X|^{k+1} \leq (C (k+1))^{-(k+1)/2}

for some {C} independent of {k}. In particular,

\displaystyle  {\bf E} |X_n|^{k+1} \leq 2(C (k+1))^{-(k+1)/2}

when {n} is sufficiently large depending on {k}. From Taylor’s theorem with remainder (and Stirling’s formula (1)) we conclude

\displaystyle  \varphi_{X_n}(t) = \sum_{j=0}^k \frac{(it)^j}{j!} {\bf E} X_n^j + O( (C(k+1))^{-(k+1)/2} |t|^{k+1} )

uniformly in {t}, for {n} sufficiently large. Similarly for {X}. Taking limits using (i) we see that

\displaystyle  \limsup_{n \rightarrow \infty} |\varphi_{X_n}(t) - \varphi_X(t)| = O( (C(k+1))^{-(k+1)/2} |t|^{k+1} ).

Then letting {k \rightarrow \infty}, keeping {t} fixed, we see that {\varphi_{X_n}(t)} converges pointwise to {\varphi_X(t)} for each {t}, and the claim now follows from the Lévy continuity theorem. \Box

Remark 45 One corollary of Theorem 44 is that the distribution of a subgaussian random variable is uniquely determined by its moments (actually, this could already be deduced from Exercise 26 and Remark 28). The situation can fail for distributions with slower tails, for much the same reason that a smooth function is not determined by its derivatives at one point if that function is not analytic.

The Fourier inversion formula provides an easy way to recover the distribution from the characteristic function. Recovering a distribution from its moments is more difficult, and sometimes requires tools such as analytic continuation; this problem is known as the inverse moment problem and will not be discussed here.

Exercise 46 (Converse direction of moment continuity theorem) Let {X_n} be a sequence of uniformly subgaussian random variables (thus there exist {C,c>0} such that {{\bf P}(|X_n| \geq t) \leq C e^{-ct^2}} for all {t>0} and all {n}), and suppose {X_n} converges in distribution to a limit {X}. Show that for any {k=0,1,2,\dots}, {{\bf E} X_n^k} converges pointwise to {{\bf E} X^k}.

We now give the moment method proof of the central limit theorem. As discussed above we may assume without loss of generality that {X} is bounded (and in particular subgaussian); we may also normalise {X} to have mean zero and unit variance. By Theorem 44, it suffices to show that

\displaystyle  {\bf E} S_n^k / n^{k/2} \rightarrow {\bf E} N^k

for all {k=0,1,2,\ldots}, where {N \equiv N(0,1)_{\bf R}} is a standard gaussian variable.

The moments {{\bf E} N^k} were already computed in Exercise 36 of Notes 1. So now we need to compute {{\bf E} S_n^k / n^{k/2}}. Using linearity of expectation, we can expand this as

\displaystyle  {\bf E} S_n^k / n^{k/2} = n^{-k/2} \sum_{1 \leq i_1,\ldots,i_k \leq n} {\bf E} X_{i_1} \ldots X_{i_k}.

To understand this expression, let us first look at some small values of {k}.

  • For {k=0}, this expression is trivially {1}.
  • For {k=1}, this expression is trivially {0}, thanks to the mean zero hypothesis on {X}.
  • For {k=2}, we can split this expression into the diagonal and off-diagonal components:

    \displaystyle  n^{-1} \sum_{1 \leq i \leq n} {\bf E} X_i^2 + n^{-1} \sum_{1 \leq i < j \leq n} {\bf E} 2 X_i X_j.

    Each summand in the first sum is {1}, as {X} has unit variance. Each summand in the second sum is {0}, as the {X_i} have mean zero and are independent. So the second moment {{\bf E} Z_n^2} is {1}.

  • For {k=3}, we have a similar expansion

    \displaystyle  n^{-3/2} \sum_{1 \leq i \leq n} {\bf E} X_i^3 + n^{-3/2} \sum_{1 \leq i < j \leq n} {\bf E} 3 X_i^2 X_j + 3 X_i X_j^2

    \displaystyle  + n^{-3/2} \sum_{1 \leq i < j < k \leq n} {\bf E} 6 X_i X_j X_k.

    The summands in the latter two sums vanish because of the (joint) independence and mean zero hypotheses. The summands in the first sum need not vanish, but are {O(1)}, so the first term is {O(n^{-1/2})}, which is asymptotically negligible, so the third moment {{\bf E} Z_n^3} goes to {0}.

  • For {k=4}, the expansion becomes quite complicated:

    \displaystyle  n^{-2} \sum_{1 \leq i \leq n} {\bf E} X_i^4 + n^{-2} \sum_{1 \leq i < j \leq n} {\bf E} 4 X_i^3 X_j + 6 X_i^2 X_j^2 + 4 X_i X_j^3

    \displaystyle  + n^{-2} \sum_{1 \leq i < j < k \leq n} {\bf E} 12 X_i^2 X_j X_k + 12 X_i X_j^2 X_k + 12 X_i X_j X_k^2

    \displaystyle  + n^{-2} \sum_{1 \leq i < j < k < l \leq n} {\bf E} 24 X_i X_j X_k X_l.

    Again, most terms vanish, except for the first sum, which is {O( n^{-1} )} and is asymptotically negligible, and the sum {n^{-2} \sum_{1 \leq i < j \leq n} {\bf E} 6 X_i^2 X_j^2}, which by the independence and unit variance assumptions works out to {n^{-2} 6 \binom{n}{2} = 3 + o(1)}. Thus the fourth moment {{\bf E} Z_n^4} goes to {3} (as it should).

Now we tackle the general case. Ordering the indices {i_1,\ldots,i_k} as {j_1 < \ldots < j_m} for some {1 \leq m \leq k}, with each {j_r} occuring with multiplicity {a_r \geq 1} and using elementary enumerative combinatorics, we see that {{\bf E} S_n^k / n^{k/2}} is the sum of all terms of the form

\displaystyle  n^{-k/2} \sum_{1 \leq j_1 < \ldots < j_m \leq n} c_{k,a_1,\ldots,a_m} {\bf E} X_{j_1}^{a_1} \ldots X_{j_m}^{a_m} \ \ \ \ \ (17)

where {1 \leq m \leq k}, {a_1,\ldots,a_m} are positive integers adding up to {k}, and {c_{k,a_1,\ldots,a_m}} is the multinomial coefficient

\displaystyle  c_{k,a_1,\ldots,a_m} := \frac{k!}{a_1! \ldots a_m!}.

The total number of such terms depends only on {k} (in fact, it is {2^{k-1}} (exercise!), though we will not need this fact).

As we already saw from the small {k} examples, most of the terms vanish, and many of the other terms are negligible in the limit {n \rightarrow \infty}. Indeed, if any of the {a_r} are equal to {1}, then every summand in (17) vanishes, by joint independence and the mean zero hypothesis. Thus, we may restrict attention to those expressions (17) for which all the {a_r} are at least {2}. Since the {a_r} sum up to {k}, we conclude that {m} is at most {k/2}.

On the other hand, the total number of summands in (17) is clearly at most {n^m} (in fact it is {\binom{n}{m}}), and the summands are bounded (for fixed {k}) since {X} is bounded. Thus, if {m} is strictly less than {k/2}, then the expression in (17) is {O( n^{m-k/2} )} and goes to zero as {n \rightarrow \infty}. So, asymptotically, the only terms (17) which are still relevant are those for which {m} is equal to {k/2}. This already shows that {{\bf E} Z_n^k} goes to zero when {k} is odd. When {k} is even, the only surviving term in the limit is now when {m=k/2} and {a_1=\ldots=a_m = 2}. But then by independence and unit variance, the expectation in (17) is {1}, and so this term is equal to

\displaystyle  n^{-k/2} \binom{n}{m} c_{k,2,\ldots,2} = \frac{1}{(k/2)!} \frac{k!}{2^{k/2}} + o(1),

and the main term is happily equal to the moment {{\bf E} N^k} as computed in Exercise 36 of Notes 1. (One could also appeal to Lemma 12 here, specialising to the case when {X} is normally distributed, to explain this coincidence.) This concludes the proof of the central limit theorem.

Exercise 47 (Chernoff bound) Let {X_1,\dots,X_n} be iid copies of a real random variable {X} of mean zero and unit variance, which is subgaussian in the sense of Exercise 42. Write {S_n := X_1+\dots+X_n}.

  • (i) Show that there exist {c''>0} such that {{\bf E} e^{tX} \leq e^{c'' t^2}} for all {t \in {\bf R}}. Conclude that {{\bf E} e^{tS_n/\sqrt{n}} \leq e^{c'' t^2}} for all {t \in {\bf R}}. (Hint: the first claim follows directly from Exercise 42 when {|t| \geq 1}; for {|t| < 1}, use the Taylor approximation {e^{tX} = 1 + tX + O( t^2 (e^{2X} + e^{-2X}) )}.)
  • (ii) Conclude the Chernoff bound

    \displaystyle  {\bf P}( |\frac{S_n}{\sqrt{n}}| \geq \lambda ) \leq C e^{-c \lambda^2}

    for some {C,c>0}, all {\lambda>0}, and all {n \geq 1}.

Exercise 48 (Erdös-Kac theorem) For any natural number {x \geq 100}, let {n} be a natural number drawn uniformly at random from the natural numbers {\{1,\dots,x\}}, and let {\omega(n)} denote the number of distinct prime factors of {n}.

  • (i) Show that for any {k=0,1,2,\dots}, one has

    \displaystyle {\bf E} (\frac{\omega(n)-\log\log x}{\sqrt{\log\log x}})^k \rightarrow 0

    as {x \rightarrow \infty} if {k} is odd, and

    \displaystyle {\bf E} (\frac{\omega(n)-\log\log x}{\sqrt{\log\log x}})^k \rightarrow \frac{k!}{2^{k/2} (k/2)!}

    as {x \rightarrow \infty} if {k} is even. (Hint: adapt the arguments in Exercise 16 of Notes 3, estimating {\omega(n)-\log\log x} by {\sum_{p \leq x^{1/10k}} 1_{p|n} - \frac{1}{p}}, using Mertens’ theorem and induction on {k} to deal with lower order errors, and treating the random variables {1_{p|n} - \frac{1}{p}} as being approximately independent and approximately of mean zero.)

  • (ii) Establish the Erdös-Kac theorem

    \displaystyle  \frac{1}{x} | \{ n \leq x: a \leq \frac{\omega(n)-\log\log x}{\sqrt{\log\log x}} \leq b \}| \rightarrow \frac{1}{\sqrt{2\pi}} \int_a^b e^{-x^2/2}\ dx

    as {x \rightarrow \infty} for any fixed {a < b}.

Informally, the Erdös-Kac theorem asserts that {\omega(n)} behaves like {N( \log\log n, \log\log n)} for “random” {n}. Note that this refines the Hardy-Ramanujan theorem (Exercise 16 of Notes 3).

Filed under: 275A - probability theory, math.CA, math.PR Tagged: central limit theorem, Lindeberg exchange method, moment method

Terence Tao275A, Notes 5: Variants of the central limit theorem

In the previous set of notes we established the central limit theorem, which we formulate here as follows:

Theorem 1 (Central limit theorem) Let {X_1,X_2,X_3,\dots} be iid copies of a real random variable {X} of mean {\mu} and variance {0 < \sigma^2 < \infty}, and write {S_n := X_1 + \dots + X_n}. Then, for any fixed {a < b}, we have

\displaystyle  {\bf P}( a \leq \frac{S_n - n \mu}{\sqrt{n} \sigma} \leq b ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_a^b e^{-t^2/2}\ dt \ \ \ \ \ (1)

as {n \rightarrow \infty}.

This is however not the end of the matter; there are many variants, refinements, and generalisations of the central limit theorem, and the purpose of this set of notes is to present a small sample of these variants.

First of all, the above theorem does not quantify the rate of convergence in (1). We have already addressed this issue to some extent with the Berry-Esséen theorem, which roughly speaking gives a convergence rate of {O(1/\sqrt{n})} uniformly in {a,b} if we assume that {X} has finite third moment. However there are still some quantitative versions of (1) which are not addressed by the Berry-Esséen theorem. For instance one may be interested in bounding the large deviation probabilities

\displaystyle  {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) \ \ \ \ \ (2)

in the setting where {\lambda} grows with {n}. The central limit theorem (1) suggests that this probability should be bounded by something like {O( e^{-\lambda^2/2})}; however, this theorem only kicks in when {n} is very large compared with {\lambda}. For instance, if one uses the Berry-Esséen theorem, one would need {n} as large as {e^{\lambda^2}} or so to reach the desired bound of {O( e^{-\lambda^2/2})}, even under the assumption of finite third moment. Basically, the issue is that convergence-in-distribution results, such as the central limit theorem, only really control the typical behaviour of statistics in {\frac{S_n-n \mu}{\sqrt{n} \sigma}}; they are much less effective at controlling the very rare outlier events in which the statistic strays far from its typical behaviour. Fortunately, there are large deviation inequalities (or concentration of measure inequalities) that do provide exponential type bounds for quantities such as (2), which are valid for both small and large values of {n}. A basic example of this is the Chernoff bound that made an appearance in Exercise 47 of Notes 4; here we give some further basic inequalities of this type, including versions of the Bennett and Hoeffding inequalities.

In the other direction, we can also look at the fine scale behaviour of the sums {S_n} by trying to control probabilities such as

\displaystyle  {\bf P}( a \leq S_n \leq a+h ) \ \ \ \ \ (3)

where {h} is now bounded (but {a} can grow with {n}). The central limit theorem predicts that this quantity should be roughly {\frac{h}{\sqrt{2\pi n} \sigma} e^{-(a-n\mu)^2 / 2n \sigma^2}}, but even if one is able to invoke the Berry-Esséen theorem, one cannot quite see this main term because it is dominated by the error term {O(1/n^{1/2})} in Berry-Esséen. There is good reason for this: if for instance {X} takes integer values, then {S_n} also takes integer values, and {{\bf P}( a \leq S_n \leq a+h )} can vanish when {h} is less than {1} and {a} is slightly larger than an integer. However, this turns out to essentially be the only obstruction; if {X} does not lie in a lattice such as {{\bf Z}}, then we can establish a local limit theorem controlling (3), and when {X} does take values in a lattice like {{\bf Z}}, there is a discrete local limit theorem that controls probabilities such as {{\bf P}(S_n = m)}. Both of these limit theorems will be proven by the Fourier-analytic method used in the previous set of notes.

We also discuss other limit theorems in which the limiting distribution is something other than the normal distribution. Perhaps the most common example of these theorems is the Poisson limit theorems, in which one sums a large number of indicator variables (or approximate indicator variables), each of which is rarely non-zero, but which collectively add up to a random variable of medium-sized mean. In this case, it turns out that the limiting distribution should be a Poisson random variable; this again is an easy application of the Fourier method. Finally, we briefly discuss limit theorems for other stable laws than the normal distribution, which are suitable for summing random variables of infinite variance, such as the Cauchy distribution.

Finally, we mention a very important class of generalisations to the CLT (and to the variants of the CLT discussed in this post), in which the hypothesis of joint independence between the variables {X_1,\dots,X_n} is relaxed, for instance one could assume only that the {X_1,\dots,X_n} form a martingale. Many (though not all) of the proofs of the CLT extend to these more general settings, and this turns out to be important for many applications in which one does not expect joint independence. However, we will not discuss these generalisations in this course, as they are better suited for subsequent courses in this series when the theory of martingales, conditional expectation, and related tools are developed.

— 1. Large deviation inequalities —

We now look at some upper bounds for the large deviation probability (2). To get some intuition as to what kinds of bounds one can expect, we first consider some examples. First suppose that {X \equiv N(0,1)} has the standard normal distribution, then {\mu=0}, {\sigma^2=1}, and {S_n} has the distribution of {N(0,n)}, so that {\frac{S_n - n \mu}{\sqrt{n} \sigma}} has the distribution of {N(0,1)}. We thus have

\displaystyle  {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) = \frac{2}{\sqrt{2\pi}} \int_\lambda^\infty e^{-t^2/2}\ dt

which on using the inequality {e^{-t^2/2} \leq e^{-\lambda^2/2} e^{-\lambda (t-\lambda)}} leads to the bound

\displaystyle  {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) \leq \lambda^{-1} e^{-\lambda^2/2}. \ \ \ \ \ (4)

Next, we consider the example when {X} is a Bernoulli random variable drawn uniformly at random from {\{0,1\}}. Then {\mu=1/2}, {\sigma^2=1/4}, and {S_n} has the standard binomial distribution on {\{0,\dots,n\}}, thus {{\bf P}(S_n=i) = \binom{n}{i}/2^i}. By symmetry, we then have

\displaystyle  {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) = \sum_{i \geq \frac{n}{2} + \frac{\lambda \sqrt{n}}{2}} \binom{n}{i} / 2^i.

We recall Stirling’s formula, which we write crudely as

\displaystyle  n! = n^n e^{-n + o(n)} = \exp( n \log n - n + o(n) )

as {n \rightarrow \infty}, where {o(n)} denotes a quantity with {o(n)/n \rightarrow 0} as {n \rightarrow \infty} (and similarly for other uses of the {o()} notation in the sequel). If {i = \alpha n} and {\alpha} is bounded away from zero and one, we then have the asymptotic

\displaystyle  \binom{n}{i} = \exp( h(\alpha) n + o(n) )

where {h(\alpha)} is the entropy function

\displaystyle  h(\alpha) := \alpha \log \frac{1}{\alpha} + (1-\alpha) \log \frac{1}{1-\alpha}

(compare with Exercise 17 of Notes 3). One can check that {h(\alpha)} is decreasing for {1/2 \leq \alpha \leq 1}, and so one can compute that

\displaystyle  {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \theta \sqrt{n} ) = \exp( (h(\frac{1+\theta}{2}) - \log 2 + o(1)) n )

as {n \rightarrow \infty} for any fixed {0 < \theta < 1}. To compare this with (4), observe from Taylor expansion that

\displaystyle  h(\frac{1+\theta}{2}) = - \frac{\theta^2}{2} + o(\theta^2)

as {\theta \rightarrow 0}.

Finally, consider the example where {X} takes values in {\{0,1\}} with {{\bf P}(X=0)=1-p} and {{\bf P}(X=1)=p} for some small {0 < p < 1/2}, thus {\mu = p} and {\sigma^2 = p - p^2 \approx p}. We have {{\bf P}(S_n = n) = p^n}, and hence

\displaystyle  {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) = p^n = \exp( - n \log \frac{1}{p} )


\displaystyle  \lambda := \sqrt{n} \frac{\sqrt{1-p}}{\sqrt{p}} \approx \sqrt{n} / \sqrt{p}.

Here, we see that the large deviation probability {\exp( - n \log \frac{1}{p} )} is somewhat larger than the gaussian prediction of {\exp( - c \lambda^2 )}. Instead, the exponent {n \log \frac{1}{p}} is approximately related to {\lambda} and {\sigma} by the formula

\displaystyle  n \log \frac{1}{p} \approx \lambda \sqrt{n} \sigma \log \frac{\lambda}{\sqrt{n} \sigma}.

We now give a general large deviations inequality that is consistent with the above examples.

Proposition 2 (Cheap Bennett inequality) Let {a > 0}, and let {X_1,\dots,X_n} be independent random variables, each of which takes values in an interval of length at most {a}. Write {S_n := X_1+\dots+X_n}, and write {\mu^{(n)}} for the mean of {S_n}. Let {\sigma^{(n)} > 0} be such that {S_n} has variance at most {(\sigma^{(n)})^2}. Then for any {\lambda > 0}, we have

\displaystyle  {\bf P}( |\frac{S_n - \mu^{(n)}}{\sigma^{(n)}}| \geq \lambda ) \leq 2 \exp( - c \min( \lambda^2, \frac{\lambda \sigma^{(n)}}{a} \log \frac{a \lambda}{\sigma^{(n)}} ) ) \ \ \ \ \ (5)

for some absolute constant {c>0}.

There is more precise form of this inequality known as Bennett’s inequality, but we will not prove it here.

The first term in the minimum dominates when {\lambda \ll \frac{\sigma^{(n)}}{a}}, and the second term dominates when {\lambda \gg \frac{\sigma^{(n)}}{a}}. Sometimes it is convenient to weaken the estimate by discarding the logarithmic factor, leading to

\displaystyle  {\bf P}( |\frac{S_n - \mu^{(n)}}{\sigma^{(n)}}| \geq \lambda ) \leq 2 \exp( - c \min( \lambda^2, \frac{\lambda \sigma^{(n)}}{a} ) )

(possibly with a slightly different choice of {c}); thus we have Gaussian type large deviation estimates for {\lambda} as large as {\sigma^{(n)}/a}, and (slightly better than) exponential decay after that.

In the case when {X_1,\dots,X_n} are iid copies of a random variable {X} of mean {\mu} and variance {\sigma^2} taking values in an interval of length {1}, we have {\mu^{(n)} = n \mu} and {(\sigma^{(n)})^2 \leq n \sigma^2}, and the above inequality simplifies slightly to

\displaystyle  {\bf P}( |\frac{S_n - n\mu}{\sqrt{n} \sigma}| \geq \lambda ) \leq 2 \exp( - c \min( \lambda^2, \lambda \sqrt{n} \sigma \log \frac{\lambda}{\sqrt{n} \sigma} ) ).

Proof: We first begin with some quick reductions. Firstly, by dividing the {X_i} (and {S_n}, {\mu^{(n)}}, and {\sigma^{(n)}}) by {a}, we may normalise {a=1}; by subtracting the mean from each of the {X_i}, we may assume that the {X_i} have mean zero, so that {\mu^{(n)}=0} as well. We also write {\sigma_i^2} for the variance of each {X_i}, so that {(\sigma^{(n)})^2 \geq \sum_{i=1}^n \sigma_i^2}. Our task is to show that

\displaystyle  {\bf P}( |S_n| \geq \lambda \sigma^{(n)} ) \leq 2 \exp( - c \min( \lambda^2, \lambda \sigma^{(n)} \log \frac{\lambda}{\sigma^{(n)}} ) )

for all {\lambda > 0}. We will just prove the upper tail bound

\displaystyle  {\bf P}( S_n \geq \lambda \sigma^{(n)} ) \leq \exp( - c \min( \lambda^2, \lambda \sigma^{(n)} \log \frac{\lambda}{\sigma^{(n)}} ) );

the lower tail bound then follows by replacing all {X_i} with their negations {-X_i}, and the claim then follows by summing the two estimates.

We use the “exponential moment method”, previously seen in proving the Chernoff bound (Exercise 47 of Notes 4), in which one uses the exponential moment generating function {t \mapsto {\bf E} e^{tS_n}} of {S_n}. On the one hand, from Markov’s inequality one has

\displaystyle  {\bf P}( S_n \geq \lambda \sigma^{(n)} ) \leq \exp( - t \lambda \sigma^{(n)}) {\bf E} e^{t S_n}

for any real parameter {t}. On the other hand, from the joint independence of the {X_i} one has

\displaystyle  {\bf E} e^{t S_n} = \prod_{i=1}^n {\bf E} e^{tX_i}.

Since the {X_i} take values in an interval of length at most {1} and have mean zero, we have {X_i = O(1)} and so {tX_i = O(t)}. This leads to the Taylor expansion

\displaystyle  e^{tX_i} = 1 + t X_i + O(e^{O(t)} t^2 X_i^2 )

so on taking expectations we have

\displaystyle  {\bf E} e^{t X_i} = 1 + O( e^{O(t)} t^2 \sigma_i^2 )

\displaystyle  = \exp( O( e^{O(t)} t^2 \sigma_i^2 ) ).

Putting all this together, we conclude that

\displaystyle  {\bf P}( S_n \geq \lambda \sigma^{(n)} ) \leq \exp( - t \lambda \sigma^{(n)} + O(e^{O(t)} t^2 (\sigma^{(n)})^2) ).

If {\lambda \leq 10 \sigma^{(n)}} (say), one can then set {t} to be a small multiple of {\lambda / \sigma^{(n)}} to obtain a bound of the form

\displaystyle  {\bf P}( S_n \geq \lambda \sigma^{(n)} ) \leq \exp( - c \lambda^2 ).

If instead {\lambda > 10 \sigma^{(n)}}, one can set {t} to be a small multiple of {\log \frac{\lambda}{\sigma^{(n)}}} to obtain a bound of the form

\displaystyle  {\bf P}( S_n \geq \lambda \sigma^{(n)} ) \leq \exp( - c \lambda \sigma^{(n)} \log \frac{\lambda}{\sigma^{(n)}} ).

In either case, the claim follows. \Box

The following variant of the above proposition is also useful, in which we get a simpler bound at the expense of worsening the quantity {\sigma^{(n)}} slightly:

Proposition 3 (Cheap Hoeffding inequality) Let {X_1,\dots,X_n} be independent random variables, with each {X_i} taking values in an interval {[a_i,b_i]} with {b_i > a_i}. Write {S_n := X_1+\dots+X_n}, and write {\mu^{(n)}} for the mean of {S_n}, and write

\displaystyle  (\sigma^{(n)})^2 := \sum_{i=1}^n (b_i-a_i)^2.

Then for any {\lambda > 0}, we have

\displaystyle  {\bf P}( |\frac{S_n - \mu^{(n)}}{\sigma^{(n)}}| \geq \lambda ) \leq 2 \exp( - c \lambda^2 ) \ \ \ \ \ (6)

for some absolute constant {c>0}.

In fact one can take {c=2}, a fact known as Hoeffding’s inequality; see Exercise 6 below for a special case of this.

Proof: We again normalise the {X_i} to have mean zero, so that {\mu^{(n)}=0}. We then have {|X_i| \leq b_i-a_i} for each {i}, so by Taylor expansion we have for any real {t} that

\displaystyle  e^{tX_i} =1 + tX_i + O( t^2 (b_i-a_i)^2 \exp( O( t (b_i-a_i) ) ) )

and thus

\displaystyle  {\bf E} e^{tX_i} = 1 + O( t^2 (b_i-a_i)^2 \exp( O( t (b_i-a_i) ) ) )

\displaystyle  = \exp( O( t^2 (b_i-a_i)^2 ) ).

Multiplying in {i}, we then have

\displaystyle  {\bf E} e^{tS_n} = \exp( O( t^2 ( \sigma^{(n)})^2 )

and one can now repeat the previous arguments (but without the {e^{O(t)}} factor to deal with). \Box

Remark 4 In the above examples, the underlying random variable {X} was assumed to either be restricted to an interval, or to be subgaussian. This type of hypothesis is necessary if one wishes to have estimates on (2) that are similarly subgaussian. For instance, suppose {X} has a zeta distribution

\displaystyle  {\bf P}(X=m) = \frac{1}{\zeta(s)} \frac{1}{m^s}

for some {s>1} and all natural numbers {i}, where {\zeta(s) := \sum_{i=1}^\infty \frac{1}{m^s}}. One can check that this distribution has finite mean and variance for {s > 3}. On the other hand, since we trivially have {S_n \geq X_1}, we have the crude lower bound

\displaystyle  {\bf P}(S_n \geq m) \geq {\bf P}(X = m) = \frac{1}{\zeta(s)} \frac{1}{m^s}

which shows that in this case the expression (2) only decays at a polynomial rate in {\lambda} rather than an exponential or subgaussian rate.

Exercise 5 (Khintchine inequality) Let {\varepsilon_1,\dots,\varepsilon_n} be iid copies of a Bernoulli random variable {\varepsilon} drawn uniformly from {\{-1,+1\}}.

  • (i) For any non-negative reals {a_1,\dots,a_n} and any {0 < p < \infty}, show that

    \displaystyle  {\bf E} |\sum_{i=1}^n \varepsilon_i a_i|^p \leq C_p (\sum_{i=1}^n |a_i|^2)^{p/2}

    for some constant {C_p} depending only on {p}. When {p=2}, show that one can take {C_2=1} and equality holds.

  • (ii) With the hypotheses in (i), obtain the matching lower bound

    \displaystyle  {\bf E} |\sum_{i=1}^n \varepsilon_i a_i|^p \geq c_p (\sum_{i=1}^n |a_i|^2)^{p/2}

    for some {c_p>0} depending only on {p}. (Hint: use (i) and Hölder’s inequality.)

  • (iii) For any {0 < p < \infty} and any functions {f_1,\dots,f_n \in L^p(X)} on a measure space {X = (X, {\mathcal X},\mu)}, show that

    \displaystyle  {\bf E} \| \sum_{i=1}^n \varepsilon_i f_i \|_{L^p(X)}^p \leq C_p \| (\sum_{i=1}^n |f_i|^2)^{1/2} \|_{L^p(X)}^p


    \displaystyle  {\bf E} \| \sum_{i=1}^n \varepsilon_i f_i \|_{L^p(X)}^p \geq c_p \| (\sum_{i=1}^n |f_i|^2)^{1/2} \|_{L^p(X)}^p

    with the same constants {c_p, C_p} as in (i), (ii). When {p=2}, show that one can take {c_2 = C_2=1} and equality holds.

  • (iv) (Marcinkiewicz-Zygmund theorem) The Khintchine inequality is very useful in real analysis; we give one example here. Let {X, Y} be measure spaces, let {1 < p < \infty}, and suppose {T: L^p(X) \rightarrow L^p(Y)} is a linear operator obeying the bound

    \displaystyle  \|Tf\|_{L^p(Y)} \leq A \|f\|_{L^p(X)}

    for all {f \in L^p(X)} and some finite {A}. Show that for any finite sequence {f_1,\dots,f_n \in L^p(X)}, one has the bound

    \displaystyle  \| (\sum_{i=1}^n |Tf_i|^2)^{1/2} \|_{L^p(Y)} \leq C'_p A \| (\sum_{i=1}^n |f_i|^2)^{1/2} \|_{L^p(X)}

    for some constant {C'_p} depending only on {p}. (Hint: test {T} against a random sum {\sum_{i=1}^n \varepsilon_i f_i}.)

  • (v) By using gaussian sums in place of random signs, show that one can take the constant {C'_p} in (iv) to be one. (For simplicity, let us take the functions in {L^p(X), L^p(Y)} to be real valued.)

In this set of notes we have not focused on getting explicit constants in the large deviation inequalities, but it is not too difficult to do so with a little extra work. We give just one example here:

Exercise 6 Let {\varepsilon_1,\dots,\varepsilon_n} be iid copies of a Bernoulli random variable {\varepsilon} drawn uniformly from {\{-1,+1\}}.

There are many further large deviation inequalities than the ones presented here. For instance, the Azuma-Hoeffding inequality gives Hoeffding-type bounds when the random variables {X_1,\dots,X_n} are not assumed to be jointly independent, but are instead required to form a martingale. Concentration of measure inequalities such as McDiarmid’s inequality handle the situation in which the sum {S_n = X_1 + \dots + X_n} is replaced by a more nonlinear function {F(X_1,\dots,X_n)} of the input variables {X_1,\dots,X_n}. There are also a number of refinements of the Chernoff estimate from the previous notes, that are collectively referred to as “Chernoff bounds“. The Bernstein inequalities handle situations in which the underlying random variable {X} is not bounded, but enjoys good moment bounds. See this previous blog post for these inequalities and some further discussion. Last, but certainly not least, there is an extensively developed theory of large deviations which is focused on the precise exponent in the exponential decay rate for tail probabilities such as (2) when {\lambda} is very large (of the order of {\sqrt{n}}); there is also a complementary theory of moderate deviations that gives precise estimates in the regime where {\lambda} is much larger than one, but much less than {\sqrt{n}}, for which we generally expect gaussian behaviour rather than exponentially decaying bounds. These topics are beyond the scope of this course.

— 2. Local limit theorems —

Let {X_1,X_2,\dots} be iid copies of a random variable {X} of mean {\mu} and variance {\sigma^2}, and write {S_n := X_1 + \dots + X_n}. On the one hand, the central limit theorem tells us that {S_n} should behave like the normal distribution {N(n\mu, n\sigma^2)}, which has probability density function {x \mapsto \frac{1}{\sqrt{2\pi n} \sigma} e^{-(x-n\mu)^2 / n \sigma^2}}. On the other hand, if {X} is discrete, then {S_n} must also be discrete. For instance, if {X} takes values in the integers {{\bf Z}}, then {S_n} takes values in the integers as well. In this case, we would expect (much as we expect a Riemann sum to approximate an integral) the probability distribution of {S_n} to behave like the probability density function predicted by the central limit theorem, thus we expect

\displaystyle  {\bf P}( S_n = m) \approx \frac{1}{\sqrt{2\pi n} \sigma} e^{-(m-n\mu)^2 / 2n \sigma^2} \ \ \ \ \ (7)

for integer {m}. This is not a direct consequence of the central limit theorem (which does not distinguish between continuous or discrete random variables {X}), and in any case is not true in some cases: if {X} is restricted to an infinite subprogression {a + q {\bf Z}} of {{\bf Z}} for some {q>1} and integer {a}, then {S_n} is similarly restricted to the infinite subprogression {na + q{\bf Z}}, so that (7) totally fails when {m} is outside of {na+q{\bf Z}} (and when {m} does lie in {na + q{\bf Z}}, one would now expect the left-hand side of (7) to be about {q} times larger than the right-hand side, to keep the total probability close to {1}). However, this turns out to be the only obstruction:

Theorem 7 (Discrete local limit theorem) Let {X_1,X_2,\dots} be iid copies of an integer-valued random variable {X} of mean {\mu} and variance {\sigma^2}. Suppose furthermore that there is no infinite subprogression {a+q{\bf Z}} of {{\bf Z}} with {q>1} for which {X} takes values almost surely in {a+q{\bf Z}}. Then one has

\displaystyle  {\bf P}( S_n = m) = \frac{1}{\sqrt{2\pi n} \sigma} e^{-(m-n\mu)^2 / 2n \sigma^2} + o(1/n^{1/2})

for all {n \geq 1} and all integers {m}, where the error term {o(1/n^{1/2})} is uniform in {m}. In other words, we have

\displaystyle  \sup_{m \in {\bf Z}} n^{1/2} | {\bf P}( S_n = m) - \frac{1}{\sqrt{2\pi n} \sigma} e^{-(m-n\mu)^2 / 2n \sigma^2}| \rightarrow 0

as {n \rightarrow \infty}.

Note for comparison that the Berry-Esséen theorem (writing {{\bf P}( S_n = m)} as, say, {{\bf P}( m-1/2 \leq S_n \leq m+1/2 )}) would give (assuming finite third moment) an error term of {O(1/n^{1/2})} instead of {o(1/n^{1/2})}, which would overwhelm the main term which is also of size {O(1/n^{1/2})}.

Proof: Unlike previous arguments, we do not have the luxury here use an affine change of variables to normalise to mean zero and variance one, as this would disrupt the hypothesis that {X} takes values in {{\bf Z}}.

Fix {n} and {m}. Since {S_n} and {m} are integers, we have the Fourier identity

\displaystyle  1_{S_n = m} = \frac{1}{2\pi} \int_{-\pi}^{\pi} e^{itS_n} e^{-itm}\ dt,

which upon taking expectations and using Fubini’s theorem gives

\displaystyle  {\bf P}(S_n = m) = \frac{1}{2\pi} \int_{-\pi}^{\pi} \varphi_{S_n}(t) e^{-itm}\ dt

where {\varphi_{S_n}(t) := {\bf E} e^{itS_n} } is the characteristic function of {S_n}. Expanding {S_n = X_1 + \dots + X_n} and noting that {X_1,\dots,X_n} are iid copies of {X}, we have

\displaystyle  \varphi_{S_n}(t) = \varphi_X(t)^n = e^{in \mu t} \varphi_{X-\mu}(t).

It will be convenient to make the change of variables {t = x/\sqrt{n}}, to obtain

\displaystyle  \sqrt{n} {\bf P}(S_n = m) = \frac{1}{2\pi} \int_{-\pi \sqrt{n}}^{\pi \sqrt{n}} \varphi_{X-\mu}(x/\sqrt{n})^n e^{i\mu x \sqrt{n}} e^{-ixm/\sqrt{n}}\ dx.

As in the Fourier-analytic proof of the central limit theorem, we have

\displaystyle  \varphi_{X-\mu}(t) = 1 - \frac{\sigma^2}{2} t^2 + o(t^2) \ \ \ \ \ (8)

as {t \rightarrow 0}, so by Taylor expansion we have

\displaystyle  \varphi_{X-\mu}(x/\sqrt{n})^n \rightarrow e^{-\sigma^2 x^2 / 2} \ \ \ \ \ (9)

as {n \rightarrow \infty} for any fixed {x}. This suggests (but does not yet prove) that

\displaystyle  \sqrt{n} {\bf P}(S_n = m) \rightarrow \frac{1}{2\pi} \int_{\bf R} e^{i \mu x \sqrt{n}} e^{-\sigma^2 x^2/2} e^{-ixm/\sqrt{n}}\ dx.

A standard Fourier-analytic calculation gives

\displaystyle  \frac{1}{2\pi} \int_{\bf R} e^{i \mu x \sqrt{n}} e^{-\sigma^2 x^2/2} e^{-ixm/\sqrt{n}}\ dx = \frac{1}{\sqrt{2\pi} \sigma} e^{-(m-n\mu)^2 / 2n \sigma^2}

so it will now suffice to establish that

\displaystyle  \int_{-\pi \sqrt{n}}^{\pi \sqrt{n}} e^{i \mu x \sqrt{n}} \varphi_{X-\mu}(x/\sqrt{n})^n e^{-ixm/\sqrt{n}}\ dx

\displaystyle  = \int_{\bf R} e^{i \mu x \sqrt{n}} e^{-\sigma^2 x^2/2} e^{-ixm/\sqrt{n}}\ dx + o(1)

uniformly in {m}. From dominated convergence we have

\displaystyle  \int_{|x| \geq \pi \sqrt{n}} e^{-\sigma^2 x^2/2} e^{-ixm/\sqrt{n}}\ dx = o(1)

so by the triangle inequality, it suffices to show that

\displaystyle  \int_{-\pi \sqrt{n}}^{\pi \sqrt{n}} |\varphi_{X-\mu}(x/\sqrt{n})^n - e^{-\sigma^2 x^2/2}| dx = o(1).

This will follow from (9) and the dominated convergence theorem, as soon as we can dominate the integrands {|\varphi_{X-\mu}(x/\sqrt{n})^n - e^{-\sigma^2 x^2/2}| 1_{|x| \leq \pi \sqrt{n}}} by an absolutely integrable function.

From (8), there is an {\varepsilon > 0} such that

\displaystyle  \hbox{Re} \varphi_{X-\mu}(t) \leq 1 - \frac{\sigma^2}{4} t^2 \leq \exp( - \sigma^2 t^2 / 4)

for all {|t| \leq \varepsilon}, and hence

\displaystyle  |\varphi_{X-\mu}(x/\sqrt{n})^n| \leq \exp( - \sigma^2 x^2 / 4 )

for {|x| \leq \varepsilon \sqrt{n}}. This gives the required domination in the region {|x| \leq \varepsilon \sqrt{n}}, so it remains to handle the region {\varepsilon \sqrt{n} < |x| \leq \pi \sqrt{n}}.

From the triangle inequality we have {|\varphi_{X-\mu}(t)| \leq 1} for all {t}. Actually we have the stronger bound {|\varphi_{X-\mu}(t)| < 1} for {0 < |t| \leq \pi}. Indeed, if {|\varphi_{X-\mu}(t)| = 1} for some such {t}, this would imply that {e^{it(X-\mu)}} is a deterministic constant, which means that {t(X-\mu)} takes values in {a + 2\pi {\bf Z}} for some real {a}, which implies that {X} takes values in {\mu + \frac{a}{t} + \frac{2\pi}{t} {\bf Z}}; since {X} also takes values in {{\bf Z}}, this would place {X} either in a singleton set or in an infinite subprogression of {{\bf Z}} (depending on whether {\frac{2\pi}{t}} and {\mu + \frac{a}{t}} are rational), a contradiction. As {\varphi_{X-\mu}} is continuous and the region {\varepsilon \leq |t| \leq \pi} is compact, there exists {0<c<1} such that {|\varphi_{X-\mu}(t)| \leq c} for all {\varepsilon \leq |t| \leq \pi}. This allows us to dominate {|\varphi_{X-\mu}(x/\sqrt{n})^n - e^{-\sigma^2 x^2/2}|} by {c^n + e^{-\sigma^2 \varepsilon^2 n / 2}} for {\varepsilon \sqrt{n} \leq |x| \leq \pi \sqrt{n}}, which is in turn bounded by {\exp( - \delta x^2 )} for some {\delta > 0} independent of {n}, giving the required domination. \Box

Of course, if the random variable {X} in Theorem 7 did take values almost surely in some subprogression {a + q{\bf Z}}, then either {X} is almost surely constant, or there is a minimal progression {a+q{\bf Z}} with this property (since {q} must divide the difference of any two integers that {X} attains with positive probability). One can then make the affine change of variables {X = a + q X'} (modifying {\mu} and {\sigma^2} appropriately) and apply the above theorem to obtain a similar local limit theorem, which we will not write here. For instance, if {X} is the uniform distribution on {\{-1,1\}}, then this argument gives

\displaystyle  {\bf P}( S_n = m) = \frac{2}{\sqrt{2\pi n} \sigma} e^{-2m^2 / n} + o(1/n^{1/2})

when {m} is an integer of the same parity as {n} (of course, {{\bf P}(S_n=m)} will vanish otherwise). A further affine change of variables handles the case when {X} is not integer valued, but takes values in some other lattice {a+q{\bf Z}}, where {q} and {a} are now real-valued.

We can complement these local limit theorems with the following result that handles the non-lattice case:

Theorem 8 (Continuous local limit theorem) Let {X_1,X_2,\dots} be iid copies of an real-valued random variable {X} of mean {\mu} and variance {\sigma^2}. Suppose furthermore that there is no infinite progression {a+q{\bf Z}} with {a,q} real for which {X} takes values almost surely in {a+q{\bf Z}}. Then for any {h \geq 1}, one has

\displaystyle  {\bf P}( a \leq S_n \leq a+h ) = \frac{h}{\sqrt{2\pi n} \sigma} e^{-(a-n\mu)^2 / 2n \sigma^2} + o(1/n^{1/2})

for all {n \geq 1} and all {a \in {\bf R}}, where the error term {o(1/n^{1/2})} is uniform in {a} (but may depend on {h}).

Equivalently, if we let {N_n} be a normal random variable with mean {n\mu} and variance {n\sigma^2}, then

\displaystyle  {\bf P}( a \leq S_n \leq a+h ) = {\bf P}( a \leq N_n \leq a+h ) + o(1/n^{1/2})

uniformly in {a}. Again, this can be compared with the Berry-Esséen theorem, which (assuming finite third moment) has an error term of {O(1/n^{1/2})} which is uniform in both {n} and {h}.

Proof: Unlike the discrete case, we have the luxury here of normalising {\mu=0} and {\sigma^2=1}, and we shall now do so.

We first observe that it will suffice to show that

\displaystyle  {\bf E} G(S_n-a) = {\bf E} G(N_n-a) + o(1/n^{1/2}), \ \ \ \ \ (10)

whenever {G: {\bf R} \rightarrow {\bf C}} is a Schwartz function whose Fourier transform

\displaystyle  \hat G(t) := \frac{1}{2\pi} \int_{\bf R} G(x) e^{-itx}\ dx

is compactly supported, and where the {o(1/n^{1/2})} error can depend on {G} but is uniform in {a}. Indeed, if this bound (10) holds, then (after replacing {G} by {|G|^2} to make it positive on some interval, and then rescaling) we obtain a bound of the form

\displaystyle  {\bf P}( a \leq S_n \leq a+h ) \leq O(h / n^{1/2} ) + o(1/n^{1/2}) \ \ \ \ \ (11)

for any {h>0} and {a \in {\bf R}}, where the {o(1/n^{1/2})} term can depend on {h} but not on {a}. Then, by convolving {1_{[0,h]}} by an approximation to the identity of some width {h'} much smaller than {h} with compactly supported Fourier transform, applying (10) to the resulting function, and using (11) to control the error between that function and {1_{[0,h]}}, we see that

\displaystyle  {\bf P}( a \leq S_n \leq a+h ) = {\bf P}( a \leq N_n \leq a+h ) + O(h'/n^{1/2}) + o(1/n^{1/2})

uniformly in {a}, where the {o(1/n^{1/2})} term can depend on {h} and {h'} but is uniform in {a}. Letting {h'} tend slowly to zero, we obtain the claim.

It remains to establish (10). We adapt the argument from the discrete case. By the Fourier inversion formula we may write

\displaystyle  G(x) = \int_{\bf R} \hat G(t) e^{itx}\ dt.

By Fubini’s theorem as before, we thus have

\displaystyle  {\bf E} G(S_n-a) = \int_{\bf R} \hat G(t) \varphi_{S_n}(t) e^{-ita}\ dt

and similarly

\displaystyle  {\bf E} G(N_n-a) = \int_{\bf R} \hat G(t) \varphi_{N_n}(t) e^{-ita}\ dt

so it suffices by the triangle inequality and the boundedness and compact support of {\hat G} to show that

\displaystyle  \int_{|t| \leq A} |\hat G(t)| |\varphi_{S_n}(t) - \varphi_{N_n}(t)|\ dt = o(1/n^{1/2})

for any fixed {A} (where the {o(1/n^{1/2})} term can depend on {A}). We have {\varphi_{S_n}(t) = \varphi_X(t)^n} and {\varphi_{N_n}(t) = e^{-n t^2 / 2}}, so by making the change of variables {x = \sqrt{n} t}, we now need to show that

\displaystyle  \int_{|t| \leq A \sqrt{n}} |\hat G(t/\sqrt{n})| |\varphi_X(x/\sqrt{n})^n - e^{-x^2/2}|\ dx \rightarrow 0

as {n \rightarrow \infty}. But this follows from the argument used to handle the discrete case. \Box

— 3. The Poisson central limit theorem —

The central limit theorem (after normalising the random variables to have mean zero) studies the fluctuations of sums {\frac{S_n}{\sqrt{n}} = \frac{X_1}{\sqrt{n}} + \dots + \frac{X_n}{\sqrt{n}}} where each individual term {\frac{X_i}{\sqrt{n}}} is quite small (typically of size {O(1/\sqrt{n})}). Now we consider a variant situation, in which one considers a sum {S_n = X_1 + \dots + X_n} of random variables which are usually zero, but occasionally equal to a larger value such as {1}. (This situation arises in many real-life situations when compiling aggregate statistics on rare events, e.g. the number of car crashes in a short period of time.) In these cases, one can get a different distribution than the gaussian distribution, namely a Poisson distribution with some intensity {\lambda} – that is to say, a random variable {X} taking values in the non-negative integers {0,1,2,\dots} with probability distribution

\displaystyle  {\bf P}(X = n) = \frac{\lambda^n}{n!} e^{-\lambda}.

One can check that this distribution has mean {\lambda} and variance {\lambda}.

Theorem 9 (Poisson central limit theorem) Let {(X_{j,n})_{1 \leq j \leq n}} be a triangular array of real random variables, where for each {n}, the variables {X_{1,n},\dots,X_{n,n}} are jointly independent. Assume furthermore that

  • (i) ({X_{j,n}} mostly {0,1}) One has {\sum_{j=1}^n {\bf P}( X_{j,n} \not \in \{0,1\} ) \rightarrow 0} as {n \rightarrow \infty}.
  • (ii) ({X_{j,n}} rarely {1}) One has {\sup_{j=1}^n {\bf P}(X_{j,n} = 1 ) \rightarrow 0} as {n \rightarrow \infty}.
  • (iii) (Convergent expectation) One has {\sum_{j=1}^n {\bf P}(X_{j,n} = 1 ) \rightarrow \lambda} as {n \rightarrow \infty} for some {0 < \lambda < \infty}.

Then the random variables {S_n := X_{1,n} + \dots + X_{n,n}} converge in distribution to a Poisson random variable of intensity {\lambda}.

Proof: From hypothesis (i) and the union bound we see that for each {n}, we have that all of the {X_{1,n},\dots,X_{n,n}} lie in {\{0,1\}} with probability {1-o(1)} as {n \rightarrow \infty}. Thus, if we replace each {X_{j,n}} by the restriction {X_{j,n} 1_{X_{j,n} \in \{0,1\}}}, the random variable {S_n} is only modified on an event of probability {o(1)}, which does not affect distributional limits (Slutsky’s theorem). Thus, we may assume without loss of generality that the {X_{j,n}} take values in {\{0,1\}}.

By Exercise 20 of Notes 4, a Poisson random variable of intensity {\lambda} has characteristic function {t \mapsto \exp( \lambda(e^{it}-1) )}. Applying the Lévy convergence theorem (Theorem 27 of Notes 4), we conclude that it suffices to show that

\displaystyle  \varphi_{S_n}(t) \rightarrow \exp( \lambda(e^{it}-1)

as {n \rightarrow \infty} for any fixed {t}.

Fix {t}. By the independence of the {X_{j,n}}, we may write

\displaystyle  \varphi_{S_n}(t) = \prod_{j=1}^n \varphi_{X_{j,n}}(t).

Since {X_{j,n}} only takes on the values {0} and {1}, we can write

\displaystyle  \varphi_{X_{j,n}}(t) = 1 - p_{j,n} + p_{j,n} e^{it}

where {p_{j,n} := {\bf P}( X_{j,n} = 1 )}. By hypothesis (ii), we have {p_{j,n} = o(1)}, so by using a branch of the complex logarithm that is analytic near {1}, we can write

\displaystyle  \varphi_{S_n}(t) = \exp( \sum_{j=1}^n \log( 1 - p_{j,n} + p_{j,n} e^{it} ) ).

By Taylor expansion we have

\displaystyle  \log( 1 - p_{j,n} + p_{j,n} e^{it} ) ) = p_{j,n} (e^{it}-1) + O( p_{j,n}^2 )

and hence by (ii), (iii)

\displaystyle  \sum_{j=1}^n \log( 1 - p_{j,n} + p_{j,n} e^{it} ) ) = \lambda (e^{it}-1) + o(1)

as {n \rightarrow \infty}, and the claim follows. \Box

Exercise 10 Establish the conclusion of Theorem 9 directly from explicit computation of the probabilities {{\bf P}(S_n = k)} in the case when each {X_{j,n}} takes values in {0,1} with {{\bf P}(X_{j,n}=1) = \lambda/n} for some fixed {\lambda > 0}.

The Poisson central limit theorem can be viewed as a degenerate limit of the central limit theorem, as seen by the next two exercises.

Exercise 11 Suppose we replace the hypothesis (iii) in Theorem 9 with the alternative hypothesis that the quantities {\lambda_n := \sum_{j=1}^n {\bf P}(X_{j,n} = 1)} go to infinity as {n \rightarrow \infty}, while leaving hypotheses (i) and (ii) unchanged. Show that {(S_n - \lambda_n) /\sqrt{\lambda_n}} converges in distribution to the normal distribution {N(0,1)}.

Exercise 12 For each {\lambda>0}, let {P_\lambda} be a Poisson random variable with intensity {\lambda}. Show that as {\lambda \rightarrow \infty}, the random variables {(P_\lambda - \lambda)/\sqrt{\lambda}} converge in distribution to the normal distribution {N(0,1)}. Discuss how this is consistent with Theorem 9 and the previous exercise.

— 4. Stable laws —

Let {X} be a real random variable. We say that {X} has a stable law or a stable distribution if for any positive reals {a, b}, there exists a positive real {c} and a real {d} such that {a X_1 + b X_2 \equiv c X + d} whenever {X_1,X_2} are iid copies of {X}. In terms of the characteristic function {\varphi_X} of {X}, we see that {X} has a stable law if for any positive reals {a,b}, there exist a positive real {c} and a real {d} for which we have the functional equation

\displaystyle  \phi_X(at) \phi_X(bt) = \phi_X(ct) e^{idt}

for all real {t}.

For instance, a normally distributed variable {X = N(\mu,\sigma^2)} is stable thanks to Lemma 12 of Notes 4; one can also see this from the characteristic function {\phi_X(t) = e^{i \mu t - \sigma^2 t^2 / 2}}. A Cauchy distribution {X}, with probability density {\frac{1}{\pi} \frac{\gamma}{(x-x_0)^2 + \gamma^2}} can also be seen to be stable, as is most easily seen from the characteristic function {\phi_X(t) = e^{ix_0 t - \gamma |t|}}. As a more degenerate example, any deterministic random variable {X = x_0} is stable. It is possible (though somewhat tedious) to completely classify all the stable distributions, see for instance the Wikipedia entry on these laws for the full classification.

If {X} is stable, and {X_1, X_2, \dots} are iid copies of {X}, then by iterating the stable law hypothesis we see that the sums {S_n := X_1 + \dots + X_n} are all equal in distribution to some affine rescaling {a_n X + b_n} of {X}. For instance, we have {X_1 + X_2 \equiv cX + d} for some {c,d}, and a routine induction then shows that

\displaystyle  X_1 + \dots + X_{2^k} \equiv c^k X + \frac{d (c^k - 1)}{c-1}

for all natural numbers {k} (with the understanding that {\frac{c^k-1}{c-1} = k} when {c=1}). In particular, the random variables {\frac{S_n - b_n}{a_n}} all have the same distribution as {X}.

More generally, given two real random variables {X} and {Y}, we say that {X} is in the basin of attraction for {Y} if, whenever {X_1,X_2,\dots} are iid copies of {X} and {S_n := X_1 + \dots + X_n}, there exist constants {a_n > 0} and {b_n} such that {\frac{S_n - b_n}{a_n}} converges in distribution to {Y}. Thus, any stable law is in its own basin of attraction, while the central limit theorem asserts that any random variable of finite variance is in the basin of attraction of a normal distribution. One can check that every random variable lies in the basin of attraction of a deterministic random variable such as {0}, simply by letting {a_n} go to infinity rapidly enough. To avoid this degenerate case, we now restrict to laws that are non-degenerate, in the sense that they are not almost surely constant. Then we have the following useful technical lemma:

Proposition 13 (Convergence of types) Let {X_n} be a sequence of real random variables converging in distribution to a non-degenerate limit {X}. Let {a_n>0} and {b_n} be real numbers such that {Y_n = a_n X_n + b_n} converges in distribution to a non-degenerate limit {Y}. Then {a_n} and {b_n} converge to some finite limits {a,b} respectively, and {aX + b \equiv Y}.

Proof: Suppose first that {a_n} goes to zero. The sequence {X_n} converges in distribution, hence is tight, hence {Y_n - b_n = a_n X_n} converges in probability to zero. In particular, if {Y'_n} is an independent copy of {Y_n}, then {Y_n - Y'_n} converges in probability to zero; but {Y_n - Y'_n} also converges in distribution to {Y-Y'} where {Y'} is an independent copy of {Y}, and {Y-Y'} is not almost surely zero since {Y} is non-degenerate. This is a contradiction. Similarly if {a_n} has any subsequence that goes to zero. We conclude that {a_n} is bounded away from zero. Rewriting {Y_n = a_n X_n + b_n} as {X_n = a_n^{-1} Y_n - a_n^{-1} b_n} and reversing the roles of {X_n} and {Y_n}, we conclude also that {a_n^{-1}} is bounded away from zero, thus {a_n} is bounded.

Since {X_n} is tight and {a_n} is bounded, {a_n X_n} is tight; since {Y_n} is also tight, this implies that {Y_n - a_n X_n = b_n} is tight, that is to say {b_n} is bounded.

Let {(a,b)} be a limit point of the {(a_n,b_n)}. By Slutsky’s theorem, a subsequence of the {Y_n = a_n X_n + b_n} then converges in distribution to {aX + b}, thus {Y \equiv aX+b}. If the limit point {(a,b)} is unique then we are done, so suppose there are two limit points {(a,b)}, {(a',b')}. Thus {aX+b \equiv a'X+b'}, which on rearranging gives {X \equiv cX + d} for some {c>0} and real {d} with {(c,d) \neq (1,0)}.

If {c=1} then on iteration we have {X \equiv X + dn} for any natural number {n}, which clearly leads to a contradiction as {n \rightarrow \infty} since {d \neq 0}. If {c < 1} then iteration gives {X = c^n X + d \frac{c^n-1}{c-1}} for any natural number {n}, which on passing to the limit in distribution as {n \rightarrow \infty} gives {X=0}, again a contradiction. If {c>1} then we rewrite {X \equiv cX + d} as {X \equiv c^{-1} X - c^{-1} d} and again obtain a contradiction, and the claim follows. \Box

One can use this proposition to verify that basins of attraction of genuinely distinct laws are disjoint:

Exercise 14 Let {Y} and {Y'} be non-degenerate real random variables. Suppose that a random variable {X} lies in the basin of attraction of both {Y} and {Y'}. Then there exist {a>0} and real {b} such that {Y' \equiv aY+b}.

If {X} lies in the basin of attraction for a non-degenerate law {Y}, then {\frac{S_n - b_n}{a_n}} converges in distribution to {Y}; since {S_{2n}} is equal in distribution to the sum of two iid copies of {S_n}, we see that {\frac{S_{2n}-2b_n}{a_n}} converges in distribution to the sum {Y_1+Y_2} of two iid copies of {Y}. On the other hand, {\frac{S_{2n}-b_{2n}}{a_{2n}}} converges in distribution to {Y}. Using Proposition 13 we conclude that {Y_1+Y_2 \equiv cY + d} for some {c>0} and real {d}. One can go further and conclude that {Y} in fact has a stable law; see the following exercise. Thus stable laws are the only laws that have a non-empty basin of attraction.

Exercise 15 Let {X} lie in the basin of attraction for a non-degenerate law {Y}.

  • (i) Show that for any iid copies {Y_1,\dots,Y_k} of {Y}, there exists a unique {c_k>0} and {d_k \in {\bf R}} such that {Y_1 + \dots + Y_k \equiv c_k Y + d_k}. Also show that {c_k Y_1 + c_l Y_2 \equiv c_{k+l} Y + d_{k+l} - d_k - d_l} for all natural numbers {k,l}.
  • (ii) Show that the {c_k} are strictly increasing, with {c_{kl} = c_k c_l} for all natural numbers {k,l}. (Hint: study the absolute value {|\phi_Y(t)|} of the characteristic function, using the non-degeneracy of {Y} to ensure that this absolute value is usually strictly less than one.) Also show that {d_k l + c_k d_l = d_{kl}} for all natural numbers {k,l}.
  • (iii) Show that there exists {\alpha > 0} such that {c_k = k^\alpha} for all {k}. (Hint: first show that {\frac{\log c_k}{\log k}} is a Cauchy sequence in {k}.)
  • (iv) If {\alpha = 1}, and {Y_1,Y_2} are iid copies of {Y}, show that {\frac{k}{k+l} Y_1 + \frac{l}{k+l} Y_2 \equiv Y + \theta_{k,l}} for all natural numbers {k,l} and some bounded real {\theta_{k,l}}. Then show that {Y} has a stable law in this case.
  • (v) If {\alpha \neq 1}, show that {d_k = \mu ( k - k^\alpha )} for some real {\mu} and all {k}. Then show that {Y} has a stable law in this case.

Exercise 16 (Classification of stable laws) Let {Y} be a non-degenerate stable law, then {Y} lies in its own basin of attraction and one can then define {c_k, d_k, \alpha} as in the preceding exercise.

  • (i) If {\alpha \neq 1}, and {\mu} is as in part (v) of the preceding exercise, show that {F_Y(t)^k = F_Y(k^\alpha t) e^{i \mu (k - k^\alpha)}} for all {t} and {k}. Then show that

    \displaystyle  F_Y(t) = \exp( i t \mu - |ct|^\alpha (1 - i \beta \hbox{sgn}(t) ) )

    for some real {c, \beta}. (One can use the identity {F_Y(-t) = \overline{F_Y(t)}} to restrict attention to the case of positive {t}.)

  • (ii) Now suppose {\alpha = 1}. Show that {d_{k+l} = d_k + d_l + O(k+l)} for all {k,l} (where the implied constant in the {O()} notation is allowed to depend on {Y}). Conclude that {d_k = O(k \log k)} for all {k}.
  • (iii) We continue to assume {\alpha = 1}. Show that {d_k = -\beta k \log k} for some real number {k}. (Hint: first show this when {k} is a power of a fixed natural number {k_0 > 1} (with {\beta} possibly depending on {k_0}). Then use the estimates from part (ii) to show that {\beta} does not actually depend on {k_0}. (One may need to invoke the Dirichlet approximation theorem to show that for any given {k_0,k_1 > 1}, one can find a power of {k_0} that is somewhat close to a power of {k_1}.)
  • (iv) We continue to assume {\alpha=1}. Show that {F_Y(t)^k = F_Y(kt) e^{-i \beta k \log k}} for all {t} and {k}. Then show that

    \displaystyle  F_Y(t) = \exp(it \mu - |ct| (1 - i \beta \hbox{sgn}(t) \log t )

    for all {t} and some real {c}.

It is also possible to determine which choices of parameters {\mu,c,\alpha,\beta} are actually achievable by some random variable {Y}, but we will not do so here.

It is possible to associate a central limit theorem to each stable law, which precisely determines their basin of attraction. We will not do this in full generality, but just illustrate the situation for the Cauchy distribution.

Exercise 17 Let {X} be a real random variable which is symmetric (that is, {X} has the same distribution as {-X}) and obeys the distribution identity

\displaystyle  {\mathbf P}( |X| \geq x ) = L(x) \frac{2}{\pi x}

for all {x>0}, where {L: (0,+\infty) \rightarrow (0,+\infty)} is a function which is slowly varying in the sense that {L(cx)/L(x) \rightarrow 1} as {x \rightarrow \infty} for all {c>0}.

  • (i) Show that

    \displaystyle  {\bf E}( e^{itX} ) = 1 - |t| + o(|t|)

    as {t \rightarrow 0}, where {o(|t|)} denotes a quantity such that {o(|t|)/|t| \rightarrow 0} as {t \rightarrow 0}. (You may need to establish the identity {\int_0^\infty \frac{1 - \cos(x)}{x^2}\ dx = \frac{\pi}{2}}, which can be done by contour integration.)

  • (ii) Let {X_1,X_2,\dots} be iid copies of {X}. Show that {\frac{X_1+\dots+X_n}{n}} converges in distribution to a copy of the standard Cauchy distribution (i.e., to a random variable with probability density function {x \mapsto \frac{1}{\pi} \frac{1}{1+x^2}}).

Filed under: 275A - probability theory, math.PR, Uncategorized Tagged: central limit theorem, large deviation inequality, local limit theorems, stable laws

Tommaso DorigoCERN And LIP Openings For Graduate Students In Physics - Good $$$

Have you recently obtained a Masters degree in a scientific discipline ? Are you fascinated by particle physics ? Do you have an interest in Machine Learning developments, artificial intelligence, and all that ? Or are you just well versed in Statistical Analysis ? Do you want to be paid twice as much as I am for attending a PhD ? If the above applies to you, you are certainly advised to read on. 

read more

Chad Orzel090/366: Kids These Days

We had our usual tv time on Sunday morning at my parents’, with the kids alternating picking what they watched. When it came around to SteelyKid’s turn, she opted for MythBusters, which wasn’t available on demand, but she has several episodes on her tablet. Of course, if SteelyKid was going to watch video on her tablet, then The Pip had to watch video on his, which led to this shot:

SteelyKid and The Pip watching video on their tablets.

SteelyKid and The Pip watching video on their tablets.

And also blissful, blessed quiet, a gigantic improvement over the sounds of “Paw Patrol,” which was The Pip’s cartoon of choice on this trip. And, well, ugh.

Anyway, that’s what’s up with the kids these days and their newfangled personal multimedia devices. And music that’s just noise– NOISE, I tell you…

John PreskillDiscourse in Delft

A camel strolled past, yards from our window in the Applied-Sciences Building.

I hadn’t expected to see camels at TU Delft, aka the Delft University of Technology, in Holland. I breathed, “Oh!” and turned to watch until the camel followed its turbaned leader out of sight. Nelly Ng, the PhD student with whom I was talking, followed my gaze and laughed.

Nelly works in Stephanie Wehner’s research group. Stephanie—a quantum cryptographer, information theorist, thermodynamicist, and former Caltech postdoc—was kind enough to host me for half August. I arrived at the same time as TU Delft’s first-year undergrads. My visit coincided with their orientation. The orientation involved coffee hours, team-building exercises, and clogging the cafeteria whenever the Wehner group wanted lunch.

And, as far as I could tell, a camel.

Not even a camel could unseat Nelly’s and my conversation. Nelly, postdoc Mischa Woods, and Stephanie are the Wehner-group members who study quantum and small-scale thermodynamics. I study quantum and small-scale thermodynamics, as Quantum Frontiers stalwarts might have tired of hearing. The four of us exchanged perspectives on our field.

Mischa knew more than Nelly and I about clocks; Nelly knew more about catalysis; and I knew more about fluctuation relations. We’d read different papers. We’d proved different theorems. We explained the same phenomena differently. Nelly and I—with Mischa and Stephanie, when they could join us—questioned and answered each other almost perpetually, those two weeks.

We talked in our offices, over lunch, in the group discussion room, and over tea at TU Delft’s Quantum Café. We emailed. We talked while walking. We talked while waiting for Stephanie to arrive so that she could talk with us.


The site of many a tête-à-tête.

The copiousness of the conversation drained me. I’m an introvert, formerly “the quiet kid” in elementary school. Early some mornings in Delft, I barricaded myself in the visitors’ office. Late some nights, I retreated to my hotel room or to a canal bank. I’d exhausted my supply of communication; I had no more words for anyone. Which troubled me, because I had to finish a paper. But I regret not one discussion, for three reasons.

First, we relished our chats. We laughed together, poked fun at ourselves, commiserated about calculations, and confided about what we didn’t understand.

We helped each other understand, second. As I listened to Mischa or as I revised notes about a meeting, a camel would stroll past a window in my understanding. I’d see what I hadn’t seen before. Mischa might be explaining which quantum states represent high-quality clocks. Nelly might be explaining how a quantum state ξ can enable a state ρ to transform into a state σ. I’d breathe, “Oh!” and watch the mental camel follow my interlocutor through my comprehension.

Nelly’s, Mischa’s, and Stephanie’s names appear in the acknowledgements of the paper I’d worried about finishing. The paper benefited from their explanations and feedback.

Third, I left Delft with more friends than I’d had upon arriving. Nelly, Mischa, and I grew to know each other, to trust each other, to enjoy each other’s company. At the end of my first week, Nelly invited Mischa and me to her apartment for dinner. She provided pasta; I brought apples; and Mischa brought a sweet granola-and-seed mixture. We tasted and enjoyed more than we would have separately.


Dinner with Nelly and Mischa.

I’ve written about how Facebook has enhanced my understanding of, and participation in, science. Research involves communication. Communication can challenge us, especially many of us drawn to science. Let’s shoulder past the barrier. Interlocutors point out camels—and hot-air balloons, and lemmas and theorems, and other sources of information and delight—that I wouldn’t spot alone.

With gratitude to Stephanie, Nelly, Mischa, the rest of the Wehner group (with whom I enjoyed talking), QuTech and TU Delft.

During my visit, Stephanie and Delft colleagues unveiled the “first loophole-free Bell test.” Their paper sent shockwaves (AKA camels) throughout the quantum community. Scott Aaronson explains the experiment here.

Terence Tao275A, Notes 3: The weak and strong law of large numbers

One of the major activities in probability theory is studying the various statistics that can be produced from a complex system with many components. One of the simplest possible systems one can consider is a finite sequence {X_1,\dots,X_n} or an infinite sequence {X_1,X_2,\dots} of jointly independent scalar random variables, with the case when the {X_i} are also identically distributed (i.e. the {X_i} are iid) being a model case of particular interest. (In some cases one may consider a triangular array {(X_{n,i})_{1 \leq i \leq n}} of scalar random variables, rather than a finite or infinite sequence.) There are many statistics of such sequences that one can study, but one of the most basic such statistics are the partial sums

\displaystyle  S_n := X_1 + \dots + X_n.

The first fundamental result about these sums is the law of large numbers (or LLN for short), which comes in two formulations, weak (WLLN) and strong (SLLN). To state these laws, we first must define the notion of convergence in probability.

Definition 1 Let {X_n} be a sequence of random variables taking values in a separable metric space {R = (R,d)} (e.g. the {X_n} could be scalar random variables, taking values in {{\bf R}} or {{\bf C}}), and let {X} be another random variable taking values in {R}. We say that {X_n} converges in probability to {X} if, for every radius {\varepsilon > 0}, one has {{\bf P}( d(X_n,X) > \varepsilon ) \rightarrow 0} as {n \rightarrow \infty}. Thus, if {X_n, X} are scalar, we have {X_n} converging to {X} in probability if {{\bf P}( |X_n-X| > \varepsilon ) \rightarrow 0} as {n \rightarrow \infty} for any given {\varepsilon > 0}.

The measure-theoretic analogue of convergence in probability is convergence in measure.

It is instructive to compare the notion of convergence in probability with almost sure convergence. it is easy to see that {X_n} converges almost surely to {X} if and only if, for every radius {\varepsilon > 0}, one has {{\bf P}( \bigvee_{n \geq N} (d(X_n,X)>\varepsilon) ) \rightarrow 0} as {N \rightarrow \infty}; thus, roughly speaking, convergence in probability is good for controlling how a single random variable {X_n} is close to its putative limiting value {X}, while almost sure convergence is good for controlling how the entire tail {(X_n)_{n \geq N}} of a sequence of random variables is close to its putative limit {X}.

We have the following easy relationships between convergence in probability and almost sure convergence:

Exercise 2 Let {X_n} be a sequence of scalar random variables, and let {X} be another scalar random variable.

  • (i) If {X_n \rightarrow X} almost surely, show that {X_n \rightarrow X} in probability. Give a counterexample to show that the converse does not necessarily hold.
  • (ii) Suppose that {\sum_n {\bf P}( |X_n-X| > \varepsilon ) < \infty} for all {\varepsilon > 0}. Show that {X_n \rightarrow X} almost surely. Give a counterexample to show that the converse does not necessarily hold.
  • (iii) If {X_n \rightarrow X} in probability, show that there is a subsequence {X_{n_j}} of the {X_n} such that {X_{n_j} \rightarrow X} almost surely.
  • (iv) If {X_n,X} are absolutely integrable and {{\bf E} |X_n-X| \rightarrow 0} as {n \rightarrow \infty}, show that {X_n \rightarrow X} in probability. Give a counterexample to show that the converse does not necessarily hold.
  • (v) (Urysohn subsequence principle) Suppose that every subsequence {X_{n_j}} of {X_n} has a further subsequence {X_{n_{j_k}}} that converges to {X} in probability. Show that {X_n} also converges to {X} in probability.
  • (vi) Does the Urysohn subsequence principle still hold if “in probability” is replaced with “almost surely” throughout?
  • (vii) If {X_n} converges in probability to {X}, and {F: {\bf R} \rightarrow {\bf R}} or {F: {\bf C} \rightarrow {\bf C}} is continuous, show that {F(X_n)} converges in probability to {F(X)}. More generally, if for each {i=1,\dots,k}, {X^{(i)}_n} is a sequence of scalar random variables that converge in probability to {X^{(i)}}, and {F: {\bf R}^k \rightarrow {\bf R}} or {F: {\bf C}^k \rightarrow {\bf C}} is continuous, show that {F(X^{(1)}_n,\dots,X^{(k)}_n)} converges in probability to {F(X^{(1)},\dots,X^{(k)})}. (Thus, for instance, if {X_n} and {Y_n} converge in probability to {X} and {Y} respectively, then {X_n + Y_n} and {X_n Y_n} converge in probability to {X+Y} and {XY} respectively.
  • (viii) (Fatou’s lemma for convergence in probability) If {X_n} are non-negative and converge in probability to {X}, show that {{\bf E} X \leq \liminf_{n \rightarrow \infty} {\bf E} X_n}.
  • (ix) (Dominated convergence in probability) If {X_n} converge in probability to {X}, and one almost surely has {|X_n| \leq Y} for all {n} and some absolutely integrable {Y}, show that {{\bf E} X_n} converges to {{\bf E} X}.

Exercise 3 Let {X_1,X_2,\dots} be a sequence of scalar random variables converging in probability to another random variable {X}.

  • (i) Suppose that there is a random variable {Y} which is independent of {X_i} for each individual {i}. Show that {Y} is also independent of {X}.
  • (ii) Suppose that the {X_1,X_2,\dots} are jointly independent. Show that {X} is almost surely constant (i.e. there is a deterministic scalar {c} such that {X=c} almost surely).

We can now state the weak and strong law of large numbers, in the model case of iid random variables.

Theorem 4 (Law of large numbers, model case) Let {X_1, X_2, \dots} be an iid sequence of copies of an absolutely integrable random variable {X} (thus the {X_i} are independent and all have the same distribution as {X}). Write {\mu := {\bf E} X}, and for each natural number {n}, let {S_n} denote the random variable {S_n := X_1 + \dots + X_n}.

  • (i) (Weak law of large numbers) The random variables {S_n/n} converge in probability to {\mu}.
  • (ii) (Strong law of large numbers) The random variables {S_n/n} converge almost surely to {\mu}.

Informally: if {X_1,\dots,X_n} are iid with mean {\mu}, then {X_1 + \dots + X_n \approx \mu n} for {n} large. Clearly the strong law of large numbers implies the weak law, but the weak law is easier to prove (and has somewhat better quantitative estimates). There are several variants of the law of large numbers, for instance when one drops the hypothesis of identical distribution, or when the random variable {X} is not absolutely integrable, or if one seeks more quantitative bounds on the rate of convergence; we will discuss some of these variants below the fold.

It is instructive to compare the law of large numbers with what one can obtain from the Kolmogorov zero-one law, discussed in Notes 2. Observe that if the {X_n} are real-valued, then the limit superior {\limsup_{n \rightarrow \infty} S_n/n} and {\liminf_{n \rightarrow \infty} S_n/n} are tail random variables in the sense that they are not affected if one changes finitely many of the {X_n}; in particular, events such as {\limsup_{n \rightarrow \infty} S_n/n > t} are tail events for any {t \in {\bf R}}. From this and the zero-one law we see that there must exist deterministic quantities {-\infty \leq \mu_- \leq \mu_+ \leq +\infty} such that {\limsup_{n \rightarrow \infty} S_n/n = \mu_+} and {\liminf_{n \rightarrow \infty} S_n/n = \mu_-} almost surely. The strong law of large numbers can then be viewed as the assertion that {\mu_- = \mu_+ = \mu} when {X} is absolutely integrable. On the other hand, the zero-one law argument does not require absolute integrability (and one can replace the denominator {n} by other functions of {n} that go to infinity as {n \rightarrow \infty}).

The law of large numbers asserts, roughly speaking, that the theoretical expectation {\mu} of a random variable {X} can be approximated by taking a large number of independent samples {X_1,\dots,X_n} of {X} and then forming the empirical mean {S_n/n = \frac{X_1+\dots+X_n}{n}}. This ability to approximate the theoretical statistics of a probability distribution through empirical data is one of the basic starting points for mathematical statistics, though this is not the focus of the course here. The tendency of statistics such as {S_n/n} to cluster closely around their mean value {\mu} is the simplest instance of the concentration of measure phenomenon, which is of tremendous significance not only within probability, but also in applications of probability to disciplines such as statistics, theoretical computer science, combinatorics, random matrix theory and high dimensional geometry. We will not discuss these topics much in this course, but see this previous blog post for some further discussion.

There are several ways to prove the law of large numbers (in both forms). One basic strategy is to use the moment method – controlling statistics such as {S_n/n} by computing moments such as the mean {{\bf E} S_n/n}, variance {{\bf E} |S_n/n - {\bf E} S_n/n|^2}, or higher moments such as {{\bf E} |S_n/n - {\bf E} S_n/n|^k} for {k = 4, 6, \dots}. The joint independence of the {X_i} make such moments fairly easy to compute, requiring only some elementary combinatorics. A direct application of the moment method typically requires one to make a finite moment assumption such as {{\bf E} |X|^k < \infty}, but as we shall see, one can reduce fairly easily to this case by a truncation argument.

For the strong law of large numbers, one can also use methods relating to the theory of martingales, such as stopping time arguments and maximal inequalities; we present some classical arguments of Kolmogorov in this regard.

— 1. The moment method —

We begin by using the moment method to establish both the strong and weak law of large numbers for sums of iid random variables, under additional moment hypotheses.

We first make a very simple observation: in order to prove the weak or strong law of large numbers for complex variables, it suffices to do so for real variables, as the complex case follows from the real case after taking real and imaginary parts. Thus we shall restrict attention henceforth to real random variables, in order to avoid some unnecessarily complications involving complex conjugation.

Let {X_1,X_2,X_3,\dots} be a sequence of iid copies of a scalar random variable {X}, and define the partial sums {S_n := X_1 + \dots + X_n}. Suppose that {X} is absolutely integrable, with expectation (or mean) {{\bf E} X = \mu}. Then we can use linearity of expectation to compute the expectation (or first moment) of {S_n}:

\displaystyle  {\bf E} S_n = {\bf E} X_1 + \dots + {\bf E} X_n

\displaystyle  = n \mu.

In particular, the expectation of {S_n/n} is {\mu}. This looks consistent with the strong and weak law of large numbers, but does not immediately imply these laws. However, thanks to Markov’s inequality, we do at least get the following very weak bound

\displaystyle  {\bf P}( S_n \geq \lambda \mu n ) \leq \frac{1}{\lambda} \ \ \ \ \ (1)

for any {\lambda > 0}, in the case that {X} is unsigned and absolutely integrable. Thus, in the unsigned case at least, we see that {S_n/n} usually doesn’t get much larger than the mean {\mu}. We will refer to (1) as a first moment bound on {S_n}, as it was obtained primarily through a computation of the first moment {{\bf E} S_n} of {S_n}.

Now we turn to second moment bounds on {S_n}, obtained through computations of second moments such as {{\bf E} |S_n|^2} or {{\bf Var}(S_n) = {\bf E} |S_n - {\bf E} S_n|^2}. It will be convenient to normalise the mean {\mu} to equal zero, by replacing each {X_i} with {X_i - \mu} (and {X} by {X - \mu}), so that {S_n} gets replaced by {S_n - \mu n} (and {S_n/n} by {S_n/n-\mu}). With this normalisation, we see that to prove the strong or weak law of large numbers, it suffices to do so in the mean zero case {\mu=0}. (On the other hand, if {X} is unsigned, then normalising {X} in this fashion will almost certainly destroy the unsigned property, so it is not always desirable to perform this normalisation.)

Suppose that {X} has finite second moment (i.e. {{\bf E} |X|^2 < \infty}, that is to say {X} is square-integrable) and has been normalised to have mean zero. We write the variance {{\bf Var}(X) = {\bf E} |X|^2} as {\sigma^2}. The first moment calculation then shows that {S_n} has mean zero. Now we compute the variance of {S_n}, which in the mean zero case is simply {{\bf E} |S_n|^2}; note from the triangle inequality that this quantity is finite. By linearity of expectation, we have

\displaystyle  {\bf Var}(S_n) = {\bf E} |X_1 + \dots + X_n|^2

\displaystyle  = \sum_{1 \leq i,j \leq n} {\bf E} X_i X_j.

(All expressions here are absolutely integrable thanks to the Cauchy-Schwarz inequality.) If {i=j}, then the term {{\bf E} X_i X_j} is equal to {{\bf E} |X|^2 = \sigma^2}. If {i \neq j}, then by hypothesis {X_i} and {X_j} are independent and mean zero, and thus

\displaystyle {\bf E} X_i X_j = ({\bf E} X_i) ({\bf E} X_j) = 0.

Putting all this together, we obtain

\displaystyle  {\bf Var}(S_n) = n \sigma^2

or equivalently

\displaystyle  {\bf Var}(S_n/n) = \frac{1}{n} \sigma^2.

This bound was established in the mean zero case, but it is clear that it also holds in general, since subtracting a constant from a random variable does not affect its variance. Thus we see that while {S_n/n} has the same mean {\mu} as {X}, it has a much smaller variance: {\sigma^2/n} in place of {\sigma^2}. This is the first demonstration of the concentration of measure effect that comes from combining many independent random variables {X_1,\dots,X_n}. (At the opposite extreme to the independent case, suppose we took {X_1,\dots,X_n} to all be exactly the same random variable: {X_1 = \dots = X_n = X}. Then {S_n/n = X} has exactly the same mean and variance as {X}. Decorrelating the {X_1,\dots,X_n} does not affect the mean of {S_n}, but produces significant cancellations that reduce the variance.)

If we insert this variance bound into Chebyshev’s inequality, we obtain the bound

\displaystyle  {\bf P}( |\frac{S_n}{n}-\mu| \geq \varepsilon ) \leq \frac{1}{n} \frac{\sigma^2}{\varepsilon^2} \ \ \ \ \ (2)

for any natural number {n} and {\varepsilon > 0}, whenever {X} has mean {\mu} and a finite variance {\sigma^2 < \infty}. The right-hand side goes to zero as {n \rightarrow \infty} for fixed {\varepsilon}, so we have in fact established the weak law of large numbers in the case that {X} has finite variance.

Note that (2) implies that

\displaystyle  \frac{\sqrt{n}}{\omega(n)} |\frac{S_n}{n}-\mu|

converges to zero in probability whenever {\omega(n)} is a function of {n} that goes to infinity as {n \rightarrow \infty}. Thus for instance

\displaystyle  \frac{\sqrt{n}}{\log n} |\frac{S_n}{n}-\mu| \rightarrow 0

in probability. Informally, this means that {S_n/n} tends to stray by not much more than {\sqrt{n}} from {\mu} typically. This intuition will be reinforced in the next set of notes when we study the central limit theorem and related results such as the Chernoff inequality. (It is also supported by the law of the iterated logarithm, which we will probably not be able to get to in this set of notes.)

One can hope to use (2) and the Borel-Cantelli lemma (Exercise 2(ii)) to also obtain the strong law of large numbers in the second moment case, but unfortunately the quantities {\frac{1}{n} \frac{\sigma^2}{\varepsilon^2}} are not summable in {n}. To resolve this issue, we will go to higher moments than the second moment. One could calculate third moments such as {{\bf E} S_n^3}, but this turns out to not convey too much information (unless {X_n} is unsigned) because of the signed nature of {S_n^3}; the expression {{\bf E} |S_n|^3} would in principle convey more usable information, but is difficult to compute as {|S_n|^3} is not a polynomial combination of the {X_i}. Instead, we move on to the fourth moment. Again, we normalise {X} to have mean {\mu=0}, and now assume a finite fourth moment {{\bf E} |X|^4 < \infty} (which, by the Hölder or Jensen inequalities, implies that all lower moments such as {{\bf E} |X|^2} are finite). We again use {\sigma^2 = {\bf E} |X|^2} to denote the variance of {X}. We can expand

\displaystyle  {\bf E} |S_n|^4 = {\bf E} |X_1 + \dots + X_n|^4

\displaystyle  = \sum_{1 \leq i,j,k,l \leq n} {\bf E} X_i X_j X_k X_l.

Note that all expectations here are absolutely integrable by Hölder’s inequality and the hypothesis of finite fourth moment. The correlation {{\bf E} X_i X_j X_k X_l} looks complicated, but fortunately it simplifies greatly in most cases. Suppose for instance that {i} is distinct from {j,k,l}, then {X_i} is independent of {(X_j,X_k,X_l)} (even if some of the {j,k,l} are equal to each other) and so

\displaystyle  {\bf E} X_i X_j X_k X_l = ({\bf E} X_i) ({\bf E} X_j X_k X_l) = 0

since {{\bf E} X_i = {\bf E} X = 0}. Similarly for permutations. This leaves only a few quadruples {(i,j,k,l)} for which {{\bf E} X_i X_j X_k X_l} could be non-zero: the three cases {i=j \neq k=l}, {i = k \neq j = l}, {i=l \neq j=k} where each of the indices {i,j,k,l} is paired up with exactly one other index; and the diagonal case {i=j=k=l}. If for instance {i=j \neq k=l}, then

\displaystyle  {\bf E} X_i X_j X_k X_l = {\bf E} X_i^2 X_k^2

\displaystyle  = ({\bf E} X_i^2) ({\bf E} X_k^2)

\displaystyle  = \sigma^2 \times \sigma^2.

Similarly for the cases {i=k \neq j=l} and {i=l \neq j=k}, which gives a total contribution of {3 n(n-1) \sigma^4} to {{\bf E} |S_n|^4}. Finally, when {i=j=k=l}, then {{\bf E} X_i X_j X_k X_l = {\bf E} X^4}, and there are {n} contributions of this form to {{\bf E} |S_n|^4}. We conclude that

\displaystyle  {\bf E} |S_n|^4 = 3n(n-1) \sigma^4 + n {\bf E} |X|^4

and hence by Markov’s inequality

\displaystyle  {\bf P} ( |\frac{S_n}{n}| > \varepsilon ) \leq \frac{3n(n-1) \sigma^4 + n {\bf E} |X|^4}{\varepsilon^4 n^4}

for any {\varepsilon>0}. If we remove the normalisation {\mu=0}, we conclude that

\displaystyle  {\bf P} ( |\frac{S_n}{n}-\mu| > \varepsilon ) \leq \frac{3n(n-1) \sigma^4 + n {\bf E} |X-\mu|^4}{\varepsilon^4 n^4}

The right-hand side decays like {O(1/n^2)}, which is now summable in {n} (in contrast to (2)). Thus we may now apply the Borel-Cantelli lemma and conclude the strong law of large numbers in the case when one has bounded fourth moment {{\bf E} |X|^4 < \infty}.

One can of course continue to compute higher and higher moments of {S_n} (assuming suitable finite moment hypotheses on {X}), though as one can already see from the fourth moment calculation, the computations become increasingly combinatorial in nature. We will pursue this analysis more in the next set of notes, when we discuss the central limit theorem. For now, we turn to some applications and variants of the moment method (many of which are taken from Durrett’s book).

We begin with two quick applications of the weak law of large numbers to topics outside of probability. We first give an explicit version of the Weierstrass approximation theorem, which asserts that continuous functions on (say) the unit interval {[0,1]} can be approximated by polynomials.

Proposition 5 (Approximation by Bernstein polynomials) Let {f: [0,1] \rightarrow {\bf R}} be a continuous function. Then the Bernstein polynomials

\displaystyle  f_n(x) := \sum_{i=0}^n \binom{n}{i} x^i (1-x)^{n-i} f(\frac{i}{n})

converges uniformly to {f} as {n \rightarrow \infty}.

Proof: We first establish the pointwise bound {f_n(p) \rightarrow f(p)} for each {0 \leq p \leq 1}. Fix {0 \leq p \leq 1}, and let {X_1,X_2,X_3,\dots} be iid Bernoulli variables (i.e. variables taking values in {\{0,1\}}) with each {X_i} equal to {1} with probability {p}. The mean of the {X_i} is clearly {p}, and the variance is bounded crudely by {1} (in fact it is {p(1-p)}), so if we set {S_n := X_1 + \dots + X_n} then by the weak law of large numbers for random variables of finite second moment, we see that {S_n/n} converges in probability to {p} (note that {S_n/n} is clearly dominated by {1}). By the dominated convergence theorem in probability, we conclude that {{\bf E} f(S_n/n)} converges to {f(p)}. But from direct computation we see that {S_n} takes values in {\{0, 1/n, \dots, n/n\}}, with each {i/n} being attained with probability {\binom{n}{i} p^i (1-p)^{n-i}} (i.e. {S_n} as a binomial distribution), and so from the definition of the Bernstein polynomials {f_n} we see that {{\bf E} f(S_n/n) = f_n(p)}. This concludes the pointwise convergence claim.

To establish the uniform convergence, we use the proof of the weak law of large numbers, rather than the statement of that law, to get the desired uniformity in the parameter {p}. For a given {p}, we see from (2) that

\displaystyle  {\bf P}( |\frac{S_n}{n}-p| \geq \delta ) \leq \frac{1}{n} \frac{p(1-p)}{\delta^2}

for any {\delta > 0}. On the other hand, as {f} is continuous on {[0,1]}, it is uniformly continuous, and so for any {\varepsilon>0} there exists a {\delta>0} such that {|f(x)-f(y)| < \varepsilon} whenever {x,y \in [0,1]} with {|x-y| < \delta}. For such an {\varepsilon} and {\delta}, we conclude that

\displaystyle  {\bf P}( |f(\frac{S_n}{n})-f(p)| \geq \varepsilon ) \leq \frac{1}{n} \frac{p(1-p)}{\delta^2}.

On the other hand, being continuous on {[0,1]}, {f} must be bounded in magnitude by some bound {M}, so that {|f(\frac{S_n}{n})-f(p)| \leq 2M}. This leads to the upper bound

\displaystyle  {\bf E} |f(\frac{S_n}{n})-f(p)| \leq \varepsilon + \frac{2M}{n} \frac{p(1-p)}{\delta^2}

and thus by the triangle inequality and the identity {{\bf E} f(S_n/n) = f_n(p)}

\displaystyle  |f_n(p) - f(p)| \leq \varepsilon + \frac{2M}{n} \frac{p(1-p)}{\delta^2}.

Since {p(1-p)} is bounded by (say) {1}, and {\varepsilon} can be made arbitrarily small, we conclude that {f_n} converges uniformly to {f} as required. \Box

The other application of the weak law of large numbers is to the geometry of high-dimensional cubes, giving the rather unintuitive conclusion that most of the volume of the high-dimensional cube is contained in a thin annulus.

Proposition 6 Let {\varepsilon > 0}. Then, for sufficiently large {n}, a proportion of at least {1-\varepsilon} of the cube {[-1,1]^n} (by {n}-dimensional Lebesgue measure) is contained in the annulus {\{ x \in {\bf R}^n: (1-\varepsilon) \sqrt{n/3} \leq |x| \leq (1+\varepsilon) \sqrt{n/3} \}}.

This proposition already indicates that high-dimensional geometry can behave in a manner quite differently from what one might naively expect from low-dimensional geometric intuition; one needs to develop a rather distinct high-dimensional geometric intuition before one can accurately make predictions in large dimensions.

Proof: Let {X_1,X_2,\dots} be iid random variables drawn uniformly from {[-1,1]}. Then the random vector {(X_1,\dots,X_n)} is uniformly distributed on the cube {[-1,1]^n}. The variables {X_1^2,X_2^2,\dots} are also iid, and (by the change of variables formula) have mean

\displaystyle  \int_0^1 x^2\ dx = \frac{1}{3}.

Hence, by the weak law of large numbers, the quantity {\frac{X_1^2+\dots+X_n^2}{n}} converges in probability to {\frac{1}{3}}, so in particular the probability

\displaystyle  {\bf P}( \sqrt{X_1^2+\dots+X_n^2} - \sqrt{n/3}| > \varepsilon \sqrt{n})

goes to zero as {n} goes to infinity. But this quantity is precisely the proportion of {[-1,1]^n} that lies outside the annulus {\{ x \in {\bf R}^n: (1-\varepsilon) \sqrt{n/3} \leq |x| \leq (1+\varepsilon) \sqrt{n/3} \}}, and the claim follows. \Box

The first and second moment method are very general, and apply to sums {S_n = X_1 + \dots + X_n} of random variables {X_1,\dots,X_n} that do not need to be identically distributed, or even independent (although the bounds can get weaker and more complicated if one strays too far from these hypotheses). For instance, it is clear from linearity of expectation that {S_n} has mean

\displaystyle  {\bf E} S_n = {\bf E} X_1 + \dots + {\bf E} X_n \ \ \ \ \ (3)

(assuming of course that {X_1,\dots,X_n} are absolutely integrable) and variance

\displaystyle  {\bf Var}(S_n) = {\bf Var}(X_1) + \dots + {\bf Var}(X_n) + \sum_{1 \leq i,j \leq n: i \neq j} \hbox{Cov}(X_i,X_j)

(assuming now that {X_1,\dots,X_n} are square-integrable). (For the latter claim, it is convenient, as before, to first normalise each of the {X_i} to have mean zero.) If the {X_1,\dots,X_n} are pairwise independent in addition to being square-integrable, then all the covariances vanish, and we obtain additivity of the variance:

\displaystyle  {\bf Var}(S_n) = {\bf Var}(X_1) + \dots + {\bf Var}(X_n). \ \ \ \ \ (4)

Remark 7 Viewing the variance as the square of the standard deviation, the identity (4) can be interpreted as a rigorous instantiation of the following informal principle of square root cancellation: if one has a sum {S_n = X_1 + \dots + X_n} of random (or pseudorandom) variables that “oscillate” in the sense that their mean is either zero or close to zero, and each {X_i} has an expected magnitude of about {a_i} (in the sense that a statistic such as the standard deviation of {X_i} is comparable to {a_i}), and the {X_i} “behave independently”, then the sum {S_n} is expected to have a magnitude of about {(\sum_{i=1}^n |a_i|^2)^{1/2}}. Thus for instance a sum of {n} unbiased signs {X_i \in \{-1,+1\}} would be expected to have magnitude about {\sqrt{n}} if the {X_i} do not exhibit strong correlations with each other. This principle turns out to be remarkably broadly applicable (at least as a heuristic, if not as a rigorous argument), even in situations for which no randomness is evident (e.g. in considering the type of exponential sums that occur in analytic number theory). We will see some further instantiations of this principle in later notes.

These identities, together with Chebyshev’s inequality, already gives some useful control on many statistics, including some which are not obviously of the form of a sum of independent random variables. A classic example of this is the coupon collector problem, which we formulate as follows. Let {N} be a natural number, and let {Y_1, Y_2, \dots} be an infinite sequence of “coupons”, which are iid and uniformly distributed from the finite set {\{1,\dots,N\}}. Let {T_N} denote the first time at which one has collected all {N} different types of coupons, thus {T_N} is the first natural number for which the set {\{Y_1,\dots,Y_{T_N}\}} attains the maximal cardinality of {N} (or {T_N=\infty} if no such natural number exists, though it is easy to see that this is a null event, indeed note from the strong law of large numbers that almost surely one will collect infinitely many of each coupon over time). The question is then to describe the behaviour of {T_N} as {N} gets large.

At first glance, {T_N} does not seem to be easily describable as the sum of many independent random variables. However, if one looks at it the right way, one can see such a structure emerge (and much of the art of probability is in finding useful and different ways of thinking of the same random variable). Namely, for each {i=0,\dots,N}, let {T_{i,N}} denote the first time one has collected {i} coupons out of {N}, thus {T_{i,N}} is the first non-negative integer such that {\{Y_1,\dots,Y_{T_i}\}} has cardinality {i}, with

\displaystyle  0 = T_{0,N} < T_{1,N} < \dots < T_{N,N} = T_N.

If we then write {X_i := T_{i,N} - T_{i-1,N}} for {i=1,\dots,n}, then the {X_i} take values in the natural numbers, we have the telescoping sum

\displaystyle  T_N = X_1 + X_2 + \dots + X_N.

Remarkably, the random variables {X_1,\dots,X_N} have a simple structure:

Proposition 8 The random variables {X_1,\dots,X_N} are jointly independent, and each {X_i} has a geometric distribution with parameter {\frac{N-i+1}{N}}, in the sense that

\displaystyle  {\bf P}(X_i = j) = (1 - \frac{N-i+1}{N})^{j-1} \frac{N-i+1}{N}

for {j=1,2,\dots} and {i=1,\dots,N}.

The joint independence of the {X_1,\dots,X_N} reflects the “Markovian” or “memoryless” nature of a certain process relating to the coupon collector problem, and can be easily established once one has understood the concept of conditional expectation, but the further exploration of these concepts will have to be deferred to the course after this one (which I will not be teaching or writing notes for). But as the coupon collecting problem is so simple, we shall proceed instead by direct computation.

Proof: It suffices to show that

\displaystyle  {\bf P}(X_1 = j_1 \wedge \dots \wedge X_N = j_N) = \prod_{i=1}^N (1 - \frac{N-i+1}{N})^{j_i-1} \frac{N-i+1}{N}

\displaystyle  = N^{-(j_1+\dots+j_n)} \prod_{i=1}^N (N-i+1) (i-1)^{j_i-1}

for any choice of natural numbers {j_1,\dots,j_N}. In order for the event {X_1 = j_1 \wedge \dots \wedge X_n = j_n} to hold, the first coupon {Y_1} can be arbitrary, but the coupons {Y_2,\dots,Y_{j_1}} have to be equal to {Y_1}; then {Y_{j_1+1}} must be from one the remaining {n-1} elements of {\{1,\dots,N\}} not equal to {Y_1}, and {Y_{j_1+2},\dots,Y_{j_1+j_2}} must be from the two-element set {\{Y_1,Y_{j_1+1}\}}; and so on and so forth up to {Y_{j_1+\dots+j_n}}. The claim then follows from a routine application of elementary combinatorics to count all the possible values for the tuple {(Y_1,\dots,Y_{j_1+\dots+j_n})} of the above form and dividing by the total number {N^{j_1+\dots+j_n}} of such tuples. \Box

Exercise 9 Show that if {X} is a geometric distribution with parameter {p} for some {0 < p \leq 1} (thus {{\bf P}(X=j) = (1-p)^{j-1} p} for all {j}) then {X} has mean {\frac{1}{p}} and variance {\frac{1-p}{p^2}}.

From the above proposition and exercise as well as (3), (4) we see that

\displaystyle  {\bf E} T_N = \sum_{i=1}^N \frac{N}{N-i+1}


\displaystyle  {\bf Var} T_N = \sum_{i=1}^N (\frac{N}{N-i+1})^2 \frac{i-1}{N}.

From the integral test (and crudely bounding {(i-1)/N = O(1)}) one can thus obtain the bounds

\displaystyle  {\bf E} T_N = N \log N + O(N)


\displaystyle  {\bf Var} T_N = O(N^2)

where we use the usual asymptotic notation of denoting by {O(X)} any quantity bounded in magnitude by a constant multiple {CX} of {X}. From Chebyshev’s inequality we thus see that

\displaystyle  {\bf P}( |T_N - N \log N| \geq \lambda N ) = O( \lambda^{-2} )

for any {\lambda > 0} (note the bound is trivial unless {\lambda} is large). This implies in particular that {T_N/N \log N} converges to {1} in probability as {N \rightarrow \infty} (assuming that our underlying probability space can model a separate coupon collector problem for each choice of {N}). Thus, roughly speaking, we see that one expects to take about {N \log N} units of time to collect all {N} coupons.

Another application of the weak and strong law of large numbers (even with the moment hypotheses currently imposed on these laws) is a converse to the Borel-Cantelli lemma in the jointly independent case:

Exercise 10 (Second Borel-Cantelli lemma) Let {E_1,E_2,\dots} be a sequence of jointly independent events. If {\sum_{n=1}^\infty {\bf P}(E_n) = \infty}, show that almost surely an infinite number of the {E_n} hold simultaneously. (Hint: compute the mean and variance of {S_n= \sum_{i=1}^n 1_{E_i}}. One can also compute the fourth moment if desired, but it is not necessary to do so for this result.)

One application of the second Borel-Cantelli lemma has the colourful name of the “infinite monkey theorem“:

Exercise 11 (Infinite monkey theorem) Let {X_1,X_2,\dots} be iid random variables drawn uniformly from a finite alphabet {A}. Show that almost surely, every finite word {a_1 \dots a_k} of letters {a_1,\dots,a_k} in the alphabet appears infinitely often in the string {X_1 X_2 X_3 \dots}.

In the usual formulation of the weak or strong law of large numbers, we draw the sums {S_n} from a single infinite sequence {X_1,X_2,\dots} of iid random variables. One can generalise the situation slightly by working instead with sums from rows of a triangular array, which are jointly independent within rows but not necessarily across rows:

Exercise 12 (Triangular arrays) Let {(X_{i,n})_{i,n \in {\bf N}: i \leq n}} be a triangular array of scalar random variables {X_{i,n}}, such that for each {n}, the row {X_{1,n},\dots,X_{n,n}} is a collection of independent random variables. For each {n}, we form the partial sums {S_n = X_{1,n} + \dots + X_{n,n}}.

  • (i) (Weak law) If all the {X_{i,n}} have mean {\mu} and {\sup_{i,n} {\bf E} |X_{i,n}|^2 < \infty}, show that {S_n/n} converges in probability to {\mu}.
  • (ii) (Strong law) If all the {X_{i,n}} have mean {\mu} and {\sup_{i,n} {\bf E} |X_{i,n}|^4 < \infty}, show that {S_n/n} converges almost surely to {\mu}.

Note that the weak and strong law of large numbers established previously corresponds to the case {X_{i,n} = X_i} when the triangular array collapses to a single sequence {X_1,X_2,\dots} of iid random variables.

We now illustrate the use of moment method and law of large number methods to two important examples of random structures, namely random graphs and random matrices.

Exercise 13 For a natural number {n} and a parameter {0 \leq p \leq 1}, define an Erdös-Renyi graph on {n} vertices with parameter {p} to be a random graph {(V,E)} on a (deterministic) vertex set {V} of {n} vertices (thus {(V,E)} is a random variable taking values in the discrete space of all {2^{\binom{n}{2}}} possible graphs one can place on {V}) such that the events {\{i,j\} \in E} for unordered pairs {\{i,j\}} in {V} are jointly independent and each occur with probability {p}.

For each {n}, let {(V_n,E_n)} be an Erdös-Renyi graph on {n} vertices with parameter {1/2} (we do not require the graphs to be independent of each other).

  • (i) If {|E_n|} is the number of edges in {(V_n,E_n)}, show that {|E_n|/\binom{n}{2}} converges almost surely to {1/2}. (Hint: use Exercise 12.)
  • (ii) If {|T_n|} is the number of triangles in {(V_n,E_n)} (i.e. the set of unordered triples {\{i,j,k\}} in {V_n} such that {\{i,j\}, \{i,k\}, \{j,k\} \in E_n}), show that {|T_n|/\binom{n}{3}} converges in probability to {1/8}. (Note: there is not quite enough joint independence here to directly apply the law of large numbers, however the second moment method still works nicely.)
  • (iii) Show in fact that {|T_n|/\binom{n}{3}} converges almost surely to {1/8}. (Note: in contrast with the situation with the strong law of large numbers, the fourth moment does not need to be computed here.)

Exercise 14 For each {n}, let {A_n = (a_{ij,n})_{1 \leq i,j \leq n}} be a random {n \times n} matrix (i.e. a random variable taking values in the space {{\bf R}^{n \times n}} or {{\bf C}^{n \times n}} of {n \times n} matrices) such that the entries {a_{ij,n}} of {A_n} are jointly independent in {i,j} and take values in {\{-1,+1\}} with a probability of {1/2} each. (Such matrices are known as random sign matrices.) We do not assume any independence for the sequence {A_1,A_2,\dots}.

  • (i) Show that the random variables {\hbox{tr} A_n A_n^* / n^2} are deterministically equal to {1}, where {A_n^*} denotes the adjoint (which, in this case, is also the transpose) of {A_n} and {\hbox{tr}} denotes the trace (sum of the diagonal entries) of a matrix.
  • (ii) Show that for any natural number {k}, the quantities {{\bf E} \hbox{tr} (A_n A_n^*)^k / n^{k+1}} are bounded uniformly in {n} (i.e. they are bounded by a quantity {C_k} that can depend on {k} but not on {n}). (You may wish to first work with simple cases like {k=2} or {k=3} to gain intuition.)
  • (iii) If {\|A_n\|_{op}} denotes the operator norm of {A_n}, and {\varepsilon > 0}, show that {\|A_n\|_{op} / n^{1/2+\varepsilon}} converges almost surely to zero, and that {\|A_n\|_{op} / n^{1/2-\varepsilon}} diverges almost surely to infinity. (Hint: use the spectral theorem to relate {\|A_n\|_{op}} with the quantities {\hbox{tr} (A_n A_n^*)^k}.)

One can obtain much sharper information on quantities such as the operator norm of a random matrix; see this previous blog post for further discussion.

Exercise 15 The Cramér random model for the primes is a random subset {{\mathcal P}} of the natural numbers with {1 \not \in A}, {2 \in A}, and the events {n \in A} for {n=3,4,\dots} being jointly independent with {{\bf P}(n \in {\mathcal P}) = \frac{1}{\log n}} (the restriction to {n \geq 3} is to ensure that {\frac{1}{\log n}} is less than {1}). It is a simple, yet reasonably convincing, probabilistic model for the primes {\{2,3,5,7,\dots\}}, which can be used to provide heuristic confirmations for many conjectures in analytic number theory. (It can be refined to give what are believed to be more accurate predictions; see this previous blog post for further discussion.)

  • (i) (Probabilistic prime number theorem) Prove that almost surely, the quantity {\frac{1}{x/\log x} |\{ n \leq x: n \in {\mathcal P}\}|} converges to one as {x \rightarrow \infty}.
  • (ii) (Probabilistic Riemann hypothesis) Show that if {\varepsilon > 0}, then the quantity

    \displaystyle  \frac{1}{x^{1/2+\varepsilon}} ( |\{ n \leq x: n \in {\mathcal P} \}| - \int_2^x \frac{dt}{\log t} )

    converges almost surely to zero as {x \rightarrow \infty}.

  • (iii) (Probabilistic twin prime conjecture) Show that almost surely, there are an infinite number of elements {p} of {{\mathcal P}} such that {p+2} also lies in {{\mathcal P}}.
  • (iv) (Probabilistic Goldbach conjecture) Show that almost surely, all but finitely many natural numbers {n} are expressible as the sum of two elements of {{\mathcal P}}.

Probabilistic methods are not only useful for getting heuristic predictions about the primes; they can also give rigorous results about the primes. We give one basic example, namely a probabilistic proof (due to Turán) of a theorem of Hardy and Ramanujan, which roughly speaking asserts that a typical large number {n} has about {\log\log n} distinct prime factors.

Exercise 16 (Hardy-Ramanujan theorem) Let {x \geq 100} be a natural number (so in particular {\log \log x \geq 1}), and let {n} be a natural number drawn uniformly at random from {1} to {x}. Assume Mertens’ theorem

\displaystyle  \sum_{p \leq x} \frac{1}{p} =\log\log x + O(1)

for all {x \geq 100}, where the sum is over primes up to {x}.

  • (i) Show that the random variable {\sum_{p \leq x^{1/10}} 1_{p|n}} (where {1_{p|n}} is {1} when {p} divides {n} and {0} otherwise, and the sum is over primes up to {x^{1/10}}) has mean {\log\log x + O(1)} and variance {O( \log\log x)}. (Hint: compute (up to reasonable error) the means, variances and covariances of the random variables {1_{p|n}}.)
  • (ii) If {\omega(n)} denotes the number of distinct prime factors of {n}, show that {\frac{\omega(n)}{\log\log n}} converges to {1} in probability as {x \rightarrow \infty}. (Hint: first show that {\omega(n) = \sum_{p \leq x^{1/10}} 1_{p|n} + O(1))}.) More precisely, show that

    \displaystyle  \frac{\omega(n) - \log\log n}{g(n) \sqrt{\log\log n}}

    converges in probability to zero, whenever {g: {\bf N} \rightarrow {\bf R}} is any function such that {g(n)} goes to infinity as {n \rightarrow \infty}.

Exercise 17 (Shannon entropy) Let {A} be a finite non-empty set of some cardinality {|A|}, and let {X} be a random variable taking values in {A}. Define the Shannon entropy {{\bf H}(X)} to be the quantity

\displaystyle  {\bf H}(X) := \sum_{x \in A} {\bf P}(X = x) \log \frac{1}{{\bf P}(X=x)}

with the convention that {0 \log \frac{1}{0}=0}. (In some texts, the logarithm to base {2} is used instead of the natural logarithm {\log}.)

  • (i) Show that {0 \leq {\bf H}(X) \leq \log |A|}. (Hint: use Jensen’s inequality.) Determine when the equality {{\bf H}(X) = \log |A|} holds.
  • (ii) Let {\varepsilon > 0} and {n} be a natural number. Let {X_1,\dots,X_n} be {n} iid copies of {X}, thus {\vec X := (X_1,\dots,X_n)} is a random variable taking values in {A^n}, and the distribution {\mu_{\vec X}} is a probability measure on {A^n}. Let {\Omega \subset A^n} denote the set

    \displaystyle  \Omega := \{ \vec x \in A^n: \exp(-(1+\varepsilon) n {\bf H}(X)) \leq \mu_{\vec X}(\{\vec x\})

    \displaystyle \leq \exp(-(1-\varepsilon) n {\bf H}(X)) \}.

    Show that if {n} is sufficiently large, then

    \displaystyle  {\bf P}( \vec X \in \Omega) \geq 1-\varepsilon


    \displaystyle  \exp((1-2\varepsilon) n {\bf H}(X)) \leq |\Omega|

    \displaystyle \leq \exp((1+2\varepsilon) n {\bf H}(X)).

    (Hint: use the weak law of large numbers to understand the number of times each element {x} of {A} occurs in {\vec X}.)

Thus, roughly speaking, while {\vec X} in principle takes values in all of {A^n}, in practice it is concentrated in a set of size about {\exp( n {\bf H}(X) )}, and is roughly uniformly distributed on that set. This is the beginning of the microstate interpretation of Shannon entropy, but we will not develop the theory of Shannon entropy further in this course.

— 2. Truncation —

The weak and strong laws of large numbers have been proven under additional moment assumptions (of finite second moment and finite fourth moment respectively). To remove these assumptions, we use the simple but effective truncation method, decomposing a general scalar random variable {X} into a truncated component such as {X 1_{|X| \leq M}} for some suitable threshold {M}, and a tail {X 1_{|X|>M}}. The main term {X 1_{|X| \leq M}} is bounded and thus has all moments finite. The tail {X 1_{|X| > M}} will not have much better moment properties than the original variable {X}, but one can still hope to make it “small” in various ways. There is a tradeoff regarding the selection of the truncation parameter {M}: if {M} is too large, then the truncated component {X 1_{|X| \leq M}} has poor estimates, but if {M} is too small then the tail {X 1_{|X|>M}} causes trouble.

Let’s first see how this works with the weak law of large numbers. As before, we assume we have an iid sequence {X_1,X_2,\dots} copies of a real absolutely integrable {X} with no additional finite moment hypotheses. We write {S_n = X_1 + \dots + X_n}. At present, we cannot take variances of the {X_i} or {S_n} and so the second moment method is not directly available. But we now perform a truncation; it turns out that a good choice of threshold here is {n}, thus we write {X_i = X_{i,\leq n} + X_{i, > n}} where {X_{i, \leq n} := X_i 1_{|X_i| \leq n}} and {X_{i,>n} := X_i 1_{|X_i| > n}}, and then similarly decompose {S_n = S_{n,\leq} + S_{n,>}} where

\displaystyle  S_{n,\leq} := X_{1,\leq n} + \dots + X_{n,\leq n}


\displaystyle  S_{n,>} := X_{1,>n} + \dots + X_{n,>n}.

Write {\mu := {\bf E} X}. We wish to show that

\displaystyle  {\bf P}( |S_n/n - \mu| \geq \varepsilon ) \rightarrow 0 \ \ \ \ \ (5)

as {n \rightarrow \infty} for any given {\varepsilon > 0} (not depending on {n}). By the triangle inequality, we can split

\displaystyle  {\bf P}( |S_n/n - \mu| \geq \varepsilon ) \leq {\bf P}( |S_{n,\leq}/n - \mu| \geq \varepsilon ) + {\bf P}( S_{n,>} \neq 0 ).

We begin by studying {S_{n,\leq}/n}.

The random variables {X_{1,\leq n},\dots,X_{n,\leq n}} are iid with mean {\mu_{\leq n} := {\bf E} X 1_{|X| \leq n}} and variance at most

\displaystyle  {\bf Var}(X 1_{|X| \leq n}) \leq {\bf E} (X 1_{|X| \leq n})^2

\displaystyle  = {\bf E} |X|^2 1_{|X| \leq n}.

Thus, {S_{n,\leq}/n} has mean {\mu_{\leq n}} and variance at most {\frac{1}{n} {\bf E} |X|^2 1_{|X| \leq n}}. By dominated convergence, {\mu_{\leq n} \rightarrow \mu} as {n \rightarrow \infty}, so for sufficiently large {n} we can bound

\displaystyle  {\bf P}( |S_{n,\leq}/n - \mu| \geq \varepsilon ) \leq {\bf P}( |S_{n,\leq}/n - \mu_{\leq n}| \geq \varepsilon/2 )

and hence by the Chebyshev inequality

\displaystyle  {\bf P}( |S_{n,\leq}/n - \mu| \geq \varepsilon ) \leq \frac{1}{(\varepsilon/2)^2} {\bf E} \frac{|X|^2}{n} 1_{|X| \leq n}.

Observe that {\frac{|X|^2}{n} 1_{|X| \leq n}} is bounded surely by the absolutely integrable {|X|}, and goes to zero as {n \rightarrow \infty}, so by dominated convergence we conclude that

\displaystyle  {\bf P}( |S_{n,\leq}/n - \mu| \geq \varepsilon ) \rightarrow 0

as {n \rightarrow \infty} (keeping {\varepsilon} fixed).

To handle {S_{n,>}}, we observe that each {X_{i,>n}} is only non-zero with probability {{\bf P}(|X| > n)}, and hence by subadditivity

\displaystyle  {\bf P}( |S_{n,>}| > 0 ) \leq n {\bf P}(|X|>n)

\displaystyle  = {\bf E} n 1_{|X|>n}.

By dominated convergence again, {{\bf E} n 1_{|X|>n} \rightarrow 0} as {n \rightarrow \infty}, and thus

\displaystyle  {\bf P}( S_{n,>} \neq 0 ) \rightarrow 0.

Putting all this together, we conclude (5) as required. This concludes the proof of the weak law of large numbers (in the iid case) for arbitrary absolutely integrable {X}. For future reference, we observe that the above arguments give the bound

\displaystyle  {\bf P}( |S_n/n - \mu| \geq \varepsilon ) \leq \frac{1}{(\varepsilon/2)^2} {\bf E} \frac{|X|^2}{n} 1_{|X| \leq n} + {\bf E} n 1_{|X|>n} \ \ \ \ \ (6)

whenever {n} is sufficiently large depending on {\varepsilon}.

Due to the reliance on the dominated convergence theorem, the above argument does not provide any uniform rate of decay in (5). Indeed there is no such uniform rate. Consider for instance the sum {S_n = X_1 + \dots + X_n} where {X_1,\dots,X_n} are iid random variables that equal {n} with probability {1/n} and {0} with probability {1-1/n}. Then the {X_i} are unsigned and all have mean {1}, but {S_n} vanishes with probability {(1-1/n)^n}, which converges to {1/e} as {n \rightarrow \infty}. Thus we see that the probability that {S_n/n} stays a distance {1} from the mean value of {1} is bounded away from zero. This is not inconsistent with the weak law of large numbers, because the underlying random variable {X} depends on {n} in this example. However, it rules out an estimate of the form

\displaystyle  {\bf P}( |S_n/n - \mu| \geq \varepsilon ) \leq c_{\varepsilon,M}(n)

that holds uniformly whenever {X} obeys the bound {{\bf E} |X| \leq M}, and {c_{\varepsilon,M}(n)} is a quantity that goes to zero as {n \rightarrow \infty} for a fixed choice of {\varepsilon,M}. (Contrast this with (2), which does provide such a uniform bound if one also assumes a bound on the second moment {{\bf E} |X|^2}.)

One can ask what happens to the {S_n} when the underlying random variable {X} is not absolutely integrable. In the unsigned case, we have

Exercise 18 Let {X_1,X_2,\dots} be iid copies of an unsigned random variable {X} with infinite mean, and write {S_n := X_1 + \dots + X_n}. Show that {S_n/n} diverges to infinity in probability, in the sense that {{\bf P}( S_n/n \geq M ) \rightarrow 1} as {n \rightarrow \infty} for any fixed {M < \infty}.

The above exercise shows that {S_n} grows faster than {n} (in probability, at least) when {X} is unsigned with infinite mean, but this does not completely settle the question of the precise rate at which {S_n} does grow. We will not answer this question in full generality here, but content ourselves with analysing a classic example of the unsigned infinite mean setting, namely the Saint Petersburg paradox. The paradox can be formulated as follows. Suppose we have a lottery whose payout {X} takes taking values in the powers of two {2,2^2,2^3,\dots} with

\displaystyle  {\bf P}( X = 2^i ) = 2^{-i}

for {i=1,2,\dots}. The question is to work out what is the “fair” or “breakeven” price to pay for a lottery ticket. If one plays this lottery {n} times, the total payout is {S_n = X_1 + \dots + X_n} where {X_1,X_2,\dots} are independent copies of {X}, so the question boils down to asking what one expects the value of {S_n/n} to be. If {X} were absolutely integrable, the strong or weak law of large numbers would indicate that {{\bf E} X} is the fair price to pay, but in this case we have

\displaystyle  {\bf E} X = \sum_{i=1}^\infty 2^i \times 2^{-i} = \sum_{i=1}^\infty 1 = \infty.

This suggests, paradoxically, that any finite price for this lottery, no matter how high, would be a bargain!

To clarify this paradox, we need to get a better understanding of the random variable {S_n/n}. For a given {n}, we let {M} be a truncation parameter to be chosen later, and split {X_i = X_{i,\leq M} + X_{i,>M}} where {X_{i,\leq M} := X_i 1_{X_i \leq M}} and {X_{i,>M} := X_i 1_{X_i>M}} as before (we no longer need the absolute value signs here as all random variables are unsigned). Since the {X_i} take values in powers of two, we may as well also set {M=2^m} to be a power of two. We split {S_n = S_{n,\leq} + S_{n,>}} where

\displaystyle  S_{n,\leq} = X_{1,\leq M} + \dots + X_{n,\leq M}


\displaystyle  S_{n,>} = X_{1,> M} + \dots + X_{n,>M}.

The random variable {X 1_{X \leq M}} can be computed to have mean

\displaystyle  {\bf E} X 1_{X \leq M} = \sum_{i=1}^m 2^i \times 2^{-i} = m

and we can upper bound the variance by

\displaystyle  {\bf Var}(X 1_{X \leq M}) \leq {\bf E} (X 1_{X \leq M})^2

\displaystyle  = \sum_{i=1}^m 2^{2i} 2^{-i}

\displaystyle  \leq 2^{m+1}

and hence {S_{n,\leq}/n} has mean {m} and variance at most {2^{m+1}/n}. By Chebyshev’s inequality, we thus have

\displaystyle  {\bf P}( |S_{n,\leq}/n - m| \geq \lambda ) \leq \frac{2^{m+1}}{n \lambda^2}

for any {\lambda > 0}.

Now we turn to {S_{n,>}}. We cannot use the first or second moment methods here because the {X_{i,>M}} are not absolutely integrable. However, we can instead use the following “zeroth moment method” argument. Observe that the random variable {X 1_{X>M}} is only nonzero with probability {2^{-m}} (that is to say, the “zeroth moment” {{\bf E} (X 1_{X>M})^0} is {2^{-m}}, using the convention {0^0=0}). Thus {S_{n,>}} is nonzero with probability at most {n 2^{-m}}. We conclude that

\displaystyle  {\bf P}( |S_{n}/n - m| \geq \lambda ) \leq \frac{2^{m+1}}{n \lambda^2} + n 2^{-m}

This bound is valid for any natural number {m} and any {\lambda > 0}. Of course for this bound to be useful, we want to select parameters so that the right-hand side is somewhat small. If we pick for instance {m} to be the integer part of {\log_2 n + \frac{1}{2} \log_2 \log_2 n}, and {\lambda} to be {\sqrt{\log_2 n}}, we see that

\displaystyle  {\bf P}( |S_n/n - \log_2 n - \frac{1}{2}\log_2 \log_2 n| \geq \sqrt{\log_2 n} ) = O( \frac{1}{\log^{1/2} n} )

which (for large {n}) implies that

\displaystyle  {\bf P}( |\frac{S_n}{n \log_2 n} - 1| \geq \frac{2}{\sqrt{\log_2 n}} ) = O( \frac{1}{\log^{1/2} n} ).

In particular, we see that {S_n / (n \log_2 n)} converges in probability to one. This suggests that the fair price to pay for the Saint Petersburg lottery is a function of the number {n} of tickets one wishes to play, and should be approximately {\log_2 n} when {n} is large. In particular, the lottery is indeed worth paying out any finite cost {M}, but one needs to buy about {2^M} tickets before one breaks even!

In contrast to the absolutely integrable case, in which the weak law can be upgraded to the strong law, there is no strong law for the Saint Petersburg paradox:

Exercise 19 With the notation as in the above analysis of the Saint Petersburg paradox, show that {\frac{S_n}{n \log_2 n}} is almost surely unbounded. (Hint: it suffices to show that {X_n / n \log_2 n} is almost surely unbounded. For this, use the second Borel-Cantelli lemma.)

The following exercise can be viewed as a continuous analogue of the Saint Petersburg paradox.

Exercise 20 A real random variable {X} is said to have a standard Cauchy distribution if it has the probability density function {x \mapsto \frac{1}{\pi} \frac{1}{1+x^2}}.

  • (i) Verify that standard Cauchy distributions exist (this boils down to checking that the integral of the probability density function is {1}).
  • (ii) Show that a real random variable with the standard Cauchy distribution is not absolutely integrable.
  • (iii) If {X_1,X_2,\dots} are iid copies of a random variable {X} with the standard Cauchy distribution, show that {\frac{|X_1|+\dots+|X_n|}{n \log n}} converges in probability to {\frac{2}{\pi}} but is almost surely unbounded.

Exercise 21 (Weak law of large numbers for triangular arrays) Let {(X_{i,n})_{i,n \in {\bf N}: 1 \leq i \leq n}} be a triangular array of random variables, with the variables {X_{1,n},\dots,X_{n,n}} jointly independent for each {n}. Let {M_n} be a sequence going to infinity, and write {X_{i,n,\leq} := X_{i,n} 1_{|X_{i,n}| \leq M_n}} and {\mu_n := \sum_{i=1}^n {\bf E} X_{i,n,\leq}}. Assume that

\displaystyle  \sum_{i=1}^n {\bf P}( |X_{i,n}| > M_n ) \rightarrow 0


\displaystyle  \frac{1}{M_n^2} \sum_{i=1}^n {\bf E} |X_{i,n,\leq}|^2 \rightarrow 0

as {n \rightarrow \infty}. Show that

\displaystyle  \frac{X_{1,n}+\dots+X_{n,n} - \mu_n}{M_n} \rightarrow 0

in probability.

Now we turn to establishing the strong law of large numbers in full generality. A first attempt would be to apply the Borel-Cantelli lemma to the bound (6). However, the decay rates for quantities such as {{\bf P}( |S_n/n - \mu| > \varepsilon )} are far too weak to be absolutely summable, in large part due to the reliance on the dominated convergence theorem. To get around this we follow some arguments of Etemadi. We first need to make a few preliminary reductions, aimed at “sparsifying” the set of times {n} that one needs to control. It is here that we will genuinely use the fact that the averages {S_n} are being drawn from a single sequence {X_1,X_2,\dots} of random variables, rather than from a triangular array.

We turn to the details. In previous arguments it was convenient to normalise the underlying random variable {X} to have mean zero. Here we will use a different reduction, namely to the case when {X} is unsigned; the strong law for real absolutely integrable {X} clearly follows from the unsigned case by expressing {X} as the difference of two unsigned absolutely integrable variables {\max(X,0)} and {\max(-X,0)} (and splitting {X_n} similarly).

Henceforth we assume {X} (and hence the {X_n}) to be unsigned. Crucially, this now implies that the partial sums {S_n} are monotone: {S_n \leq S_{n+1}}. While this does not quite imply any monotonicity on the sequence {S_n/n}, it does make it significantly easier to show that it converges. The key point is as follows.

Lemma 22 Let {0 \leq S_1 \leq S_2 \leq \dots} be an increasing sequence, and let {\mu} be a real number. Suppose that for any {\delta > 0}, the sequence {S_{n_{j,\delta}}/n_{j,\delta}} converges to {\mu} as {j \rightarrow \infty}, where {n_{j,\delta} := \lfloor (1+\delta)^j \rfloor}. Then {S_n/n} converges to {\mu}.

Proof: Let {0 < \delta < 1/2}. For any {n}, let {j} be the index such that {n_{j,\delta} \leq n < n_{j+1,\delta}}. Then we have

\displaystyle  S_{n_{j,\delta}} \leq S_n \leq S_{n_{j+1,\delta}}

and (for sufficiently large {n})

\displaystyle  (1+2\delta) n_{j,\delta} \geq n \geq \frac{1}{1+2\delta} n_{j+1,\delta}

and thus

\displaystyle  \frac{1}{1+2\delta} \frac{S_{n_{j,\delta}}}{n_{j,\delta}} \leq \frac{S_n}{n} \leq (1+2\delta) \frac{S_{n_{j+1,\delta}}}{n_{j+1,\delta}}.

Taking limit inferior and superior, we conclude that

\displaystyle  \frac{1}{1+2\delta} \mu \leq \liminf_{n \rightarrow \infty} \frac{S_n}{n} \leq \limsup_{n \rightarrow \infty} \frac{S_n}{n} \leq (1+2\delta) \mu,

and then sending {\delta \rightarrow 0} we obtain the claim. \Box

An inspection of the above argument shows that we only need to verify the hypothesis for a countable sequence of {\delta} (e.g. {\delta=1/m} for natural number {m}). Thus, to show that {S_n/n} converges to {\mu} almost surely, it suffices to show that for any {\delta>0}, one has {S_{n_{j,\delta}}/n_{j,\delta} \rightarrow \mu} almost surely as {j \rightarrow \infty}.

Fix {\delta > 0}. The point is that the “lacunary” sequence {n_{j,\delta}} is much sparser than the sequence of natural numbers {n}, and one now will lose a lot less from the Borel-Cantelli argument. Indeed, for any {\varepsilon > 0}, we can apply (6) to conclude that

\displaystyle  {\bf P}( |S_{n_{j,\delta}}/n_{j,\delta} - \mu| \geq \varepsilon ) \leq \frac{1}{(\varepsilon/2)^2} {\bf E} \frac{|X|^2}{n_{j,\delta}} 1_{|X| \leq n_{j,\delta}} + {\bf E} n_{j,\delta} 1_{|X|>n_{j,\delta}}

whenever {j} is sufficiently large depending on {\delta,\varepsilon}. Thus, by the Borel-Cantelli lemma, it will suffice to show that the sums

\displaystyle  \sum_{j=1}^\infty {\bf E} \frac{|X|^2}{n_{j,\delta}} 1_{|X| \leq n_{j,\delta}}


\displaystyle  \sum_{j=1}^\infty {\bf E} n_{j,\delta} 1_{|X|>n_{j,\delta}}

are finite. Using the monotone convergence theorem to interchange the sum and expectation, it thus suffices to show the pointwise estimates

\displaystyle  \sum_{j=1}^\infty \frac{|X|^2}{n_{j,\delta}} 1_{|X| \leq n_{j,\delta}} \leq C_\varepsilon |X|


\displaystyle  \sum_{j=1}^\infty {\bf E} n_{j,\delta} 1_{|X|>n_{j,\delta}} \leq C_\varepsilon |X|

for some {C_\varepsilon}. But this follows from the geometric series formula (the first sum is over the elements of the sequence {n_{j,\delta}} that are greater than or equal to {|X|}, while the latter is over those that are less than {|X|}). This proves the strong law of large numbers for arbitrary absolutely integrable iid {X_1,X_2,\dots}.

We remark that by carefully inspecting the above proof of the strong law of large numbers, we see that the hypothesis of joint independence of the {X_n} can be relaxed to pairwise independence.

The next exercise shows how one can use the strong law of large numbers to approximate the cumulative distribution function of a random variable {X} by an empirical cumulative distribution function.

Exercise 23 Let {X_1,X_2,\dots} be iid copies of a real random variable {X}.

  • (i) Show that for every real number {t}, one has almost surely that

    \displaystyle  \frac{1}{n} | \{ 1 \leq i \leq n: X_i \leq t \}| \rightarrow {\bf P}( X \leq t )


    \displaystyle  \frac{1}{n} | \{ 1 \leq i \leq n: X_i < t \}| \rightarrow {\bf P}( X < t )

    as {n \rightarrow \infty}.

  • (ii) Establish the Glivenko-Cantelli theorem: almost surely, one has

    \displaystyle  \frac{1}{n} | \{ 1 \leq i \leq n: X_i \leq t \}| \rightarrow {\bf P}( X \leq t )

    uniformly in {t} as {n \rightarrow \infty}. (Hint: For any natural number {m}, let {f_m(x)} denote the largest integer multiple of {1/m} less than {x}. Show first that {f_m( \frac{1}{n} | \{ 1 \leq i \leq n: X_i \leq t \}| )} is within {O(1/m)} of {f_m( {\bf P}( X \leq t ) )} for all {t} when {n} is sufficiently large.)

Exercise 24 (Lack of strong law for triangular arrays) Let {X} be a random variable taking values in the natural numbers with {{\bf P}(X=n) = \frac{1}{\zeta(3)} \frac{1}{n^3}}, where {\zeta(3) := \sum_{n=1}^\infty \frac{1}{n^3}} (this is an example of a zeta distribution).

  • (i) Show that {X} is absolutely integrable.
  • (ii) Let {(X_{i,n})_{i,n \in {\bf N}: 1 \leq i \leq n}} be jointly independent copies of {X}. Show that the random variables {\frac{X_{1,n}+\dots+X_{n,n}}{n}} are almost surely unbounded. (Hint: for any constant {A}, show that {\frac{X_{1,n}+\dots+X_{n,n}}{n} > A} occurs with probability at least {\varepsilon/n} for some {\varepsilon > 0} depending on {A}. Then use the second Borel-Cantelli lemma.)

— 3. The Kolmogorov maximal inequality —

Let {X_1,X_2,\dots} be a sequence of jointly independent square-integrable real random variables of mean zero; we do not assume the {X_i} to be identically distributed. As usual, we form the sums {S_n := X_1 + \dots + X_n}, then {S_n} has mean zero and variance

\displaystyle  {\bf Var}(S_n) = {\bf E} S_n^2 = {\bf Var}(X_1)+\dots+{\bf Var}(X_n). \ \ \ \ \ (7)

From Chebyshev’s inequality, we thus have

\displaystyle  {\bf P}( |S_n| \geq t ) \leq \frac{{\bf Var}(X_1)+\dots+{\bf Var}(X_n)}{t^2}

for any {0 < t < \infty} and natural number {n}. Perhaps surprisingly, we have the following improvement to this bound, known as the Kolmogorov maximal inequality:

Theorem 25 (Kolmogorov maximal inequality) With the notation and hypotheses as above, we have

\displaystyle  {\bf P}( \sup_{1 \leq i \leq n} |S_i| \geq t ) \leq \frac{{\bf Var}(X_1)+\dots+{\bf Var}(X_n)}{t^2}

Proof: For each {i}, let {E_i} be the event that {|S_i| \geq t}, but that {|S_j| < t} for all {1 \leq j < i}. It is clear that the event {\sup_{1 \leq i \leq n} |S_i| \geq t} is the disjunction of the disjoint events {E_1,\dots,E_n}, thus

\displaystyle  {\bf P}( \sup_{1 \leq i \leq n} |S_i| \geq t ) = \sum_{i=1}^n {\bf P}(E_i).

On the event {E_i}, we have {1 \leq \frac{1}{t^2} S_i^2}, and hence

\displaystyle  {\bf P}(E_i) \leq \frac{1}{t^2} {\bf E} S_i^2 1_{E_i}.

We will shortly prove the inequality

\displaystyle  {\bf E} S_i^2 1_{E_i} \leq {\bf E} S_n^2 1_{E_i} \ \ \ \ \ (8)

for all {1 \leq i \leq n}. Assuming this inequality for the moment, we can put together all the above estimates, using the disjointness of the {E_i}, to conclude that

\displaystyle  {\bf P}( \sup_{1 \leq i \leq n} |S_i| \geq t ) \leq \frac{1}{t^2} {\bf E} S_n^2

and the claim follows from (7).

It remains to prove (8). Since

\displaystyle  S_n^2 = S_i^2 + (S_n - S_i)^2 + 2 (S_n-S_i) S_i

\displaystyle  \geq S_i^2 + 2 (S_n-S_i) S_i

we have

\displaystyle  {\bf E} S_n^2 1_{E_i} \geq {\bf E} S_i^2 1_{E_i} + 2 {\bf E} S_i 1_{E_i} (S_n-S_i).

But note that the random variable {S_i 1_{E_i}} is completely determined by {X_1,\dots,X_i}, while {S_n-S_i} is completely determined by the {X_{i+1},\dots,X_n}. Thus {S_i 1_{E_i}} and {S_n-S_i} are independent. Since {S_n-S_i} also has mean zero, we have

\displaystyle  {\bf E} S_i 1_{E_i} (S_n-S_i) = 0

and the claim (8) follows. \Box

An inspection of the above proof reveals that the key ingredient is the lack of correlation between past and future – a variable such as {S_i 1_{E_i}}, which is determined by the portion of the sequence {X_1,X_2,\dots} to the “past” (and present) of {i}, is uncorrelated with a variable such as {S_n-S_i} that depends only on the “future” of {i}. One can formalise such a lack of correlation through the concept of a martingale, which will be covered in later courses in this sequence but which is beyond the scope of these notes. The use of the first time {i} at which {|X_i|} exceeds or attains the threshold {t} is a simple example of a stopping time, which will be a heavily used concept in the theory of martingales (and is also used extensively in harmonic analysis, which is also greatly interested in establishing maximal inequalities).

The Kolmogorov maximal inequality gives the following variant of the strong law of large numbers.

Theorem 26 (Convergence of random series) Let {X_1,X_2,\dots} be jointly independent square-integrable real random variables of mean zero with

\displaystyle  \sum_{i=1}^\infty {\bf Var}(X_i) < \infty.

Then the series {\sum_{i=1}^\infty X_i} is almost surely convergent (i.e., the partial sums converge almost surely).

Proof: From the Kolmogorov maximal inequality and continuity from below we have

\displaystyle  {\bf P}( \sup_i |S_i| \geq t ) \leq \frac{\sum_{i=1}^\infty {\bf Var}(X_i)}{t^2}. \ \ \ \ \ (9)

This is already enough to show that the partial sums {S_n = \sum_{i=1}^n X_i} are almost surely bounded in {n}, but this isn’t quite enough to establish conditional convergence. To finish the job, we apply (9) with {X_1,X_2,\dots} replaced by the shifted sequence {X_{n+1}, X_{n+2}, \dots} for a natural number {n} to conclude that

\displaystyle  {\bf P}( \sup_{i > n} |S_i - S_n| \geq t ) \leq \frac{\sum_{i=n+1}^\infty {\bf Var}(X_i)}{t^2}.

Sending {n} to infinity using continuity from above, we conclude that

\displaystyle  {\bf P}( \sup_{i > n} |S_i - S_n| \geq t \forall n) = 0

for all {t>0}; applying this for all rational {t>0}, we conclude that {S_1,S_2,\dots} is almost surely a Cauchy sequence, and the claim follows. \Box

We can use this result together with some elementary manipulation of sums to give the following alternative proof of the strong law of large numbers.

Theorem 27 (Strong law of large numbers) Let {X_1,X_2,\dots} be iid copies of an absolutely integrable variable {X} of mean {{\bf E} X = \mu}, and let {S_n := X_1+\dots+X_n}. Then {S_n/n} converges almost surely to {\mu}.

Proof: We may normalise {X} to be real valued with {\mu=0}. Note that that

\displaystyle  \sum_{n=1}^\infty {\bf P}( |X_n| > n ) = \sum_{n=1}^\infty {\bf P}(|X| > n)

\displaystyle  = {\bf E} \sum_{n=1}^\infty 1_{|X| > n}

\displaystyle  \leq {\bf E} |X|

\displaystyle  < \infty

and hence by the Borel-Cantelli lemma we almost surely have {|X_n| \leq n} for all but finitely many {n}. Thus if we write {X'_n := X_n 1_{|X_n| \leq n}} and {S'_n := X'_1+\dots+X'_n}, then the difference between {S_n/n} and {S'_n/n} almost surely goes to zero as {n \rightarrow \infty}. Thus it will suffice to show that {S'_n/n} goes to zero almost surely.

The random variables {X'_1,X'_2,\dots} are still jointly independent, but are not quite mean zero. However, the normalised random variables {Y_n := \frac{X'_n - {\bf E} X'_n}{n}} are of mean zero and

\displaystyle  \sum_{i=1}^\infty {\bf Var}(Y_i) = \sum_{i=1}^\infty {\bf Var}(X'_i) / i^2

\displaystyle  \leq \sum_{i=1}^\infty {\bf E} |X'_i|^2 / i^2

\displaystyle  = \sum_{i=1}^\infty {\bf E} |X|^2 1_{|X| \leq i} / i^2

\displaystyle  = {\bf E} |X|^2 \sum_{i > |X|} \frac{1}{i^2}

\displaystyle  = {\bf E} O( |X| )

\displaystyle  < \infty

so by Theorem 26 we see that the sum {\sum_{i=1}^\infty Y_i} is almost surely convergent.

Write {T_n := Y_1+\dots+Y_n}, thus the sequence {T_n} is almost surely a Cauchy sequence. From the identity

\displaystyle  \frac{1}{n} \sum_{i=1}^n i Y_i = \frac{1}{n} \sum_{i=0}^{n-1} (T_n - T_i)

we conclude that the sequence {\frac{1}{n} \sum_{i=1}^n i Y_i} almost surely converges to zero, that is to say

\displaystyle  \frac{S'_n}{n} - \frac{\sum_{i=1}^n {\bf E} X'_i}{n} \rightarrow 0

almost surely. On the other hand, we have

\displaystyle  {\bf E} X'_i = {\bf E} X 1_{|X| \leq i}

\displaystyle  = - {\bf E} X 1_{|X| > i}

and hence by dominated convergence

\displaystyle  {\bf E} X'_i \rightarrow 0

as {i \rightarrow \infty}, which implies that

\displaystyle  \frac{\sum_{i=1}^n {\bf E} X'_i}{n} \rightarrow 0

as {n \rightarrow \infty}. By the triangle inequality, we conclude that {S'_n/n \rightarrow 0} almost surely, as required. \Box

Exercise 28 (Kronecker lemma) Let {\sum_{n=1}^\infty Y_n} be a convergent series of real numbers {Y_n}, and let {0 < b_1 \leq b_2 \leq \dots} be a sequence tending to infinity. Show that {\frac{1}{b_n} \sum_{i=1}^n b_i Y_i} converges to zero as {n \rightarrow \infty}. (This is known as Kronecker’s lemma; the special case {b_i=i} was implicitly used in the above argument.)

Exercise 29 (Kolmogorov three-series theorem, one direction) Let {X_1,X_2,\dots} be a sequence of jointly independent real random variables, and let {A>0}. Suppose that the two series {\sum_{n=1}^\infty {\bf P}(|X_n| > A)} and {\sum_{n=1}^\infty {\bf Var}( X_n 1_{|X_n| \leq A} )} are absolutely convergent, and the third series {\sum_{n=1}^\infty {\bf E} X_n 1_{|X_n| \leq A}} is convergent. Show that the series {\sum_{n=1}^\infty X_n} is almost surely convergent. (The converse claim is also true, and will be discussed in later notes; the two claims are known collectively as the Kolmogorov three-series theorem.)

One advantage that the maximal inequality approach to the strong law of large numbers has over the moment method approach is that it tends to offer superior bounds on the (almost sure) rate of convergence. We content ourselves with just one example of this:

Exercise 30 (Cheap law of iterated logarithm) Let {X_1,X_2,\dots} be a sequence of jointly independent real random variables of mean zero and bounded variance (thus {\sup_n {\bf E} X_n^2 < \infty}). Write {S_n := X_1+\dots+X_n}. Show that {\frac{S_n}{n^{1/2} \log^{1/2+\varepsilon} n}} converges almost surely to zero as {n \rightarrow \infty} for any given {\varepsilon>0}. (Hint: use Theorem 26 and the Kronecker lemma for a suitable weighted sum of the {X_n}.) There is a more precise version of this fact known as the law of the iterated logarithm, which is beyond the scope of these notes.

The exercises below will be moved to a more appropriate location later, but are currently placed here in order to not disrupt existing numbering.

Exercise 31 Let {X_1,X_2,\dots} be iid copies of an absolutely integrable random variable {X} with mean {\mu}. Show that the averages {\frac{S_n}{n} = \frac{X_1+\dots+X_n}{n}} converge in {L^1} to {\mu}, that is to say that

\displaystyle  {\bf E} |\frac{S_n}{n}-\mu| \rightarrow 0

as {n \rightarrow \infty}.

Exercise 32 A scalar random variable {X} is said to be in weak {L^1} if one has

\displaystyle  \sup_{t>0} t {\bf P}(|X| \geq t) < \infty.

Thus Markov’s inequality tells us that every absolutely integrable random variable is in weak {L^1}, but the converse is not true (e.g. random variables with the Cauchy distribution are weak {L^1} but not absolutely integrable). Show that if {X_1,X_2,\dots} are iid copies of an unsigned weak {L^1} random variable, then there exists quantities {a_n \rightarrow \infty} such that {S_n/a_n} converges in probability to {1}, where {S_n = X_1 + \dots + X_n}. (Thus: there is a weak law of large numbers for weak {L^1} random variables, and a strong law for strong {L^1} (i.e. absolutely integrable) random variables.)

Filed under: 275A - probability theory, math.PR Tagged: law of large numbers, moment method, truncation

November 29, 2015

David HoggThe Cannon and detailed abundances

[I am on vacation this week; that didn't stop me from doing a tiny bit of research.]

I did a bit of writing for the project of taking The Cannon into compressed-sensing territory, while Andy Casey (Cambridge) structures the code so we are ready to work on the problem when he is here in NYC in a couple of weeks. I tried to work out the most conservative possible train–validate–test framework for training and validation, consistent with some ideas from Foreman-Mackey. I also tried to understand what figures we will make to demonstrate that we are getting better or more informative abundances than other approaches.

Hans-Walter called to discuss the behavior of The Cannon when we try to do large numbers of chemical abundance labels. The code finds that its best model for one element will make use of lines from other elements. Why? He pointed out (correctly) that The Cannon does it's best to predict abundances. It is not strictly just measuring the abundances. It is doing it's best to predict, and the best prediction will both measure the element directly, and also include useful indirect information. So we have to decide what our goals are, and whether to restrict the model.

Terence TaoA cheap version of Halasz’s inequality

A basic estimate in multiplicative number theory (particularly if one is using the Granville-Soundararajan “pretentious” approach to this subject) is the following inequality of Halasz (formulated here in a quantitative form introduced by Montgomery and Tenenbaum).

Theorem 1 (Halasz inequality) Let {f: {\bf N} \rightarrow {\bf C}} be a multiplicative function bounded in magnitude by {1}, and suppose that {x \geq 3}, {T \geq 1}, and { M \geq 0} are such that

\displaystyle  \sum_{p \leq x} \frac{1 - \hbox{Re}(f(p) p^{-it})}{p} \geq M \ \ \ \ \ (1)

for all real numbers {t} with {|t| \leq T}. Then

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) \ll (1+M) e^{-M} + \frac{1}{\sqrt{T}}.

As a qualitative corollary, we conclude (by standard compactness arguments) that if

\displaystyle  \sum_{p} \frac{1 - \hbox{Re}(f(p) p^{-it})}{p} = +\infty

for all real {t}, then

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) = o(1) \ \ \ \ \ (2)

as {x \rightarrow \infty}. In the more recent work of this paper of Granville and Soundararajan, the sharper bound

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) \ll (1+M) e^{-M} + \frac{1}{T} + \frac{\log\log x}{\log x}

is obtained (with a more precise description of the {(1+M) e^{-M}} term).

The usual proofs of Halasz’s theorem are somewhat lengthy (though there has been a recent simplification, in forthcoming work of Granville, Harper, and Soundarajan). Below the fold I would like to give a relatively short proof of the following “cheap” version of the inequality, which has slightly weaker quantitative bounds, but still suffices to give qualitative conclusions such as (2).

Theorem 2 (Cheap Halasz inequality) Let {f: {\bf N} \rightarrow {\bf C}} be a multiplicative function bounded in magnitude by {1}. Let {T \geq 1} and {M \geq 0}, and suppose that {x} is sufficiently large depending on {T,M}. If (1) holds for all {|t| \leq T}, then

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) \ll (1+M) e^{-M/2} + \frac{1}{T}.

The non-optimal exponent {1/2} can probably be improved a bit by being more careful with the exponents, but I did not try to optimise it here. A similar bound appears in the first paper of Halasz on this topic.

The idea of the argument is to split {f} as a Dirichlet convolution {f = f_1 * f_2 * f_3} where {f_1,f_2,f_3} is the portion of {f} coming from “small”, “medium”, and “large” primes respectively (with the dividing line between the three types of primes being given by various powers of {x}). Using a Perron-type formula, one can express this convolution in terms of the product of the Dirichlet series of {f_1,f_2,f_3} respectively at various complex numbers {1+it} with {|t| \leq T}. One can use {L^2} based estimates to control the Dirichlet series of {f_2,f_3}, while using the hypothesis (1) one can get {L^\infty} estimates on the Dirichlet series of {f_1}. (This is similar to the Fourier-analytic approach to ternary additive problems, such as Vinogradov’s theorem on representing large odd numbers as the sum of three primes.) This idea was inspired by a similar device used in the work of Granville, Harper, and Soundarajan. A variant of this argument also appears in unpublished work of Adam Harper.

I thank Andrew Granville for helpful comments which led to significant simplifications of the argument.

— 1. Basic estimates —

We need the following basic tools from analytic number theory. We begin with a variant of the classical Perron formula.

Proposition 3 (Perron type formula) Let {f: {\bf N} \rightarrow {\bf C}} be an arithmetic function bounded in magnitude by {1}, and let {x, T \geq 1}. Assume that the Dirichlet series {F(s) := \sum_{n=1}^\infty \frac{f(n)}{n^s}} is absolutely convergent for {\hbox{Re}(s) \geq 1}. Then

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) \ll \int_{-T}^T |F(1+it)| \frac{dt}{1+|t|} + \frac{1}{T} + \frac{T}{x}.

Proof: By telescoping series (and treating the contribution of {n \ll T} trivially), it suffices to show that

\displaystyle  \sum_{x/e \leq n \leq x} f(n) \ll x \int_{-T}^T |F(1+it)| \frac{dt}{1+|t|} + \frac{x}{T}

whenever {x \geq T}.

The left-hand side can be written as

\displaystyle  x \sum_n \frac{f(n)}{n} g( \log x - \log n ) \ \ \ \ \ (3)

where {g(u) := e^{-u} 1_{[0,1]}(u)}. We now introduce the mollified version

\displaystyle  \tilde g(u) := \int_{\bf R} g(u + \frac{v}{T}) \Psi(v)\ dv

of {g}, where

\displaystyle  \Psi(v) := \frac{1}{2\pi} \int_{\bf R} \psi(t) e^{itv}\ dt

and {\psi: {\bf R} \rightarrow {\bf R}} is a fixed smooth function supported on {[-1,1]} that equals {1} at the origin. Basic Fourier analysis then tells us that {\Psi} is a Schwartz function with total mass one. This gives the crude bound

\displaystyle  \tilde g(u) \ll 1

for any {u}. For {u \leq -\frac{1}{T}} or {u \geq 1 + \frac{1}{T}}, we use the bound {\Psi(v) \ll \frac{1}{(1+|v|)^{10}}} (say) to arrive at the bound

\displaystyle  \tilde g(u) \ll (T \hbox{dist}(u, \{0,1\}))^{-9};

for {\frac{1}{T} \leq u \leq 1 - \frac{1}{T}} we again use {\Psi(v) \ll \frac{1}{(1+|v|)^{10}}} and write

\displaystyle  \tilde g(u) - g(u) = \int_{\bf R} (g(u + \frac{v}{T}) - g(u)) \Psi(v)\ dv

and use the Lipschitz bound {g(u+\frac{v}{T}) - g(u) \ll \frac{v}{T}} for {u + \frac{v}{T} \in [0,1]} to obtain

\displaystyle  \tilde g(u) - g(u) \ll (T \hbox{dist}(u, \{0,1\}))^{-9}

for such {u}. Putting all these bounds together, we see that

\displaystyle  \tilde g(u) = g(u) + O( \frac{1}{(1 + T |u|)^9} ) + O( \frac{1}{(1 + T |u-1|)^9} )

for all {u}. In particular, we can write (3) as

\displaystyle  x \sum_n \frac{f(n)}{n} \tilde g(\log x - \log n)

\displaystyle + O( \sum_n \frac{x}{n} \frac{1}{(1 + T |\log x - \log n|)^9} + \frac{1}{(1 + T |\log x/e - \log n|)^9} ).

The expression {\frac{x}{n} \frac{1}{(1 + T |\log x - \log n|)^9} + \frac{1}{(1 + T |\log x/e - \log n|)^9}} is bounded by {O( \frac{x}{n} \frac{1}{T \log^9 x} )} when {n \leq x/10}, is bounded by {O( \frac{x}{n} \frac{1}{T \log^9 n} )} when {n \geq 10x}, is bounded by {O(1)} when {n = (1+O(1/T)) x} or {n = (1+O(1/T)) x/e}, and is bounded by {(\frac{T}{x} |n-x|)^{-9} + (\frac{T}{x} |n-x/e|)^{-9}} otherwise. From these bounds, a routine calculation (using the hypothesis {x \geq T}) shows that

\displaystyle  \sum_n \frac{x}{n} \frac{1}{(1 + T |\log x - \log n|)^9} + \frac{1}{(1 + T |\log x/e - \log n|)^9} \ll \frac{x}{T}

and so it remains to show that

\displaystyle  \sum_n \frac{f(n)}{n} \tilde g(\log x - \log n) \ll \int_{-T}^T |F(1+it)| \frac{dt}{1+|t|}.


\displaystyle  \tilde g(u) = T \int_{\bf R} g(v) \Psi( T(v-u) )\ dv

\displaystyle  = \frac{T}{2\pi} \int_{\bf R} \psi(t) G(Tt) e^{-iTtu} dt

\displaystyle  = \frac{1}{2\pi} \int_{\bf R} \psi(t/T) G(t) e^{-itu}\ dt


\displaystyle  G(t) := \int_{\bf R} g(v) e^{itv}\ dv

we see from the triangle inequality and the support of {\psi(t/T)} that

\displaystyle  \sum_n \frac{f(n)}{n} \tilde g(\log x - \log n) \ll \int_{|t| \leq T} |G(t)| |\sum_n \frac{f(n)}{n} e^{-it(\log x- \log n)}|\ dt

\displaystyle  \ll \int_{|t| \leq T} |G(t)| |F(1+it)|\ dt.

But from integration by parts we see that {G(t) \ll \frac{1}{1+|t|}}, and the claim follows. \Box

Next, we recall a standard {L^2} mean value estimate for Dirichlet series:

Proposition 4 ({L^2} mean value estimate) Let {f: {\bf N} \rightarrow {\bf C}} be an arithmetic function, and let {T \geq 1}. Assume that the Dirichlet series {F(s) := \sum_{n=1}^\infty \frac{f(n)}{n^s}} is absolutely convergent for {\hbox{Re}(s) \geq 1}. Then

\displaystyle  \int_{|t| \leq T} |F(1+it)|^2\ dt \ll T \sum_{n_1,n_2: \log n_2 = \log n_1 + O(1/T)} \frac{|f(n_1)| |f(n_2)|}{n_1 n_2}.

Proof: This follows from Lemma 7.1 of Iwaniec-Kowalski; for the convenience of the reader we reproduce the short proof here. Introducing the normalised sinc function {\hbox{sinc}(x) := \frac{\sin(\pi x)}{\pi x}}, we have

\displaystyle  \int_{|t| \leq T} |F(1+it)|^2\ dt \ll \int_{\bf R} |F(1+it)|^2 \hbox{sinc}^2( t / 2T )\ dt

\displaystyle  = \sum_{n_1,n_2} f(n_1) \overline{f(n_2)} \int_{\bf R} \frac{1}{n_1^{1+it} n_2^{1-it}} \hbox{sinc}^2(t/2T)\ dt

\displaystyle  \ll \sum_{n_1,n_2} \frac{|f(n_1)| |f(n_2)|}{n_1 n_2} |\int_{\bf R} \hbox{sinc}^2(t/2T) e^{it (\log n_1 - \log n_2)}\ dt|.

But a standard Fourier-analytic computation shows that {\int_{\bf R} \hbox{sinc}^2(t/2T) e^{it (\log n_1 - \log n_2)}\ dt} vanishes unless {\log n_2 = \log n_1 + O(1/T)}, in which case the integral is {O(T)}, and the claim follows. \Box

Now we recall a basic sieve estimate:

Proposition 5 (Sieve bound) Let {x \geq 1}, let {I} be an interval of length {x}, and let {{\mathcal P}} be a set of primes up to {x}. If we remove one residue class mod {p} from {I} for every {p \in {\mathcal P}}, the number of remaining natural numbers in {I} is at most {\ll |I| \prod_{p \in {\mathcal P}} (1 - \frac{1}{p})}.

Proof: This follows for instance from the fundamental lemma of sieve theory (see e.g. Corollary 19 of this blog post). (One can also use the Selberg sieve or the large sieve.) \Box

Finally, we record a standard estimate on the number of smooth numbers:

Proposition 6 Let {u \geq 1} and {\varepsilon>0}, and suppose that {x} is sufficiently large depending on {u,\varepsilon}. Then the number of natural numbers in {[1,x]} which have no prime factor larger than {x^{1/u}} is at most {O( u^{-(1-\varepsilon)u} x )}.

Proof: See Corollary 1.3 of this paper of Hildebrand and Tenenbaum. (The result also follows from the more classical work of Dickman.) \Box

— 2. Proof of theorem —

By increasing {M} as necessary we may assume that {M \geq 10} (say). Let {0 < \varepsilon_2 < \varepsilon_1 \leq 1/2} be small parameters (depending on {M}) to be optimised later; we assume {x} to be sufficiently large depending on {\varepsilon_1,\varepsilon_2,T}. Call a prime {p} small if {p \leq x^{\varepsilon_2}}, medium if {x^{\varepsilon_2} < p \leq x^{\varepsilon_1}}, and large if {x^{\varepsilon_1} < p \leq x}. Observe that for any {n \leq x} we can factorise {f} as a Dirichlet convolution

\displaystyle  f(n) = f_1 * f_2 * f_3(n)


  • {f_1} is the restriction of {f} to those natural numbers {n} whose prime factors are all small;
  • {f_2} is the restriction of {f} to those natural numbers {n} whose prime factors are all medium;
  • {f_3} is the restriction of {f} to those natural numbers {n} whose prime factors are all large.

We thus have

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) = \frac{1}{x} \sum_{n \leq x} f_1*f_2*f_3(n). \ \ \ \ \ (4)

It is convenient to remove the Dirac function {\delta(n) := 1_{n=1}} from {f_2,f_3}, so we write

\displaystyle  f_2 = \delta + f'_2; \quad f_3 = \delta + f'_3

and split

\displaystyle  f_1*f_2*f_3 = f_1*f_2 + f_1*f'_3 + f_1*f'_2*f'_3.

Note that {f_1*f_2} is the restriction of {f} to those numbers {n \leq x} whose prime factors are all small or medium. By Proposition 6, the number of such {n} can certainly be bounded by {O(e^{-1/\varepsilon_1} x)} if {x} is sufficienty large. Thus the contribution of this term to (4) is {O( e^{-1/\varepsilon_1} )}.

Similarly, {f_1 * f'_3} is the restriction of {f} to those numbers {n \leq x} which contain at least one large prime factor, but no medium prime factors. By Proposition 5 the number of such {n} is bounded by {O( \frac{\varepsilon_2}{\varepsilon_1} x )} if {x} is sufficiently large. Thus the contribution of this term to (4) is {O( \frac{\varepsilon_2}{\varepsilon_1} )}, and hence

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) \ll |\frac{1}{x} \sum_{n \leq x} f_1*f'_2*f'_3(n)| + e^{-1/\varepsilon_1} + \frac{\varepsilon_2}{\varepsilon_1}.

Note that {f_1*f'_2*f'_3} is only supported on numbers whose prime factors do not exceed {x}, so the Dirichlet series of {f_1*f'_2*f'_3} is absolutely convergent for {\hbox{Re}(s) \geq 1} and is equal to {F_1(s) F'_2(s) F'_3(s)}, where {F_1,F'_2,F'_3} are the Dirichlet series of {f_1,f'_2,f'_3} respectively. Since {f_1*f'_2*f'_3} is bounded in magnitude by {1} (being a restriction of {f}), we may apply Proposition 3 and conclude (for {x} large enough, and discarding the {\frac{1}{1+|t|}} denominator) that

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) \ll \int_{|t| \leq T} |F_1(1+it)| |F'_2(1+it)| |F'_3(1+it)|\ dt

\displaystyle + \frac{1}{T} + e^{-1/\varepsilon_1} + \frac{\varepsilon_2}{\varepsilon_1}.

We now record some {L^2} estimates:

Lemma 7 For sufficiently large {x}, we have

\displaystyle  \int_{|t| \leq T} |F'_2(1+it)|^2\ dt \ll \frac{\varepsilon_1}{\varepsilon_2^2 \log x}


\displaystyle  \int_{|t| \leq T} |F'_3(1+it)|^2\ dt \ll \frac{1}{\varepsilon_1^2 \log x}.

Proof: We just prove the former inequality, as the latter is similar. By Proposition 4, we have

\displaystyle  \int_{|t| \leq T} |F'_2(1+it)|^2\ dt \ll T \sum_{n_1,n_2: \log n_1 = \log n_2 + O(1/T)} \frac{|f'_2(n_1)| |f'_2(n_2)|}{n_1 n_2}.

The term {f'_2(n_1)} vanishes unless {n_1 \geq x^{\varepsilon_2}}, and we have {n_2 = (1+O(1/T)) n_1}, so we can bound the right-hand side by

\displaystyle  \ll T \sum_{n_1 \geq x^{\varepsilon_2}} \frac{|f'_2(n_1)|}{n_1}\sum_{n_2 = (1+O(1/T)) n_1} |f'_2(n_2)|.

The inner summand is bounded by {1} and supported on those {n_2} that are not divisible by any small primes. From Proposition 5 and Mertens’ theorem we conclude that

\displaystyle  \sum_{n_2 = (1+O(1/T)) n_1} |f'_2(n_2)| \ll \frac{1}{T \varepsilon_2 \log x} n_1

and thus

\displaystyle  \int_{|t| \leq T} |F'_2(1+it)|^2\ dt \ll \frac{1}{\varepsilon_2 \log x} \sum_{n_1} \frac{|f'_2(n_1)|}{n_1}

\displaystyle  \ll \frac{1}{\varepsilon_2 \log x} \prod_{x^{\varepsilon_2} < p \leq x^{\varepsilon_1}} (1-\frac{1}{p})^{-1}

\displaystyle  \ll \frac{\varepsilon_1}{\varepsilon^2_2 \log x}

as desired. \Box

We also have an {L^\infty} estimate:

Lemma 8 For sufficiently large {x}, we have

\displaystyle  |F_1(1+it)| \ll e^{-M} \log x

for all {|t| \leq T}.

Proof: From Euler products, Mertens’ theorem, and (1) we have

\displaystyle  F_1(1+it) = \prod_{p \leq x^{\varepsilon_2}} \sum_{j=0}^\infty \frac{f(p^j)}{p^{j(1+it)}}

\displaystyle  \ll \prod_{p \leq x^{\varepsilon_2}} |1 + \frac{f(p) p^{-it}}{p}|

\displaystyle  \ll \exp( \prod_{p \leq x^{\varepsilon_2}} \hbox{Re} \frac{f(p) p^{-it}}{p} )

\displaystyle  \ll \varepsilon_2 \log x \exp( - \prod_{p \leq x^{\varepsilon_2}} \frac{1 - \hbox{Re} f(p) p^{-it}}{p} )

\displaystyle  \ll \varepsilon_2 \log x \exp( -M + \log \frac{1}{\varepsilon_2} )

\displaystyle  \ll e^{-M} \log x

as desired. \Box

Applying Hölder’s inequality, we conclude that

\displaystyle  \frac{1}{x} \sum_{n \leq x} f(n) \ll \frac{1}{\varepsilon_1^{1/2} \varepsilon_2} e^{-M} + e^{-1/\varepsilon_1} + \frac{\varepsilon_2}{\varepsilon_1}.

Setting {\varepsilon_1 := 1/M} and {\varepsilon_2 := e^{-M/2}} we obtain the claim.

Filed under: expository, math.NT Tagged: Halasz's theorem, pretentious multiplicative functions

Chad Orzel089/366: Seasonal

Saturday was still warm, but grey and rainy, so we needed indoor activities. We took the kids down to the Roberson Museum to see their annual Christmas display, with lots of trees donated and decorated by local organizations, toys and games from the 50’s and 60’s, and a giant model train display. And the “International Forest” of trees decorated in the style of various countries.

The Polish display at the Roberson "International Forest" of Christmas trees.

The Polish display at the Roberson “International Forest” of Christmas trees.

This is the Polish display; SteelyKid was duly impressed to learn that the word for the heraldic bird at the top of that flag is the same as her surname.

Anyway, it was all seasonally appropriate and stuff, then we went home, and after dinner watched the “Frosty the Snowman” cartoon. But snapped the tv off before “Frosty Returns” could start, because we’re not monsters.

Today, we’ll be driving back up to Niskayuna, and back to a semi-normal routine.

November 28, 2015

John PreskillBTZ black holes for #BlackHoleFriday

Yesterday was a special day. And no I’m not referring to #BlackFriday — but rather to #BlackHoleFriday. I just learned that NASA spawned this social media campaign three years ago. The timing of this year’s Black Hole Friday is particularly special because we are exactly 100 years + 2 days after Einstein published his field equations of general relativity (GR). When Einstein introduced his equations he only had an exact solution describing “flat space.” These equations are notoriously difficult to solve so their introduction sent out a call-to-arms to mathematically-minded-physicists and physically-minded-mathematicians who scrambled to find new solutions.

If I had to guess, Karl Schwarzschild probably wasn’t sleeping much exactly a century ago. Not only was he deployed to the Russian Front as a solider in the German Army, but a little more than one month after Einstein introduced his equations, Schwarzschild was the first to find another solution. His solution describes the curvature of spacetime outside of a spherically symmetric mass. It has the incredible property that if the spherical mass is compact enough then spacetime will be so strongly curved that nothing will be able to escape (at least from the perspective of GR; we believe that there are corrections to this when you add quantum mechanics to the mix.) Schwarzchild’s solution took black holes from the realm of clever thought experiments to the status of being a testable prediction about how Nature behaves.

It’s worth mentioning that between 1916-1918 Reissner and Nordstrom generalized Schwarzschild’s solution to one which also has electric charge. Kerr found a solution in 1963 which describes a spinning black hole and this was generalized by Newman et al in 1965 to a solution which includes both spin (angular momentum) and electric charge. These solutions are symmetric about their spin axis. It’s worth mentioning that we can also write sensible equations which describe small perturbations around these solutions.

And that’s pretty much all that we’ve got in terms of exact solutions which are physically relevant to the 3+1 dimensional spacetime that we live in (it takes three spatial coordinates to specify a meeting location and another +1 to specify the time.) This is the setting that’s closest to our everyday experiences and these solutions are the jumping off points for trying to understand the role that black holes play in astrophysics. As I already mentioned, studying GR using pen and paper is quite challenging. But one exciting direction in the study of astrophysical black holes comes from recent progresses in the field of numerical relativity; which discretizes the calculations and then uses supercomputers to study approximate time dynamics.

Screen Shot 2015-11-27 at 10.39.41 AM

Artist’s rendition of dust+gas in an “accretion disk” orbiting a spinning black hole. Friction in the accretion disk generates temperatures oftentimes exceeding 10M degrees C (2000 times the temperature of the Sun.) This high temperature region emits x-rays and other detectable EM radiation. The image also shows a jet of plasma. The mechanism for this plasma jet is not yet well understood. Studying processes like this requires all of tools that we have available to us: from numerical relativity; to cutting edge space observatories like NuSTAR; to LIGO in the immediate future *hopefully.* Image credit: NASA/Caltech-JPL

I don’t expect many of you to be experts in the history outlined above. And I expect even fewer of you to know that Einstein’s equations still make sense in any number of dimensions. In this context, I want to briefly introduce a 2+1 dimensional solution called the BTZ black hole and outline why it has been astonishingly important since it was introduced 23 years ago by Bañados, Teteilboim and Zanelli (their paper has been cited over 2k times which is a tremendous number for theoretical physics.)

There are many different viewpoints which yield the BTZ black hole and this is one of them. This is a  time=0 slice of the BTZ black hole obtained by gluing together special curves (geodesics) related to each other by a translation symmetry. The BTZ black hole is a solution of Einstein’s equations in 2+1d which has two asymptotic regions which are causally separated from each other by an event horizon. The arrows leading to “quantum states” come into play when you use the BTZ black hole as a toy model for thinking about quantum gravity.

One of the most striking implications of Einstein’s theory of general relativity is that our universe is described by a curved geometry which we call spacetime. Einstein’s equations describe the dynamical interplay between the curvature of spacetime and the distribution of energy+matter. This may be counterintuitive, but there are many solutions even when there is no matter or energy in the spacetime. We call these vacuum solutions. Vacuum solutions can have positive, negative or zero “curvature.

As 2d surfaces: the sphere is positively curved; a saddle has negative curvature; and a plane has zero curvature.

It came as a great surprise when BTZ showed in 1992 that there is a vacuum solution in 2+1d which has many of the same properties as the more physical 3+1d black holes mentioned above. But most excitingly — and something that I can’t imagine BTZ could have anticipated — is that their solution has become the toy model of all toy models for trying to understand “quantum gravity.

GR in 2+1d has many convenient properties. Two beautiful things that happen in 2+1d are that:

  • There are no gravitational waves. Technically, this is because the Riemann tensor is fully determined by the Ricci tensor — the number of degrees of freedom in this system is exactly equal to the number of constraints given by Einstein’s equations. This makes GR in 2+1d something called a “topological field theory” which is much easier to quantize than its full blown gauge theory cousin in 3+1d.
  • The maximally symmetric vacuum solution with negative curvature, which we call Anti de-Sitter space, has a beautiful symmetry. This manifold is exactly equal to the “group manifold” SL(2,R). This enables us to translate many challenging analytical questions into simple algebraic computations. In particular, it enables us to find a huge category of solutions which we call multiboundary wormholes, with BTZ being the most famous example.
Some "multiboundary wormhole" pictures that I made.

Some “multiboundary wormhole” pictures that I made. The left shows the constant time=0 slice for a few different solutions and what you are left with after gluing according to the equations on the right. These are solutions to GR in 2+1d.

These properties make 2+1d GR particularly useful as a sandbox for making progress towards a theory of quantum gravity. As examples of what this might entail:

  • Classically, a particle is in one definite location. In quantum mechanics, a particle can be in a superposition of places. In quantum gravity, can spacetime be in a superposition of geometries? How does this work?
  • When you go from classical physics to quantum physics, tunneling becomes a thing. Can the same thing happen with quantum gravity? Where we tunnel from one spacetime geometry to another? What controls the transition amplitudes?
  • The holographic principle is an incredibly important idea in modern theoretical physics. It stems from the fact that the entropy of a black hole is proportional to the area of its event horizon — whereas the entropy of a glass of water is proportional to the volume of water inside the glass. We believe that this reduction in dimensionality is wildly significant.

A few years after the holographic principle was introduced in the early 1990’s, by Gerard ‘t Hooft and Lenny Susskind, Juan Maldacena came up with a concrete manifestation which is now called the AdS/CFT correspondence. Maldacena’s paper has been cited over 14k times making it one of the most cited theoretical physics papers of all time. However, despite having a “correspondence” it’s still very hard to translate questions back and forth between the “gravity and quantum sides” in practice. The BTZ black hole is the gravity solution where this correspondence is best understood. Its quantum dual is a state called the thermofield double, which is given by: |\Psi_{CFT}\rangle = \frac{1}{\sqrt{Z}} \sum_{n=1}^{\infty} e^{-\beta E_n/2} |n\rangle_1 \otimes |n \rangle_2 . This describes a quantum state which lives on two circles (see my BTZ picture above.) There is entanglement between the two circles. If an experimentalist only had access to one of the circles and if they were asked to try to figure out what state they have, their best guess would be a “thermal state.” A state that has been exposed to a heat-bath for too long and has lost all of its initial quantum coherence.

It is in this sense that the BTZ black hole has been hugely important. It’s also evidence of how mysterious Einstein’s equations still remain, even to this day. We still don’t have exact solutions for many settings of interest, like for two black holes merging in 3+1d. It was only in 1992 that BTZ came up with their solution–77 years after Einstein formulated his theory! Judging by historical precedence, exactly solvable toy models are profoundly useful and BTZ has already proven to be an important signpost as we continue on our quest to understand quantum gravity. There’s already broad awareness that astrophysical black holes are fascinating objects. In this post I hope I conveyed a bit of the excitement surrounding how black holes are useful in a different setting — in aiding our understanding of quantum gravity. And all of this is in the spirit of #BlackHoleFriday, of course.

Terence Tao275A, Notes 2: Product measures and independence

In the previous set of notes, we constructed the measure-theoretic notion of the Lebesgue integral, and used this to set up the probabilistic notion of expectation on a rigorous footing. In this set of notes, we will similarly construct the measure-theoretic concept of a product measure (restricting to the case of probability measures to avoid unnecessary techncialities), and use this to set up the probabilistic notion of independence on a rigorous footing. (To quote Durrett: “measure theory ends and probability theory begins with the definition of independence.”) We will be able to take virtually any collection of random variables (or probability distributions) and couple them together to be independent via the product measure construction, though for infinite products there is the slight technicality (a requirement of the Kolmogorov extension theorem) that the random variables need to range in standard Borel spaces. This is not the only way to couple together such random variables, but it is the simplest and the easiest to compute with in practice, as we shall see in the next few sets of notes.

— 1. Product measures —

It is intuitively obvious that Lebesgue measure {m^2} on {{\bf R}^2} ought to be related to Lebesgue measure {m} on {{\bf R}} by the relationship

\displaystyle  m^2( E_1 \times E_2 ) = m(E_1) m(E_2) \ \ \ \ \ (1)

for any Borel sets {E_1, E_2 \subset {\bf R}}. This is in fact true (see Exercise 4 below), and is part of a more general phenomenon, which we phrase here in the case of probability measures:

Theorem 1 (Product of two probability spaces) Let {(\Omega_1, {\mathcal F}_1, \mu_1)} and {(\Omega_2, {\mathcal F}_2, \mu_2)} be probability spaces. Then there is a unique probability measure {\mu_1 \times \mu_2} on {(\Omega_1 \times \Omega_2, {\mathcal F}_1 \times {\mathcal F}_2)} with the property that

\displaystyle  \mu_1 \times \mu_2( E_1 \times E_2 ) = \mu_1(E_1) \mu_2(E_2) \ \ \ \ \ (2)

for all {E_1 \in {\mathcal F}_1, E_2 \in {\mathcal F}_2}. Furthermore, we have the following two facts:

  • (Tonelli theorem) If {f: \Omega_1 \times \Omega_2 \rightarrow [0,+\infty]} is measurable, then for each {\omega_1 \in \Omega_1}, the function {\omega_2 \mapsto f(\omega_1,\omega_2)} is measurable on {\Omega_2}, and the function {\omega_1 \mapsto \int_{\Omega_2} f(\omega_1,\omega_2)\ d\mu_2(\omega_2)} is measurable on {\Omega_1}. Similarly, for each {\omega_2 \in \Omega_2}, the function {\omega_1 \mapsto f(\omega_1,\omega_2)} is measurable on {\Omega_1}and {\omega_2 \mapsto \int_{\Omega_1} f(\omega_1,\omega_2)\ d\mu_1(\omega_1)} is measurable on {\Omega_2}. Finally, we have

    \displaystyle  \int_{\Omega_1 \times \Omega_2} f\ d(\mu_1 \times \mu_2) = \int_{\Omega_1} (\int_{\Omega_2} f(\omega_1,\omega_2)\ d\mu_2(\omega_2)) d\mu_1(\omega_1)

    \displaystyle  = \int_{\Omega_2} (\int_{\Omega_1} f(\omega_1,\omega_2)\ d\mu_1(\omega_1)) d\mu_2(\omega_2).

  • (Fubini theorem) If {f: \Omega_1 \times \Omega_2 \rightarrow {\bf C}} is absolutely integrable, then for {\mu_1}-almost every {\omega_1 \in \Omega_1}, the function {\omega_2 \mapsto f(\omega_1,\omega_2)} is absolutely integrable on {\Omega_2}, and the function {\omega_1 \mapsto \int_{\Omega_2} f(\omega_1,\omega_2)\ d\mu_2(\omega_2)} is absolutely integrable on {\Omega_1}. Similarly, for {\mu_2}-almost every {\omega_2 \in \Omega_2}, the function {\omega_1 \mapsto f(\omega_1,\omega_2)} is absolutely integrable on {\Omega_1} and {\omega_2 \mapsto \int_{\Omega_1} f(\omega_1,\omega_2)\ d\mu_1(\omega_1)} is absolutely integrable on {\Omega_2}. Finally, we have

    \displaystyle  \int_{\Omega_1 \times \Omega_2} f\ d(\mu_1 \times \mu_2) = \int_{\Omega_1} (\int_{\Omega_2} f(\omega_1,\omega_2)\ d\mu_2(\omega_2)) d\mu_1(\omega_1)

    \displaystyle  = \int_{\Omega_2} (\int_{\Omega_1} f(\omega_1,\omega_2)\ d\mu_1(\omega_1)) d\mu_2(\omega_2).

The Fubini and Tonelli theorems are often used together (so much so that one may refer to them as a single theorem, the Fubini-Tonelli theorem, often also just referred to as Fubini’s theorem in the literature). For instance, given an absolutely integrable function {f_1: \Omega_1 \rightarrow {\bf C}} and an absolutely integrable function {f_2: \Omega_2 \rightarrow {\bf C}}, the Tonelli theorem tells us that the tensor product {f_1 \otimes f_2: \Omega_1 \times \Omega_2 \rightarrow {\bf C}} defined by

\displaystyle  (f_1 \otimes f_2)(\omega_1, \omega_2) := f_1(\omega_1) f_2(\omega_2)

for {\omega_1 \in \Omega_1,\omega_2 \in \Omega_2}, is absolutely integrable and one has the factorisation

\displaystyle  \int_{\Omega_1 \times \Omega_2} f_1 \otimes f_2\ d(\mu_1 \times \mu_2) = (\int_{\Omega_1} f_1\ d\mu_1) (\int_{\Omega_2} f_2\ d\mu_2). \ \ \ \ \ (3)

Our proof of Theorem 1 will be based on the monotone class lemma that allows one to conveniently generate a {\sigma}-algebra from a Boolean algebra. (In Durrett, the closely related {\pi-\lambda} theorem is used in place of the monotone class lemma.) Define a monotone class in a set {\Omega} to be a collection {{\mathcal F}} of subsets of {\Omega} with the following two closure properties:

  • If {E_1 \subset E_2 \subset \ldots} are a countable increasing sequence of sets in {{\mathcal F}}, then {\bigcup_{n=1}^\infty E_n \in {\mathcal F}}.
  • If {E_1 \supset E_2 \supset \ldots} are a countable decreasing sequence of sets in {{\mathcal F}}, then {\bigcap_{n=1}^\infty E_n \in {\mathcal F}}.

Thus for instance any {\sigma}-algebra is a monotone class, but not conversely. Nevertheless, there is a key way in which monotone classes “behave like” {\sigma}-algebras:

Lemma 2 (Monotone class lemma) Let {{\mathcal A}} be a Boolean algebra on {\Omega}. Then {\langle {\mathcal A} \rangle} is the smallest monotone class that contains {{\mathcal A}}.

Proof: Let {{\mathcal F}} be the intersection of all the monotone classes that contain {{\mathcal A}}. Since {\langle {\mathcal A} \rangle} is clearly one such class, {{\mathcal F}} is a subset of {\langle {\mathcal A} \rangle}. Our task is then to show that {{\mathcal F}} contains {\langle {\mathcal A} \rangle}.

It is also clear that {{\mathcal F}} is a monotone class that contains {{\mathcal A}}. By replacing all the elements of {{\mathcal F}} with their complements, we see that {{\mathcal F}} is necessarily closed under complements.

For any {E \in {\mathcal A}}, consider the set {{\mathcal C}_E} of all sets {F \in {\mathcal F}} such that {F \backslash E}, {E \backslash F}, {F \cap E}, and {\Omega \backslash (E \cup F)} all lie in {{\mathcal F}}. It is clear that {{\mathcal C}_E} contains {{\mathcal A}}; since {{\mathcal F}} is a monotone class, we see that {{\mathcal C}_E} is also. By definition of {{\mathcal F}}, we conclude that {{\mathcal C}_E = {\mathcal F}} for all {E \in {\mathcal A}}.

Next, let {{\mathcal D}} be the set of all {E \in {\mathcal F}} such that {F \backslash E}, {E \backslash F}, {F \cap E}, and {X \backslash (E \cup F)} all lie in {{\mathcal F}} for all {F \in {\mathcal F}}. By the previous discussion, we see that {{\mathcal D}} contains {{\mathcal A}}. One also easily verifies that {{\mathcal D}} is a monotone class. By definition of {{\mathcal F}}, we conclude that {{\mathcal D} = {\mathcal F}}. Since {{\mathcal F}} is also closed under complements, this implies that {{\mathcal F}} is closed with respect to finite unions. Since this class also contains {{\mathcal A}}, which contains {\emptyset}, we conclude that {{\mathcal F}} is a Boolean algebra. Since {{\mathcal F}} is also closed under increasing countable unions, we conclude that it is closed under arbitrary countable unions, and is thus a {\sigma}-algebra. As it contains {{\mathcal A}}, it must also contain {\langle {\mathcal A} \rangle}. \Box

We now begin the proof of Theorem 1. We begin with the uniqueness claim. Suppose that we have two measures {\nu, \nu'} on {\Omega_1 \times \Omega_2} that are product measures of {\mu_1} and {\mu_2} in the sense that

\displaystyle  \nu(E_1 \times E_2) = \nu'(E_1 \times E_2) = \mu_1(E_1) \times \mu_2(E_2) \ \ \ \ \ (4)

for all {E_1 \in {\mathcal F}_1} and {E_2 \in {\mathcal F}_2}. If we then set {{\mathcal F}} to be the collection of all {E \in {\mathcal F}_1 \times {\mathcal F}_2} such that {\nu(E) = \nu'(E)}, then {{\mathcal F}} contains all sets of the form {E_1 \times E_2} with {E_1 \in {\mathcal F}_1} and {E_2 \in {\mathcal F}_2}. In fact {{\mathcal F}} contains the collection {{\mathcal A}} of all sets that are “elementary” in the sense that they are of the form {\bigcup_{i=1}^n E_{1,i} \times E_{2,i}} for finite {n} and {E_{1,i} \in {\mathcal F}_1, E_{2,i} \in {\mathcal F}_2} for {i=1,\dots,n}, since such sets can be easily decomposed into a finite union of disjoint products {E'_{1,i} \times E'_{2,i}}, at which point the claim follows from (4) and finite additivity. But {{\mathcal A}} is a Boolean algebra that generates {{\mathcal F}_1 \times {\mathcal F}_2} as a {\sigma}-algebra, and from continuity from above and below we see that {{\mathcal F}} is a monotone class. By the monotone class lemma, we conclude that {{\mathcal F}} is all of {{\mathcal F}_1 \times {\mathcal F}_2}, and hence {\nu=\nu'}. This gives uniqueness. Now we prove existence. We first claim that for any measurable set {E \in {\mathcal F}_1 \times {\mathcal F}_2}, the sets {E_{\omega_1} := \{ \omega_2 \in \Omega_2: (\omega_1 \times \omega_2)\}} are measurable in {{\mathcal F}_2}. Indeed, the claim is obvious for sets {E} that are “elementary” in the sense that they belong to the Boolean algebra {{\mathcal A}} defined previously, and the collection of all such sets is a monotone class, so the claim follows from the monotone class lemma. A similar argument (relying on monotone or dominated convergence) shows that the function

\displaystyle  \omega_1 \mapsto \mu_{\Omega_2}(E_{\omega_1}) = \int_{\Omega_2} 1_E( \omega_1, \omega_2)\ d\mu_2(\omega_2)

is measurable in {\Omega_1} for all {E \in {\mathcal F}_1 \times {\mathcal F}_2}. Thus, for any {E \in {\mathcal F}_1 \times {\mathcal F}_2}, we can define the quantity {(\mu_1 \times \mu_2)(E)} by

\displaystyle (\mu_1 \times \mu_2)(E) := \int_{\Omega_1} \mu_{\Omega_2}(E_{\omega_1})\ d\mu_1(\omega_1)

\displaystyle  = \int_{\Omega_1}(\int_{\Omega_2} 1_E( \omega_1, \omega_2)\ d\mu_2(\omega_2))\ d\mu_1(\omega_1).

A routine application of the monotone convergence theorem verifies that {\mu_1 \times \mu_2} is a countably additive measure; one easily checks that (2) holds for all {E_1 \in {\mathcal F}_1, E_2 \in {\mathcal F}_2}, and in particular {\mu_1 \times \mu_2} is a probability measure.

By construction, we see that the identity

\displaystyle  \int_{\Omega_1 \times \Omega_2} f\ d(\mu_1 \times \mu_2) = \int_{\Omega_1} (\int_{\Omega_2} f(\omega_1,\omega_2)\ d\mu_2(\omega_2))\ d\mu_1(\omega_1)

holds (with all functions integrated being measurable) whenever {f} is an indicator function {f=1_E} with {E \in {\mathcal F}_1 \times {\mathcal F}_2}. By linearity of integration, the same identity holds (again with all functions measurable) when {f: \Omega_1 \times \Omega_2 \rightarrow [0,+\infty]} is an unsigned simple function. Since any unsigned measurable function {f} can be expressed as the monotone non-decreasing limit of unsigned simple functions {f_n} (for instance, one can round {f} down to the largest multiple of {2^{-n}} that is less than {n} and {f}), the above identity also holds for unsigned measurable {f} by the monotone convergence theorem. Applying this fact to the absolute value {|f|} of an absolutely integrable function {f: \Omega_1 \times \Omega_2 \rightarrow {\bf C}}, we conclude for such functions that

\displaystyle  \int_{\Omega_1} (\int_{\Omega_2} |f|(\omega_1,\omega_2)\ d\mu_2(\omega_2))\ d\mu_1(\omega_1) < \infty

which by Markov’s inequality implies that

\displaystyle  \int_{\Omega_2} |f|(\omega_1,\omega_2)\ d\mu_2(\omega_2) < \infty

for {\mu_1}-almost every {\omega_1 \in \Omega_1}. In other words, the function {\omega_2 \mapsto f(\omega_1,\omega_2)} is absolutely integrable on {\Omega_2} for {\mu_1}-almost every {\omega_1 \in \Omega_1}. By monotonicity we conclude that

\displaystyle  \int_{\Omega_1} |\int_{\Omega_2} f(\omega_1,\omega_2)\ d\mu_2(\omega_2)|\ d\mu_1(\omega_1) < \infty

and hence the function {\omega_1 \mapsto \int_{\Omega_2} f(\omega_1,\omega_2)\ d\mu_2(\omega_2)} is absolutely integrable. Hence it makes sense to ask whether the identity

\displaystyle  \int_{\Omega_1 \times \Omega_2} f\ d(\mu_1 \times \mu_2) = \int_{\Omega_1} \int_{\Omega_2} f(\omega_1,\omega_2)\ d\mu_2(\omega_2)\ d\mu_1(\omega_1)

holds for absolutely integrable {f}, as both sides are well-defined. We have already established this claim when {f} is unsigned and absolutely integrable; by subtraction this implies the claim for real-valued absolutely integrable {f}, and by taking real and imaginary parts we obtain the claim for complex-valued absolutely integrable {f}.

We may reverse the roles of {\Omega_1} and {\Omega_2}, and define {\mu_1 \times \mu_2} instead by the formula

\displaystyle (\mu_1 \times \mu_2)(E) = \int_{\Omega_2}(\int_{\Omega_1} 1_E( \omega_1, \omega_2)\ d\mu_1(\omega_1))\ d\mu_2(\omega_2).

By the previously proved uniqueness of product measure, we see that this defines the same product measure {\mu_1 \times \mu_2} as previously. Repeating the previous arguments we obtain all the above claims with the roles of {\Omega_1} and {\Omega_2} reversed. This gives all the claims required for Theorem 1.

One can extend the product construction easily to finite products:

Exercise 3 (Finite products) Show that for any finite collection {(\Omega_i, {\mathcal F}_i, \mu_i)_{i \in A}} of probability spaces, there exists a unique probability measure {\prod_{i \in A} \mu_i} on {(\prod_{i \in A} \Omega_i, \prod_{i \in A} {\mathcal F}_i)} such that

\displaystyle  (\prod_{i \in A}\mu_i)(\prod_{i \in A} E_i) = \prod_{i \in A} \mu_i(E_i)

whenever {E_i \in {\mathcal F}_i} for {i \in A}. Furthermore, show that

\displaystyle  \prod_{i \in A}\mu_i = (\prod_{i \in A_1}\mu_i) \times (\prod_{i \in A_2}\mu_i)

for any partition {A = A_1 \uplus A_2} (after making the obvious identification between {\prod_{i \in A} \Omega_i} and {(\prod_{i \in A_1} \Omega_i) \times (\prod_{i \in A_2} \Omega_i)}). Thus for instance one has the associativity property

\displaystyle  \mu_1 \times \mu_2 \times \mu_3 = (\mu_1 \times \mu_2) \times \mu_3 = \mu_1 \times (\mu_2 \times \mu_3)

for any probability spaces {(\Omega_i, {\mathcal F}_i, \mu_i)} for {i=1,\dots,3}.

By writing {\prod_{i \in A} \mu_i} as products of pairs of probability spaces in many different ways, one can obtain a higher-dimensional analogue of the Fubini and Tonelli theorems; we leave the precise statement of such a theorem to the interested reader.

It is important to be aware that the Fubini theorem identity

\displaystyle  \int_{\Omega_1} (\int_{\Omega_2} f(\omega_1,\omega_2)\ d\mu_2(\omega_2)) d\mu_1(\omega_1) = \int_{\Omega_2} (\int_{\Omega_1} f(\omega_1,\omega_2)\ d\mu_1(\omega_1)) d\mu_2(\omega_2) \ \ \ \ \ (5)

for measurable functions {f: \Omega_1 \times \Omega_2 \rightarrow {\bf C}} that are not unsigned, are usually only justified when {f} is absolutely integrable on {\Omega_1 \times \Omega_2}, or equivalently (by the Tonelli theorem) the function {\omega_1 \mapsto \int_{\Omega_2} |f(\omega_1,\omega_2)|\ d\mu_2(\omega_2)} is absolutely integrable on {\Omega_1} (or that {\omega_2 \mapsto \int_{\Omega_1} |f(\omega_1,\omega_2)|\ d\mu_1(\omega_1)} is absolutely integrable on {\Omega_2}. Without this joint absolute integrability (and without any unsigned property on {f}), the identity (5) can fail even if both sides are well-defined. For instance, let {\Omega_1 = \Omega_2} be the unit interval {[0,1]}, and let {\mu_1 = \mu_2} be the uniform probability measure on this interval, and set

\displaystyle  f(\omega_1,\omega_2) := \prod_{n=1}^\infty 2^n 1_{[2^{-n}, 2^{-n+1})}(\omega_1)

\displaystyle  \times (2^n 1_{[2^{-n}, 2^{-n+1})}(\omega_2) - 2^{n+1} 1_{[2^{-n-1}, 2^{-n})}(\omega_2)).

One can check that both sides of (5) are well-defined, but that the left-hand side is {0} and the right-hand side is {1}. Of course, this function is neither unsigned nor jointly absolutely integrable, so this counterexample does not violate either of the Fubini or Tonelli theorems. Thus one should take care to only interchange integrals when the integrands are known to be either unsigned or jointly absolutely integrable, or if one has another way to rigorously justify the exchange of integrals.

The above theory extends from probability spaces to finite measure spaces, and more generally to measure spaces that are {\sigma}-finite, that is to say they are expressable as the countable union of sets of finite measure. (With a bit of care, some portions of product measure theory are even extendible to non-sigma-finite settings, though I urge caution in applying these results blindly in that case.) We will not give the details of these generalisations here, but content ourselves with one example:

Exercise 4 Establish (4) for all Borel sets {E_1,E_2 \subset {\bf R}}. (Hint: {{\bf R}} can be viewed as the disjoint union of a countable sequence of sets of measure {1}.)

Remark 5 When doing real analysis (as opposed to probability), it is convenient to complete the Borel {\sigma}-algebra {{\mathcal B}[{\bf R}^n]} on spaces such as {{\bf R}^n}, to form the larger Lebesgue {\sigma}-algebra {{\mathcal L}[{\bf R}^n]}, defined as the collection of all subsets {E} in {{\bf R}^n} that differ from a Borel set {F} in {{\bf R}^n} by a sub-null set, in the sense that {E \Delta F \subset G} for some Borel subset {G} of {{\bf R}^n} of zero Lebesgue measure. There are analogues of the Fubini and Tonelli theorems for such complete {\sigma}-algebras; see this previous lecture notes for details. However one should be cautioned that the product {{\mathcal L}[{\bf R}^{n_1}] \times {\mathcal L}[{\bf R}^{n_2}]} of Lebesgue {\sigma}-algebras is not the Lebesgue {\sigma}-algebra {{\mathcal L}[{\bf R}^{n_1+n_2}]}, but is instead an intermediate {\sigma}-algebra between {{\mathcal B}[{\bf R}^{n_1+n_2}]} and {{\mathcal L}[{\bf R}^{n_1+n_2}]}, which causes some additional small complications. For instance, if {f: {\bf R}^{n_1+n_2} \rightarrow {\bf C}} is Lebesgue measurable, then the functions {x_2 \mapsto f(x_1,x_2)} can only be found to be Lebesgue measurable on {{\bf R}^{n_2}} for almost every {x_1 \in {\bf R}^{n_1}}, rather than for all {x_1 \in {\bf R}^{n_1}}. We will not dwell on these subtleties further here, as we will rarely have any need to complete the {\sigma}-algebras used in probability theory.

It is also important in probability theory applications to form the product of an infinite number of probability spaces {(\Omega_i, {\mathcal F}_i, \mu_i)} for {i \in A}, where {A} can be infinite or even uncountable. Recall from Notes 0 that the product {\sigma}-algebra {{\mathcal F}_A := \prod_{i \in A} {\mathcal F}_i} on {\Omega_A := \prod_{i \in A} \Omega_i} is defined to be the {\sigma}-algebra generated by the sets {\pi_j^{-1}(E_j)} for {j \in A} and {E_j \in {\mathcal F}_j}, where {\pi_j: \Omega_A \rightarrow \Omega_j} is the usual coordinate projection. Equivalently, if we define an elementary set to be a subset of {\Omega_A} of the form {\pi_B^{-1}(E_B)}, where {B} is a finite subset of {A}, {\pi_B: \Omega_A \rightarrow \Omega_B} is the obvious projection map to {\Omega_B := \prod_{i \in B} \Omega_i}, and {E_B} is a measurable set in {{\mathcal F}_B := \prod_{i \in B} {\mathcal F}_i}, then {{\mathcal F}_A} can be defined as the {\sigma}-algebra generated by the collection {{\mathcal A}} of elementary sets. (Elementary sets are the measure-theoretic analogue of cylinder sets in point set topology.) For future reference we note the useful fact that {{\mathcal A}} is a Boolean algebra.

We define a product measure {\mu_A = \prod_{i \in A} \mu_i} to be a probability measure on the measurable space {(\Omega_A, {\mathcal F}_A)} which extends all of the finite products in the sense that

\displaystyle  \mu_A( \pi_B^{-1}(E_B) ) = \mu_B(E_B)

for all finite subsets {B} of {A} and all {E_B} in {{\mathcal F}_B}, where {\mu_B := \prod_{i \in B} \mu_i}. If this product measure exists, it is unique:

Exercise 6 Show that for any collection of probability spaces {(\Omega_i, {\mathcal F}_i, \mu_i)} for {i \in A}, there is at most one product measure {\mu_A}. (Hint: adapt the uniqueness argument in Theorem 1 that used the monotone class lemma.)

Exercise 7 Let {\mu_1,\dots,\mu_n} be probability measures on {{\bf R}}, and let {F_1,\dots,F_n: {\bf R} \rightarrow [0,1]} be their Stieltjes measure functions. Show that {\mu_1 \times \dots \mu_n} is the unique probability measure on {{\bf R}^n} whose Stietljes transform is the tensor product {(t_1,\dots,t_n) \mapsto F_n(t_1) \dots F_n(t_n)} of {F_1,\dots,F_n}.

In the case of finite {A}, the finite product constructed in Exercise 3 is clearly the unique product. But for infinite {A}, the construction of product measure is a more nontrivial issue. We can generalise the problem as follows:

Problem 8 (Extension problem) Let {(\Omega_i, {\mathcal F}_i)_{i \in A}} be a collection of measurable spaces. For each finite {B \subset A}, let {\mu_B} be a probability measure on {(\Omega_B, {\mathcal F}_B)} obeying the compatibility condition

\displaystyle  \mu_B( \pi_{B \rightarrow C}^{-1}(E_C) ) = \mu_C(E_C) \ \ \ \ \ (6)

for all finite {C \subset B \subset A} and {E_C \in {\mathcal F}_C}, where {\pi_{B \rightarrow C}: \Omega_B \rightarrow \Omega_C} is the obvious restriction. Can one then define a probability measure {\mu_A} on {(\Omega_A, {\mathcal F}_A)} such that

\displaystyle  \mu_A( \pi_{B}^{-1}(E_B) ) = \mu_B(E_B) \ \ \ \ \ (7)

for all finite {B \subset A} and {E_B \subset {\mathcal F}_B}?

Note that the compatibility condition (6) is clearly necessary if one is to find a measure {\mu_A} obeying (7).

Again, one has uniqueness:

Exercise 9 Show that for any {(\Omega_i, {\mathcal F}_i)_{i \in A}} and {\mu_B} for finite {B \subset A} as in the above extension problem, there is at most one probability measure {\mu_A} with the stated properties.

The extension problem is trivial for finite {A}, but for infinite {A} there are unfortunately examples where the probability measure {\mu_A} fails to exist. However, there is one key case in which we can build the extension, thanks to the Kolmogorov extension theorem. Call a measurable space {(\Omega,{\mathcal F})} standard Borel if it is isomorphic as a measurable space to a Borel subset of the unit interval {[0,1]} with Borel measure, that is to say there is a bijection {f} from {\Omega} to a Borel subset {E} of {[0,1]} such that {f: \Omega \rightarrow E} and {f^{-1}: E \rightarrow \Omega} are both measurable. (In Durrett, such spaces are called nice spaces.) Note that one can easily replace {[0,1]} by other standard spaces such as {{\bf R}} if desired, since these spaces are isomorphic as measurable spaces (why?).

Theorem 10 (Kolmogorov extension theorem) Let the situation be as in Problem 8. If all the measurable spaces {(\Omega_i,{\mathcal F}_i)} are standard Borel, then there exists probability measure {\mu_A} solving the extension problem (which is then unique, thanks to Exercise 9).

The proof of this theorem is lengthy and is deferred to the next (optional) section. Specialising to the product case, we conclude

Corollary 11 Let {(\Omega_i, {\mathcal F}_i, \mu_i)_{i \in A}} be a collection of probability spaces with {(\Omega_i, {\mathcal F}_i)} standard Borel. Then there exists a product measure {\prod_{i \in A} \mu_i} (which is then unique, thanks to Exercise 6).

Of course, to use this theorem we would like to have a large supply of standard Borel spaces. Here is one tool that often suffices:

Lemma 12 Let {(X,d)} be a complete separable metric space, and let {\Omega} be a Borel subset of {X}. Then {\Omega} (with the Borel {\sigma}-algebra) is standard Borel.

Proof: Let us call two topological spaces Borel isomorphic if their corresponding Borel structures are isomorphic as measurable spaces. Using the binary expansion, we see that {[0,1]} is Borel isomorphic to {\{0,1\}^{\bf N}} (the countable number of points that have two binary expansions can be easily permuted to obtain a genuine isomorphism). Similarly {[0,1]^{\bf N}} is Borel isomorphic {\{0,1\}^{{\bf N} \times {\bf N}}}. Since {{\bf N} \times {\bf N}} is in bijection with {{\bf N}}, we conclude hat {[0,1]^{\bf N}} is Borel isomorphic {[0,1]}. Thus it will suffice to to show that every complete separable metric space {(X,d)} is Borel isomorphic to a Borel subset of {[0,1]^{\bf N}}. But if we let {q_1,q_2,\dots} be a countable dense subset in {X}, the map

\displaystyle  x \mapsto (\frac{d(x,q_i)}{1+d(x,q_i)})_{i \in {\bf N}}

can easily be seen to be a homeomorphism between {X} and a subset of {[0,1]^{\bf N}}, which is completely metrisable and hence Borel (in fact it is a {G_\delta} set – the countable intersection of open sets – why?). The claim follows. \Box

Exercise 13 (Kolmogorov extension theorem, alternate form) For each natural number {n}, let {\mu_n} be a probability measure on {{\bf R}^n} with the property that

\displaystyle  \mu_{n+1}( B \times {\bf R} ) = \mu_n(B)

for {n \geq 1} and any box {B = [a_1,b_1] \times \dots \times [a_n,b_n]} in {{\bf R}^n}, where we identify {{\bf R}^n \times {\bf R}} with {{\bf R}^{n+1}} in the usual manner. Show that there exists a unique probability measure {\mu_{\bf N}} on {{\bf R}^{\bf N}} (with the product {\sigma}-algebra, or equivalently the Borel {\sigma}-algebra on the product topology) such that

\displaystyle  \mu_\infty( \{ (\omega_i)_{i \in {\bf N}}: (\omega_1,\dots,\omega_n) \in E ) = \mu_n(E)

for all {n \geq 1} and Borel sets {E \subset {\bf R}^n}.

— 2. Proof of the Kolmogorov extension theorem (optional) —

We now prove Theorem 10. By the definition of a standard Borel space, we may assume without loss of generality that each {\Omega_i} is a Borel subset of {[0,1]} with the Borel {\sigma}-algebra, and then by extending each {\Omega_i} to {[0,1]} we may in fact assume without loss of generality that each {\Omega_i} is simply {[0,1]} with the Borel {\sigma}-algebra. Thus each {\mu_B} for finite {B \subset A} is a probability measure on the cube {[0,1]^B}.

We will exploit the regularity properties of such measures:

Exercise 14 Let {B} be a finite set, and let {\mu_B} be a probability measure on {[0,1]^B} (with the Borel {\sigma}-algebra). For any Borel set {E} in {[0,1]^B}, establish the inner regularity property

\displaystyle  \mu_B(E) = \sup_{K \subset E, K \hbox{ compact}} \mu_B(K)

and the outer regularity property

\displaystyle  \mu_B(E) = \inf_{U \supset E, U \hbox{ open}} \mu_B(U).

Hint: use the monotone class lemma.

Another way of stating the above exercise is that finite Borel measures on the cube are automatically Radon measures. In fact there is nothing particularly special about the unit cube {[0,1]^B} here; the claim holds for any compact separable metric spaces. Radon measures are often used in real analysis (see e.g. these lecture notes) but we will not develop their theory further here.

Observe that one can define the elementary measure {\mu_0(E)} of any elementary set {E = \pi_B^{-1}(E_B)} in {[0,1]^A} by defining

\displaystyle  \mu_0( \pi_B^{-1}(E_B) ) := \mu_B( E_B )

for any finite {B \subset A} and any Borel {E_B \subset [0,1]^B}. This definition is well-defined thanks to the compatibility hypothesis (6). From the finite additivity of the {\mu_B} it is easy to see that {\mu_0} is a finitely additive probability measure on the Boolean algebra {{\mathcal A}} of elementary sets.

We would like to extend {\mu_0} to a countably additive probability measure on {{\mathcal F}_A}. The standard approach to do this is via the Carathéodory extension theorem in measure theory (or the closely related Hahn-Kolmogorov theorem); this approach is presented in these previous lecture notes, and a similar approach is taken in Durrett. Here, we will try to avoid developing the Carathéodory extension theorem, and instead take a more direct approach similar to the direct construction of Lebesgue measure, given for instance in these previous lecture notes.

Given any subset {E \subset [0,1]^A} (not necessarily Borel), we define its outer measure {\mu^*(E)} to be the quantity

\displaystyle  \mu^*(E) := \inf \{ \sum_{i=1}^\infty \mu_0(E_i): E_i \hbox{ open elementary cover of } E \},

where we say that {E_1,E_2,\dots} is an open elementary cover of {E} if each {E_i} is an open elementary set, and {E \subset \bigcup_{i=1}^\infty E_i}. Some properties of this outer measure are easily established:

Exercise 15

  • (i) Show that {\mu^*(\emptyset) = 0}.
  • (ii) (Monotonicity) Show that if {E \subset F \subset [0,1]^A} then {\mu^*(E) \leq \mu^*(F)}.
  • (iii) (Countable subadditivity) For any countable sequence {E_1,E_2,\dots} of subsets of {[0,1]^A}, show that {\mu^*( \bigcup_{i=1}^\infty E_i) \leq \sum_{i=1}^\infty \mu^*(E_i)}. In particular (from part (i)) we have the finite subadditivity {\mu^*(E \cup F) \leq \mu^*(E) + \mu^*(F)} for all {E,F \subset [0,1]^A}.
  • (iv) (Elementary sets) If {E} is an elementary set, show that {\mu^*(E) = \mu_0(E)}. (Hint: first establish the claim when {E} is compact, relying heavily on the regularity properties of the {\mu_B} provided by Exercise 14, then extend to the general case by further heavy reliance on regularity.) In particular, we have {\mu^*([0,1]^A) = 1}.
  • (v) (Approximation) Show that if {E \in {\mathcal F}_A}, then for any {\varepsilon > 0} there exists an elementary set {E_\varepsilon} such that {\mu^*(E \Delta E_\varepsilon) < \varepsilon}. (Hint: use the monotone class lemma. When dealing with an increasing sequence of measurable sets {E_n} obeying the required property, approximate these sets by an increasing sequence of elementary sets {E'_n}, and use the finite additivity of elementary measure and the fact that bounded monotone sequences converge.)

From part (v) of the above exercise, we see that every {E \in {\mathcal F}_A} can be viewed as a “limit” of a sequence {E_n} of elementary sets such that {\mu^*(E \Delta E_{n}) < 1/n}. From parts (iii), (iv) we see that the sequence {\mu_0(E_n)} is a Cauchy sequence and thus converges to a limit, which we denote {\mu(E)}; one can check from further application of (iii), (iv) that this quantity does not depend on the specific choice of {E_n}. (Indeed, from subadditivity we see that {\mu(E) = \mu^*(E)}.) From definition we see that {\mu} extends {\mu_0} (thus {\mu(E) = \mu_0(E)} for any elementary set {E}), and from the above exercise one checks that {\mu} is countably additive. Thus {\mu} is a probability measure with the desired properties, and the proof of the Kolmogorov extension theorem is complete.

— 3. Independence —

Using the notion of product measure, we can now quickly define the notion of independence:

Definition 16 A collection {(X_i)_{i \in A}} of random variables {X_i} (each of which take values in some measurable space {R_i}) is said to be jointly independent, if the distribution of {(X_i)_{i \in A}} is the product of the distributions of the {X_i}. Or equivalently (after expanding all the definitions), we have

\displaystyle  {\bf P}( \bigwedge_{i \in B} (X_i \in S_i) ) = \prod_{i \in B} {\bf P}(X_i \in S_i)

for all finite {B \subset A} and all measurable subsets {S_i} of {R_i}. We say that two random variables {X,Y} are independent (or that {X} is independent of {Y}) if the pair {(X,Y)} is jointly independent.

It is worth reiterating that unless otherwise specified, all random variables under consideration are being modeled by a single probability space. The notion of independence between random variables does not make sense if the random variables are only being modeled by separate probability spaces; they have to be coupled together into a single probability space before independence becomes a meaningful notion.

Independence is a non-trivial notion only when one has two or more random variables; by chasing through the definitions we see that any collection of zero or one variables is automatically jointly independent.

Example 17 If we let {(X,Y)} be drawn uniformly from a product {E \times F} of two Borel sets {E,F} in {{\bf R}} of positive finite Lebesgue measure, then {X} and {Y} are independent. However, if {(X,Y)} is drawn from uniformly from another shape (e.g. a parallelogram), then one usually does not expect to have independence.

As a special case of the above definition, a finite family {X_1,\dots,X_n} of random variables taking values in {R_1,\dots,R_n} is jointly independent if one has

\displaystyle  {\bf P}( \bigwedge_{i=1}^n (X_i\in S_i) ) = \prod_{i=1}^n {\bf P}( X_i \in S_i )

for all measurable {S_i} in {R_i} for {i=1,\dots,n}.

Suppose that {(X_i)_{i \in A}} is a family of independent random variables, with each {X_i} taking values in {R_i}. From Exercise 3 we see that

\displaystyle  {\bf P}( \bigwedge_{j=1}^J (X_{A_j} \in S_j) ) = \prod_{j=1}^J {\bf P}( X_{A_j} \in S_j )

whenever {A_1,\dots,A_J} are disjoint finite subsets of {A}, {X_{A_j}} is the tuple {(X_i)_{i \in A_j}}, and {S_j} is a measurable subset of {\prod_{i \in A_j} X_i}. In particular, we see that the tuples {X_{A_1},\dots,X_{A_J}} are also jointly independent. This implies in turn that {F_1(X_{A_1}),\dots,F_J(X_{A_J})} are jointly independent for any measurable functions {F_j: \prod_{i \in A_j} X_i \rightarrow Y_j}. Thus, for instance, if {X_1,X_2,X_3,X_4} are jointly independent random variables taking values in {R_1,R_2,R_3,R_4} respectively, then {F(X_1,X_2)} and {G(X_3)} are independent for any measurable {F: X_1 \times X_2 \rightarrow Y} and {G: X_3 \rightarrow Y'}. In particular, if two scalar random variables {X,Y} are jointly independent of a third random variable {Z} (i.e. the triple {X,Y,Z} are jointly independent), then combinations such as {X+Y} or {XY} are also independent of {Z}.

We remark that there is a quantitative version of the above facts used in information theory, known as the data processing inequality, but this is beyond the scope of this course.

If {X} and {Y} are scalar random variables, then from the Fubini and Tonelli theorems we see that

\displaystyle  {\bf E} XY = ({\bf E} X) ({\bf E} Y) \ \ \ \ \ (8)

if{X} and {Y} are either both unsigned, or both absolutely integrable. We caution however that the converse is not true: just because two random variables {X,Y} happen to obey (8) does not necessarily mean that they are independent; instead, we say merely that they are uncorrelated, which is a weaker statement.

More generally, if {X} and {Y} are random variables taking values in ranges {R, R'} respectively, then

\displaystyle  {\bf E} F(X) G(Y) = ({\bf E} F(X)) ({\bf E} G(Y))

for any scalar functions {F,G} on {R,R'} respectively, provided that {F(X)} and {G(Y)} are either both unsigned, or both absolutely integrable. This is the property of {X} and {Y} which is equivalent to independence (as can be seen by specialising to those {F,G} that take values in {\{0,1\}}): thus for instance independence of two unsigned random variables {X,Y} entails not only (8), but {{\bf E} X^2 Y = ({\bf E} X^2) ({\bf E} Y)}, {{\bf E} e^X e^Y = ({\bf E} e^X) ({\bf E} e^Y)}, etc.. Similarly when discussing the joint independence of larger numbers of random variables. It is this ability to easily decouple expectations of independent random variables that make independent variables particularly easy to compute with in probability.

Exercise 18 Show that a random variable {X} is independent of itself (i.e. {X} and {X} are independent) if and only if {X} is almost surely equal to a constant.

Exercise 19 Show that a constant (deterministic) random variable is independent of any other random variable.

Exercise 20 Let {X_1,\dots,X_n} be discrete random variables (i.e. they take values in at most countable spaces {R_1,\dots,R_n} equipped with the discrete sigma-algebra). Show that {X_1,\dots,X_n} are jointly independent if and only if one has

\displaystyle  {\bf P}( \bigwedge_{i=1}^n (X_i=x_i) ) = \prod_{i=1}^n {\bf P}(X_i = x_i)

for all {x_1 \in R_1,\dots, x_n \in R_n}.

Exercise 21 Let {X_1,\dots,X_n} be real scalar random variables. Show that {X_1,\dots,X_n} are jointly independent if and only if one has

\displaystyle  {\bf P}( \bigwedge_{i=1}^n (X_i \leq t_i) ) = \prod_{i=1}^n {\bf P}( X_i \leq t_i )

for all {t_1,\dots,t_n \in {\bf R}}.

The following exercise demonstrates that probabilistic independence is analogous to linear independence:

Exercise 22 Let {V} be a finite-dimensional vector space over a finite field {F}, and let {X} be a random variable drawn uniformly at random from {V}. Let {\langle, \rangle: V \times V \rightarrow F} be a non-degenerate bilinear form on {V}, and let {v_1,\dots,v_n} be non-zero vectors in {V}. Show that the random variables {\langle X, v_1 \rangle, \dots, \langle X, v_n \rangle} are jointly independent if and only if the vectors {v_1,\dots,v_n} are linearly independent.

Exercise 23 Give an example of three random variables {X,Y,Z} which are pairwise independent (that is, any two of {X,Y,Z} are independent of each other), but not jointly independent. (Hint: one can use the preceding exercise.)

Another analogy is with orthogonality:

Exercise 24 Let {X} be a random variable taking values in {{\bf R}^n} with the Gaussian distribution, in the sense that

\displaystyle  \mathop{\bf P}( X \in S ) = \int_S \frac{1}{(2\pi)^{n/2}} e^{-|x|^2/2}\ dx

(where {|x|} denotes the Euclidean norm on {{\bf R}^n}), and let {v_1,\dots,v_m} be vectors in {{\bf R}^n}. Show that the random variables {X \cdot v_1, \dots, X \cdot v_m} (with {\cdot} denoting the Euclidean inner product) are jointly independent if and only if the {v_1,\dots,v_n} are pairwise orthogonal.

We say that a family of events {(E_i)_{i \in A}} are jointly independent if their indicator random variables {(1_{E_i})_{i \in A}} are jointly independent. Undoing the definitions, this is equivalent to requiring that

\displaystyle  {\bf P}( \bigwedge_{i \in A_1} E_{i} \wedge \bigwedge_{j \in A_2} \overline{E_{j}}) = \prod_{i \in A_1} {\bf P}(E_i) \prod_{j \in A_2} {\bf P}(\overline{E_j})

for all disjoint finite subsets {A_1, A_2} of {A}. This condition is complicated, but simplifies in the case of just two events:

Exercise 25

  • (i) Show that two events {E,F} are independent if and only if {{\bf P}(E \wedge F) = {\bf P}(E) {\bf P}(F)}.
  • (ii) If {E,F,G} are events, show that the condition {{\bf P}(E \wedge F \wedge G) = {\bf P}(E) {\bf P}(F) {\bf P}(G)} is necessary, but not sufficient, to ensure that {E,F,G} are jointly independent.
  • (iii) Given an example of three events {E,F,G} that are pairwise independent, but not jointly independent.

Because of the product measure construction, it is easy to insert independent sources of randomness into an existing randomness model by extending that model, thus giving a more useful version of Corollaries 27 and 31 of Notes 0:

Proposition 26 Suppose one has a collection of events and random variables modeled by some probability space {\Omega}, and let {\nu} be a probability measure on a measurable space {R = (R,{\mathcal B})}. Then there exists an extension {\Omega'} of the probability space {\Omega}, and a random variable {X} modeled by {\Omega'} taking values in {R}, such that {X} has distribution {\nu} and is independent of all random variables that were previously modeled by {\Omega}.

More generally, given a finite collection {(\nu_i)_{i \in A}} of probability spaces on measurable spaces {R_i}, there exists an extension {\Omega'} of {\Omega} and random variables {X_i} modeled by {\Omega'} taking values in {R_i} for each {i \in A}, such that each {X_i} has distribution {\nu_i} and {(X_i)_{i \in A}} and {Y} are jointly independent for any random variable {Y} that was previously modeled by {\Omega}.

If the {R_i} are all standard Borel spaces, then one can also take {A} to be infinite (even if {A} is uncountable).

Proof: For the first part, we define the extension {\Omega'} to be the product of {\Omega} with the probability space {(R,{\mathcal B},\nu)}, with factor map {\pi: \Omega \times R \rightarrow \Omega} defined by {\pi(\omega,x) := \omega}, and with {X} modeled by {X_\Omega(\omega,x) := x}. It is then routine to verify all the claimed properties. The other parts of the proposition are proven similarly, using Proposition 11 for the final part. \Box

Using this proposition, for instance, one can start with a given random variable {X} and create an independent copy {Y} of that variable, which has the same distribution as {X} but is independent of {X}, by extending the probability model. Indeed one can create any finite number of independent copies, or even an infinite number of {X} takes values in a standard Borel space (in particular, one can do this if {X} is a scalar random variable). A finite or infinite sequence {X_1, X_2, \dots} of random variables that are jointly independent and all have the same distribution is said to be an independent and identically distributed (or iid for short) sequence of random variables. The above proposition allows us to easily generate such sequences by extending the sample space as necessary.

Exercise 27 Let {\epsilon_1, \epsilon_2, \dots \in \{0,1\}} be random variables that are independent and identically distributed copies of the Bernoulli random variable with expectation {1/2}, that is to say the {\epsilon_1,\epsilon_2,\dots} are jointly independent with {{\bf P}( \epsilon_i = 1 ) = {\bf P}(\epsilon_i = 0 ) = 1/2} for all {i}.

  • (i) Show that the random variable {\sum_{n=1}^\infty 2^{-n} \epsilon_n} is uniformly distributed on the unit interval {[0,1]}.
  • (ii) Show that the random variable {\sum_{n=1}^\infty 2 \times 3^{-n} \epsilon_n} has the distribution of Cantor measure (constructed for instance in Example 1.2.4 of Durrett).

Note that part (i) of this exercise provides a means to construct Lebesgue measure on the unit interval {[0,1]} (although, when one unpacks the construction, it is actually not too different from the standard construction, as given for instance in this previous set of notes).

Given two square integrable real random variables {X, Y}, the covariance {\hbox{Cov}(X,Y)} between the two is defined by the formula

\displaystyle  \hbox{Cov}(X,Y) := {\bf E}( (X - {\bf E} X) (Y - {\bf E} Y) ).

The covariance is well-defined thanks to the Cauchy-Schwarz inequality, and it is not difficult to see that one has the alternate formula

\displaystyle  \hbox{Cov}(X,Y) = {\bf E}(X Y) - ({\bf E} X) ({\bf E} Y)

for the covariance. Note that the variance is a special case of the covariance: {\hbox{Var}(X) = \hbox{Cov}(X,X)}.

From construction we see that if {X,Y} are independent square integrable variables, then the covariance {\hbox{Cov}(X,Y)} vanishes. The converse is not true:

Exercise 28 Give an example of two square-integrable real bvariables {X,Y} which have vanishing covariance {\hbox{Cov}(X,Y)}, but are not independent.

However, there is one key case in which the converse does hold, namely that of gaussian random vectors.

Exercise 29 A random vector {(X_1,\dots,X_n)} taking values in {{\bf R}^n} is said to be a gaussian random vector if there exists {\mu = (\mu_1,\dots,\mu_n) \in {\bf R}^n} and an {n \times n} positive definite real symmetric matrix {\Sigma := (\sigma_{ij})_{1 \leq i,j \leq n}} such that

\displaystyle  \mathop{\bf P}( (X_1,\dots,X_n) \in S) = \frac{1}{(2\pi)^{n/2} (\det \Sigma)^{1/2}} \int_S e^{-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)}\ dx

for all Borel sets {S \subset {\bf R}^n} (where we identify elements of {{\bf R}^n} with column vectors). The distribution of {(X_1,\dots,X_n)} is called a multivariate normal distribution.

  • (i) If {(X_1,\dots,X_n)} is a gaussian random vector with the indicated parameters {\mu, \Sigma}, show that {{\bf E} X_i = \mu_i} and {\hbox{Cov}(X_i,X_j) = \sigma_{ij}} for {1 \leq i,j \leq n}. In particular {\hbox{Var}(X_i) = \sigma_{ii}}. Thus we see that the parameters of a gaussian random variable can be recovered from the mean and covariances.
  • (ii) If {(X_1,\dots,X_n)} is a gaussian random vector and {1 \leq i, j \leq n}, show that {X_i} and {X_j} are independent if and only if the covariance {\hbox{Cov}(X_i,X_j)} vanishes. Furthermore, show that {(X_1,\dots,X_n)} are jointly independent if and only if all the covariances {\hbox{Cov}(X_i,X_j)} for {1 \leq i < j \leq n} vanish. In particular, for gaussian random vectors, joint independence is equivalent to pairwise independence. (Contrast this with Exercise 23.)
  • (iii) Give an example of two real random variables {X,Y}, each of which is gaussian, and for which {\hbox{Cov}(X,Y)=0}, but such that {X} and {Y} are not independent. (Hint: take {Y} to be the product of {X} with a random sign.) Why does this not contradict (ii)?

We have discussed independence of random variables, and independence of events. It is also possible to define a notion of independence of {\sigma}-algebras. More precisely, define a {\sigma}-algebra of events to be a collection {{\mathcal F}} of events that contains the empty event, is closed under Boolean operations (in particular, under complements {E \mapsto \overline{E}}) and under countable conjunctions and countable disjunctions. Each such {\sigma}-algebra, when using a probability space model {\Omega}, is modeled by a {\sigma}-algebra {{\mathcal F}_\Omega} of measurable sets in {\Omega}, which behaves under an extension {\pi: \Omega' \rightarrow \Omega} in the obvious pullback fashion:

\displaystyle  {\mathcal F}_{\Omega'} = \{ \pi^{-1}(E): E \in {\mathcal F}_\Omega \}.

A random variable {X} taking values in some range {R} is said to be measurable with respect to a {\sigma}-algebra of events {{\mathcal F}} if the event {X \in S} lies in {{\mathcal F}} for every measurable subset {S} of {R}; in terms of a probabilistic model {\Omega}, {X} is measurable with respect to {{\mathcal F}} if and only if