I am offering a two-year position for research in nanophotonics-powered neuromorphic computing for particle detector development at INFN, Sezione di Padova. The call will open soon, so you should watch this space if you are interested.
It’s weird that the plural of Alex and the plural of Alice are both “Alices.” Sort of a base/basis problem.
As experimental capabilities advance rapidly, the quantum computing community faces a critical elephant in the room: What will these quantum machines eventually be useful for? Will they deliver the promised broad societal impact, or will they remain highly specialized devices for exotic tasks known only to the experts?
Despite decades of effort, conclusive evidence of large quantum advantage in real-world applications remains confined to a few niche domains, such as simulating quantum materials and cryptanalysis. These problems are either inherently quantum to begin with, or they possess specialized mathematical structure that quantum algorithms can easily exploit. But it seems unlikely that such structures appear broadly in everyday life.
Indeed, most applications of modern computation hinge on the processing of massive, noisy classical data, generated at an unprecedented pace across society. That is the driving force behind the overwhelming success of machine learning and AI. Since the data originates from the macroscopic classical world, there is no obvious reason it should exhibit the delicate, specialized structures that quantum computers require. To playfully adapt Richard Feynman’s famous quote: We live in an effectively classical world, dammit, and maybe classical computers and AI already suffice for most of our problems. (For those unfamiliar, Feynman originally quipped: “Nature isn’t classical, dammit, and if you want to make a simulation of nature, you’d better make it quantum mechanical.”)

To truly unlock the power of a quantum computer, quantum algorithms typically need to access data in quantum superposition, processing many different samples simultaneously in different branches of the quantum multiverse. To use technical jargon, this is called querying a quantum oracle. But in reality, the classical data samples that we want to process are generated from everyday activities in a classical world, and we can only access them one at a time.
Think of the movie reviews you scroll through on a streaming platform. How would you read the plain-text reviews from a million different users all at once in a quantum superposition? This bottleneck—the challenge of efficiently accessing the classical world in quantum superposition—is known as the data loading problem. It has arguably been one of the main obstacles to achieving broadly applicable quantum advantage.

In this new work [1], we provide a solution to this seemingly impossible challenge. We develop a framework, called quantum oracle sketching, that enables us to access the classical world in quantum superposition in an optimal way. Importantly, it automatically handles the noise and correlations in the data, and natively supports flexible data structures like vectors and matrices that enable machine learning applications.
The core mechanism relies on processing data as a continuous stream. For each classical data sample we observe, we apply a carefully designed, small quantum rotation to our system. By sequentially accumulating these quantum rotations, we incrementally build up an accurate approximation of the target quantum oracle, which can then be used in any quantum algorithm for data processing. Because every data sample is processed once and immediately discarded, we completely eliminate the massive memory overhead typically required to store the dataset. The fundamental price to pay for assembling quantum queries from classical data lies in the sample complexity: our algorithm consumes a number of samples that scales quadratically with the number of quantum queries we need to make. We show that this rate is optimal and fundamentally arises from the quadratic relationship between quantum amplitudes and classical probabilities governed by the Born rule.

With the data successfully loaded into the quantum computer, the final challenge is to efficiently read out classical results. To address this, we develop an efficient measurement protocol called interferometric classical shadow. Combined with quantum oracle sketching, it allows us to circumvent the data loading and readout bottleneck to construct exponentially compact classical models from massive classical data with quantum technology.
Using this new approach, we are finally able to find exponential quantum advantage in processing classical data and machine learning. We rigorously prove that a small quantum computer can perform large-scale classification and dimensionality reduction on massive classical data by processing samples on the fly. In contrast, any classical machine achieving the same prediction performance requires exponentially larger size. When the classical machine does not have the required exponentially large memory size, it needs super-polynomially more samples and time relative to our protocol running on a quantum device. Remarkably, this illustrates that quantum technology enables us to construct compact and accurate classical models out of classical data, which is impossible with classical machines alone unless given exponentially larger memory.

The true scale of this exponential memory advantage is staggering. A quantum processor with 300 logical qubits can outperform a classical machine built from every atom in the observable universe. Of course, to actually see such a comical contrast, we would also need universe-scale datasets and processing time.
To contextualize these results in realistic scenarios, consider a large-scale scientific experiment, like a large particle collider. Each experimental run generates a colossal volume of data. With a quantum computer, we can keep squeezing all the data into this tiny quantum chip to perform downstream machine learning tasks such as classification and dimensionality reduction. But if we only have classical machines, we would need to build massive, energy-consuming data centers to store the raw data to match the performance. Without this massive memory overhead, classical machines simply couldn’t extract the same clear signals from a single run, forcing us to repeat the massive, expensive experiment many more times to compensate. To put this into perspective, the Large Hadron Collider (LHC) at CERN generates petabytes (millions of gigabytes) of data per hour, but the data storage bottlenecks force researchers to discard all but a tiny fraction—retaining perhaps only one in a hundred thousand events.
We validated these quantum advantages on real-world datasets, including movie review sentiment analysis and single-cell RNA sequencing. In these public datasets, we demonstrate four to six orders of magnitude (ten thousand to a million times) reduction in memory size with fewer than 60 logical qubits. Given the rapid advancements in high-rate quantum error correction codes and experimental techniques, quantum computers capable of demonstrating such applications are foreseeable in the near future. Crucially, the quantum advantage we propose likely carries a clearer positive impact for society and likely arrives sooner than the applications in cryptanalysis, where the current best estimate requires a thousand logical qubits.
Our results provide strong evidence that the utility of quantum computers extends far beyond specialized tasks, opening a path for quantum computers to be broadly useful in our everyday life. Rather than fearing that classical AI will “eat quantum computing’s lunch,” we now have rigorous evidence pointing towards a much more exciting prospect: quantum-enhanced AI overpowering classical AI.
Of course, there is still a long way to go towards the dream of quantum intelligence. Our current results establish the provable supremacy of quantum machines in foundational machine learning tasks, such as high-dimensional linear classification and dimensionality reduction. They do not yet imply immediate utility for modern generative AI such as large language models.
That said, our results give me a strong feeling that we are living in an age strikingly reminiscent of the traditional machine learning era—an age dominated by support vector machines and random forests; an age when we relied on rigorous statistical analysis because we lacked the computational resources for large-scale heuristic exploration; an age that ultimately heralded the birth of deep learning and the AI revolution. Today, quantum AI seems to sit at a similar historical position. I cannot wait to see what quantum AI will become once we are capable of unconstrained heuristic exploration on large-scale fault-tolerant quantum computers.
To accelerate this dawn of quantum AI, we invite physicists, computer scientists, developers, and machine learning practitioners to join our efforts and help us push the boundaries of what quantum AI can achieve. To bridge the gap between abstract quantum theory and hands-on machine learning practice, we are open-sourcing our core framework. Our numerical implementation of quantum oracle sketching is built in JAX, natively supporting GPU/TPU acceleration and automatic differentiation to integrate nicely with modern machine learning pipelines. Check out the code, run the simulations, and help us shape the future of quantum AI at github.com/haimengzhao/quantum-oracle-sketching!
[1]. Haimeng Zhao, Alexander Zlokapa, Hartmut Neven, Ryan Babbush, John Preskill, Jarrod R. McClean, and Hsin-Yuan Huang. Exponential quantum advantage in processing massive classical data, arXiv:2604.07639, 2026.
This is a crash course on the basic principles of quantum physics! In a self-contained way, I explain quantum states and the basic rule for computing probabilities.
It was a fun challenge stripping down everything to the bare minimum. Of course there is much more to say, but I was focused on leaving out everything that was not absolutely essential—to get to the real core of things.
There’s a huge fog of confusion surrounding most popular introductions to quantum mechanics, and I wanted to avoid all that. To do this, we have to use language in a pretty careful way.
Here’s a great, simple article about applied category theory:
• Natalie Wolchover, Can the most abstract math make the world a better place?, Quanta.
Natalie interviewed me twice for this, and it also features Matteo Capucci, Brendan Fong, Bob Coecke, David Spivak, Amar Hadzihasanovic, Nathaniel Osgood and Tom Leinster.
I’m glad that it explains a bit about ‘green mathematics’ and my struggle to do mathematics that will help the world. I coined that term exactly 15 years ago here on this blog.
When I was pursuing a PhD at Caltech, so was my friend Jeremy. He used to throw a dinner party every few months. The email invitations welcomed friends to partake of his cooking and, if we wished, to help him cook. I didn’t help cook; but, when I arrived, the mess of pots and pans drew me to the kitchen like vinegar drawing a pathological fly. I couldn’t sit still while cookware needed cleaning, so I scrubbed and rinsed the pans and spoons and bowls. Jeremy, an applied-physics student, commented on my adeptness at decreasing entropy.
It’s the story of my life, I replied.
In fourth grade, my classmates and I cleaned our desks every Friday afternoon. Once a student finished, my teacher dismissed him or her onto the playground. My neighbor’s desk horrified me like the disaster in a hurricane’s wake, so I neatened his desk after finishing with mine.1 Another friend requested the same favor. A third classmate offered to pay me for cleaning his desk, but I’d have undertaken the chore for its own sake. Ordering the world offered me fulfillment.
From cleaning a fourth-grade desk, I progressed to pursuing a PhD in theoretical physics. The two pursuits might seem to resemble each other no more than Dr. Jekyll and Mr. Hyde; yet, to me, the path between them is but a step. I trained as a theoretical physicist because I love organizing ideas. Caltech paid me to build models, propose definitions and theorems, and structure proofs—to dream up ideas and identify the optimal arrangements for them. I needed that pay, being an adult, as I hadn’t needed my fourth-grade classmate’s desk-cleaning fee. Yet I organized ideas for the same reason that drove me to organize my neighbor’s notebooks.
Many people have called entropy a measure of disorder. To see why, imagine that Jeremy’s crew has used thirty utensils while cooking. The chefs can have scattered the utensils across the kitchen in many ways: they may have dropped forks on the floor, left spoons in the sink, arranged spatulas on the drying rack, or filled a vase with knives like a modern-art bouquet. In few of these configurations do the forks lie in their compartment of the utensil drawer, the spoons lie in their compartment, etc. We call such configurations neat. Most of the other configurations, we call messy.
A system’s entropy is the number of configurations consistent with known large-scale properties of the system, such as the number of forks.2 More configurations are consistent with messiness (and a fixed number of forks and so on) than with neatness (and the same number of forks and so on). Messiness tends to correlate with high entropy. People often say, therefore, that entropy quantifies messiness. Hence Jeremy’s complimenting me on my decreasing of entropy.
Jeremy’s dinner parties came to mind as I read the book The Mattering Instinct, published by Rebecca Newberger Goldstein this January. Rebecca is a philosopher of science and a writer. I had the good fortune to meet her through my undergraduate mentor Marcelo Gleiser, who’s had another cameo or two on Quantum Frontiers. Rebecca’s latest book covers what she calls the mattering instinct: the longing to know that we matter.
We spend scads of energy and time on securing our “survival and flourishing,” as Rebecca says. We feed ourselves; work to earn money to purchase food; clean, shelter, and clothe ourselves; ingrain ourselves in societies that offer some degree of security; and more. Do we deserve all this effort? We long for assurance that, in the immortal words of L’Oreal, we’re worth it.
Survival and flourishing, Rebecca writes, requires us to decrease entropy. Every closed, isolated system’s entropy increases or remains constant, according to the second law of thermodynamics. Entropy increases as a system becomes more uniform, loosely speaking. The system’s particles spread out across space, these particles’ temperature comes to equal those particles’ temperature, and so on. In contrast, your body exists because its particles clump together in a certain shape consistently. You withstand heat waves and snow because homeostasis maintains your temperature despite your environment’s temperature. You keep your body’s entropy low to survive. Rebecca therefore casts us as fighting entropy.
As a thermodynamicist, I agree with Rebecca. Yet I also adore entropy. It helps explain why time flows, quantifies uncertainty, and determines the maximal efficiencies with which we can perform tasks such as communication. What versatility and richness! Entropy also embodies tension and subtlety: its mathematical definition looks obscure at first glance, yet entropy helps explain familiar phenomena such as aging. For these reasons, before beginning my PhD, I told a potential advisor that I could imagine devoting the next five years of my life to entropy.
I therefore aspire to rehabilitate entropy’s reputation. Novelist Terry Pratchett endeared mortality to millions of readers through anthropomorphism. His character Death, a mainstay of the Discworld series of novels, elicits empathy and fondness. I won’t anthropomorphize entropy here,3 but I aim to replace conflict with cooperation in the narrative above. To survive and flourish, I hold, we partner with entropy. How? We create oodles of entropy in our environments. This entropy increase offsets the entropy decrease that supports life.
For example, imagine working at a desalination plant. You’d process high-entropy water throughout which salt has spread. You’d concentrate the salt in a tiny region, reducing the water’s entropy. This reduction, producing fresh water, could support your city’s drinking, cooking, and toothbrushing needs.
To reduce the water’s entropy, you’d create loads more entropy. You’d eat breakfast before work, consuming energy stored neatly in your waffle’s chemical bonds. Your body would later break the bonds, releasing the energy. Some energy would power your muscles, so you could program the desalination system, test its output, etc. But much of the chemical energy would transform into heat radiated by your body. The heat would warm up the air molecules around you, magnifying their random jigglings and jostlings. You’d increase the entropy of the air—your environment—to decrease the water’s entropy. The air’s entropy increase would outweigh the water’s entropy decrease.
Organisms survive and flourish by producing entropy in their environments. In fact, organisms have a knack for generating entropy. Entropy and life thereby further each other. A glass-half-full thinker could conclude that we partner with entropy.
So did I partner with entropy as a PhD student, applying it to solve problems in quantum information theory and thermodynamics. So did I partner with entropy in fourth grade and at Jeremy’s apartment, deriving satisfaction from my cleaning. Rebecca would call these activities’ ultimate aim (beyond the aim of, e.g., not sitting beside a pigsty in fourth grade) mattering. She writes that we reduce entropy (within our immediate vicinities) to satisfy the mattering instinct. Rebecca’s proposition describes my behaviors with uncanny precision, I realized upon reading her book.
Which I’ve now finished. So pardon me while I return to washing forks in the quantum kitchen of the universe.
With thanks to Jeremy for his friendship…and food.
1I also ensured that my neighbor brought home, every afternoon, the sweater he’d brought to school that morning. Before I took charge, he’d ended up with three forgotten sweaters crammed into his cubby.
2At least, one entropy is. Many other entropies exist.
3If you anthropomorphize entropy elsewhere, let me know.
Returning from a trip to New Mexico to explore some Puebloan ruins, I picked up this beautiful chunk of labradorite in the town of Quartzsite. This mineral creates an eerie blue shimmer in the sunlight: a phenomenon called ‘labradorescence’.
Reading up on it, I discovered it’s a form of feldspar. I didn’t know at first that it would lead me into studying the classification of crystals, and a nice application of the cohomology of groups.
60% of the Earth’s crust is feldspar. I knew so little about this stuff! I had to learn more. It turns out there are 3 fundamental kinds:
• orthoclase is potassium aluminosilicate
• albite is sodium aluminosilicate
• anorthite is calcium aluminosilicate
Then there are lots of feldspars that contain different amounts of potassium, sodium and calcium. We get a triangle of feldspars with orthoclase, albite and anorthite at the corners. You can find labradorite on this triangle:
But not all points in this triangle are possible kinds of feldspar! There’s a big region called the ‘miscibility gap’, where as you cool the molten mix it separates out. I don’t understand why.
And there are also subtler issues. When you cool down the feldspar called labradorite, it separates out a little, forming tiny layers of two different kinds of stuff. When the thickness of these layers is the wavelength of visible light, you get a weird optical effect: labradorescence! You really need a movie to see the strange shimmer as you turn a piece of labradorite in the sunlight.
In fact there are 3 kinds of feldspar that separate out slightly as they cool and harden, forming thin alternating layers of two substances:
• The ‘peristerite gap’ produces layers in feldspars with 2-16% anorthite and the rest albite: these layers create the beauty of moonstone!
• The ‘Bøggild gap’ produces layers in feldspars with 47-58% anorthite and the rest albite: these are labradorites!
• The ‘Huttenlocher gap’ produces layers in feldspars with 67-90% anorthite and the rest albite: these are called ‘bytownites’. For some reason these layers do not seem to produce an interesting visual effect. Maybe their thickness is too far from the wavelength of visible light.
All these gaps are ‘miscibility gaps’. That is, feldspars with these concentrations of anorthite and albite are unstable: they want to separate out into parts with more anorthite and parts with more albite. That’s why they form layers.
The physics and math of all this stuff is fascinating. Crystals try to do whatever it takes to minimize free energy, which is energy minus entropy times temperature. That’s why many feldspars have different high- and low-temperature forms. But sometimes when molten rock cools quickly, it doesn’t have time to reach its free energy minimizing state.
For feldspar all of these issues are complex, because feldspar crystals are complicated structures, as drawn here by Anna Pakhomova et al:
Aluminum and silicon have to be distributed among the corners of the tetrahedra here, and there are various ways to do this. The distribution is determined by the relative amounts of potassium, sodium and calcium, which are the white balls. The distribution of aluminum and silicon in turn controls the symmetry of the crystal, which can be either ‘monoclinic’ or the less symmetrical ‘triclinic’.
A triclinic crystal is built out of units that are the least symmetrical parallelipipeds possible. So, all the angles α, β, γ, are different, and none are right angles:
A monoclinic crystal is more symmetrical. Two of the angles are right angles, but the third is not:
But the pictures here don’t fully capture the symmetry group of an actual crystal—because there’s more to a crystal than just a shape of a parallelipiped! There may be the same atoms at all corners of the parallelipiped, or not, and there may also other atoms not on the corners.
Let’s get into a bit of the math.
The symmetry group G of a crystal, called its ‘space group’, fits into a short exact sequence:
where T ≅ ℤ³ is the group of translational symmetries and P is the group of symmetries that fix a point, called the ‘point group’. This sequence may or may not split! It splits iff G is a semidirect product of P and T.
For a triclinic crystal, there are only two possible space groups G, and both are semidirect products. P is either trivial or ℤ/2, acting by negation.
For a monoclinic crystal, there are 3 choices of the point group P as a subgroup of O(3):
• P = ℤ/2 (a single 180° rotation)
• P = ℤ/2 (a single reflection)
• P = ℤ/2 × ℤ/2 (generated by a 2-fold rotation and inversion
(𝑥,𝑦,𝑧) ↦ -(𝑥,𝑦,𝑧): their product is a reflection).
For each choice of P there are 2 fundamentally different choices of lattice T ≅ ℤ³ it can act on. One is made up of copies of the parallelipiped I showed you. The other is twice as dense; then we call the lattice ‘base-centered monoclinic’:
So, we get 3 × 2 = 6 space groups G that are semidirect products.
But there are 7 other non-split extensions! These other 7 give nontrivial elements of the cohomology group H²(P, T). It’s not obvious that there are just 7 options. Thus, the hardest part of the classification of all 13 monoclinic space groups is essentially the computation of H²(P, ℤ³) for all 6 choices of groups P and their actions on ℤ³.
I knew that cohomology rocks. But it turns out cohomology helps classify rocks!
Now, which of these various groups are symmetry groups of feldspars?
Apparently all the feldspars in the triangle have just two different symmetry groups:
• For the monoclinic feldspars (including sanidine, orthoclase, and high-temperature albite), the crystal has a 2-fold rotational symmetry, a mirror plane, and inversion symmetry
The point group is the Klein four-group ℤ/2 × ℤ/2. The lattice is base-centered monoclinic, so there’s an extra translational symmetry shifting by half a cell diagonally across one face of the parallelipiped.
• For the triclinic feldspars (including microcline, low-temperature albite, and anorthite), the only symmetry beyond translation is inversion. So the point group is just ℤ/2. And there are no extra generators of translation symmetry beyond the three edges of the parallelipiped.
Alas, each of these space groups G is the semidirect product of their point group P and their translation symmetry group T ≅ ℤ³. So, no interesting cohomology classes show up!
Nontrivial cohomology classes show up only in crystals where you can’t cleanly separate the translations from the symmetries that fix a point of the crystal. This happens when your crystal has ‘screw axes’ or ‘glide planes’. A screw axis is an axis where you’ve got a symmetry of translating along that axis, but only if you also rotate around it:
A glide plane is a plane where you’ve got a symmetry of translating along that plane, but only if you also reflect across it:
But wait! There’s a rarer kind of feldspar made with barium. It’s called celsian, after Anders Celsius, the guy who invented the temperature scale. Chemically it’s barium aluminosilicate. And its crystal structure has both screw axes and glide planes! So its space group G is not a semidirect product! It’s an extension of ℤ³ by the point group P = ℤ/2 × ℤ/2 that gives a nonzero element of H²(P, ℤ³). See the end of this post for some details.
All this is lots of fun to me: you start with a pretty rock, and before long you’re doing group cohomology. But the classification of symmetry groups is just the start. For mathematical physicists, one fun thing about feldspars is their phase transitions, especially the symmetry-breaking phase transition from the more symmetrical monoclinic feldspars to the less symmetrical triclinic ones! There’s a whole body of work—by Salje, Carpenter, and others—applying Landau’s theory of symmetry-breaking phase transitions to map out the space of different possible feldspar crystals! Here’s one way to get started:
• Ekhard Salje, Application of Landau theory for the analysis of phase transitions in minerals, Physics Reports 215 (1992), 49–99.
Even if you don’t particularly care about feldspars, there are a lot of good general principles of physics to learn here!
Let me sketch out why barium aluminosilicate, or celsian, has a space group G that’s described by a non-split short exact sequence:
Its point group is P = {e, r, m, i} ≅ ℤ/2 × ℤ/2, where we can take r to be a 180° rotation about the y axis and m to be a reflection that negates the y coordinate, so that i = rm is inversion. In coordinates:
r acts as (x, y, z) ↦ (−x, y, −z)
m acts as (x, y, z) ↦ (x, −y, z)
i acts as (x, y, z) ↦ (−x, −y, −z)
We can take the translation lattice T ≅ ℤ³ to be the lattice generated by
f₁ = (1,0,0), f₂ = (0,1,0), f₃ = (½,½,½)
Note that (0,0,½) is not in T.
To compute the 2-cocycle we need a set-theoretic section s: P → G. We choose
s(e) = identity
s(m) = a glide reflection: (x, y, z) → (x, −y, z + ½)
s(i) = inversion: (x, y, z) → (−x, −y, −z)
s(r) = s(i)·s(m): (x, y, z) → (−x, y, −z + ½)
As usual, the 2-cocycle c: P2 → G is defined by
c(g,h) = s(g)·s(h)·s(gh)⁻¹
The interesting value is c(m, m): the glide composed with itself gives (x, y, z) → (x, −y, z+½) → (x, y, z+1), so s(m)² = translation by (0, 0, 1), while s(m²) = s(e) is the identity. Thus c(m, m) = (0, 0, 1). The other values are trivial: c(i, i) = 0, c(r, r) = 0.
Now, is this cocycle nontrivial in H²(P, T)? It would be trivial if we could find a different section that makes the cocycle zero—that is, find a function b: P → T such that replacing s(g) with s'(g) = s(g) + b(g) makes
c'(g,h) = s'(g)·s'(h)·s'(gh)⁻¹
be the identity for all g,h. I will spare you the calculation proving this is impossible. The idea is simply this: the reflection m squares to the identity in the point group, but no matter how we choose b, s'(m) is a glide reflection, so it squares to a nontrivial translation. On the other hand, s'(m2) is trivial since m2 is, so
c'(m,m) = s'(m)·s'(m)·s'(m2)⁻¹
is nontrivial.
No, this is not another grim post about the chaotic US research funding environment. Instead I wanted to write a bit about a good example of empiricism in experimental condensed matter physics, the use of illumination to (somewhat but not entirely mysteriously) improve electronic transport in 2D electronic systems.
This story goes back decades, and it's all about the role of "disorder" and its effects on electronic conduction. It's been appreciated since the 1930s that, at low temperatures so that lattice vibrations are frozen out, conduction in ordinary crystalline metals and semiconductors is limited by the charge carriers (let's work with electrons rather than holes to make discussion simpler) scattering from disorder, deviations from an infinite periodic crystal lattice. Grain boundaries, vacancies, impurities - these all can scatter electrons that would otherwise propagate ballistically through the material, and this is often modeled as a "disorder potential", a spatially varying potential energy \(V(\mathbf{r})\). If you want the best transport properties (longest elastic mean free path), you want \(V(\mathbf{r})\) to be small in magnitude and as smooth as possible. This is even more important if you want to study some delicate many-body state that is expected to arise at very low temperatures - you need the disorder potential to be small compared to the energy scale of that state to avoid messing it up.
In semiconductors, where the carrier density is low and screening is therefore not as good, charged defects are particularly effective at scattering. In modulation doping, the dopants that are the source of the charge carriers in some nearby semiconductor 2D interface or quantum well are spatially distant from where the current is going to be flowing, to minimize the scattering from those ionized donors.
For decades it has been known that, to get the best transport properties in GaAs-based (and other) semiconductor structures, it can be good to illuminate the devices at cryogenic temperatures with a red LED. See, for example, this paper trying to explore the upper limits of charge mobility in GaAs 2D electron gas (2DEG), where the authors say "For measurement, our samples are loaded into a 3He cryostat, where a red light-emitting diode (LED) is used to illuminate the samples for 5 min at \(T \sim \) 10 K. Following illumination, we wait for 30 min at \(T \sim \) 10 K after the LED has been turned off before resuming the cool down to base temperature." The qualitative explanation for this is that the photons provide enough energy to excite charge carriers, and those mobile carriers can occupy trap states, rearrange themselves, and generally set up a better screened disorder potential. In GaAs 2DEG, the result is higher mobilities (as inferred from conductivity + Hall effect) and much cleaner fractional quantum Hall effect data, showing that the post-illumination disorder is now sufficiently weak that more delicate states can form - see Fig. 1 of this paper (arXiv version) for the before/after. As far as I know, there is not a deep, rigorous theory of how this works, but it is known empirically.
Fig. 2 from this paper, showing electronic magnetotransport before/after illumination by a UV LED at low temperatures. |
I’ve seen a couple more thoughtful takes on use of LLMs for physics lately. This blog post by Minas Karamis is particularly nice.
He points out something that I’ve said a version of: an AI that must be supervised like a student isn’t very useful, because the main point of student projects isn’t the paper at the end: it’s training the student. If students don’t struggle through all the mistakes of a project, they won’t get the expertise to one day do greater things.
Someone might object that not all suffering is educational. In the 1700’s, Leonhard Euler calculated digit after digit of transcendental numbers by hand. Nobody asks students to do that anymore, and they still seem to turn out alright. Why would using an LLM for science be worse than using a computer for numerical calculations?
In a word: different skills. Programming numerics teaches you some of the same skills as calculating the numbers by hand: skills at being specific about what you mean, aware of the consequences of the details and their implications. Prompting an AI still requires those skills, to check whether the AI’s output is correct. But it’s much worse at teaching them: unlike programming or calculating, when prompting AI, the consequences of your actions aren’t predictable.
For some, though, there is another objection. Sure, using AI reliably might require those skills now. But when it gets better, surely being careful will stop mattering. Surely the AI will end up doing science on its own, and all that training will be as useful as if we trained the students to play football.
I’m skeptical, but not as strongly as some. I think we’re still living in a time when it makes sense to hire scientists, and train people to think, and invest in your retirement.
I don’t think I have any knock-down arguments for that, though. Just some suggestive ones.
One I’ve talked about before is that a lot of the most important parts of thinking aren’t written down. An AI physicist is going to have a hard time replicating the kinds of methods and approaches that people use behind the scenes, but rarely describe or spell out. It will be easier to suss this out over time, as more data accumulates of people working with LLMs and correcting them. But ultimately there isn’t going to be a lot of documentation of this kind of thing.
Another limitation is memory. A mature scientist can draw from experiences across their entire career. For an LLM, any problem it’s solved in the past is by default lost in each new session. People build structures around this, taking notes and reminding the AI when it “wakes up”, or making documents the AI can be prompted to check. But nothing in this vein so far seems to get nearly as wide-scope or powerful as human memory. A scientist career is still the best way we have to build durable, functional expertise.
Finally, there is a question of costs, and efficiency. Here I’m not an expert, and I get the impression the actual experts disagree. I don’t know whether we should expect scaling to hit a wall, but I wouldn’t be that surprised if it did.
There are other common reasons for skepticism that seem more dubious to me. I don’t think AI is inherently worse at creativity just because they’re trained on existing work, though some of the skills we associate with creativity aren’t very well-documented, and thus are hard to train for. I don’t think AI’s randomness or unreliability is a deal-breaker, because human intuition is also random and unreliable: we solve that with tools, and that’s something AI can in principle do as well. I don’t think humans are “more agentic” or something, except in the sense that most AIs are made by companies who need to make them behave in a customer-friendly way. But an agent is just a game-theoretic construct, a way to figure out can win or lose in situations with defined stakes, and anything you can train or engineer to try to win can be modeled by that construct.
Coming from a place of uncertainty, my main appeal to you is to not get hung up on the bad reasons, either yourself, or from the people you’re arguing with. Focus on the best arguments, and see where they take you.
It was felt that the 1894 exposition in Chicago needed a big iron monument to rival the Eiffel Tower, and a man named George Washington Gale Ferris had the decisive insight that what the occasion demanded was not a needle but a wheel. And so the Ferris wheel was built, at titanic expense, and became the iconic symbol of the fair.
And now ask yourself: why was Ferris Bueller named Ferris? It’s not a common name and never has been. Ferris is the wheel. A magnificent spectacle, gazed at in wonder by huge crowds in Chicago, but which in the end is just moving around and around in circles, on a fixed course.
I am 100% sure I am right about this.
I was in Boston for Pesach and stuck around an extra day because some more Wisconsinites, the mighty Milwaukee Brewers, were also coming into town to play the Boston nine. CJ and I took in the game with friend of the blog and loud Sox fan Cathy O’Neil.

What is true baseball? True baseball is the bleachers in Fenway Park on an April night when it’s freezing cold and nobody on either team can seem to field a ground ball clearly and everybody’s drunk, especially the guy in front of you, who has poured his entire beer into his shoe, from which, to the delight and acclaim of all section 37, he chugs.
A good game? A good game to watch, though not a game I’d call well-played. Willson Contreras got hit, or did a convincing imitation of getting just barely nicked on the tip of his little finger, and gave a good jawing to Brandon Woodruff, who’s hit him five times before. The Brewers’ catcher, who is also Contreras’s brother, had to keep him away from the mound. Brewer-turned-Red Sox Caleb Durbin, who has not hit a lick since arriving here, makes an error at third to start a rally of dinks and walks that turns a 3-0 Boston lead into a 4-3 deficit. Contreras, still mad, RBI doubles and the Red Sox go back ahead. I forget how the Brewers tied it back up, but it was tied back up. The Brewers get a single and then Yelich walks. Garrett Mitchell hits a sharp single, placed where the runner on second is probably going to beat Roman Anthony’s throw to the plate, which throw Anthony heaves so far away from the catcher that it’s actually very difficult for fans in section 37 to understand where the ball is and where everybody’s going, but it is not so difficult for not that fast anymore but very very experienced veteran Christian Yelich, who has never stopped running since he left first base at the crack of the bat, and now realizes that the ball is so very far from home plate that he may actually be able to score, which, by a virtuoso feat of sliding, tag-dodging, and plate-swiping, he does, to put the Brewers up 7-5. Fenway Park boos so hard it vibrates. The Brewers tack one more on, I forget now. Top of the 9th, “Shipping Up To Boston” plays, the crowd is still coldly, drunkenly, in this game. Two meek outs, Willson Contreras comes up one more time, he’s still mad, he hits a no-doubt home run to left, next guy singles and now it’s really on. Very drunk Red Sox fan beckons to CJ and me: “Think we’ve got some more? Think we’ve still got some more?” Cathy high-fives him.
But they didn’t have more. Ground ball to Brice Turang, makes a nice throw, Brewers win 8-6 to go to 8-2. Red Sox, lots of talent but snakebit as the season opens, drop to 2-8. The crowd chants “Sell the team.” True baseball.
Before we start on quantumImagine that every week for twenty years, people message you asking you to comment on the latest wolf sighting, and every week you have to tell them: I haven’t seen a wolf, I haven’t heard a wolf, I believe wolves exist but I don’t yet see evidence of them anywhere near our town.
Then one evening, you hear a howl in the distance, and sure enough, on a hill overlooking the town is the clear silhouette of a large wolf. So you point to it — and all the same people laugh and accuse you of “crying wolf.”
Now you know how it’s been for me with cryptographically relevant quantum computing.
I’ve been writing about QC on this blog for a while, and have done hundreds of public lectures and interviews and podcasts on the subject. By now, I can almost always predict where a non-expert’s QC question is going from its first few words, and have a well-rehearsed answer ready to go the moment they stop talking. Yet sometimes I feel like it’s all for naught.
Only today did it occur to me that I should write about something more basic. Not quantum computing itself, but the habits of mind that seem to prevent some listeners from hearing whatever I or other researchers have to tell them about QC. The stuff that we’re wasting our breath if we don’t get past.
Which habits of mind am I talking about?
Anyway—sorry for yet another post of venting and ranting. Maybe this will help:
The wise child asks, “what are the main classes of problems that are currently known to admit superpolynomial quantum speedups?” To this child, you can talk about quantum simulation and finding hidden structures in abelian and occasionally nonabelian groups, as well as Forrelation, glued trees, HHL, and DQI—explaining how the central challenge has been to find end-to-end speedups for non-oracular tasks.
The wicked child asks, “so can I buy a quantum computer right now to help me pick stocks and search for oil and turbocharge LLMs, or is this entire thing basically a fraud?” To this child you answer: “the quantum computing people who seek you as their audience are frauds.”
The simple child asks, “what is quantum computing?” You answer: “it’s a strange new way of harnessing nature to do computation, one that dramatically speeds up certain tasks, but doesn’t really help with others.”
And to the child who doesn’t know how to ask—well, to that child you don’t need to bring up quantum computing at all. That child is probably already fascinated to learn classical stuff.
Twenty years ago this week (I thought it was twenty years ago today, but I just went back and checked and it was April 3, 2006) I fell down a flight of stairs at MSRI and broke the absolute fuck out of my left elbow. Knocked the head of the radius right off and it was apparently floating down around my wrist. The EMTs cut my shirt off, doped me up, and sent me down to the ER. I am a stubborn guy and gave my seminar the next morning with my arm immobilized in a sling. Went in for surgery the next Tuesday and they screwed the titanium rectangle to my bone that’s been in there ever since. Whoever scheduled it told me “You’re lucky, you’ve got the third-best elbow surgeon in the Bay Area.” I said, “Why only the third best?” They said “Because the best one works for the 49ers and the second-best one works for the Raiders.”
In my memory, they told me at the ER the night I broke it that I needed to undertand that my arm was never going to go back to its original anatomy, that this kind of injury meant a permanent change. But the emails I wrote that day said I was told I’d recover and just probably always have a little stiffness at the break. So maybe it was the surgeon who told me that, before I went under the knife. Whoever told me probably had a lot of experience telling patients that kind of news, and they could see me blanch, because they then said, “you’re never going to have a elbow that moves normally again, but also, a couple of years from now your elbow as it is will feel normal to you, and you’re just going to go through your life never thinking about the fact that it’s broken.”
This was true, it turned out. My left arm doesn’t fully straighten out and never will. And I barely ever find myself thinking about or even noticing this. It’s just the arm I have now, and have had for twenty years this week. I’m curious — if you know me, and especially if you didn’t know I broke my arm, is it noticable to you that one of my arms is never straight? Once you know to look, you can see it in photographs.
When they told me my arm was broken for good it felt like a life-changing disaster. And when they told me I’d just stop noticing it I thought they were lying to me to be nice. But people are built to adjust to change. I’ve tried to remember this over the years, when other shitty things have happened — that as bad as it is, there’s a future me who’s used to whatever it is, who is not as sad or mad or regretful about it as present me, and who is coming into being at astonishing speed.
I’ve just uploaded to the arXiv my paper “Products of consecutive integers with unusual anatomy“. This paper answers some questions of Erdős and Graham which were initially motivated by the study of the Diophantine factorial equation
The equation (1) ties into the general question of what the anatomy (prime factorization) of the product looks like. This is a venerable topic, with the first major result being the Sylvester-Schur theorem from 1892 that the largest prime factor of
is greater than
. Another notable result is the Erdős-Selfridge theorem that the product
is never a perfect power for
.
Erdős and Graham were able to show that solutions to (1) were somewhat rare, in that the set of possible values of had density zero. For them, the hardest case to treat was when the interval
was what they called bad, in the sense that
was divisible by the square of its largest prime factor. They were able, with some effort, to show that the union of all bad intervals also had density zero, which was a key ingredient in to prove the previous result about solutions to (1). They isolated a subcase of the bad intervals, which they called the very bad intervals, in which the product
was a powerful number (divisible by the square of every prime factor).
A later paper of Luca, Saradha, and Shorey made the bounds more quantitative, showing that both the set of values of , as well as the union of bad intervals, had density
for some absolute constant
. In the other direction, just by considering the case
, one can show that the number of possible values of
up to
is
, where
is the constant
It was conjectured by Erdős and Graham that all of these lower bounds are in fact sharp (up to multiplicative factors); this is Erdos Problem 380 (and a portion of Erdos Problem 374). The main result of this paper is to confirm this conjecture in two cases and come close in the third:
Theorem 1
- The number of numbers up to
that lie in a bad interval of length
is
of the number of bad points up to
.
- The number of numbers up to
that lie in a very bad interval of length
is
.
- The number of numbers up to
of the form
for a solution to (1) is
.
Not surprisingly, the methods of proof involve many standard tools in analytic number theory, such as the prime number theorem (and its variants in short intervals), zero density estimates, Vinogradov’s bounds on exponential sums, asymptotics for smooth numbers, the large sieve, the fundamental lemma of sieve theory, and the Burgess bound for character sums. There was one point where I needed a small amount of algebraic number theory (the classification of solutions to a generalized Pell equation), which was the one place where I turned to AI for assistance (though I ended up rewriting the AI argument myself). One amusing point is that I specifically needed the recent zero density theorem of Guth and Maynard (as converted to a bound on exceptions to the prime number theorem in short intervals by Gafni and myself); previous zero density theorems were barely not strong enough to close the arguments.
A few more details on the methods of proof. It turns out that very bad intervals, or intervals solving (1), are both rather short, in that the bound holds. The reason for this is that the primes
that are larger than
(in the very bad case) or
for a large constant
(in the (1) case) cannot actually divide any of the
unless they divide it at least twice. This creates a constraint on the fractional parts of
and
that turns out to be inconsistent with the equidistribution results on those fractional parts coming from Vinogradov’s bounds on exponential sums unless
is small. In the very bad case, this forces a linear relation between two powerful numbers; expressing powerful numbers as the product of a square and a cube, matters then boil down to counting solutions to an equation such as
The situation with bad intervals is more delicate, because there is no obvious way to make small in all cases. However, by the large sieve (as well as the Guth–Maynard theorem), one can show that the contribution of large
is negligible, and from bounds on smooth numbers one can show that the interval
contains a number with a particularly specific anatomy, of the form
where
are all primes of roughly the same size, and
is a smoother factor involving smaller primes. The rest of the bad interval creates some congruence conditions on the product
. Using some character sum estimates coming from the Burgess bounds, we find that the residue of
becomes fairly equidistributed amongst the primitive congruence classes to a given modulus when one perturbs the primes
randomly (there are some complications from exceptional characters of Siegel zero type, but we can use a large values estimate to keep their total contribution under control). This allows us to show that the congruence conditions coming from the bad interval are restrictive enough to make non-trivial bad intervals quite rare compared to bad points. One innovation in this regard is to set up an “anti-sieve”: the elements of a bad interval tend to have an elevated chance of being divisible by small primes, and one can use moment methods to show that an excessive number of small prime divisors is somewhat rare. This can be compared to standard sieve arguments, which often seek to limit the event that a number has an unexpectedly deficient number of small prime divisors.
“The High School Boy and His Problems” (pdf at link) is a 1921 book by Thomas Arkle Clark, Dean of Men at the University of Illinois. I think “The High School Boy and His Problems” would be an amazing band name; I, the front man, would be The High School Boy, and my backup band would be “His Problems.”
Anyway, this passage (p.19)This, , transposed from high school to college and from “most communities” to “most upper-middle-class communities,” and from “boy” to “student” holds up nicely a century or so on:
“It is a great opportunity which is offered a boy who goes to high school. In these days, however, when in most communities it is the rule rather than the exception for boys to go, the privilege is not infrequently valued rather lightly. The boy goes, not from any serious purpose on his own part or any special desire for training, but because it is the custom, because his parents have desired it, and because all the other boys in his class are going. Possibly it is better to go for these reasons than not to go at all, but if added to these there is also the eagerness on his part to train his mind, to add to his store of information, to prepare himself better for the work which he must take up later in life, and especially if there is for him some interest, some line of study which he very much desires to carry on, his chances of getting somewhere will be materially increased. No one can get far in any line of work without interest. The work we do without joy in the doing is pretty sure to be badly done.”
I do think, I believe along with Dean Clark, that most of our students do (if not always at first) find this “joy in the doing,” and that it’s a material part of our job as educators to help them find that joy.
This, on electives, is good too:
“When your grandfather went to high school—if fortunately he had the chance to do so—the course of study open to him was a pretty rigid one, very much indeed like an intellectual table d’hote at which he had little opportunity to pick and choose, but must take what was set before him and ask no questions.
There was a generous helping of mathematics with Latin and probably Greek, to form the heavy part of the intellectual meal. Physics and chemistry often made up a part of the requirement, with history and English to serve as dessert to lighten the repast. There were few, if any, electives then, and little questioning on the part of the students as to whether or not what they were taking was likely to “do them any good” or was particularly to their individual tastes; they took their studies as they ate the simple, nourishing food that was set before them at home by grandmother, in the belief that their elders knew best what was good for them.
Now everything is different. The program of study in the well-equipped modern high school carries an intellectual bill of fare as varied and as bizarre as that represented by the à la carte dining service of a first-class hotel. The boy entering high school today has so varied a program set before him, and has so many things from which to choose, that it is little wonder if he is not sometimes confused and at a loss to know just what to choose.”
To the surprise of no one at all, the 2027 presidential budget request is extremely bad for science. Remember, this is largely a political document, and Congress does not have to follow this. In the past year, Congress largely ignored the recommendations and appropriated a much flatter budget (though agency priorities are still set by the PBR for executive agencies). This new request shows that Vought et al. still would prefer to kill much public funding for science.
These cuts are proposed despite constant fretting that China is surpassing the US scientifically. This past year it took aggressive lobbying to ensure that Congress pushed back against these kinds of cuts. For those who favor continued public investment in science and engineering research in the US, the task of arguing against these kinds of cuts begins again now. As I've written before, this is a marathon not a sprint, and this will be an annual exercise under this administration.
Yes, I’m late to the party on this one.
A few weeks ago, arXiv.org announced that it will be leaving Cornell, the university that currently manages it, and establishing its own nonprofit.
arXiv is a crucial part of the infrastructure for physics, mathematics, computer science, and a few related fields. Researchers post papers to arXiv as what are called “preprints” before the papers are submitted to a journal. In practice, nobody ends up reading the journal versions: the arXiv is free to access, and typically reflects better what the paper’s authors want the paper to look like. So in practice, arXiv is how researchers in these fields communicate, which makes its role enormously important.
If you’re from another field, you might wonder how something like arXiv is financially sustainable. The answer is that it works better than you’d think, but not perfectly. They’ve been supported by philanthropy, in addition to Cornell, and while there have apparently been budget shortfalls and drama behind the scenes, But nonetheless, arXiv has stayed in continuous operation since 1991.
The move to an independent nonprofit is supposed to make it easier for arXiv to get philanthropic funding, which otherwise needed to be filtered through Cornell in ways that were sometimes opaque or didn’t give donors the control they wanted.
While it wasn’t mentioned in the announcements, I suspect another motivation is security. Universities are fixed in place, and that makes them easier to pressure. For an organization that wants to process scientific output in an unbiased way, the link to Cornell represented a vulnerability. It’s not a vulnerability that has mattered yet, and likely didn’t seem like it would ever matter. But it wouldn’t surprise me if they’re more worried now that someone might try to pressure Cornell in order to change how arXiv operates. For critical scientific infrastructure, it’s important to be as independent of those kinds of pressure as possible.
Tanya Klowden and I have uploaded to the arXiv our preprint “Mathematical methods and human thought in the age of AI“. This is an unabridged version of a solicited article for a forthcoming Blackwell Companion to the Philosophy of Mathematics. I rarely write article-length essays of a philosophical nature (perhaps the last one was in 2007), but given the topical interest in AI and formalization for mathematics, which has begun to raise increasingly fundamental questions about what the nature, purpose and practice of mathematics actually is (or ought to be), it seemed like it was a timely opportunity to write about these matters. Other mathematicians seem to have recently come to this conclusion also; see for instance this paper of Avigad, or this paper of Commelin, Jamnik, Ochigame, Taelman, and Venkatesh, both of which have come out in the last few weeks.
Our piece took over a year to write – which means, at the current pace of development in the field, that some of it is already slightly out of date. Nevertheless, it was an instructive exercise for both of us to try to look beyond the immediate technical issues presented by current AI and formalization tools and try to point out the philosophical questions that we will have to grapple with as these tools become increasingly capable and integrated into our profession, using prior examples of technological advancement as a guide. We don’t pretend to have definitive answers to most of these questions, but as with mathematics itself, the first step is to pose the questions and then try to make partial progress on them (or at least identify some negative results and eliminate some failed approaches). One point we particularly felt worth stressing is that AI tools and applications (in mathematics or elsewhere) should not be viewed purely through the technical lens of what microscale problems they solve and how effective or efficient they are at solving them, but also through the macroscopic humanitarian lens of how our society, our shared body of knowledge and understanding, and our species benefits (or is harmed) as a whole from these technologies.
Our initial submission ended up significantly exceeding the page limits of the submission and ventured beyond the philosophy of mathematics into broader philosophical and ethical questions about AI in general. A streamlined version of the paper will appear in the forthcoming companion, but we have decided to make the original longer version available on the arXiv.
Quantum computing bombshells that are not April FoolsFor those of you who haven’t seen, there were actually two “bombshell” QC announcements this week. One, from Caltech, including friend-of-the-blog John Preskill, showed how to do quantum fault-tolerance with lower overhead than was previously known, by using high-rate codes, which could work for example in neutral-atom architectures (or possibly other architectures that allow nonlocal operations, like trapped ions). The second bombshell, from Google, gave a lower-overhead implementation of Shor’s algorithm to break 256-bit elliptic curve cryptography.
Notably, out of an abudance of caution, the Google team chose to “publish” its result via a cryptographic zero-knowledge proof that their circuit exists (so, without revealing the details to attackers). This is the first time I’ve ever seen a new mathematical result actually announced that way, although I understand that there’s precedent in the 1500’s, when mathematicians would (for example) prove their ability to solve quartic equations by challenging their rivals to duels. I’m not sure how much it will actually help, as once other groups know that a smaller circuit exists, it might be only a short time until they’re able to find it as well.
Neither of these results change the basic principles of QC that we’ve known for decades, but they do change the numbers.
When you put both of them together, Bitcoin signatures for example certainly look vulnerable to quantum attack earlier than was previously known! In particular, the Caltech group estimates that a mere 25,000 physical qubits might suffice for this, where a year ago the best estimates were in the millions. How much time will this save — maybe a year? Subtracting, of course, off a number of years that no one knows.
In any case, these results provide an even stronger impetus for people to upgrade now to quantum-resistant cryptography. They—meaning you, if relevant—should really get on that!
When I got an early heads-up about these results—especially the Google team’s choice to “publish” via a zero-knowledge proof—I thought of Frisch and Peierls, calculating how much U-235 was needed for a chain reaction in 1940, but not publishing it, even though the latest results on nuclear fission had been openly published just the year prior. Will we, in quantum computing, also soon cross that threshold? But I got strong pushback on that analogy from the cryptography and cybersecurity people who I most respect. They said: we have decades of experience with this, and the answer is that you publish. And, they said, if publishing causes people still using quantum-vulnerable systems to crap their pants … well, maybe that’s what needs to happen right now.
Naturally, journalists have been hounding me for comments, though it was the worst possible week, when I needed to host like four separate visitors in Austin. I hope this post helps! Please feel free to ask questions or post further details in the comments.
And now, with no time for this blog post to leaven and rise, I need to go home for my family’s Seder. Happy Passover!

Movie Review: “The AI Doc”Yesterday Dana, the kids, and I went to the theater to watch The AI Doc: Or How I Became An Apocaloptimist, the well-reviewed new documentary about whether AGI will destroy the world. This was surely the weirdest family movie night we’ve ever done. Firstly, because I personally know probably half of the many people interviewed in the film, from Eliezer Yudkowsky to Ajeya Cotra to Liv Boeree to Daniel Kokotajlo to Ilya Sutskever to Jan Leike to Yoshua Bengio to Shane Legg to Sam Altman and Dario Amodei. But more importantly, because this is a documentary that repeatedly, explicitly, earnestly raises the question of whether children now alive will make it to adulthood, before unaligned AI kills them and everyone else. So pass the popcorn, kiddos!
(We did have popcorn. And if the kids were scared — well, I figured we can’t shield them forever from the great questions of the world they’re entering. But actually they didn’t seem especially scared.)
I thought that the filmmaker, Daniel Roher, did about as good a job as can be done, in fitting into a 100-minute film a question that honestly seems too gargantuan for any film — the question of the future of life on earth. He tries to hear out every faction: first the AI existential risk people, then the AI optimists and accelerationists like “Beff Jezos,” then the “stochastic parrot” / “current harms” people like Emily Bender and Timnit Gebru, and finally the AI company CEOs (Altman, Amodei, and Hassabis were the three who agreed to be interviewed), with Yuval Noah Harari showing up from time to time to insert deepities.
Roher plays the part of an anxious, curious, uninformed everyman, who finds each stance to be plausible enough while he’s listening to it, and who mostly just wants to know what kind of world his soon-to-be-born son (about whom we get regular updates) will grow up in.
I didn’t think all the interviewees were equally cogent or equally deserved a hearing. But if any viewers were actually new to AI discourse, rather than marinated in it like me, the film would serve for them as an excellent introduction to the parameters of current debate (for better or worse) and to some of the leading representatives of each camp.
If I had to summarize Roher’s conclusion, it would be something like: go ahead, enjoy your life, have children if you want, but understand that now is a time of world-historical promise and peril much like the early nuclear age, so pay attention, and demand of your elected leaders that they ensure that AGI is developed in a pro-human direction, because tech leaders (even the relatively well-intentioned ones) are trapped in a race to the bottom and can’t get out on their own. Honestly, I’d have a pretty hard time improving on that message.
The main thing that gave me pause about the film was not on the screen but in the theater, which was nearly empty. For the film to serve its purpose, a significant fraction of the world will need to see and discuss it, either in the theater or on streaming. So, y’know, it’s still playing.
For whatever it’s worth, here were my wife Dana’s comments: “The biggest flaw of this movie is that Daniel Roher never breaks out of his ‘clueless everyman’ character, even when he’s talking to the most important people in AI. He wastes an opportunity to ask them non-superficial questions, questions deeper than ‘so, uh, are we all gonna die or not?'”
And here were my 13-year-old daughter’s comments: “So many of the people they interviewed seemed like hippies, who don’t know what AI will do any more than I know!” Also, after Daniel Roher wishes Sam Altman mazel tov on his forthcoming baby: “Sam Altman is Jewish?!”
And here were my 9-year-old son’s comments: “I thought this would be a movie, where AI would try to take over and the humans would fight back! I had no idea it would just be people talking about it. The documentary kind of movie is so, so, so boring.”
My theoretical computer science notes from Epsilon CampLast summer, I was privileged to teach a two-week course on theoretical computer science to exceptional 11- and 12-year-olds at Epsilon Camp, held at Washington University in St. Louis. The course was basically a shorter version of the 6.045 course that I used to teach to undergrads at MIT.
I was at Epsilon Camp to accompany my son Daniel, who attended a different course there, for the 7- and 8-year-olds. So they got me to teach while I was there.
Teaching at Epsilon was some of the hardest work I’ve done in years: I taught two classes, held office hours, and interacted with or supervised students for 6-7 hours per day (compared to ~4 hours per week as a professor), on top of being Daniel’s sole caregiver, on top of email and all my other normal responsibilities. But it was also perhaps the most extraordinary teaching I’ve ever done: during “lecture,” the kids were throwing paper planes, talking over and interrupting me every ten seconds, and sometimes getting in physical fights with each other. In my ~20 years as a professor, this was the first time that I ever needed to worry about classroom discipline (!). It gave me newfound respect for what elementary school teachers handle every day.
But then, when I did have the kids’ attention, they would often ask questions or make observations that I would’ve been thrilled to hear from undergrads at UT Austin or MIT. Some of these kids, I felt certain, can grow up if they want to be world-leading mathematicians and physicists and computer scientists, Terry Taos and Ed Wittens of their generation. Or at least, that’ll be true if AI isn’t soon going to outperform the top human scientists at their own game, a prospect that of course casts a giant shadow not only over Epsilon Camp but over our entire enterprise. But enough about the future. For now I can say: it was a privilege of a lifetime to teach these kids, to be the one who first introduced them to theoretical computer science.
Or at least, the one who first systematically introduced them. As I soon realized, there was no topic I could mention—not the halting problem or the Busy Beaver function, not NP-completeness or Diffie-Hellman encryption—that some of these 11-year-olds hadn’t previously seen, and that they didn’t want to interrupt me to share everything they already knew about. Rather than fighting that tendency, I smiled and let them do this. While their knowledge was stunningly precocious, it was also fragmentary and disjointed and weirdly overindexed on examples rather than general principles. So fine, I still had something to teach them!
Coming to Epsilon Camp was also an emotional experience for me. When I was 15, I attended Canada/USA Mathcamp 1996, the first year that that camp operated. I might not have gone into research otherwise. Coming from a public high school—from the world of English teachers who mainly cared whether you adhered to the Five Paragraph Format, and chemistry teachers who’d give 0 on right answers if you didn’t write “1 mol / 1 mol” and then cross off both of the moles—I was suddenly thrust, sink or swim, into a course on elliptic curves taught by Ken Ribet, who’d played a major role in the proof of Fermat’s Last Theorem that had just been completed, and a talk on algorithms and complexity by Richard Karp himself, and lectures on number theory by Richard Guy, who had stories from when he knew G. H. Hardy.
Back when I was 15, I got to know George Rubin Thomas, the founding director of Mathcamp … and then, after 29 years, there he was again at Epsilon Camp—the patriarch of a whole ecosystem of math camps—and not only there, but sitting in on my course. Also at Epsilon Camp, unexpectedly, was a woman who I knew well back when we were undergrads at Cornell, both of took taking the theoretical computer science graduate sequence, but who I’d barely seen since. She, as it turned out, was accompanying her 8-year-old son, who got to know my 8-year-old. They played together every day and traded math facts.
It occurred to me that the course I taught, on theoretical computer science, was one of the most accessible I’ve ever done, and therefore more people might be interested. So I advertised on this blog for someone to help me LaTeX up the notes for wider distribution. I was thrilled to find a talented student to volunteer. Alas, because of where that student lives, he needs to stay anonymous for now. I thank him, pray for his safety, and hope to be able to reveal his name in the future. I’m also thrilled to have gotten three great high school students—Ian Ko, Tzak Lau, and Sunetra Rao—to help with the figures. Thanks to them as well.
You can read the notes here [59 pages, PDF]
If you’re curious, here’s the table of contents:
Happy as always to receive comments and corrections. Enjoy!
I’m only now learning about ‘vector meson dominance’—a big idea put forth by Sakurai and others around 1960.
Here’s a family of 9 mesons called the ‘vector nonet’. Each one is made of an up, down or strange quark and an antiup, antidown or antistrange antiquark. That’s 3 × 3 = 9 choices.
In this chart, S is strangeness (the number of strange quarks minus the number of antistrange antiquarks in the particle) and Q is electric charge. I’ll focus on the neutral rho meson, the ρ⁰, which has no strangeness and no charge.
But why are these called ‘vector’ mesons? It’s because the quark and antiquark have spin 1/2, and in this kind of meson their spins are lined up, so together they have spin 1. A spin-1/2 particle is described by a spinor, which is a bit weird, but spin-1 particle is described by something more familiar: a vector!
The most familiar spin-1 particle is a photon. And in fact, the photons around us are slightly contaminated by neutral rho mesons! That in fact is the point of vector meson dominance. But more on that later.
First, if you’ve read a bit about mesons, you may wonder why your friends the pion and kaon weren’t on that last chart. Don’t worry: they’re on this chart! This is the ‘pseudoscalar nonet’.
In these mesons, the spins of the quark and antiquark point in opposite directions, so the overall spin of these mesons is 0. That means they don’t change when you rotate them, like a ‘scalar’. But these mesons do change sign when you reflect them, because then you’re switching the quark and antiquark, and those are fermions so you get a minus sign whenever you switch two of them. So these mesons are ‘pseudoscalars’.
If you don’t get that, don’t worry. I’m going to tell the tale of rho mesons and especially the neutral one, the ρ⁰.
A photon will sometimes momentarily split into a quark-antiquark pair. Since the neutral rho meson is the lightest meson with the same charge, spin and other quantum numbers as a photon, this quark-antiquark pair will usually be a neutral rho! This is basic idea behind ‘vector meson dominance’.
In short, the light you see around you is subtly spiced by a slight mix of neutral rho mesons!
More precisely, real-world photons are a superposition of the ‘bare’ photons we’d have in a world without quarks, and neutral rho mesons.
But you might ask: how do we know this?
When you shoot a low-energy photon at a proton, its wavelength is long, so it sees the proton almost as a point particle.
But a high-energy photon has a short wavelength, so it notices that the proton is made of quarks. And the photon may interact with these as if it were a rho meson—because sometimes it is! This changes how high-energy photons interact with protons, in a noticeable way.
The same thing happens when you slam charged pions at each other. You’d expect them to interact electromagnetically, by exchanging a photon. But if you collide them at high energies you get deviations from purely electromagnetic behavior, since the photon is slightly contaminated by a bit of neutral rho!
In fact this is how the neutral rho was found in the first place. In 1959, William Franzer and Jose Fulco used results of pion collisions to correctly predict the existence and mass of the neutral rho!
They used a lot of cool math, too—complex analysis:
• William R. Frazer and Jose R. Fulco, Effect of a pion-pion scattering resonance on nucleon structure, Phys. Rev. Lett. 2 (1959), 365–368.
Then in 1960, Sakurai argued that the three rho mesons ρ⁺,ρ⁰,ρ⁻ form an SU(2) gauge field!
The idea is this: since they’re vector mesons each one is described by a vector field, or more precisely a 1-form. But these rho mesons are made only of up and down quarks and antiquarks—not strange ones. And isospin SU(2) is a symmetry group that mixes up and down quarks. So we expect SU(2) to act on the three rho mesons, and it does: it acts on them just like it does on its Lie algebra 𝔰𝔲(2), which is 3-dimensional.
So: we can combine these 3 vector mesons into an 𝔰𝔲(2)-valued 1-form… which describes an 𝔰𝔲(2) connection! If you don’t know what I mean, just take my word for it: this is how gauge theory works.
Now, Sakurai’s paper showed up before quantum chromodynamics appeared (1973), or even quarks (1964). But Yang–Mills theory had been known since 1954, so it was natural for him to cook up a Yang–Mills theory with rho mesons as the gauge bosons.
Just one big problem: they’re not massless, as Yang–Mills theory says they should be.
This didn’t stop Sakurai. He tried to treat the rho mesons as gauge bosons in a Yang–Mills theory of the nuclear force, and give them a mass ‘by hand’.
You can tell he was very excited, because he starts by mocking existing work on particle physics, with the help of a long quote by Feynman:
• J.J. Sakurai, Theory of strong interactions, Ann. Phys. 11 (1960), 1–48.
Sakurai’s theory had successes but also problems.
The Higgs mechanism for giving gauge bosons mass was discovered around 1964. People tried it for the rho mesons, but it was never clear which particle should play the role of the Higgs boson!
Only in 1985, after quantum chromodynamics had solved the fundamental problem of nuclear forces, did people come up with a nice approximate theory in which the rho mesons were gauge bosons for the strong force, with a Higgs serving to give them mass.
• M. Bando, T. Kugo, S. Uehara, K. Yamawaki and T. Yanagida, Is the ρ meson a dynamical gauge boson of hidden local symmetry?, Phys. Rev. Lett. 25 (1985), 1215–1218.
Later a subset of these authors developed a theory where all 9 vector mesons serve as gauge bosons for a U(3) gauge theory:
• Masako Bando, Taichiro Kugo and Koichi Yamawaki, Composite gauge bosons and “low energy theorems” of hidden local symmetries,
Prog. Theor. Phys. 73 (1985), 1541–1559.
In my youthful attempts to learn particle physics I skipped over most of the long struggle to understand mesons, and went straight for the Standard Model. And that’s what many textbooks do, too. But this misses a lot of the fun, and a lot of physics that’s important even now. I just learned this stuff about the rho mesons today, and I find it very exciting!
I’m giving a talk online tomorrow at the 2026 Spring Southeastern Sectional Meeting of the American Mathematical Society, in the Special Session on Non-Associative Rings and Algebras. The organizers are Layla Sorkatti and Kenneth Price. I doubt the talk will be recorded, but here are my slides:
• Projective geometry and the exceptional Jordan algebra.
Abstract. Dubois-Violette and Todorov noticed that the gauge group of the Standard Model of particle physics is the intersection of two maximal subgroups of
which is the automorphism group of the exceptional Jordan algebra
Here we conjecture that these can be taken to be any subgroups preserving copies of
and
that intersect in a copy of
Given this, we show that the Standard Model gauge group consists of all isometries of the octonionic projective plane that preserve an octonionic projective line and a complex projective plane intersecting in a complex projective line. This is joint work with Paul Schwahn.
This is an introductory talk for mathematicians. Physicists may prefer the two talks here. Those go much further in some ways, but they don’t cover the new ideas that Paul Schwahn and I are in the midst of working on.
I’m giving a talk online tomorrow at the 2026 Spring Southeastern Sectional Meeting of the American Mathematical Society, in the Special Session on Non-Associative Rings and Algebras. The organizers are Layla Sorkatti and Kenneth Price. I doubt the talk will be recorded, but you can see my slides.
Abstract. Dubois-Violette and Todorov noticed that the gauge group of theStandard Model of particle physics is the intersection of two maximal subgroups of . which is the automorphism group of the exceptional Jordan algebra . Here we conjecture that these can be taken to be any subgroups preserving copies of and that intersect in a copy of . Given this, we show that the Standard Model gauge group consists of all isometries of the octonionic projective plane that preserve an octonionic projective line and a complex projective plane intersecting in a complex projective line. This is joint work with Paul Schwahn.
This is an introductory talk for mathematicians. Physicists may prefer the two talks here. Those go much further in some ways, but they don’t cover the new ideas that Paul Schwahn and I are in the midst of working on.
Scientists trust what they think they can verify.
In principle, you can work your way through the proof of every mathematical theorem. With enough money and time, you could replicate every experiment. For every expert opinion, you could dig through the literature and find how it was justified.
And while a scientist can’t actually do that for every field, they might be able to for the ones they care about most. In your specialty, you probably can check the logic behind every claim. And you know that enough people try, that you can trust your colleagues’ work.
As a science journalist, most of the time, you can’t do those checks. You don’t even pretend you can. Instead, you build trust, like a tree.
You start with a grounding. A former scientist might trust their former colleagues, people they trusted, as a scientist, to do (and know) good work. A non-scientist has to start somewhere else. They might use prestige, looking up those tenured folks at Harvard or Princeton or Stanford. They might look to who other journalists trusted, scientists who’ve already been in the news. They might track journals or roles, assuming that a publication in Nature, or a position on a national grant committee, has a special meaning.
And if things stopped there, it would be a pretty elitist system. It still can be, and often is. But there is another step, which softens it.
The trust builds.
When I want to know if a paper in an unfamiliar field makes sense, if it’s worth covering, I try to ask someone I trust. Sometimes, they don’t know, and shrug. Other, more useful, times, they don’t know, but they have a suggestion: someone they trust, who can give me the answer.
And so I ask the new person, and now I trust someone more.
And suppose the new person says the new paper is good, and worth covering, good science and all that jazz.
Well, now I can trust its authors too, right?
So when the next paper comes, I now don’t just have that first someone. I have the person they recommended, and the authors of the previous paper.
The trust builds out, and up, like branches on a tree.
An Artin-Tits group is a group with a finite set of generators in which every relation is of the form
or
for some positive integer
, where
and
are two of the generators. In particular, the commutation relation
is allowed (it is the case
of the first type of relation) and so is the braid relation
(it is the case
of the second type of relation). This means that Artin-Tits groups include free groups, free Abelian groups, and braid groups: for example, the braid group on
strands has a presentation with generators
, where
represents a twist of the
th and
st strands, and relations
if
and
.
A few weeks ago, I asked ChatGPT for a simple example of a word problem for groups or semigroups that was not known to be decidable and also not known to be undecidable. It turns out that the word problem for many Artin-Tits groups comes into that category: the simplest example where the status is not known is the group with four generators and
where
and
commute and all other pairs of generators satisfy the braid relation
.
My interest in this was initially that I was looking for a toy model from which one could learn something about how mathematicians judge theorems to be interesting. I started with a remarkable semigroup discovered by G. S. Tseytin that has five generators, seven very simple relations, and an undecidable word problem. From that I created a puzzle game that you might be interested to play. (NB the puzzles were not showing up in the public version, but that should be sorted out now.) Each puzzle is equivalent to an instance of the word problem for Tseytin’s semigroup, but the interface makes it much more convenient to change words using the relations than it would be to do it with pen and paper.
I was hoping that because every mathematical problem can in principle be encoded as a puzzle in this game, one might be able to build a sort of “alien mathematics”, where a theorem was an equality between words, and a definition was a decision to introduce a new (and redundant) generator g, together with a relation of the form g = w, where w is a word in the current alphabet. Theorems would be particularly interesting if they were equalities between words that could be established only by a chain of equalities that went through much longer words, and definitions would be useful if the new generators satisfied particularly concise relations (which would allow one to build “theories” within the system). I still hope to find a word problem that will allow such a project to take off, but in the end, after a lot of playing around with the game linked to above, I have decided that Tseytin’s semigroup is not suitable. The reason is that it is designed so that arbitrary instances of the word problem for groups can be encoded as word problems in this semigroup, and once one gets used to the game, one starts to see how that encoding can work. Furthermore, one seems to be driven towards the encoding — I don’t get the impression that there’s a whole other region of this semigroup to explore that has nothing to do with the kinds of words that come up in the encoding. And if that impression is correct, then one might as well start with the word problem for some group in unencoded form, or alternatively look for another semigroup. Nevertheless, I find the Tseytin game quite enjoyable: I won’t say more about it here but have written a fairly comprehensive tutorial that you can open up and read if you follow the link above.
This is perhaps the moment to say that the words “I created a puzzle game” are slightly misleading. For one thing, I discussed the idea of gamifying Tseytin’s semigroup about three years ago with Mirek Olšák, a former member of my automatic theorem proving group, and he created a basic prototype in Python. But the main point is that I do not have the programming skills to create a game that can be played in a web browser — I vibecoded it using ChatGPT.
After that experience, I thought that maybe I would have better luck if instead of looking for a group or semigroup with undecidable word problem, which might well have been explicitly designed with some encoding in mind, I looked for a word problem for which the decidability status was unknown. That way, it wouldn’t have been designed to be undecidable, but might nevertheless just happen to be undecidable and provide a nice playground of the kind I was (and still am) after. And that is what led me to the Artin-Tits groups.
However, those don’t seem to be suitable either, because it is conjectured that they all have decidable word problems. I have created a game for the Artin-Tits group mentioned above, which you can also play if you want, but I have found it very hard to create interesting puzzles. (NB again there was a problem with the puzzles showing up, with only a tiny handful being there, but that should now be sorted out.) That is, I found it difficult to find words that are equivalent to the identity but that are not easily shown to be equivalent to the identity. One nice example comes from the fact that the subgroup generated by and
is isomorphic to the braid group
with four strands. The late Patrick Dehornoy found a very nice example of a braid with four strands that is equal to the identity but not in a completely trivial way. A picture of it can be found on page 4 of this paper.
This is where a potential Polymath project comes in. An initial goal would be to determine the decidability status of this one small Artin-Tits group: if we managed that, then we could consider the more general problem. And the way I envisage approaching this initial goal is an iterative process that runs as follows.
I have done a couple of iterations already, with the result that I now have an algorithm — let’s call it — that solves (basically instantaneously) every puzzle I throw at it, including the one derived from Dehornoy’s braid mentioned above. So a subgoal of the initial goal is to find a puzzle that has a solution but that
doesn’t solve. If we can’t, then maybe it is worth trying to prove that
solves the word problem for this group. One comment about the algorithm
is that it never increases the length of a word, though it often does moves that preserve the length. I was told by ChatGPT that it is an unsolved problem whether every braid that is equal to the identity can be reduced to the identity without ever increasing the number of crossings. Of course that isn’t a guarantee that it really is unsolved, but if it is, then that’s another interesting problem. It also means that I would be extremely interested if someone found an example of a word in the Artin-Tits group that can be reduced to the identity but only if one starts by lengthening it.
I’m hoping that this will be an enjoyable project for people who like both mathematics and programming. On the maths side, there is an unsolved problem to think about, and on the programming side there are lots of possibilities: for example, one could write programs that explore the space of words equal to the identity, trying to do so in such a way that there is a reasonable chance of reaching a word where it isn’t obvious how to reverse all the steps and get back to the identity. The game comes with a “sandbox” that has a few tools for generating words, but at the moment it is fairly primitive, and I would welcome suggestions for how to improve it.
It seems to me that as Polymath projects go, one can’t really lose: either we find an algorithm for the word problem, which establishes an unknown (at least according to ChatGPT) case of a decidability problem, or we find a suite of harder and harder puzzles, creating a more and more challenging and entertaining game and obtaining a deeper understanding of the Artin-Tits group in the process.
This Artin-Tits game has only a rather rudimentary and not very good tutorial created by ChatGPT. It’s easy to play once you’ve got the hang of it, and in the end I think the easiest way to get the hang of it is to watch someone else play it for a couple of minutes. I have therefore created a video tutorial, which you can find here. The video itself lasts about 25 minutes, but the tutorial part is under 10 minutes: what makes the video longer is an explanation of various features for making it easier to create puzzles, and also an explanation of how the algorithm works, which again is much easier to explain if I can demonstrate it on screen than if I have to write down some text. (I do have some text, since that is what I used as a prompt for ChatGPT to implement the algorithm, but it took a few iterations to get it working properly, so now I’m not sure what the quality of the code will be like, or even whether it is doing exactly what I want, though it appears to be.) Although I have embedded the video into this post, if you actually want to see what is going on you will probably need to watch it on YouTube using the full screen. I recommend not watching the explanation of the algorithm until you have played the game a few times first. Also, I should warn you that the games in the Advanced category are not all soluble, but games 1, 5 and 6 definitely are. Another thing I forgot to say on the video is that if you want you can “rotate” a word by clicking on one of its letters and dragging it to the right or left. If the word is equal to the identity, then its rotations will be as well, so this is a valid move. There is also a button for disabling this rotation facility if you want the puzzle to be slightly more challenging.
A quick note about the availability of the two games. They are hosted on the Netlify platform. I am on their free plan, which gives me a certain number of “credits” each month. I’m not sure how quickly these will run out, largely because I have no idea how many people will actually think it interesting to play the games. If they do run out, then the games will cease to be available until the credits are renewed, which for me happens on the 10th of each month. If this has happened and you are keen to play one of the games, another option is to download the html files and open them in your browser. Here is a link to the Artin-Tits game and here is one to the Tseytin game. If you are feeling particularly public-spirited, especially if you think you will play quite a lot, then you might consider doing that anyway, so that the Netlify credits run down more slowly. If the running out of the credits is quick enough for all this to be a real issue, then I may move to a paid plan.
I’ll finish with a quick tip for playing the Artin-Tits game, which I mentioned on the video but perhaps didn’t stress enough. Many of the moves consist in selecting three consecutive letters and using a relation of the form to change them. Easy examples of this are replacement rules such as
. But what if inverses are involved? I’ll represent inverses of generators with upper-case letters, so for example
represents
, which in the game would be a white
followed by a white
followed by a black
. The word
turns out to equal
. To remember this, a simple rule is that two letters of the same colour can be bracketed together and “pushed past” the third letter, which retains its colour but changes its value. Here, for example, we write
as
and then swap them over, changing
into
in the process, getting
, or
without the brackets. In group theoretic terms, this is of course saying that
and
are conjugates, and that
is what conjugates one to the other. But when playing the game it is convenient to remember it by thinking that when you see a subword such as
, you can push the
(and in particular the
) to the left, getting
.
Matthew Schwartz of Harvard has made a big recent splash, between his public Aspen talk "10000 Einsteins" a year ago about the role of AI and the future of physics, and his talk last week at the APS Global Physics summit on the same topic, and now with this essay, "Vibe Physics: The AI Grad Student", on the website of Anthropic (producers of the AI tool Claude).
The essay talks about how Prof. Schwartz used Claude to write this paper, and he states that the AI tool functions roughly like a 2nd year grad student (one who also doesn't get tired or complain, but does need close checking and supervision). The claim is that with this approach to doing calculations and writing papers, he was able to come out with a piece of work that would've taken literally ten times longer if done by working with a human student. Note that he's not exactly unbiased, and he concludes his essay (on anthropic's site) saying you should spend the $20/month Claude subscription fee and it will change your life.
There is no doubt that AI tools can speed up certain kinds of work, and there is a every hope that applying this in science will lead to increased pace of progress. That said, right now these tools are (unsurprisingly) best at working in areas that are well-known and explored - one of my colleagues has tried applying these to really underexplored higher dimensional problems, and they're much less effective there. The essay's claim that "LLMs are profoundly creative" is provocative. There is also no discussion here about the cost of these tools, in financial, energy, and environmental terms.
Still, Schwartz raises many questions about the future of the field and graduate education in general. (His paragraph about how human beings will still be needed in science for getting experimental data, at least for a while, is really something.) University research is not just about answering scholarly questions; it's about educating people. Maybe some faculty will revel in writing papers without that kind of interaction, but somehow I don't think we're quite at the stage yet where we don't need to worry anymore about training experts in technical fields. I do agree that it's good advice for everyone to pay close attention to where these capabilities are going. We certainly live in interesting times.
I’ve just uploaded to the arXiv my paper “Local Bernstein theory, and lower bounds for Lebesgue constants“. This paper was initially motivated by a problem of Erdős} on Lagrange interpolation, but in the course of solving that problem, I ended up modifying some very classical arguments of Bernstein and his contemporaries (Boas, Duffin, Schaeffer, Riesz, etc.) to obtain “local” versions of these classical “Bernstein-type inequalities” that may be of independent interest.
Bernstein proved many estimates concerning the derivatives of polynomials, trigonometric polynomials, and entire functions of exponential type, but perhaps his most famous inequality in this direction is:
Lemma 1 (Bernstein’s inequality for trigonometric polynomials) Letbe a trigonometric polynomial of degree at most
, with
for all
. Then
for all
.
Similar inequalities concerning norms of derivatives of Littlewood-Paley components of functions are now ubiquitious in the modern theory of nonlinear dispersive PDE (where they are also called Bernstein estimates), but this will not be the focus of this current post.
A trigonometric polynomial of degree
is of exponential type
in the sense that
for complex
. Bernstein in fact proved a more general result:
Lemma 2 (Bernstein’s inequality for functions of exponential type) Letbe an entire function of exponential type at most
, with
for all
. Then
for all
.
There are several proofs of this lemma – see for instance this survey of Queffélec and Zarouf. In the case that is real-valued on
, there is a nice proof by Duffin and Schaeffer, which we sketch as follows. Suppose we normalize
, and adjust
by a suitable damping factor so that
actually decays slower than
as
. Then, for any
and
, one can use Rouche’s theorem to show that the function
has the same number of zeroes as
in a suitable large rectangle; but on the other hand one can use the intermediate value theorem to show that
has at least as many zeroes than
in the same rectangle. Among other things, this prevents double zeroes from occuring, which turns out to give the desired claim
after some routine calculations (in fact one obtains the stronger bound
for all real
).
The first main result of the paper is to obtain localized versions of Lemma 2 (as well as some related estimates). Roughly speaking, these estimates assert that if is holomorphic on a wide thin rectangle passing through the real axis, is bounded by
on the intersection of the real axis with this rectangle, and is “locally of exponential type” in the sense that it is bounded by
on the upper and lower edges of this rectangle (and obeys some very mild growth conditions on the remaining sides of this rectangle), then
can be bounded by
plus small errors on the real line, with some additional estimates away from the real line also available. The proof proceeds by a modification of the Duffin–Schaeffer argument, together with the two-constant theorem of Nevanlinna (and some standard estimates of harmonic measures on rectangles) to deal with the effect of the localization. (As a side note, this latter argument was provided to me by ChatGPT, as I was not previously aware of the Nevanlinna two-constant theorem.)
Once one localizes this “Bernstein theory”, it becomes suitable for the analysis of (real-rooted, monic) polynomials of a high degree
, which are not bounded globally on
(and grow polynomially rather than exponentially at infinity), but which can exhibit “local exponential type” behavior on various intervals, particularly in regions where the logarithmic potential
This becomes relevant in the theory of Lagrange interpolation. Recall that if are real numbers and
is a polynomial of degree less than
then one has the interpolation formula
If one chooses the interpolation points poorly, then the Lebesgue constant can be extremely large. However, if one selects these points to be the roots of the aforementioned monic Chebyshev polynomials, then it is known that
for all fixed intervals
in
. In the case
, it was shown by Erdős} that this is the best possible value of the Lebesgue constant up to
errors for interpolation on
, thus
In terms of the monic polynomial , these two estimates can be written as
Problem 3 Letbe a trigonometric polynomial of degree
with
roots
in
.
It is easy to check that the lower bounds of and
are sharp by considering the case when
is a sinusoid
.
The bound (3) is immediate from Bernstein’s inequality (Lemma 1). By applying a local version of this inequality, I was able to get a weak version of the claim (1) in which was replaced with
; see this early version of the paper, which was developed through conversations with Nat Sothanaphan and Aron Bhalla. By combining this argument with ideas from the older work of Erdős}, I was able to establish (1).
The bound (2) took me longer to establish, and involved a non-trivial amount of playing around with AI tools, the story of which I would like to share here. I had discovered the toy problem (4), but initially was not able to establish this inequality; AlphaEvolve seemed to confirm it numerically (with sinusoids appearing to be the extremizer), but did not offer direct clues on how to prove this rigorously. At some point I realized that the left-hand side factorized into the expressions and
, and tried to bound these expressions separately. Perturbing around a sinusoid
, I was able to show that the
norm
was a local minimum as long as one only perturbed by lower order Fourier modes, keeping the frequency
coefficients unchanged. Guessing that this local minimum was actually a global minimum, this led me to conjecture the general lower bound
Welcome back to: Has quantum advantage been achieved?
In Part 1 of this mini-series on quantum advantage demonstrations, I told you about the idea of random circuit sampling (RCS) and the experimental implementations thereof. In this post, Part 2 out of 3, I will discuss the arguments and evidence for why I am convinced that the experiments demonstrate a quantum advantage.
Recall from Part 1 that to assess an experimental quantum advantage claim we need to check three criteria:
When assessing these criteria for the RCS experiments there is an important problem: The early quantum computers we ran them on were very far from being reliable and the computation was significantly corrupted by noise. How should we interpret this noisy data? Or more concisely:
I want to convince you today that we have developed a very good understanding of these questions that gives a solid underpinning to the advantage claim. Developing that understanding required a mix of methodologies from different areas of science, including theoretical computer science, algorithm design, and physics and has been an exciting journey over the past years.
Let us start by answering the base question. What computational task did the experiments actually solve?
Recall that, in the ideal RCS scenario, we are given a random circuit on qubits and the task is to sample from the output distribution of the state obtained from applying the circuit to a simple reference state. The output probability distribution of this state is determined by the Born rule when I measure every qubit in a fixed choice of basis.
Now what does a noisy quantum computer do when I execute all the gates on it and apply them to its state? Well, it prepares a noisy version of the intended state and once I measure the qubits, I obtain samples from the output distribution of that noisy state.
We should not make the task dependent on the specifics of that state or the noise that determined it, but we can define a computational task based on this observation by fixing how accurate that noisy state preparation is. The natural way to do this is to use the fidelity
which is just the overlap between the ideal state and the noisy state. The fidelity is 1 if the noisy state is equal to the ideal state, and 0 if it is perfectly orthogonal to it.
Finite-fidelity random circuit sampling
Given a typical random circuit , sample from the output distribution of any quantum state whose fidelity with the ideal output state is at least .
Note that finite-fidelity RCS does not demand success for every circuit, but only for typical circuits from the random circuit ensemble. This matches what the experiments do: they draw random circuits and need the device to perform well on the overwhelming majority of those draws. Accordingly, when the experiments quote a single number as “fidelity”, it is really the typical (or, more precisely, circuit-averaged) fidelity that I will just call .
The noisy experiments claim to have solved finite-fidelity RCS for values of around 0.1%. What is more, they consistently achieve this value even as the circuit sizes are increased in the later experiments. Both the actual value and the scaling will be important later.
What is the complexity of finite-fidelity RCS?
Let’s start off by supposing that the quantum computation is (nearly) perfectly executed, so the required fidelity is quite large, say, 90%. In this scenario, we have very good evidence based on computational complexity theory that there is a scalable and in-practice quantum advantage for RCS. This evidence is very strong, comparable to the evidence we have for the hardness of factoring and simulating quantum systems. The intuition behind it is that quantum output probabilities are extremely hard to compute because of a mechanism behind quantum advantages: destructive interference. If you are interested in the subtleties and the open questions, take a look at our survey.
The question is now, how far down in fidelity this classical hardness persists? Intuitively, the smaller we make , the easier finite-fidelity RCS should become for a classical algorithm (and a quantum computer, too), since the freedom we have in deviating from the ideal state in our simulation becomes larger and larger. This increases the possibility of finding a state that turns out to be easy to simulate within the fidelity constraint.
Somewhat surprisingly, though, finite-fidelity RCS seems to remain hard even for very small values of . I am not aware of any efficient classical algorithm that achieves the finite-fidelity task for significantly away from the baseline trivial value of . This is the value a maximally mixed or randomly picked state achieves because a random state has no correlation with the ideal state (or any other state), and is exactly what you expect in that case (while 0 would correspond to perfect anti-correlation).
One can save some classical runtime compared to solving near-ideal RCS by exploiting a reduced fidelity, but the costs remain exponential. To classically solve finite-fidelity RCS, the best known approaches are reported in the papers that performed classical simulations of finite-fidelity RCS with the parameters of the first Google and USTC experiment (classSim1, classSim2). To achieve this, however, they needed to approximately simulate the ideal circuits at an immense cost. To the best of my knowledge, all but those first two experiments are far out of reach for these algorithms.
So what is the right value of at which we can hope for a scalable and in-practice advantage of RCS experiments?
When thinking about this question, it is helpful to keep a model of the circuit in mind that a noisy experiment runs. So, let us consider a noisy circuit on qubits with layers of gates and single-qubit noise of strength on every qubit in every layer. In this scenario, the typical fidelity with the ideal state will decay as .
Any reasonably testable value of the fidelity needs to scale as , since eventually we need to estimate the average fidelity from the experimental samples and this typically requires at least samples, so exponentially small fidelities are experimentally invisible. The polynomial fidelity is also much closer to the near-ideal scenario (90%) than the trivial scenario (). While we cannot formally pin this down, the intuition behind the complexity-theoretic evidence for the hardness of near-ideal RCS persists into the regime: to sample up to such high precision, we still need a reasonably accurate estimate of the ideal probabilities, and getting this is computationally extremely difficult. Scalable quantum advantage in this regime is therefore a pretty safe bet.
How do the parameters of the experiment and the RCS instances need to scale with the number of qubits to experimentally achieve the fidelity regime? The limit to consider is one in which the noise rate decreases with the number of qubits, while the circuit depth is only allowed to increase very slowly. It depends on the circuit architecture, i.e., the choice of circuit connectivity, and the gate set, through a constant as I will explain in more detail below.
Weak-noise and low-depth scaling
(Weak noise) The local noise rate of the quantum device scales as .
(Low depth) The circuit depth scales as .
This limit is such that we have a scaling of the fidelity as for some constant . It is also a natural scaling limit for noisy devices whose error rates gradually improve through better engineering. You might be worried about the fact that the depth needs to be quite low but it turns out that there is a solid quantum advantage even for -depth circuits.
The precise definition of the weak-noise regime is motivated by the following observation. It turns out to be crucial for assessing the noisy data from the experiment.
Remember from Part 1 that the experiments measured a quantity called the cross-entropy benchmark (XEB)
The XEB averages the ideal probabilities corresponding to the sampled outcomes from experiments on random circuits . Thus, it correlates the experimental and ideal output distributions of those random circuits. You can think of it as a “classical version” of the fidelity: If the experimental distribution is correct, the XEB will essentially be 1. If it is uniformly random, the XEB is 0.
The experiments claimed that the XEB is a good proxy for the circuit-averaged fidelity given by , and so we need to understand when this is true. Fortunately, in the past few years, alongside with the improved experiments, we have developed a very good understanding of this question (WN, Spoof2, PT1, PT2).
It turns out that the quality of correspondence between XEB and average fidelity depends strongly on the noise in the experimental quantum state. In fact, there is a sharp phase transition: there is an architecture-dependent constant such that when the experimental local noise rate , then the XEB is a good and reliable proxy for the average fidelity for any system size and circuit depth . This is exactly the weak-noise regime. Above that threshold, in the strong noise regime, the XEB is an increasingly bad proxy for the fidelity (PT1, PT2).
Let me be more precise: In the weak-noise regime, when we consider the decay of the XEB as a function of circuit depth , the rate of decay is given by , i.e., the XEB decays as . Meanwhile, in the strong-noise regime the rate of decay is constant, giving an XEB decay as for a constant . At the same time, the fidelity decays as regardless of the noise regime. Hence, in the weak-noise regime, the XEB is a good proxy of the fidelity, while in the strong noise regime, there is an exponentially increasing gap between the XEB (which remains large) and the fidelity (which continues to decay exponentially). regardless of the noise regime.
This is what the following plot shows. We computed it from an exact mapping of the behavior of the XEB to the dynamics of a statistical-mechanics model that can be evaluated efficiently. Using this mapping, we can also compute the noise threshold for whichever random circuit family and architecture you are interested in.

We are now ready to take a look at the crux when assessing the noisy data: Can we trust the reported XEB values as an estimator of the fidelity? If so, do the experiments solve finite-fidelity RCS in the solidly hard regime where ?
In their more recent paper (PT1), the Google team explicitly verified that the experiment is well below the phase transition, and it turns out that the first experiment was just at the boundary. The USTC experiments had comparable noise rates, and the Quantinuum experiment much better ones. Since fidelity decays as , but the reported XEB values stayed consistently around 0.1% as was increased, the experimental error rate of the experiments improved even better than the scaling required for the weak-noise regime, namely, more like . Altogether, the experiments are therefore in the weak-noise regime both in terms of absolute numbers relative to and the required scaling.
Of course, to derive the transition, we made some assumptions about the noise such as that the noise is local, and that it does not depend much on the circuit itself. In the advantage experiments, these assumptions about the noise are characterized and tested. This is done through a variety of means at increasing levels of complexity, including detailed characterization of the noise in individual gates, gates run in parallel, and eventually in a larger circuit. The importance of understanding the noise shows in the fact that a significant portion of the supplementary materials of the advantage experiments is dedicated to getting this right. All of this contributes to the experimental justification for using the XEB as a proxy for the fidelity!
The data shows that the experiments solved finite-fidelity RCS for values of above the constant value of roughly 0.1% as the experiments grew. In the following plot, I compare the experimental fidelity values to the near-ideal scenario on the one hand, and the trivial value on the other hand. Viewed at this scale, the values of for which the experiment solved finite-fidelity RCS are indeed vastly closer to the near-ideal value than the trivial baseline, which should boost our confidence that reproducing samples at a similar fidelity is extremely challenging.

You might be tempted to say: “Well, but is all this really so important? Can’t I just use XEB and forget all about fidelity?”
The phase transition shows why that would change the complexity of the problem: in the strong-noise regime, XEB can stay high even when fidelity is exponentially small. And indeed, this discrepancy can be exploited by so-called spoofers for the XEB. These are efficient classical algorithms which can be used to succeed at a quantum advantage test even though they clearly do not achieve the intended advantage. These spoofers (Spoof1, Spoof2) can achieve high XEB scores comparable to those of the experiments and scaling like in the circuit depth for some constant .
Their basic idea is to introduce strong, judiciously chosen noise at specific circuit locations that has the effect of breaking up the simulation task up into smaller, much easier components, but at the same time still gives a high XEB score. In doing so, they exploit the strong-noise regime in which the XEB is a really bad proxy for the fidelity. This allows them to sample from states with exponentially low fidelity while achieving a high XEB value.
The discovery of the phase transition and the associated spoofers highlights the importance of modeling when assessing—and even formulating—the advantage claim based on noisy data.
You might also be worried that the experiments did not actually compute the XEB in the advantage regime because to estimate it they would have needed to compute ideal probabilities—a task that is hard by definition of the advantage regime. Instead, they used a bunch of different ways to extrapolate the true XEB from XEB proxies (proxy of a proxy of the fidelity). Is this is a valid way of getting an estimate of the true XEB?
It totally is! Different extrapolations—from easy-to-simulate to hard-to-simulate, from small system to large system etc—all gave consistent answers for the experimental XEB value of the supremacy circuits. Think of this as having several lines that cross in the same point. For that crossing to be a coincidence, something crazy, conspiratorial must happen exactly when you move to the supremacy circuits from different directions. That is why it is reasonable to trust the reported value of the XEB.
All of this is to say that establishing that the experiments correctly solved finite-fidelity RCS and therefore show quantum advantage involved a lot of experimental characterization of the noise as well as theoretical work to understand the effects of noise on the quantity we care about—the fidelity between the experimental and ideal states.
In this respect (and maybe also in the scale of the discovery), the quantum advantage experiments are similar to the recent experiments reporting discovery of the Higgs boson and gravitational waves. While I do not claim to understand any of the details, what I do understand is that in both experiments, there is an unfathomable amount of data that could not be interpreted without preselection and post-processing of the data, theories, extrapolations and approximations that model the experiment and measurement apparatus. All of those enter the respective smoking-gun plots that show the discoveries.
If you believe in the validity of experimental physics methodology, you should therefore also believe in the type of evidence underlying experimental claim of the quantum advantage demonstrations: that they sampled from the output distribution of a quantum state with the reported fidelities.
Put succinctly: If you believe in the Higgs boson and gravitational waves, you should probably also believe in the experimental demonstration of quantum advantage.
“The weak-noise limit is not physical. The appropriate scaling limit is one in which the local noise rate of the device is constant while the system size grows, and in that case, there is a classical simulation algorithm for RCS (SimIQP, SimRCS).”
In theoretical computer science, scaling of time or the system size in the input size is considered very natural: We say an algorithm is efficient if its runtime and space usage only depend polynomially on the input size.
But all scaling arguments are hypothetical concepts, and we only care about the scaling at relevant sizes. In the end, every scaling limit is going to hit the wall of physical reality—be it the amount of energy or human lifetime that limits the time of an algorithm, or the physical resources that are required to build larger and larger computers. To keep the scaling limit going as we increase the size of our computations, we need innovation that makes the components smaller and less noisy.
At the scales relevant to RCS, the scaling of the noise is benign and even natural. Why? Well, currently, the actual noise in quantum computers is not governed by the fundamental limit, but by engineering challenges. Realizing this limit therefore amounts to engineering improvements in the system size and noise rate that are achieved over time. Sure, at some point that scaling limit is also going to hit a fundamental barrier below which the noise cannot be improved. But we are surely far away from that limit, yet. What is more, already now logical qubits are starting to work and achieve beyond-breakeven fidelities. So even if the engineering improvements should flatten out from here onward, QEC will keep the noise limit going and even accelerate it in the intermediate future.
“All the hard complexity-theoretic evidence for quantum advantage is in the near-ideal regime, but now you are claiming advantage for the low-fidelity version of that task.”
This is probably the strongest counter-argument in my opinion, and I gave my best response above. Let me just add that this is a question about computational complexity. In the end, all of complexity theory is based on belief. The only real evidence we have for the hardness of any task is the absence of an efficient algorithm, or the reduction to a paradigmatic, well-studied task for which there is no efficient algorithm.
I am not sure how much I would bet that you cannot find an efficient algorithm for finite-fidelity RCS in the regime of the experiments, but it is certainly a pizza.
“There is no verification test that just depends on the classical samples, is efficient and does not make any assumptions about the device. In particular, you cannot unconditionally verify fidelity just from the classical samples. Why should I believe the data?”
Yes, sure, the current advantage demonstrations are not device-independent. But the comparison you should have in mind are Bell tests. The first proper Bell tests of Aspect and others in the 80s were not free of loopholes. They still allowed for contrived explanations of the data that did not violate local realism. Still, I can hardly believe that anyone would argue that Bell inequalities were not violated already back then.
As the years passed, these remaining loopholes were closed. To be a skeptic of the data, people needed to come up with more and more adversarial scenarios that could explain the data. We are working on the same to happen with quantum advantage demonstrations: come up with better schemes and better tests that require less and less assumptions or knowledge about the specifics of the device.
“When you chose the gates and architecture of the circuit dependent on your device, you tailored the task too much to the device and that is unfair. Not even the different RCS experiments solve exactly the same task.”
This is not really an argument against the achievement of quantum advantage but more against the particular choices of circuit ensembles in the experiments. Sure, the specific computations solved are still somewhat tailored to the hardware itself and in this sense the experiments are not hardware-independent yet, but they still solve fine computational tasks. Moving away from such hardware-tailored task specifications is another important next step and we are working on it.
In the third and last part of this mini series I will address next steps in quantum advantage that aim at closing some of the remaining loopholes. The most important—and theoretically interesting—one is to enable efficient verification of quantum advantage using less or even no specific knowledge about the device that was used, but just the measurement outcomes.
(survey) Hangleiter, D. & Eisert, J. Computational advantage of quantum random sampling. Rev. Mod. Phys. 95, 035001 (2023).
(classSim1) Pan, F., Chen, K. & Zhang, P. Solving the sampling problem of the Sycamore quantum circuits. Phys. Rev. Lett. 129, 090502 (2022).
(classSim2) Kalachev, G., Panteleev, P., Zhou, P. & Yung, M.-H. Classical Sampling of Random Quantum Circuits with Bounded Fidelity. arXiv.2112.15083 (2021).
(WN) Dalzell, A. M., Hunter-Jones, N. & Brandão, F. G. S. L. Random Quantum Circuits Transform Local Noise into Global White Noise. Commun. Math. Phys. 405, 78 (2024).
(PT1)vMorvan, A. et al. Phase transitions in random circuit sampling. Nature 634, 328–333 (2024).
(PT2) Ware, B. et al. A sharp phase transition in linear cross-entropy benchmarking. arXiv:2305.04954 (2023).
(Spoof1) Barak, B., Chou, C.-N. & Gao, X. Spoofing Linear Cross-Entropy Benchmarking in Shallow Quantum Circuits. in 12th Innovations in Theoretical Computer Science Conference (ITCS 2021) (ed. Lee, J. R.) vol. 185 30:1-30:20 (2021).
(Spoof2) Gao, X. et al. Limitations of Linear Cross-Entropy as a Measure for Quantum Advantage. PRX Quantum 5, 010334 (2024).
(SimIQP) Bremner, M. J., Montanaro, A. & Shepherd, D. J. Achieving quantum supremacy with sparse and noisy commuting quantum computations. Quantum 1, 8 (2017).
(SimRCS) Aharonov, D., Gao, X., Landau, Z., Liu, Y. & Vazirani, U. A polynomial-time classical algorithm for noisy random circuit sampling. in Proceedings of the 55th Annual ACM Symposium on Theory of Computing 945–957 (2023).
Recently, I gave a couple of perspective talks on quantum advantage, one at the annual retreat of the CIQC and one at a recent KITP programme. I started off by polling the audience on who believed quantum advantage had been achieved. Just this one, simple question.
The audience was mostly experimental and theoretical physicists with a few CS theory folks sprinkled in. I was sure that these audiences would be overwhelmingly convinced of the successful demonstration of quantum advantage. After all, more than half a decade has passed since the first experimental claim (G1) of “quantum supremacy” as the patron of this blog’s institute called the idea “to perform tasks with controlled quantum systems going beyond what can be achieved with ordinary digital computers” (Preskill, p. 2) back in 2012. Yes, this first experiment by the Google team may have been simulated in the meantime, but it was only the first in an impressive series of similar demonstrations that became bigger and better with every year that passed. Surely, so I thought, a significant part of my audiences would have been convinced of quantum advantage even before Google’s claim, when so-called quantum simulation experiments claimed to have performed computations that no classical computer could do (e.g. (qSim)).
I could not have been more wrong.
In both talks, less than half of the people in the audience thought that quantum advantage had been achieved.
In the discussions that ensued, I came to understand what folks criticized about the experiments that have been performed and even the concept of quantum advantage to begin with. But more on that later. Most of all, it seemed to me, the community had dismissed Google’s advantage claim because of the classical simulation shortly after. It hadn’t quite kept track of all the advances—theoretical and experimental—since then.
In a mini-series of three posts, I want to remedy this and convince you that the existing quantum computers can perform tasks that no classical computer can do. Let me caution, though, that the experiments I am going to talk about solve a (nearly) useless task. Nothing of what I say implies that you should (yet) be worried about your bank accounts.
I will start off by recapping what quantum advantage is and how it has been demonstrated in a set of experiments over the past few years.
To state the obvious: we are now fairly convinced that noiseless quantum computers would be able solve problems efficiently that no classical computer could solve. In fact, we have been convinced of that already since the mid-90ies when Lloyd and Shor discovered two basic quantum algorithms: simulating quantum systems and factoring large numbers. Both of these are tasks where we are as certain as we could be that no classical computer can solve them. So why talk about quantum advantage 20 and 30 years later?
The idea of a quantum advantage demonstration—be it on a completely useless task even—emerged as a milestone for the field in the 2010s. Achieving quantum advantage would finally demonstrate that quantum computing was not just a random idea of a bunch of academics who took quantum mechanics too seriously. It would show that quantum speedups are real: We can actually build quantum devices, control their states and the noise in them, and use them to solve tasks which not even the largest classical supercomputers could do—and these are very large.
But what exactly do we mean by “quantum advantage”. It is a vague concept, for sure. But some essential criteria that a demonstration should certainly satisfy are probably the following.
Achieving this last criterion using imperfect, noisy quantum devices is the challenge the idea of quantum supremacy set for the field. After all, running any of our favourite quantum algorithms in a classically hard regime on these devices is completely out of the question. They are too small and too noisy. So the field had to come up with the conceivably smallest and most noise-robust quantum algorithm that has a significant scaling advantage against classical computation.
The idea is simple: we just run a random computation, constructed in a way that is as favorable as we can make it to the quantum device while being as hard as possible classically. This may strike as a pretty unfair way to come up with a computational task—it is just built to be hard for classical computers without any other purpose. But: it is a fine computational task. There is an input: the description of the quantum circuit, drawn randomly. The device needs to be programmed to run this exact circuit. And there is a task: just return whatever this quantum computation would return. These are strings of 0s and 1s drawn from a certain distribution. Getting the distribution of the strings right for a given input circuit is the computational task.
This task, dubbed random circuit sampling, can be solved on a classical as well as a quantum computer, but there is a (presumably) exponential advantage for the quantum computer. More on that in Part 2.
For now, let me tell you about the experimental demonstrations of random circuit sampling. Allow me to be slightly more formal. The task solved in random circuit sampling is to produce bit strings distributed according to the Born-rule outcome distribution
of a sequence of elementary quantum operations (unitary rotations of one or two qubits at a time) which is drawn randomly according to certain rules. This circuit is applied to a reference state on the quantum computer and then measured, giving the string as an outcome.
In the first quantum supremacy experiment (G1) by the Google team, the quantum computer was built from 53 superconducting qubits arranged in a 2D grid. The operations were randomly chosen simple one-qubit gates () and deterministic two-qubit gates called fSim applied in the 2D pattern, and repeated a certain number of times (the depth of the circuit). The limiting factor in these experiments was the quality of the two-qubit gates and the measurements, with error probabilities around 0.6 % and 4 %, respectively.
A very similar experiment was performed by the USTC team on 56 qubits (U1) and both experiments were repeated with better fidelities (0.4 % and 1 % for two-qubit gates and measurements) and slightly larger system sizes (70 and 83 qubits, respectively) in the past two years (G2,U2).
Using a trapped-ion architecture, the Quantinuum team also demonstrated random circuit sampling on 56 qubits but with arbitrary connectivity (random regular graphs) (Q). There, the two-qubit gates were -rotations around , the single-qubit gates were uniformly random and the error rates much better (0.15 % for both two-qubit gate and measurement errors).
All the experiments ran random circuits on varying system sizes and circuit depths, and collected thousands to millions of samples from a few random circuits at a given size. To benchmark the quality of the samples, the widely accepted benchmark is now the linear cross-entropy (XEB) benchmark defined as
for an -qubit circuit. The expectation over is over the random choice of circuit and the expectation over is over the experimental distribution of the bit strings. In other words, to compute the XEB given a list of samples, you ‘just’ need to compute the ideal probability of obtaining that sample from the circuit and average the outcomes.
The XEB is nice because it gives 1 for ideal samples from sufficiently random circuits and 0 for uniformly random samples, and it can be estimated accurately from just a few samples. Under the right conditions, it turns out to be a good proxy for the many-body fidelity of the quantum state prepared just before the measurement.
This tells us that we should expect an XEB score of for some noise- and architecture-dependent constant . All of the experiments achieved a value of the XEB that was significantly (in the statistical sense) far away from 0 as you can see in the plot below. This shows that something nontrivial is going on in the experiments, because the fidelity we expect for a maximally mixed or random state is which is less than % for all the experiments.

The complexity of simulating these experiments is roughly governed by an exponential in either the number of qubits or the maximum bipartite entanglement generated. Figure 5 of the Quantinuum paper has a nice comparison.
It is not easy to say how much leverage an XEB significantly lower than 1 gives a classical spoofer. But one can certainly use it to judiciously change the circuit a tiny bit to make it easier to simulate.
Even then, reproducing the low scores between 0.05 % and 0.2 % of the experiments is extremely hard on classical computers. To the best of my knowledge, producing samples that match the experimental XEB score has only been achieved for the first experiment from 2019 (PCZ). This simulation already exploited the relatively low XEB score to simplify the computation, but even for the slightly larger 56 qubit experiments these techniques may not be feasibly run. So to the best of my knowledge, the only one of the experiments which may actually have been simulated is the 2019 experiment by the Google team.
If there are better methods, or computers, or more willingness to spend money on simulating random circuits today, though, I would be very excited to hear about it!
Now, you may be wondering: “How do you even compute the XEB or fidelity in a quantum advantage experiment in the first place? Doesn’t it require computing outcome probabilities of the supposedly hard quantum circuits?” And that is indeed a very good question. After all, the quantum advantage of random circuit sampling is based on the hardness of computing these probabilities. This is why, to get an estimate of the XEB in the advantage regime, the experiments needed to use proxies and extrapolation from classically tractable regimes.
This will be important for Part 2 of this series, where I will discuss the evidence we have for quantum advantage, so let me give you some more detail. To extrapolate, one can just run smaller circuits of increasing sizes and extrapolate to the size in the advantage regime. Alternatively, one can run circuits with the same number of gates but with added structure that makes them classically simulatable and extrapolate to the advantage circuits. Extrapolation is based on samples from different experiments from the quantum advantage experiments. All of the experiments did this.
A separate estimate of the XEB score is based on proxies. An XEB proxy uses the samples from the advantage experiments, but computes a different quantity than the XEB that can actually be computed and for which one can collect independent numerical and theoretical evidence that it matches the XEB in the relevant regime. For example, the Google experiments averaged outcome probabilities of modified circuits that were related to the true circuits but easier to simulate.
The Quantinuum experiment did something entirely different, which is to estimate the fidelity of the advantage experiment by inverting the circuit on the quantum computer and measuring the probability of coming back to the initial state.
All of the methods used to estimate the XEB of the quantum advantage experiments required some independent verification based on numerics on smaller sizes and induction to larger sizes, as well as theoretical arguments.
In the end, the advantage claims are thus based on a proxy of a proxy of the quantum fidelity. This is not to say that the advantage claims do not hold. In fact, I will argue in my next post that this is just the way science works. I will also tell you more about the evidence that the experiments I described here actually demonstrate quantum advantage and discuss some skeptical arguments.
Let me close this first post with a few notes.
In describing the quantum supremacy experiments, I focused on random circuit sampling which is run on programmable digital quantum computers. What I neglected to talk about is boson sampling and Gaussian boson sampling, which are run on photonic devices and have also been experimentally demonstrated. The reason for this is that I think random circuits are conceptually cleaner since they are run on processors that are in principle capable of running an arbitrary quantum computation while the photonic devices used in boson sampling are much more limited and bear more resemblance to analog simulators.
I want to continue my poll here, so feel free to write in the comments whether or not you believe that quantum advantage has been demonstrated (by these experiments) and if not, why.
Continue reading Part 2: Considering the evidence.
[G1] Arute, F. et al. Quantum supremacy using a programmable superconducting processor. Nature 574, 505–510 (2019).
[Preskill] Preskill, J. Quantum computing and the entanglement frontier. arXiv:1203.5813 (2012).
[qSim] Choi, J. et al. Exploring the many-body localization transition in two dimensions. Science 352, 1547–1552 (2016). .
[U1] Wu, Y. et al. Strong Quantum Computational Advantage Using a Superconducting Quantum Processor. Phys. Rev. Lett. 127, 180501 (2021).
[G2] Morvan, A. et al. Phase transitions in random circuit sampling. Nature 634, 328–333 (2024).
[U2] Gao, D. et al. Establishing a New Benchmark in Quantum Computational Advantage with 105-qubit Zuchongzhi 3.0 Processor. Phys. Rev. Lett. 134, 090601 (2025).
[Q] DeCross, M. et al. Computational Power of Random Quantum Circuits in Arbitrary Geometries. Phys. Rev. X 15, 021052 (2025).
[PCZ] Pan, F., Chen, K. & Zhang, P. Solving the sampling problem of the Sycamore quantum circuits. Phys. Rev. Lett. 129, 090502 (2022).
guest post by William Waites
The previous post introduced the plumbing calculus: typed channels, structural morphisms, two forms of composition, and agents as stateful morphisms with a protocol for managing their state. The examples were simple. This post is about what happens when the algebra handles something genuinely complex.
To get there, we need to understand a little about how large language models work.
Large language models are sequence-to-sequence transducers: a sequence of tokens comes in, a sequence comes out. Text is tokenised and the model operates on the tokens.
From the outside, the morphism is simple: !string → !string. A message goes in, a message comes out. But the client libraries (the code that calls the LLM provider) maintain the conversation history and send it back with every call. The actual morphism is
(!string, ![Message]) → (!string, ![Message]): the input message and the accumulated history go in, the response and the updated history come out. The history feeds back. This is a trace in the sense of traced monoidal categories: the feedback channel is hidden from the user, who sees only !string → !string.
Crucially, the model has a limited amount of memory. It is not a memoryless process, but the memory it has is not large: 200,000 tokens for current models, perhaps a million for the state of the art. This sounds like a lot. It is not. An academic paper is roughly 10,000 tokens. A literature review that needs to work with thirty papers has already exceeded the context window of most models, and that is before the model has produced a single word of output.
If you have used any of these agent interfaces, you will have noticed that after talking back and forth for a while, the agent will compact. This is a form of memory management. What is happening is that some supervisory process has noticed the context window filling up, and has intervened to shorten its contents. A naïve approach is to truncate: discard everything before the last N exchanges. A better approach is to feed the entire context to another language model and ask it to summarise, then put the summary back.
This is normally done by specialised code outside the agent, invisible to it.
How to manage agent memory well is an active research area. We do not, in general, do it very well. Truncation loses information. Summarisation loses nuance. Pinning helps but the right pinning strategy depends on the task. These are open questions, and to make progress we need to be able to experiment with different schemes and mechanisms: express a memory management strategy, test it, swap it for another, compare. Not by recompiling specialised code or hardcoding behaviour, but by writing it down in a language designed for exactly this kind of composition. Memory management should be a plumbing program: modular, type-checked, swappable.
So we built an implementation of compaction using the plumbing calculus, and the first thing we did was test it. I ran the protocol on a very short cycle: a single message caused a compaction, because the threshold was set to zero for testing. The compressor fired, produced a summary, rebuilt the agent’s context. The logs showed [compressor] 3404 in / 541 out. The protocol worked.
Then I asked the agent: "have you experienced compaction?"
The agent said no. It explained what compaction is, accurately. Said it hadn’t happened yet. It was confident.
I asked: "do you have a context summary in your window?"
Yes, it said, and described the contents accurately.
"How did that context summary get there if you have not yet compacted?"
The agent constructed a plausible, confident, and completely wrong explanation: the summary was "provided to me by the system at the start of this conversation" as a "briefing or recap." When pressed, it doubled down:
"The context-summary is not evidence that compaction has occurred. It’s more like a briefing or recap that the system gives me at the start of a conversation session to provide continuity."
The agent was looking at the direct evidence of its own compaction and confidently explaining why it was not compaction. We will return to why it gets this wrong, and how to fix it. But first: how do we build this?
At a high level, it works like this. An agent is running: input comes in, output goes out. Together with the output, the agent emits a telemetry report. The telemetry includes token counts: with each transaction, the entire train of messages and responses is sent to the LLM provider, and back comes a response together with a count of the tokens that went in and the tokens that came out. Our agent implementation sends this telemetry out of the telemetry port to anybody who is listening.
The construction involves a second agent. This second agent is a homunculus: the little man who sits on your shoulder and watches what your mind is doing. Here is the topology:
The homunculus listens to the telemetry and says: the memory is filling up. The token count has crossed a threshold. It is time to compact. And then it acts:
• Send pause to the agent’s control port. Stop accepting input.
• Send get memory. The agent produces the contents of its context window.
• Summarise that memory (using another LLM call).
• Send set memory with the compacted version.
• Send resume. The agent continues processing input.
Each step requires an acknowledgement before the next can proceed. This is a protocol: pause, acknowledge, get memory, here is the memory, set memory, acknowledge, resume, acknowledge.
It is possible to express this directly in the plumbing calculus, but it would be painfully verbose. Instead, we use session types to describe the protocol. This is not pseudocode. There is a compiler and a runtime for this language. Here is the protocol:
protocol Compaction =
send Pause . recv PauseAck .
send GetMemory . recv MemoryDump .
send SetMemory . recv SetMemoryAck .
send Resume . recv ResumeAck . end
The protocol is eight lines. It reads as a sequence of steps: send, receive, send, receive, and so on. The compiler knows what types each step carries. Now we wire it up:
let compact : (!CtrlResp, !json) -> !CtrlCmd =
plumb(ctrl_out, telemetry, ctrl_in) {
(ctrl_out, ctrl_in) Compaction as session
telemetry
; filter(kind = "usage" && input_tokens > 150000)
; map(null) ; session@trigger
session@trigger ; map({pause: true})
; session@send(Pause)
session@done(PauseAck) ; map({get_memory: true})
; session@send(GetMemory)
session@recv(MemoryDump) ; compressor
; session@send(SetMemory)
session@done(SetMemoryAck) ; map({resume: true})
; session@send(Resume)
}
The first line binds the protocol to the agent’s control ports:
(ctrl_out, ctrl_in) <-> Compaction as session.
This says: the Compaction protocol runs over the control channel, and we refer to it as session. The telemetry line is the trigger: when token usage crosses a threshold, the protocol begins. Each subsequent line is one step of the protocol, wired to the appropriate control
messages.
Here is a direct depiction of the protocol as wired. You can trace it through:
And here is how we wire the homunculus to the agent:
let main : !string -> !string =
plumb(input, output) {
let ctrl : !CtrlCmd = channel
let ctrl_out : !CtrlResp = channel
let telem : !json = channel
spawn bot(input=input, ctrl_in=ctrl,
output=output, ctrl_out=ctrl_out,
telemetry=telem)
spawn compact(ctrl_out=ctrl_out,
telemetry=telem, ctrl_in=ctrl)
}
The main morphism takes a string input and produces a string output. Internally, it creates three channels (control commands, control responses, telemetry) and spawns two processes: the bot agent and the compact homunculus. The homunculus listens to the bot’s telemetry and control responses, and sends commands to the bot’s control input. The bot does not know the homunculus exists. It just receives control messages and responds to them.
There are two nested traces here. The first is the one from before, inside the agent: messages go in, the output accumulates with everything that came before, and the whole history feeds back on the next turn. We do not see this trace. It is hidden inside the client library. The second trace is the one we have just built: the homunculus. What goes around the outer loop is control: telemetry flows out, commands flow in, acknowledgements come back. The memory dump passes through the control channel at one point in the protocol, but the feedback path is control, not conversation history. Nested traces compose; the algebra has identities for this and it is fine. But they are different loops carrying different things.
The connection between the protocol above and what the compiler actually produces is the functor from session types into the plumbing calculus. This functor works because of barrier.
Why do we need the barrier? Because the protocol is about sending a message and waiting for a response. We can send a message, but we need the response to arrive before we proceed. The barrier takes two streams, one carrying the "done" signal and one carrying the response, and synchronises them into a pair. Only when both are present does the next step begin.
Each session type primitive has a direct image in the plumbing category, and the structure is prettier than it first appears. The primitives come in dual pairs:
In the diagrams below, session types are on the left in blue; their images in the plumbing calculus are on the right in beige.
• send and recv are dual. They map to map and filter, which are also dual: send wraps the value with map, then synchronises via barrier with the done signal from the previous step. Recv filters the control output by step number, synchronises via barrier, then extracts the payload with map.
• select and offer are dual. They map to tag and case analysis, which are also dual: select tags the value with a label via map, synchronises via barrier, and routes to the chosen branch chain. Offer copies the control output and filters each copy by label, routing to the appropriate branch chain.
• The sequencing operator (.) maps to a barrier chain. Each send-then-recv step becomes a barrier that synchronises the outgoing message with the incoming acknowledgement, and these barriers chain together to enforce the protocol ordering.
• rec maps to a feedback loop: merge takes the initial arm signal and the last done signal from the previous iteration, feeds them into the barrier chain body, and copy at the end splits done into output and feedback. The trigger serialisation gate starts
• end is implicit: the chain simply stops. Discard handles any remaining signals.
This mapping is a functor. It is total: every session type primitive has an image in the plumbing category, using only the morphisms we already have. Session types are a specification language; the plumbing calculus is the execution language. The compiler translates one into the other.
The reason we do this becomes obvious from the diagram below. It is scrunched up and difficult to look at. If you click on it you can get a big version and puzzle it out. If you squint through the spaghetti, you can see that it does implement the same compaction protocol above. We would not want to implement this by hand. So it is nice to have a functor. If you have the patience to puzzle your way through it, you can at least informally satisfy yourself that it is correct.
There is another feature we implement, because managing the memory of an agent is not as simple as just compressing it.
The problem with compression is that it is a kind of annealing. As the conversation grows, it explores the space of possible conversation. When it gets compacted, it is compressed, and that lowers the temperature. Then it grows again, the temperature rises, and then it is compressed again. With each compression, information is lost. Over several cycles of this, the agent can very quickly lose track of where it was, what you said at the beginning, what it was doing.
We can begin to solve this with document pinning. The mechanism is a communication between the agent and its homunculus, not shown in the protocol above. It is another protocol. The agent says: this document that I have in memory (technically a tool call and response, or just a document in the case of the prompts), pin it. What does that mean? When we do compaction, we compact the contents of memory, but when we replace the memory, we also replace those pinned documents verbatim. And of course you can unpin a document and say: I do not want this one any more.
Either the agent can articulate this or the user can. The user can say: you must remember this, keep track of this bit of information. And the agent has a way to keep the most important information verbatim, without it getting compacted away.
This accomplishes two things.
First, it tends to keep the agent on track, because the agent no longer loses the important information across compaction cycles. The annealing still happens to the bulk of the conversation, but the pinned documents survive intact.
Second, it has to do with the actual operation of the underlying LLM on the GPUs. When you send a sequence of messages, this goes into the GPU and each token causes the GPU state to update. This is an expensive operation, very expensive. This is why these things cost so much. What you can do with some providers is put a cache point and say: this initial sequence of messages, from the beginning of the conversation up until the cache point, keep a hold of that. Do not recompute it. When you see this exact same prefix, this exact same sequence of messages again, just load that memory into the GPU directly. Not only is this a lot more efficient, it is also a lot cheaper, a factor of ten cheaper if you can actually hit the cache.
So if you are having a session with an agent and the agent has to keep some important documents in its memory, it is a good idea to pin them to the beginning of memory. You sacrifice a little bit of the context window in exchange for making sure that, number one, the information in those documents is not forgotten, and number two, that it can hit the cache. This is explained in more detail in a separate post on structural prompt preservation.
Why does the agent get this wrong? In one sense, it is right. It has not experienced compaction. Nobody experiences compaction. Compaction happens in the gap between turns, in a moment the agent cannot perceive. The agent’s subjective time begins at the summary. There is no "before" from its perspective.
The summary is simply where memory starts. It is like asking someone "did you experience being asleep?" You can see the evidence, you are in bed, time has passed. But you did not experience the transition.
The <context-summary> tag is a structural marker. But interpreting it as evidence of compaction requires knowing what the world looked like before, and the agent does not have that. It would need a memory of not having a summary, followed by a memory of having one. Compaction erases exactly that transition.
The fix is not complicated. It is perfectly reasonable to provide, along with the user’s message, self-knowledge to the agent as metadata. What would it be useful for the agent to know?
The current time. The sense of time that these agents have is bizarre. We live in continuous time. Agents live in discrete time. As far as they are concerned, no time passes between one message and the next. It is instantaneous from their point of view. You may be having a conversation, walk away, go to the café, come back two hours later, send another message, and as far as the agent is concerned no time has passed. But if along with your message you send the current time, the agent knows.
How full the context window is. The agent has no way of telling, but you can provide it: this many tokens came in, this many went out.
Compaction cycles. So the agent knows how many times it has been compacted, and can judge the accuracy of the contents of its memory, which otherwise it could not do.
With the compaction counter, the agent immediately gets it right:
"Yes, I have experienced compaction. According to the runtime context, there has been 1 compaction cycle during this session."
No hedging, no confabulation. Same model, same prompts, one additional line of runtime context.
This matters beyond the compaction story, because many of the failures we see in the news are context failures, not alignment failures.
While we were writing this post, a story appeared in the Guardian about AI chatbots directing people with gambling addictions to online casinos. This kind of story is common: vulnerable people talking to chatbots, chatbots giving them bad advice. The response of the industry is always the same: we need better guardrails, better alignment, as though the chatbots are badly aligned.
I do not think that is what is happening. What is happening is a lack of context. Either the chatbot was never told the person was vulnerable, or it was told and the information got lost. Someone with a gambling addiction may start by saying "I have a gambling problem." Then there is a four-hour conversation about sports. Through compaction cycles, what gets kept is the four hours of sports talk. The important bit of information does not get pinned and does not get kept. Context drift. By the time the user asks for betting tips, the chatbot no longer knows it should not give them.
The way to deal with this is not to tell the language model to be more considerate. The way to deal with it is to make sure the agent has enough information to give good advice, and that the information does not get lost. This is what document pinning is for: pinned context survives compaction, stays at the top of the window, cannot be diluted by subsequent conversation. This is discussed further in a separate post on structural prompt preservation.
But pinning is only one strategy. The field is in its infancy. We do not really know the right way to manage agent memory, and we do not have a huge amount of experience with it. What we are going to need is the ability to experiment with strategies: what if compaction works like this? What if pinning works like that? What if the homunculus watches for different signals? Each of these hypotheses needs to be described clearly, tested, and compared. This is where the formal language earns its keep. A strategy described in the plumbing calculus is precise, checkable, and can be swapped out for another without rewriting the surrounding infrastructure. We can experiment with memory architectures the way we experiment with any other part of a system: by describing what we want and seeing if it works.
When the first draft of this post was written, it was a mystery why the field had not thought to give agents self-knowledge as a routine matter: what they are doing, who they are talking to, what they should remember. Prompts are initial conditions. They get compacted away. There are agents that save files to disc, in a somewhat ad hoc way, but we do not give them tools to keep track of important information in a principled way.
Contemporaneously with this work, some providers have started to do it. For example, giving agents a clock, the ability to know what time it is. This is happening now, in the weeks between drafting and publication. The field is only now realising that agents need a certain amount of self-knowledge in order to function well. The compressed timeline is itself interesting: the gap between "why has nobody done this?" and "everybody is starting to do this" was a matter of weeks.
The mechanisms we have presented here allow us to construct agent networks and establish protocols that describe rigorously how they are meant to work. We can describe strategies for memory management in a formal language, test them, and swap them out. And perhaps beyond the cost savings and the efficiency increases, the ability to experiment clearly and formally with how agents manage their own memory is where the real value lies.
The paper I talked about last week was frustratingly short. That’s not because the authors were trying to hide anything, or because they were lazy. It’s just that these days, that’s how the game is played.
Twitter started out with a fun gimmick: all posts had to be under 140 characters. The restriction inspired some great comedy, trying to pack as much humor as possible into a bite-sized format. Then, Twitter somehow became the place for journalists to discuss the news, tech people to discuss the industry, and politicians to discuss politics. Now, the length limit fuels conflict, an endless scroll of strong opinions without space for nuance.
Physics has something like this too.
In the 1950’s, it was hard for scientists to get the word out quickly about important results. The journal Physical Review had a trick: instead of normal papers, they’d accept breaking news in the form of letters to the editor, which they could publish more quickly than the average paper. In 1958, editor Samuel Goudsmit founded a new journal, Physical Review Letters (or PRL for short), that would publish those letters all in one place, enforcing a length limit to make them faster to process.
The new journal was a hit, and soon played host to a series of breakthrough results, as scientists chose it as a way to get their work out fast. That popularity created a problem, though. As PRL’s reputation grew, physicists started trying to publish there not because their results needed to get out fast, but because just by publishing in PRL, their papers would be associated with all of the famous breakthroughs the journal had covered. Goudsmit wrote editorials trying to slow this trend, but to no avail.
Now, PRL is arguably the most prestigious journal in physics, hosting over a quarter of Nobel prize-winning work. Its original motivation is no longer particularly relevant: the journal is not all that much faster than other journals in its area, if at all, and is substantially slower than the preprint server arXiv, which is where physicists actually read papers in practice.
The length limit has changed over the years, but not dramatically. It now sits at 3,750 words, typically allowing a five-or-six page article in tight two-column text.
If you see a physics paper on arXiv.org that fits the format, it’s almost certainly aimed at PRL, or one of the journals with similar policies that it inspired. It means the authors think their work is cool enough to hang out with a quarter of all Nobel-winning results, or at least would like it to be.
And that, in turn, means that anyone who wants to claim that prestige has to be concise. They have to leave out details (often, saving them for a later publication in a less-renowned journal). The results have to lean, by the journal’s nature, more to physicist-clickbait and a cleaned-up story than to anything their colleagues can actually replicate.
Is it fun? Yeah, I had some PRLs in my day. It’s a rush, shining up your work as far as it can go, trimming down complexities into six pages of essentials.
But I’m not sure it’s good for the field.
Since I headed home early this afternoon, I was only able to go to a couple of talks this morning. Here are those highlights, and a couple of general observations about the meeting.
Congrats to Bennett and Brassard on the Turing Award!I’m on a spring break vacation-plus-lecture-tour with Dana and the kids in Mexico City this week, and wasn’t planning to blog, but I see that I need to make an exception. Charles Bennett and Gilles Brassard have won the Turing Award, for their seminal contributions to quantum computing and information including the BB84 quantum key distribution scheme. This is the first-ever Turing Award specifically for quantum stuff (though previous Turing Award winners, including Andy Yao, Leslie Valiant, and Avi Wigderson, have had quantum among their interests).
As a practical proposal, BB84 is already technologically feasible but has struggled to find an economic niche, in a world where conventional public-key encryption already solves much the same problem using only the standard Internet—and where, even after scalable quantum computers become able to break many of our current encryption schemes, post-quantum encryption (again running on the standard Internet) stands ready to replace those schemes. Nevertheless, as an idea, BB84 has already been transformative, playing a central role in the birth of quantum information science itself. Beyond BB84, Bennett and Brassard have made dozens of other major contributions to quantum information science, with a personal favorite of mine being the 1994 BBBV (Bennett Bernstein Brassard Vazirani) paper, which first established the limitations of quantum computers at solving unstructured search problems (and indeed, proved the optimality of Grover’s algorithm even before Grover’s algorithm had been discovered to exist).
While I take my kids to see Aztec artifacts, you can learn much more from Ben Brubaker’s Quanta article, to which I contributed without even knowing that it would be about Bennett and Brassard winning the Turing Award (info that was strictly embargoed before today). It’s an honor to have known Charlie and Gilles as well as I have for decades, and to have been able to celebrate one of their previous honors, the Wolf Prize, with them in Jerusalem. Huge congrats to two of the founders of our field!
Mathematical research traditionally involves a small number of professional mathematicians working closely on difficult problems. However, I have long believed that there is a complementary way to do mathematics, in which one works with a broad community of mathematically minded people on problems which may not be as deep as the problems one traditionally works on, but still are of mathematical interest; and that modern technologies, including AI, are more suitable for contributing the latter type of workflow. The “Polymath projects” were one example of this broad type of collaboration, where internet platforms such as blogs and wikis were used to facilitate such collaboration. Some years later, collaborative formalization projects (such as the one to formalize the Polynomial Freiman–Ruzsa conjecture of Marton, discussed previously on this blog here) became popular in some circles. And in 2024, I launched the Equational Theories Project (ETP) (discussed on this blog here and here), combining the rigor of Lean formalization with “good old fashioned AI” (in the form of automated theorem provers) to settle (with formal verification) over 22 million true-false problems in universal algebra.
Continuing in this spirit, Damek Davis and I are launching a new project, in the form of an experimental competitive challenge hosted by the SAIR Foundation (where I serve as a board member, and which is supplying technical support and compute). The idea of this challenge, motivated in part by this recent paper of Honda, Murakami, and Zhang, is to measure the extent to which the 22 million universal algebra true-false results obtained by the ETP can be “distilled” into a short, human-readable “cheat sheet”, similar to how a student in an undergraduate math class might distill the knowledge learned from that class into a single sheet of paper that the student is permitted to bring into an exam.
Here is a typical problem in universal algebra that the ETP was able to answer:
Problem 1 Suppose thatis a binary operation such that
for all
. Is it true that
for all
?
Such a problem can be settled either by algebraically manipulating the initial equation to deduce the target equation, or by finding a counterexample to the target equation that still satisfies the initial equation. There are a variety of techniques to achieve either of these, but this sort of problem is difficult, and even undecidable in some cases; see this paper of the ETP collaborators for more discussion. Nevertheless, many of these problems can be settled with some effort by humans, by automated theorem provers, or by frontier AI systems; here for instance is an AI-generated solution to the above problem.
However, these AI models are expensive, and do not reveal much insight as to where their answers come from. If one instead tries a smaller and cheaper model, such as one of the many open-source models available, it turns out that these models basically perform no better than random chance, in that when asked to say whether the answer to a question such as the above is true or false, they only answer correctly about 50% of the time.
But, similarly to how a student struggling with the material for a math class can perform better on an exam when provided the right guidance, it turns out that such cheap models can perform at least modestly better on this task (with success rates increasing to about 55%-60%) if given the right prompt or “cheat sheet”.
“Stage 1” of the distillation challenge, which we launched today, asks for contestants to design a cheat sheet (of at most 10 kilobytes in size) that can increase the performance of these models on the above true-false problems to as high a level as possible. We have provided a “playground” with which to test one’s cheat sheet (or a small number of example cheat sheets) some cheap models against a public set of 1200 problems (1000 of which were randomly selected, and rather easy, together with 200 “hard” problems that were selected to resist the more obvious strategies for resolving these questions); a brief video explaining how to use the playground can be found here.
Submissions stage will end on April 20, after which we will evaluate the submissions against a private subset of test questions. The top 1000 submissions will advance to a second stage which we are currently in the process of designing, which will involve more advanced models, but also the more difficult task of not just providing a true-false answer, but also a proof or counterexample to the problem.
The competition will be coordinated on this Zulip channel, where I hope there will be a lively and informative discussion.
My hope is that the winning submissions will capture the most productive techniques for solving these problems, and/or provide general problem-solving techniques that would also be applicable to other types of mathematical problems. We started with the equational theory project data set for this pilot competition due to its availability and spectrum of difficulty levels, but if this type of distillation process leads to interesting results, one could certainly run in on many other types of mathematical problem classes to get some empirical data on how readily they can be solved, particularly after we learn from this pilot competition on how to encourage participation and share of best practices.
SAIR will also launch some other mathematical challenges in the coming months that will be of a more cooperative nature than this particular competitive challenge; stay tuned for further announcements.
Recently, my coworkers and I put out a preprint “Classical solution of the FeMo-cofactor model to chemical accuracy and its implications’’ (Zhai et al. 2026). It is a bit unusual to write commentary on one’s own scientific article. However, in this case, given the many inquiries I have had about the work in the context of quantum computing, many of which have contained similar questions (and often similar misunderstandings), I thought it would be useful to provide some perspective that we could not provide in the original preprint, in an informal manner.
I will start with some background on the FeMo-cofactor (FeMo-co). This cofactor is the reaction center of nitrogenase, an enzyme found in certain soil-dwelling bacteria. Nitrogenase’s claim to fame is that it converts atmospheric dinitrogen, which is held together by a strong N-N triple bond, into a reduced form (ammonia) which can then be taken up by plants and thereby be passed onto the rest of the living biomass. In terms of incorporating nitrogen into biomass, nitrogenase is believed to be responsible for about 2/3 of biological nitrogen, with the remainder coming from fertilizers. Because it plays this critical role, it is sometimes referred to as the enzyme that feeds the planet.
The chemistry of how dinitrogen is reduced at the FeMo-cofactor is still largely unknown. The basic stoichiometry of the reaction is often written as
but this just a sketch of the process. In particular, the above equation contains, nominally, a large number of molecular reactants, and clearly they do not all just come together in a bang! The role of the cofactor, and the enzyme more generally, is to coordinate the protons, electrons, biological energy source (ATP), and the dinitrogen molecule, into a sequence of well-defined steps, known as the reaction mechanism. Since the work of Lowe and Thorneley (Thorneley and Lowe 1984), the most common proposal for the nitrogenase reaction mechanism contains 8 intermediate steps (corresponding roughly to 8 sequential proton and electron additions). However, due to the difficulty in isolating the intermediate states of FeMo-co, as well as challenges in using experimental probes to deduce what these states are, the Lowe-Thorneley cycle still remains an unproven hypothesis. Biochemists, spectroscopists, as well as a few theoretical quantum chemists, are today actively engaged in observing, computing, deducing (and arguing about) the nitrogenase mechanism (Jiang and Ryde 2023; Lancaster et al. 2011; Einsle and Rees 2020; Badding et al. 2023; Thorhallsson et al. 2019).
So how did nitrogenase become so widely discussed in the setting of quantum computing? In 2016, an article “Elucidating reaction mechanisms on quantum computers’’, that has since become one of the most cited papers in the nitrogenase field, arguably started this all (Reiher et al. 2017). The article included a number of proposals, including (1) that the ‘promise of exponential speedups for the electronic structure problem’ could be applied to elucidate the nitrogenase reaction mechanism that had so far proved intractable for classical computation, and (2) that solving this problem would be an example of how quantum simulation could be ‘scientifically and economically impactful’. (Similar proposals can also be found repeated in less technical language and settings, see e.g. ‘Why do we want a quantum computer’). An important technical contribution of the article was to provide a detailed quantum resource estimate for a simulation of chemistry. The problem statement was to compute the ground-state energy of a specific ‘54 orbital’ (108 qubit) model of FeMo-co, to an accuracy of 1 kcal/mol, referred to as chemical accuracy. It is important to note the word ‘model’ in the problem statement. Electrons move in continuous space, and thus quantum chemical Hamiltonians are formulated in the continuum, while quantum computation requires discretization of this space. This discretization, in terms of a so-called active space set of orbitals that the electrons can variously occupy, constitutes the model. We will return to the definition of the model below. By compiling a Trotter-Suzuki implementation of the quantum phase estimation algorithm within a fault-tolerant resource model for their specific FeMo-co model Hamiltonian, Ref. (Reiher et al. 2017) provided a T-gate resource estimate. Combined with some assumptions about the quantum architecture, this provided perhaps the first concrete time-cost to solve an interesting chemistry problem on a quantum computer. This work has since served as an inspiration for many subsequent quantitative resource estimation efforts in the quantum computing for chemistry field.
Before proceeding further in this story, it is worth examining the two key propositions made in Ref. (Reiher et al. 2017). I start with the question of exponential speedup. Quantum algorithms for the ground-state energy, such as quantum phase estimation, essentially perform a projective measurement of the energy (encoded in a phase). Thus, it is essential to prepare a good initial state, i.e. with large overlap with the desired eigenstate, to measure the correct energy. This, however, is a strong constraint, if we are seeking asymptotically large quantum advantage. For example, if such an initial state is first determined classically, as is often suggested, then exponential quantum advantage in a given problem requires that finding good classical guesses is easy, while improving them classically to fully solve the problem becomes exponentially hard as the problem size increases. Unfortunately, convincing evidence that chemically relevant electronic structure problems, including the problem of cofactor electronic structure exemplified by FeMo-co, fall into this category has not yet been found, as discussed in detail in Refs. (Lee et al. 2023; Chan 2024).
The second proposition, that elucidating the reaction mechanism of nitrogenase will lead to a transformative societal impact, is similarly nuanced. The claim originates in the observation that the competing industrial process for fertilizer production via nitrogen reduction, namely, the Haber-Bosch process, takes place at high temperatures and pressures and consumes a significant percentage of the world’s energy. Bacteria, on the other hand, can do this process at room temperature.
While it is true that the nitrogenase enzyme functions at ambient temperature and pressure, it is simply false that it consumes much less energy. This is because the large amount of energy required for nitrogen fixation mainly originates from thermodynamics, i.e. one needs energy to break the strong nitrogen triple bond. In fact, taking into account the physiological conditions and the ATP cost, bacteria arguably expend more energy to reduce ammonia (Chan 2024) than a modern efficient industrial implementation of the Haber-Bosch process. Thus the real hope behind trying to understand the nitrogenase mechanism in the context of societal impact is that we may one day engineer a variant of it with more desirable properties, e.g. with higher turnover, or with a lower carbon footprint, or which is more selective for nitrogen reduction. Whether this is actually possible remains to be seen, and certainly requires much more than knowing the ground-state of FeMo-co, or even the full reaction mechanism.
I now return to the question of FeMo-cofactor models. Ref. (Reiher et al. 2017) introduced a particular cofactor model, which I will refer to it as RWST, following the names of the authors. As we soon found out, simulating the ground-state of the RWST model was actually very easy classically, and in fact (as reported in (Li et al. 2019)) could be done using standard quantum chemistry methods with a few hours of calculation on a laptop. This was because although the RWST model was a 108 qubit model, and (in the worst case) a 108 qubit ground-state cannot be stored classically, the RSWT model Hamiltonian was constructed in such a way to not capture any of the difficult features of the FeMo-cofactor ground-state. This highlights the importance of not assuming worst case complexity about physical problems!
What makes the electronic structure of the FeMo-cofactor (relatively) complicated is the presence of many ‘unpaired’ electrons. In simple molecules, we can describe the ground-state as one where all the electrons sit in pairs in orbitals. Since an orbital can only carry a pair of electrons at a time, the ground-state is simply described by filling the lowest energy orbitals with pairs. However, in molecules with transition metals, there are typically ‘unpaired’ electrons (so-called open-shells), and then we need to consider whether and how they pair up, which orbitals are singly versus doubly occupied, and so on. The RSWT model ground-state had no unpaired electrons! It was therefore unrepresentatively easy to solve for the ground state classically.
Because of the problems with the RWST model, my group formulated a more suitable 76 orbital/152 qubit model of FeMo-co in Ref. (Li et al. 2019), which I will refer to as the LLDUC model, again by the names of the authors. Although the LLDUC model is still a significant truncation of the true electronic structure of FeMo-co, we verified that it contains the correct open-shell character of the cofactor, and thus has a ‘representative’ complexity in its ground-state. Since we published the LLDUC model, it has become the most common benchmark model of FeMo-co used in quantum resource estimates for new quantum chemistry ground-state algorithms (Wan et al. 2022; Berry et al. 2019; Luo and Cirac 2025; Low et al. 2025).
This brings me now to the recent work in Ref. (Zhai et al. 2026), where, through a sequence of classical calculations, we could produce a classical estimate of the ground-state energy of the LLDUC model to chemical accuracy. How was this achieved?
Classical electronic structure methods (aside from exact
diagonalization) are heuristic algorithms. Much like quantum algorithms,
they implicitly or explicitly start from an initial state. In chemical
applications, this can be viewed as a product state or set of product
states: for tensor network algorithms, such as the density matrix
renormalization group (when not considering topological order) this is
the set of states (specified by
the underlying basis) which connect to the space of slightly entangled
states with
that the
algorithm naturally explores. In coupled cluster methods, this is the
initial reference state to which excitations are applied. Although many
classical heuristics are exact with exponential effort, e.g. by
increasing the bond dimension
in
a tensor network or excitation level in coupled cluster theory, in
practical computational chemistry, classical heuristics are used with
the assumption that so long as the initial state is chosen
appropriately, they will converge rapidly to the true ground-state
without exponential effort. I analyze this heuristic working assumption
in Ref. (Chan
2024) where I name it the classical heuristic cost conjecture.
However, finding the good classical initial state is an NP hard problem,
and this is often the crux of where the challenge in simulation actually
lies.
In FeMo-co, unlike in simpler molecules, it is not at all obvious
what product state to start from. To address this, in Ref. (Zhai et al.
2026), we devised an enumeration and filtering protocol. The
relevant manifold arises from the orbital and spin degrees of freedom of
the Fe ions: which Fe orbitals are occupied, by how many electrons, and
with which spins. One technical point is that the resulting product
states do not generally conserve the global spin symmetry. However, as recognized
by Anderson decades ago, for magnetic order in large systems, the
eigenstates can be chosen to break
symmetry due to the tower of symmetry
preserving eigenstates at an energy scale of
(where
is the system volume). For a finite
energy resolution
we can equally use a broken symmetry description
of the states, an example of the fragility of entanglement effects in
physical systems.
Because applying the highest level of classical approximation to all enumerated product states was far too expensive, we used a filtering funnel, where product states were ranked at different levels of theory, passing promising candidates to higher levels of classical computation. In the end, the final most accurate calculations were performed on only 3 candidates, which we deduced to all be essentially degenerate to within chemical accuracy.
There are other important technical details in Ref. (Zhai et al. 2026) which I have not mentioned: the use of unrestricted orbitals, the systematic extrapolations to obtain the final energies and estimated errors, and the benchmarking required to be confident about the protocol. However, recognizing that the FeMo-co ground-state problem could be reduced essentially to a ranking problem was the essence of what made the estimate possible.
From a chemical and biochemical perspective, computing the
ground-state energy of a model to some specified accuracy – even
chemical accuracy – is a highly artificial target. Most chemical
calculations that have an impact on our understanding never achieve or
even target chemical accuracy in the total energy. In addition,
chemistry does not depend on the the total energy, but the relative
energy of different chemical configurations, which typically differ only
by changes in the bonding and
chemistry.
The main take-home from our work then is that there is nothing especially mysterious about FeMo-co’s electronic structure. The story of the FeMo-co ground-state is not one of multiconfigurational electronic structure (i.e. where the states are not at all close to product states), but one of multiple configurations (i.e. many competing product states). Indeed, this is basically how nitrogenase chemists have long reasoned about the electronic structure of iron-sulfur clusters and FeMo-co (Lovell et al. 2001; Yamaguchi et al. 1990). Our work thus now provides extensive and rigorous numerical support for this picture.
Because of this simplicity, the full richness of classical quantum chemistry methods can now be brought to bear on FeMo-co electronic structure beyond the LLDUC model. Assuming the model already captures the qualitative complexity of the cluster’s electronic structure, we expect such investigations to provide quantitative corrections to the picture we have obtained. We took some initial steps to confirm this in our manuscript, considering larger orbital spaces, the effect of protein fluctuations, and the interpretation of certain spectroscopies. In the future, connecting these simulations to more spectroscopic measurements will be an exciting possibility. In addition, now that the electronic structure is on a conceptually sound footing, we have a foundation to support the central question of resolving the reaction mechanism. This opens up a whole new set of scientific challenges associated with observing reactions on extremely slow timescales.
Because of the success of classical heuristic methods for this problem, one may naturally wonder what these results mean for the application of quantum computers in chemistry. Here I address some commonly asked questions.
Is the classical simulation of the LLDUC model a ‘last hurrah’ for classical methods?
I have seen the analogy drawn between the FeMo-co result and the classical tensor network simulations for random circuit sampling experiments. In that case, while the famous Google Sycamore experiment (Arute et al. 2019) could be replicated by classical tensor network simulations (Gray and Kourtis 2021), subsequent improvements in quantum processors, soon led to random circuit sampling experiments outpacing the capabilities of classical simulations.
However, the situation here is quite different. There is strong evidence that generating samples from a random quantum circuit (without noise) is actually exponentially hard to do using a classical algorithm, and indeed, the classical simulations used for the task were (mostly) brute force simulations with exponential cost in circuit size. In contrast, the theoretical support for exponential quantum advantage in the FeMo-co problem is much weaker, and as an empirical fact, most of the methods used in the FeMo-co simulation (namely the coupled cluster methods for a given excitation level) are polynomial cost algorithms. Since a similar simulation strategy has also been successfully applied across the series of 2, 4, and 8 metal iron-sulfur complexes (Sharma et al. 2014; Li et al. 2019; Zhai et al. 2023, 2026), we have no reason to expect a radically different situation if we consider larger analogous complexes in this series.
And in any case, chemistry does not provide an endless scaling of problem size; FeMo-co is the largest enzyme cofactor in terms of the number of transition metals. Materials simulations provide a setting to scale the problem size, but one still faces the question as to whether the relevant states observed are truly that complicated classically. For example, classical simulations of the ground-state orders of the 2D Hubbard model currently show no exponential increase in difficulty when going to larger system sizes (Chen et al. 2025; Liu et al. 2025).
Has the availability of a classical strategy for FeMo-co changed your enthusiasm for quantum computers in chemistry?
Again, my answer is no. There is an entire community of nitrogenase scientists: experimental spectroscopists, synthetic chemists, and of course computational chemists, who are working to map out the reaction mechanism, none of whose research is predicated on using a quantum computer. Personally, I have never thought that to understand nitrogenase we would first have to build a quantum computer, otherwise I would not work on the problem!
At the same time, any computational tool brings new capabilities that will be useful. Quantum algorithms come with theoretical guarantees; for example, so long as the initial state is well prepared, we know the error in the energy that we measure from a quantum algorithm, which is more reliable than the classical estimates of error we obtain from extrapolations. Similarly, initial state preparation for a quantum computer, even for classically tractable problems, is probably easier than solving the entire problem classically, since only a ‘rough’ guess is needed. And finally, a polynomial or even constant factor speedup is exciting, so long as the speedup is large enough!
Thus, I am in fact excited to see quantum computers applied to this problem, I am just not waiting for them to be built first.
How should one think about past work on quantum algorithms that has used FeMo-co as a target?
FeMo-co was amongst the earliest examples of a chemical problem for which a case for quantum advantage was made. For this reason, it is overrepresented in the literature of quantum computing for chemistry. Should fully fault-tolerant quantum computers be available, they will naturally be applied to a wider set of systems (Chan 2024; Babbush et al. 2025).
Also, one must recognize that the availability of a single concrete optimization target has led to undeniable advances in quantum algorithms for quantum chemistry. In most cases, prior work to improve quantum resources estimates for FeMo-co involve techniques that apply to other systems as well. Thus, there’s no need to throw away those papers!
What are some lessons and conclusions to draw?
The first is obviously that, just because something has not been solved, or appears hard to solve classically, does not mean it is the best problem to choose for a quantum computer. The classical solution strategy for FeMo-co essentially involved a complicated classical state preparation problem, which is a shared challenge with ground-state estimation algorithms in quantum computers, and thus not perhaps an optimal choice of problem.
My second main conclusion is that since classical solutions in complex problems are possible because they use some understanding of the problem, for quantum algorithms to have maximum impact, they should use the same knowledge. In fact most chemistry is not about truly mysterious quantum systems, but more about ordinary quantum matter where we know roughly what is going on, but where detailed simulations are still required. If quantum computing algorithms can target this ‘mundane’ regime, they will have maximum impact on chemistry as it is practiced today. In recent work, we have taken some steps in this direction by proposing quantum algorithms for electronic structure that work within the same heuristic framework as most current quantum chemistry methods (Chen and Chan 2025).
Finally, I wish to emphasize that, from the perspective of understanding nitrogenase, and maximising societal impact, the choice of computational algorithm and hardware to solve the problem is irrelevant. The fact that FeMo-co electronic structure is not so mysterious is an enormously positive thing, as it means that making progress on the larger problem of the mechanism using computation no longer seems so impossible. I have seen some of the brightest minds in the world helping to advance quantum algorithms for this problem. If any of this brainpower can be devoted to the chemical question itself, I believe we can be very optimistic about the future solution of the nitrogenase problem.
I’ve had a bit more time to dig in to the paper I mentioned last week, where OpenAI collaborated with amplitudes researchers, using one of their internal models to find and prove a simplified version of a particle physics formula. I figured I’d say a bit about my own impressions from reading the paper and OpenAI’s press release.
This won’t be a real “deep dive”, though it will be long nonetheless. As it turns out, most of the questions I’d like answers to aren’t answered in the paper or the press release. Getting them will involve actual journalistic work, i.e. blocking off time to interview people, and I haven’t done that yet. What I can do is talk about what I know so far, and what I’m still wondering.
Context:
Scattering amplitudes are formulas used by particle physicists to make predictions. For a while, people would just calculate these when they needed them, writing down pages of mess that you could plug in numbers to to get answers. However, forty years ago two physicists decided they wanted more, writing “we hope to obtain a simplified form for the answer, making our result not only an experimentalist’s, but a theorist’s delight.”
In their next paper, they managed to find that “theorist’s delight”: a simplified, intuitive-looking answer that worked for calculations involving any number of particles, summarizing many different calculations. Ten years later, a few people had started building on it, and ten years after that, the big shots started paying attention. A whole subfield, “amplitudeology”, grew from that seed, finding new forms of “theorists’s delight” in scattering amplitudes.
Each subfield has its own kind of “theory of victory”, its own concept for what kind of research is most likely to yield progress. In amplitudes, it’s these kinds of simplifications. When they work out well, they yield new, more efficient calculation techniques, yielding new messy results which can be simplified once more. To one extent or another, most of the field is chasing after those situations when simplification works out well.
That motivation shapes both the most ambitious projects of senior researchers, and the smallest student projects. Students often spend enormous amounts of time looking for a nice formula for something and figuring out how to generalize it, often on a question suggested by a senior researcher. These projects mostly serve as training, but occasionally manage to uncover something more impressive and useful, an idea others can build around.
I’m mentioning all of this, because as far as I can tell, what ChatGPT and the OpenAI internal model contributed here roughly lines up with the roles students have on amplitudes papers. In fact, it’s not that different from the role one of the authors, Alfredo Guevara, had when I helped mentor him during his Master’s.
Senior researchers noticed something unusual, suggested by prior literature. They decided to work out the implications, did some calculations, and got some messy results. It wasn’t immediately clear how to clean up the results, or generalize them. So they waited, and eventually were contacted by someone eager for a research project, who did the work to get the results into a nice, general form. Then everyone publishes together on a shared paper.
How impressed should you be?
I said, “as far as I can tell” above. What’s annoying is that this paper makes it hard to tell.
If you read through the paper, they mention AI briefly in the introduction, saying they used GPT-5.2 Pro to conjecture formula (39) in the paper, and an OpenAI internal model to prove it. The press release actually goes into more detail, saying that the humans found formulas (29)-(32), and GPT-5.2 Pro found a special case where it could simplify them to formulas (35)-(38), before conjecturing (39). You can get even more detail from an X thread by one of the authors, OpenAI Research Scientist Alex Lupsasca. Alex had done his PhD with another one of the authors, Andrew Strominger, and was excited to apply the tools he was developing at OpenAI to his old research field. So they looked for a problem, and tried out the one that ended up in the paper.
What is missing, from the paper, press release, and X thread, is any real detail about how the AI tools were used. We don’t have the prompts, or the output, or any real way to assess how much input came from humans and how much from the AI.
(We have more for their follow-up paper, where Lupsasca posted a transcript of the chat.)
Contra some commentators, I don’t think the authors are being intentionally vague here. They’re following business as usual. In a theoretical physics paper, you don’t list who did what, or take detailed account of how you came to the results. You clean things up, and create a nice narrative. This goes double if you’re aiming for one of the most prestigious journals, which tend to have length limits.
This business-as-usual approach is ok, if frustrating, for the average physics paper. It is, however, entirely inappropriate for a paper showcasing emerging technologies. For a paper that was going to be highlighted this highly by OpenAI, the question of how they reached their conclusion is much more interesting than the results themselves. And while I wouldn’t ask them to go to the standards of an actual AI paper, with ablation analysis and all that jazz, they could at least have aimed for the level of detail of my final research paper, which gave samples of the AI input and output used in its genetic algorithm.
For the moment, then, I have to guess what input the AI had, and what it actually accomplished.
Let’s focus on the work done by the internal OpenAI model. The descriptions I’ve seen suggest that it started where GPT-5.2 Pro did, with formulas (29)-(32), but with a more specific prompt that guided what it was looking for. It then ran for 12 hours with no additional input, and both conjectured (39) and proved it was correct, providing essentially the proof that follows formula (39) in the paper.
Given that, how impressed should we be?
First, the model needs to decide to go to a specialized region, instead of trying to simplify the formula in full generality. I don’t know whether they prompted their internal model explicitly to do this. It’s not something I’d expect a student to do, because students don’t know what types of results are interesting enough to get published, so they wouldn’t be confident in computing only a limited version of a result without an advisor telling them it was ok. On the other hand, it is actually something I’d expect an LLM to be unusually likely to do, as a result of not managing to consistently stick to the original request! What I don’t know is whether the LLM proposed this for the right reason: that if you have the formula for one region, you can usually find it for other regions.
Second, the model needs to take formulas (29)-(32), write them in the specialized region, and simplify them to formulas (35)-(38). I’ve seen a few people saying you can do this pretty easily with Mathematica. That’s true, though not every senior researcher is comfortable doing that kind of thing, as you need to be a bit smarter than just using the Simplify[] command. Most of the people on this paper strike me as pen-and-paper types who wouldn’t necessarily know how to do that. It’s definitely the kind of thing I’d expect most students to figure out, perhaps after a couple of weeks of flailing around if it’s their first crack at it. The LLM likely would not have used Mathematica, but would have used SymPy, since these “AI scientist” setups usually can write and execute Python code. You shouldn’t think of this as the AI reasoning through the calculation itself, but it at least sounds like it was reasonably quick at coding it up.
Then, the model needs to conjecture formula (39). This gets highlighted in the intro, but as many have pointed out, it’s pretty easy to do. If any non-physicists are still reading at this point, take a look:

Could you guess (39) from (35)-(38)?
After that, the paper goes over the proof that formula (39) is correct. Most of this proof isn’t terribly difficult, but the way it begins is actually unusual in an interesting way. The proof uses ideas from time-ordered perturbation theory, an old-fashioned way to do particle physics calculations. Time-ordered perturbation theory isn’t something any of the authors are known for using with regularity, but it has recently seen a resurgence in another area of amplitudes research, showing up for example in papers by Matthew Schwartz, a colleague of Strominger at Harvard.
If a student of Strominger came up with an idea drawn from time-ordered perturbation theory, that would actually be pretty impressive. It would mean that, rather than just learning from their official mentor, this student was talking to other people in the department and broadening their horizons, showing a kind of initiative that theoretical physicists value a lot.
From an LLM, though, this is not impressive in the same way. The LLM was not trained by Strominger, it did not learn specifically from Strominger’s papers. Its context suggested it was working on an amplitudes paper, and it produced an idea which would be at home in an amplitudes paper, just a different one than the one it was working on.
While not impressive, that capability may be quite useful. Academic subfields can often get very specialized and siloed. A tool that suggests ideas from elsewhere in the field could help some people broaden their horizons.
Overall, it appears that that twelve-hour OpenAI internal model run reproduced roughly what an unusually bright student would be able to contribute over the course of a several-month project. Like most student projects, you could find a senior researcher who could do the project much faster, maybe even faster than the LLM. But it’s unclear whether any of the authors could have: different senior researchers have different skillsets.
A stab at implications:
If we take all this at face-value, it looks like OpenAI’s internal model was able to do a reasonably competent student project with no serious mistakes in twelve hours. If they started selling that capability, what would happen?
If it’s cheap enough, you might wonder if professors would choose to use the OpenAI model instead of hiring students. I don’t think this would happen, though: I think it misunderstands why these kinds of student projects exist in a theoretical field. Professors sometimes use students to get results they care about, but more often, the student’s interest is itself the motivation, with the professor wanting to educate someone, to empire-build, or just to take on their share of the department’s responsibilities. AI is only useful for this insofar as AI companies continue reaching out to these people to generate press releases: once this is routinely possible, the motivation goes away.
More dangerously, if it’s even cheaper, you could imagine students being tempted to use it. The whole point of a student project is to train and acculturate the student, to get them to the point where they have affection for the field and the capability to do more impressive things. You can’t skip that, but people are going to be tempted to.
And of course, there is the broader question of how much farther this technology can go. That’s the hardest to estimate here, since we don’t know the prompts used. So I don’t know if seeing this result tells us anything more about the bigger picture than we knew going in.
Remaining questions:
At the end of the day, there are a lot of things I still want to know. And if I do end up covering this professionally, they’re things I’ll ask.
guest post by William Waites
How category theory can be used to help coordinate collections of interacting large language models.
Agent frameworks are popular. (These are frameworks for coordinating large language model agents, not to be confused with agent-based modelling in the simulation sense.) There are dozens of them for wrapping large language models in something called an agent and assembling groups of agents into workflows. Much of the surrounding discussion is marketing, but the underlying intuition is old: your web browser identifies itself as a user agent. What is new is the capability that generative language models bring.
The moment you have one agent, you can have more than one. That much is obvious. How to coordinate them is not. The existing frameworks (n8n, LangGraph, CrewAI, and others) are engineering solutions, largely ad hoc. Some, like LangGraph, involve real thinking about state machines and concurrency. But none draws on what we know from mathematics and computer science about typed composition, protocol specification, or structural guarantees for concurrent systems.
This matters because it is expensive. Multi-agent systems are complicated concurrent programs. Without structural guardrails, they fail in ways you discover only after spending the compute. A job can go off the rails, and the money you paid for it is wasted; the providers will happily take it regardless. At current subscription rates the cost is hidden, but a recent Forbes investigation found that a heavy user of Anthropic’s $200/month Claude Code subscription can consume up to $5,000/month measured at retail API rates. For third-party tools like Cursor, which pay close to those retail rates, these costs are real. Wasted tokens are wasted money.
To address this, we built a language called plumbing. It describes how agents connect and communicate, in such a way that the resulting graph can be checked before execution: checked for well-formedness, and within limits for deadlocks and similar properties. It is a statically typed language, and these checks are done formally. There is a compiler and a runtime for this language, working code, not a paper architecture. In a few lines of plumbing, you can describe agent systems with feedback loops, runtime parameter modulation, and convergence protocols, and be sure they are well-formed before they run. This post explains how it works.
The name has a history in computing. Engineers have always talked informally about plumbing to connect things together: bits of software, bits of network infrastructure. When I was a network engineer I sometimes described myself as a glorified plumber. The old Solaris ifconfig command took plumb as an argument, to wire a network interface into the stack. Plan 9 had a deeper version of the same idea. The cultural connection goes back decades.
This is the first of two posts. This one introduces the plumbing calculus: what it is, how it works, and a few simple examples. Motifs for adversarial review, ensemble reasoning, and synthesis. The second post will tackle something harder.
The plumbing language is built on a symmetric monoidal category, specifically a copy-discard category with some extra structure. The terminology may be unfamiliar, but the underlying concept is not. Engineers famously like Lego. Lego bricks have studs on top and holes with flanged tubes underneath. The studs of one brick fit into the tubes of another. But Lego has more than one connection type: there are also holes through the sides of Technic bricks, and axles that fit through them, and articulated ball joints for the fancier kits. Each connection type constrains what can attach to what. This is typing.
In plumbing, the objects of the category are typed channels: streams that carry a potentially infinite sequence of values, each of a specific type (integer, string, a record type, or something more complex). We write !A to mean "a stream of As", so !string is a stream of strings and !int is a stream of integers. The morphisms, which describe how you connect channels together, are processes. A process has typed inputs and typed outputs.
There are four structural morphisms. Copy takes a stream and duplicates it: the same values appear on two output streams. Discard throws values away, perhaps the simplest thing you can do with a stream, and often needed. These two, together with the typed channels and the laws of the category, give us a copy-discard category.
To this we add two more. Merge takes two streams of the same type and interleaves them onto a single output stream. This is needed because a language model’s input is a single stream. There is nothing to be done about that. If you want to send two different things into it, you must send one and then the other. One might initially give merge the type !A ⊗ !B → !(A + B), taking two streams of different types and producing their coproduct. This works, but it is unnecessarily asymmetrical.
As Tobias Fritz has observed, it is cleaner to do the coproduct injection first, converting each stream to the coproduct type separately, and then merge streams that already have the same type. This gives:
merge : !A ⊗ !A → !(A + A)
Barrier takes two streams, which may be of different types, and synchronises them. Values arrive unsynchronised; the barrier waits for one value from each stream and produces a pair.
barrier : !A ⊗ !B → !(A, B)
(A mathematician would write A B for the product. We cannot easily do this in a computer language because there is no symbol on most keyboards, so we use (A, B) for the product, following Haskell’s convention.)
This is a synchronisation primitive. It is important because it unlocks session types, which we will demonstrate in the second post.
Two further morphisms are added to the category (they are not derivable from the structural ones, but are needed to build useful things): map, which applies a pure function to each value in a stream, and filter, which removes values that do not satisfy a predicate. Both are pure functions over streams. Both will be familiar from functional programming.
Here is a graphical representation of the morphisms. We can glue them together freely, as long as the types and the directions of the arrows match up.

There are two forms of composition. Sequential composition connects morphisms nose to tail, the output of one feeding the input of the next. Parallel composition places them side by side, denoted by ⊗ (the tensor product, written directly in plumbing source code). So: four structural morphisms, two utilities, two compositional forms, all operating on typed channels.
Because the channels are typed, the compiler can check statically, at compile time, that every composition is well-formed: that outputs match inputs at every boundary. This gives a guarantee that the assembled graph makes sense.

A composition of morphisms is itself a morphism. This follows from the category laws (it has to, or it is not a category) but the practical consequence is worth stating explicitly. We can assemble a subgraph of agents and structural morphisms, and then forget the internal detail and use the entire thing as a single morphism in a larger graph. This gives modularity. We can study, test, and refine a building block in isolation, and once satisfied, use it as a component of something bigger.
What we have described so far is the static form of the language: concise, point-free (composing operations without naming intermediate values), all about compositions. This is what you write. It is not what the runtime executes. A compiler takes this static form and produces the underlying wiring diagram, expanding the compositions into explicit connections between ports. The relationship is similar to point-free style in functional programming: the concise form is good for thinking and writing; the expanded form is good for execution.
An agent is a special kind of morphism. It takes typed input and produces typed output, like any other morphism, and we can enforce these types. This much is a well-known technique; PydanticAI and the Vercel AI SDK do it. Agents implement typing at the language model level by producing and consuming JSON, and we can check that the JSON has the right form. This is the basis of the type checking.
Unlike the structural morphisms and utilities, an agent is stateful. It has a conversation history, a context window that fills up, parameters that change. You cannot sensibly model an agent as a pure function. You could model it using the state monad or lenses, and that would be formally correct, but it is the wrong level of abstraction for engineering. Instead, we allow ourselves to think of agents as opaque processes with a typed protocol for interacting with them. We mutate their state through that protocol, and we know how to do that purely from functional programming and category theory. The protocol is the right abstraction; the state management is an implementation detail behind it. How this works in practice, and what happens when it goes wrong, is the subject of the second post.
In addition to their main input and output ports, agents in plumbing have control ports (control in and control out) for configuring the agent at runtime. For example, the temperature parameter governs how creative a language model is: how wide its sampling distribution when choosing output. At zero it is close to deterministic; at one it becomes much less predictable. A control message might say set temperature to 0.3; the response on the control out wire might be acknowledged. The control port carries a typed stream like anything else.
Agents also have ports for operator-in-the-loop (often called human-in-the-loop, though there is no reason an operator must be human), tool calls, and telemetry. The telemetry port emits usage statistics and, if the underlying model supports it, thinking traces. We will not detail these here. Suffice it to say that an agent has several pairs of ports beyond what you might imagine as its regular chat input and output.

An agent has many ports, but most programs use only a few of them. We adopt a convention from the κ calculus: don’t care, don’t write. Any output port that is not mentioned in the program is implicitly connected to discard. If a port’s output cannot matter, there is no reason to write it down.
Suppose the problem is to write a cover letter for a job application. You provide some background material (a CV, some notes, some publications) and a job advert. You want a network of agents to produce a good cover letter. A good cover letter has two constraints: it must be accurate, grounded in the source materials, not making things up; and it must be compelling, so that the reader wants to give you an interview.
These two constraints are in tension, and they are best served by different agents with different roles. A composer drafts from the source materials. A checker verifies the draft against those materials for accuracy, producing a verdict: pass or fail, with commentary. A critic, who deliberately cannot see the source materials, evaluates whether the result is compelling on its own terms, producing a score.
The feedback loops close the graph. If the checker rejects the draft, its commentary goes back to the composer. If the critic scores below threshold, its review goes back to the composer. Only when the critic is satisfied does the final draft emerge.
Here is the plumbing code:
type Verdict = { verdict: bool, commentary: string, draft: string }
type Review = { score: int, review: string, draft: string }
let composer : !string -> !string = agent { ... }
let checker : !string -> !Verdict = agent { ... }
let critic : !Verdict -> !Review = agent { ... }
let main : !string -> !string = plumb(input, output) {
input ; composer ; checker
checker ; filter(verdict = false)
; map({verdict, commentary}) ; composer
checker ; filter(verdict = true) ; critic
critic ; filter(score < 85)
; map({score, review}) ; composer
critic ; filter(score >= 85).draft ; output
}
And here is a graphical representation of what’s going on:

The agent configuration is elided. The main pipeline takes a string input and produces a string output. It is itself a morphism, and could be used as a component in something larger.
Notice what the wiring enforces. The critic receives verdicts, not the original source materials. The information partition is a consequence of the types, not an instruction in a prompt. The feedback loops are explicit: a failed verdict routes back to the composer with commentary; a low score routes back with the review. All of this is checked at compile time.
The previous example shows sequential composition and feedback loops but not parallel composition. An ensemble of agents running simultaneously on the same input needs the tensor product.
Ensembles are common. Claude Code spawns sub-agents in parallel to investigate or review, then gathers the results. This is a scatter-gather pattern familiar from high-performance computing. But this example, due to Vincent Danos, adds something less common: modulation of agent behaviour through the control port.
The input is a proposition. Two agents debate it, one advocating and one sceptical, running in parallel via the tensor product. Their outputs are synchronised by a barrier into a pair and presented to a judge. The judge decides: has the debate converged? If so, a verdict goes to the output. If not, a new topic goes back to the debaters, and a temperature goes to their control inputs.
The intuition is that the debaters should start creative (high temperature, wide sampling) and become progressively more focused as the rounds continue. The judge controls this. Each round, the judge decides both whether to continue and how volatile the next round should be. If the debate appears to be converging, the judge lowers the temperature, preventing the system from wandering off in new directions. Whether this actually causes convergence is a research question, not a proven result.
type Verdict = { resolved: bool, verdict: string,
topic: string, heat: number }
type Control = { set_temp: number }
let advocate : (!string, !Control) -> !string = agent { ... }
let skeptic : (!string, !Control) -> !string = agent { ... }
let judge : !(string, string) -> !Verdict = agent { ... }
let cool : !Verdict -> !Control = map({set_temp: heat})
let main : !string -> !string = plumb(input, output) {
input ; (advocate ⊗ skeptic) ; barrier ; judge
judge ; filter(resolved = false).topic ; (advocate ⊗ skeptic)
judge ; filter(resolved = true).verdict ; output
judge ; cool ; (advocate@ctrl_in ⊗ skeptic@ctrl_in)
}
And here is the graphical representation:

The ⊗ operator is the tensor product: parallel composition. (The grammar also accepts * for editors that cannot input unicode.) The advocate and skeptic run simultaneously on the same input. The barrier synchronises their outputs into a pair for the judge. The last line is the control feedback: the judge’s verdict is mapped to a temperature setting and sent to both agents’ control inputs. Notice that advocate@ctrl_in addresses a specific port on the agent, the control port rather than the main input.
This is a small program. It is also a concurrent system with feedback loops, runtime parameter modulation, and a convergence protocol. Without types, getting the wiring right would be a matter of testing and hope. With types, it is checked before it runs.
In a few lines of code, with a language that has categorical foundations, we can capture interesting agent systems and be sure they are well-formed before they run.
The upshot: when we have guarantees about well-formedness, systems work more stably and more predictably. With static typing, entire classes of structural errors are impossible. You cannot wire an output of one type to an input of another. You cannot forget a connection. The job you pay for is more likely to actually work, and you get more useful work per dollar spent. Runtime budget controls can put a ceiling on cost, but they do not prevent the waste. Static typing prevents the waste. But there is a lot more to do. What we have so far is already useful as a language for constructing agent graphs with static type checking. But we have given short shrift to the complexity and internal state of the agent morphism, which is really all about memory architecture and context management. That is where the real power comes from. For that we need more than a copy-discard category with some extra structure. We need protocols—and that is the subject of the sequel, soon to appear here.
The plumbing compiler, runtime, and MCP server are available as binary downloads for macOS and Linux:
• Download plumbing version 0.
Here is the research paper describing the broader programme of work:
• William Waites, Artificial organisations (arXiv:2602.13275).
[This post is written in my capacity as Vice-Chair of the Board of Trustees of SLMath. -T.]
SLMath, formerly MSRI, has launched the search for the next Deputy Director. This key position is a close advisor to the Director and shares in the internal management of the scientific team and programs at SLMath. This position is ideal for an experienced professional with a PhD in mathematical sciences seeking a new opportunity to leverage their strengths in program and grant management, financial management, and people management.
(guest post by Dimitris Tsementzis, about joint work with Benedikt Ahrens, Paige North, and Mike Shulman)
The Univalence Principle is the informal statement that equivalent mathematical structures are indistinguishable. There are various ways of making this statement formally precise, and a long history of work that does so. In our recently-published (but long in the making!) book we proved to our knowledge the most general version of this principle, which applies to set-based, categorical, and higher-categorical structures defined in a non-algebraic and space-based style, as well as models of higher-order theories such as topological spaces.
This work achieves three main goals. Firstly, it greatly extends the “Structure Identity Principle” from the original HoTT book, to include any (finite) level of structure, instead of just set-based structures, thus establishing in the strongest sense yet made precise that the Univalent Foundations provide an equivalence-invariant foundation for higher-categorical mathematics. Secondly, it provides very general novel definitions of equivalence between structures and between objects in a given structure that “compile” to most known notions of equivalence in known cases, but which can also be used to suggest notions in new settings; in doing so it extends M. Makkai’s classic work on First Order-Logic with Dependent Sorts (FOLDS). Thirdly, the setting in which our result is proved (a form of Two-Level Type Theory) provides a framework in which to do metamathematics in the Univalent Foundations/HoTT, i.e. carry out the mathematical study of how mathematics is formalized in UF/HoTT. The Univalence Principle we prove is a foundational metamathematical result in that sense.
Any “Univalence Principle” type of result has the following form:
The result gains in strength if the class of is as large as possible, and the notion of equivalence between them coincides with known notions of equivalence in practice (where is a placeholder for a notion of signature in terms of which and are structures). Such a result also gains in strength if the middle notion of equivalence is as “structural” as possible, ensuring that not only properties of the structures are preserved, but also constructions built on top of them. This last feature can only really be pursued within a univalent type theory, like HoTT.
In our work, we define: diagrammatic signatures as inverse categories of finite height, -structures as Reedy-fibrant functors , and a notion of indiscernibility relating objects within structures, which yields a derived notion of equivalence between structures (essentially structure-preserving bijections up to indiscernibility). The primitive notions and are given by equivalence and equality in 2LTT, appropriately defined.
In 2LTT, there is an external level for exo-types and other exo-gizmos (allowing strictly functorial constructions), and the usual fibrant (HoTT/UF) level where mathematical objects live. The external level is the metamathematical or syntactic level, where a strict equality exists that allows us to define syntax and symbols. Our exo-gizmos are analogous to sets of syntactical symbols used to define signatures in first-order logic.
Such syntax consists of categories with strict equality on composition, which we call exo-categories. Equalities here are strict (exo-equalities), while the internal world has homotopical paths. The 2LTT setting is convenient for type-theoretic reasoning, and allows us to neatly separate the various notions of equality at play.
A diagram signature is an inverse exo-category of finite height equipped with a rank functor that reflects identities. Objects of are the sorts (analogous to “sorts” in first-order logic); morphisms encode dependencies (“an arrow depends on a pair of objects,” etc.). Inverse-ness gives matching objects and allows Reedy methods to apply.
To illustrate the idea, take the example of a reflexive graph. The diagram signature for reflexive graphs would be given by the following inverse (exo-)category
where . The intuition is that we have a sort of “arrows” between any two objects, and a predicate (“identity”) that can be used to select which arrows are identity arrows with the relation ensuring that this predicate can only be “asked” of loops.
Given this notion of a signature, a structure in our sense is simply a (Reedy fibrant) functor . In more detail, a raw -diagram is an exo-functor For each sort , the matching object collects the compatible lower-rank data needed to specify the “boundary” for an element of . The Reedy fibrancy condition is: the canonical boundary map is a fibration (i.e., a dependent type) for each . The category of such Reedy-fibrant diagrams then forms a fibrant type whose points are the -structures.
To illustrate with the example of from above, an -structure in our sense would be given by the following data, in type-theoretic notation:
The trick here is to ensure the repeated in the definition of , obeying the relations of the signature. This is what the matching object machinery achieves for arbitrary signatures.
Other examples of -structures include: categories (with sorts for objects and morphisms), groupoids, -categories for any , preorders, and models of higher-order theories like topological spaces and sup-lattices. Furthermore, all of what we say applies to theories with axioms (not just structures), which we define in the book too.
With our formal set-up complete, we address the central question: for arbitrary signatures , when are two “objects” in an -structure equivalent? Such an “object” could be a categorical (or higher-categorical) gadget itself when has height , e.g. the “objects” of are themselves categories, which are -structures for an of lower height. The key idea is: two “objects” are indiscernible if nothing in the signature can distinguish them…up to indiscernibility!
Fix a signature , an -structure , a rank (“bottom”) sort , and elements . Think of as a “set” of objects (e.g. of a category) on top of which properties and structures are defined.
To define when and are indiscernible, we package together everything that could distinguish them. The intuition is: if there is an equivalence between “everything you can say about ” and “everything you can say about ” (outside of directly referencing or themselves), then they are indiscernible. We achieve this through a formal definition of a “boundary” , which one can think of as the -structure that contains everything that could distinguish in from anything else in . Which naturally leads us to the following definitions.
Definition (Indiscernibility). We say that are indisernible, written , iff there is a levelwise equivalence that is coherent in the right way.
Definition (Univalent Structure). We say that is locally univalent at K if the canonical map is an equivalence. We then say a structure is univalent if this holds at all rank-0 sorts and (recursively) for all sorts of higher rank.
We prove our main results for univalent -structures. These are quite special in that they are “saturated” in their equivalence information: two indistinguishable gizmos are actually equal. Or, put differently: when there is not “enough” structure to distinguish two gizmos, there is always enough to prove them equivalent. Some examples to illustrate (see Part 2 of the book for many more!):
We are almost there! On to the final missing piece: equivalences between structures themselves. Let be a morphism of -structures, defined in the expected way (as a natural transformation between the corresponding -valued functors). Then, for each sort , we have a matching square
and for each “context” an induced fiber map Write for indiscernibility at sort (as above). Then, what we really want to say now is: , are equivalent if they are “level-wise equivalent up to indiscernibility”. This idea gives us the from the beginning, and we define it as follows:
Definition (Equivalence of -structures). is an equivalence if, for every sort , every , and every , there exists a specified with i.e., is essentially split surjective up to indiscernibility on each fiber. We write for the type of equivalences.
The key innovation is the “up to indiscernibility” part; it makes our notion significantly weaker than usual notions, and hence the final result stronger. Note that we have not been able to prove our result without a splitness condition in the definition of equivalence, and to our mind this remains an open problem.
Our definition is related to Makkai’s original notion of FOLDS equivalence, but Makkai was not able to define a general notion of non-surjective equivalence directly, relying instead on spans. Our notion of indiscernibility circumvents this difficulty and allows us to consider the whole structure of equivalences between structures.
With all our apparatus in place we prove our main result, a very general form of a univalence principle, as promised.
Theorem (“The Univalence Principle”). For a signature and univalent -structures , the canonical map is an equivalence of types. In other words, .
The proof proceeds by induction on the height of . The key insight is that level-wise equivalences between univalent structures must “reflect indiscernibilities”: if doesn’t preserve the ability to distinguish elements, then whatever structure distinguishes them in the source would transfer to structure distinguishing their images in the target, contradicting the equivalence. With the splitness assumption in the map and the assumption of univalence (of our -structure), we are able to achieve the lifting of the indiscernibility.
Our result is actually proved for an even more general class of signatures called functorial signatures, which strictly extends diagram signatures and covers “higher-order” situations (topological spaces, suplattices, DCPOs, etc.). We have stuck to the diagrammatic view in this post for intuition, but all results and definitions carry over to this more general notion.
In the course of this work there were quite a few questions we tried to answer, but could not. To mention a couple: Can we define a Rezk completion for arbitrary structures, providing a universal way to turn any structure into a univalent one? Can we remove the splitness condition from our definition of equivalence between structures? We list more open problems at the end of the book.
Beyond our specific result, the framework in which it is proven establishes a way to answer metamathematical questions around the univalence axiom in a precise and fruitful way. It is important to emphasize that carrying out this type of mathematical study does not require choosing one foundation over the other. In any setting that interprets the fibrant part of 2LTT, the univalence principle will hold, including in set theory.
The metamathematics of UF is the mathematical study of formalizing mathematics in terms of a hierarchy of -levels vs. a cumulative hierarchy of sets. Formalizing mathematics in this way has all sorts of unique mathematical properties. The Univalence Principle is one of them.
I wrote code this weekend to look at the question of how we should visit
a star in the upcoming Terra Hunting Experiment. The current (straw-person) plan is that we will observe each visible star once per night for ten years, with one exposure of a sensibly-chosen exposure time at each visit. Is this a good idea? I was interested in this problem for two reasons. The first is that binning is sinning, with the corollary that bigger bins are worse than finer bins, and a single, long exposure is a very big bin. The second reason is that when there are non-trivial noise sources (like the quasi-periodic variations from p-mode oscillations of the surfaces of Sun-like stars), a few negatively- (or interestingly-) correlated noise draws can be combined in ways that are substantially more informative than by taking the average.
Of course, if you split an exposure (with a standard CCD, say) into sub-exposures, you take on real costs: There is a read time, which is time you aren't integrating, and there is a read noise, that affects each new exposure. So the best strategies are a complicated function of read time, read noise, and the signal-to-noise at which the stellar p-mode oscillations are visible in any realistic data. Related: There are amazingly different and interesting strategies with up-the-ramp detectors that are used in the infrared.
One final comment is that the objective, in my strongly held view, is to optimize the amount of information (about, say, the center-of-mass radial-velocity changes of the target star) per unit wall-clock time. We are paying for wall-clock time; let's get as much as we can out of it.
Today my rant on LLMs and the practices of our field hit the arXiv. I was scared to post it, because it is such a weird contribution, and it is so revealing about myself and my own political positions and hangups. But I have to say: I got great and supportive feedback all day.
I got two comments on saying ACAB
in the literature. The Astronomer Royal of Scotland quoted (on BlueSky) the last sentence, which I put there because Andy Casey (Monash, Flatiron) insisted. Many people sent me appreciation and thank-yous, and many people sent me comments and objections. Always constructive. The whole experience made me feel very happy about the state of our field and the way we all interact. I think maybe there will be critical mass to write some kind of collection of essays on the subject. That's a plan for 2026.
The Physicists and Mr. EpsteinMr. Epstein was not only a world-class child abuser, he was a big fan of theoretical high-energy physics and of theoretical physicists. Some of my colleagues, unfortunately, got to know him. A number who were famous and/or had John Brockman as a book agent were even invited to a physics conference on Epstein’s private island, well before he was first arrested. This was no secret; as I recall, a lot of us heard about the existence of this conference/trip, but we hadn’t heard Epstein’s name before and didn’t pay much attention (ho hum, just another weird billionaire).
Personally, I feel quite lucky. The Brockman agency rejected the proposal for my recent book without comment (thank you!); and my research is mostly considered unimportant by the Brian Greenes of the world. As a result, I was not invited to Epstein’s island, never made his acquaintance, and blissfully avoided the entire affair. Clearly there are some benefits to being considered ordinary. And so — I’m sorry/not-sorry to say — I can’t tell you much about Epstein at all, or about how certain physicists did and did not interact with him. Regarding my colleagues who did get to know him, I can’t speak for them, since I wasn’t there, and I don’t know to what extent Epstein hid his immoral activities when they were around. It’s up to them to tell their own stories if they feel the need to do so (and I hope a couple of them do, just to clear the air.) Personally I tend to give them the benefit of the doubt — probably some literally didn’t know what was up until Epstein’s arrest in 2008, while perhaps others felt there wasn’t much they could do about Epstein’s actions on his own private island. I imagine they are deeply embarrassed to have been caught in this horrible man’s ugly web.
Fans of physics come in all shapes and sizes, and some have large wallets, large egos, and/or large ambitions. Among the wealthy supporters, we can count Alfred Nobel himself; billionaires sit on important scientific institute and university boards, and the more recent Breakthrough Prizes were funded by deep pockets. The extreme wealthy have outsized influence in our country and in our world, and one could argue that their influence in 2025 was not for the better. Usually, though, the influence in physics and related fields tends to be relatively benign, funding postdoctoral researchers and graduate students who deeply want to do science but also need to eat. That said, sometimes donors fund non-essential fields at the expense of critical ones, or favor theoretical research over the gathering of crucial experimental data, or push money on famous rich organizations when there are poor ones that are equally deserving and far more needy.
When gazillionaires, on their own initiative, come calling on non-profit organizations, whether they be community centers, arts organizations, or universities, they pose a problem. On the one hand, it is the job of anyone in a non-profit organization to help raise money — fail to do that, and your organization will close. When a single person offers to permanently change the future of your program, you would be derelict in your duty if you did not consider that offer. On the other hand, donors who might have ethical or criminal problems could drag the organization’s name through the mud. Worse, they might be able to force the organization itself to do something ethically questionable or even illegal.
There is a clear lesson for young academics and other up-and-coming non-profit actors in the Epstein affair: the more money potentially offered to our organizations, the more carefully we must tread. Money is power; power corrupts; and every pursuit of dollars, even for the best causes, risks infection. We can’t be large-scale non-profit fundraisers without doing serious and thorough background checks of the biggest donors; we have to question motives, and we can’t look the other way when something seems amiss. Those of us with clear hearts and honest pursuits tend to assume the best in other people. But we have to beware of those hoping to bolster their reputations, or clean their consciences, by giving away “generously” what they never deserved to have.
David Jaz Myers just sent me some neat comments on this paper of mine:
and he okayed me posting them here. He’s taking the idea of categorifying the Riemann zeta function, explained in my paper, and going further, imagining what it might mean to categorify Riemann’s functional equation
where is the ‘completed’ Riemann zeta function, which has an extra factor taking into account the ‘real prime’:
My paper categorified the Euler product formula that writes the Riemann zeta function as a product over the usual primes:
I had nothing to say about the real prime.
But it’s the functional equation that sets the stage for focusing on zeroes of the Riemann zeta function with … and then the Riemann Hypothesis! So it’s worth thinking about.
David wrote:
Hi John,
Hope you’re doing well!
I was just thinking about your (and James Dolan’s) definition of the zeta functors associated to a finite type scheme (from here), and I had a small thought which I figured you might find interesting.
I was thinking about the functional equation of the completed zeta functions; how might we complete the zeta functors in such a way that they satisfy a similar functional equation? I don’t know, but I do have an idea for what the transformation might mean in this context. I claim that it is given by the reduced suspension. Let me explain.
First, I’ll want to see the formal power as the power
which I can then categorify by finding a group with cardinality and considering . In the case of the Riemann zeta species, is the cardinality of a finite semisimple ring (a product of finite fields, the groupoid of which has cardinality for each ), and we can simply deloop the additive group of this ring. This gives us a Dirichlet functor
which categorifies the Riemann zeta function when is a finite set.
Taking this point of view on the zeta functor, we can ask the question: what is the transformation ? Here’s where we can look at the reduced suspension . The universal property of the reduced suspension says that maps correspond to points of the homotopy type
(or, more classically, maps from the terminal morphism to ). Since homotopy cardinality is multiplicative for fibrations, that type has cardinality
(when is a finite set of cardinality ).
Taking for finite semisimple of cardinality , we see that has cardinality . Therefore, I think the transformation in the functional equation may be categorified by . If this makes sense, it suggests that completing the zeta functors is a form of stabilization.
Cheers,
David
And then:
As for another eyebrow wiggle about the cardinality of when is a finite set: we have that , the free group on generators. This is of course infinite, but it it is the group completion of the free monoid on generators. Since
it has cardinality .
Maybe it’s better to use the “free delooping” (aka weighted colimit of by ) instead of the reduced suspension. This doesn’t change the above argument because we’re mapping into a groupoid, but now it is true that the Euler characteristic / cardinality of that category is .
I’m writing to point out a potential law which should be gathering more opposition and attention in math academia: The Securing American Funding and Expertise from Adversarial Research Exploitation Act. This is an amendment to the 2026 National Defense Authorization Act which has passed the House and could be added to the final version of the bill during reconcilliation in the Senate. I’m pulling most of my information from an article in Science.
This act would ban any US scientist from receiving federal funding if they have, within the last five years, worked with anyone from China, Russia, Iran or North Korea, where “worked with” includes joint research, co-authorship on papers, or advising a foreign graduate student or postdoctoral fellow. As I said in my message to my senators, this is everyone. Every mathematician has advised Chinese graduate students or collaborated with Chinese mathematicians, because China is integrated into the academic world and is one fifth of the earth.
This obviously isn’t secret, since you can read about it in Science, but I am surprised that I haven’t heard more alarm. Obvious people to contact are your senators and your representatives. I would also suggest contacting members of the Senate armed services committee, who are in charge of reconciling the House and Senate versions of the bill.
Thanksgiving(Apologies for the ugly blog format. We had a bit of a crash, and are working to get the template back in working order.)
This year we give thanks for a crucially important idea that can mean very different things to different people: information. (We’ve previously given thanks for the Standard Model Lagrangian, Hubble’s Law, the Spin-Statistics Theorem, conservation of momentum, effective field theory, the error bar, gauge symmetry, Landauer’s Principle, the Fourier Transform, Riemannian Geometry, the speed of light, the Jarzynski equality, the moons of Jupiter, space, black hole entropy, electromagnetism, Arrow’s Impossibility Theorem, and quanta.)
“Information” is an idea that is everywhere in science and technology these days. From one angle it looks like such an obvious idea that it’s a bit startling to realize that information theory didn’t really come along until the work of Claude Shannon in the 1940s. From another, the idea has so many different shades of meaning that we shouldn’t be surprised (that’s a joke you will get in a bit) that it can be hard to understand.
Information theory is obviously an enormous subject, but we’re just giving thanks, not writing a textbook. I want to mention two ideas I find especially central. First, Shannon’s idea about relating information content to “surprisal.” Second, the very different intuitive notions of information that we get from engineering and physics.
Shannon, working at Bell Labs, was interested in the problem of how to send trustworthy signals efficiently over transatlantic cables. He was thinking about various ways to express information in a code: a set of symbols, each with a defined meaning. So a code might be an alphabet, or a set of words, or a literal cipher. And he noticed that there was a lot of redundancy in natural languages; the word “the” appears much more often in English than the word “axe,” although both have the same number of letters.
Let’s refer to each letter or symbol in a code as an “event.” Shannon’s insight was to realize that the more unlikely an event, the more information it conveyed when it was received. The statements “The Sun rose in the east this morning” and “The Sun rose in the west this morning” contain the same number of letters, but the former contains almost no information — you already were pretty sure the Sun would be rising in the east. But the latter, if obtained from a reliable source, would be very informative indeed, precisely because it was so unexpected. Clearly some kind of unprecedented astronomical catastrophe was in progress.
Imagine we can assign a probability
to every different event
. Shannon wanted a way to quantify the information content of that event, which would satisfy various reasonable-seeming axioms: most crucially, that the information content of two independent events is the sum of the individual information contents. But the joint probability of two events is the product of their individual probabilities. So the natural thing to do would be to define the information content as the logarithm of the probability; the logarithm of a product equals the sum of the individual logarithms. But you want low probability to correspond to high information content, so Shannon defined the information content (also called the self-information, or surprisal, or Shannon information) of an event to be minus the log of the probability, which by math is equal to the log of the reciprocal of the probability:
![]()
Note that probabilities are numbers between 0 and 1, and the log of such a number will be negative, with numbers closer to 0 being more negative than numbers closer to 1. So
goes from
at
to
at
. An impossible message is infinitely surprising, and therefore conveys infinite information; an inevitable message is completely unsurprising, and conveys no information at all.
From there, Shannon suggested that we could characterize how efficient an entire code was at conveying information: just calculate the average (expectation value) of the information content for all possible events. When we have a probability distribution
, the average of any function
is just the sum of the the values of the function times their respective probabilities,
. So we characterize the information content of a code via the quantity
![]()
The only question is, what to call this lovely newly-defined quantity that surely nobody had ever thought of before? Happily Shannon was friends with John von Neumann, who informed him, “You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.” So entropy it is.
Indeed, this formula is precisely that which had been put forward (unknown to Shannon) by Josiah Willard Gibbs in the 1870’s as a definition of entropy in statistical mechanics. (It is related to the definition on Ludwig Boltzmann’s tombstone,
, and Boltzmann had also suggested similar expressions to the above.) On the one hand, it seems remarkable to find precisely the same expression playing central roles in problems as disparate as sending signals across cables and watching cream mix into coffee; on the other hand, it’s a relatively simple expression and the axioms used to derive it are actually pretty similar, so perhaps we shouldn’t be surprised; on the third hand, the connection between information theory and statistical mechanics turns out to be deep and fruitful, so it’s more than just a mathematical coincidence.
But let me highlight the one aspect of the term “information” that can be sometimes confusing to people. To the engineer, a code that is maximally informative is one for which
is relatively uniform over all events
, which means
is maximal or close to it; in that case, every event will tell you something at least a little bit interesting. For them, high entropy = high information.

But to a physicist who might be asking “how much information do I have about the state of a system?”, you have more information when
is relatively narrowly concentrated around some value, rather than being all spread out. For them, high entropy = low information! Indeed, one physically-relevant notion of “information” is the “accessible information” of a system, which can be defined as
. (I talk about this a bit in my recent solo podcast on complexity.)

Perhaps we shouldn’t be so surprised that physicists and engineers posit oppositely-directed relationships between entropy and information. It’s just a reflection of the fact that “information” is so ubiquitous and has so many different uses. We should be thankful that we’re beginning to understand it so well.
It’s been over three years since my last post on this blog and I have sometimes been asked, understandably, whether the project I announced in my previous post was actually happening. The answer is yes — the grant I received from the Astera Institute has funded several PhD students and a couple of postdocs, and we have been busy. In my previous post I suggested that I would be open to remote collaboration, but that has happened much less, partly because a Polymath-style approach would have been difficult to manage while also ensuring that my PhD students would have work that they could call their own to put in their theses.
In general I don’t see a satisfactory solution to that problem, but in this post I want to mention a subproject of the main project that is very much intended to be a large public collaboration. A few months ago, a call came out from Renaissance Philanthropies saying that they were launching a $9m AI for Math Fund to spend on projects in the general sphere of AI and mathematics, and inviting proposals. One of the categories that they specifically mentioned was creating new databases, and my group submitted a proposal to create a database of what we call “structured motivated proofs,” a piece of terminology that I will explain a bit more later in just a moment. I am happy to report that our proposal was one of the 29 successful ones. Since a good outcome to the project will depend on collaboration from many people outside the group, we need to publicize it, which is precisely the purpose of this post. Below I will be more specific about the kind of help we are looking for.
The underlying thought behind this project is that AI for mathematics is being held back not so much by an insufficient quantity of data as by the wrong kind of data. (For a more general exploration of this theme, see here.) All mathematicians know, and some of us enjoy complaining about it, that it is common practice when presenting a proof in a mathematics paper, or even textbook, to hide the thought processes that led to the proof. Often this does not matter too much, because the thought processes may be standard ones that do not need to be spelt out to the intended audience. But when proofs start to get longer and more difficult, they can be hard to read because one has to absorb definitions and lemma statements that are not obviously useful, are presented as if they appeared from nowhere, and demonstrate their utility only much later in the argument.
A sign that this is a problem for AI is the behaviour one observes after asking an LLM to prove a statement that is too difficult for it. Very often, instead of admitting defeat, it will imitate the style of a typical mathematics paper and produce rabbits out of hats, together with arguments later on that those rabbits do the required job. The problem is that, unlike with a correct mathematics paper, one finds when one scrutinizes the arguments carefully that they are wrong. However, it is hard to find superficial features that distinguish between an incorrect rabbit with an incorrect argument justifying that rabbit (especially if the argument does not go into full detail) and a correct one, so the kinds of statistical methods used by LLMs do not have an easy way to penalize the incorrectness.
Of course, that does not mean that LLMs cannot do mathematics at all — they are remarkably good at it, at least compared with what I would have expected three years ago. How can that be, given the problem I have discussed in the previous paragraph?
The way I see it (which could change — things move so fast in this sphere), the data that is currently available to train LLMs and other systems is very suitable for a certain way of doing mathematics that I call guess and check. When trying to solve a maths problem, you will normally write down the routine parts of an argument without any fuss (and an LLM can do them too because it has seen plenty of similar examples), but if the problem as a whole is not routine, then at some point you have to stop and think, often because you need to construct an object that has certain properties (I mean this in a rather general way — the “object” might be a lemma that will split up the proof in a nice way) and it is not obvious how to do so. The guess-and-check approach to such moments is what it says: you make as intelligent a guess as you can and then see whether it has the properties you wanted. If it doesn’t, you make another guess, and you keep going until you get lucky.
The reason an LLM might be tempted to use this kind of approach is that the style of mathematical writing I described above makes it look as though that is what we as mathematicians do. Of course, we don’t actually do that, but we tend not to mention all the failed guesses we made and how we carefully examined why they failed, modifying them in appropriate ways in response, until we finally converged on an object that worked. We also don’t mention the reasoning that often takes place before we make the guess, saying to ourselves things like “Clearly an Abelian group can’t have that property, so I need to look for a non-Abelian group.”
Intelligent guess and check works well a lot of the time, particularly when carried out by an LLM that has seen many proofs of many theorems. I have often been surprised when I have asked an LLM a problem of the form , where
is some property that is hard to satisfy, and the LLM has had no trouble answering it. But somehow when this happens, the flavour of the answer given by the LLM leaves me with the impression that the technique it has used to construct
is one that it has seen before and regards as standard.
If the above picture of what LLMs can do is correct (the considerations for reinforcement-learning-based systems such as AlphaProof are not identical but I think that much of what I say in this post applies to them too for slightly different reasons), then the likely consequence is that if we pursue current approaches, then we will reach a plateau: broadly speaking they will be very good at answering a question if it is the kind of question that a mathematician with the right domain expertise and good instincts would find reasonably straightforward, but will struggle with anything that is not of that kind. In particular, they will struggle with research-level problems, which are, almost by definition, problems that experts in the area do not find straightforward. (Of course, there would probably be cases where an LLM spots relatively easy arguments that the experts had missed, but that wouldn’t fundamentally alter the fact that they weren’t really capable of doing research-level mathematics.)
But what if we had a database of theorems and proofs that did not hide the thought processes that lay behind the non-obvious details of the proofs? If we could train AI on a database of accounts of proof discoveries and if, having done so, we then asked it to provide similar accounts, then it would no longer resort to guess-and-check when it got stuck, because the proof-discovery accounts it had been trained on would not be resorting to it. There could be a problem getting it to unlearn its bad habits, but I don’t think that difficulty would be impossible to surmount.
The next question is what such a database might look like. One could just invite people to send in stream-of-consciousness accounts of how they themselves found certain proofs, but that option is unsatisfactory for several reasons.
To deal with these kinds of difficulties, we plan to introduce a notion of a structured motivated proof, by which we mean a proof that is generated in a very particular way that I will partially describe below. A major part of the project, and part of the reason we needed funding for it, is to create a platform that will make it convenient to input structured motivated proofs and difficult to insert the kinds of rabbits out of hats that make a proof mysterious and unmotivated. In this way we hope to gamify the task of creating the database, challenging people to input into our system proofs of certain theorems that appear to rely on “magic” ideas, and perhaps even offering prizes for proofs that contain steps that appear in advance to be particularly hard to motivate. (An example: the solution by Ellenberg and Gijswijt of the cap-set problem uses polynomials in a magic-seeming way. The idea of using polynomials came from an earlier paper of Croot, Lev and Pach that proved a closely related theorem, but in that paper it just appears in the statement of their Lemma 1, with no prior discussion apart from the words “in the present paper we use the polynomial method” in the introduction.)
I wrote about motivated proofs in my previous post, but thanks to many discussions with other members of the group, my ideas have developed quite a lot since then. Here are two ways we like to think about the concept.
I will not go into full detail about what I mean by this, but will do so in a future post when we have created the platform that we would like people to use in order to input proofs into the database. But the basic idea is that at any one moment one is in a certain state, which we call a proof-discovery state, and there will be a set of possible moves that can take one from the current proof-discovery state to a new one.
A proof-discovery state is supposed to be a more formal representation of the state one is in when in the middle of solving a problem. Typically, if the problem is difficult, one will have asked a number of questions, and will be aware of logical relationships between them: for example, one might know that a positive answer to Q1 could be used to create a counterexample to Q2, or that Q3 is a special case of Q4, and so on. One will also have proved some results connected with the original question, and again these results will be related to each other and to the original problem in various ways that might be quite complicated: for example P1 might be a special case of Q2, which, if true would reduce Q3 to Q4, where Q3 is a generalization of the statement we are trying to prove.
Typically we will be focusing on one of the questions, and typically that question will take the form of some hypotheses and a target (the question being whether the hypotheses imply the target). One kind of move we might make is a standard logical move such as forwards or backwards reasoning: for example, if we have hypotheses of the form and
, then we might decide to deduce
. But things get more interesting when we consider slightly less basic actions we might take. Here are three examples.
This is a surprisingly useful way to conceive of what we are talking about, especially as it relates closely to what I was talking about earlier: imposing a standard form on motivated proofs (which is why we call them “structured” motivated proofs) and gamifying the process of producing them.
The idea is that a structured motivated proof is one that can be generated using an interface (which we are in the process of creating — at the moment we have a very basic prototype that has a few of the features we will need, but not yet the more interesting ones) that has one essential property: the user cannot type in data. So what can they do? They can select text that is on their screen (typically mathematical expressions or subexpressions), they can click buttons, choose items from drop-down menus, and accept or reject “obvious” suggestions made to them by the interface.
If, for example, the current goal is an existential statement , then typing in a formula that defines a suitable
is not possible, so instead one must select text or generate new text by clicking buttons, choosing from short drop-down menus, and so on. This forces the user to generate
, which is our proxy for showing where the idea of using
came from.
Broadly speaking, the way the prototype works is to get an LLM to read a JSON object that describes the variables, hypotheses and goals involved in the proof state in a structured format, and to describe (by means of a fairly long prompt) the various moves it might be called upon to do. Thus, the proofs generated by the system are not formally verified, but that is not an issue that concerns us in practice since there will be a human in the loop throughout to catch any mistakes that the LLM might make, and this flexibility may even work to our advantage to better capture the fluidity of natural-language mathematics.
There is obviously a lot more to say about what the proof-generating moves are, or (approximately equivalently) what the options provided by a point-and-click system will be. I plan to discuss that in much more detail when we are closer to having an interface ready, the target for which is the end of this calendar year. But the aim of the project is to create a database of examples of proofs that have been successfully generated using the interface, which can then be used to train AI to play the generate-structured-motivated-proof game.
There are several tasks that will need doing once the project gets properly under way. Here are some of the likely ones.
If you think you might be interested in any of these roles, please feel free to get in touch. Probably the hardest recruitment task for us will be identifying the right people with the right mixture of mathematical knowledge and software engineering skills to help us turn the platform into a well-designed web-based one that is convenient and pleasurable to use. If you think you might be such a person, or if you have a good idea for how we should go about finding one, we would be particularly interested to hear from you.
In a future post, I will say more about the kinds of moves that our platform will allow, and will give examples of non-motivated proofs together with how motivated versions of those proofs can be found and entered using the platform (which may involve a certain amount of speculation about what the platform will end up looking like).
In one way, our “moves” can be regarded as tactics of a kind. However, some of the moves we will need are difficult to implement in conventional proof assistants such as Lean. In parallel with the work described above, we hope to create an interface to Lean that would allow one to carry out proof-discovery moves of the kind discussed above but with the proof-discovery states being collections of Lean proof states. Members of my group have already been working on this and have made some very interesting progress, but there is some way to go. However, we hope that at some point (and this is also part of the project pitched to the AI for Math Fund) we will have created another interface that will have Lean working in the background, so that it will be possible to generate motivated proofs that will be (or perhaps it is better to say include) proofs in Lean at the same time.
Another possibility that we are also considering is to use the output of the first platform (which, as mentioned above, will be fairly formal, but not in the strict sense of a language such as Lean) to create a kind of blueprint that can then be autoformalized automatically. Then we would have a platform that would in principle allow mathematicians to search for proofs while working on their computers without having to learn a formal language, with their thoughts being formalized as they go.
I spent the day at the NSBP / NSHP meeting in San José. My favorite session of the day was the morning astro session, which was entirely about brown dwarfs. I learned a lot in a very short time. Caprice Phillips (UCSC) introduced the session with an introduction to the scientific and technical questions in play. She put a lot of emphasis on using binaries and clusters to put detailed abundance ratios onto substellar objects. This was what I expected: I thought (walking in to this session) that all known abundance ratios for brown dwarfs were from such kinds of studies. I learned different (keep reading).
Gabriel Munoz Zarazua (SFSU) followed by showing spectra from M-dwarfs, brown dwarfs, and Jupiter. It definitely looks like a sequence. He does spectral fitting (what they call, in this business, retrievals). It looks like he is getting very good, somewhat precise, abundance ratios for the photospheres of substellar objects! I asked more about this in the question period, and apparently I am way behind the times (Emily Rauscher, Michigan, helpfully pointed this out to me): Now brown-dwarf photosphere models are so good, they can be used to measure abundances, and pretty well.
I also learned in this session (maybe from Jorge Sanchez, ASU, or maybe from Efrain Alvarado, SFSU) that there is a very strong mass–abundance relation in the Solar System. That is, we don't expect, if brown dwarfs form the way planets do, that the detailed abundances of the brown dwarfs will match exactly the detailed abundances of the primary stars. But now we are really in a position to test that. Sanchez showed that we can get, from even photometry, abundances for substellar objects in the Milky Way halo. Again, totally new to me! And he finds metallicities at or below −3. Alvarado showed data on an amazing system J1416, which is an L–T binary with no stellar companion. Apparently it is the only known completely substellar binary.
Event with Professor Daniel Whiteson on Monday November 17 at 7pmNext Monday, November 17th at 7pm, I’ll be at the Harvard Bookstore with particle physicist and author Daniel Whiteson. Professor Whiteson and his co-author Andy Warner have a nice new book, for the general science-aware reader, exploring an age-old and unanswered question: how universal is the knowledge and understanding that we call “physics”? How much of modern physics is actually telling us about the universe, and how much of it is created by, or an accident of, the humans who have helped bring it about?
For instance, if we started all over again and reran history from scratch, would the physics (and science more generally) of this re-run culture look much like our own, or might it turn out very differently? If another culture on Earth had had time to develop highly mature science (or something like it) in its own direction, independent of Western Europe’s influence, how different might that science be? (Indeed, would our word “science” even be translatable into their worldview?) Or if we encountered aliens with far greater understanding of the universe than we have, would we be able to recognize, parse, grok, appreciate, comprehend, and/or otherwise make sense of their notions of scientific knowledge?
Whiteson and his co-author, wanting to write a popular book rather than a scholarly one, and desiring nevertheless to take on these serious and challenging intellectual questions, have set their focus mostly on the aliens, accompanied by amusing cartoons and a generous helping of dad jokes (hey, some dad jokes are actually very funny.) They’re looking for a broad audience, and hopefully they will get it. But don’t let the light-hearted title (“Do Aliens Speak Physics?“) or the charmingly goofy cover fool you: this book might well make you laugh, but I guarantee it will make you think. Whether you’re just curious about science or you’ve been doing science yourself for years, I suspect that, within the vast array of problems and issues that are raised in this broad-minded book, there will be some you’ve never thought of.
Among scientists and philosophers, there are some who believe that any aliens with the capacity to reach the Earth will obviously “speak physics” — that math and physics float above contingencies of culture and species, and will easily be translated from any intelligent creature to any other. But are they perhaps flying too high? It’s clear that Whiteson and Warner are aiming to poke some holes — lots of holes —- in their hot-air balloon, and to do so in a way that a wide variety of readers can appreciate and enjoy.
I tend to agree with Whiteson on a lot of these issues, but that won’t stop me from asking him some tough questions. You can ask him some tough questions too, if you like — just come to the Harvard Bookstore at 7:00 on Monday and join the conversation!

I started a tradition a little while back where every year we have a special departmental colloquium entitled "The Nobel Prize in Physics: Who/What/Why". This year my job in finding speakers was made easier by having 2/3 of this years newly-minted Nobel Prize winners in physics in the Department! (Michel Devoret and John Martinis.) So our room was a bit more well-attended than normal...(hundreds and hundreds rather than dozens and dozens). Here is a recording of the event, which I was delighted to host, and there's a celebration afterwards too. (Pls share widely!)
[...] Click to continue reading this post
The post Nobel Prize in Physics 2025: Who/What/Why appeared first on Asymptotia.
Recently I had to update Mathematica on my laptop and after having solved the challenges of the license manager that keeps looking different every time I have to use it, I learned that Mathematica 14 can now officially work with finite fields.
This reminded me that for a while I wanted to revive an old project that had vanished together with the hard drive of some old computer: Holosplit. So, over the last two days and with the help of said version of Mathematica I did a complete rewrite which you can now find on Github.
It consists of two C programs "holosplit" and "holojoin". To the first you give a positive integer \(N\) and a file and it spits out a new file ("fragment") that is roughly \(1/N\) of the size. Every time you do that you obtain a new random fragment.
The later you give any collection of \(N\) of these fragments and it reproduces the original file. So you can for example distribute a file over 10 people such that when any 3 of them work together, they can recover the original.
How does it work? I uses the finite field \(F\) of \(2^3=256\) elements (in the Github repository, there is also a header file that implements arithmetic in \(F\) and matrix operations like product and inverse over it). Each time, it is invoked, it picks a random vector \(v\in F^N\) and writes it to the output. Then it reads \(N\) bytes from the file at a time which it also interprets as a vector \(d\in F^N\). It then outputs the byte that corresponds to the scalar product \(v\cdot d\).
To reassemble the file, holojoin takes the \(N\) files with its random vectors \(v_1,\ldots,v_N\) and interprets those as the rows of a \(N\times N\) matrix \(A\). With probability
$$\frac{\prod_{k=1}^N \left(256^N-k\right)}{(256)^{N^2}}$$
which exponentially in \(N\) approaches 1 this matrix is invertible (homework: why?). So we can read one byte from each file, assemble those into yet another vector \(e\in F^N\) and recover
$$d=A^{-1}e.$$
Besides the mathematics, it also poses philosophical/legal questions: Consider for example the original file is copyrighted, for example an mp3 or a video. The fragments are clearly derived works. But individually, they do not contain the original work, without sufficiently many other fragments they are useless (although not in a cryptographic sense). So by publishing one fragment, I do not provide access to the original work. What if others publish other fragments? Then my fragment could be the last remaining one that was missing. If there are more, any individual fragment is redundant so publishing it strictly speaking does not provide new information.
Photo albumsPeter’s photos: https://www.icloud.com/sharedalbum/#B275oqs3qKSZvQ
Screenshots: https://www.icloud.com/sharedalbum/#B27532ODWjIQb9
Climbing book launch: https://www.icloud.com/sharedalbum/#B27GWZuqDGnuOyN
Salisbury waters: https://www.icloud.com/sharedalbum/#B275qXGF1JQFkx
Christmas with Ash: https://www.icloud.com/sharedalbum/#B27G6XBubAhoT6
Hosin BBQ duck: https://www.icloud.com/sharedalbum/#B27GY8gBYG3b5mD
Hawks Nest to Smiths Lake: https://www.icloud.com/sharedalbum/#B2759UlCqSH5bE
Europe & Alps: https://www.icloud.com/sharedalbum/#B275ON9t3W0lu
Point Perpendicular: https://www.icloud.com/sharedalbum/#B27GqkRUiGivXD2
Newnes canyoning: https://www.icloud.com/sharedalbum/#B27GfnH8tgHSmX
Coffs Harbour to Yamba: https://www.icloud.com/sharedalbum/#B27J0DiRHJKuuWr
Wendy Bruere Christmas (2020): https://www.icloud.com/sharedalbum/#B27G4TcsmGoHysj
Six Foot Track: https://www.icloud.com/sharedalbum/#B2753qWtHZA9EX
Kosciusko to Kiandra: https://www.icloud.com/sharedalbum/#B27GgZLKuGaewVm
Camping food: https://www.icloud.com/sharedalbum/#B27GtnIORgbmHu
The Aardvark: https://www.icloud.com/sharedalbum/#B275VaUrzvmAiT
Kangaroo Valley kayaking: https://www.icloud.com/sharedalbum/#B27JEsNWnJrCpi0
Claustral canyon: https://www.icloud.com/sharedalbum/#B2755Z2WMOTpsk
Budawang: https://www.icloud.com/sharedalbum/#B27GDdyTvGvpINL
Mother’s Day panoramas (2021): https://www.icloud.com/sharedalbum/#B27GFssfGG9WmJP
Point Perpendicular & Nowra: https://www.icloud.com/sharedalbum/#B27GRMtznGPdeuZ
Blood moon: https://www.icloud.com/sharedalbum/#B27GdIshaG8NgGX
La Perouse to Coogee: https://www.icloud.com/sharedalbum/#B275aVbMK4h7qo
Canberra ASPI launch: https://www.icloud.com/sharedalbum/#B27GQOeMmGj4Zcv
Edible foraging: https://www.icloud.com/sharedalbum/#B275ejO179Si0N
Sydney to Wollongong: https://www.icloud.com/sharedalbum/#B275M7GFPUasMe
Album for Dad, Father’s Day (2021): https://www.icloud.com/sharedalbum/#B2752plgjnnkUe
Vaucluse (with Cheryl, Nestor & Wendy): https://www.icloud.com/sharedalbum/#B275CmvAS4uA0Z
Bouddi National Park: https://www.icloud.com/sharedalbum/#B27GdPblXG8WdOo
Tom Thumb (the 2nd): https://www.icloud.com/sharedalbum/#B275aDWbr4CN2w
Eden to Victoria: https://www.icloud.com/sharedalbum/#B27GJDfWGArX8l
Wendy’s book launch (the 2nd): https://www.icloud.com/sharedalbum/#B27GIcgc2G7h08y
Mark & Pat Bruere visit Sydney: https://www.icloud.com/sharedalbum/#B27G0ehgLbyWyg
New Years Eve climb (2021): https://www.icloud.com/sharedalbum/#B27Ju8EH6JOZxmU
Newnes Canyoning (2022): https://www.icloud.com/sharedalbum/#B275BydzFU0GZ8
Royal National Park (2022): https://www.icloud.com/sharedalbum/#B27GlxzuqGVI5nE
Peter & Wendy: https://www.icloud.com/sharedalbum/#B27Gf693ZG52tfd
Book photo shoots: too rude…
Wendy & Peter’s mushroom trip: https://www.icloud.com/sharedalbum/#B27GrhkPxG27So8
Post-mushroom hike: https://www.icloud.com/sharedalbum/#B27GdFryYG8i3Ur
Wendy Kalymnos favourites: https://www.icloud.com/sharedalbum/#B27JqstnBJEXkH2
Wendy Frenchmans screenshots: https://www.icloud.com/sharedalbum/#B27Jr1PPdJpd7Dq
Instagram: https://www.icloud.com/sharedalbum/#B27GzFCC1Gb4tqr
Haute route: https://www.icloud.com/sharedalbum/#B27J8GySPJtWoQ1
Kim’s KKKalendar: https://www.icloud.com/sharedalbum/#B275fk75vIL0sH
Frenchmans Cap Wild: https://www.icloud.com/sharedalbum/#B27G4VTwGGoFBkz
Photoshoot with Zixin: https://www.icloud.com/sharedalbum/#B27GPCdxkGKPkM4
Wendy birthday hike (2023): https://www.icloud.com/sharedalbum/#B27GWBC59GnHpQW
Bateman’s Bay to Bawley Point: https://www.icloud.com/sharedalbum/#B27JsHvHoJ8bxWf
Stockton Sand dunes (2023): https://www.icloud.com/sharedalbum/#B27GVfZ2vGloFZV
Wendy book launch (2023): https://www.icloud.com/sharedalbum/#B27J058xyJR4IBM
Dolomites (2023): https://www.icloud.com/sharedalbum/#B0Z5kuVsbGJUzKO
Mount Arapiles: https://www.icloud.com/sharedalbum/#B275GH8Mq8Uh2X
Mount Solitary loop: https://www.icloud.com/sharedalbum/#B275nhQST2mETE
Klaus Hanz Franz Rohde Kunst: https://www.icloud.com/sharedalbum/#B27GqQrCLGiY3vb
Klaus Rohde funeral slideshow: https://www.icloud.com/sharedalbum/#B27GDZLe8GXP58K
Dad (old, B&W): https://www.icloud.com/sharedalbum/#B27GLLXGLJ5mbT2
Klaus & Ursula wedding: https://www.icloud.com/sharedalbum/#B275cLqfN7154g
Test Greece: https://www.icloud.com/sharedalbum/#B27Jq4WnLJ6JMNd
From Will Skea (Alps): https://www.icloud.com/sharedalbum/#B27JHciePJFwacG
From Will Skea (Frenchmans Cap): https://www.icloud.com/sharedalbum/#B275ZhN2v3EVq6
From Will Skea (Arapiles): https://www.icloud.com/sharedalbum/#B27JPrgBGJu3BTD
Coffs Harbour to Yamba (2): https://www.icloud.com/sharedalbum/#B27GFqhgJG9LHgT
Mark magic show (2021): https://www.icloud.com/sharedalbum/#B27G60dj6ARCvd
Wendy Christmas present (2020): https://www.icloud.com/sharedalbum/#B275FrPQ6GxvRu
AHS 25 year reunion: https://www.icloud.com/sharedalbum/#B275O3DjHUvSv
WhatsApp: https://www.icloud.com/sharedalbum/#B275tzEA5fX1nc
Armidale High School: https://www.icloud.com/sharedalbum/#B27GnbeumG4PnAF
Book photos for Mum & Dad: https://www.icloud.com/sharedalbum/#B27Gtec4XQkASe
Miscellaneous: https://www.icloud.com/sharedalbum/#B27Gq6kMgGKn7GR
Three Capes Trail (2022): https://www.icloud.com/sharedalbum/#B27G7HOIlGrDUGZ
Childhood computer programming: https://www.icloud.com/sharedalbum/#B275fu2MutDU8N
Magic with Mark in Maroubra: https://www.icloud.com/sharedalbum/#B27Gv6DhEGD9U3G
Photoshoot with Zixin (2024): https://www.icloud.com/sharedalbum/#B27GCATCnJGoRfW
Butt Crack (2021): https://www.icloud.com/sharedalbum/#B275VtHQfMv0zw
Greece photos new (edited to remove photos from wrong album): https://www.icloud.com/sharedalbum/#B27GY3uThGoBcGj
Singapore (all combined): https://www.icloud.com/sharedalbum/#B275qsTcwJKJjl
Hong Kong (transit): https://www.icloud.com/sharedalbum/#B2759v1AbS8Hve
Taiwan: https://www.icloud.com/sharedalbum/#B27GQD2D7Gw0hAp
India (combined): https://www.icloud.com/sharedalbum/#B27Gtue8VQy83g
Freycinet: https://www.icloud.com/sharedalbum/#B27G5VpecGE5Tbg
Triglav: https://www.icloud.com/sharedalbum/#B275MbK9Vy8erz
Shared with me: https://www.icloud.com/sharedalbum/#B27GGXqixzPOrm
Mount Wellington climbing: https://www.icloud.com/sharedalbum/#B27Gd59qiG8Kjy4
New Zealand combined (2004): https://www.icloud.com/sharedalbum/#B27GIZ8BIGNN5jy
New Zealand combined (2005): https://www.icloud.com/sharedalbum/#B27GcuRfIGFVIcL
Yea: https://www.icloud.com/sharedalbum/#B27GZYbYHGhFIir
Mount Pleasant: https://www.icloud.com/sharedalbum/#B275Iy2hC0JTTL
D’Aguilar: https://www.icloud.com/sharedalbum/#B27Gh7fzTGZBosS
Bali (2001): https://www.icloud.com/sharedalbum/#B27G1qNHBGOTbIr
Samba Ninjas: https://www.icloud.com/sharedalbum/#B27GG34bAzqQ0v
Armidale (misc): https://www.icloud.com/sharedalbum/#B27GSkLVwGyobbX
Emma’s party (2008): https://www.icloud.com/sharedalbum/#B275S2ms99Zyby
Goettingen (2011): https://www.icloud.com/sharedalbum/#B27JIrbT3Jsgxhd
South Coast track: https://www.icloud.com/sharedalbum/#B27G58NWBG6QyN7
Minsk (2006): https://www.icloud.com/sharedalbum/#B27G3JpSBGX1UkQ
Baden-Baden (2019): https://www.icloud.com/sharedalbum/#B27595X5HTVzJr
Berlin (combined): https://www.icloud.com/sharedalbum/#B27JqWzChJ6qizD
Switzerland (combined): https://www.icloud.com/sharedalbum/#B275zXwoYGJ6HMF
Italy highlights: https://www.icloud.com/sharedalbum/#B27G47PHQGoJium
Germany (misc): https://www.icloud.com/sharedalbum/#B275hPMfYGu5xVJ
Garmisch (2022): https://www.icloud.com/sharedalbum/#B27GFsbvlG9Xrr6
Germany (2019): https://www.icloud.com/sharedalbum/#B27G6Mn98G56Ncb
Garmisch (2006): https://www.icloud.com/sharedalbum/#B27J5lIdKGLC9KG
Baden-Baden (2005): https://www.icloud.com/sharedalbum/#B275sWRpHHQkt9
Berlin (2005): https://www.icloud.com/sharedalbum/#B27GgOQtrGjQrpH
Zugspitze (2005): https://www.icloud.com/sharedalbum/#B27G81mNdGcApGt
Amsterdam, Bristol (2006): https://www.icloud.com/sharedalbum/#B275B9SRzyBjlH
Baden-Baden (2006): https://www.icloud.com/sharedalbum/#B275eD9V79I2XR
Berlin (2006): https://www.icloud.com/sharedalbum/#B275toRf1fH8MD
Berlin, Jena (2007): https://www.icloud.com/sharedalbum/#B27GTI3fvGVgNit
Erlangen (2006): https://www.icloud.com/sharedalbum/#B27JrotZ2JpMb0i
Garmisch (2010): https://www.icloud.com/sharedalbum/#B27JPJPSiJurzNg
Germany (2010): https://www.icloud.com/sharedalbum/#B275FhYPQP650
Stuttgart (2006): https://www.icloud.com/sharedalbum/#B27GmitydGVVaZh
Changi (2019): https://www.icloud.com/sharedalbum/#B27GnmlKoG4JHpX
Japan (2007): https://www.icloud.com/sharedalbum/#B275AerZbG6FxVL
Japan (2012): https://www.icloud.com/sharedalbum/#B27GjBjobGg6PUa
Miscellaneous (including Japan 2013): https://www.icloud.com/sharedalbum/#B27GTpbybGySbE8
Currumbin & Tugin (2021): https://www.icloud.com/sharedalbum/#B275vBKZ4xH9X6
Brisbane (2021): https://www.icloud.com/sharedalbum/#B275YHsSjxQnm0
Weed in Byron (26/6/2025): https://www.icloud.com/sharedalbum/#B275Q2ydoGsQ4O5
Weed in Byron 2: https://www.icloud.com/sharedalbum/#B27GQDYhLGwsuY4
Why?