### You Could Have Invented De Bruijn Indices

#### Posted by Mike Shulman

One of the most curious things about formal treatments of type theory is the attention paid to dummy variables. The point is simple enough to explain to a calculus student: just as the integrals $\int_0^2 x^3 \, dx$ and $\int_0^2 t^3 \, dt$ are equal, so are the functions $\lambda x. x^3$ and $\lambda t. t^3$ — the name of the variable doesn’t matter. (In fact, the former statement factors through the latter.) But managing this equality (which goes by the highbrow name of “$\alpha$-equivalence”) turns out to be quite fiddly when actually implementing a typechecker, and people have developed all sorts of different-sounding ways to deal with it, with names like *De Bruijn indices*, *De Bruijn levels*, *locally named terms*, *locally nameless terms*, and so on.

In this post I want to explain what’s going on from a categorical and/or univalent perspective, and why some of these fancy-sounding things are really quite simple. To cut to the punchline, at bottom what’s going on is just different ways to choose a concrete representative of an object that’s determined only up to unique isomorphism.

Let’s consider possibly the simplest case: terms of the lambda-calculus. These are generated inductively by three operations:

- Variables are terms
- If $M$ and $N$ are terms, so is the application $M N$.
- If $M$ is a term and $x$ a variable, then the abstraction $\lambda x. M$ is a term.

A usage of a variable $x$ inside a $\lambda x$ is called **bound**, while other usages of $x$ are called **free**. Thus in $\lambda x. y x$ the usage of $x$ is bound but the usage of $y$ is free. Semantically, the free variables must be given meaning in a *context* such as $y:A, z:B$, while the bound variables are “dummies” that don’t appear in the context.

I just described the traditional approach, where one first defines all the terms at once, then defines what the free variables of a term are. But at the end of the day, what we really want to have is, for each context of (perhaps typed) variables, the set of terms containing those variables free. Or more precisely, containing *at most* those variables free, since for instance $\lambda x. y x$ makes sense in the context $y:A, z:B$ even though it doesn’t actually use the variable $z$.

Thus, from a typed perspective, it would be better to define the *sets* of terms *indexed* by the set of variables in their context. These are called *well-scoped* terms. Since we are ordinary mathematicians who think structurally, of course it doesn’t matter what these variables “actually are”, so we can just suppose them to be drawn from an abstract set $V$. Now we intend to give an inductive definition of the set $Tm_V$ of “terms with free variables from $V$”, with $V$ an arbitrary set, and the first two clauses are obvious:

- For any set $V$, if $x\in V$, then $v_x \in Tm_V$.
- For any set $V$, if $M\in Tm_V$ and $N\in Tm_V$, then $M N \in Tm_V$.

The third clause is less clear, since in an abstraction $\lambda x.M$ the input term $M$ must have $x$ as a free variable, while $x$ is no longer free in $\lambda x.M$. In other words, $M$ must have *one more variable* in its context than $\lambda x.M$, so it’s natural to say:

- For any set $V$, if $M\in Tm_{V+1}$, then $\lambda M \in Tm_V$.

Here $V+1$ denotes the coproduct, or disjoint union, of $V$ with a one-element set, denoted $1$. Note that when phrased like this, we don’t need to annotate the $\lambda$ with the variable name $x$; we know what the bound variable is, it’s just the extra element of $V+1$ that comes from $1$.

The point is that, categorically speaking, $V+1$ doesn’t denote a *specific* set, but only a set *determined up to unique isomorphism*; and thus we could potentially make different choices about exactly what set to use. But of course, from a structural point of view, the choice doesn’t matter. So to start with, let’s proceed purely structurally. This means that it doesn’t make sense to ask whether the elements of $V+1$ are “the same as” any of the elements of $V$, or the element of $1$. Instead, we have two functions $inl : V\to V+1$ and $inr : 1\to V+1$, and any element of $V+1$ can be written uniquely as either $inl(x)$ for some $x\in V$ or as $inr(\star)$ for $\star$ the element of $1$.

To see what sort of terms we end up with, consider for example the ordinary term “$\lambda x. \lambda y. x$”. This is a *closed* term with no free variables, so it must live in $Tm_\emptyset$. Its outer operation is an abstraction, so it must be obtained from the constructor $\lambda$ applied to some term in $Tm_{\emptyset+1}$ that corresponds to “$\lambda y.x$”. This latter term is also an abstraction, so it must also be obtained from $\lambda$ applied to some term in $Tm_{(\emptyset+1)+1}$. This last term must be a variable, i.e. an element of $(\emptyset+1)+1$, and there are two such elements: $inr(\star)$ and $inl(inr(\star))$. Since $inr(\star)$ comes from the rightmost copy of $1$, it corresponds to the most recently bound variable (which we were originally writing “$y$”); while $inl(inr(\star))$ comes from the other copy of $1$, hence corresponds to the next most recently bound variable (our “$x$”).

Thus, our term $\lambda x. \lambda y. x$ would be written $\lambda \lambda \, v_{inl(inr(\star))}$, while the similar but distinct term $\lambda x. \lambda y. y$ would be written $\lambda \lambda \, v_{inr(\star)}$. The examples at the next level of binding should make the pattern clear:

- $\lambda x. \lambda y. \lambda z. x$ is $\lambda\lambda\lambda\, v_{inl(inl(inr(\star)))}$.
- $\lambda x. \lambda y. \lambda z. y$ is $\lambda\lambda\lambda\, v_{inl(inr(\star))}$.
- $\lambda x. \lambda y. \lambda z. z$ is $\lambda\lambda\lambda\, v_{inr(\star)}$.

In general, whenever we need to use a bound variable, we “work out” along the structure of the term counting the number of $\lambda$s that we pass until we get to the one where our desired variable was bound. If we passed $n$ such $\lambda$s (not counting the one where our variable was bound), then the representation of our variable is $v_{inl^{n}(inr(\star))}$.

The number $n$ is called the **De Bruijn index** of our variable. Thus, we see that De Bruijn indices are really the *most natural* representation of bound variables from a structural point of view. Usually people just write “$n$” instead of $v_{inl^{n}(inr(\star))}$, so our first two examples would be $\lambda\lambda 1$ and $\lambda\lambda 0$. As for the free variables, when referring to an element $x\in V$ inside a term in $Tm_V$, we have to represent it by $inr^n(x)$ where $n$ is the number of $\lambda$s inside which the reference occurs; thus $\lambda y.\lambda z. x$ would be $\lambda\lambda\, v_{inr(inr(x))}$.

Other representations of variable binding arise by working *non* structurally, allowing ourselves to make more specific choices of “$V+1$”. Specifically, suppose we fix a particular infinite ambient set $W$, a subset $C \subseteq P W$ of its powerset containing at least one element of each finite cardinality, and an operation $suc : C\to C$ together with a family of functions $inl: V\to suc(V)$ exhibiting each $suc(V)$ as a coproduct of $V$ with a one-element set. Then we can similarly define the family of sets of terms $Tm_V$, for all $V\in C$, by analogous clauses:

- For any $V\in C$, if $x\in V$, then $v_x \in Tm_V$.
- For any $V\in C$, if $M\in Tm_V$ and $N\in Tm_V$, then $M N \in Tm_V$.
- For any $V\in C$, if $M\in Tm_{suc(V)}$, then $\lambda M \in Tm_V$.

Some popular choies of $W, C, suc, inl$ include the following.

Let $W=\mathbb{N}$, with $C$ consisting of the initial segments $[n] = \{0,1,2,\dots, n-1\}$, and $suc([n]) = [n+1]$ and $inl$ the obvious inclusion, $inl(k)=k$. This representation is called

**De Bruijn levels**: in a term without free variables, each variable is identified by the number of $\lambda$s we have to cross to get to its binder, counting*in*from the outside of the term rather than out from the variable usage. Thus, for instance, $\lambda x. \lambda y. x$ is $\lambda\lambda v_0$ while $\lambda x. \lambda y. y$ is $\lambda\lambda v_1$. When there are free variables, they are given an ordering and identified by their position in that ordering, with bound variables numbered after the free ones. Note that unlike De Bruijn indices, with De Bruijn levels each variable has the same representation regardless of where it is used in the term.Let $W,C,suc$ be as before, but now define $inl : [n] \to [n+1]$ by $inl(k) = k+1$. This yields De Bruijn

*indices*again, with the free variables ordered and numbered counting out from the term, i.e. coming after all the bound ones.Fix an infinite set $F$ that is disjoint from $\mathbb{N}$, let $W=F \cup \mathbb{N}$, and let $C$ consist of all finite subsets of $W$ such that $V\cap \mathbb{N}$ is an initial segment. If $V = V_F \cup [n]$, with $[n] = V\cap \mathbb{N}$, then define $suc(V) = V_F \cup [n+1]$, with $inl$ acting as the identity on $V_F$ and as $inl(k)=k+1$ on $[n]$. If we consider only the terms whose free variables are subset of $F$, this yields a representation known as

**locally nameless**, where bound variables are given De Bruijn indices but free ones are named by an element of some arbitrary set $F$. One popular choice is to take $F=\mathbb{N}$ as well, with $V_F$ required to be an initial segment too; then the free variables are represented as De Bruijn*levels*.

All of these representations are isomorphic to each other, simply because their only difference is that they make different choices about how to “materially” represent the structural operation $V+1$. (More precisely, given a bijection between a set $V\in C$ in one representation and a set $V'\in C'$ in another representation, there is an induced bijection $Tm_V \cong Tm'_{V'}$.) In particular, none of these representations need a separate notion of “$\alpha$-equivalence” to identify terms by “renaming bound variables”, because they all have a *unique* way to name bound variables.

But what about ordinary “named variables”, where we *do* have a choice about what bound variable names to use? One way to obtain these terms in a well-scoped way is with the following inductive definition:

- For any set $V$, if $x\in V$, then $v_x \in Tm_V$.
- For any set $V$, if $M\in Tm_V$ and $N\in Tm_V$, then $M N \in Tm_V$.
- For any set $V$ and element $x\notin V$, if $M\in Tm_{V\cup \{x\}}$, then $\lambda x. M \in Tm_V$.

Note that since there is a choice of $x$ in the third clause, we now have to notate the variable $x$ in the term $\lambda x.M$, as expected. This choice also means that now we are no longer defining the same family of sets, but will have to quotient by $\alpha$-equivalence to obtain something isomorphic. (We could also achieve that from the start by making this a higher inductive definition, with a fourth constructor saying that $\lambda x.M = \lambda y.M$ whenever $x,y\notin V$.)

The requirement that $x\notin V$ forbids “variable shadowing”. That is, if $x$ is already in scope at the point of a $\lambda$-abstraction, either as a free variable or as a variable bound by a containing $\lambda$, we are not allowed to re-bind it again: we can’t write $\lambda x. \lambda x. x$. I believe this is roughly what’s known as the **Barendregt convention** for named variables.

Another approach to named variables instead allows re-use of names, where inner binders “shadow” outer ones; thus $\lambda x. \lambda x. x$ would be $\alpha$-equivalent to $\lambda x. \lambda y. y$, whereas inside the inner $M$ of a term $\lambda x. \lambda x. M$ there is *no* way to refer to the outer $x$ any more. We could achieve something like this by simply dropping the requirement that $x\notin V$:

- For any set $V$ and element $x$, if $M\in Tm_{V\cup \{x\}}$, then $\lambda M \in Tm_V$.

Then if it happens that $x\in V$ already, we have $V \cup \{x\} = V$, so the inner term $M$ has no more variable names to refer to than it used to. But the fact that the cardinality of the context *sometimes* doesn’t go up by one when we pass across a binder is kind of icky. I haven’t seen anyone really try to work with well-scoped terms like this, but it seems likely to cause all sorts of problems, in particular in defining and quotienting by $\alpha$-equivalence to obtain something isomorphic to the simpler representations.

Another approach to variable shadowing would be to mix names with De Bruijn indices: given a set $F$ of variable names, let $W = F \times \mathbb{N}$ and $C$ the set of finite subsets of $W$ whose fiber over each point of $F$ is an initial segment of $\mathbb{N}$. Then given $V\in C$ and $x\in F$, if the fiber of $V$ over $x$ is $\{x\} \times [n]$, let $V + x$ be $V \cup (x,n)$, and define $inl : V \to V+x$ by $inl(x,k) = (x,k+1)$ and the identity on all other fibers. Now the third clause becomes

- For any $V\in C$ and $x\in F$, if $M\in Tm_{V+x}$, then $\lambda x. M \in Tm_V$.

Now we still only need to decorate the binder $\lambda$ with the element $x$ of $F$ (rather than an element of $W = F\times \mathbb{N}$), and if we identify $x\in F$ with $(x,0)\in W$ we can use named variables as usual, with shadowing: the term $\lambda x. \lambda x. x$ means $\lambda x. \lambda x. v_{(x,0)}$, where $v_{(x,0)}$ refers to the innermost $\lambda x$. But shadowed variables are still available by incrementing the index: in $\lambda x. \lambda x. v_{(x,1)}$ the variable refers to the outer binder.

I doubt anyone would use this representation in practice: from a formal point of view, it seems likely to have all the *drawbacks* of both named variables and De Bruijn indices. However, I think it’s a nice way to understand what “named variables” are actually doing, and how they relate to things like De Bruijn indices, De Bruijn levels, locally nameless variables, and so on. (This discussion also suggests that in principle, it should be possible to abstract over a “method of handling variables”; a few years ago I wrote down some experiments in this direction here, but never pursued it further.)

Anyway, the point I wanted to make is that all these weird-looking things are “really just” different representations of an invariant structural operation. This doesn’t make them any less important as far as implementation goes: when actually coding things in a computer, we often do need to choose particular representations. But maybe it makes them a bit less scary to a category theorist.

## Re: You Could Have Invented De Bruijn Indices

What does the notation v_x mean? Is it just notation for “the variable called x”? So in the “traditional” formulas it would just be written x?