You Could Have Invented De Bruijn Indices
Posted by Mike Shulman
One of the most curious things about formal treatments of type theory is the attention paid to dummy variables. The point is simple enough to explain to a calculus student: just as the integrals and are equal, so are the functions and — the name of the variable doesn’t matter. (In fact, the former statement factors through the latter.) But managing this equality (which goes by the highbrow name of “-equivalence”) turns out to be quite fiddly when actually implementing a typechecker, and people have developed all sorts of different-sounding ways to deal with it, with names like De Bruijn indices, De Bruijn levels, locally named terms, locally nameless terms, and so on.
In this post I want to explain what’s going on from a categorical and/or univalent perspective, and why some of these fancy-sounding things are really quite simple. To cut to the punchline, at bottom what’s going on is just different ways to choose a concrete representative of an object that’s determined only up to unique isomorphism.
Let’s consider possibly the simplest case: terms of the lambda-calculus. These are generated inductively by three operations:
- Variables are terms
- If and are terms, so is the application .
- If is a term and a variable, then the abstraction is a term.
A usage of a variable inside a is called bound, while other usages of are called free. Thus in the usage of is bound but the usage of is free. Semantically, the free variables must be given meaning in a context such as , while the bound variables are “dummies” that don’t appear in the context.
I just described the traditional approach, where one first defines all the terms at once, then defines what the free variables of a term are. But at the end of the day, what we really want to have is, for each context of (perhaps typed) variables, the set of terms containing those variables free. Or more precisely, containing at most those variables free, since for instance makes sense in the context even though it doesn’t actually use the variable .
Thus, from a typed perspective, it would be better to define the sets of terms indexed by the set of variables in their context. These are called well-scoped terms. Since we are ordinary mathematicians who think structurally, of course it doesn’t matter what these variables “actually are”, so we can just suppose them to be drawn from an abstract set . Now we intend to give an inductive definition of the set of “terms with free variables from ”, with an arbitrary set, and the first two clauses are obvious:
- For any set , if , then .
- For any set , if and , then .
The third clause is less clear, since in an abstraction the input term must have as a free variable, while is no longer free in . In other words, must have one more variable in its context than , so it’s natural to say:
- For any set , if , then .
Here denotes the coproduct, or disjoint union, of with a one-element set, denoted . Note that when phrased like this, we don’t need to annotate the with the variable name ; we know what the bound variable is, it’s just the extra element of that comes from .
The point is that, categorically speaking, doesn’t denote a specific set, but only a set determined up to unique isomorphism; and thus we could potentially make different choices about exactly what set to use. But of course, from a structural point of view, the choice doesn’t matter. So to start with, let’s proceed purely structurally. This means that it doesn’t make sense to ask whether the elements of are “the same as” any of the elements of , or the element of . Instead, we have two functions and , and any element of can be written uniquely as either for some or as for the element of .
To see what sort of terms we end up with, consider for example the ordinary term “”. This is a closed term with no free variables, so it must live in . Its outer operation is an abstraction, so it must be obtained from the constructor applied to some term in that corresponds to “”. This latter term is also an abstraction, so it must also be obtained from applied to some term in . This last term must be a variable, i.e. an element of , and there are two such elements: and . Since comes from the rightmost copy of , it corresponds to the most recently bound variable (which we were originally writing “”); while comes from the other copy of , hence corresponds to the next most recently bound variable (our “”).
Thus, our term would be written , while the similar but distinct term would be written . The examples at the next level of binding should make the pattern clear:
- is .
- is .
- is .
In general, whenever we need to use a bound variable, we “work out” along the structure of the term counting the number of s that we pass until we get to the one where our desired variable was bound. If we passed such s (not counting the one where our variable was bound), then the representation of our variable is .
The number is called the De Bruijn index of our variable. Thus, we see that De Bruijn indices are really the most natural representation of bound variables from a structural point of view. Usually people just write “” instead of , so our first two examples would be and . As for the free variables, when referring to an element inside a term in , we have to represent it by where is the number of s inside which the reference occurs; thus would be .
Other representations of variable binding arise by working non structurally, allowing ourselves to make more specific choices of “”. Specifically, suppose we fix a particular infinite ambient set , a subset of its powerset containing at least one element of each finite cardinality, and an operation together with a family of functions exhibiting each as a coproduct of with a one-element set. Then we can similarly define the family of sets of terms , for all , by analogous clauses:
- For any , if , then .
- For any , if and , then .
- For any , if , then .
Some popular choies of include the following.
Let , with consisting of the initial segments , and and the obvious inclusion, . This representation is called De Bruijn levels: in a term without free variables, each variable is identified by the number of s we have to cross to get to its binder, counting in from the outside of the term rather than out from the variable usage. Thus, for instance, is while is . When there are free variables, they are given an ordering and identified by their position in that ordering, with bound variables numbered after the free ones. Note that unlike De Bruijn indices, with De Bruijn levels each variable has the same representation regardless of where it is used in the term.
Let be as before, but now define by . This yields De Bruijn indices again, with the free variables ordered and numbered counting out from the term, i.e. coming after all the bound ones.
Fix an infinite set that is disjoint from , let , and let consist of all finite subsets of such that is an initial segment. If , with , then define , with acting as the identity on and as on . If we consider only the terms whose free variables are subset of , this yields a representation known as locally nameless, where bound variables are given De Bruijn indices but free ones are named by an element of some arbitrary set . One popular choice is to take as well, with required to be an initial segment too; then the free variables are represented as De Bruijn levels.
All of these representations are isomorphic to each other, simply because their only difference is that they make different choices about how to “materially” represent the structural operation . (More precisely, given a bijection between a set in one representation and a set in another representation, there is an induced bijection .) In particular, none of these representations need a separate notion of “-equivalence” to identify terms by “renaming bound variables”, because they all have a unique way to name bound variables.
But what about ordinary “named variables”, where we do have a choice about what bound variable names to use? One way to obtain these terms in a well-scoped way is with the following inductive definition:
- For any set , if , then .
- For any set , if and , then .
- For any set and element , if , then .
Note that since there is a choice of in the third clause, we now have to notate the variable in the term , as expected. This choice also means that now we are no longer defining the same family of sets, but will have to quotient by -equivalence to obtain something isomorphic. (We could also achieve that from the start by making this a higher inductive definition, with a fourth constructor saying that whenever .)
The requirement that forbids “variable shadowing”. That is, if is already in scope at the point of a -abstraction, either as a free variable or as a variable bound by a containing , we are not allowed to re-bind it again: we can’t write . I believe this is roughly what’s known as the Barendregt convention for named variables.
Another approach to named variables instead allows re-use of names, where inner binders “shadow” outer ones; thus would be -equivalent to , whereas inside the inner of a term there is no way to refer to the outer any more. We could achieve something like this by simply dropping the requirement that :
- For any set and element , if , then .
Then if it happens that already, we have , so the inner term has no more variable names to refer to than it used to. But the fact that the cardinality of the context sometimes doesn’t go up by one when we pass across a binder is kind of icky. I haven’t seen anyone really try to work with well-scoped terms like this, but it seems likely to cause all sorts of problems, in particular in defining and quotienting by -equivalence to obtain something isomorphic to the simpler representations.
Another approach to variable shadowing would be to mix names with De Bruijn indices: given a set of variable names, let and the set of finite subsets of whose fiber over each point of is an initial segment of . Then given and , if the fiber of over is , let be , and define by and the identity on all other fibers. Now the third clause becomes
- For any and , if , then .
Now we still only need to decorate the binder with the element of (rather than an element of ), and if we identify with we can use named variables as usual, with shadowing: the term means , where refers to the innermost . But shadowed variables are still available by incrementing the index: in the variable refers to the outer binder.
I doubt anyone would use this representation in practice: from a formal point of view, it seems likely to have all the drawbacks of both named variables and De Bruijn indices. However, I think it’s a nice way to understand what “named variables” are actually doing, and how they relate to things like De Bruijn indices, De Bruijn levels, locally nameless variables, and so on. (This discussion also suggests that in principle, it should be possible to abstract over a “method of handling variables”; a few years ago I wrote down some experiments in this direction here, but never pursued it further.)
Anyway, the point I wanted to make is that all these weird-looking things are “really just” different representations of an invariant structural operation. This doesn’t make them any less important as far as implementation goes: when actually coding things in a computer, we often do need to choose particular representations. But maybe it makes them a bit less scary to a category theorist.
Re: You Could Have Invented De Bruijn Indices
What does the notation v_x mean? Is it just notation for “the variable called x”? So in the “traditional” formulas it would just be written x?