composition.al

My first fifteen compilers

In my last post, I wrote about a few ways that people use the word “transpiler”. In this post, I’ll offer a more personal take on the topic, based on my own experience of learning compiler development.

The first compiler I ever worked on was the one I wrote in the spring of 2009 for Kent Dybvig’s graduate compilers course at Indiana University. Actually, I didn’t write just one compiler for Kent’s course that semester; I wrote fifteen compilers, one for each week of the course. The first one had an input language that was more or less just parenthesized assembly language; its target language was x86-64 assembly. Each week, we added more passes to the front of the previous week’s compiler, resulting in a new compiler with the same target language as the compiler of the previous week, but a slightly higher-level input language.1 By the end of the course, I had a compiler that compiled a substantial subset of Scheme to x86-64, structured as forty small passes. Each pass translated from its input language to a slightly lower-level language, or had the same input and output language but performed some analysis or optimization on it.

All this was made a lot easier because we used the “nanopass” approach to compiler development, supported by a nanopass framework that has since been open-sourced. The nanopass framework provides what’s more or less a domain-specific language for developing compilers that are structured as a series of small passes with well-defined input and output languages. The framework encourages you to build a compiler by first defining intermediate languages, then defining the compiler passes that will translate between them. It provides facilities for doing this in a low-overhead way.

We can think of the nanopass approach as taking the idea of parser combinator libraries, in which a parser is built up out of several smaller parsers, and extending that idea to the development of an entire compiler. With a parser combinator library, you write a parser by starting with a bunch of primitive parsers (say, that parse numbers or characters) and combining them, eventually building up the ability to parse a sophisticated language. The language we can parse gets fancier and fancier, but at every step of the way, the thing one has is a parser. Likewise, when developing a compiler, it’s useful to be able to think of the thing that you have at each stage of the process as already being a compiler; as you go along, it becomes a compiler for a language that’s increasingly different from the target language.

That’s why I say that I wrote fifteen compilers when I took Kent’s course. At the end of week one (and at the end of week two, and so on for each week), I had written a compiler! Granted, the compiler I had at the end of week one was a compiler for an input language that wasn’t very different from the output language. But it converted code in its input language to assembly code on which I could then run an assembler, producing a working executable. That was really exciting!

In my last post, I mentioned two different definitions of “transpiler” that coincide only if we assume that compilers always have a high-level input language. My experience in Kent’s course taught me to steer clear of making that assumption. To the contrary, it was useful to think of the thing I wrote in the first week of Kent’s course as being a compiler, despite it having a quite low-level input language. For one thing, it was hugely motivating to be able to say that I had a working compiler at each step of the way through the course. Some compiler-development experiences are long slogs where you write code for months without ever having a thing that produces an actual executable that you can run. In Kent’s course, on the other hand, we got that hit of gratification every week. Furthermore, thinking of each component of the complete compiler as itself being a compiler was useful because it encouraged us to structure our code in a readable, modular, and maintainable way, in much the same way that parser combinator libraries support the development of readable, modular, maintainable parsers.

If we take “compiler that translates between programming languages that operate at approximately the same level of abstraction” as the definition of “transpiler”, the single-pass compiler I wrote in the first week of Kent’s course was a transpiler. The same is true for any other individual compiler pass I wrote during the course. But in the course, we never thought twice about just calling them compilers. As my friend and mentor Sam Tobin-Hochstadt has pointed out, introducing a new word instead of just saying “compiler” creates an unnecessary divide in the compiler-writing community and prevents sharing of knowledge across that divide. As a concrete example of this happening, here’s a question asked on Stack Overflow in 2012 by someone who wanted to write a transpiler, but wasn’t sure how to proceed. They wrote:

Now the next thing i’d like to do, is convert that source code to another source code, thus transpiling it. But how does that work? I can’t find any direct tutorials, explanations about that.

There’s a wealth of tutorials, courses, books, and the like about how to write compilers. But if somebody believes that writing a transpiler isn’t fundamentally the same thing as writing a compiler, it may not occur to them to look at any of that material. They may have even come to believe that writing a compiler is a monolithic and unapproachable task that only an elite few can ever hope to accomplish, rather than something that can be broken down into a series of relatively small, well-defined, approachable steps, and so they might shy away from taking a course or reading a book about compiler development. Perhaps such a situation could be avoided if we just called every compiler a compiler, regardless of how small or big the difference in level of abstraction between its input and output languages.

The choice of words we use to talk about compilers matters to me because I don’t want anyone to be afraid of writing a compiler, or to believe that compilers have to be written in a monolithic way. I loved the pedagogical approach that Kent’s course took, because structuring my compiler as a bunch of little compilers made it much easier to write, debug, and maintain than if it had been structured monolithically. Those fifteen weeks were a lot of hard work, but they were also the most fun I’d ever had writing code. What’s more, it was because of having taken that course that I was able to get internships working on Rust a couple years later — not because of any specific skill that I learned in the course2, but because after taking the course, I believed that a compiler was something I could write and something I wanted to write. Of course, lots of compilers — including Rust at that time — are monolithically structured and hard to understand, but the point is that compilers don’t have to be that way! Kent’s course showed me that compilers can be beautiful, even though they often aren’t. It made me want to work on compilers.

If I’d thought of the compiler I wrote for Kent’s course as just a bunch of transpilers glued together, rather than as a compiler, then I might never have applied for that Rust internship, might not have learned everything I learned from working on Rust for two summers, and might not have gotten to know a lot of people whose presence in my life has helped me build a research career. When Sam says that using the word “transpiler” “separates [people] from useful knowledge and community”, he’s talking about what could have easily happened to me. And I might have ended up believing that Real Compilers™ must be structured monolithically — which would have made me worse at writing real compilers.

Thanks to Jaseem Abid, Lea Albaugh, David Albert, Michael Arntzenius, Rudi Chen, harrison clarke, Carl Douglas, Julia Evans, Jeff Fowler, Philip Guo, Laura Lindzey, Sean Martin, Andi McClure, Iain McCoy, Vaibhav Sagar, Stevie Strickland, and Sam Tobin-Hochstadt for giving feedback on drafts of this post or discussing aspects of it with me.

  1. I hardly ever see compiler courses or books structured in this back-to-front way, which I think is a shame. The “From NAND to Tetris” course seems to come close – projects 7 and 8 cover the back end of a compiler, while projects 10 and 11 cover the front end – but, even then, projects 10 and 11 go in front-to-back order, rather than back-to-front. If anyone reading this knows of courses and books that teach back-to-front compiler implementation aside from the one that Kent Dybvig taught at Indiana, I’d love to hear about them!

  2. Indeed, there was almost no overlap between the specific skills I learned in Kent’s course and the things I did working on Rust. For Kent’s course, for instance, I had to implement register allocation; I didn’t have to think about that for Rust, because it compiles to LLVM IR. Conversely, for Kent’s course, since we were compiling an S-expression-based language, the parser was incredibly simple, whereas parsing Rust is pretty involved; and for Kent’s course, because Scheme is untyped, we didn’t do any type inference or type checking, whereas a lot of my time working on Rust was spent on the parts of the compiler that did those things. Nevertheless, I wouldn’t have felt comfortable applying for the internship had I not taken the course.

Comments