composition.al

Just call it a compiler

A compiler is a computer program that takes a program as input, does some sort of transformation on it, and returns another program. Most of the time, we think of a compiler as something that translates code from a higher-level language (like, say, Rust, Racket, or JavaScript) to a lower-level language (like, say, LLVM IR or x86-64 assembly), but this doesn’t necessarily have to be the case.

Around 2012 or so, it started to become fashionable to use the word “transpiler” to mean a compiler that “translates between programming languages that operate at approximately the same level of abstraction”, as Wikipedia puts it. JavaScript, a high-level language, is a popular target language for these compilers, because every popular web browser can execute JavaScript efficiently (and is getting better at it all the time, thanks to significant investment from browser vendors). Many people want their code to run in every popular web browser, but not all of those people want to write JavaScript — hence, compilers that transform code from various source languages to JavaScript.1 I don’t have a problem with any of these compilers, but the word “transpiler” itself has bothered me for a long time.

There’s not consensus on what “transpiler” means

One problem with “transpiler” is that people seem to disagree on what it means. Here’s a longer excerpt from the Wikipedia “source-to-source compiler” page (to which “transpiler” redirects):

A source-to-source compiler, transcompiler or transpiler is a type of compiler that takes the source code of a program written in one programming language as its input and produces the equivalent source code in another programming language. A source-to-source compiler translates between programming languages that operate at approximately the same level of abstraction, while a traditional compiler translates from a higher level programming language to a lower level programming language.

Interestingly, this definition says that “transpiler” and “source-to-source compiler” are synonyms, then defines both of them as meaning a compiler that “translates between programming languages that operate at approximately the same level of abstraction”. A “source-to-source compiler”, though, sounds like it ought to be something more specific than that; it sounds like a compiler for which the input and target languages not only operate at similar levels of abstraction, but also both operate at high levels of abstraction (because a “source” language is presumably high-level). This definition coincides with “translates between programming languages that operate at approximately the same level of abstraction” only if we assume that all compilers have a high-level source language. (Later in this post, I’ll get to why that assumption grates on me.)

There are other ways of definining “transpiler” that overlap with the Wikipedia definition but don’t coincide with it. Some people seem to use “transpiler” to mean “compiler with a high-level target language”, without saying anything in particular about the level of abstraction of the source language. For instance, Emscripten — which compiles LLVM bitcode, a relatively low-level language, to JavaScript, a relatively high-level language — is often called a transpiler. Wikipedia even cites Emscripten as an example of a transpiler, even though it contradicts the “approximately the same level of abstraction” part of the Wikipedia definition.

In other contexts, “transpiler” might imply “compiler with relatively human-readable output”, which is yet another overlapping-but-different definition. It would rule out Emscripten; its output is JavaScript, but readability of the output isn’t a priority. On the other hand, there are TypeScript and CoffeeScript, which compile high-level, JavaScript-like languages to JavaScript and try to more or less preserve readability.

So, we’ve got several overlapping-but-different definitions for “transpiler”:

  • compiler that translates between languages that operate at similar levels of abstraction (the Wikipedia definition)
  • compiler with high-level languages as its input and target languages (what “source-to-source compiler” seems to mean)
  • compiler with a high-level language as its target language (examples include Emscripten, TypeScript, and CoffeeScript)
  • compiler that produces readable output in a high-level target language (examples include TypeScript and CoffeeScript, but not Emscripten)

There are most likely even more ways people define “transpiler” that I haven’t thought of. (I imagine that for some people, in some contexts, “transpiler” just means “compiler that produces JavaScript”!)

Another issue with these definitions is that “high-level” and “low-level” are relative, and furthermore, whether or not any given two languages operate at similar levels of abstraction isn’t necessarily a cut-and-dried matter. There’s not some kind of total ordering on languages by level of abstraction! Language constructs that operate at different levels of abstraction can coexist in the same language, making it simultaneously “higher-level” and “lower-level” than another language that has only language constructs that are somewhere in between.

It’s also worth pointing out that aside from the lack of consensus on what it means, the word “transpiler” itself is pretty silly. It’s apparently a portmanteau of “compiler” and some word that starts with “trans”, but what word? Is it “translator”? “Transformer”? Either way, the addition of the “trans-” word doesn’t really provide any more specific information than “compiler”, because a compiler is already something that translates or transforms code. We could say “source-to-source compiler” instead, but “source-to-source compiler” doesn’t seem to encompass all the definitions of “transpiler” and may not be applicable in all the contexts in which people are using “transpiler”.

All of the above explains some of why I don’t like the word “transpiler”, and it’s pretty much what I would expect a standard “stop saying ‘transpiler’, dammit!” rant to say. But if that were all I had to say, I probably wouldn’t have bothered to write this post. My reasons for not liking the word are more personal, and to explain them, I need to tell a story from my first year of grad school, way back in 2009, before anyone was saying “transpiler”.

My first compiler

The first compiler I ever worked on was the one I wrote in the spring of 2009 for Kent Dybvig’s graduate compilers course at Indiana University. Actually, I didn’t just write one compiler for Kent’s course that semester; I wrote fifteen compilers, one for each week of the course. The first one had an input language that was more or less just parenthesized assembly language; its target language was x86-64 assembly. Each week, we added more passes to the front of the previous week’s compiler, resulting in a new compiler with the same target language as the compiler of the previous week, but a slightly higher-level input language. By the end of the course, I had a compiler that compiled a substantial subset of Scheme to x86-64, structured as forty small passes. Each pass translated from its input language to a slightly lower-level language, or had the same input and output language but performed some analysis or optimization on it.

This approach to building a compiler was made a lot easier as a result of our using the “nanopass” approach to compiler development (supported by the nanopass framework, which is now open source). In some ways, the nanopass approach takes the idea of parser combinator libraries, in which a parser is built up out of several smaller parsers, and extends that idea to the development of an entire compiler. Parser combinator libraries allow one to start small and build up parsing capability gradually, but at every stage, the thing one has is a parser. Likewise, when developing a compiler, it’s useful to be able to think of the thing that you have at each stage of the process as being a compiler. This is why I don’t like the assumption that compilers need to have a high-level source language: because it’s useful to think of the thing I wrote in the first week of Kent’s course as being a compiler, albeit one with a very small difference between its input and output languages. It’s useful because it’s motivating — at every step of the way, I could say that I had a working compiler — but it’s also useful because, like parser combinators do for parsers, the nanopass approach encourages a readable, modular, and maintainable way of structuring a compiler.

Introducing a specific word for compilers that translate between languages that are at similar levels of abstraction makes it seem as though there’s a difference in kind between those compilers and Real Compilers™ — where Real Compilers™ are, presumably, those that compile a high-level language all the way to machine code or some other significantly lower-level language. But Kent’s course showed me that no such difference in kind exists. Was the single-pass compiler I wrote in the first week of the course a transpiler? It translated between languages that were similar in level of abstraction, so it would qualify as a transpiler according to the Wikipedia definition. The same is true for any other individual compiler pass I wrote in the course. How about the compiler I had after week two, or the week after that — were those transpilers, too? After exactly how many passes did my compiler stop being a transpiler and become a Real Compiler™? Five passes? Ten? Was it after I wrote the register allocator? After I wrote the instruction selection pass? After I wrote the closure conversion pass? There’s no sensible answer to this question, because the difference between a transpiler and a compiler is at best a difference in degree, not in kind.

Introducing new terminology unnecessarily divides the community

As my friend and mentor Sam Tobin-Hochstadt has pointed out, introducing a new word instead of just saying “compiler” makes it seem like there is a difference in kind between a transpiler and a compiler, and therefore creates an unnecessary divide in the compiler-writing community and prevents sharing of knowledge across that divide. As a concrete example of this happening, here’s a question asked on Stack Overflow in 2012 by someone who wanted to write a transpiler, but wasn’t sure how to proceed. They wrote:

Now the next thing i’d like to do, is convert that source code to another source code, thus transpiling it. But how does that work? I can’t find any direct tutorials, explanations about that.

There is, of course, a wealth of tutorials, courses, books, and the like about how to write compilers. But if somebody believes that writing a transpiler isn’t fundamentally the same thing as writing a compiler, it may not occur to them to look at any of that material. In fact, they might have come to believe that writing a compiler is a monolithic and unapproachable task that only an elite few can ever hope to accomplish, rather than something that can be broken down into a series of relatively small, well-defined, approachable steps, and so they might shy away from taking a class or reading a book about compiler development. Perhaps such a situation could be avoided if we just called every compiler a compiler, regardless of how small or big the difference in level of abstraction between its input and output languages.

Compilers don’t have to be scary monoliths

The choice of words we use to talk about compilers matters to me because I don’t want anyone to be afraid of writing a compiler, or to believe that compilers have to be written in a monolithic way. I loved the pedagogical approach that Kent’s course took, because structuring my compiler as a series of many small passes made it much easier to write, debug, and maintain than if it had been structured monolithically.2 Those fifteen weeks in spring 2009 were a lot of hard work, but they were also the most fun I’d ever had writing code. Moreover, it was because of having taken that course that I was able to get an internship working on Rust a couple years later — not so much because of any specific skill that I learned in the course3, but because, after taking the course, I believed that a compiler was something I was absolutely capable of working on and understanding. (Of course, lots of compilers — including, arguably, Rust at the time — are monolithically structured and hard to understand, but the point is that compilers don’t have to be that way!)

If we go by the Wikipedia definition of “transpiler”, then a compiler structured as many small passes is just a bunch of transpilers glued together. But if I’d thought of the compiler I wrote for Kent’s class as “just a bunch of transpilers”, rather than as a compiler, then I might never have applied for that Rust internship, might not have learned everything I learned from working on Rust, and might not have gotten to know a lot of people who became part of the community that helped me start to build a research career. The “useful knowledge and community” that Sam is talking about is knowledge and community that I would have missed out on. And I might have ended up believing that Real Compilers™ must be structured monolithically — which would have made me worse at writing real compilers.

  1. For these reasons, JavaScript has been called “the assembly language of the web”. However, it’s worth mentioning that many of the people who pioneered the use of JavaScript in this way have been working on WebAssembly, an emerging standard that defines, among other things, a binary format that compilers can target and that will hopefully enjoy widespread browser support. This is not to say, of course, that JavaScript is going away any time soon.

  2. It’s also worth pointing out that writing the compiler back-to-front meant that, starting from the very first week, I had a working x86-64 compiler that produced code that I could assemble and run on any old x86-64 machine, so the gratification was much faster than if we’d written it front-to-back. I hardly ever see compiler courses or books structured this way, which I think is a shame. The “From NAND to Tetris” course seems to come close – projects 7 and 8 cover the back end of a compiler, while projects 10 and 11 cover the front end – but, even then, projects 10 and 11 go in front-to-back order, rather than back-to-front. If anyone reading this knows of courses and books that teach back-to-front compiler implementation, I’d love to hear about them!

  3. Indeed, there was almost no overlap between the specific skills I learned in Kent’s course and the things I did working on Rust. For Kent’s course, for instance, I had to implement register allocation; I didn’t have to think about that for Rust, because it compiles to LLVM IR. Conversely, for Kent’s course, since we were compiling an S-expression-based language, the parser was incredibly simple, whereas parsing Rust is pretty involved; and for Kent’s course, because Scheme is untyped, we didn’t do any type inference or type checking, whereas a lot of my time working on Rust was spent on the parts of the compiler that did those things. Nevertheless, I wouldn’t have felt comfortable applying for the internship had I not taken the course.

Comments