This is a wideranging talk about things I’ve worked on in the last several years — LVars, ParallelAccelerator, and neural network verification — tying them together with the theme of “finding the right highlevel abstractions to express efficient computation”. People who are familiar with the “job talk” genre will probably be able to tell that this talk is mostly cribbed from my job talk from early 2018. Until now, though, there wasn’t a publicly available video of it — so, now there is!
I was at first unsure that it would be appropriate to recycle my job talk. After all, Jane Street was actually paying me (!) to come speak, so shouldn’t I prepare a new talk? When I suggested a different topic, though, Yaron and Danielle asked that I do this talk instead. I’m glad they did, because not only did it save me from having to come up with a whole new talk, I think it ended up being more interesting for the Jane Street audience than the other topic I had in mind.^{1} And now I have a video that I can share with people who are curious about what kinds of things I work on and what I want to do next.
Thanks to everyone who came out for the talk, especially those who found out about it on short notice — I didn’t advertise it myself until the day before. I ended up giving the same talk twice, first to an internal audience at Jane Street and then to the public. (I was pretty hoarse by the end of the night.) Both versions of the talk were wellattended, and I got lots of engaged and curious questions. (The Q&A for the public talk starts at about 56:50 in the video.)
Since my previous visit to Jane Street had been for a disastrous internship interview trip in 2010 that culminated in my going to the wrong airport for the trip home, it was nice to visit Jane Street and actually have things go well this time. Thanks to Yaron Minsky, Danielle Sucher, and Lauren Sposta at Jane Street for inviting and hosting me!
The other topic was “What does the CAP theorem really say?”, which will most likely end up as a lecture in my class this spring. ↩
Back in August, I wrote about how, while taking the Data 8X series of online courses^{1}, I had learned about standard units and about how the correlation coefficient of two (onedimensional) data sets can be thought of as either
In fact, there are lots more ways to interpret the correlation coefficient, as Rodgers and Nicewander observed in their 1988 paper “Thirteen Ways to Look at the Correlation Coefficient”. The above two ways of interpreting it are are number three (“Correlation as Standardized Slope of the Regression Line”) and number six (“Correlation as the Mean CrossProduct of Standardized Variables^{2}”) on Rodgers and Nicewander’s list, respectively.
But that still leaves eleven whole other ways of looking at the correlation coefficient! What about them?
I started looking through Rodgers and Nicewander’s paper, trying to figure out if I would be able to understand any of the other ways to look at the correlation coefficient. Way number eight (“Correlation as a Function of the Angle Between the Two Variable Vectors”) piqued my interest. I know what angles, functions, and vectors are! But what are “variable vectors”?
Rodgers and Nicewander write:
The standard geometric model to portray the relationship between variables is the scatterplot. In this space, observations are plotted as points in a space defined by variable axes.
That’s the kind of thing I wrote about back in August. For instance, here’s a scatter plot showing the relationship between my daughter’s height and weight, according to measurements taken during the first year of her life. There are eight data points, each corresponding to one observation — that is, one pair of height and weight measured at a particular doctor visit.
These measurements are in standard units, ranging from less than 1 (meaning less than one standard deviation below average for the data set) to near zero (meaning near average for the data set) to more than 1 (meaning more than one standard deviation above average for the data set). (If you’re not familiar with standard units, my previous post goes into detail about them.) I also have another scatter plot in centimeters and kilograms, if you’re curious.
Rodgers and Nicewander continue:
An “inside out” version of this space — usually called “person space” — can be defined by letting each axis represent an observation. This space contains two points — one for each variable — that define the endpoints of vectors in this (potentially) huge dimensional space.
…Whoooooa.
So, instead of having height and weight as axes, they want us to take each of the eight rows of our table — each observation — and make those be our axes. And the two axes we have now, height and weight, would then become points in that eightdimensional space.
In other words, we want to take our table of data — which looks like this, where rows correspond to points and columns correspond to axes on our scatter plot —
Date  Height (standard units)  Weight (standard units) 

20170728  1.26135  1.3158 
20170807  1.08691  1.13054 
…  …  … 
— and turn it sideways, like this:
Date  20170728  20170807  … 

Height (standard units)  1.26135  1.08691  … 
Weight (standard units)  1.3158  1.13054  … 
Now we have two points, one for each of height and weight, and eight axes, one for each of our eight observations.
Eight dimensions are hard to visualize, so for simplicity’s sake, let’s pare it down to just three dimensions by picking out three observations to think about. I’ll pick the first, the last, and one in the middle. Specifically, I’ll pick the observations from when my daughter was four days old, about six months old, and about a year old:
Date  20170728  20180126  20180730 

Height (standard units)  1.26135  0.617255  1.63707 
Weight (standard units)  1.3158  0.728253  1.41777 
What do we get when we visualize this sideways data set as a threedimensional scatter plot? Something like this:
What’s going on here? We’re looking at points in “person space”, where, as Rodgers and Nicewander explain, each axis represents an observation. In this case, there are three observations, so we have three axes. And there are two points, as promised — one for each of height and weight.
If we look at the difference between the two points on the zaxis — that is, the axis for the 07/30/2018 observation — we can see that the darkercolored blue dot is higher up. It must represent the “height” variable, then, with coordinates (1.26135, 0.617255, 1.63707). That means that the other, lightercolored blue dot, with coordinates (1.3158, 0.728253, 1.41777), must represent the “weight” variable.
I’ve also plotted vectors going from the origin to each of the two points, and these, finally, are what Rodgers and Nicewander mean by “variable vectors”!
Continuing with the paper:
If the variable vectors are based on centered variables, then the correlation has a relationship to the angle $\alpha$ between the variable vectors (Rodgers 1982): $r = \textrm{cos}(\alpha)$.
Oooh. Okay, so first of all, are our variable vectors “based on centered variables”? From what Google tells me, you center a variable by subtracting the mean from each value of the variable, resulting in a variable with zero mean. The variables we’re dealing with here are in standard units, and so the mean is already zero. So, they’re already centered! Hooray.
Finding the angle between [1.26135, 0.617255, 1.63707]
and [1.3158, 0.728253, 1.41777]
and taking its cosine, we can compute $r$ to be 0.9938006245545371. Almost 1! That means that, just like last time, we have an almost perfect linear correlation.
It’s a bit different from what we got for $r$ last time, which was 0.9910523777994954. But that’s because, for the sake of visualization, we decided to only look at three of the observations. To get more accuracy, we can go back to all eight dimensions. We may not be able to visualize them, but we can still measure the angle between them! Doing that, we get 0.9910523777994951, which is the same as we had last time, modulo 0.0000000000000003 worth of numerical imprecision. I’ll take it.
So, that’s way number eight of looking at the correlation coefficient — as the angle between two variable vectors in “person space”!
Why do Rodgers and Nicewander call it “person space”? I wonder if it’s because it’s common in statistics for an observation — a row in our original table — to correspond to a single person. It seems to also sometimes be called “subject space”, “observation space”, or “vector space”. For instance, here’s a stats.SE answer that shows an example contrasting “variable space” — that is, the usual kind of scatter plot, with an axis for each variable — with “subject space”.
I had never heard any of these terms before I saw Rodgers and Nicewander’s paper, but apparently it’s not just me! A 2002 paper by Chong et al. in the Journal of Statistics Education laments that the concept of subject space (as opposed to variable space) often isn’t taught:
There are many common misconceptions regarding factor analysis. For example, students do not know that vectors representing latent factors rotate in subject space, rather than in variable space. Consequently, eigenvectors are misunderstood as regression lines, and data points representing variables are misperceived as data points depicting observations. The topic of subject space is omitted by many statistics textbooks, and indeed it is a very difficult concept to illustrate.
And the lack of uniform terminology seems to be part of the problem. Chong et al. get delightfully snarky in their discussion of this:
In addition, the only text reviewed explaining factor analysis in terms of variable space and vector space is Applied Factor Analysis in the Natural Sciences by Reyment and Joreskog (1993). No other textbook reviewed uses the terms “subject space” or “person space.” Instead vectors are presented in “Euclidean space” (Joreskog and Sorbom 1979), “Cartesian coordinate space” (Gorsuch 1983), “factor space” (Comrey and Lee 1992; Reese and Lochmüller 1998), and “ndimensional space” (Krus 1998). The first two phrases do not adequately distinguish vector space from variable space. A scatterplot representing variable space is also a Euclidean space or a Cartesian coordinate space. The third is tautological. Stating that factors are in factor space may be compared to stating that Americans are in America.
For their part, Rodgers and Nicewander want to encourage more people to use this anglebetweenvariablevectors interpretation of $r$. They write:
Visually, it is much easier to view the correlation by observing an angle than by looking at how points cluster about the regression line. In our opinion, this interpretation is by far the easiest way to “see” the size of the correlation, since one can directly observe the size of an angle between two vectors. This insideout space that allows $r$ to be represented as the cosine of an angle is relatively neglected as an interpretational tool, however.
I have mixed feelings about this. On the one hand, yeah, it’s easier to just look at one angle between two vectors in observation space (or person space, or vector space, or subject space, or whatever you want to call it) than to have to squint at a whole bunch of points in variable space. On the other hand, for most of us it probably feels pretty strange to have, say, a “July 28, 2017” axis instead of a “height” axis. Moreover, the observation space is really hard to visualize once you get past three dimensions, so it’s hard to blame people for not wanting to think about it. I can visualize lots of points, but only a few axes, so using axes to represent observations (which we may have quite a lot of) and points to represent variables (which, when dealing with bivariate correlation, we have two of) seems like a rather backwards use of my cognitive resources! Nevertheless, I’m sure there are times when this approach is handy.
Since August, I finished the final course in the Data 8X sequence and am now a proud haver of the <airquotes>Foundations of Data Science Professional Certificate<airquotes> from <airquotes>BerkeleyX<airquotes>. ↩
When Rodgers and Nicewander speak of a “variable”, they mean it in the statistician’s sense, meaning something like “feature” (like “height” or “width”), not in the computer scientist’s sense). When I say “onedimensional data set”, that’s a synonym for “variable”. ↩
As is typical for a graduate seminar, most of the work students did in this course consisted of reading papers, writing responses to them, and giving class presentations about them. We began with the CAP tradeoff, then spent a couple weeks exploring the zoo of distributed consistency models — from session guarantees to causal consistency (with and without convergent conflict handling) to linearizability. Once we were thoroughly tired of consistency models, we spent some time on the theory and practice of replicated data types, as well as on languages, abstractions, and verification tools for combining consistencies in distributed systems. In the last month of the course, we looked at lots more languages and frameworks for distribution, with a side trip into abstractions for configuration management at the end.
One of the hardest things about planning the readings for the course is that ten weeks really isn’t very long. I had to leave plenty of good stuff off of our reading list because we just didn’t have room for it. Still, I ended up being reasonably happy with the set of readings we chose. My students had diverse interests, and it was hard to please everyone — it seemed like every time someone really liked a particular paper, someone else disliked it just as strongly — but in the end, I think everyone got to read some papers they liked. The students’ paper preferences for presentations were also sufficiently diverse that I was able to assign everybody papers that they had actually requested. (Whether they ended up still liking those papers after they were done reading and presenting on them is another question, but at least I was able to give everyone some papers they initially thought they wanted!)
Each student presented on three papers. I think that three is a lot to ask, and it probably would have been better if each student had only had to do two presentations, which would have been possible if we’d had nine students enrolled instead of six. I also did three of the paper presentations myself, two of which I think went fine and one of which I think went poorly. The paper for that latter one was Viotti and Vukolić’s survey of notions of consistency, and since that paper — as well as some of the others on our reading list — had some sort of dependency on material in Sebastian Burckhardt’s (free) book Principles of Eventual Consistency, I decided to just devote the class to talking about the material from Burckhardt’s book, instead of about the Viotti and Vukolić paper that students had actually read. This might have been a reasonable pedagogical choice if I’d managed to connect what I talked about to the reading that students had done, but I pretty much failed to do so.^{1}
If I could do it over, there are a couple of things that I’d do differently regarding the logistics of reading responses. Students were supposed to write a short response to every paper we read. I told them that they could skip up to four of these responses with no consequences, and that if they wanted to skip one, they should email me. In retrospect, the “email me” step was totally unnecessary! I should have just told them to go ahead and skip up to four responses, no questions asked, and saved myself having to read a bunch of mail.
Another logistical hiccup had to do with the way that reading response submissions worked. Like almost everything else in the course, reading responses were public on GitHub. In order to comply with US law (specifically, FERPA) and UC policy, though, I needed to give students a way to submit their work in a way that would protect their right to not have their homework be made publicly available and identifiable as theirs. (Students also have the right to not have the fact that they’re even taking a certain class be made public.) After talking with our staff about FERPA compliance, the solution I arrived at was to assign each student a random ID number for the term, which they used in the file naming convention for their reading response submissions. However, since students were using their own GitHub IDs to submit homework, I told them that they were welcome to create a separate GitHub account just for use in my course that didn’t reveal any personal information, and that they could opt out of having their names associated with their public blog posts. Nobody took me up on either of those suggestions, so it appears that no one minded having their name associated with their work in the course. Still, it probably would have been easier to just have the reading responses not be public, and instead have them shared with the class on some internal forum, instead of doing the random ID number thing.
In addition to reading papers, writing responses to them, and giving presentations about them, my students poured a lot of work into the course blog. Every student contributed two posts to the blog, as well as serving as an editor for two posts written by their classmates. I gave the students some suggestions for what to write about and provided feedback on the ideas they came to me with, but they were ultimately responsible for coming up with their own topics for the blog. I was delighted by the results:
Blogging ate up more of the students’ time than I had anticipated it would. I had imagined them spending about thirty hours per post, but some people spent much more time than that. Some students also didn’t like having to write two posts. My reasoning had been that one large post would feel unapproachable, while two smaller posts with distinct deadlines would break up the work into manageable chunks. The feedback I got from some students, though, was that they felt like they had to do the equivalent of two course projects, which hadn’t been my intention.
Although every post had a student editor, I ended up pretty heavily editing all twelve posts myself, working together with the students and using Google Docs to make comments and provide suggested edits. I had imagined that having student editors would take some of the editing burden off me, and also give students some practice with editing each other’s writing. Although I think this worked to some extent, I didn’t really give students much guidance on how to edit each others’ work, so some of the advice they gave each other contradicted what I would have said. I also felt that the students were just too nice to each other a lot of the time!
I think some students were happy with the amount of attention I was paying to the quality of their writing. Others may have found it irritating. It’s certainly true that if I’d had more than six students, I wouldn’t have been able to give students’ posts the amount of individual attention that I did this fall. (One option for a bigger class might be to have students write posts together in small groups.) Also, with the way the deadlines for finishing posts worked, I ended up with several posts to edit at once. A better approach might have been to assign particular weeks to students when they had to finish their posts, with no more than two per week, so that my editing work could have been better spaced out.
In the end, I think all our hard work paid off: we got some really nice reactions to the blog! This tweet from KC Sivaramakrishnan (who was the author of one of my favorite papers from the course) really made my year:
I am particularly impressed with the quality of the blog posts! Very well researched and written so clearly. I highly recommended reading all of the posts. https://t.co/XIAjl3XRuZ
— KC Sivaramakrishnan (@kc_srk) November 22, 2018
I’m by no means the first instructor to incorporate public blogging into a computer science seminar course. Two good examples I know of are the Understanding and Securing TLS and Security and Privacy of Machine Learning seminars run by David Evans at UVA. In those courses, teams of students worked together to write posts about each class meeting. (The classes met for a single, long session once a week.) I decided to do it differently: for us, the blog was a replacement for a traditional course project rather than a record of what was discussed in class. Anyway, I’m interested in talking to other people who’ve used blogging as part of teaching CS classes; let me know what worked for you and what didn’t!
We were fortunate to have a star lineup of guest speakers in the course:
For all the external speakers (that is, everyone except Peter), I opened up the talk to people outside of our class and made an effort to advertise, because I didn’t want people to have to come from far away to give a talk to only six students. My efforts here were sometimes quite successful and other times not at all successful. In the future, instead of asking speakers to discuss specific papers that we read for the course, I might ask them to talk more generally about their work (which some of the speakers went ahead and did anyway), which might have more broad appeal to people not enrolled in the course who likely hadn’t seen the paper. Organizing things that way would also make it possible to invite people who didn’t happen to be an author of a paper we were reading, but nevertheless were doing exciting and relevant work.
Overall, I’m pleased with how the guest speakers went. I learned a lot from having them, and I’m not just talking about learning how to arrange campus parking passes for visitors, although that is indeed a useful skill to have picked up.
The way I ran this course was influenced by my own Ph.D. experience, during which I got lots of training in how to communicate my work to other people in my own narrowlyfocused academic subfield, but not much training in how to communicate to anyone else. I wanted to do better, so I asked my students to try to aim their blog posts at a “general technical audience”, in the hope that the blog might have some impact beyond our narrow slice of academia. I never really defined “general technical audience” to anyone’s satisfaction, though, including my own. Although people said nice things about the blog, the people saying them tended to be, well, other academics in my subfield, just at different institutions. So, although we did reach people beyond UCSC, I don’t know if I can claim that the blog succeeded at communicating with an audience beyond academia. What could we try instead? Maybe tenminute !!Constyle talks would be worth a shot.
Having said all that, one other way in which I think the course did have a noticeable impact beyond UCSC is that that a group of students at CU Boulder have created a reading group based on it! I met the student runing the reading group, David Moon, back at ICFP in September, and we had a long lunch conversation. The papers he chose for the group ended up being a subset of my course’s reading list, with one more good one that we didn’t have room for (the Verdi paper!) added at the end. I’m absolutely thrilled that this reading group is happening, and I hope that it means that a few more people have the opportunity to think about this particular set of readings as a collection and consider the connections and potential connections between them.
Burckhardt’s book (which should really just be called Principles of Consistency) builds up a lot of mathematical machinery to define what he calls operation contexts, which can be thought of as a graph of events that affect the result of an operation. The concept of an operation context is necessary to define a replicated data type specification, which in turn is necessary to specify a consistency model like the ones in Viotti and Vukolić’s survey. I failed at putting all these pieces together in my lecture, but I hope that the students at least got something out of being exposed to Burckhardt’s specification framework, which has been used in a lot of followup work. In particular, Quelea takes some of the interesting parts of Burckhardt’s framework and turns it into a programming language, which I find extremely cool. ↩
!!Con (pronounced “bang bang con”) West is a twoday conference of tenminute talks about the joy, excitement, and surprise of computing, and the westcoast offshoot of !!Con, the annual independent and volunteerrun conference that I cofounded in New York in 2014. !!Con is a radically eclectic computing conference, and we’re excited to be bringing a version of it to the west coast for the first time! The conference will be held at UC Santa Cruz on February 23 and 24, 2019, and our call for talk proposals is open until tomorrow. We’ve already gotten nearly a hundred talk proposals, but we want more! We want yours!
When I say “radically eclectic”, what I mean is that at !!Con, you can expect to catch talks on lossy (!) text compression, traceroute as a storytelling medium, glowing animatronic mushrooms, what happens when you store your data in kernel space, and how to program knitting machines. Or perhaps on assembling ceramic artifacts using computer vision, synthesizing video and turning it into music, building a map to aggregate realtime flood data, how to design a compelling game with just one big red button, and why the problem of how to distribute “Elmo’s World” segments onto a series of video releases is NPcomplete. I always tell prospective talk submitters that if they have something to talk about that brings them joy but is just a bit too strange for a typical tech conference, it might be perfect for !!Con.
!!Con West will undoubtedly have a different flavor than past !!Cons have had — after all, instead of being in the middle of Manhattan or Brooklyn, we’ll be in the woods — but we’re hoping to preserve the essential parts of the !!Con aesthetic, while enthusiastically embracing the physical and human geography of the west coast, Santa Cruz, and UC Santa Cruz. Just like every past !!Con, we’ll anonymize talk proposals before reviewing them to help eliminate implicit bias in the review process. We’re also offering travel funding to speakers who request it, and a $256 honorarium to every speaker.
If all that sounds good to you, please consider submitting a talk proposal to be part of the inaugural !!Con West!
]]>I guess they heard that @palvaro is lecturing today 🔥 pic.twitter.com/BzudbcvCKs
— Lindsey Kuper (@lindsey) September 28, 2018
So I wasn’t sure if I would be able to measure up to students’ high expectations for his class. But it seemed to go well! I decided I wanted to talk about resolving conflicts between replicas in distributed systems. This was jumping ahead a bit, since Peter hadn’t really started to talk about replication in the course yet, but the students were engaged and asked very good questions.
It just so happened that the day I guestlectured was the day that a student started making videos of the class, and those videos are now up on YouTube for anyone to watch!
I started out by talking a bit about why we do replication in the first place and how conflicts between replicas arise. Then I talked about applicationspecific (or “contentaware”, if you like) strategies for resolving those conflicts, using the example of a replicated shopping cart. The class had already covered partial orders in the context of Lamport’s “happensbefore” relation, so I was well situated to introduce a little more math: upper bounds, least upper bounds, and joinsemilattices.
A lot of partial orders are joinsemilattices, but some aren’t! So we talked about that, and I brought it back to distributed systems by making the informal claim that, if operations that affect replicas’ states can be thought of as elements of a joinsemilattice, then we have a “natural” way of resolving conflicts between replicas.
Afterward, one student delighted me by asking where they could read more about the topic!^{1} And a few days later, another student shared the beautiful sketchnotes they had made:
@tos # 128 Oct. 12th 2018 Lecture 5(6). (thread) pic.twitter.com/dmbkG8mqfO
— ✨romeo (@romeoexists) October 15, 2018
Aren’t these students amazing?! I’m stoked about teaching my own version of this course in the spring.
I suggested that they look up conflictfree replicated data types. I didn’t have the presence of mind to suggest it at the time, but this blog post from Michael Arntzenius is also good. ↩
Here’s an “official” course description:
This graduate seminar course explores the theory and practice of distributed programming from a programminglanguages perspective. We will focus on programming models, languagelevel abstractions, and verification techniques that attempt to tame the many complexities of distributed systems: inevitable failures of the underlying hardware or network; communication latency resulting from the distance between nodes; the challenge of scaling to handle everlarger amounts of work; and more. Most of the work in the course will consist of reading classic and recent papers from the academic literature, writing short responses to the readings, and discussing them in class. Furthermore, every participant in the course will contribute to a public group blog where we will share what we learn with a broader audience.
There are a lot of reasonable ways to approach a seminar course on languages and abstractions for distributed programming. We could spend the whole ten weeks on process calculi and still only make a small dent in the literature. Or, we could spend ten weeks on largescale distributed data processing and still only make a small dent in the literature.
In this particular course, we will be focusing a lot of attention on consistency models and languagebased approaches to specifying, implementing, and verifying them. (Of course, we will only make a small dent in the literature.)
As a grad student, I always dreaded having to do course projects. In an ideal world, these projects were supposed to dovetail nicely with one’s “real” research, or they were supposed to morph into “real” research within three months by some mysterious alchemical process involving lots of luck and suffering. In practice, they usually ended up taking time away from real research, and they always ended up being hastily implemented and shoddily written up. Anyone who tried to get the code to run a few months later would be in for a bad time.
So, this course will attempt something different. Instead of a traditional course project, each student in the class will write (and illustrate!) two posts for a public group blog aimed at a general technical audience. The goal is to create an artifact that will outlive the class and be valuable to the broader community.
Will this be less work than a traditional course project? No. A blog post requires substantial work (reading, writing, editing, programming, debugging, thinking, …), and I’m asking students to expect each post to take about thirty hours of focused work. Furthermore, every student in the class will serve as an editor for two posts other than their own. The job of the editor is to help the writer do their best work — by reading drafts, asking clarifying questions, spotting mistakes and rough spots, and giving constructive feedback. This will take another five hours or so per post.
Altogether, it’s a pretty big time commitment, but one that I hope students will find worthwhile.
So! If you’re a computer science grad student at UC Santa Cruz, check out the course overview and draft reading list, and consider signing up to take my class! And if you’re not a computer science grad student at UC Santa Cruz, but you think you want to be, then get in touch — I might be able to help with that.
]]>Over the last few months, I’ve been working my way through Data 8X, the online version of Berkeley’s Data 8 course, after seeing Joe Hellerstein tweet about it a while back. At first, I was interested mostly for pedagogical reasons, but I can now admit to actually having learned something about data science, too. The course is organized in three parts, and I’ve finished the first two parts (check out my cheevos!) and am working my way through the third part, which focuses on prediction and machine learning.
A few days ago, I came to the part of the course that discusses the equation of the regression line, which is, well, the line used in linear regression. Given two variables $x$ and $y$, which we can visualize as a twodimensional scatter plot of a bunch of points, the idea of linear regression is to find the straight line that best fits those points. Once we have such a line, we can use it to predict the value of $y$, given some new value of $x$. The regression line represents the linear function that minimizes the error of those predictions for all the $(x, y)$ pairs that we know about.^{1}
The course gives a delightfully simple equation for the regression line: it’s $y = r \times x$, where $r$ is the correlation coefficient of $x$ and $y$, and $x$ and $y$ are measured in standard units. I had been familiar with the concept of linear regression before taking the course, but standard units and the correlation coefficient were new to me. As it turns out, working with standard units and the correlation coefficient makes linear regression easy!
As an example, let’s use some Small Data that I have handy: the measurements of my daughter’s height (or length, if you like) and weight that were taken at doctor visits during the first year of her life. The first measurements were taken on July 28, 2017, a few days after she was born, with further measurements taken at the followup appointments at two weeks, one month, two months, four months, six months, nine months, and twelve months.^{2}
Date  Height (cm)  Weight (kg) 

20170728  53.3  4.204 
20170807  54.6  4.65 
20170825  55.9  5.425 
20170925  61  6.41 
20171128  63.5  7.985 
20180126  67.3  9.125 
20180427  71.1  10.39 
20180730  74.9  10.785 
Leaving aside the date column for now, we can plot weight as a function of height:
Those dots look awfully close to being a straight line! It seems like linear regression might be a good choice for modeling the relationship between Sylvia’s height and weight during this time period. But before we get to that, let’s talk about standard units.
Consider a data point like, say, 61 centimeters, which was Sylvia’s height on September 25, 2017. Expressing this data point in units of centimeters is useful: for one thing, it’s not too hard for us, as humans, to imagine more or less how long that is. If we’ve been around a lot of babies, we might even know enough to say, “Wow, that’s a big twomonthold.”
In other ways, though, it’s perhaps less useful. If we just see the data point “61 centimeters” by itself, we don’t know anything about how it relates to the rest of the numbers in the height column: is it shorter, longer, or about average? We’d have to see the rest of the data set in order to answer that question. But it turns out that there is a way to represent individual data points that will let us answer such a question without having to look at the rest of the data set! That representation is standard units.^{3}
To convert a data point to standard units, you need to know three things: its value in original units (centimeters, petaflops, whatever it is you’ve got), and the mean and the standard deviation of the data set it came from. Its value in standard units is how many standard deviations above the mean it is.^{4} So, if a value is exactly average in the data set it came from, then regardless of what “average” means for that data set, when converted to standard units, it’s zero. If it’s one standard deviation above average, then in standard units, it’s one. If it’s below average, then in standard units it will be negative. You get the idea.
Here’s that same table again, but with the heights and weights converted to standard units.
Date  Height (standard units)  Weight (standard units) 

20170728  1.26135  1.3158 
20170807  1.08691  1.13054 
20170825  0.912464  0.808628 
20170925  0.228116  0.399485 
20171128  0.107349  0.254728 
20180126  0.617255  0.728253 
20180427  1.12716  1.2537 
20180730  1.63707  1.41777 
Height and weight are now what a statistician would call standardized variables. Sylvia’s height on September 25, 2017 was 0.228116 standard units. From that number, we can tell that that day is just a bit below average for this data set; in particular, it was around 0.2 standard deviations below average. We did have to look at the rest of the data set to be able to come up with the number 0.228116 in the first place, but the information we got from doing that is now implicit in the value itself. So, if you asked me how tall Sylvia was at her twomonth checkup and I told you, “Oh, about negative pointtwo standard units,” you’ll know that she was taller at other, presumably later, times.
Admittedly, that information isn’t all that interesting to have. After all, we expect most babies to get taller as time goes by. It might be more interesting to use standard units for a different data set, such as, say, the heights of a large number of twomonthold babies. Then, knowing the height of any one of those babies in standard units would tell us how its height compared to the rest of the babies in the data set. But, as we’ll see in a moment, standard units are good for more than just quickly seeing how a particular data point compares to the average.
To start with, let’s see what happens if we plot weight as a function of height, like we did before, but now with the the data in standard units:
This scatter plot looks pretty familiar. In fact, the data points look exactly the same as they did above, when we were working with centimeters and kilograms! Indeed, the data hasn’t changed: all that’s changed is the axes.
We can see that the origin, $(0, 0)$, is now in the middle of the plot. Because we’re using standard units, $(0, 0)$ is the “point of averages”, or the point where both variables are at their average values. You may already know that in linear regression, the regression line for a particular data set always passes through the “point of averages” for that data set. In other words, if you plug the exact average value of $x$ into the regression equation, the $y$ you’ll get will be the exact average value of $y$. So, since 0 means “average” in standard units, when we’re working in standard units we know right away that the regression line always goes through $(0, 0)$. How convenient! That’s one reason why that $y = r \times x$ equation up there is so simple. There’s no need to specify a $y$intercept, because when $x$ and $y$ are in standard units, the $y$intercept of the regression line is always 0.
So, now all we need is $r$, and then we can draw the regression line. But just what is this $r$ thing?
The correlation coefficient, known as $r$, is a measure of the strength of the linear relationship between two variables. If we represent that relationship as a scatter plot, then $r$ is a measure of how close the points are to being on a straight line. A positive $r$ means there’s a direct linear relationship between the variables; the highest possible $r$ is 1, which means that all the points are on a straight line with positive slope. A negative $r$ means there’s an inverse linear relationship, with the lowest possible $r$ being 1, meaning that all the points are on a straight line with negative slope. An $r$ of 0 means there’s no linear relationship, which could mean that there’s no relationship at all, or that the two variables are related in some nonlinear way. Wikipedia has several examples of scatter plots with different values of $r$.
The correlation coefficient is fascinating! Two quantitiative psychologists, Joseph Lee Rodgers and W. Alan Nicewander, wrote a wellknown paper called “Thirteen Ways to Look at the Correlation Coefficient” that explores some of the many ways to think about $r$. For our purposes, we’re thinking of it as the slope of the regression line for the relationship between two variables in standard units. This way of interpreting $r$ happens to be number three on Rodgers and Nicewander’s list: “Correlation as Standardized Slope of the Regression Line”.
How do we compute $r$? For that, we can turn to number six on Rodgers and Nicewander’s list: “Correlation as the Mean CrossProduct of Standardized Variables”! Not only is $r$ the slope of the regression line for the relationship between two variables in standard units, it’s also the average cross product of the values of two variables in standard units.
So, since we’ve already converted our height and weight to standard units, all we have to do to find the slope of the regression line is multiply all our pairedup values of $x$ and $y$ with each other, and then take the average of those products. And since we already know that the $y$intercept is 0, then we can draw the regression line! That’s it! No dimlyremembered calculus! No ugly iterative methods!
Let’s add a new column to our table from before, where we’ll write down the product of each pair of standardized height and weight values:
Date  Height (standard units)  Weight (standard units)  Product of standardized height and weight 

20170728  1.26135  1.3158  1.65968 
20170807  1.08691  1.13054  1.22879 
20170825  0.912464  0.808628  0.737844 
20170925  0.228116  0.399485  0.091129 
20171128  0.107349  0.254728  0.0273447 
20180126  0.617255  0.728253  0.449518 
20180427  1.12716  1.2537  1.41312 
20180730  1.63707  1.41777  2.32099 
And now we can just take the average of the numbers in that last column to get $r$. Before we do that, though, let’s try to get some intuition for why it makes any sense to multiply height and weight together when they’re in standard units, and why the average of those products would give us $r$.
Well, we said before that a positive $r$ means that there’s a direct linear relationship between our two variables, and a negative $r$ means that there’s an inverse linear relationship. Looking at our table, we can see right away that all of the numbers in the last column are positive, because in each case, we got the number either by multiplying two negative numbers or two positive numbers. That happened because each of our values in the height column falls on the same side of average — that is, on the same side of zero — as the corresponding value in the weight column.
If all our aboveaverage heights had corresponding belowaverage weights, and vice versa, then the products in the last column would all be negative, and so $r$, the average of the products, would also be negative, indicating an inverse linear relationship. And if some of the aboveaverage heights corresponded to aboveaverage weights and some corresponded to belowaverage ones, and the same for the belowaverage heights, then we’d have a mix of positive and negative numbers in the last column, and so the average of that column would presumably be pretty close to 0 — indicating a weak linear relationship or none at all. Hopefully, this informal line of reasoning provides some intuition about why taking the average crossproduct of two standardized variables gives you $r$.^{5}
So, what do we get when we take the average of the numbers in the last column? Their average turns out to be 0.9910523777994954. Wow! That’s really close to 1. Let’s plot the regression line and see what it looks like.
Since $r$ is so close to 1, the regression line is awfully close to $y = x$, a perfect linear correlation. Let’s plot that line, too.
The slope of the regression line is just a hair smaller than the slope of the line $y = x$. In fact, it’s hard to see the difference between them. Here’s a bigger version of the figure, with the ranges of the axes tweaked so that you can see that the two lines aren’t the same:
For an example of a relationship that isn’t quite as well modeled by a straight line, take a look at the Secret Bonus Content™ at the end of the accompanying notebook!
Standard units are useful for finding and reasoning about the regression line, but sooner or later, we may want to convert back to original units — centimeters and kilograms, in our case. After all, we might want to be able to ask questions like, “What does our model predict Sylvia’s weight will be when she’s 80 centimeters tall?”, and it would be inconvenient to have to convert 80 to standard units first. Also, we’d probably prefer to get the answer back in kilograms rather than standard units.
The Data 8 textbook gives the following formulas for the slope and intercept of the regression line in terms of $x$ and $y$ in original units (where $\mbox{SD}$ means “standard deviation”): $$ \mathbf{\mbox{slope of the regression line}} ~=~ r \cdot \frac{\mbox{SD of }y}{\mbox{SD of }x} $$
$$ \mathbf{\mbox{intercept of the regression line}} ~=~ \mbox{average of }y ~–~ \mbox{slope} \cdot \mbox{average of }x $$
So that would make the equation of the regression line
$$ y = (r \cdot \frac{\mbox{SD of }y}{\mbox{SD of }x}) \times x + (\mbox{average of }y ~–~ (r \cdot \frac{\mbox{SD of }y}{\mbox{SD of }x}) \cdot \mbox{average of }x) $$
which is a lot hairier than the nice $$y = r \times x$$ that you can use when working in standard units. Unfortunately, the book doesn’t actually explain how to derive the formulas for the slope and intercept. In the accompanying video at about 6m15s, Ani Adhikari shows them and then pauses for a full ten seconds before saying, “I would suggest that you don’t try to memorize this formula. Remember how the slope comes about, and then, you can, if necessary, derive the formula for the intercept, or simply — and this is our recommendation — you can look it up.” I’m not very good at following recommendations, so I ended up working it out on paper myself, and it was indeed pretty tedious, but in the end, I did manage to dervive the formulas and come up with an explanation for why they make sense. I had originally planned to write about that here, but this post is pretty long already, so perhaps that’s a post for another time.
Update (January 31, 2019): Incidentally, if you’re curious about what the remaining eleven (!) ways to look at the correlation coefficient are, I now have a followup post that explores another one of them!
This is not to say that the predictions will necessarily be good, because a straight line might not be a good fit for your data. But if you must have a straight line, then the regression line is the best straight line you can have. ↩
Incidentally, these were all “wellbaby” checkups, not times when she was sick. She went to the doctor when she was sick a couple of times, too, but they didn’t measure her height at those appointments, only weight, so I haven’t included them here. ↩
Standard units are also known as standard scores, or zscores. My impression is that the terminology “standard units” isn’t as, uh, standard as ”zscores” or “standard scores”, but I like it because I think it’s useful to think of standard units as just another unit of measurement. That’s the point of view that Brian M. Scott adopts in this fantastic answer to a question about standard units and standard deviations: “You might say that the standard deviation is a yardstick, and a zscore is a measurement expressed in terms of that yardstick.” ↩
I like the Data 8 definition of standard deviation: it’s the root mean square of deviations from average. Section 14.2 of the book has a detailed explanation. ↩
If my explanation here isn’t convincing enough, Ani Adhikari’s explanation in one of the Data 8X videos might be! ↩
In the conversation that followed Justine’s original tweet, a couple of people expressed concern that taking her advice would have hurt their GPA and therefore damaged their CS Ph.D. application prospects. In fact, one person (who I won’t link to, because I don’t want to pick on them personally) wrote that they would have wanted to minor in another field, but didn’t because “anything under a 4.0 is DOA for CS grad school”.
Is it? As so often, we can turn to Mor HarcholBalter’s advice on applying to Ph.D. programs in CS for trustworthy information. Here’s what she says:
When applying to a Ph.D. program in CS, you’d like your grades in CS and Math and Engineering classes to be about 3.5 out of 4.0, as a rough guideline. It does not help you, in my opinion, to be closer to 4.0 as opposed to 3.5. It’s a much better idea to spend your time on research than on optimizing your GPA. At CMU the mean GPA of students admitted is over 3.8 (even though we don’t use grades as a criterion), however students have also been admitted with GPAs below 3.3, since research is what matters, not grades. A GPA of 4.0 alone with no research experience will not get you into any top CS program. […] At CMU we receive hundreds of applications each year from 4.0 GPA students who have never done research. These are all put into the high risk pile and are subsequently rejected.
As further evidence that you don’t need a 4.0 to go to grad school, I offer my transcript for my undergrad degree in computer science and music. I’m not holding myself up as some sort of ideal candidate for CS Ph.D. programs — far from it! — but I did manage to get into grad school with these lessthanstellar grades.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 

In the end, my undergrad GPA was just a hair over 3.5 — slightly higher than that in CS, and considerably lower in math. A few things to note about my undergrad experience:
Perhaps it’s worth noting that I never took undergrad courses on computer architecture, compilers, distributed systems, AI, machine learning, robotics, statistics, networking, security, cryptography, or graphics. I filled in a few of those gaps in grad school, but not even close to all of them. Some of them I’m only now beginning to fill in, and others I’m not sure I’ll ever get to. If you evaluated me according to what Matt Might says every CS major should know, I’d get a failing grade. (Thankfully, Matt also identifies being good at failing as a predictor of success in research.)
Anyway: in grad school, as in the rest of life after undergrad, grades aren’t very important. What matters in grad school is research, and applicants who demonstrate an understanding of what does and doesn’t matter will be at a great advantage to those who don’t. I might even go so far as to say that a 4.0 undergrad GPA could be something of a liability on grad school applications, because it could make an applicant appear to be someone who pursues a high GPA for its own sake as opposed to pursuing more important things and being willing to fail at them. In any case, the possibility of getting low grades is a bad reason not to study something you’re interested in.
]]>I’m a rising junior at [university], majoring in computer science and math.
I have recently made up my mind about going to grad school for PL and am trying to get as much experience as possible right now. I will be attending [PLrelated summer school] in two weeks. I will also be TA’ing [university]’s Programming Languages course and working with a professor on a PLrelated research project next term.
My question is, is there any reason why I might want not to apply for PLMW at ICFP 2018 this September, and instead, apply next year? I am guessing I can only attend once, but I am not sure about that either. What are some criteria that I should consider when choosing a specific PLMW? Can you help me make a decision?
“PLMW” refers to the Programming Languages Mentoring Workshop, a oneday workshop targeted at people who are in the beginning stages of a programming languages Ph.D. program, and at those who, like my correspondent, are looking into doing a Ph.D. in PL. The workshop is held several times a year in conjunction with POPL, PLDI, SPLASH, and ICFP, four of the biggest venues for programming languages research. PLMW has been running since 2012 (I attended the first two PLMWs when I was in grad school!), and it’s now become an established part of the PL grad school (or pregradschool) experience. There’s usually an allstar lineup of speakers, covering current topics in PL research as well as advice on how to navigate life as a Ph.D. student. There’s usually also a panel of “young”^{1} researchers talking about their Ph.D. experience; I got to be on one of these panels at POPL 2016.
Because the workshop is always held at a major PL conference on the day before the main conference program begins, it’s a great way for those who are new to the PL conference scene to ease into things — both in the sense that it introduces some technical material that will be useful for understanding the rest of the conference, and in the sense that it can help firsttime attendees feel more connected socially. Plus, PLMW is more than a workshop — it’s also a scholarship program that will pay your way to attend the rest of the conference, including funds toward travel, lodging, and registration.
So, should you apply to attend PLMW? Yes, you should. But that brings us to my correspondent’s question: which PLMW should you go to?
First of all, you can go to more than one PLMW, but you can probably only get funding to go to one. Notice, for instance, that one of the criteria for scholarship eligibility for PLMW at ICFP ‘18 is “have not been funded by a prior PLMW”. (I got funded to go to PLMW twice when I was a grad student, but in those days, PLMW was still new, and the funding policy might have been tightened up since then.)
When should you go? If, like my correspondent, it’s your junior year of undergrad, you’ll probably be one of the youngest people at the conference, but that’s not necessarily a problem. It’ll still be an excellent learning opportunity. If you plan to apply to grad school during your senior year, consider going to PLMW that fall (or before), so you can have your PLMW experience to draw on when you’re applying.
Another criterion to consider is the conference at which the PLMW is held: POPL, PLDI, SPLASH, or ICFP. The content of PLMW itself might not be drastically different from one to the next, but the content of the main program of the conference will be somewhat different. At ICFP, things will be more geared toward functional programming; POPL will have more theory people; PLDI will have more hardcore language implementation people; and SPLASH will have a mix of systems, PL, and applications. But don’t make too much of these differences, and please liberally fill in air quotes around words like “functional programming”, “theory”, and “systems” that appear in the previous sentence. There’s actually a lot of overlap among all of these conferences, and all of them are good. It’s normal for a PL researcher to be a regular at more than one of them.
Consider location, too. For instance, ICFP is in St. Louis this fall and colocated with Strange Loop, so if you want to go to Strange Loop (or already have plans to go), then ICFP would be a good pick for your first PLMW. (If so, then hurry — scholarship applications are due tomorrow!) On the other hand, SPLASH is being held in Boston this fall, and Boston happens to be home to a lot of great places to study PL, so if you want to (for example) visit Northeastern or MIT while you’re in town for the conference, then PLMW at SPLASH 2018 might be a good pick. If you want to wait a little longer, there will certainly be a PLMW at POPL 2019, which is being held in Portugal — although if you’re coming from the US and you have a visa/citizenship situation that would complicate reentry into the US, be mindful of that before you make your plans.
Good luck, and I look forward to seeing you at a SIGPLAN conference soon!
I don’t like using the word “young” for these things; “junior” or “emerging” are a bit better. ↩
After defending my Ph.D. in 2014, I joined Intel Labs as a research scientist. While at Intel, I’ve gotten to work on some really cool stuff with great collaborators, but for a while now, I’ve been contemplating returning to academia, and last fall I decided to quietly start applying for a few faculty positions.
Mine was a relatively smallscale job search, and I didn’t tell too many people that this was what I was doing. I didn’t necessarily expect things to work out. But they did! I am absolutely thrilled to share that two weeks ago, I accepted an offer to join the Baskin School of Engineering at UC Santa Cruz as a tenuretrack assistant professor of computer science. I officially start in July, and I’ll be teaching starting in the fall. I’ll be joining an amazing group of faculty, including Peter Alvaro, Faisal Nawab, Owen Arden, Lise Getoor, Cormac Flanagan, Carlos Maltzahn, and many more.
I’m excited to be a Banana Slug, and it seems like my daughter is pretty stoked, too.
When I was working on my application materials back last November, my friend Neel Krishnawami advised me to make up a narrative that would connect all the work that I had done, from LVars to ParallelAccelerator to neural network verification. At first, I despaired of ever being able to come up with a coherent story to connect the different things I’d done. But the more I thought about it, the more I realized that not only did I have a story, I actually believed in the story I was telling! As I wrote back in February:
If there’s one theme that ties together all the stuff I’ve been working on or studying in the last several years, it’s that a highlevel representation of programmer intent (commutative operations on lattice elements in the case of LVars; aggregate array operations in the case of ParallelAccelerator; the ReLU predicate added to the theory of linear real arithmetic in the case of Reluplex^{1}) enables optimizations or smart scheduling choices that would be difficult or impossible otherwise.
I structured my job talk around this theme: that choosing the right highlevel abstractions to express a software system not only does not compromise high performance, but actually enables it.
The folk wisdom in computing is that in order to be efficient, things have to be lowlevel and “close to the metal”. But it turns out that choosing the right highlevel domainspecific abstractions can actually be the key to unlocking efficiency, whether we’re talking about lazy SMT solving, or about highperformance DSLs. To me, this is one of the most beautiful ideas in computing.
I thought about adding some more details here about my plans for future research, but it would make this post way too long. So for now, I’ll just say: if you’re interested in doing a Ph.D. with me at UCSC to explore interesting ideas in the intersection of programming languages, distributed systems, and verification, you should get in touch!
As anyone who’s ever read this blog already knows, I’m incapable of shutting up about !!Con, the conference of lightning talks about the joy, excitement, and surprise of computing that I cofounded with a group of friends in 2014. (Our fifth annual conference in New York just wrapped up today; videos will be available soon!) !!Con has been such a success that the demand for what we’re doing is much greater than we’re able to meet. This is a great problem to have, but it means that we’re constantly having to disappoint people. There’s only so much that our small team of volunteers can do.
The longterm solution to our inability to meet demand is for there to be a lot more conferences inspired by !!Con, in a lot more places, organized by a lot more people. A new generation of conference organizers will need to step up. But in fact, that’s already happening — EnthusiastiCon, Hello, Con!, and StarCon are three examples. It’s incredibly exciting to me to see all these conferences thrive.
Somehow, though, as supply has increased, demand hasn’t lessened. If anything, the demand for !!Con has only increased as more and more people experience the magic and fun of this conference format. This year, we got nearly 300 talk proposals, the most we’ve ever had. It’s clear that we need to expand, not only to better serve our existing audience, but also to better serve those who can’t easily travel to New York for the weekend — a group that includes me, especially now that I’m a parent.
So, next year we’re expanding !!Con to the west coast! We’ll be holding our first conference outside of New York, !!Con West, at UC Santa Cruz in early 2019.
How did this come about? As part of negotiating my offer from UCSC, I asked the department to provide financial and logistical support for bringing a version of !!Con to campus, and they enthusiastically agreed to do so. (In fact, my department chair and other future colleagues already knew about !!Con, because I’d written about it in my application materials.) As I see it, !!Con isn’t just a fun side project, but actually an integral part of my outreach mission as a scientist and educator, and I can’t tell you how much it means to me that UCSC is putting their money where their mouth is and supporting that mission!
Finally, we have two goals for !!Con West. First, of course, we want to put on another great conference in the !!Con tradition. Second, though, we want to incubate a new generation of conference organizers. By the time we’re done, every member of the !!Con West organizing team will have the skills and experience to go out and launch their own conferences, if they want to. If that sounds like something you want to do, fill out this form to apply to become a !!Con West organizer. Join us!
I want to point out that I can’t take any credit for the idea of extending the theory of linear real arithmetic with a ReLU predicate – that was all Guy Katz, with the assistance of the rest of the Reluplex team at Stanford. But when I meditated on why I liked the idea (and, more broadly, the idea of creating highperformance domainspecific solvers) so much, I realized that it fit right into the rest of the story that I was trying to tell. ↩
At last year’s DSLDI, we heard from Ron Garcia on gradual typing, Nimo Ni on the Penrose system for declaratively creating mathematical diagrams, Xiangqi Li on domainspecific debugging in Racket, and lots more; then we went out for Italian food.
Ron Garcia (@rg9119 ) gives a tutorial on gradual type consistency as part of his #dsldi17 keynote talk. pic.twitter.com/TMp4xlzbF7
— Lindsey Kuper (@lindsey) October 22, 2017
Nimo Ni describes a pair of languages for creating mathematical diagrams. #dsldi17 pic.twitter.com/XXWMbEFbRl
— Lindsey Kuper (@lindsey) October 22, 2017
Xiangqi Li tells us about domainspecific debugging in Racket. #dsldi17 pic.twitter.com/hQyCzZN9Wv
— Lindsey Kuper (@lindsey) October 22, 2017
If all this sounds like a good time to you, you should join us for DSLDI 2018! I’m excited to be involved with organizing DSLDI again, particularly since Sam is in charge this year. We’re once again calling for short talk proposals, due August 17, and information about this year’s program committee should be available soon.
]]>In addition to the letter grades that each reviewer gives to each proposal, this year I also tried to write a sentence or two for each talk I reviewed, explaining why I gave it the grade I did. As I wrote these explanations, a few themes emerged in the kinds of talks I was giving lessthanenthusiastic reviews to. I want to talk about what these were and offer advice on how to avoid them, in the hope that someone will find my advice useful when they’re submitting talk proposals to future !!Cons, or other conferences like it. (For conferences that aren’t much like !!Con, keep in mind that this advice may not do much good, and may in fact be misleading.)
!!Con talks are meant to be ten minutes long.^{1} On our talk proposal submission form, one of the things we ask prospective speakers for is a “timeline”, or a summary of how the speaker plans to use their ten minutes of stage time. The timeline helps us make sure that the speaker understands the lightning talk format, and that they’ve put some thought into making sure that their talk will fit into the allotted time.
Most other calls for talk proposals don’t ask for a timeline, so even people who are frequent conference speakers don’t always know what we’re asking for here. For whatever reason, every year we get some talk proposals that completely fail to provide any kind of reasonable timeline. Some people use the timeline as a place to continue their abstract. I’ve even seen people just copypaste their abstract into the timeline field on the form. Other submitters seem to think that by “timeline”, we’re just asking how long the talk will be — for instance, we got a couple this year that just had “10 minutes” listed as the timeline. (We’re not asking how long your talk will be — we already know how long it’s supposed to be! We’re asking how you’ll use the time.) And even more absurd than “10 minutes” was the one we got this year that said “20 minutes”! That’s pretty clear evidence that the talk submitter doesn’t know what !!Con is and didn’t bother to read what the call for proposals was asking for.
Timelines aren’t legally binding agreements, of course. We expect that most speakers will rearrange things a little, or even a lot, between when their talk proposal is due and when the talk actually happens. The important thing is for the prosepctive speaker to show us that they’ve come up with a plausiblesounding plan for a talk they could give in ten minutes. Whether it bears much resemblance to the actual talk they end up giving is another question. For more advice on timelines, see my post “How to write a timeline for a !!Con talk proposal”, which has several examples.
We have a policy at !!Con that talk titles have to include at least one exclamation point.^{2} However, an exclamation point isn’t sufficient to make a title exciting. Consider the following four titles:
One of these things is not like the others. The first three are all titles of talks that were actually presented at !!Con (by Kiran Bhattaram in 2015, wilkie in 2016, and Jan Mitsuko Cash in 2017, respectively). The fourth is representative of a class of talk proposals that get rejected. I’m not trying to pick on Spark in particular here, but “Introduction to Data Analytics with Apache Spark!” sort of looks as though the submitter just took the title of a talk they’d given at another, more traditional conference, and stuck on an exclamation point on the end.
Is a bad title enough to get a talk proposal rejected? Not necessarily. Titles are easy to change, and in fact, we do sometimes ask people to change the title of a talk after it’s been accepted. But since we only have room to accept about thirty submissions, a boring title can be enough to push a talk into the “reject” category. So, instead of giving us a reason to reject your otherwise amazing talk, come up with an exciting title that shows us how excited you are about giving the talk!
Another thing to keep in mind regarding titles is that at !!Con, your talk’s title really doesn’t have to serve as a standalone summary of the talk. At bigger conferences, where there are a hundred talks and lots of tracks to choose from (and where a talk is usually more than ten minutes long, so choosing any given talk is a larger commitment), attendees often scan through titles to decide which talks to attend, and so some speakers will try to pack their talk titles with descriptive keywords that they hope will attract attendees. But !!Con is a singletrack conference where there’s time for attendees to read all the talk abstracts, and every talk is wellattended, so a title like “A Shot in the Dark!” works well even though it doesn’t explain what the talk will be about. In fact, titles that don’t reveal all the details are a great way to pique the audience’s curiosity.
We get a lot of talk proposals that purport to show the audience how to achieve a practical goal — get more done at work! write more maintainable code! ship a product! — or that purport to teach a specific skill or a way of thinking about programmimg. These kinds of practical, bemoreeffectiveatwork talks go over well at many conferences, but we find that most of the time, they don’t work so well at !!Con.
Why not? I think it’s because most people don’t come to !!Con with the goal of learning practical skills. Rather, they come to have fun and ignite, or reignite, their excitement and curiosity about computing. It’s not that people don’t pick up useful knowledge to improve their craft from talks at !!Con — to the contrary! — but it’s a question of how the knowledge is presented. My advice to talk submitters is to aim for whimsical rather than practical. Tell a compelling and engaging story of how a project got done despite obstacles (like the story of how Lisa Ballard and Ariel Waldman built spaceprob.es using an undocumented NASA API!), or share an interesting concept for its own sake (like localitysensitive hashing!).
Some !!Con talks, like Bomani McClendon’s great talk last year about building giant animatronic glowing mushrooms, do include “lessons learned” and highlevel takeaways, but those highlevel takeaways work because they’re presented in the context of a specific project or a specific concept. One of the highlevel takeaways for Bomani’s talk, for instance, had to do with how he came to see software as a medium for art, and himself as an artist, as a result of writing software for an art project. But if the title of his talk had been “Level up as a programmer by seeing software as an artistic medium!” rather than “Making Mushrooms Glow!”, it wouldn’t have been as strong of a talk proposal.
These kinds of differences in framing can be quite subtle, but the good news is that if you submitted a talk proposal of the “this talk will make you do better work!” variety and it was rejected, it might not take much effort at all to reframe it as a talk proposal that would be a great fit for !!Con. Consider revising and resubmitting it next year!
!!Con attendees and speakers come from a wide variety of backgrounds — they might design games, build smart watches, study geometry, or develop interplanetary spacecraft flight software. A given audience member may or may not know much about any of these topics. But even talks on very narrowly specialized topics will have something to offer to both specialists and nonspecialists, if they’re done well. In fact, we prefer deep and narrow talks to broad and shallow ones. (The lightning talk format helps: in our experience, just about anything can be interesting for ten minutes if it’s presented well.)
So, talk proposals on narrowlyfocused topics are great for !!Con. A problem arises, though, when these talk proposals also make a lot of assumptions about the audience’s background, goals, or motivations. We get a lot of talk proposals that aren’t a good fit for !!Con because they assume background that not everyone in our audience has, or goals that not everyone in our audience has. This can happen, for instance, when people take a talk they’ve previously submitted to a more narrowly focused conference and try to reuse it for !!Con. If you send us a talk proposal that you previously sent to PopularMachineLearningToolFest or JavaScriptWebFrameworkOfTheMomentConf, you’ll probably need to make some changes to the way you’re framing your talk to make it suitable for !!Con, regardless of how wonderful PopularMachineLearningTool or JavaScriptWebFrameworkOfTheMoment are.
Will you have to rewrite your existing talk proposal from scratch? Maybe not! Removing assumptions about shared background and goals from a talk proposal could be as simple as changing a sentence like “We all need the web apps we build to work well on mobile devices” to “I needed the web app I built last year to work well on mobile devices”. Instead of trying to convince your audience that they care (or ought to care) about your topic, tell them why you care! Making the talk more personal will also make it more compelling, because it becomes a talk that only you can give.
Thanks to Stephen Tu, Laura Lindzey, Michael Malis, Julia Evans, and Erty Seidohl for feedback on a draft of this post.
In practice, sometimes people run a few minutes long, and that’s okay. We don’t drag them off the stage. ↩
If you submit a killer talk proposal but you forget to put an exclamation point in your talk title, don’t worry, we won’t reject it for that reason alone! Every year we accept a few talks with no exclamation point and ask the speaker to add one as part of our standard copyediting step. Still, you should try to remember to include the exclamation point yourself – it’s one of many ways that you can convey to us that you’re excited about your topic! ↩
Part of this is just a summary of my coauthors’ very cool work on Reluplex, which they’ve been working on since well before I got involved. For me, though, this isn’t just about Reluplex or even just about neural network verification, but about a more general idea that I think Reluplex exemplifies: the idea of exploiting highlevel domainspecific abstractions for efficiency. If there’s one theme that ties together all the stuff I’ve been working on or studying in the last several years, it’s that a highlevel representation of programmer intent (commutative operations on lattice elements in the case of LVars; aggregate array operations in the case of ParallelAccelerator; the ReLU predicate added to the theory of linear real arithmetic in the case of Reluplex) enables optimizations or smart scheduling choices that would be difficult or impossible otherwise.
The team’s experience with Reluplex suggests that to make headway on hard verification problems, it won’t be enough to use offtheshelf, generalpurpose solvers, and that we need to instead develop domainspecific solvers that are suited to the specific verification task at hand. That suggests to me that we (or, you know, someone) should try to create tools that make it really easy to develop those domainspecific solvers (analogously to how Delite aimed to make it really easy to develop highperformance DSLs) — and perhaps Rosette, which I’m now learning about through my work on the CAPA program, will play a role here. It’s all connected!
In addition to making solvers better, there’s also the other side of things: designing networks that are easier for solvers to chew on in the first place. I think we’re just beginning to learn how to do this, but what I find promising is that some design choices that are desirable for other reasons — like pruning and quantizing networks to reduce storage requirements, speed up inference, or improve energy efficiency — may also make those networks more amenable to verification.
It may even be the case that the same hardware accelerator techniques that optimize inference on quantized or lowprecision networks can also optimize verification of those same networks! This is honestly just pure speculation on my part, but if you’re interested in trying to figure out if it’s true, I hope you’ll come talk to me at SysML.
Baris Kasikci asked a good question on Twitter recently about the Reluplex work: “How difficult is it to produce the SMT formula for the property that you want to prove?” Part of the answer is that it depends on the property. We want properties that can be expressed in terms of “if the input to the network falls into suchandsuch class of inputs, then its output will fall into soandso class of outputs”. For the prototype ACAS Xu aircraft collision avoidance network that Reluplex used as a case study, the input is relatively highlevel, expressing things like how far apart two planes are, and the output is one of five recommendations for how much an aircraft should turn (or not turn). In that setting, it’s relatively easy to at least state the properties one wants to prove. For a network where the input is something more lowlevel, like pixels, it may be a lot harder to state properties of interest.
That should give us at least a moment’s pause, since the whole point of deep learning is that it’s possible to start with relatively lowlevel inputs and allow the computer to figure out for itself what features are important, rather than having to do a lot of complicated feature engineering to provide input. But the mechanism by which a deep neural network accomplishes that is mostly “lots of layers”, with successive layers operating on increasingly abstract representations of the input. Julian et al. describe the prototype ACAS Xu network as “deep” more because it has a lot of layers than because the input is particularly lowlevel.
But let’s say that we do have a network that takes lowlevel inputs like pixels: can Reluplex verify anything interesting about it? Yes, it can! There are still interesting properties we can express to the solver — like, say, robustness to adversarial inputs. I’ve claimed in the past (although I suspect some of my coauthors would disagree with me) that adversarial robustness properties are actually kind of boring, because we already know ways to state those properties for arbitrary networks, and that the more interesting properties are the ones that we’ll have to work hard to even figure out how to express, much less prove. But even if we’re “just” looking at verifying adversarial robustness, it’s not like there’s any shortage of work to do, both in coming up with meaningful robustness metrics and in making solving lots more efficient.
]]>At first, I wrote lots of posts about my dissertation work. I eventually managed to publish, graduate, and get a job, and since then, I’ve continued to use this blog to write about whatever I’m thinking or learning about: distributed computing, domainspecific languages, verification, machine learning, and lots of other things. I’ve also advertised conferences and workshops that I’ve been involved with; told debugging stories; and dispensed advice. Inevitably, the personal and the professional intersect.
This blog doesn’t have anywhere near the huge following of folks like Dan Luu or Julia Evans, but I’ve been fortunate to have picked up a few readers who enjoy it and who I hear from often. I think that part of the blog’s success comes from the fact that I make myself post regularly. I committed to writing two posts a month when I started the blog, and although that means that some of the posts end up being “filler”, I think the benefits outweigh the drawbacks. Having the end of the month looming is often the nudge that I need to finally finish and post things that otherwise would have lingered unpublished for a long time.
At least, that’s how it works sometimes. Too often, though, I let a month go by without posting and then have to backdate posts to the previous month in order to keep up. This has been a problem for some time, but after my daughter was born last July, it got much worse. My posts “What do people mean when they say ‘transpiler’?” and “My first fifteen compilers” were both dated July, but they were actually published in August and September, and I’m sure I wouldn’t have even managed that if I hadn’t already had drafts of both of them started before Sylvia was born.
My two Augustdated posts went up in September. When I went back to work fulltime in midOctober, the schedule slipped even further, and my Septemberdated posts didn’t go up until November. By the time December came around, I had six posts to finish for the year; with effort, I managed to crank them all out by January 10. And now it’s the end of the month again, and, predictably, I’m behind.
I’m proud to have made it five years at the twiceamonth pace, but it’s become clear that it’s not a sustainable pace for me anymore. So in 2018, I’m going to try changing my pace of posting to once a month, and seeing how that goes. Once a month seems like something that I can manage. I’m already feeling relieved about this decision. Instead of dreading having to backdate posts for January, I’m excited to write about my colleagues’ and my SysML submission for February. That’s the way that blogging should feel — like fun, not a burden!
Now that I’ve made this decision, I’m realizing that my old posting schedule was causing me to make bad decisions and engage in false economies: at one point, I even remember thinking, “I should agree to serve on the soandso program committee, because that means I can write a post about its call for papers, and that’ll take care of one of the posts I need to write this month!” (I said no to that PC invitation, but the fact that I was even tempted to say yes just to get a blog post out of it meant that there was something wrong with how much I was expecting myself to blog.)
Why not do away with this quota system altogether and just blog if and when I feel like it? Looking back on the posts that I’m happy with from over the last five years, I’m certain that a lot of them would never have been published if it hadn’t been for deadline pressure. An example is “Using the simplex algorithm for SMT solving”, which was one of those posts I cranked out in December when I was under the gun to finish a bunch of posts by the end of the year. It was a draft I’d had halffinished for a long time; I was writing about something that I had found interesting, but I wasn’t sure if I was really explaining things properly or if it would be interesting to anyone else. But the deadline meant that I couldn’t afford not to finish up and post it, and when I did, the reaction I got was quite positive! So, I really do think a bit of deadline pressure is good for me. The trick is to have a sustainable amount of deadline pressure, and not so much that I can’t keep up.
]]>Wait, what?
Proceedings of the ACM on Programming Languages, or PACMPL, is the ACM’s new openaccess journal for “research on all aspects of programming languages, from design to implementation and from mathematical formalisms to empirical studies”. So far, each issue publishes exactly the papers from one annual ACM PL conference: POPL, ICFP, and OOPSLA. (The fourth major ACM PL conference, PLDI, chose not to participate.) This means that there will be three issues of the journal each year, named “issue POPL”, “issue ICFP”, and “issue OOPSLA”.
So, what would have at one time been called the call for papers for ICFP 2018 is now officially known as the call for papers for PACMPL issue ICFP 2018. Because “PACMPL” has much less name recognition in our community than “ICFP”, I imagine a lot of people don’t realize it exists yet and are at risk of being confused by this. So, to alleviate confusion: the ICFP conference is still the ICFP conference, same as always, but now ICFP papers are being published in a journal rather than as a conference proceedings.
In fact, this was already the case for ICFP 2017, the papers from which were published in the firstever issue of PACMPL: Volume 1, Issue ICFP. So 2018 will work more or less like 2017 did, but we’ve updated the call for papers to reflect that the papers in question will go into a journal.
Why does this matter? Who cares if papers are published in a journal or a conference? As Phil Wadler explained a while back:
Programming languages are unusual in a heavy reliance on conferences over journals. In many universities and to many national funding bodies, journal publications are the only ones that count. Other fields within computing are sorting this out by moving to journals; we should too.
The best thing about PACMPL is that it’s open access, and therefore all POPL, ICFP, and OOPSLA papers published in it are freely available. Tell your friends!
]]>Quoting from the website:
This workshop aims to investigate the principles and practice of consistency models for largescale, faulttolerant, distributed shared data systems. It will bring together theoreticians and practitioners from different horizons: system development, distributed algorithms, concurrency, fault tolerance, databases, language and verification, including both academia and industry.
Relevant discussion topics include:
 Design principles, correctness conditions, and programming patterns for scalable distributed data systems.
 Techniques for weak consistency: session guarantees, causal consistency, operational transformation, conflictfree replicated data types, monotonic programming, state merge, commutativity, etc.
 Techniques for scaling and improving the performance of strongly consistent systems (e.g., Paxosbased, state machine replication, sharedlog consensus, blockchain).
 How to expose consistency vs. performance and scalability tradeoffs in the programming model, and how to help developers choose.
 How to support composed operations spanning multiple objects (transactions, workflows).
 Reasoning, analysis and verification of weakly consistent application programs.
 How to strengthen the guarantees beyond consistency: fault tolerance, security, ensuring invariants, bounding metadata size, and controlling divergence.
If you’re working on any of those things, please consider submitting a twopage talk proposal by the deadline of February 7! I look forward to reading your submissions.
]]>In March, a friend asked me if I still had the syllabus from a course we took together in 2003. She was applying for a fellowship and was trying to remember if the work she’d done in the course fourteen years ago was relevant. I did not have the syllabus, but I had kept a lot of stuff from the course, and one of the documents that I still had happened to include a URL for the course website. That URL no longer worked, but the Wayback Machine had an archived version of it, and from there I was able to find the archived syllabus for my friend and help answer her questions for the fellowship application. I was excited that I was able to put together information from two sources — my own records (on paper, in a physical file folder!), and the Wayback Machine — to track down the document my friend needed.
In October, my friend Chris tweeted something that reminded me of how, back in the early 2000s, LiveJournal Support volunteers had a custom of saying “manatee” to mean “mentee”. You could be someone’s “support manatee” as you were learning how to do support work. I remembered that this usage was common enough to have been enshrined in the LJ Support Guide at one time. The current version of the Support Guide is a shadow of its former self, and it certainly doesn’t say anything about manatees. But I went to see if the old Support Guide I remembered was on the Wayback Machine, and it was (search that page for “manatee”)! It meant a lot to me to see that something from the culture of a community I cared a lot about in 2004 had been preserved.
Finally, most recently, I was trying to find materials I developed for a course I helped teach in fall 2011. To my embarrassment and frustration, the course website had long since disappeared. I scoured my email archives and files for anything I could find about the course and came up mostly emptyhanded. Then I remembered to check the Wayback Machine. Sure enough, the labs that I developed on the LilyPond music notation system and on Markov models for text and music generation were there. So was a page documenting a project our students had done that had built on the concepts those labs introduced. I’d forgotten about that student project, and finding it again brought me joy at a time when I needed it.
So, I’m including the Internet Archive in my endofyear charitable giving. If you, too, have stories like mine and can afford to help them out, now is a great time to do it!
]]>Here’s an example: let’s say I want to know what the longest posts on this blog have been.^{1} If I were using, say, WordPress, I’m not sure how I would go about figuring that out. The interface might or might not expose that information; I might have to install a plugin or something, or manually copy and paste the text of my posts into a tool that will show a word count. But since my posts are just text files, it’s just a couple of quick shell commands chained together and run in the directory where all my posts live:
1 2 3 4 5 6 7 8 9 10 11 

From the output of that command, I can see that my longest post is “‘Experiencing computing viscerally’: my PG Podcast interview about !!Con” from August 2016, although that one hardly counts because it’s a transcript of an audio interview. The next longest post is “The LVar that wasn’t”, from December 2013, which weighed in at 4526 words. The word count includes a small amount of overhead for stuff like the Markdown front matter in each post, but it’s mostly accurate.
Let’s say that I want to spellcheck all my posts. Since I have aspell
installed, that’s pretty easy, too:
1


This launches the aspell
interface, in which I can interactively correct typos that it finds in each file, or add words it doesn’t know to my local dictionary. I tried it just now on the whole blog and found misspellings of ‘heterogeneous’, ‘narrative’, ‘necessarily’, ‘difference’, and ‘which’, all of which are now fixed.
Because I use git to version all my posts, I can see a version history of any post using gitdiff
. Let’s say that I want to see all the changes made to the post “Refactoring as a way to understand code” from a couple years back. The command
1


shows me the answer. It turns out that I made a small copy edit about an hour after writing the post, correcting the phrase “get a more visceral sense of what it’s doing” to “give me a more visceral sense of what it’s doing”. (I seem to be into this whole viscerality thing.) A couple of days later, I added the sentence “I can sympathize.” to the post. A year and a half later, I removed the tag “programming” and added “refactoring” instead.
With only three edits after the original post, I suspect that “Refactoring as a way to understand code” is actually one of my less heavilyedited posts. How can I see which ones are most heavily edited? The giteffort
command from the wonderful gitextras collection can help. If I want to see all my posts that have been edited twenty times or more, I can do this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

giteffort
lists those posts for me, first ordered by filename (which, because of the naming convention I use, is also by date) and then ordered by the number of commits. It turns out that I have four posts that have been edited twenty or more times, with “Thoughts on ‘Adversarial examples in the physical world’” being the most edited post by a pretty wide margin. Fascinating!
I use Octopress to generate this blog. When I started the blog in early 2013, Octopress was still quite popular; by 2015, it had fallen far enough out of fashion that a commenter on Hacker News remarked on my anachronistic use of it. I suppose that person would think I’m a dinosaur for still using Octopress even now, as 2017 is coming to an end. There are any number of other static site generators out there that would probably do everything I want and are probably less janky than Octopress, but I keep using it out of inertia and because it still does the job well enough. I’ll probably switch from Octopress to plain Jekyll at some point, since Octopress is just a wrapper around Jekyll, and I don’t really use most of the features Octopress adds.
The particular blogging framework I use is beside the point, though. What’s more important to me is how easy it is to work with my blog using the wider world of commandline text manipulation tools, and it’s the fact that the posts are stored as versioned text that makes that possible. That, to me, is the real power of static site generation tools — it’s not so much about what those tools themselves do as it is about everything else that the postsasversionedtext approach enables, none of which is particular to Octopress or Jekyll or any other static site generator.
I’m always worried that My Best Blogging Days Are Behind Me and that I Never Blog About Anything Substantial Anymore, so it’s reassuring to look back and see that my longest posts are actually pretty evenly spread over the years that this blog has existed, and in fact, three of the nine longest posts were written within the last year. (Length isn’t the same thing as substance, of course, but there’s no shell command for quantifying substance yet.) ↩
This post isn’t an introduction to the simplex algorithm itself; for that, there are many good references, such as chapter 29 of CLRS. If you don’t have a physical copy of CLRS handy, or if yours is currently in use as a doorstop, some of chapter 29 is on Google Books. Better yet, try chapter 2 of Linear Programming by Vašek Chvátal, a book that Guy Katz pointed me to and that I like a lot, and almost all of which is on Google Books.
The simplex algorithm is a standard technique for solving linear programming (LP) problems. What’s a linear programming problem? Wikipedia has an example:
Suppose that a farmer has a piece of farm land, say $L$ km^{2}, to be planted with either wheat or barley or some combination of the two. The farmer has a limited amount of fertilizer, $F$ kilograms, and pesticide, $P$ kilograms. Every square kilometer of wheat requires $F_1$ kilograms of fertilizer and $P_1$ kilograms of pesticide, while every square kilometer of barley requires $F_2$ kilograms of fertilizer and $P_2$ kilograms of pesticide. Let $S_1$ be the selling price of wheat per square kilometer, and $S_2$ be the selling price of barley. If we denote the area of land planted with wheat and barley by $x_1$ and $x_2$ respectively, then profit can be maximized by choosing optimal values for $x_1$ and $x_2$.
This is an optimization problem that can be solved by maximizing a function of $x_1$ and $x_2$ — in particular, $S_1 \cdot x_{1} + S_2 \cdot x_2$ — subject to constraints that capture what we know about the relationships between the other variables involved. The simplex algorithm is a recipe for doing that. The steps of the algorithm can be carried out by hand, but LP solver software packages such as GLPK or Gurobi also use some version of the algorithm to find an optimal solution to LP problems.
SAT and SMT solvers, on the other hand, solve satisfiability problems: problems where, given a formula with Boolean variables in it, we need to figure out whether or not there is some assignment of values to variables that will cause the expression to evaluate to “true”. (In the case of SMT solving, the formula is something fancier than a plain old Boolean formula, but the basic idea is the same.)
Determining whether a formula is satisfiable is a decision problem, not an optimization problem. Since SMT problems are decision problems and LP problems are optimization problems, I didn’t imagine that SMT solvers and LP solvers would have much, if anything, in common. But it turns out that that’s not the case!
To use the simplex algorithm to solve an LP problem, you first need to find what’s called a feasible solution to the problem, in which all the constraints are satisfied. The feasible solution is sometimes called a primal feasible solution or initial feasible solution, and the process of finding it is sometimes called initialization. Once that’s done, you can use the simplex method to iterate toward an optimal solution.
Some presentations of the simplex algorithm focus on the second part of the process only. In fact, in chapter 2 of the Chvátal textbook, it’s assumed from the outset that you already have a feasible solution, and now you just want to find an optimal one. For many realworld LP problems, this is a realistic assumption to make. For instance, suppose you’re the farmer from the example problem above. Whatever you’re already doing is a feasible solution: assuming you’ve set up the problem correctly, the amounts of fertilizer, pesticide, and land you’re using already fall within the constraints, because it would be impossible for you to do otherwise (for instance, you can’t use more land than exists, or put a negative amount of fertilizer on it). Of course, what you’re currently doing is probably not optimal, which is why you need the simplex algorithm! But at least you have an initial feasible solution from which to start iterating toward an optimal one.
But what do you do when you don’t have a feasible solution to begin with? It turns out that the simplex method is of use here, too. Chapter 3 of Chvátal (on pages 3942) covers the situation in which you don’t have a feasible solution and need to come up with one. You do this by solving what’s known as an auxiliary problem. From the original problem, we can mechanically construct an auxiliary problem that always has an initial feasible solution, and then use the simplex method to optimize the solution to the auxiliary problem. Once that’s done, if it turns out that the optimal solution to the auxiliary problem is zero, then a feasible solution to the original problem exists and can be easily obtained, and we can then go on and carry out the steps of the simplex method on that feasible solution, resulting in an optimal solution to the original problem. And if the optimal solution to the auxiliary problem turns out to be nonzero, then that means that no feasible solution to the original problem exists.
Chvátal explains that this approach is what’s known as the twophase simplex method.^{1} The first phase is setting up and and finding an optimal solution to the auxiliary problem. If doing that results in a feasible solution to the original problem, we can then go on to the second phase, which finds an optimal solution to the original problem. Both phases use the same iterative process.
So, you can use an auxiliary problem to either get to a feasible solution for the original problem, or determine that there is no feasible solution. Figuring out whether or not there’s a feasible solution sounds like a decision problem, and it is! But, interestingly, we use an optimization technique to solve it. We turn the decision problem into an optimization problem: if the optimal answer to the auxiliary problem turns out to be zero, then the answer to our decision problem is yes, and if the optimal answer to the auxiliary problem turns out to be nonzero, then the answer to the decision problem is no.
Back in May, I wrote about how for certain kinds of linear real arithmetic formulas, SMT solving and LP solving coincide, but I didn’t mention the simplex algorithm in that earlier post. I can now be more specific and say that for certain kinds of SMT formulas, determining the satisfiability of the formula coincides with carrying out just the first phase of the simplex algorithm that LP solvers use. Finding a feasible solution is the entire problem, and once we have one, we can stop. After all, the goal of SMT solving is just to determine whether a formula is satisfiable or not — we don’t care about finding an “optimal” way to satisfy it!
In fact, if we peek into the guts of GLPK, which is the LP solver on top of which the proofofconcept implementation of Reluplex is built, we see that in its internal representation of an LP problem, there’s a field called phase
that represents which of the two phases we’re in. Often, computer implementations of an algorithm look pretty different from descriptions of how to carry out the algorithm by hand, so I think it’s interesting to see that the notion of phases isn’t specific to byhand applications of the simplex method, but is instead fundamental enough that at least one computer implementation makes use of it, too!
↩
Halide is really two languages: one for expressing the algorithm you want to compute, and one for expressing how you want that computation to be scheduled. In fact, the key idea of Halide is this notion of separating algorithm from schedule.
The paper uses a twostage image processing pipeline as a running example. Let’s say you want to blur an image by averaging each pixel with its neighboring pixels to the left and right, as well as with the pixels above and below. We could write an algorithm for this 3x3 box blur something like the following, where in
is the original image, bh
is the horizontally blurred image, and bv
is the final image, now also vertically blurred:^{1}
1 2 

The Halide paper illustrates how, even with a simple twostage algorithm like this, there are lots of challenging scheduling choices to be made. For instance, you could do all of the horizontal blurring before doing the vertical blurring (the “breadthfirst” approach), which offers lots of parallelism. Or you could compute each pixel of the horizontal blur just before you need it for the vertical blur (the “total fusion” approach), which offers good data locality. Or you could take a “sliding window” approach that interleaves the horizontal and vertical stages in a way that avoids redundant computation. Or you could do some combination of all those approaches in search of a sweet spot. The choices are overwhelming, and that’s just for a toy twostage pipeline! For more sophisticated pipelines, such as the one described in the paper that computes the local Laplacian filters algorithm, the scheduling problem is much harder.^{2}
To me, perhaps the most interesting contribution of the PLDI ‘13 Halide paper — perhaps more interesting than the design of Halide itself — is its model of the threeway tradeoff space among parallelism, locality, and avoiding redundancy that arises when scheduling image processing pipelines.
That said, I think the design of Halide itself is also pretty awesome! The schedulinglanguage part of Halide allows one to express programs that fall anywhere we like in that threedimensional space of choices. For instance, for the 3x3 box blur algorithm above, we could express a breadthfirst schedule with the Halide scheduling directive bh.compute_root()
, which would compute all of bh
before moving on to bv
. For the “total fusion” approach, we could instead write bh.compute_at(bv, y)
, which would compute each value from bh
only when it is needed for computing bv
. But you don’t have to try each of these possibilities yourself by hand; there’s an autotuner that, given an algorithm, can come up with “highquality” (although not necessarily optimal) schedules for an algorithm, starting from a few heuristics and using stochastic search.
There’s been lots of work on attempts to free programmers from having to schedule parallelizable tasks manually. We all wish we could just write an algorithm in a highlevel way and have the computer do the work of figuring out how to parallelize it on whatever hardware resources are at hand. But decades of research haven’t yet produced a generalpurpose parallelizing compiler; automatic parallelization only works in a few limited cases. The cool thing about Halide is that it makes the scheduling language explicit, instead of burying scheduling choices somewhere deep inside a toocleverbyhalf compiler. If you know how you want your algorithm to be scheduled, you can write both the algorithm and the schedule yourself in Halide. If someone else wants to schedule the algorithm differently (say, to run on a different device), they can write their own schedule without touching the algorithm you wrote. And if you don’t know how you want your algorithm to be scheduled, you can just write the algorithm, let the Halide autotuner loose on it, and (after many hours, perhaps) have a pretty good schedule automatically generated for you. The advantage of separating algorithm from schedule isn’t just prettier, cleaner code; it’s the ability to efficiently explore the vast space of possible schedules.
A followup paper, “Distributed Halide”, appeared last year at PPoPP.^{3} This work is about extending Halide for distributed execution, making it possible to take on image processing tasks that are too big for one machine.
Running a bunch of independent tasks on a cluster for, say, processing individual frames of video is one thing, but what we’re talking about here are single images that are too big to process on one machine. Is it ever really necessary to distribute the processing of a single image? Sometimes! As the paper explains, we’re entering the age of terapixel images.^{4} For example, the Microsoft Research Terapixel project involved stitching together hundreds of 14,000^{2}pixel images, then using a global solver to remove seams, resulting in a single terapixelsized image of the night sky. Wikipedia has a list of other very large images, such as this 846gigapixel panoramic photo of Kuala Lumpur. So it’s not out of the question that a single image would be too large to process on a single machine. That’s the problem that the Distributed Halide work addresses.
Distributed Halide extends Halide by adding a couple new distributed scheduling directives to the scheduling language, and it also adds new MPI code generation to the Halide compiler. The new additions to the scheduling language are distribute()
and compute_rank()
. distribute()
applies to dimensions of individual pipeline stages and allows for distributing individual states or not. compute_rank()
is like compute_root()
, but for a particular MPI rank (“rank” being MPIspeak for “process ID”).
Adding distribution to the mix adds yet another dimension to the scheduling tradeoff space: communication between nodes. We can save time and energy by minimizing communication, but less communication generally means more redundant recomputation. Let’s suppose we want to run our twostage 3x3 box blur on a really large image. The schedule
1 2 

results in no communication between compute nodes, but lots of redundant recomputation at different nodes. On the other hand, the schedule
1 2 

results in lots of communication, but no redundant recomputation: all pixels of bh
are computed only once.
For most Halide programs, the optimal schedule is probably somewhere in the middle of the space of possible schedules. Increasing the dimensionality of the tradeoff space by adding distributed scheduling directives unfortunately means that automatically generating schedules gets a lot harder, and so the Distributed Halide work covers handwritten schedules only. Figuring out how to handle schedule synthesis for this larger space of possible schedules still seems to be an open question.
Something else I’m curious about is the idea of synthesizing Halide schedules from incomplete sketches of schedules. At this point, Halide autoscheduling is allornothing: you either write a schedule yourself by hand, or you give your algorithm to the autotuner to do all the work. But if we could combine human and machine effort? Ravi Teja Mullapudi and co.’s 2016 paper on improved automatic scheduling for Halide pipelines concluded with the sentence, “We are interested in exploring interfaces for the autoscheduler to accept partially written schedules by experts, and then fill in the missing details”, and it’s here that I wonder if sketchbased synthesis could be useful.
This isn’t necessarily what Halide syntax actually looks like; for that, see the tutorials. ↩
The localLaplacianfilters pipeline apparently has 99 stages, not all of which are obvious to me from looking at Figure 1 in the PLDI ‘13 paper. A while back, after spending some time staring at the figure and trying to get the stages to add up to 99, I eventually went to the halidedev mailing list and asked about it, and someone there was able to account for 80 of the alleged 99 stages. The point is, it’s a lot of stages. This example also demonstrates that despite its name, an image processing “pipeline” is not necessarily linear! It’s really a graph of interdependent stages, which further complicates the scheduling task. ↩
There are more details in first author Tyler Denniston’s 2016 master’s thesis. ↩
A terapixel is 10^{12} pixels, or 1000 gigapixels, or 1,000,000 megapixels. By comparison, an iPhone 7 can take 12megapixel photos. ↩