!!Con (pronounced “bang bang con”) West is a twoday conference of tenminute talks about the joy, excitement, and surprise of computing, and the westcoast offshoot of !!Con, the annual independent and volunteerrun conference that I cofounded in New York in 2014. !!Con is a radically eclectic computing conference, and we’re excited to be bringing a version of it to the west coast for the first time! The conference will be held at UC Santa Cruz on February 23 and 24, 2019, and our call for talk proposals is open until tomorrow. We’ve already gotten nearly a hundred talk proposals, but we want more! We want yours!
When I say “radically eclectic”, what I mean is that at !!Con, you can expect to catch talks on lossy (!) text compression, traceroute as a storytelling medium, glowing animatronic mushrooms, what happens when you store your data in kernel space, and how to program knitting machines. Or perhaps on assembling ceramic artifacts using computer vision, synthesizing video and turning it into music, building a map to aggregate realtime flood data, how to design a compelling game with just one big red button, and why the problem of how to distribute “Elmo’s World” segments onto a series of video releases is NPcomplete. I always tell prospective talk submitters that if they have something to talk about that brings them joy but is just a bit too strange for a typical tech conference, it might be perfect for !!Con.
!!Con West will undoubtedly have a different flavor than past !!Cons have had — after all, instead of being in the middle of Manhattan or Brooklyn, we’ll be in the woods — but we’re hoping to preserve the essential parts of the !!Con aesthetic, while enthusiastically embracing the physical and human geography of the west coast, Santa Cruz, and UC Santa Cruz. Just like every past !!Con, we’ll anonymize talk proposals before reviewing them to help eliminate implicit bias in the review process. We’re also offering travel funding to speakers who request it, and a $256 honorarium to every speaker.
If all that sounds good to you, please consider submitting a talk proposal to be part of the inaugural !!Con West!
]]>I guess they heard that @palvaro is lecturing today 🔥 pic.twitter.com/BzudbcvCKs
— Lindsey Kuper (@lindsey) September 28, 2018
So I wasn’t sure if I would be able to measure up to students’ high expectations for his class. But it seemed to go well! I decided I wanted to talk about resolving conflicts between replicas in distributed systems. This was jumping ahead a bit, since Peter hadn’t really started to talk about replication in the course yet, but the students were engaged and asked very good questions.
It just so happened that the day I guestlectured was the day that a student started making videos of the class, and those videos are now up on YouTube for anyone to watch!
I started out by talking a bit about why we do replication in the first place and how conflicts between replicas arise. Then I talked about applicationspecific (or “contentaware”, if you like) strategies for resolving those conflicts, using the example of a replicated shopping cart. The class had already covered partial orders in the context of Lamport’s “happensbefore” relation, so I was well situated to introduce a little more math: upper bounds, least upper bounds, and joinsemilattices.
A lot of partial orders are joinsemilattices, but some aren’t! So we talked about that, and I brought it back to distributed systems by making the informal claim that, if operations that affect replicas’ states can be thought of as elements of a joinsemilattice, then we have a “natural” way of resolving conflicts between replicas.
Afterward, one student delighted me by asking where they could read more about the topic!^{1} And a few days later, another student shared the beautiful sketchnotes they had made:
@tos # 128 Oct. 12th 2018 Lecture 5(6). (thread) pic.twitter.com/dmbkG8mqfO
— ✨romeo (@romeoexists) October 15, 2018
Aren’t these students amazing?! I’m stoked about teaching my own version of this course in the spring.
I suggested that they look up conflictfree replicated data types. I didn’t have the presence of mind to suggest it at the time, but this blog post from Michael Arntzenius is also good. ↩
Here’s an “official” course description:
This graduate seminar course explores the theory and practice of distributed programming from a programminglanguages perspective. We will focus on programming models, languagelevel abstractions, and verification techniques that attempt to tame the many complexities of distributed systems: inevitable failures of the underlying hardware or network; communication latency resulting from the distance between nodes; the challenge of scaling to handle everlarger amounts of work; and more. Most of the work in the course will consist of reading classic and recent papers from the academic literature, writing short responses to the readings, and discussing them in class. Furthermore, every participant in the course will contribute to a public group blog where we will share what we learn with a broader audience.
There are a lot of reasonable ways to approach a seminar course on languages and abstractions for distributed programming. We could spend the whole ten weeks on process calculi and still only make a small dent in the literature. Or, we could spend ten weeks on largescale distributed data processing and still only make a small dent in the literature.
In this particular course, we will be focusing a lot of attention on consistency models and languagebased approaches to specifying, implementing, and verifying them. (Of course, we will only make a small dent in the literature.)
As a grad student, I always dreaded having to do course projects. In an ideal world, these projects were supposed to dovetail nicely with one’s “real” research, or they were supposed to morph into “real” research within three months by some mysterious alchemical process involving lots of luck and suffering. In practice, they usually ended up taking time away from real research, and they always ended up being hastily implemented and shoddily written up. Anyone who tried to get the code to run a few months later would be in for a bad time.
So, this course will attempt something different. Instead of a traditional course project, each student in the class will write (and illustrate!) two posts for a public group blog aimed at a general technical audience. The goal is to create an artifact that will outlive the class and be valuable to the broader community.
Will this be less work than a traditional course project? No. A blog post requires substantial work (reading, writing, editing, programming, debugging, thinking, …), and I’m asking students to expect each post to take about thirty hours of focused work. Furthermore, every student in the class will serve as an editor for two posts other than their own. The job of the editor is to help the writer do their best work — by reading drafts, asking clarifying questions, spotting mistakes and rough spots, and giving constructive feedback. This will take another five hours or so per post.
Altogether, it’s a pretty big time commitment, but one that I hope students will find worthwhile.
So! If you’re a computer science grad student at UC Santa Cruz, check out the course overview and draft reading list, and consider signing up to take my class! And if you’re not a computer science grad student at UC Santa Cruz, but you think you want to be, then get in touch — I might be able to help with that.
]]>Over the last few months, I’ve been working my way through Data 8X, the online version of Berkeley’s Data 8 course, after seeing Joe Hellerstein tweet about it a while back. At first, I was interested mostly for pedagogical reasons, but I can now admit to actually having learned something about data science, too. The course is organized in three parts, and I’ve finished the first two parts (check out my cheevos!) and am working my way through the third part, which focuses on prediction and machine learning.
A few days ago, I came to the part of the course that discusses the equation of the regression line, which is, well, the line used in linear regression. Given two variables $x$ and $y$, which we can visualize as a twodimensional scatter plot of a bunch of points, the idea of linear regression is to find the straight line that best fits those points. Once we have such a line, we can use it to predict the value of $y$, given some new value of $x$. The regression line represents the linear function that minimizes the error of those predictions for all the $(x, y)$ pairs that we know about.^{1}
The course gives a delightfully simple equation for the regression line: it’s $y = r \times x$, where $r$ is the correlation coefficient of $x$ and $y$, and $x$ and $y$ are measured in standard units. I had been familiar with the concept of linear regression before taking the course, but standard units and the correlation coefficient were new to me. As it turns out, working with standard units and the correlation coefficient makes linear regression easy!
As an example, let’s use some Small Data that I have handy: the measurements of my daughter’s height (or length, if you like) and weight that were taken at doctor visits during the first year of her life. The first measurements were taken on July 28, 2017, a few days after she was born, with further measurements taken at the followup appointments at two weeks, one month, two months, four months, six months, nine months, and twelve months.^{2}
Date  Height (cm)  Weight (kg) 

20170728  53.3  4.204 
20170807  54.6  4.65 
20170825  55.9  5.425 
20170925  61  6.41 
20171128  63.5  7.985 
20180126  67.3  9.125 
20180427  71.1  10.39 
20180730  74.9  10.785 
Leaving aside the date column for now, we can plot weight as a function of height:
Those dots look awfully close to being a straight line! It seems like linear regression might be a good choice for modeling the relationship between Sylvia’s height and weight during this time period. But before we get to that, let’s talk about standard units.
Consider a data point like, say, 61 centimeters, which was Sylvia’s height on September 25, 2017. Expressing this data point in units of centimeters is useful: for one thing, it’s not too hard for us, as humans, to imagine more or less how long that is. If we’ve been around a lot of babies, we might even know enough to say, “Wow, that’s a big twomonthold.”
In other ways, though, it’s perhaps less useful. If we just see the data point “61 centimeters” by itself, we don’t know anything about how it relates to the rest of the numbers in the height column: is it shorter, longer, or about average? We’d have to see the rest of the data set in order to answer that question. But it turns out that there is a way to represent individual data points that will let us answer such a question without having to look at the rest of the data set! That representation is standard units.^{3}
To convert a data point to standard units, you need to know three things: its value in original units (centimeters, petaflops, whatever it is you’ve got), and the mean and the standard deviation of the data set it came from. Its value in standard units is how many standard deviations above the mean it is.^{4} So, if a value is exactly average in the data set it came from, then regardless of what “average” means for that data set, when converted to standard units, it’s zero. If it’s one standard deviation above average, then in standard units, it’s one. If it’s below average, then in standard units it will be negative. You get the idea.
Here’s that same table again, but with the heights and weights converted to standard units.
Date  Height (standard units)  Weight (standard units) 

20170728  1.26135  1.3158 
20170807  1.08691  1.13054 
20170825  0.912464  0.808628 
20170925  0.228116  0.399485 
20171128  0.107349  0.254728 
20180126  0.617255  0.728253 
20180427  1.12716  1.2537 
20180730  1.63707  1.41777 
Height and weight are now what a statistician would call standardized variables. Sylvia’s height on September 25, 2017 was 0.228116 standard units. From that number, we can tell that that day is just a bit below average for this data set; in particular, it was around 0.2 standard deviations below average. We did have to look at the rest of the data set to be able to come up with the number 0.228116 in the first place, but the information we got from doing that is now implicit in the value itself. So, if you asked me how tall Sylvia was at her twomonth checkup and I told you, “Oh, about negative pointtwo standard units,” you’ll know that she was taller at other, presumably later, times.
Admittedly, that information isn’t all that interesting to have. After all, we expect most babies to get taller as time goes by. It might be more interesting to use standard units for a different data set, such as, say, the heights of a large number of twomonthold babies. Then, knowing the height of any one of those babies in standard units would tell us how its height compared to the rest of the babies in the data set. But, as we’ll see in a moment, standard units are good for more than just quickly seeing how a particular data point compares to the average.
To start with, let’s see what happens if we plot weight as a function of height, like we did before, but now with the the data in standard units:
This scatter plot looks pretty familiar. In fact, the data points look exactly the same as they did above, when we were working with centimeters and kilograms! Indeed, the data hasn’t changed: all that’s changed is the axes.
We can see that the origin, $(0, 0)$, is now in the middle of the plot. Because we’re using standard units, $(0, 0)$ is the “point of averages”, or the point where both variables are at their average values. You may already know that in linear regression, the regression line for a particular data set always passes through the “point of averages” for that data set. In other words, if you plug the exact average value of $x$ into the regression equation, the $y$ you’ll get will be the exact average value of $y$. So, since 0 means “average” in standard units, when we’re working in standard units we know right away that the regression line always goes through $(0, 0)$. How convenient! That’s one reason why that $y = r \times x$ equation up there is so simple. There’s no need to specify a $y$intercept, because when $x$ and $y$ are in standard units, the $y$intercept of the regression line is always 0.
So, now all we need is $r$, and then we can draw the regression line. But just what is this $r$ thing?
The correlation coefficient, known as $r$, is a measure of the strength of the linear relationship between two variables. If we represent that relationship as a scatter plot, then $r$ is a measure of how close the points are to being on a straight line. A positive $r$ means there’s a direct linear relationship between the variables; the highest possible $r$ is 1, which means that all the points are on a straight line with positive slope. A negative $r$ means there’s an inverse linear relationship, with the lowest possible $r$ being 1, meaning that all the points are on a straight line with negative slope. An $r$ of 0 means there’s no linear relationship, which could mean that there’s no relationship at all, or that the two variables are related in some nonlinear way. Wikipedia has several examples of scatter plots with different values of $r$.
The correlation coefficient is fascinating! Two quantitiative psychologists, Joseph Lee Rodgers and W. Alan Nicewander, wrote a wellknown paper called “Thirteen Ways to Look at the Correlation Coefficient” that explores some of the many ways to think about $r$. For our purposes, we’re thinking of it as the slope of the regression line for the relationship between two variables in standard units. This way of interpreting $r$ happens to be number three on Rodgers and Nicewander’s list: “Correlation as Standardized Slope of the Regression Line”.
How do we compute $r$? For that, we can turn to number six on Rodgers and Nicewander’s list: “Correlation as the Mean CrossProduct of Standardized Variables”! Not only is $r$ the slope of the regression line for the relationship between two variables in standard units, it’s also the average cross product of the values of two variables in standard units.
So, since we’ve already converted our height and weight to standard units, all we have to do to find the slope of the regression line is multiply all our pairedup values of $x$ and $y$ with each other, and then take the average of those products. And since we already know that the $y$intercept is 0, then we can draw the regression line! That’s it! No dimlyremembered calculus! No ugly iterative methods!
Let’s add a new column to our table from before, where we’ll write down the product of each pair of standardized height and weight values:
Date  Height (standard units)  Weight (standard units)  Product of standardized height and weight 

20170728  1.26135  1.3158  1.65968 
20170807  1.08691  1.13054  1.22879 
20170825  0.912464  0.808628  0.737844 
20170925  0.228116  0.399485  0.091129 
20171128  0.107349  0.254728  0.0273447 
20180126  0.617255  0.728253  0.449518 
20180427  1.12716  1.2537  1.41312 
20180730  1.63707  1.41777  2.32099 
And now we can just take the average of the numbers in that last column to get $r$. Before we do that, though, let’s try to get some intuition for why it makes any sense to multiply height and weight together when they’re in standard units, and why the average of those products would give us $r$.
Well, we said before that a positive $r$ means that there’s a direct linear relationship between our two variables, and a negative $r$ means that there’s an inverse linear relationship. Looking at our table, we can see right away that all of the numbers in the last column are positive, because in each case, we got the number either by multiplying two negative numbers or two positive numbers. That happened because each of our values in the height column falls on the same side of average — that is, on the same side of zero — as the corresponding value in the weight column.
If all our aboveaverage heights had corresponding belowaverage weights, and vice versa, then the products in the last column would all be negative, and so $r$, the average of the products, would also be negative, indicating an inverse linear relationship. And if some of the aboveaverage heights corresponded to aboveaverage weights and some corresponded to belowaverage ones, and the same for the belowaverage heights, then we’d have a mix of positive and negative numbers in the last column, and so the average of that column would presumably be pretty close to 0 — indicating a weak linear relationship or none at all. Hopefully, this informal line of reasoning provides some intuition about why taking the average crossproduct of two standardized variables gives you $r$.^{5}
So, what do we get when we take the average of the numbers in the last column? Their average turns out to be 0.9910523777994954. Wow! That’s really close to 1. Let’s plot the regression line and see what it looks like.
Since $r$ is so close to 1, the regression line is awfully close to $y = x$, a perfect linear correlation. Let’s plot that line, too.
The slope of the regression line is just a hair smaller than the slope of the line $y = x$. In fact, it’s hard to see the difference between them. Here’s a bigger version of the figure, with the ranges of the axes tweaked so that you can see that the two lines aren’t the same:
For an example of a relationship that isn’t quite as well modeled by a straight line, take a look at the Secret Bonus Content™ at the end of the accompanying notebook!
Standard units are useful for finding and reasoning about the regression line, but sooner or later, we may want to convert back to original units — centimeters and kilograms, in our case. After all, we might want to be able to ask questions like, “What does our model predict Sylvia’s weight will be when she’s 80 centimeters tall?”, and it would be inconvenient to have to convert 80 to standard units first. Also, we’d probably prefer to get the answer back in kilograms rather than standard units.
The Data 8 textbook gives the following formulas for the slope and intercept of the regression line in terms of $x$ and $y$ in original units (where $\mbox{SD}$ means “standard deviation”): $$ \mathbf{\mbox{slope of the regression line}} ~=~ r \cdot \frac{\mbox{SD of }y}{\mbox{SD of }x} $$
$$ \mathbf{\mbox{intercept of the regression line}} ~=~ \mbox{average of }y ~–~ \mbox{slope} \cdot \mbox{average of }x $$
So that would make the equation of the regression line
$$ y = (r \cdot \frac{\mbox{SD of }y}{\mbox{SD of }x}) \times x + (\mbox{average of }y ~–~ (r \cdot \frac{\mbox{SD of }y}{\mbox{SD of }x}) \cdot \mbox{average of }x) $$
which is a lot hairier than the nice $$y = r \times x$$ that you can use when working in standard units. Unfortunately, the book doesn’t actually explain how to derive the formulas for the slope and intercept. In the accompanying video at about 6m15s, Ani Adhikari shows them and then pauses for a full ten seconds before saying, “I would suggest that you don’t try to memorize this formula. Remember how the slope comes about, and then, you can, if necessary, derive the formula for the intercept, or simply — and this is our recommendation — you can look it up.” I’m not very good at following recommendations, so I ended up working it out on paper myself, and it was indeed pretty tedious, but in the end, I did manage to dervive the formulas and come up with an explanation for why they make sense. I had originally planned to write about that here, but this post is pretty long already, so perhaps that’s a post for another time.
This is not to say that the predictions will necessarily be good, because a straight line might not be a good fit for your data. But if you must have a straight line, then the regression line is the best straight line you can have. ↩
Incidentally, these were all “wellbaby” checkups, not times when she was sick. She went to the doctor when she was sick a couple of times, too, but they didn’t measure her height at those appointments, only weight, so I haven’t included them here. ↩
Standard units are also known as standard scores, or zscores. My impression is that the terminology “standard units” isn’t as, uh, standard as ”zscores” or “standard scores”, but I like it because I think it’s useful to think of standard units as just another unit of measurement. That’s the point of view that Brian M. Scott adopts in this fantastic answer to a question about standard units and standard deviations: “You might say that the standard deviation is a yardstick, and a zscore is a measurement expressed in terms of that yardstick.” ↩
I like the Data 8 definition of standard deviation: it’s the root mean square of deviations from average. Section 14.2 of the book has a detailed explanation. ↩
If my explanation here isn’t convincing enough, Ani Adhikari’s explanation in one of the Data 8X videos might be! ↩
In the conversation that followed Justine’s original tweet, a couple of people expressed concern that taking her advice would have hurt their GPA and therefore damaged their CS Ph.D. application prospects. In fact, one person (who I won’t link to, because I don’t want to pick on them personally) wrote that they would have wanted to minor in another field, but didn’t because “anything under a 4.0 is DOA for CS grad school”.
Is it? As so often, we can turn to Mor HarcholBalter’s advice on applying to Ph.D. programs in CS for trustworthy information. Here’s what she says:
When applying to a Ph.D. program in CS, you’d like your grades in CS and Math and Engineering classes to be about 3.5 out of 4.0, as a rough guideline. It does not help you, in my opinion, to be closer to 4.0 as opposed to 3.5. It’s a much better idea to spend your time on research than on optimizing your GPA. At CMU the mean GPA of students admitted is over 3.8 (even though we don’t use grades as a criterion), however students have also been admitted with GPAs below 3.3, since research is what matters, not grades. A GPA of 4.0 alone with no research experience will not get you into any top CS program. […] At CMU we receive hundreds of applications each year from 4.0 GPA students who have never done research. These are all put into the high risk pile and are subsequently rejected.
As further evidence that you don’t need a 4.0 to go to grad school, I offer my transcript for my undergrad degree in computer science and music. I’m not holding myself up as some sort of ideal candidate for CS Ph.D. programs — far from it! — but I did manage to get into grad school with these lessthanstellar grades.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 

In the end, my undergrad GPA was just a hair over 3.5 — slightly higher than that in CS, and considerably lower in math. A few things to note about my undergrad experience:
Perhaps it’s worth noting that I never took undergrad courses on computer architecture, compilers, distributed systems, AI, machine learning, robotics, statistics, networking, security, cryptography, or graphics. I filled in a few of those gaps in grad school, but not even close to all of them. Some of them I’m only now beginning to fill in, and others I’m not sure I’ll ever get to. If you evaluated me according to what Matt Might says every CS major should know, I’d get a failing grade. (Thankfully, Matt also identifies being good at failing as a predictor of success in research.)
Anyway: in grad school, as in the rest of life after undergrad, grades aren’t very important. What matters in grad school is research, and applicants who demonstrate an understanding of what does and doesn’t matter will be at a great advantage to those who don’t. I might even go so far as to say that a 4.0 undergrad GPA could be something of a liability on grad school applications, because it could make an applicant appear to be someone who pursues a high GPA for its own sake as opposed to pursuing more important things and being willing to fail at them. In any case, the possibility of getting low grades is a bad reason not to study something you’re interested in.
]]>I’m a rising junior at [university], majoring in computer science and math.
I have recently made up my mind about going to grad school for PL and am trying to get as much experience as possible right now. I will be attending [PLrelated summer school] in two weeks. I will also be TA’ing [university]’s Programming Languages course and working with a professor on a PLrelated research project next term.
My question is, is there any reason why I might want not to apply for PLMW at ICFP 2018 this September, and instead, apply next year? I am guessing I can only attend once, but I am not sure about that either. What are some criteria that I should consider when choosing a specific PLMW? Can you help me make a decision?
“PLMW” refers to the Programming Languages Mentoring Workshop, a oneday workshop targeted at people who are in the beginning stages of a programming languages Ph.D. program, and at those who, like my correspondent, are looking into doing a Ph.D. in PL. The workshop is held several times a year in conjunction with POPL, PLDI, SPLASH, and ICFP, four of the biggest venues for programming languages research. PLMW has been running since 2012 (I attended the first two PLMWs when I was in grad school!), and it’s now become an established part of the PL grad school (or pregradschool) experience. There’s usually an allstar lineup of speakers, covering current topics in PL research as well as advice on how to navigate life as a Ph.D. student. There’s usually also a panel of “young”^{1} researchers talking about their Ph.D. experience; I got to be on one of these panels at POPL 2016.
Because the workshop is always held at a major PL conference on the day before the main conference program begins, it’s a great way for those who are new to the PL conference scene to ease into things — both in the sense that it introduces some technical material that will be useful for understanding the rest of the conference, and in the sense that it can help firsttime attendees feel more connected socially. Plus, PLMW is more than a workshop — it’s also a scholarship program that will pay your way to attend the rest of the conference, including funds toward travel, lodging, and registration.
So, should you apply to attend PLMW? Yes, you should. But that brings us to my correspondent’s question: which PLMW should you go to?
First of all, you can go to more than one PLMW, but you can probably only get funding to go to one. Notice, for instance, that one of the criteria for scholarship eligibility for PLMW at ICFP ‘18 is “have not been funded by a prior PLMW”. (I got funded to go to PLMW twice when I was a grad student, but in those days, PLMW was still new, and the funding policy might have been tightened up since then.)
When should you go? If, like my correspondent, it’s your junior year of undergrad, you’ll probably be one of the youngest people at the conference, but that’s not necessarily a problem. It’ll still be an excellent learning opportunity. If you plan to apply to grad school during your senior year, consider going to PLMW that fall (or before), so you can have your PLMW experience to draw on when you’re applying.
Another criterion to consider is the conference at which the PLMW is held: POPL, PLDI, SPLASH, or ICFP. The content of PLMW itself might not be drastically different from one to the next, but the content of the main program of the conference will be somewhat different. At ICFP, things will be more geared toward functional programming; POPL will have more theory people; PLDI will have more hardcore language implementation people; and SPLASH will have a mix of systems, PL, and applications. But don’t make too much of these differences, and please liberally fill in air quotes around words like “functional programming”, “theory”, and “systems” that appear in the previous sentence. There’s actually a lot of overlap among all of these conferences, and all of them are good. It’s normal for a PL researcher to be a regular at more than one of them.
Consider location, too. For instance, ICFP is in St. Louis this fall and colocated with Strange Loop, so if you want to go to Strange Loop (or already have plans to go), then ICFP would be a good pick for your first PLMW. (If so, then hurry — scholarship applications are due tomorrow!) On the other hand, SPLASH is being held in Boston this fall, and Boston happens to be home to a lot of great places to study PL, so if you want to (for example) visit Northeastern or MIT while you’re in town for the conference, then PLMW at SPLASH 2018 might be a good pick. If you want to wait a little longer, there will certainly be a PLMW at POPL 2019, which is being held in Portugal — although if you’re coming from the US and you have a visa/citizenship situation that would complicate reentry into the US, be mindful of that before you make your plans.
Good luck, and I look forward to seeing you at a SIGPLAN conference soon!
I don’t like using the word “young” for these things; “junior” or “emerging” are a bit better. ↩
After defending my Ph.D. in 2014, I joined Intel Labs as a research scientist. While at Intel, I’ve gotten to work on some really cool stuff with great collaborators, but for a while now, I’ve been contemplating returning to academia, and last fall I decided to quietly start applying for a few faculty positions.
Mine was a relatively smallscale job search, and I didn’t tell too many people that this was what I was doing. I didn’t necessarily expect things to work out. But they did! I am absolutely thrilled to share that two weeks ago, I accepted an offer to join the Baskin School of Engineering at UC Santa Cruz as a tenuretrack assistant professor of computer science. I officially start in July, and I’ll be teaching starting in the fall. I’ll be joining an amazing group of faculty, including Peter Alvaro, Faisal Nawab, Owen Arden, Lise Getoor, Cormac Flanagan, Carlos Maltzahn, and many more.
I’m excited to be a Banana Slug, and it seems like my daughter is pretty stoked, too.
When I was working on my application materials back last November, my friend Neel Krishnawami advised me to make up a narrative that would connect all the work that I had done, from LVars to ParallelAccelerator to neural network verification. At first, I despaired of ever being able to come up with a coherent story to connect the different things I’d done. But the more I thought about it, the more I realized that not only did I have a story, I actually believed in the story I was telling! As I wrote back in February:
If there’s one theme that ties together all the stuff I’ve been working on or studying in the last several years, it’s that a highlevel representation of programmer intent (commutative operations on lattice elements in the case of LVars; aggregate array operations in the case of ParallelAccelerator; the ReLU predicate added to the theory of linear real arithmetic in the case of Reluplex^{1}) enables optimizations or smart scheduling choices that would be difficult or impossible otherwise.
I structured my job talk around this theme: that choosing the right highlevel abstractions to express a software system not only does not compromise high performance, but actually enables it.
The folk wisdom in computing is that in order to be efficient, things have to be lowlevel and “close to the metal”. But it turns out that choosing the right highlevel domainspecific abstractions can actually be the key to unlocking efficiency, whether we’re talking about lazy SMT solving, or about highperformance DSLs. To me, this is one of the most beautiful ideas in computing.
I thought about adding some more details here about my plans for future research, but it would make this post way too long. So for now, I’ll just say: if you’re interested in doing a Ph.D. with me at UCSC to explore interesting ideas in the intersection of programming languages, distributed systems, and verification, you should get in touch!
As anyone who’s ever read this blog already knows, I’m incapable of shutting up about !!Con, the conference of lightning talks about the joy, excitement, and surprise of computing that I cofounded with a group of friends in 2014. (Our fifth annual conference in New York just wrapped up today; videos will be available soon!) !!Con has been such a success that the demand for what we’re doing is much greater than we’re able to meet. This is a great problem to have, but it means that we’re constantly having to disappoint people. There’s only so much that our small team of volunteers can do.
The longterm solution to our inability to meet demand is for there to be a lot more conferences inspired by !!Con, in a lot more places, organized by a lot more people. A new generation of conference organizers will need to step up. But in fact, that’s already happening — EnthusiastiCon, Hello, Con!, and StarCon are three examples. It’s incredibly exciting to me to see all these conferences thrive.
Somehow, though, as supply has increased, demand hasn’t lessened. If anything, the demand for !!Con has only increased as more and more people experience the magic and fun of this conference format. This year, we got nearly 300 talk proposals, the most we’ve ever had. It’s clear that we need to expand, not only to better serve our existing audience, but also to better serve those who can’t easily travel to New York for the weekend — a group that includes me, especially now that I’m a parent.
So, next year we’re expanding !!Con to the west coast! We’ll be holding our first conference outside of New York, !!Con West, at UC Santa Cruz in early 2019.
How did this come about? As part of negotiating my offer from UCSC, I asked the department to provide financial and logistical support for bringing a version of !!Con to campus, and they enthusiastically agreed to do so. (In fact, my department chair and other future colleagues already knew about !!Con, because I’d written about it in my application materials.) As I see it, !!Con isn’t just a fun side project, but actually an integral part of my outreach mission as a scientist and educator, and I can’t tell you how much it means to me that UCSC is putting their money where their mouth is and supporting that mission!
Finally, we have two goals for !!Con West. First, of course, we want to put on another great conference in the !!Con tradition. Second, though, we want to incubate a new generation of conference organizers. By the time we’re done, every member of the !!Con West organizing team will have the skills and experience to go out and launch their own conferences, if they want to. If that sounds like something you want to do, fill out this form to apply to become a !!Con West organizer. Join us!
I want to point out that I can’t take any credit for the idea of extending the theory of linear real arithmetic with a ReLU predicate – that was all Guy Katz, with the assistance of the rest of the Reluplex team at Stanford. But when I meditated on why I liked the idea (and, more broadly, the idea of creating highperformance domainspecific solvers) so much, I realized that it fit right into the rest of the story that I was trying to tell. ↩
At last year’s DSLDI, we heard from Ron Garcia on gradual typing, Nimo Ni on the Penrose system for declaratively creating mathematical diagrams, Xiangqi Li on domainspecific debugging in Racket, and lots more; then we went out for Italian food.
Ron Garcia (@rg9119 ) gives a tutorial on gradual type consistency as part of his #dsldi17 keynote talk. pic.twitter.com/TMp4xlzbF7
— Lindsey Kuper (@lindsey) October 22, 2017
Nimo Ni describes a pair of languages for creating mathematical diagrams. #dsldi17 pic.twitter.com/XXWMbEFbRl
— Lindsey Kuper (@lindsey) October 22, 2017
Xiangqi Li tells us about domainspecific debugging in Racket. #dsldi17 pic.twitter.com/hQyCzZN9Wv
— Lindsey Kuper (@lindsey) October 22, 2017
If all this sounds like a good time to you, you should join us for DSLDI 2018! I’m excited to be involved with organizing DSLDI again, particularly since Sam is in charge this year. We’re once again calling for short talk proposals, due August 17, and information about this year’s program committee should be available soon.
]]>In addition to the letter grades that each reviewer gives to each proposal, this year I also tried to write a sentence or two for each talk I reviewed, explaining why I gave it the grade I did. As I wrote these explanations, a few themes emerged in the kinds of talks I was giving lessthanenthusiastic reviews to. I want to talk about what these were and offer advice on how to avoid them, in the hope that someone will find my advice useful when they’re submitting talk proposals to future !!Cons, or other conferences like it. (For conferences that aren’t much like !!Con, keep in mind that this advice may not do much good, and may in fact be misleading.)
!!Con talks are meant to be ten minutes long.^{1} On our talk proposal submission form, one of the things we ask prospective speakers for is a “timeline”, or a summary of how the speaker plans to use their ten minutes of stage time. The timeline helps us make sure that the speaker understands the lightning talk format, and that they’ve put some thought into making sure that their talk will fit into the allotted time.
Most other calls for talk proposals don’t ask for a timeline, so even people who are frequent conference speakers don’t always know what we’re asking for here. For whatever reason, every year we get some talk proposals that completely fail to provide any kind of reasonable timeline. Some people use the timeline as a place to continue their abstract. I’ve even seen people just copypaste their abstract into the timeline field on the form. Other submitters seem to think that by “timeline”, we’re just asking how long the talk will be — for instance, we got a couple this year that just had “10 minutes” listed as the timeline. (We’re not asking how long your talk will be — we already know how long it’s supposed to be! We’re asking how you’ll use the time.) And even more absurd than “10 minutes” was the one we got this year that said “20 minutes”! That’s pretty clear evidence that the talk submitter doesn’t know what !!Con is and didn’t bother to read what the call for proposals was asking for.
Timelines aren’t legally binding agreements, of course. We expect that most speakers will rearrange things a little, or even a lot, between when their talk proposal is due and when the talk actually happens. The important thing is for the prosepctive speaker to show us that they’ve come up with a plausiblesounding plan for a talk they could give in ten minutes. Whether it bears much resemblance to the actual talk they end up giving is another question. For more advice on timelines, see my post “How to write a timeline for a !!Con talk proposal”, which has several examples.
We have a policy at !!Con that talk titles have to include at least one exclamation point.^{2} However, an exclamation point isn’t sufficient to make a title exciting. Consider the following four titles:
One of these things is not like the others. The first three are all titles of talks that were actually presented at !!Con (by Kiran Bhattaram in 2015, wilkie in 2016, and Jan Mitsuko Cash in 2017, respectively). The fourth is representative of a class of talk proposals that get rejected. I’m not trying to pick on Spark in particular here, but “Introduction to Data Analytics with Apache Spark!” sort of looks as though the submitter just took the title of a talk they’d given at another, more traditional conference, and stuck on an exclamation point on the end.
Is a bad title enough to get a talk proposal rejected? Not necessarily. Titles are easy to change, and in fact, we do sometimes ask people to change the title of a talk after it’s been accepted. But since we only have room to accept about thirty submissions, a boring title can be enough to push a talk into the “reject” category. So, instead of giving us a reason to reject your otherwise amazing talk, come up with an exciting title that shows us how excited you are about giving the talk!
Another thing to keep in mind regarding titles is that at !!Con, your talk’s title really doesn’t have to serve as a standalone summary of the talk. At bigger conferences, where there are a hundred talks and lots of tracks to choose from (and where a talk is usually more than ten minutes long, so choosing any given talk is a larger commitment), attendees often scan through titles to decide which talks to attend, and so some speakers will try to pack their talk titles with descriptive keywords that they hope will attract attendees. But !!Con is a singletrack conference where there’s time for attendees to read all the talk abstracts, and every talk is wellattended, so a title like “A Shot in the Dark!” works well even though it doesn’t explain what the talk will be about. In fact, titles that don’t reveal all the details are a great way to pique the audience’s curiosity.
We get a lot of talk proposals that purport to show the audience how to achieve a practical goal — get more done at work! write more maintainable code! ship a product! — or that purport to teach a specific skill or a way of thinking about programmimg. These kinds of practical, bemoreeffectiveatwork talks go over well at many conferences, but we find that most of the time, they don’t work so well at !!Con.
Why not? I think it’s because most people don’t come to !!Con with the goal of learning practical skills. Rather, they come to have fun and ignite, or reignite, their excitement and curiosity about computing. It’s not that people don’t pick up useful knowledge to improve their craft from talks at !!Con — to the contrary! — but it’s a question of how the knowledge is presented. My advice to talk submitters is to aim for whimsical rather than practical. Tell a compelling and engaging story of how a project got done despite obstacles (like the story of how Lisa Ballard and Ariel Waldman built spaceprob.es using an undocumented NASA API!), or share an interesting concept for its own sake (like localitysensitive hashing!).
Some !!Con talks, like Bomani McClendon’s great talk last year about building giant animatronic glowing mushrooms, do include “lessons learned” and highlevel takeaways, but those highlevel takeaways work because they’re presented in the context of a specific project or a specific concept. One of the highlevel takeaways for Bomani’s talk, for instance, had to do with how he came to see software as a medium for art, and himself as an artist, as a result of writing software for an art project. But if the title of his talk had been “Level up as a programmer by seeing software as an artistic medium!” rather than “Making Mushrooms Glow!”, it wouldn’t have been as strong of a talk proposal.
These kinds of differences in framing can be quite subtle, but the good news is that if you submitted a talk proposal of the “this talk will make you do better work!” variety and it was rejected, it might not take much effort at all to reframe it as a talk proposal that would be a great fit for !!Con. Consider revising and resubmitting it next year!
!!Con attendees and speakers come from a wide variety of backgrounds — they might design games, build smart watches, study geometry, or develop interplanetary spacecraft flight software. A given audience member may or may not know much about any of these topics. But even talks on very narrowly specialized topics will have something to offer to both specialists and nonspecialists, if they’re done well. In fact, we prefer deep and narrow talks to broad and shallow ones. (The lightning talk format helps: in our experience, just about anything can be interesting for ten minutes if it’s presented well.)
So, talk proposals on narrowlyfocused topics are great for !!Con. A problem arises, though, when these talk proposals also make a lot of assumptions about the audience’s background, goals, or motivations. We get a lot of talk proposals that aren’t a good fit for !!Con because they assume background that not everyone in our audience has, or goals that not everyone in our audience has. This can happen, for instance, when people take a talk they’ve previously submitted to a more narrowly focused conference and try to reuse it for !!Con. If you send us a talk proposal that you previously sent to PopularMachineLearningToolFest or JavaScriptWebFrameworkOfTheMomentConf, you’ll probably need to make some changes to the way you’re framing your talk to make it suitable for !!Con, regardless of how wonderful PopularMachineLearningTool or JavaScriptWebFrameworkOfTheMoment are.
Will you have to rewrite your existing talk proposal from scratch? Maybe not! Removing assumptions about shared background and goals from a talk proposal could be as simple as changing a sentence like “We all need the web apps we build to work well on mobile devices” to “I needed the web app I built last year to work well on mobile devices”. Instead of trying to convince your audience that they care (or ought to care) about your topic, tell them why you care! Making the talk more personal will also make it more compelling, because it becomes a talk that only you can give.
Thanks to Stephen Tu, Laura Lindzey, Michael Malis, Julia Evans, and Erty Seidohl for feedback on a draft of this post.
In practice, sometimes people run a few minutes long, and that’s okay. We don’t drag them off the stage. ↩
If you submit a killer talk proposal but you forget to put an exclamation point in your talk title, don’t worry, we won’t reject it for that reason alone! Every year we accept a few talks with no exclamation point and ask the speaker to add one as part of our standard copyediting step. Still, you should try to remember to include the exclamation point yourself – it’s one of many ways that you can convey to us that you’re excited about your topic! ↩
Part of this is just a summary of my coauthors’ very cool work on Reluplex, which they’ve been working on since well before I got involved. For me, though, this isn’t just about Reluplex or even just about neural network verification, but about a more general idea that I think Reluplex exemplifies: the idea of exploiting highlevel domainspecific abstractions for efficiency. If there’s one theme that ties together all the stuff I’ve been working on or studying in the last several years, it’s that a highlevel representation of programmer intent (commutative operations on lattice elements in the case of LVars; aggregate array operations in the case of ParallelAccelerator; the ReLU predicate added to the theory of linear real arithmetic in the case of Reluplex) enables optimizations or smart scheduling choices that would be difficult or impossible otherwise.
The team’s experience with Reluplex suggests that to make headway on hard verification problems, it won’t be enough to use offtheshelf, generalpurpose solvers, and that we need to instead develop domainspecific solvers that are suited to the specific verification task at hand. That suggests to me that we (or, you know, someone) should try to create tools that make it really easy to develop those domainspecific solvers (analogously to how Delite aimed to make it really easy to develop highperformance DSLs) — and perhaps Rosette, which I’m now learning about through my work on the CAPA program, will play a role here. It’s all connected!
In addition to making solvers better, there’s also the other side of things: designing networks that are easier for solvers to chew on in the first place. I think we’re just beginning to learn how to do this, but what I find promising is that some design choices that are desirable for other reasons — like pruning and quantizing networks to reduce storage requirements, speed up inference, or improve energy efficiency — may also make those networks more amenable to verification.
It may even be the case that the same hardware accelerator techniques that optimize inference on quantized or lowprecision networks can also optimize verification of those same networks! This is honestly just pure speculation on my part, but if you’re interested in trying to figure out if it’s true, I hope you’ll come talk to me at SysML.
Baris Kasikci asked a good question on Twitter recently about the Reluplex work: “How difficult is it to produce the SMT formula for the property that you want to prove?” Part of the answer is that it depends on the property. We want properties that can be expressed in terms of “if the input to the network falls into suchandsuch class of inputs, then its output will fall into soandso class of outputs”. For the prototype ACAS Xu aircraft collision avoidance network that Reluplex used as a case study, the input is relatively highlevel, expressing things like how far apart two planes are, and the output is one of five recommendations for how much an aircraft should turn (or not turn). In that setting, it’s relatively easy to at least state the properties one wants to prove. For a network where the input is something more lowlevel, like pixels, it may be a lot harder to state properties of interest.
That should give us at least a moment’s pause, since the whole point of deep learning is that it’s possible to start with relatively lowlevel inputs and allow the computer to figure out for itself what features are important, rather than having to do a lot of complicated feature engineering to provide input. But the mechanism by which a deep neural network accomplishes that is mostly “lots of layers”, with successive layers operating on increasingly abstract representations of the input. Julian et al. describe the prototype ACAS Xu network as “deep” more because it has a lot of layers than because the input is particularly lowlevel.
But let’s say that we do have a network that takes lowlevel inputs like pixels: can Reluplex verify anything interesting about it? Yes, it can! There are still interesting properties we can express to the solver — like, say, robustness to adversarial inputs. I’ve claimed in the past (although I suspect some of my coauthors would disagree with me) that adversarial robustness properties are actually kind of boring, because we already know ways to state those properties for arbitrary networks, and that the more interesting properties are the ones that we’ll have to work hard to even figure out how to express, much less prove. But even if we’re “just” looking at verifying adversarial robustness, it’s not like there’s any shortage of work to do, both in coming up with meaningful robustness metrics and in making solving lots more efficient.
]]>At first, I wrote lots of posts about my dissertation work. I eventually managed to publish, graduate, and get a job, and since then, I’ve continued to use this blog to write about whatever I’m thinking or learning about: distributed computing, domainspecific languages, verification, machine learning, and lots of other things. I’ve also advertised conferences and workshops that I’ve been involved with; told debugging stories; and dispensed advice. Inevitably, the personal and the professional intersect.
This blog doesn’t have anywhere near the huge following of folks like Dan Luu or Julia Evans, but I’ve been fortunate to have picked up a few readers who enjoy it and who I hear from often. I think that part of the blog’s success comes from the fact that I make myself post regularly. I committed to writing two posts a month when I started the blog, and although that means that some of the posts end up being “filler”, I think the benefits outweigh the drawbacks. Having the end of the month looming is often the nudge that I need to finally finish and post things that otherwise would have lingered unpublished for a long time.
At least, that’s how it works sometimes. Too often, though, I let a month go by without posting and then have to backdate posts to the previous month in order to keep up. This has been a problem for some time, but after my daughter was born last July, it got much worse. My posts “What do people mean when they say ‘transpiler’?” and “My first fifteen compilers” were both dated July, but they were actually published in August and September, and I’m sure I wouldn’t have even managed that if I hadn’t already had drafts of both of them started before Sylvia was born.
My two Augustdated posts went up in September. When I went back to work fulltime in midOctober, the schedule slipped even further, and my Septemberdated posts didn’t go up until November. By the time December came around, I had six posts to finish for the year; with effort, I managed to crank them all out by January 10. And now it’s the end of the month again, and, predictably, I’m behind.
I’m proud to have made it five years at the twiceamonth pace, but it’s become clear that it’s not a sustainable pace for me anymore. So in 2018, I’m going to try changing my pace of posting to once a month, and seeing how that goes. Once a month seems like something that I can manage. I’m already feeling relieved about this decision. Instead of dreading having to backdate posts for January, I’m excited to write about my colleagues’ and my SysML submission for February. That’s the way that blogging should feel — like fun, not a burden!
Now that I’ve made this decision, I’m realizing that my old posting schedule was causing me to make bad decisions and engage in false economies: at one point, I even remember thinking, “I should agree to serve on the soandso program committee, because that means I can write a post about its call for papers, and that’ll take care of one of the posts I need to write this month!” (I said no to that PC invitation, but the fact that I was even tempted to say yes just to get a blog post out of it meant that there was something wrong with how much I was expecting myself to blog.)
Why not do away with this quota system altogether and just blog if and when I feel like it? Looking back on the posts that I’m happy with from over the last five years, I’m certain that a lot of them would never have been published if it hadn’t been for deadline pressure. An example is “Using the simplex algorithm for SMT solving”, which was one of those posts I cranked out in December when I was under the gun to finish a bunch of posts by the end of the year. It was a draft I’d had halffinished for a long time; I was writing about something that I had found interesting, but I wasn’t sure if I was really explaining things properly or if it would be interesting to anyone else. But the deadline meant that I couldn’t afford not to finish up and post it, and when I did, the reaction I got was quite positive! So, I really do think a bit of deadline pressure is good for me. The trick is to have a sustainable amount of deadline pressure, and not so much that I can’t keep up.
]]>Wait, what?
Proceedings of the ACM on Programming Languages, or PACMPL, is the ACM’s new openaccess journal for “research on all aspects of programming languages, from design to implementation and from mathematical formalisms to empirical studies”. So far, each issue publishes exactly the papers from one annual ACM PL conference: POPL, ICFP, and OOPSLA. (The fourth major ACM PL conference, PLDI, chose not to participate.) This means that there will be three issues of the journal each year, named “issue POPL”, “issue ICFP”, and “issue OOPSLA”.
So, what would have at one time been called the call for papers for ICFP 2018 is now officially known as the call for papers for PACMPL issue ICFP 2018. Because “PACMPL” has much less name recognition in our community than “ICFP”, I imagine a lot of people don’t realize it exists yet and are at risk of being confused by this. So, to alleviate confusion: the ICFP conference is still the ICFP conference, same as always, but now ICFP papers are being published in a journal rather than as a conference proceedings.
In fact, this was already the case for ICFP 2017, the papers from which were published in the firstever issue of PACMPL: Volume 1, Issue ICFP. So 2018 will work more or less like 2017 did, but we’ve updated the call for papers to reflect that the papers in question will go into a journal.
Why does this matter? Who cares if papers are published in a journal or a conference? As Phil Wadler explained a while back:
Programming languages are unusual in a heavy reliance on conferences over journals. In many universities and to many national funding bodies, journal publications are the only ones that count. Other fields within computing are sorting this out by moving to journals; we should too.
The best thing about PACMPL is that it’s open access, and therefore all POPL, ICFP, and OOPSLA papers published in it are freely available. Tell your friends!
]]>Quoting from the website:
This workshop aims to investigate the principles and practice of consistency models for largescale, faulttolerant, distributed shared data systems. It will bring together theoreticians and practitioners from different horizons: system development, distributed algorithms, concurrency, fault tolerance, databases, language and verification, including both academia and industry.
Relevant discussion topics include:
 Design principles, correctness conditions, and programming patterns for scalable distributed data systems.
 Techniques for weak consistency: session guarantees, causal consistency, operational transformation, conflictfree replicated data types, monotonic programming, state merge, commutativity, etc.
 Techniques for scaling and improving the performance of strongly consistent systems (e.g., Paxosbased, state machine replication, sharedlog consensus, blockchain).
 How to expose consistency vs. performance and scalability tradeoffs in the programming model, and how to help developers choose.
 How to support composed operations spanning multiple objects (transactions, workflows).
 Reasoning, analysis and verification of weakly consistent application programs.
 How to strengthen the guarantees beyond consistency: fault tolerance, security, ensuring invariants, bounding metadata size, and controlling divergence.
If you’re working on any of those things, please consider submitting a twopage talk proposal by the deadline of February 7! I look forward to reading your submissions.
]]>In March, a friend asked me if I still had the syllabus from a course we took together in 2003. She was applying for a fellowship and was trying to remember if the work she’d done in the course fourteen years ago was relevant. I did not have the syllabus, but I had kept a lot of stuff from the course, and one of the documents that I still had happened to include a URL for the course website. That URL no longer worked, but the Wayback Machine had an archived version of it, and from there I was able to find the archived syllabus for my friend and help answer her questions for the fellowship application. I was excited that I was able to put together information from two sources — my own records (on paper, in a physical file folder!), and the Wayback Machine — to track down the document my friend needed.
In October, my friend Chris tweeted something that reminded me of how, back in the early 2000s, LiveJournal Support volunteers had a custom of saying “manatee” to mean “mentee”. You could be someone’s “support manatee” as you were learning how to do support work. I remembered that this usage was common enough to have been enshrined in the LJ Support Guide at one time. The current version of the Support Guide is a shadow of its former self, and it certainly doesn’t say anything about manatees. But I went to see if the old Support Guide I remembered was on the Wayback Machine, and it was (search that page for “manatee”)! It meant a lot to me to see that something from the culture of a community I cared a lot about in 2004 had been preserved.
Finally, most recently, I was trying to find materials I developed for a course I helped teach in fall 2011. To my embarrassment and frustration, the course website had long since disappeared. I scoured my email archives and files for anything I could find about the course and came up mostly emptyhanded. Then I remembered to check the Wayback Machine. Sure enough, the labs that I developed on the LilyPond music notation system and on Markov models for text and music generation were there. So was a page documenting a project our students had done that had built on the concepts those labs introduced. I’d forgotten about that student project, and finding it again brought me joy at a time when I needed it.
So, I’m including the Internet Archive in my endofyear charitable giving. If you, too, have stories like mine and can afford to help them out, now is a great time to do it!
]]>Here’s an example: let’s say I want to know what the longest posts on this blog have been.^{1} If I were using, say, WordPress, I’m not sure how I would go about figuring that out. The interface might or might not expose that information; I might have to install a plugin or something, or manually copy and paste the text of my posts into a tool that will show a word count. But since my posts are just text files, it’s just a couple of quick shell commands chained together and run in the directory where all my posts live:
1 2 3 4 5 6 7 8 9 10 11 

From the output of that command, I can see that my longest post is “‘Experiencing computing viscerally’: my PG Podcast interview about !!Con” from August 2016, although that one hardly counts because it’s a transcript of an audio interview. The next longest post is “The LVar that wasn’t”, from December 2013, which weighed in at 4526 words. The word count includes a small amount of overhead for stuff like the Markdown front matter in each post, but it’s mostly accurate.
Let’s say that I want to spellcheck all my posts. Since I have aspell
installed, that’s pretty easy, too:
1


This launches the aspell
interface, in which I can interactively correct typos that it finds in each file, or add words it doesn’t know to my local dictionary. I tried it just now on the whole blog and found misspellings of ‘heterogeneous’, ‘narrative’, ‘necessarily’, ‘difference’, and ‘which’, all of which are now fixed.
Because I use git to version all my posts, I can see a version history of any post using gitdiff
. Let’s say that I want to see all the changes made to the post “Refactoring as a way to understand code” from a couple years back. The command
1


shows me the answer. It turns out that I made a small copy edit about an hour after writing the post, correcting the phrase “get a more visceral sense of what it’s doing” to “give me a more visceral sense of what it’s doing”. (I seem to be into this whole viscerality thing.) A couple of days later, I added the sentence “I can sympathize.” to the post. A year and a half later, I removed the tag “programming” and added “refactoring” instead.
With only three edits after the original post, I suspect that “Refactoring as a way to understand code” is actually one of my less heavilyedited posts. How can I see which ones are most heavily edited? The giteffort
command from the wonderful gitextras collection can help. If I want to see all my posts that have been edited twenty times or more, I can do this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

giteffort
lists those posts for me, first ordered by filename (which, because of the naming convention I use, is also by date) and then ordered by the number of commits. It turns out that I have four posts that have been edited twenty or more times, with “Thoughts on ‘Adversarial examples in the physical world’” being the most edited post by a pretty wide margin. Fascinating!
I use Octopress to generate this blog. When I started the blog in early 2013, Octopress was still quite popular; by 2015, it had fallen far enough out of fashion that a commenter on Hacker News remarked on my anachronistic use of it. I suppose that person would think I’m a dinosaur for still using Octopress even now, as 2017 is coming to an end. There are any number of other static site generators out there that would probably do everything I want and are probably less janky than Octopress, but I keep using it out of inertia and because it still does the job well enough. I’ll probably switch from Octopress to plain Jekyll at some point, since Octopress is just a wrapper around Jekyll, and I don’t really use most of the features Octopress adds.
The particular blogging framework I use is beside the point, though. What’s more important to me is how easy it is to work with my blog using the wider world of commandline text manipulation tools, and it’s the fact that the posts are stored as versioned text that makes that possible. That, to me, is the real power of static site generation tools — it’s not so much about what those tools themselves do as it is about everything else that the postsasversionedtext approach enables, none of which is particular to Octopress or Jekyll or any other static site generator.
I’m always worried that My Best Blogging Days Are Behind Me and that I Never Blog About Anything Substantial Anymore, so it’s reassuring to look back and see that my longest posts are actually pretty evenly spread over the years that this blog has existed, and in fact, three of the nine longest posts were written within the last year. (Length isn’t the same thing as substance, of course, but there’s no shell command for quantifying substance yet.) ↩
This post isn’t an introduction to the simplex algorithm itself; for that, there are many good references, such as chapter 29 of CLRS. If you don’t have a physical copy of CLRS handy, or if yours is currently in use as a doorstop, some of chapter 29 is on Google Books. Better yet, try chapter 2 of Linear Programming by Vašek Chvátal, a book that Guy Katz pointed me to and that I like a lot, and almost all of which is on Google Books.
The simplex algorithm is a standard technique for solving linear programming (LP) problems. What’s a linear programming problem? Wikipedia has an example:
Suppose that a farmer has a piece of farm land, say $L$ km^{2}, to be planted with either wheat or barley or some combination of the two. The farmer has a limited amount of fertilizer, $F$ kilograms, and pesticide, $P$ kilograms. Every square kilometer of wheat requires $F_1$ kilograms of fertilizer and $P_1$ kilograms of pesticide, while every square kilometer of barley requires $F_2$ kilograms of fertilizer and $P_2$ kilograms of pesticide. Let $S_1$ be the selling price of wheat per square kilometer, and $S_2$ be the selling price of barley. If we denote the area of land planted with wheat and barley by $x_1$ and $x_2$ respectively, then profit can be maximized by choosing optimal values for $x_1$ and $x_2$.
This is an optimization problem that can be solved by maximizing a function of $x_1$ and $x_2$ — in particular, $S_1 \cdot x_{1} + S_2 \cdot x_2$ — subject to constraints that capture what we know about the relationships between the other variables involved. The simplex algorithm is a recipe for doing that. The steps of the algorithm can be carried out by hand, but LP solver software packages such as GLPK or Gurobi also use some version of the algorithm to find an optimal solution to LP problems.
SAT and SMT solvers, on the other hand, solve satisfiability problems: problems where, given a formula with Boolean variables in it, we need to figure out whether or not there is some assignment of values to variables that will cause the expression to evaluate to “true”. (In the case of SMT solving, the formula is something fancier than a plain old Boolean formula, but the basic idea is the same.)
Determining whether a formula is satisfiable is a decision problem, not an optimization problem. Since SMT problems are decision problems and LP problems are optimization problems, I didn’t imagine that SMT solvers and LP solvers would have much, if anything, in common. But it turns out that that’s not the case!
To use the simplex algorithm to solve an LP problem, you first need to find what’s called a feasible solution to the problem, in which all the constraints are satisfied. The feasible solution is sometimes called a primal feasible solution or initial feasible solution, and the process of finding it is sometimes called initialization. Once that’s done, you can use the simplex method to iterate toward an optimal solution.
Some presentations of the simplex algorithm focus on the second part of the process only. In fact, in chapter 2 of the Chvátal textbook, it’s assumed from the outset that you already have a feasible solution, and now you just want to find an optimal one. For many realworld LP problems, this is a realistic assumption to make. For instance, suppose you’re the farmer from the example problem above. Whatever you’re already doing is a feasible solution: assuming you’ve set up the problem correctly, the amounts of fertilizer, pesticide, and land you’re using already fall within the constraints, because it would be impossible for you to do otherwise (for instance, you can’t use more land than exists, or put a negative amount of fertilizer on it). Of course, what you’re currently doing is probably not optimal, which is why you need the simplex algorithm! But at least you have an initial feasible solution from which to start iterating toward an optimal one.
But what do you do when you don’t have a feasible solution to begin with? It turns out that the simplex method is of use here, too. Chapter 3 of Chvátal (on pages 3942) covers the situation in which you don’t have a feasible solution and need to come up with one. You do this by solving what’s known as an auxiliary problem. From the original problem, we can mechanically construct an auxiliary problem that always has an initial feasible solution, and then use the simplex method to optimize the solution to the auxiliary problem. Once that’s done, if it turns out that the optimal solution to the auxiliary problem is zero, then a feasible solution to the original problem exists and can be easily obtained, and we can then go on and carry out the steps of the simplex method on that feasible solution, resulting in an optimal solution to the original problem. And if the optimal solution to the auxiliary problem turns out to be nonzero, then that means that no feasible solution to the original problem exists.
Chvátal explains that this approach is what’s known as the twophase simplex method.^{1} The first phase is setting up and and finding an optimal solution to the auxiliary problem. If doing that results in a feasible solution to the original problem, we can then go on to the second phase, which finds an optimal solution to the original problem. Both phases use the same iterative process.
So, you can use an auxiliary problem to either get to a feasible solution for the original problem, or determine that there is no feasible solution. Figuring out whether or not there’s a feasible solution sounds like a decision problem, and it is! But, interestingly, we use an optimization technique to solve it. We turn the decision problem into an optimization problem: if the optimal answer to the auxiliary problem turns out to be zero, then the answer to our decision problem is yes, and if the optimal answer to the auxiliary problem turns out to be nonzero, then the answer to the decision problem is no.
Back in May, I wrote about how for certain kinds of linear real arithmetic formulas, SMT solving and LP solving coincide, but I didn’t mention the simplex algorithm in that earlier post. I can now be more specific and say that for certain kinds of SMT formulas, determining the satisfiability of the formula coincides with carrying out just the first phase of the simplex algorithm that LP solvers use. Finding a feasible solution is the entire problem, and once we have one, we can stop. After all, the goal of SMT solving is just to determine whether a formula is satisfiable or not — we don’t care about finding an “optimal” way to satisfy it!
In fact, if we peek into the guts of GLPK, which is the LP solver on top of which the proofofconcept implementation of Reluplex is built, we see that in its internal representation of an LP problem, there’s a field called phase
that represents which of the two phases we’re in. Often, computer implementations of an algorithm look pretty different from descriptions of how to carry out the algorithm by hand, so I think it’s interesting to see that the notion of phases isn’t specific to byhand applications of the simplex method, but is instead fundamental enough that at least one computer implementation makes use of it, too!
↩
Halide is really two languages: one for expressing the algorithm you want to compute, and one for expressing how you want that computation to be scheduled. In fact, the key idea of Halide is this notion of separating algorithm from schedule.
The paper uses a twostage image processing pipeline as a running example. Let’s say you want to blur an image by averaging each pixel with its neighboring pixels to the left and right, as well as with the pixels above and below. We could write an algorithm for this 3x3 box blur something like the following, where in
is the original image, bh
is the horizontally blurred image, and bv
is the final image, now also vertically blurred:^{1}
1 2 

The Halide paper illustrates how, even with a simple twostage algorithm like this, there are lots of challenging scheduling choices to be made. For instance, you could do all of the horizontal blurring before doing the vertical blurring (the “breadthfirst” approach), which offers lots of parallelism. Or you could compute each pixel of the horizontal blur just before you need it for the vertical blur (the “total fusion” approach), which offers good data locality. Or you could take a “sliding window” approach that interleaves the horizontal and vertical stages in a way that avoids redundant computation. Or you could do some combination of all those approaches in search of a sweet spot. The choices are overwhelming, and that’s just for a toy twostage pipeline! For more sophisticated pipelines, such as the one described in the paper that computes the local Laplacian filters algorithm, the scheduling problem is much harder.^{2}
To me, perhaps the most interesting contribution of the PLDI ‘13 Halide paper — perhaps more interesting than the design of Halide itself — is its model of the threeway tradeoff space among parallelism, locality, and avoiding redundancy that arises when scheduling image processing pipelines.
That said, I think the design of Halide itself is also pretty awesome! The schedulinglanguage part of Halide allows one to express programs that fall anywhere we like in that threedimensional space of choices. For instance, for the 3x3 box blur algorithm above, we could express a breadthfirst schedule with the Halide scheduling directive bh.compute_root()
, which would compute all of bh
before moving on to bv
. For the “total fusion” approach, we could instead write bh.compute_at(bv, y)
, which would compute each value from bh
only when it is needed for computing bv
. But you don’t have to try each of these possibilities yourself by hand; there’s an autotuner that, given an algorithm, can come up with “highquality” (although not necessarily optimal) schedules for an algorithm, starting from a few heuristics and using stochastic search.
There’s been lots of work on attempts to free programmers from having to schedule parallelizable tasks manually. We all wish we could just write an algorithm in a highlevel way and have the computer do the work of figuring out how to parallelize it on whatever hardware resources are at hand. But decades of research haven’t yet produced a generalpurpose parallelizing compiler; automatic parallelization only works in a few limited cases. The cool thing about Halide is that it makes the scheduling language explicit, instead of burying scheduling choices somewhere deep inside a toocleverbyhalf compiler. If you know how you want your algorithm to be scheduled, you can write both the algorithm and the schedule yourself in Halide. If someone else wants to schedule the algorithm differently (say, to run on a different device), they can write their own schedule without touching the algorithm you wrote. And if you don’t know how you want your algorithm to be scheduled, you can just write the algorithm, let the Halide autotuner loose on it, and (after many hours, perhaps) have a pretty good schedule automatically generated for you. The advantage of separating algorithm from schedule isn’t just prettier, cleaner code; it’s the ability to efficiently explore the vast space of possible schedules.
A followup paper, “Distributed Halide”, appeared last year at PPoPP.^{3} This work is about extending Halide for distributed execution, making it possible to take on image processing tasks that are too big for one machine.
Running a bunch of independent tasks on a cluster for, say, processing individual frames of video is one thing, but what we’re talking about here are single images that are too big to process on one machine. Is it ever really necessary to distribute the processing of a single image? Sometimes! As the paper explains, we’re entering the age of terapixel images.^{4} For example, the Microsoft Research Terapixel project involved stitching together hundreds of 14,000^{2}pixel images, then using a global solver to remove seams, resulting in a single terapixelsized image of the night sky. Wikipedia has a list of other very large images, such as this 846gigapixel panoramic photo of Kuala Lumpur. So it’s not out of the question that a single image would be too large to process on a single machine. That’s the problem that the Distributed Halide work addresses.
Distributed Halide extends Halide by adding a couple new distributed scheduling directives to the scheduling language, and it also adds new MPI code generation to the Halide compiler. The new additions to the scheduling language are distribute()
and compute_rank()
. distribute()
applies to dimensions of individual pipeline stages and allows for distributing individual states or not. compute_rank()
is like compute_root()
, but for a particular MPI rank (“rank” being MPIspeak for “process ID”).
Adding distribution to the mix adds yet another dimension to the scheduling tradeoff space: communication between nodes. We can save time and energy by minimizing communication, but less communication generally means more redundant recomputation. Let’s suppose we want to run our twostage 3x3 box blur on a really large image. The schedule
1 2 

results in no communication between compute nodes, but lots of redundant recomputation at different nodes. On the other hand, the schedule
1 2 

results in lots of communication, but no redundant recomputation: all pixels of bh
are computed only once.
For most Halide programs, the optimal schedule is probably somewhere in the middle of the space of possible schedules. Increasing the dimensionality of the tradeoff space by adding distributed scheduling directives unfortunately means that automatically generating schedules gets a lot harder, and so the Distributed Halide work covers handwritten schedules only. Figuring out how to handle schedule synthesis for this larger space of possible schedules still seems to be an open question.
Something else I’m curious about is the idea of synthesizing Halide schedules from incomplete sketches of schedules. At this point, Halide autoscheduling is allornothing: you either write a schedule yourself by hand, or you give your algorithm to the autotuner to do all the work. But if we could combine human and machine effort? Ravi Teja Mullapudi and co.’s 2016 paper on improved automatic scheduling for Halide pipelines concluded with the sentence, “We are interested in exploring interfaces for the autoscheduler to accept partially written schedules by experts, and then fill in the missing details”, and it’s here that I wonder if sketchbased synthesis could be useful.
This isn’t necessarily what Halide syntax actually looks like; for that, see the tutorials. ↩
The localLaplacianfilters pipeline apparently has 99 stages, not all of which are obvious to me from looking at Figure 1 in the PLDI ‘13 paper. A while back, after spending some time staring at the figure and trying to get the stages to add up to 99, I eventually went to the halidedev mailing list and asked about it, and someone there was able to account for 80 of the alleged 99 stages. The point is, it’s a lot of stages. This example also demonstrates that despite its name, an image processing “pipeline” is not necessarily linear! It’s really a graph of interdependent stages, which further complicates the scheduling task. ↩
There are more details in first author Tyler Denniston’s 2016 master’s thesis. ↩
A terapixel is 10^{12} pixels, or 1000 gigapixels, or 1,000,000 megapixels. By comparison, an iPhone 7 can take 12megapixel photos. ↩
Over the course of the simulation, we can see circles emanating from the center of the plot, similar to how waves spread out in circles from the point where a drop of water hits a still pond. There are also waves coming from somewhere off to the left that interact with the waves coming from the center.
The simulation runs for 5000 time steps and then loops back to the beginning. There should be a frame of animation for each tenth time step, plus one initial frame — 501 frames in all. But something’s wrong — the animation looks choppy in places, as though some of the frames are missing. What gives?
I first noticed this framedropping behavior a long time ago, the first time I ran the code with visualization turned on. Because I was watching the visualization “live” as the code ran, though, I figured that my computer was just dropping frames because it was slow or busy with some other task. I thought the frames were there, just not being shown. It wasn’t until I made the above GIF and watched it that I suspected that something was wrong, because the GIF also looked like it was missing frames, even though it wasn’t running “live”. Hmmm.
I went back to look at the Julia code that did the plotting:
1 2 3 4 

This code uses the Winston.jl 2D plotting package to plot an image for each tenth time step using the imagesc
function, then display
s that image on the screen. To make the above GIF, I’d also added an additional line of code that would save each image out to a file using Winston’s savefig
function, with sequentially numbered file names:
1 2 3 4 5 

Since there’s a frame for every tenth time step, if we were on time step number 8, this code would save the image as figure008.png
, for instance.
Then I stitched together all the PNG files into an animation with a shell command using good ol’ convert
:^{1}
1


The shell globbing conveniently takes care of putting the images in the correct order as frames in our GIF: figure000.png
, representing the initial state of the simulation, would be followed by figure001.png
and then figure002.png
, all the way up to figure500.png
. All this seemed to be working correctly; the frames weren’t being assembled in the wrong order. I still didn’t understand what was causing the jerkiness in the animation. But then I checked how many frames I actually had:
1 2 

389? That was a lot fewer than the 501 I was supposed to have! I went back and looked at the plotting code again:
1 2 3 4 5 

And then, finally, the problem jumped out at me. The line if mod(t/dt, 10) == 0
is supposed to make sure that we only plot every tenth time step of the simulation. t
represents the amount of time that has passed since the beginning of the simulation, and dt
represents the amount of time that goes by with each time step. In my code, as in the original Octave code from which it was ported, dt
is set to 0.0001. Having dt
be a variable means that we can easily change the time resolution of the simulation by making dt
larger or smaller.
Why are we checking if mod(t/dt, 10)
is equal to zero? As the author of the original code explains: “t/dt
will return a natural number which represents the number of timesteps since the simulation has started. The modulo of this number and 10 will return the remainder of the division of these two numbers. Consequently comparing the modulo with 0 will only return true every tenth time the count variable t
was incremented […].”
The problem here is that, at least after porting to Julia from Octave, t/dt
can’t be counted on to return a natural number! Take, for instance, the third time step. At this time step, t
is 0.0003, and so t/dt
ought to be 0.0003 divided by 0.0001, which is 3. But because of floatingpoint inexactness, t/dt
comes out to something like 2.9999999999999996, and therefore mod(t/dt, 10)
also comes out to 2.9999999999999996.
The first dropped frame seems to occur on the 90th time step, when t
reaches 0.009. Then, t/dt
comes out to 89.99999999999999 instead of 90, and so mod(t/dt, 10)
comes out to 9.999999999999986 instead of 0. From that point on, frames are dropped here and there, whenever t/dt
works out to be something that isn’t an integer on a time step that happens to be a multiple of 10.
I tried logging the values of t
, t/dt
, and mod(t/dt, 10)
at each time step as the code ran. There were some stretches of time where t/dt
always came out integral:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

But during other stretches of time, t/dt
would be nonintegral as often as not:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

Most interestingly of all, sometimes a pattern would emerge for a while — like here, for instance:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

And again, later on:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 

During these stretches, we get an inexact answer for every fifth value of t
, whenever t
has a 2 or a 7 in its fourth decimal place, and not for any other value of t
. Other patterns show up elsewhere. Perhaps there’s something to be learned from studying these patterns. If you want to analyze the whole log of them, please be my guest!
Anyway, as is probably clear by now, we can fix the droppedframes bug by rounding t/dt
up to the nearest integer:
1 2 3 4 5 

With this fix, we have the correct number of frames:
1 2 

And our convert
command produces a lovely, smooth animation:
Ah! That’s better!
I updated my previous post to use the smooth animation instead of the choppy one. For comparison, here are the old and new versions, side by side. Since the smooth version has more than a hundred more frames than the choppy version, we can see that it also takes a bit longer to run:
Is there a lesson to be learned from all this? If the lesson of the previous post was that visualizing what your code is doing can be a helpful way to find bugs, perhaps the lesson here is to trust what the visualization shows you. In this case, our animation came out choppy for a reason. The other lesson here, I suppose, is to watch out for inexact fractions. In this case, d/dt
was already in the code we were porting from, but that’s no excuse!
To conclude, here’s a process screenshot for Katherine Ye. I like the gradient thing going on here.
Thanks to Cameron Finucane for a conversation that inspired this post.
Here’s something else that tripped me up at this stage: so, convert
has a loop
option. I assumed that loop 1
meant “loop” and loop 0
meant “don’t loop”. But, no, in fact, the argument to loop
is the number of times you want your GIF to loop – except in the special case of 0, which means “loop infinitely”. So loop 1
and loop 0
had exactly the opposite of the behavior I expected! Oops.
↩
One such workload — now available as an example program that comes with the ParallelAccelerator package — was an implementation of the twodimensional wave equation, ported from an Octave implementation with the permission of the original author. The wave equation models the vibrations of a wave across a surface, such as a drum head when struck or the surface of a pond when hit by a drop of water. Using the wave equation, you can derive a formula that will tell you where each point on that surface will be on the next time step, based on that point’s position in the current and previous time steps.
Here’s an animated GIF of the simulation running, created using Winston.jl and ImageMagick:
This animation shows us looking down at our simulated surface from above. The circles emanating outward from the center are waves created by the “drop of water” that’s just hit the surface. (There’s also something else coming from the left; that’s the “dynamic source” described in the original author’s article.)
The most interesting part of the code is the part that implements the wave equation, which we sped up using ParallelAccelerator’s runStencil
construct, as described in section 3.3 of our paper. But this post isn’t about that — it’s about a mistake we made when porting the original Octave code to Julia!
Leaving aside ParallelAccelerator and runStencil
, let’s take a look at the Julia code that was directly ported from Octave. The code uses three twodimensional arrays, called p
, c
, and f
for “past”, “current”, and “future” respectively, to keep track of the positions of points on the surface. Each iteration of the main loop of the simulation uses the wave equation to compute a new version of f
, based on the contents of p
and c
.
The part of the code we’re concerned with comes at the end of the main loop. After computing the new f
, the simulation moves one step into the future. It updates the contents of p
to those of c
, then updates the contents of c
to those of f
— and around we go to beginning of the loop to compute the next f
using the wave equation again. The code that does the endofloop updates to p
and c
looks like this (where s
is the size of the array along each dimension):
1 2 

These two lines of Julia code are identical to the Octave code they were ported from, except that Octave uses parentheses for array indexing instead of square brackets.
One of my colleagues proposed that instead of having all those nasty array indices, we ought to be able to just write
1 2 

to update the two arrays at the end of the loop. Not only did the change make the code more concise, it was also faster when we benchmarked it. We reasoned that the new version of the code was faster because, instead of copying the contents of c
into p
and the contents of f
into c
every time through the loop, the p = c; c = f
version merely moved pointers around.
Some months later, late in 2015, I happened to run the code with plotting turned on. We usually never ran with plotting, because we wanted to benchmark only the time it took to compute the arrays, not the time it took to visualize the results of that computation.^{1} With plotting enabled, I expected to see something like the above animation, but I was surprised by what I saw instead. After a couple of frames, the animation seemed to just…stop.
Based on that symptom, I bet some of you have already correctly diagnosed the bug, but I didn’t yet understand what was going on. At first, I thought that something was wrong with the plotting code, but I looked and didn’t see any problems there. Then I remembered the change we’d made several months previously, and I decided to change the two lines that updated p
and c
back to the old version that used explicit indices, just to see what would happen.
When I made that change, the visualization showed the simulation running like it should, no longer seeming to stop after a couple of steps. I didn’t know why switching back to the old version of the code had fixed the problem, but I checked in the change with a TODO for myself to “figure out what’s going on with the array assignment bug”.
The next day, I went back to figure out what was going on and busted out everyone’s favorite debugging tool: good ol’ boxandpointer diagrams.
Let’s say that we’ve just computed f
on our first time through the loop. Suppose p
points at turquoise, c
points at orange, and f
points at purple:
It’s now time to update p
and c
to prepare for our next trip through the loop, where we’ll compute the next value of f
. Let’s first consider the more verbose version of the code that uses explicit array indexing. In this version of the code, we update p
and c
by copying the contents of c
into p
, and then the contents of f
into c
:
The contents of the array originally pointed to by c
have been copied over into the array pointed to by p
, and the contents of the array originally pointed to by f
have been copied over into the array pointed to by c
. So now p
points at orange, and c
points at purple.
f
also points at purple, but not for long: on the next trip around the loop, we compute the new contents of f
— let’s say they’re pink. Then we update p
and c
again, as before:
Now p
, c
, and f
point to purple, pink, and pink, respectively, and we’re ready for our next trip around the loop to recompute f
again.
The above approach works, but it involves a lot of copying. Wouldn’t it be nice if we could simply swap pointers around instead of having to copy the contents of arrays? That’s what the p = c; c = f
version of the code was supposed to do. Let’s see what happens when we run it.
Once again, here’s our starting state after computing f
on our first time through the loop:
Next, we update p
to point to where c
is pointing, and we update c
to point to where f
is pointing:
Now we’ve got p
pointing at orange, and c
and f
both pointing at purple. This state seems to be Observably Equivalent™ to the state we were in before, when we were copying arrays around — and we accomplished it by just moving a couple of pointers! Hooray!
But wait — what happens on the next trip through the loop?
We compute the new contents of f
, which are pink, as before. But now when we run p = c
, we update p
to point to what c
is pointing to — and that’s f
! And running c = f
is a noop, since c
already points to what f
is pointing to. So now all three of p
, c
, and f
are pointing at the same array. Now that this is the case, all future assignments of p = c
and c = f
will also be noops, and our code won’t work.
Thankfully, there’s an easy solution to this problem that still lets us avoid copying arrays around every time we go through the loop. Instead of writing
1 2 

we can introduce a temporary variable, like so:
1 2 3 4 

When we do this, here’s what things look like at the end of the first time step:
As in the other versions we’ve seen, p
points at orange and c
points at purple. We have f
pointing at turquoise, but the contents of the array f
points to aren’t important, since they’re about to change on the next time through the loop. The important thing is that neither p
nor c
point to where f
points, so updating the contents of f
will not affect the contents of p
or c
, which is how we got into trouble last time.
At the end of the next time step, all our pointers rotate:
Now p
points to purple and c
points to pink, which is what we want, and again, neither one points to the same place as f
, and the simulation continues to run correctly for the remaining time steps. Note that our orange and purple arrays are in the same place they were before — they didn’t have to move! Only the variables pointing to them changed.
I got rid of the arraycopying code and replaced it with the pointerswapping version that uses tmp
. Now, not only did our code not do any unnecessary copying, but it had the added bonus of being correct!
The issue here turned out to be an example of a classic pointeraliasing bug: we had two variables pointing to the same place when they shouldn’t have been. But why did the bug manifest in the way it did, by making the animation appear to stop after the second frame? It seems clear that the behavior of the code was going to be wrong in some way as a result of the aliasing bug, but why was it wrong in that particular way?
Here’s something that might be a clue: after fixing the pointeraliasing bug, if we replace the occurrence of p
in the wave equation with c
, then we also get the behavior where the animation appears to freeze after the second frame — even though f[x, y]
doesn’t, in general, work out to be the same thing as c[x, y]
for a given x
and y
. Why?
Finally, would it have worked to just write f = p; p = c; c = f
instead of introducing tmp
? I’ll leave that one as an exercise for the reader, too.
In the Julia docs, the “noteworthy differences from MATLAB” section mentions the fact that Julia arrays are assigned by reference: “After A=B
, changing elements of B
will modify A
as well.” That didn’t happen with p[1:s, 1:s] = c[1:s, 1:s]; c[1:s, 1:s] = f[1:s, 1:s]
, because expressions like c[1:s, 1:s]
are what are known as array “slice” expressions, and slices always make copies of the data being referred to.
In Matlab (and Octave), arrays are not assigned by reference, and modifying the original won’t modify the copy. So it presumably would have been fine to write p = c; c = f
in Octave, instead of the version with slices. (In fact, I don’t know why the original Octave code wasn’t just p = c; c = f
— perhaps someone reading this can enlighten me.) As it is, the code was “accidentally correct” when first ported to Julia, because it happened to be written using slices that referred to the entire array.
In general, though, slices can address just some portion of an array — that’s the point of slices! In Julia, if you want to refer to a part of an array but you don’t want the copying overhead that comes with using slices, you can use a new feature called views; Spencer Russell has a nice blog post about that.
Thanks to Cameron Finucane, Veit Heller, and Valentin Churavy for giving feedback on drafts of this post or discussing aspects of it with me.
In fact, the majority of the running time of the simulation went to plotting the results of the computation rather than to the computation itself, but ParallelAccelerator couldn’t do anything to speed up plotting, so we just turned off plotting and only benchmarked the part of the code that we could speed up. This, of course, was cheating. For instance, the plain Julia version of the computation ran in about 4 seconds, whereas the ParallelAccelerator version ran in about 0.3 seconds, and so we reported an orderofmagnitude speedup. But plotting took something like 20 seconds for each version, so if we’d had plotting enabled, the ParallelAccelerator version would have run in 20.3 seconds compared to 24 seconds for the plain Julia code – a measly 15% speedup. ↩
Here are what I see as the upsides of lists like this:
They help people who are putting together a program committee, keynote speaker roster, or the like, but who are having trouble thinking of women to ask. (This was the stated purpose of Jean’s list.) Often, the same small number of highly visible women tend to get asked to serve on program committees again and again, which does a disservice to those women by overburdening them with work, while other qualified women never come up for consideration. So, lists like this are especially useful for getting the names of lesswellknown women in front of decision makers, so that those names will be more likely to come to mind when it’s time to choose committee members or speakers.
It’s helpful for women in the research community who feel lonely or isolated to be able to look at a list of names and see that there are a lot of other women out there in the community who they might not have met yet. I’m excited when I look at a list of women in my field and see names I don’t recognize.
Some people have a habit of always citing the same handful of very famous names whenever they’re asked for an example of a woman working in a given field. (In PL and software engineering, those names include Grace Hopper, Barbara Liskov, and Margaret Hamilton.) When people unfamiliar with the area only hear the same famous names over and over, it can give them the false impression that those are the only women who have ever worked in the area. The existence of a list of notnecessarilysuperstar women in a field as a reference can help ameliorate that problem.
The downsides I’m thinking of are more subtle than “lists give wouldbe harassers a list of targets”, although that’s a potential downside, too. The less wellknown downsides include the following:
Lists like this tend to spread around via certain social media platforms, and therefore they might only be seen by a small subset of the community they’re supposed to serve — the subset that happens to be active on those platforms. So, they may end up merely helping the women who are already visible become more visible (perhaps even more than they’d like), while not doing much to help those who could use the visibility more.
In my experience, more visible senior and midcareer women tend not to add themselves to such lists, for a variety of reasons: they’re busy with more important things; they don’t have anything to gain personally by being on the list (they’re already getting all the PC and speaking invitations that they want); they think the list isn’t intended for them; or some other reason. Then, someone who doesn’t already know that those senior and midcareer women exist may look at the list, see that it’s mostly names of junior people, and mistakenly conclude that there are no senior or midcareer women in the community.
Related to the above points, the list may just fail to achieve any kind of traction in the community. Then, people who aren’t already part of the community may see how short the list is and think, “Wow, there are that few women? How disappointing — maybe this community has even fewer women than I’d thought!”, when in fact the list undercounts the women who are there. In that sense, a short list may be even worse than no list.
All of these downsides are mitigated by publicizing the list widely and getting it to grow. How big should we expect such a list to get? For comparison, there’s a public optin list of women active in machine learning with over a thousand names on it. Admittedly, that list isn’t specific to researchers, and even if it were, the ML research community is bigger than PL/SE research. Nevertheless, I feel like we easily ought to be able to accumulate, say, a hundred names of women active in PL/SE research. (There are more than that, but I’d be happy to see a hundred names as a start.)
I’d particularly love to see more senior women add their names to the list. I’ve noticed a phenomenon where someone who’s outside the research community, if asked to name women PL researchers, will list people in two categories: (a) the really famous, really senior superstars, who they know because they’re famous, and (b) earlycareer women, who they know because we hang out on Twitter or on the nonacademic speaking circuit. A third category — midcareer women — tend to go unnoticed, and that irritates me. I believe that researchcommunitycreated lists like this have the potential to help fix that problem, but they also have the potential to exacerbate it if midcareer women’s names never get added. It’s my hope that by advertising the list in places that aren’t Twitter, I can prevent that.
]]>