Earlier this week, I gave a talk on LVars at RICON West, a distributed systems industry conference. It covered a lot of the same material as my talk at FHPC last month, but since my speaking slot at RICON was a little longer, I used the extra time to try to explain why a programming languages academic was speaking at a distributed systems industry conference, anyway. This post covers some of the same material in written form.
The short version: The mathematical framework of conflict-free replicated data types that distributed systems researchers have developed for reasoning about and enforcing the eventual consistency of distributed replicas is almost exactly the framework that we use in our own work on LVars for reasoning about and enforcing the determinism of parallel computations.
Update (December 13, 2013): A video of my RICON talk is now available. Thanks to everyone at RICON for being an amazing audience for this talk!
Replication and eventual consistency
Distributed systems typically involve replication of data across a number of physical locations. There are lots of reasons why we might want to do this. Maybe we want the system to be robust to some part of it failing; if multiple replicas of our data exist, we’re less likely to lose data if part of our system fails. Maybe we want the data to be close to who needs it, so we replicate it across different locations for better availability. Or maybe we’re doing this to have more parallelism: if different processes are going to be simultaneously operating on the same data, they’re probably going to each need their own copy. Most likely, it’s a combination of all of these reasons and more.
It would be great if this system of distributed replicas could behave indistinguishably from the model a lot of us are used to programming with, in which all our data and all our computation are on one machine. Unfortunately, this is not the case.
The CAP theorem is a famous principle of distributed systems design. There are lots of ways to formulate this, but it’s something like the following: the C in CAP stands for consistency, and a perfectly consistent system would be one in which every replica always sees the same information. The A stands for availability, which means that all information is always available to every replica — available for both reading and writing. And, as if consistency and availability weren’t lofty enough goals already, the P is for partition tolerance, which means that the system is robust to parts of the network being unable to communicate with each other. The lie-to-children version of the CAP theorem is, “Consistency, availability, and partition tolerance: pick at most two.”
In practice, that’s something of an oversimplification. For one thing, we can’t really pick consistency and availability. That’s because real systems just are going to have network partitions sometimes, and therefore just do have to be partition-tolerant. So we have to compromise on at least one of consistency or availability.
But the other reason why this “pick two of C-A-P” thing is an oversimplification is that these properties aren’t binary. We don’t have, for instance, either perfect availability or no availability at all. Instead, we have degrees of availability. Rather choosing one or the other of availability or consistency, then, we can decide how much availability we need to have, then back off on consistency accordingly. For instance, we might give up on strict consistency in favor of eventual consistency, in which replicas may not always agree, but if updates stop arriving, all replicas will eventually come to agree given enough time.
Resolving conflicts between replicas
How are we going to ensure that all replicas come to agree? In particular, if replicas differ, how do we know which one is right?
As a straw man proposal, we could vacuously satisfy the definition of eventual consistency by just setting them all to some pre-determined value. But then, of course, we’d lose all of the writes we’ve done!
As a perhaps slightly less horrible proposal, we could try to figure out which of the writes happened last, and then have the last writer win. But this isn’t such a great solution, either. Even if we had a way of perfectly synchronizing clocks between replicas, just having the last writer win might not make sense from a semantic point of view.
The people who worked on Dynamo, Amazon’s distributed key-value store, acknowledged this in their 2007 paper when they wrote about application-specific mechanisms for resolving conflicts between replicas:
The next design choice is who performs the process of conflict resolution. This can be done by the data store or the application. If conflict resolution is done by the data store, its choices are rather limited. In such cases, the data store can only use simple policies, such as “last write wins”, to resolve conflicting updates. On the other hand, since the application is aware of the data schema it can decide on the conflict resolution method that is best suited for its client’s experience. For instance, the application that maintains customer shopping carts can choose to “merge” the conflicting versions and return a single unified shopping cart.
In other words, we can take advantage of the fact that, for a particular application, we know something about the meaning of the data we’re storing, and then parameterize the data store by some pluggable, application-specific conflict resolution operation.
This notion of application-specific conflict resolution is not without its problems, especially if implemented in an ad-hoc way. However, it turns out that we don’t have to do it in an ad-hoc way, because there exists a nice mathematical framework for reasoning about such things. This is what the work on conflict-free replicated data types, or CRDTs, is all about.
We can start by defining a partial ordering on the states that an object in the data store can be in. For instance, if we represent a cart as a set of items to be purchased, then carts might be ordered by subset inclusion.
Suppose that one replica thinks I have a pair of shoes in my cart, while another thinks that I only have a book in my cart. Each of these carts is represented by a singleton set, and neither one is “greater than” the other in the ordering we’ve defined. However, we can merge the two replicas by taking their least upper bound, which is just the smallest cart state that’s greater than or equal to both of them in the ordering — in this case, it’s the cart represented by the two-element set containing both the book and the shoes. A partially ordered set in which every two elements have a least upper bound is known as a lattice1, and the least-upper-bound operation is exactly the aforementioned “pluggable, application-specific conflict resolution operation”. Moreover, if a least upper bound exists, then it’s unique, and therefore the conflict resolution operation is deterministic.
As it turns out, this mathematical framework for reasoning about and enforcing the eventual consistency of distributed replicas is almost exactly the framework that we use in our own work on LVars for reasoning about and enforcing the determinism of parallel computations.
How did I find out about this?
Back in July 2012, I was working at Mozilla Research as an intern on Rust. At the same time, my advisor Ryan and I were working furiously to try to get our first paper about LVars submitted to a conference. Around this time, a friend of Ryan’s happened to send us a link to a paper that, when I followed up on it, turned out to be the brand-new-at-the-time “Logic and Lattices for Distributed Programming” tech report by Neil Conway and his collaborators from the BOOM group at Berkeley, and was about BloomL, their new extension to the Bloom programming language. BloomL is a really cool project that combines ideas from CRDTs and monotonic logic.
It had never occurred to me to look at the distributed database literature for work related to LVars — but there I was, on the day before our paper deadline, staring at the dated-two-weeks-previously BloomL tech report and reading sentences that could have come directly from the paper that we were about to submit! Essentially, the generalization that Neil and his collaborators made from Bloom to BloomL was precisely analogous to the generalization from IVars to LVars that we were making.
At that moment, my Mozillian co-worker Niko Matsakis walked by, saw that I was reading the BloomL TR, and offered to put me in touch with a friend of his who had worked on Bloom. Niko introduced me to his friend, who introduced me to Neil and Joe Hellerstein, and they invited me to come to Berkeley in August to give a talk. This was terrifying — I was thinking, Oh, no, how can I talk to these people; I don’t know anything about distributed systems; what are vector clocks I don’t even. But, in fact, it was great! The Berkeley people turned out to be our biggest fans, and I think they were actually the first to cite our LVars work.
While we were still trying to get our first LVars paper published, I started hearing from other researchers who were working on CRDTs. Then, in June, I got an email from Sam Elliott, who was on the team that was implementing support for CRDTs in Riak, the distributed database made by Basho, the company that sponsors RICON. Sam told me that he wanted to point some co-workers to one of my blog posts about LVars, but a hosting glitch had led to my blog being offline for a few days, so he was writing to complain about the outage. It was pretty much the coolest outcome possible from having unreliable web hosting. (In fact, it suggests a strategy for finding out who’s reading your stuff: take it down for a little while without telling anyone, and see who you hear from.) It was this conversation with Sam that incited me to submit a talk proposal to RICON about LVars.
For the last piece of my Ph.D. work, I want to shamelessly steal ideas from the distributed systems community. I want to see if any lessons from CRDTs or Dynamo or Bloom or Riak can help us improve the LVars model. It would be great to have something like “distributed LVish” someday. In the short term, though, I think that the work on CRDTs can give us a more interesting menu of LVar-based data structures to choose from — for instance, they’ve come up with various tricks for implementing CRDTs that support non-monotonic operations, like sets that support removal of elements. On the other hand, I’m curious to see what it would mean to port the idea of LVar threshold reads to the setting of CRDTs and whether that would be feasible and useful.
Actually, it only has to be a join-semilattice, because we don’t care about greatest lower bounds. ↩