Thoughts on CRDTs and threshold reads

Update (November 30, 2013): I updated this post to reflect some of my current thinking on threshold reads for CRDTs and why I think high-water-mark reads aren’t necessarily the way to go.

Since getting back from RICON, I’ve been thinking about what LVar-style threshold reads would mean for CRDTs.

One way in which CRDTs differ from LVars is that some CRDTs support non-monotonic operations. For instance, OR-Sets and 2P-Sets support removal of elements, while PN-Counters can count down as well as up. Combining threshold reads with non-monotonic operations like removal of elements from a set would seem to pose a problem for determinism. If only monotonic operations are allowed, as is the case for LVars, then the answer to the question of whether or not a given threshold read will unblock is the same from run to run (although it may unblock at different times, depending on scheduling). But if the state of the data structure being queried were allowed to move both up and down in a lattice, then that would no longer be the case. For instance, if a read were thresholded on an element that was added to and later removed from a set, then it might block forever, or not, depending on whether the read ran after or before that element’s removal. Hence some schedule nondeterminism would be exposed.

As Sam Elliott recently pointed out to me, there is a sense in which a non-monotonic CRDT — say, an OR-Set — grows monotonically with respect to a lattice. The trouble is that that lattice is the lattice of version numbers. (Or vector clocks, or dotted version vectors, or some such thing — but it amounts to the same idea, I think.) We could deterministically threshold on a version number — “Unblock when you’re at or above version 5!” — but it’s not such a good idea. For one thing, it exposes an implementation detail to the user. Worse, though, it would return something that isn’t at all meaningful or interesting for them. “Congratulations! You are now at or above version 5!” Gee, thanks.

So, if we want to be able to do deterministic threshold reads from OR-Sets or other CRDTs that allow non-monotonic operations, threshold reads on version numbers or their ilk aren’t particularly appealing. Let’s consider some other possibilities.

Threshold reads as high-water marks

One possibility is to threshold directly on state that’s changing non-monotonically, but change the semantics of threshold reads to reveal even less information than they do now.

In the existing LVars model, when a threshold read returns, it’s a guarantee that the contents of the LVar being read are at or above a certain point in the lattice: “The contents of this data structure are at least x.” That’s a quite strong guarantee. But, for non-monotonic data structures, we could, instead, have the following guarantee: “There exists a schedule under which this data structure could have at one time contained x.”

That’s a much weaker property, but there are still situations where it might be useful — like, for instance, if we want to know the “high-water mark” of some sort of time series data. Changing the semantics of threshold reads in this way would allow thresholding on the state of a data structure whose states form a lattice but for which the state does not necessarily change monotonically with respect to the lattice. This doesn’t get us all the way to handling CmRDTs directly, because there’s no built-in ordering on the states of a CmRDT, so we would have to impose one. But, unlike with a CvRDT, we wouldn’t have to require monotonic change with respect to that ordering — we would only need the ordering to exist, as a prerequisite to being able to define legal threshold sets.

Thresholding on the emulating CvRDT

The purpose of high-water-mark threshold reads would be to let us deterministically threshold on state that changes non-monotonically. There’s no reason, then, to do high-water-mark threshold reads on CvRDTs, since they change monotonically anyway — you’d only need it for CmRDTs. But wait! You can always do a monotonic CvRDT emulation of a CmRDT. So why do high-water-mark threshold reads at all? Why not just do a normal threshold read on the emulating monotonic CvRDT?

Admittedly, if you can only threshold on the emulating data structure, the information you can get out of it is not necessarily very interesting. For example, suppose we used the standard trick of encoding a set that supports additions and removals as two grow-only sets, one for additions and one for removals. Then we could write a deterministic threshold read that would unblock when, say, three elements had been removed from the set. It would be conceptually easy to implement such a threshold read — we would just threshold in the usual way on the removals set and ignore the additions set — but it couldn’t tell us much about the current set contents.

However, I would argue that the information you could get from a high-water-mark threshold read on the same data structure is also not very interesting — not to mention the fact that we don’t even have any sense about how to implement high-water-mark thresholds, whereas we do know at least something about how to implement ordinary threshold reads (since we’ve actually done it for LVars!).

As long as I’m trying to extend CRDTs to support threshold reads, I think it makes more sense for me to focus on CvRDTs anyway, since they’re more directly related to LVars than CmRDTs are. So I’ll probably start by trying to extend CvRDTs with standard threshold reads, not the high-water-mark kind. After doing this, I should be able to prove some sort of consistency property about threshold reads. In fact, threshold reads on CvRDTs could be a nice way to blend strong consistency into an eventually consistent model in a pay-as-you-go fashion. (And without sacrificing write availability, only read availability!)

Once that is done, it’s a question of what kind of CvRDT emulation allows one to do the most interesting (and yet determinism-preserving) kinds of threshold reads on a specific CmRDT. There’s no free lunch, but for a particular data structure, some determinism-preserving reads are bound to be more useful than others. For instance, maybe it really is useful to know when more than n elements have been removed from a set, regardless of the set’s state otherwise; “Buffer foo.txt has shrunk a lot; auto save disabled in that buffer until next real save” comes to mind as a use case.