It seems fashionable these days for cloud storage providers to offer consistency choices to their clients. Systems from Amazon, Google, Yahoo, and (soon) Microsoft provide both strong consistency and eventual consistency. For individual read operations, a developer may choose to receive strongly consistent or eventually consistent data. In some cases, such as the Pileus system that we’ve developed at Microsoft Research, intermediate consistency choices are also offered. But “eventual” consistency may mean different things in different systems, though often the service providers do not define it very precisely. This note explores what exactly is meant by a consistency choice and how we might specify precise guarantees for such choices.
Developers should be able to understand a consistency guarantee without knowing any details of the underlying implementation (replication protocols, system configuration, server locations, sync frequency, mean time to failure, etc.).
The consistency definitions that I’ve been using in talks are described in terms of the writes that are visible to a read operation. For example, the choices that we provide in the Pileus system are as follows:
- Strong Consistency = Reads see all previous writes.
- Eventual Consistency = Reads see subset of previous writes.
- Consistent Prefix = Reads see an initial sequence of writes.
- Bounded Staleness = Reads see all “old” writes.
- Monotonic Reads = Reads see an increasing subset of writes.
- Read My Writes = Reads see all writes performed by the reader.
But what exactly do these definitions mean?
I suggest that a consistency guarantee specifies the set of acceptable results that can be returned by a read operation. The size of this set determines the “strength” of the consistency; smaller sets implies stronger consistency.
Given complete knowledge of the previous writes and their order, along with an understanding of the semantics of read and write operations, a person should be able to determine the acceptable result-set. That’s still a lot for a developer to comprehend, but it’s better than needing to understand how the replicated system is implemented.
What about partially ordered writes? The consistency definitions simply talk about the set of visible writes. This set is independent of the write order (except for a consistent prefix).
However, we also need to say what happens with concurrent writes. A concurrent write for a given read operation is any write whose execution overlaps the execution of the read. It seems reasonable to say that the read *may*, but is not required to, see any concurrent writes.
Then, in order to determine what value(s) will/may be returned by a selected subset of writes, a person needs to understand how the system performs concurrent writes. One reasonable answer is that the system may perform them in any order. In other words, the system will (at least conceptually) execute writes in a serial order that is consistent with the partial order. So, the result of a read operation could be the value produced by *any* such serial order of writes.
Thus, the overall algorithm for determining the set of acceptable results for a given consistency is as follows:
- Given consistency + previous writes + partial write order + session state, determine set of acceptable write subsets.
- For each acceptable write subset produced in step #1, along with the partial write order, determine all possible serial orders of writes.
- For each ordered write subset produced in step #2, compute the resulting object value by applying the writes in that order.
- Take the union of all computed object values in step#3 as the set of acceptable return values.
- If multiple consistencies are being combined, take the intersection of their sets of acceptable return values.
Note that step #1 completely captures the essence of a consistency definition. It is the only step that depends on the consistency. Defining a new consistency is simply a matter of mapping the complete set of previous writes into one or more write subsets. A strong read (in the absence of concurrent writes) has a single acceptable write-set, the one containing all previously completed writes. An eventual read can see any subset of the previous writes, i.e. there could be a large number of acceptable write subsets. Intermediate consistency guarantees restrict the acceptable write subsets. With concurrent writes, all of the consistency guarantees, even strong consistency, have an increased number of acceptable write-subsets since each concurrent write may or may not be included in each subset.
Note that step #2 is not needed if writes are totally ordered. All of the other steps remain the same. Thus, understanding the consistency is a bit easier for ordered writes.
Note that step #5 should always return at least one value since the latest object value, i.e. the value that would be returned by a strongly consistent read, should be an acceptable return value for any consistency guarantee.
Importantly, note that all 5 steps can be done with no knowledge of how the storage system operates. One only needs to know the semantics of write operations, i.e. the effect of a write operation on the (unreplicated) system state, and the semantics of read operations, i.e. what a read will return given a particular system state. In any system, the semantics of reads and writes should be well documented.
Eventual consistency is commonly used in replicated cloud storage systems because it offers better performance and availability than strong consistency replication models. But, at what cost? Often I hear people say that eventual consistency places too great a burden on application developers. I hear that it results in more complicated code. However, I have seen little evidence of this and little written about this in the published literature. Let’s explore some of the potential costs of eventually consistent storage systems, with a particular emphasis on how it affects developers who store application data in such systems.
Data Access Paradigms
I see three broad patterns of how applications read and write their data: Read-Display, Read-Modify-Write, and Generate-Write. If you are aware of other widely used access patterns, please let me know.
Read-Display applications (or code segments) are those that retrieve data from a storage system and display it to their users. The data may be manipulated or transformed in some way before being displayed, but mostly is shown directly to users. Examples include social networking (e.g. Facebook, Foursquare, Twitter), photo sharing (e.g. Flickr), and news reading (e.g. New York Times online edition) applications.
Read-Modify-Write applications read one or more data objects, compute some new value for those objects based on their previous values, and then write out the new value. Examples include document editors (e.g. Word and Excel) and any applications that maintain statistics, such as web sites that record the number of visitors or page clicks.
Generate-Write applications write data objects without reading them. These are often called “blind” writes. The data being written may come from external sources like users or sensors. Examples include creating new files or data objects, appending records to a log, or depositing checks in a banking application.
Let’s consider each of these paradigms and the effect of eventual consistency on applications and application developers that use them.
For Read-Display applications, eventual consistency simply means that users may be presented with stale data since the application may read old values for data objects or the information that is displayed may be missing recently added objects. In Facebook, for example, users may not see recent comments or postings from their friends. In an e-mail application, a user may not see some messages that were recently delivered to his inbox. In a shopping cart application, users may find, when checking out, that their cart is missing some items that were previously added or may still contain items that were added but then removed.
But how does this affect the application developer? The code that he writes to read data from the storage system is the same regardless of whether that storage system provides strongly consistent or eventually consistent answers to read requests. The code to display this data is also independent of the consistency guarantees provided by the storage system, though if the application is informed that it read stale or “tentative” data then it could potentially issue a warning of some sort to its user.
Basically, Read-Display application developers ignore the consistency of the data that is read and displayed; the burden of dealing with inconsistent data is placed on the end users.
Applications with Read-Modify-Write sequences, generally want to read up-to-date data before performing updates that depend on it. Reading eventually consistent data can produce inaccurate results. For example, consider a web site that maintains click counts. It reads the current count, adds one to it, and then writes out a new value. If the read operation returns stale data, then the new value written will fail to record some clicks. That could lead to lost advertising revenue.
However, if the application programmer knows that he is dealing with an eventually consistent storage system, it is hard to imagine how he could change his code to account for this. Essentially, the developer needs to select a storage system that provides stronger guarantees. Some storage systems offer a choice of consistency guarantees, possibly including guarantees that lie between strong and eventual consistency, such as timeline consistency and bounded staleness. For the example of recording click counts, a read-my-writes guarantee will likely be sufficient since all of the writes come from a single client.
Even strongly consistent reads may be inadequate if different clients are concurrently performing Read-Modify-Write sequences. In this case, race conditions, where two clients read a value and then issue conflicting writes, may lead to incorrect results. Two solutions are used in practice. One is to obtain a lock on an object before reading and writing it, perhaps indirectly through a transaction mechanism. Locking is problematic in an eventually consistent system since clients may disagree on who holds which locks. Using a central lock manager (or a distributed locking protocol) yields consistent locking, but does not alleviate the need for strongly consistent reads.
Another option is to use optimistic concurrency control, i.e. conditional writes. That is, each write operation has a precondition that checks whether the version being overwritten is the same as the version that was previously read and was used as the basis for the write; the write succeeds only if the precondition is met. If a write fails, the application must re-execute the Read-Modify-Write code sequence. Some eventually consistent storage systems allow clients to read from any replica but require write operations to be performed at a designated primary replica. In such systems, optimistic concurrency control can be effective, even for eventually consistent reads. Since all writes go to the primary, the primary holds strongly consistent data and checks each write’s precondition against this data. This prevents conflicting writes. The same code that a developer writes to prevent race conditions in a strongly consistent storage system can be used with an eventually consistent storage system. The main difference, in practice, is that conditional writes are more likely to fail if based on eventually consistent reads. Essentially, reading old data in an eventually consistent system is the same as reading up-to-date data from a strongly consistent system but waiting a long time before issuing a write that depends on this data.
Note that for eventually consistent systems in which any replica can accept write operations, i.e. “update anywhere” schemes, two replicas may accept conflicting updates even if conditional writes are used. In such systems, additional steps are needed to resolve conflicting operations after-the-fact. Conflict resolution may be a burden placed on end users, may be handled automatically by the storage system using built-in policies such as last-writer-wins, or may require developers to produce application-specific resolution code, such as Bayou’s mergeprocs or Coda’s resolvers.
In summary, Read-Modify-Write application developers need not write code to read, modify, and write data objects any differently when using an eventually consistent storage system, but they may need to write additional conflict resolution code.
Generate-Write code sequences, by definition, do not perform any reads. Thus, they cannot possibly be affected by the read consistency model. Thus, the application code should be independent from the consistency that is provided by the storage system.
However, application developers may need to be concerned about conflicting writes. In general, application-specific integrity constraints may need to be enforced even for applications that perform blind writes. For example, concurrent attempts to create two different files with the same name or two different data objects with the same primary key should be disallowed. In a strongly consistent storage system, or an eventually consistent system where writes are performed at a single primary, the system can easily enforce integrity constraints. The first of a set of conflicting writes will succeed, and the others will be rejected. In an eventually consistent system with an “update anywhere” or “multi-master” model, programmers may need to provide conflict resolvers, as previously discussed.
Whether or not conflict resolution is required depends on how the storage system propagates and performs updates. For example, consider an application that appends log records to a file. Suppose that the system provides a special Append operation. The application does not need to read the file before appending to it; it simply needs to issue append operations. But what happens if two clients are concurrently appending to the same file and their append operations are performed at different replicas? Do these clients need to provide conflict resolution procedures in order to ensure that the replicas eventually hold identical file contents? It depends. If replicas immediately perform the append operations that they receive directly from clients and then later exchange their file contents with other replicas, then undoubtedly a resolver will be needed to resolve the contents of conflicting files. On the other hand, if replicas maintain and exchange operation logs, if they agree on the order of the operations in their replicated logs, and if they perform operations in log order (which may require rollback/redo for operations that arrive out-of-order as in Bayou), then no conflict resolution code is required for Generate-Write applications. Similar scenarios arise for storage systems that provide Increment and Decrement operations, as might be used by a banking application.
Examining three commonly used data access paradigms indicates that application code changes very little, if at all, when using an underlying storage system that provides eventual rather than strong consistency. The correctness of the program, however, may indeed be affected by the delivered read consistency. And programmers may need to devote time to understanding how eventual consistency impacts the application’s behavior and correctness. If the storage system offers a choice of consistency guarantees, then developers need to choose wisely.
Although conflict resolution is often cited as one of the inherent costs of eventual consistency, dealing with concurrent updates is required even in clients of strongly consistent storage systems. Optimistic concurrency control using conditional writes can be effective for both strongly and eventually consistent systems that use a primary update model. Only in cases of multi-master updates do developers need to write conflict resolvers (or perhaps produce code that asks users to manually resolve conflicting versions).
What do little minds and large clouds have in common? Ralph Waldo Emerson once said: “A foolish consistency is the hobgoblin of little minds.” The same could be said of large clouds. Cloud storage providers invariably replicate data for high availability, scalability, and other desirable properties. One of the key design decisions they face is what consistency to offer their customers. When one client writes some data and another client (or even the same client) attempts to read that data, what data do they get? That’s the essence of consistency, or actually, inconsistency.
Many cloud storage systems provide something known as “eventual consistency”, meaning that read operations are not guaranteed to return the latest data that was written. Eventually, all replicas will receive the latest data, and, hence, eventually, reads will return this data. But, the storage providers make no promises about how long this will take. Often, the well-known CAP theorem is used as the justification for relaxing the consistency guarantees. It says that a system cannot simultaneously provide strong consistency, high availability, and partition tolerance. Systems may choose to offer a weaker consistency model in exchange for better availability assuming that they must tolerate network partitions.
But the “P” in the CAP theorem should really stand for “performance”. The real reason that systems relax consistency is to get better performance. Eventually consistent reads, for example, can always access the closest replica, whereas strongly consistent reads must either access a restricted subset of replicas, such as a designated primary, or a group of replicas. In a geo-replicated system where replicas are located across the globe, we’ve seen a factor of 100x difference in the performance of an eventually consistent read compared to a strongly consistent read. That’s why many cloud storage systems, including Amazon’s DynamoDB, Yahoo’s PNUTS, and Oracle’s NoSQL Database, offer clients consistency choices.
In future blog posts, I plan to explore the issues of replicated data consistency and cloud storage systems. It should be fun. Welcome!