Implementing Bounded Staleness

One of the consistency choices that Pileus supports is bounded staleness, which guarantees that the version returned by a read operation is no more than t seconds out-of-date.  In other words, if the latest version was produced more than t seconds ago, then the reader is guaranteed to receive it; if the object was updated in the last t seconds, then the reader may or may not receive the latest version, but, in any case, is guaranteed to at least receive the version that was the latest as of t seconds ago.  The first paper to mention such a staleness guarantee, as far as I know, is “Data Caching Issues in an Information Retrieval System” by Alonso, Barbara, and Garcia-Molina (ACM Transactions on Database Systems, September 1990).

The simple way to implement bounded staleness, assuming a primary update scheme, is as follows.  Each secondary replica periodically syncs with the primary and records the time of its last sync.  A client reading data can determine whether a secondary replica is sufficiently up-to-date by (1) subtracting the desired staleness bound from the current time to get an acceptable read time and (2) then checking whether the resulting acceptable read time is earlier that the replica’s last sync time.  That is, a replica can serve the read request while meeting the bounded staleness guarantee as long as the replica has synchronized directly (or indirectly) with the primary sufficiently recently.

Note that this is a pessimistic implementation since it may disallow reading from replicas that are, in fact, sufficiently up-to-date for the object being read.  For example, many objects are written once and never again updated.  Suppose that the latest version of the object being read was produced months ago, then it could be read from any replica while guaranteeing any reasonable staleness bound. Unfortunately, clients cannot assume anything about update patterns.  Thus, they make pessimistic decisions about suitable replicas based on the last sync times (or else they would need to contact the primary to discover the timestamp of the latest version, which reduces the benefit of reading from secondary replicas).

However, this implementation technique assumes (a) that the client’s clock is loosely synchronized with those of the servers, and (b) that the servers report their last sync times.  Neither of these are generally good assumptions.  Consequently, we developed a different way of enforcing bounded staleness in the Pileus system.

In Pileus, servers use logical clocks.  Thus, the timestamps that are assigned to versions and the high timestamps that are maintained and reported by servers are not directly useful for determining those servers that can meet a staleness bound.  However, here’s a client-side technique that works.

Each client maintains a mapping from real time (according to the client’s clock) to logical time (according to the primary server’s logical clock).  Specifically, whenever a client accesses the primary server to perform a Get or Put operation, the server returns its current high timestamp (as a logical clock value).  The client records its current time, taken from the local real-time clock, along with the server’s high timestamp.  To get a conservative but usable mapping, the client records its clock at the start of the Get request rather than at the receipt of the response (or it could measure the round-trip time and choose the mid-point of the time interval as in some clock synchronization algorithms).  This mapping is a little off due to the message transmission delay, but that should be a small amount relative to staleness bounds.

This mapping table need not be large since it does not need to record every interaction with the primary.  One entry per minute, for instance, is probably sufficient.  Basically, it needs to cover the range of times that are likely to be mentioned in a staleness bound and in sufficient granularity.  Older entries in the table can be discarded once their logical timestamps fall below the minimum high time of all servers (or all servers that the client is likely to access, such as those that are closer than the primary).

To determine which servers can meet a staleness bound, the client computes the minimum acceptable read time = now – bound.  The client then looks up this timestamp in its table to find the associated logical time.  It uses the next highest recorded time as an approximation.  For example, suppose that the client has entries for 8:00 and 8:05, but wants a read time of 8:03; it can use the 8:05 entry as a conservative bound.  The client then chooses from among the servers with a high timestamp that is greater than the recorded logical time.

Consider a scenario in which a client wants bounded staleness with a bound of 5 minutes and ideally would like to read from its local server.  Suppose this client knows that the local server is currently less than 5 minutes out-of-date.  This client can perform all of its reads locally.  This client can also ping the primary once per minute to retrieve its current logical high time (or piggyback such pings on regular write operations).  If the local server syncs with the primary frequently enough, say once every few minutes, and the client performs reads frequently enough, then the client would get the local server’s high time in response to a read operation and always determine that the local server was okay for its next read.  If the local server syncs too infrequently, say once every 15 minutes, then the client would eventually discover, based on its local mapping table, that the local server it too out-of-date.  At this point, it would need to switch to a different server, such as the primary.  After the local server syncs with the primary, the client would notice this since it periodically pings the local server.  The client could ten switch back to reading locally.

A key advantage of this scheme is that it assumes nothing about synchronized clocks.  Each client maintains its own mapping table based on its own local clock, which could be arbitrarily out of sync.  Such a technique would work even if servers used physical, but unsynchronized clocks.

A Pileus-on-Azure Demo

I just returned from Microsoft Research’s annual technology showcase (called “TechFest”).  For two days, with my colleague Rama Kotla, I was talking about and demoing the Pileus-on-Azure system to people from all over Microsoft.  It was fun, rewarding, and totally exhausting.

I showed a version of Pileus that runs on top of the Windows Azure Storage (WAS) key-value blob store.  In particular, we use WAS storage accounts as our basic units of storage.  Each such account replicates data on three servers within an Azure datacenter, which we call a “site”.  Our Pileus-on-Azure systems adds four main features to the basic WAS functionality.

One, we allow a WAS blob container to be replicated on any number of sites throughout the world. Specially, data can reside in any Azure region, of which there are currently at least ten spread over Europe, Asia, and the United States. One (or more) of the storage sites is designated as the primary, and all updates are performed at that site (or set of sites). Other sites are considered secondary sites. These receive updates lazily from the primary site. The intention is to allow data to be placed near where it is being accessed for clients that can tolerate relaxed forms of consistency.

For my demo, I used two Azure accounts with the primary residing in the West Europe region and the secondary in the West US region. I turned on Azure’s geo-replication. So this actually caused the data to be replicated in four datacenters. The West Europe account geo-replicated its data to the North Europe region, and the West US account geo-replicated to the East US region. This configuration resulted in each blob fully residing on 12 different servers in four different datacenters in three different countries. Perhaps, that’s a bit much, but, hey, it’s a demo.

Two, Pileus-on-Azure supports six different consistency choices: strong consistency, causal consistency, bounded staleness, read my writes, monotonic reads, and eventual consistency. I described the semantics of these in an earlier blog post (and in my recent CACM article which was also mentioned in a previous blog post and, from what I learned at TechFest, is now required reading for some Microsoft product groups). In contrast, WAS provides strong consistency by default, but recently started allowing clients read-only access to geo-replicated sites, thereby providing the option of eventual consistency reads (similar to Amazon’s DynamoDB or other commercial cloud storage systems). The reason to offer consistency choices is because of the inherent trade-offs between the guarantees offered by a consistency choice and the performance of read operations.

In the demo, I ran a client on a local machine in the Microsoft Convention Center that was performing read and write operations in accordance with the YCSB benchmark using the Pileus-on-Azure API, which essentially is identical to the API provided by the WAS client library. This client would periodically trigger synchronizations between the primary site in West Europe and the secondary in West US. The following screenshot from the demo shows the measured average response time for read operations using each of the six consistency choices:


You can see that choosing strong consistency results in reads that are seven times slower than eventual consistency reads for our given configuration. The other consistency choices fall somewhere in between. Causal consistency is less than a third the cost of strong consistency in this case, but the results are very workload dependent; note that I improved the causal consistency implementation and so these numbers are much better than ones we reported in our SOSP paper for an earlier version of Pileus. For bounded staleness, I chose a bound of 30 seconds for this demo, meaning that readers receive data that is at most 30 seconds out-of-date; even with such a tight guarantee, bounded staleness improves the read performance by a factor of two compared to strong consistency. The read my writes guarantee, which application developers often desire, is only slightly more expensive than eventual consistency. Monotonic reads behave exactly the same as eventual consistency reads since they both always access the nearby West US site; there is no reason for monotonic reads to go anywhere else unless this site becomes unavailable.

Three, Pileus-on-Azure allows client applications to express their consistency and latency desires in the form of a consistency-based service level agreement (SLA). Each such SLA consists of a sequence of acceptable consistency/latency pairs with associated utilities (a numerical value between 0 and 1). Thus, clients are not locked into a single consistency choice. A consistency-based SLA governs the execution of each read operation. The Pileus-on-Azure client library attempts to maximize the utility that is delivered to the client by selecting the site(s) to which each read operation is sent.

In the demo, the performance numbers for each of the six consistencies were generated by using six simple SLAs containing a single consistency choice and unbounded latency. More interestingly, I ran the YCSB workload using what I call the “fast or strong” SLA. In this SLA, the client prefers strong consistency provided it can receive read results in under 150 ms. If not, it will accept weaker consistency, such as bounded or eventual, for a fast response. If a fast response is not possible, then the client will wait up to 2 seconds for strongly consistency data. In other words, the client does not want a slow, stale response. Here’s what this SLA looks like in the demo:


While meeting this SLA, the client sends all of the read operations to the local West US site since the primary in Europe is not able to respond in under 150 ms. This results in an average delivered utility of just over 0.5. The client mostly receives eventually consistency data but occasionally receives data that meets the bounded staleness condition, i.e. the read occasionally returns a version of the requested blob that is known to be stale by no more than 30 seconds. The client does receive fast response times, and obtains acceptable but probably disappointing overall utility. Which brings me to the fourth part of the demo.

Four, Pileus-on-Azure includes a Configuration Service that can automatically select where data should be replicated. Each client that shares a blob container reports (1) the SLA(s) that it is using, (2) the reads and writes that it performed recently, and (3) its measured round-trip latencies to each of the potential storage sites. The configuration service collects this information from clients and uses it to generate and evaluate possible alternative configurations. It suggests what sites should store replicas of the container, what site(s) should be the primary, and how often secondary sites should synchronize with the primary. It considers constraints, such as limits on the number or locations of replicas. It also has a cost model based on Azure’s pricing scheme and considers the cost of new configurations, both the one-time cost of moving from the current to a new configuration and also the ongoing replication costs. It proposes the configuration that is expected to maximize the overall utility to the clients per unit of cost. Note that this configuration service was not described in our prior SOSP paper; this is more recent work done with an intern, Masoud Saeida Ardekani.

In the demo, I ask the configuration service to propose a new configuration. Not surprisingly, since throughout the demo there was only once client accessing the data, the configuration service makes the reasonable suggestion of moving the primary to the West US site. It also reports the expected gain in utility by changing to this new configuration and the expected cost. In this case, there is no increased cost since the number and location of replicas remains the same, with only the primary site changing:


When I click on the “Install New” button, the configuration service takes action. It first upgrades the West US site to a write-only replica so that this site receives new updates from clients. It forces a synchronization between West US and the current primary in West Europe so that the latest versions are in both places. Finally, it upgrades the West US site to be a read-write replica and downgrades the West Europe site to be a read-only secondary replica.


Now, in the demo, when I re-run the read-write-sync workload, you see that all of the consistency choices result in the same performance since they all read from the new West US primary (note that variations in the reported averages are caused solely by network and server load variations):


And, as expected, my “fast or strong” SLA delivers a utility of 1, meaning that the client always receives the top level of service:


That’s the demo! It ran for two straight days without crashing or experiencing any hiccups. Way to go Windows Azure! I can’t wait to see my Azure bill for this month.

On Leases for Primaries

I am not a fan of using leases to prevent clients from performing writes; it simply delays the writes and gives readers a false sense of strongly consistent data.  On the other hand, systems like Megastore and Spanner use leases quite effectively for leader election in Paxos.  In particular, a leader (or master or primary node) can acquire a lease from a majority and hold the lease for some period of time.  This allows the leader to know that it is the leader without continually checking with other nodes, even in the face of network partitions.

So, for example, the leader (e.g. primary replica) can serve read requests locally while maintaining strong consistency.  If the primary replica ends up in a minority partition, then it can continue to process read requests.  Though it will not be able to perform write operations since these usually involve a majority of replicas.  The replicas in the majority partition cannot do anything until the primary’s lease expires, perhaps 10 seconds.  At this point, they can then elect a new primary that accepts both read and write operations.  The old primary will realize that it is no longer the primary when it is unable to renew its lease.  This ensures that the system never ends up with two primaries *assuming* leases are managed properly.

What do we need to assume about clocks or message delivery bounds in order to manage leases?  In the simplest scheme, a node requests a lease that expires at a specific time.  In order for nodes to agree that the lease has expired, they need to agree on the current time.  This requires perfectly synchronized clocks.  When a node’s lease is about to expire, it sends a lease renewal to other nodes, and the lease is renewed upon receipt of acknowledgements from a majority.  This requires a bound on the message delays so that the node knows how far in advance to initiate the lease renewal protocol.

If clocks are not perfectly synchronized but have known bounds on the clock uncertainty (as in Google’s TrueTime service), then nodes can make conservative estimates of when a lease has expired.  Suppose the clock bound is d.  The node holding the lease should behave as though it has expired d time units before the lease expiration time, whereas other nodes should wait for d time units after the expiration.  This adds a delay of d to the recovery/reconfiguration time when the primary fails or is partitioned.

What if we have no bounds on clock drift and synchronization?  One approach is for the lease request to include a lease duration rather than an expiration time.  For example, a primary could request a least for 60 seconds (as in Spanner).  Each lease request also includes a unique sequence number.  The lease is granted when the requestor receives a majority of responses (with the given sequence number).  But for how long?  The requester records the time, according to its own clock, when it requested the lease.  The lease is good for the requested duration from this time.  In other words, the requestor behaves as though the lease was immediately granted by a majority of nodes but it took some time for the acknowledgements to reach the requestor.  Other nodes record the time at which they responded to the lease request.  They believe the lease to be valid from this time for its duration.  This guarantees that the requestor uses an expiration time that is before that used by the other nodes (assuming that clock drift is not excessive).

Message delays do not cause correctness problems, but they could cause a requestor to lose its lease because it did not renew the lease in time or because the requested duration was less than the message delay.  In practice, this should not be a problem as long as the variance in message delay is not too great.

Three Publications

I would like to bring to your attention three papers that I have recently published.

1.   “Replicated Data Consistency Explained Through Baseball” was published as an article in this month’s (i.e. the December issue of) Communications of the ACM.  I have given a number of popular talks based on this paper, which was first written a couple of years ago.  It describes the utility of various consistency choices (those presented in my last blog posting) using one simple example: maintaining and accessing the score of a baseball game.  If you don’t understand baseball, don’t fret.  It could really be any sport.  If you don’t like sports, that’s okay too, though there may be something wrong with you.  There’s also a videotape of my giving a lecture on this topic to Microsoft engineers.  Check it out at:

2.  “Consistency-based Service Level Agreements for Cloud Storage” was presented last month at the ACM Symposium on Operating Systems Principles (SOSP).  This describes the Pileus system that we’ve designed and built at MSR Silicon Valley.  This paper particularly focuses on the motivation, design, and evaluation of a scheme for incorporating consistency choices into the SLAs provided by a replicated key-value store.  The feedback thus far has been very positive; feel free to send me your thoughts.  See: or
The talk slides, and someday soon perhaps the talk video, are available on the conference web site at

3.  “Transactions with Consistency Choices on Geo-Replicated Cloud Storage” is a technical report that discusses the transaction mechanism we designed for Pileus.  We show how to provide snapshot isolation (or full serializability) for distributed transactions on a partitioned, replicated key-value store while allowing those transactions to access data with relaxed consistency.  The key trade-off is that accessing stale data, while providing better expected performance, may cause a transaction to abort.  This paper shows that it is possible to support general-purpose transactions and provide reasonable semantics in an eventually consistent system.  See:

On a more personal note, I want to mention that it was not possible for me to include a dedication on my CACM paper.  But, if I could, it would have read:

Dedicated to my father, Coach George Terry, who was a respected baseball coach for most of his career.  He taught me not only the fundamentals of baseball but also the fundamentals of life.  Through his love for baseball, he touched a great number of young men and women.

Have fun reading these papers and have a very happy holidays!

Defining Consistency Guarantees

It seems fashionable these days for cloud storage providers to offer consistency choices to their clients. Systems from Amazon, Google, Yahoo, and (soon) Microsoft provide both strong consistency and eventual consistency. For individual read operations, a developer may choose to receive strongly consistent or eventually consistent data. In some cases, such as the Pileus system that we’ve developed at Microsoft Research, intermediate consistency choices are also offered. But “eventual” consistency may mean different things in different systems, though often the service providers do not define it very precisely. This note explores what exactly is meant by a consistency choice and how we might specify precise guarantees for such choices.

Developers should be able to understand a consistency guarantee without knowing any details of the underlying implementation (replication protocols, system configuration, server locations, sync frequency, mean time to failure, etc.).

The consistency definitions that I’ve been using in talks are described in terms of the writes that are visible to a read operation. For example, the choices that we provide in the Pileus system are as follows:

  • Strong Consistency = Reads see all previous writes.
  • Eventual Consistency = Reads see subset of previous writes.
  • Consistent Prefix = Reads see an initial sequence of writes.
  • Bounded Staleness = Reads see all “old” writes.
  • Monotonic Reads = Reads see an increasing subset of writes.
  • Read My Writes = Reads see all writes performed by the reader.

But what exactly do these definitions mean?

I suggest that a consistency guarantee specifies the set of acceptable results that can be returned by a read operation. The size of this set determines the “strength” of the consistency; smaller sets implies stronger consistency.

Given complete knowledge of the previous writes and their order, along with an understanding of the semantics of read and write operations, a person should be able to determine the acceptable result-set. That’s still a lot for a developer to comprehend, but it’s better than needing to understand how the replicated system is implemented.

What about partially ordered writes? The consistency definitions simply talk about the set of visible writes. This set is independent of the write order (except for a consistent prefix).

However, we also need to say what happens with concurrent writes. A concurrent write for a given read operation is any write whose execution overlaps the execution of the read. It seems reasonable to say that the read *may*, but is not required to, see any concurrent writes.

Then, in order to determine what value(s) will/may be returned by a selected subset of writes, a person needs to understand how the system performs concurrent writes. One reasonable answer is that the system may perform them in any order. In other words, the system will (at least conceptually) execute writes in a serial order that is consistent with the partial order. So, the result of a read operation could be the value produced by *any* such serial order of writes.

Thus, the overall algorithm for determining the set of acceptable results for a given consistency is as follows:

  1. Given consistency + previous writes + partial write order + session state, determine set of acceptable write subsets.
  2. For each acceptable write subset produced in step #1, along with the partial write order, determine all possible serial orders of writes.
  3. For each ordered write subset produced in step #2, compute the resulting object value by applying the writes in that order.
  4. Take the union of all computed object values in step#3 as the set of acceptable return values.
  5. If multiple consistencies are being combined, take the intersection of their sets of acceptable return values.

Note that step #1 completely captures the essence of a consistency definition. It is the only step that depends on the consistency. Defining a new consistency is simply a matter of mapping the complete set of previous writes into one or more write subsets. A strong read (in the absence of concurrent writes) has a single acceptable write-set, the one containing all previously completed writes. An eventual read can see any subset of the previous writes, i.e. there could be a large number of acceptable write subsets. Intermediate consistency guarantees restrict the acceptable write subsets. With concurrent writes, all of the consistency guarantees, even strong consistency, have an increased number of acceptable write-subsets since each concurrent write may or may not be included in each subset.

Note that step #2 is not needed if writes are totally ordered. All of the other steps remain the same. Thus, understanding the consistency is a bit easier for ordered writes.

Note that step #5 should always return at least one value since the latest object value, i.e. the value that would be returned by a strongly consistent read, should be an acceptable return value for any consistency guarantee.

Importantly, note that all 5 steps can be done with no knowledge of how the storage system operates. One only needs to know the semantics of write operations, i.e. the effect of a write operation on the (unreplicated) system state, and the semantics of read operations, i.e. what a read will return given a particular system state. In any system, the semantics of reads and writes should be well documented.

The Impact of Eventual Consistency on Application Developers

Eventual consistency is commonly used in replicated cloud storage systems because it offers better performance and availability than strong consistency replication models.  But, at what cost?  Often I hear people say that eventual consistency places too great a burden on application developers.  I hear that it results in more complicated code.  However, I have seen little evidence of this and little written about this in the published literature.  Let’s explore some of the potential costs of eventually consistent storage systems, with a particular emphasis on how it affects developers who store application data in such systems.

Data Access Paradigms

I see three broad patterns of how applications read and write their data: Read-Display, Read-Modify-Write, and Generate-Write.  If you are aware of other widely used access patterns, please let me know.

Read-Display applications (or code segments) are those that retrieve data from a storage system and display it to their users.  The data may be manipulated or transformed in some way before being displayed, but mostly is shown directly to users.  Examples include social networking (e.g. Facebook, Foursquare, Twitter), photo sharing (e.g. Flickr), and news reading (e.g. New York Times online edition) applications.

Read-Modify-Write applications read one or more data objects, compute some new value for those objects based on their previous values, and then write out the new value.  Examples include document editors (e.g. Word and Excel) and any applications that maintain statistics, such as web sites that record the number of visitors or page clicks.

Generate-Write applications write data objects without reading them.  These are often called “blind” writes.  The data being written may come from external sources like users or sensors.  Examples include creating new files or data objects, appending records to a log, or depositing checks in a banking application.

Let’s consider each of these paradigms and the effect of eventual consistency on applications and application developers that use them.


For Read-Display applications, eventual consistency simply means that users may be presented with stale data since the application may read old values for data objects or the information that is displayed may be missing recently added objects.  In Facebook, for example, users may not see recent comments or postings from their friends.  In an e-mail application, a user may not see some messages that were recently delivered to his inbox.  In a shopping cart application, users may find, when checking out, that their cart is missing some items that were previously added or may still contain items that were added but then removed.

But how does this affect the application developer?  The code that he writes to read data from the storage system is the same regardless of whether that storage system provides strongly consistent or eventually consistent answers to read requests.  The code to display this data is also independent of the consistency guarantees provided by the storage system, though if the application is informed that it read stale or “tentative” data then it could potentially issue a warning of some sort to its user.

Basically, Read-Display application developers ignore the consistency of the data that is read and displayed; the burden of dealing with inconsistent data is placed on the end users.


Applications with Read-Modify-Write sequences, generally want to read up-to-date data before performing updates that depend on it.  Reading eventually consistent data can produce inaccurate results.  For example, consider a web site that maintains click counts.  It reads the current count, adds one to it, and then writes out a new value.  If the read operation returns stale data, then the new value written will fail to record some clicks.  That could lead to lost advertising revenue.

However, if the application programmer knows that he is dealing with an eventually consistent storage system, it is hard to imagine how he could change his code to account for this.  Essentially, the developer needs to select a storage system that provides stronger guarantees.  Some storage systems offer a choice of consistency guarantees, possibly including guarantees that lie between strong and eventual consistency, such as timeline consistency and bounded staleness.  For the example of recording click counts, a read-my-writes guarantee will likely be sufficient since all of the writes come from a single client.

Even strongly consistent reads may be inadequate if different clients are concurrently performing Read-Modify-Write sequences.  In this case, race conditions, where two clients read a value and then issue conflicting writes, may lead to incorrect results.  Two solutions are used in practice.  One is to obtain a lock on an object before reading and writing it, perhaps indirectly through a transaction mechanism.  Locking is problematic in an eventually consistent system since clients may disagree on who holds which locks.  Using a central lock manager (or a distributed locking protocol) yields consistent locking, but does not alleviate the need for strongly consistent reads.

Another option is to use optimistic concurrency control, i.e. conditional writes.  That is, each write operation has a precondition that checks whether the version being overwritten is the same as the version that was previously read and was used as the basis for the write; the write succeeds only if the precondition is met.  If a write fails, the application must re-execute the Read-Modify-Write code sequence.  Some eventually consistent storage systems allow clients to read from any replica but require write operations to be performed at a designated primary replica.  In such systems, optimistic concurrency control can be effective, even for eventually consistent reads.  Since all writes go to the primary, the primary holds strongly consistent data and checks each write’s precondition against this data.  This prevents conflicting writes.  The same code that a developer writes to prevent race conditions in a strongly consistent storage system can be used with an eventually consistent storage system.  The main difference, in practice, is that conditional writes are more likely to fail if based on eventually consistent reads.  Essentially, reading old data in an eventually consistent system is the same as reading up-to-date data from a strongly consistent system but waiting a long time before issuing a write that depends on this data.

Note that for eventually consistent systems in which any replica can accept write operations, i.e. “update anywhere” schemes, two replicas may accept conflicting updates even if conditional writes are used.  In such systems, additional steps are needed to resolve conflicting operations after-the-fact.  Conflict resolution may be a burden placed on end users, may be handled automatically by the storage system using built-in policies such as last-writer-wins, or may require developers to produce application-specific resolution code, such as Bayou’s mergeprocs or Coda’s resolvers.

In summary, Read-Modify-Write application developers need not write code to read, modify, and write data objects any differently when using an eventually consistent storage system, but they may need to write additional conflict resolution code.


Generate-Write code sequences, by definition, do not perform any reads.  Thus, they cannot possibly be affected by the read consistency model.  Thus, the application code should be independent from the consistency that is provided by the storage system.

However, application developers may need to be concerned about conflicting writes.  In general, application-specific integrity constraints may need to be enforced even for applications that perform blind writes.  For example, concurrent attempts to create two different files with the same name or two different data objects with the same primary key should be disallowed.  In a strongly consistent storage system, or an eventually consistent system where writes are performed at a single primary, the system can easily enforce integrity constraints.  The first of a set of conflicting writes will succeed, and the others will be rejected.  In an eventually consistent system with an “update anywhere” or “multi-master” model, programmers may need to provide conflict resolvers, as previously discussed.

Whether or not conflict resolution is required depends on how the storage system propagates and performs updates.  For example, consider an application that appends log records to a file.  Suppose that the system provides a special Append operation.  The application does not need to read the file before appending to it; it simply needs to issue append operations.  But what happens if two clients are concurrently appending to the same file and their append operations are performed at different replicas?  Do these clients need to provide conflict resolution procedures in order to ensure that the replicas eventually hold identical file contents?  It depends.  If replicas immediately perform the append operations that they receive directly from clients and then later exchange their file contents with other replicas, then undoubtedly a resolver will be needed to resolve the contents of conflicting files.  On the other hand, if replicas maintain and exchange operation logs, if they agree on the order of the operations in their replicated logs, and if they perform operations in log order (which may require rollback/redo for operations that arrive out-of-order as in Bayou), then no conflict resolution code is required for Generate-Write applications.  Similar scenarios arise for storage systems that provide Increment and Decrement operations, as might be used by a banking application.


Examining three commonly used data access paradigms indicates that application code changes very little, if at all, when using an underlying storage system that provides eventual rather than strong consistency.  The correctness of the program, however, may indeed be affected by the delivered read consistency.  And programmers may need to devote time to understanding how eventual consistency impacts the application’s behavior and correctness.  If the storage system offers a choice of consistency guarantees, then developers need to choose wisely.

Although conflict resolution is often cited as one of the inherent costs of eventual consistency, dealing with concurrent updates is required even in clients of strongly consistent storage systems. Optimistic concurrency control using conditional writes can be effective for both strongly and eventually consistent systems that use a primary update model.  Only in cases of multi-master updates do developers need to write conflict resolvers (or perhaps produce code that asks users to manually resolve conflicting versions).


What do little minds and large clouds have in common?  Ralph Waldo Emerson once said: “A foolish consistency is the hobgoblin of little minds.”  The same could be said of large clouds.  Cloud storage providers invariably replicate data for high availability, scalability, and other desirable properties.  One of the key design decisions they face is what consistency to offer their customers.  When one client writes some data and another client (or even the same client) attempts to read that data, what data do they get?  That’s the essence of consistency, or actually, inconsistency.

Many cloud storage systems provide something known as “eventual consistency”, meaning that read operations are not guaranteed to return the latest data that was written.  Eventually, all replicas will receive the latest data, and, hence, eventually, reads will return this data.  But, the storage providers make no promises about how long this will take.  Often, the well-known CAP theorem is used as the justification for relaxing the consistency guarantees.  It says that a system cannot simultaneously provide strong consistency, high availability, and partition tolerance.  Systems may choose to offer a weaker consistency model in exchange for better availability assuming that they must tolerate network partitions.

But the “P” in the CAP theorem should really stand for “performance”.   The real reason that systems relax consistency is to get better performance.  Eventually consistent reads, for example, can always access the closest replica, whereas strongly consistent reads must either access a restricted subset of replicas, such as a designated primary, or a group of replicas.  In a geo-replicated system where replicas are located across the globe, we’ve seen a factor of 100x difference in the performance of an eventually consistent read compared to a strongly consistent read.  That’s why many cloud storage systems, including Amazon’s DynamoDB, Yahoo’s PNUTS, and Oracle’s NoSQL Database, offer clients consistency choices.

In future blog posts, I plan to explore the issues of replicated data consistency and cloud storage systems.  It should be fun.  Welcome!


Get every new post delivered to your Inbox.