Lightweight Leases for Detecting Reconfigurations

In a replicated data storage system, how should clients deal with a possibly changing set of servers? Imagine a system, like Pileus, in which some number of servers are designated as “primary” replicas (i.e. they hold the latest data) and some are “secondary” replicas (i.e. they may hold stale data). The set of primary and secondary replicas, known as the “configuration”, can be stored in the cloud where it is accessible to all clients who need this information when deciding where to send their read and write operations. Occasionally, the configuration may change in response to failures, changing workloads, or other factors.

How do clients discover that the configuration has changed? For some operations, like reads that tolerate eventual consistency, clients can safely ignore reconfigurations. If a primary replica gets downgraded to a secondary, or a secondary replica is upgraded to a primary, that’s okay for an eventually consistent read. If a server gets taken out of service, the client will detect that and try a different server. But other operations, like writes and strongly consistent reads, require up-to-date knowledge of the current primary replica or set of primary replicas. Clients could read the current configuration before any such operations, but that would be prohibitively expensive. Instead, clients want to cache the configuration and rely on this local information being correct. This raises the problem of maintaining cache consistency.

We would like a configuration management scheme with the following properties:

  1. The master copy of the configuration resides in the cloud.
  2. Each client caches and uses its own local copy of the configuration, i.e. most client operations do not require additional cloud accesses.
  3. The complete set of clients is unknown and varies over time, i.e. contacting all clients to inform them of configuration changes is not possible.
  4. The configuration can be modified at any time, but not too frequently.
  5. Clients never violate consistency guarantees by using incorrect configuration data.

Leases are the obvious solution. In particular, each client obtains a “lease” on the current configuration to prevent it from changing (at least while the client is performing a read or write that relies on the configuration). Clients know that they can safely use their cached configuration data as long as they hold an unexpired/unbroken lease. When the reconfiguration service wishes to modify the configuration, i.e. add or remove replicas, it first breaks all of the client leases, grabs its own exclusive lock, updates the master copy in the cloud, and then releases its exclusive lock; clients then reacquire leases and fetch the new configuration.

This sounds simple enough, but how can it be made to work on top of Windows Azure Storage? That’s the practical issue that I faced when implementing Pileus-on-Azure. I considered three main options: using Azure’s leases, using ETags, or the option that I ended up choosing, which I call “lightweight leases.” Let’s look at each of these approaches.

Azure supports leases on blobs. Clients can Acquire, Renew, Change, Release, or Break a lease on each blob. So, the first option that I considered is to store the configuration in a special blob. Each client could acquire an infinite (or time-limited) lease on the configuration blob; clients would all need to use the same lease ID so that their acquire operations do not conflict. In other words, we can work around the fact that Azure does not directly support non-exclusive leases by having clients share the lease ID. The reconfiguration service would break the client lease when it wants to change the configuration, acquire the lease itself (using a unique lease ID), write a new configuration blob using this lease, and then release its lease. But how would a client know that its lease was broken? That’s a problem. Having clients renew their leases periodically is not sufficient if the reconfiguration service can break this lease at any time. Morever, if the reconfiguration service fails after acquiring its own lease but before completing its update, the configuration blob will be left in a state where clients are unable to reacquire their own leases.

Azure also supports optimistic concurrency control on blobs using ETags. Specifically, an “ETag” is a string that is stored as a property of each blob and updated whenever the blob (or its metadata) is modified. You can think of the ETag as an update timestamp, though, for some reason, Azure also maintains separate timestamps.   When reading a blob, the client receives the blob’s properties along with its contents, including the ETag for the returned version of the blob. A subsequent write or read can be performed on the condition that the ETag has not changed. ETags can be used in similar ways to leases. Clients could read the configuration blob and obtain an ETag. The reconfiguration service can simply write new data to the configuration blob, thereby producing a new ETag. Any subsequent attempts by clients to re-read the configuration using their previously obtained ETags will fail, thus informing the client that the configuration has been updated. This is slightly better than the scheme using leases in that there are no leases to be broken and nothing left locked if a failure occurs. But it suffers from the same fundamental problem, clients need to periodically check for updated configurations, and these checks are insufficient if the configuration can be changed at any time.

Here’s the “lightweight lease” solution that we ended up using in Pileus-on-Azure. When the reconfiguration service wants to install a new configuration, it declares its intent by writing a reconfiguration-in-progress (RiP) flag to the configuration blob’s metadata. The reconfiguration service then waits T seconds before taking any action. Each client periodically reads the configuration blob’s metadata. If the RiP flag is not set, then the client knows that the configuration will not change for at least T seconds. In other words, each read operation that returns with the RiP flag unset renews the client’s configuration “lease” for a period of T seconds (from the time that the client started the read operation). Clients are then free to use their locally cached configuration for up to T seconds. If a client finds the RiP flag set, then it knows that its cached configuration is either out-of-date or will be soon. The client can then decide to wait for the new configuration or perhaps perform speculative operations using the possibly stale configuration. Meanwhile, the reconfiguration service, after having waited T seconds, knows that all clients are aware of the pending reconfiguration. Thus, it can now write the new configuration to the configuration blob and then clear the RiP flag. Clients continue to periodically read the configuration blob, and will eventually discover that the RiP flag is no longer set, which indicates that the client has reacquired the configuration lease.

The only remaining concern is what happens if the reconfiguration service fails after setting the RiP flag but before clearing this flag. This leaves clients in the non-leased state indefinitely. But it is really easy to recover from this state. A client, upon deciding that it has been waiting too long, can simply clear the RiP flag itself. This will effectively abort any reconfiguration that is in progress. Suppose, for instance, that the reconfiguration service had not crashed but simply was behaving slowly. When the reconfiguration service set the RiP flag by writing new metadata, it obtained an ETag. The reconfiguration service then writes the new configuration blob conditional on this ETag. This write will fail if some client had impatiently cleared the RiP flag. The reconfiguration service could then restart the reconfiguration process. Isn’t that easy? This even works if multiple reconfigurations are happening concurrently (though starvation is possible).

This simple scheme works very well for Pileus and requires no actual leases. It does not require the reconfiguration service to know about clients. Since reconfigurations are rare, and since they do not need to be performed in a timely manner, clients can hold long-term leases. In other words, the lease period T can be on the order of minutes or even hours, and thus the overhead of periodically checking the RiP flag is small.


Implementing Causal Consistency

In my previous post, I discussed how to implement bounded staleness.  Today, I focus on techniques for implementing causal consistency.  Causal consistency appears to be popular in the academic research community, but not as popular in the commercial world.  In theory, it makes perfect sense.  But there’s a lot to grasp both in understanding the consistency guarantee and in implementing it efficiently in a replicated storage system.

What is causal consistency?

In short, Reads and Writes are partially ordered according to the happens-before relationship (as defined in Leslie Lamport’s seminal paper).  Imagine a graph in which the nodes are Read and Write operations and there is a directed edge from node A to node B if node A causally precedes node B.  A Read operation in this graph sees the effect of any Write operation for which there is a directed path from the Write’s node to the Read’s node.

I find it useful to break out three key properties:

1. Writes are causally ordered.

Example 1: A client performs Write(A) followed by Write(B).  Write(B) happened after Write(A), and hence should be ordered after it in all replicas.  “Write(A) < Write(B)” denotes that Write(A) causally precedes Write(B).

Example 2:  A client at site 1 performs Write(A), then sends a message to site 2.  After receiving this message, a client at site 2 then performs Write(B).  Again, Write(A) < Write(B).

2. Reads are causally ordered.

Example 3: A client performs Read(A) followed by Read(B).  Clearly, Read(A) < Read(B).

Example 4: A client at site 1 performs Read(A) and then sends a message to site 2.  A client at site 2 then performs Read(B).  Again, Read(A) < Read(B).

3. Reads see writes in causal order.

Example 5: A client performs Read(B) and sees the effect of Write(B) which may have been performed by a different client.  Write(B) causally precedes Read(B), i.e. Write(B) < Read(B).  Suppose that Write(A) causally precedes Write(B).  If this client now performs Read(A), it should see the effect of Write(A) since Write(A) < Write(B) < Read(B) < Read(A).

Example 6: A client at site 1 performs Read(B) and sees the effect of Write(B) which causally follows Write(A).  It then sends a message to site 2.  A client at site 2 then performs Read(A).  This client should see Write(A) since Write(A) < Write(B) < Read(B) < Read(A).

Some systems, such as the COPS paper from SOSP 2011 and my Pileus paper from SOSP 2013, redefine causal consistency in order to provide a practical implementation.  In particular, these systems assume that clients only communicate through the storage system (thereby eliminating Examples 2, 4, and 6 above).  Read and Writes performed by the same client are ordered, of course, by the order in which they are performed (as in Examples 1 and 3).  Additionally, a Read operation is ordered after any writes whose effects are visible to the read (as in Example 5).  Let’s call this “read/write causal consistency”.

How can a system provide read/write causal consistency?

COPS arranges replicas into clusters, where each cluster resides in a single datacenter.  Writes within a cluster are strictly ordered, resulting in linearizability (a stronger property than causal consistency).  Causal ordering between Writes are tracked by the client library in the form of dependencies.  Writes, when lazily propagated between clusters, carry their dependencies with them.  A cluster only commits a Write when all of its dependencies have already been committed in that cluster.  This ensures that replicas within each cluster remain causally consistent.  Reads are allowed to access only a single cluster.  This ensures that readers see a causally consistent system.

Pileus takes a different approach, one that does not require explicit dependency tracking and that does permit a client to read from different replicas in different datacenters.  Primary update schemes, as used in Pileus, automatically order Writes in causal order (since they are strictly ordered by when they are performed at the primary).  Each secondary replica in Pileus is causally consistent since replicas receive writes in order, i.e. in causal order.  The only complication then is ensuring causal consistency when a client performs reads at different secondary replicas over time.  For instance, a client is not allowed to read version V2 of some object from one replica and then later read version V1 from another replica.

Each client simply records all of the version timestamps for objects returned by its Read operations and all of the version timestamps produced by its Writes.  When performing a Read operation with a given key, the client computes the minimum acceptable read timestamp for this operation.  This timestamp is the max of all previous Reads by the client for any key and previous Writes for the same key.  The client can then read from any replica whose high timestamp (the maximum version timestamp of all Writes received by that replica) is greater than the minimum acceptable read timestamp; any such replica is sufficiently up-to-date.

Consider the Example 5 given above.  Suppose the client performs Read(B) and it returns a version of object B with timestamp 42.  When the client then performs Read(A), it sets the minimum acceptable read timestamp to a value no less than 42.  This guarantees that this Read(A) will be performed at a replica that already holds the latest version of object A with a timestamp less than 42.  Since Write(A) causally preceded the Write(B) whose version was previously returned to the client, we know that it produced a version of object A with a timestamp less than 42.  Using the minimum acceptable read timestamps correctly ensures that the client will read this version of A (or perhaps a later version).

However, the use of minimal acceptable read timestamps, while avoiding dependency tracking, has a downside.  It leads to pessimistic assumptions about causal dependencies, and hence to overly conservative replica selection.  In this same example, suppose that Write(A) and Write(B) were not causally related.  For instance, they may have been performed by different clients with no interleaving Reads.  In this case, when the client performs Read(A), it could receive any version of object A without violating causal consistency.  The client could read from any replica, including ones that hold arbitrarily stale versions.  But, Pileus has no way of determining whether versions of A with timestamps less than 42 causally precede version 42 of object B (since it records no dependencies).  Essentially, Pileus pessimistically assumes that all Writes are causally related and causally ordered according to their version timestamps.

This technique produces correct behavior when there is a single primary.  When data is sharded across multiple primaries that independently assign version timestamps, potential problems may arise.  Consider our same Example 5 in which Write(A) causally precedes Write(B) but keys A and B are mastered at different primaries with different clocks.  It is possible for Write(B) to produce a version of B with timestamp 42 while Write(A) produced a version of A with a larger timestamp (maybe because A’s clock is faster than B’s or because timestamps are drawn from an update counter and A has accepted more Writes).  Now, setting the minimum acceptable timestamp to 42 does not ensure causal consistency.  The client could possibly read an old version of object A, say with timestamp 21.

To avoid this problem, the primaries need to maintain logical clocks so that all Writes are causally ordered.  That is, when a client performs a Write operation for a given key, it informs the responsible primary replica of the timestamp of the client’s latest preceding Write.  The new Write is then assigned a larger version timestamp.  This solves the correctness problem identified above in a simple way, and it does not require explicit tracking.  It still leads to pessimistic assumptions about causal dependencies.

In summary, there are two ways to implement causal consistency.  One is to explicitly record causal dependencies between Reads and Writes.  This permits optimal replica selection, but takes space for storing the dependency information and leads to complicated procedures for garbage collecting this information.  The second approach, which I prefer, is to use timestamps as described above, which avoids the need for explicit dependency tracking.  But this can lead to overly pessimistic replica selection with the associated performance consequences.

Implementing Bounded Staleness

One of the consistency choices that Pileus supports is bounded staleness, which guarantees that the version returned by a read operation is no more than t seconds out-of-date.  In other words, if the latest version was produced more than t seconds ago, then the reader is guaranteed to receive it; if the object was updated in the last t seconds, then the reader may or may not receive the latest version, but, in any case, is guaranteed to at least receive the version that was the latest as of t seconds ago.  The first paper to mention such a staleness guarantee, as far as I know, is “Data Caching Issues in an Information Retrieval System” by Alonso, Barbara, and Garcia-Molina (ACM Transactions on Database Systems, September 1990).

The simple way to implement bounded staleness, assuming a primary update scheme, is as follows.  Each secondary replica periodically syncs with the primary and records the time of its last sync.  A client reading data can determine whether a secondary replica is sufficiently up-to-date by (1) subtracting the desired staleness bound from the current time to get an acceptable read time and (2) then checking whether the resulting acceptable read time is earlier that the replica’s last sync time.  That is, a replica can serve the read request while meeting the bounded staleness guarantee as long as the replica has synchronized directly (or indirectly) with the primary sufficiently recently.

Note that this is a pessimistic implementation since it may disallow reading from replicas that are, in fact, sufficiently up-to-date for the object being read.  For example, many objects are written once and never again updated.  Suppose that the latest version of the object being read was produced months ago, then it could be read from any replica while guaranteeing any reasonable staleness bound. Unfortunately, clients cannot assume anything about update patterns.  Thus, they make pessimistic decisions about suitable replicas based on the last sync times (or else they would need to contact the primary to discover the timestamp of the latest version, which reduces the benefit of reading from secondary replicas).

However, this implementation technique assumes (a) that the client’s clock is loosely synchronized with those of the servers, and (b) that the servers report their last sync times.  Neither of these are generally good assumptions.  Consequently, we developed a different way of enforcing bounded staleness in the Pileus system.

In Pileus, servers use logical clocks.  Thus, the timestamps that are assigned to versions and the high timestamps that are maintained and reported by servers are not directly useful for determining those servers that can meet a staleness bound.  However, here’s a client-side technique that works.

Each client maintains a mapping from real time (according to the client’s clock) to logical time (according to the primary server’s logical clock).  Specifically, whenever a client accesses the primary server to perform a Get or Put operation, the server returns its current high timestamp (as a logical clock value).  The client records its current time, taken from the local real-time clock, along with the server’s high timestamp.  To get a conservative but usable mapping, the client records its clock at the start of the Get request rather than at the receipt of the response (or it could measure the round-trip time and choose the mid-point of the time interval as in some clock synchronization algorithms).  This mapping is a little off due to the message transmission delay, but that should be a small amount relative to staleness bounds.

This mapping table need not be large since it does not need to record every interaction with the primary.  One entry per minute, for instance, is probably sufficient.  Basically, it needs to cover the range of times that are likely to be mentioned in a staleness bound and in sufficient granularity.  Older entries in the table can be discarded once their logical timestamps fall below the minimum high time of all servers (or all servers that the client is likely to access, such as those that are closer than the primary).

To determine which servers can meet a staleness bound, the client computes the minimum acceptable read time = now – bound.  The client then looks up this timestamp in its table to find the associated logical time.  It uses the next highest recorded time as an approximation.  For example, suppose that the client has entries for 8:00 and 8:05, but wants a read time of 8:03; it can use the 8:05 entry as a conservative bound.  The client then chooses from among the servers with a high timestamp that is greater than the recorded logical time.

Consider a scenario in which a client wants bounded staleness with a bound of 5 minutes and ideally would like to read from its local server.  Suppose this client knows that the local server is currently less than 5 minutes out-of-date.  This client can perform all of its reads locally.  This client can also ping the primary once per minute to retrieve its current logical high time (or piggyback such pings on regular write operations).  If the local server syncs with the primary frequently enough, say once every few minutes, and the client performs reads frequently enough, then the client would get the local server’s high time in response to a read operation and always determine that the local server was okay for its next read.  If the local server syncs too infrequently, say once every 15 minutes, then the client would eventually discover, based on its local mapping table, that the local server it too out-of-date.  At this point, it would need to switch to a different server, such as the primary.  After the local server syncs with the primary, the client would notice this since it periodically pings the local server.  The client could ten switch back to reading locally.

A key advantage of this scheme is that it assumes nothing about synchronized clocks.  Each client maintains its own mapping table based on its own local clock, which could be arbitrarily out of sync.  Such a technique would work even if servers used physical, but unsynchronized clocks.

A Pileus-on-Azure Demo

I just returned from Microsoft Research’s annual technology showcase (called “TechFest”).  For two days, with my colleague Rama Kotla, I was talking about and demoing the Pileus-on-Azure system to people from all over Microsoft.  It was fun, rewarding, and totally exhausting.

I showed a version of Pileus that runs on top of the Windows Azure Storage (WAS) key-value blob store.  In particular, we use WAS storage accounts as our basic units of storage.  Each such account replicates data on three servers within an Azure datacenter, which we call a “site”.  Our Pileus-on-Azure systems adds four main features to the basic WAS functionality.

One, we allow a WAS blob container to be replicated on any number of sites throughout the world. Specially, data can reside in any Azure region, of which there are currently at least ten spread over Europe, Asia, and the United States. One (or more) of the storage sites is designated as the primary, and all updates are performed at that site (or set of sites). Other sites are considered secondary sites. These receive updates lazily from the primary site. The intention is to allow data to be placed near where it is being accessed for clients that can tolerate relaxed forms of consistency.

For my demo, I used two Azure accounts with the primary residing in the West Europe region and the secondary in the West US region. I turned on Azure’s geo-replication. So this actually caused the data to be replicated in four datacenters. The West Europe account geo-replicated its data to the North Europe region, and the West US account geo-replicated to the East US region. This configuration resulted in each blob fully residing on 12 different servers in four different datacenters in three different countries. Perhaps, that’s a bit much, but, hey, it’s a demo.

Two, Pileus-on-Azure supports six different consistency choices: strong consistency, causal consistency, bounded staleness, read my writes, monotonic reads, and eventual consistency. I described the semantics of these in an earlier blog post (and in my recent CACM article which was also mentioned in a previous blog post and, from what I learned at TechFest, is now required reading for some Microsoft product groups). In contrast, WAS provides strong consistency by default, but recently started allowing clients read-only access to geo-replicated sites, thereby providing the option of eventual consistency reads (similar to Amazon’s DynamoDB or other commercial cloud storage systems). The reason to offer consistency choices is because of the inherent trade-offs between the guarantees offered by a consistency choice and the performance of read operations.

In the demo, I ran a client on a local machine in the Microsoft Convention Center that was performing read and write operations in accordance with the YCSB benchmark using the Pileus-on-Azure API, which essentially is identical to the API provided by the WAS client library. This client would periodically trigger synchronizations between the primary site in West Europe and the secondary in West US. The following screenshot from the demo shows the measured average response time for read operations using each of the six consistency choices:


You can see that choosing strong consistency results in reads that are seven times slower than eventual consistency reads for our given configuration. The other consistency choices fall somewhere in between. Causal consistency is less than a third the cost of strong consistency in this case, but the results are very workload dependent; note that I improved the causal consistency implementation and so these numbers are much better than ones we reported in our SOSP paper for an earlier version of Pileus. For bounded staleness, I chose a bound of 30 seconds for this demo, meaning that readers receive data that is at most 30 seconds out-of-date; even with such a tight guarantee, bounded staleness improves the read performance by a factor of two compared to strong consistency. The read my writes guarantee, which application developers often desire, is only slightly more expensive than eventual consistency. Monotonic reads behave exactly the same as eventual consistency reads since they both always access the nearby West US site; there is no reason for monotonic reads to go anywhere else unless this site becomes unavailable.

Three, Pileus-on-Azure allows client applications to express their consistency and latency desires in the form of a consistency-based service level agreement (SLA). Each such SLA consists of a sequence of acceptable consistency/latency pairs with associated utilities (a numerical value between 0 and 1). Thus, clients are not locked into a single consistency choice. A consistency-based SLA governs the execution of each read operation. The Pileus-on-Azure client library attempts to maximize the utility that is delivered to the client by selecting the site(s) to which each read operation is sent.

In the demo, the performance numbers for each of the six consistencies were generated by using six simple SLAs containing a single consistency choice and unbounded latency. More interestingly, I ran the YCSB workload using what I call the “fast or strong” SLA. In this SLA, the client prefers strong consistency provided it can receive read results in under 150 ms. If not, it will accept weaker consistency, such as bounded or eventual, for a fast response. If a fast response is not possible, then the client will wait up to 2 seconds for strongly consistency data. In other words, the client does not want a slow, stale response. Here’s what this SLA looks like in the demo:


While meeting this SLA, the client sends all of the read operations to the local West US site since the primary in Europe is not able to respond in under 150 ms. This results in an average delivered utility of just over 0.5. The client mostly receives eventually consistency data but occasionally receives data that meets the bounded staleness condition, i.e. the read occasionally returns a version of the requested blob that is known to be stale by no more than 30 seconds. The client does receive fast response times, and obtains acceptable but probably disappointing overall utility. Which brings me to the fourth part of the demo.

Four, Pileus-on-Azure includes a Configuration Service that can automatically select where data should be replicated. Each client that shares a blob container reports (1) the SLA(s) that it is using, (2) the reads and writes that it performed recently, and (3) its measured round-trip latencies to each of the potential storage sites. The configuration service collects this information from clients and uses it to generate and evaluate possible alternative configurations. It suggests what sites should store replicas of the container, what site(s) should be the primary, and how often secondary sites should synchronize with the primary. It considers constraints, such as limits on the number or locations of replicas. It also has a cost model based on Azure’s pricing scheme and considers the cost of new configurations, both the one-time cost of moving from the current to a new configuration and also the ongoing replication costs. It proposes the configuration that is expected to maximize the overall utility to the clients per unit of cost. Note that this configuration service was not described in our prior SOSP paper; this is more recent work done with an intern, Masoud Saeida Ardekani.

In the demo, I ask the configuration service to propose a new configuration. Not surprisingly, since throughout the demo there was only once client accessing the data, the configuration service makes the reasonable suggestion of moving the primary to the West US site. It also reports the expected gain in utility by changing to this new configuration and the expected cost. In this case, there is no increased cost since the number and location of replicas remains the same, with only the primary site changing:


When I click on the “Install New” button, the configuration service takes action. It first upgrades the West US site to a write-only replica so that this site receives new updates from clients. It forces a synchronization between West US and the current primary in West Europe so that the latest versions are in both places. Finally, it upgrades the West US site to be a read-write replica and downgrades the West Europe site to be a read-only secondary replica.


Now, in the demo, when I re-run the read-write-sync workload, you see that all of the consistency choices result in the same performance since they all read from the new West US primary (note that variations in the reported averages are caused solely by network and server load variations):


And, as expected, my “fast or strong” SLA delivers a utility of 1, meaning that the client always receives the top level of service:


That’s the demo! It ran for two straight days without crashing or experiencing any hiccups. Way to go Windows Azure! I can’t wait to see my Azure bill for this month.

On Leases for Primaries

I am not a fan of using leases to prevent clients from performing writes; it simply delays the writes and gives readers a false sense of strongly consistent data.  On the other hand, systems like Megastore and Spanner use leases quite effectively for leader election in Paxos.  In particular, a leader (or master or primary node) can acquire a lease from a majority and hold the lease for some period of time.  This allows the leader to know that it is the leader without continually checking with other nodes, even in the face of network partitions.

So, for example, the leader (e.g. primary replica) can serve read requests locally while maintaining strong consistency.  If the primary replica ends up in a minority partition, then it can continue to process read requests.  Though it will not be able to perform write operations since these usually involve a majority of replicas.  The replicas in the majority partition cannot do anything until the primary’s lease expires, perhaps 10 seconds.  At this point, they can then elect a new primary that accepts both read and write operations.  The old primary will realize that it is no longer the primary when it is unable to renew its lease.  This ensures that the system never ends up with two primaries *assuming* leases are managed properly.

What do we need to assume about clocks or message delivery bounds in order to manage leases?  In the simplest scheme, a node requests a lease that expires at a specific time.  In order for nodes to agree that the lease has expired, they need to agree on the current time.  This requires perfectly synchronized clocks.  When a node’s lease is about to expire, it sends a lease renewal to other nodes, and the lease is renewed upon receipt of acknowledgements from a majority.  This requires a bound on the message delays so that the node knows how far in advance to initiate the lease renewal protocol.

If clocks are not perfectly synchronized but have known bounds on the clock uncertainty (as in Google’s TrueTime service), then nodes can make conservative estimates of when a lease has expired.  Suppose the clock bound is d.  The node holding the lease should behave as though it has expired d time units before the lease expiration time, whereas other nodes should wait for d time units after the expiration.  This adds a delay of d to the recovery/reconfiguration time when the primary fails or is partitioned.

What if we have no bounds on clock drift and synchronization?  One approach is for the lease request to include a lease duration rather than an expiration time.  For example, a primary could request a least for 60 seconds (as in Spanner).  Each lease request also includes a unique sequence number.  The lease is granted when the requestor receives a majority of responses (with the given sequence number).  But for how long?  The requester records the time, according to its own clock, when it requested the lease.  The lease is good for the requested duration from this time.  In other words, the requestor behaves as though the lease was immediately granted by a majority of nodes but it took some time for the acknowledgements to reach the requestor.  Other nodes record the time at which they responded to the lease request.  They believe the lease to be valid from this time for its duration.  This guarantees that the requestor uses an expiration time that is before that used by the other nodes (assuming that clock drift is not excessive).

Message delays do not cause correctness problems, but they could cause a requestor to lose its lease because it did not renew the lease in time or because the requested duration was less than the message delay.  In practice, this should not be a problem as long as the variance in message delay is not too great.

Three Publications

I would like to bring to your attention three papers that I have recently published.

1.   “Replicated Data Consistency Explained Through Baseball” was published as an article in this month’s (i.e. the December issue of) Communications of the ACM.  I have given a number of popular talks based on this paper, which was first written a couple of years ago.  It describes the utility of various consistency choices (those presented in my last blog posting) using one simple example: maintaining and accessing the score of a baseball game.  If you don’t understand baseball, don’t fret.  It could really be any sport.  If you don’t like sports, that’s okay too, though there may be something wrong with you.  There’s also a videotape of my giving a lecture on this topic to Microsoft engineers.  Check it out at:

2.  “Consistency-based Service Level Agreements for Cloud Storage” was presented last month at the ACM Symposium on Operating Systems Principles (SOSP).  This describes the Pileus system that we’ve designed and built at MSR Silicon Valley.  This paper particularly focuses on the motivation, design, and evaluation of a scheme for incorporating consistency choices into the SLAs provided by a replicated key-value store.  The feedback thus far has been very positive; feel free to send me your thoughts.  See: or
The talk slides, and someday soon perhaps the talk video, are available on the conference web site at

3.  “Transactions with Consistency Choices on Geo-Replicated Cloud Storage” is a technical report that discusses the transaction mechanism we designed for Pileus.  We show how to provide snapshot isolation (or full serializability) for distributed transactions on a partitioned, replicated key-value store while allowing those transactions to access data with relaxed consistency.  The key trade-off is that accessing stale data, while providing better expected performance, may cause a transaction to abort.  This paper shows that it is possible to support general-purpose transactions and provide reasonable semantics in an eventually consistent system.  See:

On a more personal note, I want to mention that it was not possible for me to include a dedication on my CACM paper.  But, if I could, it would have read:

Dedicated to my father, Coach George Terry, who was a respected baseball coach for most of his career.  He taught me not only the fundamentals of baseball but also the fundamentals of life.  Through his love for baseball, he touched a great number of young men and women.

Have fun reading these papers and have a very happy holidays!

Defining Consistency Guarantees

It seems fashionable these days for cloud storage providers to offer consistency choices to their clients. Systems from Amazon, Google, Yahoo, and (soon) Microsoft provide both strong consistency and eventual consistency. For individual read operations, a developer may choose to receive strongly consistent or eventually consistent data. In some cases, such as the Pileus system that we’ve developed at Microsoft Research, intermediate consistency choices are also offered. But “eventual” consistency may mean different things in different systems, though often the service providers do not define it very precisely. This note explores what exactly is meant by a consistency choice and how we might specify precise guarantees for such choices.

Developers should be able to understand a consistency guarantee without knowing any details of the underlying implementation (replication protocols, system configuration, server locations, sync frequency, mean time to failure, etc.).

The consistency definitions that I’ve been using in talks are described in terms of the writes that are visible to a read operation. For example, the choices that we provide in the Pileus system are as follows:

  • Strong Consistency = Reads see all previous writes.
  • Eventual Consistency = Reads see subset of previous writes.
  • Consistent Prefix = Reads see an initial sequence of writes.
  • Bounded Staleness = Reads see all “old” writes.
  • Monotonic Reads = Reads see an increasing subset of writes.
  • Read My Writes = Reads see all writes performed by the reader.

But what exactly do these definitions mean?

I suggest that a consistency guarantee specifies the set of acceptable results that can be returned by a read operation. The size of this set determines the “strength” of the consistency; smaller sets implies stronger consistency.

Given complete knowledge of the previous writes and their order, along with an understanding of the semantics of read and write operations, a person should be able to determine the acceptable result-set. That’s still a lot for a developer to comprehend, but it’s better than needing to understand how the replicated system is implemented.

What about partially ordered writes? The consistency definitions simply talk about the set of visible writes. This set is independent of the write order (except for a consistent prefix).

However, we also need to say what happens with concurrent writes. A concurrent write for a given read operation is any write whose execution overlaps the execution of the read. It seems reasonable to say that the read *may*, but is not required to, see any concurrent writes.

Then, in order to determine what value(s) will/may be returned by a selected subset of writes, a person needs to understand how the system performs concurrent writes. One reasonable answer is that the system may perform them in any order. In other words, the system will (at least conceptually) execute writes in a serial order that is consistent with the partial order. So, the result of a read operation could be the value produced by *any* such serial order of writes.

Thus, the overall algorithm for determining the set of acceptable results for a given consistency is as follows:

  1. Given consistency + previous writes + partial write order + session state, determine set of acceptable write subsets.
  2. For each acceptable write subset produced in step #1, along with the partial write order, determine all possible serial orders of writes.
  3. For each ordered write subset produced in step #2, compute the resulting object value by applying the writes in that order.
  4. Take the union of all computed object values in step#3 as the set of acceptable return values.
  5. If multiple consistencies are being combined, take the intersection of their sets of acceptable return values.

Note that step #1 completely captures the essence of a consistency definition. It is the only step that depends on the consistency. Defining a new consistency is simply a matter of mapping the complete set of previous writes into one or more write subsets. A strong read (in the absence of concurrent writes) has a single acceptable write-set, the one containing all previously completed writes. An eventual read can see any subset of the previous writes, i.e. there could be a large number of acceptable write subsets. Intermediate consistency guarantees restrict the acceptable write subsets. With concurrent writes, all of the consistency guarantees, even strong consistency, have an increased number of acceptable write-subsets since each concurrent write may or may not be included in each subset.

Note that step #2 is not needed if writes are totally ordered. All of the other steps remain the same. Thus, understanding the consistency is a bit easier for ordered writes.

Note that step #5 should always return at least one value since the latest object value, i.e. the value that would be returned by a strongly consistent read, should be an acceptable return value for any consistency guarantee.

Importantly, note that all 5 steps can be done with no knowledge of how the storage system operates. One only needs to know the semantics of write operations, i.e. the effect of a write operation on the (unreplicated) system state, and the semantics of read operations, i.e. what a read will return given a particular system state. In any system, the semantics of reads and writes should be well documented.