Implementing Bounded Staleness

One of the consistency choices that Pileus supports is bounded staleness, which guarantees that the version returned by a read operation is no more than t seconds out-of-date.  In other words, if the latest version was produced more than t seconds ago, then the reader is guaranteed to receive it; if the object was updated in the last t seconds, then the reader may or may not receive the latest version, but, in any case, is guaranteed to at least receive the version that was the latest as of t seconds ago.  The first paper to mention such a staleness guarantee, as far as I know, is “Data Caching Issues in an Information Retrieval System” by Alonso, Barbara, and Garcia-Molina (ACM Transactions on Database Systems, September 1990).

The simple way to implement bounded staleness, assuming a primary update scheme, is as follows.  Each secondary replica periodically syncs with the primary and records the time of its last sync.  A client reading data can determine whether a secondary replica is sufficiently up-to-date by (1) subtracting the desired staleness bound from the current time to get an acceptable read time and (2) then checking whether the resulting acceptable read time is earlier that the replica’s last sync time.  That is, a replica can serve the read request while meeting the bounded staleness guarantee as long as the replica has synchronized directly (or indirectly) with the primary sufficiently recently.

Note that this is a pessimistic implementation since it may disallow reading from replicas that are, in fact, sufficiently up-to-date for the object being read.  For example, many objects are written once and never again updated.  Suppose that the latest version of the object being read was produced months ago, then it could be read from any replica while guaranteeing any reasonable staleness bound. Unfortunately, clients cannot assume anything about update patterns.  Thus, they make pessimistic decisions about suitable replicas based on the last sync times (or else they would need to contact the primary to discover the timestamp of the latest version, which reduces the benefit of reading from secondary replicas).

However, this implementation technique assumes (a) that the client’s clock is loosely synchronized with those of the servers, and (b) that the servers report their last sync times.  Neither of these are generally good assumptions.  Consequently, we developed a different way of enforcing bounded staleness in the Pileus system.

In Pileus, servers use logical clocks.  Thus, the timestamps that are assigned to versions and the high timestamps that are maintained and reported by servers are not directly useful for determining those servers that can meet a staleness bound.  However, here’s a client-side technique that works.

Each client maintains a mapping from real time (according to the client’s clock) to logical time (according to the primary server’s logical clock).  Specifically, whenever a client accesses the primary server to perform a Get or Put operation, the server returns its current high timestamp (as a logical clock value).  The client records its current time, taken from the local real-time clock, along with the server’s high timestamp.  To get a conservative but usable mapping, the client records its clock at the start of the Get request rather than at the receipt of the response (or it could measure the round-trip time and choose the mid-point of the time interval as in some clock synchronization algorithms).  This mapping is a little off due to the message transmission delay, but that should be a small amount relative to staleness bounds.

This mapping table need not be large since it does not need to record every interaction with the primary.  One entry per minute, for instance, is probably sufficient.  Basically, it needs to cover the range of times that are likely to be mentioned in a staleness bound and in sufficient granularity.  Older entries in the table can be discarded once their logical timestamps fall below the minimum high time of all servers (or all servers that the client is likely to access, such as those that are closer than the primary).

To determine which servers can meet a staleness bound, the client computes the minimum acceptable read time = now – bound.  The client then looks up this timestamp in its table to find the associated logical time.  It uses the next highest recorded time as an approximation.  For example, suppose that the client has entries for 8:00 and 8:05, but wants a read time of 8:03; it can use the 8:05 entry as a conservative bound.  The client then chooses from among the servers with a high timestamp that is greater than the recorded logical time.

Consider a scenario in which a client wants bounded staleness with a bound of 5 minutes and ideally would like to read from its local server.  Suppose this client knows that the local server is currently less than 5 minutes out-of-date.  This client can perform all of its reads locally.  This client can also ping the primary once per minute to retrieve its current logical high time (or piggyback such pings on regular write operations).  If the local server syncs with the primary frequently enough, say once every few minutes, and the client performs reads frequently enough, then the client would get the local server’s high time in response to a read operation and always determine that the local server was okay for its next read.  If the local server syncs too infrequently, say once every 15 minutes, then the client would eventually discover, based on its local mapping table, that the local server it too out-of-date.  At this point, it would need to switch to a different server, such as the primary.  After the local server syncs with the primary, the client would notice this since it periodically pings the local server.  The client could ten switch back to reading locally.

A key advantage of this scheme is that it assumes nothing about synchronized clocks.  Each client maintains its own mapping table based on its own local clock, which could be arbitrarily out of sync.  Such a technique would work even if servers used physical, but unsynchronized clocks.