The quality of service (QoS) of optimistic replication systems in the face of updates is the degree to which the system presents an illusion of connectivity to a single up-to-date copy of all objects to all users. In real replicated systems, this illusion must necessarily be violated, and quantifying the user-visible effects of such violations is key to determining how well the system performs.
We will discuss and analyze several possible metrics that can be used to measure quality of service. We ignore important questions of system load and overheads in both time and space, since in general they are relatively easy to measure and characterize. We do not consider measures of availability, which are primarily of interest in pessimistic systems.
The conflict count quantifies the QoS of a replicated system by counting the number of conflicts observed during a measurement period. The conflict rate expresses the same information as a ratio of the conflict count to time or total operations. The advantages of these metrics lie in their simplicity: they are simple to express, to measure, and to understand. A system with few conflicts is desirable because it requires little human intervention and usually adheres to single-copy semantics. A system with a high conflict rate, by contrast, is troublesome to the user.
However, this metric has serious drawbacks [8]. Since conflicts are detected only during reconciliation, the frequency of reconciliation has a direct effect on the number of conflicts observed. In the limit, if reconciliation never occurs, no conflicts will be detected regardless of the degree of concurrent updates! (This effect is exacerbated by the the common optimization of compressing multiple changes to the same object into a single update to save time and space during reconciliation [3,5,6].)
A second drawback with conflict counts is the difficulty of measuring conflicts in systems with more than two replicas [1]. The measurement can depend on the pattern of both updates and reconciliations. Consider four replicas of a single object. Replicas 1 and 3 receive independent updates without any intervening communication, creating a conflict. If these two replicas now reconcile, this single conflict will be detected. However, suppose replica 1 first reconciles with replica 2, and 3 with 4. The conflicting updates now exist in two different places. If replicas 1 and 3 then reconcile, they will count the conflict, but when replicas 2 and 4 reconcile, they will erroneously count it a second time. Adding replicas and adjusting the reconciliation patterns allows almost arbitrary choice of total conflict count.
Another difficulty is how to count unresolved conflicts. If a conflict is created at a pair of replicas, but it is not resolved even after several reconciliations, it is difficult to design an algorithm that can correctly count it as only one conflict when faced with complex update propagation patterns.
Finally, because conflict counts only measure write/write effects, they ignore the important question of whether the user is accessing out-of-date information.
Stale-access counts provide a very attractive measure of quality of service. They directly reflect what is important to the user: up-to-date data. Furthermore, since they are based on a global view of the world, they are closely connected to the single-copy serializability semantics that are the ideal every replication system strives to achieve. In addition, unlike conflict counts, stale-access counts do not tend towards zero as the reconciliation interval increases.
An alternative is to measure the age of a stale access, defined as the time elapsed between the latest update to an object seen by the accessing replica, and the time of the globally last update. For example, suppose replica 1 updates an object at time t1, and this update is communicated to all other replicas through the normal reconciliation mechanism. Then suppose that replica 2 updates the object at time t2 > t1, and replica 3 accesses it at time t3 > t2, all without further reconciliation. Then replica 3's access is stale (because it has not seen the update from replica 2), and the age of the stale access is t2-t1, the amount of time by which the information is out of date.
The age of stale accesses is closely related to the user's desire to see only the most up-to-date information, and gives us additional information about the departure from the ideal semantics in a replication system. In many systems, an access to hours-old data is far less acceptable than one to information that is only milliseconds out of date.
Unfortunately, the global nature of stale-access metrics causes them to be unusable in live systems, due to problems with obtaining global time. Although solutions to the global-clock problem exist (e.g., GPS receivers), they are of limited accuracy and generally require special hardware that is not available on most machines, limiting the utility of these measures to simulations and systems with special equipment. Nevertheless, we believe that stale-access metrics are extremely useful in comparing various designs for replication systems, since they do not exhibit the unusual sensitivities of conflict counts.
Although the propagation time does not directly relate to the amount of stale data seen by the user, nor to the degree of staleness, it is reasonable to assume that a replication system with a low propagation time will provide the user with better service than one with a high time. Propagation time is somewhat easier to measure in a distributed system because there is no need for a global clock; instead each replica can record the cumulative time needed to propagate an update to the next replica.
The biggest disadvantage of propagation time is the cost of measurement: logging the time of each update, and tracking its progress, requires tremendous amounts of storage space. Also, propagation time assigns equal value to all updates and all replicas. It may not much matter whether a rarely-used object propagates quickly, or whether it takes a long time to reach an infrequently-connected replica. For this reason, it can be helpful to weight the propagation time by the object's usage, and to measure the time needed to reach some substantial fraction of sites rather than 100%.
We have discussed three classes of possible metrics. The commonly used metric, conflict count, is subject to a number of anomalies that make it easy to misuse and inaccurate in the general case. Nevertheless, we have found that the conflict count is useful both because of its simplicity and because its weaknesses are easily minimized in most real-world situations.
Stale-access metrics seem to provide the most accurate picture of the service provided to the user. However, measuring them depends on a global system view, which is available only in simulations and other artificial environments.
The final metric, propagation time, is less directly related to the quality of service perceived by the user. Nevertheless, it has utility in describing certain system characteristics, making it a useful adjunct to the traditional conflict count.
The authors are affiliated with the UCLA Computer Science Department. Dr. Popek is also affiliated with Platinum technology. Authors' e-mail addresses: {geoff, rguy, reiher, awang}@cs.ucla.edu, popek@platinum.com. This work was partially supported by DARPA contract N00174-91-C-0107.