1. Introduction
We consider the problem of consistency for erasure code based distributed storage systems. Distributed storage systems store data over a network of nodes, so that data remains available over time. In order to do so, several properties are desirable, such as high fault-tolerance and low storage overhead. Fault-tolerance refers to the system’s ability to sustain failures of some of its components, and is present across three dimensions: (1) availability (the data should remain available even in the event of failures), (2) persistence, (the data should remain available over time), and (3) consistency (irrespective of the sequence of read and write operations on the stored data by multiple processes, and of possible failures, the data should appear to every process as if it had been manipulated in a globally agreed order).
Availability and persistence are achieved through redundancy. The data is stored multiple times, so that even when a node is unavailable, the requested data may be queried from another node (achieving (1)). Since failures may be temporary or permanent, redundancy needs to be replenished via maintenance mechanisms, in order to achieve (2), that is persistence over time. However, since the data is stored redundantly, it becomes essential to ensure consistency, so that all applications accessing a given data see the same version, in particular after updates, irrespective of which storage nodes are accessed.
Both maintenance of adequate level of redundancy and consistency depend on the redundancy mechanisms chosen, which typically induce trade-offs with storage overhead. Replication has been the most common way to ensure fault-tolerance, though over the past decade, more and more storage systems have adopted erasure coding techniques instead, e.g., Reference [
1,
2,
3,
4], since they provide a good trade-off between fault-tolerance and storage overhead. Processes to maintain the amount of redundancy over time in the presence of node failures for erasure code based distributed storage systems have been profusely studied (see, e.g., Reference [
5,
6] for surveys on erasure coding techniques enabling redundancy maintenance in distributed storage systems). Designs of mechanisms that support efficient updates of coded data have also been considered; see, e.g., References [
7,
8,
9,
10].
1.1. Consistency
In the context of replication, consistency refers to a setting where read and write operations are performed on shared data (the replicas) by different processes (see the left of
Figure 1), and it informally means that, when one replica is updated by one of the processes, it should be ensured that the other copies are updated accordingly. This is achieved by fixing a set of rules the processes obey when they want to read or write the data, in exchange for which the data the processes obtain is expected to be up-to-date.
Under strict consistency (see the upper right of
Figure 1), processes ask for a read operation r(
x)
c or a write operation w(
x)
c (respectively, reading or writing the value of
x to be
c) at a given point of the so-called wall-clock time, and the execution is expected to be instantaneous and thus follow that same ordering. The wall-clock time represents an absolute global time, while, in practice, different processes in a distributed system may not be perfectly synchronized or aware of what other processes locally consider as the time. In this example,
writes w(
x)
b after
as per the wall-clock time; thus,
,
get
b on every occasion when they carry out a read operation subsequently as per wall-clock time. This corresponds to the ordering w(
x)
a, w(
x)
b, r(
x)
b, r(
x)
b, r(
x)
b, r(
x)
b.
In contrast, sequential consistency only requires for some global ordering of the operations. We quote Reference [
11] to provide the precise way it has been defined “… the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” The lower right of
Figure 1 illustrates an example scenario meeting this definition. The operations shown are not strictly consistent because
reads
a from
x, even though
has in real time (wall-clock time) already issued a write
b request. Nevertheless, in this case, the following global ordering w(
x)
a, r(
x)
a, w(
x)
b, r(
x)
b, r(
x)
b, r(
x)
b specifies a legitimate sequential order of operations.
There are several other forms of consistency, including, for example causal and eventual consistencies; see, e.g., Reference [
12] (Section 7): strict and sequential consistencies are strong forms of consistency, they have the advantage of maintaining a high level of global consistency at all times, at a cost in terms of latency. In this work we are interested in designing a mechanism for achieving sequential consistency over erasure coded data, and we do so by applying a standard technique to achieve so, namely quorum systems. A quorum system is defined as a collection of subsets of nodes (called quorums), where each pair of quorums has a non-empty intersection [
13] (Def 3.4). A “vote" (or a “lock”) is attributed to every node in the system, and any application wishing to either read or write data needs to gather enough votes in order to perform its operation. Because of the intersecting property of quorums, mutual exclusion of a write operation with any other write or read operations is achieved.
1.2. Related Works
In order to achieve sequential consistency, it is necessary that multiple processes cannot carry out write operation(s) to change the value of any object, while other operations are reading or writing said value; and conversely no read operation should be carried out while a write operation is underway, i.e., mutual exclusion of any write operation from other write or read operations is needed. One way to achieve this is by means of quorum systems (we describe quorum systems in detail, later in this paper), which are subsets of nodes which intersect pair-wise. Coupled with locking mechanisms, such intersecting subsets of nodes can be used to guarantee the necessary mutual exclusion, enabling the design of protocols to enforce sequential consistency. We refer to Reference [
13] for a more detailed treatment of how quorum systems are used in storage applications.
Apart from quorum systems, another popular mechanism for consistency is the class of primary-based protocols, where each data block
x has an associated primary, which is responsible for coordinating operations on
x. For example, Reference [
9] proposed an update model assuming a primary which serializes the update executions, and, similarly, in Reference [
10], the node storing a data block enforces serialized writes, while updates are disseminated in a best effort manner to the redundant blocks, and stale reads are possible, i.e., there is no consistency guarantee.
Finally the Paxos family of protocols [
14] for solving consensus algorithms has been adapted to erasure coded data. For example, Reference [
15] applies Paxos over erasure coded data by assuming a bound on the deviation of local clocks at nodes, without leveraging the structural properties of codes. Then, Reference [
4] is an erasure code based object storage which realizes consistency using Paxos, but Paxos is used at object granularity. Coded Atomic Storage (CAS) [
16] is aimed at mimicking shared memory abstraction for erasure coded data, with an emphasis on reducing communication cost, followed up by Reference [
17], which builds upon Reference [
16] to explore how reconfiguration of the system can be carried out while maintaining atomicity.
1.3. Contributions
Quorums exist in two renditions: symmetric, when a single system of sets is used to represent both reads and writes (though the kind of “vote” or “lock” associated with the acquired quorum can be different, depending on the purpose being a read or write operation), and asymmetric, when there are two distinct set systems, representing read quorums and write quorums separately. When a process uses a quorum, possibly accessing all nodes in this quorum, this induces a load, which measures the access probability of the busiest node in the system. The goal of this paper is to study a class of symmetric quorums called grid quorums [
13] (
Section 3.2), in the context of erasure coded data. This is motivated by the knowledge that grid quorums over replicated data exhibit optimal load characteristic and are, thus, good candidates to be generalized to the context of erasure coded data.
The contributions of this work are as follows:
(i) We specify requirements for quorum systems in the context of systematic maximally distance separable (MDS) erasure coded data (in
Section 3.1). Our definition encapsulates sufficiency for the quorum system (it does not preclude the existence of different other quorum systems) to meet read/write mutual exclusion needs in the system, which in turn guarantees sequential consistency congruent to the global ordering of quorums formed.
(ii) We demonstrate (in
Section 3.3) how to realize different variations
and
of grid quorum systems for erasure coded data subject to the quorum specification referred above. The former variant
involves subsets of nodes which occupy a full row and a full column of a logical grid layout of nodes, while the latter, i.e.,
, is a variant of the former, comprising truncated (smaller) groups still meeting the mutual intersection property. We prove that, for
MDS codes, where
n is a square and
, the load of the quorums under access strategy
S are
and
The access strategy
S depends on the probabilities
of accessing, respectively, data and parity nodes. Load is a metric determined by the fraction of time a given node is used, be it for a read or write operation, and is an important metric to characterize the performance and impact of a quorum system. Intuitively, lower the load, the freer the nodes in the system are to carry out other tasks. In
Section 2.2 we provide a comprehensive definition for the load of a quorum.
(iii) In
Section 3.4, we extend our study to
B-grid quorums, a generalization of grid quorums.
B-grid quorums suitable for erasure coded data are proposed which accommodate
MDS code with
and
and, thus, a wide range of rates. Their load is also computed, giving
We then discuss the trade-offs between load and storage overhead.
(iv) In
Section 4, we demonstrate how sequential consistency is achieved using the proposed quorum systems even in the presence of various combination of faults.
3. Grid Quorums
In the following, we consider logical grid layouts decoupled from the physical data placement of data and parity or the physical configuration of the storage nodes in the distributed system.
3.1. Quorum Systems and Erasure Coded Data
The definition of a quorum system (see Definition 1) considers all nodes to play the same role, since replicated data is stored (thus, every node stores the same thing). For erasure coded data, we actually have two types of nodes: those storing the actual data blocks (nodes
) and those storing the parities (nodes
). When two quorums intersect, it could thus be a priori on either data nodes or parity nodes. We add the requirement that in every intersection of two quorums, there is at least one parity node. The reason is as follows: we consider each data block to be independent of each other, hence updates on one does not alter other data blocks, while it does affect parities, which, in a MDS code, are linear combinations of all data blocks. Thus, whenever any data block is updated, parities need to be updated correspondingly. The requirement that there is always some parity node in intersection of any two quorums is, thus, formally stated as:
When an update is carried out, the write lock is released only after every parity in the quorum acquired to carry out the write operation is already updated. The parities outside the quorum too need to be updated, but this is let to happen in the background. It can be carried out efficiently using a standard technique using differentials [
7,
8,
9,
10], which is outside the scope of this work.
The quorum mechanism further needs to ensure that at least one of the latest updated parities will be present in any subsequent quorum:
Recall that we want to achieve sequential consistency [
11] using the proposed mechanism. Here,
t indicates a logical marker corresponding to the
t-th snapshot view of the system.
accordingly refers to the system at the conclusion of the next execution of any operation.
and
accordingly refers to the quorums invoked by the operations that led the system to reach the corresponding sequentially consistent states. The above invariant should hold irrespective of whether the background update propagation process is completed, in order to ensure that a process can identify the latest data and does not inadvertently obtain stale data, thus facilitating sequential consistency.
Lemma 1. Property (5) follows from (4). Proof. Suppose quorum
has been used for an update at time
t. Then, all parities involved in
got updated, and since any quorum
intersects
in such a parity
p, property (
5) follows. □
We illustrate Lemma 1 with example instances of grid quorums on the left of
Figure 2. Later, in
Section 3.3 we will formally describe grid quorums for coded data. We will then also demonstrate that indeed grid quorums always satisfy the necessary property (
4), but for the moment, we focus on providing an example to illustrate how (
5) follows from (
4). The upper triangular part of the grid comprises parity blocks, data blocks are found strictly below the diagonal. Two particular instances of quorums—one comprising data block
and some parity blocks, and another comprising data block
and some parity blocks—are shown. These two specific instances do have one parity block (boxed for emphasis) in common, satisfying (
4). The intersection of a parity across the two write quorums would ensure that any process carrying out a new write operation will become aware of the previous update, and will be able to (and need to) incorporate that update along with its own update at all the parity nodes in its own write quorum. Note that we assume that any node which has incorporated an update also stores certain meta-information, particularly the logical time-stamp of the update [
18] to enable the identification of the latest version, and also stores the previous versions (or differentials) for a period of time, until a given version is propagated to all the other storage nodes in the system, before garbage collecting obsolete versions. We next provide a sketch of how this ensures sequential consistency (shown in the right panel of
Figure 2 for the data block
).
Without loss of generality, suppose a process
asks for a write of the block
(w(
)
a,
) and acquires a write quorum
first, while process
asks for a write of the same block
(w(
)
b) and acquires a quorum
for
subsequently. If this second write quorum
is created before any of the read quorums to process the read requests from processes
and
, then it will result in a wall-clock ordered sequence as shown in the top right quadrant of
Figure 2. As an alternate scenario, it is also possible that, when the locks for
are released after
completes w(
)
a, a different pending operation obtains a quorum before
can carry out the next write operation. For instance, the first read operation by process
may be carried out ahead of
’s write operation, leading to r(
)(
a) at
occurring before w(
)
b by
, while all the other read operations follow this second write operation, leading to the scenario shown at the bottom right quadrant of
Figure 2. Both of these scenarios satisfy sequential consistency. We used updates to the same data object
to elaborate how sequential consistency is achieved, for keeping the exposition simple. However, in general, all the write operations will account for the immediately preceding write operation (Lemma 1) on any arbitrary data object and update the parities in its quorum accordingly. The immediately next update will again transitively have access to these updates. Hence, the global ordering will be determined based on the sequence in which the quorums are acquired, yielding sequential consistency. To wrap up the current example, we emphasize that though we used a particular example to demonstrate the ideas, the arguments advanced here hold for arbitrary quorum systems satisfying (
4); thus, (
5).
We will next provide a description of grid quorums for replicated data, before delving into the the design of grid quorums for erasure coded data.
3.2. Basic Grid Quorums
We recall what are basic grid quorums used over replicas.
Definition 2 (Reference [
13] (Section 3.2))
. Suppose is an integer. Then, the n nodes are arranged into a square grid of edge length . A basic grid quorum system
consists of quorums, each formed by a full row and a full column of the grid, such that rows and columns are not repeated. In
Figure 3,
nodes are arranged to form a square grid of edge length
. An example
of quorum systems is
, where the notation
means the union of row
i and column
j. The quorums
and
are shown, respectively, in yellow and in blue.
In a basic grid quorum system, the size of each quorum is
(there are
nodes on each row and column, including one node in their intersection), and two distinct quorums intersect exactly in two nodes. When the access strategy
S is uniform, the probability of access is
. From (
2), the load of node
i for
S uniform is
Since every quorum
Q is the union of one row and one column, and every node
i is given a unique (row, column) allocation in the grid, say
i is at position
, then
i can either appear in two quorums (
and
for some
), or once (if
). Thus,
This holds for every node
i and from (
3),
, so
for
S uniform.
It is known (Reference [
13] (Theor. 3.18)) that the minimal load of a quorum system is lower bounded by
, and that (Reference [
19] (Prop. 4.8)) the minimal load of a quorum system where every quorum has the same size
s is given by
. Since
for
, this gives
.
In the basic grid construction, two quorums
and
always intersect in two points (
and
). It is possible to reduce the size of the intersection to one point, as follows [
20]. Once the row
i of
is fixed, instead of including all nodes in the
jth column, we take instead exactly one node from each row larger than
i (the case where every node in which its row is larger than
i, but is in the
jth column itself is shown in
Figure 3). The difference with the previous grid quorum is that only one out of the two intersection points (
and
) is present, and the quorums are usually smaller.
3.3. Basic Grid Quorums for Coded Data
To obtain a basic grid quorum system for coded data, the first step we propose is to choose a (logical) grid layout that distinguishes data nodes from parity nodes: a specific such layout, where the data blocks are in the lower part of the grid, below and including the diagonal, and the parity blocks are above the diagonal, is shown in
Figure 4. This layout assumes that the code maps
data symbols to
n encoded ones for
n a square, that is, the rate of the code is
. There are three natural ways to define quorums based on this layout, as shown in
Figure 4: (i) quorums are unions of row
i and column
i (on the left), (ii) quorums are unions of row
i and truncated column
i (in the middle), (iii) quorums compromise of a node on row
i together with truncated row
i and truncated column
i which include parity nodes exclusively beside a single data node. For our discussions and analysis, we will consider (i) and (iii), which are formally defined next as
and
, respectively.
Definition 3. Given an code where is an integer and , suppose that the n nodes are arranged into a square grid of edge length such that the k data symbols are placed below and on the diagonal of the square (in positions with ), and the parities are placed above. Two basic grid quorums for coded data are given by Every quorum in has size , which is larger than that of quorums in which is . But then the cardinality of is , while that of is .
Lemma 2. Both quorum system and satisfy properties (4) and (5). Proof. It is enough to prove the first property. For , given i and j (), the quorums and intersect at and , and the former or latter, respectively, contains a parity (above the diagonal) when and . For , given that it differs from only on the data blocks, the parity blocks still intersect in the same manner. □
Examples are provided in
Figure 4 for an
code with
and
. Compared with a grid quorum for replicas, there are few choices for defining this quorum system given the specificities of data and parity block layout: not any choice of pairs of (row, column) works. For example, choosing
would not contain any parity, and, while
does contain a parity, it would not intersect
on a parity. In fact, suppose
is one quorum, then one cannot find any
for any
that would intersect
at a parity (and not repeating any column or row), since the row in
comprises only data.
We next consider the load of and , for which we need to introduce an access probability.
Since we have two types of nodes, those storing parities and those storing data blocks, we consider a different access probability depending on whether ( for parities) or ( for data). When deciding to form a quorum for a given node, if this node belong to several quorums, then any quorum is equally likely to be called.
Proposition 1. Given the above setting, for any choice of and , the quorum access probability is uniform for . Consequently, the quorum system has load Proof. Given the adopted layout, row i of the grid contains i data blocks and parities while column j contains parities and data blocks. Therefore, the quorum is formed of data blocks ( accounts for the fact that row i and column i intersect on the diagonal which contains a data block, that should not be counted twice).
For a node located in position
, the access probability of a quorum
depends on whether it is invoked, or
is invoked instead. We assume both are equally likely (introducing a probability factor of
for
for the terms in the computation of
). If
,
is necessarily called. Hence, we obtain:
For sanity check,
since
and
. Since
does not depend on
, we have thus shown that
since we have
quorums and each are equally likely.
From (
2), the load of node
i for
S uniform is
since node
i belongs to at most two quorums (one, when node
i is on the diagonal). □
Recall for comparison that, for
S uniform,
, and thus
, we have the same load for grid quorums with replicas as with other MDS erasure coding strategies. Note that the lower bound
still holds for the the case where coded data is stored, since (i) requiring property (
4) is a particular case of symmetric quorums, and (ii) setting
reduces to
.
Proposition 2. Given any choice of parity access probability and data access probability , the quorum access probability for is given byConsequently, for , the quorum system has load Proof. Since
the probability of accessing
is given by
For sanity check,
, and the factor of
simplifies to
where
range from 1 to
. This sum equals to
because terms in the sum can be grouped into pairs of the form
, in which the sum is 1, and there are
such terms.
Thus, using (
2), the load of node
is
hence, for a node
storing a data block, which belongs to a single quorum, we get
and the busiest of the nodes storing data is
: indeed,
When
, the right-hand sum contains the terms
, while the left-hand sum contains
. Thus, the inequality reduces to
, which holds for
. When
, the right-hand sum contains
, while the left-hand sum contains
. Now, the above inequality reduces to
, which holds term by term.
If
is storing a parity block, then
belongs to
quorums; more precisely, it belongs to
r quorums
,
and
c quorums
,
. Thus,
We have that
,
:
Indeed,
and
because this equality is clearly true if
for any choice of
c (then the first sum on the right-hand side is the sum of the left-hand side), and we have
But,
gives
which holds. This last computation further shows that
is increasing as a function of
r; thus, the busiest node containing a parity is obtained by maximizing both
r and
c; that is,
and
. □
From the layout of the data in the grid, intuitively, one would expect the lower most parity node to be most accessed, since it will be part of all the quorums invoked to access any of the data nodes situated in the last two rows of the grid, and these rows are most numerous in terms of data nodes. The analysis above formally confirms this intuition, and provides a closed-form formula to quantify the load. Realistically, one would expect that there will be explicit access (ignoring the access of nodes because of their participation in any quorum) of data nodes much more frequently than the parity nodes, i.e., . In the extreme case, when , we will have . Then, for the access strategy for which and is uniform, .
If we relax the condition of using nodes from more than one row and one column to get a quorum, say by allowing the choice of nodes in a quorum from multiple columns, akin to Reference [
20] (of which, the variant discussed above is a special case), then other quorum systems can also be identified. This leads to a generalized construction, called
B-grid, that we discuss next.
3.4. B-Grid Quorums for Coded Data
We recall the definition of B-grid quorums for replicated data.
Definition 4 (Reference [
19] (5.2))
. Suppose that n is of the form and the n nodes are arranged in a rectangular grid of rows and c columns, where rows are grouped into b bands of r rows, where band j contains rows for . Denote the intersection of column c and band j as mini-column . A quorum in the B-grid system
consists of one mini-column in every band, and a representative element in each mini-column of one band. A B-grid quorum system comprises of multiple independent B-grid quorums. In particular, mini-columns and the one band from which representative elements are chosen, are independent. In
Figure 5, we show two quorums of a
B-grid quorum system with three bands. The same argument used to derive the load of
may be applied here, namely, since every quorum has the same size
, the minimal load is (Reference [
19] (Prop. 4.8))
. Note that, consequently, when we have a square grid, i.e.,
, the load of the
B-grid
.
Since quorums in intersect in the mini-columns, this suggests a possible adaptation in the context of erasure coded data as follows.
Definition 5. Given an code, suppose that the nodes are arranged in a rectangular grid of rows and c columns, where rows are grouped into b bands of r rows. Let be a fixed set of subsets of mini-columns, where contains some chosen mini-columns in band i, and for . Place parities in the mini-columns specified by C, and the data blocks elsewhere. A quorum in the quorum system consists of the b mini-columns and of a choice of elements in the band i such that there is exactly one element per mini-column. The index j of the quorum refers to the jth choice out of the choices to choose one element from a mini-column of r elements, for each of the columns.
In the proposed B-grid quorum for coded data, each quorum comprises data nodes, and parity nodes. Moreover, in this setup, the code rate is , so we need to assume . This means, a quorum system for a code with arbitrarily high rate can be realized using B-grid.
In
Figure 6, an example is shown with
,
, with
,
,
. In addition,
(in blue) contains the mini-columns
(indicated by vertical blue lines), and one choice for
j is represented by dotted blue lines. Similarly,
and
are shown in yellow and pink.
Lemma 3. The quorum system satisfies properties (4) and (5). Proof. It is enough to prove property (
4). A quorum
must contain an element per mini-column in band
i; therefore, it necessarily intersects the mini-columns
for
, and thus the other quorums, and this intersection happens in parities, since by construction the parities are placed in these mini-columns. □
Proposition 3. Considering the same setting as for , for any choice of and , the corresponding quorum access S is uniform. Consequently, the quorum system has load Proof. Since the load is defined for the busiest nodes, we only consider nodes that are in mini-columns. The probability of accessing a quorum in
is
A node (say in band
j) experiences the maximum load in the system when it is part of the mini-column for some quorum
; furthermore, it will be a representative element for the mini-column for the quorum
. Since there are
quorums
, the corresponding contribution to the load is
, while, the load due to the latter is
, leading to a total load of
. □
To compare the computed load with the square grid from
Section 3.3, suppose that
. The load can then be rewritten as
Then, if
, the load is indeed
, effectively reducing the system to the basic square grid performance in terms of load (as expected). However, in terms of grid layout, this reduction only works in the case where the erasure code is replication.
Next, we compare the loads for
and
:
and
which is always the case, since
. This shows that there is a cost to pay in terms of load to use erasure codes, and the relative cost is given by
The function
is illustrated in
Figure 7. The rate
is shown on the
x-axis. Small values of
c are chosen (3 and 4); then, values of
b smaller than
c are considered, yielding possible rates (shown by stars). Then, for each value of
c, several values of
r are fixed (
), these choices of parameters yield six piecewise linear functions. We observe that, when
r increases, so does the load of the coded quorum system compared to the load of the uncoded one. However, for a given
r, we also observe that increasing
c decreases the load.
From the view point of definition of load in a quorum system, our analysis indicates that the parity nodes are the busiest. Its practical implications, and the efficacy of B-grid quorum systems for erasure coded storage, need to be explored accordingly. However, we note that, for read operations, the actual work (disk I/Os) will be carried out at the nodes storing data blocks, while the parity nodes will carry out further operations beyond ‘voting’ for mutual exclusion of conflicting operations, only for write operations, where the parity values are updated. System implementation accompanied with rigorous benchmarking with realistic workloads is pending; however, since erasure coding is typically used for data that is not hot, i.e., data that is not being frequently written to, the proposed quorum mechanism looks promising for practical usage.
4. Consistency in Presence of Node Failures
Consider a write operation w()a on a data object , at a given time t. Any read or write operation involving occurring at time ought to read r()a, assuming there has been no write operation involving in the interim. Two distinct situations arise:
- (1)
Node d is available: unlike in replicated systems, there is only a single node (node d) which stores , ; thus, w()a does update and subsequent read/write operations will obtain the updated value a from . That some of the parity nodes may not have received all updates possibly involving other data objects has no bearings in obtaining the latest value of .
- (2)
Node d is unavailable: if the node storing is temporarily or permanently down, a read or write request will trigger a degraded read or write operation instead, where is obtained using other data blocks and parities, in which case, it is critical that parities involved in the degraded operation are up-to-date with respect to . This scenario also needs to take into account that (i) other nodes than d might have been updated in the interim, leading to changes in nodes storing parities, and/or (ii) other nodes may be unavailable.
From Lemmas 2 and 3, we established that sequential consistency is achieved in the first case. We next discuss the second case, namely the consequences of node failures, and demonstrate how sequential consistency is still achieved. We adopt a level of abstraction that assumes there are mechanisms to deal with locks so that quorums are eventually secured for write and read operations, ensuring the liveliness of the system, i.e., that it does not get in a state where it cannot progress. Accordingly, we focus on the issue of safety, specifically that the sequential consistency invariant is always met.
The data redundancy is achieved using an MDS erasure code, which has the property that one can reconstruct any missing codeword coefficient from any k other coefficients. When node d is unavailable, for , a degraded read will try to read from k available other nodes, at least one of them must be a parity. Parities called for degraded operations at time must be up-to-date with respect to . This requirement is guaranteed if the parities used belong to the quorum used to permit the operation w()a at time t. This ensures the parity nodes carry the information with respect to the latest value of without having to rely on the propagation of updates to other parity nodes, which runs as a background process. There are parity nodes in a basic grid quorum (and for the B-grid) involved in any quorum.
Consider the baseline case, where all the other data objects are also available. Thus, we need only one parity to recreate the content of the unavailable node
d. In this set-up a degraded operation is possible as long as one parity is available, which can tolerate
(
for B-grid) unavailable parities. We also note that this parity will be up-to-date with respect to any other
,
: indeed, if this parity is not yet up-to-date, it can first be updated before using it to reconstruct
since we assume that only a single data node, namely
d, is unavailable. In this case, using
data objects
,
,
together with an up-to-date parity from the quorum of
thus guarantees that the latest value of
is computed as
using
from (
1).
Suppose now that not only node
d is unavailable which we would like to read, but say also node
, storing the data object
is down. We will then need more than one parity for reconstructing
. Since we want to access
in node
d which was last updated at time
t, by the above discussion, we need two parities in its quorum, which are assured to be up-to-date with respect to
. We would ideally need these two parities to be up-to-date with respect to all other data nodes. If these are not up-to-date with respect to the data nodes other than
, since these other data nodes themselves are available, the parities can be updated to reflect the latest value of the available data nodes. Using two up-to-date parities
and
, we get:
so we multiply the first equation by
and the second by
and compute the difference of the two terms, which yields
for
the corresponding coefficients. Then,
is found from this equation since we know
and
for
, and only
are assumed unavailable. Specifically, we have
Notice that the above computation of only assumes that has the same value in and . Therefore, even if the data node needed for the update is down, it is actually enough to find two parity nodes which reflect the same (possibly stale) value of , for being able to perform a degraded read and recreate the latest value of correctly.
Finally, it is possible that a different data node
has been updated subsequent to the last update to
, but in the interim, the unique parity node (We consider the case where the data nodes are from different rows in the grid/B-grid layout, such that they have distinct set of parities in their quorums, with a single intersecting parity. If they have multiple common parities in the intersection of their quorums, then the considered problem does not arise, since all these parities in the intersection would carry information regarding the latest values for both
and
.) at the intersection of the quorums required to read or write
and
becomes unavailable. Again, as above, if we use any two parities from the quorum for
which reflect the same (latest or stale) value of
, and exclude
irrespective of whether it is available or not, then we can reconstruct the latest value of
in the same manner, i.e., using (
7), as the above scenario.
Note that, while we have not explicitly discussed a truncated B-grid, similar to the truncated version of basic grid, where quorums were formed comprising only single data object and the parity nodes determined based on the row the data object belonged to, one can also truncate the B-grid quorums. Doing so will not alter the fault tolerance or consistency discussed above.
5. Concluding Remarks
This study is the first of its kind, in synthesizing the concept of quorum systems with erasure coded storage systems and showing the feasibility of grid quorums in that context. It creates a stepping stone for further studies on the fault-tolerance and availability of the proposed quorum systems, including: (1) access of the quorums for repair operations and degraded read operations, (2) new quorum systems for erasure coded systems with better characteristics, in terms of practical requirements, such as load, coding rate, and fault-tolerance, and (3) taking into account the asymmetric role of data and parity nodes to possibly define quorum load in a more meaningful manner. The design of practical algorithms, considering system design issues, including background update propagation, mapping the logical layout to the physical layout, replacement of permanently down nodes, garbage collection of stale information, and the overlying file system, are numerous aspects which will also need attention once the conceptual foundations mature, to translate the ideas into a working system.
Furthermore, in the context of storage systems, non-MDS codes but with better repairability properties have been proposed [
5] and are also deployed [
1] in real-world systems, where (some) parities have certain locality properties, such that they do not depend on all the data blocks. As such, the constraint from (
4) may not be adequate when such codes are used, and further studies are needed. For example, it would be interesting to consider the joint design of codes with good repairability and efficient quorum mechanisms.