Nothing Special   »   [go: up one dir, main page]

Next Article in Journal
Sharp Guarantees and Optimal Performance for Inference in Binary and Gaussian-Mixture Models
Previous Article in Journal
Quantum Hacking on an Integrated Continuous-Variable Quantum Key Distribution System via Power Analysis
Previous Article in Special Issue
Cache-Aided General Linear Function Retrieval
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On Grid Quorums for Erasure Coded Data

by
Frédérique Oggier
1 and
Anwitaman Datta
2,*
1
Division of Mathematical Sciences, Nanyang Technological University, Singapore 639798, Singapore
2
School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(2), 177; https://doi.org/10.3390/e23020177
Submission received: 25 October 2020 / Revised: 25 January 2021 / Accepted: 27 January 2021 / Published: 30 January 2021
Figure 1
<p>On the left: A simplified view of a distributed storage system, where nodes store replicas of some data object <span class="html-italic">x</span>. Processes <math display="inline"><semantics> <mrow> <msub> <mi>P</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>P</mi> <mn>2</mn> </msub> <mo>,</mo> <msub> <mi>P</mi> <mn>3</mn> </msub> <mo>,</mo> <msub> <mi>P</mi> <mn>4</mn> </msub> </mrow> </semantics></math> may ask to read the current value, say <span class="html-italic">c</span>, from <span class="html-italic">x</span>, represented as r(<span class="html-italic">x</span>)<span class="html-italic">c</span> or write the value <span class="html-italic">c</span> to <span class="html-italic">x</span>, represented as w(<span class="html-italic">x</span>)<span class="html-italic">c</span>. They may carry out the operations at any of the replicas. On the right, two forms of consistency are illustrated: a read operation r(<span class="html-italic">x</span>)<span class="html-italic">c</span> or a write operation w(<span class="html-italic">x</span>)<span class="html-italic">c</span> at a given point of the wall-clock time indicates that a process is asking for the corresponding operation on a replica. When it is effectively executed is inferred from the table: under <b>strict consistency</b> (illustrated in the <b>upper right quadrant</b>), the executions follow the same timeline, while under <b>sequential consistency</b> (illustrated in the <b>lower right quadrant</b>), the executions follow some global ordering but need not adhere to the wall clock time at which the operations were invoked by the processes.</p> ">
Figure 2
<p>Update sequence (for the grid-quorums on the left of the figure): at time <span class="html-italic">t</span>, <math display="inline"><semantics> <msub> <mi>x</mi> <mi>i</mi> </msub> </semantics></math> is updated using its quorum <math display="inline"><semantics> <msub> <mi>Q</mi> <mi>t</mi> </msub> </semantics></math>, in which parities are highlighted in row and column 4 (shown on the left). At time <math display="inline"><semantics> <mrow> <msup> <mi>t</mi> <mo>′</mo> </msup> <mo>&gt;</mo> <mi>t</mi> </mrow> </semantics></math>, <math display="inline"><semantics> <msub> <mi>x</mi> <mi>j</mi> </msub> </semantics></math> is updated using its quorum <math display="inline"><semantics> <msub> <mi>Q</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> </semantics></math>, in which parities are highlighted in row 5 and column 5 (on the right). Since the parity node (boxed for highlighting) at coordinate <math display="inline"><semantics> <mrow> <mo>(</mo> <mn>4</mn> <mo>,</mo> <mn>5</mn> <mo>)</mo> </mrow> </semantics></math> is common, process updating <math display="inline"><semantics> <msub> <mi>x</mi> <mi>j</mi> </msub> </semantics></math> would know that <math display="inline"><semantics> <msub> <mi>x</mi> <mi>i</mi> </msub> </semantics></math> was updated, and the latest value of <math display="inline"><semantics> <msub> <mi>x</mi> <mi>i</mi> </msub> </semantics></math> will have to be taken into account during its own update, so that all the parities in <math display="inline"><semantics> <msub> <mi>Q</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> </semantics></math> reflect not only the latest value of <math display="inline"><semantics> <msub> <mi>x</mi> <mi>j</mi> </msub> </semantics></math>, but also the latest value of <math display="inline"><semantics> <msub> <mi>x</mi> <mi>i</mi> </msub> </semantics></math>, irrespective of whether they had received the update information regarding <math display="inline"><semantics> <msub> <mi>x</mi> <mi>i</mi> </msub> </semantics></math> through the background process prior to the invocation of <math display="inline"><semantics> <msub> <mi>Q</mi> <msup> <mi>t</mi> <mo>′</mo> </msup> </msub> </semantics></math>. Two distinct but valid scenarios of sequential consistency are shown (on the right of the figure) based on different sequences of quorums acquired.</p> ">
Figure 3
<p>Basic grid quorums for <math display="inline"><semantics> <mrow> <mi>n</mi> <mo>=</mo> <mn>36</mn> </mrow> </semantics></math> nodes: on the left, two quorums (one in yellow and the other in blue) intersect in two points, while, on the right, a variant with smaller quorums is shown, where they intersect in one point.</p> ">
Figure 4
<p>Grid <math display="inline"><semantics> <mrow> <mo>(</mo> <mn>6</mn> <mo>×</mo> <mn>6</mn> <mo>)</mo> </mrow> </semantics></math> layout for an <math display="inline"><semantics> <mrow> <mo>(</mo> <mn>36</mn> <mo>,</mo> <mn>21</mn> <mo>)</mo> </mrow> </semantics></math> code: Data blocks are in the lower triangle (including the diagonal), while the upper triangle has the parities. On the left, an example of two intersecting quorums <math display="inline"><semantics> <mrow> <msub> <mi>Q</mi> <mn>1</mn> </msub> <mo>=</mo> <mrow> <mo>[</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> <mo>]</mo> </mrow> <mo>,</mo> <msub> <mi>Q</mi> <mn>2</mn> </msub> <mo>=</mo> <mrow> <mo>[</mo> <mn>6</mn> <mo>,</mo> <mn>6</mn> <mo>]</mo> </mrow> </mrow> </semantics></math> is shown, and they interest in two points, including a parity. In the middle, all the quorums for a variant with smaller quorums (obtained by truncating columns) are shown. On the right, each data block on row <span class="html-italic">i</span> has for quorum the union of itself, the parities on row <span class="html-italic">i</span> and the parities on column <span class="html-italic">i</span>.</p> ">
Figure 5
<p>A <span class="html-italic">B</span>-grid quorum for <math display="inline"><semantics> <mrow> <mi>n</mi> <mo>=</mo> <mn>48</mn> <mo>=</mo> <mi>c</mi> <mi>b</mi> <mi>r</mi> </mrow> </semantics></math> nodes, arranged in a grid with <math display="inline"><semantics> <mrow> <mi>c</mi> <mo>=</mo> <mn>8</mn> </mrow> </semantics></math> columns, <math display="inline"><semantics> <mrow> <mi>b</mi> <mi>r</mi> <mo>=</mo> <mn>6</mn> </mrow> </semantics></math> rows, arranged in <math display="inline"><semantics> <mrow> <mi>b</mi> <mo>=</mo> <mn>3</mn> </mrow> </semantics></math> bars each containing <math display="inline"><semantics> <mrow> <mi>r</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math> rows.</p> ">
Figure 6
<p>A <span class="html-italic">B</span>-grid with <math display="inline"><semantics> <mrow> <mo>(</mo> <mn>48</mn> <mo>,</mo> <mn>30</mn> <mo>)</mo> </mrow> </semantics></math> coded data.</p> ">
Figure 7
<p>On the <span class="html-italic">x</span>-axis, the rate <math display="inline"><semantics> <mrow> <mn>1</mn> <mo>−</mo> <mfrac> <mi>b</mi> <mi>c</mi> </mfrac> </mrow> </semantics></math>. On the <span class="html-italic">y</span>-axis, the relative load <math display="inline"><semantics> <mi>γ</mi> </semantics></math> of the coding based B-grid w.r.to replication as a function of <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>r</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> </semantics></math>, for <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>r</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> <mo>∈</mo> <mo>{</mo> <mo>(</mo> <mn>1</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> <mo>,</mo> <mo>(</mo> <mn>1</mn> <mo>,</mo> <mn>4</mn> <mo>)</mo> <mo>,</mo> <mo>(</mo> <mn>2</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> <mo>,</mo> <mo>(</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>)</mo> <mo>,</mo> <mo>(</mo> <mn>3</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> <mo>,</mo> <mo>(</mo> <mn>3</mn> <mo>,</mo> <mn>4</mn> <mo>)</mo> <mo>}</mo> </mrow> </semantics></math>.</p> ">
Review Reports Versions Notes

Abstract

:
We consider the problem of designing grid quorum systems for maximum distance separable (MDS) erasure code based distributed storage systems. Quorums are used as a mechanism to maintain consistency in replication based storage systems, for which grid quorums have been shown to produce optimal load characteristics. This motivates the study of grid quorums in the context of erasure code based distributed storage systems. We show how grid quorums can be built for erasure coded data, investigate the load characteristics of these quorum systems, and demonstrate how sequential consistency is achieved even in the presence of storage node failures.

1. Introduction

We consider the problem of consistency for erasure code based distributed storage systems. Distributed storage systems store data over a network of nodes, so that data remains available over time. In order to do so, several properties are desirable, such as high fault-tolerance and low storage overhead. Fault-tolerance refers to the system’s ability to sustain failures of some of its components, and is present across three dimensions: (1) availability (the data should remain available even in the event of failures), (2) persistence, (the data should remain available over time), and (3) consistency (irrespective of the sequence of read and write operations on the stored data by multiple processes, and of possible failures, the data should appear to every process as if it had been manipulated in a globally agreed order).
Availability and persistence are achieved through redundancy. The data is stored multiple times, so that even when a node is unavailable, the requested data may be queried from another node (achieving (1)). Since failures may be temporary or permanent, redundancy needs to be replenished via maintenance mechanisms, in order to achieve (2), that is persistence over time. However, since the data is stored redundantly, it becomes essential to ensure consistency, so that all applications accessing a given data see the same version, in particular after updates, irrespective of which storage nodes are accessed.
Both maintenance of adequate level of redundancy and consistency depend on the redundancy mechanisms chosen, which typically induce trade-offs with storage overhead. Replication has been the most common way to ensure fault-tolerance, though over the past decade, more and more storage systems have adopted erasure coding techniques instead, e.g., Reference [1,2,3,4], since they provide a good trade-off between fault-tolerance and storage overhead. Processes to maintain the amount of redundancy over time in the presence of node failures for erasure code based distributed storage systems have been profusely studied (see, e.g., Reference [5,6] for surveys on erasure coding techniques enabling redundancy maintenance in distributed storage systems). Designs of mechanisms that support efficient updates of coded data have also been considered; see, e.g., References [7,8,9,10].

1.1. Consistency

In the context of replication, consistency refers to a setting where read and write operations are performed on shared data (the replicas) by different processes (see the left of Figure 1), and it informally means that, when one replica is updated by one of the processes, it should be ensured that the other copies are updated accordingly. This is achieved by fixing a set of rules the processes obey when they want to read or write the data, in exchange for which the data the processes obtain is expected to be up-to-date.
Under strict consistency (see the upper right of Figure 1), processes ask for a read operation r(x)c or a write operation w(x)c (respectively, reading or writing the value of x to be c) at a given point of the so-called wall-clock time, and the execution is expected to be instantaneous and thus follow that same ordering. The wall-clock time represents an absolute global time, while, in practice, different processes in a distributed system may not be perfectly synchronized or aware of what other processes locally consider as the time. In this example, P 2 writes w(x)b after P 1 as per the wall-clock time; thus, P 3 , P 4 get b on every occasion when they carry out a read operation subsequently as per wall-clock time. This corresponds to the ordering w(x)a, w(x)b, r(x)b, r(x)b, r(x)b, r(x)b.
In contrast, sequential consistency only requires for some global ordering of the operations. We quote Reference [11] to provide the precise way it has been defined “… the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” The lower right of Figure 1 illustrates an example scenario meeting this definition. The operations shown are not strictly consistent because P 3 reads a from x, even though P 2 has in real time (wall-clock time) already issued a write b request. Nevertheless, in this case, the following global ordering w(x)a, r(x)a, w(x)b, r(x)b, r(x)b, r(x)b specifies a legitimate sequential order of operations.
There are several other forms of consistency, including, for example causal and eventual consistencies; see, e.g., Reference [12] (Section 7): strict and sequential consistencies are strong forms of consistency, they have the advantage of maintaining a high level of global consistency at all times, at a cost in terms of latency. In this work we are interested in designing a mechanism for achieving sequential consistency over erasure coded data, and we do so by applying a standard technique to achieve so, namely quorum systems. A quorum system is defined as a collection of subsets of nodes (called quorums), where each pair of quorums has a non-empty intersection [13] (Def 3.4). A “vote" (or a “lock”) is attributed to every node in the system, and any application wishing to either read or write data needs to gather enough votes in order to perform its operation. Because of the intersecting property of quorums, mutual exclusion of a write operation with any other write or read operations is achieved.

1.2. Related Works

In order to achieve sequential consistency, it is necessary that multiple processes cannot carry out write operation(s) to change the value of any object, while other operations are reading or writing said value; and conversely no read operation should be carried out while a write operation is underway, i.e., mutual exclusion of any write operation from other write or read operations is needed. One way to achieve this is by means of quorum systems (we describe quorum systems in detail, later in this paper), which are subsets of nodes which intersect pair-wise. Coupled with locking mechanisms, such intersecting subsets of nodes can be used to guarantee the necessary mutual exclusion, enabling the design of protocols to enforce sequential consistency. We refer to Reference [13] for a more detailed treatment of how quorum systems are used in storage applications.
Apart from quorum systems, another popular mechanism for consistency is the class of primary-based protocols, where each data block x has an associated primary, which is responsible for coordinating operations on x. For example, Reference [9] proposed an update model assuming a primary which serializes the update executions, and, similarly, in Reference [10], the node storing a data block enforces serialized writes, while updates are disseminated in a best effort manner to the redundant blocks, and stale reads are possible, i.e., there is no consistency guarantee.
Finally the Paxos family of protocols [14] for solving consensus algorithms has been adapted to erasure coded data. For example, Reference [15] applies Paxos over erasure coded data by assuming a bound on the deviation of local clocks at nodes, without leveraging the structural properties of codes. Then, Reference [4] is an erasure code based object storage which realizes consistency using Paxos, but Paxos is used at object granularity. Coded Atomic Storage (CAS) [16] is aimed at mimicking shared memory abstraction for erasure coded data, with an emphasis on reducing communication cost, followed up by Reference [17], which builds upon Reference [16] to explore how reconfiguration of the system can be carried out while maintaining atomicity.

1.3. Contributions

Quorums exist in two renditions: symmetric, when a single system of sets is used to represent both reads and writes (though the kind of “vote” or “lock” associated with the acquired quorum can be different, depending on the purpose being a read or write operation), and asymmetric, when there are two distinct set systems, representing read quorums and write quorums separately. When a process uses a quorum, possibly accessing all nodes in this quorum, this induces a load, which measures the access probability of the busiest node in the system. The goal of this paper is to study a class of symmetric quorums called grid quorums [13] (Section 3.2), in the context of erasure coded data. This is motivated by the knowledge that grid quorums over replicated data exhibit optimal load characteristic and are, thus, good candidates to be generalized to the context of erasure coded data.
The contributions of this work are as follows:
(i) We specify requirements for quorum systems in the context of systematic maximally distance separable (MDS) erasure coded data (in Section 3.1). Our definition encapsulates sufficiency for the quorum system (it does not preclude the existence of different other quorum systems) to meet read/write mutual exclusion needs in the system, which in turn guarantees sequential consistency congruent to the global ordering of quorums formed.
(ii) We demonstrate (in Section 3.3) how to realize different variations Q g r i d c o d 1 and Q g r i d c o d 2 of grid quorum systems for erasure coded data subject to the quorum specification referred above. The former variant Q g r i d c o d 1 involves subsets of nodes which occupy a full row and a full column of a logical grid layout of nodes, while the latter, i.e., Q g r i d c o d 2 , is a variant of the former, comprising truncated (smaller) groups still meeting the mutual intersection property. We prove that, for ( n , k ) MDS codes, where n is a square and k = n n + 1 2 , the load of the quorums under access strategy S are
L S ( Q g r i d c o d 1 ) = 2 n ,
and
L S ( Q g r i d c o d 2 ) = ( 2 n 1 ) P d + P p j = 1 j n 1 n n 1 n 1 + j + j = 1 j n n n n + j .
The access strategy S depends on the probabilities P d , P p of accessing, respectively, data and parity nodes. Load is a metric determined by the fraction of time a given node is used, be it for a read or write operation, and is an important metric to characterize the performance and impact of a quorum system. Intuitively, lower the load, the freer the nodes in the system are to carry out other tasks. In Section 2.2 we provide a comprehensive definition for the load of a quorum.
(iii) In Section 3.4, we extend our study to B-grid quorums, a generalization of grid quorums. B-grid quorums suitable for erasure coded data are proposed which accommodate ( n , k ) MDS code with n = c b r and k = n r b 2 and, thus, a wide range of rates. Their load is also computed, giving
L S ( Q B g r i d c o d ) = 1 b 1 + 1 r .
We then discuss the trade-offs between load and storage overhead.
(iv) In Section 4, we demonstrate how sequential consistency is achieved using the proposed quorum systems even in the presence of various combination of faults.

2. Background

2.1. Erasure Coding for Storage

A linear ( n , k ) erasure code over some finite field F q is a linear map: ( x 1 , , x k ) ( x 1 , , x k , x k + 1 , , x n ) , where x p , p = k + 1 , , n are linear combinations in x 1 , , x k , referred to as parities:
x p = i = 1 k a j i x i .
The vector ( x 1 , , x k , x k + 1 , , x n ) is called a codeword, and since the first coefficients of the codeword are x 1 , , x k , the code is said to be systematic. In a storage system, x 1 , , x k are data blocks to be stored, and, in order to provide fault tolerance, the n coefficients x 1 , , x n are stored over n nodes, say node d stores x d , d = 1 , , k for the data blocks, and node p stores x p , p = k + 1 , , n for the parity blocks. If the node i is unavailable, x i should be recoverable from the remaining n 1 blocks. In case several nodes are unavailable, the data content may or not be recoverable depending on the erasure tolerance ability of the code. Codes with the best fault tolerance with respect to k and n are called maximum distance separable (MDS) codes, and can tolerate a loss of up to n k blocks. The ratio k / n is called the rate of the code. In the context of storage, people often use the reciprocal, indicating storage overhead n / k instead.
Non-MDS erasure codes with better repairability properties have been heavily researched [5,6] and even deployed in some practical systems [1]; however, as an initial study on the topic of quorum systems for erasure coded data, we consider only MDS codes, both for keeping the study relatively simpler, as well as because they continue to be widely used [2,3].
Example 1.
The ( n , 1 ) repetition code that maps x 1 to ( x 1 , , x 1 ) is an MDS code, x 1 may be recovered as long as at least one out of its n copies is available. This means the storage system is keeping replicas of the data to be stored. The ( n , n 1 ) parity check code that maps ( x 1 , , x n 1 ) to ( x 1 , , x n 1 , x 1 + x n 1 ) is another example. Only one erasure is tolerated. The most popular class of MDS codes is called Reed-Solomon codes. They may be defined by polynomial evaluation: coefficients of the polynomial are formed from the data to be encoded, and codewords are obtained by evaluating this polynomial.

2.2. Quorum Systems

Definition 1.
Given n nodes, a quorum system Q is a set of subsets of nodes called quorums, such that every two quorums intersect, i.e., Q Q for all Q , Q Q .
We will label the set of n nodes by { 1 , , n } .
Example 2.
For example, the smallest quorum system consists of just one quorum, itself consisting of one node: Q s i n g = { { i } } , for some i in { 1 , , n } . It is called the singleton quorum Q s i n g . The majority quorum system Q m a j is defined to be all quorums of size n 2 + 1 . For example, if the set of nodes is { 1 , 2 , 3 , 4 } , then n 2 + 1 = 3 , and Q m a j = { { 1 , 2 , 3 } , { 1 , 2 , 4 } , { 1 , 3 , 4 } , { 2 , 3 , 4 } } .
Quorums are used for maintaining consistency of data that is being read and written to, by multiple processes. In order to read or write data, and, respectively, read or write locks on, a quorum of nodes must first be obtained. If the operation is a write, a quorum Q Q is formed for the write operation to be performed, and, while the operation is being processed, no other operation ought to be carried out, preventing reading of stale information, or for multiple writes to overwrite over each other. This mutual exclusion is achieved using quorums and write locks, since any other operation would need to likewise acquire another quorum of nodes, which cannot be obtained since every Q intersect with Q but write locks are treated to be exclusive. Multiple read operations can, however, be allowed, even with intersecting quorums, by associating read locks that are not mutually exclusive, distinct from write locks which are exclusive. Furthermore, whenever a write operation is involved, a subsequent read or write operation is guaranteed to know of the latest update because of the intersecting quorums and the mutual exclusion achieved through the locking process. Since no other operations are possible when a write operation is being carried out (because of the mutual exclusion achieved with write locks), and because any future process will necessarily include at least one node with the latest written value (from the intersection property of the quorums), the latest update will be visible to any future read or write accesses.
The load of a node depends on how often it is accessed. For every quorum Q Q , an access probability P S ( Q ) is defined, for S an access strategy. By definition, Q Q P S ( Q ) = 1 . Then, the load L S ( i ) of node i using the access strategy S is
L S ( i ) = Q Q i Q P S ( Q )
so that the load induced by S on the system is the load imposed by the busiest node:
L S ( Q ) = max i { 1 , , n } L S ( i ) .
The load of a quorum system Q is the minimal load, across all possible access strategies that can be used.
Example 3.
For the load of the busiest node, we have L ( Q s i n g ) = 1 , since for Q = { { i } } , for some i in { 1 , , n } , we have L S ( i ) = 1 . For L ( Q m a j ) > 1 / 2 , since, in a majority quorum system, all subsets of { 1 , , n } of size n 2 + 1 are included, which means (we present the argument for n even) that a given node i will belong to n 1 n 2 quorums out of the n n 2 + 1 quorums. Thus, a uniform access probability gives
( n 1 ) ! ( n 2 ) ! ( n 2 1 ) ! ( n 2 + 1 ) ! ( n 2 1 ) ! n ! = n 2 + 1 n = 1 2 + 1 n > 1 2 .
Continuing Example 2, with Q m a j = { { 1 , 2 , 3 } , { 1 , 2 , 4 } , { 1 , 3 , 4 } , { 2 , 3 , 4 } } , a uniform access probability means P S ( Q ) = 1 4 for every Q Q ; thus, L S ( 1 ) = P S ( { 1 , 2 , 3 } ) + P S ( { 1 , 2 , 4 } ) + P S ( { 1 , 3 , 4 } ) = 3 4 , and, similarly, L S ( i ) = 3 4 for i = 1 , 2 , 3 , 4 .

3. Grid Quorums

In the following, we consider logical grid layouts decoupled from the physical data placement of data and parity or the physical configuration of the storage nodes in the distributed system.

3.1. Quorum Systems and Erasure Coded Data

The definition of a quorum system (see Definition 1) considers all nodes to play the same role, since replicated data is stored (thus, every node stores the same thing). For erasure coded data, we actually have two types of nodes: those storing the actual data blocks (nodes 1 , , k ) and those storing the parities (nodes k + 1 , , n ). When two quorums intersect, it could thus be a priori on either data nodes or parity nodes. We add the requirement that in every intersection of two quorums, there is at least one parity node. The reason is as follows: we consider each data block to be independent of each other, hence updates on one does not alter other data blocks, while it does affect parities, which, in a MDS code, are linear combinations of all data blocks. Thus, whenever any data block is updated, parities need to be updated correspondingly. The requirement that there is always some parity node in intersection of any two quorums is, thus, formally stated as:
p Q Q   for   all   Q , Q Q   and   p { k + 1 , , n } .
When an update is carried out, the write lock is released only after every parity in the quorum acquired to carry out the write operation is already updated. The parities outside the quorum too need to be updated, but this is let to happen in the background. It can be carried out efficiently using a standard technique using differentials [7,8,9,10], which is outside the scope of this work.
The quorum mechanism further needs to ensure that at least one of the latest updated parities will be present in any subsequent quorum:
p Q t Q t + 1   for   all   Q t , Q t + 1 Q   and   p { k + 1 , , n } .
Recall that we want to achieve sequential consistency [11] using the proposed mechanism. Here, t indicates a logical marker corresponding to the t-th snapshot view of the system. t + 1 accordingly refers to the system at the conclusion of the next execution of any operation. Q t and Q t + 1 accordingly refers to the quorums invoked by the operations that led the system to reach the corresponding sequentially consistent states. The above invariant should hold irrespective of whether the background update propagation process is completed, in order to ensure that a process can identify the latest data and does not inadvertently obtain stale data, thus facilitating sequential consistency.
Lemma 1.
Property (5) follows from (4).
Proof. 
Suppose quorum Q t has been used for an update at time t. Then, all parities involved in Q t got updated, and since any quorum Q t + 1 intersects Q t in such a parity p, property (5) follows. □
We illustrate Lemma 1 with example instances of grid quorums on the left of Figure 2. Later, in Section 3.3 we will formally describe grid quorums for coded data. We will then also demonstrate that indeed grid quorums always satisfy the necessary property (4), but for the moment, we focus on providing an example to illustrate how (5) follows from (4). The upper triangular part of the grid comprises parity blocks, data blocks are found strictly below the diagonal. Two particular instances of quorums—one comprising data block x i and some parity blocks, and another comprising data block x j and some parity blocks—are shown. These two specific instances do have one parity block (boxed for emphasis) in common, satisfying (4). The intersection of a parity across the two write quorums would ensure that any process carrying out a new write operation will become aware of the previous update, and will be able to (and need to) incorporate that update along with its own update at all the parity nodes in its own write quorum. Note that we assume that any node which has incorporated an update also stores certain meta-information, particularly the logical time-stamp of the update [18] to enable the identification of the latest version, and also stores the previous versions (or differentials) for a period of time, until a given version is propagated to all the other storage nodes in the system, before garbage collecting obsolete versions. We next provide a sketch of how this ensures sequential consistency (shown in the right panel of Figure 2 for the data block x i ).
Without loss of generality, suppose a process P 1 asks for a write of the block x d (w( x d )a, d { 1 , , k } ) and acquires a write quorum Q t first, while process P 2 asks for a write of the same block x d (w( x d )b) and acquires a quorum Q t for P 2 subsequently. If this second write quorum Q t is created before any of the read quorums to process the read requests from processes P 3 and P 4 , then it will result in a wall-clock ordered sequence as shown in the top right quadrant of Figure 2. As an alternate scenario, it is also possible that, when the locks for Q t are released after P 1 completes w( x d )a, a different pending operation obtains a quorum before P 2 can carry out the next write operation. For instance, the first read operation by process P 3 may be carried out ahead of P 2 ’s write operation, leading to r( x d )(a) at P 3 occurring before w( x d )b by P 2 , while all the other read operations follow this second write operation, leading to the scenario shown at the bottom right quadrant of Figure 2. Both of these scenarios satisfy sequential consistency. We used updates to the same data object x d to elaborate how sequential consistency is achieved, for keeping the exposition simple. However, in general, all the write operations will account for the immediately preceding write operation (Lemma 1) on any arbitrary data object and update the parities in its quorum accordingly. The immediately next update will again transitively have access to these updates. Hence, the global ordering will be determined based on the sequence in which the quorums are acquired, yielding sequential consistency. To wrap up the current example, we emphasize that though we used a particular example to demonstrate the ideas, the arguments advanced here hold for arbitrary quorum systems satisfying (4); thus, (5).
We will next provide a description of grid quorums for replicated data, before delving into the the design of grid quorums for erasure coded data.

3.2. Basic Grid Quorums

We recall what are basic grid quorums used over replicas.
Definition 2
(Reference [13] (Section 3.2)). Suppose n is an integer. Then, the n nodes are arranged into a square grid of edge length n . A basic grid quorum system Q g r i d consists of n quorums, each formed by a full row and a full column of the grid, such that rows and columns are not repeated.
In Figure 3, n = 36 nodes are arranged to form a square grid of edge length n = 6 . An example Q of quorum systems is Q = { Q 1 = [ 1 , 1 ] , Q 2 = [ 2 , 3 ] , Q 3 = [ 3 , 5 ] , Q 4 = [ 4 , 2 ] , Q 5 = [ 5 , 6 ] , Q 6 = [ 6 , 4 ] } , where the notation [ i , j ] means the union of row i and column j. The quorums Q 2 and Q 3 are shown, respectively, in yellow and in blue.
In a basic grid quorum system, the size of each quorum is 2 n 1 (there are n nodes on each row and column, including one node in their intersection), and two distinct quorums intersect exactly in two nodes. When the access strategy S is uniform, the probability of access is P S ( Q ) = 1 n . From (2), the load of node i for S uniform is
L S ( i ) = Q Q i Q 1 n = | Q Q , i Q | n .
Since every quorum Q is the union of one row and one column, and every node i is given a unique (row, column) allocation in the grid, say i is at position ( i r , i c ) , then i can either appear in two quorums ( [ i r , j ] and [ l , i c ] for some j l ), or once (if [ i r , i c ] Q ). Thus,
L S ( i ) = 1 n i f [ i r , i c ] Q 2 n otherwise .
This holds for every node i and from (3), L S ( Q g r i d ) = max i { 1 , , n } L S ( i ) , so L S ( Q g r i d ) = 2 n for S uniform.
It is known (Reference [13] (Theor. 3.18)) that the minimal load of a quorum system is lower bounded by 1 n , and that (Reference [19] (Prop. 4.8)) the minimal load of a quorum system where every quorum has the same size s is given by s / n . Since s = 2 n 1 for Q g r i d , this gives L ( Q g r i d ) = min S L S ( Q g r i d ) = 2 n 1 n 2 n .
In the basic grid construction, two quorums Q i = [ i , j ] and Q l = [ l , m ] always intersect in two points ( [ i , m ] and [ l , j ] ). It is possible to reduce the size of the intersection to one point, as follows [20]. Once the row i of Q i = [ i , j ] is fixed, instead of including all nodes in the jth column, we take instead exactly one node from each row larger than i (the case where every node in which its row is larger than i, but is in the jth column itself is shown in Figure 3). The difference with the previous grid quorum is that only one out of the two intersection points ( [ i , m ] and [ l , j ] ) is present, and the quorums are usually smaller.

3.3. Basic Grid Quorums for Coded Data

To obtain a basic grid quorum system for coded data, the first step we propose is to choose a (logical) grid layout that distinguishes data nodes from parity nodes: a specific such layout, where the data blocks are in the lower part of the grid, below and including the diagonal, and the parity blocks are above the diagonal, is shown in Figure 4. This layout assumes that the code maps k = n n + 1 2 data symbols to n encoded ones for n a square, that is, the rate of the code is 1 2 ( 1 + 1 n ) . There are three natural ways to define quorums based on this layout, as shown in Figure 4: (i) quorums are unions of row i and column i (on the left), (ii) quorums are unions of row i and truncated column i (in the middle), (iii) quorums compromise of a node on row i together with truncated row i and truncated column i which include parity nodes exclusively beside a single data node. For our discussions and analysis, we will consider (i) and (iii), which are formally defined next as Q g r i d 1 and Q g r i d 2 , respectively.
Definition 3.
Given an ( n , k ) code where n is an integer and k = n n + 1 2 , suppose that the n nodes are arranged into a square grid of edge length n such that the k data symbols are placed below and on the diagonal of the square (in positions ( i , j ) with i j ), and the parities are placed above. Two basic grid quorums for coded data are given by
Q g r i d c o d 1 = { Q i = [ i , i ] , i = 1 , , n } Q g r i d c o d 2 = { Q i , j = { ( i , j ) } { ( i , c ) , c > i } { ( r , i ) , r < i } , i j , i = 1 , , n } .
Every quorum in Q g r i d c o d 2 has size 2 n 1 , which is larger than that of quorums in Q g r i d c o d 2 which is n . But then the cardinality of Q g r i d c o d 1 is n , while that of Q g r i d c o d 2 is k = n n + 1 2 .
Lemma 2.
Both quorum system Q g r i d c o d 1 and Q g r i d c o d 2 satisfy properties (4) and (5).
Proof. 
It is enough to prove the first property. For Q g r i d c o d 1 , given i and j ( i j ), the quorums Q i = [ i , i ] and Q j = [ j , j ] intersect at ( i , j ) and ( j , i ) , and the former or latter, respectively, contains a parity (above the diagonal) when i < j and j < i . For Q g r i d c o d 2 , given that it differs from Q g r i d c o d 1 only on the data blocks, the parity blocks still intersect in the same manner. □
Examples are provided in Figure 4 for an ( n , k ) code with n = 36 and k = 21 . Compared with a grid quorum for replicas, there are few choices for defining this quorum system given the specificities of data and parity block layout: not any choice of pairs of (row, column) works. For example, choosing Q 6 = [ 6 , 1 ] would not contain any parity, and, while Q 6 = [ 6 , 2 ] does contain a parity, it would not intersect Q 5 = [ 5 , 3 ] on a parity. In fact, suppose Q 6 = [ 6 , 2 ] is one quorum, then one cannot find any Q m = [ m , l ] for any m 1 that would intersect Q 6 at a parity (and not repeating any column or row), since the row in Q 6 comprises only data.
We next consider the load of Q g r i d c o d 1 and Q g r i d c o d 2 , for which we need to introduce an access probability.
Since we have two types of nodes, those storing parities and those storing data blocks, we consider a different access probability P S ( ( i , j ) ) depending on whether i > j ( P S ( ( i , j ) ) = P p for parities) or i j ( P S ( ( i , j ) ) = P d for data). When deciding to form a quorum for a given node, if this node belong to several quorums, then any quorum is equally likely to be called.
Proposition 1.
Given the above setting, for any choice of P p and P d , the quorum access probability P S ( Q ) is uniform for Q Q g r i d c o d 1 . Consequently, the quorum system Q g r i d c o d 1 has load
L S ( Q g r i d c o d 1 ) = 2 n .
Proof. 
Given the adopted layout, row i of the grid contains i data blocks and n i parities while column j contains j 1 parities and n j + 1 data blocks. Therefore, the quorum Q i = [ i , i ] is formed of ( i + n i + 1 ) 1 = ( n + 1 ) 1 = n data blocks ( 1 accounts for the fact that row i and column i intersect on the diagonal which contains a data block, that should not be counted twice).
For a node located in position ( i , j ) , the access probability of a quorum Q i Q g r i d c o d 1 depends on whether it is invoked, or Q j Q g r i d c o d 1 is invoked instead. We assume both are equally likely (introducing a probability factor of 1 2 for ( i , j ) for the terms in the computation of P S ( Q i ) ). If j = i , Q i is necessarily called. Hence, we obtain:
P S ( Q i ) = P S ( ( i , i ) ) + 1 2 j = 1 , j i n ( P S ( ( i , j ) ) + P S ( ( j , i ) ) ) = P d + 1 2 ( n 1 ) ( P p + P d ) .
For sanity check, i P S ( Q i ) = n P d + n 2 ( n 1 ) ( P p + P d ) = k P d + ( n k ) P p since k = n 2 ( n + 1 ) and n k = n 2 ( n 1 ) . Since P S ( Q i ) does not depend on Q i , we have thus shown that
P S ( Q ) = 1 n
since we have n quorums and each are equally likely.
From (2), the load of node i for S uniform is
L S ( i ) = Q Q i Q 1 n = | Q Q , i Q | n = 1 n   or   2 n ,
since node i belongs to at most two quorums (one, when node i is on the diagonal). □
Recall for comparison that, for S uniform, L S ( Q g r i d ) = 2 n 2 n 1 n , and thus L S ( Q g r i d ) = L S ( Q g r i d c o d ) , we have the same load for grid quorums with replicas as with other MDS erasure coding strategies. Note that the lower bound 1 / n still holds for the the case where coded data is stored, since (i) requiring property (4) is a particular case of symmetric quorums, and (ii) setting P d = P p reduces to P S .
Proposition 2.
Given any choice of parity access probability P p and data access probability P d , the quorum access probability P S ( Q i , j ) for Q i , j Q g r i d c o d 2 is given by
P S ( Q i , j ) = P d + P p c = i + 1 n 1 i + c + r = 1 i 1 1 i + r .
Consequently, for n 4 , the quorum system Q g r i d c o d 2 has load
L S ( Q g r i d c o d 2 ) = ( 2 n 1 ) P d + P p j = 1 j n 1 n n 1 n 1 + j + j = 1 j n n n n + j .
Proof. 
Since
Q g r i d 2 c o d = { Q i , j = { ( i , j ) } { ( i , c ) , c > i } { ( r , i ) , r < i } , i j , i = 1 , , n } ,
the probability of accessing P ( Q i j ) is given by
P S ( Q i , j ) = P d + P p c = i + 1 n 1 i + c + r = 1 i 1 1 i + r .
For sanity check, j i P S ( Q i , j ) = k P d + P p i = 1 n i c = i + 1 n 1 i + c + r = 1 i 1 1 i + r , and the factor of P p simplifies to i = 1 n i j = 1 n 1 i + j 1 2 i = i j i i + j where i , j range from 1 to n . This sum equals to n k because terms in the sum can be grouped into pairs of the form ( i i + j , j i + j ) , in which the sum is 1, and there are n k such terms.
Thus, using (2), the load of node ( i , j ) is
L S ( ( i , j ) ) = Q Q ( i , j ) Q P S ( Q ) ;
hence, for a node ( i , j ) storing a data block, which belongs to a single quorum, we get
L S ( ( i , j ) ) = P d + P p c = i + 1 n 1 i + c + r = 1 i 1 1 i + r = P d + P p j = 1 j i n 1 i + j ,
and the busiest of the nodes storing data is ( 1 , 1 ) : indeed,
L S ( ( i , j ) ) L S ( ( i + 1 , j ) ) j = 1 j i n 1 i + j j = 1 j i + 1 n 1 i + 1 + j .
When i = 1 , the right-hand sum contains the terms 1 / 3 , 1 / 4 , 1 / 5 , , 1 / ( 1 + n ) , while the left-hand sum contains 1 / 3 , 1 / 5 , , 1 / ( 2 + n ) . Thus, the inequality reduces to 1 / 4 1 / ( 2 + n ) , which holds for n 4 . When i 2 , the right-hand sum contains 1 / ( i + 1 ) , 1 / ( i + 2 ) , , 1 / ( 2 i 1 ) , 1 / ( 2 i + 1 ) , , 1 / ( i + n ) , while the left-hand sum contains 1 / ( i + 2 ) , 1 / ( i + 3 ) , , 1 / ( 2 ( i + 1 ) 1 ) , 1 / ( 2 ( i + 1 ) + 1 ) , , 1 / ( i + 1 + n ) . Now, the above inequality reduces to 1 / ( i + 1 ) + 1 / 2 ( i + 1 ) 1 / 2 i + 1 / ( i + 1 + n ) , which holds term by term.
If ( r , c ) is storing a parity block, then ( r , c ) belongs to r + c quorums; more precisely, it belongs to r quorums Q r , j , j r and c quorums Q c , j , j c . Thus,
L S ( ( r , c ) ) = j = 1 r P S ( Q r , j ) + j = 1 c P S ( Q c , j ) = r P d + r P p j = 1 j r n 1 r + j + c P d + c P p j = 1 j c n 1 c + j .
We have that L S ( ( 1 , 1 ) ) L S ( ( r , c ) ) , r c :
P d + P p j = 2 n 1 1 + j ( r + c ) P d + P p r j = 1 j r n 1 r + j + c j = 1 j c n 1 c + j .
Indeed, P d ( r + c ) P d and
j = 2 n 1 1 + j r j = 1 j r n 1 r + j + c j = 1 j c n 1 c + j
because this equality is clearly true if r = 1 for any choice of c (then the first sum on the right-hand side is the sum of the left-hand side), and we have
r j = 1 j r n 1 r + j ( r + 1 ) j = 1 j r + 1 n 1 r + 1 + j = r l = 2 l r + 2 n + 1 1 r + l + j = 1 j r + 1 n 1 r + 1 + j = r l = 1 l r n 1 r + l r r + 1 + r r + n + 1 + r 2 r r 2 r + 2 + j = 1 j r + 1 n 1 r + 1 + j .
But, j = 1 j r + 1 n 1 r + 1 + j n 1 r + n + 1 gives
r + n 1 r + n + 1 + 1 2 3 r 2 ( r + 1 ) 0 3 2 2 r + n + 1 + 3 r 2 ( r + 1 ) ,
which holds. This last computation further shows that r j = 1 j r n 1 r + j is increasing as a function of r; thus, the busiest node containing a parity is obtained by maximizing both r and c; that is, r = n 1 and c = n . □
From the layout of the data in the grid, intuitively, one would expect the lower most parity node to be most accessed, since it will be part of all the quorums invoked to access any of the data nodes situated in the last two rows of the grid, and these rows are most numerous in terms of data nodes. The analysis above formally confirms this intuition, and provides a closed-form formula to quantify the load. Realistically, one would expect that there will be explicit access (ignoring the access of nodes because of their participation in any quorum) of data nodes much more frequently than the parity nodes, i.e., P p < < P d . In the extreme case, when P p = 0 , we will have P d = 1 k = 1 n 2 n + 1 . Then, for the access strategy S for which P p = 0 and P d is uniform, L S ( Q g r i d c o d 2 ) = ( 2 n 1 ) P d = 2 n 1 n 2 n + 1 4 n > 2 n = L S ( Q g r i d c o d 1 ) = L S ( Q g r i d ) .
If we relax the condition of using nodes from more than one row and one column to get a quorum, say by allowing the choice of nodes in a quorum from multiple columns, akin to Reference [20] (of which, the variant discussed above is a special case), then other quorum systems can also be identified. This leads to a generalized construction, called B-grid, that we discuss next.

3.4. B-Grid Quorums for Coded Data

We recall the definition of B-grid quorums for replicated data.
Definition 4
(Reference [19] (5.2)). Suppose that n is of the form c b r and the n nodes are arranged in a rectangular grid of b r rows and c columns, where rows are grouped into b bands of r rows, where band j contains rows ( j 1 ) r + 1 , , j r for j = 1 , , b . Denote the intersection of column c and band j as mini-column [ [ j , c ] ] . A quorum in the B-grid system Q B g r i d consists of one mini-column in every band, and a representative element in each mini-column of one band. A B-grid quorum system comprises of multiple independent B-grid quorums. In particular, mini-columns and the one band from which representative elements are chosen, are independent.
In Figure 5, we show two quorums of a B-grid quorum system with three bands. The same argument used to derive the load of Q g r i d may be applied here, namely, since every quorum has the same size s = b r + c 1 , the minimal load is (Reference [19] (Prop. 4.8)) b r + c 1 b r c = n / c + c 1 n = 1 c + c 1 n . Note that, consequently, when we have a square grid, i.e., c = b r = n , the load of the B-grid 2 n .
Since quorums in Q B g r i d intersect in the mini-columns, this suggests a possible adaptation in the context of erasure coded data as follows.
Definition 5.
Given an ( n , k ) code, suppose that the n = c b r nodes are arranged in a rectangular grid of b r rows and c columns, where rows are grouped into b bands of r rows. Let C = { C 1 , , C b } be a fixed set of subsets of mini-columns, where C i = { [ [ i , c ( i , β ) ] ] , β { 1 , , b } } contains some chosen mini-columns in band i, and | C i | = b for i = 1 , , b . Place r b 2 parities in the mini-columns specified by C, and the k = n r b 2 data blocks elsewhere. A quorum Q i j in the quorum system Q B g r i d c o d consists of the b mini-columns [ [ 1 , c ( 1 , i ) ] ] , [ [ b , c ( b , i ) ] ] and of a choice of c 1 elements in the band i such that there is exactly one element per mini-column. The index j of the quorum refers to the jth choice out of the r ( c 1 ) choices to choose one element from a mini-column of r elements, for each of the c 1 columns.
In the proposed B-grid quorum for coded data, each quorum comprises c b data nodes, and b r + b 1 parity nodes. Moreover, in this setup, the code rate is n b 2 r b r c = b r c b 2 r b r c = 1 b c , so we need to assume c > b . This means, a quorum system for a code with arbitrarily high rate can be realized using B-grid.
In Figure 6, an example is shown with b = 3 , C = { C 1 , C 2 , C 3 } , with C 1 = { [ [ 1 , 1 ] ] , [ [ 1 , 5 ] ] , [ [ 1 , 7 ] ] } , C 2 = { [ [ 2 , 2 ] ] , [ [ 2 , 4 ] ] , [ [ 2 , 6 ] ] } , C 3 = { [ [ 3 , 3 ] ] , [ [ 3 , 5 ] ] , [ [ 3 , 7 ] ] } . In addition, Q 1 j (in blue) contains the mini-columns [ [ 1 , 5 ] ] , [ [ 2 , 2 ] ] , [ [ 3 , 7 ] ] (indicated by vertical blue lines), and one choice for j is represented by dotted blue lines. Similarly, Q 2 j and Q 3 j are shown in yellow and pink.
Lemma 3.
The quorum system Q B g r i d c o d satisfies properties (4) and (5).
Proof. 
It is enough to prove property (4). A quorum Q i j must contain an element per mini-column in band i; therefore, it necessarily intersects the mini-columns [ [ i , c ( i , β ) ] ] for β = 1 , , n , and thus the other quorums, and this intersection happens in parities, since by construction the parities are placed in these mini-columns. □
Proposition 3.
Considering the same setting as for Q g r i d c o d , for any choice of P p and P d , the corresponding quorum access S is uniform. Consequently, the quorum system Q B g r i d c o d has load
L S ( Q B g r i d c o d ) = 1 b 1 + 1 r .
Proof. 
Since the load is defined for the busiest nodes, we only consider nodes that are in mini-columns. The probability of accessing a quorum in Q B g r i d c o d is
1 b r ( c 1 ) .
A node (say in band j) experiences the maximum load in the system when it is part of the mini-column for some quorum Q i l ; furthermore, it will be a representative element for the mini-column for the quorum Q j l . Since there are b r ( c 1 ) quorums Q j l , the corresponding contribution to the load is 1 b , while, the load due to the latter is 1 r 1 b , leading to a total load of 1 b ( 1 + 1 r ) . □
To compare the computed load with the square grid from Section 3.3, suppose that b r = c = n . The load can then be rewritten as 1 b 1 + 1 r = 1 + r n . Then, if r = 1 , the load is indeed 2 n , effectively reducing the system to the basic square grid performance in terms of load (as expected). However, in terms of grid layout, this reduction only works in the case where the erasure code is replication.
Next, we compare the loads for Q B g r i d and Q B g r i d c o d :
L S ( Q B g r i d ) = 1 c + c 1 n = b r 1 n + c n L S ( Q B g r i d c o d ) = 1 b + 1 b r = 1 b + c n
and
1 b 1 b b r 1 c r b r 1 c r 1 b c 1 c r 1 ,
which is always the case, since c > b . This shows that there is a cost to pay in terms of load to use erasure codes, and the relative cost is given by
γ = L S ( Q B g r i d c o d ) L S ( Q B g r i d ) = c ( r + 1 ) b r 1 + c = r + 1 ( 1 + b c ) r 1 c + 1 + r .
The function γ is illustrated in Figure 7. The rate 1 b c is shown on the x-axis. Small values of c are chosen (3 and 4); then, values of b smaller than c are considered, yielding possible rates (shown by stars). Then, for each value of c, several values of r are fixed ( r = 1 , 2 , 3 ), these choices of parameters yield six piecewise linear functions. We observe that, when r increases, so does the load of the coded quorum system compared to the load of the uncoded one. However, for a given r, we also observe that increasing c decreases the load.
From the view point of definition of load in a quorum system, our analysis indicates that the parity nodes are the busiest. Its practical implications, and the efficacy of B-grid quorum systems for erasure coded storage, need to be explored accordingly. However, we note that, for read operations, the actual work (disk I/Os) will be carried out at the nodes storing data blocks, while the parity nodes will carry out further operations beyond ‘voting’ for mutual exclusion of conflicting operations, only for write operations, where the parity values are updated. System implementation accompanied with rigorous benchmarking with realistic workloads is pending; however, since erasure coding is typically used for data that is not hot, i.e., data that is not being frequently written to, the proposed quorum mechanism looks promising for practical usage.

4. Consistency in Presence of Node Failures

Consider a write operation w( x d )a on a data object x d , 1 d k at a given time t. Any read or write operation involving x d occurring at time t > t ought to read r( x d )a, assuming there has been no write operation involving x d in the interim. Two distinct situations arise:
(1)
Node d is available: unlike in replicated systems, there is only a single node (node d) which stores x d , 1 d k ; thus, w( x d )a does update x d and subsequent read/write operations will obtain the updated value a from x d . That some of the parity nodes may not have received all updates possibly involving other data objects has no bearings in obtaining the latest value of x d .
(2)
Node d is unavailable: if the node storing x d is temporarily or permanently down, a read or write request will trigger a degraded read or write operation instead, where x d is obtained using other data blocks and parities, in which case, it is critical that parities involved in the degraded operation are up-to-date with respect to x d . This scenario also needs to take into account that (i) other nodes than d might have been updated in the interim, leading to changes in nodes storing parities, and/or (ii) other nodes may be unavailable.
From Lemmas 2 and 3, we established that sequential consistency is achieved in the first case. We next discuss the second case, namely the consequences of node failures, and demonstrate how sequential consistency is still achieved. We adopt a level of abstraction that assumes there are mechanisms to deal with locks so that quorums are eventually secured for write and read operations, ensuring the liveliness of the system, i.e., that it does not get in a state where it cannot progress. Accordingly, we focus on the issue of safety, specifically that the sequential consistency invariant is always met.
The data redundancy is achieved using an ( n , k ) MDS erasure code, which has the property that one can reconstruct any missing codeword coefficient from any k other coefficients. When node d is unavailable, for 1 d k , a degraded read will try to read from k available other nodes, at least one of them must be a parity. Parities called for degraded operations at time t > t must be up-to-date with respect to x d . This requirement is guaranteed if the parities used belong to the quorum used to permit the operation w( x d )a at time t. This ensures the parity nodes carry the information with respect to the latest value of x d without having to rely on the propagation of updates to other parity nodes, which runs as a background process. There are n 1 parity nodes in a basic grid quorum (and b r + b 1 for the B-grid) involved in any quorum.
Consider the baseline case, where all the other data objects are also available. Thus, we need only one parity to recreate the content of the unavailable node d. In this set-up a degraded operation is possible as long as one parity is available, which can tolerate n 2 ( b r + b 2 for B-grid) unavailable parities. We also note that this parity will be up-to-date with respect to any other x d , d d : indeed, if this parity is not yet up-to-date, it can first be updated before using it to reconstruct x d since we assume that only a single data node, namely d, is unavailable. In this case, using k 1 data objects x d , d d , 1 d k together with an up-to-date parity from the quorum of x d thus guarantees that the latest value of x d is computed as
x d = a p d 1 ( x p l = 1 l d k a p l x l )
using x p = l = 1 k a p l x l from (1).
Suppose now that not only node d is unavailable which we would like to read, but say also node d , storing the data object x d is down. We will then need more than one parity for reconstructing x d . Since we want to access x d in node d which was last updated at time t, by the above discussion, we need two parities in its quorum, which are assured to be up-to-date with respect to x d . We would ideally need these two parities to be up-to-date with respect to all other data nodes. If these are not up-to-date with respect to the data nodes other than x d , since these other data nodes themselves are available, the parities can be updated to reflect the latest value of the available data nodes. Using two up-to-date parities x p and x q , we get:
x p = l = 1 l d , d k a p l x l + a p d x d + a p d x d x q = l = 1 l d , d k a q l x l + a q d x d + a q d x d ,
so we multiply the first equation by a q d and the second by a p d and compute the difference of the two terms, which yields
a q d x p a p d x q = l = 1 l d , d k α l x l + ( a q d a p d a p d a q d ) x d
for α l the corresponding coefficients. Then, x d is found from this equation since we know x p , x q and x l for l d , d , and only x d , x d are assumed unavailable. Specifically, we have
x d = ( a q d a p d a p d a q d ) 1 ( a q d x p a p d x q l = 1 l d , d k α l x l ) .
Notice that the above computation of x d only assumes that x d has the same value in x p and x q . Therefore, even if the data node x d needed for the update is down, it is actually enough to find two parity nodes which reflect the same (possibly stale) value of x d , for being able to perform a degraded read and recreate the latest value of x d correctly.
Finally, it is possible that a different data node x d has been updated subsequent to the last update to x d , but in the interim, the unique parity node (We consider the case where the data nodes are from different rows in the grid/B-grid layout, such that they have distinct set of parities in their quorums, with a single intersecting parity. If they have multiple common parities in the intersection of their quorums, then the considered problem does not arise, since all these parities in the intersection would carry information regarding the latest values for both x d and x d .) at the intersection of the quorums required to read or write x d and x d becomes unavailable. Again, as above, if we use any two parities from the quorum for x d which reflect the same (latest or stale) value of x d , and exclude x d irrespective of whether it is available or not, then we can reconstruct the latest value of x d in the same manner, i.e., using (7), as the above scenario.
Note that, while we have not explicitly discussed a truncated B-grid, similar to the truncated version of basic grid, where quorums were formed comprising only single data object and the parity nodes determined based on the row the data object belonged to, one can also truncate the B-grid quorums. Doing so will not alter the fault tolerance or consistency discussed above.

5. Concluding Remarks

This study is the first of its kind, in synthesizing the concept of quorum systems with erasure coded storage systems and showing the feasibility of grid quorums in that context. It creates a stepping stone for further studies on the fault-tolerance and availability of the proposed quorum systems, including: (1) access of the quorums for repair operations and degraded read operations, (2) new quorum systems for erasure coded systems with better characteristics, in terms of practical requirements, such as load, coding rate, and fault-tolerance, and (3) taking into account the asymmetric role of data and parity nodes to possibly define quorum load in a more meaningful manner. The design of practical algorithms, considering system design issues, including background update propagation, mapping the logical layout to the physical layout, replacement of permanently down nodes, garbage collection of stale information, and the overlying file system, are numerous aspects which will also need attention once the conceptual foundations mature, to translate the ideas into a working system.
Furthermore, in the context of storage systems, non-MDS codes but with better repairability properties have been proposed [5] and are also deployed [1] in real-world systems, where (some) parities have certain locality properties, such that they do not depend on all the data blocks. As such, the constraint from (4) may not be adequate when such codes are used, and further studies are needed. For example, it would be interesting to consider the joint design of codes with good repairability and efficient quorum mechanisms.

Author Contributions

Conceptualization, F.O. and A.D.; methodology, F.O. and A.D.; validation, F.O. and A.D.; formal analysis, F.O. and A.D.; investigation, F.O. and A.D.; resources, F.O. and A.D.; writing—original draft preparation, F.O. and A.D.; writing—review and editing, F.O. and A.D.; visualization, F.O. and A.D.; project administration, A.D.; funding acquisition, A.D. All authors have read and agreed to the published version of the manuscript.

Funding

Anwitaman Datta’s work has been supported by Singapore Ministry of Education (MOE) grant award number: 002180-00001 (Funder Reference Number: RG134/18) for the project titled ‘StorEdge: Data store along a cloud-to-thing continuum with integrity and availability’.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Huang, C.; Simitci, H.; Xu, Y.; Ogus, A.; Calder, B.; Gopalan, P.; Li, J.; Yekhanin, S. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference (ATC), Boston, MA, USA, 13–15 June 2012. [Google Scholar]
  2. Muralidhar, S.; Lloyd, W.; Roy, S.; Hill, C.; Lin, E.; Liu, W.; Pan, S.; Shankar, S.; Sivakumar, V.; Tang, L.; et al. f4: Facebook’s Warm BLOB Storage System. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), Broomfield, CO, USA, 6–8 October 2014. [Google Scholar]
  3. Lai, C.; Jiang, S.; Yang, L.; Lin, S.; Sun, G.; Hou, Z.; Cui, C.; Cong, J. Atlas: Baidu’s key-value storage system for cloud data. In Proceedings of the Symposium on Mass Storage Systems and Technologies (MSST), Santa Clara, CA, USA, 30 May–5 June 2015. [Google Scholar]
  4. Chen, Y.L.; Mu, S.; Li, J.; Huang, C.; Li, J.; Ogus, A.; Phillips, D. Giza: Erasure coding objects across global data centers. In Proceedings of the USENIX Annual Technical Conference (ATC), Santa Clara, CA, USA, 12–14 July 2017. [Google Scholar]
  5. Oggier, F.; Datta, A. Coding techniques for repairability in networked distributed storage systems. Found. Trends Commun. Inf. Theory 2013, 9, 383–466. [Google Scholar] [CrossRef] [Green Version]
  6. Balaji, S.; Krishnan, M.N.; Vajha, M.; Ramkumar, V.; Sasidharan, B.; Kumar, P.V. Erasure coding for distributed storage: An overview. Sci. China Inf. Sci. 2018, 61, 1–45. [Google Scholar] [CrossRef] [Green Version]
  7. Esmaili, K.S.; Chiniah, A.; Datta, A. Efficient updates in cross-object erasure-coded storage systems. In Proceedings of the International Conference on Big Data, Santa Clara, CA, USA, 6–9 October 2013. [Google Scholar]
  8. Rawat, A.S.; Vishwanath, S.; Bhowmick, A.; Soljanin, E. Update efficient codes for distirbuted storage. In Proceedings of the IEEE International Symposium on Information Theory Proceedings, St. Petersburg, Russia, 31 July–5 August 2011. [Google Scholar]
  9. Peter, K.; Reinefeld, A. Consistency and fault tolerance for erasure-coded distributed storage systems. In Proceedings of the International Workshop on Data-Intensive Distributed Computing, Delft, The Netherlands, 18–22 June 2012. [Google Scholar]
  10. Aguilera, M.K.; Janakiraman, R.; Xu, L. Using erasure codes efficiently for storage in a distributed system. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), Yokohama, Japan, 28 June–1 July 2005. [Google Scholar]
  11. Lamport, L. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput. 1979, 28, 690–691. [Google Scholar] [CrossRef]
  12. Tanenbaum, A.S.; Steen, M.V. Distributed Systems, Principles and Paradigms; Pearson, Prentice Hall: Upper Saddle River, NJ, USA, 2007. [Google Scholar]
  13. Vukolić, M. Quorum Systems with Appllications to Storage and Consensus; Morgan & Claypool: San Rafael, CA, USA, 2012. [Google Scholar]
  14. Lamport, L. The part-time parliament. ACM Trans. Comput. Syst. (TOCS) 1998, 16, 277–317. [Google Scholar] [CrossRef]
  15. Mu, S.; Chen, K.; Wu, Y.; Zheng, W. When paxos meets erasure code: Reduce network and storage cost in state machine replication. In Proceedings of the International Symposium on High-performance Parallel and Distributed Computing, Vancouver, BC, Canada, 23–27 June 2014. [Google Scholar]
  16. Cadambe, V.R.; Lynch, N.; Médard, M.; Musial, P. A coded shared atomic memory algorithm for message passing architectures. Distrib. Comput. 2017, 30, 49–73. [Google Scholar] [CrossRef] [Green Version]
  17. Nicolaou, N.; Cadambe, V.; Prakash, N.; Konwar, K.; Medard, M.; Lynch, N. ARES: Adaptive, Reconfigurable, Erasure Coded, Atomic Storage. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, 7–9 July 2019. [Google Scholar]
  18. Lamport, L. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 1978, 21. [Google Scholar] [CrossRef]
  19. Naor, M.; Wool, A. The load, capacity, and availability of quorum systems. SIAM J. Comput. 1998, 27, 423–447. [Google Scholar] [CrossRef]
  20. Malkhi, D. Quorum systems. In Encyclopedia of Distributed Computing; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1999. [Google Scholar]
Figure 1. On the left: A simplified view of a distributed storage system, where nodes store replicas of some data object x. Processes P 1 , P 2 , P 3 , P 4 may ask to read the current value, say c, from x, represented as r(x)c or write the value c to x, represented as w(x)c. They may carry out the operations at any of the replicas. On the right, two forms of consistency are illustrated: a read operation r(x)c or a write operation w(x)c at a given point of the wall-clock time indicates that a process is asking for the corresponding operation on a replica. When it is effectively executed is inferred from the table: under strict consistency (illustrated in the upper right quadrant), the executions follow the same timeline, while under sequential consistency (illustrated in the lower right quadrant), the executions follow some global ordering but need not adhere to the wall clock time at which the operations were invoked by the processes.
Figure 1. On the left: A simplified view of a distributed storage system, where nodes store replicas of some data object x. Processes P 1 , P 2 , P 3 , P 4 may ask to read the current value, say c, from x, represented as r(x)c or write the value c to x, represented as w(x)c. They may carry out the operations at any of the replicas. On the right, two forms of consistency are illustrated: a read operation r(x)c or a write operation w(x)c at a given point of the wall-clock time indicates that a process is asking for the corresponding operation on a replica. When it is effectively executed is inferred from the table: under strict consistency (illustrated in the upper right quadrant), the executions follow the same timeline, while under sequential consistency (illustrated in the lower right quadrant), the executions follow some global ordering but need not adhere to the wall clock time at which the operations were invoked by the processes.
Entropy 23 00177 g001
Figure 2. Update sequence (for the grid-quorums on the left of the figure): at time t, x i is updated using its quorum Q t , in which parities are highlighted in row and column 4 (shown on the left). At time t > t , x j is updated using its quorum Q t , in which parities are highlighted in row 5 and column 5 (on the right). Since the parity node (boxed for highlighting) at coordinate ( 4 , 5 ) is common, process updating x j would know that x i was updated, and the latest value of x i will have to be taken into account during its own update, so that all the parities in Q t reflect not only the latest value of x j , but also the latest value of x i , irrespective of whether they had received the update information regarding x i through the background process prior to the invocation of Q t . Two distinct but valid scenarios of sequential consistency are shown (on the right of the figure) based on different sequences of quorums acquired.
Figure 2. Update sequence (for the grid-quorums on the left of the figure): at time t, x i is updated using its quorum Q t , in which parities are highlighted in row and column 4 (shown on the left). At time t > t , x j is updated using its quorum Q t , in which parities are highlighted in row 5 and column 5 (on the right). Since the parity node (boxed for highlighting) at coordinate ( 4 , 5 ) is common, process updating x j would know that x i was updated, and the latest value of x i will have to be taken into account during its own update, so that all the parities in Q t reflect not only the latest value of x j , but also the latest value of x i , irrespective of whether they had received the update information regarding x i through the background process prior to the invocation of Q t . Two distinct but valid scenarios of sequential consistency are shown (on the right of the figure) based on different sequences of quorums acquired.
Entropy 23 00177 g002
Figure 3. Basic grid quorums for n = 36 nodes: on the left, two quorums (one in yellow and the other in blue) intersect in two points, while, on the right, a variant with smaller quorums is shown, where they intersect in one point.
Figure 3. Basic grid quorums for n = 36 nodes: on the left, two quorums (one in yellow and the other in blue) intersect in two points, while, on the right, a variant with smaller quorums is shown, where they intersect in one point.
Entropy 23 00177 g003
Figure 4. Grid ( 6 × 6 ) layout for an ( 36 , 21 ) code: Data blocks are in the lower triangle (including the diagonal), while the upper triangle has the parities. On the left, an example of two intersecting quorums Q 1 = [ 1 , 1 ] , Q 2 = [ 6 , 6 ] is shown, and they interest in two points, including a parity. In the middle, all the quorums for a variant with smaller quorums (obtained by truncating columns) are shown. On the right, each data block on row i has for quorum the union of itself, the parities on row i and the parities on column i.
Figure 4. Grid ( 6 × 6 ) layout for an ( 36 , 21 ) code: Data blocks are in the lower triangle (including the diagonal), while the upper triangle has the parities. On the left, an example of two intersecting quorums Q 1 = [ 1 , 1 ] , Q 2 = [ 6 , 6 ] is shown, and they interest in two points, including a parity. In the middle, all the quorums for a variant with smaller quorums (obtained by truncating columns) are shown. On the right, each data block on row i has for quorum the union of itself, the parities on row i and the parities on column i.
Entropy 23 00177 g004
Figure 5. A B-grid quorum for n = 48 = c b r nodes, arranged in a grid with c = 8 columns, b r = 6 rows, arranged in b = 3 bars each containing r = 2 rows.
Figure 5. A B-grid quorum for n = 48 = c b r nodes, arranged in a grid with c = 8 columns, b r = 6 rows, arranged in b = 3 bars each containing r = 2 rows.
Entropy 23 00177 g005
Figure 6. A B-grid with ( 48 , 30 ) coded data.
Figure 6. A B-grid with ( 48 , 30 ) coded data.
Entropy 23 00177 g006
Figure 7. On the x-axis, the rate 1 b c . On the y-axis, the relative load γ of the coding based B-grid w.r.to replication as a function of ( r , c ) , for ( r , c ) { ( 1 , 3 ) , ( 1 , 4 ) , ( 2 , 3 ) , ( 2 , 4 ) , ( 3 , 3 ) , ( 3 , 4 ) } .
Figure 7. On the x-axis, the rate 1 b c . On the y-axis, the relative load γ of the coding based B-grid w.r.to replication as a function of ( r , c ) , for ( r , c ) { ( 1 , 3 ) , ( 1 , 4 ) , ( 2 , 3 ) , ( 2 , 4 ) , ( 3 , 3 ) , ( 3 , 4 ) } .
Entropy 23 00177 g007
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Oggier, F.; Datta, A. On Grid Quorums for Erasure Coded Data. Entropy 2021, 23, 177. https://doi.org/10.3390/e23020177

AMA Style

Oggier F, Datta A. On Grid Quorums for Erasure Coded Data. Entropy. 2021; 23(2):177. https://doi.org/10.3390/e23020177

Chicago/Turabian Style

Oggier, Frédérique, and Anwitaman Datta. 2021. "On Grid Quorums for Erasure Coded Data" Entropy 23, no. 2: 177. https://doi.org/10.3390/e23020177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop