research-article

Public Access

Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models

Authors:

Suli Yang,

Jing Liu,

Andrea Arpaci-Dusseau,

Remzi Arpaci-DusseauAuthors Info & Claims

ACM Transactions on Storage, Volume 19, Issue 2

Article No.: 17, Pages 1 - 47

https://doi.org/10.1145/3574323

Published: 06 March 2023 Publication History

All formats PDF

Abstract

In this article, we present an approach to systematically examine the schedulability of distributed storage systems, identify their scheduling problems, and enable effective scheduling in these systems. We use Thread Architecture Models (TAMs) to describe the behavior and interactions of different threads in a system, and show both how to construct TAMs for existing systems and utilize TAMs to identify critical scheduling problems. We specify three schedulability conditions that a schedulable TAM should satisfy: completeness, local enforceability, and independence; meeting these conditions enables a system to easily support different scheduling policies. We identify five common problems that prevent a system from satisfying the schedulability conditions, and show that these problems arise in existing systems such as HBase, Cassandra, MongoDB, and Riak, making it difficult or impossible to realize various scheduling disciplines. We demonstrate how to address these schedulability problems using both direct and indirect solutions, with different trade-offs. To show how to apply our approach to enable scheduling in realistic systems, we develop Tamed-HBase and Muzzled-HBase, sets of modifications to HBase that can realize the desired scheduling disciplines, including fairness and priority scheduling, even when presented with challenging workloads.

1 Introduction

The modern data center is built atop massive, scalable storage systems [6, 27, 56, 70]. For example, a typical Google cluster consists of tens of thousands of machines, with PBs of storage spread across hard disk drives (or SSDs) [70]. These expansive storage resources are managed by Colossus, a second-generation scalable file system that replaced the original Google File System [27]; many critical Google applications (e.g., Gmail, Google Drive, and Youtube), as well as generic cloud-based services, co-utilize Colossus and thus contend for cluster-wide storage resources such as disk space and I/O bandwidth [34].

As a result, a critical aspect of these storage systems is how they share resources. If, for example, requests from one application can readily drown out requests from another, then building scalable and predictable applications and services becomes challenging (if not impossible).

To address these concerns, scalable storage systems must provide correct and efficient request scheduling as a fundamental primitive. By controlling which client or application is serviced, critical features including fair sharing [2, 49, 80, 89], throughput guarantees [74, 90], low tail latency [19, 30, 66, 87, 93], and performance isolation [69, 73, 86] can be successfully realized.

Unfortunately, modern storage systems are complex, concurrent programs. Many systems are realized via an intricate series of stages, queues, and thread pools, based loosely on SEDA design principles [88]. For example, HBase [26] consists of about 500K lines of code, and involves more than 1000 interacting threads within each server when running. Understanding how to introduce scheduling control into systems is challenging even for those who develop them; a single request may flow through numerous stages across multiple machines while being serviced.

All of the open-source storage systems we examined have significant scheduling deficiencies, thus rendering them unable to achieve desired scheduling goals. As shown in Figure 1, the original HBase fails to provide weighted fairness or isolation against background workloads, yet our implementation of Muzzled-HBase successfully achieved these goals. Such scheduling deficiencies have also caused significant problems in production, including extremely low write throughput or even data loss for HBase [32], unbounded read latency for MongoDB [52, 53], and imbalance between workloads in Cassandra [10]. All of the above problems have been assigned major or higher priority by the developers, but remain unsolved due to their complexities and the amount of changes required to the systems.

Fig. 1.

To remedy this problem, and to make the creation of flexible and effective scheduling policies within complex storage systems easy, this article presents a novel approach to such schedulability analysis, which allows systematic reasoning on how well a system could support scheduling based on its thread architecture. Specifically, we define a Thread Architecture Model (TAM), which captures the behavior and interactions of different threads within a system. By revealing the resource consumption patterns and dependencies between components, a TAM effectively links the performance of a storage system to its architecture (while abstracting away implementation details). Using a TAM, various scheduling problems can be discerned, pointing toward solutions that introduce necessary scheduling controls. The system can then be transformed to provide schedulability by fixing these problems, allowing realization of various scheduling policies atop it. TAMs are also readily visualized using Thread Architecture Diagrams (TADs), and can be (nearly) automatically obtained by tracing a system of interest under different workloads.

We introduce the Thread Architecture Schedulability Conditions (TASC), which specify three conditions that make a TAM schedulable: completeness, local enforceability, and independence. TASC ensures that schedulers are placed at the right points in the system and are given the necessary information and control to realize different desired scheduling policies, thus separating the implementation difficulty of scheduling (how system architecture and design hinder effective realization of scheduling policies) from the policy difficulties (the inherent difficulty in achieving one particular policy).

However, the internal structure of systems can pose various problems and prevent a system from satisfying TASC. We use TAM to analyze the schedulability of four important and widely-used scalable storage systems: HBase/HDFS [26, 75], Cassandra [46], MongoDB [15], and Riak [43], and highlight weaknesses in the scheduling architecture of each. Our analysis centers around five common problems that cause violations of TASC and in turn lead to inadequate scheduling: a lack of local scheduling control points, unknown resource usage, hidden competition between threads, uncontrolled thread blocking, and ordering constraints among requests. Fortunately, these problems can be precisely specified using TAMs, enabling straightforward and automatic problem identification. These problems can also be visually identified using TADs, allowing system architects to readily understand where problems arise.

Two types of techniques can be used to solve these problems, and thus enable scheduling control to be readily added into existing, complex systems. The first are direct methods, which explicitly alter the existing thread architecture to eliminate a specific scheduling problem; direct techniques can sometimes be more challenging to implement, depending on the exact system architecture. The second are indirect methods, which leave the TAM unchanged but use information to alleviate the effects caused by scheduling problems; indirect approaches are easier to incorporate but more approximate in the scheduling control they provide. When transforming real systems to provide schedulability, a combination of direct and indirect methods is often needed to reach a sweet spot that balances the scheduling control needed and the engineering effort required.

To demonstrate how direct and indirect methods can be deployed to solve the problems identified by TAM and transform realistic systems to provide schedulability, we take HBase, the most complex system we studied, and compare two of its variants: Tamed-HBase and Muzzled-HBase. Tamed-HBase (short for TAM-EnableD HBase) solves all scheduling problems original HBase possesses directly to achieve a TASC-satisfying and problem-free thread architecture; however, implementing Tamed-HBase requires major restructuring of the system. Muzzled-HBase aims for easy implementation and deploys indirect methods when necessary, at the cost of performance overheads and approximate scheduling control. We show via simulation that both Tamed-HBase and Muzzled-HBase enable fair sharing under intense resource competition, provide strong tail latency guarantees with background interference, and achieve proper isolation despite variances in request amount, size, and other workload factors. We also implement Muzzled-HBase to show that TAM-guided schedulability analysis corresponds to the real world.

The rest of this article is structured as follows. We first introduce the TAM (Section 2), and then we discuss how to use TAM to perform schedulability analysis, centered around the TASC conditions and the five scheduling problems (Section 3). We use HBase/HDFS as a case study to demonstrate how to use TAM to analyze the schedulability of a realistic system, and make said system schedulable (Section 4). We then present the schedulability analysis results of other systems (Section 5). Next, we discuss the limitations of TAM and how it can be extended (Section 6). Finally, we discuss related work (Section 7) and conclude (Section 8).

2 Thread Architecture Model

Implementing new scheduling policies in existing systems is non-trivial; most modern scalable storage systems have complex structures with specific features that complicate the realization of scheduling policies. We introduce TAMs to describe these structures. The advantage of TAM is that one can perform schedulability analysis with only information specified in this model, abstracting away all the implementation details. We first give a general and intuitive description of TAM (Section 2.1) and describe its visualization using TAD (Section 2.2). We then discuss how to automatically obtain TAM for existing systems (Section 2.3) and use TAM as a blueprint to simulate system scheduling behaviors (Section 2.4). Finally, we give a formal definition of the TAM model (Section 2.5).

2.1 TAM: General Description

A storage system contains requests that flow through the data path and consume various resources. For scheduling, a control plane collects information and determines a scheduling plan to realize the system’s overall goal. Different systems may have different scheduling goals, including reducing tail latency [1, 19, 30, 66, 87, 93], ensuring fair resource allocation [2, 49, 63, 89], minimizing interference and isolating performance for high priority workloads [25, 69, 73, 86], and others, which require different scheduling plans. The scheduling plans are enforced by local schedulers at different points along the data path, as shown in Figure 2.

Fig. 2.

In modern SEDA-based distributed storage systems [26, 46, 78, 88], the data path consists of many distinct stages residing in different nodes. A stage contains threads performing similar tasks (e.g., handling RPC requests or performing I/O). A thread refers to any sequential execution (e.g., a kernel thread, a user-level thread, or a virtual process in a virtual machine). Within a stage, threads can be organized as a pool with a fixed (or maximum) number of active threads (bounded stage) or can be allocated dynamically as requests increase (on-demand stage). In certain stages, some requests may need to be served in a specific order for correctness; this is an ordering constraint.

Each bounded stage has an associated queue from which threads pick tasks; each queue is a potential scheduling point where a local scheduler can be inserted to reorder requests. The queue can be either implicit (e.g., the default FIFO queue of a Java thread pool) or explicit (with an API to allow schedulers to be plugged in, or hard-coded scheduling decisions). A local scheduler reorders only those requests that flow through the scheduling point it is attached to, but multiple local schedulers can be configured by the control plane to achieve a global scheduling goal.

A stage may pass requests to its downstream stages for processing. After a thread issues a request to downstream stages, the thread may immediately proceed to another request or block until notified that the request completed at other stages.

Resources are consumed within a stage as requests are processed; we consider CPU, I/O, network and lock¹ resources, but other resources can be readily added to our model. Instead of specifying the exact amount of resources used in each stage (which can change based on specific characteristics of workloads), we only consider whether a resource is extensively used in a stage. This simplification allows us to abstract away the details of slightly different workloads but still captures important problems related to resource usage (shown in Section 3). Extensive resource usage is interpreted as “any serious usage of resources that is worth considering during scheduling”; we discuss how we choose the threshold in Section 2.3. If a stage may or may not extensively use a resource during processing based on different workloads, then it has unknown resource usage for this resource.

All the stages and their collective behaviors, relationships, and resource consumption patterns form the thread architecture of a system.

2.2 Visualization with TAD

One advantage of TAM is that it allows straightforward visualizations using TADs. Table 1 summarizes the building blocks in TADs; Figure 3 show the TAD of HBase/HDFS [26, 75] (labels on the arrows and important workload flows are manually added to aid understanding, and are not parts of TAD). The TADs of other systems, such as MongoDB (Figure 24), Cassandra (Figure 28), and Riak (Figure 32), are shown in Section 5. TAM and TAD can be thought of as duals: TAD is the graphical representation of TAM, while TAM is the symbolic representation of TAD; one can easily transform a TAD to its underlying TAM, and vice versa.

Fig. 3.

Table 1.

We now use the (simplified) HBase/HDFS TAD in Figure 3 to illustrate how to read a TAD and identify specific features of system scheduling from it.

When HBase clients send queries to the RegionServer, the RPC Read stage reads from the network and passes the request to the RPC Handle stage (step 1). Based on the request type (Put or Get) and whether data is cached, RPC Handle may have different behavior. One may insert custom schedulers into RPC Handle (plug symbol).

If the RPC needs to read data, then RPC Handle checks if the data is local. If it is not local, then RPC Handle sends a read request to the Data Xceive stage in a Datanode and blocks (\(r_1-r_2\), where blocking is indicated by dashed \(r_2\)). If it is local, then RPC Handle directly performs short-circuited reads, consuming I/O. I/O resource usage in RPC Handle is initially unknown and thus marked with a bracket.

For operations that modify data, RPC Handle appends Write-ahead Logging (WAL) entries to a log (\(a_1\)) and blocks until the entry is persisted. LOG Append fetches WAL entries from the queue in the same order they are appended (stop symbol), and writes them to HDFS by passing data to Data Stream (\(w_1\)), which sends the data to Data Xceive (\(w_2-w_3\)). All WAL entries append to the same HDFS file, so Data Stream and Data Xceive must process them in sequence. LOG Append also sends information about WAL entries to LOG Sync (\(a_2\)), which blocks (\(w_7\)) until the write path notifies it of completion (further details omitted); it then tells RPC Handle to proceed (dashed \(a_3\)). RPC Handle may also flush changes to the MemStore cache (\(f_1\)); when the cache is full, the content is written to HDFS with the same steps as with LOG Append writes (\(w_1-w_7\)), though without the ordering constraint.

Finally, after RPC Handle finishes an RPC, it passes the result to RPC Respond and continues another RPC (step 2). In most cases, RPC Respond responds to the client, but if the connection is idle, RPC Handle bypasses RPC Respond and responds directly, consuming network resource.

HBase has more than ten complex stages exhibiting different local behaviors (e.g., bounded versus on-demand), resource usage patterns (e.g., unknown I/O demand), and interconnections (e.g., blocking and competing for the same resources across stages). All of them are compactly encoded in its TAM/TAD, enabling us to identify problematic scheduling, as we discuss later (Section 3).

2.3 Automatic Obtainment

TAM is defined with automatic procurement in mind: all information specified in TAM can be (relatively) easily obtained, allowing automation of the schedulability analysis. We now present TADalyzer, a tool we developed to auto-discover TAM/TAD for real systems using instrumentation and tracing techniques. The workflow to generate the TAM of a given system with TADalyzer consists of four steps:

(1)

Stage Naming: the user lists (and names) important stages in the system.

(2)

Stage Annotation: the user identifies thread creation code in the code base and annotates if the new thread belongs to one of the stages previously named. Figure 4 shows a sample annotation. Threads not explicitly annotated default to a special NULL stage.

Fig. 4.

(3)

Monitoring: the user deploys the system and feeds various workloads (e.g., hot-/cold-cached, local/remote) to it. TADalyzer automatically collects necessary information for later TAM generation. If the user missed some important stages in step 1, then TADalyzer would notice that some threads in the NULL stage are overly active, and alert the user with the stack trace of these threads. Based on the alert, the user identifies the missing stages and repeats step 1–3.

(4)

Generating: After enough information is collected, the user asks TADalyzer to generate the TAM; some information that cannot be obtained by TADalyzer (see Figure 5) needs to be provided manually by the user. TADalyzer also automatically plots TAD from the TAM (though the TADs shown in the article are drawn manually for compact representation).

Fig. 5.

This workflow requires the user to know the important (but not all) stages in the system. From our experience, someone unfamiliar with the code base usually misses naming some stages initially. However, TADalyzer provides enough information to point the user to the code of the missing stages to aid further annotation, and one can typically get a satisfactory TAM within a few (\(\lt\)5) iterations of the workflow. In HBase and MongoDB such annotation took about 50 lines of code, and in Cassandra about 20.

We now briefly describe how TADalyzer generates the TAM. Based on user annotation, TADalyzer monitors thread creation and termination, and builds a mapping between threads and stages. Using this mapping, it automatically discovers the following information:

Stage Type: TADalyzer tracks the number of active threads at each stage to classify bounded or on-demand stages.

Resource Consumption: Using Linux kernel tracing tools [36, 79], TADalyzer attaches hooks to relevant kernel functions (e.g., vfs_read, socket_read) to monitor the I/O and network consumed at each stage. The CPU consumption is tracked through /proc/stat; the lock resource by automatically instrumenting relevant lock operations. TADalyzer also supports user-defined resources; however, for custom resources the user is required to annotate the code to tell TADalyzer when the resource is claimed and released.

Intra-node Data Flow: TADalyzer automatically instruments standard classes that are commonly used to pass requests, such as util.AbstractQueue in Java and std::queue<T> in C++, to build data flow between stages within the same node.

Inter-node Data Flow: TADalyzer tracks how much data each thread sends and receives on different ports. By matching the IP and port information, TADalyzer builds the data flow between stages on different nodes.

Blocking Relationship: TADalyzer injects delays in a stage and determines whether other stages block by observing if the delay propagates to these stages.

Ordering Constraint/Pluggable Scheduling Point: TADalyzer cannot automatically derive such information and asks for manual annotations from the user.

Figure 5 summarizes how TADalyzer obtains TAM information; based on the information, TADalyzer generates the TAM/TAD.

When generating TAM, TADalyzer needs to determine the threshold for extensive resource usage of a stage. In a typical system there exist many “light” stages that are occasionally activated to perform bookkeeping and consume few resources (e.g., a timer thread); even for stages that are actively involved in request processing, they may perform tasks with a particular resource pattern and only very lightly use other resources. When accumulating the resource consumption in a long run, we observe that these stages use at most 1% of the concerned resource, while the stages that are actively consuming resources when processing requests typically use more than 10% (or much higher) of the resource. For example, in MongoDB the Worker stage consumes up to 95% of the total CPU time, while the Fetcher stage consumes at most 0.2%; similarly, in HBase the RPC Respond stage is responsible for 20% to 80% of the total bytes transferred through network, depending on the workload, but its I/O consumption never exceeds 1%. TADalyzer thus chooses the extensiveness threshold to be within 1% and 10% to prevent these “light” stages from unnecessarily complicating the TAM (the exact threshold is set to 5%).

Limitations: The current implementation of TADalyzer has certain limitations. First, TADalyzer cannot automatically derive if a stage provides a pluggable scheduling point or has an ordering constraint, and it relies on the user’s domain knowledge to provide such information. Second, TADalyzer relies on run-time instrumentation, and may miss some information if the workload supplied to it is not comprehensive enough to exercise all execution paths. In particular, it may not always be possible to fully utilize all threads in a given stage by adding more load, due to system bottlenecks, which may cause TADalyzer to falsely classify a stage as bounded even though it is actually on demand. Third, TADalyzer uses kernel instrumentation to track resource usages of each thread, so it only works on systems where an application thread directly corresponds to a kernel thread. For example, Riak, which makes use of the Erlang user-level processes [84], requires additional instrumentation to obtain its TAM. Finally, the automatic queue operation instrumentation may also miss some data flows. For example, HBase uses a fast-path mechanism to directly dispatch an RPC call to a thread to improve data locality, without going through any queues; these data flows can not be captured by TADalyzer.

To summarize, the TAMs generated by TADalyzer may miss stages, flows, resource usages, or classify stages incorrectly.² However, we would like to emphasize that TAM defines a clear set of obtainable information (see Section 2.5), which enables tools that automatically extract this information to construct TAM and perform schedulability analysis. TADalyzer is just one such tool we built to demonstrate the feasibility of automatic schedulability analysis; we encourage other tools to be developed that deploy different techniques (e.g., those in References [3, 12, 61]) to discover the information listed in Figure 5 and optimize the process of obtaining TAM.

2.4 TAM: A Simulation Blueprint

One important utility of TAM is that it can serve as a simulation blueprint to study system scheduling behaviors. Simulation has long been used as a cost efficient way to understand how scheduling works in realistic systems, without the cumbersome process of running, instrumenting, tracing, or even building the system of interest [39, 47, 81]. However, as simulation is ultimately an abstraction of the real world that hide irrelevant information, for a real system with complex operations and convoluted interactions, it remains challenging to discern which key information are worth capturing to produce high fidelity results when studying a particular problem.

Because TAM abstracts all information relevant to scheduling into basic building blocks (e.g., stages, resources) and relationships (e.g., the blocking and downstream relationship), and organizes these building blocks into a complete and compact scheduling model, it greatly simplifies the simulation of system scheduling behaviors: out of all the complex system operations and interactions, one only needs to assemble the building blocks based on the relationships specified in the TAM of the system. In this sense, TAM works in the same way as a Lego brick set: it divides real systems into basic bricks, and gives you specific instructions on how to assemble the bricks into a complete system. TAM thus serves as a simulation blueprint, which enables fast and efficient scheduling simulation. In addition to existing systems, one can also assemble stages to form new TAMs that reflect hypothetical system designs, and simulate how scheduling works in these thread architectures. For example, given the TAM of an existing system, the designer can tweak one specific design of the TAM, then simulate the changed TAM to understand exactly how that change could affect system scheduling behavior and performance. As we will show later (Sections 3 and 4), targeted TAM-based simulation can provide tremendous insights into a system’s scheduling problems. In particular, while abstract TAM analysis can reveal scheduling problems in general, TAM-based simulation demonstrates how the problems actually manifest given certain workloads and resource configurations.

Simulation Framework Implementation: We thus built a simulation framework to facilitate such TAM-based simulation and analysis. Based on simpy, a Python package for event-driven simulation [50], the framework provides basic building blocks such as requests, threads, stages, resources, and schedulers; it is also capable of simulating any behaviors described in a TAM, such as blocking or the downstream relationship. The framework then allows the user to assemble the stages in different ways to form thread architectures as simple or complex as she wishes. With an assembled thread architecture, one can specify the workload characteristics (e.g., request types and arrival distribution), resource configurations (e.g., CPU frequency or network bandwidth), and scheduling policies; the framework then simulates how requests flow through the stages and consume the resources, and reports detailed performance statistics, such as client throughput, latency, and resource utilization. We show the fidelity of the simulation in Section 4.2, where we compare our simulation results to the results obtained on real implementations; the comparison reveals that TAM-based simulations are excellent predictors of the real world.

2.5 TAM: Formal Definition

We now give a more formal definition of the thread architecture model, which precisely specifies the information encoded in a TAM. Such formalism is critical for both automatically constructing TAMs (Section 2.3) and for systematically identifying the scheduling problems (Section 3).

Definition 1.

A thread architecture is defined by the 3-tuple \((S, D, B)\), where

•

\(S\) is a finite set; each element \(s\in S\) is a stage, which is defined in Definition 2.

•

\(D\) is a function that maps \(S\) to \(\mathbb {P}(S)\) (the power set of \(S\)); \(D\) represents the downstream relationship. For example, \(D(s_{1})=\lbrace s_{2}, s_{3}\rbrace\) means \(s_1\) issues requests to \(s_2\) and \(s_3\).

•

\(B\) is a function that maps \(S\) to \(\mathbb {P}(S)\); \(B\) represents the blocking relationship. For example, \(B(s_{1})=\lbrace s_{2}\rbrace\) means stage \(s_1\) blocks on stage \(s_2\).

Definition 2.

A stage is defined by the 5-tuple \((n, h, r, o, q)\), where

•

\(n\) is a string representing the name of the stage.

•

\(h\) is a positive integer indicating host ID. Stages with the same \(h\) value are on the same node.

•

\(r\) is a 4-vector representing the resource usage pattern of this stage. Each component in \(r\) can take one of the three values: true, false, or unknown, indicating whether the corresponding resource (CPU, I/O, network, lock) is used extensively in this stage.

•

\(o\) is a boolean value representing whether the stage has an ordering constraint or not.

•

\(q\) represents the local scheduling type, and can take one of the three values: on_demand, pluggable or general, indicating whether the stage is on-demand, allows pluggable schedulers, or has hard-coded or implicit scheduling logic.

3 The Scheduling Conditions and Problems

System developers rarely have the luxury of constructing a new service from scratch that can provide a new scheduling policy or goal. Instead, developers often retrofit new scheduling policies into existing architectures containing a number of limitations and flaws. We have two important questions. First, which types of thread architectures most naturally enable new scheduling policies to be implemented? Second, when problems do exist, what are the problems and how can those problems be most easily remedied?

We begin by considering an ideal thread architecture, and the process of realizing a scheduling policy in such an architecture. To realize a global scheduling policy in a distributed architecture, local schedulers should first be inserted into key scheduling points of the system to reorder requests (①); the global policy is then translated into local scheduling plans to be enforced by these schedulers (②); finally, the local plans are carried out by the local schedulers (③). Such a three-step process is intuitive and general, and has been adopted to realize various scheduling policies, including fairness [73], latency guarantees [87], and isolation [90]; it is also used in generic scheduling frameworks that support diverse user-defined policies [38, 77]. However, carrying through the three steps requires certain support from the system; we thus specify the following three conditions that an ideal thread architecture should satisfy to make realizing a scheduling policy easy:

Completeness—the system provides necessary scheduling points so that a global policy can be translated into local scheduling plans enforced by schedulers inserted at these points. Without completeness, some key components of the system may be left out and cannot be scheduled in any way.

Independence—the decisions different local schedulers make are decoupled; one local scheduler needs not coordinate with other schedulers to achieve its local goal. With independence, a global scheduling plan can be easily broken into local plans to be enforced independently; otherwise, one local scheduler’s decision and behavior may prevent another scheduler from achieving its local goal, and every scheduler has to take into account what is happening elsewhere in the system, which makes scheduling much harder.

Local Enforceability—the local scheduling plans can be implemented at each scheduling point: the system provides both enough information to the local scheduler to make the scheduling decisions, and enough control to enforce the decisions. Without local enforceability, a local scheduler may lack crucial information about a request, or the ability to reorder the request, which makes enforcing the local scheduling plan unfeasible.

We call the above conditions the TASC. Unfortunately, most systems violate the ideal schedulability conditions in one or more ways, causing scheduling difficulties. In particular, we identify five common problems exhibited in modern distributed storage systems that cause violations of TASC: no scheduling, unknown resource usage, hidden contention, uncontrolled thread blocking, and ordering constraint (see Section 3.6 for a discussion on whether the five problems are exhaustive). We illustrate how each of the problems causes TASC violations and leads to scheduling difficulties (see Table 2 for a summary). TAM allows us to identify these scheduling problems without being concerned about low level implementation details.

Table 2.

Scheduling Problem	TASC Condition Violated	Specific Violation
No Scheduling	Completeness	Lack of scheduling points at resource intensive stages.
Unknown Resource Usage	Local Enforceability	Forces local schedulers to make decisions without resource usage information.
Hidden Contention	Independence	Resource allocation at one scheduling point is affected by the resource usage at another scheduling point.
Uncontrolled Thread Blocking	Independence	Forces upstream schedulers to account for downstream progress.
Ordering Constraint	Local Enforceability	The local scheduler cannot reorder requests based on its scheduling goal.

Table 2. How Each Scheduling Problem Violates the TASC Conditions

We also show how to fix an existing stage so the system is TASC-compliant. In general, a stage can satisfy a TASC condition in one of two ways: directly with a thread architecture that naturally provides the property or indirectly by having a control plane utilize information to adjust scheduler behaviors. There are different trade-offs for direct and indirect solutions.

To illustrate the schedulability conditions and scheduling problems more clearly, we begin by focusing on systems with only a single problem; in Section 4 we consider the HBase TAM in which multiple problematic stages are interconnected. For each problem, we first give a general description on how it causes TASC violations and leads to ineffective scheduling, then precisely specify the problem in TAM and TAD. We use simulation to demonstrate how problematic thread architecture hinders scheduling policy realization, and how direct and indirect solutions can solve the problem and enable effective scheduling.

Storage systems have different scheduling goals, which require different scheduling algorithms to be deployed at local schedulers. Some systems aim at reducing end-to-end latency or meeting request deadlines [1, 19, 66], and algorithms such as earliest-deadline-first (EDF) or its variants [13, 22, 37] would be useful; some systems attempt to allocate resource fairly [2, 49, 63, 89], based on proportional fairness [41], dominate resource fairness (DRF) [28] or Bottleneck Max Fairness [8]; other systems need to ensure that Quality-of-Service guarantees are met for high priority clients regardless of other co-located clients [25, 69, 73, 86], and deploy resource partitioning at different points [11], or perform priority-based scheduling [4, 48]. All of these goals, however, must be carefully balanced with the need of maintaining high resource utilization, which maximizes efficiency and reduces cost. To demonstrate the generality of our analysis, in the simulation below the local schedulers deploy different algorithms, including EDF, DRF, and priority scheduling, to achieve various scheduling goals.

Unless otherwise noted, all simulations in this section use a common configuration: two competing clients (C1 and C2) continuously issue requests; C1 has 40 threads, C2 varies; each node is simulated with a 1 GHz CPU, 100 MB/s disk, and a 1 Gbps network connection.

3.1 No Scheduling

The completeness condition dictates that each resource-intensive stage in a thread architecture provides local scheduling. With local scheduling for a stage, requests are explicitly queued and resource-intensive activities can be ordered according to system’s overall scheduling goal. In contrast, an on-demand stage with no request queue and extensive resource usage suffers the no scheduling problem (e.g., the Data Stream and Data Xceive stages in HBase, and the Req In-Out and Process stages in Riak).

TAM: A TAM \((S, D, B)\) suffers no scheduling if \(\exists s\in S\), s.t. \(s.r \ne\) [false, false, false, false] \(\wedge\) \(s.q\) = on_demand.³

TAD: A TAD suffers no scheduling if it contains stages with non-empty resource boxes but no queues.

Figure 6(a) shows a simple TAD with two stages, the second of which has no scheduling (an on-demand stage with intensive I/O). The scheduler for Req Handle (Q1) attempts to provide a latency guarantee to C1 using EDF but is unsuccessful: as C2 issues requests with more threads, the latency of C1 exceeds the deadline by as much as \(5\times\). The problem occurs because Q1 scheduling is irrelevant when the Req Handle stage is not the bottleneck: the average queue length of Q1 is zero. Meanwhile, as shown in Figure 6(a), there are many requests contending for I/O in the I/O stage, which is not managed by a scheduler.

Fig. 6.

Figure 6(b) shows an indirect solution in which the thread architecture remains the same, but control logic is added to limit the number of requests sent from the Req Handle stage to the I/O stage. To achieve this, Q1 enforces rate limiting instead EDF. Specifically, information from the I/O stage is collected and sent to Q1 to estimate the per-client latency and utilization. Based on this information, the rates of each client are adjusted using an Additional-Increase-Multiplicative-Decrease algorithm [14] to maximize the number of requests meeting deadline; these rates are pushed to Q1 to ensure latency guarantees. System (b) is able to enforce global scheduling despite the lack of a scheduling point at the I/O stage, because Q1 is not work conserving: Q1 postpones requests from C2 even when there are idle Req Handle threads and CPU resources. As a result, any work-conserving scheduling policy such as fair queuing or priority-based scheduling would have to be emulated at the control plane. Many indirect approaches that share information are possible to solve the no scheduling problem; for example, Retro [38] similarly translates different scheduling policies into rate limits.

Figure 6(c) shows how to solve the problem directly by adding a scheduling point at the I/O stage. Q2 schedules requests from C1 first as they approach the deadlines, even when there are more requests from C2 (with further deadlines) waiting to be serviced at the I/O stage. Local scheduling points enable the system to regulate I/O resource usage at the point where the resource is contended. The direct approach simply and naturally ensures latency guarantees and isolation of the two clients with no need for information sharing across stages.

3.2 Unknown Resource Usage

Each stage within a system should know its resource usage patterns. However, in some stages, requests may follow different execution paths with different resource usage, and these paths are not known until after the stage begins. For example, a thread could first check if a request is in cache, and if not, perform I/O; the requests in this stage have two execution paths with distinct resource patterns and the scheduler does not know which path a request will take (i.e., whether it hits cache) when making scheduling decisions. In such cases, the stage suffers unknown resource usage (e.g., the RPC Handle stage in HBase due to the short-circuited reads it might perform). Unknown resource usage forces schedulers to make decisions before information is available, thus violating the local enforceability condition.

TAM: A TAM \((S, D, B)\) suffers unknown resource usage if \(\exists s\in S\), \(\exists i\in \lbrace 1, 2, 3, 4\rbrace\), s.t. \(s.r[i]\) = unknown.

TAD: A TAD suffers unknown resource usage if it contains resource symbols surrounded by square brackets.

Figure 7(a) shows a single stage with unknown I/O usage (note the bracket around the I/O resource), where Q1 performs DRF [28] with equal weighting. When C2 issues a mix of cold and cached requests, Q1 schedules C2-cold and C2-cached in the same way. Even though there are idle CPU resources, Q1 cannot schedule additional C2-cached requests to utilize the CPU, because it does not know whether the request would later cause I/O, which is currently contended. Unknown resource usage thus causes low CPU utilization and low throughput of C2-cached.

Fig. 7.

System (b) solves this problem with the indirect method of speculative execution. While many approaches are feasible, the simulation speculatively executes a waiting request when the CPU is idle. If during speculation, then the request is found to require I/O, the request is aborted and put back on the queue where it is subjected to normal scheduling. For the single-stage system (b), speculative execution provides high throughput for C2-cached, but must abort some requests and waste CPU cycles. Speculative execution works best for predictable workloads and relatively simple resource usage patterns; if a stage (e.g., the Worker stage in MongoDB) may potentially perform I/O, initiate network connections, and contend for locks, predicting whether a request will use each resource becomes much harder and less accurate.

System (c) solves the unknown resource problem directly by splitting one stage into two. The Req Handle stage performs CPU-intensive cache lookups while a new stage performs I/O for requests that miss the cache. Each stage has its own scheduler. Q1 freely admits requests when there are enough CPU resources, leading to high CPU utilization and high C2-cached throughput. Meanwhile, not only does Q2 know a request needs I/O, it also knows the size and location of the I/O, enabling Q2 to make better scheduling decisions.

3.3 Hidden Contention

When multiple stages with independent schedulers compete for the same resource, they suffer from hidden contention, which impacts overall resource allocation in unexpected ways, causing violation of the independence condition (e.g., the Worker and Oplog Writer stages in MongoDB for database locks, and the Read, Mutation, and View-Mutation stages in Cassandra for CPU and I/O). The hidden contention in MongoDB is reported to cause unbounded read latencies in production [52]. Hidden contention is ubiquitous, because some contention is difficult to avoid (e.g., most stages use CPU).

TAM: A TAM \((S, D, B)\) suffers hidden contention if \(\exists s_1\in S\), \(\exists s_2\in S\), \(\exists i \in \lbrace 1, 2, 3, 4\rbrace\) s.t. \(s_1 \ne s_2\) \(\wedge\) \(s_{1}.h = s_{2}.h\) \(\wedge\)\(s_{1}.q \ne\) on_demand \(\wedge\) \(s_{2}.q \ne\) on_demand \(\wedge\)\(s_{1}.r[i]\ne\) false \(\wedge\) \(s_{2}.r[i]\ne\) false.

TAD: A TAD suffers hidden contention if it contains stages within a node boundary that have separate queues but the same resource in the resource usage boxes.

It is worth noting that even though hidden contention can happen on any resource that is contended without regulation, the TAM/TAD-based definition of hidden contention is more restricted, because in TAM resource is modeled and scheduled within a node boundary (see Section 6 for a discussion on TAM limitations); as a result, the above TAM/TAD-based definition can only capture hidden contention on local resource. Distributed locks, however, are a special type of resource. Even though a distributed lock may be acquired by remote clients, the responsibility of managing the lock (i.e., coordinating its acquisition and release) falls solely on the (master) lock server of the lock instance [9, 35]; in this sense, a distributed lock can be viewed as a local resource scheduled within the lock server, and its contention can be captured by TAM.

Figure 8(a) shows a two-stage system with the network as the source of hidden contention; one stage reads requests and the other sends replies. Both Q1 and Q2 perform fair queuing [41] with equal weighting. However, enforcing fairness at each stage does not guarantee fair sharing at the node level. When C2 increases its reply size (i.e., its network usage), it unfairly consumes up to 95% of the network and reduces throughput of C1.

Fig. 8.

One interesting aspect of this experiment is that both clients have much larger reply size than request size, so S2 (RPC Respond) consumes more than 90% of the total network bandwidth, and S1 (RPC Read) consumes less than 10%; yet fair scheduling at Q2 does not ensure 90% of the network resource is fairly shared across C1 and C2. With larger C2 reply size, Q2 needs to schedule more C1 requests than C2 to maintain fairness. However, as there is no regulation on contention between stages, S2 effectively monopolizes the network when it sends larger replies (on behalf of C2) and prevents S1 from using the network; this causes fewer requests to be completed at S1 and flow to S2, limiting the choice available to Q2. Since S1 processes C1 and C2 requests at a 1:1 ratio (local fair scheduling), Q2 never receives enough C1 requests, and is often forced to schedule C2, because there are no requests from C1 in its queue, causing C2 to dominate the network usage; hidden contention between the two stages thus leads to unfair scheduling.

System (b) solves the problem indirectly by sharing information about network usage across the two stages. In the simulated approach, if a stage uses excessive resources, it backs off by reducing its thread count, allowing the other stage to catch up. The figure shows that system (b) allows S1 to process more requests, thus giving S2 more freedom when scheduling, which leads to effective isolation of C2. Although Q2 is still sometimes forced to schedule C2 requests, it can compensate by favoring C1 later.

System (c) solves the problem directly by changing the thread architecture and combining two network-intensive stages. With one stage handling both RPC sending and replying, Q1 has full control of the network and can isolate C1 and C2 perfectly.

3.4 Uncontrolled Thread Blocking

For optimal performance, even when some requests are waiting to be serviced, each stage should allow other requests to make progress if possible; a problem occurs if there are no unblocked threads to serve these requests. A system has an uncontrolled thread blocking problem if a bounded stage may block on a downstream stage (e.g., the RPC Handle stage in HBase, and the Worker stage in MongoDB), as scenarios may occur where all threads in that stage block at one path and other requests that could have been completed cannot be scheduled. The uncontrolled thread blocking problem of HBase is reported to cause extremely low throughput or even data loss in production [32]. Uncontrolled thread blocking forces upstream schedulers to account for downstream progress, thus violates the independence condition of TASC.

TAM: A TAM \((S, D, B)\) suffers uncontrolled thread blocking if \(\exists s\in S\), s.t. \(s.q\ne\) on_demand \(\wedge\) \(B(s)\ne \emptyset\).

TAD: A TAD suffers uncontrolled thread blocking if it contains stage boxes with queues and dashed arrows pointing toward them.

Figure 9(a) shows a system with uncontrolled thread blocking at Req Handle. Requests in Req Handle have two paths: they may complete in this stage or block on the I/O stage; the schedulers perform DRF [28] with equal weighting. Initially, both C1 and C2 receive high throughput as they issue cached requests without blocking; however, when C2 switches to an I/O-intensive workload, the throughput of C1 (which is still CPU-only) suffers. The table below shows that all threads in Req Handle are blocked on I/O, leaving no threads to process C1 requests. Having more threads at the Req Handle stage (not shown here) does not help: it only leads to more blocked threads.

Fig. 9.

Figure 9(b) shows an indirect solution where the system tries to avoid blocking with a feedback loop: information about the congestion level and estimated queuing delay for each client at the downstream stage is passed upstream. Q1 avoids scheduling a request if it anticipates excessive blocking, reserving threads for useful work. Using this information-based anticipation, system (b) keeps the number of blocked threads low and provides high throughput to C1. However, such anticipation relies on collecting information from the downstream stages, and becomes harder when the downstream stage reside in different processes or even different nodes from the blocking stage, or when one stage has many downstream stages. For example, in HBase, the RPC-Handle stage can issue requests to and block on the Data Xceive stage at every node in the cluster.

In contrast, Figure 9(c) shows a system where blocking is directly eliminated by making the Req Handle stage asynchronous. No threads block; all perform useful work, leading to high throughput for C1 even in the face of workload change. Even though a blocking interface is easier to program against, an asynchronous stage design is generally more friendly to scheduling, as one stage cannot eat up all threads and starve other stages.

3.5 Ordering Constraint

Many storage systems use WAL, which requires the writes to the log to occur in sequence. When a system requires some requests at a resource-intensive stage to be served in a specific order to ensure correctness, it has the ordering constraint problem (e.g., the Data Stream and Data Xceive stage in HBase). Ordering constraint violates the local enforceability condition of TASC, because the local scheduler cannot reorder resource-intensive activities as desired, leaving the scheduling framework with fewer or no choices.

TAM: A TAM \((S, D, B)\) suffers ordering constraint if \(\exists s\in S\), \(\exists i \in \lbrace 1, 2, 3, 4\rbrace\) s.t. \(s.o\) = true \(\wedge\) \(s.r[i]\ne\) false.

TAD: A TAD suffers ordering constraint if it contains stages with stop symbols and non-empty resource boxes.

Figure 10(a) shows a two-stage system with ordering constraint on the second stage. The schedulers enforce priorities, where high priority requests are served first as long as this does not break correctness. In this system, C1 (high priority) suffers much longer latency when C2 (low priority) issues requests aggressively. The majority of this latency occurs from queuing delay in the second stage, since low priority requests must be serviced first if they enter the stage earlier.

Fig. 10.

The ordering problem can be mitigated with indirect methods that share information across stages. In system (b), the two stages coordinate to ensure that there are never more than 10 requests in the LOG Append queue. Although requests in the LOG Append stage must still be served in order, the number of requests before C1 is now bounded, causing fewer priority inversions. The figure shows that the latency of C1 increases initially, but is bounded eventually.

Figure 10(c) shows a system that eliminates the ordering constraint problem by separating requests from different clients into different streams that share no common states (e.g., each stream has its own WAL); even though requests within a stream are still serviced in order, the scheduler can choose which stream to serve and provide differentiated services on a per-stream basis. One should note that, however, this solution only works when there is no dependencies across data written by different clients. The figure shows that C1 maintains low latency despite the larger queue size at the LOG Append stage when C2 issues more requests: free from the ordering constraint, Q2 can pick the high priority requests from C1 first.

3.6 Discussion

We have identified five categories of scheduling problems. Table 2 summarizes how each problem violates the ideal scheduling conditions specified by TASC. For each category, we have given an example that highlights the problem. In some cases the example highlights a fairness problem (Sections 3.3 and 3.4); in others it highlights a latency (Sections 3.1 and 3.5) or utilization (Section 3.2) problem. The specific choice of the scheduling policy to demonstrate each problem and the particular symptom manifested are not important though, because each of these scheduling problems can manifest in different ways, causing violations in any scheduling disciplines. For example, in Section 3.1, we show how no scheduling causes excessive latencies under the EDF policy; since there are no scheduling points to prioritize requests, it could as easily cause unfairness or priority inversions. How (and whether) the scheduling problems manifest depends on the resource available, the workload, and the scheduling policy; when TAM/TAD suggests a scheduling problem, it means that there exist certain workloads/resource configurations under which the problem manifests.

Each of the five scheduling problems by itself is not very surprising; however, they are traditionally identified either by experienced developers or through intensive experiments, which is labor-intensive and may easily miss certain problems. By compactly representing the thread architecture and exposing scheduling problems, TAM can serve a useful conceptual tool that allows the system designer to easily identify and fix the problems in an existing system, or to design a largely problem-free architecture for a new system. In addition, TAD enables visual analysis, making it clear where problems arise, while the TAM-based simulation can be used to study how scheduling problems actually manifest given certain workloads and resource configurations.

The five categories of scheduling problems can be fixed with indirect and direct solutions. Direct solutions modify the problematic thread architecture directly to fix scheduling problems, and are reflected in the TAM of the modified system. Sometimes a direct solution introduces other scheduling problems that need to be fixed (e.g., adding scheduling points to a stage may introduce uncontrolled thread blocking); the altered TAM would reflect the newly introduced problems as well. Even though the direct solutions completely eliminate the problems and offer full scheduling control, they often require intrusive changes to the system architecture and programming model (e.g., making the system asynchronous or re-designing the consistency mechanisms), which may make them difficult to deploy in existing systems.

Indirect solutions keep the TAM unchanged, but add information within and across stages to help a TAM work around inherent structural flaws of its stages. The scheduling control provided by indirect solutions are only approximate. The advantage of indirect approaches is that they usually involve minimal changes to the thread architecture, lowering the engineering effort required. However, adding information flow and complex feedback logic can add overhead and increase convergence time. Furthermore, indirect solutions often rely on stable workload characteristics, so that resource demands and path usage is predictable.

Choosing a direct or indirect approach to solve a scheduling problem will be specific to each system and its code base complexities, thread architecture, scheduling goals, and the workload. Later (Section 4), we will show how to use a combination of direct and indirect solutions to fix the problems of HBase and transform the HBase TAM to provide schedulability.

Do the five categories of problems exhaustively describe how system structure could violate the TASC conditions and hinder scheduling? For now, we can only answer this question empirically. We analyzed systems with distinct architectures (thread-based versus loose SEDA versus strict SEDA) and thread behaviors (kernel-level versus user-level threads). Only these problems arise and fixing them allows us to realize various scheduling policies effectively. We leave proving the completeness of the problems to future work.

4 HBase: A Case Study

Given the TAM/TAD of a system, multiple scheduling problems may be discovered, pointing towards solutions that introduce necessary scheduling controls. By fixing these problems, the system can be transformed to satisfy TASC and provide schedulability. We now perform such analysis on a realistic storage system, the HBase/HDFS storage stack (hereinafter just HBase). We focus on HBase, as it presents the most complex architecture we see so far, is widely deployed in many production environments [26], and achieving schedulability remains difficult despite repeated attempts from industry and academia [31, 32, 38, 87, 90]. We analyze the schedulability of MongoDB [15], Cassandra [46], and Riak [43] later (Section 5).

As we discussed earlier (Section 3), both direct and indirect approaches can be used to solve the scheduling problems HBase possesses, with different trade-offs. To demonstrate the trade-offs more clearly, for each problem, we compare the direct and indirect solution in simulation. We also consider two architectures of HBase: Tamed-HBase, in which we aim to get the ideal scheduling architecture and solve all problems directly, and Muzzled-HBase, in which we consider the practicality of the solutions and prefer indirect approaches when direct approaches may be difficult to implement. We then implement Muzzled-HBase to show that our TAM-based simulation corresponds to real world, thus demonstrate the effectiveness of TAM as a simulation blueprint for scheduling behavior studies.

4.1 TAM-based Simulations

We simulate an HBase cluster with eight nodes; one node hosts the HMaster and NameNode, and the seven other nodes host RegionServers and DataNodes. Each node is simulated with a 1 GHz CPU, a 100 MB/s disk, and a 1 Gbps network connection. Using this simulation, we compare the original HBase, the direct solutions, and the indirect solutions. The solutions can be used to realize any scheduling policy; in our simulation the schedulers simply attempt to isolate C1’s performance from C2’s workload changes.

4.1.1 No Scheduling.

The Data Xceive and Data Stream stages in HBase have a non-empty resource vector and on_demand scheduling type, indicating the no scheduling problem.

Directly solving the no scheduling problem in HBase involves changing the scheduling type of Data Xceive and Data Stream from on_demand to pluggable, so it is free from no scheduling. In a real system, this corresponds to adding scheduling points to the two stages and exporting an API to allow different schedulers to be plugged into each.

Indirectly solving this problem with rate limiting is much more complicated. RPC Handle, Data Stream, and Data Xceive stages on different nodes can all issue requests to Data Xceive on one DataNode; similarly, both LOG Append and Mem Flush may issue requests to the Data Stream stage. Thus, many stages must coordinate to limit the aggregated rate of requests sent to the on-demand stages. We simulate an algorithm similar to the one described in Section 3.1, but add additional logic to allocate rates between these stages based on previous workload patterns.

Figure 11 shows that even though the original HBase TAM does not isolate C1 from C2, both the direct and indirect approach provide stable throughput to C1 despite the change of C2. The indirect approach achieves lower throughput, because it conservatively sets the rate limits and only probes for more when observing low resource utilization. An additional disadvantage of the indirect approach occurs with more nodes: each node introduces more information that must be shared with others, making the approach less scalable.

Fig. 11.

Since the direct approach is simpler and superior in this case, both Tamed-HBase and Muzzled-HBase choose the direct solution and add scheduling points to Data Xceive and Data Stream.

4.1.2 Unknown Resource Usage.

In HBase TAM, the I/O and network components of the RPC Handle resource vector take the unknown value, indicating unknown resource usage. Further code inspection reveals that the RPC Handle threads only sends responses when the network is idle, so it does not interfere with scheduling. TAM produces a false positive here, because the threads exhibited “scheduler-like” behavior (deciding whether to perform a task based on the status of the resource) without going through the schedulers, which is not captured by TAM. Short-circuited reads, which are unknown when the request is scheduled, do cause contention for I/O and lead to ineffective scheduling.

For direct solution, we remove the unknown resource usage in the RPC Handle stage by moving short-circuited reads from RPC Handle to Data Xceive. Instead of performing reads by itself, once the RPC Handle stage recognizes a short-circuited read, it directly passes the read to the local Data Xceive stage without going through network transferring; at this point, the Data Xceive scheduler has knowledge of the I/O size and locations.

Instead of moving short-circuited reads to a different stage, for ease of implementation, the indirect solution leaves the TAM unchanged but uses speculative execution to work around the problem. We tracks the workload pattern of each client; when CPU and network are idle, the indirect approach speculatively executes requests from those clients who mostly issue CPU-intensive requests and abort these requests if they need I/O.

We simulate a standalone HBase node, which ensures that all HDFS reads at the RegionServer are short-circuited, thus isolating the effect of unknown resource usage. Figure 12 shows that both the direct and indirect approach achieve additional throughput for the cached Gets of C2 compared to original HBase, without reducing the throughput of C1 or C2’s cold-cached Gets. The indirect approach achieves slightly lower throughput than direct, because it must abort some requests, but this difference decreases with faster CPUs, since the cost of a failed speculation will be lower.

Fig. 12.

Even though Tamed-HBase adopts the direct solution, Muzzled-HBase chooses the indirect approach in favor of ease of implementation.

4.1.3 Hidden Contention.

Within the same node of the HBase TAM, both the RPC Handle and Data Xceive stages have an I/O component in their resource vectors; the RPC Read, RPC Handle, RPC Respond, Data Stream, and Data Xceive stage resource vectors all share the network component; many stage resource vectors contain the CPU component. All of them lead to the hidden contention problem.

To remove hidden contention directly, we restructure the stages so that each resource is managed by one dedicated stage. In general, one cannot completely eliminate hidden contention by dividing stages based on resource usage for two reasons:

(1)

Without special hardware, network packet processing requires significant CPU [16, 20, 40], so the network stage inevitably incurs both network and CPU usage.

(2)

Lock usage typically cannot be separated to a dedicated stage: it may be pointless to obtain a lock without doing some processing and consuming other resources.

In the case of HBase, the highly contended namespace lock is obtained to perform namespace manipulation (not shown in the simplified TAD), which does not incur extensive usage on other resources, so the lock stage can be separated. The network stage in HBase does incur CPU usage; however, by moving most CPU intensive tasks to the CPU stage (e.g., serialization and checksum verification), we can reduce the hidden contention on CPU between the network stage and the CPU stage to a minimal level.

Such stage restructuring requires major rewriting of the system; to lower the engineering effort required, the indirect solution keeps the stage unchanged, but uses a controller to monitor resource usage and adjust client weights at each stage. If stage S1 is excessively using resources on behalf of client C1, then the weight of C1 is reduced across all stages so that less C1 requests are issued to S1, forcing S1 to either use fewer resources or serve other clients. This algorithm is similar to the one deployed in Retro [38].

Figure 13(a) shows that both the direct and indirect approach isolate C1 from C2’s reply size change, versus original HBase cannot provide isolation. Figure 13(b) shows when C2’s workloads suddenly changes, the indirect solution takes a long time to converge to a fair state, depending on the amount of workload imbalance, as weight adjusting takes time. The direct solution, which completely removes the hidden contention on network, converges much more quickly.

Fig. 13.

Tamed-HBase adopts the direct solution, and the restructured stages are shown in Figure 17. To avoid no scheduling, all the new stages in Tamed-HBase have pluggable scheduling points, but the blocking relationships and order constraints are inherited from the old stages to the new ones (until further fixes). To reduce the engineering effort while maximizing the scheduling control, Muzzled-HBase adopts a hybrid of the direct and indirect approach. It only combines the RPC Read and RPC Respond stage, which are mostly responsible for network consumption. Muzzled-HBase works around the remaining contention indirectly with a controller, as described above.

4.1.4 Uncontrolled Thread Blocking.

In the HBase TAM, three stages are bounded and have a non-empty blocking set: RPC Handle, Mem Flush, and Log Sync, suggesting the uncontrolled thread blocking problem. Indeed, uncontrolled thread blocking in HBase causes extremely low throughput in production [32].

The direct solution makes the corresponding stages asynchronous to eliminate blocking and enable independent scheduling at different stages. Tamed-HBase (with stages restructured to remove hidden contention) adopts this solution and make its CPU and Log Sync stage asynchronous; implementing this solution requires changing the programming model of HBase.

The indirect solution treats the threads in the RPC Handle, Mem Flush, and Log Sync stage as resources and allocates thread time between clients like CPU or I/O resources. This approach does not eliminate blocking, but prevents one client from occupying all threads and allows other clients to make progress. More upstream threads are needed with this approach to maintain the same level of utilization (as some threads are blocked and not making progress), and more threads lead to less scheduling control and worse latency guarantees. The advantage of this approach over the information-based feedback loop (Section 3.4) is that upstream scheduling decisions are made locally, avoiding any control plane involvement. Muzzled-HBase adopts this indirect solution for ease of implementation and for later solving the ordering constraint problem (Section 4.1.5).

Figure 14 shows that when C2 switches to an I/O intensive workload, both the direct and indirect solution allow C1 to utilize CPU effectively and achieve high throughput, while original HBase delivers very low throughput. The direct solution achieves the highest throughput and near perfect resource utilization; while with the indirect solution the CPU utilization and C2 throughput fluctuate as the scheduler allocates different numbers of RPC Handle threads to C1. As shown in Figure 15, smaller number of RPC Handle threads leads to lower throughput and CPU utilization. However, increasing the number of threads would lead to less scheduling control, as more requests (exceeding the CPU parallelism level) are competing in the RPC Handle stage.

Fig. 14.

Fig. 15.

4.1.5 Ordering Constraints.

In the HBase TAM, the Data Stream and Data Xceive stage have ordering constraints and resource usages, which points to the ordering constraint problem (the Log Append stage also has ordering constraint, but does not incur extensive usage on any resources, so does not lead to the ordering constraint problem).

By re-designing the consistency mechanism, the ordering constraint can be removed. A simple redesign may maintain a separate WAL for each client, thus eliminating the need to preserve request ordering across clients. This simple solution is limited in its applicability though, as it only works when there is no dependency among the data written by different clients; for a more general solution, a more complete redesign of the consistency mechanism is required. We deploy this simple direct approach in Tamed-HBase and remove the ordering constraints in the Log Append and I/O stage (after the stage restructure).

For indirect solution, we avoid changing the consistency mechanism, but schedule at RPC Handle, above the ordering-constrained stages. Note that in Muzzled-HBase, we already schedule based on RPC Handle time to solve the blocking problem. Since threads at the RPC Handle stage block until WAL writes are done, under a stable workload, blocking time is roughly proportional to the number of downstream requests, and scheduling RPC Handle time indirectly schedules the WAL writes before passing them to Data Stream and Data Xceive. Muzzled-HBase thus solves the ordering problem indirectly. However, the number of RPC Handle threads are typically larger than the I/O parallelism in the system, making this approach less effective; therefore, we compare two settings of the indirect approach, with 10 or 30 RPC Handle threads.

Figure 16 shows that unlike in original HBase, where the throughput drop and latency increase are unbounded, both the direct and indirect solutions are able to limit C2’s effect on C1 when C2 keeps increasing the number of outstanding requests. The direct solution achieves the best throughput and latency guarantees, though requiring a complete rewriting of the WAL layer of the system. The indirect solution largely ensures isolation, but at the cost of lower throughput and longer latencies. With more RPC Handle threads, isolation becomes less effective and the latency guarantee degrades, because C1 competes with more requests from C2 after they enter RPC Handle.

Fig. 16.

4.1.6 Summary.

HBase does attempt to provide scheduling, in the form of exporting a scheduling API at the RPC Handle stage; however, this effort is rather incomplete as it fails to solve any of the scheduling problems HBase possesses, thus suggesting the importance of systematic schedulability analysis. With the aid of TAM, we are able to identify and solve all of HBase’s scheduling problems, and transform HBase to provide schedulability.

Tamed-HBase directly modifies the problematic TAM to eliminate scheduling problems (except for the hidden contention on CPU, which it reduces to a low level); its TAD is shown in Figure 17. However, realizing these changes in implementation may be too difficult and risky; for example, making stages asynchronous requires changing the RPC programming model of HBase, and removing the ordering constraint is akin to re-designing the consistency mechanism.

Fig. 17.

In contrast, Muzzled-HBase minimizes the changes to the architecture; its TAD is shown in Figure 18. As we can see, even though the no scheduling problem and the hidden contention between RPC Read and RPC Respond are eliminated, the Muzzled-HBase TAD still exhibits other problems, including unknown resource usage, blocking, ordering constraint, and hidden contention among other stages. In Muzzled-HBase, various indirect approaches are used instead to mitigate the effects of these problems and achieve approximate scheduling control. These approaches minimize the changes required to HBase, thus lower the engineering effort required, but they also incur overheads (e.g., wasted CPU cycles on aborted requests or communication across stages).

Fig. 18.

Choosing the direct or indirect approaches to solve the scheduling problems of a specific system would thus depend on the architecture and implementation of the system (e.g., its consistency mechanism and RPC implementation), the overall scheduling goal, and the engineering effort and runtime overhead one can tolerate.

The HBase case study also reveals a false positive scenarios of TAM (see Section 4.1.3), which stems from TAM not capturing certain thread behaviors. We feel the simplicity of TAM is more valuable, and the occasional false positives are a worthy price to pay.

4.2 Muzzled-HBase Implementation

In this section, we demonstrate that real HBase suffers from the scheduling problems we identified, and fixing these problems leads to schedulability. The Muzzled-HBase implementation gives us experience realizing schedulability in real systems and validates that the TAM-based simulations are excellent predictors of the real world.

Our implementation closely follows the Muzzled-HBase simulation and the modifications are straightforward as we intentionally chose solutions that require minimal changes to HBase. For ease of implementation, the RPC Read and RPC Respond stages share the same scheduler, instead of being fully combined. Client identifiers are propagated across stages with requests, so each scheduler can map requests back to the originating client. On top of Muzzled-HBase, multiple scheduling policies are implemented, including FIFO, DRF and priority scheduling. In our implementation, a centralized controller collects information and coordinates local scheduler behavior; however, other mechanisms such as distributed coordination [5, 65] are also possible. For now the scheduler performs network resource scheduling only when the server’s bandwidth is the bottleneck; we anticipate incorporating global network bandwidth allocation [44] in the future. The final implementation consists of the addition of \(\sim\)4,800 lines of code to the HBase code base and \(\sim\)3,000 lines of code to HDFS.

4.2.1 TAM Validation.

To match the simulation environment, we run experiments on an 8-node cluster. Each node has two 8-core CPUs at 2.40 GHz (plus hyper-threading), 128 GB of RAM, an 480 GB SSD (to run the system) and two 1.2 TB HDD (to host the HDFS data). The nodes are connected via 10 Gbps network. One node hosts the HMaster, NameNode, and Secondary NameNode; the other seven nodes host RegionServers and DataNodes.

We re-run the simulated experiments in Section 4.1 and show side by side the simulation results and the results obtained from real implementation (note the axis scales may be different). We can see that both the problems and solutions in the implementation match the TAM-based simulations; this accuracy holds across all five categories, indicating that our TAM model captures important performance properties of systems and serves as an excellent scheduling simulation blueprint.

Figure 19 illustrates that HBase suffered the no scheduling problem and as a result, the throughput of client C1 is significantly harmed when C2 issues more requests; further, Figure 19 shows that the solution of adding scheduling points at resource-intensive stages provides performance isolation in the real world.

Fig. 19.

The unknown resource problem that exists in HBase is shown in Figure 20: when client C2 requests in-cache data, original HBase is not able to efficiently utilize the CPU. Muzzled-HBase with speculative execution dramatically improves the throughput of C2 without harming C1. We see much higher relative throughput for cached requests in implementation than in simulation, because in our cluster the CPU is much faster (16 cores at 2.4 GHz with hyper-threading) than the simulated single-core 1 GHz CPU.

Fig. 20.

Figure 21 verifies that HBase suffered hidden contention across multiple stages, which manifests when one stage consumes more resources on behalf of a particular client (i.e., more network for C2). The small difference between the implementation and simulation results for a reply size of 64 KB occurs, because in the implementation, after transferring 64 KB, the RPC Respond thread switches to another request; we did not simulate this detail. The solution suggested by our TAM simulations again works well for the implementation.

Fig. 21.

The uncontrolled thread blocking problem that exists within HBase is illustrated in Figure 22. In original HBase, when the workload of one client switches from CPU to I/O-intensive (C2 at time 60), both clients are harmed, because not enough threads are available. Muzzled-HBase, however, protects C1 from the workload change of C2. In real implementation, the throughput of both clients increases after a while, when the data C2 accesses are cached by the local file system in the page cache. The effect of the OS page cache is not captured in our simulation, but the general trend, that C1 is slowed down to the C2 throughput level, remains the same both in simulation and in implementation. When disabling the page cache (not shown here), we observe that the throughput of both C1 and C2 remains low, as suggested by the simulation.

Fig. 22.

Finally, HBase’s ordering problem is shown in Figure 23; when C2 writes more data, the throughput of C1 suffers. Again, this problem is alleviated in Muzzled-HBase by limiting the number of outstanding requests to the lower stage to 10 or 30; 30 outstanding requests leads to worse isolation than 10, as suggested by the simulation.

Fig. 23.

4.2.2 Macrobenchmarks.

The final performance of Muzzled-HBase for YCSB [17] is shown in Figure 1 (page 2). For Figure 1(a), five different clients are each given a different weight and we use DRF-based local scheduler to achieve global weighted fairness. The original HBase was unable to provide weighted fairness across clients with different priorities, instead delivering approximately equal throughput to each; Muzzled-HBase, in contrast, enforces weighted fairness as desired.

For Figure 1(b), priority scheduling is implemented atop Muzzled-HBase by always reserving a small subset of resources, including the RPC Handle threads, for foreground workloads. With original HBase, the tail latencies of the foreground workload increase significantly when different types of workloads run in background; Muzzled-HBase, however, is able to maintain stable latencies despite the interference from the background workloads.

4.3 Discussion

Schedulability can be achieved by modifying the problematic TAM to eliminate scheduling problems directly. However, as we can see in the case of HBase, changing the TAM for existing systems usually involves restructuring the system, which is labor-intensive. As another example, to fix the uncontrolled thread blocking problem directly, Google faced onerous challenges in transforming the servers in their microservices platform from a blocking, thread-per-request model to be asynchronous only [57]. To minimize the changes to the architecture or lower the engineering effort, often we are forced to keep the same TAM, but use indirect solutions to work around its inherent structural flaws and alleviate the effects of the scheduling problems. Unfortunately, these indirect solutions only provide approximate scheduling control and incur overheads, leading to worse performance and latency guarantees, longer convergence time or lower system scalability.

We thus encourage developers to take schedulability into consideration in the early phase of system design; this is especially important in a cloud-based world where users demand isolation and quality of service guarantees. By specifying the TAM of a system, potential scheduling problems can be discovered early, avoiding the painful process of retrofitting scheduling control later. Of course, schedulability may need to be balanced with other system design goals. For example, the system architects may decide that having a simple synchronous programming model is more important, and accept blocking at some stages. However, these kind of compromises should be made only after carefully weighing the trade-offs between different goals, not just due to the obliviousness of their schedulability ramification.

5 Schedulability of Other Systems

Earlier, we showed how to transform HBase to provide schedulability. Other concurrent systems can be analyzed and transformed in the same way. Here, we analyze the schedulability of MongoDB [15], Cassandra [46], and Riak [43]. Through the analysis, we can see that different system exhibits different thread architecture and scheduling problems, which teaches us valuable lessons on how specific ways to organize and build systems are associated with certain scheduling problems. For example, the run-to-completion model of MongoDB inherently suffers severe unknown resource usage problem, while in Riak scheduling-oblivious layering and abstraction introduce hard-to-fix no scheduling, unknown resource usage, and hidden contention problems.

For each system, we first introduce its thread architecture and request flows based on the TAD, then perform TAM-based analysis to identify its scheduling problems and verify the problems and solutions via simulation; finally, we discuss the lessons learned from studying the thread architecture of this particular system.

5.1 MongoDB

5.1.1 Thread Architecture.

The TAD of MongoDB is shown in Figure 24. In MongoDB, data are stored in different shards; each shard is an independent replication set, and consists of one primary and two secondary data nodes. The primary node receives all write operations and records changes to its operation log (oplog); the secondaries replicate the primary’s oplog and apply the operations to their data. Both primary and secondaries can serve read requests.

Fig. 24.

In MongoDB, the Worker stage is responsible for most request processing, while the other stages are involved mainly in replication. For each client connection, a new Worker thread is launched; however, only 128 threads are allowed to make progress at any given time in the Worker stage, making it a bounded stage. The Worker threads are responsible for executing the requests until their completion, including reading requests from the clients, calculating the optimal query plan, carrying out the query plan to retrieve indices and data (reading from disk if needed), and sending the replies back to the clients. During the processing, the Worker threads may perform I/O or obtain database locks.

When processing writes, the Worker threads also record the operations in the oplog, and replicate the oplog to the secondary nodes (step 1). Depending on the consistency level, the Worker threads may block until the writes have been replicated to enough secondary nodes. The oplog data are passed through the NetInterface stage, the Batcher stage (which groups the operations into batches), and the Oplog Writer stage in the secondary nodes (steps 2–4). The Oplog Writer thread first writes the operations to the secondary node’s oplog, then assigns batches to the Writer threads to apply to the main database (step 5). The Oplog Writer thread blocks until the Writer threads finish the batches (step 6), at which point it signals the Feedback thread (step 7), which in turn notifies the primary node (step 8).

5.1.2 TAM Analysis and Simulation.

From the MongoDB TAM, we can identify (a) the unknown resource usage problem at the Worker stage, which processes client requests until completion and may consume the I/O and database lock resources; (b) the hidden contention problem in the secondary nodes; most notably, the Worker stage and the Oplog Writer stage compete for database locks; (c) the uncontrolled thread blocking problem at the Worker stage, where the Worker threads may block until replication is completed.

We simulate an MongoDB cluster with 24 nodes that form 8 replication sets; each replication set consists of one primary node and two secondary nodes. Each node has one 1 GHz CPU and one disk with 100 MB bandwidth, and is connected via 1 Gbps network. In original MongoDB workers are scheduled to execute using a simple FIFO algorithm, we instead allow any scheduler and use DRF for our simulation.

Figure 25(a) shows the unknown resource usage problem on I/O resources. We could see that when C2 issues both cold- and hot-cache requests, original MongoDB fails to utilize CPU effectively (less then 40% utilization) and provides low throughput to C2 cached requests. If MongoDB knows beforehand if a request needs I/O and schedules the cached requests when CPU is idle, then both the CPU utilization and C2 throughput would be improved, as shown in Figure 25(b). However, as the Worker stage has a complex execution path and may perform many tasks, it is generally hard to predict the resource usage of a request (in this case, whether a request needs I/O).

Fig. 25.

Figure 26 shows the hidden contention on the database lock in the MongoDB secondary nodes. We can see that applying DRF separately at each stage does not guarantee C1 gets a fair share of the database lock time, as shown in Figure 26(a). Instead, when C2 updates more documents at the same time, which leads the Oplog Write stage to take more time to replicate the updates, it holds the lock for up to 98% of the time, and causes the throughput of C1 to drop from 35 Kops/s to less than 0.5 Kops/s. In fact, this very problem was observed in production and reported to cause unbounded delay for reads under heavy write load [52]. When using a unified scheduler that coordinate requests across stages, however, C2 only uses up to 50% of the database lock time, and C1’s throughput stabilizes as C2 updates more documents, as shown in Figure 26(b).

Fig. 26.

Figure 27 shows the uncontrolled thread blocking problem of MongoDB. When the I/O bandwidth of the secondary nodes decreases, we expect the throughput of C2 to decrease, since it relies on replication to the secondaries, but the throughput of C1, whose requests do not go through the secondary nodes at all, should remain stable. However, as we can see from Figure 27(a), the throughput of C1 suffers, because all 128 Worker threads are blocking and no threads are available to process the read requests. After changing the Worker stage to be non-blocking in simulation, the throughput of C1 is not affected, as shown in Figure 27(b).

Fig. 27.

5.1.3 Lessons.

The study of MongoDB shows how a synchronous threading architecture, where one thread executes a request until its completion, is inherently associated with the unknown resource usage problem due to the complex and highly variable execution path within a thread. MongoDB resembles such an architecture: when a Worker thread reads a request off the network, it does not know whether this request hits cache, requires replication, has to obtain excessive locks, or needs I/O within this stage; it does not even have basis to make good predictions, making it hard to schedule at this point. We expect that altering MongoDB to provide schedulability will be difficult and require substantial structural changes to conquer the complex path and resource patterns within the Worker stage. The newer versions of MongoDB (v5.0 or v6.0) have indeed improved scheduling by moving away from such synchronous threading architecture and introducing Batons, lightweight job queues within the MongoDB process that move the execution of certain tasks out of the Worker stage [55]. However, in newer versions of MongoDB, a dedicated thread is still responsible for the whole lifetime of a client request [54], so most of our discussion and analysis on its scheduling problems still applies.

5.2 Cassandra

5.2.1 Thread Architecture.

The TAD of Cassandra is shown in Figure 28. When the client sends a request to one of the Cassandra nodes, a thread in the C-ReqHandle stage reads the request in, decodes the request, and coordinates its processing. After parsing a request, the C-ReqHandle thread first determines the request type and looks up where the relevant data is stored. For local data, the request is directly submitted to the corresponding local processing stages based on its type (step \(l_1\)). For remote data, the request is passed to the Msg Out stage (step 1). The C-ReqHandle thread then blocks until the request completes, either on the local or remote node.

Fig. 28.

The Msg Out stage sends the remote requests through the network (step 2); these requests are read and de-serialized by the Msg In stage in the remote node. Once finished, the Msg In stage puts the parsed requests in the queue of different processing stages based on the request type (step 3).

Cassandra has a dozen processing stages for different types of requests: Read, Read-Repair, Mutation, View Mutation, Gossip, and so on. We omit most of them in the TAD shown here, because they all exhibit similar behaviors. These stages execute the requests by carrying out CPU-intensive activities such as cache look-ups or data compressing; they may also perform I/O if the data needed are not in cache. After completing a request, the processing stages generate a response and pass it to the Msg Out stage (step 4).

The Msg Out stage sends the response back to the Msg In stage in the node that initiated the request processing (step 5), which in turn passes the response to the Respond stage, which is responsible for executing any callbacks associated with the request completion (step 6). Finally, the C-ReqHandle thread is notified and finishes blocking (step 7); it passes the response to the C-Respond stage, which serializes the response and sends it to the client.

5.2.2 TAM Analysis and Simulation.

From the Cassandra TAM, we can identify (a) unknown resource usage in the Read, Mutation, View-Mutation and other database processing stages, since those stages may perform I/O; (b) hidden contention between many stages for CPU, I/O and network; (c) uncontrolled thread blocking in the C-ReqHandle stage as its threads block on the request completion events.

Based on the TAM, we simulate a Cassandra cluster with three nodes. Each node has one 1 GHz CPU and one 100 MB/s disk, and is connected via 1 Gbps network.

Figure 29 shows the unknown resource usage problem of Cassandra on I/O resources. We could see that when C2 issues both cold- and hot-cache requests, original Cassandra, without the exact knowledge of resource usage for each request, fails to utilize network effectively (less than 20% utilization) and provides low throughput to C2 cached requests. However, by speculatively executing requests when network and CPU are idle, Cassandra could fully utilize network and provide much higher throughput for C2.

Fig. 29.

Figure 30 shows the uncontrolled thread blocking problem of Cassandra. When C2 switches from an CPU-intensive workload to an I/O-intensive one, we expect the throughput of C1 to increase, since C1 is still CPU-bound and more CPU resources have become available. However, as we can see from Figure 30(a), the throughput of C1 declines abruptly. All C-ReqHandle threads are blocked waiting for the processing of C2 requests to complete, which takes a long time, since they require I/O; no threads are left to serve C1. When we change the C-ReqHandle stage to be non-blocking, as shown in Figure 30(b), the throughput of C1 improves as expected after C2’s workload change; the number of C-ReqHandle threads serving C2 remains low, because C-ReqHandle stage is not the bottleneck and it does not have to wait for other stages.

Fig. 30.

Figure 31(a) shows the hidden contention on network resources in Cassandra. We can see that attempting fair scheduling at each stage does not ensure fair sharing globally. Instead, as C2 increases its response size, it unfairly consumes as much as 95% of the network resource, leading to declined C1 performance. When using a unified scheduler, however, all contentions on the network are explicitly managed, since the scheduler has a global view of network usage. As a result, C2 can only use its fair share of network, and C1’s throughput is not affected by C2’s behavior, as shown in Figure 31(b).

Fig. 31.

5.2.3 Lessons.

The study of the Cassandra thread architecture teaches us that despite its initial design goal [88], a SEDA-compliant architecture does not ensure schedulability; indeed, scheduling-aware stage division plays a vital role in designing a SEDA system with high schedulability. Cassandra closely follows the standard SEDA architecture, where all activities are managed in controlled stages. However, schedulability does not automatically follow. Having too many stages with the same resource pattern leads to hidden contention and the “inability to balance reads, writes, compaction, flushing” [10]; likewise, CPU- and I/O-intensive operations in the same stage leads to unknown resource usage.

More thoughts on how to divide stages are needed to build a highly schedulable system. Instead of dividing stages based on functionalities, such as in Cassandra, where different stages may exhibit similar resource patterns and compete with each other, we recommend dividing stages based on resource usage patterns to give more resource information to the scheduler and reduce hidden contention. Cassandra is currently moving toward this direction: developers have proposed combining different processing stages into a single non-blocking stage, and moving I/O to a dedicated thread pool [10].

5.3 Riak

5.3.1 Thread Architecture.

The TAD of Riak is shown in Figure 32. Riak is a newly emerged distributed NoSQL database based on the Erlang functional programming language [23]. Built on top of the Erlang runtime system, Riak relies heavily on the light-weighted processes (referred to as threads hereinafter) and transparent IPC mechanisms provided by the Erlang runtime.

Fig. 32.

When clients issue requests to one of the Riak nodes, a new Req In-Out thread is created for each new client connection to read from/write to the connection, and to encode/decode the messages. After decoding a request, the Req In-Out thread passes the request to the Process stage (step 1), which also spawns a new thread for each new client connection. The Req In-Out thread then blocks until the response of the request is available.

Riak uses consistent hashing to divide data into partitions and places multiple partitions in one physical node. The newly created Process thread looks up which partitions the requested data reside, and sends the request to one or more of these partitions (step 2); it then blocks until the request is completed in those partitions. There is one Cmd Handle thread for each partition, which is responsible for performing I/Os. Upon completion of I/O, the Cmd Handle thread sends the response back to the issuing Process thread (step 3).

After hearing back from all the Cmd Handle threads it sends requests to, the Process thread generates the response to the client and passes it to the Req In-Out thread (step 4). The Req In-Out thread encodes the response, sends it off the network, and waits for another request from that particular client.

In the above request flow, all data transfer between threads are handled by network-transparent IPC, which provides the same interface regardless of whether the communicating threads are local or remote, and hides the potential network usage. The thread initiating the transfer thus does not know whether the transfer would require network resources or not.

5.3.2 TAM Analysis and Simulation.

From the Riak TAM, we can identify (a) the no scheduling problem at the Req In-Out and Process stage; (b) unknown resource usage at the Process and Cmd Handle stage, as communication between them may consume network resources if the threads locate on different physical nodes; (c) hidden contention on network resources across all stages, on CPU resources between the Req In-Out and Process stage, and on I/O resources between the Cmd Handle stages for different partitions within the same node.

Based on the TAM, we simulate a Riak cluster with four physical nodes. Each node has one 1 GHz CPU and one 100 MB/s disk, and hosts 4 data partitions (i.e., has four Cmd Handle stages). Nodes are connected via 800 Mbps network.

Figure 33(a) shows the no scheduling problem of Riak. We can see that original Riak fails to provide isolation: when C2 issues requests with more threads, it consumes more network resources and the throughput of C1 suffers greatly. After adding scheduling points at the previous on-demand Req In-Out and Process stages, network resources are shared fairly between C1 and C2, and the throughput of C1 is not affected by C2, as shown in Figure 33(b).

Fig. 33.

Figure 34(a) shows the unknown resource usage problem of Riak. Without the knowledge of the network resource usage, the scheduler cannot effectively utilize the I/O resource by scheduling local requests that do not require network transfers when the network is busy but I/O devices are idle, leading to lower throughput of C2. After modifying Riak (in simulation) to add separate network transfer stages to handle remote requests, which have exact knowledge of the network usage, though, requests can be scheduled based on their resource usage patterns; as a result, the I/O resource is fully utilized and the throughput of C2 improves, because more local requests are served, as shown in Figure 34(b). However, implementing such a modification is difficult in Riak; it requires modifying the Erlang runtime to expose network management to applications so that Riak can tell whether a request is remote and needs network resources.

Fig. 34.

Figure 35(a) shows the hidden contention on I/O resources among the Cmd Handle stages in Riak. In a Riak version where each Cmd Handle stage is managed by a separate DRF scheduler, as C2 issues larger I/O requests, we can see that it uses more I/O resources (up to 70%) and causes the throughput of C1 to decrease. Even though the scheduler at each stage attempts to allocate I/O resource fairly, the contention between these stages is unregulated and causes C2 to receive more resources when it issues larger requests. When a unified scheduler is used, which has a wholistic view of the I/O usage within the node, C2 always receives a fair share of the I/O resources and does not affect the performance of C1, as shown in Figure 35(b).

Fig. 35.

5.3.3 Lessons.

The study of the Riak thread architecture teaches us valuable lessons on why schedulability should be treated as a first class citizen in system design, especially for abstraction and layering. Riak also closely follows the SEDA architecture, but relies heavily on the light-weighted processes and transparent IPC provided by the Erlang runtime, which makes resource management implicit [23]. Although creating a new Erlang process may have low overhead, creating them on-demand leads to the no scheduling problem. Similarly, with transparent IPC, many stages may consume network bandwidth without knowing it, causing unknown resource usage and hidden contention. To make Riak schedulable, one must either explicitly manage the above mechanisms, bypass the Erlang runtime, or change the Erlang runtime API to allow scheduling policies to be passed from Riak to the runtime. Riak is thus a great example demonstrating how scheduling-oblivious layering and abstraction could hinder schedulability, and why schedulability should be treated as a first class citizen in system design.

5.4 Summary

Table 3 presents a summary of the systems we study and their scheduling problems. As we can see, scheduling problems are prevalent in every system we studied and can be effectively identified by TAM analysis. Some of the problems predicted by TAM have also been experienced in production environments [10, 32, 52].

Table 3.

	N	U	C	B	O
HBase [26]	\(\times\)	\(\times\)	\(\times\)	\(\times\)	\(\times\)
MongoDB [15]		\(\times\)	\(\times\)	\(\times\)
Cassandra [46]		\(\times\)	\(\times\)	\(\times\)
Riak [43]	\(\times\)	\(\times\)	\(\times\)

Table 3. Scheduling Problems Identified from TAM

\(\times\) indicates the system has the corresponding problem.

Different system architectures pose different challenges to schedulability, and specific ways to organize and build a system may be associated with a set of typical scheduling problems. For example, the traditional thread-per-request architecture suffers heavily from the unknown resource usage problem due to the complex execution path within a thread; while a SEDA architecture that has multiple stages with similar resource profiles suffers hidden contention across stages. We thus re-emphasize the need to consider schedulability and address scheduling problems at system design time, as retrofitting scheduling control into an unsuitable architecture may be quite challenging (e.g., may require breaking existing system abstractions).

6 Model Limitations

We have shown that TAM is a useful tool for schedulability analysis and delivers promising results. In this section, we discuss some of its limitations and how we can extend TAM to further help schedulability analysis.

First, TAM is best suited for describing SEDA-like systems, where each thread belongs to a specific stage. However, in other concurrency models, threads and stages may not be statically bound. For example, in a run-to-completion model, a single thread may perform multiple tasks until a request is completed, and be scheduled (possibly by yielding) before each task. In this case, a stage would be better defined as the execution between scheduling points, allowing one thread to cross multiple stages. We leave extending TAM to other concurrency models to future work.

Second, both direct and indirect methods can be adopted to solve the scheduling problems and achieve schedulability. However, even though direct solutions are explicitly shown in TAM as architecture modifications, indirect solutions, which utilize information within and across stages to adjust scheduler behaviors, are not captured in TAM. As a result, the TAM of the system that implements indirect solutions would still exhibit the same scheduling problems even though these problems have been (indirectly) mitigated. We are considering encoding information flow between stages into TAM, which would allow it to capture the scheduling effects of indirect solutions.

Third, in TAM resource is strictly intra-node, and we model resource contention and scheduling within a node boundary. This approach, while simple, does leave out certain resource that may be contended cross nodes, such as the inter-rack network bandwidth (as opposed to the server bandwidth, which TAM does model). As a result, contentions on such non-local resource and its related problems (e.g., hidden contention as discussed in Section 3.3) cannot be captured by TAM. Extending TAM to model non-local resource is one direction for future work.

Finally, even though different systems might possess the same scheduling problems, the difficulty of fixing their problems could vary vastly based on the system’s internal structure and code base. Fixing the unknown resource problem directly in HBase requires only separating the short-circuited read processing from the RPC Read stage; fixing this problem in MongoDB, however, requires a major re-structuring of the Worker stage to account for its complex execution paths. TAM is effective in identifying the problems, but does not give many indications on how difficult solving these problems would be; systematically reasoning about such difficulties is another interesting direction to extend TAM.

7 Related Work

An earlier version of this article appeared in Proceedings of the 13th USENIX Symposium on Operation Systems Design and Implementation (OSDI’18) [76]. In this extended version, we introduce the TASC schedulability conditions and discuss how the five scheduling problems violate TASC, which helps explain why these problems cause scheduling difficulties. For each scheduling problem, the prior version presents a direct solution: how to change the thread architecture to solve the problem; we extend this by comparing the direct and indirect solutions for each problem and discussing their trade-offs, demonstrated via new simulation experiments and graphs. We discuss in more detail the significance of TAM in scheduling simulation by serving as a simulation blueprint, and the limitations of TADalyzer. For other systems such as MongoDB, Cassandra, and Riak, the prior version gives a high-level overview of their thread architectures, while this extended version adds new simulation experiments and graphs to demonstrate and analyze each of their scheduling problems and solutions. Through the analysis, we also learn how specific ways to organize and build systems are associated with certain scheduling problems, which is discussed in Section 5 of this extended version.

Scheduling as a general problem has been extensively studied in computer science, manufacturing, operational research, and many other fields [64, 71, 72, 83]. Our work differs from the previous ones, as we separate the scheduling problem in distributed storage systems into two sub-problems: the meta schedulability problem and the specific scheduling problem. For a general-purpose storage system that is designed to work for various workloads and meet various performance measures, the schedulability problem is answered at the system design/build phase, and concerns whether the system offers proper scheduling support: are schedulers placed at the right points in the system and given necessary information and control? Once proper scheduling support is built in (i.e., the system provides schedulability), the user can solve her own specific scheduling problem: given her workload, which scheduling policy should she implement on top of the scheduling support provided by the system to realize a particular performance goal?

Such separation distinguishes the TAM approach from other formalization of the scheduling problems, such as queuing networks [42, 51, 58, 82] or stochastic Petri nets [18, 62, 83, 91, 92], which focus on solving specific scheduling problems. For example, traditional queuing network models encode specific scheduling plan information (e.g., the queuing discipline at each local scheduler) and workload characteristics, (e.g., the service time distribution and transition probability), and output performance measures (e.g., response time distribution) [7, 33, 82]. Stochastic Petri nets similarly require such details. One could view TAM as a queuing network skeleton, stripped of all information but that available at system design time; our schedulability analysis aims to derive properties from the limited information encoded in TAM that would hold after the TAM skeleton is augmented with various workload/queuing discipline information to form a complete queuing network. Some techniques developed in the queuing theory context, such as the heavy traffic approximation [45], may be borrowed to prove certain properties of the TAM (e.g., the exhaustiveness of the five scheduling problems); we leave that as future work.

From a more system-oriented perspective, previous work has focused on proposing scheduling plans that achieve various specific goals [66, 73, 74, 77, 87, 90]. For example, Pisces [73] discusses how to allocate local weights to match client demands and achieve global fairness; Cake [87] proposes a feedback loop to adjust local scheduler behavior to provide latency guarantees; Retro [38] supports different scheduling policies, but by translating these policies into rate limits at local schedulers. All the above works need proper scheduling support to enforce their plans. As current systems usually lack such support (Section 5), people indeed encounter the five categories of problems we have identified during the realization of their scheduling plans [49, 74, 87, 90]: Mace et al. found that the inability to predict resource usage (unknown resource usage) and the excessive blocking of threads (uncontrolled thread blocking) prevented them from achieving fairness [49]; Cake [87] had to add scheduling points to HDFS to enforce SLOs. However, in these systems the encountered problems are solved in an ad hoc manner; the solutions are often buried in implementation details or not discussed at all. A general framework that addresses the schedulability problem explicitly and systematically is thus strongly called for.

Monotasks [59] advocates an architecture in which jobs are broken into units of work that each use a single resource, and each resource is managed with a dedicated scheduler. From the TAM perspective, such an architecture eliminates the unknown resource usage and the hidden contention problem, allowing the system to provide better schedulability. The authors indeed observe that this architecture “allows MonoSpark to avoid resource contention and under utilization,” as predicted by TAM.

Our work is also similar to SEDA [88] and Flash [60] in the sense that it studies and modifies the thread structure and interactions to improve system performance. In particular, we borrowed many terminologies from SEDA. Like our work, Capriccio [85] automatically deduces a flow graph and places scheduling points at the graph nodes for thread scheduling.

8 Conclusions

With resource sharing being one of the key aspects of modern scalable storage systems, correct and flexible scheduling becomes a central goal in system design. To ensure scheduling works as desired, schedulability analysis should be an integral part of the concurrency architecture. The thread architecture model provides a systematic way of performing such analysis, thus turning the art of enabling effective scheduling into a science that is easily accessible and automatable. The software for schedulability analysis (e.g., TADalyzer and the TAM-based simulation framework) is available at http://research.cs.wisc.edu/adsl/Software/TAM.

Acknowledgments

We thank the anonymous reviewers for their insightful feedback, which improves this article significantly. We are especially grateful to the reviewer who asked for the model behind the thread architecture diagrams. We thank Michael Gleicher for his guidance in the thread architecture visualization. CloudLab [67] provided infrastructure to run our experiments.

Footnotes

We treat each lock instance as a separate resource, but are usually only interested in one or two highly contended locks in the system, e.g., the namespace lock in HDFS.

All TAM/TADs shown in the article (except MongoDB) have been validated by each system’s developers [21, 24, 29].

A more stringent definition may require each resource intensive stage to provide pluggable scheduling point to allow flexible scheduling policy realization; we opt for a looser definition here.

References

[1]

Pelikan Cache. 2020. Taming Tail Latency and Achieving Predictability. Retrieved from https://www.pelikan.io/2020/benchmark-adq.html.

Abstract

1 Introduction

2 Thread Architecture Model

2.1 TAM: General Description

2.2 Visualization with TAD

2.3 Automatic Obtainment

2.4 TAM: A Simulation Blueprint

2.5 TAM: Formal Definition

3 The Scheduling Conditions and Problems

3.1 No Scheduling

3.2 Unknown Resource Usage

3.3 Hidden Contention

3.4 Uncontrolled Thread Blocking

3.5 Ordering Constraint

3.6 Discussion

4 HBase: A Case Study

4.1 TAM-based Simulations

4.1.1 No Scheduling.

4.1.2 Unknown Resource Usage.

4.1.3 Hidden Contention.

4.1.4 Uncontrolled Thread Blocking.

4.1.5 Ordering Constraints.

4.1.6 Summary.

4.2 Muzzled-HBase Implementation

4.2.1 TAM Validation.

4.2.2 Macrobenchmarks.

4.3 Discussion

5 Schedulability of Other Systems

5.1 MongoDB

5.1.1 Thread Architecture.

5.1.2 TAM Analysis and Simulation.

5.1.3 Lessons.

5.2 Cassandra

5.2.1 Thread Architecture.

5.2.2 TAM Analysis and Simulation.

5.2.3 Lessons.

5.3 Riak

5.3.1 Thread Architecture.

5.3.2 TAM Analysis and Simulation.

5.3.3 Lessons.

5.4 Summary

6 Model Limitations

7 Related Work

8 Conclusions

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

Principled schedulability analysis for distributed storage systems using thread architecture models

LLF Schedulability Analysis on Multiprocessor Platforms

FPZL Schedulability Analysis

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share