This section provides a case study of building performance models for Spark, a distributed framework for data analytic applications. For a given workload (defined as an application-input pair), we are interested in predicting its execution time, denoted by
\(y\), as the function of a Spark configuration, denoted by
\(x\). The vector
\(x\) is a set of values for parameters described in Table
1.
12.1 Spark Parameter Study
Spark is a distributed, in-memory computing framework that supports varied applications in data analysis such as machine learning and graph processing [
2]. Spark defines a
Resilient Distributed Dataset (RDD), an abstraction of distributed memory [
73], and provides programmers with composable operators to compute on it. An application defines a sequence of RDD operators and specifies dependencies between them in a directed acyclic graph. Operators are organized into stages, each with a set of parallel tasks that operate on RDD partitions. The Spark run-time system schedules tasks on executors distributed across worker machines. Executors are Java Virtual Machines that execute tasks, which are Java threads, and store the partitions of the RDD.
Applications are sensitive to the Spark configuration. Spark provides more than 100 parameters. Configuration parameters specify computational resources (i.e., CPU time and memory) for software processes distributed over worker machines. Second, workload parameters configure both thread-level (i.e., tasks.cpu) and data-level parallelism (e.g., default.parallelism, files.maxPartitionBytes). Third, parameters configure data compression (e.g., io.*) and serialization (e.g., kyroserializer.*) for transfers through the disk and network. Other parameters (e.g., shuffle.*) determine how data is shuffled between executors to synchronize intermediate results. Finally, memory management parameters (e.g., memory.*) configure the layout of Java virtual machines, which is important when storing large in-memory objects.
Table
1 shows representative parameters of these important categories for performance tuning. The default values of these parameters are often far from optimal for varied applications, and tuning them can improve performance by up to 5
\(x\) [
72]. Although these parameters are Spark-specific, many of them represent common parameters in other cluster computing frameworks [
1,
10,
5,
11].
We make several observations about these parameters. First, these parameters constitute a parameter space of approximately \(10^{25}\) configuration points, presenting challenges in generating heuristic search based on expert knowledge and intuitions. Second, parameters interact in complex ways. The effect of shuffle.file.buffer depends on memory.fraction, which controls the fraction of heap space allocated for computing and temporal storage; default.parallelism depends on executor.core, which controls per-machine parallelism for the application. Identifying and modeling interactions manually is difficult, motivating modeling strategies that discover and account for these interactions automatically.
Third, some important parameters represent trade-offs in resource and performance. For example, tuning driver.cores and executors.cores trades off core resources and computation performance, whereas tuning executor.memory trades off in-memory resources and data-loading performance. Spark users may have different optimization targets that weigh resource and performance differently. For example, users running Spark on private clusters may prefer maximizing performance, whereas users running on public clouds may prefer minimizing cost with a performance constraint. Building accurate performance models enables flexible optimization targets.
Finally, an application’s computational characteristics determine the parameters that affect performance most. Applications that frequently communicate data between machines are affected more by network and shuffle parameters. Applications with massive parallel computations are more sensitive to workload parallelism and CPU resource parameters. Applications that need to store large in-memory objects are more sensitive to the memory parameters. As a result, it is difficult to specify a single set of important parameters for all Spark applications.
2.2 Model-based Tuning
Previous research has proposed various machine learning techniques to predict software configuration performance and tune configurations with those predictions. These techniques include tree-based regression models [
17,
19,
22,
29,
72], support vector machines [
47,
71], multivariate linear regressions [
59,
63], and neural networks [
39,
46,
64]. Although these works demonstrate the potential of using machine learning for accurate models, they neglect several practical issues.
First, training performance models require a substantial amount of training data. We generate training data by sampling configurations from the space and measuring their performance as the application runs. This procedure is time-consuming as application runs require from minutes to hours. In this setting, reducing the number of configurations for training a performance model is essential. Most works neglect these costs and use simple random sampling to generate configurations for training.
A few works propose more advanced techniques to improve sample efficiency. Latin-Hypercube Sampling can better cover the configuration space, compared with random sampling, with the same number of sampled configurations [
17,
19]. However, maximizing coverage of the configuration space does not necessarily identify the most useful configurations for reducing a model’s prediction error. Adaptively selecting configurations that can most improve model accuracy may help, but prior system studies are limited to specific classes of models—linear regression [
63] and Delaunay Triangulation [
23]—and neither class of models can effectively model performance from high-dimensional configurations.
Second, training requires specifying model hyperparameters, which determine the model’s complexity and impact its prediction accuracy. Prior works are unable to specify hyperparameter values automatically. They rely on either manual tuning [
17,
64,
72] or prior knowledge [
63] to set hyperparameter values.
2.3 Error Decomposition in Performance Modeling
In machine learning, error decomposition provides insight in analyzing a performance model’s expected generalization error. Given a workload’s performance model, we can predict execution time
\(\hat{y}\) and measure actual execution time
\(y\) for Spark configuration
\(x\). The expected squared error
\(\textrm {E}[(y-\hat{y})^{2}|x]\) can be decomposed as
The first term is noise, the difference between measured and expected performance for a Spark configuration. The second term is variance, the difference between the model’s actual and expected prediction. The third term is bias, the difference between the expected prediction and the expected measurement. Because noise refers to the randomness inherent in the data generation procedure and is non-reducible from the perspective of performance modeling strategy, we focus on how to efficiently reduce bias and variance.
Variance accounts for prediction errors related to randomness in estimated model parameters, which in turn arises from randomness in the training procedure and data collection. First, model training often includes non-deterministic elements. Neural network training randomly initializes a model [
28] and samples batch data [
24]. Training a tree-based model randomly selects thresholds for splitting a node [
35].
Second, randomness arises in system measurements and data collection. To generate a dataset for training, system architects sample randomly from the configuration space and measure each sample’s performance. Models may over-fit to the performance of specific samples and therefore produce high variant predictions when fitted with different training samples.
A formal definition of model variance places a distribution on model parameter
\(\theta\) that is conditional on the training data
\(\mathbf {X}_{tr},\mathbf {Y}_{tr}\):
For a given training set, we can construct such a distribution by training multiple instances of models with bootstrapped samples from the training data. For a given input
\(x\), the prediction variance
\(\textrm {Var}_{\theta }(y)\) is the variance of
\(y\)’s estimate with respect to the variance in
\(\theta\). For example, variance in a simple linear model
\(y={W}x\) is given by the variance of linear combinations:
In practice, performance models with higher prediction variance will make less accurate predictions. System architects reduce prediction variance by sampling more configurations for training.
On the other hand,
Bias describes errors produced by models despite having been trained with enough data. This type of error represents a model’s inability to capture complex relationships in the data, usually due to simplified model structures. For example, a linear model cannot fit a cubic function regardless of training time and data. Model complexity is often determined by hyperparameters, such as regularization terms and model structure. Neural networks, regression splines, and tree models offer a large space of models defined by neural architectures, polynomial degrees, and tree depth, respectively, as shown in Table
2.
Bias versus Variance. For a given class of models, increasing model structure and complexity leads to a trade-off between higher bias and higher variance. A simpler model (e.g., lower-degree polynomials, fewer interaction terms, shallower neural networks) uses fewer model parameters and produces less variant predictions. But a simpler model’s predictions may not become more accurate as the size of the training dataset increases; its simple structure cannot characterize the complicated interactions among configuration parameters and results in higher prediction bias.
In contrast, a complex model (e.g., higher-degree polynomials, more interaction terms, deeper neural networks) produces better predictions with more model parameters when more training data becomes available. But its predictions might be unstable and sensitive to specific configurations due to over-fitting [
4], and this results in higher prediction variance.
Figure
1 compares errors from neural networks that predict performance for Spark’s logistic regression workload. The networks differ in model structure and training dataset size. The smaller network with layer size (16, 8) performs better when the number of sampled configurations is small (<1,000). But it cannot efficiently utilize more training data due to its simpler model structure and larger bias. The larger network suffers from high variance and cannot predict performance accurately when trained with small datasets. However, it gradually outperforms the smaller neural network as dataset size increases beyond 1,000 sampled configurations.
Cross-validation and Generality. \(K\)-fold cross-validation [
43] is a popular, successful method for evaluating a model structure’s generalized prediction ability considering bias and variance. Cross-validation divides data into
\(K\) groups. For the
\(k\)th group, it takes group
\(k\) for test data and uses the remaining groups for training data. Cross-validation uses the average test score across the
\(k\) groups to estimate the model structure’s generalized prediction ability.
Our experiments show that cross-validation produces different evaluation results when the amount of training data changes. Table
3 shows 10-fold cross-validation scores as model structure, for neural networks and random forests, and the amount of training data vary. When training data is relatively scarce (e.g., 50 to 150 configurations), cross-validation scores are higher with smaller models. When data is abundant, scores are higher with larger models. This insight, combined with observations from Figure
1, shows that simple models are preferred when there is insufficient data to balance variance and bias.
2.4 Implications for Performance Modeling
Training performance models is expensive and requires measuring outcomes for varied configurations. For a given workload-input pair, the time to train performance models depends on the time required to evaluate a configuration and the number of configurations that must be evaluated. Reducing evaluation time is difficult and constrained by the time required to read and process the dataset. As a result, we seek to reduce the number of configurations needed to train an effective model.
For efficient data collection, the system architect should identify configurations for which model predictions are likely to be inaccurate. Although the architect has not measured the performance of an unknown configuration, he or she can estimate the model’s prediction variance for that configuration. Intuitively, the performance model makes inaccurate performance for the configuration for which it has high prediction variance and less confidence in the performance prediction.
Figure
2 quantifies this intuition, comparing prediction variance and prediction error as neural networks predict the performance of the Spark logistic regression workload. Configurations with higher prediction variance are associated with higher prediction errors. And updating the model with performance data from these configurations is more likely to improve predictive accuracy. Thus, the system architect should identify configurations with high prediction variance to improve a model.
Active Learning. Active learning [
58] applies this idea in collecting the smallest possible dataset to achieve target accuracy. This technique initializes performance models with a small amount of uniformly sampled data points for training. It then strategically acquires new data samples by locating points with high prediction variances.
Figure
3 illustrates the benefits of active learning for three classes of machine learning models: neural network, random forest, and regression splines. For each model class, the figure reports errors given the same number of measured configurations (i.e., training data). Active learning trains models with lower error by measuring configurations with higher prediction variance. It outperforms passive learning, which samples configurations uniformly at random. Active learning is particularly appealing when modeling distributed analytic applications, where data can be expensive to acquire, because it accurately predicts performance with fewer system profiles.
Dynamic Model Growth. Complementing data acquisition strategies that reduce variance, model growth strategies reduce bias. In statistical learning theory, the capacity and complexity of a model should grow linearly with the amount of training data in order to bound generalization error [
32].
2 If updates to model structure increase capacity too quickly, a model may over-fit the training data because of its high expressiveness and may be sensitive to model variance. If updates increase capacity too slowly, a model may exhibit larger approximation uncertainty and bias between model predictions and true observations.
Dynamic modeling requires techniques that continuously collect data and periodically update not only model parameters but also model structure. The technique should start with a simple, minimally defined model architecture and dynamically grow the model as training data is collected.
Given our understanding of variance and bias, we propose Phronesis, a machine learning framework that accurately predicts performance with less data. It strategically collects training data with active learning and cautiously grows models to increase their capacity for accurate predictions.