CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics Frameworks
<p>Source code of logistic regression on Spark.</p> "> Figure 2
<p>Internal representation of execution flow and task execution for blocks in Spark: <span class="html-italic">x_y</span> in blocks denotes dataset id (<span class="html-italic">x</span>) and partition id (<span class="html-italic">y</span>).</p> "> Figure 3
<p>Execution time of logistic regression on 7 different caching decisions.</p> "> Figure 4
<p>Part of KMeans’s job directed acyclic graphs (DAGs) and example of DAG clustering.</p> "> Figure 5
<p>Architecture of cost-capacity-aware caching (CCA).</p> "> Figure 6
<p>Predicted caching benefit normalized to actual caching benefit from the cluster measured on three sizes of input data.</p> "> Figure 7
<p>Prediction accuracy of two models according to the input data size for evaluation.</p> "> Figure 8
<p>Comparison of three methods on sufficient cache memory.</p> "> Figure 9
<p>Comparison of three methods on reduced cache memory.</p> "> Figure 10
<p>Comparison between the proposed CCA and other cache memory management techniques.</p> ">
Abstract
:1. Introduction
2. Motivation
2.1. Application Code Analysis of In-Memory Data Analytics Framework
2.2. Execution Flow and Cached Dataset
2.3. Task-Level vs. Operator-Level Timing Metrics
2.4. Memory Pressure and Performance
3. Design Decisions
3.1. Cost Model and Caching Benefit
3.2. Spark Implementation
3.3. Caching Decision Algorithm
Algorithm 1 A recursive algorithm for DAG clustering |
Input :iter—map that stores the iteration count of the corresponding dataset or cluster |
root—top node in DAG |
Output: clusters—a set of clustered nodes |
1: function clustering |
2: ▹ Recursively traverse all nodes in DAG |
3: cluster |
4: descs |
5: while descs do |
6: desc = descs.pop |
7: if iter[desc] == iter[cluster] then |
8: descs |
9: cluster |
10: else |
11: clusters |
12: clusters |
13: cluster |
14: end if |
15: end while |
16: return clusters |
17: end function |
Algorithm2 A baseline algorithm for making a caching decision. |
Input :M—size of total cache memory |
U—size of used cache memory |
benefit—map that stores caching benefit of the corresponding cluster |
clusters—a set of clustered nodes |
Output: caches—a set of candidates to cache |
1:function make_decision |
2: update(benefit)) |
3: caches |
4: for all cluster in clusters do |
5: if U ≤ M then |
6: caches |
7: update(benefit) |
8: end if |
9: end for |
10: return caches |
11: end function |
4. Evaluation Methodology
5. Results
5.1. Prediction Accuracy of Caching Benefit
5.2. Prediction Accuracy of Cost Model
5.3. Performance on Sufficient Cache
5.4. Performance on Reduced Cache
5.5. Comparison with LCS
6. Related Work
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- De Arriba-Pérez, F.; Caeiro-Rodríguez, M.; Santos-Gago, J.M. Collection and Processing of Data from Wrist Wearable Devices in Heterogeneous and Multiple-User Scenarios. Sensors 2016, 16, 1538. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Apache Spark. Available online: https://spark.apache.org/ (accessed on 2 February 2021).
- Apache Tez. Available online: https://tez.apache.org/ (accessed on 25 February 2021).
- Apache Storm. Available online: http://storm.apache.org/ (accessed on 25 February 2021).
- Shinnar, A.; Cunningham, D.; Saraswat, V.; Herta, B. M3R: Increased Performance for In-memory Hadoop Jobs. Proc. VLDB Endow. 2012, 5, 1736–1747. [Google Scholar] [CrossRef]
- Ryza, S.; Uri Laserson, S.O.; Wills, J. Mastering Machine Learning with Spark 2.x. In Chapter Detecting Dark Matter—The Higgs-Boson Particle; Packt Publishing Ltd: Birmingham, UK, 2017. [Google Scholar]
- Alex Tellez, M.P.; Malohlava, M. Advanced Analytics with Spark: Patterns for Learning from Data at Scale. In Introuduction to Data Analysis with Scala and Spark; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2015. [Google Scholar]
- Apache Hadoop. Available online: https://hadoop.apache.org/ (accessed on 25 February 2021).
- Isard, M.; Budiu, M.; Yu, Y.; Birrell, A.; Fetterly, D. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys ’07, Lisbon, Portugal, 21–23 March 2007. [Google Scholar]
- Wang, L. Directed Acyclic Graph. In Encyclopedia of Systems Biology; Dubitzky, W., Wolkenhauer, O., Cho, K.H., Yokota, H., Eds.; Springer: New York, NY, USA, 2013. [Google Scholar]
- Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 3–7 May 2010. [Google Scholar]
- Amazon S3. Available online: https://aws.amazon.com/s3/ (accessed on 25 February 2021).
- Li, H.; Ghodsi, A.; Zaharia, M.; Shenker, S.; Stoica, I. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing, SOCC ’14, Seattle, WA, USA, 3–5 November 2014. [Google Scholar]
- Zaharia, M.; Chowdhury, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, Boston, MA, USA, 22 June 2010. [Google Scholar]
- Yu, Y.; Wang, W.; Zhang, J.; Ben Letaief, K. LRC: Dependency-aware cache management for data analytics clusters. In Proceedings of the IEEE INFOCOM 2017—IEEE Conference on Computer Communications, Atlanta, GA, USA, 1–4 May 2017. [Google Scholar]
- Geng, Y.; Shi, X.; Pei, C.; Jin, H.; Jiang, W. Lcs: An efficient data eviction strategy for spark. Int. J. Parallel Program. 2017, 45, 1285–1297. [Google Scholar] [CrossRef]
- Xuan, P.; Luo, F.; Ge, R.; Srimani, P.K. Dynamic Management of In-memory Storage for Efficiently Integrating Compute- and Data-intensive Computing on HPC Systems. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid ’17, Madrid, Spain, 14–17 May 2017. [Google Scholar]
- Jeong, M.; Park, S.; Han, H. Caching Cost Model for In-memory Data Analytics Framework. In Proceedings of the 9th International Conference on Smart Media and Applications (SMA), Jeju Island, Korea, 17–19 September 2020. [Google Scholar]
- Gerbessiotis, A.; Valiant, L. Direct Bulk-Synchronous Parallel Algorithms. J. Parallel Distrib. Comput. 1994, 22, 251–267. [Google Scholar] [CrossRef]
- Ferguson, A.D.; Bodik, P.; Kandula, S.; Boutin, E.; Fonseca, R. Jockey: Guaranteed Job Latency in Data Parallel Clusters. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys ’12, Bern, Switzerland, 10–13 April 2012. [Google Scholar]
- Huang, S.; Huang, J.; Dai, J.; Xie, T.; Huang, B. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Proceedings of the 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), Long Beach, CA, USA, 1–6 March 2010. [Google Scholar]
- Intel HiBench Suite. Available online: https://github.com/Intel-bigdata/HiBench (accessed on 25 February 2021).
- Meng, X.; Bradley, J.; Yavuz, B.; Sparks, E.; Venkataraman, S.; Liu, D.; Freeman, J.; Tsai, D.; Amde, M.; Owen, S.; et al. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res. 2016, 17, 1235–1241. [Google Scholar]
- Xin, R.S.; Crankshaw, D.; Dave, A.; Gonzalez, J.E.; Franklin, M.J.; Stoica, I. GraphX: Unifying Data-Parallel and Graph-Parallel Analytics. arXiv 2014, arXiv:1402.2394. [Google Scholar]
- Apache Spark-2.1.0 Configuration. Available online: http://spark.apache.org/docs/2.1.0/configuration.html (accessed on 25 February 2021).
- Castro Fernandez, R.; Culhane, W.; Watcharapichat, P.; Weidlich, M.; Lopez Morales, V.; Pietzuch, P. Meta-Dataflows: Efficient Exploratory Dataflow Jobs. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, Houston, TX, USA, 10–15 June 2018. [Google Scholar]
- Apache Flink. Available online: https://flink.apache.org/ (accessed on 25 February 2021).
- Gottin, V.M.; Pacheco, E.; Dias, J.; Ciarlini, A.E.M.; Costa, B.; Vieira, W.; Souto, Y.M.; Pires, P.; Porto, F.; Rittmeyer, J.A.G. Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, Houston, TX, USA, 15 June 2018., BeyondMR’18.
- Perez, T.B.G.; Zhou, X.; Cheng, D. Reference-distance Eviction and Prefetching for Cache Management in Spark. In Proceedings of the 47th International Conference on Parallel Processing (ICPP), Eugene, OR, USA, 13–16 August 2018. [Google Scholar]
- Xu, E.; Saxena, M.; Chiu, L. Neutrino: Revisiting Memory Caching for Iterative Data Analytics. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16), Denver, CO, USA, 20–21 June 2016. [Google Scholar]
- Xu, L.; Li, M.; Zhang, L.; Butt, A.R.; Wang, Y.; Hu, Z.Z. MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, 23–27 May 2016. [Google Scholar]
- Yu, Y.; Wang, W.; Zhang, J.; Weng, Q.; Ben Letaief, K. OpuS: Fair and Efficient Cache Sharing for In-Memory Data Analytics. In Proceedings of the 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, 2–6 July 2018; pp. 154–164. [Google Scholar] [CrossRef]
Notation | Meaning |
---|---|
n | Number of blocks in the dataset |
m | Number of operators used in the stage |
a | Ancestor which is the nearest cached in DAG |
S | Stage execution time |
ith operator | |
Dataset generated by | |
Data block of | |
Computing time of | |
Total computing time of blocks on dataset | |
Estimated computing cost of | |
Number of iterations for | |
Benefit from caching |
Hardware (Node Specification) | |
---|---|
CPU | Intel Xeon E5-2640 v3 * 2 |
RAM | 128 GB |
Storage | 1.5 TB NVMe SSD |
Network | Mellanox MT27520 56GbE |
Spark Configuration | |
1 master, 2 workers | |
10 executors | |
50 cores |
Workload | Input Size (GB) (10x) | Sufficient Cache Memory (GB) | Reduced Cache Memory (GB) | |||
---|---|---|---|---|---|---|
Provided Cache Size (GB) | Max Required Cache Size (GB) | Provided Cache Size (GB) | Max Required Cache Size (GB) | |||
CCA * & CAO | CCA * | CAO | ||||
Alternating Least Squares (ALS) | 3.00 | 117.36 | 25.96 | 12.98 | 11.85 | 25.96 |
K-means clustering (KM) | 18.70 | 40.22 | 20.11 | 19.74 | 40.22 | |
Support Vector Machine (SVM) | 18.63 | 40.97 | 20.49 | 11.17 | 40.97 | |
Logistic Regression (LR) | 22.4 | 49.20 | 24.60 | 22.35 | 49.20 | |
Bayesian Classification (Bayes) | 21.02 | 43.79 | 21.89 | 21.34 | 43.79 | |
Linear Regression (LinR) | 44.8 | 80.20 | 40.10 | 40.10 | 80.20 | |
Random Forest (RF) | 14.80 | 34.72 | 17.36 | 14.90 | 34.72 | |
Latent Dirichlet Allocation (LDA) | 2.10 | 4.97 | 2.49 | 2.46 | 4.97 | |
Principal Components Analysis (PCA) | 0.28 | 0.67 | 0.33 | 0.33 | 0.67 | |
Gradient Boosting Trees (GBT) | 0.30 | 2.44 | 1.22 | 1.11 | 2.54 | |
Singular Value Decomposition (SVD) | 5.00 | 5.00 | 2.50 | 0.00 | 5.00 | |
PageRank (PR) | 4.00 | 16.89 | 8.45 | 0.00 | 16.89 | |
NWeight (NW) | 0.70 | 3.02 | 1.51 | 1.38 | 3.02 | |
TeraSort (TS) | 40.00 | 40.00 | 20.00 | 0.00 | 40.00 |
Execution | Operator | Memory | Optimizing | |
---|---|---|---|---|
Flow-Aware | Cost-Aware | Capacity-Aware | Execution Flow | |
MDF | O | X | X | O |
S-CACHE | O | O | X | O |
LRC | O | X | O | X |
MRD | O | X | O | X |
Neutrino | O | X | O | X |
MemTune | X | X | O | X |
LCS | O | O | O | X |
CCA(proposed) | O | O | O | O |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Park, S.; Jeong, M.; Han, H. CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics Frameworks. Sensors 2021, 21, 2321. https://doi.org/10.3390/s21072321
Park S, Jeong M, Han H. CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics Frameworks. Sensors. 2021; 21(7):2321. https://doi.org/10.3390/s21072321
Chicago/Turabian StylePark, Seongsoo, Minseop Jeong, and Hwansoo Han. 2021. "CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics Frameworks" Sensors 21, no. 7: 2321. https://doi.org/10.3390/s21072321