Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2742854.2747283acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark

Published: 06 May 2015 Publication History

Abstract

Spark has been increasingly adopted by industries in recent years for big data analysis by providing a fault tolerant, scalable and easy-to-use in memory abstraction. Moreover, the community has been actively developing a rich ecosystem around Spark, making it even more attractive. However, there is not yet a Spark specify benchmark existing in the literature to guide the development and cluster deployment of Spark to better fit resource demands of user applications. In this paper, we present SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications. SparkBench covers four main categories of applications, including machine learning, graph computation, SQL query and streaming applications. We also characterize the resource consumption, data flow and timing information of each application and evaluate the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.

References

[1]
TPC-DS. http://www.tpc.org/tpcds/, 2014.
[2]
TPC-H. http://www.tpc.org/tpch/, 2014.
[3]
AMPLab. Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/, 2013.
[4]
Apache. Spark. http://spark.apache.org/.
[5]
Apache. GridMix. http://hadoop.apache.org/docs/r1.2.1/gridmix.html, 2013.
[6]
Apache. PigMix. https://cwiki.apache.org/confluence/display/PIG/PigMix, 2013.
[7]
T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. Linkbench: A database benchmark based on the facebook social graph. In Proceedings of the 2013 ACM SIGMOD, pages 1185--1196.
[8]
C. Avery. Giraph: Large-scale graph processing infrastructure on hadoop. Proceedings of the Hadoop Summit. Santa Clara, 2011.
[9]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM SOCC, pages 143--154, 2010.
[10]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1): 107--113, Jan. 2008.
[11]
FaceBook. Social Network Graph. http://snap.stanford.edu/data/egonets-Facebook.html.
[12]
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the 17th ACM ASPLOS, pages 37--48, New York, NY, USA, 2012.
[13]
A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD, pages 1197--1208, 2013.
[14]
Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the Eighth IEEE International Conference on Data Mining, pages 263--272. IEEE, 2008.
[15]
S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In 26th IEEE ICDEW, pages 41--51, March 2010.
[16]
IBM. Big Data and Analytics Hub. http://www.ibmbigdatahub.com/infographic/four-vs-big-data.
[17]
IBM. SoftLayer. http://www.softlayer.com/.
[18]
G. James, D. Witten, T. Hastie, and R. Tibshirani. An introduction to statistical learning. Springer, 2013.
[19]
M. N. Kolountzakis, G. L. Miller, R. Peng, and C. E. Tsourakakis. Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Mathematics, 8(1--2): 161--185, 2012.
[20]
Y. Koren. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD, pages 426--434, 2008.
[21]
M. Li, J. Tan, Y. Wang, L. Zhang, and V. Salapura. SparkBench: A Comprehensive Spark Benchmarking Suite. https://bitbucket.org/lm0926/sparkbench.
[22]
S. Neumann. Spark vs. Hadoop MapReduce. https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/.
[23]
C. Nyberg, M. Shah, and N. Govindaraju. Sort Benchmark. http://sortbenchmark.org/, 2014.
[24]
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.
[25]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD, pages 165--178, New York, NY, USA.
[26]
Twitter. Twitter4j: a Java Library for the Twitter API. http://twitter4j.org.
[27]
L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu. BigDataBench. http://prof.ict.ac.cn/BigDataBench/, 2014.
[28]
L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu. Bigdatabench: A big data benchmark suite from internet services. In IEEE 20th HPCA, pages 488--499, Feb 2014.
[29]
Wikipedia. Data Dumps. http://dumps.wikimedia.org/enwiki/.
[30]
A. Woodie. Spark Smashes MapReduce in Big Data Benchmark. http://www.datanami.com/2014/10/10/spark-smashes-mapreduce-big-data-benchmark/.
[31]
W. Xiong, Z. Yu, Z. Bei, J. Zhao, F. Zhang, Y. Zou, X. Bai, Y. Li, and C. Xu. A characterization of big data benchmarks. In IEEE International Conference on Big Data, pages 118--125, Oct 2013.
[32]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX NSDI, Berkeley, CA, USA, 2012.

Cited By

View all
  • (2024)Evaluating Stream Processing AutoscalersProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666036(110-122)Online publication date: 24-Jun-2024
  • (2024)Blaze: Holistic Caching for Iterative Data ProcessingProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629558(370-386)Online publication date: 22-Apr-2024
  • (2024)CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00309(4030-4043)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '15: Proceedings of the 12th ACM International Conference on Computing Frontiers
May 2015
413 pages
ISBN:9781450333580
DOI:10.1145/2742854
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. benchmarking
  2. cloud computing
  3. evaluation
  4. in memory data analytics
  5. spark

Qualifiers

  • Research-article

Conference

CF'15
Sponsor:
CF'15: Computing Frontiers Conference
May 18 - 21, 2015
Ischia, Italy

Acceptance Rates

CF '15 Paper Acceptance Rate 33 of 96 submissions, 34%;
Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)5
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Evaluating Stream Processing AutoscalersProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666036(110-122)Online publication date: 24-Jun-2024
  • (2024)Blaze: Holistic Caching for Iterative Data ProcessingProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629558(370-386)Online publication date: 22-Apr-2024
  • (2024)CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00309(4030-4043)Online publication date: 13-May-2024
  • (2024)Benchmarking scalability of stream processing frameworks deployed as microservices in the cloudJournal of Systems and Software10.1016/j.jss.2023.111879208:COnline publication date: 1-Feb-2024
  • (2023)SAC: Dynamic Caching upon Sketch for In-Memory Big Data Analytics2023 9th International Conference on Big Data Computing and Communications (BigCom)10.1109/BIGCOM61073.2023.00032(180-187)Online publication date: 4-Aug-2023
  • (2023)SPOAHA: Spark Program Optimizer Based on Artificial Hummingbird AlgorithmKnowledge Science, Engineering and Management10.1007/978-3-031-40289-0_26(317-331)Online publication date: 9-Aug-2023
  • (2023)SparkEdgeEmu: An Emulation Framework for Edge-Enabled Apache Spark DeploymentsEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_11(154-168)Online publication date: 24-Aug-2023
  • (2022)A Remote Memory Sharing System for Virtualized Computing InfrastructuresIEEE Transactions on Cloud Computing10.1109/TCC.2020.301808910:3(1532-1542)Online publication date: 1-Jul-2022
  • (2022)Optimal Resource Allocation of Cloud-Based Spark ApplicationsIEEE Transactions on Cloud Computing10.1109/TCC.2020.298568210:2(1301-1316)Online publication date: 1-Apr-2022
  • (2022)Towards Dependency-Aware Cache Management for Data Analytics ApplicationsIEEE Transactions on Cloud Computing10.1109/TCC.2019.294501510:1(706-723)Online publication date: 1-Jan-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media