research-article

Clash of the titans: MapReduce vs. Spark for large scale data analytics

Editors: Chen Li, Volker Markl Authors:

Fatma ÖzcanAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 13

Pages 2110 - 2121

https://doi.org/10.14778/2831360.2831365

Published: 01 September 2015 Publication History

Get Access

Abstract

MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a set of important analytic workloads. To conduct a detailed analysis, we developed two profiling tools: (1) We correlate the task execution plan with the resource utilization for both MapReduce and Spark, and visually present this correlation; (2) We provide a break-down of the task execution time for in-depth analysis. Through detailed experiments, we quantify the performance differences between MapReduce and Spark. Furthermore, we attribute these performance differences to different components which are architected differently in the two frameworks. We further expose the source of these performance differences by using a set of micro-benchmark experiments. Overall, our experiments show that Spark is about 2.5x, 5x, and 5x faster than MapReduce, for Word Count, k-means, and PageRank, respectively. The main causes of these speedups are the efficiency of the hash-based aggregation component for combine, as well as reduced CPU and disk overheads due to RDD caching in Spark. An exception to this is the Sort workload, for which MapReduce is 2x faster than Spark. We show that MapReduce's execution model is more efficient for shuffling data than Spark, thus making Sort run faster on MapReduce.

References

[1]

Apache Hadoop. http://hadoop.apache.org/.

Google Scholar

[2]

Apache Mahout. https://mahout.apache.org/.

Google Scholar

[3]

HDFS caching. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html.

Google Scholar

[4]

HPROF: A heap/cpu profiling tool. http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html.

Google Scholar

[5]

RRDtool. http://oss.oetiker.ch/rrdtool/.

Google Scholar

[6]

Spark wins 2014 graysort competition. http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html.

Google Scholar

[7]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. CACM, 51(1):107--113, 2008.

Digital Library

Google Scholar

[8]

P. Henderson. Functional Programming: Application and Implementation. Prentice-Hall International London, 1980.

Digital Library

Google Scholar

[9]

H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB, 4(11):1111--1122, 2011.

Digital Library

Google Scholar

[10]

S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In ICDEW, pages 41--51, 2010.

Crossref

Google Scholar

[11]

H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In SOCC, pages 1--15, 2014.

Digital Library

Google Scholar

[12]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.

Digital Library

Google Scholar

[13]

M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817--840, 2004.

Crossref

Google Scholar

[14]

A. Mislove, M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In IMC, pages 29--42, 2007.

Digital Library

Google Scholar

[15]

O. OMalley and A. C. Murthy. Winning a 60 second dash with a yellow elephant. Sort Benchmark, 2009.

Google Scholar

[16]

J. Shi, J. Zou, J. Lu, Z. Cao, S. Li, and C. Wang. MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. VLDB, 7(13):1319--1330, 2014.

Digital Library

Google Scholar

[17]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet another resource negotiator. In SOCC, pages 5:1--5:16, 2013.

Digital Library

Google Scholar

[18]

W. Xue, J. Shi, and B. Yang. X-RIME: Cloud-based large scale social network analysis. In SCC, pages 506--513, 2010.

Digital Library

Google Scholar

[19]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Digital Library

Google Scholar

Cited By

View all

Chen YHoque MXu PLu JTarkoma S(2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10619-023-07436-y
Zhang YZhang TZhang GJacobsen H(2023)Lifting the Fog of UncertaintiesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624646(48-64)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624646
Ullah FSrivastava GUllah SYoshigoe KZhao Y(2023)NIDS-VSB: Network Intrusion Detection System for VANET Using Spark-Based Big Data Optimization and Transfer LearningIEEE Transactions on Consumer Electronics10.1109/TCE.2023.332832070:1(1798-1809)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1109/TCE.2023.3328320
Show More Cited By

Index Terms

Clash of the titans: MapReduce vs. Spark for large scale data analytics
1. Information systems
  1. Data management systems

Recommendations

Reviews

Reviewer: Tope Omitola

Analyzing big data has never been more important in computer science, and the platforms of choice have been either MapReduce or Spark. By using selected workloads that characterize the majority of batch and iterative analytic operations (word count, sort, k -means, linear regression, and PageRank), this paper presents analyses of the performance differences between these two platforms. Although MapReduce is designed for batch jobs and Spark for iterative jobs, it is noted that they are being used, on the field, for both job types. The authors find that Spark is 2.5 to 5 times faster than MapReduce on the majority of these workloads (the only exception is sort). These results are not so surprising given the key architectural decisions made by the two platforms. This paper is resplendent with the configuration setup parameters of the experiments (hardware, software, and profilers). These parameters are useful for system administrators who want to understand a platform's behavior under different configurations. We also learn that since the majority of big data analytic workloads are central processing unit (CPU)-bound, both platforms are scalable to the number of CPU cores available to them. System developers can use the knowledge gleaned from this paper to improve the architecture and implementation of Spark and MapReduce, and of the applications running on both platforms. The explanations of the experiment results are very good: they further the understanding of how architecture and working assumptions affect system performance, and also explain some of the inner workings of the platforms. For example, we learn that as the number of "reduce" tasks is increased, the execution time of the "map" stage increases. If you want to understand the pros and cons of MapReduce and Spark, and when and how to use them, this paper is a good place to start. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 13

Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

September 2015

144 pages

ISSN:2150-8097

Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2015

Published in PVLDB Volume 8, Issue 13

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
1,742
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)11

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Chen YHoque MXu PLu JTarkoma S(2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10619-023-07436-y
Zhang YZhang TZhang GJacobsen H(2023)Lifting the Fog of UncertaintiesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624646(48-64)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624646
Ullah FSrivastava GUllah SYoshigoe KZhao Y(2023)NIDS-VSB: Network Intrusion Detection System for VANET Using Spark-Based Big Data Optimization and Transfer LearningIEEE Transactions on Consumer Electronics10.1109/TCE.2023.332832070:1(1798-1809)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1109/TCE.2023.3328320
Shi JLu J(2023)Performance models of data parallel DAG workflows for large scale data analyticsDistributed and Parallel Databases10.1007/s10619-023-07425-141:3(299-329)Online publication date: 23-May-2023
https://dl.acm.org/doi/10.1007/s10619-023-07425-1
Wang HSinnen O(2022)Scheduling Fork-Join Task Graphs with Communication Delays and Equal Processing TimesProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545036(1-9)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545036
Zou JDas ABarhate PIyengar AYuan BJankov DJermaine C(2021)LachesisProceedings of the VLDB Endowment10.14778/3457390.345739214:8(1262-1275)Online publication date: 21-Oct-2021
https://dl.acm.org/doi/10.14778/3457390.3457392
Bazai SJang-Jaccard JAlavizadeh H(2021)A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache SparkACM Transactions on Privacy and Security10.1145/348494525:1(1-25)Online publication date: 23-Nov-2021
https://dl.acm.org/doi/10.1145/3484945
Tekdogan TCakmak A(2021)Benchmarking Apache Spark and Hadoop MapReduce on Big Data ClassificationProceedings of the 2021 5th International Conference on Cloud and Big Data Computing10.1145/3481646.3481649(15-20)Online publication date: 13-Aug-2021
https://dl.acm.org/doi/10.1145/3481646.3481649
Khan MYu W(2021)ROBOTune: High-Dimensional Configuration Tuning for Cluster-Based Data AnalyticsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472518(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472518
Rodrigues JVasconcelos GMaciel P(2021)Screening hardware and volume factors in distributed machine learning algorithms on sparkComputing10.1007/s00607-021-00965-3103:10(2203-2225)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1007/s00607-021-00965-3
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Reviews

Access critical reviews of Computing literature here