Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Clash of the titans: MapReduce vs. Spark for large scale data analytics

Published: 01 September 2015 Publication History

Abstract

MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a set of important analytic workloads. To conduct a detailed analysis, we developed two profiling tools: (1) We correlate the task execution plan with the resource utilization for both MapReduce and Spark, and visually present this correlation; (2) We provide a break-down of the task execution time for in-depth analysis. Through detailed experiments, we quantify the performance differences between MapReduce and Spark. Furthermore, we attribute these performance differences to different components which are architected differently in the two frameworks. We further expose the source of these performance differences by using a set of micro-benchmark experiments. Overall, our experiments show that Spark is about 2.5x, 5x, and 5x faster than MapReduce, for Word Count, k-means, and PageRank, respectively. The main causes of these speedups are the efficiency of the hash-based aggregation component for combine, as well as reduced CPU and disk overheads due to RDD caching in Spark. An exception to this is the Sort workload, for which MapReduce is 2x faster than Spark. We show that MapReduce's execution model is more efficient for shuffling data than Spark, thus making Sort run faster on MapReduce.

References

[1]
Apache Hadoop. http://hadoop.apache.org/.
[2]
Apache Mahout. https://mahout.apache.org/.
[3]
HDFS caching. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html.
[4]
HPROF: A heap/cpu profiling tool. http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html.
[5]
RRDtool. http://oss.oetiker.ch/rrdtool/.
[6]
Spark wins 2014 graysort competition. http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html.
[7]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. CACM, 51(1):107--113, 2008.
[8]
P. Henderson. Functional Programming: Application and Implementation. Prentice-Hall International London, 1980.
[9]
H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB, 4(11):1111--1122, 2011.
[10]
S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In ICDEW, pages 41--51, 2010.
[11]
H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In SOCC, pages 1--15, 2014.
[12]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.
[13]
M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817--840, 2004.
[14]
A. Mislove, M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In IMC, pages 29--42, 2007.
[15]
O. OMalley and A. C. Murthy. Winning a 60 second dash with a yellow elephant. Sort Benchmark, 2009.
[16]
J. Shi, J. Zou, J. Lu, Z. Cao, S. Li, and C. Wang. MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. VLDB, 7(13):1319--1330, 2014.
[17]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet another resource negotiator. In SOCC, pages 5:1--5:16, 2013.
[18]
W. Xue, J. Shi, and B. Yang. X-RIME: Cloud-based large scale social network analysis. In SCC, pages 506--513, 2010.
[19]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Cited By

View all
  • (2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
  • (2023)Lifting the Fog of UncertaintiesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624646(48-64)Online publication date: 30-Oct-2023
  • (2023)NIDS-VSB: Network Intrusion Detection System for VANET Using Spark-Based Big Data Optimization and Transfer LearningIEEE Transactions on Consumer Electronics10.1109/TCE.2023.332832070:1(1798-1809)Online publication date: 30-Oct-2023
  • Show More Cited By

Index Terms

  1. Clash of the titans: MapReduce vs. Spark for large scale data analytics

    Reviews

    Tope Omitola

    Analyzing big data has never been more important in computer science, and the platforms of choice have been either MapReduce or Spark. By using selected workloads that characterize the majority of batch and iterative analytic operations (word count, sort, k -means, linear regression, and PageRank), this paper presents analyses of the performance differences between these two platforms. Although MapReduce is designed for batch jobs and Spark for iterative jobs, it is noted that they are being used, on the field, for both job types. The authors find that Spark is 2.5 to 5 times faster than MapReduce on the majority of these workloads (the only exception is sort). These results are not so surprising given the key architectural decisions made by the two platforms. This paper is resplendent with the configuration setup parameters of the experiments (hardware, software, and profilers). These parameters are useful for system administrators who want to understand a platform's behavior under different configurations. We also learn that since the majority of big data analytic workloads are central processing unit (CPU)-bound, both platforms are scalable to the number of CPU cores available to them. System developers can use the knowledge gleaned from this paper to improve the architecture and implementation of Spark and MapReduce, and of the applications running on both platforms. The explanations of the experiment results are very good: they further the understanding of how architecture and working assumptions affect system performance, and also explain some of the inner workings of the platforms. For example, we learn that as the number of "reduce" tasks is increased, the execution time of the "map" stage increases. If you want to understand the pros and cons of MapReduce and Spark, and when and how to use them, this paper is a good place to start. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 8, Issue 13
    Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
    September 2015
    144 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 September 2015
    Published in PVLDB Volume 8, Issue 13

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)89
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
    • (2023)Lifting the Fog of UncertaintiesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624646(48-64)Online publication date: 30-Oct-2023
    • (2023)NIDS-VSB: Network Intrusion Detection System for VANET Using Spark-Based Big Data Optimization and Transfer LearningIEEE Transactions on Consumer Electronics10.1109/TCE.2023.332832070:1(1798-1809)Online publication date: 30-Oct-2023
    • (2023)Performance models of data parallel DAG workflows for large scale data analyticsDistributed and Parallel Databases10.1007/s10619-023-07425-141:3(299-329)Online publication date: 23-May-2023
    • (2022)Scheduling Fork-Join Task Graphs with Communication Delays and Equal Processing TimesProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545036(1-9)Online publication date: 29-Aug-2022
    • (2021)LachesisProceedings of the VLDB Endowment10.14778/3457390.345739214:8(1262-1275)Online publication date: 21-Oct-2021
    • (2021)A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache SparkACM Transactions on Privacy and Security10.1145/348494525:1(1-25)Online publication date: 23-Nov-2021
    • (2021)Benchmarking Apache Spark and Hadoop MapReduce on Big Data ClassificationProceedings of the 2021 5th International Conference on Cloud and Big Data Computing10.1145/3481646.3481649(15-20)Online publication date: 13-Aug-2021
    • (2021)ROBOTune: High-Dimensional Configuration Tuning for Cluster-Based Data AnalyticsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472518(1-10)Online publication date: 9-Aug-2021
    • (2021)Screening hardware and volume factors in distributed machine learning algorithms on sparkComputing10.1007/s00607-021-00965-3103:10(2203-2225)Online publication date: 1-Oct-2021
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media