Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1559845.1559865acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A comparison of approaches to large-scale data analysis

Published: 29 June 2009 Publication History

Abstract

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

References

[1]
Hadoop. http://hadoop.apache.org/.
[2]
Hive. http://hadoop.apache.org/hive/.
[3]
Vertica. http://www.vertica.com/.
[4]
Y. Amir and J. Stanton. The Spread Wide Area Group Communication System. Technical report, 1998.
[5]
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008.
[6]
Cisco Systems. Cisco Catalyst 3750-E Series Switches Data Sheet, June 2008.
[7]
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. Under Submission, March 2009.
[8]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI '04, pages 10--10, 2004.
[9]
D. J. DeWitt and R. H. Gerber. Multiprocessor Hash-based Join Algorithms. In VLDB '85, pages 151--164, 1985.
[10]
D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - A High Performance Dataflow Database Machine. In VLDB '86, pages 228--237, 1986.
[11]
S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System Software of A Parallel Relational Database Machine. In VLDB '86, pages 209--219, 1986.
[12]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. SIGOPS Oper. Syst. Rev., 37(5):29--43, 2003.
[13]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In EuroSys '07, pages 59--72, 2007.
[14]
E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling object, relations and XML in the .NET framework. In SIGMOD '06, pages 706--706, 2006.
[15]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD '08, pages 1099--1110, 2008.
[16]
J. Ong, D. Fogg, and M. Stonebraker. Implementation of data abstraction in the relational database system ingres. SIGMOD Rec., 14(1):1--14, 1983.
[17]
D. A. Patterson. Technical Perspective: The Data Center is the Computer. Commun. ACM, 51(1):105--105, 2008.
[18]
R. Rustin, editor. ACM--SIGMOD Workshop on Data Description, Access and Control, May 1974.
[19]
M. Stonebraker. The Case for Shared Nothing. Database Engineering, 9:4--9, 1986.
[20]
M. Stonebraker and J. Hellerstein. What Goes Around Comes Around. In Readings in Database Systems, pages 2--41. The MIT Press, 4th edition, 2005.
[21]
D. Thomas, D. Hansson, L. Breedt, M. Clark, J. D. Davidson, J. Gehtland, and A. Schwarz. Agile Web Development with Rails. Pragmatic Bookshelf, 2006.

Cited By

View all
  • (2024)High-Performance Spatial Data Analytics: Systematic R&D for Scale-Out and Scale-Up Solutions from the Past to NowProceedings of the VLDB Endowment10.14778/3685800.368591217:12(4507-4520)Online publication date: 1-Aug-2024
  • (2024)Towards a Hierarchical Exascale Framework for Iterative Parallel Data Analysis Algorithms2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00049(293-296)Online publication date: 20-Mar-2024
  • (2024)Lossy Compression of Adjacency Matrices by Graph Filter BanksICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448045(9386-9390)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
June 2009
1168 pages
ISBN:9781605585512
DOI:10.1145/1559845
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2009

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. benchmarks
  2. mapreduce
  3. parallel database

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '09
Sponsor:
SIGMOD/PODS '09: International Conference on Management of Data
June 29 - July 2, 2009
Rhode Island, Providence, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)326
  • Downloads (Last 6 weeks)44
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)High-Performance Spatial Data Analytics: Systematic R&D for Scale-Out and Scale-Up Solutions from the Past to NowProceedings of the VLDB Endowment10.14778/3685800.368591217:12(4507-4520)Online publication date: 1-Aug-2024
  • (2024)Towards a Hierarchical Exascale Framework for Iterative Parallel Data Analysis Algorithms2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00049(293-296)Online publication date: 20-Mar-2024
  • (2024)Lossy Compression of Adjacency Matrices by Graph Filter BanksICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448045(9386-9390)Online publication date: 14-Apr-2024
  • (2024)A Linear Combination-Based Method to Construct Proxy Benchmarks for Big Data WorkloadsBenchmarking, Measuring, and Optimizing10.1007/978-981-97-0316-6_8(120-136)Online publication date: 14-Feb-2024
  • (2023)JQPro:Join Query Processing in a Distributed System for Big RDF Data Using the Hash-Merge Join TechniqueMathematics10.3390/math1105127511:5(1275)Online publication date: 6-Mar-2023
  • (2023)Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema GraphProceedings of the VLDB Endowment10.14778/3603581.360359616:10(2578-2590)Online publication date: 1-Jun-2023
  • (2023)CoTel: Ontology-Neural Co-Enhanced Text LabelingProceedings of the ACM Web Conference 202310.1145/3543507.3583533(1897-1906)Online publication date: 30-Apr-2023
  • (2023)Scheduling distributed multiway spatial join queries: optimization models and algorithmsInternational Journal of Geographical Information Science10.1080/13658816.2023.217038037:6(1388-1419)Online publication date: 6-Feb-2023
  • (2023)Artificial intelligence inspired IoT-fog based framework for generating early alerts while train passengers traveling in dangerous states using surveillance videosMultimedia Tools and Applications10.1007/s11042-023-16107-083:5(13613-13635)Online publication date: 7-Jul-2023
  • (2023)Pattern-Preserved Normalization Enabled User ProfilingSmart Grid and Innovative Frontiers in Telecommunications10.1007/978-3-031-31733-0_28(331-341)Online publication date: 26-May-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media