research-article

Shark: SQL and rich analytics at scale

Authors:

Reynold S. Xin,

Michael J. Franklin,

Ion StoicaAuthors Info & Claims

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 13 - 24

https://doi.org/10.1145/2463676.2465288

Published: 22 June 2013 Publication History

Abstract

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g. iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100X faster than Apache Hive, and machine learning programs more than 100X faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engine provides. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.

References

[1]

https://github.com/cloudera/impala.

[2]

http://hadoop.apache.org/.

[3]

http://aws.amazon.com/elasticmapreduce/.

[4]

A. Abouzeid et al. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. VLDB, 2009.

Digital Library

[5]

S. Agarwal et al. Re-optimizing data-parallel computing. In NSDI'12.

Digital Library

[6]

G. Ananthanarayanan et al. Pacman: Coordinated memory caching for parallel jobs. In NSDI, 2012.

Digital Library

[7]

R. Avnur and J. M. Hellerstein. Eddies: continuously adaptive query processing. In SIGMOD, 2000.

Digital Library

[8]

S. Babu. Towards automatic optimization of mapreduce programs. In SoCC'10.

Digital Library

[9]

A. Behm et al. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011.

Digital Library

[10]

V. Borkar et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE'11.

Digital Library

[11]

Y. Bu et al. HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow., 2010.

Digital Library

[12]

R. Chaiken et al. Scope: easy and efficient parallel processing of massive data sets. VLDB, 2008.

Digital Library

[13]

B. Chattopadhyay, et al. Tenzing a sql implementation on the mapreduce framework. PVLDB, 4(12):1318--1327, 2011.

Digital Library

[14]

S. Chen. Cheetah: a high performance, custom data warehouse on top of mapreduce. VLDB, 2010.

Digital Library

[15]

C. Chu et al. Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19:281, 2007.

[16]

J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C.Welton.Mad skills: new analysis practices for big data. VLDB, 2009.

Digital Library

[17]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.

Digital Library

[18]

X. Feng et al. Towards a unified architecture for in-rdbms analytics. In SIGMOD, 2012.

Digital Library

[19]

B. Guffler et al. Handling data skew in mapreduce. In CLOSER'11.

[20]

A. Hall et al. Processing a trillion cells per mouse click. VLDB.

Digital Library

[21]

B. Hindman et al. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI'11.

Digital Library

[22]

M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS, 2007.

Digital Library

[23]

M. Isard et al. Quincy: Fair scheduling for distributed computing clusters. In SOSP '09, 2009.

Digital Library

[24]

M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD, 2009.

Digital Library

[25]

N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD, 1998.

Digital Library

[26]

Y. Kwon et al. Skewtune: mitigating skew in mapreduce applications. In SIGMOD '12, 2012.

Digital Library

[27]

Y. Low et al. Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB, 2012.

Digital Library

[28]

G. Malewicz et al. Pregel: a system for large-scale graph processing. In SIGMOD, 2010.

Digital Library

[29]

S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., 3:330--339, Sept 2010.

Digital Library

[30]

K. Ousterhout et al. The case for tiny tasks in compute clusters. In HotOS'13.

Digital Library

[31]

A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.

Digital Library

[32]

M. Stonebraker et al. C-store: a column-oriented dbms. In VLDB'05.

Digital Library

[33]

M. Stonebraker et al. Mapreduce and parallel dbmss: friends or foes? Commun. ACM.

Digital Library

[34]

A. Thusoo et al. Hive-a petabyte scale data warehouse using hadoop. In ICDE, 2010.

[35]

Transaction Processing Performance Council. TPC BENCHMARK H.

[36]

T. Urhan, M. J. Franklin, and L. Amsaleg. Cost-based query scrambling for initial delays. In SIGMOD, 1998.

Digital Library

[37]

C. Yang et al. Osprey: Implementing mapreduce-style fault tolerance in a shared-nothing distributed database. In ICDE, 2010.

[38]

M. Zaharia et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys 10, 2010.

Digital Library

[39]

M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI, 2012.

Digital Library

Cited By

Xue MChen SLam ALi YBu Yvan Hovell HMa YLi XParanjpye SSomani ASamwel BErcegovac VKrishnamurthy SXin RFan WMokhtar MLi JShukla AZaharia MLiu ZKorlapati RBehm APetropoulos M(2024)Adaptive and Robust Query Execution for Lakehouses at ScaleProceedings of the VLDB Endowment10.14778/3685800.368581817:12(3947-3959)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685818
Lyu CFan QGuyard PDiao Y(2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682021
Sahaya Sheela MFarhaoui YKanmani Pappa CAshokkumar NAljanabi M(2024)Enhancing Query Processing in Big Data: Scalability and Performance OptimizationArtificial Intelligence, Big Data, IOT and Block Chain in Healthcare: From Concepts to Applications10.1007/978-3-031-65014-7_5(46-57)Online publication date: 14-Aug-2024
https://doi.org/10.1007/978-3-031-65014-7_5
Show More Cited By

Index Terms

Shark: SQL and rich analytics at scale
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Spark SQL: Relational Data Processing in Spark
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. ...
Shark: fast data analysis using coarse-grained distributed memory
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing ...
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

June 2013

1322 pages

ISBN:9781450320375

DOI:10.1145/2463676

General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 22 - 27, 2013

New York, New York, USA

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

248
Total Citations
View Citations
2,837
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)8

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xue MChen SLam ALi YBu Yvan Hovell HMa YLi XParanjpye SSomani ASamwel BErcegovac VKrishnamurthy SXin RFan WMokhtar MLi JShukla AZaharia MLiu ZKorlapati RBehm APetropoulos M(2024)Adaptive and Robust Query Execution for Lakehouses at ScaleProceedings of the VLDB Endowment10.14778/3685800.368581817:12(3947-3959)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685818
Lyu CFan QGuyard PDiao Y(2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682021
Sahaya Sheela MFarhaoui YKanmani Pappa CAshokkumar NAljanabi M(2024)Enhancing Query Processing in Big Data: Scalability and Performance OptimizationArtificial Intelligence, Big Data, IOT and Block Chain in Healthcare: From Concepts to Applications10.1007/978-3-031-65014-7_5(46-57)Online publication date: 14-Aug-2024
https://doi.org/10.1007/978-3-031-65014-7_5
Li YJiang HShen YFang YYang XHuang DZhang XZhang WZhang CChen PCui B(2023)Towards General and Efficient Online Tuning for SparkProceedings of the VLDB Endowment10.14778/3611540.361154816:12(3570-3583)Online publication date: 12-Sep-2023
https://dl.acm.org/doi/10.14778/3611540.3611548
Huang ZSen RLiu JWu E(2023)JoinBoost: Grow Trees over Normalized Data Using Only SQLProceedings of the VLDB Endowment10.14778/3611479.361150916:11(3071-3084)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611509
Djebrouni YRocha IBouchenak SChen LFelber PMarangozova VSchiavoni V(2023)Characterizing Distributed Machine Learning Workloads on Apache SparkProceedings of the 24th International Middleware Conference10.1145/3590140.3629112(151-164)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3629112
Katebzadeh MCosta PGrot BFedorova ANarayanan DDi Luna GQuerzoni L(2023)Saba: Rethinking Datacenter Network Allocation from Application's PerspectiveProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587450(623-638)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587450
Tang ZXiao ZYang LHe KLi K(2023)A Network Load Perception Based Task Scheduler for Parallel Distributed Data Processing SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2021.313262711:2(1352-1364)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TCC.2021.3132627
Tang SHe BYu CLi YLi K(2023)A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications (Extended abstract)2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00316(3779-3780)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00316
Nia AKaleibar FFeizi FRahimi FKashfi H(2023)Unlocking the Power of Data in Telecom: Building an Effective MLOps Infrastructure for Model Deployment2023 7th Iranian Conference on Advances in Enterprise Architecture (ICAEA)10.1109/ICAEA60387.2023.10414445(78-84)Online publication date: 15-Nov-2023
https://doi.org/10.1109/ICAEA60387.2023.10414445
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents