research-article

Major technical advancements in apache hive

Authors:

Ashutosh Chauhan,

Gunther Hagleitner,

Eric N. Hanson,

Jitendra Pandey,

Xiaodong ZhangAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 1235 - 1246

https://doi.org/10.1145/2588555.2595630

Published: 18 June 2014 Publication History

Abstract

Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of Hive in its file formats, query planning, and query execution, which are key factors determining the performance of Hive. In order to make Hive continuously satisfy the requests and requirements of processing increasingly high volumes data in a scalable and efficient way, we have set two goals related to storage and runtime performance in our efforts on advancing Hive. First, we aim to maximize the effective storage capacity and to accelerate data accesses to the data warehouse by updating the existing file formats. Second, we aim to significantly improve cluster resource utilization and runtime performance of Hive by developing a highly optimized query planner and a highly efficient query execution engine. In this paper, we present a community-based effort on technical advancements in Hive. Our performance evaluation shows that these advancements provide significant improvements on storage efficiency and query execution performance. This paper also shows how academic research lays a foundation for Hive to improve its daily operations.

References

[1]

https://hadoop.apache.org/

[2]

https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2

[3]

https://tez.incubator.apache.org/.

[4]

https://hbase.apache.org/

[5]

https://svn.apache.org/viewvc/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java?view=log

[6]

https://svn.apache.org/viewvc/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java?view=log

[7]

https://cwiki.apache.org/confluence/display/Hive/Correlation+Optimizer

[8]

https://issues.apache.org/jira/browse/HIVE-4160

[9]

https://issues.apache.org/jira/secure/attachment/12603710/Hive-Vectorized-Query-Execution-Design-rev11.pdf

[10]

http://www.tpc.org/tpch/

[11]

http://www.tpc.org/tpcds/

[12]

http://avro.apache.org/docs/current/trevni/spec.html

[13]

https://github.com/Parquet/parquet-format

[14]

https://incubator.apache.org/drill/

[15]

https://github.com/cloudera/impala

[16]

http://www.slideshare.net/ApacheDrill/oscon-2013-apache-drill-workshop-part-2

[17]

https://hive.apache.org/

[18]

A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving Relations for Cache Performance. In VLDB, 2001.

Digital Library

[19]

P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.

[20]

Y. Cao, G. C. Das, C.-Y. Chan, and K.-L. Tan. Optimizing Complex Queries with Multiple Relation Instances. In SIGMOD, 2008.

Digital Library

[21]

G. P. Copeland and S. N. Khoshafian. A Decomposition Storage Model. In SIGMOD, 1985.

Digital Library

[22]

P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, S. Madden, M. Stonebraker, S. B. Zdonik, and P. G. Brown. SS-DB: A Standard Science DBMS Benchmark. http://www-conf.slac.stanford.edu/xldb10/docs/ssdb_benchmark.pdf

[23]

A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-Oriented Storage Techniques for MapReduce. In VLDB, 2011.

Digital Library

[24]

S. Guo, J. Xiong, W. Wang, and R. Lee. Mastiff: A Mapreduce-based System for Time-Based Big Data Analytics. In CLUSTER, 2012.

Digital Library

[25]

S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: A Simultaneously Pipelined Relational Query Engine. In SIGMOD, 2005.

Digital Library

[26]

W. Hasan and R. Motwani. Coloring Away Communication in Parallel Query Optimization. In VLDB, 1995.

Digital Library

[27]

Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011.

Digital Library

[28]

Y. Huai, S. Ma, R. Lee, O. O'Malley, and X. Zhang. Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters. In VLDB, 2013.

Digital Library

[29]

A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. In SOCC, 2011.

Digital Library

[30]

A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandier, L. Doshi, and C. Bear. The Vertica Analytic Database: C-Store 7 Years Later. In VLDB, 2012.

Digital Library

[31]

R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to-MapReduce Translator. In ICDCS, 2011.

Digital Library

[32]

H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation-based Optimizer for Mapreduce Workflows. In VLDB, 2012.

Digital Library

[33]

Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: Leveraging Columnar Storage for Scalable Join Processing in the Mapreduce Framework. In SIGMOD, 2011.

Digital Library

[34]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-Scale Datasets. In VLDB, 2010.

Digital Library

[35]

T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. In VLDB, 2011.

Digital Library

[36]

T. Neumann and G. Moerkotte. A Combined Framework for Grouping and Order Optimization. In VLDB, 2004.

Digital Library

[37]

T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing Across Multiple Queries in Mapreduce. In VLDB, 2010.

Digital Library

[38]

S. Padmanabhan, T. Malkemus, R. C. Agarwal, and A. Jhingran. Block Oriented Processing of Relational Database Operations in Modern Computer Architectures. In ICDE, 2001.

Digital Library

[39]

J. Parikh. Data Infrastructure at Web Scale. http://www.vldb.org/2013/video/keynote1.flv

[40]

P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access Path Selection in a Relational Database Management System. In SIGMOD, 1979.

Digital Library

[41]

T. K. Sellis. Multiple-Query Optimization. ACM Trans. Database Syst., 13(1):23--52, Mar. 1988.

Digital Library

[42]

D. Simmen, E. Shekita, and T. Malkemus. Fundamental Techniques for Order Optimization. In SIGMOD, 1996.

Digital Library

[43]

J. Sompolski, M. Zukowski, and P. A. Boncz. Vectorization vs. Compilation in Query Execution. In DaMoN, 2011.

Digital Library

[44]

M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, 2005.

Digital Library

[45]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In ICDE, 2010.

[46]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In SoCC, 2013.

Digital Library

[47]

X. Wang, C. Olston, A. D. Sarma, and R. Burns. CoScan: Cooperative Scan Sharing in the Cloud. In SoCC, 2011.

Digital Library

[48]

J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating Partitioning and Parallel Plans into the SCOPE Optimizer. In ICDE, 2010.

[49]

J. Zhou and K. A. Ross. A Multi-resolution Block Storage Model for Database Design. In IDEAS, 2003.

[50]

J. Zhou and K. A. Ross. Buffering Database Operations for Enhanced Instruction Cache Performance. In SIGMOD, 2004.

Digital Library

[51]

M. Zukowski, S. Héman, N. Nes, and P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS. In VLDB, 2007.

Digital Library

Cited By

Al-Quraishi TAl-Quraishi NAlNabulsi HAL-Qarishey HAli A(2024)Big Data Predictive Analytics for Personalized Medicine: Perspectives and ChallengesApplied Data Science and Analysis10.58496/ADSA/2024/0042024(32-38)Online publication date: 11-Apr-2024
https://doi.org/10.58496/ADSA/2024/004
Harby AZulkernine F(2024)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460(102460)Online publication date: Sep-2024
https://doi.org/10.1016/j.is.2024.102460
Stach C(2023)Data Is the New Oil–Sort of: A View on Why This Comparison Is Misleading and Its Implications for Modern Data AdministrationFuture Internet10.3390/fi1502007115:2(71)Online publication date: 12-Feb-2023
https://doi.org/10.3390/fi15020071
Show More Cited By

Index Terms

Major technical advancements in apache hive
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture ...
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Join Algorithms under Apache Spark: Revisited
ICCTA '19: Proceedings of the 2019 5th International Conference on Computer and Technology Applications

Currently, we are dealing with large scale applications, which in turn generate massive amount of data and information. Large amount of data often requires processing algorithms using massive parallelism, where the main performance metrics is the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

97
Total Citations
View Citations
1,856
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)5

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Al-Quraishi TAl-Quraishi NAlNabulsi HAL-Qarishey HAli A(2024)Big Data Predictive Analytics for Personalized Medicine: Perspectives and ChallengesApplied Data Science and Analysis10.58496/ADSA/2024/0042024(32-38)Online publication date: 11-Apr-2024
https://doi.org/10.58496/ADSA/2024/004
Harby AZulkernine F(2024)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460(102460)Online publication date: Sep-2024
https://doi.org/10.1016/j.is.2024.102460
Stach C(2023)Data Is the New Oil–Sort of: A View on Why This Comparison Is Misleading and Its Implications for Modern Data AdministrationFuture Internet10.3390/fi1502007115:2(71)Online publication date: 12-Feb-2023
https://doi.org/10.3390/fi15020071
Zeng XHui YShen JPavlo AMcKinney WZhang H(2023)An Empirical Evaluation of Columnar Storage FormatsProceedings of the VLDB Endowment10.14778/3626292.362629817:2(148-161)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.14778/3626292.3626298
Wang CQiao JHuang XSong SHou HJiang TRui LWang JSun J(2023)Apache IoTDB: A Time Series Database for IoT ApplicationsProceedings of the ACM on Management of Data10.1145/35897751:2(1-27)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589775
Yahya MIsmail A(2023)A Comparative Study of HiveQL and SparkSQL Query Performance in a Cluster Environment2023 IEEE 8th International Conference on Recent Advances and Innovations in Engineering (ICRAIE)10.1109/ICRAIE59459.2023.10468276(1-6)Online publication date: 2-Dec-2023
https://doi.org/10.1109/ICRAIE59459.2023.10468276
Hassan MBansal S(2023)S3QLRDF: distributed SPARQL query processing using Apache Spark—a comparative performance studyDistributed and Parallel Databases10.1007/s10619-023-07422-441:3(191-231)Online publication date: 24-Jan-2023
https://doi.org/10.1007/s10619-023-07422-4
Dhana Sree Devi KAshwini PAnil Kumar NSwathy J(2023)Signature Proxy: An Efficient View Management Under Distributed ArchitectureProceedings of the 2nd International Conference on Cognitive and Intelligent Computing10.1007/978-981-99-2746-3_40(389-406)Online publication date: 2-Oct-2023
https://doi.org/10.1007/978-981-99-2746-3_40
Liu HTang BZhang JDeng YYan XZheng XShen QZeng DMao ZZhang CYou ZWang ZJiang RWang FYiu MLi HHan MLi QLuo ZGavrilovska AAltınbüken DBinnig C(2022)GHiveProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563503(158-172)Online publication date: 7-Nov-2022
https://doi.org/10.1145/3542929.3563503
Liu HTang BZhang JDeng YZheng XShen QYan XZeng DMao ZZhang CYou ZWang ZJiang RWang FYiu MLi HHan MLi QLuo ZIves ZBonifati AEl Abbadi A(2022)GHive: A Demonstration of GPU-Accelerated Query Processing in Apache HiveProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3520166(2417-2420)Online publication date: 11-Jun-2022
https://doi.org/10.1145/3514221.3520166
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents