Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2588555.2595630acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Major technical advancements in apache hive

Published: 18 June 2014 Publication History

Abstract

Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of Hive in its file formats, query planning, and query execution, which are key factors determining the performance of Hive. In order to make Hive continuously satisfy the requests and requirements of processing increasingly high volumes data in a scalable and efficient way, we have set two goals related to storage and runtime performance in our efforts on advancing Hive. First, we aim to maximize the effective storage capacity and to accelerate data accesses to the data warehouse by updating the existing file formats. Second, we aim to significantly improve cluster resource utilization and runtime performance of Hive by developing a highly optimized query planner and a highly efficient query execution engine. In this paper, we present a community-based effort on technical advancements in Hive. Our performance evaluation shows that these advancements provide significant improvements on storage efficiency and query execution performance. This paper also shows how academic research lays a foundation for Hive to improve its daily operations.

References

[1]
https://hadoop.apache.org/
[2]
https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2
[3]
https://tez.incubator.apache.org/.
[4]
https://hbase.apache.org/
[5]
https://svn.apache.org/viewvc/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java?view=log
[6]
https://svn.apache.org/viewvc/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java?view=log
[7]
https://cwiki.apache.org/confluence/display/Hive/Correlation+Optimizer
[8]
https://issues.apache.org/jira/browse/HIVE-4160
[9]
https://issues.apache.org/jira/secure/attachment/12603710/Hive-Vectorized-Query-Execution-Design-rev11.pdf
[10]
http://www.tpc.org/tpch/
[11]
http://www.tpc.org/tpcds/
[12]
http://avro.apache.org/docs/current/trevni/spec.html
[13]
https://github.com/Parquet/parquet-format
[14]
https://incubator.apache.org/drill/
[15]
https://github.com/cloudera/impala
[16]
http://www.slideshare.net/ApacheDrill/oscon-2013-apache-drill-workshop-part-2
[17]
https://hive.apache.org/
[18]
A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving Relations for Cache Performance. In VLDB, 2001.
[19]
P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.
[20]
Y. Cao, G. C. Das, C.-Y. Chan, and K.-L. Tan. Optimizing Complex Queries with Multiple Relation Instances. In SIGMOD, 2008.
[21]
G. P. Copeland and S. N. Khoshafian. A Decomposition Storage Model. In SIGMOD, 1985.
[22]
P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, S. Madden, M. Stonebraker, S. B. Zdonik, and P. G. Brown. SS-DB: A Standard Science DBMS Benchmark. http://www-conf.slac.stanford.edu/xldb10/docs/ssdb_benchmark.pdf
[23]
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-Oriented Storage Techniques for MapReduce. In VLDB, 2011.
[24]
S. Guo, J. Xiong, W. Wang, and R. Lee. Mastiff: A Mapreduce-based System for Time-Based Big Data Analytics. In CLUSTER, 2012.
[25]
S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: A Simultaneously Pipelined Relational Query Engine. In SIGMOD, 2005.
[26]
W. Hasan and R. Motwani. Coloring Away Communication in Parallel Query Optimization. In VLDB, 1995.
[27]
Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011.
[28]
Y. Huai, S. Ma, R. Lee, O. O'Malley, and X. Zhang. Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters. In VLDB, 2013.
[29]
A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. In SOCC, 2011.
[30]
A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandier, L. Doshi, and C. Bear. The Vertica Analytic Database: C-Store 7 Years Later. In VLDB, 2012.
[31]
R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to-MapReduce Translator. In ICDCS, 2011.
[32]
H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation-based Optimizer for Mapreduce Workflows. In VLDB, 2012.
[33]
Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: Leveraging Columnar Storage for Scalable Join Processing in the Mapreduce Framework. In SIGMOD, 2011.
[34]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-Scale Datasets. In VLDB, 2010.
[35]
T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. In VLDB, 2011.
[36]
T. Neumann and G. Moerkotte. A Combined Framework for Grouping and Order Optimization. In VLDB, 2004.
[37]
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing Across Multiple Queries in Mapreduce. In VLDB, 2010.
[38]
S. Padmanabhan, T. Malkemus, R. C. Agarwal, and A. Jhingran. Block Oriented Processing of Relational Database Operations in Modern Computer Architectures. In ICDE, 2001.
[39]
J. Parikh. Data Infrastructure at Web Scale. http://www.vldb.org/2013/video/keynote1.flv
[40]
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access Path Selection in a Relational Database Management System. In SIGMOD, 1979.
[41]
T. K. Sellis. Multiple-Query Optimization. ACM Trans. Database Syst., 13(1):23--52, Mar. 1988.
[42]
D. Simmen, E. Shekita, and T. Malkemus. Fundamental Techniques for Order Optimization. In SIGMOD, 1996.
[43]
J. Sompolski, M. Zukowski, and P. A. Boncz. Vectorization vs. Compilation in Query Execution. In DaMoN, 2011.
[44]
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, 2005.
[45]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In ICDE, 2010.
[46]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In SoCC, 2013.
[47]
X. Wang, C. Olston, A. D. Sarma, and R. Burns. CoScan: Cooperative Scan Sharing in the Cloud. In SoCC, 2011.
[48]
J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating Partitioning and Parallel Plans into the SCOPE Optimizer. In ICDE, 2010.
[49]
J. Zhou and K. A. Ross. A Multi-resolution Block Storage Model for Database Design. In IDEAS, 2003.
[50]
J. Zhou and K. A. Ross. Buffering Database Operations for Enhanced Instruction Cache Performance. In SIGMOD, 2004.
[51]
M. Zukowski, S. Héman, N. Nes, and P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS. In VLDB, 2007.

Cited By

View all
  • (2024)Big Data Predictive Analytics for Personalized Medicine: Perspectives and ChallengesApplied Data Science and Analysis10.58496/ADSA/2024/0042024(32-38)Online publication date: 11-Apr-2024
  • (2024)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460(102460)Online publication date: Sep-2024
  • (2023)Data Is the New Oil–Sort of: A View on Why This Comparison Is Misleading and Its Implications for Modern Data AdministrationFuture Internet10.3390/fi1502007115:2(71)Online publication date: 12-Feb-2023
  • Show More Cited By

Index Terms

  1. Major technical advancements in apache hive

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
    June 2014
    1645 pages
    ISBN:9781450323765
    DOI:10.1145/2588555
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data warehouse
    2. databases
    3. hadoop
    4. hive
    5. mapreduce

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'14
    Sponsor:

    Acceptance Rates

    SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)77
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 24 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Big Data Predictive Analytics for Personalized Medicine: Perspectives and ChallengesApplied Data Science and Analysis10.58496/ADSA/2024/0042024(32-38)Online publication date: 11-Apr-2024
    • (2024)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460(102460)Online publication date: Sep-2024
    • (2023)Data Is the New Oil–Sort of: A View on Why This Comparison Is Misleading and Its Implications for Modern Data AdministrationFuture Internet10.3390/fi1502007115:2(71)Online publication date: 12-Feb-2023
    • (2023)An Empirical Evaluation of Columnar Storage FormatsProceedings of the VLDB Endowment10.14778/3626292.362629817:2(148-161)Online publication date: 1-Oct-2023
    • (2023)Apache IoTDB: A Time Series Database for IoT ApplicationsProceedings of the ACM on Management of Data10.1145/35897751:2(1-27)Online publication date: 20-Jun-2023
    • (2023)A Comparative Study of HiveQL and SparkSQL Query Performance in a Cluster Environment2023 IEEE 8th International Conference on Recent Advances and Innovations in Engineering (ICRAIE)10.1109/ICRAIE59459.2023.10468276(1-6)Online publication date: 2-Dec-2023
    • (2023)S3QLRDF: distributed SPARQL query processing using Apache Spark—a comparative performance studyDistributed and Parallel Databases10.1007/s10619-023-07422-441:3(191-231)Online publication date: 24-Jan-2023
    • (2023)Signature Proxy: An Efficient View Management Under Distributed ArchitectureProceedings of the 2nd International Conference on Cognitive and Intelligent Computing10.1007/978-981-99-2746-3_40(389-406)Online publication date: 2-Oct-2023
    • (2022)GHiveProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563503(158-172)Online publication date: 7-Nov-2022
    • (2022)GHive: A Demonstration of GPU-Accelerated Query Processing in Apache HiveProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3520166(2417-2420)Online publication date: 11-Jun-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media