research-article

SciHadoop: array-based query processing in Hadoop

Authors:

Kleoni Ioannidou,

Carlos Maltzahn,

Neoklis Polyzotis,

Scott BrandtAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 66, Pages 1 - 11

https://doi.org/10.1145/2063384.2063473

Published: 12 November 2011 Publication History

Abstract

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.

References

[1]

Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In Proceedings of Very Large Data Bases (VLDB '09), Lyon, France, August 24--28 2009.

[2]

Paul G. Brown. Overview of SciDB: large scale array storage, processing and analysis. In Proceedings of the 2010 International Conference on Management of Data, SIGMOD '10, pages 963--968, New York, NY, USA, 2010. ACM.

Digital Library

[3]

Joe Buck, Noah Watkins, Jeff Lefevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, and Scott A. Brandt. SciHadoop: Array-based query processing in hadoop. Technical Report UCSC-SOE-11-04, UCSC, 2011.

Digital Library

[4]

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association.

Digital Library

[5]

Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey Fox. Mapreduce in the clouds for science. Cloud Computing Technology and Science, IEEE International Conference on, 0:565--572, 2010.

Digital Library

[6]

HBase homepage. http://http://hbase.apache.org/.

[7]

Ming-Yee Iu and Willy Zwaenepoel. HadoopToSQL: a mapreduce query optimizer. In Proceedings of the 5th European Conference on Computer Systems (EuroSys '10), pages 251--264, 2010.

Digital Library

[8]

Leonid Libkin, Rona Machlin, and Limsoon Wong. A query language for multidimensional arrays: design, implementation, and optimization techniques. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD '96, pages 228--239, New York, NY, USA, 1996. ACM.

Digital Library

[9]

S. Loebman, D. Nunley, Yong-Chul Kwon, B. Howe, M. Balazinska, and J. P. Gardner. Analyzing massive astrophysical datasets: Can pig/hadoop or a relational dbms help? In Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on, pages 1--10, 31 2009-sept. 4 2009.

[10]

Rona Machlin. Index-based multidimensional array queries: safety and equivalence. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '07, pages 175--184, New York, NY, USA, 2007. ACM.

Digital Library

[11]

Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. Tag: a tiny aggregation service for ad-hoc sensor networks. In OSDI 02, OSDI '02, pages 131--146, New York, NY, USA, 2002. ACM.

Digital Library

[12]

NetCDF operator (NCO) homepage. http://nco.sourceforge.net/.

[13]

Sangwon Seo, Edward J. Yoon, Jaehong Kim, Seongwook Jin, and Seungryoul Maeng. Hama: An efficient matrix computation with the mapreduce framework. In Proceedings of the 2nd IEEE International Conference on Cloud Computing Technology and Science, IEEE CloudCom 2010. IEEE, December 2010.

Digital Library

[14]

Konstantin V. Shvachko. HDFS scalability: the limits to growth.; login, 35(2):6--16, April 2010.

[15]

Emad Soroush, Magdalena Balazinska, and Daniel Wang. Arraystore: a storage manager for complex parallel array processing. In Proceedings of the 2011 international conference on Management of data, SIGMOD '11, pages 253--264, New York, NY, USA, 2011. ACM.

Digital Library

[16]

Michael Stonebraker, Jacek Becla, David Dewitt, Kian-Tat Lim, David Maier, Oliver Ratzesberger, and Stan Zdonik. Requirements for science data bases and SciDB. In CIDR 09, 2009.

[17]

Alex van Ballegooij, Roberto Cornacchia, Arjen de Vries, and Martin Kersten. Distribution rules for array database queries. In Kim Andersen, John Debenham, and Roland Wagner, editors, Database and Expert Systems Applications, volume 3588 of Lecture Notes in Computer Science, pages 55--64. Springer Berlin/Heidelberg, 2005.

Digital Library

[18]

Daniel L. Wang, Charles S. Zender, and Stephen F. Jenks. Clustered workflow execution of retargeted data analysis scripts. In CCGRID 2008, 2008.

Digital Library

[19]

Jianwu Wang, Daniel Crawl, and Ilkay Altintas. Kepler + Hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In SC-WORKS, 2009.

Digital Library

[20]

Hui Zhao, SiYun Ai, ZhenHua Lv, and Bo Li. Parallel accessing massive NetCDF data based on MapReduce. In Web Information Systems and Mining, Lecture Notes in Computer Science. Springer Berlin/Heidelberg, 2010.

Digital Library

Cited By

Baumann PMerticariu VMisev DPham Huu BLangella G(2024)Datacubes as enabler for advanced decision support systems in land managementLand Degradation & Development10.1002/ldr.515335:11(3579-3592)Online publication date: 26-May-2024
https://doi.org/10.1002/ldr.5153
Fegaras LKhan THasanuzzaman Noor MSultana T(2022)Scalable Tensors for Big Data Analytics2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020383(107-114)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020383
Suetterlein JManzano JMarquez AGao G(2022)Extending an asynchronous runtime system for high throughput applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.027163:C(214-231)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2022.01.027
Show More Cited By

Index Terms

Recommendations

BigBench: towards an industry standard benchmark for big data analytics
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to ...
Implementation of multi node Hadoop virtual cluster on open stack cloud environments

Nowadays computing plays a vital role in information technology and all other fields. Yes, the cloud computing is one of the biggest milestone in most leading next generation technology and booming up in IT filed and business sectors. In our day to day ...
Query optimization using column statistics in hive
IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications

Hive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

91
Total Citations
View Citations
730
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Baumann PMerticariu VMisev DPham Huu BLangella G(2024)Datacubes as enabler for advanced decision support systems in land managementLand Degradation & Development10.1002/ldr.515335:11(3579-3592)Online publication date: 26-May-2024
https://doi.org/10.1002/ldr.5153
Fegaras LKhan THasanuzzaman Noor MSultana T(2022)Scalable Tensors for Big Data Analytics2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020383(107-114)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020383
Suetterlein JManzano JMarquez AGao G(2022)Extending an asynchronous runtime system for high throughput applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.027163:C(214-231)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2022.01.027
Guo DGuo D(2022)Introduction of Data CenterData Center Networking10.1007/978-981-16-9368-7_1(3-24)Online publication date: 24-Feb-2022
https://doi.org/10.1007/978-981-16-9368-7_1
Hong SChoi JJeong W(2021)Distributed Interactive Visualization Using GPU-Optimized SparkIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.299089427:9(3670-3684)Online publication date: 1-Sep-2021
https://doi.org/10.1109/TVCG.2020.2990894
Güting RBehr TNidzwetzki J(2021)Distributed rrays: an algebra for generic distributed query processingDistributed and Parallel Databases10.1007/s10619-021-07325-2Online publication date: 5-Apr-2021
https://doi.org/10.1007/s10619-021-07325-2
Fukuda MGordon CMert USell M(2020)An Agent-Based Computational Framework for Distributed Data AnalysisComputer10.1109/MC.2019.293296453:3(16-25)Online publication date: 11-Mar-2020
https://dl.acm.org/doi/10.1109/MC.2019.2932964
Dong BTribaldos VXing XByna SAjo-Franklin JWu K(2020)DASSA: Parallel DAS Data Storage and Analysis for Subsurface Event Detection2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00035(254-263)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00035
Suetterlein JManzano JMarquez AGao G(2020)On the Marriage of Asynchronous Many Task Runtimes and Big Data: A Glance2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC50609.2020.00037(233-242)Online publication date: Dec-2020
https://doi.org/10.1109/HiPC50609.2020.00037
Gilroy JParonyan SAcoltzi JFukuda M(2020)Agent-Navigable Dynamic Graph Construction and Visualization over Distributed Memory2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378298(2957-2966)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9378298
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten