Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2063384.2063473acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

SciHadoop: array-based query processing in Hadoop

Published: 12 November 2011 Publication History

Abstract

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.

References

[1]
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In Proceedings of Very Large Data Bases (VLDB '09), Lyon, France, August 24--28 2009.
[2]
Paul G. Brown. Overview of SciDB: large scale array storage, processing and analysis. In Proceedings of the 2010 International Conference on Management of Data, SIGMOD '10, pages 963--968, New York, NY, USA, 2010. ACM.
[3]
Joe Buck, Noah Watkins, Jeff Lefevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, and Scott A. Brandt. SciHadoop: Array-based query processing in hadoop. Technical Report UCSC-SOE-11-04, UCSC, 2011.
[4]
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association.
[5]
Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey Fox. Mapreduce in the clouds for science. Cloud Computing Technology and Science, IEEE International Conference on, 0:565--572, 2010.
[6]
HBase homepage. http://http://hbase.apache.org/.
[7]
Ming-Yee Iu and Willy Zwaenepoel. HadoopToSQL: a mapreduce query optimizer. In Proceedings of the 5th European Conference on Computer Systems (EuroSys '10), pages 251--264, 2010.
[8]
Leonid Libkin, Rona Machlin, and Limsoon Wong. A query language for multidimensional arrays: design, implementation, and optimization techniques. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD '96, pages 228--239, New York, NY, USA, 1996. ACM.
[9]
S. Loebman, D. Nunley, Yong-Chul Kwon, B. Howe, M. Balazinska, and J. P. Gardner. Analyzing massive astrophysical datasets: Can pig/hadoop or a relational dbms help? In Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on, pages 1--10, 31 2009-sept. 4 2009.
[10]
Rona Machlin. Index-based multidimensional array queries: safety and equivalence. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '07, pages 175--184, New York, NY, USA, 2007. ACM.
[11]
Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. Tag: a tiny aggregation service for ad-hoc sensor networks. In OSDI 02, OSDI '02, pages 131--146, New York, NY, USA, 2002. ACM.
[12]
NetCDF operator (NCO) homepage. http://nco.sourceforge.net/.
[13]
Sangwon Seo, Edward J. Yoon, Jaehong Kim, Seongwook Jin, and Seungryoul Maeng. Hama: An efficient matrix computation with the mapreduce framework. In Proceedings of the 2nd IEEE International Conference on Cloud Computing Technology and Science, IEEE CloudCom 2010. IEEE, December 2010.
[14]
Konstantin V. Shvachko. HDFS scalability: the limits to growth.; login, 35(2):6--16, April 2010.
[15]
Emad Soroush, Magdalena Balazinska, and Daniel Wang. Arraystore: a storage manager for complex parallel array processing. In Proceedings of the 2011 international conference on Management of data, SIGMOD '11, pages 253--264, New York, NY, USA, 2011. ACM.
[16]
Michael Stonebraker, Jacek Becla, David Dewitt, Kian-Tat Lim, David Maier, Oliver Ratzesberger, and Stan Zdonik. Requirements for science data bases and SciDB. In CIDR 09, 2009.
[17]
Alex van Ballegooij, Roberto Cornacchia, Arjen de Vries, and Martin Kersten. Distribution rules for array database queries. In Kim Andersen, John Debenham, and Roland Wagner, editors, Database and Expert Systems Applications, volume 3588 of Lecture Notes in Computer Science, pages 55--64. Springer Berlin/Heidelberg, 2005.
[18]
Daniel L. Wang, Charles S. Zender, and Stephen F. Jenks. Clustered workflow execution of retargeted data analysis scripts. In CCGRID 2008, 2008.
[19]
Jianwu Wang, Daniel Crawl, and Ilkay Altintas. Kepler + Hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In SC-WORKS, 2009.
[20]
Hui Zhao, SiYun Ai, ZhenHua Lv, and Bo Li. Parallel accessing massive NetCDF data based on MapReduce. In Web Information Systems and Mining, Lecture Notes in Computer Science. Springer Berlin/Heidelberg, 2010.

Cited By

View all
  • (2024)Datacubes as enabler for advanced decision support systems in land managementLand Degradation & Development10.1002/ldr.515335:11(3579-3592)Online publication date: 26-May-2024
  • (2022)Scalable Tensors for Big Data Analytics2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020383(107-114)Online publication date: 17-Dec-2022
  • (2022)Extending an asynchronous runtime system for high throughput applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.027163:C(214-231)Online publication date: 1-May-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data intensive
  2. map reduce
  3. query optimization
  4. scientific file-formats

Qualifiers

  • Research-article

Funding Sources

Conference

SC '11
Sponsor:

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Datacubes as enabler for advanced decision support systems in land managementLand Degradation & Development10.1002/ldr.515335:11(3579-3592)Online publication date: 26-May-2024
  • (2022)Scalable Tensors for Big Data Analytics2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020383(107-114)Online publication date: 17-Dec-2022
  • (2022)Extending an asynchronous runtime system for high throughput applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.027163:C(214-231)Online publication date: 1-May-2022
  • (2022)Introduction of Data CenterData Center Networking10.1007/978-981-16-9368-7_1(3-24)Online publication date: 24-Feb-2022
  • (2021)Distributed Interactive Visualization Using GPU-Optimized SparkIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.299089427:9(3670-3684)Online publication date: 1-Sep-2021
  • (2021)Distributed rrays: an algebra for generic distributed query processingDistributed and Parallel Databases10.1007/s10619-021-07325-2Online publication date: 5-Apr-2021
  • (2020)An Agent-Based Computational Framework for Distributed Data AnalysisComputer10.1109/MC.2019.293296453:3(16-25)Online publication date: 11-Mar-2020
  • (2020)DASSA: Parallel DAS Data Storage and Analysis for Subsurface Event Detection2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00035(254-263)Online publication date: May-2020
  • (2020)On the Marriage of Asynchronous Many Task Runtimes and Big Data: A Glance2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC50609.2020.00037(233-242)Online publication date: Dec-2020
  • (2020)Agent-Navigable Dynamic Graph Construction and Visualization over Distributed Memory2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378298(2957-2966)Online publication date: 10-Dec-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media