Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2658840.2658841acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesdata4uConference Proceedingsconference-collections
research-article

DiNoDB: Efficient Large-Scale Raw Data Analytics

Published: 01 September 2014 Publication History

Abstract

Modern big data workflows, found in e.g., machine learning use cases, often involve iterations of cycles of batch analytics and interactive analytics on temporary data. Whereas batch analytics solutions for large volumes of raw data are well established (e.g., Hadoop, MapReduce), state-of-the-art interactive analytics solutions (e.g., distributed shared nothing RDBMSs) require data loading and/or transformation phase, which is inherently expensive for temporary data.
In this paper, we propose a novel scalable distributed solution for in-situ data analytics, that offers both scalable batch and interactive data analytics on raw data, hence avoiding the loading phase bottleneck of RDBMSs. Our system combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw files. We revisit the NoDB's centralized design and scale it out supporting multiple clients and data processing nodes to produce a new distributed data analytics system we call Distributed NoDB (DiNoDB). DiNoDB leverages MapReduce batch queries to produce critical pieces of metadata (e.g., distributed positional maps and vertical indices) to speed up interactive queries without the overheads of the data loading and data movement phases allowing users to quickly and efficiently exploit their data.
Our experimental analysis demonstrates that DiNoDB significantly reduces the data-to-query latency with respect to comparable state-of-the-art distributed query engines, like Shark, Hive and HadoopDB.

References

[1]
A. Abouzeid and et. al. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 2009.
[2]
A. Abouzied and et. al. Invisible loading: Access-driven data transfer from raw files into database systems. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT '13, 2013.
[3]
I. Alagiannis and et. al. Nodb: efficient query execution on raw data files. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, 2012.
[4]
Apache Spark. Webpage. http://spark.apache.org/.
[5]
Apache Storm. Webpage. http://storm.incubator.apache.org/.
[6]
Cloudera Impala. Webpage. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html/.
[7]
J. Dean and et. al. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, 2004.
[8]
Discardable Distributed Memory: Supporting Memory Storage in HDFS. Webpage. http://hortonworks.com/blog/ddm/.
[9]
J. Dittrich and et. al. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 2010.
[10]
M. Y. Eltabakh and et. al. Cohadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow., 2011.
[11]
M. Ester and et. al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996.
[12]
Hadoop. Webpage. http://hadoop.apache.org/.
[13]
Hive. Webpage. http://hive.apache.org/.
[14]
J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967.
[15]
K. Shvachko and et. al. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST '10, 2010.
[16]
The Lambda Architecture. Webpage. http://lambda-architecture.net/.
[17]
Vertica. Webpage. http://www.vertica.com/.
[18]
R. S. Xin and et. al. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, 2013.
[19]
M. Zaharia and et. al. Spark: Cluster computing with working sets. In HotCloud, 2010.
[20]
M. Zaharia and et. al. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP, 2013.

Cited By

View all
  • (2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
  • (2017)DiNoDB: An Interactive-Speed Query Engine for Ad-Hoc Queries on Temporary DataIEEE Transactions on Big Data10.1109/TBDATA.2016.26373563:3(320-333)Online publication date: 1-Sep-2017
  • (2017)The survey of large-scale query classification10.1063/1.4981641(040045)Online publication date: 2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
Data4U '14: Proceedings of the First International Workshop on Bringing the Value of "Big Data" to Users (Data4U 2014)
September 2014
40 pages
ISBN:9781450331869
DOI:10.1145/2658840
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • VLDB Endowment: Very Large Database Endowment
  • DELL

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed database
  2. In situ query
  3. positional map file

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

Data4U '14

Acceptance Rates

Data4U '14 Paper Acceptance Rate 6 of 6 submissions, 100%;
Overall Acceptance Rate 6 of 6 submissions, 100%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)4
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
  • (2017)DiNoDB: An Interactive-Speed Query Engine for Ad-Hoc Queries on Temporary DataIEEE Transactions on Big Data10.1109/TBDATA.2016.26373563:3(320-333)Online publication date: 1-Sep-2017
  • (2017)The survey of large-scale query classification10.1063/1.4981641(040045)Online publication date: 2017
  • (2016)Closing the functional and Performance Gap between SQL and NoSQLProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2903731(227-238)Online publication date: 26-Jun-2016
  • (2016)AQUAdexIM: highly efficient in-memory indexing and querying of astronomy time series imagesExperimental Astronomy10.1007/s10686-016-9515-042:3(387-405)Online publication date: 10-Nov-2016
  • (2015)AQUAdexProceedings, Part II, of the 15th International Conference on Algorithms and Architectures for Parallel Processing - Volume 952910.1007/978-3-319-27122-4_7(92-105)Online publication date: 18-Nov-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media