research-article

DiNoDB: Efficient Large-Scale Raw Data Analytics

Authors:

Ioannis Alagiannis,

Erietta Liarou,

Anastasia Ailamaki,

Pietro Michiardi,

Marko VukolićAuthors Info & Claims

Data4U '14: Proceedings of the First International Workshop on Bringing the Value of "Big Data" to Users (Data4U 2014)

Pages 1 - 6

https://doi.org/10.1145/2658840.2658841

Published: 01 September 2014 Publication History

Abstract

Modern big data workflows, found in e.g., machine learning use cases, often involve iterations of cycles of batch analytics and interactive analytics on temporary data. Whereas batch analytics solutions for large volumes of raw data are well established (e.g., Hadoop, MapReduce), state-of-the-art interactive analytics solutions (e.g., distributed shared nothing RDBMSs) require data loading and/or transformation phase, which is inherently expensive for temporary data.

In this paper, we propose a novel scalable distributed solution for in-situ data analytics, that offers both scalable batch and interactive data analytics on raw data, hence avoiding the loading phase bottleneck of RDBMSs. Our system combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw files. We revisit the NoDB's centralized design and scale it out supporting multiple clients and data processing nodes to produce a new distributed data analytics system we call Distributed NoDB (DiNoDB). DiNoDB leverages MapReduce batch queries to produce critical pieces of metadata (e.g., distributed positional maps and vertical indices) to speed up interactive queries without the overheads of the data loading and data movement phases allowing users to quickly and efficiently exploit their data.

Our experimental analysis demonstrates that DiNoDB significantly reduces the data-to-query latency with respect to comparable state-of-the-art distributed query engines, like Shark, Hive and HadoopDB.

References

[1]

A. Abouzeid and et. al. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 2009.

Digital Library

[2]

A. Abouzied and et. al. Invisible loading: Access-driven data transfer from raw files into database systems. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT '13, 2013.

Digital Library

[3]

I. Alagiannis and et. al. Nodb: efficient query execution on raw data files. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, 2012.

Digital Library

[4]

Apache Spark. Webpage. http://spark.apache.org/.

[5]

Apache Storm. Webpage. http://storm.incubator.apache.org/.

[6]

Cloudera Impala. Webpage. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html/.

[7]

J. Dean and et. al. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, 2004.

Digital Library

[8]

Discardable Distributed Memory: Supporting Memory Storage in HDFS. Webpage. http://hortonworks.com/blog/ddm/.

[9]

J. Dittrich and et. al. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 2010.

Digital Library

[10]

M. Y. Eltabakh and et. al. Cohadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow., 2011.

Digital Library

[11]

M. Ester and et. al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996.

[12]

Hadoop. Webpage. http://hadoop.apache.org/.

[13]

Hive. Webpage. http://hive.apache.org/.

[14]

J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967.

[15]

K. Shvachko and et. al. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST '10, 2010.

Digital Library

[16]

The Lambda Architecture. Webpage. http://lambda-architecture.net/.

[17]

Vertica. Webpage. http://www.vertica.com/.

[18]

R. S. Xin and et. al. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, 2013.

Digital Library

[19]

M. Zaharia and et. al. Spark: Cluster computing with working sets. In HotCloud, 2010.

Digital Library

[20]

M. Zaharia and et. al. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP, 2013.

Digital Library

Cited By

Alvarez-Ayllon APalomo-Duarte MDodero J(2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2018.2882244
Tian YAlagiannis ILiarou EAilamaki AMichiardi PVukolic M(2017)DiNoDB: An Interactive-Speed Query Engine for Ad-Hoc Queries on Temporary DataIEEE Transactions on Big Data10.1109/TBDATA.2016.26373563:3(320-333)Online publication date: 1-Sep-2017
https://doi.org/10.1109/TBDATA.2016.2637356
Zhou SCheng KMen L(2017)The survey of large-scale query classification10.1063/1.4981641(040045)Online publication date: 2017
https://doi.org/10.1063/1.4981641
Show More Cited By

Index Terms

DiNoDB: Efficient Large-Scale Raw Data Analytics
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

A unifying model for distributed data-intensive systems
DEBS '22: Proceedings of the 16th ACM International Conference on Distributed and Event-Based Systems

Modern applications handle increasingly larger volumes of data, generated at an unprecedented and constantly growing rate. They introduce challenges that are radically transforming the research fields that gravitate around data management and processing,...
RawVis: Visual Exploration over Raw Data
Advances in Databases and Information Systems
Abstract
Data exploration and visual analytics systems are of great importance in Open Science scenarios, where less tech-savvy researchers wish to access and visually explore big raw data files (e.g., json, csv) generated by scientific experiments using ...
Tutorial on NoSQL Databases
MOBILECLOUD '15: Proceedings of the 2015 3rd IEEE International Conference on Mobile Cloud Computing, Services, and Engineering

NoSQL databases are the new breed of databases developed to overcome the drawbacks of RDBMS. The goal of NoSQL is to provide scalability, availability and meet other requirements of cloud computing. The common motivation of NoSQL design is to meet ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

Data4U '14: Proceedings of the First International Workshop on Bringing the Value of "Big Data" to Users (Data4U 2014)

September 2014

40 pages

ISBN:9781450331869

DOI:10.1145/2658840

Editors:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

VLDB Endowment: Very Large Database Endowment
DELL

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

Data4U '14

Data4U '14: First International Workshop on Bringing the Value of

September 1, 2014

Hangzhou, China

Acceptance Rates

Data4U '14 Paper Acceptance Rate 6 of 6 submissions, 100%;

Overall Acceptance Rate 6 of 6 submissions, 100%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
230
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Alvarez-Ayllon APalomo-Duarte MDodero J(2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2018.2882244
Tian YAlagiannis ILiarou EAilamaki AMichiardi PVukolic M(2017)DiNoDB: An Interactive-Speed Query Engine for Ad-Hoc Queries on Temporary DataIEEE Transactions on Big Data10.1109/TBDATA.2016.26373563:3(320-333)Online publication date: 1-Sep-2017
https://doi.org/10.1109/TBDATA.2016.2637356
Zhou SCheng KMen L(2017)The survey of large-scale query classification10.1063/1.4981641(040045)Online publication date: 2017
https://doi.org/10.1063/1.4981641
Liu ZHammerschmidt BMcMahon DLiu YChang HÖzcan FKoutrika GMadden S(2016)Closing the functional and Performance Gap between SQL and NoSQLProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2903731(227-238)Online publication date: 26-Jun-2016
https://dl.acm.org/doi/10.1145/2882903.2903731
Hong ZYu CWang JXiao JCui CSun J(2016)AQUAdexIM: highly efficient in-memory indexing and querying of astronomy time series imagesExperimental Astronomy10.1007/s10686-016-9515-042:3(387-405)Online publication date: 10-Nov-2016
https://doi.org/10.1007/s10686-016-9515-0
Hong ZYu CXia RXiao JWang JSun JCui C(2015)AQUAdexProceedings, Part II, of the 15th International Conference on Algorithms and Architectures for Parallel Processing - Volume 952910.1007/978-3-319-27122-4_7(92-105)Online publication date: 18-Nov-2015
https://dl.acm.org/doi/10.1007/978-3-319-27122-4_7

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten