research-article

DIFF: a relational interface for large-scale data explanation

Authors:

Asvin Ananthanarayan,

Matei ZahariaAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 12, Issue 4

Pages 419 - 432

https://doi.org/10.14778/3297753.3297761

Published: 01 December 2018 Publication History

Abstract

A range of explanation engines assist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root cause analysis, today's explanation engines are designed as standalone data processing tools that do not interoperate with traditional, SQL-based analytics workflows; this limits the applicability and extensibility of these engines. In response, we propose the DIFF operator, a relational aggregation operator that unifies the core functionality of these engines with declarative relational query processing. We implement both single-node and distributed versions of the DIFF operator in MB SQL, an extension of MacroBase, and demonstrate how DIFF can provide the same semantics as existing explanation engines while capturing a broad set of production use cases in industry, including at Microsoft and Facebook. Additionally, we illustrate how this declarative approach to data explanation enables new logical and physical query optimizations. We evaluate these optimizations on several real-world production applications, and find that DIFF in MB SQL can outperform state-of-the-art engines by up to an order of magnitude.

References

[1]

S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc., 1995.

Digital Library

[2]

R. Agarwal, R. Srikant, et al. Fast algorithms for mining association rules. In VLDB, pages 487--499, 1994.

Digital Library

[3]

M. Antonakakis, T. April, M. Bailey, M. Bernhard, E. Bursztein, J. Cochran, Z. Durumeric, J. A. Halderman, L. Invernizzi, M. Kallitsis, D. Kumar, C. Lever, Z. Ma, J. Mason, D. Menscher, C. Seaman, N. Sullivan, K. Thomas, and Y. Zhou. Understanding the mirai botnet. In USENIX Security, 2017.

[4]

M. Armbrust et al. Spark sql: Relational data processing in spark. In SIGMOD, pages 1383--1394. ACM, 2015.

Digital Library

[5]

R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD, volume 29, pages 261--272. ACM, 2000.

Digital Library

[6]

J. Ayres et al. Sequential pattern mining using a bitmap representation. In KDD, pages 429--435. ACM, 2002.

Digital Library

[7]

S. Babu, P. Bizarro, and D. DeWitt. Proactive re-optimization. In SIGMOD, pages 107--118. ACM, 2005.

Digital Library

[8]

P. Bailis et al. Macrobase: Prioritizing attention in fast data. In SIGMOD, pages 541--556. ACM, 2017.

Digital Library

[9]

P. Bailis et al. Prioritizing attention in fast data: Principles and promise. CIDR Google Scholar, 2017.

Digital Library

[10]

E. Baralis, T. Cerquitelli, and S. Chiusano. Index support for frequent itemset mining in a relational dbms. In ICDE, pages 754--765. IEEE, 2005.

Digital Library

[11]

E. Baralis, T. Cerquitelli, and S. Chiusano. Imine: Index support for item set mining. IEEE Transactions on Knowledge and Data Engineering, 21(4):493--506, 2009.

Digital Library

[12]

R. G. Baraniuk. Compressive sensing {lecture notes}. IEEE signal processing magazine, 24(4):118--121, 2007.

[13]

Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165--1188, 2001.

[14]

M. Bittorf et al. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.

[15]

D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE, pages 443--452. IEEE, 2001.

Digital Library

[16]

S. Chambi et al. Better bitmap performance with roaring bitmaps. Software: practice and experience, 46(5):709--719, 2016.

Digital Library

[17]

S. Chambi et al. Optimizing druid with roaring bitmaps. In IDEAS, pages 77--86. ACM, 2016.

Digital Library

[18]

S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43. ACM, 1998.

Digital Library

[19]

L. Chen et al. Towards linear algebra over normalized data. PVLDB, 10(11):1214--1225, 2017.

Digital Library

[20]

J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56:74--80, 2013.

Digital Library

[21]

A. Deshpande et al. Adaptive query processing. Foundations and Trends in Databases, 1(1):1--140, 2007.

[22]

Z. Durumeric et al. The matter of heartbleed. In IMC, pages 475--488. ACM, 2014.

Digital Library

[23]

Z. Durumeric et al. A search engine backed by Internet-wide scanning. In SIGSAC, pages 542--553. ACM, 2015.

Digital Library

[24]

R. Fagin et al. Efficient implementation of large-scale multi-structural databases. In VLDB, pages 958--969. VLDB Endowment, 2005.

Digital Library

[25]

R. Fagin et al. Multi-structural databases. In PODS, pages 184--195. ACM, 2005.

Digital Library

[26]

W. Fang et al. Frequent itemset mining on graphics processors. In DaMoN, pages 34--42. ACM, 2009.

Digital Library

[27]

P. Fournier-Viger et al. The spmf open-source data mining library version 2. In Joint European conference on machine learning and knowledge discovery in databases, pages 36--40. Springer, 2016.

[28]

G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In ICDE, pages 209--218. IEEE, 1993.

Digital Library

[29]

J. Gray et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery, 1(1):29--53, 1997.

Digital Library

[30]

A. Greenberg et al. The cost of a cloud: research problems in data center networks. ACM SIGCOMM computer communication review, 39(1):68--73, 2008.

Digital Library

[31]

I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157--1182, 2003.

Digital Library

[32]

M. A. Hall. Correlation-based feature selection of discrete and numeric class machine learning. 2000.

Digital Library

[33]

J. M. Hellerstein et al. Architecture of a database system. Foundations and Trends® in Databases, 1(2):141--259, 2007.

Digital Library

[34]

J. M. Hellerstein and M. Stonebraker. Readings in database systems. 2005.

Digital Library

[35]

S. C. Hoi et al. Online feature selection for mining big data. In BigMine, pages 93--100. ACM, 2012.

Digital Library

[36]

IDC. The digital universe of opportunities: Rich data and the increasing value of the internet of things, 2014. http://www.emc.com/leadership/digital-universe/.

[37]

I. F. Ilyas et al. Cords: automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647--658. ACM, 2004.

Digital Library

[38]

Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results, volume 20. ACM, 1991.

Digital Library

[39]

N. Khoussainova, M. Balazinska, and D. Suciu. Perfxplain: Debugging mapreduce job performance. PVLDB, 5(7):598--609, 2012.

Digital Library

[40]

R. Kimball and M. Ross. The data warehouse toolkit: the complete guide to dimensional modeling. John Wiley & Sons, 2011.

Digital Library

[41]

P. Konda et al. Feature selection in enterprise analytics: a demonstration using an r-based data analytics system. PVLDB, 6(12):1306--1309, 2013.

Digital Library

[42]

A. Kumar. Learning Over Joins. PhD thesis, The University of Wisconsin-Madison, 2016.

[43]

A. Kumar et al. To join or not to join?: Thinking twice about joins before feature selection. In SIGMOD, pages 19--34. ACM, 2016.

Digital Library

[44]

A. Kumar, J. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1969--1984. ACM, 2015.

Digital Library

[45]

A. Lamb et al. The vertica analytic database: C-store 7 years later. VLDB, 5(12):1790--1801, 2012.

Digital Library

[46]

J. Leskovec et al. Mining of Massive Datasets. Cambridge university press, 2014.

Digital Library

[47]

H. Li et al. Pfp: parallel fp-growth for query recommendation. In RecSys, pages 107--114. ACM, 2008.

Digital Library

[48]

J. Li et al. Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):94, 2017.

Digital Library

[49]

A. Meliou, S. Roy, and D. Suciu. Causality and explanations in databases. PVLDB, 7(13):1715--1716, 2014.

Digital Library

[50]

S. Melnik et al. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010.

Digital Library

[51]

X. Meng et al. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235--1241, 2016.

Digital Library

[52]

T. Neumann and B. Radke. Adaptive optimization of very large join queries. 2018.

Digital Library

[53]

H. Q. Ngo et al. Worst-case optimal join algorithms. Journal of the ACM (JACM), 65(3):16, 2018.

Digital Library

[54]

P. O'Neil and D. Quass. Improved query performance with variant indexes. In SIGMOD, volume 26, pages 38--49. ACM, 1997.

Digital Library

[55]

A. Pagh and R. Pagh. Scalable computation of acyclic joins. In PODS, pages 225--232. ACM, 2006.

Digital Library

[56]

E. Rounds. A combined nonparametric approach to feature selection and binary decision tree design. Pattern Recognition, 12(5):313--317, 1980.

[57]

S. Roy et al. Perfaugur: Robust diagnostics for performance anomalies in cloud services. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 1167--1178. IEEE, 2015.

[58]

S. Roy and D. Suciu. A formal approach to finding explanations for database queries. In SIGMOD, pages 1579--1590. ACM, 2014.

Digital Library

[59]

G. Rupert Jr et al. Simultaneous statistical inference. Springer Science & Business Media, 2012.

[60]

Y. Saeys, I. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507--2517, 2007.

Digital Library

[61]

S. Schuh, X. Chen, and J. Dittrich. An experimental comparison of thirteen relational equi-joins in main memory. In SIGMOD, pages 1961--1976. ACM, 2016.

Digital Library

[62]

P. G. Selinger et al. Access path selection in a relational database management system. In Readings in Artificial Intelligence and Databases, pages 511--522. Elsevier, 1988.

Digital Library

[63]

X. Shang, K.-U. Sattler, and I. Geist. Sql based frequent pattern mining with fp-growth. In Applications of Declarative Programming and Knowledge Management, pages 32--46. Springer, 2005.

Digital Library

[64]

M. Stonebraker et al. C-store: a column-oriented dbms. In VLDB, pages 553--564. VLDB Endowment, 2005.

Digital Library

[65]

X. Wang et al. Data x-ray: A diagnostic tool for data errors. In SIGMOD, pages 1231--1245. ACM, 2015.

Digital Library

[66]

D. E. Willard. Applications of range query theory to relational data base join and selection operations. journal of computer and system sciences, 52(1):157--169, 1996.

Digital Library

[67]

E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013.

Digital Library

[68]

F. Yang et al. Druid: A real-time analytical data store. In SIGMOD, pages 157--168. ACM, 2014.

Digital Library

[69]

D. Y. Yoon, N. Niu, and B. Mozafari. Dbsherlock: A performance diagnostic tool for transactional databases. In SIGMOD, pages 1599--1614. ACM, 2016.

Digital Library

[70]

M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2. USENIX Association, 2012.

Digital Library

[71]

F. Zhang, Y. Zhang, and J. Bakos. Gpapriori: Gpu-accelerated frequent itemset mining. In Cluster Computing (CLUSTER), 2011 IEEE International Conference on, pages 590--594. IEEE, 2011.

Digital Library

Cited By

Hao WWang ZHong LLi LKarayanni NDasbach-Prisk AMao CYang JCidon AEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Nazar: Monitoring and Adapting ML Models on Mobile DevicesProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707246(746-761)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707246
Chen YHuang S(2023)TSExplain: Explaining Aggregated Time Series by Surfacing Evolving Contributors2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00060(708-720)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00060
Galhotra SKhurana U(2022)Automated relational data explanation using external semantic knowledgeProceedings of the VLDB Endowment10.14778/3554821.355484415:12(3562-3565)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554844
Show More Cited By

Recommendations

MacroBase: Prioritizing Attention in Fast Data
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

As data volumes continue to rise, manual inspection is becoming increasingly untenable. In response, we present MacroBase, a data analytics engine that prioritizes end-user attention in high-volume fast data streams. MacroBase enables efficient, ...
A formal approach to finding explanations for database queries
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

As a consequence of the popularity of big data, many users with a variety of backgrounds seek to extract high level information from datasets collected from various sources and combined using data integration techniques. A major challenge for research ...
Explaining query answers with explanation-ready databases

With the increased generation and availability of big data in different domains, there is an imminent requirement for data analysis tools that are able to 'explain' the trends and anomalies obtained from this data to a range of users with different ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 12, Issue 4

December 2018

140 pages

ISSN:2150-8097

Editors:
Lei Chen
HKUST
,
Fatma Özcan
IBM Research - Almaden

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2018

Published in PVLDB Volume 12, Issue 4

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
216
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)5

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hao WWang ZHong LLi LKarayanni NDasbach-Prisk AMao CYang JCidon AEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Nazar: Monitoring and Adapting ML Models on Mobile DevicesProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707246(746-761)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707246
Chen YHuang S(2023)TSExplain: Explaining Aggregated Time Series by Surfacing Evolving Contributors2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00060(708-720)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00060
Galhotra SKhurana U(2022)Automated relational data explanation using external semantic knowledgeProceedings of the VLDB Endowment10.14778/3554821.355484415:12(3562-3565)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554844
Shaowang TZhao JSintos SKrishnan S(2022)Towards causal physical error discovery in video analytics systemsProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547495(1-6)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3546930.3547495
Orphanou KOtterbacher JKleanthous SBatsuren KGiunchiglia FBogina VTal AHartman AKuflik T(2022)Mitigating Bias in Algorithmic Systems—A Fish-eye ViewACM Computing Surveys10.1145/352715255:5(1-37)Online publication date: 3-Dec-2022
https://dl.acm.org/doi/10.1145/3527152
Guo PHu BHu WIves ZBonifati AEl Abbadi A(2022)Sommelier: Curating DNN Models for the MassesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526173(1876-1890)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526173
Yao AIves ZBonifati AEl Abbadi A(2022)Interactive Query Explanations Using Fine Grained ProvenanceProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3520251(2536-2538)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3520251
Flokas LWu WLiu YWang JVerma NWu EIves ZBonifati AEl Abbadi A(2022)Complaint-Driven Training Data Debugging at Interactive SpeedsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517849(369-383)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517849
Hu BGuo PHu WIves ZBonifati AEl Abbadi A(2022)Video-zilla: An Indexing Layer for Large-Scale Video AnalyticsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517840(1905-1919)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517840
Wu E(2022)View Composition Algebra for Ad Hoc ComparisonIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.315251528:6(2470-2485)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1109/TVCG.2022.3152515
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents