Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

DIFF: a relational interface for large-scale data explanation

Published: 01 December 2018 Publication History

Abstract

A range of explanation engines assist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root cause analysis, today's explanation engines are designed as standalone data processing tools that do not interoperate with traditional, SQL-based analytics workflows; this limits the applicability and extensibility of these engines. In response, we propose the DIFF operator, a relational aggregation operator that unifies the core functionality of these engines with declarative relational query processing. We implement both single-node and distributed versions of the DIFF operator in MB SQL, an extension of MacroBase, and demonstrate how DIFF can provide the same semantics as existing explanation engines while capturing a broad set of production use cases in industry, including at Microsoft and Facebook. Additionally, we illustrate how this declarative approach to data explanation enables new logical and physical query optimizations. We evaluate these optimizations on several real-world production applications, and find that DIFF in MB SQL can outperform state-of-the-art engines by up to an order of magnitude.

References

[1]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc., 1995.
[2]
R. Agarwal, R. Srikant, et al. Fast algorithms for mining association rules. In VLDB, pages 487--499, 1994.
[3]
M. Antonakakis, T. April, M. Bailey, M. Bernhard, E. Bursztein, J. Cochran, Z. Durumeric, J. A. Halderman, L. Invernizzi, M. Kallitsis, D. Kumar, C. Lever, Z. Ma, J. Mason, D. Menscher, C. Seaman, N. Sullivan, K. Thomas, and Y. Zhou. Understanding the mirai botnet. In USENIX Security, 2017.
[4]
M. Armbrust et al. Spark sql: Relational data processing in spark. In SIGMOD, pages 1383--1394. ACM, 2015.
[5]
R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD, volume 29, pages 261--272. ACM, 2000.
[6]
J. Ayres et al. Sequential pattern mining using a bitmap representation. In KDD, pages 429--435. ACM, 2002.
[7]
S. Babu, P. Bizarro, and D. DeWitt. Proactive re-optimization. In SIGMOD, pages 107--118. ACM, 2005.
[8]
P. Bailis et al. Macrobase: Prioritizing attention in fast data. In SIGMOD, pages 541--556. ACM, 2017.
[9]
P. Bailis et al. Prioritizing attention in fast data: Principles and promise. CIDR Google Scholar, 2017.
[10]
E. Baralis, T. Cerquitelli, and S. Chiusano. Index support for frequent itemset mining in a relational dbms. In ICDE, pages 754--765. IEEE, 2005.
[11]
E. Baralis, T. Cerquitelli, and S. Chiusano. Imine: Index support for item set mining. IEEE Transactions on Knowledge and Data Engineering, 21(4):493--506, 2009.
[12]
R. G. Baraniuk. Compressive sensing {lecture notes}. IEEE signal processing magazine, 24(4):118--121, 2007.
[13]
Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165--1188, 2001.
[14]
M. Bittorf et al. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.
[15]
D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE, pages 443--452. IEEE, 2001.
[16]
S. Chambi et al. Better bitmap performance with roaring bitmaps. Software: practice and experience, 46(5):709--719, 2016.
[17]
S. Chambi et al. Optimizing druid with roaring bitmaps. In IDEAS, pages 77--86. ACM, 2016.
[18]
S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43. ACM, 1998.
[19]
L. Chen et al. Towards linear algebra over normalized data. PVLDB, 10(11):1214--1225, 2017.
[20]
J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56:74--80, 2013.
[21]
A. Deshpande et al. Adaptive query processing. Foundations and Trends in Databases, 1(1):1--140, 2007.
[22]
Z. Durumeric et al. The matter of heartbleed. In IMC, pages 475--488. ACM, 2014.
[23]
Z. Durumeric et al. A search engine backed by Internet-wide scanning. In SIGSAC, pages 542--553. ACM, 2015.
[24]
R. Fagin et al. Efficient implementation of large-scale multi-structural databases. In VLDB, pages 958--969. VLDB Endowment, 2005.
[25]
R. Fagin et al. Multi-structural databases. In PODS, pages 184--195. ACM, 2005.
[26]
W. Fang et al. Frequent itemset mining on graphics processors. In DaMoN, pages 34--42. ACM, 2009.
[27]
P. Fournier-Viger et al. The spmf open-source data mining library version 2. In Joint European conference on machine learning and knowledge discovery in databases, pages 36--40. Springer, 2016.
[28]
G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In ICDE, pages 209--218. IEEE, 1993.
[29]
J. Gray et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery, 1(1):29--53, 1997.
[30]
A. Greenberg et al. The cost of a cloud: research problems in data center networks. ACM SIGCOMM computer communication review, 39(1):68--73, 2008.
[31]
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157--1182, 2003.
[32]
M. A. Hall. Correlation-based feature selection of discrete and numeric class machine learning. 2000.
[33]
J. M. Hellerstein et al. Architecture of a database system. Foundations and Trends® in Databases, 1(2):141--259, 2007.
[34]
J. M. Hellerstein and M. Stonebraker. Readings in database systems. 2005.
[35]
S. C. Hoi et al. Online feature selection for mining big data. In BigMine, pages 93--100. ACM, 2012.
[36]
IDC. The digital universe of opportunities: Rich data and the increasing value of the internet of things, 2014. http://www.emc.com/leadership/digital-universe/.
[37]
I. F. Ilyas et al. Cords: automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647--658. ACM, 2004.
[38]
Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results, volume 20. ACM, 1991.
[39]
N. Khoussainova, M. Balazinska, and D. Suciu. Perfxplain: Debugging mapreduce job performance. PVLDB, 5(7):598--609, 2012.
[40]
R. Kimball and M. Ross. The data warehouse toolkit: the complete guide to dimensional modeling. John Wiley & Sons, 2011.
[41]
P. Konda et al. Feature selection in enterprise analytics: a demonstration using an r-based data analytics system. PVLDB, 6(12):1306--1309, 2013.
[42]
A. Kumar. Learning Over Joins. PhD thesis, The University of Wisconsin-Madison, 2016.
[43]
A. Kumar et al. To join or not to join?: Thinking twice about joins before feature selection. In SIGMOD, pages 19--34. ACM, 2016.
[44]
A. Kumar, J. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1969--1984. ACM, 2015.
[45]
A. Lamb et al. The vertica analytic database: C-store 7 years later. VLDB, 5(12):1790--1801, 2012.
[46]
J. Leskovec et al. Mining of Massive Datasets. Cambridge university press, 2014.
[47]
H. Li et al. Pfp: parallel fp-growth for query recommendation. In RecSys, pages 107--114. ACM, 2008.
[48]
J. Li et al. Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):94, 2017.
[49]
A. Meliou, S. Roy, and D. Suciu. Causality and explanations in databases. PVLDB, 7(13):1715--1716, 2014.
[50]
S. Melnik et al. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010.
[51]
X. Meng et al. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235--1241, 2016.
[52]
T. Neumann and B. Radke. Adaptive optimization of very large join queries. 2018.
[53]
H. Q. Ngo et al. Worst-case optimal join algorithms. Journal of the ACM (JACM), 65(3):16, 2018.
[54]
P. O'Neil and D. Quass. Improved query performance with variant indexes. In SIGMOD, volume 26, pages 38--49. ACM, 1997.
[55]
A. Pagh and R. Pagh. Scalable computation of acyclic joins. In PODS, pages 225--232. ACM, 2006.
[56]
E. Rounds. A combined nonparametric approach to feature selection and binary decision tree design. Pattern Recognition, 12(5):313--317, 1980.
[57]
S. Roy et al. Perfaugur: Robust diagnostics for performance anomalies in cloud services. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 1167--1178. IEEE, 2015.
[58]
S. Roy and D. Suciu. A formal approach to finding explanations for database queries. In SIGMOD, pages 1579--1590. ACM, 2014.
[59]
G. Rupert Jr et al. Simultaneous statistical inference. Springer Science & Business Media, 2012.
[60]
Y. Saeys, I. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507--2517, 2007.
[61]
S. Schuh, X. Chen, and J. Dittrich. An experimental comparison of thirteen relational equi-joins in main memory. In SIGMOD, pages 1961--1976. ACM, 2016.
[62]
P. G. Selinger et al. Access path selection in a relational database management system. In Readings in Artificial Intelligence and Databases, pages 511--522. Elsevier, 1988.
[63]
X. Shang, K.-U. Sattler, and I. Geist. Sql based frequent pattern mining with fp-growth. In Applications of Declarative Programming and Knowledge Management, pages 32--46. Springer, 2005.
[64]
M. Stonebraker et al. C-store: a column-oriented dbms. In VLDB, pages 553--564. VLDB Endowment, 2005.
[65]
X. Wang et al. Data x-ray: A diagnostic tool for data errors. In SIGMOD, pages 1231--1245. ACM, 2015.
[66]
D. E. Willard. Applications of range query theory to relational data base join and selection operations. journal of computer and system sciences, 52(1):157--169, 1996.
[67]
E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013.
[68]
F. Yang et al. Druid: A real-time analytical data store. In SIGMOD, pages 157--168. ACM, 2014.
[69]
D. Y. Yoon, N. Niu, and B. Mozafari. Dbsherlock: A performance diagnostic tool for transactional databases. In SIGMOD, pages 1599--1614. ACM, 2016.
[70]
M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2. USENIX Association, 2012.
[71]
F. Zhang, Y. Zhang, and J. Bakos. Gpapriori: Gpu-accelerated frequent itemset mining. In Cluster Computing (CLUSTER), 2011 IEEE International Conference on, pages 590--594. IEEE, 2011.

Cited By

View all
  • (2025)Nazar: Monitoring and Adapting ML Models on Mobile DevicesProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707246(746-761)Online publication date: 3-Feb-2025
  • (2023)TSExplain: Explaining Aggregated Time Series by Surfacing Evolving Contributors2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00060(708-720)Online publication date: Apr-2023
  • (2022)Automated relational data explanation using external semantic knowledgeProceedings of the VLDB Endowment10.14778/3554821.355484415:12(3562-3565)Online publication date: 1-Aug-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 12, Issue 4
December 2018
140 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2018
Published in PVLDB Volume 12, Issue 4

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Nazar: Monitoring and Adapting ML Models on Mobile DevicesProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707246(746-761)Online publication date: 3-Feb-2025
  • (2023)TSExplain: Explaining Aggregated Time Series by Surfacing Evolving Contributors2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00060(708-720)Online publication date: Apr-2023
  • (2022)Automated relational data explanation using external semantic knowledgeProceedings of the VLDB Endowment10.14778/3554821.355484415:12(3562-3565)Online publication date: 1-Aug-2022
  • (2022)Towards causal physical error discovery in video analytics systemsProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547495(1-6)Online publication date: 12-Jun-2022
  • (2022)Mitigating Bias in Algorithmic Systems—A Fish-eye ViewACM Computing Surveys10.1145/352715255:5(1-37)Online publication date: 3-Dec-2022
  • (2022)Sommelier: Curating DNN Models for the MassesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526173(1876-1890)Online publication date: 10-Jun-2022
  • (2022)Interactive Query Explanations Using Fine Grained ProvenanceProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3520251(2536-2538)Online publication date: 10-Jun-2022
  • (2022)Complaint-Driven Training Data Debugging at Interactive SpeedsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517849(369-383)Online publication date: 10-Jun-2022
  • (2022)Video-zilla: An Indexing Layer for Large-Scale Video AnalyticsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517840(1905-1919)Online publication date: 10-Jun-2022
  • (2022)View Composition Algebra for Ad Hoc ComparisonIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.315251528:6(2470-2485)Online publication date: 1-Jun-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media