Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Explaining inference queries with bayesian optimization

Published: 01 July 2021 Publication History

Abstract

Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this paper, we model an objective function as a black-box function and propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO --- a technique for finding the global optimum of a black-box function --- is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on a variety of real-world datasets. BOExplain is open-sourced as a Python package at https://github.com/sfu-db/BOExplain.

References

[1]
Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, et al. 2020. DIFF: a relational interface for large-scale data explanation. The VLDB Journal (2020), 1--26.
[2]
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2623--2631.
[3]
Cyrille Artho. 2011. Iterative delta debugging. International Journal on Software Tools for Technology Transfer 13, 3 (2011), 223--246.
[4]
Lucas Baier, Marcel Hofmann, Niklas Kühl, Marisa Mohr, and Gerhard Satzger. 2020. Handling Concept Drifts in Regression Problems-the Error Intersection Approach. arXiv preprint arXiv:2004.00438 (2020).
[5]
Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. 541--556.
[6]
Ricardo Baptista and Matthias Poloczek. 2018. Bayesian optimization of combinatorial structures. arXiv preprint arXiv:1806.08838 (2018).
[7]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. The Journal of Machine Learning Research 13, 1 (2012), 281--305.
[8]
James Bergstra, Daniel Yamins, and David Cox. 2013. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning. 115--123.
[9]
James Bergstra, Dan Yamins, and David D Cox. 2013. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference, Vol. 13. Citeseer, 20.
[10]
James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems. 2546--2554.
[11]
Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010).
[12]
Bee-Chung Chen, Lei Chen, Yi Lin, and Raghu Ramakrishnan. 2005. Prediction cubes. In Proceedings of the 31st international conference on Very large data bases. 982--993.
[13]
Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. 2019. Slice finder: Automated data slicing for model validation. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1550--1553.
[14]
Dean De Cock. 2011. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education 19, 3 (2011).
[15]
Aryan Deshwal, Syrine Belakaria, and Janardhan Rao Doppa. 2020. Scalable Combinatorial Bayesian Optimization with Tractable Statistical models. arXiv preprint arXiv:2008.08177 (2020).
[16]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[17]
Matthias Feurer and Frank Hutter. 2019. Hyperparameter optimization. In Automated Machine Learning. Springer, Cham, 3--33.
[18]
Peter I Frazier. 2018. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811 (2018).
[19]
Eduardo C Garrido-Merchán and Daniel Hernández-Lobato. 2020. Dealing with categorical and integer-valued variables in bayesian optimization with gaussian processes. Neurocomputing 380 (2020), 20--35.
[20]
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. 2017. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 1487--1495.
[21]
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery 1, 1 (1997), 29--53.
[22]
Tim Head, MechCoder, Gilles Louppe, Iaroslav Shcherbatyi, fcharras, Zé Vinícius, cmmalone, Christopher Schröder, nel215, Nuno Campos, Todd Young, Stefano Cereda, Thomas Fan, rene rex, Kejia (KJ) Shi, Justus Schwabedal, carlosdanielcsantos, Hvass-Labs, Mikhail Pak, SoManyUsernamesTaken, Fred Callaway, Loïc Estève, Lilian Besson, Mehdi Cherti, Karlson Pfannschmidt, Fabian Linzberger, Christophe Cauet, Anna Gut, Andreas Mueller, and Alexander Fabisch. 2018. scikit-optimize/scikit-optimize: v0.5.2.
[23]
Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization. Springer, 507--523.
[24]
Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu. 2012. Perfxplain: debugging mapreduce job performance. arXiv preprint arXiv:1203.6400 (2012).
[25]
Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International Conference on Machine Learning. PMLR, 1885--1894.
[26]
Rahul Krishna, Md Shahriar Iqbal, Mohammad Ali Javidian, Baishakhi Ray, and Pooyan Jamshidi. 2020. CADET: A Systematic Method For Debugging Misconfigurations using Counterfactual Reasoning. arXiv preprint arXiv:2010.06061 (2020).
[27]
Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research 18, 1 (2017), 6765--6816.
[28]
Liam Li and Ameet Talwalkar. 2020. Random search and reproducibility for neural architecture search. In Uncertainty in Artificial Intelligence. PMLR, 367--377.
[29]
Daniel James Lizotte. 2008. Practical bayesian optimization. University of Alberta.
[30]
Brandon Lockhart, Jinglin Peng, Weiyuan Wu, Jiannan Wang, and Eugene Wu. 2021. Explaining Inference Queries with Bayesian Optimization. https://github.com/sfu-db/BOExplain.
[31]
Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2020. Bugdoc: A system for debugging computational pipelines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2733--2736.
[32]
Horia Mania, Aurelia Guy, and Benjamin Recht. 2018. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055 (2018).
[33]
Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and Explanations in Databases. Proc. VLDB Endow. 7, 13 (Aug. 2014), 1715--1716.
[34]
Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going beyond provenance: Explaining query answers with pattern-based counterbalances. In Proceedings of the 2019 International Conference on Management of Data. 485--502.
[35]
Dang Nguyen, Sunil Gupta, Santu Rana, Alistair Shilton, and Svetha Venkatesh. 2020. Bayesian Optimization for Categorical and Category-Specific Continuous Inputs. In AAAI. 5256--5263.
[36]
Changyong Oh, Jakub Tomczak, Efstratios Gavves, and Max Welling. 2019. Combinatorial Bayesian Optimization using the Graph Cartesian Product. In Advances in Neural Information Processing Systems. 2914--2924.
[37]
Eliana Pastor, Luca de Alfaro, and Elena Baralis. 2021. Looking for Trouble: Analyzing Classifier Behavior via Pattern Divergence. (2021).
[38]
Jorge Ramírez, Marcos Baez, Fabio Casati, and Boualem Benatallah. 2019. Crowd-sourced datasets to study the generation and impact of text highlighting in classification tasks. (11 2019).
[39]
Sudip Roy, Arnd Christian König, Igor Dvorkin, and Manish Kumar. 2015. Perfaugur: Robust diagnostics for performance anomalies in cloud services. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 1167--1178.
[40]
Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining Query Answers with Explanation-Ready Databases. Proc. VLDB Endow. 9, 4 (Dec. 2015), 348--359.
[41]
Sudeepa Roy and Dan Suciu. 2014. A Formal Approach to Finding Explanations for Database Queries. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 1579--1590.
[42]
Binxin Ru, Ahsan Alvi, Vu Nguyen, Michael A Osborne, and Stephen Roberts. 2020. Bayesian optimisation over multiple continuous and categorical inputs. In International Conference on Machine Learning. PMLR, 8276--8285.
[43]
Sunita Sarawagi, Rakesh Agrawal, and Nimrod Megiddo. 1998. Discovery-driven exploration of OLAP data cubes. In International Conference on Extending Database Technology. Springer, 168--182.
[44]
Sunita Sarawagi and Gayatri Sathe. 2000. i3: intelligent, interactive investigation of olap data cubes. ACM SIGMOD Record 29, 2 (2000), 589.
[45]
Gayatri Sathe and Sunita Sarawagi. 2001. Intelligent rollups in multidimensional OLAP data. In VLDB, Vol. 1. 531--540.
[46]
Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2018. Explaining aggregates for exploratory analytics. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 478--487.
[47]
Matthias Schonlau, William J Welch, and Donald R Jones. 1998. Global versus local search in constrained optimization of computer models. Lecture Notes-Monograph Series (1998), 11--25.
[48]
Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE 104, 1 (2015), 148--175.
[49]
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. 2015. Scalable bayesian optimization using deep neural networks. In International conference on machine learning. 2171--2180.
[50]
Jasper Roland Snoek. 2013. Bayesian optimization and semiparametric models with applications to assistive technology. Ph.D. Dissertation. Citeseer.
[51]
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data x-ray: A diagnostic tool for data errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1231--1245.
[52]
Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliers in Aggregate Queries. Proc. VLDB Endow. 6, 8 (June 2013), 553--564.
[53]
Weiyuan Wu, Lampros Flokas, Eugene Wu, and Jiannan Wang. 2020. Complaint-driven Training Data Debugging for Query 2.0. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1317--1334.
[54]
Dong Young Yoon, Ning Niu, and Barzan Mozafari. 2016. Dbsherlock: A performance diagnostic tool for transactional databases. In Proceedings of the 2016 International Conference on Management of Data. 1599--1614.
[55]
Jiaqi Zhang, Lakshminarayanan Renganarayana, Xiaolan Zhang, Niyu Ge, Vasanth Bala, Tianyin Xu, and Yuanyuan Zhou. 2014. Encore: Exploiting system environment and correlation information for misconfiguration detection. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems. 687--700.

Cited By

View all
  • (2023)Teaching Blue Elephants the Maths for Machine LearningProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595852(1-4)Online publication date: 18-Jun-2023
  • (2023)XInsight: eXplainable Data Analysis Through The Lens of CausalityProceedings of the ACM on Management of Data10.1145/35893011:2(1-27)Online publication date: 20-Jun-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 14, Issue 11
July 2021
732 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2021
Published in PVLDB Volume 14, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Teaching Blue Elephants the Maths for Machine LearningProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595852(1-4)Online publication date: 18-Jun-2023
  • (2023)XInsight: eXplainable Data Analysis Through The Lens of CausalityProceedings of the ACM on Management of Data10.1145/35893011:2(1-27)Online publication date: 20-Jun-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media