research-article

Fauce: fast and accurate deep ensembles with uncertainty for cardinality estimation

Authors:

Dong LiAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 11

Pages 1950 - 1963

https://doi.org/10.14778/3476249.3476254

Published: 01 July 2021 Publication History

Abstract

Cardinality estimation is a fundamental and critical problem in databases. Recently, many estimators based on deep learning have been proposed to solve this problem and they have achieved promising results. However, these estimators struggle to provide accurate results for complex queries, due to not capturing real inter-column and inter-table correlations. Furthermore, none of these estimators contain the uncertainty information about their estimations. In this paper, we present a join cardinality estimator called Fauce. Fauce learns the correlations across all columns and all tables in the database. It also contains the uncertainty information of each estimation. Among all studied learned estimators, our results are promising: (1) Fauce is a light-weight estimator, it has 10× faster inference speed than the state of the art estimator; (2) Fauce is robust to the complex queries, it provides 1.3×--6.7× smaller estimation errors for complex queries compared with the state of the art estimator; (3) To the best of our knowledge, Fauce is the first estimator that incorporates uncertainty information for cardinality estimation into a deep learning model.

References

[1]

Ziawasch Abedjan, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann. 2014. Detecting unique column combinations on dynamic data. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 1036--1047.

[2]

binghamton. [n.d.]. Variance proof. https://www2.math.binghamton.edu/lib/exe/fetch.php/people/renfrew/447-4-17.pdf.

[3]

Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. 2001. STHoles: a multidimensional workload-aware histogram. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data. 211--222.

Digital Library

[4]

Walter Cai, Magdalena Balazinska, and Dan Suciu. 2019. Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In Proceedings of the 2019 International Conference on Management of Data. 18--35.

Digital Library

[5]

Chee-Yong Chan and Yannis E Ioannidis. 1999. An efficient bitmap encoding scheme for selection queries. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 215--226.

Digital Library

[6]

Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems (TODS) 32, 2 (2007), 9--es.

Digital Library

[7]

Wenqian Dong, Jie Liu, Zhen Xie, and Dong Li. 2019. Adaptive neural network-based approximation to accelerate eulerian fluid simulation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--22.

Digital Library

[8]

Wenqian Dong, Zhen Xie, Gokcen Kestor, and Dong Li. 2020. Smart-PGSim: Using Neural Network to Accelerate AC-OPF Power Grid Simulation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC '20). IEEE Press, Article 63, 15 pages.

Digital Library

[9]

Conor Durkan and Charlie Nash. 2019. Autoregressive energy machines. In ICML.

[10]

Anshuman Dutt, Chi Wang, Vivek Narasayya, and Surajit Chaudhuri. 2020. Efficiently approximating selectivity functions using low overhead regression models. Proceedings of the VLDB Endowment 13, 12 (2020), 2215--2228.

Digital Library

[11]

Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment 12, 9 (2019), 1044--1057.

Digital Library

[12]

Peter A Flach and Iztok Savnik. 1999. Database dependency discovery: a machine learning approach. AI communications 12, 3 (1999), 139--160.

Digital Library

[13]

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning. 1050--1059.

Digital Library

[14]

Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning. 881--889.

Digital Library

[15]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).

[16]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855--864.

Digital Library

[17]

Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2019. Multi-attribute selectivity estimation using deep learning. arXiv preprint arXiv:1903.09999 (2019).

[18]

Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2020. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1035--1050.

Digital Library

[19]

Rojeh Hayek and Oded Shmueli. 2020. Nn-based transformation of any SQL cardinality estimator for handling distinct, and, OR and NOT. arXiv preprint arXiv:2004.07009 (2020).

[20]

Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1477--1492.

Digital Library

[21]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. DeepDB: learn from data, not from queries! arXiv preprint arXiv:1909.00607 (2019).

Digital Library

[22]

Yka Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2 (1999), 100--111.

[23]

Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. Estimating join selectivities using bandwidth-optimized kernel density models. Proceedings of the VLDB Endowment 10, 13 (2017), 2085--2096.

Digital Library

[24]

Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).

[25]

Andreas Kipf, Dimitri Vorona, Jonas Müller, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Thomas Neumann, and Alfons Kemper. 2019. Estimating cardinalities with deep sketches. In Proceedings of the 2019 International Conference on Management of Data. 1937--1940.

Digital Library

[26]

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. 489--504.

Digital Library

[27]

Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate dependencies. Proceedings of the VLDB Endowment 11, 7 (2018), 759--772.

Digital Library

[28]

Byung-Jae Kwak, Nah-Oak Song, and Leonard E Miller. 2005. Performance analysis of exponential backoff. IEEE/ACM transactions on networking 13, 2 (2005), 343--355.

Digital Library

[29]

Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (2015), 204--215.

Digital Library

[30]

Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In Cidr.

[31]

Jie Liu, Jiawen Liu, Zhen Xie, and Dong Li. 2020. FLAME: A Self-Adaptive Auto-labeling System for Heterogeneous Mobile Processors. arXiv preprint arXiv:2003.01762 (2020).

[32]

David Lopez-Paz, Philipp Hennig, and Bernhard Schölkopf. 2013. The randomized dependence coefficient. Advances in neural information processing systems 26 (2013), 1--9.

Digital Library

[33]

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication. 270--288.

Digital Library

[34]

James Martens and Venkatesh Medabalimi. 2014. On the expressive efficiency of sum product networks. arXiv preprint arXiv:1411.7717 (2014).

[35]

microsoft. [n.d.]. queries contain correlations. https://support.microsoft.com/en-us/topic/kb2658214-fix-poor-performance-when-you-run-a-query-that-contains-correlated-and-predicates-in-sql-server-2008-or-in-sql-server-2008-r2-or-in-sql-server-2012-86e1a4a8-5793-f1a4-dd10-bc42347a7208.

[36]

Annamalai Narayanan, Mahinthan Chandramohan, Lihui Chen, Yang Liu, and Santhoshkumar Saminathan. 2016. subgraph2vec: Learning distributed representations of rooted sub-graphs from large graphs. arXiv preprint arXiv:1606.08928 (2016).

[37]

Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-dimensional Indexes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 985--1000.

Digital Library

[38]

Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, and Mohammad Alizadeh. 2021. Flow-Loss: Learning Cardinality Estimates That Matter. arXiv preprint arXiv:2101.04964 (2021).

Digital Library

[39]

Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S Sathiya Keerthi. 2018. Learning state representations for query optimization with deep reinforcement learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. 1--4.

Digital Library

[40]

Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S Sathiya Keerthi. 2019. An empirical analysis of deep learning for cardinality estimation. arXiv preprint arXiv:1905.06425 (2019).

[41]

Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari. 2017. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data. 587--602.

Digital Library

[42]

Hoifung Poon and Pedro Domingos. 2011. Sum-product networks: A new deep architecture. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, 689--690.

Digital Library

[43]

Kedar Potdar, Taher S Pardawala, and Chinmay D Pai. 2017. A comparative study of categorical variable encoding techniques for neural network classifiers. International journal of computer applications 175, 4 (2017), 7--9.

[44]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[45]

Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. 2016. Bayesian optimization with robust Bayesian neural networks. Advances in neural information processing systems 29 (2016), 4134--4142.

[46]

SQLServer.2016. [n.d.]. Cardinality estimation for correlated columns in SQLServer 2016. https://blogs.msdn.microsoft.com/sql_server_team/cardinality-estimation-forcorrelated-columns-in-sql-server-2016/.

[47]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929--1958.

Digital Library

[48]

Michael Stillger, Guy M Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO-DB2's learning optimizer. In VLDB, Vol. 1. 19--28.

Digital Library

[49]

Chuzhe Tang, Youyun Wang, Zhiyuan Dong, Gansen Hu, Zhaoguo Wang, Minjie Wang, and Haibo Chen. 2020. XIndex: a scalable learned index for multicore data storage. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 308--320.

Digital Library

[50]

Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2011. Lightweight graphical models for selectivity estimation without independence assumptions. Proceedings of the VLDB Endowment 4, 11 (2011), 852--863.

Digital Library

[51]

Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. 1009--1024.

Digital Library

[52]

Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2020. Are We Ready For Learned Cardinality Estimation? arXiv preprint arXiv:2012.06743 (2020).

[53]

Yijun Xiao and William Yang Wang. 2019. Quantifying uncertainties in natural language processing tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7322--7329.

[54]

Zhen Xie, Wenqian Dong, Jiawen Liu, Hang Liu, and Dong Li. 2021. Tahoe: tree structure-aware high performance inference engine for decision tree ensemble on GPU. In Proceedings of the Sixteenth European Conference on Computer Systems. 426--440.

Digital Library

[55]

Zhen Xie, Wenqian Dong, Jie Liu, Ivy Peng, Yanbao Ma, and Dong Li. 2021. MD-HM: memoization-based molecular dynamics simulations on big memory system. In Proceedings of the ACM International Conference on Supercomputing. 215--226.

Digital Library

[56]

Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: one cardinality estimator for all tables. arXiv preprint arXiv:2006.08109 (2020).

Digital Library

[57]

Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. arXiv preprint arXiv:1905.04278 (2019).

Digital Library

[58]

Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, Jingren Zhou, and Bin Cui. 2020. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. arXiv preprint arXiv:2011.09022 (2020).

Digital Library

Cited By

Lyu CFan QGuyard PDiao Y(2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682021
Tsan BDatta AIzenov YRusu F(2024)Approximate SketchesProceedings of the ACM on Management of Data10.1145/36393212:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639321
Zhang HLiu ZChen BZhao YZhao TYang TCui B(2024)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation ModelsProceedings of the ACM on Management of Data10.1145/36393062:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639306
Show More Cited By

Recommendations

Covariance shaping least-squares estimation

A new linear estimator is proposed, which we refer to as the covariance shaping least-squares (CSLS) estimator, for estimating a set of unknown deterministic parameters, x, observed through a known linear transformation H and corrupted by additive ...
Robust variance estimation for random effects meta-analysis

In random effects meta-analysis, an overall effect is estimated using a weighted mean, with weights based on estimated marginal variances. The variance of the overall effect is often estimated using the inverse of the sum of the estimated weights, and ...
Notes on the tightness of the hybrid Cramér-Rao lower bound

In this paper, we study the properties of the hybrid Cramér-Rao bound (HCRB). We first address the problem of estimating unknown deterministic parameters in the presence of nuisance random parameters. We specify a necessary and sufficient condition ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 11

July 2021

732 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2021

Published in PVLDB Volume 14, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
266
Total Downloads

Downloads (Last 12 months)80
Downloads (Last 6 weeks)3

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lyu CFan QGuyard PDiao Y(2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682021
Tsan BDatta AIzenov YRusu F(2024)Approximate SketchesProceedings of the ACM on Management of Data10.1145/36393212:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639321
Zhang HLiu ZChen BZhao YZhao TYang TCui B(2024)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation ModelsProceedings of the ACM on Management of Data10.1145/36393062:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639306
Kim KLee SKim IHan W(2024)ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36393002:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639300
Ye RLiang ZChen XLiu SZheng KSerra ESpezzano F(2024)A Cause-Focused Query Optimizer Alert SystemProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679771(2981-2990)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679771
Fan SHou MXi RMa WSerra ESpezzano F(2024)Precision Meets Resilience: Cross-Database Generalization with Uncertainty Quantification for Robust Cost EstimationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679632(581-590)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679632
Kamali AKantere VZuzarte CCorvinelli VBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)RobOpt: A Tool for Robust Workload Optimization Based on Uncertainty-Aware Machine LearningCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654755(468-471)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654755
Zhu RWeng LDing BZhou JBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Learned Query Optimizer: What is New and What is NextCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654692(561-569)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654692
Han YWang HChen LDong YChen XYu BYang CQian WBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)ByteCard: Enhancing ByteDance's Data Warehouse with Learned Cardinality EstimationCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653376(41-54)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653376
Chen XZhu RDing BWang SZhou J(2024)Lero: applying learning-to-rank in query optimizerThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00850-333:5(1307-1331)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00850-3
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents