Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3448016.3457568acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Steering Query Optimizers: A Practical Take on Big Data Workloads

Published: 18 June 2021 Publication History

Abstract

In recent years, there has been tremendous interest in research that applies machine learning to database systems. Being one of the most complex components of a DBMS, query optimizers could benefit from adaptive policies that are learned systematically from the data and the query workload. Recent research has brought up novel ideas towards a learned query optimizer, however these ideas have not been evaluated on a commercial query processor or on large scale, real-world workloads. In this paper, we take the approach used by Marcus et al. in Bao and adapt it to SCOPE, a big data system used internally at Microsoft. Along the way, we solve multiple new challenges: we define how optimizer rules affect final query plans by introducing the concept of a rule signature, we devise a pipeline computing interesting rule configurations for recurring jobs, and we define a new learning problem allowing us to apply such interesting rule configurations to previously unseen jobs. We evaluate the efficacy of the approach on production workloads that include 150K daily jobs. Our results show that alternative rule configurations can generate plans with lower costs, and this can translate to runtime latency savings of 7-30% on average and up to 90% for a non trivial subset of the workload.

Supplementary Material

MP4 File (3448016.3457568.mp4)
In recent years, there has been tremendous interest in research that applies machine learning to database systems. Being one of the most complex components of a DBMS, query optimizers could benefit from adaptive policies that are learned systematically from the data and the query workload. Recent research has brought up novel ideas towards a learned query optimizer, however these ideas have not been evaluated on a commercial query processor or on large scale, real-world workloads. In this paper, we take the approach used by Marcus et al. in Bao and adapt it to SCOPE, a big data system used internally at Microsoft. Along the way, we solve multiple new challenges: we define how optimizer rules affect final query plans by introducing the concept of a Rule Signature; we devise a pipeline computing interesting rules configuration for recurring jobs; and we define a new learning problem allowing us to apply such interesting rule configurations to previously unseen jobs. We evaluate the efficacy of the approach on production workloads that include 150 jobs. Our results show that alternative rule configurations can generate plans with lower costs, and this can translate to runtime latency savings on average of 7 30% and up to 90% for a non trivial subset of the workload.

References

[1]
Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming-Chuan Wu, Ion Stoica, and Jingren Zhou. 2012. Re-Optimizing Data-Parallel Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI'12). USENIX Association, USA, 21.
[2]
Edmon Begoli, Jesús Camacho-Rodr'iguez, Julian Hyde, Michael J Mior, and Daniel Lemire. 2018. Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In Proceedings of the 2018 International Conference on Management of Data. 221--230.
[3]
Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, Vol. 1, 2 (2008), 1265--1276.
[4]
Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al. 2016. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data. 215--226.
[5]
Anshuman Dutt, Chi Wang, Vivek R. Narasayya, and Surajit Chaudhuri. 2020. Efficiently Approximating Selectivity Functions using Low Overhead Regression Models. Proc. VLDB Endow., Vol. 13, 11 (2020), 2215--2228. http://www.vldb.org/pvldb/vol13/p2215-dutt.pdf
[6]
Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment, Vol. 12, 9 (2019), 1044--1057.
[7]
Goetz Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bull., Vol. 18, 3 (1995), 19--29.
[8]
Jayant R Haritsa. 2010. The Picasso database query optimizer visualizer. Proceedings of the VLDB Endowment, Vol. 3, 1--2 (2010), 1517--1520.
[9]
Alekh Jindal, Hiren Patel, Abhishek Roy, Shi Qiao, Zhicheng Yin, Rathijit Sen, and Subru Krishnan. 2019. Peregrine: Workload Optimization for Cloud Query Engines. In Proceedings of the ACM Symposium on Cloud Computing. 416--427.
[10]
Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, and Sriram Rao. 2018. Computation reuse in analytics job service at microsoft. In Proceedings of the 2018 International Conference on Management of Data. 191--203.
[11]
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).
[12]
Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).
[13]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment, Vol. 9, 3 (2015), 204--215.
[14]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2020. Bao: Learning to Steer Query Optimizers. arXiv preprint arXiv:2004.03814 (2020).
[15]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A learned query optimizer. arXiv preprint arXiv:1904.03711 (2019). %balance
[16]
Ryan Marcus and Olga Papaemmanouil. 2018. Deep reinforcement learning for join order enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1--4.
[17]
Corp. Microsoft. 2020. SQL Server. https://www.microsoft.com/en-us/sql-server/ Retrieved November 23, 2020 from
[18]
Parimarjan Negi, Ryan Marcus, Hongzi Mao, Nesime Tatbul, Tim Kraska, and Mohammad Alizadeh. 2020. Cost-Guided Cardinality Estimation: Focus Where it Matters. In 2020 IEEE 36th International Conference on Data Engineering Workshops (ICDEW). IEEE, 154--157.
[19]
Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S Sathiya Keerthi. 2019. An Empirical Analysis of Deep Learning for Cardinality Estimation. arXiv preprint arXiv:1905.06425 (2019).
[20]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
[21]
Dipanjan (DJ) Sarkar. 2019. Categorical Data. https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63
[22]
Rathijit Sen, Alekh Jindal, Hiren Patel, and Shi Qiao. 2020. AutoToken: predicting peak parallelism for big data analytics at Microsoft. Proceedings of the VLDB Endowment, Vol. 13, 12 (2020), 3326--3339.
[23]
Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, et al. 2013. F1: A distributed SQL database that scales. (2013).
[24]
Tarique Siddiqui, Alekh Jindal, Shi Qiao, Hiren Patel, and Wangchao Le. 2020. Cost models for big data query processing: Learning, retrofitting, and our findings. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 99--113.
[25]
Florian M Waas. 2008. Beyond conventional data warehousing-massively parallel data processing with greenplum database. In International Workshop on Business Intelligence for the Real-Time Enterprise. Springer, 89--96.
[26]
Florian Wolf, Michael Brendle, Norman May, Paul R Willems, Kai-Uwe Sattler, and Michael Grossniklaus. 2018. Robustness metrics for relational query execution plans. Proceedings of the VLDB Endowment, Vol. 11, 11 (2018), 1360--1372.
[27]
Lucas Woltmann, Claudio Hartmann, Maik Thiele, Dirk Habich, and Wolfgang Lehner. 2019. Cardinality estimation with local deep learning models. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1--8.
[28]
Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. Proceedings of the VLDB Endowment, Vol. 12, 3 (2018), 210--222.
[29]
Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: one cardinality estimator for all tables. arXiv preprint arXiv:2006.08109 (2020).
[30]
Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. arXiv preprint arXiv:1905.04278 (2019).
[31]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, Ion Stoica, et al. 2010. Spark: Cluster computing with working sets. HotCloud, Vol. 10, 10--10 (2010), 95.

Cited By

View all
  • (2024)Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management SystemsProceedings of the VLDB Endowment10.14778/3681954.368203017:11(3680-3693)Online publication date: 1-Jul-2024
  • (2024)PieBridge: Fast and Parameter-Efficient On-Device Training via Proxy NetworksProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699327(126-140)Online publication date: 4-Nov-2024
  • (2024)Learned Query Optimizer: What is New and What is NextCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654692(561-569)Online publication date: 9-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Check for updates

Badges

  • Honorable Mention

Author Tags

  1. distributed query optimization
  2. learning for query optimization

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)559
  • Downloads (Last 6 weeks)65
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management SystemsProceedings of the VLDB Endowment10.14778/3681954.368203017:11(3680-3693)Online publication date: 1-Jul-2024
  • (2024)PieBridge: Fast and Parameter-Efficient On-Device Training via Proxy NetworksProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699327(126-140)Online publication date: 4-Nov-2024
  • (2024)Learned Query Optimizer: What is New and What is NextCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654692(561-569)Online publication date: 9-Jun-2024
  • (2024)Machine Learning for Databases: Foundations, Paradigms, and Open problemsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654686(622-629)Online publication date: 9-Jun-2024
  • (2024)Lero: applying learning-to-rank in query optimizerThe VLDB Journal10.1007/s00778-024-00850-333:5(1307-1331)Online publication date: 25-Apr-2024
  • (2023)AutoSteer: Learned Query Optimization for Any SQL DatabaseProceedings of the VLDB Endowment10.14778/3611540.361154416:12(3515-3527)Online publication date: 1-Aug-2023
  • (2023)LEON: A New Framework for ML-Aided Query OptimizationProceedings of the VLDB Endowment10.14778/3598581.359859716:9(2261-2273)Online publication date: 1-May-2023
  • (2023)GEqO: ML-Accelerated Semantic Equivalence DetectionProceedings of the ACM on Management of Data10.1145/36267101:4(1-25)Online publication date: 12-Dec-2023
  • (2023)Efficiently Computing Join Orders with Heuristic SearchProceedings of the ACM on Management of Data10.1145/35889271:1(1-26)Online publication date: 30-May-2023
  • (2023)FactorJoin: A New Cardinality Estimation Framework for Join QueriesProceedings of the ACM on Management of Data10.1145/35887211:1(1-27)Online publication date: 30-May-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media