Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

Evaluation of Machine Learning Algorithms in Predicting the Next SQL Query from the Future

Published: 18 March 2021 Publication History

Abstract

Prediction of the next SQL query from the user, given her sequence of queries until the current timestep, during an ongoing interaction session of the user with the database, can help in speculative query processing and increased interactivity. While existing machine learning-- (ML) based approaches use recommender systems to suggest relevant queries to a user, there has been no exhaustive study on applying temporal predictors to predict the next user issued query.
In this work, we experimentally compare ML algorithms in predicting the immediate next future query in an interaction workload, given the current user query or the sequence of queries in a user session thus far. As a part of this, we propose the adaptation of two powerful temporal predictors: (a) Recurrent Neural Networks (RNNs) and (b) a Reinforcement Learning approach called Q-Learning that uses Markov Decision Processes. We represent each query as a comprehensive set of fragment embeddings that not only captures the SQL operators, attributes, and relations but also the arithmetic comparison operators and constants that occur in the query. Our experiments on two real-world datasets show the effectiveness of temporal predictors against the baseline recommender systems in predicting the structural fragments in a query w.r.t. both quality and time. Besides showing that RNNs can be used to synthesize novel queries, we find that exact Q-Learning outperforms RNNs despite predicting the next query entirely from the historical query logs.

Supplementary Material

a4-meduri-apndx.pdf (meduri.zip)
Supplemental movie, appendix, image and software files for, Evaluation of Machine Learning Algorithms in Predicting the Next SQL Query from the Future

References

[1]
2011. JSQLParser. Retrieved from https://github.com/JSQLParser/JSqlParser.
[2]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http://tensorflow.org/.
[3]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys’13). 29--42.
[4]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).
[5]
Ugur Çetintemel, Mitch Cherniack, Justin DeBrabant, Yanlei Diao, Kyriaki Dimitriadou, Alexander Kalinin, Olga Papaemmanouil, and Stanley B. Zdonik. 2013. Query steering for interactive data exploration. In Proceedings of the Sixth Biennial Conference on Innovative Data Systems Research (CIDR’13).
[6]
Gloria Chatzopoulou, Magdalini Eirinaki, and Neoklis Polyzotis. 2009. Query recommendations for interactive database exploration. In Proceedings of the 21st International Conference on Scientific and Statistical Database Management (SSDBM’09). 3--18.
[7]
Surajit Chaudhuri and Raghav Kaushik. 2009. Extending autocompletion to tolerate errors. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’09). 707--718.
[8]
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST@EMNLP’14). 103--111.
[9]
François Chollet. 2015. keras. Retrieved from https://keras.io/.
[10]
Dong Deng, Guoliang Li, He Wen, H. V. Jagadish, and Jianhua Feng. 2016. META: An efficient matching-based method for error-tolerant autocompletion. Proc. VLDB Endow. 9, 10 (2016), 828--839.
[11]
Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-example: An automatic query steering framework for interactive data exploration. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). 517--528.
[12]
Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2016. AIDE: An active learning-based approach for interactive data exploration. IEEE Trans. Knowl. Data Eng. 28, 11 (2016), 2842--2856.
[13]
Magdalini Eirinaki, Suju Abraham, Neoklis Polyzotis, and Naushin Shaikh. 2014. QueRIE: Collaborative database exploration. IEEE Trans. Knowl. Data Eng. 26, 7 (2014), 1778--1790.
[14]
Magdalini Eirinaki and Sweta Patel. 2015. QueRIE reloaded: Using matrix factorization to improve database query recommendations. In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data’15). 1500--1508.
[15]
Ori Bar El, Tova Milo, and Amit Somech. 2020. Automatically generating data exploration sessions using deep reinforcement learning. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD’20). 1527--1537.
[16]
Antonio Giuzio, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, Donatello Santoro, and Letizia Tanca. 2017. INDIANA the Database Explorer. Technical Report. Università della Basilicata, Politecnico di Milano.
[17]
Bill G. Horne and Don R. Hush. 1996. Bounds on the complexity of recurrent neural network implementations of finite state machines. Neural Netw. 9, 2 (Mar. 1996), 243--252.
[18]
Prasanth Jayachandran, Karthik Tunga, Niranjan Kamat, and Arnab Nandi. 2014. Combining user interaction, speculative query execution and sampling in the DICE system. Proc. VLDB 7, 13 (2014), 1697--1700.
[19]
Manas Joglekar, Hector Garcia-Molina, and Aditya G. Parameswaran. 2016. Interactive data exploration with smart drill-down. In Proceedings of the 32nd IEEE International Conference on Data Engineering (ICDE’16). 906--917.
[20]
Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi. 2014. Distributed and interactive cube exploration. In Proceedings of the IEEE 30th International Conference on Data Engineering, Chicago (ICDE’14). 472--483.
[21]
Andrej Karpathy. 2015. The Unreasonable Effectiveness of Recurrent Neural Networks. Retrieved from http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
[22]
Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. 2010. SnipSuggest: Context-aware autocompletion for SQL. Proc. VLDB Endow. 4, 1 (2010), 22--33.
[23]
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned cardinalities: Estimating correlated joins with deep learning. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR’19).
[24]
Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph M. Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arxiv:1808.03196. Retrieved from https://arxiv.org/abs/1808.03196.
[25]
Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788--791.
[26]
Jeff LeFevre, Jagan Sankaranarayanan, Hakan Hacigumus, Junichi Tatemura, Neoklis Polyzotis, and Michael J. Carey. 2014. Opportunistic physical design for big data analytics. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). 851--862.
[27]
Guoliang Li, Xuanhe Zhou, Shifu Li, and Bo Gao. 2019. QTune: A query-aware database tuning system with deep reinforcement learning. Proc. VLDB Endow. 12, 12 (Aug. 2019), 2118--2130.
[28]
Teng Li, Zhiyuan Xu, Jian Tang, and Yanzhi Wang. 2018. Model-free control for distributed stream data processing using deep reinforcement learning. Proc. VLDB Endow. 11, 6 (Feb. 2018), 705--718.
[29]
Xi Liang, Aaron J. Elmore, and Sanjay Krishnan. 2019. Opportunistic view materialization with deep reinforcement learning. arxiv:1903.01363. Retrieved from https://arxiv.org/abs/1903.01363.
[30]
Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and Geoffrey J. Gordon. 2018. Query-based workload forecasting for self-driving database management systems. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). 631--645.
[31]
Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Behrouz Touri, and Liang Huang. 2018. The data interaction game. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). 83--98.
[32]
Venkata Vamsikrishna Meduri, Kanchan Chowdhury, and Mohamed Sarwat. 2019. Recurrent neural networks for dynamic user intent prediction in human-database interaction. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’19). 654--657.
[33]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’13). Curran Associates, Inc., 3111--3119.
[34]
Tova Milo and Amit Somech. 2018. Next-step suggestions for modern interactive data analysis platforms. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18). 576--585.
[35]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--533.
[36]
Christopher Olah. 2015. Understanding LSTM-based RNNs. Retrieved from http://colah.github.io/posts/2015-08-Understanding-LSTMs/.
[37]
Inc. Open Source Matters and the Joomla community. 2005. Joomla! Retrieved from https://www.joomla.org/.
[38]
Olga Papaemmanouil, Yanlei Diao, Kyriaki Dimitriadou, and Liping Peng. 2016. Interactive data exploration via machine learning models. IEEE Data Eng. Bull. 39, 4 (2016), 38--49.
[39]
Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari. 2017. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). 587--602.
[40]
Liping Peng, Enhui Huang, Yuqing Xing, Anna Liu, and Yanlei Diao. 2017. Uncertainty Sampling and Optimization for Interactive Database Exploration. UMass Technical Report (2017).
[41]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14).
[42]
Senjuti Basu Roy, Haidong Wang, Ullas Nambiar, Gautam Das, and Mukesh K. Mohania. 2009. DynaCet: Building dynamic faceted search systems over databases. In Proceedings of the 25th International Conference on Data Engineering (ICDE’09), Yannis E. Ioannidis, Dik Lun Lee, and Raymond T. Ng (Eds.). IEEE Computer Society, 1463--1466.
[43]
Stuart J. Russell and Peter Norvig. 2003. Artificial Intelligence: A Modern Approach (2nd ed.). Pearson Education.
[44]
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritized experience replay. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16).
[45]
Vik Singh, Jim Gray, Ani Thakar, Alexander S. Szalay, Jordan Raddick, Bill Boroski, Svetlana Lebedeva, and Brian Yanny. 2007. SkyServer traffic report—The first five years. arxiv:cs/0701173. Retrieved from https://arxiv.org/abs/cs/0701173.
[46]
Amit Somech, Tova Milo, and Chai Ozeri. 2019. Predicting “What is Interesting” by mining interactive-data-analysis session logs. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’19). 456--467.
[47]
Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction (2 ed.). The MIT Press.
[48]
Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya G. Parameswaran, and Neoklis Polyzotis. 2015. SEEDB: Efficient data-driven visualization recommendations to support visual analytics. Proc. VLDB 8, 13 (2015), 2182--2193.
[49]
Abdul Wasay, Xinding Wei, Niv Dayan, and Stratos Idreos. 2017. Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). 557--572.
[50]
Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. In Machine Learning. 279--292.
[51]
Michael Wunder, Michael L. Littman, and Monica Babes. 2010. Classes of multiagent Q-learning dynamics with epsilon-greedy exploration. In Proceedings of the International Conference on Machine Learning (ICML’10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, 1167--1174.
[52]
Cong Yan and Yeye He. 2020. Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD’20). 1539--1554.
[53]
Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Russ R. Salakhutdinov, and Yoshua Bengio. 2016. Architectural complexity measures of recurrent neural networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 1822--1830.

Cited By

View all
  • (2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Vertically Autoscaling Monolithic Applications with CaaSPER: Scalable Container-as-a-Service Performance Enhanced Resizing Algorithm for the CloudCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653378(241-254)Online publication date: 9-Jun-2024
  • (2024)Log Replaying for Real-Time HTAP: An Adaptive Epoch-Based Two-Stage Framework2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00167(2096-2108)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 46, Issue 1
March 2021
143 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3457891
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2021
Accepted: 01 December 2020
Revised: 01 September 2020
Received: 01 February 2020
Published in TODS Volume 46, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Query prediction
  2. recommender systems
  3. recurrent neural networks
  4. schema-aware SQL embeddings

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)873
  • Downloads (Last 6 weeks)78
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Vertically Autoscaling Monolithic Applications with CaaSPER: Scalable Container-as-a-Service Performance Enhanced Resizing Algorithm for the CloudCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653378(241-254)Online publication date: 9-Jun-2024
  • (2024)Log Replaying for Real-Time HTAP: An Adaptive Epoch-Based Two-Stage Framework2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00167(2096-2108)Online publication date: 13-May-2024
  • (2023)An Analysis of AI-based SQL Injection (SQLi) Attack Detection2023 Second International Conference on Augmented Intelligence and Sustainable Systems (ICAISS)10.1109/ICAISS58487.2023.10250505(31-35)Online publication date: 23-Aug-2023
  • (2022)Predicting the Future Actions of People in the Real World to Improve Health ManagementArtificial Intelligence in Data and Big Data Processing10.1007/978-3-030-97610-1_15(175-187)Online publication date: 19-May-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media