ReJOOSp: Reinforcement Learning for Join Order Optimization in SPARQL
<p>The pipeline for join order optimization in Luposdate3000 with the proposed extensions in the boxes with red background color. This figure was adapted from [<a href="#B44-BDCC-08-00071" class="html-bibr">44</a>].</p> "> Figure 2
<p>RDF graph for the example about generating queries, where we use the blue nodes and edges.</p> "> Figure 3
<p>Example of a generated query.</p> "> Figure 4
<p>Example of an SPARQL query.</p> "> Figure 5
<p>Our architecture for using machine learning for join order optimization: The blue component stores its data in an external SQL DBMS. The red components are executed within Luposdate3000. The other components are implemented in Python. This figure was adapted from [<a href="#B44-BDCC-08-00071" class="html-bibr">44</a>].</p> "> Figure 6
<p>Our reward function uses <math display="inline"><semantics> <msub> <mi>v</mi> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>v</mi> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> </msub> </semantics></math> from the statistics, which refer to the currently known best and worst case number of intermediate results. <math display="inline"><semantics> <msub> <mi>v</mi> <mrow> <mi>c</mi> <mi>u</mi> <mi>r</mi> <mi>r</mi> <mi>e</mi> <mi>n</mi> <mi>t</mi> </mrow> </msub> </semantics></math> may receive a <math display="inline"><semantics> <mrow> <mi>n</mi> <mi>u</mi> <mi>l</mi> <mi>l</mi> </mrow> </semantics></math> value when the DBMS runs into a timeout. <math display="inline"><semantics> <mrow> <mi>n</mi> <mi>u</mi> <mi>l</mi> <mi>l</mi> </mrow> </semantics></math> values are not considered for calculating the known worst case.</p> "> Figure 7
<p>Overview of join order optimization with our machine learning approach. This figure was adapted from [<a href="#B44-BDCC-08-00071" class="html-bibr">44</a>]. Possible actions with a gray background are invalid and masked out.</p> "> Figure 8
<p>Time needed to construct an optimized join tree with a given number of inputs to join in the Luposdate3000 DBMS. The optimization of queries was repeated until 60 s were spent for each join size, then the average time per query was calculated. The benchmarks were executed on the hardware described in <a href="#sec5-BDCC-08-00071" class="html-sec">Section 5</a>.</p> "> Figure 9
<p>SPARQL evaluation on SP<sup>2</sup>B-Dataset. The chart shows the percentage of queries that require at most twice as many intermediate results as the known best case. The magenta line surrounds the area, where 70% of the queries receive good join trees. The results of baselines are presented left from the red line and the results of our proposed approach are right from the red line.</p> "> Figure 10
<p>SPARQL evaluation on WordNet-Dataset. The chart shows the percentage of queries that require at most twice as many intermediate results as the known best case. The magenta line surrounds the area, where 70% of the queries receive good join trees. The results of baselines are presented left from the red line and the results of our proposed approach are right from the red line.</p> "> Figure 11
<p>Different number of training steps on SP<sup>2</sup>B-Dataset. The chart shows the percentage of queries that require at most twice as many intermediate results as the known best case. The magenta line surrounds the area, where 70% of the queries receive good join trees. The results of baselines are presented left from the red line and the results of our proposed approach are right from the red line.</p> "> Figure 12
<p>Different number of training steps on WordNet-Dataset. The chart shows the percentage of queries that require at most twice as many intermediate results as the known best case. The results of baselines are presented left from the red line and the results of our proposed approach are right from the red line.</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. Machine Learning Approaches for Optimizing SQL Queries
2.2. Machine Learning Approaches for Optimizing SPARQL Queries
2.3. Optimizing SPARQL Queries in Open Source and Commercial Semantic Web Databases and Triple Stores
3. Luposdate3000 Join Order Optimizer
4. Our Contribution
4.1. Considerations for the Machine Learning Strategy
4.2. Generating Queries
4.3. Our Approach ReJOOSp
5. Evaluation
5.1. Environment
5.2. Evaluating Different Numbers of Triple Patterns
5.3. Evaluating Different Numbers of Training Steps
5.4. Evaluating What the Model Has Learned
5.5. Limitations
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Scheufele, W.; Moerkotte, G. On the complexity of generating optimal plans with cross products. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Tucson, AZ, USA, 12–14 May 1997; pp. 238–248. [Google Scholar]
- Allam, J.R. Evaluation of a Greedy Join-Order Optimization Approach Using the IMDB Dataset. Ph.D. Thesis, University of Magdeburg, Magdeburg, Germany, 2018. [Google Scholar]
- Lan, H.; Bao, Z.; Peng, Y. A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration. Data Sci. Eng. 2021, 6, 86–101. [Google Scholar] [CrossRef]
- Gubichev, A.; Neumann, T. Exploiting the query structure for efficient join ordering in SPARQL queries. In EDBT, Proceedings of the International Conference on Extending Database Technology, Athens, Greece, 24–28 March 2014; Amer-Yahia, S., Christophides, V., Kementsietsidis, A., Garofalakis, M.N., Idreos, S., Leroy, V., Eds.; Open Proceedings: Konstanz, Germany, 2014; pp. 439–450. [Google Scholar]
- Marcus, R.; Papaemmanouil, O. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, New York, NY, USA, 10 June 2018. [Google Scholar] [CrossRef]
- Wang, H.; Qi, Z.; Zheng, L.; Feng, Y.; Ouyang, J.; Zhang, H.; Zhang, X.; Shen, Z.; Liu, S. April: An Automatic Graph Data Management System Based on Reinforcement Learning. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, Online, 19–23 October 2020; Volume 20, pp. 3465–3468. [Google Scholar]
- Yu, X.; Li, G.; Chai, C.; Tang, N. Reinforcement Learning with Tree-LSTM for Join Order Selection. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; pp. 1297–1308. [Google Scholar]
- Heitz, J.; Stockinger, K. Join Query Optimization with Deep Reinforcement Learning Algorithms. arXiv 2019, arXiv:1911.11689. [Google Scholar]
- Hasan, R.; Gandon, F. A Machine Learning Approach to SPARQL Query Performance Prediction. In Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Washington, DC, USA, 11–14 August 2014; pp. 266–273. [Google Scholar] [CrossRef]
- Ganapathi, A.; Kuno, H.; Dayal, U.; Wiener, J.L.; Fox, A.; Jordan, M.; Patterson, D. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In Proceedings of the 2009 IEEE 25th International Conference on Data Engineering, Shanghai, China, 29 March–2 April 2009; pp. 592–603. [Google Scholar] [CrossRef]
- Gupta, C.; Mehta, A.; Dayal, U. PQR: Predicting Query Execution Times for Autonomous Workload Management. In Proceedings of the 2008 International Conference on Autonomic Computing, Chicago, IL, USA, 2–6 June 2008; pp. 13–22. [Google Scholar]
- Zhang, W.E.; Sheng, Q.Z.; Qin, Y.; Taylor, K.; Yao, L. Learning-based SPARQL query performance modeling and prediction. World Wide Web 2017, 21, 1015–1035. [Google Scholar] [CrossRef]
- Lu, H.; Chan, H.C.; Wei, K.K. A survey on usage of SQL. In Proceedings of the ACM SIGMOD Record, Washington, DC, USA, 25–28 May 1993; Volume 22, pp. 60–65. [Google Scholar] [CrossRef]
- Zolaktaf, Z.; Milani, M.; Pottinger, R. Facilitating SQL query composition and analysis. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 209–224. [Google Scholar]
- Paasche, S.; Groppe, S. Enhancing Data Quality and Process Optimization for Smart Manufacturing Lines in Industry 4.0 Scenarios. In Proceedings of the International Workshop on Big Data in Emergent Distributed Environments (BiDEDE’22), Philadelphia, PA, USA, 12–17 June 2022. [Google Scholar]
- Arias, M.; Fernández, J.D.; Martínez-Prieto, M.A.; de la Fuente, P. An Empirical Study of Real-World SPARQL Queries. arXiv 2011, arXiv:1103.5043. [Google Scholar]
- Warnke, B.; Rehan, M.W.; Fischer, S.; Groppe, S. Flexible data partitioning schemes for parallel merge joins in semantic web queries. In Proceedings of the Datenbanksysteme für Business, Technologie und Web (BTW), Dresden, Germany, 13–17 September 2021. [Google Scholar] [CrossRef]
- Winker, T.; Groppe, S.; Uotila, V.; Yan, Z.; Lu, J.; Franz, M.; Mauerer, W. Quantum Machine Learning: Foundation, New Techniques, and Opportunities for Database Research. In Proceedings of the ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), Washington, DC, USA, 18–23 June 2023. [Google Scholar]
- Çalıkyılmaz, U.; Groppe, S.; Groppe, J.; Winker, T.; Prestel, S.; Shagieva, F.; Arya, D.; Preis, F.; Gruenwald, L. Opportunities for Quantum Acceleration of Databases: Optimization of Queries and Transaction Schedules. Proc. VLDB Endow. 2023, 16, 2344–2353. [Google Scholar] [CrossRef]
- Leis, V.; Radke, B.; Gubichev, A.; Kemper, A.; Neumann, T. Cardinality Estimation Done Right: Index-Based Join Sampling. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, 8–11 January 2017. [Google Scholar]
- Li, F.; Wu, B.; Yi, K.; Zhao, Z. Wander Join: Online Aggregation via Random Walks. In Proceedings of the 2016 International Conference on Management of Data, New York, NY, USA, 26 June–1 July 2016. [Google Scholar] [CrossRef]
- Lipton, R.J.; Naughton, J.F.; Schneider, D.A. Practical selectivity estimation through adaptive sampling. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 23–25 May 1990. [Google Scholar] [CrossRef]
- Lipton, R.J.; Naughton, J.F.; Schneider, D.A. Practical selectivity estimation through adaptive sampling. SIGMOD Rec. 1990, 19, 1–11. [Google Scholar] [CrossRef]
- Selinger, P.G.; Astrahan, M.M.; Chamberlin, D.D.; Lorie, R.A.; Price, T.G. Access Path Selection in a Relational Database Management System. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, MA, USA, 30 May–1 June 1979; pp. 23–34. [Google Scholar]
- Ioannidis, Y.E. The History of Histograms (abridged). In Proceedings of the 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, 9–12 September 2003; Freytag, J.C., Lockemann, P.C., Abiteboul, S., Carey, M.J., Selinger, P.G., Heuer, A., Eds.; Morgan Kaufmann: Burlington, MA, USA, 2003; pp. 19–30. [Google Scholar]
- Ioannidis, Y.E.; Christodoulakis, S. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Trans. Database Syst. 1993, 18, 709–748. [Google Scholar] [CrossRef]
- Getoor, L.; Taskar, B.; Koller, D. Selectivity estimation using probabilistic models. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 21–24 May 2001; pp. 461–472. [Google Scholar] [CrossRef]
- Tzoumas, K.; Deshpande, A.; Jensen, C.S. Lightweight graphical models for selectivity estimation without independence assumptions. Proc. VLDB Endow. 2011, 4, 852–863. [Google Scholar] [CrossRef]
- Tzoumas, K.; Deshpande, A.; Jensen, C.S. Efficiently adapting graphical models for selectivity estimation. VLDB J. 2012, 22, 3–27. [Google Scholar] [CrossRef]
- Wang, J.; Chai, C.; Liu, J.; Li, G. FACE: A normalizing flow based cardinality estimator. Proc. VLDB Endow. 2021, 15, 72–84. [Google Scholar] [CrossRef]
- Yang, Z.; Kamsetty, A.; Luan, S.; Liang, E.; Duan, Y.; Chen, X.; Stoica, I. NeuroCard: One cardinality estimator for all tables. Proc. VLDB Endow. 2020, 14, 61–73. [Google Scholar] [CrossRef]
- Zhu, R.; Wu, Z.; Han, Y.; Zeng, K.; Pfadler, A.; Qian, Z.; Zhou, J.; Cui, B. FLAT: Fast, lightweight and accurate method for cardinality estimation. Proc. VLDB Endow. 2021, 14, 1489–1502. [Google Scholar] [CrossRef]
- Kipf, A.; Kipf, T.; Radke, B.; Leis, V.; Boncz, P.A.; Kemper, A. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, 13–16 January 2019. [Google Scholar]
- Liu, J.; Dong, W.; Zhou, Q.; Li, D. Fauce: Fast and accurate deep ensembles with uncertainty for cardinality estimation. Proc. VLDB Endow. 2021, 14, 1950–1963. [Google Scholar] [CrossRef]
- Sun, J.; Li, G. An end-to-end learning-based cost estimator. Proc. VLDB Endow. 2019, 13, 307–319. [Google Scholar] [CrossRef]
- Negi, P.; Marcus, R.; Kipf, A.; Mao, H.; Tatbul, N.; Kraska, T.; Alizadeh, M. Flow-loss: Learning cardinality estimates that matter. Proc. VLDB Endow. 2021, 14, 2019–2032. [Google Scholar] [CrossRef]
- Atserias, A.; Grohe, M.; Marx, D. Size Bounds and Query Plans for Relational Joins. In Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, Philadelphia, PA, USA, 25–28 October 2008; pp. 739–748. [Google Scholar]
- Cai, W.; Balazinska, M.; Suciu, D. Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities. In Proceedings of the 2019 International Conference on Management of Data, Amsterdam, The Netherlands, 30 June–5 July 2019. [Google Scholar] [CrossRef]
- Habich, D.; Krause, A.; Pietrzyk, J.; Faerber, C.; Lehner, W. Simplicity done right for SIMDified query processing on CPU and FPGA. In Proceedings of the 1st Workshop on Simplicity in Management of Data, SiMoD@SIGMOD 2023, Bellevue, WA, USA, 23 June 2023; Porobic, D., Wang, T., Eds.; ACM: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Wu, Z.; Negi, P.; Alizadeh, M.; Kraska, T.; Madden, S. FactorJoin: A New Cardinality Estimation Framework for Join Queries. Proc. ACM Manag. Data 2023, 1, 41. [Google Scholar] [CrossRef]
- Eschauzier, R.; Taelman, R.; Morren, M.; Verborgh, R. Reinforcement Learning-Based SPARQL Join Ordering Optimizer. In Proceedings of the Semantic Web: ESWC 2023 Satellite Events: Hersonissos, Crete, Greece, 28 May–1 June 2023; pp. 43–47. [Google Scholar] [CrossRef]
- Ristoski P, P.H. Rdf2vec: Rdf graph embeddings for data mining. In Proceedings of the Semantic Web–ISWC 2016: 15th International Semantic Web Conference Proceedings, Part I 15, Kobe, Japan, 17–21 October 2016. [Google Scholar]
- P Karthikeyan, M.; Krishnaveni, K.; Le, D.N. Analysis of Multi-Join Query Optimization Using ACO and Q-Learning. Int. J. Comput. Digit. Syst. 2024, 15, 1–11. [Google Scholar]
- Warnke, B.; Fischer, S.; Groppe, S. Using Machine Learning and Routing Protocols for Optimizing Distributed SPARQL Queries in Collaboration. Computers 2023, 12, 210. [Google Scholar] [CrossRef]
- Neumann, T.; Weikum, G. The RDF3X engine for scalable management of RDF data. Vldb J. VLDB 2010, 19, 91–113. [Google Scholar] [CrossRef]
- Iker, B.R.; Swami, A.N. Method for Optimizing Processing of Join Queries by Determining Optimal Processing Order and Assigning Optimal Join Methods to Each of the Join Operations. U.S. Patent US5345585A, 2 December 1991. [Google Scholar]
- Levine, S.; Kumar, A.; Tucker, G.; Fu, J. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv 2020, arXiv:2005.01643. [Google Scholar]
- Lample, G.; Chaplot, D.S. Playing FPS Games with Deep Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar] [CrossRef]
- Seaborne, A.; Harris, S. SPARQL 1.1 Query Language. Technical Report. 2013. Available online: https://www.w3.org/TR/2013/REC-sparql11-query-20130321/ (accessed on 14 June 2024).
- Bonifati, A.; Martens, W.; Timm, T. An Analytical Study of Large SPARQL Query Logs. Proc. VLDB Endow. 2017, 11, 149–161. [Google Scholar] [CrossRef]
- Hill, A.; Raffin, A.; Ernestus, M.; Gleave, A.; Kanervisto, A.; Traore, R.; Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; et al. Stable Baselines. 2018. Available online: https://github.com/hill-a/stable-baselines (accessed on 14 June 2024).
- Huang, S.; Ontañón, S. A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. In Proceedings of the International FLAIRS Conference Proceedings, Clearwater Beach, FL, USA, 14–17 May 2023; Volume 35. [Google Scholar] [CrossRef]
- Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
- Krishnan, S.; Yang, Z.; Goldberg, K.; Hellerstein, J.M.; Stoica, I. Learning to Optimize Join Queries With Deep Reinforcement Learning. arXiv 2018, arXiv:1808.03196. [Google Scholar]
- Schmidt, M.; Hornung, T.; Lausen, G.; Pinkel, C. SP2Bench: A SPARQL Performance Benchmark. arXiv 2008, arXiv:0806.4627. [Google Scholar]
Parameter | Value |
---|---|
Learning rate | |
Discount factor | |
Clipping parameter | |
Number of epochs when optimizing surrogate loss | 10 |
Value function coefficient for the loss calculation | |
Maximum value for the gradient clipping | |
Entropy coefficient for the loss calculation | 0 |
Trade-off between bias and variance for the Generalized Advantage Estimator |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Warnke, B.; Martens, K.; Winker, T.; Groppe, S.; Groppe, J.; Adhiyaman, P.; Srinivasan, S.; Krishnakumar, S. ReJOOSp: Reinforcement Learning for Join Order Optimization in SPARQL. Big Data Cogn. Comput. 2024, 8, 71. https://doi.org/10.3390/bdcc8070071
Warnke B, Martens K, Winker T, Groppe S, Groppe J, Adhiyaman P, Srinivasan S, Krishnakumar S. ReJOOSp: Reinforcement Learning for Join Order Optimization in SPARQL. Big Data and Cognitive Computing. 2024; 8(7):71. https://doi.org/10.3390/bdcc8070071
Chicago/Turabian StyleWarnke, Benjamin, Kevin Martens, Tobias Winker, Sven Groppe, Jinghua Groppe, Prasad Adhiyaman, Sruthi Srinivasan, and Shridevi Krishnakumar. 2024. "ReJOOSp: Reinforcement Learning for Join Order Optimization in SPARQL" Big Data and Cognitive Computing 8, no. 7: 71. https://doi.org/10.3390/bdcc8070071
APA StyleWarnke, B., Martens, K., Winker, T., Groppe, S., Groppe, J., Adhiyaman, P., Srinivasan, S., & Krishnakumar, S. (2024). ReJOOSp: Reinforcement Learning for Join Order Optimization in SPARQL. Big Data and Cognitive Computing, 8(7), 71. https://doi.org/10.3390/bdcc8070071