research-article

Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data Exploration

Authors:

X. Sean WangAuthors Info & Claims

IEEE Transactions on Knowledge and Data Engineering, Volume 36, Issue 11

Pages 6532 - 6546

https://doi.org/10.1109/TKDE.2023.3341451

Published: 12 December 2023 Publication History

Abstract

For interactive data exploration, approximate query processing (AQP) is a useful approach that usually uses samples to provide a timely response for queries by trading query accuracy. Existing AQP systems often materialize samples in the memory for reuse to speed up query processing. How to tune the samples according to the workload is one of the key problems in AQP. However, since the data exploration workload is so complex that it cannot be accurately predicted, existing sample tuning approaches cannot adapt to the changing workload very well. To address this problem, this paper proposes a deep reinforcement learning-based sample tuner, <italic>RL-STuner</italic>. When tuning samples, <italic>RL-STuner</italic> considers the workload changes from a global perspective and uses a Deep Q-learning Network (DQN) model to select an optimal sample set that has the maximum utility for the current workload. In addition, this paper proposes a set of optimization mechanisms to reduce the sample tuning cost. Experimental results on both real-world and synthetic datasets show that <italic>RL-STuner</italic> outperforms the existing sample tuning approaches and achieves 1.6×-5.2× improvements on query accuracy with a low tuning cost.

References

[1]

K. Dimitriadou, O. Papaemmanouil, and Y. Diao, “Explore-by-example: An automatic query steering framework for interactive data exploration,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2014, pp. 517–528.

[2]

P. Eichmann, E. Zgraggen, C. Binnig, and T. Kraska, “IDEBench: A benchmark for interactive data exploration,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2020, pp. 1555–1569.

[3]

Z. Liu and J. Heer, “The effects of interactive latency on exploratory visual analysis,” IEEE Trans. Vis. Comput. Graphics, vol. 20, no. 12, pp. 2122–2131, Dec. 2014.

[4]

R. B. Miller, “Response time in man-computer conversational transactions,” in Proc. Fall Joint Comput. Conf., 1968, pp. 267–277.

[5]

B. Mozafari, “Approximate query engines: Commercial challenges and research opportunities,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2017, pp. 521–524.

[6]

S. Chaudhuri, G. Das, and V. R. Narasayya, “A robust, optimization-based approach for approximate answering of aggregate queries,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2001, pp. 295–306.

[7]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica, “BlinkDB: Queries with bounded errors and bounded response times on very large data,” in Proc. 8th ACM Eur. Conf. Comput. Syst., 2013, pp. 29–42.

[8]

B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and C. Wang, “Sample seek: Approximating aggregates with distribution precision guarantee,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2016, pp. 679–694.

[9]

S. Kandula et al., “Quickr: Lazily approximating complex adhoc queries in bigdata clusters,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2016, pp. 631–646.

[10]

A. Galakatos, A. Crotty, E. Zgraggen, C. Binnig, and T. Kraska, “Revisiting reuse for approximate query processing,” in Proc. VLDB Endowment, vol. 10, no. 10, pp. 1142–1153, 2017.

[11]

Y. Park, B. Mozafari, J. Sorenson, and J. Wang, “VerdictDB: Universalizing approximate query processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2018, pp. 1461–1476.

[12]

T. D. Nguyen, M. Shih, S. S. Parvathaneni, B. Xu, D. Srivastava, and S. Tirthapura, “Random sampling for group-by queries,” in Proc. IEEE 36th Int. Conf. Data Eng., 2020, pp. 541–552.

[13]

M. Olma, O. Papapetrou, R. Appuswamy, and A. Ailamaki, “Taster: Self-tuning, elastic and online approximate query processing,” in Proc. IEEE 35th Int. Conf. Data Eng., 2019, pp. 482–493.

[14]

J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. M. VanBriesen, and N. S. Glance, “Cost-effective outbreak detection in networks,” in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, P. Berkhin, R. Caruana, and X. Wu, Eds., 2007, pp. 420–429.

[15]

V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013,.

[16]

Z. Wu, Y. Jing, Z. He, C. Guo, and X. S. Wang, “POLYTOPE: A flexible sampling system for answering exploratory queries,” World Wide Web, vol. 23, no. 1, pp. 1–22, 2020.

[17]

C. J. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, no. 3/4, pp. 279–292, 1992.

Digital Library

[18]

M. A. Wiering and M. van Otterlo, Eds., Reinforcement Learn., 2012, vol. 12.

[19]

S. Lohr, Sampling: Design and Analysis. Pacific Grove, CA, USA: Brooks/Cole, 2010.

[20]

K. Li and G. Li, “Approximate query processing: What is new and where to go? - A survey on approximate query processing,” Data Sci. Eng., vol. 3, no. 4, pp. 379–397, 2018.

[21]

Y. Tao, X. Lian, D. Papadias, and M. Hadjieleftheriou, “Random sampling for continuous streams with arbitrary updates,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 96–110, Jan. 2007.

Digital Library

[22]

H. Zhang, Y. Zhang, Z. He, Y. Jing, K. Zhang, and X. S. Wang, “An agile sample maintenance approach for agile analytics,” in Proc. IEEE 36th Int. Conf. Data Eng., 2020, pp. 757–768.

[23]

Bureau of transportation statistics, Flights dataset, 2020.

[24]

TPC-H Benchmark, 1993. [Online]. Available: http://www.tpc.org/tpch/

[25]

H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proc. 30th AAAI Conf. Artif. Intell., 2016, pp. 2094–2100.

[26]

J. M. Hellerstein, P. J. Haas, and H. J. Wang, “Online aggregation,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1997, pp. 171–182.

[27]

P. J. Haas, “Large-sample and deterministic confidence intervals for online aggregation,” in Proc. 9th Int. Conf. Sci. Statist. Database Manage., 1997, pp. 51–63.

[28]

P. J. Haas and J. M. Hellerstein, “Ripple joins for online aggregation,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1999, pp. 287–298.

[29]

S. Wu, S. Jiang, B. C. Ooi, and K. Tan, “Distributed online aggregation,” in Proc. VLDB Endowment, vol. 2, no. 1, pp. 443–454, 2009.

[30]

S. Chen, P. B. Gibbons, and S. Nath, “PR-join: A non-blocking join achieving higher early result rate with statistical guarantees,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 147–158.

[31]

K. Zeng, S. Agarwal, A. Dave, M. Armbrust, and I. Stoica, “G-OLA: Generalized on-line aggregation for interactive analysis on Big Data,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015, pp. 913–918.

[32]

I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen, “ApproxHadoop: Bringing approximations to MapReduce frameworks,” in Proc. 20th Int. Conf. Architectural Support Program. Lang. Operating Syst., 2015, pp. 383–397.

[33]

S. Han, H. Wang, J. Wan, and J. Li, “An iterative scheme for leverage-based approximate aggregation,” in Proc. IEEE 35th Int. Conf. Data Eng., 2019, pp. 494–505.

[34]

F. Li, B. Wu, K. Yi, and Z. Zhao, “Wander join and XDB: Online aggregation via random walks,” ACM Trans. Database Syst., vol. 44, no. 1, pp. 2:1–2:41, 2019.

[35]

F. Li, B. Wu, K. Yi, and Z. Zhao, “Wander join: Online aggregation via random walks,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2016, pp. 615–629.

[36]

N. Sheoran, S. Chockchowwat, A. Chheda, S. Wang, R. Verma, and Y. Park, “A step toward deep online aggregation,” in Proc. ACM Manage. Data, vol. 1, no. 2, pp. 124:1–124:28, 2023.

[37]

Z. Zhao, D. Xie, and F. Li, “AB-tree: Index for concurrent random sampling and updates,” in Proc. VLDB Endowment, vol. 15, no. 9, pp. 1835–1847, 2022.

[38]

Y. Tao, “Algorithmic techniques for independent query sampling,” in Proc. 41st ACM SIGMOD-SIGACT-SIGAI Symp. Princ. Database Syst., 2022, pp. 129–138.

[39]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy, “The aqua approximate query answering system,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1999, pp. 574–576.

[40]

V. Ganti, M. Lee, and R. Ramakrishnan, “ICICLES: Self-tuning samples for approximate query answering,” in Proc. 26th Int. Conf. Very Large Data Bases, 2000, pp. 176–187.

[41]

B. Babcock, S. Chaudhuri, and G. Das, “Dynamic sample selection for approximate query processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2003, pp. 539–550.

[42]

C. Jermaine, “Robust estimation with sampling and approximate pre-aggregation,” in Proc. 29th Int. Conf. Very Large Data Bases, 2003, pp. 886–897.

[43]

S. Chaudhuri, G. Das, and V. R. Narasayya, “Optimized stratified sampling for approximate query processing,” ACM Trans. Database Syst., vol. 32, no. 2, pp. 9–es, 2007.

[44]

L. Sidirourgos, M. L. Kersten, and P. A. Boncz, “Sciborq: Scientific data management with bounds on runtime and quality,” in Proc. Biennial Conf. Innov. Data Syst. Res., 2011, pp. 296–301.

[45]

K. Li, Y. Zhang, G. Li, W. Tao, and Y. Yan, “Bounded approximate query processing,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 12, pp. 2262–2276, Dec. 2019.

[46]

X. Liang, S. Sintos, Z. Shang, and S. Krishnan, “Combining aggregation and sampling (nearly) optimally for approximate query processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2021, pp. 1129–1141.

[47]

S. Acharya, P. B. Gibbons, and V. Poosala, “Congressional samples for approximate answering of group-by queries,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2000, pp. 487–498.

[48]

Y. Park, A. S. Tajik, M. J. Cafarella, and B. Mozafari, “Database learning: Toward a database that becomes smarter every time,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2017, pp. 587–602.

[49]

Q. Ma and P. Triantafillou, “DBEst: Revisiting approximate query processing engines with machine learning models,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2019, pp. 1553–1570.

[50]

B. Walenz, S. Sintos, S. Roy, and J. Yang, “Learning to sample: Counting with complex queries,” in Proc. Very Large Data Base Endowment, vol. 13, no. 3, pp. 390–402, 2019.

[51]

B. Hilprecht, A. Schmidt, M. Kulessa, A. Molina, K. Kersting, and C. Binnig, “DeepDB: Learn from data, not from queries!,” in Proc. Very Large Data Base Endowment, vol. 13, no. 7, pp. 992–1005, 2020.

[52]

A. M. Shanghooshabad, M. Kurmanji, Q. Ma, M. Shekelyan, M. Almasi, and P. Triantafillou, “PGMJoins: Random join sampling with graphical models,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2021, pp. 1610–1622.

[53]

S. Thirumuruganathan, S. Hasan, N. Koudas, and G. Das, “Approximate query processing for data exploration using deep generative models,” in Proc. IEEE 36th Int. Conf. Data Eng., 2020, pp. 1309–1320.

[54]

Q. Ma, A. M. Shanghooshabad, M. Almasi, M. Kurmanji, and P. Triantafillou, “Learned approximate query processing: Make it light, accurate and fast,” in Proc. Biennial Conf. Innov. Data Syst. Res., 2021.

[55]

J. Peng, B. Ding, J. Wang, K. Zeng, and J. Zhou, “One size does not fit all: A bandit-based sampler combination framework with theoretical guarantees,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2022, pp. 531–544.

[56]

G. Li, X. Zhou, S. Li, and B. Gao, “QTune: A query-aware database tuning system with deep reinforcement learning,” in Proc. VLDB Endowment, vol. 12, no. 12, pp. 2118–2130, 2019.

[57]

J. Zhang et al., “An end-to-end automatic cloud database tuning system using deep reinforcement learning,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2019, pp. 415–432.

[58]

X. Liang, A. J. Elmore, and S. Krishnan, “Opportunistic view materialization with deep reinforcement learning,” 2019,.

[59]

H. Yuan, G. Li, L. Feng, J. Sun, and Y. Han, “Automatic view generation with deep learning and reinforcement learning,” in Proc. IEEE 36th Int. Conf. Data Eng., 2020, pp. 1501–1512.

Index Terms

Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data Exploration
1. Theory of computation

Index terms have been assigned to the content through auto-classification.

Recommendations

Learned Optimizer for Online Approximate Query Processing in Data Exploration
In the interactive data exploration, approximate query processing (AQP) can be used to quickly return query results at the cost of accuracy. For online AQP, the sampler can be treated as an operator in the query plan. During the query optimization for AQP,...
Dynamic sample selection for approximate query processing
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

In decision support applications, the ability to provide fast approximate answers to aggregation queries is desirable. One commonly-used technique for approximate query answering is sampling. For many aggregation queries, appropriately constructed ...
A sample advisor for approximate query processing
ADBIS'10: Proceedings of the 14th east European conference on Advances in databases and information systems

The rapid growth of current data warehouse systems makes random sampling a crucial component of modern data management systems. Although there is a large body of work on database sampling, the problem of automatic sample selection remained (almost) ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering Volume 36, Issue 11

Nov. 2024

1887 pages

Issue’s Table of Contents

1041-4347 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 12 December 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 29 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents