Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data Exploration

Published: 12 December 2023 Publication History

Abstract

For interactive data exploration, approximate query processing (AQP) is a useful approach that usually uses samples to provide a timely response for queries by trading query accuracy. Existing AQP systems often materialize samples in the memory for reuse to speed up query processing. How to tune the samples according to the workload is one of the key problems in AQP. However, since the data exploration workload is so complex that it cannot be accurately predicted, existing sample tuning approaches cannot adapt to the changing workload very well. To address this problem, this paper proposes a deep reinforcement learning-based sample tuner, <italic>RL-STuner</italic>. When tuning samples, <italic>RL-STuner</italic> considers the workload changes from a global perspective and uses a Deep Q-learning Network (DQN) model to select an optimal sample set that has the maximum utility for the current workload. In addition, this paper proposes a set of optimization mechanisms to reduce the sample tuning cost. Experimental results on both real-world and synthetic datasets show that <italic>RL-STuner</italic> outperforms the existing sample tuning approaches and achieves 1.6&#x00D7;-5.2&#x00D7; improvements on query accuracy with a low tuning cost.

References

[1]
K. Dimitriadou, O. Papaemmanouil, and Y. Diao, “Explore-by-example: An automatic query steering framework for interactive data exploration,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2014, pp. 517–528.
[2]
P. Eichmann, E. Zgraggen, C. Binnig, and T. Kraska, “IDEBench: A benchmark for interactive data exploration,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2020, pp. 1555–1569.
[3]
Z. Liu and J. Heer, “The effects of interactive latency on exploratory visual analysis,” IEEE Trans. Vis. Comput. Graphics, vol. 20, no. 12, pp. 2122–2131, Dec. 2014.
[4]
R. B. Miller, “Response time in man-computer conversational transactions,” in Proc. Fall Joint Comput. Conf., 1968, pp. 267–277.
[5]
B. Mozafari, “Approximate query engines: Commercial challenges and research opportunities,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2017, pp. 521–524.
[6]
S. Chaudhuri, G. Das, and V. R. Narasayya, “A robust, optimization-based approach for approximate answering of aggregate queries,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2001, pp. 295–306.
[7]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica, “BlinkDB: Queries with bounded errors and bounded response times on very large data,” in Proc. 8th ACM Eur. Conf. Comput. Syst., 2013, pp. 29–42.
[8]
B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and C. Wang, “Sample seek: Approximating aggregates with distribution precision guarantee,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2016, pp. 679–694.
[9]
S. Kandula et al., “Quickr: Lazily approximating complex adhoc queries in bigdata clusters,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2016, pp. 631–646.
[10]
A. Galakatos, A. Crotty, E. Zgraggen, C. Binnig, and T. Kraska, “Revisiting reuse for approximate query processing,” in Proc. VLDB Endowment, vol. 10, no. 10, pp. 1142–1153, 2017.
[11]
Y. Park, B. Mozafari, J. Sorenson, and J. Wang, “VerdictDB: Universalizing approximate query processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2018, pp. 1461–1476.
[12]
T. D. Nguyen, M. Shih, S. S. Parvathaneni, B. Xu, D. Srivastava, and S. Tirthapura, “Random sampling for group-by queries,” in Proc. IEEE 36th Int. Conf. Data Eng., 2020, pp. 541–552.
[13]
M. Olma, O. Papapetrou, R. Appuswamy, and A. Ailamaki, “Taster: Self-tuning, elastic and online approximate query processing,” in Proc. IEEE 35th Int. Conf. Data Eng., 2019, pp. 482–493.
[14]
J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. M. VanBriesen, and N. S. Glance, “Cost-effective outbreak detection in networks,” in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, P. Berkhin, R. Caruana, and X. Wu, Eds., 2007, pp. 420–429.
[15]
V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013,.
[16]
Z. Wu, Y. Jing, Z. He, C. Guo, and X. S. Wang, “POLYTOPE: A flexible sampling system for answering exploratory queries,” World Wide Web, vol. 23, no. 1, pp. 1–22, 2020.
[17]
C. J. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, no. 3/4, pp. 279–292, 1992.
[18]
M. A. Wiering and M. van Otterlo, Eds., Reinforcement Learn., 2012, vol. 12.
[19]
S. Lohr, Sampling: Design and Analysis. Pacific Grove, CA, USA: Brooks/Cole, 2010.
[20]
K. Li and G. Li, “Approximate query processing: What is new and where to go? - A survey on approximate query processing,” Data Sci. Eng., vol. 3, no. 4, pp. 379–397, 2018.
[21]
Y. Tao, X. Lian, D. Papadias, and M. Hadjieleftheriou, “Random sampling for continuous streams with arbitrary updates,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 96–110, Jan. 2007.
[22]
H. Zhang, Y. Zhang, Z. He, Y. Jing, K. Zhang, and X. S. Wang, “An agile sample maintenance approach for agile analytics,” in Proc. IEEE 36th Int. Conf. Data Eng., 2020, pp. 757–768.
[23]
Bureau of transportation statistics, Flights dataset, 2020.
[24]
TPC-H Benchmark, 1993. [Online]. Available: http://www.tpc.org/tpch/
[25]
H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proc. 30th AAAI Conf. Artif. Intell., 2016, pp. 2094–2100.
[26]
J. M. Hellerstein, P. J. Haas, and H. J. Wang, “Online aggregation,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1997, pp. 171–182.
[27]
P. J. Haas, “Large-sample and deterministic confidence intervals for online aggregation,” in Proc. 9th Int. Conf. Sci. Statist. Database Manage., 1997, pp. 51–63.
[28]
P. J. Haas and J. M. Hellerstein, “Ripple joins for online aggregation,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1999, pp. 287–298.
[29]
S. Wu, S. Jiang, B. C. Ooi, and K. Tan, “Distributed online aggregation,” in Proc. VLDB Endowment, vol. 2, no. 1, pp. 443–454, 2009.
[30]
S. Chen, P. B. Gibbons, and S. Nath, “PR-join: A non-blocking join achieving higher early result rate with statistical guarantees,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 147–158.
[31]
K. Zeng, S. Agarwal, A. Dave, M. Armbrust, and I. Stoica, “G-OLA: Generalized on-line aggregation for interactive analysis on Big Data,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015, pp. 913–918.
[32]
I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen, “ApproxHadoop: Bringing approximations to MapReduce frameworks,” in Proc. 20th Int. Conf. Architectural Support Program. Lang. Operating Syst., 2015, pp. 383–397.
[33]
S. Han, H. Wang, J. Wan, and J. Li, “An iterative scheme for leverage-based approximate aggregation,” in Proc. IEEE 35th Int. Conf. Data Eng., 2019, pp. 494–505.
[34]
F. Li, B. Wu, K. Yi, and Z. Zhao, “Wander join and XDB: Online aggregation via random walks,” ACM Trans. Database Syst., vol. 44, no. 1, pp. 2:1–2:41, 2019.
[35]
F. Li, B. Wu, K. Yi, and Z. Zhao, “Wander join: Online aggregation via random walks,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2016, pp. 615–629.
[36]
N. Sheoran, S. Chockchowwat, A. Chheda, S. Wang, R. Verma, and Y. Park, “A step toward deep online aggregation,” in Proc. ACM Manage. Data, vol. 1, no. 2, pp. 124:1–124:28, 2023.
[37]
Z. Zhao, D. Xie, and F. Li, “AB-tree: Index for concurrent random sampling and updates,” in Proc. VLDB Endowment, vol. 15, no. 9, pp. 1835–1847, 2022.
[38]
Y. Tao, “Algorithmic techniques for independent query sampling,” in Proc. 41st ACM SIGMOD-SIGACT-SIGAI Symp. Princ. Database Syst., 2022, pp. 129–138.
[39]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy, “The aqua approximate query answering system,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1999, pp. 574–576.
[40]
V. Ganti, M. Lee, and R. Ramakrishnan, “ICICLES: Self-tuning samples for approximate query answering,” in Proc. 26th Int. Conf. Very Large Data Bases, 2000, pp. 176–187.
[41]
B. Babcock, S. Chaudhuri, and G. Das, “Dynamic sample selection for approximate query processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2003, pp. 539–550.
[42]
C. Jermaine, “Robust estimation with sampling and approximate pre-aggregation,” in Proc. 29th Int. Conf. Very Large Data Bases, 2003, pp. 886–897.
[43]
S. Chaudhuri, G. Das, and V. R. Narasayya, “Optimized stratified sampling for approximate query processing,” ACM Trans. Database Syst., vol. 32, no. 2, pp. 9–es, 2007.
[44]
L. Sidirourgos, M. L. Kersten, and P. A. Boncz, “Sciborq: Scientific data management with bounds on runtime and quality,” in Proc. Biennial Conf. Innov. Data Syst. Res., 2011, pp. 296–301.
[45]
K. Li, Y. Zhang, G. Li, W. Tao, and Y. Yan, “Bounded approximate query processing,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 12, pp. 2262–2276, Dec. 2019.
[46]
X. Liang, S. Sintos, Z. Shang, and S. Krishnan, “Combining aggregation and sampling (nearly) optimally for approximate query processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2021, pp. 1129–1141.
[47]
S. Acharya, P. B. Gibbons, and V. Poosala, “Congressional samples for approximate answering of group-by queries,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2000, pp. 487–498.
[48]
Y. Park, A. S. Tajik, M. J. Cafarella, and B. Mozafari, “Database learning: Toward a database that becomes smarter every time,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2017, pp. 587–602.
[49]
Q. Ma and P. Triantafillou, “DBEst: Revisiting approximate query processing engines with machine learning models,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2019, pp. 1553–1570.
[50]
B. Walenz, S. Sintos, S. Roy, and J. Yang, “Learning to sample: Counting with complex queries,” in Proc. Very Large Data Base Endowment, vol. 13, no. 3, pp. 390–402, 2019.
[51]
B. Hilprecht, A. Schmidt, M. Kulessa, A. Molina, K. Kersting, and C. Binnig, “DeepDB: Learn from data, not from queries!,” in Proc. Very Large Data Base Endowment, vol. 13, no. 7, pp. 992–1005, 2020.
[52]
A. M. Shanghooshabad, M. Kurmanji, Q. Ma, M. Shekelyan, M. Almasi, and P. Triantafillou, “PGMJoins: Random join sampling with graphical models,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2021, pp. 1610–1622.
[53]
S. Thirumuruganathan, S. Hasan, N. Koudas, and G. Das, “Approximate query processing for data exploration using deep generative models,” in Proc. IEEE 36th Int. Conf. Data Eng., 2020, pp. 1309–1320.
[54]
Q. Ma, A. M. Shanghooshabad, M. Almasi, M. Kurmanji, and P. Triantafillou, “Learned approximate query processing: Make it light, accurate and fast,” in Proc. Biennial Conf. Innov. Data Syst. Res., 2021.
[55]
J. Peng, B. Ding, J. Wang, K. Zeng, and J. Zhou, “One size does not fit all: A bandit-based sampler combination framework with theoretical guarantees,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2022, pp. 531–544.
[56]
G. Li, X. Zhou, S. Li, and B. Gao, “QTune: A query-aware database tuning system with deep reinforcement learning,” in Proc. VLDB Endowment, vol. 12, no. 12, pp. 2118–2130, 2019.
[57]
J. Zhang et al., “An end-to-end automatic cloud database tuning system using deep reinforcement learning,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2019, pp. 415–432.
[58]
X. Liang, A. J. Elmore, and S. Krishnan, “Opportunistic view materialization with deep reinforcement learning,” 2019,.
[59]
H. Yuan, G. Li, L. Feng, J. Sun, and Y. Han, “Automatic view generation with deep learning and reinforcement learning,” in Proc. IEEE 36th Int. Conf. Data Eng., 2020, pp. 1501–1512.

Index Terms

  1. Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data Exploration
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Knowledge and Data Engineering
    IEEE Transactions on Knowledge and Data Engineering  Volume 36, Issue 11
    Nov. 2024
    1887 pages

    Publisher

    IEEE Educational Activities Department

    United States

    Publication History

    Published: 12 December 2023

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 29 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media