Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization

Published: 02 May 2024 Publication History

Abstract

A package query returns a package---a multiset of tuples---that maximizes or minimizes a linear objective function subject to linear constraints, thereby enabling in-database decision support. Prior work has established the equivalence of package queries to Integer Linear Programs (ILPs) and developed the SketchRefine algorithm for package query processing. While this algorithm was an important first step toward supporting prescriptive analytics scalably inside a relational database, it struggles when the data size grows beyond a few hundred million tuples or when the constraints become very tight. In this paper, we present Progressive Shading, a novel algorithm for processing package queries that can scale efficiently to billions of tuples and gracefully handle tight constraints. Progressive Shading solves a sequence of optimization problems over a hierarchy of relations, each resulting from an ever-finer partitioning of the original tuples into homogeneous groups until the original relation is obtained. This strategy avoids the premature discarding of high-quality tuples that can occur with SketchRefine. Our novel partitioning scheme, Dynamic Low Variance, can handle very large relations with multiple attributes and can dynamically adapt to both concentrated and spread-out sets of attribute values, provably outperforming traditional partitioning schemes such as kd-tree. We further optimize our system by replacing our off-the-shelf optimization software with customized ILP and LP solvers, called Dual Reducer and Parallel Dual Simplex respectively, that are highly accurate and orders of magnitude faster.

References

[1]
Abdurro'uf and et al. 2021. The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar and APOGEE-2 Data. arXiv:2112.02026.
[2]
V. A. Alegana, P. M. Atkinson, C. Pezzulo, A. Sorichetta, D. Weiss, T. Bird, E. Erbach-Schoenberg, and A. J. Tatem. 2015. Fine resolution mapping of population age-structures for health and development applications. Journal of The Royal Society Interface 12, 105 (April 2015), 20150073.
[3]
Timo Berthold. 2009. RENS-relaxation enforced neighborhood search. Technical Report. Zuse Institute Berlin (ZIB).
[4]
Robert Bixby and Alexander Martin. 2000. Parallelizing the Dual Simplex Method. INFORMS Journal on Computing 12 (02 2000), 45--56.
[5]
Matteo Brucato, Azza Abouzied, and Alexandra Meliou. 2018. Package queries: efficient and scalable computation of high-order constraints. The VLDB Journal 27, 5 (01 Oct 2018), 693--718.
[6]
Wei Cui, Qianxi Zhang, Spyros Blanas, Jesús Camacho-Rodriguez, Brandon Haynes, Yinan Li, Ravi Ramamurthy, Peng Cheng, Rathijit Sen, and Matteo Interlandi. 2023. Query Processing on Gaming Consoles. In Proceedings of the 19th International Workshop on Data Management on New Hardware (Seattle, WA, USA) (DaMoN '23). Association for Computing Machinery, New York, NY, USA, 86--88.
[7]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering, IEEE 5, 1 (1998), 46--55.
[8]
Raphael Finkel and Jon Bentley. 1974. Quad Trees: A Data Structure for Retrieval on Composite Keys. Acta Inf. 4 (03 1974), 1--9.
[9]
Matteo Fischetti and Andrea Lodi. 2011. Heuristics in Mixed Integer Programming.
[10]
Gurobi Optimization, LLC. 2022. Gurobi Optimizer Reference Manual. https://www.gurobi.com
[11]
Boukthir Haddar, Mahdi Khemakhem, Saïd Hanafi, and Christophe Wilbaut. 2015. A hybrid heuristic for the 0--1 Knapsack Sharing Problem. Expert Systems with Applications 42, 10 (June 2015), 4653--4666.
[12]
J. A. Hartigan and M. A. Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108. http://www.jstor.org/stable/2346830
[13]
Dong He, Supun C Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes. Proc. VLDB Endow. 15, 11 (jul 2022), 2811--2825.
[14]
Frederick S. Hillier. 1967. Introduction to Operations Research. San Francisco, Holden-Day.
[15]
Q. Huangfu and J. A. J. Hall. 2018. Parallelizing the dual revised simplex method. Mathematical Programming Computation 10, 1 (01 Mar 2018), 119--142.
[16]
Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. 2014. Interactive Data Exploration Using Semantic Windows. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 505--516.
[17]
Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. 2015. Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data. Proc. VLDB Endow. 8, 10 (jun 2015), 1094--1105.
[18]
Leonard Kaufman and Peter J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley.
[19]
Bernard Knueven, James Ostrowski, and Jean-Paul Watson. 2020. On Mixed-Integer Programming Formulations for the Unit Commitment Problem. INFORMS Journal on Computing (June 2020).
[20]
C. C. N. Kuhn, G. Calbert, I. Garanovich, and T. Weir. 2023. Integer linear programming supporting portfolio design. arXiv:2303.14364 [math.OC]
[21]
Xiaoqian Li and Kwan L. Yeung. 2019. Traffic Engineering in Segment Routing using MILP. In ICC 2019 - 2019 IEEE International Conference on Communications (ICC). IEEE.
[22]
Anh Mai, Matteo Brucato, Azza Abouzied, Peter J. Haas, and Alexandra Meliou. 2023. Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization. https://github.com/alm818/PackageQuery. arXiv:2307.02860 [cs.DB]
[23]
Rajesh Matai, Surya Singh, and Murari Lal Mittal. 2010. Traveling Salesman Problem: an Overview of Applications, Formulations, and Solution Approaches. In Traveling Salesman Problem, Donald Davendra (Ed.). IntechOpen, Rijeka, Chapter 1.
[24]
Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O'Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, Ravichandra Addanki, Tharindi Hapuarachchi, Thomas Keck, James Keeling, Pushmeet Kohli, Ira Ktena, Yujia Li, Oriol Vinyals, and Yori Zwols. 2020. Solving Mixed Integer Programs Using Neural Networks.
[25]
Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen P. Boyd, and Matei Zaharia. 2021. Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. CoRR abs/2110.11927 (2021). arXiv:2110.11927 https://arxiv.org/abs/2110.11927
[26]
J.C. Nash. 2000. The (Dantzig) simplex method for linear programming. Computing in Science & Engineering 2, 1 (2000), 29--31.
[27]
Ravi Netravali, Vikram Nathan, James Mickens, and Hari Balakrishnan. 2018. Vesper: Measuring Time-to-Interactivity for Web Pages. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 217--231. https://www.usenix.org/conference/nsdi18/presentation/netravali-vesper
[28]
Michael J. Panik. 1996. The Dual Simplex, Primal-Dual, and Complementary Pivot Methods. Springer US, Boston, MA, 251--288.
[29]
Maurice Roux. 2015. A comparative study of divisive hierarchical clustering algorithms. CoRR abs/1506.08977 (2015). arXiv:1506.08977 http://arxiv.org/abs/1506.08977
[30]
Georg Sander and Adrian Vasiliu. 2005. Visualization and ILOG CPLEX. In Graph Drawing, János Pach (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 510--511.
[31]
TPCH [n.d.]. TPC-H Decision Support Benchmark. https://www.tpc.org/tpch/.
[32]
Vijay V. Vazirani. 2003. Approximation Algorithms. Springer Berlin Heidelberg, 108.
[33]
Laurynas Šikśnys and Torben Bach Pedersen. 2016. SolveDB: Integrating Optimization Problem Solvers Into SQL Databases. In Proceedings of the 28th International Conference on Scientific and Statistical Database Management (Budapest, Hungary) (SSDBM '16). Association for Computing Machinery, New York, NY, USA, Article 14, 12 pages.

Cited By

View all
  • (2024)Counterfactual Explanation at Will, with Zero Privacy LeakageProceedings of the ACM on Management of Data10.1145/36549332:3(1-29)Online publication date: 30-May-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 17, Issue 5
January 2024
233 pages
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 02 May 2024
Published in PVLDB Volume 17, Issue 5

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Counterfactual Explanation at Will, with Zero Privacy LeakageProceedings of the ACM on Management of Data10.1145/36549332:3(1-29)Online publication date: 30-May-2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media