research-article

Efficient construction of approximate ad-hoc ML models through materialization and reuse

Editors: Sihem Amer-Yahia, Jian Pei Authors:

Saravanan Thirumuruganathan,

Abolfazl Asudeh,

Gautam DasAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 11, Issue 11

Pages 1468 - 1481

https://doi.org/10.14778/3236187.3236199

Published: 01 July 2018 Publication History

Abstract

Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as location or time. In this paper, we investigate the feasibility of efficiently constructing approximate ML models for new queries from previously constructed ML models by leveraging the concepts of model materialization and reuse. For example, is it possible to construct an approximate ML model for data from the year 2017 if one already has ML models for each of its quarters? We propose algorithms that can support a wide variety of ML models such as generalized linear models for classification along with K-Means and Gaussian Mixture models for clustering. We propose a cost based optimization framework that identifies appropriate ML models to combine at query time and conduct extensive experiments on real-world and synthetic datasets. Our results indicate that our framework can support analytic queries on ML models, with superior performance, achieving dramatic speedups of several orders in magnitude on very large datasets.

References

[1]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In ACM SIGMOD Record, volume 28, pages 275--286. ACM, 1999.

[2]

P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1--30, 2005.

[3]

N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. http: //www.cs.columbia.edu/~rjaiswal/ajmNIPS09.pdf. Accessed: Jul 20, 2018.

[4]

N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In NIPS, pages 10--18, 2009.

Digital Library

[5]

C. Anagnostopoulos and P. Triantafillou. Efficient scalable accurate regression queries in in-dbms analytics. In ICDE, pages 559--570. IEEE, 2017.

[6]

D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035. SIAM, 2007.

Digital Library

[7]

B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, pages 539--550. ACM, 2003.

Digital Library

[8]

O. Bachem, M. Lucic, and A. Krause. Scalable and distributed clustering via lightweight coresets. arXiv preprint arXiv:1702.08248, 2017.

[9]

D. Barbará and X. Wu. Loglinear-based quasi cubes. Journal of Intelligent Information Systems, 16(3):255--276, 2001.

Digital Library

[10]

C. Baykal, L. Liebenwein, and W. Schwarting. Training support vector machines using coresets. arXiv preprint arXiv:1708.03835, 2017.

[11]

C. M. Bishop. Pattern recognition and machine learning. springer, 2006.

[12]

C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.

[13]

B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. PVLDB, pages 982--993, 2005.

Digital Library

[14]

A. Declercq and J. H. Piater. Online learning of gaussian mixture models-a two-level approach. In VISAPP (1), pages 605--611, 2008.

[15]

D. Dheeru and E. Karra Taniskidou. UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets.html, 2017. Accessed: Jul 20, 2018.

[16]

G. Dong, J. Han, J. Lam, J. Pei, and K. Wang. Mining multi-dimensional constrained gradients in data cubes. PVLDB, pages 321--330, 2001.

Digital Library

[17]

D. Feldman, M. Faulkner, and A. Krause. Scalable training of mixture models via coresets. In NIPS, pages 2142--2150, 2011.

Digital Library

[18]

D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In SODA, pages 1434--1453. SIAM, 2013.

Digital Library

[19]

M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. PVLDB, pages 343--352, 2001.

Digital Library

[20]

J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery, 1(1):29--53, 1997.

[21]

S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515--528, 2003.

Digital Library

[22]

P. Gupta, N. Koudas, E. Shang, R. Johnson, and C. Zuzarte. Processing analytical workloads incrementally. arXiv preprint arXiv:1509.05066, 2015.

[23]

A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In SIGMOD, pages 795--806. ACM, 2016.

Digital Library

[24]

L. Han, T. Yang, and T. Zhang. Local uncertainty sampling for large-scale multi-class logistic regression. arXiv preprint arXiv:1604.08098, 2016.

[25]

S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In STOC, pages 291--300. ACM, 2004.

Digital Library

[26]

J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The madlib analytics library: or mad skills, the sql. PVLDB, 5(12):1700--1711, 2012.

Digital Library

[27]

C. Hennig. Methods for merging gaussian mixture components. Advances in data analysis and classification, 4(1):3--34, 2010.

[28]

J. Huggins, T. Campbell, and T. Broderick. Coresets for scalable bayesian logistic regression. In NIPS, pages 4080--4088, 2016.

Digital Library

[29]

Z. Kaoudi, J.-A. Quiané-Ruiz, S. Thirumuruganathan, S. Chawla, and D. Agrawal. A cost-based optimizer for gradient descent optimization. In SIGMOD, pages 977--992. ACM, 2017.

Digital Library

[30]

A. Kumar, M. Boehm, and J. Yang. Data management in machine learning: Challenges, techniques, and systems. In SIGMOD, pages 1717--1722. ACM, 2017.

Digital Library

[31]

A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. ACM SIGMOD Record, 44(4):17--22, 2016.

Digital Library

[32]

A. Kumar, J. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In SIGMOD, pages 1969--1984. ACM, 2015.

Digital Library

[33]

A. Kumar, J. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In SIGMOD, pages 19--34. ACM, 2016.

Digital Library

[34]

J. Langford, L. Li, and A. Strehl. Vowpal wabbit online learning project, 2007.

[35]

X. Li, B. Cui, Y. Chen, W. Wu, and C. Zhang. Mlog: Towards declarative in-database machine learning. PVLDB, 10(12):1933--1936, 2017.

Digital Library

[36]

D. Margaritis, C. Faloutsos, and S. Thrun. Netcube: A scalable tool for fast data mining and compression. PVLDB, pages 311--320, 2001.

Digital Library

[37]

R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In NAACL, pages 456--464. ACL, 2010.

Digital Library

[38]

R. Mcdonald, M. Mohri, N. Silberman, D. Walker, and G. S. Mann. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS, pages 1231--1239, 2009.

Digital Library

[39]

X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. JMLR, 17(1):1235--1241, 2016.

Digital Library

[40]

H. Miao, A. Li, L. S. Davis, and A. Deshpande. Modelhub: Deep learning lifecycle management. In ICDE, pages 1393--1394. IEEE, 2017.

[41]

H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. In ICDE, pages 571--582. IEEE, 2017.

[42]

A. Molina, A. Munteanu, and K. Kersting. Coreset based dependency networks. arXiv preprint arXiv:1710.03285, 2017.

[43]

K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

[44]

X. Pan, S. Venkataraman, Z. Tai, and J. Gonzalez. Hemingway: Modeling distributed optimization algorithms. arXiv preprint arXiv:1702.05865, 2017.

[45]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. JMLR, 12(Oct):2825--2830, 2011.

[46]

J. M. Phillips and W. M. Tai. Improved coresets for kernel density estimates. arXiv preprint arXiv:1710.04325, 2017.

[47]

A. R. Runnalls. Kullback-leibler approach to gaussian mixture reduction. IEEE Transactions on Aerospace and Electronic Systems, 43(3), 2007.

[48]

M. Shindler, A. Wong, and A. W. Meyerson. Fast and accurate k-means for large datasets. In NIPS, pages 2375--2383, 2011.

Digital Library

[49]

R. J. Steele, A. E. Raftery, and M. J. Emond. Computing normalizing constants for finite mixture models via incremental mixture importance sampling (imis). Journal of Computational and Graphical Statistics, 15(3):712--734, 2006.

[50]

L. Torrey and J. Shavlik. Transfer learning. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, 1:242, 2009.

[51]

I. W. Tsang, A. Kocsor, and J. T. Kwok. Simpler core vector machines with enclosing balls. In ICML, pages 911--918. ACM, 2007.

Digital Library

[52]

I. W. Tsang, J. T. Kwok, and P.-M. Cheung. Core vector machines: Fast svm training on very large data sets. JMLR, 6(Apr):363--392, 2005.

[53]

M. Vartak, J. M. da Trindade, S. Madden, and M. Zaharia. Mistique: A system to store and query model intermediates for model diagnosis. In SIGMOD, pages 1285--1300, 2018.

Digital Library

[54]

M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. Model db: a system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, page 14. ACM, 2016.

[55]

S. Venkataraman, Z. Yang, M. J. Franklin, B. Recht, and I. Stoica. Ernest: Efficient performance prediction for large-scale advanced analytics. In NSDI, pages 363--378, 2016.

Digital Library

[56]

C. Zhang, A. Kumar, and C. Ré. Materialization optimizations for feature selection workloads. ACM Transactions on Database Systems (TODS), 41(1):2, 2016.

[57]

Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-efficient algorithms for statistical optimization. In NIPS, pages 1502--1510, 2012.

Digital Library

[58]

M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS, pages 2595--2603, 2010.

Digital Library

Cited By

Yang ZWang ZHuang YLu YLi CWang X(2022)Optimizing machine learning inference queries with correlative proxy modelsProceedings of the VLDB Endowment10.14778/3547305.354731015:10(2032-2044)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.14778/3547305.3547310
Derakhshan BRezaei Mahdiraji AKaoudi ZRabl TMarkl VIves ZBonifati AEl Abbadi A(2022)Materialization and Reuse Optimizations for Production Data Science PipelinesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526186(1962-1976)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526186
Wu YTannen VDavidson SMaier DPottinger RDoan ATan WAlawini ANgo H(2020)PrIU: A Provenance-Based Approach for Incrementally Updating Regression ModelsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380571(447-462)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3380571

Efficient construction of approximate ad-hoc ML models through materialization and reuse
1. Computing methodologies

Recommendations

Efficient construction of approximate ad-hoc ML models through materialization and reuse

Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as ...
Efficient construction of approximate ad-hoc ML models through materialization and reuse

Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as ...
Materialization and Decomposition of Dataspaces for Efficient Search

Dataspaces consist of large-scale heterogeneous data. The query interface of accessing tuples should be provided as a fundamental facility by practical dataspace systems. Previously, an efficient index has been proposed for queries with keyword ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 11, Issue 11

July 2018

507 pages

ISSN:2150-8097

Editors:
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
,
Jian Pei
Simon Fraser University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2018

Published in PVLDB Volume 11, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
78
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang ZWang ZHuang YLu YLi CWang X(2022)Optimizing machine learning inference queries with correlative proxy modelsProceedings of the VLDB Endowment10.14778/3547305.354731015:10(2032-2044)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.14778/3547305.3547310
Derakhshan BRezaei Mahdiraji AKaoudi ZRabl TMarkl VIves ZBonifati AEl Abbadi A(2022)Materialization and Reuse Optimizations for Production Data Science PipelinesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526186(1962-1976)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526186
Wu YTannen VDavidson SMaier DPottinger RDoan ATan WAlawini ANgo H(2020)PrIU: A Provenance-Based Approach for Incrementally Updating Regression ModelsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380571(447-462)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3380571

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents