Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Efficient construction of approximate ad-hoc ML models through materialization and reuse

Published: 01 July 2018 Publication History

Abstract

Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as location or time. In this paper, we investigate the feasibility of efficiently constructing approximate ML models for new queries from previously constructed ML models by leveraging the concepts of model materialization and reuse. For example, is it possible to construct an approximate ML model for data from the year 2017 if one already has ML models for each of its quarters? We propose algorithms that can support a wide variety of ML models such as generalized linear models for classification along with K-Means and Gaussian Mixture models for clustering. We propose a cost based optimization framework that identifies appropriate ML models to combine at query time and conduct extensive experiments on real-world and synthetic datasets. Our results indicate that our framework can support analytic queries on ML models, with superior performance, achieving dramatic speedups of several orders in magnitude on very large datasets.

References

[1]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In ACM SIGMOD Record, volume 28, pages 275--286. ACM, 1999.
[2]
P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1--30, 2005.
[3]
N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. http: //www.cs.columbia.edu/~rjaiswal/ajmNIPS09.pdf. Accessed: Jul 20, 2018.
[4]
N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In NIPS, pages 10--18, 2009.
[5]
C. Anagnostopoulos and P. Triantafillou. Efficient scalable accurate regression queries in in-dbms analytics. In ICDE, pages 559--570. IEEE, 2017.
[6]
D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035. SIAM, 2007.
[7]
B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, pages 539--550. ACM, 2003.
[8]
O. Bachem, M. Lucic, and A. Krause. Scalable and distributed clustering via lightweight coresets. arXiv preprint arXiv:1702.08248, 2017.
[9]
D. Barbará and X. Wu. Loglinear-based quasi cubes. Journal of Intelligent Information Systems, 16(3):255--276, 2001.
[10]
C. Baykal, L. Liebenwein, and W. Schwarting. Training support vector machines using coresets. arXiv preprint arXiv:1708.03835, 2017.
[11]
C. M. Bishop. Pattern recognition and machine learning. springer, 2006.
[12]
C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.
[13]
B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. PVLDB, pages 982--993, 2005.
[14]
A. Declercq and J. H. Piater. Online learning of gaussian mixture models-a two-level approach. In VISAPP (1), pages 605--611, 2008.
[15]
D. Dheeru and E. Karra Taniskidou. UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets.html, 2017. Accessed: Jul 20, 2018.
[16]
G. Dong, J. Han, J. Lam, J. Pei, and K. Wang. Mining multi-dimensional constrained gradients in data cubes. PVLDB, pages 321--330, 2001.
[17]
D. Feldman, M. Faulkner, and A. Krause. Scalable training of mixture models via coresets. In NIPS, pages 2142--2150, 2011.
[18]
D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In SODA, pages 1434--1453. SIAM, 2013.
[19]
M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. PVLDB, pages 343--352, 2001.
[20]
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery, 1(1):29--53, 1997.
[21]
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515--528, 2003.
[22]
P. Gupta, N. Koudas, E. Shang, R. Johnson, and C. Zuzarte. Processing analytical workloads incrementally. arXiv preprint arXiv:1509.05066, 2015.
[23]
A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In SIGMOD, pages 795--806. ACM, 2016.
[24]
L. Han, T. Yang, and T. Zhang. Local uncertainty sampling for large-scale multi-class logistic regression. arXiv preprint arXiv:1604.08098, 2016.
[25]
S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In STOC, pages 291--300. ACM, 2004.
[26]
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The madlib analytics library: or mad skills, the sql. PVLDB, 5(12):1700--1711, 2012.
[27]
C. Hennig. Methods for merging gaussian mixture components. Advances in data analysis and classification, 4(1):3--34, 2010.
[28]
J. Huggins, T. Campbell, and T. Broderick. Coresets for scalable bayesian logistic regression. In NIPS, pages 4080--4088, 2016.
[29]
Z. Kaoudi, J.-A. Quiané-Ruiz, S. Thirumuruganathan, S. Chawla, and D. Agrawal. A cost-based optimizer for gradient descent optimization. In SIGMOD, pages 977--992. ACM, 2017.
[30]
A. Kumar, M. Boehm, and J. Yang. Data management in machine learning: Challenges, techniques, and systems. In SIGMOD, pages 1717--1722. ACM, 2017.
[31]
A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. ACM SIGMOD Record, 44(4):17--22, 2016.
[32]
A. Kumar, J. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In SIGMOD, pages 1969--1984. ACM, 2015.
[33]
A. Kumar, J. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In SIGMOD, pages 19--34. ACM, 2016.
[34]
J. Langford, L. Li, and A. Strehl. Vowpal wabbit online learning project, 2007.
[35]
X. Li, B. Cui, Y. Chen, W. Wu, and C. Zhang. Mlog: Towards declarative in-database machine learning. PVLDB, 10(12):1933--1936, 2017.
[36]
D. Margaritis, C. Faloutsos, and S. Thrun. Netcube: A scalable tool for fast data mining and compression. PVLDB, pages 311--320, 2001.
[37]
R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In NAACL, pages 456--464. ACL, 2010.
[38]
R. Mcdonald, M. Mohri, N. Silberman, D. Walker, and G. S. Mann. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS, pages 1231--1239, 2009.
[39]
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. JMLR, 17(1):1235--1241, 2016.
[40]
H. Miao, A. Li, L. S. Davis, and A. Deshpande. Modelhub: Deep learning lifecycle management. In ICDE, pages 1393--1394. IEEE, 2017.
[41]
H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. In ICDE, pages 571--582. IEEE, 2017.
[42]
A. Molina, A. Munteanu, and K. Kersting. Coreset based dependency networks. arXiv preprint arXiv:1710.03285, 2017.
[43]
K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[44]
X. Pan, S. Venkataraman, Z. Tai, and J. Gonzalez. Hemingway: Modeling distributed optimization algorithms. arXiv preprint arXiv:1702.05865, 2017.
[45]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. JMLR, 12(Oct):2825--2830, 2011.
[46]
J. M. Phillips and W. M. Tai. Improved coresets for kernel density estimates. arXiv preprint arXiv:1710.04325, 2017.
[47]
A. R. Runnalls. Kullback-leibler approach to gaussian mixture reduction. IEEE Transactions on Aerospace and Electronic Systems, 43(3), 2007.
[48]
M. Shindler, A. Wong, and A. W. Meyerson. Fast and accurate k-means for large datasets. In NIPS, pages 2375--2383, 2011.
[49]
R. J. Steele, A. E. Raftery, and M. J. Emond. Computing normalizing constants for finite mixture models via incremental mixture importance sampling (imis). Journal of Computational and Graphical Statistics, 15(3):712--734, 2006.
[50]
L. Torrey and J. Shavlik. Transfer learning. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, 1:242, 2009.
[51]
I. W. Tsang, A. Kocsor, and J. T. Kwok. Simpler core vector machines with enclosing balls. In ICML, pages 911--918. ACM, 2007.
[52]
I. W. Tsang, J. T. Kwok, and P.-M. Cheung. Core vector machines: Fast svm training on very large data sets. JMLR, 6(Apr):363--392, 2005.
[53]
M. Vartak, J. M. da Trindade, S. Madden, and M. Zaharia. Mistique: A system to store and query model intermediates for model diagnosis. In SIGMOD, pages 1285--1300, 2018.
[54]
M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. Model db: a system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, page 14. ACM, 2016.
[55]
S. Venkataraman, Z. Yang, M. J. Franklin, B. Recht, and I. Stoica. Ernest: Efficient performance prediction for large-scale advanced analytics. In NSDI, pages 363--378, 2016.
[56]
C. Zhang, A. Kumar, and C. Ré. Materialization optimizations for feature selection workloads. ACM Transactions on Database Systems (TODS), 41(1):2, 2016.
[57]
Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-efficient algorithms for statistical optimization. In NIPS, pages 1502--1510, 2012.
[58]
M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS, pages 2595--2603, 2010.

Cited By

View all
  • (2022)Optimizing machine learning inference queries with correlative proxy modelsProceedings of the VLDB Endowment10.14778/3547305.354731015:10(2032-2044)Online publication date: 1-Jun-2022
  • (2022)Materialization and Reuse Optimizations for Production Data Science PipelinesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526186(1962-1976)Online publication date: 10-Jun-2022
  • (2020)PrIU: A Provenance-Based Approach for Incrementally Updating Regression ModelsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380571(447-462)Online publication date: 11-Jun-2020
  1. Efficient construction of approximate ad-hoc ML models through materialization and reuse

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 11, Issue 11
    July 2018
    507 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 July 2018
    Published in PVLDB Volume 11, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 16 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Optimizing machine learning inference queries with correlative proxy modelsProceedings of the VLDB Endowment10.14778/3547305.354731015:10(2032-2044)Online publication date: 1-Jun-2022
    • (2022)Materialization and Reuse Optimizations for Production Data Science PipelinesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526186(1962-1976)Online publication date: 10-Jun-2022
    • (2020)PrIU: A Provenance-Based Approach for Incrementally Updating Regression ModelsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380571(447-462)Online publication date: 11-Jun-2020

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media