Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3196959.3196960acmconferencesArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article
Public Access

In-Database Learning with Sparse Tensors

Published: 27 May 2018 Publication History

Abstract

In-database analytics is of great practical importance as it avoids the costly repeated loop data scientists have to deal with on a daily basis: select features, export the data, convert data format, train models using an external tool, reimport the parameters. It is also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This paper introduces a unified framework for training and evaluating a class of statistical learning models inside a relational database. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from relational database theory such as schema information, query structure, recent advances in query evaluation algorithms, and from linear algebra such as various tensor and matrix operations, one can formulate in-database learning problems and design efficient algorithms to solve them. The algorithms and models proposed in the paper have already been implemented and deployed in retail-planning and forecasting applications, with significant performance benefits over out-of-database solutions that require the costly data-export loop.

References

[1]
Martın Abadi et al. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR Vol. abs/1603.04467 (2016).
[2]
Serge Abiteboul et al. 2017. Research Directions for Principles of Data Management (Dagstuhl Perspectives Workshop 16151). CoRR Vol. abs/1701.09007 (2017).
[3]
S. Abiteboul, R. Hull, and V. Vianu. 1995. Foundations of Databases. Addison-Wesley. showLCCN94019295
[4]
Mahmoud Abo Khamis, Hung Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2017. In-Database Learning with Sparse Tensors. CoRR Vol. abs/1703.04780 (2017).
[5]
Mahmoud Abo Khamis, Hung Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2018. AC/DC: In-Database Learning Thunderstruck. CoRR Vol. abs/1803.07480 (2018).
[6]
Mahmoud Abo Khamis, Hung Q. Ngo, Christopher Ré, and Atri Rudra. 2016. Joins via Geometric Resolutions: Worst Case and Beyond. ACM Trans. Database Syst. Vol. 41, 4 (2016), 22:1--22:45.
[7]
Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. 2015. FAQ: Questions Asked Frequently. CoRR Vol. abs/1504.04044 (2015).
[8]
Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. 2016. FAQ: Questions Asked Frequently. In PODS. 13--28.
[9]
Isolde Adler. 2006. Width functions for hypertree decompositions. Ph.D. Dissertation, Albert-Ludwigs-Universit"at Freiburg. 2006.
[10]
Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. 1996. Advances in Knowledge Discovery and Data Mining. Chapter Fast Discovery of Association Rules, 307--328.
[11]
Molham Aref, Balder ten Cate, Todd J. Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L. Veldhuizen, and Geoffrey Washburn. 2015. Design and Implementation of the LogicBlox System SIGMOD. 1371--1382.
[12]
Albert Atserias, Martin Grohe, and Dániel Marx. 2008. Size Bounds and Query Plans for Relational Joins. In FOCS. 739--748.
[13]
Nurzhan Bakibayev, Tomás Kociský, Dan Olteanu, and Jakub Závodný. 2013. Aggregation and Ordering in Factorised Databases. PVLDB Vol. 6, 14 (2013), 1990--2001.
[14]
Jonathan Barzilai and Jonathan M. Borwein. 1988. Two-point step size gradient methods. IMA J. Numer. Anal. Vol. 8, 1 (1988), 141--148.
[15]
Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas Burdick, and Shivakumar Vaithyanathan. 2014. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB Vol. 7, 7 (2014), 553--564.
[16]
Léon Bottou. 2012. Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade (2nd ed). 421--436.
[17]
Jean-Francois Boulicaut and Cyrille Masson. 2005. Data Mining Query Languages. 715--726.
[18]
Timothy A. Davis and William W. Hager. 2001. Multiple-rank modifications of a sparse Cholesky factorization. SIAM J. Matrix Anal. Appl. Vol. 22, 4 (2001), 997--1013.
[19]
Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, and Prithviraj Sen. 2017. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. In CIDR.
[20]
Rong-En Fan et al. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res. Vol. 9 (2008), 1871--1874.
[21]
Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. 2012. Towards a unified architecture for in-RDBMS analytics SIGMOD. 325--336.
[22]
Roger Fletcher. 2005. On the Barzilai-Borwein method. In Optimization and control with applications. Appl. Optim., Vol. Vol. 96. 235--256.
[23]
P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders. 1974. Methods for modifying matrix factorizations. Math. Comp. Vol. 28 (1974), 505--535.
[24]
Tom Goldstein, Christoph Studer, and Richard G. Baraniuk. 2014. A Field Guide to Forward-Backward Splitting with a FASTA Implementation. CoRR Vol. abs/1411.3406 (2014).
[25]
Georg Gottlob, Nicola Leone, and Francesco Scarcello. 1999. Hypertree decompositions and tractable queries. In PODS. 21--32.
[26]
Martin Grohe and Dániel Marx. 2014. Constraint Solving via Fractional Edge Covers. ACM Trans. Alg. Vol. 11, 1 (2014), 4:1--4:20.
[27]
William W. Hager. 1989. Updating the inverse of a matrix. SIAM Rev. Vol. 31, 2 (1989), 221--239.
[28]
David Harris and Sarah Harris. 2012. Digital Design and Computer Architecture (2nd ed.).
[29]
T. Hastie, R. Tibshrani, and M. J. Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and generalizations. CRC Press.
[30]
Joseph M. Hellerstein et al. 2012. The MADlib Analytics Library or MAD Skills, the SQL. PVLDB Vol. 5, 12 (2012), 1700--1711.
[31]
Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, and Frederick R. Reiss. 2015. Resource Elasticity for Large-Scale Machine Learning SIGMOD. 137--152.
[32]
C. G. Khatri and C. Radhakrishna Rao. 1968. Solutions to some functional equations and their applications to characterization of probability distributions. Sankhy =Ser. A Vol. 30 (1968), 167--180.
[33]
Benny Kimelfeld and Christopher Ré. 2017. A Relational Framework for Classifier Engineering. In PODS. 5--20.
[34]
Arun Kumar, Matthias Boehm, and Jun Yang. 2017. Data Management in Machine Learning: Challenges, Techniques, and Systems SIGMOD. 1717--1722.
[35]
Arun Kumar, Jeffrey F. Naughton, and Jignesh M. Patel. 2015. Learning Generalized Linear Models Over Normalized Data SIGMOD. 1969--1984.
[36]
Arun Kumar, Jeffrey F. Naughton, Jignesh M. Patel, and Xiaojin Zhu. 2016. To Join or Not to Join?: Thinking Twice about Joins before Feature Selection SIGMOD. 19--34.
[37]
Xiangrui Meng et al. 2016. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res. Vol. 17, 1 (2016), 1235--1241.
[38]
Dirk Neumann. 2015. Lightning-Fast Deep Learning on Spark Via parallel stochastic gradient updates, www.deepdist.com. (2015).
[39]
Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2012. Worst-case Optimal Join Algorithms. In PODS. 37--48.
[40]
Hung Q. Ngo, Christopher Ré, and Atri Rudra. 2013. Skew Strikes Back: New Developments in the Theory of Join Algorithms SIGMOD Rec. 5--16.
[41]
Dan Olteanu and Jakub Závodný. 2015. Size Bounds for Factorised Representations of Query Results. ACM Trans. Database Syst. Vol. 40, 1 (2015), 2:1--2:44.
[42]
Jian Pei, Jiawei Han, and Laks VS Lakshmanan. 2001. Mining frequent itemsets with convertible constraints ICDE. 433--442.
[43]
K. B. Petersen and M. S. Pedersen. 2012. The Matrix Cookbook. (nov. 2012). http://www2.imm.dtu.dk/pubdb/p.php?3274. Version 20121115.
[44]
Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data Management Challenges in Production Machine Learning SIGMOD. 1723--1726.
[45]
Chengjie Qin and Florin Rusu. 2015. Speculative Approximations for Terascale Distributed Gradient Descent Optimization. In DanaC. 1:1--1:10.
[46]
Steffen Rendle. 2012. Factorization Machines with libFM. ACM Trans. Intell. Syst. Technol. Vol. 3, 3 (2012), 57:1--57:22.
[47]
Steffen Rendle. 2013. Scaling Factorization Machines to Relational Data. PVLDB Vol. 6, 5 (2013), 337--348.
[48]
Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. 2016. Learning Linear Regression Models over Factorized Joins SIGMOD. 3--18.
[49]
Todd L. Veldhuizen. 2014. Triejoin: A Simple, Worst-Case Optimal Join Algorithm ICDT. 96--106.
[50]
Matei Zaharia et al. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI. 15--28.

Cited By

View all
  • (2024)Improved Approximation Algorithms for Relational ClusteringProceedings of the ACM on Management of Data10.1145/36958312:5(1-27)Online publication date: 7-Nov-2024
  • (2024)Tight Bounds of Circuits for Sum-Product QueriesProceedings of the ACM on Management of Data10.1145/36515882:2(1-20)Online publication date: 14-May-2024
  • (2024)Non-Blocking Functional Dependency Discovery from Data StreamsInformation Sciences10.1016/j.ins.2024.121531(121531)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
May 2018
462 pages
ISBN:9781450347068
DOI:10.1145/3196959
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. functional aggregate queries
  2. functional dependencies
  3. in-database analytics
  4. model reparameterization
  5. tensors

Qualifiers

  • Research-article

Funding Sources

  • Margaret and Herman Sokol Faculty Award
  • NSF
  • European Union's Horizon 2020 research and innovation programme

Conference

SIGMOD/PODS '18
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)150
  • Downloads (Last 6 weeks)24
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Improved Approximation Algorithms for Relational ClusteringProceedings of the ACM on Management of Data10.1145/36958312:5(1-27)Online publication date: 7-Nov-2024
  • (2024)Tight Bounds of Circuits for Sum-Product QueriesProceedings of the ACM on Management of Data10.1145/36515882:2(1-20)Online publication date: 14-May-2024
  • (2024)Non-Blocking Functional Dependency Discovery from Data StreamsInformation Sciences10.1016/j.ins.2024.121531(121531)Online publication date: Oct-2024
  • (2023)Query Evaluation under Differential PrivacyACM SIGMOD Record10.1145/3631504.363150652:3(6-17)Online publication date: 2-Nov-2023
  • (2023)Fine-Tuning Data Structures for Query ProcessingProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580016(149-161)Online publication date: 17-Feb-2023
  • (2022)Coresets for relational data and the applicationsProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600302(434-448)Online publication date: 28-Nov-2022
  • (2022)ConnectorXProceedings of the VLDB Endowment10.14778/3551793.355184715:11(2994-3003)Online publication date: 1-Jul-2022
  • (2022)BABOONSProceedings of the VLDB Endowment10.14778/3551793.355184615:11(2980-2993)Online publication date: 29-Sep-2022
  • (2022)Functional collection programming with semi-ring dictionariesProceedings of the ACM on Programming Languages10.1145/35273336:OOPSLA1(1-33)Online publication date: 29-Apr-2022
  • (2022)In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data ShuffleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526150(1286-1300)Online publication date: 10-Jun-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media