Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3299869.3319854acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions

Published: 25 June 2019 Publication History

Abstract

Efficiently computing linear algebra expressions is central to machine learning (ML) systems. Most systems support sparse formats and operations because sparse matrices are ubiquitous and their dense representation can cause prohibitive overheads. Estimating the sparsity of intermediates, however, remains a key challenge when generating execution plans or performing sparse operations. These sparsity estimates are used for cost and memory estimates, format decisions, and result allocation. Existing estimators tend to focus on matrix products only, and struggle to attain good accuracy with low estimation overhead. However, a key observation is that real-world sparse matrices commonly exhibit structural properties such as a single non-zero per row, or columns with varying sparsity. In this paper, we introduce MNC (Matrix Non-zero Count), a remarkably simple, count-based matrix synopsis that exploits these structural properties for efficient, accurate, and general sparsity estimation. We describe estimators and sketch propagation for realistic linear algebra expressions. Our experiments - on a new estimation benchmark called SparsEst - show that the MNC estimator yields good accuracy with very low overhead. This behavior makes MNC practical and broadly applicable in ML systems.

References

[1]
Mart'i n Abadi et almbox. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI. 265--283.
[2]
Ildar Absalyamov, Michael J. Carey, and Vassilis J. Tsotras. 2018. Lightweight Cardinality Estimation in LSM-based Systems. In SIGMOD. 841--855.
[3]
Peter Ahrens, Helen Xu, and Nicholas Schiefer. 2018. A Fill Estimation Algorithm for Sparse Matrices and Tensors in Blocked Formats. In IPDPS. 546--556.
[4]
AMiner. 2017. Citation Network Dataset . textttstatic.aminer.cn/ lab-datasets/citation/dblp.v10.zip.
[5]
Rasmus Resen Amossen, Andrea Campagna, and Rasmus Pagh. 2014. Better Size Estimation for Sparse Matrix Products . Algorithmica, Vol. 69, 3 (2014), 741--757.
[6]
Rasmus Resen Amossen and Rasmus Pagh. 2009. Faster Join-Projects and Sparse Matrix Multiplications. In ICDT. 121--126.
[7]
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting Distinct Elements in a Data Stream. In RANDOM. 1--10.
[8]
Kevin S. Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. 2007. On Synopses for Distinct-Value Estimation Under Multiset Operations. In SIGMOD. 199--210.
[9]
Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2017. Julia: A Fresh Approach to Numerical Computing . SIAM Review, Vol. 59, 1 (2017).
[10]
Matthias Boehm, Douglas R. Burdick, Alexandre V. Evfimievski, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Shirish Tatikonda, and Yuanyuan Tian. 2014. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs . IEEE Data Eng. Bull., Vol. 37, 3 (2014), 52--62.
[11]
Matthias Boehm, Michael Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick Reiss, Prithviraj Sen, Arvind Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark . PVLDB, Vol. 9, 13 (2016), 1425--1436.
[12]
Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Prithviraj Sen, Alexandre V. Evfimievski, and Niketan Pansare. 2018. On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML . PVLDB, Vol. 11, 12 (2018), 1755--1768.
[13]
Léon Bottou. {n. d.}. The infinite MNIST dataset . texttthttp://leon. bottou.org/projects/infimnist.
[14]
Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek R. Narasayya. 2000. Towards Estimation Error Guarantees for Distinct Values. In PODS. 268--279.
[15]
Edith Cohen. 1994. Estimating the Size of the Transitive Closure in Linear Time. In FOCS . 190--200.
[16]
Edith Cohen. 1998. Structure Prediction and Computation of Sparse Matrix Products. J. Comb. Optim., Vol. 2, 4 (1998), 307--332.
[17]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press.
[18]
Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches . Foundations and Trends in Databases, Vol. 4, 1--3 (2012), 1--294.
[19]
Luiz A. DeRose. 1996. Compiler Techniques for MATLAB Programs. Technical Report.
[20]
Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, and Prithviraj Sen. 2017. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. In CIDR.
[21]
Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, and Berthold Reinwald. 2016. Compressed Linear Algebra for Large-Scale Machine Learning . PVLDB, Vol. 9, 12 (2016), 960--971.
[22]
Michael J. Freitag and Thomas Neumann. 2019. Every Row Counts: Combining Sketches and Sampling for Accurate Group-By Result Estimates. In CIDR.
[23]
Rainer Gemulla, Philipp Rö sch, and Wolfgang Lehner. 2008. Linked Bernoulli Synopses: Sampling along Foreign Keys. In SSDBM. 6--23.
[24]
John R. Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse Matrices in Matlab: Design and Implementation . SIAM J. Matrix Anal. Appl., Vol. 13, 1 (1992), 333--356.
[25]
Google. {n. d.}. word2vec . textttcode.google.com/archive/p/word2vec.
[26]
Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In VLDB. 311--322.
[27]
Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Arun N. Swami. 1996. Selectivity and Cost Estimation for Joins Based on Random Sampling . J. Comput. Syst. Sci., Vol. 52, 3 (1996), 550--569.
[28]
Peter J. Haas and Lynne Stokes. 1998. Estimating the Number of Classes in a Finite Population. J. Amer. Statist. Assoc., Vol. 93, 444 (1998), 1475--1487.
[29]
Hazar Harmouch and Felix Naumann. 2017. Cardinality Estimation: An Experimental Survey . PVLDB, Vol. 11, 4 (2017), 499--512.
[30]
Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In WWW. 507--517.
[31]
Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT. 683--692.
[32]
T. C. Hu and M. T. Shing. 1984. Computation of Matrix Chain Products. Part II . SIAM J. Comput., Vol. 13, 2 (1984), 228--251.
[33]
Botong Huang, Shivnath Babu, and Jun Yang. 2013. Cumulon: Optimizing Statistical Data Analysis in the Cloud. In SIGMOD. 1--12.
[34]
Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, and Frederick R. Reiss. 2015. Resource Elasticity for Large-Scale Machine Learning. In SIGMOD. 137--152.
[35]
Yannis E. Ioannidis and Stavros Christodoulakis. 1991. On the Propagation of Errors in the Size of Join Results. In SIGMOD. 268--277.
[36]
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. 2010. An optimal algorithm for the distinct elements problem. In PODS. 41--52.
[37]
Carl-Christian Kanne and Guido Moerkotte. 2010. Histograms Reloaded: The Merits of Bucket Diversity. In SIGMOD. 663--674.
[38]
Jeremy Kepner and John Gilbert. 2011. Graph Algorithms in the Language of Linear Algebra. SIAM.
[39]
David Kernert, Frank Kö hler, and Wolfgang Lehner. 2015. SpMacho - Optimizing Sparse Linear Algebra Expressions with Probabilistic Density Estimation. In EDBT. 289--300.
[40]
David Kernert, Wolfgang Lehner, and Frank Kö hler. 2016. Topology-Aware Optimization of Big Sparse Matrices and Matrix Multiplications on Main-Memory Systems. In ICDE. 823--834.
[41]
Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. 2016. FAQ: Questions Asked Frequently. In PODS. 13--28.
[42]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP. 1746--1751.
[43]
Arun Kumar, Matthias Boehm, and Jun Yang. 2017. Data Management in Machine Learning: Challenges, Techniques, and Systems. In SIGMOD. 1717--1722.
[44]
Per-Åke Larson, Wolfgang Lehner, Jingren Zhou, and Peter Zabback. 2007. Cardinality estimation using sample views with quality assurance. In SIGMOD. 175--186.
[45]
Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR.
[46]
J. Leskovec, J. Kleinberg, and C. Faloutsos. {n. d.}. SuiteSparse Matrix Collection: email-EuAll . textttsparse.tamu.edu/SNAP/email-EuAll.
[47]
Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. 2007. Graph evolution: Densification and shrinking diameters . TKDD, Vol. 1, 1 (2007).
[48]
M. Lichman. {n. d.}. UCI Machine Learning Repository: Covertype . texttthttps://archive.ics.uci.edu/ ml/datasets/Covertype.
[49]
Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In IPDPS. 370--381.
[50]
Gaëlle Loosli, Stéphane Canu, and Léon Bottou. 2007. Training Invariant Support Vector Machines using Selective Sampling . In Large Scale Kernel Machines, Léon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston (Eds.). MIT Press, Cambridge, MA., 301--320.
[51]
Miles Lopes. 2013. Estimating Unknown Sparsity in Compressed Sensing. In ICML. 217--225.
[52]
Julian McAuley. {n. d.}. Amazon Product Data - Books . textttjmcauley.ucsd.edu/data/amazon.
[53]
Vijay Menon and Keshav Pingali. 1999. High-Level Semantic Optimization of Numerical Codes. In ICS. 434--443.
[54]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space . CoRR (2013).
[55]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. In NIPS . 3111--3119.
[56]
Thomas Neumann and Bernhard Radke. 2018. Adaptive Optimization of Very Large Join Queries. In SIGMOD. 677--692.
[57]
Milos Nikolic, Mohammed Elseidy, and Christoph Koch. 2014. LINVIEW: incremental view maintenance for complex analytical queries. In SIGMOD. 253--264.
[58]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch . (2017).
[59]
Hongbo Rong, Jongsoo Park, Lingxiang Xiang, Todd A. Anderson, and Mikhail Smelyanskiy. 2016. Sparso: Context-driven Optimizations of Sparse Linear Algebra. In PACT. 247--259.
[60]
Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Paul Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In WWW - Companion Volume . 243--246.
[61]
Michael Stonebraker, Paul Brown, Alex Poliakov, and Suchi Raman. 2011. The Architecture of SciDB. In SSDBM. 1--16.
[62]
Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: Extraction and Mining of Academic Social Networks. In SIGKDD. 990--998.
[63]
Wentao Wu, Jeffrey F. Naughton, and Harneet Singh. 2016. Sampling-Based Query Re-Optimization. In SIGMOD. 1721--1736.
[64]
Feng Yu, Wen-Chi Hou, Cheng Luo, Dunren Che, and Mengxia Zhu. 2013. CS2: a new database synopsis for query estimation. In SIGMOD. 469--480.
[65]
Yongyang Yu, MingJie Tang, Walid G. Aref, Qutaibah M. Malluhi, Mostafa M. Abbas, and Mourad Ouzzani. 2017. In-Memory Distributed Matrix Computation Processing and Optimization. In ICDE.
[66]
Reza Bosagh Zadeh, Xiangrui Meng, Alexander Ulanov, Burak Yavuz, Li Pu, Shivaram Venkataraman, Evan R. Sparks, Aaron Staple, and Matei Zaharia. 2016. Matrix Computations and Optimization in Apache Spark. In SIGKDD. 31--38.

Cited By

View all
  • (2024)On Efficient Large Sparse Matrix Chain MultiplicationProceedings of the ACM on Management of Data10.1145/36549592:3(1-27)Online publication date: 30-May-2024
  • (2023)AWARE: Workload-aware, Redundancy-exploiting Linear AlgebraProceedings of the ACM on Management of Data10.1145/35886821:1(1-28)Online publication date: 30-May-2023
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
June 2019
2106 pages
ISBN:9781450356435
DOI:10.1145/3299869
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. linear algebra
  2. machine learning systems
  3. sparsity estimation

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '19
Sponsor:
SIGMOD/PODS '19: International Conference on Management of Data
June 30 - July 5, 2019
Amsterdam, Netherlands

Acceptance Rates

SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)1
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)On Efficient Large Sparse Matrix Chain MultiplicationProceedings of the ACM on Management of Data10.1145/36549592:3(1-27)Online publication date: 30-May-2024
  • (2023)AWARE: Workload-aware, Redundancy-exploiting Linear AlgebraProceedings of the ACM on Management of Data10.1145/35886821:1(1-28)Online publication date: 30-May-2023
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • (2023)Atrapos: Real-time Evaluation of Metapath Query WorkloadsProceedings of the ACM Web Conference 202310.1145/3543507.3583322(2487-2498)Online publication date: 30-Apr-2023
  • (2023)Givens rotations for QR decomposition, SVD and PCA over database joinsThe VLDB Journal10.1007/s00778-023-00818-933:4(1013-1037)Online publication date: 23-Nov-2023
  • (2022)Givens QR Decomposition over Relational DatabasesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526144(1948-1961)Online publication date: 10-Jun-2022
  • (2021)Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear AlgebraProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457317(1222-1234)Online publication date: 9-Jun-2021
  • (2021)HADAD: A Lightweight Approach for Optimizing Hybrid Complex Analytics QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457311(23-35)Online publication date: 9-Jun-2021

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media