research-article

MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions

Authors:

Johanna Sommer,

Matthias Boehm,

Alexandre V. Evfimievski,

Berthold Reinwald,

Peter J. HaasAuthors Info & Claims

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Pages 1607 - 1623

https://doi.org/10.1145/3299869.3319854

Published: 25 June 2019 Publication History

Abstract

Efficiently computing linear algebra expressions is central to machine learning (ML) systems. Most systems support sparse formats and operations because sparse matrices are ubiquitous and their dense representation can cause prohibitive overheads. Estimating the sparsity of intermediates, however, remains a key challenge when generating execution plans or performing sparse operations. These sparsity estimates are used for cost and memory estimates, format decisions, and result allocation. Existing estimators tend to focus on matrix products only, and struggle to attain good accuracy with low estimation overhead. However, a key observation is that real-world sparse matrices commonly exhibit structural properties such as a single non-zero per row, or columns with varying sparsity. In this paper, we introduce MNC (Matrix Non-zero Count), a remarkably simple, count-based matrix synopsis that exploits these structural properties for efficient, accurate, and general sparsity estimation. We describe estimators and sketch propagation for realistic linear algebra expressions. Our experiments - on a new estimation benchmark called SparsEst - show that the MNC estimator yields good accuracy with very low overhead. This behavior makes MNC practical and broadly applicable in ML systems.

References

[1]

Mart'i n Abadi et almbox. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI. 265--283.

[2]

Ildar Absalyamov, Michael J. Carey, and Vassilis J. Tsotras. 2018. Lightweight Cardinality Estimation in LSM-based Systems. In SIGMOD. 841--855.

Digital Library

[3]

Peter Ahrens, Helen Xu, and Nicholas Schiefer. 2018. A Fill Estimation Algorithm for Sparse Matrices and Tensors in Blocked Formats. In IPDPS. 546--556.

[4]

AMiner. 2017. Citation Network Dataset . textttstatic.aminer.cn/ lab-datasets/citation/dblp.v10.zip.

[5]

Rasmus Resen Amossen, Andrea Campagna, and Rasmus Pagh. 2014. Better Size Estimation for Sparse Matrix Products . Algorithmica, Vol. 69, 3 (2014), 741--757.

[6]

Rasmus Resen Amossen and Rasmus Pagh. 2009. Faster Join-Projects and Sparse Matrix Multiplications. In ICDT. 121--126.

Digital Library

[7]

Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting Distinct Elements in a Data Stream. In RANDOM. 1--10.

Digital Library

[8]

Kevin S. Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. 2007. On Synopses for Distinct-Value Estimation Under Multiset Operations. In SIGMOD. 199--210.

Digital Library

[9]

Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2017. Julia: A Fresh Approach to Numerical Computing . SIAM Review, Vol. 59, 1 (2017).

[10]

Matthias Boehm, Douglas R. Burdick, Alexandre V. Evfimievski, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Shirish Tatikonda, and Yuanyuan Tian. 2014. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs . IEEE Data Eng. Bull., Vol. 37, 3 (2014), 52--62.

[11]

Matthias Boehm, Michael Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick Reiss, Prithviraj Sen, Arvind Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark . PVLDB, Vol. 9, 13 (2016), 1425--1436.

Digital Library

[12]

Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Prithviraj Sen, Alexandre V. Evfimievski, and Niketan Pansare. 2018. On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML . PVLDB, Vol. 11, 12 (2018), 1755--1768.

Digital Library

[13]

Léon Bottou. {n. d.}. The infinite MNIST dataset . texttthttp://leon. bottou.org/projects/infimnist.

[14]

Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek R. Narasayya. 2000. Towards Estimation Error Guarantees for Distinct Values. In PODS. 268--279.

Digital Library

[15]

Edith Cohen. 1994. Estimating the Size of the Transitive Closure in Linear Time. In FOCS . 190--200.

Digital Library

[16]

Edith Cohen. 1998. Structure Prediction and Computation of Sparse Matrix Products. J. Comb. Optim., Vol. 2, 4 (1998), 307--332.

[17]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press.

[18]

Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches . Foundations and Trends in Databases, Vol. 4, 1--3 (2012), 1--294.

Digital Library

[19]

Luiz A. DeRose. 1996. Compiler Techniques for MATLAB Programs. Technical Report.

[20]

Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, and Prithviraj Sen. 2017. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. In CIDR.

[21]

Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, and Berthold Reinwald. 2016. Compressed Linear Algebra for Large-Scale Machine Learning . PVLDB, Vol. 9, 12 (2016), 960--971.

Digital Library

[22]

Michael J. Freitag and Thomas Neumann. 2019. Every Row Counts: Combining Sketches and Sampling for Accurate Group-By Result Estimates. In CIDR.

[23]

Rainer Gemulla, Philipp Rö sch, and Wolfgang Lehner. 2008. Linked Bernoulli Synopses: Sampling along Foreign Keys. In SSDBM. 6--23.

Digital Library

[24]

John R. Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse Matrices in Matlab: Design and Implementation . SIAM J. Matrix Anal. Appl., Vol. 13, 1 (1992), 333--356.

Digital Library

[25]

Google. {n. d.}. word2vec . textttcode.google.com/archive/p/word2vec.

[26]

Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In VLDB. 311--322.

Digital Library

[27]

Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Arun N. Swami. 1996. Selectivity and Cost Estimation for Joins Based on Random Sampling . J. Comput. Syst. Sci., Vol. 52, 3 (1996), 550--569.

Digital Library

[28]

Peter J. Haas and Lynne Stokes. 1998. Estimating the Number of Classes in a Finite Population. J. Amer. Statist. Assoc., Vol. 93, 444 (1998), 1475--1487.

[29]

Hazar Harmouch and Felix Naumann. 2017. Cardinality Estimation: An Experimental Survey . PVLDB, Vol. 11, 4 (2017), 499--512.

Digital Library

[30]

Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In WWW. 507--517.

Digital Library

[31]

Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT. 683--692.

Digital Library

[32]

T. C. Hu and M. T. Shing. 1984. Computation of Matrix Chain Products. Part II . SIAM J. Comput., Vol. 13, 2 (1984), 228--251.

Digital Library

[33]

Botong Huang, Shivnath Babu, and Jun Yang. 2013. Cumulon: Optimizing Statistical Data Analysis in the Cloud. In SIGMOD. 1--12.

Digital Library

[34]

Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, and Frederick R. Reiss. 2015. Resource Elasticity for Large-Scale Machine Learning. In SIGMOD. 137--152.

Digital Library

[35]

Yannis E. Ioannidis and Stavros Christodoulakis. 1991. On the Propagation of Errors in the Size of Join Results. In SIGMOD. 268--277.

Digital Library

[36]

Daniel M. Kane, Jelani Nelson, and David P. Woodruff. 2010. An optimal algorithm for the distinct elements problem. In PODS. 41--52.

Digital Library

[37]

Carl-Christian Kanne and Guido Moerkotte. 2010. Histograms Reloaded: The Merits of Bucket Diversity. In SIGMOD. 663--674.

Digital Library

[38]

Jeremy Kepner and John Gilbert. 2011. Graph Algorithms in the Language of Linear Algebra. SIAM.

Digital Library

[39]

David Kernert, Frank Kö hler, and Wolfgang Lehner. 2015. SpMacho - Optimizing Sparse Linear Algebra Expressions with Probabilistic Density Estimation. In EDBT. 289--300.

[40]

David Kernert, Wolfgang Lehner, and Frank Kö hler. 2016. Topology-Aware Optimization of Big Sparse Matrices and Matrix Multiplications on Main-Memory Systems. In ICDE. 823--834.

[41]

Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. 2016. FAQ: Questions Asked Frequently. In PODS. 13--28.

Digital Library

[42]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP. 1746--1751.

[43]

Arun Kumar, Matthias Boehm, and Jun Yang. 2017. Data Management in Machine Learning: Challenges, Techniques, and Systems. In SIGMOD. 1717--1722.

Digital Library

[44]

Per-Åke Larson, Wolfgang Lehner, Jingren Zhou, and Peter Zabback. 2007. Cardinality estimation using sample views with quality assurance. In SIGMOD. 175--186.

Digital Library

[45]

Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR.

[46]

J. Leskovec, J. Kleinberg, and C. Faloutsos. {n. d.}. SuiteSparse Matrix Collection: email-EuAll . textttsparse.tamu.edu/SNAP/email-EuAll.

[47]

Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. 2007. Graph evolution: Densification and shrinking diameters . TKDD, Vol. 1, 1 (2007).

Digital Library

[48]

M. Lichman. {n. d.}. UCI Machine Learning Repository: Covertype . texttthttps://archive.ics.uci.edu/ ml/datasets/Covertype.

[49]

Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In IPDPS. 370--381.

Digital Library

[50]

Gaëlle Loosli, Stéphane Canu, and Léon Bottou. 2007. Training Invariant Support Vector Machines using Selective Sampling . In Large Scale Kernel Machines, Léon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston (Eds.). MIT Press, Cambridge, MA., 301--320.

[51]

Miles Lopes. 2013. Estimating Unknown Sparsity in Compressed Sensing. In ICML. 217--225.

[52]

Julian McAuley. {n. d.}. Amazon Product Data - Books . textttjmcauley.ucsd.edu/data/amazon.

[53]

Vijay Menon and Keshav Pingali. 1999. High-Level Semantic Optimization of Numerical Codes. In ICS. 434--443.

Digital Library

[54]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space . CoRR (2013).

[55]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. In NIPS . 3111--3119.

Digital Library

[56]

Thomas Neumann and Bernhard Radke. 2018. Adaptive Optimization of Very Large Join Queries. In SIGMOD. 677--692.

Digital Library

[57]

Milos Nikolic, Mohammed Elseidy, and Christoph Koch. 2014. LINVIEW: incremental view maintenance for complex analytical queries. In SIGMOD. 253--264.

Digital Library

[58]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch . (2017).

[59]

Hongbo Rong, Jongsoo Park, Lingxiang Xiang, Todd A. Anderson, and Mikhail Smelyanskiy. 2016. Sparso: Context-driven Optimizations of Sparse Linear Algebra. In PACT. 247--259.

Digital Library

[60]

Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Paul Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In WWW - Companion Volume . 243--246.

Digital Library

[61]

Michael Stonebraker, Paul Brown, Alex Poliakov, and Suchi Raman. 2011. The Architecture of SciDB. In SSDBM. 1--16.

Digital Library

[62]

Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: Extraction and Mining of Academic Social Networks. In SIGKDD. 990--998.

Digital Library

[63]

Wentao Wu, Jeffrey F. Naughton, and Harneet Singh. 2016. Sampling-Based Query Re-Optimization. In SIGMOD. 1721--1736.

Digital Library

[64]

Feng Yu, Wen-Chi Hou, Cheng Luo, Dunren Che, and Mengxia Zhu. 2013. CS2: a new database synopsis for query estimation. In SIGMOD. 469--480.

Digital Library

[65]

Yongyang Yu, MingJie Tang, Walid G. Aref, Qutaibah M. Malluhi, Mostafa M. Abbas, and Mourad Ouzzani. 2017. In-Memory Distributed Matrix Computation Processing and Optimization. In ICDE.

[66]

Reza Bosagh Zadeh, Xiangrui Meng, Alexander Ulanov, Burak Yavuz, Li Pu, Shivaram Venkataraman, Evan R. Sparks, Aaron Staple, and Matei Zaharia. 2016. Matrix Computations and Optimization in Apache Spark. In SIGKDD. 31--38.

Digital Library

Cited By

Lin CLuo WFang YMa CLiu XMa Y(2024)On Efficient Large Sparse Matrix Chain MultiplicationProceedings of the ACM on Management of Data10.1145/36549592:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654959
Baunsgaard SBoehm M(2023)AWARE: Workload-aware, Redundancy-exploiting Linear AlgebraProceedings of the ACM on Management of Data10.1145/35886821:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588682
Boehm MInterlandi MJermaine CDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589407
Show More Cited By

Index Terms

MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions
1. Information systems
  1. Data management systems
2. Theory of computation
  1. Design and analysis of algorithms
    1. Streaming, sublinear and near linear time algorithms
      1. Sketching and sampling

Recommendations

$L_{1} RIP$ -Based Robust Compressed Sensing
Abstract
In the process of compressed sensing, environmental interference or various other factors can cause some of the information obtained by sampling to be abnormal. For this reason, the sparsity estimation of the signal will deviate from the real ... $_{}$ $_{}$ $_{}$ $_{}$ $_{}$ $_{}$
Function of a square matrix with repeated eigenvalues

An analytical function f(A) of an arbitrary nxn constant matrix A is determined and expressed by the ''fundamental formula'', the linear combination of constituent matrices. The constituent matrices Z"k"h, which depend on A but not on the function f(s), ...
Adaptive-Rate Compressive Sensing for Monitoring Video Based on Fast Sparsity Estimation
ICITEE '19: Proceedings of the 2nd International Conference on Information Technologies and Electrical Engineering

Theoretically, Compressive Sensing (CS) could sample and compress a signal when the whole signal is not captured and stored at the sampling side. However, it makes the estimation of signal sparsity difficult in the Adaptive-Rate Compressive Sensing (...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

June 2019

2106 pages

ISBN:9781450356435

DOI:10.1145/3299869

General Chairs:
Peter Boncz
CWI & Vrije Universiteit Amsterdam, The Netherlands
,
Stefan Manegold
CWI & Universiteit Leiden, The Netherlands
,
Program Chairs:
Anastasia Ailamaki
EPFL, Switzerland
,
Amol Deshpande
University of Maryland, USA
,
Tim Kraska
MIT, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '19

Sponsor:

SIGMOD

SIGMOD/PODS '19: International Conference on Management of Data

June 30 - July 5, 2019

Amsterdam, Netherlands

Acceptance Rates

SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
336
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)1

Reflects downloads up to 23 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin CLuo WFang YMa CLiu XMa Y(2024)On Efficient Large Sparse Matrix Chain MultiplicationProceedings of the ACM on Management of Data10.1145/36549592:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654959
Baunsgaard SBoehm M(2023)AWARE: Workload-aware, Redundancy-exploiting Linear AlgebraProceedings of the ACM on Management of Data10.1145/35886821:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588682
Boehm MInterlandi MJermaine CDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589407
Chatzopoulos SVergoulis TSkoutas DDalamagas TTryfonopoulos CKarras P(2023)Atrapos: Real-time Evaluation of Metapath Query WorkloadsProceedings of the ACM Web Conference 202310.1145/3543507.3583322(2487-2498)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583322
Olteanu DVortmeier NŽivanović Ɖ(2023)Givens rotations for QR decomposition, SVD and PCA over database joinsThe VLDB Journal10.1007/s00778-023-00818-933:4(1013-1037)Online publication date: 23-Nov-2023
https://doi.org/10.1007/s00778-023-00818-9
Olteanu DVortmeier NZivanović ĐIves ZBonifati AEl Abbadi A(2022)Givens QR Decomposition over Relational DatabasesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526144(1948-1961)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526144
Luo SJankov DYuan BJermaine CLi GLi ZIdreos SSrivastava D(2021)Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear AlgebraProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457317(1222-1234)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457317
Alotaibi RCautis BDeutsch AManolescu ILi GLi ZIdreos SSrivastava D(2021)HADAD: A Lightweight Approach for Optimizing Hybrid Complex Analytics QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457311(23-35)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457311

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents