Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3332466.3374548acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

MatRox: modular approach for improving data locality in hierarchical (Mat)rix App(Rox)imation

Published: 19 February 2020 Publication History

Abstract

Hierarchical matrix approximations have gained significant traction in the machine learning and scientific community as they exploit available low-rank structures in kernel methods to compress the kernel matrix. The resulting compressed matrix, HMatrix, is used to reduce the computational complexity of operations such as HMatrix-matrix multiplications with tuneable accuracy in an evaluation phase. Existing implementations of HMatrix evaluations do not preserve locality and often lead to unbalanced parallel execution with high synchronization. Also, current solutions require the compression phase to re-execute if the kernel method or the required accuracy change. MatRox is a framework that uses novel structure analysis strategies with code specialization and a storage format to improve locality and create load-balanced parallel tasks for HMatrix-matrix multiplications. Modularization of the matrix compression phase enables the reuse of computations when there are changes to the input accuracy and the kernel function. The MatRox-generated code for matrix-matrix multiplication is 2.98X, 1.60X, and 5.98X faster than library implementations available in GOFMM, SMASH, and STRUMPACK respectively. Additionally, the ability to reuse portions of the compression computation for changes to the accuracy leads to up to 2.64X improvement with MatRox over five changes to accuracy using GOFMM.

References

[1]
Amirhossein Aminfar, Sivaram Ambikasaran, and Eric Darve. 2016. A fast block low-rank dense solver with applications to finite-element matrices. J. Comput. Phys. 304 (2016), 170--188.
[2]
Kevin Bache and Moshe Lichman. 2013. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California. School of information and computer science 28 (2013).
[3]
Mario Bebendorf and Sergej Rjasanow. 2003. Adaptive low-rank approximation of collocation matrices. Computing 70, 1 (2003), 1--24.
[4]
Jeroen Bédorf, Evghenii Gaburov, and Simon Portegies Zwart. 2012. A sparse octree gravitational N-body code that runs entirely on the GPU processor. J. Comput. Phys. 231, 7 (2012), 2825--2839.
[5]
Steffen Börm and Jochen Garcke. 2007. Approximating Gaussian Processes with H2-Matrices. In European Conference on Machine Learning. Springer, 42--53.
[6]
Steffen Börm, Lars Grasedyck, and Wolfgang Hackbusch. 2003. Introduction to hierarchical matrices with applications. Engineering analysis with boundary elements 27, 5 (2003), 405--422.
[7]
William L Briggs, Steve F McCormick, et al. 2000. A multigrid tutorial. Vol. 72. Siam.
[8]
Difeng Cai, Edmond Chow, Lucas Erlandson, Yousef Saad, and Yuanzhe Xi. 2018. SMASH: Structured matrix approximation by separation and hierarchy. Numerical Linear Algebra with Applications 25, 6 (2018), e2204.
[9]
Nicola Cancedda, Eric Gaussier, Cyril Goutte, and Jean-Michel Renders. 2003. Word-sequence kernels. Journal of machine learning research 3, Feb (2003), 1059--1082.
[10]
Tony F Chan. 1987. Rank revealing QR factorizations. Linear algebra and its applications 88 (1987), 67--82.
[11]
Shiv Chandrasekaran, Ming Gu, and Timothy Pals. 2006. A fast ULV decomposition solver for hierarchically semiseparable representations. SIAM J. Matrix Anal. Appl. 28, 3 (2006), 603--622.
[12]
Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2017. Sympiler: transforming sparse matrix codes by decoupling symbolic analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 13.
[13]
Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2018. ParSy: inspection and transformation of sparse matrix computations for parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 62.
[14]
Edward G Coffman, Jr, Michael R Garey, and David S Johnson. 1978. An application of bin-packing to multiprocessor scheduling. SIAM J. Comput. 7, 1 (1978), 1--17.
[15]
Sanjoy Dasgupta and Yoav Freund. 2008. Random projection trees and low dimensional manifolds. In STOC, Vol. 8. Citeseer, 537--546.
[16]
Yi Ding, Risi Kondor, and Jonathan Eskreis-Winkler. 2017. Multiresolution kernel approximation for Gaussian process regression. In Advances in Neural Information Processing Systems. 3740--3748.
[17]
Tingxing Dong, Veselin Dobrev, Tzanio Kolev, Robert Rieben, Stanimire Tomov, and Jack Dongarra. 2014. A step towards energy efficient computing: Redesigning a hydrodynamic application on CPU-GPU. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 972--981.
[18]
Shai Fine and Katya Scheinberg. 2001. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research 2, Dec (2001), 243--264.
[19]
Pieter Ghysels, Xiaoye Sherry Li, Christopher Gorman, and François-Henry Rouet. 2017. A robust parallel preconditioner for indefinite systems using hierarchical matrices and randomized sampling. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 897--906.
[20]
Pieter Ghysels, Xiaoye S Li, François-Henry Rouet, Samuel Williams, and Artem Napov. 2016. An efficient multicore implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM Journal on Scientific Computing 38, 5 (2016), S358--S384.
[21]
R Govindarajan and Jayvant Anantpur. 2013. Runtime dependence computation and execution of loops on heterogeneous systems. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Computer Society, 1--10.
[22]
Lars Grasedyck, Ronald Kriemann, and Sabine Le Borne. 2008. Parallel black box H-LU preconditioning for elliptic boundary value problems. Computing and visualization in science 11, 4-6 (2008), 273--291.
[23]
Leslie Greengard and Vladimir Rokhlin. 1987. A fast algorithm for particle simulations. Journal of computational physics 73, 2 (1987), 325--348.
[24]
Wolfgang Hackbusch. 1999. A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices. Computing 62, 2 (1999), 89--108.
[25]
Wolfgang Hackbusch. 2015. Hierarchical matrices: algorithms and analysis. Vol. 49. Springer.
[26]
Wolfgang Hackbusch and Steffen Börm. 2002. Data-sparse approximation by adaptive H2-matrices. Computing 69, 1 (2002), 1--35.
[27]
W Hackbusch, B Khoromskij, and SA Sauter. 2000. On H2-matrices: Lectures on applied mathematics.
[28]
Wolfgang Hackbusch, Boris N Khoromskij, and Ronald Kriemann. 2004. Hierarchical matrices based on a weak admissibility criterion. Computing 73, 3 (2004), 207--243.
[29]
John L Hennessy and David A Patterson. 2017. Computer architecture: a quantitative approach. Elsevier.
[30]
Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. 2008. Kernel methods in machine learning. The annals of statistics (2008), 1171--1220.
[31]
Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization framework for sparse matrix kernels. The International Journal of High Performance Computing Applications 18, 1 (2004), 135--158.
[32]
Ronald Kriemann. 2005. Parallel-matrix arithmetics on shared memory systems. Computing 74, 3 (2005), 273--297.
[33]
Weifeng Liu, Ang Li, Jonathan Hogg, Iain S Duff, and Brian Vinter. 2016. A synchronization-free algorithm for parallel sparse triangular solves. In European Conference on Parallel Processing. Springer, 617--630.
[34]
William B March and George Biros. 2017. Far-field compression for fast kernel summation methods in high dimensions. Applied and Computational Harmonic Analysis 43, 1 (2017), 39--75.
[35]
William B March, Bo Xiao, and George Biros. 2015. ASKIT: Approximate skeletonization kernel-independent treecode in high dimensions. SIAM Journal on Scientific Computing 37, 2 (2015), A1089--A1110.
[36]
William B March, Bo Xiao, D Yu Chenhan, and George Biros. 2015. An algebraic parallel treecode in arbitrary dimensions. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. IEEE, 571--580.
[37]
William B March, Bo Xiao, Sameer Tharakan, D Yu Chenhan, and George Biros. 2015. A kernel-independent FMM in general dimensions. In High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for. IEEE, 1--12.
[38]
William B March, Bo Xiao, Sameer Tharakan, Chenhan D Yu, and George Biros. 2015. Robust treecode approximation for kernel machines. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 775--784.
[39]
William B March, Bo Xiao, Chenhan D Yu, and George Biros. 2016. ASKIT: an efficient, parallel library for high-dimensional kernel summations. SIAM Journal on Scientific Computing 38, 5 (2016), S720--S749.
[40]
Per-Gunnar Martinsson. 2011. A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM J. Matrix Anal. Appl. 32, 4 (2011), 1251--1274.
[41]
Per-Gunnar Martinsson and Vladimir Rokhlin. 2005. A fast direct solver for boundary integral equations in two dimensions. J. Comput. Phys. 205, 1 (2005), 1--23.
[42]
Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. 2011. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis 30, 1 (2011), 47--68.
[43]
Yohei Miki and Masayuki Umemura. 2017. GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling. New Astronomy 52 (2017), 65--81.
[44]
Mahdi Soltan Mohammadi, Tomofumi Yuki, Kazem Cheshmi, Eddie C Davis, Mary Hall, Maryam Mehri Dehnavi, Payal Nandy, Catherine Olschanowsky, Anand Venkat, and Michelle Mills Strout. 2019. Sparse computation data dependence simplification for efficient compiler-generated inspectors. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 594--609.
[45]
Vlad I Morariu, Balaji V Srinivasan, Vikas C Raykar, Ramani Duraiswami, and Larry S Davis. 2009. Automatic online tuning for fast Gaussian summation. In Advances in neural information processing systems. 1113--1120.
[46]
Maxim Naumov. 2011. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. NVIDIA Corp., Westford, MA, USA, Tech. Rep. NVR-2011 1 (2011).
[47]
Stephen M Omohundro. 1989. Five balltree construction algorithms. International Computer Science Institute Berkeley.
[48]
Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In International Supercomputing Conference. Springer, 124--140.
[49]
Lawrence Rauchwerger, Nancy M Amato, and David A Padua. 1995. Run-time methods for parallelizing partially parallel loops. In Proceedings of the 9th international conference on Supercomputing. ACM, 137--146.
[50]
Elizaveta Rebrova, Gustavo Chávez, Yang Liu, Pieter Ghysels, and Xiaoye Sherry Li. 2018. A study of clustering techniques and hierarchical matrix formats for kernel ridge regression. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 883--892.
[51]
François-Henry Rouet, Xiaoye S Li, Pieter Ghysels, and Artem Napov. 2016. A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Transactions on Mathematical Software (TOMS) 42, 4 (2016), 27.
[52]
Ana R Teixeira, Ana Maria Tomé, and Elmar Wolfgang Lang. 2008. Feature extraction using low-rank approximations of the kernel matrix. In International Conference Image Analysis and Recognition. Springer, 404--412.
[53]
Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009. Springer, 157--173.
[54]
J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr. 2014. XSEDE: Accelerating Scientific Discovery. Computing in Science and Engineering 16, 5 (Sept.-Oct. 2014), 62--74.
[55]
Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout, and Mary Hall. 2016. Automating wavefront parallelization for sparse matrix computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 41.
[56]
Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™. Springer, 167--188.
[57]
Christopher KI Williams and Carl Edward Rasmussen. 1996. Gaussian processes for regression. In Advances in neural information processing systems. 514--520.
[58]
Christopher KI Williams and Matthias Seeger. 2001. Using the Nyström method to speed up kernel machines. In Advances in neural information processing systems. 682--688.
[59]
Yuanzhe Xi and Jianlin Xia. 2016. On the stability of some hierarchical rank structured matrix algorithms. SIAM J. Matrix Anal. Appl. 37, 3 (2016), 1279--1303.
[60]
Jianlin Xia, Shivkumar Chandrasekaran, Ming Gu, and Xiaoye S Li. 2010. Fast algorithms for hierarchically semiseparable matrices. Numerical Linear Algebra with Applications 17, 6 (2010), 953--976.
[61]
Ichitaro Yamazaki and Xiaoye S Li. 2010. On techniques to improve robustness and scalability of a parallel hybrid linear solver. In International Conference on High Performance Computing for Computational Science. Springer, 421--434.
[62]
Chenhan D Yu, James Levitt, Severin Reiz, and George Biros. 2017. Geometry-oblivious FMM for compressing dense SPD matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 53.
[63]
Chenhan D Yu, Severin Reiz, and George Biros. 2018. Distributed-memory hierarchical compression of dense SPD matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 15.

Cited By

View all
  • (2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
  • (2022)Exploiting Hierarchical Parallelism and Reusability in Tensor Kernel Processing on Heterogeneous HPC Systems2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00234(2522-2535)Online publication date: May-2022
  • (2022)GSpTC: High-Performance Sparse Tensor Contraction on CPU-GPU Heterogeneous Systems2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080(380-387)Online publication date: Dec-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2020
454 pages
ISBN:9781450368186
DOI:10.1145/3332466
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Qualifiers

  • Research-article

Funding Sources

  • Canada Research Chairs program
  • NSERC
  • U.S. National Science Foundation (NSF)

Conference

PPoPP '20

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)4
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
  • (2022)Exploiting Hierarchical Parallelism and Reusability in Tensor Kernel Processing on Heterogeneous HPC Systems2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00234(2522-2535)Online publication date: May-2022
  • (2022)GSpTC: High-Performance Sparse Tensor Contraction on CPU-GPU Heterogeneous Systems2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080(380-387)Online publication date: Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media