research-article

Public Access

Efficient and effective sparse tensor reordering

Authors:

Ümit V. Çatalyürek,

Richard VuducAuthors Info & Claims

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 227 - 237

https://doi.org/10.1145/3330345.3330366

Published: 26 June 2019 Publication History

Abstract

This paper formalizes the problem of reordering a sparse tensor to improve the spatial and temporal locality of operations with it, and proposes two reordering algorithms for this problem, which we call BFS-MCS and Lexi-Order. The BFS-MCS method is a Breadth First Search (BFS)-like heuristic approach based on the maximum cardinality search family; Lexi-Order is an extension of doubly lexical ordering of matrices to tensors. We show the effects of these schemes within the context of a widely used tensor computation, the CANDECOMP/PARAFAC decomposition (CPD), when storing the tensor in three previously proposed sparse tensor formats: coordinate (COO), compressed sparse fiber (CSF), and hierarchical coordinate (HiCOO). A new partition-based superblock scheduling is also proposed for HiCOO format to improve load balance. On modern multicore CPUs, we show Lexi-Order obtains up to 4.14× speedup on sequential HiCOO-Mttkrp and 11.88× speedup on its parallel counterpart. The performance of COO- and CSF-based Mttkrps also improves. Our two reordering methods are more effective than state-of-the-art approaches. The code is released as part of Parallel Tensor Infrastructure (ParTI!): https://github.com/hpcgarage/ParTI.

References

[1]

M.ín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.

[2]

K. Akbudak and C. Aykanat. 2017. Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 8 (Aug 2017), 2258--2271.

Digital Library

[3]

A. Anandkumar, R. Ge, D. Hsu, ShS.am M. Kakade, and M. Telgarsky. 2014. Tensor Decompositions for Learning Latent Variable Models. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 2773--2832.

Digital Library

[4]

B. W. Bader and T. G. Kolda. 2007. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing 30, 1 (December 2007), 205--231.

Digital Library

[5]

B. W. Bader, T. G. Kolda, et al. 2017. MATLAB Tensor Toolbox (Version 3.0-dev). Available online. https://www.tensortoolbox.org

[6]

M. Baskaran, T. Henretty, B. Pradelle, M. H. Langston, D. Bruns-Smith, J. Ezick, and R. Lethin. 2017. Memory-efficient parallel tensor decompositions. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.

[7]

M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on. 1--6.

[8]

Z. Blanco, B. Liu, and M. M. Dehnavi. 2018. CSTF: Large-Scale Sparse Tensor Factorizations on Distributed Platforms. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 21, 10 pages.

Digital Library

[9]

J. Choi, X. Liu, S. Smith, and T. Simon. 2018. Blocking Optimization Techniques for Sparse Tensor Computation. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 568--577.

[10]

J. H. Choi and S. Vishwanathan. 2014. DFacTo: Distributed Factorization of Tensors. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger (Eds.). Curran Associates, Inc., 1296--1304.

Digital Library

[11]

A. Cichocki. 2014. Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions. CoRR abs/1403.2048 (2014).

[12]

A. Cichocki, N. Lee, I. V. Oseledets, A. Phan, Q. Zhao, and D. Mandic. 2016. Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1. ArXiv e-prints (Sept. 2016). arXiv:cs.NA/1609.00893

[13]

L. De Lathauwer and D. Nion. 2008. Decompositions of a Higher-Order Tensor in Block Terms---Part III: Alternating Least Squares Algorithms. SIAM J. Matrix Anal. Appl. 30, 3 (2008), 1067--1083.

Digital Library

[14]

J. C. Ho, J. Ghosh, and J. Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). ACM, New York, NY, USA, 115--124.

[15]

I. Jeon, E. E. Papalexakis, and C. Faloutsos U Kang. 2015. HaTen2: Billion-scale Tensor Decompositions (Version 1.0). Available from http://datalab.snu.ac.kr/haten2/.

[16]

U Kang, E. Papalexakis, A. Harpale, and C. Faloutsos. 2012. GigaTensor: Scaling Tensor Analysis Up by 100 Times - Algorithms and Discoveries. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '12). ACM, New York, NY, USA, 316--324.

Digital Library

[17]

G. Karypis and V. Kumar. 1998. A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering. J. Parallel and Distrib. Comput. 48, 1 (1998), 71 -- 95.

Digital Library

[18]

O. Kaya and B. Uçar. 2015. Scalable Sparse Tensor Decompositions in Distributed Memory Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 77, 11 pages.

Digital Library

[19]

T. Kolda and B. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (2009), 455--500.

Digital Library

[20]

Jiajia Li. 2018. Scalable tensor decompositions in high performance computing environments. Ph.D. Dissertation. Georgia Institute of Technology, Atlanta, GA, USA.

[21]

J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In ACM/IEEE Super-computing (SC '15). ACM, New York, NY, USA.

[22]

J. Li, Y. Ma, and R. Vuduc. 2016. ParTI!: A Parallel Tensor Infrastructure for Multicore CPU and GPUs (Version 0.1.0). Available from https://github.com/hpcgarage/ParTI.

[23]

J. Li, Y. Ma, X. Wu, A. Li, and K. Barker. 2019. PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite. Technical Report.

[24]

J. Li, J. Sun, and R. Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 19, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291682

Digital Library

[25]

B. Liu, C. Wen, A. D. Sarwate, and M. M. Dehnavi. 2017. A Unified Optimization Approach for Sparse Tensor Operations on GPUs. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 47--57.

[26]

A. Lubiw. 1987. Doubly Lexical Orderings of Matrices. SIAM J. Comput. 16, 5 (1987), 854--879.

Digital Library

[27]

Y. Ma, J. Li, X. Wu, C. Yan, J. Sun, and R. Vuduc. 2018. Optimizing sparse tensor times matrix on GPUs. J. Parallel and Distrib. Comput. (2018).

[28]

J. Mellor-Crummey, D. Whalley, and K. Kennedy. 2001. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings. International Journal of Parallel Programming 29, 3 (01 Jun 2001), 217--247.

Digital Library

[29]

I. Nisa, J. Li, A. Sukumaran-Rajam, R. Vuduc, and P. Sadayappan. 2019. Load-Balanced Sparse MTTKRP on GPUs. arXiv:arXiv:1904.03329

[30]

A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. 2015. Tensorizing Neural Networks. CoRR abs/1509.06569 (2015).

[31]

R. Paige and R. E. Tarjan. 1987. Three Partition Refinement Algorithms. SIAM J. Comput. 16, 6 (Dec. 1987), 973--989.

Digital Library

[32]

E. E. Papalexakis, C. Faloutsos, and D. D. Sidiropoulos. 2012. ParCube: Sparse Parallelizable Tensor Decompositions. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I (ECML PKDD '12). Springer-Verlag, Berlin, Heidelberg, 521--536.

[33]

I. Perros, R. Chen, R. Vuduc, and J. Sun. 2015. Sparse Hierarchical Tucker Factorization and its Application to Healthcare. IEEE International Conference on Data Mining (ICDM) (2015).

Digital Library

[34]

I. Perros, E. E. Papalexakis, F. Wang, R. Vuduc, E. Searles, M. Thompson, and J. Sun. 2017. SPARTan: Scalable PARAFAC2 for Large & Sparse Data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, New York, NY, USA, 375--384.

Digital Library

[35]

J. C. Pichel, F. F. Rivera, M. Fernández, and A. Rodríguez. 2012. Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs. Microprocessors and Microsystems 36, 2 (2012), 65 -- 77.

Digital Library

[36]

A. Pothen and C.-J. Fan. 1990. Computing the Block Triangular Form of a Sparse Matrix. ACM Trans. Math. Softw. 16, 4 (Dec. 1990), 303--324.

Digital Library

[37]

N. Ravindran, D. D. Sidiropoulos, S. Smith, and G. Karypis. 2014. Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decompositions. Proceedings of the Asilomar Conference on Signals, Systems, and Computers (2014).

[38]

N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. 2017. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing 65, 13 (July 2017), 3551--3582.

Digital Library

[39]

S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/

[40]

S. Smith, K. Huang, N. D. Sidiropoulos, and G. Karypis. {n. d.}. Streaming Tensor Factorization for Infinite Data Sources. 81--89.

[41]

S. Smith and G. Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms. ACM, 7.

Digital Library

[42]

S. Smith and G. Karypis. 2016. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization. In Parallel and Distributed Processing Symposium (IPDPS), 2016 IEEE International. IEEE.

[43]

S. Smith, J. Park, and G. Karypis. 2016. An Exploration of Optimization Algorithms for High Performance Tensor Completion. Proceedings of the 2016 ACM/IEEE conference on Supercomputing (2016).

[44]

S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS).

Digital Library

[45]

S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. 2016. SPLATT: The Surprisingly ParalleL spArse Tensor Toolkit (Version 1.1.1). Available from https://github.com/ShadenSmith/splatt.

[46]

R. E. Tarjan and M. Yannakakis. 1984. Simple Linear-Time Algorithms to Test Chordality of Graphs, Test Acyclicity of Hypergraphs, and Selectively Reduce Acyclic Hypergraphs. SIAM J. Comput. 13, 3 (1984), 566--579.

Digital Library

[47]

Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun. 2015. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1265--1274.

Digital Library

[48]

A. J. N. Yzelman and D. Roose. 2014. High-Level Strategies for Parallel Shared-Memory Sparse Matrix-Vector Multiplication. IEEE Transactions on Parallel and Distributed Systems 25, 1 (Jan 2014), 116--125.

Digital Library

Cited By

Suriyakumar YTallent NMarquez AKaravanic KKilic O(2024)MemFriend: Understanding Memory Performance with Spatial-Temporal AffinityProceedings of the International Symposium on Memory Systems10.1145/3695794.3695820(270-284)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695820
Kanakagiri RSolomonik EAgrawal KPetrank E(2024)Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor NetworkProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659985(169-181)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659985
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Show More Cited By

Recommendations

Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition
Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular ...
Sparse tensor storage by tensor unfolding
SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

Large scale high dimensional multi-aspect data are naturally represented as tensors or multi-way arrays for scientific and engineering computations. But in practical applications such as signal processing, data mining, computer vision, and graph analysis,...
Revisiting trace norm minimization for tensor Tucker completion: A direct multilinear rank learning approach
Abstract
To efficiently express tensor data using the Tucker format, a critical task is to minimize the multilinear rank such that the model would not be over-flexible and lead to overfitting. Due to the lack of rank minimization tools in tensor, existing ...
Highlights
- An equivalent Tucker model in the form of a CPD with low-rank factor matrices.
- New trace norm formulation for direct multilinear rank minimization in Tucker format.
- Optimization algorithm with multilinear rank estimation.
- ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '19: Proceedings of the ACM International Conference on Supercomputing

June 2019

533 pages

ISBN:9781450360791

DOI:10.1145/3330345

General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University

Copyright © 2019 ACM.

© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

ICS '19

Sponsor:

SIGARCH

ICS '19: 2019 International Conference on Supercomputing

June 26 - 28, 2019

Arizona, Phoenix

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
1,047
Total Downloads

Downloads (Last 12 months)258
Downloads (Last 6 weeks)32

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Suriyakumar YTallent NMarquez AKaravanic KKilic O(2024)MemFriend: Understanding Memory Performance with Spatial-Temporal AffinityProceedings of the International Symposium on Memory Systems10.1145/3695794.3695820(270-284)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695820
Kanakagiri RSolomonik EAgrawal KPetrank E(2024)Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor NetworkProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659985(169-181)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659985
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Wang HYang WOuyang RHu RLi KLi K(2023)A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCNACM Transactions on Parallel Computing10.1145/358437310:2(1-23)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3584373
Ye ZLai RShao JChen TCeze LAamodt TJerger NSwift M(2023)SparseTIR: Composable Abstractions for Sparse Compilation in Deep LearningProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582047(660-678)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582047
Popoola TZhao TSt. George ABhetwal KStrout MHall MOlschanowsky CDubach CBruening DHardekopf B(2023)Code Synthesis for Sparse Tensor Format Conversion and OptimizationProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580021(28-40)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3579990.3580021
Wijeratne SWang TKannan RPrasanna VIenne PZhang Z(2023)Accelerating Sparse MTTKRP for Tensor Decomposition on FPGAProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3543622.3573179(259-269)Online publication date: 12-Feb-2023
https://dl.acm.org/doi/10.1145/3543622.3573179
Wang HYang WHu ROuyang RLi KLi K(2023)A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328852034:8(2419-2432)Online publication date: Aug-2023
https://doi.org/10.1109/TPDS.2023.3288520
Wijeratne SKannan RPrasanna V(2023)Dynasor: A Dynamic Memory Layout for Accelerating Sparse MTTKRP for Tensor Decomposition on Multi-core CPU2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00012(23-33)Online publication date: 17-Oct-2023
https://doi.org/10.1109/SBAC-PAD59825.2023.00012
Soh YHelal AChecconi FLaukemann JTithi JRanadive TPetrini FChoi J(2023)Dynamic Tensor Linearization and Time Slicing for Efficient Factorization of Infinite Data Streams2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00048(402-412)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00048
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten