Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3330345.3330366acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Efficient and effective sparse tensor reordering

Published: 26 June 2019 Publication History

Abstract

This paper formalizes the problem of reordering a sparse tensor to improve the spatial and temporal locality of operations with it, and proposes two reordering algorithms for this problem, which we call BFS-MCS and Lexi-Order. The BFS-MCS method is a Breadth First Search (BFS)-like heuristic approach based on the maximum cardinality search family; Lexi-Order is an extension of doubly lexical ordering of matrices to tensors. We show the effects of these schemes within the context of a widely used tensor computation, the CANDECOMP/PARAFAC decomposition (CPD), when storing the tensor in three previously proposed sparse tensor formats: coordinate (COO), compressed sparse fiber (CSF), and hierarchical coordinate (HiCOO). A new partition-based superblock scheduling is also proposed for HiCOO format to improve load balance. On modern multicore CPUs, we show Lexi-Order obtains up to 4.14× speedup on sequential HiCOO-Mttkrp and 11.88× speedup on its parallel counterpart. The performance of COO- and CSF-based Mttkrps also improves. Our two reordering methods are more effective than state-of-the-art approaches. The code is released as part of Parallel Tensor Infrastructure (ParTI!): https://github.com/hpcgarage/ParTI.

References

[1]
M.ín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.
[2]
K. Akbudak and C. Aykanat. 2017. Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 8 (Aug 2017), 2258--2271.
[3]
A. Anandkumar, R. Ge, D. Hsu, ShS.am M. Kakade, and M. Telgarsky. 2014. Tensor Decompositions for Learning Latent Variable Models. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 2773--2832.
[4]
B. W. Bader and T. G. Kolda. 2007. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing 30, 1 (December 2007), 205--231.
[5]
B. W. Bader, T. G. Kolda, et al. 2017. MATLAB Tensor Toolbox (Version 3.0-dev). Available online. https://www.tensortoolbox.org
[6]
M. Baskaran, T. Henretty, B. Pradelle, M. H. Langston, D. Bruns-Smith, J. Ezick, and R. Lethin. 2017. Memory-efficient parallel tensor decompositions. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.
[7]
M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on. 1--6.
[8]
Z. Blanco, B. Liu, and M. M. Dehnavi. 2018. CSTF: Large-Scale Sparse Tensor Factorizations on Distributed Platforms. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 21, 10 pages.
[9]
J. Choi, X. Liu, S. Smith, and T. Simon. 2018. Blocking Optimization Techniques for Sparse Tensor Computation. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 568--577.
[10]
J. H. Choi and S. Vishwanathan. 2014. DFacTo: Distributed Factorization of Tensors. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger (Eds.). Curran Associates, Inc., 1296--1304.
[11]
A. Cichocki. 2014. Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions. CoRR abs/1403.2048 (2014).
[12]
A. Cichocki, N. Lee, I. V. Oseledets, A. Phan, Q. Zhao, and D. Mandic. 2016. Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1. ArXiv e-prints (Sept. 2016). arXiv:cs.NA/1609.00893
[13]
L. De Lathauwer and D. Nion. 2008. Decompositions of a Higher-Order Tensor in Block Terms---Part III: Alternating Least Squares Algorithms. SIAM J. Matrix Anal. Appl. 30, 3 (2008), 1067--1083.
[14]
J. C. Ho, J. Ghosh, and J. Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). ACM, New York, NY, USA, 115--124.
[15]
I. Jeon, E. E. Papalexakis, and C. Faloutsos U Kang. 2015. HaTen2: Billion-scale Tensor Decompositions (Version 1.0). Available from http://datalab.snu.ac.kr/haten2/.
[16]
U Kang, E. Papalexakis, A. Harpale, and C. Faloutsos. 2012. GigaTensor: Scaling Tensor Analysis Up by 100 Times - Algorithms and Discoveries. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '12). ACM, New York, NY, USA, 316--324.
[17]
G. Karypis and V. Kumar. 1998. A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering. J. Parallel and Distrib. Comput. 48, 1 (1998), 71 -- 95.
[18]
O. Kaya and B. Uçar. 2015. Scalable Sparse Tensor Decompositions in Distributed Memory Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 77, 11 pages.
[19]
T. Kolda and B. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (2009), 455--500.
[20]
Jiajia Li. 2018. Scalable tensor decompositions in high performance computing environments. Ph.D. Dissertation. Georgia Institute of Technology, Atlanta, GA, USA.
[21]
J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In ACM/IEEE Super-computing (SC '15). ACM, New York, NY, USA.
[22]
J. Li, Y. Ma, and R. Vuduc. 2016. ParTI!: A Parallel Tensor Infrastructure for Multicore CPU and GPUs (Version 0.1.0). Available from https://github.com/hpcgarage/ParTI.
[23]
J. Li, Y. Ma, X. Wu, A. Li, and K. Barker. 2019. PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite. Technical Report.
[24]
J. Li, J. Sun, and R. Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 19, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291682
[25]
B. Liu, C. Wen, A. D. Sarwate, and M. M. Dehnavi. 2017. A Unified Optimization Approach for Sparse Tensor Operations on GPUs. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 47--57.
[26]
A. Lubiw. 1987. Doubly Lexical Orderings of Matrices. SIAM J. Comput. 16, 5 (1987), 854--879.
[27]
Y. Ma, J. Li, X. Wu, C. Yan, J. Sun, and R. Vuduc. 2018. Optimizing sparse tensor times matrix on GPUs. J. Parallel and Distrib. Comput. (2018).
[28]
J. Mellor-Crummey, D. Whalley, and K. Kennedy. 2001. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings. International Journal of Parallel Programming 29, 3 (01 Jun 2001), 217--247.
[29]
I. Nisa, J. Li, A. Sukumaran-Rajam, R. Vuduc, and P. Sadayappan. 2019. Load-Balanced Sparse MTTKRP on GPUs. arXiv:arXiv:1904.03329
[30]
A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. 2015. Tensorizing Neural Networks. CoRR abs/1509.06569 (2015).
[31]
R. Paige and R. E. Tarjan. 1987. Three Partition Refinement Algorithms. SIAM J. Comput. 16, 6 (Dec. 1987), 973--989.
[32]
E. E. Papalexakis, C. Faloutsos, and D. D. Sidiropoulos. 2012. ParCube: Sparse Parallelizable Tensor Decompositions. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I (ECML PKDD '12). Springer-Verlag, Berlin, Heidelberg, 521--536.
[33]
I. Perros, R. Chen, R. Vuduc, and J. Sun. 2015. Sparse Hierarchical Tucker Factorization and its Application to Healthcare. IEEE International Conference on Data Mining (ICDM) (2015).
[34]
I. Perros, E. E. Papalexakis, F. Wang, R. Vuduc, E. Searles, M. Thompson, and J. Sun. 2017. SPARTan: Scalable PARAFAC2 for Large & Sparse Data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, New York, NY, USA, 375--384.
[35]
J. C. Pichel, F. F. Rivera, M. Fernández, and A. Rodríguez. 2012. Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs. Microprocessors and Microsystems 36, 2 (2012), 65 -- 77.
[36]
A. Pothen and C.-J. Fan. 1990. Computing the Block Triangular Form of a Sparse Matrix. ACM Trans. Math. Softw. 16, 4 (Dec. 1990), 303--324.
[37]
N. Ravindran, D. D. Sidiropoulos, S. Smith, and G. Karypis. 2014. Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decompositions. Proceedings of the Asilomar Conference on Signals, Systems, and Computers (2014).
[38]
N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. 2017. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing 65, 13 (July 2017), 3551--3582.
[39]
S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/
[40]
S. Smith, K. Huang, N. D. Sidiropoulos, and G. Karypis. {n. d.}. Streaming Tensor Factorization for Infinite Data Sources. 81--89.
[41]
S. Smith and G. Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms. ACM, 7.
[42]
S. Smith and G. Karypis. 2016. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization. In Parallel and Distributed Processing Symposium (IPDPS), 2016 IEEE International. IEEE.
[43]
S. Smith, J. Park, and G. Karypis. 2016. An Exploration of Optimization Algorithms for High Performance Tensor Completion. Proceedings of the 2016 ACM/IEEE conference on Supercomputing (2016).
[44]
S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS).
[45]
S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. 2016. SPLATT: The Surprisingly ParalleL spArse Tensor Toolkit (Version 1.1.1). Available from https://github.com/ShadenSmith/splatt.
[46]
R. E. Tarjan and M. Yannakakis. 1984. Simple Linear-Time Algorithms to Test Chordality of Graphs, Test Acyclicity of Hypergraphs, and Selectively Reduce Acyclic Hypergraphs. SIAM J. Comput. 13, 3 (1984), 566--579.
[47]
Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun. 2015. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1265--1274.
[48]
A. J. N. Yzelman and D. Roose. 2014. High-Level Strategies for Parallel Shared-Memory Sparse Matrix-Vector Multiplication. IEEE Transactions on Parallel and Distributed Systems 25, 1 (Jan 2014), 116--125.

Cited By

View all
  • (2024)MemFriend: Understanding Memory Performance with Spatial-Temporal AffinityProceedings of the International Symposium on Memory Systems10.1145/3695794.3695820(270-284)Online publication date: 30-Sep-2024
  • (2024)Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor NetworkProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659985(169-181)Online publication date: 17-Jun-2024
  • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '19: Proceedings of the ACM International Conference on Supercomputing
June 2019
533 pages
ISBN:9781450360791
DOI:10.1145/3330345
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HiCOO
  2. hierarchical coordinate
  3. reordering
  4. sparse tensor
  5. tensor decomposition

Qualifiers

  • Research-article

Funding Sources

Conference

ICS '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)258
  • Downloads (Last 6 weeks)32
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MemFriend: Understanding Memory Performance with Spatial-Temporal AffinityProceedings of the International Symposium on Memory Systems10.1145/3695794.3695820(270-284)Online publication date: 30-Sep-2024
  • (2024)Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor NetworkProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659985(169-181)Online publication date: 17-Jun-2024
  • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
  • (2023)A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCNACM Transactions on Parallel Computing10.1145/358437310:2(1-23)Online publication date: 20-Jun-2023
  • (2023)SparseTIR: Composable Abstractions for Sparse Compilation in Deep LearningProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582047(660-678)Online publication date: 25-Mar-2023
  • (2023)Code Synthesis for Sparse Tensor Format Conversion and OptimizationProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580021(28-40)Online publication date: 17-Feb-2023
  • (2023)Accelerating Sparse MTTKRP for Tensor Decomposition on FPGAProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3543622.3573179(259-269)Online publication date: 12-Feb-2023
  • (2023)A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328852034:8(2419-2432)Online publication date: Aug-2023
  • (2023)Dynasor: A Dynamic Memory Layout for Accelerating Sparse MTTKRP for Tensor Decomposition on Multi-core CPU2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00012(23-33)Online publication date: 17-Oct-2023
  • (2023)Dynamic Tensor Linearization and Time Slicing for Efficient Factorization of Infinite Data Streams2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00048(402-412)Online publication date: May-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media