research-article

Deinsum: practically I/O optimal multi-linear algebra

Authors:

Alexandros Nikolaos Ziogas,

Grzegorz Kwasniewski,

Timo Schneider,

Torsten HoeflerAuthors Info & Claims

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 25, Pages 1 - 15

Published: 18 November 2022 Publication History

Abstract

Multilinear algebra kernel performance on modern massively-parallel systems is determined mainly by data movement. However, deriving data movement-optimal distributed schedules for programs with many high-dimensional inputs is a notoriously hard problem. State-of-the-art libraries rely on heuristics and often fall back to suboptimal tensor folding and BLAS calls. We present Deinsum, an automated framework for distributed multilinear algebra computations expressed in Einstein notation, based on rigorous mathematical tools to address this problem. Our framework automatically derives data movement-optimal tiling and generates corresponding distributed schedules, further optimizing the performance of local computations by increasing their arithmetic intensity. To show the benefits of our approach, we test it on two important tensor kernel classes: Matricized Tensor Times Khatri-Rao Products and Tensor Times Matrix chains. We show performance results and scaling on the Piz Daint supercomputer, with up to 19x speedup over state-of-the-art solutions on 512 nodes.

Supplementary Material

MP4 File (SC22_Presentation_Ziogas.mp4)

Presentation at SC '22

Download
285.82 MB

References

[1]

W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler, K. Z. Ibrahim, K. Madduri, S. Williams, L. Oliker, C. Rosales-Fernandez, and T. Williams, "Extreme scale plasma turbulence simulations on top supercomputers worldwide," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '16. IEEE Press, 2016.

[2]

T. D. Kühne, M. Iannuzzi, M. Del Ben, V. V. Rybkin, P. Seewald, F. Stein, T. Laino, R. Z. Khaliullin, O. Schütt, F. Schiffmann, D. Golze, J. Wilhelm, S. Chulkov, M. H. Bani-Hashemian, V. Weber, U. Borštnik, M. Taillefumier, A. S. Jakobovits, A. Lazzaro, H. Pabst, T. Müller, R. Schade, M. Guidon, S. Andermatt, N. Holmberg, G. K. Schenter, A. Hehn, A. Bussy, F. Belleflamme, G. Tabacchi, A. Glöß, M. Lass, I. Bethune, C. J. Mundy, C. Plessl, M. Watkins, J. VandeVondele, M. Krack, and J. Hutter, "Cp2k: An electronic structure and molecular dynamics software package - quickstep: Efficient and accurate electronic structure calculations," The Journal of Chemical Physics, vol. 152, no. 19, p. 194103, 2020. [Online].

[3]

R. M. Hutchison, T. Womelsdorf, E. A. Allen, P. A. Bandettini, V. D. Calhoun, M. Corbetta, S. Della Penna, J. H. Duyn, G. H. Glover, J. Gonzalez-Castillo, D. A. Handwerker, S. Keilholz, V. Kiviniemi, D. A. Leopold, F. de Pasquale, O. Sporns, M. Walter, and C. Chang, "Dynamic functional connectivity: Promise, issues, and interpretations," NeuroImage, vol. 80, pp. 360--378, 2013. [Online]. Available: https://app.dimensions.ai/details/publication/pub.1051116731

[4]

M. Luisier, A. Schenk, W. Fichtner, and G. Klimeck, "Atomistic simulation of nanowires in the s p 3 d 5 s* tight-binding formalism: From boundary conditions to strain calculations," Physical Review B, vol. 74, no. 20, p. 205323, 2006.

[5]

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "Pytorch: An imperative style, high-performance deep learning library," in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024--8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

[6]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/

[7]

J. Dahm, E. Davis, T. Wicky, M. Cheeseman, O. Elbert, R. George, J. J. McGibbon, L. Groner, E. Paredes, and O. Fuhrer, "Gt4py: Python tool for implementing finite-difference computations for weather and climate," in 101st American Meteorological Society Annual Meeting. AMS, 2021.

[8]

M. Baldauf, A. Seifert, J. Förstner, D. Majewski, and M. Raschendorfer, "Operational convective-scale numerical weather prediction with the COSMO model: Description and sensitivities." Monthly Weather Review, 139:3387--3905, 2011.

[9]

COSMO, "Consortium for small-scale modeling," oct 1998. [Online]. Available: http://www.cosmo-model.org

[10]

L. S. Blackford, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry et al., "An updated set of basic linear algebra subprograms (blas)," ACM Transactions on Mathematical Software, vol. 28, no. 2, pp. 135--151, 2002.

Digital Library

[11]

E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users' Guide, 3rd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1999.

[12]

K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick, "A view of the parallel computing landscape," Commun. ACM, vol. 52, no. 10, p. 56--67, Oct. 2009. [Online].

Digital Library

[13]

G. Kwasniewski, M. Kabić, M. Besta, J. VandeVondele, R. Solca', and T. Hoefler, "Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), 2019.

[14]

E. Solomonik and J. Demmel, "Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms," in Euro-Par 2011 Parallel Processing, ser. Lecture Notes in Computer Science, E. Jeannot, R. Namyst, and J. Roman, Eds. Springer Berlin Heidelberg, 2011, pp. 90--109. [Online].

[15]

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz, "Communication-optimal parallel algorithm for strassen's matrix multiplication," in Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures, 2012, pp. 193--204.

[16]

G. Kwasniewski, M. Kabic, T. Ben-Nun, A. N. Ziogas, J. E. Saethre, A. Gaillard, T. Schneider, M. Besta, A. Kozhevnikov, J. VandeVondele, and T. Hoefler, "On the parallel i/o optimality of linear algebra kernels: Near-optimal matrix factorizations," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. Association for Computing Machinery, 2021.

[17]

E. Hutter and E. Solomonik, "Communication-avoiding Cholesky-QR2 for rectangular matrices," in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019, pp. 89--100.

[18]

M. Baskaran, T. Henretty, B. Pradelle, M. H. Langston, D. Bruns-Smith, J. Ezick, and R. Lethin, "Memory-efficient parallel tensor decompositions," in 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2017, pp. 1--7.

[19]

V. T. Chakaravarthy, J. W. Choi, D. J. Joseph, X. Liu, P. Murali, Y. Sabharwal, and D. Sreedhar, "On optimizing distributed tucker decomposition for dense tensors," in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2017, pp. 1038--1047.

[20]

G. Ballard, N. Knight, and K. Rouse, "Communication lower bounds for matricized tensor times khatri-rao product," in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018, pp. 557--567.

[21]

P. Springer and P. Bientinesi, "Design of a high-performance gemm-like tensor-tensor multiplication," ACM Transactions on Mathematical Software (TOMS), vol. 44, no. 3, pp. 1--29, 2018.

Digital Library

[22]

J. Kim, A. Sukumaran-Rajam, V. Thumma, S. Krishnamoorthy, A. Panyala, L.-N. Pouchet, A. Rountev, and P. Sadayappan, "A code generator for high-performance tensor contractions on gpus," in 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2019, pp. 85--95.

[23]

E. Solomonik, D. Matthews, J. Hammond, and J. Demmel, "Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions," in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE, 2013, pp. 813--824.

[24]

C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R'ıo, M. Wiebe, P. Peterson, P. G'erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant, "Array programming with NumPy," Nature, vol. 585, no. 7825, pp. 357--362, Sep. 2020. [Online].

[25]

D. G. a. Smith and J. Gray, "opt einsum - a python package for optimizing contraction order for einsum-like expressions," Journal of Open Source Software, vol. 3, no. 26, p. 753, 2018. [Online].

[26]

T. Ben-Nun, J. de Fine Licht, A. N. Ziogas, T. Schneider, and T. Hoefler, "Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1--14.

[27]

G. Kwasniewski, T. Ben-Nun, L. Gianinazzi, A. Calotoiu, T. Schneider, A. N. Ziogas, M. Besta, and T. Hoefler, "Pebbles, graphs, and a pinch of combinatorics: Towards tight i/o lower bounds for statically analyzable programs," in Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, 2021, pp. 328--339.

[28]

Q. Xiao, S. Zheng, B. Wu, P. Xu, X. Qian, and Y. Liang, "Hasco: Towards agile hardware and software co-design for tensor computation," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 1055--1068.

[29]

K. Hayashi, G. Ballard, Y. Jiang, and M. J. Tobia, "Shared-memory parallelization of mttkrp for dense tensors," in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018, pp. 393--394.

[30]

MPICH, "Mpi cart create," 2022. [Online]. Available: https://www.mpich.org/static/docs/v3.3/www3/MPI_Cart_create.html

[31]

MPICH, "Mpi cart sub," 2022. [Online]. Available: https://www.mpich.org/static/docs/v3.3/www3/MPI_Cart_sub.html

[32]

A. N. Ziogas, T. Schneider, T. Ben-Nun, A. Calotoiu, T. De Matteis, J. de Fine Licht, L. Lavarini, and T. Hoefler, "Productivity, portability, performance: Data-centric python," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. New York, NY, USA: Association for Computing Machinery, 2021. [Online].

Digital Library

[33]

R. Merris, Multilinear algebra. Crc Press, 1997.

[34]

L. Chi-Chung, P. Sadayappan, and R. Wenger, "On optimizing a class of multi-dimensional loops with reduction for parallel execution," Parallel Processing Letters, vol. 7, no. 02, pp. 157--168, 1997.

[35]

A. Darte, "On the complexity of loop fusion," in PACT, 1999.

[36]

Cyclops Community, "Cyclops tensor framework (ctf)." [Online]. Available: https://github.com/cyclops-community/ctf

[37]

P. Springer, T. Su, and P. Bientinesi, "HPTT: A High-Performance Tensor Transposition C++ Library," in Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ser. ARRAY 2017. New York, NY, USA: ACM, 2017, pp. 56--62. [Online].

Digital Library

[38]

P. Springer, "High-performance tensor transpose library." [Online]. Available: https://github.com/springer13/hptt

[39]

E. Solomonik, "High-performance tensor transpose library (forked by edgar solomonik)." [Online]. Available: https://github.com/solomonik/hptt

[40]

Nvidia, "cutensor," 2022. [Online]. Available: https://developer.nvidia.com/cutensor

[41]

B. Efron, "The bootstrap and modern statistics," Journal of the American Statistical Association, vol. 95, no. 452, pp. 1293--1296, 2000.

[42]

J. A. Calvin and E. F. Valeev, "Tiledarray: A general-purpose scalable block-sparse tensor framework." [Online]. Available: https://github.com/valeevgroup/tiledarray

[43]

D. A. Matthews, "High-performance tensor contraction without transposition," SIAM Journal on Scientific Computing, vol. 40, no. 1, pp. C1--C24, 2018. [Online].

Digital Library

[44]

J. Hong and H. Kung, "I/O complexity: The red-blue pebble game," in STOC, 1981, pp. 326--333.

[45]

J. S. Vitter, "External memory algorithms," in European Symposium on Algorithms. Springer, 1998, pp. 1--25.

[46]

V. Elango, F. Rastello, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan, "Data access complexity: The red/blue pebble game revisited," Technical Report, Tech. Rep., 2013.

[47]

J. E. Savage, "Extending the hong-kung model to memory hierarchies," in International Computing and Combinatorics Conference. Springer, 1995, pp. 270--281.

[48]

L. H. Loomis and H. Whitney, "An inequality related to the isoperimetric inequality," Bull. Amer. Math. Soc., vol. 55, no. 10, pp. 961--962, 10 1949.

[49]

D. Irony, S. Toledo, and A. Tiskin, "Communication lower bounds for distributed-memory matrix multiplication," Journal of Parallel and Distributed Computing, vol. 64, no. 9, pp. 1017--1026, 2004.

Digital Library

[50]

T. M. Smith, B. Lowery, J. Langou, and R. A. van de Geijn, "A tight i/o lower bound for matrix multiplication," arXiv preprint arXiv:1702.02017, 2017.

[51]

E. Solomonik and J. Demmel, "Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms," in European Conference on Parallel Processing. Springer, 2011, pp. 90--109.

[52]

A. Petitet and J. Dongarra, "Algorithmic redistribution methods for block-cyclic decompositions," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 12, pp. 1201--1216, 1999.

Digital Library

[53]

S. P. Midkiff, "Local iteration set computation for block-cyclic distributions," in Proceedings of the 1995 International Conference on Parallel Processing, Urbana-Champain, Illinois, USA, August 14--18, 1995. Volume II: Software, C. D. Polychronopoulos, Ed. CRC Press, 1995, pp. 77--84.

[54]

C. Ancourt, C. Fran, and I. R. Keryell, "A linear algebra framework for static hpf code distribution," A; a, vol. 1, no. t2, p. 1, 1993.

[55]

K. Kennedy, N. Nedeljkovic, and A. Sethi, "Efficient address generation for block-cyclic distributions," in Proceedings of the 9th International Conference on Supercomputing, ser. ICS '95. New York, NY, USA: Association for Computing Machinery, 1995, p. 180--184. [Online].

Digital Library

[56]

K. Kennedy, N. Nedeljkovic, and A. Sethi, "A linear-time algorithm for computing the memory access sequence in data-parallel programs," in Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPOPP '95. New York, NY, USA: Association for Computing Machinery, 1995, p. 102--111. [Online].

Digital Library

[57]

K. "Kennedy, N. Nedeljkovic, and A. Sethi, Communication Generation for Cyclic(K) Distributions. Boston, MA: Springer US, 1996, pp. 185--197. [Online].

[58]

S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng, "Generating local addresses and communication sets for data-parallel programs," in Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPOPP '93. New York, NY, USA: Association for Computing Machinery, 1993, p. 149--158. [Online].

Digital Library

[59]

A. Thirumalai and J. Ramanujam, "Fast address sequence generation for data-parallel programs using integer lattices," in Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing, ser. LCPC '95. Berlin, Heidelberg: Springer-Verlag, 1995, p. 191--208.

[60]

J. M. Stichnoth, "Efficient compilation of array statements for private memory multicomputers," CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, USA, Tech. Rep., 1993.

[61]

S. K. S. Gupta, S. D. Kaushik, S. Mufti, S. Sharma, C., Huang, and P. Sadayappan, "On compiling array expressions for efficient execution on distributed-memory machines," in 1993 International Conference on Parallel Processing - ICPP'93, vol. 2, 1993, pp. 301--305.

Recommendations

The tensor algebra compiler

Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse and compound operations must frequently be computed in a single kernel for performance and to save ...
Multilinear Compressed Sensing using Tensor Least Angle Regression (T-LARS)
ICDSP '22: Proceedings of the 6th International Conference on Digital Signal Processing

Multilinear compressed sensing generalizes the compressed sensing formulation to tensor signals, where the tensor signal is reconstructed using much fewer samples obtained in a sparse domain by solving a multilinear sparse coding problem. The Kronecker-...
ParCube: sparse parallelizable tensor decompositions
ECMLPKDD'12: Proceedings of the 2012th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

How can we efficiently decompose a tensor into sparse factors, when the data does not fit in memory? Tensor decompositions have gained a steadily increasing popularity in data mining applications, however the current state-of-art decomposition ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2022

1277 pages

ISBN:9784665454445

Conference Chairs:
Felix Wolf,
Sameer Shende,
General Chair:
Candace Culhane,
Program Chairs:
Sadaf Alam,
Heike Jagode

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

SC '22

Sponsor:

SIGHPC

SC '22: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2022

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
89
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten