research-article

Sparse-TPU: adapting systolic arrays for sparse matrices

Authors:

Aporva Amarnath,

Dong-Hyeon Park,

Austin Rovinski,

Ronald Dreslinski,

Trevor MudgeAuthors Info & Claims

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

Article No.: 19, Pages 1 - 12

https://doi.org/10.1145/3392717.3392751

Published: 29 June 2020 Publication History

Abstract

While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08X higher performance while consuming 4.39X and 19.79X lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.

References

[1]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2016. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105--117.

Digital Library

[2]

Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Karthik Gururaj, and Glenn Reinman. 2014. Accelerator-rich architectures: Opportunities and progresses. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.

Digital Library

[3]

Alfredo Cuzzocrea, Domenico Saccà, and Jeffrey D Ullman. 2013. Big data: a research agenda. In Proceedings of the 17th International Database Engineering & Applications Symposium. ACM, 198--203.

Digital Library

[4]

Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1.

Digital Library

[5]

Jeff Dean. 2017. Machine learning for systems and systems for machine learning. In Presentation at 2017 Conference on Neural Information Processing Systems.

[6]

Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, and Nando de Freitas. 2013. Predicting Parameters in Deep Learning. CoRR abs/1306.0543 (2013). arXiv:1306.0543 http://arxiv.org/abs/1306.0543

[7]

Iain S Duff, Albert Maurice Erisman, and John Ker Reid. 2017. Direct methods for sparse matrices. Oxford University Press.

[8]

Iain S. Duff, Michael A. Heroux, and Roldan Pozo. 2002. An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum. ACM Trans. Math. Softw. 28, 2 (June 2002), 239--267.

Digital Library

[9]

Adi Fuchs and David Wentzlaff. 2019. The accelerator wall: Limits of chip specialization. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1--14.

[10]

Ashish Gondimalla, Noah Chesnut, Mithuna Thottethodi, and T. N. Vijaykumar. 2019. SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). ACM, New York, NY, USA, 151--165.

Digital Library

[11]

J. L. Greathouse and M. Daga. 2014. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. In SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 769--780.

Digital Library

[12]

Yiwen Guo, Anbang Yao, and Yurong Chen.2016. Dynamic Network Surgery for Efficient DNNs. CoRR abs/1608.04493 (2016). arXiv:1608.04493 http://arxiv.org/abs/1608.04493

Digital Library

[13]

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. CoRR abs/1602.01528 (2016). arXiv:1602.01528 http://arxiv.org/abs/1602.01528

[14]

Song Han, Huizi Mao, and William Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.

[15]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).

[16]

Xin He, Guihai Yan, Wenyan Lu, Xuan Zhang, and Ke Liu. 2019. A Quantitative Exploration of Collaborative Pruning and Approximation Computing Towards Energy Efficient Neural Networks. IEEE Design & Test 37, 1 (2019), 36--45.

[17]

Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). ACM, New York, NY, USA, 319--333.

Digital Library

[18]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). ACM, New York, NY, USA, 1--12.

Digital Library

[19]

Bo Kågström, Per Ling, and Charles Van Loan. 1998. GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software (TOMS) 24, 3 (1998), 268--302.

Digital Library

[20]

H.T Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, New York, NY, USA, 13.

Digital Library

[21]

Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2012. An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 377--386.

Digital Library

[22]

Nallatech. 2018. OpenCAPI enabled FPGAs---the perfect bridge to a data centricworld. https://openpowerfoundation.org/wp-content/uploads/2018/10/Allan-Cantle.Nallatech-Presentation-2018-OPF-Summit_Amsterdam-presentation.pdf. [Online].

[23]

Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. Outerspace: An outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 724--736.

[24]

Subhankar Pal, Dong-hyeon Park, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, et al. 2019. A 7.3 M Output Non-Zeros/J Sparse Matrix-Matrix Multiplication Accelerator using Memory Reconfiguration in 40 nm. In 2019 Symposium on VLSI Technology. IEEE, C150--C151.

[25]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel S. Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. CoRR abs/1708.04485 (2017). arXiv:1708.04485 http://arxiv.org/abs/1708.04485

[26]

Dong-Hyeon Park, Subhankar Pal, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, et al. 2020. A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix-Matrix Multiplication Accelerator. IEEE Journal of Solid-State Circuits (2020).

[27]

S. F. Reddaway. 1973. DAP---a Distributed Array Processor. In Proceedings of the 1st Annual Symposium on Computer Architecture (ISCA '73). ACM, New York, NY, USA, 61--65.

Digital Library

[28]

Ao Ren, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Ji Li, Xuehai Qian, and Bo Yuan. 2017. Sc-dcnn: Highly-scalable deep convolutional neural network using stochastic computing. ACM SIGOPS Operating Systems Review 51, 2 (2017), 405--418.

[29]

Ananda Samajdar, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN Accelerator. CoRR abs/1811.02883 (2018). arXiv:1811.02883 http://arxiv.org/abs/1811.02883

[30]

Korey Sewell, Ronald G Dreslinski, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoffrey Blake, Michael Cieslak, Reetuparna Das, Thomas F Wenisch, Dennis Sylvester, et al. 2012. Swizzle-switch networks for many-core systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 2, 2 (2012), 278--294.

[31]

Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R Dulloor, Michael J Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. Graphmat: High performance graph analytics made productive. Proceedings of the VLDB Endowment 8, 11 (2015), 1214--1225.

Digital Library

[32]

Robert Tarjan and Andrew Yao. 1979. Storing a Sparse Table. Commun. ACM 22, 11 (1979), 606--611.

Digital Library

[33]

Paul Teich. 2018. Tear Apart Google's TPU 3.0 AI Coprocessor. https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/

[34]

S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--12.

[35]

Youwei Zhuo, Chao Wang, Mingxing Zhang, Rui Wang, Dimin Niu, Yanzhi Wang, and Xuehai Qian. 2019. GraphQ: Scalable PIM-Based Graph Processing. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 712--725.

Digital Library

Cited By

Tang MWen MYang JXue ZShen J(2025)SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From ScratchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343435944:2(497-511)Online publication date: Feb-2025
https://doi.org/10.1109/TCAD.2024.3434359
Cao YJiang JZhao GWang WJiang HSong Y(2025)Compression Format and Systolic Array Structure Co-design for Accelerating Sparse Matrix Multiplication in DNNsAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1545-2_8(114-133)Online publication date: 13-Feb-2025
https://doi.org/10.1007/978-981-96-1545-2_8
Liu YChen RLi SYang JLi Sda Silva B(2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3687480
Show More Cited By

Index Terms

Sparse-TPU: adapting systolic arrays for sparse matrices
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Systolic arrays

Recommendations

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control ...
Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Using FPGAs to accelerate ConvNets has attracted significant attention in recent years. However, FPGA accelerator design has not leveraged the latest progress of ConvNets. As a result, the key application characteristics such as frames-per-second (FPS) ...
Hardware design of an eigensolver based on the QR method

Eigenvalue computation is essential in many fields of science and engineering; however a lot of computation resources are required for computing all the eigenvalues of medium and large matrices. One of the most famous methods that computes all the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

June 2020

499 pages

ISBN:9781450379830

DOI:10.1145/3392717

General Chairs:
Eduard Ayguadé
Universitat Politècnica de Catalunya and Barcelona Supercomputing Center
,
Wen-mei Hwu
University of Illinois at Urbana-Champaign
,
Program Chairs:
Rosa M. Badia
Barcelona Supercomputing Center and Universitat Politècnica de Catalunya
,
H. Peter Hofstee
IBM Austin

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '20

Sponsor:

SIGARCH

ICS '20: 2020 International Conference on Supercomputing

June 29 - July 2, 2020

Spain, Barcelona

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

58
Total Citations
View Citations
1,593
Total Downloads

Downloads (Last 12 months)330
Downloads (Last 6 weeks)29

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tang MWen MYang JXue ZShen J(2025)SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From ScratchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343435944:2(497-511)Online publication date: Feb-2025
https://doi.org/10.1109/TCAD.2024.3434359
Cao YJiang JZhao GWang WJiang HSong Y(2025)Compression Format and Systolic Array Structure Co-design for Accelerating Sparse Matrix Multiplication in DNNsAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1545-2_8(114-133)Online publication date: 13-Feb-2025
https://doi.org/10.1007/978-981-96-1545-2_8
Liu YChen RLi SYang JLi Sda Silva B(2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3687480
Zhao GCao Y(2024)Highly Efficient Load-Balanced Dataflow for SpGEMMs on Systolic ArraysProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658777(272-276)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3649476.3658777
Lu XLong BChen XHan YSun XTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache OptimizationsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651381(71-85)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651381
Zhong KZhu ZDai GWang HYang XZhang HSi JMao QZeng SHong KZhang GYang HWang YTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine LearningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651336(349-366)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651336
Gao THuang LGao SWang K(2024)SSA: A Uniformly Recursive Bidirection-Sequence Systolic Sorter ArrayIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343433235:10(1721-1734)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3434332
Wu BYu TChen KLiu W(2024)Edge-Side Fine-Grained Sparse CNN Accelerator With Efficient Dynamic Pruning SchemeIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.334741771:3(1285-1298)Online publication date: Mar-2024
https://doi.org/10.1109/TCSI.2023.3347417
Ha DLiu LChou YGo SRo WTseng HAamodt T(2024)Generalizing Ray Tracing Accelerators for Tree Traversals on GPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00080(1041-1057)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00080
Yang YEmer JSanchez D(2024)Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00072(931-945)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00072
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten