Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3392717.3392751acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Sparse-TPU: adapting systolic arrays for sparse matrices

Published: 29 June 2020 Publication History

Abstract

While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08X higher performance while consuming 4.39X and 19.79X lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.

References

[1]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2016. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105--117.
[2]
Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Karthik Gururaj, and Glenn Reinman. 2014. Accelerator-rich architectures: Opportunities and progresses. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.
[3]
Alfredo Cuzzocrea, Domenico Saccà, and Jeffrey D Ullman. 2013. Big data: a research agenda. In Proceedings of the 17th International Database Engineering & Applications Symposium. ACM, 198--203.
[4]
Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1.
[5]
Jeff Dean. 2017. Machine learning for systems and systems for machine learning. In Presentation at 2017 Conference on Neural Information Processing Systems.
[6]
Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, and Nando de Freitas. 2013. Predicting Parameters in Deep Learning. CoRR abs/1306.0543 (2013). arXiv:1306.0543 http://arxiv.org/abs/1306.0543
[7]
Iain S Duff, Albert Maurice Erisman, and John Ker Reid. 2017. Direct methods for sparse matrices. Oxford University Press.
[8]
Iain S. Duff, Michael A. Heroux, and Roldan Pozo. 2002. An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum. ACM Trans. Math. Softw. 28, 2 (June 2002), 239--267.
[9]
Adi Fuchs and David Wentzlaff. 2019. The accelerator wall: Limits of chip specialization. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1--14.
[10]
Ashish Gondimalla, Noah Chesnut, Mithuna Thottethodi, and T. N. Vijaykumar. 2019. SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). ACM, New York, NY, USA, 151--165.
[11]
J. L. Greathouse and M. Daga. 2014. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. In SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 769--780.
[12]
Yiwen Guo, Anbang Yao, and Yurong Chen.2016. Dynamic Network Surgery for Efficient DNNs. CoRR abs/1608.04493 (2016). arXiv:1608.04493 http://arxiv.org/abs/1608.04493
[13]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. CoRR abs/1602.01528 (2016). arXiv:1602.01528 http://arxiv.org/abs/1602.01528
[14]
Song Han, Huizi Mao, and William Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.
[15]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[16]
Xin He, Guihai Yan, Wenyan Lu, Xuan Zhang, and Ke Liu. 2019. A Quantitative Exploration of Collaborative Pruning and Approximation Computing Towards Energy Efficient Neural Networks. IEEE Design & Test 37, 1 (2019), 36--45.
[17]
Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). ACM, New York, NY, USA, 319--333.
[18]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). ACM, New York, NY, USA, 1--12.
[19]
Bo Kågström, Per Ling, and Charles Van Loan. 1998. GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software (TOMS) 24, 3 (1998), 268--302.
[20]
H.T Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, New York, NY, USA, 13.
[21]
Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2012. An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 377--386.
[22]
Nallatech. 2018. OpenCAPI enabled FPGAs---the perfect bridge to a data centricworld. https://openpowerfoundation.org/wp-content/uploads/2018/10/Allan-Cantle.Nallatech-Presentation-2018-OPF-Summit_Amsterdam-presentation.pdf. [Online].
[23]
Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. Outerspace: An outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 724--736.
[24]
Subhankar Pal, Dong-hyeon Park, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, et al. 2019. A 7.3 M Output Non-Zeros/J Sparse Matrix-Matrix Multiplication Accelerator using Memory Reconfiguration in 40 nm. In 2019 Symposium on VLSI Technology. IEEE, C150--C151.
[25]
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel S. Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. CoRR abs/1708.04485 (2017). arXiv:1708.04485 http://arxiv.org/abs/1708.04485
[26]
Dong-Hyeon Park, Subhankar Pal, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, et al. 2020. A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix-Matrix Multiplication Accelerator. IEEE Journal of Solid-State Circuits (2020).
[27]
S. F. Reddaway. 1973. DAP---a Distributed Array Processor. In Proceedings of the 1st Annual Symposium on Computer Architecture (ISCA '73). ACM, New York, NY, USA, 61--65.
[28]
Ao Ren, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Ji Li, Xuehai Qian, and Bo Yuan. 2017. Sc-dcnn: Highly-scalable deep convolutional neural network using stochastic computing. ACM SIGOPS Operating Systems Review 51, 2 (2017), 405--418.
[29]
Ananda Samajdar, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN Accelerator. CoRR abs/1811.02883 (2018). arXiv:1811.02883 http://arxiv.org/abs/1811.02883
[30]
Korey Sewell, Ronald G Dreslinski, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoffrey Blake, Michael Cieslak, Reetuparna Das, Thomas F Wenisch, Dennis Sylvester, et al. 2012. Swizzle-switch networks for many-core systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 2, 2 (2012), 278--294.
[31]
Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R Dulloor, Michael J Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. Graphmat: High performance graph analytics made productive. Proceedings of the VLDB Endowment 8, 11 (2015), 1214--1225.
[32]
Robert Tarjan and Andrew Yao. 1979. Storing a Sparse Table. Commun. ACM 22, 11 (1979), 606--611.
[33]
Paul Teich. 2018. Tear Apart Google's TPU 3.0 AI Coprocessor. https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/
[34]
S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--12.
[35]
Youwei Zhuo, Chao Wang, Mingxing Zhang, Rui Wang, Dimin Niu, Yanzhi Wang, and Xuehai Qian. 2019. GraphQ: Scalable PIM-Based Graph Processing. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 712--725.

Cited By

View all
  • (2025)SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From ScratchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343435944:2(497-511)Online publication date: Feb-2025
  • (2025)Compression Format and Systolic Array Structure Co-design for Accelerating Sparse Matrix Multiplication in DNNsAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1545-2_8(114-133)Online publication date: 13-Feb-2025
  • (2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
  • Show More Cited By

Index Terms

  1. Sparse-TPU: adapting systolic arrays for sparse matrices

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing
    June 2020
    499 pages
    ISBN:9781450379830
    DOI:10.1145/3392717
    • General Chairs:
    • Eduard Ayguadé,
    • Wen-mei Hwu,
    • Program Chairs:
    • Rosa M. Badia,
    • H. Peter Hofstee
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 June 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. application-specific hardware
    2. hardware accelerators
    3. hardware-software codesign
    4. sparse matrix condensing
    5. sparse matrix processing
    6. systolic array

    Qualifiers

    • Research-article

    Conference

    ICS '20
    Sponsor:
    ICS '20: 2020 International Conference on Supercomputing
    June 29 - July 2, 2020
    Spain, Barcelona

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)330
    • Downloads (Last 6 weeks)29
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From ScratchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343435944:2(497-511)Online publication date: Feb-2025
    • (2025)Compression Format and Systolic Array Structure Co-design for Accelerating Sparse Matrix Multiplication in DNNsAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1545-2_8(114-133)Online publication date: 13-Feb-2025
    • (2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
    • (2024)Highly Efficient Load-Balanced Dataflow for SpGEMMs on Systolic ArraysProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658777(272-276)Online publication date: 12-Jun-2024
    • (2024)ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache OptimizationsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651381(71-85)Online publication date: 27-Apr-2024
    • (2024)FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine LearningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651336(349-366)Online publication date: 27-Apr-2024
    • (2024)SSA: A Uniformly Recursive Bidirection-Sequence Systolic Sorter ArrayIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343433235:10(1721-1734)Online publication date: Oct-2024
    • (2024)Edge-Side Fine-Grained Sparse CNN Accelerator With Efficient Dynamic Pruning SchemeIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.334741771:3(1285-1298)Online publication date: Mar-2024
    • (2024)Generalizing Ray Tracing Accelerators for Tree Traversals on GPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00080(1041-1057)Online publication date: 2-Nov-2024
    • (2024)Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00072(931-945)Online publication date: 29-Jun-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media