research-article

Open access

Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs

Authors:

Yuan XieAuthors Info & Claims

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 359 - 371

https://doi.org/10.1145/3352460.3358269

Published: 12 October 2019 Publication History

Abstract

Deep neural networks have become the compelling solution for the applications such as image classification, object detection, speech recognition, and machine translation. However, the great success comes at the cost of excessive computation due to the over-provisioned parameter space. To improve the computation efficiency of neural networks, many pruning techniques have been proposed to reduce the amount of multiply-accumulate (MAC) operations, which results in high sparsity in the networks.

Unfortunately, the sparse neural networks often run slower than their dense counterparts on modern GPUs due to their poor device utilization rate. In particular, as the sophisticated hardware primitives (e.g., Tensor Core) have been deployed to boost the performance of dense matrix multiplication by an order of magnitude, the performance of sparse neural networks lags behind significantly.

In this work, we propose an algorithm and hardware co-design methodology to accelerate the sparse neural networks. A novel pruning algorithm is devised to improve the workload balance and reduce the decoding overhead of the sparse neural networks. Meanwhile, new instructions and micro-architecture optimization are proposed in Tensor Core to adapt to the structurally sparse neural networks. Our experimental results show that the pruning algorithm can achieve 63% performance gain with model accuracy sustained. Furthermore, the hardware optimization gives an additional 58% performance gain with negligible area overhead.

References

[1]

[n. d.]. WMT 16 Dataset. https://www.statmt.org/wmt16/translation-task.html

[2]

Vahideh Akhlaghi, Amir Yazdanbakhsh, Kambiz Samadi, Rajesh K Gupta, and Hadi Esmaeilzadeh. 2018. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 662--673.

Digital Library

[3]

Jorge Albericio, Patrick Judd, Alberto Delmas, Sayeh Sharify, Gerard O'Leary, Roman Genov, and Andreas Moshovos. 2017. Bit-Pragmatic Deep Neural Network Computing. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 175--188.

[4]

Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 1--13.

Digital Library

[5]

Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 163--174.

[6]

Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories. ACM Transactions on Architecture and Code Optimization (TACO) 14, 2, Article 14 (June 2017), 25 pages. https://doi.org/10.1145/3085572

[7]

Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive Exploration of Neural Machine Translation Architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1442--1451.

[8]

Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. ACM, 233--244.

Digital Library

[9]

Andre Xian Ming Chang, Berin Martini, and Eugenio Culurciello. 2015. Recurrent neural networks hardware implementation on FPGA. arXiv preprint arXiv:1511.05552 (2015).

[10]

Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft.

[11]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems (ASPLOS). ACM, 269--284.

Digital Library

[12]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, 609--622.

Digital Library

[13]

Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127--138.

[14]

Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA). IEEE Press, 27--39.

Digital Library

[15]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014).

[16]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 248--255.

[17]

Misha Denil, Babak Shakibi, Laurent Dinh, and Nando de Freitas. 2013. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems (NeurIPS). 2148--2156.

[18]

Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. 2016. Persistent rnns: Stashing recurrent weights on-chip. In International Conference on Machine Learning (ICML). 2024--2033.

[19]

Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and Bo Yuan. 2017. CirCNN: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM, 395--408.

Digital Library

[20]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Vol. 43. ACM, 92--104.

Digital Library

[21]

Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. 2018. Neural cache: bit-serial in-cache acceleration of deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, 383--396.

Digital Library

[22]

Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In International Conference on Learning Representations (ICLR).

[23]

Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM.

Digital Library

[24]

Google. [n. d.]. Seq2seq: Neural Machine Translation. https://google.github.io/seq2seq/nmt/

[25]

Google. [n. d.]. TensorFlow Model Zoo: Im2txt. https://github.com/tensorflow/models/tree/master/research/im2txt

[26]

Google. [n. d.]. TensorFlow Models. https://github.com/tensorflow/models

[27]

Scott Gray, Alec Radford, and Diederik P Kingma. 2017. GPU kernels for block-sparse weights. Technical Report. Technical report, OpenAI.

[28]

Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William J. Dally. 2017. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 75--84.

[29]

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA). IEEE Press, 243--254.

Digital Library

[30]

Song Han, Huizi Mao, and William J Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR) (2016).

[31]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.

[32]

Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W Fletcher. 2018. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, 674--687.

Digital Library

[33]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

[34]

Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, and Gennady Pekhimenko. 2018. Gist: Efficient data encoding for deep neural network training. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 776--789.

Digital Library

[35]

Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-serial deep neural network computing. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1--12.

[36]

Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2016. Visualizing and understanding recurrent networks. International Conference on Learning Representations (ICLR) Workshops (2016).

[37]

Mahmoud Khairy, Jain Akshay, Tor M. Aamodt, and Timothy G. Rogers. 2018. Exploring Modern GPU Memory System Design Challenges through Accurate Modeling. arXiv preprint arXiv:1810.07269 (2018).

[38]

Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).

[39]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS). 1097--1105.

[40]

Yann LeCun. 1998. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/

[41]

Jong Chern Lee, Jihwan Kim, Kyung Whan Kim, Young Jun Ku, Dae Suk Kim, Chunseok Jeong, Tae Sik Yun, Hongjung Kim, Ho Sung Cho, Sangmuk Oh, Hyun Sung Lee, Ki Hun Kwon, Dong Beom Lee, Young Jae Choi, Jaejin Lee, Hyeon Gon Kim, Jun Hyun Chun, Jonghoon Oh, and Seok Hee Lee. 2016. High bandwidth memory (HBM) with TSV technique. In 2016 International SoC Design Conference (ISOCC). IEEE, 181--182.

[42]

Jonathan Lew, Deval Shah, Suchita Pati, Cattell Shaylin, Mengchi Zhang, Amruth Sandhupatla, Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, and Tor M. Aamodt. 2018. Analyzing Machine Learning Workloads Using a Detailed GPU Simulator. arXiv preprint arXiv:1811.08933 (2018).

[43]

Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: hierarchical storage of sparse tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 19.

Digital Library

[44]

Youjie Li, Jongse Park, Mohammad Alian, Yifan Yuan, Zheng Qu, Peitian Pan, Ren Wang, Alexander Schwing, Hadi Esmaeilzadeh, and Nam Sung Kim. 2018. A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 175--188.

Digital Library

[45]

Robert LiKamWa, Yunhui Hou, Julian Gao, Mia Polansky, and Lin Zhong. 2016. RedEye: analog ConvNet image sensor architecture for continuous mobile vision. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA). IEEE Press, 255--266.

Digital Library

[46]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV). Springer, 740--755.

[47]

Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. Pudiannao: A polyvalent machine learning accelerator. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vol. 43. ACM, 369--381.

Digital Library

[48]

Zhenhua Liu, Jizheng Xu, Xiulian Peng, and Ruiqin Xiong. 2018. Frequency-Domain Dynamic Pruning for Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). 1051--1061.

[49]

Hang Lu, Xin Wei, Ning Lin, Guihai Yan, and Xiaowei Li. 2018. Tetris: re-architecting convolutional neural network computation for machine learning accelerators. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.

Digital Library

[50]

Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1412--1421.

[51]

NVIDIA. [n. d.]. CUBLAS: Dense Linear Algebra on GPUs. https://developer.nvidia.com/cublas

[52]

NVIDIA. [n. d.]. CUTLASS: Fast Linear Algebra in CUDA C++. https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/

[53]

NVIDIA. 2017. V100 GPU architecture. the world's most advanced data center GPU. Version WP-08608-001_v1. 1. NVIDIA. Aug (2017), 108.

[54]

Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S Chung. 2015. Toward accelerating deep learning at scale using specialized hardware in the datacenter. In Hot Chips 27 Symposium (HCS), 2015 IEEE. IEEE, 1--38.

[55]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.

[56]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. 2017. Scnn: An accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 27--40.

Digital Library

[57]

Eunhyeok Park, Dongyoung Kim, and Sungjoo Yoo. 2018. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 688--698.

Digital Library

[58]

Md Aamir Raihan, Negar Goli, and Tor Aamodt. 2018. Modeling Deep Learning Accelerator Enabled GPUs. arXiv preprint arXiv:1811.08309 (2018).

[59]

Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA). IEEE Press, 267--278.

Digital Library

[60]

Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Steve Keckler. 2018. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 331--344.

[61]

Marc Riera, Jose-Maria Arnau, and Antonio González. 2018. Computation reuse in DNNs by exploiting input similarity. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, 57--68.

Digital Library

[62]

Hasim Sak, Andrew W Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Interspeech. 338--342.

[63]

Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA). IEEE Press, 14--26.

Digital Library

[64]

Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12.

Digital Library

[65]

Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. 2018. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, 764--775.

Digital Library

[66]

Seunghee Shin, Guilherme Cox, Mark Oskin, Gabriel H Loh, Yan Solihin, Abhishek Bhattacharjee, and Arkaprava Basu. 2018. Scheduling page table walks for irregular GPU applications. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, 180--192.

Digital Library

[67]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR) (2015).

[68]

Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2019. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array. 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2019).

[69]

Mingcong Song, Jiechen Zhao, Yang Hu, Jiaqi Zhang, and Tao Li. 2018. Prediction based execution on deep neural networks. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 752--763.

Digital Library

[70]

Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. In International Conference on Machine Learning. 3299--3308.

[71]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems (NeurIPS) (2014), 3104--3112.

[72]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2818--2826.

[73]

Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. 2011. Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NeurIPS Workshop, Vol. 1. Citeseer, 4.

[74]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions On Pattern Analysis and Machine Intelligence 39, 4 (2017), 652--663.

Digital Library

[75]

Xiaowei Wang, Jiecao Yu, Charles Augustine, Ravi Iyer, and Reetuparna Das. 2019. Bit Prudent In-Cache Acceleration of Deep Convolutional Neural Networks. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 81--93.

[76]

Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 2074--2082.

[77]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Technical Report. Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States).

[78]

Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin Qiao, Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch, Peter Vajda, Xiaodong Wang, Yanghan Wang, Bram Wasti, Yiming Wu, Ran Xian, Sungjoo Yoo, and Peizhao Zhang. 2019. Machine Learning at Facebook: Understanding Inference at the Edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 331--344.

[79]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1492--1500.

[80]

Zhuliang Yao, Shijie Cao, and Wencong Xiao. 2018. Balanced Sparsity for Efficient DNN Inference on GPU. arXiv preprint arXiv:1811.00206 (2018).

[81]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 161--170.

Digital Library

[82]

Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-x: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Press, 20.

[83]

Xuda Zhou, Zidong Du, Qi Guo, Shaoli Liu, Chengsi Liu, Chao Wang, Xuehai Zhou, Ling Li, Tianshi Chen, and Yunji Chen. 2018. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 15--28.

Digital Library

[84]

Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. 2018. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 883--894.

Cited By

Zhang JYin BHuo JGuo HLi Z(2024)A Sparse Neural Network-Based Control Method for Saturated Nonlinear Affine SystemsActuators10.3390/act1306020413:6(204)Online publication date: 29-May-2024
https://doi.org/10.3390/act13060204
Hafezan MAtoofian E(2024)Transient Fault Detection in Tensor Cores for Modern GPUsACM Transactions on Embedded Computing Systems10.1145/368748323:5(1-29)Online publication date: 10-Aug-2024
https://dl.acm.org/doi/10.1145/3687483
Guan YYu CZhou YLeng JLi CGuo MTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Fractal: Joint Multi-Level Sparse Pattern Tuning of Accuracy and Performance for DNN PruningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651351(416-430)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651351
Show More Cited By

Index Terms

Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Efficient tensor core-based GPU kernels for structured sparsity under reduced precision
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

The success of DNN comes at the expense of excessive memory/computation cost, which can be addressed by exploiting reduced precision and sparsity jointly. Existing sparse GPU kernels, however, fail to achieve practical speedup over cuBLASHgemm under half-...
Dual-side sparse tensor core
ISCA '21: Proceedings of the 48th Annual International Symposium on Computer Architecture

Leveraging sparsity in deep neural network (DNN) models is promising for accelerating model inference. Yet existing GPUs can only leverage the sparsity from weights but not activations, which are dynamic, unpredictable, and hence challenging to exploit. ...
Sparse GPU kernels for deep learning
SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Scientific workloads have traditionally exploited high levels of sparsity to accelerate computation and reduce memory requirements. While deep neural networks can be made sparse, achieving practical speedups on GPUs is difficult because these ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 2019

1104 pages

ISBN:9781450369381

DOI:10.1145/3352460

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

MICRO '52

Sponsor:

SIGMICRO

MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 12 - 16, 2019

OH, Columbus, USA

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

88
Total Citations
View Citations
5,965
Total Downloads

Downloads (Last 12 months)1,191
Downloads (Last 6 weeks)147

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang JYin BHuo JGuo HLi Z(2024)A Sparse Neural Network-Based Control Method for Saturated Nonlinear Affine SystemsActuators10.3390/act1306020413:6(204)Online publication date: 29-May-2024
https://doi.org/10.3390/act13060204
Hafezan MAtoofian E(2024)Transient Fault Detection in Tensor Cores for Modern GPUsACM Transactions on Embedded Computing Systems10.1145/368748323:5(1-29)Online publication date: 10-Aug-2024
https://dl.acm.org/doi/10.1145/3687483
Guan YYu CZhou YLeng JLi CGuo MTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Fractal: Joint Multi-Level Sparse Pattern Tuning of Accuracy and Performance for DNN PruningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651351(416-430)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651351
Zhong KZhu ZDai GWang HYang XZhang HSi JMao QZeng SHong KZhang GYang HWang YTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine LearningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651336(349-366)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651336
Guo CZhang RXu JLeng JLiu ZHuang ZGuo MWu HZhao SZhao JZhang KTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory StitchingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640423(450-466)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640423
Jin YZhong RLong SZhai J(2024)Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity AlignmentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346209235:11(2208-2223)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3462092
Guo CXue FLeng JQiu YGuan YCui WChen QGuo M(2024)Accelerating Sparse DNNs Based on Tiled GEMMIEEE Transactions on Computers10.1109/TC.2024.336594273:5(1275-1289)Online publication date: May-2024
https://doi.org/10.1109/TC.2024.3365942
Garg SZhang LGuan H(2024)Structured Pruning for Multi-Task Deep Neural Networks2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00049(260-266)Online publication date: 7-Aug-2024
https://doi.org/10.1109/MIPR62202.2024.00049
Peltekis CTitopoulos VNicopoulos CDimitrakopoulos G(2024)DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured SparsityIEEE Computer Architecture Letters10.1109/LCA.2024.335517823:1(17-20)Online publication date: Jan-2024
https://doi.org/10.1109/LCA.2024.3355178
Li XWang MMai YZhang YZhong BYu Z(2024)BafSP: Co-Design of Compute SRAM and Bit-Aware Data Flip Mitigation with In-Memory Sparsity Detection for SpMM2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI61997.2024.00033(124-129)Online publication date: 1-Jul-2024
https://doi.org/10.1109/ISVLSI61997.2024.00033
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents