Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3079856.3080244acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks

Published: 24 June 2017 Publication History

Abstract

Deep Neural Networks (DNNs) have demonstrated state-of-the-art performance on a broad range of tasks involving natural language, speech, image, and video processing, and are deployed in many real world applications. However, DNNs impose significant computational challenges owing to the complexity of the networks and the amount of data they process, both of which are projected to grow in the future. To improve the efficiency of DNNs, we propose ScaleDeep, a dense, scalable server architecture, whose processing, memory and interconnect subsystems are specialized to leverage the compute and communication characteristics of DNNs. While several DNN accelerator designs have been proposed in recent years, the key difference is that ScaleDeep primarily targets DNN training, as opposed to only inference or evaluation. The key architectural features from which ScaleDeep derives its efficiency are: (i) heterogeneous processing tiles and chips to match the wide diversity in computational characteristics (FLOPs and Bytes/FLOP ratio) that manifest at different levels of granularity in DNNs, (ii) a memory hierarchy and 3-tiered interconnect topology that is suited to the memory access and communication patterns in DNNs, (iii) a low-overhead synchronization mechanism based on hardware data-flow trackers, and (iv) methods to map DNNs to the proposed architecture that minimize data movement and improve core utilization through nested pipelining. We have developed a compiler to allow any DNN topology to be programmed onto ScaleDeep, and a detailed architectural simulator to estimate performance and energy. The simulator incorporates timing and power models of ScaleDeep's components based on synthesis to Intel's 14nm technology. We evaluate an embodiment of ScaleDeep with 7032 processing tiles that operates at 600 MHz and has a peak performance of 680 TFLOPs (single precision) and 1.35 PFLOPs (half-precision) at 1.4KW. Across 11 state-of-the-art DNNs containing 0.65M-14.9M neurons and 6.8M-145.9M weights, including winners from 5 years of the ImageNet competition, ScaleDeep demonstrates 6x-28x speedup at iso-power over the state-of-the-art performance on GPUs.

References

[1]
2013. Improving Photo Search: A Step Across the Semantic Gap. Google Research blog (2013).
[2]
2014. Skype Translator - How it Works: http://blogs.skype.com/2014/12/15/skypetranslator-how-it-works/. Skype Blog (2014).
[3]
2016. Apple is turning Siri into a next-level Artificial Intelligence: http://mashable.com/2016/06/13/siri-sirikit-wwdc2016-analysis/hLMSxZKVnEqO. Mashable (2016).
[4]
2016. ConvNet Benchmarks: https://github.com/soumith/convnet-benchmarks. (2016).
[5]
2016. Driver's Ed for Self-Driving Cars: How Our Deep Learning Tech Taught a Car to Drive: https://blogs.nvidia.com/blog/2016/05/06/self-driving-cars-3/. NVIDIA blog (2016).
[6]
2016. Google supercharges machine learning tasks with TPU custom chip. Google Research blog (2016).
[7]
2016. Introducing DeepText: Facebook's text understanding engine: https://code.facebook.com/posts/181565595577955/introducing-deeptext-facebook-s-text-understanding-engine/. Facebook Code (2016).
[8]
2016. Neon, Nervana Systems: http://neon.nervanasys.com/docs/latest/index.html. (2016).
[9]
2016. Nervana Zoo: https://gist.github.com/nervanazoo. (2016).
[10]
2016. Princeton Deep Driving: http://deepdriving.cs.princeton.edu/. (2016).
[11]
2016. Synopsys Design Compiler: http://www.synopsys.com/Tools/Implementation/RTLSynthesis/DesignCompiler/Pages/default.aspx. (2016).
[12]
2016. Titan X: https://blogs.nvidia.com/blog/2016/07/21/titan-x. (2016).
[13]
Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-neuron-free Deep Neural Network Computing. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 1--13.
[14]
Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A Dynamically Configurable Coprocessor for Convolutional Neural Networks. SIGARCH Comput. Archit. News 38, 3 (June 2010), 247--257.
[15]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 269--284.
[16]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 609--622.
[17]
Y. H. Chen, T. Krishna, J. Emer, and V. Sze. 2016. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In 2016 IEEE International Solid-State Circuits Conference (ISSCC). 262--263.
[18]
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. CoRR abs/1511.00363 (2015). http://arxiv.org/abs/1511.00363
[19]
Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidyanathan, Srinivas Sridharan, Dhiraj D. Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed Deep Learning Using Synchronous Stochastic Gradient Descent. CoRR abs/1602.06709 (2016). http://arxiv.org/abs/1602.06709
[20]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25, P. Bartlett, F.c.n. Pereira, C.j.c. Burges, L. Bottou, and K.q. Weinberger (Eds.). 1232--1240. http://books.nips.cc/papers/files/nips25/NIPS2012_0598.pdf
[21]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
[22]
S. Eldridge, A. Waterland, M. Seltzer, J. Appavoo, and A. Joshi. 2015. Towards General-Purpose Neural Network Computing. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 99--112.
[23]
C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR 2011 WORKSHOPS. 109--116.
[24]
V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello. 2014. A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops. 696--701.
[25]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 1737--1746.
[26]
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 243--254.
[27]
Song Han, Huizi Mao, and William J. Dally. 2015. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. CoRR abs/1510.00149 (2015). http://arxiv.org/abs/1510.00149
[28]
Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. 2014. Deep Speech: Scaling up end-to-end speech recognition. CoRR abs/1412.5567 (2014). http://arxiv.org/abs/1412.5567
[29]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
[30]
Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition. Signal Processing Magazine (2012).
[31]
Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. 2015. FireCaffe: near-linear acceleration of deep neural network training on compute clusters. CoRR abs/1511.00175 (2015). http://arxiv.org/abs/1511.00175
[32]
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Speeding up Convolutional Neural Networks with Low Rank Expansions. CoRR abs/1405.3866 (2014). http://arxiv.org/abs/1405.3866
[33]
Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997 (2014). http://arxiv.org/abs/1404.5997
[34]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 2012.
[35]
Andrew Lavin. 2015. Fast Algorithms for Convolutional Neural Networks. CoRR abs/1509.09308 (2015). http://arxiv.org/abs/1509.09308
[36]
Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. A Persona-Based Neural Conversation Model. https://www.microsoft.com/en-us/research/publication/persona-based-neural-conversation-model/
[37]
Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Hai Li, Yiran Chen, Boxun Li, Yu Wang, Hao Jiang, Mark Barnell, Qing Wu, and Jianhua Yang. 2015. RENO: A High-efficient Reconfigurable Neuromorphic Computing Accelerator Design. In Proceedings of the 52Nd Annual Design Automation Conference (DAC '15). ACM, New York, NY, USA, Article 66, 6 pages.
[38]
Abhinandan Majumdar, Srihari Cadambi, Michela Becchi, Srimat T. Chakradhar, and Hans Peter Graf. 2012. A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification. ACM Trans. Archit. Code Optim. 9, 1, Article 6 (March 2012), 30 pages.
[39]
N. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th International Symposium on Computer Architecture (ISCA '17).
[40]
S. G. Ramasubramanian, R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan. 2014. SPINDLE: SPINtronic Deep Learning Engine for large-scale neuromorphic computing. In Low Power Electronics and Design (ISLPED), 2014 IEEE/ACM International Symposium on. 15--20.
[41]
Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. 2016. Minerva: Enabling Low-power, Highly-accurate Deep Neural Network Accelerators. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 267--278.
[42]
M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--13.
[43]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252.
[44]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 2014.
[45]
Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann Lecun. 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. http://arxiv.org/abs/1312.6229 (2013).
[46]
A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 14--26.
[47]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
[48]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 1--9.
[49]
Swagath Venkataramani, Vinay K. Chippa, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Quality Programmable Vector Processors for Approximate Computing. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 1--12.
[50]
Swagath Venkataramani, Ashish Ranjan, Kaushik Roy, and Anand Raghunathan. 2014. AxNN: Energy-efficient Neuromorphic Systems Using Approximate Computing. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design (ISLPED '14). ACM, New York, NY, USA, 27--32.
[51]
S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and K. Saenko. 2015. Sequence to Sequence - Video to Text. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. 4534--4542.
[52]
Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. 2015. Staleness-aware Async-SGD for Distributed Deep Learning. CoRR abs/1511.05950 (2015). http://arxiv.org/abs/1511.05950
[53]
Xiang Zhang and Yann LeCun. 2015. Text Understanding from Scratch. CoRR abs/1502.01710 (2015). http://arxiv.org/abs/1502.01710
[54]
Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2016. Trained Ternary Quantization. CoRR abs/1612.01064 (2016). http://arxiv.org/abs/1612.01064
[55]
Aleksandar Zlateski, Kisuk Lee, and H. Sebastian Seung. 2016. ZNN - A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines. In 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, May 23-27, 2016. 801--811.

Cited By

View all
  • (2024)Hybrid Precision Floating-Point (HPFP) Selection to Optimize Hardware-Constrained Accelerator for CNN TrainingSensors10.3390/s2407214524:7(2145)Online publication date: 27-Mar-2024
  • (2024)LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI AcceleratorsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656592(62-73)Online publication date: 30-May-2024
  • (2024)Continual Learning of Generative Models With Limited Data: From Wasserstein-1 Barycenter to Adaptive CoalescenceIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.325109635:9(12042-12056)Online publication date: Sep-2024
  • Show More Cited By

Index Terms

  1. ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
    June 2017
    736 pages
    ISBN:9781450348928
    DOI:10.1145/3079856
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 June 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Deep Neural Networks
    2. Hardware Accelerators
    3. System Architecture

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ISCA '17
    Sponsor:

    Acceptance Rates

    ISCA '17 Paper Acceptance Rate 54 of 322 submissions, 17%;
    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)131
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Hybrid Precision Floating-Point (HPFP) Selection to Optimize Hardware-Constrained Accelerator for CNN TrainingSensors10.3390/s2407214524:7(2145)Online publication date: 27-Mar-2024
    • (2024)LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI AcceleratorsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656592(62-73)Online publication date: 30-May-2024
    • (2024)Continual Learning of Generative Models With Limited Data: From Wasserstein-1 Barycenter to Adaptive CoalescenceIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.325109635:9(12042-12056)Online publication date: Sep-2024
    • (2024)Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express LinkProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00100(1-18)Online publication date: 17-Nov-2024
    • (2024)Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00022(156-171)Online publication date: 2-Mar-2024
    • (2024)A Survey of Artificial Neural Network Computing SystemsCognitive Computation10.1007/s12559-024-10383-017:1Online publication date: 22-Nov-2024
    • (2024)SWG: an architecture for sparse weight gradient computationScience China Information Sciences10.1007/s11432-022-3807-967:2Online publication date: 23-Jan-2024
    • (2024)Approximate Communication in Network-on-Chips for Training and Inference of Image Classification ModelsDesign and Applications of Emerging Computer Systems10.1007/978-3-031-42478-6_27(709-740)Online publication date: 14-Jan-2024
    • (2023)A Survey on Sparsity Exploration in Transformer-Based AcceleratorsElectronics10.3390/electronics1210229912:10(2299)Online publication date: 19-May-2023
    • (2023)An Analysis of Various Design Pathways Towards Multi-Terabit Photonic On-Interposer InterconnectsACM Journal on Emerging Technologies in Computing Systems10.1145/363503120:2(1-34)Online publication date: 1-Dec-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media