research-article

MMNNN: : A tree-based Multicast Mechanism for NoC-based deep Neural Network accelerators

Authors:

Qi WangAuthors Info & Claims

Volume 85, Issue C

https://doi.org/10.1016/j.micpro.2021.104242

Published: 01 September 2021 Publication History

Abstract

Network-on-Chip (NoC) devices have been widely used in multiprocessor systems. In recent years, NoC-based Deep Neural Network (DNN) accelerators have been proposed to connect neural computing devices using NoCs. Such designs dramatically reduce off-chip memory accesses of these platforms. However, the large number of one-to-many packet transfers significantly degrade performance with traditional unicast channels. We propose a multicast mechanism for a NoC-based DNN accelerator called Multicast Mechanism for NoC-based Neural Network accelerator (MMNNN). To do so, we propose a tree-based multicast routing algorithm with excellent scalability and the ability to minimize the number of packets in the network. We also propose a router architecture for single-flit packets. Our proposed router transfers flits to multiple destinations in a single process and has no head-of-line blocking issue, offering higher throughput and lower latency than traditional wormhole router architectures. Simulation results show that our proposed multicast mechanism offers excellent performance in classification latency, average packet latency, and energy consumption.

References

[1]

Goossens K., Dielissen J., Radulescu A., Æthereal network on chip: Concepts, architectures, and implementations, IEEE Des. Test Comput. 22 (5) (2005) 414–421,.

Digital Library

[2]

Wang L., Jin Y., Kim H., Kim E.J., Recursive partitioning multicast: A bandwidth-efficient routing for networks-on-chip, in: Third International Symposium on Networks-on-Chips, NOCS 2009, May 10–13 2009, la Jolla, CA, USA. Proceedings, IEEE Computer Society, 2009, pp. 64–73,.

Digital Library

[3]

Peh L., Dally W.J., A delay model and speculative architecture for pipelined routers, in: Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA’01), Nuevo Leone, Mexico, January 20–24, 2001, IEEE Computer Society, 2001, pp. 255–266,.

[4]

Kumar A., Peh L., Kundu P., Jha N.K., Express virtual channels: towards the ideal interconnection fabric, in: Tullsen D.M., Calder B. (Eds.), 34th International Symposium on Computer Architecture (ISCA 2007), June 9–13, 2007, San Diego, California, USA, ACM, 2007, pp. 150–161,.

Digital Library

[5]

Matsutani H., Koibuchi M., Amano H., Yoshinaga T., Prediction router: Yet another low latency on-chip router architecture, in: 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 14–18 February 2009, Raleigh, North Carolina, USA, IEEE Computer Society, 2009, pp. 367–378,.

[6]

Deb S., Ganguly A., Pande P.P., Belzer B., Heo D.H., Wireless NoC as interconnection backbone for multicore chips: Promises and challenges, IEEE J. Emerg. Sel. Topics Circuits Syst. 2 (2) (2012) 228–239,.

[7]

Ouyang Y., Yang J., Xing K., Huang Z., Liang H., An improved communication scheme for non-HOL-blocking wireless NoC, Integration 60 (2018) 240–247,.

Digital Library

[8]

Ouyang Y., Li Z., Li J., Sun C., Liang H., Du G., CPCA: an efficient wireless routing algorithm in winoc for cross path congestion awareness, Integration 69 (2019) 75–84,.

Digital Library

[9]

Ouyang Y., Wang Q., Hu L., Liang H., DVFS based error avoidance strategy in wireless network-on-chip, J. Electron. Test. 35 (6) (2019) 767–777,.

Digital Library

[10]

Chen K.J., Ebrahimi M., Wang T., Yang Y., Noc-based DNN accelerator: a future design paradigm, in: Bogdan P., Silvano C. (Eds.), Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2019, New York, NY, USA, October 17–18, 2019, ACM, 2019, pp. 11:1–11:8,.

Digital Library

[11]

Dally W.J., Towles B.P., Principles and Practices of Interconnection Networks, Elsevier, 2004.

Digital Library

[12]

Painkras E., Plana L.A., Garside J.D., Temple S., Galluppi F., Patterson C., Lester D.R., Brown A.D., Furber S.B., SpiNNaker: A 1-W 18-core system-on-chip for massively-parallel neural network simulation, IEEE J. Solid State Circuits 48 (8) (2013) 1943–1953,.

[13]

Carrillo S., Harkin J., McDaid L., Morgan F., Pande S., Cawley S., McGinley B., Scalable hierarchical network-on-chip architecture for spiking neural network hardware implementations, IEEE Trans. Parallel Distributed Syst. 24 (12) (2013) 2451–2461,.

Digital Library

[14]

Chen Y., Yang T., Emer J.S., Sze V., Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices, IEEE J. Emerg. Sel. Topics Circuits Syst. 9 (2) (2019) 292–308,.

[15]

Liu X., Wen W., Qian X., Li H., Chen Y., Neu-NoC: A high-efficient interconnection network for accelerated neuromorphic systems, in: Shin Y. (Ed.), 23rd Asia and South Pacific Design Automation Conference, ASP-DAC 2018, Jeju, Korea (South), January 22–25, 2018, IEEE, 2018, pp. 141–146,.

Digital Library

[16]

Chen K.J., Wang T., NN-Noxim: High-level cycle-accurate noc-based neural networks simulator, in: 11th International Workshop on Network on Chip Architectures, NoCArc@MICRO 2018, Fukuoka, Japan, October 20, 2018, IEEE Computer Society, 2018, pp. 1–5,.

[17]

Chen K.J., Wang T.G., Yang Y.A., Cycle-accurate noc-based convolutional neural network simulator, in: Firouzi F., Chakrabarty K., Farahani B., Ye F., Pavlidis V.F. (Eds.), Proceedings of the International Conference on Omni-Layer Intelligent Systems, COINS 2019, Crete, Greece, May 5–7, 2019, ACM, 2019, pp. 199–204,.

Digital Library

[18]

Chen K.J., Ebrahimi M., Wang T., Yang Y., Liao Y., A noc-based simulator for design and evaluation of deep neural networks, Microprocess. Microsystems 77 (2020),.

Digital Library

[19]

Xiao S., Guo Y., Liao W., Deng H., Luo Y., Zheng H., Wang J., Li C., Li G., Yu Z., Neuronlink: An efficient chip-to-chip interconnect for large-scale neural network accelerators, IEEE Trans. Very Large Scale Integr. Syst. 28 (9) (2020) 1966–1978,.

Digital Library

[20]

Shen X., Ye X., Tan X., Wang D., Zhang L., Li W., Zhang Z., Fan D., Sun N., An efficient network-on-chip router for dataflow architecture, J. Comput. Sci. Technol. 32 (1) (2017) 11–25,.

[21]

Kumar D.R., Najjar W.A., Srimani P.K., A new adaptive hardware tree-based multicast routing in K-ary N-cubes, IEEE Trans. Computers 50 (7) (2001) 647–659,.

Digital Library

[22]

Hu W., Lu Z., Jantsch A., Liu H., Power-efficient tree-based multicast support for networks-on-chip, in: Proceedings of the 16th Asia South Pacific Design Automation Conference, ASP-DAC 2011, Yokohama, Japan, January 25–27, 2011, IEEE, 2011, pp. 363–368,.

[23]

Lin X., McKinley P.K., Ni L.M., Deadlock-free multicast wormhole routing in 2-D mesh multicomputers, IEEE Trans. Parallel Distributed Syst. 5 (8) (1994) 793–804,.

Digital Library

[24]

Ebrahimi M., Daneshtalab M., Liljeberg P., Tenhunen H., HAMUM - A novel routing protocol for unicast and multicast traffic in mpsocs, in: Danelutto M., Bourgeois J., Gross T. (Eds.), Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2010, Pisa, Italy, February 17–19, 2010, IEEE Computer Society, 2010, pp. 525–532,.

Digital Library

[25]

Nguyen S.T., Oyanagi S., A low cost single-cycle router based on virtual output queuing for on-chip networks, in: López S. (Ed.), 13th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, DSD 2010, 1–3 September 2010, Lille, France, IEEE Computer Society, 2010, pp. 60–67,.

Digital Library

[26]

T. Speier, B. Wolford, Qualcomm centriq 2400 processor, in: Hot Chips: A Symposium on High Performance Chips, HC29 (2017), 2017.

[27]

Jeffers J., Reinders J., Sodani A., Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, Morgan Kaufmann, 2016.

[28]

Lian X., Liu Z., Song Z., Dai J., Zhou W., Ji X., High-performance FPGA-based CNN accelerator with block-floating-point arithmetic, IEEE Trans. Very Large Scale Integr. Syst. 27 (8) (2019) 1874–1885,.

Digital Library

[29]

Farabet C., Poulet C., LeCun Y., An fpga-based stream processor for embedded real-time vision with convolutional networks, in: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, IEEE, 2009, pp. 878–885.

[30]

Gupta S., Agrawal A., Gopalakrishnan K., Narayanan P., Deep learning with limited numerical precision, 2015, CoRR abs/1502.02551, arXiv:1502.02551.

[31]

Moons B., Verhelst M., A 0.3-2.6 TOPS/W precision-scalable processor for real-time large-scale convnets, in: 2016 IEEE Symposium on VLSI Circuits, VLSIC 2016, Honolulu, HI, USA, June 15–17, 2016, IEEE, 2016, pp. 1–2,.

[32]

Yin S., Ouyang P., Tang S., Tu F., Li X., Liu L., Wei S., A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications, in: 2017 Symposium on VLSI Circuits, IEEE, 2017, pp. C26–C27.

[33]

A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: P.L. Bartlett, F.C.N. Pereira and C.J.C. Burges, L. Bottou, K.Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a Meeting Held December 3–6, 2012, Lake Tahoe, Nevada, United States, 2012, pp. 1106–1114.

[34]

K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

[35]

Catania V., Mineo A., Monteleone S., Palesi M., Patti D., Cycle-accurate network on chip simulation with noxim, ACM Trans. Model. Comput. Simul. 27 (1) (2016) 4:1–4:25.

Digital Library

[36]

Garcia-Molina H., Spauster A., Message ordering in a multicast environment, in: 9th International Conference on Distributed Computing Systems, ICDCS 1989, Newport Beach, CA, USA, June 5–9, 1989, IEEE Computer Society, 1989, pp. 354–361,.

[37]

Chang J., Maxemchuk N.F., Reliable broadcast protocols, ACM Trans. Comput. Syst. 2 (3) (1984) 251–273,.

Digital Library

[38]

Lin X., Ni L.M., Multicast communication in multicomputer networks, IEEE Trans. Parallel Distrib. Syst. 4 (10) (1993) 1105–1117,.

Digital Library

[39]

Jerger N.D.E., Peh L., Lipasti M.H., Virtual circuit tree multicasting: A case for on-chip hardware multicast support, in: 35th International Symposium on Computer Architecture (ISCA 2008), June 21–25, 2008, Beijing, China, IEEE Computer Society, 2008, pp. 229–240,.

Digital Library

[40]

Lecun Y., Bottou L., Bengio Y., Haffner P., Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324,.

[41]

Lee J., Kim C., Kang S., Shin D., Kim S., Yoo H., UNPU: an energy-efficient deep neural network accelerator with fully variable weight bit precision, IEEE J. Solid State Circuits 54 (1) (2019) 173–185,.

Cited By

Ouyang YWang JSun CWang QLiang H(2023)URMP: using reconfigurable multicast path for NoC-based deep neural network acceleratorsThe Journal of Supercomputing10.1007/s11227-023-05255-779:13(14827-14847)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1007/s11227-023-05255-7

Index Terms

MMNNN: A tree-based Multicast Mechanism for NoC-based deep Neural Network accelerators

Index terms have been assigned to the content through auto-classification.

Recommendations

A Latency-Efficient Router Architecture for CMP Systems
DSD '10: Proceedings of the 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools

As technology advances, the number of cores in Chip Multi Processor systems (CMPs) and Multi Processor Systems-on-Chips (MPSoCs) keeps increasing. Current test chips and products reach tens of cores, and it is expected to reach hundreds of cores in the ...
[2010] VIX: A Router Architecture for Priority-Aware Networks-on-Chip
IWIA '10: Proceedings of the 2010 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems

In future many-core chip multiprocessors (CMPs) and systems-on-chips (SoCs) architectures, networks-on-chip (NoC) will be one of the most critical components. In CMPs and SoCs, multiple applications will be executed concurrently and they interfere each ...
MIRA: A Multi-layered On-Chip Interconnect Router Architecture

Recently, Network-on-Chip (NoC) architectures have gained popularity to address the interconnect delay problem for designing CMP / multi-core / SoC systems in deep sub-micron technology. However, almost all prior studies have focused on 2D NoC designs. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Microprocessors & Microsystems

Microprocessors & Microsystems Volume 85, Issue C

Sep 2021

236 pages

ISSN:0141-9331

Issue’s Table of Contents

Copyright © 2021.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 September 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ouyang YWang JSun CWang QLiang H(2023)URMP: using reconfigurable multicast path for NoC-based deep neural network acceleratorsThe Journal of Supercomputing10.1007/s11227-023-05255-779:13(14827-14847)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1007/s11227-023-05255-7

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents