Efficient and Predictable Group Communication for Manycore NoCs

Karthik Yagna¹⁶,
Onkar Patil¹⁶ &
Frank Mueller¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9697))

Included in the following conference series:

International Conference on High Performance Computing

2655 Accesses
2 Citations
1 Altmetric

Abstract

Massive manycore embedded processors with network-on-chip (NoC) architectures are becoming common. These architectures provide higher processing capability due to an abundance of cores. They provide native core-to-core communication that can be exploited via message passing to provide system scalability. Despite these advantages, manycores pose predictability challenges that can affect both performance and real-time capabilities.

In this work, we develop efficient and predictable group communication using message passing specifically designed for large core counts in 2D mesh NoC architectures. We have implemented the most commonly used collectives in such a way that they incur low latency and high timing predictability making them suitable for balanced parallelization of scalable high-performance and embedded/real-time systems alike. Experimental results on a single-die 64 core hardware platform show that our collectives can significantly reduce communication times by up to 95 % for single packet messages and up to 98 % for longer messages with superior performance for sometimes all message sizes and sometimes only small message sizes depending on the group primitive. In addition, our communication primitives have significantly lower variance than prior approaches, thereby providing more balanced parallel execution progress and better real-time predictability.

This work was supported in part by NSF grants 0905181 and 1239246.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving the Performance of Collective Communication for the On-Chip Network

NoC-Based Thread Synchronization in a Custom Manycore System

Communication-Aware Hardware-Assisted MPI Overlap Engine

References

Intel: Tera-scale research prototype: connecting 80 simple cores on a single test chip. ftp://download.intel.com/research/platform/terascale/tera-scaleresearchprototypebackgrounder.pdf
Tilera: Tilera processor family. www.tilera.com/products/processors.php
Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.C., Brown III, J.F., Agarwal, A.: On-chip interconnection architecture of the tile processor. IEEE Micro 27, 15–31 (2007)
Article Google Scholar
Adapteva: Adapteva processor family. www.adapteva.com/products/silicon-devices/e16g301/
Zimmer, C., Mueller, F.: NoCMsg: scalable NoC-based message passing. In: International Symposium on Cluster Computing and the Grid (CCGRID), pp. 186–195 (2014)
Google Scholar
Zimmer, C., Mueller, F.: NoCMsg: a scalable message passing abstraction for network-on-chips. ACM Trans. Archit. Code Optim. 12(1), 1 (2015)
Article Google Scholar
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004)
Chapter Google Scholar
Kang, M., Park, E., Cho, M., Suh, J., Kang, D.I., Crago, S.P.: MPI performance analysis and optimization on Tile64/Maestro. In: Workshop on Multi-core Processors for Space – Opportunities and Challenges, July 2009
Google Scholar
Mattson, T., van der Wijngaart, R., Riepen, M., Lehnig, T., Brett, P., Haas, W., Kennedy, P., Howard, J., Vangal, S., Borkar, N., Ruhl, G., Dighe, S.: The 48-core SCC processor: the programmer’s view. In: Supercomputing, November 2010
Google Scholar
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)
Article MATH Google Scholar
Wijngaart, R.V.D., Mattson, T.: RCCE: a small library for many-core communication (2010)
Google Scholar
Comprés Ureña, I.A., Riepen, M., Konow, M.: RCKMPI – lightweight MPI Implementation for intel’s single-chip cloud computer (SCC). In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 208–217. Springer, Heidelberg (2011)
Chapter Google Scholar
Vetter, J., Mueller, F.: Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In: International Parallel and Distributed Processing Symposium, April 2002
Google Scholar
Gustafson, J.L.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988)
Article Google Scholar
Howard, J: A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS. In: IEEE International Solid-State Circuits Conference, pp. 108–109, February 2010
Google Scholar
McKinley, P.K., Tsai, J.I., Robinson, D.F.: A survey of collective communication in wormhole-routed massively parallel computers. IEEE Comput. 28, 39–50 (1994)
Article Google Scholar
Barnett, M., Payne, D.G., van de Geijn, R.A.: Optimal broadcasting in mesh-connected architectures. Technical report, Austin, TX, USA (1991)
Google Scholar
Yang, J.S., King, C.T.: Efficient tree-based multicast in wormhole-routed 2D meshes. In: Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and Networks, ISPAN 1997, Washington, DC, USA, pp. 494–500. IEEE Computer Society (1997)
Google Scholar
Sack, P., Gropp, W.: Faster topology-aware collective algorithms through non-minimal communication. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 45–54 (2012)
Google Scholar
Tsai, Y.J., McKinley, P.K.: Broadcast in all-port wormhole-routed 3D mesh networks using extended dominating sets. In: Proceedings of the 1994 International Conference on Parallel and Distributed Systems, Washington, DC, USA, pp. 120–127. IEEE Computer Society (1994)
Google Scholar
Ramakrishnan, V., Scherson, I.D.: Efficient techniques for nested and disjoint barrier synchronization. J. Parallel Distrib. Comput. 58(2), 333–356 (1999)
Article Google Scholar
Lin, X., McKinley, P.K., Ni, L.M.: Deadlock-free multicast wormhole routing in 2D mesh multicomputers. IEEE Trans. Parallel Distrib. Syst. 5(8), 793–804 (1994)
Article Google Scholar
Panda, D.K.: Fast barrier synchronization in wormhole k-ary n-cube networks with multi destination worms. In: Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture, HPCA 1995, Washington, DC, USA, pp. 200–209. IEEE Computer Society (1995)
Google Scholar
Yang, J.S., King, C.T.: Designing tree-based barrier synchronization on 2D mesh networks. IEEE Trans. Parallel Distrib. Syst. 9(6), 526–534 (1998)
Article Google Scholar
Moh, S., Yu, C., Lee, B., Youn, H.Y., Han, D., Lee, D.: Four-ary tree-based barrier synchronization for 2D meshes without nonmember involvement. IEEE Trans. Comput. 50(8), 811–823 (2001)
Article Google Scholar
Thakur, R., Choudhary, A.: All-to-all communication on meshes with wormhole routing. In: Proceedings of the 8th International Parallel Processing Symposium, pp. 561–565 (1994)
Google Scholar
Almási, G., Heidelberger, P., Archer, C.J., Martorell, X., Erway, C.C., Moreira, J.E., Steinmacher-Burow, B., Zheng, Y.: Optimization of MPI collective communication on BlueGene/L systems. In: International Conference on Supercomputing, pp. 253–262 (2005)
Google Scholar
Bokhari, S., Berryman, H.: Complete exchange on a circuit switched mesh. In: Proceedings of the Scalable High Performance Computing Conference, SHPCC 1992, pp. 300–306 (1992)
Google Scholar
Sundar, N.S., Jayasimha, D.N., Panda, D., Sadayappan, P.: Complete exchange in 2D meshes. In: Proceedings of the Scalable High-Performance Computing Conference, pp. 406–413 (1994)
Google Scholar
Suh, Y.J., Shin, K.G.: All-to-all personalized communication in multidimensional torus and mesh networks. IEEE Trans. Parallel Distrib. Syst. 12(1), 38–59 (2001)
Article Google Scholar
Suh, Y.J., Yalamanchili, S.: All-to-all communication with minimum start-up costs in 2D/3D tori and meshes. IEEE Trans. Parallel Distrib. Syst. 9(5), 442–458 (1998)
Article Google Scholar
Brandner, F., Schoeberl, M.: Static routing in symmetric real-time network-on-chips. In: International Conference on Real-Time and Network Systems, pp. 61–70 (2012)
Google Scholar
Hansson, A., Goossens, K., Rǎdulescu, A.: A unified approach to constrained mapping and routing on network-on-chip architectures. In: Proceedings of the 3rd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS 2005, New York, NY, USA, pp. 75–80. ACM (2005)
Google Scholar
Stefan, R., Goossens, K.: An improved algorithm for slot selection in the ethereal network-on-chip. In: International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip, pp. 7–10 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

North Carolina State University, Raleigh, USA
Karthik Yagna, Onkar Patil & Frank Mueller

Authors

Karthik Yagna
View author publications
You can also search for this author in PubMed Google Scholar
Onkar Patil
View author publications
You can also search for this author in PubMed Google Scholar
Frank Mueller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frank Mueller .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum, Hamburg, Germany
Julian M. Kunkel
Argonne National Laboratory, Lemont, Illinois, USA
Pavan Balaji
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yagna, K., Patil, O., Mueller, F. (2016). Efficient and Predictable Group Communication for Manycore NoCs. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-41321-1_20
Published: 15 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41320-4
Online ISBN: 978-3-319-41321-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient and Predictable Group Communication for Manycore NoCs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improving the Performance of Collective Communication for the On-Chip Network

NoC-Based Thread Synchronization in a Custom Manycore System

Communication-Aware Hardware-Assisted MPI Overlap Engine

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient and Predictable Group Communication for Manycore NoCs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improving the Performance of Collective Communication for the On-Chip Network

NoC-Based Thread Synchronization in a Custom Manycore System

Communication-Aware Hardware-Assisted MPI Overlap Engine

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation