Abstract
Massive manycore embedded processors with network-on-chip (NoC) architectures are becoming common. These architectures provide higher processing capability due to an abundance of cores. They provide native core-to-core communication that can be exploited via message passing to provide system scalability. Despite these advantages, manycores pose predictability challenges that can affect both performance and real-time capabilities.
In this work, we develop efficient and predictable group communication using message passing specifically designed for large core counts in 2D mesh NoC architectures. We have implemented the most commonly used collectives in such a way that they incur low latency and high timing predictability making them suitable for balanced parallelization of scalable high-performance and embedded/real-time systems alike. Experimental results on a single-die 64 core hardware platform show that our collectives can significantly reduce communication times by up to 95 % for single packet messages and up to 98 % for longer messages with superior performance for sometimes all message sizes and sometimes only small message sizes depending on the group primitive. In addition, our communication primitives have significantly lower variance than prior approaches, thereby providing more balanced parallel execution progress and better real-time predictability.
This work was supported in part by NSF grants 0905181 and 1239246.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Intel: Tera-scale research prototype: connecting 80 simple cores on a single test chip. ftp://download.intel.com/research/platform/terascale/tera-scaleresearchprototypebackgrounder.pdf
Tilera: Tilera processor family. www.tilera.com/products/processors.php
Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.C., Brown III, J.F., Agarwal, A.: On-chip interconnection architecture of the tile processor. IEEE Micro 27, 15–31 (2007)
Adapteva: Adapteva processor family. www.adapteva.com/products/silicon-devices/e16g301/
Zimmer, C., Mueller, F.: NoCMsg: scalable NoC-based message passing. In: International Symposium on Cluster Computing and the Grid (CCGRID), pp. 186–195 (2014)
Zimmer, C., Mueller, F.: NoCMsg: a scalable message passing abstraction for network-on-chips. ACM Trans. Archit. Code Optim. 12(1), 1 (2015)
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004)
Kang, M., Park, E., Cho, M., Suh, J., Kang, D.I., Crago, S.P.: MPI performance analysis and optimization on Tile64/Maestro. In: Workshop on Multi-core Processors for Space – Opportunities and Challenges, July 2009
Mattson, T., van der Wijngaart, R., Riepen, M., Lehnig, T., Brett, P., Haas, W., Kennedy, P., Howard, J., Vangal, S., Borkar, N., Ruhl, G., Dighe, S.: The 48-core SCC processor: the programmer’s view. In: Supercomputing, November 2010
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)
Wijngaart, R.V.D., Mattson, T.: RCCE: a small library for many-core communication (2010)
Comprés Ureña, I.A., Riepen, M., Konow, M.: RCKMPI – lightweight MPI Implementation for intel’s single-chip cloud computer (SCC). In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 208–217. Springer, Heidelberg (2011)
Vetter, J., Mueller, F.: Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In: International Parallel and Distributed Processing Symposium, April 2002
Gustafson, J.L.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988)
Howard, J: A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS. In: IEEE International Solid-State Circuits Conference, pp. 108–109, February 2010
McKinley, P.K., Tsai, J.I., Robinson, D.F.: A survey of collective communication in wormhole-routed massively parallel computers. IEEE Comput. 28, 39–50 (1994)
Barnett, M., Payne, D.G., van de Geijn, R.A.: Optimal broadcasting in mesh-connected architectures. Technical report, Austin, TX, USA (1991)
Yang, J.S., King, C.T.: Efficient tree-based multicast in wormhole-routed 2D meshes. In: Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and Networks, ISPAN 1997, Washington, DC, USA, pp. 494–500. IEEE Computer Society (1997)
Sack, P., Gropp, W.: Faster topology-aware collective algorithms through non-minimal communication. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 45–54 (2012)
Tsai, Y.J., McKinley, P.K.: Broadcast in all-port wormhole-routed 3D mesh networks using extended dominating sets. In: Proceedings of the 1994 International Conference on Parallel and Distributed Systems, Washington, DC, USA, pp. 120–127. IEEE Computer Society (1994)
Ramakrishnan, V., Scherson, I.D.: Efficient techniques for nested and disjoint barrier synchronization. J. Parallel Distrib. Comput. 58(2), 333–356 (1999)
Lin, X., McKinley, P.K., Ni, L.M.: Deadlock-free multicast wormhole routing in 2D mesh multicomputers. IEEE Trans. Parallel Distrib. Syst. 5(8), 793–804 (1994)
Panda, D.K.: Fast barrier synchronization in wormhole k-ary n-cube networks with multi destination worms. In: Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture, HPCA 1995, Washington, DC, USA, pp. 200–209. IEEE Computer Society (1995)
Yang, J.S., King, C.T.: Designing tree-based barrier synchronization on 2D mesh networks. IEEE Trans. Parallel Distrib. Syst. 9(6), 526–534 (1998)
Moh, S., Yu, C., Lee, B., Youn, H.Y., Han, D., Lee, D.: Four-ary tree-based barrier synchronization for 2D meshes without nonmember involvement. IEEE Trans. Comput. 50(8), 811–823 (2001)
Thakur, R., Choudhary, A.: All-to-all communication on meshes with wormhole routing. In: Proceedings of the 8th International Parallel Processing Symposium, pp. 561–565 (1994)
Almási, G., Heidelberger, P., Archer, C.J., Martorell, X., Erway, C.C., Moreira, J.E., Steinmacher-Burow, B., Zheng, Y.: Optimization of MPI collective communication on BlueGene/L systems. In: International Conference on Supercomputing, pp. 253–262 (2005)
Bokhari, S., Berryman, H.: Complete exchange on a circuit switched mesh. In: Proceedings of the Scalable High Performance Computing Conference, SHPCC 1992, pp. 300–306 (1992)
Sundar, N.S., Jayasimha, D.N., Panda, D., Sadayappan, P.: Complete exchange in 2D meshes. In: Proceedings of the Scalable High-Performance Computing Conference, pp. 406–413 (1994)
Suh, Y.J., Shin, K.G.: All-to-all personalized communication in multidimensional torus and mesh networks. IEEE Trans. Parallel Distrib. Syst. 12(1), 38–59 (2001)
Suh, Y.J., Yalamanchili, S.: All-to-all communication with minimum start-up costs in 2D/3D tori and meshes. IEEE Trans. Parallel Distrib. Syst. 9(5), 442–458 (1998)
Brandner, F., Schoeberl, M.: Static routing in symmetric real-time network-on-chips. In: International Conference on Real-Time and Network Systems, pp. 61–70 (2012)
Hansson, A., Goossens, K., Rǎdulescu, A.: A unified approach to constrained mapping and routing on network-on-chip architectures. In: Proceedings of the 3rd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS 2005, New York, NY, USA, pp. 75–80. ACM (2005)
Stefan, R., Goossens, K.: An improved algorithm for slot selection in the ethereal network-on-chip. In: International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip, pp. 7–10 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Yagna, K., Patil, O., Mueller, F. (2016). Efficient and Predictable Group Communication for Manycore NoCs. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-41321-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41320-4
Online ISBN: 978-3-319-41321-1
eBook Packages: Computer ScienceComputer Science (R0)