Nothing Special   »   [go: up one dir, main page]

skip to main content

A smart and novel approach for managing incast and in-network congestion through adaptive routing

Published: 01 October 2024 Publication History


High-Performance Computing and Datacenter systems, with numerous endnodes, demand an efficient interconnection network to prevent performance bottlenecks. Fat-Tree topologies are preferred for their high bisection bandwidth and multiple shortest-path routes. While existing adaptive routing excels in light or in-network congestion, it struggles with incast congestion. This paper proposes a new technique, called Congestion-Aware Adaptive Routing (SCAR), which addresses both in-network and incast congestion. SCAR limits adaptivity for incast congestion, using deterministic routing, while employing adaptive routing for non-congesting flows. It also resolves in-network congestion by routing traffic flows through alternative routes. Simulation experiments on large Fat-Trees using synthetic and trace-based traffic patterns modeling realistic applications demonstrate SCAR’s immediate reaction on mitigating in-network congestion, and a reasonable delay during incast situations, while other state-of-the-art solutions are not able to cope with incast and in-network situations at the same time.


Devise solutions to reduce congestion’s impact (e.g., HoL blocking).
Analyzed issues with multi-path routing during congestion, especially in Fat-Trees.
Introduced Smart Congestion-Aware Adaptive Routing (SCAR) to mitigate congestion.
Conducted extensive simulations on Fat-Tree topologies with realistic workloads.
SCAR manages congestion, irrespective of traffic patterns and network size.


Zahavi E., Johnson G., Kerbyson D.J., Lang M., Optimized InfiniBand™ fat-tree routing for shift all-to-all communication patterns, J. CCPE 22 (2) (2010) 217–231.
Rodriguez G., Minkenberg C., Beivide R., Luijten R.P., Labarta J., Valero M., Oblivious routing schemes in extended generalized fat tree networks, in: 2009 IEEE International Conference on Cluster Computing and Workshops, 2009, pp. 1–8.
Jiang N., Kim J., Dally W.J., Indirect adaptive routing on large scale interconnection networks, in: Keckler S.W., Barroso L.A. (Eds.), 36th International Symposium on Computer Architecture, (ISCA 2009), June 20-24, 2009, Austin, TX, USA, ACM, 2009, pp. 220–231.
Kim J., Dally W.J., Scott S., Abts D., Technology-driven, highly-scalable dragonfly topology, in: 35th International Symposium on Computer Architecture, (ISCA 2008), June 21-25, 2008, Beijing, China, IEEE Computer Society, 2008, pp. 77–88.
Kim J., Dally W.J., Abts D., Interconnect routing and scheduling - adaptive routing in high-radix clos network, in: Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, November 11-17, 2006, Tampa, FL, USA, ACM Press, 2006, p. 92.
Zhang Y., Meng Q., Liu Y., Ren F., Revisiting congestion detection in lossess networks, IEEE/ACM Trans. Netw. (2023) 1–15.
Garcia P.J., Escudero-Sahuquillo J., Quiles F.J., Duato J., Congestion management for ethernet-based lossless DataCenter networks, 2019, IEEE 802 NENDICA - Lossless Data Center Networks (LLDCN).
Rocher-Gonzalez J., Escudero-Sahuquillo J., García P.J., Quiles F.J., Mora G., Towards an efficient combination of adaptive routing and queuing schemes in fat-tree topologies, J. Parallel Distrib. Comput. 147 (2021) 46–63.
Rocher-Gonzalez J., Escudero-Sahuquillo J., Garcia P., Quiles F., Congestion management in high-performance interconnection networks using adaptive routing notifications, J. Supercomput. (2022) 1–31.
Besta M., Domke J., Schneider M., Konieczny M., Girolamo S.D., Schneider T., Singla A., Hoefler T., High-performance routing with multipathing and path diversity in ethernet and HPC networks, IEEE Trans. Parallel Distrib. Syst. 32 (4) (2021) 943–959.
Vignéras P., Quintin J.-N., The BXI routing architecture for exascale supercomputer, J. Supercomput. 72 (2016).
De Sensi D., Di Girolamo S., McMahon K., Roweth D., Hoefler T., An in-depth analysis of the slingshot interconnect, 2020, pp. 1–14.
NVIDIA D., How to configure adaptive routing and self-healing networking (new), 2023, URL
Haramaty Z., Zahavi E., Gabbay F., Crupnicoff D., Marelli A., Bloch G., Adaptive routing using inter-switch notifications, 2015.
Nachiondo T., Flich J., Duato J., Buffer management strategies to reduce HoL blocking, Parallel Distrib. Syst. IEEE Trans. 21 (6) (2010) 739–753.
W.L. Guay, B. Bogdanski, S. Reinemo, O. Lysne, T. Skeie, vFtree - A Fat-Tree Routing Algorithm Using Virtual Lanes to Alleviate Congestion, in: 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16-20 May, 2011 - Conference Proceedings, 2011, pp. 197–208.
Escudero-Sahuquillo J., García P.J., Quiles F.J., Reinemo S., Skeie T., Lysne O., Duato J., A new proposal to deal with congestion in InfiniBand-based fat-trees, J. Parallel Distrib. Comput. 74 (1) (2014) 1802–1819.
P. Yebenes, J. Escudero-Sahuquillo, P.J. García, F.J. Quiles, Towards Modeling Interconnection Networks of Exascale Systems with OMNet++, in: 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2013, Belfast, United Kingdom, February 27 - March 1, 2013, 2013, pp. 203–207.
Andujar F.J., Villar J.A., Alfaro F.J., Sánchez J.L., Escudero-Sahuquillo J., An open-source family of tools to reproduce MPI-based workloads in interconnection network simulators, J. Supercomput. 72 (12) (2016) 4601–4628.
The HPCC benchmark, 2017, URL
Szegedy C., Vanhoucke V., Ioffe S., Shlens J., Wojna Z., Rethinking the inception architecture for computer vision, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society, 2016, pp. 2818–2826.



Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors


Published In

cover image Future Generation Computer Systems
Future Generation Computer Systems  Volume 159, Issue C
Oct 2024
580 pages


Elsevier Science Publishers B. V.


Publication History

Published: 01 October 2024

Author Tags

  1. High-performance computing
  2. Datacenters
  3. Interconnection networks
  4. Congestion management
  5. Adaptive routing


  • Research-article


Other Metrics

Bibliometrics & Citations


Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics


View Options

View options






Share this Publication link

Share on social media