Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3291656.3291664acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Light-weight protocols for wire-speed ordering

Published: 11 November 2018 Publication History

Abstract

We describe light-weight protocols for selective packet ordering in out-of-order networks that carry memory traffic. The protocols are designed for heterogeneous high-performance systems, in particular, accelerated systems with end-points that have few resources available for interfacing the network.
The protocols preserve the semantics of a relaxed memory ordering model as adopted by highly-threaded many-core processors and accelerators.
The protocols achieve link-rate performance through the following techniques: (1) Speculative connection setup avoids round-trip delays found in protocols with little knowledge about endpoint resources, (2) target-side ordering avoids round-trip delays found in source-side ordering mechanisms, (3) fine-grained ordering removes dependencies unwarranted by program code avoiding cumulative ordering dependencies caused by coarse-grained ordering, (4) ordering relaxations and optimizations for producer/consumer communication patterns.
We describe two ordering protocols that provide (1) strict sequential ordering and (2) relaxed ordering for multi-packet transfers. The protocols impose no restrictions on routing, including multipath routing.

References

[1]
R. E. Kessler and J. L. Schwarzmeier, "CRAY T3D: A New Dimension for Cray Research," in Compcon. IEEE, 1993, pp. 176--182.
[2]
G. Faanes, A. Bataineh, D. Roweth, E. Froese, B. Alverson, T. Johnson, J. Kopnick, M. Higgins, J. Reinhard et al., "Cray Cascade: A Scalable HPC System Based on a Dragonfly Network," in International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society Press, 2012, p. 103.
[3]
B. Bhattacharjee, S. Boag, C. Doshi, P. Dube, B. Herta, V. Ishakian, K. Jayaram, R. Khalaf, A. Krishna, Y. B. Li et al., "IBM Deep Learning Service," IBM Journal of Research and Development, vol. 61, no. 4, pp. 10--1, 2017.
[4]
"Top500 List - November 2017," {accessed 22-May-2018}. {Online}. Available: http://www.top500.org/list/2017/11/
[5]
A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A. Ng, "Deep Learning with COTS HPC Systems," in International Conference on Machine Learning (ICML, 2013, pp. 1337--1345.
[6]
"Microsoft Unveils Project Brainwave for Real-Time AI," August 2017, {accessed 22-March-2018}. {Online}. Available: https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/
[7]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit," in International Symposium on Computer Architecture (ISCA). ACM, 2017, pp. 1--12.
[8]
"NVIDIA DGX-2," {accessed 28-March-2018}. {Online}. Available: http://www.nvidia.com/en-us/dat-center/dgx-2/
[9]
B. Klenk, L. Oden, and H. Froning, "Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time," in International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2015, pp. 318--327.
[10]
N. Jiang, L. Dennison, and W. J. Dally, "Network Endpoint Congestion Control for Fine-Grained Communication," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 2015, pp. 1--12.
[11]
J. Mankin, "CSG280: Parallel Computing Memory Consistency Models: A Survey in Past and Present Research," 2007.
[12]
N. Jiang, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis, and J. Kim, "A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator," in International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2013, pp. 86--96.
[13]
M. García, E. Vallejo, R. Beivide, M. Odriozola, and M. Valero, "Efficient Routing Mechanisms for Dragonfly Networks," in International Conference on Parallel Processing (ICPP). IEEE, 2013, pp. 582--592.
[14]
M. Laor and L. Gendel, "The Effect of Packet Reordering in a Backbone Link on Application Throughput," IEEE network, vol. 16, no. 5, pp. 28--36, 2002.
[15]
A. Ford, C. Raiciu, M. Handley, S. Barre, and J. Iyengar, "Architectural Guidelines for Multipath TCP Development," IETF, Tech. Rep., March 2011. {Online}. Available: www.rfc-editor.org/info/rfc6182
[16]
N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, M. E. Giampapa, P. Heidelberger, S. Singh, B. D. Steinmacher-Burow, T. Takken et al., "Blue Gene/L Torus Interconnection Network," IBM Journal of Research and Development, vol. 49, no. 2.3, pp. 265--276, 2005.
[17]
G. Almási, C. Archer, J. G. Castanos, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman et al., "Design and Implementation of Message-Passing Services for the Blue Gene/L Supercomputer," IBM Journal of Research and Development, vol. 49, no. 2.3, pp. 393--406, 2005.
[18]
R. Alverson, D. Roweth, and L. Kaplan, "The Gemini System Interconnect," in Symposium on High Performance Interconnects (HOTI). IEEE, 2010, pp. 83--87.
[19]
K. Z. Ibrahim, P. H. Hargrove, C. Iancu, and K. Yelick, "An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect," in International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 2014, pp. 1115--1125.
[20]
"Gen-Z Consortium," {accessed 9-February-2018}. {Online}. Available: http://genzconsortium.org
[21]
W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren, "Introduction to UPC and Language Specification," Technical Report CCS-TR-99--157, IDA Center for Computing Sciences, Tech. Rep., 1999.
[22]
"The Berkeley UPC Runtime Specification, Version 3.12," {accessed May-23--2018}. {Online}. Available: http://upc.lbl.gov/docs/system/upcr.ps
[23]
R. Nishtala, P. H. Hargrove, D. O. Bonachea, and K. A. Yelick, "Scaling Communication-Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap," in International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 2009, pp. 1--12.
[24]
S. Potluri, D. Rossetti, D. Becker, D. Poole, M. G. Venkata, O. Hernandez, P. Shamis, M. G. Lopez, M. Baker, and W. Poole, "Exploring OpenSHMEM Model to Program GPU-based Extreme-Scale Systems," in Workshop on OpenSHMEM and Related Technologies. Springer, 2014, pp. 18--35.
[25]
F. Z. Nardelli, P. Sewell, J. Sevcik, S. Sarkar, S. Owens, L. Maranget, M. Batty, and J. Alglave, "Relaxed Memory Models Must be Rigorous," in Exploiting Concurrency Efficiently and Correctly Workshop, 2009.
[26]
S. Xiao and W.-c. Feng, "Inter-Block GPU Communication via Fast Barrier Synchronization," in International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 2010, pp. 1--12.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2018
932 pages

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 11 November 2018

Check for updates

Author Tags

  1. memory interconnects
  2. ordering
  3. out-of-order networks
  4. protocols

Qualifiers

  • Research-article

Conference

SC18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 129
    Total Downloads
  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media