research-article

Light-weight protocols for wire-speed ordering

Authors:

Larry DennisonAuthors Info & Claims

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Article No.: 6, Pages 1 - 12

Published: 11 November 2018 Publication History

Abstract

We describe light-weight protocols for selective packet ordering in out-of-order networks that carry memory traffic. The protocols are designed for heterogeneous high-performance systems, in particular, accelerated systems with end-points that have few resources available for interfacing the network.

The protocols preserve the semantics of a relaxed memory ordering model as adopted by highly-threaded many-core processors and accelerators.

The protocols achieve link-rate performance through the following techniques: (1) Speculative connection setup avoids round-trip delays found in protocols with little knowledge about endpoint resources, (2) target-side ordering avoids round-trip delays found in source-side ordering mechanisms, (3) fine-grained ordering removes dependencies unwarranted by program code avoiding cumulative ordering dependencies caused by coarse-grained ordering, (4) ordering relaxations and optimizations for producer/consumer communication patterns.

We describe two ordering protocols that provide (1) strict sequential ordering and (2) relaxed ordering for multi-packet transfers. The protocols impose no restrictions on routing, including multipath routing.

References

[1]

R. E. Kessler and J. L. Schwarzmeier, "CRAY T3D: A New Dimension for Cray Research," in Compcon. IEEE, 1993, pp. 176--182.

[2]

G. Faanes, A. Bataineh, D. Roweth, E. Froese, B. Alverson, T. Johnson, J. Kopnick, M. Higgins, J. Reinhard et al., "Cray Cascade: A Scalable HPC System Based on a Dragonfly Network," in International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society Press, 2012, p. 103.

Digital Library

[3]

B. Bhattacharjee, S. Boag, C. Doshi, P. Dube, B. Herta, V. Ishakian, K. Jayaram, R. Khalaf, A. Krishna, Y. B. Li et al., "IBM Deep Learning Service," IBM Journal of Research and Development, vol. 61, no. 4, pp. 10--1, 2017.

Digital Library

[4]

"Top500 List - November 2017," {accessed 22-May-2018}. {Online}. Available: http://www.top500.org/list/2017/11/

[5]

A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A. Ng, "Deep Learning with COTS HPC Systems," in International Conference on Machine Learning (ICML, 2013, pp. 1337--1345.

Digital Library

[6]

"Microsoft Unveils Project Brainwave for Real-Time AI," August 2017, {accessed 22-March-2018}. {Online}. Available: https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/

[7]

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit," in International Symposium on Computer Architecture (ISCA). ACM, 2017, pp. 1--12.

Digital Library

[8]

"NVIDIA DGX-2," {accessed 28-March-2018}. {Online}. Available: http://www.nvidia.com/en-us/dat-center/dgx-2/

[9]

B. Klenk, L. Oden, and H. Froning, "Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time," in International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2015, pp. 318--327.

[10]

N. Jiang, L. Dennison, and W. J. Dally, "Network Endpoint Congestion Control for Fine-Grained Communication," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 2015, pp. 1--12.

Digital Library

[11]

J. Mankin, "CSG280: Parallel Computing Memory Consistency Models: A Survey in Past and Present Research," 2007.

[12]

N. Jiang, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis, and J. Kim, "A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator," in International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2013, pp. 86--96.

[13]

M. García, E. Vallejo, R. Beivide, M. Odriozola, and M. Valero, "Efficient Routing Mechanisms for Dragonfly Networks," in International Conference on Parallel Processing (ICPP). IEEE, 2013, pp. 582--592.

Digital Library

[14]

M. Laor and L. Gendel, "The Effect of Packet Reordering in a Backbone Link on Application Throughput," IEEE network, vol. 16, no. 5, pp. 28--36, 2002.

Digital Library

[15]

A. Ford, C. Raiciu, M. Handley, S. Barre, and J. Iyengar, "Architectural Guidelines for Multipath TCP Development," IETF, Tech. Rep., March 2011. {Online}. Available: www.rfc-editor.org/info/rfc6182

[16]

N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, M. E. Giampapa, P. Heidelberger, S. Singh, B. D. Steinmacher-Burow, T. Takken et al., "Blue Gene/L Torus Interconnection Network," IBM Journal of Research and Development, vol. 49, no. 2.3, pp. 265--276, 2005.

Digital Library

[17]

G. Almási, C. Archer, J. G. Castanos, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman et al., "Design and Implementation of Message-Passing Services for the Blue Gene/L Supercomputer," IBM Journal of Research and Development, vol. 49, no. 2.3, pp. 393--406, 2005.

Digital Library

[18]

R. Alverson, D. Roweth, and L. Kaplan, "The Gemini System Interconnect," in Symposium on High Performance Interconnects (HOTI). IEEE, 2010, pp. 83--87.

Digital Library

[19]

K. Z. Ibrahim, P. H. Hargrove, C. Iancu, and K. Yelick, "An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect," in International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 2014, pp. 1115--1125.

Digital Library

[20]

"Gen-Z Consortium," {accessed 9-February-2018}. {Online}. Available: http://genzconsortium.org

[21]

W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren, "Introduction to UPC and Language Specification," Technical Report CCS-TR-99--157, IDA Center for Computing Sciences, Tech. Rep., 1999.

[22]

"The Berkeley UPC Runtime Specification, Version 3.12," {accessed May-23--2018}. {Online}. Available: http://upc.lbl.gov/docs/system/upcr.ps

[23]

R. Nishtala, P. H. Hargrove, D. O. Bonachea, and K. A. Yelick, "Scaling Communication-Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap," in International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 2009, pp. 1--12.

Digital Library

[24]

S. Potluri, D. Rossetti, D. Becker, D. Poole, M. G. Venkata, O. Hernandez, P. Shamis, M. G. Lopez, M. Baker, and W. Poole, "Exploring OpenSHMEM Model to Program GPU-based Extreme-Scale Systems," in Workshop on OpenSHMEM and Related Technologies. Springer, 2014, pp. 18--35.

Digital Library

[25]

F. Z. Nardelli, P. Sewell, J. Sevcik, S. Sarkar, S. Owens, L. Maranget, M. Batty, and J. Alglave, "Relaxed Memory Models Must be Rigorous," in Exploiting Concurrency Efficiently and Correctly Workshop, 2009.

[26]

S. Xiao and W.-c. Feng, "Inter-Block GPU Communication via Fast Barrier Synchronization," in International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 2010, pp. 1--12.

Recommendations

Light-weight protocols for wire-speed ordering
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

We describe light-weight protocols for selective packet ordering in out-of-order networks that carry memory traffic. The protocols are designed for heterogeneous high-performance systems, in particular, accelerated systems with end-points that have few ...
Characterization of message ordering specifications and protocols
ICDCS '97: Proceedings of the 17th International Conference on Distributed Computing Systems (ICDCS '97)

We study the problem of determining which message ordering specifications can be implemented in a distributed system. Further, if a specification can be implemented, we give a technique to determine whether it can be implemented by tagging information ...
An Almost-Optimally Fair Three-Party Coin-Flipping Protocol

In a multiparty fair coin-flipping protocol, the parties output a common (close to) unbiased bit, even when some corrupted parties try to bias the output. Cleve [in Proceedings of the 20th Annual ACM Symposium on Theory of Computing (STOC), ACM, New York, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

November 2018

932 pages

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 11 November 2018

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC18

Sponsor:

SIGHPC

SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 16, 2018

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
129
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents