article

Cluster communication protocols for parallel-programming systems

Authors:

Raoul A. F. Bhoedjang,

Rutger F. H. HofmanAuthors Info & Claims

ACM Transactions on Computer Systems (TOCS), Volume 22, Issue 3

Pages 281 - 325

https://doi.org/10.1145/1012268.1012269

Published: 01 August 2004 Publication History

Abstract

Clusters of workstations are a popular platform for high-performance computing. For many parallel applications, efficient use of a fast interconnection network is essential for good performance. Several modern System Area Networks include programmable network interfaces that can be tailored to perform protocol tasks that otherwise would need to be done by the host processors. Finding the right trade-off between protocol processing at the host and the network interface is difficult in general. In this work, we systematically evaluate the performance of different implementations of a single, user-level communication interface. The implementations make different architectural assumptions about the reliability of the network and the capabilities of the network interface. The implementations differ accordingly in their division of protocol tasks between host software, network-interface firmware, and network hardware. Also, we investigate the effects of alternative data-transfer methods and multicast implementations, and we evaluate the influence of packet size. Using microbenchmarks, parallel-programming systems, and parallel applications, we assess the performance of the different implementations at multiple levels. We use two hardware platforms with different performance characteristics to validate our conclusions. We show how moving protocol tasks to a relatively slow network interface can yield both performance advantages and disadvantages, depending on specific characteristics of the application and the underlying parallel-programming system.

References

[1]

Araki, S., Bilas, A., Dubnicki, C., Edler, J., Konishi, K., and Philbin, J. 1998. User-space communication: A quantitative study. In Proceedings of Supercomputing'98 (Orlando, Fla.).]]

[2]

Aumage, O., Bougé, L., Denis, A., Méhaut, J.-F., Mercier, G., Namyst, R., and Prylli, L. 2000. A portable and efficient communication library for high-performance cluster computing. In Proceedings of the IEEE International Conference on Cluster Computing. IEEE Computer Society Press, Los Alamitos, Calif., 78--87.]]

[3]

Bal, H., Bhoedjang, R., Hofman, R., Jacobs, C., Langendoen, K., Rühl, T., and Kaashoek, M. 1998. Performance evaluation of the Orca shared object system. ACM Trans. Comput. Syst. 16, 1 (Feb.), 1--40.]]

[4]

Begel, A., Buonadonna, P., Culler, D., and Gay, D. 2002. An analysis of VI architecture primitives in support of parallel and distributed communication. Concurr. Comput.: Pract. Exper. 14, 1 (Jan.), 55--76.]]

[5]

Bhoedjang, R., Rühl, T., and Bal, H. 1998a. Efficient multicast on Myrinet using link-level flow control. In Proceedings of the 27th International Conference on Parallel Processing (ICPP). (Minneapolis, Minn.). 381--390.]]

[6]

Bhoedjang, R., Rühl, T., and Bal, H. 1998b. User-level network interface protocols. IEEE Comput. 31, 11 (Nov.), 53--60.]]

[7]

Bhoedjang, R., Verstoep, K., Bal, H., and Rühl, T. 2000a. Reducing data and control transfer overhead through network-interface support. In Proceedings of the 1st Myrinet User Group Conference (Lyon, France).]]

[8]

Bhoedjang, R., Verstoep, K., Rühl, T., Bal, H., and Hofman, R. 2000b. Evaluating design alternatives for reliable communication on high-speed networks. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (Cambridge, Mass.). 71--81.]]

[9]

Bilas, A., Jiang, D., Zhou, Y., and Singh, J. 1999a. Limits to the performance of software shared memory: A layered approach. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture (HPCA). (Orlando, Fla.). 193--202.]]

[10]

Bilas, A., Liao, C., and Singh, J. 1999b. Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems. In Proceedings of the 26th International Symposium on Computer Architecture (ISCA) (Atlanta, Ga.). 282--293.]]

[11]

Boden, N., Cohen, D., Felderman, R., Kulawik, A., Seitz, C., Seizovic, J., and Su, W. 1995. Myrinet: A gigabit-per-second local area network. IEEE Micro 15, 1 (Feb.), 29--36.]]

[12]

Buzzard, G., Jacobson, D., MacKey, M., Marovich, S., and Wilkes, J. 1996. An implementation of the Hamlyn sender-managed interface architecture. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (Seattle, Wa.). 245--259.]]

[13]

Canonico, R., Cristaldi, R., and Iannello, G. 1999. A scalable flow control algorithm for the Fast Messages communication library. In Proceedings of the Workshop on Communication, Architecture, and Applications for Network-Based Parallel Computing (CANPC) (Orlando, Fla.). 77--90.]]

[14]

Chang, C.-C., Czajkowski, G., Hawblitzel, C., and von Eicken, T. 1996. Low-latency communication on the IBM RISC System/6000 SP. In Proceedings of Supercomputing '96 (Pitsburgh, Pa.).]]

[15]

Chen, Y., Bilas, A., Damianakis, S., and Dubnicki, C. 1998. UTLB: A mechanism for address translation on network interfaces. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (San Jose, Calif.). 193--204.]]

[16]

Chun, B., Mainwaring, A., and Culler, D. 1997. Virtual network transport protocols for Myrinet. In Proceedings of Hot Interconnects V (Stanford, Calif.).]]

[17]

Culler, D., Liu, L., Martin, R., and Yoshikawa, C. 1996. Assessing fast network interfaces. IEEE Micro 16, 1 (Feb.), 35--43.]]

[18]

Druschel, P., Peterson, L., and Davie, B. 1994. Experiences with a high-speed network adaptor: A software perspective. In Proceedings of the 1994 Conference on Communications Architectures, Protocols, and Applications (SIGCOMM) (London, U.K.). ACM New York, 2--12.]]

[19]

Dubnicki, C., Bilas, A., Chen, Y., Damianakis, S., and Li, K. 1997a. VMMC-2: Efficient support for reliable, connection-oriented communication. In Proceedings of Hot Interconnects V (Stanford, Calif.).]]

[20]

Dubnicki, C., Bilas, A., Li, K., and Philbin, J. 1997b. Design and implementation of virtual memory-mapped communication on Myrinet. In Proceedings of the 11th Int. Parallel Processing Symp. (IPPS) (Geneva, Switzerland). 388--396.]]

[21]

Fillo, M. and Gillett, R. 1997. Architecture and implementation of Memory Channel 2. Dig. Tech. J. 9, 1, 27--41.]]

[22]

Frank, M., Agarwal, A., and Vernon, M. 1997. LoPC: Modeling contention in parallel algorithms. In Proceedings of the 6th Symposium on Principles and Practice of Parallel Programming (PPOPP) (Las Vegas Nev.). 276--287.]]

[23]

Gropp, W., Lusk, E., Doss, N., and Skjellum, A. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Paral. Comput. 22, 6 (Sept.), 789--828.]]

[24]

Huang, Y. and McKinley, P. 1996. Efficient collective operations with ATM network interface support. In Proceedings of the 25th International Conference on Parallel Processing (ICPP) (Bloomingdale, Ill.). 34--43.]]

[25]

InfiniBand Trade Association. 2002. InfiniBand Architecture Specification Release 1.1. Available from http://www.infinibandta.org.]]

[26]

Johnson, K., Kaashoek, M., and Wallach, D. 1995. CRL: High-performance all-software distributed shared memory. In Proceedings of the 15th Symposium on Operating Systems Principles (SOSP) (Copper Mountain, Col.). 213--226.]]

[27]

Karamcheti, V. and Chien, A. 1994. Software overhead in messaging layers: Where does the time go? In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (San Jose, Calif.). 51--60.]]

[28]

Kesavan, R. and Panda, D. 1997. Optimal multicast with packetization and network interface support. In Proceedings of the 26th International Conference on Parallel Processing (ICPP) (Bloomingdale, Ill.). 370--377.]]

[29]

Kielmann, T., Bal, H., Gorlatch, S., Verstoep, K., and Hofman, R. 2001. Network performance-aware collective communication for clustered wide area systems. Paral. Comput. 27, 11, 1431--1456.]]

[30]

Krishnamurthy, A., Schauser, K., Scheiman, C., Wang, R., Culler, D., and Yelick, K. 1996. Evaluation of architectural support for global address-based communication in large-scale parallel machines. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (Cambridge, Mass.). 37--48.]]

[31]

Maquelin, O., Gao, G., Hum, H., Theobald, K., and Tian, X. 1996. Polling watchdog: Combining polling and interrupts for efficient message handling. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA) (Philadelphia, Pa.). 179--188.]]

[32]

Martin, R., Vahdat, A., Culler, D., and Anderson, T. 1997. Effects of communication latency, overhead, and bandwidth in a cluster architecture. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA) (Denver Colo.). 85--97.]]

[33]

MPI Forum. 1994. A message passing interface standard. Int. J. Supercomput. Appl. 8, 3/4.]]

[34]

Pakin, S., Lauria, M., and Chien, A. 1995. High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet. In Proceedings of Supercomputing '95 (San Diego, Calif.).]]

[35]

Petrini, F., Feng, W., Hoisie, A., Coll, S., and Frachtenberg, E. 2001. Quadrics network (QsNet): High-performance clustering technology. In Proceedings of Hot Interconnects IX (Stanford, Calif.).]]

[36]

Prylli, L., Tourancheau, B., and Westrelin, R. 1999. The Design for a high-performance MPI implementation on the Myrinet network. In Proceedings 6th European PVM/MPI Users' Group (Barcelona, Spain). Lecture Notes in Computer Science, vol. 1697 Springer Verlag, New York, 223--230.]]

[37]

Romein, J. and Bal, H. 2003. Solving the game of Awari using parallel retrograde analysis. IEEE Comput. 36, 10 (Oct.), 26--33.]]

[38]

Romein, J., Bal, H., Schaeffer, J., and Plaat, A. 2002. A performance analysis of transposition-table-driven scheduling in distributed search. IEEE Trans. Parall. Distrib. Syst. 13, 5 (May), 447--459.]]

[39]

Speight, E., Abdel-Shafi, H., and Bennett, J. 1999. Realizing the performance potential of the Virtual Interface architecture. In Proceedings of the 13th International Conference on Supercomputing (ICS). Rhodes, Greece, 184--192.]]

[40]

Tang, J. and Bilas, A. 2002. Tolerating network failures in system area networks. In Proceedings of the 2002 International Conference on Parallel Processing (ICPP) (Vancouver, B. C., Canada), 121--130.]]

[41]

Tezuka, H., Hori, A., Ishikawa, Y., and Sato, M. 1997. PM: An operating system coordinated high-performance communication library. In High-Performance Computing and Networking (Vienna, Austria) Lecture Notes in Computer Science, vol. 1225.]]

[42]

Tezuka, H., O'Carroll, F., Hori, A., and Ishikawa, Y. 1998. Pin-down cache: A virtual memory management technique for zero-copy communication. In Proceedings of the 12th International Parallel Processing Symposium (IPPS) (Orlando, Fla.). 308--314.]]

[43]

Verstoep, K., Langendoen, K., and Bal, H. 1996. Efficient reliable multicast on Myrinet. In Proceedings of the 25th International Conference on Parallel Processing (ICPP) (Bloomingdale, Ill.). 156--165.]]

[44]

von Eicken, T., Basu, A., Buch, V., and Vogels, W. 1995. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th Symposium on Operating Systems Principles (SOSP) (Copper Mountain, Colo.). 303--316.]]

[45]

von Eicken, T., Culler, D., Goldstein, S., and Schauser, K. 1992. Active messages: A mechanism for integrated communication and computation. In Proceedings of the 19th International Symposium on Computer Architecture (ISCA) (Gold Coast, Australia). 256--266.]]

[46]

Wang, R., Krishnamurthy, A., Martin, R., Anderson, T., and Culler, D. 1998. Modeling and optimizing communication pipelines. In Proceedings of the 1998 Conference on Measurement and Modeling of Computer Systems (SIGMETRICS) (Madison, W. Sc.). 22--32.]]

[47]

Welsh, M., Basu, A., and von Eicken, T. 1997. Incorporating memory management into user-level network interfaces. In Proceedings of Hot Interconnects V (Stanford, Calif.)]]

Cited By

Massetto FSato LLi K(2012)A novel strategy for building interoperable MPI environment in heterogeneous high performance systemsThe Journal of Supercomputing10.1007/s11227-009-0272-y60:1(87-116)Online publication date: 1-Apr-2012
https://dl.acm.org/doi/10.1007/s11227-009-0272-y
Bergman KSmall BBergman L(2012)Next‐Generation Cluster NetworksHandbook of Computer Networks10.1002/9781118256107.ch7(93-109)Online publication date: 4-Jan-2012
https://doi.org/10.1002/9781118256107.ch7
Verstoep KMaassen JBal HRomein J(2008)Experiences with Fine-Grained Distributed Supercomputing on a 10G TestbedProceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid10.1109/CCGRID.2008.71(376-383)Online publication date: 19-May-2008
https://dl.acm.org/doi/10.1109/CCGRID.2008.71
Show More Cited By

Index Terms

Cluster communication protocols for parallel-programming systems

Recommendations

Parallel asynchronous protocols on parallel and distributed systems
Overlapping communication and computation with OpenMP and MPI

Machines comprised of a distributed collection of shared memory or SMP nodes are becoming common for parallel computing. OpenMP can be combined with MPI on many such machines. Motivations for combing OpenMP and MPI are discussed. While OpenMP is ...
Parallel particle PHD filter implemented on multicore and cluster systems

The Probability Hypothesis Density (PHD) filter is a promising technique in terms of computational complexity to solve the multiple targets tracking problem. However, the amount of computation is prohibitive in critical situations when the clutter ...

Reviews

Reviewer: Thomas Rauber

The performance of communication protocols on networks that are typically used for cluster systems is considered in this paper. Based on a single, low-level communication application programming interface (API), LCI, which supports a variety of parallel programming systems, the authors investigate the effect of different implementation decisions concerning reliability, data transfer mechanism, maximum transfer unit, and multicast on performance at different levels of program abstraction. Low-level performance characteristics are captured by microbenchmarks. Runtime-system specific benchmarks are used for the four different programming systems considered (message passing interface (MPI); the DSM-system certificate revocation list (CRL); Orca; and the distributed-search system Multigame). Performance impact at the application level is investigated by considering eight parallel applications. The paper contains a detailed evaluation of two Myrinet-based cluster platforms, showing that there is no single best implementation variant; what is best depends on application requirements, and the communication patterns produced by the programming system. The paper summarizes the insights of the study by giving several recommendations at the end of the introduction. The paper describes many details of the different implementations that require basic knowledge of the network hardware and protocols. This work is written primarily for those interested in the design and realization of communication protocols for parallel programming systems. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 22, Issue 3

August 2004

99 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/1012268

Issue’s Table of Contents

Copyright © 2004 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2004

Published in TOCS Volume 22, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
1,821
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Massetto FSato LLi K(2012)A novel strategy for building interoperable MPI environment in heterogeneous high performance systemsThe Journal of Supercomputing10.1007/s11227-009-0272-y60:1(87-116)Online publication date: 1-Apr-2012
https://dl.acm.org/doi/10.1007/s11227-009-0272-y
Bergman KSmall BBergman L(2012)Next‐Generation Cluster NetworksHandbook of Computer Networks10.1002/9781118256107.ch7(93-109)Online publication date: 4-Jan-2012
https://doi.org/10.1002/9781118256107.ch7
Verstoep KMaassen JBal HRomein J(2008)Experiences with Fine-Grained Distributed Supercomputing on a 10G TestbedProceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid10.1109/CCGRID.2008.71(376-383)Online publication date: 19-May-2008
https://dl.acm.org/doi/10.1109/CCGRID.2008.71
Cappello FBal H(2007)Toward an International "Computer Science Grid"Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid10.1109/CCGRID.2007.115(3-12)Online publication date: 14-May-2007
https://dl.acm.org/doi/10.1109/CCGRID.2007.115
Taboada GTourino JDoallo R(2006)Efficient Java Communication Protocols on High-speed Cluster InterconnectsProceedings. 2006 31st IEEE Conference on Local Computer Networks10.1109/LCN.2006.322110(264-271)Online publication date: Nov-2006
https://doi.org/10.1109/LCN.2006.322110
Taboada GTouriño JDoallo R(2006)Non-blocking java communications support on clustersProceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface10.1007/11846802_38(256-265)Online publication date: 17-Sep-2006
https://dl.acm.org/doi/10.1007/11846802_38
Taboada GTourino JDoallo R(2005)Designing Efficient Java Communications on ClustersProceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 5 - Volume 0610.1109/IPDPS.2005.164Online publication date: 4-Apr-2005
https://dl.acm.org/doi/10.1109/IPDPS.2005.164
Santos TFrohlich A(2005)A Customizable Component for Low-Level Communication SoftwareProceedings of the 19th International Symposium on High Performance Computing Systems and Applications10.1109/HPCS.2005.5(58-64)Online publication date: 15-May-2005
https://dl.acm.org/doi/10.1109/HPCS.2005.5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents