Article

A Hardware Acceleration Unit for MPI Queue Processing

Authors:

Keith D. Underwood,

K. Scott Hemmert,

Arun Rodrigues,

Richard Murphy,

Ron BrightwellAuthors Info & Claims

IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01

Page 96.2

https://doi.org/10.1109/IPDPS.2005.30

Published: 04 April 2005 Publication History

Abstract

With the heavy reliance of modern scientific applications upon the MPI Standard, it has become critical for the implementation of MPI to be as capable and as fast as possible. This has led some of the fastest modern networks to introduce the capability to offload aspects of MPI processing to an embedded processor on the network interface. With this important capability has come significant performance implications. Most notably, the time to process long queues of posted receives or unexpected messages is substantially longer on embedded processors. This paper presents an associative list matching structure to accelerate the processing of moderate length queues in MPI. Simulations are used to compare the performance of an embedded processor augmented with this capability to a baseline implementation. The proposed enhancement significantly reduces latency for moderate length queues while adding virtually no overhead for extremely short queues.

References

[1]

A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Sheiman. LogGP: Incorporating long messages into the LogP model. Journal of Parallel and Distributed Computing, 44(1):71-79, 1997.

Digital Library

[2]

R. Alverson. Red Storm. In Invited Talk, Hot Interconnects 10, August 2003.

[3]

R. Brightwell, S. Goudy, and K. D. Underwood. A preliminary analysis of the MPI queue characteritics of several applications. submitted, May 2004.

[4]

R. Brightwell, T. B. Hudson, A. B. Maccabe, and R. E. Riesen. The Portals 3.0 message passing interface. Technical Report SAND99-2959, Sandia National Laboratories, December 1999.

[5]

R. Brightwell, W. Lawry, A. B. Maccabe, and R. Riesen. Portals 3.0: Protocol building blocks for low overhead communication. In Proceedings of the 2002 Workshop on Communication Architecture for Clusters, April 2002.

Digital Library

[6]

R. Brightwell, A. B. Maccabe, and R. Riesen. Design, implementation, and performance of MPI on Portals 3.0. International Journal of High Performance Computing Applications, 17(1):7-20, Spring 2003.

Digital Library

[7]

R. Brightwell and K. D. Underwood. An analysis of NIC resource usage for offloading MPI. In Proceedings of the 2004 Workshop on Communication Architecture for Clusters, Santa Fe, NM, April 2004.

[8]

D. Buntinas and D. K. Panda. NIC-based reduction in Myrinet clusters: Is it beneficial? In Proceedings of the SAN- 02 Workshop (in conjunction with HPCA), February 2002.

[9]

D. Buntinas, D. K. Panda, and P. Sadayappan. Fast NIC-based barrier over Myrinet/GM. In Proceedings of the International Parallel and Distributed Processing Symposium, April 2001.

Digital Library

[10]

D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. SimpleScalar LLC.

[11]

D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. In Proceedings 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1-12, 1993.

Digital Library

[12]

B. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, and M. Rytting. A CAD suite for high-performance FPGA design. In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, pages 12-24, Napa, CA, April 1999. IEEE Computer Society, IEEE.

Digital Library

[13]

R. P. Martin, A. M. Vahdat, D. E. Culler, and T. E. Anderson. Effects of communication latency, overhead, and bandwidth in a cluster architecture. In Proceedings of the 24th Annual International Symposium on Computer Architecture, June 1997.

Digital Library

[14]

Message Passing Interface Forum. MPI: A message-passing interface standard. The International Journal of Supercomputer Applications and High Performance Computing, 8, 1994.

[15]

A. Moody, J. Fernandez, F. Petrini, and D. K. Panda. Scalable NIC-based reduction on large-scale clusters. In Proceedings of the ACM/IEEE SC2003 Conference, November 2003.

Digital Library

[16]

Myricom, Inc. Myrinet Express (MX): A high performance, low-level, message-passing interface for Myrinet, July 2003.

[17]

F. Petrini, W. chun Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The Quadrics network: High-performance clustering technology. IEEE Micro, 22(1):46-57, January/February 2002.

Digital Library

[18]

B. Plattner, G. Varghese, J. Turner, and M. Waldvogel. Scalable high-speed prefix matching, February 2002.

[19]

A. Rodrigues. Enkidu discrete event simulation framework. Technical Report TR04-14, University of Notre Dame, 2004.

[20]

A. Rodrigues, R. Murphy, R. Brightwell, and K. Underwood. Enhancing NIC performance for MPI using processing-inmemory. In Proceedings the 2005 Workshop on Communication Architectures for Clusters, April 2005.

Digital Library

[21]

P. Shivam, P. Wyckoff, and D. Panda. EMP: Zero-copy OS-bypass NIC-driven gigabit ethernet message passing. In Proceedings of the 2001 Conference on Supercomputing, Nov. 2001.

Digital Library

[22]

K. D. Underwood and R. Brightwell. The impact of MPI queue usage on message latency. In Proceedings of the International Conference on Parallel Processing (ICPP), Montreal, Canada, August 2004.

Digital Library

Cited By

Marts WDosanjh MSchonbein WGrant RBridges P(2019)MPI tag matching performance on ConnectX and ARMProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343224(1-10)Online publication date: 11-Sep-2019
https://dl.acm.org/doi/10.1145/3343211.3343224
Levy SFerreira K(2019)Evaluating tradeoffs between MPI message matching offload hardware capacity and performanceProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343223(1-11)Online publication date: 11-Sep-2019
https://dl.acm.org/doi/10.1145/3343211.3343223
Schonbein WGrant RDosanjh MArnold DTaufer MBalaji PPeña A(2019)INCAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356153(1-13)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356153
Show More Cited By

Index Terms

A Hardware Acceleration Unit for MPI Queue Processing

Recommendations

Software pipelining for graphic processing unit acceleration

The graphic processing unit GPU is becoming increasingly popular as a performance accelerator in various applications requiring high-performance parallel computing capability. In a central processing unit CPU or GPU hybrid system, software pipelining is ...
The cyclic queue and the tandem queue

We consider a closed queueing network, consisting of two FCFS single server queues in series: a queue with general service times and a queue with exponential service times. A fixed number $$N$$ N of customers cycle through this network. We determine the joint ...
A multiserver retrial queue: regenerative stability analysis

We consider a multiserver retrial GI / G / m queue with renewal input of primary customers, interarrival time with rate $\lambda=1/\mathsf{E}\tau$ , service time S , and exponential retrial times of customers blocked in the orbit. In the model, an arriving primary ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01

April 2005

ISBN:0769523129

Publisher

IEEE Computer Society

United States

Publication History

Published: 04 April 2005

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Marts WDosanjh MSchonbein WGrant RBridges P(2019)MPI tag matching performance on ConnectX and ARMProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343224(1-10)Online publication date: 11-Sep-2019
https://dl.acm.org/doi/10.1145/3343211.3343224
Levy SFerreira K(2019)Evaluating tradeoffs between MPI message matching offload hardware capacity and performanceProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343223(1-11)Online publication date: 11-Sep-2019
https://dl.acm.org/doi/10.1145/3343211.3343223
Schonbein WGrant RDosanjh MArnold DTaufer MBalaji PPeña A(2019)INCAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356153(1-13)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356153
Levy SFerreira K(2018)Using Simulation to Examine the Effect of MPI Message Matching Costs on Application PerformanceProceedings of the 25th European MPI Users' Group Meeting10.1145/3236367.3236375(1-11)Online publication date: 23-Sep-2018
https://dl.acm.org/doi/10.1145/3236367.3236375
Bienz AGropp WOlson L(2018)Improving Performance Models for Irregular Point-to-Point CommunicationProceedings of the 25th European MPI Users' Group Meeting10.1145/3236367.3236368(1-8)Online publication date: 23-Sep-2018
https://dl.acm.org/doi/10.1145/3236367.3236368
Ghazimirsaeed SGrant RAfsahi A(2018)A Dedicated Message Matching Mechanism for Collective CommunicationsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229712(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229712
Dosanjh MGhazimirsaeed SGrant RSchonbein WLevenhagen MBridges PAfsahi A(2018)The Case for Semi-Permanent Cache OccupancyProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225130(1-11)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225130
Ferreira KLevy SPedretti KGrant RPeña ABalaji PGropp WThakur R(2017)Characterizing MPI matching via trace-based simulationProceedings of the 24th European MPI Users' Group Meeting10.1145/3127024.3127040(1-11)Online publication date: 25-Sep-2017
https://dl.acm.org/doi/10.1145/3127024.3127040
Ang JBarrett RBenner RBurke DChan CCook JDonofrio DHammond SHemmert KKelly SLe HLeung VResnick DRodrigues AShalf JStark DUnat DWright N(2014)Abstract machine models and proxy architectures for exascale computingProceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing10.1109/Co-HPC.2014.4(25-32)Online publication date: 16-Nov-2014
https://dl.acm.org/doi/10.1109/Co-HPC.2014.4
Zounmevo JAfsahi A(2014)A fast and resource-conscious MPI message queue mechanism for large-scale jobsFuture Generation Computer Systems10.1016/j.future.2013.07.00330:C(265-290)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.1016/j.future.2013.07.003
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents