article

A fast and resource-conscious MPI message queue mechanism for large-scale jobs

Authors:

Judicael A. Zounmevo,

Ahmad AfsahiAuthors Info & Claims

Future Generation Computer Systems, Volume 30, Issue C

Pages 265 - 290

https://doi.org/10.1016/j.future.2013.07.003

Published: 01 January 2014 Publication History

Abstract

The Message Passing Interface (MPI) message queues have been shown to grow proportionately to the job size for many applications. With such a behaviour and knowing that message queues are used very frequently, ensuring fast queue operations at large scales is of paramount importance in the current and the upcoming exascale computing eras. Scalability, however, is two-fold. With the growing processor core density per node, and the expected smaller memory density per core at larger scales, a queue mechanism that is blind on memory requirements poses another scalability issue even if it solves the speed of operation problem. In this work we propose a multidimensional queue management mechanism whose operation time and memory overhead grow sub-linearly with the job size. We show why a novel approach is justified in spite of the existence of well-known and fast data structures such as binary search trees. We compare our proposal with a linked list-based approach which is not scalable in terms of speed of operation, and with an array-based method which is not scalable in terms of memory consumption. Our proposed multidimensional approach yields queue operation time speedups that translate to up to 4-fold execution time improvement over the linked list design for the applications studied in this work. It also shows a consistent lower memory footprint compared to the array-based design. Finally, compared to the linked list-based queue, our proposed design yields cache miss rate improvements which are on average on par with the array-based design. A new MPI message queue design tailored for very large-scale jobs.A design based on a 4-D data container that exploits process rank decomposition.The effect of job size on message queue operations is mitigated.Infructuous message queue searches are optimized via early detection.Scalability is provided for both execution speed and memory consumption.

References

[1]

MPI Forum, http://www.mpi-forum.org/ ¿(last accessed 17.01.2013).

[2]

Top 500 Supercomputers, http://www.top500.org/ ¿(last accessed 17.01.2013).

[3]

P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk, The importance of non-data-communication overheads in MPI, International Journal of High Performance Computing Applications (2010) 5-15.

Digital Library

[4]

R. Keller, R.L. Graham, Characteristics of the unexpected message queue of MPI applications, in: Proceedings of the 17th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI, 2010, pp. 179-188.

Digital Library

[5]

R. Brightwell, S.P. Goudy, A. Rodrigues, K.D. Underwood, Implications of application usage characteristics for collective communication offload, International Journal of High Performance Computing Applications, 4 (2006) 104-116.

Digital Library

[6]

R. Brightwell, S. Goudy, K. Underwood, A preliminary analysis of the mpi queue characterisitics of several applications, in: Proceedings of the 34th International Conference on Parallel Processing, ICPP, 2005, pp. 175-183.

Digital Library

[7]

R. Brightwell, K.D. Underwood, An analysis of NIC resource usage for offloading MPI, in: Proceedings of the 18th International Parallel and Distributed Processing Symposium, IPDPS, 2004, p. 183.

[8]

B. Barrett, R. Brightwell, S. Hemmert, K. Pedretti, K. Wheeler, K. Underwood, A.B. Maccabe, T. Hudson, The portals 4.0 message passing interface-draft 04/09/2012, https://portals4.googlecode.com/files/portals40-20120409.pdf (last accessed 17.01.2013).

[9]

G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur, W. Gropp, D.K. Panda, Natively supporting true one-sided communication in MPI on multi-core systems with InfiniBand, in: Proceedings of the 9th International Symposium of Cluster Computing and the Grid, CCGrid, 2009, pp. 380-387.

Digital Library

[10]

Open MPI, http://www.open-mpi.org/ ¿(last accessed 17.01.2013).

[11]

J.A. Zounmevo, A. Afsahi, An efficient MPI message queue mechanism for large-scale jobs, in: Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, ICPADS, 17-19 December, 2012, pp. 464-471.

Digital Library

[12]

R.L. Graham, R. Brightwell, B. Barrett, G. Bosilca, ¿Pjesivac-Grbovic, An evaluation of open MPI's matching transport layer on the Cray XT, in: Proceedings of the 14th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroPVMMPI, 2007, pp. 161-169.

Digital Library

[13]

F. Petrini, W. chun Feng, A. Hoisie, S. Coll, E. Frachtenberg, The quadrics network: high-performance clustering technology, IEEE Micro, 22 (2002) 46-57.

Digital Library

[14]

B. Tourancheau, R. Westrelin, Support for MPI at the network interface level, in: Proceedings of the 8th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroPVMMPI, Septermber 2001, pp. 52-60.

Digital Library

[15]

InfiniBand Trade Association, http://www.InfiniBandta.org/index.php ¿(last accessed 17.01.2013).

[16]

M.J. Koop, J.K. Sridhar, D.K. Panda, TupleQ: fully-asynchronous and zero-copy MPI over InfiniBand, in: Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium, IPDPS, 2009, pp. 1-8.

Digital Library

[17]

M.J. Rashti, A. Afsahi, A speculative and adaptive MPI Rendezvous protocol over RDMA-enabled interconnects, International Journal of Parallel Programming, 37 (2009) 223-246.

Digital Library

[18]

K.D. Underwood, K.S. Hemmert, A. Rodrigues, R. Murphy, R. Brightwell, A hardware acceleration unit for MPI queue processing, in: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2005, p. 96b.

Digital Library

[19]

N. Tanabe, A. Ohta, P. Waskito, H. Nakajo, Network interface architecture for scalable message queue processing, in: Proceedings of the 15th IEEE International Conference on Parallel and Distributed Systems, ICPADS, 2009, pp. 268-275.

Digital Library

[20]

K.D. Underwood, R. Brightwell, The impact of MPI queue usage on message latency, in: Proceedings of the 33rd International Conference on Parallel Processing, ICPP, 2004, pp. 152-160.

Digital Library

[21]

Myricom Inc. Myrinet Express (MX): a high performance, low-level, message-passing interface for Myrinet, July 2003. http://www.myri.com/scs/MX/doc/mx.pdf ¿(last accessed 17.01.2013).

[22]

P. Shivam, P. Wyckoff, D.K. Panda, EMP: zero-copy OS-bypass NIC-driven Gigabit Ethernet message passing, in: Proceedings of Supercomputing, SC, November 2001, p. 49.

Digital Library

[23]

MPICH2, http://www.mcs.anl.gov/research/projects/mpich2/ ¿(last accessed 17.01.2013).

[24]

J. Nieplocha, Vinod Tipparaju, E. Apra, An evaluation of two implementation strategies for optimizing one-sided atomic reduction, in: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 2005, p. 215.2.

Digital Library

[25]

Judicael A. Zounmevo, Ahmad Afsahi, Investigating senario-conscious asynchronous rendezvous over RDMA, short paper, in: 13th IEEE International Conference on Cluster Computing, Cluster 2011, 2011, pp. 542-546.

Digital Library

[26]

R. Brightwell, S. Goudy, A. Rodrigues, K. Underwood, Implications of application usage characteristics for collective communication offload, International Journal of High Performance Computing and Networking (2006) 104-116.

Digital Library

[27]

Intel Threading Building Blocks, http://threadingbuildingblocks.org/ ¿(last accessed 30.04.2013).

[28]

K. Yelick, et al. Productivity and performance using partitioned global address space languages, in: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, 2007, pp. 24-32.

Digital Library

[29]

P. Kogge, et al. ExaScale computing study: technology challenges in achieving exascale systems,http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/exascale_final_report_100208.pdf ¿(last accessed 30.04.2013).

[30]

P. Balaji, D. Buntinas, D. Goodell, W. Gropp, S. Kumar, E. Lusk, R. Thakur, J.L. Träff, MPI on a million processors, in: Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, EuroPVMMPI, 2009, pp. 20-30.

Digital Library

[31]

MVAPICH/MVAPICH2. http://mvapich.cse.ohio-state.edu ¿(last accessed 17.01.2013).

[32]

S. Ramalingam, M. Hall, C. Chun, Improving high-performance sparse libraries using compiler-assisted specialization: a petsc case study, in: 26th International Parallel and Distributed Processing Symposium Workshops & Ph.D. Forum, IPDPSW, 2012, pp. 487-496.

Digital Library

[33]

S. Seung-Jin, A. Tovchigrechko, Parallelizing BLAST and SOM algorithms with MapReduce-MPI library, in: 25th Parallel and Distributed Processing Workshops and Ph.D. Forum, IPDPSW, 2011, pp. 481-489.

Digital Library

[34]

T. Hoefler, M. Snir, Writing parallel libraries with MPI-common practice, issues, and extensions, in: Proceedings of the 18th European MPI Users' Group conference on Recent Advances in the Message Passing Interface, EuroMPI'11, 2011, pp. 345-355.

Digital Library

[35]

Zhang Haiyang, Liang Qiaoyu, Red-black tree used for arranging virtual memory area of Linux, in: 2010 International Conference on Management and Service Science, MASS, 2010, pp. 1-3.

[36]

OFED, https://www.openfabrics.org/index.php ¿(last accessed 17.01.2013).

[37]

H. Shan, J.P. Singh, L. Oliker, R. Biswas, Message passing and shared address space parallelism on an SMP cluster, Parallel Compuing, 29 (2003) 167-186.

Digital Library

[38]

K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, K.A. Yelick, The landscape of parallel computing research: a view from Berkeley, Technical Report UCB/EECS-2006-183, 2006.

[39]

NAS Parallel Benchmarks, http://www.nas.nasa.gov/resources/software/npb.html ¿(last accessed 25.01.2013).

[40]

S.K. Gutierrez, N.T. Hjelm, M.G. Venkata, R.L. Graham, Performance evaluation of open MPI on Cray XE/XK systems, in: Proceedings of the 20th Annual Symposium on High-Performance Interconnects, HOTI, 2012, pp. 40-47.

Digital Library

[41]

Communicators and contextId implementation convention in MPICH2, http://wiki.mcs.anl.gov/mpich2/index.php/Communicators_and_Context_IDs ¿(last accessed 17.01.2013).

Cited By

Marts WDosanjh MSchonbein WGrant RBridges P(2019)MPI tag matching performance on ConnectX and ARMProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343224(1-10)Online publication date: 11-Sep-2019
https://dl.acm.org/doi/10.1145/3343211.3343224
Levy SFerreira K(2019)Evaluating tradeoffs between MPI message matching offload hardware capacity and performanceProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343223(1-11)Online publication date: 11-Sep-2019
https://dl.acm.org/doi/10.1145/3343211.3343223
Ghazimirsaeed SGrant RAfsahi A(2018)A Dedicated Message Matching Mechanism for Collective CommunicationsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229712(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229712
Show More Cited By

Recommendations

An Efficient MPI Message Queue Mechanism for Large-scale Jobs
ICPADS '12: Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems

The Message Passing Interface (MPI) message queues have been shown to grow proportionately to the job size for many applications. With such a behaviour and knowing that message queues are used very frequently, ensuring fast queue operations at large ...
View-aware Message Passing Through the Integration of Kokkos and ExaMPI
EuroMPI '23: Proceedings of the 30th European MPI Users' Group Meeting

Kokkos provides in-memory advanced data structures, concurrency, and algorithms to support performance portable C++ parallel programming across CPUs and GPUs. The Message Passing Interface (MPI) provides the most widely used message passing model for ...
A Tandem Queue with a Gate Mechanism

Inspired by a problem regarding cable access networks, we consider a two station tandem queue with Poisson arrivals. At station 1 we operate a gate mechanism, leading to batch arrivals at station 2. Upon arrival at station 1, customers join a queue in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems

Future Generation Computer Systems Volume 30, Issue C

January 2014

307 pages

ISSN:0167-739X

Issue’s Table of Contents

Copyright © Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 January 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Marts WDosanjh MSchonbein WGrant RBridges P(2019)MPI tag matching performance on ConnectX and ARMProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343224(1-10)Online publication date: 11-Sep-2019
https://dl.acm.org/doi/10.1145/3343211.3343224
Levy SFerreira K(2019)Evaluating tradeoffs between MPI message matching offload hardware capacity and performanceProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343223(1-11)Online publication date: 11-Sep-2019
https://dl.acm.org/doi/10.1145/3343211.3343223
Ghazimirsaeed SGrant RAfsahi A(2018)A Dedicated Message Matching Mechanism for Collective CommunicationsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229712(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229712
Dosanjh MGhazimirsaeed SGrant RSchonbein WLevenhagen MBridges PAfsahi A(2018)The Case for Semi-Permanent Cache OccupancyProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225130(1-11)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225130
Schonbein WDosanjh MGrant RBridges P(2018)Measuring Multithreaded Message Matching MiseryEuro-Par 2018: Parallel Processing10.1007/978-3-319-96983-1_34(480-491)Online publication date: 27-Aug-2018
https://dl.acm.org/doi/10.1007/978-3-319-96983-1_34
Zounmevo JZhao XBalaji PGropp WAfsahi ADamkroger TDongarra J(2014)Nonblocking epochs in MPI one-sided communicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.44(475-486)Online publication date: 16-Nov-2014
https://dl.acm.org/doi/10.1109/SC.2014.44

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents