Nothing Special   »   [go: up one dir, main page]

skip to main content
article

A fast and resource-conscious MPI message queue mechanism for large-scale jobs

Published: 01 January 2014 Publication History

Abstract

The Message Passing Interface (MPI) message queues have been shown to grow proportionately to the job size for many applications. With such a behaviour and knowing that message queues are used very frequently, ensuring fast queue operations at large scales is of paramount importance in the current and the upcoming exascale computing eras. Scalability, however, is two-fold. With the growing processor core density per node, and the expected smaller memory density per core at larger scales, a queue mechanism that is blind on memory requirements poses another scalability issue even if it solves the speed of operation problem. In this work we propose a multidimensional queue management mechanism whose operation time and memory overhead grow sub-linearly with the job size. We show why a novel approach is justified in spite of the existence of well-known and fast data structures such as binary search trees. We compare our proposal with a linked list-based approach which is not scalable in terms of speed of operation, and with an array-based method which is not scalable in terms of memory consumption. Our proposed multidimensional approach yields queue operation time speedups that translate to up to 4-fold execution time improvement over the linked list design for the applications studied in this work. It also shows a consistent lower memory footprint compared to the array-based design. Finally, compared to the linked list-based queue, our proposed design yields cache miss rate improvements which are on average on par with the array-based design. A new MPI message queue design tailored for very large-scale jobs.A design based on a 4-D data container that exploits process rank decomposition.The effect of job size on message queue operations is mitigated.Infructuous message queue searches are optimized via early detection.Scalability is provided for both execution speed and memory consumption.

References

[1]
MPI Forum, http://www.mpi-forum.org/ ¿(last accessed 17.01.2013).
[2]
Top 500 Supercomputers, http://www.top500.org/ ¿(last accessed 17.01.2013).
[3]
P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk, The importance of non-data-communication overheads in MPI, International Journal of High Performance Computing Applications (2010) 5-15.
[4]
R. Keller, R.L. Graham, Characteristics of the unexpected message queue of MPI applications, in: Proceedings of the 17th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI, 2010, pp. 179-188.
[5]
R. Brightwell, S.P. Goudy, A. Rodrigues, K.D. Underwood, Implications of application usage characteristics for collective communication offload, International Journal of High Performance Computing Applications, 4 (2006) 104-116.
[6]
R. Brightwell, S. Goudy, K. Underwood, A preliminary analysis of the mpi queue characterisitics of several applications, in: Proceedings of the 34th International Conference on Parallel Processing, ICPP, 2005, pp. 175-183.
[7]
R. Brightwell, K.D. Underwood, An analysis of NIC resource usage for offloading MPI, in: Proceedings of the 18th International Parallel and Distributed Processing Symposium, IPDPS, 2004, p. 183.
[8]
B. Barrett, R. Brightwell, S. Hemmert, K. Pedretti, K. Wheeler, K. Underwood, A.B. Maccabe, T. Hudson, The portals 4.0 message passing interface-draft 04/09/2012, https://portals4.googlecode.com/files/portals40-20120409.pdf (last accessed 17.01.2013).
[9]
G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur, W. Gropp, D.K. Panda, Natively supporting true one-sided communication in MPI on multi-core systems with InfiniBand, in: Proceedings of the 9th International Symposium of Cluster Computing and the Grid, CCGrid, 2009, pp. 380-387.
[10]
Open MPI, http://www.open-mpi.org/ ¿(last accessed 17.01.2013).
[11]
J.A. Zounmevo, A. Afsahi, An efficient MPI message queue mechanism for large-scale jobs, in: Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, ICPADS, 17-19 December, 2012, pp. 464-471.
[12]
R.L. Graham, R. Brightwell, B. Barrett, G. Bosilca, ¿Pjesivac-Grbovic, An evaluation of open MPI's matching transport layer on the Cray XT, in: Proceedings of the 14th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroPVMMPI, 2007, pp. 161-169.
[13]
F. Petrini, W. chun Feng, A. Hoisie, S. Coll, E. Frachtenberg, The quadrics network: high-performance clustering technology, IEEE Micro, 22 (2002) 46-57.
[14]
B. Tourancheau, R. Westrelin, Support for MPI at the network interface level, in: Proceedings of the 8th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroPVMMPI, Septermber 2001, pp. 52-60.
[15]
InfiniBand Trade Association, http://www.InfiniBandta.org/index.php ¿(last accessed 17.01.2013).
[16]
M.J. Koop, J.K. Sridhar, D.K. Panda, TupleQ: fully-asynchronous and zero-copy MPI over InfiniBand, in: Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium, IPDPS, 2009, pp. 1-8.
[17]
M.J. Rashti, A. Afsahi, A speculative and adaptive MPI Rendezvous protocol over RDMA-enabled interconnects, International Journal of Parallel Programming, 37 (2009) 223-246.
[18]
K.D. Underwood, K.S. Hemmert, A. Rodrigues, R. Murphy, R. Brightwell, A hardware acceleration unit for MPI queue processing, in: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2005, p. 96b.
[19]
N. Tanabe, A. Ohta, P. Waskito, H. Nakajo, Network interface architecture for scalable message queue processing, in: Proceedings of the 15th IEEE International Conference on Parallel and Distributed Systems, ICPADS, 2009, pp. 268-275.
[20]
K.D. Underwood, R. Brightwell, The impact of MPI queue usage on message latency, in: Proceedings of the 33rd International Conference on Parallel Processing, ICPP, 2004, pp. 152-160.
[21]
Myricom Inc. Myrinet Express (MX): a high performance, low-level, message-passing interface for Myrinet, July 2003. http://www.myri.com/scs/MX/doc/mx.pdf ¿(last accessed 17.01.2013).
[22]
P. Shivam, P. Wyckoff, D.K. Panda, EMP: zero-copy OS-bypass NIC-driven Gigabit Ethernet message passing, in: Proceedings of Supercomputing, SC, November 2001, p. 49.
[23]
MPICH2, http://www.mcs.anl.gov/research/projects/mpich2/ ¿(last accessed 17.01.2013).
[24]
J. Nieplocha, Vinod Tipparaju, E. Apra, An evaluation of two implementation strategies for optimizing one-sided atomic reduction, in: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 2005, p. 215.2.
[25]
Judicael A. Zounmevo, Ahmad Afsahi, Investigating senario-conscious asynchronous rendezvous over RDMA, short paper, in: 13th IEEE International Conference on Cluster Computing, Cluster 2011, 2011, pp. 542-546.
[26]
R. Brightwell, S. Goudy, A. Rodrigues, K. Underwood, Implications of application usage characteristics for collective communication offload, International Journal of High Performance Computing and Networking (2006) 104-116.
[27]
Intel Threading Building Blocks, http://threadingbuildingblocks.org/ ¿(last accessed 30.04.2013).
[28]
K. Yelick, et al. Productivity and performance using partitioned global address space languages, in: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, 2007, pp. 24-32.
[29]
P. Kogge, et al. ExaScale computing study: technology challenges in achieving exascale systems,http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/exascale_final_report_100208.pdf ¿(last accessed 30.04.2013).
[30]
P. Balaji, D. Buntinas, D. Goodell, W. Gropp, S. Kumar, E. Lusk, R. Thakur, J.L. Träff, MPI on a million processors, in: Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, EuroPVMMPI, 2009, pp. 20-30.
[31]
MVAPICH/MVAPICH2. http://mvapich.cse.ohio-state.edu ¿(last accessed 17.01.2013).
[32]
S. Ramalingam, M. Hall, C. Chun, Improving high-performance sparse libraries using compiler-assisted specialization: a petsc case study, in: 26th International Parallel and Distributed Processing Symposium Workshops & Ph.D. Forum, IPDPSW, 2012, pp. 487-496.
[33]
S. Seung-Jin, A. Tovchigrechko, Parallelizing BLAST and SOM algorithms with MapReduce-MPI library, in: 25th Parallel and Distributed Processing Workshops and Ph.D. Forum, IPDPSW, 2011, pp. 481-489.
[34]
T. Hoefler, M. Snir, Writing parallel libraries with MPI-common practice, issues, and extensions, in: Proceedings of the 18th European MPI Users' Group conference on Recent Advances in the Message Passing Interface, EuroMPI'11, 2011, pp. 345-355.
[35]
Zhang Haiyang, Liang Qiaoyu, Red-black tree used for arranging virtual memory area of Linux, in: 2010 International Conference on Management and Service Science, MASS, 2010, pp. 1-3.
[36]
OFED, https://www.openfabrics.org/index.php ¿(last accessed 17.01.2013).
[37]
H. Shan, J.P. Singh, L. Oliker, R. Biswas, Message passing and shared address space parallelism on an SMP cluster, Parallel Compuing, 29 (2003) 167-186.
[38]
K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, K.A. Yelick, The landscape of parallel computing research: a view from Berkeley, Technical Report UCB/EECS-2006-183, 2006.
[39]
NAS Parallel Benchmarks, http://www.nas.nasa.gov/resources/software/npb.html ¿(last accessed 25.01.2013).
[40]
S.K. Gutierrez, N.T. Hjelm, M.G. Venkata, R.L. Graham, Performance evaluation of open MPI on Cray XE/XK systems, in: Proceedings of the 20th Annual Symposium on High-Performance Interconnects, HOTI, 2012, pp. 40-47.
[41]
Communicators and contextId implementation convention in MPICH2, http://wiki.mcs.anl.gov/mpich2/index.php/Communicators_and_Context_IDs ¿(last accessed 17.01.2013).

Cited By

View all
  • (2019)MPI tag matching performance on ConnectX and ARMProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343224(1-10)Online publication date: 11-Sep-2019
  • (2019)Evaluating tradeoffs between MPI message matching offload hardware capacity and performanceProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343223(1-11)Online publication date: 11-Sep-2019
  • (2018)A Dedicated Message Matching Mechanism for Collective CommunicationsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229712(1-10)Online publication date: 13-Aug-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems
Future Generation Computer Systems  Volume 30, Issue C
January 2014
307 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 January 2014

Author Tags

  1. Exascale
  2. MPI
  3. Message queues
  4. Multidimensional searches
  5. Scalability

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2019)MPI tag matching performance on ConnectX and ARMProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343224(1-10)Online publication date: 11-Sep-2019
  • (2019)Evaluating tradeoffs between MPI message matching offload hardware capacity and performanceProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343223(1-11)Online publication date: 11-Sep-2019
  • (2018)A Dedicated Message Matching Mechanism for Collective CommunicationsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229712(1-10)Online publication date: 13-Aug-2018
  • (2018)The Case for Semi-Permanent Cache OccupancyProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225130(1-11)Online publication date: 13-Aug-2018
  • (2018)Measuring Multithreaded Message Matching MiseryEuro-Par 2018: Parallel Processing10.1007/978-3-319-96983-1_34(480-491)Online publication date: 27-Aug-2018
  • (2014)Nonblocking epochs in MPI one-sided communicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.44(475-486)Online publication date: 16-Nov-2014

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media