research-article

rmalloc() and rpipe(): a uGNI-based Distributed Remote Memory Allocator and Access Library for One-sided Messaging

Authors:

Udayanga Wickramasinghe,

Andrew LumsdaineAuthors Info & Claims

ROSS'18: Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers

Article No.: 2, Pages 1 - 9

https://doi.org/10.1145/3217189.3217191

Published: 12 June 2018 Publication History

Abstract

Optimizing communication is essential for high-performance computing because synchronization bottlenecks inhibit the overall performance and scalability of parallel applications. Today's cutting-edge computing hardware, as well as networking interfaces like Cray Aries/Gemini, features extremely low latency and high bandwidth remote memory access (RMA) operations for optimized data movement. However for any efficient data movement to occur between two logical processing units, software substrates must be able to properly exploit hardware resources for the underlying fabric. Overheads due to coarse granular synchronization and stalls during irregular access of remote memory regions may hint at two adverse effects of resource under-utilization in time and space. We introduce a uGNI-based distributed remote memory allocator called "rmalloc" which expands RDMA-enabled memory utilization, and a communication substrate called "rpipe" that tries to mitigate synchronization bottlenecks. Our UNIX-inspired RMA programming model is simple to use and equally applicable to both higher-level applications as well as lower-level runtime systems for enabling efficient data movement. Our micro-benchmark results suggest that "rmalloc" default next-fit allocator outperforms MPI-3.0 RMA by 1.5X and up to 6X in most cases, while other variants of "rmalloc" (i.e. best-fit, worst-fit) reduce external fragmentation and perform comparably or better than the default "rmalloc" allocator for irregular RMA.

References

[1]

Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).

[2]

Christian Bell, Dan Bonachea, Rajesh Nishtala, and Katherine Yelick. 2006. Optimizing bandwidth limited problems using one-sided communication and overlap. In IPDPS 2006. IEEE, 10--pp.

Digital Library

[3]

Roberto Belli and Torsten Hoefler. 2015. Notified access: Extending remote memory access programming models for producer-consumer synchronization. In IPDPS, 2015 IEEE International. IEEE, 871--881.

Digital Library

[4]

Jeff Bonwick et al. 1994. The Slab Allocator: An Object-Caching Kernel Memory Allocator. In USENIX summer, Vol. 16. Boston, MA, USA.

Digital Library

[5]

UPC Consortium et al. 2005. UPC language specifications v1. 2. Technical Report. Ernest Orlando Lawrence Berkeley NationalLaboratory, Berkeley, CA (US).

[6]

Jack Dongarra et al. 2013. Mpi: A message-passing interface standard version 3.0. High Performance Computing Center Stuttgart (HLRS) (2013).

[7]

Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. 2014. Enabling highly-scalable remote memory access programming with MPI-3 one sided. Scientific Programming 22, 2 (2014), 75--91.

Digital Library

[8]

Daniel Grünewald and Christian Simmendinger. 2013. The GASPI API specification and its implementation GPI 2.0. In 7th International Conference on PGAS Programming Models, Vol. 243.

[9]

Sean Hefty. 2012. Rsockets. In 2012 OpenFabris International Workshop, Monterey, CA, USA.

[10]

Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E Grant, and Ron Brightwell. 2017. sPIN: High-performance streaming Processing in the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 59.

Digital Library

[11]

Khaled Z Ibrahim, Paul H Hargrove, Costin Iancu, and Katherine Yelick. 2014. An evaluation of one-sided and two-sided communication paradigms on relaxed-ordering interconnect. In IPDPS, 2014 IEEE 28th International. IEEE, 1115--1125.

Digital Library

[12]

Weihang Jiang, Jiuxing Liu, Hyun-Wook Jin, Dhabaleswar K Panda, William Gropp, and Rajeev Thakur. 2004. High performance MPI-2 one-sided communication over InfiniBand. In CCGrid 2004. IEEE International Symposium on. IEEE, 531--538.

Digital Library

[13]

E. Kissel and M. Swany. 2016. Photon: Remote Memory Access Middleware for High-Performance Runtime Systems. In IPDPSW 2016. 1736--1743.

[14]

Patrick MacArthur and Robert D Russell. 2014. An efficient method for stream semantics over RDMA. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 841--851.

Digital Library

[15]

Simon Pickartz, Pablo Reble, Carsten Clauss, and Stefan Lankes. 2014. SWIFT: A Transparent and Flexible Communication Layer for PCIe-Coupled Accelerators and (Co-) Processors. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. IEEE, 371--380.

Digital Library

[16]

Thomas Sterling, Matthew Anderson, P. Kevin Bohan, Maciej Brodowicz, Abhishek Kulkarni, and Bo Zhang. 2014. Towards Exascale Co-design in a Runtime System. In EASC 2014. Stockholm, Sweden.

[17]

Yanhua Sun, Gengbin Zheng, Laximant V Kale, Terry R Jones, and Ryan Olson. 2012. A uGNI-based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect. In IPDPS 2012 IEEE 26th International. IEEE, 751--762.

Digital Library

[18]

Abhinav Vishnu, Prachi Gupta, Amith R Mamidala, and Dhabaleswar K Panda. 2006. A software based approach for providing network fault tolerance in clusters with uDAPL interface: MPI level design and performance evaluation. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing. ACM, 85.

Digital Library

[19]

Abhinav Vishnu, Gopal Santhanaraman, Wei Huang, Hyun-Wook Jin, and Dhabaleswar K Panda. 2005. Supporting MPI-2 one sided communication on multi-rail InfiniBand clusters: Design challenges and performance benefits. In SC. Springer, 137--147.

Digital Library

[20]

U Wickrmasinghe and A Lumsdaine. 2018. Enabling Efficient Inter-node Message Passing and Remote Memory Access via a uGNI based Light-weight Network Substrate for Cray Interconnects. (2018). (in press).

Recommendations

WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...
A Novel Memory Block Management Scheme for PCM Using WOM-Code
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics including low static power consumption and high density. However, long write latency is one of the major drawbacks in current PCM ...
A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dual-mode flash memory in embedded systems

Similar to traditional NAND flash memory, triple-level cell (TLC) flash memory is used as secondary storage to meet the fast growing demands on storage capacity. TLC flash memory exhibits attractive features such as shock resistance, high density, low ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ROSS'18: Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers

June 2018

44 pages

ISBN:9781450358644

DOI:10.1145/3217189

Copyright © 2018 ACM.

© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

HPDC '18

Sponsor:

University of Arizona
SIGARCH

HPDC '18: The 27th International Symposium on High-Performance Parallel and Distributed Computing

June 12, 2018

AZ, Tempe, USA

Acceptance Rates

ROSS'18 Paper Acceptance Rate 5 of 7 submissions, 71%;

Overall Acceptance Rate 58 of 169 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
81
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents