shmem4py: High-Performance One-Sided Communication for Python Applications

Marcin Rogowski, King Abdullah University of Science and Technology, Saudi Arabia and NVIDIA, Poland, marcin.rogowski@kaust.edu.sa

Jeff R. Hammond, NVIDIA Helsinki Oy, Finland, jeff.science@gmail.com

David E. Keyes, King Abdullah University of Science and Technology, Saudi Arabia, david.keyes@kaust.edu.sa

Lisandro Dalcin, King Abdullah University of Science and Technology, Saudi Arabia, dalcinl@gmail.com

DOI: https://doi.org/10.1145/3624062.3624602
SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, USA, November 2023

This paper describes shmem4py, a Python wrapper for the OpenSHMEM application programming interface (API) which follows a design similar to that of the well-known mpi4py package. OpenSHMEM is a descendant of the one-sided communication library for the Cray T3D and it is known for its uncompromising performance for low-latency and high-throughput use cases involving one-sided and collective communication. OpenSHMEM is arguably one of the most efficient and portable abstractions for modern network architectures. Thanks to tight interoperability with NumPy, shmem4py provides a convenient parallel programming framework leveraging both the high-productivity NumPy feature set and the high-performance networking capabilities of OpenSHMEM. This paper discusses the design and performance characteristics of shmem4py in a variety of communication patterns relative to lower-level languages (C) as well as MPI and mpi4py.

CCS Concepts: • Computing methodologies → Parallel computing methodologies; • Software and its engineering;

Keywords: Python, OpenSHMEM, MPI, shared memory, High Performance Computing

ACM Reference Format:
Marcin Rogowski, Jeff R. Hammond, David E. Keyes, and Lisandro Dalcin. 2023. shmem4py: High-Performance One-Sided Communication for Python Applications. In Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W 2023), November 12--17, 2023, Denver, CO, USA. ACM, New York, NY, USA 9 Pages. https://doi.org/10.1145/3624062.3624602

1 INTRODUCTION

Python is one of the most widely used programming languages in computing today. It owes its popularity to being easy to learn and use, as well as the vast ecosystem of available libraries. In particular, Python has become the de facto language in data science and machine learning. Packages such as NumPy [22], SciPy [51], Pandas [34], Matplotlib [26], and Project Jupyter [30] have been significant contributors to such success. While Python itself is not particularly high-performant due to its interpreted nature, Python applications that utilize libraries written in lower-level languages such as C, C++, and Fortran are able to support excellent performance in a range of scenarios. More recently, just-in-time (JIT) compilation frameworks have made it even more straightforward to achieve performance without leaving the comfort of Python syntax. Python JIT frameworks include Numba [33], JAX [5], Theano [48], and CuPy [39].

Being a language conceived in the 1990s, initial design choices in the default and most popular implementation, CPython, hindered its capability to adapt to the multicore era. CPython's global interpreter lock (GIL) prevented multithreaded applications from taking full advantage of CPUs with multiple computing cores [4]. Due to this limitation, Python has always favored a message-passing and process-based approach to parallel computing.

Since 2008, the Python standard library has provided the multiprocessing package [36] to assist developers with process-based parallelism within a single compute node. More recently, in 2011, the Python standard library added concurrent.futures [41] to make task-based parallelism even easier. Both of these solutions, however, are based on standard operating system facilities for process management and interprocess communication, and therefore are limited to execution within a single compute node.

For distributed memory parallelism across compute node boundaries, Python supports a range of models with various levels of abstraction. Towards the bottom of the stack, the standard networking interfaces available through the socket module from the Python standard library is a feasible, although very cumbersome, approach. The Python wrapper to ZeroMQ [23] is close to socket programming in spirit but with a much richer feature set and a simpler interface. In the scientific community, MPI is the de facto standard for message passing on distributed memory architectures. The most popular Python MPI wrapper is mpi4py [12], which allows for both low-level communication of array-like data and higher-level message passing involving arbitrary Python objects. In the higher level end, a much larger number of solutions is available. Those typically follow the Single Program Multiple Data (SPMD) approach with interprocess communication mostly hidden from the user. Among these packages, we mention Python wrappers for distributed math libraries like Global Arrays [32] and PETSc [13], distributed-memory backends for NumPy [3, 11, 25, 37, 38], and task-based frameworks like Dask [9, 43], Ray [35], and the mpi4py.futures package [44].

In this paper, we describe shmem4py, a framework for using the OpenSHMEM API [6] within Python applications. OpenSHMEM resembles MPI in some ways but specializes in one-sided communication and focuses on features that have native support in high-performance networks. Like mpi4py, shmem4py is a lower-level abstraction. Users are required to invoke communication and synchronization calls explicitly. shmem4py provides the backbone communication capabilities and targets the development of higher-level parallel applications and frameworks.

shmem4py intends to fulfill at least three goals:

exposing the high-performance communication capability of OpenSHMEM libraries to Python applications;
enabling OpenSHMEM users to do rapid prototyping in Python, before switching to lower-level languages (usually C); and
providing a foundation for building a Python interface to NVSHMEM [24, 40] and related GPU-centric communication libraries.

Goal 1 is straightforward; if OpenSHMEM provides better performance than MPI for a particular communication pattern, then it is useful to make it accessible in Python. The need for goal 2 is unclear, but we have found that designing parallel algorithms with Python and mpi4py is much more productive than implementing them from scratch in C/C++ or Fortran. Goal 3 was one of the inspirations for starting the shmem4py effort. NVSHMEM (and its counterparts on other GPU platforms) is arguably the best abstraction for GPU-to-GPU communication. Due to the strong interest in Python and multi-GPU computing for machine learning applications, having direct access to this communication API would allow for fast prototyping of new communication algorithms with direct control over the low-level protocol. We hope for shmem4py to expose NVSHMEM to Python applications in the future.

2 BACKGROUND

2.1 OpenSHMEM

OpenSHMEM is the direct descendant of the SHMEM programming API [17] provided by Cray on the revolutionary T3D supercomputer [10]. SHMEM was a C library abstraction for the T3D network capability, which supported direct access to all system memory [31]. Early versions of SHMEM contained system-specific features related to cache-coherence protocols and had implementation-defined behavior, so users often wrote hardware-aware code, whether they knew it or not. The OpenSHMEM standardization effort was initiated to reconcile the differences between different implementations, eliminate ambiguity in semantics and make it easier to both use and implement OpenSHMEM across a wide range of systems. As a result of this effort, OpenSHMEM is available on a wide range of platforms, from supercomputers to laptops, and now supports the same level of portability as MPI-3, which is the basis for at least one implementation (OSHMPI) [21, 46]. OpenSHMEM is available as part of both MVAPICH2-X [29] and Open MPI [18], which support InfiniBand and other HPC interconnects, and is provided by Cray and SGI on their systems. Sandia OpenSHMEM is a high-quality reference implementation that supports OFI/libfabric [7, 19], which supports a number of HPC interconnect technologies.

2.2 Related work

There have been at least two related attempts to build OpenSHMEM support for Python. One project carries the name shmem4py but appears incomplete and, in any case, not designed similarly to mpi4py or this work [1]. PySHMEM [52] is hard to evaluate due to a lack of source code and a rather brief presentation of the API design in the associated paper; what can be observed does not appear to follow the OpenSHMEM 1.5 API closely, for better or worse, at least with respect to memory management functions.

3 IMPLEMENTATION

The latest OpenSHMEM specification defines library functions, constants, variables, and language bindings for the C and C++ programming languages. The first design decision in implementing shmem4py involved choosing, from the many options available, the approach to access OpenSHMEM functionalities in Python. To achieve this, shmem4py uses C Foreign Function Interface for Python (CFFI) [42]. We have considered it an appropriate tool to access a C API with the peculiarities of OpenSHMEM. By providing CFFI with a C header file listing all the types, constants and functions from the OpenSHMEM C API, a Python extension module can be automatically generated. This low-level extension module contains large sets of functions named after the OpenSHMEM operation and the data types they operate on. Higher-level pure Python code then builds a name string out of operations and data types and queries the low-level CFFI extension module. An alternative implementation of shmem4py could have used Cython. However, Cython was designed as a Python language extension and not an automatic wrapper generator. Due to the peculiarities of the OpenSHMEM API, a Cython-based implementation of shmem4py would require large portions of code devoted to function call dispatching based on data types. Nonetheless, those issues could be easily addressed by using code generation with a template engine similar to that used to implement NumPy internals. The choice of CFFI over Cython raises some concerns regarding performance. The runtime data type to function name translation occurs in pure Python code and therefore introduces some minor latency that contributes to overall Python function call overhead. Had we decided to implement shmem4py in Cython, this translation would occur in compiled C code and be significantly faster. Nevertheless, we would expect such overhead to be noticeable only in latency-sensitive applications involving the communication of very small messages. The benchmarks we present in this paper include such scenarios.

In general lines, shmem4py is implemented with a similar philosophy as mpi4py. Rather than hiding low-level APIs, shmem4py exposes as many features as possible to allow for maximum flexibility and provide value to the experienced developer. Rather than building new ad-hoc interfaces, shmem4py tries to stay close to the original OpenSHMEM specification. Nonetheless, the shmem4py interface cannot possibly be a one-to-one C-to-Python translation. The reason for this is twofold: the C and Python programming languages have fundamental differences, and Python users would expect a set of features that adjust to the many language idioms and the existing software ecosystem. OpenSHMEM C API functions have to be invoked with buffer pointers and sizes, while data types are encoded in each function name. On the other hand, shmem4py is NumPy-centric. All shmem4py functions accept NumPy array objects as arguments. Memory addresses, buffer sizes and data types can be determined by introspection of array objects to then dispatch the appropriate low-level OpenSHMEM C routine. This approach allows for a convenient type-generic interface: the Python shmem4py functions operating on NumPy arrays runtime-translate data types to the corresponding low-level OpenSHMEM function names. OpenSHMEM symmetric memory allocations are exposed to Python through NumPy array objects. NumPy memory management is automatic and relies on the Python garbage collector (GC), which executes in a non-deterministic way by design. Oppositely, OpenSHMEM memory management is explicit and collective: allowing the Python GC to perform symmetric memory deallocations could lead to a deadlock state. Therefore, shmem4py requires users to handle symmetric memory deallocations explicitly. The strong reliance on NumPy does not hinder interoperability: NumPy can easily wrap arbitrary memory buffers within its ndarray object type. In fact, such a feature is the one that made possible the current implementation of shmem4py.

As an example of how NumPy is integrated into shmem4py, we sketch a simplified implementation of the shmem.zeros() function, which mimics the interface of the numpy.zeros() function. shmem.zeros() wraps the OpenSHMEM shmem_calloc routine and instantiates a NumPy array object wrapping the symmetric memory buffer.

The shmem4py package supports all major implementations of the OpenSHMEM specification:

Cray OpenSHMEMX
Open MPI OpenSHMEM (OSHMEM)
Open Source Software Solutions (OSSS) OpenSHMEM
OpenSHMEM Implementation on MPI v2 (OSHMPI)
Sandia OpenSHMEM (SOS)

4 USAGE EXAMPLE

The following C program uses the OpenSHMEM 1.5 API and showcases the use of the SHMEM_FCOLLECT collective.

The corresponding Python program using shmem4py follows. Note the automatic, implicit handling of initialization and finalization of the OpenSHMEM runtime, as well as the NumPy-based exposure of symmetric memory buffers.

5 EXPERIMENTAL SETUP

We evaluate shmem4py in two dimensions. First, we compare the overhead of Python versus lower-level languages. We achieve this by comparing shmem4py versus existing implementations of the same communication pattern in C on a single node. For some of the benchmarks, we add an implementation in Python with critical kernels compiled with Numba to further highlight the performance difference between interpreted and compiled languages when it comes to arithmetic. In the second dimension, we compare shmem4py to mpi4py within a single node and on a supercomputer. Here, we expect similar conclusions to a comparison of OpenSHMEM to MPI in other languages, although it is useful to verify such a hypothesis.

5.1 Benchmarks

First, we use a simple PingPong test. Here, the data is communicated back and forth between two processes, using put and wait_until in the OpenSHMEM implementation or send-receive when using MPI. This benchmark is designed to compare the achievable latency and bandwidth, and the overhead of wrapping C functions in Python. Two processes are always used, and the message size varies between 8 B and 512 MiB. The benchmark only involves communication and memory copies without any computation. The benchmark is implemented in both C and Python.

For more realistic benchmarks, we use the Parallel Research Kernels (PRK) [15, 49, 50]. This set of benchmarks already supports a wide range of parallel programming models in multiple languages, including C+MPI, C+OpenSHMEM, Fortran+MPI and Python+mpi4py. For this paper, we implemented Python+shmem4py versions of the Synch_p2p, Stencil, and Transpose benchmarks. For Synch_p2p and Stencil, we also implemented a Numba variant of the Python code. ¹ We briefly summarize these benchmarks in the following; additional information can be found in the literature [15, 49, 50]. The Synch_p2p and Stencil kernels use point-to-point communication. Synch_p2p has a wavefront synchronization pattern that makes it more latency-sensitive, while Stencil is a structured halo exchange that depends more on the throughput of data. In those benchmarks, the send-receive pattern of MPI is reimplemented in OpenSHMEM using a combination of put/fence/atomic_inc on the sending side and wait_until on the receiving side. The Transpose kernel involves performing an all-to-all data exchange, which can be implemented in many ways. The design trade-offs include the amount of temporary buffering and global versus local synchronization. The implementations that use a collective all-to-all operation use more buffering – equal to the size of the input matrix – while the implementations using one-sided communication use less buffering (by a factor of the number of processes). In order to extend our analysis to cover more MPI/OpenSHMEM functionality, we opted to use the version using all-to-all operations. As in [49], we use 49,152 × 49,152 point grids in all the benchmarks and a star stencil with a radius of four points for the Stencil.

5.2 Hardware

For single-node runs, a workstation with a single-socket, 64-core AMD Ryzen Threadripper PRO 3995WX was used. Multi-node experiments were performed on the Shaheen II supercomputer [20] hosted by King Abdullah University of Science and Technology (KAUST). It is an instance of the Cray XC40 [16] platform with the Cray Aries interconnect [2] and two Intel Xeon E5-2698v3 (“Haswell”) 16-core processors [28] per node. The scheduler assigned the nodes in an exclusive mode. However, as in any production environment, the network was shared with other users running on the supercomputer at the time.

5.3 Software

In single-node benchmarks, a workstation running Fedora 38 was used. System packages for Python 3.11.4 and MPICH 4.0.3 were installed. Multi-node experiments were performed on a Cray XC40 running Cray Linux (based on SUSE Linux Enterprise Server 15). Cray MPICH 7.7.18, Python 3.10.1 and Cray OpenSHMEMX 9.1.2 modules were used. In both scenarios, we used the initial release 1.0.0 of shmem4py [45] along with mpi4py 3.1.4, numpy 1.24.3 and numba 0.57.0, all installed using the pip package manager.

Single-node experiments were performed using different OpenSHMEM backends with some of the results presented in Section 6.1. OSHMPI was installed from the main branch of its GitHub repository (commit 0ffb5a9) and used the MPICH 4.0.3 backend. Open MPI SHMEM was installed from the Open MPI 4.1.4 package and built with UCX 1.14.1. Open Source Software Solutions (OSSS) OpenSHMEM and Sandia OpenSHMEM were both installed from their respective GitHub repositories using the latest commits available at the time of writing (commit cbbb91d and f2f6b1e, respectively). As before, UCX 1.14.1 was used. For multi-node benchmarks, we used Cray OpenSHMEMX 9.1.2 exclusively. It is a high-quality, proprietary implementation of OpenSHMEM available on Cray XC40, which is based on DMAPP [47].

5.4 Reporting

Generally, we report the ∞ -norm (maximum throughput or maximum FLOPS/s) of a number of repetitions of an experiment. Each experiment, however, may be an average time of a number of iterations. To use a more concrete example, in Parallel Research Kernels, the number of iterations is always a command line argument. The kernel is then always executed 1 + n times, and the reported time is the average of the last n iterations. We execute such a kernel m times and report the best average time achieved over these m runs. Although we admit that this approach is slightly convoluted, our aim is to obtain the ideal performance for each programming model and eliminate any system noise.

In the multi-node scenario, our reporting approach is similar. However, we increase the number of repetitions both in the inner (average) and outer (∞ -norm) loops. Even though the compute nodes of a supercomputer typically exhibit less noise than a workstation, the network stack, which is a shared resource, may introduce significant noise to the time measurements. This is particularly noticeable in networks using adaptive routing [8] such as Cray Aries. In an attempt to make for the most fair comparison, we run both MPI and OpenSHMEM variants within the same job allocation to ensure the same node placement.

The units we use in the paper follow the IEC 60027-2 standard [27], i.e., 1 gigabyte = 1 GB = 10³ MB = 1,000 MB, while 1 gibibyte = 1 GiB = 2¹⁰ MiB = 1,024 MiB.

6 SINGLE-NODE BENCHMARKS

First, we compare the performance of four OpenSHMEM backends using a PingPong test implemented in Python. This benchmark emphasizes exclusively on communication performance and allows us to limit further analysis to the most performant OpenSHMEM backend. After that, we repeat the PingPong test, this time contrasting the performance between C vs. Python and MPI vs. OpenSHMEM. The goal of this study is to determine the overhead of wrapping C library calls in Python for both MPI and OpenSHMEM, as well as to measure the achievable bandwidth using both frameworks. Subsequently, we benchmark three applications from the Parallel Research Kernels suite [15, 49, 50]. These benchmarks include more complicated communication patterns as well as local computation. We use them to characterize performance in more realistic application scenarios.

6.1 PingPong - OpenSHMEM backends

In this experiment, we compare shmem4py using four different OpenSHMEM backends. We report the highest bandwidth observed during 100 repetitions of the benchmark. For convenience, we run the benchmarks within Podman containers. For validation purposes, we also compare to one backend running natively on the host operating system.

Figure 1: Single-node `shmem4py` PingPong benchmark using different OpenSHMEM backends running within Podman containers. OSHMPI is also run natively.

The results of the experiment are shown in Fig. 1. Sandia OpenSHMEM (SOS) underperforms for large messages. Open MPI SHMEM (OSHMEM), on the other hand, performs worse than other backends for messages smaller than ≈ 2 MiB. OSHMPI, running both natively and inside a Podman container, performs almost identically, and its achieved performance is among the highest regardless of the message size. Similar results can be reproduced using C implementations, i.e., the discrepancies point to differences between OpenSHMEM backends and not shmem4py. Based on these results, we use OSHMPI for the remaining single-node experiments.

6.2 PingPong - C vs. Python and MPI vs. OpenSHMEM

In this experiment, we compare the PingPong maximum achievable bandwidth repeating most of the setup of Section 6.1. This time, we fix the OpenSHMEM backend to OSHMPI. We test the following variants: C+MPI, C+OpenSHMEM, Python+mpi4py, and Python+shmem4py. Again, we report the fastest of 100 iterations.

Figure 2: Single-node PingPong benchmark using C+MPI, C+OpenSHMEM, Python+`mpi4py`, and Python+`shmem4py`. The horizontal dashed line shows the maximum aggregate `memcpy` bandwidth measured using two processes.

The results are shown in Fig. 2. As in earlier work analyzing mpi4py performance, there is no significant difference between C and Python implementations for large message sizes [14]. OpenSHMEM behaves similarly. Interestingly, we notice mpi4py outperforms C+MPI for some message sizes. Our analysis points to cache effects caused by different memory allocation strategies (NumPy's use of madvise vs. C's use of plain malloc). For small messages, we see increased latency when using communication libraries through Python wrappers. The ratio is about the same for both mpi4py and shmem4py.

Overall, we note a rather large performance advantage of OpenSHMEM over MPI for large messages. For this intra-node communication case, such an outcome is expected. OpenSHMEM naturally involves a single buffer copy from the source memory space to the target shared memory segment. On the other hand, the MPI backend in use (MPICH) implements the send-receive pair with a double copy using an intermediate shared memory buffer.² Note that an implementation based on MPI-3 Remote Memory Access (RMA) features is expected to achieve similar performance as OpenSHMEM.

It is also worth noting that the OpenSHMEM implementation gets very close to the practically achievable memory bandwidth. The dashed line in Fig. 2 shows the fastest aggregate memcpy bandwidth measured over 100 repetitions using two processes and large memory buffer sizes of 500 MB.

6.3 Parallel Research Kernels

In this section, we present the results obtained running three applications from the Parallel Research Kernels suite, as briefly described in Section 5.1. In an inner loop, we execute each kernel for 4 iterations and measure the average performance of the last 3 iterations. In an outer loop, each kernel runs 5 times, and we report the fastest of the inner averages.

Figure 3: Single-node Stencil benchmark using C, Python, and Python+Numba, each implemented with MPI and OpenSHMEM.

Figure 4: Single-node Synch_p2p benchmark using C, Python, and Python+Numba, each implemented with MPI and OpenSHMEM.

Figure 5: Single-node Transpose benchmark using C and Python, each implemented with an *all-to-all* operation using MPI and OpenSHMEM.

As shown in Fig. 3, 4, and 5, the trends are similar across all three kernels. Generally, kernels written in C perform the best. Pure Python implementations may be up to four orders of magnitude slower than C. This huge performance gap is typical of pure Python codes performing arithmetic operations in inner loops. The introduction of Numba to JIT compile the triple nested loop containing the update to the stencil values increases the performance dramatically. This can be seen in Fig. 3 – for the rightmost point, the gap between Python (388 MFLOPS/s) and C (128 GFLOPS/s) is huge, however using Numba immediately increases Python's performance to 79 GFLOPS/s. Similar observations can be made when analyzing the results of the Synch_p2p benchmark (Fig. 4).

Overall, we do not notice large differences between the performance achieved using MPI vs. OpenSHMEM. MPI is slightly faster for kernels implemented in C, partly due to PRK tests being a good match for MPI operations. Moreover, the OpenSHMEM implementations in C are based on the MPI ones and thus may not take advantage of the different programming model OpenSHMEM favors. When using Python, either with or without Numba, the results are mixed. Synch_p2p is a perfect use case for MPI send-receive, and thus it is not surprising to see MPI winning there. With the Stencil benchmark, there is a small advantage for OpenSHMEM, which is consistent with the expectation that it has a lower overhead relative to MPI. With the Transpose benchmark, the behavior is nearly identical, which is expected since both implementations use the all-to-all collective operation.

The results shown for the Transpose benchmark in Fig. 5 are the least remarkable and should be considered as a validation case. Python and C implementations perform very similarly, and Numba did contribute to increasing the performance (results not included). This shows that the Transpose operation, which is the local computation, is implemented efficiently in NumPy. As NumPy's transpose operation is already implemented in C, there is no room for improvement from Numba's JIT compiler. MPI and OpenSHMEM also performed similarly. Both implementations use an all-to-all collective operation, and in the particular case of OSHMPI, the underlying MPI's all-to-all operation is ultimately used.

7 MULTI-NODE BENCHMARKS

In this section, we present multi-node experiments performed on a Cray XC40 supercomputer using the same three Parallel Research Kernels as in Section 6. We use the OpenSHMEM backend recommended on this system, that is, the vendor-provided Cray OpenSHMEMX. We do not compare programming languages as the performance implications were already covered in our single-node experiments. Instead, we focus the analysis on comparing mpi4py and shmem4py.

All experiments are run using 32 processes per node, that is, fully subscribing all the available cores. In order to account for job placement on the supercomputer, all benchmarks are repeated in 10 SLURM jobs. Within each job, both mpi4py and shmem4py variants of the code are executed. PRK kernels are then executed 11 times, and the average time over the last 10 iterations is considered. We then report the fastest average obtained in any job for a given number of processes. The Transpose benchmark result is missing the 32,768 process data point, as it requires a matrix with a number of rows and columns divisible by the number of processes.

Figure 6: Multi-node Transpose, Stencil and Synch_p2p benchmarks using `mpi4py` and `shmem4py`. The unit is MB/s for the Transpose benchmark and MFLOPS/s for Stencil and Synch_p2p.

The results are presented in Fig. 6. We observe that the results vary between benchmarks. For Transpose, the OpenSHMEM implementation typically outperforms the one based on MPI. For 8,192 processes, we note the largest difference with an aggregate bandwidth of 619 GB/s (shmem4py) compared to 314 GB/s (mpi4py). This suggests that the all-to-all collective operation in Cray OpenSHMEMX outperforms the one in Cray MPICH. On the other hand, in the Synch_p2p benchmark, MPI has a large advantage over OpenSHMEM (7.06 GFLOPS/s vs. 1.64 GFLOPS/s for 32,768 processes). Arguably, MPI is not only faster, but it also provides a better communication model for the task. For the Stencil benchmark, the difference was the smallest and did not exceed 11.8%.

8 CONCLUSIONS

In this work, we introduced shmem4py - a Python wrapper for the OpenSHMEM standard specification. We briefly described the landscape of distributed memory parallelism in Python and shmem4py’s niche next to mpi4py. We then discussed the design choices and shmem4py’s implementation details. We also showed an example contrasting the use of OpenSHMEM features from both C and Python. In the second part of the paper, we analyzed the performance of shmem4py on both a workstation and a supercomputer. There, we compared the performance of different OpenSHMEM backends. Additionally, we discussed the overhead of using Python wrappers such as shmem4py and mpi4py vs. accessing the frameworks directly from C. A number of benchmarks from the PRK suite provided a baseline to compare the performance of shmem4py and mpi4py. However, the benchmarks used do not take advantage of the unique properties of OpenSHMEM, nor do they compare OpenSHMEM to MPI RMA (one-sided), because these issues are well understood outside of the Python language context. Our experiments primarily serve to validate the design, to ensure that the overheads from the Python wrapper are negligible, and that common algorithm patterns are straightforward to implement.

Overall, we obtained mixed results. We were able to identify scenarios in which OpenSHMEM was up to 2.0 × faster than MPI. In other cases, MPI was able to outperform OpenSHMEM by a factor of 4.3 ×. Overall, the performance of OpenSHMEM and MPI and their respective Python wrappers is on par, and the differences are mostly unremarkable. Nonetheless, we consider the outcome of these experiments as positive and ultimately interesting for the practitioner community at large. As both frameworks perform similarly, the choice of a programming model can be made purely on the basis of what suits the application better.

ACKNOWLEDGMENTS

The research reported in this paper was funded by King Abdullah University of Science and Technology. We are thankful to the Supercomputing Laboratory at King Abdullah University of Science and Technology for the use of their computing resources.

REFERENCES

Collin Abidi. 2020. shmem4py. https://github.com/collinabidi/shmem4py
Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012). https://www.alcf.anl.gov/files/CrayXCNetwork.pdf
Michael Bauer and Michael Garland. 2019. Legate NumPy: Accelerated and Distributed Array Computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 23, 23 pages. https://doi.org/10.1145/3295500.3356175
David Beazley. 2010. Understanding the Python GIL. In PyCON Python Conference. Atlanta, Georgia.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax
Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM for the PGAS community. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model. 1–3. https://doi.org/10.1145/2020373.2020375
Sung-Eun Choi, Howard Pritchard, James Shimek, James Swaro, Zachary Tiffany, and Ben Turrubiates. 2015. An Implementation of OFI Libfabric in Support of Multithreaded PGAS Solutions. In 2015 9th International Conference on Partitioned Global Address Space Programming Models. 59–69. https://doi.org/10.1109/PGAS.2015.14
Sudheer Chunduri, Kevin Harms, Scott Parker, Vitali Morozov, Samuel Oshin, Naveen Cherukuri, and Kalyan Kumaran. 2017. Run-to-Run Variability on Xeon Phi Based Cray XC Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’17). Association for Computing Machinery, New York, NY, USA, Article 52, 13 pages. https://doi.org/10.1145/3126908.3126926
Dask core developers. 2023. Dask: Scale the Python tools you love. https://www.dask.org/
Cray Research, Inc.1993. Cray T3D System Architecture Overview. http://www.bitsavers.org/pdf/cray/HR-04033_CRAY_T3D_System_Architecture_Overview_Sep93.pdf
Jeff Daily and Robert R. Lewis. 2011. Using the Global Arrays toolkit to reimplement NumPy for distributed computation. In Proceedings of the 10th Python in Science Conference. https://conference.scipy.org/proceedings/scipy2011/pdfs/daily.pdf
Lisandro Dalcin and Yao-Lung L. Fang. 2021. mpi4py: Status update after 12 years of development. Computing in Science & Engineering 23, 4 (2021), 47–54. https://doi.org/10.1109/MCSE.2021.3083216
Lisandro D. Dalcin, Rodrigo R. Paz, Pablo A. Kler, and Alejandro Cosimo. 2011. Parallel distributed computing using Python. Advances in Water Resources 34, 9 (2011), 1124–1139. https://doi.org/10.1016/j.advwatres.2011.04.013
Lisandro Dalcín, Rodrigo Paz, Mario Storti, and Jorge D'Elía. 2008. MPI for Python: Performance improvements and MPI-2 extensions. J. Parallel and Distrib. Comput. 68, 5 (2008), 655–662. https://doi.org/10.1016/j.jpdc.2007.09.005
Jeff R. Hammond et al.2019. Parallel Research Kernels. https://github.com/ParRes/Kernels
Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray Cascade: A scalable HPC system based on a Dragonfly network. In SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1–9. https://doi.org/10.1109/SC.2012.39
Karl Feind. 1995. Shared memory access (SHMEM) routines. Cray Research 53 (1995). https://cug.org/5-publications/proceedings_attendee_lists/1997CD/S95PROC/303_308.PDF
Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. 2004. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Dieter Kranzlmüller, Péter Kacsuk, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 97–104.
Paul Grun, Sean Hefty, Sayantan Sur, David Goodell, Robert D. Russell, Howard Pritchard, and Jeffrey M. Squyres. 2015. A Brief Introduction to the OpenFabrics Interfaces - A New Network API for Maximizing High Performance Application Efficiency. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 34–39. https://doi.org/10.1109/HOTI.2015.19
Bilel Hadri, Samuel Kortas, Saber Feki, Rooh Khurram, and Greg Newby. 2015. Overview of the KAUST's Cray X40 System – Shaheen II. In Proceedings of the Cray User Group Meeting. Chicago, USA.
Jeff R. Hammond, Sayan Ghosh, and Barbara M. Chapman. 2014. Implementing OpenSHMEM using MPI-3 one-sided communication. In OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools: First Workshop, OpenSHMEM 2014, Annapolis, MD, USA, March 4-6, 2014. Proceedings 1. Springer, 44–58. https://doi.org/10.1007/978-3-319-05215-1
Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. 2020. Array programming with NumPy. Nature 585, 7825 (Sept. 2020), 357–362. https://doi.org/10.1038/s41586-020-2649-2
Pieter Hintjens. 2013. ZeroMQ: messaging for many applications. O'Reilly Media, Inc.
Chung-Hsing Hsu, Neena Imam, Akhil Langer, Sreeram Potluri, and Chris J. Newburn. 2020. An Initial Assessment of NVSHMEM for High Performance Computing. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1–10. https://doi.org/10.1109/IPDPSW50202.2020.00104
Chien-Chin Huang, Qi Chen, Zhaoguo Wang, Russell Power, Jorge Ortiz, Jinyang Li, and Zhen Xiao. 2015. Spartan: A distributed array framework with smart tiling. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). 1–15.
John D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55
IEC 60027-2. 2000. Letter symbols to be used in electrical technology - Part 2: Telecommunications and electronics. Standard. International Electrotechnical Commission.
Intel.com. 2019. Intel^® Xeon^® Processor E5-2698v3. https://ark.intel.com/content/www/us/en/ark/products/81060/intel-xeon-processor-e52698-v3-40m-cache-2-30-ghz.html
Jithin Jose, Krishna Kandalla, Miao Luo, and Dhabaleswar K Panda. 2012. Supporting hybrid MPI and OpenSHMEM over InfiniBand: Design and performance evaluation. In 2012 41st International Conference on Parallel Processing. IEEE, 219–228. https://doi.org/10.1109/ICPP.2012.55
Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, and Carol Willing. 2016. Jupyter Notebooks – a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas, F. Loizides and B. Schmidt (Eds.). IOS Press, 87 – 90.
Arvind Krishnamurthy, David E. Culler, and Katherine Yelick. 1998. Empirical Evaluation of Global Memory Support on the Cray-T3D and Cray-T3E. Technical Report. University of California at Berkeley, USA. https://apps.dtic.mil/sti/pdfs/ADA538557.pdf
Manojkumar Krishnan, Bruce Palmer, Abhinav Vishnu, Sriram Krishnamoorthy, Jeff Daily, and Daniel Chavarria. 2012. The Global Arrays user manual. Pacific Northwest National Laboratory, Richland, WA (2012). https://hpc.pnnl.gov/globalarrays/papers/GA-UserManual-Main.pdf
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A LLVM-Based Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC (Austin, Texas) (LLVM ’15). Association for Computing Machinery, New York, NY, USA, Article 7, 6 pages. https://doi.org/10.1145/2833157.2833162
Wes McKinney. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Stéfan van der Walt and Jarrod Millman (Eds.). 56 – 61. https://doi.org/10.25080/Majora-92bf1922-00a
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerging AI applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18). 561–577.
Jesse Noller and Richard Oudkerk. 2008. Addition of the multiprocessing package to the standard library. PEP 371. https://www.python.org/dev/peps/pep-0371/
NVIDIA. 2023. An Aspiring Drop-In Replacement for NumPy at Scale. https://github.com/nv-legate/cunumeric
NVIDIA. 2023. NVIDIA cuNumeric. https://developer.nvidia.com/cunumeric
Ryosuke Okuta, Yuya Unno, Daisuke Nishino, Shohei Hido, and Crissman Loomis. 2017. CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS). http://learningsys.org/nips17/assets/papers/paper_16.pdf
Sreeram Potluri, Anshuman Goswami, Davide Rossetti, C.J. Newburn, Manjunath Gorentla Venkata, and Neena Imam. 2017. GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC). 253–262. https://doi.org/10.1109/HiPC.2017.00037
Brian Quinlan. 2009. futures - execute computations asynchronously. PEP 3148. https://www.python.org/dev/peps/pep-3148/
Armin Rigo and Maciej Fijalkowski. 2022. CFFI: C Foreign Function Interface for Python. https://cffi.readthedocs.io/
Matthew Rocklin et al. 2015. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th Python in Science Conference, Vol. 130. SciPy Austin, TX, 136.
Marcin Rogowski, Samar Aseeri, David Keyes, and Lisandro Dalcin. 2023. mpi4py.futures: MPI-Based Asynchronous Task Execution for Python. IEEE Transactions on Parallel and Distributed Systems 34, 2 (2023), 611–622. https://doi.org/10.1109/TPDS.2022.3225481
Marcin Rogowski, Lisandro Dalcin, Jeff R. Hammond, and David E. Keyes. 2023. shmem4py: OpenSHMEM for Python. Journal of Open Source Software 8, 87 (July 2023), 5444. https://doi.org/10.21105/joss.05444
Min Si, Huansong Fu, Jeff R. Hammond, and Pavan Balaji. 2022. OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations. In OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Exascale and Smart Networks, Stephen Poole, Oscar Hernandez, Matthew Baker, and Tony Curtis (Eds.). Springer International Publishing, Cham, 39–60.
Monika ten Bruggencate and Duncan Roweth. 2010. DMAPP - an API for one-sided program models on Baker systems. In Cray User Group Conference. https://cug.org/5-publications/proceedings_attendee_lists/CUG10CD/pages/1-program/final_program/CUG10_Proceedings/pages/authors/01-5Monday/03B-tenBruggencate-Paper-2.pdf
Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688
Rob F. Van der Wijngaart, Abdullah Kayi, Jeff R. Hammond, Gabriele Jost, Tom St. John, Srinivas Sridharan, Timothy G. Mattson, John Abercrombie, and Jacob Nelson. 2016. Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels. In High Performance Computing, Julian M. Kunkel, Pavan Balaji, and Jack Dongarra (Eds.). Springer International Publishing, Cham, 321–339.
Rob F. Van der Wijngaart and Timothy G. Mattson. 2014. The Parallel Research Kernels: A tool for architecture and programming system investigation. In Proceedings of the IEEE High Performance Extreme Computing Conference. IEEE. https://doi.org/10.1109/HPEC.2014.7040972
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261–272. https://doi.org/10.1038/s41592-019-0686-2
Aaron Welch, Pavel Shamis, Pengfei Hao, and Barbara Chapman. 2015. PySHMEM: A High Productivity OpenSHMEM Interface for Python. In 2015 9th International Conference on Partitioned Global Address Space Programming Models. 99–101. https://doi.org/10.1109/PGAS.2015.20

FOOTNOTE

¹The implementation details of shmem4py Parallel Research Kernels can be obtained from the following GitHub pull request: https://github.com/ParRes/Kernels/pull/625.

²Other MPI backends may feature a faster single-copy mechanism based on Cross Memory Attach (CMA) capabilities of the operating system kernel.

CC-BY license image
This work is licensed under a Creative Commons Attribution International 4.0 License.

SC-W 2023, November 12–17, 2023, Denver, CO, USA