Battling Memory Requirements of Array Programming Through Streaming

Mads R. B. Kristensen¹⁶,
James Avery¹⁷,
Troels Blum¹⁶,
Simon Andreas Frimann Lund¹⁶ &
…
Brian Vinter¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9945))

Included in the following conference series:

International Conference on High Performance Computing

2522 Accesses

Abstract

A barrier to efficient array programming, for example in Python/NumPy, is that algorithms written as pure array operations completely without loops, while most efficient on small input, can lead to explosions in memory use. The present paper presents a solution to this problem using array streaming, implemented in the automatic parallelization high-performance framework Bohrium. This makes it possible to use array programming in Python/NumPy code directly, even when the apparent memory requirement exceeds the machine capacity, since the automatic streaming eliminates the temporary memory overhead by performing calculations in per-thread registers.

Using Bohrium, we automatically fuse, JIT-compile, and execute NumPy array operations on GPGPUs without modification to the user programs. We present performance evaluations of three benchmarks, all of which show dramatic reductions in memory use from streaming, yielding corresponding improvements in speed and utilization of GPGPU-cores. The streaming-enabled Bohrium effortlessly runs programs on input sizes much beyond sizes that crash on pure NumPy due to exhausting system memory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Parray: A Unifying Array Representation for Heterogeneous Parallelism

HDArray: Parallel Array Interface for Distributed Heterogeneous Devices

Array programming with NumPy

Article Open access 16 September 2020

Notes

1.
The standard interpreter, CPython, implemented in C.
2.
Available at http://www.bh107.org.
3.
A reduction performs an associative binary operation on all elements along an axis. The prototypical reductions are sum and product, but any associative binary operation can be used.
4.
When no GPU is available, the bytecode kernels will be send directly to the CPU backend.
5.
Available at http://benchpress.readthedocs.org/.

References

Auer, A.A., Baumgartner, G., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Krishnamoorthy, S., Krishnan, S., et al.: Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol. Phy. 104(2), 211–228 (2006)
Article Google Scholar
Ayer, V.M., Miguez, S., Toby, B.H.: Why scientists should learn to program in python. Powder Diffr. 29, S48–S64 (2014)
Article Google Scholar
Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D.S., Smith, K.: Cython: the best of both worlds. Comput. Sci. Eng. 13(2), 31–39 (2011)
Article Google Scholar
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation
Google Scholar
Blum, T., Kristensen, M.R.B., Vinter, B.: Transparent GPU execution of NumPy applications. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE (2014)
Google Scholar
Cooke, D., Hochberg, T.: Numexpr. Fast evaluation of array expressions by using a vector-based virtual machine
Google Scholar
Darte, A., Huard, G.: New results on array contraction [memory optimization]. In: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors, pp. 359–370. IEEE (2002)
Google Scholar
Enkovaara, J., Romero, N.A., Shende, S., Mortensen, J.J.: Gpaw-massively parallel electronic structure calculations with python-based software. Procedia Comput. Sci. 4, 17–25 (2011)
Article Google Scholar
Foord, M., Muirhead, C.: IronPython in Action. Manning Publications Co., Greenwich (2009)
Google Scholar
Guelton, S., Brunet, P., Amini, M., Merlini, A., Corbillon, X., Raynaud, A.: Pythran: enabling static optimization of scientific python programs. Comput. Sci. Discov. 8(1), 014001 (2015)
Article Google Scholar
Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996)
Google Scholar
Jones, E., Miller, P.J.: Weaveinlining C/C++ in Python. OReilly Open Source Convention (2002)
Google Scholar
Klckner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., Fasih, A.: PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. Parallel Comput. 38(3), 157–174 (2012)
Article Google Scholar
Kristensen, M.R.B., Happe, H., Vinter, B.: GPAW optimized for Blue Gene/P using hybrid programming. In: IEEE International Symposium on Parallel Distributed Processing, IPDPS 2009, pp. 1–6 (2009)
Google Scholar
Kristensen, M.R.B., Lund, S.A.F., Blum, T., Avery, J.: Fusion of array operations at runtime. In: Proceedings of the 25th International Conference on Parallel Architectures and Compilation Techniques (PACT 2016). ACM (2016)
Google Scholar
Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K.: Separating NumPy API from implementation. In: 5th Workshop on Python for High Performance and Scientific Computing (PyHPC 2014) (2014)
Google Scholar
Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K., Vinter, B.: Bohrium: unmodified NumPy code on CPU, GPU, and cluster. In: 4th Workshop on Python for High Performance and Scientific Computing (PyHPC 2013) (2013)
Google Scholar
Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K., Vinter, B.: Bohrium: a virtual machine approach to portable parallelism. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 312–321. IEEE (2014)
Google Scholar
Kristensen, M.R.B., Vinter, B.: Numerical python for scalable architectures. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS 2010, pp. 15:1–15:9. ACM, New York (2010)
Google Scholar
Kristensen, M.R.B., Zheng, Y., Vinter, B.: PGAS for distributed numerical python targeting multi-core clusters. In: International Parallel and Distributed Processing Symposium, pp. 680–690 (2012)
Google Scholar
Lam, C.-C., Cociorva, D., Baumgartner, G., Sadayappan, P.: Optimization of memory usage requirement for a class of loops implementing multi-dimensional integrals. In: Carter, L., Ferrante, J. (eds.) LCPC 1999. LNCS, vol. 1863, pp. 350–364. Springer, Heidelberg (2000). doi:10.1007/3-540-44905-1_22
Chapter Google Scholar
Madsen, F.M., Clifton-Everest, R., Chakravarty, M.M.T., Keller, G.: Functional array streams. In: Proceedings of the 4th ACM SIGPLAN Workshop on FunctionalHigh-Performance Computing, FHPC 2015, pp. 23–34. ACM, New York (2015)
Google Scholar
Madsen, F.M., Filinski, A.: Towards a streaming model for nested data parallelism. In: Proceedings of the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing, pp. 13–24. ACM (2013)
Google Scholar
MATLAB. version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachusetts (2010)
Google Scholar
Mnih, V.: Cudamat: a cuda-based matrix class for python. Department of Computer Science, University of Toronto, Technical report, UTML TR, 4 (2009)
Google Scholar
Munshi, A., et al.: The OpenCL specification. Khronos OpenCL Working Group 1, 11–15 (2009)
Google Scholar
NVIDIA Corporation. NVIDIA CUDA Programming Guide 2.0 (2008)
Google Scholar
Oliphant, T.: Numba python bytecode to llvm translator. In: Proceedings of the Python for Scientific Computing Conference (SciPy) (2012)
Google Scholar
Pedroni, S., Rappin, N.: Jython Essentials: Rapid Scripting in Java, 1st edn. O’Reilly & Associates Inc., Sebastopol (2002)
Google Scholar
Rickett, C.D., Choi, S.-E., Rasmussen, C.E., Sottile, M.J.: Rapid prototyping frameworks for developing scientific applications: a case study. J. Supercomput. 36(2), 123–134 (2006)
Article Google Scholar
Tieleman, T.: Gnumpy: an easy way to use gpu boards in python (2010)
Google Scholar
Van Der Walt, S., Colbert, S., Varoquaux, G.: The numpy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)
Article Google Scholar
van Rossum, G.: Glue it all together with python. In: Workshop on Compositional Software Architectures, Workshop Report, Monterey, California (1998)
Google Scholar

Download references

Acknowledgement

James Avery was partially supported by the Danish Council for Independent Research Sapere Aude grant “Complexity through Logic and Algebra” (COLA).

Author information

Authors and Affiliations

Niels Bohr Institute, University of Copenhagen, Copenhagen, Denmark
Mads R. B. Kristensen, Troels Blum, Simon Andreas Frimann Lund & Brian Vinter
Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
James Avery

Authors

Mads R. B. Kristensen
View author publications
You can also search for this author in PubMed Google Scholar
James Avery
View author publications
You can also search for this author in PubMed Google Scholar
Troels Blum
View author publications
You can also search for this author in PubMed Google Scholar
Simon Andreas Frimann Lund
View author publications
You can also search for this author in PubMed Google Scholar
Brian Vinter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mads R. B. Kristensen or James Avery .

Editor information

Editors and Affiliations

University of Delaware, Newark, Delaware, USA
Michela Taufer
Forschungszentrum Jülich, Jülich, Germany
Bernd Mohr
DKRZ, Hamburg, Germany
Julian M. Kunkel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kristensen, M.R.B., Avery, J., Blum, T., Lund, S.A.F., Vinter, B. (2016). Battling Memory Requirements of Array Programming Through Streaming. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-46079-6_32
Published: 06 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Battling Memory Requirements of Array Programming Through Streaming

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Parray: A Unifying Array Representation for Heterogeneous Parallelism

HDArray: Parallel Array Interface for Distributed Heterogeneous Devices

Array programming with NumPy

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Battling Memory Requirements of Array Programming Through Streaming

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Parray: A Unifying Array Representation for Heterogeneous Parallelism

HDArray: Parallel Array Interface for Distributed Heterogeneous Devices

Array programming with NumPy

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation