Abstract
A barrier to efficient array programming, for example in Python/NumPy, is that algorithms written as pure array operations completely without loops, while most efficient on small input, can lead to explosions in memory use. The present paper presents a solution to this problem using array streaming, implemented in the automatic parallelization high-performance framework Bohrium. This makes it possible to use array programming in Python/NumPy code directly, even when the apparent memory requirement exceeds the machine capacity, since the automatic streaming eliminates the temporary memory overhead by performing calculations in per-thread registers.
Using Bohrium, we automatically fuse, JIT-compile, and execute NumPy array operations on GPGPUs without modification to the user programs. We present performance evaluations of three benchmarks, all of which show dramatic reductions in memory use from streaming, yielding corresponding improvements in speed and utilization of GPGPU-cores. The streaming-enabled Bohrium effortlessly runs programs on input sizes much beyond sizes that crash on pure NumPy due to exhausting system memory.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The standard interpreter, CPython, implemented in C.
- 2.
Available at http://www.bh107.org.
- 3.
A reduction performs an associative binary operation on all elements along an axis. The prototypical reductions are sum and product, but any associative binary operation can be used.
- 4.
When no GPU is available, the bytecode kernels will be send directly to the CPU backend.
- 5.
Available at http://benchpress.readthedocs.org/.
References
Auer, A.A., Baumgartner, G., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Krishnamoorthy, S., Krishnan, S., et al.: Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol. Phy. 104(2), 211–228 (2006)
Ayer, V.M., Miguez, S., Toby, B.H.: Why scientists should learn to program in python. Powder Diffr. 29, S48–S64 (2014)
Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D.S., Smith, K.: Cython: the best of both worlds. Comput. Sci. Eng. 13(2), 31–39 (2011)
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation
Blum, T., Kristensen, M.R.B., Vinter, B.: Transparent GPU execution of NumPy applications. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE (2014)
Cooke, D., Hochberg, T.: Numexpr. Fast evaluation of array expressions by using a vector-based virtual machine
Darte, A., Huard, G.: New results on array contraction [memory optimization]. In: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors, pp. 359–370. IEEE (2002)
Enkovaara, J., Romero, N.A., Shende, S., Mortensen, J.J.: Gpaw-massively parallel electronic structure calculations with python-based software. Procedia Comput. Sci. 4, 17–25 (2011)
Foord, M., Muirhead, C.: IronPython in Action. Manning Publications Co., Greenwich (2009)
Guelton, S., Brunet, P., Amini, M., Merlini, A., Corbillon, X., Raynaud, A.: Pythran: enabling static optimization of scientific python programs. Comput. Sci. Discov. 8(1), 014001 (2015)
Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996)
Jones, E., Miller, P.J.: Weaveinlining C/C++ in Python. OReilly Open Source Convention (2002)
Klckner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., Fasih, A.: PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. Parallel Comput. 38(3), 157–174 (2012)
Kristensen, M.R.B., Happe, H., Vinter, B.: GPAW optimized for Blue Gene/P using hybrid programming. In: IEEE International Symposium on Parallel Distributed Processing, IPDPS 2009, pp. 1–6 (2009)
Kristensen, M.R.B., Lund, S.A.F., Blum, T., Avery, J.: Fusion of array operations at runtime. In: Proceedings of the 25th International Conference on Parallel Architectures and Compilation Techniques (PACT 2016). ACM (2016)
Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K.: Separating NumPy API from implementation. In: 5th Workshop on Python for High Performance and Scientific Computing (PyHPC 2014) (2014)
Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K., Vinter, B.: Bohrium: unmodified NumPy code on CPU, GPU, and cluster. In: 4th Workshop on Python for High Performance and Scientific Computing (PyHPC 2013) (2013)
Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K., Vinter, B.: Bohrium: a virtual machine approach to portable parallelism. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 312–321. IEEE (2014)
Kristensen, M.R.B., Vinter, B.: Numerical python for scalable architectures. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS 2010, pp. 15:1–15:9. ACM, New York (2010)
Kristensen, M.R.B., Zheng, Y., Vinter, B.: PGAS for distributed numerical python targeting multi-core clusters. In: International Parallel and Distributed Processing Symposium, pp. 680–690 (2012)
Lam, C.-C., Cociorva, D., Baumgartner, G., Sadayappan, P.: Optimization of memory usage requirement for a class of loops implementing multi-dimensional integrals. In: Carter, L., Ferrante, J. (eds.) LCPC 1999. LNCS, vol. 1863, pp. 350–364. Springer, Heidelberg (2000). doi:10.1007/3-540-44905-1_22
Madsen, F.M., Clifton-Everest, R., Chakravarty, M.M.T., Keller, G.: Functional array streams. In: Proceedings of the 4th ACM SIGPLAN Workshop on FunctionalHigh-Performance Computing, FHPC 2015, pp. 23–34. ACM, New York (2015)
Madsen, F.M., Filinski, A.: Towards a streaming model for nested data parallelism. In: Proceedings of the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing, pp. 13–24. ACM (2013)
MATLAB. version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachusetts (2010)
Mnih, V.: Cudamat: a cuda-based matrix class for python. Department of Computer Science, University of Toronto, Technical report, UTML TR, 4 (2009)
Munshi, A., et al.: The OpenCL specification. Khronos OpenCL Working Group 1, 11–15 (2009)
NVIDIA Corporation. NVIDIA CUDA Programming Guide 2.0 (2008)
Oliphant, T.: Numba python bytecode to llvm translator. In: Proceedings of the Python for Scientific Computing Conference (SciPy) (2012)
Pedroni, S., Rappin, N.: Jython Essentials: Rapid Scripting in Java, 1st edn. O’Reilly & Associates Inc., Sebastopol (2002)
Rickett, C.D., Choi, S.-E., Rasmussen, C.E., Sottile, M.J.: Rapid prototyping frameworks for developing scientific applications: a case study. J. Supercomput. 36(2), 123–134 (2006)
Tieleman, T.: Gnumpy: an easy way to use gpu boards in python (2010)
Van Der Walt, S., Colbert, S., Varoquaux, G.: The numpy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)
van Rossum, G.: Glue it all together with python. In: Workshop on Compositional Software Architectures, Workshop Report, Monterey, California (1998)
Acknowledgement
James Avery was partially supported by the Danish Council for Independent Research Sapere Aude grant “Complexity through Logic and Algebra” (COLA).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Kristensen, M.R.B., Avery, J., Blum, T., Lund, S.A.F., Vinter, B. (2016). Battling Memory Requirements of Array Programming Through Streaming. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-46079-6_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)