Nothing Special   »   [go: up one dir, main page]

Skip to main content

Battling Memory Requirements of Array Programming Through Streaming

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9945))

Included in the following conference series:

  • 2522 Accesses

Abstract

A barrier to efficient array programming, for example in Python/NumPy, is that algorithms written as pure array operations completely without loops, while most efficient on small input, can lead to explosions in memory use. The present paper presents a solution to this problem using array streaming, implemented in the automatic parallelization high-performance framework Bohrium. This makes it possible to use array programming in Python/NumPy code directly, even when the apparent memory requirement exceeds the machine capacity, since the automatic streaming eliminates the temporary memory overhead by performing calculations in per-thread registers.

Using Bohrium, we automatically fuse, JIT-compile, and execute NumPy array operations on GPGPUs without modification to the user programs. We present performance evaluations of three benchmarks, all of which show dramatic reductions in memory use from streaming, yielding corresponding improvements in speed and utilization of GPGPU-cores. The streaming-enabled Bohrium effortlessly runs programs on input sizes much beyond sizes that crash on pure NumPy due to exhausting system memory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The standard interpreter, CPython, implemented in C.

  2. 2.

    Available at http://www.bh107.org.

  3. 3.

    A reduction performs an associative binary operation on all elements along an axis. The prototypical reductions are sum and product, but any associative binary operation can be used.

  4. 4.

    When no GPU is available, the bytecode kernels will be send directly to the CPU backend.

  5. 5.

    Available at http://benchpress.readthedocs.org/.

References

  1. Auer, A.A., Baumgartner, G., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Krishnamoorthy, S., Krishnan, S., et al.: Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol. Phy. 104(2), 211–228 (2006)

    Article  Google Scholar 

  2. Ayer, V.M., Miguez, S., Toby, B.H.: Why scientists should learn to program in python. Powder Diffr. 29, S48–S64 (2014)

    Article  Google Scholar 

  3. Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D.S., Smith, K.: Cython: the best of both worlds. Comput. Sci. Eng. 13(2), 31–39 (2011)

    Article  Google Scholar 

  4. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation

    Google Scholar 

  5. Blum, T., Kristensen, M.R.B., Vinter, B.: Transparent GPU execution of NumPy applications. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE (2014)

    Google Scholar 

  6. Cooke, D., Hochberg, T.: Numexpr. Fast evaluation of array expressions by using a vector-based virtual machine

    Google Scholar 

  7. Darte, A., Huard, G.: New results on array contraction [memory optimization]. In: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors, pp. 359–370. IEEE (2002)

    Google Scholar 

  8. Enkovaara, J., Romero, N.A., Shende, S., Mortensen, J.J.: Gpaw-massively parallel electronic structure calculations with python-based software. Procedia Comput. Sci. 4, 17–25 (2011)

    Article  Google Scholar 

  9. Foord, M., Muirhead, C.: IronPython in Action. Manning Publications Co., Greenwich (2009)

    Google Scholar 

  10. Guelton, S., Brunet, P., Amini, M., Merlini, A., Corbillon, X., Raynaud, A.: Pythran: enabling static optimization of scientific python programs. Comput. Sci. Discov. 8(1), 014001 (2015)

    Article  Google Scholar 

  11. Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996)

    Google Scholar 

  12. Jones, E., Miller, P.J.: Weaveinlining C/C++ in Python. OReilly Open Source Convention (2002)

    Google Scholar 

  13. Klckner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., Fasih, A.: PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. Parallel Comput. 38(3), 157–174 (2012)

    Article  Google Scholar 

  14. Kristensen, M.R.B., Happe, H., Vinter, B.: GPAW optimized for Blue Gene/P using hybrid programming. In: IEEE International Symposium on Parallel Distributed Processing, IPDPS 2009, pp. 1–6 (2009)

    Google Scholar 

  15. Kristensen, M.R.B., Lund, S.A.F., Blum, T., Avery, J.: Fusion of array operations at runtime. In: Proceedings of the 25th International Conference on Parallel Architectures and Compilation Techniques (PACT 2016). ACM (2016)

    Google Scholar 

  16. Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K.: Separating NumPy API from implementation. In: 5th Workshop on Python for High Performance and Scientific Computing (PyHPC 2014) (2014)

    Google Scholar 

  17. Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K., Vinter, B.: Bohrium: unmodified NumPy code on CPU, GPU, and cluster. In: 4th Workshop on Python for High Performance and Scientific Computing (PyHPC 2013) (2013)

    Google Scholar 

  18. Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K., Vinter, B.: Bohrium: a virtual machine approach to portable parallelism. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 312–321. IEEE (2014)

    Google Scholar 

  19. Kristensen, M.R.B., Vinter, B.: Numerical python for scalable architectures. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS 2010, pp. 15:1–15:9. ACM, New York (2010)

    Google Scholar 

  20. Kristensen, M.R.B., Zheng, Y., Vinter, B.: PGAS for distributed numerical python targeting multi-core clusters. In: International Parallel and Distributed Processing Symposium, pp. 680–690 (2012)

    Google Scholar 

  21. Lam, C.-C., Cociorva, D., Baumgartner, G., Sadayappan, P.: Optimization of memory usage requirement for a class of loops implementing multi-dimensional integrals. In: Carter, L., Ferrante, J. (eds.) LCPC 1999. LNCS, vol. 1863, pp. 350–364. Springer, Heidelberg (2000). doi:10.1007/3-540-44905-1_22

    Chapter  Google Scholar 

  22. Madsen, F.M., Clifton-Everest, R., Chakravarty, M.M.T., Keller, G.: Functional array streams. In: Proceedings of the 4th ACM SIGPLAN Workshop on FunctionalHigh-Performance Computing, FHPC 2015, pp. 23–34. ACM, New York (2015)

    Google Scholar 

  23. Madsen, F.M., Filinski, A.: Towards a streaming model for nested data parallelism. In: Proceedings of the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing, pp. 13–24. ACM (2013)

    Google Scholar 

  24. MATLAB. version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachusetts (2010)

    Google Scholar 

  25. Mnih, V.: Cudamat: a cuda-based matrix class for python. Department of Computer Science, University of Toronto, Technical report, UTML TR, 4 (2009)

    Google Scholar 

  26. Munshi, A., et al.: The OpenCL specification. Khronos OpenCL Working Group 1, 11–15 (2009)

    Google Scholar 

  27. NVIDIA Corporation. NVIDIA CUDA Programming Guide 2.0 (2008)

    Google Scholar 

  28. Oliphant, T.: Numba python bytecode to llvm translator. In: Proceedings of the Python for Scientific Computing Conference (SciPy) (2012)

    Google Scholar 

  29. Pedroni, S., Rappin, N.: Jython Essentials: Rapid Scripting in Java, 1st edn. O’Reilly & Associates Inc., Sebastopol (2002)

    Google Scholar 

  30. Rickett, C.D., Choi, S.-E., Rasmussen, C.E., Sottile, M.J.: Rapid prototyping frameworks for developing scientific applications: a case study. J. Supercomput. 36(2), 123–134 (2006)

    Article  Google Scholar 

  31. Tieleman, T.: Gnumpy: an easy way to use gpu boards in python (2010)

    Google Scholar 

  32. Van Der Walt, S., Colbert, S., Varoquaux, G.: The numpy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)

    Article  Google Scholar 

  33. van Rossum, G.: Glue it all together with python. In: Workshop on Compositional Software Architectures, Workshop Report, Monterey, California (1998)

    Google Scholar 

Download references

Acknowledgement

James Avery was partially supported by the Danish Council for Independent Research Sapere Aude grant “Complexity through Logic and Algebra” (COLA).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mads R. B. Kristensen or James Avery .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Kristensen, M.R.B., Avery, J., Blum, T., Lund, S.A.F., Vinter, B. (2016). Battling Memory Requirements of Array Programming Through Streaming. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46079-6_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46078-9

  • Online ISBN: 978-3-319-46079-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics