Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Array streaming for array programming

Published: 01 January 2018 Publication History

Abstract

A barrier to efficient array programming, for example in Python/NumPy, is that algorithms written as pure array operations completely without loops, while most efficient on small input, can lead to explosions in memory use. The present paper presents a solution to this problem using array streaming, implemented in the automatic parallelisation high-performance framework Bohrium. This makes it possible to use array programming in Python/NumPy code directly, even when the apparent memory requirement exceeds the machine capacity, since the automatic streaming eliminates the temporary memory overhead by performing calculations in per-thread registers. Using Bohrium, we automatically fuse, stream, JIT-compile, and execute NumPy array operations on GPGPUs without modification to the user programs. We present performance evaluations of three benchmarks, all of which show dramatic reductions in memory use from streaming, yielding corresponding improvements in speed and utilisation of GPGPU-cores. The fusion step is implemented using the theoretical framework presented in Kristensen et al. 2016, using a streaming-maximising cost function. The streaming-enabled Bohrium effortlessly runs programs on input sizes several orders of magnitude beyond sizes that crash on pure NumPy due to exhausting system memory.

References

[1]
Auer, A.A., Baumgartner, G., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Krishnamoorthy, S., Krishnan, S. et al. (2006) 'Automatic code generation for many-body electronic structure methods: the tensor contraction engine', Molecular Physics, Vol. 104, No. 2, pp. 211-228.
[2]
Ayer, V.M., Miguez, S. and Toby, B.H. (2014) 'Why scientists should learn to program in python', Powder Diffraction, Vol. 29, pp. S48-S64, ISSN: 1945-7413,
[3]
Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D.S. and Smith, K. (2011) 'Cython: the best of both worlds', Computing in Science & Engineering, Vol. 13, No. 2, pp. 31-39.
[4]
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D. and Bengio, Y. (2010) 'Theano: a CPU and GPU math expression compiler', in Proceedings of the Python for Scientific Computing Conference (SciPy), June, Oral Presentation.
[5]
Blum, T., Kristensen, M.R.B. and Vinter, B. (2014) 'Transparent GPU execution of NumPy applications', in Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2014 IEEE 28th International, IEEE.
[6]
Cooke, D. and Hochberg, T. (2009) Numexpr. Fast Evaluation of Array Expressions by using a Vector-based Virtual Machine.
[7]
Darte, A. and Huard, G. (2002) 'New results on array contraction [memory optimization]', in The IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2002, Proceedings, IEEE, pp. 359-370.
[8]
Enkovaara, J., Romero, N.A., Shende, S. and Mortensen, J.J. (2011) 'GPAW-massively parallel electronic structure calculations with python-based software', Procedia Computer Science, Vol. 4, pp. 17-25.
[9]
Foord, M. and Muirhead, C. (2009) IronPython in Action, Manning Publications Co., Greenwich, CT, USA, ISBN: 1933988339, 9781933988337.
[10]
Galton, F. (1889) Natural Inheritance, Number v. 42; v. 590, Macmillan and Company, London, ISBN: 9781358763694.
[11]
Guelton, S., Brunet, P., Amini, M., Merlini, A., Corbillon, X. and Raynaud, A. (2015) 'Pythran: Enabling static optimization of scientific python programs', Computational Science & Discovery, Vol. 8, No. 1, p.014001.
[12]
Ihaka, R. and Gentleman, R. (1996) 'R: a language for data analysis and graphics', Journal of Computational and Graphical Statistics, Vol. 5, No. 3, pp. 299-314.
[13]
Jones, E. and Miller, P.J. (2002) 'Weave-inlining c/c++ in python', O'Reilly Open Source Convention.
[14]
Kennedy, K. (2001) 'Fast greedy weighted fusion', International Journal of Parallel Programming, Vol. 29, No. 5, pp. 463-491.
[15]
Klöckner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P. and Fasih, A. (2012) 'PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation', Parallel Computing, Vol. 38, No. 3, pp. 157-174, ISSN: 0167-8191,
[16]
Kristensen, M.R.B., Happe, H.H. and Vinter, B. (2009) 'GPAW optimized for blue gene/P using hybrid programming', in IEEE International Symposium on Parallel Distributed Processing, 2009, IPDPS 2009, pp. 1-6,
[17]
Kristensen, M.R.B., Lund, S.A.F., Blum, T. and Avery, J. (2016) 'Fusion of parallel array operations', in Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT '16, ACM, New York, NY, USA, pp. 71-85, ISBN: 978-1-4503-4121-9,
[18]
Kristensen, M.R.B., Lund, S.A.F., Blum, T. and Skovhede, K. (2014a) 'Separating NumPy API from implementation', in 5th Workshop on Python for High Performance and Scientific Computing (PyHPC'14).
[19]
Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K. and Vinter, B. (2014b) 'Bohrium: a virtual machine approach to portable parallelism', in Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, IEEE, pp. 312-321.
[20]
Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K. and Vinter, B. (2013) 'Bohrium: unmodified NumPy code on CPU, GPU, and cluster', in 4th Workshop on Python for High Performance and Scientific Computing (PyHPC'13).
[21]
Lam, C-C., Cociorva, D., Baumgartner, G. and Sadayappan, P. (2000) 'Optimization of memory usage requirement for a class of loops implementing multi-dimensional integrals', in Carter, L. and Ferrante, J. (Eds.): Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, Vol. 1863, pp. 350-364, Springer, ISBN: 978-3-540-67858-8,
[22]
Madsen, F.M. and Filinski, A. (2013) 'Towards a streaming model for nested data parallelism', in Proceedings of the 2nd ACM SIGPLAN Workshop on Functional High Performance Computing, ACM, pp. 13-24.
[23]
Madsen, F.M., Clifton-Everest, R., Chakravarty, M.M.T. and Keller, G. (2015) 'Functional array streams', in Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing, FHPC 2015, pp. 23-34, ACM, New York, NY, USA, ISBN: 978-1-4503-3807-3,
[24]
MATLAB (2010) version 7.10.0 (R2010a), The MathWorks Inc., Natick, Massachusetts.
[25]
Mnih, V. (2009) Cudamat: A CUDA-based Matrix Class for Python, Department of Computer Science, University of Toronto, Tech. Rep. UTML TR, p.4.
[26]
Munshi, A. et al. (2009) 'The OpenCL specification', Khronos OpenCL Working Group, Vol. 1, pp. 11-15.
[27]
NVIDIA Corporation (2008) NVIDIA CUDA Programming Guide 8.0 [online] http://docs.nvidia.com/pdf/CUDA_C_Programming_Guide.pdf (accessed June 2017).
[28]
Oliphant, T. (2012) 'Numba python bytecode to llvm translator', in Proceedings of the Python for Scientific Computing Conference (SciPy).
[29]
Pedroni, S. and Rappin, N. (2002) Jython Essentials: Rapid Scripting in Java, 1st ed., O'Reilly & Associates, Inc., Sebastopol, CA, USA, ISBN 0596002475.
[30]
Rickett, C.D., Choi, S-E., Rasmussen, C.E. and Sottile, M.J. (2006) 'Rapid prototyping frameworks for developing scientific applications: a case study', The Journal of Supercomputing, Vol. 36, No. 2, pp. 123-134.
[31]
Tieleman, T. (2010) Gnumpy: an easy way to use GPU boards in Python, Department of Computer Science, University of Toronto.
[32]
Van DerWalt, S., Colbert, S.C. and Varoquaux, G. (2011) 'The numpy array: a structure for efficient numerical computation', Computing in Science & Engineering, Vol. 13, No. 2, pp. 22-30.
[33]
van Rossum, G. (1998) 'Glue it all together with python', in Workshop on Compositional Software Architectures, Workshop Report, Monterey, California.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Computational Science and Engineering
International Journal of Computational Science and Engineering  Volume 17, Issue 3
January 2018
106 pages
ISSN:1742-7185
EISSN:1742-7193
Issue’s Table of Contents

Publisher

Inderscience Publishers

Geneva 15, Switzerland

Publication History

Published: 01 January 2018

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media