Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators

Giuseppe Tagliavini ORCID: orcid.org/0000-0002-9221-4633¹,
Germain Haugou²,
Andrea Marongiu^1,2 &
…
Luca Benini^1,2

595 Accesses
11 Citations
3 Altmetric
Explore all metrics

Abstract

In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-performance code optimizations for mobile devices

Article 11 October 2018

Toward OpenCL Automatic Multi-Device Support

Scheduling for heterogeneous systems in accelerator-rich environments

Article 25 May 2021

Notes

The OpenCL 2.0 standard also enables dynamic parallelism on device side, but most programming environments do not support it yet.

References

Adapteva, Inc (2015) Epiphany-IV 64-core 28nm Microprocessor. http://www.adapteva.com/products/silicon-devices/e64g401/
Agosta, G., Barenghi, A., Pelosi, G., Scandale, M.: Towards transparently tackling functionality and performance issues across different OpenCL platforms. In: 2014 Second International Symposium on Computing and Networking (CANDAR), pp. 130–136. IEEE (2014)
Ayguadé, E., Badia, R.M., Bellens, P., Cabrera, D., Duran, A., Ferrer, R., Gonzàlez, M., Igual, F., Jiménez-González, D., Labarta, J. et al.: Extending OpenMP to survive the heterogeneous multi-core era. Int. J. Parallel Program. 38, 440–459 (2010)
Article MATH Google Scholar
Benini, L., Flamand, E., Fuin, D., Melpignano, D.: P2012: building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 983–987. IEEE (2012)
Boudier, P., Sellers, G.: Memory system on fusion APUs: the benefits of zero copy. In: AMD Fusion Developer Summit. AMD (2011). http://www.developer.amd.com/afds/assets/presentations/1004_final.pdf
Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J.H., Brown, S., Czajkowski, T.: LegUp: high-level synthesis for FPGA-based processor/accelerator systems. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 33–36. ACM (2011)
Canny, J.: A computational approach to edge detection. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, pp. 679–698. IEEE (1986)
Cong, J., Liu, C., Ghodrat, M.A., Reinman, G., Gill, M., Zou, Y.: AXR-CMP: architecture support in accelerator-rich CMPs. In: 2nd Workshop on SoC Architecture, Accelerators and Workloads (2011)
Cong, J., Ghodrat, M.A,, Gill, M., Grigorian, B., Reinman, G.: CHARM: a composable heterogeneous accelerator-rich microprocessor. In: Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 379–384. ACM (2012)
Conti, F., Rossi, D., Pullini, A., Loi, I., Benini, L.: Energy-efficient vision on the PULP platform for ultra-low power parallel computing. In: 2014 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6. IEEE (2014)
Coombs, J., Prabhu, R., Peake, G.: Overcoming the challenges of porting OpenCV to TI's embedded ARM+ DSP platforms. Int. J. Electr. Eng. Educ. 49(3), 260–274 (2012)
Article Google Scholar
Czajkowski, T.S., Aydonat, U., Denisenko, D., Freeman, J., Kinsner, M., Neto, D., Wong, J., Yiannacouras, P., Singh, DP.: From OpenCL to high-performance hardware on FPGAs. In: 22nd International Conference on Field Programmable Logic and Applications (FPL), pp. 531–534. IEEE (2012)
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev. 51(1), 129–159 (2009)
Article MATH Google Scholar
Embedded Vision Alliance (2015) Website. http://www.embedded-vision.com/
Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y.: Neuflow: a runtime reconfigurable dataflow processor for vision. In: 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 109–116. IEEE (2011)
Fatahalian, K., Horn, DR., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez , M., Ren, M., Aiken, A., Dally, W.J. et al: Sequoia: programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, p. 83. ACM (2006)
Franceschelli, A., Burgio, P., Tagliavini, G., Marongiu, A., Ruggiero, M., Lombardi, M., Bonfietti, A., Milano, M., Benini, L.: MPOpt-Cell: a high-performance data-flow programming environment for the CELL BE processor. In: Proceedings of the 8th ACM International Conference on Computing Frontiers, p. 11. ACM (2011)
Gehrig, S.K., Eberli, F., Meyer, T.: A real-time low-power stereo vision engine using semi-global matching. In: Computer Vision Systems, pp. 134–143. Springer (2009)
Geilen, M., Basten, T., Stuijk, S.: Minimising buffer requirements of synchronous dataflow graphs with model checking. In: Proceedings of the 42nd annual Design Automation Conference, pp. 819–824. ACM (2005)
Gonzàlez, M., Vujic, N., Martorell, X., Ayguadé, E., Eichenberger, A.E., Chen, T., Sura, Z., Zhang, T., O’Brien, K., O’Brien, K.: Hybrid access-specific software cache techniques for the Cell BE architecture. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 292–302. ACM (2008)
Greengard, S.: Computational photography comes into focus. Commun. ACM 57(2), 19–21 (2014)
Article Google Scholar
Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., Hanrahan, P. Darkroom: Compiling high-level image processing code into hardware pipelines. In: Proceedings of the 41st International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH) (2014)
Heinecke, A., Klemm, M., Bungartz, H.: From GPGPU to many-core: Nvidia fermi and intel many integrated core architecture. Comput. Sci. Eng. 14(2), 78–83 (2012)
Article Google Scholar
HSA Foundation Specification Library (2015). http://www.hsafoundation.com/html/HSA_Library.htm
KALRAY Corporation (2015) Website. http://www.kalray.eu/
Kronos Group (2015a) The OpenCL 1.1 Specifications. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
Kronos Group (2015b) The OpenVX API for hardware acceleration. http://www.khronos.org/openvx
Lee, H., Brown, K.J., Sujeeth, A.K., Chafi, H., Rompf, T., Odersky, M., Olukotun, K.: Implementing domain-specific languages for heterogeneous parallel computing. IEEE Micro 5, 42–53 (2011)
Article Google Scholar
Lee, J., Seo, S., Kim, C., Kim, J., Chun, P., Sura, Z., Kim, J., Han, S.: COMIC: a coherent shared memory interface for Cell BE. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 303–314. ACM (2008)
Lei, Y., Gang, Z., Si-Heon, R., Choon-Young, L., Sang-Ryong, L., Bae, K.M.: The platform of image acquisition and processing system based on DSP and FPGA. In: International Conference on Smart Manufacturing Application, pp. 470–473. IEEE (2008)
Lepley, T., Paulin, P., Flamand, E. A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory. In: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 1–10. IEEE (2013)
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI, vol. 81, pp. 674–679. IJCAI Organization (1981)
Maghazeh, A., Bordoloi, U.D., Eles, P., Peng, Z.: General purpose computing on low-power embedded GPUs: has it come of age? In: 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), pp. 1–10. IEEE (2013)
Magno, M., Tombari, F., Brunelli, D., Di Stefano, L., Benini, L.: Multimodal abandoned/removed object detection for low power video surveillance systems. In: Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 188–193. IEEE (2009)
Membarth, R., Reiche, O., Hannig, F., Teich, J., Korner, M., Eckert, W.: HIPAcc: a Domain-Specific Language and Compiler for Image Processing. IEEE Trans. Parallel Distrib. Syst. doi:10.1109/TPDS.2015.2394802 (2015)
Movidius, L.D.T.: Myriad 1 Mobile Vision Processor. http://www.movidius.com/our-technology/myriad-2-platform/ (2015)
NVIDIA (2015) Tegra Android Development Documentation Website. http://docs.nvidia.com/tegra/index.html
OpenCV Library Homepage (2015) Website. http://www.opencv.com/
Park, S., Maashri, A.A., Irick, K.M., Chandrashekhar, A., Cotter, M., Chandramoorthy, N., Debole, M., Narayanan, V.: System-on-chip for biologically inspired vision applications. IPSJ Trans. Syst. LSI Design Methodol. 5, 71–95 (2012)
Plurality Ltd (2015) The HyperCore Processor. http://www.plurality.com/hypercore.html
Qualcomm (2015) Computer Vision (FastCV). https://developer.qualcomm.com/computer-vision-fastcv
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, vol. 48, pp. 519–530. ACM (2013)
Rainey, E., Villarreal, J., Dedeoglu, G., Pulli, K., Lepley, T., Brill, F. Addressing System-Level Optimization with OpenVX Graphs. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 658–663. IEEE (2014)
Rosten, E., Porter, R., Drummond, T.: Faster and better: a machine learning approach to corner detection. IEEE Trans. Patter. Anal. Mach. Intell. 32(1), 105–119 (2010)
Article Google Scholar
Schubert, F., Schertler, K., Mikolajczyk, K.: A hands-on approach to high-dynamic-range and super resolution fusion. In: 2009 Workshop on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2009)
Sonka, M., Hlavac, V., Boyle, R..: Image processing, analysis, and machine vision. Thomson Toronto (2008)
Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 66–73 (2010)
Article Google Scholar
Tagliavini, G., Haugou, G., Marongiu, A., Benini, L.: A framework for optimizing OpenVX applications performance on embedded manycore accelerators. In: Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, pp. 125–128. ACM (2015)
Thies, W., Karczmarek, M., Amarasinghe, S.: StreamIt: a language for streaming applications. In: Compiler Construction, pp. 179–196. Springer (2002)
Vajda, A.: Programming many-core chips. Springer (2011)
Wienke, S., Springer, P., Terboven, C., an Mey, D. OpenACC First Experiences with Real-World Applications. In: Euro-Par 2012 Parallel Processing, pp. 859–870. Springer (2012)
Zedboard.org (2015) Zedboard product page. http://zedboard.org/product/zedboard

Download references

Author information

Authors and Affiliations

Department of Electrical Electronic and Information Engineering (DEI), University of Bologna, Bologna, Italy
Giuseppe Tagliavini, Andrea Marongiu & Luca Benini
Integrated System Laboratory, ETH Zurich, Zurich, Switzerland
Germain Haugou, Andrea Marongiu & Luca Benini

Authors

Giuseppe Tagliavini
View author publications
You can also search for this author in PubMed Google Scholar
Germain Haugou
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Marongiu
View author publications
You can also search for this author in PubMed Google Scholar
Luca Benini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giuseppe Tagliavini.

Additional information

This work has been supported by the EU-funded research projects P-SOCRATES (g.a. 611016) and MULTITHERMAN (g.a. 291125).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tagliavini, G., Haugou, G., Marongiu, A. et al. Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators. J Real-Time Image Proc 15, 73–92 (2018). https://doi.org/10.1007/s11554-015-0544-0

Download citation

Received: 04 May 2015
Accepted: 26 October 2015
Published: 20 November 2015
Issue Date: June 2018
DOI: https://doi.org/10.1007/s11554-015-0544-0

Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

High-performance code optimizations for mobile devices

Toward OpenCL Automatic Multi-Device Support

Scheduling for heterogeneous systems in accelerator-rich environments

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

High-performance code optimizations for mobile devices

Toward OpenCL Automatic Multi-Device Support

Scheduling for heterogeneous systems in accelerator-rich environments

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation