Abstract
In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained.
Similar content being viewed by others
Notes
The OpenCL 2.0 standard also enables dynamic parallelism on device side, but most programming environments do not support it yet.
References
Adapteva, Inc (2015) Epiphany-IV 64-core 28nm Microprocessor. http://www.adapteva.com/products/silicon-devices/e64g401/
Agosta, G., Barenghi, A., Pelosi, G., Scandale, M.: Towards transparently tackling functionality and performance issues across different OpenCL platforms. In: 2014 Second International Symposium on Computing and Networking (CANDAR), pp. 130–136. IEEE (2014)
Ayguadé, E., Badia, R.M., Bellens, P., Cabrera, D., Duran, A., Ferrer, R., Gonzàlez, M., Igual, F., Jiménez-González, D., Labarta, J. et al.: Extending OpenMP to survive the heterogeneous multi-core era. Int. J. Parallel Program. 38, 440–459 (2010)
Benini, L., Flamand, E., Fuin, D., Melpignano, D.: P2012: building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 983–987. IEEE (2012)
Boudier, P., Sellers, G.: Memory system on fusion APUs: the benefits of zero copy. In: AMD Fusion Developer Summit. AMD (2011). http://www.developer.amd.com/afds/assets/presentations/1004_final.pdf
Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J.H., Brown, S., Czajkowski, T.: LegUp: high-level synthesis for FPGA-based processor/accelerator systems. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 33–36. ACM (2011)
Canny, J.: A computational approach to edge detection. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, pp. 679–698. IEEE (1986)
Cong, J., Liu, C., Ghodrat, M.A., Reinman, G., Gill, M., Zou, Y.: AXR-CMP: architecture support in accelerator-rich CMPs. In: 2nd Workshop on SoC Architecture, Accelerators and Workloads (2011)
Cong, J., Ghodrat, M.A,, Gill, M., Grigorian, B., Reinman, G.: CHARM: a composable heterogeneous accelerator-rich microprocessor. In: Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 379–384. ACM (2012)
Conti, F., Rossi, D., Pullini, A., Loi, I., Benini, L.: Energy-efficient vision on the PULP platform for ultra-low power parallel computing. In: 2014 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6. IEEE (2014)
Coombs, J., Prabhu, R., Peake, G.: Overcoming the challenges of porting OpenCV to TI's embedded ARM+ DSP platforms. Int. J. Electr. Eng. Educ. 49(3), 260–274 (2012)
Czajkowski, T.S., Aydonat, U., Denisenko, D., Freeman, J., Kinsner, M., Neto, D., Wong, J., Yiannacouras, P., Singh, DP.: From OpenCL to high-performance hardware on FPGAs. In: 22nd International Conference on Field Programmable Logic and Applications (FPL), pp. 531–534. IEEE (2012)
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev. 51(1), 129–159 (2009)
Embedded Vision Alliance (2015) Website. http://www.embedded-vision.com/
Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y.: Neuflow: a runtime reconfigurable dataflow processor for vision. In: 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 109–116. IEEE (2011)
Fatahalian, K., Horn, DR., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez , M., Ren, M., Aiken, A., Dally, W.J. et al: Sequoia: programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, p. 83. ACM (2006)
Franceschelli, A., Burgio, P., Tagliavini, G., Marongiu, A., Ruggiero, M., Lombardi, M., Bonfietti, A., Milano, M., Benini, L.: MPOpt-Cell: a high-performance data-flow programming environment for the CELL BE processor. In: Proceedings of the 8th ACM International Conference on Computing Frontiers, p. 11. ACM (2011)
Gehrig, S.K., Eberli, F., Meyer, T.: A real-time low-power stereo vision engine using semi-global matching. In: Computer Vision Systems, pp. 134–143. Springer (2009)
Geilen, M., Basten, T., Stuijk, S.: Minimising buffer requirements of synchronous dataflow graphs with model checking. In: Proceedings of the 42nd annual Design Automation Conference, pp. 819–824. ACM (2005)
Gonzàlez, M., Vujic, N., Martorell, X., Ayguadé, E., Eichenberger, A.E., Chen, T., Sura, Z., Zhang, T., O’Brien, K., O’Brien, K.: Hybrid access-specific software cache techniques for the Cell BE architecture. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 292–302. ACM (2008)
Greengard, S.: Computational photography comes into focus. Commun. ACM 57(2), 19–21 (2014)
Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., Hanrahan, P. Darkroom: Compiling high-level image processing code into hardware pipelines. In: Proceedings of the 41st International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH) (2014)
Heinecke, A., Klemm, M., Bungartz, H.: From GPGPU to many-core: Nvidia fermi and intel many integrated core architecture. Comput. Sci. Eng. 14(2), 78–83 (2012)
HSA Foundation Specification Library (2015). http://www.hsafoundation.com/html/HSA_Library.htm
KALRAY Corporation (2015) Website. http://www.kalray.eu/
Kronos Group (2015a) The OpenCL 1.1 Specifications. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
Kronos Group (2015b) The OpenVX API for hardware acceleration. http://www.khronos.org/openvx
Lee, H., Brown, K.J., Sujeeth, A.K., Chafi, H., Rompf, T., Odersky, M., Olukotun, K.: Implementing domain-specific languages for heterogeneous parallel computing. IEEE Micro 5, 42–53 (2011)
Lee, J., Seo, S., Kim, C., Kim, J., Chun, P., Sura, Z., Kim, J., Han, S.: COMIC: a coherent shared memory interface for Cell BE. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 303–314. ACM (2008)
Lei, Y., Gang, Z., Si-Heon, R., Choon-Young, L., Sang-Ryong, L., Bae, K.M.: The platform of image acquisition and processing system based on DSP and FPGA. In: International Conference on Smart Manufacturing Application, pp. 470–473. IEEE (2008)
Lepley, T., Paulin, P., Flamand, E. A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory. In: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 1–10. IEEE (2013)
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI, vol. 81, pp. 674–679. IJCAI Organization (1981)
Maghazeh, A., Bordoloi, U.D., Eles, P., Peng, Z.: General purpose computing on low-power embedded GPUs: has it come of age? In: 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), pp. 1–10. IEEE (2013)
Magno, M., Tombari, F., Brunelli, D., Di Stefano, L., Benini, L.: Multimodal abandoned/removed object detection for low power video surveillance systems. In: Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 188–193. IEEE (2009)
Membarth, R., Reiche, O., Hannig, F., Teich, J., Korner, M., Eckert, W.: HIPAcc: a Domain-Specific Language and Compiler for Image Processing. IEEE Trans. Parallel Distrib. Syst. doi:10.1109/TPDS.2015.2394802 (2015)
Movidius, L.D.T.: Myriad 1 Mobile Vision Processor. http://www.movidius.com/our-technology/myriad-2-platform/ (2015)
NVIDIA (2015) Tegra Android Development Documentation Website. http://docs.nvidia.com/tegra/index.html
OpenCV Library Homepage (2015) Website. http://www.opencv.com/
Park, S., Maashri, A.A., Irick, K.M., Chandrashekhar, A., Cotter, M., Chandramoorthy, N., Debole, M., Narayanan, V.: System-on-chip for biologically inspired vision applications. IPSJ Trans. Syst. LSI Design Methodol. 5, 71–95 (2012)
Plurality Ltd (2015) The HyperCore Processor. http://www.plurality.com/hypercore.html
Qualcomm (2015) Computer Vision (FastCV). https://developer.qualcomm.com/computer-vision-fastcv
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, vol. 48, pp. 519–530. ACM (2013)
Rainey, E., Villarreal, J., Dedeoglu, G., Pulli, K., Lepley, T., Brill, F. Addressing System-Level Optimization with OpenVX Graphs. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 658–663. IEEE (2014)
Rosten, E., Porter, R., Drummond, T.: Faster and better: a machine learning approach to corner detection. IEEE Trans. Patter. Anal. Mach. Intell. 32(1), 105–119 (2010)
Schubert, F., Schertler, K., Mikolajczyk, K.: A hands-on approach to high-dynamic-range and super resolution fusion. In: 2009 Workshop on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2009)
Sonka, M., Hlavac, V., Boyle, R..: Image processing, analysis, and machine vision. Thomson Toronto (2008)
Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 66–73 (2010)
Tagliavini, G., Haugou, G., Marongiu, A., Benini, L.: A framework for optimizing OpenVX applications performance on embedded manycore accelerators. In: Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, pp. 125–128. ACM (2015)
Thies, W., Karczmarek, M., Amarasinghe, S.: StreamIt: a language for streaming applications. In: Compiler Construction, pp. 179–196. Springer (2002)
Vajda, A.: Programming many-core chips. Springer (2011)
Wienke, S., Springer, P., Terboven, C., an Mey, D. OpenACC First Experiences with Real-World Applications. In: Euro-Par 2012 Parallel Processing, pp. 859–870. Springer (2012)
Zedboard.org (2015) Zedboard product page. http://zedboard.org/product/zedboard
Author information
Authors and Affiliations
Corresponding author
Additional information
This work has been supported by the EU-funded research projects P-SOCRATES (g.a. 611016) and MULTITHERMAN (g.a. 291125).
Rights and permissions
About this article
Cite this article
Tagliavini, G., Haugou, G., Marongiu, A. et al. Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators. J Real-Time Image Proc 15, 73–92 (2018). https://doi.org/10.1007/s11554-015-0544-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-015-0544-0