Abstract
Deep Convolutional Neural Networks (CNNs) have been successfully used for processing images, videos, sounds, and more generic sensor data for detecting objects, patterns, and events. In this work, we propose MEPAD, a memory-efficient parallelized direct convolution algorithm for CNNs. We compare MEPAD with several approaches for implementing the convolution proposed in the literature, by optimally mapping them on two implementations of the same SIMD target architectures. By taking as use cases the VGG-16 and TinyYOLOv2 CNNs, we focus on optimizing the memory behavior and energy consumption of the algorithm in each layer of the CNNs and show that MEPAD can achieve a reduction of up to 85% in the energy-delay product (EDP) when compared to alternative approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anderson, A., Vasudevan, A., Keane, C., Gregg, D.: High-performance low-memory lowering: GEMM-based algorithms for DNN convolution. In: SBAC-PAD 2020, pp. 99–106 (2020). https://doi.org/10.1109/SBAC-PAD49847.2020.00024
Chellapilla, K., Puri, S., Simard, P.: High-performance convolutional neural networks for document processing. In: IWFHR 2006, pp. 99–106 (2006)
Chen, G., et al.: 16.1 A 340mV-to-0.9V 20.2Tb/s source-synchronous hybrid packet/circuit-switched 16\(\times \)16 network-on-chip in 22nm tri-gate CMOS. In: ISSCC 2014, pp. 276–277 (2014). https://doi.org/10.1109/ISSCC.2014.6757432
Cho, M., Brand, D.: MEC: memory-efficient convolution for deep neural network. In: ICML 2017, vol. 70, pp. 815–824 (2017)
Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990). https://doi.org/10.1145/77626.79170
Erdem, A., Silvano, C., Boesch, T., Ornstein, A.C., Singh, S.P., Desoli, G.: Runtime design space exploration and mapping of DCNNs for the ultra-low-power Orlando SoC. ACM Trans. Archit. Code Optim. 17(2) (2020). https://doi.org/10.1145/3379933
Fiorin, L., Jongerius, R., Vermij, E., van Lunteren, J., Hagleitner, C.: Near-memory acceleration for radio astronomy. IEEE Trans. Parallel Distrib. Syst. 29(1), 115–128 (2018). https://doi.org/10.1109/TPDS.2017.2748580
Kim, M.S., Del Barrio, A.A., Kim, H., Bagherzadeh, N.: The effects of approximate multiplication on convolutional neural networks. IEEE Trans. Emerging Top. Comput. 10(2), 904–916 (2022). https://doi.org/10.1109/TETC.2021.3050989
Kim, S., et al.: Processing-in-memory in high bandwidth memory (PIM-HBM) architecture with energy-efficient and low latency channels for high bandwidth system. In: EPEPS 2019, pp. 1–3 (2019). https://doi.org/10.1109/EPEPS47316.2019.193209
Kwon, H., Chatarasi, P., Pellauer, M., Parashar, A., Sarkar, V., Krishna, T.: Understanding reuse, performance, and hardware cost of DNN dataflow: a data-centric approach. In: MICRO 2019, pp. 754–768 (2019). https://doi.org/10.1145/3352460.3358252
Low, T.M., Igual, F.D., Smith, T.M., Quintana-Orti, E.S.: Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43(2) (2016). https://doi.org/10.1145/2925987
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: CVPR 2017, pp. 6517–6525 (2017). https://doi.org/10.1109/CVPR.2017.690
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR 2015 (2015). http://arxiv.org/abs/1409.1556
Steiner, L., Jung, M., Prado, F.S., Bykov, K., Wehn, N.: DRAMSys4.0: an open-source simulation framework for in-depth DRAM analyses. Int. J. Parallel Prog. 50, 217–242 (2022). https://doi.org/10.1007/s10766-022-00727-4
Thoziyoor, S., Ahn, J.H., Monchiero, M., Brockman, J.B., Jouppi, N.P.: A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In: ISCA 2008, pp. 51–62 (2008). https://doi.org/10.1109/ISCA.2008.16
Zhang, J., Franchetti, F., Low, T.M.: High performance zero-memory overhead direct convolutions. CoRR abs/1809.10170 (2018). http://arxiv.org/abs/1809.10170
Acknowledgement
This work has been supported by the Spoke 1 on Future HPC of the Italian Research Center on High-Performance Computing, Big Data and Quantum Computing (ICSC) funded by MUR Mission 4 - Next Generation EU.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fiorin, L., Silvano, C. (2024). MEPAD: A Memory-Efficient Parallelized Direct Convolution Algorithm for Deep Neural Networks. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14802. Springer, Cham. https://doi.org/10.1007/978-3-031-69766-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-69766-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69765-4
Online ISBN: 978-3-031-69766-1
eBook Packages: Computer ScienceComputer Science (R0)