Abstract
Recently, deep learning has achieved great success in artificial intelligent, whose superiority also brought new opportunity for the related research in embedded system. This paper focused on optimizing and speeding the convolution computing, the core operation within convolution neural network based on a multi-cluster digital signal processor, BWDSP. By taking advantage of the BWDSP’s architecture and characteristics of convolution computation, a suitable parallel algorithm was designed. Based on features of convolution neural network model structure, an automatic optimization tool for convolution computing with specific arguments was presented as well. The experimental result showed that the parallel algorithm given in this paper is 9.5x faster than GEMM-based algorithm commonly used in GPU and 5.7x faster than the traditional vectorization optimization algorithm. Meanwhile, a comparison was made between the parallel algorithm and tiled-base algorithm widely adopted in system with cache hierarchies, showing that the parallel one could achieve a better performance density of 1.55 times than that of later one, meaning that the work in this paper can make full use of computing resources to make them more efficient.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Gu, J., Wang, Z., Kuen, J., et al.: Recent advances in convolutional neural networks. arXiv preprint arXiv:1512.07108 (2015)
Chen, Y., Luo, T., Liu, S., et al.: Dadiannao: a machine-learning supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. IEEE Computer Society (2014)
Cetc38.com.cn: BWDSP Product Presentation. http://www.cetc38.com.cn/38/335804/335809/377610/index.html (2017). Accessed 22 Mar 2017
Cong, J., Xiao, B.: Minimizing computation in convolutional neural networks. In: Wermter, S., Weber, C., Duch, W., Honkela, T., Koprinkova-Hristova, P., Magg, S., Palm, G., Villa, A.E.P. (eds.) ICANN 2014. LNCS, vol. 8681, pp. 281–290. Springer, Cham (2014). doi:10.1007/978-3-319-11179-7_36
Chetlur, S., Woolley, C., Vandermersch, P., et al.: cuDNN: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
LeCun, Y., Bottou, L., Bengio, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Image-net.org: ImageNet Large Scale Visual Recognition Competition (ILSVRC). http://www.image-net.org/challenges/LSVRC/ (2017). Accessed 22 Mar 2017
Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Lane, N.D., Bhattacharya, S., Georgiev, P., et al.: Deepx: a software accelerator for low-power deep learning inference on mobile device. In: 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 1–12. IEEE (2016)
Cavigelli, L., Magno, M., Benini, L.: Accelerating real-time embedded scene labeling with convolutional networks. In: 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2015)
Hegde, G., Ramasamy, N., Kapre, N.: CaffePresso: an optimized library for deep learning on embedded accelerator-based platforms. In: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, p. 14. ACM (2016)
Zhang, C., Li, P., Sun, G., et al.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161–170. ACM (2015)
Acknowledgment
This work was supported in part by a grant from China Core Electronic Devices, High-end Generic Chips and Basic Software Major Projects, No. 2012ZX01034-001-001.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd
About this paper
Cite this paper
Wenqi, D., Zhenhao, Y., Maohui, L., Gai, W., JiangPing, Y., Qilong, Z. (2017). Speeding Up Convolution on Multi-cluster DSP in Deep Learning Scenarios. In: Chen, G., Shen, H., Chen, M. (eds) Parallel Architecture, Algorithm and Programming. PAAP 2017. Communications in Computer and Information Science, vol 729. Springer, Singapore. https://doi.org/10.1007/978-981-10-6442-5_47
Download citation
DOI: https://doi.org/10.1007/978-981-10-6442-5_47
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6441-8
Online ISBN: 978-981-10-6442-5
eBook Packages: Computer ScienceComputer Science (R0)