Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Specialized hardware accelerators for deep learning are widely introduced by many hardware vendors because of their high performance and efficiency. However, different vendors adopt different accelerator architectures, making it challenging for the compiler tool-chain to generate and optimize high-performance codes. Moreover, the current tool-chains provided by the vendors are either highly abstract, which makes it hard to optimize or contain too many hardware-related details, which makes it inconvenient to program. So, in this paper, we propose a middle layer compiler tool-chain for Cambricon MLU-100 to fill the gap between high-level runtime library and low operator-level SDK. Our tool-chain is based on the operator level SDK but abstracts away its redundant initialization and allocation statement. We also expose the interface of major optimization knobs compared to the existing runtime, thus enabling a considerable optimization space. We evaluate our work by several state-of-the-art neural networks and choose the line of code and optimization knobs as evaluation metrics. We also compare the performance against state-of-the-art tool-chain TensorRT applying simple optimization strategy and find that our work has great potential in optimization. Our work can guarantee the user a vast optimization space with only around \( 20\% \) amount of the codes that hides the redundant initialization and allocation statements from users.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  • Albericio, J et al.: Cnvlutin: Ineffectual-Neuron- Free Deep Neural Network Computing. In: 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18–22, 2016, pp. 1–13 (2016). https://doi.org/10.1109/ISCA.2016.11

  • Alwani, M., et al.: Fused-layer CNN accelerators. In: 49th Annual IEEE/ACM International Symposium on Microarchitecture. (2016)

  • Chen, T. et al.: TVM: an automated end-to- end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation. (2018)

  • Chen, T. et al.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machinelearning. In: Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, Salt Lake City, UT, USA, March 1–5, 2014. Ed. by Rajeev Balasubramonian, Al Davis, and Sarita V. Adve, pp. 269–284 (2014). https://doi.org/10.1145/2541940.2541967

  • Chen, Yunji et al.: DaDianNao: a machine-learning supercomputer. In: 47th Annual IEEE/ACM international symposium on microarchitecture, MI- CRO 2014, Cambridge, United Kingdom, December 13–17. pp. 609–622 (2014). https://doi.org/10.1109/MICRO.2014.58

  • Copeland, M.: What’s the difference between deep learning training and inference? https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/. Accessed Feb. 20 (2020)

  • Cui, W. et al.: Ebird: Elastic batch for improving responsiveness and throughput of deep learning services. In: 37th IEEE International Conference on Computer Design, ICCD 2019, Abu Dhabi, United Arab Emirates, November 17–20, 2019. IEEE, pp. 497–505 (2019). https://doi.org/10.1109/ICCD46524.2019.00075

  • Dong, Z. et al.: HAWQ: Hessian aware quantization of neural networks with mixed-precision. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2. pp. 293–302 (2019). https://doi.org/10.1109/ICCV.2019.00038

  • Elango, V. et al.: Diesel: DSL for linear algebra and neural net computations on GPUs. In: Proceedings of the 2nd ACM SIGPLAN international workshop on machine learning and programming languages, MAPL@PLDI 2018, Philadelphia, PA, USA, June 18–22, 2018. Ed. by Justin Gottschlich and Alvin Cheung, pp. 42–51 (2018).https://doi.org/10.1145/3211346.3211354

  • Filipovic, J., et al.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015)

    Article  Google Scholar 

  • Google. Route. Schedule. Plan. Assign. Pack. Solve. OR-Tools is fast and portable software for com- binatorial optimization. https://developers.google.com/optimization. Accessed May 20, (2020)

  • Guo C et al.: Flexibility for DNN Acceleration via Temporal GPUSystolic Array Integration. In: CoRR abs/2002.08326 (2020). url: arXiv:2002.08326

  • He, K. et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition. (2016)

  • Jain, A. et al.: Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters. In: 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019, Albuquerque, NM, USA, September 23–26. pp. 1–11 (2019). https://doi.org/10.1109/CLUSTER.2019.8891042

  • Jia, Y. et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03–07, 2014. Ed. by Kien A. Hua et al. pp. 675–678 (2014). https://doi.org/10.1145/2647868.2654889

  • Jia, Z. et al.: TASO: optimizing deep learning computation with automatic generation of graph substitutions. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. (2019)

  • Jouppi, N.P. et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture. (2017)

  • Kim, J. et al.: A code generator for high-performance tensor contractions on GPUs. In: IEEE/ACM International Symposium on Code Generation and Optimization. (2019)

  • Krizhevsky, A., et al.: ImageNet classification with deep convolutional neural networks. In: Advanced in Neural Information Processing Systems. (2012)

  • Leng, Jingwen et al.: Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems. In: IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22–26, 2020. IEEE, pp. 44–57 (2020). https://doi.org/10.1109/HPCA47549.2020.00014

  • Liu, D-F et al.: PuDianNao: a polyvalent machine learning accelerator. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. (2015)

  • Marchisio, A., Hanif, M.A., Shafique, M.: CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse. In: Design, Automation & Test in Europe Conference & Exhibition, DATE 2019, Florence, Italy, March 25-29, 2019. Ed. by Jürgen Teich and Franco Fummi. pp. 964–967 (2019). https://doi.org/10.23919/DATE.2019.8714922

  • MXNet. A flexible and efficient library for deep learning. A truly open source deep learning framework suited for exible research prototyping and production. https://mxnet.apache.org/. Accessed Feb. 20 (2020)

  • NVIDIA Corp. Geforce RTX 2080Ti. User Guide. (2019)

  • NVIDIA Corp. NVIDIA AI INFERENCE PLAT- FORM. Giant Leaps in Performance and Efficiency for AI Services, from the Data Center to the Network’s Edge. (2018)

  • NVIDIA Corp. NVIDIA TensorRT. Programmable Inference Accelerator. (2020)

  • ONNX. Open Neural Network Exchange. The open standard for machine learning interoperability. http://onnx.ai. Accessed Feb. 20 (2020)

  • Paliwal, A. et al.: Reinforced genetic algorithm learning for optimizing computation graphs. In: 8th International Conference on Learning Rep- resentations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30 (2020)

  • Qiao, B. et al.: From loop fusion to kernel fusion: a domain-specific approach to locality optimization. In: IEEE/ACM International Symposium on Code Generation and Optimization. (2019)

  • Qiu, Yuxian et al.: Adversarial Defense Through Network Profiling Based Path Extraction. In: IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20. pp. 4777-4786 (2019). https://doi.org/10.1109/CVPR.2019.00491

  • Quinton, P.: Systolic arrays: why and how? In: Parcella 1994, VI. International Workshop on Parallel Processing by Cellular Automata and Arrays, Potsdam, Germany, September 21–23, 1994. Proceedings. Ed. by Chris R. Jesshope, Vesselin Jossifov, and Wolfgang Wilhelmi. Vol. 81. Mathematical Research. pp. 39–50 (1994)

  • Ragan-Kelley. J., et al.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Conference on Programming Language Design and Implementation. (2013)

  • Sandler, M., et al.: MobileNetV2: Inverted Residuals and Linear Bottlenecks. In: Conference on Computer Vision and Pattern Recognition. (2018)

  • Shao, Y.S. et al.: Simba: Scaling Deep- Learning Inference with Multi-Chip-Module-Based Architecture. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12–16, 2019. ACM, pp. 14–27 (2019). https://doi.org/10.1145/3352460.3358302

  • Simonyan, K., Zisserman, A.: VVGGery Deep Convolutional Networks for Large-Scale Image Recognition. In: 3rd International Conference on Learning Representations. (2015)

  • Cambricon Technologies. Cambricon MLU100 Datasheet. Aug. (2019)

  • Cambricon Technologies. Cambricon Neuware Whitesheet. Aug. (2019)

  • Vasilache, N., et al.: Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. In: CoRR 1802.04730 (2018)

  • Wang, G., Lin, Y., Yi, W.: Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: 2010 IEEE/ACM Int’l Conference on Green Computing and Communications. (2010)

  • Zhang, S. et al.: Cambricon-X: An accelerator for sparse neural networks. In: 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15–19, 20:1–20:12 (2016). https://doi.org/10.1109/MICRO.2016.7783723

  • Zhang, W et al.: Laius: towards latency awareness and improved utilization of spatial multitasking accelerators in datacenters. In: Proceed- ings of the ACM International Conference on Supercomputing, ICS 2019, Phoenix, AZ, USA, June 26-28, 2019. Ed. by Rudolf Eigenmann, Chen Ding, and Sally A. McKee. ACM, pp. 58–68 (2019). https://doi.org/10.1145/3330345.3330351

  • Zheng, S. et al.: FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In: ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020 [ASPLOS 2020 was canceled because of COVID-19]. Ed. by James R. Larus, Luis Ceze, and Karin Strauss. pp. 859-873 (2020). https://doi.org/10.1145/3373376.3378508

  • Zhou, X et al.: Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach. In: 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20–24. pp. 15–28 (2018). https://doi.org/10.1109/MICRO.2018.00011

  • Zhu, M. et al.: Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, pp. 359-371 (2019). https://doi.org/10.1145/3352460.3358269

Download references

Acknowledgements

We thank the anonymous reviewers for their constructive feedback. This work was supported by National Key R&D Program of China (2019YFF0302600) and the National Natural Science Foundation of China (NSFC) Grant (61702328 and 61832006). Any opinions, findings, and conclusions in this paper are those of the authors only and do not necessarily reflect the views of our sponsors

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jingwen Leng or Minyi Guo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Leng, J., Lu, G. et al. Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator. CCF Trans. HPC 2, 332–347 (2020). https://doi.org/10.1007/s42514-020-00044-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-020-00044-7

Keywords

Navigation