Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3291168.3291211acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

TVM: an automated end-to-end optimizing compiler for deep learning

Published: 08 October 2018 Publication History

Abstract

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms - such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) - requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

References

[1]
NVIDIA Tesla V100 GPU Architecture: The World's Most Advanced Data Center GPU, 2017.
[2]
ABADI, M., AGARWAL, A., BARHAM, P., BREVDO, E., CHEN, Z., CITRO, C., CORRADO, G. S., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., GOODFELLOW, I., HARP, A., IRVING, G., ISARD, M., JIA, Y., JOZEFOWICZ, R., KAISER, L., KUDLUR, M., LEVENBERG, J., MANÉ, D., MONGA, R., MOORE, S., MURRAY, D., OLAH, C., SCHUSTER, M., SHLENS, J., STEINER, B., SUTSKEVER, I., TALWAR, K., TUCKER, P., VANHOUCKE, V., VASUDEVAN, V., VIÉGAS, F., VINYALS, O., WARDEN, P., WATTENBERG, M., WICKE, M., YU, Y., AND ZHENG, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
[3]
ABADI, M., BARHAM, P., CHEN, J., CHEN, Z., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., IRVING, G., ISARD, M., KUDLUR, M., LEVENBERG, J., MONGA, R., MOORE, S., MURRAY, D. G., STEINER, B., TUCKER, P., VASUDEVAN, V., WARDEN, P., WICKE, M., YU, Y., AND ZHENG, X. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (2016), pp. 265-283.
[4]
AGARWAL, A., AKCHURIN, E., BASOGLU, C., CHEN, G., CYPHERS, S., DROPPO, J., EVERSOLE, A., GUENTER, B., HILLEBRAND, M., HOENS, R., HUANG, X., HUANG, Z., IVANOV, V., KAMENEV, A., KRANEN, P., KUCHAIEV, O., MANOUSEK, W., MAY, A., MITRA, B., NANO, O., NAVARRO, G., ORLOV, A., PADMILAC, M., PARTHASARATHI, H., PENG, B., REZNICHENKO, A., SEIDE, F., SELTZER, M. L., SLANEY, M., STOLCKE, A., WANG, Y., WANG, H., YAO, K., YU, D., ZHANG, Y., AND ZWEIG, G. An introduction to computational networks and the computational network toolkit. Tech. Rep. MSR-TR-2014-112, August 2014.
[5]
ANSEL, J., KAMIL, S., VEERAMACHANENI, K., RAGANKELLEY, J., BOSBOOM, J., O'REILLY, U.-M., AND AMARASINGHE, S. Opentuner: An extensible framework for program autotuning. In International Conference on Parallel Architectures and Compilation Techniques (Edmonton, Canada, August 2014).
[6]
BAGHDADI, R., BEAUGNON, U., COHEN, A., GROSSER, T., KRUSE, M., REDDY, C., VERDOOLAEGE, S., BETTS, A., DONALDSON, A. F., KETEMA, J., ABSAR, J., HAASTREGT, S. V., KRAVETS, A., LOKHMOTOV, A., DAVID, R., AND HAJIYEV, E. Pencil: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (Washington, DC, USA, 2015), PACT '15, IEEE Computer Society, pp. 138-149.
[7]
BASTIEN, F., LAMBLIN, P., PASCANU, R., BERGSTRA, J., GOODFELLOW, I. J., BERGERON, A., BOUCHARD, N., AND BENGIO, Y. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
[8]
CHEN, T., AND GUESTRIN, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2016), KDD '16, ACM, pp. 785-794.
[9]
CHEN, T., LI, M., LI, Y., LIN, M., WANG, N., WANG, M., XIAO, T., XU, B., ZHANG, C., AND ZHANG, Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems (LearningSys'15) (2015).
[10]
CHEN, T.-F., AND BAER, J.-L. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers 44, 5 (May 1995), 609-623.
[11]
CHEN, Y., LUO, T., LIU, S., ZHANG, S., HE, L., WANG, J., LI, L., CHEN, T., XU, Z., SUN, N., AND TEMAM, O. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (Washington, DC, USA, 2014), MICRO-47, IEEE Computer Society, pp. 609-622.
[12]
CHEN, Y.-H., EMER, J., AND SZE, V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture (Piscataway, NJ, USA, 2016), ISCA '16, IEEE Press, pp. 367-379.
[13]
COURBARIAUX, M., BENGIO, Y., AND DAVID, J. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR abs/1511.00363 (2015).
[14]
EGGERS, S. J., EMER, J. S., LEVY, H. M., LO, J. L., STAMM, R. L., AND TULLSEN, D. M. Simultaneous multithreading: a platform for next-generation processors. IEEE Micro 17, 5 (Sept 1997), 12-19.
[15]
FRIGO, M., AND JOHNSON, S. G. Fftw: an adaptive software architecture for the fft. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on (May 1998), vol. 3, pp. 1381-1384 vol.3.
[16]
HE, K., ZHANG, X., REN, S., AND SUN, J. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027 (2016).
[17]
HEGARTY, J., BRUNHAVER, J., DEVITO, Z., RAGAN-KELLEY, J., COHEN, N., BELL, S., VASILYEV, A., HOROWITZ, M., AND HANRAHAN, P. Darkroom: Compiling high-level image processing code into hardware pipelines. ACM Trans. Graph. 33, 4 (July 2014), 144:1-144:11.
[18]
HENRIKSEN, T., SERUP, N. G. W., ELSMAN, M., HENGLEIN, F., AND OANCEA, C. E. Futhark: Purely functional gpu-programming with nested parallelism and in-place array updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2017), PLDI 2017, ACM, pp. 556-571.
[19]
HOWARD, A. G., ZHU, M., CHEN, B., KALENICHENKO, D., WANG, W., WEYAND, T., ANDREETTO, M., AND ADAM, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017).
[20]
JOUPPI, N. P. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture (May 1990), pp. 364-373.
[21]
JOUPPI, N. P., YOUNG, C., PATIL, N., PATTERSON, D., AGRAWAL, G., BAJWA, R., BATES, S., BHATIA, S., BODEN, N., BORCHERS, A., BOYLE, R., CANTIN, P.-L., CHAO, C., CLARK, C., CORIELL, J., DALEY, M., DAU, M., DEAN, J., GELB, B., GHAEMMAGHAMI, T. V., GOTTIPATI, R., GULLAND, W., HAGMANN, R., HO, C. R., HOGBERG, D., HU, J., HUNDT, R., HURT, D., IBARZ, J., JAFFEY, A., JAWORSKI, A., KAPLAN, A., KHAITAN, H., KILLEBREW, D., KOCH, A., KUMAR, N., LACY, S., LAUDON, J., LAW, J., LE, D., LEARY, C., LIU, Z., LUCKE, K., LUNDIN, A., MACKEAN, G., MAGGIORE, A., MAHONY, M., MILLER, K., NAGARAJAN, R., NARAYANASWAMI, R., NI, R., NIX, K., NORRIE, T., OMERNICK, M., PENUKONDA, N., PHELPS, A., ROSS, J., ROSS, M., SALEK, A., SAMADIANI, E., SEVERN, C., SIZIKOV, G., SNELHAM, M., SOUTER, J., STEINBERG, D., SWING, A., TAN, M., THORSON, G., TIAN, B., TOMA, H., TUTTLE, E., VASUDEVAN, V., WALTER, R., WANG, W., WILCOX, E., AND YOON, D. H. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (New York, NY, USA, 2017), ISCA '17, ACM, pp. 1-12.
[22]
KIRKPATRICK, S., GELATT, C. D., AND VECCHI, M. P. Optimization by simulated annealing. Science 220, 4598 (1983), 671-680.
[23]
KJOLSTAD, F., KAMIL, S., CHOU, S., LUGATO, D., AND AMARASINGHE, S. The tensor algebra compiler. Proc. ACM Program. Lang. 1, OOPSLA (Oct. 2017), 77:1-77:29.
[24]
KLÖCKNER, A. Loo.py: transformation-based code generation for GPUs and CPUs. In Proceedings of ARRAY '14: ACM SIGPLAN Workshop on Libraries, Languages, and Compilers for Array Programming (Edinburgh, Scotland., 2014), Association for Computing Machinery.
[25]
LAVIN, A., AND GRAY, S. Fast algorithms for convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 (2016), pp. 4013-4021.
[26]
LI, L., JAMIESON, K. G., DESALVO, G., ROSTAMIZADEH, A., AND TALWALKAR, A. Efficient hyperparameter optimization and infinitely many armed bandits. CoRR abs/1603.06560 (2016).
[27]
LIU, D., CHEN, T., LIU, S., ZHOU, J., ZHOU, S., TEMAN, O., FENG, X., ZHOU, X., AND CHEN, Y. Pudiannao: A polyvalent machine learning accelerator. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2015), ASPLOS '15, ACM, pp. 369-381.
[28]
MNIH, V., KAVUKCUOGLU, K., SILVER, D., RUSU, A. A., VENESS, J., BELLEMARE, M. G., GRAVES, A., RIEDMILLER, M., FIDJELAND, A. K., OSTROVSKI, G., ET AL. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
[29]
MULLAPUDI, R. T., ADAMS, A., SHARLET, D., RAGANKELLEY, J., AND FATAHALIAN, K. Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 35, 4 (July 2016), 83:1-83:11.
[30]
PALKAR, S., THOMAS, J. J., NARAYANAN, D., SHANBHAG, A., PALAMUTTAM, R., PIRK, H., SCHWARZKOPF, M., AMARASINGHE, S. P., MADDEN, S., AND ZAHARIA, M. Weld: Rethinking the interface between data-intensive applications. CoRR abs/1709.06416 (2017).
[31]
RADFORD, A., METZ, L., AND CHINTALA, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).
[32]
RAGAN-KELLEY, J., BARNES, C., ADAMS, A., PARIS, S., DURAND, F., AND AMARASINGHE, S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2013), PLDI '13, ACM, pp. 519-530.
[33]
RASTEGARI, M., ORDONEZ, V., REDMON, J., AND FARHADI, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision (2016), Springer, pp. 525-542.
[34]
SHARMA, H., PARK, J., MAHAJAN, D., AMARO, E., KIM, J. K., SHAO, C., MISHRA, A., AND ESMAEILZADEH, H. From high-level deep neural models to fpgas. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on (2016), IEEE, pp. 1-12.
[35]
SMITH, J. E. Decoupled access/execute computer architectures. In Proceedings of the 9th Annual Symposium on Computer Architecture (Los Alamitos, CA, USA, 1982), ISCA '82, IEEE Computer Society Press, pp. 112-119.
[36]
STEUWER, M., REMMELG, T., AND DUBACH, C. Lift: A functional data-parallel ir for high-performance gpu code generation. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (Piscataway, NJ, USA, 2017), CGO '17, IEEE Press, pp. 74-85.
[37]
SUJEETH, A. K., LEE, H., BROWN, K. J., CHAFI, H., WU, M., ATREYA, A. R., OLUKOTUN, K., ROMPF, T., AND ODERSKY, M. Optiml: An implicitly parallel domain-specific language for machine learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (USA, 2011), ICML'11, pp. 609-616.
[38]
TAI, K. S., SOCHER, R., AND MANNING, C. D. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015).
[39]
TULLOCH, A., AND JIA, Y. High performance ultra-low-precision convolutions on mobile devices. arXiv preprint arXiv:1712.02427 (2017).
[40]
UMUROGLU, Y., FRASER, N. J., GAMBARDELLA, G., BLOTT, M., LEONG, P. H. W., JAHRE, M., AND VISSERS, K. A. FINN: A framework for fast, scalable binarized neural network inference. CoRR abs/1612.07119 (2016).
[41]
VASILACHE, N. personal communication.
[42]
VASILACHE, N., ZINENKO, O., THEODORIDIS, T., GOYAL, P., DEVITO, Z., MOSES, W. S., VERDOOLAEGE, S., ADAMS, A., AND COHEN, A. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. CoRR abs/1802.04730 (2018).
[43]
VERDOOLAEGE, S., CARLOS JUEGA, J., COHEN, A., IGNACIO GÓMEZ, J., TENLLADO, C., AND CATTHOOR, F. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013), 54:1-54:23.
[44]
VOLKOV, V. Understanding Latency Hiding on GPUs. PhD thesis, University of California at Berkeley, 2016.
[45]
WEI, R., ADVE, V., AND SCHWARTZ, L. Dlvm: A modern compiler infrastructure for deep learning systems. CoRR abs/1711.03016 (2017).
[46]
WHALEY, R. C., AND DONGARRA, J. J. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (Washington, DC, USA, 1998), SC '98, IEEE Computer Society, pp. 1-27.
[47]
WILLIAMS, S., WATERMAN, A., AND PATTERSON, D. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr. 2009), 65-76.
[48]
ZAREMBA, W., SUTSKEVER, I., AND VINYALS, O. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).

Cited By

View all
  • (2024)MAGPYProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692034(683-698)Online publication date: 10-Jul-2024
  • (2024)More is differentProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692009(285-302)Online publication date: 10-Jul-2024
  • (2024)MonoNNProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691991(989-1005)Online publication date: 10-Jul-2024
  • Show More Cited By
  1. TVM: an automated end-to-end optimizing compiler for deep learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation
    October 2018
    815 pages
    ISBN:9781931971478

    Sponsors

    • NetApp
    • Google Inc.
    • NSF
    • Microsoft: Microsoft
    • Facebook: Facebook

    In-Cooperation

    Publisher

    USENIX Association

    United States

    Publication History

    Published: 08 October 2018

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 04 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)MAGPYProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692034(683-698)Online publication date: 10-Jul-2024
    • (2024)More is differentProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692009(285-302)Online publication date: 10-Jul-2024
    • (2024)MonoNNProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691991(989-1005)Online publication date: 10-Jul-2024
    • (2024)Biathlon: Harnessing Model Resilience for Accelerating ML Inference PipelinesProceedings of the VLDB Endowment10.14778/3675034.367505217:10(2631-2640)Online publication date: 1-Jun-2024
    • (2024)Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensorProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695961(160-177)Online publication date: 4-Nov-2024
    • (2024)CoolerSpace: A Language for Physically Correct and Computationally Efficient Color ProgrammingProceedings of the ACM on Programming Languages10.1145/36897418:OOPSLA2(846-875)Online publication date: 8-Oct-2024
    • (2024)Interactive Source-to-Source Optimizations Validated using Static Resource AnalysisProceedings of the 13th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis10.1145/3652588.3663320(26-34)Online publication date: 20-Jun-2024
    • (2024)Accelerated Auto-Tuning of GPU Kernels for Tensor ComputationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656626(549-561)Online publication date: 30-May-2024
    • (2024)CACTUS: Dynamically Switchable Context-aware micro-Classifiers for Efficient IoT InferenceProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661888(505-518)Online publication date: 3-Jun-2024
    • (2024)C4CAM: A Compiler for CAM-based In-memory AcceleratorsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651386(164-177)Online publication date: 27-Apr-2024
    • Show More Cited By

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media