Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/CGO57630.2024.10444805acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

TapeFlow: Streaming Gradient Tapes in Automatic Differentiation

Published: 27 September 2024 Publication History

Abstract

Computing gradients is a crucial task in many domains, including machine learning, physics simulations, and scientific computing. Automatic differentiation (AD) computes gradients for arbitrary imperative code. In reverse mode AD, an auxiliary structure, the tape, is used to transfer intermediary values required for gradient computation. The challenge is how to organize the tape in the memory hierarchy since it has a high reuse distance, lacks temporal locality, and inflates working set by 2 --- 4×.
We introduce Tapeflow, a compiler framework to orchestrate and manage the gradient tape. We make three key contributions. i) We introduce the concept of regions, which transforms the tape layout into an array-of-structs format to improve spatial reuse. ii) We schedule the execution into layers and explicitly orchestrate the tape operands using a scratchpad. This reduces the required cache size and on-chip energy. iii) Finally, we stream the tape from the DRAM by organizing it into a FIFO of tiles. The tape operands arrive just-in-time for each layer. Tapeflow, running on the same hardware, outperforms Enzyme, the state-of-the-art compiler, by 1.3--2.5×, reduces on-chip SRAM usage by 5---40×, and saves 8× on-chip energy. We demonstrate Tapeflow on a wide range of algorithms written in general-purpose language.

References

[1]
Adept software library. http://www.met.reading.ac.uk/clouds/adept.
[2]
Lenet-5 | Lenet-5 Architecture | Introduction to Lenet-5, March 2021.
[3]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-Flow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 265--283, USA, November 2016. USENIX Association.
[4]
Karsten Ahnert and Mario Mulansky. Solving ordinary differential equations in C++. AIP Conference Proceedings, 1389(1):1586--1589, September 2011.
[5]
Mohammed AlQuraishi. End-to-End Differentiable Learning of Protein Structure. page 265231. bioRxiv, February 2018.
[6]
Bradley M Bell. CppAD: A package for C++ algorithmic differentiation. Computational Infrastructure for Operations Research, 57(10), 2012.
[7]
J Benson, R Cofell, C Frericks, Chen-Han Ho, V Govindaraju, T Nowatzki, and K Sankaralingam. Design, integration and implementation of the DySER hardware accelerator into OpenSPARC. High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, pages 1--12, 2012.
[8]
Christian Bischof, Alan Carle, George Corliss, Andreas Griewank, and Paul Hovland. ADIFOR-Generating Derivative Codes from Fortran Programs. Scientific Programming, 1(1):11--29, January 1992.
[9]
Christian H Bischof, Lucas Roh, and Andrew J Mauer-Oats. ADIC: An extensible automatic differentiation tool for ANSI-C. Software: Practice and Experience, 27(12):1427--1456, 1997.
[10]
Jonas Degrave, Michiel Hermans, Joni Dambre, and Francis wyffels. A Differentiable Physics Engine for Deep Learning in Robotics. Frontiers in Neurorobotics, 13, 2019.
[11]
Nikoli Dryden, Naoya Maruyama, Tom Benson, Tim Moon, Marc Snir, and Brian Van Essen. Improving strong-scaling of cnn training by exploiting finer-grained parallelism. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 210--220. IEEE, 2019.
[12]
Jianwei Feng and Dong Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433--11442, 2021.
[13]
Audrūnas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time.
[14]
Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. DeSC: decoupled supply-compute communication management for heterogeneous architectures. In Proceedings of the 48th International Symposium on Microarchitecture, pages 191--203, Waikiki Hawaii, December 2015. ACM.
[15]
Laurent Hascoet and Valérie Pascual. The Tapenade automatic differentiation tool: Principles, model, and specification. ACM Transactions on Mathematical Software, 39(3):1--43, April 2013.
[16]
Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. DiffTaichi: Differentiable programming for physical simulation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020, 2020.
[17]
Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, and Gennady Pekhimenko. Gist: Efficient Data Encoding for Deep Neural Network Training. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018.
[18]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1(OOPSLA):77:1--77:29, October 2017.
[19]
Yann LeCun. Deep learning est mort. vive differentiable programming!, 2018.
[20]
Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. Differentiable programming for image processing and deep learning in halide. ACM Transactions on Graphics, 37(4):1--13, August 2018.
[21]
William S. Moses and Valentin Churavy. Instead of rewriting foreign code for machine learning, automatically synthesize fast gradients. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA, 2020. Curran Associates Inc.
[22]
William S. Moses, Valentin Churavy, Ludger Paehler, Jan Hückelheim, Sri Hari Krishna Narayanan, Michel Schanen, and Johannes Doerfert. Reverse-mode automatic differentiation and optimization of GPU kernels via enzyme. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16, November 2021.
[23]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In PROC of the 40th MICRO, 2007.
[24]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, October 2019.
[25]
Uwe Naumann. The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation. SIAM, 2011.
[26]
Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 416--429, Toronto ON Canada, June 2017. ACM.
[27]
Christopher Olah. Neural networks, types, and functional programming, 2015. http://colah.github.io/posts/2015-09-NN-Types-FP/.
[28]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. page 4.
[29]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019.
[30]
Barak A. Pearlmutter and Jeffrey Mark Siskind. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM Transactions on Programming Languages and Systems, (2), March.
[31]
Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. Capuchin: Tensor-based GPU Memory Management for Deep Learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020.
[32]
Christopher Rackauckas, Alan Edelman, Keno Fischer, Mike Innes, Elliot Saba, Viral B. Shah, and Will Tebbutt. Generalized physics-informed learning through language-wide differentiable programming. In Jonghyun Lee, Eric F. Darve, Peter K. Kitanidis, Matthew W. Farthing, and Tyler J. Hesser, editors, Proceedings of the AAAI 2020 Spring Symposium on Combining Artificial Intelligence and Machine Learning with Physical Sciences, Stanford, CA, USA, March 23rd - to - 25th, 2020, volume 2587 of CEUR Workshop Proceedings. CEUR-WS.org, 2020.
[33]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020.
[34]
Cristóbal Ramírez, César Alejandro Hernández, Oscar Palomar, Osman Unsal, Marco Antonio Ramírez, and Adrián Cristal. A risc-v simulator and benchmark suite for designing and evaluating vector architectures. ACM Trans. Archit. Code Optim., 17(4), nov 2020.
[35]
Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 598--611. IEEE, 2021.
[36]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design, July 2016.
[37]
Samuel Rogers, Joshua Slycord, Mohammadreza Baharani, and Hamed Tabkhi. gem5-salam: A system architecture for llvm-based accelerator modeling. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020.
[38]
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. In Nature, 1988.
[39]
Frank Schäfer, Mohamed Tarek, Lyndon White, and Chris Rackauckas. AbstractDifferentiation.jl: Backend-Agnostic Differentiable Programming in Julia. Number arXiv:2109.12449. arXiv, February 2022.
[40]
Amir Shaikhha, Andrew Fitzgibbon, Dimitrios Vytiniotis, and Simon Peyton Jones. Efficient differentiable programming in a functional array-processing language. Proceedings of the ACM on Programming Languages, 3, July 2019.
[41]
Rick Stevens, Valerie Taylor, Jeff Nichols, Arthur Barney Maccabe, Katherine Yelick, and David Brown. AI for Science Report --- Argonne National Laboratory, 2019.
[42]
Jean Utke, Uwe Naumann, Mike Fagan, Nathan Tallent, Michelle Strout, Patrick Heimbach, Chris Hill, and Carl Wunsch. OpenAD/F: A Modular Open-Source Tool for Automatic Differentiation of Fortran Codes. ACM Transactions on Mathematical Software, 34(4):1--36, July 2008.
[43]
V. Vassilev, M. Vassilev, A. Penev, L. Moneta, and V. Ilieva. Clad - Automatic Differentiation Using Clang and LLVM. volume 608, page 012055. IOP Publishing, may 2015.
[44]
Nils Wandel, Michael Weinmann, and Reinhard Klein. Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize. In International Conference on Learning Representations, March 2021.
[45]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pages 41--53, 2018.
[46]
Minjie Wang, Chien-chin Huang, and Jinyang Li. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1--17, 2019.
[47]
Xuechao Wei, Yun Liang, and Jason Cong. Overcoming data transfer bottlenecks in fpga-based dnn accelerators via layer conscious memory management. In 2019 56th ACM/IEEE Design Automation Conference (DAC), pages 1--6. IEEE, 2019.

Index Terms

  1. TapeFlow: Streaming Gradient Tapes in Automatic Differentiation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CGO '24: Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization
        March 2024
        486 pages
        ISBN:9798350395099

        Sponsors

        In-Cooperation

        • IEEE CS

        Publisher

        IEEE Press

        Publication History

        Published: 27 September 2024

        Check for updates

        Author Tags

        1. automatic differentiation
        2. gradients
        3. streaming algorithms
        4. back propagation

        Qualifiers

        • Research-article

        Conference

        CGO '24

        Acceptance Rates

        Overall Acceptance Rate 312 of 1,061 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 17
          Total Downloads
        • Downloads (Last 12 months)17
        • Downloads (Last 6 weeks)3
        Reflects downloads up to 16 Feb 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media