research-article

TapeFlow: Streaming Gradient Tapes in Automatic Differentiation

Authors:

Arrvindh ShriramanAuthors Info & Claims

CGO '24: Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization

Pages 81 - 92

https://doi.org/10.1109/CGO57630.2024.10444805

Published: 27 September 2024 Publication History

Abstract

Computing gradients is a crucial task in many domains, including machine learning, physics simulations, and scientific computing. Automatic differentiation (AD) computes gradients for arbitrary imperative code. In reverse mode AD, an auxiliary structure, the tape, is used to transfer intermediary values required for gradient computation. The challenge is how to organize the tape in the memory hierarchy since it has a high reuse distance, lacks temporal locality, and inflates working set by 2 --- 4×.

We introduce Tapeflow, a compiler framework to orchestrate and manage the gradient tape. We make three key contributions. i) We introduce the concept of regions, which transforms the tape layout into an array-of-structs format to improve spatial reuse. ii) We schedule the execution into layers and explicitly orchestrate the tape operands using a scratchpad. This reduces the required cache size and on-chip energy. iii) Finally, we stream the tape from the DRAM by organizing it into a FIFO of tiles. The tape operands arrive just-in-time for each layer. Tapeflow, running on the same hardware, outperforms Enzyme, the state-of-the-art compiler, by 1.3--2.5×, reduces on-chip SRAM usage by 5---40×, and saves 8× on-chip energy. We demonstrate Tapeflow on a wide range of algorithms written in general-purpose language.

References

[1]

Adept software library. http://www.met.reading.ac.uk/clouds/adept.

[2]

Lenet-5 | Lenet-5 Architecture | Introduction to Lenet-5, March 2021.

[3]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-Flow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 265--283, USA, November 2016. USENIX Association.

Digital Library

[4]

Karsten Ahnert and Mario Mulansky. Solving ordinary differential equations in C++. AIP Conference Proceedings, 1389(1):1586--1589, September 2011.

[5]

Mohammed AlQuraishi. End-to-End Differentiable Learning of Protein Structure. page 265231. bioRxiv, February 2018.

[6]

Bradley M Bell. CppAD: A package for C++ algorithmic differentiation. Computational Infrastructure for Operations Research, 57(10), 2012.

[7]

J Benson, R Cofell, C Frericks, Chen-Han Ho, V Govindaraju, T Nowatzki, and K Sankaralingam. Design, integration and implementation of the DySER hardware accelerator into OpenSPARC. High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, pages 1--12, 2012.

Digital Library

[8]

Christian Bischof, Alan Carle, George Corliss, Andreas Griewank, and Paul Hovland. ADIFOR-Generating Derivative Codes from Fortran Programs. Scientific Programming, 1(1):11--29, January 1992.

Digital Library

[9]

Christian H Bischof, Lucas Roh, and Andrew J Mauer-Oats. ADIC: An extensible automatic differentiation tool for ANSI-C. Software: Practice and Experience, 27(12):1427--1456, 1997.

Digital Library

[10]

Jonas Degrave, Michiel Hermans, Joni Dambre, and Francis wyffels. A Differentiable Physics Engine for Deep Learning in Robotics. Frontiers in Neurorobotics, 13, 2019.

[11]

Nikoli Dryden, Naoya Maruyama, Tom Benson, Tim Moon, Marc Snir, and Brian Van Essen. Improving strong-scaling of cnn training by exploiting finer-grained parallelism. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 210--220. IEEE, 2019.

[12]

Jianwei Feng and Dong Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433--11442, 2021.

[13]

Audrūnas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time.

[14]

Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. DeSC: decoupled supply-compute communication management for heterogeneous architectures. In Proceedings of the 48th International Symposium on Microarchitecture, pages 191--203, Waikiki Hawaii, December 2015. ACM.

Digital Library

[15]

Laurent Hascoet and Valérie Pascual. The Tapenade automatic differentiation tool: Principles, model, and specification. ACM Transactions on Mathematical Software, 39(3):1--43, April 2013.

Digital Library

[16]

Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. DiffTaichi: Differentiable programming for physical simulation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020, 2020.

[17]

Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, and Gennady Pekhimenko. Gist: Efficient Data Encoding for Deep Neural Network Training. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018.

[18]

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1(OOPSLA):77:1--77:29, October 2017.

Digital Library

[19]

Yann LeCun. Deep learning est mort. vive differentiable programming!, 2018.

[20]

Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. Differentiable programming for image processing and deep learning in halide. ACM Transactions on Graphics, 37(4):1--13, August 2018.

Digital Library

[21]

William S. Moses and Valentin Churavy. Instead of rewriting foreign code for machine learning, automatically synthesize fast gradients. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA, 2020. Curran Associates Inc.

[22]

William S. Moses, Valentin Churavy, Ludger Paehler, Jan Hückelheim, Sri Hari Krishna Narayanan, Michel Schanen, and Johannes Doerfert. Reverse-mode automatic differentiation and optimization of GPU kernels via enzyme. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16, November 2021.

Digital Library

[23]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In PROC of the 40th MICRO, 2007.

Digital Library

[24]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, October 2019.

Digital Library

[25]

Uwe Naumann. The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation. SIAM, 2011.

[26]

Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 416--429, Toronto ON Canada, June 2017. ACM.

[27]

Christopher Olah. Neural networks, types, and functional programming, 2015. http://colah.github.io/posts/2015-09-NN-Types-FP/.

[28]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. page 4.

[29]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019.

[30]

Barak A. Pearlmutter and Jeffrey Mark Siskind. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM Transactions on Programming Languages and Systems, (2), March.

[31]

Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. Capuchin: Tensor-based GPU Memory Management for Deep Learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020.

Digital Library

[32]

Christopher Rackauckas, Alan Edelman, Keno Fischer, Mike Innes, Elliot Saba, Viral B. Shah, and Will Tebbutt. Generalized physics-informed learning through language-wide differentiable programming. In Jonghyun Lee, Eric F. Darve, Peter K. Kitanidis, Matthew W. Farthing, and Tyler J. Hesser, editors, Proceedings of the AAAI 2020 Spring Symposium on Combining Artificial Intelligence and Machine Learning with Physical Sciences, Stanford, CA, USA, March 23rd - to - 25th, 2020, volume 2587 of CEUR Workshop Proceedings. CEUR-WS.org, 2020.

[33]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020.

[34]

Cristóbal Ramírez, César Alejandro Hernández, Oscar Palomar, Osman Unsal, Marco Antonio Ramírez, and Adrián Cristal. A risc-v simulator and benchmark suite for designing and evaluating vector architectures. ACM Trans. Archit. Code Optim., 17(4), nov 2020.

[35]

Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 598--611. IEEE, 2021.

[36]

Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design, July 2016.

[37]

Samuel Rogers, Joshua Slycord, Mohammadreza Baharani, and Hamed Tabkhi. gem5-salam: A system architecture for llvm-based accelerator modeling. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020.

[38]

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. In Nature, 1988.

[39]

Frank Schäfer, Mohamed Tarek, Lyndon White, and Chris Rackauckas. AbstractDifferentiation.jl: Backend-Agnostic Differentiable Programming in Julia. Number arXiv:2109.12449. arXiv, February 2022.

[40]

Amir Shaikhha, Andrew Fitzgibbon, Dimitrios Vytiniotis, and Simon Peyton Jones. Efficient differentiable programming in a functional array-processing language. Proceedings of the ACM on Programming Languages, 3, July 2019.

Digital Library

[41]

Rick Stevens, Valerie Taylor, Jeff Nichols, Arthur Barney Maccabe, Katherine Yelick, and David Brown. AI for Science Report --- Argonne National Laboratory, 2019.

[42]

Jean Utke, Uwe Naumann, Mike Fagan, Nathan Tallent, Michelle Strout, Patrick Heimbach, Chris Hill, and Carl Wunsch. OpenAD/F: A Modular Open-Source Tool for Automatic Differentiation of Fortran Codes. ACM Transactions on Mathematical Software, 34(4):1--36, July 2008.

Digital Library

[43]

V. Vassilev, M. Vassilev, A. Penev, L. Moneta, and V. Ilieva. Clad - Automatic Differentiation Using Clang and LLVM. volume 608, page 012055. IOP Publishing, may 2015.

[44]

Nils Wandel, Michael Weinmann, and Reinhard Klein. Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize. In International Conference on Learning Representations, March 2021.

[45]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pages 41--53, 2018.

Digital Library

[46]

Minjie Wang, Chien-chin Huang, and Jinyang Li. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1--17, 2019.

Digital Library

[47]

Xuechao Wei, Yun Liang, and Jason Cong. Overcoming data transfer bottlenecks in fpga-based dnn accelerators via layer conscious memory management. In 2019 56th ACM/IEEE Design Automation Conference (DAC), pages 1--6. IEEE, 2019.

Digital Library

Index Terms

TapeFlow: Streaming Gradient Tapes in Automatic Differentiation
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation
    2. General programming languages

Index terms have been assigned to the content through auto-classification.

Recommendations

Reverse-mode automatic differentiation and optimization of GPU kernels via enzyme
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Computing derivatives is key to many algorithms in scientific computing and machine learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a LLVM compiler plugin that performs reverse-mode automatic differentiation (...
Implementation of automatic differentiation tools

Automatic differentiation is a semantic transformation that applies the rules of differential calculus to source code. It thus transforms a computer program that computes a mathematical function into a program that computes the function and its ...
From hyperbolic diffusion scheme to gradient method: Implicit Green–Gauss gradients for unstructured grids
Abstract
This paper introduces a new approach to constructing algorithms for gradient computations on unstructured grids. The proposed approach is to derive a gradient algorithm from a hyperbolic diffusion scheme, which solves the diffusion ...
Highlights
- A gradient algorithm exact for linear functions is derived from a hyperbolic diffusion scheme.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '24: Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization

March 2024

486 pages

ISBN:9798350395099

General Co-chairs:
Tobias Grosser
University of Edinburgh, United Kingdom
,
Christophe Dubach
McGill University, Mila, Canada
,
Program Co-chairs:
Michel Steuwer
Technische Universität Berlin, Germany
,
Jingling Xue
UNSW, Australia
,
Guilherme Ottoni
Meta, USA
,
Fernando Magno Quintão Pereira
Federal University of Minas Gerais, Brazil

Sponsors

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 27 September 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CGO '24

Sponsor:

CGO '24: 2024 IEEE/ACM International Symposium on Code Generation and Optimization

March 2 - 6, 2024

Edinburgh, United Kingdom

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
17
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten