research-article

Open access

TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators

Authors:

Martin Maas,

Ulysse Beaugnon,

Arun Chauhan,

Berkin IlbeyiAuthors Info & Claims

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1

Pages 123 - 137

https://doi.org/10.1145/3567955.3567961

Published: 21 December 2022 Publication History

PDF eReader

Abstract

Memory buffer allocation for on-chip memories is a major challenge in modern machine learning systems that target ML accelerators. In interactive systems such as mobile phones, it is on the critical path of launching ML-enabled applications. In data centers, it is part of complex optimization loops that run many times and are the limiting factor for the quality of compilation results.

In contrast to the traditional memory allocation problem in languages such as C++, where allocation requests dynamically arrive as the application is executing, ML systems typically execute a static control flow graph that is known in advance. The task of the memory allocator is to choose buffer locations in device memory such that the total amount of used memory never exceeds the total memory available on-device. This is a high dimensional, NP-hard optimization problem that is challenging to solve.

Today, ML frameworks approach this problem either using ad-hoc heuristics or solver-based methods. Heuristic solutions work for simple cases but fail for more complex instances of this problem. Solver-based solutions can handle these more complex instances, but are expensive and impractical in scenarios where memory allocation is on the critical path, such as on mobile devices that compile models on-the-fly. We encountered this problem in the development of Google's Pixel 6 phone, where some important models took prohibitively long to compile.

We introduce an approach that solves this challenge by combining constraint optimization with domain-specific knowledge to achieve the best properties of both. We combine a heuristic-based search with a solver to guide its decision making. Our approach matches heuristics for simple inputs while being significantly faster than the best Integer Linear Program (ILP) solver-based approach for complex inputs. We also show how ML can be used to continuously improve the search for the long tail of workloads. Our approach is shipping in two production systems: Google's Pixel 6 phone and TPUv4. It achieves up to two orders of magnitude allocation time speed-up on real ML workloads compared to a highly-tuned production ILP approach that it replaces and enables important real-world models that could not otherwise be supported.

References

[1]

2017. TensorFlow Lite. https://www.tensorflow.org/lite

Abstract

References

Cited By

Index Terms

Recommendations

Optimal task allocation on non-volatile memory based hybrid main memory

Recursive function data allocation to scratch-pad memory

Energy-aware memory allocation in heterogeneous non-volatile memory systems

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations