research-article

Open access

Allo: A Programming Model for Composable Accelerator Design

Authors:

Zhiru ZhangAuthors Info & Claims

Proceedings of the ACM on Programming Languages, Volume 8, Issue PLDI

Article No.: 171, Pages 593 - 620

https://doi.org/10.1145/3656401

Published: 20 June 2024 Publication History

PDF eReader

Abstract

Special-purpose hardware accelerators are increasingly pivotal for sustaining performance improvements in emerging applications, especially as the benefits of technology scaling continue to diminish. However, designers currently lack effective tools and methodologies to construct complex, high-performance accelerator architectures in a productive manner. Existing high-level synthesis (HLS) tools often require intrusive source-level changes to attain satisfactory quality of results. Despite the introduction of several new accelerator design languages (ADLs) aiming to enhance or replace HLS, their advantages are more evident in relatively simple applications with a single kernel. Existing ADLs prove less effective for realistic hierarchical designs with multiple kernels, even if the design hierarchy is flattened. In this paper, we introduce Allo, a composable programming model for efficient spatial accelerator design. Allo decouples hardware customizations, including compute, memory, communication, and data type from algorithm specification, and encapsulates them as a set of customization primitives. Allo preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner. This approach facilitates holistic optimizations that span across function boundaries. We conduct comprehensive experiments on commonly-used HLS benchmarks and several realistic deep learning models. Our evaluation shows that Allo can outperform state-of-the-art HLS tools and ADLs on all test cases in the PolyBench. For the GPT2 model, the inference latency of the Allo generated accelerator is 1.7x faster than the NVIDIA A100 GPU with 5.4x higher energy efficiency, demonstrating the capability of Allo to handle large-scale designs.

Supplementary Material

Auxiliary Archive (pldi24main-p130-p-archive.zip)

Supplementary material for Allo

Download
845.36 KB

References

[1]

Nicolas Bohm Agostini, Serena Curzel, Jeff Jun Zhang, Ankur Limaye, Cheng Tan, Vinay Amatya, Marco Minutoli, Vito Giovanni Castellana, Joseph Manzano, David Brooks, Gu-Yeon Wei, and Antonino Tumeo. 2022. Bridging Python to Silicon: The SODA Toolchain. IEEE Micro, 42, 5 (2022), 78–88. https://doi.org/10.1109/MM.2022.3178580

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Data Parallel Algorithmic Skeletons with Accelerator Support

The hardware accelerator debate

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors