research-article

Open access

Efficient Memory Management for Large Language Model Serving with PagedAttention

Authors:

Joseph Gonzalez,

Ion StoicaAuthors Info & Claims

SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

Pages 611 - 626

https://doi.org/10.1145/3600006.3613165

Published: 23 October 2023 Publication History

Abstract

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2--4× with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm.

References

[1]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, et al. 2022. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv preprint arXiv:2207.00032 (2022).

[2]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[3]

Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. Advances in neural information processing systems 13 (2000).

[4]

Ond rej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 Conference on Machine Translation. In Proceedings of the First Conference on Machine Translation. Association for Computational Linguistics, Berlin, Germany, 131--198. http://www.aclweb.org/anthology/W/W16/W16-2301

[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.

[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

[7]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).

[8]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/

[9]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).

[10]

Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing. 477--491.

Digital Library

[11]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613--627.

[12]

Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo. 2022. DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 183--198.

[13]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344--16359.

[14]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: an efficient GPU serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 389--402.

Digital Library

[15]

FastAPI. 2023. FastAPI. https://github.com/tiangolo/fastapi.

[16]

Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference. 1--15.

Digital Library

[17]

Amir Gholami, Zhewei Yao, Sehoon Kim, Michael W Mahoney, and Kurt Keutzer. 2021. Ai and memory wall. RiseLab Medium Post 1 (2021), 6.

[18]

Github. 2022. https://github.com/features/copilot

[19]

Google. 2023. https://bard.google.com/

[20]

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving {DNNs} like Clockwork: Performance Predictability from the Bottom Up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443--462.

[21]

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale Preemption for Concurrent {GPU-accelerated}{DNN} Inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539--558.

[22]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[23]

Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1341--1355.

Digital Library

[24]

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems 2 (2020), 497--511.

[25]

Tom Kilburn, David BG Edwards, Michael J Lanigan, and Frank H Sumner. 1962. One-level storage system. IRE Transactions on Electronic Computers 2 (1962), 223--235.

[26]

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).

[27]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021).

[28]

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. arXiv preprint arXiv:2302.11665 (2023).

[29]

Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with rtasks. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 881--897.

[30]

NVIDIA. [n. d.]. Triton Inference Server. https://developer.nvidia.com/nvidia-triton-inference-server.

[31]

NVIDIA. 2023. FasterTransformer. https://github.com/NVIDIA/FasterTransformer.

[32]

NVIDIA. 2023. NCCL: The NVIDIA Collective Communication Library. https://developer.nvidia.com/nccl.

[33]

Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139 (2017).

[34]

OpenAI. 2020. https://openai.com/blog/openai-api

[35]

OpenAI. 2022. https://openai.com/blog/chatgpt

[36]

OpenAI. 2023. https://openai.com/blog/custom-instructions-for-chatgpt

[37]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

[38]

LMSYS ORG. 2023. Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B. https://lmsys.org/blog/2023-06-22-leaderboard/.

[39]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).

[40]

Shishir G Patil, Paras Jain, Prabal Dutta, Ion Stoica, and Joseph Gonzalez. 2022. POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging. In International Conference on Machine Learning. PMLR, 17573--17583.

[41]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2022. Efficiently Scaling Transformer Inference. arXiv preprint arXiv:2211.05102 (2022).

[42]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. In USENIX Annual Technical Conference. 551--564.

[43]

Reuters. 2023. https://www.reuters.com/technology/tech-giants-ai-like-bing-bard-poses-billion-dollar-search-problem-2023-02-22/

[44]

Amazon Web Services. 2023. https://aws.amazon.com/bedrock/

[45]

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 322--337.

Digital Library

[46]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. 2023. High-throughput Generative Inference of Large Language Models with a Single GPU. arXiv preprint arXiv:2303.06865 (2023).

[47]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).

[48]

Benoit Steiner, Mostafa Elhoushi, Jacob Kahn, and James Hegarty. 2022. OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks. (2022).

[49]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).

[50]

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.

[51]

ShareGPT Team. 2023. https://sharegpt.com/

[52]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, MarieAnne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).

[53]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[54]

Jing Wang, Youyou Lu, Qing Wang, Minhui Xie, Keji Huang, and Jiwu Shu. 2022. Pacman: An Efficient Compaction Approach for {Log-Structured} {Key-Value} Store on Persistent Memory. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 773--788.

[55]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 41--53.

Digital Library

[56]

Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, and Lei Li. 2021. LightSeq: A High Performance Inference Library for Transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers. 113--120.

[57]

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560 (2022).

[58]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 38--45.

[59]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[60]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for {Transformer-Based} Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521--538.

[61]

Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787--808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong

[62]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).

[63]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559--578.

[64]

Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu Sun. 2022. PetS: A Unified Framework for Parameter-Efficient Transformers Serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 489--504.

Cited By

Mhatre AR. Warhade SPawar OKokate SJain SM E(2024)Leveraging LLM: Implementing an Advanced AI Chatbot for HealthcareInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAY1964(3144-3151)Online publication date: 17-Jun-2024
https://doi.org/10.38124/ijisrt/IJISRT24MAY1964
Yuan XKong WLuo ZXu M(2024)Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical ThingsElectronics10.3390/electronics1311207713:11(2077)Online publication date: 27-May-2024
https://doi.org/10.3390/electronics13112077
Li CXu Y(2024)Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model InferenceProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689153(53-67)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689153
Show More Cited By

Index Terms

Efficient Memory Management for Large Language Model Serving with PagedAttention
1. Information systems
  1. Information storage systems
    1. Storage management
2. Software and its engineering
  1. Software notations and tools
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management

Index terms have been assigned to the content through auto-classification.

Recommendations

Characterizing Memory Write References for Efficient Management of Hybrid PCM and DRAM Memory
MASCOTS '11: Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems

In order to reduce the energy dissipation in main memory of computer systems, phase change memory (PCM) has emerged as one of the most promising technologies to incorporate into the memory hierarchy. However, PCM has two critical weaknesses to ...
Enabling Hybrid PCM Memory System with Inherent Memory Management
RACS '16: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Replacing the traditional volatile main memory, e.g., DRAM, with a non-volatile phase change memory (PCM) has become a possible solution to reduce the energy consumption of computing systems. To further reduce the bit cost of PCM, the development trend ...
Hotspot-Aware Hybrid Memory Management for In-Memory Key-Value Stores
Emerging Non-Volatile Memory (NVM) technologies promise much higher memory density and energy efficiency than DRAM, at the expense of higher read/write latency and limited write endurance. Hybrid memory systems composed of DRAM and NVM have the potential ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

October 2023

802 pages

ISBN:9798400702297

DOI:10.1145/3600006

Conference Chairs:
Jason Flinn
Meta
,
Margo Seltzer
University of British Columbia
,
General Chairs:
Peter Druschel
Max Planck Institute for Software Systems (MPI-SWS)
,
Antoine Kaufmann
Max Planck Institute for Software Systems (MPI-SWS)
,
Jonathan Mace
Max Planck Institute for Software Systems (MPI-SWS) and Microsoft Research

Copyright © 2023 Owner/Author(s).

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

In-Cooperation

USENIX

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2023

Check for updates

Badges

Qualifiers

Research-article

Conference

SOSP '23

Sponsor:

SIGOPS

SOSP '23: 29th Symposium on Operating Systems Principles

October 23 - 26, 2023

Koblenz, Germany

Acceptance Rates

SOSP '23 Paper Acceptance Rate 43 of 232 submissions, 19%;

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '24

Sponsor:
sigops

ACM SIGOPS 30th Symposium on Operating Systems Principles

November 4 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
12,375
Total Downloads

Downloads (Last 12 months)12,375
Downloads (Last 6 weeks)1,978

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mhatre AR. Warhade SPawar OKokate SJain SM E(2024)Leveraging LLM: Implementing an Advanced AI Chatbot for HealthcareInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAY1964(3144-3151)Online publication date: 17-Jun-2024
https://doi.org/10.38124/ijisrt/IJISRT24MAY1964
Yuan XKong WLuo ZXu M(2024)Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical ThingsElectronics10.3390/electronics1311207713:11(2077)Online publication date: 27-May-2024
https://doi.org/10.3390/electronics13112077
Li CXu Y(2024)Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model InferenceProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689153(53-67)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689153
Tan XLi JYang YLi JXu H(2024)Arlo: Serving Transformer-based Language Models with Dynamic Input LengthsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673124(367-376)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673124
Gao BWang ZHe ZLuo TWong WZhou Z(2024)IMI: In-memory Multi-job Inference Acceleration for Large Language ModelsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673053(752-761)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673053
Panopoulos IVenieris SVenieris I(2024)CARIn: Constraint-Aware and Responsive Inference on Heterogeneous Devices for Single- and Multi-DNN WorkloadsACM Transactions on Embedded Computing Systems10.1145/366586823:4(1-32)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3665868
Maurya AYe JRafique MCappello FNicolae BCostan ANicolae BSato K(2024)Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded OptimizersProceedings of the 14th Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures10.1145/3659995.3660038(9-16)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3659995.3660038
Liu YLi HCheng YRay SHuang YZhang QDu KYao JLu SAnanthanarayanan GMaire MHoffmann HHoltzman AJiang JSekar VYu MSeneviratne AVeitch D(2024)CacheGen: KV Cache Compression and Streaming for Fast Large Language Model ServingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672274(38-56)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672274
Li DYan MZhang YLiu ZLiu CZhang XChen TLo DChristakis MPradel M(2024)CoSec: On-the-Fly Security Hardening of Code LLMs via Supervised Co-decodingProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680371(1428-1439)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680371
Guo LWang YShi EZhong WZhang HChen JZhang RMa YZheng ZChristakis MPradel M(2024)When to Stop? Towards Efficient Code Generation in LLMs with Excess Token PreventionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680343(1073-1085)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680343
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents