research-article

Paella: Low-latency Model Serving with Software-defined GPU Scheduling

Authors:

Kelvin K. W. Ng,

Henri Maxime Demoulin,

Vincent LiuAuthors Info & Claims

SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

Pages 595 - 610

https://doi.org/10.1145/3600006.3613163

Published: 23 October 2023 Publication History

Abstract

Model serving systems play a critical role in multiplexing machine learning inference jobs across shared GPU infrastructure. These systems have traditionally sat at a high level of abstraction---receiving jobs from clients through a narrow API and relying on black-box GPU scheduling mechanisms when dispatching them. Fundamental limitations in the built-in GPU hardware scheduler, in particular, can lead to inefficiency when executing concurrent jobs. The current abstraction level also incurs system overheads that are similarly most significant when the GPU is heavily shared.

In this paper, we argue for co-designing the model compiler, local clients, and the scheduler to bypass the built-in GPU scheduler and enable software control of kernel execution order. Doing so enables the use of arbitrary scheduling algorithms and reduces system overheads throughout the critical path of inference.

References

[1]

NVIDIA HyperQ. https://docs.nvidia.com/cuda/samples/6_Advanced/simpleHyperQ/doc/HyperQ.pdf.

[2]

NVIDIA MPS. https://docs.nvidia.com/deploy/mps/index.html.

[3]

NVRTC (Runtime Compilation). https://docs.nvidia.com/cuda/nvrtc/index.html.

[4]

Parallel Thread Execution ISA). https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.

[5]

Onnx model zoo, 2020. https://github.com/onnx/models.

[6]

Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation, NSDI'17, pages 469--482, Berkeley, CA, USA, 2017. USENIX Association.

[7]

T. Amert, N. Otterness, M. Yang, J. H. Anderson, and F. D. Smith. Gpu scheduling on the nvidia tx2: Hidden details revealed. In 2017 IEEE Real-Time Systems Symposium (RTSS), pages 104--115, 2017.

[8]

Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. Pipeswitch: Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 499--514. USENIX Association, November 2020.

[9]

J. Bakita, Nathan Otterness, J. Anderson, and F. D. Smith. Scaling up: The validation of empirically derived scheduling rules on NVIDIA GPUs. In 14th Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT), 2018.

[10]

Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 49--65, Broomfield, CO, October 2014. USENIX Association.

Digital Library

[11]

N. Capodieci, R. Cavicchioli, M. Bertogna, and A. Paramakuru. Deadline-based scheduling for gpu with preemption support. In 2018 IEEE Real-Time Systems Symposium (RTSS), pages 119--130, 2018.

[12]

A. X. M. Chang and E. Culurciello. Hardware accelerators for recurrent neural networks on fpga. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1--4, 2017.

[13]

G. Chen and X. Shen. Free launch: Optimizing gpu dynamic kernel launches through thread reuse. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 407--419, 2015.

Digital Library

[14]

Li Chen, Justinas Lingys, Kai Chen, and Feng Liu. Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '18, page 191--205, New York, NY, USA, 2018. Association for Computing Machinery.

Digital Library

[15]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578--594, Carlsbad, CA, October 2018. USENIX Association.

Digital Library

[16]

CN Coelho, A Kuusela, S Li, H Zhuang, T Aarrestad, V Loncar, J Ngadiuba, M Pierini, AA Pol, and S Summers. Automatic deep heterogeneous quantization of deep neural networks for ultra low-area, low-latency inference on the edge at particle colliders. arXiv preprint arXiv:2006.10159.

[17]

Intel Corporation. Intel 64 and ia-32 architectures software developer's manual volume 3a: System programming guide, 2021.

[18]

Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. Inferline: Latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC '20, page 477--491, New York, NY, USA, 2020. Association for Computing Machinery.

[19]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613--627, Boston, MA, March 2017. USENIX Association.

Digital Library

[20]

Martijn de Rooij. Ultra low latency deep neural network inference for gravitational waves interferometer. 2021.

[21]

Henri Maxime Demoulin, Joshua Fried, Isaac Pedisich, Marios Kogias, Boon Thau Loo, Linh Thi Xuan Phan, and Irene Zhang. When idling is ideal: Optimizing tail-latency for heavy-tailed datacenter workloads with perséphone. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP '21, page 621--637, New York, NY, USA, 2021. Association for Computing Machinery.

Digital Library

[22]

Henri Maxime Demoulin, Isaac Pedisich, Nikos Vasilakis, Vincent Liu, Boon Thau Loo, and Linh Thi Xuan Phan. Detecting asymmetric application-layer denial-of-service attacks in-flight with finelame. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 693--708, Renton, WA, July 2019. USENIX Association.

[23]

Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, Benjamin Kreis, Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan Tran, et al. Fast inference of deep neural networks in fpgas for particle physics. Journal of Instrumentation, 13(07):P07027, 2018.

[24]

Glenn A. Elliott and James H. Anderson. Real-world constraints of gpus in real-time systems. In Proceedings of the 2011 IEEE 17th International Conference on Embedded and Real-Time Computing Systems and Applications - Volume 02, RTCSA '11, page 48--54, USA, 2011. IEEE Computer Society.

[25]

Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. Caladan: Mitigating interference at microsecond timescales. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 281--297. USENIX Association, November 2020.

[26]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pages 407--420, 2007.

Digital Library

[27]

Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. Graphene: Packing and dependency-aware scheduling for data-parallel clusters. OSDI'16, page 81--97, USA, 2016. USENIX Association.

[28]

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving dnns like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443--462. USENIX Association, November 2020.

[29]

Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1041--1057, Renton, WA, April 2022. USENIX Association.

[30]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770--778, 2016.

[31]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. NIPS Deep Learning and Representation Learning Workshop, 2015.

[32]

Cheol-Ho Hong, Ivor Spence, and Dimitrios S. Nikolopoulos. Gpu virtualization and scheduling methods: A comprehensive survey. ACM Comput. Surv., 50(3), June 2017.

[33]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.

[34]

Chien-Chin Huang, Gu Jin, and Jinyang Li. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, page 1341--1355, New York, NY, USA, 2020. Association for Computing Machinery.

Digital Library

[35]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261--2269, 2017.

[36]

Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.

[37]

Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. Dissecting the nvidia turing t4 gpu via microbenchmarking, 2019.

[38]

Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. Shinjuku: Preemptive scheduling for μsecond-scale tail latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 345--360, Boston, MA, February 2019. USENIX Association.

[39]

Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 1--16, Boston, MA, February 2019. USENIX Association.

[40]

Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. Timegraph: Gpu scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, page 2, USA, 2011. USENIX Association.

Digital Library

[41]

Charles W. Kazer, João Sedoc, Kelvin K.W. Ng, Vincent Liu, and Lyle H. Ungar. Fast network simulation through approximation or: How blind men can describe elephants. In Proceedings of the 17th ACM Workshop on Hot Topics in Networks, HotNets '18, page 141--147, New York, NY, USA, 2018. Association for Computing Machinery.

Digital Library

[42]

Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Reflex: Remote flash ~ local flash. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '17, page 345--359, New York, NY, USA, 2017. Association for Computing Machinery.

Digital Library

[43]

Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18, page 489--504, New York, NY, USA, 2018. Association for Computing Machinery.

Digital Library

[44]

Redis Labs and Tensorwerk. Redisai, 2020. https://github.com/RedisAI/RedisAI.

[45]

Griffin Lacey, Graham W. Taylor, and Shawki Areibi. Deep learning on fpgas: Past, present, and future, 2016.

[46]

Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4):393--423, December 2002.

Digital Library

[47]

Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881--897. USENIX Association, November 2020.

[48]

K. V. Manian, A. A. Ammar, A. Ruhela, C.-H. Chu, H. Subramoni, and D. K. Panda. Characterizing cuda unified memory (um)-aware mpi designs on modern gpu architectures. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, GPGPU '19, page 43--52, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[49]

Michele Martinelli. Poster: Gpu i/o persistent kernel for latency bound systems. In ACM Symposium on High-Performance Parallel and Distributed Computing, 2017.

[50]

Pınar Muyan-Özçelik and John D. Owens. Methods for multitasking among real-time embedded compute tasks running on the gpu. Concurrency and Computation: Practice and Experience, 29(15):e4118, 2017. e4118 cpe.4118.

[51]

Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, page 308--317, New York, NY, USA, 2011. Association for Computing Machinery.

Digital Library

[52]

NVIDIA. Triton inference server, 2020. https://github.com/triton-inference-server/server.

[53]

Ignacio Sañudo Olmedo, Nicola Capodieci, Jorge Luis Martinez, Andrea Marongiu, and Marko Bertogna. Dissecting the cuda scheduling hierarchy: a performance and predictability perspective. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 213--225, 2020.

[54]

Christopher Olston, Fangwei Li, Jeremiah Harmsen, Jordan Soyke, Kiril Gorovoy, Li Lao, Noah Fiedel, Sukriti Ramesh, and Vinu Rajashekhar. Tensorflow-serving: Flexible, high-performance ml serving. In Workshop on ML Systems at NIPS 2017, 2017.

[55]

Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918--3926. PMLR, 2018.

[56]

N. Otterness, M. Yang, S. Rust, E. Park, J. H. Anderson, F. D. Smith, A. Berg, and S. Wang. An evaluation of the nvidia tx1 for supporting real-time computer-vision workloads. In 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 353--364, 2017.

[57]

Sreepathi Pai. How the fermi thread block scheduler works (illustrated), Mar 2014. https://cs.rochester.edu/~sree/fermi-tbs/fermi-tbs.html.

[58]

Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. Scheduling techniques for gpu architectures with processing-in-memory capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT '16, page 31--44, New York, NY, USA, 2016. Association for Computing Machinery.

Digital Library

[59]

George Prekas, Marios Kogias, and Edouard Bugnion. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, page 325--341, New York, NY, USA, 2017. Association for Computing Machinery.

Digital Library

[60]

Supranamaya Ranjan, Ram Swaminathan, Mustafa Uysal, Antonio Nucci, and Edward Knightly. Ddos-shield: Ddos-resilient scheduling to counter application layer attacks. IEEE/ACM Trans. Netw., 17(1):26--39, February 2009.

Digital Library

[61]

Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397--411. USENIX Association, July 2021.

[62]

A. Shawahna, S. M. Sait, and A. El-Maleh. Fpga-based accelerators of deep learning networks for learning and classification: A review. IEEE Access, 7:7823--7859, 2019.

[63]

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP '19, page 322--337, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[64]

M. Shreedhar and George Varghese. Efficient fair queueing using deficit round robin. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM '95, page 231--242, New York, NY, USA, 1995. Association for Computing Machinery.

Digital Library

[65]

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818--2826, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society.

[66]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.

[67]

Concurrent Real-Time Linux Development Team. Real-time performance during cuda. Technical report, Concurrent Real-Time, 11 2010.

[68]

Adam Wierman and Bert Zwart. Is tail-optimal scheduling possible? Operations research, 60(5):1249--1257, 2012.

[69]

Keith Winstein and Hari Balakrishnan. Tcp ex machina: Computergenerated congestion control. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM '13, pages 123--134, New York, NY, USA, 2013. ACM.

Digital Library

[70]

C. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 331--344, 2019.

[71]

Nofel Yaseen, John Sonchack, and Vincent Liu. tpprof: A network traffic pattern profiler. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 1015--1030, Santa Clara, CA, February 2020. USENIX Association.

[72]

Peifeng Yu and Mosharaf Chowdhury. Salus: Fine-grained GPU sharing primitives for deep learning applications. CoRR, abs/1902.04610, 2019.

[73]

Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, Jacob Nelson, Omar S. Navarro Leija, Ashlie Martinez, Jing Liu, Anna Kornfeld Simpson, Sujay Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, and Anirudh Badam. The demikernel datapath os architecture for microsecond-scale datacenter systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP '21, page 195--211, New York, NY, USA, 2021. Association for Computing Machinery.

Digital Library

[74]

J. Zhong and B. He. Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling. IEEE Transactions on Parallel and Distributed Systems, 25(6):1522--1532, 2014.

Digital Library

Cited By

Piao XKim J(2024)GMM: An Efficient GPU Memory Management-based Model Serving System for Multiple DNN Inference ModelsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673122(660-668)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673122
Sinha SDwivedi SAzizian M(2024)Towards Deterministic End-to-end Latency for Medical AI Systems in NVIDIA Holoscan2024 ACM/IEEE 15th International Conference on Cyber-Physical Systems (ICCPS)10.1109/ICCPS61052.2024.00028(235-246)Online publication date: 13-May-2024
https://doi.org/10.1109/ICCPS61052.2024.00028

Index Terms

Paella: Low-latency Model Serving with Software-defined GPU Scheduling

Recommendations

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Performance of high-end supercomputers will reach the exascale through the advent of core counts in billions. However, in the upcoming exascale computing era it is important not only to focus on the performance, but also on scalability of fine-grained ...
Scheduling of deteriorating jobs with release dates to minimize the maximum lateness

In this paper, we consider the problem of scheduling n deteriorating jobs with release dates on a single (batching) machine. Each job's processing time is a simple linear function of its starting time. The objective is to minimize the maximum lateness. ...
Disengaged scheduling for fair, protected access to fast computational accelerators
ASPLOS '14

Today's operating systems treat GPUs and other computational accelerators as if they were simple devices, with bounded and predictable response times. With accelerators assuming an increasing share of the workload on modern machines, this strategy is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

October 2023

802 pages

ISBN:9798400702297

DOI:10.1145/3600006

Conference Chairs:
Jason Flinn
Meta
,
Margo Seltzer
University of British Columbia
,
General Chairs:
Peter Druschel
Max Planck Institute for Software Systems (MPI-SWS)
,
Antoine Kaufmann
Max Planck Institute for Software Systems (MPI-SWS)
,
Jonathan Mace
Max Planck Institute for Software Systems (MPI-SWS) and Microsoft Research

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

In-Cooperation

USENIX

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Conference

SOSP '23

Sponsor:

SIGOPS

SOSP '23: 29th Symposium on Operating Systems Principles

October 23 - 26, 2023

Koblenz, Germany

Acceptance Rates

SOSP '23 Paper Acceptance Rate 43 of 232 submissions, 19%;

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '24

Sponsor:
sigops

ACM SIGOPS 30th Symposium on Operating Systems Principles

November 4 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
1,725
Total Downloads

Downloads (Last 12 months)1,725
Downloads (Last 6 weeks)130

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Piao XKim J(2024)GMM: An Efficient GPU Memory Management-based Model Serving System for Multiple DNN Inference ModelsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673122(660-668)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673122
Sinha SDwivedi SAzizian M(2024)Towards Deterministic End-to-end Latency for Medical AI Systems in NVIDIA Holoscan2024 ACM/IEEE 15th International Conference on Cyber-Physical Systems (ICCPS)10.1109/ICCPS61052.2024.00028(235-246)Online publication date: 13-May-2024
https://doi.org/10.1109/ICCPS61052.2024.00028

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents