research-article

Public Access

Reinforcement learning for resource management in multi-tenant serverless platforms

Authors:

Hubertus Franke,

Zbigniew T. Kalbarczyk,

Ravishankar K. IyerAuthors Info & Claims

EuroMLSys '22: Proceedings of the 2nd European Workshop on Machine Learning and Systems

Pages 20 - 28

https://doi.org/10.1145/3517207.3526971

Published: 05 April 2022 Publication History

Abstract

Serverless Function-as-a-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve resource utilization, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to existing heuristics-based resource management approaches, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. In this paper, we show that the state-of-the-art single-agent RL algorithm (S-RL) suffers up to 4.6x higher function tail latency degradation on multi-tenant serverless FaaS platforms and is unable to converge during training. We then propose and implement a customized multi-agent RL algorithm based on Proximal Policy Optimization, i.e., multi-agent PPO (MA-PPO). We show that in multi-tenant environments, MA-PPO enables each agent to be trained until convergence and provides online performance comparable to S-RL in single-tenant cases with less than 10% degradation. Besides, MA-PPO provides a 4.4x improvement in S-RL performance (in terms of function tail latency) in multi-tenant cases.

References

[1]

Amazon. 2022. AWS Lambda concurrency limit. https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html. Accessed: 2022-01-10.

[2]

Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. 2017. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866 (2017).

[3]

Ioana Baldini, Paul Castro, Kerry Chang, Perry Cheng, Stephen Fink, Vatche Ishakian, Nick Mitchell, Vinod Muthusamy, Rodric Rabbah, Aleksander Slominski, et al. 2017. Serverless computing: Current trends and open problems. In Research Advances in Cloud Computing. Springer, 1--20.

[4]

Subho Banerjee, Saurabh Jha, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2020. Inductive-bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119). PMLR, 629--641. https://proceedings.mlr.press/v119/banerjee20a.html

[5]

Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference (AAAI/IAAI) 1998, 746--752 (1998), 2.

[6]

Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. 2020. Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? arXiv preprint arXiv:2011.09533 (2020).

[7]

Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144.

Digital Library

[8]

Yisel Garí, David A Monge, Elina Pacini, Cristian Mateos, and Carlos García Garino. 2021. Reinforcement learning-based application autoscaling in the cloud: A survey. Engineering Applications of Artificial Intelligence 102 (2021), 104288.

[9]

Samuel Ginzburg and Michael J Freedman. 2020. Serverless Isn't Server-Less: Measuring and Exploiting Resource Variability on Cloud FaaS Platforms. In Proceedings of the 2020 Sixth International Workshop on Serverless Computing. 43--48.

Digital Library

[10]

Github. 2022. Apache OpenWhisk. https://github.com/apache/openwhisk. Accessed: 2022-01-10.

[11]

Sara Kardani-Moghaddam, Rajkumar Buyya, and Kotagiri Ramamohanarao. 2020. ADRL: A Hybrid Anomaly-Aware Deep Reinforcement Learning-Based Resource Scaling in Clouds. IEEE Transactions on Parallel and Distributed Systems 32, 3 (2020), 514--526.

[12]

Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNet 2016). 50--56.

Digital Library

[13]

Monaldo Mastrolilli and Ola Svensson. 2008. (Acyclic) job shops are hard to approximate. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science. IEEE, 583--592.

Digital Library

[14]

OpenAI. 2022. OpenAI Baselines: Proximal Policy Optimization. https://openai.com/blog/openai-baselines-ppo/. Accessed: 2022-01-10.

[15]

Jinwoo Park, Byungkwon Choi, Chunghan Lee, and Dongsu Han. 2021. GRAF: A graph neural network based proactive resource allocation framework for SLO-oriented microservices. In Proceedings of the 17th International Conference on emerging Networking Experiments and Technologies (CoNext 2021). 154--167.

Digital Library

[16]

Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An intelligent fine-grained resource management framework for SLO-oriented microservices. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020). 805--825.

[17]

Haoran Qiu, Saurabh Jha, Subho S Banerjee, Archit Patke, Chen Wang, Franke Hubertus, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2021. Is Function-as-a-Service a Good Fit for Latency-Critical Services?. In Proceedings of the Seventh International Workshop on Serverless Computing (WoSC7) 2021. 1--8.

Digital Library

[18]

Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski, Steven Hand, and John Wilkes. 2020. Autopilot: Workload Autoscaling at Google. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys 2020). Association for Computing Machinery, New York, NY, USA, Article 16, 16 pages.

Digital Library

[19]

Lucia Schuler, Somaya Jamil, and Niklas Kühl. 2021. AI-based resource allocation: Reinforcement learning for adaptive auto-scaling in serverless environments. In 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2021). IEEE, 804--811.

[20]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).

[21]

Mohammad Shahrad, Jonathan Balkind, and David Wentzlaff. 2019. Architectural implications of Function-as-a-Service computing. In Proceedings of the 52nd International Symposium on Microarchitecture (MICRO 2019). 1063--1075.

Digital Library

[22]

Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX Annual Technical Conference. 205--218.

[23]

Lloyd S Shapley. 1953. Stochastic games. Proceedings of the National Academy of Sciences 39, 10 (1953), 1095--1100.

[24]

Amoghavarsha Suresh, Gagan Somashekar, Anandh Varadarajan, Veerendra Ramesh Kakarla, Hima Upadhyay, and Anshul Gandhi. 2020. ENSURE: Efficient Scheduling and Autonomous Resource Management in Serverless Environments. In International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS 2020). 1--10.

[25]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT Press.

Digital Library

[26]

Dmitrii Ustiugov, Plamen Petrov, Marios Kogias, Edouard Bugnion, and Boris Grot. 2021. Benchmarking, analysis, and optimization of serverless function snapshots. In Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). 559--572.

Digital Library

[27]

Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael Swift. 2018. Peeking behind the curtains of serverless platforms. In 2018 USENIX Annual Technical Conference. 133--146.

[28]

Ronald J Williams and Jing Peng. 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science 3, 3 (1991), 241--268.

[29]

Zhe Yang, Phuong Nguyen, Haiming Jin, and Klara Nahrstedt. 2019. MIRAS: Model-based reinforcement learning for microservice resource allocation over scientific workflows. In 2019 IEEE 39th international conference on distributed computing systems (ICDCS). IEEE, 122--132.

[30]

Zhang Yanqi, Hua Weizhe, Zhou Zhuangzhuang, Suh G. Edward, and Delimitrou Christina. 2021. Sinan: ML-Based & QoS-Aware Resource Management for Cloud Microservices. In Proceedings of the Twenty-Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021).

[31]

Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. 2021. The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. arXiv preprint arXiv:2103.01955 (2021).

[32]

Hanfei Yu, Athirai A Irissappane, Hao Wang, and Wes J Lloyd. 2021. FaaSRank: Learning to Schedule Functions in Serverless Platforms. In 2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS 2021). IEEE, 31--40.

[33]

Tianyi Yu, Qingyuan Liu, Dong Du, Yubin Xia, Binyu Zang, Ziqian Lu, Pingchao Yang, Chenggang Qin, and Haibo Chen. 2020. Serverless-Bench (SoCC 2020). https://github.com/SJTU-IPADS/ServerlessBench.

[34]

Anastasios Zafeiropoulos, Eleni Fotopoulou, Nikos Filinis, and Symeon Papavassiliou. 2022. Reinforcement learning-assisted autoscaling mechanisms for serverless computing platforms. Simulation Modelling Practice and Theory (2022), 102461.

[35]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (ATC 2019). 1049--1062.

[36]

Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. 2021. Decentralized multi-agent reinforcement learning with networked agents: Recent advances. Frontiers of Information Technology & Electronic Engineering 22, 6 (2021), 802--814.

[37]

Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. 2021. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control (2021), 321--384.

Cited By

Wen ZChen QDeng QNiu YSong ZLiu F(2024)ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling with Heterogeneous ContainerIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.3454071(1-17)Online publication date: 2024
https://doi.org/10.1109/TPDS.2024.3454071
Yue XYang SZhu LTrajanovski SFu X(2024)Demeter: Fine-grained Function Orchestration for Geo-distributed Serverless AnalyticsIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621303(2498-2507)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621303
Song STong HMeng CPan MYu Y(2024)FuncScaler: Cold-Start-Aware Holistic Autoscaling for Serverless Resource Management2024 IEEE International Conference on Web Services (ICWS)10.1109/ICWS62655.2024.00122(1036-1047)Online publication date: 7-Jul-2024
https://doi.org/10.1109/ICWS62655.2024.00122
Show More Cited By

Index Terms

Reinforcement learning for resource management in multi-tenant serverless platforms
1. Computing methodologies
  1. Artificial intelligence
    1. Distributed artificial intelligence
      1. Multi-agent systems
    2. Planning and scheduling
      1. Multi-agent planning
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Cloud computing

Recommendations

SIMPPO: a scalable and incremental online learning framework for serverless resource management
SoCC '22: Proceedings of the 13th Symposium on Cloud Computing

Serverless Function-as-a-Service (FaaS) offers improved programmability for customers, yet it is not server-"less" and comes at the cost of more complex infrastructure management (e.g., resource provisioning and scheduling) for cloud providers. To ...
Supporting Multi-Provider Serverless Computing on the Edge
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

Serverless computing has recently emerged as a new execution model for cloud computing, in which service providers offer compute runtimes, also known as Function-as-a-Service (FaaS) platforms, allowing users to develop, execute and manage application ...
Deep reinforcement learning for application scheduling in resource-constrained, multi-tenant serverless computing environments
Abstract
Serverless computing has sparked a massive interest in both the cloud service providers and their clientele in recent years. This model entails the shift of the entire matter of resource management of user applications to the service ...
Highlights
- RL oriented problem formulation for workload and system aware function scheduling.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroMLSys '22: Proceedings of the 2nd European Workshop on Machine Learning and Systems

April 2022

121 pages

ISBN:9781450392549

DOI:10.1145/3517207

Program Chairs:
Eiko Yoneki
University of Cambridge
,
Luigi Nardi
Lund University, Stanford University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

IIDAI
National Science Foundation

Conference

EuroSys '22

Sponsor:

SIGOPS

EuroSys '22: Seventeenth European Conference on Computer Systems

April 5 - 8, 2022

Rennes, France

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
1,047
Total Downloads

Downloads (Last 12 months)421
Downloads (Last 6 weeks)63

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wen ZChen QDeng QNiu YSong ZLiu F(2024)ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling with Heterogeneous ContainerIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.3454071(1-17)Online publication date: 2024
https://doi.org/10.1109/TPDS.2024.3454071
Yue XYang SZhu LTrajanovski SFu X(2024)Demeter: Fine-grained Function Orchestration for Geo-distributed Serverless AnalyticsIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621303(2498-2507)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621303
Song STong HMeng CPan MYu Y(2024)FuncScaler: Cold-Start-Aware Holistic Autoscaling for Serverless Resource Management2024 IEEE International Conference on Web Services (ICWS)10.1109/ICWS62655.2024.00122(1036-1047)Online publication date: 7-Jul-2024
https://doi.org/10.1109/ICWS62655.2024.00122
Tran MKim Y(2024)Concurrent service auto-scaling for Knative resource quota-based serverless systemFuture Generation Computer Systems10.1016/j.future.2024.06.019160:C(326-339)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1016/j.future.2024.06.019
Tran MKim Y(2024)Optimized resource usage with hybrid auto-scaling system for knative serverless edge computingFuture Generation Computer Systems10.1016/j.future.2023.11.010152:C(304-316)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.future.2023.11.010
Orlova E(2023)Dynamic Regimes for Corporate Human Capital Development Used Reinforcement Learning MethodsMathematics10.3390/math1118391611:18(3916)Online publication date: 14-Sep-2023
https://doi.org/10.3390/math11183916
Fodor BJakub ÁSzűcs GSonkoly B(2023)A Multi-Agent Deep-Reinforcement Learning Approach for Application-Agnostic Microservice Scaling2023 IEEE Virtual Conference on Communications (VCC)10.1109/VCC60689.2023.10474695(139-144)Online publication date: 28-Nov-2023
https://doi.org/10.1109/VCC60689.2023.10474695

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents