Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3517207.3526971acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Public Access

Reinforcement learning for resource management in multi-tenant serverless platforms

Published: 05 April 2022 Publication History

Abstract

Serverless Function-as-a-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve resource utilization, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to existing heuristics-based resource management approaches, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. In this paper, we show that the state-of-the-art single-agent RL algorithm (S-RL) suffers up to 4.6x higher function tail latency degradation on multi-tenant serverless FaaS platforms and is unable to converge during training. We then propose and implement a customized multi-agent RL algorithm based on Proximal Policy Optimization, i.e., multi-agent PPO (MA-PPO). We show that in multi-tenant environments, MA-PPO enables each agent to be trained until convergence and provides online performance comparable to S-RL in single-tenant cases with less than 10% degradation. Besides, MA-PPO provides a 4.4x improvement in S-RL performance (in terms of function tail latency) in multi-tenant cases.

References

[1]
Amazon. 2022. AWS Lambda concurrency limit. https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html. Accessed: 2022-01-10.
[2]
Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. 2017. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866 (2017).
[3]
Ioana Baldini, Paul Castro, Kerry Chang, Perry Cheng, Stephen Fink, Vatche Ishakian, Nick Mitchell, Vinod Muthusamy, Rodric Rabbah, Aleksander Slominski, et al. 2017. Serverless computing: Current trends and open problems. In Research Advances in Cloud Computing. Springer, 1--20.
[4]
Subho Banerjee, Saurabh Jha, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2020. Inductive-bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119). PMLR, 629--641. https://proceedings.mlr.press/v119/banerjee20a.html
[5]
Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference (AAAI/IAAI) 1998, 746--752 (1998), 2.
[6]
Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. 2020. Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? arXiv preprint arXiv:2011.09533 (2020).
[7]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144.
[8]
Yisel Garí, David A Monge, Elina Pacini, Cristian Mateos, and Carlos García Garino. 2021. Reinforcement learning-based application autoscaling in the cloud: A survey. Engineering Applications of Artificial Intelligence 102 (2021), 104288.
[9]
Samuel Ginzburg and Michael J Freedman. 2020. Serverless Isn't Server-Less: Measuring and Exploiting Resource Variability on Cloud FaaS Platforms. In Proceedings of the 2020 Sixth International Workshop on Serverless Computing. 43--48.
[10]
Github. 2022. Apache OpenWhisk. https://github.com/apache/openwhisk. Accessed: 2022-01-10.
[11]
Sara Kardani-Moghaddam, Rajkumar Buyya, and Kotagiri Ramamohanarao. 2020. ADRL: A Hybrid Anomaly-Aware Deep Reinforcement Learning-Based Resource Scaling in Clouds. IEEE Transactions on Parallel and Distributed Systems 32, 3 (2020), 514--526.
[12]
Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNet 2016). 50--56.
[13]
Monaldo Mastrolilli and Ola Svensson. 2008. (Acyclic) job shops are hard to approximate. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science. IEEE, 583--592.
[14]
OpenAI. 2022. OpenAI Baselines: Proximal Policy Optimization. https://openai.com/blog/openai-baselines-ppo/. Accessed: 2022-01-10.
[15]
Jinwoo Park, Byungkwon Choi, Chunghan Lee, and Dongsu Han. 2021. GRAF: A graph neural network based proactive resource allocation framework for SLO-oriented microservices. In Proceedings of the 17th International Conference on emerging Networking Experiments and Technologies (CoNext 2021). 154--167.
[16]
Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An intelligent fine-grained resource management framework for SLO-oriented microservices. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020). 805--825.
[17]
Haoran Qiu, Saurabh Jha, Subho S Banerjee, Archit Patke, Chen Wang, Franke Hubertus, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2021. Is Function-as-a-Service a Good Fit for Latency-Critical Services?. In Proceedings of the Seventh International Workshop on Serverless Computing (WoSC7) 2021. 1--8.
[18]
Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski, Steven Hand, and John Wilkes. 2020. Autopilot: Workload Autoscaling at Google. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys 2020). Association for Computing Machinery, New York, NY, USA, Article 16, 16 pages.
[19]
Lucia Schuler, Somaya Jamil, and Niklas Kühl. 2021. AI-based resource allocation: Reinforcement learning for adaptive auto-scaling in serverless environments. In 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2021). IEEE, 804--811.
[20]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[21]
Mohammad Shahrad, Jonathan Balkind, and David Wentzlaff. 2019. Architectural implications of Function-as-a-Service computing. In Proceedings of the 52nd International Symposium on Microarchitecture (MICRO 2019). 1063--1075.
[22]
Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX Annual Technical Conference. 205--218.
[23]
Lloyd S Shapley. 1953. Stochastic games. Proceedings of the National Academy of Sciences 39, 10 (1953), 1095--1100.
[24]
Amoghavarsha Suresh, Gagan Somashekar, Anandh Varadarajan, Veerendra Ramesh Kakarla, Hima Upadhyay, and Anshul Gandhi. 2020. ENSURE: Efficient Scheduling and Autonomous Resource Management in Serverless Environments. In International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS 2020). 1--10.
[25]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT Press.
[26]
Dmitrii Ustiugov, Plamen Petrov, Marios Kogias, Edouard Bugnion, and Boris Grot. 2021. Benchmarking, analysis, and optimization of serverless function snapshots. In Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). 559--572.
[27]
Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael Swift. 2018. Peeking behind the curtains of serverless platforms. In 2018 USENIX Annual Technical Conference. 133--146.
[28]
Ronald J Williams and Jing Peng. 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science 3, 3 (1991), 241--268.
[29]
Zhe Yang, Phuong Nguyen, Haiming Jin, and Klara Nahrstedt. 2019. MIRAS: Model-based reinforcement learning for microservice resource allocation over scientific workflows. In 2019 IEEE 39th international conference on distributed computing systems (ICDCS). IEEE, 122--132.
[30]
Zhang Yanqi, Hua Weizhe, Zhou Zhuangzhuang, Suh G. Edward, and Delimitrou Christina. 2021. Sinan: ML-Based & QoS-Aware Resource Management for Cloud Microservices. In Proceedings of the Twenty-Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021).
[31]
Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. 2021. The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. arXiv preprint arXiv:2103.01955 (2021).
[32]
Hanfei Yu, Athirai A Irissappane, Hao Wang, and Wes J Lloyd. 2021. FaaSRank: Learning to Schedule Functions in Serverless Platforms. In 2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS 2021). IEEE, 31--40.
[33]
Tianyi Yu, Qingyuan Liu, Dong Du, Yubin Xia, Binyu Zang, Ziqian Lu, Pingchao Yang, Chenggang Qin, and Haibo Chen. 2020. Serverless-Bench (SoCC 2020). https://github.com/SJTU-IPADS/ServerlessBench.
[34]
Anastasios Zafeiropoulos, Eleni Fotopoulou, Nikos Filinis, and Symeon Papavassiliou. 2022. Reinforcement learning-assisted autoscaling mechanisms for serverless computing platforms. Simulation Modelling Practice and Theory (2022), 102461.
[35]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (ATC 2019). 1049--1062.
[36]
Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. 2021. Decentralized multi-agent reinforcement learning with networked agents: Recent advances. Frontiers of Information Technology & Electronic Engineering 22, 6 (2021), 802--814.
[37]
Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. 2021. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control (2021), 321--384.

Cited By

View all
  • (2024)ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling with Heterogeneous ContainerIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.3454071(1-17)Online publication date: 2024
  • (2024)Demeter: Fine-grained Function Orchestration for Geo-distributed Serverless AnalyticsIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621303(2498-2507)Online publication date: 20-May-2024
  • (2024)FuncScaler: Cold-Start-Aware Holistic Autoscaling for Serverless Resource Management2024 IEEE International Conference on Web Services (ICWS)10.1109/ICWS62655.2024.00122(1036-1047)Online publication date: 7-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroMLSys '22: Proceedings of the 2nd European Workshop on Machine Learning and Systems
April 2022
121 pages
ISBN:9781450392549
DOI:10.1145/3517207
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 April 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. function-as-a-service
  2. multi-agent
  3. reinforcement learning
  4. resource allocation
  5. serverless computing

Qualifiers

  • Research-article

Funding Sources

Conference

EuroSys '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 18 of 26 submissions, 69%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)421
  • Downloads (Last 6 weeks)63
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling with Heterogeneous ContainerIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.3454071(1-17)Online publication date: 2024
  • (2024)Demeter: Fine-grained Function Orchestration for Geo-distributed Serverless AnalyticsIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621303(2498-2507)Online publication date: 20-May-2024
  • (2024)FuncScaler: Cold-Start-Aware Holistic Autoscaling for Serverless Resource Management2024 IEEE International Conference on Web Services (ICWS)10.1109/ICWS62655.2024.00122(1036-1047)Online publication date: 7-Jul-2024
  • (2024)Concurrent service auto-scaling for Knative resource quota-based serverless systemFuture Generation Computer Systems10.1016/j.future.2024.06.019160:C(326-339)Online publication date: 1-Nov-2024
  • (2024)Optimized resource usage with hybrid auto-scaling system for knative serverless edge computingFuture Generation Computer Systems10.1016/j.future.2023.11.010152:C(304-316)Online publication date: 4-Mar-2024
  • (2023)Dynamic Regimes for Corporate Human Capital Development Used Reinforcement Learning MethodsMathematics10.3390/math1118391611:18(3916)Online publication date: 14-Sep-2023
  • (2023)A Multi-Agent Deep-Reinforcement Learning Approach for Application-Agnostic Microservice Scaling2023 IEEE Virtual Conference on Communications (VCC)10.1109/VCC60689.2023.10474695(139-144)Online publication date: 28-Nov-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media