Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3582016.3582028acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs

Published: 25 March 2023 Publication History

Abstract

Cloud providers often have resources that are not being fully utilized, and they may offer them at a lower cost to make up for the reduced availability of these resources. However, customers may be hesitant to use such offerings (such as spot VMs) as making trade-offs between cost and resource availability is not always straightforward. In this work, we propose Snape (Spot On-demand Perfect Mixture), an intelligent framework to optimize the cost and resource availability by dynamically mixing on-demand VMs with spot VMs. Through a detailed characterization based on real production traces, we verify that the eviction of spot VMs is predictable to some extent. Snape also leverages constrained reinforcement learning to adjust the mixture policy online. Experiments across different configurations show that Snape achieves 44% savings compared to using only on-demand VMs while maintaining 99.96% availability, which is 2.77% higher than using only spot VMs.

References

[1]
Eitan Altman. 1999. Constrained Markov decision processes: stochastic modeling. Routledge.
[2]
Amazon. 2022. Amazon EC2 Spot Instances. https://www.amazonaws.cn/en/ec2/spot-instances/
[3]
Pradeep Ambati, Íñigo Goiri, Felipe Frujeri, Alper Gun, Ke Wang, Brian Dolan, Brian Corell, Sekhar Pasupuleti, Thomas Moscibroda, Sameh Elnikety, Marcus Fontoura, and Ricardo Bianchini. 2020. Providing SLOs for Resource-Harvesting VMs in Cloud Platforms. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20). 735–751.
[4]
Apache. 2022. Open Source Serverless Cloud Platform. https://openwhisk.apache.org/ Accessed: 2022-10-19
[5]
Adebiyi A. Ariyo, Adewumi O. Adewumi, and Charles K. Ayo. 2014. Stock Price Prediction Using the ARIMA Model. In 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation. 106–112. https://doi.org/10.1109/UKSim.2014.67
[6]
AutoSpotting. 2022. AutoSpotting. https://github.com/cloudutil/AutoSpotting Accessed: 2022-10-14
[7]
AWS. 2022. Auto Scaling groups with multiple instance types and purchase options. https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-mixed-instances-groups.html
[8]
AWS. 2022. AWS instance types. https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/instance-types.html Accessed: 2022-06-28
[9]
AWS. 2022. AWS VM Availability. https://docs.aws.amazon.com/whitepapers/latest/real-time-communication-on-aws/high-availability-and-scalability-on-aws.html
[10]
AWS. 2022. EC2 Spot Blocks. https://aws.amazon.com/blogs/aws/new-ec2-spot-blocks-for-defined-duration-workloads/
[11]
Microsoft Azure. 2022. Azure Spot Virtual Machines. https://azure.microsoft.com/en-us/services/virtual-machines/spot/#overview
[12]
Microsoft Azure. 2022. Azure Spot Virtual Machines: Try to restore functionality now generally available. https://azure.microsoft.com/en-us/updates/azure-spot-virtual-machines-tryrestore-functionality-now-generally-available/
[13]
Microsoft Azure. 2022. Azure VM Availability. https://azure.microsoft.com/en-us/support/legal/sla/virtual-machines/v1_9/
[14]
Microsoft Azure. 2022. Azure VM SKU. https://docs.microsoft.com/en-us/azure/virtual-machines/sizes Accessed: 2022-10-19
[15]
Microsoft Azure. 2022. Personalizer. https://learn.microsoft.com/en-us/azure/cognitive-services/personalizer/what-is-personalizer
[16]
Microsoft Azure. 2022. Spot Priority Mix for high availability and cost savings. https://learn.microsoft.com/zh-cn/azure/virtual-machine-scale-sets/spot-priority-mix Accessed: 2022-10-19
[17]
Microsoft Azure. 2022. Virtual Machine Scale Sets. https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/overview Accessed: 2022-09-28
[18]
Abhinav Bhatia, Pradeep Varakantham, and Akshat Kumar. 2019. Resource constrained deep reinforcement learning. In Proceedings of the International Conference on Automated Planning and Scheduling. 29, 610–620.
[19]
Leo Breiman. 2001. Random forests. Machine learning, 45, 1 (2001), 5–32. https://doi.org/10.1023/A:1010933404324
[20]
Ching-An Cheng, Andrey Kolobov, and Adith Swaminathan. 2021. Heuristic-guided reinforcement learning. Advances in Neural Information Processing Systems, 34 (2021), 13550–13563.
[21]
Navraj Chohan, Claris Castillo, Mike Spreitzer, Malgorzata Steinder, Asser Tantawi, and Chandra Krintz. 2010. See Spot Run: Using Spot Instances for Mapreduce Workflows. HotCloud’10.
[22]
Google Cloud. 2022. Instance groups. https://cloud.google.com/compute/docs/instance-groups/ Accessed: 2022-09-28
[23]
Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. 2018. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757.
[24]
Christoph Dann and Emma Brunskill. 2015. Sample complexity of episodic fixed-horizon reinforcement learning. Advances in Neural Information Processing Systems, 28 (2015), 2818–2826.
[25]
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232.
[26]
Alex Fuerst, Stanko Novakovic, Íñigo Goiri, Gohar Irfan Chaudhry, Prateek Sharma, Kapil Arya, Kevin Broas, Eugene Bak, Mehmet Iyigun, and Ricardo Bianchini. 2022. Memory-Harvesting VMs in Cloud Platforms. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22). 583–594. https://doi.org/10.1145/3503222.3507725
[27]
Javier Garcıa and Fernando Fernández. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16, 1 (2015), 1437–1480. https://doi.org/10.5555/2789272.2886795
[28]
Javier García and Diogo Shafie. 2020. Teaching a humanoid robot to walk faster through Safe Reinforcement Learning. Engineering Applications of Artificial Intelligence, 88 (2020), 103360. https://doi.org/10.1016/j.engappai.2019.103360
[29]
Google. 2022. Google Spot VMs. https://cloud.google.com/compute/docs/instances/spot
[30]
Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das. 2022. Cocktail: A Multidimensional Optimization for Model Serving in Cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’22).
[31]
Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, and Thomas Moscibroda. 2020. Protean: VM allocation service at scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20). 845–861.
[32]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
[33]
Chen Hou and Qianchuan Zhao. 2017. Optimization of web service-based control system for balance between network traffic and delay. IEEE Transactions on Automation Science and Engineering, 15, 3 (2017), 1152–1162. https://doi.org/10.1109/TASE.2017.2746348
[34]
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
[35]
Vijay Konda and John Tsitsiklis. 1999. Actor-critic algorithms. Advances in neural information processing systems, 12 (1999), 1008–1014.
[36]
Iordanis Koutsopoulos and Leandros Tassiulas. 2011. Control and optimization meet the smart power grid: Scheduling of power demands for optimal energy management. In Proceedings of the 2nd International Conference on Energy-efficient Computing and Networking. 41–50. https://doi.org/10.1145/2318716.2318723
[37]
Boyan Li, Hongyao Tang, Yan Zheng, Jianye Hao, Pengyi Li, Zhen Wang, Zhaopeng Meng, and Li Wang. 2022. HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
[38]
Benjamin Lindemann, Timo Müller, Hannes Vietz, Nasser Jazdi, and Michael Weyrich. 2021. A survey on long short-term memory networks for time series prediction. Procedia CIRP, 99 (2021), 650–655. https://doi.org/10.1016/j.procir.2021.03.088
[39]
Yongshuai Liu, Jiaxin Ding, and Xin Liu. 2020. A constrained reinforcement learning based approach for network slicing. In 2020 IEEE 28th International Conference on Network Protocols (ICNP ’20). 1–6. https://doi.org/10.1109/ICNP49622.2020.9259378
[40]
Yongshuai Liu, Jiaxin Ding, and Xin Liu. 2020. IPO: Interior-point policy optimization under constraints. In Proceedings of the AAAI Conference on Artificial Intelligence. 34, 4940–4947.
[41]
Yongshuai Liu, Avishai Halev, and Xin Liu. 2021. Policy learning with constraints in model-free reinforcement learning: A survey. In The 30th International Joint Conference on Artificial Intelligence (IJCAI). https://doi.org/10.24963/ijcai.2021/614
[42]
Xiaozhen Lu, Liang Xiao, Guohang Niu, Xiangyang Ji, and Qian Wang. 2022. Safe Exploration in Wireless Security: A Safe Reinforcement Learning Algorithm With Hierarchical Structure. IEEE Transactions on Information Forensics and Security, 17 (2022), 732–743. https://doi.org/10.1109/TIFS.2022.3149396
[43]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. SIGCOMM ’19. https://doi.org/10.1145/3341302.3342080
[44]
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Analysis and Exploitation of Dynamic Pricing in the Public Cloud for ML Training. In Workshop on Distributed Infrastructure, Systems, Programming, and AI.
[45]
Michael Neunert, Abbas Abdolmaleki, Markus Wulfmeier, Thomas Lampe, Tobias Springenberg, Roland Hafner, Francesco Romano, Jonas Buchli, Nicolas Heess, and Martin Riedmiller. 2020. Continuous-discrete reinforcement learning for hybrid control in robotics. In Conference on Robot Learning. 735–751.
[46]
OpenAI. 2022. Proximal Policy Optimization. https://spinningup.openai.com/en/latest/algorithms/ppo.html Accessed: 2022-06-28
[47]
Tu-Hoa Pham, Giovanni De Magistris, and Ryuki Tachibana. 2018. Optlayer-practical constrained optimization for deep reinforcement learning in the real world. In 2018 IEEE International Conference on Robotics and Automation (ICRA). 6236–6243. https://doi.org/10.1109/ICRA.2018.8460547
[48]
Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20).
[49]
Haoran Qiu, Weichao Mao, Archit Patke, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer. 2022. Reinforcement Learning for Resource Management in Multi-Tenant Serverless Platforms. EuroMLSys ’22. https://doi.org/10.1145/3517207.3526971
[50]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
[51]
George AF Seber and Alan J Lee. 2012. Linear regression analysis. 329, John Wiley & Sons.
[52]
Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. 2018. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31 (2018).
[53]
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12 (1999).
[54]
Taylor, Sean, J., Letham, and Benjamin. 2018. Forecasting at Scale. American Statistician.
[55]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 5998–6008.
[56]
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems (EuroSys ’15). 1–17. https://doi.org/10.1145/2741948.2741964
[57]
Yawen Wang, Kapil Arya, Marios Kogias, Manohar Vanga, Aditya Bhandari, Neeraja J. Yadwadkar, Siddhartha Sen, Sameh Elnikety, Christos Kozyrakis, and Ricardo Bianchini. 2021. SmartHarvest: Harvesting Idle CPUs Safely and Efficiently in the Cloud. In Proceedings of the 16th European Conference on Computer Systems (EuroSys ’21). https://doi.org/10.1145/3447786.3456225
[58]
Fangkai Yang, Bowen Pang, Jue Zhang, Bo Qiao, Lu Wang, Camille Couturier, Chetan Bansal, Soumya Ram, Si Qin, and Zhen Ma. 2022. Spot Virtual Machine Eviction Prediction in Microsoft Cloud. In Companion Proceedings of the Web Conference. 152–156. https://doi.org/10.1145/3487553.3524229
[59]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC ’19). 1049–1062.
[60]
Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li, Zhili Xiao, Bin Cheng, Jiashu Xing, Yangtao Wang, Tianheng Cheng, Li Liu, Minwei Ran, and Zekang Li. 2019. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning. SIGMOD ’19. https://doi.org/10.1145/3299869.3300085
[61]
Qizhen Zhang, Philip Bernstein, Daniel Berger, Badrish Chandramouli, Vincent Liu, and Boon Thau Loo. 2022. Compucache: Remote computable caching using spot VMs. In 12th Conference on Innovative Data Systems Research, CIDR. 9–12.
[62]
Yanqi Zhang, Íñigo Goiri, Gohar Irfan Chaudhry, Rodrigo Fonseca, Sameh Elnikety, Christina Delimitrou, and Ricardo Bianchini. 2021. Faster and cheaper serverless computing on harvested resources. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP ’21). 724–739. https://doi.org/10.1145/3477132.3483580
[63]
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 11106–11115.

Cited By

View all
  • (2024)SpotServe: Serving Generative Large Language Models on Preemptible InstancesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640411(1112-1127)Online publication date: 27-Apr-2024
  • (2024)Making Cloud Spot Instance Interruption Events VisibleProceedings of the ACM Web Conference 202410.1145/3589334.3645548(2998-3009)Online publication date: 13-May-2024
  • (2023)Microless: Cost-Efficient Hybrid Deployment of Microservices on IaaS VMs and Serverless2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00309(2303-2310)Online publication date: 17-Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
March 2023
820 pages
ISBN:9781450399180
DOI:10.1145/3582016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Spot virtual machine
  2. dynamic mixture
  3. eviction prediction

Qualifiers

  • Research-article

Conference

ASPLOS '23

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)415
  • Downloads (Last 6 weeks)39
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SpotServe: Serving Generative Large Language Models on Preemptible InstancesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640411(1112-1127)Online publication date: 27-Apr-2024
  • (2024)Making Cloud Spot Instance Interruption Events VisibleProceedings of the ACM Web Conference 202410.1145/3589334.3645548(2998-3009)Online publication date: 13-May-2024
  • (2023)Microless: Cost-Efficient Hybrid Deployment of Microservices on IaaS VMs and Serverless2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00309(2303-2310)Online publication date: 17-Dec-2023
  • (2023)How Different are the Cloud Workloads? Characterizing Large-Scale Private and Public Cloud Workloads2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00055(522-530)Online publication date: Jun-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media