research-article

Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs

Authors:

Camille Couturier,

Saravan Rajmohan,

Dongmei ZhangAuthors Info & Claims

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Pages 631 - 643

https://doi.org/10.1145/3582016.3582028

Published: 25 March 2023 Publication History

Abstract

Cloud providers often have resources that are not being fully utilized, and they may offer them at a lower cost to make up for the reduced availability of these resources. However, customers may be hesitant to use such offerings (such as spot VMs) as making trade-offs between cost and resource availability is not always straightforward. In this work, we propose Snape (Spot On-demand Perfect Mixture), an intelligent framework to optimize the cost and resource availability by dynamically mixing on-demand VMs with spot VMs. Through a detailed characterization based on real production traces, we verify that the eviction of spot VMs is predictable to some extent. Snape also leverages constrained reinforcement learning to adjust the mixture policy online. Experiments across different configurations show that Snape achieves 44% savings compared to using only on-demand VMs while maintaining 99.96% availability, which is 2.77% higher than using only spot VMs.

References

[1]

Eitan Altman. 1999. Constrained Markov decision processes: stochastic modeling. Routledge.

[2]

Amazon. 2022. Amazon EC2 Spot Instances. https://www.amazonaws.cn/en/ec2/spot-instances/

[3]

Pradeep Ambati, Íñigo Goiri, Felipe Frujeri, Alper Gun, Ke Wang, Brian Dolan, Brian Corell, Sekhar Pasupuleti, Thomas Moscibroda, Sameh Elnikety, Marcus Fontoura, and Ricardo Bianchini. 2020. Providing SLOs for Resource-Harvesting VMs in Cloud Platforms. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20). 735–751.

[4]

Apache. 2022. Open Source Serverless Cloud Platform. https://openwhisk.apache.org/ Accessed: 2022-10-19

[5]

Adebiyi A. Ariyo, Adewumi O. Adewumi, and Charles K. Ayo. 2014. Stock Price Prediction Using the ARIMA Model. In 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation. 106–112. https://doi.org/10.1109/UKSim.2014.67

Digital Library

[6]

AutoSpotting. 2022. AutoSpotting. https://github.com/cloudutil/AutoSpotting Accessed: 2022-10-14

[7]

AWS. 2022. Auto Scaling groups with multiple instance types and purchase options. https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-mixed-instances-groups.html

[8]

AWS. 2022. AWS instance types. https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/instance-types.html Accessed: 2022-06-28

[9]

AWS. 2022. AWS VM Availability. https://docs.aws.amazon.com/whitepapers/latest/real-time-communication-on-aws/high-availability-and-scalability-on-aws.html

[10]

AWS. 2022. EC2 Spot Blocks. https://aws.amazon.com/blogs/aws/new-ec2-spot-blocks-for-defined-duration-workloads/

[11]

Microsoft Azure. 2022. Azure Spot Virtual Machines. https://azure.microsoft.com/en-us/services/virtual-machines/spot/#overview

[12]

Microsoft Azure. 2022. Azure Spot Virtual Machines: Try to restore functionality now generally available. https://azure.microsoft.com/en-us/updates/azure-spot-virtual-machines-tryrestore-functionality-now-generally-available/

[13]

Microsoft Azure. 2022. Azure VM Availability. https://azure.microsoft.com/en-us/support/legal/sla/virtual-machines/v1_9/

[14]

Microsoft Azure. 2022. Azure VM SKU. https://docs.microsoft.com/en-us/azure/virtual-machines/sizes Accessed: 2022-10-19

[15]

Microsoft Azure. 2022. Personalizer. https://learn.microsoft.com/en-us/azure/cognitive-services/personalizer/what-is-personalizer

[16]

Microsoft Azure. 2022. Spot Priority Mix for high availability and cost savings. https://learn.microsoft.com/zh-cn/azure/virtual-machine-scale-sets/spot-priority-mix Accessed: 2022-10-19

[17]

Microsoft Azure. 2022. Virtual Machine Scale Sets. https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/overview Accessed: 2022-09-28

[18]

Abhinav Bhatia, Pradeep Varakantham, and Akshat Kumar. 2019. Resource constrained deep reinforcement learning. In Proceedings of the International Conference on Automated Planning and Scheduling. 29, 610–620.

[19]

Leo Breiman. 2001. Random forests. Machine learning, 45, 1 (2001), 5–32. https://doi.org/10.1023/A:1010933404324

Digital Library

[20]

Ching-An Cheng, Andrey Kolobov, and Adith Swaminathan. 2021. Heuristic-guided reinforcement learning. Advances in Neural Information Processing Systems, 34 (2021), 13550–13563.

[21]

Navraj Chohan, Claris Castillo, Mike Spreitzer, Malgorzata Steinder, Asser Tantawi, and Chandra Krintz. 2010. See Spot Run: Using Spot Instances for Mapreduce Workflows. HotCloud’10.

[22]

Google Cloud. 2022. Instance groups. https://cloud.google.com/compute/docs/instance-groups/ Accessed: 2022-09-28

[23]

Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. 2018. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757.

[24]

Christoph Dann and Emma Brunskill. 2015. Sample complexity of episodic fixed-horizon reinforcement learning. Advances in Neural Information Processing Systems, 28 (2015), 2818–2826.

[25]

Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232.

[26]

Alex Fuerst, Stanko Novakovic, Íñigo Goiri, Gohar Irfan Chaudhry, Prateek Sharma, Kapil Arya, Kevin Broas, Eugene Bak, Mehmet Iyigun, and Ricardo Bianchini. 2022. Memory-Harvesting VMs in Cloud Platforms. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22). 583–594. https://doi.org/10.1145/3503222.3507725

Digital Library

[27]

Javier Garcıa and Fernando Fernández. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16, 1 (2015), 1437–1480. https://doi.org/10.5555/2789272.2886795

Digital Library

[28]

Javier García and Diogo Shafie. 2020. Teaching a humanoid robot to walk faster through Safe Reinforcement Learning. Engineering Applications of Artificial Intelligence, 88 (2020), 103360. https://doi.org/10.1016/j.engappai.2019.103360

Digital Library

[29]

Google. 2022. Google Spot VMs. https://cloud.google.com/compute/docs/instances/spot

[30]

Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das. 2022. Cocktail: A Multidimensional Optimization for Model Serving in Cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’22).

[31]

Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, and Thomas Moscibroda. 2020. Protean: VM allocation service at scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20). 845–861.

[32]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Digital Library

[33]

Chen Hou and Qianchuan Zhao. 2017. Optimization of web service-based control system for balance between network traffic and delay. IEEE Transactions on Automation Science and Engineering, 15, 3 (2017), 1152–1162. https://doi.org/10.1109/TASE.2017.2746348

[34]

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.

[35]

Vijay Konda and John Tsitsiklis. 1999. Actor-critic algorithms. Advances in neural information processing systems, 12 (1999), 1008–1014.

[36]

Iordanis Koutsopoulos and Leandros Tassiulas. 2011. Control and optimization meet the smart power grid: Scheduling of power demands for optimal energy management. In Proceedings of the 2nd International Conference on Energy-efficient Computing and Networking. 41–50. https://doi.org/10.1145/2318716.2318723

Digital Library

[37]

Boyan Li, Hongyao Tang, Yan Zheng, Jianye Hao, Pengyi Li, Zhen Wang, Zhaopeng Meng, and Li Wang. 2022. HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.

[38]

Benjamin Lindemann, Timo Müller, Hannes Vietz, Nasser Jazdi, and Michael Weyrich. 2021. A survey on long short-term memory networks for time series prediction. Procedia CIRP, 99 (2021), 650–655. https://doi.org/10.1016/j.procir.2021.03.088

[39]

Yongshuai Liu, Jiaxin Ding, and Xin Liu. 2020. A constrained reinforcement learning based approach for network slicing. In 2020 IEEE 28th International Conference on Network Protocols (ICNP ’20). 1–6. https://doi.org/10.1109/ICNP49622.2020.9259378

[40]

Yongshuai Liu, Jiaxin Ding, and Xin Liu. 2020. IPO: Interior-point policy optimization under constraints. In Proceedings of the AAAI Conference on Artificial Intelligence. 34, 4940–4947.

[41]

Yongshuai Liu, Avishai Halev, and Xin Liu. 2021. Policy learning with constraints in model-free reinforcement learning: A survey. In The 30th International Joint Conference on Artificial Intelligence (IJCAI). https://doi.org/10.24963/ijcai.2021/614

[42]

Xiaozhen Lu, Liang Xiao, Guohang Niu, Xiangyang Ji, and Qian Wang. 2022. Safe Exploration in Wireless Security: A Safe Reinforcement Learning Algorithm With Hierarchical Structure. IEEE Transactions on Information Forensics and Security, 17 (2022), 732–743. https://doi.org/10.1109/TIFS.2022.3149396

Digital Library

[43]

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. SIGCOMM ’19. https://doi.org/10.1145/3341302.3342080

Digital Library

[44]

Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Analysis and Exploitation of Dynamic Pricing in the Public Cloud for ML Training. In Workshop on Distributed Infrastructure, Systems, Programming, and AI.

[45]

Michael Neunert, Abbas Abdolmaleki, Markus Wulfmeier, Thomas Lampe, Tobias Springenberg, Roland Hafner, Francesco Romano, Jonas Buchli, Nicolas Heess, and Martin Riedmiller. 2020. Continuous-discrete reinforcement learning for hybrid control in robotics. In Conference on Robot Learning. 735–751.

[46]

OpenAI. 2022. Proximal Policy Optimization. https://spinningup.openai.com/en/latest/algorithms/ppo.html Accessed: 2022-06-28

[47]

Tu-Hoa Pham, Giovanni De Magistris, and Ryuki Tachibana. 2018. Optlayer-practical constrained optimization for deep reinforcement learning in the real world. In 2018 IEEE International Conference on Robotics and Automation (ICRA). 6236–6243. https://doi.org/10.1109/ICRA.2018.8460547

Digital Library

[48]

Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20).

[49]

Haoran Qiu, Weichao Mao, Archit Patke, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer. 2022. Reinforcement Learning for Resource Management in Multi-Tenant Serverless Platforms. EuroMLSys ’22. https://doi.org/10.1145/3517207.3526971

Digital Library

[50]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

[51]

George AF Seber and Alan J Lee. 2012. Linear regression analysis. 329, John Wiley & Sons.

[52]

Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. 2018. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31 (2018).

[53]

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12 (1999).

[54]

Taylor, Sean, J., Letham, and Benjamin. 2018. Forecasting at Scale. American Statistician.

[55]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 5998–6008.

Digital Library

[56]

Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems (EuroSys ’15). 1–17. https://doi.org/10.1145/2741948.2741964

Digital Library

[57]

Yawen Wang, Kapil Arya, Marios Kogias, Manohar Vanga, Aditya Bhandari, Neeraja J. Yadwadkar, Siddhartha Sen, Sameh Elnikety, Christos Kozyrakis, and Ricardo Bianchini. 2021. SmartHarvest: Harvesting Idle CPUs Safely and Efficiently in the Cloud. In Proceedings of the 16th European Conference on Computer Systems (EuroSys ’21). https://doi.org/10.1145/3447786.3456225

Digital Library

[58]

Fangkai Yang, Bowen Pang, Jue Zhang, Bo Qiao, Lu Wang, Camille Couturier, Chetan Bansal, Soumya Ram, Si Qin, and Zhen Ma. 2022. Spot Virtual Machine Eviction Prediction in Microsoft Cloud. In Companion Proceedings of the Web Conference. 152–156. https://doi.org/10.1145/3487553.3524229

Digital Library

[59]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC ’19). 1049–1062.

[60]

Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li, Zhili Xiao, Bin Cheng, Jiashu Xing, Yangtao Wang, Tianheng Cheng, Li Liu, Minwei Ran, and Zekang Li. 2019. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning. SIGMOD ’19. https://doi.org/10.1145/3299869.3300085

Digital Library

[61]

Qizhen Zhang, Philip Bernstein, Daniel Berger, Badrish Chandramouli, Vincent Liu, and Boon Thau Loo. 2022. Compucache: Remote computable caching using spot VMs. In 12th Conference on Innovative Data Systems Research, CIDR. 9–12.

[62]

Yanqi Zhang, Íñigo Goiri, Gohar Irfan Chaudhry, Rodrigo Fonseca, Sameh Elnikety, Christina Delimitrou, and Ricardo Bianchini. 2021. Faster and cheaper serverless computing on harvested resources. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP ’21). 724–739. https://doi.org/10.1145/3477132.3483580

Digital Library

[63]

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 11106–11115.

Cited By

Miao XShi CDuan JXi XLin DCui BJia ZTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)SpotServe: Serving Generative Large Language Models on Preemptible InstancesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640411(1112-1127)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640411
Kim KLee KChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Making Cloud Spot Instance Interruption Events VisibleProceedings of the ACM Web Conference 202410.1145/3589334.3645548(2998-3009)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645548
Cheng JZhao YLi ZChen QCui WGuo M(2023)Microless: Cost-Efficient Hybrid Deployment of Microservices on IaaS VMs and Serverless2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00309(2303-2310)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00309
Show More Cited By

Index Terms

Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs

Recommendations

Spot Virtual Machine Eviction Prediction in Microsoft Cloud
WWW '22: Companion Proceedings of the Web Conference 2022

Azure Spot Virtual Machines (Spot VMs) utilize unused compute capacity at significant cost savings. They can be evicted when Azure needs the capacity back, therefore suitable for workloads that can tolerate interruptions. A good prediction of Spot VM ...
A Low Overhead and Reliable Nested Virtualization VMM for Cloud Computing
WISA '13: Proceedings of the 2013 10th Web Information System and Application Conference

Commodity operating systems have already gained functionality of virtual machine monitor. Nested virtualization is needed to run these commodity operating systems as virtual machines. Furthermore, with nested virtualization technology, users can run a ...
Demand-based coordinated scheduling for SMP VMs
ASPLOS '13

As processor architectures have been enhancing their computing capacity by increasing core counts, independent workloads can be consolidated on a single node for the sake of high resource efficiency in data centers. With the prevalence of virtualization ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

March 2023

820 pages

ISBN:9781450399180

DOI:10.1145/3582016

General Chair:
Tor M. Aamodt
University of British Columbia, Canada
,
Program Chairs:
Natalie Enright Jerger
University of Toronto, Canada
,
Michael Swift
University of Wisconsin-Madison, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '23

Sponsor:

ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

March 25 - 29, 2023

BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
881
Total Downloads

Downloads (Last 12 months)415
Downloads (Last 6 weeks)39

Reflects downloads up to 23 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Miao XShi CDuan JXi XLin DCui BJia ZTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)SpotServe: Serving Generative Large Language Models on Preemptible InstancesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640411(1112-1127)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640411
Kim KLee KChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Making Cloud Spot Instance Interruption Events VisibleProceedings of the ACM Web Conference 202410.1145/3589334.3645548(2998-3009)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645548
Cheng JZhao YLi ZChen QCui WGuo M(2023)Microless: Cost-Efficient Hybrid Deployment of Microservices on IaaS VMs and Serverless2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00309(2303-2310)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00309
Qin XMa MZhao YZhang JDu CLiu YParayil ABansal CRajmohan SGoiri ÍCortez EQin SLin QZhang D(2023)How Different are the Cloud Workloads? Characterizing Large-Scale Private and Public Cloud Workloads2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00055(522-530)Online publication date: Jun-2023
https://doi.org/10.1109/DSN58367.2023.00055

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents