research-article

Free access

Predictive and adaptive failure mitigation to avert production cloud VM interruptions

AUTHORs:

Sebastien Levy,

Naga Govindaraju,

Gil Lapid Shafriri,

Murali ChintalapatiAuthors Info & Claims

OSDI'20: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation

Article No.: 65, Pages 1155 - 1170

Published: 04 November 2020 Publication History

PDF eReader Publisher Site

Abstract

When a failure occurs in production systems, the highest priority is to quickly mitigate it. Despite its importance, failure mitigation is done in a reactive and ad-hoc way: taking some fixed actions only after a severe symptom is observed. For cloud systems, such a strategy is inadequate. In this paper, we propose a preventive and adaptive failure mitigation service, NARYA, that is integrated in a production cloud, Microsoft Azure's compute platform. Narya predicts imminent host failures based on multi-layer system signals and then decides smart mitigation actions. The goal is to avert VM failures. Narya's decision engine takes a novel online experimentation approach to continually explore the best mitigation action. Narya further enhances the adaptive decision capability through reinforcement learning. Narya has been running in production for 15 months. It on average reduces VM interruptions by 26% compared to the previous static strategy.

References

[1]

Apache Kafka: A distributed streaming platform. https://kafka.apache.org.

[2]

S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML'13, page III-1220-III-1228, Atlanta, GA, USA, 2013.

[3]

M. K. Aguilera and M. Walfish. No time for asynchrony. In Proceedings of the 12th Conference on Hot Topics in Operating Systems, HotOS '09, pages 3-3, Monte Verità, Switzerland, 2009.

Digital Library

[4]

G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot -- a technique for cheap recovery. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, OSDI '04, pages 31-44, San Francisco, CA, 2004.

[5]

T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225-267, Mar. 1996.

Digital Library

[6]

W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Trans. Comput., 51(5):561-580, May 2002.

Digital Library

[7]

C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2, pages 273-286. USENIX Association, 2005.

Digital Library

[8]

E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, and R. Bianchini. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, page 153-167, Shanghai, China, 2017.

Digital Library

[9]

C. Fetzer. Perfect failure detection in timed asynchronous systems. IEEE Trans. Comput., 52(2):99-112, Feb. 2003.

Digital Library

[10]

J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189-1232, 2000.

[11]

H. S. Gunawi, R. O. Suminto, R. Sears, C. Golliher, S. Sundararaman, X. Lin, T. Emami, W. Sheng, N. Bidokhti, C. McCaffrey, G. Grider, P. M. Fields, K. Harms, R. B. Ross, A. Jacobson, R. Ricci, K. Webb, P. Alvaro, H. B. Runesha, M. Hao, and H. Li. Fail-slow at scale: Evidence of hardware performance faults in large production systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies, FAST'18, pages 1-14, Oakland, CA, USA, 2018.

Digital Library

[12]

Z. Guo, S. McDirmid, M. Yang, L. Zhuang, P. Zhang, Y. Luo, T. Bergan, P. Bodik, M. Musuvathi, Z. Zhang, and L. Zhou. Failure recovery: When the cure is worse than the disease. In Proceedings of the 14th USENIX Conference on Hot Topics in Operating Systems, HotOS'13, pages 8-8, Santa Ana Pueblo, New Mexcio, 2013.

[13]

A. Haeberlen, P. Kouznetsov, and P. Druschel. PeerReview: Practical accountability for distributed systems. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP '07, pages 175-188, Stevenson, Washington, USA, 2007.

Digital Library

[14]

G. Hamerly and C. Elkan. Bayesian approaches to failure prediction for disk drives. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, page 202-209. Morgan Kaufmann Publishers Inc., 2001.

Digital Library

[15]

T. Hauer, P. Hoffmann, J. Lunney, D. Ardelean, and A. Diwan. Meaningful availability. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 545-557. USENIX Association, Feb. 2020.

[16]

P. Huang, C. Guo, J. R. Lorch, L. Zhou, and Y. Dang. Capturing and enhancing in situ system observability for failure detection. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI '18, pages 1-16, Carlsbad, CA, October 2018.

Digital Library

[17]

P. Huang, C. Guo, L. Zhou, J. R. Lorch, Y. Dang, M. Chintalapati, and R. Yao. Gray failure: The Achilles' heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS XVI. ACM, May 2017.

Digital Library

[18]

G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly efficient gradient boosting decision tree. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3146-3154. Curran Associates, Inc., 2017.

[19]

R. Kohavi and S. Thomke. The surprising power of online experiments. Harvard Business Review, 95(5):74-82, 2017.

[20]

J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pages 3744-3753. PMLR, 2019.

[21]

J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, NSDI '13, pages 427-442, Lombard, IL, Apr. 2013.

Digital Library

[22]

J. B. Leners, H. Wu, W.-L. Hung, M. K. Aguilera, and M. Walfish. Detecting failures in distributed systems with the Falcon spy network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles, SOSP '11, pages 279-294, Cascais, Portugal, Oct. 2011.

Digital Library

[23]

L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, page 661-670, Raleigh, North Carolina, USA, 2010.

Digital Library

[24]

Z. Li, Q. Cheng, K. Hsieh, Y. Dang, P. Huang, P. Singh, X. Yang, Q. Lin, Y. Wu, S. Levy, and M. Chintalapati. Gandalf: An intelligent, end-to-end analytics service for safe deployment in large-scale cloud infrastructure. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI '20. USENIX, Feburary 2020.

[25]

Q. Lin, K. Hsieh, Y. Dang, H. Zhang, K. Sui, Y. Xu, J.-G. Lou, C. Li, Y. Wu, R. Yao, M. Chintalapati, and D. Zhang. Predicting node failure in cloud service systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018, page 480-490, Lake Buena Vista, FL, USA, 2018.

Digital Library

[26]

C. Lou, P. Huang, and S. Smith. Understanding, detecting and localizing partial failures in large system software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI '20. USENIX, Feburary 2020.

[27]

S. Lu, B. Luo, T. Patel, Y. Yao, D. Tiwari, and W. Shi. Making disk failure predictions SMARTer! In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 151-167. USENIX Association, Feb. 2020.

[28]

S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS '17, page 4768-4777, Long Beach, California, USA, 2017.

Digital Library

[29]

J. C. Mogul and J. Wilkes. Nines are not enough: Meaningful metrics for clouds. In Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS '19, page 136-141, Bertinoro, Italy, 2019.

Digital Library

[30]

B. Panda, D. Srinivasan, H. Ke, K. Gupta, V. Khot, and H. S. Gunawi. IASO: A fail-slow detection and mitigation framework for distributed storage services. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '19, page 47-61, Renton, WA, USA, 2019.

[31]

D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupman, and N. Treuhaft. Recovery oriented computing (ROC): Motivation, definition, techniques, and case studies. Technical Report UCB/CSD-02-1175, EECS Department, University of California, Berkeley, Mar 2002.

[32]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018.

[33]

W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285-294, 12 1933.

[34]

R. van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, Middleware '98, pages 55-70, The Lake District, United Kingdom, 1998.

Digital Library

[35]

X. Wang, L. Yu, K. Ren, G. Tao, W. Zhang, Y. Yu, and J. Wang. Dynamic attention deep model for article recommendation by learning human editors' demonstration. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, page 2051-2059, Halifax, NS, Canada, 2017.

Digital Library

[36]

B. L. Welch. The generalization of 'student's' problem when several different population variances are involved. Biometrika, 34(1/2):28-35, 1947.

[37]

X. Wu, D. Turner, C.-C. Chen, D. A. Maltz, X. Yang, L. Yuan, and M. Zhang. NetPilot: Automating datacenter network failure mitigation. SIGCOMM Comput. Commun. Rev., 42(4):419-430, Aug. 2012.

Digital Library

[38]

Y. Xu, K. Sui, R. Yao, H. Zhang, Q. Lin, Y. Dang, P. Li, K. Jiang, W. Zhang, J.-G. Lou, M. Chintalapati, and D. Zhang. Improving service availability of cloud systems by predicting disk error. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '18, page 481-493, Boston, MA, USA, 2018.

Digital Library

Index Terms

Predictive and adaptive failure mitigation to avert production cloud VM interruptions

Index terms have been assigned to the content through auto-classification.

Recommendations

Failure-aware energy-efficient VM consolidation in cloud computing systems
Abstract
VM consolidation is an important technique used in cloud computing systems to improve energy efficiency. It migrates the running VMs from under utilized physical resources to other resources in order to reduce the energy consumption. ...
Highlights
- Reliability, energy consumption and task finishing time modelling under failures.
Towards unobtrusive VM live migration for cloud computing platforms
APSYS '12: Proceedings of the Asia-Pacific Workshop on Systems

Live migration of virtual machines (VMs) is an attractive technique for managing cloud computing platforms such as those for load balancing and maintenance of physical machines. However, the execution of live migration significantly consumes ...
Towards unobtrusive VM live migration for cloud computing platforms
APSys '12: Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems

Live migration of virtual machines (VMs) is an attractive technique for managing cloud computing platforms such as those for load balancing and maintenance of physical machines. However, the execution of live migration significantly consumes ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

OSDI'20: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation

November 2020

1255 pages

ISBN:978-1-939133-19-9

Program Chairs:
Shan Lu
University of Chicago
,
Jon Howell
VMware

Copyright © 2020.

Sponsors

ORACLE
VMware
Google Inc.
Amazon
Microsoft

Publisher

USENIX Association

United States

Publication History

Published: 04 November 2020

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
94
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)6

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents