A Case for Adaptive Resource Management in Alibaba Datacenter Using Neural Networks

Sa Wang^1,2,3,
Yan-Hai Zhu⁴,
Shan-Pei Chen⁴,
Tian-Ze Wu^1,2,
Wen-Jie Li^1,2,
Xu-Sheng Zhan^1,2,
Hai-Yang Ding⁴,
Wei-Song Shi⁵ &
…
Yun-Gang Bao^1,2,3

448 Accesses
4 Citations
Explore all metrics

Abstract

Both resource efficiency and application QoS have been big concerns of datacenter operators for a long time, but remain to be irreconcilable. High resource utilization increases the risk of resource contention between co-located workload, which makes latency-critical (LC) applications suffer unpredictable, and even unacceptable performance. Plenty of prior work devotes the effort on exploiting effective mechanisms to protect the QoS of LC applications while improving resource efficiency. In this paper, we propose MAGI, a resource management runtime that leverages neural networks to monitor and further pinpoint the root cause of performance interference, and adjusts resource shares of corresponding applications to ensure the QoS of LC applications. MAGI is a practice in Alibaba datacenter to provide on-demand resource adjustment for applications using neural networks. The experimental results show that MAGI could reduce up to 87.3% performance degradation of LC application when co-located with other antagonist applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Resource Provisioning Through Machine Learning in Cloud Services

Article 24 July 2021

CloudAIBus: a testbed for AI based cloud computing environments

Article 06 June 2024

Deep Convolutional Neural Network with a Fuzzy (DCNN-F) technique for energy and time optimized scheduling of cloud computing

Article 04 July 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Reiss C, Tumanov A, Ganger G R, Katz R H, Kozuch M A. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proc. the 3rd ACM Symposium on Cloud Computing, October 2012, Article No. 7.
Liu H. A measurement study of server utilization in public clouds. In Proc. the 9th IEEE International Conference on Dependable, Autonomic and Secure Computing, December 2011, pp.435-442.
Delimitrou C, Kozyrakis C. Quasar: Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices, 2014, 49(4): 127-144.
Google Scholar
Cortez E, Bonde A, Muzio A, Russinovich M, Fontoura M, Bianchini R. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proc. the 26th Symposium on Operating Systems Principles, October 2017, pp.153-167.
Lo D, Cheng L Q, Govindaraju R, Ranganathan P, Kozyrakis C. Heracles: Improving resource efficiency at scale. ACM SIGARCH Computer Architecture News, 2015, 43: 450-462.
Article Google Scholar
Chen S, Delimitrou C, Mart´ınez J F. PARTIES: QoS-aware resource partitioning for multiple interactive services. In Proc. the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, April 2019, pp.107-120.
Zhuravlev S, Blagodurov S, Fedorova A. Addressing shared resource contention in multicore processors via scheduling. ACM SIGPLAN Notices, 2010, 45: 129-142.
Article Google Scholar
Zhang X, Tune E, Hagmann R et al. CPI2: CPU performance isolation for shared compute clusters. In Proc. the 8th ACM European Conference on Computer Systems, April 2013, pp.379-391.
Yasin A. A top-down method for performance analysis and counters architecture. In Proc. the 2014 IEEE International Symposium on Performance Analysis of Systems and Software, March 2014, pp.35-44.
Kasture H, Sanchez D. Tailbench: A benchmark suite and evaluation methodology for latency-critical applications. In Proc. the 2016 IEEE International Symposium on Workload Characterization, September 2016, pp.3-12.
Henning J L. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News, 2006, 34(4): 1-17.
Article Google Scholar
Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E,Wilkes J. Large-scale cluster management at Google with Borg. In Proc. the 10th European Conference on Computer Systems, April 2015, Article No. 18.
Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph A D, Katz R H, Shenker S, Stoica I. Mesos: A platform for fine-grained resource sharing in the data center. In Proc. the 8th USENIX Symposium on Networked Systems Design and Implementation, March 2011, Article No. 4.
Schwarzkopf M, Konwinski A, Abd-El-Malek M, Wilkes J. Omega: Flexible, scalable schedulers for large compute clusters. In Proc. the 8th ACM European Conference on Computer Systems, April 2013, pp.351-364.
Ousterhout K, Wendell P, Zaharia M, Stoica I. Sparrow: Distributed, low latency scheduling. In Proc. the 24th ACM Symposium on Operating Systems Principles, November 2013, pp.69-84.
Zhang Z, Li C, Tao Y Y, Yang R Y, Tang H, Xu J. Fuxi: A fault-tolerant resource management and job scheduling system at Internet scale. Proceedings of the VLDB Endowment, 2014, 7(13): 1393-1404.
Article Google Scholar
Guo J, Chang Z H, Wang S, Ding H Y, Feng Y H, Mao L, Bao Y G. Who limits the resource efficiency of my datacenter: An analysis of Alibaba datacenter traces. In Proc. the International Symposium on Quality of Service, June 2019, Article No. 39.
Herdrich A, Verplanke E, Autee P, Illikkal R, Gianos C, Singhal R, Iyer R. Cache QoS: From concept to reality in the intel® Xeonr® processor E5-2600 v3 product family. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture, March 2016, pp.657-668.

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Sa Wang, Tian-Ze Wu, Wen-Jie Li, Xu-Sheng Zhan & Yun-Gang Bao
University of Chinese Academy of Sciences, Beijing, 100049, China
Sa Wang, Tian-Ze Wu, Wen-Jie Li, Xu-Sheng Zhan & Yun-Gang Bao
Peng Cheng Laboratory, Shenzhen, 518055, China
Sa Wang & Yun-Gang Bao
Alibaba Inc., Hangzhou, 311121, China
Yan-Hai Zhu, Shan-Pei Chen & Hai-Yang Ding
Department of Computer Science, Wayne State University, Michigan, MI, 48202, U.S.A.
Wei-Song Shi

Authors

Sa Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yan-Hai Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Shan-Pei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tian-Ze Wu
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Jie Li
View author publications
You can also search for this author in PubMed Google Scholar
Xu-Sheng Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Hai-Yang Ding
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Song Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yun-Gang Bao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan-Hai Zhu.

Electronic supplementary material

ESM 1

(PDF 1107 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Zhu, YH., Chen, SP. et al. A Case for Adaptive Resource Management in Alibaba Datacenter Using Neural Networks. J. Comput. Sci. Technol. 35, 209–220 (2020). https://doi.org/10.1007/s11390-020-9732-x

Download citation

Received: 22 May 2019
Revised: 14 October 2019
Published: 10 January 2020
Issue Date: January 2020
DOI: https://doi.org/10.1007/s11390-020-9732-x

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Resource Provisioning Through Machine Learning in Cloud Services

CloudAIBus: a testbed for AI based cloud computing environments

Deep Convolutional Neural Network with a Fuzzy (DCNN-F) technique for energy and time optimized scheduling of cloud computing

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A Case for Adaptive Resource Management in Alibaba Datacenter Using Neural Networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Resource Provisioning Through Machine Learning in Cloud Services

CloudAIBus: a testbed for AI based cloud computing environments

Deep Convolutional Neural Network with a Fuzzy (DCNN-F) technique for energy and time optimized scheduling of cloud computing

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation