Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3445814.3446693acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

Sinan: ML-based and QoS-aware resource management for cloud microservices

Published: 17 April 2021 Publication History

Abstract

Cloud applications are increasingly shifting from large monolithic services, to large numbers of loosely-coupled, specialized microservices. Despite their advantages in terms of facilitating development, deployment, modularity, and isolation, microservices complicate resource management, as dependencies between them introduce backpressure effects and cascading QoS violations.
We present Sinan, a data-driven cluster manager for interactive cloud microservices that is online and QoS-aware. Sinan leverages a set of scalable and validated machine learning models to determine the performance impact of dependencies between microservices, and allocate appropriate resources per tier in a way that preserves the end-to-end tail latency target. We evaluate Sinan both on dedicated local clusters and large-scale deployments on Google Compute Engine (GCE) across representative end-to-end applications built with microservices, such as social networks and hotel reservation sites. We show that Sinan always meets QoS, while also maintaining cluster utilization high, in contrast to prior work which leads to unpredictable performance or sacrifices resource efficiency. Furthermore, the techniques in Sinan are explainable, meaning that cloud operators can yield insights from the ML models on how to better deploy and design their applications to reduce unpredictable performance.

References

[1]
Decomposing twitter: Adventures in service-oriented architecture. https://www.slideshare.net/InfoQ/decomposing-twitter-adventures-in-serviceoriented-architecture.
[2]
Why grpc? https://grpc.io/.
[3]
The evolution of microservices. https://www.slideshare.net/adriancockcroft/evolution-of-microservices-craft-conference, 2016.
[4]
Microservices workshop: Why, what, and how to get there. http://www.slideshare.net/adriancockcroft/microservices-workshop-craft-conference.
[5]
Amazon ec2. http://aws.amazon.com/ec2/.
[6]
Autoscale. https://cwiki.apache.org/cloudstack/autoscaling.html.
[7]
Aws autoscaling. http://aws.amazon.com/autoscaling/.
[8]
Luiz Barroso. Warehouse-scale computing: Entering the teenage decade. ISCA Keynote, SJ, June 2011.
[9]
Luiz Barroso and Urs Hoelzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. MC Publishers, 2009.
[10]
Jeffrey Chase, Darrell Anderson, Prachi Thakar, Amin Vahdat, and Ronald Doyle. Managing energy and server resources in hosting centers. In Proceedings of SOSP. Banff, CA, 2001.
[11]
Shuang Chen, Christina Delimitrou, and Jos\'e F. Martínez. Parties: Qos-aware resource partitioning for multiple interactive services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19, pages 107?120, New York, NY, USA, 2019. ACM.
[12]
Shuang Chen, Christina Delimitrou, and Jos\'e F Martínez. Parties: Qos-aware resource partitioning for multiple interactive services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 107?120. ACM, 2019.
[13]
Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pages 785?794, New York, NY, USA, 2016. ACM.
[14]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
[15]
Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, pages 153?167, New York, NY, USA, 2017. ACM.
[16]
Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 153?167. ACM, 2017.
[17]
Jeffrey Dean and Luiz Andre Barroso. The tail at scale. In CACM, Vol. 56 No. 2, Pages 74-80.
[18]
Christina Delimitrou and Christos Kozyrakis. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Houston, TX, USA, 2013.
[19]
Christina Delimitrou and Christos Kozyrakis. Quasar: Qos-aware and resource-efficient cluster management. In Technical Report. Stanford, CA, July 2013.
[20]
Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proceedings of the Nineteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Salt Lake City, UT, USA, 2014.
[21]
Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SOCC), August 2015.
[22]
Brad Fitzpatrick. Distributed caching with memcached. In Linux Journal, Volume 2004, Issue 124, 2004.
[23]
Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He, and Christina Delimitrou. Seer: leveraging big data to navigate the complexity of cloud debugging. In Proceedings of the 10th USENIX Conference on Hot Topics in Cloud Computing, pages 13?13. USENIX Association, 2018.
[24]
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3?18. ACM, 2019.
[25]
Yu Gan, Yanqi Zhang, Kelvin Hu, Yuan He, Meghna Pancholi, Dailun Cheng, and Christina Delimitrou. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In Proceedings of the Twenty Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 2019.
[26]
Kristina Gligori\'c, Ashton Anderson, and Robert West. How constraints affect content: The case of twitter?s switch from 140 to 280 characters. In Twelfth International AAAI Conference on Web and Social Media, 2018.
[27]
Google container engine. https://cloud.google.com/container-engine.
[28]
Sriram Govindan, Jie Liu, Aman Kansal, and Anand Sivasubramaniam. Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines. In Proceedings of the 2nd ACM Symposium on Cloud Computing. Cascais, Portugal, 2011.
[29]
Milad Hashemi, Kevin Swersky, Jamie A. Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, and Parthasarathy Ranganathan. Learning memory access patterns. CoRR, abs/1803.02329, 2018.
[30]
Ben Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of NSDI. Boston, MA, 2011.
[31]
Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. Pocket: Elastic ephemeral storage for serverless analytics. In 13th $\$USENIX$\$ Symposium on Operating Systems Design and Implementation ($\$OSDI$\$ 18), pages 427?444, 2018.
[32]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, pages 591?600. AcM, 2010.
[33]
Ching-Chi Lin, Pangfeng Liu, and Jan-Jan Wu. Energy-aware virtual machine dynamic provision and scheduling for cloud computing. In Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing (CLOUD). Washington, DC, USA, 2011.
[34]
David Lo, Liqun Cheng, Rama Govindaraju, Luiz Andr\'e Barroso, and Christos Kozyrakis. Towards energy proportionality for large-scale latency-critical workloads. In Proceedings of the 41st Annual International Symposium on Computer Architecuture (ISCA). Minneapolis, MN, 2014.
[35]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving resource efficiency at scale. In Proc. of the 42Nd Annual International Symposium on Computer Architecture (ISCA). Portland, OR, 2015.
[36]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clusters. CoRR, abs/1810.01963, 2018.
[37]
Jason Mars and Lingjia Tang. Whare-map: heterogeneity in "homogeneous" warehouse-scale computers. In Proceedings of ISCA. Tel-Aviv, Israel, 2013.
[38]
Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algorithms as gradient descent. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS'99, pages 512?518, Cambridge, MA, USA, 1999. MIT Press.
[39]
David Meisner, Christopher M. Sadler, Luiz Andr\'e Barroso, Wolf-Dietrich Weber, and Thomas F. Wenisch. Power management of online data-intensive services. In Proceedings of the 38th annual international symposium on Computer architecture, pages 319?330, 2011.
[40]
Mongodb. https://www.mongodb.com.
[41]
Ripal Nathuji, Canturk Isci, and Eugene Gorbatov. Exploiting platform heterogeneity for power efficient data centers. In Proceedings of ICAC. Jacksonville, FL, 2007.
[42]
Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. Q-clouds: Managing performance interference effects for qos-aware clouds. In Proceedings of EuroSys. Paris,France, 2010.
[43]
Nginx. https://www.nginx.com.
[44]
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. Sparrow: Distributed, low latency scheduling. In Proceedings of SOSP. Farminton, PA, 2013.
[45]
Chenhao Qu, Rodrigo N Calheiros, and Rajkumar Buyya. Auto-scaling web applications in clouds: A taxonomy and survey. ACM Computing Surveys (CSUR), 51(4):73, 2018.
[46]
Rabbitmq. https://www.rabbitmq.com.
[47]
Joy Rahman and Palden Lama. Predicting the end-to-end tail latency of containerized microservices in the cloud. In IEEE International Conference on Cloud Engineering, IC2E 2019, Prague, Czech Republic, June 24-27, 2019, pages 200?210. IEEE, 2019.
[48]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135?1144, 2016.
[49]
Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, 2015.
[50]
S. Sarwar, A. Ankit, and K. Roy. Incremental learning in deep convolutional neural networks using partial network sharing. In arXiv preprint arXiv:1712.02719.
[51]
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of EuroSys. Prague, Czech Republic, 2013.
[52]
Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, and John Wilkes. Cloudscale: elastic resource scaling for multi-tenant cloud systems. In Proceedings of SOCC. Cascais, Portugal, 2011.
[53]
Akshitha Sriraman and Thomas F. Wenisch. tune: Auto-tuned threading for OLDI microservices. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 177?194, Carlsbad, CA, October 2018. USENIX Association.
[54]
Akshitha Sriraman and Thomas F Wenisch. usuite: A benchmark suite for microservices. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pages 1?12. IEEE, 2018.
[55]
Lalith Suresh, Peter Bodik, Ishai Menache, Marco Canini, and Florin Ciucu. Distributed resource management across process boundaries. In Proceedings of the 2017 Symposium on Cloud Computing, pages 611?623. ACM, 2017.
[56]
Apache thrift. https://thrift.apache.org.
[57]
Torque resource manager. http://www.adaptivecomputing.com/products/open-source/torque/.
[58]
Bhuvan Urgaonkar, Giovanni Pacifici, Prashant Shenoy, Mike Spreitzer, and Asser Tantawi. An analytical model for multi-tier internet services and its applications. SIGMETRICS Perform. Eval. Rev., 33(1):291?302, June 2005.
[59]
Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems (EuroSys), Bordeaux, France, 2015.
[60]
Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: precise online qos management for increased utilization in warehouse scale computers. In Proceedings of ISCA. 2013.
[61]
Hailong Yang, Quan Chen, Moeiz Riaz, Zhongzhi Luan, Lingjia Tang, and Jason Mars. Powerchief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained cmp. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ?17, page 133?146, New York, NY, USA, 2017. Association for Computing Machinery.
[62]
Hao Zhou, Ming Chen, Qian Lin, Yong Wang, Xiaobin She, Sifan Liu, Rui Gu, Beng Chin Ooi, and Junfeng Yang. Overload control for scaling wechat microservices. In Proceedings of the ACM Symposium on Cloud Computing, pages 149?161. ACM, 2018.

Cited By

View all
  • (2025)GenesisRM: A state-driven approach to resource management for distributed JVM web applicationsFuture Generation Computer Systems10.1016/j.future.2024.107539163(107539)Online publication date: Feb-2025
  • (2024)Quality of Service (QoS)-Aware Microservices Selection Based on Local ConstraintsInternational Journal of Computer Theory and Engineering10.7763/IJCTE.2024.V16.135216:2(35-43)Online publication date: 2024
  • (2024)Toward Trustworthy Learning-Enabled Systems with Concept-Based ExplanationsProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696894(60-67)Online publication date: 18-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
April 2021
1090 pages
ISBN:9781450383172
DOI:10.1145/3445814
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2021

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Cloud computing
  2. cluster management
  3. datacenter
  4. machine learning for system
  5. mi-croservices
  6. quality of service
  7. resource efficiency
  8. resource management
  9. resourceallocation
  10. tail latency

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,045
  • Downloads (Last 6 weeks)148
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)GenesisRM: A state-driven approach to resource management for distributed JVM web applicationsFuture Generation Computer Systems10.1016/j.future.2024.107539163(107539)Online publication date: Feb-2025
  • (2024)Quality of Service (QoS)-Aware Microservices Selection Based on Local ConstraintsInternational Journal of Computer Theory and Engineering10.7763/IJCTE.2024.V16.135216:2(35-43)Online publication date: 2024
  • (2024)Toward Trustworthy Learning-Enabled Systems with Concept-Based ExplanationsProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696894(60-67)Online publication date: 18-Nov-2024
  • (2024)Opportunities and Challenges in Service Layer Traffic EngineeringProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696871(352-359)Online publication date: 18-Nov-2024
  • (2024)DeployFix: Dynamic Repair of Software Deployment Failures via Constraint SolvingProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695268(2053-2064)Online publication date: 27-Oct-2024
  • (2024)TopFull: An Adaptive Top-Down Overload Control for SLO-Oriented MicroservicesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672253(876-890)Online publication date: 4-Aug-2024
  • (2024)Do Predictors for Resource Overcommitment Even Predict?Proceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655838(153-160)Online publication date: 22-Apr-2024
  • (2024)Integrating System State into Spatio Temporal Graph Neural Network for Microservice Workload PredictionProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671508(5521-5531)Online publication date: 25-Aug-2024
  • (2024)Optimizing Resource Management for Shared Microservices: A Scalable System DesignACM Transactions on Computer Systems10.1145/363160742:1-2(1-28)Online publication date: 13-Feb-2024
  • (2024)Flux: Decoupled Auto-Scaling for Heterogeneous Query Workload in Alibaba AnalyticDBCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653381(255-268)Online publication date: 9-Jun-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media