Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3267809.3267823acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Overload Control for Scaling WeChat Microservices

Published: 11 October 2018 Publication History

Editorial Notes

A corrigendum was issued for this article on January 3, 2019. This can be found under the Source Materials tab.

Abstract

Effective overload control for large-scale online service system is crucial for protecting the system backend from overload. Conventionally the design of overload control is ad-hoc for individual service. However, service-specific overload control could be detrimental to the overall system due to intricate service dependencies or flawed implementation of service. Service developers usually have difficulty to accurately estimate the dynamics of actual workload during the development of service. Therefore, it is essential to decouple the overload control from service logic. In this paper, we propose DAGOR, an overload control scheme designed for the account-oriented microservice architecture. DAGOR is service agnostic and system-centric. It manages overload at the microservice granule such that each microservice monitors its load status in real time and triggers load shedding in a collaborative manner among its relevant services when overload is detected. DAGOR has been used in the WeChat backend for five years. Experimental results show that DAGOR can benefit high success rate of service even when the system is experiencing overload, while ensuring fairness in the overload control.

Supplementary Material

PDF File (p149-zhou-corrigendum.pdf)
Corrigendum to "Overload Control for Scaling WeChat Microservices", by Zhou, et al., SoCC '18 Proceedings of the ACM Symposium on Cloud Computing

References

[1]
V. A. F. Almeida and D. A. Menasce. 2002. Capacity planning an essential tool for managing Web services. IT Professional 4, 4 (2002), 33--38.
[2]
Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. 1999. Resource Containers: A New Facility for Resource Management in Server Systems. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[3]
Azer Bestavros and Sue Nagy. 1997. Admission Control and Overload Management for Real-Time Database. In Real-Time Database Systems: Issues and Applications. 193--214.
[4]
Sirish Chandrasekaran and Michael Franklin. 2004. Remembrance of Streams Past: Overload-sensitive Management of Archived Streams. In Proceedings of the International Conference on Very Large Data Bases (VLDB).
[5]
Huamin Chen and P. Mohapatra. 2002. Session-based overload control in QoS-aware Web servers. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM).
[6]
Xiangping Chen, Prasant Mohapatra, and Huamin Chen. 2001. An Admission Control Scheme for Predictable Server Response Time for Web Accesses. In Proceedings of the International Conference on World Wide Web (WWW).
[7]
L. Cherkasova and P. Phaal. 2002. Session-based admission control: a mechanism for peak load management of commercial Web sites. IEEE Transactions on Computers (TC) 51, 6 (2002), 669--685.
[8]
Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient Coflow Scheduling with Varys. In Proceedings of the ACM SIGCOMM International Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM).
[9]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon's Highly Available Key-value Store. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP).
[10]
Fahad R. Dogar, Thomas Karagiannis, Hitesh Ballani, and Antony Rowstron. 2014. Decentralized Task-aware Scheduling for Data Center Networks. In Proceedings of the ACM SIGCOMM International Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM).
[11]
Sameh Elnikety, Erich Nahum, John Tracey, and Willy Zwaenepoel. 2004. A Method for Transparent Admission Control and Request Scheduling in e-Commerce Web Sites. In Proceedings of the International Conference on World Wide Web (WWW).
[12]
Thomas Erl. 2005. Service-Oriented Architecture: Concepts, Technology, and Design. Prentice Hall.
[13]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP).
[14]
Jörgen Hansson, S. F. Andler, and Sang Hyuk Son. 1999. Value-driven multi-class overload management. In International Conference on Real-Time Computing Systems and Applications (RTCSA).
[15]
Yuxiong He, Sameh Elnikety, James Larus, and Chenyu Yan. 2012. Zeta: Scheduling Interactive Services with Partial Execution. In Proceedings of the ACM Symposium on Cloud Computing (SoCC).
[16]
Virajith Jalaparti, Peter Bodik, Srikanth Kandula, Ishai Menache, Mikhail Rybalkin, and Chenyu Yan. 2013. Speeding Up Distributed Request-response Workflows. In Proceedings of the ACM SIGCOMM International Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM).
[17]
Evangelia Kalyvianaki, Marco Fiscato, Theodoros Salonidis, and Peter Pietzuch. 2016. THEMIS: Fairness in Federated Stream Processing Under Overload. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).
[18]
K. Kant and Y. Won. 1999. Server capacity planning for Web traffic workload. IEEE Transactions on Knowledge and Data Engineering (TKDE) 11, 5 (1999), 731--747.
[19]
Qian Lin, Beng Chin Ooi, Zhengkui Wang, and Cui Yu. 2015. Scalable Distributed Stream Join Processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).
[20]
Ben Maurer. 2015. Fail at Scale. ACM Queue 13, 8 (2015), 30:30--30:46.
[21]
Tony Mauro. 2015. Adopting Microservices at Netflix: Lessons for Architectural Design. https://tinyurl.com/htfezlj.
[22]
Daniel A. Menasce and Virgilio Almeida. 2001. Capacity Planning for Web Services: Metrics, Models, and Methods. Prentice Hall PTR.
[23]
Pieter J. Meulenhoff, Dennis R. Ostendorf, Miroslav Živković, Hendrik B. Meeuwissen, and Bart M. Gijsen. 2009. Intelligent Overload Control for Composite Web Services. In Proceedings of the International Joint Conference on Service-Oriented Computing (ICSOC-ServiceWave).
[24]
Jeffrey C. Mogul and K. K. Ramakrishnan. 1997. Eliminating Receive Livelock in an Interrupt-driven Kernel. ACM Transactions on Computer Systems (TOCS) 15, 3 (1997), 217--252.
[25]
Kathleen Nichols and Van Jacobson. 2012. Controlling Queue Delay. ACM Queue 10, 5 (2012), 20:20--20:34.
[26]
Lalith Suresh, Peter Bodik, Ishai Menache, Marco Canini, and Florin Ciucu. 2017. Distributed Resource Management Across Process Boundaries. In Proceedings of the ACM Symposium on Cloud Computing (SoCC).
[27]
Nesime Tatbul, Uğur Çetintemel, and Stan Zdonik. 2007. Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing. In Proceedings of the International Conference on Very Large Data Bases (VLDB).
[28]
N. Tatbul and S. Zdonik. 2006. Dealing with Overload in Distributed Stream Processing Systems. In 22nd International Conference on Data Engineering Workshops (ICDE Workshops).
[29]
Thiemo Voigt, Renu Tewari, Douglas Freimuth, and Ashish Mehra. 2001. Kernel Mechanisms for Service Differentiation in Overloaded Web Servers. In Proceedings of the USENIX Annual Technical Conference (ATC).
[30]
Chieh-Yih Wan, Shane B. Eisenman, Andrew T. Campbell, and Jon Crowcroft. 2007. Overload Traffic Management for Sensor Networks. IEEE/ACM Transactions on Networking (ToN) 3, 4 (2007).
[31]
Matt Welsh and David Culler. 2003. Adaptive Overload Control for Busy Internet Servers. In Proceedings of the USENIX Symposium on Internet Technologies and Systems (USITS).
[32]
Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: An Architecture for Well-conditioned, Scalable Internet Services. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP).
[33]
Ying Xing, Jeong-Hyon Hwang, Uğur Çetintemel, and Stan Zdonik. 2006. Providing Resiliency to Load Variations in Distributed Stream Processing. In Proceedings of the International Conference on Very Large Data Bases (VLDB).

Cited By

View all
  • (2024)TopFull: An Adaptive Top-Down Overload Control for SLO-Oriented MicroservicesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672253(876-890)Online publication date: 4-Aug-2024
  • (2024)Bouncer: Admission Control with Response Time Objectives for Low-latency Online Data SystemsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653384(400-413)Online publication date: 9-Jun-2024
  • (2024)Detecting Inconsistencies in Microservice-Based Systems: An Annotation-Assisted Scenario-Oriented ApproachIEEE Transactions on Services Computing10.1109/TSC.2024.339965217:5(2194-2209)Online publication date: Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
October 2018
546 pages
ISBN:9781450360111
DOI:10.1145/3267809
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. WeChat
  2. microservice architecture
  3. overload control
  4. service admission control

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SoCC '18
Sponsor:
SoCC '18: ACM Symposium on Cloud Computing
October 11 - 13, 2018
CA, Carlsbad, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)90
  • Downloads (Last 6 weeks)12
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)TopFull: An Adaptive Top-Down Overload Control for SLO-Oriented MicroservicesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672253(876-890)Online publication date: 4-Aug-2024
  • (2024)Bouncer: Admission Control with Response Time Objectives for Low-latency Online Data SystemsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653384(400-413)Online publication date: 9-Jun-2024
  • (2024)Detecting Inconsistencies in Microservice-Based Systems: An Annotation-Assisted Scenario-Oriented ApproachIEEE Transactions on Services Computing10.1109/TSC.2024.339965217:5(2194-2209)Online publication date: Sep-2024
  • (2024)Adaptive QoS-Aware Microservice Deployment With Excessive Loads via Intra- and Inter-Datacenter SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342593135:9(1565-1582)Online publication date: Sep-2024
  • (2024)Pyxis: Scheduling Mixed Tasks in Disaggregated DatacentersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.341862035:9(1536-1550)Online publication date: Sep-2024
  • (2024)Zero+: Monitoring Large-Scale Cloud-Native Infrastructure Using One-Sided RDMAIEEE/ACM Transactions on Networking10.1109/TNET.2024.339451432:4(3499-3514)Online publication date: Aug-2024
  • (2024)MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice ApplicationsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3363902(1-18)Online publication date: 2024
  • (2024)Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00054(499-510)Online publication date: 28-Oct-2024
  • (2024)Derm: SLA-aware Resource Management for Highly Dynamic Microservices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00039(424-436)Online publication date: 29-Jun-2024
  • (2024)LEAD: Latency-Efficient Application Deployment for Microservices Architecture2024 29th International Conference on Automation and Computing (ICAC)10.1109/ICAC61394.2024.10718812(1-6)Online publication date: 28-Aug-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media