Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3357223.3362734acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Scavenger: A Black-Box Batch Workload Resource Manager for Improving Utilization in Cloud Environments

Published: 20 November 2019 Publication History

Abstract

Resource under-utilization is common in cloud data centers. Prior works have proposed improving utilization by running provider workloads in the background, colocated with tenant workloads. However, an important challenge that has still not been addressed is considering the tenant workloads as a black-box. We present Scavenger, a batch workload manager that opportunistically runs containerized batch jobs next to black-box tenant VMs to improve utilization. Scavenger is designed to work without requiring any offline profiling or prior information about the tenant workload. To meet the tenant VMs' resource demand at all times, Scavenger dynamically regulates the resource usage of batch jobs, including processor usage, memory capacity, and network bandwidth. We experimentally evaluate Scavenger on two different testbeds using latency-sensitive tenant workloads colocated with Spark jobs in the background and show that Scavenger significantly increases resource usage without compromising the resource demands of tenant VMs.

References

[1]
Cluster data collected from production clusters in Alibaba for cluster management research, 2018. https://github.com/alibaba/clusterdata.
[2]
Amazon Elastic Compute Cloud (Amazon EC2), 2018. http://aws.amazon.com/ec2.
[3]
Chen, S., Delimitrou, C., and Martínez, J. F. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA, 2019), ASPLOS '19, pp. 107--120.
[4]
CloudLab. https://www.cloudlab.us. The University of Utah.
[5]
Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, R. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (Indianapolis, IN, USA, 2010), SoCC '10, pp. 143--154.
[6]
Cortez, E., Bonde, A., Muzio, A., Russinovich, M., Fontoura, M., and Bianchini, R. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. In Proceedings of the 26th Symposium on Operating Systems Principles (Shanghai, China, 2017), SOSP '17, pp. 153--167.
[7]
DCOPY (part of BLAS). http://www.netlib.org/blas.
[8]
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. Dynamo: Amazon's Highly Available Key-value Store. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles (Stevenson, WA, USA, 2007), pp. 205--220.
[9]
Delimitrou, C., and Kozyrakis, C. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Houston, TX, USA, 2013), pp. 77--88.
[10]
Delimitrou, C., and Kozyrakis, C. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (Salt Lake City, UT, USA, 2014), ASPLOS '14, pp. 127--144.
[11]
Delimitrou, C., Sanchez, D., and Kozyrakis, C. Tarcil: High Quality and Low Latency Scheduling in Large, Shared Clusters. In Proceedings of the 6th ACM Symposium on Cloud Computing (Kohala Coast, HI, USA, 2015), SOCC '15, pp. 97--110.
[12]
Deng, L. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Processing Magazine 29, 6 (2012), 141--142.
[13]
Ebrahimi, E., Lee, C. J., Mutlu, O., and Patt, Y. N. Fairness via Source Throttling: A Configurable and High-performance Fairness Substrate for Multi-core Memory Systems. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (Pittsburgh, PA, USA, 2010), pp. 335--346.
[14]
Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A. D., Ailamaki, A., and Falsafi, B. Clearing the Clouds: a Study of Emerging Scale-out Workloads on Modern Hardware. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (London, UK, 2012), ASPLOS '12, pp. 37--48.
[15]
Gandhi, A., Dube, P., Karve, A., Kochut, A., and Ellanti, H. The Unobservability Problem in Clouds. In Proceedings of the 2015 IEEE International Conference on Cloud and Autonomic Computing (Cambridge, MA, USA, 2015).
[16]
Gandhi, A., Harchol-Balter, M., Raghunathan, R., and Kozuch, M. AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers. Transactions on Computer Systems 30 (2012).
[17]
Gao, W., Zhan, J., Wang, L., Luo, C., Zheng, D., Ren, R., Zheng, C., Lu, G., Li, J., Cao, Z., Zhang, S., and Tang, H. BigDataBench: A Dwarf-based Big Data and AI Benchmark Suite. CoRR abs/1802.08254 (2018).
[18]
Google Cloud. https://cloud.google.com.
[19]
Goder, A., Spiridonov, A., and Wang, Y. Bistro: Scheduling Data-parallel Jobs Against Live Production Systems. In Proceedings of the 2015 Usenix Annual Technical Conference (Santa Clara, CA, USA, 2015), USENIX ATC '15, pp. 459--471.
[20]
Govindan, S., Liu, J., Kansal, A., and Sivasubramaniam, A. Cuanta: Quantifying Effects of Shared On-chip Resource Interference for Consolidated Virtual Machines. In Proceedings of the 2nd ACM Symposium on Cloud Computing (Cascais, Portugal, 2011), pp. 1--14.
[21]
Grosvenor, M. P., Schwarzkopf, M., Gog, I., Watson, R. N. M., Moore, A. W., Hand, S., and Crowcroft, J. Queues Don't Matter When You Can JUMP Them! In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation (Oakland, CA, USA, 2015), NSDI'15, pp. 1--14.
[22]
Iorgulescu, C., Azimi, R., Kwon, Y., Elnikety, S., Syamala, M., Narasayya, V., Herodotou, H., Tomita, P., Chen, A., Zhang, J., and Wang, J. PerfIso: Performance Isolation for Commercial Latency-Sensitive Services. In Proceedings of the 2018 USENIX Annual Technical Conference (Boston, MA, USA, 2018), pp. 519--532.
[23]
Janus, P., and Rzadca, K. SLO-aware colocation of data center tasks based on instantaneous processor requirements. In Proceedings of the 8th ACM Symposium on Cloud Computing (Santa Clara, CA, USA, 2017), pp. 256--268.
[24]
Javadi, S. A., and Gandhi, A. DIAL: Reducing Tail Latencies for Cloud Applications via Dynamic Interference-aware Load Balancing. In Proceedings of the 2017 IEEE International Conference on Autonomic Computing (Columbus, OH, USA, 2017), pp. 135--144.
[25]
Kasture, H., and Sanchez, D. TailBench: A Benchmark Suite and Evaluation Methodology for Latency-critical Applications. In Workload Characterization (IISWC), 2016 IEEE International Symposium on (2016), IEEE, pp. 1--10.
[26]
Leverich, J., and Kozyrakis, C. Reconciling High Server Utilization and Sub-millisecond Quality-of-service. In Proceedings of the Ninth European Conference on Computer Systems (Amsterdam, The Netherlands, 2014).
[27]
Li, J., Sharma, N. K., Ports, D. R. K., and Gribble, S. D. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. In Proceedings of the ACM Symposium on Cloud Computing (Seattle, WA, USA, 2014).
[28]
Lin, H., Ma, X., Archuleta, J., Feng, W.-c., Gardner, M., and Zhang, Z. MOON: MapReduce On Opportunistic eNvironments. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (Chicago, IL, USA, 2010), pp. 95--106.
[29]
Linux man page. virsh(1). https://linux.die.net/man/1/virsh.
[30]
Liu, H. A Measurement Study of Server Utilization in Public Clouds. In Proceedings of the 9th IEEE International Conference on Dependable, Autonomic and Secure Computing (Sydney, Australia, 2011), pp. 435--442.
[31]
Liu, Q., and Yu, Z. The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: A View from Alibaba Trace. In Proceedings of the ACM Symposium on Cloud Computing (Carlsbad, CA, USA, 2018), pp. 347--360.
[32]
Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., and Kozyrakis, C. Heracles: Improving Resource Efficiency at Scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, OR, USA, 2015), ISCA'15, pp. 450--462.
[33]
Lu, C., Ye, K., Xu, G., Xu, C. Z., and Bai, T. Imbalance in the Cloud: An Analysis on Alibaba Cluster Trace. In Proceedings of the 2017 IEEE International Big Data (Boston, MA, USA, 2017), pp. 2884--2892.
[34]
Maji, A., Mitra, S., and Bagchi, S. ICE: An Integrated Configuration Engine for Interference Mitigation in Cloud Services. In Proceedings of the 2015 IEEE International Conference on Autonomic Computing(Grenoble, France, 2015), ICAC '15, pp. 91--100.
[35]
Maji, A. K., Mitra, S., Zhou, B., Bagchi, S., and Verma, A. Mitigating interference in cloud services by middleware reconfiguration. In Proceedings of the 15th International Middleware Conference (Bordeaux, France, 2014), Middleware '14, pp. 277--288.
[36]
Marinos, A., and Briscoe, G. Community cloud computing. In Proceedings of the 1st International Conference on Cloud Computing (Beijing, China, 2009), CloudCom '09, pp. 472--484.
[37]
Mars, J., Tang, L., Hundt, R., Skadron, K., and Soffa, M. L. Bubble-up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture (2011), ACM, pp. 248--259.
[38]
Moreto, M., Cazorla, F. J., Ramirez, A., Sakellariou, R., and Valero, M. FlexDCP: A QoS Framework for CMP Architectures. SIGOPS Oper. Syst. Rev. 43, 2 (2009), 86--96.
[39]
Nathan, S., Bellur, U., and Kulkarni, P. Towards a Comprehensive Performance Model of Virtual Machine Live Migration. In Proceedings of the 6th ACM Symposium on Cloud Computing (Kohala Coast, HI, USA, 2015), SOCC '15, pp. 288--301.
[40]
Nguyen, K. T. Introduction to Cache Allocation Technology in the Intel Xeon Processor E5 v4 Family. https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology, 2016.
[41]
Novaković, D., Vasić, N., Novaković, S., Kostić, D., and Bianchini, R. Deep-Dive: Transparently Identifying and Managing Performance Interference in Virtualized Environments. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (2013), pp. 219--230.
[42]
Padhye, J., Firoiu, V., Towsley, D., and Kurose, J. Modeling TCP throughput: A simple model and its empirical validation. ACM SIGCOMM Computer Communication Review 28, 4 (1998), 303--314.
[43]
Palit, T., Shen, Y., and Ferdman, M. Demystifying Cloud Benchmarking. In Performance Analysis of Systems and Software (ISPASS), 2016 IEEE International Symposium on (2016), IEEE, pp. 122--132.
[44]
Qureshi, M. K., and Patt, Y. N. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (Orlando, FL, USA, 2006), pp. 423--432.
[45]
Schroeder, B., Wierman, A., and Harchol-Balter, M. Open Versus Closed: A Cautionary Tale. In Proceedings of the 3rd Conference on Networked Systems Design & Implementation (San Jose, CA, USA, 2006), NSDI'06.
[46]
Sharifi, A., Srikantaiah, S., Mishra, A. K., Kandemir, M., and Das, C. R. METE: Meeting End-to-end QoS in Multicores Through System-wide Resource Management. SIGMETRICS Perform. Eval. Rev. 39, 1 (2011), 13--24.
[47]
Shvachko, K., Kuang, H., Radia, S., and Chansler, R. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (Incline Village, NV, USA, 2010), MSST '10, pp. 1--10.
[48]
Benchmark Suite for Apache Spark. https://github.com/CODAIT/spark-bench.
[49]
Sphinx Speech Group, Carnegie Mellon University. CMU Robust Speech Recognition Group: Census Database. http://www.speech.cs.cmu.edu/databases/an4.
[50]
Standard Performance Evaluation Corporation. SPECjbb2015. https://www.spec.org/jbb2015.
[51]
Stress-ng. http://kernel.ubuntu.com/~cking/stress-ng.
[52]
Traffic control howto. http://tldp.org/HOWTO/Traffic-Control-HOWTO/intro.html, 2006.
[53]
Tesfatsion, S. K., Wadbro, E., and Tordsson, J. PerfGreen: Performance and Energy Aware Resource Provisioning for Heterogeneous Clouds. In Proceedings of the 2018 IEEE International Conference on Autonomic Computing (Trento, Italy, 2018), pp. 81--90.
[54]
The Apache Software Foundation. Apache Hadoop. http://hadoop.apache.org.
[55]
Tirumala, A., Qin, F., Dugan, J., Ferguson, J., and Gibbs, K. Iperf, 2006.
[56]
TPC Council. TPC-C benchmark, revision 5.11. www.tpc.org/tpcc, 2010.
[57]
Vanga, M., Gujarati, A., and Brandenburg, B. B. Tableau: A High-throughput and Predictable VM Scheduler for High-density Workloads. In Proceedings of the Thirteenth EuroSys Conference (Porto, Portugal, 2018).
[58]
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O'Malley, O., Radia, S., Reed, B., and Baldeschwieler, E. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (Santa Clara, CA, USA, 2013).
[59]
Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., and Wilkes, J. Large-scale Cluster Management at Google with Borg. In Proceedings of the 10th European Conference on Computer Systems (Bordeaux, France, 2015).
[60]
WAND Network Research Group. WITS: Waikato Internet Traffic Storage. http://www.wand.net.nz/wits/index.php.
[61]
Wasserman, L. Models, statistical inference and learning. In All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, 2010, ch. 6.
[62]
Xu, C., Rajamani, K., Ferreira, A., Felter, W., Rubio, J., and Li, Y. dCat: Dynamic Cache Management for Efficient, Performance-sensitive Infrastructure-as-a-service. In Proceedings of the 13th EuroSys Conference (Porto, Portugal, 2018), EuroSys '18, pp. 14:1--14:13.
[63]
Yan, Y., Gao, Y., Chen, Y., Guo, Z., Chen, B., and Moscibroda, T. TR-Spark: Transient Computing for Big Data Analytics. In Proceedings of the Seventh ACM Symposium on Cloud Computing (Santa Clara, CA, USA, 2016), pp. 484--496.
[64]
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (Boston, MA, USA, 2010).
[65]
Zhang, X., Dwarkadas, S., and Shen, K. Hardware Execution Throttling for Multi-core Resource Management. In Proceedings of the 2009 Conference on USENIX Annual Technical Conference (San Diego, CA, USA, 2009).
[66]
Zhang, X., Tune, E., Hagmann, R., Jnagal, R., Gokhale, V., and Wilkes, J. CPI2: CPU Performance Isolation for Shared Compute Clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (Prague, Czech Republic, 2013), pp. 379--391.
[67]
Zhang, Y., Prekas, G., Fumarola, G. M., Fontoura, M., Goiri, Í., and Bianchini, R. History-based Harvesting of Spare Cycles and Storage in Large-scale Datacenters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (2016), pp. 755--770.
[68]
Zhang, Wei and Rajasekaran, Sundaresan and Duan, Shaohua and Wood, Timothy and Zhuy, Mingfa. Minimizing Interference and Maximizing Progress for Hadoop Virtual Machines. SIGMETRICS Performance Evaluation Review 42, 4 (2015), 62--71.
[69]
Zhu, H., and Erez, M. Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (Atlanta, GA, USA, 2016), pp. 33--47.

Cited By

View all
  • (2024)Efficient Resource Management in Cloud Environments: A Modified Feeding Birds Algorithm for VM ConsolidationMathematics10.3390/math1212184512:12(1845)Online publication date: 13-Jun-2024
  • (2024)KACE: Kernel-Aware Colocation for Efficient GPU Spatial SharingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698555(460-469)Online publication date: 20-Nov-2024
  • (2024)TrEnv: Transparently Share Serverless Execution Environments Across Different Functions and NodesProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695967(421-437)Online publication date: 4-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '19: Proceedings of the ACM Symposium on Cloud Computing
November 2019
503 pages
ISBN:9781450369732
DOI:10.1145/3357223
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 November 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cloud computing
  2. background workload
  3. resource utiliization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SoCC '19
Sponsor:
SoCC '19: ACM Symposium on Cloud Computing
November 20 - 23, 2019
CA, Santa Cruz, USA

Acceptance Rates

SoCC '19 Paper Acceptance Rate 39 of 157 submissions, 25%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)335
  • Downloads (Last 6 weeks)31
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Resource Management in Cloud Environments: A Modified Feeding Birds Algorithm for VM ConsolidationMathematics10.3390/math1212184512:12(1845)Online publication date: 13-Jun-2024
  • (2024)KACE: Kernel-Aware Colocation for Efficient GPU Spatial SharingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698555(460-469)Online publication date: 20-Nov-2024
  • (2024)TrEnv: Transparently Share Serverless Execution Environments Across Different Functions and NodesProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695967(421-437)Online publication date: 4-Nov-2024
  • (2024)PREACT: Predictive Resource Allocation for Bursty Workloads in a Co-located Data CenterProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673135(722-731)Online publication date: 12-Aug-2024
  • (2024)A Survey on Spatio-Temporal Big Data Analytics Ecosystem: Resource Management, Processing Platform, and ApplicationsIEEE Transactions on Big Data10.1109/TBDATA.2023.334261910:2(174-193)Online publication date: Apr-2024
  • (2024)Running Serverless Function on Resource Fragments in Data CenterAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0798-0_26(443-462)Online publication date: 1-Mar-2024
  • (2023)Bridging Resource Prediction and System Management: A Case Study in Cloud Systems2023 19th International Conference on Network and Service Management (CNSM)10.23919/CNSM59352.2023.10327893(1-5)Online publication date: 30-Oct-2023
  • (2023)Beacons: An End-to-End Compiler Framework for Predicting and Utilizing Dynamic Loop CharacteristicsProceedings of the ACM on Programming Languages10.1145/36228037:OOPSLA2(173-203)Online publication date: 16-Oct-2023
  • (2023)On Optimizing Traffic Scheduling for Multi-replica Containerized MicroservicesProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605646(358-368)Online publication date: 7-Aug-2023
  • (2023)Understanding and Optimizing Workloads for Unified Resource Management in Large Cloud PlatformsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587437(416-432)Online publication date: 8-May-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media