research-article

Public Access

Good Things Come to Those Who Wait: Optimizing Job Waiting in the Cloud

Authors:

Pradeep Ambati,

Prashant ShenoyAuthors Info & Claims

SoCC '21: Proceedings of the ACM Symposium on Cloud Computing

Pages 229 - 242

https://doi.org/10.1145/3472883.3487007

Published: 01 November 2021 Publication History

Abstract

Cloud-enabled schedulers execute jobs on either fixed resources or those acquired on demand from cloud platforms. Thus, these schedulers must define not only a scheduling policy, which selects which jobs run when fixed resources become available, but also a waiting policy, which selects which jobs wait for fixed resources when they are not available, rather than run on on-demand resources. As with scheduling policies, optimizing waiting policies requires a priori knowledge of job runtime. Unfortunately, prior work has shown that accurately predicting job runtime is challenging. In this paper, we show that optimizing job waiting in the cloud is possible without accurate job runtime predictions. To do so, we i) speculatively execute jobs on on-demand resources for a small time and cost to learn more about job runtime, and ii) develop a ML model to predict wait time from cluster state, which is more accurate and has less overhead than prior approaches that use job runtime predictions. We evaluate our approach on a year-long batch workload consisting of 14 million jobs, and show that it yields a cost and average wait time within 4% and 13%, respectively, of the optimal.

Supplementary Material

MP4 File (Day2_5-2.mp4)

Presentation video

Download
328.08 MB

References

[1]

2019. Slurm Elastic Computing (Cloud Bursting). https://slurm.schedmd.com/elastic_computing.html.

[2]

2019. Slurm Workload Manager. https://slurm.schedmd.com/.

[3]

2020. Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/spot/.

[4]

2020. AWS ParallelCluster Auto Scaling. https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html.

[5]

2020. Azure Spot Virtual Machines. https://azure.microsoft.com/en-us/pricing/spot/.

[6]

2020. Google Preemptible Virtual Machines. https://cloud.google.com/preemptible-vms.

[7]

2020. UMass Trace Repository. http://traces.cs.umass.edu/.

[8]

2020. Waiting Game Job Trace. https://doi.org/10.5281/zenodo.3872168.

[9]

2021. AWS Batch - Fully managed batch processing at any scale. https://aws.amazon.com/batch/.

[10]

2021. Azure Batch - Cloud-scale job scheduling and compute management. https://azure.microsoft.com/en-us/services/batch/.

[11]

2021. Load Sharing Facility. https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=lsf-foundations.

[12]

O. Alipourfard, H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. 2017. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In NSDI.

[13]

P. Ambati, N. Bashir, D. Irwin, and P. Shenoy. 2020. Waiting Game: Optimally Provisioning Fixed Resources for Cloud-Enabled Schedulers. In SC.

[14]

G. Amvrosiadis, J.W. Park, G. Ganger, G. Gibson, E. Baseman, and N. DeBardeleben. 2017. Bigger, Longer, Fewer: What Do Cluster Jobs Look Like Outside Google? Technical Report CMU-PDL-17-104.

[15]

J. Brevik, D. Nurmi, and R. Wolski. 2006. Predicting Bounds on Queuing Delay for Batch-Scheduled Parallel Machines. In PPoPP.

[16]

Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API Design for Machine Learning Software: Experiences from the cikit-learn Project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108--122.

[17]

S. Di, D. Kondo, and C. Wang. 2013. Optimization and Stabilization of Composite Service Processing in a Cloud System. In 2013 IEEE/ACM 21st International Symposium on Quality of Service (IWQoS).

[18]

S. Di, C. Wang, and F. Cappello. 2014. Adaptive Algorithm for Minimizing Cloud Task Length with Prediction Errors. IEEE Transactions on Cloud Computing 2, 2 (2014), 194--207. https://doi.org/10.1109/TCC.2013.16

[19]

S. Di, C. Wang, D. Kondo, and G. Han. 2013. Towards Payment-Bound Analysis in Cloud Systems with Task-Prediction Errors. In 2013 IEEE Sixth International Conference on Cloud Computing.

[20]

Robert Grandl, Mosharaf Chowdhury, Aditya Akella, and Ganesh Ananthanarayanan. 2016. Altruistic Scheduling in Multi-Resource Clusters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) (OSDI'16).

Digital Library

[21]

T. Guo, U. Sharma, S. Sahu, T. Wood, and P. Shenoy. 2012. Seagull: Intelligent Cloud Bursting for Enterprise Applications. In USENIX ATC.

[22]

A. Harlap, A. Tumanov, A. Chung, G. Ganger, and P. Gibbons. 2017. Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. In European Conference on Computer Systems (EuroSys).

[23]

M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. 2009. Quincy: Fair Scheduling for Distributed Computing Clusters. In SOSP.

Digital Library

[24]

J. Kadupitige, V. Jadhao, and P. Sharma. 2020. Modeling the Temporally Constrained Preemptions of Transient Cloud VMs. In HPDC.

[25]

Michael Kuchnik, J. Park, C. Cranor, Elisabeth Moore, Nathan DeBardeleben, and George Amvrosiadis. 2019. This is Why ML-driven Cluster Scheduling Remains Widely Impractical. Technical Report CMU-PDL-19-103.

[26]

S. Niu, J. Zhai, X. Ma, X. Tang, and W. Chen. 2013. Cost-effective Cloud HPC Resource Provisioning by Building Semi-Elastic Virtual Clusters. In SC.

[27]

D. Nurmi, J. Brevik, and R. Wolski. 2007. QBETS: Queue Bounds Estimation from Time Series. In JSSPP.

Digital Library

[28]

S. Omer, N.Yigitbasi, A. Iosup, and D. Epema. 2009. Trace-based Evaluation of Job Runtime and Queue Wait Time Predictions in Grids. In HPDC.

[29]

Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A. Kozuch, and Gregory R. Ganger. 2018. 3Sigma: Distribution-Based Cluster Scheduling for Runtime Uncertainty. In Proceedings of the Thirteenth EuroSys Conference. https://doi.org/10.1145/3190508.3190515

Digital Library

[30]

Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2011. Google cluster-usage traces: format + schema. Technical Report. Google Inc., Mountain View, CA, USA. Revised 2014-11-17 for version 2.1. Posted at https://github.com/google/cluster-data.

[31]

P. Sharma, T. Guo, X. He, D. Irwin, and P. Shenoy. 2016. Flint: Batch-Interactive Data-Intensive Processing on Transient Servers. In European Conference on Computer Systems (EuroSys).

[32]

S. Shastri, A. Rizk, and D. Irwin. 2016. Transient Guarantees: Maximizing the Value of Idle Cloud Capacity. In SC.

Digital Library

[33]

W. Smith, V. Taylor, and I. Foster. 1999. Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance. In JSSPP.

[34]

Abel Souza, Kristiaan Pelckmans, Devarshi Ghoshal, Lavanya Ramakrishnan, and Johan Tordsson. 2020. ASA - The Adaptive Scheduling Architecture. In HPDC.

[35]

S. Subramanya, T. Guo, P. Sharma, D. Irwin, and P. Shenoy. 2015. SpotOn: A Batch Computing Service for the Spot Market. In Symposium on Cloud Computing (SoCC).

[36]

M. Tirmazi, A. Barker, N. Deng, M. Haque, Z. Qin, S. Hand, M. Harchol-Balter, and J. Wilkes. 2020. Borg: The Next Generation. In EuroSys.

Digital Library

[37]

A. Tumanov, A. Jiang, J. Park, M. Kozuch, and G. Ganger. 2016. JamaisVu: Robust Scheduling with Auto-Estimated Job Runtimes.

[38]

A. Tumanov, T. Zhu, J. Park, M. Kozuch, M. Harchol-Balter, and G. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-Ahead in Dynamic Heterogeneous Clusters. In EuroSys.

[39]

A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. 2015. Large-scale Cluster Management at Google with Borg. In European Conference on Computer Systems (EuroSys).

[40]

Y. Yan, Y. Gao, Z. Guo, B. Chen, and T. Moscibroda. 2016. TR-Spark: Transient Computing for Big Data Analytics. In Symposium on Cloud Computing (SoCC).

[41]

Y. Yang, G. Kim, W. Song, Y. Lee, A. Chung, Z. Qian, B. Cho, and B. Chun. 2017. Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters. In European Conference on Computer Systems (EuroSys).

Cited By

Bashir NGohil VSubramanya AShahrad MIrwin DOlivetti EDelimitrou C(2024)The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon-Aware SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698542(542-551)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698542
Bostandoost RLechowicz AHanafy WBashir NShenoy PHajiesmaili M(2024)LACS: Learning-Augmented Algorithms for Carbon-Aware Resource Scaling with Uncertain DemandProceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems10.1145/3632775.3661942(27-45)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3632775.3661942

Index Terms

Good Things Come to Those Who Wait: Optimizing Job Waiting in the Cloud
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing

Recommendations

Modified Rate-Monotonic Algorithm for Scheduling Periodic Jobs with Deferred Deadlines

The deadline of a request is the time instant at which its execution must complete. The deadline of the request in any period of a job with deferred deadline is some time instant after the end of the period. The authors describe a semi-static priority-...
Multi-resource Packing for Job Scheduling in Virtual Machine Based Cloud Environment
SOSE '15: Proceedings of the 2015 IEEE Symposium on Service-Oriented System Engineering

To efficiently schedule jobs with highly diverse resource requirements along CPU, memory and bandwidth for job performance and resource utilization in a virtual machine based cloud environment, the multi-resource job scheduler is proposed to pack tasks ...
SG-PBFS: Shortest Gap-Priority Based Fair Scheduling technique for job scheduling in cloud environment
Abstract
Job scheduling in cloud computing plays a crucial role in optimizing resource utilization and ensuring efficient job allocation. But cloud resources may be wasted, or service performance may suffer because of under-utilization or over-utilization ...
Highlights
- We propose a PR algorithm called PBFS to increase the efficiency of job scheduling.
- We proposed a backfilling technique named SG-PBFS.
- We evaluate the performance of the proposed algorithms using the CloudSim simulator.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '21: Proceedings of the ACM Symposium on Cloud Computing

November 2021

685 pages

ISBN:9781450386388

DOI:10.1145/3472883

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

SoCC '21

Sponsor:

SoCC '21: ACM Symposium on Cloud Computing

November 1 - 4, 2021

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
556
Total Downloads

Downloads (Last 12 months)131
Downloads (Last 6 weeks)15

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bashir NGohil VSubramanya AShahrad MIrwin DOlivetti EDelimitrou C(2024)The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon-Aware SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698542(542-551)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698542
Bostandoost RLechowicz AHanafy WBashir NShenoy PHajiesmaili M(2024)LACS: Learning-Augmented Algorithms for Carbon-Aware Resource Scaling with Uncertain DemandProceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems10.1145/3632775.3661942(27-45)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3632775.3661942

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents