Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3472883.3487007acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Good Things Come to Those Who Wait: Optimizing Job Waiting in the Cloud

Published: 01 November 2021 Publication History

Abstract

Cloud-enabled schedulers execute jobs on either fixed resources or those acquired on demand from cloud platforms. Thus, these schedulers must define not only a scheduling policy, which selects which jobs run when fixed resources become available, but also a waiting policy, which selects which jobs wait for fixed resources when they are not available, rather than run on on-demand resources. As with scheduling policies, optimizing waiting policies requires a priori knowledge of job runtime. Unfortunately, prior work has shown that accurately predicting job runtime is challenging. In this paper, we show that optimizing job waiting in the cloud is possible without accurate job runtime predictions. To do so, we i) speculatively execute jobs on on-demand resources for a small time and cost to learn more about job runtime, and ii) develop a ML model to predict wait time from cluster state, which is more accurate and has less overhead than prior approaches that use job runtime predictions. We evaluate our approach on a year-long batch workload consisting of 14 million jobs, and show that it yields a cost and average wait time within 4% and 13%, respectively, of the optimal.

Supplementary Material

MP4 File (Day2_5-2.mp4)
Presentation video

References

[1]
2019. Slurm Elastic Computing (Cloud Bursting). https://slurm.schedmd.com/elastic_computing.html.
[2]
2019. Slurm Workload Manager. https://slurm.schedmd.com/.
[3]
2020. Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/spot/.
[4]
2020. AWS ParallelCluster Auto Scaling. https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html.
[5]
2020. Azure Spot Virtual Machines. https://azure.microsoft.com/en-us/pricing/spot/.
[6]
2020. Google Preemptible Virtual Machines. https://cloud.google.com/preemptible-vms.
[7]
2020. UMass Trace Repository. http://traces.cs.umass.edu/.
[8]
2020. Waiting Game Job Trace. https://doi.org/10.5281/zenodo.3872168.
[9]
2021. AWS Batch - Fully managed batch processing at any scale. https://aws.amazon.com/batch/.
[10]
2021. Azure Batch - Cloud-scale job scheduling and compute management. https://azure.microsoft.com/en-us/services/batch/.
[11]
2021. Load Sharing Facility. https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=lsf-foundations.
[12]
O. Alipourfard, H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. 2017. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In NSDI.
[13]
P. Ambati, N. Bashir, D. Irwin, and P. Shenoy. 2020. Waiting Game: Optimally Provisioning Fixed Resources for Cloud-Enabled Schedulers. In SC.
[14]
G. Amvrosiadis, J.W. Park, G. Ganger, G. Gibson, E. Baseman, and N. DeBardeleben. 2017. Bigger, Longer, Fewer: What Do Cluster Jobs Look Like Outside Google? Technical Report CMU-PDL-17-104.
[15]
J. Brevik, D. Nurmi, and R. Wolski. 2006. Predicting Bounds on Queuing Delay for Batch-Scheduled Parallel Machines. In PPoPP.
[16]
Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API Design for Machine Learning Software: Experiences from the cikit-learn Project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108--122.
[17]
S. Di, D. Kondo, and C. Wang. 2013. Optimization and Stabilization of Composite Service Processing in a Cloud System. In 2013 IEEE/ACM 21st International Symposium on Quality of Service (IWQoS).
[18]
S. Di, C. Wang, and F. Cappello. 2014. Adaptive Algorithm for Minimizing Cloud Task Length with Prediction Errors. IEEE Transactions on Cloud Computing 2, 2 (2014), 194--207. https://doi.org/10.1109/TCC.2013.16
[19]
S. Di, C. Wang, D. Kondo, and G. Han. 2013. Towards Payment-Bound Analysis in Cloud Systems with Task-Prediction Errors. In 2013 IEEE Sixth International Conference on Cloud Computing.
[20]
Robert Grandl, Mosharaf Chowdhury, Aditya Akella, and Ganesh Ananthanarayanan. 2016. Altruistic Scheduling in Multi-Resource Clusters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) (OSDI'16).
[21]
T. Guo, U. Sharma, S. Sahu, T. Wood, and P. Shenoy. 2012. Seagull: Intelligent Cloud Bursting for Enterprise Applications. In USENIX ATC.
[22]
A. Harlap, A. Tumanov, A. Chung, G. Ganger, and P. Gibbons. 2017. Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. In European Conference on Computer Systems (EuroSys).
[23]
M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. 2009. Quincy: Fair Scheduling for Distributed Computing Clusters. In SOSP.
[24]
J. Kadupitige, V. Jadhao, and P. Sharma. 2020. Modeling the Temporally Constrained Preemptions of Transient Cloud VMs. In HPDC.
[25]
Michael Kuchnik, J. Park, C. Cranor, Elisabeth Moore, Nathan DeBardeleben, and George Amvrosiadis. 2019. This is Why ML-driven Cluster Scheduling Remains Widely Impractical. Technical Report CMU-PDL-19-103.
[26]
S. Niu, J. Zhai, X. Ma, X. Tang, and W. Chen. 2013. Cost-effective Cloud HPC Resource Provisioning by Building Semi-Elastic Virtual Clusters. In SC.
[27]
D. Nurmi, J. Brevik, and R. Wolski. 2007. QBETS: Queue Bounds Estimation from Time Series. In JSSPP.
[28]
S. Omer, N.Yigitbasi, A. Iosup, and D. Epema. 2009. Trace-based Evaluation of Job Runtime and Queue Wait Time Predictions in Grids. In HPDC.
[29]
Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A. Kozuch, and Gregory R. Ganger. 2018. 3Sigma: Distribution-Based Cluster Scheduling for Runtime Uncertainty. In Proceedings of the Thirteenth EuroSys Conference. https://doi.org/10.1145/3190508.3190515
[30]
Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2011. Google cluster-usage traces: format + schema. Technical Report. Google Inc., Mountain View, CA, USA. Revised 2014-11-17 for version 2.1. Posted at https://github.com/google/cluster-data.
[31]
P. Sharma, T. Guo, X. He, D. Irwin, and P. Shenoy. 2016. Flint: Batch-Interactive Data-Intensive Processing on Transient Servers. In European Conference on Computer Systems (EuroSys).
[32]
S. Shastri, A. Rizk, and D. Irwin. 2016. Transient Guarantees: Maximizing the Value of Idle Cloud Capacity. In SC.
[33]
W. Smith, V. Taylor, and I. Foster. 1999. Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance. In JSSPP.
[34]
Abel Souza, Kristiaan Pelckmans, Devarshi Ghoshal, Lavanya Ramakrishnan, and Johan Tordsson. 2020. ASA - The Adaptive Scheduling Architecture. In HPDC.
[35]
S. Subramanya, T. Guo, P. Sharma, D. Irwin, and P. Shenoy. 2015. SpotOn: A Batch Computing Service for the Spot Market. In Symposium on Cloud Computing (SoCC).
[36]
M. Tirmazi, A. Barker, N. Deng, M. Haque, Z. Qin, S. Hand, M. Harchol-Balter, and J. Wilkes. 2020. Borg: The Next Generation. In EuroSys.
[37]
A. Tumanov, A. Jiang, J. Park, M. Kozuch, and G. Ganger. 2016. JamaisVu: Robust Scheduling with Auto-Estimated Job Runtimes.
[38]
A. Tumanov, T. Zhu, J. Park, M. Kozuch, M. Harchol-Balter, and G. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-Ahead in Dynamic Heterogeneous Clusters. In EuroSys.
[39]
A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. 2015. Large-scale Cluster Management at Google with Borg. In European Conference on Computer Systems (EuroSys).
[40]
Y. Yan, Y. Gao, Z. Guo, B. Chen, and T. Moscibroda. 2016. TR-Spark: Transient Computing for Big Data Analytics. In Symposium on Cloud Computing (SoCC).
[41]
Y. Yang, G. Kim, W. Song, Y. Lee, A. Chung, Z. Qian, B. Cho, and B. Chun. 2017. Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters. In European Conference on Computer Systems (EuroSys).

Cited By

View all
  • (2024)The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon-Aware SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698542(542-551)Online publication date: 20-Nov-2024
  • (2024)LACS: Learning-Augmented Algorithms for Carbon-Aware Resource Scaling with Uncertain DemandProceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems10.1145/3632775.3661942(27-45)Online publication date: 4-Jun-2024

Index Terms

  1. Good Things Come to Those Who Wait: Optimizing Job Waiting in the Cloud

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '21: Proceedings of the ACM Symposium on Cloud Computing
    November 2021
    685 pages
    ISBN:9781450386388
    DOI:10.1145/3472883
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cloud computing
    2. cost-efficiency
    3. job scheduling

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    SoCC '21
    Sponsor:
    SoCC '21: ACM Symposium on Cloud Computing
    November 1 - 4, 2021
    WA, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)131
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon-Aware SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698542(542-551)Online publication date: 20-Nov-2024
    • (2024)LACS: Learning-Augmented Algorithms for Carbon-Aware Resource Scaling with Uncertain DemandProceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems10.1145/3632775.3661942(27-45)Online publication date: 4-Jun-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media