Nothing Special   »   [go: up one dir, main page]

skip to main content
short-paper

Speculation-aware Cluster Scheduling

Published: 16 September 2015 Publication History

Abstract

Stragglers are a crucial roadblock to achieving predictable performance in today's clusters. Speculation has been widelyadopted in order to mitigate the impact of stragglers; however speculation mechanisms are designed and operated independently of job scheduling when, in fact, scheduling a speculative copy of a task has a direct impact on the resources available for other jobs. In this work, based on a simple model and its analysis, we design Hopper, a job scheduler that is speculation-aware, i.e., that integrates the tradeoffs associated with speculation into job scheduling decisions.

References

[1]
Hadoop. http://hadoop.apache.org.
[2]
G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective Straggler Mitigation: Attack of the Clones. In USENIX NSDI, 2013.
[3]
G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinated Memory Caching for Parallel Jobs. In USENIX NSDI, 2012.
[4]
G. Ananthanarayanan, M. Hung, X. Ren, I. Stoica, A. Wierman, and M. Yu. GRASS: Trimming Stragglers in Approximation Analytics. In USENIX NSDI, 2014.
[5]
G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, E. Harris, and B. Saha. Reining in the Outliers in Map-Reduce Clusters Using Mantri. In USENIX OSDI, 2010.
[6]
H. Chen, J. Marden, and A. Wierman. On the Impact of Heterogeneity and Back-end Scheduling in Load Balancing Designs. In INFOCOM. IEEE, 2009.
[7]
J. Dean and L. Barroso. The Tail at Scale. Communications of the ACM, (2), 2013.
[8]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 2008.
[9]
F. Dogar, T. Karagiannis, H. Ballani, and A. Rowstron. Decentralized Task-aware Scheduling for Data Center Networks. In Proceedings of the 2014 ACM conference on SIGCOMM, pages 431--442. ACM, 2014.
[10]
O. K, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: Distributed, Low Latency Scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 69--84. ACM, 2013.
[11]
M. Lin, L. Zhang, A. Wierman, and J. Tan. Joint Optimization of Overlapping Phases in MapReduce. Performance Evaluation, 2013.
[12]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In USENIX NSDI, 2012.
[13]
X. Ren, G. Ananthanarayanan, A. Wierman, and M. Yu. Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale. ACM SIGCOMM, 2015.
[14]
M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In USENIX OSDI, 2008.

Cited By

View all
  • (2017)Cutting Latency Tail: Analyzing and Validating Replication without CancelingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.270626828:11(3128-3141)Online publication date: 1-Nov-2017
  • (2017)Optimization for Speculative Execution in Big Data Processing ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.256496228:2(530-545)Online publication date: 1-Feb-2017

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGMETRICS Performance Evaluation Review
ACM SIGMETRICS Performance Evaluation Review  Volume 43, Issue 2
September 2015
79 pages
ISSN:0163-5999
DOI:10.1145/2825236
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2015
Published in SIGMETRICS Volume 43, Issue 2

Check for updates

Qualifiers

  • Short-paper

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Cutting Latency Tail: Analyzing and Validating Replication without CancelingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.270626828:11(3128-3141)Online publication date: 1-Nov-2017
  • (2017)Optimization for Speculative Execution in Big Data Processing ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.256496228:2(530-545)Online publication date: 1-Feb-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media