article

Free access

Multi-task learning for straggler avoiding predictive job scheduling

Editors: Kevin Murphy, Bernhard Schölkopf Authors:

Neeraja J. Yadwadkar,

Bharath Hariharan,

Joseph E. Gonzalez,

Randy KatzAuthors Info & Claims

The Journal of Machine Learning Research, Volume 17, Issue 1

Pages 3692 - 3728

Published: 01 January 2016 Publication History

PDF eReader Publisher Site

Abstract

Parallel processing frameworks (Dean and Ghemawat, 2004) accelerate jobs by breaking them into tasks that execute in parallel. However, slow running or straggler tasks can run up to 8 times slower than the median task on a production cluster (Ananthanarayanan et al., 2013), leading to delayed job completion and inefficient use of resources. Existing straggler mitigation techniques wait to detect stragglers and then relaunch them, delaying straggler detection and wasting resources. We built Wrangler (Yadwadkar et al., 2014), a system that predicts when stragglers are going to occur and makes scheduling decisions to avoid such situations. To capture node and workload variability, Wrangler built separate models for every node and workload, requiring the time-consuming collection of substantial training data. In this paper, we propose multi-task learning formulations that share information between the various models, allowing us to use less training data and bring training time down from 4 hours to 40 minutes. Unlike naive multi-task learning formulations, our formulations capture the shared structure in our data, improving generalization performance on limited data. Finally, we extend these formulations using group sparsity inducing norms to automatically discover the similarities between tasks and improve interpretability.

References

[1]

Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, and Sankaran Raman. Variable sparsity kernel learning. Journal of Machine Learning Research, 12, 2011.

[2]

Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the outliers in map-reduce clusters using mantri. In OSDI, 2010.

[3]

Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective straggler mitigation: Attack of the clones. In NSDI, 2013.

[4]

Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. Grass: Trimming stragglers in approximation analytics. In NSDI, 2014.

[5]

Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6, 2005.

[6]

Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al. Convex optimization with sparsity-inducing norms. Optimization for Machine Learning, pages 19-53, 2011.

[7]

Francis R Bach. Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research, 9:1179-1225, 2008.

[8]

Jonathan Baxter. A model of inductive bias learning. JAIR, 12, 2000.

[9]

Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related classification tasks to a new unlabeled sample. In NIPS, 2011.

[10]

Edward Bortnikov, Ari Frank, Eshcar Hillel, and Sriram Rao. Predicting execution bottlenecks in map-reduce clusters. In HotCloud, 2012.

[11]

Rich Caruana. Multitask learning: A knowledge-based source of inductive bias. In ICML, 1993.

[12]

Yanpei Chen, Archana Ganapathi, Rean Griffith, and Randy Katz. The case for evaluating mapreduce performance using workload suites. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, 2011.

[13]

Yanpei Chen, Sara Alspaugh, and Randy H. Katz. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. PVLDB, 5(12), 2012.

[14]

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3), 1995.

[15]

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004.

[16]

Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In ASPLOS, 2014.

[17]

David L Donoho. For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Communications on pure and applied mathematics, 59(6), 2006.

[18]

Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In KDD, 2004.

[19]

Theodoros Evgeniou, Charles A Micchelli, Massimiliano Pontil, and John Shawe-Taylor. Learning multiple tasks with kernel methods. JMLR, 6(4), 2005.

[20]

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 2008.

[21]

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In SOSP, 2003.

[22]

Shekhar Gupta, Christian Fritz, Bob Price, Roger Hoover, Johan Dekleer, and Cees Witteveen. Throughputscheduler: Learning to schedule on heterogeneous hadoop clusters. In ICAC, 2013.

[23]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, 2007.

[24]

Laurent Jacob, Jean philippe Vert, and Francis R. Bach. Clustered multi-task learning: A convex formulation. In NIPS. 2009.

[25]

Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. lp-norm multiple kernel learning. J. Mach. Learn. Res., 12:953-997, July 2011. ISSN 1532-4435.

[26]

YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. Skewtune: Mitigating skew in mapreduce applications. In SIGMOD, 2012.

[27]

Henry Li. Introducing Windows Azure. Apress, Berkely, CA, USA, 2009. ISBN 143022469X, 9781430224693.

[28]

Shibin Parameswaran and Kilian Q. Weinberger. Large margin multi-task metric learning. In NIPS. 2010.

[29]

John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3), 1999.

[30]

Ting Kei Pong, Paul Tseng, Shuiwang Ji, and Jieping Ye. Trace norm regularization: reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20 (6), 2010.

[31]

Ariadna Quattoni, Michael Collins, and Trevor Darrell. Transfer learning for image classification with sparse prototype representations. In CVPR, 2008.

[32]

Bernardino Romera-Paredes, Hane Aung, Nadia Bianchi-Berthouze, and Massimiliano Pontil. Multilinear multitask learning. In ICML, 2013.

[33]

M Signoretto, R Langone, M Pontil, and J Suykens. Graph based regularization for multilinear multitask learning. 2014.

[34]

Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last reducer. In Proceedings of the 20th International Conference on World Wide Web, WWW'11, pages 607-614, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0632-4.

[35]

Sebastian Thrun. Is learning the n-th thing any easier than learning the first? NIPS, 1996.

[36]

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267-288, 1996.

[37]

M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In ICCV, October 2007.

[38]

Tom White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 1st edition, 2009. ISBN 0596521979, 9780596521974.

[39]

Christian Widmer, Yasemin Altun, and Nora C. Toussaint. Multitask multiple kernel learning (mt-mkl), 2010.

[40]

Kishan Wimalawarne, Masashi Sugiyama, and Ryota Tomioka. Multitask learning meets tensor factorization: task imputation via convex optimization. In NIPS, 2014.

[41]

Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for classification with dirichlet process priors. JMLR, 8, 2007.

[42]

Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. Wrangler: Predictable and faster jobs using fewer resources. In Proceedings of the ACM Symposium on Cloud Computing, SOCC '14, pages 26:1-26:14, New York, NY, USA, 2014. ACM.

[43]

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49-67, 2006.

[44]

Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy H. Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI, 2008.

[45]

Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Eurosys, 2010.

Cited By

Liu SGupta NVaidya N(2023)Impact of Redundancy on Resilience in Distributed Optimization and LearningProceedings of the 24th International Conference on Distributed Computing and Networking10.1145/3571306.3571393(80-89)Online publication date: 4-Jan-2023
https://dl.acm.org/doi/10.1145/3571306.3571393
Dau SGabrys RHuang YFeng CLuu QAlzahrani ETari Z(2023)Transition Waste Optimization for Coded Elastic ComputingIEEE Transactions on Information Theory10.1109/TIT.2023.324786069:7(4442-4465)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TIT.2023.3247860
Ahmad MNawazish MTariq MAwan MRaza MQazi ZBianchi GMei A(2022)Enabling emerging edge applications through a 5G control plane interventionProceedings of the 18th International Conference on emerging Networking EXperiments and Technologies10.1145/3555050.3569130(386-400)Online publication date: 30-Nov-2022
https://dl.acm.org/doi/10.1145/3555050.3569130
Show More Cited By

Multi-task learning for straggler avoiding predictive job scheduling
1. Computing methodologies

Recommendations

Parallel job scheduling algorithms
Effective ensembles of heuristics for scheduling flexible job shop problem with new job insertion

Flexible job shop scheduling problem.Ensembles of heuristics.Re-scheduling for new job insertion.Multiple objectives scheduling problem. This study investigates the flexible job shop scheduling problem (FJSP) with new job insertion. FJSP with new job ...
Adaptive job scheduling via predictive job resource allocation
JSSPP'06: Proceedings of the 12th international conference on Job scheduling strategies for parallel processing

Standard job scheduling uses static job sizes which lacks flexibility regarding changing load in the system and fragmentation handling. Adaptive resource allocation is known to provide the flexibility needed to obtain better response times under such ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 17, Issue 1

January 2016

8391 pages

ISSN:1532-4435

EISSN:1533-7928

Editors:
Kevin Murphy
Google
,
Bernhard Schölkopf
MPI for Intelligent Systems

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 January 2016

Published in JMLR Volume 17, Issue 1

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
176
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)9

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu SGupta NVaidya N(2023)Impact of Redundancy on Resilience in Distributed Optimization and LearningProceedings of the 24th International Conference on Distributed Computing and Networking10.1145/3571306.3571393(80-89)Online publication date: 4-Jan-2023
https://dl.acm.org/doi/10.1145/3571306.3571393
Dau SGabrys RHuang YFeng CLuu QAlzahrani ETari Z(2023)Transition Waste Optimization for Coded Elastic ComputingIEEE Transactions on Information Theory10.1109/TIT.2023.324786069:7(4442-4465)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TIT.2023.3247860
Ahmad MNawazish MTariq MAwan MRaza MQazi ZBianchi GMei A(2022)Enabling emerging edge applications through a 5G control plane interventionProceedings of the 18th International Conference on emerging Networking EXperiments and Technologies10.1145/3555050.3569130(386-400)Online publication date: 30-Nov-2022
https://dl.acm.org/doi/10.1145/3555050.3569130
Zhu JLi S(2022)Generalized Lagrange Coded Computing: A Flexible Computation-Communication Tradeoff2022 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT50566.2022.9834535(832-837)Online publication date: 26-Jun-2022
https://dl.acm.org/doi/10.1109/ISIT50566.2022.9834535
Karakus CSun YDiggavi SYin W(2021)Redundancy techniques for straggler mitigation in distributed optimization and learningThe Journal of Machine Learning Research10.5555/3322706.336201320:1(2619-2665)Online publication date: 9-Mar-2021
https://dl.acm.org/doi/10.5555/3322706.3362013
Deshmukh SThirupathi Rao KShabaz M(2021)Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing FrameworkSecurity and Communication Networks10.1155/2021/83409252021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/8340925
Wang YArya KKogias MVanga MBhandari AYadwadkar NSen SElnikety SKozyrakis CBianchini RBarbalace ABhatotia PAlvisi LCadar C(2021)SmartHarvestProceedings of the Sixteenth European Conference on Computer Systems10.1145/3447786.3456225(1-16)Online publication date: 21-Apr-2021
https://dl.acm.org/doi/10.1145/3447786.3456225
Wang SLiu JShroff N(2020)Fundamental Limits of Approximate Gradient CodingACM SIGMETRICS Performance Evaluation Review10.1145/3410048.341006148:1(21-22)Online publication date: 9-Jul-2020
https://dl.acm.org/doi/10.1145/3410048.3410061
Qian BSu JWen ZJha DLi YGuan YPuthal DJames PYang RZomaya ARana OWang LKoutny MRanjan R(2020)Orchestrating the Development Lifecycle of Machine Learning-based IoT ApplicationsACM Computing Surveys10.1145/339802053:4(1-47)Online publication date: 3-Aug-2020
https://dl.acm.org/doi/10.1145/3398020
Wang SLiu JShroff NYeh EMarkopoulou ATay Y(2020)Fundamental Limits of Approximate Gradient CodingAbstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems10.1145/3393691.3394188(21-22)Online publication date: 8-Jun-2020
https://dl.acm.org/doi/10.1145/3393691.3394188
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents