Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Multi-task learning for straggler avoiding predictive job scheduling

Published: 01 January 2016 Publication History

Abstract

Parallel processing frameworks (Dean and Ghemawat, 2004) accelerate jobs by breaking them into tasks that execute in parallel. However, slow running or straggler tasks can run up to 8 times slower than the median task on a production cluster (Ananthanarayanan et al., 2013), leading to delayed job completion and inefficient use of resources. Existing straggler mitigation techniques wait to detect stragglers and then relaunch them, delaying straggler detection and wasting resources. We built Wrangler (Yadwadkar et al., 2014), a system that predicts when stragglers are going to occur and makes scheduling decisions to avoid such situations. To capture node and workload variability, Wrangler built separate models for every node and workload, requiring the time-consuming collection of substantial training data. In this paper, we propose multi-task learning formulations that share information between the various models, allowing us to use less training data and bring training time down from 4 hours to 40 minutes. Unlike naive multi-task learning formulations, our formulations capture the shared structure in our data, improving generalization performance on limited data. Finally, we extend these formulations using group sparsity inducing norms to automatically discover the similarities between tasks and improve interpretability.

References

[1]
Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, and Sankaran Raman. Variable sparsity kernel learning. Journal of Machine Learning Research, 12, 2011.
[2]
Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the outliers in map-reduce clusters using mantri. In OSDI, 2010.
[3]
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective straggler mitigation: Attack of the clones. In NSDI, 2013.
[4]
Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. Grass: Trimming stragglers in approximation analytics. In NSDI, 2014.
[5]
Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6, 2005.
[6]
Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al. Convex optimization with sparsity-inducing norms. Optimization for Machine Learning, pages 19-53, 2011.
[7]
Francis R Bach. Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research, 9:1179-1225, 2008.
[8]
Jonathan Baxter. A model of inductive bias learning. JAIR, 12, 2000.
[9]
Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related classification tasks to a new unlabeled sample. In NIPS, 2011.
[10]
Edward Bortnikov, Ari Frank, Eshcar Hillel, and Sriram Rao. Predicting execution bottlenecks in map-reduce clusters. In HotCloud, 2012.
[11]
Rich Caruana. Multitask learning: A knowledge-based source of inductive bias. In ICML, 1993.
[12]
Yanpei Chen, Archana Ganapathi, Rean Griffith, and Randy Katz. The case for evaluating mapreduce performance using workload suites. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, 2011.
[13]
Yanpei Chen, Sara Alspaugh, and Randy H. Katz. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. PVLDB, 5(12), 2012.
[14]
Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3), 1995.
[15]
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004.
[16]
Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In ASPLOS, 2014.
[17]
David L Donoho. For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Communications on pure and applied mathematics, 59(6), 2006.
[18]
Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In KDD, 2004.
[19]
Theodoros Evgeniou, Charles A Micchelli, Massimiliano Pontil, and John Shawe-Taylor. Learning multiple tasks with kernel methods. JMLR, 6(4), 2005.
[20]
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 2008.
[21]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In SOSP, 2003.
[22]
Shekhar Gupta, Christian Fritz, Bob Price, Roger Hoover, Johan Dekleer, and Cees Witteveen. Throughputscheduler: Learning to schedule on heterogeneous hadoop clusters. In ICAC, 2013.
[23]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, 2007.
[24]
Laurent Jacob, Jean philippe Vert, and Francis R. Bach. Clustered multi-task learning: A convex formulation. In NIPS. 2009.
[25]
Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. lp-norm multiple kernel learning. J. Mach. Learn. Res., 12:953-997, July 2011. ISSN 1532-4435.
[26]
YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. Skewtune: Mitigating skew in mapreduce applications. In SIGMOD, 2012.
[27]
Henry Li. Introducing Windows Azure. Apress, Berkely, CA, USA, 2009. ISBN 143022469X, 9781430224693.
[28]
Shibin Parameswaran and Kilian Q. Weinberger. Large margin multi-task metric learning. In NIPS. 2010.
[29]
John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3), 1999.
[30]
Ting Kei Pong, Paul Tseng, Shuiwang Ji, and Jieping Ye. Trace norm regularization: reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20 (6), 2010.
[31]
Ariadna Quattoni, Michael Collins, and Trevor Darrell. Transfer learning for image classification with sparse prototype representations. In CVPR, 2008.
[32]
Bernardino Romera-Paredes, Hane Aung, Nadia Bianchi-Berthouze, and Massimiliano Pontil. Multilinear multitask learning. In ICML, 2013.
[33]
M Signoretto, R Langone, M Pontil, and J Suykens. Graph based regularization for multilinear multitask learning. 2014.
[34]
Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last reducer. In Proceedings of the 20th International Conference on World Wide Web, WWW'11, pages 607-614, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0632-4.
[35]
Sebastian Thrun. Is learning the n-th thing any easier than learning the first? NIPS, 1996.
[36]
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267-288, 1996.
[37]
M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In ICCV, October 2007.
[38]
Tom White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 1st edition, 2009. ISBN 0596521979, 9780596521974.
[39]
Christian Widmer, Yasemin Altun, and Nora C. Toussaint. Multitask multiple kernel learning (mt-mkl), 2010.
[40]
Kishan Wimalawarne, Masashi Sugiyama, and Ryota Tomioka. Multitask learning meets tensor factorization: task imputation via convex optimization. In NIPS, 2014.
[41]
Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for classification with dirichlet process priors. JMLR, 8, 2007.
[42]
Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. Wrangler: Predictable and faster jobs using fewer resources. In Proceedings of the ACM Symposium on Cloud Computing, SOCC '14, pages 26:1-26:14, New York, NY, USA, 2014. ACM.
[43]
Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49-67, 2006.
[44]
Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy H. Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI, 2008.
[45]
Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Eurosys, 2010.

Cited By

View all
  • (2023)Impact of Redundancy on Resilience in Distributed Optimization and LearningProceedings of the 24th International Conference on Distributed Computing and Networking10.1145/3571306.3571393(80-89)Online publication date: 4-Jan-2023
  • (2023)Transition Waste Optimization for Coded Elastic ComputingIEEE Transactions on Information Theory10.1109/TIT.2023.324786069:7(4442-4465)Online publication date: 1-Jul-2023
  • (2022)Enabling emerging edge applications through a 5G control plane interventionProceedings of the 18th International Conference on emerging Networking EXperiments and Technologies10.1145/3555050.3569130(386-400)Online publication date: 30-Nov-2022
  • Show More Cited By
  1. Multi-task learning for straggler avoiding predictive job scheduling

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image The Journal of Machine Learning Research
    The Journal of Machine Learning Research  Volume 17, Issue 1
    January 2016
    8391 pages
    ISSN:1532-4435
    EISSN:1533-7928
    Issue’s Table of Contents

    Publisher

    JMLR.org

    Publication History

    Published: 01 January 2016
    Published in JMLR Volume 17, Issue 1

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Impact of Redundancy on Resilience in Distributed Optimization and LearningProceedings of the 24th International Conference on Distributed Computing and Networking10.1145/3571306.3571393(80-89)Online publication date: 4-Jan-2023
    • (2023)Transition Waste Optimization for Coded Elastic ComputingIEEE Transactions on Information Theory10.1109/TIT.2023.324786069:7(4442-4465)Online publication date: 1-Jul-2023
    • (2022)Enabling emerging edge applications through a 5G control plane interventionProceedings of the 18th International Conference on emerging Networking EXperiments and Technologies10.1145/3555050.3569130(386-400)Online publication date: 30-Nov-2022
    • (2022)Generalized Lagrange Coded Computing: A Flexible Computation-Communication Tradeoff2022 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT50566.2022.9834535(832-837)Online publication date: 26-Jun-2022
    • (2021)Redundancy techniques for straggler mitigation in distributed optimization and learningThe Journal of Machine Learning Research10.5555/3322706.336201320:1(2619-2665)Online publication date: 9-Mar-2021
    • (2021)Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing FrameworkSecurity and Communication Networks10.1155/2021/83409252021Online publication date: 1-Jan-2021
    • (2021)SmartHarvestProceedings of the Sixteenth European Conference on Computer Systems10.1145/3447786.3456225(1-16)Online publication date: 21-Apr-2021
    • (2020)Fundamental Limits of Approximate Gradient CodingACM SIGMETRICS Performance Evaluation Review10.1145/3410048.341006148:1(21-22)Online publication date: 9-Jul-2020
    • (2020)Orchestrating the Development Lifecycle of Machine Learning-based IoT ApplicationsACM Computing Surveys10.1145/339802053:4(1-47)Online publication date: 3-Aug-2020
    • (2020)Fundamental Limits of Approximate Gradient CodingAbstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems10.1145/3393691.3394188(21-22)Online publication date: 8-Jun-2020
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media