research-article

Failure prediction and localization in large scientific workflows

Authors:

Karan VahiAuthors Info & Claims

WORKS '11: Proceedings of the 6th workshop on Workflows in support of large-scale science

Pages 107 - 116

https://doi.org/10.1145/2110497.2110510

Published: 14 November 2011 Publication History

Abstract

Scientific workflows provide a portable representation for scientific applications' coordinated input, output, and execution management for highly parallel executions of interdependent computations, as well as support for sharing and validating the results. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. Real-time execution monitoring provides a foundation for improving the transparency and resilience of the workflows in the face of stochastic and systematic faults. Building on previous work on early detection of these failure scenarios, we describe methods for guiding remediation to stochastic errors through predictions of the impact on application performance. To complement this analysis, we also describe techniques for isolating systematic sources of failures. We evaluate our methods on a representative sample of large real-world workflows.

References

[1]

D. Gunter and B. Tierney, "Netlogger: A toolkit for distributed system performance tuning and debugging," in Integrated Network Management, IFIP/IEEE Eighth International Symposium on Integrated Network Management (IM 2003), ser. IFIP Conference Proceedings, vol. 246. Kluwer, 2003, pp. 97--100.

Digital Library

[2]

E. Deelman, G. Singh, M. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. Berriman, J. Good phet al., "Pegasus: A framework for mapping complex scientific workflows onto distributed systems," Scientific Programming, vol. 13, no. 3, pp. 219--237, 2005.

Digital Library

[3]

T. Samak, D. Gunter, E. Deelman, G. Juve, G. Mehta, F. Silva, and K. Vahi, "Online Fault and Anomaly Detection for Large-Scale Scientific Workflows," in 13th IEEE International Conference on High Performance Computing and Communications (HPCC-2011), IEEE. Banff, Alberta, Canada: IEEE Computer Society, Sep. 2011.

Digital Library

[4]

D. Gunter, T. Samak, E. Deelman, C. H. Brooks, M. Goode, G. Juve, G. Mehta, P. Moraes, F. Silva, M. Swany, and K. Vahi, "Online Workflow Management and Performance Analysis with STAMPEDE," 7th International Conference on Network and Service Management (CNSM 2011), 2011.

Digital Library

[5]

"Dagman." {Online}. Available: www.cs.wisc.edu/condor/dagman

[6]

J. Vockler, G. Mehta, Y. Zhao, E. Deelman, and M. Wilde, "Kickstarting Remote Applications," in International Workshop on Grid Computing Environments, no. 0, 2007.

[7]

Grid logging: Best practices guide," 2008. {Online}. Available: www.cedps.net/index.php/LoggingBestPractices

[8]

SQLAlchemy. {Online}. Available: www.sqlalchemy.org

[9]

R. {Online}. Available: www.r-project.org

[10]

R. Graves and A. Pitarka, "Broadband ground-motion simulation using a hybrid approach," Bulletin of the Seismological Society of America, vol. 100, no. 5A, p. 2095, 2010.

[11]

Broadband working group. {Online}. Available: http://scec.usc.edu/research/cme/groups/broadband

[12]

P. Maechling, E. Deelman, L. Zhao, R. Graves, G. Mehta, N. Gupta, J. Mehringer, C. Kesselman, S. Callaghan, D. Okaya, H. Francoeur, V. Gupta, Y. Cui, K. Vahi, T. Jordan, and E. Field, "Scec cybershake workflows -- automating probabilistic seismic hazard analysis calculations," in Workflows for e-Sciences, I. Taylor, E. Deelman, D. Gannon, and M. Shield, Eds. Springer, 2006.

[13]

S. Callaghan, P. Maechling, E. Deelman, K. Vahi, G. Mehta, G. Juve, K. Milner, R. Graves, E. Field, D. Okaya, D. Gunter, K. Beattie, and T. Jordan, "Reducing time-to-solution using distributed high-throughput mega-workflows - experiences from scec cybershake," in Proceedings of the 2008 Fourth IEEE International Conference on eScience. Washington, DC, USA: IEEE Computer Society, 2008, pp. 151--158.

Digital Library

[14]

G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. Berman, and P. Maechling, "Scientific workflow applications on Amazon EC2," in E-Science Workshops, 2009 5th IEEE International Conference on. IEEE, 2010, pp. 59--66.

[15]

"USC Epigenome Center." {Online}. Available: epigenome.usc.edu

[16]

"LIGO Project." {Online}. Available: www.ligo.caltech.edu

[17]

D. Brown, P. Brady, A. Dietz, J. Cao, B. Johnson, and J. McNabb, "A case study on the use of workflow technologies for scientific analysis: Gravitational wave data analysis," in Worflows for e-Sciences, I. Taylor, E. Deelman, D. Gannon, and M. Shield, Eds. Springer, 2006.

[18]

G. Berriman, E. Deelman, J. Good, J. Jacob, D. Katz, C. Kesselman, A. Laity, T. Prince, G. Singh, and M.-H. Su, "Montage: A grid enabled engine for delivering custom science-grade mosaics on demand," in SPIE Conference 5487: Astronomical Telescopes, 2004.

[19]

"Periodograms." {Online}. Available: www.ipac.caltech.edu

[20]

S. Zanikolas and R. Sakellariou, "A taxonomy of grid monitoring systems," phFuture Generation Computer Systems, vol. 21, no. 1, pp. 163--188, Jan. 2005.

Digital Library

[21]

M. Gerndt, Z. Balaton, G. Gombás, P. Kacsuk, Z. Németh, N. Podhorszki, H. Truong, T. Fahringer, E. Laure, M. Bubak, and T. Margalef, "Performance tools for the grid: State of the art and future," in phAPART White Paper, 2004.

[22]

I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, and S. Mock, "Kepler: An extensible system for design and execution of scientific workflows," in phProceedings of the 16th International Conference on Scientific and Statistical Database Management, 2004, pp. 423--424.

Digital Library

[23]

I. Taylor, M. Shields, I. Wang, and A. Harrison,"Visual grid workflow in Triana," Journal of Grid Computing, vol. 3, no. 3--4, pp. 153---169, 2005.

[24]

T. Oinn phet al., "Taverna: lessons in creating a workflow environment for the life sciences," Concurrency and Computation: Practice and Experience, vol. 18, no. 10, pp. 1067--1100, 2006.

Digital Library

[25]

Q. Wu, M. Zhu, X. Lu, P. Brown, Y. Lin, Y. Gu, F. Cao, and M. Reuter, "Automation and management of scientific workflows in distributed network environments," in ph2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and PhD Forum (IPDPSW), 2010, pp. 1--8.

[26]

P. Kacsuk, G. Dózsa, J. Kovács, R. Lovas, N. Podhorszki, Z. Balaton, and G. Gombás,"P-GRADE: a grid programming environment," phJournal of Grid Computing, vol. 1, no. 2, pp. 171--197, 2003.

[27]

A. Hanemann, J. Boote, E. Boyd, J. Durand, L. Kudarimoti, R. Lapacz, M. Swany, S. Trocha, and J. Zurawski, "PerfSONAR: A service oriented architecture for multi-domain network monitoring," in In Proceedings of the Third International Conference on Service Oriented Computing (ICSOC 2005), ser. ACM Sigsoft and Sigweb, December 2005, pp. 241--254.

Digital Library

[28]

T. Fahringer, R. Prodan, R. Duan, F. Nerieri, S. Podlipnig, J. Qin, M. Siddiqui, H.-L. Truong, A. Villazon, and M. Wieczorek, "Askalon: a grid application development and computing environment," in The 6th IEEE/ACM International Workshop on Grid Computing, 2005.

Digital Library

[29]

H. Truong and S. Dustdar, "Dynamic instrumentation, performance monitoring and analysis of grid scientific workflows," Journal of Grid Computing, vol. 3, no. 1--2, pp. 1--18, 2005.

[30]

P. Brunner, H. Truong, and T. Fahringer, "Performance monitoring and visualization of grid scientific workflows in ASKALON," in High Performance Computing and Communications, ser. Lecture Notes in Computer Science, 2006, vol. 4208, pp. 170--179.

Digital Library

[31]

S. Ostermann, K. Plankensteiner, R. Prodan, T. Fahringer, and A. Iosup, "Workflow monitoring and analysis tool for ASKALON," in phGrid and Services Evolution, 2009.

[32]

S. M. S. da Cruz, F. N. da Silva, L. M. R. G. Jr., M. C. R. Cavalcanti, M. L. M. Campos, and M. Mattoso, "A lightweight middleware monitor for distributed scientific workflows," in IEEE International Symposium on Cluster Computing and the Grid, 2008, pp. 693--698.

Digital Library

[33]

S. Fu and C.-Z. Xu, "Exploring event correlation for failure prediction in coalitions of clusters," in Proceedings of the 2007 ACM/IEEE conference on Supercomputing, ser. SC' 07. New York, NY, USA: ACM, 2007, pp. 41:1--41:12. {Online}. Available: http://doi.acm.org/10.1145/1362622.1362678

Digital Library

[34]

F. Nadeem and T. Fahringer, "Predicting the execution time of grid workflow applications through local learning," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC '09. New York, NY, USA: ACM, 2009, pp. 33:1--33:12. {Online}. Available: http://doi.acm.org/10.1145/1654059.1654093

Digital Library

[35]

B. Schroeder and G. A. Gibson, "A large-scale study of failures in high-performance computing systems," in phProceedings of the International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2006, pp. 249--258. {Online}. Available: http://portal.acm.org/citation.cfm?id=1135532.1135705

Digital Library

[36]

N. Palatin, A. Leizarowitz, A. Schuster, and R. Wolff, "Mining for misconfigured machines in grid systems," in phProceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD '06. New York, NY, USA: ACM, 2006, pp. 687--692.

Digital Library

[37]

M. Munawar and P. Ward, "Adaptive monitoring in enterprise software systems," SysML, June 2006.

Cited By

Dongarra JTourancheau BDeelman EMandal AJiang MSakellariou R(2019)The role of machine learning in scientific workflowsInternational Journal of High Performance Computing Applications10.1177/109434201985212733:6(1128-1139)Online publication date: 1-Nov-2019
https://dl.acm.org/doi/10.1177/1094342019852127
Khaldi MRebbah MMeftah BSmail O(2019)Fault tolerance for a scientific workflow system in a Cloud computing environmentInternational Journal of Computers and Applications10.1080/1206212X.2019.164765142:7(705-714)Online publication date: 30-Jul-2019
https://doi.org/10.1080/1206212X.2019.1647651
Deelman EPeterka TAltintas ICarothers Cvan Dam KMoreland KParashar MRamakrishnan LTaufer MVetter J(2018)The future of scientific workflowsInternational Journal of High Performance Computing Applications10.5555/3195474.319547732:1(159-175)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.5555/3195474.3195477
Show More Cited By

Index Terms

Failure prediction and localization in large scientific workflows
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks

Recommendations

Intelligent failure prediction models for scientific workflows

Intelligent task failure models using machine learning approaches are proposed.The accuracy of proposed models is validated in Pegasus and Amazon EC2.The prediction accuracy of (94%) is achieved using Naïve Bayes approach. The ever-growing demand and ...
Monitoring of Grid scientific workflows
Large-Scale Programming Tools and Environments

Scientific workflows are a means of conducting in silico experiments in modern computing infrastructures for e-Science, often built on top of Grids. Monitoring of Grid scientific workflows is essential not only for performance analysis but also to ...
Metrics for heterogeneous scientific workflows: A case study of an earthquake science application

Scientific workflows are a common computational model for performing scientific simulations. They may include many jobs, many scientific codes, and many file dependencies. Since scientific workflow applications may include both high-performance ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WORKS '11: Proceedings of the 6th workshop on Workflows in support of large-scale science

November 2011

154 pages

ISBN:9781450311007

DOI:10.1145/2110497

General Chairs:
Ian Taylor
Cardiff University, UK
,
Johan Montagnat
CNRS, France

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '11

Sponsor:

SIGARCH

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 14, 2011

Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 30 of 54 submissions, 56%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
208
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dongarra JTourancheau BDeelman EMandal AJiang MSakellariou R(2019)The role of machine learning in scientific workflowsInternational Journal of High Performance Computing Applications10.1177/109434201985212733:6(1128-1139)Online publication date: 1-Nov-2019
https://dl.acm.org/doi/10.1177/1094342019852127
Khaldi MRebbah MMeftah BSmail O(2019)Fault tolerance for a scientific workflow system in a Cloud computing environmentInternational Journal of Computers and Applications10.1080/1206212X.2019.164765142:7(705-714)Online publication date: 30-Jul-2019
https://doi.org/10.1080/1206212X.2019.1647651
Deelman EPeterka TAltintas ICarothers Cvan Dam KMoreland KParashar MRamakrishnan LTaufer MVetter J(2018)The future of scientific workflowsInternational Journal of High Performance Computing Applications10.5555/3195474.319547732:1(159-175)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.5555/3195474.3195477
Singh AAltintas ISchram MTallent N(2018)Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622509(3905-3914)Online publication date: Dec-2018
https://doi.org/10.1109/BigData.2018.8622509
Deelman EPeterka TAltintas ICarothers Cvan Dam KMoreland KParashar MRamakrishnan LTaufer MVetter J(2017)The future of scientific workflowsThe International Journal of High Performance Computing Applications10.1177/109434201770489332:1(159-175)Online publication date: 26-Apr-2017
https://doi.org/10.1177/1094342017704893
Deelman ECarothers CMandal ATierney BVetter JBaldin ICastillo CJuve GKról DLynch VMayer BMeredith JProffen TRuth PFerreira da Silva R(2017)PANORAMAInternational Journal of High Performance Computing Applications10.1177/109434201559451531:1(4-18)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1177/1094342015594515
Chen Wda Silva RDeelman EFahringer T(2016)Dynamic and Fault-Tolerant Clustering for Scientific WorkflowsIEEE Transactions on Cloud Computing10.1109/TCC.2015.24272004:1(49-62)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1109/TCC.2015.2427200
Chen PPlale BBalaji PXu C(2015)ProvErrProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.86(525-534)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.1109/CCGrid.2015.86
Mattoso MDias JOcaña KOgasawara ECosta FHorta FSilva Vde Oliveira D(2015)Dynamic steering of HPC scientific workflowsFuture Generation Computer Systems10.1016/j.future.2014.11.01746:C(100-113)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1016/j.future.2014.11.017
Mattoso MOcaña KHorta FDias JOgasawara ESilva Vde Oliveira DCosta FAraújo IHidders JMissier PSroka J(2013)User-steering of HPC workflowsProceedings of the 2nd ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies10.1145/2499896.2499900(1-6)Online publication date: 23-Jun-2013
https://dl.acm.org/doi/10.1145/2499896.2499900
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents