Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2110497.2110510acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Failure prediction and localization in large scientific workflows

Published: 14 November 2011 Publication History

Abstract

Scientific workflows provide a portable representation for scientific applications' coordinated input, output, and execution management for highly parallel executions of interdependent computations, as well as support for sharing and validating the results. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. Real-time execution monitoring provides a foundation for improving the transparency and resilience of the workflows in the face of stochastic and systematic faults. Building on previous work on early detection of these failure scenarios, we describe methods for guiding remediation to stochastic errors through predictions of the impact on application performance. To complement this analysis, we also describe techniques for isolating systematic sources of failures. We evaluate our methods on a representative sample of large real-world workflows.

References

[1]
D. Gunter and B. Tierney, "Netlogger: A toolkit for distributed system performance tuning and debugging," in Integrated Network Management, IFIP/IEEE Eighth International Symposium on Integrated Network Management (IM 2003), ser. IFIP Conference Proceedings, vol. 246. Kluwer, 2003, pp. 97--100.
[2]
E. Deelman, G. Singh, M. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. Berriman, J. Good phet al., "Pegasus: A framework for mapping complex scientific workflows onto distributed systems," Scientific Programming, vol. 13, no. 3, pp. 219--237, 2005.
[3]
T. Samak, D. Gunter, E. Deelman, G. Juve, G. Mehta, F. Silva, and K. Vahi, "Online Fault and Anomaly Detection for Large-Scale Scientific Workflows," in 13th IEEE International Conference on High Performance Computing and Communications (HPCC-2011), IEEE. Banff, Alberta, Canada: IEEE Computer Society, Sep. 2011.
[4]
D. Gunter, T. Samak, E. Deelman, C. H. Brooks, M. Goode, G. Juve, G. Mehta, P. Moraes, F. Silva, M. Swany, and K. Vahi, "Online Workflow Management and Performance Analysis with STAMPEDE," 7th International Conference on Network and Service Management (CNSM 2011), 2011.
[5]
"Dagman." {Online}. Available: www.cs.wisc.edu/condor/dagman
[6]
J. Vockler, G. Mehta, Y. Zhao, E. Deelman, and M. Wilde, "Kickstarting Remote Applications," in International Workshop on Grid Computing Environments, no. 0, 2007.
[7]
Grid logging: Best practices guide," 2008. {Online}. Available: www.cedps.net/index.php/LoggingBestPractices
[8]
SQLAlchemy. {Online}. Available: www.sqlalchemy.org
[9]
R. {Online}. Available: www.r-project.org
[10]
R. Graves and A. Pitarka, "Broadband ground-motion simulation using a hybrid approach," Bulletin of the Seismological Society of America, vol. 100, no. 5A, p. 2095, 2010.
[11]
Broadband working group. {Online}. Available: http://scec.usc.edu/research/cme/groups/broadband
[12]
P. Maechling, E. Deelman, L. Zhao, R. Graves, G. Mehta, N. Gupta, J. Mehringer, C. Kesselman, S. Callaghan, D. Okaya, H. Francoeur, V. Gupta, Y. Cui, K. Vahi, T. Jordan, and E. Field, "Scec cybershake workflows -- automating probabilistic seismic hazard analysis calculations," in Workflows for e-Sciences, I. Taylor, E. Deelman, D. Gannon, and M. Shield, Eds. Springer, 2006.
[13]
S. Callaghan, P. Maechling, E. Deelman, K. Vahi, G. Mehta, G. Juve, K. Milner, R. Graves, E. Field, D. Okaya, D. Gunter, K. Beattie, and T. Jordan, "Reducing time-to-solution using distributed high-throughput mega-workflows - experiences from scec cybershake," in Proceedings of the 2008 Fourth IEEE International Conference on eScience. Washington, DC, USA: IEEE Computer Society, 2008, pp. 151--158.
[14]
G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. Berman, and P. Maechling, "Scientific workflow applications on Amazon EC2," in E-Science Workshops, 2009 5th IEEE International Conference on. IEEE, 2010, pp. 59--66.
[15]
"USC Epigenome Center." {Online}. Available: epigenome.usc.edu
[16]
"LIGO Project." {Online}. Available: www.ligo.caltech.edu
[17]
D. Brown, P. Brady, A. Dietz, J. Cao, B. Johnson, and J. McNabb, "A case study on the use of workflow technologies for scientific analysis: Gravitational wave data analysis," in Worflows for e-Sciences, I. Taylor, E. Deelman, D. Gannon, and M. Shield, Eds. Springer, 2006.
[18]
G. Berriman, E. Deelman, J. Good, J. Jacob, D. Katz, C. Kesselman, A. Laity, T. Prince, G. Singh, and M.-H. Su, "Montage: A grid enabled engine for delivering custom science-grade mosaics on demand," in SPIE Conference 5487: Astronomical Telescopes, 2004.
[19]
"Periodograms." {Online}. Available: www.ipac.caltech.edu
[20]
S. Zanikolas and R. Sakellariou, "A taxonomy of grid monitoring systems," phFuture Generation Computer Systems, vol. 21, no. 1, pp. 163--188, Jan. 2005.
[21]
M. Gerndt, Z. Balaton, G. Gombás, P. Kacsuk, Z. Németh, N. Podhorszki, H. Truong, T. Fahringer, E. Laure, M. Bubak, and T. Margalef, "Performance tools for the grid: State of the art and future," in phAPART White Paper, 2004.
[22]
I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, and S. Mock, "Kepler: An extensible system for design and execution of scientific workflows," in phProceedings of the 16th International Conference on Scientific and Statistical Database Management, 2004, pp. 423--424.
[23]
I. Taylor, M. Shields, I. Wang, and A. Harrison,"Visual grid workflow in Triana," Journal of Grid Computing, vol. 3, no. 3--4, pp. 153---169, 2005.
[24]
T. Oinn phet al., "Taverna: lessons in creating a workflow environment for the life sciences," Concurrency and Computation: Practice and Experience, vol. 18, no. 10, pp. 1067--1100, 2006.
[25]
Q. Wu, M. Zhu, X. Lu, P. Brown, Y. Lin, Y. Gu, F. Cao, and M. Reuter, "Automation and management of scientific workflows in distributed network environments," in ph2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and PhD Forum (IPDPSW), 2010, pp. 1--8.
[26]
P. Kacsuk, G. Dózsa, J. Kovács, R. Lovas, N. Podhorszki, Z. Balaton, and G. Gombás,"P-GRADE: a grid programming environment," phJournal of Grid Computing, vol. 1, no. 2, pp. 171--197, 2003.
[27]
A. Hanemann, J. Boote, E. Boyd, J. Durand, L. Kudarimoti, R. Lapacz, M. Swany, S. Trocha, and J. Zurawski, "PerfSONAR: A service oriented architecture for multi-domain network monitoring," in In Proceedings of the Third International Conference on Service Oriented Computing (ICSOC 2005), ser. ACM Sigsoft and Sigweb, December 2005, pp. 241--254.
[28]
T. Fahringer, R. Prodan, R. Duan, F. Nerieri, S. Podlipnig, J. Qin, M. Siddiqui, H.-L. Truong, A. Villazon, and M. Wieczorek, "Askalon: a grid application development and computing environment," in The 6th IEEE/ACM International Workshop on Grid Computing, 2005.
[29]
H. Truong and S. Dustdar, "Dynamic instrumentation, performance monitoring and analysis of grid scientific workflows," Journal of Grid Computing, vol. 3, no. 1--2, pp. 1--18, 2005.
[30]
P. Brunner, H. Truong, and T. Fahringer, "Performance monitoring and visualization of grid scientific workflows in ASKALON," in High Performance Computing and Communications, ser. Lecture Notes in Computer Science, 2006, vol. 4208, pp. 170--179.
[31]
S. Ostermann, K. Plankensteiner, R. Prodan, T. Fahringer, and A. Iosup, "Workflow monitoring and analysis tool for ASKALON," in phGrid and Services Evolution, 2009.
[32]
S. M. S. da Cruz, F. N. da Silva, L. M. R. G. Jr., M. C. R. Cavalcanti, M. L. M. Campos, and M. Mattoso, "A lightweight middleware monitor for distributed scientific workflows," in IEEE International Symposium on Cluster Computing and the Grid, 2008, pp. 693--698.
[33]
S. Fu and C.-Z. Xu, "Exploring event correlation for failure prediction in coalitions of clusters," in Proceedings of the 2007 ACM/IEEE conference on Supercomputing, ser. SC' 07. New York, NY, USA: ACM, 2007, pp. 41:1--41:12. {Online}. Available: http://doi.acm.org/10.1145/1362622.1362678
[34]
F. Nadeem and T. Fahringer, "Predicting the execution time of grid workflow applications through local learning," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC '09. New York, NY, USA: ACM, 2009, pp. 33:1--33:12. {Online}. Available: http://doi.acm.org/10.1145/1654059.1654093
[35]
B. Schroeder and G. A. Gibson, "A large-scale study of failures in high-performance computing systems," in phProceedings of the International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2006, pp. 249--258. {Online}. Available: http://portal.acm.org/citation.cfm?id=1135532.1135705
[36]
N. Palatin, A. Leizarowitz, A. Schuster, and R. Wolff, "Mining for misconfigured machines in grid systems," in phProceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD '06. New York, NY, USA: ACM, 2006, pp. 687--692.
[37]
M. Munawar and P. Ward, "Adaptive monitoring in enterprise software systems," SysML, June 2006.

Cited By

View all
  • (2019)The role of machine learning in scientific workflowsInternational Journal of High Performance Computing Applications10.1177/109434201985212733:6(1128-1139)Online publication date: 1-Nov-2019
  • (2019)Fault tolerance for a scientific workflow system in a Cloud computing environmentInternational Journal of Computers and Applications10.1080/1206212X.2019.164765142:7(705-714)Online publication date: 30-Jul-2019
  • (2018)The future of scientific workflowsInternational Journal of High Performance Computing Applications10.5555/3195474.319547732:1(159-175)Online publication date: 1-Jan-2018
  • Show More Cited By

Index Terms

  1. Failure prediction and localization in large scientific workflows

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WORKS '11: Proceedings of the 6th workshop on Workflows in support of large-scale science
    November 2011
    154 pages
    ISBN:9781450311007
    DOI:10.1145/2110497
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 November 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. failure prediction
    2. fault localization
    3. scientific workflows

    Qualifiers

    • Research-article

    Conference

    SC '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 30 of 54 submissions, 56%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)The role of machine learning in scientific workflowsInternational Journal of High Performance Computing Applications10.1177/109434201985212733:6(1128-1139)Online publication date: 1-Nov-2019
    • (2019)Fault tolerance for a scientific workflow system in a Cloud computing environmentInternational Journal of Computers and Applications10.1080/1206212X.2019.164765142:7(705-714)Online publication date: 30-Jul-2019
    • (2018)The future of scientific workflowsInternational Journal of High Performance Computing Applications10.5555/3195474.319547732:1(159-175)Online publication date: 1-Jan-2018
    • (2018)Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622509(3905-3914)Online publication date: Dec-2018
    • (2017)The future of scientific workflowsThe International Journal of High Performance Computing Applications10.1177/109434201770489332:1(159-175)Online publication date: 26-Apr-2017
    • (2017)PANORAMAInternational Journal of High Performance Computing Applications10.1177/109434201559451531:1(4-18)Online publication date: 1-Jan-2017
    • (2016)Dynamic and Fault-Tolerant Clustering for Scientific WorkflowsIEEE Transactions on Cloud Computing10.1109/TCC.2015.24272004:1(49-62)Online publication date: 1-Jan-2016
    • (2015)ProvErrProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.86(525-534)Online publication date: 4-May-2015
    • (2015)Dynamic steering of HPC scientific workflowsFuture Generation Computer Systems10.1016/j.future.2014.11.01746:C(100-113)Online publication date: 1-May-2015
    • (2013)User-steering of HPC workflowsProceedings of the 2nd ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies10.1145/2499896.2499900(1-6)Online publication date: 23-Jun-2013
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media