Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3295500.3356158acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

Addressing data resiliency for staging based scientific workflows

Published: 17 November 2019 Publication History

Abstract

As applications move towards extreme scales, data-related challenges are becoming significant concerns, and in-situ workflows based on data staging and in-situ/in-transit data processing have been proposed to address these challenges. Increasing scale is also expected to result in an increase in the rate of silent data corruption errors, which will impact both the correctness and performance of applications. Furthermore, this impact is amplified in the case of in-situ workflows due to the dataflow between the component applications of the workflow. While existing research has explored silent error detection at the application level, silent error detection for workflows remains an open challenge. This paper addresses silent error detection for extreme scale in-situ workflows. The presented approach leverages idle computation resource in data staging to enable timely detection and recovery from silent data corruption, effectively reducing the propagation of corrupted data and end-to-end workflow execution time in the presence of silent errors. As an illustration of this approach, we use a spatial outlier detection approach in staging to detect errors introduced in data transfer and storage. We also provide a CPU-GPU hybrid staging framework for error detection in order to achieve faster error identification. We have implemented our approach within the DataSpaces staging service, and evaluated it using both synthetic and real workflows on a Cray XK7 system (Titan) at different scales. We demonstrate that, in the presence of silent errors, enabling error detection on staged data alongside a checkpoint/restart scheme improves the total in-situ workflow execution time by up to 22% in comparison with using checkpoint/restart alone.

References

[1]
H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just In Time: Adding Value to The IO Pipelines of High Performance Applications with JITStaging. In Proc. 20th International Symposium on High Performance Distributed Computing (HPDC'11), June 2011.
[2]
H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, and F. Zheng. Datastager: scalable data staging services for petascale applications. In Proc. 18th International Symposium on High Performance Distributed Computing (HPDC'09), 2009.
[3]
L. Bautista-Gomez and F. Cappello. Detecting silent data corruption for extreme-scale mpi applications. In Proceedings of the 22nd European MPI Users' Group Meeting (EuroMPI'15), September 2015.
[4]
J. Bennett, H. Abbasi, P.-T. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. Pebay, D. Thompson, H. Yu, F. Zhang, and J. Chen. Combining in-situ and in-transit processing to enable extreme-scale scientific analysis. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1--9, Nov 2012.
[5]
A. Benoit, A. Cavelan, Y. Robert, and H. Sun. Optimal resilience patterns to cope with fail-stop and silent errors. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS'16), May 2016.
[6]
E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, and F. Cappello. Lightweight silent data corruption detection based on runtime data analysis for hpc applications. In Proc. 24th International Symposium on High Performance Distributed Computing (HPDC'15), June 2015.
[7]
G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69(4):410--416, 2009.
[8]
F. Cappello, G. Al, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward exascale resilience: 2014 update. In Supercomputing Frontiers and Innovations: an International Journal, volume 1, pages 5--28, 2014.
[9]
J. H. Chen, A. Choudhary, B. de Supinski, M. DeVries, E. R. Hawkes, S. Klasky, W. K. Liao, K. L. Ma, J. Mellor-Crummey, N. Podhorszki, R. Sankaran, S. Shende, and C. S. Yoo. Terascale direct numerical simulations of turbulent combustion using s3d. Computational Science & Discovery, 2009.
[10]
Z. Chen. Online-abft: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP'13), August 2013.
[11]
E. Deelman, T. Peterka, I. Altintas, C. D. Carothers, K. K. van Dam, K. Moreland, M. Parashar, L. Ramakrishnan, M. Taufer, and J. Vetter. The future of scientific workflows. The International Journal of High Performance Computing Applications, 32(1):159--175, 2018.
[12]
M. e. M. Diouri, O. Gluck, L. Lefevre, and F. Cappello. Energy considerations in checkpointing and fault tolerance protocols. In Proceedings of the Workshop on IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012), June 2012.
[13]
C. Docan, M. Parashar, and S. Klasky. Dataspaces: an interaction and coordination framework for coupled simulation workflows. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 25--36, 2010.
[14]
C. Docan, F. Zhang, T. Jin, H. Bui, Q. Sun, J. Cummings, N. Podhorszki, S. Klasky, and M. Parashar. Activespaces: Exploring dynamic code deployment for extreme scale data processing. Wiley Online Library, 2014.
[15]
S. Duan, P. Subedi, K. Teranishi, P. Davis, H. Kolla, M. Gamell, and M. Parashar. Scalable data resilience for in-memory data staging. In Proceedings of the 32th IEEE International Parallel and Distributed Processing Symposium (IPDPS'18), pages 105--115, May 2018.
[16]
J. Elliott, F. Mueller, M. Stoyanov, and C. G. Webster. Quantifying the impact of single bit flips on floating point arithmetic. In Technical report, Oak Ridge National Laboratory, August 2013.
[17]
S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS'10), pages 385--396, March 2010.
[18]
D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'12), November 2012.
[19]
A. Geist. What is the monster in the closet? In Invited Talk at Workshop on Architectures I: Exascale and Beyond: Gaps in Research, Gaps in our Thinking, volume 2, 2011.
[20]
L. B. Gomez and F. Cappello. Detecting silent data corruption through data dynamic monitoring for scientific applications. In the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP), pages 381--382, February 2014.
[21]
M. Parashar. Addressing the petascale data challenge using in-situ analytics. In Proceedings of the 2Nd International Workshop on Petascal Data Analytics: Challenges and Opportunities, PDAC '11, pages 35--36, New York, NY, USA, 2011. ACM.
[22]
V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi. Memory errors in modern systems: The good, the bad, and the ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'15), March 2015.
[23]
P. Subedi, P. Davis, S. Duan, S. Klasky, H. Kolla, and M. Parashar. Stacker: An autonomous data movement engine for extreme-scale data staging-based in-situ workflows. In High Performance Computing, Networking, Storage and Analysis (SC), 2018 International Conference for. IEEE, 2018.
[24]
P. Sun and S. Chawla. On local spatial outliers. In Fourth IEEE International Conference on Data Mining, November 2004.
[25]
J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 3(22):303--312, 2004.
[26]
U.S. Department of Energy, Office of Science. Exascale computing project. https://www.exascaleproject.org/exascale-computing-project/, 2018.
[27]
F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf. Predata - preparatory data analytics on peta-scale machines. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12, April 2010.

Cited By

View all
  • (2023)Dynamic Data-Driven Application Systems for Reservoir Simulation-Based Optimization: Lessons Learned and Future TrendsHandbook of Dynamic Data Driven Applications Systems10.1007/978-3-031-27986-7_11(287-330)Online publication date: 6-Sep-2023
  • (2021)Bootstrapping in-situ workflow auto-tuning via combining performance models of component applicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476197(1-15)Online publication date: 14-Nov-2021
  • (2021)BAASHProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476155(1-18)Online publication date: 14-Nov-2021
  • Show More Cited By
  1. Addressing data resiliency for staging based scientific workflows

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2019
    1921 pages
    ISBN:9781450362290
    DOI:10.1145/3295500
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data staging
    2. error detection
    3. fault tolerances
    4. in-situ workflows
    5. silent data corruption

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)73
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Dynamic Data-Driven Application Systems for Reservoir Simulation-Based Optimization: Lessons Learned and Future TrendsHandbook of Dynamic Data Driven Applications Systems10.1007/978-3-031-27986-7_11(287-330)Online publication date: 6-Sep-2023
    • (2021)Bootstrapping in-situ workflow auto-tuning via combining performance models of component applicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476197(1-15)Online publication date: 14-Nov-2021
    • (2021)BAASHProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476155(1-18)Online publication date: 14-Nov-2021
    • (2021)RISE: Reducing I/O Contention in Staging-based Extreme-Scale In-situ Workflows2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00021(146-156)Online publication date: Sep-2021
    • (2020)Scalable Crash Consistency for Staging-based In-situ Scientific Workflows2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00068(340-348)Online publication date: May-2020

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media