Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2751504acmconferencesBook PagePublication PageshpdcConference Proceedingsconference-collections
FTXS '15: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale
ACM2015 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
HPDC'15: The 24th International Symposium on High-Performance Parallel and Distributed Computing Portland Oregon USA 15 June 2015
ISBN:
978-1-4503-3569-0
Published:
15 June 2015
Sponsors:
University of Arizona, SIGARCH

Reflects downloads up to 24 Sep 2024Bibliometrics
Skip Abstract Section
Abstract

It is our great pleasure to welcome you to The 5th Fault Tolerance for HPC at eXtreme Scale (FTXS) 2015 Workshop -- FTXS '15.

This year's FTXS workshop continues its tradition of being a top-quality venue for research and works in progress in the field of fault tolerance and resilience for extreme scale and high performance computing (HPC) systems. The program committee for FTXS 2015 continues the practice of having members from the many varied areas of dependability related to extreme scale computing while also representing academia, industry, government laboratories and government agencies. The PC is also diverse internationally with representatives from the USA, Asia, and Europe.

The call for papers attracted submissions from Asia, Europe, and the United States. The program committee reviewed and accepted the following: Technical Papers Venue or Track Reviewed - 15 Accepted - 9 = 60%

We also encourage attendees to attend the keynote and invited talk presentations. These valuable and insightful talks can and will guide us to a better understanding of the future:

  • Failures in Large-Scale Systems: Insights from the Field, Sudhanva Gurumurthi (AMD Research, Advanced Micro Devices, Inc. and the University of Virginia)

Skip Table Of Content Section
SESSION: Keynote
invited-talk
Failures in Large-Scale Systems: Insights from the Field

The use of highly scaled technologies and large component counts pose significant reliability challenges for large-scale systems. Knowledge of failures that occur in such systems is valuable for driving RAS design decisions for component and system ...

SESSION: Logging and Monitoring
research-article
A Principled Approach to HPC Event Monitoring

As high-performance computing (HPC) systems become larger and more complex, fault tolerance becomes a greater concern. At the same time, the data volume collected to help in understanding and mitigating hardware and software faults and failures also ...

research-article
LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications

This paper presents LogDiver, a tool for the analysis of application-level resiliency in extreme-scale computing systems. The tool has been implemented to handle data generated by system monitoring tools in Blue Waters, the petascale machine in ...

SESSION: Resilient Algorithms and Libraries
research-article
Public Access
Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices

The hierarchical semi-separable (HSS) matrix factorization has useful characteristics for representing low-rank operators on extreme scale computing systems. To prepare for the higher error rates anticipated with future architectures, this paper ...

research-article
Voltage Overscaling Algorithms for Energy-Efficient Workflow Computations With Timing Errors

We propose a software-based approach using dynamic voltage overscaling to reduce the energy consumption of HPC applications. This technique aggressively lowers the supply voltage below nominal voltage, which introduces timing errors, and we use ...

short-paper
Empirical Studies of the Soft Error Susceptibility ofSorting Algorithms to Statistical Fault Injection

Soft errors are becoming an important issue in computing systems. Near threshold voltage (NTV), reduced circuit sizes, high performance computing (HPC), and high altitude computing all present interesting challenges in this area. Much of the existing ...

short-paper
Evolving the Message Passing Programming Model via a Fault-Tolerant, Object-oriented Transport Layer

In this position paper, we argue for improved fault-tolerance of an MPI code by introducing lightweight virtualization into the MPI interface. In particular, we outline key-value store semantics for MPI send/recv calls, thereby creating a far more ...

SESSION: Other Topics in Fault Tolerance
research-article
How Much SSD Is Useful for Resilience in Supercomputers

We consider the use of non-volatile memories in the form of burst buffers for resilience in supercomputers. Their cost and limited lifetime demand effective use and appropriate provisioning. We develop an analytic model for the behavior of workloads on ...

research-article
The Path to Exascale: Code Optimizations and Hardening Solutions Reliability

Graphics Processing Units are nowadays the most common general-purpose computing accelerators employed in High Performance Computing (HPC) systems. The performance and energy efficiency of such devices enables extremely powerful HPC systems to be built. ...

research-article
Transient Fault Resilient QR Factorization on GPUs

With their inherent capability to exploit parallelism, GPUs have become a popular platform for data-intensive scientific computing applications. This trend is expected to continue as the number of computations required by scientific applications reach ...

Contributors
  • Los Alamos National Laboratory
  • Argonne National Laboratory
  • University of Leeds
Index terms have been assigned to the content through auto-classification.
Please enable JavaScript to view thecomments powered by Disqus.

Recommendations

Acceptance Rates

FTXS '15 Paper Acceptance Rate 9 of 15 submissions, 60%;
Overall Acceptance Rate 16 of 25 submissions, 64%
YearSubmittedAcceptedRate
FTXS '1515960%
FTXS '1310770%
Overall251664%