Toward Resilient Applications for Extreme- Scale Systems Part III of IV

Abstract As leadership-class computing systems increase in complexity and transistor feature sizes decrease, application codes find themselves less and less able to treat a system as a reliable digital machine. In fact, the high performance computing community has grown increasingly concerned that applications will have to manage resilience issues beyond the current practice of global checkpoint restart. This is expensive at scale and not capable of fixing all types of errors. We discuss alternatives in software and numerical algorithms that can improve the resiliency of applications and manage a variety of faults anticipated in future extreme-scale computing systems.

Organizers

Keita Teranishi, Sandia National Laboratories, USA
Mark Hoemmen, Sandia National Laboratories, USA
Jaideep Ray, Sandia National Laboratories, USA
Michael A. Heroux, Sandia National Laboratories, USA

Part IV

Wednesday, February 19
MS9
10:35 AM - 12:15 PM Room: Salon F

11:50-12:10 Towards an Unified ABFT Approach for Resilient Dense Linear Algebra, Piotr Luszczek, University of Tennessee, Knoxville, USA