``Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing''

Youngbae Kim, James S. Plank, and Jack Dongarra

High Performance Computing on the Information Superhighway, HPC Asia '97, Seoul, Korea, April, 1997, pages 460-465.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/HPCA97.ps.Z.

Abstract

Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, since fault tolerance is incorporated into the matrix operations, the matrix operations become resilient to any single processor failure or change with low overhead. In this paper, we present a technique called multiple checkpointing to enable the matrix operations to tolerate a certain set of multiple processor failures by adding the capacity for multiple checkpointing processors. The results on a network of workstations have shown that this technique improves not only the reliability of the computation but also the performance of checkpointing.

Postscript of the paper


Citation Information