``Fault Tolerant Matrix Operations Using Checksum and Reverse Computation''

Youngbae Kim, James S. Plank, and Jack Dongarra

6th Symposium on the Fontiers of Massively Parallel Computation, Annapolis, MD, October, 1996, pp. 70-77.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/Frontiers96.ps and pub/plank/papers/Frontiers96.pdf.

Abstract

Recently, a new approach based on diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this technique, the matrix operations become resilient to any single processor failure with very low overhead. However, this technique has limitations when applied to certain matrix operations like matrix multiplication and Hessenberg reduction. Moreover, it cannot be applied to the more efficient ``right-looking'' variations of standard matrix factorizations.

In this paper, we present an alternative technique, based on checksum and reverse computation, that enables the above matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead.

Postscript of the paper

PDF of the paper


Citation Information