``Fault Tolerant Matrix Operations Using Checksum and Reverse Computation''

Youngbae Kim, James S. Plank, and Jack Dongarra

6th Symposium on the Fontiers of Massively Parallel Computation, Annapolis, MD, October, 1996, pp. 70-77.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/Frontiers96.ps and pub/plank/papers/Frontiers96.pdf.

Abstract

Recently, a new approach based on diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this technique, the matrix operations become resilient to any single processor failure with very low overhead. However, this technique has limitations when applied to certain matrix operations like matrix multiplication and Hessenberg reduction. Moreover, it cannot be applied to the more efficient ``right-looking'' variations of standard matrix factorizations.

In this paper, we present an alternative technique, based on checksum and reverse computation, that enables the above matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead.

Postscript of the paper

PDF of the paper

Citation Information

Plain Text:

author          Y. Kim and J. S. Plank and J. Dongarra
title           Fault Tolerant Matrix Operations using Checksum and
                Reverse Computation
booktitle       6th Symposium on the Frontiers of Massively Parallel
                Computation
month           October
pages           70-77
address         Annapolis, MD
year            1996
where           http://web.eecs.utk.edu/~jplank/plank/papers/Frontiers96.html

Bibtex:

@INPROCEEDINGS{kpd:96:ftm,
        author = "Y. Kim and J. S. Plank and J. Dongarra",
        title = "Fault Tolerant Matrix Operations using Checksum and
                Reverse Computation",
        booktitle = "6th Symposium on the Frontiers of Massively Parallel
                Computation",
        month = "October",
        pages = "70-77",
        address = "Annapolis, MD",
        year = "1996",
        where = "http://web.eecs.utk.edu/~jplank/plank/papers/Frontiers96.html"
}