Youngbae Kim, James S. Plank, and Jack Dongarra
6th Symposium on the Fontiers of Massively Parallel Computation, Annapolis, MD, October, 1996, pp. 70-77.
Available via anonymous ftp to cs.utk.edu in pub/plank/papers/Frontiers96.ps and pub/plank/papers/Frontiers96.pdf.
In this paper, we present an alternative technique, based on checksum and reverse computation, that enables the above matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead.
author Y. Kim and J. S. Plank and J. Dongarra title Fault Tolerant Matrix Operations using Checksum and Reverse Computation booktitle 6th Symposium on the Frontiers of Massively Parallel Computation month October pages 70-77 address Annapolis, MD year 1996 where http://web.eecs.utk.edu/~jplank/plank/papers/Frontiers96.html
@INPROCEEDINGS{kpd:96:ftm, author = "Y. Kim and J. S. Plank and J. Dongarra", title = "Fault Tolerant Matrix Operations using Checksum and Reverse Computation", booktitle = "6th Symposium on the Frontiers of Massively Parallel Computation", month = "October", pages = "70-77", address = "Annapolis, MD", year = "1996", where = "http://web.eecs.utk.edu/~jplank/plank/papers/Frontiers96.html" }