Youngbae Kim, James S. Plank, and Jack Dongarra
6th Symposium on the Fontiers of Massively Parallel Computation, Annapolis, MD, October, 1996, pp. 70-77.
Available via anonymous ftp to cs.utk.edu in pub/plank/papers/Frontiers96.ps and pub/plank/papers/Frontiers96.pdf.
In this paper, we present an alternative technique, based on checksum and reverse computation, that enables the above matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead.
author Y. Kim and J. S. Plank and J. Dongarra
title Fault Tolerant Matrix Operations using Checksum and
Reverse Computation
booktitle 6th Symposium on the Frontiers of Massively Parallel
Computation
month October
pages 70-77
address Annapolis, MD
year 1996
where http://web.eecs.utk.edu/~jplank/plank/papers/Frontiers96.html
@INPROCEEDINGS{kpd:96:ftm,
author = "Y. Kim and J. S. Plank and J. Dongarra",
title = "Fault Tolerant Matrix Operations using Checksum and
Reverse Computation",
booktitle = "6th Symposium on the Frontiers of Massively Parallel
Computation",
month = "October",
pages = "70-77",
address = "Annapolis, MD",
year = "1996",
where = "http://web.eecs.utk.edu/~jplank/plank/papers/Frontiers96.html"
}