``Fault Tolerant Matrix Operations for Networks of
Workstations Using Diskless Checkpointing''
James S. Plank,
Journal of Parallel and Distributed Computing, 43, September, 1997, pp. 125-138.
The original JPDC submission is available via anonymous ftp to cs.utk.edu in
The precursor to this paper (published in FTCS-25) can be found
Networks of workstations (NOWs) offer a cost effective platform
for high-performance, long-running parallel computations.
However, these computations must be able to tolerate the
changing and often faulty nature of NOW environments.
We present high-performance implementations of several
fault-tolerant algorithms for
distributed scientific computing.
The fault-tolerance is based on diskless checkpointing,
a paradigm that uses processor redundancy rather than
stable storage as the fault-tolerant medium.
These algorithms are able to run on clusters of workstations
that change over time due to failure, load or availability.
As long as there are at least n processors in the
cluster, and failures occur singly,
the computation will complete in an efficient manner.
We discuss the details of how the algorithms are tuned for
fault-tolerance and present the performance results on a PVM
network of Sun workstations connected by a fast, switched
- Plain Text:
author J. S. Plank and Y. Kim and J. Dongarra
title Fault Tolerant Matrix Operations for Networks of
Workstations Using Diskless Checkpointing
journal Journal of Parallel and Distributed Computing
author = "J. S. Plank and Y. Kim and J. Dongarra",
title = "Fault Tolerant Matrix Operations for Networks of
Workstations Using Diskless Checkpointing",
journal = "Journal of Parallel and Distributed Computing",
volume = "43",
number = "2",
pages = "125--138",
month = "June",
year = "1997"