``Fault Tolerant Matrix Operations for Networks of
Workstations Using Diskless Checkpointing''
James S. Plank,
Youngbae Kim,
and
Jack Dongarra
Journal of Parallel and Distributed Computing, 43, September, 1997, pp. 125-138.
The original JPDC submission is available via anonymous ftp to cs.utk.edu in
pub/plank/papers/ADCKP.ps.
The precursor to this paper (published in FTCS-25) can be found
here.
Abstract
Networks of workstations (NOWs) offer a cost effective platform
for high-performance, long-running parallel computations.
However, these computations must be able to tolerate the
changing and often faulty nature of NOW environments.
We present high-performance implementations of several
fault-tolerant algorithms for
distributed scientific computing.
The fault-tolerance is based on diskless checkpointing,
a paradigm that uses processor redundancy rather than
stable storage as the fault-tolerant medium.
These algorithms are able to run on clusters of workstations
that change over time due to failure, load or availability.
As long as there are at least n processors in the
cluster, and failures occur singly,
the computation will complete in an efficient manner.
We discuss the details of how the algorithms are tuned for
fault-tolerance and present the performance results on a PVM
network of Sun workstations connected by a fast, switched
ethernet.
