``Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing''

James S. Plank, Youngbae Kim, and Jack Dongarra

Journal of Parallel and Distributed Computing, 43, September, 1997, pp. 125-138.

The original JPDC submission is available via anonymous ftp to cs.utk.edu in pub/plank/papers/ADCKP.ps.

The precursor to this paper (published in FTCS-25) can be found here.


Networks of workstations (NOWs) offer a cost effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.

Postscript of the original JPDC submission

Citation Information