``Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing''

James S. Plank, Youngbae Kim, and Jack Dongarra

Journal of Parallel and Distributed Computing, 43, September, 1997, pp. 125-138.

The original JPDC submission is available via anonymous ftp to cs.utk.edu in pub/plank/papers/ADCKP.ps.

The precursor to this paper (published in FTCS-25) can be found here.

Abstract

Networks of workstations (NOWs) offer a cost effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.

Postscript of the original JPDC submission

Citation Information

Plain Text:

author          J. S. Plank and Y. Kim and J. Dongarra
title           Fault Tolerant Matrix Operations for Networks of
                Workstations Using Diskless Checkpointing
journal         Journal of Parallel and Distributed Computing
volume          43
number          2
pages           125--138
month           June
year            1997

Bibtex:

@ARTICLE{pkd:97:ftm,
        author = "J. S. Plank and Y. Kim and J. Dongarra",
        title = "Fault Tolerant Matrix Operations for Networks of
                Workstations Using Diskless Checkpointing",
        journal = "Journal of Parallel and Distributed Computing",
        volume = "43",
        number = "2",
        pages = "125--138",
        month = "June",
        year = "1997"
}