``Fault Tolerant Matrix Operations for Networks of
Workstations Using Diskless Checkpointing''
James S. Plank,
Youngbae Kim,
and
Jack Dongarra
Journal of Parallel and Distributed Computing, 43, September, 1997, pp. 125-138.
The original JPDC submission is available via anonymous ftp to cs.utk.edu in
pub/plank/papers/ADCKP.ps.
The precursor to this paper (published in FTCS-25) can be found
here.
Abstract
Networks of workstations (NOWs) offer a cost effective platform
for high-performance, long-running parallel computations.
However, these computations must be able to tolerate the
changing and often faulty nature of NOW environments.
We present high-performance implementations of several
fault-tolerant algorithms for
distributed scientific computing.
The fault-tolerance is based on diskless checkpointing,
a paradigm that uses processor redundancy rather than
stable storage as the fault-tolerant medium.
These algorithms are able to run on clusters of workstations
that change over time due to failure, load or availability.
As long as there are at least n processors in the
cluster, and failures occur singly,
the computation will complete in an efficient manner.
We discuss the details of how the algorithms are tuned for
fault-tolerance and present the performance results on a PVM
network of Sun workstations connected by a fast, switched
ethernet.
Citation Information
- Plain Text:
author J. S. Plank and Y. Kim and J. Dongarra
title Fault Tolerant Matrix Operations for Networks of
Workstations Using Diskless Checkpointing
journal Journal of Parallel and Distributed Computing
volume 43
number 2
pages 125--138
month June
year 1997
- Bibtex:
@ARTICLE{pkd:97:ftm,
author = "J. S. Plank and Y. Kim and J. Dongarra",
title = "Fault Tolerant Matrix Operations for Networks of
Workstations Using Diskless Checkpointing",
journal = "Journal of Parallel and Distributed Computing",
volume = "43",
number = "2",
pages = "125--138",
month = "June",
year = "1997"
}