``Experimental Assessment of Workstation Failures
and Their Impact on Checkpointing Systems''
James S. Plank,
Wael R. Elwasif.
28th International Symposium on Fault-tolerant
Computing, Munich, June, 1998, pages 48-57.
Available via anonymous ftp to cs.utk.edu in
In the past twenty years, there has been a wealth of theoretical
research on minimizing the expected running time of a program in the
presence of failures by employing checkpointing and rollback recovery.
In the same time period, there has been little experimental research
to corroborate these results. In this paper, we study the results of
three separate projects that monitor failure in workstation networks.
Our goals are twofold. The first is to see how these
results correlate with the theoretical results, and the second is to
assess their impact on strategies for checkpointing
long-running computations on workstations and networks of workstations.
A surprising result of our work is that although the base assumptions
of the theoretical research do not hold, many of the results are
- Plain Text:
author J. S. Plank and W. R. Elwasif
title Experimental Assessment of Workstation Failures and
Their Impact on Checkpointing Systems
booktitle 28th International Symposium on Fault-Tolerant Computing
author = "J. S. Plank and W. R. Elwasif",
title = "Experimental Assessment of Workstation Failures and
Their Impact on Checkpointing Systems",
booktitle = "28th International Symposium on Fault-Tolerant Computing",
address = "Munich",
month = "June",
year = "1998",
pages = "48-57",
where = "http://web.eecs.utk.edu/~plank/plank/papers/FTCS28.html"