Low-Latency, Concurrent Checkpointing for Parallel Programs
Kai Li,
Jeffrey F. Naughton
and
James S. Plank
IEEE Transactions on Parallel and Distributed Systems, vol 5, no. 8,
August, 1994, pp. 874--879.
Available via anonymous ftp to cs.utk.edu in
pub/plank/papers/TPDS-94.ps.Z.
Abstract
This paper presents the results of an implementation of several
algorithms for checkpointing and restarting parallel programs
on shared-memory multiprocessors. The algorithms are compared
according to the metrics of overall checkpointing time, overhead
imposed by the checkpointer on the target program, and amount of
time during which the checkpointer interrupts the target program.
The best algorithm measured achieves its efficiency through a
variation of copy-on-write, which allows the most time-consuming
operations of the checkpoint to be overlapped with the running of the
program being checkpointed.