Low-Latency, Concurrent Checkpointing for Parallel Programs

Kai Li, Jeffrey F. Naughton and James S. Plank

IEEE Transactions on Parallel and Distributed Systems, vol 5, no. 8, August, 1994, pp. 874--879.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/TPDS-94.ps.Z.

Abstract

This paper presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.

Postscript of the paper