Efficient Checkpointing on MIMD Architectures

James S. Plank

Phd Dissertation, Princeton University, June, 1993.

Abstract

Presented here are efficient algorithms for checkpointing on MIMD architectures. These algorithms have been implemented on two representative machines: a shared-memory multiprocessor, and a message-passing multicomputer. The algorithms and implementations are evaluated according to three speed metrics: checkpoint time, overhead, and latency.

Checkpointing is important as a general means of software fault-tolerance. It is also the backbone of certain program control utilities, such as job-swapping, process migration, and playback debugging. We employ several techniques to minimize the invasiveness of the checkpointer on the target program. Such techniques are main memory checkpointing, copy-on-write, buffering, compression, and the elimination of bottlenecks and extra control messages.

The major result of this dissertation is that we can implement efficient checkpointing on MIMD architectures, thereby enhancing the usability of such machines.

PDF of the dissertation.


(I'm keeping this around for historical reasons. Evidently, I came from an era when 700K was quite a lot to post in one file.....).

Postscript in one part or 13 parts:

Part 1
Part 2
Part 3
Part 4
Part 5
Part 6
Part 7
Part 8
Part 9
Part 10
Part 11
Part 12
Part 13

(Or anonymous ftp to cs.utk.edu in pub/plank/thesis).