``Ickp --- A Consistent Checkpointer for Multicomputers''

James S. Plank and Kai Li

IEEE Parallel and Distributed Technologies, 2(2), Summer, 1994, pp. 62--67.

Abstract

This paper presents performance results of ickp, a program that checkpoints applications written on the Intel iPSC/860. ickp is the first implementation of general-purpose checkpointing on a multicomputer. ickp implements three consistent checkpointing algorithms, as well as two optimizations, and recovery. This paper displays the results of testing ickp on many benchmark programs. These benchmarks are useful iPSC/860 programs written by other scientists, and not toy benchmarks used to test the checkpointer.

The main result of this paper is that we can sufficiently checkpoint a multicomputer of the size of the iPSC/860, thereby achieving fault-tolerance and coarse-grained job-swapping in an environment where there previously was none. We also draw conclusions on the nature of consistent checkpointing algorithms, and on the effectiveness of two optimizations -- main memory checkpointing, and checkpoint compression.

An alpha release of ickp has been made to the Intel community.

The base version of this paper is ``Performance Results of Ickp -- A Consistent Checkpointer on the iPSC/860''. The postscript is available for that paper.