James S. Plank and Kai Li
Scalable High Performance Computing Conference, Knoxville, TN, May, 1994, pp. 686--693.
Available via anonymous ftp to cs.utk.edu in
The main result of this paper is that we can sufficiently checkpoint a multicomputer of the size of the iPSC/860, thereby achieving fault-tolerance and coarse-grained job-swapping in an environment where there previously was none. We also draw conclusions on the nature of consistent checkpointing algorithms, and on the effectiveness of two optimizations -- main memory checkpointing, and checkpoint compression.
An alpha release of ickp has been made to the Intel community.