``Faster Checkpointing with N+1 Parity''

James S. Plank, and Kai Li

24th International Symposium on Fault-Tolerant Computing, Austin, TX, June, 1994, pp 288--297.

Available via anonymous ftp to cs.utk.edu in

pub/plank/papers/FTCS24.1994.ps.Z or pub/plank/papers/FTCS24.1994.pdf.

Abstract

This paper presents a way to perform fast, incremental checkpointing of multicomputers and distributed systems by using N+1 parity. A basic algorithm is described that uses two extra processors for checkpointing and enables the system to tolerate any single processor failure. The algorithm's speed comes from a combination of N+1 parity, extra physical memory, and virtual memory hardware so that checkpoints need not be written to disk. This eliminates the most time-consuming portion of checkpointing.

The algorithm requires each application processor to allocate a fixed amount of extra memory for checkpointing. This amount may be set statically by the programmer, and need not be equal to the size of the processor's writable address space. This alleviates a major restriction of previous checkpointing algorithms using N+1 parity [plank 93].

Finally, we outline how to extend our algorithm to tolerate any m processor failures with the addition of 2m extra checkpointing processors.

Postscript of the Paper

PDF of the Paper