``Memory Exclusion: Optimizing the Performance of Checkpointing Systems''

James S. Plank, Yuqun Chen, Kai Li, Micah Beck and Gerry Kingsley

Software -- Practice and Experience, Volume 29, Number 2, pp. 125-142, 1999.

For a precursor to this paper, see Technical Report CS-96-335.

Abstract

Checkpointing systems are a convenient way for users to make their programs fault-tolerant by intermittently saving program state to disk and restoring that state following a failure. The main concern with checkpointing is the overhead that it adds to running time of the program. This paper describes memory exclusion an important class of optimizations that reduce the overhead of checkpointing. Some forms of memory exclusion are well-known in the checkpointing community. Others are relatively new. In this paper, we describe all of them within the same framework.

We have implemented these optimization techniques in two checkpointers: libckpt which works on Unix-based workstations, and CLIP which works on the Intel Paragon. Both checkpointers are publicly available at no cost. We have checkpointed various long-running applications with both checkpointers and have explored the performance improvements that may be gained through memory exclusion. Results from these experiments are presented and show the improvements in time and space overhead.


Citation Information