``Memory Exclusion: Optimizing the Performance of Checkpointing Systems''

James S. Plank, Yuqun Chen, Kai Li, Micah Beck and Gerry Kingsley

Technical Report UT-CS-96-335, University of Tennessee, August, 1996.


A revised version of this paper appears in Software -- Practice and Experience, Volume 29, Number 2, pp. 125-142, 1999. Please cite that paper in preference to this technical report.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/CS-96-335.ps and pub/plank/papers/CS-96-335.pdf.


Abstract

Checkpointing systems are a convenient way for users to make their programs fault-tolerant by intermittently saving program state to disk, and restoring that state following a failure. The main concern with checkpointing is the overhead that it adds to running time of the program. This paper describes memory exclusion an important class of optimizations that reduce the overhead of checkpointing. These optimizations have been implemented in two checkpointers: libckpt which works on Unix-based workstations, and libNXckpt which works on the Intel Paragon. Both checkpointers are publicly available at no cost. We have checkpointed various long-running applications with both checkpointers and have explored the performance improvements that may be gained through memory exclusion. Results from these experiments are presented and show that the improvements are significant. We conclude that all checkpointing systems should include primitives allowing programmers and users to gain the full benefits of memory exclusion.

Postscript of the paper

PDF of the paper


Raw Data for the paper

The raw data for the paper is here. This link also resolves the apparent anomaly in the data in Figure 3, addressed in a footnote in the paper. Please see the link for more detail.

Citation Information