``An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance''

James S. Plank

Technical Report UT-CS-97-372, University of Tennessee, July 1997.

This technical report is a derivation of the paper ``Program Diagnostics,'' by the same author, appearing in volume 17 of the Wiley Encyclopedia of Electrical and Electronics Engineering, John G. Webster, editor, published by John Wiley & Sons, Inc, 1999. Please see http://web.eecs.utk.edu/~jplank/plank/papers/Wiley.html for a better pointer to the encyclopedia. Also, please cite that paper in preference to this technical report.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/CS-97-372.ps and pub/plank/papers/CS-97-372.pdf.

Abstract

Checkpointing is the act of saving the state of a running program so that it may be reconstructed later in time. It is an important basic functionality in computing systems that paves the way for powerful tools in many fields of computer science. This article provides a comprehensive overview of checkpointing in uniprocessor and parallel processing systems, including definitions, uses of checkpointing, and implementation details. Also included in this overview is a brief discussion of checkpoint consistency, which is a major concern in parallel processing systems, and a thorough discussion of issues related to the performance of checkpointing. It is intended that the reader of this article should receive a thorough grounding in checkpointing, with enough detail to implement an efficient checkpointer if so desired.

Postscript of the paper

PDF of the paper


Citation Information