``Program Diagnostics''

James S. Plank

Wiley Encyclopedia of Electrical and Electronics Engineering, John G. Webster, editor, John Wiley & Sons, Inc., New York, volume 17, pages 300-310.

A derivation of the paper appears as Technical Report UT-CS-97-372, University of Tennessee, July 1997.


Abstract

Checkpointing is the act of saving the state of a running program so that it may be reconstructed later in time. It is an important basic functionality in computing systems that paves the way for powerful tools in many fields of computer science. This article provides a comprehensive overview of checkpointing in uniprocessor and parallel processing systems, including definitions, uses of checkpointing, and implementation details. Also included in this overview is a brief discussion of checkpoint consistency, which is a major concern in parallel processing systems, and a thorough discussion of issues related to the performance of checkpointing. It is intended that the reader of this article should receive a thorough grounding in checkpointing, with enough detail to implement an efficient checkpointer if so desired.

Citation Information