``An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance''

Technical Report UT-CS-97-372, University of Tennessee, July 1997.

This technical report is a derivation of the paper ``Program Diagnostics,'' by the same author, appearing in volume 17 of the Wiley Encyclopedia of Electrical and Electronics Engineering, John G. Webster, editor, published by John Wiley & Sons, Inc, 1999. Please see http://web.eecs.utk.edu/~jplank/plank/papers/Wiley.html for a better pointer to the encyclopedia. Also, please cite that paper in preference to this technical report.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/CS-97-372.ps and pub/plank/papers/CS-97-372.pdf.

Abstract

Checkpointing is the act of saving the state of a running program so that it may be reconstructed later in time. It is an important basic functionality in computing systems that paves the way for powerful tools in many fields of computer science. This article provides a comprehensive overview of checkpointing in uniprocessor and parallel processing systems, including definitions, uses of checkpointing, and implementation details. Also included in this overview is a brief discussion of checkpoint consistency, which is a major concern in parallel processing systems, and a thorough discussion of issues related to the performance of checkpointing. It is intended that the reader of this article should receive a thorough grounding in checkpointing, with enough detail to implement an efficient checkpointer if so desired.

Postscript of the paper

PDF of the paper

Citation Information

Plain Text:

author          J. S. Plank
title           An Overview of Checkpointing in Uniprocessor and Distributed
                Systems, Focusing on Implementation and Performance
institution     University of Tennessee
number          CS-97-372
month           July
year            1997
where           http://web.eecs.utk.edu/~jplank/plank/papers/CS-97-372.html

Bibtex:

@TECHREPORT{p:97:ocu,
        author = "J. S. Plank",
        title = "An Overview of Checkpointing in Uniprocessor and Distributed
                Systems, Focusing on Implementation and Performance",
        institution = "University of Tennessee",
        number = "CS-97-372",
        month = "July",
        year = "1997",
        where = "http://web.eecs.utk.edu/~jplank/plank/papers/CS-97-372.html",
        note = "Also published as ``Program Diagnostics'', to appear in
                the Encyclopedia of Electrical and Electronics Engineering,
                John G. Webster, editor, published by John Wiley \& Sons, Inc."
}