``The Average Availability of Parallel Checkpointing Systems and Its Importance in Selecting Runtime Parameters''

James S. Plank, Michael G. Thomason.

FTCS-29: 29th International Symposium on Fault-tolerant Computing, Madison, WI, June, 1999, pp. 250-259.

The journal version of the paper was published in JPDC in 2001 and is an expansion of this work. Please see this link for information about that paper. Also, please cite that paper in preference to this one.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/FTCS29.ps and pub/plank/papers/FTCS29.pdf.

Matlab scripts for this work are here.


Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we briefly present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today's parallel computing environments and software, and present case studies of using the model to select runtime parameters.

Postscript of the paper

PDF of the paper

Citation Information