``The Average Availability of Uniprocessor Checkpointing Systems, Revisited''

James S. Plank and Michael G. Thomason.

Technical Report UT-CS-98-400, University of Tennessee, August, 1998.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/CS-98-400.ps.Z.

Abstract

Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. This paper makes three small contributions to this research area. First, we show how to apply the concept of availability from reliability theory as a useful metric for checkpointing systems. Second, we study the average availability of uniprocessor checkpointing systems, using the libckpt checkpointer as a model. This is a slight deviation from previous checkpointing models. We employ Bernoulli trials to derive an expression for the availability of such a checkpointing system, and then use this expression to calculate the checkpoint interval which maximizes availability. Third, we present another derivation of the availability based on a direct calculation of average segment uptime. For the exponential failure distribution function, these two derivations are equivalent. The latter derivation allows for a simple way to numerically approximate availability for other failure distribution functions. We conclude with examples of applying these results.

Postscript of the paper