## ``The Average Availability of Multiprocessor
Checkpointing Systems''

James S. Plank and
Michael G. Thomason.
Technical Report UT-CS-98-403, University of Tennessee, November, 1998.

Available via anonymous ftp to cs.utk.edu in
pub/plank/papers/CS-98-403.ps.Z.

### Abstract

Performance prediction of checkpointing systems in the presence of
failures is a well-studied research area.
The *average availability* is defined as a
useful metric for uniprocessor
checkpointing systems in a previous Technical Report.
This report introduces a
discrete-parameter, finite-state Markov chain *M* to
compute the availability for
multiprocessor
checkpointing systems. *N* is the number of
processors in the system. Processors are interchangeable.
At any time, each individual processor
is either *nonfunctional* (failed and under repair) or
*functional* (actively working on the task
or standing-by as a spare).
A specified minimum number *a*
of the *N* processors must be functional in order for the system to
work on a distributed task. The system does not use
more than *a* processors but cannot compute with fewer than *a*.
*M* is based on
assumptions of independent exponential probability distributions
for identically
distributed interoccurrence times of
failures and for identically distributed repair times. A separate
continuous-parameter Markov chain *S* is used to compute some of
the transition probabilities in *M*.
System availability is related to the speed-up obtained
with multiple processors as a measure of real-time work on
a long-running task. Finally, merging states to obtain a smaller Markov
chain and some additional computations are briefly discussed.

Matlab scripts for calculating
availability.