``The Average Availability of Multiprocessor Checkpointing Systems''

James S. Plank and Michael G. Thomason.

Technical Report UT-CS-98-403, University of Tennessee, November, 1998.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/CS-98-403.ps.Z.


Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. The average availability is defined as a useful metric for uniprocessor checkpointing systems in a previous Technical Report. This report introduces a discrete-parameter, finite-state Markov chain M to compute the availability for multiprocessor checkpointing systems. N is the number of processors in the system. Processors are interchangeable. At any time, each individual processor is either nonfunctional (failed and under repair) or functional (actively working on the task or standing-by as a spare). A specified minimum number a of the N processors must be functional in order for the system to work on a distributed task. The system does not use more than a processors but cannot compute with fewer than a. M is based on assumptions of independent exponential probability distributions for identically distributed interoccurrence times of failures and for identically distributed repair times. A separate continuous-parameter Markov chain S is used to compute some of the transition probabilities in M. System availability is related to the speed-up obtained with multiple processors as a measure of real-time work on a long-running task. Finally, merging states to obtain a smaller Markov chain and some additional computations are briefly discussed.

Postscript of the paper

Matlab scripts for calculating availability.