``The Average Availability of Multiprocessor
Checkpointing Systems''
James S. Plank and
Michael G. Thomason.
Technical Report UT-CS-98-403, University of Tennessee, November, 1998.
Available via anonymous ftp to cs.utk.edu in
pub/plank/papers/CS-98-403.ps.Z.
Abstract
Performance prediction of checkpointing systems in the presence of
failures is a well-studied research area.
The average availability is defined as a
useful metric for uniprocessor
checkpointing systems in a previous Technical Report.
This report introduces a
discrete-parameter, finite-state Markov chain M to
compute the availability for
multiprocessor
checkpointing systems. N is the number of
processors in the system. Processors are interchangeable.
At any time, each individual processor
is either nonfunctional (failed and under repair) or
functional (actively working on the task
or standing-by as a spare).
A specified minimum number a
of the N processors must be functional in order for the system to
work on a distributed task. The system does not use
more than a processors but cannot compute with fewer than a.
M is based on
assumptions of independent exponential probability distributions
for identically
distributed interoccurrence times of
failures and for identically distributed repair times. A separate
continuous-parameter Markov chain S is used to compute some of
the transition probabilities in M.
System availability is related to the speed-up obtained
with multiple processors as a measure of real-time work on
a long-running task. Finally, merging states to obtain a smaller Markov
chain and some additional computations are briefly discussed.
Matlab scripts for calculating
availability.