``Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems

James S. Plank and Michael G. Thomason.

Journal of Parallel and Distributed Computing, Vol. 61, No. 11, November, 2001, pp. 1570-1590.

The precursor to this paper, which appeared in FTCS-29 is here.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/JPDC01.pdf and pub/plank/papers/JPDC01.ps.Z.

Matlab scripts for this work are here.


Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today's parallel computing environments and software, and present case studies of using the model to select runtime parameters.

Keywords Checkpointing, performance prediction, parameter selection, parallel computation, Markov chain, exponential failure and repair distributions.

Citation Information