``Diskless Checkpointing''

James S. Plank, Kai Li and Michael A. Puening.

Technical Report UT-CS-97-380, University of Tennessee, December, 1997.


A revised version of this has been accepted for publication in IEEE Transactions on Parallel and Distributed Systems, 9(10), October, 1998, pp. 972-986. See this link for complete citation information, etc. Also, please cite that paper in preference to this technical report.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/CS-97-380.ps and pub/plank/papers/CS-97-380.pdf.


Abstract

Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems.

In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.

Postscript of the paper

PDF of the paper


Citation Information