``CLIP -- A Checkpointing Tool for Message-Passing Parallel Programs''

Yuqun Chen, James S. Plank, and Kai Li,

Scalable Input/Output, Daniel A. Reed, editor, The MIT Press, Cambridge, MA, 2004, pp. 182-200.

See The SC 1997 paper for the precursor to this article.


Checkpointing is a useful technique for rollback recovery. We present CLIP, a user-level library that provides semi-transparent checkpointing for parallel programs on the Intel Paragon multicomputer. Creating an actual tool for checkpointing a complex machine like the Paragon is not easy, because many issues arise that require careful design decisions to be made. We detail what these decisions are, and how they were made in CLIP. We present performance data when checkpointing several long-running parallel applications. These results show that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer with good performance.

Citation Information