``CLIP -- A Checkpointing Tool for Message-Passing Parallel Programs''

Yuqun Chen, James S. Plank, and Kai Li,

SC97: High Performance Computing & Networking, San Jose, November, 1997.

HTML of the paper

Abstract

Checkpointing is a useful technique for rollback recovery. We present CLIP, a user-level library that provides semi-transparent checkpointing for parallel programs on the Intel Paragon multicomputer. Creating an actual tool for checkpointing a complex machine like the Paragon is not easy, because many issues arise that require careful design decisions to be made. We detail what these decisions are, and how they were made in CLIP. We present performance data when checkpointing several long-running parallel applications. These results show that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer with good performance.

HTML of the paper

Home Page of CLIP


Citation Information