``CLIP -- A Checkpointing Tool for Message-Passing Parallel Programs''
Yuqun Chen,
James S. Plank,
and
Kai Li,
Scalable Input/Output, Daniel A. Reed, editor, The MIT Press,
Cambridge, MA, 2004, pp. 182-200.
See The SC 1997 paper for the precursor to this
article.
Abstract
Checkpointing is a useful technique for rollback recovery. We
present CLIP, a user-level library that provides semi-transparent
checkpointing for parallel programs on the Intel Paragon
multicomputer. Creating an actual tool for checkpointing a complex
machine like the Paragon is not easy, because many issues arise that
require careful design decisions to be made. We detail what these
decisions are, and how they were made in CLIP. We present
performance data when checkpointing several long-running parallel
applications. These results show that a convenient, general-purpose
checkpointing tool like CLIP can provide fault-tolerance on a
massively parallel multicomputer with good performance.
Citation Information
- Plain Text:
.inbook cpl:04:clip
author Y. Chen and J. S. Plank and K. Li
title {CLIP}: A Checkpointing Tool for Message-Passing
Parallel Programs
booktitle Scalable Input/Output
publisher The MIT Press
address Cambridge, MA
year 2004
pages 182-200
- Bibtex:
@INBOOK{cpl:04:clip,
author = "Y. Chen and J. S. Plank and K. Li",
title = "{CLIP}: A Checkpointing Tool for Message-Passing
Parallel Programs",
booktitle = "Scalable Input/Output",
publisher = "The MIT Press",
address = "Cambridge, MA",
year = "2004",
pages = "182-200"
}