Design, Implementation, and Performance of Checkpointing in
James S. Plank
Technical Report UT-CS-99-433, University of Tennessee, November, 1999.
Available via anonymous ftp to cs.utk.edu in
Submitted for publication. Up-to-date publication status will be maintained
on this page.
While a variety of checkpointing techniques and systems have been
documented for long-running programs, they are typically not
available for programmers who are not systems experts. This paper
details a project that combines three technologies, NetSolve,
Starfish, and IBP, for the seamless integration of fault-tolerance
into long-running applications. We discuss the design and
implementation of this project, and present performance results
executing on both local, high-performance networks, and wide-area,
lower performance networks.