Design, Implementation, and Performance of Checkpointing in NetSolve

Adnan Agbaria and James S. Plank

Technical Report UT-CS-99-433, University of Tennessee, November, 1999.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/CS-99-433.ps.Z and CS-99-433.pdf

Submitted for publication. Up-to-date publication status will be maintained on this page.


Abstract

While a variety of checkpointing techniques and systems have been documented for long-running programs, they are typically not available for programmers who are not systems experts. This paper details a project that combines three technologies, NetSolve, Starfish, and IBP, for the seamless integration of fault-tolerance into long-running applications. We discuss the design and implementation of this project, and present performance results executing on both local, high-performance networks, and wide-area, lower performance networks.

Compressed postscript of the paper

PDF of the paper