Design, Implementation, and Performance of Checkpointing in
NetSolve
Adnan Agbaria
and
James S. Plank
Technical Report UT-CS-99-433, University of Tennessee, November, 1999.
Available via anonymous ftp to cs.utk.edu in
pub/plank/papers/CS-99-433.ps.Z
and CS-99-433.pdf
Submitted for publication. Up-to-date publication status will be maintained
on this page.
Abstract
While a variety of checkpointing techniques and systems have been
documented for long-running programs, they are typically not
available for programmers who are not systems experts. This paper
details a project that combines three technologies, NetSolve,
Starfish, and IBP, for the seamless integration of fault-tolerance
into long-running applications. We discuss the design and
implementation of this project, and present performance results
executing on both local, high-performance networks, and wide-area,
lower performance networks.