Checkpointing Research At Tennessee

Checkpointing

Checkpointing is the saving of program state, usually to stable storage, so that it may be reconstructed later in time. Checkpointing provides the backbone for rollback recovery (fault-tolerance), playback debugging, process migration and job swapping. At Tennessee, our focus is on fault-tolerance and process migration. Specifically, we concentrate on the performance of checkpointing on all computational platforms, from uniprocessors to supercomputers.

Principal People

Professors

Graduate Students

Gerry Kingsley (Phd)
Youngbae Kim (Phd)
Slade Foster (Masters)
Lee Hamner (Masters)
Darryl Pace (Masters)
Michael Puening (Masters)

Research Colleagues

Kai Li (Princeton University)
Rob Netzer (Brown University)
Yuqun Chen (Princeton University)
Kevin Skadron (Princeton University)
Jian Xu (Brown University)

Funding

This material is based upon work supported by the National Science Foundation under Grant No. CCR-9409496, and by the ORAU Junior Faculty Enhancement Award.

Research Projects in Checkpointing:

Java Checkpointing

Follow the above link for the description of checkpointing Java programs in an architecture-independent fashion.

Diskless Checkpointing

The major source of overhead in all checkpointing systems is the time that it takes to write checkpoints to stable storage (i.e. disks). Diskless Checkpointing is a novel technique which uses the philosophy of RAID (Reliable Arrays of Inexpensive Disks) by employing extra processors to provide fault-tolerance instead of disks. This eliminates stable storage as the bottleneck in checkpointing, and places the burden on the network.

In algorithms for diskless checkpointing, processors make local checkpoints in memory (or local disk), and m extra checkpointing processors maintain encodings of these checkpoints so that if up to m processors in the NOW fail, their contents may be restored by the local checkpoints and checkpoint encodings of the surviving processors.

There are two current research projects based on diskless checkpointing. The first is a systems project where page-based incremental checkpointing is combined with diskless checkpointing so that processors may maintain their local checkpoints in memory with minimal extra memory requirements. See the paper ``Faster Checkpointing with N+1 Parity'' for a more complete discussion.

The second project mixes diskless checkpointing with high-performance matrix operations in ScaLAPACK. The result is a library of ScaLAPACK subroutines that are resilient to any one processor failure with very low overhead. See the paper ``Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing'' for details. We are continuing to develop this technique to add resilience to multiple processor failures, and to use different coding techniques combined with reverse execution to reduce the extra memory requirements still further.

Application of RAID Techniques for Checkpointing

RAID Techniques apply to disk-based checkpointing as well. In particular, they can improve the performance of standard disk-based coordinated checkpointers on networks of workstations. The paper ``Improving the Performance of Coordinated Checkpointers on Networks of Workstations using RAID Techniques'' explores the benefits of mirroring, parity, and more complex coding in coordinated checkpointing systems.

User-Directed Checkpointing / Memory Exclusion

This is research based on the notion that a few hints by the user can result in drastic improvements in the performance of checkpointing. This is due to memory exclusion, which means checkpointing less than the complete memory image of the program because the jettisoned portions are unnecessary for a correct recovery. Memory exclusion has been employed effectively in incremental checkpointing, where pages are not checkpointed when they are clean. In other words, their values have not been altered since the previous checkpoint. However, memory exclusion has not been employed to jettison dead variables --- variables whose current values will not be used by the program following the checkpoint. With user-directed checkpointing, the user may judiciously place checkpoints to maximize memory exclusion due to clean and dead variables.

In 1994, we wrote libckpt, a library for user-directed checkpointing on Unix-based uniprocessors. Experiments have shown that in several examples, user-directed checkpoint placement and memory exclusion can reduce the overhead of checkpointing by up to 90 percent. For complete details, see the paper ``Libckpt: Transparent Checkpointing under Unix.''

For a more complete treatment of memory exclusion, please see the paper ``Memory Exclusion: Optimizing the Performance of Checkpointing Systems.''

Compiler-Assisted Checkpointing

The obvious ``next step'' for user-directed checkpointing is to employ the compiler. The reasons are clear. The user may miss potential savings due to memory exclusion, or even worse, make erroneous memory exclusion calls. We have developed data flow equations that enable the compiler to generate correct memory exclusion calls for both clean and dead variables (this includes arrays). We have corroborated their effectiveness on several example programs, and are currently implementing them in a Fortran preprocessor using Stanford's SUIF toolkit. We plan to extend this work for high-performance platforms with High-Performance Fortran. See the paper ``Compiler-Assisted Memory Exclusion for Fast Checkpointing'' for a brief description and preliminary results of this technique.

An separate but related research objective is to modify this work so that load-balancing and/or program migration can be implemented with assistance from the compiler.

Ickp -- A Consistent Checkpointer for Multicomputers

Ickp is a library that enables users of the Intel iPSC/860 to checkpoint the execution state of their programs to disk. Ickp is an important piece of work as it is the first checkpointer ever written for a multicomputer. There are three significant results to be gleaned from Ickp:

Consistent checkpointing is a reasonable strategy to provide fault-tolerance of long-running programs on multicomputers. There are many kinds of checkpointing described in the literature besides consistent checkpointing: optimistic, pessimistic, independent, etc. This result is significant as consistent checkpointing is much more simple to implement than the others.
A simple synchronous algorithm for creating consistent checkpoints performs just as well as more complex algorithms. There has been a huge amount of research on algorithms for creating consistent checkpoints. The focus of this research has been to reduce the number of messages sent during checkpointing. What ickp has shown is that the major overhead of checkpointing comes from writing checkpoints to disk, and for machines with under 1000 processors, the message overhead is negligible. Thus, a simple two-phased commit called ``Sync-and-stop'' performs equally as well as more complex algorithms like the well-known Chandy-Lamport algorithm and its variants.
Checkpoint compression is beneficial in terms of time and space overhead. This result is significant, as previous research has only shown compression to be detrimental to the speed of checkpointing. However, these results came from uniprocessor checkpointing. Compression in multicomputers is truly parallel, and the savings in I/O bandwidth offset the time to compress.

See the paper ``Ickp --- A Consistent Checkpointer for Multicomputers'' for complete details.

Fast Checkpoint Compression

As stated above, standard compression algorithms have proven unsuccessful at improving the overhead of checkpointing, because the time that it takes to compress the checkpoint is greater than the time it takes to write the original checkpoint to disk. We have developed an algorithm for performing fast compression on incremental checkpoints. The algorithm essentially compresses away any unchanged word in a dirty page. See the paper ``Compressed Differences: An Algorithm for Fast Incremental Checkpointing'' for complete details.

Reed-Solomon Coding

In RAID-like sytems, Reed-Solomon coding is cited as the only general purpose way to tolerate m device failures with the addition of just m extra devices. However, there is no good reference detailing how the systems programmer may implement this coding. The paper ``A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems'' is meant to provide such a reference. It presents a complete specification of the coding algorithm plus details on how it may be implemented. This specification assumes no prior knowledge of algebra or coding theory.

Buffered, Copy-On-Write for Low-Overhead Checkpointing

It is now well-known that adding copy-on-write to a checkpointer can significantly improve its performance. The initial research on this topic came from the following papers:

``Real-Time, Concurrent Checkpoint for Parallel Programs'', Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seattle, WA, March, 1990, pp. 79-88.

``Low-Latency, Concurrent Checkpointing for Parallel Programs'', IEEE Transactions on Parallel and Distributed Systems, 5(8), August, 1994, pp. 874--879.