Checkpointing Research At Tennessee
Checkpointing
Checkpointing is the saving of program state, usually to stable storage,
so that it may be reconstructed later in time. Checkpointing provides
the backbone for rollback recovery (fault-tolerance), playback debugging,
process migration and job swapping. At Tennessee, our focus is on
fault-tolerance and process migration. Specifically, we concentrate on
the performance of checkpointing on all computational platforms, from
uniprocessors to supercomputers.
Principal People
Professors
Graduate Students
Research Colleagues
Other checkpointing links
Funding
This material is based upon work supported by the
National
Science Foundation under Grant No. CCR-9409496, and
by the ORAU Junior Faculty Enhancement Award.
Research Projects in Checkpointing:
Follow the above link for the description of checkpointing Java programs
in an architecture-independent fashion.
Diskless Checkpointing
The major source of overhead in all
checkpointing systems is the time that it takes to write
checkpoints to stable storage (i.e. disks).
Diskless Checkpointing is a novel technique which uses the
philosophy of RAID (Reliable Arrays of Inexpensive
Disks) by employing
extra processors to provide fault-tolerance instead of
disks.
This eliminates stable storage as the bottleneck in checkpointing, and
places the burden on the network.
In algorithms for diskless checkpointing, processors make local
checkpoints in memory (or local disk), and m extra
checkpointing processors maintain encodings of these
checkpoints so that if up to m processors in the NOW fail,
their contents may be restored by the local checkpoints and
checkpoint encodings of the surviving processors.
There are two current research projects based on diskless checkpointing.
The first is a systems project
where page-based incremental checkpointing is combined
with diskless checkpointing so that processors may maintain
their local checkpoints in memory with minimal extra memory
requirements. See the paper
``Faster Checkpointing with N+1 Parity'' for a more
complete discussion.
The second project mixes diskless checkpointing with high-performance matrix
operations in
ScaLAPACK.
The result is a library of ScaLAPACK subroutines that are resilient
to any one processor failure with very low
overhead. See the paper
``Fault
Tolerant Matrix Operations for Networks of
Workstations Using Diskless Checkpointing'' for details.
We are continuing to develop this technique to add resilience to
multiple processor failures, and to use different coding techniques
combined with reverse execution to reduce the extra memory
requirements still further.
Application of RAID Techniques for Checkpointing
RAID Techniques apply to disk-based checkpointing as well. In particular,
they can improve the performance of standard disk-based coordinated
checkpointers on networks of workstations. The paper
``Improving
the Performance of Coordinated Checkpointers
on Networks of Workstations using RAID Techniques''
explores the benefits of mirroring, parity, and more complex coding
in coordinated checkpointing systems.
User-Directed Checkpointing / Memory Exclusion
This is research based on the
notion that a few hints by the user can result in drastic
improvements in the performance of checkpointing.
This is due to memory exclusion, which means checkpointing less
than the complete memory image of the program because the
jettisoned portions are unnecessary for a correct recovery.
Memory exclusion has been employed effectively in incremental
checkpointing, where pages are not checkpointed
when they are clean.
In other words, their values have not been
altered since the previous checkpoint.
However, memory exclusion has not been employed to jettison dead
variables --- variables whose current values will not be used
by the program following the checkpoint.
With user-directed checkpointing, the user may judiciously place checkpoints
to maximize memory exclusion due to clean and dead variables.
In 1994, we wrote
libckpt,
a library for user-directed
checkpointing on Unix-based uniprocessors.
Experiments have
shown that in several examples, user-directed checkpoint
placement and memory exclusion can
reduce the overhead of checkpointing by up to 90
percent. For complete details, see the paper
``Libckpt: Transparent Checkpointing under Unix.''
For a more complete treatment of memory exclusion, please see the
paper ``Memory
Exclusion: Optimizing the Performance of Checkpointing
Systems.''
Compiler-Assisted Checkpointing
The obvious ``next step'' for user-directed checkpointing is to employ the
compiler.
The reasons are clear. The user may miss potential savings due to memory
exclusion, or even worse, make erroneous memory exclusion calls.
We have developed data flow equations that enable the compiler to
generate correct memory exclusion calls for both clean and
dead variables (this includes arrays).
We have corroborated their effectiveness on several example
programs, and are currently implementing them
in a Fortran preprocessor
using Stanford's SUIF toolkit.
We plan to extend this work for high-performance platforms
with High-Performance Fortran.
See the paper
``Compiler-Assisted Memory Exclusion for Fast Checkpointing'' for
a brief description and preliminary results of
this technique.
An separate but related research objective is to modify this work
so that load-balancing and/or program migration can be
implemented with assistance from the compiler.
Ickp -- A Consistent Checkpointer for Multicomputers
Ickp
is a library that enables users of the Intel iPSC/860 to checkpoint
the execution state of their programs to disk. Ickp is an important
piece of work as it is the first checkpointer ever written for a multicomputer.
There are three significant results to be gleaned from Ickp:
- Consistent checkpointing is a reasonable strategy to
provide fault-tolerance of long-running programs on
multicomputers.
There are many kinds of checkpointing described in the literature
besides consistent checkpointing: optimistic, pessimistic,
independent, etc.
This result is significant as consistent checkpointing is much more
simple to implement than the others.
- A simple synchronous algorithm for creating consistent
checkpoints performs just as well as more complex
algorithms.
There has been a huge amount of research on algorithms for creating
consistent checkpoints.
The focus of this research has been to reduce the number of messages
sent during checkpointing.
What ickp has shown is that the major overhead of checkpointing
comes from writing checkpoints to disk, and for machines with
under 1000 processors, the message overhead is negligible.
Thus, a simple two-phased commit called ``Sync-and-stop'' performs
equally as well as more complex algorithms like the
well-known Chandy-Lamport algorithm and its variants.
- Checkpoint compression is beneficial in terms of time and
space overhead.
This result is significant, as previous research has only shown
compression to be detrimental to the speed of checkpointing.
However, these results came from uniprocessor checkpointing.
Compression in multicomputers is truly parallel, and the savings in
I/O bandwidth offset the time to compress.
See the paper ``Ickp --- A Consistent Checkpointer for Multicomputers'' for
complete details.
Fast Checkpoint Compression
As stated above, standard compression algorithms have proven unsuccessful
at improving the overhead of checkpointing, because the time that it
takes to compress the checkpoint is greater than the time it takes to
write the original checkpoint to disk. We have developed an algorithm
for performing fast compression on incremental checkpoints. The
algorithm essentially compresses away any unchanged word in a dirty page.
See the paper
``Compressed Differences: An Algorithm for Fast Incremental
Checkpointing'' for
complete details.
Reed-Solomon Coding
In RAID-like sytems, Reed-Solomon coding is cited as the only general
purpose way to tolerate m device failures with the addition of
just m extra devices. However, there is no good reference detailing
how the systems programmer may implement this coding. The paper
``A
Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like
Systems'' is meant to provide such a reference. It presents a
complete specification of the coding algorithm plus details on how it
may be implemented. This specification assumes no prior knowledge
of algebra or coding theory.
Buffered, Copy-On-Write for Low-Overhead Checkpointing
It is now well-known that adding copy-on-write to a checkpointer can
significantly improve its performance. The initial research on this
topic came from the following papers:
``Real-Time, Concurrent Checkpoint for
Parallel Programs'',
Second ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming, Seattle, WA, March, 1990, pp. 79-88.
``Low-Latency, Concurrent Checkpointing for
Parallel Programs'', IEEE Transactions on Parallel and Distributed
Systems, 5(8), August, 1994, pp. 874--879.