``Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing''

Youngbae Kim, James S. Plank, and Jack Dongarra

High Performance Computing on the Information Superhighway, HPC Asia '97, Seoul, Korea, April, 1997, pages 460-465.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/HPCA97.ps.Z.

Abstract

Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, since fault tolerance is incorporated into the matrix operations, the matrix operations become resilient to any single processor failure or change with low overhead. In this paper, we present a technique called multiple checkpointing to enable the matrix operations to tolerate a certain set of multiple processor failures by adding the capacity for multiple checkpointing processors. The results on a network of workstations have shown that this technique improves not only the reliability of the computation but also the performance of checkpointing.

Postscript of the paper

Citation Information

Plain Text:

iauthor          Y. Kim and J. S. Plank and J. J. Dongarra
title           Fault Tolerant Matrix Operations for Networks of
                Workstations Using Multiple Checkpointing
booktitle       High Performance Computing on the Information
                Superhighway, HPC Asia '97
year            1997
pages           460-465
address         Seoul, Korea
month           April
where           http://web.eecs.utk.edu/~jplank/plank/papers/HPCA97.html

Bibtex:

@INPROCEEDINGS{kpd:97:mc,
        author = "Y. Kim and J. S. Plank and J. J. Dongarra",
        title = "Fault Tolerant Matrix Operations for Networks of
                Workstations Using Multiple Checkpointing",
        booktitle = "High Performance Computing on the Information
                Superhighway, HPC Asia '97",
        year = "1997",
        pages = "460-465",
        address = "Seoul, Korea",
        month = "April",
        where = "http://web.eecs.utk.edu/~jplank/plank/papers/HPCA97.html"
}