``A Failure Correction Technique for Parallel Storage Devices with Minimal Device Overhead''

James S. Plank, Joel Friedman and Kai Li

Technical Report CS-94-243, University of Tennessee, August, 1994.

See the SP&E paper that is the published extension of this work.

Available via anonymous ftp to cs.utk.edu in pub/plank/papers/CS-94-243.ps.Z.

Abstract

A common technique for providing reliability in parallel storage designs, network file systems, and diskless checkpointing systems is the N+1-Parity approach. This approach is simple in coding, but requires an excess number of additional ``checksum'' storage devices to recover more than one arbitrary device failure.

This paper presents a general method to recover from the failure of m arbitrary storage devices with the addition of exactly m checksum devices. The method is an application of Reed-Solomon codes, and can be viewed as a generalization of N+1-Parity.

This paper has two goals concerning this algorithm. First, it provides a complete specification of how to code this problem with this algorithm. To the authors' knowledge, this is the first such specification. Second, we have implemented the coding and recovery algorithm in software and shown that the method is efficient, general, and practical.

Postscript of the paper