In Search of I/O-Optimal Recovery from Disk Failures

Osama Khan, Johns Hopkins University, Inc,
Randal Burns, Johns Hopkins University, Inc,
James S. Plank, EECS Department, University of Tennessee,
Cheng Huang, Microsoft Research.

Appearing in Hot Storage '11, 3rd Workshop on Hot Topics in Storage and File Systems, Portland, OR, June, 2011.

PDF of the paper.


We address the problem of minimizing the I/O needed to recover from disk failures in erasure-coded storage systems. The principal result is an algorithm that finds the optimal I/O recovery from an arbitrary number of disk failures for any XOR-based erasure code. We also describe a family of codes with high-fault tolerance and low recovery I/O, e.g. one instance tolerates up to 11 fail- ures and recovers a lost block in 4 I/Os. While we have determined I/O optimal recovery for any given code, it remains an open problem to identify codes with the best recovery properties. We describe our ongoing efforts to- ward characterizing space overhead versus recovery I/O tradeoffs and generating codes that realize these bounds.

Citation Information