Elmootazbellah N. Elnozahy and
James S. Plank.
IEEE
Transactions on Dependable and Secure Computing,
1(2), April-June, 2004, pp. 97-108.
I do not have a copy of this online. Try Mootaz (mootaz@us.ibm.com)
for reprints. If that fails, let me know and I can probably get you
a copy.
Abstract
Over the past two decades, rollback-recovery via checkpoint-restart
has been used with reasonable success for long-running applications,
such as scientific workloads that take from few hours to few months
to complete. Currently, several commercial systems and publicly
available libraries exist to support various flavors of
checkpointing. Programmers typically use these systems if they are
satisfactory or otherwise embed checkpointing support themselves
within the application. In this paper, we project the performance
and functionality of checkpointing algorithms and systems as we know
them today into the future. We start by surveying the current
technology roadmap and particularly how Peta-Flop capable systems may
be plausibly constructed in the next few years. We consider how
rollback-recovery as practiced today will fare when systems may have
to be constructed out of thousands of nodes. Our projections predict
that, unlike current practice, the effect of rollback-recovery may
play a more prominent role in how systems may be configured to reach
the desired performance level. System planners may have to devote
additional resources to enable rollback-recovery and the current
practice of using "cheap commodity" systems to form large-scale
clusters may face serious obstacles. We suggest new avenues for
research to react to these trends.
Citation Information
- Plain Text:
author E. N. Elnozahy and J. S. Plank
title Checkpointing for Peta-Scale Systems: A Look into the
Future of Practical Rollback-Recovery
journal IEEE Transactions on Dependable and Secure Computing
volume 1
number 2
month April-June
year 2004
pages 97-108
- Bibtex:
@ARTICLE{ep:04:cps,
author = "E. N. Elnozahy and J. S. Plank",
title = "Checkpointing for Peta-Scale Systems: A Look into the
Future of Practical Rollback-Recovery",
journal = "IEEE Transactions on Dependable and Secure Computing",
volume = "1",
number = "2",
month = "April-June",
year = "2004",
pages = "97-108"
}