- September 8, 1997:
"Libckpt: Transparent Checkpointing
under Unix",
James S. Plank, Micah Beck, Gerry Kingsley and Kai Li,
Usenix Winter 1995 Technical Conference,
New Orleans, LA, January, 1995, pp. 213--223.
I know some of you have read this one before, but it's a good
paper to start with, so reread it.
- September 15, 1997:
"A Survey of Rollback-Recovery Protocols in Message Passing Systems",
E. N. Elnozahy, D. B. Johnson and Y. M. Wang.
Technical Report CMU-CS-96-181,
Department of Computer Science, Carnegie Mellon University, September 1996.
Yes, this is a long one, but it's an excellent survey of the field (and
the margins are big).
- September 22, 1997: More of the same
- September 29, 1997:
"Heterogeneous Process Migration by Recompilation",
M. M. Theimer and B. Hayes,
11th International Conference on Distributed Computing
Systems, 1991, pp. 18-25.
- October 6, 1997:
"Portable Checkpointing for Heterogeneous Architectures",
B. Ramkumar and V. Strumpen
27th International Symposium on Fault-Tolerant Computing
Systems, June, 1997.
If you can't get the above, try
here.
- October 13, 1997:
"A Longitudinal Survey of Internet Host Reliability",
D. Long, A. Muir and R. Golding,
14th Symposium on Reliable Distributed Systems,
September,
1995
pp. 2-9.
Also: "Impact of Checkpoint Latency on Overhead Ratio of a
Checkpointing Scheme," N. H. Vaidya,
IEEE Transactions on Computers, 46(8), August, 1997,
pp. 942-947.
- October 20, 1997:
"Diskless Checkpointing",
James S. Plank, Kai Li and Michael A. Puening,
draft,
October, 1997.
- October 27, 1997:
"Checkpointing in CosMiC: a User-level Process Migration
Environment",
P. E. Chung, Y. Huang, S. Yajnik, G. Fowler, K. P. Vo, and Y. M. Wang,
Pacific Rim International Symposium on Fault-Tolerant Systems,
December, 1997.
Also:
"Performance Analysis of Two Time-Based Coordinated
Checkpointing Protocols",
Gerard P. Kavanaugh and William H. Sanders,
Pacific Rim International Symposium on Fault-Tolerant Systems,
December, 1997.
- November 3, 1997:
"Supporting Nondeterministic Execution in Fault-Tolerant Systems",
Checkpointing Protocols",
J. Hamilton Slye and E. N. Elnozahy,
26th International Symposium
on Fault-Tolerant Computing,
June, 1996, pp. 250-259.
- November 10:
"Measurement based Statistical Performance Modeling and Prediction:
Case Study of the NPB Results", Dr. Erich Strohmaier,
Ayres 118 - 3:30.
- November 17:
Distinguished Speaker:
"Authoring Interactive Behaviors"
Dr. Brad Myers,
Hodges Library, Room 211 - 3:00.
- December 1:
"Process Migration",
D. S. Milojicic, F. Douglis, Y. Paindaveine,
R. Wheeler and S. Zhou,
submitted for publication, 1996. Available at
http://www.osf.org/~dejan/papers/m8.6.fr.ps.Z.
I've got a copy locally
here.
- December 8:
"A Checkpointing Strategy for Scalable Recovery
on Distributed Parallel Systems",
V. K. Naik, S. P. Midkiff and J. E. Moreira,
SC '97, San Jose, November, 1997.