-
1
-
-
2942516889
-
Quantifying rollback propagation in distributed checkpointing
-
Agbaria A., Attiya H., Friedman R., and Vitenberg R. Quantifying rollback propagation in distributed checkpointing. Journal of Parallel and Distributed Computing 64 3 (2004) 370-384
-
(2004)
Journal of Parallel and Distributed Computing
, vol.64
, Issue.3
, pp. 370-384
-
-
Agbaria, A.1
Attiya, H.2
Friedman, R.3
Vitenberg, R.4
-
2
-
-
0037968835
-
-
A. Agbaria, A. Freund, R. Friedman, Evaluating distributed checkpointing protocols, in: Proceedings of the 23rd International Conference on Distributed Computing Systems, ICDCS'03, Providence, Rhode Island, May 2003, pp. 266-273
-
A. Agbaria, A. Freund, R. Friedman, Evaluating distributed checkpointing protocols, in: Proceedings of the 23rd International Conference on Distributed Computing Systems, ICDCS'03, Providence, Rhode Island, May 2003, pp. 266-273
-
-
-
-
3
-
-
0033359224
-
-
A. Agbaria, R. Friedman, Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations, in: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, HPDC'99, August 1999, pp. 167-176
-
A. Agbaria, R. Friedman, Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations, in: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, HPDC'99, August 1999, pp. 167-176
-
-
-
-
5
-
-
0034590510
-
-
A. Agbaria, J.S. Plank, Design, implementation, and performance of checkpointing in netsolve, in: Proceedings of the 1st IEEE Conference on Dependable Systems and Networks, DSN'00, New York, USA, June 2000, pp. 49-54
-
A. Agbaria, J.S. Plank, Design, implementation, and performance of checkpointing in netsolve, in: Proceedings of the 1st IEEE Conference on Dependable Systems and Networks, DSN'00, New York, USA, June 2000, pp. 49-54
-
-
-
-
6
-
-
0032597670
-
-
L. Alvisi, E. Elnozahy, S. Rao, S.A. Husain, A.D. Mel, An analysis of communication induced checkpointing, in: Proceedings of the 29th Fault-Tolerant Computing Symposium, Madison, Wisconsin, June 1999, pp. 242-249
-
L. Alvisi, E. Elnozahy, S. Rao, S.A. Husain, A.D. Mel, An analysis of communication induced checkpointing, in: Proceedings of the 29th Fault-Tolerant Computing Symposium, Madison, Wisconsin, June 1999, pp. 242-249
-
-
-
-
7
-
-
84866225421
-
-
R. Baldoni, J.M. Hélary, A. Mostefaoui, M. Raynal, A communication-induced checkpointing protocol that ensures rollback-dependency trackability, in: Proceedings of the 27th International Symposium on Fault-Tolerant Computing, June 1997, pp. 68-77
-
R. Baldoni, J.M. Hélary, A. Mostefaoui, M. Raynal, A communication-induced checkpointing protocol that ensures rollback-dependency trackability, in: Proceedings of the 27th International Symposium on Fault-Tolerant Computing, June 1997, pp. 68-77
-
-
-
-
8
-
-
0032305992
-
-
R. Baldoni, F. Quaglia, B. Ciciani, A VP-accordant checkpointing protocol preventing useless checkpoints, in: Proceedings of the IEEE International Symposium on Reliable Distributed Systems, October 1998, pp. 61-67
-
R. Baldoni, F. Quaglia, B. Ciciani, A VP-accordant checkpointing protocol preventing useless checkpoints, in: Proceedings of the IEEE International Symposium on Reliable Distributed Systems, October 1998, pp. 61-67
-
-
-
-
9
-
-
40849106062
-
-
J. Brevik, D. Nurmi, R. Wolski, Quantifying machine availability in networked and desktop grid systems, Technical Report 2003-37, Department of Computer Science, University of California, Santa Barbara, November 2003
-
J. Brevik, D. Nurmi, R. Wolski, Quantifying machine availability in networked and desktop grid systems, Technical Report 2003-37, Department of Computer Science, University of California, Santa Barbara, November 2003
-
-
-
-
10
-
-
0021538527
-
-
D. Briatico, A. Ciuffoletti, L. Simoncini, A distributed domino-effect free recovery algorithm, in: Proceedings of the IEEE International Symposium on Reliability in Distributed Software and Database Systems, October 1984, pp. 207-215
-
D. Briatico, A. Ciuffoletti, L. Simoncini, A distributed domino-effect free recovery algorithm, in: Proceedings of the IEEE International Symposium on Reliability in Distributed Software and Database Systems, October 1984, pp. 207-215
-
-
-
-
11
-
-
0022020346
-
Distributed snapshots: Determining global states of distributed systems
-
Chandy K.M., and Lamport L. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3 1 (1985) 63-75
-
(1985)
ACM Transactions on Computer Systems
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy, K.M.1
Lamport, L.2
-
12
-
-
40849089107
-
-
G. Clark, T. Courtney, D. Daly, D. Deavours, S. Derisavi, J.M. Doyle, W.H. Sanders, P. Webster, The Möbius tool, in: Proceedings of the 9th International Workshop on Petri Nets and Performance Models, Aachen, Germany, September 2001 pp. 241-250
-
G. Clark, T. Courtney, D. Daly, D. Deavours, S. Derisavi, J.M. Doyle, W.H. Sanders, P. Webster, The Möbius tool, in: Proceedings of the 9th International Workshop on Petri Nets and Performance Models, Aachen, Germany, September 2001 pp. 241-250
-
-
-
-
13
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
Elnozahy E.N., Alvisi L., Wang Y.M., and Johnson D.B. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34 3 (2002) 375-408
-
(2002)
ACM Computing Surveys
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.1
Alvisi, L.2
Wang, Y.M.3
Johnson, D.B.4
-
14
-
-
40849135283
-
-
J.M. Hélary, A. Mostefaoui, M. Raynal, Communication-induced determination of consistent snapshots, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp. 208-217
-
J.M. Hélary, A. Mostefaoui, M. Raynal, Communication-induced determination of consistent snapshots, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp. 208-217
-
-
-
-
15
-
-
0017996760
-
Time, clocks and ordering of events in distributed systems
-
Lamport L. Time, clocks and ordering of events in distributed systems. Communications of the ACM 21 7 (1978) 558-565
-
(1978)
Communications of the ACM
, vol.21
, Issue.7
, pp. 558-565
-
-
Lamport, L.1
-
16
-
-
0027610155
-
Measurement based evaluation of operating system fault tolerance
-
Lee I., Tang D., Iyer R.K., and Hsueh M.C. Measurement based evaluation of operating system fault tolerance. IEEE Transactions on Reliability 42 2 (1993) 238-249
-
(1993)
IEEE Transactions on Reliability
, vol.42
, Issue.2
, pp. 238-249
-
-
Lee, I.1
Tang, D.2
Iyer, R.K.3
Hsueh, M.C.4
-
17
-
-
0029723377
-
-
D. Manivannan, M. Singhal, A low-overhead recovery technique using quasi-synchronous checkpointing, in: Proceedings of the 16th International Conference on Distributed Computing Systems, May 1996, pp. 100-107
-
D. Manivannan, M. Singhal, A low-overhead recovery technique using quasi-synchronous checkpointing, in: Proceedings of the 16th International Conference on Distributed Computing Systems, May 1996, pp. 100-107
-
-
-
-
18
-
-
0033360051
-
Quasi-synchronous checkpointing: Models, characterization, and classification
-
Manivannan D., and Singhal M. Quasi-synchronous checkpointing: Models, characterization, and classification. IEEE Transactions on Parallel and Distributed Systems 10 7 (1999) 703-713
-
(1999)
IEEE Transactions on Parallel and Distributed Systems
, vol.10
, Issue.7
, pp. 703-713
-
-
Manivannan, D.1
Singhal, M.2
-
19
-
-
40849101329
-
-
S. Mishra, D. Wang, Choosing an appropriate checkpointing and rollback recovery algorithm for long-running parallel and distributed applications, in: Proceedings of the 11th ISCA International Conference on Computers and their Applications, San Francisco, CA, March 1996
-
S. Mishra, D. Wang, Choosing an appropriate checkpointing and rollback recovery algorithm for long-running parallel and distributed applications, in: Proceedings of the 11th ISCA International Conference on Computers and their Applications, San Francisco, CA, March 1996
-
-
-
-
21
-
-
84888898496
-
-
N. Neves, W.K. Fuchs, RENEW: A tool for fast and efficient implementation of checkpoint protocols, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp, 58-67
-
N. Neves, W.K. Fuchs, RENEW: A tool for fast and efficient implementation of checkpoint protocols, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp, 58-67
-
-
-
-
22
-
-
40849093932
-
-
D. Nurmi, J. Brevik, R. Wolski, Modeling machine availability in enterprise and wide-area distributed computing environments, Technical Report 2003-28, Department of Computer Science, University of California, Santa Barbara, 2003
-
D. Nurmi, J. Brevik, R. Wolski, Modeling machine availability in enterprise and wide-area distributed computing environments, Technical Report 2003-28, Department of Computer Science, University of California, Santa Barbara, 2003
-
-
-
-
24
-
-
85014175705
-
-
J. Plank, W. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp. 48-57
-
J. Plank, W. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp. 48-57
-
-
-
-
25
-
-
40849106063
-
-
J.S. Plank, Efficient checkpointing on MIMD architectures, Ph.D. Thesis, Department of Computer Science, Princeton University, January 1993
-
J.S. Plank, Efficient checkpointing on MIMD architectures, Ph.D. Thesis, Department of Computer Science, Princeton University, January 1993
-
-
-
-
26
-
-
40849125266
-
-
J.S. Plank, An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance, Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997
-
J.S. Plank, An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance, Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997
-
-
-
-
27
-
-
40849101330
-
-
J.S. Plank, M. Beck, W. Elwasif, T. Moore, M. Swany, R. Wolski, The internet backplane protocol: Storage in the network, in: NetStore'99: Network Storage Symposium, Internet2, Seattle, WA, October 1999, pp. 242-249
-
J.S. Plank, M. Beck, W. Elwasif, T. Moore, M. Swany, R. Wolski, The internet backplane protocol: Storage in the network, in: NetStore'99: Network Storage Symposium, Internet2, Seattle, WA, October 1999, pp. 242-249
-
-
-
-
28
-
-
85084159983
-
-
J.S. Plank, M. Beck, G. Kingsley, K. Li, Libckpt: transparent checkpointing under UNIX, in: Usenix Winter 1995 Technical Conference, New Orleans, January 1995, pp. 220-232
-
J.S. Plank, M. Beck, G. Kingsley, K. Li, Libckpt: transparent checkpointing under UNIX, in: Usenix Winter 1995 Technical Conference, New Orleans, January 1995, pp. 220-232
-
-
-
-
29
-
-
85014175705
-
-
J.S. Plank, W. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, Munich, June 1998, pp. 48-57
-
J.S. Plank, W. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, Munich, June 1998, pp. 48-57
-
-
-
-
30
-
-
0035201417
-
Processor allocation and checkpoint interval selection in cluster computing systems
-
Plank J.S., and Thomason M.G. Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing 61 11 (2001) 1570-1590
-
(2001)
Journal of Parallel and Distributed Computing
, vol.61
, Issue.11
, pp. 1570-1590
-
-
Plank, J.S.1
Thomason, M.G.2
-
31
-
-
40849084808
-
-
B. Ramamurthy, S.J. Upadhyaya, R.K. Iyer, An object-oriented testbed for the evaluation of checkpointing and recovery systems, in: Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing
-
B. Ramamurthy, S.J. Upadhyaya, R.K. Iyer, An object-oriented testbed for the evaluation of checkpointing and recovery systems, in: Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing
-
-
-
-
32
-
-
0032597696
-
-
S. Rao, L. Alvisi, H.M. Vin, Egida: An extensible toolkit for low-overhead fault-tolerance, in: Proceedings of IEEE International Conference on Fault-Tolerant Computing, June 1999, pp. 48-55
-
S. Rao, L. Alvisi, H.M. Vin, Egida: An extensible toolkit for low-overhead fault-tolerance, in: Proceedings of IEEE International Conference on Fault-Tolerant Computing, June 1999, pp. 48-55
-
-
-
-
34
-
-
40849146702
-
-
N. Vaidya, On checkpoint latency, in: Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems, Newport Beach, December 1995
-
N. Vaidya, On checkpoint latency, in: Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems, Newport Beach, December 1995
-
-
-
-
35
-
-
40849099715
-
-
N.H. Vaidya, Another two-level failure recovery scheme: Performance impact of checkpoint placement and checkpoint latency, Technical Report TR94-068, Department of Computer Science, Texas A&M University, 1994
-
N.H. Vaidya, Another two-level failure recovery scheme: Performance impact of checkpoint placement and checkpoint latency, Technical Report TR94-068, Department of Computer Science, Texas A&M University, 1994
-
-
-
-
36
-
-
40849107628
-
-
N.H. Vaidya, Consistent logical checkpointing, Technical Report TR94-051, Department of Computer Science, Texas A&M University, 1994
-
N.H. Vaidya, Consistent logical checkpointing, Technical Report TR94-051, Department of Computer Science, Texas A&M University, 1994
-
-
-
-
38
-
-
0031388399
-
Impact of checkpoint latency on overhead ratio of a checkpointing scheme
-
Vaidya N.H. Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Transactions on Computers 46 8 (1997) 942-947
-
(1997)
IEEE Transactions on Computers
, vol.46
, Issue.8
, pp. 942-947
-
-
Vaidya, N.H.1
-
39
-
-
0029192604
-
-
Y-M. Wang, Maximum and minimum consistent global checkpoints and their applications, in: Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, SRDS'95, September 1995, pp. 86-95
-
Y-M. Wang, Maximum and minimum consistent global checkpoints and their applications, in: Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, SRDS'95, September 1995, pp. 86-95
-
-
-
-
40
-
-
0031124071
-
Consistent global checkpoints that contain a given set of checkpoints
-
Wang Y.M. Consistent global checkpoints that contain a given set of checkpoints. IEEE Transactions on Computers 42 4 (1997) 456-486
-
(1997)
IEEE Transactions on Computers
, vol.42
, Issue.4
, pp. 456-486
-
-
Wang, Y.M.1
-
41
-
-
84976846528
-
A first order approximation to the optimum checkpoint interval
-
Young J.S. A first order approximation to the optimum checkpoint interval. Communications of the ACM 17 9 (1974) 530-531
-
(1974)
Communications of the ACM
, vol.17
, Issue.9
, pp. 530-531
-
-
Young, J.S.1
-
42
-
-
40849093933
-
-
A. Ziv, J. Bruck, Analysis of checkpointing schemes for multiprocessor systems, in: Proceedings of the 13th Symposium on Reliable Distributed Systems, 1994
-
A. Ziv, J. Bruck, Analysis of checkpointing schemes for multiprocessor systems, in: Proceedings of the 13th Symposium on Reliable Distributed Systems, 1994
-
-
-
-
43
-
-
33747067045
-
-
A. Ziv, J. Bruck, Efficient checkpointing over local area networks, in: Proceedings of the IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, June 1994, pp. 30-35
-
A. Ziv, J. Bruck, Efficient checkpointing over local area networks, in: Proceedings of the IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, June 1994, pp. 30-35
-
-
-
|