메뉴 건너뛰기




Volumn 65, Issue 5, 2008, Pages 345-365

Model-based performance evaluation of distributed checkpointing protocols

Author keywords

Distributed checkpoint restart; Markov models; Performance analysis; Rollback propagation

Indexed keywords

DECISION MAKING; DISTRIBUTED COMPUTER SYSTEMS; MARKOV PROCESSES; MATHEMATICAL MODELS; PARAMETER ESTIMATION;

EID: 40849089513     PISSN: 01665316     EISSN: None     Source Type: Journal    
DOI: 10.1016/j.peva.2007.09.001     Document Type: Article
Times cited : (13)

References (43)
  • 2
    • 0037968835 scopus 로고    scopus 로고
    • A. Agbaria, A. Freund, R. Friedman, Evaluating distributed checkpointing protocols, in: Proceedings of the 23rd International Conference on Distributed Computing Systems, ICDCS'03, Providence, Rhode Island, May 2003, pp. 266-273
    • A. Agbaria, A. Freund, R. Friedman, Evaluating distributed checkpointing protocols, in: Proceedings of the 23rd International Conference on Distributed Computing Systems, ICDCS'03, Providence, Rhode Island, May 2003, pp. 266-273
  • 3
    • 0033359224 scopus 로고    scopus 로고
    • A. Agbaria, R. Friedman, Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations, in: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, HPDC'99, August 1999, pp. 167-176
    • A. Agbaria, R. Friedman, Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations, in: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, HPDC'99, August 1999, pp. 167-176
  • 4
    • 0036796940 scopus 로고    scopus 로고
    • Virtual machine based heterogeneous checkpointing
    • Agbaria A., and Friedman R. Virtual machine based heterogeneous checkpointing. Software: Practice and Experience 32 1 (2002) 1-19
    • (2002) Software: Practice and Experience , vol.32 , Issue.1 , pp. 1-19
    • Agbaria, A.1    Friedman, R.2
  • 5
    • 0034590510 scopus 로고    scopus 로고
    • A. Agbaria, J.S. Plank, Design, implementation, and performance of checkpointing in netsolve, in: Proceedings of the 1st IEEE Conference on Dependable Systems and Networks, DSN'00, New York, USA, June 2000, pp. 49-54
    • A. Agbaria, J.S. Plank, Design, implementation, and performance of checkpointing in netsolve, in: Proceedings of the 1st IEEE Conference on Dependable Systems and Networks, DSN'00, New York, USA, June 2000, pp. 49-54
  • 6
    • 0032597670 scopus 로고    scopus 로고
    • L. Alvisi, E. Elnozahy, S. Rao, S.A. Husain, A.D. Mel, An analysis of communication induced checkpointing, in: Proceedings of the 29th Fault-Tolerant Computing Symposium, Madison, Wisconsin, June 1999, pp. 242-249
    • L. Alvisi, E. Elnozahy, S. Rao, S.A. Husain, A.D. Mel, An analysis of communication induced checkpointing, in: Proceedings of the 29th Fault-Tolerant Computing Symposium, Madison, Wisconsin, June 1999, pp. 242-249
  • 7
    • 84866225421 scopus 로고    scopus 로고
    • R. Baldoni, J.M. Hélary, A. Mostefaoui, M. Raynal, A communication-induced checkpointing protocol that ensures rollback-dependency trackability, in: Proceedings of the 27th International Symposium on Fault-Tolerant Computing, June 1997, pp. 68-77
    • R. Baldoni, J.M. Hélary, A. Mostefaoui, M. Raynal, A communication-induced checkpointing protocol that ensures rollback-dependency trackability, in: Proceedings of the 27th International Symposium on Fault-Tolerant Computing, June 1997, pp. 68-77
  • 8
    • 0032305992 scopus 로고    scopus 로고
    • R. Baldoni, F. Quaglia, B. Ciciani, A VP-accordant checkpointing protocol preventing useless checkpoints, in: Proceedings of the IEEE International Symposium on Reliable Distributed Systems, October 1998, pp. 61-67
    • R. Baldoni, F. Quaglia, B. Ciciani, A VP-accordant checkpointing protocol preventing useless checkpoints, in: Proceedings of the IEEE International Symposium on Reliable Distributed Systems, October 1998, pp. 61-67
  • 9
    • 40849106062 scopus 로고    scopus 로고
    • J. Brevik, D. Nurmi, R. Wolski, Quantifying machine availability in networked and desktop grid systems, Technical Report 2003-37, Department of Computer Science, University of California, Santa Barbara, November 2003
    • J. Brevik, D. Nurmi, R. Wolski, Quantifying machine availability in networked and desktop grid systems, Technical Report 2003-37, Department of Computer Science, University of California, Santa Barbara, November 2003
  • 10
    • 0021538527 scopus 로고    scopus 로고
    • D. Briatico, A. Ciuffoletti, L. Simoncini, A distributed domino-effect free recovery algorithm, in: Proceedings of the IEEE International Symposium on Reliability in Distributed Software and Database Systems, October 1984, pp. 207-215
    • D. Briatico, A. Ciuffoletti, L. Simoncini, A distributed domino-effect free recovery algorithm, in: Proceedings of the IEEE International Symposium on Reliability in Distributed Software and Database Systems, October 1984, pp. 207-215
  • 11
    • 0022020346 scopus 로고
    • Distributed snapshots: Determining global states of distributed systems
    • Chandy K.M., and Lamport L. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3 1 (1985) 63-75
    • (1985) ACM Transactions on Computer Systems , vol.3 , Issue.1 , pp. 63-75
    • Chandy, K.M.1    Lamport, L.2
  • 12
    • 40849089107 scopus 로고    scopus 로고
    • G. Clark, T. Courtney, D. Daly, D. Deavours, S. Derisavi, J.M. Doyle, W.H. Sanders, P. Webster, The Möbius tool, in: Proceedings of the 9th International Workshop on Petri Nets and Performance Models, Aachen, Germany, September 2001 pp. 241-250
    • G. Clark, T. Courtney, D. Daly, D. Deavours, S. Derisavi, J.M. Doyle, W.H. Sanders, P. Webster, The Möbius tool, in: Proceedings of the 9th International Workshop on Petri Nets and Performance Models, Aachen, Germany, September 2001 pp. 241-250
  • 13
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • Elnozahy E.N., Alvisi L., Wang Y.M., and Johnson D.B. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34 3 (2002) 375-408
    • (2002) ACM Computing Surveys , vol.34 , Issue.3 , pp. 375-408
    • Elnozahy, E.N.1    Alvisi, L.2    Wang, Y.M.3    Johnson, D.B.4
  • 14
    • 40849135283 scopus 로고    scopus 로고
    • J.M. Hélary, A. Mostefaoui, M. Raynal, Communication-induced determination of consistent snapshots, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp. 208-217
    • J.M. Hélary, A. Mostefaoui, M. Raynal, Communication-induced determination of consistent snapshots, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp. 208-217
  • 15
    • 0017996760 scopus 로고
    • Time, clocks and ordering of events in distributed systems
    • Lamport L. Time, clocks and ordering of events in distributed systems. Communications of the ACM 21 7 (1978) 558-565
    • (1978) Communications of the ACM , vol.21 , Issue.7 , pp. 558-565
    • Lamport, L.1
  • 16
    • 0027610155 scopus 로고
    • Measurement based evaluation of operating system fault tolerance
    • Lee I., Tang D., Iyer R.K., and Hsueh M.C. Measurement based evaluation of operating system fault tolerance. IEEE Transactions on Reliability 42 2 (1993) 238-249
    • (1993) IEEE Transactions on Reliability , vol.42 , Issue.2 , pp. 238-249
    • Lee, I.1    Tang, D.2    Iyer, R.K.3    Hsueh, M.C.4
  • 17
    • 0029723377 scopus 로고    scopus 로고
    • D. Manivannan, M. Singhal, A low-overhead recovery technique using quasi-synchronous checkpointing, in: Proceedings of the 16th International Conference on Distributed Computing Systems, May 1996, pp. 100-107
    • D. Manivannan, M. Singhal, A low-overhead recovery technique using quasi-synchronous checkpointing, in: Proceedings of the 16th International Conference on Distributed Computing Systems, May 1996, pp. 100-107
  • 18
  • 19
    • 40849101329 scopus 로고    scopus 로고
    • S. Mishra, D. Wang, Choosing an appropriate checkpointing and rollback recovery algorithm for long-running parallel and distributed applications, in: Proceedings of the 11th ISCA International Conference on Computers and their Applications, San Francisco, CA, March 1996
    • S. Mishra, D. Wang, Choosing an appropriate checkpointing and rollback recovery algorithm for long-running parallel and distributed applications, in: Proceedings of the 11th ISCA International Conference on Computers and their Applications, San Francisco, CA, March 1996
  • 21
    • 84888898496 scopus 로고    scopus 로고
    • N. Neves, W.K. Fuchs, RENEW: A tool for fast and efficient implementation of checkpoint protocols, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp, 58-67
    • N. Neves, W.K. Fuchs, RENEW: A tool for fast and efficient implementation of checkpoint protocols, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp, 58-67
  • 22
    • 40849093932 scopus 로고    scopus 로고
    • D. Nurmi, J. Brevik, R. Wolski, Modeling machine availability in enterprise and wide-area distributed computing environments, Technical Report 2003-28, Department of Computer Science, University of California, Santa Barbara, 2003
    • D. Nurmi, J. Brevik, R. Wolski, Modeling machine availability in enterprise and wide-area distributed computing environments, Technical Report 2003-28, Department of Computer Science, University of California, Santa Barbara, 2003
  • 24
    • 85014175705 scopus 로고    scopus 로고
    • J. Plank, W. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp. 48-57
    • J. Plank, W. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp. 48-57
  • 25
    • 40849106063 scopus 로고    scopus 로고
    • J.S. Plank, Efficient checkpointing on MIMD architectures, Ph.D. Thesis, Department of Computer Science, Princeton University, January 1993
    • J.S. Plank, Efficient checkpointing on MIMD architectures, Ph.D. Thesis, Department of Computer Science, Princeton University, January 1993
  • 26
    • 40849125266 scopus 로고    scopus 로고
    • J.S. Plank, An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance, Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997
    • J.S. Plank, An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance, Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997
  • 27
    • 40849101330 scopus 로고    scopus 로고
    • J.S. Plank, M. Beck, W. Elwasif, T. Moore, M. Swany, R. Wolski, The internet backplane protocol: Storage in the network, in: NetStore'99: Network Storage Symposium, Internet2, Seattle, WA, October 1999, pp. 242-249
    • J.S. Plank, M. Beck, W. Elwasif, T. Moore, M. Swany, R. Wolski, The internet backplane protocol: Storage in the network, in: NetStore'99: Network Storage Symposium, Internet2, Seattle, WA, October 1999, pp. 242-249
  • 28
    • 85084159983 scopus 로고    scopus 로고
    • J.S. Plank, M. Beck, G. Kingsley, K. Li, Libckpt: transparent checkpointing under UNIX, in: Usenix Winter 1995 Technical Conference, New Orleans, January 1995, pp. 220-232
    • J.S. Plank, M. Beck, G. Kingsley, K. Li, Libckpt: transparent checkpointing under UNIX, in: Usenix Winter 1995 Technical Conference, New Orleans, January 1995, pp. 220-232
  • 29
    • 85014175705 scopus 로고    scopus 로고
    • J.S. Plank, W. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, Munich, June 1998, pp. 48-57
    • J.S. Plank, W. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, Munich, June 1998, pp. 48-57
  • 30
    • 0035201417 scopus 로고    scopus 로고
    • Processor allocation and checkpoint interval selection in cluster computing systems
    • Plank J.S., and Thomason M.G. Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing 61 11 (2001) 1570-1590
    • (2001) Journal of Parallel and Distributed Computing , vol.61 , Issue.11 , pp. 1570-1590
    • Plank, J.S.1    Thomason, M.G.2
  • 31
    • 40849084808 scopus 로고    scopus 로고
    • B. Ramamurthy, S.J. Upadhyaya, R.K. Iyer, An object-oriented testbed for the evaluation of checkpointing and recovery systems, in: Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing
    • B. Ramamurthy, S.J. Upadhyaya, R.K. Iyer, An object-oriented testbed for the evaluation of checkpointing and recovery systems, in: Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing
  • 32
    • 0032597696 scopus 로고    scopus 로고
    • S. Rao, L. Alvisi, H.M. Vin, Egida: An extensible toolkit for low-overhead fault-tolerance, in: Proceedings of IEEE International Conference on Fault-Tolerant Computing, June 1999, pp. 48-55
    • S. Rao, L. Alvisi, H.M. Vin, Egida: An extensible toolkit for low-overhead fault-tolerance, in: Proceedings of IEEE International Conference on Fault-Tolerant Computing, June 1999, pp. 48-55
  • 34
    • 40849146702 scopus 로고    scopus 로고
    • N. Vaidya, On checkpoint latency, in: Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems, Newport Beach, December 1995
    • N. Vaidya, On checkpoint latency, in: Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems, Newport Beach, December 1995
  • 35
    • 40849099715 scopus 로고    scopus 로고
    • N.H. Vaidya, Another two-level failure recovery scheme: Performance impact of checkpoint placement and checkpoint latency, Technical Report TR94-068, Department of Computer Science, Texas A&M University, 1994
    • N.H. Vaidya, Another two-level failure recovery scheme: Performance impact of checkpoint placement and checkpoint latency, Technical Report TR94-068, Department of Computer Science, Texas A&M University, 1994
  • 36
    • 40849107628 scopus 로고    scopus 로고
    • N.H. Vaidya, Consistent logical checkpointing, Technical Report TR94-051, Department of Computer Science, Texas A&M University, 1994
    • N.H. Vaidya, Consistent logical checkpointing, Technical Report TR94-051, Department of Computer Science, Texas A&M University, 1994
  • 38
    • 0031388399 scopus 로고    scopus 로고
    • Impact of checkpoint latency on overhead ratio of a checkpointing scheme
    • Vaidya N.H. Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Transactions on Computers 46 8 (1997) 942-947
    • (1997) IEEE Transactions on Computers , vol.46 , Issue.8 , pp. 942-947
    • Vaidya, N.H.1
  • 39
    • 0029192604 scopus 로고    scopus 로고
    • Y-M. Wang, Maximum and minimum consistent global checkpoints and their applications, in: Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, SRDS'95, September 1995, pp. 86-95
    • Y-M. Wang, Maximum and minimum consistent global checkpoints and their applications, in: Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, SRDS'95, September 1995, pp. 86-95
  • 40
    • 0031124071 scopus 로고    scopus 로고
    • Consistent global checkpoints that contain a given set of checkpoints
    • Wang Y.M. Consistent global checkpoints that contain a given set of checkpoints. IEEE Transactions on Computers 42 4 (1997) 456-486
    • (1997) IEEE Transactions on Computers , vol.42 , Issue.4 , pp. 456-486
    • Wang, Y.M.1
  • 41
    • 84976846528 scopus 로고
    • A first order approximation to the optimum checkpoint interval
    • Young J.S. A first order approximation to the optimum checkpoint interval. Communications of the ACM 17 9 (1974) 530-531
    • (1974) Communications of the ACM , vol.17 , Issue.9 , pp. 530-531
    • Young, J.S.1
  • 42
    • 40849093933 scopus 로고    scopus 로고
    • A. Ziv, J. Bruck, Analysis of checkpointing schemes for multiprocessor systems, in: Proceedings of the 13th Symposium on Reliable Distributed Systems, 1994
    • A. Ziv, J. Bruck, Analysis of checkpointing schemes for multiprocessor systems, in: Proceedings of the 13th Symposium on Reliable Distributed Systems, 1994
  • 43
    • 33747067045 scopus 로고    scopus 로고
    • A. Ziv, J. Bruck, Efficient checkpointing over local area networks, in: Proceedings of the IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, June 1994, pp. 30-35
    • A. Ziv, J. Bruck, Efficient checkpointing over local area networks, in: Proceedings of the IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, June 1994, pp. 30-35


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.