메뉴 건너뛰기




Volumn 65, Issue 3, 2013, Pages 1302-1326

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Author keywords

Checkpoint restart; Clusters; Fault tolerance; High Performance Computing (HPC); Performance; Reliability

Indexed keywords

CHECKPOINT/RESTART; CLUSTERS; FAULT TOLERANCE MECHANISMS; HIGH PERFORMANCE COMPUTING (HPC); HIGH PERFORMANCE COMPUTING SYSTEMS; LONG-RUNNING APPLICATIONS; PERFORMANCE; PERFORMANCE BENEFITS;

EID: 84881374819     PISSN: 09208542     EISSN: 15730484     Source Type: Journal    
DOI: 10.1007/s11227-013-0884-0     Document Type: Conference Paper
Times cited : (208)

References (79)
  • 2
    • 12344287173 scopus 로고    scopus 로고
    • Commercial fault tolerance: A tale of two systems
    • 10.1109/TDSC.2004.4
    • Bartlett W, Spainhower L (2004) Commercial fault tolerance: a tale of two systems. IEEE Trans Dependable Secure Comput 1(1):87-96
    • (2004) IEEE Trans Dependable Secure Comput , vol.1 , Issue.1 , pp. 87-96
    • Bartlett, W.1    Spainhower, L.2
  • 4
    • 84881368496 scopus 로고    scopus 로고
    • [Online]
    • Blackham B (2005) [Online]. Available: http://cryopid.berlios.de/
    • (2005)
    • Blackham, B.1
  • 5
    • 0038194608 scopus 로고    scopus 로고
    • MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
    • Bosilca G, Bouteiller A, Cappello et al (2002) MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: IEEE/ACM SIGARCH
    • (2002) IEEE/ACM SIGARCH
    • Bosilca, G.1    Bouteiller, A.2    Cappello3
  • 8
    • 68249127079 scopus 로고    scopus 로고
    • Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities
    • 10.1177/1094342009106189
    • Cappello F (2009) Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23:212-226
    • (2009) Int J High Perform Comput Appl , vol.23 , pp. 212-226
    • Cappello, F.1
  • 10
    • 84881375542 scopus 로고    scopus 로고
    • CFDR [Online]. CFDR
    • CFDR (2012) [Online]. Available: CFDR http://cfdr.usenix.org/
    • (2012)
  • 11
    • 0022020346 scopus 로고
    • Distributed snapshots: Determining global states of distributed systems
    • 10.1145/214451.214456
    • Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3(1):63-75
    • (1985) ACM Trans Comput Syst , vol.3 , Issue.1 , pp. 63-75
    • Chandy, K.M.1    Lamport, L.2
  • 12
    • 84881374699 scopus 로고    scopus 로고
    • Checkpointing.org [Online]
    • Checkpointing.org (2012) Checkpointing [Online]. Available: http://checkpointing.org
    • (2012) Checkpointing
  • 18
    • 0026104130 scopus 로고
    • Understanding fault-tolerant distributed systems
    • 10.1145/102792.102801
    • Cristian F (1991) Understanding fault-tolerant distributed systems. Commun ACM 34(2):56-88
    • (1991) Commun ACM , vol.34 , Issue.2 , pp. 56-88
    • Cristian, F.1
  • 23
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • 10.1145/568522.568525
    • Elnozahy ENM, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375-408
    • (2002) ACM Comput Surv , vol.34 , Issue.3 , pp. 375-408
    • Elnozahy, E.N.M.1    Alvisi, L.2    Wang, Y.M.3    Johnson, D.B.4
  • 25
    • 84881375783 scopus 로고    scopus 로고
    • Fault tolerance, wikipedia [Online]
    • Fault tolerance, wikipedia (2012) [Online]. Available: http://en.wikipedia.org/wiki/Fault-tolerant-system
    • (2012)
  • 26
    • 84881374465 scopus 로고    scopus 로고
    • Fusion-IO [Online]
    • Fusion-IO (2012) [Online]. Available: http://www.rpmgmbh.com/download/ Whitepaper-Green.pdf
    • (2012)
  • 28
    • 84881368293 scopus 로고    scopus 로고
    • esky [Online]
    • Gibson D (2012) esky [Online]. Available: http://esky.sourceforge.net
    • (2012)
    • Gibson, D.1
  • 30
    • 0025505070 scopus 로고
    • A census of tandem system availability between 1985 and 1990
    • 10.1109/24.58719
    • Gray J (1990) A census of tandem system availability between 1985 and 1990. IEEE Trans Reliab 39(4):409-418
    • (1990) IEEE Trans Reliab , vol.39 , Issue.4 , pp. 409-418
    • Gray, J.1
  • 31
    • 85084162186 scopus 로고    scopus 로고
    • World-wide web cache consistency
    • San Diego, CA Jan 1996
    • Gwertzman J, Seltzer M (1996) World-wide web cache consistency. In: Proc 1996 USENIX tech conf, San Diego, CA, Jan 1996, pp 141-152
    • (1996) Proc 1996 USENIX Tech Conf , pp. 141-152
    • Gwertzman, J.1    Seltzer, M.2
  • 33
    • 84881374755 scopus 로고    scopus 로고
    • InfiniBand [Online]. Available: InfiniBand
    • InfiniBand (2012) [Online]. Available: InfiniBand http://www. infinibandta.org/
    • (2012)
  • 37
    • 85013703470 scopus 로고    scopus 로고
    • Elsevier/Morgan Kaufmann San Diego, San Mateo 1126.68015
    • Koren I, Krishna C (2007) Fault-tolerant systems. Elsevier/Morgan Kaufmann, San Diego, San Mateo
    • (2007) Fault-tolerant Systems
    • Koren, I.1    Krishna, C.2
  • 38
    • 0017996760 scopus 로고
    • Time, clocks, and the ordering of events in a distributed system
    • 0378.68027 10.1145/359545.359563
    • Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21:558-565
    • (1978) Commun ACM , vol.21 , pp. 558-565
    • Lamport, L.1
  • 39
    • 0025457846 scopus 로고
    • Definition and analysis of hardware-and software-fault-tolerant architectures
    • 10.1109/2.56851
    • Laprie JC, Arlat J, Beounes C, Kanoun K (1990) Definition and analysis of hardware-and software-fault-tolerant architectures. Computer 23(7):39-51
    • (1990) Computer , vol.23 , Issue.7 , pp. 39-51
    • Laprie, J.C.1    Arlat, J.2    Beounes, C.3    Kanoun, K.4
  • 40
    • 84881371103 scopus 로고    scopus 로고
    • Large software state [Online]
    • Large software state (2012) [Online]. Available: http://www.safeware-eng. com/White-Papers/Software%20Safety.htm
    • (2012)
  • 41
    • 0028485392 scopus 로고
    • Low-latency, concurrent checkpointing for parallel programs
    • 10.1109/71.298215
    • Li K, Naughton JF, Plank JS (1994) Low-latency, concurrent checkpointing for parallel programs. IEEE Trans Parallel Distrib Syst 5(8):874-879
    • (1994) IEEE Trans Parallel Distrib Syst , vol.5 , Issue.8 , pp. 874-879
    • Li, K.1    Naughton, J.F.2    Plank, J.S.3
  • 45
    • 4544296705 scopus 로고
    • The use of triple-modular redundancy to improve computer reliability
    • 0117.12001 10.1147/rd.62.0200
    • Lyons RE, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 6(2):200-209
    • (1962) IBM J Res Dev , vol.6 , Issue.2 , pp. 200-209
    • Lyons, R.E.1    Vanderkulk, W.2
  • 46
    • 68849090178 scopus 로고    scopus 로고
    • A survey and review of the current state of rollback-recovery for cluster systems
    • Maloney A, Goscinski A (2009) A survey and review of the current state of rollback-recovery for cluster systems. Concurr Comput., 1632-1666
    • (2009) Concurr Comput. , pp. 1632-1666
    • Maloney, A.1    Goscinski, A.2
  • 48
    • 0001439335 scopus 로고
    • MPI: A message-passing interface standard
    • MPI Forum
    • MPI Forum (1994) MPI: a message-passing interface standard. Int J Supercomput Appl High Perform Comput
    • (1994) Int J Supercomput Appl High Perform Comput
  • 50
    • 0036755345 scopus 로고    scopus 로고
    • Architecture and dependability of large-scale Internet services
    • 10.1109/MIC.2002.1036037
    • Oppenheimer D, Patterson D (2002) Architecture and dependability of large-scale Internet services. IEEE Internet Comput 6(5):41-49
    • (2002) IEEE Internet Comput , vol.6 , Issue.5 , pp. 41-49
    • Oppenheimer, D.1    Patterson, D.2
  • 51
    • 84978437417 scopus 로고    scopus 로고
    • The design and implementation of zap: A system for migration computing environments
    • 10.1145/844128.844162
    • Osman S, Subhraveti D, Su G, Nieh J (2002) The design and implementation of zap: a system for migration computing environments. Oper Syst Rev 36(SI):361-376
    • (2002) Oper Syst Rev , vol.36 , pp. 361-376
    • Osman, S.1    Subhraveti, D.2    Su, G.3    Nieh, J.4
  • 53
    • 84881373144 scopus 로고    scopus 로고
    • PETSc [Online]
    • PETSc (2012) [Online]. Available: http://www.mcs.anl.gov/petsc/petsc-as/
    • (2012)
  • 54
    • 84881369313 scopus 로고    scopus 로고
    • Pinheiro E (2001) http://www.research.rutgers.edu/~edpin/epckpt/
    • (2001)
    • Pinheiro, E.1
  • 59
    • 0016522101 scopus 로고
    • System structure for software fault tolerance
    • 10.1109/TSE.1975.6312842
    • Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng SE-1(2):220-232
    • (1975) IEEE Trans Softw Eng , vol.1 , Issue.2 , pp. 220-232
    • Randell, B.1
  • 63
    • 27844542760 scopus 로고    scopus 로고
    • The Lam/Mpi checkpoint/restart framework: System-initiated checkpointing
    • 10.1177/1094342005056139
    • Sankaran S, Squyres JM, Barrett B et al (2005) The Lam/Mpi checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl 19(4):479-493
    • (2005) Int J High Perform Comput Appl , vol.19 , Issue.4 , pp. 479-493
    • Sankaran, S.1    Squyres, J.M.2    Barrett, B.3
  • 64
    • 36148941068 scopus 로고    scopus 로고
    • Understanding failures in petascale computers
    • 012022 10.1088/1742-6596/78/1/012022
    • Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):012022
    • (2007) J Phys Conf ser , vol.78 , Issue.1
    • Schroeder, B.1    Gibson, G.2
  • 65
    • 78149470110 scopus 로고    scopus 로고
    • A large-scale study of failures in high performance computing systems
    • 10.1109/TDSC.2009.4
    • Schroeder B, Gibson GA (2010) A large-scale study of failures in high performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337-350
    • (2010) IEEE Trans Dependable Secure Comput , vol.7 , Issue.4 , pp. 337-350
    • Schroeder, B.1    Gibson, G.A.2
  • 66
    • 84934312471 scopus 로고    scopus 로고
    • Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs
    • Pittsburgh, PA
    • Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P (2004) Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Supercomputing, Pittsburgh, PA
    • (2004) Supercomputing
    • Schulz, M.1    Bronevetsky, G.2    Fernandes, R.3    Marques, D.4    Pingali, K.5    Stodghill, P.6
  • 67
    • 79952579787 scopus 로고    scopus 로고
    • Exascale computing technology challenges
    • LNCS 6449 Springer Berlin, Heidelberg
    • Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: VECPAR 2010, LNCS, vol 6449. Springer, Berlin, Heidelberg, pp 1-25
    • (2011) VECPAR 2010 , pp. 1-25
    • Shalf, J.1    Dosanjh, S.2    Morrison, J.3
  • 69
    • 0003050634 scopus 로고    scopus 로고
    • Cocheck: Checkpointing and process migration for MPI
    • Stellner G (1996) Cocheck: checkpointing and process migration for MPI. In: Proc IPPS
    • (1996) Proc IPPS
    • Stellner, G.1
  • 71
    • 1442319232 scopus 로고    scopus 로고
    • PM2: High performance communication middleware for heterogeneous network environments, in supercomputing
    • IEEE Press New York
    • Takahashi T, Sumimoto S, Hori A, Harada H, Ishikawa Y (2000) PM2: high performance communication middleware for heterogeneous network environments, in supercomputing. In: ACM/IEEE 2000 conference. IEEE Press, New York, p 16
    • (2000) ACM/IEEE 2000 Conference , pp. 16
    • Takahashi, T.1    Sumimoto, S.2    Hori, A.3    Harada, H.4    Ishikawa, Y.5
  • 72
    • 84881374986 scopus 로고    scopus 로고
    • Team Condor University of Wisconsin-Madison
    • Team Condor (2010) Condor version 7.5.3 manual. University of Wisconsin-Madison
    • (2010) Condor Version 7.5.3 Manual
  • 74
    • 84881374553 scopus 로고    scopus 로고
    • Top500 [Online]
    • Top500 (2012) [Online]. Available: http://www.top500.org
    • (2012)
  • 75
    • 85101215109 scopus 로고    scopus 로고
    • Application-level checkpointing techniques for parallel programs
    • Walters J, Chaudhary V (2006) Application-level checkpointing techniques for parallel programs. In: Proc of the 3rd ICDCIT conf, pp 221-234
    • (2006) Proc of the 3rd ICDCIT Conf , pp. 221-234
    • Walters, J.1    Chaudhary, V.2
  • 76
    • 0029305383 scopus 로고
    • Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems
    • 10.1109/71.382324
    • Wang Y-M, Chung P-Y, Lin I-J, Fuchs WK (1995) Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans Parallel Distrib Syst 6(5):546-554
    • (1995) IEEE Trans Parallel Distrib Syst , vol.6 , Issue.5 , pp. 546-554
    • Wang, Y.-M.1    Chung, P.-Y.2    Lin, I.-J.3    Fuchs, W.K.4
  • 78
    • 84881377739 scopus 로고    scopus 로고
    • ckpt [Online]
    • Zandy V (2002) ckpt [Online]. Available: http://pages.cs.wisc.edu/~zandy/ ckpt/
    • (2002)
    • Zandy, V.1


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.