메뉴 건너뛰기




Volumn , Issue , 2008, Pages

Scalable group-based checkpoint/restart for large-scale message-passing systems

Author keywords

[No Author keywords available]

Indexed keywords

GROUP-BASED; LARGE-SCALE PARALLEL SYSTEMS; PARALLEL AND DISTRIBUTED PROCESSING;

EID: 51049086184     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/IPDPS.2008.4536302     Document Type: Conference Paper
Times cited : (26)

References (17)
  • 1
    • 51049104220 scopus 로고    scopus 로고
    • Lorenzo Alvisi, Sriram Rao, Syed Amir Husain, Asanka Mel de and E.N. (Mootaz) Elnozahy. An Analysis of Communication-Induced Checkpointing, in FTCS '99: Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, IEEE Computer Society, pp. 242-249, 1999
    • Lorenzo Alvisi, Sriram Rao, Syed Amir Husain, Asanka Mel de and E.N. (Mootaz) Elnozahy. An Analysis of Communication-Induced Checkpointing, in FTCS '99: Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, IEEE Computer Society, pp. 242-249, 1999
  • 3
    • 0022020346 scopus 로고
    • Distributed Snapshots: Determining Global States of Distributed Systems
    • K. Mani Chandy and Leslie Lamport. Distributed Snapshots: Determining Global States of Distributed Systems, ACM Transactions on Computer Systems, vol.3 no.1:63-75, 1985
    • (1985) ACM Transactions on Computer Systems , vol.3 , Issue.1 , pp. 63-75
    • Mani Chandy, K.1    Lamport, L.2
  • 4
    • 34548282622 scopus 로고    scopus 로고
    • Camille Coti, Thomas Hérault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez and Franck Cappello. Blocking vs Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI, in proceedings of The IEEE/ACM SC2006 Conference, 2006
    • Camille Coti, Thomas Hérault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez and Franck Cappello. Blocking vs Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI, in proceedings of The IEEE/ACM SC2006 Conference, 2006
  • 6
    • 0042078549 scopus 로고    scopus 로고
    • E.N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, 34 no.3:375-408, 2002
    • E.N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34 no.3:375-408, 2002
  • 7
    • 33745305678 scopus 로고    scopus 로고
    • Self-refined Fault Tolerance in HPC Using Dynamic Dependent Process Groups, in Distributed
    • N.P. Gopalan and K. Nagarajan. Self-refined Fault Tolerance in HPC Using Dynamic Dependent Process Groups, in Distributed Computing - IWDC 2005, pp. 153-158, 2005
    • (2005) Computing , vol.IWDC 2005 , pp. 153-158
    • Gopalan, N.P.1    Nagarajan, K.2
  • 9
    • 0031338781 scopus 로고    scopus 로고
    • Wei-Jih Li and Jyh-Jong Tsay. Checkpointing Message-Passing Interface(MPI) Parallel Programs, in Proceedings of Pacific Rim International Symposium on Fault-Tolerant Systems, 1997, IEEE Computer Society, pp. 147-152, 1997
    • Wei-Jih Li and Jyh-Jong Tsay. Checkpointing Message-Passing Interface(MPI) Parallel Programs, in Proceedings of Pacific Rim International Symposium on Fault-Tolerant Systems, 1997, IEEE Computer Society, pp. 147-152, 1997
  • 11
    • 0029255243 scopus 로고
    • Necessary and Sufficient Conditions for Consistent Global Snapshots
    • Robert H.B. Netzer and Jian Xu. Necessary and Sufficient Conditions for Consistent Global Snapshots, IEEE Transactions on Parallel and Distributed Systems, vol.6 no.2:165-169, 1995
    • (1995) IEEE Transactions on Parallel and Distributed Systems , vol.6 , Issue.2 , pp. 165-169
    • Netzer, R.H.B.1    Xu, J.2
  • 15
  • 17
    • 51049120414 scopus 로고    scopus 로고
    • NAS Parallel Benchmarks
    • NAS Parallel Benchmarks: http://www.nas.nasa.gov/Resources/Software/npb. html


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.