메뉴 건너뛰기




Volumn , Issue , 2007, Pages 47-54

Group-based coordinated checkpointing for MPI: A case study on infiniband

Author keywords

[No Author keywords available]

Indexed keywords

CLUSTER COMPUTING; DIGITAL STORAGE; FAULT TOLERANCE; SCALABILITY;

EID: 47249116207     PISSN: 01903918     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/ICPP.2007.44     Document Type: Conference Paper
Times cited : (16)

References (24)
  • 1
    • 47249161719 scopus 로고    scopus 로고
    • MPICH-V Project
    • MPICH-V Project. http://mpich-v.lri.fr.
  • 2
    • 47249109126 scopus 로고    scopus 로고
    • MPICH2. http://www-unix.mcs.anl.gov/mpi/mpich2/.
    • MPICH2. http://www-unix.mcs.anl.gov/mpi/mpich2/.
  • 3
  • 4
    • 47249159117 scopus 로고    scopus 로고
    • Parallel Virtual File System, Version 2
    • Parallel Virtual File System, Version 2. http://www.pvfs.org.
  • 8
    • 20444435911 scopus 로고    scopus 로고
    • Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
    • San Diego, CA, September
    • A. Bouteiller, P. Lemarinier, T. Hérault, G. Krawezik, and F. Cappello. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In Proceedings of Cluster 2004, San Diego, CA, September 2004.
    • (2004) Proceedings of Cluster 2004
    • Bouteiller, A.1    Lemarinier, P.2    Hérault, T.3    Krawezik, G.4    Cappello, F.5
  • 9
    • 0022020346 scopus 로고
    • Distributed Snapshots: Determining Global States of Distributed Systems
    • M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. In ACM Trans. Comput. Syst. 31, 1985.
    • (1985) ACM Trans. Comput. Syst , vol.31
    • Chandy, M.1    Lamport, L.2
  • 11
    • 12344277946 scopus 로고    scopus 로고
    • The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart
    • Technical Report LBNL-54941, Berkeley Lab, 2002
    • J. Duell, P. Hargrove, and E. Roman. The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart. Technical Report LBNL-54941, Berkeley Lab, 2002.
    • Duell, J.1    Hargrove, P.2    Roman, E.3
  • 12
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3), 2002.
    • (2002) ACM Comput. Surv , vol.34 , Issue.3
    • Elnozahy, E.N.1    Alvisi, L.2    Wang, Y.-M.3    Johnson, D.B.4
  • 15
    • 33845434226 scopus 로고    scopus 로고
    • Transparent incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers
    • Seattle, WA, November
    • R. Gioiosa, J. C. Sancho, S. Jiang, and F. Petrini. Transparent incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers. In ACM/IEEE SuperComputing 2005, Seattle, WA, November 2005.
    • (2005) ACM/IEEE SuperComputing 2005
    • Gioiosa, R.1    Sancho, J.C.2    Jiang, S.3    Petrini, F.4
  • 17
    • 47249109964 scopus 로고    scopus 로고
    • InfiniBand Trade Association
    • InfiniBand Trade Association. http://www.infinibandta.org.
  • 23
    • 34548768671 scopus 로고    scopus 로고
    • C. Wang, F. Mueller, C. Engelmann, and S. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Int'l Parallel and Distributed Processing Symposium (IPDPS '07), 2007.
    • C. Wang, F. Mueller, C. Engelmann, and S. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Int'l Parallel and Distributed Processing Symposium (IPDPS '07), 2007.


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.