메뉴 건너뛰기




Volumn E87-D, Issue 7, 2004, Pages 1820-1828

MPICH-GF: Transparent checkpointing and rollback-recovery for grid-enabled MPI processes

Author keywords

Checkpoint; Consistent recovery; Fault tolerance; Grid computing; MPI

Indexed keywords

ABSTRACTING; ALGORITHMS; COMMUNICATION SYSTEMS; COMPUTATIONAL METHODS; MIDDLEWARE; PROCESS CONTROL;

EID: 3142722523     PISSN: 09168532     EISSN: None     Source Type: Journal    
DOI: None     Document Type: Conference Paper
Times cited : (26)

References (37)
  • 4
    • 0031570635 scopus 로고    scopus 로고
    • Application level fault tolerance in heterogeneous networks of workstations
    • A. Beguelin, E. Seligman, and P. Stephan, "Application level fault tolerance in heterogeneous networks of workstations," J. Parallel Distrib. Comput., vol.43, no.2, pp.147-155, 1997.
    • (1997) J. Parallel Distrib. Comput. , vol.43 , Issue.2 , pp. 147-155
    • Beguelin, A.1    Seligman, E.2    Stephan, P.3
  • 6
    • 0001873476 scopus 로고
    • LAM: An open cluster environment for MPI
    • Toronto, Canada
    • G. Burns, R. Daoud, and J. Vaigl, "LAM: An open cluster environment for MPI," Proc. Supercomputing Symp., pp.379-386, Toronto, Canada, 1994.
    • (1994) Proc. Supercomputing Symp. , pp. 379-386
    • Burns, G.1    Daoud, R.2    Vaigl, J.3
  • 7
    • 0028408242 scopus 로고
    • Monitors, messages, and clusters: The p4 parallel programming system
    • R. Butler and E.L. Lusk, "Monitors, messages, and clusters: The p4 parallel programming system," Parallel Comput., vol.20, no.4, pp.547-564, 1994.
    • (1994) Parallel Comput. , vol.20 , Issue.4 , pp. 547-564
    • Butler, R.1    Lusk, E.L.2
  • 8
    • 0022020346 scopus 로고
    • Distributed snapshots: Determining global states of distributed systems
    • Aug.
    • K.M. Chandy and L. Lamport, "Distributed snapshots: Determining global states of distributed systems," ACM Trans. Comput. Syst., vol.3, no.1, pp.63-75, Aug. 1985.
    • (1985) ACM Trans. Comput. Syst. , vol.3 , Issue.1 , pp. 63-75
    • Chandy, K.M.1    Lamport, L.2
  • 11
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • E.N. Elnozahy, L. Alvisi, Y.-M.Wang, and D.B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol.34, no.3, pp.375-408, 2002.
    • (2002) ACM Comput. Surv. , vol.34 , Issue.3 , pp. 375-408
    • Elnozahy, E.N.1    Alvisi, L.2    Wang, Y.-M.3    Johnson, D.B.4
  • 12
    • 84940567900 scopus 로고    scopus 로고
    • FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world
    • G.E. Fagg and J. Dongarra, "FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world," PVM/MPI 2000, pp.346-353, 2000.
    • (2000) PVM/MPI 2000 , pp. 346-353
    • Fagg, G.E.1    Dongarra, J.2
  • 16
    • 0035455653 scopus 로고    scopus 로고
    • The anatomy of the grid: Enabling scalable virtual organizations
    • I. Foster, C. Kesselman, and S. Tuecke, "The anatomy of the grid: Enabling scalable virtual organizations," J. Supercomput. Appl., vol.15, no.3, 2001.
    • (2001) J. Supercomput. Appl. , vol.15 , Issue.3
    • Foster, I.1    Kesselman, C.2    Tuecke, S.3
  • 18
    • 0030243005 scopus 로고    scopus 로고
    • A high-performance, portable implementation of the MPI message passing interface standard
    • W. Gropp, E. Lusk, N. Doss, and A. Skjellum, "A high-performance, portable implementation of the MPI message passing interface standard" Parallel Comput., vol.22, no.6, pp.789-828, 1996.
    • (1996) Parallel Comput. , vol.22 , Issue.6 , pp. 789-828
    • Gropp, W.1    Lusk, E.2    Doss, N.3    Skjellum, A.4
  • 19
    • 0742293840 scopus 로고    scopus 로고
    • MPICH-G2: A grid-enabled implementation of the message passing interface
    • May
    • N.T. Karnois, B.Toonen, and I. Foster, "MPICH-G2: A grid-enabled implementation of the message passing interface," J. Parallel Distrib. Comput., vol.63, no.5, pp.551-563, May 2003.
    • (2003) J. Parallel Distrib. Comput. , vol.63 , Issue.5 , pp. 551-563
    • Karnois, N.T.1    Toonen, B.2    Foster, I.3
  • 20
    • 0023090161 scopus 로고
    • Checkpointing and rollback recovery for distributed systems
    • R. Koo and S. Toueg, "Checkpointing and rollback recovery for distributed systems," IEEE Trans. Softw. Eng., vol.SE-13, no.1, pp.23-31, 1987.
    • (1987) IEEE Trans. Softw. Eng. , vol.SE-13 , Issue.1 , pp. 23-31
    • Koo, R.1    Toueg, S.2
  • 22
    • 0002639531 scopus 로고
    • Supporting checkpointing and process migration outside the unix kernel
    • San Francisco, CA, Jan.
    • M.J. Litzkow and M. Solomon, "Supporting checkpointing and process migration outside the unix kernel," USENIX Conference Proc., pp.283-290, San Francisco, CA, Jan. 1992.
    • (1992) USENIX Conference Proc. , pp. 283-290
    • Litzkow, M.J.1    Solomon, M.2
  • 25
    • 3142699243 scopus 로고    scopus 로고
    • Nas parallel benchmarks
    • NASA Ames Research Center, "Nas parallel benchmarks," Technical Report, http://science.nas.nasa.gov/Software/NPB/, 1997.
    • (1997) Technical Report
  • 26
    • 0029255243 scopus 로고
    • Necessary and sufficient conditions for consistent global snapshots
    • R. Netzer and J. Xu, "Necessary and sufficient conditions for consistent global snapshots," IEEE Trans. Parallel Distrib. Syst., vol.6, no.2, pp.165-169, 1995.
    • (1995) IEEE Trans. Parallel Distrib. Syst. , vol.6 , Issue.2 , pp. 165-169
    • Netzer, R.1    Xu, J.2
  • 27
    • 84888898496 scopus 로고    scopus 로고
    • RENEW: A tool for fast and efficient implementation of checkpoint protocols
    • N. Neves and W.K. Fuchs, "RENEW: A tool for fast and efficient implementation of checkpoint protocols," Symp. on Fault-Tolerant Computing, pp.58-67, 1998.
    • (1998) Symp. on Fault-Tolerant Computing , pp. 58-67
    • Neves, N.1    Fuchs, W.K.2
  • 28
    • 23044532594 scopus 로고    scopus 로고
    • Application recovery in parallel programming environment
    • G.T. Nguyen, V.D. Tran, and M. Kotocová, "Application recovery in parallel programming environment," European PVM/MPI, pp.234-242, 2002.
    • (2002) European PVM/MPI , pp. 234-242
    • Nguyen, G.T.1    Tran, V.D.2    Kotocová, M.3
  • 32
    • 0032597696 scopus 로고    scopus 로고
    • Egida: An extensible toolkit for low-overhead fault-tolerance
    • S. Rao, L. Alvisi, and H.M. Vin, "Egida: An extensible toolkit for low-overhead fault-tolerance," Symp. on Fault-Tolerant Computing, pp.48-55, 1999.
    • (1999) Symp. on Fault-Tolerant Computing , pp. 48-55
    • Rao, S.1    Alvisi, L.2    Vin, H.M.3
  • 35
    • 0029713612 scopus 로고    scopus 로고
    • CoCheck: Checkpointing and process migration for MPI
    • April
    • G. Stellner, "CoCheck: Checkpointing and process migration for MPI," Proc. Int. Parallel Processing Symp., pp.526-531, April 1996.
    • (1996) Proc. Int. Parallel Processing Symp. , pp. 526-531
    • Stellner, G.1
  • 36
    • 0032179679 scopus 로고    scopus 로고
    • Theoretical analysis for communication-induced checkpointing protocols with rollback dependency trackability
    • J. Tsai, S.-Y. Kuo, and Y.-M. Wang, "Theoretical analysis for communication-induced checkpointing protocols with rollback dependency trackability," IEEE Trans. Parallel Distrib. Syst., vol.9, no.10, pp.963-971, 1998.
    • (1998) IEEE Trans. Parallel Distrib. Syst. , vol.9 , Issue.10 , pp. 963-971
    • Tsai, J.1    Kuo, S.-Y.2    Wang, Y.-M.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.