메뉴 건너뛰기




Volumn , Issue , 2004, Pages 573-586

Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs

Author keywords

[No Author keywords available]

Indexed keywords

CHECKPOINT-AND-RESTART (CPR); GLOBAL BARRIERS; HARDWARE FAILURE;

EID: 23944521034     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: None     Document Type: Conference Paper
Times cited : (17)

References (25)
  • 2
    • 0038335808 scopus 로고
    • Compiler-assisted checkpointing
    • Dept. of Computer Science, University of Tennessee
    • M. Beck, J. S. Plank, and G. Kingsley. Compiler-assisted checkpointing. Technical Report UT-CS-94-269, Dept. of Computer Science, University of Tennessee, 1994.
    • (1994) Technical Report , vol.UT-CS-94-269
    • Beck, M.1    Plank, J.S.2    Kingsley, G.3
  • 7
    • 84934278304 scopus 로고    scopus 로고
    • September 192001
    • B. Carnes. The smg2000 benchmark code. Available at http://www.llnl.gov/asci/purple/benchmarks/limited/smg/, September 192001.
    • The Smg2000 Benchmark Code
    • Carnes, B.1
  • 8
    • 0022020346 scopus 로고
    • Distributed snapshots: Determining global states of distributed systems
    • M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems, 3(1):63-75, 1985.
    • (1985) ACM Transactions on Computing Systems , vol.3 , Issue.1 , pp. 63-75
    • Chandy, M.1    Lamport, L.2
  • 9
    • 84860989858 scopus 로고    scopus 로고
    • Condor, http://www.cs.wisc.edu/condor/manual.
  • 10
    • 0026867749 scopus 로고
    • Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output
    • May
    • E. N. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, 41(5), May 1992.
    • (1992) IEEE Transactions on Computers , vol.41 , Issue.5
    • Elnozahy, E.N.1    Zwaenepoel, W.2
  • 11
    • 0004096191 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message passing systems
    • School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct.
    • M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.
    • (1996) Technical Report , vol.CMU-CS-96-181
    • Elnozahy, M.1    Alvisi, L.2    Wang, Y.M.3    Johnson, D.B.4
  • 12
    • 84940567900 scopus 로고    scopus 로고
    • FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world
    • Springer-Verilag
    • G. Fagg and J.J.Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In EuroPVM/MPI User's Group Meeting, pages 346-353. Springer-Verilag, 2000.
    • (2000) EuroPVM/MPI User's Group Meeting , pp. 346-353
    • Fagg, G.1    Dongarra, J.J.2
  • 13
    • 0010976041 scopus 로고    scopus 로고
    • Process introspection: A heterogeneous checkpoint/restart mechanism based on automatic code modification
    • Department of Computer Science, University of Virginia, 25
    • A. J. Ferrari, S. J. Chapin, and A. S. Grimshaw. Process introspection: A heterogeneous checkpoint/restart mechanism based on automatic code modification. Technical Report CS-97-05, Department of Computer Science, University of Virginia, 25, 1997.
    • (1997) Technical Report , vol.CS-97-05
    • Ferrari, A.J.1    Chapin, S.J.2    Grimshaw, A.S.3
  • 16
    • 0004215089 scopus 로고    scopus 로고
    • Morgan Kaufmann, San Francisco, California, first edition
    • N. Lynch. Distributed Algorithms. Morgan Kaufmann, San Francisco, California, first edition, 1996.
    • (1996) Distributed Algorithms
    • Lynch, N.1
  • 17
    • 0038335808 scopus 로고
    • Compiler-assisted checkpointing
    • Technical Report, University of Tennessee, Dec.
    • J. P. M. Beck and G. Kingsley. Compiler-Assisted Checkpointing. Technical Report Technical Report CS-94-269, University of Tennessee, Dec. 1994.
    • (1994) Technical Report , vol.CS-94-269
    • Beck, J.P.M.1    Kingsley, G.2
  • 18
    • 0003912256 scopus 로고    scopus 로고
    • Checkpoint and migration of UNIX processes in the condor distributed processing system
    • University of Wisconsin-Madison
    • J. B. M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, 1997.
    • (1997) Technical Report , vol.1346
    • Litzkow, J.B.M.1    Tannenbaum, T.2    Livny, M.3
  • 19
    • 0347102865 scopus 로고    scopus 로고
    • Source-code transformations for efficient reversibility
    • College of Computing, Georgia Tech, September
    • K. Perumalla and R. Fujimoto. Source-code transformations for efficient reversibility. Technical Report GIT-CC-99-21, College of Computing, Georgia Tech, September 1999.
    • (1999) Technical Report , vol.GIT-CC-99-21
    • Perumalla, K.1    Fujimoto, R.2
  • 22
  • 24
    • 33645423303 scopus 로고    scopus 로고
    • A checkpoint and recovery system for the Pittsburgh Supercomputing Center Terascale Computing System
    • N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfield, and C. Vizino. A checkpoint and recovery system for the Pittsburgh Supercomputing Center Terascale Computing System, In Supercomputing, 2001. Available at http://www.psc.edu/publications/tech\_reports/chkpt\_rcvry/ checkpoint-recovery-1.0.html.
    • (2001) Supercomputing
    • Stone, N.1    Kochmar, J.2    Reddy, R.3    Scott, J.R.4    Sommerfield, J.5    Vizino, C.6
  • 25
    • 0141682129 scopus 로고    scopus 로고
    • Srs - A framework for developing malleable and migratable parallel software
    • June
    • S. Vadhiyar and J. Dongarra. Srs - a framework for developing malleable and migratable parallel software. Parallel Processing Letters, 13(2):291-312, June 2003.
    • (2003) Parallel Processing Letters , vol.13 , Issue.2 , pp. 291-312
    • Vadhiyar, S.1    Dongarra, J.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.