메뉴 건너뛰기




Volumn 58, Issue 11, 2009, Pages 1512-1524

Highly scalable self-healing algorithms for high performance scientific computing

Author keywords

Diskless checkpointing; Fault tolerance; High performance computing; Message passing interface; Parallel and distributed systems; Pipeline; Self healing

Indexed keywords

COST REDUCTION; DIGITAL ARITHMETIC; FAULT TOLERANCE; FAULT TOLERANT COMPUTER SYSTEMS; MESSAGE PASSING; PIPELINES; SYSTEMS ENGINEERING;

EID: 75449102762     PISSN: 00189340     EISSN: None     Source Type: Journal    
DOI: 10.1109/TC.2009.42     Document Type: Article
Times cited : (35)

References (27)
  • 1
    • 84870548923 scopus 로고    scopus 로고
    • An overview of the bluegene/l supercomputer
    • N.R. Adiga et al. "An Overview of the BlueGene/L Supercomputer," Proc. Supercomputing Conf. (SC '02), pp. 1-22, 2002.
    • (2002) Proc. Supercomputing Conf. (SC '02) , pp. 1-22
    • Adiga, N.R.1
  • 5
    • 33746136466 scopus 로고    scopus 로고
    • Condition numbers of gaussian random matrices
    • Z. Chen and J. Dongarra, "Condition Numbers of Gaussian Random Matrices," SIAM J. Matrix Analysis and Applications, vol.27, no.3, pp. 603-620, 2005.
    • (2005) SIAM J. Matrix Analysis and Applications , vol.27 , Issue.3 , pp. 603-620
    • Chen, Z.1    Dongarra, J.2
  • 6
    • 0242658775 scopus 로고    scopus 로고
    • Self-adapting software for numerical linear algebra and LAPACK for clusters
    • Nov./Dec.
    • Z. Chen, J. Dongarra, P. Luszczek, and K. Roche, "Self-Adapting Software for Numerical Linear Algebra and LAPACK for Clusters," Parallel Computing, vol.29, nos. 11/12, pp. 1723-1743, Nov./Dec. 2003.
    • (2003) Parallel Computing , vol.29 , Issue.11-12 , pp. 1723-1743
    • Chen, Z.1    Dongarra, J.2    Luszczek, P.3    Roche, K.4
  • 10
    • 0000324960 scopus 로고
    • Eigenvalues and condition numbers of random matrices
    • A. Edelman, "Eigenvalues and Condition Numbers of Random Matrices," SIAM J. Matrix Analysis and Applications, vol.9, no.4, pp. 543-560, 1988.
    • (1988) SIAM J. Matrix Analysis and Applications , vol.9 , Issue.4 , pp. 543-560
    • Edelman, A.1
  • 15
    • 0018454850 scopus 로고
    • On the optimum checkpoint interval
    • E. Gelenbe, "On the Optimum Checkpoint Interval," J. ACM, vol.26, no.2, pp. 259-270, 1979.
    • (1979) J. ACM , vol.26 , Issue.2 , pp. 259-270
    • Gelenbe, E.1
  • 16
    • 0030243005 scopus 로고    scopus 로고
    • A high-performance, portable implementation of the MPI message passing interface standard
    • Sept.
    • W. Gropp, E. Lusk, N. Doss, and A. Skjellum, "A High- Performance, Portable Implementation of the MPI Message Passing Interface Standard," Parallel Computing, vol.22, no.6, pp. 789-828, Sept. 1996.
    • (1996) Parallel Computing , vol.22 , Issue.6 , pp. 789-828
    • Gropp, W.1    Lusk, E.2    Doss, N.3    Skjellum, A.4
  • 19
    • 0003413672 scopus 로고
    • MPI: A message passing interface standard
    • Message Passing Interface Forum Univ. of Tennessee
    • Message Passing Interface Forum "MPI: A Message Passing Interface Standard," Technical Report ut-cs-94-230, Univ. of Tennessee, 1994.
    • (1994) Technical Report ut-cs-94-230
  • 20
    • 0031223146 scopus 로고    scopus 로고
    • A tutorial on reed-solomon coding for fault-tolerance in RAID-like systems
    • Sept.
    • J.S. Plank, "A Tutorial on Reed-Solomon Coding for Fault- Tolerance in RAID-Like Systems," Software-Practice & Experience, vol.27, no.9, pp. 995-1012, Sept. 1997.
    • (1997) Software-Practice & Experience , vol.27 , Issue.9 , pp. 995-1012
    • Plank, J.S.1
  • 21
    • 0031570636 scopus 로고    scopus 로고
    • Fault-tolerant matrix operations for networks of workstations using diskless checkpointing
    • J.S. Plank, Y. Kim, and J. Dongarra, "Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing," J. Parallel and Distributed Computing, vol.43, no.2, pp. 125-138, 1997.
    • (1997) J. Parallel and Distributed Computing , vol.43 , Issue.2 , pp. 125-138
    • Plank, J.S.1    Kim, Y.2    Dongarra, J.3
  • 24
    • 0035201417 scopus 로고    scopus 로고
    • Processor allocation and checkpoint interval selection in cluster computing systems
    • Nov.
    • J.S. Plank and M.G. Thomason, "Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems," J. Parallel and Distributed Computing, vol.61, no.11, pp. 1570-1590, Nov. 2001.
    • (2001) J. Parallel and Distributed Computing , vol.61 , Issue.11 , pp. 1570-1590
    • Plank, J.S.1    Thomason, M.G.2
  • 25
    • 84864756973 scopus 로고    scopus 로고
    • An experimental study about diskless checkpointing
    • L.M. Silva and J.G. Silva, "An Experimental Study about Diskless Checkpointing," Proc. EUROMICRO '98 Conf., pp. 395-402, 1998.
    • (1998) Proc. EUROMICRO '98 Conf. , pp. 395-402
    • Silva, L.M.1    Silva, J.G.2
  • 26
    • 0345442370 scopus 로고    scopus 로고
    • A case for two-level recovery schemes
    • June
    • N.H. Vaidya, "A Case for Two-Level Recovery Schemes," IEEE Trans. Computers, vol.47, no.6, pp. 656-666, June 1998.
    • (1998) IEEE Trans. Computers , vol.47 , Issue.6 , pp. 656-666
    • Vaidya, N.H.1
  • 27
    • 84976846528 scopus 로고
    • A first order approximation to the optimal checkpoint interval
    • J.W. Young, "A First Order Approximation to the Optimal Checkpoint Interval," Comm. ACM, vol.17, no.9, pp. 530-531, 1974.
    • (1974) Comm. ACM , vol.17 , Issue.9 , pp. 530-531
    • Young, J.W.1


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.