메뉴 건너뛰기




Volumn 23, Issue 4, 2009, Pages 374-388

Toward exascale resilience

Author keywords

Challenge; Exascale; Fault tolerance; High performance computing; Resilience

Indexed keywords

APPLICATION LEVEL; CENTRAL PROCESSING UNITS; CHECK POINTING; CHECKPOINT/RESTART; DISTRIBUTED SYSTEMS; EXASCALE; HIGH PERFORMANCE COMPUTING SYSTEMS; HIGH-PERFORMANCE COMPUTING; LARGE SYSTEM; MEAN TIME TO FAILURE; NEW APPROACHES; PETASCALE; PROGRAMMING MODELS; RESEARCH ISSUES; SYSTEM MANAGEMENT; WHITE PAPERS;

EID: 70450206305     PISSN: 10943420     EISSN: 17412846     Source Type: Journal    
DOI: 10.1177/1094342009347767     Document Type: Article
Times cited : (268)

References (44)
  • 3
    • 0029214558 scopus 로고
    • Designing programs that check their work
    • Blum, M. and Kannan, S. (1995). Designing programs that check their work. J. ACM 42(1): 269-291.
    • (1995) J. ACM , vol.42 , Issue.1 , pp. 269-291
    • Blum, M.1    Kannan, S.2
  • 4
    • 70450200139 scopus 로고    scopus 로고
    • BLCR. (Accessed: September 2)
    • BLCR. http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml2009 (Accessed: September 2 2009)
    • (2009)
  • 6
    • 51049083541 scopus 로고    scopus 로고
    • Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments
    • In April
    • Chen, Z. (2008). Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments. In proceedings of the IEEE Parallel and Distributed Processing Symposium, April, pp. 1-8.
    • (2008) Proceedings of the IEEE Parallel and Distributed Processing Symposium , pp. 1-8
    • Chen, Z.1
  • 8
    • 70450210363 scopus 로고    scopus 로고
    • CIFT. (Accessed: September 2)
    • CIFT. http://www.mcs.anl.gov/research/cifts/index.php2009 (Accessed: September 2 2009)
    • (2009)
  • 9
    • 0022020346 scopus 로고
    • Distributed snapshots: Determining global states of distributed systems
    • Chandy, K.M. and Lamport, L. (1985). Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1): 63-75.
    • (1985) ACM Trans. Comput. Syst. , vol.3 , Issue.1 , pp. 63-75
    • Chandy, K.M.1    Lamport, L.2
  • 12
    • 70450197271 scopus 로고    scopus 로고
    • CSCL. (Accessed: September 2)
    • CSCL. http://www.cs.wisc.edu/condor/manual/v6.8/ 4_2Condor_s_Checkpoint.html2009 (Accessed: September 2 2009)
    • (2009)
  • 13
    • 84976834622 scopus 로고
    • Self-stabilizing systems in spite of distributed control
    • Dijkstra, E.W. (1974). Self-stabilizing systems in spite of distributed control. Commun. ACM 17(11), 643-644.
    • (1974) Commun. ACM , vol.17 , Issue.11 , pp. 643-644
    • Dijkstra, E.W.1
  • 14
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • Elnozahy, E.N., Alvisi, L., Wang, Y.-M. and Johnson, D.B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3): 375-408.
    • (2002) ACM Comput. Surv. , vol.34 , Issue.3 , pp. 375-408
    • Elnozahy, E.N.1    Alvisi, L.2    Wang, Y.-M.3    Johnson, D.B.4
  • 15
    • 0029004440 scopus 로고
    • Toward a theory of situation awareness in dynamic systems
    • Endsley, M.R. (1995). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32-64.
    • (1995) Human Factors , vol.37 , Issue.1 , pp. 32-64
    • Endsley, M.R.1
  • 17
    • 70450197272 scopus 로고    scopus 로고
    • FT-MPI
    • FT-MPI. http://icl.cs.utk.edu/ftmpi/2009
  • 18
    • 70450211231 scopus 로고    scopus 로고
    • Extending stability beyond CPU millennium: A micron-scale atomistic simulation of Kelvin-Helmholtz instability, a micronscale atomistic simulation of Kelvin-Helmholtz instability
    • In Reno
    • Glosli, J.N., Richards, D.F., Caspersen, K.J., Rudd, R.E., Gunnels, J.A. and Streitz, F.H. (2007). Extending stability beyond CPU millennium: A micron-scale atomistic simulation of Kelvin-Helmholtz instability, a micronscale atomistic simulation of Kelvin-Helmholtz instability. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, Reno.
    • (2007) Proceedings of the 2007 ACM/IEEE Conference on Supercomputing
    • Glosli, J.N.1    Richards, D.F.2    Caspersen, K.J.3    Rudd, R.E.4    Gunnels, J.A.5    Streitz, F.H.6
  • 20
    • 0021439162 scopus 로고
    • Algorithm-based fault tolerance for matrix operations
    • Huang, K. and Abraham, J. (1984). Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C-33(6): 518-528.
    • (1984) IEEE Trans. Comput. , vol.C-33 , Issue.6 , pp. 518-528
    • Huang, K.1    Abraham, J.2
  • 22
    • 70450182237 scopus 로고    scopus 로고
    • PERCU: A holistic method for evaluating high performance computing systems
    • Dissertation, University of California Berkeley
    • Kramer, W. (2008). PERCU: A holistic method for evaluating high performance computing systems. Dissertation, University of California Berkeley.
    • (2008)
    • Kramer, W.1
  • 23
    • 70450200137 scopus 로고    scopus 로고
    • LAM/MPI. (Accessed: September 2)
    • LAM/MPI. http://www.lam-mpi.org/2009 (Accessed: September 2 2009)
    • (2009)
  • 25
    • 70450200135 scopus 로고    scopus 로고
    • Libckpt. (Accessed: September 2)
    • Libckpt. http://www.cs.utk.edu/~plank/plank/www./libckpt.html2009 (Accessed: September 2 2009)
    • (2009)
  • 26
    • 36949009638 scopus 로고    scopus 로고
    • Scalable diskless checkpointing for large parallel systems
    • Ph.D. dissertation, University of Illinois at Urbana-Champaign
    • Lu, C.D. (2005). Scalable diskless checkpointing for large parallel systems. Ph.D. dissertation, University of Illinois at Urbana-Champaign.
    • (2005)
    • Lu, C.D.1
  • 28
    • 70450211235 scopus 로고    scopus 로고
    • MVAPICH. (Accessed: September 2)
    • MVAPICH. http://mvapich.cse.ohio-state.edu/overview/mvapich/2009 (Accessed: September 2 2009)
    • (2009)
  • 30
    • 70450211233 scopus 로고    scopus 로고
    • OpenMPI. (Accessed: September 2)
    • OpenMPI. http://www.open-mpi.org/.2009 (Accessed: September 2 2009)
    • (2009)
  • 32
    • 70450211234 scopus 로고    scopus 로고
    • PDSI. (Accessed: September 2)
    • PDSI. http://pdsi.nersc.gov2009 (Accessed: September 2 2009)
    • (2009)
  • 34
    • 0033077475 scopus 로고    scopus 로고
    • Memory exclusion: Optimizing the performance of checkpointing systems
    • Plank, J.S., Chen, Y., Li, K., Beck, M. and Kingsley, G. (1999). Memory exclusion: Optimizing the performance of checkpointing systems. Software Pract. Ex. 29(2): 125-142.
    • (1999) Software Pract. Ex. , vol.29 , Issue.2 , pp. 125-142
    • Plank, J.S.1    Chen, Y.2    Li, K.3    Beck, M.4    Kingsley, G.5
  • 36
    • 68249122526 scopus 로고    scopus 로고
    • Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems
    • Sahoo, R.K., Bae, M., Vilalta, R., Moreira, J., Ma, S. and Gupta, M. (2002). Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems. In Proceedings of IEEE/ACM Supercomputing 2002.
    • (2002) Proceedings of IEEE/ACM Supercomputing 2002
    • Sahoo, R.K.1    Bae, M.2    Vilalta, R.3    Moreira, J.4    Ma, S.5    Gupta, M.6
  • 38
    • 50649108554 scopus 로고    scopus 로고
    • Proactive fault tolerance in MPI Applications via task migration
    • In LNCS
    • Chakravorty, S., Mendes, C.L. and Kale, L.V. (2006). Proactive fault tolerance in MPI Applications via task migration. In Proceedings of HIPC 2006, LNCS, volume 4297, p. 485.
    • (2006) Proceedings of HIPC 2006 , vol.4297 , pp. 485
    • Chakravorty, S.1    Mendes, C.L.2    Kale, L.V.3
  • 40
    • 70450210361 scopus 로고    scopus 로고
    • SCR. (Accessed: September 2)
    • SCR. http://scalablecr.sourceforge.net/2009 (Accessed: September 2 2009)
    • (2009)
  • 41
    • 36148941068 scopus 로고    scopus 로고
    • Understanding failures in petascale computers
    • Teodorescu, R., Nakano, J. and Torrellas, J. (2006). SWICH: a prototype for efficient cache-level checkpointing and rollback. IEEE Micro. 26(5): 28-40
    • Schroeder, B. and Gibson, G. (2007). Understanding failures in petascale computers. J Phys. Conf. 78: 012022. Teodorescu, R., Nakano, J. and Torrellas, J. (2006). SWICH: A prototype for efficient cache-level checkpointing and rollback. IEEE Micro. 26(5): 28-40.
    • (2007) J Phys. Conf. , vol.78 , pp. 012022
    • Schroeder, B.1    Gibson, G.2
  • 42
    • 0003133883 scopus 로고
    • Probabilistic logics and the synthesis of reliable organisms from unreliable components
    • In edited by C. E. Shannon and J. McCarthy. New Jersey: Princeton University Press
    • Von Neuman, J. (1956). Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Automata studies, edited by C. E. Shannon and J. McCarthy. New Jersey: Princeton University Press, pp. 43-98.
    • (1956) Automata Studies , pp. 43-98
    • Von Neuman, J.1


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.