메뉴 건너뛰기




Volumn , Issue , 2013, Pages

ACR: Automatic checkpoint/restart for soft and hard error protection

Author keywords

Checkpoint restart; Fault tolerance; Redundancy; Silent data corruption

Indexed keywords

FAILURE ANALYSIS; FAULT TOLERANCE; RADIATION HARDENING; REDUNDANCY;

EID: 84899671615     PISSN: 21674329     EISSN: 21674337     Source Type: Conference Proceeding    
DOI: 10.1145/2503210.2503266     Document Type: Conference Paper
Times cited : (52)

References (29)
  • 1
    • 29344472607 scopus 로고    scopus 로고
    • Radiation-induced soft errors in advanced semiconductor technologies
    • R. C. Baumann. Radiation-induced soft errors in advanced semiconductor technologies. Device and Materials Reliability, IEEE Transactions on, 5(3):305-316, 2005.
    • (2005) Device and Materials Reliability, IEEE Transactions on , vol.5 , Issue.3 , pp. 305-316
    • Baumann, R.C.1
  • 3
    • 61449223447 scopus 로고    scopus 로고
    • Algorithm-based fault tolerance applied to high performance computing
    • G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. JPDC, 69(4):410-416, 2009.
    • (2009) JPDC , vol.69 , Issue.4 , pp. 410-416
    • Bosilca, G.1    Delmas, R.2    Dongarra, J.3    Langou, J.4
  • 5
    • 68249127079 scopus 로고    scopus 로고
    • Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities
    • F. Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. IJHPCA, 23(3):212-226, 2009.
    • (2009) IJHPCA , vol.23 , Issue.3 , pp. 212-226
    • Cappello, F.1
  • 6
    • 84877708941 scopus 로고    scopus 로고
    • Containment domains: A scalable, efficient, and exible resilience scheme for exascale systems
    • Los Alamitos, CA, USA. IEEE Computer Society Press
    • J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment domains: A scalable, efficient, and exible resilience scheme for exascale systems. In Supercomputing, SC'12, pages 58:1-58:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
    • (2012) Supercomputing, SC'12 , pp. 581-5811
    • Chung, J.1    Lee, I.2    Sullivan, M.3    Ryoo, J.H.4    Kim, D.W.5    Yoon, D.H.6    Kaplan, L.7    Erez, M.8
  • 7
    • 28044460018 scopus 로고    scopus 로고
    • A higher order estimate of the optimum checkpoint interval for restart dumps
    • J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Comp. Syst., 22(3):303-312, 2006.
    • (2006) Future Generation Comp. Syst. , vol.22 , Issue.3 , pp. 303-312
    • Daly, J.T.1
  • 11
    • 84877705582 scopus 로고    scopus 로고
    • Detection and correction of silent data corruption for large-scale high-performance computing
    • Los Alamitos, CA, USA. IEEE Computer Society Press
    • D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In Supercomputing, SC'12, pages 78:1-78:12, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
    • (2012) Supercomputing, SC'12 , pp. 781-7812
    • Fiala, D.1    Mueller, F.2    Engelmann, C.3    Riesen, R.4    Ferreira, K.5    Brightwell, R.6
  • 12
    • 33646126514 scopus 로고    scopus 로고
    • A peer-to-peer framework for robust execution of message passing parallel programs
    • Springer-Verlag
    • S. Genaud, C. Rattanapoka, and U. L. Strasbourg. A peer-to-peer framework for robust execution of message passing parallel programs. In In EuroPVM/MPI 2005, volume 3666 of LNCS, pages 276-284. Springer-Verlag, 2005.
    • (2005) EuroPVM/MPI 2005, Volume 3666 of LNCS , pp. 276-284
    • Genaud, S.1    Rattanapoka, C.2    Strasbourg, U.L.3
  • 13
    • 74049121711 scopus 로고    scopus 로고
    • Berkeley lab checkpoint/restart (blcr) for linux clusters
    • P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In SciDAC, 2006.
    • (2006) SciDAC
    • Hargrove, P.H.1    Duell, J.C.2
  • 18
    • 77951205449 scopus 로고    scopus 로고
    • A study of dynamic meta-learning for failure prediction in large-scale systems
    • June
    • Z. Lan, J. Gu, Z. Zheng, R. Thakur, and S. Coghlan. A study of dynamic meta-learning for failure prediction in large-scale systems. J. Parallel Distrib. Comput., 70(6):630-643, June 2010.
    • (2010) J. Parallel Distrib. Comput , vol.70 , Issue.6 , pp. 630-643
    • Lan, Z.1    Gu, J.2    Zheng, Z.3    Thakur, R.4    Coghlan, S.5
  • 19
    • 0035390088 scopus 로고    scopus 로고
    • A variational calculus approach to optimal checkpoint placement
    • Y. Ling, J. Mi, and X. Lin. A variational calculus approach to optimal checkpoint placement. Computers, IEEE Transactions on, 50(7):699-708, 2001.
    • (2001) Computers, IEEE Transactions on , vol.50 , Issue.7 , pp. 699-708
    • Ling, Y.1    Mi, J.2    Lin, X.3
  • 20
    • 84899680829 scopus 로고    scopus 로고
    • Lulesh
    • Lulesh. http://computation. llnl. gov/casc/ShockHydro/.
  • 23
    • 78650831692 scopus 로고    scopus 로고
    • Design, modeling, and evaluation of a scalable multi-level checkpointing system
    • A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In SC, pages 1-11, 2010.
    • (2010) SC , pp. 1-11
    • Moody, A.1    Bronevetsky, G.2    Mohror, K.3    De Supinski, B.R.4
  • 25
    • 84870713710 scopus 로고    scopus 로고
    • Hiding checkpoint overhead in hpc applications with a semi-blocking algorithm
    • Beijing, China, September
    • X. Ni, E. Meneses, and L. V. Kale. Hiding checkpoint overhead in hpc applications with a semi-blocking algorithm. In IEEE Cluster 12, Beijing, China, September 2012.
    • (2012) IEEE Cluster , vol.12
    • Ni, X.1    Meneses, E.2    Kale, L.V.3
  • 29
    • 20444463494 scopus 로고    scopus 로고
    • FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi
    • San Diego, CA, September
    • G. Zheng, L. Shi, and L. V. Kale. FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI. In 2004 IEEE Cluster, pages 93-103, San Diego, CA, September 2004.
    • (2004) 2004 IEEE Cluster , pp. 93-103
    • Zheng, G.1    Shi, L.2    Kale, L.V.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.