메뉴 건너뛰기




Volumn 57, Issue 12, 2008, Pages 1647-1660

Adaptive fault management of parallel applications for high-performance computing

Author keywords

Adaptive fault management; High performance computing; Large scale systems; Parallel applications

Indexed keywords

LARGE SCALE SYSTEMS;

EID: 57049111494     PISSN: 00189340     EISSN: None     Source Type: Journal    
DOI: 10.1109/TC.2008.90     Document Type: Article
Times cited : (38)

References (59)
  • 4
    • 0042078549 scopus 로고    scopus 로고
    • A Survey of Rollback-Recovery Protocols in Message-Passing Systems
    • E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," ACM Computing Surveys, vol. 34, no. 3, 2002.
    • (2002) ACM Computing Surveys , vol.34 , Issue.3
    • Elnozahy, E.1    Alvisi, L.2    Wang, Y.3    Johnson, D.4
  • 5
    • 9144223280 scopus 로고    scopus 로고
    • Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
    • Apr.-June
    • E. Elnozahy and J. Plank, "Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery," IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, Apr.-June 2004.
    • (2004) IEEE Trans. Dependable and Secure Computing , vol.1 , Issue.2
    • Elnozahy, E.1    Plank, J.2
  • 9
    • 77952378080 scopus 로고    scopus 로고
    • Critical Event Prediction for Proactive Management in Large-Scale Computer Clusters
    • R. Sahoo, A. Oliner, I. Rish, M. Gupta, J. Moreira, and S. Ma, "Critical Event Prediction for Proactive Management in Large-Scale Computer Clusters," Proc. ACM SIGKDD, 2003.
    • (2003) Proc. ACM SIGKDD
    • Sahoo, R.1    Oliner, A.2    Rish, I.3    Gupta, M.4    Moreira, J.5    Ma, S.6
  • 17
    • 33749061217 scopus 로고    scopus 로고
    • Requirements for Linux Checkpoint/Restart,
    • Technical Report LBNL-49659, Berkeley Lab, May 2002
    • J. Duell, P. Hargrove, and E. Roman, "Requirements for Linux Checkpoint/Restart," Technical Report LBNL-49659, Berkeley Lab, May 2002.
    • Duell, J.1    Hargrove, P.2    Roman, E.3
  • 18
    • 27844562921 scopus 로고    scopus 로고
    • Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
    • E. Gabriel et al., "Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation," Proc. 11th European PVM/MPI Users' Group Meeting, 2004.
    • (2004) Proc. 11th European PVM/MPI Users' Group Meeting
    • Gabriel, E.1
  • 21
    • 84976846528 scopus 로고
    • A First Order Approximation to the Optimal Checkpoint Interval
    • J. Young, "A First Order Approximation to the Optimal Checkpoint Interval," Comm. ACM, vol. 17, no. 9, 1974.
    • (1974) Comm. ACM , vol.17 , Issue.9
    • Young, J.1
  • 23
    • 0021473687 scopus 로고
    • On the Optimum Checkpoint Selection Problem
    • S. Toueg and O. Babaoglu, "On the Optimum Checkpoint Selection Problem," SIAM J. Computing, vol. 13, no. 3, 1984.
    • (1984) SIAM J. Computing , vol.13 , Issue.3
    • Toueg, S.1    Babaoglu, O.2
  • 27
    • 36949009638 scopus 로고    scopus 로고
    • Scalable Diskless Checkpointing for Large Parallel Systems,
    • PhD dissertation, Univ. of Illinois at Urbana-Champaign
    • C.-D. Lu, "Scalable Diskless Checkpointing for Large Parallel Systems," PhD dissertation, Univ. of Illinois at Urbana-Champaign, 2005.
    • (2005)
    • Lu, C.-D.1
  • 29
    • 28044457320 scopus 로고    scopus 로고
    • Monitoring Hard Disks with Smart
    • Jan
    • B. Allen, "Monitoring Hard Disks with Smart," Linux J., Jan. 2004.
    • (2004) Linux J
    • Allen, B.1
  • 30
    • 57049084232 scopus 로고    scopus 로고
    • Hardware Monitoring by
    • Hardware Monitoring by LM Sensors, http://secure.netroedge.com/-lm78/ info.html, 2007.
    • (2007)
    • Sensors, L.M.1
  • 34
    • 0002168249 scopus 로고    scopus 로고
    • Learning to Predict Rare Events in Event Sequences
    • G. Weiss and H. Hirsh, "Learning to Predict Rare Events in Event Sequences," Proc. ACM SIGKDD, 1998.
    • (1998) Proc. ACM SIGKDD
    • Weiss, G.1    Hirsh, H.2
  • 38
    • 21044437801 scopus 로고    scopus 로고
    • Overview of the Blue Gene/L System Architecture
    • A. Gara et al., "Overview of the Blue Gene/L System Architecture," IBM J. Research and Development, vol. 49, nos. 2/3, 2005.
    • (2005) IBM J. Research and Development , vol.49 , Issue.2-3
    • Gara, A.1
  • 43
    • 16244422723 scopus 로고
    • Checkpointing and Migration of Unix Processes in the Condor Distributed Processing System
    • Feb
    • T. Tannenbaum and M. Litzkow, "Checkpointing and Migration of Unix Processes in the Condor Distributed Processing System," Dr. Dobbs J. Feb. 1995.
    • (1995) Dr. Dobbs J
    • Tannenbaum, T.1    Litzkow, M.2
  • 51
    • 0035201417 scopus 로고    scopus 로고
    • Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
    • J. Plank and M. Thomason, "Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems," J. Parallel and Distributed Computing, vol. 61, no. 11, 2001.
    • (2001) J. Parallel and Distributed Computing , vol.61 , Issue.11
    • Plank, J.1    Thomason, M.2
  • 56
    • 84897988044 scopus 로고    scopus 로고
    • Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formulation
    • G. Bryan, T. Abel, and M. Norman, "Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formulation," Proc. ACM/IEEE Conf. Supercomputing (SC), 2001.
    • (2001) Proc. ACM/IEEE Conf. Supercomputing (SC)
    • Bryan, G.1    Abel, T.2    Norman, M.3
  • 57
    • 0029633168 scopus 로고
    • Gromacs: A Message-Passing Parallel Molecular Dynamics Implementation
    • H. Berendsen, D.V. der Spoel, and R. van Drunen, "Gromacs: A Message-Passing Parallel Molecular Dynamics Implementation," Computer Physics Comm., vol. 91, pp. 43-56, 1995.
    • (1995) Computer Physics Comm , vol.91 , pp. 43-56
    • Berendsen, H.1    der Spoel, D.V.2    van Drunen, R.3
  • 59
    • 79952168926 scopus 로고    scopus 로고
    • Using Adaptive Fault Tolerance to Improve Application Robustness on the Teragrid
    • Y. Li and Z. Lan, "Using Adaptive Fault Tolerance to Improve Application Robustness on the Teragrid," Proc. Second TeraGrid Conf. 2007.
    • (2007) Proc. Second TeraGrid Conf
    • Li, Y.1    Lan, Z.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.