메뉴 건너뛰기




Volumn , Issue , 2010, Pages 524-533

Hybrid checkpointing for MPI jobs in HPC environments

Author keywords

Checkpoint restart; Fault tolerance; High performance computing

Indexed keywords

CHECK POINTING; CHECKPOINT/RESTART; HIGH PERFORMANCE APPLICATIONS; HIGH PERFORMANCE COMPUTING SYSTEMS; HIGH-PERFORMANCE COMPUTING; OPTIMAL BALANCE; ORDER OF MAGNITUDE;

EID: 79951790076     PISSN: 15219097     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/ICPADS.2010.48     Document Type: Conference Paper
Times cited : (44)

References (28)
  • 1
    • 33845420448 scopus 로고    scopus 로고
    • A power-aware run-time system for high-performance computing
    • C.-H. Hsu and W.-C. Feng, "A power-aware run-time system for high-performance computing," in Supercomputing, 2005.
    • (2005) Supercomputing
    • Hsu, C.-H.1    Feng, W.-C.2
  • 4
    • 79951775997 scopus 로고    scopus 로고
    • Application MTTFE vs. Platform MTTF: A fresh perspective on system reliability and application throughput for computations at scale
    • May
    • J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak, "Application MTTFE vs. platform MTTF: A fresh perspective on system reliability and application throughput for computations at scale," in Workshop on Resiliency in High Performance Computing, May 2008, pp. 19-22.
    • (2008) Workshop on Resiliency in High Performance Computing , pp. 19-22
    • Daly, J.T.1    Pritchett-Sheats, L.A.2    Michalak, S.E.3
  • 7
    • 35248827046 scopus 로고    scopus 로고
    • A Component architecture for LAM/MPI
    • ser. Lecture Notes in Computer Science, no. 2840, Sep.
    • J. M. Squyres and A. Lumsdaine, "A Component Architecture for LAM/MPI," in European PVM/MPI Users' Group Meeting, ser. Lecture Notes in Computer Science, no. 2840, Sep. 2003, pp. 379-387.
    • (2003) European PVM/MPI Users' Group Meeting , pp. 379-387
    • Squyres, J.M.1    Lumsdaine, A.2
  • 11
    • 0026812659 scopus 로고
    • The design and implementation of a log-structured file system
    • Feb.
    • M. Rosenblum and J. K. Ousterhout, "The design and implementation of a log-structured file system," in ACM Trans. on Computer Systems, Vol. 10, No. 1, Feb. 1992.
    • (1992) ACM Trans. on Computer Systems , vol.10 , Issue.1
    • Rosenblum, M.1    Ousterhout, J.K.2
  • 13
    • 85014969248 scopus 로고    scopus 로고
    • Architectural requirements and scalability of the NAS parallel benchmarks
    • F. Wong, R. Martin, R. Arpaci-Dusseau, and D. Culler, "Architectural requirements and scalability of the NAS parallel benchmarks," in Supercomputing, 1999.
    • (1999) Supercomputing
    • Wong, F.1    Martin, R.2    Arpaci-Dusseau, R.3    Culler, D.4
  • 16
    • 0029713612 scopus 로고    scopus 로고
    • CoCheck: Checkpointing and process migration for MPI
    • Honolulu, HI, USA, 15-19 April, IEEE, Ed. 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA: IEEE Computer Society Press, 1996
    • G. Stellner, "CoCheck: checkpointing and process migration for MPI," in Proceedings of IPPS'96. The 10th International Parallel Processing Symposium: Honolulu, HI, USA, 15-19 April 1996, IEEE, Ed. 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA: IEEE Computer Society Press, 1996, pp. 526-531.
    • (1996) Proceedings of IPPS'96. the 10th International Parallel Processing Symposium , pp. 526-531
    • Stellner, G.1
  • 17
    • 0038194608 scopus 로고    scopus 로고
    • MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
    • Nov.
    • G. Bosilca, A. Boutellier, and F. Cappello, "MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes," in Supercomputing, Nov. 2002.
    • (2002) Supercomputing
    • Bosilca, G.1    Boutellier, A.2    Cappello, F.3
  • 20
    • 33845434226 scopus 로고    scopus 로고
    • Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers
    • R. Gioiosa, J. C. S., S. Jiang, and F. Petrini, "Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers," in Supercomputing, 2005.
    • (2005) Supercomputing
    • Gioiosa, R.1    Sancho, J.C.2    Jiang, S.3    Petrini, F.4
  • 22
    • 0031224013 scopus 로고    scopus 로고
    • Continuous checkpointing: Joining the checkpointing with virtual memory paging
    • S.-T. Hsu and R.-C. Chang, "Continuous checkpointing: joining the checkpointing with virtual memory paging," Softw. Pract. Exper., vol. 27, no. 9, pp. 1103-1120, 1997. (Pubitemid 127582120)
    • (1997) Software - Practice and Experience , vol.27 , Issue.9 , pp. 1103-1120
    • Hsu, S.-T.1    Chang, R.-C.2
  • 23
    • 33751065156 scopus 로고    scopus 로고
    • Adaptive page-level incremental checkpointing based on expected recovery time
    • Applied Computing 2006 - The 21st Annual ACM Symposium on Applied Computing - Proceedings of the 2006 ACM Symposium on Applied Computing
    • S. Yi, J. Heo, Y. Cho, and J. Hong, "Adaptive page-level incremental checkpointing based on expected recovery time," in ACM Symposium on Applied computing, 2006, pp. 1472-1476. (Pubitemid 44759028)
    • (2006) Proceedings of the ACM Symposium on Applied Computing , vol.2 , pp. 1472-1476
    • Yi, S.1    Heo, J.2    Cho, Y.3    Hong, J.4
  • 26
    • 84976846528 scopus 로고
    • A first order approximation to the optimum checkpoint interval
    • J. W. Young, "A first order approximation to the optimum checkpoint interval," Commun. ACM, vol. 17, no. 9, pp. 530-531, 1974.
    • (1974) Commun. ACM , vol.17 , Issue.9 , pp. 530-531
    • Young, J.W.1
  • 27
    • 28044460018 scopus 로고    scopus 로고
    • A higher order estimate of the optimum checkpoint interval for restart dumps
    • DOI 10.1016/j.future.2004.11.016, PII S0167739X04002213
    • J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future Gener. Comput. Syst., vol. 22, no. 3, pp. 303-312, 2006. (Pubitemid 41689812)
    • (2006) Future Generation Computer Systems , vol.22 , Issue.3 , pp. 303-312
    • Daly, J.T.1


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.