메뉴 건너뛰기




Volumn , Issue , 2014, Pages 1225-1234

FMI: Fault tolerant messaging interface for fast and transparent recovery

Author keywords

Checkpoint Restart; Fault tolerance; MPI

Indexed keywords

DISTRIBUTED PARAMETER NETWORKS; FAULT TOLERANCE; FILE ORGANIZATION; MESSAGE PASSING; RECOVERY; SEMANTICS; SUPERCOMPUTERS;

EID: 84906689065     PISSN: 15302075     EISSN: 23321237     Source Type: Conference Proceeding    
DOI: 10.1109/IPDPS.2014.126     Document Type: Conference Paper
Times cited : (23)

References (27)
  • 1
    • 36148941068 scopus 로고    scopus 로고
    • Understanding failures in petascale computers
    • Jul. [Online]. Available
    • B. Schroeder and G. A. Gibson, "Understanding Failures in Petascale Computers," Journal of Physics: Conference Series, vol. 78, no. 1, pp. 012 022+, Jul. 2007. [Online]. Available: http://dx.doi.org/10.1088/ 1742-6596/78/1/012022
    • (2007) Journal of Physics: Conference Series , vol.78 , Issue.1 , pp. 012022
    • Schroeder, B.1    Gibson, G.A.2
  • 5
    • 84884918986 scopus 로고    scopus 로고
    • [Online]. Available
    • "MPI Forum." [Online]. Available: http://www.mpi-forum.org/
    • MPI Forum
  • 8
    • 85084160707 scopus 로고    scopus 로고
    • Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you?
    • Berkeley, CA, USA: USENIX Association, [Online]. Available
    • B. Schroeder and G. A. Gibson, "Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you?" in Proceedings of the 5th USENIX conference on File and Storage Technologies, ser. FAST '07. Berkeley, CA, USA: USENIX Association, 2007. [Online]. Available: http://dl.acm.org/ citation.cfm?id=1267903.1267904
    • (2007) Proceedings of the 5th USENIX Conference on File and Storage Technologies, Ser. FAST '07
    • Schroeder, B.1    Gibson, G.A.2
  • 9
    • 0021392066 scopus 로고
    • Error-correcting codes for semiconductor memory applications: A state-of-the-art review
    • Mar. [Online]. Available
    • C. L. Chen and M. Y. Hsiao, "Error-correcting codes for semiconductor memory applications: a state-of-the-art review," IBM J. Res. Dev., vol. 28, no. 2, pp. 124-134, Mar. 1984. [Online]. Available: http://dx.doi.org/10.1147/rd.282.0124
    • (1984) IBM J. Res. Dev. , vol.28 , Issue.2 , pp. 124-134
    • Chen, C.L.1    Hsiao, M.Y.2
  • 13
    • 0004381167 scopus 로고
    • College Station, TX, USA, Tech. Rep., [Online]. Available
    • N. H. Vaidya, "On Checkpoint Latency," College Station, TX, USA, Tech. Rep., 1995. [Online]. Available: http://portal.acm.org/citation. cfm?id=892900
    • (1995) On Checkpoint Latency
    • Vaidya, N.H.1
  • 15
    • 84879817446 scopus 로고    scopus 로고
    • [Online]. Available
    • "PMGR COLLECTIVE." [Online]. Available: http://sourceforge.net/ projects/pmgrcollective/
    • PMGR Collective
  • 17
    • 0242571753 scopus 로고    scopus 로고
    • Slurm: Simple linux utility for resource management
    • D. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Springer Berlin Heidelberg, [Online]. Available
    • A. Yoo, M. Jette, and M. Grondona, "Slurm: Simple linux utility for resource management," in Job Scheduling Strategies for Parallel Processing, ser. Lecture Notes in Computer Science, D. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Springer Berlin Heidelberg, 2003, vol. 2862, pp. 44-60. [Online]. Available: http://dx.doi.org/10. 1007/10968987 3
    • (2003) Job Scheduling Strategies for Parallel Processing, Ser. Lecture Notes in Computer Science , vol.2862 , pp. 44-60
    • Yoo, A.1    Jette, M.2    Grondona, M.3
  • 22
    • 20444444457 scopus 로고    scopus 로고
    • The lam/mpi checkpoint/restart framework: System-initiated checkpointing
    • Sante Fe
    • S. Sankaran, J. M. Squyres, B. Barrett, and A. Lumsdaine, "The lam/mpi checkpoint/restart framework: System-initiated checkpointing," in in Proceedings, LACSI Symposium, Sante Fe, 2003, pp. 479-493.
    • (2003) Proceedings, LACSI Symposium , pp. 479-493
    • Sankaran, S.1    Squyres, J.M.2    Barrett, B.3    Lumsdaine, A.4
  • 24
    • 20444463494 scopus 로고    scopus 로고
    • FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and MPI
    • Washington, DC, USA: IEEE Computer Society, [Online]. Available
    • G. Zheng, L. Shi, and L. V. Kale, "FTC-Charm++: An In- Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI," in Proceedings of the 2004 IEEE International Conference on Cluster Computing, ser. CLUSTER '04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 93-103. [Online]. Available: http://portal.acm.org/citation.cfm?id=1111712
    • (2004) Proceedings of the 2004 IEEE International Conference on Cluster Computing, Ser. CLUSTER '04 , pp. 93-103
    • Zheng, G.1    Shi, L.2    Kale, L.V.3
  • 26
    • 0032179680 scopus 로고    scopus 로고
    • Diskless checkpointing
    • Oct. [Online]. Available
    • J. S. Plank, K. Li, and M. A. Puening, "Diskless Checkpointing," IEEE Trans. Parallel Distrib. Syst., vol. 9, no. 10, pp. 972-986, Oct. 1998. [Online]. Available: http://dx.doi.org/10.1109/71.730527
    • (1998) IEEE Trans. Parallel Distrib. Syst. , vol.9 , Issue.10 , pp. 972-986
    • Plank, J.S.1    Li, K.2    Puening, M.A.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.