메뉴 건너뛰기




Volumn , Issue , 2010, Pages 116-125

RDMA-based job migration framework for MPI over InfiniBand

Author keywords

[No Author keywords available]

Indexed keywords

CHECK POINTING; CHECKPOINT/RESTART; COORDINATED CHECKPOINTS; HIGH PERFORMANCE COMMUNICATION; IMAGE TRANSMISSION; INFINIBAND; JOB MIGRATION; NODE FAILURE; OPEN-SOURCE; PROCESS APPLICATIONS; QUEUING DELAY; STABLE STORAGE; STORAGE AREA; STORAGE SUBSYSTEMS;

EID: 78649483996     PISSN: 15525244     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/CLUSTER.2010.20     Document Type: Conference Paper
Times cited : (21)

References (35)
  • 4
    • 56749178938 scopus 로고    scopus 로고
    • Exploring event correlation for failure prediction in coalitions of clusters
    • New York, NY, USA: ACM
    • S. Fu and C.-Z. Xu, "Exploring event correlation for failure prediction in coalitions of clusters," in SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2007, pp. 1-12.
    • (2007) SC '07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing , pp. 1-12
    • Fu, S.1    Xu, C.-Z.2
  • 5
    • 78649487857 scopus 로고    scopus 로고
    • "Intelligent Platform Management Interface (IPMI)," http://www.intel.com/design/servers/ipmi/.
  • 6
    • 34548782109 scopus 로고    scopus 로고
    • A fault tolerance protocol with fast fault recovery
    • S. Chakravorty and L. V. Kale, "A fault tolerance protocol with fast fault recovery," in IPDPS 2003, 2003.
    • (2003) IPDPS 2003
    • Chakravorty, S.1    Kale, L.V.2
  • 10
    • 34548768671 scopus 로고    scopus 로고
    • A job pause service under LAM/MPI+BLCR for transparent fault tolerance
    • C. Wang, F. Mueller, C. Engelmann, and S. L. Scott, "A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance," in IPDPS, 2007, pp. 1-10.
    • (2007) IPDPS , pp. 1-10
    • Wang, C.1    Mueller, F.2    Engelmann, C.3    Scott, S.L.4
  • 15
    • 77951447133 scopus 로고    scopus 로고
    • Accelerating Checkpoint operation by node-level write aggregation on multicore systems
    • September
    • X. Ouyang, K. Gopalakrishnan, and D. K. Panda, "Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems," ICPP 2009, September 2009.
    • (2009) ICPP 2009
    • Ouyang, X.1    Gopalakrishnan, K.2    Panda, D.K.3
  • 16
    • 77952145003 scopus 로고    scopus 로고
    • Fast checkpointing by write aggregation with dynamic buffer and interleaving on multicore architecture
    • December
    • X. Ouyang, K. Gopalakrishnan, T. Gangadharappa, and D. K. Panda, "Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture," HiPC 2009, December 2009.
    • (2009) HiPC 2009
    • Ouyang, X.1    Gopalakrishnan, K.2    Gangadharappa, T.3    Panda, D.K.4
  • 17
    • 12344277946 scopus 로고    scopus 로고
    • The design and implementation of berkeley lab's linux checkpoint/restart
    • Lawrence Berkeley National Laboratory, Berkeley, CA 94720. [Online]. Available
    • Duell, J., Hargrove, P., and Roman, E., "The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart," Lawrence Berkeley National Laboratory, Berkeley, CA 94720, Tech. Rep. LBNL-54941, 2002. [Online]. Available: {https: //ftg.lbl.gov/CheckpointRestart/Pubs/LBNL-54941. pdf}
    • (2002) Tech. Rep. LBNL-54941
    • Duell, J.1    Hargrove, P.2    Roman, E.3
  • 18
    • 53349109260 scopus 로고    scopus 로고
    • "CIFTS Web Page," http://www.mcs.anl.gov/research/cifts.
    • CIFTS Web Page
  • 20
    • 78649480678 scopus 로고    scopus 로고
    • "Top 500 Supercomputers," http://www.top500.org.
  • 21
    • 74049121711 scopus 로고    scopus 로고
    • Berkeley lab checkpoint/restart (BLCR) for linux clusters
    • P. H. Hargrove and J. C. Duell, "Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters," in SciDAC, 6 2006.
    • (2006) SciDAC , vol.6
    • Hargrove, P.H.1    Duell, J.C.2
  • 22
    • 58449084165 scopus 로고    scopus 로고
    • ScELA: Scalable and extensible launching architecture for clusters
    • J. K. Sridhar, M. J. Koop, J. L. Perkins, and D. K. Panda, "ScELA: Scalable and Extensible Launching Architecture for Clusters," in HiPC, 2008, pp. 323-335.
    • (2008) HiPC , pp. 323-335
    • Sridhar, J.K.1    Koop, M.J.2    Perkins, J.L.3    Panda, D.K.4
  • 24
    • 85014969248 scopus 로고    scopus 로고
    • Architectural requirements and scalability of the NAS parallel benchmarks
    • F. C. Wong and R. P. M. etc., "Architectural requirements and scalability of the NAS parallel benchmarks," in Supercomputing '99, 1999, p. 41.
    • (1999) Supercomputing '99 , pp. 41
    • Wong, F.C.1    R, P.M.2
  • 25
    • 78649471209 scopus 로고    scopus 로고
    • "PVFS2," http://www.pvfs.org/.
  • 34
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, no. 3, pp. 375-408, 2002.
    • (2002) ACM Comput. Surv. , vol.34 , Issue.3 , pp. 375-408
    • Elnozahy, E.N.M.1    Alvisi, L.2    Wang, Y.-M.3    Johnson, D.B.4
  • 35
    • 34548042452 scopus 로고    scopus 로고
    • Proactive fault tolerance in MPI applications via task migration
    • S. Chakravorty, C. Mendes, and L. Kale, " Proactive fault tolerance in MPI applications via task migration ," in HiPC, 2006.
    • (2006) HiPC
    • Chakravorty, S.1    Mendes, C.2    Kale, L.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.