SCOPUS 정보 검색 플랫폼

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Volumn , Issue , 2010, Pages 116-125

RDMA-based job migration framework for MPI over InfiniBand

(4) Ouyang, Xiangyong a Marcarelli, Sonya a Rajachandrasekar, Raghunath a Panda, Dhabaleswar K a

a Ohio State University (United States)

Author keywords

[No Author keywords available]

Indexed keywords

CHECK POINTING; CHECKPOINT/RESTART; COORDINATED CHECKPOINTS; HIGH PERFORMANCE COMMUNICATION; IMAGE TRANSMISSION; INFINIBAND; JOB MIGRATION; NODE FAILURE; OPEN-SOURCE; PROCESS APPLICATIONS; QUEUING DELAY; STABLE STORAGE; STORAGE AREA; STORAGE SUBSYSTEMS;

CLUSTER COMPUTING; FAULT TOLERANCE; QUEUEING NETWORKS;

QUALITY ASSURANCE;

EID: 78649483996 PISSN: 15525244 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/CLUSTER.2010.20 Document Type: Conference Paper

Times cited : (21)

References (35)

1
- 78649475987
- "MPI over InfiniBand, 10GigE/iWARP and RDMAoE," in http://mvapich.cse.ohio-state.edu/.
- MPI over InfiniBand, 10GigE/iWARP and RDMAoE

2
- 84889903972
- "MPI 3.0 Standardization Effort," http://meetings.mpi-forum. org/MPI-3.0-main-page.php.
- MPI 3.0 Standardization Effort

3
- 33750936415
- Availability modeling and analysis on high performance cluster computing systems
- Washington, DC, USA: IEEE Computer Society
- H. Song, C. b. Leangsuksun, and R. Nassar, "Availability Modeling and Analysis on High Performance Cluster Computing Systems," in ARES '06: Proceedings of the First International Conference on Availability, Reliability and Security. Washington, DC, USA: IEEE Computer Society, 2006, pp. 305-313.
- (2006) ARES '06: Proceedings of the First International Conference on Availability, Reliability and Security , pp. 305-313
- Song, H.¹ Leangsuksun, C.B.² Nassar, R.³

4
- 56749178938
- Exploring event correlation for failure prediction in coalitions of clusters
- New York, NY, USA: ACM
- S. Fu and C.-Z. Xu, "Exploring event correlation for failure prediction in coalitions of clusters," in SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2007, pp. 1-12.
- (2007) SC '07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing , pp. 1-12
- Fu, S.¹ Xu, C.-Z.²

5
- 78649487857
- "Intelligent Platform Management Interface (IPMI)," http://www.intel.com/design/servers/ipmi/.

6
- 34548782109
- A fault tolerance protocol with fast fault recovery
- S. Chakravorty and L. V. Kale, "A fault tolerance protocol with fast fault recovery," in IPDPS 2003, 2003.
- (2003) IPDPS 2003
- Chakravorty, S.¹ Kale, L.V.²

7
- 77952378080
- Critical event prediction for proactive management in large-scale computer clusters
- R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam, "Critical event prediction for proactive management in large-scale computer clusters," in KDD '03, 2003, pp. 426-435.
- (2003) KDD '03 , pp. 426-435
- Sahoo, R.K.¹ Oliner, A.J.² Rish, I.³ Gupta, M.⁴ Moreira, J.E.⁵ Ma, S.⁶ Vilalta, R.⁷ Sivasubramaniam, A.⁸

8
- 34547474846
- InfiniBand Trade Association, "The InfiniBand Architecture," http://www.infinibandta.org.
- The InfiniBand Architecture

9
- 70350755748
- Proactive process-level live migration in HPC environments
- Chao Wang and Frank Mueller and Christian Engelmann and Stephen L. Scott, "Proactive process-level live migration in HPC environments," in SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008.
- (2008) SC '08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing
- Wang, C.¹ Mueller, F.² Engelmann, C.³ Scott, S.L.⁴

10
- 34548768671
- A job pause service under LAM/MPI+BLCR for transparent fault tolerance
- C. Wang, F. Mueller, C. Engelmann, and S. L. Scott, "A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance," in IPDPS, 2007, pp. 1-10.
- (2007) IPDPS , pp. 1-10
- Wang, C.¹ Mueller, F.² Engelmann, C.³ Scott, S.L.⁴

11
- 34548046749
- Proactive fault tolerance for HPC with Xen virtualization
- A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott, "Proactive fault tolerance for HPC with Xen virtualization," in ICS '07: Proceedings of the 21st annual international conference on Supercomputing, 2007, pp. 23-32.
- (2007) ICS '07: Proceedings of the 21st Annual International Conference on Supercomputing , pp. 23-32
- Nagarajan, A.B.¹ Mueller, F.² Engelmann, C.³ Scott, S.L.⁴

12
- 53349098107
- High performance virtual machine migration with RDMA over modern interconnects
- W. Huang, Q. Gao, J. Liu, and D. K. Panda, "High performance virtual machine migration with rdma over modern interconnects," in CLUSTER '07: Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007.
- (2007) CLUSTER '07: Proceedings of the 2007 IEEE International Conference on Cluster Computing
- Huang, W.¹ Gao, Q.² Liu, J.³ Panda, D.K.⁴

13
- 47249116207
- Groupbased coordinated checkpointing for MPI: A case study on InfiniBand
- Washington, DC, USA: IEEE Computer Society
- Q. Gao, W. Huang, M. J. Koop, and D. K. Panda, "Groupbased Coordinated Checkpointing for MPI: A Case Study on InfiniBand," in ICPP '07: Proceedings of the 2007 International Conference on Parallel Processing. Washington, DC, USA: IEEE Computer Society, 2007, p. 47.
- (2007) ICPP '07: Proceedings of the 2007 International Conference on Parallel Processing , pp. 47
- Gao, Q.¹ Huang, W.² Koop, M.J.³ Panda, D.K.⁴

14
- 34547424834
- Application- transparent checkpoint/restart for MPI Programs over InfiniBand
- Washington, DC, USA: IEEE Computer Society
- Q. Gao, W. Yu, W. Huang, and D. K. Panda, "Application- Transparent Checkpoint/Restart for MPI Programs over InfiniBand," in ICPP '06: Proceedings of the 2006 International Conference on Parallel Processing. Washington, DC, USA: IEEE Computer Society, 2006, pp. 471-478.
- (2006) ICPP '06: Proceedings of the 2006 International Conference on Parallel Processing , pp. 471-478
- Gao, Q.¹ Yu, W.² Huang, W.³ Panda, D.K.⁴

15
- 77951447133
- Accelerating Checkpoint operation by node-level write aggregation on multicore systems
- September
- X. Ouyang, K. Gopalakrishnan, and D. K. Panda, "Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems," ICPP 2009, September 2009.
- (2009) ICPP 2009
- Ouyang, X.¹ Gopalakrishnan, K.² Panda, D.K.³

16
- 77952145003
- Fast checkpointing by write aggregation with dynamic buffer and interleaving on multicore architecture
- December
- X. Ouyang, K. Gopalakrishnan, T. Gangadharappa, and D. K. Panda, "Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture," HiPC 2009, December 2009.
- (2009) HiPC 2009
- Ouyang, X.¹ Gopalakrishnan, K.² Gangadharappa, T.³ Panda, D.K.⁴

17
- 12344277946
- The design and implementation of berkeley lab's linux checkpoint/restart
- Lawrence Berkeley National Laboratory, Berkeley, CA 94720. [Online]. Available
- Duell, J., Hargrove, P., and Roman, E., "The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart," Lawrence Berkeley National Laboratory, Berkeley, CA 94720, Tech. Rep. LBNL-54941, 2002. [Online]. Available: {https: //ftg.lbl.gov/CheckpointRestart/Pubs/LBNL-54941. pdf}
- (2002) Tech. Rep. LBNL-54941
- Duell, J.¹ Hargrove, P.² Roman, E.³

18
- 53349109260
- "CIFTS Web Page," http://www.mcs.anl.gov/research/cifts.
- CIFTS Web Page

19
- 77951481809
- CIFTS: A coordinated infrastucture for fault-tolerant systems
- R. Gupta, P. Beckman, B. Park, E. Lusk, P.Hargrove, A. Geist, D. Panda, A.Lumsdaine, and J. Dongarra, "CIFTS: A Coordinated Infrastucture for Fault-Tolerant Systems." in In Intĺ Conference on Parallel Processing (ICPP ' 09), 2009.
- (2009) Intĺ Conference on Parallel Processing (ICPP ' 09)
- Gupta, R.¹ Beckman, P.² Park, B.³ Lusk, E.⁴ Hargrove, P.⁵ Geist, A.⁶ Panda, D.⁷ Lumsdaine, A.⁸ Dongarra, J.⁹

20
- 78649480678
- "Top 500 Supercomputers," http://www.top500.org.

21
- 74049121711
- Berkeley lab checkpoint/restart (BLCR) for linux clusters
- P. H. Hargrove and J. C. Duell, "Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters," in SciDAC, 6 2006.
- (2006) SciDAC , vol.6
- Hargrove, P.H.¹ Duell, J.C.²

22
- 58449084165
- ScELA: Scalable and extensible launching architecture for clusters
- J. K. Sridhar, M. J. Koop, J. L. Perkins, and D. K. Panda, "ScELA: Scalable and Extensible Launching Architecture for Clusters," in HiPC, 2008, pp. 323-335.
- (2008) HiPC , pp. 323-335
- Sridhar, J.K.¹ Koop, M.J.² Perkins, J.L.³ Panda, D.K.⁴

23
- 74049098606
- PLFS: A checkpoint filesystem for parallel applications
- J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, "PLFS: a checkpoint filesystem for parallel applications," in SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009.
- (2009) SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
- Bent, J.¹ Gibson, G.² Grider, G.³ McClelland, B.⁴ Nowoczynski, P.⁵ Nunez, J.⁶ Polte, M.⁷ Wingate, M.⁸

24
- 85014969248
- Architectural requirements and scalability of the NAS parallel benchmarks
- F. C. Wong and R. P. M. etc., "Architectural requirements and scalability of the NAS parallel benchmarks," in Supercomputing '99, 1999, p. 41.
- (1999) Supercomputing '99 , pp. 41
- Wong, F.C.¹ R, P.M.²

25
- 78649471209
- "PVFS2," http://www.pvfs.org/.

26
- 77952163433
- June
- S. Al-Kiswany, M. Ripeanu, S. Vazhkudai, and A. Gharaibeh, "stdchk: A Checkpoint Storage System for Desktop Grid Computing," June 2008.
- (2008) Stdchk: A Checkpoint Storage System for Desktop Grid Computing
- Al-Kiswany, S.¹ Ripeanu, M.² Vazhkudai, S.³ Gharaibeh, A.⁴

27
- 27844542760
- The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- Winter
- S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The LAM/MPI checkpoint/restart framework: System-initiated checkpointing," International Journal of High Performance Computing Applications, vol. 19, no. 4, pp. 479-493, Winter 2005.
- (2005) International Journal of High Performance Computing Applications , vol.19 , Issue.4 , pp. 479-493
- Sankaran, S.¹ Squyres, J.M.² Barrett, B.³ Lumsdaine, A.⁴ Duell, J.⁵ Hargrove, P.⁶ Roman, E.⁷

28
- 34548789748
- The design and implementation of checkpoint/restart process fault tolerance for open MPI
- March
- J. Hursey, J. Squyres, T. Mattox, and A. Lumsdaine, "The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI," in 12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems, March 2007.
- (2007) 12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems
- Hursey, J.¹ Squyres, J.² Mattox, T.³ Lumsdaine, A.⁴

29
- 0003050634
- CoCheck: Checkpointing and process migration for MPI
- G. Stellner, "CoCheck: Checkpointing and Process Migration for MPI," in Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), 1996.
- (1996) Proceedings of the 10th International Parallel Processing Symposium (IPPS '96)
- Stellner, G.¹

30
- 0010538346
- Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations
- A. Agbaria and R. Friedman, "Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations," High-Performance Distributed Computing, International Symposium on, vol. 0, p. 31, 1999.
- (1999) High-Performance Distributed Computing, International Symposium on , pp. 31
- Agbaria, A.¹ Friedman, R.²

31
- 84900298636
- CLIP: A checkpointing tool for message-passing parallel programs
- Y. Chen, J. S. Plank, and K. Li, "CLIP: A Checkpointing Tool for Message-Passing Parallel Programs," in In SC97: High Performance Networking and Computing, 1997, pp. 1-11.
- (1997) In SC97: High Performance Networking and Computing , pp. 1-11
- Chen, Y.¹ Plank, J.S.² Li, K.³

32
- 0038194608
- MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
- G. Bosilca, A. Bouteiller, S. Djilali, G. Fedak, C. Germain, T. Herault, V. Neri, and A. Selikhov, "MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes," in In Supercomputing, 2002, pp. 1-18.
- (2002) In Supercomputing , pp. 1-18
- Bosilca, G.¹ Bouteiller, A.² Djilali, S.³ Fedak, G.⁴ Germain, C.⁵ Herault, T.⁶ Neri, V.⁷ Selikhov, A.⁸

33
- 0016829070
- System structure for software fault tolerance
- New York, NY, USA: ACM
- B. Randell, "System structure for software fault tolerance," in Proceedings of the international conference on Reliable software. New York, NY, USA: ACM, 1975, pp. 437-449.
- (1975) Proceedings of the International Conference on Reliable Software , pp. 437-449
- Randell, B.¹

34
- 0042078549
- A survey of rollback-recovery protocols in message-passing systems
- E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, no. 3, pp. 375-408, 2002.
- (2002) ACM Comput. Surv. , vol.34 , Issue.3 , pp. 375-408
- Elnozahy, E.N.M.¹ Alvisi, L.² Wang, Y.-M.³ Johnson, D.B.⁴

35
- 34548042452
- Proactive fault tolerance in MPI applications via task migration
- S. Chakravorty, C. Mendes, and L. Kale, " Proactive fault tolerance in MPI applications via task migration ," in HiPC, 2006.
- (2006) HiPC
- Chakravorty, S.¹ Mendes, C.² Kale, L.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.