SCOPUS 정보 검색 플랫폼

Journal of Parallel and Distributed Computing

Volumn 72, Issue 2, 2012, Pages 254-267

Proactive process-level live migration and back migration in HPC environments

(4) Wang, Chao a Mueller, Frank a Engelmann, Christian b Scott, Stephen L b

a North Carolina State University (United States)

b OAK RIDGE NATIONAL LABORATORY (United States)

Author keywords

Back migration; Fault tolerance; Health monitoring; High performance computing; Live migration

Indexed keywords

BACK MIGRATION; EXECUTION ENVIRONMENTS; HEALTH MONITORING; HIGH-PERFORMANCE COMPUTING; LIVE MIGRATIONS; LOAD IMBALANCE; NODE FAILURE; PROCESS LEVELS; PROCESS MIGRATION; SELF-HEALING; VIRTUALIZATIONS;

COMPUTER SOFTWARE SELECTION AND EVALUATION; EXPERIMENTS; FAULT TOLERANT COMPUTER SYSTEMS; HEALTH;

FAULT TOLERANCE;

EID: 84855350553 PISSN: 07437315 EISSN: None Source Type: Journal
DOI: 10.1016/j.jpdc.2011.10.009 Document Type: Article

Times cited : (31)

References (80)

1
- 84855353032
- Advanced configuration & power interface
- Advanced configuration & power interface, http://www.acpi.info.

2
- 84870548923
- An overview of the BlueGene/L supercomputer
- N. Adiga An overview of the BlueGene/L supercomputer Supercomputing 2002
- (2002) Supercomputing
- Adiga, N.¹

3
- 28044457320
- Monitoring hard disk with smart
- B. Allen, Monitoring hard disk with smart, Linux Journal, 2004.
- (2004) Linux Journal
- Allen, B.¹

4
- 70449844295
- Dmtcp: Transparent checkpointing for cluster computations and the desktop
- J. Ansel, K. Arya, G. Cooperman, Dmtcp: Transparent checkpointing for cluster computations and the desktop, in: 23rd IEEE International Parallel and Distributed Processing Symposium, 2009.
- (2009) 23rd IEEE International Parallel and Distributed Processing Symposium
- Ansel, J.¹ Arya, K.² Cooperman, G.³

5
- 84855356069
- I. T. Association, Infiniband
- I. T. Association, Infiniband, http://www.infinibandta.org/.

6
- 12444268370
- Architecture of LA-MPI, a network-fault-tolerant MPI
- R.T. Aulwes, D.J. Daniel, N.N. Desai, R.L. Graham, L.D. Risinger, M.A. Taylor, T. Woodall, M. Sukalski, Architecture of LA-MPI, a network-fault- tolerant MPI, in: IPDPS, 2004.
- (2004) IPDPS
- Aulwes, R.T.¹ Daniel, D.J.² Desai, N.N.³ Graham, R.L.⁴ Risinger, L.D.⁵ Taylor, M.A.⁶ Woodall, T.⁷ Sukalski, M.⁸

7
- 33845591573
- Performance assurance via software rejuvenation: Monitoring, statistics and algorithms
- A. Avritzer, A. Bondi, M. Grottke, K.S. Trivedi, E.J. Weyuker, Performance assurance via software rejuvenation: Monitoring, statistics and algorithms, in: Proc. International Conference on Dependable Systems and Networks, 2006, pp. 435444.
- (2006) Proc. International Conference on Dependable Systems and Networks , pp. 435444
- Avritzer, A.¹ Bondi, A.² Grottke, M.³ Trivedi, K.S.⁴ Weyuker, E.J.⁵

8
- 0344867889
- MOSIX: An integrated multiprocessor UNIX
- Berkeley, CA, USA
- A. Barak, R. Wheeler, MOSIX: An integrated multiprocessor UNIX, in: Proceedings of the Winter 1989 USENIX Conference, USENIX, Berkeley, CA, USA, 1989, pp. 101112.
- (1989) Proceedings of the Winter 1989 USENIX Conference, USENIX , pp. 101-112
- Barak, A.¹ Wheeler, R.²

9
- 0038194608
- MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
- G. Bosilca, A. Boutellier, and F. Cappello MPICH-V: toward a scalable fault tolerant MPI for volatile nodes Supercomputing 2002
- (2002) Supercomputing
- Bosilca, G.¹ Boutellier, A.² Cappello, F.³

10
- 4544337911
- Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems
- J. Brevik, D. Nurmi, R. Wolski, Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems, in: IEEE International Symposium on Cluster Computing and the Grid, 2004, pp. 190199.
- (2004) IEEE International Symposium on Cluster Computing and the Grid , pp. 190199
- Brevik, J.¹ Nurmi, D.² Wolski, R.³

11
- 23944489879
- Process migration for MPI applications based on coordinated checkpoint
- J. Cao, Y. Li, M. Guo, Process migration for MPI applications based on coordinated checkpoint, in: ICPADS, 2005, pp. 306312.
- (2005) ICPADS , pp. 306312
- Cao, J.¹ Li, Y.² Guo, M.³

12
- 33847147616
- Proactive fault tolerance in large systems
- S. Chakravorty, C. Mendes, L. Kale, Proactive fault tolerance in large systems, in: HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in: Proceedings of HPCA-11, 2005.
- (2005) HPCRI: 1st Workshop on High Performance Computing Reliability Issues, In: Proceedings of HPCA-11
- Chakravorty, S.¹ Mendes, C.² Kale, L.³

13
- 34548042452
- Proactive fault tolerance in MPI applications via task migration
- S. Chakravorty, C. Mendes, L. Kale, Proactive fault tolerance in MPI applications via task migration, in: HiPC, 2006.
- (2006) HiPC
- Chakravorty, S.¹ Mendes, C.² Kale, L.³

14
- 34548782109
- A fault tolerance protocol with fast fault recovery
- S. Chakravorty, C. Mendes, L. Kale, A fault tolerance protocol with fast fault recovery, in: IPDPS, 2007.
- (2007) IPDPS
- Chakravorty, S.¹ Mendes, C.² Kale, L.³

15
- 85059766484
- Live migration of virtual machines
- C. Clark, K. Fraser, S. Hand, J. Hansem, E. Jul, C. Limpach, I. Pratt, A. Warfield, Live migration of virtual machines, in: NSDI, 2005.
- (2005) NSDI
- Clark, C.¹ Fraser, K.² Hand, S.³ Hansem, J.⁴ Jul, E.⁵ Limpach, C.⁶ Pratt, I.⁷ Warfield, A.⁸

16
- 0026205353
- Transparent process migration. Design alternatives and the Sprite implementation
- F. Douglis, and J.K. Ousterhout Transparent process migration: Design alternatives and the sprite implementation Softw. - Pract. Exp. 21 8 1991 757 785 (Pubitemid 21697317)
- (1991) Software - Practice and Experience , vol.21 , Issue.8 , pp. 757-785
- Douglis Fred¹ Ousterhout John²

17
- 12344277946
- The design and implementation of berkeley lab's linux checkpoint/restart
- Lawrence Berkeley National Laboratory
- J. Duell, The design and implementation of berkeley lab's linux checkpoint/restart, Tech. rep., Lawrence Berkeley National Laboratory (2000).
- (2000) Tech. Rep.
- Duell, J.¹

18
- 33751107476
- MPI-Mitten: Enabling migration technology in MPI
- C. Du, X.-H. Sun, MPI-Mitten: Enabling migration technology in MPI, in: IEEE CCGrid, 2006.
- (2006) IEEE CCGrid
- Du, C.¹ Sun, X.-H.²

19
- 84944901368
- HPCM: A pre-compiler aided middleware for the mobility of legacy code
- C. Du, X.-H. Sun, K. Chanchio, HPCM: A pre-compiler aided middleware for the mobility of legacy code, in: IEEE Cluster, 2003.
- (2003) IEEE Cluster
- Du, C.¹ Sun, X.-H.² Chanchio, K.³

20
- 34548361971
- Dynamic scheduling with process migration
- C. Du, X.-H. Sun, M. Wu, Dynamic scheduling with process migration, in: IEEE CCGrid, 2007.
- (2007) IEEE CCGrid
- Du, C.¹ Sun, X.-H.² Wu, M.³

21
- 0026867749
- Manetho: Transparent roll back-recovery with low overhead, limited rollback, and fast output commit
- E.N. Elnozahy, and W. Zwaenepoel Manetho: Transparent roll back-recovery with low overhead, limited rollback, and fast output commit IEEE Trans. Comput. 41 5 1992 526 531
- (1992) IEEE Trans. Comput. , vol.41 , Issue.5 , pp. 526-531
- Elnozahy, E.N.¹ Zwaenepoel, W.²

22
- 1542292472
- FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world
- G.E. Fagg, J.J. Dongarra, FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world, in: Euro PVM/MPI User's Group Meeting, vol. 1908, 2000, pp. 346353.
- (2000) Euro PVM/MPI User's Group Meeting , vol.1908 , pp. 346353
- Fagg, G.E.¹ Dongarra, J.J.²

23
- 33847171466
- Communication characteristics in the nas parallel benchmarks
- A. Faraj, X. Yuan, Communication characteristics in the nas parallel benchmarks, in: IASTED PDCS, 2002, pp. 724729.
- (2002) IASTED PDCS , pp. 724-729
- Faraj, A.¹ Yuan, X.²

24
- 27644434963
- Lightweight monitoring of mpi programs in real time
- G. Florez, Z. Liu, S.M. Bridges, A. Skjellum, and R.B. Vaughn Lightweight monitoring of mpi programs in real time Concurr. Comput.: Pract. Exper. 2005
- (2005) Concurr. Comput.: Pract. Exper.
- Florez, G.¹ Liu, Z.² Bridges, S.M.³ Skjellum, A.⁴ Vaughn, R.B.⁵

25
- 84855345923
- Ganglia, http://ganglia.sourceforge.net/.
- Ganglia

26
- 65449136944
- The Google file system
- S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, in: SOSP'03, 2003, pp. 2943.
- (2003) SOSP'03 , pp. 2943
- Ghemawat, S.¹ Gobioff, H.² Leung, S.-T.³

27
- 33845434226
- Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers
- R. Gioiosa, J.C. Sancho, S. Jiang, F. Petrini, Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers, in: Supercomputing, 2005.
- (2005) Supercomputing
- Gioiosa, R.¹ Sancho, J.C.² Jiang, S.³ Petrini, F.⁴

28
- 47249153592
- A meta-learning failure predictor for BlueGene/L systems
- P. Gujrati, Y. Li, Z. Lan, R. Thakur, J. White, A meta-learning failure predictor for BlueGene/L systems, in: ICPP, 2007.
- (2007) ICPP
- Gujrati, P.¹ Li, Y.² Lan, Z.³ Thakur, R.⁴ White, J.⁵

29
- 70350760088
- Toward predictive failure management for distributed stream processing systems
- X. Gu, S. Papadimitriou, P.S. Yu, S.-P. Chang, Toward predictive failure management for distributed stream processing systems, in: IEEE ICDCS, 2008.
- (2008) IEEE ICDCS
- Gu, X.¹ Papadimitriou, S.² Yu, P.S.³ Chang, S.-P.⁴

30
- 33845420448
- A power-aware run-time system for high-performance computing
- C.-H. Hsu, and W.-C. Feng A power-aware run-time system for high-performance computing Supercomputing 2005
- (2005) Supercomputing
- Hsu, C.-H.¹ Feng, W.-C.²

31
- 84855358802
- htop, http://htop.sourceforge.net/.

32
- 84855352602
- Performance evaluation of adaptive mpi
- C. Huang, G. Zheng, L. Kalé, S. Kumar, Performance evaluation of adaptive mpi, in: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2006, pp. 1221.
- (2006) ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , pp. 1221
- Huang, C.¹

33
- 33846319754
- An agent oriented proactive fault-tolerant framework for grid computing
- M.T. Huda, H.W. Schmidt, I.D. Peake, An agent oriented proactive fault-tolerant framework for grid computing, in: International Conference on e-Science and Grid Computing, 2005.
- (2005) International Conference on E-Science and Grid Computing
- Huda, M.T.¹ Schmidt, H.W.² Peake, I.D.³

34
- 34548755483
- A checkpoint and restart service specification for Open MPI
- Indiana University, Computer Science Department
- J. Hursey, J.M. Squyres, A. Lumsdaine, A checkpoint and restart service specification for Open MPI, Technical report, Indiana University, Computer Science Department (2006).
- (2006) Technical Report
- Hursey, J.¹ Squyres, J.M.² Lumsdaine, A.³

35
- 34548789748
- The design and implementation of checkpoint/restart process fault tolerance for Open MPI
- J. Hursey, J.M. Squyres, T.I. Mattox, A. Lumsdaine, The design and implementation of checkpoint/restart process fault tolerance for Open MPI, in: DPDNS, 2007.
- (2007) DPDNS
- Hursey, J.¹ Squyres, J.M.² Mattox, T.I.³ Lumsdaine, A.⁴

36
- 0023960862
- Fine-grained mobility in the emerald system
- E. Jul, H.M. Levy, N.C. Hutchinson, and A.P. Black Fine-grained mobility in the emerald system ACM Trans. Comput. Syst. 6 1 1988 109 133
- (1988) ACM Trans. Comput. Syst. , vol.6 , Issue.1 , pp. 109-133
- Jul, E.¹ Levy, H.M.² Hutchinson, N.C.³ Black, A.P.⁴

37
- 84855353033
- O.R.N. Laboratory, Resources - national center for computational sciences (nccs), Jun. 2007
- O.R.N. Laboratory, Resources - national center for computational sciences (nccs), Jun. 2007. http://info.nccs.gov/resources/jaguar.

38
- 57049111494
- Adaptive fault management of parallel applications for high-performance computing
- Z. Lan, and Y. Li Adaptive fault management of parallel applications for high-performance computing IEEE Trans. Comput. 57 2008 1647 1660
- (2008) IEEE Trans. Comput. , vol.57 , pp. 1647-1660
- Lan, Z.¹ Li, Y.²

39
- 84855356066
- Volpexmpi: An mpi library for execution of parallel applications on volatile nodes
- T. LeBlanc, R. An, E. Gabriel, J. Subhlok, Volpexmpi: an mpi library for execution of parallel applications on volatile nodes, in: European PVM/MPI Users' Group Meeting, 2009, pp. 124133.
- (2009) European PVM/MPI Users' Group Meeting , pp. 124133
- Leblanc, T.¹ An, R.² Gabriel, E.³ Subhlok, J.⁴

40
- 47249092857
- Fault-driven re-scheduling for improving system-level fault resilience
- Y. Li, P. Gujrati, Z. Lan, X.-H. Sun, Fault-driven re-scheduling for improving system-level fault resilience, in: ICPP, 2007.
- (2007) ICPP
- Li, Y.¹ Gujrati, P.² Lan, Z.³ Sun, X.-H.⁴

41
- 67649883517
- Fault-aware runtime strategies for high-performance computing
- Y. Li, Z. Lan, P. Gujrati, and X.-H. Sun Fault-aware runtime strategies for high-performance computing IEEE Trans. Parallel Distrib. Syst. 20 2009 460 473
- (2009) IEEE Trans. Parallel Distrib. Syst. , vol.20 , pp. 460-473
- Li, Y.¹ Lan, Z.² Gujrati, P.³ Sun, X.-H.⁴

42
- 0002695959
- Remote unix - Turning idle workstations into cycle servers
- M. Litzkow, Remote unix - turning idle workstations into cycle servers, in: Usenix Summer Conference, 1987, pp. 381384.
- (1987) Usenix Summer Conference , pp. 381384
- Litzkow, M.¹

43
- 0003912256
- Checkpoint and migration of UNIX processes in the Condor distributed processing system
- University of Wisconsin - Madison Computer Sciences Department, April
- M. Litzkow, T. Tannenbaum, J. Basney, M. Livny, Checkpoint and migration of UNIX processes in the Condor distributed processing system, Tech. Rep. UW-CS-TR-1346, University of Wisconsin - Madison Computer Sciences Department, April 1997.
- (1997) Tech. Rep. UW-CS-TR-1346
- Litzkow, F.M.¹ Tannenbaum, T.² Basney, J.³ Livny, M.⁴

44
- 57349155964
- High performance vmm-bypass I/O in virtual machines
- J. Liu, W. Huang, B. Abali, D. Panda, High performance vmm-bypass I/O in virtual machines, in: USENIX Conference, 2006.
- (2006) USENIX Conference
- Liu, J.¹ Huang, W.² Abali, B.³ Panda, D.⁴

45
- 79951788489
- Incremental checkpointing for grids
- J. Mehnert-Spahn, E. Feller, M. Schoettner, Incremental checkpointing for grids, in: Linux Symposium, 2009.
- (2009) Linux Symposium
- Mehnert-Spahn, J.¹ Feller, E.² Schoettner, M.³

46
- 84883336377
- Optimizing network virtualization in Xen
- A. Menon, A. Cox, W. Zwaenepoel, Optimizing network virtualization in Xen, in: USENIX Conference, 2006.
- (2006) USENIX Conference
- Menon, A.¹ Cox, A.² Zwaenepoel, W.³

47
- 0345044000
- Process migration
- D.S. Milojicic, F. Douglis, Y. Paindaveine, R. Wheeler, and S. Zhou Process migration ACM Comput. Surv. (CSUR) 32 3 2000 241 299
- (2000) ACM Comput. Surv. (CSUR) , vol.32 , Issue.3 , pp. 241-299
- Milojicic, D.S.¹ Douglis, F.² Paindaveine, Y.³ Wheeler, R.⁴ Zhou, S.⁵

48
- 84855342531
- mpip: Lightweight, scalable mpi profiling
- mpip: Lightweight, scalable mpi profiling, http://mpip.sourceforge.net/.

49
- 50649104305
- Proactive fault tolerance for HPC with Xen virtualization
- Dept. of Computer Science, North Carolina State University
- A.B. Nagarajan, F. Mueller, Proactive fault tolerance for HPC with Xen virtualization, Tech. Rep. TR 2007-1, Dept. of Computer Science, North Carolina State University (2007).
- (2007) Tech. Rep. TR 2007-1
- Nagarajan, A.B.¹ Mueller, F.²

50
- 34548046749
- Proactive fault tolerance for HPC with Xen virtualization
- A.B. Nagarajan, F. Mueller, Proactive fault tolerance for HPC with Xen virtualization, in: ICS, 2007.
- (2007) ICS
- Nagarajan, A.B.¹ Mueller, F.²

51
- 12444257746
- Fault-aware job scheduling for BlueGene/L systems
- A. Oliner, R. Sahoo, J. Moreira, M. Gupta, A. Sivasubramaniam, Fault-aware job scheduling for BlueGene/L systems, in: IPDPS, 2004.
- (2004) IPDPS
- Oliner, A.¹ Sahoo, R.² Moreira, J.³ Gupta, M.⁴ Sivasubramaniam, A.⁵

52
- 78649483996
- Rdma-based job migration framework for mpi over infiniband
- X. Ouyang, S. Marcarelli, R. Rajachandrasekar, D. Panda, Rdma-based job migration framework for mpi over infiniband, in: Cluster, 2010.
- (2010) Cluster
- Ouyang, X.¹ Marcarelli, S.² Rajachandrasekar, R.³ Panda, D.⁴

53
- 84855342532
- Performance application programming interface
- Performance application programming interface, http://icl.cs.utk.edu/ papi/.

54
- 77951478277
- Software failures and the road to a petaflop machine
- Proceedings of HPCA-11 IEEE Computer Society
- I. Philp Software failures and the road to a petaflop machine HPCRI: 1st Workshop on High Performance Computing Reliability Issues Proceedings of HPCA-11 2005 IEEE Computer Society
- (2005) HPCRI: 1st Workshop on High Performance Computing Reliability Issues
- Philp, I.¹

55
- 84855356060
- Loop Profiling Tool for HPC Code Inspection as An Efficient Method of FPGA Based Acceleration
- M. Pietro, P. Russek, K. Wiatr, Loop Profiling Tool For HPC Code Inspection as An Efficient Method of FPGA Based Acceleration, Int. J. Appl. Math. Comput. Sci., 1010.
- Int. J. Appl. Math. Comput. Sci. , pp. 1010
- Pietro, M.¹ Russek, P.² Wiatr, K.³

56
- 85084159983
- Libckpt: Transparent checkpointing under Unix
- J.S. Plank, M. Beck, G. Kingsley, K. Li, Libckpt: Transparent checkpointing under Unix, in: Usenix Winter Technical Conference, 1995, pp. 213223.
- (1995) Usenix Winter Technical Conference , pp. 213-223
- Plank, J.S.¹ Beck, M.² Kingsley, G.³ Li, K.⁴

57
- 84962012351
- Process migration in DEMOS/MP
- M.L. Powell, B.P. Miller, Process migration in DEMOS/MP, in: Symposium on Operating Systems Principles, 1983.
- (1983) Symposium on Operating Systems Principles
- Powell, M.L.¹ Miller, B.P.²

58
- 33746127333
- Terrestrial-based radiation upsets: A cautionary tale
- H. Quinn, P. Graham, Terrestrial-based radiation upsets: A cautionary tale, in: FCCM 05, 2005.
- (2005) FCCM 05
- Quinn, H.¹ Graham, P.²

59
- 70350769584
- Toward efficient failure detection and recovery in HPC
- S. Rani, C. Leangsuksun, A. Tikotekar, V. Rampure, S. Scott, Toward efficient failure detection and recovery in HPC, in: High Availability and Performance Computing Workshop, 2006.
- (2006) High Availability and Performance Computing Workshop
- Rani, S.¹ Leangsuksun, C.² Tikotekar, A.³ Rampure, V.⁴ Scott, S.⁵

60
- 84855353031
- Readable dirty-bits for IA64 linux
- Readable dirty-bits for IA64 linux, https://www.gelato.unsw.edu.au/ archives/gelato-technical/2005-November/001080.html.

61
- 84855353030
- Transparent real-time monitoring in mpi
- S.H. Russ, R. Jean-Baptiste, T.S. Kumar, and M. Harmon Transparent real-time monitoring in mpi Springer 1999
- (1999) Springer
- Russ, S.H.¹ Jean-Baptiste, R.² Kumar, T.S.³ Harmon, M.⁴

62
- 77952378080
- Critical event prediction for proactive management in large-scale computer clusters
- R. Sahoo, A. Oliner, I. Rish, M. Gupta, J. Moreira, S. Ma, R. Vilalta, A. Sivasubramaniam, Critical event prediction for proactive management in large-scale computer clusters, in: KDD '03, 2003.
- (2003) KDD '03
- Sahoo, R.¹ Oliner, A.² Rish, I.³ Gupta, M.⁴ Moreira, J.⁵ Ma, S.⁶ Vilalta, R.⁷ Sivasubramaniam, A.⁸

63
- 20444444457
- The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- S. Sankaran, J.M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, E. Roman, The LAM/MPI checkpoint/restart framework: System-initiated checkpointing, in: LACSI, 2003.
- (2003) LACSI
- Sankaran, S.¹ Squyres, J.M.² Barrett, B.³ Lumsdaine, A.⁴ Duell, J.⁵ Hargrove, P.⁶ Roman, E.⁷

64
- 70449657893
- Dram errors in the wild: A large-scale field study
- B. Schroeder, E. Pinheiro, W.-D. Weber, Dram errors in the wild: a large-scale field study, SIGMETRICS'09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems.
- SIGMETRICS'09: Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems
- Schroeder, B.¹ Pinheiro, E.² Weber, W.-D.³

65
- 33750936415
- Availability modeling and analysis on high performance cluster computing systems
- H. Song, C. Leangsuksun, R. Nassar, Availability modeling and analysis on high performance cluster computing systems., in: ARES, 2006, pp. 305313.
- (2006) ARES , pp. 305-313
- Song, H.¹ Leangsuksun, C.² Nassar, R.³

66
- 35248827046
- Lecture Notes in Computer Science Springer-Verlag Venice, Italy
- J.M. Squyres, and A. Lumsdaine A Component Architecture for LAM/MPI Lecture Notes in Computer Science vol. 2840 2003 Springer-Verlag Venice, Italy 379 387
- (2003) A Component Architecture for LAM/MPI , vol.2840 , pp. 379-387
- Squyres, J.M.¹ Lumsdaine, A.²

67
- 0029713612
- CoCheck: Checkpointing and process migration for MPI
- G. Stellner, CoCheck: checkpointing and process migration for MPI, in: Proceedings of IPPS '96, 1996.
- (1996) Proceedings of IPPS '96
- Stellner, G.¹

68
- 70349736737
- Towards a fault-aware computing environment
- X.-H. Sun, Z. Lan, Y. Li, H. Jin, Z. Zheng, Towards a fault-aware computing environment, in: HAPCW, 2008.
- (2008) HAPCW
- Sun, X.-H.¹ Lan, Z.² Li, Y.³ Jin, H.⁴ Zheng, Z.⁵

69
- 0002801064
- Preemptable remote execution facilities for the V-System
- M. Theimer, K.A. Lantz, D.R. Cheriton, Preemptable remote execution facilities for the V-System., in: SOSP, 1985, pp. 212.
- (1985) SOSP , pp. 212
- Theimer, M.¹ Lantz, K.A.² Cheriton, D.R.³

70
- 34548175984
- On the survivability of standard MPI applications
- A. Tikotekar, C. Leangsuksun, S.L. Scott, On the survivability of standard MPI applications, in: LCI International Conference on Linux Clusters: The HPC Revolution, 2006.
- (2006) LCI International Conference on Linux Clusters: The HPC Revolution
- Tikotekar, A.¹ Leangsuksun, C.² Scott, S.L.³

71
- 53349098075
- Evaluation of fault-tolerant policies using simulation
- A. Tikotekar, G. Vallée, T. Naughton, S.L. Scott, C. Leangsuksun, Evaluation of fault-tolerant policies using simulation, in: IEEE Cluster, 2007.
- (2007) IEEE Cluster
- Tikotekar, A.¹

72
- 84906512472
- Towards fault resilient global arrays
- V. Tipparaju, M. Krishnan, B. Palmer, F. Petrini, J. Nieplocha, Towards fault resilient global arrays, in: Parallel computing: architectures, algorithms, and applications, 2008.
- (2008) Parallel Computing: Architectures, Algorithms, and Applications
- Tipparaju, V.¹ Krishnan, M.² Palmer, B.³ Petrini, F.⁴ Nieplocha, J.⁵

73
- 84855342530
- Top500 supercomputer sites
- Top500 supercomputer sites, http://www.top500.org/.

74
- 49049111154
- A framework for proactive fault tolerance
- G. Vallée, K. Charoenpornwattana, C. Engelmann, A. Tikotekar, C.B. Leangsuksun, T. Naughton, S.L. Scott, A framework for proactive fault tolerance, in: ARES, 2007, pp. 659664.
- (2007) ARES , pp. 659-664
- Vallée, G.¹

75
- 33847733544
- Ghost process: A sound basis to implement process duplication, migration and checkpoint/restart in linux clusters
- G. Vallee, R. Lottiaux, D. Margery, C. Morin, J.-Y. Berthou, Ghost process: a sound basis to implement process duplication, migration and checkpoint/restart in linux clusters, in: ISPDC, 2005.
- (2005) ISPDC
- Vallee, G.¹ Lottiaux, R.² Margery, D.³ Morin, C.⁴ Berthou, J.-Y.⁵

76
- 84855356058
- Scalable, fault-tolerant membership for MPI tasks on hpc systems
- J. Varma, C. Wang, F. Mueller, C. Engelmann, S.L. Scott, Scalable, fault-tolerant membership for MPI tasks on hpc systems, in: International Conference on Supercomputing, 2006, pp. 219228.
- (2006) International Conference on Supercomputing , pp. 219228
- Varma, J.¹ Wang, C.² Mueller, F.³ Engelmann, C.⁴ Scott, S.L.⁵

77
- 84855356059
- Master's thesis, Dept. of CS, North Carolina State University, Aug.
- M. Vasavada, Innovative schemes to support incremental checkpointing, Master's thesis, Dept. of CS, North Carolina State University, Aug. 2010.
- (2010) Innovative Schemes to Support Incremental Checkpointing
- Vasavada, M.¹

78
- 34548768671
- A job pause service under LAM/MPI+BLCR for transparent fault tolerance
- C. Wang, F. Mueller, C. Engelmann, S. Scott, A job pause service under LAM/MPI+BLCR for transparent fault tolerance, in: IPDPS, 2007.
- (2007) IPDPS
- Wang, C.¹ Mueller, F.² Engelmann, C.³ Scott, S.⁴

79
- 85014969248
- Architectural requirements and scalability of the NAS parallel benchmarks
- F. Wong, R. Martin, R. Arpaci-Dusseau, D. Culler, Architectural requirements and scalability of the NAS parallel benchmarks, in: Supercomputing, 1999.
- (1999) Supercomputing
- Wong, F.¹ Martin, R.² Arpaci-Dusseau, R.³ Culler, D.⁴

80
- 84976846528
- A first order approximation to the optimum checkpoint interval
- 10.1145/361147.361115
- J.W. Young A first order approximation to the optimum checkpoint interval Commun. ACM 17 9 1974 530 531 10.1145/361147.361115
- (1974) Commun. ACM , vol.17 , Issue.9 , pp. 530-531
- Young, J.W.¹

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.