메뉴 건너뛰기




Volumn 72, Issue 2, 2012, Pages 254-267

Proactive process-level live migration and back migration in HPC environments

Author keywords

Back migration; Fault tolerance; Health monitoring; High performance computing; Live migration

Indexed keywords

BACK MIGRATION; EXECUTION ENVIRONMENTS; HEALTH MONITORING; HIGH-PERFORMANCE COMPUTING; LIVE MIGRATIONS; LOAD IMBALANCE; NODE FAILURE; PROCESS LEVELS; PROCESS MIGRATION; SELF-HEALING; VIRTUALIZATIONS;

EID: 84855350553     PISSN: 07437315     EISSN: None     Source Type: Journal    
DOI: 10.1016/j.jpdc.2011.10.009     Document Type: Article
Times cited : (31)

References (80)
  • 1
    • 84855353032 scopus 로고    scopus 로고
    • Advanced configuration & power interface
    • Advanced configuration & power interface, http://www.acpi.info.
  • 2
    • 84870548923 scopus 로고    scopus 로고
    • An overview of the BlueGene/L supercomputer
    • N. Adiga An overview of the BlueGene/L supercomputer Supercomputing 2002
    • (2002) Supercomputing
    • Adiga, N.1
  • 3
    • 28044457320 scopus 로고    scopus 로고
    • Monitoring hard disk with smart
    • B. Allen, Monitoring hard disk with smart, Linux Journal, 2004.
    • (2004) Linux Journal
    • Allen, B.1
  • 5
    • 84855356069 scopus 로고    scopus 로고
    • I. T. Association, Infiniband
    • I. T. Association, Infiniband, http://www.infinibandta.org/.
  • 9
    • 0038194608 scopus 로고    scopus 로고
    • MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
    • G. Bosilca, A. Boutellier, and F. Cappello MPICH-V: toward a scalable fault tolerant MPI for volatile nodes Supercomputing 2002
    • (2002) Supercomputing
    • Bosilca, G.1    Boutellier, A.2    Cappello, F.3
  • 11
    • 23944489879 scopus 로고    scopus 로고
    • Process migration for MPI applications based on coordinated checkpoint
    • J. Cao, Y. Li, M. Guo, Process migration for MPI applications based on coordinated checkpoint, in: ICPADS, 2005, pp. 306312.
    • (2005) ICPADS , pp. 306312
    • Cao, J.1    Li, Y.2    Guo, M.3
  • 13
    • 34548042452 scopus 로고    scopus 로고
    • Proactive fault tolerance in MPI applications via task migration
    • S. Chakravorty, C. Mendes, L. Kale, Proactive fault tolerance in MPI applications via task migration, in: HiPC, 2006.
    • (2006) HiPC
    • Chakravorty, S.1    Mendes, C.2    Kale, L.3
  • 14
    • 34548782109 scopus 로고    scopus 로고
    • A fault tolerance protocol with fast fault recovery
    • S. Chakravorty, C. Mendes, L. Kale, A fault tolerance protocol with fast fault recovery, in: IPDPS, 2007.
    • (2007) IPDPS
    • Chakravorty, S.1    Mendes, C.2    Kale, L.3
  • 16
    • 0026205353 scopus 로고
    • Transparent process migration. Design alternatives and the Sprite implementation
    • F. Douglis, and J.K. Ousterhout Transparent process migration: Design alternatives and the sprite implementation Softw. - Pract. Exp. 21 8 1991 757 785 (Pubitemid 21697317)
    • (1991) Software - Practice and Experience , vol.21 , Issue.8 , pp. 757-785
    • Douglis Fred1    Ousterhout John2
  • 17
    • 12344277946 scopus 로고    scopus 로고
    • The design and implementation of berkeley lab's linux checkpoint/restart
    • Lawrence Berkeley National Laboratory
    • J. Duell, The design and implementation of berkeley lab's linux checkpoint/restart, Tech. rep., Lawrence Berkeley National Laboratory (2000).
    • (2000) Tech. Rep.
    • Duell, J.1
  • 18
    • 33751107476 scopus 로고    scopus 로고
    • MPI-Mitten: Enabling migration technology in MPI
    • C. Du, X.-H. Sun, MPI-Mitten: Enabling migration technology in MPI, in: IEEE CCGrid, 2006.
    • (2006) IEEE CCGrid
    • Du, C.1    Sun, X.-H.2
  • 19
    • 84944901368 scopus 로고    scopus 로고
    • HPCM: A pre-compiler aided middleware for the mobility of legacy code
    • C. Du, X.-H. Sun, K. Chanchio, HPCM: A pre-compiler aided middleware for the mobility of legacy code, in: IEEE Cluster, 2003.
    • (2003) IEEE Cluster
    • Du, C.1    Sun, X.-H.2    Chanchio, K.3
  • 20
    • 34548361971 scopus 로고    scopus 로고
    • Dynamic scheduling with process migration
    • C. Du, X.-H. Sun, M. Wu, Dynamic scheduling with process migration, in: IEEE CCGrid, 2007.
    • (2007) IEEE CCGrid
    • Du, C.1    Sun, X.-H.2    Wu, M.3
  • 21
    • 0026867749 scopus 로고
    • Manetho: Transparent roll back-recovery with low overhead, limited rollback, and fast output commit
    • E.N. Elnozahy, and W. Zwaenepoel Manetho: Transparent roll back-recovery with low overhead, limited rollback, and fast output commit IEEE Trans. Comput. 41 5 1992 526 531
    • (1992) IEEE Trans. Comput. , vol.41 , Issue.5 , pp. 526-531
    • Elnozahy, E.N.1    Zwaenepoel, W.2
  • 22
    • 1542292472 scopus 로고    scopus 로고
    • FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world
    • G.E. Fagg, J.J. Dongarra, FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world, in: Euro PVM/MPI User's Group Meeting, vol. 1908, 2000, pp. 346353.
    • (2000) Euro PVM/MPI User's Group Meeting , vol.1908 , pp. 346353
    • Fagg, G.E.1    Dongarra, J.J.2
  • 23
    • 33847171466 scopus 로고    scopus 로고
    • Communication characteristics in the nas parallel benchmarks
    • A. Faraj, X. Yuan, Communication characteristics in the nas parallel benchmarks, in: IASTED PDCS, 2002, pp. 724729.
    • (2002) IASTED PDCS , pp. 724-729
    • Faraj, A.1    Yuan, X.2
  • 25
    • 84855345923 scopus 로고    scopus 로고
    • Ganglia, http://ganglia.sourceforge.net/.
    • Ganglia
  • 27
    • 33845434226 scopus 로고    scopus 로고
    • Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers
    • R. Gioiosa, J.C. Sancho, S. Jiang, F. Petrini, Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers, in: Supercomputing, 2005.
    • (2005) Supercomputing
    • Gioiosa, R.1    Sancho, J.C.2    Jiang, S.3    Petrini, F.4
  • 28
    • 47249153592 scopus 로고    scopus 로고
    • A meta-learning failure predictor for BlueGene/L systems
    • P. Gujrati, Y. Li, Z. Lan, R. Thakur, J. White, A meta-learning failure predictor for BlueGene/L systems, in: ICPP, 2007.
    • (2007) ICPP
    • Gujrati, P.1    Li, Y.2    Lan, Z.3    Thakur, R.4    White, J.5
  • 29
    • 70350760088 scopus 로고    scopus 로고
    • Toward predictive failure management for distributed stream processing systems
    • X. Gu, S. Papadimitriou, P.S. Yu, S.-P. Chang, Toward predictive failure management for distributed stream processing systems, in: IEEE ICDCS, 2008.
    • (2008) IEEE ICDCS
    • Gu, X.1    Papadimitriou, S.2    Yu, P.S.3    Chang, S.-P.4
  • 30
    • 33845420448 scopus 로고    scopus 로고
    • A power-aware run-time system for high-performance computing
    • C.-H. Hsu, and W.-C. Feng A power-aware run-time system for high-performance computing Supercomputing 2005
    • (2005) Supercomputing
    • Hsu, C.-H.1    Feng, W.-C.2
  • 31
    • 84855358802 scopus 로고    scopus 로고
    • htop, http://htop.sourceforge.net/.
  • 34
    • 34548755483 scopus 로고    scopus 로고
    • A checkpoint and restart service specification for Open MPI
    • Indiana University, Computer Science Department
    • J. Hursey, J.M. Squyres, A. Lumsdaine, A checkpoint and restart service specification for Open MPI, Technical report, Indiana University, Computer Science Department (2006).
    • (2006) Technical Report
    • Hursey, J.1    Squyres, J.M.2    Lumsdaine, A.3
  • 35
    • 34548789748 scopus 로고    scopus 로고
    • The design and implementation of checkpoint/restart process fault tolerance for Open MPI
    • J. Hursey, J.M. Squyres, T.I. Mattox, A. Lumsdaine, The design and implementation of checkpoint/restart process fault tolerance for Open MPI, in: DPDNS, 2007.
    • (2007) DPDNS
    • Hursey, J.1    Squyres, J.M.2    Mattox, T.I.3    Lumsdaine, A.4
  • 37
    • 84855353033 scopus 로고    scopus 로고
    • O.R.N. Laboratory, Resources - national center for computational sciences (nccs), Jun. 2007
    • O.R.N. Laboratory, Resources - national center for computational sciences (nccs), Jun. 2007. http://info.nccs.gov/resources/jaguar.
  • 38
    • 57049111494 scopus 로고    scopus 로고
    • Adaptive fault management of parallel applications for high-performance computing
    • Z. Lan, and Y. Li Adaptive fault management of parallel applications for high-performance computing IEEE Trans. Comput. 57 2008 1647 1660
    • (2008) IEEE Trans. Comput. , vol.57 , pp. 1647-1660
    • Lan, Z.1    Li, Y.2
  • 39
    • 84855356066 scopus 로고    scopus 로고
    • Volpexmpi: An mpi library for execution of parallel applications on volatile nodes
    • T. LeBlanc, R. An, E. Gabriel, J. Subhlok, Volpexmpi: an mpi library for execution of parallel applications on volatile nodes, in: European PVM/MPI Users' Group Meeting, 2009, pp. 124133.
    • (2009) European PVM/MPI Users' Group Meeting , pp. 124133
    • Leblanc, T.1    An, R.2    Gabriel, E.3    Subhlok, J.4
  • 40
    • 47249092857 scopus 로고    scopus 로고
    • Fault-driven re-scheduling for improving system-level fault resilience
    • Y. Li, P. Gujrati, Z. Lan, X.-H. Sun, Fault-driven re-scheduling for improving system-level fault resilience, in: ICPP, 2007.
    • (2007) ICPP
    • Li, Y.1    Gujrati, P.2    Lan, Z.3    Sun, X.-H.4
  • 42
    • 0002695959 scopus 로고
    • Remote unix - Turning idle workstations into cycle servers
    • M. Litzkow, Remote unix - turning idle workstations into cycle servers, in: Usenix Summer Conference, 1987, pp. 381384.
    • (1987) Usenix Summer Conference , pp. 381384
    • Litzkow, M.1
  • 43
    • 0003912256 scopus 로고    scopus 로고
    • Checkpoint and migration of UNIX processes in the Condor distributed processing system
    • University of Wisconsin - Madison Computer Sciences Department, April
    • M. Litzkow, T. Tannenbaum, J. Basney, M. Livny, Checkpoint and migration of UNIX processes in the Condor distributed processing system, Tech. Rep. UW-CS-TR-1346, University of Wisconsin - Madison Computer Sciences Department, April 1997.
    • (1997) Tech. Rep. UW-CS-TR-1346
    • Litzkow, F.M.1    Tannenbaum, T.2    Basney, J.3    Livny, M.4
  • 48
    • 84855342531 scopus 로고    scopus 로고
    • mpip: Lightweight, scalable mpi profiling
    • mpip: Lightweight, scalable mpi profiling, http://mpip.sourceforge.net/.
  • 49
    • 50649104305 scopus 로고    scopus 로고
    • Proactive fault tolerance for HPC with Xen virtualization
    • Dept. of Computer Science, North Carolina State University
    • A.B. Nagarajan, F. Mueller, Proactive fault tolerance for HPC with Xen virtualization, Tech. Rep. TR 2007-1, Dept. of Computer Science, North Carolina State University (2007).
    • (2007) Tech. Rep. TR 2007-1
    • Nagarajan, A.B.1    Mueller, F.2
  • 50
    • 34548046749 scopus 로고    scopus 로고
    • Proactive fault tolerance for HPC with Xen virtualization
    • A.B. Nagarajan, F. Mueller, Proactive fault tolerance for HPC with Xen virtualization, in: ICS, 2007.
    • (2007) ICS
    • Nagarajan, A.B.1    Mueller, F.2
  • 53
    • 84855342532 scopus 로고    scopus 로고
    • Performance application programming interface
    • Performance application programming interface, http://icl.cs.utk.edu/ papi/.
  • 55
    • 84855356060 scopus 로고    scopus 로고
    • Loop Profiling Tool for HPC Code Inspection as An Efficient Method of FPGA Based Acceleration
    • M. Pietro, P. Russek, K. Wiatr, Loop Profiling Tool For HPC Code Inspection as An Efficient Method of FPGA Based Acceleration, Int. J. Appl. Math. Comput. Sci., 1010.
    • Int. J. Appl. Math. Comput. Sci. , pp. 1010
    • Pietro, M.1    Russek, P.2    Wiatr, K.3
  • 58
    • 33746127333 scopus 로고    scopus 로고
    • Terrestrial-based radiation upsets: A cautionary tale
    • H. Quinn, P. Graham, Terrestrial-based radiation upsets: A cautionary tale, in: FCCM 05, 2005.
    • (2005) FCCM 05
    • Quinn, H.1    Graham, P.2
  • 60
    • 84855353031 scopus 로고    scopus 로고
    • Readable dirty-bits for IA64 linux
    • Readable dirty-bits for IA64 linux, https://www.gelato.unsw.edu.au/ archives/gelato-technical/2005-November/001080.html.
  • 65
    • 33750936415 scopus 로고    scopus 로고
    • Availability modeling and analysis on high performance cluster computing systems
    • H. Song, C. Leangsuksun, R. Nassar, Availability modeling and analysis on high performance cluster computing systems., in: ARES, 2006, pp. 305313.
    • (2006) ARES , pp. 305-313
    • Song, H.1    Leangsuksun, C.2    Nassar, R.3
  • 66
    • 35248827046 scopus 로고    scopus 로고
    • Lecture Notes in Computer Science Springer-Verlag Venice, Italy
    • J.M. Squyres, and A. Lumsdaine A Component Architecture for LAM/MPI Lecture Notes in Computer Science vol. 2840 2003 Springer-Verlag Venice, Italy 379 387
    • (2003) A Component Architecture for LAM/MPI , vol.2840 , pp. 379-387
    • Squyres, J.M.1    Lumsdaine, A.2
  • 67
    • 0029713612 scopus 로고    scopus 로고
    • CoCheck: Checkpointing and process migration for MPI
    • G. Stellner, CoCheck: checkpointing and process migration for MPI, in: Proceedings of IPPS '96, 1996.
    • (1996) Proceedings of IPPS '96
    • Stellner, G.1
  • 69
    • 0002801064 scopus 로고
    • Preemptable remote execution facilities for the V-System
    • M. Theimer, K.A. Lantz, D.R. Cheriton, Preemptable remote execution facilities for the V-System., in: SOSP, 1985, pp. 212.
    • (1985) SOSP , pp. 212
    • Theimer, M.1    Lantz, K.A.2    Cheriton, D.R.3
  • 71
    • 53349098075 scopus 로고    scopus 로고
    • Evaluation of fault-tolerant policies using simulation
    • A. Tikotekar, G. Vallée, T. Naughton, S.L. Scott, C. Leangsuksun, Evaluation of fault-tolerant policies using simulation, in: IEEE Cluster, 2007.
    • (2007) IEEE Cluster
    • Tikotekar, A.1
  • 73
    • 84855342530 scopus 로고    scopus 로고
    • Top500 supercomputer sites
    • Top500 supercomputer sites, http://www.top500.org/.
  • 74
    • 49049111154 scopus 로고    scopus 로고
    • A framework for proactive fault tolerance
    • G. Vallée, K. Charoenpornwattana, C. Engelmann, A. Tikotekar, C.B. Leangsuksun, T. Naughton, S.L. Scott, A framework for proactive fault tolerance, in: ARES, 2007, pp. 659664.
    • (2007) ARES , pp. 659-664
    • Vallée, G.1
  • 75
    • 33847733544 scopus 로고    scopus 로고
    • Ghost process: A sound basis to implement process duplication, migration and checkpoint/restart in linux clusters
    • G. Vallee, R. Lottiaux, D. Margery, C. Morin, J.-Y. Berthou, Ghost process: a sound basis to implement process duplication, migration and checkpoint/restart in linux clusters, in: ISPDC, 2005.
    • (2005) ISPDC
    • Vallee, G.1    Lottiaux, R.2    Margery, D.3    Morin, C.4    Berthou, J.-Y.5
  • 78
    • 34548768671 scopus 로고    scopus 로고
    • A job pause service under LAM/MPI+BLCR for transparent fault tolerance
    • C. Wang, F. Mueller, C. Engelmann, S. Scott, A job pause service under LAM/MPI+BLCR for transparent fault tolerance, in: IPDPS, 2007.
    • (2007) IPDPS
    • Wang, C.1    Mueller, F.2    Engelmann, C.3    Scott, S.4
  • 79
    • 85014969248 scopus 로고    scopus 로고
    • Architectural requirements and scalability of the NAS parallel benchmarks
    • F. Wong, R. Martin, R. Arpaci-Dusseau, D. Culler, Architectural requirements and scalability of the NAS parallel benchmarks, in: Supercomputing, 1999.
    • (1999) Supercomputing
    • Wong, F.1    Martin, R.2    Arpaci-Dusseau, R.3    Culler, D.4
  • 80
    • 84976846528 scopus 로고
    • A first order approximation to the optimum checkpoint interval
    • 10.1145/361147.361115
    • J.W. Young A first order approximation to the optimum checkpoint interval Commun. ACM 17 9 1974 530 531 10.1145/361147.361115
    • (1974) Commun. ACM , vol.17 , Issue.9 , pp. 530-531
    • Young, J.W.1


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.