-
1
-
-
70449844295
-
DMTCP: Transparent checkpointing for cluster computations and the desktop
-
Rome, Italy
-
Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations and the desktop. In: 23rd IEEE international parallel and distributed processing symposium, Rome, Italy, pp 1-12
-
(2009)
23rd IEEE International Parallel and Distributed Processing Symposium
, pp. 1-12
-
-
Ansel, J.1
Arya, K.2
Cooperman, G.3
-
4
-
-
84881368496
-
-
[Online]
-
Blackham B (2005) [Online]. Available: http://cryopid.berlios.de/
-
(2005)
-
-
Blackham, B.1
-
5
-
-
0038194608
-
MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
-
Bosilca G, Bouteiller A, Cappello et al (2002) MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: IEEE/ACM SIGARCH
-
(2002)
IEEE/ACM SIGARCH
-
-
Bosilca, G.1
Bouteiller, A.2
Cappello3
-
8
-
-
68249127079
-
Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities
-
10.1177/1094342009106189
-
Cappello F (2009) Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23:212-226
-
(2009)
Int J High Perform Comput Appl
, vol.23
, pp. 212-226
-
-
Cappello, F.1
-
9
-
-
70450206305
-
Toward exascale resilience
-
10.1177/1094342009347767
-
Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):378-388
-
(2009)
Int J High Perform Comput Appl
, vol.23
, Issue.4
, pp. 378-388
-
-
Cappello, F.1
Geist, A.2
Gropp, B.3
Kale, L.4
Kramer, B.5
Snir, M.6
-
10
-
-
84881375542
-
-
CFDR [Online]. CFDR
-
CFDR (2012) [Online]. Available: CFDR http://cfdr.usenix.org/
-
(2012)
-
-
-
11
-
-
0022020346
-
Distributed snapshots: Determining global states of distributed systems
-
10.1145/214451.214456
-
Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3(1):63-75
-
(1985)
ACM Trans Comput Syst
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy, K.M.1
Lamport, L.2
-
12
-
-
84881374699
-
-
Checkpointing.org [Online]
-
Checkpointing.org (2012) Checkpointing [Online]. Available: http://checkpointing.org
-
(2012)
Checkpointing
-
-
-
14
-
-
0003052123
-
-
June, Toulouse, France
-
Chen L, Avizienis A (1978) N-version programming: a fault-tolerance approach to reliability of software operation, June, Toulouse, France, pp 3-9
-
(1978)
N-version Programming: A Fault-tolerance Approach to Reliability of Software Operation
, pp. 3-9
-
-
Chen, L.1
Avizienis, A.2
-
16
-
-
85059766484
-
Live migration of virtual machines
-
vol 2, May 2005
-
Clark C, Fraser K, Hand S et al (2005) Live migration of virtual machines. In: Proceedings of the 2nd conference on symposium on networked systems design and implementation, vol 2, May 2005, pp 273-286
-
(2005)
Proceedings of the 2nd Conference on Symposium on Networked Systems Design and Implementation
, pp. 273-286
-
-
Clark, C.1
Fraser, K.2
Hand, S.3
-
18
-
-
0026104130
-
Understanding fault-tolerant distributed systems
-
10.1145/102792.102801
-
Cristian F (1991) Understanding fault-tolerant distributed systems. Commun ACM 34(2):56-88
-
(1991)
Commun ACM
, vol.34
, Issue.2
, pp. 56-88
-
-
Cristian, F.1
-
23
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
10.1145/568522.568525
-
Elnozahy ENM, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375-408
-
(2002)
ACM Comput Surv
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.M.1
Alvisi, L.2
Wang, Y.M.3
Johnson, D.B.4
-
25
-
-
84881375783
-
-
Fault tolerance, wikipedia [Online]
-
Fault tolerance, wikipedia (2012) [Online]. Available: http://en.wikipedia.org/wiki/Fault-tolerant-system
-
(2012)
-
-
-
26
-
-
84881374465
-
-
Fusion-IO [Online]
-
Fusion-IO (2012) [Online]. Available: http://www.rpmgmbh.com/download/ Whitepaper-Green.pdf
-
(2012)
-
-
-
28
-
-
84881368293
-
-
esky [Online]
-
Gibson D (2012) esky [Online]. Available: http://esky.sourceforge.net
-
(2012)
-
-
Gibson, D.1
-
30
-
-
0025505070
-
A census of tandem system availability between 1985 and 1990
-
10.1109/24.58719
-
Gray J (1990) A census of tandem system availability between 1985 and 1990. IEEE Trans Reliab 39(4):409-418
-
(1990)
IEEE Trans Reliab
, vol.39
, Issue.4
, pp. 409-418
-
-
Gray, J.1
-
31
-
-
85084162186
-
World-wide web cache consistency
-
San Diego, CA Jan 1996
-
Gwertzman J, Seltzer M (1996) World-wide web cache consistency. In: Proc 1996 USENIX tech conf, San Diego, CA, Jan 1996, pp 141-152
-
(1996)
Proc 1996 USENIX Tech Conf
, pp. 141-152
-
-
Gwertzman, J.1
Seltzer, M.2
-
33
-
-
84881374755
-
-
InfiniBand [Online]. Available: InfiniBand
-
InfiniBand (2012) [Online]. Available: InfiniBand http://www. infinibandta.org/
-
(2012)
-
-
-
38
-
-
0017996760
-
Time, clocks, and the ordering of events in a distributed system
-
0378.68027 10.1145/359545.359563
-
Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21:558-565
-
(1978)
Commun ACM
, vol.21
, pp. 558-565
-
-
Lamport, L.1
-
39
-
-
0025457846
-
Definition and analysis of hardware-and software-fault-tolerant architectures
-
10.1109/2.56851
-
Laprie JC, Arlat J, Beounes C, Kanoun K (1990) Definition and analysis of hardware-and software-fault-tolerant architectures. Computer 23(7):39-51
-
(1990)
Computer
, vol.23
, Issue.7
, pp. 39-51
-
-
Laprie, J.C.1
Arlat, J.2
Beounes, C.3
Kanoun, K.4
-
40
-
-
84881371103
-
-
Large software state [Online]
-
Large software state (2012) [Online]. Available: http://www.safeware-eng. com/White-Papers/Software%20Safety.htm
-
(2012)
-
-
-
41
-
-
0028485392
-
Low-latency, concurrent checkpointing for parallel programs
-
10.1109/71.298215
-
Li K, Naughton JF, Plank JS (1994) Low-latency, concurrent checkpointing for parallel programs. IEEE Trans Parallel Distrib Syst 5(8):874-879
-
(1994)
IEEE Trans Parallel Distrib Syst
, vol.5
, Issue.8
, pp. 874-879
-
-
Li, K.1
Naughton, J.F.2
Plank, J.S.3
-
45
-
-
4544296705
-
The use of triple-modular redundancy to improve computer reliability
-
0117.12001 10.1147/rd.62.0200
-
Lyons RE, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 6(2):200-209
-
(1962)
IBM J Res Dev
, vol.6
, Issue.2
, pp. 200-209
-
-
Lyons, R.E.1
Vanderkulk, W.2
-
46
-
-
68849090178
-
A survey and review of the current state of rollback-recovery for cluster systems
-
Maloney A, Goscinski A (2009) A survey and review of the current state of rollback-recovery for cluster systems. Concurr Comput., 1632-1666
-
(2009)
Concurr Comput.
, pp. 1632-1666
-
-
Maloney, A.1
Goscinski, A.2
-
47
-
-
0345044000
-
Process migration
-
10.1145/367701.367728
-
Milojicic DS, Douglis F, Paindaveine Y, Wheeler R, Zhou S (2000) Process migration. ACM Comput Surv 32(3):241-299
-
(2000)
ACM Comput Surv
, vol.32
, Issue.3
, pp. 241-299
-
-
Milojicic, D.S.1
Douglis, F.2
Paindaveine, Y.3
Wheeler, R.4
Zhou, S.5
-
48
-
-
0001439335
-
MPI: A message-passing interface standard
-
MPI Forum
-
MPI Forum (1994) MPI: a message-passing interface standard. Int J Supercomput Appl High Perform Comput
-
(1994)
Int J Supercomput Appl High Perform Comput
-
-
-
50
-
-
0036755345
-
Architecture and dependability of large-scale Internet services
-
10.1109/MIC.2002.1036037
-
Oppenheimer D, Patterson D (2002) Architecture and dependability of large-scale Internet services. IEEE Internet Comput 6(5):41-49
-
(2002)
IEEE Internet Comput
, vol.6
, Issue.5
, pp. 41-49
-
-
Oppenheimer, D.1
Patterson, D.2
-
51
-
-
84978437417
-
The design and implementation of zap: A system for migration computing environments
-
10.1145/844128.844162
-
Osman S, Subhraveti D, Su G, Nieh J (2002) The design and implementation of zap: a system for migration computing environments. Oper Syst Rev 36(SI):361-376
-
(2002)
Oper Syst Rev
, vol.36
, pp. 361-376
-
-
Osman, S.1
Subhraveti, D.2
Su, G.3
Nieh, J.4
-
53
-
-
84881373144
-
-
PETSc [Online]
-
PETSc (2012) [Online]. Available: http://www.mcs.anl.gov/petsc/petsc-as/
-
(2012)
-
-
-
54
-
-
84881369313
-
-
Pinheiro E (2001) http://www.research.rutgers.edu/~edpin/epckpt/
-
(2001)
-
-
Pinheiro, E.1
-
59
-
-
0016522101
-
System structure for software fault tolerance
-
10.1109/TSE.1975.6312842
-
Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng SE-1(2):220-232
-
(1975)
IEEE Trans Softw Eng
, vol.1
, Issue.2
, pp. 220-232
-
-
Randell, B.1
-
61
-
-
34548771116
-
DejaVu: Transparent user-level checkpointing, migration, and recovery for distributed systems
-
Ruscio J, Heffner M, Varadarajan S (2007) DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems. In: IEEE international parallel and distributed processing symposium, pp 1-10
-
(2007)
IEEE International Parallel and Distributed Processing Symposium
, pp. 1-10
-
-
Ruscio, J.1
Heffner, M.2
Varadarajan, S.3
-
63
-
-
27844542760
-
The Lam/Mpi checkpoint/restart framework: System-initiated checkpointing
-
10.1177/1094342005056139
-
Sankaran S, Squyres JM, Barrett B et al (2005) The Lam/Mpi checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl 19(4):479-493
-
(2005)
Int J High Perform Comput Appl
, vol.19
, Issue.4
, pp. 479-493
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
-
64
-
-
36148941068
-
Understanding failures in petascale computers
-
012022 10.1088/1742-6596/78/1/012022
-
Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):012022
-
(2007)
J Phys Conf ser
, vol.78
, Issue.1
-
-
Schroeder, B.1
Gibson, G.2
-
65
-
-
78149470110
-
A large-scale study of failures in high performance computing systems
-
10.1109/TDSC.2009.4
-
Schroeder B, Gibson GA (2010) A large-scale study of failures in high performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337-350
-
(2010)
IEEE Trans Dependable Secure Comput
, vol.7
, Issue.4
, pp. 337-350
-
-
Schroeder, B.1
Gibson, G.A.2
-
66
-
-
84934312471
-
Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs
-
Pittsburgh, PA
-
Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P (2004) Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Supercomputing, Pittsburgh, PA
-
(2004)
Supercomputing
-
-
Schulz, M.1
Bronevetsky, G.2
Fernandes, R.3
Marques, D.4
Pingali, K.5
Stodghill, P.6
-
67
-
-
79952579787
-
Exascale computing technology challenges
-
LNCS 6449 Springer Berlin, Heidelberg
-
Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: VECPAR 2010, LNCS, vol 6449. Springer, Berlin, Heidelberg, pp 1-25
-
(2011)
VECPAR 2010
, pp. 1-25
-
-
Shalf, J.1
Dosanjh, S.2
Morrison, J.3
-
68
-
-
84881374633
-
-
NASA CR 172385, Langley Research, Center, VA
-
Slivinski T, Broglio C, Wild C et al. (1984) Study of fault-tolerant software technology. NASA CR 172385, Langley Research, Center, VA
-
(1984)
Study of Fault-tolerant Software Technology
-
-
Slivinski, T.1
Broglio, C.2
Wild, C.3
-
69
-
-
0003050634
-
Cocheck: Checkpointing and process migration for MPI
-
Stellner G (1996) Cocheck: checkpointing and process migration for MPI. In: Proc IPPS
-
(1996)
Proc IPPS
-
-
Stellner, G.1
-
71
-
-
1442319232
-
PM2: High performance communication middleware for heterogeneous network environments, in supercomputing
-
IEEE Press New York
-
Takahashi T, Sumimoto S, Hori A, Harada H, Ishikawa Y (2000) PM2: high performance communication middleware for heterogeneous network environments, in supercomputing. In: ACM/IEEE 2000 conference. IEEE Press, New York, p 16
-
(2000)
ACM/IEEE 2000 Conference
, pp. 16
-
-
Takahashi, T.1
Sumimoto, S.2
Hori, A.3
Harada, H.4
Ishikawa, Y.5
-
72
-
-
84881374986
-
-
Team Condor University of Wisconsin-Madison
-
Team Condor (2010) Condor version 7.5.3 manual. University of Wisconsin-Madison
-
(2010)
Condor Version 7.5.3 Manual
-
-
-
74
-
-
84881374553
-
-
Top500 [Online]
-
Top500 (2012) [Online]. Available: http://www.top500.org
-
(2012)
-
-
-
75
-
-
85101215109
-
Application-level checkpointing techniques for parallel programs
-
Walters J, Chaudhary V (2006) Application-level checkpointing techniques for parallel programs. In: Proc of the 3rd ICDCIT conf, pp 221-234
-
(2006)
Proc of the 3rd ICDCIT Conf
, pp. 221-234
-
-
Walters, J.1
Chaudhary, V.2
-
76
-
-
0029305383
-
Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems
-
10.1109/71.382324
-
Wang Y-M, Chung P-Y, Lin I-J, Fuchs WK (1995) Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans Parallel Distrib Syst 6(5):546-554
-
(1995)
IEEE Trans Parallel Distrib Syst
, vol.6
, Issue.5
, pp. 546-554
-
-
Wang, Y.-M.1
Chung, P.-Y.2
Lin, I.-J.3
Fuchs, W.K.4
-
78
-
-
84881377739
-
-
ckpt [Online]
-
Zandy V (2002) ckpt [Online]. Available: http://pages.cs.wisc.edu/~zandy/ ckpt/
-
(2002)
-
-
Zandy, V.1
|