-
2
-
-
0032597670
-
An analysis of communication induced checkpointing
-
L. Alvisi, E.N. Elnozahy, S. Rao, S.A. Husain, and A.D. Mel, "An analysis of communication induced checkpointing," Symp. on Fault-Tolerant Computing, pp.242-249, 1999.
-
(1999)
Symp. on Fault-Tolerant Computing
, pp. 242-249
-
-
Alvisi, L.1
Elnozahy, E.N.2
Rao, S.3
Husain, S.A.4
Mel, A.D.5
-
3
-
-
77954003885
-
MPI/FT: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing
-
May
-
R. Batchu, A. Skjellum, Z. Cui, M. Beddhu, J.P. Neelamegam, Y. Dandass, and M. Apte, "MPI/FT: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing," Proc. 1st Int. Symp. on Cluster Computing and the Grid, May 2001.
-
(2001)
Proc. 1st Int. Symp. on Cluster Computing and the Grid
-
-
Batchu, R.1
Skjellum, A.2
Cui, Z.3
Beddhu, M.4
Neelamegam, J.P.5
Dandass, Y.6
Apte, M.7
-
4
-
-
0031570635
-
Application level fault tolerance in heterogeneous networks of workstations
-
A. Beguelin, E. Seligman, and P. Stephan, "Application level fault tolerance in heterogeneous networks of workstations," J. Parallel Distrib. Comput., vol.43, no.2, pp.147-155, 1997.
-
(1997)
J. Parallel Distrib. Comput.
, vol.43
, Issue.2
, pp. 147-155
-
-
Beguelin, A.1
Seligman, E.2
Stephan, P.3
-
5
-
-
84884662651
-
MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
-
G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G.F. Magniette, V. Néri, and A. Selikhov, "MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes," SuperComputing 2002, pp. 1-18, 2002.
-
(2002)
SuperComputing 2002
, pp. 1-18
-
-
Bosilca, G.1
Bouteiller, A.2
Cappello, F.3
Djilali, S.4
Magniette, G.F.5
Néri, V.6
Selikhov, A.7
-
6
-
-
0001873476
-
LAM: An open cluster environment for MPI
-
Toronto, Canada
-
G. Burns, R. Daoud, and J. Vaigl, "LAM: An open cluster environment for MPI," Proc. Supercomputing Symp., pp.379-386, Toronto, Canada, 1994.
-
(1994)
Proc. Supercomputing Symp.
, pp. 379-386
-
-
Burns, G.1
Daoud, R.2
Vaigl, J.3
-
7
-
-
0028408242
-
Monitors, messages, and clusters: The p4 parallel programming system
-
R. Butler and E.L. Lusk, "Monitors, messages, and clusters: The p4 parallel programming system," Parallel Comput., vol.20, no.4, pp.547-564, 1994.
-
(1994)
Parallel Comput.
, vol.20
, Issue.4
, pp. 547-564
-
-
Butler, R.1
Lusk, E.L.2
-
8
-
-
0022020346
-
Distributed snapshots: Determining global states of distributed systems
-
Aug.
-
K.M. Chandy and L. Lamport, "Distributed snapshots: Determining global states of distributed systems," ACM Trans. Comput. Syst., vol.3, no.1, pp.63-75, Aug. 1985.
-
(1985)
ACM Trans. Comput. Syst.
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy, K.M.1
Lamport, L.2
-
11
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
E.N. Elnozahy, L. Alvisi, Y.-M.Wang, and D.B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol.34, no.3, pp.375-408, 2002.
-
(2002)
ACM Comput. Surv.
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.1
Alvisi, L.2
Wang, Y.-M.3
Johnson, D.B.4
-
12
-
-
84940567900
-
FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world
-
G.E. Fagg and J. Dongarra, "FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world," PVM/MPI 2000, pp.346-353, 2000.
-
(2000)
PVM/MPI 2000
, pp. 346-353
-
-
Fagg, G.E.1
Dongarra, J.2
-
16
-
-
0035455653
-
The anatomy of the grid: Enabling scalable virtual organizations
-
I. Foster, C. Kesselman, and S. Tuecke, "The anatomy of the grid: Enabling scalable virtual organizations," J. Supercomput. Appl., vol.15, no.3, 2001.
-
(2001)
J. Supercomput. Appl.
, vol.15
, Issue.3
-
-
Foster, I.1
Kesselman, C.2
Tuecke, S.3
-
17
-
-
0034878266
-
Condor-G: A computation management agent for multi-institutional grids
-
Aug.
-
J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke, "Condor-G: A computation management agent for multi-institutional grids," Proc. 10th IEEE Symp. on High Performance Distributed Computing (HPDC10), pp.55-63, Aug. 2001.
-
(2001)
Proc. 10th IEEE Symp. on High Performance Distributed Computing (HPDC10)
, pp. 55-63
-
-
Frey, J.1
Tannenbaum, T.2
Foster, I.3
Livny, M.4
Tuecke, S.5
-
18
-
-
0030243005
-
A high-performance, portable implementation of the MPI message passing interface standard
-
W. Gropp, E. Lusk, N. Doss, and A. Skjellum, "A high-performance, portable implementation of the MPI message passing interface standard" Parallel Comput., vol.22, no.6, pp.789-828, 1996.
-
(1996)
Parallel Comput.
, vol.22
, Issue.6
, pp. 789-828
-
-
Gropp, W.1
Lusk, E.2
Doss, N.3
Skjellum, A.4
-
19
-
-
0742293840
-
MPICH-G2: A grid-enabled implementation of the message passing interface
-
May
-
N.T. Karnois, B.Toonen, and I. Foster, "MPICH-G2: A grid-enabled implementation of the message passing interface," J. Parallel Distrib. Comput., vol.63, no.5, pp.551-563, May 2003.
-
(2003)
J. Parallel Distrib. Comput.
, vol.63
, Issue.5
, pp. 551-563
-
-
Karnois, N.T.1
Toonen, B.2
Foster, I.3
-
20
-
-
0023090161
-
Checkpointing and rollback recovery for distributed systems
-
R. Koo and S. Toueg, "Checkpointing and rollback recovery for distributed systems," IEEE Trans. Softw. Eng., vol.SE-13, no.1, pp.23-31, 1987.
-
(1987)
IEEE Trans. Softw. Eng.
, vol.SE-13
, Issue.1
, pp. 23-31
-
-
Koo, R.1
Toueg, S.2
-
22
-
-
0002639531
-
Supporting checkpointing and process migration outside the unix kernel
-
San Francisco, CA, Jan.
-
M.J. Litzkow and M. Solomon, "Supporting checkpointing and process migration outside the unix kernel," USENIX Conference Proc., pp.283-290, San Francisco, CA, Jan. 1992.
-
(1992)
USENIX Conference Proc.
, pp. 283-290
-
-
Litzkow, M.J.1
Solomon, M.2
-
23
-
-
0034439137
-
Portable fault tolerance scheme for MPI
-
S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou, "Portable fault tolerance scheme for MPI," Parallel Process. Lett., vol.10, no.4, pp.371-382, 2000.
-
(2000)
Parallel Process. Lett.
, vol.10
, Issue.4
, pp. 371-382
-
-
Louca, S.1
Neophytou, N.2
Lachanas, A.3
Evripidou, P.4
-
25
-
-
3142699243
-
Nas parallel benchmarks
-
NASA Ames Research Center, "Nas parallel benchmarks," Technical Report, http://science.nas.nasa.gov/Software/NPB/, 1997.
-
(1997)
Technical Report
-
-
-
26
-
-
0029255243
-
Necessary and sufficient conditions for consistent global snapshots
-
R. Netzer and J. Xu, "Necessary and sufficient conditions for consistent global snapshots," IEEE Trans. Parallel Distrib. Syst., vol.6, no.2, pp.165-169, 1995.
-
(1995)
IEEE Trans. Parallel Distrib. Syst.
, vol.6
, Issue.2
, pp. 165-169
-
-
Netzer, R.1
Xu, J.2
-
27
-
-
84888898496
-
RENEW: A tool for fast and efficient implementation of checkpoint protocols
-
N. Neves and W.K. Fuchs, "RENEW: A tool for fast and efficient implementation of checkpoint protocols," Symp. on Fault-Tolerant Computing, pp.58-67, 1998.
-
(1998)
Symp. on Fault-Tolerant Computing
, pp. 58-67
-
-
Neves, N.1
Fuchs, W.K.2
-
28
-
-
23044532594
-
Application recovery in parallel programming environment
-
G.T. Nguyen, V.D. Tran, and M. Kotocová, "Application recovery in parallel programming environment," European PVM/MPI, pp.234-242, 2002.
-
(2002)
European PVM/MPI
, pp. 234-242
-
-
Nguyen, G.T.1
Tran, V.D.2
Kotocová, M.3
-
30
-
-
85084159983
-
Libckpt: Transparent checkpointing under unix
-
Jan.
-
J.S. Plank, M. Beck, G. Kingsley, and K. Li, "Libckpt: Transparent checkpointing under unix," USENIX Winter 1995 Technical Conference, pp.213-224, Jan. 1995.
-
(1995)
USENIX Winter 1995 Technical Conference
, pp. 213-224
-
-
Plank, J.S.1
Beck, M.2
Kingsley, G.3
Li, K.4
-
31
-
-
0032179680
-
Diskless checkpointing
-
J.S. Plank, K. Li, and M.A. Puening, "Diskless checkpointing," IEEE Trans. Parallel Distrib. Syst., vol.9, no.10, pp.972-986, 1998.
-
(1998)
IEEE Trans. Parallel Distrib. Syst.
, vol.9
, Issue.10
, pp. 972-986
-
-
Plank, J.S.1
Li, K.2
Puening, M.A.3
-
32
-
-
0032597696
-
Egida: An extensible toolkit for low-overhead fault-tolerance
-
S. Rao, L. Alvisi, and H.M. Vin, "Egida: An extensible toolkit for low-overhead fault-tolerance," Symp. on Fault-Tolerant Computing, pp.48-55, 1999.
-
(1999)
Symp. on Fault-Tolerant Computing
, pp. 48-55
-
-
Rao, S.1
Alvisi, L.2
Vin, H.M.3
-
33
-
-
0032202258
-
The hector distributed run-time environment
-
Nov.
-
S.H. Russ, J. Robinson, B.K. Flachs, and B. Heckel, "The hector distributed run-time environment," IEEE Trans. Parallel Distrib. Syst., vol.9, no.11, pp.1102-1114, Nov. 1998.
-
(1998)
IEEE Trans. Parallel Distrib. Syst.
, vol.9
, Issue.11
, pp. 1102-1114
-
-
Russ, S.H.1
Robinson, J.2
Flachs, B.K.3
Heckel, B.4
-
35
-
-
0029713612
-
CoCheck: Checkpointing and process migration for MPI
-
April
-
G. Stellner, "CoCheck: Checkpointing and process migration for MPI," Proc. Int. Parallel Processing Symp., pp.526-531, April 1996.
-
(1996)
Proc. Int. Parallel Processing Symp.
, pp. 526-531
-
-
Stellner, G.1
-
36
-
-
0032179679
-
Theoretical analysis for communication-induced checkpointing protocols with rollback dependency trackability
-
J. Tsai, S.-Y. Kuo, and Y.-M. Wang, "Theoretical analysis for communication-induced checkpointing protocols with rollback dependency trackability," IEEE Trans. Parallel Distrib. Syst., vol.9, no.10, pp.963-971, 1998.
-
(1998)
IEEE Trans. Parallel Distrib. Syst.
, vol.9
, Issue.10
, pp. 963-971
-
-
Tsai, J.1
Kuo, S.-Y.2
Wang, Y.-M.3
-
37
-
-
0033365704
-
Process hijacking
-
Aug.
-
V. Zandy, B. Miller, and M. Livny, "Process hijacking," Eighth Int. Symp. on High Performance Distributed Computing, pp.177-184, Aug. 1999.
-
(1999)
Eighth Int. Symp. on High Performance Distributed Computing
, pp. 177-184
-
-
Zandy, V.1
Miller, B.2
Livny, M.3
|