-
1
-
-
0033359224
-
Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations
-
Washington, DC, USA, IEEE Computer Society
-
A. Agbaria and R. Friedman. Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In HPDC '99: Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing, page 31, Washington, DC, USA, 1999. IEEE Computer Society.
-
(1999)
HPDC '99: Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing
, pp. 31
-
-
Agbaria, A.1
Friedman, R.2
-
2
-
-
0032000230
-
Message logging: Pessimistic, optimistic, causal, and optimal
-
L. Alvisi and K. Marzullo. Message logging: Pessimistic, optimistic, causal, and optimal. IEEE Trans. Softw. Eng., 24(2): 149-159, 1998.
-
(1998)
IEEE Trans. Softw. Eng
, vol.24
, Issue.2
, pp. 149-159
-
-
Alvisi, L.1
Marzullo, K.2
-
3
-
-
84884662651
-
MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
-
Los Alamitos, CA, USA, IEEE Computer Society Press
-
G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In Supercomputing '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1-18, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press.
-
(2002)
Supercomputing '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing
, pp. 1-18
-
-
Bosilca, G.1
Bouteiller, A.2
Cappello, F.3
Djilali, S.4
Fedak, G.5
Germain, C.6
Herault, T.7
Lemarinier, P.8
Lodygensky, O.9
Magniette, F.10
Neri, V.11
Selikhov, A.12
-
4
-
-
0022020346
-
Distributed snapshots: Determining global states of distributed systems
-
K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63-75, 1985.
-
(1985)
ACM Trans. Comput. Syst
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy, K.M.1
Lamport, L.2
-
5
-
-
12344277946
-
The design and implementation of Berkeley Lab's linux checkpoint/restart
-
Technical Report LBNL-54941, Lawrence Berkeley National Lab, 2003
-
J. Duell, P. Hargrove, and E. Roman. The design and implementation of Berkeley Lab's linux checkpoint/restart. Technical Report LBNL-54941, Lawrence Berkeley National Lab, 2003.
-
-
-
Duell, J.1
Hargrove, P.2
Roman, E.3
-
6
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375-408, 2002.
-
(2002)
ACM Comput. Surv
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.M.1
Alvisi, L.2
Wang, Y.-M.3
Johnson, D.B.4
-
8
-
-
27844562921
-
Open MPI: Goals, concept, and design of a next generation MPI implementation
-
E. Garbriel, G. E. Fagg, G. Bosilica, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, 2004.
-
(2004)
Proceedings, 11th European PVM/MPI Users' Group Meeting
-
-
Garbriel, E.1
Fagg, G.E.2
Bosilica, G.3
Angskun, T.4
Dongarra, J.J.5
Squyres, J.M.6
Sahay, V.7
Kambadur, P.8
Barrett, B.9
Lumsdaine, A.10
Castain, R.H.11
Daniel, D.J.12
Graham, R.L.13
Woodall, T.S.14
-
9
-
-
84955599429
-
MPI-2: Extending the Message-Passing Interface
-
Springer Verlag
-
A. Geist, W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, W. Saphir, T. Skjellum, and M. Snir. MPI-2: Extending the Message-Passing Interface. In Euro-Par '96 Parallel Processing, pages 128-135. Springer Verlag, 1996.
-
(1996)
Euro-Par '96 Parallel Processing
, pp. 128-135
-
-
Geist, A.1
Gropp, W.2
Huss-Lederman, S.3
Lumsdaine, A.4
Lusk, E.5
Saphir, W.6
Skjellum, T.7
Snir, M.8
-
10
-
-
0347133226
-
A network-failure-tolerant message-passing system for terascale clusters
-
August
-
R. L. Graham, S.-E. Choi, D. J. Daniel, N. N. Desai, R. G. Minnich, C. E. Rasmussen, L. Risinger, and M. W. Sukalski. A network-failure-tolerant message-passing system for terascale clusters. In International Journal of Parallel Programming, volume 31, pages 285-303, August 2003.
-
(2003)
International Journal of Parallel Programming
, vol.31
, pp. 285-303
-
-
Graham, R.L.1
Choi, S.-E.2
Daniel, D.J.3
Desai, N.N.4
Minnich, R.G.5
Rasmussen, C.E.6
Risinger, L.7
Sukalski, M.W.8
-
11
-
-
34548755483
-
A checkpoint and restart service specification for Open MPI
-
Technical Report TR635, Indiana University, Bloomington, Indiana, USA, July
-
J. Hursey, J. M. Squyres, and A. Lumsdaine. A checkpoint and restart service specification for Open MPI. Technical Report TR635, Indiana University, Bloomington, Indiana, USA, July 2006.
-
(2006)
-
-
Hursey, J.1
Squyres, J.M.2
Lumsdaine, A.3
-
12
-
-
0003912256
-
Checkpoint and migration of UNIX processes in the Condor distributed processing system
-
Technical Report CS-TR-199701346, University of Wisconsin, Madison
-
M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical Report CS-TR-199701346, University of Wisconsin, Madison, 1997.
-
(1997)
-
-
Litzkow, M.1
Tannenbaum, T.2
Basney, J.3
Livny, M.4
-
13
-
-
0031162195
-
Finding consistent global checkpoints in a distributed computation
-
D. Manivannan, R. H. B. Netzer, and M. Singhal. Finding consistent global checkpoints in a distributed computation. IEEE Trans. Parallel Distrib. Syst., 8(6):623-627, 1997.
-
(1997)
IEEE Trans. Parallel Distrib. Syst
, vol.8
, Issue.6
, pp. 623-627
-
-
Manivannan, D.1
Netzer, R.H.B.2
Singhal, M.3
-
14
-
-
85143038582
-
-
Message Passing Interface Forum. MPI: A Message Passing Interface. In Proc. of Supercomputing '93, pages 878-883. IEEE Computer Society Press, November 1993.
-
Message Passing Interface Forum. MPI: A Message Passing Interface. In Proc. of Supercomputing '93, pages 878-883. IEEE Computer Society Press, November 1993.
-
-
-
-
15
-
-
0345044000
-
Process migration
-
D. S. Milojičić, F. Douglis, Y. Paindaveine, R. Wheeler, and S. Zhou. Process migration. ACM Comput. Surv., 32(3):241-299, 2000.
-
(2000)
ACM Comput. Surv
, vol.32
, Issue.3
, pp. 241-299
-
-
Milojičić, D.S.1
Douglis, F.2
Paindaveine, Y.3
Wheeler, R.4
Zhou, S.5
-
16
-
-
34548792745
-
-
J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under Unix. Technical report, Knoxville, TN, USA, 1994.
-
J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under Unix. Technical report, Knoxville, TN, USA, 1994.
-
-
-
-
17
-
-
0033077475
-
Memory exclusion: Optimizing the performance of checkpointing systems
-
J. S. Plank, Y. Chen, K. Li, M. Beck, and G. Kingsley. Memory exclusion: Optimizing the performance of checkpointing systems. In Software - Practice and Experience, volume 29, pages 125-142, 1999.
-
(1999)
Software - Practice and Experience
, vol.29
, pp. 125-142
-
-
Plank, J.S.1
Chen, Y.2
Li, K.3
Beck, M.4
Kingsley, G.5
-
18
-
-
0032597696
-
Egida: An extensible toolkit for low-overhead fault-tolerance
-
Washington, DC, USA, IEEE Computer Society
-
S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. In FTCS '99: Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, page 48, Washington, DC, USA, 1999. IEEE Computer Society.
-
(1999)
FTCS '99: Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
, pp. 48
-
-
Rao, S.1
Alvisi, L.2
Vin, H.M.3
-
19
-
-
27844542760
-
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
-
Winter
-
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479-493, Winter 2005.
-
(2005)
International Journal of High Performance Computing Applications
, vol.19
, Issue.4
, pp. 479-493
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Lumsdaine, A.4
Duell, J.5
Hargrove, P.6
Roman, E.7
-
21
-
-
0031124071
-
Consistent global checkpoints that contain a given set of local checkpoints
-
Y.-M. Wang. Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans. Comput., 46(4):456-468, 1997.
-
(1997)
IEEE Trans. Comput
, vol.46
, Issue.4
, pp. 456-468
-
-
Wang, Y.-M.1
-
22
-
-
33750234379
-
-
T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe. High performance RDMA protocols in HPC. In Proceedings of EuroPVM-MPI 2006, 4192/2006 of Lecture Notes in Computer Science, pages 76-85. Springer berlin /Heidelberg, September 2006.
-
T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe. High performance RDMA protocols in HPC. In Proceedings of EuroPVM-MPI 2006, volume 4192/2006 of Lecture Notes in Computer Science, pages 76-85. Springer berlin /Heidelberg, September 2006.
-
-
-
|