-
1
-
-
51049104220
-
-
Lorenzo Alvisi, Sriram Rao, Syed Amir Husain, Asanka Mel de and E.N. (Mootaz) Elnozahy. An Analysis of Communication-Induced Checkpointing, in FTCS '99: Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, IEEE Computer Society, pp. 242-249, 1999
-
Lorenzo Alvisi, Sriram Rao, Syed Amir Husain, Asanka Mel de and E.N. (Mootaz) Elnozahy. An Analysis of Communication-Induced Checkpointing, in FTCS '99: Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, IEEE Computer Society, pp. 242-249, 1999
-
-
-
-
2
-
-
33746779994
-
-
Aurélien Bouteiller, Thomas Hérault, Géraud Krawezik, Pierre Lemarinier and Franck Cappello. MPICH-V: a Multiprotocol Fault Tolerant MPI, International Journal of High Performance Computing and Applications, vol.20 no.3:319-333, 2006
-
(2006)
MPICH-V: A Multiprotocol Fault Tolerant MPI, International Journal of High Performance Computing and Applications
, vol.20
, Issue.3
, pp. 319-333
-
-
Bouteiller, A.1
Hérault, T.2
Krawezik, G.3
Lemarinier, P.4
Cappello, F.5
-
3
-
-
0022020346
-
Distributed Snapshots: Determining Global States of Distributed Systems
-
K. Mani Chandy and Leslie Lamport. Distributed Snapshots: Determining Global States of Distributed Systems, ACM Transactions on Computer Systems, vol.3 no.1:63-75, 1985
-
(1985)
ACM Transactions on Computer Systems
, vol.3
, Issue.1
, pp. 63-75
-
-
Mani Chandy, K.1
Lamport, L.2
-
4
-
-
34548282622
-
-
Camille Coti, Thomas Hérault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez and Franck Cappello. Blocking vs Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI, in proceedings of The IEEE/ACM SC2006 Conference, 2006
-
Camille Coti, Thomas Hérault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez and Franck Cappello. Blocking vs Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI, in proceedings of The IEEE/ACM SC2006 Conference, 2006
-
-
-
-
6
-
-
0042078549
-
-
E.N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, 34 no.3:375-408, 2002
-
E.N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34 no.3:375-408, 2002
-
-
-
-
7
-
-
33745305678
-
Self-refined Fault Tolerance in HPC Using Dynamic Dependent Process Groups, in Distributed
-
N.P. Gopalan and K. Nagarajan. Self-refined Fault Tolerance in HPC Using Dynamic Dependent Process Groups, in Distributed Computing - IWDC 2005, pp. 153-158, 2005
-
(2005)
Computing
, vol.IWDC 2005
, pp. 153-158
-
-
Gopalan, N.P.1
Nagarajan, K.2
-
8
-
-
51049100534
-
-
Poster Section, High Performance Networking and Computing SC2003, Phoenix, USA
-
Pierre Lemarinier, Aurélien Bouteiller and Franck Cappello. MPICH-V3: Toward a High Performance Fault Tolerant MPI for Cluster of Clusters Grid, in Poster Section, High Performance Networking and Computing (SC2003), Phoenix, USA, 2003
-
(2003)
MPICH-V3: Toward a High Performance Fault Tolerant MPI for Cluster of Clusters Grid
-
-
Lemarinier, P.1
Bouteiller, A.2
Cappello, F.3
-
9
-
-
0031338781
-
-
Wei-Jih Li and Jyh-Jong Tsay. Checkpointing Message-Passing Interface(MPI) Parallel Programs, in Proceedings of Pacific Rim International Symposium on Fault-Tolerant Systems, 1997, IEEE Computer Society, pp. 147-152, 1997
-
Wei-Jih Li and Jyh-Jong Tsay. Checkpointing Message-Passing Interface(MPI) Parallel Programs, in Proceedings of Pacific Rim International Symposium on Fault-Tolerant Systems, 1997, IEEE Computer Society, pp. 147-152, 1997
-
-
-
-
11
-
-
0029255243
-
Necessary and Sufficient Conditions for Consistent Global Snapshots
-
Robert H.B. Netzer and Jian Xu. Necessary and Sufficient Conditions for Consistent Global Snapshots, IEEE Transactions on Parallel and Distributed Systems, vol.6 no.2:165-169, 1995
-
(1995)
IEEE Transactions on Parallel and Distributed Systems
, vol.6
, Issue.2
, pp. 165-169
-
-
Netzer, R.H.B.1
Xu, J.2
-
12
-
-
3042683162
-
Finding a Recovery Line in Uncoordinated Checkpointing
-
IEEE Computer Society, pp
-
Mamoru Ohara, Masayuki Arai, Satoshi Fukumoto and Kazuhiko Iwasaki. Finding a Recovery Line in Uncoordinated Checkpointing, in ICDCSW '04: Proceedings of the 24th International Conference on Distributed Computing Systems Workshops - W7: EC (ICDCSW'04), IEEE Computer Society, pp. 628-633, 2004
-
(2004)
ICDCSW '04: Proceedings of the 24th International Conference on Distributed Computing Systems Workshops - W7: EC (ICDCSW'04)
, pp. 628-633
-
-
Ohara, M.1
Arai, M.2
Fukumoto, S.3
Iwasaki, K.4
-
13
-
-
33746286070
-
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems
-
IEEE Computer Society
-
A.J. Oliner, Ramendra K. Sahoo, José E. Moreira and M. Gupta. Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems, in IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18, IEEE Computer Society, 2005
-
(2005)
IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop
, vol.18
-
-
Oliner, A.J.1
Sahoo, R.K.2
Moreira, J.E.3
Gupta, M.4
-
14
-
-
27844542760
-
The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing
-
Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine, Jason Duell, Paul H. Hargrove and Eric Roman. The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing, International Journal of High Performance Computing Applications, vol.19 no.4:479-493, 2005
-
(2005)
International Journal of High Performance Computing Applications
, vol.19
, Issue.4
, pp. 479-493
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Lumsdaine, A.4
Duell, J.5
Hargrove, P.H.6
Roman, E.7
-
17
-
-
51049120414
-
-
NAS Parallel Benchmarks
-
NAS Parallel Benchmarks: http://www.nas.nasa.gov/Resources/Software/npb. html
-
-
-
|