-
1
-
-
47249161719
-
-
MPICH-V Project
-
MPICH-V Project. http://mpich-v.lri.fr.
-
-
-
-
2
-
-
47249109126
-
-
MPICH2. http://www-unix.mcs.anl.gov/mpi/mpich2/.
-
MPICH2. http://www-unix.mcs.anl.gov/mpi/mpich2/.
-
-
-
-
3
-
-
47249085292
-
-
MVAPICH
-
MVAPICH: MPI for InfiniBand and iWARP. http://nowlab.cse.ohio-state.edu/ projects/mpi-iba/.
-
MPI for InfiniBand and iWARP
-
-
-
4
-
-
47249159117
-
-
Parallel Virtual File System, Version 2
-
Parallel Virtual File System, Version 2. http://www.pvfs.org.
-
-
-
-
5
-
-
84884662651
-
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
-
Baltimore, MD, November
-
G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Magniette, V. Néri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In IEEE/ACM Super-Computing 2002, Baltimore, MD, November 2002.
-
(2002)
IEEE/ACM Super-Computing 2002
-
-
Bosilca, G.1
Bouteiller, A.2
Cappello, F.3
Djilali, S.4
Magniette, G.5
Néri, V.6
Selikhov, A.7
-
6
-
-
60449096682
-
MPICH-V2: A fault tolerant MPI for volatile nodes based on pessimistic sender based message logging
-
Phoenix, AZ, November
-
A. Bouteiller, F. Cappello, T. Hérault, G. Krawezik, P. Lemarinier, and F. Magniette. MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In IEEE/ACM SuperComputing 2003, Phoenix, AZ, November 2003.
-
(2003)
IEEE/ACM SuperComputing 2003
-
-
Bouteiller, A.1
Cappello, F.2
Hérault, T.3
Krawezik, G.4
Lemarinier, P.5
Magniette, F.6
-
7
-
-
33746310123
-
Impact of event logger on causal message logging protocols for fault tolerant MPI
-
Denver, CO, April
-
A. Bouteiller, B. Collin, T. Hérault, P. Lemarinier, and F. Cappello. Impact of event logger on causal message logging protocols for fault tolerant MPI. In Proceedings of Int'l Parallel and Distributed Processing Symposium (IPDPS), Denver, CO, April 2005.
-
(2005)
Proceedings of Int'l Parallel and Distributed Processing Symposium (IPDPS)
-
-
Bouteiller, A.1
Collin, B.2
Hérault, T.3
Lemarinier, P.4
Cappello, F.5
-
8
-
-
20444435911
-
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
-
San Diego, CA, September
-
A. Bouteiller, P. Lemarinier, T. Hérault, G. Krawezik, and F. Cappello. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In Proceedings of Cluster 2004, San Diego, CA, September 2004.
-
(2004)
Proceedings of Cluster 2004
-
-
Bouteiller, A.1
Lemarinier, P.2
Hérault, T.3
Krawezik, G.4
Cappello, F.5
-
9
-
-
0022020346
-
Distributed Snapshots: Determining Global States of Distributed Systems
-
M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. In ACM Trans. Comput. Syst. 31, 1985.
-
(1985)
ACM Trans. Comput. Syst
, vol.31
-
-
Chandy, M.1
Lamport, L.2
-
10
-
-
34548282622
-
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
-
C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, and F. Cappello. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In ACM/IEEE SuperComputing (SC), 2006.
-
(2006)
ACM/IEEE SuperComputing (SC)
-
-
Coti, C.1
Herault, T.2
Lemarinier, P.3
Pilard, L.4
Rezmerita, A.5
Rodriguez, E.6
Cappello, F.7
-
11
-
-
12344277946
-
The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart
-
Technical Report LBNL-54941, Berkeley Lab, 2002
-
J. Duell, P. Hargrove, and E. Roman. The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart. Technical Report LBNL-54941, Berkeley Lab, 2002.
-
-
-
Duell, J.1
Hargrove, P.2
Roman, E.3
-
12
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3), 2002.
-
(2002)
ACM Comput. Surv
, vol.34
, Issue.3
-
-
Elnozahy, E.N.1
Alvisi, L.2
Wang, Y.-M.3
Johnson, D.B.4
-
13
-
-
33646110228
-
Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems
-
G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J. J. Dongarra. Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems. In International Supercomputer Conference (ICS), 2003.
-
(2003)
International Supercomputer Conference (ICS)
-
-
Fagg, G.E.1
Gabriel, E.2
Bosilca, G.3
Angskun, T.4
Chen, Z.5
Pjesivac-Grbovic, J.6
London, K.7
Dongarra, J.J.8
-
14
-
-
34547424834
-
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
-
Columbus, OH, August
-
Q. Gao, W. Yu, W. Huang, and D. K. Panda. Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand. In Int'l Conference on Parallel Processing (ICPP'06), Columbus, OH, August 2006.
-
(2006)
Int'l Conference on Parallel Processing (ICPP'06)
-
-
Gao, Q.1
Yu, W.2
Huang, W.3
Panda, D.K.4
-
15
-
-
33845434226
-
Transparent incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers
-
Seattle, WA, November
-
R. Gioiosa, J. C. Sancho, S. Jiang, and F. Petrini. Transparent incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers. In ACM/IEEE SuperComputing 2005, Seattle, WA, November 2005.
-
(2005)
ACM/IEEE SuperComputing 2005
-
-
Gioiosa, R.1
Sancho, J.C.2
Jiang, S.3
Petrini, F.4
-
16
-
-
34548789748
-
The design and implementation of checkpoint/restart process fault tolerance for Open MPI
-
J. Hursey, T. Mattox, A. Lumsdaine, and J. M. Squyres. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In Workshop on Dependable Parallel, Distributed and Network-Centric Systems(DPDNS), in conjunction with IPDPS, 2007.
-
(2007)
Workshop on Dependable Parallel, Distributed and Network-Centric Systems(DPDNS), in conjunction with IPDPS
-
-
Hursey, J.1
Mattox, T.2
Lumsdaine, A.3
Squyres, J.M.4
-
17
-
-
47249109964
-
-
InfiniBand Trade Association
-
InfiniBand Trade Association. http://www.infinibandta.org.
-
-
-
-
19
-
-
12444292781
-
Design and Implementation of MPICH2 over InfiniBand with RDMA Support
-
April
-
J. Liu, W. Jiang, P. Wyckoff, D. K. Panda, D. Ashton, D. Buntinas, W. Gropp, and B. Toonen. Design and Implementation of MPICH2 over InfiniBand with RDMA Support. In Int'l Parallel and Distributed Processing Symposium (IPDPS '04), April 2004.
-
(2004)
Int'l Parallel and Distributed Processing Symposium (IPDPS '04)
-
-
Liu, J.1
Jiang, W.2
Wyckoff, P.3
Panda, D.K.4
Ashton, D.5
Buntinas, D.6
Gropp, W.7
Toonen, B.8
-
21
-
-
27844542760
-
The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing
-
S. Sankaran, J.M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. International Journal of High Performance Computing Applications, pages 479-493, 2005.
-
(2005)
International Journal of High Performance Computing Applications
, pp. 479-493
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Lumsdaine, A.4
Duell, J.5
Hargrove, P.6
Roman, E.7
-
23
-
-
34548768671
-
-
C. Wang, F. Mueller, C. Engelmann, and S. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Int'l Parallel and Distributed Processing Symposium (IPDPS '07), 2007.
-
C. Wang, F. Mueller, C. Engelmann, and S. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Int'l Parallel and Distributed Processing Symposium (IPDPS '07), 2007.
-
-
-
|