-
2
-
-
20444435911
-
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
-
IEEE CS Press
-
P. Lemarinier, A. Bouteiller, T. Herault, G. Krawezik, and F. Cappello, "Improved message logging versus improved coordinated checkpointing for fault tolerant MPI," in IEEE International Conference on Cluster Computing (Cluster 2004). IEEE CS Press, 2004.
-
(2004)
IEEE International Conference on Cluster Computing (Cluster 2004)
-
-
Lemarinier, P.1
Bouteiller, A.2
Herault, T.3
Krawezik, G.4
Cappello, F.5
-
3
-
-
20444444457
-
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
-
Sante Fe, New Mexico, USA, October
-
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The LAM/MPI checkpoint/restart framework: System-initiated checkpointing," in Proceedings, LACSI Symposium, Sante Fe, New Mexico, USA, October 2003.
-
(2003)
Proceedings, LACSI Symposium
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Lumsdaine, A.4
Duell, J.5
Hargrove, P.6
Roman, E.7
-
4
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol.34, no.3, pp. 375-408, 2002.
-
(2002)
ACM Comput. Surv.
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.M.1
Alvisi, L.2
Wang, Y.-M.3
Johnson, D.B.4
-
5
-
-
61449090689
-
Redesigning the message logging model for high performance
-
Dresden, Germany, June
-
A. Bouteiller, G. Bosilca, and J. Dongarra, "Redesigning the message logging model for high performance," in International Supercomputer Conference (ISC 2008), Dresden, Germany, June 2008.
-
(2008)
International Supercomputer Conference (ISC 2008)
-
-
Bouteiller, A.1
Bosilca, G.2
Dongarra, J.3
-
6
-
-
35048884271
-
Open MPI: Goals, concept, and design of a next generation MPI implementation
-
Budapest, Hungary, September
-
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall, "Open MPI: Goals, concept, and design of a next generation MPI implementation," in Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004, pp. 97-104.
-
(2004)
Proceedings, 11th European PVM/MPI Users' Group Meeting
, pp. 97-104
-
-
Gabriel, E.1
Fagg, G.E.2
Bosilca, G.3
Angskun, T.4
Dongarra, J.J.5
Squyres, J.M.6
Sahay, V.7
Kambadur, P.8
Barrett, B.9
Lumsdaine, A.10
Castain, R.H.11
Daniel, D.J.12
Graham, R.L.13
Woodall, T.S.14
-
7
-
-
0017996760
-
Time, clocks, and the ordering of events in a distributed system
-
L. Lamport, "Time, clocks, and the ordering of events in a distributed system," Communications of the ACM, vol.21, no.7, pp. 558-565, 1978.
-
(1978)
Communications of the ACM
, vol.21
, Issue.7
, pp. 558-565
-
-
Lamport, L.1
-
8
-
-
50649083601
-
O2P: An extremely optimistic message logging protocol
-
November
-
T. Ropars and C. Morin, "O2P: An Extremely Optimistic Message Logging Protocol," INRIA Research Report 6357, November 2007.
-
(2007)
INRIA Research Report 6357
-
-
Ropars, T.1
Morin, C.2
-
9
-
-
56449096525
-
On the performance of transparent MPI piggyback messages
-
Berlin, Heidelberg: Springer-Verlag
-
M. Schulz, G. Bronevetsky, and B. R. Supinski, "On the Performance of Transparent MPI Piggyback Messages," in Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 194-201.
-
(2008)
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
, pp. 194-201
-
-
Schulz, M.1
Bronevetsky, G.2
Supinski, B.R.3
-
10
-
-
0346903448
-
Distributed recovery with K-optimistic logging
-
O. P. Damani, Y.-M. Wang, and V. K. Garg, "Distributed Recovery with K-optimistic Logging," Journal of Parallel and Distributed Computing, vol.63, pp. 1193-1218, 2003.
-
(2003)
Journal of Parallel and Distributed Computing
, vol.63
, pp. 1193-1218
-
-
Damani, O.P.1
Wang, Y.-M.2
Garg, V.K.3
-
11
-
-
0026907967
-
An efficient implementation of vector clocks
-
M. Singhal and A. Kshemkalyani, "An Efficient Implementation of Vector Clocks," Information Processing Letters, vol.43, no.1, pp. 47-52, 1992.
-
(1992)
Information Processing Letters
, vol.43
, Issue.1
, pp. 47-52
-
-
Singhal, M.1
Kshemkalyani, A.2
-
13
-
-
0003605996
-
The NAS parallel benchmarks 2.0
-
D. Bailey, T. Harris, W. Saphir, R. van der Wilngaart, A. Woo, and M. Yarrow, "The NAS Parallel Benchmarks 2.0," NASA Ames Research Center, Tech. Rep. Report NAS-95-1020, 1995.
-
(1995)
NASA Ames Research Center, Tech. Rep. Report NAS-95-1020
-
-
Bailey, D.1
Harris, T.2
Saphir, W.3
Van Der Wilngaart, R.4
Woo, A.5
Yarrow, M.6
-
14
-
-
33750205459
-
Grid'5000: A large scale and highly reconfigurable experimental grid testbed
-
R. Bolze, F. Cappello, E. Caron, M. Daydé, F. Desprez, E. Jeannot, Y. Jégou, S. Lanteri, J. Leduc, N. Melab, G. Mornet, R. Namyst, P. Primet, B. Quetier, O. Richard, E.-G. Talbi, and I. Touche, "Grid'5000: A large scale and highly reconfigurable experimental grid testbed," International Journal of High Performance Computing Applications, vol.20, no.4, pp. 481-494, 2006.
-
(2006)
International Journal of High Performance Computing Applications
, vol.20
, Issue.4
, pp. 481-494
-
-
Bolze, R.1
Cappello, F.2
Caron, E.3
Daydé, M.4
Desprez, F.5
Jeannot, E.6
Jégou, Y.7
Lanteri, S.8
Leduc, J.9
Melab, N.10
Mornet, G.11
Namyst, R.12
Primet, P.13
Quetier, B.14
Richard, O.15
Talbi, E.-G.16
Touche, I.17
-
15
-
-
0030286802
-
Algorithm-based fault location and recovery for matrix computations on multiprocessor systems
-
A. Roy-Chowdhury and P. Banerjee, "Algorithm-based fault location and recovery for matrix computations on multiprocessor systems," IEEE Trans. Comput., vol.45, no.11, pp. 1239-1247, 1996.
-
(1996)
IEEE Trans. Comput.
, vol.45
, Issue.11
, pp. 1239-1247
-
-
Roy-Chowdhury, A.1
Banerjee, P.2
-
16
-
-
31844451082
-
Fault tolerant high performance computing by a coding approach
-
New York, NY, USA: ACM Press
-
Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, "Fault tolerant high performance computing by a coding approach," in PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming. New York, NY, USA: ACM Press, 2005, pp. 213-223.
-
(2005)
PPoPP '05: Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, pp. 213-223
-
-
Chen, Z.1
Fagg, G.E.2
Gabriel, E.3
Langou, J.4
Angskun, T.5
Bosilca, G.6
Dongarra, J.7
-
17
-
-
84940567900
-
FT-MPI : FFFault tolerant MPI, supporting dynamic applications in a dynamic world
-
Balatonfred, Hungary: Springer-Verlag Heidelberg, September
-
G. Fagg and J. Dongarra, "FT-MPI : Fault tolerant MPI, supporting dynamic applications in a dynamic world," in 7th Euro PVM/MPI User's Group Meeting2000, vol.1908 / 2000. Balatonfred, Hungary: Springer-Verlag Heidelberg, september 2000.
-
(2000)
7th Euro PVM/MPI User's Group Meeting2000
, vol.1908
-
-
Fagg, G.1
Dongarra, J.2
-
18
-
-
0035480335
-
HARNESS and fault tolerant MPI
-
October
-
G. E. Fagg, A. Bukovsky, and J. J. Dongarra, "HARNESS and fault tolerant MPI," Parallel Computing, vol.27, no.11, pp. 1479-1495, October 2001.
-
(2001)
Parallel Computing
, vol.27
, Issue.11
, pp. 1479-1495
-
-
Fagg, G.E.1
Bukovsky, A.2
Dongarra, J.J.3
-
19
-
-
0022020346
-
Distributed snapshots : DDDetermining global states of distributed systems
-
ACM, February
-
K. M. Chandy and L. Lamport, "Distributed snapshots : Determining global states of distributed systems," in Transactions on Computer Systems, vol.3(1). ACM, February 1985, pp. 63-75.
-
(1985)
Transactions on Computer Systems
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy, K.M.1
Lamport, L.2
-
21
-
-
0033184756
-
Communication-induced determination of consistent snapshots
-
J.-M. Hlary, A. Mostefaoui, and M. Raynal, "Communication-induced determination of consistent snapshots," IEEE Transactions on Parallel and Distributed Systems, vol.10, no.9, pp. 865-877, 1999.
-
(1999)
IEEE Transactions on Parallel and Distributed Systems
, vol.10
, Issue.9
, pp. 865-877
-
-
Hlary, J.-M.1
Mostefaoui, A.2
Raynal, M.3
-
22
-
-
34548274091
-
Dejavu: Transparent user-level checkpointing, migration and recovery for distributed systems
-
New York, NY, USA: ACM Press
-
J. F. Ruscio, M. A. Heffner, and S. Varadarajan, "Dejavu: transparent user-level checkpointing, migration and recovery for distributed systems," in SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM Press, 2006, p. 158.
-
(2006)
SC '06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing
, pp. 158
-
-
Ruscio, J.F.1
Heffner, M.A.2
Varadarajan, S.3
-
23
-
-
0026867749
-
Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output
-
May
-
Elnozahy, Elmootazbellah, and Zwaenepoel, "Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output," IEEE Transactions on Computing, vol.41, no.5, May 1992.
-
(1992)
IEEE Transactions on Computing
, vol.41
, Issue.5
-
-
Elnozahy1
Elmootazbellah2
Zwaenepoel3
-
24
-
-
0032597696
-
Egida: An extensible toolkit for low-overhead fault-tolerance
-
IEEE CS Press
-
S. Rao, L. Alvisi, and H. M. Vin, "Egida: An extensible toolkit for low-overhead fault-tolerance," in 29th Symposium on Fault-Tolerant Computing (FTCS'99). IEEE CS Press, 1999, pp. 48-55.
-
(1999)
29th Symposium on Fault-Tolerant Computing (FTCS'99)
, pp. 48-55
-
-
Rao, S.1
Alvisi, L.2
Vin, H.M.3
-
25
-
-
33746779994
-
-
SAGE Publications, Summer
-
A. Bouteiler, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, "MPICH-V project: a multiprotocol automatic fault tolerant MPI," vol.20. SAGE Publications, Summer 2006, pp. 319-333.
-
(2006)
MPICH-V Project: A Multiprotocol Automatic Fault Tolerant MPI
, vol.20
, pp. 319-333
-
-
Bouteiler, A.1
Herault, T.2
Krawezik, G.3
Lemarinier, P.4
Cappello, F.5
-
26
-
-
0022112420
-
Optimistic recovery in distributed systems
-
R. Strom and S. Yemini, "Optimistic Recovery in Distributed Systems," ACM Transactions on Computing Systems, vol.3, no.3, pp. 204-226, 1985.
-
(1985)
ACM Transactions on Computing Systems
, vol.3
, Issue.3
, pp. 204-226
-
-
Strom, R.1
Yemini, S.2
-
27
-
-
0024890316
-
Efficient distributed recovery using message logging
-
New York, NY, USA: ACM Press
-
A. P. Sistla and J. L. Welch, "Efficient Distributed Recovery Using Message Logging," in PODC '89: Proceedings of the eighth annual ACM Symposium on Principles of distributed computing. New York, NY, USA: ACM Press, 1989, pp. 223-238.
-
(1989)
PODC '89: Proceedings of the Eighth Annual ACM Symposium on Principles of Distributed Computing
, pp. 223-238
-
-
Sistla, A.P.1
Welch, J.L.2
-
28
-
-
0028994250
-
Completely asynchronous optimistic recovery with minimal rollbacks
-
Pasadena, California
-
S. W. Smith, D. B. Johnson, and J. D. Tygar, "Completely Asynchronous Optimistic Recovery with Minimal Rollbacks," in FTCS-25: 25th International Symposium on Fault Tolerant Computing Digest of Papers, Pasadena, California, 1995, pp. 361-371.
-
(1995)
FTCS-25: 25th International Symposium on Fault Tolerant Computing Digest of Papers
, pp. 361-371
-
-
Smith, S.W.1
Johnson, D.B.2
Tygar, J.D.3
-
29
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
september
-
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Computing Surveys (CSUR), vol.34, no.3, pp. 375-408, september 2002.
-
(2002)
ACM Computing Surveys (CSUR)
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, M.1
Alvisi, L.2
Wang, Y.M.3
Johnson, D.B.4
|