-
2
-
-
0032597670
-
An analysis of communication induced checkpointing
-
Los Alamitos, CA: IEEE CS Press
-
Alvisi, L., Elnozahy, E., Rao, S., Husain, S. A., and Mel, A. D. 1999. An analysis of communication induced checkpointing. 29th Symposium on Fault-Tolerant Computing (FTCS'99). Los Alamitos, CA: IEEE CS Press.
-
(1999)
29th Symposium on Fault-Tolerant Computing (FTCS'99)
-
-
Alvisi, L.1
Elnozahy, E.2
Rao, S.3
Husain, S.A.4
Mel, A.D.5
-
3
-
-
0029237761
-
Message logging: Pessimistic, optimistic, and causal
-
Los Alamitos, CA: IEEE CS Press
-
Alvisi, L. and Marzullo, K. 1995. Message logging: Pessimistic, optimistic, and causal. Proceedings of the 15th International Conference on Distributed Computing Systems (ICDCS 1995), pp. 229-236. Los Alamitos, CA: IEEE CS Press.
-
(1995)
Proceedings of the 15th International Conference on Distributed Computing Systems (ICDCS 1995)
, pp. 229-236
-
-
Alvisi, L.1
Marzullo, K.2
-
4
-
-
0003605996
-
-
Report NAS-95-020, Numerical Aerodynamic Simulation Facility, NASA Ames Research Center
-
Bailey, D., Harris, T., Saphir, W., Wijngaart, R. V. D., Woo, A., and Yarrow, M. 1995. The NAS Parallel Benchmarks 2.0. Report NAS-95-020, Numerical Aerodynamic Simulation Facility, NASA Ames Research Center.
-
(1995)
The NAS Parallel Benchmarks 2.0
-
-
Bailey, D.1
Harris, T.2
Saphir, W.3
Wijngaart, R.V.D.4
Woo, A.5
Yarrow, M.6
-
5
-
-
77954003885
-
MPI/FT™: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing
-
Melbourne, Australia. IEEE/ACM
-
Batchu, R., Neelamegam, J., Cui, Z., Beddhua, M., Skjellum, A., Dandass, Y., and Apte, M. 2001. MPI/FT™: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. Proceedings of the 1st International Symposium of Cluster Computing and the Grid (CCGRID2001), Melbourne, Australia. IEEE/ ACM.
-
(2001)
Proceedings of the 1st International Symposium of Cluster Computing and the Grid (CCGRID2001)
-
-
Batchu, R.1
Neelamegam, J.2
Cui, Z.3
Beddhua, M.4
Skjellum, A.5
Dandass, Y.6
Apte, M.7
-
6
-
-
0032313590
-
The relative over-head of piggybacking in causal message logging protocols
-
Los Alamitos, CA: IEEE CS Press
-
Bhatia, K., Marzullo, K., and Alvisi, L. 1998. The relative over-head of piggybacking in causal message logging protocols. 17th Symposium on Reliable Distributed Systems (SRDS'98), pp. 348-353. Los Alamitos, CA: IEEE CS Press.
-
(1998)
17th Symposium on Reliable Distributed Systems (SRDS'98)
, pp. 348-353
-
-
Bhatia, K.1
Marzullo, K.2
Alvisi, L.3
-
7
-
-
84884662651
-
MPICHV: Toward a scalable fault tolerant MPI for volatile nodes
-
Baltimore USA, IEEE/ACM
-
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fédak, G., Germain, C., Hérault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Néri, V., and Selikhov, A. 2002. MPICHV: Toward a scalable fault tolerant MPI for volatile nodes. High Performance Networking and Computing (SC2002), Baltimore USA, IEEE/ACM.
-
(2002)
High Performance Networking and Computing (SC2002)
-
-
Bosilca, G.1
Bouteiller, A.2
Cappello, F.3
Djilali, S.4
Fédak, G.5
Germain, C.6
Hérault, T.7
Lemarinier, P.8
Lodygensky, O.9
Magniette, F.10
Néri, V.11
Selikhov, A.12
-
8
-
-
60449096682
-
MPICH-V2: A fault tolerant MPI for volatile nodes based on pessimistic sender based message logging
-
Phoenix USA, IEEE/ACM
-
Bouteiller, A., Cappello, F., Hérault, T., Krawezik, G., Lemarinier, P., and Magniette, F. 2003a. MPICH-V2: A fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. High Performance Networking and Computing (SC2003), Phoenix USA, IEEE/ ACM.
-
(2003)
High Performance Networking and Computing (SC2003)
-
-
Bouteiller, A.1
Cappello, F.2
Hérault, T.3
Krawezik, G.4
Lemarinier, P.5
Magniette, F.6
-
9
-
-
33746681732
-
MPICH V3 preview: A hierarchical fault tolerant MPI for multi-cluster grids
-
poster session, Phoenix USA
-
Bouteiller, A., Lemarinier, P., and Cappello, F. 2003b. MPICH V3 preview: A hierarchical fault tolerant MPI for multi-cluster grids. IEEE/ ACM High Performance Networking and Computing (SC 2003), poster session, Phoenix USA.
-
(2003)
IEEE/ACM High Performance Networking and Computing (SC 2003)
-
-
Bouteiller, A.1
Lemarinier, P.2
Cappello, F.3
-
10
-
-
84944901411
-
Coordinated checkpoint versus message log for fault tolerant MPI
-
Los Alamitos, CA: IEEE CS Press
-
Bouteiller, A., Lemarinier, P., Krawezik, G., and Cappello, F. 2003c. Coordinated checkpoint versus message log for fault tolerant MPI. IEEE International Conference on Cluster Computing (Cluster 2003). Los Alamitos, CA: IEEE CS Press.
-
(2003)
IEEE International Conference on Cluster Computing (Cluster 2003)
-
-
Bouteiller, A.1
Lemarinier, P.2
Krawezik, G.3
Cappello, F.4
-
11
-
-
0001873476
-
LAM: An Open Cluster Environment for MPI
-
Burns, G., Daoud, R., and Vaigl, J. 1994. LAM: An Open Cluster Environment for MPI. Proceedings of Supercomputing Symposium, pp. 379-386.
-
(1994)
Proceedings of Supercomputing Symposium
, pp. 379-386
-
-
Burns, G.1
Daoud, R.2
Vaigl, J.3
-
12
-
-
0022020346
-
Distributed snapshots: Determining global states of distributed systems
-
ACM
-
Chandy, K. M. and Lamport, L. 1985. Distributed snapshots: Determining global states of distributed systems. Transactions on Computer Systems 3(1):63-75. ACM.
-
(1985)
Transactions on Computer Systems
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy, K.M.1
Lamport, L.2
-
15
-
-
0026867749
-
Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output
-
Elnozahy, E. N. and Zwaenepoel, W. 1992b. Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers 41(5).
-
(1992)
IEEE Transactions on Computers
, vol.41
, Issue.5
-
-
Elnozahy, E.N.1
Zwaenepoel, W.2
-
16
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
Elnozahy, M., Alvisi, L., Wang, Y. M., and Johnson, D. B. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR) 34(3):375-408.
-
(2002)
ACM Computing Surveys (CSUR)
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, M.1
Alvisi, L.2
Wang, Y.M.3
Johnson, D.B.4
-
17
-
-
84940567900
-
FT-MPI: Fault tolerant mpi, supporting dynamic applications in a dynamic world
-
Balatonfüred, Hungary. Heidelberg: Springer-Verlag
-
Fagg, G. and Dongarra, J. 2000. FT-MPI: Fault tolerant mpi, supporting dynamic applications in a dynamic world. 7th Euro PVM/MPI User's Group Meeting 2000, vol. 1908, Balatonfüred, Hungary. Heidelberg: Springer-Verlag.
-
(2000)
7th Euro PVM/MPI User's Group Meeting 2000
, vol.1908
-
-
Fagg, G.1
Dongarra, J.2
-
18
-
-
0035480335
-
HARNESS and fault tolerant MPI
-
Fagg, G. E., Bukovsky, A., and Dongarra, J. J. 2001. HARNESS and fault tolerant MPI. Parallel Computing 27(11): 1479-1495.
-
(2001)
Parallel Computing
, vol.27
, Issue.11
, pp. 1479-1495
-
-
Fagg, G.E.1
Bukovsky, A.2
Dongarra, J.J.3
-
20
-
-
0030243005
-
High-performance, portable implementation of the MPI message passing interface standard
-
Gropp, W., Lusk, E., Doss, N., and Skjellum, A. 1996. High-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22(6): 789-828.
-
(1996)
Parallel Computing
, vol.22
, Issue.6
, pp. 789-828
-
-
Gropp, W.1
Lusk, E.2
Doss, N.3
Skjellum, A.4
-
23
-
-
0032311702
-
An efficient algorithm for causal message logging
-
Los Alamitos, CA: IEEE CS Press
-
Lee, B., Park, T., Yeom, H. Y., and Cho, Y. 1998. An efficient algorithm for causal message logging. 17th Symposium on Reliable Distributed Systems (SRDS 1998), pp. 19-25. Los Alamitos, CA: IEEE CS Press.
-
(1998)
17th Symposium on Reliable Distributed Systems (SRDS 1998)
, pp. 19-25
-
-
Lee, B.1
Park, T.2
Yeom, H.Y.3
Cho, Y.4
-
24
-
-
20444435911
-
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
-
Los Alamitos, CA: IEEE CS Press
-
Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., and Cappello, F. 2004. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. IEEE International Conference on Cluster Computing (Cluster 2004). Los Alamitos, CA: IEEE CS Press.
-
(2004)
IEEE International Conference on Cluster Computing (Cluster 2004)
-
-
Lemarinier, P.1
Bouteiller, A.2
Herault, T.3
Krawezik, G.4
Cappello, F.5
-
25
-
-
0003912256
-
Checkpoint and migration of UNIX processes in the condor distributed processing system
-
Technical Report Technical Report 1346, University of Wisconsin-Madison
-
Litzkow, M., Tannenbaum, T., Basney, J., and Livny, M. 1997. Checkpoint and migration of UNIX processes in the condor distributed processing system. Technical Report Technical Report 1346, University of Wisconsin-Madison.
-
(1997)
-
-
Litzkow, M.1
Tannenbaum, T.2
Basney, J.3
Livny, M.4
-
26
-
-
0034439137
-
MPI-FT: Portable fault tolerance scheme for MPI
-
World Scientific Publishing Company
-
Louca, S., Neophytou, N., Lachanas, A., and Evripidou, P. 2000. MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters(PPL) 10(4). World Scientific Publishing Company.
-
(2000)
Parallel Processing Letters(PPL)
, vol.10
, Issue.4
-
-
Louca, S.1
Neophytou, N.2
Lachanas, A.3
Evripidou, P.4
-
27
-
-
0035201417
-
Processor allocation and checkpoint interval selection in cluster computing systems
-
Planck, J. S. and Thomason, M. G. 2001. Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing 61(11): 1570-1590.
-
(2001)
Journal of Parallel and Distributed Computing
, vol.61
, Issue.11
, pp. 1570-1590
-
-
Planck, J.S.1
Thomason, M.G.2
-
28
-
-
85014175705
-
Experimental assessment of workstation failures and their impact on checkpointing systems
-
Los Alamitos, CA: IEEE CS Press
-
Plank, J. S. and Elwasif, W. R. 1998. Experimental assessment of workstation failures and their impact on checkpointing systems. 28th Symposium on Fault-Tolerant Computing (FTCS'98), pp. 48-57. Los Alamitos, CA: IEEE CS Press.
-
(1998)
28th Symposium on Fault-Tolerant Computing (FTCS'98)
, pp. 48-57
-
-
Plank, J.S.1
Elwasif, W.R.2
-
30
-
-
0032317801
-
The cost of recovery in message logging protocols
-
Los Alamitos, CA: IEEE CS Press
-
Rao, S., Alvisi, L., and Vin, H. M. 1998. The cost of recovery in message logging protocols. 17th Symposium on Reliable Distributed Systems (SRDS), pp. 10-18. Los Alamitos, CA: IEEE CS Press.
-
(1998)
17th Symposium on Reliable Distributed Systems (SRDS)
, pp. 10-18
-
-
Rao, S.1
Alvisi, L.2
Vin, H.M.3
-
31
-
-
0032597696
-
Egida: An extensible toolkit for low-overhead fault-tolerance
-
In Los Alamitos, CA: IEEE CS Press
-
Rao, S., Alvisi, L., and Vin, H. M. 1999. Egida: An extensible toolkit for low-overhead fault-tolerance. In 29th Symposium on Fault-Tolerant Computing (FTCS'99), pp. 48-55. Los Alamitos, CA: IEEE CS Press.
-
(1999)
29th Symposium on Fault-Tolerant Computing (FTCS'99)
, pp. 48-55
-
-
Rao, S.1
Alvisi, L.2
Vin, H.M.3
-
32
-
-
20444444457
-
The LAM/MPI check-point/restart framework: System-initiated checkpointing
-
Sante Fe, New Mexico, USA
-
Sankaran, S., Squyres, J. M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., and Roman, E. 2003. The LAM/MPI check-point/restart framework: System-initiated checkpointing. Proceedings, LACSI Symposium, Sante Fe, New Mexico, USA.
-
(2003)
Proceedings, LACSI Symposium
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Lumsdaine, A.4
Duell, J.5
Hargrove, P.6
Roman, E.7
-
34
-
-
0003710740
-
-
Cambridge, MA: MIT Press
-
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., and Dongarra, J. 1996. MPI: The Complete Reference. Cambridge, MA: MIT Press.
-
(1996)
MPI: The Complete Reference
-
-
Snir, M.1
Otto, S.2
Huss-Lederman, S.3
Walker, D.4
Dongarra, J.5
-
36
-
-
0022112420
-
Optimistic recovery in distributed systems
-
ACM
-
Strom, R. and Yemini, S. 1985. Optimistic recovery in distributed systems. Transactions on Computer Systems 3(3):204-226. ACM.
-
(1985)
Transactions on Computer Systems
, vol.3
, Issue.3
, pp. 204-226
-
-
Strom, R.1
Yemini, S.2
-
37
-
-
0024128166
-
Volatile logging in n-fault-tolerant distributed systems
-
Los Alamitos, CA: IEEE CS Press
-
Strom, R. E., Bacon, D. F., and Yemini, S. A. 1988. Volatile logging in n-fault-tolerant distributed systems. 18th Annual International Symposium on Fault-Tolerant Computing (FTCS-18), pp. 44-49. Los Alamitos, CA: IEEE CS Press.
-
(1988)
18th Annual International Symposium on Fault-Tolerant Computing (FTCS-18)
, pp. 44-49
-
-
Strom, R.E.1
Bacon, D.F.2
Yemini, S.A.3
|