-
2
-
-
0038335808
-
Compiler-assisted checkpointing
-
Dept. of Computer Science, University of Tennessee
-
M. Beck, J. S. Plank, and G. Kingsley. Compiler-assisted checkpointing. Technical Report UT-CS-94-269, Dept. of Computer Science, University of Tennessee, 1994.
-
(1994)
Technical Report
, vol.UT-CS-94-269
-
-
Beck, M.1
Plank, J.S.2
Kingsley, G.3
-
3
-
-
0031570635
-
Application level fault tolerance in heterogeneous, networks of workstations
-
A. Beguelin, E. Seligman, and P. Stephan. Application level fault tolerance in heterogeneous, networks of workstations. Journal of Parallel and Distributed Computing, 43(2): 147-155, 1997.
-
(1997)
Journal of Parallel and Distributed Computing
, vol.43
, Issue.2
, pp. 147-155
-
-
Beguelin, A.1
Seligman, E.2
Stephan, P.3
-
4
-
-
60449096682
-
MPICH-V2:a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging
-
Phoenix, AZ, Nov.
-
A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarnier, and F. Magniette. MPICH-V2:a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In Supercomputing Conference SC'03, Phoenix, AZ, Nov. 2003.
-
(2003)
Supercomputing Conference SC'03
-
-
Bouteiller, A.1
Cappello, F.2
Herault, T.3
Krawezik, G.4
Lemarnier, P.5
Magniette, F.6
-
5
-
-
0038040085
-
Automated application-level checkpointing of mpi programs
-
San Diego, CA, June
-
G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated application-level checkpointing of mpi programs. In Principles and Practices of Parallel Programming, San Diego, CA, June 2003.
-
(2003)
Principles and Practices of Parallel Programming
-
-
Bronevetsky, G.1
Marques, D.2
Pingali, K.3
Stodghill, P.4
-
6
-
-
1142268808
-
Collective operations in an application-level fault tolerant MPI system
-
San Francisco, CA, June 23-26
-
G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective operations in an application-level fault tolerant MPI system. In International Conference on Supercomputing (ICS) 2003, San Francisco, CA, June 23-26 2003.
-
(2003)
International Conference on Supercomputing (ICS) 2003
-
-
Bronevetsky, G.1
Marques, D.2
Pingali, K.3
Stodghill, P.4
-
7
-
-
84934278304
-
-
September 192001
-
B. Carnes. The smg2000 benchmark code. Available at http://www.llnl.gov/asci/purple/benchmarks/limited/smg/, September 192001.
-
The Smg2000 Benchmark Code
-
-
Carnes, B.1
-
8
-
-
0022020346
-
Distributed snapshots: Determining global states of distributed systems
-
M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems, 3(1):63-75, 1985.
-
(1985)
ACM Transactions on Computing Systems
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy, M.1
Lamport, L.2
-
9
-
-
84860989858
-
-
Condor, http://www.cs.wisc.edu/condor/manual.
-
-
-
-
10
-
-
0026867749
-
Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output
-
May
-
E. N. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, 41(5), May 1992.
-
(1992)
IEEE Transactions on Computers
, vol.41
, Issue.5
-
-
Elnozahy, E.N.1
Zwaenepoel, W.2
-
11
-
-
0004096191
-
A survey of rollback-recovery protocols in message passing systems
-
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct.
-
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.
-
(1996)
Technical Report
, vol.CMU-CS-96-181
-
-
Elnozahy, M.1
Alvisi, L.2
Wang, Y.M.3
Johnson, D.B.4
-
12
-
-
84940567900
-
FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world
-
Springer-Verilag
-
G. Fagg and J.J.Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In EuroPVM/MPI User's Group Meeting, pages 346-353. Springer-Verilag, 2000.
-
(2000)
EuroPVM/MPI User's Group Meeting
, pp. 346-353
-
-
Fagg, G.1
Dongarra, J.J.2
-
13
-
-
0010976041
-
Process introspection: A heterogeneous checkpoint/restart mechanism based on automatic code modification
-
Department of Computer Science, University of Virginia, 25
-
A. J. Ferrari, S. J. Chapin, and A. S. Grimshaw. Process introspection: A heterogeneous checkpoint/restart mechanism based on automatic code modification. Technical Report CS-97-05, Department of Computer Science, University of Virginia, 25, 1997.
-
(1997)
Technical Report
, vol.CS-97-05
-
-
Ferrari, A.J.1
Chapin, S.J.2
Grimshaw, A.S.3
-
16
-
-
0004215089
-
-
Morgan Kaufmann, San Francisco, California, first edition
-
N. Lynch. Distributed Algorithms. Morgan Kaufmann, San Francisco, California, first edition, 1996.
-
(1996)
Distributed Algorithms
-
-
Lynch, N.1
-
17
-
-
0038335808
-
Compiler-assisted checkpointing
-
Technical Report, University of Tennessee, Dec.
-
J. P. M. Beck and G. Kingsley. Compiler-Assisted Checkpointing. Technical Report Technical Report CS-94-269, University of Tennessee, Dec. 1994.
-
(1994)
Technical Report
, vol.CS-94-269
-
-
Beck, J.P.M.1
Kingsley, G.2
-
18
-
-
0003912256
-
Checkpoint and migration of UNIX processes in the condor distributed processing system
-
University of Wisconsin-Madison
-
J. B. M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, 1997.
-
(1997)
Technical Report
, vol.1346
-
-
Litzkow, J.B.M.1
Tannenbaum, T.2
Livny, M.3
-
19
-
-
0347102865
-
Source-code transformations for efficient reversibility
-
College of Computing, Georgia Tech, September
-
K. Perumalla and R. Fujimoto. Source-code transformations for efficient reversibility. Technical Report GIT-CC-99-21, College of Computing, Georgia Tech, September 1999.
-
(1999)
Technical Report
, vol.GIT-CC-99-21
-
-
Perumalla, K.1
Fujimoto, R.2
-
24
-
-
33645423303
-
A checkpoint and recovery system for the Pittsburgh Supercomputing Center Terascale Computing System
-
N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfield, and C. Vizino. A checkpoint and recovery system for the Pittsburgh Supercomputing Center Terascale Computing System, In Supercomputing, 2001. Available at http://www.psc.edu/publications/tech\_reports/chkpt\_rcvry/ checkpoint-recovery-1.0.html.
-
(2001)
Supercomputing
-
-
Stone, N.1
Kochmar, J.2
Reddy, R.3
Scott, J.R.4
Sommerfield, J.5
Vizino, C.6
-
25
-
-
0141682129
-
Srs - A framework for developing malleable and migratable parallel software
-
June
-
S. Vadhiyar and J. Dongarra. Srs - a framework for developing malleable and migratable parallel software. Parallel Processing Letters, 13(2):291-312, June 2003.
-
(2003)
Parallel Processing Letters
, vol.13
, Issue.2
, pp. 291-312
-
-
Vadhiyar, S.1
Dongarra, J.2
|