-
2
-
-
33845593340
-
A large-scale study of failures in high-performance computing systems
-
Washington, DC, USA
-
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN 2006: Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, USA, pp. 249-258 (2006)
-
(2006)
DSN 2006: Proceedings of the International Conference on Dependable Systems and Networks
, pp. 249-258
-
-
Schroeder, B.1
Gibson, G.A.2
-
3
-
-
67349271621
-
An analysis of clustered failures on large supercomputing systems
-
Hacker, T.J., Romero, F., Carothers, C.D.: An analysis of clustered failures on large supercomputing systems. J. Parallel Distrib. Comput. 69(7), 652-665 (2009)
-
(2009)
J. Parallel Distrib. Comput.
, vol.69
, Issue.7
, pp. 652-665
-
-
Hacker, T.J.1
Romero, F.2
Carothers, C.D.3
-
4
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22(3), 303-312 (2006)
-
(2006)
Future Generation Computer Systems
, vol.22
, Issue.3
, pp. 303-312
-
-
Daly, J.T.1
-
5
-
-
9144223280
-
Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
-
Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable Secur. Comput. 1(2), 97-108 (2004)
-
(2004)
IEEE Trans. Dependable Secur. Comput.
, vol.1
, Issue.2
, pp. 97-108
-
-
Elnozahy, E.N.1
Plank, J.S.2
-
6
-
-
51049108820
-
An optimal checkpoint/restart model for a large scale high performance computing system
-
Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1-9 (2008)
-
(2008)
IEEE International Symposium on Parallel and Distributed Processing
, pp. 1-9
-
-
Liu, Y.1
Nassar, R.2
Leangsuksun, C.3
Naksinehaboon, N.4
Paun, M.5
Scott, S.6
-
7
-
-
34547424386
-
Cooperative checkpointing: A robust approach to large-scale systems reliability
-
ACM, New York
-
Oliner, A.J., Rudolph, L., Sahoo, R.K.: Cooperative checkpointing: a robust approach to large-scale systems reliability. In: Proceedings of The 20th Annual International Conference on Supercomputing, pp. 14-23. ACM, New York (2006)
-
(2006)
Proceedings of the 20th Annual International Conference on Supercomputing
, pp. 14-23
-
-
Oliner, A.J.1
Rudolph, L.2
Sahoo, R.K.3
-
8
-
-
84976846528
-
A first order approximation to the optimum checkpoint interval
-
Young, J.W.: A first order approximation to the optimum checkpoint interval. ACM Commun. 17(9), 530-531 (1974)
-
(1974)
ACM Commun.
, vol.17
, Issue.9
, pp. 530-531
-
-
Young, J.W.1
-
9
-
-
0022020346
-
Distributed snapshots: Determining global states of distributed systems
-
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63-75 (1985)
-
(1985)
ACM Trans. Comput. Syst.
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy, K.M.1
Lamport, L.2
-
10
-
-
77955115718
-
A new flexible checkpoint/ restart model
-
INRIA
-
Bouguerra, M.S., Gautier, T., Trystram, D., Vincent, J.M.: A new flexible checkpoint/ restart model. Technical report, RR-6751, INRIA (2008)
-
(2008)
Technical Report RR-6751
-
-
Bouguerra, M.S.1
Gautier, T.2
Trystram, D.3
Vincent, J.M.4
-
11
-
-
0000652719
-
Selection of a checkpoint interval in a criticaltask environment
-
Geist, R., Reynolds, R., Westall, J.: Selection of a checkpoint interval in a criticaltask environment. IEEE Transactions on Reliability 37, 395-400 (1988)
-
(1988)
IEEE Transactions on Reliability
, vol.37
, pp. 395-400
-
-
Geist, R.1
Reynolds, R.2
Westall, J.3
-
12
-
-
0032597646
-
The average availability of parallel checkpointing systems and its importance in selecting runtime parameters
-
Plank, J.S., Thomason, M.G.: The average availability of parallel checkpointing systems and its importance in selecting runtime parameters. In: 29th International Symposium on Fault-Tolerant Computing, pp. 250-259 (1999)
-
(1999)
29th International Symposium on Fault-Tolerant Computing
, pp. 250-259
-
-
Plank, J.S.1
Thomason, M.G.2
-
13
-
-
50649087527
-
Reliability-aware approach: An incremental checkpoint/restart model in HPC environments
-
Naksinehaboon, N., Liu, Y., Leangsuksun, C., Nassar, R., Paun, M., Scott, S.: Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments. In: IEEE International Symposium on Cluster Computing and the Grid, pp. 783-788 (2008)
-
(2008)
IEEE International Symposium on Cluster Computing and the Grid
, pp. 783-788
-
-
Naksinehaboon, N.1
Liu, Y.2
Leangsuksun, C.3
Nassar, R.4
Paun, M.5
Scott, S.6
|