-
1
-
-
0028994247
-
Software rejuvenation: Analysis, module and applications
-
Washington, DC, USA: IEEE CS
-
N. Kolettis and N. D. Fulton, "Software rejuvenation: Analysis, module and applications," in FTCS '95. Washington, DC, USA: IEEE CS, 1995, p. 381.
-
(1995)
FTCS '95
, pp. 381
-
-
Kolettis, N.1
Fulton, N.D.2
-
4
-
-
60649109658
-
Supporting distributed application workflows in heterogeneous computing environments
-
IEEE Computer Society Press
-
Q. Wu and Y. Gu, "Supporting distributed application workflows in heterogeneous computing environments," in 14th Int. Conf. on Parallel and Distributed Systems (ICPADS). IEEE Computer Society Press, 2008.
-
(2008)
14th Int. Conf. on Parallel and Distributed Systems (ICPADS)
-
-
Wu, Q.1
Gu, Y.2
-
5
-
-
0020765766
-
Effects of checkpointing on program execution time
-
DOI 10.1016/0020-0190(83)90093-5
-
A. Duda, "The effects of checkpointing on program execution time," Inf. Processing Letters, Vol. 16, no. 5, pp. 221-229, 1983. (Pubitemid 13590444)
-
(1983)
Information Processing Letters
, vol.16
, Issue.5
, pp. 221-229
-
-
Duda, A.1
-
6
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
DOI 10.1016/j.future.2004.11.016, PII S0167739X04002213
-
J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future Generation Computer Systems, Vol. 22, no. 3, pp. 303-312, 2004. (Pubitemid 41689812)
-
(2006)
Future Generation Computer Systems
, vol.22
, Issue.3
, pp. 303-312
-
-
Daly, J.T.1
-
7
-
-
0036041277
-
Improving cluster availability using workstation validation
-
T. Heath, R. P. Martin, and T. D. Nguyen, "Improving cluster availability using workstation validation," SIGMETRICS Perf. Eval. Rev., Vol. 30, no. 1, pp. 217-227, 2002. (Pubitemid 35009524)
-
(2002)
Performance Evaluation Review
, vol.30
, Issue.1
, pp. 217-227
-
-
Heath, T.1
Martin, R.P.2
Nguyen, T.D.3
-
8
-
-
33845593340
-
A large-scale study of failures in high-performance computing systems
-
DOI 10.1109/DSN.2006.5, 1633514, Proceedings - DSN 2006: 2006 International Conference on Dependable Systems and Networks
-
B. Schroeder and G. A. Gibson, "A large-scale study of failures in high-performance computing systems," in Proc. of DSN, 2006, pp. 249-258. (Pubitemid 44930426)
-
(2006)
Proceedings of the International Conference on Dependable Systems and Networks
, vol.2006
, pp. 249-258
-
-
Schroeder, B.1
Gibson, G.A.2
-
9
-
-
51049108820
-
An optimal checkpoint/restart model for a large scale high performance computing system
-
IEEE
-
Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. Scott, "An optimal checkpoint/restart model for a large scale high performance computing system," in IPDPS 2008. IEEE, 2008, pp. 1-9.
-
(2008)
IPDPS 2008
, pp. 1-9
-
-
Liu, Y.1
Nassar, R.2
Leangsuksun, C.3
Naksinehaboon, N.4
Paun, M.5
Scott, S.6
-
10
-
-
83155160934
-
Modeling and tolerating heterogeneous failures in large parallel systems
-
ACM Press
-
E. Heien, D. Kondo, A. Gainaru, D. LaPine, B. Kramer, and F. Cappello, "Modeling and tolerating heterogeneous failures in large parallel systems," in Proc. SC'2011 Int. Conf. for High Performance Computing, Networking, Storage and Analysis. ACM Press, 2011.
-
(2011)
Proc. SC'2011 Int. Conf. for High Performance Computing, Networking, Storage and Analysis
-
-
Heien, E.1
Kondo, D.2
Gainaru, A.3
LaPine, D.4
Kramer, B.5
Cappello, F.6
-
11
-
-
77955097389
-
A flexible checkpoint/restart model in distributed systems
-
http://dx.doi.org/10.1007/978-3-642-14390-822
-
M.-S. Bouguerra, T. Gautier, D. Trystram, and J.-M. Vincent, "A flexible checkpoint/restart model in distributed systems," in PPAM, ser. LNCS, Vol. 6067, 2010, pp. 206-215. [Online]. Available: http://dx.doi.org/10. 1007/978-3-642-14390-822
-
(2010)
PPAM, ser. LNCS
, vol.6067
, pp. 206-215
-
-
Bouguerra, M.-S.1
Gautier, T.2
Trystram, D.3
Vincent, J.-M.4
-
12
-
-
83155184556
-
Checkpointing strategies for parallel jobs
-
ACM Press
-
M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, "Checkpointing strategies for parallel jobs," in Proc. SC'2011 Int. Conf. for High Performance Computing, Networking, Storage and Analysis. ACM Press, 2011.
-
(2011)
Proc. SC'2011 Int. Conf. for High Performance Computing, Networking, Storage and Analysis
-
-
Bougeret, M.1
Casanova, H.2
Rabie, M.3
Robert, Y.4
Vivien, F.5
-
13
-
-
85060036181
-
The validity of the single processor approach to achieving large scale computing capabilities
-
AFIPS Press
-
G. Amdahl, "The validity of the single processor approach to achieving large scale computing capabilities," in AFIPS Conference Proceedings, Vol. 30. AFIPS Press, 1967, pp. 483-485.
-
(1967)
AFIPS Conference Proceedings
, vol.30
, pp. 483-485
-
-
Amdahl, G.1
-
14
-
-
0003615167
-
-
SIAM
-
L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users' Guide. SIAM, 1997.
-
(1997)
ScaLAPACK Users' Guide
-
-
Blackford, L.S.1
Choi, J.2
Cleary, A.3
D'Azevedo, E.4
Demmel, J.5
Dhillon, I.6
Dongarra, J.7
Hammarling, S.8
Henry, G.9
Petitet, A.10
Stanley, K.11
Walker, D.12
Whaley, R.C.13
-
15
-
-
84867631517
-
Using group replication for resilience on exascale systems
-
Research report RR-7876, February
-
M. Bougeret, H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, "Using group replication for resilience on exascale systems," INRIA, Research report RR-7876, February 2012. [Online]. Available: http://hal.inria.fr/hal-00668016
-
(2012)
INRIA
-
-
Bougeret, M.1
Casanova, H.2
Robert, Y.3
Vivien, F.4
Zaidouni, D.5
-
19
-
-
84880864185
-
Complexity analysis of checkpoint scheduling with variable costs
-
IEEE Transactions On
-
M.-S. Bouguerra, D. Trystram, and F. Wagner, "Complexity analysis of checkpoint scheduling with variable costs," Computers, IEEE Transactions on, 2012.
-
(2012)
Computers
-
-
Bouguerra, M.-S.1
Trystram, D.2
Wagner, F.3
-
20
-
-
77954903245
-
The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems
-
IEEE International Symposium On
-
D. Kondo, B. Javadi, A. Iosup, and D. Epema, "The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems," Cluster Computing and the Grid, IEEE International Symposium on, Vol. 0, pp. 398-407, 2010.
-
(2010)
Cluster Computing and the Grid
, pp. 398-407
-
-
Kondo, D.1
Javadi, B.2
Iosup, A.3
Epema, D.4
-
21
-
-
84976846528
-
A first order approximation to the optimum checkpoint interval
-
J. W. Young, "A first order approximation to the optimum checkpoint interval," Communications of the ACM, Vol. 17, no. 9, pp. 530-531, 1974.
-
(1974)
Communications of the ACM
, vol.17
, Issue.9
, pp. 530-531
-
-
Young, J.W.1
-
22
-
-
78650009816
-
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
-
ACM
-
W. Jones, J. Daly, and N. DeBardeleben, "Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters," in HPDC'10. ACM, 2010, pp. 276-279.
-
(2010)
HPDC'10
, pp. 276-279
-
-
Jones, W.1
Daly, J.2
DeBardeleben, N.3
-
23
-
-
83155195315
-
Analysis of dependencies of checkpoint cost and checkpoint interval of fault tolerant MPI applications
-
K. Venkatesh, "Analysis of Dependencies of Checkpoint Cost and Checkpoint Interval of Fault Tolerant MPI Applications," Analysis, Vol. 2, no. 08, pp. 2690-2697, 2010.
-
(2010)
Analysis
, vol.2
, Issue.8
, pp. 2690-2697
-
-
Venkatesh, K.1
-
24
-
-
84976696875
-
Performance analysis of checkpointing strategies
-
A. Tantawi and M. Ruschitzka, "Performance analysis of checkpointing strategies," ACM TOCS, Vol. 2, no. 2, pp. 123-144, 1984.
-
(1984)
ACM TOCS
, vol.2
, Issue.2
, pp. 123-144
-
-
Tantawi, A.1
Ruschitzka, M.2
-
25
-
-
35248884762
-
Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems
-
DOI 10.1145/1248377.1248423, SPAA'07: Proceedings of the Nineteenth Annual Symposium on Parallelism in Algorithms and Architectures
-
J. Dongarra, E. Jeannot, E. Saule, and Z. Shi, "Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems," in ACM Symposium on Parallel Algorithms and Architectures (SPAA). ACM Press, 2007, pp. 280-288. (Pubitemid 47568577)
-
(2007)
Annual ACM Symposium on Parallelism in Algorithms and Architectures
, pp. 280-288
-
-
Dongarra, J.J.1
Jeannot, E.2
Saule, E.3
Shi, Z.4
-
26
-
-
0036504529
-
Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing
-
DOI 10.1109/71.993209
-
A. Dogan and F. Özgüner, "Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing," IEEE Trans. Parallel Distributed Systems, Vol. 13, no. 3, pp. 308-323, 2002. (Pubitemid 34448783)
-
(2002)
IEEE Transactions on Parallel and Distributed Systems
, vol.13
, Issue.3
, pp. 308-323
-
-
Dogan, A.1
Ozguner, F.2
-
27
-
-
59149105005
-
Reliability versus performance for critical applications
-
A. Girault, E. Saule, and D. Trystram, "Reliability versus performance for critical applications," J. Parallel Distributed Computing, Vol. 69, no. 3, pp. 326-336, 2009.
-
(2009)
J. Parallel Distributed Computing
, vol.69
, Issue.3
, pp. 326-336
-
-
Girault, A.1
Saule, E.2
Trystram, D.3
-
28
-
-
77956978166
-
Using replication and checkpointing for reliable task management in computational grids
-
S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, "Using Replication and Checkpointing for Reliable Task Management in Computational Grids," in Proc. of the International Conference on High Performance Computing & Simulation, 2010.
-
(2010)
Proc. of the International Conference on High Performance Computing & Simulation
-
-
Yi, S.1
Kondo, D.2
Kim, B.3
Park, G.4
Cho, Y.5
-
29
-
-
83155188951
-
Evaluating the viability of process replication reliability for exascale systems
-
K. Ferreira, J. Stearley, J. H. I. Laros, R. Oldfield, K. Pe-dretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold, "Evaluating the Viability of Process Replication Reliability for Exascale Systems," in Proceedings of the 2011 ACM/IEEE Conference on Supercomputing, 2011.
-
(2011)
Proceedings of the 2011 ACM/IEEE Conference on Supercomputing
-
-
Ferreira, K.1
Stearley, J.2
Laros, J.H.I.3
Oldfield, R.4
Pedretti, K.5
Brightwell, R.6
Riesen, R.7
Bridges, P.G.8
Arnold, D.9
|