-
1
-
-
0024142081
-
A linear algebraic model of algorithm-based fault tolerance
-
December
-
C. J. Anfinson and F. T. Luk. A linear algebraic model of algorithm-based fault tolerance. IEEE Transactions on Computers, 37(12), December 1988.
-
(1988)
IEEE Transactions on Computers
, vol.37
, Issue.12
-
-
Anfinson, C.J.1
Luk, F.T.2
-
2
-
-
79959599577
-
Bounds on algorithm-based fault tolerance in multiple processor systems
-
P. Banerjee and J. Abraham. Bounds on algorithm-based fault tolerance in multiple processor systems. IEEE Transactions on Computers, 2006.
-
(2006)
IEEE Transactions on Computers
-
-
Banerjee, P.1
Abraham, J.2
-
3
-
-
0025489006
-
Algorithm-based fault tolerance on a hypercube multiprocessor
-
P. Banerjee, J. T. Rahmeh, C. B. Stunkel, V. S. S. Nair, K. Roy, V. Balasubramanian, and J. A. Abraham. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Transactions on Computers, C-39:1132-1145, 1990.
-
(1990)
IEEE Transactions on Computers
, vol.C-39
, pp. 1132-1145
-
-
Banerjee, P.1
Rahmeh, J.T.2
Stunkel, C.B.3
Nair, V.S.S.4
Roy, K.5
Balasubramanian, V.6
Abraham, J.A.7
-
4
-
-
68249127079
-
Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities
-
August
-
F. Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3), August 2009.
-
(2009)
International Journal of High Performance Computing Applications
, vol.23
, Issue.3
-
-
Cappello, F.1
-
6
-
-
74049164805
-
Optimal real number codes for fault tolerant matrix operations
-
Z. Chen. Optimal real number codes for fault tolerant matrix operations. In Proceedings of the ACM/IEEE SC2009 Conference on High Performance Networking, Computing, Storage, and Analysis, Portland, OR, USA, November 2009.
-
Proceedings of the ACM/IEEE SC2009 Conference on High Performance Networking, Computing, Storage, and Analysis, Portland, OR, USA, November 2009
-
-
Chen, Z.1
-
10
-
-
31844451082
-
Fault tolerant high performance computing by a coding approach
-
Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming, Chicago, IL, USA, June 2005.
-
Proceedings of the ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming, Chicago, IL, USA, June 2005
-
-
Chen, Z.1
Fagg, G.E.2
Gabriel, E.3
Langou, J.4
Angskun, T.5
Bosilca, G.6
Dongarra, J.7
-
12
-
-
36949009638
-
-
PhD thesis, Univ. of Illinois Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign
-
C. da Lu. Scalable diskless checkpointing for large parallel systems. PhD thesis, Univ. of Illinois Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2005.
-
(2005)
Scalable Diskless Checkpointing for Large Parallel Systems
-
-
Da Lu, C.1
-
13
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
DOI 10.1016/j.future.2004.11.016, PII S0167739X04002213
-
J. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 22(3):303-312, 2006. (Pubitemid 41689812)
-
(2006)
Future Generation Computer Systems
, vol.22
, Issue.3
, pp. 303-312
-
-
Daly, J.T.1
-
15
-
-
77954904463
-
Distributed diskless checkpoint for large scale systems
-
L. A. B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka. Distributed diskless checkpoint for large scale systems. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010.
-
2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010
-
-
Gomez, L.A.B.1
Maruyama, N.2
Cappello, F.3
Matsuoka, S.4
-
17
-
-
77953995050
-
Algorithmic cholesky factorization fault recovery
-
D. Hakkarinen and Z. Chen. Algorithmic cholesky factorization fault recovery. In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta, GA, USA, April 2010.
-
Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta, GA, USA, April 2010
-
-
Hakkarinen, D.1
Chen, Z.2
-
18
-
-
0021439162
-
Algorithm-based fault tolerance for matrix operations
-
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33:518-528, 1984.
-
(1984)
IEEE Transactions on Computers
, vol.C-33
, pp. 518-528
-
-
Huang, K.-H.1
Abraham, J.A.2
-
19
-
-
0022721936
-
Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures
-
May
-
J. Jou and J. Abraham. Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures. In Proceedings of the IEEE, volume 74, May 1986.
-
(1986)
Proceedings of the IEEE
, vol.74
-
-
Jou, J.1
Abraham, J.2
-
21
-
-
0023995880
-
ANALYSIS OF ALGORITHM-BASED FAULT TOLERANCE TECHNIQUES.
-
DOI 10.1016/0743-7315(88)90027-5
-
F. T. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 5(2):172-184, 1988. (Pubitemid 18589858)
-
(1988)
Journal of Parallel and Distributed Computing
, vol.5
, Issue.2
, pp. 172-184
-
-
Luk, F.T.1
Park, H.2
-
22
-
-
78650831692
-
Design, modeling, and evaluation of a scalable multi-level checkpointing system
-
A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In IEEE/ACM Supercomputing Conference, November 2010.
-
IEEE/ACM Supercomputing Conference, November 2010
-
-
Moody, A.1
Bronevetsky, G.2
Mohror, K.3
De Supinski, B.R.4
-
24
-
-
0031570636
-
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing
-
DOI 10.1006/jpdc.1997.1336, PII S0743731597913368
-
J. S. Plank, Y. Kim, and J. Dongarra. Fault tolerant matrix operations for networks of workstations using diskless checkpointing. IEEE Journal of Parallel and Distributed Computing, 43:125-138, 1997. (Pubitemid 127171409)
-
(1997)
Journal of Parallel and Distributed Computing
, vol.43
, Issue.2
, pp. 125-138
-
-
Plank, J.S.1
Kim, Y.2
Dongarra, J.J.3
-
25
-
-
0032179680
-
Diskless checkpointing
-
J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10):972-986, 1998.
-
(1998)
IEEE Transactions on Parallel and Distributed Systems
, vol.9
, Issue.10
, pp. 972-986
-
-
Plank, J.S.1
Li, K.2
Puening, M.A.3
-
26
-
-
33845593340
-
A large-scale study of failures in high-performance computing systems
-
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, Philadelphia, PA, USA, June 2006.
-
Proceedings of the International Conference on Dependable Systems and Networks, Philadelphia, PA, USA, June 2006
-
-
Schroeder, B.1
Gibson, G.A.2
-
28
-
-
34548768671
-
Job pause service under lam/mpi+blcr for transparent fault tolerance
-
C. Wang, F. Mueller, C. Engelmann, and S. Scot. Job pause service under lam/mpi+blcr for transparent fault tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA, USA, March 2007.
-
Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA, USA, March 2007
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scot, S.4
-
29
-
-
84976846528
-
A first order approximation to the optimum checkpoint interval
-
September
-
J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17:530-531, September 1974.
-
(1974)
Commun. ACM
, vol.17
, pp. 530-531
-
-
Young, J.W.1
|