-
3
-
-
84875193661
-
-
MPICH-V. http://mpich-v.lri.fr.
-
-
-
-
4
-
-
84875166533
-
Cooperative Application/OS DRAM Fault Recovery
-
P. Bridges, M. Hoemmen, K. Ferreira, M. Heroux, P. Soltero and R. Brightwell. Cooperative Application/OS DRAM Fault Recovery. In Proceedings of the 4th International Workshop on Resiliency in High Performance Computing (Resilience 2011), Bordeaux, France, August 29 - September 2, 2011.
-
Proceedings of the 4th International Workshop on Resiliency in High Performance Computing (Resilience 2011), Bordeaux, France, August 29 - September 2, 2011
-
-
Bridges, P.1
Hoemmen, M.2
Ferreira, K.3
Heroux, M.4
Soltero, P.5
Brightwell, R.6
-
5
-
-
0038040085
-
Automated Application-level Checkpointing of MPI Programs
-
G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated Application-level Checkpointing of MPI Programs. In Proceedings of the 2003 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP03), San Diego, California, June 11-13, 2003.
-
Proceedings of the 2003 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP03), San Diego, California, June 11-13, 2003
-
-
Bronevetsky, G.1
Marques, D.2
Pingali, K.3
Stodghill, P.4
-
6
-
-
57349156147
-
Soft Error Vulnerability of Iterative Linear Algebra Methods
-
G. Bronevetsky and B. R. Supinski Soft Error Vulnerability of Iterative Linear Algebra Methods. In Proceedings of the 22nd annual international conference on Supercomputing (ICS2008), Island of Kos, Aegean Sea, Greece, June 7-12, 2008.
-
Proceedings of the 22nd Annual International Conference on Supercomputing (ICS2008), Island of Kos, Aegean Sea, Greece, June 7-12, 2008
-
-
Bronevetsky, G.1
Supinski, B.R.2
-
8
-
-
70450206305
-
Toward Exascale Resilience
-
F. Cappello, A. Geist, B. Gropp, L. V. Kal, B. Kramer, and M. Snir. Toward Exascale Resilience. International Journal of High Performance Computing Applications, Vol. 23, No. 4, Page 374-388, 2009.
-
(2009)
International Journal of High Performance Computing Applications
, vol.23
, Issue.4
, pp. 374-388
-
-
Cappello, F.1
Geist, A.2
Gropp, B.3
Kal, L.V.4
Kramer, B.5
Snir, M.6
-
9
-
-
25144456004
-
Numerically stable real number codes based on random matrices
-
Proceeding of the 5th International Conference on Computational Science (ICCS2005), Atlanta, Georgia, USA, May 22-25, 2005
-
Z. Chen and J. Dongarra. Numerically stable real number codes based on random matrices. In Proceeding of the 5th International Conference on Computational Science (ICCS2005), Atlanta, Georgia, USA, May 22-25, 2005. LNCS 3514,
-
LNCS
, vol.3514
-
-
Chen, Z.1
Dongarra, J.2
-
10
-
-
33746136466
-
Condition numbers of Gaussian random matrices
-
DOI 10.1137/040616413
-
Z. Chen and J. Dongarra. Condition Numbers of Gaussian Random Matrices. SIAM Journal on Matrix Analysis and Applications, Volume 27, Number 3, Page 603-620, 2005. (Pubitemid 44085054)
-
(2005)
SIAM Journal on Matrix Analysis and Applications
, vol.27
, Issue.3
, pp. 603-620
-
-
Chen, Z.1
Dongarra, J.J.2
-
11
-
-
33847240498
-
Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources
-
Z. Chen, and J. Dongarra. Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources. Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2006), Rhodes Island, Greece, April 25-29, 2006.
-
Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2006), Rhodes Island, Greece, April 25-29, 2006
-
-
Chen, Z.1
Dongarra, J.2
-
13
-
-
75449102762
-
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing
-
July
-
Z. Chen, and J. Dongarra. Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing. IEEE Transactions on Computers, July, 2009.
-
(2009)
IEEE Transactions on Computers
-
-
Chen, Z.1
Dongarra, J.2
-
14
-
-
31844451082
-
Fault tolerant high performance computing by a coding approach
-
Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 2005), June 14-17, 2005, Chicago, IL, USA.
-
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 2005), June 14-17, 2005, Chicago, IL, USA
-
-
Chen, Z.1
Fagg, G.E.2
Gabriel, E.3
Langou, J.4
Angskun, T.5
Bosilca, G.6
Dongarra, J.7
-
16
-
-
79959586938
-
High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing
-
T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011), Tucson, Arizona, May 31 - June 4, 2011.
-
Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011), Tucson, Arizona, May 31 - June 4, 2011
-
-
Davies, T.1
Karlsson, C.2
Liu, H.3
Ding, C.4
Chen, Z.5
-
17
-
-
77953995050
-
Algorithmic Cholesky Factorization Fault Recovery
-
D. Hakkarinen and Z. Chen. Algorithmic Cholesky Factorization Fault Recovery. Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium, Atlanta, GA, USA, April 19-23, 2010.
-
Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium, Atlanta, GA, USA, April 19-23, 2010
-
-
Hakkarinen, D.1
Chen, Z.2
-
19
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
J. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Comp. Syst., 22(3): 303-312 (2006).
-
(2006)
Future Generation Comp. Syst.
, vol.22
, Issue.3
, pp. 303-312
-
-
Daly, J.1
-
20
-
-
84877716050
-
A Tunable, Software-based DRAM Error Detection and Correction Library for HPC
-
D. Fiala, K. Ferreira, F. Mueller, and C. Engelmann A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. In Proceedings of the 4th International Workshop on Resiliency in High Performance Computing (Resilience 2011), Bordeaux, France, August 29 - September 2, 2011.
-
Proceedings of the 4th International Workshop on Resiliency in High Performance Computing (Resilience 2011), Bordeaux, France, August 29 - September 2, 2011
-
-
Fiala, D.1
Ferreira, K.2
Mueller, F.3
Engelmann, C.4
-
21
-
-
84875140305
-
-
Open MPI: www.open-mpi.org/.
-
-
-
-
22
-
-
58349092078
-
Failure Tolerance in Petascale Computers
-
November
-
G. A. Gibson, B. Schroeder, and J. Digney. Failure Tolerance in Petascale Computers. CTWatch Quarterly, Volume 3, Number 4, November 2007.
-
(2007)
CTWatch Quarterly
, vol.3
, Issue.4
-
-
Gibson, G.A.1
Schroeder, B.2
Digney, J.3
-
24
-
-
0021439162
-
Algorithm-based fault tolerance for matrix operations
-
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, vol. C-33:518- 528, 1984.
-
(1984)
IEEE Transactions on Computers
, vol.C-33
, pp. 518-528
-
-
Huang, K.-H.1
Abraham, J.A.2
-
25
-
-
0032179680
-
Diskless checkpointing
-
J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972-986, 1998. (Pubitemid 128747893)
-
(1998)
IEEE Transactions on Parallel and Distributed Systems
, vol.9
, Issue.10
, pp. 972-986
-
-
Plank, J.S.1
Li, K.2
Puening, M.A.3
-
26
-
-
77953987800
-
Analyzing the soft error resilience of linear solvers on multicore multiprocessors
-
K. Malkowski, P. Raghavan, and M. Kandemir. Analyzing the soft error resilience of linear solvers on multicore multiprocessors. Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium, Atlanta, GA, USA, April 19-23, 2010.
-
Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium, Atlanta, GA, USA, April 19-23, 2010
-
-
Malkowski, K.1
Raghavan, P.2
Kandemir, M.3
-
27
-
-
79959593688
-
Characterizing the impact of soft errors on iterative methods in scientific computing
-
M. Shantharam, S. Srinivasmurthy, and P. Raghavan. Characterizing the impact of soft errors on iterative methods in scientific computing. Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011), Tucson, Arizona, May 31 - June 4, 2011.
-
Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011), Tucson, Arizona, May 31 - June 4, 2011
-
-
Shantharam, M.1
Srinivasmurthy, S.2
Raghavan, P.3
-
28
-
-
84990637885
-
PVM: A framework for parallel distributed computing
-
V. S. Sunderam. PVM: a framework for parallel distributed computing. Concurrency: Pract. Exper., 2(4):315-339, 1990.
-
(1990)
Concurrency: Pract. Exper.
, vol.2
, Issue.4
, pp. 315-339
-
-
Sunderam, V.S.1
-
29
-
-
1842829625
-
-
Society for Industrial and Applied Mathematics. Second Edition. April 30
-
Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics. Second Edition. April 30, 2003.
-
(2003)
Iterative Methods for Sparse Linear Systems
-
-
Saad, Y.1
-
30
-
-
34548768671
-
Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
-
C. Wang, F. Mueller, C. Engelmann, and S. Scot. Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium, March, 2007.
-
Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium, March, 2007
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scot, S.4
|