-
1
-
-
29344472607
-
Radiation-induced soft errors in advanced semiconductor technologies
-
R. C. Baumann. Radiation-induced soft errors in advanced semiconductor technologies. Device and Materials Reliability, IEEE Transactions on, 5(3):305-316, 2005.
-
(2005)
Device and Materials Reliability, IEEE Transactions on
, vol.5
, Issue.3
, pp. 305-316
-
-
Baumann, R.C.1
-
2
-
-
83155160949
-
FTI: High performance fault tolerance interface for hybrid systems
-
Nov.
-
L. Bautista-Gomez, D. Komatitsch, N. Maruyama, S. Tsuboi, F. Cappello, and S. Matsuoka. FTI: High performance fault tolerance interface for hybrid systems. In Supercomputing, pages 1-12, Nov. 2011.
-
(2011)
Supercomputing
, pp. 1-12
-
-
Bautista-Gomez, L.1
Komatitsch, D.2
Maruyama, N.3
Tsuboi, S.4
Cappello, F.5
Matsuoka, S.6
-
3
-
-
61449223447
-
Algorithm-based fault tolerance applied to high performance computing
-
G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. JPDC, 69(4):410-416, 2009.
-
(2009)
JPDC
, vol.69
, Issue.4
, pp. 410-416
-
-
Bosilca, G.1
Delmas, R.2
Dongarra, J.3
Langou, J.4
-
4
-
-
83155184556
-
Checkpointing strategies for parallel jobs
-
New York, NY, USA. ACM
-
M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. Checkpointing strategies for parallel jobs. In Supercomputing, SC'11, pages 33:1-33:11, New York, NY, USA, 2011. ACM.
-
(2011)
Supercomputing, SC'11
, pp. 331-3311
-
-
Bougeret, M.1
Casanova, H.2
Rabie, M.3
Robert, Y.4
Vivien, F.5
-
5
-
-
68249127079
-
Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities
-
F. Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. IJHPCA, 23(3):212-226, 2009.
-
(2009)
IJHPCA
, vol.23
, Issue.3
, pp. 212-226
-
-
Cappello, F.1
-
6
-
-
84877708941
-
Containment domains: A scalable, efficient, and exible resilience scheme for exascale systems
-
Los Alamitos, CA, USA. IEEE Computer Society Press
-
J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment domains: A scalable, efficient, and exible resilience scheme for exascale systems. In Supercomputing, SC'12, pages 58:1-58:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
-
(2012)
Supercomputing, SC'12
, pp. 581-5811
-
-
Chung, J.1
Lee, I.2
Sullivan, M.3
Ryoo, J.H.4
Kim, D.W.5
Yoon, D.H.6
Kaplan, L.7
Erez, M.8
-
7
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Comp. Syst., 22(3):303-312, 2006.
-
(2006)
Future Generation Comp. Syst.
, vol.22
, Issue.3
, pp. 303-312
-
-
Daly, J.T.1
-
8
-
-
74549140832
-
The case for modular redundancy in large-scale high performance computing systems
-
ACTA Press, Calgary, AB, Canada, Feb.
-
C. Engelmann, H. H. Ong, and S. L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194. ACTA Press, Calgary, AB, Canada, Feb. 2009.
-
(2009)
International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009
, pp. 189-194
-
-
Engelmann, C.1
Ong, H.H.2
Scott, S.L.3
-
9
-
-
77952275692
-
Shoestring: Probabilistic soft error reliability on the cheap
-
New York, NY, USA. ACM
-
S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In Architectural support for programming languages and operating systems, ASPLOS XV, pages 385-396, New York, NY, USA, 2010. ACM.
-
(2010)
Architectural Support for Programming Languages and Operating Systems, ASPLOS XV
, pp. 385-396
-
-
Feng, S.1
Gupta, S.2
Ansari, A.3
Mahlke, S.4
-
10
-
-
83155188951
-
Evaluating the viability of process replication reliability for exascale systems
-
New York, NY, USA. ACM
-
K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Supercomputing, pages 44:1-44:12, New York, NY, USA, 2011. ACM.
-
(2011)
Supercomputing
, pp. 441-4412
-
-
Ferreira, K.1
Stearley, J.2
Laros III, J.H.3
Oldfield, R.4
Pedretti, K.5
Brightwell, R.6
Riesen, R.7
Bridges, P.G.8
Arnold, D.9
-
11
-
-
84877705582
-
Detection and correction of silent data corruption for large-scale high-performance computing
-
Los Alamitos, CA, USA. IEEE Computer Society Press
-
D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In Supercomputing, SC'12, pages 78:1-78:12, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
-
(2012)
Supercomputing, SC'12
, pp. 781-7812
-
-
Fiala, D.1
Mueller, F.2
Engelmann, C.3
Riesen, R.4
Ferreira, K.5
Brightwell, R.6
-
12
-
-
33646126514
-
A peer-to-peer framework for robust execution of message passing parallel programs
-
Springer-Verlag
-
S. Genaud, C. Rattanapoka, and U. L. Strasbourg. A peer-to-peer framework for robust execution of message passing parallel programs. In In EuroPVM/MPI 2005, volume 3666 of LNCS, pages 276-284. Springer-Verlag, 2005.
-
(2005)
EuroPVM/MPI 2005, Volume 3666 of LNCS
, pp. 276-284
-
-
Genaud, S.1
Rattanapoka, C.2
Strasbourg, U.L.3
-
13
-
-
74049121711
-
Berkeley lab checkpoint/restart (blcr) for linux clusters
-
P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In SciDAC, 2006.
-
(2006)
SciDAC
-
-
Hargrove, P.H.1
Duell, J.C.2
-
14
-
-
80955138722
-
-
Technical report, Sandia National Laboratories. September
-
M. A. Heroux, D. W. Doerer, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Improving performance via mini-applications. Technical report, Sandia National Laboratories, September 2009.
-
(2009)
Improving Performance Via Mini-applications
-
-
Heroux, M.A.1
Doerer, D.W.2
Crozier, P.S.3
Willenbring, J.M.4
Edwards, H.C.5
Williams, A.6
Rajan, M.7
Keiter, E.R.8
Thornquist, H.K.9
Numrich, R.W.10
-
16
-
-
84866879380
-
-
Technical Report 11-49, Parallel Programming Laboratory November
-
L. Kale, A. Arya, A. Bhatele, A. Gupta, N. Jain, P. Jetley, J. Liander, P. Miller, Y. Sun, R. Venkataraman, L. Wesolowski, and G. Zheng. Charm++ for productivity and performance: A submission to the 2011 HPC class II challenge. Technical Report 11-49, Parallel Programming Laboratory, November 2011.
-
(2011)
Charm++ for Productivity and Performance: A Submission to the 2011 HPC Class II Challenge
-
-
Kale, L.1
Arya, A.2
Bhatele, A.3
Gupta, A.4
Jain, N.5
Jetley, P.6
Liander, J.7
Miller, P.8
Sun, Y.9
Venkataraman, R.10
Wesolowski, L.11
Zheng, G.12
-
17
-
-
66749092384
-
-
P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.
-
(2008)
Exascale Computing Study: Technology Challenges in Achieving Exascale Systems
-
-
Kogge, P.1
Bergman, K.2
Borkar, S.3
Campbell, D.4
Carlson, W.5
Dally, W.6
Denneau, M.7
Franzon, P.8
Harrod, W.9
Hiller, J.10
Karp, S.11
Keckler, S.12
Klein, D.13
Lucas, R.14
Richards, M.15
Scarpelli, A.16
Scott, S.17
Snavely, A.18
Sterling, T.19
Williams, R.S.20
Yelick, K.21
more..
-
18
-
-
77951205449
-
A study of dynamic meta-learning for failure prediction in large-scale systems
-
June
-
Z. Lan, J. Gu, Z. Zheng, R. Thakur, and S. Coghlan. A study of dynamic meta-learning for failure prediction in large-scale systems. J. Parallel Distrib. Comput., 70(6):630-643, June 2010.
-
(2010)
J. Parallel Distrib. Comput
, vol.70
, Issue.6
, pp. 630-643
-
-
Lan, Z.1
Gu, J.2
Zheng, Z.3
Thakur, R.4
Coghlan, S.5
-
19
-
-
0035390088
-
A variational calculus approach to optimal checkpoint placement
-
Y. Ling, J. Mi, and X. Lin. A variational calculus approach to optimal checkpoint placement. Computers, IEEE Transactions on, 50(7):699-708, 2001.
-
(2001)
Computers, IEEE Transactions on
, vol.50
, Issue.7
, pp. 699-708
-
-
Ling, Y.1
Mi, J.2
Lin, X.3
-
20
-
-
84899680829
-
-
Lulesh
-
Lulesh. http://computation. llnl. gov/casc/ShockHydro/.
-
-
-
-
21
-
-
84885193593
-
A message-logging protocol for multicore systems
-
Boston, USA, June
-
E. Meneses, X. Ni, and L. V. Kale. A Message-Logging Protocol for Multicore Systems. In Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), Boston, USA, June 2012.
-
(2012)
Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS)
-
-
Meneses, E.1
Ni, X.2
Kale, L.V.3
-
22
-
-
29344473319
-
Predicting the number of fatal soft errors in los alamos national laboratory's asc q supercomputer
-
sept.
-
S. Michalak, K. Harris, N. Hengartner, B. Takala, and S. Wender. Predicting the number of fatal soft errors in los alamos national laboratory's asc q supercomputer. Device and Materials Reliability, IEEE Transactions on, 5(3):329-335, sept. 2005.
-
(2005)
Device and Materials Reliability, IEEE Transactions on
, vol.5
, Issue.3
, pp. 329-335
-
-
Michalak, S.1
Harris, K.2
Hengartner, N.3
Takala, B.4
Wender, S.5
-
23
-
-
78650831692
-
Design, modeling, and evaluation of a scalable multi-level checkpointing system
-
A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In SC, pages 1-11, 2010.
-
(2010)
SC
, pp. 1-11
-
-
Moody, A.1
Bronevetsky, G.2
Mohror, K.3
De Supinski, B.R.4
-
24
-
-
28444483117
-
The soft error problem: An architectural perspective
-
IEEE
-
S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: An architectural perspective. In High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pages 243-247. IEEE, 2005.
-
(2005)
High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on
, pp. 243-247
-
-
Mukherjee, S.S.1
Emer, J.2
Reinhardt, S.K.3
-
25
-
-
84870713710
-
Hiding checkpoint overhead in hpc applications with a semi-blocking algorithm
-
Beijing, China, September
-
X. Ni, E. Meneses, and L. V. Kale. Hiding checkpoint overhead in hpc applications with a semi-blocking algorithm. In IEEE Cluster 12, Beijing, China, September 2012.
-
(2012)
IEEE Cluster
, vol.12
-
-
Ni, X.1
Meneses, E.2
Kale, L.V.3
-
26
-
-
27344436659
-
Scalable molecular dynamics with NAMD
-
J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, and K. Schulten. Scalable molecular dynamics with NAMD. Journal of Computational Chemistry, 26(16):1781-1802, 2005.
-
(2005)
Journal of Computational Chemistry
, vol.26
, Issue.16
, pp. 1781-1802
-
-
Phillips, J.C.1
Braun, R.2
Wang, W.3
Gumbart, J.4
Tajkhorshid, E.5
Villa, E.6
Chipot, C.7
Skeel, R.D.8
Kale, L.9
Schulten, K.10
-
29
-
-
20444463494
-
FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi
-
San Diego, CA, September
-
G. Zheng, L. Shi, and L. V. Kale. FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI. In 2004 IEEE Cluster, pages 93-103, San Diego, CA, September 2004.
-
(2004)
2004 IEEE Cluster
, pp. 93-103
-
-
Zheng, G.1
Shi, L.2
Kale, L.V.3
|