-
1
-
-
70449657893
-
Dram errors in the wild: A large-scale field study
-
B. Schroeder, E. Pinheiro, and W.-D. Weber, "Dram errors in the wild: a large-scale field study," in SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 2009, pp. 193-204.
-
SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 2009
, pp. 193-204
-
-
Schroeder, B.1
Pinheiro, E.2
Weber, W.-D.3
-
3
-
-
84858781341
-
Cosmic rays don't strike twice: understanding the nature of dram errors and the implications for system design
-
Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ser.
-
A. A. Hwang, I. A. Stefanovici, and B. Schroeder, "Cosmic rays don't strike twice: understanding the nature of dram errors and the implications for system design," in Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '12, 2012, pp. 111-122.
-
(2012)
ASPLOS '12
, pp. 111-122
-
-
Hwang, A.A.1
Stefanovici, I.A.2
Schroeder, B.3
-
4
-
-
77954589223
-
-
Los Alamos National Laboratory, Los Alamos, NM, USA, Tech. Rep. LALP-07-041, Jun.
-
J. T. Daly, "ADTSC nuclear weapons highlights: Facilitating high-throughput ASC calculations," Los Alamos National Laboratory, Los Alamos, NM, USA, Tech. Rep. LALP-07-041, Jun. 2007.
-
(2007)
ADTSC Nuclear Weapons Highlights: Facilitating High-throughput ASC Calculations
-
-
Daly, J.T.1
-
5
-
-
79951775997
-
Application MTTFE vs. platform MTTF: A fresh perspective on system reliability and application throughput for computations at scale
-
J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak, "Application MTTFE vs. platform MTTF: A fresh perspective on system reliability and application throughput for computations at scale," in Proceedings of the Workshop on Resiliency in High Performance Computing (Resilience) 2008, May 2008, pp. 19-22.
-
Proceedings of the Workshop on Resiliency in High Performance Computing (Resilience) 2008, May 2008
, pp. 19-22
-
-
Daly, J.T.1
Pritchett-Sheats, L.A.2
Michalak, S.E.3
-
7
-
-
83155188951
-
Evaluating the viability of process replication reliability for exascale systems
-
nov
-
K. Ferreira, J. Stearley, J. H. L. III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. Bridges, and D. Arnold, "Evaluating the viability of process replication reliability for exascale systems," in Supercomputing, nov 2011.
-
(2011)
Supercomputing
-
-
Ferreira, K.1
Stearley, J.2
L III, J.H.3
Oldfield, R.4
Pedretti, K.5
Brightwell, R.6
Riesen, R.7
Bridges, P.8
Arnold, D.9
-
10
-
-
0016874205
-
Redundancy management technique for space shuttle computers
-
J. R. Sklaroff, "Redundancy management technique for space shuttle computers," IBM Journal of Research and Development, vol. 20, no. 1, pp. 20-28, 1976.
-
(1976)
IBM Journal of Research and Development
, vol.20
, Issue.1
, pp. 20-28
-
-
Sklaroff, J.R.1
-
11
-
-
15044363155
-
Robust system design with built-in soft-error resilience
-
DOI 10.1109/MC.2005.70
-
S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, "Robust system design with built-in soft-error resilience," Computer, vol. 38, no. 2, pp. 43-52, 2005. (Pubitemid 40377402)
-
(2005)
Computer
, vol.38
, Issue.2
, pp. 43-52
-
-
Mitra, S.1
Seifert, N.2
Zhang, M.3
Shi, Q.4
Kim, K.S.5
-
12
-
-
0038346239
-
Transient-fault recovery for chip multiprocessors
-
M. Gomaa, C. Scarbrough, T. N. Vijayjumar, and I. Pomeranz, "Transient-fault recovery for chip multiprocessors," in International Symposium on Computer Architecture, May 2003, pp. 98-109.
-
International Symposium on Computer Architecture, May 2003
, pp. 98-109
-
-
Gomaa, M.1
Scarbrough, C.2
Vijayjumar, T.N.3
Pomeranz, I.4
-
14
-
-
33746127333
-
Terrestrial-based radiation upsets: A cautionary tale
-
H. Quinn and P. Graham, "Terrestrial-based radiation upsets: A cautionary tale," in Symposium on Field-Programmable Custom Computing Machines (FCCM) 2005, Apr. 18-20, 2005, pp. 193-202.
-
Symposium on Field-Programmable Custom Computing Machines (FCCM) 2005, Apr. 18-20, 2005
, pp. 193-202
-
-
Quinn, H.1
Graham, P.2
-
15
-
-
84866942395
-
Combining partial redundancy and checkpointing for HPC
-
accepted
-
J. Elliot, K. Kharbas, D. Fiala, F. Mueller, C. Engelmann, and K. Ferreirar, "Combining partial redundancy and checkpointing for HPC," in International Conference on Distributed Computing Systems, 2012, p. (accepted).
-
International Conference on Distributed Computing Systems, 2012
-
-
Elliot, J.1
Kharbas, K.2
Fiala, D.3
Mueller, F.4
Engelmann, C.5
Ferreirar, K.6
-
16
-
-
84877712705
-
Hpc landscape - Application accelerators: Deus ex machina?
-
Sep. invited Talk at
-
J. Vetter, "Hpc landscape - application accelerators: Deus ex machina?" Sep. 2009, invited Talk at High Performance Embedded Computing Workshop.
-
(2009)
High Performance Embedded Computing Workshop
-
-
Vetter, J.1
-
17
-
-
84877697134
-
Simulation challenge: Exascale planning overview
-
Aug. invited Talk at
-
J. Shalf, "Simulation challenge: Exascale planning overview," Aug. 2010, invited Talk at HEC FSIO R&D Workshop.
-
(2010)
HEC FSIO R&D Workshop
-
-
Shalf, J.1
-
18
-
-
79951595196
-
The international exascale software project roadmap
-
J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J. C. Andre, D. Barkai, J. Y. Berthou, T. Boku, B. Braunschweig, and et al., "The international exascale software project roadmap," International Journal of High Performance Computing Applications, vol. 25, no. 1, pp. 3-60, 2011.
-
(2011)
International Journal of High Performance Computing Applications
, vol.25
, Issue.1
, pp. 3-60
-
-
Dongarra, J.1
Beckman, P.2
Moore, T.3
Aerts, P.4
Aloisio, G.5
Andre, J.C.6
Barkai, D.7
Berthou, J.Y.8
Boku, T.9
Braunschweig, B.10
-
19
-
-
84877705582
-
-
Dept. of Computer Science, North Carolina State University, Tech. Rep. TR 2012-5, May
-
D. Fiala, F. Mueller, C. Engelmann, K. Ferreira, R. Brightwell, and R. Riesen, "Detection and correction of silent data corruption for large-scale high-performance computing," Dept. of Computer Science, North Carolina State University, Tech. Rep. TR 2012-5, May 2012.
-
(2012)
Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing
-
-
Fiala, D.1
Mueller, F.2
Engelmann, C.3
Ferreira, K.4
Brightwell, R.5
Riesen, R.6
-
20
-
-
84862123385
-
File i/o for mpi applications in redundant execution scenarios
-
S. Böhm and C. Engelmann, "File i/o for mpi applications in redundant execution scenarios," in Euromicro International Conference on Parallel, Distributed, and network-based Processing, Feb. 2012.
-
Euromicro International Conference on Parallel, Distributed, and Network-based Processing, Feb. 2012
-
-
Böhm, S.1
Engelmann, C.2
-
21
-
-
77955737995
-
-
Whitepaper, Dec. [Online]. Available
-
N. DeBardeleben, J. Laros, J. T. Daly, S. L. Scott, C. Engelmann, and B. Harrod, "High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development," Whitepaper, Dec. 2009. [Online]. Available: http://www.csm.ornl.gov/̃engelman/publications/ debardeleben09high-end.pdf
-
(2009)
High-end Computing Resilience: Analysis of Issues Facing the HEC Community and Path-forward for Research and Development
-
-
DeBardeleben, N.1
Laros, J.2
Daly, J.T.3
Scott, S.L.4
Engelmann, C.5
Harrod, B.6
-
22
-
-
33749067567
-
Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters
-
Denver, CO, USA: Institute of Physics Publishing, Bristol, UK, Jun. 25-29, [Online]. Available
-
P. H. Hargrove and J. C. Duell, "Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters," in Journal of Physics: Proceedings of the Scientific Discovery through Advanced Computing Program (SciDAC) Conference 2006, vol. 46. Denver, CO, USA: Institute of Physics Publishing, Bristol, UK, Jun. 25-29, 2006, pp. 494-499. [Online]. Available: http://www.iop.org/EJ/ article/1742-6596/46/1/067/jpconf6-46-067.pdf
-
(2006)
Journal of Physics: Proceedings of the Scientific Discovery Through Advanced Computing Program (SciDAC) Conference 2006
, vol.46
, pp. 494-499
-
-
Hargrove, P.H.1
Duell, J.C.2
-
23
-
-
78650807026
-
-
Lawrence Livermore National Laboratory, Livermore, CA, USA, Tech. Rep. TR-JLPC-09-01, Aug. [Online]. Available
-
G. Bronevetsky and A. Moody, "Scalable I/O systems via node-local storage: Approaching 1 TB/sec file I/O," Lawrence Livermore National Laboratory, Livermore, CA, USA, Tech. Rep. TR-JLPC-09-01, Aug. 2009. [Online]. Available: http://dx.doi.org/10.2172/964079
-
(2009)
Scalable I/O Systems Via Node-local Storage: Approaching 1 TB/sec File I/O
-
-
Bronevetsky, G.1
Moody, A.2
-
24
-
-
83155182888
-
System implications of memory reliability in exascale computing
-
S. Li, K. Chen, M.-Y. Hsieh, N. Muralimanohar, C. D. Kersey, J. B. Brockman, A. F. Rodrigues, and N. P. Jouppi, "System implications of memory reliability in exascale computing," in Supercomputing, 2011, pp. 46:1-46:12.
-
(2011)
Supercomputing
-
-
Li, S.1
Chen, K.2
Hsieh, M.-Y.3
Muralimanohar, N.4
Kersey, C.D.5
Brockman, J.B.6
Rodrigues, A.F.7
Jouppi, N.P.8
-
25
-
-
84877716050
-
A tunable, software-based dram error detection and correction library for hpc
-
D. Fiala, K. Ferreira, F. Mueller, and C. Engelmann, "A tunable, software-based dram error detection and correction library for hpc," in Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Sep. 2011, pp. 110-121.
-
Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Sep. 2011
, pp. 110-121
-
-
Fiala, D.1
Ferreira, K.2
Mueller, F.3
Engelmann, C.4
-
26
-
-
29344473319
-
Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer
-
[Online]. Available
-
S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender, "Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer," IEEE Transactions on Device and Materials Reliability (TDMR), vol. 5, no. 3, pp. 329-335, 2005. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs-all.jsp?arnumber=1545893
-
(2005)
IEEE Transactions on Device and Materials Reliability (TDMR)
, vol.5
, Issue.3
, pp. 329-335
-
-
Michalak, S.E.1
Harris, K.W.2
Hengartner, N.W.3
Takala, B.E.4
Wender, S.A.5
-
27
-
-
57349156147
-
Soft error vulnerability of iterative linear algebra methods
-
Island of Kos, Greece: ACM Press, New York, NY, USA, Jun. 7-12, [Online]. Available
-
st ACM International Conference on Supercomputing (ICS) 2008. Island of Kos, Greece: ACM Press, New York, NY, USA, Jun. 7-12, 2007. [Online]. Available: http://greg.bronevetsky.com/papers/2008ICS.pdf
-
(2007)
st ACM International Conference on Supercomputing (ICS) 2008
-
-
Bronevetsky, G.1
De Supinski, B.R.2
-
28
-
-
0026404704
-
Architecture of fault-tolerant computers: An historical perspective
-
[Online]. Available
-
D. P. Siemwiorek, "Architecture of fault-tolerant computers: An historical perspective," Proceedings of the IEEE, vol. 79, no. 12, pp. 1710-1734, 1991. [Online]. Available: http://dx.doi.org/10.1109/5.119549
-
(1991)
Proceedings of the IEEE
, vol.79
, Issue.12
, pp. 1710-1734
-
-
Siemwiorek, D.P.1
-
29
-
-
58149131807
-
DDMR: Dynamic and scalable dual modular redundancy with short validation intervals
-
[Online]. Available
-
A. Golander, S. Weiss, and R. Ronen, "DDMR: Dynamic and scalable dual modular redundancy with short validation intervals," IEEE Computer Architecture Letters, vol. 7, no. 2, pp. 65-68, 2008. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.12
-
(2008)
IEEE Computer Architecture Letters
, vol.7
, Issue.2
, pp. 65-68
-
-
Golander, A.1
Weiss, S.2
Ronen, R.3
-
30
-
-
67649255075
-
PLR: A software approach to transient fault tolerance for multicore architectures
-
[Online]. Available
-
A. Shye, J. Blomstedt, T. Moseley, V. J. Reddi, and D. A. Connors, "PLR: A software approach to transient fault tolerance for multicore architectures," IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 6, no. 2, pp. 135-148, 2009. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TDSC.2008.62
-
(2009)
IEEE Transactions on Dependable and Secure Computing (TDSC)
, vol.6
, Issue.2
, pp. 135-148
-
-
Shye, A.1
Blomstedt, J.2
Moseley, T.3
Reddi, V.J.4
Connors, D.A.5
-
31
-
-
0036287327
-
Detailed design and evaluation of redundant multithreading alternatives
-
Anchorage, AK, USA: IEEE Computer Society, May 25-29, 2002, [Online]. Available
-
th Annual International Symposium on Computer Architecture (ISCA) 2002. Anchorage, AK, USA: IEEE Computer Society, May 25-29, 2002, pp. 99-110. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/ISCA. 2002.1003566
-
th Annual International Symposium on Computer Architecture (ISCA) 2002
, pp. 99-110
-
-
Mukherjee, S.S.1
Kontz, M.2
Reinhardt, S.K.3
-
32
-
-
74549140832
-
The case for modular redundancy in large-scale high performance computing systems
-
Innsbruck, Austria: ACTA Press, Calgary, AB, Canada, Feb. 16-18, [Online]. Available
-
th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009. Innsbruck, Austria: ACTA Press, Calgary, AB, Canada, Feb. 16-18, 2009, pp. 189-194. [Online]. Available: http://www.csm.ornl.gov/̃engelman/publications/engelmann09case.pdf
-
(2009)
th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009
, pp. 189-194
-
-
Engelmann, C.1
Ong, H.H.2
Scott, S.L.3
-
33
-
-
78149257903
-
Transparent redundant computing with MPI
-
EuroMPI, ser. R. Keller, E. Gabriel, M. M. Resch, and J. Dongarra, Eds., Springer
-
R. Brightwell, K. B. Ferreira, and R. Riesen, "Transparent redundant computing with MPI," in EuroMPI, ser. Lecture Notes in Computer Science, R. Keller, E. Gabriel, M. M. Resch, and J. Dongarra, Eds., vol. 6305. Springer, 2010, pp. 208-218.
-
(2010)
Lecture Notes in Computer Science
, vol.6305
, pp. 208-218
-
-
Brightwell, R.1
Ferreira, K.B.2
Riesen, R.3
-
34
-
-
79958180996
-
Redundant execution of hpc applications with mr-mpi
-
Innsbruck, Austria: ACTA Press, Calgary, AB, Canada, Feb. 15-17
-
th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011. Innsbruck, Austria: ACTA Press, Calgary, AB, Canada, Feb. 15-17, 2011.
-
(2011)
th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011
-
-
Engelmann, C.1
Böhm, S.2
-
35
-
-
70350469329
-
Volpexmpi: An MPI library for execution of parallel applications on volatile nodes
-
th European PVM/MPI Users' Group Meeting (EuroPVM/MPI) 2009, Espoo, Finland: Springer Verlag, Berlin, Germany, Sep. 7-10, [Online]. Available
-
th European PVM/MPI Users' Group Meeting (EuroPVM/MPI) 2009, vol. 5759. Espoo, Finland: Springer Verlag, Berlin, Germany, Sep. 7-10, 2009, pp. 124-133. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-03770-2-19
-
(2009)
Lecture Notes in Computer Science
, vol.5759
, pp. 124-133
-
-
LeBlanc, T.1
Anand, R.2
Gabriel, E.3
Subhlok, J.4
-
36
-
-
84877686778
-
Parallelizing heavyweight debugging tools with mpiecho
-
B. Roundtree, G. Cobb, T. Gamblin, M. Schulz, B. Supinski, and H. Tufo, "Parallelizing heavyweight debugging tools with mpiecho," in High-performance Infrastructure for Scalable Toolsi, WHIST 2011, Held as part of ICS '11, Tucson, Arizona, 2011, pp. 803-808.
-
High-performance Infrastructure for Scalable Toolsi, WHIST 2011, Held As Part of ICS '11, Tucson, Arizona, 2011
, pp. 803-808
-
-
Roundtree, B.1
Cobb, G.2
Gamblin, T.3
Schulz, M.4
Supinski, B.5
Tufo, H.6
-
37
-
-
84877704164
-
-
Dept. of Computer Science, University of Colorado at Boulder, Tech. Rep. CU-CS-1082-11, Jun.
-
G. Cobb, B. Roundtree, H. Tufo, M. Schulz, T. Gamblin, and B. de Supinski, "Mpiecho: A framework for transparent mpi task replication," Dept. of Computer Science, University of Colorado at Boulder, Tech. Rep. CU-CS-1082-11, Jun. 2011.
-
(2011)
Mpiecho: A Framework for Transparent Mpi Task Replication
-
-
Cobb, G.1
Roundtree, B.2
Tufo, H.3
Schulz, M.4
Gamblin, T.5
De Supinski, B.6
|