-
4
-
-
0015631041
-
Arithmetic algorithms for error-coded operands
-
10022
-
Avizienis A. Arithmetic algorithms for error-coded operands. IEEE Transactions on Computers. 1973 ; C-22 (6). 567-572
-
(1973)
IEEE Transactions on Computers
, Issue.6
, pp. 567-572
-
-
Avizienis, A.1
-
7
-
-
84900553917
-
-
(accessed 25 February 2014)
-
BaileyFRBellGBlondinJ. (2007) Petascale metrics panel report. Available at: http://research.microsoft.com/en-us/um/people/gbell/supers/ascac-petascale- metrics-panel-report-and-executive-summary-2007-02-12.pdf (accessed 25 February 2014)
-
(2007)
Petascale Metrics Panel Report
-
-
Bailey, F.R.1
Bell, G.2
Blondin, J.3
-
9
-
-
0022706330
-
Bounds on algorithm-based fault tolerance in multiple processor systems
-
10035 296-306
-
Banerjee P, Abraham J. Bounds on algorithm-based fault tolerance in multiple processor systems. IEEE Transactions on Computers. 1986 ; C-35: 4 296-306
-
(1986)
IEEE Transactions on Computers
, pp. 4
-
-
Banerjee, P.1
Abraham, J.2
-
10
-
-
0025489006
-
Algorithm-based fault tolerance on a hypercube multiprocessor
-
Banerjee P, Rahmeh J, Stunkel C, et al. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Transactions on Computers. 1990 ; 39 (9). 1132-1145
-
(1990)
IEEE Transactions on Computers
, vol.39
, Issue.9
, pp. 1132-1145
-
-
Banerjee, P.1
Rahmeh, J.2
Stunkel, C.3
-
11
-
-
83155160949
-
FTI: High performance fault tolerance interface for hybrid systems
-
Bautista-Gomez LA, Tsuboi S, Komatitsch D, et al. FTI: High performance fault tolerance interface for hybrid systems. International conference for high-performance computing, networking, storage and analysis (SC). 2011a ;:
-
(2011)
International Conference for High-performance Computing, Networking, Storage and Analysis (SC)
-
-
Bautista-Gomez, L.A.1
Tsuboi, S.2
Komatitsch, D.3
-
12
-
-
83155160949
-
FTI: High performance fault tolerance interface for hybrid systems
-
Bautista-Gomez L, Komatitsch D, Maruyama N, et al. FTI: High performance fault tolerance interface for hybrid systems. International conference for high-performance computing, networking, storage and analysis (SC). 2011b ;:
-
(2011)
International Conference for High-performance Computing, Networking, Storage and Analysis (SC)
-
-
Bautista-Gomez, L.1
Komatitsch, D.2
Maruyama, N.3
-
14
-
-
84867646266
-
-
Träff J Benkner S Dongarra J, ed. New York, NY: Springer
-
Bland W, Bouteiller A, Herault T, et al Recent Advances in the Message Passing Interface. Träff J Benkner S Dongarra J, ed. New York, NY: Springer ; 2012: 193-203.
-
(2012)
Recent Advances in the Message Passing Interface
, pp. 193-203
-
-
Bland, W.1
Bouteiller, A.2
Herault, T.3
-
15
-
-
33846118079
-
Designing reliable systems from unreliable components: The challenges of transistor variability and degradation
-
Borkar S. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro. 2005 ; 25 (6). 10-16
-
(2005)
IEEE Micro
, vol.25
, Issue.6
, pp. 10-16
-
-
Borkar, S.1
-
17
-
-
80052306159
-
-
Jeannot E Namyst R Jean R, ed. New York, NY: Springer
-
Bouteiller A, Herault T, Bosilca G, et al Euro-Par 2011: Parallel Processing Workshops. Jeannot E Namyst R Jean R, ed. New York, NY: Springer ; 2011: 51-64.
-
(2011)
Euro-Par 2011: Parallel Processing Workshops
, pp. 51-64
-
-
Bouteiller, A.1
Herault, T.2
Bosilca, G.3
-
27
-
-
83155160952
-
The IBM Blue Gene/Q interconnection network and message unit
-
Chen D, Eisley NA, Heidelberger P, et al. The IBM Blue Gene/Q interconnection network and message unit. International conference for high-performance computing, networking, storage and analysis (SC). 2011 ;:
-
(2011)
International Conference for High-performance Computing, Networking, Storage and Analysis (SC)
-
-
Chen, D.1
Eisley, N.A.2
Heidelberger, P.3
-
30
-
-
84877708941
-
Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems
-
Chung J, Lee I, Sullivan M, et al. Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems. International conference for high-performance computing, networking, storage and analysis (SC). 2012 ;:
-
(2012)
International Conference for High-performance Computing, Networking, Storage and Analysis (SC)
-
-
Chung, J.1
Lee, I.2
Sullivan, M.3
-
33
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
Daly JT. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems. 2006 ; 22 (3). 303-312
-
(2006)
Future Generation Computer Systems
, vol.22
, Issue.3
, pp. 303-312
-
-
Daly, J.T.1
-
34
-
-
37549003336
-
MapReduce: Simplified data processing on large clusters
-
Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM. 2008 ; 51 (1). 107-113
-
(2008)
Communications of the ACM
, vol.51
, Issue.1
, pp. 107-113
-
-
Dean, J.1
Ghemawat, S.2
-
35
-
-
77955737995
-
High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development
-
DARPA, VA
-
DeBardelebenNLarosJDalyJ. (2010b) High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Technical Report LA-UR-10-00030, DARPA, VA. available at http://www.csm.ornl.gov/∼engelman/publications/debardeleben09high-end 2/25/14
-
(2010)
Technical Report LA-UR-10-00030
-
-
De Bardeleben, N.1
Laros, J.2
Daly, J.3
-
38
-
-
78650016517
-
Trends from ten years of soft error experimentation
-
(acessed 25 February 2014)
-
DixitAHealdRWoodA (2009) Trends from ten years of soft error experimentation. In: The workshop on silicon Available at: http://softerrors. info/selse/images/selse-2009/Papers/selse5-submission-29.pdf (acessed 25 February 2014).
-
(2009)
The Workshop on Silicon
-
-
Dixit, A.1
Heald, R.2
Wood, A.3
-
42
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
Elnozahy ENM, Alvisi L, Wang YM, et al. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys. 2002 ; 34 (3). 375-408
-
(2002)
ACM Computing Surveys
, vol.34
, Issue.3
, pp. 375-408
-
-
Enm, E.1
Alvisi, L.2
Wang, Y.M.3
-
43
-
-
84900548976
-
-
Elnozahy (editor) System Resilience at Extreme Scale White Paper accessed 2/25/14
-
Elnozahy (editor) System Resilience at Extreme Scale White Paper available at http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type= pdf&doi=10.1.1.205.4240accessed 2/25/14
-
-
-
-
46
-
-
70349157325
-
-
(accessed 25 February 2014)
-
FaddenS (2012) An introduction to GPFS version 3.5. Available at: www-03.ibm.com/systems/jo/resources/introduction-to-gpfs-3-5.pdf (accessed 25 February 2014).
-
(2012)
An Introduction to GPFS Version 3.5
-
-
Fadden, S.1
-
49
-
-
83155188951
-
Evaluating the viability of process replication reliability for exascale systems
-
Ferreira KB, Stearley J, Laros JH, et al. Evaluating the viability of process replication reliability for exascale systems. International conference for high-performance computing, networking, storage and analysis (SC). 2011 ;:
-
(2011)
International Conference for High-performance Computing, Networking, Storage and Analysis (SC)
-
-
Ferreira, K.B.1
Stearley, J.2
Laros, J.H.3
-
50
-
-
84900531208
-
-
Constrained Optimization New York, NY John Wiley & Sons
-
Fletcher R Constrained Optimization New York, NY John Wiley & Sons ; 1981 :
-
(1981)
-
-
Fletcher, R.1
-
54
-
-
84877693592
-
Fault prediction under the microscope: A closer look into HPC systems
-
Gainaru A, Cappello F, Snir M, et al. Fault prediction under the microscope: A closer look into HPC systems. International conference for high-performance computing, networking, storage and analysis (SC). 2012b ;:
-
(2012)
International Conference for High-performance Computing, Networking, Storage and Analysis (SC)
-
-
Gainaru, A.1
Cappello, F.2
Snir, M.3
-
56
-
-
79951947569
-
Modeling of retention failure behavior in bipolar oxide-based resistive switching memory
-
Gao B, Zhang H, Chen B, et al. Modeling of retention failure behavior in bipolar oxide-based resistive switching memory. IEEE Electron Device Letters. 2011 ; 32 (3). 276-278
-
(2011)
IEEE Electron Device Letters
, vol.32
, Issue.3
, pp. 276-278
-
-
Gao, B.1
Zhang, H.2
Chen, B.3
-
60
-
-
84900540703
-
-
Technical report, U.S. Department of Energy, DC
-
Geist A, Lucas B, Snir M, et al Technical report, U.S. Department of Energy, DC ; 2012 :
-
(2012)
-
-
Geist, A.1
Lucas, B.2
Snir, M.3
-
61
-
-
70449106113
-
Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node
-
Gill B, Seifert N, Zia V. Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node. IEEE international reliability physics symposium. 2009 ;: 199-205
-
(2009)
IEEE International Reliability Physics Symposium
, pp. 199-205
-
-
Gill, B.1
Seifert, N.2
Zia, V.3
-
64
-
-
33947495454
-
Fighting bugs: Remove, retry, replicate, and rejuvenate
-
Grottke M, Trivedi KS. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Computer. 2007 ; 40 (2). 107-109
-
(2007)
IEEE Computer
, vol.40
, Issue.2
, pp. 107-109
-
-
Grottke, M.1
Trivedi, K.S.2
-
76
-
-
0242443635
-
Measurements and analysis of ser tolerant latch in a 90 nm dual-Vt CMOS process
-
Hazucha P, Karnik T, Bloechel SWB, et al. Measurements and analysis of SER tolerant latch in a 90 nm dual-Vt CMOS process. IEEE custom integrated circuits conference. 2003 ;: 617-620
-
(2003)
IEEE Custom Integrated Circuits Conference
, pp. 617-620
-
-
Hazucha, P.1
Karnik, T.2
Swb, B.3
-
78
-
-
83155160934
-
Modeling and tolerating heterogeneous failures in large parallel systems
-
Heien E, Kondo D, Gainaru A, et al. Modeling and tolerating heterogeneous failures in large parallel systems. International conference for high-performance computing, networking, storage and analysis (SC). 2011 ;:
-
(2011)
International Conference for High-performance Computing, Networking, Storage and Analysis (SC)
-
-
Heien, E.1
Kondo, D.2
Gainaru, A.3
-
82
-
-
0021439162
-
Algorithm-based fault tolerance for matrix operations
-
Huang KH, Abraham J. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers. 1984 ; C-33 (6). 518-528
-
(1984)
IEEE Transactions on Computers
, vol.C33
, Issue.6
, pp. 518-528
-
-
Huang, K.H.1
Abraham, J.2
-
85
-
-
77954030094
-
Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule
-
Ibe E, Taniguchi H, Yahagi Y, et al. Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule. IEEE Transactions on Electron Devices. 2010 ; 57 (7). 1527-1538
-
(2010)
IEEE Transactions on Electron Devices
, vol.57
, Issue.7
, pp. 1527-1538
-
-
Ibe, E.1
Taniguchi, H.2
Yahagi, Y.3
-
86
-
-
0037253011
-
NASA advances robotic space exploration
-
Katz D, Some R. NASA advances robotic space exploration. Computer. 2003 ; 36 (1). 52-61
-
(2003)
Computer
, vol.36
, Issue.1
, pp. 52-61
-
-
Katz, D.1
Some, R.2
-
89
-
-
0026307546
-
Estimates of rounding errors with fast automatic differentiation and interval analysis
-
Kubota K, Iri M. Estimates of rounding errors with fast automatic differentiation and interval analysis. Journal of Information Processing. 1992 ; 14 (3). 508-515
-
(1992)
Journal of Information Processing
, vol.14
, Issue.3
, pp. 508-515
-
-
Kubota, K.1
Iri, M.2
-
92
-
-
83155193250
-
Large scale debugging of parallel tasks with AutomaDeD
-
Laguna I, Gamblin T, de Supinski BR, et al. Large scale debugging of parallel tasks with AutomaDeD. International conference for high-performance computing, networking, storage and analysis (SC). 2011 ;:
-
(2011)
International Conference for High-performance Computing, Networking, Storage and Analysis (SC)
-
-
Laguna, I.1
Gamblin, T.2
De Supinski, B.R.3
-
95
-
-
70350776678
-
Lessons learned at 208K: Towards debugging millions of cores
-
Lee GL, Ahn DH, Arnold DC, et al. Lessons learned at 208K: Towards debugging millions of cores. International conference for high-performance computing, networking, storage and analysis (SC). 2008 ;:
-
(2008)
International Conference for High-performance Computing, Networking, Storage and Analysis (SC)
-
-
Lee, G.L.1
Ahn, D.H.2
Arnold, D.C.3
-
99
-
-
0037319402
-
Decomposition algorithms for stochastic programming on a computational grid
-
Linderoth J, Wright S. Decomposition algorithms for stochastic programming on a computational grid. Computational Optimization and Applications. 2003 ; 24 (2). 207-250
-
(2003)
Computational Optimization and Applications
, vol.24
, Issue.2
, pp. 207-250
-
-
Linderoth, J.1
Wright, S.2
-
100
-
-
0028416906
-
Reliable floating-point arithmetic algorithms for error-coded operands
-
Lo JC. Reliable floating-point arithmetic algorithms for error-coded operands. IEEE Transactions on Computers. 1994 ; 43 (4). 400-412
-
(1994)
IEEE Transactions on Computers
, vol.43
, Issue.4
, pp. 400-412
-
-
Lo, J.C.1
-
110
-
-
84900526494
-
-
January (accessed 25 February 2014)
-
MitchellR (1977) The Underground Grammarian, Vol., No. 1, January. Available at http://www.sourcetext.com/grammarian/ (accessed 25 February 2014).
-
(1977)
The Underground Grammarian
, vol.1
-
-
Mitchell, R.1
-
113
-
-
78650831692
-
Design, modeling, and evaluation of a scalable multi-level checkpointing system
-
Moody A, Bronevetsky G, Mohror K, et al. Design, modeling, and evaluation of a scalable multi-level checkpointing system. International conference for high-performance computing, networking, storage and analysis (SC). 2010 ;:
-
(2010)
International Conference for High-performance Computing, Networking, Storage and Analysis (SC)
-
-
Moody, A.1
Bronevetsky, G.2
Mohror, K.3
-
115
-
-
84900530512
-
-
MPIPlugIn (accessed 25 February 2014)
-
MPIPlugIn (2013) MPI plugin for KDevelop. Available at: http://sourceforge.net/projects/mpiplugin/ (accessed 25 February 2014).
-
(2013)
MPI Plugin for KDevelop
-
-
-
119
-
-
84900557827
-
-
NCAR (accessed 25 February 2014)
-
NCAR (2014) Community earth system model. Available at: http://www2.cesm.ucar.edu/ (accessed 25 February 2014).
-
(2014)
Community Earth System Model
-
-
-
120
-
-
76649113170
-
-
Network Working Group (accessed 25 February 2014)
-
Network Working Group (2009) The syslog protocol. Available at: http://tools.ietf.org/html/rfc5424 (accessed 25 February 2014).
-
(2009)
The Syslog Protocol
-
-
-
126
-
-
34547396006
-
Dynamic derivation of application-specific error detectors and their implementation in hardware
-
Pattabiraman K, Saggese GP, Chen D, et al. Dynamic derivation of application-specific error detectors and their implementation in hardware. European dependable computing conference. 2006 ;: 97-108
-
(2006)
European Dependable Computing Conference
, pp. 97-108
-
-
Pattabiraman, K.1
Saggese, G.P.2
Chen, D.3
-
130
-
-
84900549431
-
The Eckert tapes: Computer pioneer says ENIAC team couldn't afford to fail - And didn't
-
Randall A V. The Eckert tapes: Computer pioneer says ENIAC team couldn't afford to fail - and didn't. Computerworld. 2006 ; 40 (8). 18
-
(2006)
Computerworld
, vol.40
, Issue.8
, pp. 18
-
-
Randall, A.V.1
-
132
-
-
10044267465
-
Impact of negative bias temperature instability on digital circuit reliability
-
Reddy V, Krishnan A, Marshall A, et al. Impact of negative bias temperature instability on digital circuit reliability. Microelectronics Reliability. 2005 ; 45 (1). 31-38
-
(2005)
Microelectronics Reliability
, vol.45
, Issue.1
, pp. 31-38
-
-
Reddy, V.1
Krishnan, A.2
Marshall, A.3
-
135
-
-
84900529904
-
-
Rogue Wave Software (accessed 25 February 2014)
-
Rogue Wave Software (2013) TotalView Debugger. Available at: http://www.roguewave.com/products/totalview.aspx (accessed 25 February 2014).
-
(2013)
TotalView Debugger
-
-
-
136
-
-
80052380100
-
-
Emmanuel J Raymond N Jean R, ed. New York, NY: Springer
-
Ropars T, Guermouche A, Uçar B, et al Euro-Par 2011: Parallel Processing Workshops 17th International Euro-ParConference. Emmanuel J Raymond N Jean R, ed. New York, NY: Springer ; 2011: 567-578.
-
(2011)
Euro-Par 2011: Parallel Processing Workshops 17th International Euro-ParConference
, pp. 567-578
-
-
Ropars, T.1
Guermouche, A.2
Uçar, B.3
-
138
-
-
0004320012
-
Algorithm-based error-detection schemes for iterative solution of partial differential equations
-
Roy-Chowdhury A, Bellas N, Banerjee P. Algorithm-based error-detection schemes for iterative solution of partial differential equations. IEEE Transactions on Computers. 1996 ; 45 (4). 394-407
-
(1996)
IEEE Transactions on Computers
, vol.45
, Issue.4
, pp. 394-407
-
-
Roy-Chowdhury, A.1
Bellas, N.2
Banerjee, P.3
-
145
-
-
24344500868
-
-
Seltborg P, Polanski A, Petrochenkov S, et al. Radiation shielding of high-energy neutrons in SAD. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment.. 2005 ; 550 (1). 313-328
-
(2005)
Radiation Shielding of High-energy Neutrons in SAD. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment
, vol.550
, Issue.1
, pp. 313-328
-
-
Seltborg, P.1
Polanski, A.2
Petrochenkov, S.3
-
148
-
-
0032667728
-
IBM's S/390 G5 microprocessor design
-
Slegel TJ, Averill RM, Check MA, et al. IBM's S/390 G5 microprocessor design. IEEE Micro. 1999 ; 19 (2). 12-23
-
(1999)
IEEE Micro
, vol.19
, Issue.2
, pp. 12-23
-
-
Slegel, T.J.1
Averill, R.M.2
Ma, C.3
-
151
-
-
0033314330
-
IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective
-
Spainhower L, Gregg T. IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective. IBM Journal of Research and Development. 1999 ; 43 (5.6). 863-873
-
(1999)
IBM Journal of Research and Development
, vol.43
, Issue.5-6
, pp. 863-873
-
-
Spainhower, L.1
Gregg, T.2
-
153
-
-
36049028957
-
Defining and measuring supercomputer reliability, availability, and serviceability (RAS)
-
Stearley J. Defining and measuring supercomputer reliability, availability, and serviceability (RAS). Proceedings of the Linux clusters institute conference. 2005 ;:
-
(2005)
Proceedings of the Linux Clusters Institute Conference
-
-
Stearley, J.1
-
157
-
-
0012842250
-
Tests and tolerances for high-performance software-implemented fault detection
-
Turmon M, Granat R, Katz D, et al. Tests and tolerances for high-performance software-implemented fault detection. IEEE Transactions on Computers. 2003 ; 52 (5). 579-591
-
(2003)
IEEE Transactions on Computers
, vol.52
, Issue.5
, pp. 579-591
-
-
Turmon, M.1
Granat, R.2
Katz, D.3
-
158
-
-
33847095845
-
Towards achieving relentless reliability gains in a server marketplace of teraflops, laptops, kilowatts, and ''cost, cost, cost''... : Making peace between a black art and the bottom line
-
Van Horn J. Towards achieving relentless reliability gains in a server marketplace of teraflops, laptops, kilowatts, and ''cost, cost, cost''... : Making peace between a black art and the bottom line. Proceedings of the IEEE international test conference (ITC). 2005 ;: 8
-
(2005)
Proceedings of the IEEE International Test Conference (ITC)
, pp. 8
-
-
Van Horn, J.1
-
160
-
-
84900527903
-
-
New York: The Macmillan Company
-
Wittgenstein L New York: The Macmillan Company ; 1953 :
-
(1953)
-
-
Wittgenstein, L.1
-
161
-
-
78650349637
-
High switching endurance in TaOx memristive devices
-
Yang J, Zhang M, Strachan J, et al. High switching endurance in TaOx memristive devices. Applied Physics Letters. 2010 ; 97 (23). 232102
-
(2010)
Applied Physics Letters
, vol.97
, Issue.23
, pp. 232102
-
-
Yang, J.1
Zhang, M.2
Strachan, J.3
-
162
-
-
84976846528
-
A first order approximation to the optimum checkpoint interval
-
Young JW. A first order approximation to the optimum checkpoint interval. Communications of the ACM. 1974 ; 17 (9). 530-531
-
(1974)
Communications of the ACM
, vol.17
, Issue.9
, pp. 530-531
-
-
Young, J.W.1
-
164
-
-
84856466439
-
A Monte Carlo study of the low resistance state retention of HfOx based resistive switching memory
-
Yu S, Yin Chen Y, Guan X, et al. A Monte Carlo study of the low resistance state retention of HfOx based resistive switching memory. Applied Physics Letters. 2012 ; 100 (4). 043507
-
(2012)
Applied Physics Letters
, vol.100
, Issue.4
, pp. 043507
-
-
Yu, S.1
Yin Chen, Y.2
Guan, X.3
-
168
-
-
77649192707
-
A data-driven approach for predicting failure scenarios in nuclear systems
-
Zio E, Maio FD, Stasi M. A data-driven approach for predicting failure scenarios in nuclear systems. Annals of Nuclear Energy. 2010 ; 37: 482-491
-
(2010)
Annals of Nuclear Energy
, vol.37
, pp. 482-491
-
-
Zio, E.1
Maio, F.D.2
Stasi, M.3
|