메뉴 건너뛰기




Volumn 28, Issue 2, 2014, Pages 129-173

Addressing failures in exascale computing

(28)  Snir, Marc a   Wisniewski, Robert W b   Abraham, Jacob A c   Adve, Sarita V d   Bagchi, Saurabh e   Balaji, Pavan a   Belak, Jim f   Bose, Pradip g   Cappello, Franck a   Carlson, Bill h   Chien, Andrew A i   Coteus, Paul g   Debardeleben, Nathan A j   Diniz, Pedro C k   Engelmann, Christian l   Erez, Mattan c   Fazzari, Saverio m   Geist, Al l   Gupta, Rinku a   Johnson, Fred n   more..


Author keywords

exascale; extreme scale computing; fault tolerance; high performance computing; Resilience

Indexed keywords

APPLICATION PROGRAMS; FAULT TOLERANCE;

EID: 84900560822     PISSN: 10943420     EISSN: 17412846     Source Type: Journal    
DOI: 10.1177/1094342014522573     Document Type: Article
Times cited : (323)

References (168)
  • 4
    • 0015631041 scopus 로고
    • Arithmetic algorithms for error-coded operands
    • 10022
    • Avizienis A. Arithmetic algorithms for error-coded operands. IEEE Transactions on Computers. 1973 ; C-22 (6). 567-572
    • (1973) IEEE Transactions on Computers , Issue.6 , pp. 567-572
    • Avizienis, A.1
  • 7
    • 84900553917 scopus 로고    scopus 로고
    • (accessed 25 February 2014)
    • BaileyFRBellGBlondinJ. (2007) Petascale metrics panel report. Available at: http://research.microsoft.com/en-us/um/people/gbell/supers/ascac-petascale- metrics-panel-report-and-executive-summary-2007-02-12.pdf (accessed 25 February 2014)
    • (2007) Petascale Metrics Panel Report
    • Bailey, F.R.1    Bell, G.2    Blondin, J.3
  • 9
    • 0022706330 scopus 로고
    • Bounds on algorithm-based fault tolerance in multiple processor systems
    • 10035 296-306
    • Banerjee P, Abraham J. Bounds on algorithm-based fault tolerance in multiple processor systems. IEEE Transactions on Computers. 1986 ; C-35: 4 296-306
    • (1986) IEEE Transactions on Computers , pp. 4
    • Banerjee, P.1    Abraham, J.2
  • 10
    • 0025489006 scopus 로고
    • Algorithm-based fault tolerance on a hypercube multiprocessor
    • Banerjee P, Rahmeh J, Stunkel C, et al. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Transactions on Computers. 1990 ; 39 (9). 1132-1145
    • (1990) IEEE Transactions on Computers , vol.39 , Issue.9 , pp. 1132-1145
    • Banerjee, P.1    Rahmeh, J.2    Stunkel, C.3
  • 15
    • 33846118079 scopus 로고    scopus 로고
    • Designing reliable systems from unreliable components: The challenges of transistor variability and degradation
    • Borkar S. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro. 2005 ; 25 (6). 10-16
    • (2005) IEEE Micro , vol.25 , Issue.6 , pp. 10-16
    • Borkar, S.1
  • 31
  • 33
    • 28044460018 scopus 로고    scopus 로고
    • A higher order estimate of the optimum checkpoint interval for restart dumps
    • Daly JT. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems. 2006 ; 22 (3). 303-312
    • (2006) Future Generation Computer Systems , vol.22 , Issue.3 , pp. 303-312
    • Daly, J.T.1
  • 34
    • 37549003336 scopus 로고    scopus 로고
    • MapReduce: Simplified data processing on large clusters
    • Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM. 2008 ; 51 (1). 107-113
    • (2008) Communications of the ACM , vol.51 , Issue.1 , pp. 107-113
    • Dean, J.1    Ghemawat, S.2
  • 35
    • 77955737995 scopus 로고    scopus 로고
    • High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development
    • DARPA, VA
    • DeBardelebenNLarosJDalyJ. (2010b) High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Technical Report LA-UR-10-00030, DARPA, VA. available at http://www.csm.ornl.gov/∼engelman/publications/debardeleben09high-end 2/25/14
    • (2010) Technical Report LA-UR-10-00030
    • De Bardeleben, N.1    Laros, J.2    Daly, J.3
  • 38
    • 78650016517 scopus 로고    scopus 로고
    • Trends from ten years of soft error experimentation
    • (acessed 25 February 2014)
    • DixitAHealdRWoodA (2009) Trends from ten years of soft error experimentation. In: The workshop on silicon Available at: http://softerrors. info/selse/images/selse-2009/Papers/selse5-submission-29.pdf (acessed 25 February 2014).
    • (2009) The Workshop on Silicon
    • Dixit, A.1    Heald, R.2    Wood, A.3
  • 42
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • Elnozahy ENM, Alvisi L, Wang YM, et al. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys. 2002 ; 34 (3). 375-408
    • (2002) ACM Computing Surveys , vol.34 , Issue.3 , pp. 375-408
    • Enm, E.1    Alvisi, L.2    Wang, Y.M.3
  • 43
    • 84900548976 scopus 로고    scopus 로고
    • Elnozahy (editor) System Resilience at Extreme Scale White Paper accessed 2/25/14
    • Elnozahy (editor) System Resilience at Extreme Scale White Paper available at http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type= pdf&doi=10.1.1.205.4240accessed 2/25/14
  • 45
    • 35348872682 scopus 로고    scopus 로고
    • The Daikon system for dynamic detection of likely invariants
    • Ernst MD, Perkins JH, Guo PJ, et al. The Daikon system for dynamic detection of likely invariants. Science of Computer Programming. 2007 ; 69 (1). 35-45
    • (2007) Science of Computer Programming , vol.69 , Issue.1 , pp. 35-45
    • Ernst, M.D.1    Perkins, J.H.2    Guo, P.J.3
  • 46
    • 70349157325 scopus 로고    scopus 로고
    • (accessed 25 February 2014)
    • FaddenS (2012) An introduction to GPFS version 3.5. Available at: www-03.ibm.com/systems/jo/resources/introduction-to-gpfs-3-5.pdf (accessed 25 February 2014).
    • (2012) An Introduction to GPFS Version 3.5
    • Fadden, S.1
  • 50
    • 84900531208 scopus 로고
    • Constrained Optimization New York, NY John Wiley & Sons
    • Fletcher R Constrained Optimization New York, NY John Wiley & Sons ; 1981 :
    • (1981)
    • Fletcher, R.1
  • 56
    • 79951947569 scopus 로고    scopus 로고
    • Modeling of retention failure behavior in bipolar oxide-based resistive switching memory
    • Gao B, Zhang H, Chen B, et al. Modeling of retention failure behavior in bipolar oxide-based resistive switching memory. IEEE Electron Device Letters. 2011 ; 32 (3). 276-278
    • (2011) IEEE Electron Device Letters , vol.32 , Issue.3 , pp. 276-278
    • Gao, B.1    Zhang, H.2    Chen, B.3
  • 60
    • 84900540703 scopus 로고    scopus 로고
    • Technical report, U.S. Department of Energy, DC
    • Geist A, Lucas B, Snir M, et al Technical report, U.S. Department of Energy, DC ; 2012 :
    • (2012)
    • Geist, A.1    Lucas, B.2    Snir, M.3
  • 61
    • 70449106113 scopus 로고    scopus 로고
    • Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node
    • Gill B, Seifert N, Zia V. Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node. IEEE international reliability physics symposium. 2009 ;: 199-205
    • (2009) IEEE International Reliability Physics Symposium , pp. 199-205
    • Gill, B.1    Seifert, N.2    Zia, V.3
  • 64
    • 33947495454 scopus 로고    scopus 로고
    • Fighting bugs: Remove, retry, replicate, and rejuvenate
    • Grottke M, Trivedi KS. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Computer. 2007 ; 40 (2). 107-109
    • (2007) IEEE Computer , vol.40 , Issue.2 , pp. 107-109
    • Grottke, M.1    Trivedi, K.S.2
  • 76
    • 0242443635 scopus 로고    scopus 로고
    • Measurements and analysis of ser tolerant latch in a 90 nm dual-Vt CMOS process
    • Hazucha P, Karnik T, Bloechel SWB, et al. Measurements and analysis of SER tolerant latch in a 90 nm dual-Vt CMOS process. IEEE custom integrated circuits conference. 2003 ;: 617-620
    • (2003) IEEE Custom Integrated Circuits Conference , pp. 617-620
    • Hazucha, P.1    Karnik, T.2    Swb, B.3
  • 82
    • 0021439162 scopus 로고
    • Algorithm-based fault tolerance for matrix operations
    • Huang KH, Abraham J. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers. 1984 ; C-33 (6). 518-528
    • (1984) IEEE Transactions on Computers , vol.C33 , Issue.6 , pp. 518-528
    • Huang, K.H.1    Abraham, J.2
  • 85
    • 77954030094 scopus 로고    scopus 로고
    • Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule
    • Ibe E, Taniguchi H, Yahagi Y, et al. Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule. IEEE Transactions on Electron Devices. 2010 ; 57 (7). 1527-1538
    • (2010) IEEE Transactions on Electron Devices , vol.57 , Issue.7 , pp. 1527-1538
    • Ibe, E.1    Taniguchi, H.2    Yahagi, Y.3
  • 86
    • 0037253011 scopus 로고    scopus 로고
    • NASA advances robotic space exploration
    • Katz D, Some R. NASA advances robotic space exploration. Computer. 2003 ; 36 (1). 52-61
    • (2003) Computer , vol.36 , Issue.1 , pp. 52-61
    • Katz, D.1    Some, R.2
  • 89
    • 0026307546 scopus 로고
    • Estimates of rounding errors with fast automatic differentiation and interval analysis
    • Kubota K, Iri M. Estimates of rounding errors with fast automatic differentiation and interval analysis. Journal of Information Processing. 1992 ; 14 (3). 508-515
    • (1992) Journal of Information Processing , vol.14 , Issue.3 , pp. 508-515
    • Kubota, K.1    Iri, M.2
  • 99
    • 0037319402 scopus 로고    scopus 로고
    • Decomposition algorithms for stochastic programming on a computational grid
    • Linderoth J, Wright S. Decomposition algorithms for stochastic programming on a computational grid. Computational Optimization and Applications. 2003 ; 24 (2). 207-250
    • (2003) Computational Optimization and Applications , vol.24 , Issue.2 , pp. 207-250
    • Linderoth, J.1    Wright, S.2
  • 100
    • 0028416906 scopus 로고
    • Reliable floating-point arithmetic algorithms for error-coded operands
    • Lo JC. Reliable floating-point arithmetic algorithms for error-coded operands. IEEE Transactions on Computers. 1994 ; 43 (4). 400-412
    • (1994) IEEE Transactions on Computers , vol.43 , Issue.4 , pp. 400-412
    • Lo, J.C.1
  • 110
    • 84900526494 scopus 로고
    • January (accessed 25 February 2014)
    • MitchellR (1977) The Underground Grammarian, Vol., No. 1, January. Available at http://www.sourcetext.com/grammarian/ (accessed 25 February 2014).
    • (1977) The Underground Grammarian , vol.1
    • Mitchell, R.1
  • 115
    • 84900530512 scopus 로고    scopus 로고
    • MPIPlugIn (accessed 25 February 2014)
    • MPIPlugIn (2013) MPI plugin for KDevelop. Available at: http://sourceforge.net/projects/mpiplugin/ (accessed 25 February 2014).
    • (2013) MPI Plugin for KDevelop
  • 119
    • 84900557827 scopus 로고    scopus 로고
    • NCAR (accessed 25 February 2014)
    • NCAR (2014) Community earth system model. Available at: http://www2.cesm.ucar.edu/ (accessed 25 February 2014).
    • (2014) Community Earth System Model
  • 120
    • 76649113170 scopus 로고    scopus 로고
    • Network Working Group (accessed 25 February 2014)
    • Network Working Group (2009) The syslog protocol. Available at: http://tools.ietf.org/html/rfc5424 (accessed 25 February 2014).
    • (2009) The Syslog Protocol
  • 122
    • 31044449725 scopus 로고    scopus 로고
    • Accident prediction model for railway-highway interfaces
    • Oh J, Washington SP, Nam D. Accident prediction model for railway-highway interfaces. Accident Analysis and Prevention. 2006 ; 38 (2). 346-356
    • (2006) Accident Analysis and Prevention , vol.38 , Issue.2 , pp. 346-356
    • Oh, J.1    Washington, S.P.2    Nam, D.3
  • 126
    • 34547396006 scopus 로고    scopus 로고
    • Dynamic derivation of application-specific error detectors and their implementation in hardware
    • Pattabiraman K, Saggese GP, Chen D, et al. Dynamic derivation of application-specific error detectors and their implementation in hardware. European dependable computing conference. 2006 ;: 97-108
    • (2006) European Dependable Computing Conference , pp. 97-108
    • Pattabiraman, K.1    Saggese, G.P.2    Chen, D.3
  • 130
    • 84900549431 scopus 로고    scopus 로고
    • The Eckert tapes: Computer pioneer says ENIAC team couldn't afford to fail - And didn't
    • Randall A V. The Eckert tapes: Computer pioneer says ENIAC team couldn't afford to fail - and didn't. Computerworld. 2006 ; 40 (8). 18
    • (2006) Computerworld , vol.40 , Issue.8 , pp. 18
    • Randall, A.V.1
  • 132
    • 10044267465 scopus 로고    scopus 로고
    • Impact of negative bias temperature instability on digital circuit reliability
    • Reddy V, Krishnan A, Marshall A, et al. Impact of negative bias temperature instability on digital circuit reliability. Microelectronics Reliability. 2005 ; 45 (1). 31-38
    • (2005) Microelectronics Reliability , vol.45 , Issue.1 , pp. 31-38
    • Reddy, V.1    Krishnan, A.2    Marshall, A.3
  • 135
    • 84900529904 scopus 로고    scopus 로고
    • Rogue Wave Software (accessed 25 February 2014)
    • Rogue Wave Software (2013) TotalView Debugger. Available at: http://www.roguewave.com/products/totalview.aspx (accessed 25 February 2014).
    • (2013) TotalView Debugger
  • 138
    • 0004320012 scopus 로고    scopus 로고
    • Algorithm-based error-detection schemes for iterative solution of partial differential equations
    • Roy-Chowdhury A, Bellas N, Banerjee P. Algorithm-based error-detection schemes for iterative solution of partial differential equations. IEEE Transactions on Computers. 1996 ; 45 (4). 394-407
    • (1996) IEEE Transactions on Computers , vol.45 , Issue.4 , pp. 394-407
    • Roy-Chowdhury, A.1    Bellas, N.2    Banerjee, P.3
  • 140
    • 77950267881 scopus 로고    scopus 로고
    • A survey of online failure prediction methods
    • Salfner F, Lenk M, Malek M. A survey of online failure prediction methods. ACM Computing Surveys. 2010 ; 42: 1-42
    • (2010) ACM Computing Surveys , vol.42 , pp. 1-42
    • Salfner, F.1    Lenk, M.2    Malek, M.3
  • 148
    • 0032667728 scopus 로고    scopus 로고
    • IBM's S/390 G5 microprocessor design
    • Slegel TJ, Averill RM, Check MA, et al. IBM's S/390 G5 microprocessor design. IEEE Micro. 1999 ; 19 (2). 12-23
    • (1999) IEEE Micro , vol.19 , Issue.2 , pp. 12-23
    • Slegel, T.J.1    Averill, R.M.2    Ma, C.3
  • 151
    • 0033314330 scopus 로고    scopus 로고
    • IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective
    • Spainhower L, Gregg T. IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective. IBM Journal of Research and Development. 1999 ; 43 (5.6). 863-873
    • (1999) IBM Journal of Research and Development , vol.43 , Issue.5-6 , pp. 863-873
    • Spainhower, L.1    Gregg, T.2
  • 153
    • 36049028957 scopus 로고    scopus 로고
    • Defining and measuring supercomputer reliability, availability, and serviceability (RAS)
    • Stearley J. Defining and measuring supercomputer reliability, availability, and serviceability (RAS). Proceedings of the Linux clusters institute conference. 2005 ;:
    • (2005) Proceedings of the Linux Clusters Institute Conference
    • Stearley, J.1
  • 157
    • 0012842250 scopus 로고    scopus 로고
    • Tests and tolerances for high-performance software-implemented fault detection
    • Turmon M, Granat R, Katz D, et al. Tests and tolerances for high-performance software-implemented fault detection. IEEE Transactions on Computers. 2003 ; 52 (5). 579-591
    • (2003) IEEE Transactions on Computers , vol.52 , Issue.5 , pp. 579-591
    • Turmon, M.1    Granat, R.2    Katz, D.3
  • 158
    • 33847095845 scopus 로고    scopus 로고
    • Towards achieving relentless reliability gains in a server marketplace of teraflops, laptops, kilowatts, and ''cost, cost, cost''... : Making peace between a black art and the bottom line
    • Van Horn J. Towards achieving relentless reliability gains in a server marketplace of teraflops, laptops, kilowatts, and ''cost, cost, cost''... : Making peace between a black art and the bottom line. Proceedings of the IEEE international test conference (ITC). 2005 ;: 8
    • (2005) Proceedings of the IEEE International Test Conference (ITC) , pp. 8
    • Van Horn, J.1
  • 160
    • 84900527903 scopus 로고
    • New York: The Macmillan Company
    • Wittgenstein L New York: The Macmillan Company ; 1953 :
    • (1953)
    • Wittgenstein, L.1
  • 161
    • 78650349637 scopus 로고    scopus 로고
    • High switching endurance in TaOx memristive devices
    • Yang J, Zhang M, Strachan J, et al. High switching endurance in TaOx memristive devices. Applied Physics Letters. 2010 ; 97 (23). 232102
    • (2010) Applied Physics Letters , vol.97 , Issue.23 , pp. 232102
    • Yang, J.1    Zhang, M.2    Strachan, J.3
  • 162
    • 84976846528 scopus 로고
    • A first order approximation to the optimum checkpoint interval
    • Young JW. A first order approximation to the optimum checkpoint interval. Communications of the ACM. 1974 ; 17 (9). 530-531
    • (1974) Communications of the ACM , vol.17 , Issue.9 , pp. 530-531
    • Young, J.W.1
  • 164
    • 84856466439 scopus 로고    scopus 로고
    • A Monte Carlo study of the low resistance state retention of HfOx based resistive switching memory
    • Yu S, Yin Chen Y, Guan X, et al. A Monte Carlo study of the low resistance state retention of HfOx based resistive switching memory. Applied Physics Letters. 2012 ; 100 (4). 043507
    • (2012) Applied Physics Letters , vol.100 , Issue.4 , pp. 043507
    • Yu, S.1    Yin Chen, Y.2    Guan, X.3
  • 168
    • 77649192707 scopus 로고    scopus 로고
    • A data-driven approach for predicting failure scenarios in nuclear systems
    • Zio E, Maio FD, Stasi M. A data-driven approach for predicting failure scenarios in nuclear systems. Annals of Nuclear Energy. 2010 ; 37: 482-491
    • (2010) Annals of Nuclear Energy , vol.37 , pp. 482-491
    • Zio, E.1    Maio, F.D.2    Stasi, M.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.