SCOPUS 정보 검색 플랫폼

Supercomputing Frontiers and Innovations

Volumn 1, Issue 1, 2014, Pages 4-27

Toward exascale resilience: 2014 update

(6) Cappello, Franck a,b Geist, Al c Gropp, William b Kale, Sanjay b Kramer, Bill b Snir, Marc a,b

a ARGONNE NATIONAL LABORATORY (United States)

b UNIVERSITY OF ILLINOIS AT URBANA CHAMPAIGN (United States)

c OAK RIDGE NATIONAL LABORATORY (United States)

Author keywords

Exascale; Fault tolerance techniques; Resilience

Indexed keywords

EXASCALE; FAULT TOLERANCE TECHNIQUES; PREDICT ERRORS; RESEARCH PROBLEMS; RESILIENCE; TECHNICAL PROGRESS; TECHNOLOGY EVOLUTION; UNSTABLE SYSTEM;

FAULT TOLERANCE;

EID: 85018017476 PISSN: 24096008 EISSN: 23138734 Source Type: Journal
DOI: 10.14529/jsfi140101 Document Type: Article

Times cited : (266)

References (102)

1
- 85033564681
- The Blue Waters super system for super science
- Jeffrey S. Vetter, editor, Chapman and Hall/CRC
- The Blue Waters super system for super science. Contemporary High Performance Computing From Petascale toward Exascale, Jeffrey S. Vetter, editor, Chapman and Hall/CRC, pages 339-366, ISBN: 978-1-4665-6834-1, 2013
- (2013) Contemporary High Performance Computing From Petascale toward Exascale , pp. 339-366

2
- 85038367620
- Optimal checkpointing period: Time vs
- abs/1310.8456
- Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, and Jack Dongarra. Optimal checkpointing period: Time vs. energy. CoRR, abs/1310.8456, 2013
- (2013) Energy. CoRR
- Aupy, G.¹ Benoit, A.² Hérault, T.³ Robert, Y.⁴ Dongarra, J.⁵

3
- 12344308304
- Basic concepts and taxonomy of dependable and secure computing
- A. Avizienis, J.C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11-33, 2004
- (2004) IEEE Transactions on Dependable and Secure Computing , vol.1 , Issue.1 , pp. 11-33
- Avizienis, A.¹ Laprie, J.C.² Randell, B.³ Landwehr, C.⁴

4
- 83155160949
- FTI: high performance fault tolerance interface for hybrid systems
- Networking, Storage and Analysis (SC11). ACM
- L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. FTI: high performance fault tolerance interface for hybrid systems. In Proc. 2011 Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC11). ACM, 2011
- (2011) In Proc. 2011 Int. Conf. High Performance Computing
- Bautista-Gomez, L.¹ Tsuboi, S.² Komatitsch, D.³ Cappello, F.⁴ Maruyama, N.⁵ Matsuoka, S.⁶

5
- 84968850677
- Silent error detection in numerical time-stepping schemes
- April
- Austin R Benson, Sven Schmit, and Robert Schreiber. Silent error detection in numerical time-stepping schemes. International Journal of High Performance Computing Applications, April, 2014
- (2014) International Journal of High Performance Computing Applications
- Benson, A.R.¹ Schmit, S.² Schreiber, R.³

6
- 80053259207
- Exploiting data similarity to reduce memory footprints
- Susmit Biswas, Bronis R. de Supinski, Martin Schulz, Diana Franklin, Timothy Sherwood, and Frederic T. Chong. Exploiting data similarity to reduce memory footprints. In Proceedings of IEEE IPDPS, pages 152-163, 2011
- (2011) Proceedings of IEEE IPDPS , pp. 152-163
- Biswas, S.¹ de Supinski, B.R.² Schulz, M.³ Franklin, D.⁴ Sherwood, T.⁵ Chong, F.T.⁶

7
- 85038393026
- Extending the scope of the checkpoint-on-failure protocol for forward recovery in standard MPI concurrency and computation: Practice and experience
- July
- W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Extending the scope of the checkpoint-on-failure protocol for forward recovery in standard MPI, concurrency and computation: Practice and experience, special issue: Euro-par 2012. July 2013
- (2013) Special issue: Euro-par 2012
- Bland, W.¹ Du, P.² Bouteiller, A.³ Herault, T.⁴ Bosilca, G.⁵ Dongarra, J.⁶

8
- 84874409590
- User level failure mitigation in MPI
- Springer
- Wesley Bland. User level failure mitigation in MPI. In Euro-Par 2012: Parallel Processing Workshops, pages 499-504. Springer, 2013
- (2013) In Euro-Par 2012: Parallel Processing Workshops , pp. 499-504
- Bland, W.¹

9
- 84918839033
- Unified model for assessing checkpointing protocols at extreme-scale
- November
- G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra, A. Guermouche, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni. Unified model for assessing checkpointing protocols at extreme-scale, concurrency and computation: Practice and experience. November 2013
- (2013) Concurrency and computation: Practice and experience
- Bosilca, G.¹ Bouteiller, A.² Brunet, E.³ Cappello, F.⁴ Dongarra, J.⁵ Guermouche, A.⁶ Herault, T.⁷ Robert, Y.⁸ Vivien, F.⁹ Zaidouni, D.¹⁰

10
- 84884837861
- Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing
- Mohamed-Slim Bouguerra, Ana Gainaru, Leonardo Arturo Bautista-Gomez, Franck Cappello, Satoshi Matsuoka, and Naoya Maruyama. Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In Proceedings of IEEE IPDPS, pages 501-512, 2013
- (2013) In Proceedings of IEEE IPDPS , pp. 501-512
- Bouguerra, M.-S.¹ Gainaru, A.² Bautista-Gomez, L.A.³ Cappello, F.⁴ Matsuoka, S.⁵ Maruyama, N.⁶

11
- 84874118584
- Correlated set coordination in fault tolerant message logging protocols, concurrency and computation
- A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Correlated set coordination in fault tolerant message logging protocols, concurrency and computation: Practice and experience. Vol. 25, No. 4:pp. 572-585, 2013
- (2013) Practice and experience , vol.25 , Issue.4 , pp. 572-585
- Bouteiller, A.¹ Herault, T.² Bosilca, G.³ Dongarra, J.⁴

12
- 84883201136
- Multi-criteria checkpointing strategies: Response-time versus resource utilization
- Springer Berlin Heidelberg
- Aurelien Bouteiller, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Hrault, and Yves Robert. Multi-criteria checkpointing strategies: Response-time versus resource utilization. In Felix Wolf, Bernd Mohr, and Dieter Mey, editors, Euro-Par 2013 Parallel Processing, volume 8097 of Lecture Notes in Computer Science, pages 420-431. Springer Berlin Heidelberg, 2013
- (2013) In Felix Wolf, Bernd Mohr, and Dieter Mey, editors, Euro-Par 2013 Parallel Processing, volume 8097 of Lecture Notes in Computer Science , pp. 420-431
- Bouteiller, A.¹ Cappello, F.² Dongarra, J.³ Guermouche, A.⁴ Hrault, T.⁵ Robert, Y.⁶

13
- 84906672289
- Fault-tolerant linear solvers via selective reliability
- June
- P. G. Bridges, K. B. Ferreira, M. A. Heroux, and M. Hoemmen. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012
- (2012) ArXiv e-prints
- Bridges, P.G.¹ Ferreira, K.B.² Heroux, M.A.³ Hoemmen, M.⁴

14
- 84863961922
- Cooperative application/OS DRAM fault recovery
- Michael Alexander, Pasqua DAmbra, Adam Belloum, George Bosilca, Mario Cannataro, Marco Danelutto, Beniamino Martino, Michael Gerndt, Emmanuel Jeannot, Raymond Namyst, Jean Roman, StephenL. Scott, JesperLarsson Traff, Geoffroy Valle, and Josef Weidendorfer, editors, Euro-Par 2011: Parallel Processing Workshops, volume 7156 of Lecture Notes in Springer Berlin HeidelbergComputer Science
- PatrickG. Bridges, Mark Hoemmen, KurtB. Ferreira, MichaelA. Heroux, Philip Soltero, and Ron Brightwell. Cooperative application/OS DRAM fault recovery. In Michael Alexander, Pasqua DAmbra, Adam Belloum, George Bosilca, Mario Cannataro, Marco Danelutto, Beniamino Martino, Michael Gerndt, Emmanuel Jeannot, Raymond Namyst, Jean Roman, StephenL. Scott, JesperLarsson Traff, Geoffroy Valle, and Josef Weidendorfer, editors, Euro-Par 2011: Parallel Processing Workshops, volume 7156 of Lecture Notes in Computer Science, pages 241-250. Springer Berlin Heidelberg, 2012
- (2012) , pp. 241-250
- Bridges, P.G.¹ Hoemmen, M.² Ferreira, K.B.³ Heroux, M.A.⁴ Soltero, P.⁵ Brightwell, R.⁶

15
- 0005356617
- Charles Babbage's analytical engine, 1838
- Allan G Bromley. Charles Babbage's analytical engine, 1838. Annals of the History of Computing, 4(3):196-217, 1982
- (1982) Annals of the History of Computing , vol.4 , Issue.3 , pp. 196-217
- Bromley, A.G.¹

16
- 57349156147
- Soft error vulnerability of iterative linear algebra methods
- ACM
- Greg Bronevetsky and Bronis de Supinski. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the 22nd annual international conference on Supercomputing, pages 155-164. ACM, 2008
- (2008) In Proceedings of the 22nd annual international conference on Supercomputing , pp. 155-164
- Bronevetsky, G.¹ de Supinski, B.²

17
- 74049111423
- Compiler-enhanced incremental checkpointing for openmp applications
- New York, NY, USA. ACM
- Greg Bronevetsky, Daniel J. Marques, Keshav K. Pingali, Radu Rugina, and Sally A. McKee. Compiler-enhanced incremental checkpointing for openmp applications. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '08, pages 275-276, New York, NY, USA, 2008. ACM
- (2008) In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '08 , pp. 275-276
- Bronevetsky, G.¹ Marques, D.J.² Pingali, K.K.³ Rugina, R.⁴ McKee, S.A.⁵

18
- 68249127079
- Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities
- Franck Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3):212-226, 2009
- (2009) International Journal of High Performance Computing Applications , vol.23 , Issue.3 , pp. 212-226
- Cappello, F.¹

19
- 70450206305
- Toward exascale resilience
- Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. Toward exascale resilience. International Journal of High Performance Computing Applications, 23(4):374-388, 2009
- (2009) International Journal of High Performance Computing Applications , vol.23 , Issue.4 , pp. 374-388
- Cappello, F.¹ Geist, A.² Gropp, B.³ Kale, L.⁴ Kramer, B.⁵ Snir, M.⁶

20
- 77958506610
- On communication determinism in parallel hpc applications
- IEEE
- Franck Cappello, Amina Guermouche, and Marc Snir. On communication determinism in parallel hpc applications. In Computer Communications and Networks (ICCCN), 2010 Proceedings of 19th International Conference on, pages 1-8. IEEE, 2010
- (2010) In Computer Communications and Networks (ICCCN), 2010 Proceedings of 19th International Conference on , pp. 1-8
- Cappello, F.¹ Guermouche, A.² Snir, M.³

21
- 84864068316
- Fault resilience of the algebraic multi-grid solver
- New York, NY, USAACM
- Marc Casas, Bronis R. de Supinski, Greg Bronevetsky, and Martin Schulz. Fault resilience of the algebraic multi-grid solver. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, pages 91-100, New York, NY, USA, 2012. ACM
- (2012) In Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12 , pp. 91-100
- Casas, M.¹ de Supinski, B.R.² Bronevetsky, G.³ Schulz, M.⁴

22
- 34548782109
- A fault tolerance protocol with fast fault recovery
- IEEE Press
- Sayantan Chakravorty and L. V. Kale. A fault tolerance protocol with fast fault recovery. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium. IEEE Press, 2007
- (2007) In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium
- Chakravorty, S.¹ Kale, L.V.²

23
- 84875168534
- Online-abft: An online algorithm based fault tolerance scheme for soft error detection in iterative methods
- New York, NY, USAACM
- Zizhong Chen. Online-abft: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 167-176, New York, NY, USA, 2013. ACM
- (2013) In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13 , pp. 167-176
- Chen, Z.¹

24
- 33847240498
- Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources
- IEEE Computer Society
- Zizhong Chen and J. Dongarra. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In Parallel and Distributed Processing Symposium, International, page 76, Los Alamitos, CA, USA, 2006. IEEE Computer Society
- (2006) In Parallel and Distributed Processing Symposium, International, page 76, Los Alamitos, CA, USA
- Chen, Z.¹ Dongarra, J.²

25
- 57049092162
- Algorithm-based fault tolerance for fail-stop failures
- Dec
- Zizhong Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. Parallel and Distributed Systems, IEEE Transactions on, 19(12):1628-1641, Dec 2008
- (2008) Parallel and Distributed Systems, IEEE Transactions on , vol.19 , Issue.12 , pp. 1628-1641
- Chen, Z.¹ Dongarra, J.²

26
- 31844451082
- Fault tolerant high performance computing by a coding approach
- New York, NY, USAACM
- Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, and Jack Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05, pages 213-223, New York, NY, USA, 2005. ACM
- (2005) In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05 , pp. 213-223
- Chen, Z.¹ Fagg, G.E.² Gabriel, E.³ Langou, J.⁴ Angskun, T.⁵ Bosilca, G.⁶ Dongarra, J.⁷

27
- 84879873377
- Quantitative evaluation of soft error injection techniques for robust system design
- IEEE
- Hyungmin Cho, Shahrzad Mirkhani, Chen-Yong Cher, Jacob A Abraham, and Subhasish Mitra. Quantitative evaluation of soft error injection techniques for robust system design. In Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE, pages 1-10. IEEE, 2013
- (2013) In Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE , pp. 1-10
- Cho, H.¹ Mirkhani, S.² Cher, C.-Y.³ Abraham, J.A.⁴ Mitra, S.⁵

28
- 84896887172
- Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems
- November
- Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon, Larry Kaplan, and Mattan Erez. Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems. In the Proceedings of SC12, November 2012
- (2012) In the Proceedings of SC12
- Chung, J.¹ Lee, I.² Sullivan, M.³ Ryoo, J.H.⁴ Kim, D.W.⁵ Yoon, D.H.⁶ Kaplan, L.⁷ Erez, M.⁸

29
- 28044460018
- A higher order estimate of the optimum checkpoint interval for restart dumps
- John T Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 22(3):303-312, 2006
- (2006) Future Generation Computer Systems , vol.22 , Issue.3 , pp. 303-312
- Daly, J.T.¹

30
- 84880071687
- Correcting soft errors online in lu factorization
- New York, NY, USAACM
- Teresa Davies and Zizhong Chen. Correcting soft errors online in lu factorization. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing, HPDC '13, pages 167-178, New York, NY, USA, 2013. ACM
- (2013) In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing, HPDC '13 , pp. 167-178
- Davies, T.¹ Chen, Z.²

31
- 79959586938
- High performance linpack benchmark: A fault tolerant implementation without checkpointing
- New York, NY, USAACM
- Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, and Zizhong Chen. High performance linpack benchmark: A fault tolerant implementation without checkpointing. In Proceedings of the International Conference on Supercomputing, ICS '11, pages 162-171, New York, NY, USA, 2011. ACM
- (2011) In Proceedings of the International Conference on Supercomputing, ICS '11 , pp. 162-171
- Davies, T.¹ Karlsson, C.² Liu, H.³ Ding, C.⁴ Chen, Z.⁵

32
- 77955737995
- High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development
- LA-UR-10-00030, DARPA, January
- N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann, and W. Harrod. High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Technical Report LA-UR-10-00030, DARPA, January 2010
- (2010) Technical Report
- DeBardeleben, N.¹ Laros, J.² Daly, J.³ Scott, S.⁴ Engelmann, C.⁵ Harrod, W.⁶

33
- 84882620187
- Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience
- Berlin, HeidelbergSpringer-Verlag
- Nathan DeBardeleben, Sean Blanchard, Qiang Guan, Ziming Zhang, and Song Fu. Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience. In Proceedings of the 2011 International Conference on Parallel Processing-Volume 2, Euro-Par'11, pages 282-291, Berlin, Heidelberg, 2012. Springer-Verlag
- (2012) In Proceedings of the 2011 International Conference on Parallel Processing-Volume 2, Euro-Par'11 , pp. 282-291
- DeBardeleben, N.¹ Blanchard, S.² Guan, Q.³ Zhang, Z.⁴ Fu, S.⁵

34
- 84964321739
- Lessons learned from the analysis of system failures at petascale: The case of Blue Waters
- Catello Di Martino, F Baccanico, W Kramer, J Fullop, Z Kalbarczyk, and R Iyer. Lessons learned from the analysis of system failures at petascale: The case of Blue Waters. In The 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2014), 2014
- (2014) In The 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2014)
- Martino, C.D.¹ Baccanico, F.² Kramer, W.³ Fullop, J.⁴ Kalbarczyk, Z.⁵ Iyer, R.⁶

35
- 84944814783
- No 1 ESS maintenance plan
- R.W. Downing, J.S. Nowak, and L.S. Tuomenoksa. No. 1 ESS maintenance plan. Bell System Technical Journal, 43:5:1961-2019, 1964
- (1964) Bell System Technical Journal , vol.43 , Issue.5 , pp. 1961-2019
- Downing, R.W.¹ Nowak, J.S.² Tuomenoksa, L.S.³

36
- 84858403667
- Algorithm-based fault tolerance for dense matrix factorizations
- New York, NY, USAACM
- Peng Du, Aurelien Bouteiller, George Bosilca, Thomas Herault, and Jack Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 225-234, New York, NY, USA, 2012. ACM
- (2012) In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12 , pp. 225-234
- Du, P.¹ Bouteiller, A.² Bosilca, G.³ Herault, T.⁴ Dongarra, J.⁵

37
- 85038382627
- Low-cost concurrent error detection for floatingpoint unit (fpu) controllers
- 20130276, 372(2018)
- Peter D. Dben, Jaume Joven, Avinash Lingamneni, Hugh McNamara, Giovanni De Michel, Krishna V. Palem, and T. N. Palmer. Low-cost concurrent error detection for floatingpoint unit (fpu) controllers. Philosophical Transactions of the Royal Society A, 20130276, 372(2018), 2014
- (2014) Philosophical Transactions of the Royal Society A
- Dben, P.D.¹ Joven, J.² Lingamneni, A.³ McNamara, H.⁴ Michel, G.D.⁵ Palem, K.V.⁶ Palmer, T.N.⁷

38
- 85038364389
- February
- John Daly (editor), Bob Adolf, Shekhar Borkar, Nathan DeBardeleben, Mootaz Elnozahy, Mike Heroux, David Rogers, Rob Ross, Vivek Sarkar, Martin Schulz, Marc Snir, and Paul Woodward. Inter Agency Workshop on HPC Resilience at Extreme Scale. http://institute.lanl.gov/resilience/docs/Inter-AgencyResilienceReport.pdf, February 2012
- (2012) Inter Agency Workshop on HPC Resilience at Extreme Scale
- Daly, J.¹ Adolf, B.² Borkar, S.³ DeBardeleben, N.⁴ Elnozahy, M.⁵ Heroux, M.⁶ Rogers, D.⁷ Ross, R.⁸ Sarkar, V.⁹ Schulz, M.¹⁰ Snir, M.¹¹ Woodward, P.¹²

39
- 77954574789
- System resilience at extreme scale
- Defense Advanced Research Project Agency (DARPA)
- Mootaz Elnozahy (editor), Ricardo Bianchini, Tarek El-Ghazawi, Armando Fox, Forest Godfrey, Adolfy Hoisie, Kathryn McKinley, Rami Melhem, James Plank, Partha Ranganathan, and Josh Simons. System resilience at extreme scale. Technical report, Defense Advanced Research Project Agency (DARPA), 2009
- (2009) Technical report
- Elnozahy, M.¹ Bianchini, R.² El-Ghazawi, T.³ Fox, A.⁴ Godfrey, F.⁵ Hoisie, A.⁶ McKinley, K.⁷ Melhem, R.⁸ Plank, J.⁹ Ranganathan, P.¹⁰ Simons, J.¹¹

40
- 77649294316
- Reduced precision checking for a floating point adder
- IEEE
- Patrick J Eibl, Andrew D Cook, and Daniel J Sorin. Reduced precision checking for a floating point adder. In Defect and Fault Tolerance in VLSI Systems, 2009. DFT'09. 24th IEEE International Symposium on, pages 145-152. IEEE, 2009
- (2009) In Defect and Fault Tolerance in VLSI Systems, 2009. DFT'09. 24th IEEE International Symposium on , pp. 145-152
- Eibl, P.J.¹ Cook, A.D.² Sorin, D.J.³

41
- 84880906440
- Energy considerations in checkpointing and fault tolerance protocols
- Mohammed el Mehdi Diouri, Olivier Glück, Laurent Lefèvre, and Franck Cappello. Energy considerations in checkpointing and fault tolerance protocols. In Proceedings of FTXS workshop, IEEE/IFIP DSN'12, pages 1-6, 2012
- (2012) In Proceedings of FTXS workshop, IEEE/IFIP DSN'12 , pp. 1-6
- del Mehdi Diouri, M.¹ Glück, O.² Lefèvre, L.³ Cappello, F.⁴

42
- 84866942395
- Combining partial redundancy and checkpointing for hpc
- June
- J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. Combining partial redundancy and checkpointing for hpc. In Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems ICDCS, pages 615-626, June 2012
- (2012) In Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems ICDCS , pp. 615-626
- Elliott, J.¹ Kharbas, K.² Fiala, D.³ Mueller, F.⁴ Ferreira, K.⁵ Engelmann, C.⁶

43
- 84906696554
- Evaluating the impact of SDC on the GMRES iterative solver4
- IPDPS'14
- James Elliott, Mark Hoemme, and Frank Mueller. Evaluating the impact of SDC on the GMRES iterative solver4. In Proceedings of International Parallel and Distributed Processing Symposium, IPDPS'14, 2014
- (2014) In Proceedings of International Parallel and Distributed Processing Symposium
- Elliott, J.¹ Hoemme, M.² Mueller, F.³

44
- 0042078549
- A survey of rollback-recovery protocols in message-passing systems
- Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3):375-408, 2002
- (2002) ACM Computing Surveys (CSUR) , vol.34 , Issue.3 , pp. 375-408
- Elnozahy, E.N.¹ Alvisi, L.² Wang, Y.-M.³ Johnson, D.B.⁴

45
- 84888310932
- Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale
- January
- Christian Engelmann. Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale. Future Generation Computer Systems (FGCS), 30(0):59-65, January 2014
- (2014) Future Generation Computer Systems (FGCS) , vol.30 , pp. 59-65
- Engelmann, C.¹

46
- 79958180996
- Redundant execution of HPC applications with MR-MPI
- Christian Engelmann and Swen Böhm. Redundant execution of HPC applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011, pages 31-38, 2011
- (2011) In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011 , pp. 31-38
- Engelmann, C.¹ Böhm, S.²

47
- 70349089035
- Proactive fault tolerance using preemptive migration
- February 18-20
- Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive fault tolerance using preemptive migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, February 18-20, 2009
- (2009) Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009 , pp. 252-257
- Engelmann, C.¹ Vallée, G.R.² Naughton, T.³ Scott, S.L.⁴

48
- 84989894776
- Pete Beckman et al. Argo: An exascale operating system. In http://www.mcs.anl.gov/project/argo-exascale-operating-system
- Argo: An exascale operating system
- Beckman, P.¹

49
- 85038366747
- Ron Brightwell et al. Hobbes-an operating system for extreme-scale systems. In http://xstack.sandia.gov/hobbes/
- Hobbes-an operating system for extreme-scale systems
- Brightwell, R.¹

50
- 84940567900
- FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world
- London, UK, UKSpringer-Verlag
- Graham E. Fagg and Jack Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 346-353, London, UK, UK, 2000. Springer-Verlag
- (2000) In Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface , pp. 346-353
- Fagg, G.E.¹ Dongarra, J.²

51
- 83155195270
- Technical Report SAND2011-2488, Sandia National Laboratories, Albuquerque, NM
- Kurt Ferreira, Rolf Riesen, Ron Oldfield, Jon Stearley, James Laros, Kevin Pedretti, and Ron Brightwell. rMPI: increasing fault resiliency in a message-passing environment. Technical Report SAND2011-2488, Sandia National Laboratories, Albuquerque, NM, 2011
- (2011) RMPI: increasing fault resiliency in a message-passing environment
- Ferreira, K.¹ Riesen, R.² Oldfield, R.³ Stearley, J.⁴ Laros, J.⁵ Pedretti, K.⁶ Brightwell, R.⁷

52
- 84908663133
- Evaluating the viability of process replication reliability for exascale systems
- Nov
- Kurt B Ferreira, Rolf Riesen, Patrick Bridges, Dorian Arnold, Jon Stearley, H Laros III James, Ron Oldfield, Kevin Pedretti, and Ron Brightwell. Evaluating the viability of process replication reliability for exascale systems. In ACM/IEEE Conference on Supercomputing (SC11), Nov 2011
- (2011) In ACM/IEEE Conference on Supercomputing (SC11)
- Ferreira, K.B.¹ Riesen, R.² Bridges, P.³ Arnold, D.⁴ Stearley, J.⁵ Laros, J.H.⁶ Oldfield, R.⁷ Pedretti, K.⁸ Brightwell, R.⁹

53
- 84877705582
- Detection and correction of silent data corruption for large-scale highperformance computing
- Los Alamitos, CA, USAIEEE Computer Society Press
- David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. Detection and correction of silent data corruption for large-scale highperformance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 78:1-78:12, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press
- (2012) In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12 , pp. 78.1-78.12
- Fiala, D.¹ Mueller, F.² Engelmann, C.³ Riesen, R.⁴ Ferreira, K.⁵ Brightwell, R.⁶

54
- 84866885057
- Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems
- IEEE
- Ana Gainaru, Franck Cappello, and William Kramer. Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems. In Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS), pages 1168-1179. IEEE, 2012
- (2012) In Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS) , pp. 1168-1179
- Gainaru, A.¹ Cappello, F.² Kramer, W.³

55
- 84877693592
- Fault prediction under the microscope: A closer look into HPC systems
- IEEE Computer Society Press
- Ana Gainaru, Franck Cappello, Marc Snir, and William Kramer. Fault prediction under the microscope: A closer look into HPC systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE/ACM SC'12, page 77. IEEE Computer Society Press, 2012
- (2012) In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE/ACM SC'12 , pp. 77
- Gainaru, A.¹ Cappello, F.² Snir, M.³ Kramer, W.⁴

56
- 84881050143
- Failure prediction for HPC systems and applications current situation and open issues
- Ana Gainaru, Franck Cappello, Marc Snir, and William Kramer. Failure prediction for HPC systems and applications current situation and open issues. International Journal of High Performance Computing Applications, 27(3):273-282, 2013
- (2013) International Journal of High Performance Computing Applications , vol.27 , Issue.3 , pp. 273-282
- Gainaru, A.¹ Cappello, F.² Snir, M.³ Kramer, W.⁴

57
- 85038402001
- Private communication
- Al Geist. Private communication, 2012
- (2012) Al Geist

58
- 85038379147
- U.S. Department of Energy fault management workshop
- DOE
- Al Geist, Bob Lucas, Marc Snir, Shekhar Borkar, Eric Roman, Mootaz Elnozahy, Bert Still, Andrew Chien, Robert Clay, John Wu, Christian Engelmann, Nathan DeBardeleben, Rob Ross Larry Kaplan Martin Schulz, Mike Heroux, Sriram Krishnamoorthy, Lucy Nowell, Abhinav Vishnu, and Lee-Ann Talley. U.S. Department of Energy fault management workshop. Technical report, DOE, 2012
- (2012) Technical report
- Geist, A.¹ Lucas, B.² Snir, M.³ Borkar, S.⁴ Roman, E.⁵ Elnozahy, M.⁶ Still, B.⁷ Chien, A.⁸ Clay, R.⁹ Wu, J.¹⁰ Engelmann, C.¹¹ DeBardeleben, N.¹² Larry Kaplan Martin Schulz, R.R.¹³ Heroux, M.¹⁴ Krishnamoorthy, S.¹⁵ Nowell, L.¹⁶ Vishnu, A.¹⁷ Talley, L.-A.¹⁸

59
- 33646144388
- Providing efficient I/O redundancy in MPI environments
- Dieter Kranzlmüller, Peter Kacsuk, and Jack Dongarra, editors, number LNCS3241 in Lecture Notes in Computer Science, Springer Verlag 11th European PVM/MPI User's Group Meeting, Budapest, Hungary
- William Gropp, Robert Ross, and Neill Miller. Providing efficient I/O redundancy in MPI environments. In Dieter Kranzlmüller, Peter Kacsuk, and Jack Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number LNCS3241 in Lecture Notes in Computer Science, pages 77-86. Springer Verlag, 2004. 11th European PVM/MPI User's Group Meeting, Budapest, Hungary
- (2004) Recent Advances in Parallel Virtual Machine and Message Passing Interface , pp. 77-86
- Gropp, W.¹ Ross, R.² Miller, N.³

60
- 4344695315
- Fault tolerance in MPI programs
- William D. Gropp and Ewing Lusk. Fault tolerance in MPI programs. International Journal of High Performance Computer Applications, 18(3):363-372, 2004
- (2004) International Journal of High Performance Computer Applications , vol.18 , Issue.3 , pp. 363-372
- Gropp, W.D.¹ Lusk, E.²

61
- 84866852589
- Hydee: Failure containment without event logging for large scale send-deterministic MPI applications
- Amina Guermouche, Thomas Ropars, Marc Snir, and Franck Cappello. Hydee: Failure containment without event logging for large scale send-deterministic MPI applications. In Proceedings of IEEE IPDPS, pages 1216-1227, 2012
- (2012) In Proceedings of IEEE IPDPS , pp. 1216-1227
- Guermouche, A.¹ Ropars, T.² Snir, M.³ Cappello, F.⁴

62
- 77951481809
- CIFTS: A coordinated infrastructure for fault-tolerant systems
- IEEE
- Rinku Gupta, Pete Beckman, B-H Park, Ewing Lusk, Paul Hargrove, Al Geist, Dhabaleswar K Panda, Andrew Lumsdaine, and Jack Dongarra. CIFTS: A coordinated infrastructure for fault-tolerant systems. In Parallel Processing, 2009. ICPP'09. International Conference on, pages 237-245. IEEE, 2009
- (2009) In Parallel Processing, 2009. ICPP'09. International Conference on , pp. 237-245
- Gupta, R.¹ Beckman, P.² Park, B.-H.³ Lusk, E.⁴ Hargrove, P.⁵ Geist, A.⁶ Panda, D.K.⁷ Lumsdaine, A.⁸ Dongarra, J.⁹

63
- 33749067567
- Berkeley lab checkpoint/restart (blcr) for Linux clusters
- IOP Publishing
- Paul H Hargrove and Jason C Duell. Berkeley lab checkpoint/restart (blcr) for Linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006
- (2006) In Journal of Physics: Conference Series , vol.46 , pp. 494
- Hargrove, P.H.¹ Duell, J.C.²

64
- 84871176503
- Mechanisms and evaluation of cross-layer fault-tolerance for supercomputing
- Chen-Han Ho, Marc de Kruijf, Karthikeyan Sankaralingam, Barry Rountree, Martin Schulz, and Bronis R. de Supinski. Mechanisms and evaluation of cross-layer fault-tolerance for supercomputing. In Proceedings of ICPP, pages 510-519, 2012
- (2012) Proceedings of ICPP , pp. 510-519
- Ho, C.-H.¹ de Kruijf, M.² Sankaralingam, K.³ Rountree, B.⁴ Schulz, M.⁵ de Supinski, B.R.⁶

65
- 0021439162
- Algorithm-based fault tolerance for matrix operations
- June
- Kuang-Hua Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., 33(6):518-528, June 1984
- (1984) IEEE Trans. Comput , vol.33 , Issue.6 , pp. 518-528
- Huang, K.-H.¹ Abraham, J.A.²

66
- 80053030072
- Run-through stabilization: An MPI proposal for process fault tolerance
- Joshua Hursey, Richard L. Graham, Greg Bronevetsky, Darius Buntinas, Howard Pritchard, and David G. Solt. Run-through stabilization: An MPI proposal for process fault tolerance. In Proceedings of EuroMPI, pages 329-332, 2011
- (2011) In Proceedings of EuroMPI , pp. 329-332
- Hursey, J.¹ Graham, R.L.² Bronevetsky, G.³ Buntinas, D.⁴ Pritchard, H.⁵ Solt, D.G.⁶

67
- 84898045408
- Mcrengine: A scalable checkpointing system using data-aware aggregation and compression
- Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, and Rudolf Eigenmann. Mcrengine: A scalable checkpointing system using data-aware aggregation and compression. Scientific Programming, 21(3-4):149-163, 2013
- (2013) Scientific Programming , vol.21 , Issue.3-4 , pp. 149-163
- Islam, T.Z.¹ Mohror, K.² Bagchi, S.³ Moody, A.⁴ de Supinski, B.R.⁵ Eigenmann, R.⁶

68
- 78650009816
- Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
- New York, NY, USAACM
- William M. Jones, John T. Daly, and Nathan DeBardeleben. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 276-279, New York, NY, USA, 2010. ACM
- (2010) In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10 , pp. 276-279
- Jones, W.M.¹ Daly, J.T.² DeBardeleben, N.³

69
- 85038354447
- Technical Report Technical Memorandum ANL/MCS-TM-312, MCS, ANL, December 2009
- D. S. Katz, J. Daly, N. DeBardeleben, M. Elnozahy, B. Kramer, L. Lathrop, N. Nystrom, K. Milfeld, S. Sanielevici, S. Cott, and L. Votta. 2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM-March 19-20, 2009. Technical Report Technical Memorandum ANL/MCS-TM-312, MCS, ANL, December 2009
- (2009) Fault tolerance for extreme-scale computing workshop, Albuquerque, NM-March 19-20, 2009
- Katz, D.S.¹ Daly, J.² DeBardeleben, N.³ Elnozahy, M.⁴ Kramer, B.⁵ Lathrop, L.⁶ Nystrom, N.⁷ Milfeld, K.⁸ Sanielevici, S.⁹ Cott, S.¹⁰ Votta, L.¹¹

70
- 0037253011
- NASA advanced robotic space exploration
- D.S. Katz and R.R. Some. NASA advanced robotic space exploration. Computer, 36(1):52-61, 2003
- (2003) Computer , vol.36 , Issue.1 , pp. 52-61
- Katz, D.S.¹ Some, R.R.²

71
- 84863542945
- Technical Report TR-LPH-2012-001, LPH Group, Department of Electrical and Computer Engineering, The University of Texas at Austin, December
- Ikhwan Lee, Michael Sullivan, Evgeni Krimer, DongWan Kim, Mehmet Basoglu, Doe Hyun Yoon, Larry Kaplan, and Mattan Erez. Survey of error and fault detection mechanisms v2. Technical Report TR-LPH-2012-001, LPH Group, Department of Electrical and Computer Engineering, The University of Texas at Austin, December 2012
- (2012) Survey of error and fault detection mechanisms v2
- Lee, I.¹ Sullivan, M.² Krimer, E.³ Kim, D.⁴ Basoglu, M.⁵ Yoon, D.H.⁶ Kaplan, L.⁷ Erez, M.⁸

72
- 84899682930
- Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach
- Networking, Storage and Analysis (SC)
- Dong Li, Zizhong Chen, Panruo Wu, and Jeffrey S Vetter. Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013
- (2013) In International Conference for High Performance Computing
- Li, D.¹ Chen, Z.² Wu, P.³ Vetter, J.S.⁴

73
- 84877692741
- Classifying soft error vulnerabilities in extremescale scientific applications using a binary instrumentation tool
- Networking, Storage, and Analysis, Salt Lake City, 11/2012
- Dong Li, Jeffrey S Vetter, andWeikuan Yu. Classifying soft error vulnerabilities in extremescale scientific applications using a binary instrumentation tool. In SC12: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, Salt Lake City, 11/2012 2012
- (2012) In SC12: ACM/IEEE International Conference for High Performance Computing
- Li, D.¹ Vetter, J.S.² Weikuan, Y.³

74
- 77954589337
- Scott A proactive fault tolerance framework for high-performance computing
- Antonina Litvinova, Christian Engelmann, and Stephen L. Scott. A proactive fault tolerance framework for high-performance computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, 2010
- (2010) In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010
- Litvinova, A.¹ Engelmann, C.² Stephen, L.³

75
- 84878419277
- Low-cost concurrent error detection for floating-point unit (FPU) controllers
- July
- M. Maniatakos, P. Kudva, B.M. Fleischer, and Y. Makris. Low-cost concurrent error detection for floating-point unit (FPU) controllers. Computers, IEEE Transactions on, 62(7):1376-1388, July 2013
- (2013) Computers, IEEE Transactions on , vol.62 , Issue.7 , pp. 1376-1388
- Maniatakos, M.¹ Kudva, P.² Fleischer, B.M.³ Makris, Y.⁴

76
- 84955374563
- Energy profile of rollback-recovery strategies in high performance computing
- E. Meneses, O. Sarood, and L.V. Kalé. Energy profile of rollback-recovery strategies in high performance computing. Parallel Computing, 2014
- (2014) Parallel Computing
- Meneses, E.¹ Sarood, O.² Kalé, L.V.³

77
- 80955167907
- Dynamic load balance for optimized message logging in fault tolerant HPC applications
- Esteban Meneses, Laxmikant V. Kalé, and Greg Bronevetsky. Dynamic load balance for optimized message logging in fault tolerant HPC applications. In Proceedings of IEEE Cluster, pages 281-289, 2011
- (2011) Proceedings of IEEE Cluster , pp. 281-289
- Meneses, E.¹ Laxmikant, V.² Kalé³ Bronevetsky, G.⁴

78
- 78650831692
- Design, modeling, and evaluation of a scalable multi-level checkpointing system
- A. Moody, G. Bronevetsky, K. Mohror, and B.R. de Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the 2010 International Conference on High Performance Computing, Networking, Storage and Analysis (SC10), pages 1-11, 2010
- (2010) Proceedings of the 2010 International Conference on High Performance Computing, Networking, Storage and Analysis (SC10) , pp. 1-11
- Moody, A.¹ Bronevetsky, G.² Mohror, K.³ de Supinski, B.R.⁴

79
- 84899671615
- ACR: Automatic checkpoint/restart for soft and hard error protection
- Networking, Storage and Analysis, SC '13. IEEE Computer Society, November
- Xiang Ni, Esteban Meneses, Nikhil Jain, and Laxmikant V. Kale. ACR: Automatic checkpoint/restart for soft and hard error protection. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '13. IEEE Computer Society, November 2013
- (2013) In ACM/IEEE International Conference for High Performance Computing
- Ni, X.¹ Meneses, E.² Jain, N.³ Kale, L.V.⁴

80
- 84870713710
- Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm
- Beijing, China, September
- Xiang Ni, Esteban Meneses, and Laxmikant V. Kalé. Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In Proceedings of IEEE Cluster'12, Beijing, China, September 2012
- (2012) In Proceedings of IEEE Cluster'12
- Ni, X.¹ Meneses, E.² Kalé, L.V.³

81
- 83455166682
- Nvcr: A transparent checkpointrestart library for nvidia cuda
- Akira Nukada, Hiroyuki Takizawa, and Satoshi Matsuoka. Nvcr: A transparent checkpointrestart library for nvidia cuda. In IPDPS Workshops, pages 104-113, 2011
- (2011) IPDPS Workshops , pp. 104-113
- Nukada, A.¹ Takizawa, H.² Matsuoka, S.³

82
- 84906706607
- Optimization of multi-level checkpoint model for large scale HPC applications
- Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications. Optimization of multi-level checkpoint model for large scale HPC applications. In Proceedings of IEEE IPDPS 2014, 2014
- (2014) In Proceedings of IEEE IPDPS 2014

83
- 84904409465
- Snapify: Capturing snapshots of offload applications on Xeon Phi manycore processors
- HPDC'14
- A. Rezaei, G. Coviello, CH. Li, S. Chakradhar, and F Mueller. Snapify: Capturing snapshots of offload applications on Xeon Phi manycore processors. In Proceedings of High-Performance Parallel and Distributed Computing, HPDC'14, 2014
- (2014) In Proceedings of High-Performance Parallel and Distributed Computing
- Rezaei, A.¹ Coviello, G.² Li, C.H.³ Chakradhar, S.⁴ Mueller, F.⁵

84
- 85038373125
- Martsinkevich, Amina Guermouche, Andre Schiper, and Franck Cappello. Spbc: leveraging the characteristics of MPI HPC applications for scalable checkpointing
- Thomas Ropars, Tatiana V. Martsinkevich, Amina Guermouche, Andre Schiper, and Franck Cappello. Spbc: leveraging the characteristics of MPI HPC applications for scalable checkpointing. In Proceedings of IEEE/ACM SC, page 8, 2013
- (2013) In Proceedings of IEEE/ACM SC , pp. 8
- Ropars, T.¹ Tatiana, V.²

85
- 84880052335
- Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system
- Takafumi Saito, Kento Sato, Hitoshi Sato, and Satoshi Matsuoka. Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system. In In Proceedings of the Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), pages 41-48, 2013
- (2013) In In Proceedings of the Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS) , pp. 41-48
- Saito, T.¹ Sato, K.² Sato, H.³ Matsuoka, S.⁴

86
- 84899668006
- A cool way of improving the reliability of HPC machines
- Networking, Storage and Analysis, IEEE/ACM SC'13, Denver, CO, USA, November
- Osman Sarood, Esteban Meneses, and L. V. Kale. A cool way of improving the reliability of HPC machines. In Proceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE/ACM SC'13, Denver, CO, USA, November 2013
- (2013) In Proceedings of The International Conference for High Performance Computing
- Sarood, O.¹ Meneses, E.² Kale, L.V.³

87
- 84925019506
- de Supinski, and Satoshi Matsuoka Design and modeling of a non-blocking checkpointing system
- Kento Sato, Naoya Maruyama, Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis R. de Supinski, and Satoshi Matsuoka. Design and modeling of a non-blocking checkpointing system. In Proceedings of IEEE/ACM SC'12, page 19, 2012
- (2012) In Proceedings of IEEE/ACM SC'12 , pp. 19
- Sato, K.¹ Maruyama, N.² Mohror, K.³ Moody, A.⁴ Gamblin, T.⁵ Bronis, R.⁶

88
- 84866720696
- Algorithmic approaches to low overhead fault detection for sparse linear algebra
- Joseph Sloan, Rakesh Kumar, and Greg Bronevetsky. Algorithmic approaches to low overhead fault detection for sparse linear algebra. 42rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2012
- (2012) 42rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
- Sloan, J.¹ Kumar, R.² Bronevetsky, G.³

89
- 84883436062
- An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance
- Joseph Sloan, Rakesh Kumar, and Greg Bronevetsky. An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2013
- (2013) 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
- Sloan, J.¹ Kumar, R.² Bronevetsky, G.³

90
- 84900560822
- Addressing failures in exascale computing
- 28(2):129-173, May
- Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A Chien, Paul Coteus, Nathan A DeBardeleben, Pedro C Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing failures in exascale computing. International Journal of High Performance Computing Applications, 28(2):129-173, May 2014
- (2014) International Journal of High Performance Computing Applications
- Snir, M.¹ Wisniewski, R.W.² Abraham, J.A.³ Adve, S.V.⁴ Bagchi, S.⁵ Balaji, P.⁶ Belak, J.⁷ Bose, P.⁸ Cappello, F.⁹ Carlson, B.¹⁰ Chien, A.A.¹¹ Coteus, P.¹² DeBardeleben, N.A.¹³ Diniz, P.C.¹⁴ Engelmann, C.¹⁵ Erez, M.¹⁶ Fazzari, S.¹⁷ Geist, A.¹⁸ Gupta, R.¹⁹ Johnson, F.²⁰ more..

91
- 0033314330
- IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective
- L. Spainhower and T.A. Gregg. IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective. IBM Journal of Research and Development, 43(5.6):863-873, 1999
- (1999) IBM Journal of Research and Development , vol.43 , Issue.5-6 , pp. 863-873
- Spainhower, L.¹ Gregg, T.A.²

92
- 85038398070
- computer architecture department. num: Upc-dac-rr-cap-2013-12
- Omer Subasi, Javier Arias, Jesus Labarta, Osman Unsal, Adrian Cristal, and Barcelona Supercomputing Center. Leveraging a task-based asynchronous dataflow substrate for efficient and scalable resiliency, research report of polythecnic university of catalonia-computer architecture department. num: Upc-dac-rr-cap-2013-12. 2014
- (2014) Leveraging a task-based asynchronous dataflow substrate for efficient and scalable resiliency, research report of polythecnic university of catalonia
- Subasi, O.¹ Arias, J.² Labarta, J.³ Unsal, O.⁴ Cristal, A.⁵ Center, B.S.⁶

93
- 85038384726
- The Eckert tapes: Computer pioneer says ENIAC team couldn't afford to fail-and didn't
- February
- Alexander Randall V. The Eckert tapes: Computer pioneer says ENIAC team couldn't afford to fail-and didn't. Computerworld, 40(8), February 2006
- (2006) Computerworld , vol.40 , Issue.8
- Alexander Randall, V.¹

94
- 79951790076
- Hybrid checkpointing for MPI jobs in HPC environments
- Dec
- Chao Wang, F. Mueller, C. Engelmann, and S.L. Scott. Hybrid checkpointing for MPI jobs in HPC environments. In Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Conference on, pages 524-533, Dec 2010
- (2010) In Parallel and Distributed Systems (ICPADS) 2010 IEEE 16th International Conference on , pp. 524-533
- Wang, C.¹ Mueller, F.² Engelmann, C.³ Scott, S.L.⁴

95
- 84855350553
- Proactive processlevel live migration and back migration in HPC environments
- February
- ChaoWang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive processlevel live migration and back migration in HPC environments. J. Parallel Distrib. Comput., 72(2):254-267, February 2012
- (2012) J. Parallel Distrib. Comput , vol.72 , Issue.2 , pp. 254-267
- Wang, C.¹ Mueller, F.² Engelmann, C.³ Scott, S.L.⁴

96
- 84879509439
- Fault tolerance for multi-threaded applications by leveraging hardware transactional memory
- ACM
- Gulay Yalcin, Osman Sabri Unsal, and Adrian Cristal. Fault tolerance for multi-threaded applications by leveraging hardware transactional memory. In Proceedings of the ACM International Conference on Computing Frontiers, page 4. ACM, 2013
- (2013) In Proceedings of the ACM International Conference on Computing Frontiers , pp. 4
- Yalcin, G.¹ Unsal, O.S.² Cristal, A.³

97
- 84976846528
- A first order approximation to the optimum checkpoint interval
- John W Young. A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530-531, 1974
- (1974) Communications of the ACM , vol.17 , Issue.9 , pp. 530-531
- Young, J.W.¹

98
- 33646425358
- Performance evaluation of automatic checkpoint-based fault tolerance for ampi and charm++
- April
- Gengbin Zheng, Chao Huang, and Laxmikant V. Kalé. Performance evaluation of automatic checkpoint-based fault tolerance for ampi and charm++. SIGOPS Oper. Syst. Rev., 40(2):90-99, April 2006
- (2006) SIGOPS Oper. Syst. Rev , vol.40 , Issue.2 , pp. 90-99
- Zheng, G.¹ Huang, C.² Laxmikant, V.K.³

99
- 85038403638
- A Scalable Double In-memory Checkpoint and Restart Scheme towards Exascale, in Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS)
- USA, June
- Gengbin Zheng, Xiang Ni, and L. V. Kale. A Scalable Double In-memory Checkpoint and Restart Scheme towards Exascale, in Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS). Boston, USA, June 2012
- (2012) Boston
- Zheng, G.¹ Ni, X.² Kale, L.V.³

100
- 84983422393
- Fault tolerance in an inner-outer solver: a GVR-enabled case study
- Lecture Notes in Computer Science
- Ziming Zheng, Andrew A. Chien, and Keita Teranishi. Fault tolerance in an inner-outer solver: a GVR-enabled case study. In Proceedings of VECPAR 2014, Lecture Notes in Computer Science, 2014
- (2014) In Proceedings of VECPAR 2014
- Zheng, Z.¹ Chien, A.A.² Teranishi, K.³

101
- 77956589566
- A practical failure prediction with location and lead time for Blue Gene/P
- IEEE
- Ziming Zheng, Zhiling Lan, Rinku Gupta, Susan Coghlan, and Peter Beckman. A practical failure prediction with location and lead time for Blue Gene/P. In Dependable Systems and Networks Workshops (DSN-W), 2010 International Conference on, pages 15-22. IEEE, 2010
- (2010) In Dependable Systems and Networks Workshops (DSN-W), 2010 International Conference on , pp. 15-22
- Zheng, Z.¹ Lan, Z.² Gupta, R.³ Coghlan, S.⁴ Beckman, P.⁵

102
- 80053278089
- Co-analysis of RAS log and job log on Blue Gene/P
- IEEE
- Ziming Zheng, Li Yu, Wei Tang, Zhiling Lan, Rinku Gupta, Narayan Desai, Susan Coghlan, and Daniel Buettner. Co-analysis of RAS log and job log on Blue Gene/P. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 840-851. IEEE, 2011
- (2011) In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International , pp. 840-851
- Zheng, Z.¹ Yu, L.² Tang, W.³ Lan, Z.⁴ Gupta, R.⁵ Desai, N.⁶ Coghlan, S.⁷ Buettner, D.⁸

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.