메뉴 건너뛰기




Volumn 1, Issue 1, 2014, Pages 4-27

Toward exascale resilience: 2014 update

Author keywords

Exascale; Fault tolerance techniques; Resilience

Indexed keywords

EXASCALE; FAULT TOLERANCE TECHNIQUES; PREDICT ERRORS; RESEARCH PROBLEMS; RESILIENCE; TECHNICAL PROGRESS; TECHNOLOGY EVOLUTION; UNSTABLE SYSTEM;

EID: 85018017476     PISSN: 24096008     EISSN: 23138734     Source Type: Journal    
DOI: 10.14529/jsfi140101     Document Type: Article
Times cited : (266)

References (102)
  • 1
    • 85033564681 scopus 로고    scopus 로고
    • The Blue Waters super system for super science
    • Jeffrey S. Vetter, editor, Chapman and Hall/CRC
    • The Blue Waters super system for super science. Contemporary High Performance Computing From Petascale toward Exascale, Jeffrey S. Vetter, editor, Chapman and Hall/CRC, pages 339-366, ISBN: 978-1-4665-6834-1, 2013
    • (2013) Contemporary High Performance Computing From Petascale toward Exascale , pp. 339-366
  • 7
    • 85038393026 scopus 로고    scopus 로고
    • Extending the scope of the checkpoint-on-failure protocol for forward recovery in standard MPI concurrency and computation: Practice and experience
    • July
    • W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Extending the scope of the checkpoint-on-failure protocol for forward recovery in standard MPI, concurrency and computation: Practice and experience, special issue: Euro-par 2012. July 2013
    • (2013) Special issue: Euro-par 2012
    • Bland, W.1    Du, P.2    Bouteiller, A.3    Herault, T.4    Bosilca, G.5    Dongarra, J.6
  • 11
    • 84874118584 scopus 로고    scopus 로고
    • Correlated set coordination in fault tolerant message logging protocols, concurrency and computation
    • A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Correlated set coordination in fault tolerant message logging protocols, concurrency and computation: Practice and experience. Vol. 25, No. 4:pp. 572-585, 2013
    • (2013) Practice and experience , vol.25 , Issue.4 , pp. 572-585
    • Bouteiller, A.1    Herault, T.2    Bosilca, G.3    Dongarra, J.4
  • 14
    • 84863961922 scopus 로고    scopus 로고
    • Cooperative application/OS DRAM fault recovery
    • Michael Alexander, Pasqua DAmbra, Adam Belloum, George Bosilca, Mario Cannataro, Marco Danelutto, Beniamino Martino, Michael Gerndt, Emmanuel Jeannot, Raymond Namyst, Jean Roman, StephenL. Scott, JesperLarsson Traff, Geoffroy Valle, and Josef Weidendorfer, editors, Euro-Par 2011: Parallel Processing Workshops, volume 7156 of Lecture Notes in Springer Berlin HeidelbergComputer Science
    • PatrickG. Bridges, Mark Hoemmen, KurtB. Ferreira, MichaelA. Heroux, Philip Soltero, and Ron Brightwell. Cooperative application/OS DRAM fault recovery. In Michael Alexander, Pasqua DAmbra, Adam Belloum, George Bosilca, Mario Cannataro, Marco Danelutto, Beniamino Martino, Michael Gerndt, Emmanuel Jeannot, Raymond Namyst, Jean Roman, StephenL. Scott, JesperLarsson Traff, Geoffroy Valle, and Josef Weidendorfer, editors, Euro-Par 2011: Parallel Processing Workshops, volume 7156 of Lecture Notes in Computer Science, pages 241-250. Springer Berlin Heidelberg, 2012
    • (2012) , pp. 241-250
    • Bridges, P.G.1    Hoemmen, M.2    Ferreira, K.B.3    Heroux, M.A.4    Soltero, P.5    Brightwell, R.6
  • 15
    • 0005356617 scopus 로고
    • Charles Babbage's analytical engine, 1838
    • Allan G Bromley. Charles Babbage's analytical engine, 1838. Annals of the History of Computing, 4(3):196-217, 1982
    • (1982) Annals of the History of Computing , vol.4 , Issue.3 , pp. 196-217
    • Bromley, A.G.1
  • 18
    • 68249127079 scopus 로고    scopus 로고
    • Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities
    • Franck Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3):212-226, 2009
    • (2009) International Journal of High Performance Computing Applications , vol.23 , Issue.3 , pp. 212-226
    • Cappello, F.1
  • 29
    • 28044460018 scopus 로고    scopus 로고
    • A higher order estimate of the optimum checkpoint interval for restart dumps
    • John T Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 22(3):303-312, 2006
    • (2006) Future Generation Computer Systems , vol.22 , Issue.3 , pp. 303-312
    • Daly, J.T.1
  • 32
    • 77955737995 scopus 로고    scopus 로고
    • High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development
    • LA-UR-10-00030, DARPA, January
    • N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann, and W. Harrod. High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Technical Report LA-UR-10-00030, DARPA, January 2010
    • (2010) Technical Report
    • DeBardeleben, N.1    Laros, J.2    Daly, J.3    Scott, S.4    Engelmann, C.5    Harrod, W.6
  • 44
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3):375-408, 2002
    • (2002) ACM Computing Surveys (CSUR) , vol.34 , Issue.3 , pp. 375-408
    • Elnozahy, E.N.1    Alvisi, L.2    Wang, Y.-M.3    Johnson, D.B.4
  • 45
    • 84888310932 scopus 로고    scopus 로고
    • Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale
    • January
    • Christian Engelmann. Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale. Future Generation Computer Systems (FGCS), 30(0):59-65, January 2014
    • (2014) Future Generation Computer Systems (FGCS) , vol.30 , pp. 59-65
    • Engelmann, C.1
  • 57
    • 85038402001 scopus 로고    scopus 로고
    • Private communication
    • Al Geist. Private communication, 2012
    • (2012) Al Geist
  • 59
    • 33646144388 scopus 로고    scopus 로고
    • Providing efficient I/O redundancy in MPI environments
    • Dieter Kranzlmüller, Peter Kacsuk, and Jack Dongarra, editors, number LNCS3241 in Lecture Notes in Computer Science, Springer Verlag 11th European PVM/MPI User's Group Meeting, Budapest, Hungary
    • William Gropp, Robert Ross, and Neill Miller. Providing efficient I/O redundancy in MPI environments. In Dieter Kranzlmüller, Peter Kacsuk, and Jack Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number LNCS3241 in Lecture Notes in Computer Science, pages 77-86. Springer Verlag, 2004. 11th European PVM/MPI User's Group Meeting, Budapest, Hungary
    • (2004) Recent Advances in Parallel Virtual Machine and Message Passing Interface , pp. 77-86
    • Gropp, W.1    Ross, R.2    Miller, N.3
  • 61
    • 84866852589 scopus 로고    scopus 로고
    • Hydee: Failure containment without event logging for large scale send-deterministic MPI applications
    • Amina Guermouche, Thomas Ropars, Marc Snir, and Franck Cappello. Hydee: Failure containment without event logging for large scale send-deterministic MPI applications. In Proceedings of IEEE IPDPS, pages 1216-1227, 2012
    • (2012) In Proceedings of IEEE IPDPS , pp. 1216-1227
    • Guermouche, A.1    Ropars, T.2    Snir, M.3    Cappello, F.4
  • 63
    • 33749067567 scopus 로고    scopus 로고
    • Berkeley lab checkpoint/restart (blcr) for Linux clusters
    • IOP Publishing
    • Paul H Hargrove and Jason C Duell. Berkeley lab checkpoint/restart (blcr) for Linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006
    • (2006) In Journal of Physics: Conference Series , vol.46 , pp. 494
    • Hargrove, P.H.1    Duell, J.C.2
  • 65
    • 0021439162 scopus 로고
    • Algorithm-based fault tolerance for matrix operations
    • June
    • Kuang-Hua Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., 33(6):518-528, June 1984
    • (1984) IEEE Trans. Comput , vol.33 , Issue.6 , pp. 518-528
    • Huang, K.-H.1    Abraham, J.A.2
  • 67
    • 84898045408 scopus 로고    scopus 로고
    • Mcrengine: A scalable checkpointing system using data-aware aggregation and compression
    • Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, and Rudolf Eigenmann. Mcrengine: A scalable checkpointing system using data-aware aggregation and compression. Scientific Programming, 21(3-4):149-163, 2013
    • (2013) Scientific Programming , vol.21 , Issue.3-4 , pp. 149-163
    • Islam, T.Z.1    Mohror, K.2    Bagchi, S.3    Moody, A.4    de Supinski, B.R.5    Eigenmann, R.6
  • 70
    • 0037253011 scopus 로고    scopus 로고
    • NASA advanced robotic space exploration
    • D.S. Katz and R.R. Some. NASA advanced robotic space exploration. Computer, 36(1):52-61, 2003
    • (2003) Computer , vol.36 , Issue.1 , pp. 52-61
    • Katz, D.S.1    Some, R.R.2
  • 72
    • 84899682930 scopus 로고    scopus 로고
    • Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach
    • Networking, Storage and Analysis (SC)
    • Dong Li, Zizhong Chen, Panruo Wu, and Jeffrey S Vetter. Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013
    • (2013) In International Conference for High Performance Computing
    • Li, D.1    Chen, Z.2    Wu, P.3    Vetter, J.S.4
  • 73
    • 84877692741 scopus 로고    scopus 로고
    • Classifying soft error vulnerabilities in extremescale scientific applications using a binary instrumentation tool
    • Networking, Storage, and Analysis, Salt Lake City, 11/2012
    • Dong Li, Jeffrey S Vetter, andWeikuan Yu. Classifying soft error vulnerabilities in extremescale scientific applications using a binary instrumentation tool. In SC12: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, Salt Lake City, 11/2012 2012
    • (2012) In SC12: ACM/IEEE International Conference for High Performance Computing
    • Li, D.1    Vetter, J.S.2    Weikuan, Y.3
  • 75
    • 84878419277 scopus 로고    scopus 로고
    • Low-cost concurrent error detection for floating-point unit (FPU) controllers
    • July
    • M. Maniatakos, P. Kudva, B.M. Fleischer, and Y. Makris. Low-cost concurrent error detection for floating-point unit (FPU) controllers. Computers, IEEE Transactions on, 62(7):1376-1388, July 2013
    • (2013) Computers, IEEE Transactions on , vol.62 , Issue.7 , pp. 1376-1388
    • Maniatakos, M.1    Kudva, P.2    Fleischer, B.M.3    Makris, Y.4
  • 76
    • 84955374563 scopus 로고    scopus 로고
    • Energy profile of rollback-recovery strategies in high performance computing
    • E. Meneses, O. Sarood, and L.V. Kalé. Energy profile of rollback-recovery strategies in high performance computing. Parallel Computing, 2014
    • (2014) Parallel Computing
    • Meneses, E.1    Sarood, O.2    Kalé, L.V.3
  • 77
    • 80955167907 scopus 로고    scopus 로고
    • Dynamic load balance for optimized message logging in fault tolerant HPC applications
    • Esteban Meneses, Laxmikant V. Kalé, and Greg Bronevetsky. Dynamic load balance for optimized message logging in fault tolerant HPC applications. In Proceedings of IEEE Cluster, pages 281-289, 2011
    • (2011) Proceedings of IEEE Cluster , pp. 281-289
    • Meneses, E.1    Laxmikant, V.2    Kalé3    Bronevetsky, G.4
  • 79
    • 84899671615 scopus 로고    scopus 로고
    • ACR: Automatic checkpoint/restart for soft and hard error protection
    • Networking, Storage and Analysis, SC '13. IEEE Computer Society, November
    • Xiang Ni, Esteban Meneses, Nikhil Jain, and Laxmikant V. Kale. ACR: Automatic checkpoint/restart for soft and hard error protection. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '13. IEEE Computer Society, November 2013
    • (2013) In ACM/IEEE International Conference for High Performance Computing
    • Ni, X.1    Meneses, E.2    Jain, N.3    Kale, L.V.4
  • 80
    • 84870713710 scopus 로고    scopus 로고
    • Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm
    • Beijing, China, September
    • Xiang Ni, Esteban Meneses, and Laxmikant V. Kalé. Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In Proceedings of IEEE Cluster'12, Beijing, China, September 2012
    • (2012) In Proceedings of IEEE Cluster'12
    • Ni, X.1    Meneses, E.2    Kalé, L.V.3
  • 81
    • 83455166682 scopus 로고    scopus 로고
    • Nvcr: A transparent checkpointrestart library for nvidia cuda
    • Akira Nukada, Hiroyuki Takizawa, and Satoshi Matsuoka. Nvcr: A transparent checkpointrestart library for nvidia cuda. In IPDPS Workshops, pages 104-113, 2011
    • (2011) IPDPS Workshops , pp. 104-113
    • Nukada, A.1    Takizawa, H.2    Matsuoka, S.3
  • 82
    • 84906706607 scopus 로고    scopus 로고
    • Optimization of multi-level checkpoint model for large scale HPC applications
    • Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications. Optimization of multi-level checkpoint model for large scale HPC applications. In Proceedings of IEEE IPDPS 2014, 2014
    • (2014) In Proceedings of IEEE IPDPS 2014
  • 84
    • 85038373125 scopus 로고    scopus 로고
    • Martsinkevich, Amina Guermouche, Andre Schiper, and Franck Cappello. Spbc: leveraging the characteristics of MPI HPC applications for scalable checkpointing
    • Thomas Ropars, Tatiana V. Martsinkevich, Amina Guermouche, Andre Schiper, and Franck Cappello. Spbc: leveraging the characteristics of MPI HPC applications for scalable checkpointing. In Proceedings of IEEE/ACM SC, page 8, 2013
    • (2013) In Proceedings of IEEE/ACM SC , pp. 8
    • Ropars, T.1    Tatiana, V.2
  • 91
    • 0033314330 scopus 로고    scopus 로고
    • IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective
    • L. Spainhower and T.A. Gregg. IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective. IBM Journal of Research and Development, 43(5.6):863-873, 1999
    • (1999) IBM Journal of Research and Development , vol.43 , Issue.5-6 , pp. 863-873
    • Spainhower, L.1    Gregg, T.A.2
  • 93
    • 85038384726 scopus 로고    scopus 로고
    • The Eckert tapes: Computer pioneer says ENIAC team couldn't afford to fail-and didn't
    • February
    • Alexander Randall V. The Eckert tapes: Computer pioneer says ENIAC team couldn't afford to fail-and didn't. Computerworld, 40(8), February 2006
    • (2006) Computerworld , vol.40 , Issue.8
    • Alexander Randall, V.1
  • 95
    • 84855350553 scopus 로고    scopus 로고
    • Proactive processlevel live migration and back migration in HPC environments
    • February
    • ChaoWang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive processlevel live migration and back migration in HPC environments. J. Parallel Distrib. Comput., 72(2):254-267, February 2012
    • (2012) J. Parallel Distrib. Comput , vol.72 , Issue.2 , pp. 254-267
    • Wang, C.1    Mueller, F.2    Engelmann, C.3    Scott, S.L.4
  • 97
    • 84976846528 scopus 로고
    • A first order approximation to the optimum checkpoint interval
    • John W Young. A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530-531, 1974
    • (1974) Communications of the ACM , vol.17 , Issue.9 , pp. 530-531
    • Young, J.W.1
  • 98
    • 33646425358 scopus 로고    scopus 로고
    • Performance evaluation of automatic checkpoint-based fault tolerance for ampi and charm++
    • April
    • Gengbin Zheng, Chao Huang, and Laxmikant V. Kalé. Performance evaluation of automatic checkpoint-based fault tolerance for ampi and charm++. SIGOPS Oper. Syst. Rev., 40(2):90-99, April 2006
    • (2006) SIGOPS Oper. Syst. Rev , vol.40 , Issue.2 , pp. 90-99
    • Zheng, G.1    Huang, C.2    Laxmikant, V.K.3
  • 99
    • 85038403638 scopus 로고    scopus 로고
    • A Scalable Double In-memory Checkpoint and Restart Scheme towards Exascale, in Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS)
    • USA, June
    • Gengbin Zheng, Xiang Ni, and L. V. Kale. A Scalable Double In-memory Checkpoint and Restart Scheme towards Exascale, in Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS). Boston, USA, June 2012
    • (2012) Boston
    • Zheng, G.1    Ni, X.2    Kale, L.V.3
  • 100
    • 84983422393 scopus 로고    scopus 로고
    • Fault tolerance in an inner-outer solver: a GVR-enabled case study
    • Lecture Notes in Computer Science
    • Ziming Zheng, Andrew A. Chien, and Keita Teranishi. Fault tolerance in an inner-outer solver: a GVR-enabled case study. In Proceedings of VECPAR 2014, Lecture Notes in Computer Science, 2014
    • (2014) In Proceedings of VECPAR 2014
    • Zheng, Z.1    Chien, A.A.2    Teranishi, K.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.