메뉴 건너뛰기




Volumn , Issue , 2013, Pages 501-512

Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing

Author keywords

Failure prediction; large scale HPC systems; multilevel checkpointing; resilience

Indexed keywords

DISCRETE-EVENT SIMULATORS; FAILURE PREDICTION; MEAN TIME BETWEEN FAILURES; MODEL-BASED SIMULATIONS; MULTILEVEL CHECKPOINTING; PARALLEL APPLICATION; PREDICTION PRECISION; RESILIENCE;

EID: 84884837861     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/IPDPS.2013.74     Document Type: Conference Paper
Times cited : (38)

References (36)
  • 1
    • 84884867455 scopus 로고    scopus 로고
    • http:// [Online; accessed 1-October-2012]
    • Top 500 most powerful supercomputers. http://http://www.top500.org/, 2012. [Online; accessed 1-October-2012].
    • (2012) Top 500 Most Powerful Supercomputers
  • 5
    • 28044460018 scopus 로고    scopus 로고
    • A higher order estimate of the optimum checkpoint interval for restart dumps
    • J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, pages 303-312, 2006.
    • (2006) Future Generation Computer Systems , pp. 303-312
    • Daly, J.T.1
  • 7
    • 34548640111 scopus 로고    scopus 로고
    • Fundamental differences between sph and grid methods
    • Agertz et al. Fundamental differences between sph and grid methods. Monthly Notices of the Royal Astronomical Society, pages 963-978, 2007.
    • (2007) Monthly Notices of the Royal Astronomical Society , pp. 963-978
    • Agertz1
  • 9
    • 55849147399 scopus 로고    scopus 로고
    • Dynamic meta-learning for failure prediction in large-scale systems: A case study
    • IEEE press
    • J. Gu et al. Dynamic meta-learning for failure prediction in large-scale systems: A case study. In International Conference on Parallel Processing, pages 157-164. IEEE press, 2008.
    • (2008) International Conference on Parallel Processing , pp. 157-164
    • Gu, J.1
  • 11
    • 84877719832 scopus 로고    scopus 로고
    • Logmaster: Mining event correlations in logs of large-scale cluster systems
    • abs/1003.0951
    • R. Ren et al. Logmaster: Mining event correlations in logs of large-scale cluster systems. CoRR abs/1003.0951, 2010.
    • (2010) CoRR
    • Ren, R.1
  • 12
  • 14
    • 84866885057 scopus 로고    scopus 로고
    • Taming of the shrew: Modeling the normal and faulty behavior of large-scale hpc systems
    • IEEE press
    • Ana Gainaru, Franck Cappello, and William Kramer. Taming of the shrew: Modeling the normal and faulty behavior of large-scale hpc systems. In Proceedings of IEEE IPDPS 2012. IEEE press, 2012.
    • (2012) Proceedings of IEEE IPDPS 2012
    • Gainaru, A.1    Cappello, F.2    Kramer, W.3
  • 23
    • 83955164680 scopus 로고    scopus 로고
    • Weibull and gamma renewal approximation using generalized exponential functions
    • T. Jin and L. Gonigunta. Weibull and gamma renewal approximation using generalized exponential functions. Communications in Statistics-Simulation and Computation, 38(1):154-171, 2008.
    • (2008) Communications in Statistics-Simulation and Computation , vol.38 , Issue.1 , pp. 154-171
    • Jin, T.1    Gonigunta, L.2
  • 26
    • 31044449725 scopus 로고    scopus 로고
    • Accident prediction model for railway-highway interfaces
    • Jutaek Oh, Simon P Washington, and Doohee Nam. Accident prediction model for railway-highway interfaces. Accident analysis and prevention, 38(2):346-356, 2006.
    • (2006) Accident Analysis and Prevention , vol.38 , Issue.2 , pp. 346-356
    • Oh, J.1    Washington, S.P.2    Nam, D.3
  • 28
    • 54249121630 scopus 로고    scopus 로고
    • Modelling discontinuities and kelvin-helmholtz instabilities in sph
    • Daniel J. Price. Modelling discontinuities and kelvin-helmholtz instabilities in sph. Journal of Computational Physics, pages 10040-10057, 2008.
    • (2008) Journal of Computational Physics , pp. 10040-10057
    • Price, D.J.1
  • 29
    • 80052380100 scopus 로고    scopus 로고
    • On the use of cluster-based partial message logging to improve fault tolerance for mpi hpc applications
    • Springer Berlin / Heidelberg
    • Thomas Ropars, Amina Guermouche, Bora Uçar, Esteban Meneses, Laxmikant Kalé, and Franck Cappello. On the use of cluster-based partial message logging to improve fault tolerance for mpi hpc applications. In Euro-Par 2011 Parallel Processing, volume 6852, pages 567-578. Springer Berlin / Heidelberg, 2011.
    • (2011) Euro-Par 2011 Parallel Processing , vol.6852 , pp. 567-578
    • Ropars, T.1    Guermouche, A.2    Uçar, B.3    Meneses, E.4    Kalé, L.5    Cappello, F.6
  • 30
    • 77950267881 scopus 로고    scopus 로고
    • A survey of online failure prediction methods
    • Felix Salfner, Maren Lenk, and Miroslaw Malek. A survey of online failure prediction methods. ACM Computing Surveys, 42:1-42, 2010.
    • (2010) ACM Computing Surveys , vol.42 , pp. 1-42
    • Salfner, F.1    Lenk, M.2    Malek, M.3
  • 31
    • 80052777075 scopus 로고    scopus 로고
    • Making tsubame2.0, the world's greenest production supercomputer, even greener challenges to the architects
    • IEEE Press Piscataway
    • Matsuoka Satoshi. Making tsubame2.0, the world's greenest production supercomputer, even greener challenges to the architects. In International Symposium on Low Power Electronics and Design, pages 367-368. IEEE Press Piscataway, 2011.
    • (2011) International Symposium on Low Power Electronics and Design , pp. 367-368
    • Satoshi, M.1
  • 32
    • 29144514328 scopus 로고    scopus 로고
    • The cosmological simulation code gadget-2
    • Blackwell Science Ltd
    • Volker Springel. The cosmological simulation code gadget-2. In Monthly Notices of the Royal Astronomical Society, volume 364, pages 1105-1134. Blackwell Science Ltd, 2005.
    • (2005) Monthly Notices of the Royal Astronomical Society , vol.364 , pp. 1105-1134
    • Springel, V.1
  • 33
    • 0035390088 scopus 로고    scopus 로고
    • A variational calculus approach to optimal checkpoint placement
    • July
    • X.Lin Y.Ling, J.Mi. A variational calculus approach to optimal checkpoint placement. IEEE Transactions on Computers, 50(07):699, July 2001.
    • (2001) IEEE Transactions on Computers , vol.50 , Issue.7 , pp. 699
    • Lin, X.1    Ling, Y.2    Mi, J.3
  • 34
    • 84976846528 scopus 로고
    • A first order approximation to the optimum checkpoint interval
    • J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17(9):530-531, 1974.
    • (1974) Commun. ACM , vol.17 , Issue.9 , pp. 530-531
    • Young, J.W.1
  • 35
    • 20444463494 scopus 로고    scopus 로고
    • Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi
    • IEEE Computer Society
    • Gengbin Zheng, Lixia Shi, and L. V. Kale. Ftc-charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and mpi. In Proceedings of the IEEE International Conference on Cluster Computing, pages 93-103. IEEE Computer Society, 2004.
    • (2004) Proceedings of the IEEE International Conference on Cluster Computing , pp. 93-103
    • Zheng, G.1    Shi, L.2    Kale, L.V.3
  • 36
    • 77649192707 scopus 로고    scopus 로고
    • A data-driven approach for predicting failure scenarios in nuclear systems
    • Enrico Zio, Francesco Di Maio, and Marco Stasi. A data-driven approach for predicting failure scenarios in nuclear systems. Annals of Nuclear Energy, 37:482-491, 2010.
    • (2010) Annals of Nuclear Energy , vol.37 , pp. 482-491
    • Zio, E.1    Di Maio, F.2    Stasi, M.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.