메뉴 건너뛰기




Volumn , Issue , 2011, Pages 1557-1566

Predicting node failure in high performance computing systems from failure and usage logs

Author keywords

[No Author keywords available]

Indexed keywords

DECISION TREE CLASSIFIERS; FAILURE DATA; FAILURE INFORMATION; HIGH PERFORMANCE COMPUTERS; HIGH PERFORMANCE COMPUTING SYSTEMS; IDLE TIME; LOS ALAMOS NATIONAL LABORATORY; MINING CLASSIFICATION; NODE FAILURE; PREDICTION SYSTEMS; ROOT CAUSE; SEPARATE ANALYSIS; USAGE DATA;

EID: 83455262121     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/IPDPS.2011.310     Document Type: Conference Paper
Times cited : (31)

References (27)
  • 1
    • 83455261683 scopus 로고    scopus 로고
    • Experimental assessment of workstation failures and their impact on checkpointing systems
    • J. S. Plank and W. R. Elwasif. Experimental assessment of workstation failures and their impact on checkpointing systems. In Proceedings of FTCS-98.
    • Proceedings of FTCS-98
    • Plank, J.S.1    Elwasif, W.R.2
  • 3
    • 83455247703 scopus 로고    scopus 로고
    • Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems
    • June
    • N. Nakka, A. Choudhary, "Failure data-driven selective node-level duplication to improve MTTF in High Performance Computing Systems", In Proceedings of HPCS 2009, June 2009.
    • (2009) Proceedings of HPCS 2009
    • Nakka, N.1    Choudhary, A.2
  • 6
    • 45749113088 scopus 로고    scopus 로고
    • Modeling machine availability in enterprise and wide-area distributed computing environments
    • D. Nurmi, J. Brevik, and R. Wolski. Modeling machine availability in enterprise and wide-area distributed computing environments. In Euro-Par'05, 2005.
    • (2005) Euro-par'05
    • Nurmi, D.1    Brevik, J.2    Wolski, R.3
  • 9
    • 84958782417 scopus 로고    scopus 로고
    • Networked windows NT system field failure data analysis
    • J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked Windows NT system field failure data analysis. In Proc. of the PRDC, 1999.
    • (1999) Proc. of the PRDC
    • Xu, J.1    Kalbarczyk, Z.2    Iyer, R.K.3
  • 10
    • 33845593340 scopus 로고    scopus 로고
    • A large-scale study of failures in high-performance-computing systems
    • June
    • B. Schroeder and G. Gibson. A large-scale study of failures in high-performance-computing systems. In Proceedings of the DSN, June 2006.
    • (2006) Proceedings of the DSN
    • Schroeder, B.1    Gibson, G.2
  • 11
    • 84976815079 scopus 로고
    • Measurement and modeling of computer reliability as affected by system activity
    • R. K. Iyer, D. J. Rossetti, and M. C. Hsueh. Measurement and modeling of computer reliability as affected by system activity. ACM Transactions on Computing Systems, Vol. 4, No. 3, 1986.
    • (1986) ACM Transactions on Computing Systems , vol.4 , Issue.3
    • Iyer, R.K.1    Rossetti, D.J.2    Hsueh, M.C.3
  • 13
    • 36049013419 scopus 로고    scopus 로고
    • What supercomputers say: A study of five system logs
    • UK, June
    • Adam J. Oliner, Jon Stearley: What Supercomputers Say: A Study of Five System Logs. In Proceedings of the DSN, Edinburgh, UK, June 2007, pp. 575-584.
    • (2007) Proceedings of the DSN, Edinburgh , pp. 575-584
    • Oliner, A.J.1    Stearley, J.2
  • 15
    • 47249123819 scopus 로고    scopus 로고
    • Exploring meta-learning to improve failure prediction in supercomputing clusters
    • P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, "Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters", In Proceedings of ICPP, 2007.
    • (2007) Proceedings of ICPP
    • Gujrati, P.1    Li, Y.2    Lan, Z.3    Thakur, R.4    White, J.5
  • 16
    • 79952168926 scopus 로고    scopus 로고
    • Using adaptive fault tolerance to improve application robustness on the TeraGrid
    • Y. Li and Z. Lan, "Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid", In Proceedings of TeraGrid'07, 2007.
    • (2007) Proceedings of TeraGrid'07
    • Li, Y.1    Lan, Z.2
  • 17
    • 57049111494 scopus 로고    scopus 로고
    • Adaptive fault management of parallel applications for high performance computing
    • Z. Lan and Y. Li, "Adaptive Fault Management of Parallel Applications for High Performance Computing", IEEE Transactions on Computers, Vol. 57, No. 12, pp. 1647-1660, 2008.
    • (2008) IEEE Transactions on Computers , vol.57 , Issue.12 , pp. 1647-1660
    • Lan, Z.1    Li, Y.2
  • 20
    • 34249832377 scopus 로고
    • A Bayesian method for the induction of probabilistic networks from data
    • G. Cooper, E. Herskovits (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning. 9(4):309-347.
    • (1992) Machine Learning , vol.9 , Issue.4 , pp. 309-347
    • Cooper, G.1    Herskovits, E.2
  • 24
    • 0035478854 scopus 로고    scopus 로고
    • Random forests
    • Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.
    • (2001) Machine Learning , vol.45 , Issue.1 , pp. 5-32
    • Leo, B.1
  • 27
    • 0000521473 scopus 로고
    • Ridge estimators in logistic regression
    • le Cessie, S., van Houwelingen, J.C. (1992). Ridge Estimators in Logistic Regression. Applied Statistics. 41(1):191-201.
    • (1992) Applied Statistics , vol.41 , Issue.1 , pp. 191-201
    • Le Cessie, S.1    Van Houwelingen, J.C.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.