메뉴 건너뛰기




Volumn , Issue , 2012, Pages

Fault prediction under the microscope: A closer look into HPC systems

Author keywords

fault detection; fault tolerance; large scale HPC systems; signal analysis

Indexed keywords

CLASSICAL APPROACH; COMPONENT FAILURES; EFFICIENT PREDICTIONS; FUTURE IMPROVEMENTS; HIGH PERFORMANCE COMPUTING SYSTEMS; PRECISION AND RECALL; PREDICTION METHODS; PREVENTIVE MEASURES;

EID: 84877693592     PISSN: 21674329     EISSN: 21674337     Source Type: Conference Proceeding    
DOI: 10.1109/SC.2012.57     Document Type: Conference Paper
Times cited : (98)

References (34)
  • 6
    • 77952378080 scopus 로고    scopus 로고
    • Critical Event Prediction for Proactive Management In Large-scale Computer Clusters
    • R. K. Sahoo et al: Critical Event Prediction for Proactive Management In Large-scale Computer Clusters. International conference on Knowledge discovery and data mining, pp 426-435, 2003
    • (2003) International Conference on Knowledge Discovery and Data Mining , pp. 426-435
    • Sahoo, R.K.1
  • 7
    • 79951644113 scopus 로고    scopus 로고
    • Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems
    • N. Yigitbasi et al: Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems. IEEE/ACM International Conference on Grid Computing, pp 65-72, 2010
    • (2010) IEEE/ACM International Conference on Grid Computing , pp. 65-72
    • Yigitbasi, N.1
  • 8
    • 77958132122 scopus 로고    scopus 로고
    • Mining Dependency in Distributed Systems through Unstructured Logs Analysis
    • January
    • J. G. Lou et al: Mining Dependency in Distributed Systems through Unstructured Logs Analysis ACM SIGOPS Volume 44 Issue 1, January 2010
    • (2010) ACM SIGOPS , vol.44 , Issue.1
    • Lou, J.G.1
  • 9
    • 77951145583 scopus 로고    scopus 로고
    • Online System Problem Detection by Mining Patterns of Console Logs
    • W. Xu et al: Online System Problem Detection by Mining Patterns of Console Logs IEEE International Conference on Data Mining, pp 588-597, 2009
    • (2009) IEEE International Conference on Data Mining , pp. 588-597
    • Xu, W.1
  • 11
    • 55849147399 scopus 로고    scopus 로고
    • Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A case Study
    • J. Gu et al: Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A case Study International Conference on Parallel Processing, pp 157-164, 2008
    • (2008) International Conference on Parallel Processing , pp. 157-164
    • Gu, J.1
  • 13
    • 84877719832 scopus 로고    scopus 로고
    • LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems
    • abs/1003.0951
    • R. Ren et al: LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems CoRR abs/1003.0951, 2010
    • (2010) CoRR
    • Ren, R.1
  • 16
    • 0025416073 scopus 로고
    • Automatic recognition of intermittent failures: An experimental study of field data
    • R. Iyer et al: Automatic recognition of intermittent failures: An experimental study of field data. IEEE Transactions on Computers, 39:525537, 1990.
    • (1990) IEEE Transactions on Computers , vol.39 , pp. 525537
    • Iyer, R.1
  • 22
    • 0001265595 scopus 로고
    • An Extended Table of Critical Values for the Mann-Whitney (Wilcoxon) Two-Sample Statistic
    • R. C. Milton: An Extended Table of Critical Values for the Mann-Whitney (Wilcoxon) Two-Sample Statistic. Journal of the American Statistical Association, Volume 59, Issue 3, 1978
    • (1978) Journal of the American Statistical Association , vol.59 , Issue.3
    • Milton, R.C.1
  • 23
    • 84877687209 scopus 로고    scopus 로고
    • Accessed on 2010
    • National Center for Supercomputing Applications at the University of Illinois. www.ncsa.illinois.edu. Accessed on 2010.
  • 26
    • 20444463494 scopus 로고    scopus 로고
    • FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI
    • G. Zheng et al: FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI. International Conference on Cluster Computing CLUSTER, pp 93-103, 2004
    • (2004) International Conference on Cluster Computing CLUSTER , pp. 93-103
    • Zheng, G.1
  • 27
    • 85032796232 scopus 로고    scopus 로고
    • Rebound: Scalable Checkpointing for Coherent Shared Memory
    • R. Agarwal et al: Rebound: Scalable Checkpointing for Coherent Shared Memory. ACM SIGARCH Computer Architecture News, Volume 39 Issue 3, 2011
    • (2011) ACM SIGARCH Computer Architecture News , vol.39 , Issue.3
    • Agarwal, R.1
  • 33
    • 4444380999 scopus 로고    scopus 로고
    • A survey of fault localization techniques in computer networks
    • M. Steinder and A. Sethi: A survey of fault localization techniques in computer networks. Science of Computer Programming, volume 53, issue 2, 2004
    • (2004) Science of Computer Programming , vol.53 , Issue.2
    • Steinder, M.1    Sethi, A.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.