메뉴 건너뛰기




Volumn 7, Issue 4, 2010, Pages 337-350

A large-scale study of failures in high-performance computing systems

Author keywords

empirical study; failures; field study; high performance computing; Large scale systems; node outages; reliability; repair time; root cause; supercomputing; time between failures

Indexed keywords

FAILURE (MECHANICAL); FAILURE ANALYSIS; LARGE SCALE SYSTEMS; RELIABILITY; WEIBULL DISTRIBUTION;

EID: 78149470110     PISSN: 15455971     EISSN: None     Source Type: Journal    
DOI: 10.1109/TDSC.2009.4     Document Type: Article
Times cited : (468)

References (28)
  • 1
    • 78149465809 scopus 로고    scopus 로고
    • The raw data and more information is available at the following two URLs, http://www.lanl.gov/projects/computerscience/data/
    • The raw data and more information is available at the following two URLs: http://www.pdl.cmu.edu/FailureData/and http://www.lanl.gov/projects/ computerscience/data/, 2006.
    • (2006)
  • 4
    • 0025505070 scopus 로고
    • A census of tandem system availability between 1985 and 1990
    • Oct
    • J. Gray, "A Census of Tandem System Availability Between 1985 and 1990", IEEE Trans. Reliability, vol. 39, no. 4, pp. 409-418, Oct. 1990.
    • (1990) IEEE Trans. Reliability , vol.39 , Issue.4 , pp. 409-418
    • Gray, J.1
  • 6
    • 84976815079 scopus 로고
    • Measurement and modeling of computer reliability as affected by system activity
    • R. K. Iyer, D. J. Rossetti, and M. C. Hsueh, "Measurement and Modeling of Computer Reliability as Affected by System Activity", ACM Trans. Computer Systems, vol. 4, no. 3, 1986.
    • (1986) ACM Trans. Computer Systems , vol.4 , Issue.3
    • Iyer, R.K.1    Rossetti, D.J.2    Hsueh, M.C.3
  • 9
    • 0025502686 scopus 로고
    • Error log analysis: Statistical modeling and heuristic trend analysis
    • Oct
    • T.-T. Y. Lin and D. P. Siewiorek, "Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis", IEEE Trans. on Reliability, vol. 39, no. 4, pp. 419-432, Oct. 1990.
    • (1990) IEEE Trans. on Reliability , vol.39 , Issue.4 , pp. 419-432
    • Lin, Y.T.-T.1    Siewiorek, D.P.2
  • 13
    • 0029368189 scopus 로고
    • Measuring system and software reliability using an automated data collection process
    • B. Murphy and T. Gent, "Measuring System and Software Reliability Using an Automated Data Collection Process", Quality and Reliability Eng. Int'l, vol. 11, no. 5, 1995.
    • (1995) Quality and Reliability Eng. Int'l , vol.11 , Issue.5
    • Murphy, B.1    Gent, T.2
  • 24
    • 84877699694 scopus 로고
    • A case for two-level distributed recovery schemes
    • N. H. Vaidya, "A Case For Two-Level Distributed Recovery Schemes", Proc. ACM SIGMETRICS, 1995.
    • (1995) Proc. ACM SIGMETRICS
    • Vaidya, N.H.1
  • 25
    • 0031078972 scopus 로고    scopus 로고
    • Self-similarity through high-variability: Statistical analysis of ethernet LAN traffic at the source level
    • W. Willinger, M. S. Taqqu, R. Sherman, and D. V. Wilson, "Self-Similarity Through High-Variability: Statistical Analysis of Ethernet LAN Traffic at the Source Level", IEEE/ACM Trans. Networking, vol. 5, no. 1, pp. 71-86, 1997.
    • (1997) IEEE/ACM Trans. Networking , vol.5 , Issue.1 , pp. 71-86
    • Willinger, W.1    Taqqu, M.S.2    Sherman, R.3    Wilson, D.V.4
  • 26
    • 0030600996 scopus 로고    scopus 로고
    • Checkpointing in distributed computing systems
    • May
    • K. F. Wong and M. Franklin, "Checkpointing in Distributed Computing Systems", J. Parallel and Distributed Computing, vol. 35, no. 1, pp. 67-75, May 1996.
    • (1996) J. Parallel and Distributed Computing , vol.35 , Issue.1 , pp. 67-75
    • Wong, K.F.1    Franklin, M.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.