메뉴 건너뛰기




Volumn , Issue , 2011, Pages 840-851

Co-analysis of RAS log and job log on Blue Gene/P

Author keywords

Blue Gene P; Co Analysis; Log Analysis; Reliability

Indexed keywords

BLUE GENE; CO-ANALYSIS; FAILURE CHARACTERISTICS; FAULT RESILIENCE; LOG ANALYSIS; PETASCALE; SYSTEM BEHAVIORS; SYSTEM SIZE;

EID: 80053278089     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/IPDPS.2011.83     Document Type: Conference Paper
Times cited : (67)

References (30)
  • 1
    • 40749160036 scopus 로고    scopus 로고
    • Overview of the IBM Blue Gene/P project
    • Blue Gene Team
    • Blue Gene Team, "Overview of the IBM Blue Gene/P project,"IBM Journal of Research and Development, 2008.
    • (2008) IBM Journal of Research and Development
  • 6
    • 80053236042 scopus 로고    scopus 로고
    • FTB-enabled failure prediction for Blue Gene/P systems. (research poster)
    • Z. Zheng, R. Gupta, Z. Lan, and S. Coghlan. FTB-enabled failure prediction for Blue Gene/P systems. In Proc. of SuperComputing (research poster), 2009.
    • (2009) Proc. of SuperComputing
    • Zheng, Z.1    Gupta, R.2    Lan, Z.3    Coghlan, S.4
  • 8
    • 33845593340 scopus 로고    scopus 로고
    • A large-scale study of failures in high-performance computing systems
    • B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In Proc. of DSN, 2006.
    • Proc. of DSN, 2006
    • Schroeder, B.1    Gibson, G.2
  • 9
    • 36049013419 scopus 로고    scopus 로고
    • What supercomputers say: A study of five system logs
    • A. Oliner and J. Stearly. What supercomputers say: A study of five system logs. In Proc. of DSN, 2007.
    • Proc. of DSN, 2007
    • Oliner, A.1    Stearly, J.2
  • 13
    • 84976846528 scopus 로고
    • A first order approximation to the optimal checkpoint interval
    • J. Young. A first order approximation to the optimal checkpoint interval. Comm. ACM, 17(9): 530-531, 1974.
    • (1974) Comm. ACM , vol.17 , Issue.9 , pp. 530-531
    • Young, J.1
  • 18
    • 12444268325 scopus 로고    scopus 로고
    • System-level fault tolerance in largescale parallel machines with buffered coscheduling
    • F. Petrini, K. Davis, and J. Sancho. System-level fault tolerance in largescale parallel machines with buffered coscheduling. In Proc. of IPDPS, 2004.
    • Proc. of IPDPS, 2004
    • Petrini, F.1    Davis, K.2    Sancho, J.3
  • 19
    • 80053252298 scopus 로고    scopus 로고
    • Reliability-aware scalability models for high performance computing
    • Z. Ziming and Z. Lan. Reliability-aware scalability models for high performance computing. In Proc. of Cluster, 2009.
    • Proc. of Cluster, 2009
    • Ziming, Z.1    Lan, Z.2
  • 21
    • 78650009816 scopus 로고    scopus 로고
    • Impact of suboptimal checkpoint intervals on application efficiency in computational clusters
    • W. Jones, J. Daly, and N. DeBardeleben. Impact of suboptimal checkpoint intervals on application efficiency in computational clusters. In Proc. of HPDC, 2010.
    • Proc. of HPDC, 2010
    • Jones, W.1    Daly, J.2    DeBardeleben, N.3
  • 23
    • 67649860233 scopus 로고    scopus 로고
    • Exploring event correlation for failure prediction in coalitions of clusters
    • S. Fu and C. Xu Exploring event correlation for failure prediction in coalitions of clusters. In Proc. of Supercomputing, 2007.
    • Proc. of Supercomputing, 2007
    • Fu, S.1    Xu, C.2
  • 25
    • 72249121870 scopus 로고    scopus 로고
    • Detecting large-scale system problems by mining console logs
    • W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan. Detecting large-scale system problems by mining console logs. In SOSP, 2009.
    • (2009) SOSP
    • Xu, W.1    Huang, L.2    Fox, A.3    Patterson, D.4    Jordan, M.5
  • 26
    • 17044405923 scopus 로고    scopus 로고
    • Toward integrating feature selection algorithms for classification and clustering
    • DOI 10.1109/TKDE.2005.66
    • H. Liu and L. Yu. Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Trans. on Knowledge and Data Engineering, 17(4):491-502, 2005. (Pubitemid 40495592)
    • (2005) IEEE Transactions on Knowledge and Data Engineering , vol.17 , Issue.4 , pp. 491-502
    • Liu, H.1    Yu, L.2
  • 27
    • 85084160707 scopus 로고    scopus 로고
    • Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?
    • B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proc. of FAST, 2007.
    • Proc. of FAST, 2007
    • Schroeder, B.1    Gibson, G.2
  • 30
    • 77949275829 scopus 로고    scopus 로고
    • Reliability of a System of k Nodes for High Performance Computing Applications
    • N. Gottumukkala, R. Nassar, M. Paun, and C. Leangsuksun. Reliability of a System of k Nodes for High Performance Computing Applications. IEEE Trans. on Reliability, 59(1):142-169, 2010.
    • (2010) IEEE Trans. on Reliability , vol.59 , Issue.1 , pp. 142-169
    • Gottumukkala, N.1    Nassar, R.2    Paun, M.3    Leangsuksun, C.4


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.