SCOPUS 정보 검색 플랫폼

Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011

Volumn , Issue , 2011, Pages 840-851

Co-analysis of RAS log and job log on Blue Gene/P

(8) Zheng, Ziming a Yu, Li a Tang, Wei a Lan, Zhiling a Gupta, Rinku b Desai, Narayan b Coghlan, Susan b Buettner, Daniel b

a Illinois Institute of Technology (United States)

b ARGONNE NATIONAL LABORATORY (United States)

Author keywords

Blue Gene P; Co Analysis; Log Analysis; Reliability

Indexed keywords

BLUE GENE; CO-ANALYSIS; FAILURE CHARACTERISTICS; FAULT RESILIENCE; LOG ANALYSIS; PETASCALE; SYSTEM BEHAVIORS; SYSTEM SIZE;

COBALT COMPOUNDS; DISTRIBUTED PARAMETER NETWORKS; PARALLEL PROCESSING SYSTEMS; RELIABILITY; RELIABILITY ANALYSIS;

FAILURE ANALYSIS;

EID: 80053278089 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/IPDPS.2011.83 Document Type: Conference Paper

Times cited : (67)

References (30)

1
- 40749160036
- Overview of the IBM Blue Gene/P project
- Blue Gene Team
- Blue Gene Team, "Overview of the IBM Blue Gene/P project,"IBM Journal of Research and Development, 2008.
- (2008) IBM Journal of Research and Development

2
- 84870399830
- Top500 supercomputing sites http://top500.org/.
- Top500 Supercomputing Sites

3
- 77951481809
- CiFTS: A coordinated infrastructure for fault-tolerant systems
- R. Gupta, P. Beckman, B.-H. Park, E. Lusk, and P. Hargrove. CiFTS: A coordinated infrastructure for fault-tolerant systems. In Proc. of ICPP, 2009.
- Proc. of ICPP, 2009
- Gupta, R.¹ Beckman, P.² Park, B.-H.³ Lusk, E.⁴ Hargrove, P.⁵

4
- 70450055295
- An adaptive semantic filter for Blue Gene/L failure log analysis systems
- Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. An adaptive semantic filter for Blue Gene/L failure log analysis systems. Workshop on SMTPS, 2007.
- Workshop on SMTPS, 2007
- Liang, Y.¹ Zhang, Y.² Xiong, H.³ Sahoo, R.⁴

5
- 12444257746
- Fault-aware job scheduling for Blue Gene/L systems
- A. Oliner, R. Sahoo, J. Moreira, M. Gupta, and A. Sivasubramaniam. Fault-aware job scheduling for Blue Gene/L systems. In Proc. of IPDPS, 2004.
- Proc. of IPDPS, 2004
- Oliner, A.¹ Sahoo, R.² Moreira, J.³ Gupta, M.⁴ Sivasubramaniam, A.⁵

6
- 80053236042
- FTB-enabled failure prediction for Blue Gene/P systems. (research poster)
- Z. Zheng, R. Gupta, Z. Lan, and S. Coghlan. FTB-enabled failure prediction for Blue Gene/P systems. In Proc. of SuperComputing (research poster), 2009.
- (2009) Proc. of SuperComputing
- Zheng, Z.¹ Gupta, R.² Lan, Z.³ Coghlan, S.⁴

7
- 70449794134
- System log pre-processing to improve failure prediction
- Z. Zheng, Z. Lan, B. Park, and A. Geist. System log pre-processing to improve failure prediction. In Proc. of DSN, 2009.
- Proc. of DSN, 2009
- Zheng, Z.¹ Lan, Z.² Park, B.³ Geist, A.⁴

8
- 33845593340
- A large-scale study of failures in high-performance computing systems
- B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In Proc. of DSN, 2006.
- Proc. of DSN, 2006
- Schroeder, B.¹ Gibson, G.²

9
- 36049013419
- What supercomputers say: A study of five system logs
- A. Oliner and J. Stearly. What supercomputers say: A study of five system logs. In Proc. of DSN, 2007.
- Proc. of DSN, 2007
- Oliner, A.¹ Stearly, J.²

10
- 67349271621
- An analysis of clustered failures on large supercomputing systems
- T. Hacker, F. Romero, and C. Carothers. An analysis of clustered failures on large supercomputing systems. Journal of Parallel and Distributed Computing, 69:652-665, 2009.
- (2009) Journal of Parallel and Distributed Computing , vol.69 , pp. 652-665
- Hacker, T.¹ Romero, F.² Carothers, C.³

11
- 70349657128
- Blue Gene/L log analysis and time to interrupt estimation
- N. Taerat, N. Naksinehaboon, C. Chandler, J. Elliott, C. Leangsuksun, G. Ostrouchov, S. Scott, and C. Engelmann. Blue Gene/L log analysis and time to interrupt estimation. In Proc. of ARES, 2009.
- Proc. of ARES, 2009
- Taerat, N.¹ Naksinehaboon, N.² Chandler, C.³ Elliott, J.⁴ Leangsuksun, C.⁵ Ostrouchov, G.⁶ Scott, S.⁷ Engelmann, C.⁸

12
- 27544497222
- Filtering failure logs for a Blue Gene/L prototype
- Y. Liang, Y. Zhang, A. Sivasubramanium, R. Sahoo, J. Moreia, and M. Gupta. Filtering failure logs for a Blue Gene/L prototype. In Proc. of DSN, 2005.
- Proc. of DSN, 2005
- Liang, Y.¹ Zhang, Y.² Sivasubramanium, A.³ Sahoo, R.⁴ Moreia, J.⁵ Gupta, M.⁶

13
- 84976846528
- A first order approximation to the optimal checkpoint interval
- J. Young. A first order approximation to the optimal checkpoint interval. Comm. ACM, 17(9): 530-531, 1974.
- (1974) Comm. ACM , vol.17 , Issue.9 , pp. 530-531
- Young, J.¹

14
- 72049093226
- Fault-aware utility-based job scheduling on Blue Gene/P systems
- W. Tang, Z. Lan, N. Desai, and D. Buettner. Fault-aware utility-based job scheduling on Blue Gene/P systems. In Proc. of Cluster, 2009.
- Proc. of Cluster, 2009
- Tang, W.¹ Lan, Z.² Desai, N.³ Buettner, D.⁴

15
- 83455240695
- Petascale system management experiences
- N. Desai, R. Bradshaw, C. Lueninghoener, A. Cherry, S. Coghlan, and W. Scullin. Petascale system management experiences. In Proc. of LISA, 2008.
- Proc. of LISA, 2008
- Desai, N.¹ Bradshaw, R.² Lueninghoener, C.³ Cherry, A.⁴ Coghlan, S.⁵ Scullin, W.⁶

16
- 0003740564
- London: Chapman and Hall
- M. Crowder, A. Kimber, T. Sweeting, and R. Smith. Statistical analysis of reliability data. London: Chapman and Hall, 1991.
- (1991) Statistical Analysis of Reliability Data
- Crowder, M.¹ Kimber, A.² Sweeting, T.³ Smith, R.⁴

17
- 4544382099
- Failure data analysis of a large-scale heterogeneous server environment
- R. Sahoo, A. Sivasubramanium, M. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proc. of DSN, 2004.
- Proc. of DSN, 2004
- Sahoo, R.¹ Sivasubramanium, A.² Squillante, M.³ Zhang, Y.⁴

18
- 12444268325
- System-level fault tolerance in largescale parallel machines with buffered coscheduling
- F. Petrini, K. Davis, and J. Sancho. System-level fault tolerance in largescale parallel machines with buffered coscheduling. In Proc. of IPDPS, 2004.
- Proc. of IPDPS, 2004
- Petrini, F.¹ Davis, K.² Sancho, J.³

19
- 80053252298
- Reliability-aware scalability models for high performance computing
- Z. Ziming and Z. Lan. Reliability-aware scalability models for high performance computing. In Proc. of Cluster, 2009.
- Proc. of Cluster, 2009
- Ziming, Z.¹ Lan, Z.²

20
- 77955737995
- Whitepaper
- N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann and B. Harrod. High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Whitepaper, 2009.
- (2009) High-end Computing Resilience: Analysis of Issues Facing the HEC Community and Path-forward for Research and Development
- DeBardeleben, N.¹ Laros, J.² Daly, J.³ Scott, S.⁴ Engelmann, C.⁵ Harrod, B.⁶

21
- 78650009816
- Impact of suboptimal checkpoint intervals on application efficiency in computational clusters
- W. Jones, J. Daly, and N. DeBardeleben. Impact of suboptimal checkpoint intervals on application efficiency in computational clusters. In Proc. of HPDC, 2010.
- Proc. of HPDC, 2010
- Jones, W.¹ Daly, J.² DeBardeleben, N.³

22
- 4544360243
- Technical Report CRHC 9808, UIUC
- J. Xu, Z. Kallbarczyk, and R. Iyer. Networked Windows NT system field failure data analysis. Technical Report CRHC 9808, UIUC, 1999.
- (1999) Networked Windows NT System Field Failure Data Analysis
- Xu, J.¹ Kallbarczyk, Z.² Iyer, R.³

23
- 67649860233
- Exploring event correlation for failure prediction in coalitions of clusters
- S. Fu and C. Xu Exploring event correlation for failure prediction in coalitions of clusters. In Proc. of Supercomputing, 2007.
- Proc. of Supercomputing, 2007
- Fu, S.¹ Xu, C.²

24
- 85077345110
- Understanding customer problem troubleshooting from storage system logs
- W. Jiang, C. Hu, S. Pasupathy, A. Kanevsky, Z. Li, and Y. Zhou. Understanding customer problem troubleshooting from storage system logs. In Proc. of FAST, 2009.
- Proc. of FAST, 2009
- Jiang, W.¹ Hu, C.² Pasupathy, S.³ Kanevsky, A.⁴ Li, Z.⁵ Zhou, Y.⁶

25
- 72249121870
- Detecting large-scale system problems by mining console logs
- W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan. Detecting large-scale system problems by mining console logs. In SOSP, 2009.
- (2009) SOSP
- Xu, W.¹ Huang, L.² Fox, A.³ Patterson, D.⁴ Jordan, M.⁵

26
- 17044405923
- Toward integrating feature selection algorithms for classification and clustering
- DOI 10.1109/TKDE.2005.66
- H. Liu and L. Yu. Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Trans. on Knowledge and Data Engineering, 17(4):491-502, 2005. (Pubitemid 40495592)
- (2005) IEEE Transactions on Knowledge and Data Engineering , vol.17 , Issue.4 , pp. 491-502
- Liu, H.¹ Yu, L.²

27
- 85084160707
- Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?
- B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proc. of FAST, 2007.
- Proc. of FAST, 2007
- Schroeder, B.¹ Gibson, G.²

28
- 0343644421
- Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/parallel/ workload/.
- Parallel Workloads Archive

29
- 77956588374
- USENIX Computer Failure Data Repository. http://cfdr.usenix.org/.
- USENIX Computer Failure Data Repository

30
- 77949275829
- Reliability of a System of k Nodes for High Performance Computing Applications
- N. Gottumukkala, R. Nassar, M. Paun, and C. Leangsuksun. Reliability of a System of k Nodes for High Performance Computing Applications. IEEE Trans. on Reliability, 59(1):142-169, 2010.
- (2010) IEEE Trans. on Reliability , vol.59 , Issue.1 , pp. 142-169
- Gottumukkala, N.¹ Nassar, R.² Paun, M.³ Leangsuksun, C.⁴

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.