SCOPUS 정보 검색 플랫폼

Proceedings - IEEE 9th International Conference on Dependable, Autonomic and Secure Computing, DASC 2011

Volumn , Issue , 2011, Pages 15-22

Establishing hypothesis for recurrent system failures from cluster log files

(8) Chuah, Edward a Lee, Gary a Tjhi, William Chandra a Kuo, Shyh Hao a Hung, Terence a Hammond, John b Minyard, Tommy b Browne, James C c

a INSTITUTE OF HIGH PERFORMANCE COMPUTING (Singapore)

b Texas Advanced Computing Center (United States)

c UNIVERSITY OF TEXAS AT AUSTIN (United States)

Author keywords

Failure diagnosis; Hypothesis testing; Large cluster systems; Reliability; Syslogs

Indexed keywords

CAUSAL RELATIONSHIPS; CORRELATION ANALYSIS; DIAGNOSE SYSTEM; EVENT SEQUENCE; FAILURE DIAGNOSIS; FAILURE DIAGNOSTICS; FILE SYSTEMS; HIGH CONFIDENCE; HYPOTHESIS TESTING; LARGE CLUSTERS; LOG ANALYSIS; LOG FILE; OPEN SOURCE SOFTWARE; SECOND GENERATION; SYSLOGS; SYSTEM FAILURES; SYSTEMS ADMINISTRATOR; UNIVERSITY OF TEXAS;

CLUSTER COMPUTING; COMPUTER OPERATING SYSTEMS; EMBEDDED SOFTWARE; FAILURE ANALYSIS; FILE ORGANIZATION; RELIABILITY; SUPERCOMPUTERS; SYSTEMS ENGINEERING;

OPEN SYSTEMS;

EID: 84856109383 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/DASC.2011.27 Document Type: Conference Paper

Times cited : (9)

References (30)

1
- 0025502686
- Error log analysis: Statistical modeling and heuristic trend analysis
- T.-T. Y. Lin and D. P. Siewiorek, "Error log analysis: Statistical modeling and heuristic trend analysis," IEEE Transactions on Reliability, vol. 39, no. 4, 1990.
- (1990) IEEE Transactions on Reliability , vol.39 , Issue.4
- Lin, T.-T.Y.¹ Siewiorek, D.P.²

2
- 0020500772
- Trend analysis on system error files
- M. M. Tsao and D. P. Siewiorek, "Trend analysis on system error files,"in Proceedings of FTCS '83, 1983.
- (1983) Proceedings of FTCS '83
- Tsao, M.M.¹ Siewiorek, D.P.²

3
- 33847328785
- Availability assessment of sunos/solaris unix systems based on syslogd and wtmpx log files: A case study
- C. Simache and M. Kaaniche, "Availability assessment of sunos/solaris unix systems based on syslogd and wtmpx log files: A case study," in Proceedings of IEEE PRDC, Dec 2005.
- Proceedings of IEEE PRDC, Dec 2005
- Simache, C.¹ Kaaniche, M.²

4
- 20444471122
- Towards informatic analysis of syslogs
- J. Stearley, "Towards informatic analysis of syslogs," in Proceedings of IEEE Cluster Computing, 2004, pp. 309-318.
- Proceedings of IEEE Cluster Computing, 2004 , pp. 309-318
- Stearley, J.¹

5
- 33845593340
- A large-scale study of failures in high-performance computing systems
- B. Schroeder and G. Gibson, "A large-scale study of failures in high-performance computing systems," in Proceedings of IEEE/IFIP DSN, 2006, pp. 249-258.
- Proceedings of IEEE/IFIP DSN, 2006 , pp. 249-258
- Schroeder, B.¹ Gibson, G.²

6
- 36049013419
- What supercomputers say: A study of five system logs
- A. Oliner and J. Stearley, "What supercomputers say: A study of five system logs," in Proceedings of IEEE/IFIP DSN, June 2007.
- Proceedings of IEEE/IFIP DSN, June 2007
- Oliner, A.¹ Stearley, J.²

7
- 67349271621
- An analysis of clustered failures on large supercomputing systems
- T. J. Hacker, F. Romero, and C. D. Carothers, "An analysis of clustered failures on large supercomputing systems," Journal of Parallel and Distributed Computing, vol. 69, no. 7, 2009.
- (2009) Journal of Parallel and Distributed Computing , vol.69 , Issue.7
- Hacker, T.J.¹ Romero, F.² Carothers, C.D.³

8
- 85092792131
- Analyzing system logs: A new view of what's important
- S. Sabato, E. Yom-Tov, A. Tsherniak, and S. Rosset, "Analyzing system logs: A new view of what's important," in 2nd USENIX workshop on Tackling Computer Systems Problems with Machine Learning Techniques, 2007.
- 2nd USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques, 2007
- Sabato, S.¹ Yom-Tov, E.² Tsherniak, A.³ Rosset, S.⁴

9
- 84856079819
- One graph is worth a thousand logs: Uncovering hidden structures in massive system event logs
- M. Aharon, G. Barash, I. Cohen, and E. Mordechai, "One graph is worth a thousand logs: Uncovering hidden structures in massive system event logs," in Proceedings of ECML PKDD, 2009.
- Proceedings of ECML PKDD, 2009
- Aharon, M.¹ Barash, G.² Cohen, I.³ Mordechai, E.⁴

10
- 50649105078
- Bad words: Finding faults in spirit's syslogs
- J. Stearly and A. J. Oliner, "Bad words: Finding faults in spirit's syslogs," in Proceedings of IEEE CCGRID, 2008.
- Proceedings of IEEE CCGRID, 2008
- Stearly, J.¹ Oliner, A.J.²

11
- 75449097851
- Toward automated anomaly identification in large-scale systems
- Z. Lan, Z. Zheng, and Y. Li, "Toward automated anomaly identification in large-scale systems," IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 2, 2010.
- (2010) IEEE Transactions on Parallel and Distributed Systems , vol.21 , Issue.2
- Lan, Z.¹ Zheng, Z.² Li, Y.³

12
- 4243934975
- PhD Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University
- M. M. Tsao, "Trend analysis and fault prediction," PhD Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University, 1983.
- (1983) Trend Analysis and Fault Prediction
- Tsao, M.M.¹

13
- 33845589803
- Bluegene/l failure analysis and prediction models
- Y. Liang, Y. Zhang, M. Jette, A. Sivasubramaniam, and R. Sahoo, "Bluegene/l failure analysis and prediction models," in Proceedings of IEEE/IFIP DSN, 2006.
- Proceedings of IEEE/IFIP DSN, 2006
- Liang, Y.¹ Zhang, Y.² Jette, M.³ Sivasubramaniam, A.⁴ Sahoo, R.⁵

14
- 49749107565
- Failure prediction in ibm bluegene/l event logs
- Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo, "Failure prediction in ibm bluegene/l event logs," in Proceedings of IEEE ICDM, 2007.
- Proceedings of IEEE ICDM, 2007
- Liang, Y.¹ Zhang, Y.² Xiong, H.³ Sahoo, R.⁴

15
- 56749178938
- Exploring event correlation for failure prediction in coalitions of clusters
- S. Fu and C.-Z. Xu, "Exploring event correlation for failure prediction in coalitions of clusters," in Proceedings of ACM/IEEE Supercomputing, no. 41, 2007.
- (2007) Proceedings of ACM/IEEE Supercomputing , Issue.41
- Fu, S.¹ Xu, C.-Z.²

16
- 84856108300
- A practical failure prediction with location and lead time for blue gene/p
- Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman, "A practical failure prediction with location and lead time for blue gene/p," in 1st Workshop on Fault-Tolerance for HPC at Extreme Scale (in conjunction with IEEE/IFIP DSN 2010), 2010.
- 1st Workshop on Fault-Tolerance for HPC at Extreme Scale (In Conjunction with IEEE/IFIP DSN 2010), 2010
- Zheng, Z.¹ Lan, Z.² Gupta, R.³ Coghlan, S.⁴ Beckman, P.⁵

17
- 77951205449
- A study of dynamic meta-learning for failure prediction in large-scale systems
- Z. Lan, J. Gu, Z. Zheng, R. Thakur, and S. Coghlan, "A study of dynamic meta-learning for failure prediction in large-scale systems," Journal of Parallel and Distributed Computing (JPDC), vol. 70, no. 6, 2010.
- (2010) Journal of Parallel and Distributed Computing (JPDC) , vol.70 , Issue.6
- Lan, Z.¹ Gu, J.² Zheng, Z.³ Thakur, R.⁴ Coghlan, S.⁵

18
- 70449794134
- System log pre-processing to improve failure prediction
- Z. Zheng, Z. Lan, B. Park, and A. Geist, "System log pre-processing to improve failure prediction," in Proceedings of IEEE/IFIP DSN, 2009.
- Proceedings of IEEE/IFIP DSN, 2009
- Zheng, Z.¹ Lan, Z.² Park, B.³ Geist, A.⁴

19
- 84856115820
- A fault diagnosis and prognosis service for teragrid clusters
- Z. Lan, P. Gujrati, Y. Li, Z. Zheng, R. Thakur, and J. White, "A fault diagnosis and prognosis service for teragrid clusters," in Proceedings of ACM TeraGrid, 2007.
- Proceedings of ACM TeraGrid, 2007
- Lan, Z.¹ Gujrati, P.² Li, Y.³ Zheng, Z.⁴ Thakur, R.⁵ White, J.⁶

20
- 77956291503
- End-to-end framework for fault management for open source clusters: Ranger
- J. L. Hammond, T. Minyard, and J. Browne, "End-to-end framework for fault management for open source clusters: Ranger," in Proceedings of ACM TeraGrid, no. 9, 2010.
- (2010) Proceedings of ACM TeraGrid , Issue.9
- Hammond, J.L.¹ Minyard, T.² Browne, J.³

21
- 79952790201
- Diagnosing the root-causes of failures from cluster log files
- E. Chuah, S.-H. Kuo, P. Hiew, W.-C. Tjhi, G. Lee, J. Hammond, M. T. Michalewicz, T. Hung, and J. C. Browne, "Diagnosing the root-causes of failures from cluster log files," in Proceedings of IEEE HiPC, Dec 19-22 2010.
- Proceedings of IEEE HiPC, Dec 19-22 2010
- Chuah, E.¹ Kuo, S.-H.² Hiew, P.³ Tjhi, W.-C.⁴ Lee, G.⁵ Hammond, J.⁶ Michalewicz, M.T.⁷ Hung, T.⁸ Browne, J.C.⁹

22
- 0022906522
- Recognition of error symptoms in large systems
- R. K. Iyer, L. T. Young, and V. Sridhar, "Recognition of error symptoms in large systems," in 1986 ACM Fall joint computer conference, 1986.
- (1986) 1986 ACM Fall Joint Computer Conference
- Iyer, R.K.¹ Young, L.T.² Sridhar, V.³

23
- 80052167311
- Models for time coalescence in event logs
- J. P. Hansen and D. P. Siewiorek, "Models for time coalescence in event logs," in Proceedings of FTCS '92, 1992.
- (1992) Proceedings of FTCS '92
- Hansen, J.P.¹ Siewiorek, D.P.²

24
- 49049104267
- Automated system monitoring and notification with swatch
- S. E. Hansen and E. T. Atkins, "Automated system monitoring and notification with swatch," in USENIX LISA, 1993.
- (1993) USENIX LISA
- Hansen, S.E.¹ Atkins, E.T.²

25
- 84860042842
- Listening to your cluster with logs
- J. E. Prewett, "Listening to your cluster with logs," in 5th LCI International Conference on Linux Clusters, 2004.
- 5th LCI International Conference on Linux Clusters, 2004
- Prewett, J.E.¹

26
- 26844519400
- Path-based faliure and evolution management
- M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer, "Path-based faliure and evolution management," in Proceedings of NSDI, 2004.
- Proceedings of NSDI, 2004
- Chen, M.Y.¹ Accardi, A.² Kiciman, E.³ Lloyd, J.⁴ Patterson, D.⁵ Fox, A.⁶ Brewer, E.⁷

27
- 0036930823
- Pinpoint: Problem determination in large, dynamic internet services
- M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, "Pinpoint: Problem determination in large, dynamic internet services," in Proceedings of IEEE/IFIP DSN, 2002.
- Proceedings of IEEE/IFIP DSN, 2002
- Chen, M.Y.¹ Kiciman, E.² Fratkin, E.³ Fox, A.⁴ Brewer, E.⁵

28
- 77957761115
- Problem diagnosis for mapreduce-based cloud computing environments
- J. Tan, X. Pan, E. Marinelli, S. Kavulya, R. Gandhi, and P. Narasimhan, "Problem diagnosis for mapreduce-based cloud computing environments, "in Proceedings of IEEE/IFIP NOMS, 2010.
- Proceedings of IEEE/IFIP NOMS, 2010
- Tan, J.¹ Pan, X.² Marinelli, E.³ Kavulya, S.⁴ Gandhi, R.⁵ Narasimhan, P.⁶

29
- 77955941295
- Visual, log-based causal tracing for performance debugging of mapreduce systems
- J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan, "Visual, log-based causal tracing for performance debugging of mapreduce systems," in Proceedings of IEEE ICDCS, 2010.
- Proceedings of IEEE ICDCS, 2010
- Tan, J.¹ Kavulya, S.² Gandhi, R.³ Narasimhan, P.⁴

30
- 77956573133
- Using correlated surprise to infer shared influence
- A. J. Oliner, A. V. Kulkarni, and A. Aiken, "Using correlated surprise to infer shared influence," in Proceedings of IEEE/IFIP DSN, 2010.
- Proceedings of IEEE/IFIP DSN, 2010
- Oliner, A.J.¹ Kulkarni, A.V.² Aiken, A.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.