SCOPUS 정보 검색 플랫폼

IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

Volumn , Issue , 2011, Pages 1557-1566

Predicting node failure in high performance computing systems from failure and usage logs

(3) Nakka, Nithin a Agrawal, Ankit b Choudhary, Alok b

a University of Illinois at Urbana Champaign (United States)

b Northwestern University (United States)

Author keywords

[No Author keywords available]

Indexed keywords

DECISION TREE CLASSIFIERS; FAILURE DATA; FAILURE INFORMATION; HIGH PERFORMANCE COMPUTERS; HIGH PERFORMANCE COMPUTING SYSTEMS; IDLE TIME; LOS ALAMOS NATIONAL LABORATORY; MINING CLASSIFICATION; NODE FAILURE; PREDICTION SYSTEMS; ROOT CAUSE; SEPARATE ANALYSIS; USAGE DATA;

COMPUTER SOFTWARE SELECTION AND EVALUATION; DATA MINING; DECISION TREES; DISTRIBUTED PARAMETER NETWORKS; FORECASTING; PARALLEL PROCESSING SYSTEMS; TREES (MATHEMATICS);

DISTRIBUTED COMPUTER SYSTEMS;

EID: 83455262121 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/IPDPS.2011.310 Document Type: Conference Paper

Times cited : (31)

References (27)

1
- 83455261683
- Experimental assessment of workstation failures and their impact on checkpointing systems
- J. S. Plank and W. R. Elwasif. Experimental assessment of workstation failures and their impact on checkpointing systems. In Proceedings of FTCS-98.
- Proceedings of FTCS-98
- Plank, J.S.¹ Elwasif, W.R.²

2
- 59249090746
- Subtleties in tolerating correlated failures
- S. Nath, H. Yu, P. B. Gibbons, and S. Seshan. Subtleties in tolerating correlated failures. In Proceedings of NSDI'06, 2006.
- (2006) Proceedings of NSDI'06
- Nath, S.¹ Yu, H.² Gibbons, P.B.³ Seshan, S.⁴

3
- 83455247703
- Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems
- June
- N. Nakka, A. Choudhary, "Failure data-driven selective node-level duplication to improve MTTF in High Performance Computing Systems", In Proceedings of HPCS 2009, June 2009.
- (2009) Proceedings of HPCS 2009
- Nakka, N.¹ Choudhary, A.²

4
- 4544255683
- Improving cluster availability using workstation validation
- T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availability using workstation validation. In Proceedings of ACM SIGMETRICS, 2002.
- (2002) Proceedings of ACM SIGMETRICS
- Heath, T.¹ Martin, R.P.² Nguyen, T.D.³

5
- 0029204130
- A longitudinal survey of internet host reliability
- th SRDS, 1995.
- (1995) th SRDS
- Long, D.¹ Muir, A.² Golding, R.³

6
- 45749113088
- Modeling machine availability in enterprise and wide-area distributed computing environments
- D. Nurmi, J. Brevik, and R. Wolski. Modeling machine availability in enterprise and wide-area distributed computing environments. In Euro-Par'05, 2005.
- (2005) Euro-par'05
- Nurmi, D.¹ Brevik, J.² Wolski, R.³

7
- 4544382099
- Failure data analysis of a large-scale heterogeneous server environment
- R. K. June
- R. K. Sahoo, R. K., A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proceedings of DSN, June 2004.
- (2004) Proceedings of DSN
- Sahoo, R.K.¹ Sivasubramaniam, A.² Squillante, M.S.³ Zhang, Y.⁴

8
- 0025693296
- Failure analysis and modelling of a VAX cluster system
- D. Tang, R. K. Iyer, and S. S. Subramani. Failure analysis and modelling of a VAX cluster system. In Fault Tolerant Computing Systems, 1990.
- (1990) Fault Tolerant Computing Systems
- Tang, D.¹ Iyer, R.K.² Subramani, S.S.³

9
- 84958782417
- Networked windows NT system field failure data analysis
- J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked Windows NT system field failure data analysis. In Proc. of the PRDC, 1999.
- (1999) Proc. of the PRDC
- Xu, J.¹ Kalbarczyk, Z.² Iyer, R.K.³

10
- 33845593340
- A large-scale study of failures in high-performance-computing systems
- June
- B. Schroeder and G. Gibson. A large-scale study of failures in high-performance-computing systems. In Proceedings of the DSN, June 2006.
- (2006) Proceedings of the DSN
- Schroeder, B.¹ Gibson, G.²

11
- 84976815079
- Measurement and modeling of computer reliability as affected by system activity
- R. K. Iyer, D. J. Rossetti, and M. C. Hsueh. Measurement and modeling of computer reliability as affected by system activity. ACM Transactions on Computing Systems, Vol. 4, No. 3, 1986.
- (1986) ACM Transactions on Computing Systems , vol.4 , Issue.3
- Iyer, R.K.¹ Rossetti, D.J.² Hsueh, M.C.³

12
- 0019661017
- Workload, performance, and reliability of digital computing systems
- th FTCS, 1981.
- (1981) th FTCS
- Castillo, X.¹ Siewiorek, D.²

13
- 36049013419
- What supercomputers say: A study of five system logs
- UK, June
- Adam J. Oliner, Jon Stearley: What Supercomputers Say: A Study of Five System Logs. In Proceedings of the DSN, Edinburgh, UK, June 2007, pp. 575-584.
- (2007) Proceedings of the DSN, Edinburgh , pp. 575-584
- Oliner, A.J.¹ Stearley, J.²

14
- 55849103487
- A fault diagnosis and prognosis service for TeraGrid clusters
- Z. Lan, Y. Li, P. Gujrati, Z. Zheng, R. Thakur, and J. White, "A Fault Diagnosis and Prognosis Service for TeraGrid Clusters", In Proceedings of TeraGrid'07, 2007.
- (2007) Proceedings of TeraGrid'07
- Lan, Z.¹ Li, Y.² Gujrati, P.³ Zheng, Z.⁴ Thakur, R.⁵ White, J.⁶

15
- 47249123819
- Exploring meta-learning to improve failure prediction in supercomputing clusters
- P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, "Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters", In Proceedings of ICPP, 2007.
- (2007) Proceedings of ICPP
- Gujrati, P.¹ Li, Y.² Lan, Z.³ Thakur, R.⁴ White, J.⁵

16
- 79952168926
- Using adaptive fault tolerance to improve application robustness on the TeraGrid
- Y. Li and Z. Lan, "Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid", In Proceedings of TeraGrid'07, 2007.
- (2007) Proceedings of TeraGrid'07
- Li, Y.¹ Lan, Z.²

17
- 57049111494
- Adaptive fault management of parallel applications for high performance computing
- Z. Lan and Y. Li, "Adaptive Fault Management of Parallel Applications for High Performance Computing", IEEE Transactions on Computers, Vol. 57, No. 12, pp. 1647-1660, 2008.
- (2008) IEEE Transactions on Computers , vol.57 , Issue.12 , pp. 1647-1660
- Lan, Z.¹ Li, Y.²

18
- 12444257746
- Fault-aware job scheduling for bluegene/L systems
- A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, and A. Sivasubramaniam. Fault-aware job scheduling for Bluegene/L systems. In Proceedings of the 18th IPDPS, 2004.
- (2004) Proceedings of the 18th IPDPS
- Oliner, A.J.¹ Sahoo, R.K.² Moreira, J.E.³ Gupta, M.⁴ Sivasubramaniam, A.⁵

19
- 84948977233
- The power of decision tables
- Ron Kohavi: The Power of Decision Tables. In: 8th European Conference on Machine Learning, 174-189, 1995.
- (1995) 8th European Conference on Machine Learning , pp. 174-189
- Kohavi, R.¹

20
- 34249832377
- A Bayesian method for the induction of probabilistic networks from data
- G. Cooper, E. Herskovits (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning. 9(4):309-347.
- (1992) Machine Learning , vol.9 , Issue.4 , pp. 309-347
- Cooper, G.¹ Herskovits, E.²

21
- 0000468432
- Estimating continuous distributions in Bayesian classifiers
- San Mateo
- George H. John, Pat Langley: Estimating Continuous Distributions in Bayesian Classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo, 338-345, 1995.
- (1995) Eleventh Conference on Uncertainty in Artificial Intelligence , pp. 338-345
- John, G.H.¹ Langley, P.²

22
- 0003500248
- Morgan Kaufmann Publishers, San Mateo, CA
- Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.
- (1993) C4.5: Programs for Machine Learning
- Quinlan, R.¹

23
- 0006452367
- The alternating decision tree learning algorithm
- Bled, Slovenia
- Freund, Y., Mason, L.: The alternating decision tree learning algorithm. In: Proceeding of the Sixteenth International Conference on Machine Learning, Bled, Slovenia, 124-133, 1999.
- (1999) Proceeding of the Sixteenth International Conference on Machine Learning , pp. 124-133
- Freund, Y.¹ Mason, L.²

24
- 0035478854
- Random forests
- Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.
- (2001) Machine Learning , vol.45 , Issue.1 , pp. 5-32
- Leo, B.¹

25
- 0003957032
- Morgan Kaufmann Pub
- I. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann Pub, 2005.
- (2005) Data Mining: Practical Machine Learning Tools and Techniques
- Witten, I.¹ Frank, E.²

26
- 0003508724
- John Wiley and Sons, Inc.
- David Hosmer and Stanley Lemeshow.1989. Applied Logistic Regression. John Wiley and Sons, Inc.
- (1989) Applied Logistic Regression
- Hosmer, D.¹ Lemeshow, S.²

27
- 0000521473
- Ridge estimators in logistic regression
- le Cessie, S., van Houwelingen, J.C. (1992). Ridge Estimators in Logistic Regression. Applied Statistics. 41(1):191-201.
- (1992) Applied Statistics , vol.41 , Issue.1 , pp. 191-201
- Le Cessie, S.¹ Van Houwelingen, J.C.²

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.