SCOPUS 정보 검색 플랫폼

Proceedings of the 2011 6th International Conference on Availability, Reliability and Security, ARES 2011

Volumn , Issue , 2011, Pages 83-90

Proactive failure management by integrated unsupervised and semi-supervised learning for dependable cloud systems

(3) Guan, Qiang a Zhang, Ziming a Fu, Song a

a UNIVERSITY OF NORTH TEXAS (United States)

Author keywords

Bayesian detector; Cloud systems; Decision tree; Dependable systems; Learning algorithms

Indexed keywords

ANOMALOUS BEHAVIOR; BAYESIAN DETECTORS; BAYESIAN MODEL; CLOUD SYSTEMS; COMPUTING SYSTEM; DECISION TREE CLASSIFIERS; DEPENDABLE SYSTEMS; EXECUTION ENVIRONMENTS; FAILURE DETECTION METHOD; FAILURE DYNAMICS; FAILURE MANAGEMENT; FAILURE PREDICTION; LABELED DATA; SEMI-SUPERVISED LEARNING; SYSTEM ADMINISTRATORS; SYSTEM COMPONENTS;

BAYESIAN NETWORKS; COMPUTER SYSTEMS; DECISION TREES; FORECASTING; LEARNING ALGORITHMS; PLANT EXTRACTS; SUPERVISED LEARNING;

CLOUD COMPUTING;

EID: 80455144683 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/ARES.2011.20 Document Type: Conference Paper

Times cited : (39)

References (40)

1
- 80455150839
- sysstat. Available at: http://sebastien.godard.pagesperso-orange.fr/.

2
- 40849089513
- Model-based performance evaluation of distributed checkpointing protocols
- DOI 10.1016/j.peva.2007.09.001, PII S0166531607001009
- A. Agbaria and R. Friedman. Model-based performance eval-uation of distributed checkpointing protocols. Performance Evaluation, 65(5):345-365, 2008. (Pubitemid 351400683)
- (2008) Performance Evaluation , vol.65 , Issue.5 , pp. 345-365
- Agbaria, A.¹ Friedman, R.²

3
- 0003802343
- Wadsworth and Brooks
- L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classifi-cation and Regression Trees. Wadsworth and Brooks, 1984.
- (1984) Classifi-cation and Regression Trees
- Breiman, L.¹ Friedman, J.² Olshen, R.³ Stone, C.⁴

4
- 74049111423
- Compiler-enhanced incremental checkpointing for openmp applications
- G. Bronevetsky, D. J. Marques, K. K. Pingali, R. Rugina, and S. A. McKee. Compiler-enhanced incremental checkpointing for openmp applications. In Proc. of ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008.
- (2008) Proc. of ACM Symposium on Principles and Practice of Parallel Programming (PPoPP)
- Bronevetsky, G.¹ Marques, D.J.² Pingali, K.K.³ Rugina, R.⁴ McKee, S.A.⁵

5
- 34548042452
- Proactive fault tolerance in MPI applications via task migration
- S. Chakravorty, C. Mendes, and L. Kale. Proactive fault tolerance in MPI applications via task migration. In Proc. of IEEE International Conference on High Performance Com-puting, 2006.
- (2006) Proc. of IEEE International Conference on High Performance Com-puting
- Chakravorty, S.¹ Mendes, C.² Kale, L.³

6
- 84889281816
- John Wiley & Sons
- T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, 1991.
- (1991) Elements of Information Theory
- Cover, T.¹ Thomas, J.²

7
- 0003922190
- Wiley-Interscience
- R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classifica-tion. Wiley-Interscience, 2001.
- (2001) Pattern Classifica-tion
- Duda, R.O.¹ Hart, P.E.² Stork, D.G.³

8
- 0042078549
- A survey of rollback-recovery protocols in message-passing systems
- E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375-408, 2002.
- (2002) ACM Computing Surveys , vol.34 , Issue.3 , pp. 375-408
- Elnozahy, E.N.M.¹ Alvisi, L.² Wang, Y.-M.³ Johnson, D.B.⁴

9
- 70349735985
- Failure-aware construction and recon guration of distributed virtual machines for high availability computing
- S. Fu. Failure-aware construction and recon guration of distributed virtual machines for high availability computing. In Proc. of IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), 2009.
- (2009) Proc. of IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid)
- Fu, S.¹

10
- 76849100508
- Failure-aware resource management for high-availability computing clusters with distributed virtual ma-chines
- S. Fu. Failure-aware resource management for high-availability computing clusters with distributed virtual ma-chines. Journal of Parallel and Distributed Computing, 70(4):384-393, 2010.
- (2010) Journal of Parallel and Distributed Computing , vol.70 , Issue.4 , pp. 384-393
- Fu, S.¹

11
- 56749178938
- Exploring event correlation for failure pre-diction in coalitions of clusters
- S. Fu and C. Xu. Exploring event correlation for failure pre-diction in coalitions of clusters. In Proceedings of ACM/IEEE Supercomputing Conference (SC), 2007.
- (2007) Proceedings of ACM/IEEE Supercomputing Conference (SC)
- Fu, S.¹ Xu, C.²

12
- 47249124464
- Quantifying temporal and spatial correlation of failure events for proactive management
- S. Fu and C. Xu. Quantifying temporal and spatial correlation of failure events for proactive management. In Proc. of IEEE International Symposium on Reliable Distributed Systems (SRDS), 2007.
- (2007) Proc. of IEEE International Symposium on Reliable Distributed Systems (SRDS)
- Fu, S.¹ Xu, C.²

13
- 70349679287
- Proactive resource management for failure resilient high performance computing clusters
- S. Fu and C. Xu. Proactive resource management for failure resilient high performance computing clusters. In Proceedings of IEEE International Conference on Availability, Reliability and Security (ARES), 2009.
- (2009) Proceedings of IEEE International Conference on Availability, Reliability and Security (ARES)
- Fu, S.¹ Xu, C.²

14
- 77956227790
- Quantifying event correlations for proac-tive failure management in networked computing systems
- S. Fu and C. Xu. Quantifying event correlations for proac-tive failure management in networked computing systems. Journal of Parallel and Distributed Computing, 70(11):1100-1109, 2010.
- (2010) Journal of Parallel and Distributed Computing , vol.70 , Issue.11 , pp. 1100-1109
- Fu, S.¹ Xu, C.²

15
- 33947184459
- Analytical models for architecture-based software reliability prediction: A unification framework
- DOI 10.1109/TR.2006.884587
- S. S. Gokhale and K. S. Trivedi. Analytical models for architecture-based software reliability prediction: A uni ca-tion framework. IEEE Transactions on Reliability, 55(4):578-590, 2006. (Pubitemid 46405748)
- (2006) IEEE Transactions on Reliability , vol.55 , Issue.4 , pp. 578-590
- Gokhale, S.S.¹ Trivedi, K.S.²

16
- 79551518524
- Auto-AID: A data mining framework for autonomic anomaly identi cation in networked computer systems
- Q. Guan and S. Fu. auto-AID: A data mining framework for autonomic anomaly identi cation in networked computer systems. In Proceedings of IEEE International Performance Computing and Communications Conference (IPCCC), 2010.
- (2010) Proceedings of IEEE International Performance Computing and Communications Conference (IPCCC)
- Guan, Q.¹ Fu, S.²

17
- 79952786041
- Anomaly detection in large-scale coalition clusters for dependability assurance
- Q. Guan, D. Smith, and S. Fu. Anomaly detection in large-scale coalition clusters for dependability assurance. In Proceedings of IEEE International Conference on High Per-formance Computing (HiPC), 2010.
- (2010) Proceedings of IEEE International Conference on High Per-formance Computing (HiPC)
- Guan, Q.¹ Smith, D.² Fu, S.³

18
- 0003585297
- Morgan Kaufmann Publishers Inc.
- J. Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 2005.
- (2005) Data Mining: Concepts and Techniques
- Han, J.¹

19
- 4544255683
- Improving cluster availability using workstation validation
- T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availability using workstation validation. In Proc. of ACM Conference on Measurement and modeling of computer systems (SIGMETRICS), 2002.
- (2002) Proc. of ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS)
- Heath, T.¹ Martin, R.P.² Nguyen, T.D.³

20
- 85160681664
- Transparent checkpoint-restart of multiple processes on commodity operating systems
- O. Laadan and J. Nieh. Transparent checkpoint-restart of multiple processes on commodity operating systems. In Proc. of USENIX Annual Technical Conference (USENIX), 2007.
- (2007) Proc. of USENIX Annual Technical Conference (USENIX)
- Laadan, O.¹ Nieh, J.²

21
- 53349121135
- Fast restart mechanism for check-point/recovery protocols in networked environments
- Y. Li and Z. Lan. Fast restart mechanism for check-point/recovery protocols in networked environments. In Proc. of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2008.
- (2008) Proc. of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
- Li, Y.¹ Lan, Z.²

22
- 33845589803
- BlueGene/L failure analysis and prediction models
- Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. K. Sahoo. BlueGene/L failure analysis and prediction models. In Proc. of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2006.
- (2006) Proc. of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
- Liang, Y.¹ Zhang, Y.² Sivasubramaniam, A.³ Jette, M.⁴ Sahoo, R.K.⁵

23
- 27544497222
- Filtering failure logs for a BlueGene/L prototype
- Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, and M. Gupta. Filtering failure logs for a BlueGene/L prototype. In Proc. of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2005.
- (2005) Proc. of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
- Liang, Y.¹ Zhang, Y.² Sivasubramaniam, A.³ Sahoo, R.⁴ Moreira, J.⁵ Gupta, M.⁶

24
- 47249131447
- Exploiting availability predic-tion in distributed systems
- J.W. Mickens and B. D. Noble. Exploiting availability predic-tion in distributed systems. In Proc. of USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2006.
- (2006) Proc. of USENIX Symposium on Networked Systems Design and Implementation (NSDI)
- Mickens, J.W.¹ Noble, B.D.²

25
- 34548046749
- Proactive fault tolerance for HPC with Xen virtualization
- A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for HPC with Xen virtualization. In Proc. of ACM International Conference on Supercomputing (ICS), 2007.
- (2007) Proc. of ACM International Conference on Supercomputing (ICS)
- Nagarajan, A.B.¹ Mueller, F.² Engelmann, C.³ Scott, S.L.⁴

26
- 12444257746
- Fault-aware job scheduling for BlueGene/L systems
- A. J. Oliner, R. K. Sahoo, J. E. Moreira, and et al. Fault-aware job scheduling for BlueGene/L systems. In Proc. of the 18th International Parallel and Distributed Processing Symposium (IPDPS), 2004.
- (2004) Proc. of the 18th International Parallel and Distributed Processing Symposium (IPDPS)
- Oliner, A.J.¹ Sahoo, R.K.² Moreira, J.E.³

27
- 34548010919
- Software failures and the road to a peta!op machine
- I. Philp. Software failures and the road to a peta!op machine. In Proc. of Symposium on High Performance Computer Architecture Workshop, 2005.
- (2005) Proc. of Symposium on High Performance Computer Architecture Workshop
- Philp, I.¹

28
- 79957998466
- Timely virtual machine migration for pro-active fault tolerance
- A. Polze, P. Tröger, and F. Salfner. Timely virtual machine migration for pro-active fault tolerance. In Proc. of IEEE International Workshop on Object/component/service-oriented Real-time Networked Ultra-dependable Systems (WORNUS), at IEEE 14th International Symposium on Object/Component/Service-oriented Real-time Distributed Computing (ISORC), 2011.
- (2011) Proc. of IEEE International Workshop on Object/component/service-oriented Real-time Networked Ultra-dependable Systems (WORNUS), at IEEE 14th International Symposium on Object/Component/Service-oriented Real-time Distributed Computing (ISORC)
- Polze, A.¹ Tröger, P.² Salfner, F.³

29
- 77952378080
- Critical event prediction for proactive management in large-scale computer clusters
- R. K. Sahoo, A. J. Oliner, I. Rish, and et al. Critical event prediction for proactive management in large-scale computer clusters. In Proc. of ACM International Conference on Knowledge Discovery and Data Dining (SIGKDD), 2003.
- (2003) Proc. of ACM International Conference on Knowledge Discovery and Data Dining (SIGKDD)
- Sahoo, R.K.¹ Oliner, A.J.² Rish, I.³

30
- 4544382099
- Failure data analysis of a large-scale heterogeneous server environment
- R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proc. of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2004.
- (2004) Proc. of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
- Sahoo, R.K.¹ Sivasubramaniam, A.² Squillante, M.S.³ Zhang, Y.⁴

31
- 77950267881
- A survey of online failure prediction methods
- F. Salfner, M. Lenk, and M. Malek. A survey of online failure prediction methods. ACM Computing Surveys, 42:10:1-10:42, 2010.
- (2010) ACM Computing Surveys , vol.42 , pp. 101-1042
- Salfner, F.¹ Lenk, M.² Malek, M.³

32
- 33746314557
- Proactive fault handling for system availability enhancement
- F. Salfner and M. Malek. Proactive fault handling for system availability enhancement. In Proc. of the 19th IEEE International Parallel and Distributed Processing Sympo-sium (IPDPS), Workshop on Dependable Parallel Distributed Network-centric Systems, 2005.
- (2005) Proc. of the 19th IEEE International Parallel and Distributed Processing Sympo-sium (IPDPS), Workshop on Dependable Parallel Distributed Network-centric Systems
- Salfner, F.¹ Malek, M.²

33
- 47249121233
- Using hidden semi-markov models for effective online failure prediction
- F. Salfner and M. Malek. Using hidden semi-markov models for effective online failure prediction. In Proc. of the 26th IEEE International Symposium on Reliable Distributed Sys-tems (SRDS), 2007.
- (2007) Proc. of the 26th IEEE International Symposium on Reliable Distributed Sys-tems (SRDS)
- Salfner, F.¹ Malek, M.²

34
- 70449479757
- Cross-core event monitoring for processor failure prediction
- F. Salfner, P. Tröger, and S. Tschirpke. Cross-core event monitoring for processor failure prediction. In Proc. of IEEE International Conference on High Performance Computing & Simulation, Workshop on Dependable Multi-Core Comput-ing(DMCC), 2009.
- (2009) Proc. of IEEE International Conference on High Performance Computing & Simulation, Workshop on Dependable Multi-Core Comput-ing(DMCC)
- Salfner, F.¹ Tröger, P.² Tschirpke, S.³

35
- 33845593340
- A large-scale study of failures in high-performance-computing systems
- B. Schroeder and G. Gibson. A large-scale study of failures in high-performance-computing systems. In Proc. of IEEE/IFIP International Conference on Dependable Systems and Net-works (), 2006.
- (2006) Proc. of IEEE/IFIP International Conference on Dependable Systems and Net-works ()
- Schroeder, B.¹ Gibson, G.²

36
- 33750936415
- Availability modeling and analysis on high performance cluster computing systems
- H. Song, C. Leangsuksun, and R. Nassar. Availability modeling and analysis on high performance cluster computing systems. In Proc. of IEEE International Conference on Availability, Reliability and Security (ARES), 2006.
- (2006) Proc. of IEEE International Conference on Availability, Reliability and Security (ARES)
- Song, H.¹ Leangsuksun, C.² Nassar, R.³

37
- 70350755748
- Proactive process-level live migration in HPC environments
- C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Proactive process-level live migration in HPC environments. In Proc. of ACM/IEEE Conference on Supercomputing (SC), 2008.
- (2008) Proc. of ACM/IEEE Conference on Supercomputing (SC)
- Wang, C.¹ Mueller, F.² Engelmann, C.³ Scott, S.L.⁴

38
- 67650672322
- Beyond availability: Towards a deeper understanding of ma-chine failure characteristics in large distributed systems
- P. Yalagandula, S. Nath, H. Yu, P. B. Gibbons, and S. Sesha. Beyond availability: Towards a deeper understanding of ma-chine failure characteristics in large distributed systems. In Proc. of USENIX WORLDS, 2004.
- (2004) Proc. of USENIX WORLDS
- Yalagandula, P.¹ Nath, S.² Yu, H.³ Gibbons, P.B.⁴ Sesha, S.⁵

39
- 77954054232
- Failure prediction for autonomic management of networked computer systems with availability assurance
- Z. Zhang and S. Fu. Failure prediction for autonomic management of networked computer systems with availability assurance. In Proceedings of IEEE Workshop on Depend-able Parallel, Distributed and Network-Centric Systems in conjunction with IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010.
- (2010) Proceedings of IEEE Workshop on Depend-able Parallel, Distributed and Network-Centric Systems in Conjunction with IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- Zhang, Z.¹ Fu, S.²

40
- 79551557730
- A hierarchical failure management framework for dependability assurance in compute clusters
- Z. Zhang and S. Fu. A hierarchical failure management framework for dependability assurance in compute clusters. International Journal of Computational Science, 4(4):313-326, 2010.
- (2010) International Journal of Computational Science , vol.4 , Issue.4 , pp. 313-326
- Zhang, Z.¹ Fu, S.²

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.