SCOPUS 정보 검색 플랫폼

Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum, IPDPSW 2010

Volumn , Issue , 2010, Pages

Failure prediction for autonomic management of networked computer systems with availability assurance

(2) Zhang, Ziming a Fu, Song a

a NEW MEXICO INSTITUTE OF MINING AND TECHNOLOGY (United States)

Author keywords

Autonomic systems; Failure management; Networked computer systems; System dependability

Indexed keywords

AUTONOMIC MANAGEMENT; AUTONOMIC SYSTEMS; COMPONENT FAILURES; COMPUTATIONAL GRIDS; FAILURE BEHAVIORS; FAILURE CORRELATION; FAILURE DYNAMICS; FAILURE MANAGEMENT; FAILURE PREDICTION; NETWORKED COMPUTER SYSTEMS; OFFLINE; ONLINE PREDICTION; OPERATION COST; PRODUCTION ENVIRONMENTS; SELF MANAGEMENT; SYSTEM DEPENDABILITY; SYSTEM DESIGNERS;

DISTRIBUTED PARAMETER NETWORKS; FORECASTING; HIERARCHICAL SYSTEMS;

COMPUTER SYSTEMS;

EID: 77954054232 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/IPDPSW.2010.5470868 Document Type: Conference Paper

Times cited : (14)

References (45)

1
- 84881083374
- Weka: The University of Waikato. Available at
- Weka: The University of Waikato. Machine learning software in Java. Available at: http://www.cs.waikato.ac.nz/ml/weka/.
- Machine Learning Software in Java

2
- 84870548923
- An overview of the BlueGene/L supercomputer
- N. Adiga, G. Almasi, and et al. An overview of the BlueGene/L supercomputer. In Proceedings of ACM/IEEE Conference on Supercomputing (SC), 2002.
- (2002) Proceedings of ACM/IEEE Conference on Supercomputing (SC)
- Adiga, N.¹ Almasi, G.²

3
- 40849089513
- Model-based performance evaluation of distributed checkpointing protocols
- A. Agbaria and R. Friedman. Model-based performance evaluation of distributed checkpointing protocols. Performance Evaluation, 65(5):345-365, 2008.
- (2008) Performance Evaluation , vol.65 , Issue.5 , pp. 345-365
- Agbaria, A.¹ Friedman, R.²

4
- 1542679193
- Objective Bayesian analysis of spatially correlated data
- J. O. Berger, V. D. Oliveira, and B. Sansó. Objective Bayesian analysis of spatially correlated data. Journal of the American Statistical Association, 96(456):1361-1374, 2001.
- (2001) Journal of the American Statistical Association , vol.96 , Issue.456 , pp. 1361-1374
- Berger, J.O.¹ Oliveira, V.D.² Sansó, B.³

5
- 27544473955
- Nonstop advanced architecture
- D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. Nonstop advanced architecture. In Proceedings of IEEE International Conference on Depend- able Systems and Networks (DSN), 2005.
- (2005) Proceedings of IEEE International Conference on Depend- Able Systems and Networks (DSN)
- Bernick, D.¹ Bruckert, B.² Vigna, P.D.³ Garcia, D.⁴ Jardine, R.⁵ Klecka, J.⁶ Smullen, J.⁷

6
- 74049111423
- Compiler-enhanced incremental checkpointing for openmp applications
- G. Bronevetsky, D. J. Marques, K. K. Pingali, R. Rugina, and S. A. McKee. Compiler-enhanced incremental checkpointing for openmp applications. In Proceedings of ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008.
- (2008) Proceedings of ACM Symposium on Principles and Practice of Parallel Programming (PPoPP)
- Bronevetsky, G.¹ Marques, D.J.² Pingali, K.K.³ Rugina, R.⁴ McKee, S.A.⁵

7
- 70449914816
- Dynamic content web applications: Crash, failover, and recovery analysis
- L. E. Buzato, G. M. D. Vieira, and W. Zwaenepoel. Dynamic content web applications: Crash, failover, and recovery analysis. In Proceedings of International Conference on Dependable Systems and Networks (DSN), 2009.
- (2009) Proceedings of International Conference on Dependable Systems and Networks (DSN)
- Buzato, L.E.¹ Vieira, G.M.D.² Zwaenepoel, W.³

8
- 34548042452
- Proactive fault tolerance in MPI applications via task migration
- S. Chakravorty, C. Mendes,, and L. Kale. Proactive fault tolerance in MPI applications via task migration. In Proceedings of IEEE International Conference on High Performance Computing, 2006.
- (2006) Proceedings of IEEE International Conference on High Performance Computing
- Chakravorty, S.¹ Mendes, C.² Kale, L.³

9
- 0036504529
- Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing
- A. Dogan and F. Özgüner. Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3):308-323, 2002.
- (2002) IEEE Transactions on Parallel and Distributed Systems , vol.13 , Issue.3 , pp. 308-323
- Dogan, A.¹ Özgüner, F.²

10
- 0042078549
- A survey of rollback-recovery protocols in message-passing systems
- E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375-408, 2002.
- (2002) ACM Computing Surveys , vol.34 , Issue.3 , pp. 375-408
- Elnozahy, E.N.M.¹ Alvisi, L.² Wang, Y.-M.³ Johnson, D.B.⁴

11
- 4043157227
- Reliability, availability, and serviceability (RAS) of the IBM eServer z990
- M. L. Fair, C. R. Conklin, S. B. Swaney, P. J. Meaney, W. J. Clarke, L. C. Alves, I. N. Modi, F. Freier, W. Fischer, and N. E. Weber. Reliability, availability, and serviceability (RAS) of the IBM eServer z990. IBM Journal of Research and Development, 48(3-4), 2004.
- (2004) IBM Journal of Research and Development , vol.48 , Issue.3-4
- Fair, M.L.¹ Conklin, C.R.² Swaney, S.B.³ Meaney, P.J.⁴ Clarke, W.J.⁵ Alves, L.C.⁶ Modi, I.N.⁷ Freier, F.⁸ Fischer, W.⁹ Weber, N.E.¹⁰

12
- 70349735985
- Failure-aware construction and reconfiguration of distributed virtual machines for high availability computing
- S. Fu. Failure-aware construction and reconfiguration of distributed virtual machines for high availability computing. In Proceedings of IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), 2009.
- (2009) Proceedings of IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid)
- Fu, S.¹

13
- 76849100508
- Failure-aware resource management for highavailability computing clusters with distributed virtual machines
- In Press
- S. Fu. Failure-aware resource management for highavailability computing clusters with distributed virtual machines. Journal of Parallel and Distributed Computing, In Press, 2010.
- (2010) Journal of Parallel and Distributed Computing
- Fu, S.¹

14
- 56749178938
- Exploring event correlation for failure prediction in coalitions of clusters
- November
- S. Fu and C.-Z. Xu. Exploring event correlation for failure prediction in coalitions of clusters. In Proceedings of ACM/IEEE Supercomputing Conference (SC), November 2007.
- (2007) Proceedings of ACM/IEEE Supercomputing Conference (SC)
- Fu, S.¹ Xu, C.-Z.²

15
- 47249124464
- Quantifying temporal and spatial correlation of failure events for proactive management
- S. Fu and C.-Z. Xu. Quantifying temporal and spatial correlation of failure events for proactive management. In Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS), 2007.
- (2007) Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS)
- Fu, S.¹ Xu, C.-Z.²

16
- 70349679287
- Proactive resource management for failure resilient high performance computing clusters
- March
- S. Fu and C.-Z. Xu. Proactive resource management for failure resilient high performance computing clusters. In Proceedings of IEEE International Conference on Availability, Reliability and Security (ARES), March 2009.
- (2009) Proceedings of IEEE International Conference on Availability, Reliability and Security (ARES)
- Fu, S.¹ Xu, C.-Z.²

17
- 0004169893
- Boston: Kluwer Academic Publishers
- R. G. Gallager. Discrete Stochastic Processes. Boston: Kluwer Academic Publishers, 1996.
- (1996) Discrete Stochastic Processes
- Gallager, R.G.¹

18
- 33947184459
- Analytical models for architecture-based software reliability prediction: A unification framework
- S. S. Gokhale and K. S. Trivedi. Analytical models for architecture-based software reliability prediction: A unification framework. IEEE Transactions on Reliability, 55(4):578-590, 2006.
- (2006) IEEE Transactions on Reliability , vol.55 , Issue.4 , pp. 578-590
- Gokhale, S.S.¹ Trivedi, K.S.²

19
- 55849147399
- Dynamic meta-learning for failure prediction in large-scale systems: A case study
- J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B.-H. Park. Dynamic meta-learning for failure prediction in large-scale systems: A case study. In Proceedings of IEEE International Conference on Parallel Processing (ICPP), 2008.
- (2008) Proceedings of IEEE International Conference on Parallel Processing (ICPP)
- Gu, J.¹ Zheng, Z.² Lan, Z.³ White, J.⁴ Hocks, E.⁵ Park, B.-H.⁶

20
- 47249137011
- Reliability and scheduling on systems subject to failures
- M. Hakem and F. Butelle. Reliability and scheduling on systems subject to failures. In Proceedings of IEEE Conference on Parallel Processing, 2007.
- (2007) Proceedings of IEEE Conference on Parallel Processing
- Hakem, M.¹ Butelle, F.²

21
- 4544255683
- Improving cluster availability using workstation validation
- T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availability using workstation validation. In Proceedings of the ACM Conference on Measurement and modeling of computer systems (SIGMETRICS), 2002.
- (2002) Proceedings of the ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS)
- Heath, T.¹ Martin, R.P.² Nguyen, T.D.³

22
- 33845420448
- A power-aware run-time system for high-performance computing
- November
- C.-H. Hsu and W.-C. Feng. A power-aware run-time system for high-performance computing. In Proceedings of ACM/IEEE Conference on Supercomputing (SC), November 2005.
- (2005) Proceedings of ACM/IEEE Conference on Supercomputing (SC)
- Hsu, C.-H.¹ Feng, W.-C.²

23
- 85160681664
- Transparent checkpoint-restart of multiple processes on commodity operating systems
- O. Laadan and J. Nieh. Transparent checkpoint-restart of multiple processes on commodity operating systems. In Proceedings of USENIX Annual Technical Conference (USENIX), 2007.
- (2007) Proceedings of USENIX Annual Technical Conference (USENIX)
- Laadan, O.¹ Nieh, J.²

24
- 33751082401
- Exploit failure prediction for adaptive fault-tolerance in cluster computing
- Y. Li and Z. Lan. Exploit failure prediction for adaptive fault-tolerance in cluster computing. In Proceedings of IEEE Symposium on Cluster Computing and the Grid (CCGRID), 2006.
- (2006) Proceedings of IEEE Symposium on Cluster Computing and the Grid (CCGRID)
- Li, Y.¹ Lan, Z.²

25
- 53349121135
- Fast restart mechanism for checkpoint/recovery protocols in networked environments
- Y. Li and Z. Lan. Fast restart mechanism for checkpoint/recovery protocols in networked environments. In Proceedings of IEEE International Conference on Dependable Systems and Networks (DSN), 2008.
- (2008) Proceedings of IEEE International Conference on Dependable Systems and Networks (DSN)
- Li, Y.¹ Lan, Z.²

26
- 33845589803
- BlueGene/L failure analysis and prediction models
- Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. K. Sahoo. BlueGene/L failure analysis and prediction models. In Proceedings of International Conference on Dependable Systems and Networks (DSN), 2006.
- (2006) Proceedings of International Conference on Dependable Systems and Networks (DSN)
- Liang, Y.¹ Zhang, Y.² Sivasubramaniam, A.³ Jette, M.⁴ Sahoo, R.K.⁵

27
- 27544497222
- Filtering failure logs for a BlueGene/L prototype
- Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, and M. Gupta. Filtering failure logs for a BlueGene/L prototype. In Proceedings of Conference on Dependable Systems and Networks (DSN), 2005.
- (2005) Proceedings of Conference on Dependable Systems and Networks (DSN)
- Liang, Y.¹ Zhang, Y.² Sivasubramaniam, A.³ Sahoo, R.⁴ Moreira, J.⁵ Gupta, M.⁶

28
- 47249131447
- Exploiting availability prediction in distributed systems
- J. W. Mickens and B. D. Noble. Exploiting availability prediction in distributed systems. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2006.
- (2006) Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI)
- Mickens, J.W.¹ Noble, B.D.²

29
- 1442309284
- On the reliability of the IBM MVS/XA operating system
- S. Mourad and D. Andrews. On the reliability of the IBM MVS/XA operating system. IEEE Transactions on Software Engineering, 13(10):1135-1139, 1987.
- (1987) IEEE Transactions on Software Engineering , vol.13 , Issue.10 , pp. 1135-1139
- Mourad, S.¹ Andrews, D.²

30
- 34548046749
- Proactive fault tolerance for HPC with Xen virtualization
- A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for HPC with Xen virtualization. In Proceedings of ACM International Conference on Supercomputing (ICS), 2007.
- (2007) Proceedings of ACM International Conference on Supercomputing (ICS)
- Nagarajan, A.B.¹ Mueller, F.² Engelmann, C.³ Scott, S.L.⁴

31
- 27544438709
- Probabilistic QoS guarantees for supercomputing systems
- A. J. Oliner and J. E. Moreira. Probabilistic QoS guarantees for supercomputing systems. In Proceedings of IEEE International Conference on Dependable Systems and Networks (DSN), 2005.
- (2005) Proceedings of IEEE International Conference on Dependable Systems and Networks (DSN)
- Oliner, A.J.¹ Moreira, J.E.²

32
- 36049013419
- What supercomputers say: A study of five system logs
- A. J. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In Proceedings of International Conference on Dependable Systems and Networks (DSN), 2007.
- (2007) Proceedings of International Conference on Dependable Systems and Networks (DSN)
- Oliner, A.J.¹ Stearley, J.²

33
- 34548010919
- Software failures and the road to a petaflop machine
- I. Philp. Software failures and the road to a petaflop machine. In Proceedings of Symposium on High Performance Computer Architecture Workshop, 2005.
- (2005) Proceedings of Symposium on High Performance Computer Architecture Workshop
- Philp, I.¹

34
- 20444463471
- A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters
- X. Qin and H. Jiang. A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. Journal of Parallel and Distributed Computing, 65(8):885-900, 2005.
- (2005) Journal of Parallel and Distributed Computing , vol.65 , Issue.8 , pp. 885-900
- Qin, X.¹ Jiang, H.²

35
- 51049111944
- Big systems and big reliability challenges
- D. Reed, C. Lu, and C. Mendes. Big systems and big reliability challenges. In Proceedings of Parallel Computing, 2003.
- (2003) Proceedings of Parallel Computing
- Reed, D.¹ Lu, C.² Mendes, C.³

36
- 77952378080
- Critical event prediction for proactive management in large-scale computer clusters
- August
- R. K. Sahoo, A. J. Oliner, I. Rish, and et al. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of ACM International Conference on Knowledge Discovery and Data Dining (SIGKDD), August 2003.
- (2003) Proceedings of ACM International Conference on Knowledge Discovery and Data Dining (SIGKDD)
- Sahoo, R.K.¹ Oliner, A.J.² Rish, I.³ And⁴

37
- 4544382099
- Failure data analysis of a large-scale heterogeneous server environment
- R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proceedings of IEEE International Conference on Dependable Systems and Networks (DSN), 2004.
- (2004) Proceedings of IEEE International Conference on Dependable Systems and Networks (DSN)
- Sahoo, R.K.¹ Sivasubramaniam, A.² Squillante, M.S.³ Zhang, Y.⁴

38
- 33847157361
- Predicting failures of computer systems: A case study for a telecommunication system
- F. Salfner, M. Schieschke, and M. Malek. Predicting failures of computer systems: a case study for a telecommunication system. In Proceedings of IEEE Parallel and Distributed Processing Symposium (IPDPS), 2006.
- (2006) Proceedings of IEEE Parallel and Distributed Processing Symposium (IPDPS)
- Salfner, F.¹ Schieschke, M.² Malek, M.³

39
- 33845593340
- A large-scale study of failures in high-performance-computing systems
- B. Schroeder and G. Gibson. A large-scale study of failures in high-performance-computing systems. In Proceedings of International Conference on Dependable Systems and Networks (DSN), 2006.
- (2006) Proceedings of International Conference on Dependable Systems and Networks (DSN)
- Schroeder, B.¹ Gibson, G.²

40
- 84888292453
- Understanding failures in petascale computers
- B. Schroeder and G. Gibson. Understanding failures in petascale computers. In Proceedings of SciDAC, 2007.
- (2007) Proceedings of SciDAC
- Schroeder, B.¹ Gibson, G.²

41
- 70350755748
- Proactive process-level live migration in HPC environments
- C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Proactive process-level live migration in HPC environments. In Proceedings of ACM/IEEE Conference on Supercomputing (SC), 2008.
- (2008) Proceedings of ACM/IEEE Conference on Supercomputing (SC)
- Wang, C.¹ Mueller, F.² Engelmann, C.³ Scott, S.L.⁴

42
- 33847127495
- A proactive fault-detection mechanism in large-scale cluster systems
- L. Wu, D. Meng, W. Gao, and J. Zhan. A proactive fault-detection mechanism in large-scale cluster systems. In Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2006.
- (2006) Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- Wu, L.¹ Meng, D.² Gao, W.³ Zhan, J.⁴

43
- 56749158844
- Performance under failures of high-end computing
- M. Wu, X.-H. Sun, and H. Jin. Performance under failures of high-end computing. In Proceedings of ACM/IEEE Conference on Supercomputing, 2007.
- (2007) Proceedings of ACM/IEEE Conference on Supercomputing
- Wu, M.¹ Sun, X.-H.² Jin, H.³

44
- 67650672322
- Beyond availability: Towards a deeper understanding of machine failure characteristics in large distributed systems
- P. Yalagandula, S. Nath, H. Yu, P. B. Gibbons, and S. Sesha. Beyond availability: Towards a deeper understanding of machine failure characteristics in large distributed systems. In Proceedings of USENIX WORLDS, 2004.
- (2004) Proceedings of USENIX WORLDS
- Yalagandula, P.¹ Nath, S.² Yu, H.³ Gibbons, P.B.⁴ Sesha, S.⁵

45
- 33845595513
- Performance implications of failures in large-scale cluster scheduling
- Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In Proceedings of the 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.
- (2004) Proceedings of the 10th Workshop on Job Scheduling Strategies for Parallel Processing
- Zhang, Y.¹ Squillante, M.S.² Sivasubramaniam, A.³ Sahoo, R.K.⁴

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.