-
3
-
-
77950267881
-
A survey of online failure prediction methods
-
F. Salfner, M. Lenk, and M. Malek, A survey of online failure prediction methods, ACM Computing Surveys, vol. 42, pp. 10:1-10:42, 2010.
-
(2010)
ACM Computing Surveys
, vol.42
-
-
Salfner, F.1
Lenk, M.2
Malek, M.3
-
5
-
-
77956227790
-
Quantifying event correlations for proactive failure management in networked computing systems
-
S. Fu and C. Xu, Quantifying event correlations for proactive failure management in networked computing systems, Journal of Parallel and Distributed Computing, vol. 70, no. 11, pp. 1100-1109, 2010.
-
(2010)
Journal of Parallel and Distributed Computing
, vol.70
, Issue.11
, pp. 1100-1109
-
-
Fu, S.1
Xu, C.2
-
6
-
-
55849147399
-
Dynamic meta-learning for failure prediction in large-scale systems: A case study
-
J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B.-H. Park, Dynamic meta-learning for failure prediction in large-scale systems: A case study, in Proceedings of IEEE International Conference on Parallel Processing (ICPP), 2008.
-
(2008)
Proceedings of IEEE International Conference On Parallel Processing (ICPP)
-
-
Gu, J.1
Zheng, Z.2
Lan, Z.3
White, J.4
Hocks, E.5
Park, B.-H.6
-
7
-
-
33750936415
-
Availability modeling and analysis on high performance cluster computing systems
-
H. Song, C. Leangsuksun, and R. Nassar, Availability modeling and analysis on high performance cluster computing systems, in Proceedings of IEEE International Conference on Availability, Reliability and Security (ARES), 2006.
-
(2006)
Proceedings of IEEE International Conference On Availability, Reliability and Security (ARES)
-
-
Song, H.1
Leangsuksun, C.2
Nassar, R.3
-
12
-
-
84863053811
-
-
available at:
-
"sysstat," available at: http://sebastien.godard.pagesperso-orange.fr/.
-
Sysstat
-
-
-
13
-
-
0042078549
-
A survey of rollback-recovery protocols in messagepassing systems
-
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in messagepassing systems, ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
-
(2002)
ACM Computing Surveys
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.M.1
Alvisi, L.2
Wang, Y.-M.3
Johnson, D.B.4
-
16
-
-
40849089513
-
Model-based performance evaluation of distributed checkpointing protocols
-
A. Agbaria and R. Friedman, Model-based performance evaluation of distributed checkpointing protocols, Performance Evaluation, vol. 65, no. 5, pp. 345-365, 2008.
-
(2008)
Performance Evaluation
, vol.65
, Issue.5
, pp. 345-365
-
-
Agbaria, A.1
Friedman, R.2
-
17
-
-
74049111423
-
Compiler-enhanced incremental checkpointing for openmp applications
-
G. Bronevetsky, D. J. Marques, K. K. Pingali, R. Rugina, and S. A. McKee, Compiler-enhanced incremental checkpointing for openmp applications, in Proceedings of ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008.
-
(2008)
Proceedings of ACM Symposium On Principles and Practice of Parallel Programming (PPoPP)
-
-
Bronevetsky, G.1
Marques, D.J.2
Pingali, K.K.3
Rugina, R.4
McKee, S.A.5
-
19
-
-
33845589803
-
BlueGene/L failure analysis and prediction models
-
Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. K. Sahoo, BlueGene/L failure analysis and prediction models, in Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2006.
-
(2006)
Proceedings of IEEE/IFIP International Conference On Dependable Systems and Networks (DSN)
-
-
Liang, Y.1
Zhang, Y.2
Sivasubramaniam, A.3
Jette, M.4
Sahoo, R.K.5
-
20
-
-
70449479757
-
Cross-core event monitoring for processor failure prediction
-
F. Salfner, P. Tröger, and S. Tschirpke, Cross-core event monitoring for processor failure prediction, in Proceedings of IEEE International Conference on High Performance Computing & Simulation, Workshop on Dependable Multi-Core Computing(DMCC), 2009.
-
(2009)
Proceedings of IEEE International Conference On High Performance Computing & Simulation, Workshop On Dependable Multi-Core Computing(DMCC)
-
-
Salfner, F.1
Tröger, P.2
Tschirpke, S.3
-
23
-
-
33947184459
-
Analytical models for architecture-based software reliability prediction: A unification framework
-
S. S. Gokhale and K. S. Trivedi, Analytical models for architecture-based software reliability prediction: A unification framework, IEEE Transactions on Reliability, vol. 55, no. 4, pp. 578-590, 2006.
-
(2006)
IEEE Transactions On Reliability
, vol.55
, Issue.4
, pp. 578-590
-
-
Gokhale, S.S.1
Trivedi, K.S.2
-
27
-
-
79551557730
-
A hierarchical failure management framework for dependability assurance in compute clusters
-
Z. Zhang and S. Fu, A hierarchical failure management framework for dependability assurance in compute clusters, International Journal of Computational Science, vol. 4, no. 4, pp. 313-326, 2010.
-
(2010)
International Journal of Computational Science
, vol.4
, Issue.4
, pp. 313-326
-
-
Zhang, Z.1
Fu, S.2
-
30
-
-
34548046749
-
Proactive fault tolerance for HPC with Xen virtualization
-
A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott, Proactive fault tolerance for HPC with Xen virtualization, in Proceedings of ACM International Conference on Supercomputing (ICS), 2007.
-
(2007)
Proceedings of ACM International Conference On Supercomputing (ICS)
-
-
Nagarajan, A.B.1
Mueller, F.2
Engelmann, C.3
Scott, S.L.4
-
34
-
-
70350755748
-
Proactive process-level live migration in HPC environments
-
C. Wang, F. Mueller, C. Engelmann, and S. L. Scott, Proactive process-level live migration in HPC environments, in Proceedings of ACM/IEEE Conference on Supercomputing (SC), 2008.
-
(2008)
Proceedings of ACM/IEEE Conference On Supercomputing (SC)
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scott, S.L.4
-
35
-
-
76849100508
-
Failure-aware resource management for high-availability computing clusters with distributed virtual machines
-
S. Fu, Failure-aware resource management for high-availability computing clusters with distributed virtual machines, Journal of Parallel and Distributed Computing, vol. 70, no. 4, pp. 384-393, 2010.
-
(2010)
Journal of Parallel and Distributed Computing
, vol.70
, Issue.4
, pp. 384-393
-
-
Fu, S.1
-
38
-
-
78649317228
-
Coordinated session-based admission control with statistical learning for multi-tier internet applications
-
S. Muppala and X. Zhou, Coordinated session-based admission control with statistical learning for multi-tier internet applications, Journal of Network and Computer Applications, Elsevier, vol. 34, no. 1, pp. 20-29, 2011.
-
(2011)
Journal of Network and Computer Applications, Elsevier
, vol.34
, Issue.1
, pp. 20-29
-
-
Muppala, S.1
Zhou, X.2
-
40
-
-
27544497222
-
Filtering failure logs for a BlueGene/L prototype
-
Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, and M. Gupta., Filtering failure logs for a BlueGene/L prototype, in Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2005.
-
(2005)
Proceedings of IEEE/IFIP International Conference On Dependable Systems and Networks (DSN)
-
-
Liang, Y.1
Zhang, Y.2
Sivasubramaniam, A.3
Sahoo, R.4
Moreira, J.5
Gupta, M.6
-
41
-
-
4544382099
-
Failure data analysis of a large-scale heterogeneous server environment
-
R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang, Failure data analysis of a large-scale heterogeneous server environment, in Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2004.
-
(2004)
Proceedings of IEEE/IFIP International Conference On Dependable Systems and Networks (DSN)
-
-
Sahoo, R.K.1
Sivasubramaniam, A.2
Squillante, M.S.3
Zhang, Y.4
-
42
-
-
67650672322
-
Beyond availability: Towards a deeper understanding of machine failure characteristics in large distributed systems
-
P. Yalagandula, S. Nath, H. Yu, P. B. Gibbons, and S. Sesha, Beyond availability: Towards a deeper understanding of machine failure characteristics in large distributed systems, in Proceedings of USENIX WORLDS, 2004.
-
(2004)
Proceedings of USENIX WORLDS
-
-
Yalagandula, P.1
Nath, S.2
Yu, H.3
Gibbons, P.B.4
Sesha, S.5
-
44
-
-
33845595513
-
Performance implications of failures in large-scale cluster scheduling
-
Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo, Performance implications of failures in large-scale cluster scheduling, in Proceedings of the 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.
-
(2004)
Proceedings of the 10th Workshop On Job Scheduling Strategies For Parallel Processing
-
-
Zhang, Y.1
Squillante, M.S.2
Sivasubramaniam, A.3
Sahoo, R.K.4
-
45
-
-
33947184459
-
Analytical models for architecture-based software reliability prediction: A unification framework
-
S. S. Gokhale and K. S. Trivedi, Analytical models for architecture-based software reliability prediction: A unification framework, IEEE Trans. on Reliability, vol. 55, no. 4, pp. 578-590, 2006.
-
(2006)
IEEE Trans. On Reliability
, vol.55
, Issue.4
, pp. 578-590
-
-
Gokhale, S.S.1
Trivedi, K.S.2
-
48
-
-
77954752832
-
Correlating instrumentation data to system states: A building block for automated diagnosis and control
-
I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase, Correlating instrumentation data to system states: a building block for automated diagnosis and control, in Proceedings of USENIX Symposium on Opearting Systems Design and Implementation (OSDI), 2004.
-
(2004)
Proceedings of USENIX Symposium On Opearting Systems Design and Implementation (OSDI)
-
-
Cohen, I.1
Goldszmidt, M.2
Kelly, T.3
Symons, J.4
Chase, J.S.5
|