-
1
-
-
33845593340
-
A large-scale study of failures in highperformance computing systems
-
B. Schroeder and G. A. Gibson, "A large-scale study of failures in highperformance computing systems," in Proceedings of IEEE/IFIP DSN, 2006, pp. 249-258.
-
(2006)
Proceedings of IEEE/IFIP DSN
, pp. 249-258
-
-
Schroeder, B.1
Gibson, G.A.2
-
2
-
-
67349271621
-
An analysis of clustered failures on large supercomputing systems
-
T. J. Hacker, F. Romero, and C. D. Carothers, "An analysis of clustered failures on large supercomputing systems," Journal of Parallel and Distributed Computing, vol. 69, no. 7, pp. 652-665, 2009.
-
(2009)
Journal of Parallel and Distributed Computing
, vol.69
, Issue.7
, pp. 652-665
-
-
Hacker, T.J.1
Romero, F.2
Carothers, C.D.3
-
3
-
-
84866720084
-
Automatic fault characterization via abnormality-enhanced classification
-
G. Bronevetsky, I. Laguna, B. R. de Supinski, and S. Bagchi, "Automatic fault characterization via abnormality-enhanced classification," in Proceedings of IEEE/IFIP DSN, 2012, pp. 1-12.
-
(2012)
Proceedings of IEEE/IFIP DSN
, pp. 1-12
-
-
Bronevetsky, G.1
Laguna, I.2
De Supinski, B.R.3
Bagchi, S.4
-
4
-
-
84966284395
-
Probabilistic diagnosis of performance faults in large scale parallel applications
-
I. Laguna, D. H. Anh, B. R. de Supinski, S. Bagchi, and T. Gamblin, "Probabilistic diagnosis of performance faults in large scale parallel applications," in Proceedings of PACT, 2012, pp. 1-10.
-
(2012)
Proceedings of PACT
, pp. 1-10
-
-
Laguna, I.1
Anh, D.H.2
De Supinski, B.R.3
Bagchi, S.4
Gamblin, T.5
-
5
-
-
75449097851
-
Toward automated anomaly identification in large-scale systems
-
Z. Lan, Z. Zheng, and Y. Li, "Toward automated anomaly identification in large-scale systems," IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 2, pp. 174-187, 2010.
-
(2010)
IEEE Transactions on Parallel and Distributed Systems
, vol.21
, Issue.2
, pp. 174-187
-
-
Lan, Z.1
Zheng, Z.2
Li, Y.3
-
6
-
-
79952786041
-
Anomaly detection in large-scale coalition clusters for dependability assurance
-
Q. Guan, D. Smith, and S. Fu, "Anomaly detection in large-scale coalition clusters for dependability assurance," in Proceedings of IEEE HiPC, 2010, pp. 1-10.
-
(2010)
Proceedings of IEEE HiPC
, pp. 1-10
-
-
Guan, Q.1
Smith, D.2
Fu, S.3
-
7
-
-
77951588446
-
Diagnosis of recurrent faults using log files
-
T. Reidemeister, M. A. Munawar, M. Jiang, and P. A. Ward, "Diagnosis of recurrent faults using log files," in Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research, 2009, pp. 12-23.
-
(2009)
Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
, pp. 12-23
-
-
Reidemeister, T.1
Munawar, M.A.2
Jiang, M.3
Ward, P.A.4
-
8
-
-
77956573133
-
Using correlated surprise to infer shared influence
-
A. J. Oliner, A. V. Kulkarni, and A. Aiken, "Using correlated surprise to infer shared influence," in Proceedings of IEEE/IFIP DSN, 2010, pp. 191-200.
-
(2010)
Proceedings of IEEE/IFIP DSN
, pp. 191-200
-
-
Oliner, A.J.1
Kulkarni, A.V.2
Aiken, A.3
-
9
-
-
84867695274
-
3-dimensional root cause diagnosis via co-analysis
-
Z. Zheng, L. Yu, Z. Lan, and T. Jones, "3-dimensional root cause diagnosis via co-analysis," in Proceedings of ACM ICAC, 2012, pp. 181-190.
-
(2012)
Proceedings of ACM ICAC
, pp. 181-190
-
-
Zheng, Z.1
Yu, L.2
Lan, Z.3
Jones, T.4
-
10
-
-
49749107565
-
Failure prediction in ibm bluegene/l event logs
-
Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo, "Failure prediction in ibm bluegene/l event logs," in Proceedings of IEEE ICDM, 2007, pp. 583-588.
-
(2007)
Proceedings of IEEE ICDM
, pp. 583-588
-
-
Liang, Y.1
Zhang, Y.2
Xiong, H.3
Sahoo, R.4
-
11
-
-
70449479757
-
Cross-core event monitoring for processor failure prediction
-
F. Salfner, P. Troeger, and S. Tschirpke, "Cross-core event monitoring for processor failure prediction," in Proceedings of HPCS DMCC Workshop, 2009, pp. 67-73.
-
(2009)
Proceedings of HPCS DMCC Workshop
, pp. 67-73
-
-
Salfner, F.1
Troeger, P.2
Tschirpke, S.3
-
12
-
-
77951205449
-
A study of dynamic meta-learning for failure prediction in large-scale systems
-
Z. Lan, J. Gu, Z. Zheng, R. Thakur, and S. Coghlan, "A study of dynamic meta-learning for failure prediction in large-scale systems," Journal of Parallel and Distributed Computing, vol. 70, no. 6, pp. 630-643, 2010.
-
(2010)
Journal of Parallel and Distributed Computing
, vol.70
, Issue.6
, pp. 630-643
-
-
Lan, Z.1
Gu, J.2
Zheng, Z.3
Thakur, R.4
Coghlan, S.5
-
13
-
-
84874306678
-
Logmaster: Mining event correlations in logs of large-scale cluster systems
-
X. Fu, R. Ren, J. Zhan, W. Zhou, Z. Jia, and G. Lu, "Logmaster: Mining event correlations in logs of large-scale cluster systems," in Proceedings of IEEE SRDS, 2012, pp. 1-10.
-
(2012)
Proceedings of IEEE SRDS
, pp. 1-10
-
-
Fu, X.1
Ren, R.2
Zhan, J.3
Zhou, W.4
Jia, Z.5
Lu, G.6
-
14
-
-
84891541902
-
Tacc stats: I/o performance monitoring for the intransigent
-
J. Hammond, "Tacc stats: I/o performance monitoring for the intransigent," in Invited Keynote for the 3rd IASDS Workshop, 2011, pp. 1-29.
-
(2011)
Invited Keynote for the 3rd IASDS Workshop
, pp. 1-29
-
-
Hammond, J.1
-
15
-
-
84856109383
-
Establishing hypothesis for recurrent system failures from cluster log files
-
Dec 12-14
-
E. Chuah, G. Lee, W.-C. Tjhi, S.-H. Kuo, T. Hung, J. Hammond, T. Minyard, and J. C. Browne, "Establishing hypothesis for recurrent system failures from cluster log files," in Proceedings of IEEE DASC, Dec 12-14 2011, pp. 1-8.
-
(2011)
Proceedings of IEEE DASC
, pp. 1-8
-
-
Chuah, E.1
Lee, G.2
Tjhi, W.-C.3
Kuo, S.-H.4
Hung, T.5
Hammond, J.6
Minyard, T.7
Browne, J.C.8
-
16
-
-
77956291503
-
End-to-end framework for fault management for open source clusters: Ranger
-
J. L. Hammond, T. Minyard, and J. Browne, "End-to-end framework for fault management for open source clusters: Ranger," in Proceedings of ACM TeraGrid, no. 9, 2010.
-
(2010)
Proceedings of ACM TeraGrid
, Issue.9
-
-
Hammond, J.L.1
Minyard, T.2
Browne, J.3
-
17
-
-
12344308304
-
Basic concepts and taxonomy of dependable and secure computing
-
A. Avizienis, J.-C. Lapire, B. Randell, and C. Landwehr, "Basic concepts and taxonomy of dependable and secure computing," IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11-33, 2004.
-
(2004)
IEEE Transactions on Dependable and Secure Computing
, vol.1
, Issue.1
, pp. 11-33
-
-
Avizienis, A.1
Lapire, J.-C.2
Randell, B.3
Landwehr, C.4
-
18
-
-
84891524202
-
-
T. T. Project
-
T. T. Project, http://sebastien.godard.pagesperso-orange.fr/.
-
-
-
-
19
-
-
80053278089
-
Co-analysis of ras log and job log on blue gene/p
-
Z. Zheng, L. Yu, W. Tang, and Z. Lan, "Co-analysis of ras log and job log on blue gene/p," in Proceedings of IEEE IPDPS, 2011, pp. 840-851.
-
(2011)
Proceedings of IEEE IPDPS
, pp. 840-851
-
-
Zheng, Z.1
Yu, L.2
Tang, W.3
Lan, Z.4
-
20
-
-
36049013419
-
What supercomputers say: A study of five system logs
-
June
-
A. Oliner and J. Stearley, "What supercomputers say: A study of five system logs," in Proceedings of IEEE/IFIP DSN, June 2007, pp. 575-584.
-
(2007)
Proceedings of IEEE/IFIP DSN
, pp. 575-584
-
-
Oliner, A.1
Stearley, J.2
-
22
-
-
34248577801
-
Algorithms for projectionpursuit robust principal component analysis
-
C. Croux, P. Filzmoser, and M. R. Oliveira, "Algorithms for projectionpursuit robust principal component analysis," Chemometrics and Intelligent Laboratory Systems, vol. 87, no. 2, pp. 218-225, 2007.
-
(2007)
Chemometrics and Intelligent Laboratory Systems
, vol.87
, Issue.2
, pp. 218-225
-
-
Croux, C.1
Filzmoser, P.2
Oliveira, M.R.3
-
23
-
-
0042826822
-
Independent component analysis: Algorithms and applications
-
A. Hyvarinen and E. Oja, "Independent component analysis: Algorithms and applications," Neural Networks, vol. 13, no. 4-5, pp. 411-430, 2000.
-
(2000)
Neural Networks
, vol.13
, Issue.4-5
, pp. 411-430
-
-
Hyvarinen, A.1
Oja, E.2
-
25
-
-
0000178613
-
On the reciprocal of the general algebraic matrix
-
E. H. Moore, "On the reciprocal of the general algebraic matrix," Bulletin of the AMS, vol. 26, no. 9, p. 394395, 1920.
-
(1920)
Bulletin of the AMS
, vol.26
, Issue.9
, pp. 394395
-
-
Moore, E.H.1
-
26
-
-
70450257779
-
Towards automated performance diagnosis in a large iptv network
-
A. A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, and Q. Zhao, "Towards automated performance diagnosis in a large iptv network," in Proceedings of the ACM SIGCOMM 2009 conference on Data communication, 2009, pp. 231-242.
-
(2009)
Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication
, pp. 231-242
-
-
Mahimkar, A.A.1
Ge, Z.2
Shaikh, A.3
Wang, J.4
Yates, J.5
Zhang, Y.6
Zhao, Q.7
-
27
-
-
80051926966
-
Online detection of multi-component interactions in production systems
-
A. J. Oliner and A. Aiken, "Online detection of multi-component interactions in production systems," in Proceedings of IEEE/IFIP DSN, 2011, pp. 49-60.
-
(2011)
Proceedings of IEEE/IFIP DSN
, pp. 49-60
-
-
Oliner, A.J.1
Aiken, A.2
-
28
-
-
0022906522
-
Recognition of error symptoms in large systems
-
R. K. Iyer, L. T. Young, and V. Sridhar, "Recognition of error symptoms in large systems," in 1986 ACM Fall joint computer conference, 1986, pp. 797-806.
-
(1986)
1986 ACM Fall Joint Computer Conference
, pp. 797-806
-
-
Iyer, R.K.1
Young, L.T.2
Sridhar, V.3
-
29
-
-
84891522525
-
-
Lustre
-
Lustre, http://http://wiki.lustre.org/manual/LustreManual18 HTML/ index.html.
-
-
-
-
33
-
-
84891543224
-
A modular failure-aware resource allocation architecture for cloud computing
-
A. Chester, M. Leeke, M. Al-Ghamdi, S. A. Jarvis, and A. Jhumka, "A modular failure-aware resource allocation architecture for cloud computing," in Proceedings of UKPEW, 2011.
-
(2011)
Proceedings of UKPEW
-
-
Chester, A.1
Leeke, M.2
Al-Ghamdi, M.3
Jarvis, S.A.4
Jhumka, A.5
-
34
-
-
74949101845
-
A framework for distributed monitoring and root cause analysis for large ip networks
-
D. Banerjee, V. Madduri, and M. Srivatsa, "A framework for distributed monitoring and root cause analysis for large ip networks," in Proceedings of IEEE SRDS, 2009, pp. 246-255.
-
(2009)
Proceedings of IEEE SRDS
, pp. 246-255
-
-
Banerjee, D.1
Madduri, V.2
Srivatsa, M.3
-
35
-
-
78650560719
-
Shedding light on enterprise network failures using spotlight
-
J. Dipu, P. Prakash, R. R. Kompella, and R. Chandra, "Shedding light on enterprise network failures using spotlight," in Proceedings of IEEE SRDS, 2010, pp. 167-176.
-
(2010)
Proceedings of IEEE SRDS
, pp. 167-176
-
-
Dipu, J.1
Prakash, P.2
Kompella, R.R.3
Chandra, R.4
-
36
-
-
84866636396
-
Draco: Statistical diagnosis of chronic problems in large distributed systems
-
S. P. Kavulya, S. Daniels, K. Joshi, M. Hiltunen, R. Gandhi, and P. Narasimhan, "Draco: Statistical diagnosis of chronic problems in large distributed systems," in Proceedings of IEEE/IFIP DSN, 2012, pp. 1-12.
-
(2012)
Proceedings of IEEE/IFIP DSN
, pp. 1-12
-
-
Kavulya, S.P.1
Daniels, S.2
Joshi, K.3
Hiltunen, M.4
Gandhi, R.5
Narasimhan, P.6
-
37
-
-
84866677413
-
Adaptive algorithms for diagnosing large-scale failures in computer networks
-
S. Tati, B. J. Ko, G. Cao, A. Swami, and T. L. Porta, "Adaptive algorithms for diagnosing large-scale failures in computer networks," in Proceedings of IEEE/IFIP DSN, 2012, pp. 1-12.
-
(2012)
Proceedings of IEEE/IFIP DSN
, pp. 1-12
-
-
Tati, S.1
Ko, B.J.2
Cao, G.3
Swami, A.4
Porta, T.L.5
-
38
-
-
36248945561
-
Automated rule-based diagnosis through a distributed monitor system
-
G. Khanna, M. Y. Cheng, P. Varadharajan, S. Bagchi, M. P. Correia, and P. J. Verissimo, "Automated rule-based diagnosis through a distributed monitor system," IEEE Transactions on Dependable and Secure Computing, vol. 4, no. 4, pp. 266-279, 2007.
-
(2007)
IEEE Transactions on Dependable and Secure Computing
, vol.4
, Issue.4
, pp. 266-279
-
-
Khanna, G.1
Cheng, M.Y.2
Varadharajan, P.3
Bagchi, S.4
Correia, M.P.5
Verissimo, P.J.6
-
39
-
-
77957761115
-
Problem diagnosis for mapreduce-based cloud computing environments
-
J. Tan, X. Pan, E. Marinelli, S. Kavulya, R. Gandhi, and P. Narasimhan, "Problem diagnosis for mapreduce-based cloud computing environments," in Proceedings of IEEE/IFIP NOMS, 2010.
-
(2010)
Proceedings of IEEE/IFIP NOMS
-
-
Tan, J.1
Pan, X.2
Marinelli, E.3
Kavulya, S.4
Gandhi, R.5
Narasimhan, P.6
-
40
-
-
77951439561
-
Mining console logs for large-scale system problem detection
-
December
-
W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan, "Mining console logs for large-scale system problem detection," in Proceedings of 3rd Workshop on Tackling Computer Systems Problems with Machine Learning Techniques, December 2008.
-
(2008)
Proceedings of 3rd Workshop on Tackling Computer Systems Problems with Machine Learning Techniques
-
-
Xu, W.1
Huang, L.2
Fox, A.3
Patterson, D.4
Jordan, M.5
-
41
-
-
79953093128
-
Improving software diagnosibility via log enhancement
-
D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage, "Improving software diagnosibility via log enhancement," in Proceedings of ACM ASPLOS, 2011, pp. 3-14.
-
(2011)
Proceedings of ACM ASPLOS
, pp. 3-14
-
-
Yuan, D.1
Zheng, J.2
Park, S.3
Zhou, Y.4
Savage, S.5
|