-
1
-
-
33845589803
-
Bluegene/l failure analysis and prediction models
-
Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo. Bluegene/l failure analysis and prediction models. DSN 2006: Int. Conference on Dependable Systems and Networks, pages 425-434, 2006.
-
(2006)
DSN 2006: Int. Conference on Dependable Systems and Networks
, pp. 425-434
-
-
Liang, Y.1
Zhang, Y.2
Sivasubramaniam, A.3
Jette, M.4
Sahoo, R.5
-
2
-
-
36049041275
-
Understanding disk failure rates: What does an mttf of 1,000,000 hours mean to you?
-
October
-
Bianca Schroeder and Garth A. Gibson. Understanding disk failure rates: What does an mttf of 1,000,000 hours mean to you? IEEE Trans. Storage, 3(3), October 2007.
-
(2007)
IEEE Trans. Storage
, vol.3
, Issue.3
-
-
Schroeder, B.1
Gibson, G.A.2
-
3
-
-
36049013419
-
What supercomputers say: A study of five system logs
-
DSN '07. 37th Annual IEEE/IFIP Int. Conference on June
-
A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. Dependable Systems and Networks, 2007. DSN '07. 37th Annual IEEE/IFIP Int. Conference on, pages 575-584, June 2007.
-
(2007)
Dependable Systems and Networks, 2007
, pp. 575-584
-
-
Oliner, A.1
Stearley, J.2
-
5
-
-
80052167311
-
Models for time coalescence in event logs
-
FTCS-22. Digest of Papers., Twenty-Second Int. symp. on Jul
-
J.P. Hansen and D.P. Siewiorek. Models for time coalescence in event logs. Fault-Tolerant Computing, 1992. FTCS-22. Digest of Papers., Twenty-Second Int. symp. on, pages 221-227, Jul 1992.
-
(1992)
Fault-tolerant Computing, 1992
, pp. 221-227
-
-
Hansen, J.P.1
Siewiorek, D.P.2
-
6
-
-
33646927438
-
Error and failure analysis of a unix server
-
Proc. Third IEEE Int. Nov
-
R. Lal and G. Choi. Error and failure analysis of a unix server. High-Assurance Systems Engineering symp., 1998. Proc. Third IEEE Int., pages 232-239, Nov 1998.
-
(1998)
High-assurance Systems Engineering Symp., 1998
, pp. 232-239
-
-
Lal, R.1
Choi, G.2
-
7
-
-
4544382099
-
Failure data analysis of a large-scale heterogeneous server environment
-
Washington, DC, USA IEEE Computer Society
-
R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In DSN '04: Proc. of the 2004 Int. Conference on Dependable Systems and Networks, page 772, Washington, DC, USA, 2004. IEEE Computer Society.
-
(2004)
DSN '04: Proc. of the 2004 Int. Conference on Dependable Systems and Networks
, pp. 772
-
-
Sahoo, R.K.1
Sivasubramaniam, A.2
Squillante, M.S.3
Zhang, Y.4
-
8
-
-
27544497222
-
Filtering failure logs for a bluegene/l prototype
-
Washington, DC, USA IEEE Computer Society
-
Y. Liang, A. Sivasubramaniam, J. Moreira, Y. Zhang, R.K. Sahoo, and M. Jette. Filtering failure logs for a bluegene/l prototype. In DSN '05: Proc. of the 2005 Int. Conference on Dependable Systems and Networks, pages 476-485, Washington, DC, USA, 2005. IEEE Computer Society.
-
(2005)
DSN '05: Proc. of the 2005 Int. Conference on Dependable Systems and Networks
, pp. 476-485
-
-
Liang, Y.1
Sivasubramaniam, A.2
Moreira, J.3
Zhang, Y.4
Sahoo, R.K.5
Jette, M.6
-
9
-
-
0030379933
-
Analyze-now - An environment for collection and analysis of failures in a network of workstations
-
IEEE Transactions on Dec
-
A. Thakur and R.K. Iyer. Analyze-now - an environment for collection and analysis of failures in a network of workstations. Reliability, IEEE Transactions on, 45(4):561-570, Dec 1996.
-
(1996)
Reliability
, vol.45
, Issue.4
, pp. 561-570
-
-
Thakur, A.1
Iyer, R.K.2
-
11
-
-
27544495732
-
Crash data collection: A windows case study
-
Washington, DC, USA IEEE Computer Society
-
A. Ganapathi and D. Patterson. Crash data collection: A windows case study. In DSN '05: Proc. of the 2005 Int. Conference on Dependable Systems and Networks, pages 280-285, Washington, DC, USA, 2005. IEEE Computer Society.
-
(2005)
DSN '05: Proc. of the 2005 Int. Conference on Dependable Systems and Networks
, pp. 280-285
-
-
Ganapathi, A.1
Patterson, D.2
-
12
-
-
78651588409
-
Event log based dependability analysis of windows nt and 2k systems
-
Washington, DC, USA IEEE Computer Society
-
C. Simache, M. Kaâniche, and A. Saidane. Event log based dependability analysis of windows nt and 2k systems. In PRDC '02: Proc. of the 2002 Pacific Rim Int. symp. on Dependable Computing, page 311, Washington, DC, USA, 2002. IEEE Computer Society.
-
(2002)
PRDC '02: Proc. of the 2002 Pacific Rim Int. Symp. on Dependable Computing
, pp. 311
-
-
Simache, C.1
Kaâniche, M.2
Saidane, A.3
-
13
-
-
84958782417
-
Networked windows nt system field failure data analysis
-
Washington, DC, USA IEEE Computer Society
-
J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked windows nt system field failure data analysis. In PRDC '99: Proc. of the 1999 Pacific Rim Int. symp. on Dependable Computing, page 178, Washington, DC, USA, 1999. IEEE Computer Society.
-
(1999)
PRDC '99: Proc. of the 1999 Pacific Rim Int. Symp. on Dependable Computing
, pp. 178
-
-
Xu, J.1
Kalbarczyk, Z.2
Iyer, R.K.3
-
14
-
-
50649105078
-
Bad words: Finding faults in spirit's syslogs
-
CCGRID '08. 8th IEEE International Symposium on may
-
J. Stearley and A.J. Oliner. Bad words: Finding faults in spirit's syslogs. In Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on, pages 765-770, may 2008.
-
(2008)
Cluster Computing and the Grid, 2008
, pp. 765-770
-
-
Stearley, J.1
Oliner, A.J.2
-
15
-
-
0036821893
-
The mobius framework and its implementation
-
IEEE Transactions on oct
-
D.D. Deavours, G. Clark, T. Courtney, D. Daly, S. Derisavi, J.M. Doyle, W.H. Sanders, and P.G. Webster. The mobius framework and its implementation. Software Engineering, IEEE Transactions on, 28(10):956 - 969, oct 2002.
-
(2002)
Software Engineering
, vol.28
, Issue.10
, pp. 956-969
-
-
Deavours, D.D.1
Clark, G.2
Courtney, T.3
Daly, D.4
Derisavi, S.5
Doyle, J.M.6
Sanders, W.H.7
Webster, P.G.8
-
16
-
-
53349174366
-
A log mining approach to failure analysis of enterprise telephony systems
-
DSN 2008. IEEE International Conference on june
-
Chinghway Lim, N. Singh, and S. Yajnik. A log mining approach to failure analysis of enterprise telephony systems. In Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on, pages 398-403, june 2008.
-
(2008)
Dependable Systems and Networks with FTCS and DCC, 2008
, pp. 398-403
-
-
Lim, C.1
Singh, N.2
Yajnik, S.3
-
17
-
-
84866681807
-
A framework for assessing the dependability of supercomputers via automated log analysis
-
Anchorage, AK
-
C. Di Martino, D. Cotroneo, Z. Kalbarczyk, and R. K. Iyer. A framework for assessing the dependability of supercomputers via automated log analysis. In Sup. volume of Proc. of the Int. Conference on Dependable Systems and Networks, Anchorage, AK., pages 383-384, 2008.
-
(2008)
Sup. Volume of Proc. of the Int. Conference on Dependable Systems and Networks
, pp. 383-384
-
-
Di Martino, C.1
Cotroneo, D.2
Kalbarczyk, Z.3
Iyer, R.K.4
-
18
-
-
67349271621
-
An analysis of clustered failures on large supercomputing systems
-
Thomas J. Hacker, Fabian Romero, and Christopher D. Carothers. An analysis of clustered failures on large supercomputing systems. Journal of Parallel and Distributed Computing, 69(7):652 - 665, 2009.
-
(2009)
Journal of Parallel and Distributed Computing
, vol.69
, Issue.7
, pp. 652-665
-
-
Hacker, T.J.1
Romero, F.2
Carothers, C.D.3
-
19
-
-
4544337911
-
Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems
-
CCGRID '04 Washington, DC, USA IEEE Computer Society
-
J. Brevik, D. Nurmi, and R. Wolski. Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems. In Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGRID '04, pages 190-199, Washington, DC, USA, 2004. IEEE Computer Society.
-
(2004)
Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
, pp. 190-199
-
-
Brevik, J.1
Nurmi, D.2
Wolski, R.3
-
20
-
-
33845593340
-
A large-scale study of failures in high-performance computing systems
-
Washington, DC, USA IEEE Computer Society
-
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In DSN '06: Proc. of the Int. Conference on Dependable Systems and Networks, pages 249-258, Washington, DC, USA, 2006. IEEE Computer Society.
-
(2006)
DSN '06: Proc. of the Int. Conference on Dependable Systems and Networks
, pp. 249-258
-
-
Schroeder, B.1
Gibson, G.A.2
-
21
-
-
0036041277
-
Improving cluster availability using workstation validation
-
New York, NY, USA ACM
-
Taliver Heath, Richard P. Martin, and Thu D. Nguyen. Improving cluster availability using workstation validation. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, SIGMETRICS '02, pages 217-227, New York, NY, USA, 2002. ACM.
-
(2002)
Proceedings of the 2002 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '02
, pp. 217-227
-
-
Heath, T.1
Martin, R.P.2
Nguyen, T.D.3
-
22
-
-
34548092060
-
Using queue structures to improve job reliability
-
New York, NY, USA ACM
-
Thomas J. Hacker and Zdzislaw Meglicki. Using queue structures to improve job reliability. In Proceedings of the 16th international symposium on High performance distributed computing, HPDC '07, pages 43-54, New York, NY, USA, 2007. ACM.
-
(2007)
Proceedings of the 16th International Symposium on High Performance Distributed Computing, HPDC '07
, pp. 43-54
-
-
Hacker, T.J.1
Meglicki, Z.2
-
24
-
-
0026869241
-
Analysis and modeling of correlated failures in multicomputer systems
-
IEEE Transactions on may
-
D. Tang and R.K. Iyer. Analysis and modeling of correlated failures in multicomputer systems. Computers, IEEE Transactions on, 41(5):567-577, may 1992.
-
(1992)
Computers
, vol.41
, Issue.5
, pp. 567-577
-
-
Tang, D.1
Iyer, R.K.2
-
25
-
-
0344164868
-
Workload, performance, and reliability of digital computing systems
-
Twenty-Fifth Int. symp. on Jun
-
X. Castillo and D.P. Siewiorek. Workload, performance, and reliability of digital computing systems. Fault-Tolerant Computing, 1995, 'Highlights from Twenty-Five Years'., Twenty-Fifth Int. symp. on, pages 367-, Jun 1995.
-
(1995)
Fault-tolerant Computing, 1995, 'Highlights from Twenty-five Years'
, pp. 367
-
-
Castillo, X.1
Siewiorek, D.P.2
-
26
-
-
0025502686
-
Error log analysis: Statistical modeling and heuristic trend analysis
-
IEEE Transactions on
-
T. T. Y. Lin and D. P. Siewiorek. Error log analysis: statistical modeling and heuristic trend analysis. Reliability, IEEE Transactions on, 39(4):419-432, 1990.
-
(1990)
Reliability
, vol.39
, Issue.4
, pp. 419-432
-
-
Lin, T.T.Y.1
Siewiorek, D.P.2
-
27
-
-
4544382099
-
Failure data analysis of a large-scale heterogeneous server environment
-
Washington, DC, USA IEEE Computer Society
-
Ramendra K. Sahoo, Anand Sivasubramaniam, Mark S. Squillante, and Yanyong Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, pages 772-, Washington, DC, USA, 2004. IEEE Computer Society.
-
(2004)
Proceedings of the 2004 International Conference on Dependable Systems and Networks
, pp. 772
-
-
Sahoo, R.K.1
Sivasubramaniam, A.2
Squillante, M.S.3
Zhang, Y.4
-
28
-
-
80051915968
-
Improving log-based field failure data analysis of multi-node computing systems
-
International Conference on
-
Antonio Pecchia, Domenico Cotroneo, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. Improving log-based field failure data analysis of multi-node computing systems. Dependable Systems and Networks, International Conference on, 0:97-108, 2011.
-
(2011)
Dependable Systems and Networks
, pp. 97-108
-
-
Pecchia, A.1
Cotroneo, D.2
Kalbarczyk, Z.3
Iyer, R.K.4
-
29
-
-
51849106128
-
Mining event logs with slct and loghound
-
NOMS 2008. IEEE april
-
R. Vaarandi. Mining event logs with slct and loghound. In Network Operations and Management Symposium, 2008. NOMS 2008. IEEE, pages 1071-1074, april 2008.
-
(2008)
Network Operations and Management Symposium, 2008
, pp. 1071-1074
-
-
Vaarandi, R.1
-
30
-
-
70449794134
-
System log pre-processing to improve failure prediction
-
DSN '09. IEEE/IFIP International Conference on 29 2009-july 2
-
Ziming Zheng, Zhiling Lan, B.H. Park, and A. Geist. System log pre-processing to improve failure prediction. In Dependable Systems Networks, 2009. DSN '09. IEEE/IFIP International Conference on, pages 572-577, 29 2009-july 2 2009.
-
(2009)
Dependable Systems Networks, 2009
, pp. 572-577
-
-
Zheng, Z.1
Lan, Z.2
Park, B.H.3
Geist, A.4
-
31
-
-
0029703899
-
A comparative analysis of event tupling schemes
-
Washington, DC, USA IEEE Computer Society
-
M. F. Buckley and D. P. Siewiorek. A comparative analysis of event tupling schemes. In FTCS '96: Proc. of the The Twenty-Sixth Annual Int. symp. on Fault-Tolerant Computing (FTCS '96), page 294, Washington, DC, USA, 1996. IEEE Computer Society.
-
(1996)
FTCS '96: Proc. of the The Twenty-sixth Annual Int. Symp. on Fault-tolerant Computing (FTCS '96)
, pp. 294
-
-
Buckley, M.F.1
Siewiorek, D.P.2
-
32
-
-
80053278089
-
Co-analysis of ras log and job log on blue gene/p
-
Washington, DC, USA IEEE Computer Society
-
Ziming Zheng, Li Yu, Wei Tang, Zhiling Lan, Rinku Gupta, Narayan Desai, Susan Coghlan, and Daniel Buettner. Co-analysis of ras log and job log on blue gene/p. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11, pages 840-851, Washington, DC, USA, 2011. IEEE Computer Society.
-
(2011)
Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11
, pp. 840-851
-
-
Zheng, Z.1
Yu, L.2
Wei, T.3
Lan, Z.4
Gupta, R.5
Desai, N.6
Coghlan, S.7
Buettner, D.8
-
33
-
-
0001631180
-
Software dependability in the tandem guardian system
-
I. Lee and R. K. Iyer. Software dependability in the tandem guardian system. IEEE Trans. Softw. Eng., 21(5):455-467, 1995.
-
(1995)
IEEE Trans. Softw. Eng.
, vol.21
, Issue.5
, pp. 455-467
-
-
Lee, I.1
Iyer, R.K.2
-
34
-
-
85160740182
-
A memory soft error measurement on production systems
-
Berkeley, CA, USA USENIX Association
-
Xin Li, Kai Shen, Michael C. Huang, and Lingkun Chu. A memory soft error measurement on production systems. In 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, pages 21:1-21:6, Berkeley, CA, USA, 2007. USENIX Association.
-
(2007)
2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
, vol.21
, pp. 211-216
-
-
Li, X.1
Kai, S.2
Huang, M.C.3
Chu, L.4
-
35
-
-
77955511287
-
Towards a federated metropolitan area grid environment: The scope network-aware infrastructure
-
Francesco Palmieri and Silvio Pardi. Towards a federated metropolitan area grid environment: The scope network-aware infrastructure. Future Generation Computer Systems, 26(8):1241 - 1256, 2010.
-
(2010)
Future Generation Computer Systems
, vol.26
, Issue.8
, pp. 1241-1256
-
-
Palmieri, F.1
Pardi, S.2
-
36
-
-
78650855128
-
A fault avoidance strategy improving the reliability of the egi production grid infrastructure
-
Berlin, Heidelberg Springer-Verlag
-
Francesco Palmieri, Silvio Pardi, and Paolo Veronesi. A fault avoidance strategy improving the reliability of the egi production grid infrastructure. In Proceedings of the 14th international conference on Principles of distributed systems, OPODIS'10, pages 159-172, Berlin, Heidelberg, 2010. Springer-Verlag.
-
(2010)
Proceedings of the 14th International Conference on Principles of Distributed Systems, OPODIS'10
, pp. 159-172
-
-
Palmieri, F.1
Pardi, S.2
Veronesi, P.3
|