-
1
-
-
4544382099
-
Failure data analysis of a large-scale heterogeneous server environment
-
R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In DSN '04: Proc. of the 2004 Int. Conference on Dependable Systems and Networks, pages 772-781, 2004.
-
(2004)
DSN 04: Proc. of the 2004 Int. Conference on Dependable Systems and Networks
, pp. 772-781
-
-
Sahoo, R.K.1
Sivasubramaniam, A.2
Squillante, M.S.3
Zhang, Y.4
-
2
-
-
27544497222
-
Filtering failure logs for a bluegene/l prototype
-
Y. Liang, A. Sivasubramaniam, J. Moreira, Y. Zhang, R.K. Sahoo, and M. Jette. Filtering failure logs for a bluegene/l prototype. In DSN '05: Proc. of the 2005 Int. Conference on Dependable Systems and Netw orks, pages 476-485, 2005.
-
(2005)
DSN 05: Proc. of the 2005 Int. Conference on Dependable Systems and Netw Orks
, pp. 476-485
-
-
Liang, Y.1
Sivasubramaniam, A.2
Moreira, J.3
Zhang, Y.4
Sahoo, R.K.5
Jette, M.6
-
3
-
-
33845589803
-
Bluegene/l failure analysis and prediction models
-
Y. Liang, Y. Zhang, M. Jette, Anand Sivasubramaniam, and R. Sahoo. Bluegene/l failure analysis and prediction models. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on, pages 425-434, 2006.
-
(2006)
Dependable Systems and Networks 2006. DSN 2006. International Conference on
, pp. 425-434
-
-
Liang, Y.1
Zhang, Y.2
Jette, M.3
Sivasubramaniam, A.4
Sahoo, R.5
-
5
-
-
85084160707
-
Disk failures in the real world: What does an mttf of 1 000,000 hours mean to you?
-
Berkeley, CA, USA. USENIX Associa tion
-
B. Schroeder and G. A. Gibson. Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX conference on File and Storage Technologies, FAST '07, Berkeley, CA, USA, 2007. USENIX Associa tion.
-
(2007)
Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST
, vol.7
-
-
Schroeder, B.1
Gibson, G.A.2
-
7
-
-
84866712387
-
Assessing time coalescence techniques for the analysis of supercomputer logs
-
C. Di Martino, M. Cinque, and D. Cotroneo. Assessing time coalescence techniques for the analysis of supercomputer logs. In In Proc. of 42nd Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN), 2012, pages 1-12, 2012.
-
(2012)
Proc. of 42nd Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN
, vol.2012
, pp. 1-12
-
-
Di Martino, C.1
Cinque, M.2
Cotroneo, D.3
-
8
-
-
80051915968
-
Improving logbased field failure data analysis of multi-node computing systems
-
Washington, DC, USA. IEEE Computer Society
-
A. Pecchia, d. Cotroneo, Z. Kalbarczyk, and R. K. Iyer. Improving logbased field failure data analysis of multi-node computing systems. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks, DSN '11, pages 97-108, Washington, DC, USA, 2011. IEEE Computer Society.
-
(2011)
Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks, DSN
, vol.11
, pp. 97-108
-
-
Pecchia, A.1
Cotroneo, D.2
Kalbarczyk, Z.3
Iyer, R.K.4
-
9
-
-
70449657893
-
Dram errors in the wild: A largescale field study
-
June
-
B. Schroeder, E. Pinheiro, and W. Weber. Dram errors in the wild: a largescale field study. SIGMETRICS Perform. Eval. Rev., 37(1):193-204, June 2009.
-
(2009)
SIGMETRICS Perform. Eval. Rev
, vol.37
, Issue.1
, pp. 193-204
-
-
Schroeder, B.1
Pinheiro, E.2
Weber, W.3
-
10
-
-
84877693592
-
Fault prediction under the microscope: A closer look into hpc systems
-
A. Gainaru, F. Cappello, M. Snir, and W. Kramer. Fault prediction under the microscope: A closer look into hpc systems. In High Performance Co mputing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1-11, 2012.
-
(2012)
High Performance Co Mputing, Networking, Storage and Analysis (SC), 2012 International Conference for
, pp. 1-11
-
-
Gainaru, A.1
Cappello, F.2
Snir, M.3
Kramer, W.4
-
14
-
-
84912093660
-
-
http://www.adaptivecomputing.com/products/hpc-products/moab-hpc-suite-enterprise-edition.
-
-
-
-
15
-
-
84912093659
-
-
http://www.cray.com/Products/Storage/Sonexion/Specifications.aspx.
-
-
-
-
16
-
-
84881063949
-
A state-machine approach to disambiguating supercomputer event logs
-
Berkeley, CA,. USENIX
-
J. Stearley, R. Ballance, and L. Bauman. A state-machine approach to disambiguating supercomputer event logs. In proc. of Workshop on Managing System Automatically and Dynamically 2, 155-192, Berkeley, CA, 2012. USENIX.
-
(2012)
Proc. of Workshop on Managing System Automatically and Dynamically
, vol.2
, pp. 155-192
-
-
Stearley, J.1
Ballance, R.2
Bauman, L.3
-
17
-
-
84899689608
-
Feng shui of supercomputer memory: Positional effects in dram and sram faults
-
New York, NY, USA ACM
-
V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng shui of supercomputer memory: Positional effects in dram and sram faults. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, SC '13, pages 22:1-22:11, New York, NY, USA, 2013. ACM.
-
(2013)
Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, SC
, vol.13
, pp. 221-2211
-
-
Sridharan, V.1
Stearley, J.2
Debardeleben, N.3
Blanchard, S.4
Gurumurthi, S.5
-
18
-
-
84912071239
-
-
http://www.olcf.ornl.gov/titan/,number2ontop500.org.
-
-
-
-
20
-
-
83155160934
-
Modeling and tolerating heterogeneous failures in large parallel systems
-
New York, NY, USA
-
E. Heien, D. Kondo, A. Gainaru, A. LaPine, W. Kramer, and F. Cappello. Modeling and tolerating heterogeneous failures in large parallel systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 45:1-45:11, New York, NY, USA, 2011. ACM.
-
(2011)
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11
, pp. 451-4511
-
-
Heien, E.1
Kondo, D.2
Gainaru, A.3
Lapine, A.4
Kramer, W.5
Cappello, F.6
-
21
-
-
84884479401
-
One size does not fit all: Clustering supercomputer failures using a multiple time window approach
-
In JulianMartin Kunkel, Thomas Ludwig, and HansWerner Me uer, editors Springer Berlin Heidelberg
-
C. Di Martino. One size does not fit all: Clustering supercomputer failures using a multiple time window approach. In JulianMartin Kunkel, Thomas Ludwig, and HansWerner Me uer, editors, International Supercomputing Conference-Supercomputing, volume 7905 of Lecture Notes in Computer Science, pages 302-316. Springer Berlin Heidelberg, 2013.
-
(2013)
International Supercomputing Conference-Supercomputing Volume 7905 of Lecture Notes in Computer Science
, pp. 302-316
-
-
Di Martino, C.1
-
22
-
-
84885982390
-
Predicting job completion times using system logs in supercomputing clusters
-
, June
-
Xin Chen, Charng-Da Lu, and K. Pattabiraman. Predicting job completion times using system logs in supercomputing clusters. In Dependable Systems and Networks Workshop (DSN-W), 2013 43rd Annual IEEE/IFIP Conference on, pages 1-8, June 2013.
-
(2013)
Dependable Systems and Networks Workshop (DSN-W 2013 43rd Annual IEEE/IFIP Conference on
, pp. 1-8
-
-
Chen, X.1
Lu, C.-D.2
Pattabiraman, K.3
-
24
-
-
77954467035
-
Studying and using failure data from large-scale internet services
-
New York, NY, USA. ACM
-
D. Oppenheimer and D. A. Patterson. Studying and using failure data from large-scale internet services. In Proceedings of the 10th workshop on ACM SIGOPS European workshop, EW 10, pages 255-258, New York, NY, USA, 2002. ACM.
-
(2002)
Proceedings of the 10th Workshop on ACM SIGOPS European Workshop, EW
, vol.10
, pp. 255-258
-
-
Oppenheimer, D.1
Patterson, D.A.2
|