-
1
-
-
40749160036
-
Overview of the IBM Blue Gene/P project
-
Blue Gene Team
-
Blue Gene Team, "Overview of the IBM Blue Gene/P project,"IBM Journal of Research and Development, 2008.
-
(2008)
IBM Journal of Research and Development
-
-
-
3
-
-
77951481809
-
CiFTS: A coordinated infrastructure for fault-tolerant systems
-
R. Gupta, P. Beckman, B.-H. Park, E. Lusk, and P. Hargrove. CiFTS: A coordinated infrastructure for fault-tolerant systems. In Proc. of ICPP, 2009.
-
Proc. of ICPP, 2009
-
-
Gupta, R.1
Beckman, P.2
Park, B.-H.3
Lusk, E.4
Hargrove, P.5
-
5
-
-
12444257746
-
Fault-aware job scheduling for Blue Gene/L systems
-
A. Oliner, R. Sahoo, J. Moreira, M. Gupta, and A. Sivasubramaniam. Fault-aware job scheduling for Blue Gene/L systems. In Proc. of IPDPS, 2004.
-
Proc. of IPDPS, 2004
-
-
Oliner, A.1
Sahoo, R.2
Moreira, J.3
Gupta, M.4
Sivasubramaniam, A.5
-
8
-
-
33845593340
-
A large-scale study of failures in high-performance computing systems
-
B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In Proc. of DSN, 2006.
-
Proc. of DSN, 2006
-
-
Schroeder, B.1
Gibson, G.2
-
9
-
-
36049013419
-
What supercomputers say: A study of five system logs
-
A. Oliner and J. Stearly. What supercomputers say: A study of five system logs. In Proc. of DSN, 2007.
-
Proc. of DSN, 2007
-
-
Oliner, A.1
Stearly, J.2
-
11
-
-
70349657128
-
Blue Gene/L log analysis and time to interrupt estimation
-
N. Taerat, N. Naksinehaboon, C. Chandler, J. Elliott, C. Leangsuksun, G. Ostrouchov, S. Scott, and C. Engelmann. Blue Gene/L log analysis and time to interrupt estimation. In Proc. of ARES, 2009.
-
Proc. of ARES, 2009
-
-
Taerat, N.1
Naksinehaboon, N.2
Chandler, C.3
Elliott, J.4
Leangsuksun, C.5
Ostrouchov, G.6
Scott, S.7
Engelmann, C.8
-
12
-
-
27544497222
-
Filtering failure logs for a Blue Gene/L prototype
-
Y. Liang, Y. Zhang, A. Sivasubramanium, R. Sahoo, J. Moreia, and M. Gupta. Filtering failure logs for a Blue Gene/L prototype. In Proc. of DSN, 2005.
-
Proc. of DSN, 2005
-
-
Liang, Y.1
Zhang, Y.2
Sivasubramanium, A.3
Sahoo, R.4
Moreia, J.5
Gupta, M.6
-
13
-
-
84976846528
-
A first order approximation to the optimal checkpoint interval
-
J. Young. A first order approximation to the optimal checkpoint interval. Comm. ACM, 17(9): 530-531, 1974.
-
(1974)
Comm. ACM
, vol.17
, Issue.9
, pp. 530-531
-
-
Young, J.1
-
15
-
-
83455240695
-
Petascale system management experiences
-
N. Desai, R. Bradshaw, C. Lueninghoener, A. Cherry, S. Coghlan, and W. Scullin. Petascale system management experiences. In Proc. of LISA, 2008.
-
Proc. of LISA, 2008
-
-
Desai, N.1
Bradshaw, R.2
Lueninghoener, C.3
Cherry, A.4
Coghlan, S.5
Scullin, W.6
-
18
-
-
12444268325
-
System-level fault tolerance in largescale parallel machines with buffered coscheduling
-
F. Petrini, K. Davis, and J. Sancho. System-level fault tolerance in largescale parallel machines with buffered coscheduling. In Proc. of IPDPS, 2004.
-
Proc. of IPDPS, 2004
-
-
Petrini, F.1
Davis, K.2
Sancho, J.3
-
19
-
-
80053252298
-
Reliability-aware scalability models for high performance computing
-
Z. Ziming and Z. Lan. Reliability-aware scalability models for high performance computing. In Proc. of Cluster, 2009.
-
Proc. of Cluster, 2009
-
-
Ziming, Z.1
Lan, Z.2
-
20
-
-
77955737995
-
-
Whitepaper
-
N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann and B. Harrod. High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Whitepaper, 2009.
-
(2009)
High-end Computing Resilience: Analysis of Issues Facing the HEC Community and Path-forward for Research and Development
-
-
DeBardeleben, N.1
Laros, J.2
Daly, J.3
Scott, S.4
Engelmann, C.5
Harrod, B.6
-
21
-
-
78650009816
-
Impact of suboptimal checkpoint intervals on application efficiency in computational clusters
-
W. Jones, J. Daly, and N. DeBardeleben. Impact of suboptimal checkpoint intervals on application efficiency in computational clusters. In Proc. of HPDC, 2010.
-
Proc. of HPDC, 2010
-
-
Jones, W.1
Daly, J.2
DeBardeleben, N.3
-
23
-
-
67649860233
-
Exploring event correlation for failure prediction in coalitions of clusters
-
S. Fu and C. Xu Exploring event correlation for failure prediction in coalitions of clusters. In Proc. of Supercomputing, 2007.
-
Proc. of Supercomputing, 2007
-
-
Fu, S.1
Xu, C.2
-
24
-
-
85077345110
-
Understanding customer problem troubleshooting from storage system logs
-
W. Jiang, C. Hu, S. Pasupathy, A. Kanevsky, Z. Li, and Y. Zhou. Understanding customer problem troubleshooting from storage system logs. In Proc. of FAST, 2009.
-
Proc. of FAST, 2009
-
-
Jiang, W.1
Hu, C.2
Pasupathy, S.3
Kanevsky, A.4
Li, Z.5
Zhou, Y.6
-
25
-
-
72249121870
-
Detecting large-scale system problems by mining console logs
-
W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan. Detecting large-scale system problems by mining console logs. In SOSP, 2009.
-
(2009)
SOSP
-
-
Xu, W.1
Huang, L.2
Fox, A.3
Patterson, D.4
Jordan, M.5
-
26
-
-
17044405923
-
Toward integrating feature selection algorithms for classification and clustering
-
DOI 10.1109/TKDE.2005.66
-
H. Liu and L. Yu. Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Trans. on Knowledge and Data Engineering, 17(4):491-502, 2005. (Pubitemid 40495592)
-
(2005)
IEEE Transactions on Knowledge and Data Engineering
, vol.17
, Issue.4
, pp. 491-502
-
-
Liu, H.1
Yu, L.2
-
27
-
-
85084160707
-
Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?
-
B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proc. of FAST, 2007.
-
Proc. of FAST, 2007
-
-
Schroeder, B.1
Gibson, G.2
-
30
-
-
77949275829
-
Reliability of a System of k Nodes for High Performance Computing Applications
-
N. Gottumukkala, R. Nassar, M. Paun, and C. Leangsuksun. Reliability of a System of k Nodes for High Performance Computing Applications. IEEE Trans. on Reliability, 59(1):142-169, 2010.
-
(2010)
IEEE Trans. on Reliability
, vol.59
, Issue.1
, pp. 142-169
-
-
Gottumukkala, N.1
Nassar, R.2
Paun, M.3
Leangsuksun, C.4
|