-
1
-
-
36049013419
-
What supercomputers say: A study of five system logs
-
A. Oliner and J. Stearly, "What Supercomputers Say: A Study of Five System Logs," Proc. of DSN, 2007.
-
(2007)
Proc. of DSN
-
-
Oliner, A.1
Stearly, J.2
-
2
-
-
33845593340
-
A large-scale study of failures in highperformance-computing systems
-
B. Schroeder and G. Gibson, "A Large-scale Study of Failures in Highperformance-computing Systems," Proc. of DSN, 2006.
-
(2006)
Proc. of DSN
-
-
Schroeder, B.1
Gibson, G.2
-
3
-
-
85060036181
-
Validity of the single processor approach to achieving large-scale computing capabilities
-
G. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities," Proc. of AFIPS Spring Joint Computer Conference, 1967.
-
(1967)
Proc. of AFIPS Spring Joint Computer Conference
-
-
Amdahl, G.1
-
4
-
-
0024012163
-
Reevaluating amdahl's law
-
J. Gustafson, "Reevaluating Amdahl's law," Communications of the ACM, 31(5):532-533,1988.
-
(1988)
Communications of the ACM
, vol.31
, Issue.5
, pp. 532-533
-
-
Gustafson, J.1
-
6
-
-
72049101354
-
Adaptive grid-enabled SIMOX simulation on Japan-US grid testbed
-
Y. Tanaka, H. Takemiya, S. Sekiguchi, S. Ogata, A. Nakano, R. Kalia, and P. Vashishta, "Adaptive Grid-enabled SIMOX Simulation on Japan-US Grid Testbed", Proc. of TeraGrid, 2006.
-
(2006)
Proc. of TeraGrid
-
-
Tanaka, Y.1
Takemiya, H.2
Sekiguchi, S.3
Ogata, S.4
Nakano, A.5
Kalia, R.6
Vashishta, P.7
-
9
-
-
0025502686
-
Error log analysis: Statistical modeling and heuristic trend analysis
-
T. Lin and D. Siewiorek, "Error log analysis: statistical modeling and heuristic trend analysis," IEEE Trans. on Reliability, 39(4):419-432, 1990.
-
(1990)
IEEE Trans. on Reliability
, vol.39
, Issue.4
, pp. 419-432
-
-
Lin, T.1
Siewiorek, D.2
-
11
-
-
52949107193
-
Algorithm-system scalability of heterogeneous computing
-
Y. Chen, X. Sun, and M. Wu, "Algorithm-System Scalability of Heterogeneous Computing," Journal of Parallel and Distributed Computing, 68(11):1403-1412, 2008.
-
(2008)
Journal of Parallel and Distributed Computing
, vol.68
, Issue.11
, pp. 1403-1412
-
-
Chen, Y.1
Sun, X.2
Wu, M.3
-
12
-
-
33745170068
-
Scalability of heterogeneous computing
-
X. Sun, Y. Chen, and M.Wu, "Scalability of Heterogeneous Computing," Proc. of ICPP, 2005.
-
(2005)
Proc. of ICPP
-
-
Sun, X.1
Chen, Y.2
Wu, M.3
-
15
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
J. Daly, "A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps," Future Generation Computer Systems, 22(3): 303-312, 2006.
-
(2006)
Future Generation Computer Systems
, vol.22
, Issue.3
, pp. 303-312
-
-
Daly, J.1
-
16
-
-
0012237782
-
Minimizing completion time of a program by checkpointing and rejuvenation
-
S. Garg, Y. Huang, C. Kintala, and K. Trivedi, "Minimizing Completion Time of a Program by Checkpointing and Rejuvenation," Proc. Of SIGMETRICS, 1996.
-
(1996)
Proc. of SIGMETRICS
-
-
Garg, S.1
Huang, Y.2
Kintala, C.3
Trivedi, K.4
-
17
-
-
0035201417
-
Processor allocation and checkpoint interval selection in cluster computing systems
-
J. Plank and M. Thomason, "Processor allocation and checkpoint interval selection in cluster computing systems," Journal of Parallel and Distributed Computing, 61(11): 1570-1590, 2001.
-
(2001)
Journal of Parallel and Distributed Computing
, vol.61
, Issue.11
, pp. 1570-1590
-
-
Plank, J.1
Thomason, M.2
-
18
-
-
85014175705
-
Experimental assessment of workstation failures and their impact on checkpointing systems
-
J. Plank and W. Elwasif, "Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems," Proc. of FTCS, 1998.
-
(1998)
Proc. of FTCS
-
-
Plank, J.1
Elwasif, W.2
-
19
-
-
9144223280
-
Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
-
E. Elnozahy and J. Plank, "Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery," IEEE Trans. On Dependable and Secure Computing, 1(2):97-108, 2004.
-
(2004)
IEEE Trans. on Dependable and Secure Computing
, vol.1
, Issue.2
, pp. 97-108
-
-
Elnozahy, E.1
Plank, J.2
-
20
-
-
27544513113
-
Modeling coordinated checkpointing for large-scale supercomputers
-
L. Wang, K. Pattabiraman, Z. Kalbarczyk, and R. Iyer, "Modeling Coordinated Checkpointing for Large-Scale Supercomputers," Proc. Of DSN, 2005.
-
(2005)
Proc. of DSN
-
-
Wang, L.1
Pattabiraman, K.2
Kalbarczyk, Z.3
Iyer, R.4
-
21
-
-
57049111494
-
Adaptive fault management of parallel applications for high performance computing
-
Z. Lan and Y. Li, "Adaptive Fault Management of Parallel Applications for High Performance Computing," IEEE Trans. Computers, 57(12): 1647-1660, 2008.
-
(2008)
IEEE Trans. Computers
, vol.57
, Issue.12
, pp. 1647-1660
-
-
Lan, Z.1
Li, Y.2
-
22
-
-
55849147399
-
Dynamic meta-learning for failure prediction in large-scale systems: A case study
-
J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B-H. Park, "Dynamic Meta-Learning for Failure Prediction in Large-scale Systems: A Case Study", Proc. of ICPP, 2008.
-
(2008)
Proc. of ICPP
-
-
Gu, J.1
Zheng, Z.2
Lan, Z.3
White, J.4
Hocks, E.5
Park, B.-H.6
-
23
-
-
72049113723
-
Reliability aware optimal K node of parallel applications in large scale HPC systems
-
N. Gottumukkala, C. Leangsuksun, R. Nassar, M. Paun, D. Sule, and S. Scott, "Reliability Aware Optimal K Node of Parallel applications in Large Scale HPC Systems," Proc. of High Availability and Performance Computing Workshop, 2008.
-
(2008)
Proc. of High Availability and Performance Computing Workshop
-
-
Gottumukkala, N.1
Leangsuksun, C.2
Nassar, R.3
Paun, M.4
Sule, D.5
Scott, S.6
-
24
-
-
84976846528
-
A first order approximation to the optimal checkpoint interval
-
J. Young, "A First Order Approximation to the Optimal Checkpoint Interval," Comm. ACM, 17(9): 530-531, 1974.
-
(1974)
Comm. ACM
, vol.17
, Issue.9
, pp. 530-531
-
-
Young, J.1
-
25
-
-
33845595513
-
Performance implications of failures in large-scale cluster scheduling
-
Y. Zhang, M. Squillante, A. Sivasubramaniam, and R. Sahoo, " Performance implications of failures in large-scale cluster scheduling," Proc. Of Workshop on JSSPP, SIGMETRICS, 2004.
-
(2004)
Proc. of Workshop on JSSPP, SIGMETRICS
-
-
Zhang, Y.1
Squillante, M.2
Sivasubramaniam, A.3
Sahoo, R.4
-
26
-
-
33746286070
-
Performance implications of periodic checkpointing on large-scale cluster systems
-
A. Oliner, R. Sahoo, J. Moreira, and M. Gupta, "Performance Implications of Periodic Checkpointing on Large-scale Cluster Systems," Proc. Of IPDPS, 2005.
-
(2005)
Proc. of IPDPS
-
-
Oliner, A.1
Sahoo, R.2
Moreira, J.3
Gupta, M.4
-
27
-
-
72049130706
-
Opportunistic checkpoint intervals to improve system performance
-
S. Arunagiri, J. Daly, P. Teller, S. Seelam, R. Oldfield, M. Varela, and R. Riesen, "Opportunistic Checkpoint Intervals to Improve System Performance," Technical Report UTEP-CS-08-24, 2008.
-
(2008)
Technical Report UTEP-CS-08-24
-
-
Arunagiri, S.1
Daly, J.2
Teller, P.3
Seelam, S.4
Oldfield, R.5
Varela, M.6
Riesen, R.7
-
28
-
-
72049129021
-
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
-
A. Bouteiller, P. Lemarinier, G. Krawezik, and F. Cappello, "Improved message logging versus improved coordinated checkpointing for fault tolerant MPI," Proc. of Cluster, 2003.
-
(2003)
Proc. of Cluster
-
-
Bouteiller, A.1
Lemarinier, P.2
Krawezik, G.3
Cappello, F.4
-
29
-
-
85027617648
-
Analysis of scalability of parallel algorithms and architectures: A survey
-
V. Kumar and A. Gupta, "Analysis of scalability of parallel algorithms and architectures: a survey," Proc of ICS, 1991.
-
(1991)
Proc of ICS
-
-
Kumar, V.1
Gupta, A.2
-
30
-
-
64049097304
-
Extending Amdahl's law for energy-efficient computing in the many-core era
-
D. Woo and H. Lee, "Extending Amdahl's law for energy-efficient computing in the many-core era," IEEE Computer, 41(12):24-31, 2008.
-
(2008)
IEEE Computer
, vol.41
, Issue.12
, pp. 24-31
-
-
Woo, D.1
Lee, H.2
-
31
-
-
34547424386
-
Cooperative checkpointing: A robust approach to large-scale systems reliability
-
A. Oliner, L. Rudolph, and R. Sahoo, "Cooperative checkpointing: A robust approach to large-scale systems reliability," Proc. of ICS, 2006.
-
(2006)
Proc. of ICS
-
-
Oliner, A.1
Rudolph, L.2
Sahoo, R.3
-
33
-
-
12444268325
-
System-level faulttolerance in largescale parallel machines with buffered coscheduling
-
F. Petrini, K. Davis, and J. Sancho, "System-level faulttolerance in largescale parallel machines with buffered coscheduling," Proc. of IPDPS, 2004.
-
(2004)
Proc. of IPDPS
-
-
Petrini, F.1
Davis, K.2
Sancho, J.3
-
34
-
-
0004244684
-
Checkpointing and modelling of program execution time
-
John Wiley and Sons
-
V. Nicola, "Checkpointing and modelling of program execution time. Software Fault Tolerance," John Wiley and Sons, 1995.
-
(1995)
Software Fault Tolerance
-
-
Nicola, V.1
-
35
-
-
78649627101
-
A fast recovery mechanism for checkpointing in networked environments
-
Y. Li and Z. Lan, "A Fast Recovery Mechanism for Checkpointing in Networked Environments," Proc. of DSN, 2008.
-
(2008)
Proc. of DSN
-
-
Li, Y.1
Lan, Z.2
-
36
-
-
78449285638
-
Proactive processlevel live migration in HPC environments
-
C. Wang, F. Mueller, C. Engelmann, and S. Scott, "Proactive processlevel live migration in HPC environments," Proc. of Supercomputing, 2008.
-
(2008)
Proc. of Supercomputing
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scott, S.4
-
38
-
-
50649107313
-
Application MTFE vs platform MTBF: A fresh perspective on system reliabilty and application throughput for computations at scale
-
J. Daly, L. Pritchett-Sheats, and S. Michala, "Application MTFE vs Platform MTBF: A Fresh Perspective on System Reliabilty and Application Throughput for Computations at Scale," Proc. of CCGRID, 2008.
-
(2008)
Proc. of CCGRID
-
-
Daly, J.1
Pritchett-Sheats, L.2
Michala, S.3
|