-
4
-
-
0042078549
-
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
-
E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," ACM Computing Surveys, vol. 34, no. 3, 2002.
-
(2002)
ACM Computing Surveys
, vol.34
, Issue.3
-
-
Elnozahy, E.1
Alvisi, L.2
Wang, Y.3
Johnson, D.4
-
5
-
-
9144223280
-
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
-
Apr.-June
-
E. Elnozahy and J. Plank, "Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery," IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, Apr.-June 2004.
-
(2004)
IEEE Trans. Dependable and Secure Computing
, vol.1
, Issue.2
-
-
Elnozahy, E.1
Plank, J.2
-
6
-
-
0035266102
-
Proactive Management of Software Aging
-
V. Castelli, R. Harper, P. Heldelberger, S. Hunter, K. Trivedi, K. Vaidyanathan, and W. Zeggert, "Proactive Management of Software Aging," IBM J. Research and Development, vol. 45, no. 2, 2001.
-
(2001)
IBM J. Research and Development
, vol.45
, Issue.2
-
-
Castelli, V.1
Harper, R.2
Heldelberger, P.3
Hunter, S.4
Trivedi, K.5
Vaidyanathan, K.6
Zeggert, W.7
-
9
-
-
77952378080
-
Critical Event Prediction for Proactive Management in Large-Scale Computer Clusters
-
R. Sahoo, A. Oliner, I. Rish, M. Gupta, J. Moreira, and S. Ma, "Critical Event Prediction for Proactive Management in Large-Scale Computer Clusters," Proc. ACM SIGKDD, 2003.
-
(2003)
Proc. ACM SIGKDD
-
-
Sahoo, R.1
Oliner, A.2
Rish, I.3
Gupta, M.4
Moreira, J.5
Ma, S.6
-
10
-
-
33845589803
-
Blue Gene/L Failure Analysis and Prediction Models
-
Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo, "Blue Gene/L Failure Analysis and Prediction Models," Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2006.
-
(2006)
Proc. Int'l Conf. Dependable Systems and Networks (DSN)
-
-
Liang, Y.1
Zhang, Y.2
Sivasubramaniam, A.3
Jette, M.4
Sahoo, R.5
-
11
-
-
47249153592
-
A Meta-Learning Failure Predictor for Blue Gene/L Systems
-
P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, "A Meta-Learning Failure Predictor for Blue Gene/L Systems," Proc. Int'l Conf. Parallel Processing (ICPP), 2007.
-
(2007)
Proc. Int'l Conf. Parallel Processing (ICPP)
-
-
Gujrati, P.1
Li, Y.2
Lan, Z.3
Thakur, R.4
White, J.5
-
13
-
-
51049108066
-
Mpich-V: A Multiprotocol Automatic Fault Tolerant MPI
-
A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, "Mpich-V: A Multiprotocol Automatic Fault Tolerant MPI," Int'l J. High Performance Computing and Applications, 2005.
-
(2005)
Int'l J. High Performance Computing and Applications
-
-
Bouteiller, A.1
Herault, T.2
Krawezik, G.3
Lemarinier, P.4
Cappello, F.5
-
15
-
-
23944521034
-
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs
-
M. Schulz, G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, and P. Stodghill, "Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs," Proc. ACM/IEEE Conf. Supercomputing (SC), 2004.
-
(2004)
Proc. ACM/IEEE Conf. Supercomputing (SC)
-
-
Schulz, M.1
Bronevetsky, G.2
Fernandes, R.3
Marques, D.4
Pingali, K.5
Stodghill, P.6
-
16
-
-
85084159983
-
Libckpt: Transparent Checkpointing under Unix
-
J. Plank, M. Beck, G. Kingsley, and K. Li, "Libckpt: Transparent Checkpointing under Unix," Proc. Usenix Winter Technical Conf., 1995.
-
(1995)
Proc. Usenix Winter Technical Conf
-
-
Plank, J.1
Beck, M.2
Kingsley, G.3
Li, K.4
-
17
-
-
33749061217
-
Requirements for Linux Checkpoint/Restart,
-
Technical Report LBNL-49659, Berkeley Lab, May 2002
-
J. Duell, P. Hargrove, and E. Roman, "Requirements for Linux Checkpoint/Restart," Technical Report LBNL-49659, Berkeley Lab, May 2002.
-
-
-
Duell, J.1
Hargrove, P.2
Roman, E.3
-
18
-
-
27844562921
-
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
-
E. Gabriel et al., "Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation," Proc. 11th European PVM/MPI Users' Group Meeting, 2004.
-
(2004)
Proc. 11th European PVM/MPI Users' Group Meeting
-
-
Gabriel, E.1
-
20
-
-
34548768671
-
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
-
C. Wang, F. Mueller, C. Engelmann, and S. Scott, "A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance," Proc. 21st Int'l Parallel and Distributed Processing Symp. (IPDPS ), 2007.
-
(2007)
Proc. 21st Int'l Parallel and Distributed Processing Symp. (IPDPS )
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scott, S.4
-
21
-
-
84976846528
-
A First Order Approximation to the Optimal Checkpoint Interval
-
J. Young, "A First Order Approximation to the Optimal Checkpoint Interval," Comm. ACM, vol. 17, no. 9, 1974.
-
(1974)
Comm. ACM
, vol.17
, Issue.9
-
-
Young, J.1
-
22
-
-
4544342875
-
Min-Max Checkpoint Placement under Incomplete Failure Information
-
T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, "Min-Max Checkpoint Placement under Incomplete Failure Information," Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2004.
-
(2004)
Proc. Int'l Conf. Dependable Systems and Networks (DSN)
-
-
Ozaki, T.1
Dohi, T.2
Okamura, H.3
Kaio, N.4
-
23
-
-
0021473687
-
On the Optimum Checkpoint Selection Problem
-
S. Toueg and O. Babaoglu, "On the Optimum Checkpoint Selection Problem," SIAM J. Computing, vol. 13, no. 3, 1984.
-
(1984)
SIAM J. Computing
, vol.13
, Issue.3
-
-
Toueg, S.1
Babaoglu, O.2
-
25
-
-
12444268355
-
On the Feasibility of Incremental Checkpointing for Scientific Computing
-
J. Sancho, F. Petrini, G. Johnson, J. Fernandez, and E. Frachtenberg, "On the Feasibility of Incremental Checkpointing for Scientific Computing," Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2004.
-
(2004)
Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS)
-
-
Sancho, J.1
Petrini, F.2
Johnson, G.3
Fernandez, J.4
Frachtenberg, E.5
-
26
-
-
0032179680
-
Diskless Checkpointing
-
Oct
-
J. Plank, K. Li, and M. Puening, "Diskless Checkpointing," IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, Oct. 1998.
-
(1998)
IEEE Trans. Parallel and Distributed Systems
, vol.9
, Issue.10
-
-
Plank, J.1
Li, K.2
Puening, M.3
-
27
-
-
36949009638
-
Scalable Diskless Checkpointing for Large Parallel Systems,
-
PhD dissertation, Univ. of Illinois at Urbana-Champaign
-
C.-D. Lu, "Scalable Diskless Checkpointing for Large Parallel Systems," PhD dissertation, Univ. of Illinois at Urbana-Champaign, 2005.
-
(2005)
-
-
Lu, C.-D.1
-
29
-
-
28044457320
-
Monitoring Hard Disks with Smart
-
Jan
-
B. Allen, "Monitoring Hard Disks with Smart," Linux J., Jan. 2004.
-
(2004)
Linux J
-
-
Allen, B.1
-
30
-
-
57049084232
-
-
Hardware Monitoring by
-
Hardware Monitoring by LM Sensors, http://secure.netroedge.com/-lm78/ info.html, 2007.
-
(2007)
-
-
Sensors, L.M.1
-
34
-
-
0002168249
-
Learning to Predict Rare Events in Event Sequences
-
G. Weiss and H. Hirsh, "Learning to Predict Rare Events in Event Sequences," Proc. ACM SIGKDD, 1998.
-
(1998)
Proc. ACM SIGKDD
-
-
Weiss, G.1
Hirsh, H.2
-
38
-
-
21044437801
-
Overview of the Blue Gene/L System Architecture
-
A. Gara et al., "Overview of the Blue Gene/L System Architecture," IBM J. Research and Development, vol. 49, nos. 2/3, 2005.
-
(2005)
IBM J. Research and Development
, vol.49
, Issue.2-3
-
-
Gara, A.1
-
40
-
-
33749680779
-
A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster
-
C. Leangsuksun, T. Liu, T. Raol, S. Scott, and R. Libby, "A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster," Proc. Fifth LCI Int'l Conf. Linux Clusters, 2004.
-
(2004)
Proc. Fifth LCI Int'l Conf. Linux Clusters
-
-
Leangsuksun, C.1
Liu, T.2
Raol, T.3
Scott, S.4
Libby, R.5
-
41
-
-
12444257746
-
Fault-Aware Job Scheduling for Blue Gene/L Systems
-
A. Oliner, R. Sahoo, J. Moreira, M. Gupta, and A. Sivasubramaniam, "Fault-Aware Job Scheduling for Blue Gene/L Systems," Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2004.
-
(2004)
Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS)
-
-
Oliner, A.1
Sahoo, R.2
Moreira, J.3
Gupta, M.4
Sivasubramaniam, A.5
-
43
-
-
16244422723
-
Checkpointing and Migration of Unix Processes in the Condor Distributed Processing System
-
Feb
-
T. Tannenbaum and M. Litzkow, "Checkpointing and Migration of Unix Processes in the Condor Distributed Processing System," Dr. Dobbs J. Feb. 1995.
-
(1995)
Dr. Dobbs J
-
-
Tannenbaum, T.1
Litzkow, M.2
-
47
-
-
77955897418
-
Total Recall: System Support for Automated Availability Management
-
R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. Voelker, "Total Recall: System Support for Automated Availability Management," Proc. First Symp. Networked Systems Design and Implementation (NSDI ), 2004.
-
(2004)
Proc. First Symp. Networked Systems Design and Implementation (NSDI )
-
-
Bhagwan, R.1
Tati, K.2
Cheng, Y.3
Savage, S.4
Voelker, G.5
-
48
-
-
51049111075
-
A Fault Diagnosis and Prognosis Service for Teragrid Clusters
-
Z. Lan, P. Gujrati, Y. Li, Z. Zheng, R. Thakur, and J. White, "A Fault Diagnosis and Prognosis Service for Teragrid Clusters," Proc. Second TeraGrid Conf., 2007.
-
(2007)
Proc. Second TeraGrid Conf
-
-
Lan, Z.1
Gujrati, P.2
Li, Y.3
Zheng, Z.4
Thakur, R.5
White, J.6
-
51
-
-
0035201417
-
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
-
J. Plank and M. Thomason, "Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems," J. Parallel and Distributed Computing, vol. 61, no. 11, 2001.
-
(2001)
J. Parallel and Distributed Computing
, vol.61
, Issue.11
-
-
Plank, J.1
Thomason, M.2
-
54
-
-
27544513113
-
Modeling Coordinated Checkpointing for Large-Scale Supercomputers
-
L. Wang, K. Pattabiraman, Z. Kalbarczyk, and R. Iyer, "Modeling Coordinated Checkpointing for Large-Scale Supercomputers," Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2005.
-
(2005)
Proc. Int'l Conf. Dependable Systems and Networks (DSN)
-
-
Wang, L.1
Pattabiraman, K.2
Kalbarczyk, Z.3
Iyer, R.4
-
56
-
-
84897988044
-
Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formulation
-
G. Bryan, T. Abel, and M. Norman, "Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formulation," Proc. ACM/IEEE Conf. Supercomputing (SC), 2001.
-
(2001)
Proc. ACM/IEEE Conf. Supercomputing (SC)
-
-
Bryan, G.1
Abel, T.2
Norman, M.3
-
57
-
-
0029633168
-
Gromacs: A Message-Passing Parallel Molecular Dynamics Implementation
-
H. Berendsen, D.V. der Spoel, and R. van Drunen, "Gromacs: A Message-Passing Parallel Molecular Dynamics Implementation," Computer Physics Comm., vol. 91, pp. 43-56, 1995.
-
(1995)
Computer Physics Comm
, vol.91
, pp. 43-56
-
-
Berendsen, H.1
der Spoel, D.V.2
van Drunen, R.3
-
59
-
-
79952168926
-
Using Adaptive Fault Tolerance to Improve Application Robustness on the Teragrid
-
Y. Li and Z. Lan, "Using Adaptive Fault Tolerance to Improve Application Robustness on the Teragrid," Proc. Second TeraGrid Conf. 2007.
-
(2007)
Proc. Second TeraGrid Conf
-
-
Li, Y.1
Lan, Z.2
|