-
2
-
-
27844542760
-
The lam/mpi checkpoint/restart framework: System-initiated checkpointing
-
S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, and A. Lumsdaine, "The lam/mpi checkpoint/restart framework: System-initiated checkpointing," International Journal of High Performance Computing Applications, vol. 19, no. 4, pp. 479-493, 2005.
-
(2005)
International Journal of High Performance Computing Applications
, vol.19
, Issue.4
, pp. 479-493
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Sahay, V.4
Lumsdaine, A.5
-
3
-
-
12444292740
-
Computation-at-risk: Assessing job portfolio management risk on clusters
-
IEEE Computer Society
-
S. D. Kleban and S. H. Clearwater, "Computation-at-risk: Assessing job portfolio management risk on clusters," in IPDPS. IEEE Computer Society, 2004.
-
(2004)
IPDPS
-
-
Kleban, S.D.1
Clearwater, S.H.2
-
5
-
-
0036041277
-
Improving cluster availability using workstation validation
-
ACM
-
T. Heath, R. P. Martin, and T. D. Nguyen, "Improving cluster availability using workstation validation," in SIGMETRICS. ACM, 2002, pp. 217-227.
-
(2002)
SIGMETRICS
, pp. 217-227
-
-
Heath, T.1
Martin, R.P.2
Nguyen, T.D.3
-
6
-
-
34548056878
-
-
D. Nurmi, J. Brevik, and R. Wolski, Quantifying machine availability in networked and desktop grid systems, University of California, Santa Barbara, Computer Science, Tech. Rep. ucsb.cs:TR-2003-37, Nov. 2003.
-
D. Nurmi, J. Brevik, and R. Wolski, "Quantifying machine availability in networked and desktop grid systems," University of California, Santa Barbara, Computer Science, Tech. Rep. ucsb.cs:TR-2003-37, Nov. 2003.
-
-
-
-
7
-
-
4544337911
-
Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems
-
IEEE Computer Society
-
J. Brevik, D. Nurmi, and R. Wolski, "Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems," in CCGRID. IEEE Computer Society, 2004, pp. 190-199.
-
(2004)
CCGRID
, pp. 190-199
-
-
Brevik, J.1
Nurmi, D.2
Wolski, R.3
-
8
-
-
27144534020
-
Modeling machine availability in enterprise and wide-area distributed computing environments
-
Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30, September 2, 2005, Proceedings, Springer
-
D. Nurmi, J. Brevik, and R. Wolski, "Modeling machine availability in enterprise and wide-area distributed computing environments," in Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30 - September 2, 2005, Proceedings, ser. Lecture Notes in Computer Science, vol. 3648. Springer, 2005, pp. 432-441.
-
(2005)
ser. Lecture Notes in Computer Science
, vol.3648
, pp. 432-441
-
-
Nurmi, D.1
Brevik, J.2
Wolski, R.3
-
9
-
-
23944448107
-
Performance implications of failures in large-scale cluster scheduling
-
JSSPP, Springer
-
Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo, "Performance implications of failures in large-scale cluster scheduling," in JSSPP, ser. Lecture Notes in Computer Science, vol. 3277. Springer, 2004, pp. 233-252.
-
(2004)
ser. Lecture Notes in Computer Science
, vol.3277
, pp. 233-252
-
-
Zhang, Y.1
Squillante, M.S.2
Sivasubramaniam, A.3
Sahoo, R.K.4
-
13
-
-
34548108278
-
-
Los Alamos National Laboratory, data on system failures, Online, Available
-
Los Alamos National Laboratory. (2006) Raw operational data on system failures. [Online]. Available: http://www.lanl.gov/projects/computerscience/ data/
-
(2006)
Raw operational
-
-
-
15
-
-
34548088531
-
Reliability analysis in hpc clusters
-
2
-
N. Raju, Gottumukkala, Y. Liu, C. B. Leangsuksun, R. Nassar, and S. Scott2, "Reliability analysis in hpc clusters," Proceedings of the High Availability and Performance Computing Workshop, 2006.
-
(2006)
Proceedings of the High Availability and Performance Computing Workshop
-
-
Raju, N.1
Gottumukkala2
Liu, Y.3
Leangsuksun, C.B.4
Nassar, R.5
Scott, S.6
-
16
-
-
0033344278
-
Failure data analysis of a LAN of windows NT based computers
-
Washington, Brussels, Tokyo: IEEE, Oct
-
M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, "Failure data analysis of a LAN of windows NT based computers," in Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS '99). Washington - Brussels - Tokyo: IEEE, Oct. 1999, pp. 178-189.
-
(1999)
Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS '99)
, pp. 178-189
-
-
Kalyanakrishnam, M.1
Kalbarczyk, Z.2
Iyer, R.3
-
21
-
-
84898046897
-
Scaling to Thousands of Processors with Buffer Coscheduling
-
Pittsburgh, PA, Aug
-
F. Petøini, "Scaling to Thousands of Processors with Buffer Coscheduling," in Scaling to New Heights Workshop, Pittsburgh, PA, Aug 2002.
-
(2002)
Scaling to New Heights Workshop
-
-
Petøini, F.1
-
22
-
-
0345446547
-
The workload on parallel supercomputers: Modeling the characteristics of rigid jobs
-
Lublin and Feitelson, "The workload on parallel supercomputers: Modeling the characteristics of rigid jobs," JPDC: Journal of Parallel and Distributed Computing, vol. 63, 2003.
-
(2003)
JPDC: Journal of Parallel and Distributed Computing
, vol.63
-
-
Lublin1
Feitelson2
-
23
-
-
0031388399
-
Impact of checkpoint latency on overhead ratio of a checkpointing scheme
-
N. H. Vaidya, "Impact of checkpoint latency on overhead ratio of a checkpointing scheme," IEEE Trans. Computers, vol. 46, no. 8, pp. 942-947, 1997.
-
(1997)
IEEE Trans. Computers
, vol.46
, Issue.8
, pp. 942-947
-
-
Vaidya, N.H.1
-
24
-
-
33847764225
-
-
University of California, Santa Barbara, Computer Science, Tech. Rep. TR, Nov. 6
-
D. Nurmi, R. Wolski, and J. Brevik, "Model-based checkpoint scheduling for volatile resource environments," University of California, Santa Barbara, Computer Science, Tech. Rep. TR-2004-25, Nov. 6 2004.
-
(2004)
Model-based checkpoint scheduling for volatile resource environments
, pp. 2004
-
-
Nurmi, D.1
Wolski, R.2
Brevik, J.3
-
25
-
-
34548105831
-
-
N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfield, and C. Vizinok, A checkpoint and recovery system for the Pittsburgh supercomputing center terascale computing system, Pittsburgh Supercomputer Center, Tech. Rep. CMU-PSC-TR-2001-0002, 2001.
-
N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfield, and C. Vizinok, "A checkpoint and recovery system for the Pittsburgh supercomputing center terascale computing system," Pittsburgh Supercomputer Center, Tech. Rep. CMU-PSC-TR-2001-0002, 2001.
-
-
-
-
26
-
-
84978437474
-
Pastiche: Making backup cheap and easy
-
Proceedings of the 5th ACM Symposium on Operating System Design and Implementation OSDI-02, New York: ACM Press, Dec. 9-11
-
L. P. Cox, C. D. Murray, and B. Noble, "Pastiche: Making backup cheap and easy," in Proceedings of the 5th ACM Symposium on Operating System Design and Implementation (OSDI-02), ser. Operating Systems Review. New York: ACM Press, Dec. 9-11 2007, pp. 285-298.
-
(2007)
ser. Operating Systems Review
, pp. 285-298
-
-
Cox, L.P.1
Murray, C.D.2
Noble, B.3
|