-
1
-
-
12444257746
-
-
A. J. Oliner, R. Sahoo, J. E. Moreira, M. Gupta, and A.Sivasubramaniam. Fault-aware Job Scheduling For BlueGene/L Systems. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2004.
-
A. J. Oliner, R. Sahoo, J. E. Moreira, M. Gupta, and A.Sivasubramaniam. Fault-aware Job Scheduling For BlueGene/L Systems. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2004.
-
-
-
-
2
-
-
0026923304
-
Task Allocation for Maximizing Reliability of Distributed Computer Systems
-
Sept
-
S.M. Shatz, J.-P. Wang, M. Goto, "Task Allocation for Maximizing Reliability of Distributed Computer Systems," IEEE Transactions on Computers, vol. 41, no. 9, pp. 1156-1168, Sept., 1992.
-
(1992)
IEEE Transactions on Computers
, vol.41
, Issue.9
, pp. 1156-1168
-
-
Shatz, S.M.1
Wang, J.-P.2
Goto, M.3
-
3
-
-
33646432010
-
MOLAR: Adaptive runtime support for high-end computing operating and runtime systems
-
C. Engelmann, S. L. Scott, D. E. Bernholdt, N. R. Gottumukkala, C. Leangsuksun, J. Varma, C. Wang, F. Mueller, A. G. Shet, and P. Sadayappan. MOLAR: Adaptive runtime support for high-end computing operating and runtime systems. ACM SIGOPS Operating Systems Review (OSR), 40(2), pages 63-72, 2006.
-
(2006)
ACM SIGOPS Operating Systems Review (OSR)
, vol.40
, Issue.2
, pp. 63-72
-
-
Engelmann, C.1
Scott, S.L.2
Bernholdt, D.E.3
Gottumukkala, N.R.4
Leangsuksun, C.5
Varma, J.6
Wang, C.7
Mueller, F.8
Shet, A.G.9
Sadayappan, P.10
-
4
-
-
0036041277
-
-
Heath, T., Martin, R. P., and Nguyen, T. D. 2002. Improving cluster availability using workstation validation. SIGMETRICS Perform. Eval. Rev. 30, 1 (Jun. 2002), 217-227.
-
Heath, T., Martin, R. P., and Nguyen, T. D. 2002. Improving cluster availability using workstation validation. SIGMETRICS Perform. Eval. Rev. 30, 1 (Jun. 2002), 217-227.
-
-
-
-
5
-
-
53349143922
-
-
Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In Proc. 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.
-
Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In Proc. 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.
-
-
-
-
6
-
-
0025693296
-
-
D. Tang, R. K. Iyer, and S. S. Subramani. Failure analysis and modelling of a vaxcluster system. In Proceedings of 20th. Intl. Symposium on Fault-tolerant Computing, pages 244-251, 1990.
-
D. Tang, R. K. Iyer, and S. S. Subramani. Failure analysis and modelling of a vaxcluster system. In Proceedings of 20th. Intl. Symposium on Fault-tolerant Computing, pages 244-251, 1990.
-
-
-
-
7
-
-
33845593340
-
-
Schroeder, B. and Gibson, G. A. 2006. A. large-scale study of failures in high-performance computing systems. In Proceedings of the international Conference on Dependable Systems and Networks, June 2006.
-
Schroeder, B. and Gibson, G. A. 2006. A. large-scale study of failures in high-performance computing systems. In Proceedings of the international Conference on Dependable Systems and Networks, June 2006.
-
-
-
-
8
-
-
4544382099
-
-
R. Sahoo, A. Sivasubramaniam, M. Squillante, and Y. Zhang. Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, pages 389-398, 2004.
-
R. Sahoo, A. Sivasubramaniam, M. Squillante, and Y. Zhang. Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, pages 389-398, 2004.
-
-
-
-
9
-
-
33746286070
-
-
Oliner, A. J., Sahoo, R. K., Moreira, J. E., and Gupta, M. 2005. Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems. In Proceedings of the 19th IEEE international Parallel and Distributed Processing Symposium (Ipdps'05) - Workshop 18 - 19 (April 04 - 08, 2005).
-
Oliner, A. J., Sahoo, R. K., Moreira, J. E., and Gupta, M. 2005. Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems. In Proceedings of the 19th IEEE international Parallel and Distributed Processing Symposium (Ipdps'05) - Workshop 18 - Volume 19 (April 04 - 08, 2005).
-
-
-
-
10
-
-
33845589803
-
-
Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, Ramendra Sahoo, BlueGene/L Failure Analysis and Prediction Models, dsn, pp. 425-434, International Conference on Dependable Systems and Networks (DSN'Q6), 2006
-
Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, Ramendra Sahoo, "BlueGene/L Failure Analysis and Prediction Models," dsn, pp. 425-434, International Conference on Dependable Systems and Networks (DSN'Q6), 2006
-
-
-
-
11
-
-
77952378080
-
-
R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pages 426-435, August 2003.
-
R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pages 426-435, August 2003.
-
-
-
-
12
-
-
33751076285
-
-
Wu Linping, Meng Dan, Jianfeng Zhan, Wang Lei, Tu Bibo, A Failure-Aware Scheduling Strategy in Large-Scale Cluster System, ccgrid, pp. 645-648, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006.
-
Wu Linping, Meng Dan, Jianfeng Zhan, Wang Lei, Tu Bibo, "A Failure-Aware Scheduling Strategy in Large-Scale Cluster System," ccgrid, pp. 645-648, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006.
-
-
-
-
15
-
-
53349143918
-
-
Narasimha Raju. Gottumukkala, Chockchai Box Leangsuksun, and S. Scott, Reliability-aware Approach to Improve Job Completion Time for Large-Scale Parallel Applications, the 2nd workshop on HPCRI, held in a conjunction with the IEEE 12th Intl Symp on HPCA, Austin, Texas, Feb 11-15,06.
-
Narasimha Raju. Gottumukkala, Chockchai Box Leangsuksun, and S. Scott, "Reliability-aware Approach to Improve Job Completion Time for Large-Scale Parallel Applications", the 2nd workshop on HPCRI, held in a conjunction with the IEEE 12th Intl Symp on HPCA, Austin, Texas, Feb 11-15,06.
-
-
-
-
16
-
-
0035201417
-
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
-
November
-
James S. Plank and Michael G. Thomason, Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems, Journal of Parallel and Distributed Computing, Volume 61, Issue 11, November 2001, Pages 1570-1590.
-
(2001)
Journal of Parallel and Distributed Computing
, vol.61
, Issue.11
, pp. 1570-1590
-
-
Plank, J.S.1
Thomason, M.G.2
-
17
-
-
53349177156
-
-
Soong T T. Model Verification, in Fundamentals of Probability and Statistics for Engineers.John Wiley & Sons Ltd.Chichester, UK,2004,p327.
-
Soong T T. "Model Verification", in Fundamentals of Probability and Statistics for Engineers.John Wiley & Sons Ltd.Chichester, UK,2004,p327.
-
-
-
-
18
-
-
84955613215
-
Toward convergence in job schedulers for parallel supercomputers, Job Scheduling Strategies for Parallel Processing
-
of, Springer-Verlag
-
D. G. Feitelson and L. Rudolph, Toward convergence in job schedulers for parallel supercomputers, Job Scheduling Strategies for Parallel Processing, volume 1162 of Lecture Notes in Computer Science, pages 1-26. Springer-Verlag, 1996
-
(1996)
Lecture Notes in Computer Science
, vol.1162
, pp. 1-26
-
-
Feitelson, D.G.1
Rudolph, L.2
|