SCOPUS 정보 검색 플랫폼

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Volumn , Issue , 2007, Pages 312-321

Reliability-aware resource allocation in HPC systems

(5) Gottumukkala, Narasimha Raju a Leangsuksun, Chokchai Box a Taerat, Narate a Nassar, Raja a Scott, Stephen L b

a Louisiana Tech University (United States)

b OAK RIDGE NATIONAL LABORATORY (United States)

Author keywords

[No Author keywords available]

Indexed keywords

COMPUTER NETWORKS; IMAGE SEGMENTATION; PARALLEL ALGORITHMS; PLANNING; RELIABILITY; RESOURCE ALLOCATION; SOFTWARE RELIABILITY; WEIBULL DISTRIBUTION;

CHALLENGING PROBLEM; CLUSTER COMPUTING; COMPUTING SYSTEMS; EXPONENTIAL DISTRIBUTIONS; FAILURE BEHAVIORS; GOODNESS OF FITS; HARDWARE AND SOFTWARE COMPONENTS; HIGH-PERFORMANCE COMPUTING; HPC SYSTEMS; INTERNATIONAL CONFERENCES; LOGNORMAL; PARALLEL JOBS; PARALLEL PROGRAMS; PARALLEL WORKLOADS; PERFORMANCE LOSSES; RELIABILITY PREDICTIONS; RESEARCH EFFORTS; RESOURCE ALLOCATION ALGORITHMS; RESOURCE ALLOCATION MODEL; RESOURCE MANAGERS; RUN-TIME SYSTEMS; SIMULATION RESULTS; TIME VARYING; WEIBULL;

COMPUTER SYSTEMS;

EID: 53349172400 PISSN: 15525244 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/CLUSTR.2007.4629245 Document Type: Conference Paper

Times cited : (20)

References (18)

1
- 12444257746
- A. J. Oliner, R. Sahoo, J. E. Moreira, M. Gupta, and A.Sivasubramaniam. Fault-aware Job Scheduling For BlueGene/L Systems. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2004.
- A. J. Oliner, R. Sahoo, J. E. Moreira, M. Gupta, and A.Sivasubramaniam. Fault-aware Job Scheduling For BlueGene/L Systems. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2004.

2
- 0026923304
- Task Allocation for Maximizing Reliability of Distributed Computer Systems
- Sept
- S.M. Shatz, J.-P. Wang, M. Goto, "Task Allocation for Maximizing Reliability of Distributed Computer Systems," IEEE Transactions on Computers, vol. 41, no. 9, pp. 1156-1168, Sept., 1992.
- (1992) IEEE Transactions on Computers , vol.41 , Issue.9 , pp. 1156-1168
- Shatz, S.M.¹ Wang, J.-P.² Goto, M.³

3
- 33646432010
- MOLAR: Adaptive runtime support for high-end computing operating and runtime systems
- C. Engelmann, S. L. Scott, D. E. Bernholdt, N. R. Gottumukkala, C. Leangsuksun, J. Varma, C. Wang, F. Mueller, A. G. Shet, and P. Sadayappan. MOLAR: Adaptive runtime support for high-end computing operating and runtime systems. ACM SIGOPS Operating Systems Review (OSR), 40(2), pages 63-72, 2006.
- (2006) ACM SIGOPS Operating Systems Review (OSR) , vol.40 , Issue.2 , pp. 63-72
- Engelmann, C.¹ Scott, S.L.² Bernholdt, D.E.³ Gottumukkala, N.R.⁴ Leangsuksun, C.⁵ Varma, J.⁶ Wang, C.⁷ Mueller, F.⁸ Shet, A.G.⁹ Sadayappan, P.¹⁰

4
- 0036041277
- Heath, T., Martin, R. P., and Nguyen, T. D. 2002. Improving cluster availability using workstation validation. SIGMETRICS Perform. Eval. Rev. 30, 1 (Jun. 2002), 217-227.
- Heath, T., Martin, R. P., and Nguyen, T. D. 2002. Improving cluster availability using workstation validation. SIGMETRICS Perform. Eval. Rev. 30, 1 (Jun. 2002), 217-227.

5
- 53349143922
- Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In Proc. 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.
- Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In Proc. 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.

6
- 0025693296
- D. Tang, R. K. Iyer, and S. S. Subramani. Failure analysis and modelling of a vaxcluster system. In Proceedings of 20th. Intl. Symposium on Fault-tolerant Computing, pages 244-251, 1990.
- D. Tang, R. K. Iyer, and S. S. Subramani. Failure analysis and modelling of a vaxcluster system. In Proceedings of 20th. Intl. Symposium on Fault-tolerant Computing, pages 244-251, 1990.

7
- 33845593340
- Schroeder, B. and Gibson, G. A. 2006. A. large-scale study of failures in high-performance computing systems. In Proceedings of the international Conference on Dependable Systems and Networks, June 2006.
- Schroeder, B. and Gibson, G. A. 2006. A. large-scale study of failures in high-performance computing systems. In Proceedings of the international Conference on Dependable Systems and Networks, June 2006.

8
- 4544382099
- R. Sahoo, A. Sivasubramaniam, M. Squillante, and Y. Zhang. Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, pages 389-398, 2004.
- R. Sahoo, A. Sivasubramaniam, M. Squillante, and Y. Zhang. Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, pages 389-398, 2004.

9
- 33746286070
- Oliner, A. J., Sahoo, R. K., Moreira, J. E., and Gupta, M. 2005. Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems. In Proceedings of the 19th IEEE international Parallel and Distributed Processing Symposium (Ipdps'05) - Workshop 18 - 19 (April 04 - 08, 2005).
- Oliner, A. J., Sahoo, R. K., Moreira, J. E., and Gupta, M. 2005. Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems. In Proceedings of the 19th IEEE international Parallel and Distributed Processing Symposium (Ipdps'05) - Workshop 18 - Volume 19 (April 04 - 08, 2005).

10
- 33845589803
- Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, Ramendra Sahoo, BlueGene/L Failure Analysis and Prediction Models, dsn, pp. 425-434, International Conference on Dependable Systems and Networks (DSN'Q6), 2006
- Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, Ramendra Sahoo, "BlueGene/L Failure Analysis and Prediction Models," dsn, pp. 425-434, International Conference on Dependable Systems and Networks (DSN'Q6), 2006

11
- 77952378080
- R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pages 426-435, August 2003.
- R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pages 426-435, August 2003.

12
- 33751076285
- Wu Linping, Meng Dan, Jianfeng Zhan, Wang Lei, Tu Bibo, A Failure-Aware Scheduling Strategy in Large-Scale Cluster System, ccgrid, pp. 645-648, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006.
- Wu Linping, Meng Dan, Jianfeng Zhan, Wang Lei, Tu Bibo, "A Failure-Aware Scheduling Strategy in Large-Scale Cluster System," ccgrid, pp. 645-648, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006.

13
- 0004261043
- 2 edition, Chapman & HaIl/CRC
- Tobias, P.A and Trindade,D.C., 1995 Applied Reliability, 2 edition, Chapman & HaIl/CRC
- (1995) Applied Reliability
- Tobias, P.A.¹ Trindade, D.C.²

14
- 0343644421
- D. G. Feitelson. Parallel workloads archive, http://cs.huji.ac.il/labs/ parallel/workload/index.html,2001.
- (2001) Parallel workloads archive
- Feitelson, D.G.¹

15
- 53349143918
- Narasimha Raju. Gottumukkala, Chockchai Box Leangsuksun, and S. Scott, Reliability-aware Approach to Improve Job Completion Time for Large-Scale Parallel Applications, the 2nd workshop on HPCRI, held in a conjunction with the IEEE 12th Intl Symp on HPCA, Austin, Texas, Feb 11-15,06.
- Narasimha Raju. Gottumukkala, Chockchai Box Leangsuksun, and S. Scott, "Reliability-aware Approach to Improve Job Completion Time for Large-Scale Parallel Applications", the 2nd workshop on HPCRI, held in a conjunction with the IEEE 12th Intl Symp on HPCA, Austin, Texas, Feb 11-15,06.

16
- 0035201417
- Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
- November
- James S. Plank and Michael G. Thomason, Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems, Journal of Parallel and Distributed Computing, Volume 61, Issue 11, November 2001, Pages 1570-1590.
- (2001) Journal of Parallel and Distributed Computing , vol.61 , Issue.11 , pp. 1570-1590
- Plank, J.S.¹ Thomason, M.G.²

17
- 53349177156
- Soong T T. Model Verification, in Fundamentals of Probability and Statistics for Engineers.John Wiley & Sons Ltd.Chichester, UK,2004,p327.
- Soong T T. "Model Verification", in Fundamentals of Probability and Statistics for Engineers.John Wiley & Sons Ltd.Chichester, UK,2004,p327.

18
- 84955613215
- Toward convergence in job schedulers for parallel supercomputers, Job Scheduling Strategies for Parallel Processing
- of, Springer-Verlag
- D. G. Feitelson and L. Rudolph, Toward convergence in job schedulers for parallel supercomputers, Job Scheduling Strategies for Parallel Processing, volume 1162 of Lecture Notes in Computer Science, pages 1-26. Springer-Verlag, 1996
- (1996) Lecture Notes in Computer Science , vol.1162 , pp. 1-26
- Feitelson, D.G.¹ Rudolph, L.²

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.