메뉴 건너뛰기




Volumn , Issue , 2007, Pages 312-321

Reliability-aware resource allocation in HPC systems

Author keywords

[No Author keywords available]

Indexed keywords

COMPUTER NETWORKS; IMAGE SEGMENTATION; PARALLEL ALGORITHMS; PLANNING; RELIABILITY; RESOURCE ALLOCATION; SOFTWARE RELIABILITY; WEIBULL DISTRIBUTION;

EID: 53349172400     PISSN: 15525244     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/CLUSTR.2007.4629245     Document Type: Conference Paper
Times cited : (20)

References (18)
  • 1
    • 12444257746 scopus 로고    scopus 로고
    • A. J. Oliner, R. Sahoo, J. E. Moreira, M. Gupta, and A.Sivasubramaniam. Fault-aware Job Scheduling For BlueGene/L Systems. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2004.
    • A. J. Oliner, R. Sahoo, J. E. Moreira, M. Gupta, and A.Sivasubramaniam. Fault-aware Job Scheduling For BlueGene/L Systems. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2004.
  • 2
    • 0026923304 scopus 로고
    • Task Allocation for Maximizing Reliability of Distributed Computer Systems
    • Sept
    • S.M. Shatz, J.-P. Wang, M. Goto, "Task Allocation for Maximizing Reliability of Distributed Computer Systems," IEEE Transactions on Computers, vol. 41, no. 9, pp. 1156-1168, Sept., 1992.
    • (1992) IEEE Transactions on Computers , vol.41 , Issue.9 , pp. 1156-1168
    • Shatz, S.M.1    Wang, J.-P.2    Goto, M.3
  • 4
    • 0036041277 scopus 로고    scopus 로고
    • Heath, T., Martin, R. P., and Nguyen, T. D. 2002. Improving cluster availability using workstation validation. SIGMETRICS Perform. Eval. Rev. 30, 1 (Jun. 2002), 217-227.
    • Heath, T., Martin, R. P., and Nguyen, T. D. 2002. Improving cluster availability using workstation validation. SIGMETRICS Perform. Eval. Rev. 30, 1 (Jun. 2002), 217-227.
  • 5
    • 53349143922 scopus 로고    scopus 로고
    • Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In Proc. 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.
    • Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In Proc. 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.
  • 6
    • 0025693296 scopus 로고    scopus 로고
    • D. Tang, R. K. Iyer, and S. S. Subramani. Failure analysis and modelling of a vaxcluster system. In Proceedings of 20th. Intl. Symposium on Fault-tolerant Computing, pages 244-251, 1990.
    • D. Tang, R. K. Iyer, and S. S. Subramani. Failure analysis and modelling of a vaxcluster system. In Proceedings of 20th. Intl. Symposium on Fault-tolerant Computing, pages 244-251, 1990.
  • 7
    • 33845593340 scopus 로고    scopus 로고
    • Schroeder, B. and Gibson, G. A. 2006. A. large-scale study of failures in high-performance computing systems. In Proceedings of the international Conference on Dependable Systems and Networks, June 2006.
    • Schroeder, B. and Gibson, G. A. 2006. A. large-scale study of failures in high-performance computing systems. In Proceedings of the international Conference on Dependable Systems and Networks, June 2006.
  • 8
    • 4544382099 scopus 로고    scopus 로고
    • R. Sahoo, A. Sivasubramaniam, M. Squillante, and Y. Zhang. Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, pages 389-398, 2004.
    • R. Sahoo, A. Sivasubramaniam, M. Squillante, and Y. Zhang. Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, pages 389-398, 2004.
  • 9
    • 33746286070 scopus 로고    scopus 로고
    • Oliner, A. J., Sahoo, R. K., Moreira, J. E., and Gupta, M. 2005. Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems. In Proceedings of the 19th IEEE international Parallel and Distributed Processing Symposium (Ipdps'05) - Workshop 18 - 19 (April 04 - 08, 2005).
    • Oliner, A. J., Sahoo, R. K., Moreira, J. E., and Gupta, M. 2005. Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems. In Proceedings of the 19th IEEE international Parallel and Distributed Processing Symposium (Ipdps'05) - Workshop 18 - Volume 19 (April 04 - 08, 2005).
  • 10
    • 33845589803 scopus 로고    scopus 로고
    • Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, Ramendra Sahoo, BlueGene/L Failure Analysis and Prediction Models, dsn, pp. 425-434, International Conference on Dependable Systems and Networks (DSN'Q6), 2006
    • Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, Ramendra Sahoo, "BlueGene/L Failure Analysis and Prediction Models," dsn, pp. 425-434, International Conference on Dependable Systems and Networks (DSN'Q6), 2006
  • 11
    • 77952378080 scopus 로고    scopus 로고
    • R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pages 426-435, August 2003.
    • R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pages 426-435, August 2003.
  • 12
    • 33751076285 scopus 로고    scopus 로고
    • Wu Linping, Meng Dan, Jianfeng Zhan, Wang Lei, Tu Bibo, A Failure-Aware Scheduling Strategy in Large-Scale Cluster System, ccgrid, pp. 645-648, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006.
    • Wu Linping, Meng Dan, Jianfeng Zhan, Wang Lei, Tu Bibo, "A Failure-Aware Scheduling Strategy in Large-Scale Cluster System," ccgrid, pp. 645-648, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006.
  • 15
    • 53349143918 scopus 로고    scopus 로고
    • Narasimha Raju. Gottumukkala, Chockchai Box Leangsuksun, and S. Scott, Reliability-aware Approach to Improve Job Completion Time for Large-Scale Parallel Applications, the 2nd workshop on HPCRI, held in a conjunction with the IEEE 12th Intl Symp on HPCA, Austin, Texas, Feb 11-15,06.
    • Narasimha Raju. Gottumukkala, Chockchai Box Leangsuksun, and S. Scott, "Reliability-aware Approach to Improve Job Completion Time for Large-Scale Parallel Applications", the 2nd workshop on HPCRI, held in a conjunction with the IEEE 12th Intl Symp on HPCA, Austin, Texas, Feb 11-15,06.
  • 16
    • 0035201417 scopus 로고    scopus 로고
    • Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
    • November
    • James S. Plank and Michael G. Thomason, Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems, Journal of Parallel and Distributed Computing, Volume 61, Issue 11, November 2001, Pages 1570-1590.
    • (2001) Journal of Parallel and Distributed Computing , vol.61 , Issue.11 , pp. 1570-1590
    • Plank, J.S.1    Thomason, M.G.2
  • 17
    • 53349177156 scopus 로고    scopus 로고
    • Soong T T. Model Verification, in Fundamentals of Probability and Statistics for Engineers.John Wiley & Sons Ltd.Chichester, UK,2004,p327.
    • Soong T T. "Model Verification", in Fundamentals of Probability and Statistics for Engineers.John Wiley & Sons Ltd.Chichester, UK,2004,p327.
  • 18
    • 84955613215 scopus 로고    scopus 로고
    • Toward convergence in job schedulers for parallel supercomputers, Job Scheduling Strategies for Parallel Processing
    • of, Springer-Verlag
    • D. G. Feitelson and L. Rudolph, Toward convergence in job schedulers for parallel supercomputers, Job Scheduling Strategies for Parallel Processing, volume 1162 of Lecture Notes in Computer Science, pages 1-26. Springer-Verlag, 1996
    • (1996) Lecture Notes in Computer Science , vol.1162 , pp. 1-26
    • Feitelson, D.G.1    Rudolph, L.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.