메뉴 건너뛰기




Volumn , Issue , 2009, Pages

Reliability-aware scalability models for high performance computing

Author keywords

[No Author keywords available]

Indexed keywords

ANALYTICAL TOOL; APPLICATION PERFORMANCE; APPLICATION SCALABILITY; DEVELOPED MODEL; FAULT TOLERANCE TECHNIQUES; HIGH PERFORMANCE COMPUTING; PARALLEL APPLICATION; TRACE-BASED SIMULATION;

EID: 72049124295     PISSN: 15525244     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/CLUSTR.2009.5289177     Document Type: Conference Paper
Times cited : (27)

References (38)
  • 1
    • 36049013419 scopus 로고    scopus 로고
    • What supercomputers say: A study of five system logs
    • A. Oliner and J. Stearly, "What Supercomputers Say: A Study of Five System Logs," Proc. of DSN, 2007.
    • (2007) Proc. of DSN
    • Oliner, A.1    Stearly, J.2
  • 2
    • 33845593340 scopus 로고    scopus 로고
    • A large-scale study of failures in highperformance-computing systems
    • B. Schroeder and G. Gibson, "A Large-scale Study of Failures in Highperformance-computing Systems," Proc. of DSN, 2006.
    • (2006) Proc. of DSN
    • Schroeder, B.1    Gibson, G.2
  • 3
    • 85060036181 scopus 로고
    • Validity of the single processor approach to achieving large-scale computing capabilities
    • G. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities," Proc. of AFIPS Spring Joint Computer Conference, 1967.
    • (1967) Proc. of AFIPS Spring Joint Computer Conference
    • Amdahl, G.1
  • 4
    • 0024012163 scopus 로고
    • Reevaluating amdahl's law
    • J. Gustafson, "Reevaluating Amdahl's law," Communications of the ACM, 31(5):532-533,1988.
    • (1988) Communications of the ACM , vol.31 , Issue.5 , pp. 532-533
    • Gustafson, J.1
  • 9
    • 0025502686 scopus 로고
    • Error log analysis: Statistical modeling and heuristic trend analysis
    • T. Lin and D. Siewiorek, "Error log analysis: statistical modeling and heuristic trend analysis," IEEE Trans. on Reliability, 39(4):419-432, 1990.
    • (1990) IEEE Trans. on Reliability , vol.39 , Issue.4 , pp. 419-432
    • Lin, T.1    Siewiorek, D.2
  • 11
    • 52949107193 scopus 로고    scopus 로고
    • Algorithm-system scalability of heterogeneous computing
    • Y. Chen, X. Sun, and M. Wu, "Algorithm-System Scalability of Heterogeneous Computing," Journal of Parallel and Distributed Computing, 68(11):1403-1412, 2008.
    • (2008) Journal of Parallel and Distributed Computing , vol.68 , Issue.11 , pp. 1403-1412
    • Chen, Y.1    Sun, X.2    Wu, M.3
  • 12
    • 33745170068 scopus 로고    scopus 로고
    • Scalability of heterogeneous computing
    • X. Sun, Y. Chen, and M.Wu, "Scalability of Heterogeneous Computing," Proc. of ICPP, 2005.
    • (2005) Proc. of ICPP
    • Sun, X.1    Chen, Y.2    Wu, M.3
  • 14
    • 56749158844 scopus 로고    scopus 로고
    • Performance under failure of high-end computing
    • M. Wu, X. Sun, and H. Jin, "Performance under Failure of High-End Computing," Proc. of SuperComputing, 2007.
    • (2007) Proc. of SuperComputing
    • Wu, M.1    Sun, X.2    Jin, H.3
  • 15
    • 28044460018 scopus 로고    scopus 로고
    • A higher order estimate of the optimum checkpoint interval for restart dumps
    • J. Daly, "A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps," Future Generation Computer Systems, 22(3): 303-312, 2006.
    • (2006) Future Generation Computer Systems , vol.22 , Issue.3 , pp. 303-312
    • Daly, J.1
  • 16
    • 0012237782 scopus 로고    scopus 로고
    • Minimizing completion time of a program by checkpointing and rejuvenation
    • S. Garg, Y. Huang, C. Kintala, and K. Trivedi, "Minimizing Completion Time of a Program by Checkpointing and Rejuvenation," Proc. Of SIGMETRICS, 1996.
    • (1996) Proc. of SIGMETRICS
    • Garg, S.1    Huang, Y.2    Kintala, C.3    Trivedi, K.4
  • 17
    • 0035201417 scopus 로고    scopus 로고
    • Processor allocation and checkpoint interval selection in cluster computing systems
    • J. Plank and M. Thomason, "Processor allocation and checkpoint interval selection in cluster computing systems," Journal of Parallel and Distributed Computing, 61(11): 1570-1590, 2001.
    • (2001) Journal of Parallel and Distributed Computing , vol.61 , Issue.11 , pp. 1570-1590
    • Plank, J.1    Thomason, M.2
  • 18
    • 85014175705 scopus 로고    scopus 로고
    • Experimental assessment of workstation failures and their impact on checkpointing systems
    • J. Plank and W. Elwasif, "Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems," Proc. of FTCS, 1998.
    • (1998) Proc. of FTCS
    • Plank, J.1    Elwasif, W.2
  • 19
    • 9144223280 scopus 로고    scopus 로고
    • Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
    • E. Elnozahy and J. Plank, "Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery," IEEE Trans. On Dependable and Secure Computing, 1(2):97-108, 2004.
    • (2004) IEEE Trans. on Dependable and Secure Computing , vol.1 , Issue.2 , pp. 97-108
    • Elnozahy, E.1    Plank, J.2
  • 20
  • 21
    • 57049111494 scopus 로고    scopus 로고
    • Adaptive fault management of parallel applications for high performance computing
    • Z. Lan and Y. Li, "Adaptive Fault Management of Parallel Applications for High Performance Computing," IEEE Trans. Computers, 57(12): 1647-1660, 2008.
    • (2008) IEEE Trans. Computers , vol.57 , Issue.12 , pp. 1647-1660
    • Lan, Z.1    Li, Y.2
  • 22
    • 55849147399 scopus 로고    scopus 로고
    • Dynamic meta-learning for failure prediction in large-scale systems: A case study
    • J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B-H. Park, "Dynamic Meta-Learning for Failure Prediction in Large-scale Systems: A Case Study", Proc. of ICPP, 2008.
    • (2008) Proc. of ICPP
    • Gu, J.1    Zheng, Z.2    Lan, Z.3    White, J.4    Hocks, E.5    Park, B.-H.6
  • 24
    • 84976846528 scopus 로고
    • A first order approximation to the optimal checkpoint interval
    • J. Young, "A First Order Approximation to the Optimal Checkpoint Interval," Comm. ACM, 17(9): 530-531, 1974.
    • (1974) Comm. ACM , vol.17 , Issue.9 , pp. 530-531
    • Young, J.1
  • 26
    • 33746286070 scopus 로고    scopus 로고
    • Performance implications of periodic checkpointing on large-scale cluster systems
    • A. Oliner, R. Sahoo, J. Moreira, and M. Gupta, "Performance Implications of Periodic Checkpointing on Large-scale Cluster Systems," Proc. Of IPDPS, 2005.
    • (2005) Proc. of IPDPS
    • Oliner, A.1    Sahoo, R.2    Moreira, J.3    Gupta, M.4
  • 28
    • 72049129021 scopus 로고    scopus 로고
    • Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
    • A. Bouteiller, P. Lemarinier, G. Krawezik, and F. Cappello, "Improved message logging versus improved coordinated checkpointing for fault tolerant MPI," Proc. of Cluster, 2003.
    • (2003) Proc. of Cluster
    • Bouteiller, A.1    Lemarinier, P.2    Krawezik, G.3    Cappello, F.4
  • 29
    • 85027617648 scopus 로고
    • Analysis of scalability of parallel algorithms and architectures: A survey
    • V. Kumar and A. Gupta, "Analysis of scalability of parallel algorithms and architectures: a survey," Proc of ICS, 1991.
    • (1991) Proc of ICS
    • Kumar, V.1    Gupta, A.2
  • 30
    • 64049097304 scopus 로고    scopus 로고
    • Extending Amdahl's law for energy-efficient computing in the many-core era
    • D. Woo and H. Lee, "Extending Amdahl's law for energy-efficient computing in the many-core era," IEEE Computer, 41(12):24-31, 2008.
    • (2008) IEEE Computer , vol.41 , Issue.12 , pp. 24-31
    • Woo, D.1    Lee, H.2
  • 31
    • 34547424386 scopus 로고    scopus 로고
    • Cooperative checkpointing: A robust approach to large-scale systems reliability
    • A. Oliner, L. Rudolph, and R. Sahoo, "Cooperative checkpointing: A robust approach to large-scale systems reliability," Proc. of ICS, 2006.
    • (2006) Proc. of ICS
    • Oliner, A.1    Rudolph, L.2    Sahoo, R.3
  • 33
    • 12444268325 scopus 로고    scopus 로고
    • System-level faulttolerance in largescale parallel machines with buffered coscheduling
    • F. Petrini, K. Davis, and J. Sancho, "System-level faulttolerance in largescale parallel machines with buffered coscheduling," Proc. of IPDPS, 2004.
    • (2004) Proc. of IPDPS
    • Petrini, F.1    Davis, K.2    Sancho, J.3
  • 34
    • 0004244684 scopus 로고
    • Checkpointing and modelling of program execution time
    • John Wiley and Sons
    • V. Nicola, "Checkpointing and modelling of program execution time. Software Fault Tolerance," John Wiley and Sons, 1995.
    • (1995) Software Fault Tolerance
    • Nicola, V.1
  • 35
    • 78649627101 scopus 로고    scopus 로고
    • A fast recovery mechanism for checkpointing in networked environments
    • Y. Li and Z. Lan, "A Fast Recovery Mechanism for Checkpointing in Networked Environments," Proc. of DSN, 2008.
    • (2008) Proc. of DSN
    • Li, Y.1    Lan, Z.2
  • 38
    • 50649107313 scopus 로고    scopus 로고
    • Application MTFE vs platform MTBF: A fresh perspective on system reliabilty and application throughput for computations at scale
    • J. Daly, L. Pritchett-Sheats, and S. Michala, "Application MTFE vs Platform MTBF: A Fresh Perspective on System Reliabilty and Application Throughput for Computations at Scale," Proc. of CCGRID, 2008.
    • (2008) Proc. of CCGRID
    • Daly, J.1    Pritchett-Sheats, L.2    Michala, S.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.