메뉴 건너뛰기




Volumn 20, Issue 4, 2009, Pages 460-473

Fault-aware runtime strategies for high-performance computing

Author keywords

0 1 knapsack; Fault tolerance; High performance computing; Performance; Reliability; Runtime strategies

Indexed keywords

0-1 KNAPSACK; CRITICAL CHALLENGES; FAILURE PREDICTION; FAULT MANAGEMENT; FAULT TOLERANCE TECHNIQUES; HIGH-PERFORMANCE COMPUTING; KEY ISSUES; KNAPSACK MODEL; PARALLEL SYSTEM; PERFORMANCE; PRODUCTION SYSTEM; REAL TRACE; RUNTIME; RUNTIME STRATEGIES; RUNTIME SYSTEMS; SYNTHETIC DATA; SYSTEM PRODUCTIVITY;

EID: 67649883517     PISSN: 10459219     EISSN: None     Source Type: Journal    
DOI: 10.1109/TPDS.2008.128     Document Type: Article
Times cited : (25)

References (48)
  • 1
    • 0035877334 scopus 로고    scopus 로고
    • Scheduling with Unexpected Machine Breakdowns
    • S. Albers and G. Schmidt, "Scheduling with Unexpected Machine Breakdowns," Discrete Applied Math., vol. 110, nos. 2-3, pp. 85-99, 2001.
    • (2001) Discrete Applied Math , vol.110 , Issue.2-3 , pp. 85-99
    • Albers, S.1    Schmidt, G.2
  • 2
    • 23944436115 scopus 로고    scopus 로고
    • New Grid Scheduling and Rescheduling Methods in the GrADS Project
    • F. Berman et al., "New Grid Scheduling and Rescheduling Methods in the GrADS Project," Int'l J. Parallel Programming, vol. 33, nos. 2-3, pp. 209-229, 2005.
    • (2005) Int'l J. Parallel Programming , vol.33 , Issue.2-3 , pp. 209-229
    • Berman, F.1
  • 7
    • 1542383568 scopus 로고    scopus 로고
    • Reliable Matching and Scheduling of Precedence-Constrained Tasks in Heterogeneous Distributed Computing
    • A. Dogan and F. Ozguner, "Reliable Matching and Scheduling of Precedence-Constrained Tasks in Heterogeneous Distributed Computing," Proc. Int'l Conf. Parallel Processing (ICPP '00 , pp. 307-314, 2000.
    • (2000) Proc. Int'l Conf. Parallel Processing (ICPP '00 , pp. 307-314
    • Dogan, A.1    Ozguner, F.2
  • 9
    • 0042078549 scopus 로고    scopus 로고
    • A Survey of Rollback Recovery Protocols in Message-Passing Systems
    • E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, "A Survey of Rollback Recovery Protocols in Message-Passing Systems," ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
    • (2002) ACM Computing Surveys , vol.34 , Issue.3 , pp. 375-408
    • Elnozahy, E.1    Alvisi, L.2    Wang, Y.3    Johnson, D.4
  • 10
    • 0022891004 scopus 로고
    • Distributed Functions Allocation for Reliability and Delay Optimization
    • 86, pp
    • S. Hariri and C. Raghavendra, "Distributed Functions Allocation for Reliability and Delay Optimization," Proc. ACM Fall Joint Computer Conf. (FJCC '86), pp. 344-352, 1986.
    • (1986) Proc. ACM Fall Joint Computer Conf. (FJCC , pp. 344-352
    • Hariri, S.1    Raghavendra, C.2
  • 13
  • 14
    • 0000412757 scopus 로고    scopus 로고
    • Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems
    • S. Kartik and C. Murthy, "Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems," IEEE Trans. Computer Systems, vol. 46, pp. 719-724, 1997.
    • (1997) IEEE Trans. Computer Systems , vol.46 , pp. 719-724
    • Kartik, S.1    Murthy, C.2
  • 17
    • 57049111494 scopus 로고    scopus 로고
    • Adaptive Fault Management of Parallel Applications for High Performance Computing
    • Dec
    • Z. Lan and Y. Li, "Adaptive Fault Management of Parallel Applications for High Performance Computing," IEEE Trans. Computers, vol. 57, no. 12, pp. 1647-1660, Dec. 2008.
    • (2008) IEEE Trans. Computers , vol.57 , Issue.12 , pp. 1647-1660
    • Lan, Z.1    Li, Y.2
  • 18
    • 0003912256 scopus 로고    scopus 로고
    • Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System,
    • Technical Report 1346, Univ. of Wisconsin-Madison Computer Science
    • M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System," Technical Report 1346, Univ. of Wisconsin-Madison Computer Science, 1997.
    • (1997)
    • Litzkow, M.1    Tannenbaum, T.2    Basney, J.3    Livny, M.4
  • 19
    • 36949009638 scopus 로고    scopus 로고
    • Scalable Diskless Checkpointing for Large Parallel Systems,
    • PhD dissertation, Univ. of Illinois at Urbana-Champaign
    • C. Lu, "Scalable Diskless Checkpointing for Large Parallel Systems," PhD dissertation, Univ. of Illinois at Urbana-Champaign, 2005.
    • (2005)
    • Lu, C.1
  • 20
    • 0035363047 scopus 로고    scopus 로고
    • Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling
    • June
    • A. Mu'alem and D. Feitelson, "Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling," IEEE Trans. Parallel and Distributed System, vol. 12, no. 6, pp. 529-543, June 2001.
    • (2001) IEEE Trans. Parallel and Distributed System , vol.12 , Issue.6 , pp. 529-543
    • Mu'alem, A.1    Feitelson, D.2
  • 26
    • 84898046897 scopus 로고    scopus 로고
    • Scaling to Thousands of Processors with Buffered Coscheduling
    • F. Petrini, "Scaling to Thousands of Processors with Buffered Coscheduling," Proc. Scaling to New Height Workshop, 2002.
    • (2002) Proc. Scaling to New Height Workshop
    • Petrini, F.1
  • 29
    • 0035201417 scopus 로고    scopus 로고
    • Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
    • J. Plank and M. Thomason, "Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems," J. Parallel and Distributed Computing, vol. 61, no. 11, pp. 1570-1590, 2001.
    • (2001) J. Parallel and Distributed Computing , vol.61 , Issue.11 , pp. 1570-1590
    • Plank, J.1    Thomason, M.2
  • 31
    • 77952378080 scopus 로고    scopus 로고
    • Critical Event Prediction for Proactive Management in Large-Scale Computer Clusters
    • R. Sahoo et al., "Critical Event Prediction for Proactive Management in Large-Scale Computer Clusters," Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDDM '03), pp. 426-435, 2003.
    • (2003) Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDDM '03) , pp. 426-435
    • Sahoo, R.1
  • 34
    • 0026923304 scopus 로고
    • Task Allocation for Maximizing Reliability of Distributed Computer Systems
    • Sept
    • S. Shatz, J. Wang, and M. Goto, "Task Allocation for Maximizing Reliability of Distributed Computer Systems," IEEE Trans. Computers vol. 41, no. 9, pp. 1156-1168, Sept. 1992.
    • (1992) IEEE Trans. Computers , vol.41 , Issue.9 , pp. 1156-1168
    • Shatz, S.1    Wang, J.2    Goto, M.3
  • 36
    • 84859478556 scopus 로고
    • A Survey of Process Migration Mechanisms
    • J. Smith, "A Survey of Process Migration Mechanisms," Operating Systems Rev., vol. 22, no. 3, pp. 102-106, 1988.
    • (1988) Operating Systems Rev , vol.22 , Issue.3 , pp. 102-106
    • Smith, J.1
  • 38
    • 0032683084 scopus 로고    scopus 로고
    • Safety and Reliability Driven Task Allocation in Distributed Systems
    • Mar
    • S. Srinivasan and N. Jha, "Safety and Reliability Driven Task Allocation in Distributed Systems," IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 3, Mar. 1999.
    • (1999) IEEE Trans. Parallel and Distributed Systems , vol.10 , Issue.3
    • Srinivasan, S.1    Jha, N.2
  • 39
    • 34248674898 scopus 로고    scopus 로고
    • Backfilling Using System-Generated Predictions Rather than User Runtime Estimates
    • June
    • D. Tsafrir, Y. Etsion, and D. Feitelson, "Backfilling Using System-Generated Predictions Rather than User Runtime Estimates," IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 6, June 2007.
    • (2007) IEEE Trans. Parallel and Distributed Systems , vol.18 , Issue.6
    • Tsafrir, D.1    Etsion, Y.2    Feitelson, D.3
  • 42
    • 84976846528 scopus 로고
    • A First Order Approximation to the Optimal Checkpoint Interval
    • J. Young, "A First Order Approximation to the Optimal Checkpoint Interval," ACM Comm., vol. 17, no. 9, pp. 530-531, 1974.
    • (1974) ACM Comm , vol.17 , Issue.9 , pp. 530-531
    • Young, J.1
  • 46
    • 0004429467 scopus 로고
    • Kiviat Graphs: Conventions and Figures of Merit
    • M. Morris, "Kiviat Graphs: Conventions and Figures of Merit," ACM SIGMETRICS Performance Evaluation Rev., vol. 3, no. 3, 1974.
    • (1974) ACM SIGMETRICS Performance Evaluation Rev , vol.3 , Issue.3
    • Morris, M.1
  • 47
    • 56749178938 scopus 로고    scopus 로고
    • Exploring Event Correlation for Failure Prediction in Coalitions of Clusters
    • S. Fu and C.Z. Xu, "Exploring Event Correlation for Failure Prediction in Coalitions of Clusters," Proc. ACM/IEEE Conf. Supercomputing (SC) 2007.
    • (2007) Proc. ACM/IEEE Conf. Supercomputing (SC)
    • Fu, S.1    Xu, C.Z.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.