메뉴 건너뛰기




Volumn , Issue , 2007, Pages 39-46

Fault-driven re-scheduling for improving system-level fault resilience

Author keywords

[No Author keywords available]

Indexed keywords

FAULT TOLERANCE; FORECASTING; PRODUCTIVITY;

EID: 47249092857     PISSN: 01903918     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/ICPP.2007.42     Document Type: Conference Paper
Times cited : (19)

References (30)
  • 1
    • 28044457320 scopus 로고    scopus 로고
    • Monitoring Hard Disk with SMART
    • January
    • B. Allen, "Monitoring Hard Disk with SMART", Linux Journal, January, 2004.
    • (2004) Linux Journal
    • Allen, B.1
  • 2
    • 4544337911 scopus 로고    scopus 로고
    • Automatic Methods for Predicting Machine Availability in Desktop Grid and Peer-to-Peer Systems
    • IEEE Computer Society, Chicago,IL
    • J. Brevik, D. Nurmi, and R. Wolski, "Automatic Methods for Predicting Machine Availability in Desktop Grid and Peer-to-Peer Systems", Proc. of IEEE CCGrid, IEEE Computer Society, Chicago,IL, 2004, pp. 190-199.
    • (2004) Proc. of IEEE CCGrid , pp. 190-199
    • Brevik, J.1    Nurmi, D.2    Wolski, R.3
  • 3
    • 23944436115 scopus 로고    scopus 로고
    • New Grid Scheduling and Rescheduling Methods in the GrADS Project
    • F. Berman, H. Casanova, et al., "New Grid Scheduling and Rescheduling Methods in the GrADS Project", Intl. Journal of Parallel Programming, 2005, pp. 209-229
    • (2005) Intl. Journal of Parallel Programming , pp. 209-229
    • Berman, F.1    Casanova, H.2
  • 4
    • 1542383568 scopus 로고    scopus 로고
    • Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing
    • IEEE Computer Society, Toronto, Canada
    • A. Dogan,F. Ozguner, "Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing,"In Proc. of the ICPP, IEEE Computer Society, Toronto, Canada, 2000, pp. 307
    • (2000) Proc. of the ICPP , pp. 307
    • Dogan, A.1    Ozguner, F.2
  • 5
    • 33751107476 scopus 로고    scopus 로고
    • MPI-Mitten: Enabling Migration Technology in MPI
    • IEEE Computer Society, Singapore
    • Cong Du and Xian-He Sun, "MPI-Mitten: Enabling Migration Technology in MPI", in Proc. of CCGRID, IEEE Computer Society, Singapore, 2006, pp. 11-18
    • (2006) Proc. of CCGRID , pp. 11-18
    • Du, C.1    Sun, X.-H.2
  • 6
    • 9144223280 scopus 로고    scopus 로고
    • Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
    • Elmootazbellah N. Elnozahy and James S. Plank, "Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery", IEEE Transactions on Dependable and Secure Computing, Volume 1, No 2, 2004, pp. 97-108.
    • (2004) IEEE Transactions on Dependable and Secure Computing , vol.1 , Issue.2 , pp. 97-108
    • Elnozahy, E.N.1    Plank, J.S.2
  • 8
    • 47249123819 scopus 로고    scopus 로고
    • Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters
    • P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, "Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters", in Proc. of ICPP07, 2007
    • (2007) Proc. of ICPP07
    • Gujrati, P.1    Li, Y.2    Lan, Z.3    Thakur, R.4    White, J.5
  • 9
    • 0037342701 scopus 로고    scopus 로고
    • A fault-tolerant scheduling algorithm for real-time periodic tasks with possible software faults
    • C.-C. Han, K.G. Shin, J. Wu, "A fault-tolerant scheduling algorithm for real-time periodic tasks with possible software faults," IEEE Trans. Computers, Vol.52, No.3 pp.362-372, 2003
    • (2003) IEEE Trans. Computers , vol.52 , Issue.3 , pp. 362-372
    • Han, C.-C.1    Shin, K.G.2    Wu, J.3
  • 10
    • 47249157799 scopus 로고    scopus 로고
    • Advanced Failure Prediction in Complex Software Systems
    • G. Hoffmann, F. Salfner, M. Malek, "Advanced Failure Prediction in Complex Software Systems", in Proc. of SRDS, 2004
    • (2004) Proc. of SRDS
    • Hoffmann, G.1    Salfner, F.2    Malek, M.3
  • 14
    • 47249121426 scopus 로고    scopus 로고
    • IBM LoadLeveler for AIX 5L, available at http: //publib.boulder.ibm.com
    • IBM LoadLeveler for AIX 5L, available at http: //publib.boulder.ibm.com
  • 15
    • 33751082401 scopus 로고    scopus 로고
    • Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
    • Singapore
    • Yawei Li, Zhiling Lan, "Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing", in Proc. of IEEE CCGrid'06, Singapore,2006,pp. 531-538
    • (2006) Proc. of IEEE CCGrid'06 , pp. 531-538
    • Li, Y.1    Lan, Z.2
  • 16
    • 47249160413 scopus 로고    scopus 로고
    • Hardware monitoring by lm sensors, available at http: //secure.netroedge.com/-lm78/info.html.
    • Hardware monitoring by lm sensors, available at http: //secure.netroedge.com/-lm78/info.html.
  • 18
    • 0003912256 scopus 로고    scopus 로고
    • Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System
    • University of Wisconsin-Madison Computer Science Technical Report #1346
    • M. Lizkow, T. Tannenbaum, et al., "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System", University of Wisconsin-Madison Computer Science Technical Report #1346, 1997.
    • (1997)
    • Lizkow, M.1    Tannenbaum, T.2
  • 19
    • 36949009638 scopus 로고    scopus 로고
    • Scalable Diskless Checkpointing for Large Parallel Systems
    • Ph.D. thesis, University of Illinois at Urbana-Champaign
    • Charng-Da Lu, "Scalable Diskless Checkpointing for Large Parallel Systems", Ph.D. thesis, University of Illinois at Urbana-Champaign, 2005
    • (2005)
    • Lu, C.-D.1
  • 20
    • 84872514589 scopus 로고    scopus 로고
    • available at
    • Moab Workload Manager, available at http://www.clusterresources.com
    • Moab Workload Manager
  • 21
    • 0035363047 scopus 로고    scopus 로고
    • A. Mu'alem and D. Feitelson, Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling, in IEEE Trans. Parallel and Distributed Systems, 12(6), 2001,pp. 529-543
    • A. Mu'alem and D. Feitelson, "Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling", in IEEE Trans. Parallel and Distributed Systems, Vol. 12(6), 2001,pp. 529-543
  • 22
    • 12444257746 scopus 로고    scopus 로고
    • A. Oliner, Ramendra K. Sahoo, José E. Moreira, Manish Gupta, Anand Sivasubramaniam, Fault-Aware Job Scheduling for BlueGene/L Systems, in Proc. of IPDPS, 2004,
    • A. Oliner, Ramendra K. Sahoo, José E. Moreira, Manish Gupta, Anand Sivasubramaniam, "Fault-Aware Job Scheduling for BlueGene/L Systems", in Proc. of IPDPS, 2004,
  • 23
    • 0343644421 scopus 로고    scopus 로고
    • available at
    • Parallel Workloads Archive, available at http://www.cs.huji.ac.il/labs/ parallel/workload/
    • Parallel Workloads Archive
  • 24
    • 47249098059 scopus 로고    scopus 로고
    • System-level fault-tolerance in large-scale parallel machines with buffered coscheduling
    • Petrini, F.; Davis, K.; Sancho, J.C," System-level fault-tolerance in large-scale parallel machines with buffered coscheduling", in Proc. of IPDPS, 2004, pp. 209
    • (2004) Proc. of IPDPS , pp. 209
    • Petrini, F.1    Davis, K.2    Sancho, J.C.3
  • 25
    • 0032683084 scopus 로고    scopus 로고
    • Safety and Reliability Driven Task Allocation in Distributed Systems
    • S. Srinivasan, and N.K. Jha, "Safety and Reliability Driven Task Allocation in Distributed Systems," in IEEE Trans. Parallel and Distributed Systems, Vol 10(3), 1999, pp. 238-251
    • (1999) IEEE Trans. Parallel and Distributed Systems , vol.10 , Issue.3 , pp. 238-251
    • Srinivasan, S.1    Jha, N.K.2
  • 26
    • 20444463471 scopus 로고    scopus 로고
    • A Dynamic and Reliability-driven Scheduling Algorithmfor Parallel Real-time Jobs on Heterogeneous Clusters
    • X. Qin and H. Jiang, "A Dynamic and Reliability-driven Scheduling Algorithmfor Parallel Real-time Jobs on Heterogeneous Clusters," in Journal of Parallel and Distributed Computing, vol. 65, no. 8, 2005, pp. 885-900.
    • (2005) Journal of Parallel and Distributed Computing , vol.65 , Issue.8 , pp. 885-900
    • Qin, X.1    Jiang, H.2
  • 27
    • 77952378080 scopus 로고    scopus 로고
    • Critical Event Prediction for Proactive Management in Large-scale Computer Clusters
    • Washington DC, USA
    • Ramendra K. Sahoo, A. Oliner, et al., "Critical Event Prediction for Proactive Management in Large-scale Computer Clusters", in Proc. of KDD, Washington DC, USA,2003,pp. 426-435
    • (2003) Proc. of KDD , pp. 426-435
    • Ramendra, K.1    Sahoo, A.O.2
  • 28
    • 0026923304 scopus 로고
    • Task Allocation for Maximizing Reliability of Distributed Computer Systems
    • S. Shatz, J. Wang, and M. Goto, "Task Allocation for Maximizing Reliability of Distributed Computer Systems", in IEEE Trans. on Computers, Vol 41(9), 1992,pp. 1156-1168
    • (1992) IEEE Trans. on Computers , vol.41 , Issue.9 , pp. 1156-1168
    • Shatz, S.1    Wang, J.2    Goto, M.3
  • 29
    • 78149354391 scopus 로고    scopus 로고
    • Predicting Rare Events in Temporal Domains
    • R. Vilalta and S. Ma, "Predicting Rare Events in Temporal Domains", in Proc. of IEEE ICDM, 2002, pp.474-481
    • (2002) Proc. of IEEE ICDM , pp. 474-481
    • Vilalta, R.1    Ma, S.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.