SCOPUS 정보 검색 플랫폼

Proceedings of the International Conference on Parallel Processing

Volumn , Issue , 2007, Pages 39-46

Fault-driven re-scheduling for improving system-level fault resilience

(4) Li, Yawei a Gujrati, Prashasta a Lan, Zhiling a Sun, Xian He a,b

a Illinois Institute of Technology (United States)

b FERMI NATIONAL ACCELERATOR LABORATORY (United States)

Author keywords

[No Author keywords available]

Indexed keywords

FAULT TOLERANCE; FORECASTING; PRODUCTIVITY;

CONVENTIONAL METHODS; FAILURE PREDICTION; IMPROVING SYSTEMS; PERFORMANCE IMPACT; POTENTIAL FAILURES; SYSTEM PRODUCTIVITY; SYSTEM RESILIENCES; TOLERANCE APPROACH;

SCHEDULING;

EID: 47249092857 PISSN: 01903918 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/ICPP.2007.42 Document Type: Conference Paper

Times cited : (19)

References (30)

1
- 28044457320
- Monitoring Hard Disk with SMART
- January
- B. Allen, "Monitoring Hard Disk with SMART", Linux Journal, January, 2004.
- (2004) Linux Journal
- Allen, B.¹

2
- 4544337911
- Automatic Methods for Predicting Machine Availability in Desktop Grid and Peer-to-Peer Systems
- IEEE Computer Society, Chicago,IL
- J. Brevik, D. Nurmi, and R. Wolski, "Automatic Methods for Predicting Machine Availability in Desktop Grid and Peer-to-Peer Systems", Proc. of IEEE CCGrid, IEEE Computer Society, Chicago,IL, 2004, pp. 190-199.
- (2004) Proc. of IEEE CCGrid , pp. 190-199
- Brevik, J.¹ Nurmi, D.² Wolski, R.³

3
- 23944436115
- New Grid Scheduling and Rescheduling Methods in the GrADS Project
- F. Berman, H. Casanova, et al., "New Grid Scheduling and Rescheduling Methods in the GrADS Project", Intl. Journal of Parallel Programming, 2005, pp. 209-229
- (2005) Intl. Journal of Parallel Programming , pp. 209-229
- Berman, F.¹ Casanova, H.²

4
- 1542383568
- Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing
- IEEE Computer Society, Toronto, Canada
- A. Dogan,F. Ozguner, "Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing,"In Proc. of the ICPP, IEEE Computer Society, Toronto, Canada, 2000, pp. 307
- (2000) Proc. of the ICPP , pp. 307
- Dogan, A.¹ Ozguner, F.²

5
- 33751107476
- MPI-Mitten: Enabling Migration Technology in MPI
- IEEE Computer Society, Singapore
- Cong Du and Xian-He Sun, "MPI-Mitten: Enabling Migration Technology in MPI", in Proc. of CCGRID, IEEE Computer Society, Singapore, 2006, pp. 11-18
- (2006) Proc. of CCGRID , pp. 11-18
- Du, C.¹ Sun, X.-H.²

6
- 9144223280
- Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
- Elmootazbellah N. Elnozahy and James S. Plank, "Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery", IEEE Transactions on Dependable and Secure Computing, Volume 1, No 2, 2004, pp. 97-108.
- (2004) IEEE Transactions on Dependable and Secure Computing , vol.1 , Issue.2 , pp. 97-108
- Elnozahy, E.N.¹ Plank, J.S.²

7
- 0343644421
- D.Feitelson. Parallel Workloads Archive http://cs.huji.ac.il/labs/ parallel/workload/index.html
- Parallel Workloads Archive
- Feitelson, D.¹

8
- 47249123819
- Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters
- P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, "Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters", in Proc. of ICPP07, 2007
- (2007) Proc. of ICPP07
- Gujrati, P.¹ Li, Y.² Lan, Z.³ Thakur, R.⁴ White, J.⁵

9
- 0037342701
- A fault-tolerant scheduling algorithm for real-time periodic tasks with possible software faults
- C.-C. Han, K.G. Shin, J. Wu, "A fault-tolerant scheduling algorithm for real-time periodic tasks with possible software faults," IEEE Trans. Computers, Vol.52, No.3 pp.362-372, 2003
- (2003) IEEE Trans. Computers , vol.52 , Issue.3 , pp. 362-372
- Han, C.-C.¹ Shin, K.G.² Wu, J.³

10
- 47249157799
- Advanced Failure Prediction in Complex Software Systems
- G. Hoffmann, F. Salfner, M. Malek, "Advanced Failure Prediction in Complex Software Systems", in Proc. of SRDS, 2004
- (2004) Proc. of SRDS
- Hoffmann, G.¹ Salfner, F.² Malek, M.³

11
- 1242329663
- Application Of A Model-Based Fault Detection System To Nuclear Plant Signals
- Seoul,Korea
- K. C. Gross, R. M. Singer, S. W. Wegerich, J. P. Herzog, R. VanAlstine, and F. Bockhorst, "Application Of A Model-Based Fault Detection System To Nuclear Plant Signals", in Proc. of ISAP,Seoul,Korea, 1997, pp. 66-70
- (1997) Proc. of ISAP , pp. 66-70
- Gross, K.C.¹ Singer, R.M.² Wegerich, S.W.³ Herzog, J.P.⁴ VanAlstine, R.⁵ Bockhorst, F.⁶

12
- 47249140942
- Health Application Programming Interface, http://www.renci.org
- Health Application Programming Interface

13
- 33749680779
- A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster
- Austin, TX
- C. Leangsuksun et al, "A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster", in Proc. of LCI International Conference on Linux Clusters: The HPC Revolution 2004, Austin, TX, 2004
- (2004) Proc. of LCI International Conference on Linux Clusters: The HPC Revolution 2004
- Leangsuksun, C.¹

14
- 47249121426
- IBM LoadLeveler for AIX 5L, available at http: //publib.boulder.ibm.com
- IBM LoadLeveler for AIX 5L, available at http: //publib.boulder.ibm.com

15
- 33751082401
- Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
- Singapore
- Yawei Li, Zhiling Lan, "Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing", in Proc. of IEEE CCGrid'06, Singapore,2006,pp. 531-538
- (2006) Proc. of IEEE CCGrid'06 , pp. 531-538
- Li, Y.¹ Lan, Z.²

16
- 47249160413
- Hardware monitoring by lm sensors, available at http: //secure.netroedge.com/-lm78/info.html.
- Hardware monitoring by lm sensors, available at http: //secure.netroedge.com/-lm78/info.html.

17
- 31844445405
- R. Lawrence, "A Survey of Process Migration Mechanisms",http:// www.cs.uiowa.edu/~rlawrenc/research/Papers/proc_mig.pdf
- A Survey of Process Migration Mechanisms
- Lawrence, R.¹

18
- 0003912256
- Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System
- University of Wisconsin-Madison Computer Science Technical Report #1346
- M. Lizkow, T. Tannenbaum, et al., "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System", University of Wisconsin-Madison Computer Science Technical Report #1346, 1997.
- (1997)
- Lizkow, M.¹ Tannenbaum, T.²

19
- 36949009638
- Scalable Diskless Checkpointing for Large Parallel Systems
- Ph.D. thesis, University of Illinois at Urbana-Champaign
- Charng-Da Lu, "Scalable Diskless Checkpointing for Large Parallel Systems", Ph.D. thesis, University of Illinois at Urbana-Champaign, 2005
- (2005)
- Lu, C.-D.¹

20
- 84872514589
- available at
- Moab Workload Manager, available at http://www.clusterresources.com
- Moab Workload Manager

21
- 0035363047
- A. Mu'alem and D. Feitelson, Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling, in IEEE Trans. Parallel and Distributed Systems, 12(6), 2001,pp. 529-543
- A. Mu'alem and D. Feitelson, "Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling", in IEEE Trans. Parallel and Distributed Systems, Vol. 12(6), 2001,pp. 529-543

22
- 12444257746
- A. Oliner, Ramendra K. Sahoo, José E. Moreira, Manish Gupta, Anand Sivasubramaniam, Fault-Aware Job Scheduling for BlueGene/L Systems, in Proc. of IPDPS, 2004,
- A. Oliner, Ramendra K. Sahoo, José E. Moreira, Manish Gupta, Anand Sivasubramaniam, "Fault-Aware Job Scheduling for BlueGene/L Systems", in Proc. of IPDPS, 2004,

23
- 0343644421
- available at
- Parallel Workloads Archive, available at http://www.cs.huji.ac.il/labs/ parallel/workload/
- Parallel Workloads Archive

24
- 47249098059
- System-level fault-tolerance in large-scale parallel machines with buffered coscheduling
- Petrini, F.; Davis, K.; Sancho, J.C," System-level fault-tolerance in large-scale parallel machines with buffered coscheduling", in Proc. of IPDPS, 2004, pp. 209
- (2004) Proc. of IPDPS , pp. 209
- Petrini, F.¹ Davis, K.² Sancho, J.C.³

25
- 0032683084
- Safety and Reliability Driven Task Allocation in Distributed Systems
- S. Srinivasan, and N.K. Jha, "Safety and Reliability Driven Task Allocation in Distributed Systems," in IEEE Trans. Parallel and Distributed Systems, Vol 10(3), 1999, pp. 238-251
- (1999) IEEE Trans. Parallel and Distributed Systems , vol.10 , Issue.3 , pp. 238-251
- Srinivasan, S.¹ Jha, N.K.²

26
- 20444463471
- A Dynamic and Reliability-driven Scheduling Algorithmfor Parallel Real-time Jobs on Heterogeneous Clusters
- X. Qin and H. Jiang, "A Dynamic and Reliability-driven Scheduling Algorithmfor Parallel Real-time Jobs on Heterogeneous Clusters," in Journal of Parallel and Distributed Computing, vol. 65, no. 8, 2005, pp. 885-900.
- (2005) Journal of Parallel and Distributed Computing , vol.65 , Issue.8 , pp. 885-900
- Qin, X.¹ Jiang, H.²

27
- 77952378080
- Critical Event Prediction for Proactive Management in Large-scale Computer Clusters
- Washington DC, USA
- Ramendra K. Sahoo, A. Oliner, et al., "Critical Event Prediction for Proactive Management in Large-scale Computer Clusters", in Proc. of KDD, Washington DC, USA,2003,pp. 426-435
- (2003) Proc. of KDD , pp. 426-435
- Ramendra, K.¹ Sahoo, A.O.²

28
- 0026923304
- Task Allocation for Maximizing Reliability of Distributed Computer Systems
- S. Shatz, J. Wang, and M. Goto, "Task Allocation for Maximizing Reliability of Distributed Computer Systems", in IEEE Trans. on Computers, Vol 41(9), 1992,pp. 1156-1168
- (1992) IEEE Trans. on Computers , vol.41 , Issue.9 , pp. 1156-1168
- Shatz, S.¹ Wang, J.² Goto, M.³

29
- 78149354391
- Predicting Rare Events in Temporal Domains
- R. Vilalta and S. Ma, "Predicting Rare Events in Temporal Domains", in Proc. of IEEE ICDM, 2002, pp.474-481
- (2002) Proc. of IEEE ICDM , pp. 474-481
- Vilalta, R.¹ Ma, S.²

30
- 33845595513
- Performance Implications of Failures in Large-Scale Cluster Scheduling
- New York, USA
- Y. Zhang et al., "Performance Implications of Failures in Large-Scale Cluster Scheduling", Proc. of 10th Workshop on JSSPP, held in conjunction with SIGMETRICS , New York, USA, 2004.
- (2004) Proc. of 10th Workshop on JSSPP, held in conjunction with SIGMETRICS
- Zhang, Y.¹

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.