-
1
-
-
0035877334
-
Scheduling with Unexpected Machine Breakdowns
-
S. Albers and G. Schmidt, "Scheduling with Unexpected Machine Breakdowns," Discrete Applied Math., vol. 110, nos. 2-3, pp. 85-99, 2001.
-
(2001)
Discrete Applied Math
, vol.110
, Issue.2-3
, pp. 85-99
-
-
Albers, S.1
Schmidt, G.2
-
2
-
-
23944436115
-
New Grid Scheduling and Rescheduling Methods in the GrADS Project
-
F. Berman et al., "New Grid Scheduling and Rescheduling Methods in the GrADS Project," Int'l J. Parallel Programming, vol. 33, nos. 2-3, pp. 209-229, 2005.
-
(2005)
Int'l J. Parallel Programming
, vol.33
, Issue.2-3
, pp. 209-229
-
-
Berman, F.1
-
3
-
-
33746779994
-
MPICH-V: A Multiprotocol Automatic Fault Tolerant MPI
-
A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, "MPICH-V: A Multiprotocol Automatic Fault Tolerant MPI," Int'l J. High Performance Computing and Applications, vol. 20, no. 3, pp. 319-333, 2006.
-
(2006)
Int'l J. High Performance Computing and Applications
, vol.20
, Issue.3
, pp. 319-333
-
-
Bouteiller, A.1
Herault, T.2
Krawezik, G.3
Lemarinier, P.4
Cappello, F.5
-
5
-
-
50649108554
-
Proactive Fault Tolerance in MPI Applications via Task Migration
-
S. Chakravorty, C. Mendes, and L. Kale, "Proactive Fault Tolerance in MPI Applications via Task Migration," Proc. Int'l Conf. High Performance Computing (HiPC '06), p. 485, 2006.
-
(2006)
Proc. Int'l Conf. High Performance Computing (HiPC '06)
, pp. 485
-
-
Chakravorty, S.1
Mendes, C.2
Kale, L.3
-
6
-
-
0004116989
-
-
second ed. The MIT Press and McGraw-Hill Book
-
T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, second ed. The MIT Press and McGraw-Hill Book, 2001.
-
(2001)
Introduction to Algorithms
-
-
Cormen, T.1
Leiserson, C.2
Rivest, R.3
Stein, C.4
-
7
-
-
1542383568
-
Reliable Matching and Scheduling of Precedence-Constrained Tasks in Heterogeneous Distributed Computing
-
A. Dogan and F. Ozguner, "Reliable Matching and Scheduling of Precedence-Constrained Tasks in Heterogeneous Distributed Computing," Proc. Int'l Conf. Parallel Processing (ICPP '00 , pp. 307-314, 2000.
-
(2000)
Proc. Int'l Conf. Parallel Processing (ICPP '00
, pp. 307-314
-
-
Dogan, A.1
Ozguner, F.2
-
9
-
-
0042078549
-
A Survey of Rollback Recovery Protocols in Message-Passing Systems
-
E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, "A Survey of Rollback Recovery Protocols in Message-Passing Systems," ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
-
(2002)
ACM Computing Surveys
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.1
Alvisi, L.2
Wang, Y.3
Johnson, D.4
-
10
-
-
0022891004
-
Distributed Functions Allocation for Reliability and Delay Optimization
-
86, pp
-
S. Hariri and C. Raghavendra, "Distributed Functions Allocation for Reliability and Delay Optimization," Proc. ACM Fall Joint Computer Conf. (FJCC '86), pp. 344-352, 1986.
-
(1986)
Proc. ACM Fall Joint Computer Conf. (FJCC
, pp. 344-352
-
-
Hariri, S.1
Raghavendra, C.2
-
12
-
-
0003454649
-
-
Wiley-Interscience
-
R. Jain, The Art of Computer Systems, Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley-Interscience, 1991.
-
(1991)
The Art of Computer Systems, Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling
-
-
Jain, R.1
-
13
-
-
67349247286
-
-
Parallel Workloads Archive, http://www.cs.huji.ac.il/labs/ parallel/workload/, 2008.
-
(2008)
Parallel Workloads Archive
-
-
-
14
-
-
0000412757
-
Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems
-
S. Kartik and C. Murthy, "Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems," IEEE Trans. Computer Systems, vol. 46, pp. 719-724, 1997.
-
(1997)
IEEE Trans. Computer Systems
, vol.46
, pp. 719-724
-
-
Kartik, S.1
Murthy, C.2
-
16
-
-
47249092857
-
Fault-Driven Re-Scheduling for Improving System-Level Fault Resilience
-
Y. Li, P. Gujrati, Z. Lan, and X. Sun, "Fault-Driven Re-Scheduling for Improving System-Level Fault Resilience," Proc. Int'l Conf. Parallel Processing (ICPP), 2007.
-
(2007)
Proc. Int'l Conf. Parallel Processing (ICPP)
-
-
Li, Y.1
Gujrati, P.2
Lan, Z.3
Sun, X.4
-
17
-
-
57049111494
-
Adaptive Fault Management of Parallel Applications for High Performance Computing
-
Dec
-
Z. Lan and Y. Li, "Adaptive Fault Management of Parallel Applications for High Performance Computing," IEEE Trans. Computers, vol. 57, no. 12, pp. 1647-1660, Dec. 2008.
-
(2008)
IEEE Trans. Computers
, vol.57
, Issue.12
, pp. 1647-1660
-
-
Lan, Z.1
Li, Y.2
-
18
-
-
0003912256
-
Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System,
-
Technical Report 1346, Univ. of Wisconsin-Madison Computer Science
-
M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System," Technical Report 1346, Univ. of Wisconsin-Madison Computer Science, 1997.
-
(1997)
-
-
Litzkow, M.1
Tannenbaum, T.2
Basney, J.3
Livny, M.4
-
19
-
-
36949009638
-
Scalable Diskless Checkpointing for Large Parallel Systems,
-
PhD dissertation, Univ. of Illinois at Urbana-Champaign
-
C. Lu, "Scalable Diskless Checkpointing for Large Parallel Systems," PhD dissertation, Univ. of Illinois at Urbana-Champaign, 2005.
-
(2005)
-
-
Lu, C.1
-
20
-
-
0035363047
-
Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling
-
June
-
A. Mu'alem and D. Feitelson, "Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling," IEEE Trans. Parallel and Distributed System, vol. 12, no. 6, pp. 529-543, June 2001.
-
(2001)
IEEE Trans. Parallel and Distributed System
, vol.12
, Issue.6
, pp. 529-543
-
-
Mu'alem, A.1
Feitelson, D.2
-
21
-
-
34548046749
-
Proactive Fault Tolerance for HPC with Xen Virtualization
-
A. Nagarajan, F. Mueller, C. Engelmann, and S. Scott, "Proactive Fault Tolerance for HPC with Xen Virtualization," Proc. Int'l Conf. Supercomputing (ICS '07), pp. 23-32, 2007.
-
(2007)
Proc. Int'l Conf. Supercomputing (ICS '07)
, pp. 23-32
-
-
Nagarajan, A.1
Mueller, F.2
Engelmann, C.3
Scott, S.4
-
22
-
-
34548293317
-
Evaluation of a Workflow Scheduler Using Integrated Performance Modeling and Batch Queue Wait Time Prediction
-
D. Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, and K. Kennedy, "Evaluation of a Workflow Scheduler Using Integrated Performance Modeling and Batch Queue Wait Time Prediction," Proc. ACM/IEEE Conf. Supercomputing (SC), 2006.
-
(2006)
Proc. ACM/IEEE Conf. Supercomputing (SC)
-
-
Nurmi, D.1
Mandal, A.2
Brevik, J.3
Koelbel, C.4
Wolski, R.5
Kennedy, K.6
-
23
-
-
34547424386
-
Cooperative Check-pointing a Robust Approach to Large-Scale Systems Reliability
-
A.J. Oliner, L. Rudolph, and R.K. Sahoo, "Cooperative Check-pointing a Robust Approach to Large-Scale Systems Reliability," Proc. Int'l Conf. Supercomputing (ICS '06), pp. 14-23, 2006.
-
(2006)
Proc. Int'l Conf. Supercomputing (ICS '06)
, pp. 14-23
-
-
Oliner, A.J.1
Rudolph, L.2
Sahoo, R.K.3
-
24
-
-
12444257746
-
Fault-Aware Job Scheduling for BlueGene/L Systems
-
A. Oliner, R. Sahoo, J. Moreira, and M. Gupta, "Fault-Aware Job Scheduling for BlueGene/L Systems," Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS '04), p. 64, 2004.
-
(2004)
Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS '04)
, pp. 64
-
-
Oliner, A.1
Sahoo, R.2
Moreira, J.3
Gupta, M.4
-
25
-
-
4544342875
-
Min-max Checkpoint Placement under Incomplete Failure Information
-
T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, "Min-max Checkpoint Placement under Incomplete Failure Information," Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), p. 721, 2004.
-
(2004)
Proc. Int'l Conf. Dependable Systems and Networks (DSN '04)
, pp. 721
-
-
Ozaki, T.1
Dohi, T.2
Okamura, H.3
Kaio, N.4
-
26
-
-
84898046897
-
Scaling to Thousands of Processors with Buffered Coscheduling
-
F. Petrini, "Scaling to Thousands of Processors with Buffered Coscheduling," Proc. Scaling to New Height Workshop, 2002.
-
(2002)
Proc. Scaling to New Height Workshop
-
-
Petrini, F.1
-
27
-
-
0002067202
-
Libckpt: Transparent Checkpointing under Unix
-
J. Plank, M. Beck, G. Kingsley, and K. Li, "Libckpt: Transparent Checkpointing under Unix," Proc. Usenix, 1995.
-
(1995)
Proc. Usenix
-
-
Plank, J.1
Beck, M.2
Kingsley, G.3
Li, K.4
-
28
-
-
0032179680
-
Diskless Checkpointing
-
Oct
-
J. Plank, K. Li, and M. Puening, "Diskless Checkpointing," IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972-986, Oct. 1998.
-
(1998)
IEEE Trans. Parallel and Distributed Systems
, vol.9
, Issue.10
, pp. 972-986
-
-
Plank, J.1
Li, K.2
Puening, M.3
-
29
-
-
0035201417
-
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
-
J. Plank and M. Thomason, "Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems," J. Parallel and Distributed Computing, vol. 61, no. 11, pp. 1570-1590, 2001.
-
(2001)
J. Parallel and Distributed Computing
, vol.61
, Issue.11
, pp. 1570-1590
-
-
Plank, J.1
Thomason, M.2
-
30
-
-
77957023910
-
Big Systems and Big Reliability Challenges
-
D. Reed, C. Lu, and C. Mendes, "Big Systems and Big Reliability Challenges," Proc. Parallel Computing (ParCo '03), pp. 729-736, 2003.
-
(2003)
Proc. Parallel Computing (ParCo '03)
, pp. 729-736
-
-
Reed, D.1
Lu, C.2
Mendes, C.3
-
31
-
-
77952378080
-
Critical Event Prediction for Proactive Management in Large-Scale Computer Clusters
-
R. Sahoo et al., "Critical Event Prediction for Proactive Management in Large-Scale Computer Clusters," Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDDM '03), pp. 426-435, 2003.
-
(2003)
Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDDM '03)
, pp. 426-435
-
-
Sahoo, R.1
-
33
-
-
23944521034
-
Implementation and Evaluation of a Scalable Application Level Checkpoint-Recovery Scheme for MPI Programs
-
M. Schulz, G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, and P. Stodghill, "Implementation and Evaluation of a Scalable Application Level Checkpoint-Recovery Scheme for MPI Programs," Proc. ACM/IEEE Conf. Supercomputing (SC '04), p. 38, 2004.
-
(2004)
Proc. ACM/IEEE Conf. Supercomputing (SC '04)
, pp. 38
-
-
Schulz, M.1
Bronevetsky, G.2
Fernandes, R.3
Marques, D.4
Pingali, K.5
Stodghill, P.6
-
34
-
-
0026923304
-
Task Allocation for Maximizing Reliability of Distributed Computer Systems
-
Sept
-
S. Shatz, J. Wang, and M. Goto, "Task Allocation for Maximizing Reliability of Distributed Computer Systems," IEEE Trans. Computers vol. 41, no. 9, pp. 1156-1168, Sept. 1992.
-
(1992)
IEEE Trans. Computers
, vol.41
, Issue.9
, pp. 1156-1168
-
-
Shatz, S.1
Wang, J.2
Goto, M.3
-
36
-
-
84859478556
-
A Survey of Process Migration Mechanisms
-
J. Smith, "A Survey of Process Migration Mechanisms," Operating Systems Rev., vol. 22, no. 3, pp. 102-106, 1988.
-
(1988)
Operating Systems Rev
, vol.22
, Issue.3
, pp. 102-106
-
-
Smith, J.1
-
38
-
-
0032683084
-
Safety and Reliability Driven Task Allocation in Distributed Systems
-
Mar
-
S. Srinivasan and N. Jha, "Safety and Reliability Driven Task Allocation in Distributed Systems," IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 3, Mar. 1999.
-
(1999)
IEEE Trans. Parallel and Distributed Systems
, vol.10
, Issue.3
-
-
Srinivasan, S.1
Jha, N.2
-
39
-
-
34248674898
-
Backfilling Using System-Generated Predictions Rather than User Runtime Estimates
-
June
-
D. Tsafrir, Y. Etsion, and D. Feitelson, "Backfilling Using System-Generated Predictions Rather than User Runtime Estimates," IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 6, June 2007.
-
(2007)
IEEE Trans. Parallel and Distributed Systems
, vol.18
, Issue.6
-
-
Tsafrir, D.1
Etsion, Y.2
Feitelson, D.3
-
41
-
-
27544513113
-
Modeling Coordinated Checkpointing for Large-Scale Supercomputers
-
L. Wang, K. Pattabiraman, L. Votta, A.C. Vick, Z. Wood, and R. Kalbarczyk, "Modeling Coordinated Checkpointing for Large-Scale Supercomputers," Proc. Int'l Conf. Dependable Systems and Networks (DSN '05), pp. 812-821, 2005.
-
(2005)
Proc. Int'l Conf. Dependable Systems and Networks (DSN '05)
, pp. 812-821
-
-
Wang, L.1
Pattabiraman, K.2
Votta, L.3
Vick, A.C.4
Wood, Z.5
Kalbarczyk, R.6
-
42
-
-
84976846528
-
A First Order Approximation to the Optimal Checkpoint Interval
-
J. Young, "A First Order Approximation to the Optimal Checkpoint Interval," ACM Comm., vol. 17, no. 9, pp. 530-531, 1974.
-
(1974)
ACM Comm
, vol.17
, Issue.9
, pp. 530-531
-
-
Young, J.1
-
43
-
-
23944448107
-
Performance Implications of Failures in Large-Scale Cluster Scheduling
-
Y. Zhang, M. Squillante, A. Sivasubramaniam, and R. Sahoo, "Performance Implications of Failures in Large-Scale Cluster Scheduling," Proc. Workshop Job Scheduling Strategies for Parallel Processing (JSSPP '04) pp. 233-252, 2004.
-
(2004)
Proc. Workshop Job Scheduling Strategies for Parallel Processing (JSSPP '04)
, pp. 233-252
-
-
Zhang, Y.1
Squillante, M.2
Sivasubramaniam, A.3
Sahoo, R.4
-
45
-
-
47249153592
-
A Meta-Learning Failure Predictor for Blue Gene/L Systems
-
P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, "A Meta-Learning Failure Predictor for Blue Gene/L Systems," Proc. Int'l Conf. Parallel Processing (ICPP), 2007.
-
(2007)
Proc. Int'l Conf. Parallel Processing (ICPP)
-
-
Gujrati, P.1
Li, Y.2
Lan, Z.3
Thakur, R.4
White, J.5
-
46
-
-
0004429467
-
Kiviat Graphs: Conventions and Figures of Merit
-
M. Morris, "Kiviat Graphs: Conventions and Figures of Merit," ACM SIGMETRICS Performance Evaluation Rev., vol. 3, no. 3, 1974.
-
(1974)
ACM SIGMETRICS Performance Evaluation Rev
, vol.3
, Issue.3
-
-
Morris, M.1
-
47
-
-
56749178938
-
Exploring Event Correlation for Failure Prediction in Coalitions of Clusters
-
S. Fu and C.Z. Xu, "Exploring Event Correlation for Failure Prediction in Coalitions of Clusters," Proc. ACM/IEEE Conf. Supercomputing (SC) 2007.
-
(2007)
Proc. ACM/IEEE Conf. Supercomputing (SC)
-
-
Fu, S.1
Xu, C.Z.2
-
48
-
-
55849147399
-
Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study
-
J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B. Park, "Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study," Proc. Int'l Conf. Parallel Processing (ICPP), 2008.
-
(2008)
Proc. Int'l Conf. Parallel Processing (ICPP)
-
-
Gu, J.1
Zheng, Z.2
Lan, Z.3
White, J.4
Hocks, E.5
Park, B.6
|