-
1
-
-
11144287593
-
An overview of the bluegene/1 super-computer
-
Papers, November
-
N. Adiga and et. al. An overview of the bluegene/1 super-computer. In Supercomputing (SC2002) Technical. Papers, November 2002.
-
(2002)
Supercomputing (SC2002) Technical.
-
-
Adiga, N.1
-
2
-
-
8344232253
-
Adaptive incremental checkpointing for massively parallel systems
-
S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In ICS 2004, pages 277-286, 2004.
-
(2004)
ICS 2004
, pp. 277-286
-
-
Agarwal, S.1
Garg, R.2
Gupta, M.S.3
Moreira, J.E.4
-
3
-
-
0035877334
-
Scheduling with unexpected machine breakdowns
-
S. Albers and G. Schmidt. Scheduling with unexpected machine breakdowns. Discrete Applied Mathematics, 110(2-3):85-99, 2001.
-
(2001)
Discrete Applied Mathematics
, vol.110
, Issue.2-3
, pp. 85-99
-
-
Albers, S.1
Schmidt, G.2
-
8
-
-
12444329151
-
An evaluation of parallel job scheduling for asci blue-pacifi c
-
November
-
H. Franke, J. Jann, J. E. Moreira, and P. Pattnaik. An evaluation of parallel job scheduling for asci blue-pacifi c. In Proc. of SC'99. Portland OR, IBM Research Report RC 21559, IBM TJ Watson Research Center, November 1999.
-
(1999)
Proc. of SC'99. Portland OR, IBM Research Report RC 21559, IBM TJ Watson Research Center
-
-
Franke, H.1
Jann, J.2
Moreira, J.E.3
Pattnaik, P.4
-
9
-
-
0011625222
-
Time sharing massively parallel machines
-
August
-
B. Gorda and R. Wolski. Time sharing massively parallel machines. In Proc. of ICPP'95. Portland OR, pages 214-217, August 1995.
-
(1995)
Proc. of ICPP'95. Portland or
, pp. 214-217
-
-
Gorda, B.1
Wolski, R.2
-
10
-
-
84974701617
-
Job scheduling for the bluegene/1 system
-
E. Krevat, J. G. Castanos, and J. E. Moreira. Job scheduling for the bluegene/1 system. In JSSPP, pages 38-54, 2002.
-
(2002)
JSSPP
, pp. 38-54
-
-
Krevat, E.1
Castanos, J.G.2
Moreira, J.E.3
-
12
-
-
27544497222
-
Filtering failure logs for a bluegene/l prototype
-
Y. Liang, Y. Zhang, R. K. Sahoo, J. E. Moreira, and M. Gupta. Filtering failure logs for a bluegene/l prototype. In Intl Conf. on Dependable Systems and Networks (DSN-2005) (submitted), 2005.
-
(2005)
Intl Conf. on Dependable Systems and Networks (DSN-2005) (Submitted)
-
-
Liang, Y.1
Zhang, Y.2
Sahoo, R.K.3
Moreira, J.E.4
Gupta, M.5
-
13
-
-
12444257746
-
Fault-aware job scheduling for bluegene/l systems
-
Apr.
-
A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, and A. Sivasubramaniam. Fault-aware job scheduling for bluegene/l systems. In IEEE IPDPS, Intl. Parallel and Distributed Processing Symposium, Apr. 2004.
-
(2004)
IEEE IPDPS, Intl. Parallel and Distributed Processing Symposium
-
-
Oliner, A.J.1
Sahoo, R.K.2
Moreira, J.E.3
Gupta, M.4
Sivasubramaniam, A.5
-
15
-
-
0035201417
-
Processor allocation and checkpoint interval selection in cluster computing systems
-
November
-
J. S. Plank and M. G. Thomason. Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing, 61(11):1570-1590, November 2001.
-
(2001)
Journal of Parallel and Distributed Computing
, vol.61
, Issue.11
, pp. 1570-1590
-
-
Plank, J.S.1
Thomason, M.G.2
-
16
-
-
84948470299
-
An efficient faulttolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems
-
August
-
X. Qin, H. Jiang, and D. R. Swanson. An efficient faulttolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems. In Proceedings of the 30th. International Conference on Parallel Processing, pages 360-368, August 2002.
-
(2002)
Proceedings of the 30th. International Conference on Parallel Processing
, pp. 360-368
-
-
Qin, X.1
Jiang, H.2
Swanson, D.R.3
-
17
-
-
77952378080
-
Critical event prediction for proactive management in large-scale computer clusters
-
August
-
R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the ACM SIGKDD, Intl. Conf. on Knowledge Discovery and Data Mining, pages 426-435, August 2003.
-
(2003)
Proceedings of the ACM SIGKDD, Intl. Conf. on Knowledge Discovery and Data Mining
, pp. 426-435
-
-
Sahoo, R.K.1
Oliner, A.J.2
Rish, I.3
Gupta, M.4
Moreira, J.E.5
Ma, S.6
Vilalta, R.7
Sivasubramaniam, A.8
-
18
-
-
4544382099
-
Failure data analysis of a large-scale heterogeneous server environment
-
June
-
R. K. Sahoo, A. Sivasubramanian, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proceedings of the Intl. Conf. on Dependable Systems and Networks (DSN), pages 772-781, June 2004.
-
(2004)
Proceedings of the Intl. Conf. on Dependable Systems and Networks (DSN)
, pp. 772-781
-
-
Sahoo, R.K.1
Sivasubramanian, A.2
Squillante, M.S.3
Zhang, Y.4
-
19
-
-
20444444457
-
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
-
Sante Fe, New Mexico, USA, October
-
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In Proceedings, LACSI Symposium, Sante Fe, New Mexico, USA, October 2003.
-
(2003)
Proceedings, LACSI Symposium
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Lumsdaine, A.4
Duell, J.5
Hargrove, P.6
Roman, E.7
-
23
-
-
33845595513
-
Performance implications of failures in large-scale cluster scheduling
-
Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In 10th Workshop on JSSPP, SIGMETRICS, 2004.
-
(2004)
10th Workshop on JSSPP, SIGMETRICS
-
-
Zhang, Y.1
Squillante, M.S.2
Sivasubramaniam, A.3
Sahoo, R.K.4
|