-
1
-
-
34547478654
-
-
N. Adiga and T. B. Team. An overview of the bluegene/1 supercomputer. In Supercomputing, Technical Papers, Nov. 2002.
-
N. Adiga and T. B. Team. An overview of the bluegene/1 supercomputer. In Supercomputing, Technical Papers, Nov. 2002.
-
-
-
-
2
-
-
8344232253
-
Adaptive incremental checkpointing for massively parallel systems
-
S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the Intl. Conf. on Supercomputing (ICS), pages 277-286, 2004.
-
(2004)
Proceedings of the Intl. Conf. on Supercomputing (ICS)
, pp. 277-286
-
-
Agarwal, S.1
Garg, R.2
Gupta, M.S.3
Moreira, J.E.4
-
4
-
-
0022012278
-
Discovering patterns in sequence of events
-
T. Dietterich and R. Michalski. Discovering patterns in sequence of events. In Artificial Intelligence, volume 25, pages 187-232, 1985.
-
(1985)
Artificial Intelligence
, vol.25
, pp. 187-232
-
-
Dietterich, T.1
Michalski, R.2
-
5
-
-
84871146551
-
The performance of consistent checkpointing
-
Houston, TX, Oct
-
E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In 11th Symposium on Reliable Distributed Systems, Houston, TX, Oct. 1992.
-
(1992)
11th Symposium on Reliable Distributed Systems
-
-
Elnozahy, E.N.1
Johnson, D.B.2
Zwaenepoel, W.3
-
6
-
-
9144223280
-
Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
-
E. N. Elnozahy and J. S. Plank. Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable Secur. Comput., 1(2):97-108, 2004.
-
(2004)
IEEE Trans. Dependable Secur. Comput
, vol.1
, Issue.2
, pp. 97-108
-
-
Elnozahy, E.N.1
Plank, J.S.2
-
8
-
-
0003454649
-
-
Wiley-Interscience, New York
-
R. K. Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley-Interscience, New York, 1991.
-
(1991)
The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling
-
-
Jain, R.K.1
-
10
-
-
33845589803
-
Blue gene/l failure analysis and prediction models
-
Y. Liang, Y. Zhang, M. Jette, A. Sivasubramaniam, and R. K. Sahoo. Blue gene/l failure analysis and prediction models. In Proceedings of the Intl. Conf. on Dependable Systems and Networks (DSN), 2006.
-
(2006)
Proceedings of the Intl. Conf. on Dependable Systems and Networks (DSN)
-
-
Liang, Y.1
Zhang, Y.2
Jette, M.3
Sivasubramaniam, A.4
Sahoo, R.K.5
-
11
-
-
27544497222
-
Filtering failure logs for a bluegene/l prototype
-
June
-
Y. Liang, Y. Zhang, A. Sivasubramaniam, R. K. Sahoo, J. Moreira, and M. Gupta. Filtering failure logs for a bluegene/l prototype. In Proceedings of the Intl. Conf. on Dependable Systems and Networks (DSN), June 2005.
-
(2005)
Proceedings of the Intl. Conf. on Dependable Systems and Networks (DSN)
-
-
Liang, Y.1
Zhang, Y.2
Sivasubramaniam, A.3
Sahoo, R.K.4
Moreira, J.5
Gupta, M.6
-
12
-
-
34547428381
-
Compiler-generated staggered checkpointing
-
New York, NY, USA, ACM Press
-
A. N. Norman, S.-E. Choi, and C. Lin. Compiler-generated staggered checkpointing. In Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems (LCR), pages 1-8, New York, NY, USA, 2004. ACM Press.
-
(2004)
Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems (LCR)
, pp. 1-8
-
-
Norman, A.N.1
Choi, S.-E.2
Lin, C.3
-
15
-
-
27544438709
-
Probabilistic qos guarantees for supercomputing systems
-
June
-
A. J. Oliner, L. Rudolph, R. K. Sahoo, J. Moreira, and M. Gupta. Probabilistic qos guarantees for supercomputing systems. In Proceedings of the Intl. Conf. on Dependable Systems and Networks (DSN), June 2005.
-
(2005)
Proceedings of the Intl. Conf. on Dependable Systems and Networks (DSN)
-
-
Oliner, A.J.1
Rudolph, L.2
Sahoo, R.K.3
Moreira, J.4
Gupta, M.5
-
17
-
-
33746286070
-
Performance implications of periodic checkpointing on large-scale cluster systems
-
Apr
-
A. J. Oliner, R. K. Sahoo, J. E. Moreira, and M. Gupta. Performance implications of periodic checkpointing on large-scale cluster systems. In IEEE IPDPS, Workshop on System Management Tools for Large-scale Parallel Systems, Apr. 2005.
-
(2005)
IEEE IPDPS, Workshop on System Management Tools for Large-scale Parallel Systems
-
-
Oliner, A.J.1
Sahoo, R.K.2
Moreira, J.E.3
Gupta, M.4
-
18
-
-
34547444432
-
-
J. S. Plank and W. R. Elwasif. Experimental
-
J. S. Plank and W. R. Elwasif. Experimental
-
-
-
-
19
-
-
34547405648
-
-
assessment of workstation failures and their impact on checkpointing systems. In Proceedings of the 28th Intl. Symposium on Fault-tolerant Computing, June 1998.
-
assessment of workstation failures and their impact on checkpointing systems. In Proceedings of the 28th Intl. Symposium on Fault-tolerant Computing, June 1998.
-
-
-
-
20
-
-
0035201417
-
Processor allocation and checkpoint interval selection in cluster computing systems
-
November
-
J. S. Plank and M. G. Thomason. Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing, 61(11):1570-1590, November 2001.
-
(2001)
Journal of Parallel and Distributed Computing
, vol.61
, Issue.11
, pp. 1570-1590
-
-
Plank, J.S.1
Thomason, M.G.2
-
21
-
-
77952378080
-
Critical event prediction for proactive management in large-scale computer clusters
-
August
-
R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the ACM SIGKDD, Intl. Conf. on Knowledge Discovery and Data Mining, pages 426-435, August 2003.
-
(2003)
Proceedings of the ACM SIGKDD, Intl. Conf. on Knowledge Discovery and Data Mining
, pp. 426-435
-
-
Sahoo, R.K.1
Oliner, A.J.2
Rish, I.3
Gupta, M.4
Moreira, J.E.5
Ma, S.6
Vilalta, R.7
Sivasubramaniam, A.8
-
22
-
-
84934312471
-
Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for mpi programs
-
M. Schultz, G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, and P. Stodghill. Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for mpi programs. In Supercomputing, 2004.
-
(2004)
Supercomputing
-
-
Schultz, M.1
Bronevetsky, G.2
Fernandes, R.3
Marques, D.4
Pingali, K.5
Stodghill, P.6
-
24
-
-
84976846528
-
A first order approximation to the optimum checkpoint interval
-
J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17(9):530-531, 1974.
-
(1974)
Commun. ACM
, vol.17
, Issue.9
, pp. 530-531
-
-
Young, J.W.1
|