-
1
-
-
70450159193
-
The international exascale software project: A call to cooperative action by the global high-performance community
-
Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The international exascale software project: a call to cooperative action by the global high-performance community. IJHPCA 23(4), 309-322 (2009)
-
(2009)
IJHPCA
, vol.23
, Issue.4
, pp. 309-322
-
-
Dongarra, J.1
Beckman, P.2
Aerts, P.3
Cappello, F.4
Lippert, T.5
Matsuoka, S.6
Messina, P.7
Moore, T.8
Stevens, R.9
Trefethen, A.10
Valero, M.11
-
2
-
-
84858430349
-
Failure tolerance in petascale computers
-
Gibson, G.: Failure tolerance in petascale computers. Journal of Physics: Conference Series 78, 012022 (2007)
-
(2007)
Journal of Physics: Conference Series
, vol.78
, pp. 012022
-
-
Gibson, G.1
-
3
-
-
83155188951
-
Evaluating the Viability of Process Replication Reliability for Exascale Systems
-
ACM/IEEE
-
Ferreira, K., Stearley, J., Laros, J.H.I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the Viability of Process Replication Reliability for Exascale Systems. In: Proc. of SC 2011. ACM/IEEE (2011)
-
(2011)
Proc. of SC 2011
-
-
Ferreira, K.1
Stearley, J.2
Laros, J.H.I.3
Oldfield, R.4
Pedretti, K.5
Brightwell, R.6
Riesen, R.7
Bridges, P.G.8
Arnold, D.9
-
4
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Survey 34, 375-408 (2002)
-
(2002)
ACM Survey
, vol.34
, pp. 375-408
-
-
Elnozahy, E.N.M.1
Alvisi, L.2
Wang, Y.M.3
Johnson, D.B.4
-
5
-
-
80052306159
-
Correlated set coordination in fault tolerant message logging protocols
-
Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. Springer, Heidelberg
-
Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.J.: Correlated set coordination in fault tolerant message logging protocols. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 51-64. Springer, Heidelberg (2011)
-
(2011)
LNCS
, vol.6853
, pp. 51-64
-
-
Bouteiller, A.1
Herault, T.2
Bosilca, G.3
Dongarra, J.J.4
-
6
-
-
84866852589
-
HydEE: Failure containment without event logging for large scale send-deterministic MPI applications
-
IEEE May
-
Guermouche, A., Ropars, T., Snir, M., Cappello, F.: HydEE: Failure containment without event logging for large scale send-deterministic MPI applications. In: Proc. 26th IPDPS, pp. 1216-1227. IEEE (May 2012)
-
(2012)
Proc. 26th IPDPS
, pp. 1216-1227
-
-
Guermouche, A.1
Ropars, T.2
Snir, M.3
Cappello, F.4
-
7
-
-
84867640976
-
-
Research report RR-7950, INRIA
-
Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified model for assessing checkpointing protocols at extreme-scale. Research report RR-7950, INRIA (2012)
-
(2012)
Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
-
-
Bosilca, G.1
Bouteiller, A.2
Brunet, E.3
Cappello, F.4
Dongarra, J.5
Guermouche, A.6
Herault, T.7
Robert, Y.8
Vivien, F.9
Zaidouni, D.10
-
8
-
-
0021439162
-
Algorithm-based fault tolerance for matrix operations
-
Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 100(6), 518-528 (1984)
-
(1984)
IEEE Transactions on Computers
, vol.100
, Issue.6
, pp. 518-528
-
-
Huang, K.1
Abraham, J.2
-
9
-
-
31844451082
-
Fault tolerant high performance computing by a coding approach
-
ACM
-
Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault tolerant high performance computing by a coding approach. In: Proc. 10th ACM SIGPLAN PPoPP, pp. 213-223. ACM (2005)
-
(2005)
Proc. 10th ACM SIGPLAN PPoPP
, pp. 213-223
-
-
Chen, Z.1
Fagg, G.E.2
Gabriel, E.3
Langou, J.4
Angskun, T.5
Bosilca, G.6
Dongarra, J.7
-
10
-
-
33746779994
-
MPICH-V: A multiprotocol fault tolerant MPI
-
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V: a multiprotocol fault tolerant MPI. IJHPCA 20(3), 319-333 (2006)
-
(2006)
IJHPCA
, vol.20
, Issue.3
, pp. 319-333
-
-
Bouteiller, A.1
Herault, T.2
Krawezik, G.3
Lemarinier, P.4
Cappello, F.5
-
11
-
-
84883136362
-
-
Research report ICL-UT-1301, University of Tennessee February
-
Bouteiller, A., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y.: Multi-criteria checkpointing strategies: Optimizing response-time versus resource utilization. Research report ICL-UT-1301, University of Tennessee (February 2013)
-
(2013)
Multi-Criteria Checkpointing Strategies: Optimizing Response-Time Versus Resource Utilization
-
-
Bouteiller, A.1
Cappello, F.2
Dongarra, J.3
Guermouche, A.4
Herault, T.5
Robert, Y.6
-
12
-
-
84860692966
-
K computer: 8.162 petaflops massively parallel scalar supercomputer built with over 548k cores
-
IEEE
-
Miyazaki, H., Kusano, Y., Okano, H., Nakada, T., Seki, K., Shimizu, T., Shinjo, N., Shoji, F., Uno, A., Kurokawa, M.: K computer: 8.162 petaflops massively parallel scalar supercomputer built with over 548k cores. In: ISSCC, pp. 192-194. IEEE (2012)
-
(2012)
ISSCC
, pp. 192-194
-
-
Miyazaki, H.1
Kusano, Y.2
Okano, H.3
Nakada, T.4
Seki, K.5
Shimizu, T.6
Shinjo, N.7
Shoji, F.8
Uno, A.9
Kurokawa, M.10
-
13
-
-
34548782109
-
A fault tolerance protocol with fast fault recovery
-
IEEE March
-
Chakravorty, S., Kale, L.: A fault tolerance protocol with fast fault recovery. In: Proc. 21st IPDPS, pp. 1-10. IEEE (March 2007)
-
(2007)
Proc. 21st IPDPS
, pp. 1-10
-
-
Chakravorty, S.1
Kale, L.2
-
14
-
-
70349129932
-
FTPA: Supporting fault-tolerant parallel computing through parallel recomputing
-
Yang, X., Du, Y., Wang, P., Fu, H., Jia, J.: FTPA: Supporting fault-tolerant parallel computing through parallel recomputing. IEEE Transactions on Parallel and Distributed Systems 20(10), 1471-1486 (2009)
-
(2009)
IEEE Transactions on Parallel and Distributed Systems
, vol.20
, Issue.10
, pp. 1471-1486
-
-
Yang, X.1
Du, Y.2
Wang, P.3
Fu, H.4
Jia, J.5
-
16
-
-
84976769480
-
The effectiveness of multiple hardware contexts
-
ACM
-
Thekkath, R., Eggers, S.J.: The effectiveness of multiple hardware contexts. In: Proc. of the 6th ASPLOS, pp. 328-337. ACM (1994)
-
(1994)
Proc. of the 6th ASPLOS
, pp. 328-337
-
-
Thekkath, R.1
Eggers, S.J.2
-
17
-
-
33751039336
-
Performance evaluation of Adaptive MPI
-
ACM
-
Huang, C., Zheng, G., Kalé, L., Kumar, S.: Performance evaluation of Adaptive MPI. In: Proc. 11th ACM SIGPLAN PPoPP, pp. 12-21. ACM (2006)
-
(2006)
Proc. 11th ACM SIGPLAN PPoPP
, pp. 12-21
-
-
Huang, C.1
Zheng, G.2
Kalé, L.3
Kumar, S.4
-
18
-
-
33646940970
-
Hybrid preemptive scheduling of message passing interface applications on grids
-
Bouteiller, A., Bouziane, H.L., Herault, T., Lemarinier, P., Cappello, F.: Hybrid preemptive scheduling of message passing interface applications on grids. IJHPCA 20(1), 77-90 (2006)
-
(2006)
IJHPCA
, vol.20
, Issue.1
, pp. 77-90
-
-
Bouteiller, A.1
Bouziane, H.L.2
Herault, T.3
Lemarinier, P.4
Cappello, F.5
-
19
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22(3), 303-312 (2004)
-
(2004)
FGCS
, vol.22
, Issue.3
, pp. 303-312
-
-
Daly, J.T.1
|