-
4
-
-
33749669598
-
MPI/FT: A model-based approach to low-overhead fault tolerant message-passing middleware
-
Jan.
-
BATCHU, R., DANDASS, Y. S., SKJELLUM, A., AND BEDDHU, M. MPI/FT: A model-based approach to low-overhead fault tolerant message-passing middleware. Cluster Computing 7, 4 (Jan. 2004), 303-315.
-
(2004)
Cluster Computing
, vol.7
, Issue.4
, pp. 303-315
-
-
Batchu, R.1
Dandass, Y.S.2
Skjellum, A.3
Beddhu, M.4
-
5
-
-
66749092384
-
-
Sept.
-
BERGMAN, K., BORKAR, S., CAMPBELL, D., CARLSON, W., DALLY, W., DENNEAU, M., FRANZON, P., HARROD, W., HILL, K., HILLER, J., KARP, S., KECKLER, S., KLEIN, D., KOGGE, P., LUCAS, R., RICHARDS, M., SCARPELLI, A., SCOTT, S., SNAVELY, A., STERLING, T., WILLIAMS, R. S., AND YELICK, K. Exascale computing study: Technology challenges in achieving exascale systems. http://www.science.energy. gov/ascr/Research/CS/DARPAexascale-hardware(2008).pdf, Sept. 2008.
-
(2008)
Exascale Computing Study: Technology Challenges in Achieving Exascale Systems
-
-
Bergman, K.1
Borkar, S.2
Campbell, D.3
Carlson, W.4
Dally, W.5
Denneau, M.6
Franzon, P.7
Harrod, W.8
Hill, K.9
Hiller, J.10
Karp, S.11
Keckler, S.12
Klein, D.13
Kogge, P.14
Lucas, R.15
Richards, M.16
Scarpelli, A.17
Scott, S.18
Snavely, A.19
Sterling, T.20
Williams, R.S.21
Yelick, K.22
more..
-
6
-
-
85084161916
-
Magazines and vmem: Extending the slab allocator to many CPUs and arbitrary resources
-
USENIX Association
-
BONWICK, J., AND ADAMS, J. Magazines and vmem: Extending the slab allocator to many CPUs and arbitrary resources. In Proceedings of the General Track: 2002 USENIX Annual Technical Conference (Berkeley, CA, USA, 2001), USENIX Association, pp. 15-33.
-
Proceedings of the General Track: 2002 USENIX Annual Technical Conference (Berkeley, CA, USA, 2001)
, pp. 15-33
-
-
Bonwick, J.1
Adams, J.2
-
7
-
-
84864777225
-
Cooperative application/os DRAM fault recovery
-
Lecture Notes in Computer Science
-
BRIDGES, P., HOEMMEN, M., FERREIRA, K. B., HEROUX, M., SOLTERO, P., AND BRIGHTWELL, R. Cooperative application/os DRAM fault recovery. Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the Euro-Par Conference, Lecture Notes in Computer Science (2011), -.
-
(2011)
Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in Conjunction with the Euro-Par Conference
-
-
Bridges, P.1
Hoemmen, M.2
Ferreira, K.B.3
Heroux, M.4
Soltero, P.5
Brightwell, R.6
-
8
-
-
57349156147
-
Soft error vulnerability of iterative linear algebra methods
-
ACM
-
BRONEVETSKY, G., AND DE SUPINSKI, B. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the 22nd Annual International Conference on Supercomputing (New York, NY, USA, 2008), ICS '08, ACM, pp. 155-164.
-
Proceedings of the 22nd Annual International Conference on Supercomputing (New York, NY, USA, 2008), ICS '08
, pp. 155-164
-
-
Bronevetsky, G.1
De Supinski, B.2
-
9
-
-
1142268808
-
Collective operations in application-level fault-tolerant MPI
-
ACM
-
BRONEVETSKY, G., MARQUES, D., PINGALI, K., AND STODGHILL, P. Collective operations in application-level fault-tolerant MPI. In Proceedings of the 17th annual international conference on Supercomputing (New York, NY, USA, 2003), ICS '03, ACM, pp. 234-243.
-
Proceedings of the 17th Annual International Conference on Supercomputing (New York, NY, USA, 2003), ICS '03
, pp. 234-243
-
-
Bronevetsky, G.1
Marques, D.2
Pingali, K.3
Stodghill, P.4
-
10
-
-
50649108554
-
Proactive fault tolerance in mpi applications via task migration
-
CHAKRAVORTY, S., MENDES, C., AND KALÃL', L. Proactive fault tolerance in mpi applications via task migration. Strategy 4297 (2006), 485âǍŞ496.
-
(2006)
Strategy
, vol.4297
, pp. 485-496
-
-
Chakravorty, S.1
Mendes, C.2
Kalãl, L.3
-
12
-
-
84864743916
-
-
March 1
-
DAVID A. WHEELER. Sloccount. http://www.dwheeler.com/sloccount, March 1 2012.
-
(2012)
Sloccount
-
-
Wheeler, D.A.1
-
15
-
-
84866942395
-
Combining partial redundancy and checkpointing for HPC
-
IEEE Computer Society Press, to appear
-
ELLIOT, J., KHARBAS, K., FIALA, D., MUELLER, F., FERREIRA, K., AND ENGELMANN, C. Combining partial redundancy and checkpointing for HPC. In International Conference on Distributed Computing Systems (Los Alamitos, CA, USA, June 2012), IEEE Computer Society Press, pp. 1-11. [to appear].
-
International Conference on Distributed Computing Systems (Los Alamitos, CA, USA, June 2012)
, pp. 1-11
-
-
Elliot, J.1
Kharbas, K.2
Fiala, D.3
Mueller, F.4
Ferreira, K.5
Engelmann, C.6
-
16
-
-
25144486687
-
Super-scalable algorithms for computing on 100,000 processors
-
Springer Verlag, Berlin, Germany
-
th International Conference on Computational Science (ICCS) 2005, Part I (Atlanta, GA, USA, May 22-25, 2005), vol. 3514, Springer Verlag, Berlin, Germany, pp. 313-320.
-
th International Conference on Computational Science (ICCS) 2005, Part I (Atlanta, GA, USA, May 22-25, 2005)
, vol.3514
, pp. 313-320
-
-
Engelmann, C.1
Geist, G.A.A.2
-
17
-
-
74549140832
-
The case for modular redundancy in large-scale high performance computing systems
-
ACTA Press, Calgary, AB, Canada
-
th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009 (Innsbruck, Austria, Feb. 16-18, 2009), ACTA Press, Calgary, AB, Canada, pp. 189-194.
-
th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009 (Innsbruck, Austria, Feb. 16-18, 2009)
, pp. 189-194
-
-
Engelmann, C.1
Ong, H.H.2
Scott, S.L.3
-
18
-
-
33646126295
-
Scalable fault tolerant mpi: Extending the recovery algorithm
-
PVM/MPI (2005), B. D. Martino, D. Kranzlmüller, and J. Dongarra, Eds., Springer
-
FAGG, G. E., ANGSKUN, T., BOSILCA, G., PJESIVAC-GRBOVIC, J., AND DONGARRA, J. Scalable fault tolerant mpi: Extending the recovery algorithm. In PVM/MPI (2005), B. D. Martino, D. Kranzlmüller, and J. Dongarra, Eds., vol. 3666 of Lecture Notes in Computer Science, Springer, pp. 67-75.
-
Lecture Notes in Computer Science
, vol.3666
, pp. 67-75
-
-
Fagg, G.E.1
Angskun, T.2
Bosilca, G.3
Pjesivac-Grbovic, J.4
Dongarra, J.5
-
19
-
-
83155188951
-
Evaluating the viability of process replication reliability for exascale systems
-
Nov
-
FERREIRA, K., RIESEN, R., STEARLEY, J., III, J. H. L., OLDFIELD, R., PEDRETTI, K., BRIDGES, P., ARNOLD, D., AND BRIGHTWELL, R. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage, and Analysis, (SC'11) (Nov 2011).
-
(2011)
Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage, and Analysis, (SC'11)
-
-
Ferreira, K.1
Riesen, R.2
Stearley, J.3
Iii, J.H.L.4
Oldfield, R.5
Pedretti, K.6
Bridges, P.7
Arnold, D.8
Brightwell, R.9
-
20
-
-
84877716050
-
A tunable, software-based DRAM error detection and correction library for HPC
-
Springer Verlag, Berlin, Germany
-
FIALA, D., FERREIRA, K. B., MUELLER, F., AND ENGELMANN, C. A tunable, software-based DRAM error detection and correction library for HPC. In Lecture Notes in Computer Science: Proceedings of the European Conference on Parallel and Distributed Computing (Euro-Par) 2011: Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (Bordeaux, France, Aug 2011), Springer Verlag, Berlin, Germany.
-
Lecture Notes in Computer Science: Proceedings of the European Conference on Parallel and Distributed Computing (Euro-Par) 2011: Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (Bordeaux, France, Aug 2011)
-
-
Fiala, D.1
Ferreira, K.B.2
Mueller, F.3
Engelmann, C.4
-
21
-
-
80053223509
-
Uncoordinated checkpointing without domino effect for send-deterministic message passing applications
-
May
-
GUERMOUCHE, A., ROPARS, T., BRUNET, E., SNIR, M., AND CAPPELLO, F. Uncoordinated checkpointing without domino effect for send-deterministic message passing applications. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (May 2011).
-
(2011)
Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium
-
-
Guermouche, A.1
Ropars, T.2
Brunet, E.3
Snir, M.4
Cappello, F.5
-
22
-
-
0021439162
-
Algorithm-based fault tolerance for matrix operations
-
June
-
HUANG, K.-H., AND ABRAHAM, J. A. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33, 6 (June 1984).
-
(1984)
IEEE Transactions on Computers
, vol.C-33
, pp. 6
-
-
Huang, K.-H.1
Abraham, J.A.2
-
23
-
-
84858781341
-
Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design
-
ACM
-
HWANG, A. A., STEFANOVICI, I. A., AND SCHROEDER, B. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2012), ASPLOS '12, ACM, pp. 111-122.
-
Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2012), ASPLOS '12
, pp. 111-122
-
-
Hwang, A.A.1
Stefanovici, I.A.2
Schroeder, B.3
-
24
-
-
84864739773
-
-
March 1
-
INOVATIVE COMPUTING LABORATORY. FT-MPI. http://icl.cs.utk.edu/ftmpi, March 1 2012.
-
(2012)
FT-MPI
-
-
-
28
-
-
77954005825
-
Palacios and kitten: New high performance operating systems for scalable virtualized and native supercomputing
-
LANGE, J. R., PEDRETTI, K. T., HUDSON, T., DINDA, P. A., CUI, Z., XIA, L., BRIDGES, P. G., GOCKE, A., JACONETTE, S., LEVENHAGEN, M., AND BRIGHTWELL, R. Palacios and kitten: New high performance operating systems for scalable virtualized and native supercomputing. In IPDPS'10 (2010), pp. 1-12.
-
(2010)
IPDPS'10
, pp. 1-12
-
-
Lange, J.R.1
Pedretti, K.T.2
Hudson, T.3
Dinda, P.A.4
Cui, Z.5
Xia, L.6
Bridges, P.G.7
Gocke, A.8
Jaconette, S.9
Levenhagen, M.10
Brightwell, R.11
-
29
-
-
83155182888
-
System implications of memory reliability in exascale computing
-
ACM
-
LI, S., CHEN, K., HSIEH, M.-Y., MURALIMANOHAR, N., KERSEY, C. D., BROCKMAN, J. B., RODRIGUES, A. F., AND JOUPPI, N. P. System implications of memory reliability in exascale computing. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2011), SC '11, ACM, pp. 46:1-46:12.
-
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2011), SC '11
-
-
Li, S.1
Chen, K.2
Hsieh, M.-Y.3
Muralimanohar, N.4
Kersey, C.D.5
Brockman, J.B.6
Rodrigues, A.F.7
Jouppi, N.P.8
-
30
-
-
77954020082
-
A high-performance fault-tolerant software framework for memory on commodity GPUs
-
april
-
MARUYAMA, N., NUKADA, A., AND MATSUOKA, S. A high-performance fault-tolerant software framework for memory on commodity GPUs. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on (april 2010), pp. 1-12.
-
(2010)
Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on
, pp. 1-12
-
-
Maruyama, N.1
Nukada, A.2
Matsuoka, S.3
-
31
-
-
78650831692
-
Design, modeling, and evaluation of a scalable multi-level checkpointing system
-
IEEE Computer Society
-
MOODY, A., BRONEVETSKY, G., MOHROR, K., AND SUPINSKI, B. R. d. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Washington, DC, USA, 2010), SC '10, IEEE Computer Society, pp. 1-11.
-
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Washington, DC, USA, 2010), SC '10
, pp. 1-11
-
-
Moody, A.1
Bronevetsky, G.2
Mohror, K.3
Supinski, B.R.D.4
-
33
-
-
0036507891
-
Control-flow checking by software signatures
-
mar
-
OH, N., SHIRVANI, P., AND MCCLUSKEY, E. Control-flow checking by software signatures. Reliability, IEEE Transactions on 51, 1 (mar 2002), 111-122.
-
(2002)
Reliability, IEEE Transactions on
, vol.51
, Issue.1
, pp. 111-122
-
-
Oh, N.1
Shirvani, P.2
Mccluskey, E.3
-
34
-
-
0036507790
-
Error detection by duplicated instructions in super-scalar processors
-
mar
-
OH, N., SHIRVANI, P., AND MCCLUSKEY, E. J. Error detection by duplicated instructions in super-scalar processors. Reliability, IEEE Transactions on 51, 1 (mar 2002), 63-75.
-
(2002)
Reliability, IEEE Transactions on
, vol.51
, Issue.1
, pp. 63-75
-
-
Oh, N.1
Shirvani, P.2
Mccluskey, E.J.3
-
35
-
-
0028994249
-
Algorithm-based diskless checkpointing for fault tolerant matrix operations
-
Los Alamitos, CA, USA : IEEE Comput. Soc. Press
-
PLANK, J. S., KIM, Y. B., AND DONGARRA, J. J. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers (Pasadena, CA, USA, June 1995), Los Alamitos, CA, USA : IEEE Comput. Soc. Press, 1995, pp. 351-360.
-
(1995)
Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers (Pasadena, CA, USA, June 1995)
, pp. 351-360
-
-
Plank, J.S.1
Kim, Y.B.2
Dongarra, J.J.3
-
36
-
-
84963800757
-
A source-to-source compiler for generating dependable software
-
REBAUDENGO, M., REORDA, M., VIOLANTE, M., AND TORCHIANO, M. A source-to-source compiler for generating dependable software. In Source Code Analysis and Manipulation, 2001. Proceedings. First IEEE International Workshop on (2001), pp. 33-42.
-
(2001)
Source Code Analysis and Manipulation, 2001. Proceedings. First IEEE International Workshop on
, pp. 33-42
-
-
Rebaudengo, M.1
Reorda, M.2
Violante, M.3
Torchiano, M.4
-
37
-
-
33646829087
-
SWIFt: Software implemented fault tolerance
-
IEEE Computer Society
-
REIS, G. A., CHANG, J., VACHHARAJANI, N., RANGAN, R., AND AUGUST, D. I. SWIFt: Software implemented fault tolerance. In Proceedings of the international symposium on Code generation and optimization (Washington, DC, USA, 2005), CGO'05, IEEE Computer Society, pp. 243-254.
-
Proceedings of the International Symposium on Code Generation and Optimization (Washington, DC, USA, 2005), CGO'05
, pp. 243-254
-
-
Reis, G.A.1
Chang, J.2
Vachharajani, N.3
Rangan, R.4
August, D.I.5
-
39
-
-
84864766011
-
-
March 10
-
SANDIA NATIONAL LABORATORY. Kitten lightweight kernel. https://software.sandia.gov/trac/kitten, March 10 2012.
-
(2012)
Kitten Lightweight Kernel
-
-
-
41
-
-
36148941068
-
Understanding failures in petascale computers
-
SCHROEDER, B., AND GIBSON, G. A. Understanding failures in petascale computers. Journal of Physics: Conference Series 78, 1 (2007), 012022.
-
(2007)
Journal of Physics: Conference Series
, vol.78
, Issue.1
, pp. 012022
-
-
Schroeder, B.1
Gibson, G.A.2
-
42
-
-
79551703768
-
DRAM errors in the wild: A large-scale field study
-
February
-
SCHROEDER, B., PINHEIRO, E., AND WEBER, W.-D. DRAM errors in the wild: a large-scale field study. Communications of the ACM 54 (February 2011), 100-107.
-
(2011)
Communications of the ACM
, vol.54
, pp. 100-107
-
-
Schroeder, B.1
Pinheiro, E.2
Weber, W.-D.3
-
43
-
-
0034260103
-
Software-implemented EDAC protection against SEUs
-
sep
-
SHIRVANI, P., SAXENA, N., AND MCCLUSKEY, E. Software-implemented EDAC protection against SEUs. Reliability, IEEE Transactions on 49, 3 (sep 2000), 273-284.
-
(2000)
Reliability, IEEE Transactions on
, vol.49
, Issue.3
, pp. 273-284
-
-
Shirvani, P.1
Saxena, N.2
Mccluskey, E.3
-
44
-
-
84864756973
-
An experimental study about diskless checkpointing
-
IEEE Computer Society Press
-
SILVA, L. M., AND SILVA, J. G. An experimental study about diskless checkpointing. In 24th EUROMICRO Conference (Vasteras, Sweden, August 1998), IEEE Computer Society Press, pp. 395-402.
-
24th EUROMICRO Conference (Vasteras, Sweden, August 1998)
, pp. 395-402
-
-
Silva, L.M.1
Silva, J.G.2
-
46
-
-
84864777232
-
-
SMEM March 1
-
SMEM. Memory reporting tool. http://www.selenic.com/smem/, March 1 2012.
-
(2012)
Memory Reporting Tool
-
-
|