-
4
-
-
66749092384
-
-
Sept.
-
BERGMAN, K., BORKAR, S., CAMPBELL, D., CARLSON, W., DALLY, W., DENNEAU, M., FRANZON, P., HARROD, W., HILL, K., HILLER, J., KARP, S., KECKLER, S., KLEIN, D., KOGGE, P., LUCAS, R., RICHARDS, M., SCARPELLI, A., SCOTT, S., SNAVELY, A., STERLING, T., WILLIAMS, R. S., AND YELICK, K. Exascale computing study: Technology challenges in achieving exascale systems. http://www.science.energy. gov/ascr/Research/CS/DARPAexascale-hardware(2008).pdf, Sept. 2008.
-
(2008)
Exascale Computing Study: Technology Challenges in Achieving Exascale Systems
-
-
Bergman, K.1
Borkar, S.2
Campbell, D.3
Carlson, W.4
Dally, W.5
Denneau, M.6
Franzon, P.7
Harrod, W.8
Hill, K.9
Hiller, J.10
Karp, S.11
Keckler, S.12
Klein, D.13
Kogge, P.14
Lucas, R.15
Richards, M.16
Scarpelli, A.17
Scott, S.18
Snavely, A.19
Sterling, T.20
Williams, R.S.21
Yelick, K.22
more..
-
5
-
-
78149257903
-
Transparent redundant computing with mpi
-
R. Keller, E. Gabriel, M. M. Resch, and J. Dongarra, Eds. vol. 6305 of Lecture Notes in Computer Science, Springer
-
BRIGHTWELL, R., FERREIRA, K. B., AND RIESEN, R. Transparent redundant computing with mpi. In EuroMPI (2010), R. Keller, E. Gabriel, M. M. Resch, and J. Dongarra, Eds., vol. 6305 of Lecture Notes in Computer Science, Springer, pp. 208-218.
-
(2010)
EuroMPI
, pp. 208-218
-
-
Brightwell, R.1
Ferreira, K.B.2
Riesen, R.3
-
6
-
-
68249127079
-
Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities
-
CAPPELLO, F. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. IJHPCA 23, 3 (2009), 212-226.
-
(2009)
IJHPCA
, vol.23
, Issue.3
, pp. 212-226
-
-
Cappello, F.1
-
7
-
-
0345757358
-
Practical Byzantine Fault Tolerance and Proactive Recovery
-
DOI 10.1145/571637.571640
-
CASTRO, M., AND LISKOV, B. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems (TOCS) 20, 4 (Nov. 2002), 398-461. (Pubitemid 135702591)
-
(2002)
ACM Transactions on Computer Systems
, vol.20
, Issue.4
, pp. 398-461
-
-
Castro, M.1
Liskov, B.2
-
8
-
-
12444281734
-
A fault tolerant protocol for massively parallel systems
-
Santa Fe, NM USA, April, IEEE Computer Society Press
-
CHAKRAVORTY, S., AND KALÉ, L. V. A fault tolerant protocol for massively parallel systems. In Proceedings of the International Parallel and Distributed Processing Symposium (Santa Fe, NM USA, April 2004), IEEE Computer Society Press.
-
(2004)
Proceedings of the International Parallel and Distributed Processing Symposium
-
-
Chakravorty, S.1
Kalé, L.V.2
-
9
-
-
84883502243
-
Hive: Fault containment for shared-memory multiprocessors
-
New York, NY, USA, ACM
-
CHAPIN, J., ROSENBLUM, M., DEVINE, S., LAHIRI, T., TEODOSIU, D., AND GUPTA, A. Hive: fault containment for shared-memory multiprocessors. In SOSP'95: Proceedings of the fifteenth ACM symposium on Operating systems principles (New York, NY, USA, 1995), ACM, pp. 12-25.
-
(1995)
SOSP'95: Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles
, pp. 12-25
-
-
Chapin, J.1
Rosenblum, M.2
Devine, S.3
Lahiri, T.4
Teodosiu, D.5
Gupta, A.6
-
10
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
DOI 10.1016/j.future.2004.11.016, PII S0167739X04002213
-
DALY, J. T. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22, 3 (2006), 303-312. (Pubitemid 41689812)
-
(2006)
Future Generation Computer Systems
, vol.22
, Issue.3
, pp. 303-312
-
-
Daly, J.T.1
-
11
-
-
0003260044
-
CTH: A software family for multi-dimensional shock physics analysis
-
July
-
E. S. HERTEL, J., BELL, R. L., ELRICK, M. G., FARNSWORTH, A. V., KERLEY, G. I., MCGLAUN, J. M., PETNEY, S. V., SILLING, S. A., TAYLOR, P. A., AND YARRINGTON, L. CTH: A software family for multi-dimensional shock physics analysis. In Proceedings of the 19th International Symposium on Shock Waves (July 1993), pp. 377-382.
-
(1993)
Proceedings of the 19th International Symposium on Shock Waves
, pp. 377-382
-
-
Hertel, J.1
Bell, R.L.2
Elrick, M.G.3
Farnsworth, A.V.4
Kerley, G.I.5
McGlaun, J.M.6
Petney, S.V.7
Silling, S.A.8
Taylor, P.A.9
Yarrington, L.10
-
12
-
-
9144223280
-
Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
-
Apr.
-
ELNOZAHY, E., AND PLANK, J. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. Dependable and Secure Computing, IEEE Transactions on 1, 2 (Apr. 2004), 97-108.
-
(2004)
Dependable and Secure Computing, IEEE Transactions on
, vol.1
, Issue.2
, pp. 97-108
-
-
Elnozahy, E.1
Plank, J.2
-
13
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
ELNOZAHY, E. N. M., ALVISI, L., WANG, Y.-M., AND JOHNSON, D. B. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 3 (2002), 375-408.
-
(2002)
ACM Comput. Surv.
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.M.1
Alvisi, L.2
Wang, Y.-M.3
Johnson, D.B.4
-
14
-
-
74549140832
-
The case for modular redundancy in large-scale high performance computing systems
-
Innsbruck, Austria, Feb. 16-18, ACTA Press, Calgary, AB, Canada
-
th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009 (Innsbruck, Austria, Feb. 16-18, 2009), ACTA Press, Calgary, AB, Canada, pp. 189-194.
-
(2009)
th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009
, pp. 189-194
-
-
Engelmann, C.1
Ong, H.H.2
Scott, S.L.3
-
15
-
-
83155195270
-
RMPI: Increasing fault resiliency in a message-passing environment
-
FERREIRA, K., RIESEN, R., OLDFIELD, R., STEARLEY, J., III, J. H. L., Pedretti, K., and Brightwell, R. rMPI: Increasing fault resiliency in a message-passing environment. Technical Report SAND2011-2488, Sandia National Laboratories, 2011.
-
(2011)
Technical Report SAND2011-2488, Sandia National Laboratories
-
-
Ferreira, K.1
Riesen, R.2
Oldfield, R.3
Stearley, J.4
L. III, J.H.5
Pedretti, K.6
Brightwell, R.7
-
16
-
-
0345415768
-
Fundamentals of fault-tolerant distributed computing in asynchronous environments
-
March
-
GÄRTNER, F. C. Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys 31, 1 (March 1999), 1-26.
-
(1999)
ACM Computing Surveys
, vol.31
, Issue.1
, pp. 1-26
-
-
Gärtner, F.C.1
-
17
-
-
80053223509
-
Uncoordinated checkpointing without domino effect for send-deterministic message passing applications
-
May
-
GUERMOUCHE, A., ROPARS, T., BRUNET, E., SNIR, M., AND CAPPELLO, F. Uncoordinated checkpointing without domino effect for send-deterministic message passing applications. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (May 2011).
-
(2011)
Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium
-
-
Guermouche, A.1
Ropars, T.2
Brunet, E.3
Snir, M.4
Cappello, F.5
-
18
-
-
67349271621
-
An analysis of clustered failures on large supercomputing systems
-
July
-
HACKER, T. J., ROMERO, F., AND CAROTHERS, C. D. An analysis of clustered failures on large supercomputing systems. J. Parallel Distrib. Comput. 69 (July 2009), 652-665.
-
(2009)
J. Parallel Distrib. Comput.
, vol.69
, pp. 652-665
-
-
Hacker, T.J.1
Romero, F.2
Carothers, C.D.3
-
19
-
-
84990709890
-
The general birthday problem
-
New York, NY, USA, John Wiley & Sons, Inc.
-
HOLST, L. The general birthday problem. In Random Graphs 93: Proceedings of the sixth international seminar on Random graphs and probabilistic methods in combinatorics and computer science (New York, NY, USA, 1995), John Wiley & Sons, Inc., pp. 201-208.
-
(1995)
Random Graphs 93: Proceedings of the Sixth International Seminar on Random Graphs and Probabilistic Methods in Combinatorics and Computer Science
, pp. 201-208
-
-
Holst, L.1
-
22
-
-
78149347218
-
Predictive performance and scalability modeling of a large-scale application
-
KERBYSON, D. J., ALME, H. J., HOISIE, A., PETRINI, F., WASSERMAN, H. J., AND GITTINGS, M. Predictive performance and scalability modeling of a large-scale application. In Proceedings of the ACM/IEEE conference on Supercomputing (2001), pp. 37-48.
-
(2001)
Proceedings of the ACM/IEEE Conference on Supercomputing
, pp. 37-48
-
-
Kerbyson, D.J.1
Alme, H.J.2
Hoisie, A.3
Petrini, F.4
Wasserman, H.J.5
Gittings, M.6
-
23
-
-
0017996760
-
Time, clocks, and the ordering of events in a distributed system
-
DOI 10.1145/359545.359563
-
LAMPORT, L. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558-565. (Pubitemid 8615486)
-
(1978)
Communications of the ACM
, vol.21
, Issue.7
, pp. 558-565
-
-
Lamport Leslie1
-
24
-
-
0344850282
-
A generalized birthday problem
-
MATHIS, F. H. A generalized birthday problem. SIAM Review 33, 2 (1991), 265-270.
-
(1991)
SIAM Review
, vol.33
, Issue.2
, pp. 265-270
-
-
Mathis, F.H.1
-
25
-
-
0003321148
-
An overview of the Intel TFLOPS supercomputer
-
MATTSON, T. G., AND HENRY, G. An overview of the Intel TFLOPS supercomputer. Intel Technology Journal, Q1 (1998), 12.
-
(1998)
Intel Technology Journal
, vol.Q1
, pp. 12
-
-
Mattson, T.G.1
Henry, G.2
-
26
-
-
15044360879
-
The architecture of tandem's nonstop system
-
New York, NY, USA, ACM
-
MCEVOY, D. The architecture of tandem's nonstop system. In ACM'81: Proceedings of the ACM'81 conference (New York, NY, USA, 1981), ACM, p. 245.
-
(1981)
ACM'81: Proceedings of the ACM'81 Conference
, pp. 245
-
-
McEvoy, D.1
-
27
-
-
78650831692
-
Design, modeling, and evaluation of a scalable multi-level checkpointing system
-
Washington, DC, USA, SC'10, IEEE Computer Society
-
MOODY, A., BRONEVETSKY, G., MOHROR, K., AND SUPINSKI, B. R. D. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Washington, DC, USA, 2010), SC'10, IEEE Computer Society, pp. 1-11.
-
(2010)
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
, pp. 1-11
-
-
Moody, A.1
Bronevetsky, G.2
Mohror, K.3
Supinski, B.R.D.4
-
28
-
-
47249142074
-
Modeling the impact of checkpoints on next-generation systems
-
Sept.
-
OLDFIELD, R. A., ARUNAGIRI, S., TELLER, P. J., SEELAM, S., VARELA, M. R., RIESEN, R., AND ROTH, P. C. Modeling the impact of checkpoints on next-generation systems. In 24th IEEE Conference on Mass Storage Systems and Technologies (Sept. 2007), pp. 30-46.
-
(2007)
24th IEEE Conference on Mass Storage Systems and Technologies
, pp. 30-46
-
-
Oldfield, R.A.1
Arunagiri, S.2
Teller, P.J.3
Seelam, S.4
Varela, M.R.5
Riesen, R.6
Roth, P.C.7
-
29
-
-
33746286070
-
Performance implications of periodic checkpointing on large-scale cluster systems
-
OLINER, A. J., SAHOO, R. K., MOREIRA, J. E., AND GUPTA, M. Performance implications of periodic checkpointing on large-scale cluster systems. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 (2005), p. 299-2.
-
(2005)
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop
, vol.18
, pp. 299-302
-
-
Oliner, A.J.1
Sahoo, R.K.2
Moreira, J.E.3
Gupta, M.4
-
31
-
-
78649472064
-
Application sensitivity to link and injection bandwidth on a Cray XT4 system
-
Helsinki, Finland, May
-
PEDRETTI, K. T., VAUGHAN, C., HEMMERT, K. S., AND BARRETT, B. Application sensitivity to link and injection bandwidth on a Cray XT4 system. In Proceedings of the 2005 Cray User Group Annual Technical Conference (Helsinki, Finland, May 2008).
-
(2008)
Proceedings of the 2005 Cray User Group Annual Technical Conference
-
-
Pedretti, K.T.1
Vaughan, C.2
Hemmert, K.S.3
Barrett, B.4
-
32
-
-
0028994249
-
Algorithm-based diskless checkpointing for fault tolerant matrix operations
-
Pasadena, CA, USA, June 1995, Los Alamitos, CA, USA : IEEE Comput. Soc. Press
-
PLANK, J. S., KIM, Y. B., AND DONGARRA, J. J. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers (Pasadena, CA, USA, June 1995), Los Alamitos, CA, USA : IEEE Comput. Soc. Press, 1995, pp. 351-360.
-
(1995)
Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers
, pp. 351-360
-
-
Plank, J.S.1
Kim, Y.B.2
Dongarra, J.J.3
-
33
-
-
0002467378
-
Fast parallel algorithms for short-range molecular dynamics
-
PLIMPTON, S. J. Fast parallel algorithms for short-range molecular dynamics. J Comp Phys 117, 1 (1995), 1-19.
-
(1995)
J Comp Phys
, vol.117
, Issue.1
, pp. 1-19
-
-
Plimpton, S.J.1
-
35
-
-
83155177911
-
-
home page, Apr. 10
-
Sandia National Laboratory. Mantevo project home page. https://software.sandia.gov/mantevo, Apr. 10 2010.
-
(2010)
Mantevo Project
-
-
-
36
-
-
0025564050
-
Implementing fault-tolerant services using the state machine approach: A tutorial
-
SCHNEIDER, F. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22, 4 (1990), 299-319.
-
(1990)
ACM Computing Surveys
, vol.22
, Issue.4
, pp. 299-319
-
-
Schneider, F.1
-
38
-
-
36148941068
-
Understanding failures in petascale computers
-
SCHROEDER, B., AND GIBSON, G. A. Understanding failures in petascale computers. Journal of Physics: Conference Series 78, 1 (2007), 012022.
-
(2007)
Journal of Physics: Conference Series
, vol.78
, Issue.1
, pp. 012022
-
-
Schroeder, B.1
Gibson, G.A.2
-
39
-
-
84864756973
-
An experimental study about diskless checkpointing
-
Vasteras, Sweden, August, IEEE Computer Society Press
-
SILVA, L. M., AND SILVA, J. G. An experimental study about diskless checkpointing. In 24th EUROMICRO Conference (Vasteras, Sweden, August 1998), IEEE Computer Society Press, pp. 395 - 402.
-
(1998)
24th EUROMICRO Conference
, pp. 395-402
-
-
Silva, L.M.1
Silva, J.G.2
-
40
-
-
46049083585
-
Joshua: Symmetric active/active replication for highly available hpc job and resource management
-
Los Alamitos, CA, USA, IEEE Computer Society
-
UHLEMANN, K., ENGELMANN, C., AND SCOTT, S. Joshua: Symmetric active/active replication for highly available hpc job and resource management. In Proceedings of the 2006 IEEE International Conference on Cluster Computing (Los Alamitos, CA, USA, 2006), IEEE Computer Society.
-
(2006)
Proceedings of the 2006 IEEE International Conference on Cluster Computing
-
-
Uhlemann, K.1
Engelmann, C.2
Scott, S.3
|