-
2
-
-
21244491597
-
Soft errors in advanced computer systems
-
DOI 10.1109/MDT.2005.69
-
R. Baumann, "Soft errors in advanced computer systems," IEEE Design & Test of Computers, vol. 22, no. 3, pp. 258-266, 2005. (Pubitemid 40889826)
-
(2005)
IEEE Design and Test of Computers
, vol.22
, Issue.3
, pp. 258-266
-
-
Baumann, R.1
-
3
-
-
79955046559
-
-
"Roadrunner," http://www.lanl.gov/roadrunner.
-
Roadrunner
-
-
-
4
-
-
79955008050
-
-
"Jaguar," http://www.nccs.gov/jaguar.
-
Jaguar
-
-
-
6
-
-
9144223280
-
Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
-
E. N. Elnozahy and J. S. Plank, "Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery," IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 2, pp. 97-108, 2004.
-
(2004)
IEEE Transactions on Dependable and Secure Computing
, vol.1
, Issue.2
, pp. 97-108
-
-
Elnozahy, E.N.1
Plank, J.S.2
-
7
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
-
(2002)
ACM Computing Surveys
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.1
Alvisi, L.2
Wang, Y.-M.3
Johnson, D.B.4
-
8
-
-
0031570635
-
Application level fault tolerance in heterogeneous networks of workstations
-
DOI 10.1006/jpdc.1997.1338, PII S0743731597913381
-
A. Beguelin, E. Seligman, and P. Stephan, "Application level fault tolerance in heterogeneous networks of workstations," Journal of Parallel and Distributed Computing, vol. 43, no. 2, pp. 147-155, 1997. (Pubitemid 127171411)
-
(1997)
Journal of Parallel and Distributed Computing
, vol.43
, Issue.2
, pp. 147-155
-
-
Beguelin, A.1
Seligman, E.2
Stephan, P.3
-
9
-
-
85084159983
-
Libckpt: Transparent checkpointing under unix
-
Jan.
-
J. S. Plank, M. Beck, G. Kingsley, and K. Li, "Libckpt: Transparent checkpointing under Unix," in Usenix Winter Technical Conference, Jan. 1995, pp. 213-223.
-
(1995)
Usenix Winter Technical Conference
, pp. 213-223
-
-
Plank, J.S.1
Beck, M.2
Kingsley, G.3
Li, K.4
-
11
-
-
33749067567
-
Berkeley lab checkpoint/restart (BLCR) for Linux clusters
-
DOI 10.1088/1742-6596/46/1/067, 067
-
P. H. Hargrove and J. C. Duell, "Berkeley lab checkpoint/restart (BLCR) for Linux clusters," Journal of Physics: Conference Series, vol. 46, no. 1, pp. 494-499, 2006. (Pubitemid 44461038)
-
(2006)
Journal of Physics: Conference Series
, vol.46
, Issue.1
, pp. 494-499
-
-
Hargrove, P.H.1
Duell, J.C.2
-
13
-
-
72149132074
-
Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure recovery
-
Aug.
-
A. Bouteiller, T. Ropars, G. Bosilca, C. Morin, and J. Dongarra, "Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure recovery," in IEEE International Conference on Cluster Computing, Aug. 2009, pp. 1-9.
-
(2009)
IEEE International Conference on Cluster Computing
, pp. 1-9
-
-
Bouteiller, A.1
Ropars, T.2
Bosilca, G.3
Morin, C.4
Dongarra, J.5
-
14
-
-
0022020346
-
Distributed snapshots: Determining global states of distributed systems
-
DOI 10.1145/214451.214456
-
K. M. Chandy and L. Lamport, "Distributed snapshots: Determining global states of distributed systems," ACM Transactions on Computer Systems, vol. 3, no. 1, pp. 63-75, 1985. (Pubitemid 15597765)
-
(1985)
ACM Transactions on Computer Systems
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy K.Mani1
Lamport Leslie2
-
16
-
-
0032179680
-
Diskless checkpointing
-
J. S. Plank, K. Li, and M. A. Puening, "Diskless checkpointing," IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 10, pp. 972-986, Oct. 1998. (Pubitemid 128747893)
-
(1998)
IEEE Transactions on Parallel and Distributed Systems
, vol.9
, Issue.10
, pp. 972-986
-
-
Plank, J.S.1
Li, K.2
Puening, M.A.3
-
17
-
-
0021439162
-
Algorithm-based fault tolerance for matrix operations
-
K.-H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations," IEEE Transactions on Computers, vol. 33, no. 6, pp. 518-528, 1984.
-
(1984)
IEEE Transactions on Computers
, vol.33
, Issue.6
, pp. 518-528
-
-
Huang, K.-H.1
Abraham, J.A.2
-
18
-
-
61449223447
-
Algorithm-based fault tolerance applied to high performance computing
-
G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, "Algorithm-based fault tolerance applied to high performance computing," Journal of Parallel and Distributed Computing, vol. 69, no. 4, pp. 410-416, 2009.
-
(2009)
Journal of Parallel and Distributed Computing
, vol.69
, Issue.4
, pp. 410-416
-
-
Bosilca, G.1
Delmas, R.2
Dongarra, J.3
Langou, J.4
-
19
-
-
33847240498
-
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources
-
Apr.
-
Z. Chen and J. Dongarra, "Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources," in IEEE International Parallel & Distributed Processing Symposium, Apr. 2006.
-
(2006)
IEEE International Parallel & Distributed Processing Symposium
-
-
Chen, Z.1
Dongarra, J.2
-
21
-
-
20444463494
-
FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
-
2004 IEEE International Conference on Cluster Computing, ICCC 2004
-
G. Zheng, L. Shi, and L. V. Kale, "FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI," in IEEE International Conference on Cluster Computing, Sep. 2004, pp. 93-103. (Pubitemid 40822360)
-
(2004)
Proceedings - IEEE International Conference on Cluster Computing, ICCC
, pp. 93-103
-
-
Zheng, G.1
Shi, L.2
Kale, L.V.3
-
22
-
-
84906512472
-
Towards fault resilient global arrays
-
V. Tipparaju, M. Krishnan, B. Palmer, F. Petrini, and J. Nieplocha, "Towards fault resilient global arrays." in International Conference on Parallel Computing, vol. 15, 2007, pp. 339-345.
-
(2007)
International Conference on Parallel Computing
, vol.15
, pp. 339-345
-
-
Tipparaju, V.1
Krishnan, M.2
Palmer, B.3
Petrini, F.4
Nieplocha, J.5
-
23
-
-
77951481809
-
CIFTS: A coordinated infrastructure for fault-tolerant systems
-
R. Gupta et al., "CIFTS: A coordinated infrastructure for fault-tolerant systems," in International Conference on Parallel Processing, 2009, pp. 237-245.
-
(2009)
International Conference on Parallel Processing
, pp. 237-245
-
-
Gupta, R.1
-
24
-
-
70349089035
-
Proactive fault tolerance using preemptive migration
-
Feb.
-
C. Engelmann, G. Vallée, T. Naughton, and S. L. Scott, "Proactive fault tolerance using preemptive migration," in Euromicro International Conference on Parallel, Distributed and Network-based Processing, Feb. 2009, pp. 252-257.
-
(2009)
Euromicro International Conference on Parallel, Distributed and Network-based Processing
, pp. 252-257
-
-
Engelmann, C.1
Vallée, G.2
Naughton, T.3
Scott, S.L.4
-
25
-
-
70350755748
-
Proactive process-level live migration in HPC environments
-
Nov.
-
C. Wang, F. Mueller, C. Engelmann, and S. L. Scott, "Proactive process-level live migration in HPC environments," in Proceedings of the ACM/IEEE Conference on Supercomputing, Nov. 2008, pp. 1-12.
-
(2008)
Proceedings of the ACM/IEEE Conference on Supercomputing
, pp. 1-12
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scott, S.L.4
-
26
-
-
33645983963
-
Advances, applications and performance of the global arrays shared memory programming toolkit
-
May
-
J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà, "Advances, applications and performance of the global arrays shared memory programming toolkit," International Journal of High Performance Computing Applications, vol. 20, pp. 203-231, May 2006.
-
(2006)
International Journal of High Performance Computing Applications
, vol.20
, pp. 203-231
-
-
Nieplocha, J.1
Palmer, B.2
Tipparaju, V.3
Krishnan, M.4
Trease, H.5
Aprà, E.6
-
27
-
-
84994456017
-
-
"Global Arrays," http://www.emsl.pnl.gov/docs/global.
-
Global Arrays
-
-
-
28
-
-
77955309392
-
NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
-
M. Valiev et al., "NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations," Computer Physics Communications, vol. 181, no. 9, pp. 1477-1489, 2010.
-
(2010)
Computer Physics Communications
, vol.181
, Issue.9
, pp. 1477-1489
-
-
Valiev, M.1
-
29
-
-
77953931510
-
Utilizing high performance computing for chemistry: Parallel computational chemistry
-
W. A. Jong et al., "Utilizing high performance computing for chemistry: parallel computational chemistry," Physical Chemistry Chemical Physics, vol. 12, no. 26, pp. 6896-6920, 2010.
-
(2010)
Physical Chemistry Chemical Physics
, vol.12
, Issue.26
, pp. 6896-6920
-
-
Jong, W.A.1
-
30
-
-
0036041078
-
Space-time trade-off optimization for a class of electronic structure calculations
-
D. Cociorva et al., "Space-time trade-off optimization for a class of electronic structure calculations," in Proceedings of the Programming Language Design and Implementation, 2002, pp. 177-186. (Pubitemid 34991517)
-
(2002)
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
, pp. 177-186
-
-
Cociorva, D.1
Baumgartner, G.2
Lam, C.-C.3
Sadayappan, P.4
Ramanujam, J.5
Nooijen, M.6
Bernholdt, D.E.7
Harrison, R.8
-
31
-
-
33746091677
-
ScalaBLAST: A scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis
-
DOI 10.1109/TPDS.2006.112
-
C. Oehmen and J. Nieplocha, "ScalaBLAST: A scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis," IEEE Transactions on Parallel and Distributed Systems, vol. 17, no. 8, pp. 740-749, 2006. (Pubitemid 44070144)
-
(2006)
IEEE Transactions on Parallel and Distributed Systems
, vol.17
, Issue.8
, pp. 740-749
-
-
Oehmen, C.1
Nieplocha, J.2
|