-
1
-
-
12344308304
-
Basic concepts and taxonomy of dependable and secure computing
-
Avizienis, A., Laprie, J.-C., Randell, B. and Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1: 11-33.
-
(2004)
IEEE Trans. Dependable Secure Comput.
, vol.1
, pp. 11-33
-
-
Avizienis, A.1
Laprie, J.-C.2
Randell, B.3
Landwehr, C.4
-
2
-
-
61449090689
-
Redesigning the message logging model for high performance
-
In Dresden, Germany, June
-
Bouteiller, A., Bosilca, G. and Dongarra, J. (2008). Redesigning the message logging model for high performance. In Proceedings of the International Supercomputing Conference (ISC 2008), Dresden, Germany, June.
-
(2008)
Proceedings of the International Supercomputing Conference (ISC 2008)
-
-
Bouteiller, A.1
Bosilca, G.2
Dongarra, J.3
-
3
-
-
0029214558
-
Designing programs that check their work
-
Blum, M. and Kannan, S. (1995). Designing programs that check their work. J. ACM 42(1): 269-291.
-
(1995)
J. ACM
, vol.42
, Issue.1
, pp. 269-291
-
-
Blum, M.1
Kannan, S.2
-
4
-
-
70450200139
-
-
BLCR. (Accessed: September 2)
-
BLCR. http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml2009 (Accessed: September 2 2009)
-
(2009)
-
-
-
5
-
-
34548748000
-
C3: A system for automating application-level checkpointing of MPI programs
-
In October
-
Bronevetsky, G., Marques, D., Pingali, K. and Stodghill, P. (2003). C3: A system for automating application-level checkpointing of MPI programs. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), October.
-
(2003)
Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003)
-
-
Bronevetsky, G.1
Marques, D.2
Pingali, K.3
Stodghill, P.4
-
6
-
-
51049083541
-
Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments
-
In April
-
Chen, Z. (2008). Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments. In proceedings of the IEEE Parallel and Distributed Processing Symposium, April, pp. 1-8.
-
(2008)
Proceedings of the IEEE Parallel and Distributed Processing Symposium
, pp. 1-8
-
-
Chen, Z.1
-
7
-
-
78449285638
-
Proactive process-level live migration in HPC environments
-
In Tampa
-
Wang, C., Mueller, F., Engelmann, C. and Scott, S.L. (2008). Proactive process-level live migration in HPC environments. In Proceedings of Supercomputing 2008, Tampa.
-
(2008)
Proceedings of Supercomputing 2008
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scott, S.L.4
-
8
-
-
70450210363
-
-
CIFT. (Accessed: September 2)
-
CIFT. http://www.mcs.anl.gov/research/cifts/index.php2009 (Accessed: September 2 2009)
-
(2009)
-
-
-
9
-
-
0022020346
-
Distributed snapshots: Determining global states of distributed systems
-
Chandy, K.M. and Lamport, L. (1985). Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1): 63-75.
-
(1985)
ACM Trans. Comput. Syst.
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy, K.M.1
Lamport, L.2
-
12
-
-
70450197271
-
-
CSCL. (Accessed: September 2)
-
CSCL. http://www.cs.wisc.edu/condor/manual/v6.8/ 4_2Condor_s_Checkpoint.html2009 (Accessed: September 2 2009)
-
(2009)
-
-
-
13
-
-
84976834622
-
Self-stabilizing systems in spite of distributed control
-
Dijkstra, E.W. (1974). Self-stabilizing systems in spite of distributed control. Commun. ACM 17(11), 643-644.
-
(1974)
Commun. ACM
, vol.17
, Issue.11
, pp. 643-644
-
-
Dijkstra, E.W.1
-
14
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
Elnozahy, E.N., Alvisi, L., Wang, Y.-M. and Johnson, D.B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3): 375-408.
-
(2002)
ACM Comput. Surv.
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.1
Alvisi, L.2
Wang, Y.-M.3
Johnson, D.B.4
-
15
-
-
0029004440
-
Toward a theory of situation awareness in dynamic systems
-
Endsley, M.R. (1995). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32-64.
-
(1995)
Human Factors
, vol.37
, Issue.1
, pp. 32-64
-
-
Endsley, M.R.1
-
17
-
-
70450197272
-
-
FT-MPI
-
FT-MPI. http://icl.cs.utk.edu/ftmpi/2009
-
-
-
-
18
-
-
70450211231
-
Extending stability beyond CPU millennium: A micron-scale atomistic simulation of Kelvin-Helmholtz instability, a micronscale atomistic simulation of Kelvin-Helmholtz instability
-
In Reno
-
Glosli, J.N., Richards, D.F., Caspersen, K.J., Rudd, R.E., Gunnels, J.A. and Streitz, F.H. (2007). Extending stability beyond CPU millennium: A micron-scale atomistic simulation of Kelvin-Helmholtz instability, a micronscale atomistic simulation of Kelvin-Helmholtz instability. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, Reno.
-
(2007)
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing
-
-
Glosli, J.N.1
Richards, D.F.2
Caspersen, K.J.3
Rudd, R.E.4
Gunnels, J.A.5
Streitz, F.H.6
-
20
-
-
0021439162
-
Algorithm-based fault tolerance for matrix operations
-
Huang, K. and Abraham, J. (1984). Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C-33(6): 518-528.
-
(1984)
IEEE Trans. Comput.
, vol.C-33
, Issue.6
, pp. 518-528
-
-
Huang, K.1
Abraham, J.2
-
22
-
-
70450182237
-
PERCU: A holistic method for evaluating high performance computing systems
-
Dissertation, University of California Berkeley
-
Kramer, W. (2008). PERCU: A holistic method for evaluating high performance computing systems. Dissertation, University of California Berkeley.
-
(2008)
-
-
Kramer, W.1
-
23
-
-
70450200137
-
-
LAM/MPI. (Accessed: September 2)
-
LAM/MPI. http://www.lam-mpi.org/2009 (Accessed: September 2 2009)
-
(2009)
-
-
-
25
-
-
70450200135
-
-
Libckpt. (Accessed: September 2)
-
Libckpt. http://www.cs.utk.edu/~plank/plank/www./libckpt.html2009 (Accessed: September 2 2009)
-
(2009)
-
-
-
26
-
-
36949009638
-
Scalable diskless checkpointing for large parallel systems
-
Ph.D. dissertation, University of Illinois at Urbana-Champaign
-
Lu, C.D. (2005). Scalable diskless checkpointing for large parallel systems. Ph.D. dissertation, University of Illinois at Urbana-Champaign.
-
(2005)
-
-
Lu, C.D.1
-
27
-
-
84884662651
-
MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
-
In IEEE, November
-
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fédak, G., Germain, C., Hérault, T. and Lemarinier, P. (2002). MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In Proceedings of SuperComputing 2002. IEEE, November, http://mpich-v.lri.fr/.
-
(2002)
Proceedings of SuperComputing 2002
-
-
Bosilca, G.1
Bouteiller, A.2
Cappello, F.3
Djilali, S.4
Fédak, G.5
Germain, C.6
Hérault, T.7
Lemarinier, P.8
-
28
-
-
70450211235
-
-
MVAPICH. (Accessed: September 2)
-
MVAPICH. http://mvapich.cse.ohio-state.edu/overview/mvapich/2009 (Accessed: September 2 2009)
-
(2009)
-
-
-
29
-
-
16244380627
-
Using fault Injection and modeling to evaluate the performability of cluster based services
-
In Seattle, WA, March
-
Nagaraja, K., Li, X., Bianchini, R., Martin, R. and Nguyen, T.D. (2003). Using fault Injection and modeling to evaluate the performability of cluster based services. In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems, Seattle, WA, March.
-
(2003)
Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems
-
-
Nagaraja, K.1
Li, X.2
Bianchini, R.3
Martin, R.4
Nguyen, T.D.5
-
30
-
-
70450211233
-
-
OpenMPI. (Accessed: September 2)
-
OpenMPI. http://www.open-mpi.org/.2009 (Accessed: September 2 2009)
-
(2009)
-
-
-
32
-
-
70450211234
-
-
PDSI. (Accessed: September 2)
-
PDSI. http://pdsi.nersc.gov2009 (Accessed: September 2 2009)
-
(2009)
-
-
-
33
-
-
0032179680
-
Diskless checkpointing
-
Plank, J., Li, K. and Puening, M. (1998). Diskless checkpointing. IEEE Trans. Parallel Distr. Syst. 9(10): 972-986.
-
(1998)
IEEE Trans. Parallel Distr. Syst.
, vol.9
, Issue.10
, pp. 972-986
-
-
Plank, J.1
Li, K.2
Puening, M.3
-
34
-
-
0033077475
-
Memory exclusion: Optimizing the performance of checkpointing systems
-
Plank, J.S., Chen, Y., Li, K., Beck, M. and Kingsley, G. (1999). Memory exclusion: Optimizing the performance of checkpointing systems. Software Pract. Ex. 29(2): 125-142.
-
(1999)
Software Pract. Ex.
, vol.29
, Issue.2
, pp. 125-142
-
-
Plank, J.S.1
Chen, Y.2
Li, K.3
Beck, M.4
Kingsley, G.5
-
35
-
-
84885578759
-
Rx: Treating bugs as allergies-a safe method to survive software failure
-
In October
-
Qin, F., Tucek, J., Sundaresan, J. and Zhou, Y. (2005). Rx: Treating bugs as allergies-a safe method to survive software failure. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP'05), October.
-
(2005)
Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP'05)
-
-
Qin, F.1
Tucek, J.2
Sundaresan, J.3
Zhou, Y.4
-
36
-
-
68249122526
-
Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems
-
Sahoo, R.K., Bae, M., Vilalta, R., Moreira, J., Ma, S. and Gupta, M. (2002). Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems. In Proceedings of IEEE/ACM Supercomputing 2002.
-
(2002)
Proceedings of IEEE/ACM Supercomputing 2002
-
-
Sahoo, R.K.1
Bae, M.2
Vilalta, R.3
Moreira, J.4
Ma, S.5
Gupta, M.6
-
38
-
-
50649108554
-
Proactive fault tolerance in MPI Applications via task migration
-
In LNCS
-
Chakravorty, S., Mendes, C.L. and Kale, L.V. (2006). Proactive fault tolerance in MPI Applications via task migration. In Proceedings of HIPC 2006, LNCS, volume 4297, p. 485.
-
(2006)
Proceedings of HIPC 2006
, vol.4297
, pp. 485
-
-
Chakravorty, S.1
Mendes, C.L.2
Kale, L.V.3
-
39
-
-
67650091156
-
A tunable holistic resiliency approach for high-performance computing systems
-
Raleigh, NC, USA
-
Scott, S., Engelmann, C., Vallee, G., Naughton, T., Tikotekar, A., Ostrouchov, G., Leangsuksun, C., Naksinehaboon, N., Nassar, R., Paun, M., Mueller, F., Wang, C., Nagarajan, A. and Varma, J. (2009). A tunable holistic resiliency approach for high-performance computing systems. Poster in Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Raleigh, NC, USA.
-
(2009)
Poster in Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)
-
-
Scott, S.1
Engelmann, C.2
Vallee, G.3
Naughton, T.4
Tikotekar, A.5
Ostrouchov, G.6
Leangsuksun, C.7
Naksinehaboon, N.8
Nassar, R.9
Paun, M.10
Mueller, F.11
Wang, C.12
Nagarajan, A.13
Varma, J.14
-
40
-
-
70450210361
-
-
SCR. (Accessed: September 2)
-
SCR. http://scalablecr.sourceforge.net/2009 (Accessed: September 2 2009)
-
(2009)
-
-
-
41
-
-
36148941068
-
Understanding failures in petascale computers
-
Teodorescu, R., Nakano, J. and Torrellas, J. (2006). SWICH: a prototype for efficient cache-level checkpointing and rollback. IEEE Micro. 26(5): 28-40
-
Schroeder, B. and Gibson, G. (2007). Understanding failures in petascale computers. J Phys. Conf. 78: 012022. Teodorescu, R., Nakano, J. and Torrellas, J. (2006). SWICH: A prototype for efficient cache-level checkpointing and rollback. IEEE Micro. 26(5): 28-40.
-
(2007)
J Phys. Conf.
, vol.78
, pp. 012022
-
-
Schroeder, B.1
Gibson, G.2
-
42
-
-
0003133883
-
Probabilistic logics and the synthesis of reliable organisms from unreliable components
-
In edited by C. E. Shannon and J. McCarthy. New Jersey: Princeton University Press
-
Von Neuman, J. (1956). Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Automata studies, edited by C. E. Shannon and J. McCarthy. New Jersey: Princeton University Press, pp. 43-98.
-
(1956)
Automata Studies
, pp. 43-98
-
-
Von Neuman, J.1
-
44
-
-
20444463494
-
FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
-
In San Diego, CA, September
-
Zheng, G., Shi, L. and Kale, L.V. (2004). FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, San Diego, CA, September, pp. 93-103.
-
(2004)
Proceedings of the 2004 IEEE International Conference on Cluster Computing
, pp. 93-103
-
-
Zheng, G.1
Shi, L.2
Kale, L.V.3
|