-
1
-
-
0032000230
-
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
-
L. Alvisi and K. Marzullo. Message Logging: Pessimistic, Optimistic, Causal, and Optimal. IEEE Transactions on Software Engineering, 24(2):149-159, 1998.
-
(1998)
IEEE Transactions on Software Engineering
, vol.24
, Issue.2
, pp. 149-159
-
-
Alvisi, L.1
Marzullo, K.2
-
2
-
-
0032597670
-
An Analysis of Communication-Induced Checkpointing
-
Washington, DC, USA, IEEE Computer Society
-
L. Alvisi, S. Rao, S. A. Husain, A. de Mel, and E. Elnozahy. An Analysis of Communication-Induced Checkpointing. In Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, FTCS '99, pages 242-, Washington, DC, USA, 1999. IEEE Computer Society.
-
(1999)
Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, FTCS '99
-
-
Alvisi, L.1
Rao, S.2
Husain, S.A.3
De Mel, A.4
Elnozahy, E.5
-
3
-
-
0003605996
-
-
2.0. Technical Report NAS-95-020, NASA Ames Research Center
-
D. Bailey, T. Harris, W. Saphir, R. van der Wilngaart, A. Woo, and M. Yarrow. The NAS Parallel Benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, 1995.
-
(1995)
The NAS Parallel Benchmarks
-
-
Bailey, D.1
Harris, T.2
Saphir, W.3
Van Der Wilngaart, R.4
Woo, A.5
Yarrow, M.6
-
4
-
-
0032305992
-
A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints
-
Washington, DC, USA, IEEE Computer Society
-
R. Baldoni, F. Quaglia, and B. Ciciani. A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints. In Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems, SRDS '98, pages 61-, Washington, DC, USA, 1998. IEEE Computer Society.
-
(1998)
Proceedings of the the 17th IEEE Symposium on Reliable Distributed Systems, SRDS '98
-
-
Baldoni, R.1
Quaglia, F.2
Ciciani, B.3
-
5
-
-
0024123530
-
Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems-an Optimistic Approach
-
Columbus, OH , USA
-
B. Bhargava and L. Shu-Renn. Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems-an Optimistic Approach. In Seventh Symposium on Reliable Distributed Systems, pages 3-12, Columbus, OH , USA, 1988.
-
(1988)
Seventh Symposium on Reliable Distributed Systems
, pp. 3-12
-
-
Bhargava, B.1
Shu-Renn, L.2
-
6
-
-
78149231438
-
Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols
-
Berlin, Heidelberg, Springer-Verlag
-
G. Bosilca, A. Bouteiller, T. Herault, P. Lemarinier, and J. J. Dongarra. Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols. In Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface, EuroMPI'10, pages 189-197, Berlin, Heidelberg, 2010. Springer-Verlag.
-
(2010)
Proceedings of the 17th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI'10
, pp. 189-197
-
-
Bosilca, G.1
Bouteiller, A.2
Herault, T.3
Lemarinier, P.4
Dongarra, J.J.5
-
8
-
-
80052306159
-
Correlated Set Coordination in Fault Tolerant Message Logging Protocols
-
Euro-Par 2011, Springer Berlin / Heidelberg
-
A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Correlated Set Coordination in Fault Tolerant Message Logging Protocols. In Euro-Par 2011, volume 6853 of Lecture Notes in Computer Science, pages 51-64. Springer Berlin / Heidelberg, 2011.
-
(2011)
Lecture Notes in Computer Science
, vol.6853
, pp. 51-64
-
-
Bouteiller, A.1
Herault, T.2
Bosilca, G.3
Dongarra, J.4
-
9
-
-
72149132074
-
Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery
-
A. Bouteiller, T. Ropars, G. Bosilca, C. Morin, and J. Dongarra. Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery. In IEEE International Conference on Cluster Computing (Cluster 2009), New Orleans, USA, 2009.
-
IEEE International Conference on Cluster Computing (Cluster 2009), New Orleans, USA, 2009
-
-
Bouteiller, A.1
Ropars, T.2
Bosilca, G.3
Morin, C.4
Dongarra, J.5
-
11
-
-
0022020346
-
Distributed Snapshots: Determining Global States of Distributed Systems
-
K. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, 3(1):63-75, 1985.
-
(1985)
ACM Transactions on Computer Systems
, vol.3
, Issue.1
, pp. 63-75
-
-
Chandy, K.1
Lamport, L.2
-
12
-
-
79951595196
-
The international exascale software project roadmap
-
February
-
J. Dongarra, P. Beckman, T. Moore, et al. The international exascale software project roadmap. International Journal of High Performance Computing Applications, 25:3-60, February 2011.
-
(2011)
International Journal of High Performance Computing Applications
, vol.25
, pp. 3-60
-
-
Dongarra, J.1
Beckman, P.2
Moore, T.3
-
14
-
-
0042078549
-
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
-
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys, 34(3):375-408, 2002.
-
(2002)
ACM Computing Surveys
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.M.1
Alvisi, L.2
Wang, Y.-M.3
Johnson, D.B.4
-
15
-
-
84866884771
-
FTI: High performance Fault Tolerance Interface for hybrid systems
-
L. B. Gomez, N. Maruyama, D. Komatitsch, S. Tsuboi, F. Cappello, S. Matsuoka, and T. Nakamura. FTI: high performance Fault Tolerance Interface for hybrid systems. In IEEE/ACM SuperComputing 2011, Seatle, USA, November 2011.
-
IEEE/ACM SuperComputing 2011, Seatle, USA, November 2011
-
-
Gomez, L.B.1
Maruyama, N.2
Komatitsch, D.3
Tsuboi, S.4
Cappello, F.5
Matsuoka, S.6
Nakamura, T.7
-
16
-
-
80053223509
-
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications
-
to appear
-
A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications. In 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS2011), Anchorage, USA, 2011. to appear.
-
25th IEEE International Parallel & Distributed Processing Symposium (IPDPS2011), Anchorage, USA, 2011
-
-
Guermouche, A.1
Ropars, T.2
Brunet, E.3
Snir, M.4
Cappello, F.5
-
18
-
-
51049086184
-
Scalable Group-Based Checkpoint/Restart for Large-Scale Message-Passing Systems
-
J. C. Y. Ho, C.-L. Wang, and F. C. M. Lau. Scalable Group-Based Checkpoint/Restart for Large-Scale Message-Passing Systems. In 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS'08), Miami, USA, 2008.
-
22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS'08), Miami, USA, 2008
-
-
Ho, J.C.Y.1
Wang, C.-L.2
Lau, F.C.M.3
-
21
-
-
0017996760
-
Clocks, and the Ordering of Events in a Distributed System
-
L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7):558-565, 1978.
-
(1978)
Communications of the ACM
, vol.21
, Issue.7
, pp. 558-565
-
-
Lamport Time, L.1
-
22
-
-
77954923590
-
Team-based Message Logging: Preliminary Results
-
E. Meneses, C. L. Mendes, and L. V. Kale. Team-based Message Logging: Preliminary Results. In 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010)., May 2010.
-
3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010)., May 2010
-
-
Meneses, E.1
Mendes, C.L.2
Kale, L.V.3
-
24
-
-
4544315524
-
Hybrid Checkpointing for Parallel Applications in Cluster Federations
-
Washington, DC, USA, IEEE Computer Society
-
S. Monnet, C. Morin, and R. Badrinath. Hybrid Checkpointing for Parallel Applications in Cluster Federations. In Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid (CCGRID'04), pages 773-782, Washington, DC, USA, 2004. IEEE Computer Society.
-
(2004)
Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid (CCGRID'04)
, pp. 773-782
-
-
Monnet, S.1
Morin, C.2
Badrinath, R.3
-
25
-
-
47249142074
-
Modeling the Impact of Checkpoints on Next-Generation Systems
-
Washington, DC, USA, IEEE Computer Society
-
R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela, R. Riesen, and P. C. Roth. Modeling the Impact of Checkpoints on Next-Generation Systems. In MSST '07: Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, pages 30-46, Washington, DC, USA, 2007. IEEE Computer Society.
-
(2007)
MSST '07: Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
, pp. 30-46
-
-
Oldfield, R.A.1
Arunagiri, S.2
Teller, P.J.3
Seelam, S.4
Varela, M.R.5
Riesen, R.6
Roth, P.C.7
-
27
-
-
77956584397
-
See Applications Run and Throughput Jump: The Case for Redundant Computing in HPC
-
Washington, DC, USA, IEEE Computer Society
-
R. Riesen, K. Ferreira, and J. Stearley. See Applications Run and Throughput Jump: The Case for Redundant Computing in HPC. In Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), DSNW '10, pages 29-34, Washington, DC, USA, 2010. IEEE Computer Society.
-
(2010)
Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), DSNW '10
, pp. 29-34
-
-
Riesen, R.1
Ferreira, K.2
Stearley, J.3
-
28
-
-
80052380100
-
On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications
-
T. Ropars, A. Guermouche, B. Uçar, E. Meneses, L. V. Kalé, and F. Cappello. On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications. In Euro-Par 2011, pages 567-578, 2011.
-
(2011)
Euro-Par 2011
, pp. 567-578
-
-
Ropars, T.1
Guermouche, A.2
Uçar, B.3
Meneses, E.4
Kalé, L.V.5
Cappello, F.6
-
29
-
-
80054900610
-
Active optimistic and distributed message logging for message-passing applications
-
T. Ropars and C. Morin. Active optimistic and distributed message logging for message-passing applications. Concurrency and Computation: Practice and Experience, 23(17):2167-2178, 2011.
-
(2011)
Concurrency and Computation: Practice and Experience
, vol.23
, Issue.17
, pp. 2167-2178
-
-
Ropars, T.1
Morin, C.2
-
32
-
-
65349094006
-
Trading off Logging Overhead and Coordinating Overhead to Achieve Efficient Rollback Recovery
-
April
-
J.-M. Yang, K. F. Li, W.-W. Li, and D.-F. Zhang. Trading Off Logging Overhead and Coordinating Overhead to Achieve Efficient Rollback Recovery. Concurrency and Computation : Practice and Experience, 21:819-853, April 2009.
-
(2009)
Concurrency and Computation: Practice and Experience
, vol.21
, pp. 819-853
-
-
Yang, J.-M.1
Li, K.F.2
Li, W.-W.3
Zhang, D.-F.4
|