-
1
-
-
33845420448
-
A power-aware run-time system for high-performance computing
-
C.-H. Hsu and W.-C. Feng, "A power-aware run-time system for high-performance computing," in Supercomputing, 2005.
-
(2005)
Supercomputing
-
-
Hsu, C.-H.1
Feng, W.-C.2
-
4
-
-
79951775997
-
Application MTTFE vs. Platform MTTF: A fresh perspective on system reliability and application throughput for computations at scale
-
May
-
J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak, "Application MTTFE vs. platform MTTF: A fresh perspective on system reliability and application throughput for computations at scale," in Workshop on Resiliency in High Performance Computing, May 2008, pp. 19-22.
-
(2008)
Workshop on Resiliency in High Performance Computing
, pp. 19-22
-
-
Daly, J.T.1
Pritchett-Sheats, L.A.2
Michalak, S.E.3
-
5
-
-
34548768671
-
A job pause service under LAM/MPI+BLCR for transparent fault tolerance
-
Apr.
-
C. Wang, F. Mueller, C. Engelmann, and S. Scott, "A job pause service under LAM/MPI+BLCR for transparent fault tolerance," in International Parallel and Distributed Processing Symposium, Apr. 2007.
-
(2007)
International Parallel and Distributed Processing Symposium
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scott, S.4
-
6
-
-
34548771116
-
Dejavu: Transparent userlevel checkpointing, migration, and recovery for distributed systems
-
J. Ruscio, M. Heffner, and S. Varadarajan, "Dejavu: Transparent userlevel checkpointing, migration, and recovery for distributed systems," in International Parallel and Distributed Processing Symposium, 2007.
-
(2007)
International Parallel and Distributed Processing Symposium
-
-
Ruscio, J.1
Heffner, M.2
Varadarajan, S.3
-
7
-
-
35248827046
-
A Component architecture for LAM/MPI
-
ser. Lecture Notes in Computer Science, no. 2840, Sep.
-
J. M. Squyres and A. Lumsdaine, "A Component Architecture for LAM/MPI," in European PVM/MPI Users' Group Meeting, ser. Lecture Notes in Computer Science, no. 2840, Sep. 2003, pp. 379-387.
-
(2003)
European PVM/MPI Users' Group Meeting
, pp. 379-387
-
-
Squyres, J.M.1
Lumsdaine, A.2
-
9
-
-
20444444457
-
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
-
Oct.
-
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The LAM/MPI checkpoint/restart framework: System-initiated checkpointing," in LACSI Symposium, Oct. 2003.
-
(2003)
LACSI Symposium
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Lumsdaine, A.4
Duell, J.5
Hargrove, P.6
Roman, E.7
-
10
-
-
34548789748
-
The design and implementation of checkpoint/restart process fault tolerance for Open MPI
-
J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine, "The design and implementation of checkpoint/restart process fault tolerance for Open MPI," in Workshop on Dependable Parallel, Distributed and Network-Centric Systems, 03 2007.
-
(2007)
Workshop on Dependable Parallel, Distributed and Network-Centric Systems
, pp. 03
-
-
Hursey, J.1
Squyres, J.M.2
Mattox, T.I.3
Lumsdaine, A.4
-
11
-
-
0026812659
-
The design and implementation of a log-structured file system
-
Feb.
-
M. Rosenblum and J. K. Ousterhout, "The design and implementation of a log-structured file system," in ACM Trans. on Computer Systems, Vol. 10, No. 1, Feb. 1992.
-
(1992)
ACM Trans. on Computer Systems
, vol.10
, Issue.1
-
-
Rosenblum, M.1
Ousterhout, J.K.2
-
12
-
-
70350755748
-
Proactive processlevel live migration in hpc environments
-
C. Wang, F. Mueller, C. Engelmann, and S. Scott, "Proactive processlevel live migration in hpc environments," in Supercomputing, 2008.
-
(2008)
Supercomputing
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scott, S.4
-
13
-
-
85014969248
-
Architectural requirements and scalability of the NAS parallel benchmarks
-
F. Wong, R. Martin, R. Arpaci-Dusseau, and D. Culler, "Architectural requirements and scalability of the NAS parallel benchmarks," in Supercomputing, 1999.
-
(1999)
Supercomputing
-
-
Wong, F.1
Martin, R.2
Arpaci-Dusseau, R.3
Culler, D.4
-
14
-
-
33746047855
-
The design, implementation, and evaluation of mpiBLAST
-
A. Darling, L. Carey, and W. Feng, "The design, implementation, and evaluation of mpiBLAST," in ClusterWorld Conference and Expo, 2003.
-
(2003)
ClusterWorld Conference and Expo
-
-
Darling, A.1
Carey, L.2
Feng, W.3
-
15
-
-
50649087527
-
Reliability-aware approach: An incremental checkpoint/restart model in hpc environments
-
N. Naksinehaboon, Y. Liu, C. B. Leangsuksun, R. Nassar, M. Paun, and S. Scott, "Reliability-aware approach: An incremental checkpoint/restart model in hpc environments," in Symposium on Cluster Computing and the Grid, 2008, pp. 783-788.
-
(2008)
Symposium on Cluster Computing and the Grid
, pp. 783-788
-
-
Naksinehaboon, N.1
Liu, Y.2
Leangsuksun, C.B.3
Nassar, R.4
Paun, M.5
Scott, S.6
-
16
-
-
0029713612
-
CoCheck: Checkpointing and process migration for MPI
-
Honolulu, HI, USA, 15-19 April, IEEE, Ed. 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA: IEEE Computer Society Press, 1996
-
G. Stellner, "CoCheck: checkpointing and process migration for MPI," in Proceedings of IPPS'96. The 10th International Parallel Processing Symposium: Honolulu, HI, USA, 15-19 April 1996, IEEE, Ed. 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA: IEEE Computer Society Press, 1996, pp. 526-531.
-
(1996)
Proceedings of IPPS'96. the 10th International Parallel Processing Symposium
, pp. 526-531
-
-
Stellner, G.1
-
17
-
-
0038194608
-
MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
-
Nov.
-
G. Bosilca, A. Boutellier, and F. Cappello, "MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes," in Supercomputing, Nov. 2002.
-
(2002)
Supercomputing
-
-
Bosilca, G.1
Boutellier, A.2
Cappello, F.3
-
18
-
-
60449096682
-
MPICH-V2: A fault tolerant MPI for volatile nodes based on pessimistic sender based message logging
-
B. Bouteiller, F. Cappello, T. Herault, K. Krawezik, P. Lemarinier, and M. Magniette, "MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging," in Supercomputing, 2003.
-
(2003)
Supercomputing
-
-
Bouteiller, B.1
Cappello, F.2
Herault, T.3
Krawezik, K.4
Lemarinier, P.5
Magniette, M.6
-
19
-
-
34250708320
-
Analysis of the component architecture overhead in Open MPI
-
September
-
B. Barrett, J. M. Squyres, A. Lumsdaine, R. L. Graham, and G. Bosilca, "Analysis of the component architecture overhead in Open MPI," in European PVM/MPI Users' Group Meeting, September 2005.
-
(2005)
European PVM/MPI Users' Group Meeting
-
-
Barrett, B.1
Squyres, J.M.2
Lumsdaine, A.3
Graham, R.L.4
Bosilca, G.5
-
20
-
-
33845434226
-
Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers
-
R. Gioiosa, J. C. S., S. Jiang, and F. Petrini, "Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers," in Supercomputing, 2005.
-
(2005)
Supercomputing
-
-
Gioiosa, R.1
Sancho, J.C.2
Jiang, S.3
Petrini, F.4
-
21
-
-
33644504479
-
Space-efficient page-level incremental checkpointing
-
J. Heo, S. Yi, Y. Cho, J. Hong, and S. Y. Shin, "Space-efficient page-level incremental checkpointing," in ACM Symposium on Applied computing, 2005, pp. 1558-1562.
-
(2005)
ACM Symposium on Applied Computing
, pp. 1558-1562
-
-
Heo, J.1
Yi, S.2
Cho, Y.3
Hong, J.4
Shin, S.Y.5
-
22
-
-
0031224013
-
Continuous checkpointing: Joining the checkpointing with virtual memory paging
-
S.-T. Hsu and R.-C. Chang, "Continuous checkpointing: joining the checkpointing with virtual memory paging," Softw. Pract. Exper., vol. 27, no. 9, pp. 1103-1120, 1997. (Pubitemid 127582120)
-
(1997)
Software - Practice and Experience
, vol.27
, Issue.9
, pp. 1103-1120
-
-
Hsu, S.-T.1
Chang, R.-C.2
-
23
-
-
33751065156
-
Adaptive page-level incremental checkpointing based on expected recovery time
-
Applied Computing 2006 - The 21st Annual ACM Symposium on Applied Computing - Proceedings of the 2006 ACM Symposium on Applied Computing
-
S. Yi, J. Heo, Y. Cho, and J. Hong, "Adaptive page-level incremental checkpointing based on expected recovery time," in ACM Symposium on Applied computing, 2006, pp. 1472-1476. (Pubitemid 44759028)
-
(2006)
Proceedings of the ACM Symposium on Applied Computing
, vol.2
, pp. 1472-1476
-
-
Yi, S.1
Heo, J.2
Cho, Y.3
Hong, J.4
-
24
-
-
8344232253
-
Adaptive incremental checkpointing for massively parallel systems
-
New York, NY, USA: ACM
-
S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira, "Adaptive incremental checkpointing for massively parallel systems," in International Conference on Supercomputing. New York, NY, USA: ACM, 2004, pp. 277-286.
-
(2004)
International Conference on Supercomputing
, pp. 277-286
-
-
Agarwal, S.1
Garg, R.2
Gupta, M.S.3
Moreira, J.E.4
-
26
-
-
84976846528
-
A first order approximation to the optimum checkpoint interval
-
J. W. Young, "A first order approximation to the optimum checkpoint interval," Commun. ACM, vol. 17, no. 9, pp. 530-531, 1974.
-
(1974)
Commun. ACM
, vol.17
, Issue.9
, pp. 530-531
-
-
Young, J.W.1
-
27
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
DOI 10.1016/j.future.2004.11.016, PII S0167739X04002213
-
J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future Gener. Comput. Syst., vol. 22, no. 3, pp. 303-312, 2006. (Pubitemid 41689812)
-
(2006)
Future Generation Computer Systems
, vol.22
, Issue.3
, pp. 303-312
-
-
Daly, J.T.1
-
28
-
-
51049108820
-
An optimal checkpoint/restart model for a large scale high performance computing system
-
Apr.
-
Y. Liu, R. Nassar, C. B. Leangsuksun, N. Naksinehaboon, M. Paun, and S. Scott, "An optimal checkpoint/restart model for a large scale high performance computing system," in International Parallel and Distributed Processing Symposium, Apr. 2008.
-
(2008)
International Parallel and Distributed Processing Symposium
-
-
Liu, Y.1
Nassar, R.2
Leangsuksun, C.B.3
Naksinehaboon, N.4
Paun, M.5
Scott, S.6
|