-
1
-
-
77952225154
-
-
MPI over InfiniBand Project
-
MPI over InfiniBand Project. In http://nowlab.cse.ohiostate.edu/projects/ mpi-iba/.
-
-
-
-
5
-
-
0032592492
-
Harness: A next generation distributed virtual machine
-
Micah Beck, Jack J. Dongarra, and etc. Graham E. Fagg. Harness: a next generation distributed virtual machine. Future Gener. Comput. Syst., 15(5-6):571-582, 1999.
-
(1999)
Future Gener. Comput. Syst.
, vol.15
, Issue.5-6
, pp. 571-582
-
-
Beck, M.1
Dongarra, J.J.2
Fagg, G.E.3
-
6
-
-
0038194608
-
Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes
-
George Bosilca, Aurelien Bouteiller, Samir Djilali, Gilles Fedak, Cecile Germain, Thomas Herault, Vincent Neri, and Anton Selikhov. Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In In Supercomputing, pages 1-18, 2002.
-
(2002)
In Supercomputing
, pp. 1-18
-
-
Bosilca, G.1
Bouteiller, A.2
Djilali, S.3
Fedak, G.4
Germain, C.5
Herault, T.6
Neri, V.7
Selikhov, A.8
-
9
-
-
33749061217
-
Requirements for Linux Checkpoint/Restart
-
Berkeley, CA 94720
-
Duell, J., Hargrove, P., and Roman, E. Requirements for Linux Checkpoint/Restart. Technical Report LBNL-49659, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, 2002.
-
(2002)
Technical Report LBNL-49659, Lawrence Berkeley National Laboratory
-
-
Duell, J.1
Hargrove, P.2
Roman, E.3
-
11
-
-
34548789748
-
The design and implementation of checkpoint/restart process fault tolerance for open mpi
-
J. Hursey, J.M. Squyres, T.I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for open mpi. In 12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems, March 2007.
-
12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems, March 2007
-
-
Hursey, J.1
Squyres, J.M.2
Mattox, T.I.3
Lumsdaine, A.4
-
14
-
-
0003912256
-
Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System
-
April
-
Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. In Technical Report UW-CSTR-1346, University of Wisconsin-Madison, Computer Sciences Department, April 1997.
-
(1997)
Technical Report UW-CSTR-1346, University of Wisconsin-Madison, Computer Sciences Department
-
-
Litzkow, M.1
Tannenbaum, T.2
Basney, J.3
Livny, M.4
-
15
-
-
74049121711
-
Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
-
6
-
Paul H. Hargrove and Jason C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters. In SciDAC, 6 2006.
-
(2006)
SciDAC
-
-
Hargrove, P.H.1
Duell, J.C.2
-
16
-
-
0141599174
-
-
Technical report, Knoxville, TN, USA
-
James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt: Transparent checkpointing under unix. Technical report, Knoxville, TN, USA, 1994.
-
(1994)
Libckpt: Transparent Checkpointing under Unix
-
-
James, S.1
Plank, M.B.2
Kingsley, G.3
Li, K.4
-
17
-
-
47249116207
-
Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand
-
Q. Gao, W. Huang, M. Koop, and D. K. Panda. Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand. In Int'l Conference on Parallel Processing (ICPP), XiAn, China, 9 2007.
-
Int'l Conference on Parallel Processing (ICPP), XiAn, China, 9 2007
-
-
Gao, Q.1
Huang, W.2
Koop, M.3
Panda, D.K.4
-
19
-
-
20444444457
-
The lam/mpi checkpoint/restart framework: System-initiated checkpointing
-
Oct.
-
S. Sankaran and J. M. Squyres and B. Barrett etc. The lam/mpi checkpoint/restart framework: System-initiated checkpointing. LACSI, Oct. 2003.
-
(2003)
LACSI
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
-
21
-
-
34548768671
-
A job pause service under lam/mpi+blcr for transparent fault tolerance
-
Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. A job pause service under lam/mpi+blcr for transparent fault tolerance. In IPDPS, pages 1-10, 2007.
-
(2007)
IPDPS
, pp. 1-10
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scott, S.L.4
-
23
-
-
85014969248
-
Architectural requirements and scalability of the nas parallel benchmarks
-
Frederick C. Wong and Richard P. Martin etc. Architectural requirements and scalability of the nas parallel benchmarks. In Supercomputing '99, page 41, 1999.
-
(1999)
Supercomputing '99
, pp. 41
-
-
Wong, F.C.1
Martin, R.P.2
-
24
-
-
77951447133
-
Accelerating checkpoint operation by node-level write aggregation on multicore systems
-
To appear in September
-
Xiangyong Ouyang, Karthik Gopalakrishnan and Dhabaleswar K. Panda. Accelerating checkpoint operation by node-level write aggregation on multicore systems. To appear in ICPP 2009, September 2009.
-
(2009)
ICPP 2009
-
-
Ouyang, X.1
Gopalakrishnan, K.2
Panda, D.K.3
|