-
3
-
-
33750936415
-
Availability modeling and analysis on high performance cluster computing systems
-
Washington, DC, USA: IEEE Computer Society
-
H. Song, C. b. Leangsuksun, and R. Nassar, "Availability Modeling and Analysis on High Performance Cluster Computing Systems," in ARES '06: Proceedings of the First International Conference on Availability, Reliability and Security. Washington, DC, USA: IEEE Computer Society, 2006, pp. 305-313.
-
(2006)
ARES '06: Proceedings of the First International Conference on Availability, Reliability and Security
, pp. 305-313
-
-
Song, H.1
Leangsuksun, C.B.2
Nassar, R.3
-
4
-
-
56749178938
-
Exploring event correlation for failure prediction in coalitions of clusters
-
New York, NY, USA: ACM
-
S. Fu and C.-Z. Xu, "Exploring event correlation for failure prediction in coalitions of clusters," in SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2007, pp. 1-12.
-
(2007)
SC '07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing
, pp. 1-12
-
-
Fu, S.1
Xu, C.-Z.2
-
5
-
-
78649487857
-
-
"Intelligent Platform Management Interface (IPMI)," http://www.intel.com/design/servers/ipmi/.
-
-
-
-
6
-
-
34548782109
-
A fault tolerance protocol with fast fault recovery
-
S. Chakravorty and L. V. Kale, "A fault tolerance protocol with fast fault recovery," in IPDPS 2003, 2003.
-
(2003)
IPDPS 2003
-
-
Chakravorty, S.1
Kale, L.V.2
-
7
-
-
77952378080
-
Critical event prediction for proactive management in large-scale computer clusters
-
R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam, "Critical event prediction for proactive management in large-scale computer clusters," in KDD '03, 2003, pp. 426-435.
-
(2003)
KDD '03
, pp. 426-435
-
-
Sahoo, R.K.1
Oliner, A.J.2
Rish, I.3
Gupta, M.4
Moreira, J.E.5
Ma, S.6
Vilalta, R.7
Sivasubramaniam, A.8
-
10
-
-
34548768671
-
A job pause service under LAM/MPI+BLCR for transparent fault tolerance
-
C. Wang, F. Mueller, C. Engelmann, and S. L. Scott, "A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance," in IPDPS, 2007, pp. 1-10.
-
(2007)
IPDPS
, pp. 1-10
-
-
Wang, C.1
Mueller, F.2
Engelmann, C.3
Scott, S.L.4
-
11
-
-
34548046749
-
Proactive fault tolerance for HPC with Xen virtualization
-
A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott, "Proactive fault tolerance for HPC with Xen virtualization," in ICS '07: Proceedings of the 21st annual international conference on Supercomputing, 2007, pp. 23-32.
-
(2007)
ICS '07: Proceedings of the 21st Annual International Conference on Supercomputing
, pp. 23-32
-
-
Nagarajan, A.B.1
Mueller, F.2
Engelmann, C.3
Scott, S.L.4
-
12
-
-
53349098107
-
High performance virtual machine migration with RDMA over modern interconnects
-
W. Huang, Q. Gao, J. Liu, and D. K. Panda, "High performance virtual machine migration with rdma over modern interconnects," in CLUSTER '07: Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007.
-
(2007)
CLUSTER '07: Proceedings of the 2007 IEEE International Conference on Cluster Computing
-
-
Huang, W.1
Gao, Q.2
Liu, J.3
Panda, D.K.4
-
13
-
-
47249116207
-
Groupbased coordinated checkpointing for MPI: A case study on InfiniBand
-
Washington, DC, USA: IEEE Computer Society
-
Q. Gao, W. Huang, M. J. Koop, and D. K. Panda, "Groupbased Coordinated Checkpointing for MPI: A Case Study on InfiniBand," in ICPP '07: Proceedings of the 2007 International Conference on Parallel Processing. Washington, DC, USA: IEEE Computer Society, 2007, p. 47.
-
(2007)
ICPP '07: Proceedings of the 2007 International Conference on Parallel Processing
, pp. 47
-
-
Gao, Q.1
Huang, W.2
Koop, M.J.3
Panda, D.K.4
-
14
-
-
34547424834
-
Application- transparent checkpoint/restart for MPI Programs over InfiniBand
-
Washington, DC, USA: IEEE Computer Society
-
Q. Gao, W. Yu, W. Huang, and D. K. Panda, "Application- Transparent Checkpoint/Restart for MPI Programs over InfiniBand," in ICPP '06: Proceedings of the 2006 International Conference on Parallel Processing. Washington, DC, USA: IEEE Computer Society, 2006, pp. 471-478.
-
(2006)
ICPP '06: Proceedings of the 2006 International Conference on Parallel Processing
, pp. 471-478
-
-
Gao, Q.1
Yu, W.2
Huang, W.3
Panda, D.K.4
-
15
-
-
77951447133
-
Accelerating Checkpoint operation by node-level write aggregation on multicore systems
-
September
-
X. Ouyang, K. Gopalakrishnan, and D. K. Panda, "Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems," ICPP 2009, September 2009.
-
(2009)
ICPP 2009
-
-
Ouyang, X.1
Gopalakrishnan, K.2
Panda, D.K.3
-
16
-
-
77952145003
-
Fast checkpointing by write aggregation with dynamic buffer and interleaving on multicore architecture
-
December
-
X. Ouyang, K. Gopalakrishnan, T. Gangadharappa, and D. K. Panda, "Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture," HiPC 2009, December 2009.
-
(2009)
HiPC 2009
-
-
Ouyang, X.1
Gopalakrishnan, K.2
Gangadharappa, T.3
Panda, D.K.4
-
17
-
-
12344277946
-
The design and implementation of berkeley lab's linux checkpoint/restart
-
Lawrence Berkeley National Laboratory, Berkeley, CA 94720. [Online]. Available
-
Duell, J., Hargrove, P., and Roman, E., "The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart," Lawrence Berkeley National Laboratory, Berkeley, CA 94720, Tech. Rep. LBNL-54941, 2002. [Online]. Available: {https: //ftg.lbl.gov/CheckpointRestart/Pubs/LBNL-54941. pdf}
-
(2002)
Tech. Rep. LBNL-54941
-
-
Duell, J.1
Hargrove, P.2
Roman, E.3
-
18
-
-
53349109260
-
-
"CIFTS Web Page," http://www.mcs.anl.gov/research/cifts.
-
CIFTS Web Page
-
-
-
19
-
-
77951481809
-
CIFTS: A coordinated infrastucture for fault-tolerant systems
-
R. Gupta, P. Beckman, B. Park, E. Lusk, P.Hargrove, A. Geist, D. Panda, A.Lumsdaine, and J. Dongarra, "CIFTS: A Coordinated Infrastucture for Fault-Tolerant Systems." in In Intĺ Conference on Parallel Processing (ICPP ' 09), 2009.
-
(2009)
Intĺ Conference on Parallel Processing (ICPP ' 09)
-
-
Gupta, R.1
Beckman, P.2
Park, B.3
Lusk, E.4
Hargrove, P.5
Geist, A.6
Panda, D.7
Lumsdaine, A.8
Dongarra, J.9
-
20
-
-
78649480678
-
-
"Top 500 Supercomputers," http://www.top500.org.
-
-
-
-
21
-
-
74049121711
-
Berkeley lab checkpoint/restart (BLCR) for linux clusters
-
P. H. Hargrove and J. C. Duell, "Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters," in SciDAC, 6 2006.
-
(2006)
SciDAC
, vol.6
-
-
Hargrove, P.H.1
Duell, J.C.2
-
22
-
-
58449084165
-
ScELA: Scalable and extensible launching architecture for clusters
-
J. K. Sridhar, M. J. Koop, J. L. Perkins, and D. K. Panda, "ScELA: Scalable and Extensible Launching Architecture for Clusters," in HiPC, 2008, pp. 323-335.
-
(2008)
HiPC
, pp. 323-335
-
-
Sridhar, J.K.1
Koop, M.J.2
Perkins, J.L.3
Panda, D.K.4
-
23
-
-
74049098606
-
PLFS: A checkpoint filesystem for parallel applications
-
J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, "PLFS: a checkpoint filesystem for parallel applications," in SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009.
-
(2009)
SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
-
-
Bent, J.1
Gibson, G.2
Grider, G.3
McClelland, B.4
Nowoczynski, P.5
Nunez, J.6
Polte, M.7
Wingate, M.8
-
24
-
-
85014969248
-
Architectural requirements and scalability of the NAS parallel benchmarks
-
F. C. Wong and R. P. M. etc., "Architectural requirements and scalability of the NAS parallel benchmarks," in Supercomputing '99, 1999, p. 41.
-
(1999)
Supercomputing '99
, pp. 41
-
-
Wong, F.C.1
R, P.M.2
-
25
-
-
78649471209
-
-
"PVFS2," http://www.pvfs.org/.
-
-
-
-
27
-
-
27844542760
-
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
-
Winter
-
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The LAM/MPI checkpoint/restart framework: System-initiated checkpointing," International Journal of High Performance Computing Applications, vol. 19, no. 4, pp. 479-493, Winter 2005.
-
(2005)
International Journal of High Performance Computing Applications
, vol.19
, Issue.4
, pp. 479-493
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Lumsdaine, A.4
Duell, J.5
Hargrove, P.6
Roman, E.7
-
28
-
-
34548789748
-
The design and implementation of checkpoint/restart process fault tolerance for open MPI
-
March
-
J. Hursey, J. Squyres, T. Mattox, and A. Lumsdaine, "The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI," in 12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems, March 2007.
-
(2007)
12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems
-
-
Hursey, J.1
Squyres, J.2
Mattox, T.3
Lumsdaine, A.4
-
30
-
-
0010538346
-
Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations
-
A. Agbaria and R. Friedman, "Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations," High-Performance Distributed Computing, International Symposium on, vol. 0, p. 31, 1999.
-
(1999)
High-Performance Distributed Computing, International Symposium on
, pp. 31
-
-
Agbaria, A.1
Friedman, R.2
-
31
-
-
84900298636
-
CLIP: A checkpointing tool for message-passing parallel programs
-
Y. Chen, J. S. Plank, and K. Li, "CLIP: A Checkpointing Tool for Message-Passing Parallel Programs," in In SC97: High Performance Networking and Computing, 1997, pp. 1-11.
-
(1997)
In SC97: High Performance Networking and Computing
, pp. 1-11
-
-
Chen, Y.1
Plank, J.S.2
Li, K.3
-
32
-
-
0038194608
-
MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
-
G. Bosilca, A. Bouteiller, S. Djilali, G. Fedak, C. Germain, T. Herault, V. Neri, and A. Selikhov, "MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes," in In Supercomputing, 2002, pp. 1-18.
-
(2002)
In Supercomputing
, pp. 1-18
-
-
Bosilca, G.1
Bouteiller, A.2
Djilali, S.3
Fedak, G.4
Germain, C.5
Herault, T.6
Neri, V.7
Selikhov, A.8
-
33
-
-
0016829070
-
System structure for software fault tolerance
-
New York, NY, USA: ACM
-
B. Randell, "System structure for software fault tolerance," in Proceedings of the international conference on Reliable software. New York, NY, USA: ACM, 1975, pp. 437-449.
-
(1975)
Proceedings of the International Conference on Reliable Software
, pp. 437-449
-
-
Randell, B.1
-
34
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, no. 3, pp. 375-408, 2002.
-
(2002)
ACM Comput. Surv.
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.M.1
Alvisi, L.2
Wang, Y.-M.3
Johnson, D.B.4
-
35
-
-
34548042452
-
Proactive fault tolerance in MPI applications via task migration
-
S. Chakravorty, C. Mendes, and L. Kale, " Proactive fault tolerance in MPI applications via task migration ," in HiPC, 2006.
-
(2006)
HiPC
-
-
Chakravorty, S.1
Mendes, C.2
Kale, L.3
|