-
1
-
-
84881042190
-
Post-failure recovery of MPI communication capability: Design and rationale
-
W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of MPI communication capability: Design and rationale. International Journal of High Performance Computing Applications, 27(3):244-254, 2013.
-
(2013)
International Journal of High Performance Computing Applications
, vol.27
, Issue.3
, pp. 244-254
-
-
Bland, W.1
Bouteiller, A.2
Herault, T.3
Bosilca, G.4
Dongarra, J.5
-
2
-
-
61449223447
-
Algorithm-based fault tolerance applied to high performance computing
-
G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal ofParallel and Distributed Computing, 69(4):410-416, 2009.
-
(2009)
Journal OfParallel and Distributed Computing
, vol.69
, Issue.4
, pp. 410-416
-
-
Bosilca, G.1
Delmas, R.2
Dongarra, J.3
Langou, J.4
-
3
-
-
60449096682
-
MPICH-V2: A fault tolerant MPI for volatile nodes based on pessimistic sender based message logging
-
Nov
-
A. Bouteiller, F. Cappello, T. Herault, K. Krawezik, P. Lemarinier, and M. Magniette. MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In ACM/IEEE Supercomputing Conference, Nov. 2003.
-
(2003)
ACM/IEEE Supercomputing Conference
-
-
Bouteiller, A.1
Cappello, F.2
Herault, T.3
Krawezik, K.4
Lemarinier, P.5
Magniette, M.6
-
4
-
-
84908663379
-
On the use of remote GPUs and low-power processors for the acceleration of scientific applications
-
A. Castello, J. Duato, R. Mayo, A. J. Pefia, E. S. Quintana-Orti, V. Roca, and F. Silla. On the use of remote GPUs and low-power processors for the acceleration of scientific applications. In International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies (ENERGY), 2014.
-
(2014)
International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies (ENERGY)
-
-
Castello, A.1
Duato, J.2
Mayo, R.3
Pefia, A.J.4
Quintana-Orti, E.S.5
Roca, V.6
Silla, F.7
-
5
-
-
84898026203
-
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
-
J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems. Scientific Programming, 21(3-4):197-212, 2013.
-
(2013)
Scientific Programming
, vol.21
, Issue.3-4
, pp. 197-212
-
-
Chung, J.1
Lee, I.2
Sullivan, M.3
Ryoo, J.H.4
Kim, D.W.5
Yoon, D.H.6
Kaplan, L.7
Erez, M.8
-
6
-
-
84903827331
-
GPGPUs: How to combine high computational power with high reliability
-
L. B. Gomez, F. Cappello, L. Carro, N. DeBardeleben, B. Fang, S. Gurumurthi, K. Pattabiraman, P. Rech, and M. S. Reorda. GPGPUs: how to combine high computational power with high reliability. In Design, Automation & Test in Europe (DATE), 2014.
-
(2014)
Design, Automation & Test in Europe (DATE)
-
-
Gomez, L.B.1
Cappello, F.2
Carro, L.3
DeBardeleben, N.4
Fang, B.5
Gurumurthi, S.6
Pattabiraman, K.7
Rech, P.8
Reorda, M.S.9
-
7
-
-
33749067567
-
Berkeley lab checkpoint/restart (BLCR) for linux clusters
-
P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. Journal ofPhysics: Conference Series, 46(1):494, 2006.
-
(2006)
Journal OfPhysics: Conference Series
, vol.46
, Issue.1
, pp. 494
-
-
Hargrove, P.H.1
Duell, J.C.2
-
8
-
-
84965049420
-
-
Sandia National Laboratories
-
M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Improving performance via mini-applications. Technical report, Sandia National Laboratories, 2009.
-
(2009)
Improving Performance Via Mini-applications. Technical Report
-
-
Heroux, M.A.1
Doerfler, D.W.2
Crozier, P.S.3
Willenbring, J.M.4
Edwards, H.C.5
Williams, A.6
Rajan, M.7
Keiter, E.R.8
Thornquist, H.K.9
Numrich, R.W.10
-
11
-
-
84864054886
-
SnuCL: An OpenCL framework for heterogeneous CPU/GPU clusters
-
ACM
-
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In 26th International Conference on Supercomputing, pages 341-352. ACM, 2012.
-
(2012)
26th International Conference on Supercomputing
, pp. 341-352
-
-
Kim, J.1
Seo, S.2
Lee, J.3
Nah, J.4
Jo, G.5
Lee, J.6
-
12
-
-
84893324240
-
PVOCL: Power-aware dynamic placement and migration in virtualized GPU environments
-
IEEE
-
P. Lama, Y. Li, A. M. Aji, P. Balaji, J. Dinan, S. Xiao, Y. Zhang, W. Feng, R. Thakur, and X. Zhou. pVOCL: Power-aware dynamic placement and migration in virtualized GPU environments. In 33rd International Conference on Distributed Computing Systems (ICDCS), pages 145-154. IEEE, 2013.
-
(2013)
33rd International Conference on Distributed Computing Systems (ICDCS)
, pp. 145-154
-
-
Lama, P.1
Li, Y.2
Aji, A.M.3
Balaji, P.4
Dinan, J.5
Xiao, S.6
Zhang, Y.7
Feng, W.8
Thakur, R.9
Zhou, X.10
-
13
-
-
84901493407
-
-
Technical report PRACE, Dec
-
P. Lavallee, G. C. de Verdiere, P. Wautelet, D. Lecas, and J. Dupays. Porting and optimizing HYDRO to new platforms and programming paradigms-lessons learnt. Technical report, PRACE, Dec. 2012.
-
(2012)
Porting and Optimizing HYDRO to New Platforms and Programming Paradigms-lessons Learnt
-
-
Lavallee, P.1
De Verdiere, G.C.2
Wautelet, P.3
Lecas, D.4
Dupays, J.5
-
15
-
-
58049086636
-
Parallel double error correcting code design to mitigate multi-bit upsets in SRAMs
-
Sept
-
R. Naseer and J. Draper. Parallel double error correcting code design to mitigate multi-bit upsets in SRAMs. In 34th European Solid-State Circuits Conference (ESSCIRC), pages 222-225, Sept. 2008.
-
(2008)
34th European Solid-State Circuits Conference (ESSCIRC)
, pp. 222-225
-
-
Naseer, R.1
Draper, J.2
-
16
-
-
84966611065
-
-
NVIDIA Corporation
-
NVIDIA Corporation. NVIDIA Management Library (NVML). http://developer.nvidia.com/nvidia-management-library-nvml, 2015.
-
(2015)
NVIDIA Management Library (NVML)
-
-
-
19
-
-
84908669300
-
A complete and efficient CUDA-sharing solution for HPC clusters
-
A. J. Pena, C. Reafio, F. Silla, R. Mayo, E. S. Quintana-Orti, and J. Duato. A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Computing, 40(10):574-588, 2014.
-
(2014)
Parallel Computing
, vol.40
, Issue.10
, pp. 574-588
-
-
Pena, A.J.1
Reafio, C.2
Silla, F.3
Mayo, R.4
Quintana-Orti, E.S.5
Duato, J.6
-
20
-
-
0033349322
-
Soft-error detection through software fault-tolerance techniques
-
Nov
-
M. Rebaudengo, M. Reorda, M. Torchiano, and M. Violante. Soft-error detection through software fault-tolerance techniques. In International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT), pages 210-218, Nov 1999.
-
(1999)
International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT)
, pp. 210-218
-
-
Rebaudengo, M.1
Reorda, M.2
Torchiano, M.3
Violante, M.4
-
21
-
-
84904409465
-
Snapify: Capturing snapshots of offload applications on Xeon Phi manycore processors
-
ACM
-
A. Rezaei, G. Coviello, C. Li, S. Chakradhar, and F. Mueller. Snapify: capturing snapshots of offload applications on Xeon Phi manycore processors. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, 2014.
-
(2014)
International Symposium on High-Performance Parallel and Distributed Computing
-
-
Rezaei, A.1
Coviello, G.2
Li, C.3
Chakradhar, S.4
Mueller, F.5
-
22
-
-
27844542760
-
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
-
S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479-493, 2005.
-
(2005)
International Journal of High Performance Computing Applications
, vol.19
, Issue.4
, pp. 479-493
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Sahay, V.4
Lumsdaine, A.5
Duell, J.6
Hargrove, P.7
Roman, E.8
-
24
-
-
80053270870
-
CheCL: Transparent checkpointing and process migration of OpenCL applications
-
IEEE
-
H. Takizawa, K. Koyama, K. Sato, K. Komatsu, and H. Kobayashi. CheCL: Transparent checkpointing and process migration of OpenCL applications. In International Parallel & Distributed Processing Symposium (IPDPS), pages 864-876. IEEE, 2011.
-
(2011)
International Parallel & Distributed Processing Symposium (IPDPS)
, pp. 864-876
-
-
Takizawa, H.1
Koyama, K.2
Sato, K.3
Komatsu, K.4
Kobayashi, H.5
-
25
-
-
77950975351
-
CheCUDA: A checkpoint/restart tool for CUDA applications
-
IEEE
-
H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi. CheCUDA: A checkpoint/restart tool for CUDA applications. In International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 408-413. IEEE, 2009.
-
(2009)
International Conference on Parallel and Distributed Computing, Applications and Technologies
, pp. 408-413
-
-
Takizawa, H.1
Sato, K.2
Komatsu, K.3
Kobayashi, H.4
-
26
-
-
84966609695
-
-
TSUBAME Computing Services
-
TSUBAME Computing Services. Failure history of TSUBAME2.5. http://mon.g.gsic.titech.ac.jp/trouble-list, 2015.
-
(2015)
Failure History of TSUBAME2.5
-
-
-
27
-
-
84966468814
-
-
Virginia Tech
-
Virginia Tech. HokieSpeed (Seneca CPU-GPU). http://www.arc.vt.edu/resources/hpc/hokiespeed.php, 2015.
-
(2015)
HokieSpeed (Seneca CPU-GPU)
-
-
-
28
-
-
84905504258
-
Real-world design and evaluation of compiler-managed GPU redundant multithreading
-
J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In International Symposium on Computer Architecture (ISCA), pages 73-84, 2014.
-
(2014)
International Symposium on Computer Architecture (ISCA)
, pp. 73-84
-
-
Wadden, J.1
Lyashevsky, A.2
Gurumurthi, S.3
Sridharan, V.4
Skadron, K.5
-
29
-
-
77954995377
-
Reducing cache power with low-cost, multi-bit error-correcting codes
-
C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S. Lu. Reducing cache power with low-cost, multi-bit error-correcting codes. SIGARCH Comput. Archit. News, 38(3):83-93, 2010.
-
(2010)
SIGARCH Comput. Archit. News
, vol.38
, Issue.3
, pp. 83-93
-
-
Wilkerson, C.1
Alameldeen, A.R.2
Chishti, Z.3
Wu, W.4
Somasekhar, D.5
Lu, S.6
-
30
-
-
84870656041
-
VOCL: An optimized environment for transparent virtualization of graphics processing units
-
IEEE
-
S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong, and W. Feng. VOCL: An optimized environment for transparent virtualization of graphics processing units. In Innovative Parallel Computing (InPar). IEEE, 2012.
-
(2012)
Innovative Parallel Computing (InPar)
-
-
Xiao, S.1
Balaji, P.2
Zhu, Q.3
Thakur, R.4
Coghlan, S.5
Lin, H.6
Wen, G.7
Hong, J.8
Feng, W.9
-
31
-
-
80053254113
-
Hauberk: Lightweight silent data corruption error detector for GPGPU
-
K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk, and R. Iyer. Hauberk: Lightweight silent data corruption error detector for GPGPU. In International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 287-300, 2011.
-
(2011)
International Conference on Parallel and Distributed Computing, Applications and Technologies
, pp. 287-300
-
-
Yim, K.S.1
Pham, C.2
Saleheen, M.3
Kalbarczyk, Z.4
Iyer, R.5
|