-
1
-
-
84859145371
-
-
CUDA PTX ISA, http://www.nvidia.com/content/CUDAptxisa1.4.pdf
-
CUDA PTX ISA
-
-
-
5
-
-
34547309668
-
-
Version 3.0. NVIDIA
-
CUDA Programming Guide, Version 3.0. NVIDIA (2010)
-
(2010)
CUDA Programming Guide
-
-
-
6
-
-
77950611743
-
Hpctoolkit: Tools for performance analysis of optimized parallel programs
-
Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: Hpctoolkit: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22(6), 685-701 (2010)
-
(2010)
Concurrency and Computation: Practice and Experience
, vol.22
, Issue.6
, pp. 685-701
-
-
Adhianto, L.1
Banerjee, S.2
Fagan, M.3
Krentel, M.4
Marin, G.5
Mellor-Crummey, J.6
Tallent, N.R.7
-
7
-
-
84925509670
-
Optimizing Compilers for Modern Architectures
-
Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann (2002)
-
(2002)
Morgan Kaufmann
-
-
Allen, R.1
Kennedy, K.2
-
8
-
-
77951572335
-
Automatic C-to-CUDA Code Generation for Affine Programs
-
Gupta, R. (ed.) CC 2010. Springer, Heidelberg
-
Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA Code Generation for Affine Programs. In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 244-263. Springer, Heidelberg (2010)
-
(2010)
LNCS
, vol.6011
, pp. 244-263
-
-
Baskaran, M.M.1
Ramanujam, J.2
Sadayappan, P.3
-
10
-
-
84877042382
-
A scalable crossplatform infrastructure for application performance tuning using hardware counters
-
Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A scalable crossplatform infrastructure for application performance tuning using hardware counters. In: ACM/IEEE 2000 Conference, Supercomputing (November 2000)
-
ACM/IEEE 2000 Conference, Supercomputing (November 2000)
-
-
Browne, S.1
Dongarra, J.2
Garner, N.3
London, K.4
Mucci, P.5
-
11
-
-
0028549474
-
Improving the ratio of memory operations to floatingpoint operations in loops
-
Carr, S., Kennedy, K.: Improving the ratio of memory operations to floatingpoint operations in loops. ACM Transactions on Programming Languages and Systems 16(6), 1768-1810 (1994)
-
(1994)
ACM Transactions on Programming Languages and Systems
, vol.16
, Issue.6
, pp. 1768-1810
-
-
Carr, S.1
Kennedy, K.2
-
12
-
-
77749340082
-
Model-driven autotuning of sparse matrixvector multiply on GPUs
-
ACM, New York
-
Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrixvector multiply on GPUs. In: PPoPP 2010: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 115-126. ACM, New York (2010)
-
(2010)
PPoPP 2010: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, pp. 115-126
-
-
Choi, J.W.1
Singh, A.2
Vuduc, R.W.3
-
13
-
-
0023565191
-
What's in a name? -or- The value of renaming for parallelism detection and storage allocation
-
Cytron, R., Ferrante, J.: What's in a name? -or- the value of renaming for parallelism detection and storage allocation. In: ICPP 1987, pp. 19-27 (1987)
-
(1987)
ICPP 1987
, pp. 19-27
-
-
Cytron, R.1
Ferrante, J.2
-
14
-
-
70350771127
-
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
-
IEEE Press, Piscataway
-
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1-12. IEEE Press, Piscataway (2008)
-
(2008)
SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing
, pp. 1-12
-
-
Datta, K.1
Murphy, M.2
Volkov, V.3
Williams, S.4
Carter, J.5
Oliker, L.6
Patterson, D.7
Shalf, J.8
Yelick, K.9
-
15
-
-
77954013048
-
Loop unrolling for gpgpu programs
-
Murthy, G., Ravishankar, M., Sadayappan, M.B., Optimal, P.: loop unrolling for gpgpu programs. In: IEEE International Symposium on Parallel Distributed Processing (2010)
-
IEEE International Symposium on Parallel Distributed Processing (2010)
-
-
Murthy, G.1
Ravishankar, M.2
Sadayappan, M.B.3
Optimal, P.4
-
16
-
-
34548292052
-
A memory model for scientific algorithms on graphics processors
-
ACM, New York
-
Govindaraju, N.K., Larsen, S., Gray, J., Manocha, D.: A memory model for scientific algorithms on graphics processors. In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 89. ACM, New York (2006)
-
(2006)
SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing
, pp. 89
-
-
Govindaraju, N.K.1
Larsen, S.2
Gray, J.3
Manocha, D.4
-
17
-
-
70350754502
-
High performance discrete fourier transforms on graphics processors
-
IEEE Press, Piscataway
-
Govindaraju, N.K., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J.: High performance discrete fourier transforms on graphics processors. In: SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1-12. IEEE Press, Piscataway (2008)
-
(2008)
SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing
, pp. 1-12
-
-
Govindaraju, N.K.1
Lloyd, B.2
Dotsenko, Y.3
Smith, B.4
Manferdelli, J.5
-
18
-
-
79952608669
-
Optimizing and Auto-tuning Belief Propagation on the GPU
-
Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. Springer, Heidelberg
-
Grauer-Gray, S., Cavazos, J.: Optimizing and Auto-tuning Belief Propagation on the GPU. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 121-135. Springer, Heidelberg (2011)
-
(2011)
LNCS
, vol.6548
, pp. 121-135
-
-
Grauer-Gray, S.1
Cavazos, J.2
-
20
-
-
79952583455
-
Accelerating GPU Kernels for Dense Linear Algebra
-
Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. Springer, Heidelberg
-
Nath, R., Tomov, S., Dongarra, J.: Accelerating GPU Kernels for Dense Linear Algebra. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 83-92. Springer, Heidelberg (2011)
-
(2011)
LNCS
, vol.6449
, pp. 83-92
-
-
Nath, R.1
Tomov, S.2
Dongarra, J.3
-
21
-
-
74049114159
-
Auto-tuning 3-d FFT library for CUDA GPUs
-
ACM, New York
-
Nukada, A., Matsuoka, S.: Auto-tuning 3-d FFT library for CUDA GPUs. In: SC 2009: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1-10. ACM, New York (2009)
-
(2009)
SC 2009: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
, pp. 1-10
-
-
Nukada, A.1
Matsuoka, S.2
-
22
-
-
78650814738
-
Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures
-
Rahimian, A., Lashuk, I., Veerapaneni, S., Chandramowlishwaran, A., Malhotra, D., Moon, L., Sampath, R., Shringarpure, A., Vetter, J., Vuduc, R., Zorin, D., Biros, G.: Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)
-
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)
-
-
Rahimian, A.1
Lashuk, I.2
Veerapaneni, S.3
Chandramowlishwaran, A.4
Malhotra, D.5
Moon, L.6
Sampath, R.7
Shringarpure, A.8
Vetter, J.9
Vuduc, R.10
Zorin, D.11
Biros, G.12
-
23
-
-
79959466764
-
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
-
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu,W.M.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2008)
-
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2008)
-
-
Ryoo, S.1
Rodrigues, C.I.2
Baghsorkhi, S.S.3
Stone, S.S.4
Kirk, D.B.5
Hwu, W.M.W.6
-
25
-
-
60949098907
-
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
-
Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. 35(3), 178-194 (2009)
-
(2009)
Parallel Comput
, vol.35
, Issue.3
, pp. 178-194
-
-
Williams, S.1
Oliker, L.2
Vuduc, R.3
Shalf, J.4
Yelick, K.5
Demmel, J.6
-
26
-
-
58449092097
-
Exploring the Optimization Space of Dense Linear Algebra Kernels
-
Amaral, J.N. (ed.) LCPC 2008. Springer, Heidelberg
-
Yi, Q., Qasem, A.: Exploring the Optimization Space of Dense Linear Algebra Kernels. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 343-355. Springer, Heidelberg (2008)
-
(2008)
LNCS
, vol.5335
, pp. 343-355
-
-
Yi, Q.1
Qasem, A.2
-
28
-
-
77955184165
-
Accelerating iterative field-compensated mr image reconstruction on GPUs
-
Zhuo, Y., Wu, X.L., Haldar, J.P., Hwu, W.M., Liang, Z.P., Sutton, B.P.: Accelerating iterative field-compensated mr image reconstruction on GPUs. In: Proceedings of the 2010 IEEE International Conference on Biomedical Imaging: From Nano to Macro, ISBI 2010 (2010)
-
(2010)
Proceedings of the 2010 IEEE International Conference on Biomedical Imaging: From Nano to Macro, ISBI 2010
-
-
Zhuo, Y.1
Wu, X.L.2
Haldar, J.P.3
Hwu, W.M.4
Liang, Z.P.5
Sutton, B.P.6
|