-
1
-
-
60449097203
-
The design of openmp tasks
-
E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of openmp tasks. In TPDS, 2009.
-
(2009)
TPDS
-
-
Ayguadé, E.1
Copty, N.2
Duran, A.3
Hoeflinger, J.4
Lin, Y.5
Massaioli, F.6
Teruel, X.7
Unnikrishnan, P.8
Zhang, G.9
-
3
-
-
57349180412
-
A compiler framework for optimization of affine loop nests for GPGPUS
-
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In ICS, 2008.
-
(2008)
ICS
-
-
Baskaran, M.M.1
Bondhugula, U.2
Krishnamoorthy, S.3
Ramanujam, J.4
Rountev, A.5
Sadayappan, P.6
-
4
-
-
70450059008
-
Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors
-
M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors. In IPDPS 2009.
-
(2009)
IPDPS
-
-
Boyer, M.1
Tarjan, D.2
Acton, S.T.3
Skadron, K.4
-
6
-
-
70649092154
-
Rodinia: A benchmark suite for heterogeneous computing
-
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC 2009.
-
(2009)
IISWC
-
-
Che, S.1
Boyer, M.2
Meng, J.3
Tarjan, D.4
Sheaffer, J.W.5
Lee, S.H.6
Skadron, K.7
-
7
-
-
84856559490
-
Dynamic detection of uniform and affine vectors in GPGPU computations
-
S. Collange, D. Defour, and Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. In ICPP, 2009.
-
(2009)
ICPP
-
-
Collange, S.1
Defour, D.2
Zhang, Y.3
-
11
-
-
32844460093
-
Automatic thread distribution for nested parallelism in OpenMP
-
A. Duran, M. Gonzàlez, and J. Corbalán.Automatic thread distribution for nested parallelism in OpenMP. In ICS, 2005.
-
(2005)
ICS
-
-
Duran, A.1
Gonzàlez, M.2
Corbalán, J.3
-
12
-
-
60849099135
-
High performance discrete Fourier transforms on graphics processors
-
N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In Proc. Supercomputing, 2008.
-
(2008)
Proc. Supercomputing
-
-
Govindaraju, N.1
Lloyd, B.2
Dotsenko, Y.3
Smith, B.4
Manferdelli, J.5
-
14
-
-
70450231944
-
An analytical model for GPU architecture with memory-level and thread-level parallelism awareness
-
S. Hong and H. Kim. An analytical model for GPU architecture with memory-level and thread-level parallelism awareness. In Proc. International Symposium on Computer Architecture, 2009.
-
(2009)
Proc. International Symposium on Computer Architecture
-
-
Hong, S.1
Kim, H.2
-
16
-
-
84896891672
-
-
http://moss.csc.ncsu.edu/~mueller/cluster/arc/
-
-
-
-
17
-
-
84871147865
-
Exploiting memory access patterns to improve memory performance in data-parallel architectures
-
B. Jang, D. Schaa, P. Mistry and D. Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. In IEEE TPDS, 2010.
-
(2010)
IEEE TPDS
-
-
Jang, B.1
Schaa, D.2
Mistry, P.3
Kaeli, D.4
-
18
-
-
84887477265
-
Neither more nor less: Optimizing thread-level parallelism for GPGPUs
-
O. Kayiran, A. Jog, M. T. Kandemir, C. R. Das. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT, 2013.
-
(2013)
PACT
-
-
Kayiran, O.1
Jog, A.2
Kandemir, M.T.3
Das, C.R.4
-
19
-
-
79952801699
-
Achieving a single compute device image in openCL for multiple GPUs
-
J. Kim, H. Kim, J. Lee, and J. Lee. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. In PPoPP, 2011.
-
(2011)
PPoPP
-
-
Kim, J.1
Kim, H.2
Lee, J.3
Lee, J.4
-
20
-
-
26444437628
-
Cetus-an extensible compiler infrastructure for source-to-source transformation
-
S. I. Lee, T. Johnson, and R. Eigenmann. Cetus-an extensible compiler infrastructure for source-to-source transformation. In LCPC, 2003
-
(2003)
LCPC
-
-
Lee, S.I.1
Johnson, T.2
Eigenmann, R.3
-
21
-
-
67650081010
-
OpenMP to GPGPU: A compiler framework for automatic translation and optimization
-
S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. In PPoPP, 2009
-
(2009)
Proc. in PPoPP
-
-
Lee, S.1
Min, S.-J.2
Eigenmann, R.3
-
22
-
-
48849109021
-
OpenUH: An optimizing, portable openmp compiler
-
C. Liao, O. Hernandez, B. Chapman, W. Chen and W. Zheng. OpenUH: An Optimizing, Portable OpenMP Compiler. In the 12th Workshop on Compilers for Parallel Computers, Spain, 2006.
-
(2006)
12th Workshop on Compilers for Parallel Computers, Spain
-
-
Liao, C.1
Hernandez, O.2
Chapman, B.3
Chen, W.4
Zheng, W.5
-
23
-
-
70450103746
-
A cross-input adaptive frame-work for gpu programs optimization
-
Y. Liu, E. Z. Zhang, amd X. Shen. A Cross-Input Adaptive Frame-work for GPU Programs Optimization. In IPDPS, 2009.
-
(2009)
IPDPS
-
-
Liu, Y.1
Zhang, E.Z.2
Shen, X.3
-
24
-
-
84863342255
-
Improving GPU performance via large warps and two-level warp scheduling
-
V. Narasiman, C. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, and Y. Patt. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In MICRO, 2011.
-
(2011)
MICRO
-
-
Narasiman, V.1
Lee, C.2
Shebanow, M.3
Miftakhutdinov, R.4
Mutlu, O.5
Patt, Y.6
-
29
-
-
84876909157
-
SIMD parallelization of applications that traverse irregular data structures
-
B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen and W. Schulte. SIMD Parallelization of Applications that Traverse Irregular Data Structures. In CGO, 2013.
-
(2013)
CGO
-
-
Ren, B.1
Agrawal, G.2
Larus, J.R.3
Mytkowicz, T.4
Poutanen, T.5
Schulte, W.6
-
31
-
-
43449094719
-
Optimization space pruning for a multi-threaded GPU
-
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Optimization space pruning for a multi-threaded GPU. In CGO, 2008.
-
(2008)
CGO
-
-
Ryoo, S.1
Rodrigues, C.I.2
Stone, S.S.3
Baghsorkhi, S.S.4
Ueng, S.5
Stratton, J.A.6
Hwu, W.W.7
-
32
-
-
79959466764
-
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
-
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008.
-
(2008)
PPoPP
-
-
Ryoo, S.1
Rodrigues, C.I.2
Baghsorkhi, S.S.3
Stone, S.S.4
Kirk, D.B.5
Hwu, W.W.6
-
34
-
-
84863347222
-
A performance analysis framework for identifying performance benefits in GPGPU applications
-
J. Sim, A. Dasgupta, H. Kim, and R. Vuduc, A Performance Analysis Framework for Identifying Performance Benefits in GPGPU Applications. In PPoPP, 2012.
-
(2012)
PPoPP
-
-
Sim, J.1
Dasgupta, A.2
Kim, H.3
Vuduc, R.4
-
35
-
-
84896837430
-
Dynamic thread creation for improving processor utilization on SIMT streaming processor architectures
-
M. Steffen and J. Zambreno. Dynamic Thread Creation for Improving Processor Utilization on SIMT Streaming Processor Architectures. In MICRO, 2010.
-
(2010)
MICRO
-
-
Steffen, M.1
Zambreno, J.2
-
36
-
-
23044523992
-
Performance evaluation of OpenMP applications with nested parallelism
-
Y. Tanaka, K. Taura, M. Sato, and A. Yonezawa. Performance evaluation of OpenMP applications with nested parallelism. In Languages, Compilers, and Run-Time Systems for Scalable Computers, 2000.
-
(2000)
Languages, Compilers, and Run-Time Systems for Scalable Computers
-
-
Tanaka, Y.1
Taura, K.2
Sato, M.3
Yonezawa, A.4
-
37
-
-
27844486561
-
A compiler for exploiting nested parallelism in OpenMP programs
-
X. Tian, JP. Hoeflinger, G. Haab, Y.K. Chen, M. Girkar, and S. Shah. A compiler for exploiting nested parallelism in OpenMP programs. Parallel Computing, 2005.
-
(2005)
Parallel Computing
-
-
Tian, X.1
Hoeflinger, J.P.2
Haab, G.3
Chen, Y.K.4
Girkar, M.5
Shah, S.6
-
39
-
-
67349149521
-
Benchmarking GPUs to tune dense linear algebra
-
V. Volkov and J. W. Benchmarking GPUs to tune dense linear algebra. In Proc. Supercomputing, 2008.
-
(2008)
Proc. Supercomputing
-
-
Volkov, V.1
-
40
-
-
84875195366
-
Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced GPU memory accesses
-
B. Wu, Z. Zhao, E. Zhang, Y. Jiang, and X. Shen. Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced GPU Memory Accesses. In PPoPP, 2013.
-
(2013)
PPoPP
-
-
Wu, B.1
Zhao, Z.2
Zhang, E.3
Jiang, Y.4
Shen, X.5
-
41
-
-
77954691442
-
A GPGPU compiler for memory optimization and parallelism management
-
Y. Yang, P. Xiang, J. Kong and H. Zhou. A GPGPU Compiler for Memory Optimization and Parallelism Management. In PLDI, 2010.
-
(2010)
PLDI
-
-
Yang, Y.1
Xiang, P.2
Kong, J.3
Zhou, H.4
-
42
-
-
84867509598
-
Shared memory multiplexing: A novel way to improve GPGPU throughput
-
Y. Yang, P. Xiang, M. Mantor, N. Rubin, and H. Zhou. Shared Memory Multiplexing: A Novel Way to Improve GPGPU Throughput. In PACT, 2012.
-
(2012)
PACT
-
-
Yang, Y.1
Xiang, P.2
Mantor, M.3
Rubin, N.4
Zhou, H.5
-
43
-
-
77749337487
-
Fast tridiagonal solvers on the GPU
-
Y. Zhang, J. Cohen, and J. D. Owens. Fast Tridiagonal Solvers on the GPU. In PPoPP, 2010.
-
(2010)
PPoPP
-
-
Zhang, Y.1
Cohen, J.2
Owens, J.D.3
-
44
-
-
79953126288
-
On-thefly elimination of dynamic irregularities for GPU computing
-
E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-thefly elimination of dynamic irregularities for GPU computing. In ASPLOS, 2011.
-
(2011)
ASPLOS
-
-
Zhang, E.Z.1
Jiang, Y.2
Guo, Z.3
Tian, K.4
Shen, X.5
|