-
2
-
-
70349169075
-
Analyzing CUDA workloads using a detailed GPU simulator
-
Apr
-
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, IEEE International Symposium on, pages 163-174, Apr. 2009.
-
(2009)
Performance Analysis of Systems and Software, IEEE International Symposium On, Pages
, pp. 163-174
-
-
Bakhoda, A.1
Yuan, G.L.2
Fung, W.W.L.3
Wong, H.4
Aamodt, T.M.5
-
3
-
-
57349139452
-
A practical automatic polyhedral parallelizer and locality optimizer
-
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 101-113, 2008.
-
(2008)
Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation
, pp. 101-113
-
-
Bondhugula, U.1
Hartono, A.2
Ramanujam, J.3
Sadayappan, P.4
-
4
-
-
84877702106
-
A scalable, numerically stable, high-performance tridiagonal solver using gpus
-
L. Chang, J. A. Stratton, H. Kim, and W. W. Hwu. A Scalable, Numerically Stable, High-performance Tridiagonal Solver Using GPUs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 27:1-27:11, 2012.
-
(2012)
Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
, pp. 271-2711
-
-
Chang, L.1
Stratton, J.A.2
Kim, H.3
Hwu, W.W.4
-
5
-
-
70649092154
-
Rodinia: A benchmark suite for heterogeneous computing
-
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, IEEE International Symposium on, pages 44-54, 2009.
-
(2009)
Workload Characterization, IEEE International Symposium on
, pp. 44-54
-
-
Che, S.1
Boyer, M.2
Meng, J.3
Tarjan, D.4
Sheaffer, J.W.5
Lee, S.6
Skadron, K.7
-
6
-
-
78049512154
-
Barra: A parallel functional simulator for gpgpu
-
Aug.
-
S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: A Parallel Functional Simulator for GPGPU. In Modeling, Analysis Simulation of Computer and Telecommunication Systems, IEEE International Symposium on, pages 351-360, Aug. 2010.
-
(2010)
Modeling, Analysis Simulation of Computer and Telecommunication Systems, IEEE International Symposium on
, pp. 351-360
-
-
Collange, S.1
Daumas, M.2
Defour, D.3
Parello, D.4
-
8
-
-
84856530584
-
Divergence analysis and optimizations
-
Oct.
-
B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. Meira. Divergence Analysis and Optimizations. In Parallel Architectures and Compilation Techniques, 2011 International Conference on, pages 320-329, Oct. 2011.
-
(2011)
Parallel Architectures and Compilation Techniques, 2011 International Conference on
, pp. 320-329
-
-
Coutinho, B.1
Sampaio, D.2
Pereira, F.M.Q.3
Meira, W.4
-
9
-
-
0002806690
-
OpenMP: An industry standard API for shared-memory programming
-
L. Dagum and R. Menon. OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46-55, 1998.
-
(1998)
Computational Science & Engineering, IEEE
, vol.5
, Issue.1
, pp. 46-55
-
-
Dagum, L.1
Menon, R.2
-
10
-
-
78149233155
-
Ocelot: A dynamic optimization framework for bulksynchronous applications in heterogeneous systems
-
G. F. Diamos, N. Clark, A. R. Kerr, and S. Yalamanchili. Ocelot: A Dynamic Optimization Framework for Bulksynchronous Applications in Heterogeneous Systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 353-364, 2010.
-
(2010)
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques
, pp. 353-364
-
-
Diamos, G.F.1
Clark, N.2
Kerr, A.R.3
Yalamanchili, S.4
-
12
-
-
49949106993
-
Perfmon2: A flexible performance monitoring interface for Linux
-
Citeseer
-
S. Eranian. Perfmon2: a flexible performance monitoring interface for Linux. In Proc. of the 2006 Ottawa Linux Symposium, pages 269-288. Citeseer, 2006.
-
(2006)
Proc. of the 2006 Ottawa Linux Symposium
, pp. 269-288
-
-
Eranian, S.1
-
13
-
-
78149276036
-
Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors
-
J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin Peaks: A Software Platform for Heterogeneous Computing on General-purpose and Graphics Processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 205-216, 2010.
-
(2010)
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques
, pp. 205-216
-
-
Gummaraju, J.1
Morichetti, L.2
Houston, M.3
Sander, B.4
Gaster, B.R.5
Zheng, B.6
-
14
-
-
33645444470
-
Interprocedural parallelization analysis in suif
-
July
-
M.W. Hall, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, and M. S. Lam. Interprocedural parallelization analysis in suif. ACM Trans. Program. Lang. Syst., 27(4):662-731, July 2005.
-
(2005)
ACM Trans. Program. Lang. Syst.
, vol.27
, Issue.4
, pp. 662-731
-
-
Hall, M.W.1
Amarasinghe, S.P.2
Murphy, B.R.3
Liao, S.-W.4
Lam, M.S.5
-
15
-
-
84961295225
-
-
P. Jaaskelainen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. pocl: A performance-portable OpenCL implementation, 2014.
-
(2014)
Pocl: A Performance-portable OpenCL Implementation
-
-
Jaaskelainen, P.1
De La Lama, C.S.2
Schnetter, E.3
Raiskila, K.4
Takala, J.5
Berg, H.6
-
16
-
-
84899719703
-
OpenCL framework for arm processors with neon support
-
G. Jo, W. Jeon, W. Jung, G. Taft, and J. Lee. OpenCL Framework for ARM Processors with NEON Support. In Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, pages 33-40, 2014.
-
(2014)
Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing
, pp. 33-40
-
-
Jo, G.1
Jeon, W.2
Jung, W.3
Taft, G.4
Lee, J.5
-
21
-
-
84961318208
-
-
Khronos OpenCL Working Group and others. The OpenCL Specification. A. Munshi, Ed, 2008
-
Khronos OpenCL Working Group and others. The OpenCL Specification. A. Munshi, Ed, 2008.
-
-
-
-
23
-
-
84864054886
-
SnuCL: An opencl framework for heterogeneous cpu/gpu clusters
-
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU Clusters. In Proceedings of the 26th ACM International Conference on Supercomputing, pages 341-352, 2012.
-
(2012)
Proceedings of the 26th ACM International Conference on Supercomputing
, pp. 341-352
-
-
Kim, J.1
Seo, S.2
Lee, J.3
Nah, J.4
Jo, G.5
Lee, J.6
-
24
-
-
84883089997
-
When polyhedral transformations meet simd code generation
-
June
-
M. Kong, R. Veras, K. Stock, F. Franchetti, L. Pouchet, and P. Sadayappan. When Polyhedral Transformations Meet SIMD Code Generation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, number 6, pages 127-138, June 2013.
-
(2013)
Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
, Issue.6
, pp. 127-138
-
-
Kong, M.1
Veras, R.2
Stock, K.3
Franchetti, F.4
Pouchet, L.5
Sadayappan, P.6
-
25
-
-
78149255519
-
An opencl framework for heterogeneous multicores with local memory
-
J. Lee, J. Kim, S. Seo, S. Kim, J. Park, H. Kim, T. Dao, Y. Cho, S. Seo, S. Lee, S. Cho, H. Song, S. Suh, and J. Choi. An OpenCL Framework for Heterogeneous Multicores with Local Memory. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 193-204, 2010.
-
(2010)
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques
, pp. 193-204
-
-
Lee, J.1
Kim, J.2
Seo, S.3
Kim, S.4
Park, J.5
Kim, H.6
Dao, T.7
Cho, Y.8
Seo, S.9
Lee, S.10
Cho, S.11
Song, H.12
Suh, S.13
Choi, J.14
-
26
-
-
84899746576
-
OpenCL performance evaluation on modern multi core cpus
-
May
-
J. Lee, K. Patel, N. Nigania, H. Kim, and H. Kim. OpenCL Performance Evaluation on Modern Multi Core CPUs. In Parallel and Distributed Processing Symposium Workshops PhD Forum, 2013 IEEE 27th International, pages 1177-1185, May 2013.
-
(2013)
Parallel and Distributed Processing Symposium Workshops PhD Forum, 2013 IEEE 27th International
, pp. 1177-1185
-
-
Lee, J.1
Patel, K.2
Nigania, N.3
Kim, H.4
Kim, H.5
-
27
-
-
84876943307
-
Convergence and scalarization for data-parallel architectures
-
Feb
-
Y. Lee, R. Krashinsky, V. Grover, S. Keckler, and K. Asanovic. Convergence and scalarization for data-parallel architectures. In Code Generation and Optimization, 2013 IEEE/ACM International Symposium on, pages 1-11, Feb 2013.
-
(2013)
Code Generation and Optimization, 2013 IEEE/ACM International Symposium on
, pp. 1-11
-
-
Lee, Y.1
Krashinsky, R.2
Grover, V.3
Keckler, S.4
Asanovic, K.5
-
28
-
-
84899692998
-
A large-scale cross-architecture evaluation of thread-coarsening
-
A. Magni, C. Dubach, and M. F. P. O'Boyle. A Large-scale Cross-architecture Evaluation of Thread-coarsening. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 11:1-11:11, 2013.
-
(2013)
Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis
, pp. 111-1111
-
-
Magni, A.1
Dubach, C.2
O'boyle, M.F.P.3
-
29
-
-
0029202471
-
A comparison of full and partial predicated execution support for ILP processors
-
June
-
S. A. Mahlke, R. E. Hank, J. E. McCormick, D. I. August, and W. W. Hwu. A comparison of full and partial predicated execution support for ILP processors. In Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on, pages 138-149, June 1995.
-
(1995)
Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on
, pp. 138-149
-
-
Mahlke, S.A.1
Hank, R.E.2
McCormick, J.E.3
August, D.I.4
Hwu, W.W.5
-
30
-
-
0030190854
-
Improving data locality with loop transformations
-
July
-
K. S. McKinley, S. Carr, and C. Tseng. Improving Data Locality with Loop Transformations. ACM Trans. Program. Lang. Syst., 18(4):424-453, July 1996.
-
(1996)
ACM Trans. Program. Lang. Syst.
, vol.18
, Issue.4
, pp. 424-453
-
-
McKinley, K.S.1
Carr, S.2
Tseng, C.3
-
31
-
-
65649105504
-
Intel threading building blocks
-
Apr.
-
C. Pheatt. Intel Threading Building Blocks. J. Comput. Sci. Coll., 23(4):298, Apr. 2008.
-
(2008)
J. Comput. Sci. Coll.
, vol.23
, Issue.4
, pp. 298
-
-
Pheatt, C.1
-
32
-
-
10444289646
-
Code generation in the polyhedral model is easier than you think
-
B. L. Prism, V. S. Quentin, V. Cedex, and C. Bastoul. Code Generation in the Polyhedral Model Is Easier Than You Think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 7-16, 2004.
-
(2004)
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
, pp. 7-16
-
-
Prism, B.L.1
Quentin, V.S.2
Cedex, V.3
Bastoul, C.4
-
34
-
-
84887446142
-
Automatic opencl workgroup size selection for multicore cpus
-
Sept
-
S. Seo, J. Lee, G. Jo, and J. Lee. Automatic OpenCL workgroup size selection for multicore CPUs. In Parallel Architectures and Compilation Techniques, 2013 22nd International Conference on, pages 387-397, Sept 2013.
-
(2013)
Parallel Architectures and Compilation Techniques, 2013 22nd International Conference on
, pp. 387-397
-
-
Seo, S.1
Lee, J.2
Jo, G.3
Lee, J.4
-
35
-
-
84877647998
-
Performance traps in opencl for cpus
-
Feb
-
J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. Performance Traps in OpenCL for CPUs. In Parallel, Distributed and Network-Based Processing, 2013 21st Euromicro International Conference on, pages 38-45, Feb. 2013.
-
(2013)
Parallel, Distributed and Network-Based Processing, 2013 21st Euromicro International Conference on
, pp. 38-45
-
-
Shen, J.1
Fang, J.2
Sips, H.3
Varbanescu, A.L.4
-
36
-
-
58449109179
-
MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs
-
J. A. Stratton, S. S. Stone, andW.W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In J. N. Amaral, editor, Languages and Compilers for Parallel Computing, pages 16-30. 2008.
-
(2008)
J. N. Amaral, Editor, Languages and Compilers for Parallel Computing
, pp. 16-30
-
-
Stratton, J.A.1
Stone, S.S.2
Hwu, W.W.3
-
37
-
-
77953978573
-
Efficient compilation of fine-grained spmd-threaded programs for multicore cpus
-
J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, andW.W. Hwu. Efficient Compilation of Fine-grained SPMD-threaded Programs for Multicore CPUs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 111-119, 2010.
-
(2010)
Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization
, pp. 111-119
-
-
Stratton, J.A.1
Grover, V.2
Marathe, J.3
Aarts, B.4
Murphy, M.5
Hu, Z.6
Hwu, W.W.7
-
38
-
-
84873470137
-
-
IMPACT Technical Report
-
J. A. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. D. Liu, and W. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report, 2012.
-
(2012)
Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing
-
-
Stratton, J.A.1
Rodrigues, C.2
Sung, I.3
Obeid, N.4
Chang, L.5
Anssari, N.6
Liu, G.D.7
Hwu, W.W.8
-
41
-
-
77954691442
-
A gpgpu compiler for memory optimization and parallelism management
-
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU Compiler for Memory Optimization and Parallelism Management. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 86-97, 2010.
-
(2010)
Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation
, pp. 86-97
-
-
Yang, Y.1
Xiang, P.2
Kong, J.3
Zhou, H.4
|