-
2
-
-
78149276036
-
Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors
-
J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng, "Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors, " in Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pp. 205-216, 2010.
-
(2010)
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques
, pp. 205-216
-
-
Gummaraju, J.1
Morichetti, L.2
Houston, M.3
Sander, B.4
Gaster, B.R.5
Zheng, B.6
-
4
-
-
84938982672
-
Pocl: A performance-portable opencl implementation
-
P. Jääskeläinen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg, "pocl: A performance-portable opencl implementation, " International Journal of Parallel Programming, vol. 43, no. 5, pp. 752-785, 2015.
-
(2015)
International Journal of Parallel Programming
, vol.43
, Issue.5
, pp. 752-785
-
-
Jääskeläinen, P.1
De La Lama, C.S.2
Schnetter, E.3
Raiskila, K.4
Takala, J.5
Berg, H.6
-
5
-
-
84961314978
-
Localitycentric thread scheduling for bulk-synchronous programming models on CPU architectures
-
H.-S. Kim, I. El Hajj, J. Stratton, S. Lumetta, and W.-M. Hwu, "Localitycentric thread scheduling for bulk-synchronous programming models on CPU architectures, " in Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 257-268, 2015.
-
(2015)
Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization
, pp. 257-268
-
-
Kim, H.-S.1
El Hajj, I.2
Stratton, J.3
Lumetta, S.4
Hwu, W.-M.5
-
6
-
-
84937693610
-
PORPLE: An extensible optimizer for portable data placement on GPU
-
G. Chen, B. Wu, D. Li, and X. Shen, "PORPLE: An extensible optimizer for portable data placement on GPU, " in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 88-100, 2014.
-
(2014)
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture
, pp. 88-100
-
-
Chen, G.1
Wu, B.2
Li, D.3
Shen, X.4
-
7
-
-
78649824847
-
Exploiting memory access patterns to improve memory performance in data-parallel architectures
-
B. Jang, D. Schaa, P. Mistry, and D. Kaeli, "Exploiting memory access patterns to improve memory performance in data-parallel architectures, " IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 1, pp. 105-118, 2011.
-
(2011)
IEEE Trans. Parallel Distrib. Syst
, vol.22
, Issue.1
, pp. 105-118
-
-
Jang, B.1
Schaa, D.2
Mistry, P.3
Kaeli, D.4
-
8
-
-
0343462141
-
Automated empirical optimizations of software and the atlas project
-
R. C. Whaley, A. Petitet, and J. J. Dongarra, "Automated empirical optimizations of software and the atlas project, " Parallel Computing, vol. 27, no. 1, pp. 3-35, 2001.
-
(2001)
Parallel Computing
, vol.27
, Issue.1
, pp. 3-35
-
-
Whaley, R.C.1
Petitet, A.2
Dongarra, J.J.3
-
9
-
-
1542396679
-
Spiral: A generator for platform-Adapted libraries of signal processing alogorithms
-
M. Püschel, J. M. Moura, B. Singer, J. Xiong, J. Johnson, D. Padua, M. Veloso, and R. W. Johnson, "Spiral: A generator for platform-Adapted libraries of signal processing alogorithms, " International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 21-45, 2004.
-
(2004)
International Journal of High Performance Computing Applications
, vol.18
, Issue.1
, pp. 21-45
-
-
Püschel, M.1
Moura, J.M.2
Singer, B.3
Xiong, J.4
Johnson, J.5
Padua, D.6
Veloso, M.7
Johnson, R.W.8
-
10
-
-
84870725376
-
Policy-based tuning for performance portability and library co-optimization
-
D. Merrill, M. Garland, and A. Grimshaw, "Policy-based tuning for performance portability and library co-optimization, " in Innovative Parallel Computing, pp. 1-10, 2012.
-
(2012)
Innovative Parallel Computing
, pp. 1-10
-
-
Merrill, D.1
Garland, M.2
Grimshaw, A.3
-
11
-
-
84883116448
-
Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
-
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, "Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, " ACM SIGPLAN Notices, vol. 48, no. 6, pp. 519-530, 2013.
-
(2013)
ACM SIGPLAN Notices
, vol.48
, Issue.6
, pp. 519-530
-
-
Ragan-Kelley, J.1
Barnes, C.2
Adams, A.3
Paris, S.4
Durand, F.5
Amarasinghe, S.6
-
13
-
-
34548207355
-
Sequoia: Programming the memory hierarchy
-
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, "Sequoia: Programming the memory hierarchy, " in Proceedings of the 2006 ACM/IEEE conference on Supercomputing, ACM, 2006.
-
(2006)
Proceedings of the 2006 ACM/ IEEE Conference on Supercomputing, ACM
-
-
Fatahalian, K.1
Horn, D.R.2
Knight, T.J.3
Leem, L.4
Houston, M.5
Park, J.Y.6
Erez, M.7
Ren, M.8
Aiken, A.9
Dally, W.J.10
Hanrahan, P.11
-
14
-
-
70450227331
-
Petabricks: A language and compiler for algorithmic choice
-
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe, "Petabricks: A language and compiler for algorithmic choice, " in Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 38-49, 2009.
-
Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation
, vol.2009
, pp. 38-49
-
-
Ansel, J.1
Chan, C.2
Wong, Y.L.3
Olszewski, M.4
Zhao, Q.5
Edelman, A.6
Amarasinghe, S.7
-
15
-
-
80053955412
-
Accelerating CUDA graph algorithms at maximum warp
-
S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating CUDA graph algorithms at maximum warp, " in ACM SIGPLAN Notices, vol. 46, pp. 267-276, 2011.
-
(2011)
ACM SIGPLAN Notices
, vol.46
, pp. 267-276
-
-
Hong, S.1
Kim, S.K.2
Oguntebi, T.3
Olukotun, K.4
-
16
-
-
85009382810
-
KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism
-
in press
-
I. El Hajj, J. Ǵomez-Luna, C. Li, L.-W. Chang, D. Milojicic, and W. mei Hwu, "KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism, " in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016 (in press).
-
(2016)
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture
-
-
El Hajj, I.1
Ǵomez-Luna, J.2
Li, C.3
Chang, L.-W.4
Milojicic, W.5
Mei Hwu, D.6
-
17
-
-
85009366731
-
-
NVIDIA, CUDA C best practices guide, v. 7.0
-
NVIDIA, "CUDA C best practices guide v. 7.0, " 2015.
-
(2015)
-
-
-
18
-
-
84975230376
-
DySel: Lightweight dynamic selection for kernelbased data-parallel programming model
-
ACM
-
L.-W. Chang, H.-S. Kim, and W.-m. Hwu, "DySel: Lightweight dynamic selection for kernelbased data-parallel programming model, " in Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 667-680, ACM, 2016.
-
(2016)
Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
, pp. 667-680
-
-
Chang, L.-W.1
Kim, W.-M.2
Hwu, H.-S.3
-
19
-
-
20344394051
-
-
"The Matrix Market." http://math.nist.gov/MatrixMarket/.
-
The Matrix Market
-
-
-
22
-
-
85009381347
-
-
Intel Math Kernel Library
-
"Intel Math Kernel Library." http://software.intel.com/enus/articles/intel-mkl/.
-
-
-
-
23
-
-
84977938542
-
-
NVIDIA. NVIDIA, v7.0 ed Oct
-
NVIDIA, CUBLAS Library User Guide. NVIDIA, v7.0 ed., Oct. 2015.
-
(2015)
CUBLAS Library User Guide
-
-
-
25
-
-
70649092154
-
Rodinia: A benchmark suite for heterogeneous computing
-
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing, " in Workload Characterization, 2009, IEEE International Symposium on, pp. 44-54, 2009.
-
Workload Characterization 2009 IEEE International Symposium on
, vol.2009
, pp. 44-54
-
-
Che, S.1
Boyer, M.2
Meng, J.3
Tarjan, D.4
Sheaffer, J.W.5
Lee, S.-H.6
Skadron, K.7
-
26
-
-
57349184047
-
Fast scan algorithms on graphics processors
-
Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli, "Fast scan algorithms on graphics processors, " in Proceedings of the 22Nd Annual International Conference on Supercomputing, pp. 205-213, 2008.
-
(2008)
Proceedings of the 22Nd Annual International Conference on Supercomputing
, pp. 205-213
-
-
Dotsenko, Y.1
Govindaraju, N.K.2
Sloan, P.-P.3
Boyd, C.4
Manferdelli, J.5
-
27
-
-
84875175606
-
StreamScan: Fast scan algorithms for GPUs without global barrier synchronization
-
S. Yan, G. Long, and Y. Zhang, "StreamScan: Fast scan algorithms for GPUs without global barrier synchronization, " in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 229-238, 2013.
-
(2013)
Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, pp. 229-238
-
-
Yan, S.1
Long, G.2
Zhang, Y.3
-
28
-
-
84976501593
-
Inplace data sliding algorithms for many-core architectures
-
IEEE
-
J. Ǵomez-Luna, L.-W. Chang, I.-J. Sung, W.-M. Hwu, and N. Guil, "Inplace data sliding algorithms for many-core architectures, " in Parallel Processing, 2015 44th International Conference on, pp. 210-219, IEEE, 2015.
-
(2015)
Parallel Processing 2015 44th International Conference on
, pp. 210-219
-
-
Ǵomez-Luna, J.1
Chang, L.-W.2
Sung, I.-J.3
Hwu, W.-M.4
Guil, N.5
-
31
-
-
77952273045
-
The scalable heterogeneous computing (SHOC) benchmark suite
-
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The scalable heterogeneous computing (SHOC) benchmark suite, " in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 63-74, 2010.
-
(2010)
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
, pp. 63-74
-
-
Danalis, A.1
Marin, G.2
McCurdy, C.3
Meredith, J.S.4
Roth, P.C.5
Spafford, K.6
Tipparaju, V.7
Vetter, J.S.8
-
32
-
-
84936931250
-
Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format
-
IEEE
-
J. L. Greathouse and M. Daga, "Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format, " in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 769-780, IEEE, 2014.
-
(2014)
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
, pp. 769-780
-
-
Greathouse, J.L.1
Daga, M.2
-
33
-
-
84939147992
-
A collection-oriented programming model for performance portability
-
S. Muralidharan, M. Garland, B. Catanzaro, A. Sidelnik, and M. Hall, "A collection-oriented programming model for performance portability, " in Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 263-264, 2015.
-
(2015)
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, pp. 263-264
-
-
Muralidharan, S.1
Garland, M.2
Catanzaro, B.3
Sidelnik, A.4
Hall, M.5
-
34
-
-
84957710915
-
Generating performance portable code using rewrite rules: From high-level functional expressions to high-performance OpenCL code
-
M. Steuwer, C. Fensch, S. Lindley, and C. Dubach, "Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code, " in Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming, pp. 205-217, 2015.
-
(2015)
Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming
, pp. 205-217
-
-
Steuwer, M.1
Fensch, C.2
Lindley, S.3
Dubach, C.4
-
35
-
-
80054864401
-
PEPPHER: Efficient and productive usage of hybrid computing systems
-
S. Benkner, S. Pllana, J. L. Träf, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov, "PEPPHER: Efficient and productive usage of hybrid computing systems, " IEEE Micro, vol. 31, no. 5, pp. 28-41, 2011.
-
(2011)
IEEE Micro
, vol.31
, Issue.5
, pp. 28-41
-
-
Benkner, S.1
Pllana, S.2
Träf, J.L.3
Tsigas, P.4
Dolinsky, U.5
Augonnet, C.6
Bachmayer, B.7
Kessler, C.8
Moloney, D.9
Osipov, V.10
-
36
-
-
84876535618
-
The PEPPHER composition tool: Performance-Aware dynamic composition of applications for GPU-based systems
-
U. Dastgeer, L. Li, and C. Kessler, "The PEPPHER composition tool: Performance-Aware dynamic composition of applications for GPU-based systems, " in High Performance Computing, Networking, Storage and Analysis, 2012 SC Companion:, pp. 711-720, 2012.
-
(2012)
High Performance Computing, Networking, Storage and Analysis 2012 SC Companion
, pp. 711-720
-
-
Dastgeer, U.1
Li, L.2
Kessler, C.3
-
37
-
-
84937692188
-
Locality-Aware mapping of nested parallel patterns on GPUs
-
IEEE Computer Society
-
H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun, "Locality-Aware mapping of nested parallel patterns on GPUs, " in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 63-74, IEEE Computer Society, 2014.
-
(2014)
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture
, pp. 63-74
-
-
Lee, H.1
Brown, K.J.2
Sujeeth, A.K.3
Rompf, T.4
Olukotun, K.5
-
38
-
-
70449959487
-
CHiLL: A framework for composing high-level loop transformations
-
C. Chen, J. Chame, and M. Hall, "CHiLL: A framework for composing high-level loop transformations, " tech. rep., 2008.
-
(2008)
Tech. Rep
-
-
Chen, C.1
Chame, J.2
Hall, M.3
-
41
-
-
84905980170
-
Delite: A compiler architecture for performance-oriented embedded domain-specific languages
-
A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun, "Delite: A compiler architecture for performance-oriented embedded domain-specific languages, " ACM Trans. Embed. Comput. Syst., vol. 13, no. 4s, pp. 134:1-134:25, 2014.
-
(2014)
ACM Trans. Embed. Comput. Syst
, vol.13
, Issue.4
, pp. 1341-13425
-
-
Sujeeth, A.K.1
Brown, K.J.2
Lee, H.3
Rompf, T.4
Chafi, H.5
Odersky, M.6
Olukotun, K.7
-
43
-
-
84875671819
-
Portable performance on heterogeneous architectures
-
ACM
-
P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe, "Portable performance on heterogeneous architectures, " in Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, vol. 48, pp. 431-444, ACM, 2013.
-
(2013)
Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems
, vol.48
, pp. 431-444
-
-
Phothilimthana, P.M.1
Ansel, J.2
Ragan-Kelley, J.3
Amarasinghe, S.4
|