-
3
-
-
84967201494
-
Dynamic parallelism for simple and efficient GPU graph algorithms
-
ACM
-
P. Zhang, E. Holk, J. Matty, S. Misurda, M. Zalewski, J. Chu, S. McMillan, and A. Lumsdaine, "Dynamic parallelism for simple and efficient GPU graph algorithms, " in Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, IA3 15, pp. 11:1-11:4, ACM, 2015.
-
(2015)
Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, IA3 15
, pp. 111-114
-
-
Zhang, P.1
Holk, E.2
Matty, J.3
Misurda, S.4
Zalewski, M.5
Chu, J.6
McMillan, S.7
Lumsdaine, A.8
-
4
-
-
84923668340
-
Efficient GPUimplementation of adaptive mesh refinement for the shallow-water equations
-
M. L. Sætra, A. R. Brodtkorb, and K.-A. Lie, "Efficient GPUimplementation of adaptive mesh refinement for the shallow-water equations, " Journal of Scientific Computing, vol. 63, no. 1, pp. 23-48, 2015.
-
(2015)
Journal of Scientific Computing
, vol.63
, Issue.1
, pp. 23-48
-
-
Sætra, M.L.1
Brodtkorb, A.R.2
Lie, K.-A.3
-
5
-
-
84976501593
-
Inplace data sliding algorithms for many-core architectures
-
IEEE
-
J. Ǵomez-Luna, L.-W. Chang, I.-J. Sung, W.-M. Hwu, and N. Guil, "Inplace data sliding algorithms for many-core architectures, " in Parallel Processing (ICPP), 2015 44th International Conference on, pp. 210-219, IEEE, 2015.
-
(2015)
Parallel Processing (ICPP) 2015 44th International Conference on
, pp. 210-219
-
-
Ǵomez-Luna, J.1
Chang, L.-W.2
Sung, I.-J.3
Hwu, W.-M.4
Guil, N.5
-
6
-
-
84946029581
-
Characterization and analysis of dynamic parallelism in unstructured GPU applications
-
J. Wang and S. Yalamanchili, "Characterization and analysis of dynamic parallelism in unstructured GPU applications, " in Workload Characterization (IISWC), 2014 IEEE International Symposium on, pp. 51-60, IEEE, 2014.
-
(2014)
Workload Characterization (IISWC) 2014 IEEE International Symposium On, IEEE
, pp. 51-60
-
-
Wang, J.1
Yalamanchili, S.2
-
7
-
-
84896893237
-
CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications
-
ACM
-
Y. Yang and H. Zhou, "CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications, " in ACM SIGPLAN Notices, vol. 49, pp. 93-106, ACM, 2014.
-
(2014)
ACM SIGPLAN Notices
, vol.49
, pp. 93-106
-
-
Yang, Y.1
Zhou, H.2
-
8
-
-
85009354023
-
-
A CUDA dynamic parallelism case study: PANDA Accessed 2016-04-01
-
"A CUDA dynamic parallelism case study: PANDA." https://devblogs.nvidia.com/parallelforall/a-CUDA-dynamic-parallelismcase-study-panda/. Accessed: 2016-04-01.
-
-
-
-
9
-
-
84960076275
-
Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on GPUs
-
ACM
-
J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, "Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on GPUs, " in Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 528-540, ACM, 2015.
-
(2015)
Proceedings of the 42nd Annual International Symposium on Computer Architecture
, pp. 528-540
-
-
Wang, J.1
Rubin, N.2
Sidelnik, A.3
Yalamanchili, S.4
-
11
-
-
84886375362
-
Parallel search on video cards
-
USENIX Association
-
T. Kaldewey, J. Hagen, A. Di Blas, and E. Sedlar, "Parallel search on video cards, " in Proceedings of the First USENIX Conference on Hot Topics in Parallelism, HotPar09, p. 9, USENIX Association, 2009.
-
(2009)
Proceedings of the First USENIX Conference on Hot Topics in Parallelism, HotPar09
, pp. 9
-
-
Kaldewey, T.1
Hagen, J.2
Di Blas, A.3
Sedlar, E.4
-
13
-
-
84938982672
-
Pocl: A performance-portable OpenCL implementation
-
P. Jääskeläinen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg, "pocl: A performance-portable OpenCL implementation, " International Journal of Parallel Programming, vol. 43, no. 5, pp. 752-785, 2015.
-
(2015)
International Journal of Parallel Programming
, vol.43
, Issue.5
, pp. 752-785
-
-
Jääskeläinen, P.1
De La Lama, C.S.2
Schnetter, E.3
Raiskila, K.4
Takala, J.5
Berg, H.6
-
14
-
-
84876943307
-
Convergence and scalarization for data-parallel architectures
-
IEEE Computer Society
-
K. Asanovic, S. W. Keckler, Y. Lee, R. Krashinsky, and V. Grover, "Convergence and scalarization for data-parallel architectures, " in Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1-11, IEEE Computer Society, 2013.
-
(2013)
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
, pp. 1-11
-
-
Asanovic, K.1
Keckler, S.W.2
Lee, Y.3
Krashinsky, R.4
Grover, V.5
-
15
-
-
70450189096
-
Efficient stream compaction on wide SIMD many-core architectures
-
ACM
-
M. Billeter, O. Olsson, and U. Assarsson, "Efficient stream compaction on wide SIMD many-core architectures, " in Proceedings of the Conference on High Performance Graphics 2009, HPG 09, pp. 159-166, ACM, 2009.
-
Proceedings of the Conference on High Performance Graphics 2009, HPG 09
, vol.2009
, pp. 159-166
-
-
Billeter, M.1
Olsson, O.2
Assarsson, U.3
-
16
-
-
85049937265
-
The OpenCL specification, version 2.0
-
L. Howes and A. Munshi, "The OpenCL specification, version 2.0, " Khronos Group, 2015.
-
(2015)
Khronos Group
-
-
Howes, L.1
Munshi, A.2
-
17
-
-
84875175606
-
StreamScan: Fast scan algorithms for GPUs without global barrier synchronization
-
ACM
-
S. Yan, G. Long, and Y. Zhang, "StreamScan: Fast scan algorithms for GPUs without global barrier synchronization, " in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 13, pp. 229-238, ACM, 2013.
-
(2013)
Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 13
, pp. 229-238
-
-
Yan, S.1
Long, G.2
Zhang, Y.3
-
18
-
-
77952273045
-
The scalable heterogeneous computing (SHOC) benchmark suite
-
ACM
-
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The scalable heterogeneous computing (SHOC) benchmark suite, " in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pp. 63-74, ACM, 2010.
-
(2010)
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3
, pp. 63-74
-
-
Danalis, A.1
Marin, G.2
McCurdy, C.3
Meredith, J.S.4
Roth, P.C.5
Spafford, K.6
Tipparaju, V.7
Vetter, J.S.8
-
19
-
-
84873458159
-
A quantitative study of irregular programs on GPUs
-
Nov 2012
-
M. Burtscher, R. Nasre, and K. Pingali, "A quantitative study of irregular programs on GPUs, " in Workload Characterization (IISWC), 2012 IEEE International Symposium on, pp. 141-151, Nov 2012.
-
Workload Characterization (IISWC) 2012 IEEE International Symposium on
, pp. 141-151
-
-
Burtscher, M.1
Nasre, R.2
Pingali, K.3
-
20
-
-
85009348622
-
-
NVIDIA, CUDA samples v. 7.5
-
NVIDIA, "CUDA samples v. 7.5, " 2015.
-
(2015)
-
-
-
21
-
-
84923879310
-
NUPAR: A benchmark suite for modern GPU architectures
-
ACM
-
Y. Ukidave, F. N. Paravecino, L. Yu, C. Kalra, A. Momeni, Z. Chen, N. Materise, B. Daley, P. Mistry, and D. Kaeli, "NUPAR: A benchmark suite for modern GPU architectures, " in Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, ICPE 15, pp. 253-264, ACM, 2015.
-
(2015)
Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, ICPE 15
, pp. 253-264
-
-
Ukidave, Y.1
Paravecino, F.N.2
Yu, L.3
Kalra, C.4
Momeni, A.5
Chen, Z.6
Materise, N.7
Daley, B.8
Mistry, P.9
Kaeli, D.10
-
22
-
-
79952796611
-
Evaluating graph coloring on GPUs
-
A. V. P. Grosset, P. Zhu, S. Liu, S. Venkatasubramanian, and M. Hall, "Evaluating graph coloring on GPUs, " in Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP 11, pp. 297-298, ACM, 2011.
-
(2011)
Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, ACM
, vol.11
, pp. 297-298
-
-
Grosset, A.V.P.1
Zhu, P.2
Liu, S.3
Venkatasubramanian, S.4
Hall, M.5
-
23
-
-
20344394051
-
-
Accessed 2016-04-01
-
"Matrix market." http://math.nist.gov/MatrixMarket/. Accessed: 2016-04-01.
-
Matrix Market
-
-
-
25
-
-
85009348666
-
-
GPU pro tip: CUDA 7 streams simplify concurrency Accessed 2016-04-01
-
"GPU pro tip: CUDA 7 streams simplify concurrency." http://devblogs.nvidia.com/parallelforall/GPU-pro-Tip-CUDA-7-streamssimplify-concurrency/. Accessed: 2016-04-01.
-
-
-
-
26
-
-
84940066769
-
-
Accessed 2016-04-10
-
"CUDA dynamic parallelism API and principles." https://devblogs. nvidia.com/parallelforall/CUDA-dynamic-parallelism-Api-principles/. Accessed: 2016-04-10.
-
CUDA Dynamic Parallelism API and Principles
-
-
-
27
-
-
85009416339
-
-
NVIDIA, Profiler users guide v. 7.5
-
NVIDIA, "Profiler users guide v. 7.5, " 2015.
-
(2015)
-
-
-
28
-
-
84988443467
-
Laperm: Locality aware scheduler for dynamic parallelism on GPUs
-
June
-
J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, "Laperm: Locality aware scheduler for dynamic parallelism on GPUs, " in The 43rd International Symposium on Computer Architecture (ISCA), June 2016.
-
(2016)
The 43rd International Symposium on Computer Architecture (ISCA)
-
-
Wang, J.1
Rubin, N.2
Sidelnik, A.3
Yalamanchili, S.4
-
29
-
-
84905454859
-
Finegrain task aggregation and coordination on GPUs
-
IEEE Press
-
M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood, "Finegrain task aggregation and coordination on GPUs, " in ACM SIGARCH Computer Architecture News, vol. 42, pp. 181-192, IEEE Press, 2014.
-
(2014)
ACM SIGARCH Computer Architecture News
, vol.42
, pp. 181-192
-
-
Orr, M.S.1
Beckmann, B.M.2
Reinhardt, S.K.3
Wood, D.A.4
-
30
-
-
84870690379
-
A study of persistent threads style GPU programming for gpGPU workloads
-
IEEE
-
K. Gupta, J. A. Stuart, and J. D. Owens, "A study of persistent threads style GPU programming for gpGPU workloads, " in Innovative Parallel Computing (InPar), 2012, pp. 1-14, IEEE, 2012.
-
(2012)
Innovative Parallel Computing (InPar) 2012
, pp. 1-14
-
-
Gupta, K.1
Stuart, J.A.2
Owens, J.D.3
-
31
-
-
85009416419
-
-
Private communication
-
G. Chen and X. Shen. private communication.
-
-
-
Chen, G.1
Shen, X.2
-
32
-
-
79960526623
-
Enabling task parallelism in the CUDA scheduler
-
M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron, "Enabling task parallelism in the CUDA scheduler, " in Workshop on Programming Models for Emerging Architectures, vol. 9, 2009.
-
(2009)
Workshop on Programming Models for Emerging Architectures
, vol.9
-
-
Guevara, M.1
Gregg, C.2
Hazelwood, K.3
Skadron, K.4
-
33
-
-
84976510144
-
Nested parallelism on GPU: Exploring parallelization templates for irregular loops and recursive computations
-
IEEE
-
D. Li, H. Wu, and M. Becchi, "Nested parallelism on GPU: Exploring parallelization templates for irregular loops and recursive computations, " in Parallel Processing (ICPP), 2015 44th International Conference on, pp. 979-988, IEEE, 2015.
-
(2015)
Parallel Processing (ICPP) 2015 44th International Conference on
, pp. 979-988
-
-
Li, D.1
Wu, H.2
Becchi, M.3
-
34
-
-
84951798257
-
Efficient execution of recursive programs on commodity vector hardware
-
ACM
-
B. Ren, Y. Jo, S. Krishnamoorthy, K. Agrawal, and M. Kulkarni, "Efficient execution of recursive programs on commodity vector hardware, " in Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 509-520, ACM, 2015.
-
(2015)
Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
, pp. 509-520
-
-
Ren, B.1
Jo, Y.2
Krishnamoorthy, S.3
Agrawal, K.4
Kulkarni, M.5
-
35
-
-
10444243253
-
Decoupled software pipelining with the synchronization array
-
IEEE Computer Society
-
R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August, "Decoupled software pipelining with the synchronization array, " in Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pp. 177-188, IEEE Computer Society, 2004.
-
(2004)
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
, pp. 177-188
-
-
Rangan, R.1
Vachharajani, N.2
Vachharajani, M.3
August, D.I.4
-
36
-
-
84875184822
-
Kernel weaver: Automatically fusing database primitives for efficient GPU computation
-
IEEE Computer Society
-
H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili, "Kernel weaver: Automatically fusing database primitives for efficient GPU computation, " in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 107-118, IEEE Computer Society, 2012.
-
(2012)
Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
, pp. 107-118
-
-
Wu, H.1
Diamos, G.2
Cadambi, S.3
Yalamanchili, S.4
|