SCOPUS 정보 검색 플랫폼

Proceedings of the Annual International Symposium on Microarchitecture, MICRO

Volumn 2016-December, Issue , 2016, Pages

KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism

(6) Hajj, Izzat El a,c Gomez Luna, Juan b Li, Cheng a Chang, Li Wen a Milojicic, Dejan b Hwu, Wen Mei a

a UNIVERSITY OF ILLINOIS AT URBANA CHAMPAIGN (United States)

b UNIVERSITY OF CÓRDOBA (Spain)

c HEWLETT PACKARD LABORATORIES (United States)

Author keywords

[No Author keywords available]

Indexed keywords

PROGRAM COMPILERS; PROGRAM PROCESSORS;

COMPILER TECHNIQUES; GEOMETRIC MEAN;

COMPUTER ARCHITECTURE;

EID: 85009382810 PISSN: 10724451 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/MICRO.2016.7783716 Document Type: Conference Paper

Times cited : (33)

References (36)

1
- 84898796621
- Introduction to dynamic parallelism
- S. Jones, "Introduction to dynamic parallelism, " in GPU Technology Conference Presentation, 2012.
- (2012) GPU Technology Conference Presentation
- Jones, S.¹

2
- 84966891675
- Morgan Kaufmann
- W. H. Wen-mei, Heterogeneous System Architecture: A new compute platform infrastructure. Morgan Kaufmann, 2015.
- (2015) Heterogeneous System Architecture: A New Compute Platform Infrastructure
- Wen-Mei, W.H.¹

3
- 84967201494
- Dynamic parallelism for simple and efficient GPU graph algorithms
- ACM
- P. Zhang, E. Holk, J. Matty, S. Misurda, M. Zalewski, J. Chu, S. McMillan, and A. Lumsdaine, "Dynamic parallelism for simple and efficient GPU graph algorithms, " in Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, IA3 15, pp. 11:1-11:4, ACM, 2015.
- (2015) Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, IA3 15 , pp. 111-114
- Zhang, P.¹ Holk, E.² Matty, J.³ Misurda, S.⁴ Zalewski, M.⁵ Chu, J.⁶ McMillan, S.⁷ Lumsdaine, A.⁸

4
- 84923668340
- Efficient GPUimplementation of adaptive mesh refinement for the shallow-water equations
- M. L. Sætra, A. R. Brodtkorb, and K.-A. Lie, "Efficient GPUimplementation of adaptive mesh refinement for the shallow-water equations, " Journal of Scientific Computing, vol. 63, no. 1, pp. 23-48, 2015.
- (2015) Journal of Scientific Computing , vol.63 , Issue.1 , pp. 23-48
- Sætra, M.L.¹ Brodtkorb, A.R.² Lie, K.-A.³

5
- 84976501593
- Inplace data sliding algorithms for many-core architectures
- IEEE
- J. Ǵomez-Luna, L.-W. Chang, I.-J. Sung, W.-M. Hwu, and N. Guil, "Inplace data sliding algorithms for many-core architectures, " in Parallel Processing (ICPP), 2015 44th International Conference on, pp. 210-219, IEEE, 2015.
- (2015) Parallel Processing (ICPP) 2015 44th International Conference on , pp. 210-219
- Ǵomez-Luna, J.¹ Chang, L.-W.² Sung, I.-J.³ Hwu, W.-M.⁴ Guil, N.⁵

6
- 84946029581
- Characterization and analysis of dynamic parallelism in unstructured GPU applications
- J. Wang and S. Yalamanchili, "Characterization and analysis of dynamic parallelism in unstructured GPU applications, " in Workload Characterization (IISWC), 2014 IEEE International Symposium on, pp. 51-60, IEEE, 2014.
- (2014) Workload Characterization (IISWC) 2014 IEEE International Symposium On, IEEE , pp. 51-60
- Wang, J.¹ Yalamanchili, S.²

7
- 84896893237
- CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications
- ACM
- Y. Yang and H. Zhou, "CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications, " in ACM SIGPLAN Notices, vol. 49, pp. 93-106, ACM, 2014.
- (2014) ACM SIGPLAN Notices , vol.49 , pp. 93-106
- Yang, Y.¹ Zhou, H.²

8
- 85009354023
- A CUDA dynamic parallelism case study: PANDA Accessed 2016-04-01
- "A CUDA dynamic parallelism case study: PANDA." https://devblogs.nvidia.com/parallelforall/a-CUDA-dynamic-parallelismcase-study-panda/. Accessed: 2016-04-01.

9
- 84960076275
- Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on GPUs
- ACM
- J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, "Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on GPUs, " in Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 528-540, ACM, 2015.
- (2015) Proceedings of the 42nd Annual International Symposium on Computer Architecture , pp. 528-540
- Wang, J.¹ Rubin, N.² Sidelnik, A.³ Yalamanchili, S.⁴

10
- 84959927541
- Free launch: Optimizing GPU dynamic kernel launches through thread reuse
- ACM
- G. Chen and X. Shen, "Free launch: optimizing GPU dynamic kernel launches through thread reuse, " in Proceedings of the 48th International Symposium on Microarchitecture, pp. 407-419, ACM, 2015.
- (2015) Proceedings of the 48th International Symposium on Microarchitecture , pp. 407-419
- Chen, G.¹ Shen, X.²

11
- 84886375362
- Parallel search on video cards
- USENIX Association
- T. Kaldewey, J. Hagen, A. Di Blas, and E. Sedlar, "Parallel search on video cards, " in Proceedings of the First USENIX Conference on Hot Topics in Parallelism, HotPar09, p. 9, USENIX Association, 2009.
- (2009) Proceedings of the First USENIX Conference on Hot Topics in Parallelism, HotPar09 , pp. 9
- Kaldewey, T.¹ Hagen, J.² Di Blas, A.³ Sedlar, E.⁴

12
- 85009397782
- CUB:kernel-level software reuse and library design
- D. Merrill, "CUB:kernel-level software reuse and library design, " in GPU Technology Conference Presentation, 2013.
- (2013) GPU Technology Conference Presentation
- Merrill, D.¹

13
- 84938982672
- Pocl: A performance-portable OpenCL implementation
- P. Jääskeläinen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg, "pocl: A performance-portable OpenCL implementation, " International Journal of Parallel Programming, vol. 43, no. 5, pp. 752-785, 2015.
- (2015) International Journal of Parallel Programming , vol.43 , Issue.5 , pp. 752-785
- Jääskeläinen, P.¹ De La Lama, C.S.² Schnetter, E.³ Raiskila, K.⁴ Takala, J.⁵ Berg, H.⁶

14
- 84876943307
- Convergence and scalarization for data-parallel architectures
- IEEE Computer Society
- K. Asanovic, S. W. Keckler, Y. Lee, R. Krashinsky, and V. Grover, "Convergence and scalarization for data-parallel architectures, " in Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1-11, IEEE Computer Society, 2013.
- (2013) Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) , pp. 1-11
- Asanovic, K.¹ Keckler, S.W.² Lee, Y.³ Krashinsky, R.⁴ Grover, V.⁵

15
- 70450189096
- Efficient stream compaction on wide SIMD many-core architectures
- ACM
- M. Billeter, O. Olsson, and U. Assarsson, "Efficient stream compaction on wide SIMD many-core architectures, " in Proceedings of the Conference on High Performance Graphics 2009, HPG 09, pp. 159-166, ACM, 2009.
- Proceedings of the Conference on High Performance Graphics 2009, HPG 09 , vol.2009 , pp. 159-166
- Billeter, M.¹ Olsson, O.² Assarsson, U.³

16
- 85049937265
- The OpenCL specification, version 2.0
- L. Howes and A. Munshi, "The OpenCL specification, version 2.0, " Khronos Group, 2015.
- (2015) Khronos Group
- Howes, L.¹ Munshi, A.²

17
- 84875175606
- StreamScan: Fast scan algorithms for GPUs without global barrier synchronization
- ACM
- S. Yan, G. Long, and Y. Zhang, "StreamScan: Fast scan algorithms for GPUs without global barrier synchronization, " in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 13, pp. 229-238, ACM, 2013.
- (2013) Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 13 , pp. 229-238
- Yan, S.¹ Long, G.² Zhang, Y.³

18
- 77952273045
- The scalable heterogeneous computing (SHOC) benchmark suite
- ACM
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The scalable heterogeneous computing (SHOC) benchmark suite, " in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pp. 63-74, ACM, 2010.
- (2010) Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3 , pp. 63-74
- Danalis, A.¹ Marin, G.² McCurdy, C.³ Meredith, J.S.⁴ Roth, P.C.⁵ Spafford, K.⁶ Tipparaju, V.⁷ Vetter, J.S.⁸

19
- 84873458159
- A quantitative study of irregular programs on GPUs
- Nov 2012
- M. Burtscher, R. Nasre, and K. Pingali, "A quantitative study of irregular programs on GPUs, " in Workload Characterization (IISWC), 2012 IEEE International Symposium on, pp. 141-151, Nov 2012.
- Workload Characterization (IISWC) 2012 IEEE International Symposium on , pp. 141-151
- Burtscher, M.¹ Nasre, R.² Pingali, K.³

20
- 85009348622
- NVIDIA, CUDA samples v. 7.5
- NVIDIA, "CUDA samples v. 7.5, " 2015.
- (2015)

21
- 84923879310
- NUPAR: A benchmark suite for modern GPU architectures
- ACM
- Y. Ukidave, F. N. Paravecino, L. Yu, C. Kalra, A. Momeni, Z. Chen, N. Materise, B. Daley, P. Mistry, and D. Kaeli, "NUPAR: A benchmark suite for modern GPU architectures, " in Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, ICPE 15, pp. 253-264, ACM, 2015.
- (2015) Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, ICPE 15 , pp. 253-264
- Ukidave, Y.¹ Paravecino, F.N.² Yu, L.³ Kalra, C.⁴ Momeni, A.⁵ Chen, Z.⁶ Materise, N.⁷ Daley, B.⁸ Mistry, P.⁹ Kaeli, D.¹⁰

22
- 79952796611
- Evaluating graph coloring on GPUs
- A. V. P. Grosset, P. Zhu, S. Liu, S. Venkatasubramanian, and M. Hall, "Evaluating graph coloring on GPUs, " in Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP 11, pp. 297-298, ACM, 2011.
- (2011) Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, ACM , vol.11 , pp. 297-298
- Grosset, A.V.P.¹ Zhu, P.² Liu, S.³ Venkatasubramanian, S.⁴ Hall, M.⁵

23
- 20344394051
- Accessed 2016-04-01
- "Matrix market." http://math.nist.gov/MatrixMarket/. Accessed: 2016-04-01.
- Matrix Market

24
- 84882564541
- Thrust: A productivity-oriented library for CUDA
- N. Bell and J. Hoberock, "Thrust: A productivity-oriented library for CUDA, " GPU Computing Gems: Jade Edition, 2012.
- (2012) GPU Computing Gems: Jade Edition
- Bell, N.¹ Hoberock, J.²

25
- 85009348666
- GPU pro tip: CUDA 7 streams simplify concurrency Accessed 2016-04-01
- "GPU pro tip: CUDA 7 streams simplify concurrency." http://devblogs.nvidia.com/parallelforall/GPU-pro-Tip-CUDA-7-streamssimplify-concurrency/. Accessed: 2016-04-01.

26
- 84940066769
- Accessed 2016-04-10
- "CUDA dynamic parallelism API and principles." https://devblogs. nvidia.com/parallelforall/CUDA-dynamic-parallelism-Api-principles/. Accessed: 2016-04-10.
- CUDA Dynamic Parallelism API and Principles

27
- 85009416339
- NVIDIA, Profiler users guide v. 7.5
- NVIDIA, "Profiler users guide v. 7.5, " 2015.
- (2015)

28
- 84988443467
- Laperm: Locality aware scheduler for dynamic parallelism on GPUs
- June
- J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, "Laperm: Locality aware scheduler for dynamic parallelism on GPUs, " in The 43rd International Symposium on Computer Architecture (ISCA), June 2016.
- (2016) The 43rd International Symposium on Computer Architecture (ISCA)
- Wang, J.¹ Rubin, N.² Sidelnik, A.³ Yalamanchili, S.⁴

29
- 84905454859
- Finegrain task aggregation and coordination on GPUs
- IEEE Press
- M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood, "Finegrain task aggregation and coordination on GPUs, " in ACM SIGARCH Computer Architecture News, vol. 42, pp. 181-192, IEEE Press, 2014.
- (2014) ACM SIGARCH Computer Architecture News , vol.42 , pp. 181-192
- Orr, M.S.¹ Beckmann, B.M.² Reinhardt, S.K.³ Wood, D.A.⁴

30
- 84870690379
- A study of persistent threads style GPU programming for gpGPU workloads
- IEEE
- K. Gupta, J. A. Stuart, and J. D. Owens, "A study of persistent threads style GPU programming for gpGPU workloads, " in Innovative Parallel Computing (InPar), 2012, pp. 1-14, IEEE, 2012.
- (2012) Innovative Parallel Computing (InPar) 2012 , pp. 1-14
- Gupta, K.¹ Stuart, J.A.² Owens, J.D.³

31
- 85009416419
- Private communication
- G. Chen and X. Shen. private communication.
- Chen, G.¹ Shen, X.²

32
- 79960526623
- Enabling task parallelism in the CUDA scheduler
- M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron, "Enabling task parallelism in the CUDA scheduler, " in Workshop on Programming Models for Emerging Architectures, vol. 9, 2009.
- (2009) Workshop on Programming Models for Emerging Architectures , vol.9
- Guevara, M.¹ Gregg, C.² Hazelwood, K.³ Skadron, K.⁴

33
- 84976510144
- Nested parallelism on GPU: Exploring parallelization templates for irregular loops and recursive computations
- IEEE
- D. Li, H. Wu, and M. Becchi, "Nested parallelism on GPU: Exploring parallelization templates for irregular loops and recursive computations, " in Parallel Processing (ICPP), 2015 44th International Conference on, pp. 979-988, IEEE, 2015.
- (2015) Parallel Processing (ICPP) 2015 44th International Conference on , pp. 979-988
- Li, D.¹ Wu, H.² Becchi, M.³

34
- 84951798257
- Efficient execution of recursive programs on commodity vector hardware
- ACM
- B. Ren, Y. Jo, S. Krishnamoorthy, K. Agrawal, and M. Kulkarni, "Efficient execution of recursive programs on commodity vector hardware, " in Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 509-520, ACM, 2015.
- (2015) Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation , pp. 509-520
- Ren, B.¹ Jo, Y.² Krishnamoorthy, S.³ Agrawal, K.⁴ Kulkarni, M.⁵

35
- 10444243253
- Decoupled software pipelining with the synchronization array
- IEEE Computer Society
- R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August, "Decoupled software pipelining with the synchronization array, " in Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pp. 177-188, IEEE Computer Society, 2004.
- (2004) Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques , pp. 177-188
- Rangan, R.¹ Vachharajani, N.² Vachharajani, M.³ August, D.I.⁴

36
- 84875184822
- Kernel weaver: Automatically fusing database primitives for efficient GPU computation
- IEEE Computer Society
- H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili, "Kernel weaver: Automatically fusing database primitives for efficient GPU computation, " in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 107-118, IEEE Computer Society, 2012.
- (2012) Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture , pp. 107-118
- Wu, H.¹ Diamos, G.² Cadambi, S.³ Yalamanchili, S.⁴

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.