SCOPUS 정보 검색 플랫폼

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Volumn 7210 LNCS, Issue , 2012, Pages 21-40

Automatic restructuring of GPU kernels for exploiting inter-thread data locality

(3) Unkule, Swapneela a Shaltz, Christopher a Qasem, Apan a

a TEXAS STATE UNIVERSITY (United States)

Author keywords

[No Author keywords available]

Indexed keywords

DATA LOCALITY; DIRECT IMPACT; MEMORY PERFORMANCE; MULTI-THREADING; ON CURRENTS; REGISTER PRESSURE; SHARED MEMORIES; SOFTWARE FRAMEWORKS;

AUTOMATION; COMPUTER PROGRAMMING; CONTROL; MULTITASKING; PROFITABILITY; PROGRAM COMPILERS;

COARSENING;

EID: 84859153100 PISSN: 03029743 EISSN: 16113349 Source Type: Book Series
DOI: 10.1007/978-3-642-28652-0_2 Document Type: Conference Paper

Times cited : (25)

References (28)

1
- 84859145371
- CUDA PTX ISA, http://www.nvidia.com/content/CUDAptxisa1.4.pdf
- CUDA PTX ISA

2
- 84889682070
- GPU Computing SDK, http://developer.nvidia.com
- GPU Computing SDK

3
- 84859150897
- Kernel for min-max and reduction, http://supercomputingblog.com/cuda/ cuda-tutorial-3-thread-communication/
- Kernel for Min-max and Reduction

4
- 32844469834
- Top 500 Supercomputer Sites, http://www.top500.org
- Top 500 Supercomputer Sites

5
- 34547309668
- Version 3.0. NVIDIA
- CUDA Programming Guide, Version 3.0. NVIDIA (2010)
- (2010) CUDA Programming Guide

6
- 77950611743
- Hpctoolkit: Tools for performance analysis of optimized parallel programs
- Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: Hpctoolkit: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22(6), 685-701 (2010)
- (2010) Concurrency and Computation: Practice and Experience , vol.22 , Issue.6 , pp. 685-701
- Adhianto, L.¹ Banerjee, S.² Fagan, M.³ Krentel, M.⁴ Marin, G.⁵ Mellor-Crummey, J.⁶ Tallent, N.R.⁷

7
- 84925509670
- Optimizing Compilers for Modern Architectures
- Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann (2002)
- (2002) Morgan Kaufmann
- Allen, R.¹ Kennedy, K.²

8
- 77951572335
- Automatic C-to-CUDA Code Generation for Affine Programs
- Gupta, R. (ed.) CC 2010. Springer, Heidelberg
- Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA Code Generation for Affine Programs. In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 244-263. Springer, Heidelberg (2010)
- (2010) LNCS , vol.6011 , pp. 244-263
- Baskaran, M.M.¹ Ramanujam, J.² Sadayappan, P.³

9
- 0027983356
- Effective partial redundancy elimination
- Briggs, P., Cooper, K.D.: Effective partial redundancy elimination. In: Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, PLDI 1994 (1994)
- (1994) Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, PLDI 1994
- Briggs, P.¹ Cooper, K.D.²

10
- 84877042382
- A scalable crossplatform infrastructure for application performance tuning using hardware counters
- Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A scalable crossplatform infrastructure for application performance tuning using hardware counters. In: ACM/IEEE 2000 Conference, Supercomputing (November 2000)
- ACM/IEEE 2000 Conference, Supercomputing (November 2000)
- Browne, S.¹ Dongarra, J.² Garner, N.³ London, K.⁴ Mucci, P.⁵

11
- 0028549474
- Improving the ratio of memory operations to floatingpoint operations in loops
- Carr, S., Kennedy, K.: Improving the ratio of memory operations to floatingpoint operations in loops. ACM Transactions on Programming Languages and Systems 16(6), 1768-1810 (1994)
- (1994) ACM Transactions on Programming Languages and Systems , vol.16 , Issue.6 , pp. 1768-1810
- Carr, S.¹ Kennedy, K.²

12
- 77749340082
- Model-driven autotuning of sparse matrixvector multiply on GPUs
- ACM, New York
- Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrixvector multiply on GPUs. In: PPoPP 2010: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 115-126. ACM, New York (2010)
- (2010) PPoPP 2010: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , pp. 115-126
- Choi, J.W.¹ Singh, A.² Vuduc, R.W.³

13
- 0023565191
- What's in a name? -or- The value of renaming for parallelism detection and storage allocation
- Cytron, R., Ferrante, J.: What's in a name? -or- the value of renaming for parallelism detection and storage allocation. In: ICPP 1987, pp. 19-27 (1987)
- (1987) ICPP 1987 , pp. 19-27
- Cytron, R.¹ Ferrante, J.²

14
- 70350771127
- Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
- IEEE Press, Piscataway
- Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1-12. IEEE Press, Piscataway (2008)
- (2008) SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing , pp. 1-12
- Datta, K.¹ Murphy, M.² Volkov, V.³ Williams, S.⁴ Carter, J.⁵ Oliker, L.⁶ Patterson, D.⁷ Shalf, J.⁸ Yelick, K.⁹

15
- 77954013048
- Loop unrolling for gpgpu programs
- Murthy, G., Ravishankar, M., Sadayappan, M.B., Optimal, P.: loop unrolling for gpgpu programs. In: IEEE International Symposium on Parallel Distributed Processing (2010)
- IEEE International Symposium on Parallel Distributed Processing (2010)
- Murthy, G.¹ Ravishankar, M.² Sadayappan, M.B.³ Optimal, P.⁴

16
- 34548292052
- A memory model for scientific algorithms on graphics processors
- ACM, New York
- Govindaraju, N.K., Larsen, S., Gray, J., Manocha, D.: A memory model for scientific algorithms on graphics processors. In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 89. ACM, New York (2006)
- (2006) SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing , pp. 89
- Govindaraju, N.K.¹ Larsen, S.² Gray, J.³ Manocha, D.⁴

17
- 70350754502
- High performance discrete fourier transforms on graphics processors
- IEEE Press, Piscataway
- Govindaraju, N.K., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J.: High performance discrete fourier transforms on graphics processors. In: SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1-12. IEEE Press, Piscataway (2008)
- (2008) SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing , pp. 1-12
- Govindaraju, N.K.¹ Lloyd, B.² Dotsenko, Y.³ Smith, B.⁴ Manferdelli, J.⁵

18
- 79952608669
- Optimizing and Auto-tuning Belief Propagation on the GPU
- Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. Springer, Heidelberg
- Grauer-Gray, S., Cavazos, J.: Optimizing and Auto-tuning Belief Propagation on the GPU. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 121-135. Springer, Heidelberg (2011)
- (2011) LNCS , vol.6548 , pp. 121-135
- Grauer-Gray, S.¹ Cavazos, J.²

19
- 67650081010
- OpenMP to GPGPU: A compiler framework for automatic translation and optimization
- Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming (2009)
- Proceedings of the 14th ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming (2009)
- Lee, S.¹ Min, S.J.² Eigenmann, R.³

20
- 79952583455
- Accelerating GPU Kernels for Dense Linear Algebra
- Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. Springer, Heidelberg
- Nath, R., Tomov, S., Dongarra, J.: Accelerating GPU Kernels for Dense Linear Algebra. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 83-92. Springer, Heidelberg (2011)
- (2011) LNCS , vol.6449 , pp. 83-92
- Nath, R.¹ Tomov, S.² Dongarra, J.³

21
- 74049114159
- Auto-tuning 3-d FFT library for CUDA GPUs
- ACM, New York
- Nukada, A., Matsuoka, S.: Auto-tuning 3-d FFT library for CUDA GPUs. In: SC 2009: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1-10. ACM, New York (2009)
- (2009) SC 2009: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , pp. 1-10
- Nukada, A.¹ Matsuoka, S.²

22
- 78650814738
- Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures
- Rahimian, A., Lashuk, I., Veerapaneni, S., Chandramowlishwaran, A., Malhotra, D., Moon, L., Sampath, R., Shringarpure, A., Vetter, J., Vuduc, R., Zorin, D., Biros, G.: Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)
- Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)
- Rahimian, A.¹ Lashuk, I.² Veerapaneni, S.³ Chandramowlishwaran, A.⁴ Malhotra, D.⁵ Moon, L.⁶ Sampath, R.⁷ Shringarpure, A.⁸ Vetter, J.⁹ Vuduc, R.¹⁰ Zorin, D.¹¹ Biros, G.¹²

23
- 79959466764
- Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
- Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu,W.M.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2008)
- Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2008)
- Ryoo, S.¹ Rodrigues, C.I.² Baghsorkhi, S.S.³ Stone, S.S.⁴ Kirk, D.B.⁵ Hwu, W.M.W.⁶

24
- 70350771131
- Benchmarking GPUs to tune dense linear algebra
- Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (2008)
- SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (2008)
- Volkov, V.¹ Demmel, J.W.²

25
- 60949098907
- Optimization of sparse matrix-vector multiplication on emerging multicore platforms
- Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. 35(3), 178-194 (2009)
- (2009) Parallel Comput , vol.35 , Issue.3 , pp. 178-194
- Williams, S.¹ Oliker, L.² Vuduc, R.³ Shalf, J.⁴ Yelick, K.⁵ Demmel, J.⁶

26
- 58449092097
- Exploring the Optimization Space of Dense Linear Algebra Kernels
- Amaral, J.N. (ed.) LCPC 2008. Springer, Heidelberg
- Yi, Q., Qasem, A.: Exploring the Optimization Space of Dense Linear Algebra Kernels. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 343-355. Springer, Heidelberg (2008)
- (2008) LNCS , vol.5335 , pp. 343-355
- Yi, Q.¹ Qasem, A.²

27
- 70450103746
- A cross-input adaptive framework for GPU program optimizations
- Yixun, L., Zhang, E.Z., Shen, X.: A cross-input adaptive framework for GPU program optimizations. In: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing (2009)
- Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing (2009)
- Yixun, L.¹ Zhang, E.Z.² Shen, X.³

28
- 77955184165
- Accelerating iterative field-compensated mr image reconstruction on GPUs
- Zhuo, Y., Wu, X.L., Haldar, J.P., Hwu, W.M., Liang, Z.P., Sutton, B.P.: Accelerating iterative field-compensated mr image reconstruction on GPUs. In: Proceedings of the 2010 IEEE International Conference on Biomedical Imaging: From Nano to Macro, ISBI 2010 (2010)
- (2010) Proceedings of the 2010 IEEE International Conference on Biomedical Imaging: From Nano to Macro, ISBI 2010
- Zhuo, Y.¹ Wu, X.L.² Haldar, J.P.³ Hwu, W.M.⁴ Liang, Z.P.⁵ Sutton, B.P.⁶

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.