SCOPUS 정보 검색 플랫폼

Proceedings of the International Conference on Supercomputing

Volumn , Issue , 2010, Pages 115-125

Streamlining GPU applications on the fly - Thread divergence elimination through runtime thread-data remapping

(4) Zhang, Eddy Z a Jiang, Yunlian a Guo, Ziyu a Shen, Xipeng a

a The College of William and Mary (United States)

Author keywords

CPU GPU pipelining; data transformation; GPGPU; thread divergence; thread data remapping

Indexed keywords

COMPUTING POWER; CONDITIONAL BRANCH; COST EFFICIENCY; DATA LAYOUTS; DATA TRANSFORMATION; GPGPU; GRAPHIC PROCESSING UNITS; HIGH PERFORMANCE COMPUTING; MASSIVE DATA; NON-TRIVIAL; ON THE FLIES; PERFORMANCE DEGRADATION; PERFORMANCE IMPROVEMENTS; REMAPPING; RUN-TIME THREADS; RUNTIMES; SYSTEMATIC INVESTIGATIONS;

INTELLIGENT CONTROL; PARALLEL ARCHITECTURES; PROGRAM PROCESSORS;

METADATA;

EID: 77954724148 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/1810085.1810104 Document Type: Conference Paper

Times cited : (72)

References (18)

1
- 84870629709
- NVIDIA CUDA. http://www.nvidia.com/cuda.
- NVIDIA CUDA

2
- 57349180412
- A compiler framework for optimization of affine loop nests for GPGPUs
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for GPGPUs. In ICS'08: Proceedings of the 22nd Annual International Conference on Supercomputing, pages 225-234, 2008.
- (2008) ICS'08: Proceedings of the 22nd Annual International Conference on Supercomputing , pp. 225-234
- Baskaran, M.M.¹ Bondhugula, U.² Krishnamoorthy, S.³ Ramanujam, J.⁴ Rountev, A.⁵ Sadayappan, P.⁶

3
- 77951159230
- A control-structure splitting optimization for gpgpu
- S. Carrillo, J. Siegel, and X. Li. A control-structure splitting optimization for gpgpu. In Proceedings of ACM Computing Frontiers, 2009.
- Proceedings of ACM Computing Frontiers, 2009
- Carrillo, S.¹ Siegel, J.² Li, X.³

4
- 33746070806
- Cache-conscious coallocation of hot data streams
- T. M. Chilimbi and R. Shaham. Cache-conscious coallocation of hot data streams. In Proceedings of ACM SIGPLAN Conference on Programming Languages Design and Implementation, 2006.
- Proceedings of ACM SIGPLAN Conference on Programming Languages Design and Implementation, 2006
- Chilimbi, T.M.¹ Shaham, R.²

5
- 0032667957
- Improving cache performance in dynamic applications through data and computation reorganization at run time
- C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation, Atlanta, GA, May 1999.
- Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation, Atlanta, GA, May 1999
- Ding, C.¹ Kennedy, K.²

6
- 1642502420
- Improving effective bandwidth through compiler enhancement of global cache reuse
- C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. Journal of Parallel and Distributed Computing, 64(1):108-134, 2004.
- (2004) Journal of Parallel and Distributed Computing , vol.64 , Issue.1 , pp. 108-134
- Ding, C.¹ Kennedy, K.²

7
- 57349184047
- Fast scan algorithms on graphics processors
- Y. Dotsenko, N. K. Govindaraju, P. Sloan, C. Boyd, and J. Manferdelli. Fast scan algorithms on graphics processors. In ICS'08: Proceedings of the 22nd Annual International Conference on Supercomputing, pages 205-213, 2008.
- (2008) ICS'08: Proceedings of the 22nd Annual International Conference on Supercomputing , pp. 205-213
- Dotsenko, Y.¹ Govindaraju, N.K.² Sloan, P.³ Boyd, C.⁴ Manferdelli, J.⁵

8
- 47349104432
- Dynamic warp formation and scheduling for efficient gpu control flow
- Washington, DC, USA, IEEE Computer Society
- W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In MICRO '07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 407-420, Washington, DC, USA, 2007. IEEE Computer Society.
- (2007) MICRO '07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture , pp. 407-420
- Fung, W.¹ Sham, I.² Yuan, G.³ Aamodt, T.⁴

9
- 33745715056
- Exploiting locality for irregular scientific codes
- H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel Distributed Systems, 17(7):606-618, 2006.
- (2006) IEEE Transactions on Parallel Distributed Systems , vol.17 , Issue.7 , pp. 606-618
- Han, H.¹ Tseng, C.-W.²

10
- 67650081010
- Openmp to gpgpu: A compiler framework for automatic translation and optimization
- S. Lee, S. Min, and R. Eigenmann. Openmp to gpgpu: A compiler framework for automatic translation and optimization. In Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009.
- Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009
- Lee, S.¹ Min, S.² Eigenmann, R.³

11
- 70450103746
- A cross-input adaptive framework for GPU programs optimization
- Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU programs optimization. In Proceedings of International Parallel and Distribute Processing Symposium (IPDPS), pages 1-10, 2009.
- (2009) Proceedings of International Parallel and Distribute Processing Symposium (IPDPS) , pp. 1-10
- Liu, Y.¹ Zhang, E.Z.² Shen, X.³

12
- 70350759823
- Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
- A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In SC'08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pages 1-11, 2008.
- (2008) SC'08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing , pp. 1-11
- Nukada, A.¹ Ogata, Y.² Endo, T.³ Matsuoka, S.⁴

13
- 79959466764
- Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 73-82, 2008.
- (2008) PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , pp. 73-82
- Ryoo, S.¹ Rodrigues, C.I.² Baghsorkhi, S.S.³ Stone, S.S.⁴ Kirk, D.B.⁵ Hwu, W.W.⁶

14
- 43449094719
- Program optimization space pruning for a multithreaded GPU
- S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Program optimization space pruning for a multithreaded GPU. In CGO'08: Proceedings of the Sixth Annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 195-204, 2008.
- (2008) CGO'08: Proceedings of the Sixth Annual IEEE/ACM International Symposium on Code Generation and Optimization , pp. 195-204
- Ryoo, S.¹ Rodrigues, C.I.² Stone, S.S.³ Baghsorkhi, S.S.⁴ Ueng, S.⁵ Stratton, J.A.⁶ Hwu, W.W.⁷

15
- 56849102474
- Efficient computation of sum-products on GPUs through software-managed cache
- June
- M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on GPUs through software-managed cache. In Proceedings of the 22nd ACM International Conference on Supercomputing, pages 309-318, June 2008.
- (2008) Proceedings of the 22nd ACM International Conference on Supercomputing , pp. 309-318
- Silberstein, M.¹ Schuster, A.² Geiger, D.³ Patney, A.⁴ Owens, J.D.⁵

16
- 58449127539
- Cuda-lite: Reducing gpu programming complexity
- S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-M. W. Hwu. Cuda-lite: Reducing gpu programming complexity. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing, 2008.
- Proceedings of the International Workshop on Languages and Compilers for Parallel Computing, 2008
- Ueng, S.-Z.¹ Lathara, M.² Baghsorkhi, S.S.³ Hwu, W.-M.W.⁴

17
- 70350771131
- Benchmarking gpus to tune dense linear algebra
- V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In SC'08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008.
- SC'08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008
- Volkov, V.¹ Demmel, J.W.²

18
- 41249094477
- Lattice boltzmann based pde solver on the gpu
- Y. Zhao. Lattice boltzmann based pde solver on the gpu. The Visual Computer, (5):323-333, 2008.
- (2008) The Visual Computer , Issue.5 , pp. 323-333
- Zhao, Y.¹

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.