SCOPUS 정보 검색 플랫폼

Proceedings of the International Conference on Supercomputing

Volumn , Issue , 2011, Pages 2-11

An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

(4) Huo, Xin a Ravi, Vignesh a Ma, Wenjing a Agrawal, Gagan a

a Ohio State University (United States)

Author keywords

cuda; gpu; irregular reduction; partitioning

Indexed keywords

CACHE PERFORMANCE; CUDA; DATA ACCESS; DISTRIBUTED MEMORY MACHINES; DISTRIBUTED SHARED MEMORY; ENGINEERING CODES; EXECUTION STRATEGIES; GPU; HIGH PERFORMANCE COMPUTING; IRREGULAR REDUCTIONS; NUMBER OF THREADS; PARALLELIZATIONS; PARALLELIZING; PARTITIONING; PARTITIONING METHODS; RUNTIME MODULES; RUNTIME SUPPORT; RUNTIMES; SHARED MEMORIES; SHARED MEMORY MACHINES; SYSTEMATIC STUDY; UNIPROCESSORS; UNSTRUCTURED GRID;

CACHE MEMORY; COMPUTER SOFTWARE SELECTION AND EVALUATION; INTELLIGENT CONTROL; OPTIMIZATION;

PROGRAM PROCESSORS;

EID: 79959575872 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/1995896.1995900 Document Type: Conference Paper

Times cited : (12)

References (40)

1
- 0031139728
- Interprocedural data flow based optimizations for distributed memory compilation
- May
- G. Agrawal and J. Saltz. Interprocedural data flow based optimizations for distributed memory compilation. Software Practice and Experience, 27(5):519-546, May 1997.
- (1997) Software Practice and Experience , vol.27 , Issue.5 , pp. 519-546
- Agrawal, G.¹ Saltz, J.²

2
- 77954076573
- CUDA-lite: Reducing GPU Programming Complexity
- S. Baghsorkhi, M. Lathara, and W. mei Hwu. CUDA-lite: Reducing GPU Programming Complexity. In LCPC 2008, 2008.
- (2008) LCPC 2008
- Baghsorkhi, S.¹ Lathara, M.² Mei Hwu, W.³

3
- 63549135938
- Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories
- NY, USA, ACM
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In PPoPP, pages 1-10, NY, USA, 2008. ACM.
- (2008) PPoPP , pp. 1-10
- Baskaran, M.M.¹ Bondhugula, U.² Krishnamoorthy, S.³ Ramanujam, J.⁴ Rountev, A.⁵ Sadayappan, P.⁶

4
- 0030382364
- Parallel programming with Polaris
- Dec.
- W. Blume, R. Doallo, R. Eigenman, J. Grout, J. Hoelflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Parallel programming with Polaris. IEEE Computer, 29(12):78-82, Dec. 1996.
- (1996) IEEE Computer , vol.29 , Issue.12 , pp. 78-82
- Blume, W.¹ Doallo, R.² Eigenman, R.³ Grout, J.⁴ Hoelflinger, J.⁵ Lawrence, T.⁶ Lee, J.⁷ Padua, D.⁸ Paek, Y.⁹ Pottenger, B.¹⁰ Rauchwerger, L.¹¹ Tu, P.¹²

5
- 77749340082
- Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs
- Feb.
- J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs. In PPoPP, Feb. 2010.
- (2010) PPoPP
- Choi, J.W.¹ Singh, A.² Vuduc, R.W.³

6
- 77958483977
- Running unstructured grid cfd solvers on modern graphics hardware
- number AIAA 2009-4001, June
- A. Corrigan, F. Camelli, R. Löhner, and J. Wallin. Running unstructured grid cfd solvers on modern graphics hardware. In 19th AIAA Computational Fluid Dynamics Conference, number AIAA 2009-4001, June 2009.
- (2009) 19th AIAA Computational Fluid Dynamics Conference
- Corrigan, A.¹ Camelli, F.² Löhner, R.³ Wallin, J.⁴

7
- 0029430697
- Index array flattening through program transformation
- IEEE Computer Society Press, Dec.
- R. Das, , P. Havlak, J. Saltz, and K. Kennedy. Index array flattening through program transformation. In SC95. IEEE Computer Society Press, Dec. 1995.
- (1995) SC95
- Das, R.¹ Havlak, P.² Saltz, J.³ Kennedy, K.⁴

8
- 0028386843
- The design and implementation of a parallel unstructured Euler solver using software primitives
- Mar.
- R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy. The design and implementation of a parallel unstructured Euler solver using software primitives. AIAA Journal, 32(3):489-496, Mar. 1994.
- (1994) AIAA Journal , vol.32 , Issue.3 , pp. 489-496
- Das, R.¹ Mavriplis, D.J.² Saltz, J.³ Gupta, S.⁴ Ponnusamy, R.⁵

9
- 79954630742
- Improving cache performance of dynamic applications with computation and data layout transformations
- May
- C. Ding and K. Kennedy. Improving cache performance of dynamic applications with computation and data layout transformations. In PLDI99, May 1999.
- (1999) PLDI99
- Ding, C.¹ Kennedy, K.²

10
- 64649105762
- Accelerating molecular dynamic simulation on graphics processing units
- Radeon 4870
- M. S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. Legrand, A. L. Beberg, D. L. Ensign, C. M. Bruns, and V. S. Pande. Accelerating molecular dynamic simulation on graphics processing units. Journal of Computational Chemistry, 30(Radeon 4870):864-872, 2009.
- (2009) Journal of Computational Chemistry , vol.30 , pp. 864-872
- Friedrichs, M.S.¹ Eastman, P.² Vaidyanathan, V.³ Houston, M.⁴ Legrand, S.⁵ Beberg, A.L.⁶ Ensign, D.L.⁷ Bruns, C.M.⁸ Pande, V.S.⁹

11
- 51549093017
- Sparse matrix computations on manycore GPUs
- M. Garland. Sparse matrix computations on manycore GPUs. In DAC, 2008.
- (2008) DAC
- Garland, M.¹

12
- 0033707876
- A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors
- ACM Press, May
- E. Gutierrez, O. Plata, and E. L. Zapata. A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors. In ICS00, pages 78-87. ACM Press, May 2000.
- (2000) ICS00 , pp. 78-87
- Gutierrez, E.¹ Plata, O.² Zapata, E.L.³

13
- 0030380793
- Maximizing multiprocessor performance with the SUIF compiler
- Dec.
- M. Hall, S. Amarsinghe, B. Murphy, S. Liao, and M. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, (12), Dec. 1996.
- (1996) IEEE Computer , Issue.12
- Hall, M.¹ Amarsinghe, S.² Murphy, B.³ Liao, S.⁴ Lam, M.⁵

14
- 79959622062
- Improving compiler and runtime support for irregular reductions
- Aug.
- H. Han and C.-W. Tseng. Improving compiler and runtime support for irregular reductions. In LCPC98, Aug. 1998.
- (1998) LCPC98
- Han, H.¹ Tseng, C.-W.²

15
- 0003630067
- A comparison of locality transformations for irregular codes
- May
- H. Han and C.-W. Tseng. A comparison of locality transformations for irregular codes. In Proceedings of Fifth Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers, pages 31-36, May 2000.
- (2000) Proceedings of Fifth Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers , pp. 31-36
- Han, H.¹ Tseng, C.-W.²

16
- 0342622933
- Handling irregular problems with Fortran D - A preliminary report
- Also available as CRPC Technical Report CRPC-TR93339-S
- R. v. Hanxleden. Handling irregular problems with Fortran D - a preliminary report. In CPC, Delft, The Netherlands, Dec. 1993. Also available as CRPC Technical Report CRPC-TR93339-S.
- CPC, Delft, the Netherlands, Dec. 1993
- Hanxleden, R.V.¹

17
- 0029322399
- Parallelizing molecular dynamics programs for distributed memory machines
- Summer Also available as University of Maryland Technical Report CS-TR-3374 and UMIACS-TR-94-125
- Y.-S. Hwang, R. Das, J. H. Saltz, M. Hodoscek, and B. R. Brooks. Parallelizing molecular dynamics programs for distributed memory machines. IEEE Computational Science & Engineering, 2(2):18-29, Summer 1995. Also available as University of Maryland Technical Report CS-TR-3374 and UMIACS-TR-94-125.
- (1995) IEEE Computational Science & Engineering , vol.2 , Issue.2 , pp. 18-29
- Hwang, Y.-S.¹ Das, R.² Saltz, J.H.³ Hodoscek, M.⁴ Brooks, B.R.⁵

18
- 0029375750
- Partitioning unstructured computational graphs for nonuniform and adaptive environments
- Fall
- M. Kaddoura, C.-W. Ou, and S. Ranka. Partitioning unstructured computational graphs for nonuniform and adaptive environments. IEEE Parallel & Distributed Technology, 3(3):63-69, Fall 1995.
- (1995) IEEE Parallel & Distributed Technology , vol.3 , Issue.3 , pp. 63-69
- Kaddoura, M.¹ Ou, C.-W.² Ranka, S.³

19
- 84990479742
- An efficient heuristic procedure for partitioning graphs
- Feb.
- B. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal, 49(2):291-307, Feb. 1970.
- (1970) Bell System Technical Journal , vol.49 , Issue.2 , pp. 291-307
- Kernighan, B.¹ Lin, S.²

20
- 79959588500
- Using Compiler-Based Autotuning to Generate High-Performance GPU Libraries
- M. Khan, G. Rudy, C. Chen, M. Hall, and J. Chame. Using Compiler-Based Autotuning to Generate High-Performance GPU Libraries. SC 2010 Poster Session, 2010.
- SC 2010 Poster Session, 2010
- Khan, M.¹ Rudy, G.² Chen, C.³ Hall, M.⁴ Chame, J.⁵

21
- 0026231040
- Compiling global name-space parallel loops for distributed execution
- DOI 10.1109/71.97901
- C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed execution. TPDS, 2(4):440-451, Oct. 1991. (Pubitemid 23624758)
- (1991) IEEE Transactions on Parallel and Distributed Systems , vol.2 , Issue.4 , pp. 440-451
- Koelbel, C.¹ Mehrotra, P.²

22
- 0029229672
- Exploiting spatial regularity in irregular iterative applications
- IEEE Computer Society Press, Apr.
- A. Lain and P. Banerjee. Exploiting spatial regularity in irregular iterative applications. In IPPS95, pages 820-826. IEEE Computer Society Press, Apr. 1995.
- (1995) IPPS95 , pp. 820-826
- Lain, A.¹ Banerjee, P.²

23
- 78650802947
- OpenMPC: Extended OpenMP Programming and Tuning for GPUs
- S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP Programming and Tuning for GPUs. In SC, Nov 2010.
- SC, Nov 2010
- Lee, S.¹ Eigenmann, R.²

24
- 67650081010
- OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization
- S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization. In PPoPP'09, 2009.
- (2009) PPoPP'09
- Lee, S.¹ Min, S.-J.² Eigenmann, R.³

25
- 0005006119
- On the automatic parallelization of sparse and irregular Fortran programs
- Y. Lin and D. Padua. On the automatic parallelization of sparse and irregular Fortran programs. In Proceedings of the Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers (LCR - 98), May 1998.
- Proceedings of the Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers (LCR - 98), May 1998
- Lin, Y.¹ Padua, D.²

26
- 38349105400
- Molecular dynamics simulations on commodity gpus with cuda
- W. Liu, B. Schmidt, G. Voss, and W. Müller-Wittig. Molecular dynamics simulations on commodity gpus with cuda. In HiPC, pages 185-196, 2007.
- (2007) HiPC , pp. 185-196
- Liu, W.¹ Schmidt, B.² Voss, G.³ Müller-Wittig, W.⁴

27
- 70449707774
- A Translation System for Enabling Data Mining Applications on GPUs
- June
- W. Ma and G. Agrawal. A Translation System for Enabling Data Mining Applications on GPUs. In ICS, June 2009.
- (2009) ICS
- Ma, W.¹ Agrawal, G.²

28
- 79952788812
- An Integer Programming Framework for Optimizing Shared Memory Use on GPUs
- Dec.
- W. Ma and G. Agrawal. An Integer Programming Framework for Optimizing Shared Memory Use on GPUs. In HiPC, Dec. 2010.
- (2010) HiPC
- Ma, W.¹ Agrawal, G.²

29
- 0032684978
- Improving memory hierarchy performance of irregular applications
- June
- J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving memory hierarchy performance of irregular applications. In ICS, June 1999.
- (1999) ICS
- Mellor-Crummey, J.¹ Whalley, D.² Kennedy, K.³

30
- 0033362479
- Localizing non-affine array references
- Oct.
- N. Mitchell, L. Carter, and J. Ferrante. Localizing non-affine array references. In PACT, Oct. 1999.
- (1999) PACT
- Mitchell, N.¹ Carter, L.² Ferrante, J.³

31
- 79952796871
- Jul
- M. Moazeni, A. Bui, and M. Sarrafzadeh. A Memory Optimization Technique for Software-Managed Scratchpad Memory in GPUs. http://www.sasp-conference.org/ index.html, Jul 2009.
- (2009) A Memory Optimization Technique for Software-Managed Scratchpad Memory in GPUs
- Moazeni, M.¹ Bui, A.² Sarrafzadeh, M.³

32
- 0029192463
- Efficient support for irregular applications on distributed-memory machines
- ACM Press, July
- S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, and J. Saltz. Efficient support for irregular applications on distributed-memory machines. In PPOPP, pages 68-79. ACM Press, July 1995.
- (1995) PPOPP , pp. 68-79
- Mukherjee, S.¹ Sharma, S.² Hill, M.³ Larus, J.⁴ Rogers, A.⁵ Saltz, J.⁶

33
- 79959627160
- ACM SIGPLAN Notices, Vol. 30, No. 8.
- ACM SIGPLAN Notices , vol.30 , Issue.8

34
- 0029356841
- Runtime support and compilation methods for user-specified irregular data distributions
- Aug.
- R. Ponnusamy, J. Saltz, A. Choudhary, Y.-S. Hwang, and G. Fox. Runtime support and compilation methods for user-specified irregular data distributions. TPDS, 6(8):815-831, Aug. 1995.
- (1995) TPDS , vol.6 , Issue.8 , pp. 815-831
- Ponnusamy, R.¹ Saltz, J.² Choudhary, A.³ Hwang, Y.-S.⁴ Fox, G.⁵

35
- 0036505103
- Parallel static and dynamic multi-constraint graph partitioning
- DOI 10.1002/cpe.605
- K. Schloegel, G. Karypis, and V. Kumar. Parallel static and dynamic multi-constraint graph partitioning. Concurrency and Computation: Practice and Experience, 14(3):219-240, 2002. (Pubitemid 34460007)
- (2002) Concurrency Computation Practice and Experience , vol.14 , Issue.3 , pp. 219-240
- Schloegel, K.¹ Karypis, G.² Kumar, V.³

36
- 70450029523
- A framework for efficient and scalable execution of domain-specific templates on GPUs
- N. Sundaram, A. Raghunathan, and S. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. In IPDPS, 2009.
- (2009) IPDPS
- Sundaram, N.¹ Raghunathan, A.² Chakradhar, S.³

37
- 84963626200
- Accelerating molecular dynamics simulations with gpus
- J. P. Walters, V. Balu, V. Chaudhary, D. Kofke, and A. Schultz. Accelerating molecular dynamics simulations with gpus. In ISCA PDCCS, pages 44-49, 2008.
- (2008) ISCA PDCCS , pp. 44-49
- Walters, J.P.¹ Balu, V.² Chaudhary, V.³ Kofke, D.⁴ Schultz, A.⁵

38
- 0029322543
- Distributed memory compiler design for sparse problems
- June
- J. Wu, R. Das, J. Saltz, H. Berryman, and S. Hiranandani. Distributed memory compiler design for sparse problems. IEEE Transactions on Computers, 44(6):737-753, June 1995.
- (1995) IEEE Transactions on Computers , vol.44 , Issue.6 , pp. 737-753
- Wu, J.¹ Das, R.² Saltz, J.³ Berryman, H.⁴ Hiranandani, S.⁵

39
- 77954691442
- A GPGPU compiler for memory optimization and parallelism management
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI, 2010.
- (2010) PLDI
- Yang, Y.¹ Xiang, P.² Kong, J.³ Zhou, H.⁴

40
- 0033703286
- Adaptive reduction parallelization techniques
- ACM Press, May
- H. Yu and L. Rauchwerger. Adaptive reduction parallelization techniques. In ICS00, pages 66-75. ACM Press, May 2000.
- (2000) ICS00 , pp. 66-75
- Yu, H.¹ Rauchwerger, L.²

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.