메뉴 건너뛰기




Volumn , Issue , 2014, Pages 93-105

CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications

Author keywords

Compiler; GPGPU; Local memory; Nested parallelism

Indexed keywords

APPLICATION DEVELOPERS; COMPILER; GPGPU; LOCAL MEMORY; NESTED PARALLELISM; NUMBER OF THREADS; PARALLEL PROGRAM; THREAD-LEVEL PARALLELISM;

EID: 84896893237     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1145/2555243.2555254     Document Type: Conference Paper
Times cited : (29)

References (44)
  • 2
  • 4
    • 70450059008 scopus 로고    scopus 로고
    • Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors
    • M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors. In IPDPS 2009.
    • (2009) IPDPS
    • Boyer, M.1    Tarjan, D.2    Acton, S.T.3    Skadron, K.4
  • 7
    • 84856559490 scopus 로고    scopus 로고
    • Dynamic detection of uniform and affine vectors in GPGPU computations
    • S. Collange, D. Defour, and Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. In ICPP, 2009.
    • (2009) ICPP
    • Collange, S.1    Defour, D.2    Zhang, Y.3
  • 11
    • 32844460093 scopus 로고    scopus 로고
    • Automatic thread distribution for nested parallelism in OpenMP
    • A. Duran, M. Gonzàlez, and J. Corbalán.Automatic thread distribution for nested parallelism in OpenMP. In ICS, 2005.
    • (2005) ICS
    • Duran, A.1    Gonzàlez, M.2    Corbalán, J.3
  • 14
    • 70450231944 scopus 로고    scopus 로고
    • An analytical model for GPU architecture with memory-level and thread-level parallelism awareness
    • S. Hong and H. Kim. An analytical model for GPU architecture with memory-level and thread-level parallelism awareness. In Proc. International Symposium on Computer Architecture, 2009.
    • (2009) Proc. International Symposium on Computer Architecture
    • Hong, S.1    Kim, H.2
  • 15
    • 79952811127 scopus 로고    scopus 로고
    • Accelerating CUDA graph algorithms at maximum warp
    • S. Hong, S.K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graph algorithms at maximum warp. In PPoPP 2011.
    • (2011) PPoPP
    • Hong, S.1    Kim, S.K.2    Oguntebi, T.3    Olukotun, K.4
  • 16
    • 84896891672 scopus 로고    scopus 로고
    • http://moss.csc.ncsu.edu/~mueller/cluster/arc/
  • 17
    • 84871147865 scopus 로고    scopus 로고
    • Exploiting memory access patterns to improve memory performance in data-parallel architectures
    • B. Jang, D. Schaa, P. Mistry and D. Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. In IEEE TPDS, 2010.
    • (2010) IEEE TPDS
    • Jang, B.1    Schaa, D.2    Mistry, P.3    Kaeli, D.4
  • 18
    • 84887477265 scopus 로고    scopus 로고
    • Neither more nor less: Optimizing thread-level parallelism for GPGPUs
    • O. Kayiran, A. Jog, M. T. Kandemir, C. R. Das. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT, 2013.
    • (2013) PACT
    • Kayiran, O.1    Jog, A.2    Kandemir, M.T.3    Das, C.R.4
  • 19
    • 79952801699 scopus 로고    scopus 로고
    • Achieving a single compute device image in openCL for multiple GPUs
    • J. Kim, H. Kim, J. Lee, and J. Lee. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. In PPoPP, 2011.
    • (2011) PPoPP
    • Kim, J.1    Kim, H.2    Lee, J.3    Lee, J.4
  • 20
    • 26444437628 scopus 로고    scopus 로고
    • Cetus-an extensible compiler infrastructure for source-to-source transformation
    • S. I. Lee, T. Johnson, and R. Eigenmann. Cetus-an extensible compiler infrastructure for source-to-source transformation. In LCPC, 2003
    • (2003) LCPC
    • Lee, S.I.1    Johnson, T.2    Eigenmann, R.3
  • 21
    • 67650081010 scopus 로고    scopus 로고
    • OpenMP to GPGPU: A compiler framework for automatic translation and optimization
    • S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. In PPoPP, 2009
    • (2009) Proc. in PPoPP
    • Lee, S.1    Min, S.-J.2    Eigenmann, R.3
  • 23
    • 70450103746 scopus 로고    scopus 로고
    • A cross-input adaptive frame-work for gpu programs optimization
    • Y. Liu, E. Z. Zhang, amd X. Shen. A Cross-Input Adaptive Frame-work for GPU Programs Optimization. In IPDPS, 2009.
    • (2009) IPDPS
    • Liu, Y.1    Zhang, E.Z.2    Shen, X.3
  • 32
    • 79959466764 scopus 로고    scopus 로고
    • Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
    • S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008.
    • (2008) PPoPP
    • Ryoo, S.1    Rodrigues, C.I.2    Baghsorkhi, S.S.3    Stone, S.S.4    Kirk, D.B.5    Hwu, W.W.6
  • 34
    • 84863347222 scopus 로고    scopus 로고
    • A performance analysis framework for identifying performance benefits in GPGPU applications
    • J. Sim, A. Dasgupta, H. Kim, and R. Vuduc, A Performance Analysis Framework for Identifying Performance Benefits in GPGPU Applications. In PPoPP, 2012.
    • (2012) PPoPP
    • Sim, J.1    Dasgupta, A.2    Kim, H.3    Vuduc, R.4
  • 35
    • 84896837430 scopus 로고    scopus 로고
    • Dynamic thread creation for improving processor utilization on SIMT streaming processor architectures
    • M. Steffen and J. Zambreno. Dynamic Thread Creation for Improving Processor Utilization on SIMT Streaming Processor Architectures. In MICRO, 2010.
    • (2010) MICRO
    • Steffen, M.1    Zambreno, J.2
  • 39
    • 67349149521 scopus 로고    scopus 로고
    • Benchmarking GPUs to tune dense linear algebra
    • V. Volkov and J. W. Benchmarking GPUs to tune dense linear algebra. In Proc. Supercomputing, 2008.
    • (2008) Proc. Supercomputing
    • Volkov, V.1
  • 40
    • 84875195366 scopus 로고    scopus 로고
    • Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced GPU memory accesses
    • B. Wu, Z. Zhao, E. Zhang, Y. Jiang, and X. Shen. Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced GPU Memory Accesses. In PPoPP, 2013.
    • (2013) PPoPP
    • Wu, B.1    Zhao, Z.2    Zhang, E.3    Jiang, Y.4    Shen, X.5
  • 41
    • 77954691442 scopus 로고    scopus 로고
    • A GPGPU compiler for memory optimization and parallelism management
    • Y. Yang, P. Xiang, J. Kong and H. Zhou. A GPGPU Compiler for Memory Optimization and Parallelism Management. In PLDI, 2010.
    • (2010) PLDI
    • Yang, Y.1    Xiang, P.2    Kong, J.3    Zhou, H.4
  • 42
    • 84867509598 scopus 로고    scopus 로고
    • Shared memory multiplexing: A novel way to improve GPGPU throughput
    • Y. Yang, P. Xiang, M. Mantor, N. Rubin, and H. Zhou. Shared Memory Multiplexing: A Novel Way to Improve GPGPU Throughput. In PACT, 2012.
    • (2012) PACT
    • Yang, Y.1    Xiang, P.2    Mantor, M.3    Rubin, N.4    Zhou, H.5
  • 43
    • 77749337487 scopus 로고    scopus 로고
    • Fast tridiagonal solvers on the GPU
    • Y. Zhang, J. Cohen, and J. D. Owens. Fast Tridiagonal Solvers on the GPU. In PPoPP, 2010.
    • (2010) PPoPP
    • Zhang, Y.1    Cohen, J.2    Owens, J.D.3
  • 44
    • 79953126288 scopus 로고    scopus 로고
    • On-thefly elimination of dynamic irregularities for GPU computing
    • E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-thefly elimination of dynamic irregularities for GPU computing. In ASPLOS, 2011.
    • (2011) ASPLOS
    • Zhang, E.Z.1    Jiang, Y.2    Guo, Z.3    Tian, K.4    Shen, X.5


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.