메뉴 건너뛰기




Volumn , Issue , 2011, Pages 308-317

Improving GPU performance via large warps and two-level warp scheduling

Author keywords

divergence; GPGPU; SIMD; warp scheduling

Indexed keywords

COMPUTATIONAL POWER; COMPUTATIONAL RESOURCES; CONDITIONAL BRANCH; DIVERGENCE; GENERAL PURPOSE; GPGPU; GPU PROGRAMMING; GRAPHICS PROCESSING UNITS; MICRO ARCHITECTURES; PARALLEL APPLICATION; POPULAR PLATFORM; SIMD;

EID: 84863342255     PISSN: 10724451     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1145/2155620.2155656     Document Type: Conference Paper
Times cited : (338)

References (27)
  • 1
    • 84882609297 scopus 로고    scopus 로고
    • Advanced Micro Devices, Inc. ATI Stream Technology. http://www.amd.com/ stream.
    • ATI Stream Technology
  • 2
    • 0025431380 scopus 로고
    • April: A processor architecture for multiprocessing
    • A. Agarwal et al. April: a processor architecture for multiprocessing. In ISCA-17, 1990.
    • (1990) ISCA-17
    • Agarwal, A.1
  • 3
    • 0033895964 scopus 로고    scopus 로고
    • Speed and power scaling of SRAMs
    • Feb.
    • B. Amrutur and M. Horowitz. Speed and power scaling of SRAMs. IEEE JSCC, 35(2):175-185, Feb. 2000.
    • (2000) IEEE JSCC , vol.35 , Issue.2 , pp. 175-185
    • Amrutur, B.1    Horowitz, M.2
  • 4
    • 0015330108 scopus 로고
    • The Illiac IV system
    • Apr.
    • W. J. Bouknight et al. The Illiac IV system. Proceedings of the IEEE, 60(4):369-388, Apr. 1972.
    • (1972) Proceedings of the IEEE , vol.60 , Issue.4 , pp. 369-388
    • Bouknight, W.J.1
  • 5
    • 70649092154 scopus 로고    scopus 로고
    • Rodinia: A benchmark suite for heterogeneous computing
    • S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.
    • (2009) IISWC
    • Che, S.1
  • 6
    • 79955923056 scopus 로고    scopus 로고
    • Thread block compaction for efficient simt control flow
    • W. W. L. Fung and T. Aamodt. Thread block compaction for efficient simt control flow. In HPCA-17, 2011.
    • (2011) HPCA-17
    • Fung, W.W.L.1    Aamodt, T.2
  • 7
    • 47349104432 scopus 로고    scopus 로고
    • Dynamic warp formation and scheduling for efficient GPU control flow
    • W. W. L. Fung et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO-40, 2007.
    • (2007) MICRO-40
    • Fung, W.W.L.1
  • 8
    • 68549096107 scopus 로고    scopus 로고
    • Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
    • June
    • W. W. L. Fung et al. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM TACO, 6(2):1-37, June 2009.
    • (2009) ACM TACO , vol.6 , Issue.2 , pp. 1-37
    • Fung, W.W.L.1
  • 9
    • 65349159175 scopus 로고    scopus 로고
    • Compute unified device architecture application suitability
    • may-jun
    • W.-M. Hwu et al. Compute unified device architecture application suitability. Computing in Science Engineering, may-jun 2009.
    • (2009) Computing in Science Engineering
    • Hwu, W.-M.1
  • 10
    • 2342652812 scopus 로고    scopus 로고
    • Stream register files with indexed access
    • N. Jayasena et al. Stream register files with indexed access. In HPCA-10, 2004.
    • (2004) HPCA-10
    • Jayasena, N.1
  • 11
    • 77954999879 scopus 로고    scopus 로고
    • Efficient conditional operations for data-parallel architectures
    • U. Kapasi et al. Efficient conditional operations for data-parallel architectures. In MICRO-33, 2000.
    • (2000) MICRO-33
    • Kapasi, U.1
  • 12
    • 0036398375 scopus 로고    scopus 로고
    • Vlsi design and verification of the imagine processor
    • B. Khailany et al. Vlsi design and verification of the imagine processor. In ICCD, 2002.
    • (2002) ICCD
    • Khailany, B.1
  • 13
    • 84863372818 scopus 로고    scopus 로고
    • Khronos Group. OpenCL. http://www.khronos.org/opencl.
    • OpenCL
  • 15
    • 4644337990 scopus 로고    scopus 로고
    • The vector-thread architecture
    • R. Krashinsky et al. The vector-thread architecture. In ISCA-31, 2004.
    • (2004) ISCA-31
    • Krashinsky, R.1
  • 17
    • 77954976292 scopus 로고    scopus 로고
    • Dynamic warp subdivision for integrated branch and memory divergence tolerance
    • J. Meng et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA-37, 2010.
    • (2010) ISCA-37
    • Meng, J.1
  • 18
    • 47349098275 scopus 로고    scopus 로고
    • MineBench: A benchmark suite for data mining workloads
    • R. Narayanan et al. MineBench: A benchmark suite for data mining workloads. In IISWC, 2006.
    • (2006) IISWC
    • Narayanan, R.1
  • 19
    • 84863390635 scopus 로고    scopus 로고
    • NVIDIA. CUDA GPU Computing SDK. http://developer.nvidia.com/gpu- computing-sdk.
    • CUDA GPU Computing SDK
  • 22
    • 0017922490 scopus 로고
    • The CRAY-1 computer system
    • Jan.
    • R. M. Russell. The CRAY-1 computer system. Communications of the ACM, 21(1):63-72, Jan. 1978.
    • (1978) Communications of the ACM , vol.21 , Issue.1 , pp. 63-72
    • Russell, R.M.1
  • 23
    • 79959466764 scopus 로고    scopus 로고
    • Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
    • S. Ryoo et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008.
    • PPoPP, 2008
    • Ryoo, S.1
  • 24
    • 0018282603 scopus 로고
    • A pipelined shared resource MIMD computer
    • B. J. Smith. A pipelined shared resource MIMD computer. In ICPP, 1978.
    • (1978) ICPP
    • Smith, B.J.1
  • 25
    • 0033727057 scopus 로고    scopus 로고
    • Vector instruction set support for conditional operations
    • J. E. Smith et al. Vector instruction set support for conditional operations. In ISCA-27, 2000.
    • (2000) ISCA-27
    • Smith, J.E.1
  • 26
    • 84863352139 scopus 로고
    • Parallel operation in the control data 6600
    • J. E. Thornton. Parallel operation in the control data 6600. In AFIPS, 1965.
    • (1965) AFIPS
    • Thornton, J.E.1
  • 27
    • 0035696665 scopus 로고    scopus 로고
    • Handling long-latency loads in a simultaneous multithreading processor
    • D. M. Tullsen and J. A. Brown. Handling long-latency loads in a simultaneous multithreading processor. In MICRO-34, 2001.
    • (2001) MICRO-34
    • Tullsen, D.M.1    Brown, J.A.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.