SCOPUS 정보 검색 플랫폼

MICRO 2013 - Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Volumn , Issue , 2013, Pages 99-110

Divergence-aware warp scheduling

(3) Rogers, Timothy G a O'Connor, Mike b Aamodt, Tor M a

a UNIVERSITY OF BRITISH COLUMBIA (Canada)

b NVIDIA (United States)

Author keywords

caches; divergence; GPU; scheduling

Indexed keywords

CACHES; DIVERGENCE; GPU; HARDWARE THREAD SCHEDULING; ON-LINE CHARACTERIZATION; PROACTIVE SCHEDULING; RUN-TIME INFORMATION; SPARSE MATRIX-VECTOR MULTIPLY;

CACHE MEMORY; COMPUTER ARCHITECTURE; ENERGY EFFICIENCY; FORECASTING; HARDWARE; PROGRAM PROCESSORS; SCHEDULING; WAVEFRONTS;

WEAVING;

EID: 84892547586 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2540708.2540718 Document Type: Conference Paper

Times cited : (127)

References (37)

1
- 80053046736
- v4.2
- NVIDIA CUDA C Programming Guide v4.2, 2012.
- (2012) NVIDIA CUDA C Programming Guide

2
- 84892549898
- T. M. Aamodt et al. GPGPU-Sim 3.x Manual. http://gpgpu-sim.org/manual/ index.php5/GPGPU-Sim-3.x-Manual, 2012.
- (2012) GPGPU-Sim 3.X Manual
- Aamodt, T.M.¹

3
- 70349169075
- Analyzing CUDA Workloads Using a Detailed GPU Simulator
- A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS 2009, pages 163-174.
- (2009) ISPASS , pp. 163-174
- Bakhoda, A.¹

4
- 77954705607
- Tracing Garbage Collection on Highly Parallel Platforms
- K. Barabash and E. Petrank. Tracing Garbage Collection on Highly Parallel Platforms. In ISMM 2010, pages 1-10.
- (2010) ISMM , pp. 1-10
- Barabash, K.¹ Petrank, E.²

5
- 0003473816
- 2nd Edition. SIAM
- R. Barrett et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, 1994.
- (1994) Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods
- Barrett, R.¹

6
- 74049143158
- Implementing sparse matrix-vector multiplication on throughput-oriented processors
- N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC 2009.
- (2009) SC
- Bell, N.¹ Garland, M.²

7
- 70649092154
- Rodinia: A Benchmark Suite for Heterogeneous Computing
- S. Che et al. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC 2009, pages 44-54.
- (2009) IISWC , pp. 44-54
- Che, S.¹

8
- 79951707102
- Memory Latency Reduction via Thread Throttling
- H.-Y. Cheng et al. Memory Latency Reduction via Thread Throttling. In MICRO-43, pages 53-64, 2010.
- (2010) MICRO-43 , pp. 53-64
- Cheng, H.-Y.¹

9
- 77954719557
- The Scalable Heterogeneous Computing (SHOC) benchmark suite
- A. Danalis et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In GPGPU 2010.
- (2010) GPGPU
- Danalis, A.¹

10
- 80052528714
- Dark Silicon and the End of Multicore Scaling
- H. Esmaeilzadeh et al. Dark Silicon and the End of Multicore Scaling. In ISCA 2011, pages 365-376.
- (2011) ISCA , pp. 365-376
- Esmaeilzadeh, H.¹

11
- 79955923056
- Thread Block Compaction for Efficient SIMT Control Flow
- W. Fung and T. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA 2011, pages 25-36.
- (2011) HPCA , pp. 25-36
- Fung, W.¹ Aamodt, T.²

12
- 47349104432
- Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
- W. W. L. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO-40.
- MICRO-40
- Fung, W.W.L.¹

13
- 80052533471
- Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors
- M. Gebhart and D. R. Johnson et al. Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA 2011, pages 235-246.
- (2011) ISCA , pp. 235-246
- Gebhart, M.¹ Johnson, D.R.²

14
- 67650635164
- Many-Core vs. Many-Thread Machines: Stay Away From the Valley
- jan.
- Z. Guz et al. Many-Core vs. Many-Thread Machines: Stay Away From the Valley. Computer Architecture Letters, pages 25-28, jan. 2009.
- (2009) Computer Architecture Letters , pp. 25-28
- Guz, Z.¹

15
- 84862107632
- Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems
- T. H. Hetherington et al. Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems. In ISPASS 2012, pages 88-98.
- (2012) ISPASS , pp. 88-98
- Hetherington, T.H.¹

16
- 79952811127
- Accelerating CUDA Graph Algorithms at Maximum Warp
- S. Hong et al. Accelerating CUDA Graph Algorithms at Maximum Warp. In PPoPP 2011, pages 267-276.
- (2011) PPoPP , pp. 267-276
- Hong, S.¹

17
- 84858767531
- CRUISE: Cache Replacement and Utility-Aware Scheduling
- A. Jaleel et al. CRUISE: Cache Replacement and Utility-Aware Scheduling. In ASPLOS 2012, pages 249-260.
- (2012) ASPLOS , pp. 249-260
- Jaleel, A.¹

18
- 77954998134
- High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)
- A. Jaleel et al. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In ISCA 2010, pages 60-71.
- (2010) ISCA , pp. 60-71
- Jaleel, A.¹

19
- 84864068497
- Characterizing and Improving the use of Demand-Fetched Caches in GPUs
- W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the use of Demand-Fetched Caches in GPUs. In ICS 2012, pages 15-24.
- (2012) ICS , pp. 15-24
- Jia, W.¹ Shaw, K.A.² Martonosi, M.³

20
- 84875640178
- OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance
- A. Jog et al. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In ASPLOS 2013.
- (2013) ASPLOS
- Jog, A.¹

21
- 84881126240
- Orchestrated Scheduling and Prefetching for GPGPUs
- A. Jog et al. Orchestrated Scheduling and Prefetching for GPGPUs. In ISCA, 2013.
- (2013) ISCA
- Jog, A.¹

22
- 84887477265
- Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs
- O. Kayiran et al. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT 2013.
- (2013) PACT
- Kayiran, O.¹

23
- 84892519366
- Khronos Group. OpenCL. http://www.khronos.org/opencl/.
- OpenCL

24
- 84862910894
- Effect of Instruction Fetch and Memory Scheduling on GPU Performance
- N. B. Lakshminarayana and H. Kim. Effect of Instruction Fetch and Memory Scheduling on GPU Performance. In Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010.
- (2010) Workshop on Language, Compiler, and Architecture Support for GPGPU
- Lakshminarayana, N.B.¹ Kim, H.²

25
- 79951719035
- Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
- J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO-43, pages 213-224, 2010.
- (2010) MICRO-43 , pp. 213-224
- Lee, J.¹ Lakshminarayana, N.B.² Kim, H.³ Vuduc, R.⁴

26
- 84881151222
- GPUWattch: Enabling Energy Optimizations in GPGPUs
- J. Leng et al. GPUWattch: Enabling Energy Optimizations in GPGPUs. In ISCA 2013.
- (2013) ISCA
- Leng, J.¹

27
- 44849137198
- NVIDIA Tesla: A Unified Graphics and Computing Architecture
- March-April
- E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 28(2):39-55, March-April 2008.
- (2008) Micro, IEEE , vol.28 , Issue.2 , pp. 39-55
- Lindholm, E.¹

28
- 84881440334
- How a Single Chip Causes Massive Power Bills GPUSimPow: A GPGPU Power Simulator
- M. Maas et al. How a Single Chip Causes Massive Power Bills GPUSimPow: A GPGPU Power Simulator. In ISPASS 2013.
- (2013) ISPASS
- Maas, M.¹

29
- 77954976292
- Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance
- J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In ISCA 2010, pages 235-246.
- (2010) ISCA , pp. 235-246
- Meng, J.¹ Tarjan, D.² Skadron, K.³

30
- 84863342255
- Improving GPU Performance via Large Warps and Two-Level Warp Scheduling
- V. Narasiman et al. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In MICRO-44, pages 308-317, 2011.
- (2011) MICRO-44 , pp. 308-317
- Narasiman, V.¹

31
- 35348920021
- Adaptive Insertion Policies for High Performance Caching
- M. K. Qureshi et al. Adaptive Insertion Policies for High Performance Caching. In ISCA 2007, pages 381-391.
- (2007) ISCA , pp. 381-391
- Qureshi, M.K.¹

32
- 84892521689
- T. G. Rogers. CCWS Simulation Infrastructure. http://www.ece.ubc.ca/ ~tgrogers/ccws.html, 2013.
- (2013) CCWS Simulation Infrastructure
- Rogers, T.G.¹

33
- 84876590572
- Cache-Conscious Wavefront Scheduling
- T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-Conscious Wavefront Scheduling. In MICRO-45, 2012.
- (2012) MICRO-45
- Rogers, T.G.¹ O'Connor, M.² Aamodt, T.M.³

34
- 84879544253
- An Experimental Study on Performance Portability of OpenCL Kernels
- S. Rul et al. An Experimental Study on Performance Portability of OpenCL Kernels. In Application Accelerators in High Performance Computing, 2010.
- (2010) Application Accelerators in High Performance Computing
- Rul, S.¹

35
- 0029178210
- Multiscalar Processors
- G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar Processors. In ISCA 1995.
- (1995) ISCA
- Sohi, G.S.¹ Breach, S.E.² Vijaykumar, T.N.³

36
- 32044455748
- Using Page Residency to Balance Tradeoffs in Tracing Garbage Collection
- D. Spoonhower, G. Blelloch, and R. Harper. Using Page Residency to Balance Tradeoffs in Tracing Garbage Collection. In Proc. of Int'l Conf. on Virtual Execution Environments (VEE 2005), pages 57-67.
- Proc. of Int'l Conf. on Virtual Execution Environments (VEE 2005) , pp. 57-67
- Spoonhower, D.¹ Blelloch, G.² Harper, R.³

37
- 0030149507
- CACTI: An Enhanced Cache Access and Cycle Time Model
- May
- S. Wilton and N. Jouppi. CACTI: An Enhanced Cache Access and Cycle Time Model. Solid-State Circuits, IEEE Journal of, 31(5):677-688, May 1996.
- (1996) Solid-State Circuits, IEEE Journal of , vol.31 , Issue.5 , pp. 677-688
- Wilton, S.¹ Jouppi, N.²

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.