-
3
-
-
70349169075
-
Analyzing CUDA Workloads Using a Detailed GPU Simulator
-
A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS 2009, pages 163-174.
-
(2009)
ISPASS
, pp. 163-174
-
-
Bakhoda, A.1
-
4
-
-
77954705607
-
Tracing Garbage Collection on Highly Parallel Platforms
-
K. Barabash and E. Petrank. Tracing Garbage Collection on Highly Parallel Platforms. In ISMM 2010, pages 1-10.
-
(2010)
ISMM
, pp. 1-10
-
-
Barabash, K.1
Petrank, E.2
-
6
-
-
74049143158
-
Implementing sparse matrix-vector multiplication on throughput-oriented processors
-
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC 2009.
-
(2009)
SC
-
-
Bell, N.1
Garland, M.2
-
7
-
-
70649092154
-
Rodinia: A Benchmark Suite for Heterogeneous Computing
-
S. Che et al. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC 2009, pages 44-54.
-
(2009)
IISWC
, pp. 44-54
-
-
Che, S.1
-
8
-
-
79951707102
-
Memory Latency Reduction via Thread Throttling
-
H.-Y. Cheng et al. Memory Latency Reduction via Thread Throttling. In MICRO-43, pages 53-64, 2010.
-
(2010)
MICRO-43
, pp. 53-64
-
-
Cheng, H.-Y.1
-
9
-
-
77954719557
-
The Scalable Heterogeneous Computing (SHOC) benchmark suite
-
A. Danalis et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In GPGPU 2010.
-
(2010)
GPGPU
-
-
Danalis, A.1
-
10
-
-
80052528714
-
Dark Silicon and the End of Multicore Scaling
-
H. Esmaeilzadeh et al. Dark Silicon and the End of Multicore Scaling. In ISCA 2011, pages 365-376.
-
(2011)
ISCA
, pp. 365-376
-
-
Esmaeilzadeh, H.1
-
11
-
-
79955923056
-
Thread Block Compaction for Efficient SIMT Control Flow
-
W. Fung and T. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA 2011, pages 25-36.
-
(2011)
HPCA
, pp. 25-36
-
-
Fung, W.1
Aamodt, T.2
-
12
-
-
47349104432
-
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
-
W. W. L. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO-40.
-
MICRO-40
-
-
Fung, W.W.L.1
-
13
-
-
80052533471
-
Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors
-
M. Gebhart and D. R. Johnson et al. Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA 2011, pages 235-246.
-
(2011)
ISCA
, pp. 235-246
-
-
Gebhart, M.1
Johnson, D.R.2
-
14
-
-
67650635164
-
Many-Core vs. Many-Thread Machines: Stay Away From the Valley
-
jan.
-
Z. Guz et al. Many-Core vs. Many-Thread Machines: Stay Away From the Valley. Computer Architecture Letters, pages 25-28, jan. 2009.
-
(2009)
Computer Architecture Letters
, pp. 25-28
-
-
Guz, Z.1
-
15
-
-
84862107632
-
Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems
-
T. H. Hetherington et al. Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems. In ISPASS 2012, pages 88-98.
-
(2012)
ISPASS
, pp. 88-98
-
-
Hetherington, T.H.1
-
16
-
-
79952811127
-
Accelerating CUDA Graph Algorithms at Maximum Warp
-
S. Hong et al. Accelerating CUDA Graph Algorithms at Maximum Warp. In PPoPP 2011, pages 267-276.
-
(2011)
PPoPP
, pp. 267-276
-
-
Hong, S.1
-
17
-
-
84858767531
-
CRUISE: Cache Replacement and Utility-Aware Scheduling
-
A. Jaleel et al. CRUISE: Cache Replacement and Utility-Aware Scheduling. In ASPLOS 2012, pages 249-260.
-
(2012)
ASPLOS
, pp. 249-260
-
-
Jaleel, A.1
-
18
-
-
77954998134
-
High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)
-
A. Jaleel et al. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In ISCA 2010, pages 60-71.
-
(2010)
ISCA
, pp. 60-71
-
-
Jaleel, A.1
-
19
-
-
84864068497
-
Characterizing and Improving the use of Demand-Fetched Caches in GPUs
-
W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the use of Demand-Fetched Caches in GPUs. In ICS 2012, pages 15-24.
-
(2012)
ICS
, pp. 15-24
-
-
Jia, W.1
Shaw, K.A.2
Martonosi, M.3
-
20
-
-
84875640178
-
OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance
-
A. Jog et al. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In ASPLOS 2013.
-
(2013)
ASPLOS
-
-
Jog, A.1
-
21
-
-
84881126240
-
Orchestrated Scheduling and Prefetching for GPGPUs
-
A. Jog et al. Orchestrated Scheduling and Prefetching for GPGPUs. In ISCA, 2013.
-
(2013)
ISCA
-
-
Jog, A.1
-
22
-
-
84887477265
-
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs
-
O. Kayiran et al. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT 2013.
-
(2013)
PACT
-
-
Kayiran, O.1
-
23
-
-
84892519366
-
-
Khronos Group. OpenCL. http://www.khronos.org/opencl/.
-
OpenCL
-
-
-
25
-
-
79951719035
-
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
-
J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO-43, pages 213-224, 2010.
-
(2010)
MICRO-43
, pp. 213-224
-
-
Lee, J.1
Lakshminarayana, N.B.2
Kim, H.3
Vuduc, R.4
-
26
-
-
84881151222
-
GPUWattch: Enabling Energy Optimizations in GPGPUs
-
J. Leng et al. GPUWattch: Enabling Energy Optimizations in GPGPUs. In ISCA 2013.
-
(2013)
ISCA
-
-
Leng, J.1
-
27
-
-
44849137198
-
NVIDIA Tesla: A Unified Graphics and Computing Architecture
-
March-April
-
E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 28(2):39-55, March-April 2008.
-
(2008)
Micro, IEEE
, vol.28
, Issue.2
, pp. 39-55
-
-
Lindholm, E.1
-
28
-
-
84881440334
-
How a Single Chip Causes Massive Power Bills GPUSimPow: A GPGPU Power Simulator
-
M. Maas et al. How a Single Chip Causes Massive Power Bills GPUSimPow: A GPGPU Power Simulator. In ISPASS 2013.
-
(2013)
ISPASS
-
-
Maas, M.1
-
29
-
-
77954976292
-
Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance
-
J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In ISCA 2010, pages 235-246.
-
(2010)
ISCA
, pp. 235-246
-
-
Meng, J.1
Tarjan, D.2
Skadron, K.3
-
30
-
-
84863342255
-
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling
-
V. Narasiman et al. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In MICRO-44, pages 308-317, 2011.
-
(2011)
MICRO-44
, pp. 308-317
-
-
Narasiman, V.1
-
31
-
-
35348920021
-
Adaptive Insertion Policies for High Performance Caching
-
M. K. Qureshi et al. Adaptive Insertion Policies for High Performance Caching. In ISCA 2007, pages 381-391.
-
(2007)
ISCA
, pp. 381-391
-
-
Qureshi, M.K.1
-
37
-
-
0030149507
-
CACTI: An Enhanced Cache Access and Cycle Time Model
-
May
-
S. Wilton and N. Jouppi. CACTI: An Enhanced Cache Access and Cycle Time Model. Solid-State Circuits, IEEE Journal of, 31(5):677-688, May 1996.
-
(1996)
Solid-State Circuits, IEEE Journal of
, vol.31
, Issue.5
, pp. 677-688
-
-
Wilton, S.1
Jouppi, N.2
|