-
3
-
-
84864843567
-
Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems
-
R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In ISCA, 2012.
-
(2012)
ISCA
-
-
Ausavarungnirun, R.1
Chang, K.K.-W.2
Subramanian, L.3
Loh, G.H.4
Mutlu, O.5
-
4
-
-
79951702398
-
Throughput-effective on-chip networks for manycore accelerators
-
A. Bakhoda, J. Kim, and T. Aamodt. Throughput-effective On-chip Networks for Manycore Accelerators. In MICRO, 2010.
-
(2010)
MICRO
-
-
Bakhoda, A.1
Kim, J.2
Aamodt, T.3
-
5
-
-
70349169075
-
Analyzing CUDA workloads using a detailed GPU simulator
-
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.
-
(2009)
ISPASS
-
-
Bakhoda, A.1
Yuan, G.2
Fung, W.3
Wong, H.4
Aamodt, T.5
-
6
-
-
57349180412
-
A compiler framework for optimization of affine loop nests for GPGPUs
-
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In ICS 2008.
-
(2008)
ICS
-
-
Baskaran, M.M.1
Bondhugula, U.2
Krishnamoorthy, S.3
Ramanujam, J.4
Rountev, A.5
Sadayappan, P.6
-
8
-
-
83155188972
-
CudaDMA: Optimizing GPU memory bandwidth via warp specialization
-
M. Bauer, H. Cook, and B. Khailany. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In SC, 2011.
-
(2011)
SC
-
-
Bauer, M.1
Cook, H.2
Khailany, B.3
-
9
-
-
0032761638
-
Impulse: Building a smarter memory controller
-
J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In HPCA, 1999.
-
(1999)
HPCA
-
-
Carter, J.1
Hsieh, W.2
Stoller, L.3
Swanson, M.4
Zhang, L.5
Brunvand, E.6
Davis, A.7
Kuo, C.-C.8
Kuramkote, R.9
Parker, M.10
Schaelicke, L.11
Tateyama, T.12
-
10
-
-
70649092154
-
Rodinia: A benchmark suite for heterogeneous computing
-
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009.
-
(2009)
IISWC
-
-
Che, S.1
Boyer, M.2
Meng, J.3
Tarjan, D.4
Sheaffer, J.5
Lee, S.-H.6
Skadron, K.7
-
11
-
-
84861811396
-
Modeling cache contention and throughput of multiprogrammed manycore processors
-
X. E. Chen and T. Aamodt. Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors. IEEE Trans. Comput., 2012.
-
(2012)
IEEE Trans. Comput.
-
-
Chen, X.E.1
Aamodt, T.2
-
12
-
-
84863348772
-
Parallel application memory scheduling
-
E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. Parallel Application Memory Scheduling. MICRO, 2011.
-
(2011)
MICRO
-
-
Ebrahimi, E.1
Miftakhutdinov, R.2
Fallin, C.3
Lee, C.J.4
Joao, J.A.5
Mutlu, O.6
Patt, Y.N.7
-
13
-
-
64949179220
-
Techniques for bandwidth- efficient prefetching of linked data structures in hybrid prefetching systems
-
E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for Bandwidth- Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In HPCA, 2009.
-
(2009)
HPCA
-
-
Ebrahimi, E.1
Mutlu, O.2
Patt, Y.N.3
-
14
-
-
47349104432
-
Dynamicwarp formation and scheduling for efficient GPU control flow
-
W. Fung, I. Sham, G. Yuan, and T. Aamodt. DynamicWarp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007.
-
(2007)
MICRO
-
-
Fung, W.1
Sham, I.2
Yuan, G.3
Aamodt, T.4
-
15
-
-
79955923056
-
Thread block compaction for efficient simt control flow
-
W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA, 2011.
-
(2011)
HPCA
-
-
Fung, W.W.L.1
Aamodt, T.M.2
-
17
-
-
80052533471
-
Energy-efficient mechanisms for managing thread context in throughput processors
-
M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA, 2011.
-
(2011)
ISCA
-
-
Gebhart, M.1
Johnson, D.R.2
Tarjan, D.3
Keckler, S.W.4
Dally, W.J.5
Lindholm, E.6
Skadron, K.7
-
18
-
-
84856511841
-
Regulating locality vs. parallelism tradeoffs in multiple memory controller environments
-
S. Hassan, D. Choudhary, M. Rasquinha, and S. Yalamanchili. Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments. In PACT, 2011.
-
(2011)
PACT
-
-
Hassan, S.1
Choudhary, D.2
Rasquinha, M.3
Yalamanchili, S.4
-
19
-
-
63549097654
-
Mars: A mapreduce framework on graphics processors
-
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT, 2008.
-
(2008)
PACT
-
-
He, B.1
Fang, W.2
Luo, Q.3
Govindaraju, N.K.4
Wang, T.5
-
20
-
-
84860328391
-
Balancing DRAM locality and parallelism in shared memory CMP systems
-
M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems . In HPCA, 2012.
-
(2012)
HPCA
-
-
Jeong, M.K.1
Yoon, D.H.2
Sunwoo, D.3
Sullivan, M.4
Lee, I.5
Erez, M.6
-
21
-
-
84864068497
-
Characterizing and improving the use of demand-fetched caches in GPUs
-
W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In ICS, 2012.
-
(2012)
ICS
-
-
Jia, W.1
Shaw, K.A.2
Martonosi, M.3
-
22
-
-
84863554441
-
Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs
-
A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das. Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs. In DAC, 2012.
-
(2012)
DAC
-
-
Jog, A.1
Mishra, A.K.2
Xu, C.3
Xie, Y.4
Narayanan, V.5
Iyer, R.6
Das, C.R.7
-
24
-
-
84875641823
-
Neither more nor less: Optimizing thread-level parallelism for GPGPUs
-
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs. In CSE Penn State Tech Report, TR-CSE-2012-006, 2012.
-
(2012)
CSE Penn State Tech Report, TR-CSE-2012-006
-
-
Kayiran, O.1
Jog, A.2
Kandemir, M.T.3
Das, C.R.4
-
25
-
-
80054875176
-
GPUs and the future of parallel computing
-
S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the Future of Parallel Computing. IEEE Micro, 2011.
-
(2011)
IEEE Micro
-
-
Keckler, S.1
Dally, W.2
Khailany, B.3
Garland, M.4
Glasco, D.5
-
26
-
-
77952558442
-
ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers
-
Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A Scalable and High-performance Scheduling Algorithm for Multiple Memory Controllers. In HPCA, 2010.
-
(2010)
HPCA
-
-
Kim, Y.1
Han, D.2
Mutlu, O.3
Harchol-Balter, M.4
-
27
-
-
79951718838
-
Thread cluster memory scheduling: Exploiting differences in memory access behavior
-
Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO, 2010.
-
(2010)
MICRO
-
-
Kim, Y.1
Papamichael, M.2
Mutlu, O.3
Harchol-Balter, M.4
-
35
-
-
76749092678
-
Improving memory bank-level parallelism in the presence of prefetching
-
C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. Improving Memory Bank-Level Parallelism in the Presence of Prefetching. In MICRO, 2009.
-
(2009)
MICRO
-
-
Lee, C.J.1
Narasiman, V.2
Mutlu, O.3
Patt, Y.N.4
-
36
-
-
79951719035
-
Many-thread aware prefetching mechanisms for GPGPU applications
-
J. Lee, N. Lakshminarayana, H. Kim, and R. Vuduc. Many-thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO, 2010.
-
(2010)
MICRO
-
-
Lee, J.1
Lakshminarayana, N.2
Kim, H.3
Vuduc, R.4
-
38
-
-
52649128991
-
Memory performance attacks: Denial of memory service in multi-core systems
-
T. Moscibroda and O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In USENIX SECURITY, 2007.
-
(2007)
Usenix Security
-
-
Moscibroda, T.1
Mutlu, O.2
-
40
-
-
84858771269
-
Reducing memory interference in multicore systems via application-aware memory channel partitioning
-
S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning". In MICRO, 2011.
-
(2011)
MICRO
-
-
Muralidhara, S.P.1
Subramanian, L.2
Mutlu, O.3
Kandemir, M.4
Moscibroda, T.5
-
41
-
-
52649119398
-
Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems
-
O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA, 2008.
-
(2008)
ISCA
-
-
Mutlu, O.1
Moscibroda, T.2
-
42
-
-
47349122373
-
Stall-time fair memory access scheduling for chip multiprocessors
-
O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO, 2007.
-
(2007)
MICRO
-
-
Mutlu, O.1
Moscibroda, T.2
-
43
-
-
84867553128
-
Application-aware prefetch prioritization in on-chip networks
-
N. Chidambaram Nachiappan, A. K. Mishra, M. Kandemir, A. Sivasubramaniam, O. Mutlu, and C. R. Das. Application-aware Prefetch Prioritization in On-chip Networks. In PACT, 2012.
-
(2012)
PACT
-
-
Nachiappan, N.C.1
Mishra, A.K.2
Kandemir, M.3
Sivasubramaniam, A.4
Mutlu, O.5
Das, C.R.6
-
44
-
-
84863342255
-
Improving GPU performance via large warps and two-level warp scheduling
-
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU Performance Via Large Warps and Two-level Warp Scheduling. In MICRO, 2011.
-
(2011)
MICRO
-
-
Narasiman, V.1
Shebanow, M.2
Lee, C.J.3
Miftakhutdinov, R.4
Mutlu, O.5
Patt, Y.N.6
-
45
-
-
2342644731
-
Data cache prefetching using a global history buffer
-
K. J. Nesbit, and J. E. Smith. Data Cache Prefetching Using a Global History Buffer. In HPCA, 2004.
-
(2004)
HPCA
-
-
Nesbit, K.J.1
Smith, J.E.2
-
49
-
-
84864855982
-
CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures
-
M. Rhu and M. Erez. CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures. In ISCA 2012.
-
(2012)
ISCA
-
-
Rhu, M.1
Erez, M.2
-
50
-
-
0033691565
-
Memory access scheduling
-
S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D. Owens. Memory Access Scheduling. In ISCA, 2000.
-
(2000)
ISCA
-
-
Rixner, S.1
Dally, W.J.2
Kapasi, U.J.3
Mattson, P.R.4
Owens, J.D.5
-
52
-
-
34547655822
-
Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers
-
S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In HPCA, 2007.
-
(2007)
HPCA
-
-
Srinath, S.1
Mutlu, O.2
Kim, H.3
Patt, Y.N.4
-
54
-
-
78149251414
-
Data layout transformation exploiting memory-level parallelism in structured grid many-core applications
-
I. J. Sung, J. A. Stratton, and W.-M. W. Hwu. Data Layout Transformation Exploiting Memory-level Parallelism in Structured Grid Many-core Applications. In PACT, 2010.
-
(2010)
PACT
-
-
Sung, I.J.1
Stratton, J.A.2
Hwu, W.-M.W.3
-
55
-
-
21644451858
-
The effectiveness of multiple hardware contexts
-
R. Thekkath, and S. J. Eggers. The Effectiveness of Multiple Hardware Contexts. In ASPLOS, 1994.
-
(1994)
ASPLOS
-
-
Thekkath, R.1
Eggers, S.J.2
-
57
-
-
84872056636
-
Row buffer locality aware caching policies for hybrid memories
-
H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In ICCD, 2012.
-
(2012)
ICCD
-
-
Yoon, H.1
Meza, J.2
Ausavarungnirun, R.3
Harding, R.4
Mutlu, O.5
-
58
-
-
76749123978
-
Complexity effective memory access scheduling for many-core accelerator architectures
-
G. Yuan, A. Bakhoda, and T. Aamodt. Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. In MICRO, 2009.
-
(2009)
MICRO
-
-
Yuan, G.1
Bakhoda, A.2
Aamodt, T.3
|