SCOPUS 정보 검색 플랫폼

International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS

Volumn , Issue , 2013, Pages 395-406

OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance

(8) Jog, Adwait a Kayiran, Onur a Nachiappan, Nachiappan Chidambaram a Mishra, Asit K c Kandemir, Mahmut T a Mutlu, Onur a Iyer, Ravishankar b Das, Chita R c

a PENNSYLVANIA STATE UNIVERSITY (United States)

b CARNEGIE MELLON UNIVERSITY (United States)

c INTEL CORPORATION (United States)

Author keywords

Gpgpus; Latency tolerance; Prefetching; Scheduling

Indexed keywords

GPGPUS; IMPROVING PERFORMANCE; LATENCY TOLERANCE; PERFORMANCE IMPROVEMENTS; PREFETCHING; SCHEDULING DECISIONS; SCHEDULING TECHNIQUES; THREAD LEVEL PARALLELISM;

DYNAMIC RANDOM ACCESS STORAGE; PROGRAM PROCESSORS; SCHEDULING;

WEAVING;

EID: 84875640178 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2451116.2451158 Document Type: Conference Paper

Times cited : (139)

References (59)

1
- 84875678965
- AMD., Nov.
- AMD. Radeon and FirePro Graphics Cards, Nov. 2011.
- (2011) Radeon and FirePro Graphics Cards

2
- 84875648573
- AMD. (Evergreen) Architecture, Oct.
- AMD. Heterogeneous Computing: OpenCL and the ATI Radeon HD 5870 (Evergreen) Architecture, Oct. 2012.
- (2012) Heterogeneous Computing: OpenCL and the ATI Radeon HD 5870

3
- 84864843567
- Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems
- R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In ISCA, 2012.
- (2012) ISCA
- Ausavarungnirun, R.¹ Chang, K.K.-W.² Subramanian, L.³ Loh, G.H.⁴ Mutlu, O.⁵

4
- 79951702398
- Throughput-effective on-chip networks for manycore accelerators
- A. Bakhoda, J. Kim, and T. Aamodt. Throughput-effective On-chip Networks for Manycore Accelerators. In MICRO, 2010.
- (2010) MICRO
- Bakhoda, A.¹ Kim, J.² Aamodt, T.³

5
- 70349169075
- Analyzing CUDA workloads using a detailed GPU simulator
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.
- (2009) ISPASS
- Bakhoda, A.¹ Yuan, G.² Fung, W.³ Wong, H.⁴ Aamodt, T.⁵

6
- 57349180412
- A compiler framework for optimization of affine loop nests for GPGPUs
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In ICS 2008.
- (2008) ICS
- Baskaran, M.M.¹ Bondhugula, U.² Krishnamoorthy, S.³ Ramanujam, J.⁴ Rountev, A.⁵ Sadayappan, P.⁶

7
- 79959581990
- Automatic C-to- CUDA code generation for affine programs
- M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to- CUDA Code Generation for Affine Programs. In CC/ETAPS 2010.
- (2010) CC/ETAPS
- Baskaran, M.M.¹ Ramanujam, J.² Sadayappan, P.³

8
- 83155188972
- CudaDMA: Optimizing GPU memory bandwidth via warp specialization
- M. Bauer, H. Cook, and B. Khailany. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In SC, 2011.
- (2011) SC
- Bauer, M.¹ Cook, H.² Khailany, B.³

9
- 0032761638
- Impulse: Building a smarter memory controller
- J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In HPCA, 1999.
- (1999) HPCA
- Carter, J.¹ Hsieh, W.² Stoller, L.³ Swanson, M.⁴ Zhang, L.⁵ Brunvand, E.⁶ Davis, A.⁷ Kuo, C.-C.⁸ Kuramkote, R.⁹ Parker, M.¹⁰ Schaelicke, L.¹¹ Tateyama, T.¹²

10
- 70649092154
- Rodinia: A benchmark suite for heterogeneous computing
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009.
- (2009) IISWC
- Che, S.¹ Boyer, M.² Meng, J.³ Tarjan, D.⁴ Sheaffer, J.⁵ Lee, S.-H.⁶ Skadron, K.⁷

11
- 84861811396
- Modeling cache contention and throughput of multiprogrammed manycore processors
- X. E. Chen and T. Aamodt. Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors. IEEE Trans. Comput., 2012.
- (2012) IEEE Trans. Comput.
- Chen, X.E.¹ Aamodt, T.²

12
- 84863348772
- Parallel application memory scheduling
- E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. Parallel Application Memory Scheduling. MICRO, 2011.
- (2011) MICRO
- Ebrahimi, E.¹ Miftakhutdinov, R.² Fallin, C.³ Lee, C.J.⁴ Joao, J.A.⁵ Mutlu, O.⁶ Patt, Y.N.⁷

13
- 64949179220
- Techniques for bandwidth- efficient prefetching of linked data structures in hybrid prefetching systems
- E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for Bandwidth- Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In HPCA, 2009.
- (2009) HPCA
- Ebrahimi, E.¹ Mutlu, O.² Patt, Y.N.³

14
- 47349104432
- Dynamicwarp formation and scheduling for efficient GPU control flow
- W. Fung, I. Sham, G. Yuan, and T. Aamodt. DynamicWarp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007.
- (2007) MICRO
- Fung, W.¹ Sham, I.² Yuan, G.³ Aamodt, T.⁴

15
- 79955923056
- Thread block compaction for efficient simt control flow
- W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA, 2011.
- (2011) HPCA
- Fung, W.W.L.¹ Aamodt, T.M.²

16
- 84858761190
- Hardware transactional memory for GPU architectures
- W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO, 2011.
- (2011) MICRO
- Fung, W.W.L.¹ Singh, I.² Brownsword, A.³ Aamodt, T.M.⁴

17
- 80052533471
- Energy-efficient mechanisms for managing thread context in throughput processors
- M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA, 2011.
- (2011) ISCA
- Gebhart, M.¹ Johnson, D.R.² Tarjan, D.³ Keckler, S.W.⁴ Dally, W.J.⁵ Lindholm, E.⁶ Skadron, K.⁷

18
- 84856511841
- Regulating locality vs. parallelism tradeoffs in multiple memory controller environments
- S. Hassan, D. Choudhary, M. Rasquinha, and S. Yalamanchili. Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments. In PACT, 2011.
- (2011) PACT
- Hassan, S.¹ Choudhary, D.² Rasquinha, M.³ Yalamanchili, S.⁴

19
- 63549097654
- Mars: A mapreduce framework on graphics processors
- B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT, 2008.
- (2008) PACT
- He, B.¹ Fang, W.² Luo, Q.³ Govindaraju, N.K.⁴ Wang, T.⁵

20
- 84860328391
- Balancing DRAM locality and parallelism in shared memory CMP systems
- M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems . In HPCA, 2012.
- (2012) HPCA
- Jeong, M.K.¹ Yoon, D.H.² Sunwoo, D.³ Sullivan, M.⁴ Lee, I.⁵ Erez, M.⁶

21
- 84864068497
- Characterizing and improving the use of demand-fetched caches in GPUs
- W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In ICS, 2012.
- (2012) ICS
- Jia, W.¹ Shaw, K.A.² Martonosi, M.³

22
- 84863554441
- Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs
- A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das. Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs. In DAC, 2012.
- (2012) DAC
- Jog, A.¹ Mishra, A.K.² Xu, C.³ Xie, Y.⁴ Narayanan, V.⁵ Iyer, R.⁶ Das, C.R.⁷

23
- 0033075109
- Prefetching using markov predictors
- D. Joseph and D. Grunwald. Prefetching Using Markov Predictors. IEEE Trans. Comput., 1999.
- (1999) IEEE Trans. Comput.
- Joseph, D.¹ Grunwald, D.²

24
- 84875641823
- Neither more nor less: Optimizing thread-level parallelism for GPGPUs
- O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs. In CSE Penn State Tech Report, TR-CSE-2012-006, 2012.
- (2012) CSE Penn State Tech Report, TR-CSE-2012-006
- Kayiran, O.¹ Jog, A.² Kandemir, M.T.³ Das, C.R.⁴

25
- 80054875176
- GPUs and the future of parallel computing
- S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the Future of Parallel Computing. IEEE Micro, 2011.
- (2011) IEEE Micro
- Keckler, S.¹ Dally, W.² Khailany, B.³ Garland, M.⁴ Glasco, D.⁵

26
- 77952558442
- ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers
- Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A Scalable and High-performance Scheduling Algorithm for Multiple Memory Controllers. In HPCA, 2010.
- (2010) HPCA
- Kim, Y.¹ Han, D.² Mutlu, O.³ Harchol-Balter, M.⁴

27
- 79951718838
- Thread cluster memory scheduling: Exploiting differences in memory access behavior
- Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO, 2010.
- (2010) MICRO
- Kim, Y.¹ Papamichael, M.² Mutlu, O.³ Harchol-Balter, M.⁴

28
- 77951157944
- D. Kirk and Wen-mei, W. Hwu. Programming Massively Parallel Processors. 2010.
- (2010) Programming Massively Parallel Processors
- Kirk, D.¹ Wen-Mei² Hwu, W.³

29
- 84875639833
- MPR
- K. Krewell. Amd's Fusion Finally Arrives. MPR, 2011.
- (2011) Amd's Fusion Finally Arrives
- Krewell, K.¹

30
- 84875680006
- MPR
- K. Krewell. Ivy Bridge Improves Graphics. MPR, 2011.
- (2011) Ivy Bridge Improves Graphics
- Krewell, K.¹

31
- 84875670207
- MPR
- K. Krewell. Most Significant Bits. MPR, 2011.
- (2011) Most Significant Bits
- Krewell, K.¹

32
- 84875674432
- MPR
- K. Krewell. Nvidia Lowers the Heat on Kepler. MPR, 2012.
- (2012) Nvidia Lowers the Heat on Kepler
- Krewell, K.¹

33
- 84870990602
- DRAM scheduling policy for GPGPU architectures based on a potential function
- N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin. DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function. Computer Architecture Letters, 2012.
- (2012) Computer Architecture Letters
- Lakshminarayana, N.B.¹ Lee, J.² Kim, H.³ Shin, J.⁴

34
- 66749189125
- Prefetch-aware DRAM controllers
- C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt. Prefetch-Aware DRAM Controllers. In MICRO, 2008.
- (2008) MICRO
- Lee, C.J.¹ Mutlu, O.² Narasiman, V.³ Patt, Y.N.⁴

35
- 76749092678
- Improving memory bank-level parallelism in the presence of prefetching
- C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. Improving Memory Bank-Level Parallelism in the Presence of Prefetching. In MICRO, 2009.
- (2009) MICRO
- Lee, C.J.¹ Narasiman, V.² Mutlu, O.³ Patt, Y.N.⁴

36
- 79951719035
- Many-thread aware prefetching mechanisms for GPGPU applications
- J. Lee, N. Lakshminarayana, H. Kim, and R. Vuduc. Many-thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO, 2010.
- (2010) MICRO
- Lee, J.¹ Lakshminarayana, N.² Kim, H.³ Vuduc, R.⁴

37
- 44849137198
- NVIDIA tesla: A unified graphics and computing architecture
- E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 2008.
- (2008) IEEE Micro
- Lindholm, E.¹ Nickolls, J.² Oberman, S.³ Montrym, J.⁴

38
- 52649128991
- Memory performance attacks: Denial of memory service in multi-core systems
- T. Moscibroda and O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In USENIX SECURITY, 2007.
- (2007) Usenix Security
- Moscibroda, T.¹ Mutlu, O.²

39
- 70349100958
- June
- A. Munshi. The OpenCL Specification, June 2011.
- (2011) The OpenCL Specification
- Munshi, A.¹

40
- 84858771269
- Reducing memory interference in multicore systems via application-aware memory channel partitioning
- S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning". In MICRO, 2011.
- (2011) MICRO
- Muralidhara, S.P.¹ Subramanian, L.² Mutlu, O.³ Kandemir, M.⁴ Moscibroda, T.⁵

41
- 52649119398
- Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems
- O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA, 2008.
- (2008) ISCA
- Mutlu, O.¹ Moscibroda, T.²

42
- 47349122373
- Stall-time fair memory access scheduling for chip multiprocessors
- O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO, 2007.
- (2007) MICRO
- Mutlu, O.¹ Moscibroda, T.²

43
- 84867553128
- Application-aware prefetch prioritization in on-chip networks
- N. Chidambaram Nachiappan, A. K. Mishra, M. Kandemir, A. Sivasubramaniam, O. Mutlu, and C. R. Das. Application-aware Prefetch Prioritization in On-chip Networks. In PACT, 2012.
- (2012) PACT
- Nachiappan, N.C.¹ Mishra, A.K.² Kandemir, M.³ Sivasubramaniam, A.⁴ Mutlu, O.⁵ Das, C.R.⁶

44
- 84863342255
- Improving GPU performance via large warps and two-level warp scheduling
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU Performance Via Large Warps and Two-level Warp Scheduling. In MICRO, 2011.
- (2011) MICRO
- Narasiman, V.¹ Shebanow, M.² Lee, C.J.³ Miftakhutdinov, R.⁴ Mutlu, O.⁵ Patt, Y.N.⁶

45
- 2342644731
- Data cache prefetching using a global history buffer
- K. J. Nesbit, and J. E. Smith. Data Cache Prefetching Using a Global History Buffer. In HPCA, 2004.
- (2004) HPCA
- Nesbit, K.J.¹ Smith, J.E.²

46
- 82955212653
- NVIDIA. Oct.
- NVIDIA. CUDA C Programming Guide, Oct. 2010.
- (2010) CUDA C Programming Guide

47
- 84864861336
- NVIDIA
- NVIDIA. CUDA C/C++ SDK code samples, 2011.
- (2011) CUDA C/C++ SDK Code Samples

48
- 84875636098
- NVIDIA. Nov.
- NVIDIA. Fermi: NVIDIA's Next Generation CUDA Compute Architecture, Nov. 2011.
- (2011) Fermi: NVIDIA's Next Generation CUDA Compute Architecture

49
- 84864855982
- CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures
- M. Rhu and M. Erez. CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures. In ISCA 2012.
- (2012) ISCA
- Rhu, M.¹ Erez, M.²

50
- 0033691565
- Memory access scheduling
- S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D. Owens. Memory Access Scheduling. In ISCA, 2000.
- (2000) ISCA
- Rixner, S.¹ Dally, W.J.² Kapasi, U.J.³ Mattson, P.R.⁴ Owens, J.D.⁵

51
- 84876590572
- Cache-conscious wavefront scheduling
- T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious Wavefront Scheduling. In MICRO, 2012.
- (2012) MICRO
- Rogers, T.G.¹ O'Connor, M.² Aamodt, T.M.³

52
- 34547655822
- Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers
- S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In HPCA, 2007.
- (2007) HPCA
- Srinath, S.¹ Mutlu, O.² Kim, H.³ Patt, Y.N.⁴

53
- 84873470137
- J. A. Stratton et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. 2012.
- (2012) Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing
- Stratton, J.A.¹

54
- 78149251414
- Data layout transformation exploiting memory-level parallelism in structured grid many-core applications
- I. J. Sung, J. A. Stratton, and W.-M. W. Hwu. Data Layout Transformation Exploiting Memory-level Parallelism in Structured Grid Many-core Applications. In PACT, 2010.
- (2010) PACT
- Sung, I.J.¹ Stratton, J.A.² Hwu, W.-M.W.³

55
- 21644451858
- The effectiveness of multiple hardware contexts
- R. Thekkath, and S. J. Eggers. The Effectiveness of Multiple Hardware Contexts. In ASPLOS, 1994.
- (1994) ASPLOS
- Thekkath, R.¹ Eggers, S.J.²

56
- 77952579552
- Demystifying GPU microarchitecture through microbenchmarking
- H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU Microarchitecture Through Microbenchmarking. In ISPASS, 2010.
- (2010) ISPASS
- Wong, H.¹ Papadopoulou, M.-M.² Sadooghi-Alvandi, M.³ Moshovos, A.⁴

57
- 84872056636
- Row buffer locality aware caching policies for hybrid memories
- H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In ICCD, 2012.
- (2012) ICCD
- Yoon, H.¹ Meza, J.² Ausavarungnirun, R.³ Harding, R.⁴ Mutlu, O.⁵

58
- 76749123978
- Complexity effective memory access scheduling for many-core accelerator architectures
- G. Yuan, A. Bakhoda, and T. Aamodt. Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. In MICRO, 2009.
- (2009) MICRO
- Yuan, G.¹ Bakhoda, A.² Aamodt, T.³

59
- 52649113530
- U.S. Patent Number 5, 630, 096
- W. K. Zuravleff and T. Robinson. Controller for a Synchronous DRAM that Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order. U.S. Patent Number 5, 630, 096, 1997.
- (1997) Controller for a Synchronous DRAM that Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order
- Zuravleff, W.K.¹ Robinson, T.²

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.