메뉴 건너뛰기




Volumn , Issue , 2013, Pages 395-406

OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance

Author keywords

Gpgpus; Latency tolerance; Prefetching; Scheduling

Indexed keywords

GPGPUS; IMPROVING PERFORMANCE; LATENCY TOLERANCE; PERFORMANCE IMPROVEMENTS; PREFETCHING; SCHEDULING DECISIONS; SCHEDULING TECHNIQUES; THREAD LEVEL PARALLELISM;

EID: 84875640178     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1145/2451116.2451158     Document Type: Conference Paper
Times cited : (139)

References (59)
  • 3
    • 84864843567 scopus 로고    scopus 로고
    • Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems
    • R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In ISCA, 2012.
    • (2012) ISCA
    • Ausavarungnirun, R.1    Chang, K.K.-W.2    Subramanian, L.3    Loh, G.H.4    Mutlu, O.5
  • 4
    • 79951702398 scopus 로고    scopus 로고
    • Throughput-effective on-chip networks for manycore accelerators
    • A. Bakhoda, J. Kim, and T. Aamodt. Throughput-effective On-chip Networks for Manycore Accelerators. In MICRO, 2010.
    • (2010) MICRO
    • Bakhoda, A.1    Kim, J.2    Aamodt, T.3
  • 5
    • 70349169075 scopus 로고    scopus 로고
    • Analyzing CUDA workloads using a detailed GPU simulator
    • A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.
    • (2009) ISPASS
    • Bakhoda, A.1    Yuan, G.2    Fung, W.3    Wong, H.4    Aamodt, T.5
  • 7
    • 79959581990 scopus 로고    scopus 로고
    • Automatic C-to- CUDA code generation for affine programs
    • M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to- CUDA Code Generation for Affine Programs. In CC/ETAPS 2010.
    • (2010) CC/ETAPS
    • Baskaran, M.M.1    Ramanujam, J.2    Sadayappan, P.3
  • 8
    • 83155188972 scopus 로고    scopus 로고
    • CudaDMA: Optimizing GPU memory bandwidth via warp specialization
    • M. Bauer, H. Cook, and B. Khailany. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In SC, 2011.
    • (2011) SC
    • Bauer, M.1    Cook, H.2    Khailany, B.3
  • 11
    • 84861811396 scopus 로고    scopus 로고
    • Modeling cache contention and throughput of multiprogrammed manycore processors
    • X. E. Chen and T. Aamodt. Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors. IEEE Trans. Comput., 2012.
    • (2012) IEEE Trans. Comput.
    • Chen, X.E.1    Aamodt, T.2
  • 13
    • 64949179220 scopus 로고    scopus 로고
    • Techniques for bandwidth- efficient prefetching of linked data structures in hybrid prefetching systems
    • E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for Bandwidth- Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In HPCA, 2009.
    • (2009) HPCA
    • Ebrahimi, E.1    Mutlu, O.2    Patt, Y.N.3
  • 14
    • 47349104432 scopus 로고    scopus 로고
    • Dynamicwarp formation and scheduling for efficient GPU control flow
    • W. Fung, I. Sham, G. Yuan, and T. Aamodt. DynamicWarp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007.
    • (2007) MICRO
    • Fung, W.1    Sham, I.2    Yuan, G.3    Aamodt, T.4
  • 15
    • 79955923056 scopus 로고    scopus 로고
    • Thread block compaction for efficient simt control flow
    • W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA, 2011.
    • (2011) HPCA
    • Fung, W.W.L.1    Aamodt, T.M.2
  • 18
    • 84856511841 scopus 로고    scopus 로고
    • Regulating locality vs. parallelism tradeoffs in multiple memory controller environments
    • S. Hassan, D. Choudhary, M. Rasquinha, and S. Yalamanchili. Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments. In PACT, 2011.
    • (2011) PACT
    • Hassan, S.1    Choudhary, D.2    Rasquinha, M.3    Yalamanchili, S.4
  • 20
    • 84860328391 scopus 로고    scopus 로고
    • Balancing DRAM locality and parallelism in shared memory CMP systems
    • M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems . In HPCA, 2012.
    • (2012) HPCA
    • Jeong, M.K.1    Yoon, D.H.2    Sunwoo, D.3    Sullivan, M.4    Lee, I.5    Erez, M.6
  • 21
    • 84864068497 scopus 로고    scopus 로고
    • Characterizing and improving the use of demand-fetched caches in GPUs
    • W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In ICS, 2012.
    • (2012) ICS
    • Jia, W.1    Shaw, K.A.2    Martonosi, M.3
  • 22
    • 84863554441 scopus 로고    scopus 로고
    • Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs
    • A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das. Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs. In DAC, 2012.
    • (2012) DAC
    • Jog, A.1    Mishra, A.K.2    Xu, C.3    Xie, Y.4    Narayanan, V.5    Iyer, R.6    Das, C.R.7
  • 26
    • 77952558442 scopus 로고    scopus 로고
    • ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers
    • Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A Scalable and High-performance Scheduling Algorithm for Multiple Memory Controllers. In HPCA, 2010.
    • (2010) HPCA
    • Kim, Y.1    Han, D.2    Mutlu, O.3    Harchol-Balter, M.4
  • 27
    • 79951718838 scopus 로고    scopus 로고
    • Thread cluster memory scheduling: Exploiting differences in memory access behavior
    • Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO, 2010.
    • (2010) MICRO
    • Kim, Y.1    Papamichael, M.2    Mutlu, O.3    Harchol-Balter, M.4
  • 35
    • 76749092678 scopus 로고    scopus 로고
    • Improving memory bank-level parallelism in the presence of prefetching
    • C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. Improving Memory Bank-Level Parallelism in the Presence of Prefetching. In MICRO, 2009.
    • (2009) MICRO
    • Lee, C.J.1    Narasiman, V.2    Mutlu, O.3    Patt, Y.N.4
  • 36
    • 79951719035 scopus 로고    scopus 로고
    • Many-thread aware prefetching mechanisms for GPGPU applications
    • J. Lee, N. Lakshminarayana, H. Kim, and R. Vuduc. Many-thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO, 2010.
    • (2010) MICRO
    • Lee, J.1    Lakshminarayana, N.2    Kim, H.3    Vuduc, R.4
  • 38
    • 52649128991 scopus 로고    scopus 로고
    • Memory performance attacks: Denial of memory service in multi-core systems
    • T. Moscibroda and O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In USENIX SECURITY, 2007.
    • (2007) Usenix Security
    • Moscibroda, T.1    Mutlu, O.2
  • 40
    • 84858771269 scopus 로고    scopus 로고
    • Reducing memory interference in multicore systems via application-aware memory channel partitioning
    • S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning". In MICRO, 2011.
    • (2011) MICRO
    • Muralidhara, S.P.1    Subramanian, L.2    Mutlu, O.3    Kandemir, M.4    Moscibroda, T.5
  • 41
    • 52649119398 scopus 로고    scopus 로고
    • Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems
    • O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA, 2008.
    • (2008) ISCA
    • Mutlu, O.1    Moscibroda, T.2
  • 42
    • 47349122373 scopus 로고    scopus 로고
    • Stall-time fair memory access scheduling for chip multiprocessors
    • O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO, 2007.
    • (2007) MICRO
    • Mutlu, O.1    Moscibroda, T.2
  • 45
    • 2342644731 scopus 로고    scopus 로고
    • Data cache prefetching using a global history buffer
    • K. J. Nesbit, and J. E. Smith. Data Cache Prefetching Using a Global History Buffer. In HPCA, 2004.
    • (2004) HPCA
    • Nesbit, K.J.1    Smith, J.E.2
  • 49
    • 84864855982 scopus 로고    scopus 로고
    • CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures
    • M. Rhu and M. Erez. CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures. In ISCA 2012.
    • (2012) ISCA
    • Rhu, M.1    Erez, M.2
  • 52
    • 34547655822 scopus 로고    scopus 로고
    • Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers
    • S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In HPCA, 2007.
    • (2007) HPCA
    • Srinath, S.1    Mutlu, O.2    Kim, H.3    Patt, Y.N.4
  • 54
    • 78149251414 scopus 로고    scopus 로고
    • Data layout transformation exploiting memory-level parallelism in structured grid many-core applications
    • I. J. Sung, J. A. Stratton, and W.-M. W. Hwu. Data Layout Transformation Exploiting Memory-level Parallelism in Structured Grid Many-core Applications. In PACT, 2010.
    • (2010) PACT
    • Sung, I.J.1    Stratton, J.A.2    Hwu, W.-M.W.3
  • 55
    • 21644451858 scopus 로고
    • The effectiveness of multiple hardware contexts
    • R. Thekkath, and S. J. Eggers. The Effectiveness of Multiple Hardware Contexts. In ASPLOS, 1994.
    • (1994) ASPLOS
    • Thekkath, R.1    Eggers, S.J.2
  • 58
    • 76749123978 scopus 로고    scopus 로고
    • Complexity effective memory access scheduling for many-core accelerator architectures
    • G. Yuan, A. Bakhoda, and T. Aamodt. Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. In MICRO, 2009.
    • (2009) MICRO
    • Yuan, G.1    Bakhoda, A.2    Aamodt, T.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.