SCOPUS 정보 검색 플랫폼

MICRO 2013 - Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Volumn , Issue , 2013, Pages 86-98

A locality-aware memory hierarchy for energy-efficient GPU architectures

(4) Rhu, Minsoo a Sullivan, Michael a Leng, Jingwen a Erez, Mattan a

a The University of Texas at Austin (United States)

Author keywords

adaptive granularity memory; fine grained memory access; GPU; irregular memory access patterns; SIMD; SIMT

Indexed keywords

GPU; MEMORY ACCESS; MEMORY ACCESS PATTERNS; SIMD; SIMT;

COMPUTER ARCHITECTURE; ENERGY EFFICIENCY; PROGRAM PROCESSORS;

CACHE MEMORY;

EID: 84892519096 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2540708.2540717 Document Type: Conference Paper

Times cited : (104)

References (62)

1
- 70649092154
- Rodinia: A benchmark suite for heterogeneous computing
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IEEE International Symposium on Workload Characterization (IISWC-2009), October 2009.
- IEEE International Symposium on Workload Characterization (IISWC-2009), October 2009
- Che, S.¹ Boyer, M.² Meng, J.³ Tarjan, D.⁴ Sheaffer, J.⁵ Lee, S.-H.⁶ Skadron, K.⁷

2
- 63549097654
- A MapReduce Framework on Graphics Processors
- B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, "A MapReduce Framework on Graphics Processors," in 17th International Conference on Parallel Architecture and Compilation Techniques (PACT-17), 2008.
- 17th International Conference on Parallel Architecture and Compilation Techniques (PACT-17), 2008
- He, B.¹ Fang, W.² Luo, Q.³ Govindaraju, N.⁴ Wang, T.⁵

3
- 84873458159
- A Quantitative Study of Irregular Programs on GPUs
- M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in IEEE International Symposium on Workload Characterization (IISWC-2012), 2012.
- IEEE International Symposium on Workload Characterization (IISWC-2012), 2012
- Burtscher, M.¹ Nasre, R.² Pingali, K.³

4
- 47349104432
- Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
- W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in 40th International Symposium on Microarchitecture (MICRO-40), December 2007.
- 40th International Symposium on Microarchitecture (MICRO-40), December 2007
- Fung, W.W.¹ Sham, I.² Yuan, G.³ Aamodt, T.M.⁴

5
- 74049151553
- Increasing memory miss tolerance for SIMD cores
- D. Tarjan, J. Meng, and K. Skadron, "Increasing memory miss tolerance for SIMD cores," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC-09), 2009.
- Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC-09), 2009
- Tarjan, D.¹ Meng, J.² Skadron, K.³

6
- 77954976292
- Dynamic warp subdivision for integrated branch and memory divergence tolerance
- J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in 37th International Symposium on Computer Architecture (ISCA-37), 2010.
- 37th International Symposium on Computer Architecture (ISCA-37), 2010
- Meng, J.¹ Tarjan, D.² Skadron, K.³

7
- 79955923056
- Thread Block Compaction for Efficient SIMT Control Flow
- W. W. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," in 17th International Symposium on High Performance Computer Architecture (HPCA-17), February 2011.
- 17th International Symposium on High Performance Computer Architecture (HPCA-17), February 2011
- Fung, W.W.¹ Aamodt, T.M.²

8
- 84863342255
- Improving GPU Performance via Large Warps and Two-Level Warp Scheduling
- V. Narasiman and et al., "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in 44th International Symposium on Microarchitecture (MICRO-44), December 2011.
- 44th International Symposium on Microarchitecture (MICRO-44), December 2011
- Narasiman, V.¹

9
- 84864855982
- CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures
- M. Rhu and M. Erez, "CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures," in 39th International Symposium on Computer Architecture (ISCA-39), June 2012.
- 39th International Symposium on Computer Architecture (ISCA-39), June 2012
- Rhu, M.¹ Erez, M.²

10
- 84876590572
- Cache-Conscious Wavefront Scheduling
- T. Rogers, M. O'Connor, and T. Aamodt, "Cache-Conscious Wavefront Scheduling," in 45th International Symposium on Microarchitecture (MICRO-45), December 2012.
- 45th International Symposium on Microarchitecture (MICRO-45), December 2012
- Rogers, T.¹ O'Connor, M.² Aamodt, T.³

11
- 84880298026
- The Dual-Path Execution Model for Efficient GPU Control Flow
- M. Rhu and M. Erez, "The Dual-Path Execution Model for Efficient GPU Control Flow," in 19th International Symposium on High-Performance Computer Architecture (HPCA-19), February 2013.
- 19th International Symposium on High-Performance Computer Architecture (HPCA-19), February 2013
- Rhu, M.¹ Erez, M.²

12
- 84875640178
- OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance
- A. Jog and et al., "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-13), 2013.
- 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-13), 2013
- Jog, A.¹

13
- 84881171083
- Maximizing SIMD Resource Utilization in GPGPUs with SIMD Lane Permutation
- M. Rhu and M. Erez, "Maximizing SIMD Resource Utilization in GPGPUs with SIMD Lane Permutation," in 40th International Symposium on Computer Architecture (ISCA-40), June 2013.
- 40th International Symposium on Computer Architecture (ISCA-40), June 2013
- Rhu, M.¹ Erez, M.²

14
- 84881183039
- SIMD Divergence Optimization through Intra-Warp Compaction
- A. Vaidya and et al., "SIMD Divergence Optimization through Intra-Warp Compaction," in 40th International Symposium on Computer Architecture (ISCA-40), June 2013.
- 40th International Symposium on Computer Architecture (ISCA-40), June 2013
- Vaidya, A.¹

15
- 84881126240
- Orchestrated Scheduling and Prefetching for GPGPUs
- A. Jog and et al., "Orchestrated Scheduling and Prefetching for GPGPUs," in 40th International Symposium on Computer Architecture (ISCA-40), 2013.
- 40th International Symposium on Computer Architecture (ISCA-40), 2013
- Jog, A.¹

16
- 80054875176
- GPUs and the Future of Parallel Computing
- October
- S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," in IEEE Micro, October 2011.
- (2011) IEEE Micro
- Keckler, S.¹ Dally, W.² Khailany, B.³ Garland, M.⁴ Glasco, D.⁵

17
- 80052542940
- Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput
- D. H. Yoon, M. K. Jeong, and M. Erez, "Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput," in 38th International Symposium on Computer Architecture (ISCA-38), 2011.
- 38th International Symposium on Computer Architecture (ISCA-38), 2011
- Yoon, D.H.¹ Jeong, M.K.² Erez, M.³

18
- 84864862775
- The dynamic granularity memory system
- D. H. Yoon, M. Sullivan, M. K. Jeong, and M. Erez, "The dynamic granularity memory system," in 39th International Symposium on Computer Architecture (ISCA-39), 2012.
- 39th International Symposium on Computer Architecture (ISCA-39), 2012
- Yoon, D.H.¹ Sullivan, M.² Jeong, M.K.³ Erez, M.⁴

19
- 77951900491
- NVIDIA Corporation
- NVIDIA Corporation, "NVIDIA's Next Generation CUDA Compute Architecture: Fermi," 2009.
- (2009) NVIDIA's Next Generation CUDA Compute Architecture: Fermi

20
- 84867648940
- -, "Whitepaper: NVIDIA GeForce GTX 680," 2012.
- (2012) Whitepaper: NVIDIA GeForce GTX 680

21
- 84864843978
- AMD Corporation
- AMD Corporation, "AMD Radeon HD 6900M Series Specifications," 2010.
- (2010) AMD Radeon HD 6900M Series Specifications

22
- 35948991669
- NVIDIA Corporation
- NVIDIA Corporation, "NVIDIA CUDA Programming Guide," 2011.
- (2011) NVIDIA CUDA Programming Guide

23
- 78149329064
- AMD Corporation, August
- AMD Corporation, "ATI Stream Computing OpenCL Programming Guide," August 2010.
- (2010) ATI Stream Computing OpenCL Programming Guide

24
- 84892518346
- Hynix
- 1Gb (32Mx32) GDDR5 SGRAM, H5GQ1H24AFR, Hynix, 2009.
- (2009) 1Gb (32Mx32) GDDR5 SGRAM, H5GQ1H24AFR

25
- 74049087888
- Future scaling of processor-memmory interfaces
- J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, "Future scaling of processor-memmory interfaces," in Proc. the Int'l Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2009.
- Proc. the Int'l Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2009
- Ahn, J.H.¹ Jouppi, N.P.² Kozyrakis, C.³ Leverich, J.⁴ Schreiber, R.S.⁵

26
- 67650604446
- Multicore DIMM: An energy efficient memory module with independently controlled DRAMs
- Jan.-Jun.
- J. H. Ahn, J. Leverich, R. Schreiber, and N. P. Jouppi, "Multicore DIMM: An energy efficient memory module with independently controlled DRAMs," IEEE Computer Architecture Letters, vol. 8, no. 1, pp. 5-8, Jan.-Jun. 2009.
- (2009) IEEE Computer Architecture Letters , vol.8 , Issue.1 , pp. 5-8
- Ahn, J.H.¹ Leverich, J.² Schreiber, R.³ Jouppi, N.P.⁴

27
- 84892533366
- Micro-threaded row and column operations in a DRAM core
- F. A. Ware and C. Hampel, "Micro-threaded row and column operations in a DRAM core," in Proc. the first Workshop on Unique Chips and Systems (UCAS), Mar. 2005.
- Proc. the First Workshop on Unique Chips and Systems (UCAS), Mar. 2005
- Ware, F.A.¹ Hampel, C.²

28
- 49749122679
- Improving power and data efficiency with threaded memory modules
- -, "Improving power and data efficiency with threaded memory modules," in Proceedings of the International Conference on Computer Design (ICCD), 2006.
- Proceedings of the International Conference on Computer Design (ICCD), 2006
- Ware, F.A.¹ Hampel, C.²

29
- 66749162556
- Mini-rank: Adaptive DRAM architecture for improving memory power efficiency
- H. Zheng and et al., "Mini-rank: Adaptive DRAM architecture for improving memory power efficiency," in 41st International Symposium on Microarchitecture (MICRO-41), Nov. 2008.
- 41st International Symposium on Microarchitecture (MICRO-41), Nov. 2008
- Zheng, H.¹

30
- 77951180817
- Instruction set innovations for the Convey HC-1 computer
- T. M. Brewer, "Instruction set innovations for the Convey HC-1 computer," IEEE Micro, vol. 30, no. 2, pp. 70-79, 2010.
- (2010) IEEE Micro , vol.30 , Issue.2 , pp. 70-79
- Brewer, T.M.¹

31
- 0002388384
- Structural aspects of the system/360 model 85, part II: The cache
- J. S. Liptay, "Structural aspects of the system/360 model 85, part II: The cache," IBM Systems Journal, vol. 7, pp. 15-21, 1968.
- (1968) IBM Systems Journal , vol.7 , pp. 15-21
- Liptay, J.S.¹

32
- 84877700379
- Mage: Adaptive granularity and ecc for resilient and power efficient memory systems
- IEEE
- S. Li and et al., "Mage: adaptive granularity and ecc for resilient and power efficient memory systems," in High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, 2012, pp. 1-11.
- (2012) High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for , pp. 1-11
- Li, S.¹

33
- 84892558405
- The Cray Black Widow: A highly scalable vector multiprocessor
- D. Abts and et al., "The Cray Black Widow: A highly scalable vector multiprocessor," in Proc. the Int'l Conf. High Performance Computing, Networking, Storage, and Analysis (SC), Nov. 2007.
- Proc. the Int'l Conf. High Performance Computing, Networking, Storage, and Analysis (SC), Nov. 2007
- Abts, D.¹

34
- 0031593995
- Exploiting spatial locality in data caches using spatial footprints
- S. Kumar and C. Wilkerson, "Exploiting spatial locality in data caches using spatial footprints," in 25th International Symposium on Computer Architecture (ISCA-25), 1998.
- 25th International Symposium on Computer Architecture (ISCA-25), 1998
- Kumar, S.¹ Wilkerson, C.²

35
- 2342482320
- Accurate and complexity-effective spatial pattern prediction
- C. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos, "Accurate and complexity-effective spatial pattern prediction," in 10th International Symposium on High Performance Computer Architecture (HPCA-10), 2004.
- 10th International Symposium on High Performance Computer Architecture (HPCA-10), 2004
- Chen, C.¹ Yang, S.-H.² Falsafi, B.³ Moshovos, A.⁴

36
- 84864068497
- Characterizing and Improving the Use of Demand-Fetched Caches in GPUs
- W. Jia, K. Shaw, and M. Martonosi, "Characterizing and Improving the Use of Demand-Fetched Caches in GPUs," in 26th International Supercomputing Conference (ICS'26), 2012.
- 26th International Supercomputing Conference (ICS'26), 2012
- Jia, W.¹ Shaw, K.² Martonosi, M.³

37
- 0014814325
- Space/Time Trade-Offs in Hash Coding with Allowable Errors
- B. Bloom, "Space/Time Trade-Offs in Hash Coding with Allowable Errors," in ACM Communications, 1970.
- (1970) ACM Communications
- Bloom, B.¹

38
- 0034206002
- Summary cache: A scalable wide-area web cache sharing protocol
- L. Fan, P. Cao, J. Almeida, and A. Z. Broder, "Summary cache: a scalable wide-area web cache sharing protocol," IEEE/ACM Transactions on Networking (TON), vol. 8, no. 3, pp. 281-293, 2000.
- (2000) IEEE/ACM Transactions on Networking (TON) , vol.8 , Issue.3 , pp. 281-293
- Fan, L.¹ Cao, P.² Almeida, J.³ Broder, A.Z.⁴

39
- 0031366315
- Efficient Hardware Hashing Functions for High Performance Computers
- M. Ramakrishna and et al., "Efficient Hardware Hashing Functions for High Performance Computers," in IEEE Transactions on Computers, 1997.
- (1997) IEEE Transactions on Computers
- Ramakrishna, M.¹

40
- 0030672489
- The agree predictor: A mechanism for reducing negative branch history interference
- E. Sprangle, R. S. Chappell, M. Alsup, and Y. N. Patt, "The agree predictor: A mechanism for reducing negative branch history interference," in 17th International Symposium on Computer Architecture (ISCA-17), 1997.
- 17th International Symposium on Computer Architecture (ISCA-17), 1997
- Sprangle, E.¹ Chappell, R.S.² Alsup, M.³ Patt, Y.N.⁴

41
- 70349169075
- Analyzing CUDA workloads using a detailed GPU simulator
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2009), April 2009.
- IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2009), April 2009
- Bakhoda, A.¹ Yuan, G.² Fung, W.³ Wong, H.⁴ Aamodt, T.⁵

42
- 84892547793
- "GPGPU-Sim," http://www.gpgpu-sim.org.
- GPGPU-Sim

43
- 84892560748
- "DrSim," http://lph.ece.utexas.edu/public/DrSim.
- DrSim

44
- 84860328391
- Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems
- M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez, "Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems," in 18th International Symposium on High Performance Computer Architecture (HPCA-18), February 2012.
- 18th International Symposium on High Performance Computer Architecture (HPCA-18), February 2012
- Jeong, M.K.¹ Yoon, D.H.² Sunwoo, D.³ Sullivan, M.⁴ Lee, I.⁵ Erez, M.⁶

45
- 78650867466
- A 7Gb/s/pin 1 Gbit GDDR5 SDRAM With 2.5 ns Bank to Bank Active Time and No Bank Group Restriction
- T.-Y. Oh and et al., "A 7Gb/s/pin 1 Gbit GDDR5 SDRAM With 2.5 ns Bank to Bank Active Time and No Bank Group Restriction," in IEEE Journal of Solid-State Circuits, 2011.
- (2011) IEEE Journal of Solid-State Circuits
- Oh, T.-Y.¹

46
- 84881141803
- "GPGPU-Sim Manual," http://www.gpgpu-sim.org/manual.
- GPGPU-Sim Manual

47
- 84881151222
- GPUWattch: Enabling Energy Optimizations in GPGPUs
- J. Leng and et al., "GPUWattch: Enabling Energy Optimizations in GPGPUs," in 40th International Symposium on Computer Architecture (ISCA-40), June 2013.
- 40th International Symposium on Computer Architecture (ISCA-40), June 2013
- Leng, J.¹

48
- 84864861336
- NVIDIA Corporation
- NVIDIA Corporation, "CUDA C/C++ SDK CODE Samples," 2011.
- (2011) CUDA C/C++ SDK CODE Samples

49
- 38849131252
- High-throughput sequence alignment using graphics processing units
- M. Schatz, C. Trapnell, A. Delcher, and A. Varshney, "High- throughput sequence alignment using graphics processing units," BMC Bioinformatics, vol. 8, no. 1, p. 474, 2007.
- (2007) BMC Bioinformatics , vol.8 , Issue.1 , pp. 474
- Schatz, M.¹ Trapnell, C.² Delcher, A.³ Varshney, A.⁴

50
- 80052533471
- Energy-efficient mechanisms for managing thread context in throughput processors
- M. Gebhart, D. Johnson, D. Tarjan, S. Keckler, W. Dally, E. Lindholm, and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in 38th International Symposium on Computer Architecture (ISCA-38), 2011.
- 38th International Symposium on Computer Architecture (ISCA-38), 2011
- Gebhart, M.¹ Johnson, D.² Tarjan, D.³ Keckler, S.⁴ Dally, W.⁵ Lindholm, E.⁶ Skadron, K.⁷

51
- 84887430750
- HMC, Hybrid Memory Cube Consortium
- HMC, "Hybrid memory cube specification 1.0," Hybrid Memory Cube Consortium, 2013.
- (2013) Hybrid Memory Cube Specification 1.0

52
- 84892491153
- Hynix, Hynix Semiconductor, Inc.
- Hynix, "Blazing a trail to high performance graphics," Hynix Semiconductor, Inc., 2011.
- (2011) Blazing a Trail to High Performance Graphics

53
- 84892506420
- JEDEC
- JEDEC, "JESD 229 Wide I/O SDR," 2011.
- (2011) JESD 229 Wide I/O SDR

54
- 77952257218
- Virtualized and flexible ECC for main memory
- D. H. Yoon and M. Erez, "Virtualized and flexible ECC for main memory," in Proc. the 15th Int'l. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar. 2010.
- Proc. the 15th Int'l. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar. 2010
- Yoon, D.H.¹ Erez, M.²

55
- 79958107212
- Decoupled sectored caches: Conciliating low tag implementation cost
- A. Seznec, "Decoupled sectored caches: Conciliating low tag implementation cost," in Proc. the 21st Ann. Int'l Symp. Computer Architecture (ISCA), Apr. 1994.
- Proc. the 21st Ann. Int'l Symp. Computer Architecture (ISCA), Apr. 1994
- Seznec, A.¹

56
- 0032644675
- The pool of subsectors cache design
- J. B. Rothman and A. J. Smith, "The pool of subsectors cache design," in Proc. the 13th Int'l Conf. Supercomputing (ICS), Jun. 1999.
- Proc. the 13th Int'l Conf. Supercomputing (ICS), Jun. 1999
- Rothman, J.B.¹ Smith, A.J.²

57
- 0029204095
- A data cache with multiple caching strategies tuned to different types of locality
- A. Gonzalez, C. Aliagas, and M. Valero, "A data cache with multiple caching strategies tuned to different types of locality," in Proc. the Int'l Conf. Supercomputing (ICS), Jul. 1995.
- Proc. the Int'l Conf. Supercomputing (ICS), Jul. 1995
- Gonzalez, A.¹ Aliagas, C.² Valero, M.³

58
- 34250668863
- Approximately detecting duplicates for streaming data using stable bloom filters
- ACM
- F. Deng and D. Rafiei, "Approximately detecting duplicates for streaming data using stable bloom filters," in Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 2006, pp. 25-36.
- (2006) Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data , pp. 25-36
- Deng, F.¹ Rafiei, D.²

59
- 84881327807
- "bcache: A Linux kernel block layer cache," http://bcache.evilpiepirate.org/.
- Bcache: A Linux Kernel Block Layer Cache

60
- 8344271981
- Approximate caches for packet classification
- IEEE
- F. Chang, W.-c. Feng, and K. Li, "Approximate caches for packet classification," in INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies, vol. 4. IEEE, 2004, pp. 2196-2207.
- (2004) INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies , vol.4 , pp. 2196-2207
- Chang, F.¹ Feng, W.-C.² Li, K.³

61
- 84881364062
- TBF: A memory-efficient replacement policy for flash-based caches
- C. Ungureanu and et al., "TBF: A memory-efficient replacement policy for flash-based caches," in Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013, pp. 1117-1128.
- (2013) Data Engineering (ICDE), 2013 IEEE 29th International Conference on , pp. 1117-1128
- Ungureanu, C.¹

62
- 72949105570
- Aging bloom filter with two active buffers for dynamic sets
- M. Yoon, "Aging bloom filter with two active buffers for dynamic sets," Knowledge and Data Engineering, IEEE Transactions on, vol. 22, no. 1, pp. 134-138, 2010.
- (2010) Knowledge and Data Engineering, IEEE Transactions on , vol.22 , Issue.1 , pp. 134-138
- Yoon, M.¹

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.