SCOPUS 정보 검색 플랫폼

Proceedings - International Symposium on Computer Architecture

Volumn , Issue , 2010, Pages 235-246

Dynamic warp subdivision for integrated branch and memory divergence tolerance

(3) Meng, Jiayuan a Tarjan, David a Skadron, Kevin a

a University of Virginia (United States)

Author keywords

Branch divergence; Latency hiding; Memory divergence; SIMD; Warp

Indexed keywords

AREA OVERHEAD; CACHE HIERARCHIES; DATA PARALLEL; IDLE CYCLES; L2 CACHE; LATENCY HIDING; MAXIMIZE THROUGHPUT; MEMORY ACCESS; MEMORY LEVEL PARALLELISMS; MULTI-THREADING; MULTIPLE PROCESSING; REGISTER FILES;

CACHE MEMORY; COMPUTER ARCHITECTURE; COMPUTERS; PROGRAM PROCESSORS; SCHEDULING;

WEAVING;

EID: 77954976292 PISSN: 10636897 EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/1815961.1815992 Document Type: Conference Paper

Times cited : (207)

References (32)

1
- 77955005234
- NVIDIA's next generation CUDA compute architecture: Fermi
- NVIDIA's next generation CUDA compute architecture: Fermi. NVIDIA Corporation, 2009.
- (2009) NVIDIA Corporation

2
- 77954969653
- ATI. Radeon 9700 Pro. http://mirror.ati.com/products/pc/radeon9700pro, 2002.
- (2002) Radeon 9700 Pro

3
- 33846535493
- The M5 simulator: Modeling networked systems
- N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4), 2006.
- (2006) IEEE Micro , vol.26 , pp. 4
- Binkert, N.L.¹ Dreslinski, R.G.² Hsu, L.R.³ Lim, K.T.⁴ Saidi, A.G.⁵ Reinhardt, S.K.⁶

4
- 51449118065
- A performance study of general purpose applications on graphics processors using CUDA
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general purpose applications on graphics processors using CUDA. JPDC, 2008.
- (2008) JPDC
- Che, S.¹ Boyer, M.² Meng, J.³ Tarjan, D.⁴ Sheaffer, J.W.⁵ Skadron, K.⁶

5
- 33845901233
- Learning-based SMT processor resource distribution via hill-climbing
- S. Choi and D. Yeung. Learning-based SMT processor resource distribution via hill-climbing. In ISCA, 2006.
- (2006) ISCA
- Choi, S.¹ Yeung, D.²

6
- 0035691709
- Dynamic speculative precomputation
- J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen. Dynamic speculative precomputation. In MICRO 34, 2001.
- (2001) MICRO , vol.34
- Collins, J.D.¹ Tullsen, D.M.² Wang, H.³ Shen, J.P.⁴

7
- 70449722984
- Intel Corporation
- Intel Corporation. Intel AVX: New frontiers in performance improvements and energy efficiency, 2009.
- (2009) Intel AVX: New Frontiers in Performance Improvements and Energy Efficiency

8
- 70449699817
- NVIDIA Corporation
- NVIDIA Corporation. GeForce GTX 280 specifications. 2008.
- (2008) GeForce GTX 280 Specifications

9
- 0036292604
- Tarantula: A vector extension to the Alpha architecture
- R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hern, T. Juan, G. Lowney, M Mattina, and A. Seznec. Tarantula: A vector extension to the Alpha architecture. In ISCA, 2002.
- (2002) ISCA
- Espasa, R.¹ Ardanaz, F.² Emer, J.³ Felix, S.⁴ Gago, J.⁵ Gramunt, R.⁶ Hern, I.⁷ Juan, T.⁸ Lowney, G.⁹ Mattina, M.¹⁰ Seznec, A.¹¹

10
- 47349104432
- Dynamic warp formation and scheduling for efficient GPU control flow
- W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, 2007.
- (2007) MICRO
- Fung, W.W.L.¹ Sham, I.² Yuan, G.³ Aamodt, T.M.⁴

11
- 34247376580
- Chip multiprocessing and the cell broadband engine
- M. Gschwind. Chip multiprocessing and the Cell Broadband Engine. In CF, 2006.
- (2006) CF
- Gschwind, M.¹

12
- 0034459255
- Efficient conditional operations for data-parallel architectures
- U. J. Kapasi, J. Dally, W, S. Rixner, P. R. Mattson, J. D. Owens, and B. Khailany. Efficient conditional operations for data-parallel architectures. In MICRO 33, 2000.
- (2000) MICRO , vol.33
- Kapasi, U.J.¹ Dally, W.J.² Rixner, S.³ Mattson, P.R.⁴ Owens, J.D.⁵ Khailany, B.⁶

13
- 3242661621
- Technical report, University of California, Berkeley
- C. Kozyrakis. A media-enhanced vector architecture for embedded memory systems. Technical report, University of California, Berkeley, 1999.
- (1999) A Media-enhanced Vector Architecture for Embedded Memory Systems
- Kozyrakis, C.¹

14
- 4644337990
- The Vector-Thread architecture
- R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The Vector-Thread architecture. In ISCA, 2004.
- (2004) ISCA
- Krashinsky, R.¹ Batten, C.² Hampton, M.³ Gerding, S.⁴ Pharris, B.⁵ Casper, J.⁶ Asanovic, K.⁷

15
- 77955001720
- US Patent 4,435,758
- R. A. Lorie and H. R. Strong. Method for conditional branch execution in SIMD vector processors. US Patent 4,435,758, 1984.
- (1984) Method for conditional branch execution in SIMD vector processors
- Lorie, R.A.¹ Strong, H.R.²

16
- 77954020709
- Exploiting inter-thread temporal locality for chip multithreading
- J. Meng, J. W. Sheaffer, and K. Skadron. Exploiting inter-thread temporal locality for chip multithreading. In PDPS, 2010.
- (2010) PDPS
- Meng, J.¹ Sheaffer, J.W.² Skadron, K.³

17
- 77955007736
- Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling
- J. Meng and K. Skadron. Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling. In ICCD, 2007.
- (2007) ICCD
- Meng, J.¹ Skadron, K.²

18
- 77954994930
- Dynamic warp subdivision for integrated branch and memory divergence tolerance: Extended results
- J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance: Extended results. U.Va. Tech. Report CS-2010-2015, 2010.
- (2010) U.Va. Tech. Report CS-2010-5
- Meng, J.¹ Tarjan, D.² Skadron, K.³

19
- 47349098275
- Minebench: A benchmark suite for data mining workloads
- R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. IISWC, 2006.
- (2006) IISWC
- Narayanan, R.¹ Ozisikyilmaz, B.² Zambreno, J.³ Memik, G.⁴ Choudhary, A.⁵

20
- 47249164386
- Performance improvement methodology for ClearSpeed's CSX600
- Y. Nishikawa, M. Koibuchi, M. Yoshimi, K. Miura, and H. Amano. Performance improvement methodology for ClearSpeed's CSX600. In ICPP, 2007.
- (2007) ICPP
- Nishikawa, Y.¹ Koibuchi, M.² Yoshimi, M.³ Miura, K.⁴ Amano, H.⁵

21
- 0016994364
- Implementation of permutation functions in illiac iv-type computers
- S. E. Orcutt. Implementation of permutation functions in illiac iv-type computers. IEEE Trans. Comput., 25(9), 1976.
- (1976) IEEE Trans. Comput. , vol.25 , pp. 9
- Orcutt, S.E.¹

22
- 33845874613
- A case for MLP-aware cache replacement
- M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for MLP-aware cache replacement. In ISCA, 2006.
- (2006) ISCA
- Qureshi, M.K.¹ Lynch, D.N.² Mutlu, O.³ Patt, Y.N.⁴

23
- 57749185053
- Runahead threads to improve SMT performance
- T. Ramirez, A. Pajuelo, O.J. Santana, and M. Valero. Runahead threads to improve SMT performance. HPCA, 2008.
- (2008) HPCA
- Ramirez, T.¹ Pajuelo, A.² Santana, O.J.³ Valero, M.⁴

24
- 34547456450
- Vector lane threading
- S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis. Vector lane threading. In ICPP, 2006.
- (2006) ICPP
- Rivoire, S.¹ Schultz, R.² Okuda, T.³ Kozyrakis, C.⁴

25
- 0032312385
- A bandwidth-efficient architecture for media processing
- S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. López-Lagunas, P. R. Mattson, and J. D. Owens. A bandwidth-efficient architecture for media processing. In MICRO 31, 1998.
- (1998) MICRO , vol.31
- Rixner, S.¹ Dally, W.J.² Kapasi, U.J.³ Khailany, B.⁴ López-Lagunas, A.⁵ Mattson, P.R.⁶ Owens, J.D.⁷

26
- 0017922490
- The CRAY-1 computer system
- R. M. Russell. The CRAY-1 computer system. Commun. ACM, 21(1), 1978.
- (1978) Commun. ACM , vol.21 , pp. 1
- Russell, R.M.¹

27
- 49249086142
- Larrabee: A many-core ×86 architecture for visual computing
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core ×86 architecture for visual computing. ACM Trans. Graph., 27(3), 2008.
- (2008) ACM Trans. Graph. , vol.27 , Issue.3
- Seiler, L.¹ Carmean, D.² Sprangle, E.³ Forsyth, T.⁴ Abrash, M.⁵ Dubey, P.⁶ Junkins, S.⁷ Lake, A.⁸ Sugerman, J.⁹ Cavin, R.¹⁰ Espasa, R.¹¹ Grochowski, E.¹² Juan, T.¹³ Hanrahan, P.¹⁴

28
- 0030644231
- A mechanism for SIMD execution of SPMD programs
- Y. Takahashi. A mechanism for SIMD execution of SPMD programs. In HPC-ASIA, 1997.
- (1997) HPC-ASIA
- Takahashi, Y.¹

29
- 0035178105
- Cost-effective hardware acceleration of multimedia applications
- D. Talla and L. K. John. Cost-effective hardware acceleration of multimedia applications. In ICCD, 2001.
- (2001) ICCD
- Talla, D.¹ John, L.K.²

30
- 74049151553
- Increasing memory miss tolerance for SIMD cores
- D. Tarjan, J. Meng, and K. Skadron. Increasing memory miss tolerance for SIMD cores. In SC, 2009.
- (2009) SC
- Tarjan, D.¹ Meng, J.² Skadron, K.³

31
- 0035696665
- Handling long-latency loads in a simultaneous multithreading processor
- D. M. Tullsen and J. A. Brown. Handling long-latency loads in a simultaneous multithreading processor. In MICRO 34, 2001.
- (2001) MICRO , vol.34
- Tullsen, D.M.¹ Brown, J.A.²

32
- 0029194459
- The SPLASH-2 programs: Characterization and methodological considerations
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. ISCA, 1995.
- (1995) ISCA
- Woo, S.C.¹ Ohara, M.² Torrie, E.³ Singh, J.P.⁴ Gupta, A.⁵

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.