SCOPUS 정보 검색 플랫폼

Proceedings - International Symposium on Computer Architecture

Volumn , Issue , 2012, Pages 49-60

Simultaneous branch and warp interweaving for sustained GPU performance

(3) Brunie, Nicolas a Collange, Sylvain b Diamos, Gregory c

a UNIVERSITÉ DE LYON (France)

b FEDERAL UNIVERSITY OF MINAS GERAIS (Brazil)

c NVIDIA (United States)

Author keywords

[No Author keywords available]

Indexed keywords

COMPUTER ARCHITECTURE; COMPUTER GRAPHICS; LOCKS (FASTENERS); PROGRAM PROCESSORS;

CONTROL LOGIC; DIVERGENTS; EXECUTION PATHS; EXECUTION UNITS; FINE GRAINED; INSTRUCTION FETCH; MICRO ARCHITECTURES; MULTIPLE THREADS; PERFORMANCE; RE CONVERGENCES;

GRAPHICS PROCESSING UNIT;

EID: 84864834311 PISSN: 10636897 EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2366231.2337166 Document Type: Conference Paper

Times cited : (68)

References (31)

1
- 35348872507
- Transparent control independence (TCI)
- June
- A. S. Al-Zawawi, V. K. R eddy, E. Rotenberg, and H. H. Akkary. Transparent control independence (TCI). SIGAR CH Comput. Archit. News, 35:448-459, June 2007.
- (2007) SIGAR CH Comput. Archit. News , vol.35 , pp. 448-459
- Al-Zawawi, A.S.¹ Eddy, V.K.R.² Rotenberg, E.³ Akkary, H.H.⁴

2
- 70649092154
- Rodinia: A benchmark suite for heterogeneous computing.
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IEEE Workload Characterization Symposium, 0:44-54, 2009.
- (2009) IEEE Workload Characterization Symposium , pp. 44-54
- Che, S.¹ Boyer, M.² Meng, J.³ Tarjan, D.⁴ Sheaffer, J.W.⁵ Lee, S.-H.⁶ Skadron, K.⁷

3
- 78049512154
- Barra: A parallel functional simulator for GPGPU
- S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: a parallel functional simulator for GPGPU. In IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pages 351-360, 2010.
- (2010) IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) , pp. 351-360
- Collange, S.¹ Daumas, M.² Defour, D.³ Parello, D.⁴

4
- 84856559490
- Dynamic detection of uniform and affine vectors in GPGPU computations
- volume LNCS 6043
- S. Collange, D. Defour, and Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. In Europar 3rd Workshop on Highly Parallel Processing on a Chip (HPPC), volume LNCS 6043, pages 46-55, 2009.
- (2009) Europar 3rd Workshop on Highly Parallel Processing on A Chip (HPPC) , pp. 46-55
- Collange, S.¹ Defour, D.² Zhang, Y.³

5
- 84864863647
- Affine Vector Cache for memory bandwidth savings
- ENS, Lyon, Dec.
- S. Collange and A. Kouyoumdjian. Affine Vector Cache for memory bandwidth savings. Technical Report ensl- 00649200, ENS Lyon, Dec. 2011.
- (2011) Technical Report Ensl- 00649200
- Collange, S.¹ Kouyoumdjian, A.²

6
- 21644487687
- Control flow optimization via dynamic reconvergence prediction
- IEEE Computer Society
- J. D. Collins, D. M. Tullsen, and H. Wang. Control flow optimization via dynamic reconvergence prediction. In IEEE/ACM International Symposium on Microarchitecture, pages 129-140. IEEE Computer Society, 2004.
- (2004) IEEE /ACM International Symposium on Microarchitecture , pp. 129-140
- Collins, J.D.¹ Tullsen, D.M.² Wang, H.³

7
- 84864829532
- US Patent 7434032, October
- B. W. Coon, P. C. Mills, S. F. Oberman, and M. Y. Siu. Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequential register size indicators. US Patent 7434032, October 2008.
- (2008) Tracking Register Usage during Multithreaded Processing Using A Scoreboard Having Separate Memory Regions and Storing Sequential Register Size Indicators
- Coon, B.W.¹ Mills, P.C.² Oberman, S.F.³ Siu, M.Y.⁴

8
- 84856515692
- PEPSC: A power-efficient processor for scientific computing
- G. Dasika, A. Sethia, T. Mudge, and S. Mahlke. PEPSC: A power-efficient processor for scientific computing. In PACT, 2011.
- (2011) PACT
- Dasika, G.¹ Sethia, A.² Mudge, T.³ Mahlke, S.⁴

9
- 84864834319
- Multithreaded instruction sharing
- M. Dechene, E. Forbes, and E. Rotenberg. Multithreaded instruction sharing. Technical report, North Carolina State University, 2010.
- (2010) Technical Report, North Carolina State University
- Dechene, M.¹ Forbes, E.² Rotenberg, E.³

10
- 84863351470
- SIMD re-convergence at thread frontiers
- December
- G. Diamos, A. Kerr, H. Wu, S. Yalamanchili, B. Ashbaugh, and S. Maiyuran. SIMD re-convergence at thread frontiers. In MICRO 44: Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture, December 2011.
- (2011) MICRO 44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
- Diamos, G.¹ Kerr, A.² Wu, H.³ Yalamanchili, S.⁴ Ashbaugh, B.⁵ Maiyuran, S.⁶

11
- 70449647744
- CASH: Revisiting hardware sharing in single-chip parallel processor
- R. Dolbeau and A. Seznec. CASH: Revisiting hardware sharing in single-chip parallel processor. Journal of Instruction-Level Parallelism, 6:1-16, 2004.
- (2004) Journal of Instruction-Level Parallelism , vol.6 , pp. 1-16
- Dolbeau, R.¹ Seznec, A.²

12
- 0030784080
- Multithreaded vector architectures
- R. Espasa and M. Valero. Multithreaded vector architectures. In Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, HPCA'97, pages 237-244, 1997.
- (1997) Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, HPCA'97 , pp. 237-244
- Espasa, R.¹ Valero, M.²

13
- 84862142534
- Towards solving the table maker dilemma on GPU
- P. Fortin, M. Gouicem, and S. Graillat. Towards solving the table maker dilemma on GPU. In 20th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP 12), 2012.
- (2012) 20th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP 12)
- Fortin, P.¹ Gouicem, M.² Graillat, S.³

14
- 79955923056
- Thread block compaction for efficient SIMT control flow
- February
- W. Fung and T. Aamodt. Thread block compaction for efficient SIMT control flow. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA), pages 25-36, February 2011.
- (2011) 2011 IEEE17th International Symposium on High Performance Computer Architecture (HPCA) , pp. 25-36
- Fung, W.¹ Aamodt, T.²

15
- 68549096107
- Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
- July
- W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Trans. Archit. Code Optim., 6:7:1-7:37, July 2009.
- (2009) ACM Trans. Archit. Code Optim. , vol.6 , pp. 71-737
- Fung, W.W.L.¹ Sham, I.² Yuan, G.³ Aamodt, T.M.⁴

16
- 80052533471
- Energyefficient mechanisms for managing thread context in throughput processors
- M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energyefficient mechanisms for managing thread context in throughput processors. In Proceeding of the 38th annual international symposium on Computer architecture, pages 235-246, 2011.
- (2011) Proceeding of the 38th Annual International Symposium on Computer Architecture , pp. 235-246
- Gebhart, M.¹ Johnson, D.R.² Tarjan, D.³ Keckler, S.W.⁴ Dally, W.J.⁵ Lindholm, E.⁶ Skadron, K.⁷

17
- 85184640233
- Coherent vector lane threading
- A. Glew. Coherent vector lane threading. Berkeley Par- Lab Seminar, 2009.
- (2009) Berkeley Par- Lab Seminar
- Glew, A.¹

18
- 15044343841
- The Vector- Thread architecture
- R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The Vector- Thread architecture. IEEE MICRO, 24(6):84-90, 2004.
- (2004) IEEE MICRO , vol.24 , Issue.6 , pp. 84-90
- Krashinsky, R.¹ Batten, C.² Hampton, M.³ Gerding, S.⁴ Pharris, B.⁵ Casper, J.⁶ Asanovic, K.⁷

19
- 21644440721
- Conjoinedcore chip multiprocessing
- R. Kumar, N. P. Jouppi, and D. M. Tullsen. Conjoinedcore chip multiprocessing. In IEEE/ACM International Symposium on Microarchitecture, pages 195-206, 2004.
- (2004) IEEE /ACM International Symposium on Microarchitecture , pp. 195-206
- Kumar, R.¹ Jouppi, N.P.² Tullsen, D.M.³

20
- 84871656628
- Performance in GPU architectures: Potentials and distances
- A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and distances. In 9th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD11), in conjunction with ISCA-38, 2011.
- (2011) 9th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD11), in Conjunction with ISCA-38
- Lashgar, A.¹ Baniasadi, A.²

21
- 77954995885
- Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU
- V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture, pages 451-460, 2010.
- (2010) ISCA ' 10: Proceedings of the 37th Annual International Symposium on Computer Architecture , pp. 451-460
- Lee, V.W.¹ Kim, C.² Chhugani, J.³ Deisher, M.⁴ Kim, D.⁵ Nguyen, A.D.⁶ Satish, N.⁷ Smelyanskiy, M.⁸ Chennupaty, S.⁹ Hammarlund, P.¹⁰ Singhal, R.¹¹ Dubey, P.¹²

22
- 44849137198
- NVIDIA Tesla: A unified graphics and computing architecture
- J. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39-55, 2008.
- (2008) IEEE Micro , vol.28 , Issue.2 , pp. 39-55
- Lindholm, J.E.¹ Nickolls, J.² Oberman, S.³ Montrym, J.⁴

23
- 79951689916
- Minimal multi-threading: Finding and removing redundant instructions in multithreaded processors
- G. Long, D. Franklin, S. Biswas, P. Ortiz, J. Oberg, D. Fan, and F. T. Chong. Minimal multi-threading: Finding and removing redundant instructions in multithreaded processors. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '43, pages 337-348, 2010.
- (2010) Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '43 , pp. 337-348
- Long, G.¹ Franklin, D.² Biswas, S.³ Ortiz, P.⁴ Oberg, J.⁵ Fan, D.⁶ Chong, F.T.⁷

24
- 77954976292
- Dynamic warp subdivision for int rated branch and memory divergence tolerance
- J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for int rated branch and memory divergence tolerance. SIGARCH Comput. Archit. News, 38(3):235-246, 2010.
- (2010) SIGARCH Comput. Archit. News , vol.38 , Issue.3 , pp. 235-246
- Meng, J.¹ Tarjan, D.² Skadron, K.³

25
- 84864829539
- Scheduler in multi-threaded processor prioritizing instructions passing qualification rule
- US Patent 7949855, May
- P. C. Mills, J. E. Lindholm, B. W. Coon, G. M. Tarolli, and J. M. Burgess. Scheduler in multi-threaded processor prioritizing instructions passing qualification rule. US Patent 7949855, May 2011.
- (2011)
- Mills, P.C.¹ Lindholm, J.E.² Coon, B.W.³ Tarolli, G.M.⁴ Burgess, J.M.⁵

26
- 84863342255
- Improving GPU performance via large warps and two-level warp scheduling
- December
- V. Narasiman, C. J. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU performance via large warps and two-level warp scheduling. In MICRO 44: Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture, December 2011.
- (2011) MICRO 44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
- Narasiman, V.¹ Lee, C.J.² Shebanow, M.³ Miftakhutdinov, R.⁴ Mutlu, O.⁵ Patt, Y.N.⁶

27
- 77951154340
- The GPU computing era
- March
- J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30:56-69, March 2010.
- (2010) IEEE Micro , vol.30 , pp. 56-69
- Nickolls, J.¹ Dally, W.J.²

28
- 85184640695
- NVIDIA CUDA SDK, 2010. http://www. nvidia.com/cuda/.
- (2010)

29
- 33644661238
- Contentaddressable memory (CAM) circuits and architectures: A tutorial and survey
- march
- K. Pagiamtzis and A. Sheikholeslami. Contentaddressable memory (CAM) circuits and architectures: a tutorial and survey. IEEE Journal of Solid-State Circuits, 41(3):712-727, march 2006.
- (2006) IEEE Journal of Solid-State Circuits , vol.41 , Issue.3 , pp. 712-727
- Pagiamtzis, K.¹ Sheikholeslami, A.²

30
- 34547456450
- Vector lane threading
- S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis. Vector lane threading. In Proceedings of the 2006 International Conference on Parallel Processing, ICPP '06, pages 55-64, 2006.
- (2006) Proceedings of the 2006 International Conference on Parallel Processing, ICPP '06 , pp. 55-64
- Rivoire, S.¹ Schultz, R.² Okuda, T.³ Kozyrakis, C.⁴

31
- 0029183524
- Simultaneous multithreading: Maximizing on-chip parallelism
- May
- D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: maximizing on-chip parallelism. SIGARCH Comput. Archit. News, 23:392-403, May 1995.
- (1995) SIGARCH Comput. Archit. News , vol.23 , pp. 392-403
- Tullsen, D.M.¹ Eggers, S.J.² Levy, H.M.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.