SCOPUS 정보 검색 플랫폼

Proceedings - International Symposium on Computer Architecture

Volumn 13-17-June-2015, Issue , 2015, Pages 515-527

CAWA: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads

(3) Lee, Shin Ying a Arunkumar, Akhil a Wu, Carole Jean a

a ARIZONA STATE UNIVERSITY (United States)

Author keywords

[No Author keywords available]

Indexed keywords

COMPUTER ARCHITECTURE; COMPUTER GRAPHICS; COMPUTER GRAPHICS EQUIPMENT; CRITICALITY (NUCLEAR FISSION); MEMORY ARCHITECTURE; PROGRAM PROCESSORS; SCHEDULING;

CHIP MULTIPROCESSOR; EVALUATION RESULTS; GENERAL PURPOSE GPU; GRAPHICS PROCESSING UNIT; IMPROVE PERFORMANCE; MEMORY ACCESS LATENCY; PARALLEL WORKLOADS; RESOURCE UTILIZATIONS;

WEAVING;

EID: 84960122845 PISSN: 10636897 EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2749469.2750418 Document Type: Conference Paper

Times cited : (67)

References (39)

1
- 70349169075
- Analyzing CUDA workloads using a detailed GPU simulator
- Boston, MA, USA, April
- A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proc. of the 2009 IEEE International Symposium on Analysis of Systems and Software (ISPASS'09), Boston, MA, USA, April 2009.
- (2009) Proc. of the 2009 IEEE International Symposium on Analysis of Systems and Software (ISPASS'09)
- Bakhoda, A.¹ Yuan, G.² Fung, W.W.L.³ Wong, H.⁴ Aamodt, T.M.⁵

2
- 70450245578
- Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors
- Austin, TX, USA, June
- A. Bhattacharjee and M. Martonosi, "Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors," in Proc. of the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA'09), Austin, TX, USA, June 2009.
- (2009) Proc. of the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA'09)
- Bhattacharjee, A.¹ Martonosi, M.²

3
- 84881143721
- Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior
- Tel Aviv, Israel, June
- K. D. Bois, S. Eyerman, J. B. Sartor, and L. Eeckhout, "Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior," in Proc. of the 40th IEEE/ACM International Symposium on Computer Architecture (ISCA'13), Tel Aviv, Israel, June 2013.
- (2013) Proc. of the 40th IEEE/ACM International Symposium on Computer Architecture (ISCA'13)
- Bois, K.D.¹ Eyerman, S.² Sartor, J.B.³ Eeckhout, L.⁴

4
- 84873458159
- A quantitative study of irregular programs on GPUs
- San Diego, CA, USA, November
- M. Burtscher, R. Nasre, and K. Pingali, "A quantitative study of irregular programs on GPUs," in Proc. of the 2012 IEEE International Symposium on Workload Characterization (IISWC'12), San Diego, CA, USA, November 2012.
- (2012) Proc. of the 2012 IEEE International Symposium on Workload Characterization (IISWC'12)
- Burtscher, M.¹ Nasre, R.² Pingali, K.³

5
- 70649092154
- Rodinia: A benchmark suite for heterogeneous computing
- Austin, TX, USA, October
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proc. of the 2009 IEEE International Symposium on Workload Characterization (IISWC'09), Austin, TX, USA, October 2009.
- (2009) Proc. of the 2009 IEEE International Symposium on Workload Characterization (IISWC'09)
- Che, S.¹ Boyer, M.² Meng, J.³ Tarjan, D.⁴ Sheaffer, J.W.⁵ Lee, S.-H.⁶ Skadron, K.⁷

6
- 78751505898
- A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads
- Atlanta, GA, USA, December
- S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in Proc. of the 2010 IEEE International Symposium on Workload Characterization (IISWC'10), Atlanta, GA, USA, December 2010.
- (2010) Proc. of the 2010 IEEE International Symposium on Workload Characterization (IISWC'10)
- Che, S.¹ Sheaffer, J.W.² Boyer, M.³ Szafaryn, L.G.⁴ Wang, L.⁵ Skadron, K.⁶

7
- 84957545426
- Adaptive cache management for energy-efficient GPU computing
- Cambridge, UK, December
- X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, and W. mei Hwu, "Adaptive cache management for energy-efficient GPU computing," in Proc. of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'14), Cambridge, UK, December 2014.
- (2014) Proc. of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'14)
- Chen, X.¹ Chang, L.-W.² Rodrigues, C.I.³ Lv, J.⁴ Mei Hwu, W.⁵

8
- 84863348772
- Parallel application memory scheduling
- Porto Alegre, Brazil, December
- E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt, "Parallel application memory scheduling," in Proc. of the 44th International Symposium on Microarchitecture (MICRO'11), Porto Alegre, Brazil, December 2011.
- (2011) Proc. of the 44th International Symposium on Microarchitecture (MICRO'11)
- Ebrahimi, E.¹ Miftakhutdinov, R.² Fallin, C.³ Lee, C.J.⁴ Joao, J.A.⁵ Mutlu, O.⁶ Patt, Y.N.⁷

9
- 79955923056
- Thread block compaction for efficient SIMT control flow
- San Antonio, TX, USA, February
- W. L. W. Fung and T. M. Aamodt, "Thread block compaction for efficient SIMT control flow," in Proc. of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA'11), San Antonio, TX, USA, February 2011.
- (2011) Proc. of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA'11)
- Fung, W.L.W.¹ Aamodt, T.M.²

10
- 84960184351
- Dynamic warp formation and scheduling for efficient GPU control flow
- Saint-Malo, France, June
- W. L. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France, June 2010.
- (2010) Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10)
- Fung, W.L.W.¹ Sham, I.² Yuan, G.³ Aamodt, T.M.⁴

11
- 80052533471
- Energy-efficient mechanisms for managing thread context in throughput processors
- San Jose, CA, USA, June
- M. Gebhart, R. D. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindoholm, and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in Proc. of the 38th IEEE/ACM International Symposium on Computer Architecture (ISCA'11), San Jose, CA, USA, June 2011.
- (2011) Proc. of the 38th IEEE/ACM International Symposium on Computer Architecture (ISCA'11)
- Gebhart, M.¹ Johnson, R.D.² Tarjan, D.³ Keckler, S.W.⁴ Dally, W.J.⁵ Lindoholm, E.⁶ Skadron, K.⁷

12
- 77954998134
- High performance cache replacement using re-reference interval prediction (RRIP)
- Saint-Malo, France, June
- A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, "High performance cache replacement using re-reference interval prediction (RRIP)," in Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France, June 2010.
- (2010) Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10)
- Jaleel, A.¹ Theobald, K.B.² Steely, S.C.³ Emer, J.⁴

13
- 84864068497
- Characterizing and improving the use of demand-fetched caches in GPUs
- Venice, Italy, June
- W. Jia, K. A. Shaw, and M. Martonosi, "Characterizing and improving the use of demand-fetched caches in GPUs," in Proc. of the 20th ACM International Conference on Supercomputing (ICS'12), Venice, Italy, June 2012.
- (2012) Proc. of the 20th ACM International Conference on Supercomputing (ICS'12)
- Jia, W.¹ Shaw, K.A.² Martonosi, M.³

14
- 84903985058
- MRPB: Memory request prioritization for massively parallel processors
- Orlando, FL, USA, February
- W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: memory request prioritization for massively parallel processors," in Proc. of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA'14), Orlando, FL, USA, February 2014.
- (2014) Proc. of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA'14)
- Jia, W.¹ Shaw, K.A.² Martonosi, M.³

15
- 84881126240
- Orchestrated scheduling and prefetching for GPGPUs
- Tel-Aviv, Isreal, June
- A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated scheduling and prefetching for GPGPUs," in Proc. of the 40th IEEE/ACM International Symposium on Computer Architecture (ISCA'13), Tel-Aviv, Isreal, June 2013.
- (2013) Proc. of the 40th IEEE/ACM International Symposium on Computer Architecture (ISCA'13)
- Jog, A.¹ Kayiran, O.² Mishra, A.K.³ Kandemir, M.T.⁴ Mutlu, O.⁵ Iyer, R.⁶ Das, C.R.⁷

16
- 84875640178
- OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance
- Houston, TX, USA, March
- A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance," in Proc. of the 18th IEEE/ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13), Houston, TX, USA, March 2013.
- (2013) Proc. of the 18th IEEE/ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13)
- Jog, A.¹ Kayiran, O.² Nachiappan, N.C.³ Mishra, A.K.⁴ Kandemir, M.T.⁵ Mutlu, O.⁶ Iyer, R.⁷ Das, C.R.⁸

17
- 52949085794
- Cache replacement based on reuse-distance prediction
- Lake Tahoe, CA, USA, October
- G. Keramidas, P. Petoumenos, and S. Kaxiras, "Cache replacement based on reuse-distance prediction," in Proc. of the 25th IEEE International Conference on Computer Design (ICCD'07), Lake Tahoe, CA, USA, October 2007.
- (2007) Proc. of the 25th IEEE International Conference on Computer Design (ICCD'07)
- Keramidas, G.¹ Petoumenos, P.² Kaxiras, S.³

18
- 79951697650
- Sampling dead block prediction for last-level caches
- Atlanta, GA, USA, December
- S. Khan, Y. Tian, and D. Jimenez, "Sampling dead block prediction for last-level caches," in Proc. of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO'10), Atlanta, GA, USA, December 2010.
- (2010) Proc. of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO'10)
- Khan, S.¹ Tian, Y.² Jimenez, D.³

19
- 0034851536
- Dead-block prediction&dead-block correlating prefetchers
- A.-C. Lai, C. Fide, and B. Falsafi, "Dead-block prediction&dead-block correlating prefetchers," in Proc. of the 28th IEEE/ACM International Symposium on Computer Architecture (ISCA'01), 2001.
- (2001) Proc. of the 28th IEEE/ACM International Symposium on Computer Architecture (ISCA'01)
- Lai, A.-C.¹ Fide, C.² Falsafi, B.³

20
- 84907073162
- CAWS: Criticality-aware warp scheduling for GPGPU workloads
- Edmonton, AB, Canada, August
- S.-Y. Lee and C.-J. Wu, "CAWS: Criticality-aware warp scheduling for GPGPU workloads," in Proc. of the 23rd IEEE/ACM International Conference on Parallel Architectures and Compilation (PACT'14), Edmonton, AB, Canada, August 2014.
- (2014) Proc. of the 23rd IEEE/ACM International Conference on Parallel Architectures and Compilation (PACT'14)
- Lee, S.-Y.¹ Wu, C.-J.²

21
- 84904472216
- Characterizing the latency hiding ability of GPUs
- Monterey, CA, USA, March
- S.-Y. Lee and C.-J. Wu, "Characterizing the latency hiding ability of GPUs," in Proc. of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'14) as Poster Abstract, Monterey, CA, USA, March 2014.
- (2014) Proc. of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'14) As Poster Abstract
- Lee, S.-Y.¹ Wu, C.-J.²

22
- 44849137198
- NVIDIA Tesla: A unified graphics and computing architecture
- March
- E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A unified graphics and computing architecture," IEEE Micro, vol. 28, pp. 39-55, March 2008.
- (2008) IEEE Micro , vol.28 , pp. 39-55
- Lindholm, E.¹ Nickolls, J.² Oberman, S.³ Montrym, J.⁴

23
- 77954976292
- Dynamic warp subdivision for integrated branch and memory divergence tolerance
- Saint-Malo, France, June
- J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France, June 2010.
- (2010) Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10)
- Meng, J.¹ Tarjan, D.² Skadron, K.³

24
- 84863342255
- Improving GPU performance via large warps and twolevel warp scheduling
- Porto Alegre, Brazil, December
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and twolevel warp scheduling," in Proc. of the 44th International Symposium on Microarchitecture (MICRO'11), Porto Alegre, Brazil, December 2011.
- (2011) Proc. of the 44th International Symposium on Microarchitecture (MICRO'11)
- Narasiman, V.¹ Shebanow, M.² Lee, C.J.³ Miftakhutdinov, R.⁴ Mutlu, O.⁵ Patt, Y.N.⁶

25
- 84960126117
- NVIDIA
- NVIDIA, "PTX ISA," 2009. Available: http://www.nvidia.com/content/CUDA-ptx-isa-1.4.pdf
- (2009) PTX ISA

26
- 84960184365
- NVIDIA
- NVIDIA, "NVIDIA CUDA C programming guide v4.2," 2012. Available: http://developer.nvidia.com/nvidia-gpu-computing-documentation
- (2012) NVIDIA CUDA C Programming Guide , vol.4 , Issue.2

27
- 84999126318
- NVIDIA September
- NVIDIA, "NVIDIA GeForce GTX 980: Featuring Maxwell, the most advanced GPU ever made," September 2014.
- (2014) NVIDIA GeForce GTX 980: Featuring Maxwell, the Most Advanced GPU Ever Made

28
- 84946053358
- Microarchitectural performance characterization of irregular GPU kernels
- Raleigh, NC, USA, October
- M. A. O'Neil and M. Burtscher, "Microarchitectural performance characterization of irregular GPU kernels," in Proc. of the 2014 IEEE International Symposium on Workload Characterization (IISWC'14), Raleigh, NC, USA, October 2014.
- (2014) Proc. of the 2014 IEEE International Symposium on Workload Characterization (IISWC'14)
- O'Neil, M.A.¹ Burtscher, M.²

29
- 35348920021
- Adaptive insertion policies for high performance caching
- San Diego, CA, USA, June
- M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr., and J. Emer, "Adaptive insertion policies for high performance caching," in Proc. of the 34th IEEE/ACM International Symposium on Computer Architecture (ISCA'07), San Diego, CA, USA, June 2007.
- (2007) Proc. of the 34th IEEE/ACM International Symposium on Computer Architecture (ISCA'07)
- Qureshi, M.K.¹ Jaleel, A.² Patt, Y.N.³ Steely, S.C.⁴ Emer, J.⁵

30
- 41349120769
- Setdueling-controlled adaptive insertion for high-performance caching
- January
- M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr., and J. Emer, "Setdueling-controlled adaptive insertion for high-performance caching," IEEE Micro, vol. 28, no. 1, pp. 91-98, January 2008.
- (2008) IEEE Micro , vol.28 , Issue.1 , pp. 91-98
- Qureshi, M.K.¹ Jaleel, A.² Patt, Y.N.³ Steely, S.C.⁴ Emer, J.⁵

31
- 34548042910
- Utility-based cache partitioning: A lowoverhead, high-performance, runtime mechanism to partition shared caches
- Orlando, FL, USA, December
- M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: A lowoverhead, high-performance, runtime mechanism to partition shared caches," in Proc. of the 39th IEEE/ACM International Symposium on Microarchitecture (MICRO'06), Orlando, FL, USA, December 2006.
- (2006) Proc. of the 39th IEEE/ACM International Symposium on Microarchitecture (MICRO'06)
- Qureshi, M.K.¹ Patt, Y.N.²

32
- 84880298026
- The dual-path execution model for efficient GPU control flow
- Shenzhen, China, February
- M. Rhu and M. Erez, "The dual-path execution model for efficient GPU control flow," in Proc. of the 19th IEEE International Symposium on High Performance Computer Architecture (HPCA'13), Shenzhen, China, February 2013.
- (2013) Proc. of the 19th IEEE International Symposium on High Performance Computer Architecture (HPCA'13)
- Rhu, M.¹ Erez, M.²

33
- 84892519096
- A locality-aware memory hierarchy for energy-efficient GPU architecture
- Davis, CA, USA, December
- M. Rhu, M. Sullivan, J. Leng, and M. Erez, "A locality-aware memory hierarchy for energy-efficient GPU architecture," in Proc. of the 46th International Symposium on Microarchitecture (MICRO'13), Davis, CA, USA, December 2013.
- (2013) Proc. of the 46th International Symposium on Microarchitecture (MICRO'13)
- Rhu, M.¹ Sullivan, M.² Leng, J.³ Erez, M.⁴

34
- 84876590572
- Cache-conscious wavefront scheduling
- Vancouver, BC, Canada, December
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in Proc. of the 45th IEEE/ACM International Symposium on Microarchitecture (MICRO'12), Vancouver, BC, Canada, December 2012.
- (2012) Proc. of the 45th IEEE/ACM International Symposium on Microarchitecture (MICRO'12)
- Rogers, T.G.¹ O'Connor, M.² Aamodt, T.M.³

35
- 84892547586
- Divergence-aware warp scheduling
- Davis, CA, USA, December
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware warp scheduling," in Proc. of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO'13), Davis, CA, USA, December 2013.
- (2013) Proc. of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO'13)
- Rogers, T.G.¹ O'Connor, M.² Aamodt, T.M.³

36
- 84921758691
- The Parboil technical report
- University of Illinois Urbana-Champaign, Champaign, IL, USA, March
- J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. W. Hwu, "The Parboil technical report," in IMPACT Technical Report (IMPACT-12-01), University of Illinois Urbana-Champaign, Champaign, IL, USA, March 2012.
- (2012) IMPACT Technical Report (IMPACT-12-01)
- Stratton, J.A.¹ Rodrigues, C.² Sung, I.-J.³ Obeid, N.⁴ Chang, L.-W.⁵ Anssari, N.⁶ Liu, G.D.⁷ Hwu, W.-M.W.⁸

37
- 84881183039
- SIMD divergence optimization through intra-warp compaction
- Tel Aviv, Israel, June
- A. S. Vaidya, A. Shayesteh, D. H. Woo, R. Saharoy, and M. Azimi, "SIMD divergence optimization through intra-warp compaction," in Proc. of the IEEE/ACM 40th International Symposium on Computer Architecture (ISCA'13), Tel Aviv, Israel, June 2011.
- (2011) Proc. of the IEEE/ACM 40th International Symposium on Computer Architecture (ISCA'13)
- Vaidya, A.S.¹ Shayesteh, A.² Woo, D.H.³ Saharoy, R.⁴ Azimi, M.⁵

38
- 84863389330
- SHiP: Signature-based hit predictor for high performance caching
- Porto Alegre, Brazil, December
- C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, "SHiP: signature-based hit predictor for high performance caching," in Proc. of the 44th IEEE/ACM International Symposium on Microarchitecture (MICRO'11), Porto Alegre, Brazil, December 2011.
- (2011) Proc. of the 44th IEEE/ACM International Symposium on Microarchitecture (MICRO'11)
- Wu, C.-J.¹ Jaleel, A.² Hasenplaugh, W.³ Martonosi, M.⁴ Steely, S.C.⁵ Emer, J.⁶

39
- 84893396474
- An efficient compiler framework for cache bypassing on GPUs
- San Jose, CA, USA, November
- X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in Proc. of the 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD'13), San Jose, CA, USA, November 2013.
- (2013) Proc. of the 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD'13)
- Xie, X.¹ Liang, Y.² Sun, G.³ Chen, D.⁴

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.