SCOPUS 정보 검색 플랫폼

Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016

Volumn , Issue , 2016, Pages 583-595

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

(4) Wang, Jin a Rubin, Norm b Sidelnik, Albert b Yalamanchili, Sudhakar a

a GEORGIA INSTITUTE OF TECHNOLOGY (United States)

b NVIDIA (United States)

Author keywords

dynamic parallelism; GPU; irregular applications; memory locality; thread block scheduler

Indexed keywords

COMPUTER ARCHITECTURE; PROGRAM PROCESSORS; ROUTERS; SCHEDULING;

BULK SYNCHRONOUS PARALLEL MODELS; IRREGULAR APPLICATIONS; MEMORY HIERARCHY; MEMORY LOCALITY; REFERENCE LOCALITIES; SCHEDULING DECISIONS; SCHEDULING STRATEGIES; THREAD BLOCK SCHEDULER;

MEMORY ARCHITECTURE;

EID: 84988443467 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/ISCA.2016.57 Document Type: Conference Paper

Times cited : (49)

References (41)

1
- 41249087856
- General purpose molecular dynamics simulations fully implemented on graphics processing units
- J. A. Anderson, C. D. Lorenz, and A. Travesset, "General purpose molecular dynamics simulations fully implemented on graphics processing units," Journal of Computational Physics, vol. 227, no. 10, 2008.
- (2008) Journal of Computational Physics , vol.227 , Issue.10
- Anderson, J.A.¹ Lorenz, C.D.² Travesset, A.³

2
- 36849056785
- Real-time deformation of detailed geometry based on mappings to a less detailed physical simulation on the GPU
- Eurographics Association
- J. Mosegaard and T. S. Sorensen, "Real-time deformation of detailed geometry based on mappings to a less detailed physical simulation on the GPU," in Proceedings of the 11th Eurographics Conference on Virtual Environments, pp. 105-111, Eurographics Association, 2005.
- (2005) Proceedings of the 11th Eurographics Conference on Virtual Environments , pp. 105-111
- Mosegaard, J.¹ Sorensen, T.S.²

3
- 77953140299
- V. Podlozhnyuk, "Black-scholes option pricing," 2007.
- (2007) Black-scholes Option Pricing
- Podlozhnyuk, V.¹

4
- 77956373685
- Optix: A general purpose ray tracing engine
- ACM
- S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, et al., "Optix: a general purpose ray tracing engine," in ACM Transactions on Graphics (TOG), vol. 29, p. 66, ACM, 2010.
- (2010) ACM Transactions on Graphics (TOG) , vol.29 , pp. 66
- Parker, S.G.¹ Bigler, J.² Dietrich, A.³ Friedrich, H.⁴ Hoberock, J.⁵ Luebke, D.⁶ McAllister, D.⁷ McGuire, M.⁸ Morley, K.⁹ Robison, A.¹⁰

5
- 84988434327
- NVIDIA
- NVIDIA, "Cuda dynamic parallelism programming guide," 2015.
- (2015) CUDA Dynamic Parallelism Programming Guide

6
- 79951728783
- Khronos
- Khronos, "The opencl specification version 2.0," 2014.
- (2014) The Opencl Specification Version 2.0

7
- 84876590572
- Cacheconscious wavefront scheduling
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cacheconscious wavefront scheduling," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45), 2012.
- (2012) Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45)
- Rogers, T.G.¹ O'Connor, M.² Aamodt, T.M.³

8
- 84892547586
- Divergenceaware warp scheduling
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergenceaware warp scheduling," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46), pp. 99-110, 2013.
- (2013) Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46) , pp. 99-110
- Rogers, T.G.¹ O'Connor, M.² Aamodt, T.M.³

9
- 84863342255
- Improving GPU performance via large warps and two-level warp scheduling
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO-44), 2011.
- (2011) Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO-44)
- Narasiman, V.¹ Shebanow, M.² Lee, C.J.³ Miftakhutdinov, R.⁴ Mutlu, O.⁵ Patt, Y.N.⁶

10
- 84875640178
- Owl: Cooperative thread array aware scheduling techniques for improving gpGPU performance
- A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Owl: Cooperative thread array aware scheduling techniques for improving gpGPU performance," in Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13), 2013.
- (2013) Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13)
- Jog, A.¹ Kayiran, O.² Chidambaram Nachiappan, N.³ Mishra, A.K.⁴ Kandemir, M.T.⁵ Mutlu, O.⁶ Iyer, R.⁷ Das, C.R.⁸

11
- 84903999614
- Warp-level divergence in GPUs: Characterization, impact, and mitigation
- P. Xiang, Y. Yang, and H. Zhou, "Warp-level divergence in GPUs: Characterization, impact, and mitigation," in Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA-20), 2014.
- (2014) Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA-20)
- Xiang, P.¹ Yang, Y.² Zhou, H.³

12
- 84887477265
- Neither more nor less: Optimizing thread-level parallelism for gpGPUs
- O. Kayran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither more nor less: Optimizing thread-level parallelism for gpGPUs," in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT'13), 2013.
- (2013) Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT'13)
- Kayran, O.¹ Jog, A.² Kandemir, M.T.³ Das, C.R.⁴

13
- 84903951085
- Improving gpGPU resource utilization through alternative thread block scheduling
- M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Improving gpGPU resource utilization through alternative thread block scheduling," in Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA-20), 2014.
- (2014) Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA-20)
- Lee, M.¹ Song, S.² Moon, J.³ Kim, J.⁴ Seo, W.⁵ Cho, Y.⁶ Ryu, S.⁷

14
- 84960122845
- Cawa: Coordinated warp scheduling and cache prioritization for critical warp acceleration of gpGPU workloads
- S.-Y. Lee, A. Arunkumar, and C.-J. Wu, "Cawa: coordinated warp scheduling and cache prioritization for critical warp acceleration of gpGPU workloads," in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA-42), 2015.
- (2015) Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA-42)
- Lee, S.-Y.¹ Arunkumar, A.² Wu, C.-J.³

15
- 84946029581
- Characterization and analysis of dynamic parallelism in unstructured GPU applications
- J. Wang and S. Yalamanchili, "Characterization and analysis of dynamic parallelism in unstructured GPU applications," in Proceedings of 2014 IEEE International Symposium on Workload Characterization (IISWC'14), 2014.
- (2014) Proceedings of 2014 IEEE International Symposium on Workload Characterization (IISWC'14)
- Wang, J.¹ Yalamanchili, S.²

16
- 84960076275
- Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on GPUs
- J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, "Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on GPUs," in Proceedings of the 42nd Annual International Symposium on Computer Architecuture (ISCA-42), 2015.
- (2015) Proceedings of the 42nd Annual International Symposium on Computer Architecuture (ISCA-42)
- Wang, J.¹ Rubin, N.² Sidelnik, A.³ Yalamanchili, S.⁴

17
- 84959927541
- Free launch: Optimizing GPU dynamic kernel launches through thread reuse
- G. Chen and X. Shen, "Free launch: Optimizing GPU dynamic kernel launches through thread reuse," in Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48), 2015.
- (2015) Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48)
- Chen, G.¹ Shen, X.²

18
- 84896893237
- CUDA-np: Realizing nested threadlevel parallelism in gpGPU applications
- Y. Yang and H. Zhou, "Cuda-np: Realizing nested threadlevel parallelism in gpGPU applications," in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'14), 2014.
- (2014) Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'14)
- Yang, Y.¹ Zhou, H.²

19
- 84991702740
- Accelerating irregular algorithms on gpGPUs using fine-grain hardware worklists
- J. Kim and C. Batten, "Accelerating irregular algorithms on gpGPUs using fine-grain hardware worklists," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47), 2014.
- (2014) Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47)
- Kim, J.¹ Batten, C.²

20
- 84962232247
- Locality exists in graph processing: Workload characterization on an ivy bridge server
- S. Beamer, K. Asanovic, and D. Patterson, "Locality exists in graph processing: Workload characterization on an ivy bridge server," in Proceedings of the 2015 IEEE International Symposium on Workload Characterization (IISWC'15), 2015.
- (2015) Proceedings of the 2015 IEEE International Symposium on Workload Characterization (IISWC'15)
- Beamer, S.¹ Asanovic, K.² Patterson, D.³

21
- 84892519096
- A localityaware memory hierarchy for energy-efficient GPU architectures
- M. Rhu, M. Sullivan, J. Leng, and M. Erez, "A localityaware memory hierarchy for energy-efficient GPU architectures," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46), 2013.
- (2013) Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46)
- Rhu, M.¹ Sullivan, M.² Leng, J.³ Erez, M.⁴

22
- 78751505898
- A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads
- S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads," in Proceedings of 2010 IEEE International Symposium o nWorkload Characterization (IISWC'10), 2010.
- (2010) Proceedings of 2010 IEEE International Symposium O NWorkload Characterization (IISWC'10)
- Che, S.¹ Sheaffer, J.W.² Boyer, M.³ Szafaryn, L.G.⁴ Wang, L.⁵ Skadron, K.⁶

23
- 84969827416
- NVIDIA
- NVIDIA, "Cuda c programming guide," 2015.
- (2015) CUDA C Programming Guide

24
- 84905509992
- Enabling preemptive multiprogramming on GPUs
- I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero, "Enabling preemptive multiprogramming on GPUs," in Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA-41), 2014.
- (2014) Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA-41)
- Tanasic, I.¹ Gelado, I.² Cabezas, J.³ Ramirez, A.⁴ Navarro, N.⁵ Valero, M.⁶

25
- 84999126318
- NVIDIA
- NVIDIA, "Nvidia geforce gtx 980 whitepaper," 2014.
- (2014) NVIDIA Geforce Gtx 980 Whitepaper

26
- 70349169075
- Analyzing CUDA workloads using a detailed GPU simulator
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing cuda workloads using a detailed GPU simulator," in Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09), 2009.
- (2009) Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09)
- Bakhoda, A.¹ Yuan, G.² Fung, W.³ Wong, H.⁴ Aamodt, T.⁵

27
- 84960162410
- Thermodynamic states in explosion fields
- Coeur d'Alene Resort, ID, USA
- A. Kuhl, "Thermodynamic states in explosion fields," in 14th International Symposium on Detonation, Coeur d'Alene Resort, ID, USA, 2010.
- (2010) 14th International Symposium on Detonation
- Kuhl, A.¹

28
- 84858427151
- An efficient CUDA implementation of the tree-based barnes hut n-body algorithm
- M. Burtscher and K. Pingali, "An efficient cuda implementation of the tree-based barnes hut n-body algorithm," GPU computing Gems Emerald edition, p. 75, 2011.
- (2011) GPU Computing Gems Emerald Edition , pp. 75
- Burtscher, M.¹ Pingali, K.²

29
- 84858391043
- Scalable GPU graph traversal
- D. Merrill, M. Garland, and A. Grimshaw, "Scalable GPU graph traversal," in Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12), 2012.
- (2012) Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12)
- Merrill, D.¹ Garland, M.² Grimshaw, A.³

30
- 84864332206
- D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner, "10th dimacs implementation challenge: Graph partitioning and graph clustering," 2011.
- (2011) 10th Dimacs Implementation Challenge: Graph Partitioning and Graph Clustering
- Bader, D.A.¹ Meyerhenke, H.² Sanders, P.³ Wagner, D.⁴

31
- 84939428494
- Efficient graph matching and coloring on the GPU
- J. Cohen and P. Castonguay, "Efficient graph matching and coloring on the GPU," in GPU Technology Conference, 2012.
- (2012) GPU Technology Conference
- Cohen, J.¹ Castonguay, P.²

32
- 80052350460
- Gregex: GPU based high speed regular expression matching engine
- IEEE
- L. Wang, S. Chen, Y. Tang, and J. Su, "Gregex: Gpu based high speed regular expression matching engine," in Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on, pp. 366-370, IEEE, 2011.
- (2011) Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on , pp. 366-370
- Wang, L.¹ Chen, S.² Tang, Y.³ Su, J.⁴

33
- 85019691440
- Testing intrusion detection systems: A critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by lincoln laboratory
- J. McHugh, "Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory," ACM Transactions on Information and System Security, vol. 3, no. 4, pp. 262-294, 2000.
- (2000) ACM Transactions on Information and System Security , vol.3 , Issue.4 , pp. 262-294
- McHugh, J.¹

34
- 84893303174
- GPU accelerated item-based collaborative filtering for bigdata applications
- C. H. Nadungodage, Y. Xia, J. J. Lee, M. Lee, and C. S. Park, "Gpu accelerated item-based collaborative filtering for bigdata applications," in Proceedings of 2013 IEEE International Conference on Big Data, 2013.
- (2013) Proceedings of 2013 IEEE International Conference on Big Data
- Nadungodage, C.H.¹ Xia, Y.² Lee, J.J.³ Lee, M.⁴ Park, C.S.⁵

35
- 85015559680
- An algorithmic framework for performing collaborative filtering
- J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, "An algorithmic framework for performing collaborative filtering," in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999.
- (1999) Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
- Herlocker, J.L.¹ Konstan, J.A.² Borchers, A.³ Riedl, J.⁴

36
- 84875193084
- Relational algorithms for multi-bulk-synchronous processors
- G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili, "Relational algorithms for multi-bulk-synchronous processors," in Proceedings of the 18th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP'13), 2013.
- (2013) Proceedings of the 18th ACM SIGPLAN Symposium on Principles AndPractice of Parallel Programming (PPoPP'13)
- Diamos, G.¹ Wu, H.² Wang, J.³ Lele, A.⁴ Yalamanchili, S.⁵

37
- 70349191933
- Lonestar: A suite of parallel irregular programs
- M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali, "Lonestar: A suite of parallel irregular programs," in Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09), 2009.
- (2009) Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09)
- Kulkarni, M.¹ Burtscher, M.² Cascaval, C.³ Pingali, K.⁴

38
- 79951700098
- Improving simt efficiency of global rendering algorithms with architectural support for dynamic micro-kernels
- M. Steffen and J. Zambreno, "Improving simt efficiency of global rendering algorithms with architectural support for dynamic micro-kernels," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43), 2010.
- (2010) Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43)
- Steffen, M.¹ Zambreno, J.²

39
- 84870410502
- Nested data-parallelism on the GPU
- ACM
- L. Bergstrom and J. Reppy, "Nested data-parallelism on the GPU," in ACM SIGPLAN Notices, vol. 47, pp. 247-258, ACM, 2012.
- (2012) ACM SIGPLAN Notices , vol.47 , pp. 247-258
- Bergstrom, L.¹ Reppy, J.²

40
- 84905454859
- Fine-grain task aggregation and coordination on GPUs
- M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood, "Fine-grain task aggregation and coordination on GPUs," in Proceedings of the 41st Annual International Symposium on Computer Architecuture (ISCA-41), 2014.
- (2014) Proceedings of the 41st Annual International Symposium on Computer Architecuture (ISCA-41)
- Orr, M.S.¹ Beckmann, B.M.² Reinhardt, S.K.³ Wood, D.A.⁴

41
- 85047004205
- Locality-aware mapping of nested parallel patterns on GPUs
- H. Lee, K. Brown, A. Sujeeth, T. Rompf, and K. Olukotun, "Locality-aware mapping of nested parallel patterns on GPUs," in Proceedings of the 47th International Symposium on Microarchitecture (MICRO-47), 2014.
- (2014) Proceedings of the 47th International Symposium on Microarchitecture (MICRO-47)
- Lee, H.¹ Brown, K.² Sujeeth, A.³ Rompf, T.⁴ Olukotun, K.⁵

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.