-
1
-
-
78651324376
-
Understanding the efficiency of ray traversal on GPUs
-
T. Aila and S. Laine. Understanding the Efficiency of Ray Traversal on GPUs. In HPG'09, pages 145-149, 2009.
-
(2009)
HPG'09
, pp. 145-149
-
-
Aila, T.1
Laine, S.2
-
3
-
-
84875205533
-
From relational verification to SIMD loop synthesis
-
G. Barthe, J. M. Crespo, S. Gulwani, C. Kunz, and M. Marron. From Relational Verification to SIMD Loop Synthesis. In PPoPP'13, pages 123-134, 2013.
-
(2013)
PPoPP'13
, pp. 123-134
-
-
Barthe, G.1
Crespo, J.M.2
Gulwani, S.3
Kunz, C.4
Marron, M.5
-
4
-
-
0029191296
-
Cilk: An efficient multithreaded runtime system
-
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. In PPOPP'95, pages 207-216, 1995.
-
(1995)
PPOPP'95
, pp. 207-216
-
-
Blumofe, R.D.1
Joerg, C.F.2
Kuszmaul, B.C.3
Leiserson, C.E.4
Randall, K.H.5
Zhou, Y.6
-
5
-
-
84877715459
-
Billion-particle SIMD-friendly two-point correlation on largescale HPC cluster systems
-
J. Chhugani, C. Kim, H. Shukla, J. Park, P. Dubey, J. Shalf, and H. D. Simon. Billion-particle SIMD-friendly Two-point Correlation on Largescale HPC Cluster Systems. In SC'12, pages 1:1-1:11, 2012.
-
(2012)
SC'12
, pp. 11-111
-
-
Chhugani, J.1
Kim, C.2
Shukla, H.3
Park, J.4
Dubey, P.5
Shalf, J.6
Simon, H.D.7
-
6
-
-
84951776710
-
-
Cilk. Cilk. http://supertech.csail.mit.edu/cilk/.
-
Cilk
-
-
Cilk1
-
7
-
-
51549087961
-
Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays
-
H. Dammertz, J. Hanika, and A. Keller. Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays. In EGSR'08, pages 1225-1233, 2008.
-
(2008)
EGSR'08
, pp. 1225-1233
-
-
Dammertz, H.1
Hanika, J.2
Keller, A.3
-
8
-
-
33749253908
-
Programming with Exceptions in JCilk
-
Dec.
-
J. S. Danaher, I.-T. A. Lee, and C. E. Leiserson. Programming with Exceptions in JCilk. Sci. Comput. Program., 63(2):147-171, Dec. 2006.
-
(2006)
Sci. Comput. Program.
, vol.63
, Issue.2
, pp. 147-171
-
-
Danaher, J.S.1
Lee, A.I.-T.2
Leiserson, C.E.3
-
9
-
-
0000011164
-
A fast computer method for matrix transposing
-
July
-
J. O. Eklundh. A Fast Computer Method for Matrix Transposing. IEEE Trans. Comput., 21(7):801-803, July 1972.
-
(1972)
IEEE Trans. Comput.
, vol.21
, Issue.7
, pp. 801-803
-
-
Eklundh, J.O.1
-
10
-
-
0347507496
-
The implementation of the cilk-5 multithreaded language
-
M. Frigo, C. E. Leiserson, and K. H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In PLDI'98, pages 212-223, 1998.
-
(1998)
PLDI'98
, pp. 212-223
-
-
Frigo, M.1
Leiserson, C.E.2
Randall, K.H.3
-
11
-
-
70449631676
-
Reducers and other cilk++ hyperobjects
-
M. Frigo, P. Halpern, C. E. Leiserson, and S. Lewin-Berlin. Reducers and Other Cilk++ Hyperobjects. In SPAA'09, pages 79-90, 2009.
-
(2009)
SPAA'09
, pp. 79-90
-
-
Frigo, M.1
Halpern, P.2
Leiserson, C.E.3
Lewin-Berlin, S.4
-
12
-
-
84865327496
-
Can GPGPU programming be liberated from the data-parallel bottleneck?
-
August
-
B. Gaster and L. Howes. Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck? Computer, 45(8):42-52, August 2012.
-
(2012)
Computer
, vol.45
, Issue.8
, pp. 42-52
-
-
Gaster, B.1
Howes, L.2
-
13
-
-
70450029262
-
Work-first and help-first scheduling policies for async-finish task parallelism
-
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and Help-first Scheduling Policies for Async-finish Task Parallelism. In IPDPS'09, pages 1-12, 2009.
-
(2009)
IPDPS'09
, pp. 1-12
-
-
Guo, Y.1
Barik, R.2
Raman, R.3
Sarkar, V.4
-
15
-
-
0141427127
-
Vectorization of tree traversals
-
Mar.
-
L. Hernquist. Vectorization of Tree Traversals. J. Comput. Phys., 87(1):137-147, Mar. 1990.
-
(1990)
J. Comput. Phys.
, vol.87
, Issue.1
, pp. 137-147
-
-
Hernquist, L.1
-
16
-
-
84976759390
-
Graphinators and the Duality of SIMD and MIMD
-
P. Hudak and E. Hohr. Graphinators and the Duality of SIMD and MIMD. In LFP'88, pages 224-234, 1988.
-
(1988)
LFP'88
, pp. 224-234
-
-
Hudak, P.1
Hohr, E.2
-
17
-
-
84879836252
-
Efficient scheduling of recursive control flow on GPUs
-
X. Huo, S. Krishnamoorthy, and G. Agrawal. Efficient Scheduling of Recursive Control Flow on GPUs. In ICS'13, pages 409-420, 2013.
-
(2013)
ICS'13
, pp. 409-420
-
-
Huo, X.1
Krishnamoorthy, S.2
Agrawal, G.3
-
18
-
-
84858310773
-
Enhancing locality for recursive traversals of recursive structures
-
Y. Jo and M. Kulkarni. Enhancing Locality for Recursive Traversals of Recursive Structures. In OOPSLA'11, pages 463-482, 2011.
-
(2011)
OOPSLA'11
, pp. 463-482
-
-
Jo, Y.1
Kulkarni, M.2
-
19
-
-
84887467173
-
Automatic vectorization of tree traversals
-
Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic Vectorization of Tree Traversals. In PACT'13, pages 363-374, 2013.
-
(2013)
PACT'13
, pp. 363-374
-
-
Jo, Y.1
Goldfarb, M.2
Kulkarni, M.3
-
20
-
-
77954701719
-
FAST: Fast architecture sensitive tree search on modern CPUs and GPUs
-
C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In SIGMOD'10, pages 339-350, 2010.
-
(2010)
SIGMOD'10
, pp. 339-350
-
-
Kim, C.1
Chhugani, J.2
Satish, N.3
Sedlar, E.4
Nguyen, A.D.5
Kaldewey, T.6
Lee, V.W.7
Brandt, S.A.8
Dubey, P.9
-
21
-
-
84878542156
-
Efficient SIMD code generation for irregular kernels
-
S. Kim and H. Han. Efficient SIMD Code Generation for Irregular Kernels. In PPoPP'12, pages 55-64, 2012.
-
(2012)
PPoPP'12
, pp. 55-64
-
-
Kim, S.1
Han, H.2
-
22
-
-
34547358180
-
Efficient parallel out-of-core matrix transposition
-
S. Krishnamoorthy, G. Baumgartner, D. Cociorva, C.-C. Lam, and P. Sadayappan. Efficient Parallel Out-of-core Matrix Transposition. International Journal of High Performance Computing and Networking, 2(2):110-119, 2004.
-
(2004)
International Journal of High Performance Computing and Networking
, vol.2
, Issue.2
, pp. 110-119
-
-
Krishnamoorthy, S.1
Baumgartner, G.2
Cociorva, D.3
Lam, C.-C.4
Sadayappan, P.5
-
23
-
-
84863012838
-
An evaluation of vectorizing compilers
-
S. Maleki, Y. Gao, M. J. Garzarán, T. Wong, and D. A. Padua. An Evaluation of Vectorizing Compilers. In PACT'11, pages 372-382, 2011.
-
(2011)
PACT'11
, pp. 372-382
-
-
Maleki, S.1
Gao, Y.2
Garzarán, M.J.3
Wong, T.4
Padua, D.A.5
-
25
-
-
63549093768
-
Outer-loop Vectorization: Revisited for Short SIMD Architectures
-
D. Nuzman and A. Zaks. Outer-loop Vectorization: Revisited for Short SIMD Architectures. In PACT'08, pages 2-11, 2008.
-
(2008)
PACT'08
, pp. 2-11
-
-
Nuzman, D.1
Zaks, A.2
-
26
-
-
84922773010
-
-
NVIDIA. CUDA. http://www.nvidia.com/object/cuda-home-new.html.
-
CUDA
-
-
NVIDIA1
-
27
-
-
38149069665
-
UTS: An unbalanced tree search benchmark
-
S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An Unbalanced Tree Search Benchmark. In LCPC'06, pages 235-250, 2007.
-
(2007)
LCPC'06
, pp. 235-250
-
-
Olivier, S.1
Huan, J.2
Liu, J.3
Prins, J.4
Dinan, J.5
Sadayappan, P.6
Tseng, C.-W.7
-
29
-
-
84905454859
-
Finegrain task aggregation and coordination on GPUs
-
M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood. Finegrain Task Aggregation and Coordination on GPUs. In ISCA'14, pages 181-192, 2014.
-
(2014)
ISCA'14
, pp. 181-192
-
-
Orr, M.S.1
Beckmann, B.M.2
Reinhardt, S.K.3
Wood, D.A.4
-
30
-
-
19344368072
-
SPIRAL: Code generation for DSP transforms
-
M. Puschel, J. M. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE, 93(2):232-275, 2005.
-
(2005)
Proceedings of the IEEE
, vol.93
, Issue.2
, pp. 232-275
-
-
Puschel, M.1
Moura, J.M.2
Johnson, J.R.3
Padua, D.4
Veloso, M.M.5
Singer, B.W.6
Xiong, J.7
Franchetti, F.8
Gacic, A.9
Voronenko, Y.10
Chen, K.11
Johnson, R.W.12
Rizzolo, N.13
-
32
-
-
84876909157
-
SIMD parallelization of applications that traverse irregular data structures
-
B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, and W. Schulte. SIMD Parallelization of Applications that Traverse Irregular Data Structures. In CGO'13, pages 1-10, 2013.
-
(2013)
CGO'13
, pp. 1-10
-
-
Ren, B.1
Agrawal, G.2
Larus, J.R.3
Mytkowicz, T.4
Poutanen, T.5
Schulte, W.6
-
33
-
-
79951700098
-
Improving SIMT efficiency of global rendering algorithms with architectural support for dynamic micro-kernels
-
M. Steffen and J. Zambreno. Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels. In MICRO'43, pages 237-248, 2010.
-
(2010)
MICRO'43
, pp. 237-248
-
-
Steffen, M.1
Zambreno, J.2
-
34
-
-
77952162137
-
OpenCL: A parallel programming standard for heterogeneous computing systems
-
May
-
J. E. Stone, D. Gohara, and G. Shi. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Des. Test, 12(3):66-73, May 2010.
-
(2010)
IEEE Des. Test
, vol.12
, Issue.3
, pp. 66-73
-
-
Stone, J.E.1
Gohara, D.2
Shi, G.3
-
35
-
-
84951008030
-
-
Oct.
-
TPL. The Task Parallel Library. http://msdn. microsoft.com/en-us/magazine/cc163340.aspx, Oct. 2007.
-
(2007)
The Task Parallel Library
-
-
TPL1
-
36
-
-
84934313374
-
Task management for irregularparallel workloads on the GPU
-
S. Tzeng, A. Patney, and J. D. Owens. Task Management for Irregularparallel Workloads on the GPU. In HPG'10, pages 29-37, 2010.
-
(2010)
HPG'10
, pp. 29-37
-
-
Tzeng, S.1
Patney, A.2
Owens, J.D.3
|