SCOPUS 정보 검색 플랫폼

Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Volumn , Issue , 2011, Pages

Fast implementation of DGEMM on Fermi GPU

(6) Tan, Guangming a Li, Linchuan a Triechle, Sean b Phillips, Everett b Bao, Yungang a Sun, Ninghui a

a INSTITUTE OF GEOLOGY AND GEOPHYSICS (China)

b NVIDIA (United States)

Author keywords

CUDA; GPU; High performance computing; Matrix matrix multiplication

Indexed keywords

CUDA; FAST IMPLEMENTATION; GPU; HIGH PERFORMANCE COMPUTING; INSTRUCTION SCHEDULING; MACHINE LANGUAGES; MATRIX-MATRIX MULTIPLICATION; MEMORY HIERARCHY; MEMORY OPERATIONS; MICRO ARCHITECTURES; OPTIMAL ALGORITHM; OPTIMIZATION STRATEGY; PEAK PERFORMANCE; PERFORMANCE MODELING; SHARED MEMORIES; SOFTWARE PIPELINING;

ALGORITHMS; COMPUTER SOFTWARE SELECTION AND EVALUATION; MATRIX ALGEBRA; MEMORY ARCHITECTURE; MULTITASKING; OPTIMIZATION;

BENCHMARKING;

EID: 83155160943 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2063384.2063431 Document Type: Conference Paper

Times cited : (102)

References (17)

1
- 78149329064
- AMD, August
- AMD. ATI Stream Computing OpenCL Programming Guide, rev1.05, August 2010.
- (2010) ATI Stream Computing OpenCL Programming Guide, Rev1.05

2
- 0003666392
- LAPACK: A portable linear algebra library for high-performance computers
- May
- E. Anderson, Z. Bai, C. Bischof, J. W. Demmel, J. J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C. Sorensen. LAPACK: A portable linear algebra library for high-performance computers. Technical Report 20, LAPACK Working Note, May 1990.
- (1990) Technical Report 20, LAPACK Working Note
- Anderson, E.¹ Bai, Z.² Bischof, C.³ Demmel, J.W.⁴ Dongarra, J.J.⁵ Croz, J.D.⁶ Greenbaum, A.⁷ Hammarling, S.⁸ McKenney, A.⁹ Sorensen, D.C.¹⁰

3
- 20744452904
- Self-adapting linear algebra algorithms and software
- J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. Whaley, and K. Yelick. Self-adapting linear algebra algorithms and software. In Proceedings of the IEEE, volume 93, pages 293-312, 2005.
- (2005) Proceedings of the IEEE , vol.93 , pp. 293-312
- Demmel, J.¹ Dongarra, J.² Eijkhout, V.³ Fuentes, E.⁴ Petitet, A.⁵ Vuduc, R.⁶ Whaley, R.⁷ Yelick, K.⁸

4
- 0025402476
- Set of level 3 basic linear algebra subprograms
- DOI 10.1145/77626.79170
- J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16:1-17, March 1990. (Pubitemid 20684794)
- (1990) ACM Transactions on Mathematical Software , vol.16 , Issue.1 , pp. 1-17
- Dongarra Jack, J.¹ Croz Jeremy Du² Hammarling Sven³ Duff Iain⁴

5
- 44249094647
- Anatomy of high-performance matrix multiplication
- 34:12:1-12:25, May
- K. Goto and R. A. v. d. Geijn. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw., 34:12:1-12:25, May 2008.
- (2008) ACM Trans. Math. Softw.
- Goto, K.¹ Geijn, R.A.V.D.²

6
- 79958258447
- C. Jang. Gatlas gpu automatically tuned linear algebra software. http://golem5.org/gatlas/.
- Gatlas Gpu Automatically Tuned Linear Algebra Software
- Jang, C.¹

7
- 68849128792
- A note on auto-tuning gemm for gpus
- Berlin, Heidelberg, Springer-Verlag
- Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning gemm for gpus. In Proceedings of the 9th International Conference on Computational Science: Part I, ICCS'09, pages 884-892, Berlin, Heidelberg, 2009. Springer-Verlag.
- (2009) Proceedings of the 9th International Conference on Computational Science: Part I, ICCS'09 , pp. 884-892
- Li, Y.¹ Dongarra, J.² Tomov, S.³

8
- 81555213505
- A fast gemm implementation on the cypress gpu
- March
- N. Nakasato. A fast gemm implementation on the cypress gpu. SIGMETRICS Perform. Eval. Rev., 38:50-55, March 2011.
- (2011) Sigmetrics Perform. Eval. Rev. , vol.38 , pp. 50-55
- Nakasato, N.¹

9
- 79958284905
- An improved magma gemm for fermi gpus
- July
- R. Nath, S. Tomov, and J. Dongarra. An improved magma gemm for fermi gpus. Technical Report 227, LAPACK Working Note, July 2010.
- (2010) Technical Report 227, LAPACK Working Note
- Nath, R.¹ Tomov, S.² Dongarra, J.³

10
- 84886934561
- NVIDIA. Cuda Community Showcase. http://www.nvidia.com/object/ cudaappsflashnew.html.
- Cuda Community Showcase

11
- 77951900491
- NVIDIA. Nvidia's next generation cuda compute architecture: Fermi. http://www.nvidia.com/object/fermi architecture.html, 2009.
- (2009) Nvidia's Next Generation Cuda Compute Architecture: Fermi

12
- 82955212653
- NVIDIA
- NVIDIA. CUDA C Programming Guide, Version 3.2, 2010.
- (2010) CUDA C Programming Guide, Version 3.2

13
- 79959466764
- Optimization principles and application performance evaluation of a multithreaded gpu using cuda
- New York, NY, USA, ACM
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP'08, pages 73-82, New York, NY, USA, 2008. ACM.
- (2008) Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'08 , pp. 73-82
- Ryoo, S.¹ Rodrigues, C.I.² Baghsorkhi, S.S.³ Stone, S.S.⁴ Kirk, D.B.⁵ Hwu, W.-M.W.⁶

14
- 43449094719
- Program optimization space pruning for a multithreaded GPU
- DOI 10.1145/1356058.1356084, Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization
- S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu. Program optimization space pruning for a multithreaded gpu. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, CGO'08, pages 195-204, New York, NY, USA, 2008. ACM. (Pubitemid 351667266)
- (2008) Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization , pp. 195-204
- Ryoo, S.¹ Rodrigues, C.I.² Stone, S.S.³ Baghsorkhi, S.S.⁴ Ueng, S.-Z.⁵ Stratton, J.A.⁶ Hwu, W.-M.W.⁷

15
- 70350771131
- Benchmarking gpus to tune dense linear algebra
- pages 31:1-31:11, Piscataway, NJ, USA, IEEE Press
- V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC'08, pages 31:1-31:11, Piscataway, NJ, USA, 2008. IEEE Press.
- (2008) Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC'08
- Volkov, V.¹ Demmel, J.W.²

16
- 77952579552
- Demystifying gpu microarchitecture through microbenchmarking
- H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying gpu microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software, ISPASS'10, pages 235-246, 2010.
- (2010) 2010 IEEE International Symposium on Performance Analysis of Systems & Software, ISPASS'10 , pp. 235-246
- Wong, H.¹ Papadopoulou, M.² Sadooghi-Alvandi, M.³ Moshovos, A.⁴

17
- 20744459570
- Is search really necessary to generate high-performance blas?
- K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance blas? In Proceedings of the IEEE, volume 93, pages 358-386, 2005.
- (2005) Proceedings of the IEEE , vol.93 , pp. 358-386
- Yotov, K.¹ Li, X.² Ren, G.³ Garzaran, M.⁴ Padua, D.⁵ Pingali, K.⁶ Stodghill, P.⁷

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.