SCOPUS 정보 검색 플랫폼

Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors

Volumn , Issue , 2011, Pages 35-42

A high-performance, low-power linear algebra core

(3) Pedram, Ardavan a Gerstlauer, Andreas a Geijn, Robert A Van De b

a The University of Texas at Austin (United States)

b University of Texas at Austin (United States)

Author keywords

[No Author keywords available]

Indexed keywords

45NM TECHNOLOGY; CURRENT COMPONENT; CUSTOM HARDWARES; FEASIBILITY STUDIES; LOW POWER; MATRIX COMPUTATION; ORDERS OF MAGNITUDE; REDUCING POWER; TECHNOLOGY SCALING;

ALGEBRA; COMPUTER ARCHITECTURE; EFFICIENCY;

MATRIX ALGEBRA;

EID: 80055100054 PISSN: 10636862 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/ASAP.2011.6043234 Document Type: Conference Paper

Times cited : (22)

References (46)

1
- 77954995378
- Understanding sources of inefficiency in general-purpose chips
- R. Hameed et al., "Understanding sources of inefficiency in general-purpose chips," ISCA, 2010.
- (2010) ISCA
- Hameed, R.¹

2
- 11944272996
- The cost of flexibility in systems on a chip design for signal processing applications
- N. Zhang and R. W. Broderson, "The cost of flexibility in systems on a chip design for signal processing applications," University of California, Berkeley, Tech. Rep., 2002.
- (2002) University of California, Berkeley, Tech. Rep.
- Zhang, N.¹ Broderson, R.W.²

3
- 40149087224
- Asap: An asynchronous array of simple processors
- IEEE Journal of march
- Z. Yu et al., "Asap: An asynchronous array of simple processors," Solid-State Circuits, IEEE Journal of, vol. 43, no. 3, pp. 695-705, march 2008.
- (2008) Solid-state Circuits , vol.43 , Issue.3 , pp. 695-705
- Yu, Z.¹

4
- 0025402476
- A set of level 3 basic linear algebra subprograms
- J. Dongarra et al., "A set of level 3 basic linear algebra subprograms," ACM Trans. Math. Soft., vol. 16, no. 1, 1990.
- (1990) ACM Trans. Math. Soft. , vol.16 , Issue.1
- Dongarra, J.¹

5
- 44249094647
- Anatomy of a high-performance matrix multiplication
- May
- K. Goto and R. van de Geijn, "Anatomy of a high-performance matrix multiplication," ACM Trans. Math. Soft., vol. 34, no. 3, p. 12, May 2008.
- (2008) ACM Trans. Math. Soft. , vol.34 , Issue.3 , pp. 12
- Goto, K.¹ Van De Geijn, R.²

6
- 48849089104
- High-performance implementation of the level-3 BLAS
- K. Goto and R. van de Geijn, "High-performance implementation of the level-3 BLAS," ACM Trans. Math. Softw., vol. 35, no. 1, pp. 1-14, 2008.
- (2008) ACM Trans. Math. Softw. , vol.35 , Issue.1 , pp. 1-14
- Goto, K.¹ Van De Geijn, R.²

7
- 80055078946
- Intel math kernel library
- "Intel Math Kernel Library," Intel, User's Guide 314774-009US, 2009.
- (2009) Intel, User's Guide 314774-009US

8
- 0003278639
- Automatically tuned linear algebra software
- R. C. Whaley and J. J. Dongarra, "Automatically tuned linear algebra software," in SC, 1998.
- (1998) SC
- Whaley, R.C.¹ Dongarra, J.J.²

9
- 78651269052
- Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
- K. Fatahalian et al., "Understanding the efficiency of GPU algorithms for matrix-matrix multiplication," ACM SIGGRAPH/EUROGRAPHICS, 2004.
- (2004) ACM SIGGRAPH/Eurographics
- Fatahalian, K.¹

10
- 72049102909
- Performance analysis of memory transfers and GEMM subroutines on NVIDIA tesla GPU cluster
- V. Allada et al., "Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster," CLUSTER'09, 2009.
- (2009) CLUSTER'09
- Allada, V.¹

11
- 70350771131
- Benchmarking GPUs to tune dense linear algebra
- V. Volkov and J. Demmel, "Benchmarking GPUs to tune dense linear algebra," SC, 2008.
- (2008) SC
- Volkov, V.¹ Demmel, J.²

12
- 79958284905
- LAPACK WN #227, Tech. Rep.
- R. Nath et al., "An improved MAGMA GEMM for Fermi GPUs," LAPACK WN #227, Tech. Rep., 2010.
- (2010) An Improved MAGMA GEMM for Fermi GPUs
- Nath, R.¹

13
- 0035341920
- Hyper-systolic matrix multiplication
- T. Lippert et al., "Hyper-systolic matrix multiplication," Parallel Computing, 2001.
- (2001) Parallel Computing
- Lippert, T.¹

14
- 79551552121
- Fpga-array with bandwidth-reduction mechanism for scalable and power-efficient numerical simulations based on finite difference methods
- November
- K. Sano et al., "Fpga-array with bandwidth-reduction mechanism for scalable and power-efficient numerical simulations based on finite difference methods," ACM Trans. Reconfigurable Technol. Syst., vol. 3, pp. 21:1-21:35, November 2010.
- (2010) ACM Trans. Reconfigurable Technol. Syst. , vol.3 , pp. 211-2135
- Sano, K.¹

15
- 80055080909
- Design and power performance evaluation of on-chip memory processor with arithmetic accelerators
- C. Takahashi et al., "Design and power performance evaluation of on-chip memory processor with arithmetic accelerators," IWIA, 2008.
- (2008) IWIA
- Takahashi, C.¹

16
- 70450237431
- Rigel: An architecture and scalable programming interface for a 1000-core accelerator
- J. Kelm et al., "Rigel: an architecture and scalable programming interface for a 1000-core accelerator," ISCA, 2009.
- (2009) ISCA
- Kelm, J.¹

17
- 85008053864
- An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS
- S. Vangal et al., "An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS," IEEE J. of Solid-State Circuits, vol. 43, no. 1, 2008.
- (2008) IEEE J. of Solid-state Circuits , vol.43 , Issue.1
- Vangal, S.¹

18
- 80055072024
- ClearSpeed Technology Ltd, Datasheet 06-PD-1425 Rev 1
- "CSX700 Floating Point Processor," ClearSpeed Technology Ltd, Datasheet 06-PD-1425 Rev 1, 2011.
- (2011) CSX700 Floating Point Processor

19
- 43249098087
- A matrix product accelerator for field programmable systems on chip
- P. Zicari et al., "A matrix product accelerator for field programmable systems on chip," Microprocessors and Microsystems 32, 2008.
- (2008) Microprocessors and Microsystems 32
- Zicari, P.¹

20
- 34047144377
- Scalable and modular algorithms for floatingpoint matrix multiplication on reconfigurable computing systems
- L. Zhuo and V. Prasanna, "Scalable and modular algorithms for floatingpoint matrix multiplication on reconfigurable computing systems," IEEE Trans. on Parallel and Distributed Systems, vol. 18, no. 4, 2007.
- (2007) IEEE Trans. on Parallel and Distributed Systems , vol.18 , Issue.4
- Zhuo, L.¹ Prasanna, V.²

21
- 50149101744
- Floating-point matrix multiplication in a polymorphic processor
- G. Kuzmanov and W. van Oijen, "Floating-point matrix multiplication in a polymorphic processor," ICFPT, pp. 249 - 252, 2007.
- (2007) ICFPT , pp. 249-252
- Kuzmanov, G.¹ Van Oijen, W.²

22
- 0000667923
- The torus-wrap mapping for dense matrix calculations on massively parallel computers
- B. A. Hendrickson and D. E. Womble, "The Torus-Wrap mapping for dense matrix calculations on massively parallel computers," SIAM J. Sci. Stat. Comput., vol. 15, no. 5, 1994.
- (1994) SIAM J. Sci. Stat. Comput , vol.15 , Issue.5
- Hendrickson, B.A.¹ Womble, D.E.²

23
- 0002924772
- ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers
- IEEE
- J. Choi et al., "ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers," in Proc. of the 4th Symp. on the Frontiers of Massively Parallel Computation. IEEE, 1992.
- (1992) Proc. of the 4th Symp. on the Frontiers of Massively Parallel Computation
- Choi, J.¹

24
- 33749527748
- A 6.2-GFlops floating-point multiply-accumulator with conditional normalization
- S. Vangal et al., "A 6.2-GFlops floating-point multiply-accumulator with conditional normalization," IEEE J. of Solid-State Circuits, vol. 41, no. 10, 2006.
- (2006) IEEE J. of Solid-state Circuits , vol.41 , Issue.10
- Vangal, S.¹

25
- 80055070134
- Towards a high-performance, low-power linear algebra processor
- Computer Engineering Research Center September
- A. Pedram et al., "Towards a high-performance, low-power linear algebra processor," Computer Engineering Research Center, The University of Texas at Austin, Tech. Rep. UT-CERC-10-03, September 2010.
- (2010) The University of Texas at Austin, Tech. Rep. UT-CERC-10-03
- Pedram, A.¹

26
- 50249180329
- Floating-point fused multiply-add architectures
- E. Quinnell et al., "Floating-point fused multiply-add architectures," ACSSC, pp. 331 - 337, 2007.
- (2007) ACSSC , pp. 331-337
- Quinnell, E.¹

27
- 77949949584
- A 90mW/GFlop 3.4GHz reconfigurable fused/continuous multiply-accumulator for floating-point and integer operands in 65nm
- S. Jain et al., "A 90mW/GFlop 3.4GHz reconfigurable fused/continuous multiply-accumulator for floating-point and integer operands in 65nm," VLSID, 2010.
- (2010) VLSID
- Jain, S.¹

28
- 75449106575
- Low-power multiple-precision iterative floating-point multiplier with SIMD support
- D. Tan et al., "Low-power multiple-precision iterative floating-point multiplier with SIMD support," IEEE Trans. on Computers, vol. 58, no. 2, 2009.
- (2009) IEEE Trans. on Computers , vol.58 , Issue.2
- Tan, D.¹

29
- 80055039127
- Energy-efficient floating point unit design
- S. Galal and M. Horowitz, "Energy-efficient floating point unit design," IEEE Trans. on Computers, vol. PP, no. 99, 2010.
- (2010) IEEE Trans. on Computers , vol.PP , Issue.99
- Galal, S.¹ Horowitz, M.²

30
- 80055073323
- CACTI:5.0 an integrated cache timing, power, and area model
- T. Shyamkumar et al., "CACTI:5.0 an integrated cache timing, power, and area model," HP Laboratories Palo Alto, Technical Report HPL- 2007-167, 2007.
- (2007) HP Laboratories Palo Alto, Technical Report HPL-2007-167
- Shyamkumar, T.¹

31
- 33749033522
- Energy model of networks-on-chip and a bus
- P. Wolkotte, "Energy model of networks-on-chip and a bus," System-on- Chip, pp. 82 - 85, 2005.
- (2005) System-on-chip , pp. 82-85
- Wolkotte, P.¹

32
- 16244376510
- Power analysis of system-level on-chip communication architectures
- K. Lahiri and A. Raghunathan, "Power analysis of system-level on-chip communication architectures," CODES+ISSS, 2004.
- (2004) CODES+ISSS
- Lahiri, K.¹ Raghunathan, A.²

33
- 76749146060
- Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures
- S. Li et al., "Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures," MICRO, 2009.
- (2009) Micro
- Li, S.¹

34
- 0033719421
- Wattch: A framework for architectural-level power analysis and optimizations
- D. Brooks et al., "Wattch: a framework for architectural-level power analysis and optimizations," ISCA, pp. 83 - 94, 2000.
- (2000) ISCA , pp. 83-94
- Brooks, D.¹

35
- 77954994853
- An integrated GPU power and performance model
- Jun
- S. Hong and H. Kim, "An integrated GPU power and performance model," ISCA, Jun 2010.
- (2010) ISCA
- Hong, S.¹ Kim, H.²

36
- 77952579552
- Demystifying GPU microarchitecture through microbenchmarking
- H. Wong et al., "Demystifying GPU microarchitecture through microbenchmarking," ISPASS, pp. 235 - 246, 2010.
- (2010) ISPASS , pp. 235-246
- Wong, H.¹

37
- 80055087211
- Samsung DDR3 SDRAM: High-performance, energy-efficient memory for todays green computing platforms
- March
- "Samsung DDR3 SDRAM: High-Performance, Energy-Efficient Memory for Todays Green Computing Platforms," SAMSUNG Green Memory, Tech. Rep., March 2009.
- (2009) Samsung Green Memory, Tech. Rep.

38
- 78649470097
- Fermi computer architecture white paper
- "Fermi computer architecture white paper," NVIDIA, Technical Report, 2009.
- (2009) NVIDIA, Technical Report

39
- 80052539683
- Inside fermi: Nvidia's HPC push
- September
- D. Kanter, "Inside Fermi: Nvidia's HPC push," Real World Technologies, Tech. Rep., September 2009.
- (2009) Real World Technologies, Tech. Rep.
- Kanter, D.¹

40
- 51349166333
- Penryn: 45-nm next generation intel® core™ 2 processor
- Jan
- V George et al., "Penryn: 45-nm next generation Intel® core™ 2 processor," A-SSCC, Jan 2008.
- (2008) A-SSCC
- George, V.¹

41
- 77951476028
- High-performance floating-point implementation using FPGAs
- M. Parker, "High-performance floating-point implementation using FPGAs," in MILCOM, 2009.
- (2009) MILCOM
- Parker, M.¹

42
- 34247349114
- The potential of the cell processor for scientific computing
- S. Williams et al., "The potential of the Cell processor for scientific computing," in CF, 2006.
- (2006) CF
- Williams, S.¹

43
- 80055099661
- Performance of a multicore matrix multiplication library
- Jan
- F. Lauginiger et al., "Performance of a multicore matrix multiplication library," STMCS, Jan 2007.
- (2007) STMCS
- Lauginiger, F.¹

44
- 0032155271
- GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark
- B. Kagstrom et al., "GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark," ACM Trans. Math. Soft, vol. 24, no. 3, 1998.
- (1998) ACM Trans. Math. Soft , vol.24 , Issue.3
- Kagstrom, B.¹

45
- 0003706460
- (third ed.). Philadelphia, PA, USA: SIAM
- E. Anderson et al., LAPACK Users' guide (third ed.). Philadelphia, PA, USA: SIAM, 1999.
- (1999) LAPACK Users' Guide
- Anderson, E.¹

46
- 80055093472
- F. Van Zee, libflame: The Complete Reference. www.lulu.com, 2009.
- (2009) The Complete Reference
- Van Zee, F.¹

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.