메뉴 건너뛰기




Volumn , Issue , 2011, Pages 35-42

A high-performance, low-power linear algebra core

Author keywords

[No Author keywords available]

Indexed keywords

45NM TECHNOLOGY; CURRENT COMPONENT; CUSTOM HARDWARES; FEASIBILITY STUDIES; LOW POWER; MATRIX COMPUTATION; ORDERS OF MAGNITUDE; REDUCING POWER; TECHNOLOGY SCALING;

EID: 80055100054     PISSN: 10636862     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/ASAP.2011.6043234     Document Type: Conference Paper
Times cited : (22)

References (46)
  • 1
    • 77954995378 scopus 로고    scopus 로고
    • Understanding sources of inefficiency in general-purpose chips
    • R. Hameed et al., "Understanding sources of inefficiency in general-purpose chips," ISCA, 2010.
    • (2010) ISCA
    • Hameed, R.1
  • 3
    • 40149087224 scopus 로고    scopus 로고
    • Asap: An asynchronous array of simple processors
    • IEEE Journal of march
    • Z. Yu et al., "Asap: An asynchronous array of simple processors," Solid-State Circuits, IEEE Journal of, vol. 43, no. 3, pp. 695-705, march 2008.
    • (2008) Solid-state Circuits , vol.43 , Issue.3 , pp. 695-705
    • Yu, Z.1
  • 4
    • 0025402476 scopus 로고
    • A set of level 3 basic linear algebra subprograms
    • J. Dongarra et al., "A set of level 3 basic linear algebra subprograms," ACM Trans. Math. Soft., vol. 16, no. 1, 1990.
    • (1990) ACM Trans. Math. Soft. , vol.16 , Issue.1
    • Dongarra, J.1
  • 5
    • 44249094647 scopus 로고    scopus 로고
    • Anatomy of a high-performance matrix multiplication
    • May
    • K. Goto and R. van de Geijn, "Anatomy of a high-performance matrix multiplication," ACM Trans. Math. Soft., vol. 34, no. 3, p. 12, May 2008.
    • (2008) ACM Trans. Math. Soft. , vol.34 , Issue.3 , pp. 12
    • Goto, K.1    Van De Geijn, R.2
  • 6
    • 48849089104 scopus 로고    scopus 로고
    • High-performance implementation of the level-3 BLAS
    • K. Goto and R. van de Geijn, "High-performance implementation of the level-3 BLAS," ACM Trans. Math. Softw., vol. 35, no. 1, pp. 1-14, 2008.
    • (2008) ACM Trans. Math. Softw. , vol.35 , Issue.1 , pp. 1-14
    • Goto, K.1    Van De Geijn, R.2
  • 8
    • 0003278639 scopus 로고    scopus 로고
    • Automatically tuned linear algebra software
    • R. C. Whaley and J. J. Dongarra, "Automatically tuned linear algebra software," in SC, 1998.
    • (1998) SC
    • Whaley, R.C.1    Dongarra, J.J.2
  • 9
    • 78651269052 scopus 로고    scopus 로고
    • Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
    • K. Fatahalian et al., "Understanding the efficiency of GPU algorithms for matrix-matrix multiplication," ACM SIGGRAPH/EUROGRAPHICS, 2004.
    • (2004) ACM SIGGRAPH/Eurographics
    • Fatahalian, K.1
  • 10
    • 72049102909 scopus 로고    scopus 로고
    • Performance analysis of memory transfers and GEMM subroutines on NVIDIA tesla GPU cluster
    • V. Allada et al., "Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster," CLUSTER'09, 2009.
    • (2009) CLUSTER'09
    • Allada, V.1
  • 11
    • 70350771131 scopus 로고    scopus 로고
    • Benchmarking GPUs to tune dense linear algebra
    • V. Volkov and J. Demmel, "Benchmarking GPUs to tune dense linear algebra," SC, 2008.
    • (2008) SC
    • Volkov, V.1    Demmel, J.2
  • 13
    • 0035341920 scopus 로고    scopus 로고
    • Hyper-systolic matrix multiplication
    • T. Lippert et al., "Hyper-systolic matrix multiplication," Parallel Computing, 2001.
    • (2001) Parallel Computing
    • Lippert, T.1
  • 14
    • 79551552121 scopus 로고    scopus 로고
    • Fpga-array with bandwidth-reduction mechanism for scalable and power-efficient numerical simulations based on finite difference methods
    • November
    • K. Sano et al., "Fpga-array with bandwidth-reduction mechanism for scalable and power-efficient numerical simulations based on finite difference methods," ACM Trans. Reconfigurable Technol. Syst., vol. 3, pp. 21:1-21:35, November 2010.
    • (2010) ACM Trans. Reconfigurable Technol. Syst. , vol.3 , pp. 211-2135
    • Sano, K.1
  • 15
    • 80055080909 scopus 로고    scopus 로고
    • Design and power performance evaluation of on-chip memory processor with arithmetic accelerators
    • C. Takahashi et al., "Design and power performance evaluation of on-chip memory processor with arithmetic accelerators," IWIA, 2008.
    • (2008) IWIA
    • Takahashi, C.1
  • 16
    • 70450237431 scopus 로고    scopus 로고
    • Rigel: An architecture and scalable programming interface for a 1000-core accelerator
    • J. Kelm et al., "Rigel: an architecture and scalable programming interface for a 1000-core accelerator," ISCA, 2009.
    • (2009) ISCA
    • Kelm, J.1
  • 17
    • 85008053864 scopus 로고    scopus 로고
    • An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS
    • S. Vangal et al., "An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS," IEEE J. of Solid-State Circuits, vol. 43, no. 1, 2008.
    • (2008) IEEE J. of Solid-state Circuits , vol.43 , Issue.1
    • Vangal, S.1
  • 18
    • 80055072024 scopus 로고    scopus 로고
    • ClearSpeed Technology Ltd, Datasheet 06-PD-1425 Rev 1
    • "CSX700 Floating Point Processor," ClearSpeed Technology Ltd, Datasheet 06-PD-1425 Rev 1, 2011.
    • (2011) CSX700 Floating Point Processor
  • 19
    • 43249098087 scopus 로고    scopus 로고
    • A matrix product accelerator for field programmable systems on chip
    • P. Zicari et al., "A matrix product accelerator for field programmable systems on chip," Microprocessors and Microsystems 32, 2008.
    • (2008) Microprocessors and Microsystems 32
    • Zicari, P.1
  • 20
    • 34047144377 scopus 로고    scopus 로고
    • Scalable and modular algorithms for floatingpoint matrix multiplication on reconfigurable computing systems
    • L. Zhuo and V. Prasanna, "Scalable and modular algorithms for floatingpoint matrix multiplication on reconfigurable computing systems," IEEE Trans. on Parallel and Distributed Systems, vol. 18, no. 4, 2007.
    • (2007) IEEE Trans. on Parallel and Distributed Systems , vol.18 , Issue.4
    • Zhuo, L.1    Prasanna, V.2
  • 21
    • 50149101744 scopus 로고    scopus 로고
    • Floating-point matrix multiplication in a polymorphic processor
    • G. Kuzmanov and W. van Oijen, "Floating-point matrix multiplication in a polymorphic processor," ICFPT, pp. 249 - 252, 2007.
    • (2007) ICFPT , pp. 249-252
    • Kuzmanov, G.1    Van Oijen, W.2
  • 22
    • 0000667923 scopus 로고
    • The torus-wrap mapping for dense matrix calculations on massively parallel computers
    • B. A. Hendrickson and D. E. Womble, "The Torus-Wrap mapping for dense matrix calculations on massively parallel computers," SIAM J. Sci. Stat. Comput., vol. 15, no. 5, 1994.
    • (1994) SIAM J. Sci. Stat. Comput , vol.15 , Issue.5
    • Hendrickson, B.A.1    Womble, D.E.2
  • 24
    • 33749527748 scopus 로고    scopus 로고
    • A 6.2-GFlops floating-point multiply-accumulator with conditional normalization
    • S. Vangal et al., "A 6.2-GFlops floating-point multiply-accumulator with conditional normalization," IEEE J. of Solid-State Circuits, vol. 41, no. 10, 2006.
    • (2006) IEEE J. of Solid-state Circuits , vol.41 , Issue.10
    • Vangal, S.1
  • 25
    • 80055070134 scopus 로고    scopus 로고
    • Towards a high-performance, low-power linear algebra processor
    • Computer Engineering Research Center September
    • A. Pedram et al., "Towards a high-performance, low-power linear algebra processor," Computer Engineering Research Center, The University of Texas at Austin, Tech. Rep. UT-CERC-10-03, September 2010.
    • (2010) The University of Texas at Austin, Tech. Rep. UT-CERC-10-03
    • Pedram, A.1
  • 26
    • 50249180329 scopus 로고    scopus 로고
    • Floating-point fused multiply-add architectures
    • E. Quinnell et al., "Floating-point fused multiply-add architectures," ACSSC, pp. 331 - 337, 2007.
    • (2007) ACSSC , pp. 331-337
    • Quinnell, E.1
  • 27
    • 77949949584 scopus 로고    scopus 로고
    • A 90mW/GFlop 3.4GHz reconfigurable fused/continuous multiply-accumulator for floating-point and integer operands in 65nm
    • S. Jain et al., "A 90mW/GFlop 3.4GHz reconfigurable fused/continuous multiply-accumulator for floating-point and integer operands in 65nm," VLSID, 2010.
    • (2010) VLSID
    • Jain, S.1
  • 28
    • 75449106575 scopus 로고    scopus 로고
    • Low-power multiple-precision iterative floating-point multiplier with SIMD support
    • D. Tan et al., "Low-power multiple-precision iterative floating-point multiplier with SIMD support," IEEE Trans. on Computers, vol. 58, no. 2, 2009.
    • (2009) IEEE Trans. on Computers , vol.58 , Issue.2
    • Tan, D.1
  • 29
    • 80055039127 scopus 로고    scopus 로고
    • Energy-efficient floating point unit design
    • S. Galal and M. Horowitz, "Energy-efficient floating point unit design," IEEE Trans. on Computers, vol. PP, no. 99, 2010.
    • (2010) IEEE Trans. on Computers , vol.PP , Issue.99
    • Galal, S.1    Horowitz, M.2
  • 31
    • 33749033522 scopus 로고    scopus 로고
    • Energy model of networks-on-chip and a bus
    • P. Wolkotte, "Energy model of networks-on-chip and a bus," System-on- Chip, pp. 82 - 85, 2005.
    • (2005) System-on-chip , pp. 82-85
    • Wolkotte, P.1
  • 32
    • 16244376510 scopus 로고    scopus 로고
    • Power analysis of system-level on-chip communication architectures
    • K. Lahiri and A. Raghunathan, "Power analysis of system-level on-chip communication architectures," CODES+ISSS, 2004.
    • (2004) CODES+ISSS
    • Lahiri, K.1    Raghunathan, A.2
  • 33
    • 76749146060 scopus 로고    scopus 로고
    • Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures
    • S. Li et al., "Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures," MICRO, 2009.
    • (2009) Micro
    • Li, S.1
  • 34
    • 0033719421 scopus 로고    scopus 로고
    • Wattch: A framework for architectural-level power analysis and optimizations
    • D. Brooks et al., "Wattch: a framework for architectural-level power analysis and optimizations," ISCA, pp. 83 - 94, 2000.
    • (2000) ISCA , pp. 83-94
    • Brooks, D.1
  • 35
    • 77954994853 scopus 로고    scopus 로고
    • An integrated GPU power and performance model
    • Jun
    • S. Hong and H. Kim, "An integrated GPU power and performance model," ISCA, Jun 2010.
    • (2010) ISCA
    • Hong, S.1    Kim, H.2
  • 36
    • 77952579552 scopus 로고    scopus 로고
    • Demystifying GPU microarchitecture through microbenchmarking
    • H. Wong et al., "Demystifying GPU microarchitecture through microbenchmarking," ISPASS, pp. 235 - 246, 2010.
    • (2010) ISPASS , pp. 235-246
    • Wong, H.1
  • 37
    • 80055087211 scopus 로고    scopus 로고
    • Samsung DDR3 SDRAM: High-performance, energy-efficient memory for todays green computing platforms
    • March
    • "Samsung DDR3 SDRAM: High-Performance, Energy-Efficient Memory for Todays Green Computing Platforms," SAMSUNG Green Memory, Tech. Rep., March 2009.
    • (2009) Samsung Green Memory, Tech. Rep.
  • 38
    • 78649470097 scopus 로고    scopus 로고
    • Fermi computer architecture white paper
    • "Fermi computer architecture white paper," NVIDIA, Technical Report, 2009.
    • (2009) NVIDIA, Technical Report
  • 40
    • 51349166333 scopus 로고    scopus 로고
    • Penryn: 45-nm next generation intel® core™ 2 processor
    • Jan
    • V George et al., "Penryn: 45-nm next generation Intel® core™ 2 processor," A-SSCC, Jan 2008.
    • (2008) A-SSCC
    • George, V.1
  • 41
    • 77951476028 scopus 로고    scopus 로고
    • High-performance floating-point implementation using FPGAs
    • M. Parker, "High-performance floating-point implementation using FPGAs," in MILCOM, 2009.
    • (2009) MILCOM
    • Parker, M.1
  • 42
    • 34247349114 scopus 로고    scopus 로고
    • The potential of the cell processor for scientific computing
    • S. Williams et al., "The potential of the Cell processor for scientific computing," in CF, 2006.
    • (2006) CF
    • Williams, S.1
  • 43
    • 80055099661 scopus 로고    scopus 로고
    • Performance of a multicore matrix multiplication library
    • Jan
    • F. Lauginiger et al., "Performance of a multicore matrix multiplication library," STMCS, Jan 2007.
    • (2007) STMCS
    • Lauginiger, F.1
  • 44
    • 0032155271 scopus 로고    scopus 로고
    • GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark
    • B. Kagstrom et al., "GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark," ACM Trans. Math. Soft, vol. 24, no. 3, 1998.
    • (1998) ACM Trans. Math. Soft , vol.24 , Issue.3
    • Kagstrom, B.1
  • 45
    • 0003706460 scopus 로고    scopus 로고
    • (third ed.). Philadelphia, PA, USA: SIAM
    • E. Anderson et al., LAPACK Users' guide (third ed.). Philadelphia, PA, USA: SIAM, 1999.
    • (1999) LAPACK Users' Guide
    • Anderson, E.1


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.