-
1
-
-
77954995378
-
Understanding sources of inefficiency in general-purpose chips
-
R. Hameed et al., "Understanding sources of inefficiency in general-purpose chips," ISCA, 2010.
-
(2010)
ISCA
-
-
Hameed, R.1
-
3
-
-
40149087224
-
Asap: An asynchronous array of simple processors
-
IEEE Journal of march
-
Z. Yu et al., "Asap: An asynchronous array of simple processors," Solid-State Circuits, IEEE Journal of, vol. 43, no. 3, pp. 695-705, march 2008.
-
(2008)
Solid-state Circuits
, vol.43
, Issue.3
, pp. 695-705
-
-
Yu, Z.1
-
4
-
-
0025402476
-
A set of level 3 basic linear algebra subprograms
-
J. Dongarra et al., "A set of level 3 basic linear algebra subprograms," ACM Trans. Math. Soft., vol. 16, no. 1, 1990.
-
(1990)
ACM Trans. Math. Soft.
, vol.16
, Issue.1
-
-
Dongarra, J.1
-
5
-
-
44249094647
-
Anatomy of a high-performance matrix multiplication
-
May
-
K. Goto and R. van de Geijn, "Anatomy of a high-performance matrix multiplication," ACM Trans. Math. Soft., vol. 34, no. 3, p. 12, May 2008.
-
(2008)
ACM Trans. Math. Soft.
, vol.34
, Issue.3
, pp. 12
-
-
Goto, K.1
Van De Geijn, R.2
-
6
-
-
48849089104
-
High-performance implementation of the level-3 BLAS
-
K. Goto and R. van de Geijn, "High-performance implementation of the level-3 BLAS," ACM Trans. Math. Softw., vol. 35, no. 1, pp. 1-14, 2008.
-
(2008)
ACM Trans. Math. Softw.
, vol.35
, Issue.1
, pp. 1-14
-
-
Goto, K.1
Van De Geijn, R.2
-
8
-
-
0003278639
-
Automatically tuned linear algebra software
-
R. C. Whaley and J. J. Dongarra, "Automatically tuned linear algebra software," in SC, 1998.
-
(1998)
SC
-
-
Whaley, R.C.1
Dongarra, J.J.2
-
9
-
-
78651269052
-
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
-
K. Fatahalian et al., "Understanding the efficiency of GPU algorithms for matrix-matrix multiplication," ACM SIGGRAPH/EUROGRAPHICS, 2004.
-
(2004)
ACM SIGGRAPH/Eurographics
-
-
Fatahalian, K.1
-
10
-
-
72049102909
-
Performance analysis of memory transfers and GEMM subroutines on NVIDIA tesla GPU cluster
-
V. Allada et al., "Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster," CLUSTER'09, 2009.
-
(2009)
CLUSTER'09
-
-
Allada, V.1
-
11
-
-
70350771131
-
Benchmarking GPUs to tune dense linear algebra
-
V. Volkov and J. Demmel, "Benchmarking GPUs to tune dense linear algebra," SC, 2008.
-
(2008)
SC
-
-
Volkov, V.1
Demmel, J.2
-
13
-
-
0035341920
-
Hyper-systolic matrix multiplication
-
T. Lippert et al., "Hyper-systolic matrix multiplication," Parallel Computing, 2001.
-
(2001)
Parallel Computing
-
-
Lippert, T.1
-
14
-
-
79551552121
-
Fpga-array with bandwidth-reduction mechanism for scalable and power-efficient numerical simulations based on finite difference methods
-
November
-
K. Sano et al., "Fpga-array with bandwidth-reduction mechanism for scalable and power-efficient numerical simulations based on finite difference methods," ACM Trans. Reconfigurable Technol. Syst., vol. 3, pp. 21:1-21:35, November 2010.
-
(2010)
ACM Trans. Reconfigurable Technol. Syst.
, vol.3
, pp. 211-2135
-
-
Sano, K.1
-
15
-
-
80055080909
-
Design and power performance evaluation of on-chip memory processor with arithmetic accelerators
-
C. Takahashi et al., "Design and power performance evaluation of on-chip memory processor with arithmetic accelerators," IWIA, 2008.
-
(2008)
IWIA
-
-
Takahashi, C.1
-
16
-
-
70450237431
-
Rigel: An architecture and scalable programming interface for a 1000-core accelerator
-
J. Kelm et al., "Rigel: an architecture and scalable programming interface for a 1000-core accelerator," ISCA, 2009.
-
(2009)
ISCA
-
-
Kelm, J.1
-
17
-
-
85008053864
-
An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS
-
S. Vangal et al., "An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS," IEEE J. of Solid-State Circuits, vol. 43, no. 1, 2008.
-
(2008)
IEEE J. of Solid-state Circuits
, vol.43
, Issue.1
-
-
Vangal, S.1
-
18
-
-
80055072024
-
-
ClearSpeed Technology Ltd, Datasheet 06-PD-1425 Rev 1
-
"CSX700 Floating Point Processor," ClearSpeed Technology Ltd, Datasheet 06-PD-1425 Rev 1, 2011.
-
(2011)
CSX700 Floating Point Processor
-
-
-
19
-
-
43249098087
-
A matrix product accelerator for field programmable systems on chip
-
P. Zicari et al., "A matrix product accelerator for field programmable systems on chip," Microprocessors and Microsystems 32, 2008.
-
(2008)
Microprocessors and Microsystems 32
-
-
Zicari, P.1
-
20
-
-
34047144377
-
Scalable and modular algorithms for floatingpoint matrix multiplication on reconfigurable computing systems
-
L. Zhuo and V. Prasanna, "Scalable and modular algorithms for floatingpoint matrix multiplication on reconfigurable computing systems," IEEE Trans. on Parallel and Distributed Systems, vol. 18, no. 4, 2007.
-
(2007)
IEEE Trans. on Parallel and Distributed Systems
, vol.18
, Issue.4
-
-
Zhuo, L.1
Prasanna, V.2
-
21
-
-
50149101744
-
Floating-point matrix multiplication in a polymorphic processor
-
G. Kuzmanov and W. van Oijen, "Floating-point matrix multiplication in a polymorphic processor," ICFPT, pp. 249 - 252, 2007.
-
(2007)
ICFPT
, pp. 249-252
-
-
Kuzmanov, G.1
Van Oijen, W.2
-
22
-
-
0000667923
-
The torus-wrap mapping for dense matrix calculations on massively parallel computers
-
B. A. Hendrickson and D. E. Womble, "The Torus-Wrap mapping for dense matrix calculations on massively parallel computers," SIAM J. Sci. Stat. Comput., vol. 15, no. 5, 1994.
-
(1994)
SIAM J. Sci. Stat. Comput
, vol.15
, Issue.5
-
-
Hendrickson, B.A.1
Womble, D.E.2
-
24
-
-
33749527748
-
A 6.2-GFlops floating-point multiply-accumulator with conditional normalization
-
S. Vangal et al., "A 6.2-GFlops floating-point multiply-accumulator with conditional normalization," IEEE J. of Solid-State Circuits, vol. 41, no. 10, 2006.
-
(2006)
IEEE J. of Solid-state Circuits
, vol.41
, Issue.10
-
-
Vangal, S.1
-
25
-
-
80055070134
-
Towards a high-performance, low-power linear algebra processor
-
Computer Engineering Research Center September
-
A. Pedram et al., "Towards a high-performance, low-power linear algebra processor," Computer Engineering Research Center, The University of Texas at Austin, Tech. Rep. UT-CERC-10-03, September 2010.
-
(2010)
The University of Texas at Austin, Tech. Rep. UT-CERC-10-03
-
-
Pedram, A.1
-
26
-
-
50249180329
-
Floating-point fused multiply-add architectures
-
E. Quinnell et al., "Floating-point fused multiply-add architectures," ACSSC, pp. 331 - 337, 2007.
-
(2007)
ACSSC
, pp. 331-337
-
-
Quinnell, E.1
-
27
-
-
77949949584
-
A 90mW/GFlop 3.4GHz reconfigurable fused/continuous multiply-accumulator for floating-point and integer operands in 65nm
-
S. Jain et al., "A 90mW/GFlop 3.4GHz reconfigurable fused/continuous multiply-accumulator for floating-point and integer operands in 65nm," VLSID, 2010.
-
(2010)
VLSID
-
-
Jain, S.1
-
28
-
-
75449106575
-
Low-power multiple-precision iterative floating-point multiplier with SIMD support
-
D. Tan et al., "Low-power multiple-precision iterative floating-point multiplier with SIMD support," IEEE Trans. on Computers, vol. 58, no. 2, 2009.
-
(2009)
IEEE Trans. on Computers
, vol.58
, Issue.2
-
-
Tan, D.1
-
29
-
-
80055039127
-
Energy-efficient floating point unit design
-
S. Galal and M. Horowitz, "Energy-efficient floating point unit design," IEEE Trans. on Computers, vol. PP, no. 99, 2010.
-
(2010)
IEEE Trans. on Computers
, vol.PP
, Issue.99
-
-
Galal, S.1
Horowitz, M.2
-
30
-
-
80055073323
-
CACTI:5.0 an integrated cache timing, power, and area model
-
T. Shyamkumar et al., "CACTI:5.0 an integrated cache timing, power, and area model," HP Laboratories Palo Alto, Technical Report HPL- 2007-167, 2007.
-
(2007)
HP Laboratories Palo Alto, Technical Report HPL-2007-167
-
-
Shyamkumar, T.1
-
31
-
-
33749033522
-
Energy model of networks-on-chip and a bus
-
P. Wolkotte, "Energy model of networks-on-chip and a bus," System-on- Chip, pp. 82 - 85, 2005.
-
(2005)
System-on-chip
, pp. 82-85
-
-
Wolkotte, P.1
-
32
-
-
16244376510
-
Power analysis of system-level on-chip communication architectures
-
K. Lahiri and A. Raghunathan, "Power analysis of system-level on-chip communication architectures," CODES+ISSS, 2004.
-
(2004)
CODES+ISSS
-
-
Lahiri, K.1
Raghunathan, A.2
-
33
-
-
76749146060
-
Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures
-
S. Li et al., "Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures," MICRO, 2009.
-
(2009)
Micro
-
-
Li, S.1
-
34
-
-
0033719421
-
Wattch: A framework for architectural-level power analysis and optimizations
-
D. Brooks et al., "Wattch: a framework for architectural-level power analysis and optimizations," ISCA, pp. 83 - 94, 2000.
-
(2000)
ISCA
, pp. 83-94
-
-
Brooks, D.1
-
35
-
-
77954994853
-
An integrated GPU power and performance model
-
Jun
-
S. Hong and H. Kim, "An integrated GPU power and performance model," ISCA, Jun 2010.
-
(2010)
ISCA
-
-
Hong, S.1
Kim, H.2
-
36
-
-
77952579552
-
Demystifying GPU microarchitecture through microbenchmarking
-
H. Wong et al., "Demystifying GPU microarchitecture through microbenchmarking," ISPASS, pp. 235 - 246, 2010.
-
(2010)
ISPASS
, pp. 235-246
-
-
Wong, H.1
-
37
-
-
80055087211
-
Samsung DDR3 SDRAM: High-performance, energy-efficient memory for todays green computing platforms
-
March
-
"Samsung DDR3 SDRAM: High-Performance, Energy-Efficient Memory for Todays Green Computing Platforms," SAMSUNG Green Memory, Tech. Rep., March 2009.
-
(2009)
Samsung Green Memory, Tech. Rep.
-
-
-
38
-
-
78649470097
-
Fermi computer architecture white paper
-
"Fermi computer architecture white paper," NVIDIA, Technical Report, 2009.
-
(2009)
NVIDIA, Technical Report
-
-
-
40
-
-
51349166333
-
Penryn: 45-nm next generation intel® core™ 2 processor
-
Jan
-
V George et al., "Penryn: 45-nm next generation Intel® core™ 2 processor," A-SSCC, Jan 2008.
-
(2008)
A-SSCC
-
-
George, V.1
-
41
-
-
77951476028
-
High-performance floating-point implementation using FPGAs
-
M. Parker, "High-performance floating-point implementation using FPGAs," in MILCOM, 2009.
-
(2009)
MILCOM
-
-
Parker, M.1
-
42
-
-
34247349114
-
The potential of the cell processor for scientific computing
-
S. Williams et al., "The potential of the Cell processor for scientific computing," in CF, 2006.
-
(2006)
CF
-
-
Williams, S.1
-
43
-
-
80055099661
-
Performance of a multicore matrix multiplication library
-
Jan
-
F. Lauginiger et al., "Performance of a multicore matrix multiplication library," STMCS, Jan 2007.
-
(2007)
STMCS
-
-
Lauginiger, F.1
-
44
-
-
0032155271
-
GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark
-
B. Kagstrom et al., "GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark," ACM Trans. Math. Soft, vol. 24, no. 3, 1998.
-
(1998)
ACM Trans. Math. Soft
, vol.24
, Issue.3
-
-
Kagstrom, B.1
-
45
-
-
0003706460
-
-
(third ed.). Philadelphia, PA, USA: SIAM
-
E. Anderson et al., LAPACK Users' guide (third ed.). Philadelphia, PA, USA: SIAM, 1999.
-
(1999)
LAPACK Users' Guide
-
-
Anderson, E.1
|