SCOPUS 정보 검색 플랫폼

Volumn 93, Issue 2, 2005, Pages 358-385

Is search really necessary to generate high-performance BLAS?

(7) Yotov, Kamen a Li, Xiaoming b Ren, Gang b Garzarán, María Jesús b Padua, David b Pingali, Keshav a Stodghill, Paul a

a School of Operations Research and Information Engineering (United States)

b University of Illinois at Urbana Champaign (United States)

Author keywords

Basic Linear Algebra Subprograms (BLAS); Compilers; Empirical optimization; High performance computing; Library generators; Model driven optimization; Program optimization

Indexed keywords

COMPUTER OPERATING SYSTEMS; COMPUTER SOFTWARE; MATHEMATICAL TRANSFORMATIONS; OPTIMIZATION; PARAMETER ESTIMATION; PROGRAM COMPILERS;

BASIC LINEAR ALGEBRA SUBPROGRAMS; EMPIRICAL OPTIMIZATION; HIGH PERFORMANCE COMPUTING; LIBRARY GENERATORS; MODEL-DRIVEN OPTIMIZATION; PROGRAM OPTIMIZATION;

SEARCH ENGINES;

EID: 20744459570 PISSN: 00189219 EISSN: None Source Type: Journal
DOI: 10.1109/JPROC.2004.840444 Document Type: Conference Paper

Times cited : (108)

References (43)

1
- 20744439191
- [Online]
- ATLAS home page [Online]. Available: http.//math-atlas.sourceforge.net/
- ATLAS Home Page

2
- 20744450712
- [Online]
- PHiPAC home page [Online]. Available: http://www.icsi.berkeley.edu/ ~bilmes/phipac
- PHiPAC Home Page

3
- 0028427170
- Improving performance of linear algebra algorithms for dense matrices using algorithmic prefetch
- R. C. Agarwal, F. G. Gustavson, and M. Zubair, "Improving performance of linear algebra algorithms for dense matrices using algorithmic prefetch," IBM J. Res. Develop., vol. 38, no. 3, pp. 265-275, 1994.
- (1994) IBM J. Res. Develop. , vol.38 , Issue.3 , pp. 265-275
- Agarwal, R.C.¹ Gustavson, F.G.² Zubair, M.³

4
- 0037952146
- San Francisco, CA: Morgan Kaufmann
- R. Allan and K. Kennedy, Optimizing Compilers for Modern Architectures. San Francisco, CA: Morgan Kaufmann, 2002.
- (2002) Optimizing Compilers for Modern Architectures
- Allan, R.¹ Kennedy, K.²

5
- 0003207812
- Unimodular transformations of double loops
- Cambridge, MA: MIT Press, ch. 10
- U. Banerjee, "Unimodular transformations of double loops," in Advances in Languages and Compilers for Parallel Processing. Cambridge, MA: MIT Press, 1991, ch. 10, pp. 192-219.
- (1991) Advances in Languages and Compilers for Parallel Processing , pp. 192-219
- Banerjee, U.¹

6
- 0030661485
- Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology
- Vienna, Austria
- J. Bilmes, K. Asanović, C.-w. Chin, and J. Demmel, "Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology," presented at the Int. Conf. Supercomputing, Vienna, Austria, 1997.
- (1997) Int. Conf. Supercomputing
- Bilmes, J.¹ Asanović, K.² Chin, C.-W.³ Demmel, J.⁴

7
- 0028482686
- (Pen)-ultimate tiling?
- P. Boulet, A. Darte, T. Risset, and Y. Robert, "(Pen)-ultimate tiling?," Integration VLSI J., vol. 17, pp. 33-51, 1994.
- (1994) Integration VLSI J. , vol.17 , pp. 33-51
- Boulet, P.¹ Darte, A.² Risset, T.³ Robert, Y.⁴

8
- 0000493064
- Estimating interlock and improving balance for pipelined architectures
- D. Callahan, J. Cocke, and K. Kennedy, "Estimating interlock and improving balance for pipelined architectures," J. Parallel Distrib. Comput., vol. 5, no. 4, pp. 334-358, 1988.
- (1988) J. Parallel Distrib. Comput. , vol.5 , Issue.4 , pp. 334-358
- Callahan, D.¹ Cocke, J.² Kennedy, K.³

9
- 0025447908
- Improving register allocation for subscripted variables
- D. Callahan, S. Carr, and K. Kennedy, "Improving register allocation for subscripted variables," in Proc. SIGPLAN Conf. Programming Language Design and Implementation, 1990, pp. 53-65.
- (1990) Proc. SIGPLAN Conf. Programming Language Design and Implementation , pp. 53-65
- Callahan, D.¹ Carr, S.² Kennedy, K.³

10
- 18844387390
- Exact analysis of the cache behavior of nested loops
- S. Chatterjee, E. Parker, P. J. Hanlon, and A. R. Lebeck, "Exact analysis of the cache behavior of nested loops," in Proc. ACM SIGPLAN 2001 Conf. Programming Language Design and Implementation, 2001, pp. 286-297.
- (2001) Proc. ACM SIGPLAN 2001 Conf. Programming Language Design and Implementation , pp. 286-297
- Chatterjee, S.¹ Parker, E.² Hanlon, P.J.³ Lebeck, A.R.⁴

11
- 84976859799
- Unifying data and control transformations for distributed shared memory machines
- M. Cierniak and W. Li, "Unifying data and control transformations for distributed shared memory machines," in SIGPLAN 1995 Conf. Programming Languages Design and Implementation, pp. 205-217.
- SIGPLAN 1995 Conf. Programming Languages Design and Implementation , pp. 205-217
- Cierniak, M.¹ Li, W.²

12
- 84976745804
- Tile size selection using cache organization and data layout
- S. Coleman and K. S. McKinley, "Tile size selection using cache organization and data layout," in Proc. SIGPLAN Conf. Programming Language Design and Implementation, 1995, pp. 279-290.
- (1995) Proc. SIGPLAN Conf. Programming Language Design and Implementation , pp. 279-290
- Coleman, S.¹ McKinley, K.S.²

13
- 0026933251
- Some efficient solutions to the affine scheduling problem - Part 1: One dimensional time
- Oct.
- P. Feautrier, "Some efficient solutions to the affine scheduling problem - Part 1: One dimensional time," Int. J. Parallel Program., vol. 1, no. 5, pp. 313-348, Oct. 1992.
- (1992) Int. J. Parallel Program. , vol.1 , Issue.5 , pp. 313-348
- Feautrier, P.¹

14
- 0036575993
- Yet another optimization article
- May/Jun.
- M. Fowler, "Yet another optimization article," IEEE Softw., vol. 19, no. 3, pp. 20-21, May/Jun. 2002.
- (2002) IEEE Softw. , vol.19 , Issue.3 , pp. 20-21
- Fowler, M.¹

15
- 0033358624
- Automatic analytical modeling for the estimation of cache misses
- B. B. Fraguela, R. Doallo, and E. Zapata, "Automatic analytical modeling for the estimation of cache misses," in Proc. Int. Conf. Parallel Architectures and Compilation Techniques (PACT), 1999, pp. 221-231.
- (1999) Proc. Int. Conf. Parallel Architectures and Compilation Techniques (PACT) , pp. 221-231
- Fraguela, B.B.¹ Doallo, R.² Zapata, E.³

16
- 0031636309
- FFTW: An adaptive software architecture for the FFT
- M. Frigo and S. G. Johnson, "FFTW: An adaptive software architecture for the FFT," in Proc. IEEE Intl. Conf. Acoustics, Speech, and Signal Processing, vol. 3, 1998, pp. 1381-1384.
- (1998) Proc. IEEE Intl. Conf. Acoustics, Speech, and Signal Processing , vol.3 , pp. 1381-1384
- Frigo, M.¹ Johnson, S.G.²

17
- 20744449792
- The design and implementation of FFTW3
- Feb.
- _, "The design and implementation of FFTW3," Proc. IEEE, vol. 93, no. 2, pp. 216-231, Feb. 2005.
- (2005) Proc. IEEE , vol.93 , Issue.2 , pp. 216-231

18
- 0003528278
- Philadelphia, PA: SIAM
- S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes. Philadelphia, PA: SIAM, 2001.
- (2001) Performance Optimization of Numerically Intensive Codes
- Goedecker, S.¹ Hoisie, A.²

19
- 1542392269
- On reducing TLB misses in matrix multiplication
- Dept. Comput. Sci., Univ. Texas, Austin
- K. Goto and R. van de Geijn, "On reducing TLB misses in matrix multiplication," Dept. Comput. Sci., Univ. Texas, Austin, Tech. Rep. TR-2002-55, 2002.
- (2002) Tech. Rep. , vol.TR-2002-55
- Goto, K.¹ Van De Geijn, R.²

20
- 84949665448
- A family of high-performance matrix algorithms
- J. A. Gunnels, G. M. Henry, and R. A. van de Geijn, "A family of high-performance matrix algorithms," in Proc. Int. Conf, Computational Science (ICCS 2001), pp. 51-60.
- Proc. Int. Conf, Computational Science (ICCS 2001) , pp. 51-60
- Gunnels, J.A.¹ Henry, G.M.² Van De Geijn, R.A.³

21
- 20744436023
- private communication
- F. Gustavson, private communication, 2004.
- (2004)
- Gustavson, F.¹

22
- 0004302191
- San Francisco, CA: Morgan Kaufmann
- J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. San Francisco, CA: Morgan Kaufmann, 1990.
- (1990) Computer Architecture: A Quantitative Approach
- Hennessy, J.L.¹ Patterson, D.A.²

23
- 20744456697
- Flexible High-Performance Matrix Multiply via Self-Modifying Runtime Code
- Dept. Comput. Sci., Univ. Texas, Austin, Dec.
- "Flexible High-Performance Matrix Multiply via Self-Modifying Runtime Code," Dept. Comput. Sci., Univ. Texas, Austin, Tech. Rep. CS-TR-01-44, Dec. 2001.
- (2001) Tech. Rep. , vol.CS-TR-01-44

24
- 20744459543
- [Online]
- K. Goto. High-performance BLAS. [Online]. Available: http://www.cs. utexas.edu/users/flame/goto/
- High-performance BLAS
- Goto, K.¹

25
- 20744435859
- Searching for the best FFT formulas with the SPL compiler
- J. Johnson, R. W. Johnson, D. A. Padua, and J. Xiong, "Searching for the best FFT formulas with the SPL compiler," in Proc. 13th Int. Workshop on Languages and Compilers for Parallel Computing, 2000, pp. 109-124.
- (2000) Proc. 13th Int. Workshop on Languages and Compilers for Parallel Computing , pp. 109-124
- Johnson, J.¹ Johnson, R.W.² Padua, D.A.³ Xiong, J.⁴

26
- 0347304618
- Data-centric multi-level blocking
- I. Kodukula, N. Ahmed, and K. Pingali, "Data-centric multi-level blocking," in Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, 1997, pp. 346-357.
- (1997) Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation , pp. 346-357
- Kodukula, I.¹ Ahmed, N.² Pingali, K.³

27
- 10844294800
- Imperfectly nested loop transformations for memory hierarchy management
- Rhodes, Greece, June
- I. Kodukula and K. Pingali, "Imperfectly nested loop transformations for memory hierarchy management," presented at the Int. Conf. Supercomputing, Rhodes, Greece, June 1999.
- (1999) Int. Conf. Supercomputing
- Kodukula, I.¹ Pingali, K.²

28
- 0027694019
- Access normalization: Loop restructuring for NUMA compilers
- W. Li and K. Pingali, "Access normalization: Loop restructuring for NUMA compilers," ACM Trans. Comput. Syst., 1993.
- (1993) ACM Trans. Comput. Syst.
- Li, W.¹ Pingali, K.²

29
- 0014701246
- Evaluation techniques for storage hierarchies
- R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger, "Evaluation techniques for storage hierarchies," IBM Syst. J., vol. 9, no. 2, pp. 78-92, 1970.
- (1970) IBM Syst. J. , vol.9 , Issue.2 , pp. 78-92
- Mattson, R.L.¹ Gecsei, J.² Slutz, D.R.³ Traiger, I.L.⁴

30
- 84945709131
- Organizing matrices and matrix operations for paged memory systems
- A. C. McKellar and E. G. Coffman Jr., "Organizing matrices and matrix operations for paged memory systems," Commun. ACM, vol. 12, no. 3, pp. 153-165, 1969.
- (1969) Commun. ACM , vol.12 , Issue.3 , pp. 153-165
- McKellar, A.C.¹ Coffman Jr., E.G.²

31
- 0022874874
- Advanced compiler optimization for supercomputers
- Dec
- D. Padua and M. Wolfe, "Advanced compiler optimization for supercomputers," Commun. ACM, vol. 29, no. 12, pp. 1184-1201, Dec, 1986.
- (1986) Commun. ACM , vol.29 , Issue.12 , pp. 1184-1201
- Padua, D.¹ Wolfe, M.²

32
- 19344368072
- SPIRAL: Code generation for DSP transforms
- Feb.
- M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gačić, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo, "SPIRAL: Code generation for DSP transforms," Proc. IEEE, vol. 93, no. 2, pp. 232-275, Feb. 2005.
- (2005) Proc. IEEE , vol.93 , Issue.2 , pp. 232-275
- Püschel, M.¹ Moura, J.M.F.² Johnson, J.³ Padua, D.⁴ Veloso, M.⁵ Singer, B.W.⁶ Xiong, J.⁷ Franchetti, F.⁸ Gačić, A.⁹ Voronenko, Y.¹⁰ Chen, K.¹¹ Johnson, R.W.¹² Rizzolo, N.¹³

33
- 0024898517
- Engineering and scientific subroutine library release 3 for IBM ES/3090 vector multiprocessors
- J. McComb, R. C. Agarwal, F. G. Gustavson, and S. Schmidt, "Engineering and scientific subroutine library release 3 for IBM ES/3090 vector multiprocessors," IBM Syst. J., vol. 28, no. 2, pp. 345-350, 1989.
- (1989) IBM Syst. J. , vol.28 , Issue.2 , pp. 345-350
- McComb, J.¹ Agarwal, R.C.² Gustavson, F.G.³ Schmidt, S.⁴

34
- 0009755242
- Iterative modulo scheduling
- B. R. Rau, "Iterative modulo scheduling," Hewlett-Packard Res. Lab., Tech. Rep. HPL-94-115, 1995.
- (1995) Hewlett-Packard Res. Lab., Tech. Rep. , vol.HPL-94-115
- Rau, B.R.¹

35
- 0003929457
- Automatic blocking of nested loops
- Univ. Tennessee, Knoxville
- R. Schreiber and J. Dongarra, "Automatic blocking of nested loops," Univ. Tennessee, Knoxville, Tech. Rep. CS-90-108, 1990.
- (1990) Tech. Rep. , vol.CS-90-108
- Schreiber, R.¹ Dongarra, J.²

36
- 20744440107
- private communication
- R. C. Whaley, private communication, 2004.
- (2004)
- Whaley, R.C.¹

37
- 20744443273
- [Online]
- _, x86 optimizations, part 1. [Online], Available: http://sourceforge. net/mailarchive/forum.php?thread_id=1569256&forum_id=426
- X86 Optimizations, Part 1

38
- 13244261416
- [Online]
- _, User contribution to ATLAS. [Online]. Available: http://math-atlas. sourceforge.net/devel/atlas_contrib
- User Contribution to ATLAS

39
- 13244279577
- Minimizing development and maintenance costs in supporting persistently optimized BLAS
- to be published
- R. C. Whaley and A. Petitet, "Minimizing development and maintenance costs in supporting persistently optimized BLAS," Softw. Pract. Exper., to be published.
- Softw. Pract. Exper.
- Whaley, R.C.¹ Petitet, A.²

40
- 0343462141
- Automated empirical optimization of software and the ATLAS project
- R. C. Whaley, A. Petitet, and J. J. Dongarra, "Automated empirical optimization of software and the ATLAS project," Parallel Comput, vol. 27, no. 1-2, pp. 3-35, 2001.
- (2001) Parallel Comput , vol.27 , Issue.1-2 , pp. 3-35
- Whaley, R.C.¹ Petitet, A.² Dongarra, J.J.³

41
- 0002924272
- An algorithmic approach to compund loop transformations
- Cambridge, MA: MIT Press
- M. E. Wolf and M. S. Lam, "An algorithmic approach to compund loop transformations," in Advances in Languages and Compilers for Parallel Computing. Cambridge, MA: MIT Press, 1991, pp. 243-259.
- (1991) Advances in Languages and Compilers for Parallel Computing , pp. 243-259
- Wolf, M.E.¹ Lam, M.S.²

42
- 0002433589
- Iteration space tiling for memory hierarchies
- M. Wolfe, "Iteration space tiling for memory hierarchies," in Proc. 3rd SIAM Conf. Parallel Processing for Scientific Computing, 1987, pp. 357-361.
- (1987) Proc. 3rd SIAM Conf. Parallel Processing for Scientific Computing , pp. 357-361
- Wolfe, M.¹

43
- 20744433639
- X-Ray: A tool for automatic measurement of architectural parameters
- K. Yotov, K. Pingali, and P. Stodghill, "X-Ray: A tool for automatic measurement of architectural parameters," Comput. Sci., Cornell Univ., Tech. Rep. TR2004-1966, 2004.
- (2004) Comput. Sci., Cornell Univ., Tech. Rep. , vol.TR2004-1966
- Yotov, K.¹ Pingali, K.² Stodghill, P.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.