SCOPUS 정보 검색 플랫폼

Proceedings - International Symposium on Computer Architecture

Volumn , Issue , 2012, Pages 440-451

Can traditional programming bridge the Ninja performance gap for parallel computing applications?

(8) Satish, Nadathur a Kim, Changkyu a Chhugani, Jatin a Saito, Hideki b Krishnaiyer, Rakesh b Smelyanskiy, Mikhail a Girkar, Milind b Dubey, Pradeep a

a INTEL CORPORATION (United States)

b INTEL RESEARCH (United States)

Author keywords

[No Author keywords available]

Indexed keywords

C++ (PROGRAMMING LANGUAGE); CODES (SYMBOLS); MEMORY ARCHITECTURE; PROGRAM COMPILERS;

COMPUTING APPLICATIONS; CURRENT PROCESSORS; MANY-CORE ARCHITECTURE; MEMORY HIERARCHY; MODERN PROCESSORS; MULTI-CORE PROCESSOR; PARALLEL COM- PUTING; PERFORMANCE; PERFORMANCE GAPS; TRADITIONAL APPROACHES;

PARALLEL PROCESSING SYSTEMS;

EID: 84864831385 PISSN: 10636897 EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2366231.2337210 Document Type: Conference Paper

Times cited : (72)

References (48)

1
- 1842870918
- S. J. Aarseth. Gravitational N-body Simulations Tools and Algorithms. 2003.
- (2003) Gravitational N-body Simulations Tools and Algorithms
- Aarseth, S.J.¹

2
- 77951472684
- Direct N-body kernels for multicore platforms
- N. Arora, A. Shringarpure, and R. W. Vuduc. Direct N-body Kernels for Multicore Platforms. In ICPP, pages 379-387, 2009.
- (2009) ICPP , pp. 379-387
- Arora, N.¹ Shringarpure, A.² Vuduc, R.W.³

3
- 35648995516
- The landscape of parallel computing research: A view from berkeley
- K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-183, 2006.
- (2006) Technical Report UCB/EECS-183
- Asanovic, K.¹ Bodik, R.² Catanzaro, B.C.³ Gebis, J.J.⁴ Husbands, P.⁵ Keutzer, K.⁶ Patterson, D.A.⁷ Plishker, W.L.⁸ Shalf, J.⁹ Williams, S.W.¹⁰ Yelick, K.A.¹¹

4
- 63549095070
- The PARSEC benchmark suite: Characterization and architectural implications
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT, pages 72-81, 2008.
- (2008) PACT , pp. 72-81
- Bienia, C.¹ Kumar, S.² Singh, J.P.³ Li, K.⁴

5
- 85015692260
- The pricing of options and corporate liabilities
- F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3):637-654, 1973.
- (1973) Journal of Political Economy , vol.81 , Issue.3 , pp. 637-654
- Black, F.¹ Scholes, M.²

6
- 77954942935
- Low depth cache-oblivious algorithms
- G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low depth cache-oblivious algorithms. In SPAA, pages 189-199, 2010.
- (2010) SPAA , pp. 189-199
- Blelloch, G.E.¹ Gibbons, P.B.² Simhadri, H.V.³

7
- 79960806724
- Can CPUs match GPUs on performance with productivity?: Experiences with optimizing aFLOP-intensive application on CPUs and GPU
- August
- R. Bordawekar, U. Bondhugula, and R. Rao. Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing aFLOP-intensive Application on CPUs and GPU. IBM Research Report, RC25033, August 2010.
- (2010) IBM Research Report, RC25033
- Bordawekar, R.¹ Bondhugula, U.² Rao, R.³

8
- 0031489544
- The market model of interest rate dynamics
- A. Brace, D. Gatarek, and M. Musiela. The Market Model of Interest Rate Dynamics. Mathematical Finance, 7(2):127-155, 1997.
- (1997) Mathematical Finance , vol.7 , Issue.2 , pp. 127-155
- Brace, A.¹ Gatarek, D.² Musiela, M.³

9
- 85184636553
- blogs.intel.com/research/2011/09/hmc.php, Research@Intel
- B. Casper. Reinventing DRAM with the Hybrid Memory Cube. blogs.intel.com/research/2011/09/hmc.php, 2010. Research@Intel.
- (2010) Reinventing DRAM with the Hybrid Memory Cube
- Casper, B.¹

10
- 85184648002
- R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald. Parallel Programming in OpenMP, 2010.
- (2010) Parallel Programming in OpenMP
- Chandra, R.¹ Menon, R.² Dagum, L.³ Kohr, D.⁴ Maydan, D.⁵ McDonald, J.⁶

11
- 49249135216
- Onvergence of recognition, mining, and synthesis workloads and its implications
- Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, et al. onvergence of recognition, mining, and synthesis workloads and its implications. Proceedings of the IEEE, 96(5):790-807, 2008.
- (2008) Proceedings of the IEEE , vol.96 , Issue.5 , pp. 790-807
- Chen, Y.K.¹ Chhugani, J.² Dubey, P.³ Hughes, C.J.⁴ Kim, D.⁵ Kumar, S.⁶

12
- 84865096511
- Efficient implementation of sorting on multi-core simd cpu architecture
- J. Chhugani, A. D. Nguyen, et al. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 1(2):1313-1324, 2008.
- (2008) PVLDB , vol.1 , Issue.2 , pp. 1313-1324
- Chhugani, J.¹ Nguyen, A.D.²

13
- 84864839392
- The end of denial architecture and the rise of throughput computing
- W. J. Dally. The End of Denial Architecture and the Rise of Throughput Computing. Keynote speech at Desgin Automation Conference, 2010.
- (2010) Keynote Speech at Desgin Automation Conference
- Dally, W.J.¹

14
- 77953972043
- PhD thesis, EECS Department, University of California, Berkeley, Dec
- K. Datta. Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley, Dec 2009.
- (2009) Auto-tuning Stencil Codes for Cache-Based Multicore Platforms
- Datta, K.¹

15
- 84946734201
- Volume rendering
- R. A. Drebin, L. C. Carpenter, and P. Hanrahan. Volume rendering. In SIGGRAPH, pages 65-74, 1988.
- (1988) SIGGRAPH , pp. 65-74
- Drebin, R.A.¹ Carpenter, L.C.² Hanrahan, P.³

16
- 36949031604
- A platform 2015 workload model: Recognition, miniming and synthesis moves computers to the era of tera
- P. Dubey. A Platform 2015 Workload Model: Recognition, Miniming and Synthesis Moves Computers to the Era of Tera. Intel, 2005.
- (2005) Intel
- Dubey, P.¹

17
- 8344245462
- Vectorization for simd architectures with alignment constraints
- A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for simd architectures with alignment constraints. In PLDI, pages 82-93, 2004.
- (2004) PLDI , pp. 82-93
- Eichenberger, A.E.¹ Wu, P.² O'brien, K.³

18
- 78650646788
- Joint forces: From multithreaded programming to GPU computing
- January
- F. Feinbube, P. Troger, and A. Polze. Joint Forces: From Multithreaded Programming to GPU Computing. IEEE Softw., 28:51-57, January 2011.
- (2011) IEEE Softw. , vol.28 , pp. 51-57
- Feinbube, F.¹ Troger, P.² Polze, A.³

19
- 84864831251
- Monte carlo evaluation of sensitivities in computational finance
- M. B. Giles. Monte carlo evaluation of sensitivities in computational finance. Technical report, Oxford University Computing Laboratory, 2007.
- (2007) Technical Report Oxford University Computing Laboratory
- Giles, M.B.¹

20
- 0042482650
- 'N-body' problems in statistical learning
- A. G. Gray and A. W. Moore. 'N-Body' Problems in Statistical Learning. In NIPS, pages 521-527, 2000.
- (2000) NIPS , pp. 521-527
- Gray, A.G.¹ Moore, A.W.²

21
- 56849108794
- A portable runtime interface for multi-level memory hierarchies
- M. Houston, J.-Y. Park, M. Ren, T. Knight, K. Fatahalian, A. Aiken, W. Dally, and P. Hanrahan. A portable runtime interface for multi-level memory hierarchies. In PPoPP, pages 143-152, 2008.
- (2008) PPoPP , pp. 143-152
- Houston, M.¹ Park, J.-Y.² Ren, M.³ Knight, T.⁴ Fatahalian, K.⁵ Aiken, A.⁶ Dally, W.⁷ Hanrahan, P.⁸

22
- 84864839396
- Intel. A quick, easy and reliable way to improve threaded performance. http://software.intel.com/en-us/articles/intel-cilk-plus/, 2010.
- (2010) A Quick, Easy and Reliable Way to Improve Threaded Performance

23
- 79551492089
- Intel, White paper, June
- Intel. Intel Advanced Vector Extensions Programming Reference. White paper, June 2011.
- (2011) Intel Advanced Vector Extensions Programming Reference

24
- 85184646781
- Intel. Optimization Notice. http://software.intel.com/en-us/articles/ optimization-notice/, 2012.
- (2012) Optimization Notice

25
- 78650874239
- Performance evaluation of convolution on the cell broadband engine processor
- L. Ismail and D. Guerchi. Performance Evaluation of Convolution on the Cell Broadband Engine Processor. IEEE PDS, 22(2):337-351, 2011.
- (2011) IEEE PDS , vol.22 , Issue.2 , pp. 337-351
- Ismail, L.¹ Guerchi, D.²

26
- 38649087090
- Hyperfast perspective cone-beam backprojection
- M. Kachelrieb, M. Knaup, and O. Bockenbach. Hyperfast perspective cone-beam backprojection. IEEE Nuclear Science, pages 1679-1683, 2006.
- (2006) IEEE Nuclear Science , pp. 1679-1683
- Kachelrieb, M.¹ Knaup, M.² Bockenbach, O.³

27
- 77954696758
- Cache topology aware computation mapping for multicores
- M. Kandemir, T. Yemliha, S. Muralidhara, S. Srikantaiah, M. Irwin, et al. Cache topology aware computation mapping for multicores. In PLDI, 2010.
- (2010) PLDI
- Kandemir, M.¹ Yemliha, T.² Muralidhara, S.³ Srikantaiah, S.⁴ Irwin, M.⁵

28
- 77954701719
- FAST: Fast architecture sensitive tree search on modern CPUs and GPUs
- C. Kim, J. Chhugani, N. Satish, et al. FAST: Fast Architecture Sensitive Tree search on modern CPUs and GPUs. In SIGMOD, pages 339-350, 2010.
- (2010) SIGMOD , pp. 339-350
- Kim, C.¹ Chhugani, J.² Satish, N.³

29
- 84864839397
- Closing the ninja performance gap through traditional programming and compiler technology
- C. Kim, N. Satish, J. Chhugani, et al. Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology. Technical report, Intel Labs, 2011.
- (2011) Technical Report Intel Labs
- Kim, C.¹ Satish, N.² Chhugani, J.³

30
- 77954995885
- Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU
- V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey.Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. ISCA, pages 451-460, 2010.
- (2010) ISCA , pp. 451-460
- Lee, V.W.¹ Kim, C.² Chhugani, J.³ Deisher, M.⁴ Kim, D.⁵ Nguyen, A.D.⁶ Satish, N.⁷ Smelyanskiy, M.⁸ Chennupaty, S.⁹ Hammarlund, P.¹⁰ Singhal, R.¹¹ Dubey, P.¹²

31
- 78650666949
- A synergetic approach to throughput computing on x86-based multicore desktops
- C.-K. Luk, R. Newton, et al. A synergetic approach to throughput computing on x86-based multicore desktops. IEEE Software, 28:39-50, 2011.
- (2011) IEEE Software , vol.28 , pp. 39-50
- Luk, C.-K.¹ Newton, R.²

32
- 0035311079
- Power: A first-class architectural design constraint
- T. N. Mudge. Power: A first-class architectural design constraint. IEEE Computer, 34(4):52-58, 2001.
- (2001) IEEE Computer , vol.34 , Issue.4 , pp. 52-58
- Mudge, T.N.¹

33
- 78650806116
- 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs
- A. Nguyen, N. Satish, et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs. In SC10, pages 1-13, 2010.
- (2010) SC10 , pp. 1-13
- Nguyen, A.¹ Satish, N.²

34
- 79953275887
- Multi-platform auto-vectorization
- D. Nuzman and R. Henderson. Multi-platform auto-vectorization. In CGO, pages 281-294, 2006.
- (2006) CGO , pp. 281-294
- Nuzman, D.¹ Henderson, R.²

35
- 63549093768
- Outer-loop vectorization: Revisited for short simd architectures
- D. Nuzman and A. Zaks. Outer-loop vectorization: revisited for short simd architectures. In PACT, pages 2-11, 2008.
- (2008) PACT , pp. 2-11
- Nuzman, D.¹ Zaks, A.²

36
- 79955066309
- Nvidia
- Nvidia. CUDA C Best Practices Guide 3.2, 2010.
- (2010) CUDA C Best Practices Guide 3.2

37
- 85184643058
- Oracle
- Oracle. Oracle TimesTen In-Memory Database Technical FAQ, 2007.
- (2007) Oracle TimesTen In-Memory Database Technical FAQ

38
- 85184635665
- Black-Scholes option pricing
- V. Podlozhnyuk. Black-Scholes option pricing. Nvidia, 2007.
- (2007) Nvidia
- Podlozhnyuk, V.¹

39
- 79959466764
- Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, pages 73-82, 2008.
- (2008) PPoPP , pp. 73-82
- Ryoo, S.¹ Rodrigues, C.I.² Baghsorkhi, S.S.³ Stone, S.S.⁴ Kirk, D.B.⁵ Mei, W.⁶ Hwu, W.⁷

40
- 77954743119
- Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort
- N. Satish, C. Kim, J. Chhugani, et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351-362, 2010.
- (2010) SIGMOD , pp. 351-362
- Satish, N.¹ Kim, C.² Chhugani, J.³

41
- 49249086142
- Larrabee: A many-core x86 architecture for visual computing
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008.
- (2008) SIGGRAPH , vol.27 , Issue.3
- Seiler, L.¹ Carmean, D.² Sprangle, E.³ Forsyth, T.⁴ Abrash, M.⁵ Dubey, P.⁶ Junkins, S.⁷ Lake, A.⁸ Sugerman, J.⁹ Cavin, R.¹⁰ Espasa, R.¹¹ Grochowski, E.¹² Juan, T.¹³ Hanrahan, P.¹⁴

42
- 85184637635
- Lecture2go.uni-hamburg.de/konferenzen/-/k/10940 ISC10 keynote
- K. B. Skaugen. HPC Technology-Scale-Up and Scale-Out. lecture2go.uni-hamburg.de/konferenzen/-/k/10940. ISC10 Keynote.
- HPC Technology-Scale-Up and Scale-Out
- Skaugen, K.B.¹

43
- 70350681243
- Mapping high-fidelity volume rendering for medical imaging to CPU, GPU and many-core architectures
- M. Smelyanskiy, D. Holmes, et al. Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures. IEEE Trans. Vis. Comput. Graph., 15(6):1563-1570, 2009.
- (2009) IEEE Trans. Vis. Comput. Graph. , vol.15 , Issue.6 , pp. 1563-1570
- Smelyanskiy, M.¹ Holmes, D.²

44
- 84892298358
- M. C. Sukop and D. T. Thorne, Jr. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers. 2006.
- (2006) Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers
- Sukop, M.C.¹ Thorne Jr., D.T.²

45
- 67650998701
- Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms
- S. Williams, J. Carter, L. Oliker, J. Shalf, and K. A. Yelick. Optimization of a lattice boltzmann computation on state-of-the-art multicore platforms. J. Parallel Distrib. Comput., 69(9):762-777, 2009.
- (2009) J. Parallel Distrib. Comput. , vol.69 , Issue.9 , pp. 762-777
- Williams, S.¹ Carter, J.² Oliker, L.³ Shalf, J.⁴ Yelick, K.A.⁵

46
- 77952554764
- An optimized 3d-stacked memory architecture by exploiting excessive, high-density tsv bandwidth
- D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee. An optimized 3d-stacked memory architecture by exploiting excessive, high-density tsv bandwidth. In HPCA, pages 1-12, 2010.
- (2010) HPCA , pp. 1-12
- Woo, D.H.¹ Seong, N.H.² Lewis, D.L.³ Lee, H.-H.S.⁴

47
- 77954691442
- A GPGPU compiler for memory optimization and parallelism management
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI, pages 86-97, 2010.
- (2010) PLDI , pp. 86-97
- Yang, Y.¹ Xiang, P.² Kong, J.³ Zhou, H.⁴

48
- 77954699806
- Bamboo: A data-centric, object-oriented approach to many-core software
- J. Zhou and B. Demsky. Bamboo: a data-centric, object-oriented approach to many-core software. In PLDI, pages 388-399, 2010.
- (2010) PLDI , pp. 388-399
- Zhou, J.¹ Demsky, B.²

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.