research-article

Free access

Can traditional programming bridge the ninja performance gap for parallel computing applications?

Authors:

Pradeep DubeyAuthors Info & Claims

Communications of the ACM, Volume 58, Issue 5

Pages 77 - 86

https://doi.org/10.1145/2742910

Published: 23 April 2015 Publication History

All formats PDF

References

[1]

Arora, N., Shringarpure, A., Vuduc, R.W. Direct N-body Kernels for multicore platforms. In ICPP (2009), 379--387.

Digital Library

Google Scholar

[2]

Asanovic, K., Bodik, R., Catanzaro, B., Gebis, J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-183, 2006.

Google Scholar

[3]

Bienia, C., Kumar, S., Singh, J.P., Li, K. The PARSEC benchmark suite: Characterization and architectural implications. In PACT (2008), 72--81.

Digital Library

Google Scholar

[4]

Brace, A., Gatarek, D., Musiela, M. The market model of interest rate dynamics. Mathematical Finance 7, 2 (1997),127--155.

Crossref

Google Scholar

[5]

Chen, Y.K., Chhugani, J., et al. Convergence of recognition, mining and synthesis workloads and its implications. IEEE 96, 5 (2008),790--807.

Google Scholar

[6]

Chhugani, J., Nguyen, A.D., et al. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB 1, 2 (2008), 1313--1324.

Digital Library

Google Scholar

[7]

Dally, W.J. The end of denial architecture and the rise of throughput computing. In Keynote Speech at Desgin Automation Conference (2010).

Google Scholar

[8]

Datta, K. Auto-tuning Stencil Codes for Cache-based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley (Dec 2009).

Digital Library

Google Scholar

[9]

Fowler, M. Domain Specific Languages, 1st edn. Addison-Wesley Professional, Boston, MA 2010.

Digital Library

Google Scholar

[10]

Giles, M.B. Monte Carlo Evaluation of Sensitivities in Computational Finance. Technical report. Oxford University Computing Laboratory, 2007.

Google Scholar

[11]

Intel. A quick, easy and reliable way to improve threaded performance, 2010. software.intel.com/articles/intel-cilk-plus.

Google Scholar

[12]

Ismail, L., Guerchi, D. Performance evaluation of convolution on the cell broadband engine processor. IEEE PDS 22, 2 (2011), 337--351.

Digital Library

Google Scholar

[13]

Kachelrieb, M., Knaup, M., Bockenbach, O. Hyperfast perspective cone-beam backprojection. IEEE Nuclear Science 3, (2006), 1679--1683.

Google Scholar

[14]

Kim, C., Chhugani, J., Satish, N., et al. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In SIGMOD (2010). 339--350.

Digital Library

Google Scholar

[15]

Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In ISCA (2010). 451--460.

Digital Library

Google Scholar

[16]

T. N. Mudge. Power: A first-class architectural design constraint. IEEE Computer 34, 4 (2001), 52--58.

Digital Library

Google Scholar

[17]

Nguyen, A., Satish, N., et al. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In SC10 (2010). 1--13.

Digital Library

Google Scholar

[18]

Nuzman, D., Henderson, R. Multi-platform auto-vectorization. In CGO (2006). 281--294.

Digital Library

Google Scholar

[19]

Nvidia. CUDA C Best Practices Guide 3, 2 (2010).

Google Scholar

[20]

Podlozhnyuk, V. Black--Scholes option pricing. Nvidia, 2007. http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/BlackScholes/doc/BlackScholes.pdf.

Google Scholar

[21]

Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP (2008). 73--82.

Digital Library

Google Scholar

[22]

Satish, N., Kim, C., Chhugani, J., et al. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In SIGMOD (2010). 351--362.

Digital Library

Google Scholar

[23]

Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., et al. Can traditional programming bridge the Ninja performance gap for parallel computing applications? In ISCA (2012). 440--451.

Digital Library

Google Scholar

[24]

Smelyanskiy, M., Holmes, D., et al. Mapping high-fidelity volume rendering to CPU, GPU and many-core. IEEE TVCG, 15, 6(2009), 1563--1570.

Digital Library

Google Scholar

[25]

Sukop, M.C., Thorne, D.T., Jr. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers, 2006.

Digital Library

Google Scholar

[26]

Tian, X., Saito, H., Girkar, M., Preis, S., Kozhukhov, S., Cherkasov, A.G., Nelson, C., Panchenko, N., Geva, R., Compiling C/C++ SIMD extensions for function and loop vectorizaion on multicore-SIMD processors. In IPDPS Workshops (Springer, NY, 2012). 2349--2358.

Digital Library

Google Scholar

Cited By

View all

Végh J(2021)How to Extend Single-Processor Approach to Explicitly Many-Processor ApproachAdvances in Software Engineering, Education, and e-Learning10.1007/978-3-030-70873-3_31(435-458)Online publication date: 10-Mar-2021
https://doi.org/10.1007/978-3-030-70873-3_31
Hunold SSteiner S(2020)Benchmarking Julia’s Communication Performance: Is Julia HPC ready or Full HPC?2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS51919.2020.00008(20-25)Online publication date: Nov-2020
https://doi.org/10.1109/PMBS51919.2020.00008
Sato YYuki TEndo T(2019)An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral CompilationACM Transactions on Architecture and Code Optimization10.1145/329344915:4(1-23)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3293449
Show More Cited By

Index Terms

Can traditional programming bridge the ninja performance gap for parallel computing applications?
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Can traditional programming bridge the Ninja performance gap for parallel computing applications?
ISCA '12

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional ...
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional ...
Parallel Programming for Modern High Performance Computing Systems

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Communications of the ACM Volume 58, Issue 5

May 2015

80 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/2766485

Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2015

Published in CACM Volume 58, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
6,073
Total Downloads

Downloads (Last 12 months)234
Downloads (Last 6 weeks)12

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Végh J(2021)How to Extend Single-Processor Approach to Explicitly Many-Processor ApproachAdvances in Software Engineering, Education, and e-Learning10.1007/978-3-030-70873-3_31(435-458)Online publication date: 10-Mar-2021
https://doi.org/10.1007/978-3-030-70873-3_31
Hunold SSteiner S(2020)Benchmarking Julia’s Communication Performance: Is Julia HPC ready or Full HPC?2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS51919.2020.00008(20-25)Online publication date: Nov-2020
https://doi.org/10.1109/PMBS51919.2020.00008
Sato YYuki TEndo T(2019)An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral CompilationACM Transactions on Architecture and Code Optimization10.1145/329344915:4(1-23)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3293449
Vegh JTisan A(2019)The Need for Modern Computing Paradigm: Science Applied to Computing2019 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI49370.2019.00283(1523-1532)Online publication date: Dec-2019
https://doi.org/10.1109/CSCI49370.2019.00283
Merchant FVatwani TChattopadhyay ARaha SNandy SNarayan R(2018)Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR FactorizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.280382029:8(1707-1720)Online publication date: 1-Aug-2018
https://doi.org/10.1109/TPDS.2018.2803820
Endo TMidorikawa HSato Y(2018)Software Technology That Deals with Deeper Memory Hierarchy in Post-petascale EraAdvanced Software Technologies for Post-Peta Scale Computing10.1007/978-981-13-1924-2_12(227-248)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-981-13-1924-2_12
Sato YEndo T(2017)An Accurate Simulator of Cache-Line Conflicts to Exploit the Underlying Cache PerformanceEuro-Par 2017: Parallel Processing10.1007/978-3-319-64203-1_9(119-133)Online publication date: 1-Aug-2017
https://doi.org/10.1007/978-3-319-64203-1_9
Kiefer MWarzel DTichy WJannesari ABenkner SZhao XAtoofian ESato Y(2015)An empirical study on parallelism in modern open-source projectsProceedings of the 2nd International Workshop on Software Engineering for Parallel Systems10.1145/2837476.2837481(35-44)Online publication date: 27-Oct-2015
https://dl.acm.org/doi/10.1145/2837476.2837481
Papenhausen EWang BLangston MBaskaran MHenretty TIzubuchi TJohnson AJung CLin MMeister BMueller KLethin R(2015)Polyhedral user mapping and assistant visualizer tool for the r-stream auto-parallelizing compiler2015 IEEE 3rd Working Conference on Software Visualization (VISSOFT)10.1109/VISSOFT.2015.7332433(180-184)Online publication date: Sep-2015
https://doi.org/10.1109/VISSOFT.2015.7332433

View Options

View options

PDF

View or Download as a PDF file.

PDF Chinese translation

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

References

Cited By

Index Terms

Recommendations