Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Free access

Can traditional programming bridge the ninja performance gap for parallel computing applications?

Published: 23 April 2015 Publication History
First page of PDF

References

[1]
Arora, N., Shringarpure, A., Vuduc, R.W. Direct N-body Kernels for multicore platforms. In ICPP (2009), 379--387.
[2]
Asanovic, K., Bodik, R., Catanzaro, B., Gebis, J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-183, 2006.
[3]
Bienia, C., Kumar, S., Singh, J.P., Li, K. The PARSEC benchmark suite: Characterization and architectural implications. In PACT (2008), 72--81.
[4]
Brace, A., Gatarek, D., Musiela, M. The market model of interest rate dynamics. Mathematical Finance 7, 2 (1997),127--155.
[5]
Chen, Y.K., Chhugani, J., et al. Convergence of recognition, mining and synthesis workloads and its implications. IEEE 96, 5 (2008),790--807.
[6]
Chhugani, J., Nguyen, A.D., et al. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB 1, 2 (2008), 1313--1324.
[7]
Dally, W.J. The end of denial architecture and the rise of throughput computing. In Keynote Speech at Desgin Automation Conference (2010).
[8]
Datta, K. Auto-tuning Stencil Codes for Cache-based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley (Dec 2009).
[9]
Fowler, M. Domain Specific Languages, 1st edn. Addison-Wesley Professional, Boston, MA 2010.
[10]
Giles, M.B. Monte Carlo Evaluation of Sensitivities in Computational Finance. Technical report. Oxford University Computing Laboratory, 2007.
[11]
Intel. A quick, easy and reliable way to improve threaded performance, 2010. software.intel.com/articles/intel-cilk-plus.
[12]
Ismail, L., Guerchi, D. Performance evaluation of convolution on the cell broadband engine processor. IEEE PDS 22, 2 (2011), 337--351.
[13]
Kachelrieb, M., Knaup, M., Bockenbach, O. Hyperfast perspective cone-beam backprojection. IEEE Nuclear Science 3, (2006), 1679--1683.
[14]
Kim, C., Chhugani, J., Satish, N., et al. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In SIGMOD (2010). 339--350.
[15]
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In ISCA (2010). 451--460.
[16]
T. N. Mudge. Power: A first-class architectural design constraint. IEEE Computer 34, 4 (2001), 52--58.
[17]
Nguyen, A., Satish, N., et al. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In SC10 (2010). 1--13.
[18]
Nuzman, D., Henderson, R. Multi-platform auto-vectorization. In CGO (2006). 281--294.
[19]
Nvidia. CUDA C Best Practices Guide 3, 2 (2010).
[20]
Podlozhnyuk, V. Black--Scholes option pricing. Nvidia, 2007. http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/BlackScholes/doc/BlackScholes.pdf.
[21]
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP (2008). 73--82.
[22]
Satish, N., Kim, C., Chhugani, J., et al. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In SIGMOD (2010). 351--362.
[23]
Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., et al. Can traditional programming bridge the Ninja performance gap for parallel computing applications? In ISCA (2012). 440--451.
[24]
Smelyanskiy, M., Holmes, D., et al. Mapping high-fidelity volume rendering to CPU, GPU and many-core. IEEE TVCG, 15, 6(2009), 1563--1570.
[25]
Sukop, M.C., Thorne, D.T., Jr. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers, 2006.
[26]
Tian, X., Saito, H., Girkar, M., Preis, S., Kozhukhov, S., Cherkasov, A.G., Nelson, C., Panchenko, N., Geva, R., Compiling C/C++ SIMD extensions for function and loop vectorizaion on multicore-SIMD processors. In IPDPS Workshops (Springer, NY, 2012). 2349--2358.

Cited By

View all
  • (2021)How to Extend Single-Processor Approach to Explicitly Many-Processor ApproachAdvances in Software Engineering, Education, and e-Learning10.1007/978-3-030-70873-3_31(435-458)Online publication date: 10-Mar-2021
  • (2020)Benchmarking Julia’s Communication Performance: Is Julia HPC ready or Full HPC?2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS51919.2020.00008(20-25)Online publication date: Nov-2020
  • (2019)An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral CompilationACM Transactions on Architecture and Code Optimization10.1145/329344915:4(1-23)Online publication date: 8-Jan-2019
  • Show More Cited By

Index Terms

  1. Can traditional programming bridge the ninja performance gap for parallel computing applications?

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Communications of the ACM
      Communications of the ACM  Volume 58, Issue 5
      May 2015
      80 pages
      ISSN:0001-0782
      EISSN:1557-7317
      DOI:10.1145/2766485
      • Editor:
      • Moshe Y. Vardi
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 April 2015
      Published in CACM Volume 58, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)234
      • Downloads (Last 6 weeks)12
      Reflects downloads up to 12 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)How to Extend Single-Processor Approach to Explicitly Many-Processor ApproachAdvances in Software Engineering, Education, and e-Learning10.1007/978-3-030-70873-3_31(435-458)Online publication date: 10-Mar-2021
      • (2020)Benchmarking Julia’s Communication Performance: Is Julia HPC ready or Full HPC?2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS51919.2020.00008(20-25)Online publication date: Nov-2020
      • (2019)An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral CompilationACM Transactions on Architecture and Code Optimization10.1145/329344915:4(1-23)Online publication date: 8-Jan-2019
      • (2019)The Need for Modern Computing Paradigm: Science Applied to Computing2019 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI49370.2019.00283(1523-1532)Online publication date: Dec-2019
      • (2018)Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR FactorizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.280382029:8(1707-1720)Online publication date: 1-Aug-2018
      • (2018)Software Technology That Deals with Deeper Memory Hierarchy in Post-petascale EraAdvanced Software Technologies for Post-Peta Scale Computing10.1007/978-981-13-1924-2_12(227-248)Online publication date: 7-Dec-2018
      • (2017)An Accurate Simulator of Cache-Line Conflicts to Exploit the Underlying Cache PerformanceEuro-Par 2017: Parallel Processing10.1007/978-3-319-64203-1_9(119-133)Online publication date: 1-Aug-2017
      • (2015)An empirical study on parallelism in modern open-source projectsProceedings of the 2nd International Workshop on Software Engineering for Parallel Systems10.1145/2837476.2837481(35-44)Online publication date: 27-Oct-2015
      • (2015)Polyhedral user mapping and assistant visualizer tool for the r-stream auto-parallelizing compiler2015 IEEE 3rd Working Conference on Software Visualization (VISSOFT)10.1109/VISSOFT.2015.7332433(180-184)Online publication date: Sep-2015

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDFChinese translation

      eReader

      View online with eReader.

      eReader

      Digital Edition

      View this article in digital edition.

      Digital Edition

      Magazine Site

      View this article on the magazine site (external)

      Magazine Site

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media