Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1375527.1375571acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

CUBA: an architecture for efficient CPU/co-processor data communication

Published: 07 June 2008 Publication History

Abstract

Data-parallel co-processors have the potential to improve performance in highly parallel regions of code when coupled to a general-purpose CPU. However, applications often have to be modified in non-intuitive and complicated ways to mitigate the cost of data marshalling between the CPU and the co-processor. In some applications the overheads cannot be amortized and co-processors are unable to provide benefit. The additional effort and complexity of incorporating co-processors makes it difficult, if not impossible, to effectively utilize co-processors in large applications.
This paper presents CUBA, an architecture model where co-processors encapsulated as function calls can efficiently access their input and output data structures through pointer parameters. The key idea is to map the data structures required by the co-processor to the co-processor local memory as opposed to the CPU's main memory. The mapping in CUBA preserves the original layout of the shared data structures hosted in the co-processor local memory. The mapping renders the data marshalling process unnecessary and reduces the need for code changes in order to use the co-processors. CUBA allows the CPU to cache hosted data structures with a selective write-through cache policy, allowing the CPU to access hosted data structures while supporting efficient communication with the co-processors. Benchmark simulation results show that a CUBA-based system can approach optimal transfer rates while requiring few changes to the code that executes on the CPU.

References

[1]
AMD Staff. AMD64 Architecture Programmer's Manual. AMD Corporation, Sept. 2006.
[2]
D. Anderson. Hyper-Transport System Architecture. Addison-Wesley Professional, 2003.
[3]
R. Enzler, M. Platzer, C. Plessl, L. Thiele, and G. Troester. Reconfigurable processors for handhelds and wearables: Application analysis. In Reconfigurable Technology, pages 135146, Denver, CO, USA, Aug. 2001.
[4]
M. Fahey, S. Alam, T. Dunigan Jr, J. Vetter, and P. Worley. Early Evaluation of the Cray XD1. Cray User Group Conference, 2005.
[5]
M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, 2006.
[6]
Z. Guo, B. Buyukkurt, and W. Najjar. Input data reuse in compiling window operations onto reconfigurable hardware. ACM SIGPLAN Notices, 39(7):249--256, 2004.
[7]
Z. Guo, W. Najjar, F. Vahid, and K. Vissers. A quantitative analysis of the speedup factors of FPGAs over processors. In FPGA, pages 162170, New York, NY, USA, 2004. ACM Press.
[8]
S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The Chimaera reconfigurable functional unit. In FCCM, pages 8796. IEEE Computer Society Press, 1997.
[9]
J. R. Hauser and J. Wawryznek. Garp: A MIPS processor with a reconfigurable coprocessor. In FCCM, pages 296--299, 1997.
[10]
M. Hummel, M. Krause, and D. O'Flaherty. AMD and HP: Protocol enhacements for tightly coupled accelerators. Press Release, 2007.
[11]
Intel Staff. Intel 64 and IA-32 Architectures Software Developer's Manuals. Intel, May 2007.
[12]
J. A. Jacob and P. Chow. Memory interfacing and instruction specification for reconfigurable processors. In FPGA, pages 145--154, New York, NY, USA, 1999.
[13]
J. H. Kelm, I. Gelado, M. J. Murphy, N. Navarro, S. Lumetta, and W. mei W. Hwu. CIGAR: Application partitioning for a cpu/coprocessor architecture. In PACT, pages 317--326, New York, NY, USA, 2007. ACM Press.
[14]
D. Kim, R. Managuli, and Y. Kim. Data cache and direct memory access in programming mediaprocessors. IEEE Micro, 21(4):33--42, 2001.
[15]
D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas. Data forwarding in scalable shared--memory multiprocessors. In ICS, pages 255--264, New York, NY, USA, 1995. ACM Press.
[16]
MIPS Staff. MIPS32 Architecture for Programmers. MIPS Technologies, Mar. 2001.
[17]
J. Renau, B. Fragela, J. Tuck, W. Liu, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator. http://sesc.sourceforge.net, Jan. 2005.
[18]
S. Ryoo, C. Rodrigues, S. S. Baghsorhki, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, pages 73--82, 2008.
[19]
D. Seal. ARM Architecture Reference Manual. Addison-Wesley Longman Pusblishing Co., Inc., Boston, MA, USA, 2000.
[20]
H. Singh, M. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, and E. Chaves Filho. MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers, 49(5):465--481, 2000.
[21]
Xilinx. Virtex-II Pro and Virtex-II Pro X Plaform FPGAs: Complete Data Sheet, Oct. 2005.

Cited By

View all
  • (2023)TCADer: A Tightly Coupled Accelerator Design framework for heterogeneous system with hardware/software co-designJournal of Systems Architecture10.1016/j.sysarc.2023.102822136(102822)Online publication date: Mar-2023
  • (2017)Efficient exception handling support for GPUsProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123950(109-122)Online publication date: 14-Oct-2017
  • (2017)A Pipeline-Based Ray-Tracing Runtime System for HSA-Compliant FrameworksIEEE Transactions on Multimedia10.1109/TMM.2017.269782519:11(2450-2462)Online publication date: Nov-2017
  • Show More Cited By

Index Terms

  1. CUBA: an architecture for efficient CPU/co-processor data communication

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '08: Proceedings of the 22nd annual international conference on Supercomputing
    June 2008
    390 pages
    ISBN:9781605581583
    DOI:10.1145/1375527
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 June 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tag

    1. co-processors

    Qualifiers

    • Research-article

    Conference

    ICS08
    Sponsor:
    ICS08: International Conference on Supercomputing
    June 7 - 12, 2008
    Island of Kos, Greece

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)TCADer: A Tightly Coupled Accelerator Design framework for heterogeneous system with hardware/software co-designJournal of Systems Architecture10.1016/j.sysarc.2023.102822136(102822)Online publication date: Mar-2023
    • (2017)Efficient exception handling support for GPUsProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123950(109-122)Online publication date: 14-Oct-2017
    • (2017)A Pipeline-Based Ray-Tracing Runtime System for HSA-Compliant FrameworksIEEE Transactions on Multimedia10.1109/TMM.2017.269782519:11(2450-2462)Online publication date: Nov-2017
    • (2015)An evaluation of unified memory technology on NVIDIA GPUsProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.105(1092-1098)Online publication date: 4-May-2015
    • (2014)Portable and Transparent Host-Device Communication Optimization for GPGPU EnvironmentsProceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization10.1145/2581122.2544156(55-65)Online publication date: 15-Feb-2014
    • (2014)Portable and Transparent Host-Device Communication Optimization for GPGPU EnvironmentsProceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization10.1145/2544137.2544156(55-65)Online publication date: 15-Feb-2014
    • (2014)Design Space Exploration of Memory Model for Heterogeneous ComputingProceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2014.9(160-167)Online publication date: 22-Oct-2014
    • (2013)A hybrid shared memory heterogeneous execution platform for PCIe-based GPGPUs20th Annual International Conference on High Performance Computing10.1109/HiPC.2013.6799140(343-352)Online publication date: Dec-2013
    • (2013)High Performance Code Generation for Stencil Computation on Heterogeneous Multi-device Architectures2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing10.1109/HPCC.and.EUC.2013.213(1512-1518)Online publication date: Nov-2013
    • (2013)Shared memory heterogeneous computation on PCIe-supported platforms2013 23rd International Conference on Field programmable Logic and Applications10.1109/FPL.2013.6645580(1-4)Online publication date: Sep-2013
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media