research-article

CUBA: an architecture for efficient CPU/co-processor data communication

Authors:

Steven S. Lumetta,

Wen-mei W. HwuAuthors Info & Claims

ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

Pages 299 - 308

https://doi.org/10.1145/1375527.1375571

Published: 07 June 2008 Publication History

Abstract

Data-parallel co-processors have the potential to improve performance in highly parallel regions of code when coupled to a general-purpose CPU. However, applications often have to be modified in non-intuitive and complicated ways to mitigate the cost of data marshalling between the CPU and the co-processor. In some applications the overheads cannot be amortized and co-processors are unable to provide benefit. The additional effort and complexity of incorporating co-processors makes it difficult, if not impossible, to effectively utilize co-processors in large applications.

This paper presents CUBA, an architecture model where co-processors encapsulated as function calls can efficiently access their input and output data structures through pointer parameters. The key idea is to map the data structures required by the co-processor to the co-processor local memory as opposed to the CPU's main memory. The mapping in CUBA preserves the original layout of the shared data structures hosted in the co-processor local memory. The mapping renders the data marshalling process unnecessary and reduces the need for code changes in order to use the co-processors. CUBA allows the CPU to cache hosted data structures with a selective write-through cache policy, allowing the CPU to access hosted data structures while supporting efficient communication with the co-processors. Benchmark simulation results show that a CUBA-based system can approach optimal transfer rates while requiring few changes to the code that executes on the CPU.

References

[1]

AMD Staff. AMD64 Architecture Programmer's Manual. AMD Corporation, Sept. 2006.

[2]

D. Anderson. Hyper-Transport System Architecture. Addison-Wesley Professional, 2003.

[3]

R. Enzler, M. Platzer, C. Plessl, L. Thiele, and G. Troester. Reconfigurable processors for handhelds and wearables: Application analysis. In Reconfigurable Technology, pages 135146, Denver, CO, USA, Aug. 2001.

[4]

M. Fahey, S. Alam, T. Dunigan Jr, J. Vetter, and P. Worley. Early Evaluation of the Cray XD1. Cray User Group Conference, 2005.

[5]

M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, 2006.

Digital Library

[6]

Z. Guo, B. Buyukkurt, and W. Najjar. Input data reuse in compiling window operations onto reconfigurable hardware. ACM SIGPLAN Notices, 39(7):249--256, 2004.

Digital Library

[7]

Z. Guo, W. Najjar, F. Vahid, and K. Vissers. A quantitative analysis of the speedup factors of FPGAs over processors. In FPGA, pages 162170, New York, NY, USA, 2004. ACM Press.

Digital Library

[8]

S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The Chimaera reconfigurable functional unit. In FCCM, pages 8796. IEEE Computer Society Press, 1997.

Digital Library

[9]

J. R. Hauser and J. Wawryznek. Garp: A MIPS processor with a reconfigurable coprocessor. In FCCM, pages 296--299, 1997.

Digital Library

[10]

M. Hummel, M. Krause, and D. O'Flaherty. AMD and HP: Protocol enhacements for tightly coupled accelerators. Press Release, 2007.

[11]

Intel Staff. Intel 64 and IA-32 Architectures Software Developer's Manuals. Intel, May 2007.

[12]

J. A. Jacob and P. Chow. Memory interfacing and instruction specification for reconfigurable processors. In FPGA, pages 145--154, New York, NY, USA, 1999.

Digital Library

[13]

J. H. Kelm, I. Gelado, M. J. Murphy, N. Navarro, S. Lumetta, and W. mei W. Hwu. CIGAR: Application partitioning for a cpu/coprocessor architecture. In PACT, pages 317--326, New York, NY, USA, 2007. ACM Press.

Digital Library

[14]

D. Kim, R. Managuli, and Y. Kim. Data cache and direct memory access in programming mediaprocessors. IEEE Micro, 21(4):33--42, 2001.

Digital Library

[15]

D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas. Data forwarding in scalable shared--memory multiprocessors. In ICS, pages 255--264, New York, NY, USA, 1995. ACM Press.

Digital Library

[16]

MIPS Staff. MIPS32 Architecture for Programmers. MIPS Technologies, Mar. 2001.

[17]

J. Renau, B. Fragela, J. Tuck, W. Liu, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator. http://sesc.sourceforge.net, Jan. 2005.

[18]

S. Ryoo, C. Rodrigues, S. S. Baghsorhki, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, pages 73--82, 2008.

Digital Library

[19]

D. Seal. ARM Architecture Reference Manual. Addison-Wesley Longman Pusblishing Co., Inc., Boston, MA, USA, 2000.

Digital Library

[20]

H. Singh, M. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, and E. Chaves Filho. MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers, 49(5):465--481, 2000.

Digital Library

[21]

Xilinx. Virtex-II Pro and Virtex-II Pro X Plaform FPGAs: Complete Data Sheet, Oct. 2005.

Cited By

Li WLiu TXiao ZQi HZhu WWang J(2023)TCADer: A Tightly Coupled Accelerator Design framework for heterogeneous system with hardware/software co-designJournal of Systems Architecture10.1016/j.sysarc.2023.102822136(102822)Online publication date: Mar-2023
https://doi.org/10.1016/j.sysarc.2023.102822
Tanasic IGelado IJorda MAyguade ENavarro NHunter HMoreno JEmer JSanchez D(2017)Efficient exception handling support for GPUsProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123950(109-122)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123950
Kao CMiao YHsu W(2017)A Pipeline-Based Ray-Tracing Runtime System for HSA-Compliant FrameworksIEEE Transactions on Multimedia10.1109/TMM.2017.269782519:11(2450-2462)Online publication date: Nov-2017
https://doi.org/10.1109/TMM.2017.2697825
Show More Cited By

Index Terms

CUBA: an architecture for efficient CPU/co-processor data communication
1. Computer systems organization

Recommendations

On the Use of Large Intel Xeon Phi Clusters for GEANT4-Based Simulations

Abstract GEANT4 is the basic software for fast and precise simulation of particle interactions with matter. Along the way towards enabling the execution of GEANT4 based simulations on hybrid High Performance Computing HPC architectures with large ...
A Self-tuning Scientific Framework using Model-Driven Engineering for Heterogeneous Execution Platforms
ADAPT '14: Proceedings of International Workshop on Adaptive Self-tuning Computing Systems

This article presents an ongoing work towards the extension of Sm@rtConfig -- a dynamic scheduling tool with self-tuning load-balancing functionalities targeting CPUs, GPUs, and other co-processors. This extension is based on the introduction of a high-...
Profiling and Monitoring Deep Learning Training Tasks
EuroMLSys '23: Proceedings of the 3rd Workshop on Machine Learning and Systems

The embarrassingly parallel nature of deep learning training tasks makes CPU-GPU co-processors the primary commodity hardware for them. The computing and memory requirements of these tasks, however, do not always align well with the available GPU ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

June 2008

390 pages

ISBN:9781605581583

DOI:10.1145/1375527

General Chairs:
Theo Papatheodorou
University of Patras, Greece
,
Utpal Banerjee
Intel (retired), USA
,
Program Chairs:
Avi Mendelson
Intel, Israel
,
Kyle Gallivan
Florida State University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

co-processors

Qualifiers

Research-article

Conference

ICS08

Sponsor:

ICS08: International Conference on Supercomputing

June 7 - 12, 2008

Island of Kos, Greece

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
694
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li WLiu TXiao ZQi HZhu WWang J(2023)TCADer: A Tightly Coupled Accelerator Design framework for heterogeneous system with hardware/software co-designJournal of Systems Architecture10.1016/j.sysarc.2023.102822136(102822)Online publication date: Mar-2023
https://doi.org/10.1016/j.sysarc.2023.102822
Tanasic IGelado IJorda MAyguade ENavarro NHunter HMoreno JEmer JSanchez D(2017)Efficient exception handling support for GPUsProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123950(109-122)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123950
Kao CMiao YHsu W(2017)A Pipeline-Based Ray-Tracing Runtime System for HSA-Compliant FrameworksIEEE Transactions on Multimedia10.1109/TMM.2017.269782519:11(2450-2462)Online publication date: Nov-2017
https://doi.org/10.1109/TMM.2017.2697825
Li WJin GCui XSee SBalaji PXu C(2015)An evaluation of unified memory technology on NVIDIA GPUsProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.105(1092-1098)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.1109/CCGrid.2015.105
Margiolas CO'Boyle M(2014)Portable and Transparent Host-Device Communication Optimization for GPGPU EnvironmentsProceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization10.1145/2581122.2544156(55-65)Online publication date: 15-Feb-2014
https://dl.acm.org/doi/10.1145/2581122.2544156
Margiolas CO'Boyle M(2014)Portable and Transparent Host-Device Communication Optimization for GPGPU EnvironmentsProceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization10.1145/2544137.2544156(55-65)Online publication date: 15-Feb-2014
https://dl.acm.org/doi/10.1145/2544137.2544156
Lim JKim H(2014)Design Space Exploration of Memory Model for Heterogeneous ComputingProceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2014.9(160-167)Online publication date: 22-Oct-2014
https://dl.acm.org/doi/10.1109/SBAC-PAD.2014.9
Shukla SBhuyan L(2013)A hybrid shared memory heterogeneous execution platform for PCIe-based GPGPUs20th Annual International Conference on High Performance Computing10.1109/HiPC.2013.6799140(343-352)Online publication date: Dec-2013
https://doi.org/10.1109/HiPC.2013.6799140
Li PBrunet ENamyst R(2013)High Performance Code Generation for Stencil Computation on Heterogeneous Multi-device Architectures2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing10.1109/HPCC.and.EUC.2013.213(1512-1518)Online publication date: Nov-2013
https://doi.org/10.1109/HPCC.and.EUC.2013.213
Shukla SYang YBhuyan LBrisk P(2013)Shared memory heterogeneous computation on PCIe-supported platforms2013 23rd International Conference on Field programmable Logic and Applications10.1109/FPL.2013.6645580(1-4)Online publication date: Sep-2013
https://doi.org/10.1109/FPL.2013.6645580
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents