Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

An asymmetric distributed shared memory model for heterogeneous parallel systems

Published: 13 March 2010 Publication History

Abstract

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory.
This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs.
We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.

References

[1]
The OpenCL Specification, 2009.
[2]
A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz,J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The mit alewife machine: architecture and performance. In ISCA '95, pages2--13, New York, NY, USA, 1995. ACM.
[3]
S. Ahuja, N. Carriero, and D. Gelernter. Linda and friends. IEEETrans. on Computers, 19(8):26--34, Aug. 1986.
[4]
P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas,N. Amato, and L. Rauchwerger. STAPL: An adaptive, generic parallel C++ library. LNCS, pages 193--208, 2003
[5]
H. Bal and A. Tanenbaum. Distributed programming with shared data.In ICCL '88, pages 82--91, Oct 1988.
[6]
K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin,and J. C. Sancho. Entering the petaflop era: the architecture and performance of roadrunner. In SC'08, pages 1--11, Piscataway, NJ,USA, 2008. IEEE Press.
[7]
P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta. Cellss: a programming model for the cell be architecture. In SC'06, page 86, New York, NY, USA, 2006. ACM.
[8]
B. Bershad, M. Zekauskas, and W. Sawdon. The midway distributedshared memory system. In Compcon Spring '93, pages 528--537, Feb 1993.
[9]
R. Bisiani and A. Forin. Multilanguage parallel programming ofheterogeneous machines. IEEE Trans. on Computers, 37(8):930--945, Aug 1988.
[10]
R. Bisiani and M. Ravishankar. Plus: a distributed shared-memorysystem. SIGARCH Comput. Archit. News, 18(3a):115--124, 1990.
[11]
I. Buck. GPU computing with NVIDIA CUDA. In SIGGRAPH '07,page 6, New York, NY, USA, 2007. ACM.
[12]
J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation andperformance of munin. In SOSP '91, pages 152--164, New York, NY, USA, 1991. ACM.
[13]
B.-C. Cheng and W. W. Hwu. Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation. In PLDI '00, pages 57--69, New York, NY, USA, 2000. ACM.
[14]
J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N.Sharp, and Q. Wu. Parallel programming using skeleton functions. In PARLE'93, pages 146--160, London, UK, 1993. Springer-Verlag.
[15]
P. Dasgupta, J. LeBlanc, R.J., M. Ahamad, and U. Ramachandran.The clouds distributed operating system. IEEE Trans. on Computers, 24(11):34--44, Nov 1991.
[16]
G. Delp, A. Sethi, and D. Farber. An analysis of memnet--an experiment in high-speed shared-memory local networking. In SIGCOMM'88, pages 165--174, New York, NY, USA, 1988. ACM.
[17]
B. Fleisch and G. Popek. Mirage: a coherent distributed sharedmemory design. In SOSP '89, pages 211--223, New York, NY, USA, 1989. ACM.
[18]
S. Frank, I. Burkhardt, H., and J. Rothnie. The ksr 1: bridging the gapbetween shared memory and mpps. In Compcon Spring '93, pages 285--294, Feb 1993.
[19]
I. Gelado, J. H. Kelm, S. Ryoo, S. S. Lumetta, N. Navarro, andW. W. Hwu. Cuba: an architecture for efficient cpu/co--processor data communication. In ICS '08, pages 299--308, New York, NY, USA,2008. ACM.
[20]
D. B. Gustavson. The scalable coherent interface and related standardsprojects. IEEE Micro, 12(1):10--22, 1992.
[21]
S. H. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The chimaera reconfigurable functional unit. IEEE Trans. on VLSI, 12(2):206--217, Feb. 2004.
[22]
J. R. Hauser and J. Wawrzynek. Garp: a MIPS processor with areconfigurable coprocessor. In FCCM '97, pages 12--21, Apr 1997.
[23]
M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. P. Singh,R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The performance impact of flexibility in the stanford flash multiprocessor. In ASPLOS '94, pages 274--285, New York, NY, USA, 1994. ACM.
[24]
W. W. Hwu and J. Stone. A programmers view of the new GPUcomputing capabilities in the Fermi architecture and cuda 3.0. White paper, University of Illinois, 2009.
[25]
IBM Staff. SPE Runtime Management Library, 2007.
[26]
IMPACT Group. Parboil benchmark suite.http://impact.crhc.illinois.edu/parboil.php.
[27]
Intel Staff. Intel 945G Express Chipset Product Brief, 2005.
[28]
Intel Staff. Intel Xeon Processor 7400 Series. Specification Update,2008.
[29]
V. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro.Predictive runtime code scheduling for heterogeneous architectures. In HiPEAC '09, pages 19--33, Berlin, Heidelberg, 2009. Springer-Verlag.
[30]
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, andD. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49(4/5):589--604, 2005.
[31]
P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. Tread-marks: distributed shared memory on standard workstations and operating systems. In WTEC'94, pages 10--10, Berkeley, CA, USA, 1994.USENIX Association.
[32]
J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy,A. Mahesri, S. S. Lumetta, M. I. Frank, and S. Patel. Rigel: an architecture and scalable programming interface for a 1000-core accelerator. In ISCA '09, pages 140--151, New York, NY, USA, 2009. ACM.
[33]
D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy.The directory-based cache coherence protocol for the dash multiprocessor. In ISCA '90, pages 148--159, New York, NY, USA, 1990.ACM.
[34]
K. Li and P. Hudak. Memory coherence in shared virtual memorysystems. ACM Trans. Comput. Syst., 7(4):321--359, 1989.
[35]
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. Nvidiatesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39-55, March-April 2008.
[36]
C. Maples and L. Wittie. Merlin: A superglue for multicomputersystems. In Compcon Spring '90, volume 90, pages 73--81, 1990.
[37]
J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: aportable "shared-memory" programming model for distributed memory computers. In SC'94, pages 340--349, New York, NY, USA, 1994.ACM.
[38]
NVIDIA Staff. NVIDIA CUDA Programming Guide 2.2, 2009.
[39]
S. Patel and W. W. Hwu. Accelerator architectures. IEEE Micro,28(4):4--12, July-Aug. 2008.
[40]
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture forvisual computing. ACM Trans. Graph., 27(3):1--15, 2008.
[41]
H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. C. Filho. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. onComputers, 49(5):465--481, May 2000.
[42]
M. Vanneschi. The programming model of assist, an environmentfor parallel and distributed portable applications. Parallel Comput., 28(12):1709--1732, 2002.
[43]
S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, andE. M. Panainte. The molen polymorphic processor. IEEE Trans. on Computers, 53(11):1363--1375, 2004.
[44]
D. Warren and S. Haridi. Data Diffusion Machine -- a scalable sharedvirtual memory multiprocessor. In Fifth Generation Computer Systems 1988, page 943. Springer-Verlag, 1988.
[45]
J. Wilson, A.W., J. LaRowe, R.P., and M. Teller. Hardware assist fordistributed shared memory. In DCS '03, pages 246--255, May 1993.
[46]
Xilinx Staff. Virtex-5 Family Overview, Feb 2009.
[47]
S. Zhou, M. Stumm, and T. McInerney. Extending distributed shared memory to heterogeneous environments. In DCS '90, pages 30--37, May 1990.

Cited By

View all
  • (2019)GAIAProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358864(661-674)Online publication date: 10-Jul-2019
  • (2019)NVQuery: Efficient Query Processing in Nonvolatile MemoryIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.281908038:4(628-639)Online publication date: Apr-2019
  • (2019)Empowering Extreme Automation via Zero-Touch Operations and GPU ParallelizationIT Professional10.1109/MITP.2019.289216221:2(27-32)Online publication date: 1-Mar-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 38, Issue 1
ASPLOS '10
March 2010
399 pages
ISSN:0163-5964
DOI:10.1145/1735970
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
    March 2010
    422 pages
    ISBN:9781605588391
    DOI:10.1145/1736020
    • General Chair:
    • James C. Hoe,
    • Program Chair:
    • Vikram S. Adve
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2010
Published in SIGARCH Volume 38, Issue 1

Check for updates

Author Tags

  1. asymmetric distributed shared memory
  2. data-centric programming models
  3. heterogeneous systems

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)57
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)GAIAProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358864(661-674)Online publication date: 10-Jul-2019
  • (2019)NVQuery: Efficient Query Processing in Nonvolatile MemoryIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.281908038:4(628-639)Online publication date: Apr-2019
  • (2019)Empowering Extreme Automation via Zero-Touch Operations and GPU ParallelizationIT Professional10.1109/MITP.2019.289216221:2(27-32)Online publication date: 1-Mar-2019
  • (2019)Data-flow analysis and optimization for data coherence in heterogeneous architecturesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.04.004Online publication date: Apr-2019
  • (2018)ActivePointersACM SIGOPS Operating Systems Review10.1145/3273982.327399052:1(84-95)Online publication date: 28-Aug-2018
  • (2017)VectorPUProceedings of the 8th Workshop and 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms10.1145/3029580.3029582(7-12)Online publication date: 25-Jan-2017
  • (2017)Efficient query processing in crossbar memory2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)10.1109/ISLPED.2017.8009204(1-6)Online publication date: Jul-2017
  • (2017)Directive-Based Partitioning and Pipelining for Graphics Processing Units2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.96(575-584)Online publication date: May-2017
  • (2017)Efficient Data Sharing on Heterogeneous Systems2017 46th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2017.21(121-130)Online publication date: Aug-2017
  • (2016)ActivePointersACM SIGARCH Computer Architecture News10.1145/3007787.300120044:3(596-608)Online publication date: 18-Jun-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media