research-article

An asymmetric distributed shared memory model for heterogeneous parallel systems

Authors:

Javier Cabezas,

Wen-mei W. HwuAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 38, Issue 1

Pages 347 - 358

https://doi.org/10.1145/1735970.1736059

Published: 13 March 2010 Publication History

Abstract

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory.

This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs.

We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.

References

[1]

The OpenCL Specification, 2009.

[2]

A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz,J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The mit alewife machine: architecture and performance. In ISCA '95, pages2--13, New York, NY, USA, 1995. ACM.

Digital Library

[3]

S. Ahuja, N. Carriero, and D. Gelernter. Linda and friends. IEEETrans. on Computers, 19(8):26--34, Aug. 1986.

Digital Library

[4]

P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas,N. Amato, and L. Rauchwerger. STAPL: An adaptive, generic parallel C++ library. LNCS, pages 193--208, 2003

[5]

H. Bal and A. Tanenbaum. Distributed programming with shared data.In ICCL '88, pages 82--91, Oct 1988.

[6]

K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin,and J. C. Sancho. Entering the petaflop era: the architecture and performance of roadrunner. In SC'08, pages 1--11, Piscataway, NJ,USA, 2008. IEEE Press.

Digital Library

[7]

P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta. Cellss: a programming model for the cell be architecture. In SC'06, page 86, New York, NY, USA, 2006. ACM.

Digital Library

[8]

B. Bershad, M. Zekauskas, and W. Sawdon. The midway distributedshared memory system. In Compcon Spring '93, pages 528--537, Feb 1993.

[9]

R. Bisiani and A. Forin. Multilanguage parallel programming ofheterogeneous machines. IEEE Trans. on Computers, 37(8):930--945, Aug 1988.

Digital Library

[10]

R. Bisiani and M. Ravishankar. Plus: a distributed shared-memorysystem. SIGARCH Comput. Archit. News, 18(3a):115--124, 1990.

Digital Library

[11]

I. Buck. GPU computing with NVIDIA CUDA. In SIGGRAPH '07,page 6, New York, NY, USA, 2007. ACM.

Digital Library

[12]

J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation andperformance of munin. In SOSP '91, pages 152--164, New York, NY, USA, 1991. ACM.

Digital Library

[13]

B.-C. Cheng and W. W. Hwu. Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation. In PLDI '00, pages 57--69, New York, NY, USA, 2000. ACM.

Digital Library

[14]

J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N.Sharp, and Q. Wu. Parallel programming using skeleton functions. In PARLE'93, pages 146--160, London, UK, 1993. Springer-Verlag.

Digital Library

[15]

P. Dasgupta, J. LeBlanc, R.J., M. Ahamad, and U. Ramachandran.The clouds distributed operating system. IEEE Trans. on Computers, 24(11):34--44, Nov 1991.

Digital Library

[16]

G. Delp, A. Sethi, and D. Farber. An analysis of memnet--an experiment in high-speed shared-memory local networking. In SIGCOMM'88, pages 165--174, New York, NY, USA, 1988. ACM.

Digital Library

[17]

B. Fleisch and G. Popek. Mirage: a coherent distributed sharedmemory design. In SOSP '89, pages 211--223, New York, NY, USA, 1989. ACM.

Digital Library

[18]

S. Frank, I. Burkhardt, H., and J. Rothnie. The ksr 1: bridging the gapbetween shared memory and mpps. In Compcon Spring '93, pages 285--294, Feb 1993.

[19]

I. Gelado, J. H. Kelm, S. Ryoo, S. S. Lumetta, N. Navarro, andW. W. Hwu. Cuba: an architecture for efficient cpu/co--processor data communication. In ICS '08, pages 299--308, New York, NY, USA,2008. ACM.

Digital Library

[20]

D. B. Gustavson. The scalable coherent interface and related standardsprojects. IEEE Micro, 12(1):10--22, 1992.

Digital Library

[21]

S. H. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The chimaera reconfigurable functional unit. IEEE Trans. on VLSI, 12(2):206--217, Feb. 2004.

Digital Library

[22]

J. R. Hauser and J. Wawrzynek. Garp: a MIPS processor with areconfigurable coprocessor. In FCCM '97, pages 12--21, Apr 1997.

Digital Library

[23]

M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. P. Singh,R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The performance impact of flexibility in the stanford flash multiprocessor. In ASPLOS '94, pages 274--285, New York, NY, USA, 1994. ACM.

Digital Library

[24]

W. W. Hwu and J. Stone. A programmers view of the new GPUcomputing capabilities in the Fermi architecture and cuda 3.0. White paper, University of Illinois, 2009.

[25]

IBM Staff. SPE Runtime Management Library, 2007.

[26]

IMPACT Group. Parboil benchmark suite.http://impact.crhc.illinois.edu/parboil.php.

[27]

Intel Staff. Intel 945G Express Chipset Product Brief, 2005.

[28]

Intel Staff. Intel Xeon Processor 7400 Series. Specification Update,2008.

[29]

V. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro.Predictive runtime code scheduling for heterogeneous architectures. In HiPEAC '09, pages 19--33, Berlin, Heidelberg, 2009. Springer-Verlag.

Digital Library

[30]

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, andD. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49(4/5):589--604, 2005.

[31]

P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. Tread-marks: distributed shared memory on standard workstations and operating systems. In WTEC'94, pages 10--10, Berkeley, CA, USA, 1994.USENIX Association.

Digital Library

[32]

J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy,A. Mahesri, S. S. Lumetta, M. I. Frank, and S. Patel. Rigel: an architecture and scalable programming interface for a 1000-core accelerator. In ISCA '09, pages 140--151, New York, NY, USA, 2009. ACM.

Digital Library

[33]

D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy.The directory-based cache coherence protocol for the dash multiprocessor. In ISCA '90, pages 148--159, New York, NY, USA, 1990.ACM.

Digital Library

[34]

K. Li and P. Hudak. Memory coherence in shared virtual memorysystems. ACM Trans. Comput. Syst., 7(4):321--359, 1989.

Digital Library

[35]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. Nvidiatesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39-55, March-April 2008.

Digital Library

[36]

C. Maples and L. Wittie. Merlin: A superglue for multicomputersystems. In Compcon Spring '90, volume 90, pages 73--81, 1990.

[37]

J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: aportable "shared-memory" programming model for distributed memory computers. In SC'94, pages 340--349, New York, NY, USA, 1994.ACM.

Digital Library

[38]

NVIDIA Staff. NVIDIA CUDA Programming Guide 2.2, 2009.

[39]

S. Patel and W. W. Hwu. Accelerator architectures. IEEE Micro,28(4):4--12, July-Aug. 2008.

Digital Library

[40]

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture forvisual computing. ACM Trans. Graph., 27(3):1--15, 2008.

Digital Library

[41]

H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. C. Filho. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. onComputers, 49(5):465--481, May 2000.

Digital Library

[42]

M. Vanneschi. The programming model of assist, an environmentfor parallel and distributed portable applications. Parallel Comput., 28(12):1709--1732, 2002.

Digital Library

[43]

S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, andE. M. Panainte. The molen polymorphic processor. IEEE Trans. on Computers, 53(11):1363--1375, 2004.

Digital Library

[44]

D. Warren and S. Haridi. Data Diffusion Machine -- a scalable sharedvirtual memory multiprocessor. In Fifth Generation Computer Systems 1988, page 943. Springer-Verlag, 1988.

[45]

J. Wilson, A.W., J. LaRowe, R.P., and M. Teller. Hardware assist fordistributed shared memory. In DCS '03, pages 246--255, May 1993.

[46]

Xilinx Staff. Virtex-5 Family Overview, Feb 2009.

[47]

S. Zhou, M. Stumm, and T. McInerney. Extending distributed shared memory to heterogeneous environments. In DCS '90, pages 30--37, May 1990.

Cited By

Brokhman TLifshits PSilberstein MDan TDahlia M(2019)GAIAProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358864(661-674)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.5555/3358807.3358864
Imani MGupta SSharma SRosing T(2019)NVQuery: Efficient Query Processing in Nonvolatile MemoryIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.281908038:4(628-639)Online publication date: Apr-2019
https://doi.org/10.1109/TCAD.2018.2819080
Fiaidhi JMohammed SFiaidhi J(2019)Empowering Extreme Automation via Zero-Touch Operations and GPU ParallelizationIT Professional10.1109/MITP.2019.289216221:2(27-32)Online publication date: 1-Mar-2019
https://doi.org/10.1109/MITP.2019.2892162
Show More Cited By

Index Terms

An asymmetric distributed shared memory model for heterogeneous parallel systems
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Distributed memory

Recommendations

An asymmetric distributed shared memory model for heterogeneous parallel systems
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to ...
An asymmetric distributed shared memory model for heterogeneous parallel systems
ASPLOS '10

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to ...
A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs

Motivated by the explosion of Big Data analytics, performance improvements in low-power (wimpy) systems and the increasing energy efficiency of GPUs, this paper presents a timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 38, Issue 1

ASPLOS '10

March 2010

399 pages

ISSN:0163-5964

DOI:10.1145/1735970

Issue’s Table of Contents

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
March 2010
422 pages
ISBN:9781605588391
DOI:10.1145/1736020
General Chair:
James C. Hoe
Carnegie Mellon University, USA
,
Program Chair:
Vikram S. Adve
University of Illinois at Urbana-Champaign, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2010

Published in SIGARCH Volume 38, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

179
Total Citations
View Citations
2,793
Total Downloads

Downloads (Last 12 months)57
Downloads (Last 6 weeks)3

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Brokhman TLifshits PSilberstein MDan TDahlia M(2019)GAIAProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358864(661-674)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.5555/3358807.3358864
Imani MGupta SSharma SRosing T(2019)NVQuery: Efficient Query Processing in Nonvolatile MemoryIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.281908038:4(628-639)Online publication date: Apr-2019
https://doi.org/10.1109/TCAD.2018.2819080
Fiaidhi JMohammed SFiaidhi J(2019)Empowering Extreme Automation via Zero-Touch Operations and GPU ParallelizationIT Professional10.1109/MITP.2019.289216221:2(27-32)Online publication date: 1-Mar-2019
https://doi.org/10.1109/MITP.2019.2892162
Sousa RPereira MPereira FAraujo G(2019)Data-flow analysis and optimization for data coherence in heterogeneous architecturesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.04.004Online publication date: Apr-2019
https://doi.org/10.1016/j.jpdc.2019.04.004
Shahar SBergman SSilberstein M(2018)ActivePointersACM SIGOPS Operating Systems Review10.1145/3273982.327399052:1(84-95)Online publication date: 28-Aug-2018
https://dl.acm.org/doi/10.1145/3273982.3273990
Li LKessler C(2017)VectorPUProceedings of the 8th Workshop and 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms10.1145/3029580.3029582(7-12)Online publication date: 25-Jan-2017
https://dl.acm.org/doi/10.1145/3029580.3029582
Imani MGupta SArredondo ARosing T(2017)Efficient query processing in crossbar memory2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)10.1109/ISLPED.2017.8009204(1-6)Online publication date: Jul-2017
https://doi.org/10.1109/ISLPED.2017.8009204
Cui XScogland TSupinski BFeng W(2017)Directive-Based Partitioning and Pipelining for Graphics Processing Units2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.96(575-584)Online publication date: May-2017
https://doi.org/10.1109/IPDPS.2017.96
Garcia-Flores VAyguade EPena A(2017)Efficient Data Sharing on Heterogeneous Systems2017 46th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2017.21(121-130)Online publication date: Aug-2017
https://doi.org/10.1109/ICPP.2017.21
Shahar SBergman SSilberstein M(2016)ActivePointersACM SIGARCH Computer Architecture News10.1145/3007787.300120044:3(596-608)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1145/3007787.3001200
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents