research-article

Introducing mNUMA: an extended PGAS architecture

Authors:

Peter M. KoggeAuthors Info & Claims

PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

Article No.: 6, Pages 1 - 10

https://doi.org/10.1145/2020373.2020379

Published: 12 October 2010 Publication History

Abstract

We describe design details of a Light Weight Processing migration-NUMA architecture, a novel high performance system design that provides hardware support for a partitioned global address space, migrating subjects, and word level synchronization primitives. Using the architectural definition, combinations of structures are shown to work together to carry out basic actions such as address translation, migration, in-memory synchronization, and work management. We present results from simulation of microkernels showing that LWP-mNUMA compensates for latency with far greater memory access concurrency than possible on a conventional systems. In particular, several microkernels model tough, irregular access patterns that have limited speedups -- in certain problem areas -- to dozens of conventional processors. On these, results show speedup increasing up to 1024 multicore mNUMA processing nodes, running over 1 million threadlets.

References

[1]

The international technology roadmap for semiconductors. http://www.itrs.net/, 2009.

[2]

R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera computer system. In Proceedings of the1990 International Conference on Supercomputing, pages 1--6, 1990. URL citeseer.ist.psu.edu/alverson90tera.html.

Digital Library

[3]

K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, E. Lee, N. Morgan, G. Necula, D. Patterson, et al. The Parallel Computing Laboratory at UC Berkeley: A Research Agenda Based on the Berkeley View. 2008.

[4]

D. Bader and G. Cong. Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. Journal of Parallel and Distributed Computing, 66(11):1366--1378, 2006.

Digital Library

[5]

A. Begel, P. Buonadonna, D. Culler, and D. Gay. An analysis of VI Architecture primitives in support of parallel and distributed communication. Concurrency and Computation: Practice and Experience, 14 (1):55--76, 2002.

[6]

S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, et al. TILE64 processor: A 64-core SoC with mesh interconnect. In Proc. ISSCC, pages 88--598, 2008.

[7]

D. Bonachea, P. Hargrove, M. Welcome, and K. Yelick. Porting gasnet to portals: Partitioned global address space (pgas) language support for the cray xt. Cray Users Group, 2009.

[8]

G. Cong, G. Almasi, and V. Saraswat. Fast PGAS connected components algorithms. In Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, pages 1--6. ACM, 2009.

Digital Library

[9]

E. Dekel, S. Peng, and S. Lyengar. Optimal parallel algorithms for constructing and maintaining a balancedm-way search tree. International Journal of Parallel Programming, 15(6):503--528, 1986.

Digital Library

[10]

P. Husbands, C. Iancu, and K. Yelick. A Performance Analysis of the Berkeley UPC compiler. In Proceedings of the 17th Annual International Conference on Supercomputing. ACM New York, NY, USA, 2003.

Digital Library

[11]

V. Iosevich and A. Schuster. Software Distributed Shared Memory: a VIA-based implementation and comparison of sequential consistency with home-based lazy release consistency. Software: Practice and Experience, 35(8):755--786, 2005.

Digital Library

[12]

G. Karypis, K. Schloegel, and V. Kumar. ParMETIS: Parallel Graph Partitioning and Sparse Matrix Ordering Library Version 3.1. University of Minnesota, Minneapolis, 2003.

[13]

H. Kim, E. Rutledge, S. Sacco, S. Mohindra, M. Marzilli, J. Kepner, R. Haney, J. Daly, and N. Bliss. PVTOL: Providing Productivity, Performance and Portability to DoD Signal Processing Applications on Multicore Processors. In Proceedings of the 2008 DoD HPCMP Users Group Conference, pages 327--333. IEEE Computer Society, 2008.

Digital Library

[14]

P. Kogge. Computer Architectures with Increased Concurrency Capabilities. Technical Report TR-06-03, Department of Computer Science and Engineering University of Notre Dame, March 2006.

[15]

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, et al. Exascale computing study: Technology challenges in achieving exascale systems. Washington, DC: DARPA Information Processing Techniques Office, 28, 2008.

[16]

P. Konecny. Introducing the Cray XMT. Cray User Group, 2007.

[17]

A. Lumsdaine, D. Gregor, B. Hendrickson, J. Berry, and J. Guest Editors. Challenges in Parallel Graph Processing. Parallel Processing Letters, 17(1), 2007.

[18]

M. Monchiero, G. Palermo, C. Silvano, and O. Villa. Exploration of distributed shared memory architectures for NoC-based multiprocessors. Journal of Systems Architecture, 53(10):719--732, 2007.

Digital Library

[19]

D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A case for intelligent RAM. IEEE Micro, 17(2):34--44, 1997.

Digital Library

[20]

F. Petrini, W. Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The Quadrics network: High-performance clustering technology. IEEE Micro, 22(1):46--57, 2002.

Digital Library

[21]

M. Raynal and A. Schiper. From Causal Consistency to Sequential Consistency in Shared Memory Systems. In Selected areas in cryptography: 9th annual international workshop, SAC 2002, St. John's, Newfoundland, Canada, August 15-16, 2002: revised papers, page 180. Springer, 2003.

Digital Library

[22]

A. Rodrigues, R. Murphy, R. Brightwell, and K. Underwood. Enhancing NIC performance for MPI using processing-in-memory. 2005.

[23]

B. Romanescu, A. Lebeck, and D. Sorin. Specifying and dynamically verifying address translation-aware memory consistency. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, pages 323--334. ACM, 2010.

Digital Library

[24]

D. Terry, A. Demers, K. Petersen, M. Spreitzer, M. Theimer, and B. Welch. Session guarantees for weakly consistent replicated data. In The 3 rd International Conference on Parallel and Distributed Information Systems, pages 140--149, 1994.

Digital Library

[25]

K. Theobald, R. Kumar, G. Agrawal, G. Heber, R. Thulasiram, and G. Gao. Developing a communication intensive application on the EARTH multithreaded architecture. In Euro-Par 2000 Parallel Processing, pages 625--637. Springer, 2000.

Digital Library

[26]

S. Thoziyoor, N. Muralimanohar, and N. Jouppi. CACTI 5.0. PHL Techincal Report, pages 2007--167.

[27]

K. Underwood, M. Vance, J. Berry, and B. Hendrickson. Analyzing the scalability of graph algorithms on eldorado. In 2007 IEEE International Parallel and Distributed Processing Symposium, page 496. IEEE, 2007.

[28]

Wikipedia. Continuation-passing style, June 2010. en.wikipedia.org/wiki/Continuation-passing_style.

Cited By

Cason MKogge P(2011)Recomposing an Irregular Algorithm Using a Novel Low-Level PGAS ModelProceedings of the 2011 40th International Conference on Parallel Processing Workshops10.1109/ICPPW.2011.55(238-248)Online publication date: 13-Sep-2011
https://dl.acm.org/doi/10.1109/ICPPW.2011.55

Index Terms

Introducing mNUMA: an extended PGAS architecture

Recommendations

Introducing OpenSHMEM: SHMEM for the PGAS community
PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

The OpenSHMEM community would like to announce a new effort to standardize SHMEM, a communications library that uses one-sided communication and utilizes a partitioned global address space.

OpenSHMEM is an effort to bring together a variety of SHMEM and ...
OpenSHMEM as a Portable Communication Layer for PGAS Models: A Case Study with Coarray Fortran
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster Computing

Languages and libraries based on the Partitioned Global Address Space (PGAS) programming model have emerged in recent years with a focus on addressing the programming challenges for scalable parallel systems. Among these, Coarray Fortran (CAF) is unique ...
Preliminary Implementation of Coarray Fortran Translator Based on Omni XcalableMP
PGAS '15: Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models

XcalableMP (XMP) is a PGAS language for distributed memory environments. It employs Coarray Fortran (CAF) features as the local-view programming model. We implemented the main part of CAF in the form of a translator, i.e., a source-to-source compiler, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

October 2010

134 pages

ISBN:9781450304610

DOI:10.1145/2020373

General Chair:
José E. Moreira
IBM T.J. Watson Research Center
,
Program Chairs:
Costin Iancu
Lawrence Berkeley Laboratory
,
Vijay Saraswat
IBM T.J. Watson Research Center

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PGAS '10

PGAS '10: Fourth Conference on Partitioned Global Address Space Programming Model

October 12 - 15, 2010

New York, New York, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
111
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cason MKogge P(2011)Recomposing an Irregular Algorithm Using a Novel Low-Level PGAS ModelProceedings of the 2011 40th International Conference on Parallel Processing Workshops10.1109/ICPPW.2011.55(238-248)Online publication date: 13-Sep-2011
https://dl.acm.org/doi/10.1109/ICPPW.2011.55

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents