Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2020373.2020379acmotherconferencesArticle/Chapter ViewAbstractPublication PagespgasConference Proceedingsconference-collections
research-article

Introducing mNUMA: an extended PGAS architecture

Published: 12 October 2010 Publication History

Abstract

We describe design details of a Light Weight Processing migration-NUMA architecture, a novel high performance system design that provides hardware support for a partitioned global address space, migrating subjects, and word level synchronization primitives. Using the architectural definition, combinations of structures are shown to work together to carry out basic actions such as address translation, migration, in-memory synchronization, and work management. We present results from simulation of microkernels showing that LWP-mNUMA compensates for latency with far greater memory access concurrency than possible on a conventional systems. In particular, several microkernels model tough, irregular access patterns that have limited speedups -- in certain problem areas -- to dozens of conventional processors. On these, results show speedup increasing up to 1024 multicore mNUMA processing nodes, running over 1 million threadlets.

References

[1]
The international technology roadmap for semiconductors. http://www.itrs.net/, 2009.
[2]
R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera computer system. In Proceedings of the1990 International Conference on Supercomputing, pages 1--6, 1990. URL citeseer.ist.psu.edu/alverson90tera.html.
[3]
K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, E. Lee, N. Morgan, G. Necula, D. Patterson, et al. The Parallel Computing Laboratory at UC Berkeley: A Research Agenda Based on the Berkeley View. 2008.
[4]
D. Bader and G. Cong. Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. Journal of Parallel and Distributed Computing, 66(11):1366--1378, 2006.
[5]
A. Begel, P. Buonadonna, D. Culler, and D. Gay. An analysis of VI Architecture primitives in support of parallel and distributed communication. Concurrency and Computation: Practice and Experience, 14 (1):55--76, 2002.
[6]
S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, et al. TILE64 processor: A 64-core SoC with mesh interconnect. In Proc. ISSCC, pages 88--598, 2008.
[7]
D. Bonachea, P. Hargrove, M. Welcome, and K. Yelick. Porting gasnet to portals: Partitioned global address space (pgas) language support for the cray xt. Cray Users Group, 2009.
[8]
G. Cong, G. Almasi, and V. Saraswat. Fast PGAS connected components algorithms. In Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, pages 1--6. ACM, 2009.
[9]
E. Dekel, S. Peng, and S. Lyengar. Optimal parallel algorithms for constructing and maintaining a balancedm-way search tree. International Journal of Parallel Programming, 15(6):503--528, 1986.
[10]
P. Husbands, C. Iancu, and K. Yelick. A Performance Analysis of the Berkeley UPC compiler. In Proceedings of the 17th Annual International Conference on Supercomputing. ACM New York, NY, USA, 2003.
[11]
V. Iosevich and A. Schuster. Software Distributed Shared Memory: a VIA-based implementation and comparison of sequential consistency with home-based lazy release consistency. Software: Practice and Experience, 35(8):755--786, 2005.
[12]
G. Karypis, K. Schloegel, and V. Kumar. ParMETIS: Parallel Graph Partitioning and Sparse Matrix Ordering Library Version 3.1. University of Minnesota, Minneapolis, 2003.
[13]
H. Kim, E. Rutledge, S. Sacco, S. Mohindra, M. Marzilli, J. Kepner, R. Haney, J. Daly, and N. Bliss. PVTOL: Providing Productivity, Performance and Portability to DoD Signal Processing Applications on Multicore Processors. In Proceedings of the 2008 DoD HPCMP Users Group Conference, pages 327--333. IEEE Computer Society, 2008.
[14]
P. Kogge. Computer Architectures with Increased Concurrency Capabilities. Technical Report TR-06-03, Department of Computer Science and Engineering University of Notre Dame, March 2006.
[15]
P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, et al. Exascale computing study: Technology challenges in achieving exascale systems. Washington, DC: DARPA Information Processing Techniques Office, 28, 2008.
[16]
P. Konecny. Introducing the Cray XMT. Cray User Group, 2007.
[17]
A. Lumsdaine, D. Gregor, B. Hendrickson, J. Berry, and J. Guest Editors. Challenges in Parallel Graph Processing. Parallel Processing Letters, 17(1), 2007.
[18]
M. Monchiero, G. Palermo, C. Silvano, and O. Villa. Exploration of distributed shared memory architectures for NoC-based multiprocessors. Journal of Systems Architecture, 53(10):719--732, 2007.
[19]
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A case for intelligent RAM. IEEE Micro, 17(2):34--44, 1997.
[20]
F. Petrini, W. Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The Quadrics network: High-performance clustering technology. IEEE Micro, 22(1):46--57, 2002.
[21]
M. Raynal and A. Schiper. From Causal Consistency to Sequential Consistency in Shared Memory Systems. In Selected areas in cryptography: 9th annual international workshop, SAC 2002, St. John's, Newfoundland, Canada, August 15-16, 2002: revised papers, page 180. Springer, 2003.
[22]
A. Rodrigues, R. Murphy, R. Brightwell, and K. Underwood. Enhancing NIC performance for MPI using processing-in-memory. 2005.
[23]
B. Romanescu, A. Lebeck, and D. Sorin. Specifying and dynamically verifying address translation-aware memory consistency. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, pages 323--334. ACM, 2010.
[24]
D. Terry, A. Demers, K. Petersen, M. Spreitzer, M. Theimer, and B. Welch. Session guarantees for weakly consistent replicated data. In The 3 rd International Conference on Parallel and Distributed Information Systems, pages 140--149, 1994.
[25]
K. Theobald, R. Kumar, G. Agrawal, G. Heber, R. Thulasiram, and G. Gao. Developing a communication intensive application on the EARTH multithreaded architecture. In Euro-Par 2000 Parallel Processing, pages 625--637. Springer, 2000.
[26]
S. Thoziyoor, N. Muralimanohar, and N. Jouppi. CACTI 5.0. PHL Techincal Report, pages 2007--167.
[27]
K. Underwood, M. Vance, J. Berry, and B. Hendrickson. Analyzing the scalability of graph algorithms on eldorado. In 2007 IEEE International Parallel and Distributed Processing Symposium, page 496. IEEE, 2007.
[28]
Wikipedia. Continuation-passing style, June 2010. en.wikipedia.org/wiki/Continuation-passing_style.

Cited By

View all
  • (2011)Recomposing an Irregular Algorithm Using a Novel Low-Level PGAS ModelProceedings of the 2011 40th International Conference on Parallel Processing Workshops10.1109/ICPPW.2011.55(238-248)Online publication date: 13-Sep-2011

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
October 2010
134 pages
ISBN:9781450304610
DOI:10.1145/2020373
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. PGAS
  2. distributed shared memory (DSM)
  3. multi-threaded architecture

Qualifiers

  • Research-article

Conference

PGAS '10

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2011)Recomposing an Irregular Algorithm Using a Novel Low-Level PGAS ModelProceedings of the 2011 40th International Conference on Parallel Processing Workshops10.1109/ICPPW.2011.55(238-248)Online publication date: 13-Sep-2011

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media