Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore Architectures

Published: 17 September 2016 Publication History

Abstract

The performance and energy efficiency of modern architectures depend on memory locality, which can be improved by thread and data mappings considering the memory access behavior of parallel applications. In this article, we propose intense pages mapping, a mechanism that analyzes the memory access behavior using information about the time the entry of each page resides in the translation lookaside buffer. It provides accurate information with a very low overhead. We present experimental results with simulation and real machines, with average performance improvements of 13.7% and energy savings of 4.4%, which come from reductions in cache misses and interconnection traffic.

References

[1]
Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. 2009. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 33--42.
[2]
Reza Azimi, David K. Tam, Livio Soares, and Michael Stumm. 2009. Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Operating Systems Review 43, 2, 56--65.
[3]
Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi-Keung Luk, Gail Lyons, Harish Patil, and Ady Tal. 2010. Analyzing parallel programs with Pin. IEEE Computer 43, 3, 34--41.
[4]
Nick Barrow-Williams, Christian Fensch, and Simon Moore. 2009. A communication characterisation of splash-2 and Parsec. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). 86--97.
[5]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 72--81.
[6]
Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. Communications of the ACM 54, 5, 67--77.
[7]
François Broquedis, Olivier Aumage, Brice Goglin, Samuel Thibault, Pierre-Andr Wacrenier, and Raymond Namyst. 2010a. Structuring the execution of OpenMP applications for multicore architectures. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’10). 1--10.
[8]
François Broquedis, Jerome Clet-Ortega, Stephanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010b. hwloc: A generic framework for managing hardware affinities in HPC applications. In Proceedings of the Euromicro Conference on Parallel, Distributed, and Network-Based Processing (PDP’10). 180--186.
[9]
Z. Chishti, M. D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. ACM SIGARCH Computer Architecture News 33, 2, 357--368.
[10]
Jonathan Corbet. 2012a. AutoNUMA: The Other Approach to NUMA Scheduling. Retrieved August 20, 2016, from http://lwn.net/Articles/488709/.
[11]
Jonathan Corbet. 2012b. Toward Better NUMA Scheduling. Retrieved August 20, 2016, from http://lwn.net/Articles/486858/.
[12]
P. W. Coteus, J. U. Knickerbocker, C. H. Lam, and Y. A. Vlasov. 2011. Technologies for exascale systems. IBM Journal of Research and Development 55, 5, 14:1--14:12.
[13]
Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2015. An efficient algorithm for communication-based task mapping. In Proceedings of the International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 207--214.
[14]
Blas Cuesta, Alberto Ros, Maria E. Gomez, Antonio Robles, and Jose Duato. 2013. Increasing the effectiveness of directory caches by avoiding the tracking of non-coherent memory blocks. IEEE Transactions on Computers 62, 3, 482--495.
[15]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quéma, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 381--393.
[16]
Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Rob H. Bisseling, and Umit V. Catalyurek. 2006. Parallel hypergraph partitioning for scientific computing. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’06). 124--133.
[17]
Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2014. kMAF: Automatic kernel-level management of thread and data affinity. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 277--288.
[18]
Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2015. Communication-aware process and thread mapping using online communication detection. Parallel Computing 43, 43--63.
[19]
Fabrice Dupros, Hideo Aochi, Ariane Ducellier, Dimitri Komatitsch, and Jean Roman. 2008. Exploiting intensive multithreading for the efficient simulation of 3D seismic wave propagation. In Proceedings of the IEEE International Conference on Computational Science and Engineering (CSE’08). 253--260.
[20]
Fabrice Dupros, Christiane Pousa, Alexandre Carissimi, and Jean-François Méhaut. 2010. Parallel simulations of seismic wave propagation on NUMA architectures. In Parallel Computing: From Multicores and GPU’s to Petascale, B. Chapman, F. Desprez, G. R. Joubert, A. Lichnewsky, F. Peters, and T. Priol (Eds.). IOS Press, Amsterdam, Netherlands, 67--74.
[21]
Stephane Eranian. 2006. Perfmon2: A flexible performance monitoring interface for Linux. In Proceedings of the Linux Symposium.
[22]
Josue Feliu, Julio Sahuquillo, Salvador Petit, and Jose Duato. 2012. Understanding cache hierarchy contention in CMPs to improve job scheduling. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’12).
[23]
Ilaria Di Gennaro, Alessandro Pellegrini, and Francesco Quaglia. 2016. OS-based NUMA optimization: Tackling the case of truly multi-thread applications with non-partitioned virtual page accesses. In Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGRID’16). 291--300.
[24]
Intel. 2010. Intel Itanium Architecture Software Developer’s Manual. Technical Report. Intel Corporation.
[25]
Intel. 2012a. 2nd Generation Intel Core Processor Family. Technical Report. Intel Corporation.
[26]
Intel. 2012b. Intel Performance Counter Monitor—A Better Way to Measure CPU Utilization. Retrieved August 20, 2016, from http://www.intel.com/software/pcm.
[27]
Emmanuel Jeannot and Guillaume Mercier. 2010. Near-optimal placement of MPI processes on hierarchical NUMA architectures. In Euro-Par 2010—Parallel Processing. Lecture Notes in Computer Science, Vol. 6272. Springer, 199--210.
[28]
JEDEC. 2012. DDR3 SDRAM Standard. Retrieved August 20, 2016, from https://www.jedec.org/standards-documents/docs/jesd-79-3d.
[29]
H. Jin, M. Frumkin, and J. Yan. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report. NASA.
[30]
George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1, 359--392.
[31]
Tobias Klug, Michael Ott, Josef Weidendorfer, and Carsten Trinitis. 2008. autopin—automated optimization of thread-to-core pinning on multicore systems. In Transactions on High-Performance Embedded Architectures and Compilers. Lecture Notes in Computer Science, Vol. 6590. Springer, 219--235.
[32]
Richard P. LaRowe, Mark A. Holliday, and Carla Schlatter Ellis. 1992. An analysis of dynamic page placement on a NUMA multiprocessor. ACM SIGMETRICS Performance Evaluation Review 20, 1, 23--34.
[33]
Henrik Löf and Sverker Holmgren. 2005. Affinity-on-next-touch: Increasing the performance of an industrial PDE solver on a cc-NUMA system. In Proceedings of the International Conference on Supercomputing (SC’05). 387--392.
[34]
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. 2002. Simics: A full system simulation platform. IEEE Computer 35, 2, 50--58.
[35]
Jaydeep Marathe and Frank Mueller. 2006. Hardware profile-guided automatic page placement for ccNUMA systems. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’06). 90--99.
[36]
Jaydeep Marathe, Vivek Thakkar, and Frank Mueller. 2010. Feedback-directed page placement for ccNUMA via hardware-generated memory traces. Journal of Parallel and Distributed Computing 70, 12, 1204--1219.
[37]
Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why on-chip cache coherence is here to stay. Communications of the ACM 55, 7, 78.
[38]
M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, Min Xu, A. R. Alameldeen, K. E. Moore, M .D. Hill, and D. A. Wood. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Computer Architecture News 33, 4, 92--99.
[39]
Takeshi Ogasawara. 2009. NUMA-aware memory manager with dominant-thread-based copying GC. ACM SIGPLAN Notices 44, 10, 377--389.
[40]
Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, Fernando M. Quintão Pereira, and Fernando Magno. 2014. Compiler support for selective page migration in NUMA architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 369--380.
[41]
Petar Radojković, Vladimir Cakarević, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2013. Thread assignment of multithreaded network applications in multicore/multithreaded processors. IEEE Transactions on Parallel and Distributed Systems 24, 12, 2513--2525.
[42]
Christiane Pousa Ribeiro, Jean-François Méhaut, Alexandre Carissimi, Marcio Castro, and Luiz Gustavo Fernandes. 2009. Memory affinity for hierarchical shared memory multiprocessors. In Proceedings of the International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’09). 59--66.
[43]
Christian Terboven, Dieter an Mey, Dirk Schmidl, Henry Jin, and Thomas Reichstein. 2008. Data and thread affinity in OpenMP programs. In Proceedings of the Workshop on Memory Access on Future Processors: A Solved Problem? (MAW’08). 377--384.
[44]
Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, Norman P. Jouppi, and Palo Alto. 2008. Cacti 5.1. Technical Report. HP Labs.
[45]
Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Computing 68, 9, 1186--1200.
[46]
Josep Torrellas. 2009. Architectures for extreme-scale computing. IEEE Computer 42, 11, 28--35.
[47]
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. OS Support for Improving Data Locality on CC-NUMA Compute Servers. Technical Report. Stanford University, Stanford, CA.
[48]
Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). 129--142.
[49]
Sergey Zhuravlev, Juan Carlos Saez, Sergey Blagodurov, Alexandra Fedorova, and Manuel Prieto. 2012. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys 45, 1, Article No. 4.

Cited By

View all
  • (2023)A Low-Level Virtual Machine Just-In-Time Prototype for Running an Energy-Saving Hardware-Aware Mapping Algorithm on C/C++ Applications That Use PthreadsEnergies10.3390/en1619678116:19(6781)Online publication date: 23-Sep-2023
  • (2023)HATS: HetTask SchedulingIEEE Transactions on Cloud Computing10.1109/TCC.2022.318408111:2(2071-2083)Online publication date: 1-Apr-2023
  • (2022)On the performance limits of thread placement for array databases in non-uniform memory architecturesComputing10.1007/s00607-021-01043-4105:5(1059-1075)Online publication date: 17-Jan-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 3
September 2016
207 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2988523
Issue’s Table of Contents
© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 September 2016
Accepted: 01 July 2016
Revised: 01 July 2016
Received: 01 February 2016
Published in TACO Volume 13, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NUMA
  2. Thread mapping
  3. cache memory
  4. communication
  5. data mapping
  6. data sharing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)90
  • Downloads (Last 6 weeks)15
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Low-Level Virtual Machine Just-In-Time Prototype for Running an Energy-Saving Hardware-Aware Mapping Algorithm on C/C++ Applications That Use PthreadsEnergies10.3390/en1619678116:19(6781)Online publication date: 23-Sep-2023
  • (2023)HATS: HetTask SchedulingIEEE Transactions on Cloud Computing10.1109/TCC.2022.318408111:2(2071-2083)Online publication date: 1-Apr-2023
  • (2022)On the performance limits of thread placement for array databases in non-uniform memory architecturesComputing10.1007/s00607-021-01043-4105:5(1059-1075)Online publication date: 17-Jan-2022
  • (2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
  • (2021)Performance Analysis of Array Database Systems in Non-Uniform Memory Architecture2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00034(169-176)Online publication date: Mar-2021
  • (2021)A Performance-Stable NUMA Management Scheme for Linux-Based HPC SystemsIEEE Access10.1109/ACCESS.2021.30699919(52987-53002)Online publication date: 2021
  • (2019)Optimization strategies for geophysics models on manycore systemsThe International Journal of High Performance Computing Applications10.1177/1094342018824150(109434201882415)Online publication date: 17-Jan-2019
  • (2019)EagerMapACM Transactions on Parallel Computing10.1145/33097115:4(1-24)Online publication date: 8-Mar-2019
  • (2018)NumaMMAProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225094(1-10)Online publication date: 13-Aug-2018
  • (2018)Improving Communication and Load Balancing with Thread Mapping in Manycore Systems2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP2018.2018.00021(93-100)Online publication date: Mar-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media