research-article

Maximizing system utilization via parallelism management for co-located parallel applications

Authors:

Camilo A. Celis Guzman,

Bernhard EggerAuthors Info & Claims

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Article No.: 14, Pages 1 - 14

https://doi.org/10.1145/3243176.3243199

Published: 01 November 2018 Publication History

Abstract

With an increasing number of cores and memory controllers in multiprocessor platforms, co-location of parallel applications is gaining on importance. Key to achieve good performance is allocating the proper number of threads to co-located applications. This paper presents NuPoCo, a framework for automatically managing parallelism of co-located parallel applications on NUMA multi-socket multi-core systems. NuPoCo maximizes the utilization of CPU cores and memory controllers by dynamically adjusting the number of threads for co-located parallel applications. Evaluated with various scenarios of co-located OpenMP applications on a 64-core AMD and a 72-core Intel machine, NuPoCo achieves a reduction of the total turnaround time by 10-20% compared to the default Linux scheduler and an existing parallelism management policy focusing on CPU utilization only.

References

[1]

2018. GNU libgomp. http://gcc.gnu.org/onlinedocs/libgomp/. (2018). {online; accessed July 2018}.

[2]

AMD. 2012. BIOS and kernel developer's guide (BKDG) for AMD family 15h models 00h-0fh processors. (2012).

[3]

AMD. 2014. Revision Guide for AMD Family 15h Models 00h-0Fh Processors. (2014).

[4]

AMD. 2018. AMD Opteron 6300 Series Processors. http://www.amd.com/en-us/products/server/opteron/6000/6300. (2018). {online; accessed July 2018}.

[5]

David H Bailey, Eric Barszcz, John T Barton, David S Browning, Russell L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, et al. 1991. The NAS parallel benchmarks. International Journal of High Performance Computing Applications 5, 3 (1991), 63--73.

Digital Library

[6]

Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 29--44.

Digital Library

[7]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, New York, NY, USA, 72--81.

Digital Library

[8]

OpenMP Architecture Review Board. 2018. OpenMP. http://openmp.org. (2018). {online; accessed July 2018}.

[9]

Jens Breitbart, Simon Pickartz, Stefan Lankes, Josef Weidendorfer, and Antonello Monti. 2017. Dynamic Co-Scheduling Driven by Main Memory Bandwidth Utilization. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 400--409.

[10]

Jens Breitbart, Josef Weidendorfer, and Carsten Trinitis. 2015. Case Study on Co-scheduling for HPC Applications. In 2015 44th International Conference on Parallel Processing Workshops. 277--285.

Digital Library

[11]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54.

Digital Library

[12]

Younghyun Cho, Surim Oh, and Bernhard Egger. 2016. Online scalability characterization of data-parallel programs on many cores. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). 191--205.

Digital Library

[13]

Younghyun Cho, Surim Oh, and Bernhard Egger. 2017. Adaptive Space-Shared Scheduling for Shared-Memory Parallel Programs. In Job Scheduling Strategies for Parallel Processing. JSSPP 2015, JSSPP 2016. Lecture Notes in Computer Science, vol. 10353. Springer International Publishing, Cham, 158--177.

[14]

Timothy Creech, Aparna Kotha, and Rajeev Barua. 2013. Efficient multiprogramming for multicores with SCAF. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[15]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 381--394.

Digital Library

[16]

Murali Krishna Emani and Michael O'Boyle. 2015. Celebrating Diversity: A Mixture of Experts Approach for Runtime Mapping in Dynamic Environments. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '15). ACM, New York, NY, USA, 499--508.

Digital Library

[17]

Daniel Goodman, Georgios Varisteas, and Tim Harris. 2017. Pandia: Comprehensive Contention-sensitive Thread Placement. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys '17). ACM, New York, NY, USA, 254--269.

Digital Library

[18]

Dominik Grewe, Zheng Wang, and Michael F. P. O'Boyle. 2011. A Workload-aware Mapping Approach for Data-parallel Programs. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC '11). ACM, New York, NY, USA, 117--126.

Digital Library

[19]

Camilo A. Celis Guzman, Younghyun Cho, and Bernhard Egger. 2017. SnuMAP: an Open-source Trace Profiler for Manycore Systems. https://csap.snu.ac.kr/software/snumap/. (2017). {online; accessed July 2018}.

[20]

Tim Harris, Martin Maas, and Virendra J. Marathe. 2014. Callisto: Co-scheduling Parallel Runtime Systems. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys '14). ACM, New York, NY, USA, Article 24.

Digital Library

[21]

Wim Heirman, Trevor E. Carlson, Kenzo Van Craeynest, Ibrahim Hur, Aamer Jaleel, and Lieven Eeckhout. 2014. Undersubscribed threading on clustered cache architectures. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 678--689.

[22]

Intel. 2015. Intel 64 and IA-32 Architectures Software Developer's Manual. (2015).

[23]

Intel. 2015. Intel Xeon Processor E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual. (2015).

[24]

Intel. 2018. Intel Performance Counter Monitor - A better way to measure CPU utilization. http://www.intel.com/software/pcm. (2018). {online; accessed July 2018}.

[25]

Intel. 2018. Intel Xeon Processor E7-8870 v3. http://ark.intel.com/products/84682/Intel-Xeon-Processor-E7-8870-v3-45M-Cache-2_10-GHz. (2018). {online; accessed July 2018}.

[26]

Henk Jonkers. 1994. Queueing models of parallel applications: the Glamis methodology. In Computer Performance Evaluation Modelling Techniques and Tools. Springer, 123--138.

Digital Library

[27]

Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi N. Bhuyan. 2013. ADAPT: A Framework for Coscheduling Multithreaded Programs. ACM Trans. Archit. Code Optim. 9, 4, Article 45 (Jan. 2013), 24 pages.

Digital Library

[28]

Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, and Nathan Clark. 2010. Thread Tailor: Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 270--279.

Digital Library

[29]

Rose Liu, Kevin Klues, Sarah Bird, Steven Hofmeyr, Krste Asanović, and John Kubiatowicz. 2009. Tessellation: Space-time Partitioning in a Manycore Client OS. In Proceedings of the First USENIX Conference on Hot Topics in Parallelism (HotPar'09). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1855591.1855601

Digital Library

[30]

Jean-Pierre Lozi, Baptiste Lepers, Justin Funston, Fabien Gaud, Vivien Quéma, and Alexandra Fedorova. 2016. The Linux Scheduler: A Decade of Wasted Cores. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16). ACM, New York, NY, USA, Article 1, 16 pages.

Digital Library

[31]

Kun Luo, Jayanth Gummaraju, and Manoj Franklin. 2001. Balancing thoughput and fairness in SMT processors. In 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS. 164--171.

[32]

Zoltan Majo and Thomas R. Gross. 2011. Memory Management in NUMA Multicore Systems: Trapped Between Cache Contention and Interconnect Overhead. SIGPLAN Not. 46, 11 (June 2011), 11--20.

Digital Library

[33]

Zoltan Majo and Thomas R Gross. 2012. Matching memory access patterns and data placement for NUMA systems. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 230--241.

Digital Library

[34]

John D. McCalpin. 1991-2007. STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical Report. University of Virginia, Charlottesville, Virginia. http://www.cs.virginia.edu/stream/ A continually updated technical report. http://www.cs.virginia.edu/stream/.

[35]

Ryan W. Moore and Bruce R. Childers. 2012. Using utility prediction models to dynamically choose program thread counts. In 2012 IEEE International Symposium on Performance Analysis of Systems Software. 135--144.

Digital Library

[36]

Bhyrav Mutnury, Frank Paglia, James Mobley, Girish K. Singh, and Ron Bellomio. 2010. QuickPath Interconnect (QPI) design and analysis in high speed servers. In 19th Topical Meeting on Electrical Performance of Electronic Packaging and Systems. 265--268.

[37]

Arun Raman, Hanjun Kim, Taewook Oh, Jae W. Lee, and David I. August. 2011. Parallelism Orchestration Using DoPE: The Degree of Parallelism Executive. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '11). ACM, New York, NY, USA.

Digital Library

[38]

Arun Raman, Ayal Zaks, Jae W. Lee, and David I. August. 2012. Parcae: A System for Flexible Parallel Execution. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '12). ACM, New York, NY, USA, 133--144.

Digital Library

[39]

James Reinders. 2007. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, Inc.

Digital Library

[40]

Gabriele Sartori. 2001. Hypertransport Technology. Platform Conference (2001).

[41]

Hiroshi Sasaki, Satoshi Imamura, and Koji Inoue. 2013. Coordinated power-performance optimization in manycores. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 51--61.

Digital Library

[42]

Hiroshi Sasaki, Teruo Tanimoto, Koji Inoue, and Hiroshi Nakamura. 2012. Scalability-based Manycore Partitioning. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT '12). ACM, New York, NY, USA, 107--116.

Digital Library

[43]

Sangmin Seo, Gangwon Jo, and Jaejin Lee. 2011. Performance characterization of the NAS Parallel Benchmarks in OpenCL. In 2011 IEEE International Symposium on Workload Characterization (IISWC). 137--148.

Digital Library

[44]

Srinath Sridharan, Gagan Gupta, and Gurindar S. Sohi. 2014. Adaptive, Efficient, Parallel Execution of Parallel Programs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 169--180.

Digital Library

[45]

Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen. 2015. Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 529--540. https://www.usenix.org/conference/atc15/technical-session/presentation/srikanthan

Digital Library

[46]

János Sztrik. 2011. Basic queueing theory. University of Debrecen: Faculty of Informatics (2011).

[47]

Bogdan Marius Tudor and Yong Meng Teo. 2011. A practical approach for performance analysis of shared-memory programs. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 652--663.

Digital Library

[48]

Bogdan Marius Tudor, Yong Meng Teo, and Simon See. 2011. Understanding off-chip memory contention of parallel programs in multicore systems. In Parallel Processing (ICPP), 2011 International Conference on. IEEE, 602--611.

Digital Library

[49]

David Wentzlaff, Charles Gruenwald III, Nathan Beckmann, Kevin Modzelewski, Adam Belay, Lamia Youseff, Jason Miller, and Anant Agarwal. 2010. An operating system for multicore and clouds: mechanisms and implementation. In Proceedings of the 1st ACM symposium on Cloud computing. ACM, 3--14.

Digital Library

[50]

Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, USA, 129--142.

Digital Library

Cited By

Liu JYu Z(2024)Global-State Aware Automatic NUMA BalancingProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671380(317-326)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3671380
Rubio JBilbao CSaez JPrieto-Matias M(2024)Exploiting Elasticity via OS-Runtime Cooperation to Improve CPU Utilization in Multicore Systems2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00014(35-43)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00014
Mastoras AYzelman AChen QHuang ZSi M(2023)Studying the expressiveness and performance of parallelization abstractions for linear pipelinesProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582522(29-38)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3582514.3582522
Show More Cited By

Index Terms

Maximizing system utilization via parallelism management for co-located parallel applications
1. Computing methodologies
  1. Parallel computing methodologies

Recommendations

Dopia: online parallelism management for integrated CPU/GPU architectures
PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Recent desktop and mobile processors often integrate CPU and GPU onto the same die. The limited memory bandwidth of these integrated architectures can negatively affect the performance of data-parallel workloads when all computational resources are ...
Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution
ICPP Workshops '23: Proceedings of the 52nd International Conference on Parallel Processing Workshops

GPUs are renowned for their exceptional computational acceleration capabilities achieved through massive parallelism. However, utilizing GPUs for computation requires manual identification of code regions suitable for offloading, data transfer ...
Study of parallel programming models on computer clusters with Intel MIC coprocessors

Coprocessors based on the Intel Many Integrated Core MIC Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC processors to achieve the parallelism. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

November 2018

494 pages

ISBN:9781450359863

DOI:10.1145/3243176

General Chair:
Skevos Evripidou
University of Cyprus, Cyprus
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IFIP WG 10.3: IFIP WG 10.3
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 01 November 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Research Foundation of Korea
Seoul National University

Conference

PACT '18

Sponsor:

SIGARCH

PACT '18: International conference on Parallel Architectures and Compilation Techniques

November 1 - 4, 2018

Limassol, Cyprus

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
362
Total Downloads

Downloads (Last 12 months)111
Downloads (Last 6 weeks)16

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu JYu Z(2024)Global-State Aware Automatic NUMA BalancingProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671380(317-326)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3671380
Rubio JBilbao CSaez JPrieto-Matias M(2024)Exploiting Elasticity via OS-Runtime Cooperation to Improve CPU Utilization in Multicore Systems2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00014(35-43)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00014
Mastoras AYzelman AChen QHuang ZSi M(2023)Studying the expressiveness and performance of parallelization abstractions for linear pipelinesProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582522(29-38)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3582514.3582522
Custódio RPaulino HRito GAgrawal KShun J(2023)Efficient Synchronization-Light Work StealingProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591099(39-49)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591099
Cheng DWang YDai D(2023)Dynamic Resource Provisioning for Iterative Workloads on Apache SparkIEEE Transactions on Cloud Computing10.1109/TCC.2021.310804311:1(639-652)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TCC.2021.3108043
Huang HZhao YRao JWu SJin HWang DKun SPan L(2023)Adapt Burstable Containers to Variable CPU ResourcesIEEE Transactions on Computers10.1109/TC.2022.317448072:3(614-626)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TC.2022.3174480
Bilbao CSaez JPrieto‐Matias M(2023)Flexible system software scheduling for asymmetric multicore systems with PMCSched: A case for Intel Alder LakeConcurrency and Computation: Practice and Experience10.1002/cpe.781435:25Online publication date: 6-Jun-2023
https://doi.org/10.1002/cpe.7814
Cho YPark JNegele FJo CGross TEgger BLee JAgrawal KSpear M(2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508421
Srikanthan SChakraborti SFerro PDwarkadas S(2022)MAPPER: Managing Application Performance via Parallel Efficiency Regulation*ACM Transactions on Architecture and Code Optimization10.1145/350176719:2(1-26)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3501767
Zhang YYin LLi DPeng YLu K(2022)ParaX : Bandwidth-Efficient Instance Assignment for DL on Multi-NUMA Many-Core CPUsIEEE Transactions on Computers10.1109/TC.2022.314516471:11(3032-3046)Online publication date: 1-Nov-2022
https://doi.org/10.1109/TC.2022.3145164
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents