Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3243176.3243199acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Maximizing system utilization via parallelism management for co-located parallel applications

Published: 01 November 2018 Publication History

Abstract

With an increasing number of cores and memory controllers in multiprocessor platforms, co-location of parallel applications is gaining on importance. Key to achieve good performance is allocating the proper number of threads to co-located applications. This paper presents NuPoCo, a framework for automatically managing parallelism of co-located parallel applications on NUMA multi-socket multi-core systems. NuPoCo maximizes the utilization of CPU cores and memory controllers by dynamically adjusting the number of threads for co-located parallel applications. Evaluated with various scenarios of co-located OpenMP applications on a 64-core AMD and a 72-core Intel machine, NuPoCo achieves a reduction of the total turnaround time by 10-20% compared to the default Linux scheduler and an existing parallelism management policy focusing on CPU utilization only.

References

[1]
2018. GNU libgomp. http://gcc.gnu.org/onlinedocs/libgomp/. (2018). {online; accessed July 2018}.
[2]
AMD. 2012. BIOS and kernel developer's guide (BKDG) for AMD family 15h models 00h-0fh processors. (2012).
[3]
AMD. 2014. Revision Guide for AMD Family 15h Models 00h-0Fh Processors. (2014).
[4]
AMD. 2018. AMD Opteron 6300 Series Processors. http://www.amd.com/en-us/products/server/opteron/6000/6300. (2018). {online; accessed July 2018}.
[5]
David H Bailey, Eric Barszcz, John T Barton, David S Browning, Russell L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, et al. 1991. The NAS parallel benchmarks. International Journal of High Performance Computing Applications 5, 3 (1991), 63--73.
[6]
Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 29--44.
[7]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, New York, NY, USA, 72--81.
[8]
OpenMP Architecture Review Board. 2018. OpenMP. http://openmp.org. (2018). {online; accessed July 2018}.
[9]
Jens Breitbart, Simon Pickartz, Stefan Lankes, Josef Weidendorfer, and Antonello Monti. 2017. Dynamic Co-Scheduling Driven by Main Memory Bandwidth Utilization. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 400--409.
[10]
Jens Breitbart, Josef Weidendorfer, and Carsten Trinitis. 2015. Case Study on Co-scheduling for HPC Applications. In 2015 44th International Conference on Parallel Processing Workshops. 277--285.
[11]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54.
[12]
Younghyun Cho, Surim Oh, and Bernhard Egger. 2016. Online scalability characterization of data-parallel programs on many cores. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). 191--205.
[13]
Younghyun Cho, Surim Oh, and Bernhard Egger. 2017. Adaptive Space-Shared Scheduling for Shared-Memory Parallel Programs. In Job Scheduling Strategies for Parallel Processing. JSSPP 2015, JSSPP 2016. Lecture Notes in Computer Science, vol. 10353. Springer International Publishing, Cham, 158--177.
[14]
Timothy Creech, Aparna Kotha, and Rajeev Barua. 2013. Efficient multiprogramming for multicores with SCAF. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[15]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 381--394.
[16]
Murali Krishna Emani and Michael O'Boyle. 2015. Celebrating Diversity: A Mixture of Experts Approach for Runtime Mapping in Dynamic Environments. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '15). ACM, New York, NY, USA, 499--508.
[17]
Daniel Goodman, Georgios Varisteas, and Tim Harris. 2017. Pandia: Comprehensive Contention-sensitive Thread Placement. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys '17). ACM, New York, NY, USA, 254--269.
[18]
Dominik Grewe, Zheng Wang, and Michael F. P. O'Boyle. 2011. A Workload-aware Mapping Approach for Data-parallel Programs. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC '11). ACM, New York, NY, USA, 117--126.
[19]
Camilo A. Celis Guzman, Younghyun Cho, and Bernhard Egger. 2017. SnuMAP: an Open-source Trace Profiler for Manycore Systems. https://csap.snu.ac.kr/software/snumap/. (2017). {online; accessed July 2018}.
[20]
Tim Harris, Martin Maas, and Virendra J. Marathe. 2014. Callisto: Co-scheduling Parallel Runtime Systems. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys '14). ACM, New York, NY, USA, Article 24.
[21]
Wim Heirman, Trevor E. Carlson, Kenzo Van Craeynest, Ibrahim Hur, Aamer Jaleel, and Lieven Eeckhout. 2014. Undersubscribed threading on clustered cache architectures. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 678--689.
[22]
Intel. 2015. Intel 64 and IA-32 Architectures Software Developer's Manual. (2015).
[23]
Intel. 2015. Intel Xeon Processor E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual. (2015).
[24]
Intel. 2018. Intel Performance Counter Monitor - A better way to measure CPU utilization. http://www.intel.com/software/pcm. (2018). {online; accessed July 2018}.
[25]
Intel. 2018. Intel Xeon Processor E7-8870 v3. http://ark.intel.com/products/84682/Intel-Xeon-Processor-E7-8870-v3-45M-Cache-2_10-GHz. (2018). {online; accessed July 2018}.
[26]
Henk Jonkers. 1994. Queueing models of parallel applications: the Glamis methodology. In Computer Performance Evaluation Modelling Techniques and Tools. Springer, 123--138.
[27]
Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi N. Bhuyan. 2013. ADAPT: A Framework for Coscheduling Multithreaded Programs. ACM Trans. Archit. Code Optim. 9, 4, Article 45 (Jan. 2013), 24 pages.
[28]
Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, and Nathan Clark. 2010. Thread Tailor: Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 270--279.
[29]
Rose Liu, Kevin Klues, Sarah Bird, Steven Hofmeyr, Krste Asanović, and John Kubiatowicz. 2009. Tessellation: Space-time Partitioning in a Manycore Client OS. In Proceedings of the First USENIX Conference on Hot Topics in Parallelism (HotPar'09). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1855591.1855601
[30]
Jean-Pierre Lozi, Baptiste Lepers, Justin Funston, Fabien Gaud, Vivien Quéma, and Alexandra Fedorova. 2016. The Linux Scheduler: A Decade of Wasted Cores. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16). ACM, New York, NY, USA, Article 1, 16 pages.
[31]
Kun Luo, Jayanth Gummaraju, and Manoj Franklin. 2001. Balancing thoughput and fairness in SMT processors. In 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS. 164--171.
[32]
Zoltan Majo and Thomas R. Gross. 2011. Memory Management in NUMA Multicore Systems: Trapped Between Cache Contention and Interconnect Overhead. SIGPLAN Not. 46, 11 (June 2011), 11--20.
[33]
Zoltan Majo and Thomas R Gross. 2012. Matching memory access patterns and data placement for NUMA systems. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 230--241.
[34]
John D. McCalpin. 1991-2007. STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical Report. University of Virginia, Charlottesville, Virginia. http://www.cs.virginia.edu/stream/ A continually updated technical report. http://www.cs.virginia.edu/stream/.
[35]
Ryan W. Moore and Bruce R. Childers. 2012. Using utility prediction models to dynamically choose program thread counts. In 2012 IEEE International Symposium on Performance Analysis of Systems Software. 135--144.
[36]
Bhyrav Mutnury, Frank Paglia, James Mobley, Girish K. Singh, and Ron Bellomio. 2010. QuickPath Interconnect (QPI) design and analysis in high speed servers. In 19th Topical Meeting on Electrical Performance of Electronic Packaging and Systems. 265--268.
[37]
Arun Raman, Hanjun Kim, Taewook Oh, Jae W. Lee, and David I. August. 2011. Parallelism Orchestration Using DoPE: The Degree of Parallelism Executive. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '11). ACM, New York, NY, USA.
[38]
Arun Raman, Ayal Zaks, Jae W. Lee, and David I. August. 2012. Parcae: A System for Flexible Parallel Execution. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '12). ACM, New York, NY, USA, 133--144.
[39]
James Reinders. 2007. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, Inc.
[40]
Gabriele Sartori. 2001. Hypertransport Technology. Platform Conference (2001).
[41]
Hiroshi Sasaki, Satoshi Imamura, and Koji Inoue. 2013. Coordinated power-performance optimization in manycores. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 51--61.
[42]
Hiroshi Sasaki, Teruo Tanimoto, Koji Inoue, and Hiroshi Nakamura. 2012. Scalability-based Manycore Partitioning. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT '12). ACM, New York, NY, USA, 107--116.
[43]
Sangmin Seo, Gangwon Jo, and Jaejin Lee. 2011. Performance characterization of the NAS Parallel Benchmarks in OpenCL. In 2011 IEEE International Symposium on Workload Characterization (IISWC). 137--148.
[44]
Srinath Sridharan, Gagan Gupta, and Gurindar S. Sohi. 2014. Adaptive, Efficient, Parallel Execution of Parallel Programs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 169--180.
[45]
Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen. 2015. Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 529--540. https://www.usenix.org/conference/atc15/technical-session/presentation/srikanthan
[46]
János Sztrik. 2011. Basic queueing theory. University of Debrecen: Faculty of Informatics (2011).
[47]
Bogdan Marius Tudor and Yong Meng Teo. 2011. A practical approach for performance analysis of shared-memory programs. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 652--663.
[48]
Bogdan Marius Tudor, Yong Meng Teo, and Simon See. 2011. Understanding off-chip memory contention of parallel programs in multicore systems. In Parallel Processing (ICPP), 2011 International Conference on. IEEE, 602--611.
[49]
David Wentzlaff, Charles Gruenwald III, Nathan Beckmann, Kevin Modzelewski, Adam Belay, Lamia Youseff, Jason Miller, and Anant Agarwal. 2010. An operating system for multicore and clouds: mechanisms and implementation. In Proceedings of the 1st ACM symposium on Cloud computing. ACM, 3--14.
[50]
Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, USA, 129--142.

Cited By

View all
  • (2024)Global-State Aware Automatic NUMA BalancingProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671380(317-326)Online publication date: 24-Jul-2024
  • (2024)Exploiting Elasticity via OS-Runtime Cooperation to Improve CPU Utilization in Multicore Systems2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00014(35-43)Online publication date: 20-Mar-2024
  • (2023)Studying the expressiveness and performance of parallelization abstractions for linear pipelinesProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582522(29-38)Online publication date: 25-Feb-2023
  • Show More Cited By

Index Terms

  1. Maximizing system utilization via parallelism management for co-located parallel applications

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
    November 2018
    494 pages
    ISBN:9781450359863
    DOI:10.1145/3243176
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    • IFIP WG 10.3: IFIP WG 10.3
    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication Notes

    Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

    Publication History

    Published: 01 November 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. openMP
    2. parallelism management
    3. resource utilization

    Qualifiers

    • Research-article

    Funding Sources

    • National Research Foundation of Korea
    • Seoul National University

    Conference

    PACT '18
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 121 of 471 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)111
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 23 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Global-State Aware Automatic NUMA BalancingProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671380(317-326)Online publication date: 24-Jul-2024
    • (2024)Exploiting Elasticity via OS-Runtime Cooperation to Improve CPU Utilization in Multicore Systems2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00014(35-43)Online publication date: 20-Mar-2024
    • (2023)Studying the expressiveness and performance of parallelization abstractions for linear pipelinesProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582522(29-38)Online publication date: 25-Feb-2023
    • (2023)Efficient Synchronization-Light Work StealingProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591099(39-49)Online publication date: 17-Jun-2023
    • (2023)Dynamic Resource Provisioning for Iterative Workloads on Apache SparkIEEE Transactions on Cloud Computing10.1109/TCC.2021.310804311:1(639-652)Online publication date: 1-Jan-2023
    • (2023)Adapt Burstable Containers to Variable CPU ResourcesIEEE Transactions on Computers10.1109/TC.2022.317448072:3(614-626)Online publication date: 1-Mar-2023
    • (2023)Flexible system software scheduling for asymmetric multicore systems with PMCSched: A case for Intel Alder LakeConcurrency and Computation: Practice and Experience10.1002/cpe.781435:25Online publication date: 6-Jun-2023
    • (2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
    • (2022)MAPPER: Managing Application Performance via Parallel Efficiency Regulation*ACM Transactions on Architecture and Code Optimization10.1145/350176719:2(1-26)Online publication date: 24-Mar-2022
    • (2022)ParaX : Bandwidth-Efficient Instance Assignment for DL on Multi-NUMA Many-Core CPUsIEEE Transactions on Computers10.1109/TC.2022.314516471:11(3032-3046)Online publication date: 1-Nov-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media