Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

Published: 08 July 2015 Publication History

Abstract

Modern mainstream powerful computers adopt multisocket multicore CPU architecture and NUMA-based memory architecture. While traditional work-stealing schedulers are designed for single-socket architectures, they incur severe shared cache misses and remote memory accesses in these computers. To solve the problem, we propose a locality-aware work-stealing (LAWS) scheduler, which better utilizes both the shared cache and the memory system. In LAWS, a load-balanced task allocator is used to evenly split and store the dataset of a program to all the memory nodes and allocate a task to the socket where the local memory node stores its data for reducing remote memory accesses. Then, an adaptive DAG packer adopts an auto-tuning approach to optimally pack an execution DAG into cache-friendly subtrees. After cache-friendly subtrees are created, every socket executes cache-friendly subtrees sequentially for optimizing shared cache usage. Meanwhile, a triple-level work-stealing scheduler is applied to schedule the subtrees and the tasks in each subtree. Through theoretical analysis, we show that LAWS has comparable time and space bounds compared with traditional work-stealing schedulers. Experimental results show that LAWS can improve the performance of memory-bound programs up to 54.2% on AMD-based experimental platforms and up to 48.6% on Intel-based experimental platforms compared with traditional work-stealing schedulers.

References

[1]
U. A. Acar, G. E. Blelloch, and R. D. Blumofe. 2002. The data locality of work stealing. Theory of Computing Systems 35, 3 (2002), 321--347.
[2]
E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. 2009. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems 20, 3 (2009), 404--418.
[3]
G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. 2008. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms. ACM, 501--510.
[4]
R. D. Blumofe. 1995. Executing Multithreaded Programs Efficiently. Ph.D. Dissertation. MIT.
[5]
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. 1996. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing 37, 1 (1996), 55--69.
[6]
M. Castro, L. G. Fernandes, C. Pousa, J.-F. Méhaut, and M. S. de Aguiar. 2009. NUMA-ICTM: A parallel version of ICTM exploiting memory placement strategies for NUMA machines. In The 23rd International Parallel and Distributed Processing Symposium. IEEE, 1--8.
[7]
Q. Chen, Y. Chen, Z. Huang, and M. Guo. 2012. WATS: Workload-aware task scheduling in asymmetric multi-core architectures. In The 26th International Parallel and Distributed Processing Symposium. IEEE, 249--260.
[8]
Q. Chen and M. Guo. 2014. Adaptive workload aware task scheduling for single-ISA multi-core architectures. ACM Transactions on Architecture and Code Optimization 11, 1 (2014), 8.
[9]
Q. Chen, M. Guo, and H. Guan. 2014. LAWS: Locality-aware work-stealing for multi-socket multi-core architectures. In Proceedings of the 28th ACM International Conference on Supercomputing. ACM, 3--12.
[10]
Q. Chen, M. Guo, and Z. Huang. 2012. CATS: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In Proceedings of the 26th Annual International Conference on Supercomputing. ACM, 163--172.
[11]
Q. Chen, M. Guo, and Z. Huang. 2013. Adaptive cache aware bi-tier work-stealing in multi-socket multi-core architectures. IEEE Transactions on Parallel and Distributed Systems 24, 12 (2013), 2334--2343.
[12]
Q. Chen, Z. Huang, M. Guo, and J. Zhou. 2011. CAB: Cache-aware bi-tier task-stealing in multi-socket multi-core architecture. In The 40th International Conference on Parallel Processing. IEEE, 722--732.
[13]
R. Cole and V. Ramachandran. 2013. Analysis of randomized work stealing with false sharing. In The 27th International Parallel and Distributed Processing Symposium. IEEE, 985--989.
[14]
Hypertransport Technology Consortium. 2010. HyperTransport I/O Link Specification, Revision 3.10c edition. (2010).
[15]
M. Frigo, C. E. Leiserson, and K. H. Randall. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 212--223.
[16]
T. Gautier, J. V. F. Lima, N. Maillard, and B. Raffin. 2013a. Locality-aware work stealing on multi-CPU and multi-GPU architectures. In The 6th Workshop on Programmability Issues for Heterogeneous Multicores.
[17]
T. Gautier, J. V. F. Lima, N. Maillard, and B. Raffin. 2013b. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In The 27th International Parallel and Distributed Processing Symposium. IEEE, 1299--1308.
[18]
A. Gerasoulis and T. Yang. 1992. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing 16, 4 (1992), 276--291.
[19]
Y. Guo, R. Barik, R. Raman, and V. Sarkar. 2009. Work-first and help-first scheduling policies for async-finish task parallelism. In The 23th IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--12.
[20]
Y. Guo, J. Zhao, V. Cave, and V. Sarkar. 2010. SLAW: A scalable locality-aware adaptive work--stealing scheduler. In The 24th IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--12.
[21]
Intel. 2009. Introduction to the Intel Quickpath Interconnect. White Paper (2009).
[22]
L. V. Kale and S. Krishnan. 1993. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the 8th Annual Conference on Object-Oriented Programming Systems, Languages, and Applications. ACM, 91--108.
[23]
G. Karypis and V. Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (1998), 359--392.
[24]
J. K. Lee and J. Palsberg. 2010. Featherweight X10: A core calculus for async-finish parallelism. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 25--36.
[25]
C. E. Leiserson. 2009. The Cilk++ concurrency platform. In The 46th ACM/IEEE Design Automation Conference. ACM, New York, 522--527.
[26]
A. Muddukrishna, P. A. Jonsson, V. Vlassov, and M. Brorsson. 2013. Locality-aware task scheduling and data distribution on NUMA systems. In OpenMP in the Era of Low Power Devices and Accelerators. Springer, 156--170.
[27]
L. L. Pilla, C. P. Ribeiro, D. Cordeiro, A. Bhatele, P. O. A. Navaux, J.-F. Méhaut, and L. V. Kalé. 2011. Improving parallel system performance with a NUMA-aware load balancer. TR-JLPC-11-02 (2011).
[28]
J.-N. Quintin and F. Wagner. 2010. Hierarchical work-stealing. In The 16th International Euro-Par Conference. Springer, 217--229.
[29]
J. Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media.
[30]
M. Shaheen and R. Strzodka. 2012. NUMA aware iterative stencil computations on many-core systems. In The 26th International Parallel and Distributed Processing Symposium. IEEE, 461--473.
[31]
S. Sridharan, G. Gupta, and G. S. Sohi. 2013. Holistic run-time parallelism management for time and energy efficiency. In Proceedings of the 27th Annual International Conference on Supercomputing. ACM, 337--348.
[32]
B. Vikranth, R. Wankar, and C. R. Rao. 2013. Topology aware task stealing for on-chip NUMA multi-core processors. Procedia Computer Science 18 (2013), 379--388.
[33]
R. Yang, J. Antony, A. Rendell, D. Robson, and P. Strazdins. 2011. Profiling directed NUMA optimization on Linux systems: A case study of the Gaussian computational chemistry code. In The 25th International Parallel and Distributed Processing Symposium. IEEE, 1046--1057.
[34]
R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis. 2013. Locality-aware task management for unstructured parallelism: a quantitative limit study. In Proceedings of the 25th Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, 315--325.

Cited By

View all

Index Terms

  1. Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 2
    July 2015
    410 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2775085
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 July 2015
    Accepted: 01 April 2015
    Received: 01 March 2015
    Published in TACO Volume 12, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. History-based auto-tuning
    2. Memory subsystem
    3. Task scheduling

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)54
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work StealingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319619233:12(4530-4546)Online publication date: 1-Dec-2022
    • (2018)NumaMMAProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225094(1-10)Online publication date: 13-Aug-2018
    • (2018)Scheduling Parallel Computations by Work StealingInternational Journal of Parallel Programming10.1007/s10766-016-0484-846:2(173-197)Online publication date: 1-Apr-2018
    • (2017)Congra: Towards Efficient Processing of Concurrent Graph Queries on Shared-Memory Machines2017 IEEE International Conference on Computer Design (ICCD)10.1109/ICCD.2017.40(217-224)Online publication date: Nov-2017
    • (2017)Work-Stealing for NUMA-enabled ArchitectureTask Scheduling for Multi-core and Parallel Architectures10.1007/978-981-10-6238-4_4(73-111)Online publication date: 25-Nov-2017

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media