research-article

Open access

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

Authors:

Minyi GuoAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 2

Article No.: 22, Pages 1 - 24

https://doi.org/10.1145/2766450

Published: 08 July 2015 Publication History

Abstract

Modern mainstream powerful computers adopt multisocket multicore CPU architecture and NUMA-based memory architecture. While traditional work-stealing schedulers are designed for single-socket architectures, they incur severe shared cache misses and remote memory accesses in these computers. To solve the problem, we propose a locality-aware work-stealing (LAWS) scheduler, which better utilizes both the shared cache and the memory system. In LAWS, a load-balanced task allocator is used to evenly split and store the dataset of a program to all the memory nodes and allocate a task to the socket where the local memory node stores its data for reducing remote memory accesses. Then, an adaptive DAG packer adopts an auto-tuning approach to optimally pack an execution DAG into cache-friendly subtrees. After cache-friendly subtrees are created, every socket executes cache-friendly subtrees sequentially for optimizing shared cache usage. Meanwhile, a triple-level work-stealing scheduler is applied to schedule the subtrees and the tasks in each subtree. Through theoretical analysis, we show that LAWS has comparable time and space bounds compared with traditional work-stealing schedulers. Experimental results show that LAWS can improve the performance of memory-bound programs up to 54.2% on AMD-based experimental platforms and up to 48.6% on Intel-based experimental platforms compared with traditional work-stealing schedulers.

References

[1]

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. 2002. The data locality of work stealing. Theory of Computing Systems 35, 3 (2002), 321--347.

[2]

E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. 2009. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems 20, 3 (2009), 404--418.

Digital Library

[3]

G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. 2008. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms. ACM, 501--510.

Digital Library

[4]

R. D. Blumofe. 1995. Executing Multithreaded Programs Efficiently. Ph.D. Dissertation. MIT.

Digital Library

[5]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. 1996. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing 37, 1 (1996), 55--69.

Digital Library

[6]

M. Castro, L. G. Fernandes, C. Pousa, J.-F. Méhaut, and M. S. de Aguiar. 2009. NUMA-ICTM: A parallel version of ICTM exploiting memory placement strategies for NUMA machines. In The 23rd International Parallel and Distributed Processing Symposium. IEEE, 1--8.

Digital Library

[7]

Q. Chen, Y. Chen, Z. Huang, and M. Guo. 2012. WATS: Workload-aware task scheduling in asymmetric multi-core architectures. In The 26th International Parallel and Distributed Processing Symposium. IEEE, 249--260.

Digital Library

[8]

Q. Chen and M. Guo. 2014. Adaptive workload aware task scheduling for single-ISA multi-core architectures. ACM Transactions on Architecture and Code Optimization 11, 1 (2014), 8.

Digital Library

[9]

Q. Chen, M. Guo, and H. Guan. 2014. LAWS: Locality-aware work-stealing for multi-socket multi-core architectures. In Proceedings of the 28th ACM International Conference on Supercomputing. ACM, 3--12.

Digital Library

[10]

Q. Chen, M. Guo, and Z. Huang. 2012. CATS: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In Proceedings of the 26th Annual International Conference on Supercomputing. ACM, 163--172.

Digital Library

[11]

Q. Chen, M. Guo, and Z. Huang. 2013. Adaptive cache aware bi-tier work-stealing in multi-socket multi-core architectures. IEEE Transactions on Parallel and Distributed Systems 24, 12 (2013), 2334--2343.

Digital Library

[12]

Q. Chen, Z. Huang, M. Guo, and J. Zhou. 2011. CAB: Cache-aware bi-tier task-stealing in multi-socket multi-core architecture. In The 40th International Conference on Parallel Processing. IEEE, 722--732.

Digital Library

[13]

R. Cole and V. Ramachandran. 2013. Analysis of randomized work stealing with false sharing. In The 27th International Parallel and Distributed Processing Symposium. IEEE, 985--989.

Digital Library

[14]

Hypertransport Technology Consortium. 2010. HyperTransport I/O Link Specification, Revision 3.10c edition. (2010).

[15]

M. Frigo, C. E. Leiserson, and K. H. Randall. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 212--223.

Digital Library

[16]

T. Gautier, J. V. F. Lima, N. Maillard, and B. Raffin. 2013a. Locality-aware work stealing on multi-CPU and multi-GPU architectures. In The 6th Workshop on Programmability Issues for Heterogeneous Multicores.

[17]

T. Gautier, J. V. F. Lima, N. Maillard, and B. Raffin. 2013b. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In The 27th International Parallel and Distributed Processing Symposium. IEEE, 1299--1308.

Digital Library

[18]

A. Gerasoulis and T. Yang. 1992. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing 16, 4 (1992), 276--291.

[19]

Y. Guo, R. Barik, R. Raman, and V. Sarkar. 2009. Work-first and help-first scheduling policies for async-finish task parallelism. In The 23th IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--12.

Digital Library

[20]

Y. Guo, J. Zhao, V. Cave, and V. Sarkar. 2010. SLAW: A scalable locality-aware adaptive work--stealing scheduler. In The 24th IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--12.

[21]

Intel. 2009. Introduction to the Intel Quickpath Interconnect. White Paper (2009).

[22]

L. V. Kale and S. Krishnan. 1993. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the 8th Annual Conference on Object-Oriented Programming Systems, Languages, and Applications. ACM, 91--108.

Digital Library

[23]

G. Karypis and V. Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (1998), 359--392.

Digital Library

[24]

J. K. Lee and J. Palsberg. 2010. Featherweight X10: A core calculus for async-finish parallelism. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 25--36.

Digital Library

[25]

C. E. Leiserson. 2009. The Cilk++ concurrency platform. In The 46th ACM/IEEE Design Automation Conference. ACM, New York, 522--527.

Digital Library

[26]

A. Muddukrishna, P. A. Jonsson, V. Vlassov, and M. Brorsson. 2013. Locality-aware task scheduling and data distribution on NUMA systems. In OpenMP in the Era of Low Power Devices and Accelerators. Springer, 156--170.

[27]

L. L. Pilla, C. P. Ribeiro, D. Cordeiro, A. Bhatele, P. O. A. Navaux, J.-F. Méhaut, and L. V. Kalé. 2011. Improving parallel system performance with a NUMA-aware load balancer. TR-JLPC-11-02 (2011).

[28]

J.-N. Quintin and F. Wagner. 2010. Hierarchical work-stealing. In The 16th International Euro-Par Conference. Springer, 217--229.

Digital Library

[29]

J. Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media.

Digital Library

[30]

M. Shaheen and R. Strzodka. 2012. NUMA aware iterative stencil computations on many-core systems. In The 26th International Parallel and Distributed Processing Symposium. IEEE, 461--473.

Digital Library

[31]

S. Sridharan, G. Gupta, and G. S. Sohi. 2013. Holistic run-time parallelism management for time and energy efficiency. In Proceedings of the 27th Annual International Conference on Supercomputing. ACM, 337--348.

Digital Library

[32]

B. Vikranth, R. Wankar, and C. R. Rao. 2013. Topology aware task stealing for on-chip NUMA multi-core processors. Procedia Computer Science 18 (2013), 379--388.

[33]

R. Yang, J. Antony, A. Rendell, D. Robson, and P. Strazdins. 2011. Profiling directed NUMA optimization on Linux systems: A case study of the Gaussian computational chemistry code. In The 25th International Parallel and Distributed Processing Symposium. IEEE, 1046--1057.

Digital Library

[34]

R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis. 2013. Locality-aware task management for unstructured parallelism: a quantitative limit study. In Proceedings of the 25th Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, 315--325.

Digital Library

Cited By

Shiina STaura K(2022)Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work StealingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319619233:12(4530-4546)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3196192
Trahay FSelva MMorel LMarquet K(2018)NumaMMAProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225094(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225094
Yang JHe Q(2018)Scheduling Parallel Computations by Work StealingInternational Journal of Parallel Programming10.1007/s10766-016-0484-846:2(173-197)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s10766-016-0484-8
Show More Cited By

Index Terms

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management

Recommendations

Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures

Modern multicore computers often adopt a multisocket multicore architecture with shared caches in each socket. However, traditional work-stealing schedulers tend to pollute the shared cache and incur more cache misses due to their random stealing. To ...
Auto-tuning stencil codes for cache-based multicore platforms
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing

Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 12, Issue 2

July 2015

410 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2775085

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 July 2015

Accepted: 01 April 2015

Received: 01 March 2015

Published in TACO Volume 12, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Basic Research 973 Program of China
EU FP7 CLIMBER project
Program for Changjiang Scholars and Innovative Research Team in University
Scientific Innovation Act of STCSM
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
500
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)10

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shiina STaura K(2022)Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work StealingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319619233:12(4530-4546)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3196192
Trahay FSelva MMorel LMarquet K(2018)NumaMMAProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225094(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225094
Yang JHe Q(2018)Scheduling Parallel Computations by Work StealingInternational Journal of Parallel Programming10.1007/s10766-016-0484-846:2(173-197)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s10766-016-0484-8
Pan PLi C(2017)Congra: Towards Efficient Processing of Concurrent Graph Queries on Shared-Memory Machines2017 IEEE International Conference on Computer Design (ICCD)10.1109/ICCD.2017.40(217-224)Online publication date: Nov-2017
https://doi.org/10.1109/ICCD.2017.40
Chen QGuo MChen QGuo M(2017)Work-Stealing for NUMA-enabled ArchitectureTask Scheduling for Multi-core and Parallel Architectures10.1007/978-981-10-6238-4_4(73-111)Online publication date: 25-Nov-2017
https://doi.org/10.1007/978-981-10-6238-4_4

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents