research-article

A stealing mechanism for delegation methods

Authors:

Yiping YaoAuthors Info & Claims

The Journal of Supercomputing, Volume 77, Issue 10

Pages 10827 - 10849

https://doi.org/10.1007/s11227-021-03719-2

Published: 01 October 2021 Publication History

Abstract

Modern multi-core architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a NUMA node is much faster than access to data cached on a remote node. Prior work has shown that on the NUMA multi-core system, delegation is the highly efficient solution to address the generally poor performance of locking methods on highly contended locks due to delegating the execution of critical section to one thread, reducing the movement of shared data between cores. However, we observe that delegation methods exhibit sub-par single-thread performance due mainly to the overheads of the communication between the server and client threads and request array traversal. To address this problem, this paper proposes a stealing mechanism that under no contention the clients can directly enter the critical section by enabling lock stealing. Meanwhile, under contention it employs delegation protocol by disabling lock stealing. Furthermore, we apply stealing mechanism to the state-of-the-art delegation methods: FFWD and RCL. The evaluation shows that delegation augmented with lock stealing can significantly improve the performance in the non-contended case, while matching the performance of their original counterparts in the contended circumstances.

References

[1]

Anderson TE The performance of spin lock alternatives for shared memory multiprocessors IEEE Trans Parallel Distrib Syst 1990 1 1 6-16

[2]

Baumann A, Barham P, Dagand PE, Tyshasta H, Rebecca I, Simon P, Timothy R, Adrian S, Akhilesh S (2009) The multi-kernel: a new OS architecture for scalable multicore systems. The ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP 09), pp. 29-44

[3]

Baumann A, Peter S, Adrian S, Akhilesh S, Timothy R, Paul B, Rebecca I (2009) Your computer is already a distributed system. Why Isn’t Your OS? Workshop on Workstation Operating Systems /workshop on Hot Topics in Operating Systems (HotOS 09), pp. 12-12

[4]

Bienia C (2011) Benchmarking modern multiprocessors. Ph.D. Dissertation. Princeton University, Princeton, NJ

[5]

Boyd-Wickizer S, Clements AT, Mao Y, Pesterev A, Kaashoek MF, Morris RT, Zeldovich N (2010) An analysis of Linux scalability to many cores. In: Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Vancouver, Canada, pp. 1-16

[6]

Boyd-Wickizer S, Kaashoek MF, Morris R, Zeldovich N (2012) Non-scalable locks are dangerous. In: Proceedings of the Linux Symposium. Ottawa, Canada

[7]

Chabbi M, Fagan M, Mellor-Crummey J (2015) High performance locks for multi-level NUMA systems. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 215-226

[8]

Chabbi M, Mellor-Crummey J (2016) Contention-conscious, locality-preserving locks. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)

[9]

Craig T (1993) Building FIFO and priority queuing spin locks from atomic swap, Univ. Washington, Seattle, WA, USA, Tech. Rep. 93-02-02

[10]

Dave Dice, Virendra J. Marathe, and Nir Shavit. Flat-combining NUMA locks. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 65-74, 2011

[11]

David T, Guerraoui R, Trigonakis V (2013) Everything you always wanted to know about synchronization but were afraid to ask. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles(SOSP 13), pp. 33-48

[12]

Dice D (2017) Malthusian locks. In: Proceedings of the Twelfth European Conference on Computer Systems (EuroSys), pp. 314-327

[13]

Dice D, Marathe VJ, and Shavit N Lock cohorting: a general technique for designing NUMA locks ACM Trans Parallel Comput 2015 1 2 13:1-13:42

[14]

Dice D, Kogan A (2019) Compact NUMA-aware Locks. In: Proceedings of the Fourteenth EuroSys Conference 2019 (EuroSys’19). ACM, New York, NY, USA, Article 12, 15 pages

[15]

Dogan H, Hijaz F, Ahmad M, Kahne B, Wilson P, Khan O (2017). Accelerating graph and machine learning workloads using a shared memory multicore architecture with auxiliary support for in-hardware explicit messaging. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, pp. 254-264

[16]

Eyerman S, Eeckhout L (2010) Modeling critical sections in Amdahl’s law and its implications for multicore design. In: Proceedings of the Annual International Symposium on Computer Architecture (ISCA), pp. 362-370

[17]

Fatourou P, Kallimanis ND (2012) Revisiting the combining synchronization technique. In: Proceedings of the 17th ACM Symposium on Principles and Practice of Parallel Programming (PPOPP), pp. 257-266, New Orleans, LA

[18]

Guerraoui R, Guiroux H, Lachaize R, Quma V, and Trigonakis V Lock-unlock: Is that all? a pragmatic analysis of locking in software systems ACM Trans Comput Syst 2019 36 1 1:1-1:149

[19]

Guerraoui R, Trigonakis V (2016) Optimistic concurrency with OPTIK. In: Proceedings of the 21st ACM Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, Barcelona, Spain, pp. 18:1-18:12

[20]

Guiroux H, Lachaize R, Quéma V (2016) Multicore locks: the case is not closed yet. In: Proceedings of the USENIX Annual Technical Conference (ATC), pp. 649-662

[21]

Hackenberg D, Molka D, Nagel W (2009) Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. ACM International Symposium on Microarchitecture(MICRO 09), pp. 413- 422

[22]

Hendler D, Incze I, Shavit N, Tzafrir M (2010) Flat combining and the synchronization-parallelism tradeoff. In: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures, pp. 355-364. ACM

[23]

He B, Scherer WN, Scott ML (2005) Preemption Adaptivity in Time-published Queue-based Spin Locks. In: Proceedings of the 12th International Conference on High Performance Computing (HiPC’05) (2005), Springer-Verlag

[24]

Kashyap S, Calciu I, Cheng X, Min C, Kim T (2019) Scalable and practical locking with shuffling. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), pp. 586-599

[25]

Kashyap S, Min C, Kim T (2017) Scalable NUMA-aware Blocking Synchronization Primitives. In: Proceedings of the (2017) USENIX Annual Technical Conference (ATC). USENIX Association, Santa Clara, CA

[26]

Kim J, Mathew A, Kashyap S, Ramanathan MK, Min C (2019) MV-RLU: scaling read-log-update with multi-versioning. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 779-792

[27]

Klaftenegger D, Sagonas K, and Winblad K Queue delegation locking IEEE Trans Parallel Distrib Syst 2018 29 3 687-704

[28]

Liu T and Berger ED Sheriff: precise detection and automatic mitigation of false sharing ACM Sigplan Notices 2011 46 3-18

[29]

Lozi JP, David F, Thomas G, Lawall J, Muller G (2016) Fast and portable locking for multicore architectures. ACM Trans. Comput. Syst., 33(4):13:1-13:62

[30]

Lozi JP, David F, Thomas G, Lawall JL, Muller G et al. (2012) Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In: USENIX Annual Technical Conference, pp. 65-76

[31]

Magnusson PS, Landin A, Hagersten E (1994) Queue locks on cache coherent multiprocessors. In: Proceedings of the 8th International Parallel Processing Symposium, pages pp. 165-171

[32]

Mellor-Crummey JM and Scott ML Algorithms for scalable synchronization on shared-memory multiprocessors ACM Trans Comput Syst 1991

[33]

Memcached, . https://memcached.org

[34]

Memslap, . http://docs.libmemcached.org/bin/ memaslap.html

[35]

Nanavati M, Spear M, Taylor N, Rajagopalan S, Meyer DT, Aiello W, Warfield A (2013) Whose cache line is it anyway?: operating system support for live detection and repair of false sharing. In Proceedings of the 8th ACM European Conference on Computer Systems, pp. 141-154. ACM, 2013

[36]

Oyama Y, Taura K, Yonezawa A (1999) Executing parallel programs with synchronization bottlenecks efficiently, In: Proc. Int. Workshop Parallel Distrib. Comput. Symbolic Irregular Appl., pp. 182-204

[37]

Radovic Z, Hagersten E (2003) Hierarchical backoff locks for nonuniform communication architectures. In International Symposium on High- Performance Computer Architecture (HPCA), pp. 241-252

[38]

Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C (2007) Evaluating mapreduce for multi-core and multiprocessor systems. In High Performance Computer Architecture. HPCA (2007) IEEE 13th International Symposium on, pp. 13-24. Ieee, 2007

[39]

Roghanchi S, Eriksson J, Basu N (2017) ffwd: delegation is (much) faster than you think. In: Proceedings of the 26th Symposium on Operating Systems Principles. ACM, pp. 342-358

[40]

Splash-2. http://www.capsl.udel.edu/splash

[41]

Tang X, Zhai J, Qian X, Chen W (2019) pLock: a fast lock for architectures with explicit inter-core message passing. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 765-778

[42]

Victor Luchangco, Dan Nussbaum, and Nir Shavit. A hierarchical CLH queue lock. In Proceedings of the 12th International Conference on Parallel Processing (EuroPar), pages 801-810, 2006

[43]

Zhang M, Chen H, Cheng L, Lau FCM, and Wang CL Scalable Adaptive NUMA-Aware Lock IEEE Trans Parallel Distrib Syst 2017 28 6 1754-1769

Index Terms

A stealing mechanism for delegation methods
1. Computing methodologies
  1. Parallel computing methodologies
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems

Index terms have been assigned to the content through auto-classification.

Recommendations

On the Performance of Delegation over Cache-Coherent Shared Memory
ICDCN '15: Proceedings of the 16th International Conference on Distributed Computing and Networking

Delegation is a thread synchronization technique where access to shared data is performed through a dedicated server thread. When a client thread requires shared data access, it makes a request to a server and waits for a response. This paper studies ...
Compact NUMA-aware Locks
EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

Modern multi-socket architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a socket is much faster than access to data cached on a remote socket. Prior work offers several efficient NUMA-aware ...
Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors

Most multiprocessors are multiprogrammed to achieve acceptable response time and to increase their utilization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 77, Issue 10

Oct 2021

1466 pages

ISSN:0920-8542

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 October 2021

Accepted: 25 February 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents