research-article

On the Performance of Delegation over Cache-Coherent Shared Memory

Authors:

Darko Petrović,

André SchiperAuthors Info & Claims

ICDCN '15: Proceedings of the 16th International Conference on Distributed Computing and Networking

Article No.: 17, Pages 1 - 10

https://doi.org/10.1145/2684464.2684476

Published: 04 January 2015 Publication History

Abstract

Delegation is a thread synchronization technique where access to shared data is performed through a dedicated server thread. When a client thread requires shared data access, it makes a request to a server and waits for a response. This paper studies delegation implementation over cache-coherent shared memory, with the goal of optimizing it for high throughput. Whereas client-server communication naturally fits message-passing systems, efficient implementation over cache-coherent shared memory requires careful optimization. We demonstrate optimizations that significantly improve delegation performance on two modern x86 processors (the Intel Xeon Westmere and the AMD Opteron Magny-Cours), enabling us to come up with counter, stack and queue implementations that outperform the best known alternatives in a large number of cases. Our optimized delegation solution achieves 1.4x (resp. 2x) higher throughput compared to the most efficient state-of-the-art delegation solution on the Intel Xeon (resp. AMD Opteron).

References

[1]

A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: a new OS architecture for scalable multicore systems. In Proc. of the ACM SIGOPS 22nd symposium on Operating systems principles, 2009.

Digital Library

[2]

I. Calciu, D. Dice, T. Harris, M. Herlihy, A. Kogan, V. Marathe, and M. Moir. Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores. In International Conference on Principles of Distributed Systems, pages 83--97, 2013.

Digital Library

[3]

J. Cleary, O. Callanan, M. Purcell, and D. Gregg. Fast asymmetric thread synchronization. ACM Transactions on Architecture and Code Optimization, 9(4):27:1--27:22, Jan. 2013.

Digital Library

[4]

T. David, R. Guerraoui, and V. Trigonakis. Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 33--48, 2013.

Digital Library

[5]

P. Fatourou and N. D. Kallimanis. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2012.

Digital Library

[6]

D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, 2010.

Digital Library

[7]

M. Herlihy. A Methodology for Implementing Highly Concurrent Data Objects. ACM Transactions Programming Languages and Systems, 15(5):745--770, Nov. 1993.

Digital Library

[8]

Intel. Intel 64 and IA-32 Architectures Software Developers Manual Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B, and 3C, February 2014.

[9]

D. Klaftenegger, K. Sagonas, and K. Winblad. Brief announcement: Queue delegation locking. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '14, pages 70--72, 2014.

Digital Library

[10]

J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller. Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the 2012 USENIX Annual Technical Conference, 2012.

Digital Library

[11]

J. M. Mellor-Crummey and M. L. Scott. Algorithms for Scalable Synchronization on Shared-memory Multiprocessors. ACM Transactions Computer Systems, 9(1):21--65, Feb. 1991.

Digital Library

[12]

M. M. Michael and M. L. Scott. Simple, fast, and practical nonblocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, 1996.

Digital Library

[13]

A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2013.

Digital Library

[14]

Y. Oyama, K. Taura, and A. Yonezawa. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications, 1999.

[15]

J. Park, R. M. Yoo, D. S. Khudia, C. J. Hughes, and D. Kim. Location-aware Cache Management for Many-core Processors with Deep Cache Hierarchy. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 20:1--20:12, 2013.

Digital Library

[16]

D. Petrović, T. Ropars, and A. Schiper. Leveraging Hardware Message Passing for Efficient Thread Synchronization. In 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014.

Digital Library

[17]

N. Shavit and D. Touitou. Elimination trees and the construction of pools and stacks: preliminary version. In Proceedings of the 7th annual ACM symposium on Parallel algorithms and architectures, 1995.

Digital Library

[18]

D. Sorin, M. Hill, and D. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, 6(3):1--212, 2011.

Digital Library

[19]

M. A. Suleman, O. Mutlu, M. Qureshi, and Y. Patt. Accelerating Critical Section Execution with Asymmetric Multicore Architectures. IEEE Micro, 30(1):60--70, Jan. 2010.

Digital Library

[20]

R. K. Treiber. Systems Programming: Coping with Parallelism. Technical Report RJ 5118, IBM Almaden Research Center, Apr. 1986.

[21]

D. Wentzlaff and A. Agarwal. Factored operating systems (fos): the case for a scalable operating system for multicores. ACM SIGOPS Operating Systems Review, 43(2):76--85, 2009.

Digital Library

Cited By

Hemmatpour MFerrero RGandino FMontrucchio BRebaudengo M(2019)Cost Evaluation of Synchronization Algorithms for Multicore ArchitecturesAdvanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics10.4018/978-1-5225-7598-6.ch051(697-713)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7598-6.ch051
Hemmatpour MFerrero RGandino FMontrucchio BRebaudengo M(2018)Cost Evaluation of Synchronization Algorithms for Multicore ArchitecturesEncyclopedia of Information Science and Technology, Fourth Edition10.4018/978-1-5225-2255-3.ch346(3989-4003)Online publication date: 2018
https://doi.org/10.4018/978-1-5225-2255-3.ch346
Zhang MChen HCheng LLau FWang C(2017)Scalable Adaptive NUMA-Aware LockIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.263069528:6(1754-1769)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1109/TPDS.2016.2630695

Index Terms

On the Performance of Delegation over Cache-Coherent Shared Memory
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

A stealing mechanism for delegation methods
Abstract
Modern multi-core architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a NUMA node is much faster than access to data cached on a remote node. Prior work has shown that on the NUMA multi-...
Delegation-Based MPI communications for a hybrid parallel computer with many-core architecture
EuroMPI'12: Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface

Many-core architecture draws much attention in HPC community towards the Exascale era. Many ongoing research activities using GPU or the Many Integrated Core (MIC) architecture from Intel exist worldwide. Many-core CPUs have a great deal of impact to ...
Performance-Energy Considerations for Shared Cache Management in a Heterogeneous Multicore Processor

Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as graphic processing unit (GPU) cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICDCN '15: Proceedings of the 16th International Conference on Distributed Computing and Networking

January 2015

360 pages

ISBN:9781450329286

DOI:10.1145/2684464

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICDCN '15

ICDCN '15: International Conference on Distributed Computing and Networking

January 4 - 7, 2015

Goa, India

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
123
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hemmatpour MFerrero RGandino FMontrucchio BRebaudengo M(2019)Cost Evaluation of Synchronization Algorithms for Multicore ArchitecturesAdvanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics10.4018/978-1-5225-7598-6.ch051(697-713)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7598-6.ch051
Hemmatpour MFerrero RGandino FMontrucchio BRebaudengo M(2018)Cost Evaluation of Synchronization Algorithms for Multicore ArchitecturesEncyclopedia of Information Science and Technology, Fourth Edition10.4018/978-1-5225-2255-3.ch346(3989-4003)Online publication date: 2018
https://doi.org/10.4018/978-1-5225-2255-3.ch346
Zhang MChen HCheng LLau FWang C(2017)Scalable Adaptive NUMA-Aware LockIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.263069528:6(1754-1769)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1109/TPDS.2016.2630695

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents