research-article

Investigating the Performance of Hardware Transactions on a Multi-Socket Machine

Authors:

Victor LuchangcoAuthors Info & Claims

SPAA '16: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures

Pages 121 - 132

https://doi.org/10.1145/2935764.2935796

Published: 11 July 2016 Publication History

Abstract

The introduction of hardware transactional memory (HTM) into commercial processors opens a door for designing and implementing scalable synchronization mechanisms. One example for such a mechanism is transactional lock elision (TLE), where lock-based critical sections are executed concurrently using hardware transactions. So far, the effectiveness of TLE and other HTM-based mechanisms has been assessed mostly on small, single-socket machines. This paper investigates the behavior of hardware transactions on a large two-socket machine. Using TLE as an example, we show that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance. We identify the reason for this phenomenon, and present a simple adaptive technique that overcomes this problem by throttling threads as necessary to optimize system performance. Using extensive evaluation of multiple microbenchmarks and real applications, we demonstrate that our technique achieves the full performance of the system for workloads that scale across sockets, and avoids the performance degradation that cripples TLE for workloads that do not.

References

[1]

G. Adelson-Velsky and E. Landis. An algorithm for the organization of information. Soviet Mathematics Doklady, 3:1259--1263, 1962.

[2]

Y. Afek, A. Levy, and A. Morrison. Software-improved hardware lock elision. In Proc. ACM PODC, pages 212--221, 2014.

Digital Library

[3]

Y. Afek, A. Matveev, O. R. Moll, and N. Shavit. Amalgamated lock-elision. In Proc. DISC, pages 309--324, 2015.

Digital Library

[4]

J. H. Ahn. ccTSA: A Coverage-Centric Threaded Sequence Assembler. PLoS ONE, 7(6), June 2012.

[5]

E. Atoofian. Improving performance of software transactional memory through contention locality. J. Supercomput., 64(2):527--547, 2013.

Digital Library

[6]

S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for NUMA-aware contention management on multicore systems. In Proc. USENIX ATC, 2011.

Digital Library

[7]

W. Bolosky, R. Fitzgerald, and M. Scott. Simple but effective techniques for NUMA memory management. SIGOPS Oper. Syst. Rev., 23(5):19--31, Nov. 1989.

Digital Library

[8]

I. Calciu, D. Dice, T. Harris, M. Herlihy, A. Kogan, V. J. Marathe, and M. Moir. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In Proceedings of the International Conference on Principles of Distributed Systems OPODIS, pages 83--97, 2013.

Digital Library

[9]

G. Chadha, S. Mahlke, and S. Narayanasamy. When less is more (LIMO): Controlled parallelism for improved efficiency. In Proc. International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), pages 141--150, 2012.

Digital Library

[10]

D. Dice, T. Harris, A. Kogan, and Y. Lev. The influence of malloc placement on TSX hardware transactional memory. CoRR, 2015.

[11]

D. Dice, A. Kogan, and Y. Lev. Refined transactional lock elision. In Proc. ACM PPoPP, pages 19:1--12, 2016.

Digital Library

[12]

D. Dice, A. Kogan, Y. Lev, T. Merrifield, and M. Moir. Adaptive integration of hardware and software lock elision techniques. In Proc. ACM SPAA, pages 188--197, 2014.

Digital Library

[13]

D. Dice, Y. Lev, M. Moir, and D. Nussbaum. Early experience with a commercial hardware transactional memory implementation. In Proc. ACM ASPLOS, pages 157--168, 2009.

Digital Library

[14]

D. Dice, Y. Lev, M. Moir, D. Nussbaum, and M. Olszewski. Early experience with a commercial hardware transactional memory implementation. Technical report, Sun Labs, 2009.

Digital Library

[15]

D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: A general technique for designing NUMA locks. TOPC, 1(2):13, 2015.

Digital Library

[16]

N. Diegues and P. Romano. Self-tuning Intel transactional synchronization extensions. In Proceedings of the International Conference on Autonomic Computing (ICAC), pages 209--219, 2014.

[17]

N. Diegues, P. Romano, and S. Garbatov. Seer: Probabilistic scheduling for hardware transactional memory. In Proc. ACM SPAA, pages 224--233, 2015.

Digital Library

[18]

N. Diegues, P. Romano, and L. Rodrigues. Virtues and limitations of commodity hardware transactional memory. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT), pages 3--14, 2014.

Digital Library

[19]

A. Hassan, R. Palmieri, and B. Ravindran. Remote transaction commit: Centralizing software transactional memory commits. IEEE Transactions on Computers, pages 26--33, 2015.

[20]

M. Herlihy and E. Moss. Architectural support for lock-free data structures. In Proc. International Symposium on Computer Architecture (ISCA), pages 289--300, 1993.

Digital Library

[21]

M. Jenne, O. Boberg, H. Kurban, and M. Dalkilic. Studying the milky way galaxy using paraheap-k. Computer, 47(9):26--33, 2014.

Digital Library

[22]

R. P. LaRowe, Jr., C. S. Ellis, and M. A. Holliday. Evaluation of NUMA memory management through modeling and measurements. IEEE Transactions on Parallel and Distributed Systems, 3:686--701, 1991.

Digital Library

[23]

B. Lepers, V. Quema, and A. Fedorova. Thread and memory placement on NUMA systems: Asymmetry matters. In Proc. USENIX ATC, 2015.

Digital Library

[24]

T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn. Efficient operating system scheduling for performance-asymmetric multi-core architectures. In Proc. ACM/IEEE Supercomputing, pages 1--11, 2007.

Digital Library

[25]

J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In Proc. USENIX ATC, 2012.

Digital Library

[26]

A. Matveev and N. Shavit. Reduced hardware lock elision. In Proceedings of 6th Workshop on the Theory of Transactional Memory (WTTM), 2014.

[27]

C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: stanford transactional applications for multi-processing. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 35--46, 2008.

[28]

T. Nakaike, R. Odaira, M. Gaudet, M. Michael, and H. Tomari. Quantitative Comparison of Hardware Transactional Memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. In Proc. ACM/IEEE ISCA, 2015.

Digital Library

[29]

K. K. Pusukuri, R. Gupta, and L. N. Bhuyan. Thread reinforcer: Dynamically determining number of threads via os level monitoring. In Proc. IEEE International Symposium on Workload Characterization (IISWC), pages 116--125, 2011.

Digital Library

[30]

A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using dope: The degree of parallelism executive. In Proc. ACM PLDI, pages 26--37, 2011.

Digital Library

[31]

W. Ruan, Y. Liu, and M. Spear. STAMP need not be considered harmful. In Proceedings of ACM SIGPLAN Workshop on Transactional Computing (TRANSACT), 2014.

[32]

L. Tang, J. Mars, X. Zhang, R. Hagmann, R. Hundt, and E. Tune. Optimizing Google's warehouse scale computers: The NUMA experience. In Proc. IEEE HPCA, pages 188--197, 2013.

Digital Library

[33]

B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. SIGOPS Oper. Syst. Rev., 30(5):279--289, 1996.

Digital Library

[34]

R. M. Yoo, C. J. Hughes, K. Lai, and R. Rajwar. Performance evaluation of Intel® transactional synchronization extensions for high-performance computing. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013.

Digital Library

[35]

R. M. Yoo and H.-H. S. Lee. Adaptive transaction scheduling for transactional memory systems. In Proc. ACM SPAA, pages 169--178, 2008.

Digital Library

Cited By

Khalaji MBrown TDaudjee KAksenov VLee IChabbi MSteuwer M(2024)Practical Hardware Transactional vEB TreesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638504(215-228)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638504
Giannoula CPeppas AGoumas GKoziris N(2022)High-performance and balanced parallel graph coloring on multicore platformsThe Journal of Supercomputing10.1007/s11227-022-04894-679:6(6373-6421)Online publication date: 7-Nov-2022
https://doi.org/10.1007/s11227-022-04894-6
Bang TMay NPetrov IBinnig C(2022)The full story of 1000 coresThe VLDB Journal10.1007/s00778-022-00742-431:6(1185-1213)Online publication date: 29-Apr-2022
https://doi.org/10.1007/s00778-022-00742-4
Show More Cited By

Index Terms

Investigating the Performance of Hardware Transactions on a Multi-Socket Machine
1. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Shared memory algorithms
  2. Models of computation
    1. Concurrency

Recommendations

Transactional Lock Elision Meets Combining
PODC '17: Proceedings of the ACM Symposium on Principles of Distributed Computing

Flat combining (FC) and transactional lock elision (TLE) are two techniques that facilitate efficient multi-thread access to a sequentially implemented data structure protected by a lock. FC allows threads to delegate their operations to another (...
Improving Parallelism in Hardware Transactional Memory

Today’s hardware transactional memory (HTM) systems rely on existing coherence protocols, which implement a requester-wins strategy. This, in turn, leads to poor performance when transactions frequently conflict, causing them to resort to a non-...
Software-improved hardware lock elision
PODC '14: Proceedings of the 2014 ACM symposium on Principles of distributed computing

With hardware transactional memory (HTM) becoming available in mainstream processors, lock-based critical sections may now initiate a hardware transaction instead of taking the lock, enabling their concurrent execution unless a real data conflict ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '16: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures

July 2016

492 pages

ISBN:9781450342100

DOI:10.1145/2935764

General Chair:
Christian Scheideler
University of Paderborn
,
Program Chair:
Seth Gilbert
National University of Singapore

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPAA '16

Sponsor:

SPAA '16: 28th ACM Symposium on Parallelism in Algorithms and Architectures

July 11 - 13, 2016

California, Pacific Grove, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
300
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Khalaji MBrown TDaudjee KAksenov VLee IChabbi MSteuwer M(2024)Practical Hardware Transactional vEB TreesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638504(215-228)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638504
Giannoula CPeppas AGoumas GKoziris N(2022)High-performance and balanced parallel graph coloring on multicore platformsThe Journal of Supercomputing10.1007/s11227-022-04894-679:6(6373-6421)Online publication date: 7-Nov-2022
https://doi.org/10.1007/s11227-022-04894-6
Bang TMay NPetrov IBinnig C(2022)The full story of 1000 coresThe VLDB Journal10.1007/s00778-022-00742-431:6(1185-1213)Online publication date: 29-Apr-2022
https://doi.org/10.1007/s00778-022-00742-4
Al Badawi AChen LVig S(2022)Fast homomorphic SVM inference on encrypted dataNeural Computing and Applications10.1007/s00521-022-07202-834:18(15555-15573)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1007/s00521-022-07202-8
Zou XWang FFeng DYang FLei MLiu C(2022)SPHT: A scalable and high‐performance hashing scheme for persistent memorySoftware: Practice and Experience10.1002/spe.308352:7(1679-1697)Online publication date: 21-Mar-2022
https://doi.org/10.1002/spe.3083
yi zyao yChen K(2021)A Universal Construction to implement Concurrent Data Structure for NUMA-muticoreProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472475(1-11)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472475
Navarro-Torres AAlastruey-Benede JIbanez-Marin PCarpen-Amarie M(2021)Synchronization Strategies on Many-Core SMT Systems2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD53543.2021.00017(54-63)Online publication date: Oct-2021
https://doi.org/10.1109/SBAC-PAD53543.2021.00017
Bang TMay NPetrov IBinnig CPorobic DNeumann T(2020)The tale of 1000 CoresProceedings of the 16th International Workshop on Data Management on New Hardware10.1145/3399666.3399910(1-9)Online publication date: 15-Jun-2020
https://dl.acm.org/doi/10.1145/3399666.3399910
Park SMcKenney PDufour LYeom HBilas AMagoutis KMarkatos EKostic DSeltzer M(2020)An HTM-based update-side synchronization for RCU on NUMA systemsProceedings of the Fifteenth European Conference on Computer Systems10.1145/3342195.3387527(1-15)Online publication date: 15-Apr-2020
https://dl.acm.org/doi/10.1145/3342195.3387527
Bang TOukid IMay NPetrov IBinnig CMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Robust Performance of Main Memory Data Structures by ConfigurationProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389725(1651-1666)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389725
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents