Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2935764.2935796acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

Investigating the Performance of Hardware Transactions on a Multi-Socket Machine

Published: 11 July 2016 Publication History

Abstract

The introduction of hardware transactional memory (HTM) into commercial processors opens a door for designing and implementing scalable synchronization mechanisms. One example for such a mechanism is transactional lock elision (TLE), where lock-based critical sections are executed concurrently using hardware transactions. So far, the effectiveness of TLE and other HTM-based mechanisms has been assessed mostly on small, single-socket machines. This paper investigates the behavior of hardware transactions on a large two-socket machine. Using TLE as an example, we show that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance. We identify the reason for this phenomenon, and present a simple adaptive technique that overcomes this problem by throttling threads as necessary to optimize system performance. Using extensive evaluation of multiple microbenchmarks and real applications, we demonstrate that our technique achieves the full performance of the system for workloads that scale across sockets, and avoids the performance degradation that cripples TLE for workloads that do not.

References

[1]
G. Adelson-Velsky and E. Landis. An algorithm for the organization of information. Soviet Mathematics Doklady, 3:1259--1263, 1962.
[2]
Y. Afek, A. Levy, and A. Morrison. Software-improved hardware lock elision. In Proc. ACM PODC, pages 212--221, 2014.
[3]
Y. Afek, A. Matveev, O. R. Moll, and N. Shavit. Amalgamated lock-elision. In Proc. DISC, pages 309--324, 2015.
[4]
J. H. Ahn. ccTSA: A Coverage-Centric Threaded Sequence Assembler. PLoS ONE, 7(6), June 2012.
[5]
E. Atoofian. Improving performance of software transactional memory through contention locality. J. Supercomput., 64(2):527--547, 2013.
[6]
S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for NUMA-aware contention management on multicore systems. In Proc. USENIX ATC, 2011.
[7]
W. Bolosky, R. Fitzgerald, and M. Scott. Simple but effective techniques for NUMA memory management. SIGOPS Oper. Syst. Rev., 23(5):19--31, Nov. 1989.
[8]
I. Calciu, D. Dice, T. Harris, M. Herlihy, A. Kogan, V. J. Marathe, and M. Moir. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In Proceedings of the International Conference on Principles of Distributed Systems OPODIS, pages 83--97, 2013.
[9]
G. Chadha, S. Mahlke, and S. Narayanasamy. When less is more (LIMO): Controlled parallelism for improved efficiency. In Proc. International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), pages 141--150, 2012.
[10]
D. Dice, T. Harris, A. Kogan, and Y. Lev. The influence of malloc placement on TSX hardware transactional memory. CoRR, 2015.
[11]
D. Dice, A. Kogan, and Y. Lev. Refined transactional lock elision. In Proc. ACM PPoPP, pages 19:1--12, 2016.
[12]
D. Dice, A. Kogan, Y. Lev, T. Merrifield, and M. Moir. Adaptive integration of hardware and software lock elision techniques. In Proc. ACM SPAA, pages 188--197, 2014.
[13]
D. Dice, Y. Lev, M. Moir, and D. Nussbaum. Early experience with a commercial hardware transactional memory implementation. In Proc. ACM ASPLOS, pages 157--168, 2009.
[14]
D. Dice, Y. Lev, M. Moir, D. Nussbaum, and M. Olszewski. Early experience with a commercial hardware transactional memory implementation. Technical report, Sun Labs, 2009.
[15]
D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: A general technique for designing NUMA locks. TOPC, 1(2):13, 2015.
[16]
N. Diegues and P. Romano. Self-tuning Intel transactional synchronization extensions. In Proceedings of the International Conference on Autonomic Computing (ICAC), pages 209--219, 2014.
[17]
N. Diegues, P. Romano, and S. Garbatov. Seer: Probabilistic scheduling for hardware transactional memory. In Proc. ACM SPAA, pages 224--233, 2015.
[18]
N. Diegues, P. Romano, and L. Rodrigues. Virtues and limitations of commodity hardware transactional memory. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT), pages 3--14, 2014.
[19]
A. Hassan, R. Palmieri, and B. Ravindran. Remote transaction commit: Centralizing software transactional memory commits. IEEE Transactions on Computers, pages 26--33, 2015.
[20]
M. Herlihy and E. Moss. Architectural support for lock-free data structures. In Proc. International Symposium on Computer Architecture (ISCA), pages 289--300, 1993.
[21]
M. Jenne, O. Boberg, H. Kurban, and M. Dalkilic. Studying the milky way galaxy using paraheap-k. Computer, 47(9):26--33, 2014.
[22]
R. P. LaRowe, Jr., C. S. Ellis, and M. A. Holliday. Evaluation of NUMA memory management through modeling and measurements. IEEE Transactions on Parallel and Distributed Systems, 3:686--701, 1991.
[23]
B. Lepers, V. Quema, and A. Fedorova. Thread and memory placement on NUMA systems: Asymmetry matters. In Proc. USENIX ATC, 2015.
[24]
T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn. Efficient operating system scheduling for performance-asymmetric multi-core architectures. In Proc. ACM/IEEE Supercomputing, pages 1--11, 2007.
[25]
J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In Proc. USENIX ATC, 2012.
[26]
A. Matveev and N. Shavit. Reduced hardware lock elision. In Proceedings of 6th Workshop on the Theory of Transactional Memory (WTTM), 2014.
[27]
C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: stanford transactional applications for multi-processing. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 35--46, 2008.
[28]
T. Nakaike, R. Odaira, M. Gaudet, M. Michael, and H. Tomari. Quantitative Comparison of Hardware Transactional Memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. In Proc. ACM/IEEE ISCA, 2015.
[29]
K. K. Pusukuri, R. Gupta, and L. N. Bhuyan. Thread reinforcer: Dynamically determining number of threads via os level monitoring. In Proc. IEEE International Symposium on Workload Characterization (IISWC), pages 116--125, 2011.
[30]
A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using dope: The degree of parallelism executive. In Proc. ACM PLDI, pages 26--37, 2011.
[31]
W. Ruan, Y. Liu, and M. Spear. STAMP need not be considered harmful. In Proceedings of ACM SIGPLAN Workshop on Transactional Computing (TRANSACT), 2014.
[32]
L. Tang, J. Mars, X. Zhang, R. Hagmann, R. Hundt, and E. Tune. Optimizing Google's warehouse scale computers: The NUMA experience. In Proc. IEEE HPCA, pages 188--197, 2013.
[33]
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. SIGOPS Oper. Syst. Rev., 30(5):279--289, 1996.
[34]
R. M. Yoo, C. J. Hughes, K. Lai, and R. Rajwar. Performance evaluation of Intel® transactional synchronization extensions for high-performance computing. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013.
[35]
R. M. Yoo and H.-H. S. Lee. Adaptive transaction scheduling for transactional memory systems. In Proc. ACM SPAA, pages 169--178, 2008.

Cited By

View all
  • (2024)Practical Hardware Transactional vEB TreesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638504(215-228)Online publication date: 2-Mar-2024
  • (2022)High-performance and balanced parallel graph coloring on multicore platformsThe Journal of Supercomputing10.1007/s11227-022-04894-679:6(6373-6421)Online publication date: 7-Nov-2022
  • (2022)The full story of 1000 coresThe VLDB Journal10.1007/s00778-022-00742-431:6(1185-1213)Online publication date: 29-Apr-2022
  • Show More Cited By

Index Terms

  1. Investigating the Performance of Hardware Transactions on a Multi-Socket Machine

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SPAA '16: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures
      July 2016
      492 pages
      ISBN:9781450342100
      DOI:10.1145/2935764
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 July 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. concurrent data structures
      2. hardware transactional memory
      3. lock elision
      4. locks
      5. non-uniform memory access

      Qualifiers

      • Research-article

      Conference

      SPAA '16

      Acceptance Rates

      Overall Acceptance Rate 447 of 1,461 submissions, 31%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)12
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 23 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Practical Hardware Transactional vEB TreesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638504(215-228)Online publication date: 2-Mar-2024
      • (2022)High-performance and balanced parallel graph coloring on multicore platformsThe Journal of Supercomputing10.1007/s11227-022-04894-679:6(6373-6421)Online publication date: 7-Nov-2022
      • (2022)The full story of 1000 coresThe VLDB Journal10.1007/s00778-022-00742-431:6(1185-1213)Online publication date: 29-Apr-2022
      • (2022)Fast homomorphic SVM inference on encrypted dataNeural Computing and Applications10.1007/s00521-022-07202-834:18(15555-15573)Online publication date: 1-Sep-2022
      • (2022)SPHT: A scalable and high‐performance hashing scheme for persistent memorySoftware: Practice and Experience10.1002/spe.308352:7(1679-1697)Online publication date: 21-Mar-2022
      • (2021)A Universal Construction to implement Concurrent Data Structure for NUMA-muticoreProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472475(1-11)Online publication date: 9-Aug-2021
      • (2021)Synchronization Strategies on Many-Core SMT Systems2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD53543.2021.00017(54-63)Online publication date: Oct-2021
      • (2020)The tale of 1000 CoresProceedings of the 16th International Workshop on Data Management on New Hardware10.1145/3399666.3399910(1-9)Online publication date: 15-Jun-2020
      • (2020)An HTM-based update-side synchronization for RCU on NUMA systemsProceedings of the Fifteenth European Conference on Computer Systems10.1145/3342195.3387527(1-15)Online publication date: 15-Apr-2020
      • (2020)Robust Performance of Main Memory Data Structures by ConfigurationProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389725(1651-1666)Online publication date: 11-Jun-2020
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media