Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2342821.2342827guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications

Published: 13 June 2012 Publication History

Abstract

The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. In this paper, we propose a new lock algorithm, Remote Core Locking (RCL), that aims to improve the performance of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server core. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the core acquiring the lock because such data can typically remain in the server core's cache.
We have developed a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX locks into RCL locks. We have evaluated our approach on 18 applications: Memcached, Berkeley DB, the 9 applications of the SPLASH-2 benchmark suite and the 7 applications of the Phoenix2 benchmark suite. 10 of these applications, including Memcached and Berkeley DB, are unable to scale because of locks, and benefit from RCL. Using RCL locks, we get performance improvements of up to 2.6 times with respect to POSIX locks on Memcached, and up to 14 times with respect to Berkeley DB.

References

[1]
J. L. Abellán, J. Fernández, and M. E. Acacio. GLocks: Efficient support for highly-contended locks in many-core CMPs. In 25th IPDPS, 2011.
[2]
A. Agarwal and M. Cherian. Adaptive backoff synchronization techniques. In ISCA'89, pages 396-406, 1989.
[3]
D. F. Bacon, R. Konuru, C. Murthy, and M. Serrano. Thin locks: featherweight synchronization for Java. In PLDI'98, pages 258-268, 1998.
[4]
A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schuepbach, and A. Singhania. The multikernel: a new OS architecture for scalable multicore systems. In SOSP'09, pages 29-44, 2009.
[5]
L. Boguslavsky, K. Harzallah, A. Kreinen, K. Sevcik, and A. Vainshtein. Optimal strategies for spinning and blocking. Journal of Parallel and Distributed Computing, 1993.
[6]
A. Brodsky, F. Ellen, and P. Woelfel. Fully-adaptive algorithms for long-lived renaming. In DISC '06, pages 413-427, 2006.
[7]
E. M. Chaves Jr., P. Das, T. J. LeBlanc, B. D. Marsh, and M. L. Scott. Kernel-kernel communication in a shared-memory multiprocessor. Concurrency - Practice and Experience, 5(3):171-191, 1993.
[8]
T. S. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, Department of Computer Science, University of Washington, Feb. 2003.
[9]
U. Drepper and I. Molnar. Native POSIX thread library for Linux. Technical report, RedHat, 2003.
[10]
M. Fowler. Refactoring: Improving the Design of Existing Code. Addison Wesley, 1999.
[11]
B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system. In OSDI'99, pages 87-100.
[12]
B. He, W. N. Scherer III, and M. L. Scott. Preemption adaptivity in time-published queue-based spin locks. In 11th International Conference on High Performance Computing, 2005.
[13]
D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In SPAA'10, pages 355-364, 2010.
[14]
M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Elsevier, 2008.
[15]
F. R. Johnson, R. Stoica, A. Ailamaki, and T. C. Mowry. Decoupling contention management from scheduling. In ASPLOS'10, pages 117-128, 2010.
[16]
E. Ladan-Mozes and N. Shavit. An optimistic approach to lock-free FIFO queues. Distributed Computing, 20(5):323-341, 2008.
[17]
J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21-65, 1991.
[18]
J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In ASPLOS, pages 269-278. ACM, 1991.
[19]
E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. Helios: heterogeneous multiprocessing with satellite kernels. In SOSP'09, pages 221-234, 2009.
[20]
J. K. Ousterhout. Scheduling techniques for concurrent systems. In ICDCS '82, pages 22-30, 1982.
[21]
Y. Padioleau, J. Lawall, R. R. Hansen, and G. Muller. Documenting and automating collateral evolutions in Linux device drivers. In EuroSys, pages 247-260, 2008.
[22]
I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. Data-oriented transaction execution. Proceedings of the VLDB Endowment (PVLDB), 3(1):928-939, 2010.
[23]
A. Roy, S. Hand, and T. Harris. Exploring the limits of disjoint access parallelism. In HotPar'09, pages 8-8, 2009.
[24]
M. L. Scott and W. N. Scherer. Scalable queue-based spin locks with timeout. In PPoPP'01, pages 44-52, 2001.
[25]
O. Shalev and N. Shavit. Split-ordered lists: Lock-free extensible hash tables. JACM, 53(3):379-405, May 2006.
[26]
S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In In Workshop on Operating System Interference in High Performance Applications, 2006.
[27]
M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In ASPLOS, pages 253-264, 2009.
[28]
S. B. Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: an operating system for many cores. In OSDI'08.
[29]
W. Xiong, S. Park, J. Zhang, Y. Zhou, and Z. Ma. Ad hoc synchronization considered harmful. In OSDI'10, pages 1-8, 2010.
[30]
K. Yotov, K. Pingali, and P. Stodghill. Automatic measurement of memory hierarchy parameters. In SIGMETRICS '05, pages 181-192, 2005.

Cited By

View all
  1. Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Guide Proceedings
      USENIX ATC'12: Proceedings of the 2012 USENIX conference on Annual Technical Conference
      June 2012
      41 pages

      Publisher

      USENIX Association

      United States

      Publication History

      Published: 13 June 2012

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 23 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)FTSDProceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3476886.3477518(123-130)Online publication date: 24-Aug-2021
      • (2020)Scalable Coordination of Hierarchical ParallelismProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404398(1-11)Online publication date: 17-Aug-2020
      • (2020)Efficient Abortable-locking Protocol for Multi-level NUMA SystemsACM Transactions on Parallel Computing10.1145/33997287:3(1-32)Online publication date: 10-Jul-2020
      • (2020)Avoiding scheduler subversion using scheduler-cooperative locksProceedings of the Fifteenth European Conference on Computer Systems10.1145/3342195.3387521(1-17)Online publication date: 15-Apr-2020
      • (2019)Lock–UnlockACM Transactions on Computer Systems10.1145/330150136:1(1-149)Online publication date: 14-Mar-2019
      • (2019)Accelerating Synchronization Using Moving Compute to Data Model at 1,000-core Multicore ScaleACM Transactions on Architecture and Code Optimization10.1145/330020816:1(1-27)Online publication date: 13-Feb-2019
      • (2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019
      • (2019)MV-RLUProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304040(779-792)Online publication date: 4-Apr-2019
      • (2019)pLockProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304030(765-778)Online publication date: 4-Apr-2019
      • (2018)A predictable synchronisation algorithmACM SIGPLAN Notices10.1145/3200691.317853353:1(415-416)Online publication date: 10-Feb-2018
      • Show More Cited By

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media