research-article

A Family of Relaxed Concurrent Queues for Low-Latency Operations and Item Transfers

Authors:

Giorgos Kappes,

Stergios V. AnastasiadisAuthors Info & Claims

ACM Transactions on Parallel Computing, Volume 9, Issue 4

Article No.: 16, Pages 1 - 37

https://doi.org/10.1145/3565514

Published: 16 December 2022 Publication History

Abstract

The producer-consumer communication over shared memory is a critical function of current scalable systems. Queues that provide low latency and high throughput on highly utilized systems can improve the overall performance perceived by the end users. In order to address this demand, we set as priority to achieve both high operation performance and item transfer speed. The Relaxed Concurrent Queues (RCQs) are a family of queues that we have designed and implemented for that purpose. Our key idea is a relaxed ordering model that splits the enqueue and dequeue operations into a stage of sequential assignment to a queue slot and a stage of concurrent execution across the slots. At each slot, we apply no order restrictions among the operations of the same type. We define several variants of the RCQ algorithms with respect to offered concurrency, required hardware instructions, supported operations, occupied memory space, and precondition handling. For specific RCQ algorithms, we provide pseudo-code definitions and reason about their correctness and progress properties. Additionally, we theoretically estimate and experimentally validate the worst-case distance between an RCQ algorithm and a strict first-in-first-out (FIFO) queue. We developed prototype implementations of the RCQ algorithms and experimentally compare them with several representative strict FIFO and relaxed data structures over a range of workload and system settings. The RCQS algorithm is a provably linearizable lock-free member of the RCQ family. We experimentally show that RCQS achieves factors to orders of magnitude advantage over the state-of-the-art strict or relaxed queue algorithms across several latency and throughput statistics of the queue operations and item transfers.

References

[1]

2007. Intel® 64 Architecture Memory Ordering White Paper. Intel Corporation.

[2]

2017. Atomic operations library (memory_order). Retrieved October 6, 2022 from https://en.cppreference.com/w/c/atomic/memory_order.

[3]

2020. Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2: Instruction Set Reference A-Z. Order Number 325383-072US, Intel Corporation.

[4]

2020. tcmalloc. Retrieved October 6, 2022 from https://google.github.io/tcmalloc/.

[5]

Yehuda Afek, Guy Korland, Maria Natanzon, and Nir Shavit. 2010. Scalable producer-consumer pools based on elimination-diffraction trees. In International Euro-Par Conference (Ischia, Italy). Pasqua D’Ambra, Mario Guarracino, and Domenico Talia (Eds.). Lecture Notes in Computer Science, Vol. 6271. Springer, Berlin, Heidelberg, 151–162.

[6]

Yehuda Afek, Guy Korland, and Eitan Yanovsky. 2010. Quasi-linearizability: Relaxed consistency for improved concurrency. In Principles of Distributed Systems (Tozeur, Tunisia), Chenyang Lu, Toshimitsu Masuzawa, and Mohamed Mosbah (Eds.). Lecture Notes in Computer Science, Vol. 6490. Springer, Berlin, Heidelberg, 395–410.

[7]

Dan Alistarh, Justin Kopinsky, Jerry Li, and Nir Shavit. 2015. The spraylist: A scalable relaxed priority queue. In Symposium on Principles and Practice of Parallel Programming (San Francisco, CA). ACM, New York, NY, 11–20.

[8]

Dan Alistarh, Giorgi Nadiradze, and Nikita Koval. 2019. Efficiency guarantees for parallel incremental algorithms under relaxed schedulers. In Symposium on Parallel Algorithms and Architectures (Phoenix, AZ). ACM, New York, NY, 145–154.

[9]

Gunter Bolch, Stefan Greiner, Hermann de Meer, and Kishor S. Trivedi. 2006. Queueing Networks and Markov Chains: Modeling and Performance Evaluation With Computer Science Applications (2nd ed.). John Wiley & Sons, Inc., New York, NY, Chapter 7, 263–309.

[10]

Daniel Cederman, Anders Gidenstam, Phuong Ha, Hkan Sundell, Marina Papatriantafilou, and Philippas Tsigas. 2017. Lock-free concurrent data structures. In Programming Multi-Core and Many-Core Computing Systems, Sabri Pllana and Fatos Xhafa (Eds.). John Wiley & Sons, Ltd, Hoboken, NJ, Chapter 3, 59–79.

[11]

Dave Dice. 2014. PTLQueue : a scalable bounded-capacity MPMC queue. Retrieved October 6, 2022 from https://blogs.oracle.com/dave/ptlqueue-:-a-scalable-bounded-cap acity-mpmc-queue.

[12]

David Dice and Oleksandr Otenko. 2011. Brief announcement: Multilane —a concurrent blocking multiset. In Symposium on Parallelism in Algorithms and Architectures (San Jose, CA). ACM, New York, NY, 313–314.

[13]

Claude Evequoz. 2008. Non-blocking concurrent FIFO queues with single word synchronization primitives. In International Conference on Parallel Processing (Portland, OR). IEEE Computer Society, Los Alamitos, CA, 397–405.

Digital Library

[14]

Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the combining synchronization technique. In Symposium on Principles and Practice of Parallel Programming (New Orleans, LA). ACM, New York, NY, 257–266.

[15]

Hubertus Franke, Rusty Russell, and Matthew Kirkwood. 2002. Fuss, futexes and furwocks: Fast userlevel locking in Linux. In Ottawa Linux Symposium (Ottawa, Canada), Ottawa Linux Symposium (OLS), 479–495.

[16]

Hossein Golestani, Amirhossein Mirhosseini, and Thomas F. Wenisch. 2019. Software data planes: You can’t always spin to win. In Symposium on Cloud Computing (Santa Cruz, CA). ACM, New York, NY, 337–350.

[17]

Jakob Gruber, Jesper Larsson Träff, and Martin Wimmer. 2016. Brief announcement: Benchmarking concurrent priority queues. In Symposium on Parallelism in Algorithms and Architectures (Pacific Grove, CA). ACM, New York, NY, 361–362.

[18]

Andreas Haas, Thomas A. Henzinger, Andreas Holzer, Christoph M. Kirsch, Michael Lippautz, Hannes Payer, Ali Sezgin, Ana Sokolova, and Helmut Veith. 2016. Local linearizability for concurrent container-type data structures. In International Conference on Concurrency Theory (Quebec City, Canada). Josée Desharnais and Radha Jagadeesan (Eds.). Leibniz International Proceedings in Informatics, Vol. 59. Dagstuhl Publishing, Saarbrücken/Wadern, Germany, 6:1–6:15.

[19]

Andreas Haas, Thomas Hütter, Christoph M. Kirsch, Michael Lippautz, Mario Preishuber, and Ana Sokolova. 2015. Scal: A benchmarking suite for concurrent data structures. In International Conference on Networked Systems (Agadir, Morocco). Ahmed Bouajjani and Hugues Fauconnier (Eds.). Lecture Notes in Computer Science, Vol. 9466. Springer, Cham, Switzerland, 1–14.

[20]

Andreas Haas, Michael Lippautz, Thomas A. Henzinger, Hannes Payer, Ana Sokolova, Christoph M. Kirsch, and Ali Sezgin. 2013. Distributed queues in shared memory: Multicore performance and scalability through quantitative relaxation. In International Conference on Computing Frontiers (Ischia, Italy). ACM, New York, NY, Article 17, 9 pages.

Digital Library

[21]

Thomas A. Henzinger, Christoph M. Kirsch, Hannes Payer, Ali Sezgin, and Ana Sokolova. 2013. Quantitative relaxation of concurrent data structures. In Symposium on Principles of Programming Languages (Rome, Italy). ACM, New York, NY, 317–328.

[22]

Maurice Herlihy, Nir Shavit, Victor Luchangco, and Michael Spear. 2021. The Art of Multiprocessor Programming (2nd ed.). Morgan Kaufmann, an imprint of Elsevier, Cambridge, MA, Chapter 3, 49–74.

[23]

Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12, 3 (July1990), 463–492.

Digital Library

[24]

Giorgos Kappes and Stergios V. Anastasiadis. 2020. A user-level toolkit for storage I/O isolation on multitenant hosts. In Symposium on Cloud Computing (Virtual Event, USA). ACM, New York, NY, 74–89.

[25]

Giorgos Kappes and Stergios V. Anastasiadis. 2021. POSTER: A lock-free relaxed concurrent queue for fast work distribution. In Symposium on Principles and Practice of Parallel Programming (Virtual Event). ACM, New York, NY, 454–456.

[26]

Bernhard Kerbl, Michael Kenzel, Joerg H. Mueller, Dieter Schmalstieg, and Markus Steinberger. 2018. The broker queue: A fast, linearizable FIFO queue for fine-granular work distribution on the GPU. In International Conference on Supercomputing (Beijing, China). ACM, New York, NY, 76–85.

Digital Library

[27]

Christoph M. Kirsch, Michael Lippautz, and Hannes Payer. 2013. Fast and scalable, lock-free k-FIFO queues. In International Conference on Parallel Computing Technologies (St. Petersburg, Russia). Victor Malyshkin (Ed.). Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 208–223.

Digital Library

[28]

David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2018. Queue delegation locking. IEEE Transactions on Parallel and Distributed Systems 29, 3 (2018), 687–704.

[29]

Donald Ervin Knuth. 1997. The Art of Computer Programming, Volume 1 (3rd ed.). Addison-Wesley, Boston, MA, Chapter 2, 232–465.

[30]

Justin Kopinsky. 2018. Relaxed Concurrent Ordering Structures. Ph.D. Dissertation. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, Chapter 1, 11–25.

[31]

Leslie Lamport. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers9 (Sept.1979), 690–691.

Digital Library

[32]

Collin Lee, Seo Jin Park, Ankita Kejriwal, Satoshi Matsushita, and John Ousterhout. 2015. Implementing linearizability at large scale and low latency. In Symposium on Operating Systems Principles (Monterey, CA). ACM, New York, NY, 71–86.

[33]

Dinesh P. Mehta. 2018. Handbook of Data Structures and Applications (2nd ed.). Dinesh P. Mehta and Sartaj Sahni (Eds.). CRC Press, Taylor & Francis Group, Boca Raton, FL, Chapter 2, 23–34.

[34]

Maged M. Michael and Michael L. Scott. 1996. Simple, fast and practical non-blocking and blocking concurrent queue algorithms. In Symposium on Principles of Distributed Computing (Philadelphia, PA). ACM, New York, NY, 267–275.

[35]

Adam Morrison. 2016. Scaling synchronization in multicore programs. Communications of the ACM 58, 11 (2016), 44–51.

Digital Library

[36]

Adam Morrison and Yehuda Afek. 2013. Fast concurrent queues for x86 processors. In Symposium on Principles and Practice of Parallel Programming (Shenzhen, China). ACM, New York, NY, 103–112.

[37]

Or Ostrovsky and Adam Morrison. 2020. Scaling concurrent queues by using HTM to profit from failed atomic operations. In Symposium on Principles and Practice of Parallel Programming (San Diego, CA). ACM, New York, NY, 89–101.

[38]

Hamza Rihani, Peter Sanders, and Roman Dementiev. 2015. Brief announcement: MultiQueues: Simple relaxed concurrent priority queues. In Symposium on Parallelism in Algorithms and Architectures (Portland, OR). ACM, New York, NY, 80–82.

[39]

Adones Rukundo, Aras Atalar, and Philippas Tsigas. 2019. Monotonically relaxing concurrent data-structure semantics for increasing performance: An efficient 2D design framework. In International Symposium on Distributed Computing (Budapest, Hungary). Jukka Suomela (Ed.). Leibniz International Proceedings in Informatics, Vol. 146. Dagstuhl Publishing, Saarbrücken/Wadern, Germany, 31:1–31:15.

[40]

William N. Scherer, Doug Lea, and Michael L. Scott. 2006. Scalable synchronous queues. In Symposium on Principles and Practice of Parallel Programming (New York, NY). ACM, New York, NY, 147–156.

[41]

William N. Scherer and Michael L. Scott. 2004. Nonblocking concurrent data structures with condition synchronization. In International Conference on Distributed Computing (Amsterdam, Netherlands), Rachid Guerraoui (Ed.). Lecture Notes in Computer Science, Vol. 3274. Springer, Berlin, Heidelberg, 174–187.

[42]

Peter Sewel, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. x86-TSO: A rigorous and usable programmer’s model for x86 multiprocessors. Communications of the ACM 53, 7 (May2010), 87–97.

[43]

Niloufar Shafiei. 2009. Non-blocking array-based algorithms for stacks and queues. In International Conference on Distributed Computing and Networking (Hyderabad, India). ACM, New York, NY, 55–66.

[44]

Chien-Hua Shann, Ting-Lu Huang, and Cheng Chen. 2000. A practical nonblocking queue algorithm using compare-and-swap. In International Conference on Parallel and Distributed Systems (Iwate, Japan). IEEE Computer Society, Los Alamitos, CA, 470–476.

[45]

Nir Shavit and Gadi Taubenfeld. 2016. The computability of relaxed data structures: Queues and stacks as examples. Distributed Computing 29, 5 (Oct.2016), 395–407.

Digital Library

[46]

Shumpei Shiina and Kenjiro Taura. 2019. Almost deterministic work stealing. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, CO). ACM, New York, NY, 47:1–47:16.

Digital Library

[47]

Håkan Sundell, Anders Gidenstam, Marina Papatriantafilou, and Philippas Tsigas. 2011. A lock-free algorithm for concurrent bags. In Symposium on Parallelism in Algorithms and Architectures (San Jose, CA). ACM, New York, NY, 335–344.

[48]

The kernel development community. 2021. Linux Scheduler. Linux version 5.15.0-rc3. Retrieved October 7, 2022 from https://www.kernel.org/doc/html/latest/scheduler/index.html.

[49]

Philippas Tsigas and Yi Zhang. 2001. A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems. In Symposium on Parallel Algorithms and Architectures (Crete Island, Greece). ACM, New York, NY, 134–143.

[50]

Martin Wimmer, Jakob Gruber, Jesper Larsson Träff, and Philippas Tsigas. 2015. The lock-free k-LSM relaxed priority queue. In Symposium on Principles and Practice of Parallel Programming (San Francisco, CA). ACM, New York, NY, 277–278.

[51]

Chaoran Yang. 2020. A benchmark framework for concurrent queue implementations. Retrieved October 7, 2022 from https://github.com/chaoran/fast-wait-free-queue.

[52]

Chaoran Yang and John Mellor-Crummey. 2016. A wait-free queue as fast as fetch-and-add. In Symposium on Principles and Practice of Parallel Programming (Barcelona, Spain). ACM, New York, NY, 16:1–16:13.

[53]

Tingzhe Zhou, Maged Michael, and Michael Spear. 2019. A practical, scalable, relaxed priority queue. In International Conference on Parallel Processing (Kyoto, Japan). ACM, New York, NY, 57:1–57:10.

Digital Library

Cited By

Kappes GAnastasiadis S(2024)Diciclo: Flexible User-level Services for Efficient Multitenant IsolationACM Transactions on Computer Systems10.1145/363940442:1-2(1-47)Online publication date: 13-Feb-2024
https://dl.acm.org/doi/10.1145/3639404
von Geijer KTsigas P(2024)How to Relax Instantly: Elastic Relaxation of Concurrent Data StructuresEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_9(119-133)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69583-4_9

Index Terms

Recommendations

A lock-free relaxed concurrent queue for fast work distribution
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

The operation of modern systems requires the low latency and high throughput of producer-consumer communication over shared memory. In order to achieve fast communication at high concurrency, we define a relaxed ordering model that splits the queue ...
Fast and lock-free concurrent priority queues for multi-thread systems

We present an efficient and practical lock-free implementation of a concurrent priority queue that is suitable for both fully concurrent (large multi-processor) systems as well as pre-emptive (multi-process) systems. Many algorithms for concurrent ...
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems
IPDPS '03: Proceedings of the 17th International Symposium on Parallel and Distributed Processing

We present an efficient and practical lock-free implementation of a concurrent priority queue that is suitable for both fully concurrent (large multi-processor) systems as well as pre-emptive (multi-process) systems. Many algorithms for concurrent ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 9, Issue 4

December 2022

102 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3572851

Editor:
David A. Bader
New Jersey Institute of Technology, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2022

Online AM: 04 October 2022

Accepted: 29 September 2022

Revised: 22 September 2022

Received: 10 May 2021

Published in TOPC Volume 9, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
370
Total Downloads

Downloads (Last 12 months)199
Downloads (Last 6 weeks)15

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kappes GAnastasiadis S(2024)Diciclo: Flexible User-level Services for Efficient Multitenant IsolationACM Transactions on Computer Systems10.1145/363940442:1-2(1-47)Online publication date: 13-Feb-2024
https://dl.acm.org/doi/10.1145/3639404
von Geijer KTsigas P(2024)How to Relax Instantly: Elastic Relaxation of Concurrent Data StructuresEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_9(119-133)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69583-4_9

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents