research-article

FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Authors:

John Giacomoni,

Manish VachharajaniAuthors Info & Claims

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

Pages 43 - 52

https://doi.org/10.1145/1345206.1345215

Published: 20 February 2008 Publication History

Abstract

Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward's effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.

References

[1]

S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66--76, 1996.

Digital Library

[2]

S. Amarasinghe. Multicores from the compiler's perspective: A blessing or a curse? In CGO '05: Proceedings of the International Symposium on Code Generation and Optimization, pages 137--137, Washington, DC, USA, March 2005. IEEE Computer Society.

Digital Library

[3]

T. E. Anderson. The performance of spin lock alternatives for shared-money multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6--16, 1990.

Digital Library

[4]

R. Cooksey, S. Jourdan, and D. Grunwald. A stateless, contentdirected data prefetching mechanism. In ASPLOS-X: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--290, New York, NY, USA, 2002. ACM Press.

Digital Library

[5]

J. Dai, B. Huang, L. Li, and L. Harrison. Automatically partitioning packet processing applications for pipelined architectures. In PLDI '05: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 237--248, New York, NY, USA, 2005. ACM.

Digital Library

[6]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI '04: Proceedings of the Sixth USENIX Symposium on Operating Systems Design and Implementation, pages 137--150, December 2004.

Digital Library

[7]

K. Gharachorloo and P. B. Gibbons. Detecting violations of sequential consistency. In SPAA '91: Proceedings of the Third Annual ACM Symposium on Parallel Algorithms and Architectures, pages 316--326, New York, NY, USA, 1991. ACM Press.

Digital Library

[8]

M. Girkar and C. D. Polychronopoulos. Automatic extraction of functional parallelism from ordinary programs. IEEE Transactions on Parallel and Distributed Systems, 3(2):166--178, 1992.

Digital Library

[9]

M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463--492, 1990.

Digital Library

[10]

C. A. R. Hoare. Communicating sequential processes. http://usingcsp.com/cspbook.pdf, 2004. Electronic edition edited by Jim Davies.

[11]

S. Kumar, J. Maschmeyer, and P. Crowley. Exploiting locality to ameliorate packet queue contention and serialization. In CF '06: Proceedings of the 3rd Conference on Computing Frontiers, pages 279--290, New York, NY, USA, 2006. ACM Press.

Digital Library

[12]

E. Ladan-Mozes and N. Shavit. An optimistic approach to lock-free FIFO queues. In DISC '04: Proceedings of the 18th International Conference on Distributed Computing, volume 3274, pages 117--131. Springer, 2004.

[13]

L. Lamport. Specifying concurrent program modules. ACM Transactions on Programming Languages and Systems, 5(2):190--222, 1983.

Digital Library

[14]

M. M. K. Martin, D. J. Sorin, H. W. Cain, M. D. Hill, and M. H. Lipasti. Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing. In MICRO 34: Proceedings of the 34th annual ACM/IEEE International Symposium on Microarchitecture, pages 328--337, Washington, DC, USA, 2001. IEEE Computer Society.

Digital Library

[15]

H. Massalin and C. Pu. Threads and input/output in the Synthesis kernel. In SOSP '89: Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, pages 191--201, New York, NY, USA, 1989. ACM Press.

Digital Library

[16]

M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 51(1):1--26, 1998.

Digital Library

[17]

M. Moir, D. Nussbaum, O. Shalev, and N. Shavit. Using elimination to implement scalable and lock-free fifo queues. In SPAA '05: Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 253--262, New York, NY, USA, 2005. ACM Press.

Digital Library

[18]

J. Mudigonda, H. M. Vin, and R. Yavatkar. Overcoming the memory wall in packet processing: hammers or ladders? In ANCS '05: Proceedings of the 2005 Symposium on Architecture for Networking and Communications Systems, pages 1--10, New York, NY, USA, 2005. ACM Press.

Digital Library

[19]

G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In MICRO '05: 38th Annual IEEE/ACM International Symposium on Microarchitecture, pages 105--118, Los Alamitos, CA, USA, 2005. IEEE Computer Society.

Digital Library

[20]

G. Ottoni, R. Rangan, A. Stoler, M. Bridges, and D. August. From sequential programs to concurrent threads. IEEE Computer Architecture Letters, 5:6--9, 2006.

Digital Library

[21]

M. S. Papamarcos and J. H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. In ISCA '84: Proceedings of the 11th Annual International Symposium on Computer Architecture, pages 348--354, New York, NY, USA, 1984. ACM Press.

Digital Library

[22]

S. Prakash, Y. H. Lee, and T. Johnson. A nonblocking algorithm for shared queues using compare-and-swap. IEEE Transactions on Computers, 43(5):548--559, 1994.

Digital Library

[23]

R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In PACT '04: Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques, pages 177--188, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[24]

G. Regnier, D. Minturn, G. McAlpine, V. A. Saletore, and A. Foong. ETA: Experience with an Intel Xeon processor as a packet processing engine. IEEE Micro, 24(1):24--31, 2004.

Digital Library

[25]

W. N. Scherer III, D. Lea, and M. L. Scott. Scalable synchronous queues. In PPoPP '06: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 147--156, New York, NY, USA, 2006. ACM Press.

Digital Library

[26]

J. E. Smith. Decoupled access/execute computer architectures. ACM Transactions on Computer Systems, 2(4):289--308, 1984.

Digital Library

[27]

W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Computational Complexity, pages 179--196, 2002.

Digital Library

[28]

P. Tsigas and Y. Zhang. A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems. In SPAA '01: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 134--143, New York, NY, USA, 2001. ACM Press.

Digital Library

[29]

J. D. Valois. Lock-free data structures. PhD

Digital Library

Cited By

Du ZLi SHuang ZXue JHuang KWang TShao ZDe V(2024)PipeSSD: A Lock-free Pipelined SSD Firmware Design for Multi-core ArchitectureProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657384(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3657384
Qin MSun CHofmann JVucinic D(2024)DISCO: Distributed Inference with Sparse Communications2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00242(2421-2429)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00242
Iyer RUnal MKogias MCandea GDruschel PKaufmann AMace JFlinn JSeltzer M(2023)Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal SchedulingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613136(466-481)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613136
Show More Cited By

Index Terms

FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

DCAS is not a silver bullet for nonblocking algorithm design
SPAA '04: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures

Despite years of research, the design of efficient nonblocking algorithms remains difficult. A key reason is that current shared-memory multiprocessor architectures support only single-location synchronisation primitives such as compare-and-swap (CAS) ...
FastForward for Efficient Pipeline Parallelism
PACT '07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques

High-rate core-to-core communication is critical for efficient pipeline-parallel software architectures. This paper introduces FastForward, a software-only low-overhead high-rate queue algorithm for pipeline parallelism on multicore architectures. ...
Bringing practical lock-free synchronization to 64-bit applications
PODC '04: Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing

Many lock-free data structures in the literature exploit techniques that are possible only because state-of-the-art 64-bit processors are still running 32-bit operating systems and applications. As software catches up to hardware, "64-bit-clean" lock-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

February 2008

308 pages

ISBN:9781595937957

DOI:10.1145/1345206

General Chair:
Siddhartha Chatterjee
IBM Research USA
,
Program Chair:
Michael L. Scott
University of Rochester USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP08

Sponsor:

PPoPP08: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 20 - 23, 2008

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

111
Total Citations
View Citations
1,748
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)9

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Du ZLi SHuang ZXue JHuang KWang TShao ZDe V(2024)PipeSSD: A Lock-free Pipelined SSD Firmware Design for Multi-core ArchitectureProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657384(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3657384
Qin MSun CHofmann JVucinic D(2024)DISCO: Distributed Inference with Sparse Communications2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00242(2421-2429)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00242
Iyer RUnal MKogias MCandea GDruschel PKaufmann AMace JFlinn JSeltzer M(2023)Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal SchedulingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613136(466-481)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613136
Narayanan VDetweiler DHuang TBurtsev AFedorova ANarayanan DDi Luna GQuerzoni L(2023)DRAMHiT: A Hash Table Architected for the Speed of DRAMProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587457(817-834)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587457
Nguyen QSanchez D(2023)Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071026(1262-1274)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071026
Chen YYu XKoutris PArpaci-Dusseau AArpaci-Dusseau RShu JIves ZBonifati AEl Abbadi A(2022)Plor: General Transactions with Predictable, Low Tail LatencyProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517879(19-33)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517879
Milman-Sela GKogan ALev YLuchangco VPetrank E(2022)BQ: A Lock-Free Queue with BatchingACM Transactions on Parallel Computing10.1145/35127579:1(1-49)Online publication date: 23-Mar-2022
https://dl.acm.org/doi/10.1145/3512757
Adas DFriedman R(2022)A Fast Wait-Free Multi-Producers Single-Consumer QueueProceedings of the 23rd International Conference on Distributed Computing and Networking10.1145/3491003.3491004(77-86)Online publication date: 4-Jan-2022
https://dl.acm.org/doi/10.1145/3491003.3491004
Hou XGuan YHan TZhang N(2022)DistrEdge: Speeding up Convolutional Neural Network Inference on Distributed Edge Devices2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00110(1097-1107)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00110
Lee MAhn HHong CNikolopoulos D(2022)gShareFuture Generation Computer Systems10.1016/j.future.2021.12.016130:C(181-192)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.future.2021.12.016
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents