Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1345206.1345215acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Published: 20 February 2008 Publication History

Abstract

Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward's effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.

References

[1]
S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66--76, 1996.
[2]
S. Amarasinghe. Multicores from the compiler's perspective: A blessing or a curse? In CGO '05: Proceedings of the International Symposium on Code Generation and Optimization, pages 137--137, Washington, DC, USA, March 2005. IEEE Computer Society.
[3]
T. E. Anderson. The performance of spin lock alternatives for shared-money multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6--16, 1990.
[4]
R. Cooksey, S. Jourdan, and D. Grunwald. A stateless, contentdirected data prefetching mechanism. In ASPLOS-X: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--290, New York, NY, USA, 2002. ACM Press.
[5]
J. Dai, B. Huang, L. Li, and L. Harrison. Automatically partitioning packet processing applications for pipelined architectures. In PLDI '05: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 237--248, New York, NY, USA, 2005. ACM.
[6]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI '04: Proceedings of the Sixth USENIX Symposium on Operating Systems Design and Implementation, pages 137--150, December 2004.
[7]
K. Gharachorloo and P. B. Gibbons. Detecting violations of sequential consistency. In SPAA '91: Proceedings of the Third Annual ACM Symposium on Parallel Algorithms and Architectures, pages 316--326, New York, NY, USA, 1991. ACM Press.
[8]
M. Girkar and C. D. Polychronopoulos. Automatic extraction of functional parallelism from ordinary programs. IEEE Transactions on Parallel and Distributed Systems, 3(2):166--178, 1992.
[9]
M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463--492, 1990.
[10]
C. A. R. Hoare. Communicating sequential processes. http://usingcsp.com/cspbook.pdf, 2004. Electronic edition edited by Jim Davies.
[11]
S. Kumar, J. Maschmeyer, and P. Crowley. Exploiting locality to ameliorate packet queue contention and serialization. In CF '06: Proceedings of the 3rd Conference on Computing Frontiers, pages 279--290, New York, NY, USA, 2006. ACM Press.
[12]
E. Ladan-Mozes and N. Shavit. An optimistic approach to lock-free FIFO queues. In DISC '04: Proceedings of the 18th International Conference on Distributed Computing, volume 3274, pages 117--131. Springer, 2004.
[13]
L. Lamport. Specifying concurrent program modules. ACM Transactions on Programming Languages and Systems, 5(2):190--222, 1983.
[14]
M. M. K. Martin, D. J. Sorin, H. W. Cain, M. D. Hill, and M. H. Lipasti. Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing. In MICRO 34: Proceedings of the 34th annual ACM/IEEE International Symposium on Microarchitecture, pages 328--337, Washington, DC, USA, 2001. IEEE Computer Society.
[15]
H. Massalin and C. Pu. Threads and input/output in the Synthesis kernel. In SOSP '89: Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, pages 191--201, New York, NY, USA, 1989. ACM Press.
[16]
M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 51(1):1--26, 1998.
[17]
M. Moir, D. Nussbaum, O. Shalev, and N. Shavit. Using elimination to implement scalable and lock-free fifo queues. In SPAA '05: Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 253--262, New York, NY, USA, 2005. ACM Press.
[18]
J. Mudigonda, H. M. Vin, and R. Yavatkar. Overcoming the memory wall in packet processing: hammers or ladders? In ANCS '05: Proceedings of the 2005 Symposium on Architecture for Networking and Communications Systems, pages 1--10, New York, NY, USA, 2005. ACM Press.
[19]
G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In MICRO '05: 38th Annual IEEE/ACM International Symposium on Microarchitecture, pages 105--118, Los Alamitos, CA, USA, 2005. IEEE Computer Society.
[20]
G. Ottoni, R. Rangan, A. Stoler, M. Bridges, and D. August. From sequential programs to concurrent threads. IEEE Computer Architecture Letters, 5:6--9, 2006.
[21]
M. S. Papamarcos and J. H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. In ISCA '84: Proceedings of the 11th Annual International Symposium on Computer Architecture, pages 348--354, New York, NY, USA, 1984. ACM Press.
[22]
S. Prakash, Y. H. Lee, and T. Johnson. A nonblocking algorithm for shared queues using compare-and-swap. IEEE Transactions on Computers, 43(5):548--559, 1994.
[23]
R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In PACT '04: Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques, pages 177--188, Washington, DC, USA, 2004. IEEE Computer Society.
[24]
G. Regnier, D. Minturn, G. McAlpine, V. A. Saletore, and A. Foong. ETA: Experience with an Intel Xeon processor as a packet processing engine. IEEE Micro, 24(1):24--31, 2004.
[25]
W. N. Scherer III, D. Lea, and M. L. Scott. Scalable synchronous queues. In PPoPP '06: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 147--156, New York, NY, USA, 2006. ACM Press.
[26]
J. E. Smith. Decoupled access/execute computer architectures. ACM Transactions on Computer Systems, 2(4):289--308, 1984.
[27]
W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Computational Complexity, pages 179--196, 2002.
[28]
P. Tsigas and Y. Zhang. A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems. In SPAA '01: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 134--143, New York, NY, USA, 2001. ACM Press.
[29]
J. D. Valois. Lock-free data structures. PhD

Cited By

View all
  • (2024)PipeSSD: A Lock-free Pipelined SSD Firmware Design for Multi-core ArchitectureProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657384(1-6)Online publication date: 23-Jun-2024
  • (2024)DISCO: Distributed Inference with Sparse Communications2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00242(2421-2429)Online publication date: 3-Jan-2024
  • (2023)Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal SchedulingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613136(466-481)Online publication date: 23-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
February 2008
308 pages
ISBN:9781595937957
DOI:10.1145/1345206
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fastforward
  2. linearizability
  3. lock-free
  4. multicore
  5. multiprocessors
  6. nonblocking synchronization
  7. pipeline parallel
  8. queue

Qualifiers

  • Research-article

Conference

PPoPP08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)9
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PipeSSD: A Lock-free Pipelined SSD Firmware Design for Multi-core ArchitectureProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657384(1-6)Online publication date: 23-Jun-2024
  • (2024)DISCO: Distributed Inference with Sparse Communications2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00242(2421-2429)Online publication date: 3-Jan-2024
  • (2023)Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal SchedulingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613136(466-481)Online publication date: 23-Oct-2023
  • (2023)DRAMHiT: A Hash Table Architected for the Speed of DRAMProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587457(817-834)Online publication date: 8-May-2023
  • (2023)Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071026(1262-1274)Online publication date: Feb-2023
  • (2022)Plor: General Transactions with Predictable, Low Tail LatencyProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517879(19-33)Online publication date: 10-Jun-2022
  • (2022)BQ: A Lock-Free Queue with BatchingACM Transactions on Parallel Computing10.1145/35127579:1(1-49)Online publication date: 23-Mar-2022
  • (2022)A Fast Wait-Free Multi-Producers Single-Consumer QueueProceedings of the 23rd International Conference on Distributed Computing and Networking10.1145/3491003.3491004(77-86)Online publication date: 4-Jan-2022
  • (2022)DistrEdge: Speeding up Convolutional Neural Network Inference on Distributed Edge Devices2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00110(1097-1107)Online publication date: May-2022
  • (2022)gShareFuture Generation Computer Systems10.1016/j.future.2021.12.016130:C(181-192)Online publication date: 1-May-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media