Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2967938.2967954acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

CAF: Core to Core Communication Acceleration Framework

Published: 11 September 2016 Publication History

Abstract

As the number of cores in a multicore system increases, core-to-core (C2C) communication is increasingly limiting the performance scaling of workloads that share data frequently. The traditional way cores communicate is by using shared memory space between them. However, shared memory communication fundamentally involves coherence invalidations and cache misses, which cause large performance overheads and incur a high amount of network traffic. Many important workloads incur significant C2C communication and are affected significantly by the costs, including pipelined packet processing which is widely used in software-based networking solutions. In these workloads, threads run on different cores and pass packets from one core to another for different stages of processing using software queues.
In this paper, we analyze the behavior and overheads of software queue management. Based on this analysis, we propose a novel C2C Communication Acceleration Framework (CAF) to optimize C2C communication. CAF offloads substantial communication burdens from cores and memory to a designated, efficient hardware device we refer to as Queue Management Device (QMD) attached to the Network on Chip. CAF combines hardware and software optimizations to effectively reduce the queue-induced communication overheads and improve the overall system performance by up to 2-12x over traditional software queue implementations.

References

[1]
QorIQ DPAA Primer for Software Architecture. Technical report, Freescale Semiconductor Inc, 2012.
[2]
Data Plane Development Kit: Programmer's Guide. Technical report, Intel Corp, 2015.
[3]
W. Berke. A cache technique for synchronization variables in highly parallel, shared memory systems. In Courant Institute of Mathematical Sciences, 1988.
[4]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Computer Architecture News, 39(2), 2011.
[5]
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1995.
[6]
L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics, Paris, France, August 2010. Springer.
[7]
R. Chen, H. Chen, and B. Zang. Tiled-mapreduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010.
[8]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, 2004.
[9]
C. F. Dumitrescu. Design Patterns for Packet Processing Applications on Multi-core Intel Architecture Processors . Technical report, Intel Corp., 2008.
[10]
J. Giacomoni, T. Moseley, and M. Vachharajani. Fastforward for efficient pipeline parallelism: A cache-optimized concurrent lock-free queue. In Proceedings of the 13th Symposium on Principles and Practice of Parallel Programming, 2008.
[11]
J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, 1989.
[12]
A. Haas, M. Lippautz, T. A. Henzinger, H. Payer, A. Sokolova, C. M. Kirsch, and A. Sezgin. Distributed queues in shared memory: Multicore performance and scalability through quantitative relaxation. In Proceedings of the ACM International Conference on Computing Frontiers, 2013.
[13]
N. Jiang, D. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. Shaw, J. Kim, and W. Dally. A detailed and flexible cycle-accurate network-on-chip simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, April 2013.
[14]
W. Jiang, V. Ravi, and G. Agrawal. A map-reduce system with an alternate api for multi-core environments. In Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing(CCGrid), 2010.
[15]
A. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Design, Automation Test in Europe Conference Exhibition, April 2009.
[16]
D. A. Koufaty, X. Chen, D. Poulsen, and J. Torrellas. Data forwarding in scalable shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7(12), Dec 1996.
[17]
S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture, 2007.
[18]
L. Lamport. Specifying concurrent program modules. ACM Trans. Program. Lang. Syst., 5(2), Apr. 1983.
[19]
H.-H. S. Lee, G. S. Tyson, and M. K. Farrens. Eager writeback - a technique for improving bandwidth utilization. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, 2000.
[20]
J. Lee, C. Nicopoulos, H. G. Lee, S. Panth, S. K. Lim, and J. Kim. Isonet: Hardware-based job queue management for many-core architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(6), June 2013.
[21]
P. Lee, T. Bu, and G. Chandranmenon. A lock-free, cache-efficient multi-core synchronization mechanism for line-rate network traffic monitoring. In Proceedings of International Symposium on Parallel Distributed Processing, 2010.
[22]
S. Lee, D. Tiwari, Y. Solihin, and J. Tuck. Haqu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. In Proceedings of 17th International Symposium on High Performance Computer Architecture, 2011.
[23]
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The stanford dash multiprocessor. Computer, 25(3), 1992.
[24]
S. Li, J. H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi. Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd International Symposium on Microarchitecture, Dec 2009.
[25]
M. Lichman. UCI machine learning repository, 2013.
[26]
M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, 1996.
[27]
J. Park, R. M. Yoo, D. S. Khudia, C. J. Hughes, and D. Kim. Location-aware cache management for many-core processors with deep cache hierarchy. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2013.
[28]
D. Pregibon. Logistic regression diagnostics. The Annals of Statistics, 9(4), 1981.
[29]
R. Rajwar, A. Kägi, and J. R. Goodman. Inferential queueing and speculative push for reducing critical communication latencies. In Proceedings of the 17th Annual International Conference on Supercomputing, 2003.
[30]
E. Rosti, E. Smirni, T. D. Wagner, A. W. Apon, and L. W. Dowdy. The ksr1: Experimentation and modeling of poststore. In Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1993.
[31]
D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, 2010.
[32]
B. Schölkopf, J. Platt, and T. Hofmann. Map-reduce for machine learning on multicore. In Proceedings of Conference of Advances in Neural Information Processing Systems, 2007.
[33]
T. R. Scogland and W.-c. Feng. Design and evaluation of scalable concurrent queues for many-core architectures. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, 2015.
[34]
P. Tsigas and Y. Zhang. A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems. In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, 2001.
[35]
J. Wei, W. Dai, A. Qiao, Q. Ho, H. Cui, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing, 2015.
[36]
S. Wilton and N. Jouppi. Cacti: an enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits, 31(5), May 1996.
[37]
T. Xu, P. Liljeberg, and H. Tenhunen. Optimal memory controller placement for chip multiprocessor. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis, Oct 2011.

Cited By

View all
  • (2024)HASIIL: Hardware-Assisted Scheduling to Improve IPC Latency in LinuxProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649197(80-87)Online publication date: 7-May-2024
  • (2024)BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less QueuingProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641568(100-112)Online publication date: 17-Feb-2024
  • (2024)HW-FUTEX: Hardware-Assisted Futex SyscallIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.331792632:1(16-29)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
September 2016
474 pages
ISBN:9781450341219
DOI:10.1145/2967938
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hardware accelerator
  2. hardware queue
  3. multicore communication

Qualifiers

  • Research-article

Conference

PACT '16
Sponsor:
  • IFIP WG 10.3
  • IEEE TCCA
  • SIGARCH
  • IEEE CS TCPP

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;
Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)2
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)HASIIL: Hardware-Assisted Scheduling to Improve IPC Latency in LinuxProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649197(80-87)Online publication date: 7-May-2024
  • (2024)BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less QueuingProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641568(100-112)Online publication date: 17-Feb-2024
  • (2024)HW-FUTEX: Hardware-Assisted Futex SyscallIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.331792632:1(16-29)Online publication date: Jan-2024
  • (2024)Intel Accelerators Ecosystem: An SoC-Oriented Perspective : Industry Product2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00066(848-862)Online publication date: 29-Jun-2024
  • (2023)HAWEN: Hardware Accelerator for Thread Wake-Ups in Linux Event Notification2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247823(1-6)Online publication date: 9-Jul-2023
  • (2022)SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core SystemsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545044(1-12)Online publication date: 29-Aug-2022
  • (2022)IDIO: Network-Driven, Inbound Network Data Orchestration on Server Processors2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00042(480-493)Online publication date: Oct-2022
  • (2021)Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00027(182-191)Online publication date: May-2021
  • (2021)Tackling the MPSoC Data Locality ChallengeMulti‐Processor System‐on‐Chip 110.1002/9781119818298.ch5(85-117)Online publication date: 26-Mar-2021
  • (2020)Efficient Nearest-Neighbor Data Sharing in GPUsACM Transactions on Architecture and Code Optimization10.1145/342998118:1(1-26)Online publication date: 30-Dec-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media