research-article

CAF: Core to Core Communication Acceleration Framework

Authors:

Andrew Herdrich,

Yan SolihinAuthors Info & Claims

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 351 - 362

https://doi.org/10.1145/2967938.2967954

Published: 11 September 2016 Publication History

Abstract

As the number of cores in a multicore system increases, core-to-core (C2C) communication is increasingly limiting the performance scaling of workloads that share data frequently. The traditional way cores communicate is by using shared memory space between them. However, shared memory communication fundamentally involves coherence invalidations and cache misses, which cause large performance overheads and incur a high amount of network traffic. Many important workloads incur significant C2C communication and are affected significantly by the costs, including pipelined packet processing which is widely used in software-based networking solutions. In these workloads, threads run on different cores and pass packets from one core to another for different stages of processing using software queues.

In this paper, we analyze the behavior and overheads of software queue management. Based on this analysis, we propose a novel C2C Communication Acceleration Framework (CAF) to optimize C2C communication. CAF offloads substantial communication burdens from cores and memory to a designated, efficient hardware device we refer to as Queue Management Device (QMD) attached to the Network on Chip. CAF combines hardware and software optimizations to effectively reduce the queue-induced communication overheads and improve the overall system performance by up to 2-12x over traditional software queue implementations.

References

[1]

QorIQ DPAA Primer for Software Architecture. Technical report, Freescale Semiconductor Inc, 2012.

[2]

Data Plane Development Kit: Programmer's Guide. Technical report, Intel Corp, 2015.

[3]

W. Berke. A cache technique for synchronization variables in highly parallel, shared memory systems. In Courant Institute of Mathematical Sciences, 1988.

[4]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Computer Architecture News, 39(2), 2011.

Digital Library

[5]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1995.

Digital Library

[6]

L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics, Paris, France, August 2010. Springer.

[7]

R. Chen, H. Chen, and B. Zang. Tiled-mapreduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010.

Digital Library

[8]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, 2004.

Digital Library

[9]

C. F. Dumitrescu. Design Patterns for Packet Processing Applications on Multi-core Intel Architecture Processors . Technical report, Intel Corp., 2008.

[10]

J. Giacomoni, T. Moseley, and M. Vachharajani. Fastforward for efficient pipeline parallelism: A cache-optimized concurrent lock-free queue. In Proceedings of the 13th Symposium on Principles and Practice of Parallel Programming, 2008.

Digital Library

[11]

J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, 1989.

Digital Library

[12]

A. Haas, M. Lippautz, T. A. Henzinger, H. Payer, A. Sokolova, C. M. Kirsch, and A. Sezgin. Distributed queues in shared memory: Multicore performance and scalability through quantitative relaxation. In Proceedings of the ACM International Conference on Computing Frontiers, 2013.

Digital Library

[13]

N. Jiang, D. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. Shaw, J. Kim, and W. Dally. A detailed and flexible cycle-accurate network-on-chip simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, April 2013.

[14]

W. Jiang, V. Ravi, and G. Agrawal. A map-reduce system with an alternate api for multi-core environments. In Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing(CCGrid), 2010.

Digital Library

[15]

A. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Design, Automation Test in Europe Conference Exhibition, April 2009.

Digital Library

[16]

D. A. Koufaty, X. Chen, D. Poulsen, and J. Torrellas. Data forwarding in scalable shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7(12), Dec 1996.

Digital Library

[17]

S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture, 2007.

Digital Library

[18]

L. Lamport. Specifying concurrent program modules. ACM Trans. Program. Lang. Syst., 5(2), Apr. 1983.

Digital Library

[19]

H.-H. S. Lee, G. S. Tyson, and M. K. Farrens. Eager writeback - a technique for improving bandwidth utilization. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, 2000.

Digital Library

[20]

J. Lee, C. Nicopoulos, H. G. Lee, S. Panth, S. K. Lim, and J. Kim. Isonet: Hardware-based job queue management for many-core architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(6), June 2013.

Digital Library

[21]

P. Lee, T. Bu, and G. Chandranmenon. A lock-free, cache-efficient multi-core synchronization mechanism for line-rate network traffic monitoring. In Proceedings of International Symposium on Parallel Distributed Processing, 2010.

[22]

S. Lee, D. Tiwari, Y. Solihin, and J. Tuck. Haqu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. In Proceedings of 17th International Symposium on High Performance Computer Architecture, 2011.

Digital Library

[23]

D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The stanford dash multiprocessor. Computer, 25(3), 1992.

Digital Library

[24]

S. Li, J. H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi. Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd International Symposium on Microarchitecture, Dec 2009.

Digital Library

[25]

M. Lichman. UCI machine learning repository, 2013.

[26]

M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, 1996.

Digital Library

[27]

J. Park, R. M. Yoo, D. S. Khudia, C. J. Hughes, and D. Kim. Location-aware cache management for many-core processors with deep cache hierarchy. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2013.

Digital Library

[28]

D. Pregibon. Logistic regression diagnostics. The Annals of Statistics, 9(4), 1981.

[29]

R. Rajwar, A. Kägi, and J. R. Goodman. Inferential queueing and speculative push for reducing critical communication latencies. In Proceedings of the 17th Annual International Conference on Supercomputing, 2003.

Digital Library

[30]

E. Rosti, E. Smirni, T. D. Wagner, A. W. Apon, and L. W. Dowdy. The ksr1: Experimentation and modeling of poststore. In Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1993.

Digital Library

[31]

D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, 2010.

Digital Library

[32]

B. Schölkopf, J. Platt, and T. Hofmann. Map-reduce for machine learning on multicore. In Proceedings of Conference of Advances in Neural Information Processing Systems, 2007.

[33]

T. R. Scogland and W.-c. Feng. Design and evaluation of scalable concurrent queues for many-core architectures. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, 2015.

Digital Library

[34]

P. Tsigas and Y. Zhang. A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems. In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, 2001.

Digital Library

[35]

J. Wei, W. Dai, A. Qiao, Q. Ho, H. Cui, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing, 2015.

Digital Library

[36]

S. Wilton and N. Jouppi. Cacti: an enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits, 31(5), May 1996.

[37]

T. Xu, P. Liljeberg, and H. Tenhunen. Optimal memory controller placement for chip multiprocessor. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis, Oct 2011.

Digital Library

Cited By

Twardzik TNolte LJalier CShi JWild THerkersdorf A(2024)HASIIL: Hardware-Assisted Scheduling to Improve IPC Latency in LinuxProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649197(80-87)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649197
Wu QLi RBeard JJohn LRodríguez GSadayappan PSukumaran-Rajam A(2024)BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less QueuingProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641568(100-112)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641568
Nolte LTwardzik TJalier CHuang ZShi JWild THerkersdorf A(2024)HW-FUTEX: Hardware-Assisted Futex SyscallIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.331792632:1(16-29)Online publication date: Jan-2024
https://doi.org/10.1109/TVLSI.2023.3317926
Show More Cited By

Index Terms

CAF: Core to Core Communication Acceleration Framework

Recommendations

A programmable processing array architecture supporting dynamic task scheduling and module-level prefetching
CF '12: Proceedings of the 9th conference on Computing Frontiers

Massively Parallel Processing Arrays (MPPA) constitute programmable hardware accelerators that excel in the execution of applications exhibiting Data-Level Parallelism (DLP). The concept of employing such programmable accelerators as sidekicks to the ...
Cache-Based Memory Copy Hardware Accelerator for Multicore Systems

In this paper, we present a new architecture of the cache-based memory copy hardware accelerator in a multicore system supporting message passing. The accelerator is able to accelerate memory data movements, in particular memory copies. We perform an ...
A Co-design Approach for Hardware Optimizations in Multicore Architectures Using MCAPI
INA-OCMC '15: Proceedings of the 2015 Ninth International Workshop on Interconnection Network Architectures: On-Chip, Multi-Chip

Current SoC platforms targeting high-performance with high power efficiency rely on replicating several processing cores while adding dedicated hardware units for specific tasks. However, programming such architectures demand a high effort when compared ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

September 2016

474 pages

ISBN:9781450341219

DOI:10.1145/2967938

General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '16

Sponsor:

IFIP WG 10.3
IEEE TCCA
SIGARCH
IEEE CS TCPP

PACT '16: International Conference on Parallel Architectures and Compilation

September 11 - 15, 2016

Haifa, Israel

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 13 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
648
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Twardzik TNolte LJalier CShi JWild THerkersdorf A(2024)HASIIL: Hardware-Assisted Scheduling to Improve IPC Latency in LinuxProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649197(80-87)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649197
Wu QLi RBeard JJohn LRodríguez GSadayappan PSukumaran-Rajam A(2024)BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less QueuingProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641568(100-112)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641568
Nolte LTwardzik TJalier CHuang ZShi JWild THerkersdorf A(2024)HW-FUTEX: Hardware-Assisted Futex SyscallIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.331792632:1(16-29)Online publication date: Jan-2024
https://doi.org/10.1109/TVLSI.2023.3317926
Yuan YWang RRanganathan NRao NKumar SLantz PSanjeepan VCabrera JKwatra ASankaran RJeong IKim N(2024)Intel Accelerators Ecosystem: An SoC-Oriented Perspective : Industry Product2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00066(848-862)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00066
Nolte LTwardzik TJalier CHuang ZShi JKowalsky CWild THerkersdorf A(2023)HAWEN: Hardware Accelerator for Thread Wake-Ups in Linux Event Notification2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247823(1-6)Online publication date: 9-Jul-2023
https://doi.org/10.1109/DAC56929.2023.10247823
Wu QEkanayake ALi RBeard JJohn L(2022)SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core SystemsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545044(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545044
Alian MAgarwal SShin JPatel NYuan YKim DWang RKim N(2022)IDIO: Network-Driven, Inbound Network Data Orchestration on Server Processors2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00042(480-493)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00042
Wu QBeard JEkanayake AGerstlauer AJohn L(2021)Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00027(182-191)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00027
Rheindt SSrivatsa ALenke ONolte LWild THerkersdorf A(2021)Tackling the MPSoC Data Locality ChallengeMulti‐Processor System‐on‐Chip 110.1002/9781119818298.ch5(85-117)Online publication date: 26-Mar-2021
https://doi.org/10.1002/9781119818298.ch5
Nematollahi NSadrosadati MFalahati HBarkhordar MDrumond MSarbazi-Azad HFalsafi B(2020)Efficient Nearest-Neighbor Data Sharing in GPUsACM Transactions on Architecture and Code Optimization10.1145/342998118:1(1-26)Online publication date: 30-Dec-2020
https://dl.acm.org/doi/10.1145/3429981
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents