research-article

Public Access

Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chip

Authors:

Paolo Mantovani,

Emilio G. Cota,

Christian Pilato,

Giuseppe Di Guglielmo,

Luca P. CarloniAuthors Info & Claims

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Article No.: 3, Pages 1 - 10

https://doi.org/10.1145/2968455.2968509

Published: 01 October 2016 Publication History

Abstract

Local memory is a key factor for the performance of accelerators in SoCs. Despite technology scaling, the gap between on-chip storage and memory footprint of embedded applications keeps widening. We present a solution to preserve the speedup of accelerators when scaling from small to large data sets. Combining specialized DMA and address translation with a software layer in Linux, our design is transparent to user applications and broadly applicable to any class of SoCs hosting high-throughput accelerators. We demonstrate the robustness of our design across many heterogeneous workload scenarios and memory allocation policies with FPGA-based SoC prototypes featuring twelve concurrent accelerators accessing up to 768MB out of 1GB-addressable DRAM.

References

[1]

M. Awasthi, et al. Handling the problems and opportunities posed by multiple on-chip memory controllers. In Proceedings of the International Conference on Parallel architectures and compilation techniques (PACT), pages 319--330, Sept. 2010.

Digital Library

[2]

K. Barker, et al. PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute, December 2013. http://hpc.pnnl.gov/projects/PERFECT/.

[3]

S. Borkar and A. A. Chien. The future of microprocessors. Communication of the ACM, 54:67--77, May 2011.

Digital Library

[4]

L. P. Carloni. From latency insensitive design to communication-based system-level design. Proceedings of the IEEE, 103(11):2133--2151, Nov. 2015.

[5]

L. P. Carloni. The case for embedded scalable platforms. In Proceedings of ACM/EDAC/IEEE Design Automation Conference (DAC), June 2016.

Digital Library

[6]

Y.-T. Chen, et al. Accelerator-rich CMPs: From concept to real hardware. In Proceedings of IEEE International Conference on Computer Design (ICCD), pages 169--176, Oct. 2013.

[7]

E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 225--236, 2010.

Digital Library

[8]

J. Cong, et al. Accelerator-rich architectures: Opportunities and progresses. In Proceedings of ACM/EDAC/IEEE Design Automation Conference (DAC), June 2014.

Digital Library

[9]

E. G. Cota, P. Mantovani, and L. P. Carloni. Exploiting private local memories to reduce the opportunity cost of accelerator integration. In Proceedings of the International Conference on Supercomputing (ICS), June 2016.

Digital Library

[10]

E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. An analysis of accelerator coupling in heterogeneous architectures. In Proceedings of ACM/EDAC/IEEE Design Automation Conference (DAC), June 2015.

Digital Library

[11]

E. G. Cota, et al. Accelerator memory reuse in the dark silicon era. Computer Architecture Letters, 13(1):9--12, Jan-Jun 2014.

Digital Library

[12]

W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proceedings of ACM/EDAC/IEEE Design Automation Conference (DAC), pages 684--689, 2001.

Digital Library

[13]

H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 449--460, 2012.

Digital Library

[14]

C. F. Fajardo, et al. Buffer-integrated-cache: A cost-effective SRAM architecture for handheld and embedded platforms. In Proceedings of ACM/EDAC/IEEE Design Automation Conference (DAC), pages 966--971, June 2011.

Digital Library

[15]

J. Gaisler. An open-source VHDL IP library with plug & play configuration. Building the Information Society, pages 711--717, 2004.

[16]

R. Komuravelli, et al. Stash: Have your scratchpad and cache it too. In Proceedings of International Symposium on Computer Architecture (ISCA), pages 707--719.

Digital Library

[17]

B. Li, Z. Fang, and R. Iyer. Template-based memory access engine for accelerators in SoCs. In Proceedings of Asia and South Pacific Design Automation Conference (ASP-DAC), pages 147--153, Jan. 2011.

Digital Library

[18]

K. Lim, et al. Thin servers with smart pipes: Designing SoC accelerators for Memcached. SIGARCH Comput. Archit. News, 41(3):36--47, June 2013.

Digital Library

[19]

L. Liu, et al. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proceedings of the International Conference on Parallel architectures and compilation techniques (PACT), pages 367--376, 2012.

Digital Library

[20]

S. Lotlikar, V. Pai, and P. V. Gratz. AcENoCs: A Configurable HW/SW Platform for FPGA Accelerated NoC Emulation. In Proceedings of Annual Conference on VLSI Design, pages 147--152, Jan. 2011.

Digital Library

[21]

M. J. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The accelerator store: A shared memory framework for accelerator-based systems. ACM Transactions on Architecture and Code Optimization (TACO), 8(4):48:1--48:22, Jan. 2012.

Digital Library

[22]

P. Mantovani, G. D. Guglielmo, and L. P. Carloni. High-level synthesis of accelerators in embedded scalable platforms. In Proceedings of Asia and South Pacific Design Automation Conference (ASP-DAC), Jan. 2016.

[23]

M. Mehendale, et al. A true multistandard, programmable, low-power, full HD video-codec engine for smartphone SoC. In ISSCC Digest of Technical Papers, pages 226--228, Feb. 2012.

[24]

D. Melpignano, et al. Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications. In Proceedings of ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1137--1142, June 2012.

Digital Library

[25]

S. P. Muralidhara, et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning. In Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 374--385, Dec. 2011.

Digital Library

[26]

C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. System-level Memory Optimization for High-level Synthesis of Component-based SoCs. In Proceedings of International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 1--10, Oct. 2014.

Digital Library

[27]

proFPGA Prototyping Systems. http://www.prodesign-europe.com/profpga.

[28]

W. Qadeer, et al. Convolution engine: balancing efficiency & flexibility in specialized computing. In Proceedings of International Symposium on Computer Architecture (ISCA), pages 24--35, June 2013.

Digital Library

[29]

Y. S. Shao, et al. Toward cache-friendly hardware accelerators. In HPCA Sensors and Cloud Architectures Workshop (SCAW), pages 1--6, Feb. 2015.

[30]

S. K. Shukla, Y. Yang, L. N. Bhuyan, and P. Brisk. Shared memory heterogeneous computation on PCIe-supported platforms. In Proceedings of International Conference on Field Programmable Logic and Applications (FPL), pages 1--4, Sept. 2013.

[31]

G. Venkatesh, et al. Conservation cores: reducing the energy of mature computations. In Proceedings of Conference on Architectural support for programming languages and operating systems (ASPLOS), pages 205--218, Mar. 2010.

Digital Library

[32]

P. Vogel, A. Marongiu, and L. Benini. Lightweight virtual memory support for many-core accelerators in heterogeneous embedded socs. In Proceedings of International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 45--54, Oct 2015.

Digital Library

[33]

F. Winterstein, et al. MATCHUP: Memory abstractions for heap manipulating programs. In Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pages 136--145, Feb. 2015.

Digital Library

[34]

H.-J. Yang, et al. LMC: Automatic resource-aware program-optimized memory partitioning. In Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pages 128--137, 2016.

Digital Library

Cited By

Xu YLi ASorensen T(2023)Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00028(201-213)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00028
Y aguang yZhao Y(2022)A review of auxiliary hardware architectures supporting dynamic taint analysisInternational Conference on Cloud Computing, Internet of Things, and Computer Applications (CICA 2022)10.1117/12.2642713(109)Online publication date: 28-Jul-2022
https://doi.org/10.1117/12.2642713
Kamaleldin AGohringer D(2022)A Hybrid Memory/Accelerator Tile Architecture for FPGA-based RISC-V Manycore Systems2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL57034.2022.00053(300-306)Online publication date: Aug-2022
https://doi.org/10.1109/FPL57034.2022.00053
Show More Cited By

Recommendations

High Performance Single-Chip FPGA Rijndael Algorithm Implementations
CHES '01: Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded Systems

This paper describes high performance single-chip FPGA implementations of the new Advanced Encryption Standard (AES) algorithm, Rijndael. The designs are implemented on the Virtex-E FPGA family of devices. FPGAs have proven to be very effective in ...
Reconfigurable hardware for high-security/high-performance embedded systems: the SAFES perspective

Embedded systems present significant security challenges due to their limited resources and power constraints. This paper focuses on the issues of building secure embedded systems on reconfigurable hardware and proposes a security architecture for ...
Design and Applications for Embedded Networks-on-Chip on FPGAs

Field-programmable gate-arrays (FPGAs) have evolved to include embedded memory, high-speed I/O interfaces and processors, making them both more efficient and easier-to-use for compute acceleration and networking applications. However, implementing on-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

October 2016

187 pages

ISBN:9781450344821

DOI:10.1145/2968455

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

ESWEEK'16

ESWEEK'16: TWELFTH EMBEDDED SYSTEM WEEK

October 1 - 7, 2016

Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
500
Total Downloads

Downloads (Last 12 months)134
Downloads (Last 6 weeks)16

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu YLi ASorensen T(2023)Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00028(201-213)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00028
Y aguang yZhao Y(2022)A review of auxiliary hardware architectures supporting dynamic taint analysisInternational Conference on Cloud Computing, Internet of Things, and Computer Applications (CICA 2022)10.1117/12.2642713(109)Online publication date: 28-Jul-2022
https://doi.org/10.1117/12.2642713
Kamaleldin AGohringer D(2022)A Hybrid Memory/Accelerator Tile Architecture for FPGA-based RISC-V Manycore Systems2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL57034.2022.00053(300-306)Online publication date: Aug-2022
https://doi.org/10.1109/FPL57034.2022.00053
Pilato CSoldavini S(2022)Accelerator Design with High-Level SynthesisHandbook of Computer Architecture10.1007/978-981-15-6401-7_19-1(1-33)Online publication date: 27-Jan-2022
https://doi.org/10.1007/978-981-15-6401-7_19-1
Zuckerman JGiri DKwon JMantovani PCarloni L(2021)Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480065(350-365)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480065
Tibaldi MPilato CFerrandi F(2021)Automatic Generation of Heterogeneous SoC Architectures With Secure CommunicationsIEEE Embedded Systems Letters10.1109/LES.2020.300397413:2(61-64)Online publication date: Jun-2021
https://doi.org/10.1109/LES.2020.3003974
Piccolboni LDi Guglielmo GSethumadhavan SCarloni L(2021)HARDROID: Transparent Integration of Crypto Accelerators in Android2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622875(1-8)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622875
Giri DChiu KDi Guglielmo GMantovani PCarloni L(2020)ESP4ML: Platform-Based Design of Systems-on-Chip for Embedded Machine Learning2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE48585.2020.9116317(1049-1054)Online publication date: Mar-2020
https://doi.org/10.23919/DATE48585.2020.9116317
Balkind JLim KSchaffner MGao FChirkov GLi ALavrov ANguyen TFu YZaruba FGulati KBenini LWentzlaff DLarus JCeze LStrauss K(2020)BYOCProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378479(699-714)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378479
Carloni L(2020)Scalable Open-Source System-on-Chip Design: (Invited Talk - Extended Abstract)2020 IFIP/IEEE 28th International Conference on Very Large Scale Integration (VLSI-SOC)10.1109/VLSI-SOC46417.2020.9344077(7-9)Online publication date: 5-Oct-2020
https://doi.org/10.1109/VLSI-SOC46417.2020.9344077
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents