research-article

Origami: A Convolutional Network Accelerator

Authors:

Lukas Cavigelli,

David Gschwend,

Christoph Mayer,

Luca BeniniAuthors Info & Claims

GLSVLSI '15: Proceedings of the 25th edition on Great Lakes Symposium on VLSI

Pages 199 - 204

https://doi.org/10.1145/2742060.2743766

Published: 20 May 2015 Publication History

Abstract

Today advanced computer vision (CV) systems of ever increasing complexity are being deployed in a growing number of application scenarios with strong real-time and power constraints. Current trends in CV clearly show a rise of neural network-based algorithms, which have recently broken many object detection and localization records. These approaches are very flexible and can be used to tackle many different challenges by only changing their parameters. In this paper, we present the first convolutional network accelerator which is scalable to network sizes that are currently only handled by workstation GPUs, but remains within the power envelope of embedded systems. The architecture has been implemented on 3.09 mm2 core area in UMC 65 nm technology, capable of a throughput of 274 GOp/s at 369 GOp/s/W with an external memory bandwidth of just 525 MB/s full-duplex " a decrease of more than 90% from previous work.

References

[1]

F. Porikli, F. Bremond, S. L. Dockstader, J. Ferryman, A. Hoogs, B. C. Lovell, S. Pankanti, B. Rinner, P. Tu, and P. L. Venetianer, "Video surveillance: past, present, and now the future {DSP Forum}," IEEE Signal Process. Mag., vol. 30, pp. 190"198, 2013.

[2]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Proc. NIPS"12, 2012.

[3]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going Deeper with Convolutions," in arXiv:1409.4842, 2014.

[4]

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks," in arXiv:1312.6229, 2013.

[5]

Y. Taigman and M. Yang, "Deepface: Closing the gap to human-level performance in face verification," in Proc. IEEE CVPR"13, 2013.

Digital Library

[6]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, "Microsoft COCO: Common Objects in Context," in arXiv:1405.0312, 2014.

[7]

C. Labovitz, S. Iekel-Johnson, D. McPherson, J. Oberheide, and F. Jahanian, "Internet inter-domain traffic," ACM SIGCOMM Computer Communication Review, vol. 40. p. 75, 2010.

Digital Library

[8]

C. Bobda and S. Velipasalar, Eds., Distributed Embedded Smart Cameras. Springer, 2014.

Digital Library

[9]

C. Farabet, C. Couprie, L. Najman, and Y. LeCun, "Learning hierarchical features for scene labeling," IEEE Trans. PAMI, 2013.

Digital Library

[10]

L. Cavigelli, M. Magno, and L. Benini, "Accelerating Real-Time Embedded Scene Labeling with Convolutional Networks," in Proc. DAC"15, 2015.

Digital Library

[11]

R. Collobert, "Torch7: A matlab-like environment for machine learning," Proc. NIPSW"11, 2011.

[12]

Y. Jia, "Caffe: An Open Source Convolutional Architecture for Fast Feature Embedding." 2013.

[13]

S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, "cuDNN: Efficient Primitives for Deep Learning," in arXiv:1410.0759, 2014.

[14]

C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, "CNP: An FPGA-based processor for Convolutional Networks," in Proc. IEEE FPL"09, 2009, vol. 1, no. 1, pp. 32"37.

[15]

C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, "NeuFlow: A runtime reconfigurable dataflow processor for vision," in Proc. IEEE CVPRW"11, 2011, pp. 109"116.

[16]

P. H. Pham, D. Jelaca, C. Farabet, B. Martini, Y. LeCun, and E. Culurciello, "NeuFlow: Dataflow vision processing system-on-a-chip," in Midwest Symposium on Circuits and Systems, 2012, pp. 1044"1047.

[17]

V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," in Proc. IEEE CVPR"14, 2014, pp. 682"687.

Digital Library

[18]

F. Conti and L. Benini, "A Ultra-Low-Energy Convolution Engine for Fast Brain-Inspired Vision in Multicore Clusters," in Proc. DATE"15, 2015.

Digital Library

[19]

S. Gould, R. Fulton, and D. Koller, "Decomposing a scene into geometric and semantically consistent regions," in Proc. IEEE ICCV"09, 2009.

[20]

M. Schaffner, F. K. Gürkaynak, A. Smolic, and L. Benini, "DRAM or no-DRAM"" Exploring Linear Solver Architectures for Image Domain Warping in 28 nm CMOS," in Proc. IEEE DATE"15, 2015.

Digital Library

Cited By

Lee CYeh T(2024)ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN TensorsACM Transactions on Architecture and Code Optimization10.1145/365336321:3(1-24)Online publication date: 21-Mar-2024
https://dl.acm.org/doi/10.1145/3653363
Li YLouri AKaranth A(2024)A High-Performance and Energy-Efficient Photonic Architecture for Multi-DNN AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332753535:1(46-58)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3327535
Michailidis PMichailidis IGkelios SKaratzinis GKosmatopoulos E(2023)Neuro-distributed cognitive adaptive optimization for training neural networks in a parallel and asynchronous mannerIntegrated Computer-Aided Engineering10.3233/ICA-23071831:1(19-41)Online publication date: 16-Nov-2023
https://dl.acm.org/doi/10.3233/ICA-230718
Show More Cited By

Index Terms

Origami: A Convolutional Network Accelerator
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Hardware
  1. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific processors

Recommendations

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
Direct MPI Library for Intel Xeon Phi Co-Processors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

DCFA-MPI is an MPI library implementation for Intel Xeon Phi co-processor clusters, where a compute node consists of an Intel Xeon Phi co-processor card connected to the host via PCI Express with InfiniBand. DCFA-MPI enables direct data transfer between ...
Performance of the NVIDIA Jetson TK1 in HPC
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster Computing

The NVIDIA Jetson is demonstrated as a competitiveHPC platform. The Jetson has 192 Kepler CUDA cores that are"true" in that they share a processor: in the case of the Jetson, a32-bit ARM Cortex-A15 variant low power architecture. Ourwork explores the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

GLSVLSI '15: Proceedings of the 25th edition on Great Lakes Symposium on VLSI

May 2015

418 pages

ISBN:9781450334747

DOI:10.1145/2742060

General Chairs:
Alex K. Jones
University of Pittsburgh, USA
,
Hai (Helen) Li
University of Pittsburgh, USA
,
Program Chairs:
Ayse K. Coskun
Boston University, USA
,
Martin Margala
University of Massachusetts, Lowell, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

In-Cooperation

IEEE CEDA
IEEE CASS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

armasuisse Science & Technology
European Research Council

Conference

GLSVLSI '15

Sponsor:

SIGDA

GLSVLSI '15: Great Lakes Symposium on VLSI 2015

May 20 - 22, 2015

Pennsylvania, Pittsburgh, USA

Acceptance Rates

GLSVLSI '15 Paper Acceptance Rate 41 of 148 submissions, 28%;

Overall Acceptance Rate 312 of 1,156 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

126
Total Citations
View Citations
1,795
Total Downloads

Downloads (Last 12 months)91
Downloads (Last 6 weeks)11

Reflects downloads up to 29 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lee CYeh T(2024)ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN TensorsACM Transactions on Architecture and Code Optimization10.1145/365336321:3(1-24)Online publication date: 21-Mar-2024
https://dl.acm.org/doi/10.1145/3653363
Li YLouri AKaranth A(2024)A High-Performance and Energy-Efficient Photonic Architecture for Multi-DNN AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332753535:1(46-58)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3327535
Michailidis PMichailidis IGkelios SKaratzinis GKosmatopoulos E(2023)Neuro-distributed cognitive adaptive optimization for training neural networks in a parallel and asynchronous mannerIntegrated Computer-Aided Engineering10.3233/ICA-23071831:1(19-41)Online publication date: 16-Nov-2023
https://dl.acm.org/doi/10.3233/ICA-230718
Li YLouri AKaranth A(2023)A Silicon Photonic Multi-DNN Accelerator2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00028(238-249)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00028
Zhang JYan H(2023)Application and Implementation of Convolutional Neural Network Accelerator Based on FPGA in Environmental Sound Classification2023 8th International Conference on Computer and Communication Systems (ICCCS)10.1109/ICCCS57501.2023.10151442(22-27)Online publication date: 21-Apr-2023
https://doi.org/10.1109/ICCCS57501.2023.10151442
Tang MYang L(2023)A Case Study on DNN AcceleratorsProceeding of 2022 International Conference on Wireless Communications, Networking and Applications (WCNA 2022)10.1007/978-981-99-3951-0_86(787-792)Online publication date: 27-Jul-2023
https://doi.org/10.1007/978-981-99-3951-0_86
Lee JHan SChoi SChoi J(2022)Power-Efficient Deep Neural Network Accelerator Minimizing Global Buffer Access without Data Transfer between Neighboring Multiplier—Accumulator UnitsElectronics10.3390/electronics1113199611:13(1996)Online publication date: 25-Jun-2022
https://doi.org/10.3390/electronics11131996
Pavlidakis MMavridis SChazapis AVasiliadis GBilas AGavrilovska AAltınbüken DBinnig C(2022)AraxProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563467(1-15)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3542929.3563467
Svoboda FFernandez-Marques JLiberis ELane NYoneki ENardi L(2022)Deep learning on microcontrollersProceedings of the 2nd European Workshop on Machine Learning and Systems10.1145/3517207.3526978(54-63)Online publication date: 5-Apr-2022
https://dl.acm.org/doi/10.1145/3517207.3526978
Islam MShrestha RRoy Chowdhury S(2022)An Uninterrupted Processing Technique-Based High-Throughput and Energy-Efficient Hardware Accelerator for Convolutional Neural NetworksIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.321096330:12(1891-1901)Online publication date: Dec-2022
https://doi.org/10.1109/TVLSI.2022.3210963
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents