A General-Purpose Many-Accelerator Architecture Based on Dataflow Graph Clustering of Applications

Peng Chen^1,2,
Lei Zhang¹,
Yin-He Han¹ &
…
Yun-Ji Chen¹

200 Accesses
2 Citations
Explore all metrics

Abstract

The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. “Dark Silicon”), that is only a small fraction of chip can run at full speed during a period of time. Designing accelerators for specific applications or algorithms is considered to be one of the most promising approaches to improving energy-efficiency. However, most current design methods for accelerators are dedicated for certain applications or algorithms, which greatly constrains their applicability. In this paper, we propose a novel general-purpose many-accelerator architecture. Our contributions are two-fold. Firstly, we propose to cluster dataflow graphs (DFGs) of hotspot basic blocks (BBs) in applications. The DFG clusters are then used for accelerators design. This is because a DFG is the largest program unit which is not specific to a certain application. We analyze 17 benchmarks in SPEC CPU 2006, acquire over 300 DFGs hotspots by using LLVM compiler tool, and divide them into 15 clusters based on graph similarity. Secondly, we introduce a function instruction set architecture (FISC) and illustrate how DFG accelerators can be integrated with a processor core and how they can be used by applications. Our results show that the proposed DFG clustering and FISC design can speed up SPEC benchmarks 6.2X on average.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Utilization of Dataflow Architectures Through Software and Hardware Co-Design

A CGRA Definition Framework for Dataflow Applications

Effective runtime scheduling for high-performance graph processing on heterogeneous dataflow architecture

Article 28 July 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Govindaraju V, Ho C H, Sankaralingam K. Dynamically specialized datapaths for energy efficient computing. In Proc. the 17th Symp. High Performance Computer Architecture (HPCA), February 2011, pp.503-514.
Venkatesh G, Sampson J, Goulding N et al. Conservation cores: Reducing the energy of mature computations. ACM SIGARCH Computer Architecture News, 2010, 38(1): 205-218.
Article Google Scholar
Guha A, Zhang Y, ur Rasool R et al. Systematic evaluation of workload clustering for extremely energy-efficient architectures. ACM SIGARCH Computer Architecture News, 2013, 41(2): 22-29.
Article Google Scholar
Cong J, Ghodrat M A, Gill M et al. Architecture support in accelerator-rich CMPs. In Proc. the 49th Annual Design Automation Conference (DAC), June 2012, pp.843-849.
Hameed R, Qadeer W, Wachs M et al. Understanding sources of inefficiency in general-purpose chips. In Proc. the 37th ISCA, June 2010, pp.37-47.
Memik G, Memik S O, Mangione-Smith W H. Design and analysis of a layer seven network processor accelerator using reconfigurable logic. In Proc. the 10th IEEE Symposium on Field-Programmable Custom Computing Machines, April 2002, pp.131-140.
Yoon C W, Woo R, Kook J et al. An 80/20-MHz 160-mW multimedia processor integrated with embedded DRAM, MPEG-4 accelerator and 3-D rendering engine for mobile applications. IEEE Journal of Solid-State Circuits, 2001, 36(11): 1758-1767.
Article Google Scholar
Steinkraus D, Buck I, Simard P Y. Using GPUs for machine learning algorithms. In Proc. the 8th Int. Conf. Document Analysis and Recognition, August 29-September 1, 2005, pp.1115-1119.
Pionteck T, Staake T, Stiefmeier T et al. Design of a reconfigurable AES encryption/decryption engine for mobile terminals. In Proc. Int. Symp. Circuits and Systems, May 2004, Vol.2, pp.545-548.
Google Scholar
Lattner C, Adve V. LLVM: A compilation framework for life-long program analysis & transformation. In Proc. Int. Symp. Code Generation and Optimization: Feedback-Directed and Runtime Optimization, March 2004, pp.75-86.
Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proc. the 18th Int. Conf. Data Engineering, March 2002, pp.117-128.
Wu L, Weaver C, Austin T. CryptoManiac: A fast flexible architecture for secure communication. In Proc. Int. Symp. Computer Architecture, June 30-July 4, 2001, pp.110-119.
Ebeling C, Cronquist D C, Franklin P. RaPiD — Reconfigurable pipelined datapath. In Proc. the 6th International Workshop on Field-Programmable Logic, Sept. 1996, pp.126-135.
Goldstein S C, Schmit H, Moe M et al. PipeRench: A coprocessor for streaming multimedia acceleration. In Proc. the 26th Int. Symp. Computer Architecture, May 1999, pp.28-39.
Google Scholar
Ahn J H, Dally W J, Khailany B et al. Evaluating the imagine stream architecture. In Proc. the 31st Int. Symp. Computer Architecture, June 2004.
Boeing A, Braunl T. Evaluation of real-time physics simulation systems. In Proc. the 5th International Conference on Computer Graphics and Interactive Techniques in Australia and Southeast Asia, December 2007, pp.281-288.
Luo Z, Liu H, Wu X. Artificial neural network computation on graphic process unit. In Proc. Int. Joint Conf. Neural Networks, July 31-Aug. 4, 2005, Vol.1, pp.622-626.
Lindholm E, Nickolls J, Oberman S et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39-55.
Article Google Scholar
Owens J D, Luebke D, Govindaraju N et al. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 2007, 26(1): 80-113.
Article Google Scholar
Demme J, Sethumadhavan S. Approximate graph clustering for program characterization. ACM Transactions on Architecture and Code Optimization (TACO), 2012, 8(4): Article No. 21.
Cong J, Liu B, Majumdar R et al. Behavior-level observability analysis for operation gating in low-power behavioral synthesis. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2010, 16(1): Article No.4.

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Peng Chen, Lei Zhang, Yin-He Han & Yun-Ji Chen
University of Chinese Academy of Sciences, Beijing, 100049, China
Peng Chen

Authors

Peng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yin-He Han
View author publications
You can also search for this author in PubMed Google Scholar
Yun-Ji Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peng Chen.

Additional information

This paper is supported by the National Natural Science Foundation of China under Grant Nos. 601173006, 61221062, and the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No. XDA06010403.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(DOC 28 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, P., Zhang, L., Han, YH. et al. A General-Purpose Many-Accelerator Architecture Based on Dataflow Graph Clustering of Applications. J. Comput. Sci. Technol. 29, 239–246 (2014). https://doi.org/10.1007/s11390-014-1426-9

Download citation

Received: 30 October 2013
Revised: 03 January 2014
Published: 23 March 2014
Issue Date: March 2014
DOI: https://doi.org/10.1007/s11390-014-1426-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Utilization of Dataflow Architectures Through Software and Hardware Co-Design

A CGRA Definition Framework for Dataflow Applications

Effective runtime scheduling for high-performance graph processing on heterogeneous dataflow architecture

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A General-Purpose Many-Accelerator Architecture Based on Dataflow Graph Clustering of Applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Utilization of Dataflow Architectures Through Software and Hardware Co-Design

A CGRA Definition Framework for Dataflow Applications

Effective runtime scheduling for high-performance graph processing on heterogeneous dataflow architecture

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation