research-article

Public Access

The CURE: Cluster Communication Using Registers

Authors:

Vignyan Reddy Kothinti Naresh,

Mikko H. LipastiAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 16, Issue 5s

Article No.: 124, Pages 1 - 19

https://doi.org/10.1145/3126527

Published: 27 September 2017 Publication History

Abstract

VLIW processors typically deliver high performance on limited budget making them ideal for a variety of communication and signal processing solutions. These processors typically need large multi-ported register files that can have side effects of increased cycle time and high power consumption. The access delay and energy of these register files can also become prohibitive when increasing the register count or the access ports, thus limiting the overall performance of the processor. Most prior art circumvent this problem by using multiple clusters with private register files, to lower the access delay and reduce energy consumption. However, clustering artifacts, like increased inter--cluster communication operations and spill-recovery code, result in a performance penalty.

This paper proposes CURE — a novel technique to considerably reduce the negative effects of clustering. CURE augments the ISA to expose the communication registers to the compilers to increase availability of architectural register state to all functional units. The inter--cluster communication operations are integrated into regular ALU and memory operations to improve instruction encoding efficiency. We also propose a new code scheduling heuristic to handle the ISA changes, and to realize the improvements in processor’s performance and energy consumption. Our quantitative analysis estimates that CURE, when compared to the baseline 8--issue uni--cluster processor, boosts average performance by 61% while reducing the average register dynamic energy by 77%.

References

[1]

Alex Aletà, Josep M. Codina, Antonio González, and David Kaeli. 2003. Instruction replication for clustered microarchitectures. In MICRO-36.

Digital Library

[2]

Alex Aletà, Josep M. Codina, Jesús Sánchez, and Antonio González. 2001. Graph-partitioning based instruction scheduling for clustered processors. In MICRO-34.

Digital Library

[3]

Alex Aletà, Josep M. Codina, Jesús Sánchez, Antonio González, and David Kaeli. 2002. Exploiting pseudo-schedules to guide data dependence graph partitioning. In PACT.

Digital Library

[4]

R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In MICRO-34.

Digital Library

[5]

A. Capitanio, N. Dutt, and A. Nicolau. 1992. Partitioned Register Files For VLIWs: A preliminary analysis of tradeoffs. In MICRO-25.

Digital Library

[6]

Gregory Chaitin. Register allocation and spilling via graph coloring. SIGPLAN Not. 39, 4.

Digital Library

[7]

Lakshmi N. Chakrapani, John Gyllenhaal, Wenmei W. Hwu, Scott A. Mahlke, Krishna V. Palem, and Rodric M. Rabbah. 2004. Trimaran: An infrastructure for research in instruction-level parallelism. In In Instruction-level Parallelism. Lecture Notes in Computer Science. Springer-Verlag, www.trimaran.org.

[8]

N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. 2011. FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. In ISCA-38.

Digital Library

[9]

Josep M. Codina, Jesús Sánchez, and Antonio González. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In PACT.

Digital Library

[10]

L. Codrescu, W. Anderson, S. Venkumanhanti, M. Zeng, E. Plondke, C. Koob, A. Ingle, R. Maule, and R. Talluri. 2013. Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications. In Hot Chips.

[11]

Osvaldo Colavin and Davide Rizzo. 2003. A scalable wide-issue clustered VLIW with a reconfigurable interconnect. In CASES.

Digital Library

[12]

J.-L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham. 2000. Multiple-banked register file architectures. In ISCA-27.

Digital Library

[13]

Nam Duong and R. Kumar. 2009. Register Multimapping: A technique for reducing register bank conflicts in processors with large register files. In SASP-7.

[14]

John R. Ellis. 1985. Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific). Ph.D. thesis.

Digital Library

[15]

Equator. 1998. MAP1000 unfolds at Equator. In Microprocessor Report.

[16]

Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred Homewood. 2000. Lx: a technology platform for customizable VLIW embedded processing. In ISCA-27.

Digital Library

[17]

K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. 1997. The multicluster architecture: reducing cycle time through partitioning. In MICRO-30.

Digital Library

[18]

Jose Fridman and Zvi Greenfield. 2000. The TigerSHARC DSP Architecture. IEEE Micro.

Digital Library

[19]

A. Gangwar, M. Balakrishnan, P. R. Panda, and A. Kumar. 2005. Evaluation of bus based interconnect mechanisms in clustered VLIW architectures. In DATE.

Digital Library

[20]

J. S. Gardner. 2012. CEVA Exposes DSP Six Pack. In Microprocessor Report.

[21]

N. Goel, A. Kumar, and P. R. Panda. 2007. Power Reduction in VLIW Processor with Compiler Driven Bypass Network. In VLSID-20.

Digital Library

[22]

A. Gonzalez, J. Gonzalez, and M. Valero. 1998. Virtual-physical registers. In HPCA-4.

Digital Library

[23]

Texas Instrucments Inc. 1998. TMS320C62x/67x CPU and instruction set reference guide.

[24]

Texas Instruments. 2010. TMS320C6745/C6747 Fixed/Floating- point digital signal processors (Rev.D).

[25]

Intel. Intel Itanium Architecture Software Develorer‘s Manual: Intel Itanium Instruction Set. www.intel.com 3, 293--370.

[26]

Krishnan Kailas and Ashok Agrawala. 2001. CARS: A new code generation framework for clustered ILP processors. In HPCA.

Digital Library

[27]

P. Geoffrey Lowney, Stefan M. Freudenberger, Thomas J. Karzes, W. D. Lichtenstein, Robert P. Nix, John S. O’Donnell, and John C. Ruttenberg. 1993. The multiflow trace scheduling compiler. The Journal of Supercomputing 7 (1993), 51--142.

Digital Library

[28]

R. Nagpal and Y. N. Srikant. 2007. Register file energy optimization for snooping based clustered VLIW architectures. In SBAC-PAD-19.

[29]

V. R. K. Naresh, D. J. Palframan, and M. H. Lipasti. 2011. CRAM: Coded registers for amplified multiporting. In MICRO-44.

Digital Library

[30]

Emre Özer, Sanjeev Banerjia, and Thomas M. Conte. 1998. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In MICRO-31.

Digital Library

[31]

I. Park, M. D. Powell, and T. N. Vijaykumar. 2002. Reducing register ports for higher speed and lower energy. In MICRO-35.

Digital Library

[32]

Roni Potasman. 1992. Percolation based compiling for evaluation of parallelism and hardware design trade-offs. Ph.D.

[33]

C. Rowen, D. Nicolaescu, R. Ravindran, D. Heine, G. Martin, J. Kim, D. Maydan, N. Andrews, B. Huffman, V. Papaparaskeva, S. Gal-On, P. Nuth, P. Patwardhan, and M. Paradkar. 2011. The World's Fastest DSP Core: Breaking the 100 GMAC/s Barrier. In Hot Chips.

[34]

A. Terechko, M. Garg, and H. Corporaal. 2005. Evaluation of speed and area of clustered VLIW processors. In VLSID-18.

Digital Library

[35]

A. Terechko, E. Le Thenaff, M. Garg, J. van Eijndhoven, and H. Corporaal. 2003. Inter-cluster communication models for clustered VLIW processors. In HPCA-9 2003.

Digital Library

[36]

J. H. Tseng and K. Asanovic. 2003. Banked multiported register files for high-frequency superscalar microprocessors. In ISCA-30.

Digital Library

[37]

S. Wallace and N. Bagherzadeh. 1996. A scalable register file architecture for dynamically scheduled processors. In PACT.

Digital Library

[38]

R. Yung and N. C. Wilhelm. 1995. Caching processor general registers. In ICCD.

Digital Library

[39]

J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. 2000. Two-level hierarchical register file organization for VLIW processors. In MICRO-33.

Digital Library

[40]

Javier Zalamea, Josep Llosa, Eduard Ayguad, and Mateo Valero. 2001. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Micro-34.

Digital Library

[41]

Yingchao Zhao, C. J. Xue, Minming Li, and B. Hu. 2009. Energy-aware register file re-partitioning for clustered VLIW architectures. In ASP-DAC.

Digital Library

[42]

V. Zyuban and P. Kogge. 1998. The energy complexity of register files. In ISLPED.

Digital Library

Cited By

Jiang ZZhang YWang J(2021)A multi-surrogate-assisted dual-layer ensemble feature selection algorithmApplied Soft Computing10.1016/j.asoc.2021.107625110:COnline publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1016/j.asoc.2021.107625
Xu HDing YSun JZhao KChen Y(2019)Dynamic Group Recommendation Based on the Attention MechanismFuture Internet10.3390/fi1109019811:9(198)Online publication date: 17-Sep-2019
https://doi.org/10.3390/fi11090198

Index Terms

The CURE: Cluster Communication Using Registers
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Very long instruction word

Recommendations

Multiple-banked register file architectures
Special Issue: Proceedings of the 27th annual international symposium on Computer architecture (ISCA '00)

The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies ...
Multiple-banked register file architectures
ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture

The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies ...
A Cost-Effective Clustered Architecture
PACT '99: Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques

In current superscalar processors, all floating-point resources are idle during the execution of integer programs. As previous works show, this problem can be alleviated if the floating-point cluster is extended to execute simple integer instructions. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 16, Issue 5s

Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017

October 2017

1448 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3145508

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 27 September 2017

Accepted: 01 June 2017

Revised: 01 June 2017

Received: 01 March 2017

Published in TECS Volume 16, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
187
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiang ZZhang YWang J(2021)A multi-surrogate-assisted dual-layer ensemble feature selection algorithmApplied Soft Computing10.1016/j.asoc.2021.107625110:COnline publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1016/j.asoc.2021.107625
Xu HDing YSun JZhao KChen Y(2019)Dynamic Group Recommendation Based on the Attention MechanismFuture Internet10.3390/fi1109019811:9(198)Online publication date: 17-Sep-2019
https://doi.org/10.3390/fi11090198

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents