Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

The CURE: Cluster Communication Using Registers

Published: 27 September 2017 Publication History

Abstract

VLIW processors typically deliver high performance on limited budget making them ideal for a variety of communication and signal processing solutions. These processors typically need large multi-ported register files that can have side effects of increased cycle time and high power consumption. The access delay and energy of these register files can also become prohibitive when increasing the register count or the access ports, thus limiting the overall performance of the processor. Most prior art circumvent this problem by using multiple clusters with private register files, to lower the access delay and reduce energy consumption. However, clustering artifacts, like increased inter--cluster communication operations and spill-recovery code, result in a performance penalty.
This paper proposes CURE — a novel technique to considerably reduce the negative effects of clustering. CURE augments the ISA to expose the communication registers to the compilers to increase availability of architectural register state to all functional units. The inter--cluster communication operations are integrated into regular ALU and memory operations to improve instruction encoding efficiency. We also propose a new code scheduling heuristic to handle the ISA changes, and to realize the improvements in processor’s performance and energy consumption. Our quantitative analysis estimates that CURE, when compared to the baseline 8--issue uni--cluster processor, boosts average performance by 61% while reducing the average register dynamic energy by 77%.

References

[1]
Alex Aletà, Josep M. Codina, Antonio González, and David Kaeli. 2003. Instruction replication for clustered microarchitectures. In MICRO-36.
[2]
Alex Aletà, Josep M. Codina, Jesús Sánchez, and Antonio González. 2001. Graph-partitioning based instruction scheduling for clustered processors. In MICRO-34.
[3]
Alex Aletà, Josep M. Codina, Jesús Sánchez, Antonio González, and David Kaeli. 2002. Exploiting pseudo-schedules to guide data dependence graph partitioning. In PACT.
[4]
R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In MICRO-34.
[5]
A. Capitanio, N. Dutt, and A. Nicolau. 1992. Partitioned Register Files For VLIWs: A preliminary analysis of tradeoffs. In MICRO-25.
[6]
Gregory Chaitin. Register allocation and spilling via graph coloring. SIGPLAN Not. 39, 4.
[7]
Lakshmi N. Chakrapani, John Gyllenhaal, Wenmei W. Hwu, Scott A. Mahlke, Krishna V. Palem, and Rodric M. Rabbah. 2004. Trimaran: An infrastructure for research in instruction-level parallelism. In In Instruction-level Parallelism. Lecture Notes in Computer Science. Springer-Verlag, www.trimaran.org.
[8]
N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. 2011. FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. In ISCA-38.
[9]
Josep M. Codina, Jesús Sánchez, and Antonio González. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In PACT.
[10]
L. Codrescu, W. Anderson, S. Venkumanhanti, M. Zeng, E. Plondke, C. Koob, A. Ingle, R. Maule, and R. Talluri. 2013. Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications. In Hot Chips.
[11]
Osvaldo Colavin and Davide Rizzo. 2003. A scalable wide-issue clustered VLIW with a reconfigurable interconnect. In CASES.
[12]
J.-L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham. 2000. Multiple-banked register file architectures. In ISCA-27.
[13]
Nam Duong and R. Kumar. 2009. Register Multimapping: A technique for reducing register bank conflicts in processors with large register files. In SASP-7.
[14]
John R. Ellis. 1985. Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific). Ph.D. thesis.
[15]
Equator. 1998. MAP1000 unfolds at Equator. In Microprocessor Report.
[16]
Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred Homewood. 2000. Lx: a technology platform for customizable VLIW embedded processing. In ISCA-27.
[17]
K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. 1997. The multicluster architecture: reducing cycle time through partitioning. In MICRO-30.
[18]
Jose Fridman and Zvi Greenfield. 2000. The TigerSHARC DSP Architecture. IEEE Micro.
[19]
A. Gangwar, M. Balakrishnan, P. R. Panda, and A. Kumar. 2005. Evaluation of bus based interconnect mechanisms in clustered VLIW architectures. In DATE.
[20]
J. S. Gardner. 2012. CEVA Exposes DSP Six Pack. In Microprocessor Report.
[21]
N. Goel, A. Kumar, and P. R. Panda. 2007. Power Reduction in VLIW Processor with Compiler Driven Bypass Network. In VLSID-20.
[22]
A. Gonzalez, J. Gonzalez, and M. Valero. 1998. Virtual-physical registers. In HPCA-4.
[23]
Texas Instrucments Inc. 1998. TMS320C62x/67x CPU and instruction set reference guide.
[24]
Texas Instruments. 2010. TMS320C6745/C6747 Fixed/Floating- point digital signal processors (Rev.D).
[25]
Intel. Intel Itanium Architecture Software Develorer‘s Manual: Intel Itanium Instruction Set. www.intel.com 3, 293--370.
[26]
Krishnan Kailas and Ashok Agrawala. 2001. CARS: A new code generation framework for clustered ILP processors. In HPCA.
[27]
P. Geoffrey Lowney, Stefan M. Freudenberger, Thomas J. Karzes, W. D. Lichtenstein, Robert P. Nix, John S. O’Donnell, and John C. Ruttenberg. 1993. The multiflow trace scheduling compiler. The Journal of Supercomputing 7 (1993), 51--142.
[28]
R. Nagpal and Y. N. Srikant. 2007. Register file energy optimization for snooping based clustered VLIW architectures. In SBAC-PAD-19.
[29]
V. R. K. Naresh, D. J. Palframan, and M. H. Lipasti. 2011. CRAM: Coded registers for amplified multiporting. In MICRO-44.
[30]
Emre Özer, Sanjeev Banerjia, and Thomas M. Conte. 1998. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In MICRO-31.
[31]
I. Park, M. D. Powell, and T. N. Vijaykumar. 2002. Reducing register ports for higher speed and lower energy. In MICRO-35.
[32]
Roni Potasman. 1992. Percolation based compiling for evaluation of parallelism and hardware design trade-offs. Ph.D.
[33]
C. Rowen, D. Nicolaescu, R. Ravindran, D. Heine, G. Martin, J. Kim, D. Maydan, N. Andrews, B. Huffman, V. Papaparaskeva, S. Gal-On, P. Nuth, P. Patwardhan, and M. Paradkar. 2011. The World's Fastest DSP Core: Breaking the 100 GMAC/s Barrier. In Hot Chips.
[34]
A. Terechko, M. Garg, and H. Corporaal. 2005. Evaluation of speed and area of clustered VLIW processors. In VLSID-18.
[35]
A. Terechko, E. Le Thenaff, M. Garg, J. van Eijndhoven, and H. Corporaal. 2003. Inter-cluster communication models for clustered VLIW processors. In HPCA-9 2003.
[36]
J. H. Tseng and K. Asanovic. 2003. Banked multiported register files for high-frequency superscalar microprocessors. In ISCA-30.
[37]
S. Wallace and N. Bagherzadeh. 1996. A scalable register file architecture for dynamically scheduled processors. In PACT.
[38]
R. Yung and N. C. Wilhelm. 1995. Caching processor general registers. In ICCD.
[39]
J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. 2000. Two-level hierarchical register file organization for VLIW processors. In MICRO-33.
[40]
Javier Zalamea, Josep Llosa, Eduard Ayguad, and Mateo Valero. 2001. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Micro-34.
[41]
Yingchao Zhao, C. J. Xue, Minming Li, and B. Hu. 2009. Energy-aware register file re-partitioning for clustered VLIW architectures. In ASP-DAC.
[42]
V. Zyuban and P. Kogge. 1998. The energy complexity of register files. In ISLPED.

Cited By

View all
  • (2021)A multi-surrogate-assisted dual-layer ensemble feature selection algorithmApplied Soft Computing10.1016/j.asoc.2021.107625110:COnline publication date: 1-Oct-2021
  • (2019)Dynamic Group Recommendation Based on the Attention MechanismFuture Internet10.3390/fi1109019811:9(198)Online publication date: 17-Sep-2019

Index Terms

  1. The CURE: Cluster Communication Using Registers

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Embedded Computing Systems
    ACM Transactions on Embedded Computing Systems  Volume 16, Issue 5s
    Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017
    October 2017
    1448 pages
    ISSN:1539-9087
    EISSN:1558-3465
    DOI:10.1145/3145508
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 27 September 2017
    Accepted: 01 June 2017
    Revised: 01 June 2017
    Received: 01 March 2017
    Published in TECS Volume 16, Issue 5s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Inter-cluster communication
    2. register file architecture

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)A multi-surrogate-assisted dual-layer ensemble feature selection algorithmApplied Soft Computing10.1016/j.asoc.2021.107625110:COnline publication date: 1-Oct-2021
    • (2019)Dynamic Group Recommendation Based on the Attention MechanismFuture Internet10.3390/fi1109019811:9(198)Online publication date: 17-Sep-2019

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media