Article

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Authors:

Jesús Sánchez,

Antonio GonzálezAuthors Info & Claims

MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture

Page 315

Published: 03 December 2003 Publication History

Abstract

Wire delays are a major concern for current and forthcoming processors.One approach to attack this problem is to divide the processorinto semi-independent units referred to as clusters. Acluster usually consists of a local register file and a subset of thefunctional units, while the data cache remains centralized. However,as technology evolves, the latency of such a centralizedcache will increase leading to an important performance impact.In this paper we propose to include flexible low-latency buffers ineach cluster in order to reduce the performance impact of highercache latencies. The reduced number of entries in each buffer permitsthe design of flexible ways to map data from L1 to these buffers.The proposed L0 buffers are managed by the compiler, whichis responsible to decide which memory instructions make use ofthem.Effective instruction scheduling techniques are proposed togenerate code that exploits these buffers. Results for the Media-benchbenchmark suite show that the performance of a clusteredVLIW processor with a unified L1 data cache is improved by 16%when such buffers are used. In addition, the proposed architecturealso shows significant advantages over both MultiVLIW processorsand a clustered processors with a word-interleaved cache,two state-of-the-art designs with a distributed L1 data cache.

References

[1]

{1} V. Agarwal, M. S. Hrishikesh, S. W. Keckler and D. Burger, "Clock Rate versus IPC: The End of the Road For Conventional Microarchitectures", in Procs. of the 27th Int. Symp. on Computer Architecture, pp. 248-259, June 2000.

Digital Library

[2]

{2} O. Avissar, R. Barua, D. Stewart, "Heterogeneous Memory Management for Embedded Systems", in Procs. of Int. Conf. on Compilers, Architecture, and Synthesis for Embedded Systems, Nov. 2001.

Digital Library

[3]

{3} R. Bahar, G. Albera, S. Manne, "Power and Performance Tradeoffs using Various Caching Strategies", in Procs. of Int. Symp. on Low Power Electronics and Design, 1998.

Digital Library

[4]

{4} D. Bernstein, D. Cohen and D. Maydan, "Dynamic Memory Disambiguation for Array References", in Procs. of 27th Int. Symp. on Microarchitecture , pp. 105-111, Nov. 1994.

Digital Library

[5]

{5} P.P. Chang, S.A. Mahlke, W.Y. Chen, N.J. Water, and W.W. Hwu, "IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors", in Procs. of the 18th Int. Symp. on Computer Architecture, pp. 266-275, May 1991.

Digital Library

[6]

{6} A. Charlesworth, "An Approach to Scientific Array Processing: The Architectural Design of the AP120B/FPS-164 Family", in Computer, 14(9), pp. 18-27, 1981.

Digital Library

[7]

{7} B. Cheng, "Compile-Time Memory Disambiguation for C Programs", PhD thesis, Dept. of Computer Science, University of Illinois, May 2000.

Digital Library

[8]

{8} P. Faraboschi, G. Brown, J. Fisher, G. Desoli and F. Homewood, "Lx: A Technology Platform for Customizable VLIW Embedded Processing", in Procs. of the 27th Int. Symp. on Computer Architecture, pp. 203-213, June 2000.

Digital Library

[9]

{9} J. Fridman and Zvi Greefield, "The TigerSharc DSP Architecture", IEEE Micro, pp. 66-76, Jan-Feb. 2000.

Digital Library

[10]

{10} E. Gibert, J. Sánchez and A. González, "Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor", in Procs. of 35th Int. Symp. on Microarchitecture, Nov. 2002.

Digital Library

[11]

{11} E. Gibert, J. Sánchez and A. González, "Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache", in Procs. of 1st Int. Symp. on Code Generation and Optimization , March 2003.

Digital Library

[12]

{12} L. Gwennap, "Digital 21264 Sets New Standard", Microprocessor Report, 10(14), Oct. 1996.

[13]

{13} R. Huff, "Lifetime-Sensitive Modulo Scheduling", in Procs. of the ACM SIGPLAN'93 Conf. on Programming Languages Design and Implementation , 1993.

Digital Library

[14]

{14} K. Kailas, K. Ebcioglu and A. Agrawala, "CARS: A New Code Generation Framework for Clustered ILP Processors", in Procs. of the 7th Int. Symp. on High-Performance Computer Architecture, Jan. 2001.

Digital Library

[15]

{15} J. Kin, M. Gupta, W. H. Mangione-Smith, "The Filter Cache: An Energy Efficient Memory Structure", in Procs. of 30th Int. Symp. on Microarchitecture , Dec. 1997.

Digital Library

[16]

{16} C. Lee, M. Potkonjak, and W.H. Mangione-Smith, "MediaBench: a Tool for Evaluating and Synthesizing Multimedia and Communication Systems", in Procs. of 30th Int. Symp. on Microarchitecture, pp. 330-335, Dec. 1997.

Digital Library

[17]

{17} J. Llosa, A. González, E. Ayguadé and M. Valero, "Swing Modulo Scheduling", in Procs. of Int. Conf. on Parallel Architectures and Compilation Techniques, pp. 80-86, Oct. 1996.

Digital Library

[18]

{18} E. Nystrom and A.E. Eichenberger, "Effective Cluster Assignment for Modulo Scheduling", in Procs. of the 31st Int. Symp. on Microarchitecture , pp. 103-114, 1998.

Digital Library

[19]

{19} E. Özer, S. Banerjia, T.M. Conte, "Unified Assign and Schedule: A New Approach to Scheduling for Clustered Register File Microarchitectures", in Procs. of 31st Symp. on Microarchitecture, Nov. 1998.

Digital Library

[20]

{20} S. Palacharla, N.P. Jouppi, and J.E. Smith, "Complexity-Effective Superscalar Processors", in Procs. of the 24th Int. Symp. on Computer Architecture , pp. 1-13, June 1997.

Digital Library

[21]

{21} P. Panda, N. Dutt, A. Nicolau, "Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications", in Procs. of European Design and Test Conference, March 1997.

Digital Library

[22]

{22} J. Sánchez and A. González, "The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures", in Procs. of the 29th Int. Conf. on Parallel Processing, Aug. 2000.

Digital Library

[23]

{23} J. Sánchez, and A. González, "Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture", in Procs. of 33rd Int. Symp. on Microarchitecture , Dec. 2000.

Digital Library

[24]

{24} E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, "Baring it all to Software: Raw Machines", IEEE Computer, September 1997.

Digital Library

[25]

{25} Y. Wu, R. Rakvic, L. Chen, C. Miao, G. Chrysos, J. Fang, "Compiler Managed Micro-cache Bypassing for High Performance EPIC Processors", in Procs. 35th Int. Symp. on Microarchitecture, Nov. 2002.

Digital Library

Cited By

Sharifian AKumar SGuha AShriraman AHsu WYang CLipasti MLee H(2016)CHAINSAWThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195698(1-14)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195698
Terechko ACorporaal H(2007)Inter-cluster communication in VLIW architecturesACM Transactions on Architecture and Code Optimization10.1145/1250727.12507314:2(11-es)Online publication date: 1-Jun-2007
https://dl.acm.org/doi/10.1145/1250727.1250731
Chu MMahlke S(2006)Compiler-directed Data Partitioning for Multicluster ProcessorsProceedings of the International Symposium on Code Generation and Optimization10.1109/CGO.2006.9(208-220)Online publication date: 26-Mar-2006
https://dl.acm.org/doi/10.1109/CGO.2006.9
Show More Cited By

Index Terms

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors
1. Hardware
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Compiler managed micro-cache bypassing for high performance EPIC processors
MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture

Advanced microprocessors have been increasing clock rates, well beyond the Gigahertz boundary. For such high performance microprocessors, a small and fast data micro cache (ucache) is important to overall performance, and proper management of it via load ...
Speculative Clustered Caches for Clustered Processors
ISHPC '02: Proceedings of the 4th International Symposium on High Performance Computing

Clustering is a technique for partitioning superscalar processor's execution resources to simultaneously allow for more in-flight instructions, wider issue width, and more aggressive clock speeds. As either the size of individual clusters or the total ...
Data preload for superscalar and VLIW processors

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture

December 2003

412 pages

ISBN:076952043X

Copyright © Copyright (c) 2003 Institute of Electrical and Electronics Engineers, Inc. All rights reserved.

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 03 December 2003

Check for updates

Qualifiers

Article

Conference

MICRO-36

Sponsor:

SIGMICRO

MICRO-36: The 36th Annual International Symposium on Microarchitecture

December 3 - 5, 2003

Acceptance Rates

MICRO 36 Paper Acceptance Rate 35 of 134 submissions, 26%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
275
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sharifian AKumar SGuha AShriraman AHsu WYang CLipasti MLee H(2016)CHAINSAWThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195698(1-14)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195698
Terechko ACorporaal H(2007)Inter-cluster communication in VLIW architecturesACM Transactions on Architecture and Code Optimization10.1145/1250727.12507314:2(11-es)Online publication date: 1-Jun-2007
https://dl.acm.org/doi/10.1145/1250727.1250731
Chu MMahlke S(2006)Compiler-directed Data Partitioning for Multicluster ProcessorsProceedings of the International Symposium on Code Generation and Optimization10.1109/CGO.2006.9(208-220)Online publication date: 26-Mar-2006
https://dl.acm.org/doi/10.1109/CGO.2006.9
Gibert ESanchez JGonzalez A(2005)Distributed Data Cache Designs for Clustered VLIW ProcessorsIEEE Transactions on Computers10.1109/TC.2005.16354:10(1227-1241)Online publication date: 1-Oct-2005
https://dl.acm.org/doi/10.1109/TC.2005.163
Balasubramonian RFeautrier PGoodman JSeznec A(2004)Cluster prefetchProceedings of the 18th annual international conference on Supercomputing10.1145/1006209.1006255(326-335)Online publication date: 26-Jun-2004
https://dl.acm.org/doi/10.1145/1006209.1006255

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents