Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/956417.956547acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
Article

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Published: 03 December 2003 Publication History

Abstract

Wire delays are a major concern for current and forthcoming processors.One approach to attack this problem is to divide the processorinto semi-independent units referred to as clusters. Acluster usually consists of a local register file and a subset of thefunctional units, while the data cache remains centralized. However,as technology evolves, the latency of such a centralizedcache will increase leading to an important performance impact.In this paper we propose to include flexible low-latency buffers ineach cluster in order to reduce the performance impact of highercache latencies. The reduced number of entries in each buffer permitsthe design of flexible ways to map data from L1 to these buffers.The proposed L0 buffers are managed by the compiler, whichis responsible to decide which memory instructions make use ofthem.Effective instruction scheduling techniques are proposed togenerate code that exploits these buffers. Results for the Media-benchbenchmark suite show that the performance of a clusteredVLIW processor with a unified L1 data cache is improved by 16%when such buffers are used. In addition, the proposed architecturealso shows significant advantages over both MultiVLIW processorsand a clustered processors with a word-interleaved cache,two state-of-the-art designs with a distributed L1 data cache.

References

[1]
{1} V. Agarwal, M. S. Hrishikesh, S. W. Keckler and D. Burger, "Clock Rate versus IPC: The End of the Road For Conventional Microarchitectures", in Procs. of the 27th Int. Symp. on Computer Architecture, pp. 248-259, June 2000.
[2]
{2} O. Avissar, R. Barua, D. Stewart, "Heterogeneous Memory Management for Embedded Systems", in Procs. of Int. Conf. on Compilers, Architecture, and Synthesis for Embedded Systems, Nov. 2001.
[3]
{3} R. Bahar, G. Albera, S. Manne, "Power and Performance Tradeoffs using Various Caching Strategies", in Procs. of Int. Symp. on Low Power Electronics and Design, 1998.
[4]
{4} D. Bernstein, D. Cohen and D. Maydan, "Dynamic Memory Disambiguation for Array References", in Procs. of 27th Int. Symp. on Microarchitecture , pp. 105-111, Nov. 1994.
[5]
{5} P.P. Chang, S.A. Mahlke, W.Y. Chen, N.J. Water, and W.W. Hwu, "IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors", in Procs. of the 18th Int. Symp. on Computer Architecture, pp. 266-275, May 1991.
[6]
{6} A. Charlesworth, "An Approach to Scientific Array Processing: The Architectural Design of the AP120B/FPS-164 Family", in Computer, 14(9), pp. 18-27, 1981.
[7]
{7} B. Cheng, "Compile-Time Memory Disambiguation for C Programs", PhD thesis, Dept. of Computer Science, University of Illinois, May 2000.
[8]
{8} P. Faraboschi, G. Brown, J. Fisher, G. Desoli and F. Homewood, "Lx: A Technology Platform for Customizable VLIW Embedded Processing", in Procs. of the 27th Int. Symp. on Computer Architecture, pp. 203-213, June 2000.
[9]
{9} J. Fridman and Zvi Greefield, "The TigerSharc DSP Architecture", IEEE Micro, pp. 66-76, Jan-Feb. 2000.
[10]
{10} E. Gibert, J. Sánchez and A. González, "Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor", in Procs. of 35th Int. Symp. on Microarchitecture, Nov. 2002.
[11]
{11} E. Gibert, J. Sánchez and A. González, "Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache", in Procs. of 1st Int. Symp. on Code Generation and Optimization , March 2003.
[12]
{12} L. Gwennap, "Digital 21264 Sets New Standard", Microprocessor Report, 10(14), Oct. 1996.
[13]
{13} R. Huff, "Lifetime-Sensitive Modulo Scheduling", in Procs. of the ACM SIGPLAN'93 Conf. on Programming Languages Design and Implementation , 1993.
[14]
{14} K. Kailas, K. Ebcioglu and A. Agrawala, "CARS: A New Code Generation Framework for Clustered ILP Processors", in Procs. of the 7th Int. Symp. on High-Performance Computer Architecture, Jan. 2001.
[15]
{15} J. Kin, M. Gupta, W. H. Mangione-Smith, "The Filter Cache: An Energy Efficient Memory Structure", in Procs. of 30th Int. Symp. on Microarchitecture , Dec. 1997.
[16]
{16} C. Lee, M. Potkonjak, and W.H. Mangione-Smith, "MediaBench: a Tool for Evaluating and Synthesizing Multimedia and Communication Systems", in Procs. of 30th Int. Symp. on Microarchitecture, pp. 330-335, Dec. 1997.
[17]
{17} J. Llosa, A. González, E. Ayguadé and M. Valero, "Swing Modulo Scheduling", in Procs. of Int. Conf. on Parallel Architectures and Compilation Techniques, pp. 80-86, Oct. 1996.
[18]
{18} E. Nystrom and A.E. Eichenberger, "Effective Cluster Assignment for Modulo Scheduling", in Procs. of the 31st Int. Symp. on Microarchitecture , pp. 103-114, 1998.
[19]
{19} E. Özer, S. Banerjia, T.M. Conte, "Unified Assign and Schedule: A New Approach to Scheduling for Clustered Register File Microarchitectures", in Procs. of 31st Symp. on Microarchitecture, Nov. 1998.
[20]
{20} S. Palacharla, N.P. Jouppi, and J.E. Smith, "Complexity-Effective Superscalar Processors", in Procs. of the 24th Int. Symp. on Computer Architecture , pp. 1-13, June 1997.
[21]
{21} P. Panda, N. Dutt, A. Nicolau, "Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications", in Procs. of European Design and Test Conference, March 1997.
[22]
{22} J. Sánchez and A. González, "The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures", in Procs. of the 29th Int. Conf. on Parallel Processing, Aug. 2000.
[23]
{23} J. Sánchez, and A. González, "Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture", in Procs. of 33rd Int. Symp. on Microarchitecture , Dec. 2000.
[24]
{24} E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, "Baring it all to Software: Raw Machines", IEEE Computer, September 1997.
[25]
{25} Y. Wu, R. Rakvic, L. Chen, C. Miao, G. Chrysos, J. Fang, "Compiler Managed Micro-cache Bypassing for High Performance EPIC Processors", in Procs. 35th Int. Symp. on Microarchitecture, Nov. 2002.

Cited By

View all
  • (2016)CHAINSAWThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195698(1-14)Online publication date: 15-Oct-2016
  • (2007)Inter-cluster communication in VLIW architecturesACM Transactions on Architecture and Code Optimization10.1145/1250727.12507314:2(11-es)Online publication date: 1-Jun-2007
  • (2006)Compiler-directed Data Partitioning for Multicluster ProcessorsProceedings of the International Symposium on Code Generation and Optimization10.1109/CGO.2006.9(208-220)Online publication date: 26-Mar-2006
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
December 2003
412 pages
ISBN:076952043X

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 03 December 2003

Check for updates

Qualifiers

  • Article

Conference

MICRO-36
Sponsor:

Acceptance Rates

MICRO 36 Paper Acceptance Rate 35 of 134 submissions, 26%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2016)CHAINSAWThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195698(1-14)Online publication date: 15-Oct-2016
  • (2007)Inter-cluster communication in VLIW architecturesACM Transactions on Architecture and Code Optimization10.1145/1250727.12507314:2(11-es)Online publication date: 1-Jun-2007
  • (2006)Compiler-directed Data Partitioning for Multicluster ProcessorsProceedings of the International Symposium on Code Generation and Optimization10.1109/CGO.2006.9(208-220)Online publication date: 26-Mar-2006
  • (2005)Distributed Data Cache Designs for Clustered VLIW ProcessorsIEEE Transactions on Computers10.1109/TC.2005.16354:10(1227-1241)Online publication date: 1-Oct-2005
  • (2004)Cluster prefetchProceedings of the 18th annual international conference on Supercomputing10.1145/1006209.1006255(326-335)Online publication date: 26-Jun-2004

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media