Article

An interleaved cache clustered VLIW processor

Authors:

Jesús Sánchez,

Antonio GonzálezAuthors Info & Claims

ICS '02: Proceedings of the 16th international conference on Supercomputing

Pages 210 - 219

https://doi.org/10.1145/514191.514222

Published: 22 June 2002 Publication History

Abstract

Clustered microarchitectures are becoming a common organization due to their potential to reduce the penalties caused by wire delays and power consumption. Fully-distributed architectures are particularly effective to deal with these constraints, and besides they are very scalable. However, the distribution of the data cache memory poses a significant challenge and may be critical for performance. In this work, a distributed data cache VLIW architecture based on an interleaved cache organization along with cyclic scheduling techniques are proposed. Moreover, the use of Attraction Buffers for such an architecture is introduced. Attraction Buffers are a novel hardware mechanism to increase the percentage of local accesses. The idea is to allow the movement of some data towards the clusters that need it.Performance results for 9 Mediabench benchmarks show that our scheduling techniques are able to hide the increased memory latency when accessing data mapped in a remote cluster. In addition, the local hit ratio is increased by 15% and stall time is reduced by 30% when using the same scheduling techniques with an interleaved cache clustered processor with Attraction Buffers. Finally, the proposed architecture is compared with a state-of-the-art distributed architecture such as the multiVLIW. Results show that the performance of an interleaved cache clustered VLIW processor with Attraction Buffers is similar to that of the multiVLIW architecture, whereas the former has a lower hardware complexity.

References

[1]

Agarwal, M.S. Hrishikesh, S.W. Keckler and D. Burger, "Clock Rate versus IPC: The End of the Road For Conventional Microarchitectures", in Procs. of the 27th Int. Symp. on Computer Architecture, pp. 248-259, June 2000

Digital Library

[2]

R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal, "Maps: A Compiler-Managed Memory System for Raw Machines", Procs. of the 26th Int. Symp. on Computer Architecture, June 1999

[3]

P.P. Chang, S.A. Mahlke, W.Y. Chen, N.J. Water, and W.W. Hwu, "IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors", in Procs. of the 18th Int. Symp. on Computer Architecture, pp. 266-275, May 1991

Digital Library

[4]

B. Cheng, "Compile-Time Memory Disambiguation for C Programs", PhD thesis, Department of Computer Science, University of Illinois, May 2000

Digital Library

[5]

J. M. Codina, J. Sánchez and A. González, "A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors", in Procs. of Int. Conf. on Parallel Architectures and Compilation Techniques, Sept. 2001

Digital Library

[6]

J. M. Codina, J. Llosa and A. González, "A Comparative Study of Modulo Scheduling Techniques", in Procs. of Int. Conference on Supercomputing, June 2002

Digital Library

[7]

R. Ellis, "Bulldog: A Compiler for VLIW Architectures", MIT Press, pp. 180-184, 1986

Digital Library

[8]

P. Faraboschi, G. Brown, J. Fisher, G. Desoli and F. Homewood, "Lx: A Technology Platform for Customizable VLIW Embedded Processing", in Procs. of the 27th Int. Symp. on Computer Architecture, pp. 203-213, June 2000

Digital Library

[9]

J. Fridman and Zvi Greefield, "The TigerSharc DSP Architecture", IEEE Micro, pp. 66-76, Jan-Feb. 2000

Digital Library

[10]

Enric Gibert, J. Sanchez and A. Gonzalez, "An Interleaved Cache Architecture for Clustered VLIW Processors", Technical Report UPC-DAC-2001-23, Universitat Politecnica de Catalunya, June 2001 (http://www.ac.upc.es/recerca/reports/ DAC/2001/index,en.html)

[11]

L. Gwennap, "Digital 21264 Sets New Standard", Microproccessor Report, 10(14), Oct. 1996

[12]

K. Kailas, K. Ebcioglu and A. Agrawala, "CARS: A New Code Generation Framework for Clustered ILP Processors", in Procs. of the 7th Int. Symp. on High-Performance Computer Architecture, Jan. 2001

Digital Library

[13]

P. M. Kogge, "The Architecture of Pipelining Processors", McGraw-Hill, New York, 1981

[14]

M. Lam, "Software pipelining: An Effective Scheduling Technique for VLIW Machines", in Procs. on Conf. on Programming Languages and Implementation Design, pp. 318-328, 1988

Digital Library

[15]

D. Lavery, and W. W. Hwu, "Modulo Scheduling of Loops in Control-Intensive Non-Numeric Programs", in Procs. of the 29th Int. Symp. on Microarchitecture, pp. 126-141, Dec. 1996

Digital Library

[16]

D. H. Lawrie, "Access and Alignement of Data in an Array Processor", IEEE Trans. on Computers, 24(12), pp. 1145-1155, 1975

Digital Library

[17]

C. Lee, M. Potkonjak, and W.H. Mangione-Smith, "MediaBench: a Tool for Evaluating and Synthesizing Multimedia and Communication Systems", in Procs. of Int. Symp. on Microarchitecture, pp. 330-335, Dec. 1997

Digital Library

[18]

J. Llosa, A. González, E. Ayguadé and M. Valero, "Swing Modulo Scheduling", in Procs. of Int. Conf. on Parallel Architectures and Compilation Techniques (PACT'96), pp.80-86, Oct. 1996

Digital Library

[19]

P. Lowney, S. Freudenberger, T. Karzes, W. Lichtenstein, R. Nix, J. O'Donnell and J. Ruttenberger, "The Multiflow Trace Scheduling Compiler", in Journal of Supercomputing, pp. 51-142, Jan. 1993

Digital Library

[20]

S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, "Effective Compiler Support for Predicated Execution Using the Hyperblock ", in Procs. of 25th Int. Symp. on Microarchitecture, pp. 45-54, Dec. 1992

Digital Library

[21]

E. Nystrom and A. E. Eichenberger, "Effective Cluster Assignment for Modulo Scheduling", in Procs. of the 31st Int. Symp. on Microarchitecture, pp. 103-114, 1998

Digital Library

[22]

"MAP1000 unfolds at Equator", Microprocessor Report, 12(16), Dec. 1998

[23]

S. Palacharla, N.P. Jouppi, and J.E. Smith, "Complexity-Effective Superscalar Processors", in Procs. of the 24th Int. Symp. on Computer Architecture, pp. 1-13, June 1997

Digital Library

[24]

G.G. Pechanek, and S. Vassiliadis, "The ManArray Embedded Processor Architecture," in Procs. of the 26th. Euromicro Conference: "Informatics: inventing the future", Maastricht, The Netherlands, Vol. I, pp.348-355, Sept. 2000

[25]

J. Sánchez and A. González, "Cache Sensitive Modulo Scheduling", in Procs. of 30th Int. Symp. on Microarchitecture, pp. 338-348, Dec. 1997

Digital Library

[26]

J. Sánchez and A. González, "The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures", in Procs. of the 29th Int. Conf. on Parallel Processing, Aug. 2000

Digital Library

[27]

J. Sánchez and A. González, "Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture", in Procs. of 33rd Int. Symp. on Microarchitecture, Dec. 2000

Digital Library

[28]

Texas Instruments Inc., "TMS320C62x/67x CPU and Instruction Set Reference Guide", 1998.

[29]

E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, "Baring it all to Software: Raw Machines", IEEE Computer, pp. 86-93, September 1997

Digital Library

Cited By

Dreslinski RManville TSewell KDas RPinckney NSatpathy SBlaauw DSylvester DMudge TYew PCho SDeRose LLilja D(2012)XPoint cacheProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370829(75-86)Online publication date: 19-Sep-2012
https://dl.acm.org/doi/10.1145/2370816.2370829
Machanick PBarnard LBotha R(2007)Design principles for a virtual multiprocessorProceedings of the 2007 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries10.1145/1292491.1292500(76-82)Online publication date: 2-Oct-2007
https://dl.acm.org/doi/10.1145/1292491.1292500
Chu MMahlke S(2006)Compiler-directed Data Partitioning for Multicluster ProcessorsProceedings of the International Symposium on Code Generation and Optimization10.1109/CGO.2006.9(208-220)Online publication date: 26-Mar-2006
https://dl.acm.org/doi/10.1109/CGO.2006.9
Show More Cited By

Index Terms

An interleaved cache clustered VLIW processor
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Very long instruction word
    2. Serial architectures
      1. Complex instruction set computing
      2. Reduced instruction set computing

Recommendations

Instruction scheduling for a clustered VLIW processor with a word-interleaved cache: Research Articles
10th International Workshop on Compilers for Parallel Computers (CPC 2003)

Clustering is a common technique to overcome the wire delay problem incurred by the evolution of technology. Fully distributed architectures, where the register file, the functional units and the data cache are partitioned, are particularly effective to ...
Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor
MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture

Clustering is a common technique to overcome the wire delay problem incurred by the evolution of technology. Fully-distributed architectures, where the register file, the functional units and the data cache are partitioned, are particularly effective to ...
Machine-Description Driven Compilers for EPIC and VLIW Processors

In the past, due to the restricted gate count available on an inexpensive chip, embedded DSPs have had limited parallelism, few registers and irregular, incomplete interconnectivity. More recently, with increasing levels of integration, embedded VLIW ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '02: Proceedings of the 16th international conference on Supercomputing

June 2002

338 pages

ISBN:1581134835

DOI:10.1145/514191

General Chair:
Kemal Ebcioglu
IBM T.J. Watson Research Center
,
Program Chairs:
Keshav Pingali
Cornell University
,
Alex Nicolau
University of California

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICS02

Sponsor:

SIGARCH

ICS02: International Conference on Supercomputing

June 22 - 26, 2002

New York, New York, USA

Acceptance Rates

ICS '02 Paper Acceptance Rate 31 of 144 submissions, 22%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
523
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dreslinski RManville TSewell KDas RPinckney NSatpathy SBlaauw DSylvester DMudge TYew PCho SDeRose LLilja D(2012)XPoint cacheProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370829(75-86)Online publication date: 19-Sep-2012
https://dl.acm.org/doi/10.1145/2370816.2370829
Machanick PBarnard LBotha R(2007)Design principles for a virtual multiprocessorProceedings of the 2007 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries10.1145/1292491.1292500(76-82)Online publication date: 2-Oct-2007
https://dl.acm.org/doi/10.1145/1292491.1292500
Chu MMahlke S(2006)Compiler-directed Data Partitioning for Multicluster ProcessorsProceedings of the International Symposium on Code Generation and Optimization10.1109/CGO.2006.9(208-220)Online publication date: 26-Mar-2006
https://dl.acm.org/doi/10.1109/CGO.2006.9
Gibert ESánchez JGonzález A(2006)Instruction scheduling for a clustered VLIW processor with a word‐interleaved cacheConcurrency and Computation: Practice and Experience10.1002/cpe.101318:11(1391-1411)Online publication date: 12-Jan-2006
https://doi.org/10.1002/cpe.1013
Gibert ESanchez JGonzalez A(2005)Distributed Data Cache Designs for Clustered VLIW ProcessorsIEEE Transactions on Computers10.1109/TC.2005.16354:10(1227-1241)Online publication date: 1-Oct-2005
https://dl.acm.org/doi/10.1109/TC.2005.163
Zhong HFan KMahlke SSchlansker M(2005)A Distributed Control Path Architecture for VLIW ProcessorsProceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2005.5(197-206)Online publication date: 17-Sep-2005
https://dl.acm.org/doi/10.1109/PACT.2005.5
González JLatorre FGonzález ACarter JZhang L(2004)Cache organizations for clustered microarchitecturesProceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture10.1145/1054943.1054950(46-55)Online publication date: 20-Jun-2004
https://dl.acm.org/doi/10.1145/1054943.1054950
Chu MFan KRavindran RMahlke S(2004)Cost-Sensitive Partitioning in an Architecture Synthesis System for Multicluster ProcessorsIEEE Micro10.1109/MM.2004.724:3(10-20)Online publication date: 1-May-2004
https://dl.acm.org/doi/10.1109/MM.2004.7
Gibert ESánchez JGonzález AJohnson RConte THwu W(2003)Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cacheProceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization10.5555/776261.776283(193-203)Online publication date: 23-Mar-2003
https://dl.acm.org/doi/10.5555/776261.776283
Gibert ESanchez JGonzalez A(2003)Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cacheInternational Symposium on Code Generation and Optimization, 2003. CGO 2003.10.1109/CGO.2003.1191545(193-203)Online publication date: 2003
https://doi.org/10.1109/CGO.2003.1191545
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten