Article

Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache

Authors:

Jesús Sánchez,

Antonio GonzálezAuthors Info & Claims

CGO '03: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization

Pages 193 - 203

Published: 23 March 2003 Publication History

Abstract

Clustering is a common technique to deal with wire delays. Fully-distributed architectures, where the register file, the functional units and the cache memory are partitioned, are particularly effective to deal with these constraints and besides they are very scalable. However, the distribution of the data cache introduces a new problem: memory instructions may reach the cache in an order different to the sequential program order, thus possibly violating its contents. In this paper two local scheduling mechanisms that guarantee the serialization of aliased memory instructions are proposed and evaluated: the construction of memory dependent chains (MDC solution), and two transformations (store replication and load-store synchronization) applied to the original Data Dependence Graph (DDGT solution). These solutions do not require any extra hardware.The proposed scheduling techniques are evaluated for a word-interleaved cache clustered VLIW processor (although these techniques can also be used for any other distributed cache configuration). Results for the Mediabench benchmark suite demonstrate the effectiveness of such techniques. In particular, the DDGT solution increases the proportion of local accesses by 16% compared to MDC, and stall time is reduced by 32% since load instructions can be freely scheduled in any cluster. However, the MDC solution reduces compute time and it often outperforms the former. Finally the impact of both techniques on an architecture with Attraction Buffers is studied and evaluated.

References

[1]

V. Agarwal, M.S. Hrishikesh, S.W. Keckler and D. Burger, "Clock Rate versus IPC: The End of the Road For Conventional Microarchitectures", in Procs. of the 27th Int. Symp. on Computer Architecture, pp. 248--259, June 2000

Digital Library

[2]

R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal, "Maps: A Compiler-Managed Memory System for Raw Machines", Procs. of the 26th Int. Symp. on Computer Architecture, June 1999

Digital Library

[3]

D. Bernstein, D. Cohen and D. Maydan, "Dynamic Memory Disambiguation for Array References", in Procs. of 27th Int. Symp. on Microarchitecture, pp. 105--111, Nov. 1994

Digital Library

[4]

P.P. Chang, S.A. Mahlke, W.Y. Chen, N.J. Water, and W.W. Hwu, "IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors", in Procs. of the 18th Int. Symp. on Computer Architecture, pp. 266--275, May 1991

Digital Library

[5]

A. Charlesworth, "An Approach to Scientific Array Processing: The Architectural Design of the AP 120B/FPS-164 Family", in Computer, 14(9), pp. 18--27, 1981

Digital Library

[6]

B. Cheng, "Compile-Time Memory Disambiguation for C Programs", PhD thesis, Dept. of Computer Science, University of Illinois, May 2000

Digital Library

[7]

J.M. Codina, J. Sánchez and A. González, "A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors", in Procs. of Int. Conf. on Parallel Architectures and Compilation Techniques, Sept. 2001

Digital Library

[8]

P. Faraboschi, G. Brown, J. Fisher, G. Desoli and F. Homewood, "Lx: A Technology Platform for Customizable VLIW Embedded Processing", in Procs. of the 27th Int. Symp. on Computer Architecture, pp. 203--213, June 2000

Digital Library

[9]

J. Fridman and Zvi Greefield, "The TigerSharc DSP Architecture", IEEE Micro, pp. 66--76, Jan-Feb. 2000

Digital Library

[10]

E. Gibert, J. Sánchez and A. González, "An Interleaved Cache Clustered VLIW Processor", in Procs. of Int. Conf. on Supercomputing, pp. 210--219, June 2002.

Digital Library

[11]

E. Gibert, J. Sánchez and A. González, "Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor", in Procs. of 35th Int. Symp. on Microarchitecture, November 2002.

Digital Library

[12]

L. Gwennap, "Digital 21264 Sets New Standard", Microprocessor Report, 10(14), Oct. 1996

[13]

K. Kailas, K. Ebcioglu and A. Agrawala, "CARS: A New Code Generation Framework for Clustered ILP Processors", in Procs. of the 7th Int. Symp. on High-Performance Computer Architecture, Jan. 2001

Digital Library

[14]

C. Lee, M. Potkonjak, and W.H. Mangione-Smith, "Media-Bench: a Tool for Evaluating and Synthesizing Multimedia and Communication Systems", in Procs. of Int. Symp. on Microarchitecture, pp. 330--335, Dec. 1997

Digital Library

[15]

K. Li, "IVY: A Shared Virtual Memory System for Parallel Computing", in Procs. of Int. Conf. on Parallel Processing, Aug. 1988

[16]

J. Llosa, A. González, E. Ayguadé and M. Valero, "Swing Modulo Scheduling", in Procs. of lnt. Conf. on Parallel Architectures and Compilation Techniques, pp. 80--86, Oct. 1996

Digital Library

[17]

S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, "Effective Compiler Support for Predicated Execution Using the Hyperblock ", in Procs. of 25th Int. Symp. on Microarchitecture, pp. 45--54, Dec. 1992

Digital Library

[18]

E. Nystrom and A. E. Eichenberger, "Effective Cluster Assignment for Modulo Scheduling", in Procs. of the 31st Int. Symp. on Microarchitecture, pp. 103--114, 1998

Digital Library

[19]

E. Özer, S. Banerjia, T.M. Conte, "Unified Assign and Schedule: A New Approach to Scheduling for Clustered Register File Microarchitectures", in Procs. of 31st Symp. on Microarchitecture, Nov. 1998

Digital Library

[20]

S. Palacharla, N.P. Jouppi, and J.E. Smith, "Complexity-Effective Superscalar Processors", in Procs. of the 24th Int. Symp. on Computer Architecture, pp. 1--13, June 1997

Digital Library

[21]

J. Sánchez and A. González, "Cache Sensitive Modulo Scheduling", in Procs. of 30th Int. Symp. on Microarchitecture, pp. 338--348, Dec. 1997

Digital Library

[22]

J. Sánchez and A. González, "The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures", in Procs. of the 29th Int. Conf. on Parallel Processing, Aug. 2000

Digital Library

[23]

J. Sánchez, and A. González, "Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture", in Procs. of 33rd Int. Symp. on Microarchitecture, Dec. 2000

Digital Library

[24]

M. Tomasevic, and V. Milutinovic, "Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors", IEEE Micro, vol. 14, no. 5 and 6, Oct. and Dec. 1994

Digital Library

[25]

V. V. Zyuban, "Inherently lower-power high-performance superscalar architectures", PhD thesis, Dept. of Computer Science and Engineering, Univ. of Notre Dame, March 2000

Digital Library

Cited By

Gibert ESanchez JGonzalez A(2005)Distributed Data Cache Designs for Clustered VLIW ProcessorsIEEE Transactions on Computers10.1109/TC.2005.16354:10(1227-1241)Online publication date: 1-Oct-2005
https://dl.acm.org/doi/10.1109/TC.2005.163
Gibert ESánchez JGonzález A(2003)Flexible Compiler-Managed L0 Buffers for Clustered VLIW ProcessorsProceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture10.5555/956417.956547Online publication date: 3-Dec-2003
https://dl.acm.org/doi/10.5555/956417.956547

Index Terms

Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache

Recommendations

Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor
MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture

Clustering is a common technique to overcome the wire delay problem incurred by the evolution of technology. Fully-distributed architectures, where the register file, the functional units and the data cache are partitioned, are particularly effective to ...
Instruction scheduling for a clustered VLIW processor with a word-interleaved cache: Research Articles
10th International Workshop on Compilers for Parallel Computers (CPC 2003)

Clustering is a common technique to overcome the wire delay problem incurred by the evolution of technology. Fully distributed architectures, where the register file, the functional units and the data cache are partitioned, are particularly effective to ...
An interleaved cache clustered VLIW processor
ICS '02: Proceedings of the 16th international conference on Supercomputing

Clustered microarchitectures are becoming a common organization due to their potential to reduce the penalties caused by wire delays and power consumption. Fully-distributed architectures are particularly effective to deal with these constraints, and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '03: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization

March 2003

349 pages

ISBN:076951913X

General Chairs:
Richard Johnson
Transmeta
,
Tom Conte
NC State University
,
Program Chair:
Wen-mei Hwu
University of Illinois at Urbana-Champaign

Copyright © Copyright (c) 2003 Institute of Electrical and Electronics Engineers, Inc. All rights reserved.

Sponsors

IEEE Computer Society TC-uARCH
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 23 March 2003

Check for updates

Qualifiers

Article

Conference

CGO03

Sponsor:

SIGMICRO

CGO03: First Annual International IEEE/ACM Symposium on Code Generation and Optimization 2003

March 23 - 26, 2003

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
304
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gibert ESanchez JGonzalez A(2005)Distributed Data Cache Designs for Clustered VLIW ProcessorsIEEE Transactions on Computers10.1109/TC.2005.16354:10(1227-1241)Online publication date: 1-Oct-2005
https://dl.acm.org/doi/10.1109/TC.2005.163
Gibert ESánchez JGonzález A(2003)Flexible Compiler-Managed L0 Buffers for Clustered VLIW ProcessorsProceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture10.5555/956417.956547Online publication date: 3-Dec-2003
https://dl.acm.org/doi/10.5555/956417.956547

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents