Inherently Lower-Power High-Performance Superscalar Architectures

Paper Review
Inherently Lower-Power High-

Performance Superscalar
Architectures
Rami Abielmona
Prof. Maitham Shams
95.575
March 4, 2002
Flynn’s Classifications (1972) [1]
SISD – Single Instruction stream, Single Data stream
Conventional sequential machines
Program executed is instruction stream, and data operated on is data stream
SIMD – Single Instruction stream, Multiple Data streams

Vector machines (superscalar)
Processors execute same program, but operate on different data streams
MIMD – Multiple Instruction streams, Multiple Data streams

Parallel machines
Independent processors execute different programs, using unique data streams
MISD – Multiple Instruction streams, Single Data stream

Systolic array machines
Common data structure is manipulated by separate processors, executing different
instruction streams (programs)
Pipelined Execution
Effective way of
organizing concurrent activity
in a computer system
Makes it possible to
execute instructions
concurrently
Maximum throughput of a
pipelined processor is one
instruction per clock cycle
Shown in figure 1 is a two-
stage pipeline, with buffer B1
receiving new information at
the end of each clock cycle
Figure 1, courtesy [2]

Superscalar Processors
Processors equipped with multiple execution units,
in order to handle several instructions in parallel [11]
Maximum throughput is greater than one
instruction per cycle (multiple-issue)
Baseline architecture is shown in figure 2 [3]
Figure 2, courtesy [3]

Important Terminology [2] [4]
Issue Width The metric designating how many
instructions are issued per cycle
Issue Window Comprises the last n entries of the
instruction buffer
Register File Set of n-byte, dual read, single write bank
of registers
Register Renaming Technique used to prevent stalling

the processor for false data
dependencies between instructions
Instruction Steering Technique used to send decoded instructions to
appropriate memory banks
Memory Disambiguation Unit Mechanism for enforcing data
dependencies through memory
Motivations and Objectives
To analyze the high-end processor market for BOTH power
and area-performance trade-offs (not previously done)
To propose a superscalar architecture which achieves a
power reduction without compromising performance
Analysis to be carried out on structures that increase energy
dissipation, with an increasing issue width
Register rename logic
Instruction issue window
Memory disambiguation unit
Data bypass mechanism
Multiported register file
Instruction and data caches
Energy Models [5]
Model 1 – Multiported RAM
Access energy (R or W) = Edecode + Earray + ESA + EctlSA + Epre + Econtrol
Word line energy = Vdd2 Nbits( Cgate Wpass,r + ( 2Nwrite+ Nread ) Wpitch Cmetal )
Bit line energy = Vdd Mmargin Vsense Cbl,read Nbits
Model 2 – CAM (Content-Addressable Memory)

Using IW write word lines and IW write bitline pairs
Model 3 – Pipeline latches and clocking tree

Assume balanced clocking tree (less power dissipation than grids)
Assume lower power single phase clocking scheme
Near minimum transistor sizes used in latches
Model 4 – Functional Units

Eavg = Econst + Nchange x Echange
Energy complexity is independent of issue width
Preliminary Simulation Results
E ~ (IW)γ
Wrote own simulator, incorporating all
the developed energy models (based on Structure Energy growth
SimpleScalar tool set) parameter
Register rename γ = 1.1
Ran simulations for 5 superscalar logic
designs, with IW ranging from 4 to 16
Instruction issue γ = 1.9
Results show that total committed window
energy increases with IW,
IW as wider Memory disam. γ = 1.5
processors rely on deeper speculation to unit
exploit ILP Multiported γ = 1.8
Energy/instruction grows linearly for all register file
structures except functional units (FUs) Data bypass γ = 1.6
mechanism
Results obtained using 35-micron and
Functional units γ = 0.1
Vdd = 3.3V technologies, which FUs scale
well with. However, RAMs, CAMs and All caches γ = 0.7
long wires do not scale well, and thus have
to be LOW-POWER structures Table 1
Problem Formulation
Energy-Delay Product
E x D = energy/operation x cycles/operation
E x D = (energy/cycle) / IPC2
E x D = [ IPC x (IW)γ ] / IPC2 ~ (IW)γ / IPC
E x D ~ (IW)γ - α ~ (IPC) (γ – α) / α
Problem Definition
If α = 1, then E x D ~ (IPC)γ-1 ~ (IW)γ-1
If α = 0.5, then E x D ~ (IPC)γ-1/2 ~ (IW)2γ-1
Need new techniques to achieve more ILP with
conventional superscalar design
Intermediary Recap
We have discussed
Superscalar processsor design and terminology
Energy modeling of microarchitecture structures
Analysis of energy-delay metric
Preliminary simulation results
We will introduce
General solution methodology
Previous decentralization schemes
Proposed strategy
Simulation results of multicluster architecture
Conclusions
General Design Solution
Decentralization of microarchitecture
Replace tightly coupled CPU with a set of clusters,
each capable of superscalar processing
Can ideally reduce γ to zero, with good cluster
partitioning techniques
Solution introduces the following issues

1. Additional paths for intercluster communication
2. Need for cluster assignment algorithms
3. Interaction of cluster with common memory system
Previous Decentralized Solutions
Particular Solution Main Features
Limited Connectivity VLIWs [6] RF is partitioned into banks
Every operation specifies a destination bank
Multiscalar Architecture [7] PEs organized in a circular chain
RF is decentralized
Trace Window Architecture [8] RF and issue window are partitioned
All instructions must be buffered
Multicluster Architecture [9] RF, issue window and FUs are decentralized
Special instruction used for intercluster
communication
Dependence-Based Architecutre [10] Contains instruction dispatch intelligence
2-cluster Alpha 21264 Processor [11] Both clusters contain copy of RF

Proposed Multicluster
Architecture (1)
Instead of tightly coupled CPUs, proposed
architecture will involve a set of clusters, each
containing:
instruction issue window
local physical register file
set of execution units
local memory disambiguation unit
one bank of interleaved data cache
Refer to figure 3 on next slide

Proposed Multicluster Arch. (2)
Figure 3
Multicluster Architecture Details
Register Renaming and Instruction Steering
Each cluster is provided with a local physical RF
Global Map Table maintains mapping between architectural registers and
physical registers
Cluster Assignment Algorithm

Tries to minimize
intercluster register dependencies
delay through cluster assignment logic
Whole graph solution is NP-complete, therefore near-optimal solutions
devised by divide & conquer method
Intercluster Communication
Remote Access Window (RAW) used for remote RF calls
Remote Access Buffer (RAB) used to keep the remote source operand
One cycle penalty incurred for a remote RF
Multicluster Architecture Details
(Cont’d)
Memory Dataflow
Centralized memory disambiguation unit does not scale with
increasing issue width and bigger sizes of the load/store window
Proposed scheme: Every cluster is provided with a local load/store
window that is hardwired to a particular data cache bank
Developed a bank predictor in order to combat not knowing which
cluster the instruction is being routed to at the decode stage
Stack Pointer (SP) References

Realized an eager mechanism for handling SP references
 With a new reference to SP, an entry is allocated in RAB
 Upon instruction completion, results written into RF and RAB
 RAB entry is not freed after instruction reads contents
 RAB entry is freed only when a new SP reference commits
Results and Analysis
A single address transfer bus is sufficient for handling
intercluster address transfers
A single bus is used to handle intercluster data transfers
arising from bank mispredictions
4-6 entries are used in the RAB for low-power
2 extra entries are sufficient in the RAB for SP refs.
Intercluster traffic is reduced by 20 % and performance
improved by 3 % using SP eager mechanism
Multicluster architecture showed 20 % better performance
than the best configurations with centralized architectures,
with a 50 % reduction in power dissipation
Conclusions
Main Result of Work
Using this architecture will allow the development of
high-performance processors while keeping the
microarchitecture energy-efficient, as proven by the
energy-delay product
Main Contribution of Work
A methodology for doing energy-efficiency analysis was derived for use
with the next generation high-performance decentralized superscalar
processors
Other Major Contributions

 Opened analyst’s eyes to the 3-D IPC-area-energy space
A roadmap for future high-performance low-power
microprocessor development has been proposed
 Coined the energy-efficient family concept, composed of equally
optimal energy-efficient configurations
References (1)
[1] M.J. Flynn, “Very High-Speed Computing Systems,” Proceedings of the
IEEE, vol. 54, December 1966, p.p. 1901-1909.
[2] C. Hamacher, Z. Vranesic and S. Zaky, “Computer Organization,” fifth
edition, McGraw-Hill: New York, 2002.
[3] V. Zyuban and P. Kogge, “Inherently Lower-Power High-Performance
Superscalar Architectures,” IEEE Transactions on Computers, vol. 50, no.
3, March 2001, p.p. 268-285.
[4] E. Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault
Tolerance in Microprocessors,”, Proceedings of the 29th Fault-Tolerant
Computing Symposium, June 1999
[5] V. Zyuban, “Inherently Lower-Power High-Performance Superscalar
Architectures,” PhD thesis, Univ. of Notre Dame, Mar. 2000.
[6] R. Colwell et al., “A VLIW Architecture for a Trace Scheduling
Compiler,” IEEE Trans. Computers, vol. 37, no. 8, pp. 967-979, Aug. 1988.
[7] M. Franklin and G.S. Sohi, “The Expandable Split Window Architecture
for Exploiting Fine-Grain Parallelism,” Proc. 19th Ann. Int’l Symp.
Microarchitecture, May 1992.
References (2)
[8] S. Vajapeyam and T. Miltra, “Improving Superscalar Instruction
Dispatch and Issue by Exploiting Dynamic Code Sequences,” Proc. 24th
Ann. Int’l Symp. Computer Architecture, June 1997.
[9] K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic, “The Multicluster
Architecture: Reducing Cycle Time through Partitioning,” Proc. 30th Ann.
Int’l Symp. Microarchitecture, Dec. 1997.
[10] S. Palacharla, N. Jouppi and J. Smith, “Complexity-Effective
Superscalar Processor,” Proc. 24th Ann. Int’l Symp. Computer Architecture,
pp. 206-218, June 1997.
[11] K. Hwang, “Advanced Computer Architecture: Parallelism, Scalability,
Programmability,” McGrawHill: New York, 1993.
Questions/Comments

Inherently Lower-Power High-Performance Superscalar Architectures

Uploaded by

Copyright:

Available Formats

Inherently Lower-Power High-Performance Superscalar Architectures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inherently Lower-Power High-Performance Superscalar Architectures

Uploaded by

Copyright:

Available Formats

Paper Review

Inherently Lower-Power High-

SIMD – Single Instruction stream, Multiple Data streams

MIMD – Multiple Instruction streams, Multiple Data streams

MISD – Multiple Instruction streams, Single Data stream

Figure 1, courtesy [2]

Figure 2, courtesy [3]

Register Renaming Technique used to prevent stalling

Model 2 – CAM (Content-Addressable Memory)

Model 3 – Pipeline latches and clocking tree

Model 4 – Functional Units

Solution introduces the following issues

2-cluster Alpha 21264 Processor [11] Both clusters contain copy of RF

Refer to figure 3 on next slide

Cluster Assignment Algorithm

Stack Pointer (SP) References

Other Major Contributions

You might also like