VLSI Module 4
VLSI Module 4
VLSI Module 4
ng Blocks .
11.1 Introduction
Aler the of the basic digital gates, it is
in-depth study of the
design optimization
and time to
test
our acquired skills on a somewhat larger scale and put them in a more system-oriente
perspective.
We will apply the techniques of the previous chapters to design a number of circuitsol.
used in the
datapaths of microprocessors and signal processors. More specifically, we disee often
the design of a cuss
representative set of modules such as adders, multipliers, and shifters. The sDend
and power of these elements often
dominates the overall system performance. Hence, a careful
speed
design optimization is required. It rapidly becomes obvious that the design task is not
forward. For each module, straight.
multiple equivalent logic and circuit topologies exist, each of which
has its own
positives and negatives in terms of area, speed, or power.
Although far from complete, the analysis presented helps focus on the essential
that must be made in the course trade-offs
of the digital design process. You will see that
only one design level-for instance, through transistor sizing optimization at
only-leads to inferior designs. A
global picture is therefore of crucial importance. Digital designers focus their
attention on the
gates, circuits, or transistors that have the largest
impact on their goal function. The noncritical
parts of the circuit can be developed routinely. We also
may develop first-order performance
models that foster
understanding of the fundamental behavior of a module. The discussion also
clarifies which computer aids can
help to simplify and automate this phase of the design process.
Before analyzing the design of the arithmetic modules, a short
discussion of the role of
the
datapath in the digital-processor picture is appropriate. This not only
design requirements for the datapath, but also puts the rest of this bookhighlights
the specific
in perspective. Other
processor modules-the input/output, controller, and memory modules--have
different require-
ments and were discussed in
Chapter 8. After an analysis of the area-power-delay trade-offs in
the design of adders,
multipliers, and shifters, we will use the same structures to illustrate some
of the
power-minimization approaches introduced in Chapter 6. The chapter concludes with a
short perspective on datapath design and its trade-offs.
datapath, memory, control, and input/output blocks. The datapath is the core of the
this is where all computations are performed. 1The other blocks in processor
the processor are support units
that either store the results produced by the
datapath or help to determine what will happen in
the next cycle. A typical datapath consists of an interconnection
of basic combinational
tions, such as arithmetic operators (addition, multüplication, func
comparison, and shift) or logic
(AND, OR, and XOR). The design of the arithmetic operators is the topic of this chapter. The
11.3 The Adder
561
Control
B Bit 3
Bit2
Bit
0 1 delete
0 0 0 propagate
0 propagate
0 0 0 propagate
0 0 propagate
0 0
generate/propagate
1
generate/propagate
S ABC;
=
ABC+ + ABC; + ABC; (11.1)
C, = AB +BC; + AC
G AB
D AB
(11.2)
P A SB
We can rewrite S and C, as functions of P and G (or D):
C(G, P) = G+ PC;
Notice that G and P are only functions of A and B and are not
dependent upon C. In a similar
way, we can also derive expressions for S(D, P) and
C,(D, P).
Ao Bo A B
Cio
FA
Co Co Co2 Co3
FA FA FA
So S1 S2
Figure 11-2 Four-bit ripple-carry adder: topology.
An N-bit adder can be constructed by cascading N full-adder (FA) circuits in series, con-
necting Co,k-1 to
Cikfor k = 1 toN-1, and the first carry-in Ci0 to 0 (Figure 11-2). This configu-
ration is called a ripple-carry adder, since the carry bit "ripples" from one stage to the other. The
delay through the circuit depends upon the number of logic stages that must be traversed and is a
function of the applied input signals. For some
input signals, no rippling effect occurs at all,
while for others, the carry has to ripple all the way from the least significant bit (lsb) to the most
significant bit (msb). The propagation delay of such a structure (also called the critical path) is
defined as the worst case delay over all possible input
patterns.
In the case of the
ripple-carry adder, the worst case delay happens when a carry generated
at the least significant bit
position propagates all the way to the most significant bit position.
This carry is finally consumed in the last
stage to produce the sum. The delay is then propor-
tional to the number of bits in the input words N and is
approximated by
adder (N- 1)tcary+sum (11.4)
where cary and un Cqual the propagation delays from C to C. and S, respectively2
Example 11.1 Propagation Delay of Ripple-Carry Adder
Derive the values of A, and B, (k = 0..N- 1) so that the worst case delay is obtained for
the ripple-cary adder.
The worst case condition requires that a carry be generated at the lsb position. Since
the input cary of the first full adder Co is always 0, this both Ap and B, must equal 1. All
the other stages must be in propagate mode. Hence, either A; or B, must be high. Finally,
we would like to physically measure the delay of a transition on the msb sum bit. Assum-
ing an initial value of 0 for Sy-1, we must arrange a 0>1 transition. This is achieved by
setting both Ay-j and By-1 to 0 (or 1), which yields a high sum bit given the incoming
carry of 1.
For example, the following values for A and B trigger the worst case delay for an 8-
bit addition:
A: 0000001; B: 01111111
uation (11.4) assumes that both the delay from the input signals Ag (or B,) to Coo for the lsb, and the Cto-C, delay
for all other
bitsequal
to fear
s64
Chapter 11 Designing Arithmetic Building Blocks
o set-up the worst case delay transition, all the inputs can e
stant with A
Kept constant wint
The propagation delay of the ripple-cary adder is linearly proportional to N. This propen
ecomes increasingly important when designing adders for the wide data paths
(N= 16... 128) that are desirable in curent and future computers.
When designing the ful-adder cell for a fast ripple-cary adder. it is tar more important to
optimize a r than since the latter has only a minor influence on the total value of
Before starting an in-depth discussion on the circuit design of full-adder cells, the folloa
ing additional logic property of the full adder is worth mentioning:
Inverting all inputs to a full adder results in inverted values for all outputs.
This property, also called the imverting property. is expressed in the pair of equations
C FA FA C
VpD
VpD
A
B
A
-B VpD
A- X-
C d
s
AL
VDD
C% E.B
Tall PMOS transistor stacks are present in both carry- and sum-generation circuits.
The intrinsic load capacitance of the C signal is large and consists of two diffusion and
six gate capacitances, plus the wiring
capacitance.
The signal propagates through two inverting stages in the cary-generation circuit. As
mentioned earlier, minimizing the carry-path delay is the prime goal of the designer of
high-speed adder circuits. Given the small load (fan-out) at the output of the carry chain.
having two logic stages is too high a number, and leads to extra delay.
The sum generation requires one extra logic stage, but that is not that important, sincea
factor appears only once in the propagation delay of the ripple-carry adder of Eq. (11.4).
566
Chapter 11 otic Bulldin
Designing Arithmetic Bullding Block
Ap BO
Ci Co2 Co
FA FA
Co FA' FA
So S
Figure 11-5 Inverter elimination in carry path. FA' stands for a full
adder without the inverter in the carry path.
Alhough slow, the circuit includes some smart design tricks. Notice that the first as
the cary-generation circuit is of
with the C, signal on the smaller PMOS stack,
designed
its logical effort to 2. Also, the NMOS and PMOS transistors connected to are lowerina
C, placed as clo
as possible to the
output of the gate. This is a direct application of a
circuit-optimization tech
nique discussed in Section 4.2-transistors on the critical path should be placed as close as
sible to the output of the
pos
gate. For instance, in stage k of the adder, signals A, and B, are available
and stable long before
Cik (= Cok-) arrives after rippling through the previous stages. In this
way, the capacitances of the internal nodes in the transistor chain are precharged or
advance. On arrival of Cik. only the
discharged in
capacitance of node X has to be (dis)charged. Putting the
CA transistors closer to Vpp and GND would require not only the (dis)charging of the capaci-
tance of node X, but also of the internal
capacitances.
The speed of this circuit can now be
improved gradually by using some of the adder prop-
erties discussed in the previous section. First, the number of
inverting stages in the carry path
can be reduced by
exploiting inverting property-inverting all the inputs of a full-adder cell
the
also inverts all the outputs. This rule allows us to eliminate an inverter in a
carry chain, as dem-
onstrated in Figure 11-5.
Mirror Adder Design
An improved adder circuit, also called the mirror adder, is shown in Figure 11-6 [Weste93]. Is
operation is based on Eq. (11.3). The carry-generation circuitry is worth analyzing. First, the
cary-inverting gate is eliminated, as suggested in the previous section. Secondly, the PDN and
PUN networks of the gate are not dual. Instead, they forma clever
implementation of the propa-
gate/generate/delete function-when either D or G is high, C, is set to VpD or GND, respec
tively. When the conditions for a Propagate are valid (or P is 1), the incoming camy 15
propagated (in inverted format) to C. This results in a considerable reduction in both area and
delay. The analysis of the sum circuitry is left to the reader. The following observations are
worth considering:
This full-adder cell requires only 24 transistors.
in Figure 11-12b.
This information can speed up the operation of the adder, as shown
be used to
to the next block
When BP =
PoP1P2P3= 1, the incoming carry is forwarded immediately
adder or carry-skip adder
through the bypass transistor M,-hence the name carry-bypass
Lehman62]. If this is not the case, the carry is obtained by way of the normal route
Po Go 2G2 Pa G3
Ci0 FA
Co,0
FA
Co1 FA FA Co
(a) Carry propagation
Po Go P1 G1 P3 G3
BP PoP1P2P3
Ci0 FA FA FA
Co,3
Figure 11-13 shows the possible camy-propagation paths when the full-adder circuit is
implemented in Manchester-carny style. This picture demonstrates how the bypass speeds
up the addition: The cary propagates cither through the bypass path, or a carry is gener
ated somewhere in the chain. In both cases, the delay is smaller than the normal ripple
configuration. The area overhead incurred by adding the bypass path is small and typically
the bypass path breaks the regular bit-slice
ranges between 10 and 20%. However, adding
structure (as was present in Figure 11-10)
P2 Pa BP
Po P
Cio
Go G G2
BP
Let us now compute the delay of an N-bit adder. At first, we assume that the total adder
is divided in (N/M) equal-length bypass stages, each of which contains M bits. An approxi-
mating expression for the total propagation time can be derived from Figure 11-14a and is
givenin Eq. (11.9). Namely,
M bits
Figure 11-14 (N= 16) carry-bypass adder: composition. The worst case delay path
is shaded in gray.
11.3 The Adder 573
Consumed at the last bit position without generating an output carry. The optimal number of bits
per skip block is determined by technological parameters such as the extra delay of the bypass-
eiecthng multüplexer, the buffering requirements in the carry chain, and the ratio of the delay
through the ripple and the bypass paths.
Although still linear, the slope of the delay function increases in a more gradual fashion
than for the ripple-carry adder, as pictured in Figure 11-15. This difference is substantial for
larger adders. Notice that the ripple adder is actually faster for small values of N, for which the
overhead of the extra bypass multiplexer makes the
bypass structure not interesting. The cross
Over point depends upon technology considerations and is normally situated between four and
eight bits.
adder.
Ripple adder
--
Bypass adder
N
4..8
Bedrijó2]. is demonstrated in Figure 11-16. Consider the block of adders, which is carry-selecte
tok +3. Instead of waiting on the arrival of the addine
output cary of bit k -1, both the 0 and bits
bilities are analyzed. From a circuit
point of view, this means that two carry paths arepossi
mented. When Cak
finally settles, either the result of the 0 or the 1
path is selected imple
k
multiplexer, which can be performed with a minimal
delay. As is by t
hardware overhead of the carry-select adder is restricted to an evident from Figure 11-16
additional carry path and a the
plexer, and equals about 30% with respect to a ripple-carry structure. multi m
Setup
BG
0 0Carrypropagation
1 14Carry propagation
Cok-
Multiplexer Cok3
Carry vector
Sum Generation
(5)L
1-Carry 1-Carry | 1-Carry
(6) (8)
0 Multiplexer Multiplexer Multiplexer Multiplexer
Sum Generation Sum Generation Sum Generation
Sum Generation
03 11 S12-15 (10)
(a) Linear configuration
Bit 0-1 Bit 2-4 Bit 5-8 Bit 9-13 Bit 14-19
Setup Setup Setup Setup
IfM << N (e.g., M 2, and N = 64), the first term dominates, and Eq. (11.11) can be
simplified to
N (11.12)
11.3 The Adder
577
50
40 Ripple adder
30
20 Linear select
10
Square root select
20 40 60
N
Figure 11-19
ripple and selectPropagation delay of square-root
adders. The unit carry-select adder versus linear
delay model is used to model the cell delays.
or
P 2NV (11.13)
Equation (11.13) can be used to
express tad function
as a
of N by rewriting Eq. (11.10):
add setup + Micarry + (W2N) mux + 'sum (11.14)
The delay is proportional to N for large adders (N>> M), or
relation has a major impact, which is illustrated tadd OWN). This square-root
=
X x Y-=Xy (11.21)
i = 0 j= 0
defined as follows:
with X, Ye {0, 1}. The multiplication operation is then
M+N-1
Z XxY'= Z,2
k = 0
(11.22)
M-1 V-1 M- N-1
i = 0 j= 0 i=0j=0
The simplest way to perform a multiplication is to use a single two-input adder. For inputs
that are M and N bits wide, the multiplication takes M cycles, using an N-bit adder. This shif
and-add algorithm for multiplication adds together M partial products. Each partial product is
generated by multiplying the multiplicand with a bit of the multiplier-which, essentially, is an
AND operation-and by shifting the result on the basis of the multiplier bit's position.
A faster way to implement multiplication is to resort to an approach similar to manually
computing a multiplication. All the partial products are generated at the same time and orga-
nized in an aray. A multioperand addition is applied to compute the final product. The approach
is illstrated in Figure 11-28. This set of operations can be mapped directly into hardware. The
587
11.4 The Multiplier
101010 Multiplicand
011 Multiplier
101 0 10
101 0 1 0
Partial products
0 00 0 00
+101010
1 1 1 0 0 1 1 1 0 Result
Xs X X X
X
(N-1)/2
000 0
001 +Multiplicand
010 +Multiplicand
101 - Multiplicand
110 - Multiplicand
111 0
11.4 The Multiplier
589
In
simple terms, the modified
ones from msb to Booth's recoding essentially of
lsb and examines the multiplier tor suCor
replaces them with a leading For
understood as the beginning of a string1, and
xample, 011 1s a- I at the end of the
of
strimg
and is therefore replaced by
leading I (or 100), while 110 is ones a
seen as the end of
significant position (or 010). a string and is replaced by a-1 at tne ieast
X X
Y
YZ
X
HA FA FA HA
Y z
FA FA FA HA
FA FA FA HA
Tz Z
Figure 11-30 A 4x4 bit-array multiplier for unsigned numbers-composition. HA stands for a
half adder, or an adder cell with only two inputs. The hardware for the generation and addition of
one partial product is shaded in gray.
look at those critical paths yields an approximate expression for the propagation delay (derived
here for critical path 2). We write this as
BA FAFA-HA
. Critical Path 1
Critical Path 2
FA FA FA-HA
EA-EAEA-A
Figure 11-31 Ripple-carry based 4x 4 multiplier (simplified diagram). Two of the
possible critical paths are highlighted.