VLSI Module 4

560 Chapter 11 Designing Arithmetic Bulldine
ng Blocks .
11.8 Perspective: Design as a Trade-off

11.9 Summary
11.10 To Probe Further
11.1 Introduction
Aler the of the basic digital gates, it is
in-depth study of the
design optimization
and time to
test
our acquired skills on a somewhat larger scale and put them in a more system-oriente
perspective.
We will apply the techniques of the previous chapters to design a number of circuitsol.
used in the
datapaths of microprocessors and signal processors. More specifically, we disee often
the design of a cuss
representative set of modules such as adders, multipliers, and shifters. The sDend
and power of these elements often
dominates the overall system performance. Hence, a careful
speed
design optimization is required. It rapidly becomes obvious that the design task is not
forward. For each module, straight.
multiple equivalent logic and circuit topologies exist, each of which
has its own
positives and negatives in terms of area, speed, or power.
Although far from complete, the analysis presented helps focus on the essential
that must be made in the course trade-offs
of the digital design process. You will see that
only one design level-for instance, through transistor sizing optimization at
only-leads to inferior designs. A
global picture is therefore of crucial importance. Digital designers focus their
attention on the
gates, circuits, or transistors that have the largest
impact on their goal function. The noncritical
parts of the circuit can be developed routinely. We also
may develop first-order performance
models that foster
understanding of the fundamental behavior of a module. The discussion also
clarifies which computer aids can
help to simplify and automate this phase of the design process.
Before analyzing the design of the arithmetic modules, a short
discussion of the role of
the
datapath in the digital-processor picture is appropriate. This not only
design requirements for the datapath, but also puts the rest of this bookhighlights
the specific
in perspective. Other
processor modules-the input/output, controller, and memory modules--have
different require-
ments and were discussed in
Chapter 8. After an analysis of the area-power-delay trade-offs in
the design of adders,
multipliers, and shifters, we will use the same structures to illustrate some
of the
power-minimization approaches introduced in Chapter 6. The chapter concludes with a
short perspective on datapath design and its trade-offs.
11.2 Datapaths in Digital Processor Architectures

We introduced the concept of digital processor in Chapter 8. Its components consist of the
a
datapath, memory, control, and input/output blocks. The datapath is the core of the
this is where all computations are performed. 1The other blocks in processor
the processor are support units
that either store the results produced by the
datapath or help to determine what will happen in
the next cycle. A typical datapath consists of an interconnection
of basic combinational
tions, such as arithmetic operators (addition, multüplication, func
comparison, and shift) or logic
(AND, OR, and XOR). The design of the arithmetic operators is the topic of this chapter. The
11.3 The Adder
561
Control
B Bit 3
Bit2
Bit
Figure 11-1 Bit-sliced datapath organization.
intended application sets constraints on the datapath design. In cases, such in

some as personal
computers, processing speed is everything. In most other applications, there is a maximum
amount of power that is allowed to be dissipated, or there is maximum energy available for com-
putation while maintaining the desired throughput.
Datapaths often are arranged in a bit-sliced organization, as shown in Figure 11-1. Instead
of operating on single-bit digital signals, the data in a processor are arranged in a word-based
fashion. Typical microprocessor datapaths are 32 or 64 bits wide, while the dedicated signal pro-
cessing datapaths, such as those in DSL modems, magnetic disk drives, or compact-disc players
are of arbitrary width,
typically 5 to 24 bits. For instance, a 32-bit processor operates on data
words that are 32 bits wide. This is reflected in the
organization of the datapath. Since the same
operation frequently has,to be performed on each bit of the data word, the datapath consists of
32 bit slices, each operating on a single bit-hence the term bit
sliced. Bit slices are either iden-
tical or resemble a similar structure for all bits. The
datapath designer can concentrate on the
design of a single slice that is repeated 32 times.
11.3 The Adder

Addition is the most commonly used arithmetic
operation. It often is the speed-limiting element
as well. Therefore, careful
optimization of the adder is of the utmost importance. This optimiza-
tion can proceed either at the logic or circuit level.
Typical logic-level optimizations try to rear-
range the Boolean equations so that a faster or smaller circuit is obtained. An example of such a
logic optimization is the carry lookahead adder discussed later in the
chapter. Circuit optimiza-
tions, the other hand, manipulate transistor sizes and circuit
on
Before considering both
topology to optimize the speed.
optimization processes, provide
we a short summary of the basic defini-
tions of an adder circuit (as defined in
any book on logic design [e.g., Katz94]).
11.3.1 The Binary Adder: Definitions
Table 11.1 shows the truth table of a
binary full adder. A and B are the adder inputs, C; is the
carry input, Sis the sum output, and C, is the carry output. The Boolean expressions and C for S'
are given in Eq. (11.1).
562
Chapter 11 Designlng Arithmetic Bulldina Bl
Blocks
Table 11-1 Truth table for full adder.
A B C C Carry Status
0
0 0 delete
0 1 delete
0 0 0 propagate
0 propagate
0 0 0 propagate
0 0 propagate
0 0
generate/propagate
1
generate/propagate
S ABC;
=
ABC+ + ABC; + ABC; (11.1)
C, = AB +BC; + AC
It is often useful from an

implementation perspective to define S' and C, as functions of
some intermediate signals G (generate), D (delete), and P
that acary bit will be generated (deleted)
(propagate). G = 1 (D 1) ensures
=
atC, independent of C, while P = 1 guarantees that

an
incoming carry will propagate to Co Expressions for these signals can be derived from
inspection of the truth table:
G AB
D AB
(11.2)
P A SB
We can rewrite S and C, as functions of P and G (or D):
C(G, P) = G+ PC;
SG, P) = PeC (11.3)
Notice that G and P are only functions of A and B and are not
dependent upon C. In a similar
way, we can also derive expressions for S(D, P) and
C,(D, P).
'Note that the propagate signal sometimes is also defined as the OR

function of the
input cary propagates to the output when A =B= 1, to0o. We will inputs
A and B-this condition guar-
antees that the
this d finition is used. provide appropriate warning whenever
The Adder
11.3
563
Ao Bo A B
Cio
FA
Co Co Co2 Co3
FA FA FA
So S1 S2
Figure 11-2 Four-bit ripple-carry adder: topology.
An N-bit adder can be constructed by cascading N full-adder (FA) circuits in series, con-
necting Co,k-1 to
Cikfor k = 1 toN-1, and the first carry-in Ci0 to 0 (Figure 11-2). This configu-
ration is called a ripple-carry adder, since the carry bit "ripples" from one stage to the other. The
delay through the circuit depends upon the number of logic stages that must be traversed and is a
function of the applied input signals. For some
input signals, no rippling effect occurs at all,
while for others, the carry has to ripple all the way from the least significant bit (lsb) to the most
significant bit (msb). The propagation delay of such a structure (also called the critical path) is
defined as the worst case delay over all possible input
patterns.
In the case of the
ripple-carry adder, the worst case delay happens when a carry generated
at the least significant bit
position propagates all the way to the most significant bit position.
This carry is finally consumed in the last
stage to produce the sum. The delay is then propor-
tional to the number of bits in the input words N and is
approximated by
adder (N- 1)tcary+sum (11.4)
where cary and un Cqual the propagation delays from C to C. and S, respectively2
Example 11.1 Propagation Delay of Ripple-Carry Adder
Derive the values of A, and B, (k = 0..N- 1) so that the worst case delay is obtained for
the ripple-cary adder.
The worst case condition requires that a carry be generated at the lsb position. Since
the input cary of the first full adder Co is always 0, this both Ap and B, must equal 1. All
the other stages must be in propagate mode. Hence, either A; or B, must be high. Finally,
we would like to physically measure the delay of a transition on the msb sum bit. Assum-
ing an initial value of 0 for Sy-1, we must arrange a 0>1 transition. This is achieved by
setting both Ay-j and By-1 to 0 (or 1), which yields a high sum bit given the incoming
carry of 1.
For example, the following values for A and B trigger the worst case delay for an 8-
bit addition:
A: 0000001; B: 01111111
uation (11.4) assumes that both the delay from the input signals Ag (or B,) to Coo for the lsb, and the Cto-C, delay
for all other
bitsequal
to fear
s64
Chapter 11 Designing Arithmetic Building Blocks
o set-up the worst case delay transition, all the inputs can e
stant with A
Kept constant wint
undergoing a0- I transition

The left-most bit represents the msb in this binary represe ntation. Observe that
only one of the many worst case patterns. This case exercises the 0~I delay of
this
of the
nnal sum. Derive several other cases that exercise the 0 I and I > 0 transitions
Two important conclusions can be drawn from Eq. (114):
The propagation delay of the ripple-cary adder is linearly proportional to N. This propen
ecomes increasingly important when designing adders for the wide data paths
(N= 16... 128) that are desirable in curent and future computers.
When designing the ful-adder cell for a fast ripple-cary adder. it is tar more important to
optimize a r than since the latter has only a minor influence on the total value of
Before starting an in-depth discussion on the circuit design of full-adder cells, the folloa
ing additional logic property of the full adder is worth mentioning:
Inverting all inputs to a full adder results in inverted values for all outputs.
This property, also called the imverting property. is expressed in the pair of equations
S(A, B, C) = S(A. B, C;)

(11.5
CA, B. C) =C,(A. B. C}
and will be extremely useful when optimizing the speed of the ripple-cary adder. It states that
the circuits of Figure 11-3 are identical.
11.3.2 The Full Adder: Circuit Design Considerations

Static Adder Circuit
One way to implement the full-adder circuit is to take the logic equations of Eq. (11.1) and
translate them directly into complementary CMOS
circuitry. Some logic manipulations can bep
to reduce the transistor count. For
instance, it is advantageous to share some logic between t
C FA FA C
Figure 11-3 Inverting property of the full adder.

The circles in the schematics
represent inverters.
11.3 The Adder
565
VpD
VpD
A
B
A
-B VpD
A- X-
C d
s
AL
VDD
C% E.B
Figure 11-4 Complementary static CMOS implementation of full adder.

sum- and carry-generation subcircuits, as long as this does not slow down the carry generation,
which is the most critical part, as stated previously. The
following is an example of such a reor-
ganized equation set:
C = AB + BC; +AC;
and
(11.6)
S =
ABC;+C(A + B+ C)
The equivalence with the original equations is verified. The
easily corresponding adder design,
using comnplementary static CMOS, is shown in Figure 11-4 and requires 28 transistors. In addi-
tion to consuming a large area, this circuit is slow:
Tall PMOS transistor stacks are present in both carry- and sum-generation circuits.
The intrinsic load capacitance of the C signal is large and consists of two diffusion and
six gate capacitances, plus the wiring
capacitance.
The signal propagates through two inverting stages in the cary-generation circuit. As
mentioned earlier, minimizing the carry-path delay is the prime goal of the designer of
high-speed adder circuits. Given the small load (fan-out) at the output of the carry chain.
having two logic stages is too high a number, and leads to extra delay.
The sum generation requires one extra logic stage, but that is not that important, sincea
factor appears only once in the propagation delay of the ripple-carry adder of Eq. (11.4).
566
Chapter 11 otic Bulldin
Designing Arithmetic Bullding Block
Even cel Odd cell
Ap BO
Ci Co2 Co
FA FA
Co FA' FA
So S
Figure 11-5 Inverter elimination in carry path. FA' stands for a full
adder without the inverter in the carry path.
Alhough slow, the circuit includes some smart design tricks. Notice that the first as
the cary-generation circuit is of
with the C, signal on the smaller PMOS stack,
designed
its logical effort to 2. Also, the NMOS and PMOS transistors connected to are lowerina
C, placed as clo
as possible to the
output of the gate. This is a direct application of a
circuit-optimization tech
nique discussed in Section 4.2-transistors on the critical path should be placed as close as
sible to the output of the
pos
gate. For instance, in stage k of the adder, signals A, and B, are available
and stable long before
Cik (= Cok-) arrives after rippling through the previous stages. In this
way, the capacitances of the internal nodes in the transistor chain are precharged or
advance. On arrival of Cik. only the
discharged in
capacitance of node X has to be (dis)charged. Putting the
CA transistors closer to Vpp and GND would require not only the (dis)charging of the capaci-
tance of node X, but also of the internal
capacitances.
The speed of this circuit can now be
improved gradually by using some of the adder prop-
erties discussed in the previous section. First, the number of
inverting stages in the carry path
can be reduced by
exploiting inverting property-inverting all the inputs of a full-adder cell
the
also inverts all the outputs. This rule allows us to eliminate an inverter in a
carry chain, as dem-
onstrated in Figure 11-5.
Mirror Adder Design
An improved adder circuit, also called the mirror adder, is shown in Figure 11-6 [Weste93]. Is
operation is based on Eq. (11.3). The carry-generation circuitry is worth analyzing. First, the
cary-inverting gate is eliminated, as suggested in the previous section. Secondly, the PDN and
PUN networks of the gate are not dual. Instead, they forma clever
implementation of the propa-
gate/generate/delete function-when either D or G is high, C, is set to VpD or GND, respec
tively. When the conditions for a Propagate are valid (or P is 1), the incoming camy 15
propagated (in inverted format) to C. This results in a considerable reduction in both area and
delay. The analysis of the sum circuitry is left to the reader. The following observations are
worth considering:
This full-adder cell requires only 24 transistors.
The P A +B definition ofthe propagate signal is used here

571
11.3 The Adder
11.3.3 The Binary Adder: Loglc Design Consideratlons

additions with a relatively
PpC-carry adder is only practical for the imnlementation of
e r v e r s require 64;
bits, while
Most desktop computers use word lengths of 32 (e.8. the
multimedia processors
Computers, such as mainframes, supercomputers, or The linear dependence of
laystation2) [Suzuoki99], require word lengths of up to 128 bits.

ny
the adder speed on
impractical.
rather
the number of bits makes the usage oof Pripple addersom.
Logic
We briefly discuss aa
p e e d on the number of bits makes the usage O(N). We brieny discuss
opimizations are therefore necessary, resulting in adders with ,
<
of those in the sections that follow. We concentrate o n the

circuit design implication,
umber design literature.
Since most of the presented structures are well known from the traditional logic
The Carry-Bypass Adder

Ak and Bk (k
=
that the values of

Consider the
four-bit adder block of Figure 11-12a. Suppose
.3) are such that all propagate signals P: (k = 0...3) are high. An incoming carry 10 proP
complete adder chain and
causes an outgoing carny
agates under those conditions through the
Co3 1. In other words,
if (PoP,PaPs = 1) then Co,3 = Ci,0
(11.8)
else either DELETE or GENERATE occurred
in Figure 11-12b.
This information can speed up the operation of the adder, as shown
be used to
to the next block
When BP =
PoP1P2P3= 1, the incoming carry is forwarded immediately
adder or carry-skip adder
through the bypass transistor M,-hence the name carry-bypass
Lehman62]. If this is not the case, the carry is obtained by way of the normal route
Po Go 2G2 Pa G3
Ci0 FA
Co,0
FA
Co1 FA FA Co
(a) Carry propagation
Po Go P1 G1 P3 G3
BP PoP1P2P3
Ci0 FA FA FA
Co,3
(b) Adding a bypass
Figure 11-12 Carry-bypass structure-basic concept.

572 Chapter 11 Designing Arithmetic Building Blocks
ExDple 11.3 Carry Bypassin Manchester Carry-Chain Adder
Figure 11-13 shows the possible camy-propagation paths when the full-adder circuit is
implemented in Manchester-carny style. This picture demonstrates how the bypass speeds
up the addition: The cary propagates cither through the bypass path, or a carry is gener
ated somewhere in the chain. In both cases, the delay is smaller than the normal ripple
configuration. The area overhead incurred by adding the bypass path is small and typically
the bypass path breaks the regular bit-slice
ranges between 10 and 20%. However, adding
structure (as was present in Figure 11-10)
P2 Pa BP
Po P
Cio
Go G G2
BP
Figure 11-13 Manchester carry-chain implementation of bypass adder.
Let us now compute the delay of an N-bit adder. At first, we assume that the total adder
is divided in (N/M) equal-length bypass stages, each of which contains M bits. An approxi-
mating expression for the total propagation time can be derived from Figure 11-14a and is
givenin Eq. (11.9). Namely,
setup +Mlcarry +|-1bypas +(M- c0ry Sum (11.9)
Bit 0-3 Bit 4-7 Bit 8-11 Bit 12-15
Setup eup Setup

bypass
Setup Setup
Carry Carry Carry Carry

propagation propagation propagation propagation
Sum Sum Sum Sum
M bits
Figure 11-14 (N= 16) carry-bypass adder: composition. The worst case delay path
is shaded in gray.
11.3 The Adder 573
with the composing parameters defined as follows:

setup the hxed overhead time to create the
carry he propagation delay through a sinole hit Theand
generate propaga
worst case carry-propagation delay
through single stage of M bits is

a
bypass he propagation delay throughapproximately M times larger.

the bypass multiplexer of a single stage.
Isum the time to
generate the sum of the final stage.
ne criticalpath is shaded in gray on the block diagram of Figure
11-14. From Eq.
lows that is
(1.9), t tol-
p still linear in the number of bits N, since in the worst case, the carry 1s
generated
ne nrst bit position, ripples through the first block, skips around (N/M-2) bypass stages, and at
is
Consumed at the last bit position without generating an output carry. The optimal number of bits
per skip block is determined by technological parameters such as the extra delay of the bypass-
eiecthng multüplexer, the buffering requirements in the carry chain, and the ratio of the delay
through the ripple and the bypass paths.
Although still linear, the slope of the delay function increases in a more gradual fashion
than for the ripple-carry adder, as pictured in Figure 11-15. This difference is substantial for
larger adders. Notice that the ripple adder is actually faster for small values of N, for which the
overhead of the extra bypass multiplexer makes the
bypass structure not interesting. The cross
Over point depends upon technology considerations and is normally situated between four and
eight bits.
Problem 11.1 Delay of Carry-Skip Adder

Determine an input pattern that triggers the worst case delay in a 16-bit (4x4) cary-bypass adder. Assum-
ing that Tearysetup = skip sum = 1, determine the delay and compare it with that of a normal ripple
adder.
Ripple adder
--
Bypass adder
N
4..8
Figure 11-15 Propagation delay ripple-carry

of versus carry-bypass adder.
574 Chapter 11 Designing Arithmetic Buildino.
The Linear
ing Block
In d
Carry-Select Adder
npple-cary
adder, every full1-adder cell has to wait for the incoming
carry before an
E Caty can be generated, One way to get around this linear
possible values of the dependency is to
anticing
the real
cary input and evaluate the result for both possibilities in
value of the incoming cary is known, the correct result is advanipate oith
advance. Once
easily selected with
multiplexer stage. An implementation of this idea, appropriately called the
:
a simy
Bedrijó2]. is demonstrated in Figure 11-16. Consider the block of adders, which is carry-selecte
tok +3. Instead of waiting on the arrival of the addine
output cary of bit k -1, both the 0 and bits
bilities are analyzed. From a circuit
point of view, this means that two carry paths arepossi
mented. When Cak
finally settles, either the result of the 0 or the 1
path is selected imple
k
multiplexer, which can be performed with a minimal
delay. As is by t
hardware overhead of the carry-select adder is restricted to an evident from Figure 11-16
additional carry path and a the
plexer, and equals about 30% with respect to a ripple-carry structure. multi m
A full carry-select adder is now constructed

by chaining a number of equal-length add
stages, as in the carry-bypass approach (see Figure 11-17). The critical
From inspection of the circuit, we can derive a path is shaded in gra
first-order model of the worst case
delay of the module, written as propagation
'add'setup+Mcarry+ mux+ sum (11.10)

where setu sum and Imur are fixed delays and N and M represent the total number
of bits, and the
number of bits per stage,
respectively. learry is the delay of the carry through a single
cell. The carry delay
through a single block is proportional to the length of that full-adder
M cary stage or equals
The propagation delay
of the adder is, again, linearly proportional to N (Eq. (11.10). The
reason for this linear behavior is that the
block-select signal that selects between the 0 and 1
solutions still has to ripple
through stages in the worst case.
all
Setup
BG
0 0Carrypropagation
1 14Carry propagation
Cok-
Multiplexer Cok3
Carry vector
Sum Generation
Figure 11-16 Four-bit carry-select

module-topology.
11.3 The Adder
575
Bit 0-3
Bit 4-7
Setip Setup
Bit 8-11 Bit 12-15
Setup Setup
0 0-Cary
-Carry 0- Carry Carry
1 Cary
1 1-Cary- 1-Cart 1-Carny
Cio Multiplexer o,3 Multiplexer MultiplexerC11 Multiplexer
Co,7
Sum generation
Sum generation Sum generation
So-3 S4-7
Sum generation
Figure 11-17 Sixteen-bit, linear Sg-11 S12-15
carry-select adder. The critical path is shaded in gray
Problem 11.2 Linear
Carry-Select Delay
Determine the delay of a 16-bit
linear carry-select adder
result with that of Problem 11.1. by using unit delays for all cells. Compare the
Compare various block configurations as wel.
The Square-Root Carry-Select Adder
The next structure illustrates
how an alert
design, it is essential to locate the designer can make a major impact. To optimize a
critical timing path first. Consider the case of a
carry-select adder. To simplify the discussion, assume 16-bit linear
that the ful-adder and
have identical
propagation delays equal to a normalized value of 1. The worst case multiplexer cells
of the signals at the different network arrival times
nodes with respect to the time the
marked and annotated on Figure 11-18a. This input applied are
is
adder ripples through the
analysis demonstrates that the critical path of the
multiplexer networks of the subsequent stages.
One striking opportunity is
readily apparent. Consider the multiplexer gate in the last
adder stage. The inputs to this
multiplexer are the two carry chains of the block and the block
multiplexer signal from the previous stage. A major mismatch between the arrival times of the
signals can be observed. The results of the carry chains are stable long before the
signal arives. It makes sense to equalize the delay through both
multiplexer
paths. This can be achieved by
progressively adding more bits to the subsequent stages in the adder, requiring more time for the
generation of the cary signals. For example, the first stage can add 2 bits, the second contains 3,
the third has 4, and so forth, as demonstrated in Figure 11-18b. The annotated
arival times show
that this adder topology is faster than the linear organization, even
though an extra stageis
needed. In fact, the same propagation delay is also valid for a 20-bit adder. Observe that the dis-
crepancy in arrival times at the multiplexer nodes has been eliminated.
In effect, the simple trick of making the adder stages progressively longer results in an
adder structure with sublinear delay characteristics. This is illustrated by the following analysis:
576
Bit 0-3
Bit 4-7 Bit 8-1 Bit 12-15
Sctup Setup Setup Setup
JL)
0-Carry 0-Carry 0-Carry 0-Carry
(1)J
(5)L
1-Carry 1-Carry | 1-Carry
(6) (8)
0 Multiplexer Multiplexer Multiplexer Multiplexer
Sum Generation Sum Generation Sum Generation
Sum Generation
03 11 S12-15 (10)
(a) Linear configuration
Bit 0-1 Bit 2-4 Bit 5-8 Bit 9-13 Bit 14-19
Setup Setup Setup Setup
0-Carry 0-Carry 0 0-Carry 0-Carry

J T
1-Cary 1-Carry 1-Carry 1-Carry
(4) (5) (6)

CMultiplexer Multiplexer Multiplexer Multiplexer Mux
Sum Sum Generation Sum Generation Sum Generation Sum

SoA 24 S9-13 (9) S14-19
(6) Square root configuration
Figure 11-18 Worst case signal arrival times in carry-select adders. The signal arrival
timesare marked in
parentheses.
Assume that an N-bit adder contains P
stages, and the first stage adds M bits. An additional bit is
added to each subsequent
stage. The following relation then holds:
N =M+ (M+1) +(M+2) + (M+3)+... + (M +P-1)
(11.11)
IfM << N (e.g., M 2, and N = 64), the first term dominates, and Eq. (11.11) can be
simplified to
N (11.12)
11.3 The Adder
577
50
40 Ripple adder
30
20 Linear select
10
Square root select
20 40 60
N
Figure 11-19
ripple and selectPropagation delay of square-root
adders. The unit carry-select adder versus linear
delay model is used to model the cell delays.
or
P 2NV (11.13)
Equation (11.13) can be used to
express tad function
as a
of N by rewriting Eq. (11.10):
add setup + Micarry + (W2N) mux + 'sum (11.14)
The delay is proportional to N for large adders (N>> M), or
relation has a major impact, which is illustrated tadd OWN). This square-root
=
in Figure 11-19, where the

linear and square-root select adders are delays of both the
plotted as a function of N. It can be observed that foc
large values of N, tadd becomes almost a constant.
Problem 11.3 Unequal Bypass Groups in Carry-Bypass Adder

A careful reader
might be interested in applying the previous technique to carry-bypass adders. We saw ear-
lier that their delay is a linear function of a number of bits. Can they be modified to achieve better than lin-
ear delay by using variable group sizes?
It does make sense to make the consecutive
groups gradually larger. However, the
cary-select adders does not directly apply to this case, and a progressive increase in stagetechnique
used in
sizes eventually
increases the delay. Consider a carry-bypass adder in which the last stage is the
that
largest: The carry signal
propagates through that and
stage consumed at the msb
gets position (with no chance of bypassing it) is
on the critical path for the sum generation. Increasing the size of the last group does not
help the problem.
Based this discussion and assuming constant delays
on
tor carry and bypass gates, sketch the
of the cary bypass network that achieves a delay that is better than linear. profile
586 Chapter 11 Designing Arlthmetic Building Blocks
and.
These adders require number-encoding
binary althmetic structures [Swartzlander90].
Therefore, they are only interesting when
decoding steps, whose is a function of N.
delay
in such as multipliers or high-speed signal processors.
Coedded larger structures
11.4 The Multiplier

The performance of many computational
and slow operations.
Multiplications are expensive which a multiplication operation can
be executed
the speed at
problems often is dominated by the integration of complete multiplication
units in
This observation has, for instance,
prompted
and microprocessors.
state-of-the-artdigital signal processors of the topics dis.
adder arrays. Therefore, the majority
Multipliers are, in effect, complex context as well. The analysis
of the multiplier
section are of value in this
cussed in the preceding the area) of complex cir-
into how to optimize the performance (or
gives us some further insight the multiply operation, we discuss
the basic array
After a short discussion of
cuit topologies.
also discuss different approaches to partial product
generation, accumulation and
multiplier. We
their final summation.
11.4.1 The Multiplier: Definitions

are M and N bits wide, respectively. To
Consider two unsigned binary numbers X and Y that
X and Y in the binary representation
introduce the multiplication operation, it is useful to express
- 1 N-1
X x Y-=Xy (11.21)
i = 0 j= 0
defined as follows:
with X, Ye {0, 1}. The multiplication operation is then
M+N-1
Z XxY'= Z,2
k = 0
(11.22)
M-1 V-1 M- N-1
i = 0 j= 0 i=0j=0
The simplest way to perform a multiplication is to use a single two-input adder. For inputs
that are M and N bits wide, the multiplication takes M cycles, using an N-bit adder. This shif
and-add algorithm for multiplication adds together M partial products. Each partial product is
generated by multiplying the multiplicand with a bit of the multiplier-which, essentially, is an
AND operation-and by shifting the result on the basis of the multiplier bit's position.
A faster way to implement multiplication is to resort to an approach similar to manually
computing a multiplication. All the partial products are generated at the same time and orga-
nized in an aray. A multioperand addition is applied to compute the final product. The approach
is illstrated in Figure 11-28. This set of operations can be mapped directly into hardware. The
587
11.4 The Multiplier
101010 Multiplicand
011 Multiplier
101 0 10
101 0 1 0
Partial products
0 00 0 00
+101010
1 1 1 0 0 1 1 1 0 Result
Figure 11-28 Binary multiplication-an example.

the following three functions
resulting structure is calledarray multiplier and combines
an
and final addition.
partial-product generation, partial-product accumulation,
Problem 11.6 Multiplication by a Constant

a constant.
Describe how you would pro-
It is often necessary to implement multiplication of an operand by
ceed with this task.
11.4.2 Partial-Product Generation
AND of multiplicand X with a multiplier bit Y, (see

Partialproducts result from the logical or a
is either a copy of the multiplicand
in the partial-product array
Figure 11-29). Each row
can lead to some
substan-
Careful optimization of the partial-product generation
row of zeroes. has many zero
the partial-product array
reductions. Note that in most cases
tial delay and area when added. In the
thus represent a waste of effort
the result and
rows that have no impact
on
the case of all
of all ones, all the partial
products exist, while in
case of a multiplier
consisting prod-
reduce the number of generated partial
none. This
observation allows us to
zeros, there is
ucts by half. 01ll1110, which produces six
of the form
Assume, for example,
an eight-bit multiplier
the number of nonzero rows by
One can substantially reduce
nonzero partial-product rows.
+2) into a different

format. The reader can
verify
recoding this number (2+20 +2 + 2 +2 for -l, represents the
same number. Using
with I a shorthand notation
that the form 10000010,
Xs X X X
X
PP3 PP2 PP PPo

PP PP
PP PP
Partial-product
generation logic.
11-29
Figure
588
but the final adder has to be able to per-
this format, we have to add only two partial products,
transformation is called Booth's recoding [Booth5| ].
and
form subtraction as well. This type of
ensures that for every two con-
to, at most, one half. It
it reduces the number of partial products
the number of partial products is equiva-
secutive bits, at most bit will be I or -1. Reducing
one
as well as an area reduction.
additions, which leads to a speedup
lent to reducing the number of word into a base-4
to formatting the multiplier
Formally. this transformation is equivalent
scheme, instead of the usual binary format:
(N-1)/2
Y= YA with (¥,e (-2-1,0, 1,2)) (11.23)

j= 0
the most par

the worst case multiplier input because it generates
Note that 1010...10 represents
AND operation,
the multiplication with {0, 1} is equivalent to an
tial products (one half). While
of inversion and shift logic. The encod-
multiplying with{-2,-1, 0, 1, 2) requires a combination
ing can be performed on the fly
and requires some simple logic gates.
not practical for multiplier design, and a
Having a variable-size partial-product array is
used instead [MacSorley61]. The multiplier is parti-
modified Booth's recoding is most often
tioned into three-bit that overlap by one bit. Each group of three is recoded, as shown in
groups
and forms one partial product. The resulting number of partial products equals
half
Table 11-2,
the two current bits, combined
of the multiplier width. The input bits to the recoding process are
with the upper bit from the next group, moving from msb to lsb.
Table 11-2 Modified Booth's recoding

Partial Product Selection Table
Multiplier Bits Recoded Bits
000 0
001 +Multiplicand
010 +Multiplicand
011 +2x Multiplicand

100 -2 x Multiplicand
101 - Multiplicand
110 - Multiplicand
111 0
11.4 The Multiplier
589
In
simple terms, the modified
ones from msb to Booth's recoding essentially of
lsb and examines the multiplier tor suCor
replaces them with a leading For
understood as the beginning of a string1, and
xample, 011 1s a- I at the end of the
of
strimg
and is therefore replaced by
leading I (or 100), while 110 is ones a
seen as the end of
significant position (or 010). a string and is replaced by a-1 at tne ieast
Example 11.5 Modified Booth's

Recoding
Consider the eight-bit binary number 01111110 shown earlier. This can be divided into
Tour
overlapping groups of three bits, going from msb to lsb: 00 (1), 11 (1), 11 (), 10 (0).
Recoding by using Table 11-2 yields: 10 (2 x). O0 (0 x), 00 (0x), 10 (-2 X), or, in com-
bined format, 100000010. This is equivalent to the result we obtained before.
Problem 11.7 Booth's Recoder

Design the combinational logic that implements a modifed Both's recoding for a parallel multiplier, using
Table 11-2. Compare its implementation in complementary and pass-transistor CMOS.
11.4.3 Partial-Product Accumulation

After the partial products are generated, they must be summed. This accumulation is essentially
multioperand addition. A straightforward way to accumulate partial products is by usinga
number of adders that will form an array-hence, the name, array multiplier. A more sophisti-
cated procedure performs the addition in a tree format.
The Array Multiplier

The composition of an array multiplier is shown in Figure 11-30. There is a one-to-one topolog-
ical correspondence between this hardware structure and the manual multiplication shown in
AND gates (in the
Figure 11-28. The generation of N partial products requires Nx two-bit
M
is devoted to the adding of the N par-
style of Figure 11-29). Most of the area of the multiplier
The shifting of the partial products for their
tial products, which requires N- 1 M-bit adders.
and does not require any logic. The overall
proper alignment is performed by simple routing
structure can easily be compacted into a rectangle, resulting in a very efficient layout.
the propagation delay of this circuit is not
Due to the array organization, determining
of Figure 11-30. The partial are sum adders
straightforward. Consider the implementation
Performance optimization requires that the critical tim-
implemented as ripple-carry structures. number of paths of
This turns out to be nontrivial. In fact, a large
ing path be identified first.
identified. Two of those are highlighted in Figure 11-31. A closer
almost identical length can be
does not substantially change the

does not employ Booth's recoding. Adding recoding
SThis particular implementation and tne partual-product generation is slightly more complex.
of adders is reduced by hall,
implementation. The number
590 Chapter 11 Designing Arithmetic Building Block
bcks
X X
Y
YZ
X
HA FA FA HA
Y z
FA FA FA HA
FA FA FA HA
Tz Z
Figure 11-30 A 4x4 bit-array multiplier for unsigned numbers-composition. HA stands for a
half adder, or an adder cell with only two inputs. The hardware for the generation and addition of
one partial product is shaded in gray.
look at those critical paths yields an approximate expression for the propagation delay (derived
here for critical path 2). We write this as
mul[(M -1)+(N-2)] carry + (N-1) 5um + tand (11.24)

where tearn, is the propagation delay between input and output carry, ,sum is the delay between the
input carry and sum bit of the full adder, and tand is the delay of the AND gate.
Since all critical paths have the same length, speeding up just one of them-for instance,
by replacing one adder by a faster one such as a carry-select adder-does not make much sense
from a design standpoint. All critical paths have to be attacked at the same time. From
Eq.
(11.24), it can be deduced that the minimization of mul requires the minimization of both tary
BA FAFA-HA
. Critical Path 1
Critical Path 2
FA FA FA-HA
EA-EAEA-A
Figure 11-31 Ripple-carry based 4x 4 multiplier (simplified diagram). Two of the
possible critical paths are highlighted.

VLSI Module 4

Uploaded by

Copyright:

Available Formats

VLSI Module 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

VLSI Module 4

Uploaded by

Copyright:

Available Formats

560 Chapter 11 Designing Arithmetic Bulldine

11.8 Perspective: Design as a Trade-off

11.2 Datapaths in Digital Processor Architectures

Figure 11-1 Bit-sliced datapath organization.

intended application sets constraints on the datapath design. In cases, such in

11.3 The Adder

It is often useful from an

atC, independent of C, while P = 1 guarantees that

SG, P) = PeC (11.3)

'Note that the propagate signal sometimes is also defined as the OR

undergoing a0- I transition

Two important conclusions can be drawn from Eq. (114):

S(A, B, C) = S(A. B, C;)

11.3.2 The Full Adder: Circuit Design Considerations

Figure 11-3 Inverting property of the full adder.

Figure 11-4 Complementary static CMOS implementation of full adder.

Even cel Odd cell

The P A +B definition ofthe propagate signal is used here

11.3.3 The Binary Adder: Loglc Design Consideratlons

laystation2) [Suzuoki99], require word lengths of up to 128 bits.

of those in the sections that follow. We concentrate o n the

The Carry-Bypass Adder

that the values of

(b) Adding a bypass

Figure 11-12 Carry-bypass structure-basic concept.

ExDple 11.3 Carry Bypassin Manchester Carry-Chain Adder

Figure 11-13 Manchester carry-chain implementation of bypass adder.

setup +Mlcarry +|-1bypas +(M- c0ry Sum (11.9)

Bit 0-3 Bit 4-7 Bit 8-11 Bit 12-15

Setup eup Setup

Carry Carry Carry Carry

Sum Sum Sum Sum

with the composing parameters defined as follows:

through single stage of M bits is

bypass he propagation delay throughapproximately M times larger.

Problem 11.1 Delay of Carry-Skip Adder

Figure 11-15 Propagation delay ripple-carry

A full carry-select adder is now constructed

'add'setup+Mcarry+ mux+ sum (11.10)

Figure 11-16 Four-bit carry-select

0-Carry 0-Carry 0 0-Carry 0-Carry

(4) (5) (6)

Sum Sum Generation Sum Generation Sum Generation Sum

in Figure 11-19, where the

Problem 11.3 Unequal Bypass Groups in Carry-Bypass Adder

11.4 The Multiplier

11.4.1 The Multiplier: Definitions

Figure 11-28 Binary multiplication-an example.

Problem 11.6 Multiplication by a Constant

11.4.2 Partial-Product Generation

AND of multiplicand X with a multiplier bit Y, (see

+2) into a different

PP3 PP2 PP PPo

Y= YA with (¥,e (-2-1,0, 1,2)) (11.23)

the most par

Table 11-2 Modified Booth's recoding

Multiplier Bits Recoded Bits

011 +2x Multiplicand

Example 11.5 Modified Booth's

Problem 11.7 Booth's Recoder