CMOS Cell Libraries For Minimal Leakage Power
CMOS Cell Libraries For Minimal Leakage Power
CMOS Cell Libraries For Minimal Leakage Power
Master’s Thesis
by
Project number: 55
Informatics and Mathematical Modelling
Computer Science and Engineering
Technical University of Denmark
Preface
Preface
This report is part of the results from the master’s thesis project ’Design of CMOS Cell
Libraries for Minimal Leakage Currents’ conducted at Informatics and Mathematical Mod-
elling (IMM), Computer Science and Engineering division (CSE), Technical University of
Denmark (DTU) from February to August 2004.
This project was conducted as a part of three independent, but collaborative master’s
thesis. The original idea for this work was conceived by Peter Østergaard Nielsen from
Vitesse Semiconductor Corporation, Denmark.
I would like to thank my colleagues Martin Hans and Michael Kristensen for inspiring
cooperation. Further I would like to thank Alberto Nannarelli for valuable insights and the
administrative staff of IMM for helping me speed up the project work.
Abstract
Leakage due to scaling down CMOS device sizes will be the major power consumption
source in cell based IC design in a few years. This work addresses the problem of this
leakage, investigating the possibilities of utilizing alternative logic families instead of static
CMOS for the creation of a low leakage cell library. For this purpose, MTCMOS, CPL and
Domino logic are investigated for leakage characteristics and are found unusable for low
leakage design.
Using cell libraries of small logic cells for IC design is found to be the major reason for
much of the leakage. Synthesizing without cell boundaries by building larger cells reduces
the leakage problem greatly. A new synthesis flow and cell library is proposed.
Keywords: Low leakage CMOS, CPL, Domino, MTCMOS, MacroCMOS, Synthesis for
low leakage design.
Resumé
1 Introduction 9
1.1 Invention of MOSFET transistors . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Synthesis of cell based designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 The problem of leakage currents . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Possible solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Objectives for this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Overview of the report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7
6 Evaluation of Logic Families 57
6.1 Static CMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Cutting off power supply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3 Complementary pass-transistor logic . . . . . . . . . . . . . . . . . . . . . . . . 61
6.4 Domino logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.5 MacroCMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7 Discussion of Results 73
7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 The chosen candidate for cell library implementation . . . . . . . . . . . . . . 75
A Project Description 91
Bibliography 123
C HAPTER 1
I NTRODUCTION
Contents
1.1 Invention of MOSFET transistors . . . . . . . . . . . . . . . . . . . 9
1.2 Synthesis of cell based designs . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Cell libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 The problem of leakage currents . . . . . . . . . . . . . . . . . . . 11
1.4 Possible solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Objectives for this work . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Overview of the report . . . . . . . . . . . . . . . . . . . . . . . . . 12
The aim of this chapter is to describe the problem that this work intends to solve. The development
of MOS transistors, synthesis tools and cell libraries is described to introduce the origin of the
leakage current problem. Possible solutions to the leakage current problem are presented forming the
basis for the objectives set in this work. Last, an overview of this report is given.
9
10 Introduction
Rleak
With this picture in mind, the problem of current day synthesis tools and small static
CMOS cells become clear. Using large numbers of small gates containing very few transis-
tors each is the cause of the problem. This is the manner in which the number of leaking
resistors (transistors) is maximized and the resistance on each path is reduced to the mini-
mum. The resistance can be increased by using high-Vth low leakage transistors, but these
transistor reduce the speed of the circuitry.
A solution to this problem could be to go back to the decision of selecting static CMOS
as logic family. If devices had been leaking as much two decades ago as they will do within
a few years, small cell static CMOS might not have been selected as the logic family of the
future. Instead other interesting logic families might have prevailed. In this work different
logic families will be discussed and two, Domino logic and Complementary Pass-transistor
Logic, have been selected for closer low leakage evaluation.
Another solution is found in a characteristic of leakage currents: As the leakage power
dissipation is not dependent on activity, but is an ever present power dissipation source,
cutting off power to inactive regions may save quite large fractions of the total power dissi-
12 Introduction
pation. This is especially interesting for applications that do only operate in a small fraction
of the time. Therefore, this concept is taken under evaluation in this work.
The third solution presented here came through a study of transistor characteristics.
Connecting transistors in series (stacking), which will be shown to decrease leakage con-
siderably, will be proven to be a good solution to the problem. Building larger logic blocks
on-the-fly in the synthesis process and including optimization algorithms for leakage re-
ductions can yield very large savings in the power budget. This approach will require
changes in the synthesis process and complete redesign of current cell libraries. Changes to
the synthesis process of today and a new cell library are proposed in this work.
A fourth, and very well explored possible solution, is to replace all transistors with
high-Vth (low leakage) transistors, which will postpone the leakage problem for quite some
years. This is though only possible when adequate time slack is available, since low-leakage
transistors are slower by nature. Therefore, this work is based in the area where time re-
quirements are just met or met by a fraction of the paths in the design. This is the setting
for this work: Reducing leakage currents where slow, high-Vth are not possible to use, or
only usable to some extent.
Presentation of
Logic Families Chapter 4
Logic Familiy
Evaluation Methods Chapter 5
first three chapters of this report to form the basis for the evaluation work described in the
following chapters.
The flow of this report is depicted in Figure 1.2. This figure will be repeated at the
beginning of each chapter with markings showing the placement of the specific chapter in
the entire report flow. Here follows a short description of the contents of the eight following
chapters.
Chapter 2 introduces the design of cell libraries and the synthesis flow of today and dis-
cusses the future of cell libraries taking the rising problem of leakage current into account.
The contents of cell libraries and the process of cell library design are explored.
Cell library design requires accurate simulation of electrical characteristics of logic cells.
For this purpose chapter 3 gives a investigation of how the power consumption of inte-
grated circuits is simulated, with special focus on the leakage currents in CMOS designs as
device sizes grow smaller. This chapter also introduces the transistors models and simula-
tion approaches.
Alternative logic families are presented in chapter 4 which investigates logic families
through a short survey of the characteristics of each logic family in terms of power and
ease of design. Based on this discussion a number of target logic families are selected for
evaluation. How the logic families are evaluated is presented in chapter 5, which also de-
scribes how fair comparisons are achieved between logic families.
Chapter 6 presents the simulation work based on techniques described in chapters 3
and 5, and the results from the work.
Chapter 7 evaluates the results from all simulations and describes why static CMOS is
chosen for the creation of the low leakage cell library. The new cell library is presented in
chapter 8, which also describes the changes in the synthesis flow that are required to be
done in order to use the library. Chapter 9 concludes on the work and presents topics for
future work and projects.
Hereafter follows the appendices. The contents and numbering of the appendices will
be clarified when referred to in the report.
14 Introduction
C HAPTER 2
Contents
2.1 The role of cell libraries . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The contents of cell libraries . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Modelling propagation delay in cell libraries . . . . . . . . . . . . . 17
2.2.2 Modelling power dissipation in cell libraries . . . . . . . . . . . . . 18
2.3 Synthesis of cell based designs . . . . . . . . . . . . . . . . . . . . 19
2.3.1 The cell library/synthesis tool interface . . . . . . . . . . . . . . . . 20
2.4 Implicit cell library contents . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 The static CMOS cell library . . . . . . . . . . . . . . . . . . . . . . 22
15
16 Design of Cell Libraries
process controlB()
if a >= 2 then Synthesis Placement Routing
b <= "0010";
c <= "1100";
endif;
...
Figure 2.1: Synthesis, placement and routing using data from a cell library.
From this description of the roles of the cell library it is evident that the cell library needs
to include the following:
• A compilation of cells including information of: Logic function, area, timing, dynamic
and leakage power consumption
• Wireload models for both synthesis and place & route
• The physical layout of the cells for the place & route tool
• A library of symbols and other graphics for the graphic interfaces of all tools etc.
Since this work is about characterization of logic cells in terms of power consumption
and timing, the term ’cell library’ here refers to the first two points in unison. The STM
180nm DKHCMOS8D[4] cell library available at IMM/DTU will serve as example of a cell
library.
• Global values such as temperature, unit declarations, and settings for the synthesis
• Wire load models for wires formulated by resistance, capacitance, slope, area and
fanout length
• Wire load selection criteria defining which wire load model to use depending on area
• Templates for propagation delay lookup tables with input net transition and output
capacitance as parameters
• Templates for power dissipation lookup tables with input net transition and output
capacitance as parameters
These values are printed for the synthesis tool to inform the tool under which assump-
tions the simulations of the cells have been done, and how the following electrical speci-
fications of the cells are to be read. The cells follow hereafter. The description of the cells
contain these data:
2.1 All information in this sample has been manipulated in structure and values for copyright protection pur-
poses.
2.2. THE CONTENTS OF CELL LIBRARIES 17
Dtotal
Dcell Dwire
Figure 2.2: Total gate delay split into cell and wire delay.
Scalar values:
• Area
• Logic function
• Maximum capacitance
Lookup tables:
With these values the synthesis tool is able to calculate the total area consumption, tim-
ing of the circuit with statistical wire loads, and the power dissipation with random inputs.
Doing place & route and backannotating the design with input value information produces
a realistic picture of whether the timing requirements of the circuit are met, and a reason-
ably good power dissipation prediction.
The delay is modelled as the sum of the cell delay and wire delay (Figure 2.2). The cell
delay is the time from a input value transition reaches 50% of its final value till the output
of the cell has changed to 50% of its final value. This is depicted in the left hand side of
Figure 2.3.
The propagation delay depends on the slope of the input value transition and the to-
tal capacitance on the output. The lookup tables for gate delay is therefore a table with
capacitance and input value transition slope as parameters.
The delay of wires is read from lookup tables with resistance and capacitance as param-
eters, to model what delay that wire causes. A number of wire load models are available
modelling a variety of wire lengths and capacitive loads on these. Statistical area dependent
models are used to evaluate which wire load model is to be used for each wire. Backanno-
tating the real wire length improves the accuracy of the model, and until it is done the delay
models rely only on statistical, and possibly very conservative, wire delay models.
18 Design of Cell Libraries
voltage voltage
100% 100%
90%
Input Input
50%
Output Output
10%
Pdyn,int
Pdyn,cap
Pleak Pleak
time time
Figure 2.4: Power consumption before, during and after an output transition. A cell library repre-
sentation.
1. Calculate total output capacitance: Wire capacitance + total gate input capacitance
2. Look up the rise/fall-time of the cell using the calculated output capacitance and the
input transition time as parameters
3. Add wire delay. This is calculated from adding the wire load model to the output
transition
circuit power consumption. Pdyn,cap naturally depends on the capacitive load on the out-
put, which totals the wire load capacitance and the total input capacitance of connected
logic gates.
Denoting the frequency of output signal transitions (the toggle rate) by TR the entire
power model can be expressed in one relation:
h is an input state dependent leakage power function of the input state, where vi,j is the
j’th input value to the i’th cell. This value is read from the input state dependent leakage
power lookup table (leakage_power). If input values are unknown the default leakage power
value is used.
Eswitch is the switching energy required to change output state due to a transition from
one to another input state. This value is looked up in the rise_power or fall_power lookup
tables. The internal power consumption depends on the input transition time and the total
output capacitance, which are the parameters for the lookup tables.
The last component is Pdyn,cap which depends only on the output capacitance. This
factor is summed into Eswitch for practical reasons.
The calculation of the power consumption follows in three steps for each cell:
1. When a input transition occurs: Determine what output transition the input transition
causes and lookup the rise or fall power consumption for that transition
2. Then, lookup the leakage power consumption caused by both input vectors and add
a average of these values to the total power consumption
3. If no input transitions occur, just lookup the leakage of the cell and add it to the total
power consumption
Leakage power dissipation as a function of input states requires the leakage to be ex-
pressed in lookup tables with input vectors as parameter. The leakage at any moment can
then be expressed as the total sum of leaking gates according to their respective input states.
If input states are unknown, an average value read from the cell library is used.
Problem Z=A+B
C1 C0
Sub−problem ..... A1 + B1 Ao + Bo
Sum1 Sum0
Gate
Figure 2.5: A synthesis flow of mapping an abstract problem into logic cells.
Optimization then follows in several steps. Possibly some of the paths through the in-
creased levels of logic depth are not fast enough and must be compensated by increasing
the drive strength of the gate. If this is still not enough to meet the timing requirements,
logic optimizations must be done to improve speed. Since NAND-gates are typically faster
than AND/OR-gates, the NAND-gates replaced the AND/OR-gates(3) in the right hand
side of Figure 2.5.
This interface of supplying the synthesis tool with only a limited number of cells clearly
has some disadvantage. First, it cannot contain all logic functions, so smaller cells have
to be cascaded. Secondly, as all cells are not available with inverted/non-inverted inputs,
inverters have to be put in numerous places. This is a further important as the number of
cells and logic depths increase.
Thirdly, if a cell is just a bit too slow or too fast no improvements can be done, and the
synthesis tool has to redesign the logic expression, if no slightly faster cell is available. A
2.4. IMPLICIT CELL LIBRARY CONTENTS 21
50
45
40
35
30
Cells
25
20
15
10
0
1 2 3 4 5 6 7 8 9
Inputs
fourth reason is, that for low leakage applications the cell library with a fixed number of
cells is not good either. The possibilities of reducing leakage are hereby limited to replac-
ing high-speed, high-leakage cells with reduced-speed, low-leakage cells. In many cases
there is not enough time slack for this replacement, and high-leakage cells are therefore
necessary. The limitations of using cell libraries will be further discussed in Chapter 8.
Connection of cells Can combinational logic be built simply by connecting logic cells like
Lego blocks only taking the timing (sum of propagations delays) into account? Or
do cells alter their electrical characteristics dependent on the characteristics of the
previous logic stage?
Value stability Can signals be assumed to remain stable in value as long as the cells are fed
with supply power an input values are stable? Or are there dynamic characteristics
of the logic family that prevent this assumption? A notion of drive strength and drive
limitations has to be formulated for each logic family.
Clocking issues Are cells simple logic functions or do they need a clock signal requiring
the synthesis tool to build logic considering the timing of the clock for each cell?
These considerations are defining the way the synthesis tool has to synthesize a given
design to a cell library built on a given logic family. Other considerations are:
Leakage current Do cells leak the same amount of current with all possible input combi-
nations or can power be saved by building the logic utilizing statistical information
in order to put as many cells in their low leakage state as long as possible?
Power versus speed What are the tradeoffs for the given logic family when it comes to
power versus speed? Is high speed and low power impossible to achieve at the same
time? And what does it cost in terms of area to pursue?
22 Design of Cell Libraries
These considerations have to be done for the given library of logic cells and the results
be built into the synthesis tool cost functions and synthesis operation style.
Contents
3.1 Scaling device dimensions . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 The effect of device dimension scaling on leakage currents . . . . 26
3.2.1 p-n junction reverse bias current . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Subthreshold leakage . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 Gate leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Leakage current modelling using HSPICE . . . . . . . . . . . . . . 30
3.3.1 The Berkeley Predictive Technology Model . . . . . . . . . . . . . . 30
3.3.2 Predicting the future with BPTM model cards . . . . . . . . . . . . 31
3.3.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.4 Device sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 The leakage of logic gates . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Stacking of transistors . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Leakage as function of input combinations . . . . . . . . . . . . . . 34
3.5 Designing for low leakage . . . . . . . . . . . . . . . . . . . . . . . 34
The aim of this chapter is to describe the effect of scaling down MOS devices on
the dynamic and leakage power consumption. Projections of the future in terms
of device sizes, supply voltages and power estimations are presented and used to
estimate the magnitude of the leakage problem in the future.
Evaluating the leakage of logic gates is done through simulation with HSPICE.
Transistor model cards used for these simulations are presented, and an intro-
ductory study of the effect of stacking transistors is given. Since stacking will
be shown to have a great effect on the leakage, considerations for utilizing this
and other facts for the design of low leakage gates are presented in the end of this
chapter.
23
24 Leakage Current Simulation and Theory of Power Consumption
1000 5
4.5
3.5
Device length (nm)
Supply Voltage(V)
3
100
2.5
1.5
10 0.5
1990 1995 2000 2005 2010 1990 1995 2000 2005 2010
Year Year
Figure 3.1: Projected development in device sizes and supply voltage [9]
have been achieved every two years [6]. To keep power consumption down, supply volt-
ages have been lowered. Hence, the transistor threshold voltage (Vth ) has to be scaled ac-
cordingly to maintain the high drive current and to maintain the performance improve-
ment of 30% per technology generation dictated by Moore´s Observation3.1
Power consumption of integrated circuits has become the major technical problem of
the semiconductor industry. This problem has to be dealt with at all levels to make the
exponential growth in device density possible in the future. So far, large achievements in
reducing the power dissipation has come from voltage scaling and parallelizing designs
to preserve computational speed. Voltage scaling is very effective due to the power dis-
sipations quadratic dependency of the supply voltage. Total power consumption can be
expressed in this equation[7]:
This equation expresses that the total power dissipation originates from two main sources:
1) Dynamic power dissipation, that includes the charging and discharging of capacitances
and 2) Static power dissipation produced by leaking devices. Dynamic power also includes
switching power dissipation, which is often expressed[8]:
Taking a look at the computational speed versus voltage supply this equation comes in
handy[7]:
(V − Vth )α
f∼ (3.3)
V
The term α is an experimentally derived constant, that for current technology is approx-
imately 1.3.
Combining equation 3.1 and 3.3 it is evident why voltage scaling is so effective. The
computational speed of a circuit decreases approximately linear with decreasing voltage,
but the power consumption drops quadratically with decreasing voltage supply. Therefore,
halving voltage supply and doubling hardware in parallel preserves computational speed
and decreases dynamic power consumption by around 50%. Projected supply voltages and
device sizes are depicted in figure 3.1.
Though, leaking devices causing static power consumptions have become just as power
hungry as the dynamic sources of power dissipation. Equation 3.1 states that the static
3.1 Moore’s Law is an inaccurate name for the law since it is not a mathematical (or legislative) law at all.
Moore´s Observation, which it is more accurately called in many sources, depends on a survey of the development
of integrated circuits versus time. As this relationship cannot hold forever, Moore´s Law is best called Moore´s
Observation.
3.1. SCALING DEVICE DIMENSIONS 25
250
Dynamic power
0
Normalized total chip power dissipation
10
−2
10 150
−6
10 0
1990 1995 2000 2005 2010 2015 2020
Figure 3.2: Total chip dynamic and static power dissipation trends assuming doubling of on-chip
devices every two years. Based on the International Technology Roadmap for Semiconductors[10]
and [7]
power dissipation depends linearly on the voltage supply, which may lead to the inter-
pretation that static power consumption puts an end to voltage scaling. This is not entirely
correct since the term Ileak is exponentially dependant on the supply voltage as well, which
is why voltage scaling and hardware doubling still works in many cases in the future for
lowering total power consumption[2].
Yet, as hardware is doubled and devices are leaking, the leakage power dissipation
grows to be the major fraction of the total power dissipation. Figure 3.2 shows projected
dynamic and leakage power dissipation together with projected device sizes. The leakage
component is broken in to two contributors:
• Subthreshold leakage, Isubth , which is the drain-source current when the transistor is
in its non-conducting state.
• Gate-oxide leakage, Igate , is the total amount of leakage currents through the gate
oxide due to tunnelling etc.
VDD VDD
IDG
I subth IGB
IGS
VSS VSS
Gate
I Gate
Source
Drain
n+ n+
I2
I1
p− well
well
get primarily the dynamic power consumption though. The scope here is mainly leakage
currents and only the leakage part of the total power consumption will be discussed. Yet,
when a solution is presented it is discussed whether the solution causes increased dynamic
power consumption.
Considering a n-channel transistor with source connected to ground, Vg < Vth and drain-
source voltage | Vds |≥ 0.1V the almost entire voltage drop occurs over the reversed bias
3.2. THE EFFECT OF DEVICE DIMENSION SCALING ON LEAKAGE CURRENTS 27
High−V
DD
Low−VDD
log ID
−1
St
0V
VG (V)
Figure 3.5: Drain current versus gate voltage at two different drain voltages.
substrate-drain p-n junction. Under these conditions the electrostatic potential variations
are very small and the electric field formed by the gate is negligible, causing the number
of mobile carriers to be small. In this case the drift component of the subthreshold drain-
to-source current is negligible and subthreshold conduction is dominated by the diffusion
current. The carriers move by diffusion along the surface causing a current which is expo-
nentially dependant on the gate voltage.
The weak inversion current can be expressed by:
W Vg −Vth −vDS
Ids = µ0 Cox (m − 1)(vT )2 × e mvT × (1 − e vT ) (3.4)
L
where
Cdm (εsi /Wdm ) 3tox
m=1+ =1+ =1+ (3.5)
Cox εox /tox Wdm
The threshold voltage of the transistor is denoted Vth and the thermal voltage vth =
KT /q. Cox is the gate oxide capacitance and µ0 is the zero bias mobility. K is the Boltzmann
constant, T is the temperature in Kelvin and q is the electron charge. m is the subthreshold
swing coefficient or body effect coefficient for the transistor. Wdm is the maximum width of
the depletion layer and tox is the thickness of the gate oxide. Cdm and Cox are the depletion
layer capacitance and the capacitance of the insulator layer.
From equation (3.4) it can be seen that the subthreshold current is independent of the
drain-source voltage for VDS larger than just a few vT . This seems counter-intuitive, since
one would expect the drain-source voltage to have great impact in the leakage current.
Equation (3.4) does not hold for small devices due to effects such as drain-induced barrier
lowering and body-effect, and is merely printed here to show the leakage currents depen-
dency of gate width, length and gate voltage in longer devices. It confirms that the leakage
grows exponentially with Vg . This dependency is expressed in the subthreshold slope (St )
which described the inverse slope of the linear part of the Isubth /Vth -graph (figure 3.5).
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 y/L
Figure 3.6: Energy bands at the surface versus distance normalized to the channel length L from
source to drain. Curve A depicts a long-channel device, curve B a short-channel device. Curve C
represents a short-channel device with high drain bias.
’Vth roll-off’. These effects will not be discussed here, as they are determined by the device
sizes alone and is not altered by reconfiguring transistors.
In long devices the drain and source regions are far enough apart for the electrical field
and depletion regions induced into the device by these regions to have any impact in the
threshold voltage. Hence the threshold voltage is almost independent of the channel length
and drain bias. In a short-channel device, on the other hand, source and drain depletion
width and source-drain potential have great effect on the energy band bending over a con-
siderable portion of the device. Threshold voltage and thereby subthreshold currents of
short-channel devices vary with the drain bias. This effect is called drain-induced barrier
lowering (DIBL).
Figure 3.6 depicts three different energy bands near the surface of a long device (A)
and two short devices (B and C), charged by relative low drain-source voltage except (C)
which is driven by higher voltage. The threshold voltage equals the maximum energy level
a charge carrier has to achieve to move between the source and drain terminals.
It is evident that decreasing channel lengths reduces the threshold voltage. Increasing
drain voltage causes further Vth lowering in the short-channel device, but does not affect
the long-channel device. This is due to the flatness of the curve in the middle (or the high
slopes near drain and source), which originates from the extension of the non-affected area
under the gate.
Ideally DIBL does not change the St -slope, but it reduces Vth . Higher surface and chan-
nel doping can reduce the DIBL effect [6]. DIBL certainly has to be taken into account when
designing new technologies as supply voltage lowering not only slows the circuits down,
but counters the DIBL-effect and raises the threshold voltage which further slows circuits
down. This is especially important when considering multi-Vth -designs [12]. The effects of
DIBL is shown on Figure 3.7.
Devices built from numerous MOS transistors are typically made on a common substrate.
All MOS transistors therefore share the same substrate and hence the same substrate po-
tential Vsubstrate . Yet, as transistors are connected in series to form gating functions, it is
no longer possible to guarantee the same source potential for all transistors. Source to sub-
strate voltage (VSB ) may increase along the chain of transistors when moving along the
chain away from VSS . This increase in VSB widens the bulk depletion region and increases
3.2. THE EFFECT OF DEVICE DIMENSION SCALING ON LEAKAGE CURRENTS 29
High−VDD
Low−VDD
log ID
DIBL
GIDL
0V
VG (V)
the threshold voltage. This effect is known as the body effect. The following equation ex-
presses the threshold voltage equation [1, 6]:
p
2εst qNa (2ψB + Vsb )
Vth = Vf b + 2ψB + (3.7)
Cox
where Vf b is the flat band voltage, Na is the doping density in the substrate, and ψB =
(KT /q) ln(Na /ni ) is the difference between the Fermi potential and the intrinsic potential
on the substrate. Looking at the Vth ’s dependency of the bulk-source potential, it is evident
that the Vth is more sensitive to Vbs with high bulk doping concentrations. The substrate
sensitivity can be expressed as: [6]
p
dVth εst qNa /2(2ψB + Vsb )
= (3.8)
dVbs Cox
At zero Vsb the substrate sensitivity is Cdm /Cox equal to m − 1 which explains, why m
is also referred to as the body effect coefficient.
The entire subthreshold leakage current including weak inversion, DIBL and body effect
can be expressed by the following equation. [6, 13]
−vDS
1
(VG −VS −Vth0 −γ 0 ×VS +η×VDS )
Isubth = A × e mvT × (1 − e VT
) (3.9)
where
0 W −∆Vth
A = µ0 Cox (vT )2 e1.8 e ηvT (3.10)
Lef f
Vth0 is the zero bias threshold voltage. For small values of Vsb the body effect is nearly
linear with respect to Vs , so the body effect is represented here as γ 0 Vs . The DIBL coefficient
is denoted η, and Cox is the gate oxide capacitance, µ0 is the zero bias mobility and m is
the subthreshold swing coefficient for the transistor. The term ∆Vth is introduced here to
account for the transistor-to-transistor leakage variations [6].
Equation (3.9) and (3.10) are the equations used in this project to model subthreshold
leakage. The same equations are used in the transistor models[13], which will be used in
the simulation work.
30 Leakage Current Simulation and Theory of Power Consumption
3.2 Synopsys HSPICE version 2004.03 with AvanWaves 2004.03 as graphical interface
3.3. LEAKAGE CURRENT MODELLING USING HSPICE 31
Figure 3.8: Model card parameters for 70nm and 180nm LL and HS transistors
frequently used as basis for circuit simulation and is widely used by most semiconduc-
tor manufacturers world wide[18]. For the BSIM model a range of BPTM transistor model
cards is available in device sizes 180nm down to 70nm. On the BPTM site a generator for
model cards is offered, that can produce model cards with user specified parameters. The
parameters are:
Estimating these four parameters enables the generation of nMOS and pMOS model
cards for any process within some limits specified by the generator.
In this work four nMOS/pMOS-pairs of transistor model cards have been generated this
way. A high-speed (low-Vth ) and a low-leakage (high-Vth ) pair, both in 180nm and 70nm
versions. The value of Vth was for the 180nm high-speed process copied from the STM
DKHCMOS8 cell library and for the 70nm high-speed(HS) process taken from [19]. For the
low-leakage (LL) versions the maximum Vth allowed by the BPTM model card generator
were selected. Values recommended by BPTM for Tox and Rdsw were used. Table 3.8 shows
selected model parameters.
To enable sufficient current drive Vth is often set to be VDD /4 [19]. In the low-leak tran-
sistors in Table 3.8 this design rule-of-thumb has been altered to be VDD /3 to further en-
hance the low-leakage performance of the LL transistors. All model cards created for this
project is attached in Appendix C.
10000 10000
1000 1000
100 100
10 10
1 1
50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400
(a) Leakage vs. gate length of nMOS transistors (b) Leakage vs. gate length of pMOS transistors
Figure 3.9: Leakage in pico-Amps (pA) of nMOS and pMOS transistors. Both 180nm and 70nm
transistors versus device length in nm. The top two lines represent HS transistors, and LL transis-
tors bellow.
32 Leakage Current Simulation and Theory of Power Consumption
3.9(a) shows the leakage of the 180nm (blue and purple) and 70nm (red and green) nMOS
transistors in HS and LL versions.
The difference in leakage is very clear. The minimum sized 70nmLL transistor leaks
13pA, and 5871pA for the 70nmHS. In the 180nm case the leakages are 2.5pA and 70pA
respectively. The pMOS 70nm transistor leaks 3956pA and 39pA in HS and LL versions
respectively. For the 180nm transistors the leakages are 145pA and 4.5pA in HS and LL
versions respectively.
The difference is very clear. The leakage of a 70nmHS transistor is a factor of 84 higher
than the 180nmHS nMOS transistor. The difference between HS and LL transistors is even
more expressed in 70nm technology than in 180nm technology.
Surprisingly, through simulation it was found that the pMOS transistor (except the
70nmHS case) leaks more than the corresponding nMOS transistor. The literature states,
that the opposite should be the case. All BPTM models seem to have this behavior.
The leakage currents do not decrease exponentially with long device sizes. After 2∗Lmin
the leakage seems to increase a bit and flatten out at a certain level. This is due to the
derivation of Vth which depends on a number of either experimentally or calculatory ap-
proximated factors[20]. The model cards therefore have maximum accuracy near minimum
device sizes.
3.3.3 Assumptions
To enable fair comparison between logic families, the surrounding circuitry behaves ac-
cording to a set of assumptions given here:
• Input values reach perfect (0V or VDD ) value and are noise free.
• Voltage supply lines are perfect in voltage values and do not swing when power is
drawn from them.
• Outputs of the circuit under test drive a capacitor equal to ten times the gate capaci-
tance for the given technology.
The first two assumptions prevent logic families coping miserably with low quality in-
put values and voltage supplies to perform equally miserably. Clearly, when designing
circuitry utilizing these logic families, steps would be taken to improve input and voltage
supply voltage level stabilities. All simulations are done assuming room temperature (25
degrees Celcius).
q
3.3 This µn
figure is approximated from µp
, which is the typical way to balance the widths [22]. In this work
there is no clear reason to alter this relation.
3.4. THE LEAKAGE OF LOGIC GATES 33
Figure 3.10: Feature sizes of transistors with device sizes (DS) 180nm and 70nm.
The leakage through series-connected transistors in a stack with more than one non-conducting
device reduces the leakage by at least an order of magnitude [9]. As device sizes are dimin-
ishing, and thereby the DIBL effect increases, the stacking effect increases. Therefore, the
stacking factor, defined as the ratio between the leakage of a single versus a stack of tran-
sistors, will increase in the future. Stacking transistors is a promising way of reducing leak-
age. Firstly due to the stacking effect itself, and secondly because a stack of non-conducting
transistors will have increasing source voltages the closer they are placed to the output (at
VDD ), which increases the body effect, reducing leakage further.
Estimating the leakage of a stack of transistors is rather simple if the transistors are ho-
mogenous and in one unbroken line. But when using different transistor sizes and connect-
ing other paths midways in the stack, the task becomes quite difficult. In [13] a promising
pseudo-algorithm is given for estimating leakage. Using HSPICE and the BSIM3 model,
the stacking effect is modelled by an iterative approach.
Figure 3.11 shows different transistor configurations leaking into ground. The differ-
ence between one single and two transistors in series is evident. The difference is nearly
a factor of nine. As expected the right-most configuration leaks twice the amount of the
second configuration. The two configurations with three transistors show the importance
of placing transistors correctly in a stack.
Due to body effects it is better to place the single transistor near ground (near VDD for
pMOS transistors) than near the output, which might seem counterintuitive. Voltages on
the midway of the stacks are written on the figure. These voltages show the great effect of
the body effect, since the voltages indicate much higher resistance in the upper transistors.
These low voltages cause the rightmost of the three-transistor configurations to be superior
in terms of leakage compared with the leftmost.
34 Leakage Current Simulation and Theory of Power Consumption
Figure 3.11: Leaking stacks of 70nm HS transistors. The voltage denotes the voltage measured in
the middle of the stack when full VDD =1V is supplied to the stack.
1. Transistors near the output are most affected by the DIBL effect, lowering their Vth .
2. Transistors that are not directly connected to VSS are affected by the body-effect in-
creasing their Vth .
3. The leakage of a transistor depends heavily upon the gate length and Vth .
4. Stacking of transistors reduces the leakage greatly.
5. The leakage of a gate depends on the input state
When design a region of circuitry (a logic gate for example) placing as few transistors
near the output(1) and VSS (2) as possible will reduce the leakage. This can be achieved by
reconfiguring the transistors in the stacks.
Reducing the leakage can be done by increasing the gate length(3) or Vth of the tran-
sistors. This reduces the drive strength of the transistor, so available time slack must be
available.
Leakage can also be saved by using building logic gates with an increased number of
transistors in series (in stacks) (4). A larger with high stacks of transistors will therefore
leak less than a cascade of smaller cells in many cases.
3.5. DESIGNING FOR LOW LEAKAGE 35
Since the leakage is input dependent (5), a gate can be supplied with a low leakage input
vector when it is inactive.
These considerations help selecting logic families for leakage evaluation, which is the
topic of the following chapter.
36 Leakage Current Simulation and Theory of Power Consumption
C HAPTER 4
Contents
4.1 Logic selection criteria . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Survey of logic families . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Static logic styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Differential logic styles . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 Clocked and dynamic logic styles . . . . . . . . . . . . . . . . . . . 40
4.3 Static CMOS logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 MTCMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 CMOS Domino logic . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Trading speed for low leakage . . . . . . . . . . . . . . . . . . . . . 42
4.6 Complementary Pass-Transistor logic . . . . . . . . . . . . . . . . 43
4.6.1 Possible problems with CPL . . . . . . . . . . . . . . . . . . . . . . 44
4.7 MacroCMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7.1 Larger cells for transistor stacking . . . . . . . . . . . . . . . . . . . 45
4.7.2 Logic optimizations for low leakage . . . . . . . . . . . . . . . . . . 45
4.7.3 Utilizing speed for leakage reduction . . . . . . . . . . . . . . . . . 46
The aim of this chapter is to give a short survey of logic families. Based on
this survey the logic families for further evaluation are selected. These fami-
lies are CPL and Domino logic. These logic families together with MTCMOS
and static CMOS are further introduced and their benefits and drawbacks in
terms of power consumption, ease of design, and characteristic features such as
robustness to voltage swings and process variations are discussed.
37
38 Presentation of Logic Families
For a long time static CMOS was the logic family of choice in overall terms of power,
area and speed. But, since the leakage problem will only grow in the future, this choice may
have to be reconsidered. One topic of this work is selecting and evaluating alternative logic
families that may experience a come-back in the main industry due to the rising leakage
problem.
As described in the introduction and explained more closely in Chapter 3 the leakage
problem can be narrowed down to a coarse relation:
X X 1
2
Pleak = VDD Ipath(n) = VDD (4.1)
n n
Rpath(n)
In this equation n is the total number of paths from VDD to VSS , and Rleak is the steady
state equivalent average resistance of the leaking paths considering all possible input com-
binations.
Since VDD is predefined when designing cell libraries, only two factors remain to adjust.
These factors are n and Rleak , the number of paths between the voltage sources and the
resistance on these paths. Therefore, the logic families in question in this work either:
In this chapter a short survey of logic families will be given, discussing which of them
are interesting in respect to the three topics above. Thereafter a presentation of static CMOS
logic and the selected three design styles; Complementary Pass-Transistor Logic, Domino
and MacroCMOS, are presented and evaluated for leakage current characteristics.
MTCMOS, is interpreted both as the power routing scheme as used here in this work, and the idea of altering the
threshold voltage by changing the bulk potential.
4.2. SURVEY OF LOGIC FAMILIES 39
B B
B VSS n−logic
n−logic block
block
VSS VSS
A VSS
VSS
Pseudo−nMOS CPL Lector logic CVSL CCMOS
Figure 4.1: Logic AND-gates designed with: Pseudo-nMOS, Lector logic, CVSL and CCMOS
logic.
without changing the logic function of the block. But, as the leakage of the following block
is dependant of the quality of the input value, this will cause the following logic block to
leak severely.
Complementary Pass-Transistor Logic (CPL) is very interesting since it reduces the need
for connections to the voltage sources and maximized the number of transistor in series.
CPL is selected for evaluation.
Lector logic[24] is an enhancement to static CMOS where a cross coupled pair of pMOS
and nMOS transistors is put between the pull-up and pull-down networks(Figure 4.1). This
puts an extra non-conducting transistor in series with all paths when signals are at a steady
value. Yet, this method increases the propagation delay. Scaling up transistors to overcome
this overhead will increase the leakage, and it is doubtful how big the benefit of adding a
single transistor in series could be.
As described, low leakage design is obtained by increasing the resistance of the paths in
the design and generally reducing the number of them. This can be achieved by replacing
small logic blocks by larger more complex blocks. Chapter 2 describes the limitation of cell
libraries of blocks consisting of a limited number of logic gates. Breaking this boundary
by designing cells on the fly can reduce leakage by enabling the design of logic cells fully
customized taking leakage current considerations into account. This is investigated. The
improved static CMOS ’logic family’ is here called MacroCMOS.
Cascade Voltage Switch Logic (CVSL) [1] is a logic style where two complementary nMOS
switch structures are constructed and then connected in a pair of cross-coupled pMOS pull-
up transistors(Figure 4.1). All inputs are needed in both inverted and non-inverted form
and the pair of pull-down networks doubles the hardware needed. Yet, CVSL gates can
be built to be very fast as the pull-down load on the nMOS network is minimized when
leaving out the complementary pull-up network. The pull-up network can also be clocked
to reduce power consumption, making this gate a dynamic gate. In this way the CVSL gate
practically becomes two complementary Domino gates with double hardware and double
power consumption, which is why Domino must be more efficient than CVSL.
MOS Current Mode Logic (MCML)[25] is like CVSL built from a pair of pull-down net-
works and a resistor replacing the pMOS pull-up transistors. A constant current is drawn
through the pair of nMOS networks [26], which flows through one of the networks condi-
tionally to the inputs forming a very fast gate.
In general differential logic is not interesting in this work. It requires doubling of out-
puts for inverted and non-inverted inputs/outputs which will double the number of leak-
ing paths. One could argue, that the currents flowing are not leakage, but dynamic current
since the current is used for the fast changing of output values. Yet, reducing the total
40 Presentation of Logic Families
VDD VDD
I leak
A B B
A
B Z Z Z
I leak
A A
(A , B) = (0 , 1)
VSS VSS
B VSS
VSS
Figure 4.2: A static CMOS AND-gate. Logic symbol, transistor netlist and leakage.
power consumption is goal here, so reducing the leakage causing an increased dynamic
power consumption beyond the reduction in leakage power consumption is not useful.
4.4 MTCMOS
MTCMOS, which in this work refers to the cutting off power supply concept, will be eval-
uated for possible incorporation in current cell libraries. Adding power routing transistors
inside every cell and adding a ’on’ signal to the cell will allow for design, where the syn-
thesis tool can derive a controller to turn specific cells off, and on during operation. Cells
can also be connected to the same ’on’-signal to enable for turning on/off regions of logic.
In theory, this could reduce the leakage problem in inactive periods of operation con-
siderably. The leakage power consumption savings must be so large, that a controller can
be build consuming less power than the power saved. Further, adding transistors in series
with power rails may increase the propagation delay of the cells. This delay overhead must
not exceed the delay overhead of using a low-leakage cell, or else a low-leakage cell would
be preferred due to ease of design, no controller overhead etc.
MTCMOS will be further explored in section 5.2.1.
Ileak
Z Z Z
I leak
A A A
B B B
(A,B,Clk) = (1,1,1)
Figure 4.3: A Domino AND-gate. Basic gate, bleeder transistor added and leakage.
logic 1. Any number of domino logic gates can therefore be cascaded provided that all
gates can evaluate within the evaluate phase of the clock.
Domino logic is designed for speed. The precharging and discharging of internal nodes,
which might seem a waste of power, is the price of achieving very fast cascaded logic. After
the precharge period the pMOS precharge transistor stops conducting reducing the load on
the nMOS discharge transistor to only pulling down the capacitance of the nMOS network.
The primary stage of domino logic will draw some short circuit current as both clocked
transistors change state at the same time, but the following stages will have their precharge
transistor fully non-conducting when the (conditionally only-rising-edge) input values ar-
rive. Hereby the pull-down load on the nMOS-network and clocking transistor is mini-
mized causing input values to propagate very fast through the stages driven by fast invert-
ers4.2 .
4.2 The analogy here is, that the capacitances in the cascaded Domino logic blocks are discharged like falling
Domino bricks, that can only fall, and only be raised (precharged) by hand.
4.6. COMPLEMENTARY PASS-TRANSISTOR LOGIC 43
B B
VDD
A A
A
B
A Z
Z
B C
Z Z
Figure 4.4: CPL XOR gates in three versions: Yano’s 2-input, Wang’s 2-input and a 3-input [31].
Advantages: Fast
Leakage current reductions easy to obtain
Disadvantages: Increased dynamic power consumption
Sensitive to process variations
Difficult to design due to clocking issues
Furthermore, when driving logic values one must use an appropriate (nMOS or pMOS)
transistor to be able to drive the value well enough, which raises the need for inverted
input values. Hence, a choice must be made between producing the inverted input values
needed for an implementation that is not connected to the voltage supplies, or saving the
inverters by making connections to VDD and VSS to drive logic values.
A B C D E A,B,C and D E
t t
A
Out Out E
A B C D E
t
Figure 4.5: Power characteristics of a traversing input values. Static CMOS to the left with switch-
ing current peaks. To the right a CPL implementation with fewer connects to voltage sources.
4.7. MACROCMOS 45
Figure 4.6: 4-input NAND-gates in two versions: cascaded small gates and one larger gate.
4.7 MacroCMOS
MacroCMOS presented in this work is not a classic logic family itself, but rather a proposed
improvement for lower leakage in static CMOS cell based designs. As found earlier the
boundaries between logic cells in cell libraries limit the possible optimizations that can
done in terms of leakage power. Synthesizing hardware without these boundaries enables
the construction of larger customized logic blocks that will have greatly improved leakage
current characteristics.
In the beginning of this chapter it was found that leakage current reduction is achieved
by reducing the number of paths and/or increasing the resistance on these paths. Both can
be done by replacing a number of smaller gates with larger, more complex gates forming
the same logic function, which is the concept of MacroCMOS. Larger gates can be designed
much more leakage power efficient, because of the stacking effect, improved logic opti-
mization possibilities and the utilization of gained speed for low leakage.
more elaborate and very real example must be devised to prove the benefits of logic opti-
mizations when ignoring cell logic boundaries.
Logic optimizations are not always possible though. During the work of this project it
became apparent, that some logic gates perform badly when built into a larger cell. Chapter
8 elaborates more on this subject.
Contents
5.1 Logic families comparison . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.1 A static CMOS basis for comparison . . . . . . . . . . . . . . . . . 48
5.1.2 Logic family comparison steps . . . . . . . . . . . . . . . . . . . . . 49
5.2 Logic family specific simulation approaches . . . . . . . . . . . . . 50
5.2.1 Cutting off power supply . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 Complementary pass-transistor logic . . . . . . . . . . . . . . . . . 51
5.2.3 Domino Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.4 MacroCMOS gates . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Evaluation of the logic families requires great care taken when devising fair and
comparable simulation cases. The same care has to be taken when designing a
fair and average-case set of static CMOS logic gates to enable fair comparison.
Furthermore, the results from the simulation cases need further treatment in
order to give comparable values. This chapter describes the considerations done
for designing simulation cases and generating a static CMOS set of gates for
comparison. Then, the steps of building and optimizing the logic blocks built
with the selected logic families is described. After these general remarks specific
implementation remarks are given for each logic family.
47
48 Logic Family Evaluation Methods
Leak
Optimum of minimum
sized transistors
Non−optimal solutions
Optimal solutions
Void design space
T pd
Figure 5.1: Speed/power design space with the optimal curve as a boundary between non-optimal
and impossible solutions.
Leak
Optimum of minimum
sized transistors
static CMOS
T pd
Figure 5.2: Finding an optimal solution in the speed/power design space for a given logic family
compared with static CMOS.
of the gate. If one would define, that the maximum propagation delay and leakage current
of the gate must be the equivalent to the minimum sized transistor gate, no improvements
can be made. Therefore this solution is on the optimal solution curve.
One could argue that both the transistor width and length could be increased to improve
the solution, but that solution would be the same as returning to an older technology with
larger device sizes. Therefore the minimum sized static CMOS gates are used as a basis of
logic family comparison.
Clearly not all possible logic gates can be simulated in reasonable time, and that would not
be necessary either. As stated in [33], 20 logic gates are necessary to form a valid comparison
set of gates. The work done in [33] is based upon minimizing a cell library by excluding the
logic gates that were used least often when synthesizing a set of simulation circuits. This
library of only 20 logic cells including flip-flops forms a complete cell library reduced to
only 20 cells with minimum delay and power dissipation overhead. The 20 cells described
in this paper is used here as comparison basis representing the static CMOS logic family.
VDD
ON
Virtual−VDD
VDD
Inputs
A
Inactive−mode
Other low leakage ON C
controller controller
B
VSS
Virtual−VSS
ON
VSS
Figure 5.3: A inactive-mode low leakage controller, controlling the power supply to virtual supply
voltage rails powering logic blocks (A, B and C).
built in the logic family under evaluation. If the achieved solution is better than the static
CMOS solution in terms of speed, an iterative approach is taken to utilize the gained time
slack slowing down the logic block and achieving less leakage power consumption.
On the other hand, if the solution should prove to be worse, steps are taken to improve
the speed taking the leakage current into consideration. Hence, the propagation delay of
the logic block is the critical parameter, and the leakage current is derived as the product of
improvements done under the strict timing limit. The final solutions can then be compared
directly in terms of leakage current.
Leaking devices cause power consumption not only when the circuitry is working, but also
in inactive periods, as described in Chapter 3. The fraction of the total power consumption
that is caused by leakage current is dependent upon the utilization of the circuit, i.e. the
percentage of the time a given circuitry is working. For a system with parts with low uti-
lization, this fraction grows quite high.
An obvious method of leakage current reduction is to cut off the power to inactive re-
gions of logic by routing the voltage supplies through transistors controlled by a designated
’inactive-mode low leakage controller’. This is often called MTCMOS in the literature[23].
Using transistors to cut off power to large areas of logic causes large voltage swings and
possible failure when re-activating the circuitry. To avoid this, small regions of logic can be
cut off and reactivated independently using a more complex controller and several cut off
transistor stages.
Figure 5.3 shows the concept. The controller can listen to inputs or other controllers,
such as a controller of an input queue, to be able to decide when to put the logic blocks into
sleep mode.
5.2. LOGIC FAMILY SPECIFIC SIMULATION APPROACHES 51
The virtual supply voltage rails, from which the logic block will be drawing power, will
naturally be affected by the voltage drop over the two transistors. This voltage drop is
dependant on the current drawn by the logic block, which leads to swings in the virtual
supply voltage when the logic block is working. These swings impact the propagation de-
lay of the logic block causing increased delay. Increasing the drive strength by sizing up
the width of the supply voltage transistors reduce the voltage swings and the voltage drop,
but inherently causes increased leakage when power is to be cut off from the logic block.
Increasing the length of the power supply transistors reduces the leakage, but reduces
the drive strength of the transistors further increasing the virtual supply voltage swings.
Using high-Vth transistors may reduce the leakage without causing a too severe increase in
propagation delays that may be remedied by sizing up the width of the transistors and still
saving leakage power overall.
It is clear that a study of the effects of adding circuitry for cutting off power supply
must be conducted. The width, length and threshold voltage of the transistors feeding the
virtual supply voltage rails are the parameters for this study. The outputs are propagation
delays and leakage current measurements for a set of logic blocks designed for simulation
purpose.
The set consists of two simulation cases, which will be shown to be sufficient for this
study. In the first case the logic block is represented by a resistor simulating a leaking cir-
cuit. This rather simple case enables comparisons of the effectiveness of adding the supply
voltage transistors in different sizings without taking the dynamic characteristics of the
logic block into consideration.
This forms the basis for the second case where a logic block consisting of a ’NAND-
NOR’ structure is simulated for propagation delay and leakage current. Comparing the
results from this case to the simplified resistor case helps locate characteristics originating
from this specific (NAND-NOR) simulation case that might invalidate the general conclu-
sion derived from this simulation.
In CPL quite a few ways of designing XOR gates are possible. Figure 5.4 depicts three
different implementations. The implementation to the left is a mix of static CMOS and
pass-gates. This design includes two inverters, which are expensive in terms of leakage.
Wang’s XOR gate in the middle of Figure 5.4 is a true CPL gate including only one
inverter for driving the output value. Eliminating the inverter yields a XNOR gate with
weak pull-up with inputs A=1 and B=1 [32].
The third XOR-gate is 3-input true CPL XOR-gate[31]. This gate is built of only nMOS
transistors, and as the pull-up of these transistors grows more limited as the number of
52 Logic Family Evaluation Methods
B B
VDD
A A
A
B
A Z
Z
B C
Z Z
Connects to VSS: 2 Connects to VSS:1 Connects to VSS: 2
Connects to VDD: 2 Connects to VDD: 2 Connects to VDD: 2
Total transistors: 8 Total transistors: 6 Total transistors: 12
Figure 5.4: Three different implementations of XOR gates in complementary pass-transistor logic
with different number of connections to the power rails.
transistors in series increases, a better implementation can be built adding pMOS transis-
tors to form pass-gates instead of single nMOS pull-up/pull-down transistors.
In the evaluation of CPL all three different types of XOR implementations were sim-
ulated and both speed and leakage power characteristics were explored. The speed gain
from changing static CMOS to CPL would have been used to further decrease the leakage
of the CPL gates. Though, due to very poor results from this analysis, no further simulation
was done in CPL.
From the XOR case it was determined that the weakly driven signals cause more leakage
power consumption than what was saved due to reduced number of supply connections.
This will be described further in Section 6.3.
Scaling a Domino block down in speed to match static CMOS by scaling the clocking tran-
sistors to save leakage is done in a series iterative steps. First the pMOS transistor is scaled
to be exactly strong enough to pull up the stage. This is shown in Figure 5.5 in the Precharge
phase of the clock. The arches a, b and c represent a too strong, an appropriate and a too
weak pull up respectively. It is, off course, not possible to pull up entirely to VDD within
a given clock phase, so another required minimum value must be set. Here, the pull-up is
required to pull-up to Vth /8, which was found to be achievable without too severe impact
on leakage through the pMOS device. The clock frequency is set to 1GHz.
Secondly, the nMOS pull-down transistor is sized to be exactly strong enough. This is
shown in the Evaluate-phase of Figure 5.5. Here arches d, e and f represent a too weak,
an appropriate and a too strong pull-down respectively. The pull-down nMOS transistors
5.2. LOGIC FAMILY SPECIFIC SIMULATION APPROACHES 53
VDD
nMOS c
In network
d
e
f
Clk VSS
ta tb t
VSS
Figure 5.5: A Domino block with clocking transistors. The pull-up and pull-down of the block.
does not have the entire Evaluate clock phase to pull down, as following Domino blocks
are waiting for the output. The maximum pull-down time including the propagation delay
of the output inverter is set to be the propagation time of the corresponding static CMOS
gate. The maximum propagation delay without the inverter delay is shown on Figure 5.5
as tb − ta .
After the nMOS transistor has been sized for minimum leakage, pull-up and pull-down
times are checked again, to verify operation again with the added capacitive load caused
by the larger nMOS device. A optimal solution is found by iteratively sizing the two tran-
sistors. The design chosen for simulation is again the full-adder, which will enable easy
comparison with CPL and static CMOS implementations.
Gate leakage, although generally not included in this work, will have a very bad influence
on the performance of dynamic logic. When gates in the output inverter start leaking, the
dynamically held input to the inverter must be helped by a bleeder transistor to keep the
high signal value. Designing a MOS device to keep an internal nodes voltage value very
near VDD is a tradeoff between keeping a high quality signal value using a large pMOS
device causing little subthreshold leakage in the inverter, but large amounts of leakage
through the rest of the Domino block, or using a smaller device causing the opposite effects.
Clearly, gate leakage causes further leakage when trying to remedy the effects of gate
leakage. In this study the gate leakage will be approximated by an resistor connected from
ground to the dynamically held nodes. The analysis described above in the previous section
then repeated with the added current source.
The concept of MacroCMOS is to form larger gates from either smaller gates or direct
boolean expression synthesis to reduce leakage through logic optimizations and stacking of
transistors. The full-adder design is not optimal to show the benefits from using MacroC-
MOS due to the parallelism of the full-adder design, that includes two entirely disjoint
components. It will still be designed for comparison purposes.
54 Logic Family Evaluation Methods
C0
C0
Sum0
A B C A0,B0 Sum0
A0,B0 A B C
AB+AC+BC
AB+AC+BC Sum1
C1 A1,B1
Sum1 A B C
A B C
A1,B1 A1B1+(A1+B1)*
(A0B0+A0C0+B0C0)
AB+AC+BC
C2
C2
Figure 5.6: A 2-bit standard full-adder and a leakage improved MacroCMOS 2-bit full-adder.
Figure 5.6 shows a possible way to construct a larger block from the smaller ones. Two
full-adders have been joined into one block by including the carry-computation in the sum-
computation in the next stage. The carry C1 does not exist anymore, but the corresponding
evaluating networks have been incorporated in the 3-input XOR gate in the next stage.
The component calculating C2 becomes somewhat larger since C1 does not exist, so the
carry C2 must be determined from the four input values and C0. Comparing the C2 carry
generator to the original one, it is evident that a AND function (*) is introduced which is
good in terms of leakage because this implies chaining transistors in series. The full-adder
is selected to be used in the evaluation of MacroCMOS for comparison.
To investigate what logic optimizations can be done to a circuit when the limitations of cell
libraries are ignored, a logic block is devised for simulation. The STM cell library available
at IMM/DTU offer a modest number of larger cells with a maximum of five or six inputs.
These cells are inherently not optimal in terms of leakage due to the fact that they have been
designed for common purpose usage. Therefore a logic block matching a cell in a common
cell library is applied with inputs that enable logic optimization in MacroCMOS, but not
possible with current cell libraries. This optimization is not possible with the current cell
libraries, but only when manufacturing cells on-the-fly.
A larger logic block, for example, connected with the same input connected to more
than one input terminal could be rebuild to reduce leakage. This is done by reconfiguring
the transistors, removing superfluous transistors and resizing other transistors to reduce
leakage while still keeping the original timing of the gate.
Evaluating the benefits from building larger cells for stacking is a delicate matter since the
results will depend heavily of the particular simulation case. As will be shown in section
6.5 the leakage per input of a XOR-gate increases when replacing a larger XOR-gate with
smaller ones in cascade, while the opposite is true for a NAND-gate.
5.2. LOGIC FAMILY SPECIFIC SIMULATION APPROACHES 55
This implies that a randomized case consisting of a variety of different gates in cascade
is needed where optimization can be done, and from which general optimization methods
can be derived. Here, an 11-input gate with total of 9 distinct inputs will be examined for
this purpose.
56 Logic Family Evaluation Methods
C HAPTER 6
Contents
6.1 Static CMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Cutting off power supply . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.1 The resistor case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.2 The Nand-Nor case . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2.3 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Complementary pass-transistor logic . . . . . . . . . . . . . . . . . 61
6.3.1 Wang’s XOR gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3.2 Yano’s XOR Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.3 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 Domino logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4.1 The Domino XOR block . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4.2 The Domino And-Or block . . . . . . . . . . . . . . . . . . . . . . . 65
6.4.3 Gate leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4.4 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5 MacroCMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5.1 The full-adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5.2 Logic optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5.3 Larger cells for MacroCMOS . . . . . . . . . . . . . . . . . . . . . . 70
6.5.4 Limitations of MacroCMOS . . . . . . . . . . . . . . . . . . . . . . 71
6.5.5 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . 71
This chapter describes the evaluation of target logic families through the sim-
ulation cases presented in Chapter 5. The results from each evaluation will be
discussed. The following chapter contains a comparative discussion of all the
simulation results. The files used for simulation are included in the attached
disk. Appendix F gives a short outline of the contents of the disk.
57
58 Evaluation of Logic Families
Leakage
VDD
20
On
15
10
0
On
16
14
VSS 12
10
0 8
2
4 6 Gate width
6
8 4
10 2
Gate length 12
14
16 0
Figure 6.1: Leakage current of a 36.5 Ohm resistor driven by virtual voltage supply transistors.
Leakage currents were measured as steady state leakage current drawn from the voltage
supply through the circuit for every possible input. The average leakage current was then
calculated under the assumption that every input value combination is equally frequent.
The input vectors causing minimum and maximum leakage current were recorded to-
gether with the corresponding leakage current to enable derivation of low leakage input
vectors at a later stage. To measure propagation delays of the circuits the worst case shift
from one to another input vector causing maximum output delay was predicted by hand
and investigated by simulation.
Further descriptions of the 20 cells, including logic functionality, transistor netlists and
simulation results, is printed in Appendix D.
U 1V 1V
R(V ) = = = = 36.5M Ω, V = VDD
ILeak 2ILeak−nand + ILeak−nor 2 ∗ 7.9nA + 11.65nA
(6.1)
Figure 6.1 depicts the leakage current as function of length and width of the voltage
supply transistors. The width is set to in the range 1 to 16 times Wmin and length steps
through 1, 2, 4, 8, and 16 times Lmin . The leakage current grows approximately linearly
with transistor width as expected and is reduced non-linearly with increasing gate length.
For a given gate width and increasing gate length it is easily seen that the leakage drops
off sharply when increasing the gate length to 2 times Lmin and it is expected that the
leakage drops further with increased gate length. This change is not visible from figure
6.1 though due to the characteristics of the transistor model cards as described in section
6.2. CUTTING OFF POWER SUPPLY 59
T rise
Leak(nA)
70
4
60
3.5
50
3
2.5 40
2 30
1.5 20
1 10
0.5
0
16 0 16
14 2 14
12 4 12
10 6 10
0 8
2 8 8
4
6 6 Width 10 6
Length
8 4
Length 10
12 2 Width 12 4
14 14 2
16 0
Figure 6.2: Leakage current and rising edge propagation delay of a nand-nor structure
driven by virtual voltage supply transistors.
3.3.2. Instead the curve becomes slightly bend upwards with medium long gate lengths, as
shown in figure 3.9 on page 31.
The minimum propagation delay is achieved with the maximum width of 16 times Wmin .
The propagation delay is then 102 to 108 ps for gate length varying between 1 and 16 times
Wmin . Scaling down the gate width to Wmin increases the propagation delay to the range
140 to 646 ps. The NAND-NOR stage without power routing transistors has a total delay
of 72 ps according to Table D.1 in Appendix D. 6.2
Routing power through transistors is evidently not without cost in terms of timing.
Hence, saving leakage power utilizing power routing transistors comes down to a tradeoff
between speed and power in the end.
6.1 Please note that the graph is rotated in comparison with the leakage graph to show the curvature of the
graph.
6.2 Falling delay of a nand2-gate, 28 ps, plus the rising delay of a nor-gate, 44 ps. This has been verified by
simulation.
60 Evaluation of Logic Families
Comparing data from figure 6.2(b) and 6.2(a) a good tradeoff is selected at the point
(W,L) = (6,4). Here the timing is in the flat area of the timing graph and the leakage current
is in the low range of the leakage graph. Table 6.3 presents this situation with propagation
delay in ’on’ mode and leakage current in ’off’ mode. The same simulation was conducted
with low-leakage (LL) power supply transistors. The general conclusions are the same, but
the specific results are inherently different.
It is clear that the great reduction in leakage current has a cost of speed of the circuit.
As described, this is due to swings in the virtual voltage sources as the logic block draws
power. Figure 6.4 shows the extent of the voltage swing for power routing transistors of
minimum dimensions. As the output is driven high, the voltage difference between the
virtual Vdd and ground gets as low as 0.67V, which is a 33% decreased supply. Sizing the
width of the supply transistor up to 4 times Wmin decreases this swing to 9%.
The regular Vdd can be raised to compensate for this voltage swing, and this was done
in the tradeoff situation mentioned in Table 6.3. The supply voltage was increased until
the timing of the circuit was back to its original level. At Vdd = 1.24V the timing was
comparable to original timing of the circuit. This increase in supply voltage leads to an
increase of the leakage of the circuit in ’on’ mode from 27.4nA to 53.4nA and to increased
dynamic power consumption, which will not be pursued here further.
Wave Symbol **
D0:tr0:v(z)) 1.2
D0:tr0:v(vdd2))
1.15
D0:tr0:v(gnd2))
1.1
1.05
1
950m VDD
900m
850m
800m
750m
Voltages (lin)
700m
650m
600m Z (OUTPUT)
550m
500m
450m
400m
350m
300m
250m
200m
150m
100m
GND
50m
0
-50m
Figure 6.4: Virtual ground and virtual Vdd voltage swings under rising edge shift of a nand-nor
structure.
VDD
B
Z
A
B
Z
Z
B
A A C C
this way when the timing allows for it, but achieving a solution with minimum timing
overhead is difficult without scaling the power cutoff transistors into the very large.
The timing overhead can be sought to be remedied by increasing the supply voltage. Yet
it was shown that a very high increase (2̃4%) was needed to restore the circuit to full speed.
This voltage increase causes the entire circuit to leak more and consume considerably more
dynamic power. If enough time slack was available in the design, a more feasible way of
leakage current reduction would be to utilize the time slack for replacing high-speed with
low-leakage transistors. Using only one of the two power routing transistors to reduce the
effects on the propagation delay only solves the problem partially and reduces the low
leakage benefits.
The leakage power consumption savings only apply in inactive mode. This greatly lim-
its the usage of this method, but as it introduces no power overhead (other than the power
consumed by the controller) nothing can be lost (other than computational speed). Care
has to be taken when designing this way, though. Supply voltage swings may reduce the
noise margins of the gate resulting in failure in the worst case. [27, 34]
Summing up, adding power cutoff transistors reduces the leakage current in the cir-
cuitry, but has drawbacks:
• Added delay for all circuits with reasonably sized power supply transistors
Family Logic gate Tpd rise Tpd fall Leakage Leakage w/o inverters
Static CMOS 2-input XOR 58 ps 121 ps 17.7 nA 7.9 nA
Static CMOS 3-input XOR 320 ps 160 ps 25.37 nA 10.6 nA
Table 6.1: 2- and 3-input XOR gates in static CMOS cells simulated with 70 nm HS BPTM model
cards.
Family Logic gate Tpd rise Tpd fall Leakage w/o output inv.
CPL Wang’s 2-input XOR 52 ps 91 ps 9.9 nA
CPL Impr. Wang’s 2-input XOR 72 ps 121 ps 2.56 nA
CPL Wang’s 3-input XOR 72 ps 102 26.2 nA
CPL Impr. Wang’s 3-input XOR 108 ps 159 ps 5.05 nA
Table 6.2: 2-input and 3-input XOR gates in static CMOS cells simulated with 70 nm High Speed
BPTM model cards.
Cascading CPL XOR gates will enhance the overall speed of the circuit in comparison to
static logic families, as described in Section 4.6. Cascading two 2-input Wang’s XOR gates to
form a 3-input XOR gate needed in the full-adder investigates this theory. The gained time
slack was utilized by sizing up the inverters to reduce leakage power dissipation. Results
from these simulations are also shown in Figure 6.2.
As theorized the speed penalty of cascading XOR gates is very low and only raises the
falling-edge propagation delay by a little more than 10% in the worst case. The ’improved’
version is slowed down to reduce leakage power dissipation by a factor of more than five.
Reduction in leakage of a factor of four to five is a considerable achievement. Yet, the leak-
age in these circuits originate from ’voltage source to input’-leakage which must be consid-
ered more thoroughly.
Since inputs are ideal voltage sources, the input values are always guaranteed to be the
specified value and stable. This assumption is not valid from a real input, since this input is
being driven by other logic circuits. The assumption is valid, though, as long as no power
is drawn from the input source. No current is drawn when:
• Inputs are only connected to transistor gates that do not draw currents to drive signal
values and
400
350
300
250
nA
200
150
100
50
0
0 20 40 60 80 100 120 140 160 180
mV
Figure 6.6: Leakage current of a 2-input Wang’s XOR gate with self-induced, alternated input
value on one input.
Since Wang’s XOR gate uses the inputs to drive the input to the inverter high, leakage
current can flow out through the inputs when they are at their low value. Therefore a ideal
input voltage source is not a realistic assumption in this case. Input values will be driven
by logic on the input, which has a certain IDS /Vout -characteristic, i.e. depending on the
current the logic has to conduct the input value changes in voltage.
Substituting the ideal input voltage source by a voltage source with a resistor in series
simulates a more real input source in the steady state. By altering the resistance value the
input values alternates. Figure 6.6 depicts the leakage current of a 2-input Wang’s XOR gate
as function of input value. The main reason for this increase is, that the input both drives
the nMOS transistor disconnecting it from the pMOS pull-up network and as pull-down
voltage source. Adding some resistance to the input increases the input value above VSS
(due to the leakage current through the resistor) which increases the conductance of the
nMOS transistor leading again to increased leakage and so forth. This happens for both
inputs.
The Yano 3-input XOR gate is a gate driven only by input values and has no connec-
tions to either VDD or VSS . Output values are inverted to form the correct values and to
drive following logic circuits. This gate was implemented and simulated, and leakage cur-
rent values are depicted in Table 6.3. The three columns contain leakage current measure-
ments from input inverters (II), output inverters (OI) and current drawn from inputs (In)
in the steady state. The high leakage current originates from the weak pull-up of nMOS-
transistors causing the output inverters to operate close to the boundary to the cutoff re-
gion. At around 800mV on the gates of the output inverters, these inverters leak consider-
ably.
Introducing pass-gates instead by adding pMOS transistors remedies this situation. The
main problem is then the current drawn from the inputs. This might have been reduced by
sizing up transistor length, but as it was found, that the gate was slower than the static
CMOS implementation, nothing can be done about the leakage problem.
64 Evaluation of Logic Families
In the Yano XOR gate there are no connections to voltage sources and the output is con-
nected to the gates of inverters, so no leakage current should be possible internally in the
gate. Yet by closer inspection of the circuitry it becomes apparent that there are paths from
the node B to node B containing only one nMOS transistor. No matter what value A might
assume, one nMOS transistor will be conducting and one not. The same applies to the value
C. Picking two random discreet values for the inputs A and C and disregarding the con-
ducting transistors, the Yano’s XOR gate becomes 4 nMOS-transistors in parallel driven by
inverters of opposite output values. This is the explanation of why the CPL gate leaks more
than the equivalent static CMOS implementation. Adding pMOS-transistors to form pass-
gates just increases the problem as the leakage through these transistors adds to the sum of
leakage.
Wang’s XOR gate suffers from somewhat the same problem. Using input values to drive
outputs directly causes the leakage from the few, but present, voltage sources to affect input
values causing further leakage. So in general it can be concluded that removing voltage
sources and using input values to drive internal nodes affects the internal signal value
stability and causes inverters and drive buffers to leak considerably. CPL reduces the need
for transistors due passing of input values, but this in terms increases the leakage through
paths that are maybe not so easily identified. Leaking paths are even harder to predict if
CPL were to be used for cell based design, as the leakage source and drain in many cases
will placed in two different cells.
In general it must be concluded that:
• Signal value variations due to passing of input values are hard to control causing
considerable leakage
• Increasing the number of transistors in series further alters signal values causing in-
creased leakage
• The speed gained by CPL does not give enough time slack to improve the logic gates
sensitivity to signal variations
VDD VDD
Clk Clk
Z Z
A B A A A A A
B
C B B B B
Clk C C
VSS
Clk
VSS
Figure 6.7: Transistor netlists of the AND-OR (left) and 3-input XOR Domino logic gates.
The XOR gate was implemented in a nMOS pull-down Domino block and simulated. Fig-
ure 6.7 depicts this block. First the basic gate was investigated. The leakage of the block
was clearly much reduced and the pull-down propagation time was less than the equiva-
lent time of the static CMOS gate. Table 6.5 shows results from this simulation.
The time slack was utilized to size up the length of the pMOS device to 4.5 ∗ Lmin . The
nMOS device could, within time bounds, be scaled to 1.4 ∗ Lmin . This design is called the
improved design. The leakage hereby was reduced by up to a factor of 16.
Further, by instead using low-leakage clocking transistors the leakage could further be
reduced. The leakage of this optimized gate is around 4.9pA which is a factor of 3,000 less
than the static CMOS implementation.
Using a nMOS Domino logic block for the AND-OR case (Figure 6.7) yields equally good
results. The simulation run is essentially the same as for the XOR logic block. First a basic
implementation proved to be superior in time, so the time slack was used to built the im-
proved lower leakage design by transistor sizing. Thereafter low-leakage transistor replaced
the high-speed clocking transistor, and even better results were achieved. The results are
shown in Table 6.6
66 Evaluation of Logic Families
As described in section 3.2.3 many different models for gate leakage can be found in the
literature. The models are typically based on a statistical study from a given process, from
which a model has been formulated by exponential regression. These models contain fac-
tors specific for the given process. Hence, since no analysis could be found with the same
model parameters and supply voltages as used here in this work, these models do not
apply in this case.
Instead, a rather crude model can be formed from the knowledge, that around the year
of introduction of 70nm processes, the total gate leakage will be equally large as the total
subthreshold leakage. A design built with the simulated CyHP library with maybe 40%
registers will have a average subthreshold leakage of around 4nA per transistor in (non-
conducting state) in the design.
One study, though, is very interesting in this respect. The paper [36] incorporates gate
leakage models for a 70nm process into BPTM transistor models and simulates for gate
leakage. The gate leakage printed in the paper is 50nA per nMOS transistor with VDD = 1V
and Tox =10Å. The gate leakage decreases with an order of magnitude for each added 2Å
gate-oxide thickness or each added 0.3V to VDD .
The transistor models in this work have 16Å gate-oxide thickness. This is oxide thick-
ness not allow for much voltage scaling. In a real process, the Tox must be assumed to
be thinner. Furthermore, process variations can easily cause several Ångström variations
in the oxide thickness causing up to several orders of magnitude [37] increase in the gate
leakage.
Using a process with 1V supply voltage and gate-oxide thickness 10 will then leak 50nA
per transistor. Process variations can increase this problem by more than an order of mag-
nitude, since only a process variation in the gate-oxide of 2Å is required to cause this.
Assuming no process variations the gate leakage still causes major problems for dynamic
logics.
To evaluate the impact of leaking gates on the total leakage of Domino gates, a resistor
(Rleak ) was connected with the output and ground, like in Figure 6.9. Assuming gate leak-
age of either the 4nA estimated in this work, or the 50nA from [36], the resistor value be-
6.4. DOMINO LOGIC 67
350
300
250
Voltage drop
200
150
100
50
200 300 400 500 600 700 800 900 1000
Clock frequency
Figure 6.8: Voltage change(mV) of a dynamically held output as function of clock frequency(MHz).
comes 125M Ω and 10M Ω respectively, when two transistor gates are driven by the dynam-
ically held output.
Precharging the dynamically held output to VDD and applying a non-pulldown input
vector, the effect of leakage can be measured in the end of the evaluate phase. Figure 6.8
shows the voltage drop, the dynamically held node experiences, as function of clock fre-
quency. Naturally, as the clock frequency is increased, the voltage drop decreases due to
the shortened time the output has to be held high dynamically.
The leakage currents of 8nA and 100nA are simulated by the Rleak resistor, and the 3-
input XOR with the achieved leakage improvements is examined again. The dynamically
held node can be kept high by using a bleeder transistor or simply by a resistor. The resistor
will can be sized very precisely to match the leakage.
Calculating the resistance value follows these steps: The leakage is set to 100nA and the
maximum voltage drop is relaxed from VDD /8 to VDD /4 to ease the design of the bleeder
device. With these values the resistor becomes:
VDD /4 0.04V
Rpull−up = = = 4 ∗ 105 Ω (6.2)
Ileak 100nA
In the case, where the nMOS network is supposed to pull-down, the resistor will then
leak 1V /40.000Ω = 2.5µA, which is unacceptable. An alternative way is to use a bleeder
transistor, that can be turned off, when the output is at certain levels. This is depicted on
Figure 6.9. This turns the logic family into a semi-static family, though.
The design of this transistor is rather difficult. The transistor has to be able to deliver
100nA at a drain-source voltage of 40mV . This transistor has to be quite strong to achieve
this. Though, the transistor must not be too strong to prevent the nMOS network from being
able to pull-down. Either a very wide transistor is needed or a ultra-low Vth transistor is
needed. Both will leak considerably.
Here, a simulation setup was made consisting of the 3-input XOR gate with 1GHz clock
frequency and the maximum voltage drop of 40mV . A low-Vth transistor was used as
bleeder transistor and sized to match the required drive strength at 100nA at 40mV drain-
source voltage.
First, the bleeder transistor was measured to be pulling high adequately at the device
sizes L = 9 ∗ Lmin , W = 1 ∗ Wmin . Adding this transistor causes the pull-down nMOS
transistor to be inadequate and therefore was sized up to W = 4 ∗ Wmin . The pull-up
pMOS transistor was no longer capable of pulling high, so the length of that transistor had
to be reduced to L = 3.5 ∗ Lmin .
68 Evaluation of Logic Families
Z Z
Rleak R leak
nMOS pull− nMOS pull−
VSS VSS
down network down network
Clk Clk
VSS VSS
Figure 6.9: Adding a resistor to the output simulates leaking gates. Possible solutions could be to
add a pull-up transistor or resistor.
With the resistor connected the gate leaks around 100nA, naturally. This leakage is
though the worst case leakage, which all Domino gates are not experiencing. Removing
the resistor the leakage remains around 50nA. This is partly due to the altered clocking
transistors, and also due to the output inverter. As the bleeder and clocking pMOS device
leaks into the gate region of the output inverter, the voltage on the gates increases and
causes high leakage.
6.5 MacroCMOS
The evaluation of MacroCMOS follows in three steps. First, the full-adder is implemented
in MacroCMOS fashion. Secondly, a circuit showing the possible logic optimization when
building cells on-the-fly. The third evaluation consists of a large block, that is optimized to
match the best case Synopsys implementation.
Before these analyses were done, a proof-of-concept simulation run was completed. The
results from this survey are described in the end of this section.
6.5. MACROCMOS 69
A A
c
B B c
c
C Z C Z c c
D D
c
E E
Figure 6.11: 1: The small gate. 2: CyHP implementation. 3: Transistor netlist of the 6-input cell
library cell. 4: Optimized transistor netlist.
70 Evaluation of Logic Families
60
Leakage in nA
40
Figure 6.13: 1: The small gate. 2: CyHP implementation. 3: Transistor netlist of the 6-input cell
library cell. 4: Optimized transistor netlist.
the leakage from the inverters and comparing the leakages of the two implementations.
This way only the optimized bits are compared. The results from this comparison is shown
in figure 6.15.
Gate/Inputs: 2 3 4 8
NAND 3.9 nA 4.5 nA 2.1 nA 0.68 nA
XOR 15.4 nA - 91.4 -
Figure 6.16: Average leakage of a NAND and XOR gate with minimum sized transistors.
other transistors to be scaled for lower leakage. It is believed by the author, that an even
much better result could have been achieved given enough time to derive a automated
process.
MacroCMOS will be discussed further in the following chapters.
C HAPTER 7
D ISCUSSION OF R ESULTS
Contents
7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.1.1 MTCMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.1.2 Complementary pass-transistor logic . . . . . . . . . . . . . . . . . 73
7.1.3 Domino logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1.4 MacroCMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 The chosen candidate for cell library implementation . . . . . . . 75
This chapter will present the key reasons for the selection of the logic family for
implementation of a cell library. Results from the previous simulations will be
discussed briefly to determine whether or not general conclusions can be drawn
from the example simulation cases.
7.1 Results
The results from the simulations are presented here in short and the candidate for a cell
library implementation is selected based on these considerations.
7.1.1 MTCMOS
Cutting off power to a region in periods of no activity proved a good solution to reduce
leakage. A factor of 2000 and even more can be saved depending on the amount of speed
one would be willing to sacrifice. The factor of 2000 came with a delay penalty of 87.5%.
An implementation built with low-leakage transistors can be sized to match this per-
formance. This implementation would not need a controller, that cannot be switched off,
or extra hardware. Therefore, MTCMOS did not prove to be better than an existing LL/HS
cell based implementation.
73
74 Discussion of Results
The concept of having multiple stages after each other without voltage rail connections
reduces the leakage due to the left out connections, but causes the reduced voltage value
quality and thereby leakage.
Furthermore, as the same signal is used as input value and voltage source, the circuitry
becomes very sensitive to process variations and long wires, both causing non-ideal con-
nections between the logic blocks.
An XOR gate that matched the speed of the equivalent CMOS gate was built with a leak-
age reduction of around 50%, but this gate was proven to be very sensitive to variations in
input value levels. Introducing gate leakage would further have increased these problems.
These problems will apply to any circuit built with CPL logic.
In general, is not a possible to design for low leakage using the CPL logic family. In this
work it is not explored whether CPL can be utilized to further decrease the leakage of a
design built from low-leakage transistors. It can be speculated that LL transistors are not
so sensitive to the described effects. Yet again, one would probably choose to increase Vth
even further for this purpose instead.
7.1.4 MacroCMOS
The design style proposed in this work is MacroCMOS. The analysis here totals three ex-
ample implementations. It is usually difficult to prove something in general from a few
examples. Yet, the examples show general optimizations that are not possible with current
cell libraries.
The full-adder example showed, that this design which is very parallel and not very
optimal for MacroCMOS could be built with around 33% leakage reduction. The six-input
AND-OR gate showed that logic optimization without a static cell library enables optimiza-
tions for low leakage. Further, it proved that larger cells leak less than smaller.
The nine-input MacroCMOS cell design proved, that in many cases a randomly gener-
ated logic block can be built with the same delays and with far better leakage reductions
than using current synthesis tools and cell libraries. This is not always true, it is proven
also. The XOR gate is better left out of a larger block in many cases. The synthesis tool must
explore design space in every case to search for the best solution.
7.1 The case with 8nA gate leakage was simulated for verification. Equally bad results were encountered.
7.2. THE CHOSEN CANDIDATE FOR CELL LIBRARY IMPLEMENTATION 75
Contents
8.1 Synthesis of MacroCMOS . . . . . . . . . . . . . . . . . . . . . . . 78
8.1.1 Current synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.1.2 Proposed synthesis flow . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 The MacroCMOS cell library . . . . . . . . . . . . . . . . . . . . . . 79
8.2.1 Data required by the cell estimator . . . . . . . . . . . . . . . . . . 80
8.2.2 Data required by the cell generator . . . . . . . . . . . . . . . . . . 80
8.2.3 The total of new requirements to cell libraries . . . . . . . . . . . . 81
8.2.4 Modelling propagation delay in a MacroCMOS cell library . . . . . 81
8.2.5 Modelling power consumption in a MacroCMOS cell library . . . 82
8.2.6 Layout of MacroCMOS cells . . . . . . . . . . . . . . . . . . . . . . 82
8.3 Optimizing a design for low leakage with MacroCMOS . . . . . . 82
8.3.1 Input optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.3.2 Internal scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.3.3 Structural considerations . . . . . . . . . . . . . . . . . . . . . . . . 83
8.3.4 Trading time slack for low leakage . . . . . . . . . . . . . . . . . . 84
8.3.5 Gaining time slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.4 Further issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.4.1 Physical synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.4.2 Gate leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.4.3 Dynamic power consumption . . . . . . . . . . . . . . . . . . . . . 86
8.4.4 MOS device degradation . . . . . . . . . . . . . . . . . . . . . . . . 86
This chapter describes the new synthesis flow proposed in this work. This new
synthesis flow sets new requirements to the cell libraries. A cell library designed
to match the requirements set by the synthesis tool is proposed.
With the MacroCMOS cell library and synthesis tool a range of optimization
algorithms can be devised to take advantage of this new synthesis paradigm. A
number of possible optimizations enabled by using MacroCMOS style synthesis
is presented in the end of this chapter.
77
78 A Cell Library for Low Leakage
Retiming Retiming
Boolean expressions Boolean expressions
C
A
Cell library Cell estimator
B
D
Layout / Wireload + Check
Cell generator
E
Final check
Layout / Wireload + Check
Final check
In a current synthesis flow the abstract problem is broken down into sub-problems until
an RTL level of boolean expressions is reached. The full-adder example given in Figure 2.5
demonstrates this. The design is in the boolean expression step represented by graphs of
logic. Breaking these graphs into sub-graphs that can be mapped down into logic cells is
done iteratively using the logic functions supplied by the cell library.
Since cells cannot be scaled during synthesis, but only replaced by the limited number
of equivalent cells with other drive strength, this step can be quite time consuming. One
could speculate that a cell library with an infinite number of cells would be perfect for this
task. Yet, as the design space increases dramatically when the number of cells is increased,
this might not be beneficial in terms of synthesis time consumption.
If no solution can be found at this level, the synthesis tool must go back and reorder
the logic (Figure 8.1,A). When a solution is found, the cells are laid out by a place&route
tool and wire loads are determined. If this solution does not meet the timing requirements,
maybe a single cell could be changed (B). Else, one would have to go back to either rewrit-
ing the design code or change the design parameters given to the synthesis tool.
The main boundary here is the interface between the synthesis tool and the cell library. If
a cell can not be found, that matches the requirements, a complete reordering of the logic is
required. Further, the possibilities for logic optimizations for low leakage are, as discussed,
very much limited.
8.2. THE MACROCMOS CELL LIBRARY 79
At all levels it must be ensured, that infinite loops do not occur. That is, if two conflict-
ing adjustments are done alternatingly forever, and no convergence towards a solution is
possible.
Caching is a powerful tool to fight this problem. By caching earlier solutions the tools
can be made to prevent these solutions from been retried. Another way is to limit the num-
ber of retries every level in the synthesis flow is allowed to do.
Caching the transistor lists from the cell estimator after a cell has been finished can speed
up the cell generation when the cell can be reused in a future design. The cell generator
still needs to be invoked though, to get the optimum low-leakage cell and to take new
circumstances into account.
The proposed synthesis flow requires a complete restructuring of current synthesis flows.
Another alternative way is possible, though. Synthesizing a design with a current synthesis
tool produces a netlist of logic gates for the place & route tool.
This netlist could be read by a post-synthesis tool, that locates areas of logic for the de-
sign of larger low-leakage cells. This tool can read from the cell library the area, timing etc.
for the cells in question, and then generate a lower leaking cell matching these character-
istics. Clearly, this does not produce as good a solution as the proposed flow, but it will
require less alternations to a current synthesis flow.
solutions.
A possible way is to include predefined pull-up and pull-down networks in a new cell
library, from which larger cells can be built. It is infeasible to design a static cell library
with all possible logic functions. But allowing the synthesis tool to combine predefined
networks, either serially or in parallel, all possible logic functions can be evaluated for
leakage, propagation delay etc. This forms a cell library with virtually infinite cells. A few
of the structures are depicted in figure 8.2.
The number of structures needed in this library is only a small fraction of all possible
logic functions, as these can be built from combining networks. How these networks must
be described in the cell library for use in this synthesis process is explored through the
requirements set to the cell library by the cell estimator and the cell generator.
Furthermore, the cell estimator needs wire load models for external wires. Internal wire
loads are included statistically in the values described above. This can be accepted as most
of the wire load lies in the external wires. The external wire load models could well be
implemented like in the Liberty format.
Internal wires have to be included in this step. During simulation of a cell, wire models
of internal wires are needed. External wire loads can be added in two ways. Either the
wire loads are added as statistical models in this step and the cell is simulated. Or the wire
loads are first added after all cells have been laid out, and then the circuit is fitted through
simulation.
A good solution would be to use the statistical wire load models to size the transistors
and then lay them out. When all have been laid out final adjustments can be made by
retrieving the actual wire loads from the design. No matter where in the process wire loads
are added, the cell library is required to include wire load models.
Further considerations, such as noise sources, rules about layout etc. which are not cov-
ered here, need to be incorporated in the cell library as well.
From this description of the task of the cell generator it is evident, that the cell library is
further required to include the following:
R
Tpd
B C
Figure 8.3 gives the general idea. For each structure with different transistor sizings a ta-
ble of propagation delays is required. Interpolation between equal structures with different
transistor sizings is possible if the difference between sizings is reasonably small.
When the cell has been assembled in the cell estimator, wire loads are added, and propa-
gation delays are checked again. More factors such as input transition time etc. are included
in the cell generator step, which is done by real simulation. The tables need to be conser-
vative enough to ensure, that the cell generator can generate the cell by the specifications
from the cell estimator.
V DD
A
A
V SS
removed, minimizing the paths between the voltage rails. Secondly, the increased speed
allowed for transistors to be resized reducing the leakage.
Knowledge of inputs values can also be useful in the synthesis for low leakage. If a
working model of a design in a HDL is available, statistical information about signal values
can be extracted. The synthesis tool at the boolean expression level hand over statistical
information to the cell estimator. If no model is available, statistical information can be
extracted directly from the design. An eight-input AND-gate can be assumed to produce a
logic zero on the output for the major part of the time.
This statistical information can be used to help structure the logic blocks beneficially.
If a signal is typically ’0’, the nMOS transistor that is fed by this signal should be placed
closest to VSS , due to the leakage dependency of the structure of transistors (Figure 3.11).
Y
I A I A I A
Z Z Z
X X
X Pu Pd*
X
Y I A Z X X
I A Z Z
X X
X
X Pu* Pd
Figure 8.5: Building a larger block from any logic block and a smaller block.
With a static cell library of high-speed (HS) and low-leakage (LL) cells, leakage power op-
timizations are typically done by replacing HS cells with the corresponding LL cells. Since
LL cells have higher propagation delays, this procedure is only possible when enough time
slack is available.
Figure 8.6 presents this. On the left hand side of the figure the distribution of all path
delays is shown in black. The red line indicates the percentage of low-leakage cells. As the
path delay increases, the possibilities of replacing high-speed cell with corresponding low-
leakage version diminish due to timing issues. Therefore, most LL cells will be placed on
low delay paths.
The right hand side of Figure 8.6 represents the same concept. Here the paths have been
sorted by path delay and are presented with the largest delays horizontally in the top of
the figure. When timing allows for it, a cell is replaced with a LL cell. If more time slack is
available this procedure is repeated until all cells are low-leakage cells. In the figure two
replacements are depicted.
Three regions are of interest here. The region A represents paths which are somewhat
too fast for their timing requirement, but not fast enough to use LL cells. The regions B and
C represents paths where all, or a maximum number of, cells have been replaced by LL
cells, but they still have some time slack available for optimization.
When the difference in propagation delays of HS and LL cells supersede the available
time slack, no optimizations can be done when using a cell library of static cells. If cells
could be scaled to match the time slack, a great deal of leakage would be saved. Since
the drive of a transistor scales linearly with gate length (equation 3.4 on page 27) and the
leakage scales exponentially (equation 3.5 on page 27), scaling up the transistor gate length
to match the timing requirement causes the leakage to exponentially.
The full-adder example from Chapter 3 showed that building logic blocks together in
MacroCMOS blocks can produce logic blocks that are faster than the original one. This
time slack can be used to lower the leakage of the cell by using LL transistors or scale the
length up of some or all transistors.
8.4. FURTHER ISSUES 85
Percentage
100%
Percentage LL cells
Distribution of paths
A
Delay of a HS path
C
Delay of a LL path
t max t pd t max t pd
Figure 8.6: Distribution of path delays with percentage of LL cells (in red). On the right hand side:
All paths sorted by path delay. The delay overhead of using LL cells (in red).
Stage 1 Stage 2
A
All paths sorted by delay in Stage 1
t
t max
Figure 8.7: Retiming to meet timing requirements (A). Further retiming to balance delays in two
pipeline stages(B)
Another way of gaining time slack is by retiming. Retiming is for many purposes such
as dividing logic between pipeline stages to meet timing requirements, and it can also be
used to equalize the time slack on both sides of a pipeline register for example.
By moving a part of the delay from a stage that barely meets the timing requirements
to a stage with available time slack the time slack on both sides of the register is balanced
(Figure 8.7). This enables further leakage reductions by trading time slack for low leakage.
This can only be done if the structure of the logic allows for it.
The output switching activity of a larger cell must be expected to be lower than the total
switching activity of a cascade of smaller cells. This is due to the missing internal nodes
that for an input vector transition do not switch numerous times before all previous levels
of logic have stabilized at their final levels. Further, the increased propagation delay of a
larger cell in comparison with a single smaller cell dampens glitches in the circuit. This
further reduces the switching activity.
Even though a switch in output state is bound to be more expensive in power the re-
duced switching activity and robustness to glitches will counter this effect.
**
1000m
2u
900m
0
800m
-2u
700m
Voltages (lin)
Currents (lin)
-4u
600m
-6u
500m
-8u 400m
300m
-10u
200m
-12u
100m
-14u
0
**
1000m
2u
900m
0
800m
-2u
700m
Voltages (lin)
Currents (lin)
-4u
600m
-6u 500m
-8u 400m
300m
-10u
200m
-12u
100m
-14u
0
Figure 8.9: IV dd for 1-, 2-, 3- and 4-device stacked inverter. The device pair closest to the output
has a 150ps time shifted input.
Low input slopes do not necessarily cause massive short circuit currents though. A
1V
70nm HS inverter was simulated with an input transition slope of 500ps = 2V /ns. The
same experiment was done building an inverter with two, three and four devices in series
in both the pull-up and pull-down networks. The currents drawn from VDD is depicted on
Figure 8.8.8.1
It is evident, that the more devices that are placed in series, the lower peak current
is flowing through the stack. Furthermore, as input signals arrive at different time points
the devices will be in different conducting states at all times. Figure 8.9 shows the same
four stacks with the device pair nearest to the output being driven by the same input, just
delayed by 150ps. The single-inverter is driven by the normal non time-shifted input.
The results from this analysis did not show an indication that larger cells increase switch-
ing currents. Therefore, in combination with the fact, that larger cells reduce the switching
activity, MOS device degradation is no more a problem in MacroCMOS than in regular
CMOS.
8.1 The currents are negative in value since HSPICE measures it as ’current into the node VDD
88 A Cell Library for Low Leakage
C HAPTER 9
Contents
9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.1 Conclusion
The main objective of this work was to evaluate possible logic families other than static
CMOS for low leakage design. This task was completed in a series of analyses.
The effects on leakage of scaling down device sizes was explored and rules of thumb for
low leakage design of gates were presented. Based on these leakage considerations, a sur-
vey of logic families was conducted and MTCMOS, CPL and Domino logic were selected
for closer leakage evaluation.
MTCMOS proved to be unusable since the delay overhead of adding the power rout-
ing transistors matches the overhead of using low-leakage transistors instead, which is an
equally good and more design friendly approach. CPL failed due to reduced signal qual-
ity on internal nodes causing more leakage than gained by removing connections to the
voltage supply rails.
Domino logic proved to be very good at reducing the subthreshold leakage. Yet, when
gate leakage was taken into consideration, the benefits were lost as a keeper device would
have to be added to maintain the dynamically held node.
The proposed design style, MacroCMOS, was investigated through three example sim-
ulation cases. MacroCMOS was found to be more efficient in reducing the leakage than
a current synthesis tool with a current cell library. This was proven for both smaller and
larger cells, that are not present in the cell library.
Through the study of transistor characteristics it was found that the main reason for the
magnitude of the leakage problem is the usage of static cell libraries and current synthesis
tools. The cell libraries offer only a limited number of cells, and typically these cells only
have a small number of inputs. Assuming that larger logic functions can be built from these
small cells without much overhead is not correct when including leakage considerations.
Furthermore, the limited interface between synthesis tool and cell library consisting of a
limited list of logic functions prevents many of the optimizations needed for low leakage
design. Small cell synthesis for low leakage is not feasible in the future.
For the synthesis of MacroCMOS logic a new synthesis flow and cell library was pro-
posed. Optimizations for low leakage such as logic optimizations, internal scaling, struc-
tural considerations and the efficient utilization of time slack for low leakage were pre-
sented and proven to work through the examples given. Retiming for low leakage was
89
90 Conclusion and Future Work
presented here also. Furthermore, it was discussed how the entire time slack available can
be used for lowering the leakage of a circuit.
Although a logic family could not be found to replace static CMOS and change the
way low leakage design is done, this work demonstrated a new way of using static CMOS
for low leakage. Incorporating more logic in every (larger) cell and benefiting from the
optimizations now made possible proved to be a viable way to reduce the leakage problem
in the future.
Changing the design flow towards utilization of an alternative logic family than static
CMOS would have had great costs. Not only synthesis tools and cell libraries needed to
be changed, but also the designers would have to adjust their work flow and their archi-
tectural knowledge of IC design. Therefore, continuing the design flow in static CMOS
with in MacroCMOS style preserves much of the work that has been done in the areas of
optimizations on the architectural and synthesis levels.
The static CMOS logic family is generally recognized as the best overall performing
logic family in terms of power consumption, area and timing. This work has concluded
that static CMOS still will be the best performing logic family in the future even when the
leakage problem is taken into consideration. Yet, the small cell based synthesis flow will
have to be rethought incorporating aggressive leakage current reduction schemes, such as
the MacroCMOS design style.
• Logic optimizations for low leakage presented in this work is not an exhaustive anal-
ysis of the research area. Further logic optimizations and even more elaborate im-
provements or transistor reconfigurations can be done. This area can be explored to
improve the efficiency of an automated full custom synthesis flow like MacroCMOS.
• The implementation of a synthesis tool, or post-synthesis tool, and cell library are
future work topics.
• Optimization algorithms for fast full custom synthesis taking leakage into considera-
tion is an area of research for the future.
• Fast layout and accurate simulation of gates built on-the-fly is also an interesting
topic.
• When high-k dielectric materials have been fully implemented in productions dy-
namic logic style could be reevaluated for low leakage applications.
A P ROJECT D ESCRIPTION
Number: 55
Master’s Thesis Project:
Title: Design of CMOS cell libraries for minimal leakage currents
Student: Jacob Gregers Hansen
Period: 17.02.2004 - 13.08.2004
Project description:
Objectives
The objective of this MSc thesis work is to investigate optimal design under the presence
of static gate leakages, and to device how design rules and trade-offs are altered.
Description
A main concern during the design of System-on Chips (SOCs) is the power budget,
especially battery supplied systems are considered. In general, dynamic and static
contributions constitute total power dissipation.
Dynamic power is primarily consumed by the information processing in the charging
and discharging of internal capacitances. As such, dynamic power consumption is
proportional to these capacitances, the switching frequency and the supply voltage.
Static power consumption, on the other hand, is caused by leakage currents while the
circuit is idle, i.e. not performing computations.
One key attraction of CMOS is negligible static power consumption. However, with de-
creasing device sizes this property is no longer satisfied due to subtreshold conduction.
The reason for this is that for smaller devices, supply voltages are reduced. For speed,
this in turn forces a reduction in threshold voltages. As a consequence, transistors are
no longer turned off satisfactorily, i.e. drain currents contributes significantly to power
losses in the transistor non-conductive state. For a 0.13µm process, the static losses may
constitute almost 50% of the total power consumption.
The issue has been addressed by offering libraries of gates and cells in both low-VT
and high-VT versions. This offers the option of fast, low-VT cells with high static power
losses where timing is critical, and a slower, high-VT design for other parts. Traditional
synthesis tools do not offer the means to optimize for multiple-VT libraries to reduce
static power consumption. The solution, using such known synthesis tools, consists
of synthesizing a design using a low-VT library, under the constraint that timing and
performance requirements are met. Then, in a post-synthesis phase, the back-annotated
circuit is analyzed with respect to power consumption and the circuit modified, replacing
low-VT by high-VT library cells whereever possible. The update process does not involve
any re-synthesis steps.
This thesis work addresses the design of logic families using different transistor config-
urations to realize libraries representing alternatives in the speed-power design space
and under various technologies. This includes the generation of a 90nm library or better
from an existing library, and the characterization of this library for use in a synthesis
tool. The thesis work will be performed in parallel with two other MSc thesis works in
a collaborative but independent effort. One work focusses on the incorporation of static
power consumption metrics in the synthesis process, while the other work concentrates
on the architectural aspects of multiple-VT libraries.
91
92 Project Description
B A C ELL L IBRARY IN THE L IBERTY
F ORMAT
/**********************************************************/
wire_load(maxarea_000980) {
resistance : 0.00023
capacitance : 0.00018
slope : 9.35
area : 0
fanout_length( 1 , 9.35)
}
...
wire_load_selection(default_by_area){
wire_load_from_area( 0 , 980 , maxarea_000980)
wire_load_from_area( 980 , 4540 , maxarea_004540)
... }
lu_table_template( table_1 ) {
variable_1 : input_net_transition ;
variable_2 : total_output_net_capacitance ;
index_1 (" 0.01, 0.06, 0.3, 1.2, 2.4 ");
index_2 (" 0.003, 0.048, 0.24, 0.72, 1.44 ");
} power_lut_template( power_table_1 ) {
variable_1 : input_transition_time ;
variable_2 : total_output_net_capacitance ;
index_1 (" 0.01, 0.06, 0.3, 1.2, 2.4 ");
93
94 A Cell Library in the Liberty Format
***--------------------------------------------------------------------------
*** BPTM 0.18, 0.13, 0.10 and 0.07 micron technologies
***--------------------------------------------------------------------------
***
*** This library of model cards was
*** created by Jacob Gregers Hansen on 17 April 2004 ***
*** This library of transistor models contains MOSFETs based on the
*** Berkeley Predictive Technology Model parameters / technology cards.
*** No responsibility is assumed for the use of the information stated
*** ***
95
96 Model Cards For Simulation
+Ptp= 0 JS=1.50E-08
JSW=2.50E-13 +N=1.0 Xti=3.0
Cgdo=2.786E-10 +Cgso=2.786E-10 Cgbo=0.0E+00
Capmod= 2 +NQSMOD= 0 Elm= 5
Xpart= 1 +Cgsl= 1.6E-10 Cgdl= 1.6E-10
Ckappa= 2.886 +Cf= 1.069e-10 Clc= 0.0000001
Cle= 0.6 +Dlc= 4E-08 Dwc= 0
Vfbcv= -1 .ENDS .ENDL
Prt= 0.00
+ACM= 0 ldif=0.00
hdif=0.00 +rsh= 7 rd= 0
rs= 0 +rsc= 0 rdc= 0
+Cj= 0.0015 Mj= 0.72 Pb= 1.25 Cjsw= 2E-10 Mjsw= 0.37 +Php= 0.77
Cjgate= 2E-14 Cta= 0 Ctp= 0 Pta= 0 Ptp= 0 +JS=1.50E-08
JSW=2.50E-13 N=1.0 Xti=3.0 +Cgdo=4.094E-10 Cgso=4.094E-10
Cgbo=0.0E+00 Capmod= 2 +NQSMOD= 0 Elm= 5 Xpart= 1 cgsl= 1E-10
cgdl= 1E-10 +ckappa= 0.08 cf= 1.266e-10 clc= 1.0000000E-07 cle=
0.6000000 +Dlc= 1.6E-08 Dwc= 0 .ENDS .ENDL
+Cj= 0.0015 Mj= 0.72 Pb= 1.25 Cjsw= 2E-10 Mjsw= 0.37 +Php= 0.77
Cjgate= 2E-14 Cta= 0 Ctp= 0 Pta= 0 Ptp= 0 +JS=1.50E-08
JSW=2.50E-13 N=1.0 Xti=3.0 +Cgdo=3.853E-10 Cgso=3.853E-10
Cgbo=0.0E+00 Capmod= 2 +NQSMOD= 0 Elm= 5 Xpart= 1 cgsl= 0.6422E-10
cgdl= 0.6422E-10 +ckappa= 0.08 cf= 1.266e-10 clc= 1.0000000E-07
cle= 0.6000000 +Dlc= 1.5E-08 Dwc= 0 .ENDS .ENDL
C.6. 70NM LOW-LEAKAGE BPTM MODEL CARDS 109
+Cj= 0.0015 Mj= 0.72 Pb= 1.25 Cjsw= 2E-10 Mjsw= 0.37 +Php= 0.77
Cjgate= 2E-14 Cta= 0 Ctp= 0 Pta= 0 Ptp= 0 +JS=1.50E-08
JSW=2.50E-13 N=1.0 Xti=3.0 +Cgdo=4.094E-10 Cgso=4.094E-10
Cgbo=0.0E+00 Capmod= 2 +NQSMOD= 0 Elm= 5 Xpart= 1 cgsl= 1E-10
cgdl= 1E-10 +ckappa= 0.08 cf= 1.266e-10 clc= 1.0000000E-07 cle=
0.6000000 +Dlc= 1.6E-08 Dwc= 0
.ENDS .ENDL
110 Model Cards For Simulation
+Cj= 0.0015 Mj= 0.72 Pb= 1.25 Cjsw= 2E-10 Mjsw= 0.37 +Php= 0.77
Cjgate= 2E-14 Cta= 0 Ctp= 0 Pta= 0 Ptp= 0 +JS=1.50E-08
JSW=2.50E-13 N=1.0 Xti=3.0 +Cgdo=3.853E-10 Cgso=3.853E-10
Cgbo=0.0E+00 Capmod= 2 +NQSMOD= 0 Elm= 5 Xpart= 1 cgsl= 0.6422E-10
cgdl= 0.6422E-10 +ckappa= 0.08 cf= 1.266e-10 clc= 1.0000000E-07
cle= 0.6000000 +Dlc= 1.5E-08 Dwc= 0 .ENDS .ENDL
D M INIMAL S TATIC CMOS C ELL L IBRARY
This appendix contains the description of the minimized (CyHP[33]) Static CMOS Cell Library
used for comparisons in this work. The 20 cells are presented here, followed by the simulation results
from the HSPICE simulations, and finally the transistors netlists used for simulation.
The propagation delays in Table D.1 are measured as the worst case propagation delay
from the input reaches 90% of its final value till the output has reached 90% of its final
value. this is shown in Figure D.2. The input vectors contain the inputs (A, B, C, ...) in that
order, where ’A’ is the input connected to the transistor closest to the output in a transistor
stack. For flip-flops the two bits given in this table represents the current output state of the
register and the next state half-way propagated through the register.
111
112 Minimal Static CMOS Cell Library
Circuit Model card Average leak Max. leak. Input Min.leak. Input Tpd fall Tpd rise
and-nor3 70nm HS 10.15 16.23 (010) 4.32 (111) 70 190
or-nand3 70nm HS 5.32 11.6 (100) 3.98 (010) 58 90
inv-nand x2 70nm HS 19.01 27.48 (10) 9.2 (01) 52 26
inv-nor x2 70nm HS 18.7 31.2 (01) 12.02 (10) 20 80
nand3 70nm HS 3.51 11.82 (111) 0.346 (000) 78 58
mux2 70nm HS 16.4 19.6 (001) 13.9 (000) 70 61
nand2 70nm HS 4.6 7.9 (11) 0.668 (00) 28 29
nand2 x2 70nm HS 9.2 15.8 (11) 1.34 (00) 24 24
nor3 70nm HS 3.2 17.48 (000) 0.085 (111) 25 90
nor x1 70nm HS 4.48 11.65 (00) 0.182 (11) 17 44
nor x2 70nm HS 8.9 23.3 (00) 0.363 (11) 14 40
xnor 70nm HS 17.7 21.47 (00) 15.8 (11) 58 121
inv 70nm HS 4.85 5.8 (0) 3.95 (1) 20 30
inv x2 70nm HS 9.78 11.66 (0) 7.9 (1) 18 27
inv x4 70nm HS 19.55 23.3 (0) 15.8 (1) 18 25
inv x8 70nm HS 39.13 46.6 (0) 31.6 (1) 17 23
inv x16 70nm HS 78.43 93.6 (0) 63.26 (1) 16 20
dff + 70nm HS 24.45 25.4 (01) 23.5 (00) 70 68
dff + x2 70nm HS 29 27.5 (01) 31.2 (10) 61 70
dff - 70nm HS 24.45 25.4 (00) 23.5 (01) 70 68
Table D.1: Minimum, maximum and average leakage current and propagation delays for 20 static
CMOS cells simulated with 70 nm High Speed BPTM model cards. All currents are in nano-Amps
and all times in pico-seconds.
VDD
VDD 90%of VDD
In Out
10% of VDD
VSS VSS
Trise Tfall
B B C A B B
A A B A A
A A C A B
B B A B C
A B
A B A B C S S INV
B A
A A
B B B A
C S S
A B B C
A A B
A
B A B A C
D_FF
Clk Clk
Clk
Clk Clk Clk Clk
Clk
Figure D.3: Transistor netlists for the CyHP 20 cell library. Different drive strength are modelled
as multiplying the width of the output driving transistors with the drive strength factor. Negative
edge triggered flip-flops were built by replacing Clk with Clk
114 Minimal Static CMOS Cell Library
E A M ACRO CMOS C ELL
A
B
C
D
C
E
D
F
Z
G
H
I
To explore what optimizations are possible with current synthesis tools and cell libraries,
Synopsys Design Compiler was used to optimize the basic design logically.
The result was a design, where two large inverting gates, were used combined with four
smaller gates. This design, depicted in Figure E.2, is the best solution under relaxed timing
constraints. If the timing would be set much tighter, Synopsys comes up with the design in
Figure E.3. Here, almost only NAND-gates have been used due to them being the fastest
multi-input logic gate in the cell library.
A
B
D
Z
H
I
Figure E.2: The basic design optimized under loose timing bounds
115
116 A MacroCMOS Cell
Z
G
D
E
C
I
H
VDD
H I I
A B C D F G
A
C D
C E B
D E C
A
C C
C D
D F G
B D E
C E
A B
D
I
F
H
H I
VSS
Figure E.4: The entire logic block as one MacroCMOS transistor netlist
118 A MacroCMOS Cell
A
B
C
D
C
E
D
F
G
H Z
I
Figure E.5: Dividing the logic into larger areas in a beneficial way.
VDD VDD
A I H H
C D
B
T I
T
C E
Z
T T
D
F I H I
G
H
T
VSS
A B C D F G
C E
VSS
Figure E.6: The transistor netlist of the logic divided into two large blocks.
Figure E.7 shows simulation results. The top line (the dotted) represents the distribution
of leakage currents at all 512 input states for the Synopsys logic optimized version (not the
speed-optimized version). The much darker curve represents the results from the MacroC-
MOS version optimized to match the same speed as the Synopsys optimized version.
A variety of leakage optimizations can be done to the MacroCMOS block. By inspection
of Figure E.6 transistors are located that can be sized for low leakage. The three D, F and
G transistors in parallel have relative fast pull-down in comparison with the serialized
transistor. By increasing the length of these three transistors the total leakage is reduced
considerably without affecting the delay of the cell beyond the equivalent delay of the
Synopsys optimized version.
By iteratively adjusting the sizings of the transistors to match the timing of the Syn-
opsys optimized version, optimizations were graduately achieved. Only the most obvious
optimization was done since the workload of manually simulating and sizing transistors is
very high. Clearly this has to be automated in a synthesis tool. The last and lowest curve
represents results from simulating the MacroCMOS gate after leakage optimizations.
Table E.8 summarizes the results from these simulations.
E.1. AN EXAMPLE MACROCMOS CELL 119
500
400
Leakage versus input
300
Improved MacroCMOS gate
Input
Basic MacroCMOS gate
Synopsys Design Compiler
200
100
0
65
60
55
50
45
40
35
30
25
20
15
Leakage in nA
Figure E.7: Leakage current distribution with all input 512 (9 inputs) sorted by leakage value.
If problems should be encountered reading the disk, the files will also be available on
the following URL: http://www.izaq.dk/pep/disk.tar.gz
121
122 Contents of Included Disk
B IBLIOGRAPHY
[2] Martin Hans, “Architectural Aspects of Design for Low Static Power Consumption,”
2004.
[5] Synopsys Corporation, Library Compiler: Modelling Timing and Power. 2003.
[8] C. Svensson and A. Alvandpour, “Low Power and Low Voltage CMOS Digital Circuit
Techniques,” ISPLED, 1998.
[9] G. Sery, S. Borkar, and V. De, “Life Is CMOS: Why Chase the Life After?,” Intel Corpo-
ration, 2001.
[12] C. K. And, “Dynamic VTH Scaling Scheme for Active Leakage Power Reduction,”
vol. citeseer.ist.psu.edu/572435.html, 1998.
[13] Z. Chen, L. Wei, M. Johnson, and K. Roy, “Estimation of standby leakage power in
CMOS circuits considering accurate modeling of transistor stacks,” International Sym-
posium on Low Power Electronics and Design, vol. Proceedings of the 1998 international
symposium on Low power electronics and design, 1998.
[16] D. Lee and W. K. et.al., “Simultaneous Subthreshold and Gate-Oxide Tunnelling Leak-
age Current Analysios in Nanometer CMOS Desing,” 2003.
[17] X. Xiand, J. He, M. Dunga, and B. Heydari, “The Berkeley Predictive Technology
Model3 ver. 3.0 Homepage,” http://www-device.eecs.berkeley.edu/ bsim3.html, 1998.
123
Bibliography 124
[18] C. Hu, “BSIM Model for Circuit Design Using Advanced Technologies,” 2001 Sympo-
sium on VLSI Circuits Digest of Technical Papers, 2001.
[19] D. Sylvester and K. Keutzer, “Rethinking Deep-Submicron Circuit Design,” Computer,
vol. 32, no. 11, pp. 25–33, 1999.
[20] W. Liu, X. Jin, and J. C. et.al., “BSIM 3v3.2.2 MOSFET Model,” Department of Electrical
Engineering and Computer Sciences, vol. University of California, Berkeley, 1999.
[21] F. Stassen, “Design Rules and Electrical Parameters for a 0.18 micron CMOS Process,”
Informatics and Mathematical Modelling, Computer Science and Engineering, Technical Uni-
versity of Denmark, 2004.
[22] J. M. Rabaey and M. Pedram, Low Power Design Methodologies. 1996.
[23] J. Kao and A. Chandrakasan, “Mtcmos sequential circuits,” Proceedings of the 27th Eu-
ropean Solid-State Circuits Conference, pp. 332–335, 2001.
[24] N. Hanchate and N. Ranganathan, “LECTOR: A Technique for Leakage Reduction in
CMOS Circuits,” 2004.
[25] A. Ghani, “High-speed low-power design in cmos,” Master’s Thesis at Department of
Informatics and Mathematical Modelling, Technical University of Denmark, 2002.
[26] F. Stassen, Practical Aspects of CMOS Layout. Technical University of Den-
mark/Department of Information Technology, 1996.
[27] Q. Wang and S. B. K. Vrudhula, “Static Power Optimization of Deep Submicron CMOS
Circuits for Dual Vt Technology,” 1998.
[28] J. P. Halter and F. N. Najm, “A Gate-Level Power Reduction Method for Ultra-Low-
Power CMOS Circuits,” 1997.
[29] D. V. Campenhout, T. Mudge, and K. A. Sakallah, “Timing Verification of Sequential
Dynamic Circuits,” 1999.
[30] S. K. Karandikar and S. S. Spatnekar, “Technology Mapping for SOI Domino Logic
Incorporating Solutions for the Parasitic Bipolar Effect,” 2001.
[31] A. P. Chandrakasan, S. Sheng, and F. I. (Robert W. Brodersen, “Low-Power CMOS
Digital Design,” IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 473–484, 1992.
[32] R. Zimmermann and W. Fichtner, “Low-Power Logic Styles: CMOS Versus Pass-
Transistor Logic,” IEEE Journal of Solid-State Circuits, vol. 32, no. 7, pp. 1–12, 1997.
[33] N. M. Duc and T. Sakurai, “Compact yet high performance (CyHP) library for short
time-to-market with new technologies,” in Proceedings of the 2000 conference on Asia
South Pacific design automation, pp. 475–480, ACM Press, 2000.
[34] J. Kao, Chandrakasan, and D. Antoniandis, “Transistor Sizing Issues and Tool For
Multi-Threshold CMOS Technology,” Proc. of DAC’97, vol. June 1997, 1997.
[35] K. Y. et. al., “A 3.8 ns CMOS 16 x 16 multiplier using complementary pass-transistor
logic,” IEEE Journal of Solid-State Circuits, vol. 25, pp. 388–395, 1990.
[36] F. Hamzaoglu and M. R. Stan, “Circuit-level techniques to control gate leakage for
sub-100nm CMOS,” in Proceedings of the 2002 international symposium on Low power
electronics and design, pp. 60–63, ACM Press, 2002.
[37] Y. Xu, Z. Luo, and X. Li, “A Maximum Total Leakage Current Estimation Method,”
2004.
[38] S. K. Karandikar and S. S. Sapatnekar, “Technology Mapping for SOI Domino Logic -
Incorporating Solutions for the Parasitic Bipolar Effect,” DAC, vol. June 18-22, 2001.
Bibliography 125
[39] M. Lefebvre, D. Marple, and C. Sechen, “The Future of Custom Cell Generation in
Physical Synthesis,” 34th Design Automation Conference, 1997.
[40] T. Ekebrand and N. Funke, “A Parameterizable Standard Cell Generator,” 2003.
[41] D. Bhattacharya and V. Boppana, “Design Optimization with Automated Flex-Cell
Creation,” 2002.
[42] Various Authors from Intel Corporation, “Library Architecture Challenges for Cell-
Based Design,” Intel Technology Journal, vol. 08, no. 01, pp. 61–67, 2004.