Variable-Latency Adder (VL-Adder) : New Arithmetic Circuit Design Practice To Overcome NBTI
Variable-Latency Adder (VL-Adder) : New Arithmetic Circuit Design Practice To Overcome NBTI
Variable-Latency Adder (VL-Adder) : New Arithmetic Circuit Design Practice To Overcome NBTI
Yiran Chen
Seagate Technology 7801 Computer Ave. Bloomington, MN 55435 +1(952)402-7481
Hai Li
Seagate Technology 7801 Computer Ave Bloomington, MN 55435 +1(952)402-7493
Jing Li
Purdue University 465 Northwestern Ave West Lafayette, IN 47906 +1(765) 494-0759
Cheng-Kok Koh
Purdue University 465 Northwestern Ave West Lafayette, IN 47906 +1(765) 496-3683
yiran.chen@seagate.com
helen.li@seagate.com
Jingli@purdue.edu
chengkok@purdue.edu
ABSTRACT
Negative bias temperature instability (NBTI) has become a dominant reliability concern for nanoscale PMOS transistors. In this paper, we propose variable-latency adder (VL-adder) technique for NBTI tolerance. By detecting the circuit failure on-the-fly, the proposed VL-adder can automatically shift data capturing clock edge to tolerate NBTI-induced delay degradation on critical timing paths. VL-adder operates with a fixed supply voltage and clock period, avoiding the high design and manufacturing costs incurred by existing NBTI-tolerant techniques. Compared to other related lower-power adder designs, VL-adder technique always provides better energy efficiency through the whole chip lifetime with very limited performance degradation (4.6% or less).
transistor [5], as these techniques incur extremely unbalanced power consumption profile or switching activity distribution. A common practice to counter the effects of NBTI-induced transistor aging is to over design: A design corner that denoting the maximum performance degradation of transistors (over the lifetime of the chip) is analyzed. This technique is called guardbanding [6]. The guardbanding method could be very pessimistic and powerinefficient because (1) the profile of parameters affecting NBTI effects (temperature, supply voltage and duty cycle of input signal) could be very unbalanced and (2) NBTI-induced aging effects has statistical components due to process variations. To avoid these pitfalls, many adaptive NBTI-tolerant methodologies have been proposed. They include clock frequency tuning, adaptive body biasing and adaptive supply voltage. These techniques however, all require complicated control circuitry, large extra power and area overheads, and significant additional manufacture cost. To improve the power efficiency, some sensors of NBTI effects have been designed to guide these adaptive NBTI-tolerant methodologies [6]. In this paper, we propose an adder design concept named variable-latency adder (VL-adder) for NBTI tolerance. VL-adder technique leverages from the idea of differentiating operation latency in the Ripple-Carry Adder of [7] and the Cascaded CarrySelect Adder of [8]. For example, in a 32-bit unsigned RippleCarry Adder (RCA), the longest carry propagation delay occurs only when the carry-out signal (CO) of the adder of the leastsignificant bit (LSB) propagates through the adder of the mostsignificant bit (MSB), e.g., A<31:0> = 0xFFFFFFFF and B<31:0> = 0x00000001 [7]. The occurrence probability of operands that result in the longest carry propagation delay is very low, i.e., 2 32 2 64 2.3 10 10 for random inputs. In [7], authors used the input vectors to predict the possible longest carry propagation delay of RCA. Operations are classified as long- or short-latency ones. When a long-latency operation occurs, VDD is raised to a higher level to satisfy certain timing requirement. In [8], the proposed Cascaded Carry-Select Adder (C2SA) can detect the operation latency on the fly. When long latency operations come in, the capturing clock edge is shifted to catch the output correctly. The key of VL-adder design is an operation latency detector that can adjust latency-detection threshold: operations that are classified as short-latency operation at the beginning of the chip lifetime may be classified as long-latency operation towards the end of the chip lifetime. Compared to the traditional adaptive NBTI-tolerant techniques, the proposed VL-adders have three main advantages: 1) The working frequency and supply voltage are fixed throughout the lifetime of a chip;
General Terms: Performance, Design, Reliability Keywords: Negative Bias Temperature Instability (NBTI),
Variable-Latency adder (VL-adder)
1. INTRODUCTION
The continual scaling of semiconductor process technology [1] has caused variability and reliability issues to emerge as primary concerns in modern VLSI design. NBTI occurs under negative gate voltage (e.g., Vgs = - VDD) and is measured by the shift in threshold voltage (Vth). The increase in PMOS transistor threshold voltage over time degrades device drive current, extends circuit delay [2], and significantly reduces the lifetime of a chip [3]. The extent of NBTI-induced Vth shifting of a PMOS transistor is heavily determined by its history of work status temperature, supply voltage and the duty cycle of input signal (i.e., the portion of time when the PMOS transistor is on). Therefore, transistors at different locations on the same chip may suffer varying degrees of NBTI-induced delay degradation. This situation is exacerbated by modern power management techniques, e.g., clock-gating [4], sleep
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISLPED07, August 2729, 2007, Portland, Oregon, USA. Copyright 2007 ACM 978-1-59593-709-4/07/0008...$5.00.
195
The NBTI-tolerance mechanism is automatic and local: It does not have the negative global effects that clock frequency tuning or supply voltage tuning may have; 3) The power overhead and area penalty are minimal, when compared to other tuning technique targeting transistor speed, e.g., body biasing. The remaining sections are organized as follows: Section 2 introduces the necessary background of VL-adder design; Section 3 presents the details of VL-adder, including the practices of a 32-bit RCA and a 64-bit Carry-Select Adder (CSA); Section 4 provides experimental results and analysis; Section 5 concludes our work.
2)
CSS
(a)
Bit 24-31 Setup Bit 0-1 Setup CSS MUX SUM MUX SUM Critical delay of standard CSA CSS
CSS
MUX SUM
(b)
Fig. 2 C2SA and standard CSA (a) C2SA (b) Standard CSA Stage Std. CSA C2SA VL-C2SA Table 1. Carry-select stages in various CSA designs 1 2 3 4 5 6 7 8 9 10 11 2 2 2 2 2 2 3 3 3 4 4 4 6 6 6 7 7 7 8 7 7 9 3 5 11 4 2 12 5 5 / 6 6 12 / 7 7 13 / 8 8
FA
FA
FA
FA
HA
Maximum CPL of 32-bit RCA = 32 Longest CPL = 17 for A<31:0> = 0xFFFF7FFF & B<31:0> = 0x00000001 Longest CPL of short-latency operation =19 in [7] Fig. 1 Carry propagation length in a 32-bit RCA
In contrast, there are 2 M CSSs in the Cascaded CSA (C2SA) [8] (see Fig. 2(a) and the row labeled C2SA in Table 1). The 2 M CSSs in a C2SA are divided in two groups, each with
M CSSs. In each group, starting with a small number of inputs in the least significant CSS, the number of input bits in each CSS increases linearly from the least significant CSS to the most significant CSS (see Table 1). As in [7], the long- and short-latency operations are differentiated by checking the carry propagation in a few CCSs in the middle of C2SA (see Fig. 2(a)).
A carry length detection circuit (CLDC) was proposed in [8] to detect whether the carry propagation is killed among some NC consecutive bits starting from the Lth bit. The logic of CDLC is:
P < L + N C 1 : L >=
L + N C 1 k =L
(1)
When P<L+NC-1: L> = 1, the operation is a long-latency one with a CPL that could reach M, the maximum possible CPL for an M-bit RCA. Otherwise, the operation is categorized as a shortlatency one, as the carry propagation is killed by some bit(s) covered by CLDC. For random inputs, the probability of longlatency operation is PrL = 1/2Nc. The maximum possible CPL of a short-latency operation is max(L+NC, M-L). A larger Nc reduces the probability of long-latency operations and increase the maximum possible CPL of short-latency operations. Experimental results in [7] showed that for power efficiency, the combination of L=13 and NC = 6 gives the optimal configuration of CLDC for a 32-bit RCA.
Compared to standard CSA, the critical delay of long-latency operations of C2SA is longer. However, the critical delay of shortlatency operations, which occur more frequently, is shorter. The 64-bit C2SA of [8], for example, has a CLDC of logic P<37:31>. Long-latency operations occur with a probability of PrL = 1/27 0.78%. The critical delays of short-latency operations and longlatency operations are Dsetup + 11Dcarry + Dsum and Dsetup + 15Dcarry + Dsum, respectively. Here Dsetup denotes the setup time of CSA, i.e., the delays of creating the intermediate signals G (Generation) and P (Propagation) [9]. Dcarry and Dsum denote the delays of carry generation circuit and sum generation circuit, respectively. We assume that the delay of carry generation Dcarry equals the delay of
196
multiplexer (MUX) Dmux [9]. For comparison, the critical delay of standard CSA equals Dsetup + 13Dcarry + Dsum.
design overhead. The relationships between Vcrt(t) and t of the 32bit RCA of [7] with PTM 70nm Tech. at Vdes = 1.0V are shown in Fig. 6. Here, Vdes is selected such that at the end of 7-year chip lifetime short-latency operations are still timing correct. When duty cycle increases, the gap between Vcrt(t) and Vdes increases; for a duty cycle of 0.75, this gap can be more than 10% of Vdes.
Duty cycle = 0.25
Duty cycle = 0.5
Duty cycle = 0.75
Fig. 4 shows a guardband violation sensor proposed in [6]. This sensor includes three components, namely, a stability checker (comparator), a delay element, and an output latch. By comparing the signal at the beginning of guardband and the signal at the capturing clock edge, any occurrence of signal switching within the guardband can be detected. The area and power overheard incurred by this guardband violation sensor is negligible as the delay element and the output latch can be shared by multiple sensors, and the execution of guardband violation detection occurs quite infrequently, e.g., with an interval of 15 days.
Sensor
Building on the concept of operation-latency differentiation in [7][8], we propose a variable-latency adder (VL-adder) technique to overcome NBTI effect. In the proposed VL-adder, the detection threshold of long-latency operation in VL-adder can be adaptively adjusted to account for NBTI-degraded delay; operations with a NBTI-degraded delay that is longer than one clock cycle are automatically categorized as long-latency operations and are executed within two clock cycles. Consequently, a VL-adder can continuously work at a low supply voltage level, eliminating the need for clock period tuning or supply voltage tuning. A guardband violation sensor that can predict NBTI-induced timing violations and generate the signal to adjust the detection threshold is needed.
Duty cycle = 0.25 Duty cycle = 0.5 Duty cycle = 0.75
1.00 0.97 0.94 0.91 0.88 7 years lifetime 0.85 1.00E+00 1.00E+02 1.00E+04 1.00E+06 1.00E+08 1.00E+10
Flip-Flop
Output
Delay Element
Stability Checker
Latch
Violation (TH_ADJ)
Fig. 6 Relation between critical supply voltage and chip operation time
GATING
CLDC
TH_ADJ
197
Fig. 8 shows an example of new adjustable CDLC circuit design for our 32-bit RCA-based VL-adder. A multiplexer is controlled by signal TH_ADJ to choose between two logic functions P<18:13> and P<17:14>. When a long-latency operation is detected, a clockgating signal GATING is generated to shift the capturing clock edge to the end of next cycle.
P14 P15 P16 P17 P18 P13 1 0 GATING D Q Q TH_ADJ CLK GATING
supply voltage Vcrt(0) at the beginning of the chip lifetime is only 0.91V for a duty cycle of 0.75. Simulation results show that at 0.94V, a 32-bit RCA design with CLDC logic P<17:14> would still be timing-correct at the end of chip lifetime. In other words, for a 32-bit VL-RCA with PTM 70nm technology, a lower supply voltage of 0.94V, denoted VVL,des, is sufficient to ensure timingcorrectness throughout the lifetime of the chip. The timing correctness of long-latency operations requires the latency of long-latency operation is always shorter than two clock cycles [8]. In Section 4 we shall show this requirement is always met in our VL-adder designs before and after the detection threshold of long-latency operation changes. Here, we extend the VL-adder concept to Cascaded CSA (C2SA) design for NBTI-induced circuit delay degradation tolerance. Fig. 10 shows the structure of our proposed 64-bit C2SA-based VLadder (VL-C2SA). The core adder of a VL-C2SA is a C2SA with two modified carry-select stages covered by CLDC: The bit width of the modified carry-select stages 8 and 9 are five and two, respectively, as shown in the row VL-C2SA in Table 1.
CLK CLDC A B
Recall that reducing the number of input bits covered by CLDC shortens the critical delay of short-latency operations but increases the occurrence frequency of long-latency operations. The timing diagrams before and after changing the detection threshold of longlatency operation are shown in Fig. 9. The output of an operation that fails with original detection threshold of long-latency operation is successfully captured by shifted clock edge with the new detection threshold of long-latency operation. For example, after adjusting the detection threshold by changing the CLDC logic from P<18:13> to P<17:14>, the longest delay of short-latency operations reduces from 19Dcarry to 18Dcarry, allowing for some tolerance for the NBTI-induced circuit delay. (As defined in Section 2.2, Dcarry is the carry propagation delay of one single bit adder.) The probability that a long-latency operation occurs increases from 1/26 to 1/24.
Orig. threshold with guardband violation Clock Output TH_ADJ New threshold eliminates the guardband violation: Clock Output GATING Guardband
2 2 3 4 6 7 7 5 2 5 6 7 8 C2SA
GATING
Fig. 9 Timing diagrams of VL-adder before and after detection threshold of long-latency operation changes
The proposed C2SA-based VL-adder (or VL-C2SA) in Fig. 10 still keeps the same longest delays of long- and short-latency operations as the ones of the 64-bit C2SA of [8]. Originally the threshold of long-latency operations is determined by the logic P<37:31>. As mentioned in Section 2.2, the longest delay of shortlatency operation is Dsetup + 11Dcarry + Dsum and the longest delay of long-latency operation is Dsetup + 15Dcarry + Dsum under the assumption that Dcarry = Dmux. When guardband violation sensor detects a violation, signal TH_ADJ is generated to adjust the detection threshold of longlatency operations by changing the CLDC logic from P<37:31> to P<35:31>. As a result, the occurrence probability of long-latency operations PrL increases from 1/27 0.78% to 1/25 3.13%. The new critical path of short-latency operation is from the inputs of carry-select stage 1 to the outputs of carry-select stage 8. Hence, the longest delay of short-latency operations changes from Dsetup + 11Dcarry + Dsum to Dsetup + 10Dcarry + Dsum accordingly. The adjustable CLDC of C2SA-based VL-adder is shown in Fig. 11. The logic functions P<37:31> and P<35:31> are selected by control signal TH_ADJ, according to the detection result of guardband violation sensors. We note that the multiplexer MUX is between NAND3 and OR2. As its operation overlaps with that of OR1, no performance penalty is incurred. As the critical timing path of the short-latency operation of our proposed VL-C2SA under original detection threshold of longlatency operation is from the inputs of carry-select group 1 through
As a long-latency operation requires two clock cycles to complete [8], the increase in long-latency operations introduces performance overhead to original RCA design proposed in [7]. In our 32-bit RCA-based VL-adder design, this extra performance overhead is (1/24 - 1/26)/(1+1/26) 4.6%. Since the critical timing path under original detection threshold of long-latency operation can only be from bit 0 to bit 18 and from bit 13 to bit 31, two guardband violation sensors are needed at bit 31 and bit 18. As mentioned in [6], the sensors are activated very infrequently. Hence, the power overhead is negligible. Interestingly, VL-RCA may present an opportunity to reduce power. Recall that Vdes, the supply voltage of a circuit, has to be selected to ensure timing correctness of the circuit throughout its lifetime. Consider the 32-bit RCA design (with CLDC logic P<18:13>) presented in [7]. Suppose that Vdes = 1.0V allows the short-latency operations to be timing-correct throughout the lifetime of the chip with PTM 70nm technology. The critical
198
the outputs of carry-select stage 9, only two guardband violation sensors are required at the two output bits of carry-select stage 9.
P31 P32 P33 P34 P35 P36 P37 NAND3 NAND1 GATING OR1 OR2 NAND2 GND MUX 1 0 TH_ADJ D Q Q CLK GATING
Fig. 14 shows the comparison between the normalized PDP (power-delay production) of our 32-bit VL-RCA and that of the 32bit RCA of [7] over the whole 7-year chip operation time. We defined the average adder delay that considering variable adder latency as Tclk(1+PrL), where PrL is the occurrence probability of long-latency operations. To be conservative, we assume adders are all working with a duty cycle of 0.75. 214 = 16384 random inputs are simulated. The power dissipation of CLDC adjustment circuitry of VL-RCA is accounted for in the plot for VL-RCA.
Longest delay of short-latency operation (ns)
1.22 1.20 1.18 1.16 1.14 1.12 1.0E+00 Duty cycle = 0.25 Duty cycle =0.5 Duty cycle =0.75
Tclk
Turning Pt.
1.0E+02
1.0E+04
1.0E+06
1.0E+08
Adjusting the detection threshold of long-latency operations results in 4.6% degradation in clock-cycle-based performance (see Section 3.1) when compared to the RCA of [7]. However, Fig. 14 shows that by working at a lower supply voltage, VL-RCA always provides higher energy-efficiency (about 10% less PDP) than the RCA of [7] does, throughout the entire 7-year chip lifetime.
1.00
RCA in [9]
VL-RCA
Normalized PDP
0.95 0.90 0.85 0.80 0.75 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08
1.0E+02
1.0E+04
1.0E+06
1.0E+08
Fig. 12 Time-varying longest delays of long- and short -latency operation of the 32-bits RCA of [7]
When a guardband violation is detected, our proposed 32-bit RCA-based VL-adder (VL-RCA) adjusts the detection threshold of long-latency operation by using CLDC logic function P<17:14> (see Section 3.1). To ensure that all short-latency operations of our 32-bit VL-RCA can complete within Tclk=1.215ns throughout the 7-year chip lifetime, the supply voltage is at least 0.94V. Under a supply voltage of 0.94V, the relation between the longest delay of the short-latency operations of a 32-bit VL-RCA and the chip operational lifetime is shown in Fig. 13. For a duty cycle of 0.75 the detection threshold of long-latency operation of VL-RCA has to be adjusted after chip has been working for around 2107s (or around 8 months). With a duty cycle of 0.5 or 0.25, the detection threshold of long-latency operation has to be adjusted after the chip has been in operation for 6107s or 2108s, respectively.
Table 2 shows the transistor counts of different components of the RCA of [7] and VL-RCA. Compared to the 32-bit RCA of [7], VL-RCA introduced 7.80% area overhead (in terms of transistor counts). Because of the infrequent activation of the guardband violation sensor, the power overhead of the sensor is negligible [6].
Table 2. Transistor counts of different circuitries Component Core adder Sensor CLDC RCA 896 40 VL-RCA 896 61 52 C2SA 5304 42 VL-C2SA 5304 61 54
199
CLDC logic from P<37:31> to P<35:31>. The minimal supply voltage level to ensure a 7-year lifetime of VL-C2SA is 0.91V. For a duty cycle of 0.75, the detection threshold of long-latency operation has to be adjusted after chip has been in operation for only around 1106s (or 12 days). Therefore, we choose a higher supply voltage of 0.94V for the proposed VL-C2SA. The degradation of the longest delay of the short-latency operations of a VL-C2SA is shown in Fig. 16. For a duty cycle of 0.75, the detection threshold of long-latency operation has to be adjusted after about 8 months (2107s) of operation. In practice, designers can increase the supply voltage (and hence, power) to delay the adjustment of the detection threshold of long-latency operation.
Short: DC=0.25 Long: DC=0.25 Short: DC=0.5 Long: DC=0.5 Short: DC=0.75 Long: DC=0.75 0.95 0.92 0.89 0.86 0.83 0.80 1.0E+02 1.0E+04 1.0E+06 1.0E+08 Operation Time (s)
performance by (1/25 - 1/27)/(1+1/27) 2.33%, when compared to the 64-bit C2SA of [8]. Nonetheless, the proposed VL-C2SA is still more energy-efficient for the entire 7-year chip lifetime. The transistor counts of every component of the proposed VLC2SA design and the C2SA of [8] are also shown in Table 2. Due to size of the core adder, the area overhead percentage incurred by the VL-C2SA design is only 1.37%.
5. Conclusion
In this paper, we present a new adder design concept called Variable latency-Adder (VL-adder) for NBTI tolerance. The operations of adder are differentiated according to their latencies: short-latency and long-latency. When a long-latency operation occurs, the data-capturing clock edge is shifted one more cycle to allow more computation time (and to latch the output data correctly). The detection threshold of long-latency operation can be dynamically adjusted in a VL-adder. If the delay of an originally short-latency operation exceeds one clock cycle due to NBTI degradation, this short-latency operation is re-categorized as a long-latency one by adjusting the detection threshold of longlatency operation, eliminating the need for supply voltage tuning or clock frequency tuning is required. While providing better energy efficiency throughout the chip lifetime, the proposed VL-adder design incurs minimal area and performance penalties.
0.65
Longest latency of shortlatency operations (ns)
Fig. 15 Time-varying longest delays of long- and short -latency operation of the 64-bit C2SA of [8]
6. Reference
[1] International Technology Roadmap for Semiconductors, 2005. [2] N. Kimizuka, et al., The impact of bias temperature instability for direct-tunneling ultra-thin gate oxide on MOSFET scaling, VLSI Symp. on Tech., 1999, pp. 73-74. [3] V. Reddy, et al., Impact of Negative Bias Temperature Instability on Digital Circuit Reliability, International Reliability Physics Symposium, 2002, pp. 248-254. [4] H. Li, et al, Deterministic clock gating for microprocessor power reduction, the 9th Intl Symp. on High Performance Computer Arch., Feb. 2003, pp. 113-124. [5] A. Agarwal, et al, A Single-Vt Low-Leakage Gated-Ground Cache for Deep Submicron, IEEE Jour. of Solid-State Circuits, Vol.38-2, pp. 319-328, Feb. 2003.
The normalized PDP of our 64-bit VL-C2SA and that of the 64bit VL-C2SA of [8] over a 7-year chip lifetime are shown in Fig. 17. Again, a total of 214 = 16384 random inputs are simulated. Also, the power dissipation of CLDC adjustment circuitry of VLC2SA has been accounted for.
Duty cycle = 0.25 Longest delay of shortlatency operation (ns)
0.62 0.61 0.60 0.59 0.58 0.57 0.56
Tclk
Turning Pt.
1.0E+00
1.0E+02
1.0E+04
1.0E+06
1.0E+08
[6] M. Agarwal, et al., On-line Failure Prediction and Its Application to Transistor Aging, ACM/IEEE Intl Workshop on Timing Issues in the Specification and Synthesis of Digital Systems (TAU), 2007. [7] H. Suzuki, et al, Low Power Adder with Adaptive Supply Voltage, the 21st Intl Conf. on Computer Design, San Jose, Oct. 2003, pp. 103-106. [8] Y. Chen, et al, Cascaded Carry-Select Adder (C2SA): A New Structure for Low-Power CSA Design, 2005 Intl. Symp. on Low Power Electronics Design 2005, pp. 115-118. [9] J. M. Rabaey, Digital Integrated Circuits: a design perspective, Englewood Cliffs, NJ: Prentice Hall, 1996.
Normalized PDP
1.0E+02
1.0E+04
1.0E+06
1.0E+08
[10] K. Kang, et al., Efficient Transistor-Level Sizing Technique under Temporal Performance Degradation due to NBTI, IEEE International Conference on Computer Design, 2006, pp. 216-221. [11] Predict Technology Model http://www.eas.asu.edu/~ptm/
As mentioned in Section 3.2, adjusting the detection threshold of long-latency operations degrades the clock-cycle-based
200