

PDF issue: 2024-11-16

## A 28-nm FD-S0I 8T Dual-Port SRAM for Low-Energy Image Processor With Selective Sourceline Drive Scheme

Mori, Haruki ; Nakagawa, Tomoki ; Kitahara, Yuki ; Kawamoto, Yuta ; Takagi, Kenta ; Yoshimoto, Shusuke ; Izumi, Shintaro ; Kawaguchi,...

## (Citation)

IEEE Transactions on Circuits and Systems I: Regular Papers, 66(4):1442-1453

(Issue Date) 2018-12-20

(Resource Type) journal article

(Version)

Accepted Manuscript

### (Rights)

© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or… (URL)

https://hdl.handle.net/20.500.14094/90008144



# A 28-nm FD-SOI 8T Dual-Port SRAM for Low-Energy Image Processor with Selective Sourceline Drive Scheme

Haruki Mori, *Student Member*, *IEEE*, Tomoki Nakagawa, Yuki Kitahara, Yuta Kawamoto, Kenta Takagi, Shusuke Yoshimoto, *Member*, *IEEE*, Shintaro Izumi, *Member*, *IEEE*, Hiroshi Kawaguchi, *Member*, *IEEE*, and Masahiko Yoshimoto, *Member*, *IEEE* 

Abstract— This paper presents a low-energy 64-Kb 8-transistor (8T) one-read/one-write dual-port image memory with a 28-nm fully depleted SOI (FD-SOI) process technology. Our proposed SRAM adopts a selective sourceline drive (SSD) scheme and a consecutive data write technique for improving active energy efficiency at low voltage. The novel SSD scheme controls sourceline voltage and eliminates leakage energy at unselected columns in read operations. We fabricated a 64-Kb 8T dual-port SRAM in the 28-nm FD-SOI process technology. The 8T SRAM cell size is  $0.291 \times 1.457 \ \mu m^2$ . The test chip exhibits 0.48-V operation at access time of 135 ns. The energy minimum point is at a supply voltage of 0.56 V and an access time of 35 ns, where 265.0 fJ/cycle in write operations and 389.6 fJ/cycle in read operations are achieved. These factors are, respectively, 30% and 26% smaller than those of the 8T dual-port SRAM with the conventional scheme.

Index Terms—8T SRAM, 28-nm SRAM, Consecutive Access, FD-SOI, Image Memory, Low Power, Multi-Port SRAM

### I. INTRODUCTION

Low-energy image recognition is demanded for internet of things (IoT) devices in various fields such as automatic driving systems, robot vision, and augmented reality systems with fine resolution. Image resolution enhancement requires large capacity and large area. It also entails high power consumption because of the increased amounts of image data that must be processed. In fact, the power consumption in memory (global memory, caches, and register files) dissipates more than 40% of the power of the image processor (GTX 580; Nvidia Corp.) [1]. For IoT devices handling image information, more energy-efficient memory technology is anticipated.

We would like to thank STMicroelectronics for chip implementation. This work was supported by STARC, VLSI Design and Education Center (VDEC), The University of Tokyo with the collaboration with Cadence Corp., Mentor Graphics Corp., Synopsys Inc., CMP Inc., and JSPS KAKENHI Grant Number 18J11572 and Grant Number 18H01500.

- H. Mori, T. Nakagawa, Y. Kitahara, Y. Kawamoto, K. Takagi S. Izumi, H. Kawaguchi, and M. Yoshimoto are with The Graduate School of System Informatics, Kobe University, Rokkodai, Kobe, Hyogo, Japan (e-mail: mori.haruki@cs28.cs.kobe-u.ac.jp).
- S. Yoshimoto is now with The Institute of Scientific and Industrial Research, Osaka University, Suita, Osaka, Japan.
- Color versions of one or more of the figures in this paper are available online at http://ieeeexplore.ieee.org

Digital object Identifier:



Fig. 1. SRAM in an image processor.

Fully Depleted Silicon on Insulator (FD-SOI) technology is promising to provide high-speed, low-voltage SRAM [2, 3]. A 28-nm FD-SOI has a fully depleted transistor with an ultra-thin silicon body and a buried oxide (BOX) layer, giving them excellent electrostatic control. Therefore, it brings stable features at low-voltage operation. The BOX layer reduces the leakage current by controlling the electrical flow from a source node to a drain node in a transistor. Moreover, the BOX layer reduces the parasitic capacitance between the source node and the drain node. This feature of the 28-nm FD-SOI enables the production of ultra-low-power SRAMs [4–8].

Energy efficiency is improved in the near-threshold region because dynamic energy and leakage energy are well balanced [9]. The combination of a low threshold voltage and low supply voltage is good for high-activation logic circuitry, whereas a high threshold voltage and a high supply voltage are suitable for memory operations. For memory, the activation is low because only a selected wordline and certain bitlines are activated. In such cases, the high threshold voltage suppresses the leakage current and total energy. Process technologies such as Fin-FET and FD-SOI have a smaller S factor. Moderate threshold voltage and moderate supply voltage achieve the best scenario, especially for memory [10–13].

Input data for image processing are stored temporarily in SRAM. In an image processor, many processing cores access the SRAM for multi-thread processing. Figure 1 portrays the memory system in an image processing unit. The SRAM array stores data such as image maps, feature maps, and various parameters for its processing on the many processing elements (PE). Demand for multi-port SRAMs has increased to accommodate high-speed, low-energy image processing. The multi-port SRAM is suitable for parallel operation. It improves

2

the total chip performance and/or memory bandwidth by enabling multiple simultaneous operations in the same bank [14]. Parallel processing is a key technology for real time image applications that require embedded memories with multiple access ports [15–17]. To date, multiport SRAMs that support simultaneous write and read operations have been proposed for use as image processors [18-20].

Several important earlier works have examined dual-port SRAM architectures. In an earlier report of the literature [21], the authors proposed 40-nm 512-Kb pipelined 8T SRAM for a high-speed image processor. The pipeline design enables high-frequency yet low-power operation. In another report [22], researchers explained dual-port 8T SRAM with a differential reference-based sense amplifier (SA). This work targeted the benefits of small signal sensing in the context of a single-ended read path; then it addressed the half-select problem. This design achieved lower operating voltage of 360 mV using the differential reference-based SA. An SRAM design in a 28-nm Fin-FET technology, which adopted a differential sense amplifier for one-read/one-write (1R1W) dual-port SRAM bitcells, was presented in another report [23]. The differential SA scheme divides a memory array into two (upper and lower) memory matrices (MATs). The differential voltage between a read bitline (RBL) of the selected MAT and that of an unselected MAT is amplified in readout operation at higher frequency. Reportedly [24], a 1R1W dual-port SRAM in a 16-nm Fin-FET technology was achieved by 6T single-port SRAM bitcells with double-pumping internal clock for high speed and high density. The double-pumping clock scheme for an internal clock generator achieves robust timing design without strict severe setup/hold margins, and achieves lower operating voltage of 680 mV using a negative-level write driver. Nevertheless, these devices and methods all require dedicated signal timing and entail a greater area cost [21, 23, 24].

As described in an another earlier report, an 8T 1R1W dual-port SRAM is typically used for leveraging disturb-free access because of the dedicated read port [25]. Consequently, 8T dual-port SRAMs with lower active/standby powers have become more important than ever. A conventional 8T dual-port SRAM cell consists of six transistors as a 6T SRAM cell and a decoupled two-transistor read port. This structure can eliminate the well-known read disturb problem by preventing charge sharing with internal storage nodes when a read wordline is activated. A read port of an 8T dual-port SRAM employs a sourceline as a footer line, which is shared in the same row address to perform low-energy operations. This 8T 1R1W dual-port SRAM reduces leakage current through unselected read bitlines. Some read bitlines are, however, discharged slightly in unselected columns because the floating sourceline of the conventional scheme [26] degrades energy efficiency.

For a low-energy image processor, we did the following.

- We designed a 28-nm FD-SOI 8T dual-port SRAM. High energy efficiency of sub-picojoule/cycle in the proposed SRAM is demonstrated.
- To cut off the read bitline discharge completely, we propose novel footer circuitry: the selective sourceline drive (SSD) scheme.

The remainder of this paper is organized as follows. Section 2 presents the proposed 8T dual-port SRAM design with the selective sourceline drive (SSD) scheme. Implementation and measurement results are presented in Section 3. The final section summarizes the relevant findings.

### II. PROPOSED 8T DUAL-PORT SRAM DESIGN

### A. Selective sourceline drive (SSD) scheme

The conventional 8T 1R1W dual-port SRAM with the selective sourceline control (SSLC) scheme in Fig. 2 presents an illustration of a memory matrix and a conventional SL control scheme [26]. The memory matrix commonly employs an interleaving structure. The SL in this scheme acts as a "virtual ground line" for a single column. The SL has two states: a grounded state and a floating state. A selected read bitline (RBL) is connected to the ground through a transfer gate in the conventional structure, whereas SLs at unselected read ports become floating. The floating node of the unselected SL is charged up when a readout datum is "0" on an RBL. The RBL voltage is not a full swing because of the cutoff SL, but it is unnecessary and consumes certain energy in the conventional scheme.

Fig. 3 presents an illustration of the concept of the proposed 1R1W dual-port SRAM with the selective sourceline drive (SSD) scheme. It has a pair of nMOS switches (M1 and M2), an inverter, as a footer circuit in every column. Those switches keep the SL on the ground for the selected column, or drive the



Fig. 2. Conventional 8T memory matrix with the SSLC scheme. Unselected SL are floating because of an nMOS switch. They consume unnecessary energy.



Fig. 3. Concept of the proposed selective sourceline drive (SSD) scheme in the read operation.



Fig. 4. Column control in the proposed SSD scheme in read operation.

SL to VDD-Vth (Vth is a threshold voltage of M1) when the column is not selected. In the standby mode, all the SLs are grounded to prepare for upcoming random read access. The right panel also depicts the proposed SSD scheme behavior. In the read operation, the SL discharge is enabled (SDE). The signal for all columns is "0". The column select signal (COL\_E) from column-decoder activates a target column switch. In the selected column, the output of the OR gate becomes "1." The M2 transistor is activated to read out the stored data. At this time, the charge and discharge power are consumed only at the selected RBL because the SL at the selected column is grounded.

In the unselected column, the OR gate output is low. The M1 transistor drives the SL voltage to VDD-Vth. Under these circumstances, no read current is flowing through the RBL. The nMOS transistors and an inverter circuit in the SSD scheme consumes switching power to maintain the SL voltage (= VDD – Vth). However, the switching power is sufficiently smaller than the RBL discharge that is connected to the 256 cells/column.

Fig. 4 shows the concept of column control under the proposed SSD scheme. An OR gate is inserted as a column controller in every 16 columns to enable the SSD scheme. An inverter and two nMOS transistors are needed for every column. By contrast, in the conventional SSLC scheme, an nMOS transistor as a sourceline (SL) switch and an OR gate as a column selector are deployed in every column to control the SL connectivity. The proposed SSD scheme has less area overhead than the SSLC scheme. It is noteworthy that the column address inputs and a column decoder are used as well as the conventional column addressing circuitry. Therefore, no additional circuitry is necessary for column addressing in the SSD scheme. Each OR gate is activated by the SDE signal and a column address decoder output as the column enable (COL E) signal. The SL voltage is discharged to the ground by activation of the M2 transistor if the output of the OR gate is set to "1".

Figs. 5(a)–5(c) portray a signal timing flow comparison between the conventional SSLC scheme and the proposed SSD scheme in read operation. Fig. 5(a) shows SL control circuits that have an OR gate to control a target column commonly used in both the conventional SSLC and the proposed SSD. The SDE signal as an input signal for OR gate is initially set to "1" to disable the switch control, but it is turned to "0" as the first step of the read operation. Then, the column select (COL E) signal



Fig. 5. Signal timing control flow in the conventional SSLC and proposed SSD scheme in read operation: (a) SSD control signals, (b) read port control signals with conventional SSLC scheme, and (c) read port control signals with the proposed SSD scheme.

chooses a target column. The switches are activated. In the selected column, the switches pull down the SL voltage to the ground in both SSLC and SSD. At this cycle, the RBL voltage falls down to the ground after the read wordline (RWL) is enabled in the "1" read operation. However, in an unselected column, the single nMOS switch separates the SL from the ground in the SSLC. The SL becomes a floating node. However, the RBL voltage is discharged slightly at every cycle because of leakage current, as presented in Fig. 5(b). By contrast, the proposed SSD scheme drives the SL voltage to VDD-Vth. The charge/discharge energy on the RBLs is eliminated even in unselected columns, presented in Fig. 5(c).



Fig. 6. Simulated "0"-read operating waveforms: (a) RWL waveform commonly used in the conventional SSLC scheme and the proposed SSD proposed scheme, (b) RBL and SL waveforms in the conventional SSLC scheme, and (c) RBL and SL waveforms in the proposed SSD scheme.

Figs. 6(a)–6(c) present a comparison of the simulated waveforms for "0"-read operation in the conventional SSLC scheme and the proposed SSD scheme. The RWL waveform is commonly used in the conventional SSLC scheme and the proposed SSD scheme depicted in Fig. 6(a).

Fig. 6 (b) shows the RBL and the SL waveforms in the "0"-read operation simulated with the conventional SSLC scheme. In the selected column, the SL is connected to the ground. The RBL voltage is discharged to the ground voltage after input of the RWL pulse. In the unselected column, the SL is separated by the nMOS switch from ground. The dedicated read port is activated by the RWL pulse. Then, the RBL voltage is pulled down slightly because of leakage current. It consumes unnecessary power at every cycle.

Fig. 6(c) depicts the RBL and the SL waveforms with the proposed SSD scheme in the "0"-read operation. In the selected column, the SSD scheme connects the SL to the ground. Then the RBL voltage is discharged to the ground voltage after the RWL pulse input. However, the SSD scheme drives the SL voltage to VDD-Vth when the columns are not selected. The RBL voltage maintains the VDD voltage and saves energy. In terms of power consumption, the SSD scheme is better than the conventional SSLC scheme. The proposed SSD scheme has no voltage swing on the local RBL. The output voltage amplitude of the SL is restricted to VDD-Vth because it minimizes the dynamic power consumption of the driver switches (M1 and M2) and its leakage current flowing through M2. The proposed SSD scheme therefore eliminates the unnecessary read current in the unselected columns. The dynamic power consumption on the two driver nMOS in SSD scheme is made only when the selected column is changed. Therefore, the read operation in vertical memory addressing is effective to a considerable degree for the proposed SRAM.

### B. 8T 1R1W dual-port SRAM cell

Figs. 7(a)–7(c) show the proposed 8T 1R1W dual-port SRAM cell schematic and custom layout design with a separated SL architecture in the 28-nm FD-SOI process technology. Fig. 7(a) portrays a schematic of the proposed SRAM cell design. It has additional pass gate transistors PG3 and PG4 with additional RWL and RBL metals to draw the read current flowing through the dedicated read port.

The dedicated read port enables simultaneous but separate read and write operations. The gate node of PG4 transistor is connected to the cell internal node V2. Conventionally, the source node of PG4 transistor is connected to ground. In our design, the source node of PG4 transistor is comprises the SL, which is vertically connected to the SSD scheme to control the SL voltage on a column basis.

Fig. 7 (b) shows the FEOL layout of proposed SRAM cell. The bit area is  $0.423~\mu m^2$  designed on the logic rule base. Read ports comprising PG3 and PG4 transistors are arranged at the right side from a 6T SRAM cell. The PG4 transistor shares the same poly gate metal with PU2 and PD2 transistors as the V2 node.

Fig. 7(c) depicts the BEOL layout design of proposed SRAM cell. Conventionally, the source node of the dedicated read port





Fig. 7. Proposed 8T 1R1W dual-port SRAM cell in a 28-nm FD-SOI: (a) circuit design, (b) FEOL layout design, and (c) BEOL layout design.

(c) BEOL of proposed 8TDP SRAM

can be shared with an adjacent cell because all source nodes are grounded. In the proposed SRAM cell, the source nodes are connected vertically to the SSD scheme. Thereby adjacent source nodes must be separated. In our design, the additional SL metal is located at the right end of figure with the Metal3 layer instead of the conventional ground line. The SL metal can be accommodated on the cell area. There is no area overhead for the additional SL metal on the cell design.

### C. RBL delay and area optimization in SSD scheme

The single-ended 1R1W dual-port SRAM generally uses an inverter circuit as a sense amplifier (SA). This type of SA is beneficial for high-density single-ended SRAM design by virtue of its simple structure. In a "0"-read operation, the RBLs are discharged by the activated read ports. In the single-ended read port, RBL voltage must be fully discharged to the ground to sense the stored datum. The RBL delay performance with local/global variations affects the readout timing setup. The RBL delay depends mainly on the SRAM cell transistor sizing and the pull down nMOS switch size in the SSD scheme.

Fig. 8 illustrates the RBL delay versus the width of the SL



Fig. 8. Read bitline (RBL) delay  $t_{delay}$  and area overhead versus the gate width option of the SL pull-down nMOS transistor M2 (SS corner, -40°C). The slowest RBL delays at each gate width size are shown.

pulldown nMOS switch simulated with 1M Monte Carlo at SS-corner/ $-40^{\circ}$ C. A smaller pulldown nMOS switch in the SSD scheme slows the RBL delay, whereas a larger size increases the capacitance on the RBL. The  $5\sigma$  point of the RBL shows delays at each nMOS size in Fig. 8. The slowest RBL delay is shorter by 24.9% at 640 nm width, compared with the case of 80 nm width. The RBL delay improvement is saturated even if the switch size is increased further.

The area overhead is another factor; it increases linearly according to the nMOS width. In our design, we choose the pulldown nMOS switch width of 640 nm for the SSD scheme implementation with only 0.9% area overhead.

Fig. 9 presents the RBL delays simulated at FF-corner/125°C. The  $-5\sigma$  of RBL delays at FF-corner/125°C in Fig. 9. It is improved by 20.7% at the switch width of 640 nm with the area overhead of only 0.9% in the whole memory macro, compared with the case of 80-nm width. Herein, the nMOS switch width is optimized by considering the tradeoff between the RBL delay and the area overhead.

Figs. 10(a)–10(b) present readout waveforms of the slowest cells in the conventional grounded SL scheme, the SSLC scheme, and the SSD scheme when Monte Carlo analyses are executed at TT, 25°C. Fig. 10(a) shows simulated waveforms of the SA output signal V(SAout). Fig. 10(b) shows the simulated RBL waveforms in the read operation in the SSD scheme with 1M iterations of Monte Carlo analyses. As described in this paper,  $t_{delav}$  is defined by the time to which V(SAout) rises to 0.45 V at supply voltage of 0.5 V. Although the fastest RBL is fully discharged, the slowest RBL is still in a half VDD voltage. The readout time in the proposed SSD scheme is affected by the transistor size of the nMOS switches in the SSD scheme because the SL in the SSD turns out to be grounded immediately after a column is selected, which is regarded as a setup time. In any case, the SL discharge delay in the proposed SSD scheme is expected to be much shorter than the RBL discharge delay time. The  $t_{delay}$  in the grounded SL scheme, which value is 43.45 ns, is most strongly affected by the Vth variations in the SRAM cell transistors. In the SSLC scheme,  $t_{delay}$  is 36.46 ns; in the SSD scheme,  $t_{delay}$  is 38.95 ns, which are 16.1% and 10.3% shorter, respectively, than that on



Fig. 9. Read bitline (RBL) delay  $t_{delay}$  and area overhead versus the gate width option of the SL pull-down nMOS transistor M2 (FF corner, 125°C): The fastest RBL delays at each gate width size are shown.

the grounded SL at TT, 25°C.

Additionally, one must consider the hold time until V(SAout) becomes 50% of operating voltage. The  $t_{delav}$  variations should be evaluated for a hold time margin in read out operation. TABLE I presents a summary and a statistical evaluation of t<sub>delay</sub> on RBL between the SL control schemes of three types when executing 1M Monte Carlo analyses at SS, -40°C, and varied operating voltage change from 0.4 V to 0.7 V. The SL circuit which adopted a voltage control scheme as transistor stacking should increase the average or median value of  $t_{delav}$ time because of the capacity increase, as presented in TABLE I. It is apparent that the average  $t_{delay}$  increases by 4.2%. The median value increases by 7.5% at 0.5 V operation in the proposed scheme. However, statistics such as skewness and standard deviation of the  $t_{delay}$  distribution are slightly lower because of the nMOS transistor stack forcing condition [27]. The standard deviation is shown to decrease by 3.2%; the skewness decreases by 9.2% at 0.5 V operation in the proposed scheme. The  $-5\sigma$  and  $5\sigma$  values are shown as the  $t_{delay}$  Min and the  $t_{delay}$  Max in TABLE I. Actually,  $t_{delay}$  Min increases 0.25 ns. Also, t<sub>delay</sub> Max decreases 53.70 ns with SSLC compared with grounded SL. In addition, in the SSD, t<sub>delay</sub> Min increases 0.31 ns and  $t_{delay}$  Max decreases 29.90 ns compared with conventional grounded SL.

Fig. 11 shows normalized delay on RBL at varied supply voltages of 0.4 V - 0.7 V for hold margin evaluation. Each plot extracted from the slowest cell which is supported  $5\sigma$  coverage at SS,  $-40^{\circ}$ C. The  $t_{delay}$  on RBL is much lower than that of grounded SL: 30.2% at a voltage of 0.4 V. Actually, t<sub>delay</sub> does not have a normal distribution, and thus skewed. Fig. 12 shows t<sub>delay</sub> distributions at 0.4 V, SS, and -40°C, in the conventional grounded SL, the conventional SL with the SSLC scheme, and the proposed SL with the SSD scheme on the quantile—quantile plot. To statistically analyze the  $t_{delay}$  distribution, the horizontal  $t_{delay}$  is converted by the logarithmic function,  $log_{10}(t_{delay})$ , to best fit its skewness to a straight line, which implies that  $t_{delay}$  is determined by near-subthreshold current under this condition. The respective mean values of the conventional grounded SL and the proposed SSD scheme is -6.533 and -6.513. Results show that the proposed SSD is slower than the grounded

| Voltage | Corner, Tmp. | SL structure | Mean [ns] | Median [ns] | Std. [ns] | Skewness | Kurtosis | t <sub>delay</sub> Min [ns] | t <sub>delay</sub> Max [ns] |
|---------|--------------|--------------|-----------|-------------|-----------|----------|----------|-----------------------------|-----------------------------|
| 0.4 V   | SS, -40°C    | SL w/ GND    | 394.40    | 258.30      | 385.80    | 5.68     | 117.90   | 15.84                       | 28,760.00                   |
|         |              | SL w/ SSLC   | 404.20    | 276.40      | 380.60    | 5.03     | 75.01    | 20.02                       | 20,660.00                   |
|         |              | SL w/ SSD    | 405.20    | 285.40      | 382.40    | 4.92     | 70.00    | 21.24                       | 20,070.00                   |
| 0.5 V   | SS, -40°C    | SL w/ GND    | 17.26     | 13.26       | 12.58     | 5.04     | 81.02    | 2.59                        | 681.20                      |
|         |              | SL w/ SSLC   | 17.96     | 14.03       | 12.51     | 4.63     | 62.88    | 2.84                        | 627.50                      |
|         |              | SL w/ SSD    | 17.99     | 14.26       | 12.54     | 4.57     | 59.38    | 2.90                        | 651.30                      |
| 0.6 V   | SS, -40°C    | SL w/ GND    | 2.62      | 2.45        | 0.79      | 1.94     | 14.18    | 0.96                        | 24.33                       |
|         |              | SL w/ SSLC   | 2.75      | 2.58        | 0.81      | 1.85     | 12.40    | 1.01                        | 23.49                       |
|         |              | SL w/ SSD    | 2.76      | 2.59        | 0.81      | 1.86     | 12.14    | 1.06                        | 24.17                       |
| 0.7 V   | SS, -40°C    | SL w/ GND    | 0.99      | 0.97        | 0.17      | 0.86     | 4.60     | 0.52                        | 3.20                        |
|         |              | SL w/ SSLC   | 1.04      | 1.02        | 0.17      | 0.85     | 4.53     | 0.53                        | 3.03                        |
|         |              | SL w/ SSD    | 1.04      | 1.02        | 0.17      | 0.85     | 4.53     | 0.55                        | 2.92                        |

TABLE I STATISTICAL DATA COMPARISONS BETWEEN DIFFERENT SL-STRUCTURE IN READ OPERATION



Fig. 10. Simulated readout delay comparison between conventional grounded SL, conventional SL with SSLC scheme, and proposed SL with SSD scheme.

SL, on average. However, the standard deviations for the conventional grounded SL and the proposed SSD are, respectively, 0.328 and 0.313.

The standard deviation of the proposed SSD is smaller. As depicted in Fig. 8, the standard deviation is significantly impacted by a size of the pull-down transistor M2 connected to the SL. If it is sufficiently large, then the mean value of  $t_{delay}$  is close to the conventional grounded SL. Its variation is suppressed in the proposed SSD. According to the probability theory, the standard deviation is suppressed by an increase in the number of transistors [27, 28, 29]. The proposed SSD has one more transistor on the SL. For these reasons,  $t_{delay}$  of the proposed SSD is slightly smaller than the conventional grounded SL at  $-5\sigma$  of the percentile value.

### D. Operating speed evaluation in the write cycle

The proposed SRAM employs the precharge-less write circuit to reduce the energy consumption on the WBL. Fig. 13 shows the precharge-less write waveforms of "0"-write operation in 1M Monte Carlo analyses. In this figure, the "1" data are initially stored in the SRAM cell. In the write operation, an input data transfer to the WBL/WBLB without the precharge sequence. The WBLs have no equalization because of the precharge-less write scheme. Therefore, the same input data consume less energy on the WBLs. The write delay to flip the



Fig. 11. Simulated readout delay comparison on slowest cells at varied supply voltage between conventional grounded SL, conventional SL with SSLC scheme, and proposed SL with SSD scheme.



Fig. 12.  $t_{delay}$  distributions in the conventional grounded SL, the conventional SL with the SSLC scheme, and the proposed SL with the SSD scheme on a quantile-quantile plot.

internal node voltage shown with the internal node V1 and V2 waveforms on the fastest and slowest cell is focused in internal nodes V1 and V2. Inversion of the internal node is started when the WWL pulse is inputted. In this figure, inversion time  $t_{write}$  is defined by the time to node V1 rises to 0.45 V. In addition,  $t_{write}$  for the fastest cell at  $-5\sigma$  is only 160 ps, whereas the  $t_{write}$  for the slowest cell at  $5\sigma$  is 9.68 ns. In the whole write sequence, write



Fig. 13. Simulated waveforms of the proposed SRAM: (a) write waveforms on the write bitlines, and internal node V1 and V2, (b) focused on the internal node V1 and V2, and delay evaluation on the fastest/slowest cell at TT, 25°C.

access time with 11.56 ns is achieved at TT, 25°C, which is much shorter than that of the read operation.

### E. Consecutive memory access in video processing

Image data such as people or landscapes reflect luminance information. They have similar brightness in adjacent pixels. Figure 14 presents the switching possibility of a readout bit between a present pixel and a next pixel. The averaged switching possibility obtained from the three sample images is 49.8% on the least-significant bit (LSB = 1st digit), meaning that the value of the LSB is random, which is reasonable. The most-significant bit (MSB = 8th digit) has a switching possibility of 7.8% on average because it has much stronger correlation between adjacent pixels.

Memory mapping for image data in the proposed video processing is performed on a channel-by-channel basis. Correlation between adjacent pixels is retained on RGB channels. Our consecutive memory access is beneficial for optimizing power consumption even if image data have multiple dimensions of channels. Therefore, in the consecutive accesses, it is better to map their addresses along the row direction, as presented in Fig. 15(a), where a column address is not changed often. Fig. 15(b) depicts the waveform of the proposed SRAM in write operation. By virtue of the precharge-less and incremental write operation, the proposed SRAM reduces the write energy; the charge/discharge on a pair of write bitlines (WBL and WBLB) is consumed only when a write datum is changed. The consecutive writing of the same datum consumes less energy because the proposed dual-port SRAM has a dedicated write port and needs no precharge scheme on the WBLs.

However, it is adversely affected by the well-known half-select problem along the write wordline (WWL). The divided wordline structure is therefore adopted only for the write port to avoid the half-select problem [30]. The read port has the common interleaving structure with no divided RWLs because an image processor often requires a greater number of read ports and therefore exerts strong effects on its area.



Fig. 14. Switching possibility in image data.



Fig. 15. Consecutive memory access in video processing: (a) block diagram and (b) waveforms in write operation.

### III. CHIP IMPLEMENTATION AND MEASUREMENT RESULTS

Fig. 16 shows a chip layout of the proposed 64-Kb SRAM macro configured with 32 Kb  $\times$  2 banks with X/Y decoders, read/write and I/O circuit. The macro size is 242  $\times$  189  $\mu m^2$  (= 0.045 mm²). Each 32-Kb bank consists of 4 Kb  $\times$  8 subarrays, which are configured with 256 rows and 16 columns. The area size of the 4-Kb subarray is 2,157  $\mu m^2$  (= 1962.38  $\mu m^2$  memory array + 194.77  $\mu m^2$  peripheral circuit). The circuit for the SSD scheme is located under the Y decoder. Its area is 27.75  $\mu m^2$ .

In our design, the area overhead of the proposed SSD circuit is only 0.9% of the entire macro. Fig. 17 presents a test chip micrograph. We fabricated a 64-kb 8T dual-port SRAM macro in a 28-nm FD-SOI process technology.



Fig. 17. Chip microgram of the test chip.



Fig. 16. A 64-Kb SRAM (32 Kb × 2 bank) macro layout design comprises 16 × 4-Kb subarray block (= 256 × 16 cells).



Fig. 18. Measured Shmoo plot in read operation.



Fig. 19. Measured Shmoo plot in write operation.

Fig. 18 presents a measured Shmoo plot in read operation. We verified that the test chip can operate at a supply voltage of 0.48 V and with access time of 135 ns (= 7.4 MHz) at a room



Fig. 20. Schematic of the proposed 8T dual-port SRAM with the SSD scheme.

temperature:  $25^{\circ}$ C. The operating point that achieves the minimum energy per cycle is at a supply voltage of 0.56 V and a cycle time of 35 ns (= 28.6 MHz). In addition, Fig. 19 presents a measured Shmoo plot in write operations. The test chip can operate at a supply voltage of 0.46 V and a write pulse width of 56 ns. The shortest write pulse width is 4 ns at a supply voltage of 0.62 V.

Fig. 20 portrays a schematic of the proposed 8T dual-port SRAM array with the SSD scheme. The proposed 8T SRAM has a precharge-less write circuit. Consequently, successive writes of same data consume less energy. However, bit interleaving incurs the well-known half-select problem along the write wordline. The divided wordline structure is therefore

| Source                       | [26]                    | [21]                      | [22]                     | [23]                     | [24]                     | This Work                |
|------------------------------|-------------------------|---------------------------|--------------------------|--------------------------|--------------------------|--------------------------|
| Technology                   | 40nm bulk<br>CMOS       | 40nm LP<br>CMOS           | 65nm bulk<br>CMOS        | 28nm bulk<br>CMOS        | 16nm<br>FinFET           | 28nm<br>FD-SOI           |
| Memory Size                  | 16 Kb                   | 512 Kb                    | 96 Kb                    | 512 Kb                   | 256 Kb                   | 64 Kb                    |
| Cell Size [um <sup>2</sup> ] | 1.01*                   | 0.8496                    | -                        | -                        | -                        | 0.423*                   |
| Bit density (Mb/mm²)         | 0.457                   | 1.17                      | -                        | 3.16                     | 6.05                     | 2.35                     |
| Sensing<br>Scheme            | Large<br>Signal         | Small<br>Signal           | Large<br>Signal          | Small<br>Signal          | Small<br>Signal          | Large<br>Signal          |
| Cell Type                    | 8T-1R1W                 | 8T-CP-1R1W                | 8T-1R1W                  | 8T-1R1W                  | 6T-1R1W                  | 8T-1R1W                  |
| IO Size                      | 16                      | 64                        | 8                        | 32                       | 64                       | 16                       |
| # Bits/BL                    | 128                     | 32                        | 128                      | 256                      | 128                      | 256                      |
| Performance                  | 10.0 MHz<br>(0.5V/25°C) | 800.0 MHz<br>(1.1V/25°C)  | 1.8 MHz<br>(0.55V/25°C)  | 1.69 GHz<br>(1.0V/125°C) | 1.2 GHz<br>(0.88V/-)     | 66.7 MHz<br>(0.7V/25°C)  |
| Functional<br>Frequency      | 10.0 MHz<br>(0.5V/25°C) | 200.0 MHz<br>(0.65V/25°C) | 125.0 KHz<br>(0.36/25°C) | 1.69 GHz<br>(1.0V/125°C) | 1.2 GHz<br>(0.88V/-)     | 28.0 MHz<br>(0.56V/25°C) |
| Power<br>[uW/Access]         | 17.8<br>(0.5V/25°C)     | 902.0<br>(0.65V/25°C)     | 5.1<br>(0.36V/25°C)      | 16375.7<br>(1.0V/125°C)  | 14503.2<br>(0.88V/125°C) | 9.16<br>(0.56V/25°C)     |
| FoM [fJ/bit]                 | 2.43                    | 0.19                      | 11.44                    | 0.15                     | 0.26                     | 0.08                     |

### TABLE III DUAL-PORT 8T SRAM COMPARISON

\*Logic rule based SRAM cell design

adopted to avoid the half-select problem [30] in our design. The OR gate of the SSD scheme is connected to 16 readout circuits; it selects a target column for SL discharge or charge SLs to VDD – Vth in unselected columns.

Fig. 21 shows the simulated and measured active/leakage energy comparisons between the conventional SSLC scheme and the proposed SSD scheme for read operation. It is noteworthy that both read circuits must have the RBL precharge scheme because of their associated single-ended read ports. In the read operation, the test patterns of the "ALL 0" and "ALL 1" respectively denote the mean consecutive "0" and "1" read operations of the incremental row address accesses. The checkerboard patterns using the incremental row address (CKB X+) have 50.0% "0" and 50.0% "1" data with energies that are intermediate of "ALL0" and "ALL1". In the CKB using the incremental column address (CKB Y+), the column address is changed at every cycle. The energy comparison demonstrates that the proposed SSD scheme improves the read energy by 26.0% on average, which is 389.6 fJ/cycle.

Fig. 22 again shows the simulated energy breakdown and a comparison between the conventional SSLC scheme and the proposed SSD scheme in "ALL 1" and "ALL 0" read operations. Although an RWL must be enabled at every cycle, its RBL charge/discharge does not occur in the "ALL 1" read operation because the read port is cut off with PG4 in the 8T cell. Unnecessary current is reduced in unselected columns. However, Fig. 22(a) shows that its energy saving is small because no RBL is discharged in this case. However, Fig. 22(b) shows the "ALL 0" read operation. The RBL charge/discharge takes place at every cycle. The RBL and the selected SL energy are increased drastically.

Floating SLs in the unselected columns are discharged in the SSLC scheme, which consumes unnecessary readout energy.



Fig. 21. Simulated and measured energy comparisons between the conventional SSLC scheme and the proposed SSD scheme in read operation.

The SSD scheme can eliminate the energy wasted in the unselected column by 96.5% compared with the SSLC scheme.

In the write operation, the proposed 8T dual-port SRAM requires no precharge on the WBLs. Additionally, its WWL has a divided structure. Therefore, the proposed SRAM can reduce the unneeded write energy because of the charge/discharge in the half-selected columns. The measured "0" and "1" write energies are, respectively, 196.2 fJ/cycle and 215.2 fJ/cycle. Fig. 23 portrays the impact of the incremental write operation for write energy saving. As a counterpart, the write energy becomes 382.0 fJ/cycle in the consecutive column access in the CKB test pattern. Write energies are reduced according to the spatial frequency on its images.

In the "color" Image 1 and monochrome Image 2, the write energy is reduced by 29%, whereas in the monochrome Image 9 and the color Image 10, they respectively reach 34% and 35%



Fig. 23. Write energy saving in incremental accesses.



Fig. 22. Simulated energy breakdown comparison between the conventional SSLC scheme and the proposed SSD scheme in (a) "ALL 1" and (b) "ALL 0" read operations.

of energy saving. Those features are beneficial in that the image has high similarity among all pixels. In Image 10, the write energy of the 250.4 fJ/cycle achieves 264.0 fJ/cycle, on average, which is 30% lower than that in the consecutive column access.

Therefore, our proposed 8T dual-port SRAM with the SSD scheme can reduce both read and write energies. Moreover, it is suitable for low-power image processing devices. TABLE II presents a summary of the characteristics of the proposed SRAM test chip implemented in 28-nm FD-SOI technology.

TABLE III presents a summary of a performance comparison among the state-of-the-art 1R1W dual-port SRAMs taken from recent conference and journal papers, as introduced in Section 1.

For our test chip, we designed the SRAM bitcell on a logic rule basis. Therefore, the bit cell density is lower than [16] using a similar technology node. As described previously, we take an inverter as a large-signal sensing scheme. It is generally used for a lower area cost by virtue of its simple structure. However, the inverter incurs a full-swing signal when "1" is read out. A small-signal sensing scheme using a differential sense amplifier is often adopted [21, 23, 24], which consequently achieves much higher operating frequency.

### TABLE II TEST CHIP FEATURES

| Technology          | 28-nm FD-SOI                 |  |  |  |
|---------------------|------------------------------|--|--|--|
| Cumply voltage      | 0.48-0.7V (Memory macro)     |  |  |  |
| Supply voltage      | 1.8V (I/O)                   |  |  |  |
| Chip area           | 1.0x1.0mm <sup>2</sup>       |  |  |  |
| Macro size          | 189x242μm²                   |  |  |  |
| Macro configulation | 64Kb (32Kb X 2), 16bits/word |  |  |  |
| Cell size           | 0.291x1.457μm²               |  |  |  |
| Frequency           | 7.4MHz@0.48V, 66.7MHz@0.7V   |  |  |  |
| Read active energy  | 389.6fJ@0.56V, 28.6MHz, RT   |  |  |  |
| Write active energy | 265.0fJ@0.56V, 28.6MHz, RT   |  |  |  |

However, such circuitry requires dedicated signal timing and a greater area cost. The figure of merit (FoM) represents the energy per bit that includes standby and active energy, which is scaled by technology node scale factor k, operating frequency freq, the number of cells on a bitline  $l_{bl}$ , I/O bit width  $w_{io}$ , and the entire memory capacity Cap. In this case, FoM is expressed as shown below.

$$FoM = \frac{Power}{Cap * l_{bl} * w_{io} * freq * k}$$

Because of the unnecessary read energy reduction by the proposed SL voltage control scheme and because of the WBL charge and discharge energy saving with consecutive memory access, the *FoM* number is much more beneficial than other cutting-edge schemes. These results demonstrate the utility of the SL voltage control with the SSD scheme for low-power and low-voltage performance on 1R1W dual-port SRAMs.

### IV. SUMMARY

We presented an 8T dual-port SRAM with the selective sourceline drive (SSD) scheme for an image processor. Our proposed SRAM drives the sourceline (SL) to VDD-Vth at unselected columns in read operation and exploits the consecutive row accesses in write operation for improving energy efficiency at low voltage. We fabricated a 64-Kb 8T dual-port SRAM using 28-nm FD-SOI process technology. The test chip exhibits 0.48 V operation with 135 ns access time. The energy minimum point is a supply voltage of 0.56 V at a 28.6 MHz frequency, at which 265.0 fJ/cycle in the write operation and 389.6 fJ/cycle in the read operation were achieved.

### REFERENCES

- [1] J. Lim, N. B. Lakshminarayana, H. Kim, W. Song, S. Yalamanchili, W. Sung, "Power Modeling for GPU Architecture using McPAT," ACM Trans. on Design Automation of Electronic Systems (TODAES), vol. 19, no. 3, pp. 1–17, June 2014.
- [2] N. Planes, O. Weber, V. Barral, S. Haendler, D. Noblet D. Croain, M. Bocat, P. Sassoulas, X. Federspiel, A. Cros, A. Bajolet, E. Richard, B. Dumont, P. Perreau, D. Petit, D. Golanski, C. Fenouillet-Beranger, N. Guillot, M. Rafik, V. Huard, S. Puget, X. Montagner, M.-A. Jaud, O. Rozeau, O. Saxod, F. Wacquant, F. Monsieur, D. Barge, L. Pinzelli, M. Mellier, F. Boeuf, F. Arnaud and M. Haond, "28-nm FDSOI Technology Platform for High-Speed Low-Voltage Digital Applications" in *IEEE Symposium on VLSI Tech. Dig.*, pp. 133–134, June 2012.
- [3] P. Flatresse, B. Giraud, J. Noel, et al., "Ultra-Wide Body-Bias Range LDPC Decoder in 28-nm UTBB FDSOI Technology" in *IEEE Int. Sorid State-Circuits Conference*. (ISSCC) Dig. Tech. Papers, Feb. 2013, pp. 424–425.
- [4] L. Hutin, C. Le Royer, F. Andrieu, O. Weber, M. Casse, J.-M. Hartmann, D. Cooper, A. Béché, L. Brevard, L. Brunet, J. Cluzel, P. Batude, M. Vinet and O. Faynot, "Dual Strained Channel Co-Integration into CMOS, RO and SRAM cells on FDSOI down to 17nm Gate Length" in *IEDM Dig. Tech. Papers*, Dec. 2010, pp. 11.1.1-11.1.4.
- [5] J. E. Husseini, X. Garros, J. Cluzel, A. Subirats, A. Makosiej, O. Weber, O. Thomas, V. Huard, X. Federspiel, G. Reimbold, "A complete Characterization and Modeling of the BTI-Induced Dynamic Variability of SRAM Arrays in 28-nm FD-SOI Technology" *IEEE Trans. on Electron Devices*, Vol. 61, no. 12, pp. 3991–3999, Dec. 2014.
- [6] C. Fenouillet-Beranger, S. Denorme, B. Icard et al., "Fully-Depleted SOI Technology using High-K and Single-Metal Gate for 32nm Node LSTP Applications featuring 0.179μm<sup>2</sup> 6T-SRAM bitcell" in *IEDM Dig. Tech. Papers*, Dec. 2007, pp. 267–270.
- [7] O. Thomas, B. Zimmer, B. Pelloux-Prayer, N. Planes, K-C. Akyel, L. Ciampolini, P. Flatresse and B. Nikolić, "6T SRAM Design for Wide Voltage Range in 28-nm FDSOI" in *IEEE Int. SOI Conf. Papers*, Oct. 2012, pp. 1–2.
- [8] H. Pilo, C. A. Adams, et al., "A 64Mb SRAM in 22nm SOI Technology Featuring Fine-Granularity Power Gating and Low-Energy Power-Supply-Partition Techniques for 37% Leakage Reduction" in *IEEE Int. Solid State-Circuits Conference*. (ISSCC) Dig. Tech. Papers, Feb. 2013, pp. 322–323.
- [9] T. Sakurai, "Low power digital circuit design", Proc. of IEEE European Solid-State Conference (ESSCIRC), Sep. 2004, pp. 11–18.
- [10] M. Nomura, A. Muramatsu, H. Takeno, S. Hattri, D Ogawa, M. Nasu, K Hirairi, S. Kumashiro, S. Moriwaki, Y. Yamamoto, S. Miyano, Y. Hiraku, I. Hayashi, K. Yoshioka, A. Shikata, H. Ishikuro, M. Ahn, Y. Okuma, X. Zhang, Y. Ryu, K. Ishida, M. Takamiya, T. Kuroda, H. Shinohara, and T. Sakurai, "0.5V Image Processor with 563 GOPS/W SIMD and 32bit CPU Using High Voltage Clock Distribution (HVCD) and Adaptive Frequency Scaling (AFS) with 40nm CMOS" in *IEEE Symposium on VLSI Tech. Dig. Papers*, June 2013, pp. 118–119.
- [11] K. Kang, H. Jeong, Y. Yang, J. Park, K. Kim, and S.-O. Jung, "Full-swing local bitline SRAM architecture based on the 22-nm FinFET technology for low-voltage operation," *IEEE Trans. Very Large Scale Intgr. (VLSI) Syst.*, vol. 24, no. 4, pp. 1342–1350, Apr. 2016.
- [12] K.-H. Koo, L. Wei, J. Keane, U. Bhattacharya, E. A. Karl, and K. Zhang, "A 0.094μm² high density and aging resilient 8T SRAM with 14nm FinFET technology featuring 560mV VMIN with read and write assist," in VLSI Technol. Symp., Dig. Tech. Papers, Jun. 2015, pp. C266–C267.
- [13] H. Mori, T. Nakagawa, Y. Kitahara, Y. Kawamoto, K. Takagi, S. Yoshimoto, S. Izumi, K. Nii, H. Kawaguchi, and M. Yoshimoto, "A 298-fJ/writecycle 650-fJ/readcycle 8T Three-Port SRAM in 28-nm FD-SOI Process Technology for Image Processor," *Proc. of IEEE Custom Integrated Circuits Conference (CICC)*, Sep. 2015, pp. 1–4.
- [14] J. P. Kulkarni, J. Keane, K. H. Koo, S. Nalam, Z. Guo, E. Karl, and K. Zhang, "5.6Mb/mm² 1R1W 8T SRAM Arrays Operating Down to 560 mV Utilizing Small-Signal Sensing with Charge Sheared Bitline and Asymmetric Sense Amplifier in 14 nm FinFET CMOS Technology" *IEEE J. Solid-State Circuits*, vol. 52, no. 1, pp. 229–239, Jan. 2017.

- [15] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," *IEEE J. Solid-State Circuits*, vol. 52, no. 1, pp. 127–138, Jan. 2017.
- [16] Stephen W. Keckler, W. Dally, B. Khailany, M. Garland and D. Glasco, "GPUs AND THE FUTURE OF PARALLEL COMPUTING" IEEE Micro, vol. 31, no. 5, pp. 7–17, Sept. 2011.
- [17] B.-C.C. Lai, K. Hsien-Kai, Jou Jing-Yang, "A Cache Hierarchy Aware Thread Mapping Methodology for GPGPUs" *Computers, IEEE Trans.* on, vol. 64, no. 4, pp. 884–898, April 2015.
- [18] K. Nii, M. Yabuuchi, Y. Yokoyama, Y. Ishii, T. Okagaki, M. Morimoto, Y. Tsukamoto, K. Tanaka, M. Tanaka, S. Tanaka, "2RW dual-port SRAM design challenges in advanced technology nodes" in *IEDM Dig. Tech. Papers*, Dec. 2015, pp. 11.1.1–11.1.4
- [19] M. Yabuuchi, Y. Tsukamoto, M. Morimoto, M. Tanaka, and Koji Nii, "20nm high-density single-port and dual-port SRAMs with wordline-voltage-adjustment system for read/write assists," in *IEEE Int. Solid State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2014, pp. 234–235.
- [20] J. Kulkarni, Muhammed Khellah, J. Tschanz, B. Geuskens, R. Jain, S. Kim, and V. De, "8T-bitcell SRAM Array in 22nm Tri-Gate CMOS for Energy-Efficient Operation across Wide Dynamic Voltage Range," in *IEEE Symposium on VLSI Tech. Dig. Papers*, June 2013, pp. C126–127.
- [21] N. Lien, L. Chen, C. Chen, H. Yang, M. Tu, P. Kan, Y. Hu, C Chung, S. Jou, W Hung, "A 40 nm 512 kb Cross-Point 8 T Pipeline SRAM With Binary Word-Line Boosting Control, Ripple Bit-Line and Adaptive Data-Aware Write-Assist," Circuit and Systems-I, IEEE Trans. on, vol. 64, no. 4, pp. 643–647, Dec. 2014.
- [22] L. Wen, X. Cheng, K. Zhou, S. Tian, and X Zeng, "Bit-Interleaving-Enabled 8T SRAM With Shared Data-Aware Write and Reference-Based Sense Amplifier," Circuit and Systems–II, IEEE Trans. on, vol. 64, no. 4, pp. 643–647, July 2016.
- [23] M. Yabuuchi, H. Fujiwara, Y. Tsukamoto, M. Tanaka, S. Tanaka, and K. Nii, "A 28nm High Density 1R/1W 8T SRAM Macro with Screening Circuitry against Read Disturb Failure," *Proc. of IEEE Custom Integrated Circuits Conference (CICC)*, Sep. 2013, pp. 1–4.
- [24] M. Yabuuchi, Y. Sawada, T. Sano, Y. Ishii, S. Tanaka, M. Tanaka and K. Nii, "A 6.05-Mb/mm<sup>2</sup> 16-nm FinFET Double Pumping 1W1R 2-port SRAM with 313-ps Read Access Time," in *IEEE Symposium* on VLSI Tech. Dig. Papers, Feb. 2016, pp. 14–15.
- [25] Y. Ishii, H. Fujiwara, et al., "A 28 nm Dual-Port SRAM Macro with Screening Circuitry Against Write-Read Disturb Failure Issues" *IEEE J. Solid-State Circuits*, vol. 46, no. 11, pp. 2535–2544, Nov. 2011.
- [26] S. Yoshimoto, S. Miyano et al., "A 40-nm 8T SRAM with Selective Source Line Control of Read Bitlines and Address Preset Structure," Proc. of IEEE Custom Integrated Circuits Conference (CICC), Sep. 2013, pp. 1–4.
- [27] D. N. Silva, A. I. Reis, R. P. Ribas, "CMOS logic gate performance variability related to transistor network arrangements," in *Microelectronics Reliability*, vol. 49, no. 9–11, pp. 977–981, Sep. 2009.
- [28] M. Alioto, G. Palumbo, and M. Pennisi, "Understanding the Effect of Process Variations on the Delay of Static and Domino Logic," *Very Large Scale Integration (VLSI), IEEE Trans. on*, vol. 18, no. 5, pp. 697–710, May 2010.
- [29] D. N. Silva, A. I. Reis, R. P. Ribas, "Gate delay variability estimation method for parametric yield improvement in nanometer CMOS technology," in *Microelectronics Reliability*, vol. 50, no. 9–11, pp. 1223–1229, Aug. 2010.
- [30] H. Fujiwara, M. Yabuuchi, M. Morimoto and K. Tanaka, "A 20nm 0.6V 2.1μW/MHz 128-kb SRAM with no half select issue by interleave wordline and hierarchical bitline scheme" in *IEEE Symposium on VLSI Tech. Dig. Papers*, June 2013, pp. 118–119.



Haruki Mori received B.E. and M.E. degrees in Computer Science and Systems Engineering from Kobe University, Kobe, Japan in 2014 and 2016, respectively. He is currently working toward the Ph.D. degree at Kobe University. He joined TSMC, Hsinchu, Taiwan, in 2017, as an Intern. He is a JSPS research fellow at Kobe University from 2018 to current. His

current research interests include low-power SRAM design, low-voltage MRAM design, and energy efficient deep learning system. He was a recipient of the IEEE CICC 2015 Intel/IBM/Catalyst Foundation Student Scholarship Award, IEEE ICECS 2016 Best Paper Award, and IEEE MLSP 2017 Best Student Paper Award. He is a student member of IEEE, and IEICE.



**Tomoki Nakagawa** was born on May 2, 1990. He earned a B.E. and M.E. degrees in Computer and Systems Engineering from Kobe University, Hyogo, Japan, in 2013 and 2015. His current research is related to low-power and low-voltage memory designs.



Yuki Kitahara was born on March 3, 1990. He earned a B.E. and M.E. degrees in Computer and Systems Engineering from Kobe University, Hyogo, Japan, in 2012 and 2014. His current research is related to low-power and low-voltage memory designs.



Yuta Kawamoto received B.E. and M.E. degrees in Computer Science and Systems Engineering from Kobe University, Kobe, Japan in 2014 and 2016, respectively. His current research is related to low-power and low-voltage analog circuit designs.



Kenta Takagi received the B.E. and M.E. degrees in Computer Science and Systems Engineering from Kobe University, Kobe, Japan in 2012 and 2014. He is currently on the master course at Kobe University. His current research is a low-power image recognition VLSI designs. He was a recipient of IEEE SSCS 2013 Japan Chapter Academic Research Award.



Shusuke Yoshimoto received B.E. and M.E. degrees in Computer and Systems Engineering from Kobe University, Hyogo, Japan, in 2009 and 2011, respectively. He earned Ph.D. degree in Engineering from the university in 2013. He was a JSPS research fellow from 2013 to 2014. He worked in Department of Electrical Engineering at Stanford University as a

postdoctoral from 2013 to 2015. Since 2015, he has been an Assistant Professor in The Institute of Scientific and Industrial Research at Osaka University. His current research interests include biomedical signal processing, flexible electronics, organic circuit design, nano-electronics, soft error, low-power and robust memory design. He was a recipient of 2011 and 2012 IEEE SSCS Japan Chapter Academic Research Awards, 2013 IEEE SSCS Kansai Chapter IMFEDK Student Paper Award. and 2013 Intel/Analog Devices/Catalyst Foundation/Cirrus Logic CICC Student Scholarship Award. He has served as a program committee student member in IEICE Integrated Circuit Design.



Shintaro Izumi respectively received his B.Eng. and M.Eng. degrees in Computer Science and Systems Engineering from Kobe University, Hyogo, Japan, in 2007 and 2008. He received his Ph.D. degree in Engineering from Kobe University in 2011. He was a JSPS research fellow at Kobe University from 2009 to 2011. Since 2011, he has been an Assistant Professor in the

Organization of Advanced Science and Technology at Kobe University. His current research interests include biomedical signal processing, communication protocols, low-power VLSI design, and sensor networks.

He has served as a Vice Chair of IEEE Kansai Section Young Professional Affinity Group, as a Student Activity Committee Member for IEEE Kansai Section, as a Program Committee Member for IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips), and as a Guest Associate Editor of IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences. He was a recipient of 2010 IEEE SSCS Japan Chapter Young Researchers Award.



Hiroshi Kawaguchi received B.Eng. and M.Eng. degrees in electronic engineering from Chiba University, Chiba, Japan, in 1991 and 1993, respectively, and earned a Ph.D. degree in electronic engineering from The University of Tokyo, Tokyo, Japan, in 2006. He joined Konami Corporation, Kobe, Japan, in 1993, where he developed arcade entertainment systems.

He moved to The Institute of Industrial Science, The University of Tokyo, as a Technical Associate in 1996, and was appointed as a Research Associate in 2003. In 2005, he moved to Kobe University, Kobe, Japan. Since 2007, he has been an Associate Professor with The Department of Information Science at that university. He is also a Collaborative Researcher with The Institute of Industrial Science, The University of Tokyo. His current research interests include low-voltage SRAM, RF circuits, and ubiquitous sensor networks. Dr. Kawaguchi was a recipient of the IEEE ISSCC 2004 Takuo Sugano Outstanding Paper Award and the IEEE Kansai Section 2006 Gold Award. He has served as a Design and Implementation of Signal Processing Systems (DISPS) Technical Committee Member for IEEE Signal Processing Society, as a Program Committee Member for IEEE Custom Integrated Circuits Conference (CICC) and IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips), and as an Associate Editor of IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences and IPSJ Transactions on System LSI Design Methodology (TSLDM). He is a member of the IEEE, ACM, IEICE, and IPSJ.



Masahiko Yoshimoto joined the LSI Laboratory, Mitsubishi Electric Corporation, Itami, Japan, in 1977. From 1978 to1983 he had been engaged in the design of NMOS and CMOS static RAM. Since 1984 he had been involved in the research and development of multimedia ULSI systems. He earned a Ph.D. degree in

Electrical Engineering from Nagoya University, Nagoya, Japan in 1998. Since 2000, he had been a professor of Dept. of Electrical & Electronic System Engineering in Kanazawa University, Japan. Since 2004, he has been a professor of Dept. of Computer and Systems Engineering in Kobe University, Japan. His current activity is focused on the research and development of an ultra low power multimedia and ubiquitous media VLSI systems and a dependable SRAM circuit. He holds on 70 registered patents. He has served on the program committee of the IEEE International Solid State Circuit Conference from 1991 to 1993. Also he served as Guest Editor for special issues on Low-Power System LSI ,IP and Related Technologies of IEICE Transactions in 2004. He was a chair of IEEE SSCS (Solid State Circuits Society) Kansai Chapter from 2009 to 2010. He is also a chair of The IEICE Electronics Society Technical Committee on Integrated Circuits and Devices from 2011-2012. He received the R&D100 awards from the R&D magazine for the development of the DISP and the development of the realtime MPEG2 video encoder chipset in 1990 and 1996, respectively. He also received 21th TELECOM System Technology Award in 2006.