Optimizing Performance and Energy Overheads Due to Fanout in In-Memory Computing Systems

Md Adnan Zaman²⁰,
Rajeev Joshi²⁰ &
Srinivas Katkoori²⁰

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 561))

Included in the following conference series:

IFIP/IEEE International Conference on Very Large Scale Integration - System on a Chip

723 Accesses
5 Citations

Abstract

For NOR-NOT based memristor crossbar architectures, we propose a novel approach to address the fanout overhead problem. Instead of copying the logic value as inputs to the driven memristors, we propose that the controller reads the logic value and then applies it in parallel to the driven memristors. We consider two different cases based on the initialization of the memristors to logic-1 at the locations where we want keep the first input memristor of the driven gates. If the memristors are initialized, it falls under case 1, otherwise case 2. In comparison to recently published works, experimental evaluation on ISCAS’85 benchmarks resulted in average performance improvements of 51.08%, 38.66%, and 63.18% for case 1 and 50.94%, 42.08%, and 60.65% for case 2 considering three different mapping scenarios (average, best, and worst). In regards to energy dissipation, we have also obtained average improvements of 91.30%, 88.53%, and 74.04% for case 1 and 86.03%, 78.97%, and 51.89% for case 2 considering the aforementioned scenarios.

You have full access to this open access chapter, Download conference paper PDF

Modern System Design Using Memristors

Synthesis and Technology Mapping for In-Memory Computing

RETRACTED ARTICLE: Memristor based high speed and low power consumption memory design using deep search method

Article 04 March 2020

Keywords

1 Introduction

Von Neumann architecture suffers from memory wall problem due to bandwidth mismatch between slower memory and faster CPU [1]. To overcome memory wall problem, non von-Neumann architecture is being actively considered where storage and computing can be performed in the same location. This computing inside memory is known as in-memory computing. Emerging non-volatile resistive memory technology such as memristor can enable such non von-Neumann computing paradigm. A Memory Processing Unit (MPU) has been proposed [2], where memristive memory is used as storage in conjunction with logical operations.

Due to high speed, low power consumption, scalability, data retention, endurance, and compatibility with conventional CMOS, many memristor based logic families and circuits have been proposed [3]. Based on logic state variable, memristor based logic families can be classified into stateful (logic value represented with memristor resistance) and non-stateful logic families [4]. In this work, we employ a stateful logic family, known as Memristor-Aided loGIC (MAGIC) [5]. In this logic style, for a given logic gate, input values and output value(s) are stored as memristor states. Memristors can be fabricated on a crossbar array, which offers high storage density and low power consumption [6]. With MAGIC, only NOR and NOT gates can be directly mapped to crossbar array.

In recent years, researchers have proposed a few in-memory logic synthesis approaches based on MAGIC logic style, where a given circuit netlist consisting of only NOR and NOT gates is mapped to a memristor crossbar. In [5], a detailed procedure to map NOR/NOT logic gates to crossbar has been discussed and also transpose crossbar concept has been introduced to allow gates to be mapped along the rows as well as columns in a crossbar architecture. In [7], a synthesis tool has been proposed that maps arbitrary logical functions within the memristive memory in an optimal manner. In [8], a scalable design flow for in-memory computing has been proposed that allows a given circuit netlist to be implemented in transpose crossbar. In both of these approaches, a given gate netlist is first converted into a netlist of NOR/NOT gates using an existing logic synthesis tool [9] and mapped to the crossbar architecture. While mapping, we come across fanout where a single output (driving) memristor of a logic gate has to be used as input (driven) memristors of multiple gates connected to it. For a fanout of two or more leaf memristors, current methods (previous approaches) perform the copy operation for a number of times equal to the number of driven memristors that are not on the same row or column as of the driving memristor. Such copy style requires two NOT operations which in turn requires two extra cycles. With the increment of fanouts in a given netlist, the number of extra cycles increases hence energy as well. To the best of our knowledge, no other previous works attempted to reduce the additional cycle count of a copy operation inherent to a fanout.

In this work, we propose a novel approach that will reduce the performance and energy overheads originated from fanout in a given circuit netlist. Instead of copying the value, the proposed controller can read the value and apply in parallel to the driven memristors. We consider two different cases based on the initialization of the memristors at the copy locations. The locations on the crossbar where we want to keep the first input memristors of the driven gates can be defined as the copy locations. In our prior work [10] which discusses case 1, like the previous works [7, 8], we consider that the memristors at the copy locations are initilaized to logic-1 which provides an added advantage to decide whether to write the read value or not to the driven memristors. As proposed in MAGIC logic style, the output memristors are initialized to logic-1 prior to logic execution and it allows the controller to skip write cycle if the read value is one. We have compared our work with a recently published work [8] for three different mapping scenarios. We obtain average improvements of 51.08%, 38.66%, and 63.18% and 91.30%, 88.53%, and 74.04% in performance and energy dissipation respectively. For case 2, we consider that the memristors at the copy locations are not initialized to logic-1 which dictates the use of initialization cycles with the previous approach. We obtain average improvements of 50.94%, 42.08%, and 60.65% and 86.03%, 78.97%, and 51.89% in performance and energy dissipation respectively.

The rest of the chapter is organized as follows: Sect. 2 presents background and related work. Section 3 describes the proposed approach to reduce the number of cycles and energy dissipations related to fanout. Section 4 reports experimental results. Section 5 draws conclusions.

2 Background and Related Work

In this section, we discuss the basic working principle of a memristor, relevant logic design styles, crossbar architecture, and in-memory computing. We also review some of the recent works on memristor based in-memory logic synthesis for a given gate-level netlist with particular concentration on fanout.

2.1 Memristor

Memristor is a two terminal device that can remember its previous state and change its resistance based on a given potential across the device. Chua [11] first proposed memristor that links flux ($\phi $) and charge (q) to its memristance (M) according to,

$$\begin{aligned} d{\phi } = M dq \end{aligned}$$

(1)

Structurally, memristor can be thought of as a thin semiconductor film with thickness (D) sandwiched between two contacts. We can change the overall resistance by changing its doped region width, w [12] (Fig. 1).

2.2 Memristor Aided LoGIC (MAGIC)

Our proposed methodology is based on MAGIC logic style [5]. As shown in Fig. 2, memristors’ resistances of the $IN_{1}$ and $IN_{2}$ are considered as input values and we can determine the output logic value by measuring OUT memristor resistance. An execution voltage $V_{G}$ is applied to both input memristors and the output value is stored in the output memristor.

2.3 In-Memory Computation Using Memristor Crossbar

A Memory Processing Unit (MPU) based on in-memory computing architecture is as shown in Fig. 3 [2]. It consists of a controller, crossbar memory, and analog multiplexers. The analog multiplexers’ outputs are connected to the bitlines and wordlines of the memristive memory and the voltage select lines of the multiplexers are connected to the controller. To carry out a regular read or write operation, controller sends suitable signals to the addressed memristor cells through multiplexers. We have also shown two signals, $V_{SET}$ and $V_{RESET}$ that can enable writing logic-1 and logic-0 to the memristor cells respectively.

The suitability of the crossbar architecture can be better explained by NOR operation as shown in Fig. 2. We, first, initialize the output memristor (OUT) to logic-1 and connect it to the ground (0V) and then we apply execution voltage $V_{G}$ to the input memristors ($IN_{1}$ and $IN_{2}$). This voltage may corrupt data on other memristors on the same row/column, to avoid this, an isolation voltage $V_{iso}$ needs to be applied to the columns and rows that we want to unselect. Parallel execution of NOR/NOT gates on the crossbar requires the alignment of the inputs and outputs of the respective gates. Since we are considering transpose memory here, gates can be aligned either by rows or columns.

2.4 Fanout

Fanout occurs when an output memristor at any circuit depth (excluding primary output) has to drive multiple memristors. It can degrade performance as circuit depth increases and incurs additional energy overheads by introduction of extra cycles. For a given gate-level netlist, a single memristor cell can only be used either as an input or output of a gate. Therefore, the value that is stored in the output memristor if needed can work as input to multiple following memristors. A naïve approach can perform this fanout operation by multiple copies of the logic value as equal to number of the driven memristors (on different rows or columns than driving memristor). This copy operation introduces additional cycles, as proposed in MAGIC logic style where each copy operation requires cascade of two NOT operations. Therefore, it will require two extra memristors as well as two extra cycles for a single copy operation. As the number of fanout increases for a specific gate output, the number of copy operations, hence additional cycles also increases linearly. Moreover, the energy dissipation also increases due to the extra NOT operations. Works [7, 8] based in-memory logic synthesis did not address these performance related issues.

3 Proposed Approach

In this work, we propose a novel approach that will reduce the performance and energy overheads due to fanout in memristor based in-memory computing. The controller will apply relevant signals in proper sequence to the rows and columns in a crossbar architecture. Here, we consider two different cases based on the initialization of the output memristors at the copy locations.

Case 1: Here, we maintain the same assumption as of works [8] and [12]. We consider that the output memristors at the copy locations are initialized to logic-1 which allows us to use a multiplexer through which controller can skip write cycle if the read value is one. With our approach, we have assumed on average 50% of the time, the sensed logic value is 1 which allows us to skip the write cycle. The equations for total cycle counts and energy consumption are formed considering this factor.
Case 2: Here, we consider that the output memristors at the copy locations are not initialized to logic-1. With our proposed approach, we exclude the multiplexer circuit hence removing the ability to skip write cycle (cycle 2) on average 50% of the time. We calculate the total cycle counts and energy requirement and compare with the previous work [8]. Previous works rely on copy operation which requires cascade of two NOT executions which in turn requires initialization of the output memristors of NOT gates to logic-1. Whereas with our approach, we do not need to initialize the output memristors at the copy locations to logic-1. This comparison demonstrates the trade-off between copy operation and direct read-write operation.

3.1 Overall Approach: Case 1

It will carry out the proposed method according to the following steps:

Cycle 1: Controller reads the logic value of the output memristor which has fanout of two or more.
Cycle 2: Controller writes the logic value as inputs to the driven memristors only if the sensed value is zero, otherwise, it can skip the cycle 2.

Previous works [7, 8] utilized a controller that produces a similar state diagram as shown in Fig. 4(a). For regular memory read and write operations, it goes to the read and write states respectively and for executing a logic function, it goes to the execution state. Execution state consists of micro operations where the controller executes the logic function in multiple cycles.

With our proposed approach, the number of state transitions will increase as shown in the Fig. 4(b) but the number of states will remain the same. While executing the micro operations in the execution state, whenever there is a fanout event, the controller will go to the read state and depending on the read value, it will decide whether to go to the write state or not. We can achieve this by modifying the controller circuit. The sensed bit can act as selector for a 2:1 multiplexer and depending on the select bit, controller will determine whether to write to the memory or not. If the read value is 1, the multiplexer will pass the input signal $r_{e1}$ to the controller to go the execution state skipping the write state, otherwise, it will go to the write state and then to the execution state. This allows us to avoid writing logic-1 if the sensed logic value is 1.

3.2 Mapping Scenario Analysis: Case 1

We have made the following assumptions for three different mapping scenarios of a logic function on a memristor crossbar.

We only assume that a single copy of primary input is stored in a single memristor cell and following a fanout scenario from this cell, one can do the copy operation on-demand. Previous works assume that multiple copies of primary inputs are already available depending on the number of gates they are driving. Hence, their assumptions underestimate the cycles required as they are able to eliminate the copy cycles produced from the fanouts of primary inputs.

Like the previous works [8, 13], we have considered the output memristors are initialized to logic value one at the beginning of the execution process which is a requirement for MAGIC logic style.

The work [8] has also initialized the output memristors to logic-1 required for copy operations (which eventually become the first input memristors of the driven gates). With our proposed approach, we are only considering fanout event related metrics not the whole synthesis procedure. Therefore, in our result analyses, we are not accounting the initialization cycles and energy consumption generated from initializing the output memristors. For fair comparison, we have maintained the same assumption while comparing with work [8]. In our work, the added consideration is, we can avoid writing logic value one. After reading the logic value of the driving memristor, if the controller finds the value as one, it can skip the write operation as the memristors are already initialized to one.

To demonstrate the efficacy of the approach, we have considered three different mapping scenarios (average, best, and worst) depending on the location of the second input memristors of the driven gates. To explain three scenarios, we consider the gate-level netlist shown in Fig. 5(a). We observe that there is a fanout of 4 from the node h. We also consider the logic value of h to be mapped on the location (1, 1) in the memristor crossbar architecture.

Scenario 1. Here, we consider that all the second input memristors are aligned along the same row or same column. The mapped variable h in Fig. 5(b) needs to be copied 3 times to use as first input memristors of the driven gates and the variables a, i, j, and g are mapped as second input memristors on the second column of the crossbar. With the previous approach, it requires 4 cycles to copy logic value of h and now with all the input memristors of the driven gates in aligned position, it is possible to execute the gates in one cycle. The following equation can be used to estimate the total cycles required for copy operations with previous approach:

$$\begin{aligned} \sum _{n=2}^{N} [n*freq(n)] \end{aligned}$$

(2)

where, n denotes different fanouts (i.e., number of gates it is driving), N is maximum fanout degree, and freq(n) is the number of times, a fanout of n is found in a netlist.

With the proposed approach, we need 1 cycle to read the logic value and one cycle to write multiple copies of h along the same row or column (given that the read value is logic-0). We consider that on average 50% of the time the read value is 0 which allows us to consider only half of the write cycles to the total cycle count. Therefore, we can estimate the total cycles required with the equation given below:

$$\begin{aligned} 1.5\sum _{n=2}^{N} [freq(n)] \end{aligned}$$

(3)

It should be noted that with our proposed approach, we require at most two cycles (read and write) for each fanout event and this is true for all scenarios considered here. Therefore, for all three scenarios, we can reuse Eq. 13 to estimate the total cycle counts.

We are providing a general formula to estimate the energy dissipation with both the previous and proposed approach due to the fanout only. According to the work [8], 52.49 fJ is the energy required for one NOT execution. Therefore, with previous approach, we can estimate the total energy dissipation by multiplying 52.49 fJ with the total cycle count (same as total NOT executions required for copy operations). The following equation gives us the total energy dissipation with this approach:

$$\begin{aligned} 52.49\sum _{n=2}^{N} [n*freq(n)] \end{aligned}$$

(4)

Whereas, for the proposed approach, the total energy required for a specific benchmark circuit can be estimated as:

$$\begin{aligned} \sum _{n=2}^{N} [\{(RE_m + (n-1) * WE )\}*freq(n)] \end{aligned}$$

(5)

Where, $RE_m$ denotes the total energy needed to read the variable and to decide (using multiplexer) whether to go to the write state or not. For fanout of n, we need to write $(n-1)$ memristors in a single cycle. Here, WE denotes the write energy required for writing a single memristor.

Scenario 2. We consider here some of the second input memristors are aligned on the same row as of the first input memristor (h) and some are scattered throughout the crossbar. Here, we divide different fanouts (n) according to the number of second input memristors residing on the same row. For example, for fanout of 3, 4, 5, and 6, we have considered 2 second input memristors are on the same row as first input memristor. From Fig. 6(a), it can be seen that the logic variables a and i that are mapped to the second input memristors of the gates 4 and 5 are aligned on the same row and variables j and g are in random locations. With the previous approach, h needs to be copied two times (3 NOT executions) to use it as an inputs of gates 6 and 7. We can evaluate the following equation for the previous approach:

$$\begin{aligned} 2 + \sum _{i=1}^{M} \sum _{n=4i-1}^{4i+2} [\{n -(2i-1)\} freq(n)] \end{aligned}$$

(6)

where, 2 cycles are required for fanout of 2. Here, i denotes the number of sub-divisions made on the different fanout and M denotes maximum degree of sub-division. For example, fanout of 4 falls under first sub-division (i = 1) which requires 3 cycles (given by the term $\{n-(2i-1)\}$) to make two copies of the driving memristor. In this way, we find cycle count for each division and add together to get the total cycle counts. As discussed earlier, with the proposed approach, we will get the cycle counts from Eq. 13.

With the previous approach, as in Scenario 1, by simply multiplying the total cycle count with energy requirement of one NOT execution, we can estimate the total energy dissipation:

$$\begin{aligned} 52.49 * [2 + \sum _{i=1}^{M} \sum _{n=4i-1}^{4i+2} [\{n -(2i-1)\} freq(n)]] \end{aligned}$$

(7)

and for the proposed approach:

$$\begin{aligned} (RE_m + WE) + \sum _{i=1}^{M} \sum _{n=4i-1}^{4i+2} [\{RE_m + (n -2i) * WE\} freq(n)] \end{aligned}$$

(8)

Where, first term ($RE_m$ + WE) is the energy required for fanout of 2. The term $(n-2i)$ gives us the number of memristors that needs to be written in a single write cycle.

Scenario 3. For the third scenario, we consider that the second input memristors are scattered throughout the crossbar i.e., aligned neither horizontally nor vertically. From Fig. 6(c), we can see that variables a, i, j, and g are mapped on the memristors that are scattered on the crossbar. With the previous approach, it requires $(n+1)$ copy cycles for a specific fanout of n and the total cycles required can be estimated as:

$$\begin{aligned} \sum _{n=2}^{N} [(n + 1) freq(n)] \end{aligned}$$

(9)

For the proposed approach, we use Eq. 13 which gives us the same cycle counts as of scenario 1 and 2.

By maintaining the same procedure as of Scenarios 1 and 2, the total energy dissipation for the previous approach can be obtained. For the proposed approach, we need to write n memristors in a single write cycle for a fanout of n and the term $(n*WE)$ gives us the write energy required for this operation. For the previous and proposed approach, the energy dissipation can be estimated by the following two equations respectively:

$$\begin{aligned} 52.49 * \sum _{n=2}^{N} [(n + 1) freq(n)] \end{aligned}$$

(10)

$$\begin{aligned} \sum _{n=2}^{N} [\{RE_m + (n * WE)\} freq(n)] \end{aligned}$$

(11)

3.3 Overall Approach: Case 2

It will carry out the proposed method according to the following steps:

Cycle 1: Controller reads the logic value of the output memristor which has fanout of two or more.
Cycle 2: Controller writes the logic value as inputs to the driven memristors without considering whether the sensed logic value is one or not.

Previous works [7, 8] utilized a controller that produces a similar state diagram as shown in Fig. 7(a) and maintain the same operation as explained in Sect. 3.1.

With our proposed approach, While executing the micro operations in the execution state, whenever there is a fanout event, the controller will go to the read state and then go to the write state to write the value to the driven memristors without considering whether the read logic value is one or not. Here, we do not use the 2:1 multiplexer, so, controller does not have the capacity to determine whether to skip cycle 2 or not. Controller will always go to the read state and then to the write state irrespective of the read logic value. We do not initialize the output memristors at the copy locations to logic-1 which saves us the initialization cycles and energy required for this operation.

3.4 Mapping Scenario Analysis: Case 2

We maintain the same assumption as discussed in the Sect. 3.2 except the third assumption that assume the output memristors at the copy locations are initialized to logic-1. With the exclusion of this assumption, controller now does not have the ability to skip the write cycle as considered in case 1. Here, we discuss three different mapping scenarios of a logic function on a memristor crossbar.

Scenario 1. Here, we consider that all the second input memristors are aligned along the same row or same column. The mapped variable h in Fig. 5(b) needs to be copied 3 times to use as first input memristors of the driven gates and the variables a, i, j, and g are mapped as second input memristors on the second column of the crossbar. In case 2, we consider that output memristors at the copy locations are not initialized to logic-1. Hence, with the previous approach, it requires at least 1 cycle to initialize the memristors at the copy locations and 4 cycles to copy logic value of h. The following equation can be used to estimate the total cycles required for copy operations with previous approach:

$$\begin{aligned} \sum _{n=2}^{N} [(n + 1)freq(n)] \end{aligned}$$

(12)

where, n denotes different fanouts (i.e., number of gates it is driving), N is maximum fanout degree, and freq(n) is the number of times, a fanout of n is found in a netlist.

With the proposed approach, we need one cycle to read the logic value and one cycle to write multiple copies of h along the same row or column. Therefore, we can estimate the total cycles required with the equation given below:

$$\begin{aligned} 2\sum _{n=2}^{N} [freq(n)] \end{aligned}$$

(13)

It should be noted that with our proposed approach, we require two cycles (read and write) for each fanout event and this is true for all scenarios considered here. Therefore, for all three scenarios, we can reuse Eq. 13 to estimate the total cycle counts.

We are providing a general formula to estimate the energy dissipation with both the previous and proposed approach due to the fanout only. For case 2, we have to consider both the energy dissipation due to NOT execution and initialization of the output memristors at the copy locations. According to the work [8], 52.49 fJ is the energy required for one NOT execution, we can denote it by NEx and the energy required to set a memristor to logic 1 is 66.09 fJ, we can denote it by $init\_1$. For fanout of n, we need to initialize n number of memristors. The following equation gives us the total energy dissipation with this approach:

$$\begin{aligned} (NEx + init\_1)\sum _{n=2}^{N} [n*freq(n)] \end{aligned}$$

(14)

Whereas, for the proposed approach, the total energy required for a specific benchmark circuit can be estimated as:

$$\begin{aligned} \sum _{n=2}^{N} [\{(RE + (n-1) * WE )\}*freq(n)] \end{aligned}$$

(15)

Where, RE denotes the energy needed to read the variable. For fanout of n, we need to write $(n-1)$ memristors in a single cycle. Here, WE denotes the write energy required for writing a single memristor.

Scenario 2. We consider here the same mapping scenario as discussed in Sect. 3.2, i.e. some of the second input memristors are aligned on the same row as of the first input memristor (h) and some are scattered throughout the crossbar. For details of this mapping scenario, we can refer to Sect. 3.2 (scenario 2). Referring to Fig. 6(a), with the previous approach, first, an initialization cycle is needed to initialize the three output memristors at the copy locations and then h needs to be copied two times (3 NOT executions) to use it as an inputs of gates 6 and 7. We can evaluate the following equation for the previous approach:

$$\begin{aligned} 3 + \sum _{i=1}^{M} \sum _{n=4i-1}^{4i+2} [\{n -2(i-1)\} freq(n)] \end{aligned}$$

(16)

where, 3 cycles are required for fanout of 2. Here, i denotes the number of sub-divisions made on the different fanout and M denotes maximum degree of sub-division. For a specific fanout of n, the term $\{n-2(i-1)\}$ gives us the total cycle count which includes the copy cycles and the initialization cycle. In this way, we find cycle count for each division and add together to get the total cycle counts. As discussed earlier, with the proposed approach, we will get the cycle counts from Eq. 13.

With the previous approach, as in Scenario 1, we can estimate the total energy dissipation by the following equation:

$$\begin{aligned} (NEx + init\_1) [2 + \sum _{i=1}^{M} \sum _{n=4i-1}^{4i+2} [\{n -(2i-1)\} freq(n)]] \end{aligned}$$

(17)

and for the proposed approach:

$$\begin{aligned} (RE + WE) + \sum _{i=1}^{M} \sum _{n=4i-1}^{4i+2} [\{RE + (n -2i) * WE\} freq(n)] \end{aligned}$$

(18)

where, first term (RE + WE) is the energy required for fanout of 2. The term $(n-2i)$ gives us the number of memristors that needs to be written in a single write cycle.

Scenario 3. For the third scenario, we consider that the second input memristors are scattered throughout the crossbar i.e., aligned neither horizontally nor vertically. From Fig. 6(c), we can see that variables a, i, j, and g are mapped on the memristors that are scattered on the crossbar. With the previous approach, it requires 1 initialization cycle and $(n+1)$ copy cycles for a specific fanout of n and the total cycles required can be estimated as:

$$\begin{aligned} \sum _{n=2}^{N} [(n + 2) freq(n)] \end{aligned}$$

(19)

For the proposed approach, we use Eq. 13 which gives us the same cycle counts as of scenario 1 and 2.

$$\begin{aligned} (NEx + init\_1) \sum _{n=2}^{N} [(n + 1) freq(n)] \end{aligned}$$

(20)

$$\begin{aligned} \sum _{n=2}^{N} [\{RE + (n * WE)\} freq(n)] \end{aligned}$$

(21)

4 Experimental Results

To validate the proposed approach, we performed a set of experiments on ISCAS’85 benchmark circuits and compared our results with a recently published work [8]. While doing so, we maintained the assumptions as stated in Sect. 3. In a given NOR/NOT based synthesized netlist, total number of gate executions consist of regular execution of gates pertaining only to the netlist and the extra NOT executions required for copy operations produced by fanout. We estimated the total number of extra NOT executions over the total number of gate executions for a specific benchmark circuit and then averaged over all the benchmark circuits. An average of 47.41%, 41.82%, and 54.40% additional cycle count due to NOT operations has been measured for scenario 1, scenario 2, and scenario 3 respectively. With our proposed approach, we are able to reduce performance and energy overheads produced by these aforementioned additional cycles.

Case 1: Previous works such as [7, 8] have only considered energy dissipation due to gate executions, but as energy consumptions from read cycle, write cycle, and 2:1 multiplexer are inherent in our proposed approach, we have included these in our analysis. Performance and energy overheads for the multiplexer are obtained for a 45 nm CMOS process technology from PTM using HSPICE and our circuit level evaluation shows that the delay introduced by a multiplexer is in the picosecond range. The work [8] has used the VTEAM model [14] to find the maximum latency for MAGIC NOR operations (when either of the inputs is at logic-1) and also to find the read time. We consider these times as the cycle time and read time of the system respectively. According to [15], the latency introduced by MAGIC operations are much higher than that of read and write operations. Hence, the effective delay produced by the sense circuit and the multiplexer still stays below 1 cycle time. Therefore, in our analysis, we will not consider any performance overhead due to this added multiplexer.
Case 2: In case 2, we do not initialize the output memristors at the copy locations to logic-1. This removes the 2:1 multiplexer from the system, as the controller is not aware of the logic value stored in the output memristors at the copy locations. In our result analysis, for the proposed approach, we exclude the energy consumed by the multiplexer and in case of previous approach, we add the energy required for initializing the memristor to logic-1. From [8], the energy required to write logic-1 to a memristor is 66.09 fJ. Therefore, with the previous approach, the total cycle count comes from the addition of initialization cycle and the copy cycles. With the proposed approach, we always need two cycles, one to read the logic value of the driving memristor and other to write the logic value to the driven memristors.

4.1 Experimental Setup

The experimental methodology we follow to validate our approach is as follows:

1.
A given gate-level netlist is first synthesized with ABC tool [9] and target library of NOR and NOT gates.
2.
For three different scenarios, number of fanout events has been found and cycle counts and energy dissipations estimation have been performed with the proposed method and compared with [8].

Table 1. Comparison of number of cycles (Average Scenario) with the regular and proposed approach.

Full size table

4.2 Results and Analysis

Tables 1, 2, and 3 report the improvements over recently published work with respect to the number of cycles due to fanout in average, best, and worst scenarios respectively. For case 1, we observe average improvements of 51.08%, 38.66%, and 63.18% in cycle reduction considering three mapping scenarios.

In case 2, we consider that the memristors at the copy locations are not initialized to logic-1 which dictates the need of adding initialization cycle to the total cycle counts. With the proposed approach, we need two cycles to read the driving memristor and then write the logic variable to the driven memristors in parallel. We observe average improvements of 50.94%, 42.08%, and 60.65% in cycle reduction.

Table 2. Comparison of number of cycles (Best Scenario) with the regular and proposed approach.

Full size table

Table 3. Comparison of number of cycles (Worst Scenario) with the regular and proposed approach.

Full size table

Table 4. Comparison of energy dissipation (Average Scenario) with the regular and proposed approach.

Full size table

Table 5. Comparison of energy dissipation (Best Scenario) with the regular and proposed approach.

Full size table

Table 6. Comparison of energy dissipation (Worst Scenario) with the regular and proposed approach.

Full size table

Tables 4, 5, and 6 report the improvements in energy consumption. For estimating the energy requirement, we have taken the same parameter values as discussed in [8]. The energy required for NOT execution is 52.49 fJ. The energy required to read logic-0 is 0.03 fJ, while that for logic-1 is 3.34 fJ. For our estimation purpose, we have averaged these two values and considered the read energy as 1.685 fJ. The energy required to write logic-1 to a memristor which is termed as SET operation is 66.09 fJ and through RESET operation, we can write logic-0 to a memristor and the energy required is 17.60 fJ. For case 1, as discussed in Sect. 3, we do not need to write logic-1 with our proposed approach, thus, no energy is consumed. Therefore, the average energy considered for SET and RESET operation is 8.8 fJ. For a specific benchmark circuit, we find all different fanout occurrences and add the energies required to estimate the total energy dissipation. We observe average improvements of 91.30%, 88.53%, and 74.04% by experimental evaluation on ISCAS’85 benchmark circuits. This improvement can be primarily attributed to the fact that we were able to eliminate the need for writing logic-1.

For case 2, with the previous approach, the memristors at the copy locations need to be initialized to logic-1. Therefore, this initialization energy needs to be accounted for in the total energy calculation. With the proposed approach, we exclude energy consumption produced by the multiplexer that is required in case 1. The only energies required are the energy to read the driving memristor and energy to write the driven memristors. We observe average improvements of 86.03%, 78.97%, and 51.89% in energy saving.

5 Conclusions

In this work, we outline an effective approach that would significantly reduce the performance and energy overheads due to fanout while mapping a logic function in memristor based crossbar architecture. Comparison has been made with a recently published work and shows significant average improvements in average, best, and worst mapping scenarios in performance and energy dissipation. Future research direction would be to implement the entire logic synthesis process utilizing the fanout optimization discussed here.

References

Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput. Arch. News 23(1), 20–24 (1995)
Article Google Scholar
Hur, R.B., Kvatinsky, S.: Memristive memory processing unit (MPU) controller for in-memory processing. In: 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE), pp. 1–5, November 2016
Google Scholar
Kvatinsky, S., Friedman, E.G., Kolodny, A., Weiser, U.C.: The desired memristor for circuit designers. IEEE Circuits Syst. Mag. 12(2), 17–22 (2013). Secondquarter
Article Google Scholar
Lehtonen, E., Poikonen, J.H., Laiho, M.: Memristive stateful logic. In: Adamatzky, A., Chua, L. (eds.) Memristor Networks, pp. 603–623. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-02630-5_27
Chapter Google Scholar
Talati, N., Gupta, S., Mane, P., Kvatinsky, S.: Logic design within memristive memories using memristor-aided loGIC (MAGIC). IEEE Trans. Nanotechnol. 15(4), 635–650 (2016)
Article Google Scholar
Nair, R.: Evolution of Memory Architecture. Proc. IEEE 103(8), 1331–1345 (2015)
Article Google Scholar
Hur, R.B., Wald, N., Talati, N., Kvatinsky, S.: Simple magic: synthesis and in-memory mapping of logic execution for memristor-aided logic. In: 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 225–232, November 2017
Google Scholar
Gharpinde, R., Thangkhiew, P.L., Datta, K., Sengupta, I.: A scalable in-memory logic synthesis approach using memristor crossbar. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(2), 355–366 (2018)
Article Google Scholar
Berkeley Logic Synthesis and Verification Group, ABC: A System for Sequential Synthesis and Verification, Release 90215. http://www.eecs.berkeley.edu/~alanmi/abc/
Zaman, M.A., Katkoori, S.: Minimizing performance and energy overheads due to fanout in memristor based logic implementations. In: 2018 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), pp. 7–12, October 2018
Google Scholar
Chua, L.: Memristor-The missing circuit element. IEEE Trans. Circuit Theory 18(5), 507–519 (1971)
Article Google Scholar
Strukov, D.B., Snider, G.S., Stewart, D.R., Williams, R.S.: The missing memristor found. Nature 453(7191), 80–83 (2008)
Article Google Scholar
Ali, A.H., Hur, R.B., Wald, N., Kvatinsky, S.: Efficient algorithms for in-memory fixed point multiplication using MAGIC. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5, May 2018
Google Scholar
Kvatinsky, S., Ramadan, M., Friedman, E.G., Kolodny, A.: VTEAM: a general model for voltage-controlled memristors. IEEE Trans. Circuits Syst. II: Express Briefs 62(8), 786–790 (2015)
Article Google Scholar
Talati, N., et al.: Practical challenges in delivering the promises of real processing-in-memory machines. In: 2018 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1628–1633, March 2018
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of South Florida, Tampa, FL, USA
Md Adnan Zaman, Rajeev Joshi & Srinivas Katkoori

Authors

Md Adnan Zaman
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Srinivas Katkoori
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Srinivas Katkoori .

Editor information

Editors and Affiliations

University of Verona, Verona, Italy
Nicola Bombieri
University of Verona, Verona, Italy
Graziano Pravadelli
University of Tokyo, Tokyo, Japan
Masahiro Fujita
University of Michigan, Ann Arbor, MI, USA
Todd Austin
Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
Ricardo Reis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zaman, M.A., Joshi, R., Katkoori, S. (2019). Optimizing Performance and Energy Overheads Due to Fanout in In-Memory Computing Systems. In: Bombieri, N., Pravadelli, G., Fujita, M., Austin, T., Reis, R. (eds) VLSI-SoC: Design and Engineering of Electronics Systems Based on New Computing Paradigms. VLSI-SoC 2018. IFIP Advances in Information and Communication Technology, vol 561. Springer, Cham. https://doi.org/10.1007/978-3-030-23425-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-23425-6_8
Published: 26 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23424-9
Online ISBN: 978-3-030-23425-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Optimizing Performance and Energy Overheads Due to Fanout in In-Memory Computing Systems

Abstract