research-article

Open access

Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing

Authors:

Satya Jaswanth Badri,

Mukesh Saini,

Neeraj GoelAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 20, Issue 4

Article No.: 58, Pages 1 - 25

https://doi.org/10.1145/3629524

Published: 14 December 2023 Publication History

PDF eReader

Abstract

Battery-less technology evolved to replace battery usage in space, deep mines, and other environments to reduce cost and pollution. Non-volatile memory (NVM) based processors were explored for saving the system state during a power failure. Such devices have a small SRAM and large non-volatile memory. To make the system energy efficient, we need to use SRAM efficiently. So we must select some portions of the application and map them to either SRAM or FRAM. This paper proposes an ILP-based memory mapping technique for intermittently powered IoT devices. Our proposed technique gives an optimal mapping choice that reduces the system’s Energy-Delay Product (EDP). We validated our system using TI-based MSP430FR6989 and MSP430F5529 development boards. Our proposed memory configuration consumes 38.10% less EDP than the baseline configuration and 9.30% less EDP than the existing work under stable power. Our proposed configuration achieves 20.15% less EDP than the baseline configuration and 26.87% less EDP than the existing work under unstable power. This work supports intermittent computing and works efficiently during frequent power failures.

1 Introduction

The Internet of Things (IoT) refers to a network of sensors and nodes that can easily communicate and collaborate with each other. Batteries are the most common source of power for IoT devices. Due to the limited capacity of the battery and the short useful life [24], replacement is costly. IoT may consist of billions of sensors and systems by the end of 2050 [18]. The replacement and disposal of billions of battery-operated devices are expensive and hazardous to the environment. As a result, we need battery-free IoT devices.

Energy harvesters are a promising alternative to battery-powered devices. The energy harvester collects energy from the environment and stores it in capacitors. Energy harvesting is unreliable, power failures are unavoidable, and the application’s execution is irregular. This type of computing is known as intermittent computing [23, 37, 50].

For intermittently powered IoT devices, energy harvesting is the primary energy source. Energy-harvesting sources such as piezoelectric materials and radio-frequency devices extract a small amount of energy from their surroundings. We must use energy efficiently in both stable and unstable power supply scenarios.

In order to accumulate more energy and utilize energy efficiently to make the system energy efficient, we primarily have two choices. The first choice is to reduce energy consumption by proposing new techniques that use energy efficiently. The second choice is to increase the number of different energy harvesters, which will accumulate more energy while increasing maintenance costs. We need to maintain these many energy harvesters, which is not a feasible solution. Thus, our main concern is reducing energy consumption by proposing new techniques which help to design an energy-efficient system.

Gonzalez and Horowitz [19] mentioned energy as not an ideal metric to evaluate the efficiency of the system. By simply reducing the supply voltage or load capacitance, energy can be reduced. Instead of using energy as a metric, they suggested using the Energy-Delay Product (EDP) as the energy-efficient design metric. The EDP considers both performance and energy simultaneously and if a design minimizes the EDP, we can call such a design energy-efficient. We define EDP in equation 1.

\begin{equation} E D P= E_{system} \times Num\_cycles \end{equation}

(1)

Where $E_{system}$ is the energy consumption of the system, $Num\_cycles$ is the number of CPU cycles. If the power supply is stable, $Num\_cycles$ is the total number of cycles required to execute the application. If it is an unstable power scenario, $Num\_cycles$ is the total number of cycles required to backup, restore and execute the application where the unstable power scenario describes that for a fixed execution period, i.e., 300 seconds, there will be a power failure every 1 second.

During these frequent power failures, executing IoT applications becomes more difficult because all computed data may be lost, and the application’s execution must restart from the beginning. During power failures, we need an additional procedure to backup/checkpoint the volatile memory contents to non-volatile memory (NVM).

Flash memory was the prior NVM technology used by modern microcontrollers at the main memory level, such as MSP430F5529 [34]. Flash is ineffective for frequent backups and checkpointing because its erase/write operations require a lot of energy. Emerging NVMs outperform flash, including spin-transfer-torque RAM (STT-RAM) [9, 41], phase-change memory (PCM) [35], resistive RAM (ReRAM), and ferroelectric RAM (FRAM) [25]. Previous work has been demonstrated by incorporating these emerging NVMs into low-power-based microcontrollers (MCUs) [25, 27, 34]. Recent NVM-based MCUs, such as the flash-based MSP430F5529 and the FRAM-based MSP430FR6989, encourage the use of hybrid main memory. The flash-based MCU, MSP430F5529, is made up of SRAM and flash, while the FRAM-based MCU, MSP430FR6989, is made up of SRAM and FRAM at the main memory level. The challenges associated with hybrid main memory-based architectures, such as MSP430FR6989, are as follows.

(1)

FRAM consumes 2x times more energy and latency than SRAM. This design degrades system performance and consumes extra energy even during normal operations.

(2)

SRAM loses contents during a power failure and needs to execute the application from the beginning, which consumes extra energy and time. For large-size applications, this design will not be helpful. Anyway, using only SRAM performs better during regular operations.

(3)

We can design a hybrid main memory to get the benefits from both SRAM and FRAM. The following questions need to be answered and analyzed to use the hybrid main memory design.

(a)

How do we choose the appropriate sections of a program and map them to either SRAM or FRAM regions? A significant challenge is mapping a program’s stack, code, and data sections to either SRAM or FRAM.

(b)

How and where should volatile contents be backed up to the NVM region during frequent power failures?

The main question is which section of an application should be placed in which memory region; this is essentially a memory mapping problem. This article makes the following contributions to all the challenges mentioned above.

–

To the best of our knowledge, this is the first work on the Integer-Linear Programming (ILP) based memory mapping technique for intermittently powered IoT devices.

–

We incorporated the energy-harvesting scenarios into the ILP model such that the number of power failures is considered as an input for our ILP model.

–

We formulated the memory mapping problem to cover all the possible design choices. We also formulated our problem in such a way that it supports large-size applications.

–

We proposed a framework that efficiently consumes low energy during regular operation and frequent power failures. Our proposed framework supports intermittent computing.

–

We evaluated the proposed techniques and frameworks in actual hardware boards.

Our proposed ILP model recommends placing each section in either SRAM or FRAM. We compared the proposed memory configuration and techniques with the baseline memory configurations under both stable and unstable power scenarios. Our proposed memory configuration consumes 38.10% less EDP than the FRAM-only configuration and 9.30% less EDP than the existing work under stable power. Our proposed configuration achieves 20.15% less EDP than the FRAM-only configuration and 26.87% less EDP than the existing work under unstable power.

Paper organization: Section 2 discusses the background and related works. Section 3 explains the motivation behind the proposed framework. Section 4 explains the system model and gives an overview of the problem definition. Section 5 explains the proposed ILP-based memory mapping technique and framework that supports during frequent power failures. The experimental setup and results are described in Section 6. We conclude this work in Section 7.

2 Related Works

SRAM and DRAM are used to design registers, caches, and main memory in traditional processors. We replace a regular processor’s volatile memory model with an NVM for an intermittently aware design. STT-RAM, PCM, flash, and FRAM are all relatively new NVM technologies [9, 12, 13, 20, 26, 35, 41, 42, 45, 57]. These NVM technologies motivated researchers due to their appealing characteristics, such as non-volatility, low cost, and high density [2, 3, 7, 8, 55].

To develop an intermittent aware design, we should also change the execution model of a conventional processor by incorporating additional backup/restore procedures [23, 37, 50, 51]. Choi et al. [10] propose a speculative execution for intermittent aware processors by changing the pipelining stages. Thirumala et al. [52] propose an energy-efficient in-memory computing engine for intermittently powered systems. Thirumala et al. introduce ferroelectric transistor (FEFET) based memory architecture to support NVM-based storage that gives benefits during frequent power failures.

Researchers started using real-time NVM-based MCUs for intermittent computing [17, 25, 34, 43, 46, 47, 48]. Sliper et al. [47, 48] explore a memory management scheme that loads pages whenever required and saves only the modified pages to NVM. Researchers observed that using only NVMs at the cache or main memory level degrades the system’s performance and consumes more energy, which gives an idea of exploring hybrid memories. Recent NVM-based MCUs such as MSP430FR6989 [25] consist of both SRAM and FRAM. We need to utilize the SRAM and FRAM efficiently and correctly; otherwise, we may degrade the system performance and consume extra energy. To make the system more efficient, we must map the application contents to SRAM or FRAM. This is actually a memory mapping problem, similar to scratch-pad memories.

Recent works focused on incorporating NVMs as virtual memory during frequent power failures. Maioli and Mottola [40] propose ALFRED, a mapping technique that maps virtual memory to volatile or NVM. Maioli and Mottola use machine-level codes for these mappings that achieve 2x improvement compared to existing techniques and systems. However, the Maioli and Mottola technique does not discuss the complete design choices or consider the real-time power scenarios.

Including NVMs in systems needs to answer the following research questions: when to checkpoint and where to checkpoint the volatile data. Researchers proposed efficient checkpointing techniques [1, 44, 56] incorporating user-defined function calls that help determine how much energy is still available in the capacitor. Based on that analysis, the system invokes the checkpoint. These techniques even predict power failures and support intermittent computing.

Researchers explored a similar mapping problem in scratch-pad memories (SPMs) [21, 36, 49]. Chakraborty et al. [6] documented the existing and standard memory mapping techniques on SPMs. In earlier works, memory mapping was done mainly between SPMs and main memory. Memory mapping can be done statically and dynamically [30, 31]. In static memory mapping, either ILP or the compiler can help to determine the best placement [16, 21, 36, 49]. ILP-solver takes inputs obtained from profilers and memory sizes as constraints in ILP-based memory mapping works. The ILP-solver provides the best placement option based on the objective function. In dynamic memory allocation [14, 15, 53, 54], either the user-defined program or the compiler will decide on an optimal placement choice at run-time. A run-time strategy that aims to optimize performance under power constraints must be careful not to degrade the performance of load-balanced programs [4].

However, our problem differs from the memory mapping techniques in SPMs because intermittent computing brings new constraints. During intermittent computation, the challenges were the forward progress of an application, data consistency, environmental consistency, and concurrency between the tasks. Due to these challenges, the execution model and development environment differ from SPM-based memory mapping techniques. As a result, we require a memory mapping technique that supports intermittent computation.

Researchers have explored memory mapping techniques and analysis for the MSP430FR6989 MCU. Yıldırım et al. [59] proposed a task-based mapping mechanism considering all event-driven paradigms that support intermittent computing and battery-less sensing devices. In FRAM-based MCUs, Jayakumar et al. [27] implement a checkpointing policy. They save the system state to FRAM during a power failure. Jayakumar et al. [28, 29] propose an energy-efficient memory mapping technique for TI-based applications in FRAM-based MCUs. Kim et al. [33] present a detailed analysis of energy consumption for all memory sections in FRAM-based MCUs with different memory mappings.

Recent works on NVM-based processors (NVP) have been introduced [32, 38, 39, 58]. Xu et al. [58] propose an energy-efficient backup procedure for NVPs that includes re-computation analysis and excludes some recomputed data to reduce the size of the backup content. Khorguani et al. [32] propose a checkpointing technique to support multi-threaded programs and fault tolerance by periodically evicting persistent data structures to NVM-based main memory. Recent checkpointing technique [56] is also proposed to reduce the size of volatile contents that must be backup/restored during frequent power failures for NVPs. Energy-harvesting-based NVPs are also designed in literature [11, 60], Write-Light Cache (WL-Cache) [11] proposes a cache architecture that incorporates new write policy for energy harvesting systems and ReplayCache [60] is proposed to exploit the volatile caches.

Earlier works investigated this problem by analyzing the system’s efficiency possibilities. The authors [28, 29, 33] have not covered all the possibilities and design choices. In addition, there is significantly less contribution toward memory mappings in FRAM-based MCUs that support intermittent computation. Our work proposes an energy-efficient memory mapping technique for intermittently powered IoT devices that experience frequent power failures.

3 Motivation

This section discusses the advantages of using hybrid SRAM and FRAM for these MSP430-based MCUs over SRAM-only or FRAM-only designs and the importance of efficient memory allocation.

SRAM is 2KB, and FRAM is 128KB in a FRAM-based MCU, MSP430FR6989. The first naive approach is to use the entire 128KB of FRAM in both stable and unstable power scenarios, resulting in longer execution cycles and higher energy consumption. Similarly, we have a second naive approach to use the entire 2KB SRAM for small applications (whichever fits within the SRAM size), which has advantages during regular operation. Unfortunately, it loses all 2KB SRAM data during a power failure and takes more time to back up 2KB contents to FRAM during a power failure. As shown in Figure 1, for the FRAM-only configuration, we map all three sections to FRAM and all three sections to SRAM for the SRAM-only configuration.

Fig. 1.

We compared the FRAM-only configuration and the SRAM-only configuration in both stable and unstable power scenarios. FRAM-only configuration performs better during frequent power failures, whereas the SRAM-only configuration performs better during regular operations (without any power failures), as shown in Figure 2. On average, the FRAM-only configuration consumes 47.9% more energy than the SRAM-only configuration during a stable power, as shown in Figure 2(a). On average, the SRAM-only configuration consumes 32.7% more energy than the FRAM-only configuration during unstable power, as shown in Figure 2(b). We also observed that MCU would pitch an error to increase the SRAM space or use FRAM space for any computations. For large-size applications that will not run using only SRAM, it require FRAM as well. Thus, large applications consume more energy in the SRAM-only configuration during a stable power scenario.

Fig. 2.

These two designs motivate us to propose a hybrid memory design that effectively uses both SRAM and FRAM. We also encountered that SRAM-only configuration is ineffective for larger applications. As a result, we had to use a hybrid memory and figure out how and where to place the sections. To the best of our knowledge, only one work explored the memory mapping issue for these MCUs [29]. We analyzed the mapping decisions using their empirical model. Jayakumar et al. [29] calculated the energy consumption values for each configuration. Based on the energy values., the authors suggested to map text, data, and stack sections to either SRAM or FRAM.

The empirical method used by the authors is as follows. The authors considered functions as the basic unit. They explored all configurations and calculated the energy values, as shown in Table 1. The authors have eight configurations because they have two memory regions (SRAM or FRAM) and need to map three sections (stack, data, text). Using the authors’ model, we calculated the energy values for the qsort_small application. For instance, the {SSS} configuration performs better during a stable power supply, and during a power failure, {SFS} consumes less energy than all other configurations. As a result, the authors allocate text and stack sections to SRAM and data sections to FRAM.

Table 1.

Configuration	Text	Data	Stack	$Energy_{stable} (mJ)$	$Energy_{unstable} (mJ)$
1. {SSS}	SRAM	SRAM	SRAM	16.70	79.56
2. {SSF}	SRAM	SRAM	FRAM	21.08	66.34
3. {SFS}	SRAM	FRAM	SRAM	28.75	33.79
4. {SFF}	SRAM	FRAM	FRAM	35.97	52.10
5. {FSS}	FRAM	SRAM	SRAM	39.48	68.24
6. {FSF}	FRAM	SRAM	FRAM	57.64	54.75
7. {FFS}	FRAM	FRAM	SRAM	64.14	45.83
8. {FFF}	FRAM	FRAM	FRAM	92.09	36.07

Table 1. Analysis of the Empirical Methods used by Jayakumar et al. [29] for qsort_small Under Stable and Unstable Power Scenarios

We observed that this empirical method becomes ineffective as the number of configurations increases. The authors considered all global variables, arrays, and constants as data sections. Instead, why can’t we map each global variable or array to either SRAM or FRAM? This increases the number of configurations, and calculating/tracking energy values is challenging. Our design space grows enormously and makes our mapping problem challenging.

This new set of challenges motivated us to propose an energy-efficient memory mapping technique. Our proposed memory mapping framework supports large-size applications and covers all possible configurations.

4 System Model and Problem Definition

This section discusses the system model for embedded MCUs and defines the mapping problem for these MCUs.

4.1 System Model

We consider a simple, customized RISC instruction set with a Von-Neumann architecture, where the instructions and data share the same address space that supports at least 16-bit addressing. Base architecture does not have a cache to avoid uncertainty. To make things simple, we assume single cycle execution of the processor. The base architecture has a small SRAM memory and a larger NVM.

The MSP430 is an example of such a processor. Non-volatile memory sizes range from 1 kilobyte (KB) to 256 KB, while volatile RAM sizes range from 256 bytes to 2KB. Both SRAM and NVM can be accessed by instructions using a compiler/linker script. We can modify the linker script to map memory according to the memory ranges specified by the user. MSP430 does not have any operating system.

4.2 Problem Definition

Definition 4.1 (Optimal Memory Mapping Problem).

Given a program that consists of various functions and global variables, sizes of SRAM and FRAM, the number of reads and writes for each function/variable, the number and duration of power failures, and the energy required per read/write to the SRAM/FRAM. What is the optimal memory mapping for these functions/variables in order to reduce the system’s EDP?

The inputs are: The number of functions; the number of global variables; energy per write to SRAM and FRAM; energy per read to SRAM and FRAM; SRAM and FRAM sizes; Number of CPU cycles per function; the number of reads; the number of writes; number and duration of power failures.

The output is: Mapping information for all functions and global variables, under which the system’s EDP is minimized.

Definition 4.2 (Support for Intermittent Computing).

During power failures, we must safely back up the volatile contents to NVM. As stated previously, we must use SRAM efficiently for energy savings; but again, how can we save the contents of SRAM? There are two significant issues with intermittent computation. First, during a power failure, all SRAM’s mapping information and register contents are lost, causing the system to become inconsistent. Second, how do we back up/restore the mapping information and register contents to ensure system consistency?

Our first objective is to minimize the overall system’s energy and system’s EDP. We need to support the proposed system even during frequent power failures. Our second objective is to maximize the execution progress of the application during frequent power failures. Application progress depends on two parameters. One is how much time it takes to execute the application during frequent power failures, and one also has to consider backup and restore operation costs. The second is $eta$, and we define $eta$ as a factor that describes how often and for how long an intermittent behavior occurs during the application’s execution.

Essentially, application progress is the approximate number of cycles that the application is required to execute during frequent power failures, assuming that we can execute the same application at the same time during a regular operation. Application progress is a function of both execution time and the number of power failures ($\eta$), as shown in equation 2.

\begin{equation} \text{ Progress }=F \text{ ($NC_{Execute} * \eta $) } \end{equation}

(2)

We define the number of power failures in equation 3. It is the ratio between the time consumed during regular execution without any power failures to the time consumed during power failures, where $NC_{Execute}$ is the number of cycles required to execute the application during regular operation. For a given intermittent scenario and environmental conditions, the average $\eta$ value is usually known. For example, piezoelectric-based energy harvesting systems will have a higher power failure rate than solar energy harvesting systems.

\begin{equation} \begin{split} \eta &= \frac{NC_{Execute}}{NC_{Intermittent}} \end{split} \end{equation}

(3)

We define the number of cycles required during power failures in equation 4. We need to perform the backup and restore operations during a power failure.

\begin{equation} NC_{Intermittent} = NC_{Backup} + NC_{Execute} + NC_{Restore} \end{equation}

(4)

Where $NC_{Backup}$ is the number of cycles required for the backup operation and $NC_{Restore}$ is the number of cycles required for the restore operation.

5 Mapi-Pro: an Energy Efficient Memory Mapping for Intermittent Computing

In this section, we discuss the details of the proposed mapping technique. Our main objective is to pick the optimal mapping choice among all the design choices, which reduces the system’s EDP. To achieve this, we proposed an ILP-based mapping technique. The overview of the proposed mapping technique is shown in Figure 3. We also discuss how we support intermittent computing for these MCUs.

Fig. 3.

5.1 Inputs for ILP Model

We have shown the overview block diagram of the proposed ILP framework in Figure 4.

Fig. 4.

The first step is to compile each benchmark on the TI-based compiler to generate the assembly code. We begin the profiling process once we have an .asm file (assembly file) for each benchmark. Profiling can be done manually or with online profilers such as VTune, Valgrind, gperf, and gprof. We performed manual profiling using an .asm file as input and checking for branch, load, and store instructions. We manually calculate the total number of loads and stores for each function and global variable by considering the jump instructions. During the profiling and characterization process, we consider the branch instructions and their behavior from the generated assembly code. By manually parsing through each benchmark, we can also determine how many functions and variables are present in a given application.

Using the datasheet of these MSP430FR6989/MSP430F5529 boards, we can get to know the sizes of SRAM, flash, and FRAM. We recorded the number of CPU cycles required to execute each application while running it on MSP430FR6989/MSP430F5529 boards. The CPU cycles were recorded in both stable and unstable power scenarios.

Experiments were also conducted to determine the energy required for each read and write request to SRAM and FRAM. We refer to this as a one-time characterization, meaning that we have to conduct these experiments only for the first time. This information is required for a specific MSP430-based microcontroller to calculate the total system energy. We created a software procedure that writes data to the FRAM address space while reading the contents of the SRAM address space through loops, i.e., reading from SRAM and writing to FRAM. Another software procedure that writes data to the SRAM address space is by reading the contents of the FRAM address space using loops, i.e., reading from FRAM and writing to SRAM. Using this analysis, we calculate the energy per read/write to SRAM/FRAM. As shown in Figure 4, we currently have the following information that must be provided to the ILP solver:

–

Memory address ranges for SRAM, flash, and FRAM.

–

Number of CPU cycles required to run a specific application.

–

Number of loads and stores for each function and variable (as determined by the .asm file).

–

Number of functions and variables in a given application is known.

–

Energy required for each read/write to SRAM/FRAM.

–

Frequency of power failures that occur during the execution of an application and duration of each power failure.

The inputs mentioned above are the ILP solver’s inputs. We used ILPSolve 5.5 [5] as our ILP solver throughout this work. ILPSolve IDE is mentioned in [5] and explains in detail how to put the ILP formulation into the IDE to achieve the optimized solution. With all of these inputs in hand, we formulate the proposed ILP model that supports intermittent computing.

5.2 ILP Formulation for Data Mapping for Intermittent Computing

We present the ILP formulation for the memory mapping problem mentioned in Definition 4.1. We divide this ILP formulation into two parts, one is for global variables, and the second is for the functions.

For Global Variables: Let the number of global variables in a program be ‘G’. Let the number of reads and writes to the variable ‘i’ be $r_i$ and $w_i$. We divided FRAM’s 128 KB into two regions, i.e., $FRAM_n$ and $FRAM_b$, $FRAM_n$ memory region has 125 KB, and the $FRAM_b$ memory region has 3 KB.

We have two memory regions represented as $Mem_j$ as shown in the equation 5; When j = 1, we select the memory region as SRAM, and we use $FRAM_{n}$ for j = 2.

\begin{equation} Mem_j = {\left\lbrace \begin{array}{ll}j = 1 & \text{; SRAM } \\ j = 2 & ; FRAM_{n} \end{array}\right.} \end{equation}

(5)

Let the SRAM / FRAM sizes be $Size(Mem_j)$ as shown in equation 6; When j = 1, we refer to the SRAM memory size in bytes, and when j = 2, we refer to the memory size $FRAM_{n}$ in bytes.

\begin{equation} Size(Mem_j) = {\left\lbrace \begin{array}{ll}j = 1 & \text{; SRAM } \\ j = 2 & ; FRAM_{n} \end{array}\right.} \end{equation}

(6)

Let the energy required for each read/write to $Mem_j$ be $E_{r\_j}$ and $E_{w\_j}$. Let the number of CPU cycles required to execute a global variable $v_i$ be $NC_{v_i}$, where $\forall i \in [1, G])$. Using one-time characterization and static profiling, we gathered data such as per read/write energy to SRAM/FRAM and the number of cycles.

We define a binary variable (BV); $I_{j}\left(v_{i}\right)$, which refers to a variable $v_i$ is allocated to the memory region $j$. If $I_{j}\left(v_{i}\right)$ = 1 then the variable $v_i$ is allocated and $I_{j}\left(v_{i}\right)$ = 0 indicates that the variable $v_i$ is not allocated. $I_{j}\left(v_{i}\right)$, where $(\forall j \in [1, Mem_j], \forall i \in [1, G])$ is defined as shown in equation 7.

\begin{equation} I_{j}\left(v_{i}\right)= {\left\lbrace \begin{array}{ll}1 & v_{i} \text{ is allocated to memory region } j \\ 0 & \text{ otherwise} \end{array}\right.} \end{equation}

(7)

Constraints: There are two constraints, one is for BV; $I_{j}\left(v_{i}\right)$ and one is a memory size constraint. In any case, a variable $v_i$ is allocated to only one memory region, which means that $v_i$ is allocated to either SRAM or FRAM but not both. This constraint is defined in equation 8.

\begin{equation} \sum _{j=1}^{Mem_j} I_{j}\left(v_{i}\right)=1 \quad (\forall i \in [1, G]) \end{equation}

(8)

The other constraint is related to memory sizes. The allocated variables $v_i$ and its $Size(v_i)$; $\forall i \in [1, G])$ should not be greater than $Size(Mem_j)$. This constraint is defined in equation 9.

\begin{equation} \sum _{i=1}^{G} I_{j}\left(v_{i}\right) * Size(v_i) \le Size(Mem_j) \quad (\forall j \in [1, Mem_j]) \end{equation}

(9)

Objective 4.1: The challenge of mapping global variables in a program to either SRAM or FRAM is to reduce EDP and improve system performance. $E_{global}$ is defined in equation 10. Where $E_{global}$ is the energy required to allocate global variables to either SRAM or FRAM and execute those from their respective memory regions.

\begin{equation} E_{global} = \sum _{j=1}^{Mem_j} \sum _{i=1}^{G} [E_{r\_j} \times r_{i} +E_{w\_j} \times w_{i}] \end{equation}

(10)

$EDP_{global}$ is defined in equation 11. Where $EDP_{global}$ is the energy-delay product required to allocate global variables to either SRAM or FRAM.

\begin{equation} EDP_{global} = \sum _{j=1}^{Mem_j} \sum _{i=1}^{G} I_{j}\left(v_{i}\right) [ E_{global} \times NC_{v_i} ] \end{equation}

(11)

For Functions: Let the number of functions in a program be $`N_f^{\prime }$. Let the number of reads and writes to the $i{\rm {th}}$ function be $r(F_i)$ and $w(F_i)$, where $\forall i \in [1, N_f]$. The functions consist of procedural parameters, local variables, and return variables. Internally, the code/data of functions are divided into text, data, and stack sections. We map at least one section among these three sections to either SRAM or FRAM regions, i.e., $Mem_j$ and $Sec_k(i)$ defines section ‘k’ of $i{\rm {th}}$ function as shown in equation 12; when k = 1, we refer to the text section of $i{\rm {th}}$ function, when k = 2, we refer to the data section of $i{\rm {th}}$ function, and when k = 3, we refer to the stack section of $i{\rm {th}}$ function.

\begin{equation} Sec_k(i) = {\left\lbrace \begin{array}{ll}k = 1 & \text{; Text } \\ k = 2 & \text{; Data } \\ k = 3 & \text{; Stack } \end{array}\right.} ; \forall i \in [1, N_f] \end{equation}

(12)

We define a BV; $I_{j}\left(Sec_{k}(i) \right)$, which refers to a section $Sec_k$ of $i{\rm {th}}$ function is allocated to only one memory region $j$. If $I_{j}\left(Sec_{k}(i) \right)$ = 1 then the section $Sec_i$ is allocated and $I_{j}\left(Sec_{k}(i) \right)$ = 0 that indicates the section $Sec_i$ is not allocated. $I_{j}\left(Sec_{k}(i) \right)$, where $(\forall j \in [1, Mem_j], \forall i \in [1, N_f])$, $\forall k \in [1, Sec_k(i)])$ is defined as shown in equation 13.

\begin{equation} I_{j}\left(Sec_{k}(i) \right) = {\left\lbrace \begin{array}{ll}1 & Sec_k \text{ of } i{\rm {th}} \text{ function is allocated to } j \\ 0 & \text{ otherwise} \end{array}\right.} \end{equation}

(13)

Constraints: There are two constraints, one is for BV; $I_{j}\left(Sec_{k}(i) \right)$ and one is a memory size constraint. In any case, a $Sec_k$ of the $i{\rm {th}}$ function is allocated to only one memory region, which means that the $Sec_k$ of the $i{\rm {th}}$ function is allocated to either SRAM or FRAM but not both. This constraint is defined in equation 14.

\begin{equation} \sum _{k=1}^{3} \sum _{j=1}^{Mem_j} I_{j}\left(Sec_{k}(i) \right))=1 \quad (\forall i \in [1, N_f]) \end{equation}

(14)

The other constraint is related to memory sizes. The allocated sections $Sec_{k}(i)$ and its $Size(F_i)$; $\forall k \in [1, Sec_k(i)])$, $\forall j \in [1, Mem_j]$, $\forall i \in [1, N_f]$ should not be greater than the $Size(Mem_j)$. This constraint is defined in equation 15.

\begin{equation} \sum _{i=1}^{G} I_{j}\left(v_{i}\right) * Size(v_i) + \sum _{k=1}^{3} \sum _{i=1}^{N_f} I_{j} \left(Sec_{k}(i) \right) * Size(F_i) \le Size(Mem_j) \end{equation}

(15)

Objective 4.2: The challenge of mapping sections of these functions in a program to either SRAM or FRAM is to minimize EDP and improve system performance. $E_{func}$ is defined in equation 16, where $M_{c_i}$ is the number of the times $i{\rm {th}}$ functions called.

\begin{equation} E_{func} = \sum _{j=1}^{Mem_j} \sum _{i=1}^{N_f} [E_{r\_j} \times r(F_i) +E_{w\_j} \times w(F_i)] \times M_{c_i} \end{equation}

(16)

$EDP_{func}$ is defined in the equation 17. Where $EDP_{func}$ is the energy-delay product required to allocate all functions to either SRAM or FRAM. Where $E_{func}$ is the energy required to allocate functions to either SRAM or FRAM. Where $NC_{F_i}$ is the number of CPU cycles required to execute a function $F_i$.

\begin{equation} EDP_{func} = \sum _{k=1}^{3} \sum _{j=1}^{Mem_j} \sum _{i=1}^{N_f} I_{j} \left(Sec_{k}(i) \right) [ E_{func} \times NC_{F_i} ] \end{equation}

(17)

The total system EDP, $EDP_{system}$, is the sum of both $EDP_{global}$ and $EDP_{func}$ as shown in equation 18.

\begin{equation} EDP_{system} = \eta (EDP_{global} + EDP_{func}) \end{equation}

(18)

Our objective function is shown in the equation 19. Our main objective is to minimize the system’s EDP by choosing the optimal placement choice.

\begin{equation} {\bf Objective Function: Minimize}\ EDP_{system} \end{equation}

(19)

5.3 Implementing Mapping Technique in MSP430FR6989

Once we obtain the placement information from $ILP\_solver$, we map the respective variables and the sections of a function to either SRAM or FRAM. We modify the linker script accordingly to map the sections or variables to either SRAM or FRAM. In our proposed mapping policy, placing global variables is straightforward, i.e., mapping the respective variable to either SRAM or FRAM based on the ILP decision.

We observed that from the linker script, we could map the whole stack section of each function to either SRAM or FRAM. We analyzed the mappings of the stack section for each function by modifying the linker script. We used the built-in attributes to differentiate the mappings between SRAM and FRAM; for example, we used the built-in attribute $(\_\_attribute\_\_((ramfunc))$ that maps this function to SRAM. If we want to place the stack section to SRAM, we modify the linker script by replacing the default setting with “ .stack: {} > RAM (HIGH) ”. If we want to place the stack section to FRAM, we modify the linker script by replacing the default setting with “.stack: {} > FRAM”.

Similarly, for the text section, we observed that placing the text section in either SRAM or FRAM shows an impact on EDP. This effect is because the majority of accesses in the text section are read accesses, as we observed that the energy consumption for each read access to SRAM/FRAM differs. Table 3 shows that approximately FRAM consumes 2x more read energy than SRAM. Thus, we analyzed each application to map the text section based on the free space available. If we have enough space available in SRAM, we place the text section in SRAM itself; otherwise, we place the text section in FRAM. We included the following four lines in our linker script to check the above condition and map the text section.

Table 2.

Component	Description
Target Board	TI MSP430FR6989 Launchpad
Core	MSP430 (1.8-3.6 V; 16 MHz)
Memory	2KB SRAM and 128KB FRAM
IDE	Code Composer Studio
Energy Profiling	Energy Trace++
ILP Solver	LPSolve_IDE [5]
Benchmarks	Mixed benchmarks (MiBench and TI-based)

Table 2. Experimental Setup

Table 3.

Memory	Per Read Energy (nJ)	Per Write Energy (nJ)
SRAM	5500	5600
FRAM	10325	13125
Flash	23876	31198

Table 3. Energy Values for each Read/Write to SRAM and FRAM

(1)

$\#ifndef \_\_LARGE\_CODE\_MODEL\_\_$

(2)

.text : {} > FRAM

(3)

#else

(4)

.text : {} >> SRAM

We modified the linker script for mapping the data section by using the inbuilt compiler directives. We followed the below three steps.

(1)

Allocate a new memory block, for instance, $NEW\_DATASECTION$. We can declare the start address and size of the data section in the linker script.

(2)

Define a segment (.Localvars) which stores in this memory block ($NEW\_DATASECTION$).

(3)

Use #pragma $DATA\_SECTION (funct\_name, seg\_name)$ in the program to define functions in this segment. Where $funct\_name$ is the function name, and $seg\_name$ is the created segment name. For instance, #pragma $DATA\_SECTION (func\_1, .Localvars)$

Once we are done with creating the different sections, we can allocate these sections to either SRAM or FRAM based on ILP decisions. For instance, placing ” $NEW\_DATASECTION$: {} > FRAM” in the linker script, which maps the $NEW\_DATASECTION$ to FRAM.

5.4 Support for Intermittent Computing

When the power is stable, everything works properly. Because of the static allocation scheme, we map all functions/variables to SRAM/FRAM for the first time. During a power failure, SRAM and registers lose all of their contents, including mapping information. When power is restored, we don’t know what functions/variables were allocated to SRAM before the failure. As a result, we must either restart the execution from the beginning or end up with incorrect results. Restarting the application consumes extra energy and time, making our system inefficient in terms of energy consumption and performance.

We propose a backup strategy during frequent power failures. FRAM was divided into $FRAM_n$ and $FRAM_b$ as shown in Figure 3. $FRAM_n$ has a size of 125 KB and is used for regular mappings. $FRAM_b$ has a size of 3 KB that serves as a backup region (BR) during power failures. So, during a power failure, we back up all register and SRAM contents to FRAM. Whenever power is restored, we restore the register and SRAM contents from $FRAM_b$ to SRAM and resume the application execution. The proposed backup strategy reduces extra energy consumption and makes the system more energy efficient.

5.4.1 Implementation Details of Flash-based Programming for Intermittent Computing:.

MSP430F5529 consists of SRAM and Flash at main memory. SRAM is the only memory on the chip where the CPU can read code for executing the application during Flash programming. We need to copy the Flash program function onto the stack whenever we want to use only SRAM for mapping the application. Whenever we want to switch between SRAM to Flash, we need to restore the stack pointer, and as well as we need to map the program counter register to the Flash memory region.

During a power failure scenario, we must perform the backup operation to copy the SRAM data to the Flash memory region. For the backup operation, we made some changes to the inbuilt MSP430 functions, such as void Flash_wb(char *Data_ptr, char byte) and void Flash_ww(int *Data_ptr, int word). Where Flash_wb() helps in writing the byte to the Flash memory region, Flash_ww() helps in writing the word to the Flash memory region.

Whenever power comes back, we must restore the contents from the Flash-based backup region to the SRAM memory region. We used the inbuilt functions, i.e., ctpl() functions for copying from Flash to SRAM, and after restoring, we needed to clear the Flash-based backup region; for this, we made changes to the inbuilt function, i.e., void Flash_clr(int *Data_ptr) to clear the Flash data.

6 Experimental Setup and Results

6.1 Experimental Setup

We used TI’s MSP430FR6989 for all experiments. We experimented with mixed benchmarks, which have both Mi-Bench [22] and TI-based benchmarks. We have shown the experimental setup in the Table 2. The development platform and experimental setup are shown in Figure 5. We performed experiments to determine the energy required for a single read/write to SRAM/FRAM, as shown in the Table 3. We collected the number of reads/writes for each global variable and functions as part of a one-time characterization. We also used TI’s MSP430F5529 for comparing flash with FRAM. We performed experiments to determine the energy required for a single read/write to flash, as shown in the Table 3.

Fig. 5.

MCU, which we experimented has MSP430 architecture, which is more suitable for IoT devices. The majority of MSP430 software is written in C and compiled with one of TI’s recommended compilers (IAR Embedded Code Bench, Code-Composer Studio (CCS), or msp430-gcc). The IAR Embedded Code Bench and CCS compilers are part of integrated development environments (IDEs). For all experiments in this article, we opt for CCS, which is a widely used, freely available, and easily extended tool. EnergyTrace++ technology allows us to calculate energy and power consumption directly. According to the datasheet for the MSP430FR6989, the number of cycles required to read/write in FRAM is twice that of SRAM, which means the access penalty of FRAM is twice that of SRAM at this specific operating point of 16 MHz. The latency penalty disappears when operating at/below 8 MHz and gets worse above 16 MHz.

6.2 Evaluation Benchmarks

We chose benchmarks from both the MiBench suite and TI benchmarks. One of the primary motivations for using the MiBench suite is that most of the TI-based benchmarks were small in size and easily fit into either SRAM or FRAM. In these cases, we don’t require any hybrid memory design. Most TI-based benchmarks have only one or two functions and 3-4 global variables, which is not useful for the hybrid main memory design. Thus, we used mixed benchmarks consisting of 4 TI-based benchmarks and 12 from the MiBench suite.

For the MiBench suite, we first make MCU-compatible benchmarks by adding MCU-related header files and watchdog timers. All benchmarks may not be compatible with the MCU. Thus, we need to choose the benchmarks from the MiBench suite that are compatible with the MSP430 boards. Once we have benchmarks, we execute them on board for the machine code. Using the .asm file, we calculate the inputs that are required by the ILP solver, as shown in Figure 4.

6.3 Baseline Configurations

We chose five different memory configurations to compare with the proposed memory configuration.

We directly map all the functions/variables to FRAM in the FRAM-only configuration, as shown in Figure 1. We use the FRAM-only configuration to compare our proposed memory configuration during stable and unstable power scenarios.

We directly map all the functions/variables to SRAM in the SRAM-only configuration, as shown in Figure 1. We use the SRAM-only configuration to compare our proposed memory configuration during stable and unstable power scenarios.

We used the empirical method of Jayakumar et al. [29]. We compare this configuration with our proposed configuration during stable and unstable power scenarios to observe the importance of the proposed work rather than the existing work.

In SRAM+Flash with ILP configuration, we used the proposed ILP technique for the flash-based msp430 board [34]. We compare this configuration with our proposed configuration during stable and unstable power scenarios to observe the difference between FRAM and Flash technologies.

In the SRAM+FRAM with ILP configuration, we have the proposed memory mapping technique that does not support BR. We compare this configuration with our proposed configuration during frequent power failures to observe the importance of BR. The overview of all baseline configurations is shown in Table 4. The experimental setup for all the above five configurations is the same as the one proposed.

Table 4.

Configuration	FRAM	SRAM	Flash	Backup Region (BR)	ILP
FRAM-only	✓	✗	✗	✗	✗
SRAM-only	✗	✓	✗	✗	✗
Jayakumar et al. [29]	✓	✓	✗	✗	✗
SRAM+Flash with ILP	✗	✓	✓	✗	✓
SRAM+FRAM with ILP	✓	✓	✗	✗	✓
Proposed	✓	✓	✓	✓	✓
✓- Supported, ✗- Not Supported

Table 4. Overview of the Different Memory Configurations for Comparing with the Proposed Memory Configuration

6.4 Results

The proposed memory configuration is evaluated in this section under stable and unstable power. The proposed memory configuration is compared with five different memory configurations as discussed in Section 6.3.

6.4.1 Under Stable Power:.

Our main objective of the proposed memory configuration is to minimize the system’s EDP. All values shown in Figure 6 are normalized with the FRAM-only configuration. Compared to the FRAM-only configuration, the proposed gets 38.10% less EDP, as shown in Figure 6. Because there are no power interruptions in this scenario, this improvement is totally due to the proposed ILP model. In configuration-1, we place everything to FRAM, where FRAM consumes more energy and the number of cycles than SRAM, as shown in the Table 3. Our proposed memory configuration incorporates the placement recommendation from the proposed ILP model and suggests utilizing both SRAM and FRAM.

Fig. 6.

Under a stable power scenario, the proposed gets 9.30% less EDP than Jayakumar et al., as shown in Figure 6. We discussed the authors’ empirical model and assumptions in the previous Section 3. The authors assumed that the data section included all global variables, constants, and arrays. As a result, our proposed ILP-based mapping differs from the authors’ mapping in that our proposed mapping outperforms the existing work. Under stable power, Jayakumar et al. receive 24.57% less EDP than the FRAM-only configuration, as shown in Figure 6. This advantage is primarily due to hybrid memory from Jayakumar et al.

Compared to SRAM+Flash with ILP configuration, the proposed reduces EDP by 18.55%, as shown in Figure 6. In this configuration, we used flash + SRAM with our proposed ILP framework. As shown in Table 3, the above benefit is primarily due to FRAM because flash consumes more energy. Jayakumar et al. outperform SRAM+Flash with ILP configuration during stable power. Because of FRAM in Jayakumar et al., even our proposed ILP model is ineffective in this case. We encountered that Jayakumar et al. achieved 9.19% less EDP than SRAM+Flash with ILP configuration, and this benefit is because of smaller applications. From Figure 6, SRAM+Flash with ILP configuration performs better for large applications than SRAM+Flash with ILP configuration. Jayakumar et al. empirical method suggests placing more content on SRAM because SRAM is sufficient for placing the entire small-size application. As a result, the performance of Jayakumar et al. is dependent on the application size, as for large-size applications, even FRAM does not outperform Flash.

Comparing the proposed memory configuration to the empirical method of Jayakumar et al. helps in understanding the role of the ILP model. This comparison also clarifies whether these advantages stem from the mapping granularity or the ILP. The proposed memory configuration outperforms the existing one, demonstrating that it benefits from mapping granularity and the ILP model.

SRAM-only configuration outperforms the proposed and all other memory configurations under stable power conditions. We noticed that this benefit is primarily due to SRAM, but it only applies to smaller applications. SRAM-only achieves 36.19% less EDP than the proposed for smaller applications, as shown in Figure 6. We also looked at large applications where the proposed configuration outperforms the SRAM-only configuration by a small margin. When the SRAM is full, the MCU must wait for the space to be released, which consumes extra energy and cycles. For more extensive applications, the SRAM-only configuration achieves 2.94% more EDP than proposed.

We also evaluated our proposed framework with another MSP430F5529 MCU with flash and SRAM for completeness. This comparison assists the user in selecting the most appropriate NVM technology, such as FRAM or Flash, as needed. To be fair, we used the same sizes of SRAM (2 KB) and Flash (128 KB) in this comparison. We compared FRAM-based and flash-based MCUs under stable power conditions. We used the proposed frameworks and techniques in both MCUs. We discovered that the proposed FRAM-based configuration outperforms the flash-based configuration. Flash-based configurations consume 26.03% more EDP than FRAM-based configurations, as shown in Figure 7. Flash consumes more energy, as shown in Table 3.

Fig. 7.

6.4.2 Under Unstable power:.

We used the default TI-based compute through power loss (ctpl) tool for migration. During a power failure, we need to migrate the SRAM contents to a FRAM-based backup region ($FRAM_b$), i.e., the backup process. Whenever the power comes back, we need to migrate the ($FRAM_b$) contents to SRAM, i.e., the restoration process. Therefore, all these migrations are done using ctpl() functions. We introduce a power failure by changing the low-power modes mentioned in the MSP430FR6989 design document. We use ctpl() to create power failures. We assume that the number of power failures is spread equally within the fixed execution period, i.e., 300 seconds. For instance, for an execution period of 300 seconds, and let us say the number of power failures is 300, then for every 1 second, we experience a power failure.

Considering energy-harvesting sources, such as piezoelectric and vibration-based sources, they extract significantly less energy from their surroundings. In such cases, the capacitor is unable to store sufficient energy, leading to frequent power failures. As a consequence, our proposed architecture is capable of handling these worst-case scenarios. However, existing works by Xie et al. [57] and Badri et al. [2, 3] made similar assumptions that almost every power failure occurs every 200 and 500 ms for a 1-core CPU running at a frequency of 480 MHz.

We performed experiments under unstable power to compare the proposed memory configuration with five memory configurations, as shown in Table 4. All values shown in Figure 8 are normalized with the SRAM-only configuration. Compared to the SRAM-only configuration, the proposed gets 20.15% less EDP, as shown in Figure 8. We observe that the migration overhead is less than the energy consumed to execute the FRAM application, and this migration overhead depends on the number of power failures. For example, a backup migration consumes approximately 16.88 mJ of energy, and a restore migration consumes approximately 11.606 mJ of energy in a qsort application. The above benefit to our proposed configuration is using a hybrid memory.

Fig. 8.

Under an unstable power scenario, the proposed gets 26.87% less EDP than Jayakumar et al., as shown in Figure 8. We discussed the author’s empirical model and assumptions in the previous Section 3. As already stated, the empirical method of Jayakumar et al. is more beneficial for small applications. In contrast, the author’s empirical method suggests placing more content on SRAM because SRAM is sufficient for placing the entire small-size application. Thus, for Jayakumar et al. work, backup/restore operations take more energy during a power failure. Our proposed mapping outperforms existing work. During frequent power failures, Jayakumar et al. receive 8.25% less EDP than the FRAM-only configuration, as shown in Figure 8. This advantage is primarily due to Jayakumar et al. hybrid memory.

Compared to SRAM+Flash with ILP configuration, the proposed reduces EDP by 27.10%, as shown in Figure 8. As shown in Table 3, the above benefit is primarily due to FRAM because flash consumes more energy. Jayakumar et al. outperform SRAM+Flash with ILP configuration during stable power. Because of FRAM in Jayakumar et al., even our proposed ILP model is ineffective for this comparison. We encountered that Jayakumar et al. achieved 8.19% less EDP than SRAM+Flash with ILP configuration for smaller applications. The above benefit for Jayakumar et al. is minimal because the size of backup/restore increases, which even neutralizes the flash for some applications, as shown in Figure 8. SRAM+Flash with ILP configuration achieves 5.10% less EDP than Jayakumar et al. for large applications, as shown in Figure 8. As a result, the performance of Jayakumar et al. depends on the application size; for large-size applications, even FRAM does not outperform Flash.

The proposed technique outperforms all memory configurations under unstable power conditions. This benefit is primarily due to hybrid memory and the proposed mapping technique. SRAM-only achieves 46.59% less EDP than the proposed, as shown in Figure 8.

When we remove BR, all the mapping information of SRAM is lost because our model is static. We introduce a BR in the FRAM memory region to save this mapping information. During a power failure, we migrate the SRAM contents to $FRAM_b$; whenever power comes back, we restore the $FRAM_b$ contents to the SRAM.

We experimented to know the importance of BR, where we compared the proposed memory configuration with SRAM+FRAM with ILP configuration. Compared to SRAM+FRAM with ILP configuration, the proposed gets 26.44% less EDP, as shown in Figure 8. This benefit is because we need to re-execute the application four times from the beginning, which consumes extra time and energy. The number of times re-executing the application is equal to the number of power failures.

We also evaluated our proposed framework with another MSP430F5529 MCU consisting of flash and SRAM for completeness. This comparison assists the user in selecting the most appropriate NVM technology, such as FRAM or Flash, as needed. To be fair, we used the same sizes of SRAM (2 KB) and Flash (128 KB) in this comparison. We also used BR in these experiments; the only difference is that we replaced the FRAM with flash in the proposed configuration, and everything is the same. We compared FRAM-based and flash-based MCUs under unstable power conditions. We used the proposed frameworks and techniques in both MCUs. We discovered that the proposed FRAM-based configuration outperforms the flash-based configuration. Flash-based configurations consume 20.11% more EDP than FRAM-based configurations, as shown in Figure 9. Flash consumes more energy, as shown in Table 3.

Fig. 9.

6.4.3 Analysis with Different Number of Power Failures:.

The proposed ILP solver supports different frequencies and durations of power failures. We experimented with five different frequencies and power failure interruption frequencies to analyze this behavior. We fixed the execution period (forward progress), i.e., 300 seconds across all benchmarks, which means that all benchmarks will execute repeatedly until this fixed period. We introduce a power failure for every 1 sec, 500 msec, 250 msec, 5 sec, and 15 sec. These comparisons and analyses make the proposed technique more realistic by considering the more frequent power failure and nominal power failure scenarios.

We noticed that increasing the number of power failures has an effect on the system’s EDP, decreasing the overall system’s EDP. To make the same assumptions as Xie et al. [57], we introduce a power failure every 500 msec in the above experiments. We used the proposed mapping technique for all of the experiments shown in Figure 10 but with varying numbers of power failures. We observed that when the system experiences a power failure every 250 msec (PF_250 msec), the proposed mapping technique outperforms the existing mapping technique by 8.89%.

Fig. 10.

The PF_500 msec scenario has an 11.62% less EDP than the PF_250 msec scenario. Compared to the PF_250 msec scenario, the PF_1 sec scenario has a 13.19% less EDP. Compared to the PF_250 msec scenario, the PF_5 sec scenario provides an 18.55% less EDP. The PF_15 sec scenario has a 22.91% less EDP than the PF_250 msec scenario.

We see these improvements between the fewer power failure scenarios (PF_15 sec) and the frequent power failure scenarios (PF_250 msec) because of backup overhead. According to the proposed backup policy, whenever there is a power failure, we start backing up the SRAM contents and the register file contents to the backup region. So, increasing the number of power failures raises the backup content size, which means the number of writes to NVM-based main memory (flash or FRAM) increases. Table 3 shows that FRAM/flash consumes more write energy than SRAM. As a result, during frequent power failures, the overall energy/execution time increases, affecting the overall EDP. Therefore, the number and duration of power interruptions have an impact on the system’s EDP. Thus, our proposed ILP model includes this as an input metric.

6.4.4 Analysis for ILP formulation:.

We performed experiments to know the time the ILP solver took to formulate and solve the proposed ILP model. Table 5 shows the execution times of each benchmark that the ILP solver has taken to solve the mapping problem. On average, the ILP solver takes 61.37 msec to formulate and solve the proposed mapping technique.

Table 5.

Benchmarks	Number of Global Variables	Number of Functions	ILP Solver Execution Time (milli-seconds)
16bit_2dim	1	2	43.33
aes	6	10	57.91
basicmath_small	3	12	59.91
basicmath_large	3	12	67.89
bf	2	6	47.33
crc	3	8	49.13
dhrystone	14	8	63.01
dijkstra	10	11	70.99
fft	2	7	56.24
fir	1	4	53.30
matrix_mult	3	1	47.33
patricia	7	6	76.26
qsort_small	5	3	64.11
qsort_large	5	3	70.10
sha	10	8	73.55
susan	9	21	81.56

Table 5. The Time Taken by the ILP Solver to Formulate and Solve the Proposed Mapping Model

Table 5 shows additional information, such as the number of functions and global variables. These details help us to understand why the ILP solver is taking longer than other benchmarks. There is a direct dependency on these details (number of global variables and functions) according to the proposed ILP model and equations mentioned in the Section 5.2. The susan application, for example, has 9 global variables and 21 functions and takes 81.56 msec, whereas the fft application has 2 global variables and 7 functions and takes 56.24 msec. When comparing these benchmarks, fft has fewer global variables and functions than susan. As a result, the ILP solver takes longer than the fft application to formulate and solve for susan.

6.5 Summary of the Proposed Mapping Technique

In this section, we outline the proposed ILP-based memory mapping technique. Following all of these analyses, we observed that the mappings shown below consume less EDP than other design choices, as shown in the table. To keep things simple, we only show the final mapping configurations for each application’s stack, data, and text sections, keeping out the final mappings for global variables.

Table 6 shows that, with the exception of the dhrystone application, the remaining three TI benchmark applications (fir, matrix, and 16bit_2dim) are very small and can easily be placed in SRAM. We don’t need FRAM for these types of smaller applications, but there is a disadvantage during frequent power failures. FRAM’s backup and restore sizes are larger for these applications during frequent power failures. As a result, our proposed backup/restore strategy should be intelligent enough to reduce EDP. The dhrystone application, on the other hand, has a larger stack section that requires FRAM to accommodate the entire stack section.

Table 6.

Benchmarks	Stack	Text	Data
16bit_2dim	SRAM	SRAM	SRAM
aes	SRAM	FRAM	FRAM
basicmath_small	SRAM	SRAM	FRAM
basicmath_large	SRAM	FRAM	FRAM
bf	SRAM	SRAM	FRAM
crc	SRAM	FRAM	SRAM
dhrystone	FRAM	SRAM	FRAM
dijkstra	SRAM	FRAM	SRAM
fft	SRAM	SRAM	FRAM
fir	SRAM	SRAM	FRAM
matrix_mult	SRAM	SRAM	SRAM
patricia	SRAM	FRAM	SRAM
qsort_small	SRAM	SRAM	FRAM
qsort_large	SRAM	FRAM	FRAM
sha	SRAM	FRAM	FRAM
susan	SRAM	FRAM	FRAM

Table 6. Optimal Placement for Different Applications in MSP430FR6989

As we can see from the Table 6, many applications used both SRAM and FRAM for the Mi-Bench applications. As a result, we can conclude that a hybrid main memory design is required for many applications. Using a hybrid main memory design helps to reduce EDP during stable power scenarios. Even so, determining how and where to backup the volatile contents can be difficult during frequent power outages. However, our proposed memory mapping technique and the framework suggest using a hybrid main memory design that supports intermittent computing.

MSP430 is only used as a prototyping example in this problem, which considers any architecture with hybrid main memory architecture for embedded systems. ILP formulation has no constraints regarding other NVM technologies and processor architectures. Any architecture with the following characteristics can benefit from our formulation: (1) No cache, (2) Hybrid main memory (volatile + any of NVM technology), (3) The size of volatile memory should be much smaller than the size of NVM technology (SRAM’s 2KB is and FRAM’s is 128KB), and (4) No support for multi-programming.

7 Conclusions

This paper proposed an ILP-based memory mapping technique that reduces the system’s energy-delay product. For both global variables and functions, we formulate an ILP model. The functions consist of data, stack, and code sections. Our ILP model suggests placing each section on either SRAM or FRAM. Under both stable and unstable power scenarios, we compared the proposed memory configuration to the baseline memory configurations. We evaluated our proposed frameworks and techniques on actual boards. We added a backup region in FRAM to support intermittent computing. We compared the proposed framework with the recent related work.

Under stable power, our proposed memory configuration consumes 38.10% less EDP than the FRAM-only configuration and 9.30% less EDP than the existing work. Under unstable power, our proposed configuration achieves 20.15% less EDP than the FRAM-only configuration and 26.87% less EDP than the existing work. FRAM-based design (MSP430FR6989) consumes 26.03% less EDP than flash-based design (MSP430F5529) during stable power and 20.11% less EDP than flash-based design during frequent power failures.

References

[1]

Mohammad Alshboul, Hussein Elnawawy, Reem Elkhouly, Keiji Kimura, James Tuck, and Yan Solihin. 2019. Efficient checkpointing with recompute scheme for non-volatile main memory. ACM Transactions on Architecture and Code Optimization (TACO) 16, 2 (2019), 1–27.

Abstract

1 Introduction

2 Related Works

3 Motivation

4 System Model and Problem Definition

4.1 System Model

4.2 Problem Definition

5 Mapi-Pro: an Energy Efficient Memory Mapping for Intermittent Computing

5.1 Inputs for ILP Model

5.2 ILP Formulation for Data Mapping for Intermittent Computing

5.3 Implementing Mapping Technique in MSP430FR6989

5.4 Support for Intermittent Computing

5.4.1 Implementation Details of Flash-based Programming for Intermittent Computing:.

6 Experimental Setup and Results

6.1 Experimental Setup

6.2 Evaluation Benchmarks

6.3 Baseline Configurations

6.4 Results

6.4.1 Under Stable Power:.

6.4.2 Under Unstable power:.

6.4.3 Analysis with Different Number of Power Failures:.

6.4.4 Analysis for ILP formulation:.

6.5 Summary of the Proposed Mapping Technique

7 Conclusions

References

Cited By

Index Terms

Recommendations

Management and optimization for nonvolatile memory-based hybrid scratchpad memory on multicore embedded processors

Modeling, Architecture, and Applications for Emerging Memory Technologies

A durable and energy efficient main memory using phase change memory technology

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations