1. Introduction
As large-scale application domains like scientific computing, social media, and financial analytics continue to expand, the computational and storage requirements of modern systems have surpassed the available resources. In the upcoming decade, it is anticipated that the amount of data managed by global data centers will increase by fifty times, while the number of processors will only grow by a factor of ten [
1]. This indicates that the demand for performance will soon outstrip resource allocations.
Furthermore, Information and Communication Technology (ICT) devices and services currently contribute significantly to the world’s overall energy consumption, with projections indicating that their energy demand will rise to nearly 21% by 2030 [
2]. Consequently, it becomes evident that relying solely on over-provisioning resources will not suffice to address the impending challenges facing the computing industry. These constraints on both computational resources and energy consumption create a growing urgency for new approaches to improve efficiency while maintaining acceptable levels of performance.
In recent decades, significant technological advancements and increasing computational demands have driven a remarkable reduction in the size of integrated circuits and computing systems. This downscaling of CMOS technology has resulted in several key benefits, such as enhanced computational performance, improved energy efficiency, and the ability to increase the number of cores per chip. Smaller transistors allow for faster switching speeds, enabling higher clock frequencies, which translates to quicker data processing and more powerful computing systems. Additionally, as transistors shrink, the power required to switch them can be reduced, leading to lower overall energy consumption, which is crucial for mobile and battery-operated devices.
However, CMOS downscaling is not without its drawbacks. As transistors continue to shrink, the benefits of reduced supply voltage become less significant, and the leakage current (unwanted current that flows even when the transistor is off) becomes more pronounced, leading to higher static power consumption. Moreover, the exponential increase in power consumption due to higher clock frequencies has introduced thermal challenges, as more energy is dissipated as heat, which can damage the chip and reduce its lifespan. The combination of these factors means that the traditional benefits of CMOS scaling are diminishing, and the ability to further increase the number of cores per chip is constrained by power and thermal limits. Consequently, as CMOS technology reaches its scaling limits, it becomes imperative to explore alternative approaches, such as new materials, 3D stacking, or novel architectures, to continue improving computing efficiency without exacerbating these power and thermal issues [
3].
In addition to the trends mentioned above, the nature of the tasks fueling the demand for computing has evolved across the computing spectrum, spanning from mobile devices to the cloud. Within data centers and the cloud, the impetus for computing stems from the necessity to efficiently manage, organize, search, and derive conclusions from vast datasets. In contrast, the predominant computing demand for mobile and embedded devices arises from the desire for more immersive media experiences and more natural, intelligent interactions with users and the surrounding environment. Although computational errors are generally undesirable, a common thread runs through this spectrum: these applications are not primarily concerned with computing precise numerical outputs. Instead, “correctness” is defined as generating results that are sufficiently accurate to deliver an acceptable user experience [
4].
These applications inherently possess a resilience towards errors, meaning they can produce satisfactory outputs even when some of their computations are carried out in an approximate manner [
5]. For instance, in search and recommendation systems, there is not always a single definitive or “golden” result; instead, multiple answers falling within a specific range are considered acceptable. Additionally, iterative applications processing extensive data sets may terminate convergence prematurely or employ heuristics [
6]. In many Machine Learning (ML) applications, even if a golden result exists, the most advanced algorithms may not be able to achieve it. Consequently, users often have to settle for reasonably inaccurate but still adequate results. Furthermore, applications such as multimedia, wireless communication, speech recognition, and data mining exhibit a degree of error tolerance. Human perceptual limitations signify that such errors may not significantly affect image, audio, and video processing applications. Another example pertains to applications dealing with noisy input data (e.g., image and sensor data processing, and speech recognition). The noise in the input naturally leads to imprecise results, and approximations have a similar impact. In simpler terms, applications that can handle noisy inputs also possess the capability to withstand approximations [
7,
8,
9]. Finally, some applications utilize computational patterns like aggregation or iterative refinement, which can mitigate or compensate for the effects of approximations.
By intentionally introducing controlled approximations, Approximate Computing (AxC) leverages the inherent resilience of these applications to improve energy efficiency and performance while aligning well with the evolving demands of diverse application domains. AxC is an encouraging approach to enhance computing efficiency.
The concept of AxC encompasses a wide array of techniques that capitalize on the inherent error resilience of applications, ultimately leading to improved efficiency across all computing stack layers, ranging from the fundamental transistor-level design to software implementations. These techniques can have varying impacts on both the hardware and the output quality. AxC capitalizes on the existence of data and algorithms that can tolerate errors, as well as the limitations in the perception of end-users. It strategically balances accuracy against the potential for performance improvements or energy savings. In essence, it takes advantage of the gap between the level of accuracy that computer systems can provide and the level of accuracy required by the specific application or the end-users. This required accuracy is typically much lower than what the computer systems can deliver. The selective relaxation of accuracy allows for considerable gains in key parameters like power and performance, particularly within applications where exact correctness is secondary to operational efficiency.
Leveraging AxC involves addressing a few aspects and challenges. The first challenge is identifying the segments within the targeted software or hardware component that can be candidates for approximation. Identifying segments of code or data that can be approximated may necessitate a comprehensive understanding of the application on behalf of the designer.
The second challenge is implementing the AxC technique to introduce approximations. On the one hand, there is a limit to the accuracy degradation that can be introduced so the output remains acceptable. On the other hand, the level of accuracy degradation and the performance improvements or energy savings vary depending on the selected AxC technique. Hence, available AxC techniques should be evaluated and compared with find the most suitable AxC technique tailored for a target application or design.
The next challenge is choosing the suitable error measurement criteria, often tailored to the particular application, and executing the actual error assessment process to ensure that the output adheres to the predefined quality standards [
5]. The error assessment usually involves simulating the precise and approximate versions of applications. However, alternative methods like Bayesian inference [
10,
11] or ML-based approaches [
12] have been put forth in the scientific literature.
A Design Space Exploration (DSE) can be performed to address all the previously mentioned challenges. The goal of performing a DSE is to determine the most optimal approximate configurations from those generated by applying a given set of approximation techniques to the design. Hence, the DSE approaches can help systematically evaluate different approximate designs to choose the most suitable AxC techniques and, consequently, the best configurations for any given combination of AxC techniques. Early DSE approaches either combine multiple design objectives into a single-objective optimization problem or optimize a solitary parameter while keeping the remaining variables constant. More recent research, as seen in published works, has tackled circuit design issues by considering a Multi-objective Optimization Problem (MOP) to seek out Pareto-optimal approximate circuit configurations [
13]. Regrettably, these approaches predominantly concentrated on simple systems, specifically arithmetic components like adders and multipliers, as they form the foundational components for more intricate designs [
14].
While several surveys have comprehensively explored DSE methods across domains like embedded systems and general-purpose computing, they often overlook the distinct challenges and considerations imposed by AxC. In contrast, approximate designs introduce new dimensions in the DSE process, as they require balancing accuracy with efficiency gains in energy and performance tailored to the error resilience of specific applications. This survey focuses on DSE methodologies uniquely suited to the AxC paradigm, where selecting the optimal trade-offs involves performance and resource efficiency and assessing acceptable error margins. By providing a dedicated review of DSE approaches applicable to approximate designs, this work tries to fill a critical gap, offering insights that are not addressed in existing surveys and thereby supporting the design of next-generation systems that meet stringent energy and performance demands.
This paper aims to cover different DSE approaches leveraged in comparing approximate versions of a target application or design. The structure of this paper is as follows: Firstly,
Section 2 provides a background on AxC techniques and DSE approaches. Then, in
Section 3, the search methodology to find related studies and categorize them is explained. In
Section 4, DSE approaches to compare and choose suitable AxC techniques are reviewed and compared. Finally, a conclusion is provided in
Section 5.
3. Literature Search Methodology
The objective of this survey was to systematically identify and classify existing literature on DSE methodologies proposed for finding the most suitable AxC techniques to be applied to a program or hardware design.
The keywords used for the search were carefully chosen based on common terminology found in influential works in the field. These included terms such as “Approximate Computing”, “Design Space Exploration”, “Multi-objective Optimization”, and “Approximate Hardware/Software”. Boolean operators and advanced filtering techniques were utilized to refine the search results in academic databases. This ensured a comprehensive yet focused set of papers covering various abstraction levels in both hardware and software implementations.
The selection criteria were established to maintain the relevance and quality of the review. We prioritized papers that provided detailed descriptions of DSE methodologies and excluded those that relied solely on exhaustive search methods. This approach aimed to highlight more sophisticated and efficient DSE approaches.
Once the relevant papers were identified, they were categorized based on the type of search algorithm employed for conducting the DSE. These categories included search algorithms such as ML, Evolutionary Algorithms (EAs), and custom algorithms. Papers were further sorted based on the target hardware for which the DSE was conducted, such as Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Central Processing Units (CPUs), and Graphics Processing Units (GPUs). Additionally, we considered the application domains, including image processing, scientific computing, and signal processing. The categorization process is shown in
Figure 2 to better visualize the process.
For instance, some studies focus on FPGAs and ASICs, particularly in the context of designing accelerators using AxC techniques. Some other studies do not specify a particular target hardware, indicating their proposed methods can be applied universally across different platforms. Additionally, some papers address hardware- or software-level approximations for GPUs.
This survey also considers the application domains of the programs targeted for approximation. The application domains usually considered in most studies for selecting benchmarks include image processing, signal processing, scientific computing, financial analysis, Natural Language Processing (NLP), 3D gaming, and robotics.
Additionally, information about employed AxC techniques was extracted from each study to better compare different studies based on the AxC techniques applied at software, architectural, or hardware levels.
In a nutshell, this survey aimed to identify and evaluate the proposed DSE methods employed to explore the extensive design space of approximate versions of a design. The focus was on understanding whether these methods were well-known search algorithms or custom approaches. Following this structured and systematic methodology, the survey provides a comprehensive overview of the current state-of-the-art proposed DSE methodologies for finding the most suitable AxC techniques, highlighting the diversity of approaches and their applicability to various hardware platforms and application domains.
Figure 3 shows the aforementioned process of categorizing different studies.
4. Comparison and Analysis
This section provides an overview and comparison of the proposed DSE approaches in the literature for applying AxC techniques to programs or hardware designs. Though many different search algorithms have been proposed in the literature to explore the vast design space of approximate programs or hardware designs, two categories of algorithms are commonly leveraged: ML algorithms and EAs. ML approaches often leverage data-driven techniques to predict and explore optimal design configurations. At the same time, EAs use bio-inspired strategies such as Genetic Algorithms (GAs) to navigate the design space.
Table 2 provides information about the research works that took an ML approach to perform the DSE, while
Table 3 includes information about the research works that leveraged EAs to perform the DSE. All the remaining research works that perform the DSE using other heuristic algorithms or combining different optimization algorithms are listed in
Table 4. While
Table 2,
Table 3,
Table 4 and
Table 5 provide an overview to allow comparison among different studies based on the employed search algorithm, target hardware, and use case domain,
Table 6,
Table 7,
Table 8 and
Table 9 provide an overview of the same sets of studies to allow comparison among different studies based on AxC techniques applied in each study.
4.1. DSE Using ML Algorithms
As reported in
Table 2, the most popular ML algorithm is RL [
37,
50,
54,
55]. While authors in [
51,
52] use MBO and modified MCTS, respectively; authors in [
12,
53] mention using ML-based search algorithm. Among these research works, though the target hardware varies from FPGAs and ASICs to general-purpose CPUs, the use-case domain always includes image and signal processing benchmarks, ranging from traditional image processing to image classification using NNs. Moving the comparison to the AxC techniques applied at different levels, as reported in
Table 6, replacing the exact adders and multipliers with approximate counterparts is the most common hardware-level approximation investigated [
12,
37,
51,
52,
53]. However, the investigated software-level AxC techniques are noticeably application-specific: In [
50,
51], algorithm parameters—that indicate the iterations of executing a code basic block or the size of the inputs processed at each iteration—are decreased to reduce the execution time or program memory while sacrificing output accuracy. A similar approach of loop perforation is applied in [
54] alongside changing the input data structure. Interestingly, in [
55], an ML algorithm is employed to search the design space of an ML application, proposing a DSE framework to find the optimal quantization level for each layer of a DNN.
4.2. DSE Using EAs
Table 3 lists the research works that leveraged EAs to perform the DSE. Between the prominently used subsets of EAs, an ES algorithm is only used in [
58]. All approaches enlisted here either use GA or its subset NSGA-II to explore the design space. More precisely, authors in [
56,
57,
60] employ GA, authors in [
58,
59] use NSGA-II, and authors in [
61] developed a NAS algorithm based on NSGA-II. Comparing the target hardware of the reviewed research works, most works consider optimizing an accelerator design for FPGA and ASIC implementation as expected. At the same time, the research in [
61] targets GPUs for optimizing CNN designs.
Comparing the benchmarks in
Table 3 to those listed in
Table 2, most of the benchmarks fall under the image processing category, though the types of benchmarks are slightly different. Comparing the applied AxC techniques, as reported in
Table 7, in [
56,
57] authors investigate employing sparse LUTs, precision scaling, and approximate adders for a pixel-streaming pipeline application accelerated with an FPGA. Similarly, in [
58,
59] authors explore using approximate adders and multipliers for optimizing video and image compression accelerators. In [
60], authors try to optimize benchmarks from different domains such as scientific computing, 3D gaming, 3D image rendering, signal processing, and image processing when the approximation is applied at a software level, altering the program’s static instructions. Distinctively, in [
61], authors propose approximating multipliers using LUTs and a customized approximate convolutional layer to support quantization-aware training of CNNs and dynamically explore the design space. It is noteworthy that the aim is to optimize a CNN design usually trained on a GPU. Hence, approximation at the hardware level is not an option, while such an AxC technique can be emulated at the software level.
4.3. DSE Using Custom Algorithms
Table 4 reports a list of reviewed papers that neither rely on ML nor EAs to explore the design space. In [
68], authors mention using a TS algorithm, with potential integration of GAs into the DSE framework. Notably, TS focuses on iteratively improving a single solution, whereas GAs work with a population of solutions and evolve them over generations using crossover, mutation, or other genetic operators. Hence, taking a TS approach might not seem the best choice when the MOP does not have a single optimum solution, and a Pareto front of non-dominated solutions may represent the optima better. In [
70], authors select a GD approach to search the design space. Contrary to the fact that GD is a widely used optimization technique in ML, GD is not employed as a part of an ML search algorithm in the aforementioned work. All the remaining works in
Table 4 employ custom algorithms. In some cases, the DSE includes multiple stages of exploration, where pruning techniques are used before applying the search algorithm to reduce the design space size or after applying the search algorithm to refine the obtained solution sets.
Table 4 categorizes studies by target hardware, starting with those focused on FPGAs and ASICs, and continues with studies on optimized accelerator design.
Table 8 reports the AxC techniques applied in each study enlisted in
Table 4. Similar to the other sets of studies presented in
Table 2 and
Table 3, only a few works reported in
Table 4 are hardware-independent or target general-purpose CPUs and GPUs. The target hardware in [
62] includes both FPGAs and ASICs. In [
62], the DSE is performed to optimize a hardware accelerator design for a video processing application using approximate adders and logic blocks. Similarly, in [
63], the target hardware includes both FPGAs and ASICs. In this study, the DSE is performed to optimize the design of DNNs accelerated using FPGAs and ASICs, while the AxC techniques applied are quantization techniques aimed at approximating the DNN design at the software level. In [
64], authors perform the DSE with a heuristic search algorithm to optimize the hardware implementation of different functions used in a DNN vector accelerator. To apply approximation through logic isolation, the portions of logic in the circuit that consume significant power but contribute only minimally to output accuracy are identified. Then the DSE is performed to find the best trade-off between DNN classification accuracy and energy savings. It can be implied that the target hardware can be categorized as ASIC. In [
65,
66], the DSE is performed with custom algorithms, applying hardware approximations to hardware implementations of video and image processing benchmarks. The target hardware in these studies can be categorized as ASIC. In [
67,
68], the authors propose to modify the HLS tools to study the approximation effects.
Continuing through
Table 4, in [
69] the DSE is performed to optimize hardware accelerator design, investigating both hardware-level and software-level approximation techniques. The other three works also perform the DSE to optimize the accelerator design, specifically for ML applications [
70,
71,
72]. In [
73], authors target a very different type of acceleration using NPUs. While using NPUs for acceleration purposes can be categorized as applying approximation at the architectural level, the target hardware can be classified in the ASIC category. In [
74], the target hardware is not explicitly mentioned; the proposed methodology applies to any DSE performed on general-purpose CPUs as target hardware. In [
75], authors perform the DSE to find the best configuration when applying their proposed hardware-level approximation technique, which is specific to GPUs. However, the approximation technique is also applied to some benchmarks executed on general-purpose CPUs to provide a fair comparison between the results obtained by performing the DSE for both hardware targets.
Comparing the use case domains across
Table 4, image and signal processing are the prevalent categories of applications. Moreover, ML applications for image and text classification, pattern and speech recognition, and NLP tasks are considered in many works. Some works also target image compression tasks. Many works include matrix multiplication, DCT, FIR, and Sobel filters in their studies, as these functions are crucial for many image-processing tasks. Some works also consider benchmarks from financial analysis, robotics, 3D gaming, and scientific computing domains.
Considering the AxC techniques mentioned in
Table 8, studies in [
62,
65,
68,
69,
74] investigate using approximate adders and multipliers. In [
64], authors propose applying a hardware-level AxC technique called logic isolation using latches or AND/OR gates at the inputs, MUXes at the output, and power gating. In [
66], authors propose applying another hardware-level AxC technique called clock gating alongside the precision reduction of primary inputs at the RTL level. Similarly, authors in [
70] propose to apply a clock overgating technique. In [
69], authors propose to use VOS alongside approximate adders and multipliers at the hardware level while approximating the additions and multiplications also on the software level. In [
67], authors propose very different AxC techniques: Internal Signal Substitution and Bit-Level Optimization at the RTL level, Functional Unit Substitution (additions and multiplications) at the HLS level, and Source-Code Pruning Based on Profiling at the software level. Also, in [
71,
72], authors propose applying AxC techniques at multiple levels while designing an AI accelerator. They propose applying precision reduction to DNN data as well as using approximated versions of fundamental DNN functions such as activation functions, pooling, normalization, and data shuffling in the network accelerator design. Though also aiming at optimizing DNN designs, authors in [
63] propose to apply AxC techniques at the software level to enable dynamic quantization of the DNN during the training phase. Interestingly, in [
73], authors propose to use a very different AxC technique compared with all of these reviewed studies. The proposed approach includes approximating the entire program using an NPU as an accelerator. Another interesting AxC technique is proposed in [
75] to tackle memory bottlenecks while executing the program on a GPU and transferring the data from CPUs to GPUs and vice versa.
4.4. DSE of Approximate Functions Design
Some studies in the literature propose approaches to efficiently explore the design space for approximate logic synthesis and consider approximate versions of circuits generated by approximating selected portions (or sub-functions) of Boolean networks. These studies are reported separately in
Table 5, while the applied AxC techniques in these studies are reported in
Table 9. The approximation is applied at the hardware level and involves logic falsification in [
76,
78]. The approximation technique in [
77] is based on Boolean network simplifications allowed by EXDCs. The approximation in [
79] is based on BMF for truth tables. And, in [
80], a customized approximation of Boolean networks is applied. The search algorithm to explore the design space is an NSGA-II in [
76,
78,
80] while in [
77,
79] authors employ customized and heuristic algorithms. While the benchmarks for all of these studies include well-known approximate adders and multipliers in the literature, other circuits such as ALUs, decoders, shifters, and multiple combinational circuits have been employed as benchmarks. Interestingly, in [
78], the study targets safety-critical applications. The Quadruple Approximate Modular Redundancy (QAMR) approach is opposed to Triple Modular Redundancy (TMR), where all modules are exact circuits.
4.5. Evaluated Parameters in DSE
While
Table 2,
Table 3,
Table 4 and
Table 5 provide an overview to allow comparison among different studies based on employed search algorithm, target hardware, and use case domain,
Table 10,
Table 11,
Table 12 and
Table 13 provide an overview of the same sets of studies to allow comparison among different studies based on evaluated parameters involved in the trade-off imposed by approximation.
Since AxC trades off accuracy for performance and energy efficiency, the first important parameter to evaluate during DSE is accuracy. Depending on the approximation goals, parameters measured during DSE in different studies may vary.
Predictably, power consumption is a key parameter frequently targeted in the reviewed studies, as it directly impacts energy efficiency. However, many studies choose to target energy consumption instead of power consumption. This approach is entirely valid because energy savings inherently indicate power savings, considering that energy is the product of power and time. By measuring energy directly, these studies effectively capture the combined impact of power reduction and execution time, providing a comprehensive view of the gains in efficiency achieved through AxC techniques.
The second most in-demand parameter, especially when designing accelerators, is the circuit area. Understandably, when approximations are applied to optimize a design, specifically in the case of employing the design on FPGAs, reducing the area utilization, or LUT count, is one of the approximation goals.
After area, performance and execution time are the most commonly measured parameters. In applications such as Artificial Neural Networks (ANNs), where execution time is inherently high, one of the primary goals of applying approximation is to reduce this execution time, particularly for inference and, when feasible, training. The lengthy execution times of these applications also directly impact the DSE time, as evaluating even a few approximate instances can become highly time-consuming. While typical application execution times may range from seconds to minutes, the DSE time needed to explore and evaluate possible approximations often extends to hours or days. In the case of ANNs, the execution time for inference alone can take hours, and the DSE time required to assess even a limited number of approximate instances can span several days. Therefore, in applications where the execution time is already considerable and hence a primary target to trade-off with accuracy, proposing DSE methodologies that can assess more approximate instances in a reasonable time becomes crucial.
Memory utilization is often the least frequently evaluated parameter in the reviewed studies. Many AxC techniques are primarily applied to optimize execution time, energy, or performance rather than specifically targeting memory utilization. However, these techniques can still impact memory utilization. For instance, some techniques aimed at reducing execution time, energy consumption, or improving performance may also affect memory usage as a secondary outcome. This indirect influence on memory is an important consideration, even though it is not the primary focus of these techniques. For example, in [
50], authors propose to explore the design space comprised of approximate versions of an iris scanning pipeline. The approximation includes reducing the search window size and the region of interest in iris images, reducing the parameters of iris segmentation, and reducing the kernel size of the filter. Though the main target is to reduce program execution time, the memory needed to store the intermediate and final output images and program parameters is reduced. In [
75], authors propose an AxC technique to mitigate the bottlenecks of limited off-chip bandwidth and long access latency when the data are transferred from CPU to GPU and back. When a cache miss happens, RFVP predicts the requested values. In this case, the main goal of approximation is to achieve off-chip memory bandwidth consumption reduction, while speedup and energy reductions are also reported.
Through
Table 10,
Table 11,
Table 12 and
Table 13, besides the accuracy column, there is an error metric(s) column that reports the error metric(s) presented in each study to measure accuracy degradation due to applying approximation. Among all the parameters mentioned—power consumption, execution time, performance, memory utilization, and circuit area—accuracy is unique because the metrics used to measure accuracy degradation are often more complex and application-specific. For example, while power consumption differences are reported simply as ED between the measurements from approximate and exact versions, accuracy degradation error metrics involve a variety of sophisticated measures that are tailored to the specific application domain. In
Table 1, the most popular error metrics are listed.
Finally, every proposed DSE approach results in a solution or a set of optimal solutions for the MOP. In some cases, an optimal solution exists. In many other cases, no global optimal solution can be found, and a Pareto front (a set of non-dominated solutions) is presented. The last column in
Table 10,
Table 11,
Table 12 and
Table 13 indicates the studies that reported a Pareto front as the result of the DSE performed, or at least compared a set of solutions resulted from the proposed DSE approach with a Pareto front obtained by exhaustive search or other methods. In most cases, the obtained Pareto front shows a trade-off between accuracy on the one hand and an evaluated parameter, such as energy efficiency, on the other hand [
12,
50,
51,
52,
54,
56,
57,
58,
59,
61,
65,
66,
67,
68,
69,
76,
77,
78,
79,
80].
The rest of the reviewed studies that did not obtain a Pareto front but provided other analysis methods for comparing the DSE results are considered hereafter.
In some studies, a single threshold or multiple thresholds for the acceptable accuracy degradation was set, and then the DSE was performed for each accuracy threshold. For example, in [
53], a solution was provided for each accuracy threshold. In [
62], performance is plotted for different accelerator designs. However, no Pareto front is provided. Also, in [
64], three different DNN accuracy thresholds were set, and the DSE was performed for each threshold. Hence, the plots show the energy reductions for each DNN accuracy threshold instead of a Pareto front. Similarly, in [
70], the plots show the energy reductions for each DNN accuracy threshold instead of a Pareto front.
In [
55], two application-specific error metrics were proposed to evaluate the accuracy for DNN quantization and plot the quantization space Pareto frontier for these two error metrics called State of Relative Accuracy and State of Quantization.
In [
37], an RL approach was selected for performing the DSE, and steps of exploration have been plotted for evaluated parameters, including accuracy; however, a Pareto front is not obtained. In [
60], plots show the accuracy and energy against multiple thresholds for the number of program instructions to be approximated. However, a comparison to the Pareto front is not provided.
In [
63], the quantization is applied dynamically during training and inference of the DNN. Therefore, the plots show the changes in the DNN accuracy concerning the number of MACs used in the computations. Also, in the same plot, the results are compared with other quantization-aware approaches in the literature instead of comparing the results with a Pareto front obtained by other DSE methods. Since in dynamic approximation of the DNNs, the changes in accuracy during the training or inference are more representative of the approach effectiveness, plotting a Pareto front seems unnecessary.
In [
71] and the previous studies with the same framework [
72], no Pareto front was presented. Instead, for each DNN, compute efficiency, training throughput, and inference latency were reported. In [
73], an NPU is employed as an accelerator for a frequently executed region of code or function to approximate the function by replacing it with a neural network. Since an ANN is employed, similar to other works on ANNs, multiple thresholds for function quality loss, or in other words, different ANN accuracy levels, were investigated. Hence, the speedup and energy reduction for multiple thresholds of function quality loss were plotted. In consequence, no Pareto front was demonstrated.
4.6. DSE Methodologies Comparison by Use Case Domain
In this subsection, we categorize the reviewed studies based on their general use case domains, such as image processing, ML, signal processing, and scientific computing. For each use case domain, the studies are compared based on metrics used in DSE, such as accuracy, power savings, execution time, and area utilization.
4.6.1. Image Processing Applications
Several studies applied DSE methodologies to optimize image processing applications, balancing accuracy with power and area savings.
Hashemi et al. [
50] focused on iris scanning applications using RL to reduce filter kernel sizes. This study achieved notable energy savings and area utilization improvements while maintaining acceptable accuracy, measured by the HD between images.
Ullah et al. [
51] used MBO to approximate Gaussian blur filters. The study reported reduced LUT count and power savings, with output accuracy evaluated using MAE.
Mrazek et al. [
12] applied approximate multipliers and adders to Sobel and Gaussian blur filters, achieving significant energy savings with minimal SSIM loss.
Rajput et al. [
52] employed AI-based heuristics to optimize image processing tasks such as RGB2gray and Gaussian blur filters. The study reported accuracy levels evaluated with MRED and PSNR while achieving area and energy savings.
Awais et al. [
53] optimized image processing applications, including RGB2gray and FIR filters, by applying approximate adders and multipliers, achieving power savings and maintaining acceptable accuracy levels.
Manuel and Kreddig [
56,
57] used EAs to optimize a pixel-streaming pipeline for image processing, achieving power savings with limited color differences, measured using the CIELAB
metric.
Barbareschi et al. [
59] applied NSGA-II for JPEG compression optimization, balancing power, area, and image quality, evaluated using MSSIM and DSSIM metrics.
Savino et al. [
74] used custom algorithms for optimizing image processing tasks (matrix multiplication and FIR filters), reporting reductions in area and power with minimal accuracy degradation.
Overall, these studies reveal that different DSE approaches have been successfully applied to various image processing tasks. Most studies show that approximate hardware components, such as adders and multipliers, provide significant power savings with minimal accuracy loss. While some methods, like those used by Hashemi et al. [
50] and Mrazek et al. [
12], focus on energy efficiency, others, like Rajput et al. [
52] and Barbareschi et al. [
59], emphasize balancing area utilization with accuracy and energy consumption. However, the wide range of objectives and varying use cases across these studies makes it difficult to perform a unified performance comparison beyond these metrics.
4.6.2. ML Applications
The reviewed studies also applied DSE methods to optimize Machine Learning models, particularly Deep Neural Networks (DNNs).
Elthakeb et al. [
55] used RL for quantization of CNN layers, achieving accuracy improvements and energy savings by adjusting quantization levels dynamically.
Pinos et al. [
61] applied NAS algorithms for optimizing CNN models like MobileNetV2 and ResNet50V2, reporting energy savings during inference with minimal accuracy degradation.
Fu et al. [
63] focused on DNN training, using dynamic fractional quantization to optimize training energy and latency while maintaining classification accuracy across multiple ResNet models.
Venkataramani et al. [
71,
72] applied precision reduction techniques in AI accelerators for DNNs like VGG16 and BERT, achieving inference latency and energy reductions with minor accuracy trade-offs.
All studies in this category target improving energy efficiency while maintaining acceptable accuracy levels for Machine Learning applications. While Elthakeb et al. [
55] and Pinos et al. [
61] focus on optimizing DNN quantization and inference energy, Fu et al. [
63] specifically optimize training energy and latency. Venkataramani et al. [
71,
72] focus on overall system-level reductions in inference latency and compute efficiency. The primary challenge in comparing these methods lies in the varying objectives, such as training versus inference optimization. However, all studies demonstrate that precision reduction and quantization techniques are highly effective in balancing accuracy and energy consumption in Machine Learning applications.
4.6.3. Signal Processing Applications
Signal processing tasks require optimization in terms of power consumption, performance, and accuracy, which several studies achieved by applying hardware AxC techniques.
Mrazek et al. [
12] used approximate multipliers and adders in Sobel and Gaussian blur filters, achieving significant power savings and minimal SSIM loss.
Rajput et al. [
52] employed AI-based heuristics to optimize signal processing tasks like FIR filters, showing a trade-off between area savings and output accuracy, measured using MRED and PSNR.
Saeedi et al. [
37] applied RL to optimize FIR filters and matrix multiplication, achieving power savings with acceptable MAE levels.
Alan et al. [
66] used custom algorithms to optimize Sobel and Gaussian blur filters, achieving power and area savings by applying clock gating and precision reduction techniques.
In signal processing applications, the main focus is on achieving power and area savings while maintaining output accuracy. While Mrazek et al. [
12] and Alan et al. [
66] emphasize hardware-level optimizations, such as approximate multipliers and clock gating, Rajput et al. [
52] and Saeedi et al. [
37] focus on trade-offs between accuracy and energy savings. Despite these differences, all studies demonstrate that approximate computing techniques are effective in optimizing power consumption with minimal impact on accuracy.
4.6.4. Scientific Computing Applications
Scientific computing tasks often involve balancing execution time, energy efficiency, and accuracy. Several studies focused on optimizing scientific computing applications using DSE techniques.
Park et al. [
60] applied genetic algorithms to scientific computing tasks such as FFT and Successive Over-Relaxation (SOR), achieving energy reductions with slight accuracy trade-offs, measured using normalized difference.
Yazdanbakhsh et al. [
75] applied custom DSE techniques to optimize tasks on GPUs, showing improvements in speedup and energy savings while balancing accuracy and memory bandwidth.
Mahajan et al. [
73] applied NPUs as accelerators to scientific tasks such as FFT, reporting significant energy reductions with acceptable accuracy, measured using miss rate and image difference.
In scientific computing applications, the primary goal is to reduce execution time and energy consumption while maintaining accuracy. Although the studies employ different techniques, such as genetic algorithms in Park et al. [
60], GPU optimization in Yazdanbakhsh et al. [
75], and NPU acceleration in Mahajan et al. [
73], they all report improvements in speedup and energy savings. Due to the variety of techniques and the specific application focus of each study, further direct comparison between the studies is challenging.
4.6.5. Video and Audio Processing Applications
Several studies targeted video processing applications, with a focus on optimizing power, area, and accuracy.
Hoffmann et al. [
54] applied RL for video encoding tasks like x264, achieving energy savings through techniques like loop perforation while maintaining acceptable accuracy levels.
Prabakaran et al. [
58] used NSGA-II to optimize HEVC video processing tasks, showing trade-offs between area, power, and accuracy, evaluated using PSNR.
Shafique et al. [
62] applied custom DSE methods to optimize HEVC processing, achieving power savings through approximate adders with minimal quality loss, measured by BER.
The reviewed studies on video and audio processing show that DSE methodologies focus on balancing power and area savings with maintaining acceptable accuracy or quality metrics like PSNR and BER. Hoffmann et al. [
54] applied more general techniques like loop perforation, whereas Prabakaran et al. [
58] and Shafique et al. [
62] used hardware-specific optimizations. Though they all focus on video encoding and compression tasks, the diversity in techniques and objectives limits the potential for direct comparison beyond energy savings and accuracy metrics.
4.6.6. Robotics, Financial Analysis, and Other Applications
A few studies focused on specialized domains such as robotics and financial analysis.
Mahajan et al. [
73] applied NPU accelerators in robotics and financial analysis applications, achieving energy reductions while maintaining acceptable accuracy levels, as measured by metrics such as MRED and miss rate.
Hoffmann et al. [
54] applied RL to financial analysis tasks, such as Swaptions, showing significant energy savings while balancing accuracy, measured by Swaption price.
For robotics and financial analysis applications, the reviewed studies focus primarily on energy savings, using techniques such as NPU acceleration (Mahajan et al. [
73]) and RL-based optimization (Hoffmann et al. [
54]). While these studies achieve notable results, the specificity of their application domains and techniques limits further comparison.
In conclusion, while each study employs DSE methodologies to optimize approximate designs based on metrics like output accuracy, power savings, and execution time, comparing them even within the same use case domain is challenging due to differences in target hardware, benchmarks, AxC techniques, and optimization goals. For example, some of the reviewed studies focused on image processing applications, such as Sobel filters and Gaussian blur, and employed approximate hardware to achieve power savings. In contrast, some studies targeting Machine Learning applications, such as Neural Networks for image classification (e.g., ResNet or MobileNet models), employ approximate computing techniques like quantization to balance energy consumption and accuracy. Despite shared metrics, the diversity in techniques and objectives, such as reducing neural network inference latency on FPGAs versus optimizing power consumption on ASICs, makes direct comparisons difficult. Instead, most studies highlight how their proposed DSE methods can help identify the most suitable approximate designs within specific domains, optimizing metrics according to the unique needs of each target software or hardware.
4.7. Comparison of the Reviewed DSE Approaches by Frequent Categorizes
Figure 4 presents the distribution of the reviewed studies based on the search algorithms employed to perform DSE for approximate designs. The results reveal that 48.48% of the studies opted for custom algorithms, underscoring that nearly half of the approaches rely on application-specific methods, potentially due to the need for pruning large design spaces and addressing particular hardware or software constraints. This significant portion suggests that custom algorithms remain popular for solving highly specialized DSE problems in the AxC domain.
On the other hand, 27.27% of the studies employed EAs, and 24.24% utilized ML-based methods. While the data indicate that roughly a quarter of the surveyed works leveraged ML techniques such as RL, EAs are slightly more popular, likely due to their proven flexibility and scalability when dealing with a wide variety of design spaces and their robustness in handling complex MOP.
Figure 5 shows the distribution of the reviewed studies based on the target hardware of each study. The results indicate that a significant proportion of the studies, 36.36%, targeted ASICs. This reflects the preference of the AxC domain for ASICs, as they allow for highly customized and efficient hardware designs, making them a favorable target for employing AxC techniques. The second largest group of studies, comprising 21.21%, targeted both FPGAs and ASICs, suggesting that many DSE methodologies are designed to be versatile enough to optimize for both reconfigurable and dedicated hardware, which enables flexibility depending on the specific design requirements.
On the other hand, 18.18% of the studies focused solely on FPGAs. This percentage highlights the importance of FPGAs in approximate computing, especially in cases where reconfigurability is critical, such as during iterative design processes or for applications requiring adaptable precision levels. Furthermore, 12.12% of the studies targeted general-purpose CPUs, indicating that even though some studies did not include hardware-specific optimizations, general-purpose processors are still valuable in certain contexts, particularly for applying software-level AxC techniques.
Interestingly, only 6.06% of the studies targeted GPUs, showing that while GPUs are effective for parallel processing, they may not be as frequently used as much as ASICs or FPGAs for applying AxC techniques. The last three columns reflect multi-platform approaches, where the target hardware spans multiple categories, such as FPGAs, ASICs, and general-purpose CPUs, each making up 3.03% of the reviewed studies. This indicates that some DSE approaches aim to be flexible and adaptable to various hardware platforms, which may be driven by the need for multi-objective optimization across different computing systems.
Figure 6 presents the distribution of the reviewed studies based on the use case domains of the case study benchmarks used in each reviewed paper for conducting experiments. It is important to note that most studies examined benchmarks from multiple domains. Hence, such studies are included in counting the total number of studies for each use case domain.
As the total number of studies for each column shows, image processing tasks, such as Sobel and Gaussian blur filters, are the most popular among the reviewed studies. ML applications are also popular among the reviewed studies. The three columns labeled “DNNs for Image Classification and Pattern Recognition”, “Machine Learning”, and “Natural Language Processing” are all considered ML applications. It is noteworthy that the column labeled “Machine Learning” shows the number of case studies that cannot be categorized under the other two columns. Signal Processing, especially Digital Signal Processing (DSP) applications, and Video Processing (especially HEVC) applications are the next commonly used applications in the reviewed studies, respectively.
To conclude, the distribution of use case domains demonstrates a clear preference for image processing and Machine Learning tasks as key benchmarks for applying AxC techniques. This trend can be attributed to the inherent tolerance of these domains to approximation, where minor accuracy losses are often acceptable in exchange for significant gains in power and resource savings. Furthermore, signal processing and video processing applications also emerge as prominent areas, reflecting their suitability for applying approximations in embedded systems and real-time processing environments. The diverse set of use case domains illustrates the adaptability of AxC techniques across various computational tasks, emphasizing the versatility and growing importance of DSE methodologies in optimizing a wide range of applications.
Figure 6 also illustrates the distribution of search algorithms employed for performing the DSE for various use case domains. These search algorithms are categorized into three main classes: “Machine Learning”, “Evolutionary Algorithm”, and “Custom”. Custom algorithms are the most commonly used, especially for image processing, signal processing, and video processing tasks. This popularity may be due to the need for application-specific optimizations and pruning techniques in large design spaces. The same trend is observable for image classification and pattern recognition tasks.
On the other hand, EAs are applied across domains like image processing, video processing, and scientific computing, owing to their robustness in solving MOPs. ML-based DSE approaches are mostly utilized in image processing and signal processing tasks. Interestingly, in these domains, ML and Custom algorithms were employed almost equally to perform the DSE, while EAs were less employed. This trend may be attributed to the fact that ML techniques, such as RL and NAS, excel at dynamically optimizing design trade-offs in structured, data-rich environments like image and signal processing. Additionally, these domains often require real-time processing or adaptive techniques, which ML algorithms are well-suited for. Custom algorithms, on the other hand, are favored for their flexibility in handling domain-specific constraints and heuristics that are tailored to the problem. However, EAs, despite their proven multi-objective optimization capabilities, might be less efficient in environments where specific, fast-converging solutions are necessary due to time or resource constraints.
Overall, while custom algorithms dominate most tasks, ML and EAs show strong adaptability, each offering unique advantages based on the complexity and nature of the design space being explored.
5. Conclusions
This survey systematically reviewed and classified existing literature on DSE methodologies aimed at identifying suitable AxC techniques for various applications and hardware designs. The search strategy focused on papers that provided detailed descriptions of their DSE algorithms, deliberately excluding those utilizing exhaustive search methods to highlight more sophisticated and efficient approaches.
Two dominant categories of DSE methods emerged from the reviewed studies: ML approaches and EA methods. While both methodologies have strengths, the relative advantages depend largely on the complexity of the design space and the target application.
The ML approaches, especially those leveraging RL, are highly effective in domains like image processing and DNNs, where the design space is structured and substantial data from previous executions is available. These approaches excel at fine-tuning trade-offs between power and accuracy dynamically, making them suitable for applications requiring frequent reconfigurations or those with domain-specific training data [
50,
55]. However, ML-based DSE methods tend to be more application-specific and less adaptable to broader hardware platforms or large, complex design spaces [
51,
52].
In contrast, EA-based methods, particularly those utilizing GAs and NSGA-II, offer greater flexibility and scalability. These methods are especially suited for complex design spaces like FPGAs, ASICs, and other hardware-specific optimizations. GAs have been proven to deliver robust, Pareto-optimal solutions across a variety of applications, including image and video processing, where they consistently outperform other approaches in balancing circuit area, power consumption, and execution time [
56,
59]. Moreover, NSGA-II has shown particular promise in optimizing DNNs and delivering effective performance-power trade-offs in NN accelerator designs [
61].
In conclusion, while ML-based DSE approaches offer fine-tuned control and are highly effective in specialized applications, EA-based DSE methods, especially GAs, are better suited for complex and large-scale design spaces that require general-purpose optimization. Future research may combine the adaptability of EA-based DSE approaches with the fine-grained control of ML-based methods, creating hybrid models that can optimize across diverse hardware environments and application domains.
Furthermore, a persistent challenge is the increasing complexity posed by heterogeneous systems, where different hardware platforms such as FPGAs, ASICs, and GPUs introduce unique constraints. Addressing this challenge will require the development of generalizable DSE methodologies that can operate across multiple hardware platforms, an area that necessitates further research.
Additionally, as discussed in
Section 4, several studies have successfully integrated pruning techniques to reduce the complexity and size of the design space, leading to more efficient exploration. Nonetheless, future work must focus on reducing the computational overhead of performing DSE itself, particularly for large-scale applications such as Deep Neural Networks, where DSE can be time-intensive and resource-demanding. By developing more efficient exploration techniques and expanding generalizability across hardware platforms, future research can help overcome the current limitations of DSE methodologies employed for AxC systems.
In addition to discussing the strengths of various ML- and EA-based DSE methods, it is important to discuss the computational challenges and resource demands associated with performing DSE for approximate designs. When performing DSE for approximate designs, computational complexity and hardware resource requirements vary significantly, depending on the scope of the exploration and the chosen methods. Some studies face challenges related to time, memory utilization, and processing power, especially when exploring large design spaces with numerous approximate versions of hardware or software designs. However, not all reviewed studies provide detailed reports on these parameters, making direct comparisons difficult.
Where exploration time or resource usage is discussed, the computational burden is influenced by factors such as the type of search algorithm, design space size, and the target hardware. For instance, DSE-targeting FPGA designs may differ in complexity and time requirements compared with DSE-targeting ASICs or GPUs, even when similar methodologies are used.
Several studies employ pruning techniques to reduce the design space size, which helps facilitate the exploration process and lowers the computational demands. Such strategies are particularly prominent in studies using custom algorithms, as discussed in
Section 4.3. Pruning can significantly reduce exploration time by focusing on the most promising candidate designs.
However, due to inconsistent reporting on exploration time and hardware resource consumption across different studies, drawing broad conclusions about the computational costs of different DSE methods remains challenging. This gap presents an opportunity for future work to provide more detailed comparisons of resource requirements across DSE methodologies.