Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (122)

Search Parameters:
Keywords = quantized CNN

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
33 pages, 3144 KiB  
Article
CNN-Based Optimization for Fish Species Classification: Tackling Environmental Variability, Class Imbalance, and Real-Time Constraints
by Amirhosein Mohammadisabet, Raza Hasan, Vishal Dattana, Salman Mahmood and Saqib Hussain
Information 2025, 16(2), 154; https://doi.org/10.3390/info16020154 - 19 Feb 2025
Abstract
Automated fish species classification is essential for marine biodiversity monitoring, fisheries management, and ecological research. However, challenges such as environmental variability, class imbalance, and computational demands hinder the development of robust classification models. This study investigates the effectiveness of convolutional neural network (CNN)-based [...] Read more.
Automated fish species classification is essential for marine biodiversity monitoring, fisheries management, and ecological research. However, challenges such as environmental variability, class imbalance, and computational demands hinder the development of robust classification models. This study investigates the effectiveness of convolutional neural network (CNN)-based models and hybrid approaches to address these challenges. Eight CNN architectures, including DenseNet121, MobileNetV2, and Xception, were compared alongside traditional classifiers like support vector machines (SVMs) and random forest. DenseNet121 achieved the highest accuracy (90.2%), leveraging its superior feature extraction and generalization capabilities, while MobileNetV2 balanced accuracy (83.57%) with computational efficiency, processing images in 0.07 s, making it ideal for real-time deployment. Advanced preprocessing techniques, such as data augmentation, turbidity simulation, and transfer learning, were employed to enhance dataset robustness and address class imbalance. Hybrid models combining CNNs with traditional classifiers achieved intermediate accuracy with improved interpretability. Optimization techniques, including pruning and quantization, reduced model size by 73.7%, enabling real-time deployment on resource-constrained devices. Grad-CAM visualizations further enhanced interpretability by identifying key image regions influencing predictions. This study highlights the potential of CNN-based models for scalable, interpretable fish species classification, offering actionable insights for sustainable fisheries management and biodiversity conservation. Full article
(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Framework for the research methodology in fish species classification.</p>
Full article ">Figure 2
<p>Example of augmented images.</p>
Full article ">Figure 3
<p>Confusion matrix for DenseNet121 performance.</p>
Full article ">Figure 4
<p>Training and validation loss for DenseNet121.</p>
Full article ">Figure 5
<p>Grad-CAM heatmap analysis for a single class.</p>
Full article ">Figure 6
<p>Comparative Grad-CAM visualizations across multiple classes.</p>
Full article ">Figure 7
<p>Turbidity simulation results and model predictions.</p>
Full article ">
29 pages, 6669 KiB  
Article
Implementing Deep Neural Networks on ARM-Based Microcontrollers: Application for Ventricular Fibrillation Detection
by Vessela Krasteva, Todor Stoyanov and Irena Jekova
Appl. Sci. 2025, 15(4), 1965; https://doi.org/10.3390/app15041965 - 13 Feb 2025
Viewed by 345
Abstract
GPU-based deep neural networks (DNNs) are powerful for electrocardiogram (ECG) processing and rhythm classification. Although questions often arise about their practical application in embedded systems with low computational resources, few studies have investigated the associated challenges. This study aims to show a useful [...] Read more.
GPU-based deep neural networks (DNNs) are powerful for electrocardiogram (ECG) processing and rhythm classification. Although questions often arise about their practical application in embedded systems with low computational resources, few studies have investigated the associated challenges. This study aims to show a useful workflow for deploying a pre-trained DNN model from a GPU-based development platform to two popular ARM-based microcontrollers: Raspberry Pi 4 and ARM Cortex-M7. Specifically, a five-layer convolutional neural network pre-trained in TensorFlow (TF) for the detection of ventricular fibrillation is converted to Lite Runtime (LiteRT) format and subjected to post-training quantization to reduce model size and computational complexity. Using a test dataset of 7482 10 s cardiac arrest ECGs, the inference of LiteRT DNN in Raspberry Pi 4 takes about 1 ms with a sensitivity of 98.6% and specificity of 99.5%, reproducing the TF DNN performance. An optimization study with 1300 representative datasets (RDSs), including 10 to 4000 calibration ECG signals selected by random, rhythm, or amplitude-based criteria, showed that choosing a random RDS with a relatively small size of 80 resulted in a quantized integer LiteRT DNN with minimal quantization error. The inference of both non-quantized and quantized LiteRT DNNs on a low-resource ARM Cortex-M7 microcontroller (STM32F7) shows rhythm accuracy deviation of <0.4%. Quantization reduces internal computation latency from 4.8 s to 0.6 s, flash memory usage from 40 kB to 20 kB, and energy consumption by 7.85 times. This study ensures that DNN models retain their functionality while being optimized for real-time execution on resource-constrained hardware, demonstrating application in automated external defibrillators. Full article
Show Figures

Figure 1

Figure 1
<p>Flowchart of the methodological steps, showing the conversion and implementation of DNN models in microcontroller boards. OS: operating system; IDE: Integrated Development Environment; DNN: deep neural network; TF: TensorFlow; LiteRT: Lite Runtime; ARM: Acorn RISC (reduced instruction set computer) machine; GPU: graphical processing unit.</p>
Full article ">Figure 2
<p>Architecture of the trained TF DNN model for detection of ventricular fibrillation, as selected from the optimization study [<a href="#B67-applsci-15-01965" class="html-bibr">67</a>]. The input layer processes a 10 s single-lead ECG signal sampled at 125 Hz, while the output layer consists of a single neuron presenting the probability of ventricular fibrillation (pVF). The hidden layer feature map dimensions are shaped by 1D convolution with valid padding, influenced by the kernel size. This model has a total number of 7521 parameters.</p>
Full article ">Figure 3
<p>Flowchart of the conversion of LiteRT DNN models with and without post-training quantization. The description of the quantization methods is according to [<a href="#B77-applsci-15-01965" class="html-bibr">77</a>].</p>
Full article ">Figure 4
<p>Test interface of ARM-based microcontroller boards (slave) connected to GPU-based workstation test platform (master). ECG: Electrocardiogram; I/O: Input/Output; IDE: Integrated Development Environment; USB: Universal Serial Bus; USART: Universal Synchronous Asynchronous Receiver Transmitter; CPU: Central Processing Unit; DMA: Direct Memory Access; SDRAM: Synchronous Dynamic Random-Access Memory; SRAM: Synchronous Random-Access Memory; LiteRT DNN: Lite Runtime Deep Neural Network.</p>
Full article ">Figure 5
<p>Optimization and test workflow of the DNN models in the target platforms. DynQ: dynamic range quantization; IntQ: integer quantization; Full-IntQ: full-integer quantization; RDS: representative dataset; MAE(pVF): mean absolute error of the probability for detection of ventricular fibrillation.</p>
Full article ">Figure 6
<p>Measurements of post-training quantization times for the Integer (IntQ) and Full-Integer (Full-IntQ) quantized LiteRT models as a function of the representative dataset size (RDS = 10, 20, 40, 80, 120, 160, 200, 250, 300, 350, 400, 500, 1000, and 4306 ECG recordings) selected from the calibration dataset.</p>
Full article ">Figure 7
<p>Examples of representative datasets (RDSs), including 10 ECG recordings from the calibration dataset and selected by the defined 3 RDS selection strategies: random selection (1st column), rhythm-based selection (2nd column), and amplitude-based selection (3rd column). VF: ventricular fibrillation; NSR: normal sinus rhythm; ONR: other non-shockable rhythm; ASYS: asystole; RMS: root mean square value of the ECG amplitude.</p>
Full article ">Figure 8
<p>Statistical analysis of MAE (pVF) in function of the representative dataset (RDS) for two types of quantized LiteRT DNN models: IntQ (<b>top</b>) and Full-IntQ (<b>bottom</b>). MAE(pVF) density distributions (violin plots) are calculated on the validation database for 100 LiteRT DNN models quantized with each RDS size (13 sizes, including from 10 to 1000 training ECG recordings) and three RDS selection methods (random, rhythm-based, and amplitude-based). The validation MAE(pVF) for a LiteRT DNN model quantized on the total calibration database (4306 ECG recordings) is provided as a reference (on right).</p>
Full article ">Figure 9
<p>Comparison of the quantized LiteRT DNN models (IntQ on left and Full-IntQ on right), presenting the lowest mean absolute error MAE (pVF) for the validation database in each violin plot of <a href="#applsci-15-01965-f007" class="html-fig">Figure 7</a>. The selected best quantized models (RDS = 80 ECG recordings by random selection) are highlighted.</p>
Full article ">Figure 10
<p>Observations of pVF output of TF DNN (reference) vs. quantized LiteRT DNN (IntQ on top and Full-IntQ on bottom) in the test dataset, illustrating the performance results in <a href="#applsci-15-01965-t004" class="html-table">Table 4</a> for pVF threshold = 0.5. Observation points are presented as scatter plots and histograms for four rhythms: normal sinus rhythm, other non-shockable rhythms, asystole, and ventricular fibrillation.</p>
Full article ">Figure 11
<p>Examples of 10 s ECG strips with inverse classification outcomes of TF DNN (reference) vs. quantized LiteRT DNN (IntQ on left and Full-IntQ on right) based on a decision threshold pVF = 0.5. The cases are highlighted in <a href="#applsci-15-01965-f010" class="html-fig">Figure 10</a>.</p>
Full article ">
27 pages, 3199 KiB  
Article
Hybrid CNN–BiLSTM–DNN Approach for Detecting Cybersecurity Threats in IoT Networks
by Bright Agbor Agbor, Bliss Utibe-Abasi Stephen, Philip Asuquo, Uduak Onofiok Luke and Victor Anaga
Computers 2025, 14(2), 58; https://doi.org/10.3390/computers14020058 - 10 Feb 2025
Viewed by 460
Abstract
The Internet of Things (IoT) ecosystem is rapidly expanding. It is driven by continuous innovation but accompanied by increasingly sophisticated cybersecurity threats. Protecting IoT devices from these emerging vulnerabilities has become a critical priority. This study addresses the limitations of existing IoT threat [...] Read more.
The Internet of Things (IoT) ecosystem is rapidly expanding. It is driven by continuous innovation but accompanied by increasingly sophisticated cybersecurity threats. Protecting IoT devices from these emerging vulnerabilities has become a critical priority. This study addresses the limitations of existing IoT threat detection methods, which often struggle with the dynamic nature of IoT environments and the growing complexity of cyberattacks. To overcome these challenges, a novel hybrid architecture combining Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM), and Deep Neural Networks (DNN) is proposed for accurate and efficient IoT threat detection. The model’s performance is evaluated using the IoT-23 and Edge-IIoTset datasets, which encompass over ten distinct attack types. The proposed framework achieves a remarkable 99% accuracy on both datasets, outperforming existing state-of-the-art IoT cybersecurity solutions. Advanced optimization techniques, including model pruning and quantization, are applied to enhance deployment efficiency in resource-constrained IoT environments. The results highlight the model’s robustness and its adaptability to diverse IoT scenarios, which address key limitations of prior approaches. This research provides a robust and efficient solution for IoT threat detection, establishing a foundation for advancing IoT security and addressing the evolving landscape of cyber threats while driving future innovations in the field. Full article
(This article belongs to the Special Issue Multimedia Data and Network Security)
Show Figures

Figure 1

Figure 1
<p>Proposed architecture.</p>
Full article ">Figure 2
<p>Hybrid model structure.</p>
Full article ">Figure 3
<p>BiLSTM architecture.</p>
Full article ">Figure 4
<p>Binary classification flowchart.</p>
Full article ">Figure 5
<p>Workflow for optimizing the hybrid model through pruning and quantization.</p>
Full article ">Figure 6
<p>Confusion matrix.</p>
Full article ">Figure 7
<p>Training and validation metrics for the IoT-23 dataset over 30 epochs.</p>
Full article ">Figure 8
<p>Binary classification confusion matrix generated using the IoT-23 dataset.</p>
Full article ">Figure 9
<p>Confusion matrix for multi-class classification based on the IoT-23 dataset.</p>
Full article ">Figure 10
<p>Confusion matrix for binary classification based on the Edge-IIoTset dataset).</p>
Full article ">Figure 11
<p>Confusion matrix depicting the outcomes of multi-class classification on the Edge-IIoTset dataset.</p>
Full article ">Figure 12
<p>Accuracy curve for multi-class classification based on the Edge-IIoTset dataset.</p>
Full article ">Figure 13
<p>Loss curve for multi-class classification based on the Edge-IIoTset dataset.</p>
Full article ">
15 pages, 1363 KiB  
Article
MSQuant: Efficient Post-Training Quantization for Object Detection via Migration Scale Search
by Zhesheng Jiang, Chao Li, Tao Qu, Chu He and Dingwen Wang
Electronics 2025, 14(3), 504; https://doi.org/10.3390/electronics14030504 - 26 Jan 2025
Viewed by 394
Abstract
YOLO (You Only Look Once) has become the dominant paradigm in real-time object detection. However, deploying real-time object detectors on resource-constrained platforms faces challenges due to high computational and memory demands. Quantization addresses this by compressing and accelerating CNN models through the representation [...] Read more.
YOLO (You Only Look Once) has become the dominant paradigm in real-time object detection. However, deploying real-time object detectors on resource-constrained platforms faces challenges due to high computational and memory demands. Quantization addresses this by compressing and accelerating CNN models through the representation of weights and activations with low-precision values. Nevertheless, the quantization difficulty between weights and activations is often imbalanced. In this work, we propose MSQuant, an efficient post-training quantization (PTQ) method for CNN-based object detectors, which balances the quantization difficulty between activations and weights through migration scale. MSQuant introduces the concept of migration scales to mitigate this disparity, thereby improving overall model accuracy. An alternating search method is employed to optimize the migration scales, avoiding local optima and reducing quantization error. We select YOLOv5 and YOLOv8 models as the PTQ baseline, followed by extensive experiments on the PASCAL VOC, COCO, and DOTA datasets to explore various combinations of quantization methods. The results demonstrate the effectiveness and robustness of MSQuant. Our approach consistently outperforms other methods, showing significant improvements in quantization performance and model accuracy. Full article
(This article belongs to the Special Issue High-Performance Computing and AI Compression)
Show Figures

Figure 1

Figure 1
<p>The violin plot of the model.6.cv2.conv layer in YOLOv5s and YOLOv8s. Significant differences are observed in the range of weights and activations within the same channel, indicating that the quantization difficulty for weights and activations is markedly different. (<b>a</b>) YOLOv5s weights distribution. (<b>b</b>) YOLOv8s weights distribution. (<b>c</b>) YOLOv5s activations distribution. (<b>d</b>) YOLOv8s activations distribution with same channels. Longer outliers result in a more uneven data distribution, causing greater errors and making quantization more challenging.</p>
Full article ">Figure 2
<p>Architecture of MSQuant. MSQuant first calculates the distributions of the original weights and activations, which are used to initialize the migration scales. Following this, the migrated weights and activations undergo pseudo quantization, and an iterative process is employed to optimize the migration scales. Once the optimal migration scales are determined, the final quantization is performed on the new weights and activations, ensuring efficient and accurate model performance.</p>
Full article ">Figure 3
<p>A 3D plot visualizing the impact of migration scales on quantization error shows the migration scales for two channels on the x and y axes, with the mean squared error (MSE) as the loss function on the z-axis. (<b>a</b>) Loss for the first convolutional layer. (<b>b</b>) Loss for convolutional layers in C3 block. The visualization demonstrates the tendency for the search for optimal migration scales to be attracted to local minima.</p>
Full article ">
10 pages, 638 KiB  
Article
Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs
by Kai Chen, Xinyu Wang, Yuxiang Fu and Li Li
Electronics 2025, 14(3), 464; https://doi.org/10.3390/electronics14030464 - 23 Jan 2025
Viewed by 492
Abstract
Due to the ability to perform computations directly on encrypted data, homomorphic encryption (HE) has recently become an important branch of privacy-preserving machine learning (PPML) implementation. Nevertheless, existing implementations of HE-based convolutional neural network (HCNN) applications are not satisfactory in inference latency and [...] Read more.
Due to the ability to perform computations directly on encrypted data, homomorphic encryption (HE) has recently become an important branch of privacy-preserving machine learning (PPML) implementation. Nevertheless, existing implementations of HE-based convolutional neural network (HCNN) applications are not satisfactory in inference latency and area efficiency compared to the unencrypted version. In this work, we first improve the additive powers-of-two (APoT) quantization method for HCNN to achieve a better tradeoff between the complexity of modular multiplication and the network accuracy. An efficient multiplicationless modular multiplier–accumulator (M-MAC) unit is accordingly designed. Furthermore, a batch-processing HCNN accelerator with M-MACs is implemented, in which we propose an advanced data partition scheme to avoid multiple moves of the large-size ciphertext polynomials. Compared to the latest FPGA design, our accelerator can achieve 11× resource reduction of an M-MAC and 2.36× speedup in inference latency for a widely used CNN-11 network to process 8K images. The speedup of our design is also significant compared to the latest CPU and GPU implementations of the batch-processing HCNN models. Full article
Show Figures

Figure 1

Figure 1
<p>Comparison between different quantization schemes for the third convolution layer in CNN-11.</p>
Full article ">Figure 2
<p>M-MAC architecture.</p>
Full article ">Figure 3
<p>Overall architecture of the CNN accelerator.</p>
Full article ">Figure 4
<p>Dataflow of the convolution layer.</p>
Full article ">
21 pages, 5845 KiB  
Article
FPGA-QNN: Quantized Neural Network Hardware Acceleration on FPGAs
by Mustafa Tasci, Ayhan Istanbullu, Vedat Tumen and Selahattin Kosunalp
Appl. Sci. 2025, 15(2), 688; https://doi.org/10.3390/app15020688 - 12 Jan 2025
Viewed by 867
Abstract
Recently, convolutional neural networks (CNNs) have received a massive amount of interest due to their ability to achieve high accuracy in various artificial intelligence tasks. With the development of complex CNN models, a significant drawback is their high computational burden and memory requirements. [...] Read more.
Recently, convolutional neural networks (CNNs) have received a massive amount of interest due to their ability to achieve high accuracy in various artificial intelligence tasks. With the development of complex CNN models, a significant drawback is their high computational burden and memory requirements. The performance of a typical CNN model can be enhanced by the improvement of hardware accelerators. Practical implementations on field-programmable gate arrays (FPGA) have the potential to reduce resource utilization while maintaining low power consumption. Nevertheless, when implementing complex CNN models on FPGAs, these may may require further computational and memory capacities, exceeding the available capacity provided by many current FPGAs. An effective solution to this issue is to use quantized neural network (QNN) models to remove the burden of full-precision weights and activations. This article proposes an accelerator design framework for FPGAs, called FPGA-QNN, with a particular value in reducing high computational burden and memory requirements when implementing CNNs. To approach this goal, FPGA-QNN exploits the basics of quantized neural network (QNN) models by converting the high burden of full-precision weights and activations into integer operations. The FPGA-QNN framework comes up with 12 accelerators based on multi-layer perceptron (MLP) and LeNet CNN models, each of which is associated with a specific combination of quantization and folding. The outputs from the performance evaluations on Xilinx PYNQ Z1 development board proved the superiority of FPGA-QNN in terms of resource utilization and energy efficiency in comparison to several recent approaches. The proposed MLP model classified the FashionMNIST dataset at a speed of 953 kFPS with 1019 GOPs while consuming 2.05 W. Full article
(This article belongs to the Special Issue Advancements in Deep Learning and Its Applications)
Show Figures

Figure 1

Figure 1
<p>A view of entire system.</p>
Full article ">Figure 2
<p>(<b>a</b>) Single-layer perceptron vs. (<b>b</b>) multi layer-perceptron.</p>
Full article ">Figure 3
<p>A demonstration of the LeNet-5 architecture.</p>
Full article ">Figure 4
<p>The running mechanism of QAT and PTQ quantization directions.</p>
Full article ">Figure 5
<p>A demonstration of 8-bit quantization in Brevitas.</p>
Full article ">Figure 6
<p>An example scenario: (<b>a</b>) standard quantization with 32-bit weight (<b>b</b>) BNN with 1-bit weight.</p>
Full article ">Figure 7
<p>The main body of the FINN framework with 4 steps.</p>
Full article ">Figure 8
<p>(<b>a</b>) Single processing element (PE) and (<b>b</b>) matrix–vector–threshold unit in FINN framework.</p>
Full article ">Figure 9
<p>Illustration of four separate matrix multiplication resulting from folding.</p>
Full article ">Figure 10
<p>The MLP model implemented for acceleration.</p>
Full article ">Figure 11
<p>The LeNet model implemented for acceleration.</p>
Full article ">Figure 12
<p>Training phase accuracy and loss plots for MLP and LeNet quantized models.</p>
Full article ">Figure 13
<p>Transformations applied to models in FINN.</p>
Full article ">Figure 14
<p>The blocks of the developed accelerator hardware.</p>
Full article ">Figure 15
<p>FPGA and CPU accuracy graph for MLP and LeNet models.</p>
Full article ">Figure 16
<p>Accuracy and timing analysis of FPGA and CPU platforms to observe the effects of precision and folding configurations.</p>
Full article ">Figure 17
<p>FPGA source consumption by model, precision, and folding.</p>
Full article ">Figure 18
<p>Xilinx Vivado power estimation tool and actual power measurement based on quantization levels.</p>
Full article ">
13 pages, 1853 KiB  
Article
Optimizing Deep Learning Acceleration on FPGA for Real-Time and Resource-Efficient Image Classification
by Ahmad Mouri Zadeh Khaki and Ahyoung Choi
Appl. Sci. 2025, 15(1), 422; https://doi.org/10.3390/app15010422 - 5 Jan 2025
Cited by 1 | Viewed by 1117
Abstract
Deep learning (DL) has revolutionized image classification, yet deploying convolutional neural networks (CNNs) on edge devices for real-time applications remains a significant challenge due to constraints in computation, memory, and power efficiency. This work presents an optimized implementation of VGG16 and VGG19, two [...] Read more.
Deep learning (DL) has revolutionized image classification, yet deploying convolutional neural networks (CNNs) on edge devices for real-time applications remains a significant challenge due to constraints in computation, memory, and power efficiency. This work presents an optimized implementation of VGG16 and VGG19, two widely used CNN architectures, for classifying the CIFAR-10 dataset using transfer learning on field-programmable gate arrays (FPGAs). Utilizing the Xilinx Vitis-AI and TensorFlow2 frameworks, we adapt VGG16 and VGG19 for FPGA deployment through quantization, compression, and hardware-specific optimizations. Our implementation achieves high classification accuracy, with Top-1 accuracy of 89.54% and 87.47% for VGG16 and VGG19, respectively, while delivering significant reductions in inference latency (7.29× and 6.6× compared to CPU-based alternatives). These results highlight the suitability of our approach for resource-efficient, real-time edge applications. Key contributions include a detailed methodology for combining transfer learning with FPGA acceleration, an analysis of hardware resource utilization, and performance benchmarks. This work underscores the potential of FPGA-based solutions to enable scalable, low-latency DL deployments in domains such as autonomous systems, IoT, and mobile devices. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

Figure 1
<p>Workflow of this study for implementing VGG16 and VGG19 on FPGA using the Xilinx Vitis-AI framework.</p>
Full article ">Figure 2
<p>VGG16 and VGG19 model architecture using transfer learning in this work.</p>
Full article ">Figure 3
<p>Conceptual pipeline of the Xilinx Vitis-AI.</p>
Full article ">Figure 4
<p>Confusion matrix of CIFAR-10 test set images classification by our FPGA-based models: (<b>a</b>) VGG16; (<b>b</b>) VGG19.</p>
Full article ">
14 pages, 2382 KiB  
Article
Edge-AI Enabled Wearable Device for Non-Invasive Type 1 Diabetes Detection Using ECG Signals
by Maria Gragnaniello, Vincenzo Romano Marrazzo, Alessandro Borghese, Luca Maresca, Giovanni Breglio and Michele Riccio
Bioengineering 2025, 12(1), 4; https://doi.org/10.3390/bioengineering12010004 - 24 Dec 2024
Cited by 1 | Viewed by 801
Abstract
Diabetes is a chronic condition, and traditional monitoring methods are invasive, significantly reducing the quality of life of the patients. This study proposes the design of an innovative system based on a microcontroller that performs real-time ECG acquisition and evaluates the presence of [...] Read more.
Diabetes is a chronic condition, and traditional monitoring methods are invasive, significantly reducing the quality of life of the patients. This study proposes the design of an innovative system based on a microcontroller that performs real-time ECG acquisition and evaluates the presence of diabetes using an Edge-AI solution. A spectrogram-based preprocessing method is combined with a 1-Dimensional Convolutional Neural Network (1D-CNN) to analyze the ECG signals directly on the device. By applying quantization as an optimization technique, the model effectively balances memory usage and accuracy, achieving an accuracy of 89.52% with an average precision and recall of 0.91 and 0.90, respectively. These results were obtained with a minimal memory footprint of 347 kB flash and 23 kB RAM, showcasing the system’s suitability for wearable embedded devices. Furthermore, a custom PCB was developed to validate the system in a real-world scenario. The hardware integrates high-performance electronics with low power consumption, demonstrating the feasibility of deploying Edge-AI for non-invasive, real-time diabetes detection in resource-constrained environments. This design represents a significant step forward in improving the accessibility and practicality of diabetes monitoring. Full article
(This article belongs to the Special Issue Monitoring and Analysis of Human Biosignals, Volume II)
Show Figures

Figure 1

Figure 1
<p>Schematic block diagram of the PCB design. The main components include the section for the Analog Front-End (AFE) using the MAX30003, the Microcontroller Unit (MCU) based on the STM32F401, and the output section utilizing Bluetooth low energy (BLE) for communication.</p>
Full article ">Figure 2
<p>Rendering of the top and bottom layers of the custom PCB. On the left side, the electrodes are highlighted.</p>
Full article ">Figure 3
<p>The overall procedure began with the creation of a useful dataset by D1NAMO. These data were then processed through spectrogram analysis, followed by CNN inference, which was used to display the results.</p>
Full article ">Figure 4
<p>Example waveforms of (<b>a</b>) a diabetic ECG signal and (<b>b</b>) a healthy ECG.</p>
Full article ">Figure 5
<p>Schematic representation of the neural network design. The structure consists of input layers for ECG data, followed by key processing layers, leading to the final classification output.</p>
Full article ">Figure 6
<p>Confusion matrix.</p>
Full article ">Figure 7
<p>Graphical results from the ST Edge AI Developer Cloud following the benchmark test.</p>
Full article ">
18 pages, 1732 KiB  
Article
A One-Dimensional Depthwise Separable Convolutional Neural Network for Bearing Fault Diagnosis Implemented on FPGA
by Yu-Pei Liang, Hao Chen and Ching-Che Chung
Sensors 2024, 24(23), 7831; https://doi.org/10.3390/s24237831 - 7 Dec 2024
Viewed by 867
Abstract
This paper presents a hardware implementation of a one-dimensional convolutional neural network using depthwise separable convolution (DSC) on the VC707 FPGA development board. The design processes the one-dimensional rolling bearing current signal dataset provided by Paderborn University (PU), employing minimal preprocessing to maximize [...] Read more.
This paper presents a hardware implementation of a one-dimensional convolutional neural network using depthwise separable convolution (DSC) on the VC707 FPGA development board. The design processes the one-dimensional rolling bearing current signal dataset provided by Paderborn University (PU), employing minimal preprocessing to maximize the comprehensiveness of feature extraction. To address the high parameter demands commonly associated with convolutional neural networks (CNNs), the model incorporates DSC, significantly reducing computational complexity and parameter load. Additionally, the DoReFa-Net quantization method is applied to compress network parameters and activation function outputs, thereby minimizing memory usage. The quantized DSC model requires approximately 22 KB of storage and performs 1,203,128 floating-point operations in total. The implementation achieves a power consumption of 527 mW at a clock frequency of 50 MHz, while delivering a fault diagnosis accuracy of 96.12%. Full article
(This article belongs to the Special Issue Feature Papers in Physical Sensors 2024)
Show Figures

Figure 1

Figure 1
<p>The operation of down-sampling, while the down-sampling factor set to 3.</p>
Full article ">Figure 2
<p>The overview architecture of the proposed 1-D DSC architecture.</p>
Full article ">Figure 3
<p>The operation of DSC in the first layer.</p>
Full article ">Figure 4
<p>Overall architecture of the proposed DSC hardware design.</p>
Full article ">Figure 5
<p>The power analysis of the proposed DSC hardware design at 50 MHz.</p>
Full article ">Figure 6
<p>The confusion matrix of the software training result.</p>
Full article ">
21 pages, 12287 KiB  
Article
An Optimised CNN Hardware Accelerator Applicable to IoT End Nodes for Disruptive Healthcare
by Arfan Ghani, Akinyemi Aina and Chan Hwang See
IoT 2024, 5(4), 901-921; https://doi.org/10.3390/iot5040041 - 6 Dec 2024
Viewed by 1029
Abstract
In the evolving landscape of computer vision, the integration of machine learning algorithms with cutting-edge hardware platforms is increasingly pivotal, especially in the context of disruptive healthcare systems. This study introduces an optimized implementation of a Convolutional Neural Network (CNN) on the Basys3 [...] Read more.
In the evolving landscape of computer vision, the integration of machine learning algorithms with cutting-edge hardware platforms is increasingly pivotal, especially in the context of disruptive healthcare systems. This study introduces an optimized implementation of a Convolutional Neural Network (CNN) on the Basys3 FPGA, designed specifically for accelerating the classification of cytotoxicity in human kidney cells. Addressing the challenges posed by constrained dataset sizes, compute-intensive AI algorithms, and hardware limitations, the approach presented in this paper leverages efficient image augmentation and pre-processing techniques to enhance both prediction accuracy and the training efficiency. The CNN, quantized to 8-bit precision and tailored for the FPGA’s resource constraints, significantly accelerates training by a factor of three while consuming only 1.33% of the power compared to a traditional software-based CNN running on an NVIDIA K80 GPU. The network architecture, composed of seven layers with excessive hyperparameters, processes downscale grayscale images, achieving notable gains in speed and energy efficiency. A cornerstone of our methodology is the emphasis on parallel processing, data type optimization, and reduced logic space usage through 8-bit integer operations. We conducted extensive image pre-processing, including histogram equalization and artefact removal, to maximize feature extraction from the augmented dataset. Achieving an accuracy of approximately 91% on unseen images, this FPGA-implemented CNN demonstrates the potential for rapid, low-power medical diagnostics within a broader IoT ecosystem where data could be assessed online. This work underscores the feasibility of deploying resource-efficient AI models in environments where traditional high-performance computing resources are unavailable, typically in healthcare settings, paving the way for and contributing to advanced computer vision techniques in embedded systems. Full article
(This article belongs to the Topic Machine Learning in Internet of Things II)
Show Figures

Figure 1

Figure 1
<p>Augmented highly stressed image.</p>
Full article ">Figure 2
<p>Adjusted augmented image with histogram.</p>
Full article ">Figure 3
<p>Pre-processing and FPGA training pipeline.</p>
Full article ">Figure 4
<p>Custom CNN layers.</p>
Full article ">Figure 5
<p>PC to FPGA training flowchart.</p>
Full article ">Figure 6
<p>(<b>a</b>) Control software on the PC connected to the FPGA. (<b>b</b>) PMOD SD card reader connected to the Basys3 FPGA. (<b>c</b>) PC connected to the Basys3 FPGA via a micro-USB.</p>
Full article ">Figure 7
<p>Highly stressed image classified correctly by the FPGA CNN.</p>
Full article ">Figure 8
<p>Highly stressed sample classification performance.</p>
Full article ">Figure 9
<p>Moderately stressed sample classification performance.</p>
Full article ">Figure 10
<p>Normal sample classification performance.</p>
Full article ">Figure 11
<p>High-level parallelization block diagram.</p>
Full article ">Figure 12
<p>HDL output logic block diagram.</p>
Full article ">Figure 13
<p>HDL output logic block diagram with extended detail.</p>
Full article ">Figure 14
<p>Visual representation of the bounded ReLu activation-based softmax.</p>
Full article ">Figure 15
<p>MATLAB custom CNN training accuracy and loss.</p>
Full article ">Figure 16
<p>MATLAB validation accuracy and associated parameters.</p>
Full article ">Figure 17
<p>Multiclass confusion matrix in MATLAB.</p>
Full article ">Figure 18
<p>Multiclass classification performance for the FPGA CNN model on unseen images.</p>
Full article ">Figure 19
<p>Accuracy of highly stressed samples against moderately stressed samples.</p>
Full article ">Figure 20
<p>Loss rate of highly stressed samples against moderately stressed samples.</p>
Full article ">
17 pages, 6810 KiB  
Article
Breast Tumor Detection and Diagnosis Using an Improved Faster R-CNN in DCE-MRI
by Haitian Gui, Han Jiao, Li Li, Xinhua Jiang, Tao Su and Zhiyong Pang
Bioengineering 2024, 11(12), 1217; https://doi.org/10.3390/bioengineering11121217 - 1 Dec 2024
Cited by 1 | Viewed by 1068
Abstract
AI-based breast cancer detection can improve the sensitivity and specificity of detection, especially for small lesions, which has clinical value in realizing early detection and treatment so as to reduce mortality. The two-stage detection network performs well; however, it adopts an imprecise ROI [...] Read more.
AI-based breast cancer detection can improve the sensitivity and specificity of detection, especially for small lesions, which has clinical value in realizing early detection and treatment so as to reduce mortality. The two-stage detection network performs well; however, it adopts an imprecise ROI during classification, which can easily include surrounding tumor tissues. Additionally, fuzzy noise is a significant contributor to false positives. We adopted Faster RCNN as the architecture, introduced ROI aligning to minimize quantization errors and feature pyramid network (FPN) to extract different resolution features, added a bounding box quadratic regression feature map extraction network and three convolutional layers to reduce interference from tumor surrounding information, and extracted more accurate and deeper feature maps. Our approach outperformed Faster R-CNN, Mask R-CNN, and YOLOv9 in breast cancer detection across 485 internal cases. We achieved superior performance in mAP, sensitivity, and false positive rate ((0.752, 0.950, 0.133) vs. (0.711, 0.950, 0.200) vs. (0.718, 0.880, 0.120) vs. (0.658, 0.680, 405)), which represents a 38.5% reduction in false positives compared to manual detection. Additionally, in a public dataset of 220 cases, our model also demonstrated the best performance. It showed improved sensitivity and specificity, effectively assisting doctors in diagnosing cancer. Full article
Show Figures

Figure 1

Figure 1
<p>Flowchart of the study procedure.</p>
Full article ">Figure 2
<p>The architecture of our proposed model BC R-CNN.</p>
Full article ">Figure 3
<p>PDN structure.</p>
Full article ">Figure 4
<p>Four-quadrant location.</p>
Full article ">Figure 5
<p>Background noise reduction: (<b>a</b>) MRI before noise reduction; (<b>b</b>) MRI after noise reduction; (<b>c</b>) segmented breast of original MRI; (<b>d</b>) segmented breast of noise reduction MRI.</p>
Full article ">Figure 6
<p>U-Net++ breast edge segmented: (<b>a</b>) sagittal breast MRI and single breast MRI at axial plane; (<b>b</b>) masks; (<b>c</b>) segmented breasts.</p>
Full article ">Figure 7
<p>AUC performance comparison of different models: (<b>a</b>) internal dataset, (<b>b</b>) public dataset.</p>
Full article ">Figure 8
<p>Breast tumor location and diagnosis comparison of Faster R-CNN and our proposed model: (<b>a1</b>,<b>b1</b>,<b>c1</b>) detected by Faster R-CNN; (<b>a2</b>,<b>b2</b>,<b>c2</b>) detected by our proposed model. (<b>a1</b>,<b>b1</b>) false positive; (<b>c1</b>) diagnosed with a lower score; (<b>c2</b>) diagnosed at a higher score.</p>
Full article ">
10 pages, 337 KiB  
Communication
InMemQK: A Product Quantization Based MatMul Module for Compute-in-Memory Attention Macro
by Pengcheng Feng, Yihao Chen, Jinke Yu, Hao Yue, Zhelong Jiang, Yi Xiao, Wan’ang Xiao, Huaxiang Lu and Gang Chen
Appl. Sci. 2024, 14(23), 11198; https://doi.org/10.3390/app142311198 - 1 Dec 2024
Viewed by 758
Abstract
Large Language Models (LLMs), based on transformer architecture, have demonstrated remarkable capabilities in natural language processing tasks, enabling machines to generate human-like text and engage in meaningful dialogues. However, the exponential increase in model parameters has led to limitations in inference speed and [...] Read more.
Large Language Models (LLMs), based on transformer architecture, have demonstrated remarkable capabilities in natural language processing tasks, enabling machines to generate human-like text and engage in meaningful dialogues. However, the exponential increase in model parameters has led to limitations in inference speed and energy efficiency. Compute-in-memory (CIM) technology offers a promising solution to accelerate AI inference by performing analog computations directly within memory, potentially reducing latency and power consumption. At the same time, CIM has been successfully applied to accelerate Convolutional Neural Networks (CNNs); however, the matrix–matrix multiplication (MatMul) operations inherent in the scaled dot-product attention of the transformer present unique challenges for direct CIM implementation. In this work, we propose InMemQK, a compute-in-memory-based attention accelerator that focuses on optimizing MatMul operations through software and hardware co-design. At the software level, InMemQK employs product quantization (PQ) to eliminate data dependencies. At the hardware level, InMemQK integrates energy-efficient time-domain MAC macros for ADC-free computations. Experimental results show InMemQK achieves 13.2×–13.9× lower power consumption than existing CIM-based accelerators. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

Figure 1
<p>Transformer model structure and MatMul methods in CIM: (<b>a</b>) Multi-head self-attention module. (<b>b</b>) MatMul calculations in ISAAC, Timely [<a href="#B15-applsci-14-11198" class="html-bibr">15</a>], etc. (<b>c</b>) MatMul calculations in ReTransformer and (<b>d</b>) proposed MatMul methods.</p>
Full article ">Figure 2
<p>(<b>a</b>) Voltage-domain MAC based on Ohm’s law. (<b>b</b>) Time-domain MAC based on current integration. (<b>c</b>) The normalized energy of different interfaces, where <math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mi>D</mi> <mi>A</mi> <mi>C</mi> </mrow> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mi>A</mi> <mi>D</mi> <mi>C</mi> </mrow> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mi>D</mi> <mi>T</mi> <mi>C</mi> </mrow> </msub> </semantics></math>, and <math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mi>T</mi> <mi>D</mi> <mi>C</mi> </mrow> </msub> </semantics></math> denote the energy of one DAC, ADC, DTC, and TDC, respectively.</p>
Full article ">Figure 3
<p>Schematic diagram of product quantization. (<b>a</b>) Vector splitting. (<b>b</b>) Clustering and representation. (<b>c</b>) Two steps of approximate MatMul.</p>
Full article ">Figure 4
<p>(<b>a</b>) Time-domain multiplication–accumulation macro. (<b>b</b>) Two-stage MatMul calculation pipeline.</p>
Full article ">Figure 5
<p>(<b>a</b>) Performance comparison with D-MAC. (<b>b</b>) InMemQK power saving over JSSC’2024, JSSC’2022 (only mapping relevance computing).</p>
Full article ">
20 pages, 691 KiB  
Article
DiscHAR: A Discrete Approach to Enhance Human Activity Recognition in Cyber Physical Systems: Smart Homes
by Ishrat Fatima, Asma Ahmad Farhan, Maria Tamoor, Shafiq ur Rehman, Hisham Abdulrahman Alhulayyil and Fawaz Tariq
Computers 2024, 13(11), 300; https://doi.org/10.3390/computers13110300 - 19 Nov 2024
Viewed by 832
Abstract
The main challenges in smart home systems and cyber-physical systems come from not having enough data and unclear interpretation; thus, there is still a lot to be done in this field. In this work, we propose a practical approach called Discrete Human Activity [...] Read more.
The main challenges in smart home systems and cyber-physical systems come from not having enough data and unclear interpretation; thus, there is still a lot to be done in this field. In this work, we propose a practical approach called Discrete Human Activity Recognition (DiscHAR) based on prior research to enhance Human Activity Recognition (HAR). Our goal is to generate diverse data to build better models for activity classification. To tackle overfitting, which often occurs with small datasets, we generate data and convert them into discrete forms, improving classification accuracy. Our methodology includes advanced techniques like the R-Frame method for sampling and the Mixed-up approach for data generation. We apply K-means vector quantization to categorize the data, and through the elbow method, we determine the optimal number of clusters. The discrete sequences are converted into one-hot encoded vectors and fed into a CNN model to ensure precise recognition of human activities. Evaluations on the OPP79, PAMAP2, and WISDM datasets show that our approach outperforms existing models, achieving 89% accuracy for OPP79, 93.24% for PAMAP2, and 100% for WISDM. These results demonstrate the model’s effectiveness in identifying complex activities captured by wearable devices. Our work combines theory and practice to address ongoing challenges in this field, aiming to improve the reliability and performance of activity recognition systems in dynamic environments. Full article
Show Figures

Figure 1

Figure 1
<p>High-level architecture of DiscHAR.</p>
Full article ">Figure 2
<p>Elbow method to determine clusters within each activity class, where the x axis shows the number of clusters and the y axis represents distortion.</p>
Full article ">Figure 3
<p>Detailed overview of the CNN model used in DiscHAR.</p>
Full article ">Figure 4
<p>F1 score for the OPP79 [<a href="#B34-computers-13-00300" class="html-bibr">34</a>] dataset, where the x axis shows epochs and the y axis represents the F1 score.</p>
Full article ">Figure 5
<p>Loss curve for the OPP79 [<a href="#B34-computers-13-00300" class="html-bibr">34</a>] dataset, where the x axis shows epochs and the y axis represents the loss curve.</p>
Full article ">Figure 6
<p>F1 score for the PAMAP2 [<a href="#B30-computers-13-00300" class="html-bibr">30</a>] dataset, where the x axis shows epochs and the y axis represents the F1 score.</p>
Full article ">Figure 7
<p>Loss curve for the PAMAP2 [<a href="#B30-computers-13-00300" class="html-bibr">30</a>] dataset, where the x axis shows epochs and the y axis represents the loss curve.</p>
Full article ">Figure 8
<p>Accuracy for the WISDM [<a href="#B38-computers-13-00300" class="html-bibr">38</a>] dataset, where the x axis shows epochs and the y axis represents the accuracy for different learning rates.</p>
Full article ">Figure 9
<p>Loss curve for the WISDM [<a href="#B38-computers-13-00300" class="html-bibr">38</a>] dataset, where the x axis shows epochs and the y axis represents the loss curve for different learning rates.</p>
Full article ">
18 pages, 1757 KiB  
Article
End-to-End Deployment of Winograd-Based DNNs on Edge GPU
by Pierpaolo Mori, Mohammad Shanur Rahman, Lukas Frickenstein, Shambhavi Balamuthu Sampath, Moritz Thoma, Nael Fasfous, Manoj Rohit Vemparala, Alexander Frickenstein, Walter Stechele and Claudio Passerone
Electronics 2024, 13(22), 4538; https://doi.org/10.3390/electronics13224538 - 19 Nov 2024
Viewed by 942
Abstract
The Winograd algorithm reduces the computational complexity of convolutional neural networks (CNNs) by minimizing the number of multiplications required for convolutions, making it particularly suitable for resource-constrained edge devices. Concurrently, most edge hardware accelerators utilize 8-bit integer arithmetic to enhance energy efficiency and [...] Read more.
The Winograd algorithm reduces the computational complexity of convolutional neural networks (CNNs) by minimizing the number of multiplications required for convolutions, making it particularly suitable for resource-constrained edge devices. Concurrently, most edge hardware accelerators utilize 8-bit integer arithmetic to enhance energy efficiency and reduce inference latency, requiring the quantization of CNNs before deployment. Combining Winograd-based convolution with quantization offers the potential for both performance acceleration and reduced energy consumption. However, prior research has identified significant challenges in this combination, particularly due to numerical instability and substantial accuracy degradation caused by the transformations required in the Winograd domain, making the two techniques incompatible on edge hardware. In this work, we describe our latest training scheme, which addresses these challenges, enabling the successful integration of Winograd-accelerated convolution with low-precision quantization while maintaining high task-related accuracy. Our approach mitigates the numerical instability typically introduced during the transformation, ensuring compatibility between the two techniques. Additionally, we extend our work by presenting a custom-optimized CUDA implementation of quantized Winograd convolution for NVIDIA edge GPUs. This implementation takes full advantage of the proposed training scheme, achieving both high computational efficiency and accuracy, making it a compelling solution for edge-based AI applications. Our training approach enables significant MAC reduction with minimal impact on prediction quality. Furthermore, our hardware results demonstrate up to a 3.4× latency reduction for specific layers, and a 1.44× overall reduction in latency for the entire DeepLabV3 model, compared to the standard implementation. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

Figure 1
<p>The three steps of the <math display="inline"><semantics> <mrow> <mi>F</mi> <mo>(</mo> <mn>4</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> </mrow> </semantics></math> Winograd algorithm: (1) input and weight transformation, (2) element-wise matrix multiplication (EWMM) of the transformed matrices, and (3) inverse transformation to produce the spatial output feature maps. The numerical instability due to quantization is highlighted.</p>
Full article ">Figure 2
<p>Comparison of the (<b>a</b>) standard Winograd quantized transformation against (<b>b</b>) the Winograd quantized transformation that leverages trainable clipping factors to better exploit the quantized range.</p>
Full article ">Figure 3
<p>Overview of the proposed Winograd aware quantized training. Straight-through estimator (STE) is used to approximate the gradient of the quantization function. Trainable clipping factors <span class="html-italic">c</span>, <math display="inline"><semantics> <msub> <mi>α</mi> <mrow> <mi>t</mi> <mi>a</mi> </mrow> </msub> </semantics></math>, and <math display="inline"><semantics> <msub> <mi>α</mi> <mrow> <mi>t</mi> <mi>w</mi> </mrow> </msub> </semantics></math> are highlighted in <span style="color: #FF0000">red</span>.</p>
Full article ">Figure 4
<p>Input transformation kernel overview. The input volume is divided in sub-volumes and each thread block is responsible for the transformation of a sub-volume.</p>
Full article ">Figure 5
<p>Element-wise matrix multiplication kernel overview. The computation is organized in <math display="inline"><semantics> <mrow> <mn>6</mn> <mo>×</mo> <mn>6</mn> </mrow> </semantics></math> GEMMs. Each one is responsible for the computation of <math display="inline"><semantics> <mrow> <msub> <mi>N</mi> <mrow> <mi>t</mi> <mi>i</mi> <mi>l</mi> <mi>e</mi> <mi>s</mi> </mrow> </msub> <mo>×</mo> <msub> <mi>C</mi> <mi>o</mi> </msub> </mrow> </semantics></math> output pixels in the Winograd domain.</p>
Full article ">Figure 6
<p>Inverse transformation kernel overview. The Winograd tiles produced by the EWMM kernel are transformed back to the spatial domain. Each thread block is responsible for the computation of a <math display="inline"><semantics> <mrow> <mn>4</mn> <mo>×</mo> <mn>4</mn> <mo>×</mo> <msub> <mi>P</mi> <mrow> <mi>o</mi> <mi>c</mi> </mrow> </msub> </mrow> </semantics></math> output pixel.</p>
Full article ">Figure 7
<p>Numerical distributions of example layers for transformed weights and activations of ResNet-20 on CIFAR-10. The values in the clipped range (green) sufficiently contain the information needed to maintain high-accuracy full 8-bit Winograd.</p>
Full article ">Figure 8
<p>Latency speedup brought by the custom Winograd <math display="inline"><semantics> <mrow> <mi>F</mi> <mo>(</mo> <mn>4</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> </mrow> </semantics></math> kernels compared to cuDNN convolution on Tensor Cores (<tt>int8x32</tt>).</p>
Full article ">Figure 9
<p>The latency contribution of each of the three steps in the Winograd <math display="inline"><semantics> <mrow> <mi>F</mi> <mo>(</mo> <mn>4</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> </mrow> </semantics></math> algorithm. In each sub-figure, the spatial dimensions are fixed, while the channel dimensions are varied.</p>
Full article ">
19 pages, 4771 KiB  
Article
Intelligent Fault Diagnosis Method Based on Neural Network Compression for Rolling Bearings
by Xinren Wang, Dongming Hu, Xueqi Fan, Huiyi Liu and Chenbin Yang
Symmetry 2024, 16(11), 1461; https://doi.org/10.3390/sym16111461 - 4 Nov 2024
Viewed by 1162
Abstract
Rolling bearings are often exposed to high speeds and pressures, leading to the symmetry in their rotating structure being disrupted, which can lead to serious failures. Intelligent rolling bearing fault diagnosis is a critical part of ensuring operation of machinery, and it has [...] Read more.
Rolling bearings are often exposed to high speeds and pressures, leading to the symmetry in their rotating structure being disrupted, which can lead to serious failures. Intelligent rolling bearing fault diagnosis is a critical part of ensuring operation of machinery, and it has been facilitated by the growing popularity of convolutional neural networks (CNNs). The outstanding performance of fault diagnosis CNNs results from complex and redundant network structures and parameters, resulting in huge storage and computational requirements, which makes it challenging to implement these models in resource-limited industrial devices. This study aims to address this problem by proposing a comprehensive compression method for CNNs that is applied to intelligent fault diagnosis. It involves several different compression methods, including tensor train decomposition, parameter quantization, and knowledge distillation for deep network compression. This results in a significant decrease in redundancy and speeding up the training of CNN models. Firstly, tensor train decomposition is applied to reduce redundant connections in both convolutional and fully connected layers. The next step is to perform parameter quantization to minimize the bits needed for parameter representation and storage. Finally, knowledge distillation is used to restore accuracy to the compressed model. The effectiveness of the proposed approach is confirmed by an experiment and ablation study with different models on several datasets. The results show that it can significantly reduce redundant information and floating-point operations with little degradation in accuracy. Notably, on the CWRU dataset, with about 60% parameter reduction, there is no degradation in our model’s accuracy. The proposed approach is a new attempt at the intelligent fault diagnosis of rolling bearings in industrial equipment. Full article
Show Figures

Figure 1

Figure 1
<p>Tensor train decomposition (TTD) of a 4-order tensor, where the dimension of <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>r</mi> </mrow> <mrow> <mn>0</mn> </mrow> </msub> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>r</mi> </mrow> <mrow> <mn>4</mn> </mrow> </msub> </mrow> </semantics></math> are always set to 1.</p>
Full article ">Figure 2
<p>The process of quantization by weight-sharing clustering. Assuming that the weights in a convolutional layer are a 4 × 4 matrix, the weights are quantized into four categories, each represented by different colors, and all the weights in the same category have the same value. Therefore, each weight only requires a tiny 2-bit clustering index.</p>
Full article ">Figure 3
<p>The flowchart of the compression method for the bearing fault diagnosis in this section.</p>
Full article ">Figure 4
<p>The experimental platform of two bearing datasets: (<b>a</b>) Experimental equipment of XJTU-SY dataset [<a href="#B40-symmetry-16-01461" class="html-bibr">40</a>]. (<b>b</b>) Experimental equipment of CWRU dataset [<a href="#B41-symmetry-16-01461" class="html-bibr">41</a>].</p>
Full article ">Figure 5
<p>Comparison of response time with different methods on the XJTU-SY dataset.</p>
Full article ">Figure 6
<p>The accuracy curve before and after distillation of knowledge with different compression rates on the XJTU-SY dataset: (<b>a</b>) Accuracy curve of CNN-1; (<b>b</b>) Accuracy curve of CNN-2.</p>
Full article ">Figure 7
<p>The accuracy and loss curves with different retraining methods on the XJTU-SY dataset: (<b>a</b>) accuracy curve of CNN-1; (<b>b</b>) loss curve of CNN-1; (<b>c</b>) accuracy curve of CNN-2; (<b>d</b>) loss curve of CNN-2.</p>
Full article ">Figure 8
<p>The confusion matrices of CNN-1 and CNN-2 on the XJTU-SY dataset: (<b>a</b>) confusion matrices of original CNN-1 before compression; (<b>b</b>) confusion matrices of compressed CNN-1 before knowledge distillation; (<b>c</b>) confusion matrices of compressed CNN-1 after knowledge distillation; (<b>d</b>) confusion matrices of original CNN-2 before compression; (<b>e</b>) confusion matrices of compressed CNN-2 before knowledge distillation; (<b>f</b>) confusion matrices of compressed CNN-2 after knowledge distillation.</p>
Full article ">Figure 9
<p>The accuracy curve and parameters in different compression stages of the CNN-1 model on the XJTU-SY dataset: (<b>a</b>) The whole compression process. (<b>b</b>) The compression process without parameter quantization. (<b>c</b>) The compression process without tensor train decomposition.</p>
Full article ">
Back to TopTop