MDPI - Publisher of Open Access Journals

33 pages, 3144 KiB

Open AccessArticle

CNN-Based Optimization for Fish Species Classification: Tackling Environmental Variability, Class Imbalance, and Real-Time Constraints

by Amirhosein Mohammadisabet, Raza Hasan, Vishal Dattana, Salman Mahmood and Saqib Hussain

Information 2025, 16(2), 154; https://doi.org/10.3390/info16020154 - 19 Feb 2025

Abstract

Automated fish species classification is essential for marine biodiversity monitoring, fisheries management, and ecological research. However, challenges such as environmental variability, class imbalance, and computational demands hinder the development of robust classification models. This study investigates the effectiveness of convolutional neural network (CNN)-based [...] Read more.

Automated fish species classification is essential for marine biodiversity monitoring, fisheries management, and ecological research. However, challenges such as environmental variability, class imbalance, and computational demands hinder the development of robust classification models. This study investigates the effectiveness of convolutional neural network (CNN)-based models and hybrid approaches to address these challenges. Eight CNN architectures, including DenseNet121, MobileNetV2, and Xception, were compared alongside traditional classifiers like support vector machines (SVMs) and random forest. DenseNet121 achieved the highest accuracy (90.2%), leveraging its superior feature extraction and generalization capabilities, while MobileNetV2 balanced accuracy (83.57%) with computational efficiency, processing images in 0.07 s, making it ideal for real-time deployment. Advanced preprocessing techniques, such as data augmentation, turbidity simulation, and transfer learning, were employed to enhance dataset robustness and address class imbalance. Hybrid models combining CNNs with traditional classifiers achieved intermediate accuracy with improved interpretability. Optimization techniques, including pruning and quantization, reduced model size by 73.7%, enabling real-time deployment on resource-constrained devices. Grad-CAM visualizations further enhanced interpretability by identifying key image regions influencing predictions. This study highlights the potential of CNN-based models for scalable, interpretable fish species classification, offering actionable insights for sustainable fisheries management and biodiversity conservation. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Graphical abstract

29 pages, 6669 KiB

Open AccessArticle

Implementing Deep Neural Networks on ARM-Based Microcontrollers: Application for Ventricular Fibrillation Detection

by Vessela Krasteva, Todor Stoyanov and Irena Jekova

Appl. Sci. 2025, 15(4), 1965; https://doi.org/10.3390/app15041965 - 13 Feb 2025

Viewed by 345

Abstract

GPU-based deep neural networks (DNNs) are powerful for electrocardiogram (ECG) processing and rhythm classification. Although questions often arise about their practical application in embedded systems with low computational resources, few studies have investigated the associated challenges. This study aims to show a useful [...] Read more.

GPU-based deep neural networks (DNNs) are powerful for electrocardiogram (ECG) processing and rhythm classification. Although questions often arise about their practical application in embedded systems with low computational resources, few studies have investigated the associated challenges. This study aims to show a useful workflow for deploying a pre-trained DNN model from a GPU-based development platform to two popular ARM-based microcontrollers: Raspberry Pi 4 and ARM Cortex-M7. Specifically, a five-layer convolutional neural network pre-trained in TensorFlow (TF) for the detection of ventricular fibrillation is converted to Lite Runtime (LiteRT) format and subjected to post-training quantization to reduce model size and computational complexity. Using a test dataset of 7482 10 s cardiac arrest ECGs, the inference of LiteRT DNN in Raspberry Pi 4 takes about 1 ms with a sensitivity of 98.6% and specificity of 99.5%, reproducing the TF DNN performance. An optimization study with 1300 representative datasets (RDSs), including 10 to 4000 calibration ECG signals selected by random, rhythm, or amplitude-based criteria, showed that choosing a random RDS with a relatively small size of 80 resulted in a quantized integer LiteRT DNN with minimal quantization error. The inference of both non-quantized and quantized LiteRT DNNs on a low-resource ARM Cortex-M7 microcontroller (STM32F7) shows rhythm accuracy deviation of <0.4%. Quantization reduces internal computation latency from 4.8 s to 0.6 s, flash memory usage from 40 kB to 20 kB, and energy consumption by 7.85 times. This study ensures that DNN models retain their functionality while being optimized for real-time execution on resource-constrained hardware, demonstrating application in automated external defibrillators. Full article

(This article belongs to the Special Issue Advances in Electrocardiogram (ECG) Signal Processing and Its Applications)

► Show Figures

Figure 1

27 pages, 3199 KiB

Open AccessArticle

Hybrid CNN–BiLSTM–DNN Approach for Detecting Cybersecurity Threats in IoT Networks

by Bright Agbor Agbor, Bliss Utibe-Abasi Stephen, Philip Asuquo, Uduak Onofiok Luke and Victor Anaga

Computers 2025, 14(2), 58; https://doi.org/10.3390/computers14020058 - 10 Feb 2025

Viewed by 460

Abstract

The Internet of Things (IoT) ecosystem is rapidly expanding. It is driven by continuous innovation but accompanied by increasingly sophisticated cybersecurity threats. Protecting IoT devices from these emerging vulnerabilities has become a critical priority. This study addresses the limitations of existing IoT threat [...] Read more.

The Internet of Things (IoT) ecosystem is rapidly expanding. It is driven by continuous innovation but accompanied by increasingly sophisticated cybersecurity threats. Protecting IoT devices from these emerging vulnerabilities has become a critical priority. This study addresses the limitations of existing IoT threat detection methods, which often struggle with the dynamic nature of IoT environments and the growing complexity of cyberattacks. To overcome these challenges, a novel hybrid architecture combining Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM), and Deep Neural Networks (DNN) is proposed for accurate and efficient IoT threat detection. The model’s performance is evaluated using the IoT-23 and Edge-IIoTset datasets, which encompass over ten distinct attack types. The proposed framework achieves a remarkable 99% accuracy on both datasets, outperforming existing state-of-the-art IoT cybersecurity solutions. Advanced optimization techniques, including model pruning and quantization, are applied to enhance deployment efficiency in resource-constrained IoT environments. The results highlight the model’s robustness and its adaptability to diverse IoT scenarios, which address key limitations of prior approaches. This research provides a robust and efficient solution for IoT threat detection, establishing a foundation for advancing IoT security and addressing the evolving landscape of cyber threats while driving future innovations in the field. Full article

(This article belongs to the Special Issue Multimedia Data and Network Security)

► Show Figures

Figure 1

15 pages, 1363 KiB

Open AccessArticle

MSQuant: Efficient Post-Training Quantization for Object Detection via Migration Scale Search

by Zhesheng Jiang, Chao Li, Tao Qu, Chu He and Dingwen Wang

Electronics 2025, 14(3), 504; https://doi.org/10.3390/electronics14030504 - 26 Jan 2025

Viewed by 394

Abstract

YOLO (You Only Look Once) has become the dominant paradigm in real-time object detection. However, deploying real-time object detectors on resource-constrained platforms faces challenges due to high computational and memory demands. Quantization addresses this by compressing and accelerating CNN models through the representation [...] Read more.

YOLO (You Only Look Once) has become the dominant paradigm in real-time object detection. However, deploying real-time object detectors on resource-constrained platforms faces challenges due to high computational and memory demands. Quantization addresses this by compressing and accelerating CNN models through the representation of weights and activations with low-precision values. Nevertheless, the quantization difficulty between weights and activations is often imbalanced. In this work, we propose MSQuant, an efficient post-training quantization (PTQ) method for CNN-based object detectors, which balances the quantization difficulty between activations and weights through migration scale. MSQuant introduces the concept of migration scales to mitigate this disparity, thereby improving overall model accuracy. An alternating search method is employed to optimize the migration scales, avoiding local optima and reducing quantization error. We select YOLOv5 and YOLOv8 models as the PTQ baseline, followed by extensive experiments on the PASCAL VOC, COCO, and DOTA datasets to explore various combinations of quantization methods. The results demonstrate the effectiveness and robustness of MSQuant. Our approach consistently outperforms other methods, showing significant improvements in quantization performance and model accuracy. Full article

(This article belongs to the Special Issue High-Performance Computing and AI Compression)

► Show Figures

Figure 1

Figure 1
The violin plot of the model.6.cv2.conv layer in YOLOv5s and YOLOv8s. Significant differences are observed in the range of weights and activations within the same channel, indicating that the quantization difficulty for weights and activations is markedly different. (a) YOLOv5s weights distribution. (b) YOLOv8s weights distribution. (c) YOLOv5s activations distribution. (d) YOLOv8s activations distribution with same channels. Longer outliers result in a more uneven data distribution, causing greater errors and making quantization more challenging. Full article ">Figure 2
Architecture of MSQuant. MSQuant first calculates the distributions of the original weights and activations, which are used to initialize the migration scales. Following this, the migrated weights and activations undergo pseudo quantization, and an iterative process is employed to optimize the migration scales. Once the optimal migration scales are determined, the final quantization is performed on the new weights and activations, ensuring efficient and accurate model performance. Full article ">Figure 3
A 3D plot visualizing the impact of migration scales on quantization error shows the migration scales for two channels on the x and y axes, with the mean squared error (MSE) as the loss function on the z-axis. (a) Loss for the first convolutional layer. (b) Loss for convolutional layers in C3 block. The visualization demonstrates the tendency for the search for optimal migration scales to be attracted to local minima. Full article ">

10 pages, 638 KiB

Open AccessArticle

Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs

by Kai Chen, Xinyu Wang, Yuxiang Fu and Li Li

Electronics 2025, 14(3), 464; https://doi.org/10.3390/electronics14030464 - 23 Jan 2025

Viewed by 492

Abstract

Due to the ability to perform computations directly on encrypted data, homomorphic encryption (HE) has recently become an important branch of privacy-preserving machine learning (PPML) implementation. Nevertheless, existing implementations of HE-based convolutional neural network (HCNN) applications are not satisfactory in inference latency and [...] Read more.

Due to the ability to perform computations directly on encrypted data, homomorphic encryption (HE) has recently become an important branch of privacy-preserving machine learning (PPML) implementation. Nevertheless, existing implementations of HE-based convolutional neural network (HCNN) applications are not satisfactory in inference latency and area efficiency compared to the unencrypted version. In this work, we first improve the additive powers-of-two (APoT) quantization method for HCNN to achieve a better tradeoff between the complexity of modular multiplication and the network accuracy. An efficient multiplicationless modular multiplier–accumulator (M-MAC) unit is accordingly designed. Furthermore, a batch-processing HCNN accelerator with M-MACs is implemented, in which we propose an advanced data partition scheme to avoid multiple moves of the large-size ciphertext polynomials. Compared to the latest FPGA design, our accelerator can achieve

11 \times

resource reduction of an M-MAC and

2.36 \times

speedup in inference latency for a widely used CNN-11 network to process 8K images. The speedup of our design is also significant compared to the latest CPU and GPU implementations of the batch-processing HCNN models. Full article

► Show Figures

Figure 1

21 pages, 5845 KiB

Open AccessArticle

FPGA-QNN: Quantized Neural Network Hardware Acceleration on FPGAs

by Mustafa Tasci, Ayhan Istanbullu, Vedat Tumen and Selahattin Kosunalp

Appl. Sci. 2025, 15(2), 688; https://doi.org/10.3390/app15020688 - 12 Jan 2025

Viewed by 867

Abstract

Recently, convolutional neural networks (CNNs) have received a massive amount of interest due to their ability to achieve high accuracy in various artificial intelligence tasks. With the development of complex CNN models, a significant drawback is their high computational burden and memory requirements. [...] Read more.

Recently, convolutional neural networks (CNNs) have received a massive amount of interest due to their ability to achieve high accuracy in various artificial intelligence tasks. With the development of complex CNN models, a significant drawback is their high computational burden and memory requirements. The performance of a typical CNN model can be enhanced by the improvement of hardware accelerators. Practical implementations on field-programmable gate arrays (FPGA) have the potential to reduce resource utilization while maintaining low power consumption. Nevertheless, when implementing complex CNN models on FPGAs, these may may require further computational and memory capacities, exceeding the available capacity provided by many current FPGAs. An effective solution to this issue is to use quantized neural network (QNN) models to remove the burden of full-precision weights and activations. This article proposes an accelerator design framework for FPGAs, called FPGA-QNN, with a particular value in reducing high computational burden and memory requirements when implementing CNNs. To approach this goal, FPGA-QNN exploits the basics of quantized neural network (QNN) models by converting the high burden of full-precision weights and activations into integer operations. The FPGA-QNN framework comes up with 12 accelerators based on multi-layer perceptron (MLP) and LeNet CNN models, each of which is associated with a specific combination of quantization and folding. The outputs from the performance evaluations on Xilinx PYNQ Z1 development board proved the superiority of FPGA-QNN in terms of resource utilization and energy efficiency in comparison to several recent approaches. The proposed MLP model classified the FashionMNIST dataset at a speed of 953 kFPS with 1019 GOPs while consuming 2.05 W. Full article

(This article belongs to the Special Issue Advancements in Deep Learning and Its Applications)

► Show Figures

Figure 1

13 pages, 1853 KiB

Open AccessArticle

Optimizing Deep Learning Acceleration on FPGA for Real-Time and Resource-Efficient Image Classification

by Ahmad Mouri Zadeh Khaki and Ahyoung Choi

Appl. Sci. 2025, 15(1), 422; https://doi.org/10.3390/app15010422 - 5 Jan 2025

Cited by 1 | Viewed by 1117

Abstract

Deep learning (DL) has revolutionized image classification, yet deploying convolutional neural networks (CNNs) on edge devices for real-time applications remains a significant challenge due to constraints in computation, memory, and power efficiency. This work presents an optimized implementation of VGG16 and VGG19, two [...] Read more.

Deep learning (DL) has revolutionized image classification, yet deploying convolutional neural networks (CNNs) on edge devices for real-time applications remains a significant challenge due to constraints in computation, memory, and power efficiency. This work presents an optimized implementation of VGG16 and VGG19, two widely used CNN architectures, for classifying the CIFAR-10 dataset using transfer learning on field-programmable gate arrays (FPGAs). Utilizing the Xilinx Vitis-AI and TensorFlow2 frameworks, we adapt VGG16 and VGG19 for FPGA deployment through quantization, compression, and hardware-specific optimizations. Our implementation achieves high classification accuracy, with Top-1 accuracy of 89.54% and 87.47% for VGG16 and VGG19, respectively, while delivering significant reductions in inference latency (7.29× and 6.6× compared to CPU-based alternatives). These results highlight the suitability of our approach for resource-efficient, real-time edge applications. Key contributions include a detailed methodology for combining transfer learning with FPGA acceleration, an analysis of hardware resource utilization, and performance benchmarks. This work underscores the potential of FPGA-based solutions to enable scalable, low-latency DL deployments in domains such as autonomous systems, IoT, and mobile devices. Full article

(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)

► Show Figures

Figure 1

14 pages, 2382 KiB

Open AccessArticle

Edge-AI Enabled Wearable Device for Non-Invasive Type 1 Diabetes Detection Using ECG Signals

by Maria Gragnaniello, Vincenzo Romano Marrazzo, Alessandro Borghese, Luca Maresca, Giovanni Breglio and Michele Riccio

Bioengineering 2025, 12(1), 4; https://doi.org/10.3390/bioengineering12010004 - 24 Dec 2024

Cited by 1 | Viewed by 801

Abstract

Diabetes is a chronic condition, and traditional monitoring methods are invasive, significantly reducing the quality of life of the patients. This study proposes the design of an innovative system based on a microcontroller that performs real-time ECG acquisition and evaluates the presence of [...] Read more.

Diabetes is a chronic condition, and traditional monitoring methods are invasive, significantly reducing the quality of life of the patients. This study proposes the design of an innovative system based on a microcontroller that performs real-time ECG acquisition and evaluates the presence of diabetes using an Edge-AI solution. A spectrogram-based preprocessing method is combined with a 1-Dimensional Convolutional Neural Network (1D-CNN) to analyze the ECG signals directly on the device. By applying quantization as an optimization technique, the model effectively balances memory usage and accuracy, achieving an accuracy of 89.52% with an average precision and recall of 0.91 and 0.90, respectively. These results were obtained with a minimal memory footprint of 347 kB flash and 23 kB RAM, showcasing the system’s suitability for wearable embedded devices. Furthermore, a custom PCB was developed to validate the system in a real-world scenario. The hardware integrates high-performance electronics with low power consumption, demonstrating the feasibility of deploying Edge-AI for non-invasive, real-time diabetes detection in resource-constrained environments. This design represents a significant step forward in improving the accessibility and practicality of diabetes monitoring. Full article

(This article belongs to the Special Issue Monitoring and Analysis of Human Biosignals, Volume II)

► Show Figures

Figure 1

Figure 1
Schematic block diagram of the PCB design. The main components include the section for the Analog Front-End (AFE) using the MAX30003, the Microcontroller Unit (MCU) based on the STM32F401, and the output section utilizing Bluetooth low energy (BLE) for communication. Full article ">Figure 2
Rendering of the top and bottom layers of the custom PCB. On the left side, the electrodes are highlighted. Full article ">Figure 3
The overall procedure began with the creation of a useful dataset by D1NAMO. These data were then processed through spectrogram analysis, followed by CNN inference, which was used to display the results. Full article ">Figure 4
Example waveforms of (a) a diabetic ECG signal and (b) a healthy ECG. Full article ">Figure 5
Schematic representation of the neural network design. The structure consists of input layers for ECG data, followed by key processing layers, leading to the final classification output. Full article ">Figure 6
Confusion matrix. Full article ">Figure 7
Graphical results from the ST Edge AI Developer Cloud following the benchmark test. Full article ">

18 pages, 1732 KiB

Open AccessArticle

A One-Dimensional Depthwise Separable Convolutional Neural Network for Bearing Fault Diagnosis Implemented on FPGA

by Yu-Pei Liang, Hao Chen and Ching-Che Chung

Sensors 2024, 24(23), 7831; https://doi.org/10.3390/s24237831 - 7 Dec 2024

Viewed by 867

Abstract

This paper presents a hardware implementation of a one-dimensional convolutional neural network using depthwise separable convolution (DSC) on the VC707 FPGA development board. The design processes the one-dimensional rolling bearing current signal dataset provided by Paderborn University (PU), employing minimal preprocessing to maximize [...] Read more.

This paper presents a hardware implementation of a one-dimensional convolutional neural network using depthwise separable convolution (DSC) on the VC707 FPGA development board. The design processes the one-dimensional rolling bearing current signal dataset provided by Paderborn University (PU), employing minimal preprocessing to maximize the comprehensiveness of feature extraction. To address the high parameter demands commonly associated with convolutional neural networks (CNNs), the model incorporates DSC, significantly reducing computational complexity and parameter load. Additionally, the DoReFa-Net quantization method is applied to compress network parameters and activation function outputs, thereby minimizing memory usage. The quantized DSC model requires approximately 22 KB of storage and performs 1,203,128 floating-point operations in total. The implementation achieves a power consumption of 527 mW at a clock frequency of 50 MHz, while delivering a fault diagnosis accuracy of 96.12%. Full article

(This article belongs to the Special Issue Feature Papers in Physical Sensors 2024)

► Show Figures

Figure 1

21 pages, 12287 KiB

Open AccessFeature PaperArticle

An Optimised CNN Hardware Accelerator Applicable to IoT End Nodes for Disruptive Healthcare

by Arfan Ghani, Akinyemi Aina and Chan Hwang See

IoT 2024, 5(4), 901-921; https://doi.org/10.3390/iot5040041 - 6 Dec 2024

Viewed by 1029

Abstract

In the evolving landscape of computer vision, the integration of machine learning algorithms with cutting-edge hardware platforms is increasingly pivotal, especially in the context of disruptive healthcare systems. This study introduces an optimized implementation of a Convolutional Neural Network (CNN) on the Basys3 [...] Read more.

In the evolving landscape of computer vision, the integration of machine learning algorithms with cutting-edge hardware platforms is increasingly pivotal, especially in the context of disruptive healthcare systems. This study introduces an optimized implementation of a Convolutional Neural Network (CNN) on the Basys3 FPGA, designed specifically for accelerating the classification of cytotoxicity in human kidney cells. Addressing the challenges posed by constrained dataset sizes, compute-intensive AI algorithms, and hardware limitations, the approach presented in this paper leverages efficient image augmentation and pre-processing techniques to enhance both prediction accuracy and the training efficiency. The CNN, quantized to 8-bit precision and tailored for the FPGA’s resource constraints, significantly accelerates training by a factor of three while consuming only 1.33% of the power compared to a traditional software-based CNN running on an NVIDIA K80 GPU. The network architecture, composed of seven layers with excessive hyperparameters, processes downscale grayscale images, achieving notable gains in speed and energy efficiency. A cornerstone of our methodology is the emphasis on parallel processing, data type optimization, and reduced logic space usage through 8-bit integer operations. We conducted extensive image pre-processing, including histogram equalization and artefact removal, to maximize feature extraction from the augmented dataset. Achieving an accuracy of approximately 91% on unseen images, this FPGA-implemented CNN demonstrates the potential for rapid, low-power medical diagnostics within a broader IoT ecosystem where data could be assessed online. This work underscores the feasibility of deploying resource-efficient AI models in environments where traditional high-performance computing resources are unavailable, typically in healthcare settings, paving the way for and contributing to advanced computer vision techniques in embedded systems. Full article

(This article belongs to the Topic Machine Learning in Internet of Things II)

► Show Figures

Figure 1

17 pages, 6810 KiB

Open AccessArticle

Breast Tumor Detection and Diagnosis Using an Improved Faster R-CNN in DCE-MRI

by Haitian Gui, Han Jiao, Li Li, Xinhua Jiang, Tao Su and Zhiyong Pang

Bioengineering 2024, 11(12), 1217; https://doi.org/10.3390/bioengineering11121217 - 1 Dec 2024

Cited by 1 | Viewed by 1068

Abstract

AI-based breast cancer detection can improve the sensitivity and specificity of detection, especially for small lesions, which has clinical value in realizing early detection and treatment so as to reduce mortality. The two-stage detection network performs well; however, it adopts an imprecise ROI [...] Read more.

AI-based breast cancer detection can improve the sensitivity and specificity of detection, especially for small lesions, which has clinical value in realizing early detection and treatment so as to reduce mortality. The two-stage detection network performs well; however, it adopts an imprecise ROI during classification, which can easily include surrounding tumor tissues. Additionally, fuzzy noise is a significant contributor to false positives. We adopted Faster RCNN as the architecture, introduced ROI aligning to minimize quantization errors and feature pyramid network (FPN) to extract different resolution features, added a bounding box quadratic regression feature map extraction network and three convolutional layers to reduce interference from tumor surrounding information, and extracted more accurate and deeper feature maps. Our approach outperformed Faster R-CNN, Mask R-CNN, and YOLOv9 in breast cancer detection across 485 internal cases. We achieved superior performance in mAP, sensitivity, and false positive rate ((0.752, 0.950, 0.133) vs. (0.711, 0.950, 0.200) vs. (0.718, 0.880, 0.120) vs. (0.658, 0.680, 405)), which represents a 38.5% reduction in false positives compared to manual detection. Additionally, in a public dataset of 220 cases, our model also demonstrated the best performance. It showed improved sensitivity and specificity, effectively assisting doctors in diagnosing cancer. Full article

(This article belongs to the Special Issue Artificial Intelligence in Biomedical Imaging and Biomedical Signal Processing)

► Show Figures

Figure 1

Figure 1
Flowchart of the study procedure. Full article ">Figure 2
The architecture of our proposed model BC R-CNN. Full article ">Figure 3
PDN structure. Full article ">Figure 4
Four-quadrant location. Full article ">Figure 5
Background noise reduction: (a) MRI before noise reduction; (b) MRI after noise reduction; (c) segmented breast of original MRI; (d) segmented breast of noise reduction MRI. Full article ">Figure 6
U-Net++ breast edge segmented: (a) sagittal breast MRI and single breast MRI at axial plane; (b) masks; (c) segmented breasts. Full article ">Figure 7
AUC performance comparison of different models: (a) internal dataset, (b) public dataset. Full article ">Figure 8
Breast tumor location and diagnosis comparison of Faster R-CNN and our proposed model: (a1,b1,c1) detected by Faster R-CNN; (a2,b2,c2) detected by our proposed model. (a1,b1) false positive; (c1) diagnosed with a lower score; (c2) diagnosed at a higher score. Full article ">

10 pages, 337 KiB

Open AccessCommunication

InMemQK: A Product Quantization Based MatMul Module for Compute-in-Memory Attention Macro

by Pengcheng Feng, Yihao Chen, Jinke Yu, Hao Yue, Zhelong Jiang, Yi Xiao, Wan’ang Xiao, Huaxiang Lu and Gang Chen

Appl. Sci. 2024, 14(23), 11198; https://doi.org/10.3390/app142311198 - 1 Dec 2024

Viewed by 758

Abstract

Large Language Models (LLMs), based on transformer architecture, have demonstrated remarkable capabilities in natural language processing tasks, enabling machines to generate human-like text and engage in meaningful dialogues. However, the exponential increase in model parameters has led to limitations in inference speed and [...] Read more.

Large Language Models (LLMs), based on transformer architecture, have demonstrated remarkable capabilities in natural language processing tasks, enabling machines to generate human-like text and engage in meaningful dialogues. However, the exponential increase in model parameters has led to limitations in inference speed and energy efficiency. Compute-in-memory (CIM) technology offers a promising solution to accelerate AI inference by performing analog computations directly within memory, potentially reducing latency and power consumption. At the same time, CIM has been successfully applied to accelerate Convolutional Neural Networks (CNNs); however, the matrix–matrix multiplication (MatMul) operations inherent in the scaled dot-product attention of the transformer present unique challenges for direct CIM implementation. In this work, we propose InMemQK, a compute-in-memory-based attention accelerator that focuses on optimizing MatMul operations through software and hardware co-design. At the software level, InMemQK employs product quantization (PQ) to eliminate data dependencies. At the hardware level, InMemQK integrates energy-efficient time-domain MAC macros for ADC-free computations. Experimental results show InMemQK achieves 13.2×–13.9× lower power consumption than existing CIM-based accelerators. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

Figure 1
Transformer model structure and MatMul methods in CIM: (a) Multi-head self-attention module. (b) MatMul calculations in ISAAC, Timely [<a href="#B15-applsci-14-11198" class="html-bibr">15</a>], etc. (c) MatMul calculations in ReTransformer and (d) proposed MatMul methods. Full article ">Figure 2
(a) Voltage-domain MAC based on Ohm’s law. (b) Time-domain MAC based on current integration. (c) The normalized energy of different interfaces, where <math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mi>D</mi> <mi>A</mi> <mi>C</mi> </mrow> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mi>A</mi> <mi>D</mi> <mi>C</mi> </mrow> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mi>D</mi> <mi>T</mi> <mi>C</mi> </mrow> </msub> </semantics></math>, and <math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mi>T</mi> <mi>D</mi> <mi>C</mi> </mrow> </msub> </semantics></math> denote the energy of one DAC, ADC, DTC, and TDC, respectively. Full article ">Figure 3
Schematic diagram of product quantization. (a) Vector splitting. (b) Clustering and representation. (c) Two steps of approximate MatMul. Full article ">Figure 4
(a) Time-domain multiplication–accumulation macro. (b) Two-stage MatMul calculation pipeline. Full article ">Figure 5
(a) Performance comparison with D-MAC. (b) InMemQK power saving over JSSC’2024, JSSC’2022 (only mapping relevance computing). Full article ">

20 pages, 691 KiB

Open AccessArticle

DiscHAR: A Discrete Approach to Enhance Human Activity Recognition in Cyber Physical Systems: Smart Homes

by Ishrat Fatima, Asma Ahmad Farhan, Maria Tamoor, Shafiq ur Rehman, Hisham Abdulrahman Alhulayyil and Fawaz Tariq

Computers 2024, 13(11), 300; https://doi.org/10.3390/computers13110300 - 19 Nov 2024

Viewed by 832

Abstract

The main challenges in smart home systems and cyber-physical systems come from not having enough data and unclear interpretation; thus, there is still a lot to be done in this field. In this work, we propose a practical approach called Discrete Human Activity [...] Read more.

The main challenges in smart home systems and cyber-physical systems come from not having enough data and unclear interpretation; thus, there is still a lot to be done in this field. In this work, we propose a practical approach called Discrete Human Activity Recognition (DiscHAR) based on prior research to enhance Human Activity Recognition (HAR). Our goal is to generate diverse data to build better models for activity classification. To tackle overfitting, which often occurs with small datasets, we generate data and convert them into discrete forms, improving classification accuracy. Our methodology includes advanced techniques like the R-Frame method for sampling and the Mixed-up approach for data generation. We apply K-means vector quantization to categorize the data, and through the elbow method, we determine the optimal number of clusters. The discrete sequences are converted into one-hot encoded vectors and fed into a CNN model to ensure precise recognition of human activities. Evaluations on the OPP79, PAMAP2, and WISDM datasets show that our approach outperforms existing models, achieving 89% accuracy for OPP79, 93.24% for PAMAP2, and 100% for WISDM. These results demonstrate the model’s effectiveness in identifying complex activities captured by wearable devices. Our work combines theory and practice to address ongoing challenges in this field, aiming to improve the reliability and performance of activity recognition systems in dynamic environments. Full article

► Show Figures

Figure 1

18 pages, 1757 KiB

Open AccessEditor’s ChoiceArticle

End-to-End Deployment of Winograd-Based DNNs on Edge GPU

by Pierpaolo Mori, Mohammad Shanur Rahman, Lukas Frickenstein, Shambhavi Balamuthu Sampath, Moritz Thoma, Nael Fasfous, Manoj Rohit Vemparala, Alexander Frickenstein, Walter Stechele and Claudio Passerone

Electronics 2024, 13(22), 4538; https://doi.org/10.3390/electronics13224538 - 19 Nov 2024

Viewed by 942

Abstract

The Winograd algorithm reduces the computational complexity of convolutional neural networks (CNNs) by minimizing the number of multiplications required for convolutions, making it particularly suitable for resource-constrained edge devices. Concurrently, most edge hardware accelerators utilize 8-bit integer arithmetic to enhance energy efficiency and [...] Read more.

The Winograd algorithm reduces the computational complexity of convolutional neural networks (CNNs) by minimizing the number of multiplications required for convolutions, making it particularly suitable for resource-constrained edge devices. Concurrently, most edge hardware accelerators utilize 8-bit integer arithmetic to enhance energy efficiency and reduce inference latency, requiring the quantization of CNNs before deployment. Combining Winograd-based convolution with quantization offers the potential for both performance acceleration and reduced energy consumption. However, prior research has identified significant challenges in this combination, particularly due to numerical instability and substantial accuracy degradation caused by the transformations required in the Winograd domain, making the two techniques incompatible on edge hardware. In this work, we describe our latest training scheme, which addresses these challenges, enabling the successful integration of Winograd-accelerated convolution with low-precision quantization while maintaining high task-related accuracy. Our approach mitigates the numerical instability typically introduced during the transformation, ensuring compatibility between the two techniques. Additionally, we extend our work by presenting a custom-optimized CUDA implementation of quantized Winograd convolution for NVIDIA edge GPUs. This implementation takes full advantage of the proposed training scheme, achieving both high computational efficiency and accuracy, making it a compelling solution for edge-based AI applications. Our training approach enables significant MAC reduction with minimal impact on prediction quality. Furthermore, our hardware results demonstrate up to a 3.4× latency reduction for specific layers, and a 1.44× overall reduction in latency for the entire DeepLabV3 model, compared to the standard implementation. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

19 pages, 4771 KiB

Open AccessArticle

Intelligent Fault Diagnosis Method Based on Neural Network Compression for Rolling Bearings

by Xinren Wang, Dongming Hu, Xueqi Fan, Huiyi Liu and Chenbin Yang

Symmetry 2024, 16(11), 1461; https://doi.org/10.3390/sym16111461 - 4 Nov 2024

Viewed by 1162

Abstract

Rolling bearings are often exposed to high speeds and pressures, leading to the symmetry in their rotating structure being disrupted, which can lead to serious failures. Intelligent rolling bearing fault diagnosis is a critical part of ensuring operation of machinery, and it has [...] Read more.

Rolling bearings are often exposed to high speeds and pressures, leading to the symmetry in their rotating structure being disrupted, which can lead to serious failures. Intelligent rolling bearing fault diagnosis is a critical part of ensuring operation of machinery, and it has been facilitated by the growing popularity of convolutional neural networks (CNNs). The outstanding performance of fault diagnosis CNNs results from complex and redundant network structures and parameters, resulting in huge storage and computational requirements, which makes it challenging to implement these models in resource-limited industrial devices. This study aims to address this problem by proposing a comprehensive compression method for CNNs that is applied to intelligent fault diagnosis. It involves several different compression methods, including tensor train decomposition, parameter quantization, and knowledge distillation for deep network compression. This results in a significant decrease in redundancy and speeding up the training of CNN models. Firstly, tensor train decomposition is applied to reduce redundant connections in both convolutional and fully connected layers. The next step is to perform parameter quantization to minimize the bits needed for parameter representation and storage. Finally, knowledge distillation is used to restore accuracy to the compressed model. The effectiveness of the proposed approach is confirmed by an experiment and ablation study with different models on several datasets. The results show that it can significantly reduce redundant information and floating-point operations with little degradation in accuracy. Notably, on the CWRU dataset, with about 60% parameter reduction, there is no degradation in our model’s accuracy. The proposed approach is a new attempt at the intelligent fault diagnosis of rolling bearings in industrial equipment. Full article

(This article belongs to the Topic Predictive Analytics and Fault Diagnosis of Machines with Machine Learning Techniques)

► Show Figures

Figure 1

Search Results (122)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (122)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI