Saved Queries

by Viktor Szabó, Kaan Orhan, Csaba Dobó-Nagy, Dániel Sándor Veres, David Manulis, Matvey Ezhov, Alex Sanders and Bence Tamás Szabó

Diagnostics 2025, 15(4), 510; https://doi.org/10.3390/diagnostics15040510 - 19 Feb 2025

Abstract

Background/Objectives: Our study aimed to determine the accuracy of the artificial intelligence-based Diagnocat system (DC) in detecting periapical lesions (PL) on panoramic radiographs (PRs). Methods: 616 teeth were selected from 357 panoramic radiographs, including 308 teeth with clearly visible periapical radiolucency and 308 without any periapical lesion. Three groups were generated: teeth with radiographic signs of caries (Group 1), teeth with coronal restoration (Group 2), and teeth with root canal filling (Group 3). The PRs were uploaded to the Diagnocat system for evaluation. The performance of the convolutional neural network in detecting PLs was assessed by its sensitivity, specificity, and positive and negative predictive values, as well as the diagnostic accuracy value. We investigated the possible effect of the palatoglossal air space (PGAS) on the evaluation of the AI tool. Results: DC identified periapical lesions in 240 (77.9%) cases out of the 308 teeth with PL and detected no PL in 68 (22.1%) teeth with PL. The AI-based system detected no PL in any of the groups without PL. The overall sensitivity, specificity, and diagnostic accuracy of DC were 0.78, 1.00, and 0.89, respectively. Considering these parameters for each group, Group 2 showed the highest values at 0.84, 1.00, and 0.95, respectively. Fisher’s Exact test showed that PGAS does not significantly affect (p = 1) the detection of PL in the upper teeth. The AI-based system showed lower probability values for detecting PL in the case of central incisors, wisdom teeth, and canines. The sensitivity and diagnostic accuracy of DC for detecting PL on canines showed lower values at 0.27 and 0.64, respectively. Conclusions: The CNN-based Diagnocat system can support the diagnosis of PL on PRs and serves as a decision-support tool during radiographic assessments. Full article

(This article belongs to the Special Issue Advancements in Artificial Intelligence for Dentomaxillofacial Radiology—2nd Edition)

29 pages, 34115 KiB

Open AccessArticle

Sliding-Window CNN + Channel-Time Attention Transformer Network Trained with Inertial Measurement Units and Surface Electromyography Data for the Prediction of Muscle Activation and Motion Dynamics Leveraging IMU-Only Wearables for Home-Based Shoulder Rehabilitation

by Aoyang Bai, Hongyun Song, Yan Wu, Shurong Dong, Gang Feng and Hao Jin

Sensors 2025, 25(4), 1275; https://doi.org/10.3390/s25041275 - 19 Feb 2025

Abstract

Inertial Measurement Units (IMUs) are widely utilized in shoulder rehabilitation due to their portability and cost-effectiveness, but their reliance on spatial motion data restricts their use in comprehensive musculoskeletal analyses. To overcome this limitation, we propose SWCTNet (Sliding Window CNN + Channel-Time Attention Transformer Network), an advanced neural network specifically tailored for multichannel temporal tasks. SWCTNet integrates IMU and surface electromyography (sEMG) data through sliding window convolution and channel-time attention mechanisms, enabling the efficient extraction of temporal features. This model enables the prediction of muscle activation patterns and kinematics using exclusively IMU data. The experimental results demonstrate that the SWCTNet model achieves recognition accuracies ranging from 87.93% to 91.03% on public temporal datasets and an impressive 98% on self-collected datasets. Additionally, SWCTNet exhibits remarkable precision and stability in generative tasks: the normalized DTW distance was 0.12 for the normal group and 0.25 for the patient group when using the self-collected dataset. This study positions SWCTNet as an advanced tool for extracting musculoskeletal features from IMU data, paving the way for innovative applications in real-time monitoring and personalized rehabilitation at home. This approach demonstrates significant potential for long-term musculoskeletal function monitoring in non-clinical or home settings, advancing the capabilities of IMU-based wearable devices. Full article

(This article belongs to the Special Issue Wearable Devices for Physical Activity and Healthcare Monitoring)

►▼ Show Figures

Figure 1

33 pages, 3144 KiB

Open AccessArticle

CNN-Based Optimization for Fish Species Classification: Tackling Environmental Variability, Class Imbalance, and Real-Time Constraints

by Amirhosein Mohammadisabet, Raza Hasan, Vishal Dattana, Salman Mahmood and Saqib Hussain

Information 2025, 16(2), 154; https://doi.org/10.3390/info16020154 - 19 Feb 2025

Abstract

Automated fish species classification is essential for marine biodiversity monitoring, fisheries management, and ecological research. However, challenges such as environmental variability, class imbalance, and computational demands hinder the development of robust classification models. This study investigates the effectiveness of convolutional neural network (CNN)-based models and hybrid approaches to address these challenges. Eight CNN architectures, including DenseNet121, MobileNetV2, and Xception, were compared alongside traditional classifiers like support vector machines (SVMs) and random forest. DenseNet121 achieved the highest accuracy (90.2%), leveraging its superior feature extraction and generalization capabilities, while MobileNetV2 balanced accuracy (83.57%) with computational efficiency, processing images in 0.07 s, making it ideal for real-time deployment. Advanced preprocessing techniques, such as data augmentation, turbidity simulation, and transfer learning, were employed to enhance dataset robustness and address class imbalance. Hybrid models combining CNNs with traditional classifiers achieved intermediate accuracy with improved interpretability. Optimization techniques, including pruning and quantization, reduced model size by 73.7%, enabling real-time deployment on resource-constrained devices. Grad-CAM visualizations further enhanced interpretability by identifying key image regions influencing predictions. This study highlights the potential of CNN-based models for scalable, interpretable fish species classification, offering actionable insights for sustainable fisheries management and biodiversity conservation. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

►▼ Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
Framework for the research methodology in fish species classification. Full article ">Figure 2
Example of augmented images. Full article ">Figure 3
Confusion matrix for DenseNet121 performance. Full article ">Figure 4
Training and validation loss for DenseNet121. Full article ">Figure 5
Grad-CAM heatmap analysis for a single class. Full article ">Figure 6
Comparative Grad-CAM visualizations across multiple classes. Full article ">Figure 7
Turbidity simulation results and model predictions. Full article ">

32 pages, 10807 KiB

Open AccessArticle

A Comprehensive Evaluation of Monocular Depth Estimation Methods in Low-Altitude Forest Environment

by Jiwen Jia, Junhua Kang, Lin Chen, Xiang Gao, Borui Zhang and Guijun Yang

Remote Sens. 2025, 17(4), 717; https://doi.org/10.3390/rs17040717 - 19 Feb 2025

Abstract

Monocular depth estimation (MDE) is a critical computer vision task that enhances environmental perception in fields such as autonomous driving and robot navigation. In recent years, deep learning-based MDE methods have achieved notable progress in these fields. However, achieving robust monocular depth estimation in low-altitude forest environments remains challenging, particularly in scenes with dense and cluttered foliage, which complicates applications in environmental monitoring, agriculture, and search and rescue operations. This paper presents a comprehensive evaluation of state-of-the-art deep learning-based MDE methods on low-altitude forest datasets. The evaluated models include both self-supervised and supervised approaches, employing different network structures such as convolutional neural networks (CNNs) and Vision Transformers (ViTs). We assessed the generalization of these approaches across diverse low-altitude scenarios, specifically focusing on forested environments. A systematic set of evaluation criteria is employed, comprising traditional image-based global statistical metrics as well as geometry-aware metrics, to provide a more comprehensive evaluation of depth estimation performance. The results indicate that most Transformer-based models, such as DepthAnything and Metric3D, outperform traditional CNN-based models in complex forest environments by capturing detailed tree structures and depth discontinuities. Conversely, CNN-based models like MiDas and Adabins struggle with handling depth discontinuities and complex occlusions, yielding less detailed predictions. On the Mid-Air dataset, the Transformer-based DepthAnything demonstrates a 54.2% improvement in RMSE for the global error metric compared to the CNN-based Adabins. On the LOBDM dataset, the CNN-based MiDas has the depth edge completeness error of 93.361, while the Transformer-based Metric3D demonstrates the significantly lower error of only 5.494. These findings highlight the potential of Transformer-based approaches for monocular depth estimation in low-altitude forest environments, with implications for high-throughput plant phenotyping, environmental monitoring, and other forest-specific applications. Full article

(This article belongs to the Special Issue Image Analysis for Forest Environmental Monitoring)

22 pages, 3475 KiB

Open AccessArticle

Uncertainty-Aware Adaptive Multiscale U-Net for Low-Contrast Cardiac Image Segmentation

by A. S. M. Sharifuzzaman Sagar, Muhammad Zubair Islam, Jawad Tanveer and Hyung Seok Kim

Appl. Sci. 2025, 15(4), 2222; https://doi.org/10.3390/app15042222 - 19 Feb 2025

Abstract

Medical image analysis is critical for diagnosing and planning treatments, particularly in addressing heart disease, a leading cause of mortality worldwide. Precise segmentation of the left atrium, a key structure in cardiac imaging, is essential for detecting conditions such as atrial fibrillation, heart failure, and stroke. However, its complex anatomy, subtle boundaries, and inter-patient variations make accurate segmentation challenging for traditional methods. Recent advancements in deep learning, especially semantic segmentation, have shown promise in addressing these limitations by enabling detailed, pixel-wise classification. This study proposes a novel segmentation framework Adaptive Multiscale U-Net (AMU-Net) combining Convolutional Neural Networks (CNNs) and transformer-based encoder–decoder architectures. The framework introduces a Contextual Dynamic Encoder (CDE) for extracting multi-scale features and capturing long-range dependencies. An Adaptive Feature Decoder Block (AFDB), leveraging an Adaptive Feature Attention Block (AFAB) improves boundary delineation. Additionally, a Spectral Synthesis Fusion Head (SFFH) synthesizes spectral and spatial features, enhancing segmentation performance in low-contrast regions. To ensure robustness, data augmentation techniques such as rotation, scaling, and flipping are applied. Laplacian approximation is employed for uncertainty estimation, enabling interpretability and identifying regions of low confidence. Our proposed model achieves a Dice score of 93.35, a Precision of 94.12, and a Recall of 92.78, outperforming existing methods. Full article

(This article belongs to the Special Issue Advances in Machine Learning and Data Mining: Emerging Trends and Applications)

►▼ Show Figures

Figure 1

15 pages, 10730 KiB

Open AccessArticle

An Efficient Forest Smoke Detection Approach Using Convolutional Neural Networks and Attention Mechanisms

by Quy-Quyen Hoang, Quy-Lam Hoang and Hoon Oh

J. Imaging 2025, 11(2), 67; https://doi.org/10.3390/jimaging11020067 - 19 Feb 2025

Abstract

This study explores a method of detecting smoke plumes effectively as the early sign of a forest fire. Convolutional neural networks (CNNs) have been widely used for forest fire detection; however, they have not been customized or optimized for smoke characteristics. This paper proposes a CNN-based forest smoke detection model featuring novel backbone architecture that can increase detection accuracy and reduce computational load. Since the proposed backbone detects the plume of smoke through different views using kernels of varying sizes, it can better detect smoke plumes of different sizes. By decomposing the traditional square kernel convolution into a depth-wise convolution of the coordinate kernel, it can not only better extract the features of the smoke plume spreading along the vertical dimension but also reduce the computational load. An attention mechanism was applied to allow the model to focus on important information while suppressing less relevant information. The experimental results show that our model outperforms other popular ones by achieving detection accuracy of up to 52.9 average precision (AP) and significantly reduces the number of parameters and giga floating-point operations (GFLOPs) compared to the popular models. Full article

►▼ Show Figures

Figure 1

32 pages, 124914 KiB

Open AccessArticle

CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation

by Juan Lei, Huigang Wang, Zelin Lei, Jiayuan Li and Shaowei Rong

Remote Sens. 2025, 17(4), 707; https://doi.org/10.3390/rs17040707 - 19 Feb 2025

Abstract

The salient object detection (SOD) of forward-looking sonar images plays a crucial role in underwater detection and rescue tasks. However, the existing SOD algorithms find it difficult to effectively extract salient features and spatial structure information from images with scarce semantic information, uneven intensity distribution, and high noise. Convolutional neural networks (CNNs) have strong local feature extraction capabilities, but they are easily constrained by the receptive field and lack the ability to model long-range dependencies. Transformers, with their powerful self-attention mechanism, are capable of modeling the global features of a target, but they tend to lose a significant amount of local detail. Mamba effectively models long-range dependencies in long sequence inputs through a selection mechanism, offering a novel approach to capturing long-range correlations between pixels. However, since the saliency of image pixels does not exhibit sequential dependencies, this somewhat limits Mamba’s ability to fully capture global contextual information during the forward pass. Inspired by multimodal feature fusion learning, we propose a hybrid CNN–Transformer–Mamba architecture, termed FLSSNet. FLSSNet is built upon a CNN and Transformer backbone network, integrating four core submodules to address various technical challenges: (1) The asymmetric dual encoder–decoder (ADED) is capable of simultaneously extracting features from different modalities and systematically modeling both local contextual information and global spatial structure. (2) The Transformer feature converter (TFC) module optimizes the multimodal feature fusion process through feature transformation and channel compression. (3) The long-range correlation attention (LRCA) module enhances CNN’s ability to model long-range dependencies through the collaborative use of convolutional kernels, selective sequential scanning, and attention mechanisms, while effectively suppressing noise interference. (4) The recursive contour refinement (RCR) model refines edge contour information through a layer-by-layer recursive mechanism, achieving greater precision in boundary details. The experimental results show that FLSSNet exhibits outstanding competitiveness among 25 state-of-the-art SOD methods, achieving MAE and

E_{ξ}

values of 0.04 and 0.973, respectively. Full article

(This article belongs to the Special Issue Ocean Remote Sensing Based on Radar, Sonar and Optical Techniques)

►▼ Show Figures

Figure 1

29 pages, 11018 KiB

Open AccessArticle

Impact on Classification Process Generated by Corrupted Features

by Simona Moldovanu, Dan Munteanu and Carmen Sîrbu

Big Data Cogn. Comput. 2025, 9(2), 45; https://doi.org/10.3390/bdcc9020045 - 18 Feb 2025

Viewed by 145

Abstract

The topic of this study is the testing of the robustness of machine learning (ML) and neural network (NN) models with a new idea based on corrupted data. Typically, ML and NN classifiers are trained on real feature data; however, a portion of the features may be false, with noise, or incorrect. The undesired content was analyzed in eight experiments with false data, six with feature noise, and six with label noise. These tests were all conducted on the public Breast Cancer Wisconsin Dataset (BCWD). Throughout this, the false and noise data were gradually corrupted in a random way, generating new data and replacing raw features that belonged to the BCWD. Artificial Intelligence (AI) should be properly selected while categorizing different diseases using medical data. The Pearson correlation coefficient (PCC) applied between features monitored their correlation in each experiment, and a correlation matrix between both true and false features was used. Four machine learning (ML) algorithms—Random Forest (RF), XGBClassifier (XGB), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM)—were used, as well as for the analysis of important features (IF) and the binary classification. The study was completed using three deep neural networks—a simple Deep Neural Network (DNN), a Convolutional Neural Network (CNN), and a Transformer Neural Network (TNN). In the context of a binary classification, the accuracy, F1-score, Area Under the Curve (AUC), and Matthews correlation coefficient (MCC) metrics of the performance of classification in malignant versus benign breast cancer (BC) was computed. The results demonstrated the robustness of some methods and the sensitivity of other machine learning algorithms in the context of corrupted data, computational cost, and hyperparameters optimization. Full article

►▼ Show Figures

Figure 1

18 pages, 5677 KiB

Open AccessArticle

Computer Vision-Based Concrete Crack Identification Using MobileNetV2 Neural Network and Adaptive Thresholding

by Li Hui, Ahmed Ibrahim and Riyadh Hindi

Infrastructures 2025, 10(2), 42; https://doi.org/10.3390/infrastructures10020042 - 18 Feb 2025

Viewed by 80

Abstract

Concrete is widely used in different types of buildings and bridges; however, one of the major issues for concrete structures is crack formation and propagation during its service life. These cracks can potentially introduce harmful agents into concrete, resulting in a reduction in the overall lifespan of concrete structures. Traditional methods for crack detection primarily hinge on manual visual inspection, which relies on the experience and expertise of inspectors using tools such as magnifying glasses and microscopes. To address this issue, computer vision is one of the most innovative solutions for concrete cracking evaluation, and its application has been an area of research interest in the past few years. This study focuses on the utilization of the lightweight MobileNetV2 neural network for concrete crack detection. A dataset including 40,000 images was adopted and preprocessed using various thresholding techniques, of which adaptive thresholding was selected for developing the crack evaluation algorithm. While both the convolutional neural network (CNN) and MobileNetV2 indicated comparable accuracy levels in crack detection, the MobileNetV2 model’s significantly smaller size makes it a more efficient selection for crack detection using mobile devices. In addition, an advanced algorithm was developed to detect cracks and evaluate crack widths in high-resolution images. The effectiveness and reliability of both the selected method and the developed algorithm were subsequently assessed through experimental validation. Full article

(This article belongs to the Special Issue Advances in Artificial Intelligence for Infrastructures)

►▼ Show Figures

Figure 1

21 pages, 1850 KiB

Open AccessReview

Deep Learning for Automatic Detection of Volcanic and Earthquake-Related InSAR Deformation

by Xu Liu, Yingfeng Zhang, Xinjian Shan, Zhenjie Wang, Wenyu Gong and Guohong Zhang

Remote Sens. 2025, 17(4), 686; https://doi.org/10.3390/rs17040686 - 18 Feb 2025

Viewed by 179

Abstract

Interferometric synthetic aperture radar (InSAR) technology plays a crucial role in monitoring surface deformation and has become widely used in volcanic and earthquake research. With the rapid advancement of satellite technology, InSAR now generates vast volumes of deformation data. Deep learning has revolutionized data analysis, offering exceptional capabilities for processing large datasets. Leveraging these advancements, automatic detection of volcanic and earthquake deformation from extensive InSAR datasets has emerged as a major research focus. In this paper, we first introduce several representative deep learning architectures commonly used in InSAR data analysis, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), and Transformer networks. Each architecture offers unique advantages for addressing the challenges of InSAR data. We then systematically review recent progress in the automatic detection and identification of volcanic and earthquake deformation signals from InSAR images using deep learning techniques. This review highlights two key aspects: the design of network architectures and the methodologies for constructing datasets. Finally, we discuss the challenges in automatic detection and propose potential solutions. This study aims to provide a comprehensive overview of the current applications of deep learning for extracting InSAR deformation features, with a particular focus on earthquake and volcanic monitoring. Full article

(This article belongs to the Special Issue Advanced SAR/InSAR Techniques in Understanding and Monitoring Geohazards)

►▼ Show Figures

Figure 1

Figure 1
InSAR data processing based on deep learning. (a) The primary deep learning architectures utilized in InSAR data processing, including CNNs, RNNs, GANs, and Transformers. (b) DL is applied to various stages of InSAR data processing, including deformation detection, atmospheric correction, phase filtering, and phase unwrapping. Full article ">Figure 2
The main architectures of CNNs, RNNs, GANs, and Transformer networks. (a) CNNs primarily consist of an input layer, convolutional layers, pooling layers, fully connected layers, and an output layer. (b) RNNs consist of input layers, recurrent hidden layers, and an output layer for sequence tasks. (c) GANs consist of a generator and a discriminator, which are trained together in a competitive manner. (d) Transformers consist of an encoder and a decoder, both using self-attention and feed-forward layers. Full article ">Figure 3
Different learning processes between traditional machine learning and transfer learning. (a) Traditional machine learning approaches learn each task independently, starting from scratch. (b) Transfer learning utilizes knowledge gained from previous tasks and applies it to a target task. Full article ">Figure 4
Data augmentation methods. (a) Geometric transformation-based data augmentation involves techniques like zoom, rotation, mirroring, and flipping to expand the training datasets. (b) Pixel-level transformation-based data augmentation modifies individual pixel values, such as brightness, contrast, and color, to enhance the datasets. (c) Filtering-based data augmentation involves applying filters like blurring, sharpening, and noise to diversify the training datasets. The original InSAR interferogram data were downloaded from the COMET-LiCS Sentinel-1 InSAR portal (<a href="https://comet.nerc.ac.uk/comet-lics-portal/" target="_blank">https://comet.nerc.ac.uk/comet-lics-portal/</a> (accessed on 1 December 2024)). Full article ">

27 pages, 5597 KiB

Open AccessArticle

Smart Organization of Imbalanced Traffic Datasets for Long-Term Traffic Forecasting

by Mustafa M. Kara, H. Irem Turkmen and M. Amac Guvensan

Sensors 2025, 25(4), 1225; https://doi.org/10.3390/s25041225 - 18 Feb 2025

Viewed by 177

Abstract

Predicting traffic speed is an important issue, especially in urban regions. Precise long-term forecasts would enable individuals to conserve time and financial resources while diminishing air pollution. Despite extensive research on this subject, to our knowledge, no publications investigate or tackle the issue of imbalanced datasets in traffic speed prediction. Traffic speed data are often biased toward high numbers because low traffic speeds are infrequent. The temporal aspect of traffic carries two important factors for low-speed value. The daily population movement, captured by the time of day, and the weather data, recorded by month, are both considered in this study. Hour-wise Pattern Organization and Month-wise Pattern Organization techniques were devised, which organize the speed data using these two factors as a metric with a view to providing a superior representation of data characteristics that are in the minority. In addition to these two methods, a Speed-wise Pattern Organization strategy is proposed, which arranges train and test samples by setting boundaries on speed while taking the volatile nature of traffic into consideration. We evaluated these strategies using four popular model types: long short-term memory (LSTM), gated recurrent unit networks (GRUs), bi-directional LSTM, and convolutional neural networks (CNNs). GRU had the best performance, achieving a MAPE (Mean Absolute Percentage Error) of 13.51%, whereas LSTM demonstrated the lowest performance, with a MAPE of 13.74%. We validated their robustness through our studies and observed improvements in model accuracy across all categories. While the average improvement was approximately 4%, our methodologies demonstrated superior performance in low-traffic speed scenarios, augmenting model prediction accuracy by 11.2%. The presented methodologies in this study are applied in the pre-processing steps, allowing their application with various models and additional pre-processing procedures to attain comparable performance improvements. Full article

(This article belongs to the Section Navigation and Positioning)

►▼ Show Figures

Figure 1

11 pages, 939 KiB

Open AccessProceeding Paper

CNN-Based Image Segmentation Approach in Brain Tumor Classification: A Review

by Nurul Huda and Ku Ruhana Ku-Mahamud

Eng. Proc. 2025, 84(1), 66; https://doi.org/10.3390/engproc2025084066 - 17 Feb 2025

Viewed by 21

Abstract

This study explores the application of Convolutional Neural Networks (CNNs) for brain tumor segmentation, leveraging their ability to automatically extract hierarchical features from medical images. CNN architectures like U-Net, V-Net, and ResNet have shown significant promise in brain tumor classification, offering high precision in detecting tumor boundaries and classifying tumor types. Various benchmark datasets, such as BraTS, TCIA, Harvard, and Kaggle, provide annotated MRI images to evaluate these models. Performance metrics including Dice Similarity Coefficient (DSC), Intersection over Union, and accuracy are employed to assess the models’ effectiveness. The results demonstrate that CNN-based models, particularly U-Net, perform exceptionally well, with DSC scores exceeding 0.90 in most cases. However, challenges such as data imbalance, the need for large datasets, and high computational demands persist. Despite these limitations, CNNs, when combined with advanced techniques like transfer learning and data augmentation, offer robust solutions for brain tumor segmentation, showing promise for real-time clinical deployment. Further advancements are necessary to address generalization issues and enhance model efficiency, ensuring broader applicability in clinical settings. Full article

(This article belongs to the Proceedings of The 8th Mechanical Engineering, Science and Technology International Conference)

►▼ Show Figures

Figure 1

Figure 1
Segmentation Techniques from 2013–2024. Full article ">Figure 2
Recent advances and future trend of image segmentation for brain tumor classification. Full article ">

32 pages, 4102 KiB

Open AccessArticle

A Multimodal Pain Sentiment Analysis System Using Ensembled Deep Learning Approaches for IoT-Enabled Healthcare Framework

by Anay Ghosh, Saiyed Umer, Bibhas Chandra Dhara and G. G. Md. Nawaz Ali

Sensors 2025, 25(4), 1223; https://doi.org/10.3390/s25041223 - 17 Feb 2025

Viewed by 168

Abstract

This study introduces a multimodal sentiment analysis system to assess and recognize human pain sentiments within an Internet of Things (IoT)-enabled healthcare framework. This system integrates facial expressions and speech-audio recordings to evaluate human pain intensity levels. This integration aims to enhance the recognition system’s performance and enable a more accurate assessment of pain intensity. Such a multimodal approach supports improved decision making in real-time patient care, addressing limitations inherent in unimodal systems for measuring pain sentiment. So, the primary contribution of this work lies in developing a multimodal pain sentiment analysis system that integrates the outcomes of image-based and audio-based pain sentiment analysis models. The system implementation contains five key phases. The first phase focuses on detecting the facial region from a video sequence, a crucial step for extracting facial patterns indicative of pain. In the second phase, the system extracts discriminant and divergent features from the facial region using deep learning techniques, utilizing some convolutional neural network (CNN) architectures, which are further refined through transfer learning and fine-tuning of parameters, alongside fusion techniques aimed at optimizing the model’s performance. The third phase performs the speech-audio recording preprocessing; the extraction of significant features is then performed through conventional methods followed by using the deep learning model to generate divergent features to recognize audio-based pain sentiments in the fourth phase. The final phase combines the outcomes from both image-based and audio-based pain sentiment analysis systems, improving the overall performance of the multimodal system. This fusion enables the system to accurately predict pain levels, including ‘high pain’, ‘mild pain’, and ‘no pain’. The performance of the proposed system is tested with the three image-based databases such as a 2D Face Set Database with Pain Expression, the UNBC-McMaster database (based on shoulder pain), and the BioVid database (based on heat pain), along with the VIVAE database for the audio-based dataset. Extensive experiments were performed using these datasets. Finally, the proposed system achieved accuracies of 76.23%, 84.27%, and 38.04% for two, three, and five pain classes, respectively, on the 2D Face Set Database with Pain Expression, UNBC, and BioVid datasets. The VIVAE audio-based system recorded a peak performance of 97.56% and 98.32% accuracy for varying training–testing protocols. These performances were compared with some state-of-the-art methods that show the superiority of the proposed system. By combining the outputs of both deep learning frameworks on image and audio datasets, the proposed multimodal pain sentiment analysis system achieves accuracies of 99.31% for the two-class, 99.54% for the three-class, and 87.41% for the five-class pain problems. Full article

(This article belongs to the Section Physical Sensors)

►▼ Show Figures

Figure 1

16 pages, 2242 KiB

Open AccessArticle

Effective Data Augmentation Techniques for Arabic Speech Emotion Recognition Using Convolutional Neural Networks

by Wided Bouchelligua, Reham Al-Dayil and Areej Algaith

Appl. Sci. 2025, 15(4), 2114; https://doi.org/10.3390/app15042114 - 17 Feb 2025

Viewed by 248

Abstract

This paper investigates the effectiveness of various data augmentation techniques for enhancing Arabic speech emotion recognition (SER) using convolutional neural networks (CNNs). Utilizing the Saudi Dialect and BAVED datasets, we address the challenges of limited and imbalanced data commonly found in Arabic SER. To improve model performance, we apply augmentation techniques such as noise addition, time shifting, increasing volume, and reducing volume. Additionally, we examine the optimal number of augmentations required to achieve the best results. Our experiments reveal that these augmentations significantly enhance the CNN’s ability to recognize emotions, with certain techniques proving more effective than others. Furthermore, the number of augmentations plays a critical role in balancing model accuracy. The Saudi Dialect dataset achieved its best results with two augmentations (increasing volume and decreasing volume), reaching an accuracy of 96.81%. Similarly, the BAVED dataset demonstrated optimal performance with a combination of three augmentations (noise addition, increasing volume, and reducing volume), achieving an accuracy of 92.60%. These findings indicate that carefully selected augmentation strategies can greatly improve the performance of CNN-based SER systems, particularly in the context of Arabic speech. This research underscores the importance of tailored augmentation techniques to enhance SER performance and sets a foundation for future advancements in this field. Full article

(This article belongs to the Special Issue Natural Language Processing: Novel Methods and Applications)

►▼ Show Figures

Figure 1

Figure 1
Original Saudi Dialect dataset distribution. Full article ">Figure 2
Original BAVED dataset distribution. Full article ">Figure 3
The flow of data preparation for the SER. Full article ">Figure 4
Examples of the audio files with data augmentation: (a) original audio for an angry emotion (01), (b) noise addition, (c) time shift, (d) increasing volume, and (e) reducing volume. Full article ">Figure 5
The block diagram of the MFCC computation. Full article ">Figure 6
The proposed SER architecture. Full article ">

24 pages, 2289 KiB

Open AccessArticle

A Non-Invasive Approach for Facial Action Unit Extraction and Its Application in Pain Detection

by Mondher Bouazizi, Kevin Feghoul, Shengze Wang, Yue Yin and Tomoaki Ohtsuki

Bioengineering 2025, 12(2), 195; https://doi.org/10.3390/bioengineering12020195 - 17 Feb 2025

Viewed by 152

Abstract

A significant challenge that hinders advancements in medical research is the sensitive and confidential nature of patient data in available datasets. In particular, sharing patients’ facial images poses considerable privacy risks, especially with the rise of generative artificial intelligence (AI), which could misuse such data if accessed by unauthorized parties. However, facial expressions are a valuable source of information for doctors and researchers, which creates a need for methods to derive them without compromising patient privacy or safety by exposing identifiable facial images. To address this, we present a quick, computationally efficient method for detecting action units (AUs) and their intensities—key indicators of health and emotion—using only 3D facial landmarks. Our proposed framework extracts 3D face landmarks from video recordings and employs a lightweight neural network (NN) to identify AUs and estimate AU intensities based on these landmarks. Our proposed method reaches a 79.25% F1-score in AU detection for the main AUs, and 0.66 in AU intensity estimation Root Mean Square Error (RMSE). This performance shows that it is possible for researchers to share 3D landmarks, which are far less intrusive, instead of facial images while maintaining high accuracy in AU detection. Moreover, to showcase the usefulness of our AU detection model, using the detected AUs and estimated intensities, we trained state-of-the-art Deep Learning (DL) models to detect pain. Our method reaches 91.16% accuracy in pain detection, which is not far behind the 93.14% accuracy obtained when employing a convolutional neural network (CNN) with residual blocks trained on actual images and the 92.11% accuracy obtained when employing all the ground-truth AUs. Full article

(This article belongs to the Section Biosignal Processing)

►▼ Show Figures

Figure 1

Figure 1
An example of the human face mesh superimposed on the human face itself. Areas around the eyes, the nose, and the mouth have higher landmark density than the remaining parts of the face. Full article ">Figure 2
A flowchart of the proposed framework. The framework is composed of 3 main components: an anonymizer, an AU detector, and a pain detector. Full article ">Figure 3
A diagram block of the proposed framework: Upon generating the 3D face landmarks, a 2-layer FCNN with multiple outputs is used to detect the AUs. The sequence of detected AUs is then processed through a Transformer encoder to identify the class (pain). Full article ">Figure 4
The structure of the Transformer encoder used in our work. Full article ">Figure 5
Example of consecutive frames from the dataset (a few seconds apart) along with their detected face landmarks. Full article ">Figure 6
Distribution in percent of the different AUs in our dataset. Full article ">Figure 7
Precision, recall and F1-scores of the detection of the secondary AUs. Full article ">Figure 8
Distribution of the intensity level for each action unit in our dataset. Full article ">

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 200.

Go to page 1 2 3 4 5

Search Results (9,973)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI