MDPI - Publisher of Open Access Journals

33 pages, 5826 KiB

Open AccessArticle

Improving Churn Detection in the Banking Sector: A Machine Learning Approach with Probability Calibration Techniques

by Alin-Gabriel Văduva, Simona-Vasilica Oprea, Andreea-Mihaela Niculae, Adela Bâra and Anca-Ioana Andreescu

Electronics 2024, 13(22), 4527; https://doi.org/10.3390/electronics13224527 - 18 Nov 2024

Viewed by 255

Abstract

Identifying and reducing customer churn have become a priority for financial institutions seeking to retain clients. Our research focuses on customer churn rate analysis using advanced machine learning (ML) techniques, leveraging a synthetic dataset sourced from the Kaggle platform. The dataset undergoes a [...] Read more.

Identifying and reducing customer churn have become a priority for financial institutions seeking to retain clients. Our research focuses on customer churn rate analysis using advanced machine learning (ML) techniques, leveraging a synthetic dataset sourced from the Kaggle platform. The dataset undergoes a preprocessing phase to select variables directly impacting customer churn behavior. SMOTETomek, a hybrid technique that combines oversampling of the minority class (churn) with SMOTE and the removal of noisy or borderline instances through Tomek links, is applied to balance the dataset and improve class separability. Two cutting-edge ML models are applied—random forest (RF) and the Light Gradient-Boosting Machine (LGBM) Classifier. To evaluate the effectiveness of these models, several key performance metrics are utilized, including precision, sensitivity, F1 score, accuracy, and Brier score, which helps assess the calibration of the predicted probabilities. A particular contribution of our research is on calibrating classification probabilities, as many ML models tend to produce uncalibrated probabilities due to the complexity of their internal mechanisms. Probability calibration techniques are employed to adjust the predicted probabilities, enhancing their reliability and interpretability. Furthermore, the Shapley Additive Explanations (SHAP) method, an explainable artificial intelligence (XAI) technique, is further implemented to increase the transparency and credibility of the model’s decision-making process. SHAP provides insights into the importance of individual features in predicting churn, providing knowledge to banking institutions for the development of personalized customer retention strategies. Full article

(This article belongs to the Special Issue Applied Machine Learning in Intelligent Systems)

► Show Figures

Figure 1

23 pages, 3970 KiB

Open AccessArticle

Using Machine Learning and Feature Importance to Identify Risk Factors for Mortality in Pediatric Heart Surgery

by Lorenz A. Kapsner, Manuel Feißt, Ariawan Purbojo, Hans-Ulrich Prokosch, Thomas Ganslandt, Sven Dittrich, Jonathan M. Mang and Wolfgang Wällisch

Diagnostics 2024, 14(22), 2587; https://doi.org/10.3390/diagnostics14222587 - 18 Nov 2024

Viewed by 265

Abstract

Background: The objective of this IRB-approved retrospective monocentric study was to identify risk factors for mortality after surgery for congenital heart defects (CHDs) in pediatric patients using machine learning (ML). CHD belongs to the most common congenital malformations, and remains the leading mortality [...] Read more.

Background: The objective of this IRB-approved retrospective monocentric study was to identify risk factors for mortality after surgery for congenital heart defects (CHDs) in pediatric patients using machine learning (ML). CHD belongs to the most common congenital malformations, and remains the leading mortality cause from birth defects. Methods: The most recent available hospital encounter for each patient with an age <18 years hospitalized for CHD-related cardiac surgery between the years 2011 and 2020 was included in this study. The cohort consisted of 1302 eligible patients (mean age [SD]: 402.92 [±562.31] days), who were categorized into four disease groups. A random survival forest (RSF) and the ‘eXtreme Gradient Boosting’ algorithm (XGB) were applied to model mortality (incidence: 5.6% [n = 73 events]). All models were then applied to predict the outcome in an independent holdout test dataset (40% of the cohort). Results: RSF and XGB achieved average C-indices of 0.85 (±0.01) and 0.79 (±0.03), respectively. Feature importance was assessed with ‘SHapley Additive exPlanations’ (SHAP) and ‘Time-dependent explanations of machine learning survival models’ (SurvSHAP(t)), both of which revealed high importance of the maximum values of serum creatinine observed within 72 h post-surgery for both ML methods. Conclusions: ML methods, along with model explainability tools, can reveal interesting insights into mortality risk after surgery for CHD. The proposed analytical workflow can serve as a blueprint for translating the analysis into a federated setting that builds upon the infrastructure of the German Medical Informatics Initiative. Full article

(This article belongs to the Special Issue Machine-Learning-Based Disease Diagnosis and Prediction)

► Show Figures

Figure 1

Figure 1
Performance: boxplots to visualize the performance of the applied models during validation and when predicting the outcome in the independent holdout test dataset. The underlying data for each boxplot are the performance of the 100 models from the repeated CV during validation and when applying these 100 models to predict the outcome in the holdout test dataset. XGB: xgboost; RSF: random survival forest; CPH: Cox proportional hazards regression. Full article ">Figure 2
Feature importance: SHAP beeswarm plots. XGB: xgboost; RSF: random survival forest. Full article ">Figure 3
Feature importance: global SurvSHAP(t) values showing feature importance as a function of the survival time for the random survival forest (RSF). The SurvSHAP(t) values of the seven most important features identified as such were averaged by feature and evaluation time-point across all 100 repeated CV models. On the x-axis, red ticks mark event time points and black ticks mark censored time points. Full article ">Figure 4
Feature importance ranks: counts of the occurrence of variables within the five most important features as defined by their mean absolute SHAP values and SurvSHAP(t) values, respectively, for each repeated CV model. XGB: xgboost; RSF: random survival forest. Full article ">Figure 5
Feature importance: SHAP force plots for XGB and RSF. UVHD I: disease group univentricular heart failure (HF) 1; UVHD II: disease group univentricular HF 2; BVHD cmplx.: disease group biventricular HF complex; BVHD smpl.: disease group biventricular HF simple; XGB: xgboost; RSF: random survival forest. (a) XGB. (b) RSF. Full article ">Figure 5 Cont.
Feature importance: SHAP force plots for XGB and RSF. UVHD I: disease group univentricular heart failure (HF) 1; UVHD II: disease group univentricular HF 2; BVHD cmplx.: disease group biventricular HF complex; BVHD smpl.: disease group biventricular HF simple; XGB: xgboost; RSF: random survival forest. (a) XGB. (b) RSF. Full article ">Figure 6
Distribution of the maximum values of serum creatinine observed within 72 h post-surgery for the heart disease groups, stratified by the training dataset and the test dataset. UVHD I: disease group univentricular heart failure (HF) 1; UVHD II: disease group univentricular HF 2; BVHD cmplx.: disease group biventricular HF complex; BVHD smpl.: disease group biventricular HF simple. Full article ">

26 pages, 9107 KiB

Open AccessArticle

A Method for Prediction and Analysis of Student Performance That Combines Multi-Dimensional Features of Time and Space

by Zheng Luo, Jiahao Mai, Caihong Feng, Deyao Kong, Jingyu Liu, Yunhong Ding, Bo Qi and Zhanbo Zhu

Mathematics 2024, 12(22), 3597; https://doi.org/10.3390/math12223597 - 17 Nov 2024

Viewed by 452

Abstract

The prediction and analysis of students’ academic performance are essential tools for educators and learners to improve teaching and learning methods. Effective predictive methods assist learners in targeted studying based on forecast results, while effective analytical methods help educators design appropriate educational content. [...] Read more.

The prediction and analysis of students’ academic performance are essential tools for educators and learners to improve teaching and learning methods. Effective predictive methods assist learners in targeted studying based on forecast results, while effective analytical methods help educators design appropriate educational content. However, in actual educational environments, factors influencing student performance are multidimensional across both temporal and spatial dimensions. Therefore, a student performance prediction and analysis method incorporating multidimensional spatiotemporal features has been proposed in this study. Due to the complexity and nonlinearity of learning behaviors in the educational process, predicting students’ academic performance effectively is challenging. Nevertheless, machine learning algorithms possess significant advantages in handling data complexity and nonlinearity. Initially, a multidimensional spatiotemporal feature dataset was constructed by combining three categories of features: students’ basic information, performance at various stages of the semester, and educational indicators from their places of origin (considering both temporal aspects, i.e., performance at various stages of the semester, and spatial aspects, i.e., educational indicators from their places of origin). Subsequently, six machine learning models were trained using this dataset to predict student performance, and experimental results confirmed their accuracy. Furthermore, SHAP analysis was utilized to extract factors significantly impacting the experimental outcomes. Subsequently, this study conducted data ablation experiments, the results of which proved the rationality of the feature selection in this study. Finally, this study proposed a feasible solution for guiding teaching strategies by integrating spatiotemporal multi-dimensional features in the analysis of student performance prediction in actual teaching processes. Full article

► Show Figures

Figure 1

22 pages, 5816 KiB

Open AccessArticle

Causality-Driven Feature Selection for Calibrating Low-Cost Airborne Particulate Sensors Using Machine Learning

by Vinu Sooriyaarachchi, David J. Lary, Lakitha O. H. Wijeratne and John Waczak

Sensors 2024, 24(22), 7304; https://doi.org/10.3390/s24227304 - 15 Nov 2024

Viewed by 368

Abstract

With escalating global environmental challenges and worsening air quality, there is an urgent need for enhanced environmental monitoring capabilities. Low-cost sensor networks are emerging as a vital solution, enabling widespread and affordable deployment at fine spatial resolutions. In this context, machine learning for [...] Read more.

With escalating global environmental challenges and worsening air quality, there is an urgent need for enhanced environmental monitoring capabilities. Low-cost sensor networks are emerging as a vital solution, enabling widespread and affordable deployment at fine spatial resolutions. In this context, machine learning for the calibration of low-cost sensors is particularly valuable. However, traditional machine learning models often lack interpretability and generalizability when applied to complex, dynamic environmental data. To address this, we propose a causal feature selection approach based on convergent cross mapping within the machine learning pipeline to build more robustly calibrated sensor networks. This approach is applied in the calibration of a low-cost optical particle counter OPC-N3, effectively reproducing the measurements of

{PM}_{1}

and

{PM}_{2.5}

as recorded by research-grade spectrometers. We evaluated the predictive performance and generalizability of these causally optimized models, observing improvements in both while reducing the number of input features, thus adhering to the Occam’s razor principle. For the

{PM}_{1}

calibration model, the proposed feature selection reduced the mean squared error on the test set by 43.2% compared to the model with all input features, while the SHAP value-based selection only achieved a reduction of 29.6%. Similarly, for the

{PM}_{2.5}

model, the proposed feature selection led to a 33.2% reduction in the mean squared error, outperforming the 30.2% reduction achieved by the SHAP value-based selection. By integrating sensors with advanced machine learning techniques, this approach advances urban air quality monitoring, fostering a deeper scientific understanding of microenvironments. Beyond the current test cases, this feature selection method holds potential for broader applications in other environmental monitoring applications, contributing to the development of interpretable and robust environmental models. Full article

(This article belongs to the Section Sensor Networks)

► Show Figures

Figure 1

14 pages, 2103 KiB

Open AccessArticle

Initial Development of Automated Machine Learning-Assisted Prediction Tools for Aryl Hydrocarbon Receptor Activators

by Paulina Anna Wojtyło, Natalia Łapińska, Lucia Bellagamba, Emidio Camaioni, Aleksander Mendyk and Stefano Giovagnoli

Pharmaceutics 2024, 16(11), 1456; https://doi.org/10.3390/pharmaceutics16111456 - 15 Nov 2024

Viewed by 356

Abstract

Background: The aryl hydrocarbon receptor (AhR) plays a crucial role in immune and metabolic processes. The large molecular diversity of ligands capable of activating AhR makes it impossible to determine the structural features useful for the design of new potent modulators. Thus, [...] Read more.

Background: The aryl hydrocarbon receptor (AhR) plays a crucial role in immune and metabolic processes. The large molecular diversity of ligands capable of activating AhR makes it impossible to determine the structural features useful for the design of new potent modulators. Thus, in the field of drug discovery, the intricate nature of AhR activation necessitates the development of novel tools to address related challenges. Methods: In this study, quantitative structure–activity relationship (QSAR) models of classification and regression were developed with the objective of identifying the most effective method for predicting AhR activity. The initial dataset was obtained by combining the ChEMBL and WIPO databases which contained 978 molecules with EC₅₀ values. The predictive models were developed using the automated machine learning platform mljar according to a 10-fold cross validation (10-CV) testing procedure. Results: The classification model demonstrated an accuracy value of 0.760 and F1 value of 0.789 for the test set. The root-mean-squared error (RMSE) was 5444, and the coefficient of determination (R²) was 0.208 for the regression model. The Shapley Additive Explanations (SHAP) method was then employed for a deeper comprehension of the impact of the variables on the model’s predictions. As a practical application for scientific purposes, the best performing classification model was then used to develop an AhR web application. This application is accessible online and has been implemented in Streamlit. Conclusions: The findings may serve as a foundation in prompting further research into the development of a QSAR model, which could enhance comprehension of the influence of ligand structure on the modulation of AhR activity. Full article

(This article belongs to the Special Issue Computational Intelligence (CI) Tools in Applications of Pharmaceutics)

► Show Figures

Figure 1

18 pages, 4192 KiB

Open AccessArticle

Application of Isokinetic Dynamometry Data in Predicting Gait Deviation Index Using Machine Learning in Stroke Patients: A Cross-Sectional Study

by Xiaolei Lu, Chenye Qiao, Hujun Wang, Yingqi Li, Jingxuan Wang, Congxiao Wang, Yingpeng Wang and Shuyan Qie

Sensors 2024, 24(22), 7258; https://doi.org/10.3390/s24227258 - 13 Nov 2024

Viewed by 310

Abstract

Background: Three-dimensional gait analysis, supported by advanced sensor systems, is a crucial component in the rehabilitation assessment of post-stroke hemiplegic patients. However, the sensor data generated from such analyses are often complex and challenging to interpret in clinical practice, requiring significant time and [...] Read more.

Background: Three-dimensional gait analysis, supported by advanced sensor systems, is a crucial component in the rehabilitation assessment of post-stroke hemiplegic patients. However, the sensor data generated from such analyses are often complex and challenging to interpret in clinical practice, requiring significant time and complicated procedures. The Gait Deviation Index (GDI) serves as a simplified metric for quantifying the severity of pathological gait. Although isokinetic dynamometry, utilizing sophisticated sensors, is widely employed in muscle function assessment and rehabilitation, its application in gait analysis remains underexplored. Objective: This study aims to investigate the use of sensor-acquired isokinetic muscle strength data, combined with machine learning techniques, to predict the GDI in hemiplegic patients. This study utilizes data captured from sensors embedded in the Biodex dynamometry system and the Vicon 3D motion capture system, highlighting the integration of sensor technology in clinical gait analysis. Methods: This study was a cross-sectional, observational study that included a cohort of 150 post-stroke hemiplegic patients. The sensor data included measurements such as peak torque, peak torque/body weight, maximum work of repeated actions, coefficient of variation, average power, total work, acceleration time, deceleration time, range of motion, and average peak torque for both flexor and extensor muscles on the affected side at three angular velocities (60°/s, 90°/s, and 120°/s) using the Biodex System 4 Pro. The GDI was calculated using data from a Vicon 3D motion capture system. This study employed four machine learning models—Lasso Regression, Random Forest (RF), Support Vector regression (SVR), and BP Neural Network—to model and validate the sensor data. Model performance was evaluated using mean squared error (MSE), the coefficient of determination (R²), and mean absolute error (MAE). SHapley Additive exPlanations (SHAP) analysis was used to enhance model interpretability. Results: The RF model outperformed others in predicting GDI, with an MSE of 16.18, an R² of 0.89, and an MAE of 2.99. In contrast, the Lasso Regression model yielded an MSE of 22.29, an R² of 0.85, and an MAE of 3.71. The SVR model had an MSE of 31.58, an R² of 0.82, and an MAE of 7.68, while the BP Neural Network model exhibited the poorest performance with an MSE of 50.38, an R² of 0.79, and an MAE of 9.59. SHAP analysis identified the maximum work of repeated actions of the extensor muscles at 60°/s and 120°/s as the most critical sensor-derived features for predicting GDI, underscoring the importance of muscle strength metrics at varying speeds in rehabilitation assessments. Conclusions: This study highlights the potential of integrating advanced sensor technology with machine learning techniques in the analysis of complex clinical data. The developed GDI prediction model, based on sensor-acquired isokinetic dynamometry data, offers a novel, streamlined, and effective tool for assessing rehabilitation progress in post-stroke hemiplegic patients, with promising implications for broader clinical application. Full article

(This article belongs to the Special Issue Sensing and Signal Processing Technologies for Outpatient Monitoring and Rehabilitation)

► Show Figures

Figure 1

Figure 1
Overview of the proposed pipeline. A set of 20 features, selected through RFE—including test speed (3 features), muscle tests (2 features), test side (2 features), and outcome metrics (10 features)—were used as inputs to four distinct data modeling techniques: Lasso Regression, BP Neural Network model, RF Regression, and SVR. The objective was to predict the GDI score obtained via the Vicon motion capture system. Hyperparameter optimization was performed using 10-fold cross-validation. Finally, to provide clear interpretability of the results, the SHAP technique was employed and compared. Full article ">Figure 2
(A) Isokinetic dynamometry procedure; (B) Vicon 3D gait analysis static calibration; (C) software calculation procedure. Full article ">Figure 3
Distribution of selected features. Full article ">Figure 4
Histogram of dataset distribution. Full article ">Figure 5
Correlation matrix of selected features. Note: The colors in the matrix range from blue to red, indicating the direction and strength of the correlations between features. Blue represents negative correlations, while red represents positive correlations, with the intensity of the color reflecting the strength of the correlation. Full article ">Figure 6
RF model prediction scatterplot. Note: in the plot, the dashed line represents the reference line for perfect prediction (i.e., where the predicted values equal the actual values). Full article ">Figure 7
SHAP summary plot of different models. (A) RF model results; (B) SVR model results; (C) BP Neural Network model results; (D) Lasso Regression model results. Note: in these plots, the color gradient from blue (low value) to red (high value) represents the size of the feature values, and the x-axis represents the SHAP values, with larger values indicating a more significant impact of the feature on the model’s output. Full article ">

41 pages, 6420 KiB

Open AccessArticle

Analyzing Autonomous Vehicle Collision Types to Support Sustainable Transportation Systems: A Machine Learning and Association Rules Approach

by Ehsan Kohanpour, Seyed Rasoul Davoodi and Khaled Shaaban

Sustainability 2024, 16(22), 9893; https://doi.org/10.3390/su16229893 - 13 Nov 2024

Viewed by 460

Abstract

The increasing presence of autonomous vehicles (AVs) in transportation, driven by advances in AI and robotics, requires a strong focus on safety in mixed-traffic environments to promote sustainable transportation systems. This study analyzes AV crashes in California using advanced machine learning to identify [...] Read more.

The increasing presence of autonomous vehicles (AVs) in transportation, driven by advances in AI and robotics, requires a strong focus on safety in mixed-traffic environments to promote sustainable transportation systems. This study analyzes AV crashes in California using advanced machine learning to identify patterns among various crash factors. The main objective is to explore AV crash mechanisms by extracting association rules and developing a decision tree model to understand interactions between pre-crash conditions, driving states, crash types, severity, locations, and other variables. A multi-faceted approach, including statistical analysis, data mining, and machine learning, was used to model crash types. The SMOTE method addressed data imbalance, with models like CART, Apriori, RF, XGB, SHAP, and Pearson’s test applied for analysis. Findings reveal that rear-end crashes are the most common, making up over 50% of incidents. Side crashes at night are also frequent, while angular and head-on crashes tend to be more severe. The study identifies high-risk locations, such as complex unsignalized intersections, and highlights the need for improved AV sensor technology, AV–infrastructure coordination, and driver training. Technological advancements like V2V and V2I communication are suggested to significantly reduce the number and severity of specific types of crashes, thereby enhancing the overall safety and sustainability of transportation systems. Full article

(This article belongs to the Special Issue Innovations and Policies Shaping Sustainable Transportation Engineering)

► Show Figures

Figure 1

17 pages, 20963 KiB

Open AccessArticle

Landslide Susceptibility Mapping Based on Ensemble Learning in the Jiuzhaigou Region, Sichuan, China

by Bangsheng An, Zhijie Zhang, Shenqing Xiong, Wanchang Zhang, Yaning Yi, Zhixin Liu and Chuanqi Liu

Remote Sens. 2024, 16(22), 4218; https://doi.org/10.3390/rs16224218 - 12 Nov 2024

Viewed by 392

Abstract

Accurate landslide susceptibility mapping is vital for disaster forecasting and risk management. To address the problem of limited accuracy of individual classifiers and lack of model interpretability in machine learning-based models, a coupled multi-model framework for landslide susceptibility mapping is proposed. Using Jiuzhaigou [...] Read more.

Accurate landslide susceptibility mapping is vital for disaster forecasting and risk management. To address the problem of limited accuracy of individual classifiers and lack of model interpretability in machine learning-based models, a coupled multi-model framework for landslide susceptibility mapping is proposed. Using Jiuzhaigou County, Sichuan Province, as a case study, we developed an evaluation index system incorporating 14 factors. We employed three base models—logistic regression, support vector machine, and Gaussian Naive Bayes—assessed through four ensemble methods: Stacking, Voting, Bagging, and Boosting. The decision mechanisms of these models were explained via a SHAP (SHapley Additive exPlanations) analysis. Results demonstrate that integrating machine learning with ensemble learning and SHAP yields more reliable landslide susceptibility mapping and enhances model interpretability. This approach effectively addresses the challenges of unreliable landslide susceptibility mapping in complex environments. Full article

(This article belongs to the Special Issue Remote Sensing Data for Modeling and Managing Natural Disasters)

► Show Figures

Figure 1

20 pages, 2819 KiB

Open AccessArticle

Mortality Prediction Modeling for Patients with Breast Cancer Based on Explainable Machine Learning

by Sang Won Park, Ye-Lin Park, Eun-Gyeong Lee, Heejung Chae, Phillip Park, Dong-Woo Choi, Yeon Ho Choi, Juyeon Hwang, Seohyun Ahn, Keunkyun Kim, Woo Jin Kim, Sun-Young Kong, So-Youn Jung and Hyun-Jin Kim

Cancers 2024, 16(22), 3799; https://doi.org/10.3390/cancers16223799 - 12 Nov 2024

Viewed by 527

Abstract

Background/Objectives: Breast cancer is the most common cancer in women worldwide, requiring strategic efforts to reduce its mortality. This study aimed to develop a predictive classification model for breast cancer mortality using real-world data, including various clinical features. Methods: A total [...] Read more.

Background/Objectives: Breast cancer is the most common cancer in women worldwide, requiring strategic efforts to reduce its mortality. This study aimed to develop a predictive classification model for breast cancer mortality using real-world data, including various clinical features. Methods: A total of 11,286 patients with breast cancer from the National Cancer Center were included in this study. The mortality rate of the total sample was approximately 6.2%. Propensity score matching was used to reduce bias. Several machine learning models, including extreme gradient boosting, were applied to 31 clinical features. To enhance model interpretability, we used the SHapley Additive exPlanations method. ML analyses were also performed on the samples, excluding patients who developed other cancers after breast cancer. Results: Among the ML models, the XGB model exhibited the highest discriminatory power, with an area under the curve of 0.8722 and a specificity of 0.9472. Key predictors of the mortality classification model included occurrence in other organs, age at diagnosis, N stage, T stage, curative radiation treatment, and Ki-67(%). Even after excluding patients who developed other cancers after breast cancer, the XGB model remained the best-performing, with an AUC of 0.8518 and a specificity of 0.9766. Additionally, the top predictors from SHAP were similar to the results for the overall sample. Conclusions: Our models provided excellent predictions of breast cancer mortality using real-world data from South Korea. Explainable artificial intelligence, such as SHAP, validated the clinical applicability and interpretability of these models. Full article

(This article belongs to the Section Clinical Research of Cancer)

► Show Figures

Figure 1

28 pages, 6321 KiB

Open AccessArticle

Explainable and Interpretable Model for the Early Detection of Brain Stroke Using Optimized Boosting Algorithms

by Yogita Dubey, Yashraj Tarte, Nikhil Talatule, Khushal Damahe, Prachi Palsodkar and Punit Fulzele

Diagnostics 2024, 14(22), 2514; https://doi.org/10.3390/diagnostics14222514 - 9 Nov 2024

Viewed by 681

Abstract

Stroke stands as a prominent global health issue, causing considerable mortality and debilitation. It arises when cerebral blood flow is compromised, leading to irreversible brain cell damage or death. Leveraging the power of machine learning, this paper presents a systematic approach to predict [...] Read more.

Stroke stands as a prominent global health issue, causing considerable mortality and debilitation. It arises when cerebral blood flow is compromised, leading to irreversible brain cell damage or death. Leveraging the power of machine learning, this paper presents a systematic approach to predict stroke-patient survival based on a comprehensive set of factors. These factors include demographic attributes, medical history, lifestyle elements, and physiological metrics. An effective random sampling method is proposed to handle the highly biased data on strokes. Stroke prediction using optimized boosting machine-learning algorithms is supported with explainable AI using LIME and SHAP. This enables the models to discern intricate data patterns and establish correlations between selected features and patient survival. Through this approach, the study seeks to uncover actionable insights to guide healthcare practitioners in devising personalized treatment strategies for stroke patients. Full article

(This article belongs to the Special Issue A New Era in Diagnosis: From Biomarkers to Artificial Intelligence)

► Show Figures

Figure 1

20 pages, 10237 KiB

Open AccessArticle

A Leaf Chlorophyll Content Estimation Method for Populus deltoides (Populus deltoides Marshall) Using Ensembled Feature Selection Framework and Unmanned Aerial Vehicle Hyperspectral Data

by Zhulin Chen, Xuefeng Wang, Shijiao Qiao, Hao Liu, Mengmeng Shi, Xingjing Chen, Haiying Jiang and Huimin Zou

Forests 2024, 15(11), 1971; https://doi.org/10.3390/f15111971 - 8 Nov 2024

Viewed by 384

Abstract

Leaf chlorophyll content (LCC) is a key indicator in representing the photosynthetic capacity of Populus deltoides (Populus deltoides Marshall). Unmanned aerial vehicle (UAV) hyperspectral imagery provides an effective approach for LCC estimation, but the issue of band redundancy significantly impacts model accuracy [...] Read more.

Leaf chlorophyll content (LCC) is a key indicator in representing the photosynthetic capacity of Populus deltoides (Populus deltoides Marshall). Unmanned aerial vehicle (UAV) hyperspectral imagery provides an effective approach for LCC estimation, but the issue of band redundancy significantly impacts model accuracy and computational efficiency. Commonly used single feature selection algorithms not only fail to balance computational efficiency with optimal set search but also struggle to combine different regression algorithms under dynamic set conditions. This study proposes an ensemble feature selection framework to enhance LCC estimation accuracy using UAV hyperspectral data. Firstly, the embedded algorithm was improved by introducing the SHapley Additive exPlanations (SHAP) algorithm into the ranking system. A dynamic ranking strategy was then employed to remove bands in steps of 10, with LCC models developed at each step to identify the initial band subset based on estimation accuracy. Finally, the wrapper algorithm was applied using the initial band subset to search for the optimal band subset and develop the corresponding model. Three regression algorithms including gradient boosting regression trees (GBRT), support vector regression (SVR), and gaussian process regression (GPR) were combined with this framework for LCC estimation. The results indicated that the GBRT-Optimal model developed using 28 bands achieved the best performance with R² of 0.848, RMSE of 1.454 μg/cm² and MAE of 1.121 μg/cm². Compared with a model performance that used all bands as inputs, this optimal model reduced the RMSE value by 24.37%. In addition to estimating biophysical and biochemical parameters, this method is also applicable to other hyperspectral imaging tasks. Full article

(This article belongs to the Special Issue Panoptic Segmentation of Tree Scenes from Mobile LiDAR Data)

► Show Figures

Figure 1

19 pages, 688 KiB

Open AccessArticle

Advancing Pulmonary Nodule Detection with ARSGNet: EfficientNet and Transformer Synergy

by Maroua Oumlaz, Yassine Oumlaz, Aziz Oukaira, Amrou Zyad Benelhaouare and Ahmed Lakhssassi

Electronics 2024, 13(22), 4369; https://doi.org/10.3390/electronics13224369 - 7 Nov 2024

Viewed by 556

Abstract

Lung cancer, the leading cause of cancer-related deaths globally, presents significant challenges in early detection and diagnosis. The effective analysis of pulmonary medical imaging, particularly computed tomography (CT) scans, is critical in this endeavor. Traditional diagnostic methods, which are manual and time-intensive, underscore [...] Read more.

Lung cancer, the leading cause of cancer-related deaths globally, presents significant challenges in early detection and diagnosis. The effective analysis of pulmonary medical imaging, particularly computed tomography (CT) scans, is critical in this endeavor. Traditional diagnostic methods, which are manual and time-intensive, underscore the need for innovative, efficient, and accurate detection approaches. To address this need, we introduce the Adaptive Range Slice Grouping Network (ARSGNet), a novel deep learning framework that enhances early lung cancer diagnosis through advanced segmentation and classification techniques in CT imaging. ARSGNet synergistically integrates the strengths of EfficientNet and Transformer architectures, leveraging their superior feature extraction and contextual processing capabilities. This hybrid model proficiently handles the complexities of 3D CT images, ensuring precise and reliable lung nodule detection. The algorithm processes CT scans using short slice grouping (SSG) and long slice grouping (LSG) techniques to extract critical features from each slice, culminating in the generation of nodule probabilities and the identification of potential nodular regions. Incorporating shapley additive explanations (SHAP) analysis further enhances model interpretability by highlighting the contributory features. Our extensive experimentation demonstrated a significant improvement in diagnostic accuracy, with training accuracy increasing from 0.9126 to 0.9817. This advancement not only reflects the model’s efficient learning curve but also its high proficiency in accurately classifying a majority of training samples. Given its high accuracy, interpretability, and consistent reduction in training loss, ARSGNet holds substantial potential as a groundbreaking tool for early lung cancer detection and diagnosis. Full article

(This article belongs to the Special Issue Recent Trends in Artificial Learning and Data Processing for Biomedical Engineering)

► Show Figures

Figure 1

22 pages, 11094 KiB

Open AccessArticle

State of Health Estimation for Lithium-Ion Batteries Using an Explainable XGBoost Model with Parameter Optimization

by Zhenghao Xiao, Bo Jiang, Jiangong Zhu, Xuezhe Wei and Haifeng Dai

Batteries 2024, 10(11), 394; https://doi.org/10.3390/batteries10110394 - 7 Nov 2024

Viewed by 533

Abstract

Accurate and reliable estimation of the state of health (SOH) of lithium-ion batteries is crucial for ensuring safety and preventing potential failures of power sources in electric vehicles. However, current data-driven SOH estimation methods face challenges related to adaptiveness and interpretability. This paper [...] Read more.

Accurate and reliable estimation of the state of health (SOH) of lithium-ion batteries is crucial for ensuring safety and preventing potential failures of power sources in electric vehicles. However, current data-driven SOH estimation methods face challenges related to adaptiveness and interpretability. This paper investigates an adaptive and explainable battery SOH estimation approach using the eXtreme Gradient Boosting (XGBoost) model. First, several battery health features extracted from various charging and relaxation processes are identified, and their correlation with battery aging is analyzed. Then, a SOH estimation method based on the XGBoost algorithm is established, and the model’s hyper-parameters are tuned using the Bayesian optimization algorithm (BOA) to enhance the adaptiveness of the proposed estimation model. Additionally, the Tree SHapley Additive exPlanation (TreeSHAP) technique is employed to analyze the explainability of the estimation model and reveal the influence of different features on SOH evaluation. Experiments involving two types of batteries under various aging conditions are conducted to obtain battery cycling aging data for model training and validation. The quantitative results demonstrate that the proposed method achieves an estimation accuracy with a mean absolute error of less than 2.7% and a root mean squared error of less than 3.2%. Moreover, the proposed method shows superior estimation accuracy and performance compared to existing machine learning models. Full article

(This article belongs to the Special Issue State-of-Health Estimation of Batteries)

► Show Figures

Figure 1

25 pages, 1926 KiB

Open AccessArticle

Enhancing Structured Query Language Injection Detection with Trustworthy Ensemble Learning and Boosting Models Using Local Explanation Techniques

by Thi-Thu-Huong Le, Yeonjeong Hwang, Changwoo Choi, Rini Wisnu Wardhani, Dedy Septono Catur Putranto and Howon Kim

Electronics 2024, 13(22), 4350; https://doi.org/10.3390/electronics13224350 - 6 Nov 2024

Viewed by 432

Abstract

This paper presents a comparative analysis of several decision models for detecting Structured Query Language (SQL) injection attacks, which remain one of the most prevalent and serious security threats to web applications. SQL injection enables attackers to exploit databases, gain unauthorized access, and [...] Read more.

This paper presents a comparative analysis of several decision models for detecting Structured Query Language (SQL) injection attacks, which remain one of the most prevalent and serious security threats to web applications. SQL injection enables attackers to exploit databases, gain unauthorized access, and manipulate data. Traditional detection methods often struggle due to the constantly evolving nature of these attacks, the increasing complexity of modern web applications, and the lack of transparency in the decision-making processes of machine learning models. To address these challenges, we evaluated the performance of various models, including decision tree, random forest, XGBoost, AdaBoost, Gradient Boosting Decision Tree (GBDT), and Histogram Gradient Boosting Decision Tree (HGBDT), using a comprehensive SQL injection dataset. The primary motivation behind our approach is to leverage the strengths of ensemble learning and boosting techniques to enhance detection accuracy and robustness against SQL injection attacks. By systematically comparing these models, we aim to identify the most effective algorithms for SQL injection detection systems. Our experiments show that decision tree, random forest, and AdaBoost achieved the highest performance, with an accuracy of 99.50% and an F1 score of 99.33%. Additionally, we applied SHapley Additive exPlanations (SHAPs) and Local Interpretable Model-agnostic Explanations (LIMEs) for local explainability, illustrating how each model classifies normal and attack cases. This transparency enhances the trustworthiness of our approach to detecting SQL injection attacks. These findings highlight the potential of ensemble methods to provide reliable and efficient solutions for detecting SQL injection attacks, thereby improving the security of web applications. Full article

(This article belongs to the Special Issue Converging Platform Technologies: Collaborative Innovations and Future Directions)

► Show Figures

Figure 1

13 pages, 3717 KiB

Open AccessArticle

Multi-Modal Vision Transformer with Explainable Shapley Additive Explanations Value Embedding for Cymbidium goeringii Quality Grading

by Zhen Wang, Xiangnan He, Yuting Wang and Xian Li

Appl. Sci. 2024, 14(22), 10157; https://doi.org/10.3390/app142210157 - 6 Nov 2024

Viewed by 327

Abstract

Cymbidium goeringii (Rchb. f.) is a traditional Chinese flower with highly valued biological, cultural, and artistic properties. However, the valuation of Rchb. f. mainly relies on subjective judgment, lacking a standardized digital evaluation and grading methods. Traditional grading methods solely rely [...] Read more.

Cymbidium goeringii (Rchb. f.) is a traditional Chinese flower with highly valued biological, cultural, and artistic properties. However, the valuation of Rchb. f. mainly relies on subjective judgment, lacking a standardized digital evaluation and grading methods. Traditional grading methods solely rely on unimodal data and are based on fuzzy grading standards; the key features for values are especially inexplicable. Accurately evaluating Rchb. f. quality through multi-modal algorithms and clarifying the impact mechanism of key features on Rchb. f. value is essential for providing scientific references for online orchid trading. A multi-modal Transformer for Rchb. f. quality grading combined with the Shapley Additive Explanations (SHAP) algorithm was proposed, which mainly includes one embedding layer, one UNet, one Vision Transformer (ViT) and one Encoder layer. A multi-modal orchid dataset including images and text was obtained from Orchid Trading Website, and seven key features were extracted. Based on petals’ RGB segmented from UNet and global fine-grained features extracted from ViT, text features and image features were organically fused into Transformer Encoders throughout concatenation operation, a 93.13% accuracy was achieved. Furthermore, SHAP algorithm was utilized to quantify and rank the importance of seven features, clarifying the impact mechanism of key features on Rchb. f. quality and value. This multi-modal Transformer with SHAP algorithm for Rchb. f. grading provided a novel idea to represent the explainable features accurately, exhibiting good potential for establishing a reliable digital evaluation method for agricultural products with high value. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

Figure 1
(a) Construction of Transformer Encoder and its self-attention. (b) UNet construction for petal segmentation. Full article ">Figure 2
Architecture of multi-modal Transformers for Rchb. f. quality grading. Red numbers 9 and 0 are padding values of orchid feature indices and orchid feature values, respectively. Full article ">Figure 3
Computing process of SHAP algorithm for explainable features representation. Full article ">Figure 4
(a) Illustration of the model architecture with numerical features extracted from textual data and accuracy of the model in the test set under different numbers of Transformer Encoder layers (insert). (b) The value curves of Cross Entropy Cost and accuracy of the model. (c) Radar chart of Feature importance based on SHAP values. Full article ">Figure 5
(a) Illustration of model architecture with UNet and numerical features extracted from textual data. (b) The value curves of Cross Entropy Cost and accuracy of the model. (c) Radar chart of feature importance based on SHAP values. Full article ">Figure 6
(a) Illustration of model architecture with UNet, ViT, and numerical features extracted from textual data (Exp 3) and concatenation operation for connecting ViT’s output to Transformer Encoder (Exp 4). The value curves of Cross Entropy Cost and accuracy of (b) Exp 3 and (c) Exp 4. (d) Radar chart of feature importance based on SHAP values. Full article ">

Search Results (420)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (420)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI