Nothing Special   »   [go: up one dir, main page]

Next Article in Journal
Studies on 1D Electronic Noise Filtering Using an Autoencoder
Previous Article in Journal
Exploiting the Regularized Greedy Forest Algorithm Through Active Learning for Predicting Student Grades: A Case Study
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predictive Analytics for Thyroid Cancer Recurrence: A Machine Learning Approach

Computing and Security, Slippery Rock University, Slippery Rock, PA 16066, USA
*
Author to whom correspondence should be addressed.
Knowledge 2024, 4(4), 557-570; https://doi.org/10.3390/knowledge4040029 (registering DOI)
Submission received: 13 August 2024 / Revised: 31 October 2024 / Accepted: 7 November 2024 / Published: 18 November 2024

Abstract

:
Differentiated thyroid cancer (DTC), comprising papillary and follicular thyroid cancers, is the most prevalent type of thyroid malignancy. Accurate prediction of DTC is crucial for improving patient outcomes. Machine learning (ML) offers a promising approach to analyze risk factors and predict cancer recurrence. In this study, we aimed to develop predictive models to identify patients at an elevated risk of DTC recurrence based on 16 risk factors. We developed six ML models and applied them to a DTC dataset. We evaluated the ML models using Synthetic Minority Over-Sampling Technique (SMOTE) and with hyperparameter tuning. We measured the models’ performance using precision, recall, F1 score, and accuracy. Results showed that Random Forest consistently outperformed the other investigated models (KNN, SVM, Decision Tree, AdaBoost, and XGBoost) across all scenarios, demonstrating high accuracy and balanced precision and recall. The application of SMOTE improved model performance, and hyperparameter tuning enhanced overall model effectiveness.

1. Introduction

Differentiated thyroid cancer (DTC), which encompasses both papillary and follicular thyroid cancers, represents the most prevalent type of thyroid malignancy [1]. DTC recurrence risk is influenced by many factors, necessitating accurate predictive models to enhance patient outcomes. Over the past two decades, the incidence of thyroid cancer has increased internationally; however, the rate of mortality has stayed consistent [2]. This stability in relative mortality rates has been made possible by advancements in diagnostic techniques and treatment approaches. These advancements have improved the ability to accurately assess risk, predict recurrence, and implement early, tailored interventions, especially when all of these are based on accepted risk factors [3].
DTC accounts for over 90% of thyroid cancer cases. High levels of thyroid-stimulating hormone (TSH) are traditionally associated with thyroid malignancies, but recent studies indicate that elevated thyroxine (T4) and triiodothyronine (T3) levels may also contribute to thyroid cancer development [4]. Moreover, DTC recurrence affects about 30% of patients within 10 years of diagnosis, with papillary thyroid cancer (PTC) being the most common subtype. Follicular thyroid cancer (FTC) represents about 10–15% of cases and has a unique characteristic related to high propensity for metastasis through blood vessel invasion [5,6].
DTC frequently recurs in the cervical lymph nodes, contributing to increased metastasis risk. Monitoring methods for DTC include classification systems such as the American Thyroid Association (ATA) risk classification and the tumor node metastasis (TNM) staging. The ATA system classifies patients into high, intermediate, or low risk of recurrence based on several factors [7]. TNM staging considers the extent of the tumor, lymph node involvement, and metastasis. However, TNM has been criticized for its lack of integration of biological characteristics that might influence malignancy [8].
In recent years, the use of machine learning techniques has become a promising approach for assessing complex datasets consisting of patient risk factors and predicting the risk for cancer occurrence and recurrence.
Although these methods are promising, they are not yet widely used in the field of thyroid cancer recurrence. This study aims to advance existing research by developing predictive models that identify patients at increased risk of DTC recurrence based on 16 established risk factors [3], including patient data, treatment types, and personal histories. Specifically, we aim to evaluate six machine learning algorithms to determine which one provides the most accurate predictions of thyroid cancer recurrence.

2. Background and Related Work

2.1. Overview of DTC and Thyroid Cancer

The thyroid is an organ located in the neck and produces hormones that are key for body functions, including the thyroid hormone. Thyroid hormone, consisting of thyroxine (T4) and triiodothyronine (T3), plays a crucial role in cell proliferation, differentiation, and cellular metabolic processes [5]. High levels of TSH are associated with increased risk of thyroid malignancy. However, recent studies showed that T3 and T4 may also be involved in thyroid cancer development. For example, Sasson et al. [4] found that high levels of FT4 were directly associated with DTC malignancy and a high FT4/FT3 ratio also significantly increased risk of malignancy. Additionally, increased levels of TSH are also known to be associated with increased likelihood of nodule malignancy [4].
DTC encompasses both papillary and follicular cancers, and accounts for approximately 90% of all thyroid cancer diagnoses [5]. Disease recurrence associated with DTC affects roughly 30% of patients within 10 years of initial diagnosis, making DTC dangerous in the long term as well as short [1]. PTC is the most prevalent form of thyroid cancer, and is characterized by slow growth, but often spreads to surrounding lymph nodes in the neck [6]. FTC, which represents approximately 10–15% of all thyroid cancers, is also characterized by slow growth; however, it is also associated with a higher rate of metastasis to distant parts of the body due to invasion of the blood vessels [5]. Through a close look at the two subcategories of DTC, it is apparent that though DTC is very treatable, it also poses significant concerns for recurrence due to its subcategories’ proclivity for metastasis.

2.2. Current Methods for Monitoring Recurrence

DTC commonly recurs in regional or cervical lymph nodes, making them prone to increased metastasis and mortality [1]. Accurate classification is essential for both treating DTC and predicting its likelihood of recurrence. The most widely used classification systems in clinics today are the ATA risk classification system and TNM staging. The ATA risk classification categorizes DTC into high, intermediate, and low risk of recurrence based on biological factors such as histological subtype, size and extent of the primary tumor, and presence of distant metastasis [7]. Staging, on the other hand, is the process of predicting prognosis of cancer patients. This has traditionally been performed through TNM staging, which takes into consideration factors regarding the tumor characteristics, lymph node involvement, and metastasis of cancer. This way, staging is useful to plan current treatments and determine progression over time. On the other hand, TNM staging is criticized for lacking important biological characteristics of malignant tumors [8].

2.3. Current Research on Machine Learning in Thyroid Cancer

In recent years, artificial intelligence (AI) has gained popularity for its potential uses in medicine, with cancer being a particularly prospective area of involvement. Cancers in general are highly multifactorial, in that various genetic, epigenetic, proteomic, and transcriptome changes can affect an individual’s likelihood of developing any number of different cancers. Therefore, multiple complex factors must be considered at the same time to predict outcomes and create best treatments [5]. This kind of complexity is consistent in thyroid cancer, which is likely associated with genetic factors and environmental ones. Considering multiple risk factors for each patient individually can be challenging without technological assistance. However, AI and machine learning (ML) can serve as a valuable tool to streamline this process and provide accurate prediction and personalized assessments. By analyzing complex multi-omics data in an efficient way, such tools have the potential to help diagnose current cancers, predict prognosis, discover new biomarkers for cancer, identify underlying mechanisms, and develop personalized treatments to more effectively eliminate thyroid cancer [5,9].
One of the most promising and currently implemented uses for ML in thyroid cancer healthcare is diagnostics and screening. Classic ML algorithms are currently used in the computer-aided diagnosis (CAD) systems to improve accuracy of diagnosis and to reduce time required for image interpretation, though ultimately the diagnosis is still solely determined by the medical professional [10].
Thyroid Imaging Reporting and Data System (TI-RADS) is a commonly used method for categorizing biopsied thyroid nodules today [11]. These categories are labelled 1−5, with TR1 being benign and TR5 being highly suspicious, and are based on a point scale determined through nodule factors, such as composition, echogenicity, shape, margin, and echogenic foci. Gu et al. [9] constructed a classification model for thyroid cancer based on risk factors as well as a prediction model for metastasis based on risk factors. Authors argue that TI-RADS, though a useful screening tool, is not as accurate at determining malignancy and metastasis as it could be. They utilized ML to predict both malignancy and metastasis of 1735 patients, aiming to improve early diagnosis and treatment by analyzing risk factors. Results showed that XGBoost achieved the highest performance, suggesting that ML can significantly aid in early diagnosis and treatment decisions compared to TI-RADS.
Fine-needle aspiration biopsy (FNAB) is often used for suspicious thyroid nodules found via ultrasound. However, up to 30% of these may be classified as indeterminate thyroid nodules (ITN), often necessitating further surgery to determine malignancy [12]. To develop a cost-effective, non-invasive ML model, Luong et al. [12] analyzed electronic health record data from 355 nodules classified as indeterminate by FNAB. They found that a random forest classifier had the best performance. Findings demonstrated the potential of ML models to aid in early clinical decision-making and reduce unnecessary procedures.
Ballester et al. [13] conducted a retrospective analysis of 5351 thyroid tumors, investigating pathologic upstaging, where the final pathologic stage exceeds the initial clinical stage. They reported upstaging rates of 17.5% for tumor stage, 18% for nodal stage, and 10.9% for summary stage, identifying factors like Asian race, older age, and lymph vascular invasion as contributors to upstaging. This study underscores the importance of recognizing factors that contribute to upstaging for improved management and counseling of thyroid cancer patients.
Distant metastasis often indicates poor prognosis, as metastasis to other body parts can be more challenging to find and treat. Mao et al. [14] studied 5809 patients to evaluate ML models for predicting distant metastasis in follicular thyroid carcinoma (FTC). They found that the XGBoost model had the best performance, with diagnosis age, race, extrathyroidal extension, and lymph node invasion being significant risk factors.
In a retrospective analysis, Medas et al. [15] investigated factors influencing recurrence in 579 DTC patients. They found a recurrence rate of 6.2% and a five-year disease-free survival rate of 94.1%. Multivariate analysis identified lymph node metastasis as a strong predictor of recurrence, with multifocality and extrathyroidal extension also associated with increased risk. Conversely, microcarcinoma (tumor size ≤ 1 cm) was an independent protective factor, emphasizing the need for risk stratification in personalizing treatment plans. Findings suggest that high-risk patients may benefit from more aggressive follow-up and treatment to better prevent recurrence.
Jin et al. [16] developed an overall survival (OS) prognostic model for participants with differentiated thyroid cancer with distant metastasis. Nine variables were introduced to build a machine learning model, receiver operating characteristic (ROC) was used to evaluate the recognition ability of the model, calibration plots were used to obtain prediction accuracy, and decision curve analysis (DCA) was used to estimate clinical benefit. The proposed model was found to have good discriminative ability and high clinical value in its 10-year survival predictions.
Tang et al. [17] developed a nomogram to predict cancer-specific survival (CSS) in patients with PTC. They utilized the Surveillance, Epidemiology, and End Results (SEER) database to procure participants for the study. COX regression analysis demonstrated that age, gender, marriage, tumor grade, TNM stage, surgery, radiotherapy, chemotherapy, and tumor size were significantly associated with CSS in middle-aged patients with PTC. These nine variables were then used to develop a prediction model that could predict and affect the CSS of middle-aged PTC patients. This tool was found to have good accuracy and discrimination and better overall clinical value than traditional TNM staging for this population. Park and Lee [18] utilized five ML models to determine which best predicted recurrence of PTC in a cohort of 1040 patients. Results showed that the Decision Tree (DT) model achieved the best accuracy, at 95%, and the lightGBM and stacking models together achieved 93% accuracy.
Wang et al. [19] used five ML models to predict structural recurrence in papillary thyroid cancer (PTC) patients, analyzing electronic medical records from 2244 patients. The authors utilized the least absolute shrinkage and selection operator (LASSO) method to select nine variables for developing the prediction models, which included thyroglobulin (TG), lymph node (LN) variables (LN dissection, number of LNs dissected, lymph node metastasis ratio (LNR), and N stage), comorbidities and metabolic-related variables (comorbidity of hypertension, comorbidity of diabetes, BMI, and low-density lipoprotein (LDL)). Variable importance analysis showed that the most important variables across all models were TG, LNR, and N stage. The top performing models were SVM, XGBoost, and Random Forest (RF) models, all of which showed better discrimination than the ATA risk stratification according to the AUC values and corresponding indices. Furthermore, their RF model was found to have the most consistent calibration, as well as good discrimination and interpretability. Findings suggest that patients with recurrent disease are more likely to be older, male, cigarette smokers, alcohol drinkers, and have various comorbidities, highlighting the potential of ML in enhancing current risk stratification methods and assisting in personalized patient management.
Finally, Borzooei et al. [20] conducted a prospective study using the Differentiated Thyroid Cancer Recurrence dataset that is also being used in our study. They trained ML models on three distinct combinations of features: a dataset with all features excluding ATA risk score (12 features), another with ATA risk alone, and a third with all features combined (13 features). Authors found that the model that combined clinicopathologic features with ATA risk score outperformed the other two models. SVM was found to be the best performing ML model.
The current study attempts to fill the gaps in the literature by evaluating six machine learning models to determine the most effective for predicting cancer recurrence. Comparing these models, we aim to identify the optimal predictive tool. This research will contribute significantly to personalized cancer care by enhancing risk stratification and improving long-term patient outcomes.

3. Design and Methodology

The Differentiated Thyroid Cancer Recurrence dataset from the University of California at Irvine Machine Learning Repository was used in this study [3]. This dataset consists of the retrospective clinical data for 383 patients diagnosed with DTC, each followed for a minimum of 10 years. The collected clinical data included 16 features: age at diagnosis, gender, current smoking status, prior smoking history, history of head and neck radiation therapy, thyroid function, presence of goiter, presence of adenopathy on physical examination, pathological subtype of cancer, focality, ATA risk assessment, TNM staging, initial treatment response, and recurrence status. The dataset contains 312 females (81%) and 71 males (19%). The average age at diagnosis was 41. The pathological subtype breakdown was 287 Papillary (75%), 48 Micropapillary (13%), 28 Follicular (7%), and 20 Hurthel Cell (5%). According to the ATA risk classification, patients were classified as follows: 249 low risk (65%), 102 intermediate risk (27%), and 32 high risk (8%). Most cases (333) were classified as Stage 1 (87%). A total of 208 patients had an excellent initial treatment response (54%). The remaining cases had the following initial treatment responses: 91 structural incomplete (24%), 61 indeterminate (16%), and 23 biochemical incomplete (6%). In terms of recurrence, 108 patients (28%) experienced recurrence. Notably, this dataset contained no missing values.
ML models were applied in this study to analyze and predict DTC recurrence. The ML models used were KNN, SVM, Decision Tree, Random Forest, AdaBoost, and XGBoost. KNN is a simple, instance-based learning algorithm that classifies data points based on the majority class among their k-nearest neighbors in the feature space, using a chosen distance metric. KNN is particularly useful in scenarios with well-defined clusters in the feature space, enabling it to provide quick predictions for DTC recurrence based on similar historical cases [21]. Leveraging the kernel trick, SVM maps inputs into high-dimensional feature spaces, allowing us to draw clear margins between classes. Its robustness against outliers was beneficial in our study, particularly with the high-dimensional, small dataset we analyzed. This model helped identify subtle patterns that differentiate between recurring and non-recurring cases [22]. Decision Tree represents choices and their results in a tree-shaped graph. The Decision Tree model allowed for straightforward interpretation of decision rules concerning DTC recurrence [21,22]. The Random Forest method mitigated the limitations of individual decision trees by aggregating multiple trees to improve accuracy and robustness. It proved effective in handling high-dimensional data and addressing class imbalances, which are common in DTC datasets, ultimately leading to better predictive performance [23]. AdaBoost enhanced weaker classifiers by focusing on errors from prior iterations. Its ability to improve accuracy, particularly in complex cases, allowed the refinement of predictions for DTC recurrences [21,23]. XGBoost is considered an excellent model in handling missing values and requires careful hyperparameter tuning to maximize its potential, making it possible to model complex interactions among features related to DTC recurrence effectively [23]. Different models were implemented to account for differences in handling class imbalances and predictive accuracy.
Python 3.13 was used to run different models. Python has a rich variety of libraries and tools that make it an excellent choice for implementing machine learning classification techniques. From data preparation and model training to evaluation and deployment, Python provides comprehensive support for every step of the machine learning workflow. The dataset was partitioned into a training set (80%) and a test set (20%). Data was run once with no modifications. Given the gender imbalance in this dataset, a second experiment was performed using SMOTE to address class imbalance. SMOTE generates synthetic samples for the minority class based on feature similarity with existing minority instances. This technique helps to alleviate the bias towards the majority class by increasing the representation of the minority class, thereby improving the performance of classifiers trained on imbalanced datasets. In hopes of improving ML model performance, a final run was conducted using hyperparameter tuning. This process involves selecting the best values for these parameters to achieve optimal model performance.
For all tests, data preprocessing was performed. Categorical variables were transformed using LabelEncoder and converted into numerical values. StandardScaler was applied to standardize columns by removing the mean and scaling to unit variance. SMOTE was used to handle class imbalance by oversampling the minority class. For the SMOTE test, the MinMaxScaler was used to scale specific columns to a range between 0 and 2. Grid search was the method used for hyperparameter tuning, which allows you to systematically find the best hyperparameter combination to optimize model performance. Hyperparameter optimization techniques in machine learning encompass manual methods like trial-and-error tuning and exhaustive grid search. Random search efficiently explores broader spaces by random sampling [24]. The dataset was split into training and testing sets using train_test_split with a test size of 20%. The final models were evaluated using accuracy, precision, recall, F1 score, and AUC score.
In the fine-tuning scenario, each model was fine-tuned in different ways depending on the model parameters. For the KNN model, the hyperparameters tuned were n_neighbors and weights. The n_neighbors parameter controls how many neighbors influence the classification decision, and was tested with values of 3, 5, and 7. The weights parameter defines how neighbors contribute to the final classification and was tested with two options: uniform and distance. Such combination of tuning parameters ensures that the KNN model is capable of balancing local sensitivity and generalization, providing robust performance on the dataset [25]. For the SVM model, the key hyperparameters tuned were C and kernel. The C parameter controls the trade-off between achieving a large margin and minimizing classification errors. The C parameter was tested using small and large values. The kernel parameter influences how the algorithm transforms the input data into a higher-dimensional space. With these hyperparameters, the SVM model can achieve an optimal balance between margin maximization and classification accuracy for the dataset [26].
For the Decision Tree model, the key hyperparameters tuned were criterion and max_depth. The criterion parameter was set to either gini or entropy. The max_depth parameter was evaluated with values of 10, 20, and 30, as well as the default setting of None, which allows the tree to grow without limit. This tuning strategy for the Decision Tree model ensures a balance between model complexity and generalizability [27]. For the Random Forest model, the tuning strategy involved adjusting n_estimators, criterion, and max_depth. The n_estimators parameter, which controls the number of trees in the forest, was tested with values of 10, 50, and 100. The criterion parameter was set to either gini or entropy. The max_depth parameter was evaluated with values of 10, 20, and 30, as well as None to allow unlimited tree growth. This combination of hyperparameter tuning ensures that the Random Forest model finds the optimal balance between predictive accuracy and computational efficiency [28,29].
For the AdaBoost model, the primary hyperparameters tuned were n_estimators and learning_rate. The n_estimators parameter defines the number of weak learners, and was evaluated using 50, 100, and 200. The learning_rate parameter controls the contribution of each weak learner, and was evaluated using 0.01, 0.1, and 1. This tuning strategy ensures that the AdaBoost model achieves a balance between learning speed and accuracy [30]. Finally, for the XGBoost model, key hyperparameters tuned were n_estimators, learning_rate, and max_depth. The n_estimators parameter represents the number of boosting rounds, and was evaluated with values of 50, 100, and 200. The learning_rate parameter controls how much each boosting round contributes to the final model, and was evaluated using 0.01, 0.1, and 0.2. The max_depth parameter controls the depth of each tree, and was evaluated using 3, 6, and 9. This comprehensive tuning strategy allows XGBoost to achieve both high accuracy and generalizability across the dataset [31].
The performance of the models was assessed using accuracy, precision, recall, and F1 score. These metrics provide a comprehensive view of each model’s effectiveness. Accuracy shows how often the model was able to correctly predict if DTC recurred or not. Precision indicates how many patients predicted to have DTC recurrence had recurrence. High precision means that when the model predicts recurrence, it is likely to be correct. This is critical for minimizing false alarms that may cause unnecessary treatment. Recall shows how effectively the model predicts patients who will experience recurrence. High recall shows that the model will be able to predict most recurrences. This ensures that all at-risk patients are identified. This minimizes false negatives and helps ensure that all patients receive the proper treatment needed to minimize their chances of recurrence. Finally, a high F1 score shows that the model has a good balance of precision and recall. This score can be especially helpful in cases of class imbalance [21].

4. Results

For the KNN classifier with and without SMOTE, a value of 3 for k was chosen because experimental evaluation showed it can help avoid the model being overly sensitive to noise but retain strong performance and logical reasoning. Such a value helps avoid overfitting, while still capturing local patterns in the data. With SMOTE, a value of 3 for k remains valid because the data become more balanced, and a value of 3 for k ensures that minority classes are not dominated by the majority class. When it comes to fine-tuning, testing k with values of 3, 5, and 7 is helpful to explore how increasing the neighborhood size improves model generalization without compromising accuracy. It makes the model more sensitive to small changes in the data and helps find the best balance between capturing local patterns and making accurate predictions.
In the initial run with no modifications (Table 1), KNN demonstrated strong overall performance with high accuracy (0.90) and precision (0.91). It did have a lower recall (0.81), which could mean that the model would miss some positive cases of recurrence.
The SVM model had a lower accuracy (0.83), but still had good precision (0.91). The recall was low (0.66), which meant the model struggled to identify recurrence. The Decision Tree model had strong performance with high accuracy (0.92), precision (0.89), recall (0.91), F1 score (0.98), and AUC score (0.912). A training score of 1.0 indicates there may have been overfitting to the training data. The test score of 0.92 indicates that the model still performed well with new data.
The Random Forest model had the best overall performance, with near-perfect accuracy (0.99), precision (0.99), F1 score (0.98), and AUC score (0.996). The recall (0.97) was only slightly lower, which indicates that this model was reliable for predicting DTC recurrence. The high training and test scores also indicated good model generalization and minimal overfitting.
AdaBoost also had high performance, with high accuracy (0.97), precision (0.98), recall (0.95), F1 score (0.96), and AUC score (0.986). The close training and test scores indicate that it is a robust choice. Finally, XGBoost also had consistently high metrics for accuracy (0.97), precision (0.97), recall (0.97), F1 score (0.97), and AUC score (0.995). A training score of 1.0 does show the possibility of overfitting with this model.
The AUC scores showed that Random Forest performed the best with an AUC score of 0.996, indicating near-perfect performance in distinguishing between classes. XGBoost followed closely with an AUC of 0.995, which demonstrates strong predictive power. AdaBoost had a slightly lower but still excellent AUC of 0.986, showing strong performance as well. Although SVM had the highest AUC among the non-ensemble models, at 0.937, its overall accuracy and recall were lower compared to other models. KNN and Decision Tree performed reasonably well, with AUC scores of 0.863 and 0.912, respectively.
Table 2 shows a summary of which features are most significant for predicting DTC recurrence in the models using SMOTE. As shown in the table, response and risk were important features in the current scenario, especially for Random Forest and XGBoost models. Age, T, and N showed varying importance across different models. Finally, several features like adenopathy, focality, and Hx smoking had low or negative importance in the SVC and KNN models.
The data were run again with the application of SMOTE (Table 3) to address class imbalances. Overall, the models showed good improvement with the application of SMOTE. KNN showed significant improvement, achieving high accuracy (0.97), precision (0.97), recall (0.97), F1 score (0.97), and AUC score (0.990).
SVM greatly benefited from SMOTE, indicating that the model was likely influenced by the class imbalance. SVM had balanced performance with good accuracy (0.94), precision (0.94), recall (0.94), F1 score (0.94), and AUC score (0.994). The Decision Tree model also showed improvements with consistent performance for accuracy (0.94), precision (0.94), recall (0.94), F1 score (0.94), and AUC score (0.929). Once again, the perfect training score indicated that there could be potential of overfitting with this model.
Random Forest did see a slight decrease in performance when SMOTE was applied. This could indicate that there had been some overfitting on the initial model. Since Random Forest is an ensemble method, it can typically handle some degree of class imbalance. Introducing the synthetic samples to the data can add noise to the data, which could also explain the decreased performance.
AdaBoost did see a slight decrease in both accuracy (0.95) and precision (0.95). Recall (0.96) showed some slight improvement, but the model is less apt to favor one class disproportionally. XGBoost saw results like those of the initial run and once again showed the potential for some overfitting with this model.
The AUC scores showed that Random Forest stood out as the best performer with an AUC score of 0.999, indicating nearly perfect discriminatory power. XGBoost followed closely with an AUC of 0.996, showing a strong predictive capability. SVM and AdaBoost both showed excellent performance with AUC scores of 0.994, making them competitive but slightly behind Random Forest. KNN also performed very well with an AUC of 0.990, but it was slightly outperformed by the ensemble methods. The AUC score of Decision Tree indicated it may be less reliable in distinguishing between classes compared to the other models.
Table 4 shows a summary of which features are most significant for predicting DTC recurrence in the models using SMOTE. As shown in the table, the Random Forest and XGBoost highlighted risk and response as the most important factors, while AdaBoost and Decision Tree also emphasized age and physical examination. SVC and KNN showed that age and response were consistently important, with additional contributions from adenopathy and T in KNN. Finally, low-impact features such as Hx smoking, Hx radiotherapy, and focality generally had very low or negative importance across models.
The final test was conducted using hyperparameter tuning (Table 5). The purpose of running hyperparameter tuning was to improve each algorithm’s performance by finding the parameters that would ultimately maximize the performance of each algorithm. KNN had a consistent performance with no significant changes.
Despite the hyperparameter tuning, recall remained lower (0.81), suggesting that the model could miss some cases of potential recurrence. SVM showed good improvement from the initial results, indicating hyperparameter tuning enhanced the model’s ability to predict DTC recurrence. Decision Tree showed marked improvement with accuracy (0.97), precision (0.98), recall (0.95), F1 score (0.96), and AUC score (0.912). This shows that hyperparameter tuning helped the Decision Tree model reduce overfitting and improve generalization.
Random Forest once again had very high performance across the board for accuracy (0.99), precision (0.99), recall (0.97), F1 score (0.98), and AUC score (0.992). Hyperparameter tuning helped to solidify its performance and make this model very reliable for predicting DTC recurrence. AdaBoost saw an improvement with precision (0.98) but did see a slight drop with recall (0.92).
Finally, XGBoost showed consistent performance with accuracy (0.97), precision (0.97), recall (0.97), F1 score (0.97), and AUC score (0.996). This demonstrates that hyperparameter tuning helps enhance this model, allowing it to maintain high performance across all metrics.
The AUC scores showed that XGBoost demonstrated the best performance with an AUC score of 0.996, with exceptional ability to distinguish between classes. Random Forest followed closely with an AUC of 0.992. AdaBoost also performed very well, with an AUC score of 0.994, slightly outperforming Random Forest but still falling behind XGBoost. Decision Tree had a good AUC of 0.912, showing reliable performance but not as strong as the ensemble methods. SVM achieved an AUC of 0.904, which was solid but somewhat lower than the other models. KNN, with an AUC of 0.867, had the lowest score, indicating the lowest performance when it came to distinguishing between classes compared to the other models.
Table 6 shows a summary of which features are most significant for predicting DTC recurrence in the models using fine-tuning. As shown in the table, response and risk continued to be key predictors across models with fine-tuning, particularly in XGBoost and Random Forest. Age and T were also consistently significant across multiple models. Finally, adenopathy and focality showed varying importance across models, with some negative values in KNN.

5. Discussion

The goal of this study was to determine what the best method was to predict likelihood of recurrence of DTC. Our study involved six machine learning algorithms, each tested under three different parameter settings. To address class imbalance in the training data, we used SMOTE oversampling technique. We evaluated each algorithm’s performance in two scenarios, with SMOTE and without SMOTE enabled. A third test was conducted using hyperparameter tuning, where the test selects the best possible values for each machine learning model to achieve the optimal performance. Tests and results showed that the best machine learning model was Random Forest, which consistently outperformed or matched the other models across all three scenarios.
Overall, the application of SMOTE did improve the performance of most models, particularly those that struggled with imbalanced data, such as KNN and SVM. Hyperparameter tuning further enhanced performance, especially for models like SVM and Decision Tree. Random Forest, AdaBoost, and XGBoost demonstrated strong performance in all scenarios, making reliable choices for handling imbalanced datasets. Hyperparameter tuning further enhanced the performance of most models, particularly SVM, Decision Tree, and AdaBoost. In all three scenarios, Random Forest seemed to be the best and most reliable algorithm. Random Forest consistently scored higher than the other algorithms in each scenario and would lead to the most reliable results when determining likelihood of recurrence of thyroid cancer.
According to the results, complex models like Random Forest and XGBoost resulted in better performance because such models could capture the intricate patterns in data. These models also showed reliable performance across both training and test sets. Such findings showed that these models are better at generalizing than others, especially in the presence of noisy synthetic data. On the other hand, these models are more prone to overfitting compared to other models tested, especially when it comes to small datasets. The SVM and KNN models showed improved results with SMOTE, indicating that their initial performance was impacted by the imbalance in classes. Such observations highlight the importance of preprocessing techniques like SMOTE in improving model predictions. The Decision Tree showed some overfitting, which was mitigated through hyperparameter tuning. This finding demonstrates the need for regularization techniques in more complex models to enhance generalizability. Finally, overall, fine-tuning helped improve some models’ performance, which emphasizes the need for selecting optimal hyperparameters in machine learning. In conclusion, different models perform differently under certain conditions. Furthermore, such performance varies based on data characteristics and preprocessing techniques. These findings underscore the importance of addressing data imbalance and optimizing model parameters to achieve the best predictive performance in medical diagnosis tasks.
Analysis of the importance of features in predicting cancer recurrence showed that the response attribute consistently dominated feature importance in most models, particularly in gradient boosting, where it had exceptionally high importance in all scenarios. Risk also showed strong importance, especially in Random Forest, Gradient Boosting, and Decision Tree models. AdaBoost displayed variability in feature importance across modifications, but age and risk emerged as prominent features with SMOTE and hyperparameter tuning. SVC and KNN generally assigned very low or negative importance to most features, including key factors like risk, suggesting these models may not be well-suited for this specific prediction task.
As shown in the analysis, multiple metrics have been used to evaluate the models quantitatively, including accuracy, recall, precision, and F1 score. Such metrics are considered important to assess a model’s performance, while at the same time helping interpret findings with respect to the clinical context. For instance, a high recall in the KNN model indicates its effectiveness in identifying positive cases, which is vital in clinical scenarios where missing a diagnosis could be detrimental. In addition, feature importance analysis showed that, for example, age and response significantly contribute to the model predictions, aligning with existing literature that emphasizes their role in patient outcomes. Accordingly, we can enhance the decision-making process by integrating machine learning models with clinical risk factors to help improve patient management and treatment.
Results of this study imply interesting clinical applications, particularly advancements in personalized treatment development. This tool, in combination with oversight from clinical experts, could be utilized to better predict the chances of recurrent DTC in patients following primary treatment methodologies. Understanding risk levels for recurrence allows clinicians to create more effective screening and monitoring programs at regular intervals to better detect recurrence in earlier stages and to monitor more closely for distant metastasis that often increase risk of mortality in recurrent DTC. Overall, this study demonstrates some promising implications for clinical use, with the caveat that classic ML algorithms such as those discussed herein require ongoing oversight from experts in the field to ascertain ongoing accuracy, particularly with application to larger and larger datasets.

6. Conclusions, Limitations, and Future Works

In conclusion, this study demonstrates that Random Forest, followed closely by XGBoost and AdaBoost, offers the most reliable and robust performance for predicting DTC recurrence across all tested scenarios. The application of SMOTE effectively addressed class imbalances, significantly enhancing the performance of models like KNN and SVM, which were initially affected by the imbalance. Hyperparameter tuning further improved the models’ generalization capabilities, particularly in SVM, Decision Tree, and AdaBoost, highlighting the importance of model optimization in predictive tasks.
While this study identified the best algorithm for predicting thyroid cancer recurrence, several limitations remain. One key limitation is the dataset size; having a larger and more diverse dataset could lead to more accurate predictions. Given that DTC recurrence occurs more frequently in women than men, class imbalance may still exist even with more data, but a larger dataset could yield stronger, more generalizable conclusions. Another important limitation is the absence of clinical validation for predictive models. Although the models demonstrate high accuracy and recall, their practical utility in real-world clinical settings has not yet been evaluated. Clinical testing is necessary to confirm their effectiveness and reliability in patient care.
Future studies should explore these models in larger and more diverse datasets to validate and generalize these findings. In addition, there is a need to explore the balance between model complexity and performance to optimize predictions for DTC recurrence. In addition, there is a need to validate the tested models using clinical outcome data to ensure their predictions align with actual patient experiences and outcomes, thus confirming their real-world applicability. It would be valuable to explore how the inclusion of genetic and omics data could enhance the predictive performance of the machine learning models. Including such data may help capture more complex patterns and interactions that are not reflected in clinical and demographic data alone. Finally, future research should focus on developing models that account for inherent imbalances in a way that reflects the real-world clinical scenarios.

Author Contributions

Conceptualization, E.C., S.P., T.L., B.H. and A.W.; methodology, E.C., S.P., T.L. and B.H.; software, E.C. and S.P.; validation, T.L. and B.H.; formal analysis, E.C., S.P., T.L. and B.H.; data curation, E.C., S.P., T.L. and B.H.; writing—original draft preparation, E.C., S.P., T.L., B.H., A.W. and R.S.; writing—review and editing, A.W. and R.S.; supervision, A.W. and R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is available at https://archive.ics.uci.edu/ (accessed on 12 August 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Shokoohi, A.; Berthelet, E.; Gill, S.; Prisman, E.; Sexsmith, G.; Tran, E.; White, A.; Wiseman, S.M.; Wu, J.; Ho, C. Treatment for Recurrent Differentiated Thyroid Cancer: A Canadian Population Based Experience. Cureus 2020, 12, e7122. [Google Scholar] [CrossRef] [PubMed]
  2. Coca-Pelaz, A.; Rodrigo, J.P.; Shah, J.P.; Nixon, I.J.; Hartl, D.M.; Robbins, K.T.; Kowalski, L.P.; Mäkitie, A.A.; Hamoir, M.; López, F.; et al. Recurrent Differentiated Thyroid Cancer: The Current Treatment Options. Cancers 2023, 15, 2692. [Google Scholar] [CrossRef] [PubMed]
  3. Guan, D.; Yuan, W.; Lee, Y.K.; Lee, S. Nearest neighbor editing aided by unlabeled data. Inf. Sci. 2009, 179, 2273–2282. [Google Scholar] [CrossRef]
  4. Sasson, M.; Kay-Rivest, E.; Shoukrun, R.; Florea, A.; Hier, M.; Forest, V.-I.; Tamilia, M.; Payne, R.J. The T4/T3 quotient as a risk factor for differentiated thyroid cancer: A case control study. J. Otolaryngol.—Head Neck Surg. 2017, 46, 28. [Google Scholar] [CrossRef]
  5. Bhattacharya, S.; Mahato, R.K.; Singh, S.; Bhatti, G.K.; Mastana, S.S.; Bhatti, J.S. Advances and challenges in thyroid cancer: The interplay of genetic modulators, targeted therapies, and AI-driven approaches. Life Sci. 2023, 332, 122110. [Google Scholar] [CrossRef]
  6. Habchi, Y.; Himeur, Y.; Kheddar, H.; Boukabou, A.; Atalla, S.; Chouchane, A.; Ouamane, A.; Mansoor, W. AI in Thyroid Cancer Diagnosis: Techniques, Trends, and Future Directions. Systems 2023, 11, 519. [Google Scholar] [CrossRef]
  7. Watson-Brown, P.; Anderson, D. Differentiated thyroid cancer: A guide to survivorship care. Aust. J. Gen. Pract. 2023, 52, 47–51. [Google Scholar] [CrossRef]
  8. Tang, N.; Yang, C.; Fan, J.; Cao, L. VerifAI: Verified Generative AI. arXiv 2023, arXiv:2307.02796. [Google Scholar]
  9. Gu, J.; Xie, R.; Zhao, Y.; Zhao, Z.; Xu, D.; Ding, M.; Lin, T.; Xu, W.; Nie, Z.; Miao, E.; et al. A machine learning-based approach to predicting the malignant and metastasis of thyroid cancer. Front. Oncol. 2022, 12, 938292. [Google Scholar] [CrossRef]
  10. Nagendra, L.; Pappachan, J.M.; Fernandez, C.J. Artificial intelligence in the diagnosis of thyroid cancer: Recent advances and future directions. Artif. Intell. Cancer 2023, 4, 1–10. [Google Scholar] [CrossRef]
  11. Ahmad, H.; Van Der Lugt, A. The Radiology Assistant: TI-RADS—Thyroid Imaging Reporting and Data System. Available online: https://radiologyassistant.nl/head-neck/ti-rads/ti-rads (accessed on 12 August 2024).
  12. Luong, G.; Idarraga, A.J.; Hsiao, V.; Schneider, D.F. Risk Stratifying Indeterminate Thyroid Nodules With Machine Learning. J. Surg. Res. 2022, 270, 214–220. [Google Scholar] [CrossRef] [PubMed]
  13. Ballester, J.M.S.; Finn, C.B.; Ginzberg, S.P.; Kelz, R.R.; Wachtel, H. Thyroid cancer pathologic upstaging: Frequency and related factors. Am. J. Surg. 2023, 226, 171–175. [Google Scholar] [CrossRef] [PubMed]
  14. Mao, Y.; Lan, H.; Lin, W.; Liang, J.; Huang, H.; Li, L.; Wen, J.; Chen, G. Machine learning algorithms are comparable to conventional regression models in predicting distant metastasis of follicular thyroid carcinoma. Clin. Endocrinol. 2023, 98, 98–109. [Google Scholar] [CrossRef] [PubMed]
  15. Medas, F.; Canu, G.L.; Boi, F.; Lai, M.L.; Erdas, E.; Calò, P.G. Predictive Factors of Recurrence in Patients with Differentiated Thyroid Carcinoma: A Retrospective Analysis on 579 Patients. Cancers 2019, 11, 1230. [Google Scholar] [CrossRef] [PubMed]
  16. Jin, S.; Yang, X.; Zhong, Q.; Liu, X.; Zheng, T.; Zhu, L.; Yang, J. A Predictive Model for the 10-year Overall Survival Status of Patients With Distant Metastases From Differentiated Thyroid Cancer Using XGBoost Algorithm-A Population-Based Analysis. Front. Genet. 2022, 13, 896805. [Google Scholar] [CrossRef]
  17. Tang, J.; Zhanghuang, C.; Yao, Z.; Li, L.; Xie, Y.; Tang, H.; Zhang, K.; Wu, C.; Yang, Z.; Yan, B. Development and validation of a nomogram to predict cancer-specific survival in middle-aged patients with papillary thyroid cancer: A SEER database study. Heliyon 2023, 9, e13665. [Google Scholar] [CrossRef]
  18. Park, Y.M.; Lee, B.-J. Machine learning-based prediction model using clinico-pathologic factors for papillary thyroid carcinoma recurrence. Sci. Rep. 2021, 11, 4948. [Google Scholar] [CrossRef]
  19. Wang, H.; Zhang, C.; Li, Q.; Tian, T.; Huang, R.; Qiu, J.; Tian, R. Development and validation of prediction models for papillary thyroid cancer structural recurrence using machine learning approaches. BMC Cancer 2024, 24, 427. [Google Scholar] [CrossRef]
  20. Borzooei, S.; Briganti, G.; Golparian, M.; Lechien, J.R.; Tarokhian, A. Machine learning for risk stratification of thyroid cancer patients: A 15-year cohort study. Eur. Arch. Oto-Rhino-Laryngol. 2024, 281, 2095–2104. [Google Scholar] [CrossRef]
  21. Chu, C.S.; Lee, N.P.; Adeoye, J.; Thomson, P.; Choi, S. Machine learning and treatment outcome prediction for oral cancer. J. Oral Pathol. Med. 2020, 49, 977–985. [Google Scholar] [CrossRef]
  22. Lickert, H.; Wewer, A.; Dittmann, S.; Bilge, P.; Dietrich, F. Selection of Suitable Machine Learning Algorithms for Classification Tasks in Reverse Logistics. Procedia CIRP 2021, 96, 272–277. [Google Scholar] [CrossRef]
  23. Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef] [PubMed]
  24. Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
  25. Batista, G.E.A.P.A.; Silva, D.F. How k-nearest neighbor parameters affect its performance. In Proceedings of the Argentine Symposium on Artificial Intelligence, Mar del Plata, Argentina, 24–25 August 2009; pp. 1–12. [Google Scholar]
  26. Zhou, J.Y.; Shi, J.; Li, G. Fine tuning support vector machines for short-term wind speed forecasting. Energy Convers. Manag. 2011, 52, 1990–1998. [Google Scholar] [CrossRef]
  27. Mantovani, R.G.; Horváth, T.; Cerri, R.; Junior, S.B.; Vanschoren, J.; de Carvalho, A.D.L. An empirical study on hyperparameter tuning of decision trees. arXiv 2018, arXiv:1812.02207. [Google Scholar]
  28. Probst, P.; Wright, M.N.; Boulesteix, A. Hyperparameters and tuning strategies for random forest. WIREs Data Min. Knowl. Discov. 2019, 9, e1301. [Google Scholar] [CrossRef]
  29. Probst, P.; Boulesteix, A.L. To tune or not to tune the number of trees in random forest. J. Mach. Learn. Res. 2018, 18, 1–18. [Google Scholar]
  30. Krithiga, R.; Ilavarasan, E. Hyperparameter tuning of AdaBoost algorithm for social spammer identification. Int. J. Pervasive Comput. Commun. 2021, 17, 462–482. [Google Scholar] [CrossRef]
  31. Putatunda, S.; Rama, K. A comparative analysis of hyperopt as against other approaches for hyper-parameter optimization of XGBoost. In Proceedings of the SPML ’18: 2018 International Conference on Signal Processing and Machine Learning, Shanghai, China, 28–30 November 2018; pp. 6–10. [Google Scholar]
Table 1. Summary of Results Without Modification.
Table 1. Summary of Results Without Modification.
ModelAccuracyPrecisionRecallF1AUC Score
KNN0.900.910.810.840.863
SVM0.830.910.660.690.937
Decision Tree0.920.890.910.900.912
Random Forest0.990.990.970.980.996
AdaBoost0.970.980.950.960.986
XGBoost0.970.970.970.970.995
Table 2. Feature Importance for Each Model (Without Modification).
Table 2. Feature Importance for Each Model (Without Modification).
FeatureRandom
Forest
Gradient
Boosting
AdaBoostDecision
Tree
SVCKNN
Age0.0690.0520.2800.0700.009−0.001
Gender0.0170.0070.0600.0170.000−0.001
Smoking0.0090.0020.0200.0140.0000.001
Hx Smoking0.0020.0030.0000.0000.0000.000
Hx Radiotherapy0.0010.0040.0000.0000.0120.000
Thyroid Function0.0170.0080.0800.0170.0070.014
Physical Examination0.0190.0020.0400.0080.005−0.005
Adenopathy0.0500.0070.0000.012−0.008−0.013
Pathology0.0130.0030.1000.0000.0010.003
Focality0.0170.0010.0000.0000.000−0.012
Risk0.1490.0670.0400.054−0.001−0.001
T0.0830.0140.0800.0030.0080.003
N0.1090.0100.0600.012−0.0100.048
M0.0050.0000.0000.0000.0480.003
Stage0.0380.0040.0400.0000.040−0.001
Response0.4030.8170.2000.7940.1770.044
Table 3. Summary of results using SMOTE.
Table 3. Summary of results using SMOTE.
ModelAccuracyPrecisionRecallF1AUC Score
KNN0.970.970.970.970.990
SVM0.940.940.940.940.994
Decision Tree0.940.940.940.940.929
Random Forest0.950.950.960.950.999
AdaBoost0.950.950.960.950.994
XGBoost0.960.960.960.960.996
Table 4. Feature Importance for Each Model (SMOTE).
Table 4. Feature Importance for Each Model (SMOTE).
FeatureRandom ForestGradient BoostingAdaBoostDecision TreeSVCKNN
Age0.0860.0550.4200.0690.0950.098
Gender0.0070.0020.0200.000−0.0010.008
Smoking0.0040.0040.0000.0000.0000.008
Hx Smoking0.0010.0000.0000.0000.0000.000
Hx Radiotherapy0.0000.0000.0000.0000.0000.000
Thyroid Function0.0210.0130.0600.025−0.0010.002
Physical Examination0.0210.0100.0800.029−0.0060.018
Adenopathy0.0380.0120.0400.026−0.0030.032
Pathology0.0130.0020.0800.0000.000−0.008
Focality0.0460.0020.0000.000−0.002−0.002
Risk0.2530.7090.0400.6810.0190.019
T0.0460.0060.0600.0010.0050.027
N0.1110.0050.0200.0110.0340.006
M0.0010.0020.0000.0000.0000.000
Stage0.0150.0000.0200.000−0.0010.001
Response0.3370.1800.1600.1590.0660.063
Table 5. Summary of Results Using Hyperparameter Tuning.
Table 5. Summary of Results Using Hyperparameter Tuning.
ModelAccuracyPrecisionRecallF1AUC Score
KNN0.900.910.810.840.867
SVM0.940.940.890.910.904
Decision Tree0.970.980.950.960.912
Random Forest0.990.990.970.980.992
AdaBoost0.960.980.920.940.994
XGBoost0.970.970.970.970.996
Table 6. Feature Importance for Each Model (Hyperparameter Tuning).
Table 6. Feature Importance for Each Model (Hyperparameter Tuning).
FeatureRandom
Forest
Gradient
Boosting
AdaBoostDecision
Tree
SVCKNN
Age0.0640.0520.0800.0700.005−0.005
Gender0.0130.0090.0400.0110.003−0.003
Smoking0.0100.0010.0000.000−0.0140.000
Hx Smoking0.0030.0050.0000.000−0.0010.000
Hx Radiotherapy0.0010.0000.0000.0000.0120.000
Thyroid Function0.0120.0080.0000.0190.0010.000
Physical Examination0.0180.0020.0000.0250.008−0.010
Adenopathy0.0450.0080.0000.0200.001−0.033
Pathology0.0120.0030.0000.0200.0030.003
Focality0.0210.0010.0000.0080.003−0.007
Risk0.1980.0670.2000.104−0.001−0.012
T0.0820.0150.0000.0210.001−0.005
N0.0620.0090.0600.0100.0440.033
M0.0060.0000.0000.0000.0480.001
Stage0.0350.0030.0200.0000.0400.000
Response0.4190.8180.6000.6930.2250.005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Clark, E.; Price, S.; Lucena, T.; Haberlein, B.; Wahbeh, A.; Seetan, R. Predictive Analytics for Thyroid Cancer Recurrence: A Machine Learning Approach. Knowledge 2024, 4, 557-570. https://doi.org/10.3390/knowledge4040029

AMA Style

Clark E, Price S, Lucena T, Haberlein B, Wahbeh A, Seetan R. Predictive Analytics for Thyroid Cancer Recurrence: A Machine Learning Approach. Knowledge. 2024; 4(4):557-570. https://doi.org/10.3390/knowledge4040029

Chicago/Turabian Style

Clark, Elizabeth, Samantha Price, Theresa Lucena, Bailey Haberlein, Abdullah Wahbeh, and Raed Seetan. 2024. "Predictive Analytics for Thyroid Cancer Recurrence: A Machine Learning Approach" Knowledge 4, no. 4: 557-570. https://doi.org/10.3390/knowledge4040029

APA Style

Clark, E., Price, S., Lucena, T., Haberlein, B., Wahbeh, A., & Seetan, R. (2024). Predictive Analytics for Thyroid Cancer Recurrence: A Machine Learning Approach. Knowledge, 4(4), 557-570. https://doi.org/10.3390/knowledge4040029

Article Metrics

Back to TopTop