1. Introduction
Differentiated thyroid cancer (DTC), which encompasses both papillary and follicular thyroid cancers, represents the most prevalent type of thyroid malignancy [
1]. DTC recurrence risk is influenced by many factors, necessitating accurate predictive models to enhance patient outcomes. Over the past two decades, the incidence of thyroid cancer has increased internationally; however, the rate of mortality has stayed consistent [
2]. This stability in relative mortality rates has been made possible by advancements in diagnostic techniques and treatment approaches. These advancements have improved the ability to accurately assess risk, predict recurrence, and implement early, tailored interventions, especially when all of these are based on accepted risk factors [
3].
DTC accounts for over 90% of thyroid cancer cases. High levels of thyroid-stimulating hormone (TSH) are traditionally associated with thyroid malignancies, but recent studies indicate that elevated thyroxine (T4) and triiodothyronine (T3) levels may also contribute to thyroid cancer development [
4]. Moreover, DTC recurrence affects about 30% of patients within 10 years of diagnosis, with papillary thyroid cancer (PTC) being the most common subtype. Follicular thyroid cancer (FTC) represents about 10–15% of cases and has a unique characteristic related to high propensity for metastasis through blood vessel invasion [
5,
6].
DTC frequently recurs in the cervical lymph nodes, contributing to increased metastasis risk. Monitoring methods for DTC include classification systems such as the American Thyroid Association (ATA) risk classification and the tumor node metastasis (TNM) staging. The ATA system classifies patients into high, intermediate, or low risk of recurrence based on several factors [
7]. TNM staging considers the extent of the tumor, lymph node involvement, and metastasis. However, TNM has been criticized for its lack of integration of biological characteristics that might influence malignancy [
8].
In recent years, the use of machine learning techniques has become a promising approach for assessing complex datasets consisting of patient risk factors and predicting the risk for cancer occurrence and recurrence.
Although these methods are promising, they are not yet widely used in the field of thyroid cancer recurrence. This study aims to advance existing research by developing predictive models that identify patients at increased risk of DTC recurrence based on 16 established risk factors [
3], including patient data, treatment types, and personal histories. Specifically, we aim to evaluate six machine learning algorithms to determine which one provides the most accurate predictions of thyroid cancer recurrence.
3. Design and Methodology
The Differentiated Thyroid Cancer Recurrence dataset from the University of California at Irvine Machine Learning Repository was used in this study [
3]. This dataset consists of the retrospective clinical data for 383 patients diagnosed with DTC, each followed for a minimum of 10 years. The collected clinical data included 16 features: age at diagnosis, gender, current smoking status, prior smoking history, history of head and neck radiation therapy, thyroid function, presence of goiter, presence of adenopathy on physical examination, pathological subtype of cancer, focality, ATA risk assessment, TNM staging, initial treatment response, and recurrence status. The dataset contains 312 females (81%) and 71 males (19%). The average age at diagnosis was 41. The pathological subtype breakdown was 287 Papillary (75%), 48 Micropapillary (13%), 28 Follicular (7%), and 20 Hurthel Cell (5%). According to the ATA risk classification, patients were classified as follows: 249 low risk (65%), 102 intermediate risk (27%), and 32 high risk (8%). Most cases (333) were classified as Stage 1 (87%). A total of 208 patients had an excellent initial treatment response (54%). The remaining cases had the following initial treatment responses: 91 structural incomplete (24%), 61 indeterminate (16%), and 23 biochemical incomplete (6%). In terms of recurrence, 108 patients (28%) experienced recurrence. Notably, this dataset contained no missing values.
ML models were applied in this study to analyze and predict DTC recurrence. The ML models used were KNN, SVM, Decision Tree, Random Forest, AdaBoost, and XGBoost. KNN is a simple, instance-based learning algorithm that classifies data points based on the majority class among their k-nearest neighbors in the feature space, using a chosen distance metric. KNN is particularly useful in scenarios with well-defined clusters in the feature space, enabling it to provide quick predictions for DTC recurrence based on similar historical cases [
21]. Leveraging the kernel trick, SVM maps inputs into high-dimensional feature spaces, allowing us to draw clear margins between classes. Its robustness against outliers was beneficial in our study, particularly with the high-dimensional, small dataset we analyzed. This model helped identify subtle patterns that differentiate between recurring and non-recurring cases [
22]. Decision Tree represents choices and their results in a tree-shaped graph. The Decision Tree model allowed for straightforward interpretation of decision rules concerning DTC recurrence [
21,
22]. The Random Forest method mitigated the limitations of individual decision trees by aggregating multiple trees to improve accuracy and robustness. It proved effective in handling high-dimensional data and addressing class imbalances, which are common in DTC datasets, ultimately leading to better predictive performance [
23]. AdaBoost enhanced weaker classifiers by focusing on errors from prior iterations. Its ability to improve accuracy, particularly in complex cases, allowed the refinement of predictions for DTC recurrences [
21,
23]. XGBoost is considered an excellent model in handling missing values and requires careful hyperparameter tuning to maximize its potential, making it possible to model complex interactions among features related to DTC recurrence effectively [
23]. Different models were implemented to account for differences in handling class imbalances and predictive accuracy.
Python 3.13 was used to run different models. Python has a rich variety of libraries and tools that make it an excellent choice for implementing machine learning classification techniques. From data preparation and model training to evaluation and deployment, Python provides comprehensive support for every step of the machine learning workflow. The dataset was partitioned into a training set (80%) and a test set (20%). Data was run once with no modifications. Given the gender imbalance in this dataset, a second experiment was performed using SMOTE to address class imbalance. SMOTE generates synthetic samples for the minority class based on feature similarity with existing minority instances. This technique helps to alleviate the bias towards the majority class by increasing the representation of the minority class, thereby improving the performance of classifiers trained on imbalanced datasets. In hopes of improving ML model performance, a final run was conducted using hyperparameter tuning. This process involves selecting the best values for these parameters to achieve optimal model performance.
For all tests, data preprocessing was performed. Categorical variables were transformed using LabelEncoder and converted into numerical values. StandardScaler was applied to standardize columns by removing the mean and scaling to unit variance. SMOTE was used to handle class imbalance by oversampling the minority class. For the SMOTE test, the MinMaxScaler was used to scale specific columns to a range between 0 and 2. Grid search was the method used for hyperparameter tuning, which allows you to systematically find the best hyperparameter combination to optimize model performance. Hyperparameter optimization techniques in machine learning encompass manual methods like trial-and-error tuning and exhaustive grid search. Random search efficiently explores broader spaces by random sampling [
24]. The dataset was split into training and testing sets using train_test_split with a test size of 20%. The final models were evaluated using accuracy, precision, recall, F1 score, and AUC score.
In the fine-tuning scenario, each model was fine-tuned in different ways depending on the model parameters. For the KNN model, the hyperparameters tuned were n_neighbors and weights. The n_neighbors parameter controls how many neighbors influence the classification decision, and was tested with values of 3, 5, and 7. The weights parameter defines how neighbors contribute to the final classification and was tested with two options: uniform and distance. Such combination of tuning parameters ensures that the KNN model is capable of balancing local sensitivity and generalization, providing robust performance on the dataset [
25]. For the SVM model, the key hyperparameters tuned were C and kernel. The C parameter controls the trade-off between achieving a large margin and minimizing classification errors. The C parameter was tested using small and large values. The kernel parameter influences how the algorithm transforms the input data into a higher-dimensional space. With these hyperparameters, the SVM model can achieve an optimal balance between margin maximization and classification accuracy for the dataset [
26].
For the Decision Tree model, the key hyperparameters tuned were criterion and max_depth. The criterion parameter was set to either gini or entropy. The max_depth parameter was evaluated with values of 10, 20, and 30, as well as the default setting of None, which allows the tree to grow without limit. This tuning strategy for the Decision Tree model ensures a balance between model complexity and generalizability [
27]. For the Random Forest model, the tuning strategy involved adjusting n_estimators, criterion, and max_depth. The n_estimators parameter, which controls the number of trees in the forest, was tested with values of 10, 50, and 100. The criterion parameter was set to either gini or entropy. The max_depth parameter was evaluated with values of 10, 20, and 30, as well as None to allow unlimited tree growth. This combination of hyperparameter tuning ensures that the Random Forest model finds the optimal balance between predictive accuracy and computational efficiency [
28,
29].
For the AdaBoost model, the primary hyperparameters tuned were n_estimators and learning_rate. The n_estimators parameter defines the number of weak learners, and was evaluated using 50, 100, and 200. The learning_rate parameter controls the contribution of each weak learner, and was evaluated using 0.01, 0.1, and 1. This tuning strategy ensures that the AdaBoost model achieves a balance between learning speed and accuracy [
30]. Finally, for the XGBoost model, key hyperparameters tuned were n_estimators, learning_rate, and max_depth. The n_estimators parameter represents the number of boosting rounds, and was evaluated with values of 50, 100, and 200. The learning_rate parameter controls how much each boosting round contributes to the final model, and was evaluated using 0.01, 0.1, and 0.2. The max_depth parameter controls the depth of each tree, and was evaluated using 3, 6, and 9. This comprehensive tuning strategy allows XGBoost to achieve both high accuracy and generalizability across the dataset [
31].
The performance of the models was assessed using accuracy, precision, recall, and F1 score. These metrics provide a comprehensive view of each model’s effectiveness. Accuracy shows how often the model was able to correctly predict if DTC recurred or not. Precision indicates how many patients predicted to have DTC recurrence had recurrence. High precision means that when the model predicts recurrence, it is likely to be correct. This is critical for minimizing false alarms that may cause unnecessary treatment. Recall shows how effectively the model predicts patients who will experience recurrence. High recall shows that the model will be able to predict most recurrences. This ensures that all at-risk patients are identified. This minimizes false negatives and helps ensure that all patients receive the proper treatment needed to minimize their chances of recurrence. Finally, a high F1 score shows that the model has a good balance of precision and recall. This score can be especially helpful in cases of class imbalance [
21].
4. Results
For the KNN classifier with and without SMOTE, a value of 3 for k was chosen because experimental evaluation showed it can help avoid the model being overly sensitive to noise but retain strong performance and logical reasoning. Such a value helps avoid overfitting, while still capturing local patterns in the data. With SMOTE, a value of 3 for k remains valid because the data become more balanced, and a value of 3 for k ensures that minority classes are not dominated by the majority class. When it comes to fine-tuning, testing k with values of 3, 5, and 7 is helpful to explore how increasing the neighborhood size improves model generalization without compromising accuracy. It makes the model more sensitive to small changes in the data and helps find the best balance between capturing local patterns and making accurate predictions.
In the initial run with no modifications (
Table 1), KNN demonstrated strong overall performance with high accuracy (0.90) and precision (0.91). It did have a lower recall (0.81), which could mean that the model would miss some positive cases of recurrence.
The SVM model had a lower accuracy (0.83), but still had good precision (0.91). The recall was low (0.66), which meant the model struggled to identify recurrence. The Decision Tree model had strong performance with high accuracy (0.92), precision (0.89), recall (0.91), F1 score (0.98), and AUC score (0.912). A training score of 1.0 indicates there may have been overfitting to the training data. The test score of 0.92 indicates that the model still performed well with new data.
The Random Forest model had the best overall performance, with near-perfect accuracy (0.99), precision (0.99), F1 score (0.98), and AUC score (0.996). The recall (0.97) was only slightly lower, which indicates that this model was reliable for predicting DTC recurrence. The high training and test scores also indicated good model generalization and minimal overfitting.
AdaBoost also had high performance, with high accuracy (0.97), precision (0.98), recall (0.95), F1 score (0.96), and AUC score (0.986). The close training and test scores indicate that it is a robust choice. Finally, XGBoost also had consistently high metrics for accuracy (0.97), precision (0.97), recall (0.97), F1 score (0.97), and AUC score (0.995). A training score of 1.0 does show the possibility of overfitting with this model.
The AUC scores showed that Random Forest performed the best with an AUC score of 0.996, indicating near-perfect performance in distinguishing between classes. XGBoost followed closely with an AUC of 0.995, which demonstrates strong predictive power. AdaBoost had a slightly lower but still excellent AUC of 0.986, showing strong performance as well. Although SVM had the highest AUC among the non-ensemble models, at 0.937, its overall accuracy and recall were lower compared to other models. KNN and Decision Tree performed reasonably well, with AUC scores of 0.863 and 0.912, respectively.
Table 2 shows a summary of which features are most significant for predicting DTC recurrence in the models using SMOTE. As shown in the table, response and risk were important features in the current scenario, especially for Random Forest and XGBoost models. Age, T, and N showed varying importance across different models. Finally, several features like adenopathy, focality, and Hx smoking had low or negative importance in the SVC and KNN models.
The data were run again with the application of SMOTE (
Table 3) to address class imbalances. Overall, the models showed good improvement with the application of SMOTE. KNN showed significant improvement, achieving high accuracy (0.97), precision (0.97), recall (0.97), F1 score (0.97), and AUC score (0.990).
SVM greatly benefited from SMOTE, indicating that the model was likely influenced by the class imbalance. SVM had balanced performance with good accuracy (0.94), precision (0.94), recall (0.94), F1 score (0.94), and AUC score (0.994). The Decision Tree model also showed improvements with consistent performance for accuracy (0.94), precision (0.94), recall (0.94), F1 score (0.94), and AUC score (0.929). Once again, the perfect training score indicated that there could be potential of overfitting with this model.
Random Forest did see a slight decrease in performance when SMOTE was applied. This could indicate that there had been some overfitting on the initial model. Since Random Forest is an ensemble method, it can typically handle some degree of class imbalance. Introducing the synthetic samples to the data can add noise to the data, which could also explain the decreased performance.
AdaBoost did see a slight decrease in both accuracy (0.95) and precision (0.95). Recall (0.96) showed some slight improvement, but the model is less apt to favor one class disproportionally. XGBoost saw results like those of the initial run and once again showed the potential for some overfitting with this model.
The AUC scores showed that Random Forest stood out as the best performer with an AUC score of 0.999, indicating nearly perfect discriminatory power. XGBoost followed closely with an AUC of 0.996, showing a strong predictive capability. SVM and AdaBoost both showed excellent performance with AUC scores of 0.994, making them competitive but slightly behind Random Forest. KNN also performed very well with an AUC of 0.990, but it was slightly outperformed by the ensemble methods. The AUC score of Decision Tree indicated it may be less reliable in distinguishing between classes compared to the other models.
Table 4 shows a summary of which features are most significant for predicting DTC recurrence in the models using SMOTE. As shown in the table, the Random Forest and XGBoost highlighted risk and response as the most important factors, while AdaBoost and Decision Tree also emphasized age and physical examination. SVC and KNN showed that age and response were consistently important, with additional contributions from adenopathy and T in KNN. Finally, low-impact features such as Hx smoking, Hx radiotherapy, and focality generally had very low or negative importance across models.
The final test was conducted using hyperparameter tuning (
Table 5). The purpose of running hyperparameter tuning was to improve each algorithm’s performance by finding the parameters that would ultimately maximize the performance of each algorithm. KNN had a consistent performance with no significant changes.
Despite the hyperparameter tuning, recall remained lower (0.81), suggesting that the model could miss some cases of potential recurrence. SVM showed good improvement from the initial results, indicating hyperparameter tuning enhanced the model’s ability to predict DTC recurrence. Decision Tree showed marked improvement with accuracy (0.97), precision (0.98), recall (0.95), F1 score (0.96), and AUC score (0.912). This shows that hyperparameter tuning helped the Decision Tree model reduce overfitting and improve generalization.
Random Forest once again had very high performance across the board for accuracy (0.99), precision (0.99), recall (0.97), F1 score (0.98), and AUC score (0.992). Hyperparameter tuning helped to solidify its performance and make this model very reliable for predicting DTC recurrence. AdaBoost saw an improvement with precision (0.98) but did see a slight drop with recall (0.92).
Finally, XGBoost showed consistent performance with accuracy (0.97), precision (0.97), recall (0.97), F1 score (0.97), and AUC score (0.996). This demonstrates that hyperparameter tuning helps enhance this model, allowing it to maintain high performance across all metrics.
The AUC scores showed that XGBoost demonstrated the best performance with an AUC score of 0.996, with exceptional ability to distinguish between classes. Random Forest followed closely with an AUC of 0.992. AdaBoost also performed very well, with an AUC score of 0.994, slightly outperforming Random Forest but still falling behind XGBoost. Decision Tree had a good AUC of 0.912, showing reliable performance but not as strong as the ensemble methods. SVM achieved an AUC of 0.904, which was solid but somewhat lower than the other models. KNN, with an AUC of 0.867, had the lowest score, indicating the lowest performance when it came to distinguishing between classes compared to the other models.
Table 6 shows a summary of which features are most significant for predicting DTC recurrence in the models using fine-tuning. As shown in the table, response and risk continued to be key predictors across models with fine-tuning, particularly in XGBoost and Random Forest. Age and T were also consistently significant across multiple models. Finally, adenopathy and focality showed varying importance across models, with some negative values in KNN.
5. Discussion
The goal of this study was to determine what the best method was to predict likelihood of recurrence of DTC. Our study involved six machine learning algorithms, each tested under three different parameter settings. To address class imbalance in the training data, we used SMOTE oversampling technique. We evaluated each algorithm’s performance in two scenarios, with SMOTE and without SMOTE enabled. A third test was conducted using hyperparameter tuning, where the test selects the best possible values for each machine learning model to achieve the optimal performance. Tests and results showed that the best machine learning model was Random Forest, which consistently outperformed or matched the other models across all three scenarios.
Overall, the application of SMOTE did improve the performance of most models, particularly those that struggled with imbalanced data, such as KNN and SVM. Hyperparameter tuning further enhanced performance, especially for models like SVM and Decision Tree. Random Forest, AdaBoost, and XGBoost demonstrated strong performance in all scenarios, making reliable choices for handling imbalanced datasets. Hyperparameter tuning further enhanced the performance of most models, particularly SVM, Decision Tree, and AdaBoost. In all three scenarios, Random Forest seemed to be the best and most reliable algorithm. Random Forest consistently scored higher than the other algorithms in each scenario and would lead to the most reliable results when determining likelihood of recurrence of thyroid cancer.
According to the results, complex models like Random Forest and XGBoost resulted in better performance because such models could capture the intricate patterns in data. These models also showed reliable performance across both training and test sets. Such findings showed that these models are better at generalizing than others, especially in the presence of noisy synthetic data. On the other hand, these models are more prone to overfitting compared to other models tested, especially when it comes to small datasets. The SVM and KNN models showed improved results with SMOTE, indicating that their initial performance was impacted by the imbalance in classes. Such observations highlight the importance of preprocessing techniques like SMOTE in improving model predictions. The Decision Tree showed some overfitting, which was mitigated through hyperparameter tuning. This finding demonstrates the need for regularization techniques in more complex models to enhance generalizability. Finally, overall, fine-tuning helped improve some models’ performance, which emphasizes the need for selecting optimal hyperparameters in machine learning. In conclusion, different models perform differently under certain conditions. Furthermore, such performance varies based on data characteristics and preprocessing techniques. These findings underscore the importance of addressing data imbalance and optimizing model parameters to achieve the best predictive performance in medical diagnosis tasks.
Analysis of the importance of features in predicting cancer recurrence showed that the response attribute consistently dominated feature importance in most models, particularly in gradient boosting, where it had exceptionally high importance in all scenarios. Risk also showed strong importance, especially in Random Forest, Gradient Boosting, and Decision Tree models. AdaBoost displayed variability in feature importance across modifications, but age and risk emerged as prominent features with SMOTE and hyperparameter tuning. SVC and KNN generally assigned very low or negative importance to most features, including key factors like risk, suggesting these models may not be well-suited for this specific prediction task.
As shown in the analysis, multiple metrics have been used to evaluate the models quantitatively, including accuracy, recall, precision, and F1 score. Such metrics are considered important to assess a model’s performance, while at the same time helping interpret findings with respect to the clinical context. For instance, a high recall in the KNN model indicates its effectiveness in identifying positive cases, which is vital in clinical scenarios where missing a diagnosis could be detrimental. In addition, feature importance analysis showed that, for example, age and response significantly contribute to the model predictions, aligning with existing literature that emphasizes their role in patient outcomes. Accordingly, we can enhance the decision-making process by integrating machine learning models with clinical risk factors to help improve patient management and treatment.
Results of this study imply interesting clinical applications, particularly advancements in personalized treatment development. This tool, in combination with oversight from clinical experts, could be utilized to better predict the chances of recurrent DTC in patients following primary treatment methodologies. Understanding risk levels for recurrence allows clinicians to create more effective screening and monitoring programs at regular intervals to better detect recurrence in earlier stages and to monitor more closely for distant metastasis that often increase risk of mortality in recurrent DTC. Overall, this study demonstrates some promising implications for clinical use, with the caveat that classic ML algorithms such as those discussed herein require ongoing oversight from experts in the field to ascertain ongoing accuracy, particularly with application to larger and larger datasets.