1. Introduction
Predicting the aftermarket performance of Initial Public Offerings (IPOs) is a critical area of research in finance, as it can provide valuable insights for investors, companies, and policymakers [
1]. The ability to forecast IPO outcomes is essential for data-driven decision-making in complex financial systems, enabling stakeholders to navigate uncertainties and optimize strategies. The performance of IPOs post-listing serves as a significant indicator of market dynamics and investor sentiment, impacting decisions around investment, issuance strategies, and regulatory policies [
2]. Accurate prediction models help mitigate risks and inform better decision-making, making this an area of high importance and continuous development.
Traditionally, IPO performance prediction has relied on various statistical methods, such as regression analysis, which offer foundational insights into the factors influencing IPO outcomes [
3]. However, these methods often fall short of capturing the complex, non-linear relationships inherent in financial markets. Despite advancements in machine-learning models, existing approaches still face limitations in handling small, imbalanced IPO datasets, particularly in emerging markets, where data scarcity and class imbalance present significant challenges. Many studies have either overlooked these issues or relied on limited features, reducing their generalizability and predictive robustness. As financial systems grow more intricate, there is an increasing need for advanced data-driven methodologies, such as machine learning, to uncover hidden patterns and improve predictive accuracy [
4].
Despite the progress made, predicting IPO performance remains challenging due to issues such as limited data availability and class imbalance. This issue is particularly pronounced in emerging markets, where IPO datasets are typically small, making it difficult to train robust models. Additionally, in these markets, the number of successful IPOs often significantly outnumbers those that underperform, leading to skewed predictions. Addressing these challenges is critical for developing reliable, data-driven models capable of managing the complexities and imbalances of financial systems, particularly in emerging markets.
This study addresses key gaps in IPO underperformance prediction through a comprehensive data-driven framework specifically designed for challenging financial environments. Unlike previous approaches, the framework integrates feature selection, data balancing techniques, and risk-based model evaluation to improve generalizability and decision-making. Leveraging publicly available financial and prospectus data reduces reliance on proprietary datasets, enhancing the practical applicability of the model, especially in emerging markets where data scarcity is prevalent. The methodology employs various ensemble methods, including BC, RF, Ada, GB, XG, ET, and SC, with Decision Trees (DT) as the base estimator. It incorporates ANOVA F-value for feature selection, Randomized Search for hyperparameter optimization, and the Minority Over-Sampling Technique (SMOTE) for class balance to optimize predictive accuracy. Additionally, the study introduces a dynamic methodology that tailors evaluation metrics based on investor risk preferences, ensuring adaptability to different risk profiles. This comprehensive approach aims to provide a robust, versatile tool for IPO performance prediction, offering valuable insights into the field of financial forecasting. Thus, the major contributions of this paper are summarized as follows:
- (1)
It proposes a unique data-driven framework that is tailored to handle IPO underperformance predictions in complex financial systems, focusing on small and imbalanced datasets relevant to emerging markets by conducting the following:
Depending solely on the publicly accessible pre-listing prospectus and firm-specific financial data.
Utilizing SMOTE to handle class imbalances and ANOVA to manage feature selection.
Incorporating various tuned ensemble classifiers to handle small datasets in the context of IPO underperformance predictions.
- (2)
It proposes a risk-optimized methodology for classifier selection based on investor’s risk preferences.
The remainder of the paper is organized as follows:
Section 2 provides a review of the relevant literature,
Section 3 describes the dataset used,
Section 4 outlines the proposed framework,
Section 5 presents the results obtained from the framework, and
Section 6 offers conclusions.
2. Literature Review
Predicting the aftermarket performance of IPOs is a critical area of research in finance, as it can provide valuable insights for investors, companies, and policymakers. A wide range of data-driven methodologies has been explored to understand and forecast the behavior of IPO stocks post-listing, spanning from traditional regression analysis to advanced machine-learning techniques tailored for complex financial systems. This literature review examines the key approaches and findings in this domain, emphasizing the evolution of predictive modeling and the role of data-driven decision-making in optimizing IPO performance predictions.
2.1. Regression Approaches
The performance of IPOs has been the subject of extensive research, with various studies employing regression analysis to uncover the determinants and implications of IPO performance across different markets, as shown in
Table 1.
Using multiple regression with industry and listing year as dummy variables, Ferdous et al., [
5] analyzed 211 IPOs on the Australian Stock Exchange from 2011 to 2015. They found underpricing in the total market return and overpricing in the secondary market, influenced by the year of listing and industry settings. Rafique et al., [
6] also used multiple regression to investigate 51 IPOs on the Pakistan Stock Exchange over ten years and concluded that prior IPO demand, firm size, issue size, and leverage do not significantly impact financial and operating performance. Mutai [
7] employed regression and traditional statistical tests to examine 12 IPOs on the Nairobi Securities Exchange (NSE) from 1996 to 2013, discovering an average underpricing of 55.36% and a significant post-IPO decline in Cumulative Abnormal Returns (CAR) and Return on Equity (ROE), suggesting the need for investors to consider more financial determinants beyond ROA and ROE. Michel et al., [
8] applied regression analysis to explore the relationship between IPO underpricing and public float, the portion of a company’s shares that are available for trading by the public, with data from 1996 to 2008, finding a non-linear relationship that supports the hypothesis that firms allocate a fixed amount for underpricing. Mittal and Verma [
9] used ordinary least squares (OLS) and stepwise regression methods to analyze 335 book-built IPOs in the Indian capital market from 2006 to 2015, finding that natural logarithm transformations significantly improved model explanatory power. Lastly, Ong et al., [
10] utilized univariate and multiple OLS regression analyses to study 467 Malaysian IPOs listed from 2000 to 2017, discovering a positive relationship between IPO price-multiples and those of comparable firms, with lower-valued firms underpricing their IPOs to attract investors and book-built IPOs generating higher initial returns, highlighting the book-building mechanism’s role in mitigating misvaluation.
These studies underscore both the strengths and limitations of using regression analysis to predict IPO aftermarket performance. The advantages include identifying key determinants of IPO performance, assessing the impact of various factors such as industry, listing year, and investor demand, and improving model predictability through variable transformation. However, the traditional regression method’s limitations include the assumption of a linear relationship between predictors and the outcome, which may not always hold true, leading to potential model misspecification. It also struggles with capturing complex, non-linear interactions compared to advanced techniques like machine learning. Regression models also can be sensitive to outliers, which can disproportionately affect the model estimates and reduce predictive accuracy. Nevertheless, regression analysis remains a pivotal tool in the empirical examination of IPO dynamics, offering valuable insights for investors, underwriters, and regulators.
2.2. Machine-Learning Approaches
The prediction of IPO aftermarket performance has recently been extensively explored through machine-learning (ML) models, as shown in
Table 2. This literature review examines several key studies that employ different machine-learning techniques to enhance the prediction accuracy of IPO outcomes.
The random forest algorithm emerges as a prominent method in several studies due to its robustness and high predictive accuracy. Baba and Sevil [
1] emphasize the superiority of the random forest algorithm over traditional linear regression models in predicting IPO initial returns on Borsa Istanbul. They attribute this to the algorithm’s ability to handle outliers effectively, with key predictors including IPO proceeds and trading volume. Similarly, Quintana et al. [
11] benchmark random forests against eight traditional machine-learning algorithms, finding that random forests outperform others in terms of mean and median predictive accuracy and exhibit the second smallest error variance. Emidi and Galán [
12] also find that random forests achieve the highest performance (accuracy of 71%) in predicting IPO outcomes based on prospectus content. In contrast, Dhini and Sondakh [
13] and Munshi et al., [
3] compare random forests with other ensemble learning methods, such as gradient-boosted trees and XGBoost. While Dhini and Sondakh find no significant performance differences between random forest and gradient-boosted trees, Munshi et al. report that XGBoost achieves the highest average accuracy of 87.89%, compared to 80.25% for random forest.
Methods such as logistic regression and random forest algorithms were explored by Supsermpol et al. [
14], who concluded that random forest outperforms logistic regression in predicting post-IPO financial performance. This finding aligns with the results of Emidi and Galán [
12] and Ni [
15], who also observed superior performance of random forests over logistic regression in various contexts. However, Ni [
15] notes that logistic regression produces superior outcomes for predicting IPO performance in the Hong Kong stock market, with random forest also performing well, particularly for longer prediction horizons (Days 10, 20, and 30).
The exploration of advanced ML models reveals varied results. Sonsare et al. [
16] find that artificial neural networks (ANN) outperform other models, including random forest, with an accuracy of 68.11% in predicting IPO underperformance. In a similar vein, Neghab et al. [
17] demonstrate that tree-based models, particularly LightGBM, outperform other models in both regression and classification tasks, achieving an average F1 score of 82.3%. Neghab et al. [
18] introduce a non-linear approach using deep neural networks (DNN) and stochastic frontier analysis to estimate IPO pricing efficiency. Their DNN-based method identifies significant premarket underpricing, with IPO offer prices being, on average, 12.43% lower than the estimated maximum offer prices.
Machine-learning models offer several advantages in predicting IPO outcomes. They exhibit robustness to outliers, with models like random forests handling these anomalies more effectively than traditional linear regression models. Additionally, machine-learning models, including random forests, XGBoost, and artificial neural networks (ANNs), consistently demonstrate higher predictive accuracy compared to traditional statistical methods, with reported accuracy ranging from 68.11% for ANNs to 87.89% for XGBoost for IPO performance predictions. These advanced models are also capable of capturing complex, non-linear relationships, providing deeper insights into the determinants of IPO performance. However, machine-learning models come with certain disadvantages. Their complexity and interpretability can be challenging, especially with advanced models like deep neural networks (DNNs), which are often less transparent than traditional models, complicating their practical application. Moreover, some machine-learning models, such as logistic regression and DNNs, require large datasets and extensive assumptions, which may not always be feasible. There is also a potential risk of overfitting, particularly with complex models like ANNs and ensemble methods, necessitating careful model validation and tuning to avoid this issue.
2.3. Determinants of IPO Underpricing and Performance
The literature on IPO underpricing identifies various determinants influencing initial and long-term performance, as shown in
Table 3. Lubis et al. [
19] focus on the Indonesian market, finding that inflation and interest rates significantly boost initial returns, while firm-specific factors like ROA, size, and age do not. Oliveira et al., [
4] emphasize informational asymmetry as a primary theory, highlighting underwriter and issuer reputations, corporate governance, and offering size as key determinants. Arora and Singh [
20] explore Indian SME IPOs, noting that factors such as issue size and oversubscription negatively affect long-run performance, whereas auditor reputation, underwriter reputation, and market conditions positively influence it. Kumar and Sahoo [
2] analyze the impact of anchor investor regulations in India, finding that anchor-backed IPOs underperform less severely in the long run, with offer size, grade, and promoter holding being significant variables. Hussein et al. [
21] investigate ChiNext IPOs, revealing that risk factors like ongoing litigation, policy changes, and capital expenditures significantly affect initial returns, indicating the importance of disclosed risk factors. Collectively, these studies underscore the multifaceted macroeconomic conditions, firm characteristics, market perceptions, and regulatory environments.
2.4. Sentiment and Textual Analysis Approaches
Recent research has extensively explored the utility of sentiment and textual analysis in predicting IPO aftermarket performance, as shown in
Table 4. Ly and Nguyen [
22] demonstrate that sentiment analysis of IPO prospectuses can predict stock price movements with up to 9.6% greater accuracy than random chance, highlighting the significance of sentiment in short-term IPO performance. Chi and Li [
23] find that the readability of IPO prospectuses, assessed through a gradient boost decision tree model, significantly predicts IPO underpricing, suggesting that clearer prospectuses lead to less underpricing due to reduced information asymmetry. Katsafado et al. [
24] extend this analysis by incorporating both textual and financial data from S-1 filings, using various machine-learning algorithms to predict IPO underpricing. Their models show a 6.1% improvement in accuracy over financial-only models, with sophisticated approaches outperforming traditional methods. Zou et al. [
25] examine the impact of media coverage on IPOs in China, finding a negative relationship between media coverage and IPO underpricing, as well as heightened investor sensitivity to negative news. Fedorova et al. [
26] use advanced techniques like Latent Dirichlet Allocation (LDA) and BERT to analyze news sentiment and key topics, finding that media sentiment and specific themes significantly influence IPO underpricing.
Utilizing sentiment and textual analysis offers several key advantages. Enhanced predictive accuracy is achieved by incorporating textual data alongside financial metrics, providing a more comprehensive understanding of IPO dynamics. Additionally, improved readability and nuanced media coverage analysis help reduce information asymmetry, leading to more efficient market outcomes. Furthermore, techniques such as sentiment analysis and topic modeling provide advanced insights into investor sentiment and market trends, capturing elements that traditional financial metrics may overlook. These advantages underscore the growing importance of sentiment and textual analysis in financial research, particularly in the context of IPO performance prediction. Despite their advantages, utilizing sentiment and textual analysis for predicting IPO aftermarket performance faces challenges like inconsistent data quality, subjective text interpretation, complex and resource-intensive algorithms, overfitting risks, unpredictable investor behavior, varying media influence, practical integration difficulties, and regulatory and ethical concerns.
2.5. Data-Driven Approaches
Recent studies have explored various data-driven predictive models to forecast IPO aftermarket performance, as shown in
Table 5. Kang et al. [
27] examine the relationship between online search volumes and post-IPO stock returns, finding that lower pre-IPO search volumes correlate with higher post-IPO returns, suggesting that less pre-IPO attention may indicate undervaluation. Sorkhi and Paradi [
28] introduce a methodology combining Bayesian inference and Data Envelopment Analysis (DEA) to estimate the probability density function (PDF) of IPO stock prices in the short-term, addressing the challenge of predicting price uncertainty for firms with limited market history. Their approach iteratively updates prior beliefs using DEA to find comparable IPOs and Bayesian inference to refine the IPO’s prior PDF, validated through backtesting. Turpanov [
29] investigates the impact of optimistic analyst forecasts on long-run abnormal returns for South Korean IPOs, revealing an upward bias in earnings forecasts and a positive correlation between risk-adjusted returns and earnings forecast revisions, which supports the long-run underperformance hypothesis when controlling for risk.
Overall, while data-driven predictive models offer significant benefits in terms of accuracy and innovative data usage, they also present challenges related to data quality, model complexity, and inherent biases. These must be carefully managed to ensure reliable and effective IPO performance predictions.
After reviewing current methods, our approach will utilize ensemble learning classifiers, known for their superior accuracy and ability to handle non-linear interactions. Selection will focus on classifiers inherently capable of managing small datasets. Furthermore, we will apply several optimization techniques such as SMOTE, ANOVA F-value, and hyperparameter tuning. Lastly, determinants will be based on pre-listing prospectus characteristics and financial ratios, as these features are accessible to investors and decision-makers in advance.
4. Methodology
The framework aims to predict IPO underperformance within the first month post-listing based on investor risk preference.
Figure 1 illustrates that following typical data preprocessing steps such as cleansing, the framework first implements SMOTE, a widely used technique in data-driven decision-making for complex systems [
30]. SMOTE creates synthetic instances for the minority class by interpolating between existing examples rather than merely duplicating them. This process balances the class distribution, thereby mitigating model bias towards the majority class and enhancing overall performance metrics, which are essential for robust decision-making in imbalanced and data-scarce environments. Additionally, by introducing variability through synthetic samples, SMOTE reduces overfitting, leading to more generalizable and robust models capable of accurately predicting IPO underperformance in data-scarce environments.
Utilizing SMOTE can also enhance the effectiveness of ANOVA F-value for feature selection, which is the next major step for our framework, by helping to attain the normality assumption required for ANOVA. By generating synthetic samples to balance the class distribution, SMOTE reduces the skewness and imbalance in the dataset, which in turn promotes a more normal distribution of data. To further validate feature importance, ANOVA F-value analysis is employed to assess variance among groups and determine statistical significance. If no significant variance is detected, it confirms the absence of a meaningful relationship between factors, preventing the inclusion of redundant or non-informative features and enhancing the credibility of our findings. A threshold of 0.2 is set to exclude less predictive features, improving model interpretability and preventing overfitting. This threshold was selected based on a grid search from 0.1 to 0.9 with a step of 0.1 to ultimately obtain 0.2 as the chosen level for best accuracy. Consequently, the combination of SMOTE and ANOVA F-value facilitates the identification of the most significant features, thereby improving the predictive performance and robustness of models, especially in scenarios involving small, imbalanced datasets such as those found in emerging financial markets.
Moreover, outlier analysis using the interquartile range (IQR) method confirmed that no significant outliers were present in the dataset. Additionally, Principal Component Analysis (PCA) was applied to manage feature correlations, reducing redundancy and enhancing the model’s stability.
Next, the framework splits the dataset into 80% for training and 20% for testing (single-split validation), providing a straightforward and efficient means to evaluate model performance on unseen data. The same training and test datasets are used across all models to ensure consistency in evaluation. This method allocates a substantial part of the data for training purposes yet retains sufficient data to effectively evaluate the model’s generalizability. Feature scaling is performed only on the training set after splitting and then applied to the test set to prevent data leakage. To increase the reliability of the assessment, the framework utilizes k-fold cross-validation, designating nine folds for training and one fold for testing. The k-fold cross-validation is a standard practice in data-driven modeling, reducing overfitting risks and ensuring comprehensive performance evaluation across diverse data subsets. By averaging the results from all folds, k-fold validation offers a comprehensive and reliable measure of the model’s performance, which is particularly crucial for small datasets where variability can significantly impact outcomes.
The framework then starts training data on a set of thoroughly selected ensemble models known to handle small datasets more efficiently. Each model is trained on the same training dataset and subsequently evaluated on the test dataset, ensuring a uniform assessment protocol. The set includes Bagging Classifier (BC), Random Forest (RF), AdaBoost (Ada), Gradient Boosting (GB), XGBoost (XG), Stacking Classifier (SC), and Extra Trees (ET), with Decision Tree (DT) as a base estimator. Ensemble models are particularly effective for small datasets as they combine the strengths of multiple learning algorithms to improve predictive performance and robustness. By aggregating the outputs of various base learners, ensemble methods reduce the risk of overfitting and enhance generalization, which is crucial when dealing with limited data. In addition to ensemble learning models, we also evaluate deep-learning-based approaches such as Multi-Layer Perceptron (MLP), TabNet, and Artificial Neural Networks (ANN). Deep-learning models offer the advantage of automatically learning complex representations from raw data, potentially capturing intricate patterns that traditional ensemble models might overlook. However, deep-learning models typically require larger datasets for optimal performance and are more prone to overfitting when trained on limited data. To mitigate this issue, we employ regularization techniques, dropout layers, and hyperparameter tuning to enhance their generalization capability. Hyperparameter optimization plays a crucial role in enhancing model performance by fine-tuning parameters to achieve optimal results. Randomized Search is particularly effective for this task, as it efficiently navigates a broad range of hyperparameter values, identifying the best configurations without the prohibitive computational expense of grid search [
31]. This approach exemplifies a strategic balance between efficiency and accuracy, making it well-suited for computationally constrained environments. After selecting the optimal hyperparameters, the models were trained using these configurations to ensure peak performance.
Metrics such as accuracy, precision, recall, F1 scores, and area under the receiver operating characteristic curve (AUC) are used to evaluate and compare the efficacy of different models. Accuracy determines the overall correctness of the model by calculating the proportion of correct predictions out of the total number of predictions (i.e., accuracy in predicting underperformance). In situations where datasets are imbalanced, relying solely on accuracy might not provide a clear picture. Precision and recall provide deeper insights. Precision measures the fraction of true positive predictions within all positive predictions, showing how well the model avoids false positives, while recall quantifies the fraction of true positives detected among all actual positives, emphasizing the model’s capacity to identify true positives.
The F1 score, as the harmonic mean of precision and recall, mitigates the trade-offs between these two metrics, making it a crucial measure when both false negatives and false positives carry significant consequences. The AUC offers a comprehensive evaluation of the model’s performance at various classification thresholds, reflecting its ability to effectively differentiate between classes. These metrics collectively ensure a robust and data-driven evaluation framework for decision-making under uncertainty.
At the heart of our proposed framework, we strategically propose a dynamic methodology—named Investor Preference Prediction Framework (IPPF)—to improve the decision-making process for IPO investments. Recognizing the underlying link between investing decisions and risk preferences, the framework acknowledges a wide range of investors, from risk-averse to risk-tolerant. It ranks the evaluated ensemble models based on investor risk preference, thereby tailoring investment strategies to individual risk profiles.
This dynamic methodology is particularly important in the context of IPO short-term underperformance prediction because the outcomes of such predictions can vary significantly based on different risk metrics. For example, a risk-tolerant investor seeks higher returns and wants to avoid false alarms about underperforming stocks. Hence, they value precision, which ensures most predicted underperformers are truly underperforming. On the other hand, a risk-averse investor prioritizes safety and wants to avoid missing any underperforming stocks. Hence, they value recall, which ensures most actual underperformers are correctly identified.
By employing IPPF, the framework dynamically adjusts the model evaluation criteria based on these risk preferences, ensuring that the selected model aligns with the investor’s risk tolerance. This adaptability enhances the relevance and applicability of the model’s predictions, as it allows investors to make more informed decisions that align with their risk appetite. Furthermore, this approach improves the overall robustness and flexibility of the prediction framework, making it more responsive to the diverse needs of different investors. In the volatile and unpredictable environment of IPO investments, such tailored decision-making support is crucial for optimizing returns and managing risks effectively.
The subsequent section will discuss each component of the proposed framework in further detail.
4.1. Class Imbalance, Feature Selection, and Hyperparameter Tuning
Various strategies have been implemented to enhance the proposed framework. One of the key approaches is the use of SMOTE, which effectively tackles the problem of class imbalance within our datasets. This technique generates synthetic samples for the underrepresented minority class, helping to balance the dataset [
30]. SMOTE creates synthetic examples along the line segments joining existing minority class instances. This strategy reduces bias toward the majority class and increases prediction accuracy in scenarios where instances belong to a minority class. The class distribution after applying SMOTE is shown in
Table 7.
The proposed framework employs the ANOVA F-value, a statistical measure that assesses the significance of the differences in means among multiple groups [
32]. In the machine-learning domain, the ANOVA F-value reveals important variables for predictive modeling by examining the variances of distinct classes. A higher ANOVA F-value indicates a more substantial impact of a factor on the target. Setting a threshold, such as 0.2 in this framework, eliminates less predictive characteristics, improving model interpretability and lowering the danger of overfitting.
Hyperparameter tuning is a crucial stage in enhancing machine-learning model performance. A hyperparameter tuning approach, Randomized Search, offers effective parameter space search techniques [
31]. In contrast to other hyperparameter tuning techniques, Randomized Search selects a random subset of configurations, reducing processing costs and time [
31]. This method offers a comprehensive search in the hyperparameter space for optimal settings that strike a reasonable balance between model variance and bias.
4.2. Random Search for Hyperparameter Optimization
In order to improve the performance of tree-based ensemble models and deep-learning models, we use Randomized Search Cross-Validation (RandomizedSearchCV) to perform hyperparameter tuning. Since random search explores a wide range of hyperparameter values to find optimal configurations, it is computationally efficient compared to exhaustive grid search. The algorithm optimizes hyperparameters based on multiple metrics, including accuracy, recall, precision, and F1 score, ensuring a balanced evaluation of model performance, particularly for handling imbalanced datasets. Each model is tuned based on the key parameters, such as the number of estimators, maximum depth, feature selection strategies, learning rate, etc., as shown in
Table 8.
4.3. Base Estimator
In machine learning, base estimators are essential as they serve as the foundational models for advanced ensemble methods. These initial models process the data first, and their outputs are typically integrated or improved using different ensemble strategies to boost the accuracy of predictions. The choice of base estimator significantly influences the overall effectiveness of the ensemble model. A well-chosen base estimator can capture essential patterns in the data, providing a robust foundation upon which ensemble methods can build. Among the various types of base estimators, decision trees have emerged as a popular and powerful choice due to their unique characteristics and adaptability.
The appeal of decision trees lies in their simplicity and interpretability [
33]. They provide a clear and intuitive representation of how decisions are made, which is valuable for understanding the underlying patterns in the data. Additionally, decision trees can handle both numerical and categorical data, making them versatile tools for diverse datasets. Their ability to capture non-linear relationships and interactions between features further enhances their utility in complex domains such as finance. As base estimators, decision trees are not only effective on their own but also serve as the cornerstone for more advanced ensemble methods like Random Forests and boosting algorithms. These ensemble techniques leverage the strengths of individual decision trees, combining them to produce models with improved accuracy and robustness. To summarize, DT is selected as the base estimator for several reasons:
Simplicity and interpretability: Decision trees are straightforward to understand and interpret, making them ideal for initial model building and understanding feature importance.
Handling non-linearity: DTs can capture non-linear relationships between features, which are common in financial datasets. This allows them to model complex interactions without requiring extensive data preprocessing or transformation.
Versatility: Decision trees can handle both numerical and categorical data, making them versatile tools in diverse datasets. This versatility is particularly valuable in financial data, where features can vary widely in type and scale.
Foundation for ensembles: As a base estimator, decision trees form the foundation for more complex ensemble methods like Random Forests and Boosting algorithms. Their ability to be combined into ensembles allows for improved predictive performance and robustness.
Efficiency: DTs are relatively fast to train and evaluate, which is crucial when conducting multiple iterations of model training and hyperparameter tuning, such as in Randomized Search and cross-validation processes.
4.4. Selection of Ensemble and Deep-Learning Classifiers
The choice of specific ensemble classifiers in this research is driven by their proven effectiveness in handling small and imbalanced datasets, as well as their ability to model complex relationships in financial data. The ensemble methods selected—BC, RF, Ada, GB, XG, SC, and ET—each bring unique strengths to the framework, enhancing predictive reliability and robustness in various market conditions.
Bagging Classifier: Bagging helps to lower variance and prevent overfitting by creating multiple models, each trained on distinct subsets of the dataset. This technique is especially beneficial in environments with small datasets, where the risk of overfitting is typically higher.
Random Forest: RF, which builds on the concept of bagging, constructs several decision trees and combines their outputs. It is renowned for its strong performance and capability to manage data with many dimensions, making it a suitable option for predicting IPO performance, particularly when dealing with complex interactions among numerous features.
AdaBoost: AdaBoost, a type of boosting approach, trains models sequentially, each time concentrating on the instances that were previously misclassified. Its capability to learn from mistakes and adjust accordingly makes it effective for enhancing results on imbalanced datasets, a key factor in accurately predicting IPO underperformance.
Gradient Boosting and XGBoost: Both GB and XG are powerful boosting methods that build models sequentially, each new model correcting the errors of its predecessor. XGBoost, in particular, is optimized for speed and performance, making it highly effective for large-scale data analysis. These methods are chosen for their ability to capture subtle patterns and interactions within the data.
Stacking Classifier: Stacking leverages the strengths of multiple base models by combining their predictions using a metamodel. This approach is selected for its ability to synthesize diverse model insights, thereby improving prediction accuracy and robustness.
Extra Trees: ET is similar to RF but differs in the way it splits nodes. It uses random splits rather than the best splits, which can lead to lower variance and improved generalization on small datasets. ET is chosen for its efficiency and effectiveness in handling diverse feature sets.
Multi-Layer Perceptron (MLP): MLP is a deep-learning-based artificial neural network that consists of multiple hidden layers with non-linear activation functions. It is particularly effective in capturing intricate relationships within data and learning hierarchical representations. MLP is well-suited for IPO performance prediction as it can uncover deep feature interactions that traditional machine-learning models may overlook. However, careful tuning of hyperparameters, such as the number of layers, neurons, and learning rate, is required to achieve optimal performance.
TabNet: TabNet is a deep-learning model specifically designed for tabular data, leveraging attention mechanisms to perform feature selection dynamically during training. Unlike traditional tree-based methods, TabNet allows for interpretability by identifying which features contribute the most to predictions. Its ability to focus on relevant aspects of the dataset makes it a promising candidate for IPO forecasting, especially when feature importance is a crucial aspect of decision-making.
Artificial Neural Network (ANN): ANN is a versatile deep-learning architecture composed of interconnected layers of neurons that learn patterns through backpropagation. It is particularly powerful for modeling complex and non-linear relationships within financial data. When applied to IPO performance prediction, ANN can effectively capture interactions between financial indicators and company-specific attributes, providing a robust alternative to conventional machine-learning approaches. Its effectiveness depends on factors such as network depth, activation functions, and regularization techniques.
The selected models were chosen for their ability to handle structured data, robustness against overfitting, and strong generalization capabilities, as shown in
Table 9. Unlike heuristic methods like Binary Ant Colony Optimization (BACO), which focuses on feature selection and optimization [
34], ensemble models provide superior classification accuracy and stability by leveraging multiple learners. Additionally, boosting techniques such as AdaBoost, Gradient Boosting, and XGBoost offer advantages in improving weak learners, while stacking enhances predictive performance by combining multiple models.
4.5. Validation and Evaluation Metrics
In this study, we utilize both single train/test split and k-fold cross-validation methods. The effectiveness of our classifiers is evaluated using metrics such as accuracy, precision, recall, and F1 score, which are commonly employed in classification tasks to provide a detailed assessment of model performance. These metrics are calculated based on the confusion matrix, as shown in
Table 10. The confusion matrix is built on the following values:
True Positives (tp)—correctly predicted underperforming stocks. In our context, a true positive occurs when the model predicts that a stock will underperform (positive prediction), and the actual outcome is that the stock underperformed (positive label). This means that the model correctly identified and predicted stocks that experienced underperformance.
True Negatives (tn)—correctly predicted non-underperforming stocks. In this paper, a true negative occurs when the model predicts that a stock will not underperform (negative prediction), and the actual outcome is that the stock did not underperform (negative label). This means that the model correctly identified and predicted stocks that did not underperform (or performed well).
False Positives (fp)—incorrectly predicted underperforming stocks when they actually did not underperform, are also known as Type I Errors. A false positive occurs when the model predicts that a stock will underperform (positive prediction) when the actual outcome is that the stock did not underperform (negative label). This means that the model made an incorrect positive prediction, indicating that the stock would experience underperformance when this was not actually the case. For risk-tolerant investors who are making decisions that should maximize profits, this represents a possible missed opportunity.
False Negatives (fn)—incorrectly predicted non-underperforming stocks when they actually underperformed, are also known as Type II Error. A false negative occurs when the model predicts a stock will not underperform (negative prediction) when the actual outcome is that the stock underperforms (positive label). In our context, this means that the model made an incorrect negative prediction, failing to identify that the stock would experience underperformance when it was actually the case. For risk-averse investors who are making decisions that should limit loss, this represents a higher risk of actual losses.
In order to efficiently assess and compare the performance of IPO underperformance prediction models, this study employs a number of key performance metrics that capture different dimensions of model robustness and suitability to different investor risk preferences. Since these metrics concern trade-offs between false positives and false negatives, they allow us to understand how the model can generate accurate predictions while balancing.
Accuracy (
Ac): Accuracy measures the overall correctness of the predictions or the proportion of total correct predictions. It is calculated as follows:
While a high accuracy score means that a model correctly predicts many instances, it fails to specify the types of prediction errors (false positives vs. false negatives) or account for how classes are distributed (such as underperforming vs. non-underperforming stocks). In situations where there is a prediction of IPO underperformance, there is often a class imbalance, with one class (like underperforming stocks) being much rarer than the other (non-underperforming stocks). Under these conditions, a model could reach a high accuracy simply by predominantly predicting the more frequent class. For instance, in a scenario where 5% of stocks underperform, and 95% perform well, a model predicting “non-underperforming” for all cases would achieve 95% accuracy, yet it would be utterly ineffective in recognizing any underperforming stocks, resulting in no true positives.
Accuracy can be misleading because it ignores class proportions, hides critical errors, and lacks specificity. It does not account for the imbalance between classes, leading to a false sense of performance when the minority class is rarely predicted. For risk-averse investors who prioritize avoiding losses, high accuracy does not necessarily mean low false negatives. A model can achieve high accuracy by correctly identifying a large number of non-underperforming stocks (true negatives) while still missing many underperforming ones (false negatives). Moreover, accuracy alone does not provide insight into how well the model identifies underperforming stocks (recall) or the reliability of its positive predictions (precision). Metrics like recall and precision should be considered to obtain a true picture of the model’s performance in predicting IPO underperformance.
Recall (
Re): Recall, also known as sensitivity, measures the model’s ability to correctly identify actual positives (stocks that went below IPO price). It is calculated as follows:
Risk-averse investors, who aim to limit losses, would prefer recall. Recall focuses on the proportion of true positives among all actual positives, minimizing false negatives. This ensures that most underperforming stocks are correctly identified, reducing the risk of missing out on stocks that would lead to losses. This aligns with their preference for avoiding the risk of actual losses by ensuring that underperformance is detected whenever it occurs.
Precision (
Pr): Precision measures the proportion of positive identifications that were actually correct, or the proportion of correctly predicted positive instances out of all instances predicted as positive by the model. It is calculated as follows:
Risk-tolerant investors who aim to maximize profits and can accept more risk would prefer precision. Precision measures the proportion of true positives among all positive predictions, minimizing false positives. This ensures that when the model predicts underperformance, it is highly likely to be correct, thus avoiding missed opportunities due to incorrect predictions of underperformance. This aligns with their preference for maximizing returns by accurately identifying stocks that are unlikely to underperform.
F1 Score (
F1): The
F1 score is the harmonic mean of precision and recall, providing a balance between them. It is calculated and simplified as follows:
The F1 score differs from other statistical measures, such as the F-statistic, which is used in hypothesis testing. The F1 score is instrumental in balancing precision and recall. For risk-averse investors, a high F1 score indicates a model that effectively identifies stocks likely to decrease in value (thus should be avoided) while minimizing the incorrect labeling of stocks as risky (thus not missing out on potential gains).
A general or broader form for the
F1 score is the so-called
Fꞵ score [
35], and it is formulated as follows:
where
β is a positive real constant that allows unequal weighting between precision and recall. When
β = 1, precision and recall are evenly balanced, leading to the regular F1 score.
β < 1 favors precision, while
β > 1 favors recall.
The decision-making process about investments in IPOs is fundamentally tied to an investor’s risk preference. Investors typically span a spectrum from risk-averse, favoring the minimization of the probability of loss, to risk-tolerant, often willing to accept higher probabilities of loss for the potential of greater returns. Thus, in our study of predicting IPO underperformance based on the preference of the investor, the infinite space for
β would make it more difficult to assess or quantify an investor’s risk. We, therefore, define a new parameter
r, called the risk preference factor, which takes any real number between 0 (favoring precision, i.e., risk-averse investors) and 1 (favoring recall, i.e., risk-tolerant investors). The real constant
β can then be redefined as a function of
r, ensuring consistency with the following interpretation:
The investor’s risk preference now determines . For a risk-averse investor, would favor recall, whereas would emphasize precision. Finally, represents completely balanced precision and recall, leading back to the F1 score. We also can use any real number in between as per investor risk bias. As a result, the model selection guarantees that the score appropriately reflects the trade-off between fn and fp, as evaluated by various investor profiles. We call this method IPPF, a systematic strategy for balancing Pr (the significance of avoiding fp predictions) and Re (the value of avoiding fn predictions) to evaluate class-based machine-learning models for modeling investor risk in investment decisions.
We can also assess the discrepancy of a certain model in response to investor risk preference by drawing a straight line between
Pr and
Re utilizing the risk preference factor
r as follows:
The slope of this straight line is mathematically the difference between
Pr and
Re derived as follows:
The absolute value Equation (8) would range between 0, indicating a robust model or a model that is insensitive to investor risk preference, and 1, indicating a fragile model that is highly sensitive to investor risk preference. This would measure the discrepancy between precision and recall.
If we do not know the investor risk preference, we would prefer a model that minimizes the absolute difference between
Pr and
Re, leading us to the following objective:
Thus, we would want a model that has a maximum
Fꞵ score, but we can penalize models that have high
as follows to assess robustness (
):
This measure, however, becomes problematic when
, leading to an undefined outcome. Thus, we would simply use the geometric mean [
36] to measure the discrepancy between precision and recall as follows:
This measure naturally penalizes large discrepancies between precision and recall. The model selection criteria, named robustness ratio, become the following:
We can now assess the best models across investors’ preference levels to find the best model when exact risk preference is unknown with the following measure:
where
is the robustness ratio for model
i at risk level
j,
n is the number of models, and
m is the number of risk levels.
is the binary decision variable that is equal to 1 if model
i is selected for risk level
j; otherwise, it is 0.
In summary, to find the best tree-based ensemble model for IPO predicting, the IPPF technique evaluates models through the perspective of the investor’s risk preferences. We identify the most appropriate model by analyzing each model’s recall and precision over the risk preference continuum, balancing these metrics based on the investor’s risk tolerance.
Receiver Operating Characteristic (ROC): The ROC curve serves as a method to assess the performance of binary classification systems by displaying the compromise between the true positive rate and the false positive rate at different threshold levels. The area under the curve (AUC) summarizes these data into a single number. A perfect prediction is denoted by an AUC of 1, while an AUC of 0.5 indicates a performance no better than random guessing. Models with AUC values approaching 1 are considered more accurate, and a higher AUC value signifies better overall model performance.
4.6. Investor Preference Prediction Framework (IPPF)
IPPF is built around risk sensitivity, which is critical in financial decision-making. By quantitatively measuring risk preference from 0 (risk-averse) to 1 (risk-tolerant), our methodology provides a systematic strategy for balancing precision (the significance of avoiding false positives) and recall (the value of avoiding false negatives) in our prediction models.
The framework combines these ensemble methods with decision trees as base estimators to take advantage of the strengths of each of these methods to better predict a small, imbalanced dataset. The aim of this strategic selection of these models is to address the particular challenges of IPO underperformance prediction in emerging markets, ensuring that these models fit the individual preferences of the investors and help them make appropriate investment decisions.
4.7. Algorithm of Proposed Framework
Let denotes the feature matrix for n samples and p features and let be the target vector. The set Θ comprises a finite collection of learning algorithms considered for model training. The scalar α signifies the predetermined threshold for feature selection based on performance metrics. We define as the collection of models obtained from training algorithms in Θ on the dataset . The model yielding the highest performance according to predefined criteria is represented as . Then, is applied for data balancing to y. The functions S, F, R, and P correspond to standard scaling, feature selection (using ANOVA F-value), hyperparameter optimization (using Randomized Search), and performance evaluation, respectively. Single-split and k-fold cross-validation strategies are encapsulated by and , correspondingly. Sensitivity for investors’ risk level () is conducted utilizing IPPF by calculating the adjusted .
The algorithm of the proposed framework is shown in Algorithm 1:
Algorithm 1 Proposed Framework with IPPF |
Input: Feature matrix X, target vector y, set of algorithms Θ, threshold α, investor’s risk preference r Output: Best performing model Mbest, performance metrics procedure EvaluateModels(X, y, Θ, α, r) Xscaled ← S(X) Apply SMOTE to balance classes in (Xscaled, y) Split (Xscaled, y) into training (Xtrain, ytrain) and testing (Xtest, ytest) using single split (SS) Xselected ← F(Xtrain) ANOVA F-value feature selection with threshold α Initialize an empty list for Results for each θ in Θ do Perform k-fold cross-validation (CV) on Xselected, ytrain M ← hyper-parameter tuning by Random Search R using θ on Xselected, ytrain with SS/CV Metrics ← evaluate M on Xtest, ytest after final model training on the entire Xselected, ytrain Append (M, Metrics) to Results end for Mbest ← P(M) evaluates models with best Metrics from Results Calculate fβ for each model using investor’s risk preference r Update Mbest based on fβ and r using IPPF return Mbest and its Metrics end procedure |
5. Results and Discussion
The Results and Discussion section provides an overall evaluation of the model’s predictive ability, using both single-split training and 10-fold cross-validation approaches. The analysis covers the in-depth analysis of the confusion matrix and corresponding results obtained during training as well as testing phases, providing essential insights into the model’s efficacy in correctly identifying true positives, true negatives, false positives, and false negatives. The examination of single-split training results explains the model’s behavior on the training set and also evaluates the model’s generalization on new, unseen data. Furthermore, this section also explores the findings of 10-fold cross-validation, which gives some insight into how the model can remain consistent and resilient across different regions of the dataset.
5.1. Models Training
In the training phase, the ensemble models performed well across various metrics. The DT model showed an exceptional true negative rate of 53.57% and achieved a solid balance in counts for false negatives (3.57%) and true positive outcomes (42.86%). RF follows the same trend, with high true negative and positive rates of 51.79% and 42.86%, respectively. The true negative rates for BC, AdaBoost Classifier, GB Classifier, and XGBoost are around 53.57%. Most notably, the ET Classifier demonstrated a significantly different trend, achieving 46.43% true positive and 48.21% true negative rates simultaneously. The SC presented a balanced performance, achieving a true negative rate of 50.00% while effectively managing false positives and false negatives.
Deep-learning models exhibited varied performance, with MLP, TabNet, and ANN showing different strengths and weaknesses. MLP achieved a moderate true positive rate of 37.5% but struggled with a higher false positive rate of 10.7%. TabNet, leveraging attention-based learning, showed a lower true negative rate (35.7%) and the highest false negative rate (21.4%), indicating difficulty in correctly identifying positive instances. ANN performed similarly to ET in terms of true negatives (48.21%) but had the highest false negative rate (26.79%), indicating a struggle in correctly classifying positive cases. The confusion matrix details of trained models are provided in
Table 11.
The training results show that ensemble models perform well across all evaluation metrics. The BC, AdaBoost Classifier, and GB Classifier all achieve 100% accuracy, precision, recall, and F1 score. It shows that these models made flawless predictions on the training data, accurately capturing both positive and negative cases. The DT, RF, XGBoost Classifier, and ET Classifier have somewhat lower accuracy, but their performance is exceptional, particularly regarding recall for detecting true positive cases. The SC’s 93% accuracy, balanced recall, precision, and F1 scores show that it can effectively combine diverse base models.
In contrast, deep-learning models exhibited varied performances. MLP achieved 80% accuracy with balanced recall (80%) and F1 scores (79%), demonstrating its ability to capture underlying patterns but with some misclassifications. TabNet performed the worst among all models, achieving only 61% accuracy with a recall of 54%, indicating its difficulty in distinguishing true positive cases. ANN also struggled, with an accuracy of 67% and the lowest recall of 42%, suggesting challenges in capturing positive cases effectively despite a relatively high precision of 79%. The detailed evaluation metrics of trained models are provided in
Table 12 5.2. Models Testing
The testing confusion matrix provides insights into the models’ generalization abilities by demonstrating how well they function on untested data. Notably, the BC, AdaBoost Classifier, and GB Classifier exhibit consistent and impressive results across TN%, FP%, FN%, and TP%. These models strike an outstanding balance between avoiding false positives and false negatives, which is crucial for accurate predictions. The highest TP% (57%) and TN% (35.71%) are found in the ET Classifier, indicating its proficiency in correctly predicting instances. The DT and XGBoost Classifier, although having lower TN% and TP%, display a very balanced performance. The SC, representing the union of models, demonstrates robustness with 28.57% TN% and 50.00% TP%, highlighting its ability to synthesize predictions effectively on the testing data.
Deep-learning models exhibit mixed generalization performance. MLP achieves a TN% of 35.7% with a balanced TP% of 35.7%, demonstrating its ability to generalize moderately well but with a higher FN% (28.6%), indicating misclassifications in positive cases. TabNet struggles with a low TP% of 21.4% and the highest FN% (42.9%), suggesting difficulty in correctly predicting positive instances. ANN performs similarly, with a TP% of 21.43% and an FN% of 42.86%, highlighting its limitations in capturing true positives. The confusion matrix details of tested models are provided in
Table 13.
The models’ evaluation is critical for determining their practical usefulness in forecasting whether a newly traded stock would trade below its IPO price within one month. The BC and ET Classifiers emerge as standout performers with high accuracy (86%) and balanced recall, precision, and F1 scores. The SC, representing a combination of diverse models, achieves a commendable 79% accuracy and exhibits balanced recall, precision, and F1 scores. While the DT and AdaBoost Classifier achieve moderate accuracy (71%), their recall, precision, and F1 scores suggest room for improvement, especially in avoiding false negatives and maintaining precision.
Deep-learning models show mixed performance in this evaluation. MLP achieves 71% accuracy with a high precision of 100%, but its recall (56%) suggests that it struggles with capturing true positive cases, leading to a less balanced performance. TabNet performs the weakest, with only 50% accuracy and a recall of 33%, highlighting its difficulty in correctly predicting stocks trading below their IPO price. ANN performs slightly better than TabNet, with 57% accuracy and 33% recall, but its 100% precision suggests that while it correctly identifies some positive cases, it misclassifies many others. The detailed evaluation metrics of tested models are provided in
Table 14.
The above discussion shows that BC is the best performer with high accuracy and precision. It shows its ability to correctly identify positive cases while limiting false positives to a minimum. Moreover, the ET Classifier performs admirably, with excellent accuracy, recall, precision, and F1 scores. These models differ from the others in that they can predict whether a newly listed stock will trade below its Initial Public Offering (IPO) price within one month of its listing. On the other hand, the XGBoost Classifier displays competitive numbers but falls somewhat below top performance. In contrast, the DT model underperformed across evaluation metrics, suggesting limits in its capacity to represent the complexity of IPO price fluctuations.
BC and SC perform better because BC trains individual models on slightly varied data samples, creating a diverse ensemble. This strategy proved valuable in mitigating the risk of overfitting the limited dataset associated with IPO predictions. The ensemble’s predictions, aggregated through averaging or majority voting, contribute to more stable and reliable outcomes, which is crucial in scenarios with small-scale data. The ET Classifier, employed in the ensemble, also excels in addressing challenges related to variance and overfitting, which are common concerns in the context of limited data. Its randomized decision-making process and comprehensive analysis of the entire learning sample result in de-correlated DT, effectively reducing variance and improving the model’s ability to capture underlying patterns in the data.
In contrast, deep-learning models exhibit varying levels of effectiveness in handling IPO predictions. MLP achieves moderate accuracy with high precision but struggles with recall, indicating its difficulty in identifying all relevant positive cases. This suggests that while MLP is effective at making confident predictions, it may not generalize well in capturing all variations in IPO performance. TabNet, despite its architectural advantages for tabular data, underperforms significantly, demonstrating the lowest accuracy and recall. This suggests that its feature representation may not align well with the stock market data structure. ANN, while slightly better than TabNet, also struggles to balance precision and recall, indicating challenges in effectively learning from the limited dataset.
5.3. 10-Fold Cross-Validation
In predicting stock performance following IPOs, applying a 10-fold cross-validation technique adds a crucial level of consistency to the assessment procedure. In this method, the dataset is divided into 10, nine of which are used to train the models and one to test them. Each subset is utilized as the validation data precisely once during the ten repetitions of this process. This approach allows for an overall evaluation of the models’ generalization abilities over distinct data portions, reducing biases that may occur based on a sole split. For stakeholders in the financial sector, 10-fold cross-validation is a dependable indicator of the consistency and dependability of models when it comes to IPO stock prediction. The following indicators have been used to evaluate the models using 10-fold cross-validation: Mean, Median Interquartile Range (IQR), First Quartile Q1, Third Quartile (Q3), Whisker Low (WhisLo), Whisker High (WhisHi), and Fliers (Outliers above and below Q1 ± 1.5 × IQR)
The models perform consistently across measures, with mean accuracy ranging from 57% to 70%. The BC and ET models demonstrate outstanding accuracy, with averages of 70% and 69%, respectively. The recall values range from 63% to 78%, reflecting the balance sensitivity in forecasting stocks that fall below their IPO prices. Precision ratings for positive predictions are typically good, ranging from 61% to 76%, with the bagging model achieving an impressive precision mean of 76%, emphasizing its ability to minimize false positives. Lastly, the F1 scores range between 60% and 69%, indicating a well-rounded performance, with BC achieving the highest value.
Beyond accuracy, recall, precision, and F1 score, the analysis of additional metrics provides a more comprehensive understanding of each model’s performance in predicting stock movements in the IPO market. The DT has a consistent and equal distribution across metrics with a moderate IQR for accuracy, recall, precision, and F1 score. The RF model has a balanced distribution across all measures, indicating dependability in multiple aspects of prediction. AdaBoost Classifier, despite a lower IQR in accuracy, demonstrates robust performance in recall, precision, and F1 score. The GB Classifier, XGBoost Classifier, and SC all have varied degrees of IQR, demonstrating trade-offs between different measures. While the ET Classifier has an accuracy IQR of 0%, other metrics show significant variability, indicating distinct strengths and shortcomings.
Deep-learning models exhibit mixed results in 10-fold cross-validation. MLP performs competitively, achieving a mean accuracy of 67%, along with strong precision (77%) and well-balanced F1 scores (68%), making it a viable alternative for IPO predictions. However, TabNet struggles significantly, yielding the lowest accuracy (44%) and recall (27%), suggesting that its feature representation may not effectively capture the patterns necessary for stock movement prediction. ANN also underperforms with a 53% accuracy and a recall of 42%, indicating difficulties in learning meaningful patterns from the dataset. The lower performance of deep-learning models, particularly ANN and TabNet, suggests that traditional ensemble-based models may still be more suited for IPO prediction tasks where data availability is limited and interpretability is essential. The detailed results of the 10-fold cross-validation are shown in
Table 15 and
Table 16.
In 10-fold cross-validation results, the BC is a good choice among the classifiers. It achieves the highest accuracy of 70%, indicating effectiveness in identifying stocks likely to decline while maintaining a balanced trade-off between recall and precision. On the other hand, the ET Classifier demonstrates the second-highest accuracy of 69%. However, the existence of fliers and a non-zero IQR implies that its efficacy may vary between folds. BC appears as a noteworthy alternative with the most outstanding mean F1 score, indicating a solid balance between accuracy and recall despite the slightly lower recall score.
On the other hand, the RF model performs poorly compared to ET and BC, particularly regarding the recall rate with many fliers. Based on these results, cautious investors looking for a model capable of reliably detecting stocks that may fall below their IPO price while generating the fewest false alarms should pick the AdaBoost Classifier. However, the BC and ET Classifier should be considered viable alternatives, particularly for investors prepared to accept somewhat lower recall in exchange for higher accuracy and fewer false positives. For risk-averse investors seeking a model that consistently identifies stocks at risk of falling below their IPO price with minimal false negatives, the AdaBoost Classifier is the top recommendation based on these findings. However, practical consideration should also be given to the BC and ET Classifier as viable secondary options, especially for investors willing to trade off some recall for improved precision and fewer false positives.
Deep-learning models present a mixed picture in IPO stock prediction. The MLP model achieves a competitive accuracy of 67% and the highest precision of 77%, making it a strong contender for investors prioritizing precise predictions with minimal false positives. However, its slightly lower recall (64%) suggests some limitations in capturing all potential declining stocks. In contrast, TabNet significantly underperforms, with an accuracy of only 44% and a recall of 27%, indicating its struggle to extract meaningful patterns from IPO data. Similarly, ANN falls behind with a 53% accuracy and 42% recall, suggesting that deep-learning models may require more extensive feature engineering or larger datasets to perform optimally in this context. While MLP remains a viable choice for precision-focused investors, traditional ensemble models like BC and ET still offer better overall reliability, especially in handling small-scale IPO prediction tasks.
5.4. Receiver Operating Characteristic Curve (ROC)
In the context of this study, ROC analysis is an essential technique for assessing the binary classification performance of various machine-learning and deep-learning models. ROC curves are visual representations of each model’s ability to balance true positive rates and false positive rates across different classification thresholds, providing insights into their discriminative power in predicting whether a newly listed stock will fall below its Initial Public Offering (IPO) price within a month of trading.
The BC has the highest AUC of 0.90, showing a more remarkable ability to balance true positive and false positive rates across different classification thresholds. RF and ET Classifiers follow closely, with AUC values of 0.89, indicating strong performance in the study’s context. The SC shows constant discriminative capability with an average AUC of 0.83, whereas the GB Classifier obtains an AUC of 0.78, indicating significantly weaker discriminating power. The AdaBoost and XGBoost Classifiers had lower AUC scores of 0.73 apiece, indicating inferior performance. Finally, the DT Classifier falls behind with a score of 0.69, indicating the limits in its capacity to distinguish between positive and negative cases in the given predicting job. Among deep-learning models, MLP achieved an AUC of 0.87, showing competitive performance close to RF and ET, while ANN obtained an AUC of 0.80, demonstrating moderate discriminative power. However, TabNet performed poorly with an AUC of 0.69, aligning with DT in its weaker ability to differentiate between cases. The ROC curve of all the classifiers is shown in
Figure 2.
5.5. Comparison
In this study, we propose a new framework to predict if a recently listed stock will trade below its IPO price in a month of trading using tree-based ensemble learning techniques. An essential component of our research is the limited utilization of tree-based ensemble methods in existing research on stock market prediction. We validate our suggested framework by comparing the results of our best model with those of Ampomah and Nyame [
37], who also used tree-based ensemble approaches. Since their research used a different dataset, the comparative analysis of results is not prohibited because they used the same tree-based ensemble methods. Further, this comparative study also shows that our framework is not only efficient but also unique in providing better results, notably for small datasets, which is an area underexplored in most studies.
As illustrated in
Figure 3, the Extra Trees (ET) Classifier in our study outperformed the ET Classifiers from Ampomah and Nyame [
37] across all evaluation metrics. Our ET Classifier achieved an accuracy of 86%, compared to 83.75% in their study. Notably, the recall improved from 81.25% to 88%, precision increased from 86.25% to 89%, and the F1 score rose from 83.69% to 88.48%. These results underscore the robustness and reliability of our framework, especially in predicting IPO underperformance in the short term.
Several factors may explain this performance gap. First, our framework incorporates advanced feature selection through ANOVA F-value and hyperparameter optimization using Randomized Search, which likely enhanced model accuracy and generalization. Additionally, the use of SMOTE to address class imbalance has contributed to improved recall, reflecting better sensitivity in identifying underperforming IPOs.
5.6. Risk Sensitivity and Score Calculation Using IPPF
This study introduces IPPF to explore the effect of different investors’ risk preferences on the outcome of the framework. It will be used in both single-split and 10-fold cross-validation to evaluate our suggested approach’s risk sensitivity and robustness. By adding IPPF, we aim to enhance the decision-making process in complex financial systems, offering a more thorough knowledge of the model’s performance across multiple data splits and assuring its usefulness in capturing the dataset’s underlying patterns.
5.6.1. IPPF with Single-Split Validation
The results in
Table 17 and
Figure 4 indicate a clear shift in the chosen model when the risk level rises. The following is a discussion and interpretation based on the findings of single-split validation.
The Extra Trees Classifier outperforms other models in the risk range of 0 to 0.5, as determined by the score. The score remains constant at 0.89, indicating that the model’s precision–recall balance does not change considerably or is not sensitive to parameter alterations within the risk range. The BC is the preferable model, starting at a risk level of 0.6, with its score continuously increases with each rise in risk. This trend continues until the maximum risk level is 1.0. The shift suggests that the Bagging Classifier is better suited to handling scenarios in which precision is increasingly essential—for example, missing out on predicting a price increase above the IPO level, which is costly (i.e., the opportunity cost is high) and a common concern for risk-tolerant investors. Transitioning from the ET Classifier to the BC at a risk threshold of 0.6 is an essential milestone in the use of the IPPF method. This means that as the risk level grows, the BC model will likely outperform the ET Classifier in terms of precision, which becomes increasingly important as we focus more on minimizing false positives.
5.6.2. IPPF with 10-Fold Cross-Validation
For risk levels 0 to 0.3, the AdaBoost Classifier is consistently selected. At lower risk levels, when misclassifying a negative event (a price drop below the IPO level) is more penalized (lower
), this model performs best according to the
score, as shown in
Figure 5. AdaBoost may be suitable for risk-averse investors, as the
score decreases with increased risk. At risk level 0.4, the ET Classifier is preferred, indicating a fair trade-off at this stage.
This preference, however, is fleeting, as the BC takes the lead from risk level 0.5 onward, showing that it handles the increasing focus on precision better than the ET and AdaBoost Classifiers. From a risk threshold of 0.5 to 1.0, the BC’s
score improves and stays the preferred model, suggesting its relative strength in cases where missing a positive result (e.g., predicting a price gain beyond the IPO level) is more punished. This steady choice demonstrates the model’s resilience in high-risk situations where precision becomes critical. The detailed risk levels with calculated measure scores for 10-fold validation are shown in
Table 18.
Figure 6 shows the robustness ratios for both single-split and 10-fold validations for analysis.
The diagram in
Figure 6 demonstrates how different models perform under different investor risk profiles that span from risk-averse (focused on recall only at r = 0) to risk-tolerant (focused on precision only at r = 1). The robustness ratio shows how much a forecasting system reacts to changes between accurate detection and total finding. Lower ρ ratings mean our models work better under all conditions, whereas higher ρ readings indicate they react strongly to changes. ET proves to be the most dependable for all risk levels below 0.5, while BC takes over as the best model for risk-tolerant investors. Another interpretation is that we should select AdaBoost only if we know for sure the investor is risk-averse since it is less robust with a higher
value. Also, we can interpret that if the risk preference of an investor is unknown, then it would be best to select either ET or BC for balanced and more robust results.
7. Conclusions and Future Directions
This paper explores the challenging area of forecasting IPO results, particularly in overcoming the obstacles associated with insufficient data and class imbalance. The study significantly improves predicted accuracy by applying a modified framework that integrates SMOTE for class balancing with ensemble learning methods. The ensemble contains a variety of classifiers, including DT, RF, BC, AdaBoost, GB, XGBoost, ET, and SC. The results show that the ET Classifier performs better than other models in terms of accuracy and well-balanced recall, precision, and F1 scores in the single-split evaluation. Also, the BC achieves a high accuracy of 70% and well-balanced recall, precision, and F1 scores in 10-fold validation. In contrast, the MLP model showed the best performance among the deep-learning models, achieving an accuracy of 67% and a strong recall rate of 77%, indicating its effectiveness in this context.
The proposed framework outperforms existing tree-based ensemble learning techniques in single-split evaluation. This validation illustrates the impact of data-driven decision-making in complex systems and the improvements achieved in this domain by using ensemble approaches in our proposed framework for predictive modeling in stock market prediction.
Furthermore, IPPF reveals insights into the dynamic nature of decision-making based on varying investor risk preferences. In single-split validation, the ET Classifier is resilient for investors with low to moderate risk tolerance. However, the BC is preferable as risk tolerance increases due to its greater recall rate. This demonstrates a model’s flexibility to varying investor choices. Lastly, the AdaBoost Classifier performs well in 10-fold cross-validation for risk-averse investors but loses efficacy as risk tolerance increases. The ET Classifier strikes a balance at a moderate risk level, but the BC performs best for moderate to high-risk tolerance. The dynamic change between these classifiers highlights the need to understand model behavior across a wide range of risk preferences and the relevance of cross-validation in creating stable and generalizable models.
While the current study yields noteworthy results and a promising conceptual framework, future research might explore more dimensions. Firstly, an analysis of the influence of feature engineering and the inclusion of domain-specific financial metrics might improve the model’s prediction abilities. More advanced ensemble techniques, hybrid models integrating deep-learning algorithms, or using different base estimators may allow for more significant results. Additionally, future work will involve using a larger dataset to further validate the robustness and generalizability of the proposed framework. Lastly, exploring interpretability and explainability in the context of IPO prediction models would contribute to fostering trust and understanding among stakeholders, promoting data-driven decision-making in complex systems.