Abstract
Tumor is one of the deadliest diseases; nowadays the cases of tumors are increasing rapidly. Researchers from worldwide are doing extensive research for the diagnosis and discernment of tumors by employing machine learning algorithms and performing experiments on the basis of observations which are stored in the form of datasets. The tumor-related dataset is high-dimensional and has many genes, most of which are not prognostic. Some of them are irrelevant and redundant. Here, we proposed a methodology named IG based on IWSSr-Random Forest(RF) which selects the most relevant prognostic genes by using Information Gain for gene ranking and evaluates the importance of genes by using RF after selecting the genes in an incremental manner in a wrapper. Furthermore, we use the RF for classification purposes. Experiments are performed on nine publicly available tumor-related datasets. Accuracy, Confusion matrix, Precision, F-measure, and Recall are used as performance evaluators. The proposed methodology selects 3 most relevant genes out of 2000 genes, 5 genes out of 7129 genes, 3 genes out of 7129 genes, 5 genes out of 24,481 genes, 7 genes out of 12,601 genes, 5 genes out of 15,154 genes, 2 genes out of 4026 genes, 5 genes out of 12,582 genes and 4 genes out of 2308 genes, and produces 88.71%, 71.67%, 98.61%, 79.38%, 93.60%, 99.60%, 92.42%, 95.83% and 92.77% accurate results in case of Colon, Central Nervous System, Leukemia, Breast Cancer, Lung Cancer, Ovarian Cancer, Lymphoma, MLL and SRBCT respectively. Experimental results present that IG based on IWSSr(RF) performs well compared to the state-of-the-art algorithms’ results for instance RF, Naïve Bayes, KNN, and Decision Tree. IG based on IWSSr(RF) also has nominal time complexity compared to the time complexity of the above-mentioned classification algorithms.
Similar content being viewed by others
Data availability
All data is publicly available at (Zexuan Zhu, 2007).
References
Abe S (2005) 13th European symposium on artificial neural networks. Bruges, Belgium, p 27–29
Aghdam MH, Ghasem-Aghaee N, Basiri ME (2009) Text feature selection using ant colony optimization. Expert Syst Appl 36(3):6843–6853
Almuallim H, Dietterich TG (1994) Learning boolean concepts in the presence of many irrelevant features. Artif Intell 69(1–2):279–305
Bermejo P, de la Ossa L, Gámez JA, Puerta JM (2012) Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking. Knowl-Based Syst 25(1):35–44
Bomze IM, De Klerk E (2002) Solving standard quadratic optimization problems via linear, semidefinite and copositive programming. J Global Optim 24(2):163–185
Breiman L (2001) Random forests Machine learning 45(1):5–32
Cernuda C, Lughofer E, Hintenaus P, Märzinger W (2014) Enhanced waveband selection in NIR spectra using enhanced genetic operators. J Chemometr 28(3):123–136
Chen X-W (2003) An improved branch and bound algorithm for feature selection. Pattern Recogn Lett 24(12):1925–1933
Cotter SF, Kreutz-Delgado K, Rao BD (2001) Backward sequential elimination for sparse vector subset selection. Signal Process 81(9):1849–1864
Debuse JC, Rayward-Smith VJ (1997) Feature subset selection within a simulated annealing data mining algorithm. Journal of Intelligent Information Systems 9(1):57–81
Prachi HM, Sharma P (2019) Intrusion detection using machine learning and feature selection. International Journal of Computer Network and Information Security 11(4):43–52
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97(457):77–87
George EI, McCulloch RE (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88(423):881–889
Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recogn 43(1):5–13
Guyon, Isabelle, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. Advances in Neural Information Processing Systems 18(1):2005
Kabir M, Shahjahan M, Murase K (2009) An efficient feature selection using ant colony optimization algorithm. International Conference on Neural Information Processing
Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. Aaai
Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, de Schaetzen V, Duque R, Bersini H, Nowe A (2012) A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinf 9(4):1106–1119
Leardi R, Boggia R, Terrile M (1992) Genetic algorithms as a strategy for feature selection. J Chemom 6(5):267–281
Liu H, Zhou M, Liu Q (2019) An embedded feature selection method for imbalanced data classification. IEEE/CAA Journal of Automatica Sinica 6(3):703–715
Lughofer E (2011) On-line incremental feature weighting in evolving fuzzy classifiers. Fuzzy Sets Syst 163(1):1–23
Mbaabu O (2022) Introduction to Random Forest in Machine Learning. https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/
Mitchell TJ, Beauchamp JJ (1988) Bayesian variable selection in linear regression. J Am Stat Assoc 83(404):1023–1032
Quinlan JR (1986) Induction of decision trees Machine learning 1(1):81–106
Ruiz R, Riquelme JC, Aguilar-Ruiz JS (2006) Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recogn 39(12):2383–2392
Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Sivagaminathan RK, Ramakrishnan S (2007) A hybrid approach for feature subset selection using neural networks and ant colony optimization. Expert Syst Appl 33(1):49–60
Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Pearson education. Inc., New Delhi
Van’t Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, Van Der Kooy K, Marton MJ, Witteveen AT (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–536
Wang J, Wu L, Kong J, Li Y, Zhang B (2013) Maximum weight and minimum redundancy: a novel framework for feature subset selection. Pattern Recogn 46(6):1616–1627
Yang J, Honavar V (1998) Feature subset selection using a genetic algorithm. IEEE Intelligent Systems and their Applications 13(2):44–49
Zhu Z, Ong Y-S, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recogn 40(11):3236–3248
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fatima, A., Nazir, T., Nazir, A.K. et al. An efficient Incremental Wrapper-based Information Gain Gene Subset Selection (IG based on IWSSr) method for Tumor Discernment. Multimed Tools Appl 83, 64741–64766 (2024). https://doi.org/10.1007/s11042-023-18046-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-18046-2