Abstract
The high dimensionality of data is a common problem in classification. In this work, a small number of significant features is investigated to classify data of two sample groups. Various feature selection and classification techniques are applied in a collection of four high-throughput DNA methylation microarray data sets. Using accuracy as a performance metric, the repeated 10-fold cross-validation strategy is implemented to evaluate the different proposed techniques. Combining the Signal to Noise Ratio (SNR) and Wilcoxon rank-sum test filter methods with Support Vector Machine-Recursive Feature Elimination (SVM-RFE) as an embedded method has resulted in a perfect performance. In addition, the linear classifiers showed excellent results compared to others classifiers when applied to such data sets.
Similar content being viewed by others
References
Li, D., Xie, Z., Le Pape, M., Dye, T.: An evaluation of statistical methods for dna methylation microarray data analysis. BMC Bioinform. 16(1), 1 (2015)
Das, P.M., Singal, R.: DNA methylation and cancer. J. Clin. Oncol. 22(22), 4632–4642 (2004)
Zhuang, J., Widschwendter, M., Teschendorff, A.E.: A comparison of feature selection and classification methods in dna methylation studies using the illumina infinium platform. BMC Bioinform. 13(1), 59 (2012)
Lee, C.P., Leu, Y.: A novel hybrid feature selection method for microarray data analysis. Appl. Soft Comput. 11(1), 208–213 (2011)
Saeys, Y., Inza, I., Larranaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Cai, Z., Xu, D., Zhang, Q., Zhang, J., Ngai, S.M., Shao, J.: Classification of lung cancer using ensemble-based feature selection and machine learning methods. Mol. BioSyst. 11(3), 791–800 (2015)
Ma, Z., Teschendorff, A.E.: A variational bayes beta mixture model for feature selection in dna methylation studies. J. Bioinform. Computat. Biol. 11(04), 1350005 (2013)
Meng, H., Murrelle, E.L., Li, G.: Identification of a small optimal subset of CpG sites as bio-markers from high-throughput DNA methylation profiles. BMC Bioinform. 9(1), 457 (2008)
Amin, I.I., Hassanien, A.E., Kassim, S.K., Hefny, H.A.: Big DNA methylation data analysis and visualizing in a common form of breast cancer. In: Hassanien, A.E., Azar, A.T., Snasael, V., Kacprzyk, J., Abawajy, J.H. (eds.) Big Data in Complex Systems. SBD, vol. 9, pp. 375–392. Springer, Heidelberg (2015)
Valavanis, I., Pilalis, E., Georgiadis, P., Kyrtopoulos, S., Chatziioannou, A.: Cancer biomarkers from genome-scale DNA methylation: Comparison of evolutionary and semantic analysis methods. Microarrays 4(4), 647–670 (2015)
Gunavathi, C., Premalatha, K.: Cuckoo search optimisation for feature selection in cancer classification: a new approach. Int. J. Data Min. Bioinform. 13(3), 248–265 (2015)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)
Zhou, X., Tuck, D.P.: MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics 23(9), 1106–1114 (2007)
Bibikova, M., Le, J., Barnes, B., Saedinia-Melnyk, S., Zhou, L., Shen, R., Gunderson, K.L.: Genome-wide dna methylation profiling using infinium\(\textregistered \) assay. Epigenomics 1(1), 177–200 (2009)
Bibikova, M., Barnes, B., Tsan, C., Ho, V., Klotzle, B., Le, J.M., Delano, D., Zhang, L., Schroth, G.P., Gunderson, K.L., et al.: High density dna methylation array with single CpG site resolution. Genomics 98(4), 288–295 (2011)
Lipworth, L., Morgans, A.K., Edwards, T.L., Barocas, D.A., Chang, S.S., Herrell, S.D., Penson, D.F., Resnick, M.J., Smith, J.A., Clark, P.E.: Renal cell cancer histological subtype distribution differs by race and sex. BJU Int. 117(2), 260–265 (2016)
Liu, Y., Aryee, M.J., Padyukov, L., Fallin, M.D., Hesselberg, E., Runarsson, A., Reinius, L., Acevedo, N., Taub, M., Ronninger, M., et al.: Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotechnol. 31(2), 142–147 (2013)
Teschendorff, A.E., Menon, U., Gentry-Maharaj, A., Ramus, S.J., Weisenberger, D.J., Shen, H., Campan, M., Noushmehr, H., Bell, C.G., Maxwell, A.P., et al.: Age-dependent dna methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 20(4), 440–446 (2010)
Dedeurwaerder, S., Defrance, M., Bizet, M., Calonne, E., Bontempi, G., Fuks, F.: A comprehensive overview of infinium humanmethylation450 data processing. Briefings Bioinform. 15(6), 929–941 (2013)
Chen, Y.A., Lemire, M., Choufani, S., Butcher, D.T., Grafodatskaya, D., Zanke, B.W., Gallinger, S., Hudson, T.J., Weksberg, R.: Discovery of cross-reactive probes and polymorphic CpGs in the illumina infinium humanmethylation450 microarray. Epigenetics 8(2), 203–209 (2013)
Zhang, Q., Wu, H., Zheng, H.: Aberrantly methylated CpG island detection in colon cancer. J. Proteomics Bioinform. 2015 (2015)
Romanski, P., Kotthoff, L.: Fselector: Selecting attributes (2013). https://cran.r-project.org/web/packages/FSelector/. R package version 0.19
Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 8(1), 1 (2007)
Liang, J.D., Ping, X.O., Tseng, Y.J., Huang, G.T., Lai, F., Yang, P.M.: Recurrence predictive models for patients with hepatocellular carcinoma after radiofrequency ablation using support vector machines with feature selection methods. Comput. Methods Programs Biomed. 117(3), 425–434 (2014)
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM (1992)
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Keller, A.D., Schummer, M., Hood, L., Ruzzo, W.L.: Bayesian classification of DNA array expression data. Technical Report UW-CSE-2000-08-01 (2000)
Huerta, E.B., Duval, B., Hao, J.K.: A hybrid LDA and genetic algorithm for gene selection and classification of microarray data. Neurocomputing 73(13), 2375–2383 (2010)
Kuncheva, L.I.: A stability index for feature selection. In: Artificial Intelligence and Applications, pp. 421–427 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Alkuhlani, A., Nassef, M., Farag, I. (2017). A Comparative Study of Feature Selection and Classification Techniques for High-Throughput DNA Methylation Data. In: Hassanien, A., Shaalan, K., Gaber, T., Azar, A., Tolba, M. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016. AISI 2016. Advances in Intelligent Systems and Computing, vol 533. Springer, Cham. https://doi.org/10.1007/978-3-319-48308-5_76
Download citation
DOI: https://doi.org/10.1007/978-3-319-48308-5_76
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48307-8
Online ISBN: 978-3-319-48308-5
eBook Packages: EngineeringEngineering (R0)