Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy
<p>Performance analysis with five classifiers on 100 <span class="html-italic">q</span> parameters of three benchmark datasets (evaluation metric: ACC). (<b>a</b>) Benchmark D1—ACC; (<b>b</b>) Benchmark D2—ACC; (<b>c</b>) Benchmark D3—ACC.</p> "> Figure 1 Cont.
<p>Performance analysis with five classifiers on 100 <span class="html-italic">q</span> parameters of three benchmark datasets (evaluation metric: ACC). (<b>a</b>) Benchmark D1—ACC; (<b>b</b>) Benchmark D2—ACC; (<b>c</b>) Benchmark D3—ACC.</p> "> Figure 2
<p>Performance analysis with generalized entropies on 100 <span class="html-italic">q</span> parameters of three benchmark datasets (evaluation metric: ACC). (<b>a</b>) Benchmark D1—ACC; (<b>b</b>) Benchmark D2—ACC; (<b>c</b>) Benchmark D3—ACC.</p> ">
Abstract
:1. Introduction
- Question 1 (Q1): Are Tsallis entropy-based features robust for extracting information from biological sequences in classification problems?
- Question 2 (Q2): Does the entropic index affect the classification performance?
- Question 3 (Q3): Is Tsallis entropy as robust as Shannon entropy for extracting information from biological sequences?
2. Literature Review
3. Information Theory and Entropy
- Superextensive entropy :
- Extensive entropy :
- Subextensive entropy :
4. Materials and Methods
4.1. A Novel Feature Extraction Technique
Algorithm 1: Pseudocode of the Proposed Technique |
4.2. Benchmark Dataset and Experimental Setting
- Case Study I: Assessment of the Tsallis entropy and the effect of the entropic index q, generating 100 feature vectors for each benchmark dataset with 100 different q parameters (entropic index). The features were extracted by Algorithm 1, with q varying from 0.1 to 10.0 in steps of 0.1 (except 1.0, which leads to the Shannon entropy). The goal was to find the best values for the parameter q to be used in the experiments. For this, three benchmark datasets from previous studies were used [5,43,44]. For the first dataset (D1), the selected task was long non-coding RNAs (lncRNA) vs. protein-coding genes (mRNA), as in [45], using a set with mRNA and lncRNA sequences (500 for each label—benchmark dataset [5]). For the second dataset (D2), a benchmark set from [5], the selected task was the induction of a classifier to distinguish circular RNAs (cirRNAs) from other lncRNAs using 1000 sequences (500 for each label). The third dataset (D3) is for Phage Virion Protein (PVP) classification, from [44], with 129 PVP and 272 non-PVP sequences.
- Case Study II: We use the best parameters ( entropic index—found in case study I) to evaluate its performance on new datasets: D4—Sigma70 Promoters [46] (2141 sequences), D5—Anticancer Peptides [47] (344 sequences) and D6—Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2, 24815 sequences) [13].
- Case Study III—Comparing Tsallis with Shannon Entropy: As a baseline of the comparison between methods, we use Shannon entropy, as we did not find any article studying the form of proposed classification with Tsallis entropy and the effect of the entropic parameter with different classifiers. In this experiment, we use D1, D2, D3, D4, D5 and D6.
- Case Study IV—Comparing Generalized Entropies: To better understand the effectiveness of generalized entropies for feature extraction, we evaluated Tsallis with the Rényi entropy. In this case, the evaluations of the two approaches were conducted by using the experiments from case study I, changing the entropic index for generating the datasets from 0.1 to 10.0 in steps of 0.1, and inducing the CatBoost classifier. In addition, the datasets used were D1, D2 and D3.
- Case Study V—Dimensionality Reduction Analysis: Finally, we assessed our proposal with other known techniques of feature extraction and dimensionality reduction, e.g., Singular Value Decomposition (SVD) [48] and Uniform Manifold Approximation and Projection (UMAP) [49], using datasets D1, D2, D3 and D5. We also added three new benchmark datasets provided by [50] to predict recombination spots (D7) with 1050 sequences (it contained 478 positive sequences and 572 negative sequences) and for the HIV-1 M pure subtype against CRF classification (D8) with 200 sequences (it contained 100 positive and negative sequences) [51]. In addition, we also used a multiclass dataset (D9) containing seven bacterial phyla with 488 small RNA (sRNA), 595 transfer RNA (tRNA) and 247 ribosomal RNA (rRNA) from [52]. Moreover, to apply SVD and UMAP, we kept the same feature descriptor by k-mer frequency.
5. Results and Discussion
5.1. Case Study I
5.2. Case Study II
5.3. Case Study III—Comparing Tsallis with Shannon Entropy
5.4. Case Study IV—Comparing Generalized Entropies
5.5. Case Study V—Dimensionality Reduction
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Hashemi, F.S.G.; Ismail, M.R.; Yusop, M.R.; Hashemi, M.S.G.; Shahraki, M.H.N.; Rastegari, H.; Miah, G.; Aslani, F. Intelligent mining of large-scale bio-data: Bioinformatics applications. Biotechnol. Biotechnol. Equip. 2018, 32, 10–29. [Google Scholar] [CrossRef] [Green Version]
- Silva, J.C.F.; Teixeira, R.M.; Silva, F.F.; Brommonschenkel, S.H.; Fontes, E.P. Machine learning approaches and their current application in plant molecular biology: A systematic review. Plant Sci. 2019, 284, 37–47. [Google Scholar] [CrossRef]
- Greener, J.G.; Kandathil, S.M.; Moffat, L.; Jones, D.T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 2022, 23, 40–55. [Google Scholar] [CrossRef]
- Lou, H.; Schwartz, M.; Bruck, J.; Farnoud, F. Evolution of k-mer frequencies and entropy in duplication and substitution mutation systems. IEEE Trans. Inform. Theor. 2019, arXiv:1812.02250. [Google Scholar] [CrossRef] [Green Version]
- Bonidia, R.P.; Sampaio, L.D.H.; Domingues, D.S.; Paschoal, A.R.; Lopes, F.M.; de Carvalho, A.C.P.L.F.; Sanches, D.S. Feature extraction approaches for biological sequences: A comparative study of mathematical features. Brief. Bioinform. 2021, 22, bbab011. [Google Scholar] [CrossRef]
- Maros, M.E.; Capper, D.; Jones, D.T.; Hovestadt, V.; von Deimling, A.; Pfister, S.M.; Benner, A.; Zucknick, M.; Sill, M. Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data. Nat. Protoc. 2020, 15, 479–512. [Google Scholar] [CrossRef]
- Eitzinger, S.; Asif, A.; Watters, K.E.; Iavarone, A.T.; Knott, G.J.; Doudna, J.A.; Minhas, F. Machine learning predicts new anti-CRISPR proteins. Nucl. Acids Res. 2020, 48, 4698–4708. [Google Scholar] [CrossRef]
- Vamathevan, J.; Clark, D.; Czodrowski, P.; Dunham, I.; Ferran, E.; Lee, G.; Li, B.; Madabhushi, A.; Shah, P.; Spitzer, M.; et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 2019, 18, 463–477. [Google Scholar] [CrossRef]
- Abubaker Bagabir, S.; Ibrahim, N.K.; Abubaker Bagabir, H.; Hashem Ateeq, R. Covid-19 and Artificial Intelligence: Genome sequencing, drug development and vaccine discovery. J. Infect. Public Health 2022, 15, 289–296. [Google Scholar] [CrossRef]
- Storcheus, D.; Rostamizadeh, A.; Kumar, S. A survey of modern questions and challenges in feature extraction. In Proceedings of the Feature Extraction: Modern Questions and Challenges, Montreal, QC, Canada, 11 December 2015; pp. 1–18. [Google Scholar]
- Iuchi, H.; Matsutani, T.; Yamada, K.; Iwano, N.; Sumi, S.; Hosoda, S.; Zhao, S.; Fukunaga, T.; Hamada, M. Representation learning applications in biological sequence analysis. Comput. Struct. Biotechnol. J. 2021, 19, 3198–3208. [Google Scholar] [CrossRef]
- Cui, F.; Zhang, Z.; Zou, Q. Sequence representation approaches for sequence-based protein prediction tasks that use deep learning. Brief. Funct. Genom. 2021, 20, 61–73. [Google Scholar] [CrossRef]
- Bonidia, R.P.; Domingues, D.S.; Sanches, D.S.; de Carvalho, A.C. MathFeature: Feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Brief. Bioinform. 2022, 23, bbab434. [Google Scholar] [CrossRef]
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
- Vinga, S. Information theory applications for biological sequence analysis. Brief. Bioinform. 2013, 15, 376–389. [Google Scholar] [CrossRef] [Green Version]
- Pritišanac, I.; Vernon, R.M.; Moses, A.M.; Forman Kay, J.D. Entropy and information within intrinsically disordered protein regions. Entropy 2019, 21, 662. [Google Scholar] [CrossRef] [Green Version]
- Vopson, M.M.; Robson, S.C. A new method to study genome mutations using the information entropy. Phys. A Statist. Mech. Appl. 2021, 584, 126383. [Google Scholar] [CrossRef]
- Ré, M.A.; Azad, R.K. Generalization of entropy based divergence measures for symbolic sequence analysis. PLoS ONE 2014, 9, e0093532. [Google Scholar] [CrossRef] [Green Version]
- Akhter, S.; Bailey, B.A.; Salamon, P.; Aziz, R.K.; Edwards, R.A. Applying Shannon’s information theory to bacterial and phage genomes and metagenomes. Sci. Rep. 2013, 3, 1033. [Google Scholar] [CrossRef] [Green Version]
- Machado, J.T.; Costa, A.C.; Quelhas, M.D. Shannon, Rényie and Tsallis entropy analysis of DNA using phase plane. Nonlinear Anal. Real World Appl. 2011, 12, 3135–3144. [Google Scholar] [CrossRef]
- Tripathi, R.; Patel, S.; Kumari, V.; Chakraborty, P.; Varadwaj, P.K. Deeplnc, a long non-coding rna prediction tool using deep neural network. Netw. Model. Anal. Health Inform. Bioinform. 2016, 5, 21. [Google Scholar] [CrossRef]
- Yamano, T. Information theory based on nonadditive information content. Phys. Rev. E 2001, 63, 046105. [Google Scholar] [CrossRef] [Green Version]
- Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
- Tsallis, C.; Mendes, R.; Plastino, A.R. The role of constraints within generalized nonextensive statistics. Phys. A Stat. Mech. Appl. 1998, 261, 534–554. [Google Scholar] [CrossRef]
- De Albuquerque, M.P.; Esquef, I.A.; Mello, A.G. Image thresholding using Tsallis entropy. Pattern Recognit. Lett. 2004, 25, 1059–1065. [Google Scholar] [CrossRef]
- Ramírez-Reyes, A.; Hernández-Montoya, A.R.; Herrera-Corral, G.; Domínguez-Jiménez, I. Determining the entropic index q of Tsallis entropy in images through redundancy. Entropy 2016, 18, 299. [Google Scholar] [CrossRef] [Green Version]
- Lopes, F.M.; de Oliveira, E.A.; Cesar, R.M. Inference of gene regulatory networks from time series by Tsallis entropy. BMC Syst. Biol. 2011, 5, 61. [Google Scholar] [CrossRef] [Green Version]
- De la Cruz-García, J.S.; Bory-Reyes, J.; Ramirez-Arellano, A. A Two-Parameter Fractional Tsallis Decision Tree. Entropy 2022, 24, 572. [Google Scholar] [CrossRef]
- Thilagaraj, M.; Rajasekaran, M.P.; Kumar, N.A. Tsallis entropy: As a new single feature with the least computation time for classification of epileptic seizures. Clust. Comput. 2019, 22, 15213–15221. [Google Scholar] [CrossRef] [Green Version]
- Keele, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; Version 2.3 EBSE Technical Report; EBSE-2007-01; University of Durham: Durham, UK, 2007. [Google Scholar]
- Brereton, P.; Kitchenham, B.A.; Budgen, D.; Turner, M.; Khalil, M. Lessons from applying the systematic literature review process within the software engineering domain. J. Syst. Softw. 2007, 80, 571–583. [Google Scholar] [CrossRef] [Green Version]
- Kitchenham, B.; Brereton, O.P.; Budgen, D.; Turner, M.; Bailey, J.; Linkman, S. Systematic literature reviews in software engineering—A systematic literature review. Inform. Softw. Technol. 2009, 51, 7–15. [Google Scholar] [CrossRef]
- Karimi, S.; Pohl, S.; Scholer, F.; Cavedon, L.; Zobel, J. Boolean versus ranked querying for biomedical systematic reviews. BMC Med. Inform. Decis. Mak. 2010, 10, 58. [Google Scholar] [CrossRef] [Green Version]
- Martignon, L. Information Theory. In International Encyclopedia of the Social & Behavioral Sciences; Smelser, N.J., Baltes, P.B., Eds.; Pergamon: Oxford, UK, 2001; pp. 7476–7480. [Google Scholar] [CrossRef]
- Adami, C. The use of information theory in evolutionary biology. Ann. N. Y. Acad. Sci. 2012, 1256, 49–65. [Google Scholar] [CrossRef] [Green Version]
- Lesne, A. Shannon entropy: A rigorous notion at the crossroads between probability, information theory, dynamical systems and statistical physics. Math. Struct. Comput. Sci. 2014, 24, e240311. [Google Scholar] [CrossRef] [Green Version]
- Zhang, Y.; Wu, L. Optimal multi-level thresholding based on maximum Tsallis entropy via an artificial bee colony approach. Entropy 2011, 13, 841–859. [Google Scholar] [CrossRef] [Green Version]
- Maszczyk, T.; Duch, W. Comparison of Shannon, Renyi and Tsallis entropy used in decision trees. In Proceedings of the International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland, 22–26 June 2008; pp. 643–651. [Google Scholar]
- Tsallis, C. Nonextensive statistics: Theoretical, experimental and computational evidences and connections. Braz. J. Phys. 1999, 29, 1–35. [Google Scholar] [CrossRef]
- Dérian, N.; Pham, H.P.; Nehar-Belaid, D.; Tchitchek, N.; Klatzmann, D.; Eric, V.; Six, A. The Tsallis generalized entropy enhances the interpretation of transcriptomics datasets. PLoS ONE 2022, 17, e0266618. [Google Scholar] [CrossRef]
- Fehr, S.; Berens, S. On the conditional Rényi entropy. IEEE Trans. Inform. Theor. 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
- Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20–30 July 1960; Volume 1. [Google Scholar]
- Chu, Q.; Zhang, X.; Zhu, X.; Liu, C.; Mao, L.; Ye, C.; Zhu, Q.H.; Fan, L. PlantcircBase: A database for plant circular RNAs. Mol. Plant 2017, 10, 1126–1128. [Google Scholar] [CrossRef]
- Manavalan, B.; Shin, T.H.; Lee, G. PVP-SVM: Sequence-based prediction of phage virion proteins using a support vector machine. Front. Microbiol. 2018, 9, 476. [Google Scholar] [CrossRef]
- Klapproth, C.; Sen, R.; Stadler, P.F.; Findeiß, S.; Fallmann, J. Common features in lncRNA annotation and classification: A survey. Non-Coding RNA 2021, 7, 77. [Google Scholar] [CrossRef]
- Lin, H.; Liang, Z.Y.; Tang, H.; Chen, W. Identifying sigma70 promoters with novel pseudo nucleotide composition. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 16, 1316–1321. [Google Scholar] [CrossRef]
- Li, Q.; Zhou, W.; Wang, D.; Wang, S.; Li, Q. Prediction of anticancer peptides using a low-dimensional feature model. Front. Bioeng. Biotechnol. 2020, 8, 892. [Google Scholar] [CrossRef]
- Halko, N.; Martinsson, P.G.; Tropp, J.A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 2011, 53, 217–288. [Google Scholar] [CrossRef]
- McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Sour. Softw. 2018, 3, 861. [Google Scholar] [CrossRef]
- Khan, F.; Khan, M.; Iqbal, N.; Khan, S.; Muhammad Khan, D.; Khan, A.; Wei, D.Q. Prediction of Recombination Spots Using Novel Hybrid Feature Extraction Method via Deep Learning Approach. Front. Genet. 2020, 11, 1052. [Google Scholar] [CrossRef]
- Remita, M.A.; Halioui, A.; Malick Diouara, A.A.; Daigle, B.; Kiani, G.; Diallo, A.B. A machine learning approach for viral genome classification. BMC Bioinform. 2017, 18, 1–11. [Google Scholar] [CrossRef] [Green Version]
- Bonidia, R.P.; Santos, A.P.A.; de Almeida, B.L.; Stadler, P.F.; da Rocha, U.N.; Sanches, D.S.; de Carvalho, A.C. BioAutoML: Automated feature engineering and metalearning to predict noncoding RNAs in bacteria. Brief. Bioinform. 2022, 23, bbac218. [Google Scholar] [CrossRef]
- Randhawa, G.S.; Soltysiak, M.P.; El Roz, H.; de Souza, C.P.; Hill, K.A.; Kari, L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS ONE 2020, 15, e0232391. [Google Scholar] [CrossRef] [Green Version]
- Naeem, S.M.; Mabrouk, M.S.; Marzouk, S.Y.; Eldosoky, M.A. A diagnostic genomic signal processing (GSP)-based system for automatic feature analysis and detection of COVID-19. Brief. Bioinform. 2021, 22, 1197–1205. [Google Scholar] [CrossRef]
- Arslan, H. Machine Learning Methods for COVID-19 Prediction Using Human Genomic Data. In Proceedings of the Multidisciplinary Digital Publishing Institute Proceedings, Online, 9–11 December 2020; Volume 74, p. 20. [Google Scholar]
- Berry, M.W. Large-scale sparse singular value computations. Int. J. Supercomput. Appl. 1992, 6, 13–49. [Google Scholar] [CrossRef]
- Rajamanickam, S. Efficient Algorithms for Sparse Singular Value Decomposition; University of Florida: Gainesville, FL, USA, 2009. [Google Scholar]
- McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
- Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.A.; Kwok, I.W.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2019, 37, 38–44. [Google Scholar] [CrossRef]
- Dorrity, M.W.; Saunders, L.M.; Queitsch, C.; Fields, S.; Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 2020, 11, 1537. [Google Scholar] [CrossRef] [Green Version]
- Li, M.; Si, Y.; Yang, W.; Yu, Y. ET-UMAP integration feature for ECG biometrics using Stacking. Biomed. Sig. Proc. Control 2022, 71, 103159. [Google Scholar] [CrossRef]
Group | Descriptor |
---|---|
Nucleic Acid Composition | Nucleotide composition |
Dinucleotide composition | |
Trinucleotide composition | |
Tetranucleotide composition | |
Pentanucleotide composition | |
Hexanucleotide composition | |
Basic k-mer | |
Reverse complementary k-mer | |
Increment in diversity | |
Mismatch | |
Subsequence | |
GC-content | |
AT/GT Ratio | |
Cumulative skew | |
kGap | |
Position-specific nucleotide frequency | |
Nucleotide content | |
Conformational properties | |
Enhanced nucleic acid composition | |
Composition of k-spaced Nucleic Acid Pairs | |
TD | Topological descriptors |
K-Nearest Neighbor | K-nearest neighbor for proteins |
Autocorrelation | Normalized Moreau–Broto |
Moran | |
Geary | |
Dinucleotide-based auto-covariance | |
Dinucleotide-based cross-covariance | |
Dinucleotide-based auto-cross-covariance | |
Trinucleotide-based auto-covariance | |
Trinucleotide-based cross-covariance | |
Trinucleotide-based auto-cross-covariance | |
Pseudo Nucleic Acid Composition | Type 1 Pseudo k-tuple nucleotide composition |
Type 2 Pseudo k-tuple nucleotide composition | |
Pseudo k-tuple nucleotide composition | |
Pseudo dinucleotide composition | |
Numerical Mapping | Z-curve theory |
Nucleotide Chemical Property | |
Accumulated Nucleotide Frequency | |
Electron–ion interaction pseudopotential | |
Pseudo electron–ion interaction pseudopotential | |
Binary | |
Orthonormal encoding | |
Basic one-hot | |
6-dimension one-hot method | |
Amino Acid Composition | Amino acid composition |
Dipeptide composition | |
Tripeptide composition | |
Terminal end amino acid count | |
Amino acid pair | |
Secondary structure composition | |
Secondary structure—amino acid composition | |
Solvent accessibility composition | |
Solvent accessibility—amino acid composition | |
Codon composition | |
Protein length | |
Overlapping k-mers | |
Information-based statistics | |
Basic k-mer | |
Distance-based residue | |
Distance pair | |
Residue-Couple Model | |
Composition moment vector | |
Enhanced amino acid composition | |
Composition of k-spaced amino acid pairs | |
Dipeptide deviation from expected mean | |
Grouped amino acid composition | |
Enhanced grouped amino acid composition | |
Composition of k-spaced amino acid group pairs | |
Grouped dipeptide composition | |
Grouped tripeptide composition | |
kGap | |
Position-specific nucleotide frequency | |
Pseudo-Amino Acid Composition | Type 1 PseAAC |
Type 2 PseAAC | |
Dipeptide (or Type 3) PseAAC | |
General parallel correlation PseAAC | |
General series correlation PseAAC | |
Pseudo k-tuple reduced AAC (type 1 to type 16) | |
CTD | Composition |
Transition | |
Distribution | |
Sequence-Order | Sequence-order-coupling number |
Quasi-sequence-order | |
Profile-based Features | Signal average |
Signal peak area | |
PSSM (Position-Specific Scoring Matrix) profile | |
Profile-based physicochemical distance | |
Distance-based top-n-gram | |
Top-n-gram | |
Sequence conservation score | |
Frequency profile matrix | |
Conjoint Triad | Conjoint Triad |
Conjoint k-spaced triad | |
Proteochemometric Descriptors | Principal component analysis |
Principal component analysis (2D and 3D) | |
Factor analysis | |
Factor analysis (2D and 3D) | |
Multidimensional scaling | |
Multidimensional scaling (2D and 3D) | |
BLOSUM and PAM matrix-derived | |
Biophysical quantitative properties | |
Amino acid properties | |
Molecular descriptors | |
Sequence Similarity | Gene Ontology (GO) similarity |
Sequence Alignment | |
BLAST matrix | |
Structure Composition | Secondary structure |
Solvent accessible surface area | |
Secondary structure binary | |
Disorder | |
Disorder content | |
Disorder binary | |
Torsional angles | |
DNA shape features | |
Physicochemical Property | AAindex |
Z-scale | |
Physicochemical n-Grams | |
Dinucleotide physicochemical | |
Trinucleotide physicochemical |
Dataset | GaussianNB | RF | Bagging | MLP | CatBoost | |||||
---|---|---|---|---|---|---|---|---|---|---|
q | ACC | q | ACC | q | ACC | q | ACC | q | ACC | |
D1 | 2.7 | 0.9370 | 0.4 | 0.9430 | 2.7 | 0.9400 | 2.2 | 0.9380 | 2.3 | 0.9440 |
9.2 | 0.4760 | 9.6 | 0.7360 | 9.6 | 0.7270 | 10.0 | 0.5060 | 9.6 | 0.747 | |
D2 | 1.5 | 0.7980 | 5.3 | 0.8220 | 5.7 | 0.8080 | 0.9 | 0.7800 | 4.0 | 0.8300 |
9.6 | 0.5210 | 10.0 | 0.6510 | 10.0 | 0.6170 | 9.9 | 0.5060 | 9.2 | 0.6800 | |
D3 | 8.7 | 0.7008 | 7.8 | 0.6910 | 2.0 | 0.7157 | 1.5 | 0.7184 | 1.1 | 0.7282 |
1.3 | 0.6062 | 9.8 | 0.5985 | 9.5 | 0.5962 | 0.1 | 0.6860 | 5.7 | 0.6610 |
Dataset | q | Classifier | ACC | Recall | F1 Score | AUC | BACC |
---|---|---|---|---|---|---|---|
D4 | 0.5 | RF | 0.6594 | 0.2556 | 0.3423 | 0.6279 | 0.5647 |
CatBoost | 0.6563 | 0.1973 | 0.2848 | 0.6233 | 0.5487 | ||
2.0 | RF | 0.6687 | 0.3094 | 0.3932 | 0.6108 | 0.5845 | |
CatBoost | 0.6641 | 0.2063 | 0.2987 | 0.6301 | 0.5567 | ||
3.0 | RF | 0.6672 | 0.3049 | 0.3886 | 0.6150 | 0.5822 | |
CatBoost | 0.6625 | 0.2377 | 0.3282 | 0.6319 | 0.5629 | ||
4.0 | RF | 0.6641 | 0.2825 | 0.3684 | 0.6163 | 0.5746 | |
CatBoost | 0.6656 | 0.2466 | 0.3385 | 0.6415 | 0.5674 | ||
5.0 | RF | 0.6641 | 0.2825 | 0.3684 | 0.6348 | 0.5746 | |
CatBoost | 0.6734 | 0.2646 | 0.3598 | 0.6375 | 0.5775 |
Dataset | q | Classifier | ACC | Recall | F1 Score | AUC | BACC |
---|---|---|---|---|---|---|---|
D5 | 0.5 | RF | 0.7019 | 0.5952 | 0.6173 | 0.7437 | 0.6847 |
CatBoost | 0.6923 | 0.3810 | 0.5000 | 0.7488 | 0.6421 | ||
2.0 | RF | 0.7019 | 0.5476 | 0.5974 | 0.7454 | 0.6770 | |
CatBoost | 0.6538 | 0.4286 | 0.5000 | 0.7500 | 0.6175 | ||
3.0 | RF | 0.7212 | 0.5714 | 0.6234 | 0.7748 | 0.6970 | |
CatBoost | 0.6827 | 0.4286 | 0.5217 | 0.7385 | 0.6417 | ||
4.0 | RF | 0.7019 | 0.5238 | 0.5867 | 0.7823 | 0.6732 | |
CatBoost | 0.6923 | 0.4762 | 0.5556 | 0.7642 | 0.6575 | ||
5.0 | RF | 0.7211 | 0.5476 | 0.6133 | 0.7813 | 0.6932 | |
CatBoost | 0.6923 | 0.4762 | 0.5556 | 0.7600 | 0.6575 |
Dataset | q | Classifier | ACC | Recall | F1 Score | AUC | BACC |
---|---|---|---|---|---|---|---|
D6 | 0.5 | RF | 0.9989 | 0.9992 | 0.9994 | 1.0000 | 0.9985 |
CatBoost | 0.9982 | 1.0000 | 0.9990 | 0.9999 | 0.9947 | ||
2.0 | RF | 0.9996 | 1.0000 | 0.9998 | 1.0000 | 0.9990 | |
CatBoost | 0.9951 | 0.9996 | 0.9971 | 1.0000 | 0.9862 | ||
3.0 | RF | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
CatBoost | 0.9996 | 1.0000 | 0.9998 | 1.0000 | 0.9990 | ||
4.0 | RF | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
CatBoost | 0.9996 | 1.0000 | 0.9998 | 1.0000 | 0.9990 | ||
5.0 | RF | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
CatBoost | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Dataset | Classifier | Entropy | q | ACC | Recall | F1 Score | BACC |
---|---|---|---|---|---|---|---|
D1 | CatBoost | Tsallis | 2.3 | 0.9420 | 0.9673 | 0.9437 | 0.9421 |
Shannon | - | 0.9420 | 0.9651 | 0.9435 | 0.9421 | ||
D2 | CatBoost | Tsallis | 4.0 | 0.8140 | 0.7760 | 0.8053 | 0.8153 |
Shannon | - | 0.8080 | 0.7582 | 0.7970 | 0.8115 | ||
D3 | CatBoost | Tsallis | 1.1 | 0.7231 | 0.3869 | 0.4724 | 0.6342 |
Shannon | - | 0.7207 | 0.3886 | 0.4708 | 0.6334 | ||
D4 | RF | Tsallis | 2.0 | 0.6687 | 0.3094 | 0.3932 | 0.5845 |
Shannon | - | 0.6563 | 0.2556 | 0.3403 | 0.5623 | ||
D5 | RF | Tsallis | 3.0 | 0.7212 | 0.5714 | 0.6234 | 0.6970 |
Shannon | - | 0.7115 | 0.5476 | 0.6053 | 0.6851 | ||
D6 | RF | Tsallis | 5.0 | 0.9984 | 0.9846 | 0.9915 | 0.9922 |
Shannon | - | 0.9985 | 0.9888 | 0.9922 | 0.9942 | ||
Mean | - | Tsallis | - | 0.8112 | 0.6659 | 0.7049 | 0.7776 |
Shannon | - | 0.8061 | 0.6507 | 0.6915 | 0.7714 | ||
Gain | - | - | - | 0.51% | 1.52% | 1.34% | 0.62% |
Wins | - | Tsallis | - | 5 | 4 | 5 | 5 |
Shannon | - | 2 | 2 | 1 | 2 |
Dataset | Reduction | ACC | Recall | F1 Score | BACC |
---|---|---|---|---|---|
D1 | Tsallis (q = 2.3) | 0.9430 | 0.9650 | 0.9438 | 0.9434 |
SVD | 0.4980 | 0.0000 | 0.0000 | 0.4982 | |
UMAP | 0.4980 | 0.9963 | 0.6632 | 0.4981 | |
D2 | Tsallis (q = 4.0) | 0.8120 | 0.7718 | 0.8030 | 0.8114 |
SVD | 0.5004 | 0.0016 | 0.0032 | 0.5008 | |
UMAP | 0.4994 | 0.0000 | 0.0000 | 0.5000 | |
D3 | Tsallis (q = 1.1) | 0.7307 | 0.3538 | 0.4541 | 0.6310 |
SVD | 0.5389 | 0.7132 | 0.4942 | 0.5834 | |
UMAP | 0.3191 | 0.9933 | 0.4825 | 0.4967 | |
D5 | Tsallis (q = 3.0) | 0.6720 | 0.5181 | 0.5515 | 0.6508 |
SVD | 0.7403 | 0.7630 | 0.7752 | 0.7261 | |
UMAP | 0.4021 | 0.0000 | 0.0000 | 0.5000 | |
D7 | Tsallis (q = 3.0) | 0.7371 | 0.6711 | 0.6947 | 0.7337 |
SVD | 0.5438 | 0.0000 | 0.0000 | 0.4992 | |
UMAP | 0.5143 | 0.1824 | 0.1147 | 0.4963 | |
D8 | Tsallis (q = 1.1) | 0.6500 | 0.6111 | 0.6277 | 0.6525 |
SVD | 0.8023 | 0.8575 | 0.7843 | 0.8171 | |
UMAP | 0.6326 | 0.7728 | 0.6544 | 0.6511 | |
D9 | Tsallis (q = 9.2) | 0.9489 | 0.9481 | 0.9507 | 0.9481 |
SVD | 0.5586 | 0.6433 | 0.5517 | 0.6433 | |
UMAP | 0.5992 | 0.6528 | 0.6167 | 0.6528 | |
Mean | Tsallis | 0.7848 | 0.6913 | 0.7179 | 0.7673 |
SVD | 0.5975 | 0.4255 | 0.3727 | 0.6097 | |
UMAP | 0.4950 | 0.5139 | 0.3616 | 0.5421 | |
Wins | Tsallis | 5 | 3 | 4 | 5 |
SVD | 2 | 2 | 3 | 2 | |
UMAP | 0 | 2 | 0 | 0 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bonidia, R.P.; Avila Santos, A.P.; de Almeida, B.L.S.; Stadler, P.F.; Nunes da Rocha, U.; Sanches, D.S.; de Carvalho, A.C.P.L.F. Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy. Entropy 2022, 24, 1398. https://doi.org/10.3390/e24101398
Bonidia RP, Avila Santos AP, de Almeida BLS, Stadler PF, Nunes da Rocha U, Sanches DS, de Carvalho ACPLF. Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy. Entropy. 2022; 24(10):1398. https://doi.org/10.3390/e24101398
Chicago/Turabian StyleBonidia, Robson P., Anderson P. Avila Santos, Breno L. S. de Almeida, Peter F. Stadler, Ulisses Nunes da Rocha, Danilo S. Sanches, and André C. P. L. F. de Carvalho. 2022. "Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy" Entropy 24, no. 10: 1398. https://doi.org/10.3390/e24101398
APA StyleBonidia, R. P., Avila Santos, A. P., de Almeida, B. L. S., Stadler, P. F., Nunes da Rocha, U., Sanches, D. S., & de Carvalho, A. C. P. L. F. (2022). Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy. Entropy, 24(10), 1398. https://doi.org/10.3390/e24101398