Abstract
The protein fold recognition problem is crucial in bioinformatics. It is usually solved using sequence comparison methods but when proteins similar in structure share little in the way of sequence homology they fail and machine learning methods are used to predict the structure of the protein. The imbalance of the data sets, the number of outliers and the high number of classes make the task very complex. We try to explain the methodology for building classifiers for protein fold recognition and to cover all the major results in this field.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alpaydin, E.: Introduction to Machine Learning. MIT Press (2009)
Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 3389–3402 (1997)
Anfinsen, B.C.: Principles that govern the folding of protein chains. Science, 223–230 (1973)
Apweiler, R., Bairoch, A., Wu, C.H., et al.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. D115–D119 (2004)
Banach, M., Konieczny, L., Roterman, I.: The late-stage intermediate. In: Protein Folding in Silico, pp. 21–38
Banach, M., Konieczny, L., Roterman, I.: The fuzzy oil drop model, based on hydrophobicity density distribution, generalizes the influence of water environment on protein structure and function. J. Theor Biol. 6–17 (2014)
Berman, H.M., et al. The protein databank. Nucleic Acids Res. 235–242 (2000)
Bishop, MCh.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Breiman, L.: Bagging predictors. Mach. Learn. 123–140 (1996)
Breiman, L.: Random Forests. Mach. Learn. 5–32 (2001)
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees (1984)
Brown, G., et al.: Diversity creation methods: a survey and categorization. Inf. Fusion, 5–20 (2005)
Chan, H.S., Dill, K.: The protein folding problem. Phys. Today, 24–32 (1993)
Chen, D., Tian, X., Zhou, B., Gao, J.: ProFold: protein fold classification with additional structural features and a novel ensemble classifier. BioMed. Res. Int. (2016)
Chen, K., Kurgan, L.: PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics, 2843–2850 (2007)
Cheng, J.: SCRATCH: a protein structure and structural feature prediction server. Nucleid Acid Res. 72–76 (2005)
Chinnasamy, A., Sung, W.K., Mittal, A.: Protein structure and fold prediction using tree-augmented naïve Bayesian classifier. In: Proceedings of PSB, Stanford CA (2004)
Chmielnicki, W., Stapor, K.: Protein fold recognition with combined RDA-SVM classifier. Lecture Notes on Artificial Intelligence, pp. 162–169 (2010)
Chmielnicki, W., Stapor, K.: A hybrid discriminative/generative approach to protein fold recognition. Neurocomputing, 194–198 (2012)
Chothia, C.: One thousand families for the molecular biologist. Nature, 543–544 (1992)
Chou, K.C.: Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 246–255 (2001)
Chou, K.C.: Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics, 262–274
Clearly, J.G., Trigg, I.E.: K*: an instance-based learner using an entropic distance measure. Proc. Int. Conf. Mach. Learn. 108–114 (1995)
Crammer, K., Singer, Y.: On the learnability and design of output codes for multiclass problems. In: 13th Computational Learning Theory Conference, pp. 35–46 (2000)
Craven, M.W., Mural, R.J., Hauser, L.J., Uberbacher, E.C.: Predicting protein folding classes without overly relying on homology. In: Proceedings of Intelligent Systems in Molecular Biology (ISMB), pp. 98–106 (1995)
Damoulas, T., Girolami, M.: Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics, 1264–1270 (2008)
Denoeux, T.: A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Trans. Syst. Man Cybern. 804–813 (1995)
Deschavanne, P., Tuffery, P.: Enhanced protein fold recognition using a structural alphabet. Proteins, 129–137 (2009)
Dietterich, T.G.: Ensemble methods in machine learning. In: 1st International Workshop on Multiple Classifier Systems, pp. 1–15 (2000)
Dill, K.A., Chan, H.S.: From Levinthal to pathways to funnels. Nat. Struct. Biol. 10–19 (1997)
Ding, C.H., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, pp. 349–358 (2001)
Dong, Q., Zhou, S., Guan, J.: A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 2655–2662 (2009)
Dubchak, I., Muchnik, I. Holbrook, S.R., Kim, S.H.: Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA, 8700–8704 (1995)
Freund, Y., Shapire, R.: A decision-theoretic generalization of online learning and an application to boosting. J. Comput. Sys. Sci. 119–139 (1997)
Ghahramani, Z.: An introduction to Hidden Markov Models and Bayesian networks. Int. J. Pattern Recognit. Artif. Intell. 9–42
Guo, X., Gao, X.: A novel hierarchical ensemble classifier for protein fold recognition. Protein Eng. Des. Sel. 659–664 (2008)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2009)
Hinton, G.E., Osindero S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 1527–1554 (2006)
Huang, C.D., Lin, C.T., Pal, N.R.: Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification. IEEE Trans. Nanobiosci. 221–232 (2003)
Ibrahim, W., Abadeh, M.S.: Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. J. Theor. Biol. 1–15 (2017)
Jo, T., Hou, J., Eickholt, J., Cheng, J.: Improving protein fold recognition by deep learning networks. Sci. Rep. (2015)
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 195–202 (1999)
Jurkowski, W., Baster, Z., Dulak, D., Roterman, I.: The early-stage intermediate. In: Protein Folding in Silico, pp. 1–20 (2012)
Kmiecik, S., Gront, D., Kolinski, M., Wieteska, L., Dawid, A.E., Kolinski, A.: Coarse-grained protein models and their applications. Chem. Rev. 7898–7936 (2016)
Konieczny, L., Roterman-Konieczna, I., Spólnik, P.: The structure and function of living organisms. Syst. Biol. 1–32 (2013)
Krupa, P., Sieradzan, A.K., Rackovsky, S., Baranowski, M., Olldziej, S., Scheraga, H.A., Liwo, A., Czaplewski, C.: Improvement of the treatment of loop structures in the UNRES force field by inclusion of coupling between backbone- and side-chain-local conformational states. J. Chem. Theory Comput. (2013)
Leslie, C.S., et al.: Mismatch string kernels for discriminative protein classification. Bioinformatics, 467–476 (2004)
Levitt, M.: Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 507–533 (1992)
Li, J., Wu, J., Chen, K.: PFP-RFSM: protein fold prediction by using random forests and sequence motifs. J. Biomed. Sci. Eng. 1161–1170 (2013)
Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 857–868 (2003)
Lin, K.L., Lin, C.Y., Huang, C.D., Chang, H.M., Yang, C.Y., Lin, C.T., Hsu, D.F.: Feature selection and combination criteria for improving accuracy in protein structure prediction. IEEE Trans. NanoBiosci. 186–196 (2007)
Lindahl, E., Elofsson, A.: Identification of related proteins on family, superfamily and fold level. J. Mol. Biol. 613–625 (2000)
Lo Conte, L., Ailey, B., Hubbard, T.J.P., Brenner, S.E., Murzin, A.G., Chothia, C.: SCOP: a structural classification of protein database. Nucleic Acids Res. 257–259 (2000)
Marchler-Bauer, A., et al.: CDD: a conserved domain database for interactive domain family analysis. Nucleid Acid Res. D237–D240 (2007)
Nanni, L.: A novel ensemble of classifiers for protein fold recognition. Neurocomputing, 2434–2437 (2006)
Okun, O.: Protein fold recognition with k-local hyperplane distance nearest neighbor algorithm. In: Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, pp. 51–57 (2004)
Pedersen, J.T., Moult, J.: Genetic algorithms for protein structure prediction. Curr. Opin. Struct. Biol. 227–231 (1996)
Rangwala, H., Karypis, G.: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 4239–4247 (2005)
Rashid, M.A., Newton, M.A.H., Hoque, M.T., Sattar, A.: Mixing energy models in genetic algorithms for on-lattice protein structure prediction. BioMed. Res. Int. (2013)
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 1–39 (2010)
Roterman, I., Bryliński, M., Konieczny, L., Jurkowski, W.: Early-stage protein folding—in silico model. Recent Adv. Struct. Biol. (2007)
Saigo, H., et al.: Protein homology detection using string alignment kernels. Bioinformatics, 1682–1689 (2004)
Sali, A., Blundell, T.L.: Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 779–815 (1993)
Schaffer, A., et al.: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleid Acids Res. 2994–3005 (2001)
Shamim, M., et al.: Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics, 3320–3327 (2007)
Shapire, R.: The strength of weak learnability. Mach. Learn. 197–227 (1995)
Sharma, A., Lyons, J., Dehzangi, A., Paliwal, K.: A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J. Theor. Biol. 41–46 (2013)
Shawe-Taylor, J., Cristiannini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004)
Shen, H.B., Chou, K.C.: Predicting protein fold pattern with functional domain and sequential evolution information. J. Theor. Biol. 441–446 (2009)
Stapor, K.: Classification methods in computer vision (in Polish). Scientific Publishing House PWN, Warsaw (2011)
Unger, R., Moult, J.: Genetic algorithms for protein folding simulations. J. Mol. Biol. 75–81 (1993)
Wei, L., Liao, M., Gao, X., Zou, Q.: Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Trans. Nanobiosci. 649–659
Wei, L., Zou, Q.: Recent progress in machine learning-based methods for protein fold recognition. Int. J. Mol. Sci. (2016)
Yang, J.-Y., Chen, X.: Improving taxonomy-based protein fold recognition by using global and local features. Proteins, 2053–2064 (2011)
Ying, Y., Huang, K., Campbell, C.: 2009. Enhanced protein fold recognition through a novel data integration approach. BMC Bioinformat. 267–287
Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of 10th International Conference Machine Learning, pp. 856–863
Zouhal, L.M., Denoeux, T.: An evidence-theoretic kNN rule with parameter optimization. IEEE Trans. Syst. Man Cybern. 263–271 (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Stapor, K., Roterman-Konieczna, I., Fabian, P. (2019). Machine Learning Methods for the Protein Fold Recognition Problem. In: Tsihrintzis, G., Sotiropoulos, D., Jain, L. (eds) Machine Learning Paradigms. Intelligent Systems Reference Library, vol 149 . Springer, Cham. https://doi.org/10.1007/978-3-319-94030-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-94030-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94029-8
Online ISBN: 978-3-319-94030-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)