Abstract
Protein secondary structure prediction problem is one of the widely studied problems in bioinformatics. Predicting the secondary structure of a protein is an important step for determining its tertiary structure and thus its function. This paper explores the protein secondary structure problem using a novel feature selection algorithm combined with a machine learning approach based on random forests. For feature reduction, we propose an algorithm that uses a graph theoretical approach which finds cliques in the non-position specific evolutionary profiles of proteins obtained from BLOSUM62. Then, the features selected by this algorithm are used for condensing the position specific evolutionary information obtained from PSI-BLAST. Our results show that we are able to save significant amount of space and time and still achieve high accuracy results even when the features of the data are 25% reduced.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17), 3389–3402 (1997)
Altun, G., et al.: Hybrid SVM kernels for protein secondary structure prediction. In: Proc. IEEE Intl Conf. on Granular Computing (GRC 2006), pp. 762–765 (2006)
Aydin, Z., Altunbasak, Y., Borodovsky, M.: Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. BMC Bioinformatics 7, 178 (2006)
Berman, H., et al.: The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB Data.
Birzele, F., Kramer, S.: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics 22(21), 2628–2634 (2006)
Butenko, S., Wilhelm, W.: Clique-detection models in computational biochemistry and genomics. European Journal of Operational Research, To appear (2006), Available online at http://www.sciencedirect.com/
Breiman, L.: Random Forests. Machine Learning 45, 15–32 (2001)
Breiman, L., Cutler, A.: Random Forest, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm
Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: a Hidden Markov Model for Local Sequence Structure Correlations in Proteins. J. Mol. Biol. 301, 173–190 (2000)
Chou, P.Y., Fasman, G.D.: Prediction of protein conformation. Biochemistry 13(2), 222–245 (1974)
Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman and Hall, New York (1993)
Fleming, P.J., Gong, H., Rose, G.D.: Secondary structure determines protein topology. Protein Science 15, 1829–1834 (2006)
Garnier, J., Osguthorpe, D.J., Robson, B.: Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120, 97–120 (1978)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89, 10915–10919 (1992)
Hu, H., et al.: Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans. NanoBiosci. 3, 265 (2004)
Hua, S., Sun, Z.: A Novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach. J. Mol. Biol 308, 397–407 (2001)
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195–202 (1999)
Karypis, G.: YASSPP: better kernels and coding schemes lead to improvements in protein secondary structure prediction. Proteins 64(3), 575–586 (2006)
Kloczkowski, A., et al.: Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence. Proteins 49, 154–166 (2002)
Kim, H., Park, H.: Protein Secondary Structure based on an improved support vector machines approach. Protein Eng. (2003)
Kurgan, L., Homaeian, L.: Prediction of Secondary Protein Structure Content from Primary Sequence Alone-A Feature Selection Based Approach. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 334–345. Springer, Heidelberg (2005)
Niskanen, S., Östergård, P.R.J.: Cliquer User’s Guide, Version 1.0. Communications Laboratory, Helsinki University of Technology, Espoo, Finland, Tech. Rep. T48 (2003)
Östergård, P.R.J.: A fast algorithm for the maximum clique problem. Discrete Applied Mathematics 120(1-3), 197–207 (2002)
Przytycka, T., Aurora, R., Rose, G.D.: A protein taxonomy based on secondary structure. Nature Structural Biol. 6, 672–682 (1999)
Przybylski, D., Rost, B.: Alignments grow, secondary structure prediction improves. Proteins 46, 197–205 (2002)
Rost, B.: Rising accuracy of protein secondary structure prediction. In: Chasman, D. (ed.) Protein structure determination, analysis, and modeling for drug discovery, pp. 207–249. Dekker, New York (2003)
Rost, B., Sander, C.: Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232, 584–599 (1993)
Shi, S.Y.M., Suganthan, P.N.: Feature Analysis and Classification of Protein Secondary Structure Data. In: Kaynak, O., et al. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 1151–1158. Springer, Heidelberg (2003)
Su, C.-T., Chen, C.-Y., Ou, Y.-Y.: Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics 7, 319 (2006)
Vishveshwara, S., Brinda, K.V., Kannan, N.: Protein Structure: Insights from Graph Theory. J. Th. Comp. Chem. 1, 187–211 (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Altun, G., Hu, HJ., Gremalschi, S., Harrison, R.W., Pan, Y. (2007). A Feature Selection Algorithm Based on Graph Theory and Random Forests for Protein Secondary Structure Prediction. In: Măndoiu, I., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science(), vol 4463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72031-7_54
Download citation
DOI: https://doi.org/10.1007/978-3-540-72031-7_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72030-0
Online ISBN: 978-3-540-72031-7
eBook Packages: Computer ScienceComputer Science (R0)