A Feature Selection Algorithm Based on Graph Theory and Random Forests for Protein Secondary Structure Prediction

Gulsah Altun¹,
Hae-Jin Hu¹,
Stefan Gremalschi¹,
Robert W. Harrison^1,2 &
…
Yi Pan¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4463))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

945 Accesses
1 Citations

Abstract

Protein secondary structure prediction problem is one of the widely studied problems in bioinformatics. Predicting the secondary structure of a protein is an important step for determining its tertiary structure and thus its function. This paper explores the protein secondary structure problem using a novel feature selection algorithm combined with a machine learning approach based on random forests. For feature reduction, we propose an algorithm that uses a graph theoretical approach which finds cliques in the non-position specific evolutionary profiles of proteins obtained from BLOSUM62. Then, the features selected by this algorithm are used for condensing the position specific evolutionary information obtained from PSI-BLAST. Our results show that we are able to save significant amount of space and time and still achieve high accuracy results even when the features of the data are 25% reduced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

MASS: predict the global qualities of individual protein models using random forests and novel statistical potentials

Article Open access 06 July 2020

Ensemble of Artificial Bee Colony Optimization and Random Forest Technique for Feature Selection and Classification of Protein Function Family Prediction

A two-stage approach towards protein secondary structure classification

Article 29 May 2020

References

Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17), 3389–3402 (1997)
Article Google Scholar
Altun, G., et al.: Hybrid SVM kernels for protein secondary structure prediction. In: Proc. IEEE Intl Conf. on Granular Computing (GRC 2006), pp. 762–765 (2006)
Google Scholar
Aydin, Z., Altunbasak, Y., Borodovsky, M.: Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. BMC Bioinformatics 7, 178 (2006)
Article Google Scholar
Berman, H., et al.: The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB Data.
Google Scholar
Birzele, F., Kramer, S.: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics 22(21), 2628–2634 (2006)
Article Google Scholar
Butenko, S., Wilhelm, W.: Clique-detection models in computational biochemistry and genomics. European Journal of Operational Research, To appear (2006), Available online at http://www.sciencedirect.com/
Breiman, L.: Random Forests. Machine Learning 45, 15–32 (2001)
Google Scholar
Breiman, L., Cutler, A.: Random Forest, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm
Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: a Hidden Markov Model for Local Sequence Structure Correlations in Proteins. J. Mol. Biol. 301, 173–190 (2000)
Article Google Scholar
Chou, P.Y., Fasman, G.D.: Prediction of protein conformation. Biochemistry 13(2), 222–245 (1974)
Article Google Scholar
Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman and Hall, New York (1993)
MATH Google Scholar
Fleming, P.J., Gong, H., Rose, G.D.: Secondary structure determines protein topology. Protein Science 15, 1829–1834 (2006)
Article Google Scholar
Garnier, J., Osguthorpe, D.J., Robson, B.: Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120, 97–120 (1978)
Article Google Scholar
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89, 10915–10919 (1992)
Article Google Scholar
Hu, H., et al.: Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans. NanoBiosci. 3, 265 (2004)
Article Google Scholar
Hua, S., Sun, Z.: A Novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach. J. Mol. Biol 308, 397–407 (2001)
Article Google Scholar
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195–202 (1999)
Article Google Scholar
Karypis, G.: YASSPP: better kernels and coding schemes lead to improvements in protein secondary structure prediction. Proteins 64(3), 575–586 (2006)
Article Google Scholar
Kloczkowski, A., et al.: Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence. Proteins 49, 154–166 (2002)
Article Google Scholar
Kim, H., Park, H.: Protein Secondary Structure based on an improved support vector machines approach. Protein Eng. (2003)
Google Scholar
Kurgan, L., Homaeian, L.: Prediction of Secondary Protein Structure Content from Primary Sequence Alone-A Feature Selection Based Approach. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 334–345. Springer, Heidelberg (2005)
Google Scholar
Niskanen, S., Östergård, P.R.J.: Cliquer User’s Guide, Version 1.0. Communications Laboratory, Helsinki University of Technology, Espoo, Finland, Tech. Rep. T48 (2003)
Google Scholar
Östergård, P.R.J.: A fast algorithm for the maximum clique problem. Discrete Applied Mathematics 120(1-3), 197–207 (2002)
Article MathSciNet MATH Google Scholar
Przytycka, T., Aurora, R., Rose, G.D.: A protein taxonomy based on secondary structure. Nature Structural Biol. 6, 672–682 (1999)
Article Google Scholar
Przybylski, D., Rost, B.: Alignments grow, secondary structure prediction improves. Proteins 46, 197–205 (2002)
Article Google Scholar
Rost, B.: Rising accuracy of protein secondary structure prediction. In: Chasman, D. (ed.) Protein structure determination, analysis, and modeling for drug discovery, pp. 207–249. Dekker, New York (2003)
Google Scholar
Rost, B., Sander, C.: Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232, 584–599 (1993)
Article Google Scholar
Shi, S.Y.M., Suganthan, P.N.: Feature Analysis and Classification of Protein Secondary Structure Data. In: Kaynak, O., et al. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 1151–1158. Springer, Heidelberg (2003)
Google Scholar
Su, C.-T., Chen, C.-Y., Ou, Y.-Y.: Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics 7, 319 (2006)
Article Google Scholar
Vishveshwara, S., Brinda, K.V., Kannan, N.: Protein Structure: Insights from Graph Theory. J. Th. Comp. Chem. 1, 187–211 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science,
Gulsah Altun, Hae-Jin Hu, Stefan Gremalschi, Robert W. Harrison & Yi Pan
Department of Biology, Georgia State University, 30303, Atlanta, GA, USA
Robert W. Harrison

Authors

Gulsah Altun
View author publications
You can also search for this author in PubMed Google Scholar
Hae-Jin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Gremalschi
View author publications
You can also search for this author in PubMed Google Scholar
Robert W. Harrison
View author publications
You can also search for this author in PubMed Google Scholar
Yi Pan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Ion Măndoiu Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Altun, G., Hu, HJ., Gremalschi, S., Harrison, R.W., Pan, Y. (2007). A Feature Selection Algorithm Based on Graph Theory and Random Forests for Protein Secondary Structure Prediction. In: Măndoiu, I., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science(), vol 4463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72031-7_54

Download citation

DOI: https://doi.org/10.1007/978-3-540-72031-7_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72030-0
Online ISBN: 978-3-540-72031-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics