Abstract
Detecting the boundaries of protein domains has been an important and challenging problem in experimental and computational structural biology. In this paper the domain detection is first taken as an imbalanced data learning problem. A novel undersampling method using distance-based maximal entropy in the feature space of SVMs is proposed. On multiple sequence alignments that are derived from a database search, multiple measures are defined to quantify the domain information content of each position along the sequence. The overall accuracy is about 87% together with high sensitivity and specificity. Simulation results demonstrate that the utility of the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general imbalanced datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Orengo, A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH-a Hierarchic Classification of Protein Domain Structures. Structure 5, 1093–1108 (1997)
Murzin, G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247, 536–540 (1995)
Alexandrov, N., Shindyalov, I.: PDP: Protein Domain Parser. Bioinformatics 19(3), 429–430 (2003)
Holm, L., Sander, C.: Mapping the Protein Universe. Science 273, 595–602 (1996)
Sonnhammer, E.L., Kahn, D.: Modular Arrangement of Proteins as Inferred from Analysis of Homology. Protein Sci. 3, 482–492 (1994)
Gracy, J., Argos, P.: Automated Protein Sequence Database Classification. I. Integration of Copositional Similarity Search, Local Similarity Search and Multiple Sequence Alignment. Bioinformatics 14(2), 164–187 (1998)
Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval. In: Proceedings of ACM International Conference on Multimedia, pp. 107–118 (2001)
Joachims, T.: Text Categorization with SVM: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)
Wu, G., Chang, E.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC (2003)
Nagaragan, N., Yona, G.: Automatic Prediction of Protein Domains from Sequence Information Using a Hybrid Learn System. Bioinformatics 1, 1–27 (2004)
Galzitskaya, O.V., Melnik, B.S.: Prediction of Protein Domain Boundaries from Sequence Alone. Protein Science 12, 696–701 (2003)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000)
Akbani, R., Kwek, S.S., Japkowicz, N.: Applying Support Vector Machines to Imbalanced Datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)
Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the Sensitivity of Support Vector Machines. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 55–60 (1999)
Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30(1), 25–36 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Zou, S., Huang, Y., Wang, Y., Hu, C., Liang, Y., Zhou, C. (2007). A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds) Advances in Neural Networks – ISNN 2007. ISNN 2007. Lecture Notes in Computer Science, vol 4492. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72393-6_149
Download citation
DOI: https://doi.org/10.1007/978-3-540-72393-6_149
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72392-9
Online ISBN: 978-3-540-72393-6
eBook Packages: Computer ScienceComputer Science (R0)