Abstract
Since most knowledge about protein-protein interactions still hides in biological publications, there is an increasing focus on automatically extracting information from the vast amount of biological literature. Existing approaches can be broadly categorized as rule-based or statistically-based. Rule-based approaches require heavy manual effort. On the other hand, statistically-based approaches require large-scale, richly annotated corpora in order to reliably estimate model parameters. This is normally difficult to obtain in practical applications. We have proposed a hidden vector state (HVS) model for protein-protein interactions extraction. The HVS model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. State transitions are factored into a stack shift operation similar to those of a push-down automaton followed by the push of a new preterminal category label. In this paper, we propose a novel approach based on the k-nearest-neighbors classifier to automatically train the HVS model from un-annotated data. Experimental results show the improved performance over the baseline system with the HVS model trained from a small amount of the annotated data.
Chapter PDF
Similar content being viewed by others
References
Huang, M., Zhu, X., Hao, Y.: Discovering patterns to extract protein-protein interactions from full text. Bioinformatics 20(18), 3604–3612 (2004)
Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., Cochran, B.: Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations. In: Proceedings of the Pacific Symposium on Biocomputing, Hawaii, U.S.A, pp. 362–373 (2002)
Temkin, J.M., Gilder, M.R.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19(16), 2046–2053 (2003)
Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., Mazo, l.: Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20(5), 604–611 (2004)
Seymore, K., McCallum, A., Rosenfeld, R.: Learning Hidden Markov Model Structure for Information Extraction. In: AAAI 99 Workshop on Machine Learning for Information Extraction (1999)
Zhou, D., He, Y., Kwoh, C.K.: Extracting Protein-Protein Interactions from the Literature using the Hidden Vector State Model. In: International Workshop on Bioinformatics Research and Applications, Reading, UK (2006)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised selftraining of object detection models. In: Seventh IEEE Workshop on Applications of Computer Vision, IEEE Computer Society Press, Los Alamitos (2005)
Jones, R.: Learning to extract entities from labeled and unlabeled text. PhD thesis, Carnegie Mellon University (2005)
Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of 18th International Conference on Machine Learning, pp. 19–26. Morgan Kaufmann, San Francisco (2001)
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)
Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison (2005)
He, Y., Young, S.: Semantic processing using the hidden vector state model. Computer Speech and Language 19(1), 85–106 (2005)
Brill, E.: Some Advances in Transformation-Based Part of Speech Tagging. In: Proceedings of the Twelfth National Conference on Artificial Intelligence, pp. 722–727 (1994)
Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl. 1), i180–182 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Zhou, D., He, Y., Kwoh, C.K. (2007). Training the Hidden Vector State Model from Un-annotated Corpus. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2007. ICCS 2007. Lecture Notes in Computer Science, vol 4488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72586-2_54
Download citation
DOI: https://doi.org/10.1007/978-3-540-72586-2_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72585-5
Online ISBN: 978-3-540-72586-2
eBook Packages: Computer ScienceComputer Science (R0)