Training the Hidden Vector State Model from Un-annotated Corpus

Deyu Zhou¹,
Yulan He¹ &
Chee Keong Kwoh¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4488))

Included in the following conference series:

International Conference on Computational Science

1110 Accesses

Abstract

Since most knowledge about protein-protein interactions still hides in biological publications, there is an increasing focus on automatically extracting information from the vast amount of biological literature. Existing approaches can be broadly categorized as rule-based or statistically-based. Rule-based approaches require heavy manual effort. On the other hand, statistically-based approaches require large-scale, richly annotated corpora in order to reliably estimate model parameters. This is normally difficult to obtain in practical applications. We have proposed a hidden vector state (HVS) model for protein-protein interactions extraction. The HVS model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. State transitions are factored into a stack shift operation similar to those of a push-down automaton followed by the push of a new preterminal category label. In this paper, we propose a novel approach based on the k-nearest-neighbors classifier to automatically train the HVS model from un-annotated data. Experimental results show the improved performance over the baseline system with the HVS model trained from a small amount of the annotated data.

Download to read the full chapter text

Chapter PDF

Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct

Article Open access 18 March 2015

Automated Extraction and Visualization of Protein–Protein Interaction Networks and Beyond: A Text-Mining Protocol

An improved approach to infer protein-protein interaction based on a hierarchical vector space model

Article Open access 27 April 2018

Keywords

References

Huang, M., Zhu, X., Hao, Y.: Discovering patterns to extract protein-protein interactions from full text. Bioinformatics 20(18), 3604–3612 (2004)
Article Google Scholar
Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., Cochran, B.: Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations. In: Proceedings of the Pacific Symposium on Biocomputing, Hawaii, U.S.A, pp. 362–373 (2002)
Google Scholar
Temkin, J.M., Gilder, M.R.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19(16), 2046–2053 (2003)
Article Google Scholar
Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., Mazo, l.: Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20(5), 604–611 (2004)
Article Google Scholar
Seymore, K., McCallum, A., Rosenfeld, R.: Learning Hidden Markov Model Structure for Information Extraction. In: AAAI 99 Workshop on Machine Learning for Information Extraction (1999)
Google Scholar
Zhou, D., He, Y., Kwoh, C.K.: Extracting Protein-Protein Interactions from the Literature using the Hidden Vector State Model. In: International Workshop on Bioinformatics Research and Applications, Reading, UK (2006)
Google Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Article MATH Google Scholar
Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised selftraining of object detection models. In: Seventh IEEE Workshop on Applications of Computer Vision, IEEE Computer Society Press, Los Alamitos (2005)
Google Scholar
Jones, R.: Learning to extract entities from labeled and unlabeled text. PhD thesis, Carnegie Mellon University (2005)
Google Scholar
Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of 18th International Conference on Machine Learning, pp. 19–26. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)
Google Scholar
Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison (2005)
Google Scholar
He, Y., Young, S.: Semantic processing using the hidden vector state model. Computer Speech and Language 19(1), 85–106 (2005)
Article Google Scholar
Brill, E.: Some Advances in Transformation-Based Part of Speech Tagging. In: Proceedings of the Twelfth National Conference on Artificial Intelligence, pp. 722–727 (1994)
Google Scholar
Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl. 1), i180–182 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, 639798, Singapore
Deyu Zhou, Yulan He & Chee Keong Kwoh

Authors

Deyu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yulan He
View author publications
You can also search for this author in PubMed Google Scholar
Chee Keong Kwoh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Yong Shi Geert Dick van Albada Jack Dongarra Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, D., He, Y., Kwoh, C.K. (2007). Training the Hidden Vector State Model from Un-annotated Corpus. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2007. ICCS 2007. Lecture Notes in Computer Science, vol 4488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72586-2_54

Download citation

DOI: https://doi.org/10.1007/978-3-540-72586-2_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72585-5
Online ISBN: 978-3-540-72586-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Training the Hidden Vector State Model from Un-annotated Corpus

Abstract

Chapter PDF

Similar content being viewed by others

Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct

Automated Extraction and Visualization of Protein–Protein Interaction Networks and Beyond: A Text-Mining Protocol

An improved approach to infer protein-protein interaction based on a hierarchical vector space model

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Training the Hidden Vector State Model from Un-annotated Corpus

Abstract

Chapter PDF

Similar content being viewed by others

Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct

Automated Extraction and Visualization of Protein–Protein Interaction Networks and Beyond: A Text-Mining Protocol

An improved approach to infer protein-protein interaction based on a hierarchical vector space model

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation