SVM Based Learning System for Information Extraction

Yaoyong Li²¹,
Kalina Bontcheva²¹ &
Hamish Cunningham²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3635))

Included in the following conference series:

International Workshop on Deterministic and Statistical Methods in Machine Learning

2511 Accesses
49 Citations

Abstract

This paper presents an SVM-based learning system for information extraction (IE). One distinctive feature of our system is the use of a variant of the SVM, the SVM with uneven margins, which is particularly helpful for small training datasets. In addition, our approach needs fewer SVM classifiers to be trained than other recent SVM-based systems. The paper also compares our approach to several state-of-the-art systems (including rule learning and statistical learning algorithms) on three IE benchmark datasets: CoNLL-2003, CMU seminars, and the software jobs corpus. The experimental results show that our system outperforms a recent SVM-based system on CoNLL-2003, achieves the highest score on eight out of 17 categories on the jobs corpus, and is second best on the remaining nine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

VBSRL: A Semantic Frame-Based Approach for Data Extraction from Unstructured Business Documents

Improving Supervised Classification Using Information Extraction

TEES 2.2: Biomedical Event Extraction for Diverse Corpora

Article Open access 30 October 2015

References

Bender, O., Och, F.J., Ney, H.: Maximum entropy models for named entity recognition. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL 2003, Edmonton, Canada, pp. 148–151 (2003)
Google Scholar
Califf, M.E.: Relational learning techniques for natural language information extraction. PhD thesis, University of Texas at Austin (1998)
Google Scholar
Chieu, H.L., Ng, H.T.: A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text. In: Proceedings of the Eighteenth National Conference on Artificial Intelligence, pp. 786–791 (2002)
Google Scholar
Chieu, H.L., Ng, H.T.: Named entity recognition: A maximum entropy approach using global information. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan (2002)
Google Scholar
Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL 2003, Edmonton, Canada, pp. 160–163 (2003)
Google Scholar
Cimiano, P., Handschuh, S., Staab, S.: Towards the self-Annotating Web. In: Proceedings of WWW 2004 (2004)
Google Scholar
Ciravegna, F.: (LP)², an adaptive algorithm for information extraction from web related texts. In: Proceedings of the IJCAI 2001 Workshop on Adaptive Text Extraction and Mining, Seattle (2001)
Google Scholar
Ciravegna, F. (LP)², Rule Induction for Information Extraction Using Linguistic Constraints. Technical Report CS-03-07, Department of Computer Science, University of Sheffield, Sheffield (September 2003)
Google Scholar
Ciravegna, F., Wilks, Y.: Designing adaptive information extraction for the semantic Web in Amilcare. In: Handschuh, S., Staab, S. (eds.) Annotation for the Semantic Web. IOS Press, Amsterdam (2003)
Google Scholar
Cunningham, H.: Information extraction, automatic. Encyclopedia of Language and Linguistics, 2nd edn. (2005)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, ACL 2002 (2002)
Google Scholar
Curran, J.R., Clark, S.: Language independent NER using a maximum entropy tagger. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL 2003, Edmonton, Canada, pp. 164–167 (2003)
Google Scholar
Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL 2003, Edmonton, Canada, pp. 168–171 (2003)
Google Scholar
Freigtag, D., McCallum, A.K.: Information extraction with HMMs and shrinkage. In: Proceedings of Workshop on Machine Learning for Information Extraction, pp. 31–36 (1999)
Google Scholar
Freitag, D.: Information extraction from html: Application of a general learning approach. In: Proceedings of the Fifteenth Conference on Artificial Intelligence AAAI 1998, pp. 517–523 (1998)
Google Scholar
Freitag, D.: Machine Learning for Information Extraction in Informal Domains. PhD thesis, Carnegie Mellon University (1998)
Google Scholar
Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Machine Learning 39, 169–202 (2000)
Article MATH Google Scholar
Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proceedings of AAAI 2000 (2000)
Google Scholar
Isozaki, H., Kazawa, H.: Efficient Support Vector Classifiers for Named Entity Recognition. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, pp. 390–396 (2002)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar
Li, Y., Shawe-Taylor, J.: The SVM with uneven margins and Chinese document categorization. In: Proceedings of The 17th Pacific Asia Conference on Language, Information and Computation (PACLIC17), Singapore, October 2003, pp. 216–227 (2003)
Google Scholar
Mayfield, J., McNamee, P., Piatko, C.: Named entity recognition using hundreds of thousands of features. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL 2003, Edmonton, Canada, pp. 184–187 (2003)
Google Scholar
Roth, D., Yih, W.T.: Relational learning via propositional algorithms: an information extraction case study. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI 2001), pp. 1257–1263 (2001)
Google Scholar
SAIC. Proceedings of the Seventh Message Understanding Conference, MUC-7 (1998), http://www.itl.nist.gov/iaui/894.02/relatedprojects/muc/index.html
Sang, E.F., Meulder, F.D.: Introduction to the CoNLL 2003 shared task: language-independent named entity recognition. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL 2003, Edmonton, Canada, pp. 142–147 (2003)
Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34, 233–272 (1999)
Article MATH Google Scholar
Song, Y., Yi, E., Kim, E., Lee, G.G.: POSBIOTM-NER: a machine learning approach for bio-named entity recognition. In: Workshop on a critical assessment of text mining methods in molecular biology, Granada, Spain (2004), http://www.pdg.cnb.uam.es/BioLINK/workshopBioCreative04/
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Sheffield, Sheffield, S1 4DP, UK
Yaoyong Li, Kalina Bontcheva & Hamish Cunningham

Authors

Yaoyong Li
View author publications
You can also search for this author in PubMed Google Scholar
Kalina Bontcheva
View author publications
You can also search for this author in PubMed Google Scholar
Hamish Cunningham
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, The University of Sheffield, Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
Joab Winkler
Department of Computer Science, The University of Sheffield, Regent Court,211 Portobello Street, S1 4DP, Sheffield, UK
Mahesan Niranjan
University of Manchester, UK
Neil Lawrence

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Bontcheva, K., Cunningham, H. (2005). SVM Based Learning System for Information Extraction. In: Winkler, J., Niranjan, M., Lawrence, N. (eds) Deterministic and Statistical Methods in Machine Learning. DSMML 2004. Lecture Notes in Computer Science(), vol 3635. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11559887_19

Download citation

DOI: https://doi.org/10.1007/11559887_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29073-5
Online ISBN: 978-3-540-31728-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics