Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1076034.1076091acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Learning to extract information from semi-structured text using a discriminative context free grammar

Published: 15 August 2005 Publication History

Abstract

In recent work, conditional Markov chain models (CMM) have been used to extract information from semi-structured text (one example is the Conditional Random Field [10]). Applications range from finding the author and title in research papers to finding the phone number and street address in a web page. The CMM framework combines a priori knowledge encoded as features with a set of labeled training data to learn an efficient extraction process. We will show that similar problems can be solved more effectively by learning a discriminative context free grammar from training data. The grammar has several distinct advantages: long range, even global, constraints can be used to disambiguate entity labels; training data is used more efficiently; and a set of new more powerful features can be introduced. The grammar based approach also results in semantic information (encoded in the form of a parse tree) which could be used for IR applications like question answering. The specific problem we consider is of extracting personal contact, or address, information from unstructured sources such as documents and emails. While linear-chain CMMs perform reasonably well on this task, we show that a statistical parsing approach results in a 50% reduction in error rate. This system also has the advantage of being interactive, similar to the system described in [9]. In cases where there are multiple errors, a single user correction can be propagated to correct multiple errors automatically. Using a discriminatively trained grammar, 93.71% of all tokens are labeled correctly (compared to 88.43% for a CMM) and 72.87% of records have all tokens labeled correctly (compared to 45.29% for the CMM).

References

[1]
Vinajak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatically extracting structure from free text addresses. In Bulletin of the IEEE Computer Society Technical committe on Data Engineering. IEEE, 2000.
[2]
Remco Bouckaert. Low level information extraction: A bayesian network based approach. In Proc. TextML 2002, Sydney, Australia, 2002.
[3]
Claire Cardie and David Pierce. Proposal for an interactive environment for information extraction. Technical Report TR98-1702, 2, 1998.
[4]
Rich Caruana, Paul Hodor, and John Rosenberg. High precision information extraction. In KDD-2000 Workshop on Text Mining, August 2000.
[5]
M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP02), 2002.
[6]
Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995.
[7]
Y. Freund and R. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277--296.
[8]
Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In International Conference on Machine Learning, pages 148--156, 1996.
[9]
T. Kristjansson, A. Culotta, P. Viola, and A. McCallum. Interactive information extraction with constrained conditional random fields. In Proceedings of the 19th international conference on artificial intelligence, AAAI, pages 412--418, 2004.
[10]
John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001.
[11]
C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999.
[12]
M. Marcus, G. Kim, M. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger. The penn treebank: Annotating predicate argument structure, 1994.
[13]
Andrew McCallum. Efficiently inducing features of conditional random fields. In Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI03), 2003.
[14]
Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Marti Hearst and Mari Ostendorf, editors, HLT-NAACL, Edmonton, Alberta, Canada, 2003. Association for Computational Linguistics.
[15]
Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum entropy for text classification. In IJCAI'99 Workshop on Information Filtering, 1999.
[16]
David Pinto, Andrew McCallum, Xing Wei, and W. Bruce Croft. Table extraction using conditional random fields. In Proceedings of the ACM SIGIR, 2003.
[17]
L.R. Rabiner. A tutorial on hidden markov models. In Proc. of the IEEE, volume 77, pages 257--286, 1989.
[18]
Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Active hidden markov models for information extraction. In Advances in Intelligent Data Analysis, 4th International Conference, IDA 2001, 2001.
[19]
Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Marti Hearst and Mari Ostendorf, editors, HLT-NAACL: Main Proceedings, pages 213--220, Edmonton, Alberta, Canada, 2003. Association for Computational Linguistics.
[20]
J. Stylos, B. A. Myers, and A. Faulring. Citrine: providing intelligent copy-and-paste. In Proceedings of ACM Symposium on User Interface Software and Technology (UIST 2004), pages 185--188, 2005.
[21]
B. Tasker, D. Klein, M. Collins, D. Koller, and C.Manning. Max-margin parsing. In Empirical Methods in Natural Language Processing (EMNLP04), 2004.

Cited By

View all
  • (2023)DocToModel: Automated Authoring of Models from Diverse Requirements Specification DocumentsProceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice10.1109/ICSE-SEIP58684.2023.00024(199-210)Online publication date: 17-May-2023
  • (2021)Artificial Intelligence, Social Media and Supply Chain Management: The Way ForwardElectronics10.3390/electronics1019234810:19(2348)Online publication date: 25-Sep-2021
  • (2021)Fast Attention-based Learning-To-Rank Model for Structured Map SearchProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462904(942-951)Online publication date: 11-Jul-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
August 2005
708 pages
ISBN:1595930345
DOI:10.1145/1076034
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 August 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. conditional random fields
  2. discriminative grammars
  3. discriminative models
  4. information retrieval
  5. perceptron training
  6. text tagging

Qualifiers

  • Article

Conference

SIGIR05
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)DocToModel: Automated Authoring of Models from Diverse Requirements Specification DocumentsProceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice10.1109/ICSE-SEIP58684.2023.00024(199-210)Online publication date: 17-May-2023
  • (2021)Artificial Intelligence, Social Media and Supply Chain Management: The Way ForwardElectronics10.3390/electronics1019234810:19(2348)Online publication date: 25-Sep-2021
  • (2021)Fast Attention-based Learning-To-Rank Model for Structured Map SearchProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462904(942-951)Online publication date: 11-Jul-2021
  • (2021)Generation of Oracles using Natural Language Processing2021 28th Asia-Pacific Software Engineering Conference Workshops (APSEC Workshops)10.1109/APSECW53869.2021.00016(25-31)Online publication date: Dec-2021
  • (2019)Automatic Security Classification Based on Incremental Learning and Similarity Comparison2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC)10.1109/ITAIC.2019.8785798(812-817)Online publication date: May-2019
  • (2017)Context-aware Service Input Ranking by Learning from Historical InformationIEEE Transactions on Services Computing10.1109/TSC.2017.2777487(1-1)Online publication date: 2017
  • (2017)Automated U.S diplomatic cables security classification: Topic model pruning vs. classification based on clusters2017 IEEE International Symposium on Technologies for Homeland Security (HST)10.1109/THS.2017.7943471(1-6)Online publication date: Apr-2017
  • (2016)NEREA: Named Entity Recognition and Disambiguation Exploiting Local Document Repositories2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2016.0159(1035-1042)Online publication date: Nov-2016
  • (2016)Information Extraction on Weather Forecasts with Semantic TechnologiesNatural Language Processing and Information Systems10.1007/978-3-319-41754-7_12(140-151)Online publication date: 17-Jun-2016
  • (2015)A new approach to geocodingProceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems10.1145/2820783.2820827(1-10)Online publication date: 3-Nov-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media