Article

Learning to extract information from semi-structured text using a discriminative context free grammar

Authors:

Mukund NarasimhanAuthors Info & Claims

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 330 - 337

https://doi.org/10.1145/1076034.1076091

Published: 15 August 2005 Publication History

Abstract

In recent work, conditional Markov chain models (CMM) have been used to extract information from semi-structured text (one example is the Conditional Random Field [10]). Applications range from finding the author and title in research papers to finding the phone number and street address in a web page. The CMM framework combines a priori knowledge encoded as features with a set of labeled training data to learn an efficient extraction process. We will show that similar problems can be solved more effectively by learning a discriminative context free grammar from training data. The grammar has several distinct advantages: long range, even global, constraints can be used to disambiguate entity labels; training data is used more efficiently; and a set of new more powerful features can be introduced. The grammar based approach also results in semantic information (encoded in the form of a parse tree) which could be used for IR applications like question answering. The specific problem we consider is of extracting personal contact, or address, information from unstructured sources such as documents and emails. While linear-chain CMMs perform reasonably well on this task, we show that a statistical parsing approach results in a 50% reduction in error rate. This system also has the advantage of being interactive, similar to the system described in [9]. In cases where there are multiple errors, a single user correction can be propagated to correct multiple errors automatically. Using a discriminatively trained grammar, 93.71% of all tokens are labeled correctly (compared to 88.43% for a CMM) and 72.87% of records have all tokens labeled correctly (compared to 45.29% for the CMM).

References

[1]

Vinajak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatically extracting structure from free text addresses. In Bulletin of the IEEE Computer Society Technical committe on Data Engineering. IEEE, 2000.

[2]

Remco Bouckaert. Low level information extraction: A bayesian network based approach. In Proc. TextML 2002, Sydney, Australia, 2002.

[3]

Claire Cardie and David Pierce. Proposal for an interactive environment for information extraction. Technical Report TR98-1702, 2, 1998.

Digital Library

[4]

Rich Caruana, Paul Hodor, and John Rosenberg. High precision information extraction. In KDD-2000 Workshop on Text Mining, August 2000.

[5]

M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP02), 2002.

Digital Library

[6]

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995.

[7]

Y. Freund and R. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277--296.

Digital Library

[8]

Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In International Conference on Machine Learning, pages 148--156, 1996.

Digital Library

[9]

T. Kristjansson, A. Culotta, P. Viola, and A. McCallum. Interactive information extraction with constrained conditional random fields. In Proceedings of the 19th international conference on artificial intelligence, AAAI, pages 412--418, 2004.

Digital Library

[10]

John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001.

Digital Library

[11]

C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999.

Digital Library

[12]

M. Marcus, G. Kim, M. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger. The penn treebank: Annotating predicate argument structure, 1994.

[13]

Andrew McCallum. Efficiently inducing features of conditional random fields. In Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI03), 2003.

Digital Library

[14]

Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Marti Hearst and Mari Ostendorf, editors, HLT-NAACL, Edmonton, Alberta, Canada, 2003. Association for Computational Linguistics.

Digital Library

[15]

Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum entropy for text classification. In IJCAI'99 Workshop on Information Filtering, 1999.

[16]

David Pinto, Andrew McCallum, Xing Wei, and W. Bruce Croft. Table extraction using conditional random fields. In Proceedings of the ACM SIGIR, 2003.

Digital Library

[17]

L.R. Rabiner. A tutorial on hidden markov models. In Proc. of the IEEE, volume 77, pages 257--286, 1989.

[18]

Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Active hidden markov models for information extraction. In Advances in Intelligent Data Analysis, 4th International Conference, IDA 2001, 2001.

Digital Library

[19]

Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Marti Hearst and Mari Ostendorf, editors, HLT-NAACL: Main Proceedings, pages 213--220, Edmonton, Alberta, Canada, 2003. Association for Computational Linguistics.

Digital Library

[20]

J. Stylos, B. A. Myers, and A. Faulring. Citrine: providing intelligent copy-and-paste. In Proceedings of ACM Symposium on User Interface Software and Technology (UIST 2004), pages 185--188, 2005.

Digital Library

[21]

B. Tasker, D. Klein, M. Collins, D. Koller, and C.Manning. Max-margin parsing. In Empirical Methods in Natural Language Processing (EMNLP04), 2004.

Cited By

Rajbhoj ANistala PKulkarni VSoni SPathan ALiu AMuccini H(2023)DocToModel: Automated Authoring of Models from Diverse Requirements Specification DocumentsProceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice10.1109/ICSE-SEIP58684.2023.00024(199-210)Online publication date: 17-May-2023
https://dl.acm.org/doi/10.1109/ICSE-SEIP58684.2023.00024
Khatua AKhatua AChi XCambria E(2021)Artificial Intelligence, Social Media and Supply Chain Management: The Way ForwardElectronics10.3390/electronics1019234810:19(2348)Online publication date: 25-Sep-2021
https://doi.org/10.3390/electronics10192348
Zhang CEvans MLepikhin MYankov DDiaz FShah CSuel TCastells PJones RSakai T(2021)Fast Attention-based Learning-To-Rank Model for Structured Map SearchProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462904(942-951)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3462904
Show More Cited By

Index Terms

Learning to extract information from semi-structured text using a discriminative context free grammar
1. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing systems and tools
2. Information systems

Recommendations

Learning Discriminative Sequence Models from Partially Labelled Data for Activity Recognition
PRICAI '08: Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence

Recognising daily activity patterns of people from low-level sensory data is an important problem. Traditional approaches typically rely on generative models such as the hidden Markov models and training on fully labelled data. While activity data can ...
Extracting structured information from user queries with semi-supervised conditional random fields
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

When search is against structured documents, it is beneficial to extract information from user queries in a format that is consistent with the backend data structure. As one step toward this goal, we study the problem of query tagging which is to assign ...
Parsing Arabic using induced probabilistic context free grammar

The importance of the parsing task for NLP applications is well understood. However developing parsers remains difficult because of the complexity of the Arabic language. Most parsers are based on syntactic grammars that describe the syntactic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

August 2005

708 pages

ISBN:1595930345

DOI:10.1145/1076034

General Chairs:
Ricardo Baeza-Yates
University of Chile, Chile
,
Nivio Ziviani
Federal University of Minas Gerais, Brazil
,
Program Chairs:
Gary Marchionini
University of North Carolina, USA
,
Alistair Moffat
University of Melbourne, Australia
,
John Tait
University of Sunderland, UK

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 August 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR05

Sponsor:

SIGIR

SIGIR05: The 28th ACM/SIGIR International Symposium on Information Retrieval 2005

August 15 - 19, 2005

Salvador, Brazil

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
1,482
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rajbhoj ANistala PKulkarni VSoni SPathan ALiu AMuccini H(2023)DocToModel: Automated Authoring of Models from Diverse Requirements Specification DocumentsProceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice10.1109/ICSE-SEIP58684.2023.00024(199-210)Online publication date: 17-May-2023
https://dl.acm.org/doi/10.1109/ICSE-SEIP58684.2023.00024
Khatua AKhatua AChi XCambria E(2021)Artificial Intelligence, Social Media and Supply Chain Management: The Way ForwardElectronics10.3390/electronics1019234810:19(2348)Online publication date: 25-Sep-2021
https://doi.org/10.3390/electronics10192348
Zhang CEvans MLepikhin MYankov DDiaz FShah CSuel TCastells PJones RSakai T(2021)Fast Attention-based Learning-To-Rank Model for Structured Map SearchProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462904(942-951)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3462904
Leong IBarbosa R(2021)Generation of Oracles using Natural Language Processing2021 28th Asia-Pacific Software Engineering Conference Workshops (APSEC Workshops)10.1109/APSECW53869.2021.00016(25-31)Online publication date: Dec-2021
https://doi.org/10.1109/APSECW53869.2021.00016
Liang YWen ZTao YLi GGuo B(2019)Automatic Security Classification Based on Incremental Learning and Similarity Comparison2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC)10.1109/ITAIC.2019.8785798(812-817)Online publication date: May-2019
https://doi.org/10.1109/ITAIC.2019.8785798
Wang SZou YNg JNg T(2017)Context-aware Service Input Ranking by Learning from Historical InformationIEEE Transactions on Services Computing10.1109/TSC.2017.2777487(1-1)Online publication date: 2017
https://doi.org/10.1109/TSC.2017.2777487
Alzhrani KRudd EChow CBoult T(2017)Automated U.S diplomatic cables security classification: Topic model pruning vs. classification based on clusters2017 IEEE International Symposium on Technologies for Homeland Security (HST)10.1109/THS.2017.7943471(1-6)Online publication date: Apr-2017
https://doi.org/10.1109/THS.2017.7943471
Garrido AIlarri SSangiao SGanan ABean ACardiel O(2016)NEREA: Named Entity Recognition and Disambiguation Exploiting Local Document Repositories2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2016.0159(1035-1042)Online publication date: Nov-2016
https://doi.org/10.1109/ICTAI.2016.0159
Garrido ABuey MMuñoz GCasado-Rubio J(2016)Information Extraction on Weather Forecasts with Semantic TechnologiesNatural Language Processing and Information Systems10.1007/978-3-319-41754-7_12(140-151)Online publication date: 17-Jun-2016
https://doi.org/10.1007/978-3-319-41754-7_12
Berkhin PEvans MTeodorescu FWu WYankov DAli MHuang YGertz MRenz MSankaranarayanan J(2015)A new approach to geocodingProceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems10.1145/2820783.2820827(1-10)Online publication date: 3-Nov-2015
https://dl.acm.org/doi/10.1145/2820783.2820827
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten