Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/1613715.1613719dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
research-article
Free access

Regular expression learning for information extraction

Published: 25 October 2008 Publication History

Abstract

Regular expressions have served as the dominant workhorse of practical information extraction for several years. However, there has been little work on reducing the manual effort involved in building high-quality, complex regular expressions for information extraction tasks. In this paper, we propose ReLIE, a novel transformation-based algorithm for learning such complex regular expressions. We evaluate the performance of our algorithm on multiple datasets and compare it against the CRF algorithm. We show that ReLIE, in addition to being an order of magnitude faster, outperforms CRF under conditions of limited training data and cross-domain data. Finally, we show how the accuracy of CRF can be improved by using features extracted by ReLIE.

References

[1]
R. Alquezar and A. Sanfeliu. 1994. Incremental grammatical inference from positive and negative data using unbiased finite state automata. In SSPR.
[2]
Douglas E. Appelt and Boyan Onyshkevych. 1998. The common pattern specification language. In TIPSTER TEXT PROGRAM.
[3]
Geert Jan Bex et al. 2006. Inference of concise DTDs from XML data. In VLDB.
[4]
Eric Brill. 2000. Pattern-based disambiguation for natural language processing. In SIGDAT.
[5]
William W. Cohen and Andrew McCallum. 2003. Information Extraction from the World Wide Web. in KDD
[6]
William W. Cohen. 2004. Minorthird: Methods for identifying names and ontological relations in text using heuristics for inducing regularities from data. http://minorthird.sourceforge.net.
[7]
William W. Cohen et al. 2005. Learning to Understand Web Site Update Requests. In IJCAI.
[8]
Fabio Ciravegna. 2001. Adaptive information extraction from text by rule induction and generalization. In IJCAI.
[9]
H. Cunningham. 1999. JAPE -- a java annotation patterns engine.
[10]
Francois Denis et al. 2004. Learning regular languages using RFSAs. Theor. Comput. Sci., 313(2):267--294.
[11]
Francois Denis. 2001. Learning regular languages from simple positive examples. Machine Learning, 44(1/2):37--66.
[12]
Pedro DeRose et al. 2007. DBLife: A Community Information Management Platform for the Database Research Community In CIDR
[13]
Pierre Dupont. 1996. Incremental regular inference. In ICGI.
[14]
Ronen Feldman et all. 2006. Self-supervised Relation Extraction from the Web. In ISMIS.
[15]
Henning Fernau. 2005. Algorithms for learning regular expressions. In ALT.
[16]
Laura Firoiu et al. 1998. Learning regular languages from positive evidence. In CogSci.
[17]
K. Fukuda et al. 1998. Toward information extraction: identifying protein names from biological papers. Pac Symp Biocomput., 1998:707--718
[18]
Ugo Galassi and Attilio Giordana. 2005. Learning regular expressions from noisy sequences. In SARA.
[19]
Minos Garofalakis et al. 2000. XTRACT: a system for extracting document type descriptors from XML documents. In SIGMOD.
[20]
Hong Lei Guo et al. 2006. Empirical Study on the Performance Stability of Named Entity Recognition Model across Domains In EMNLP.
[21]
Java Regular Expressions. 2008. http://java.sun.com/javase/6/docs/api/java/util/regex/package-summary.html.
[22]
Dan Klein et al. 2003. Named Entity Recognition with Character-Level Models. In HLT-NAACL.
[23]
Vijay Krishnan and Christopher D. Manning. 2006. An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition. In ACL.
[24]
Yunyao Li et al. 2006. Getting work done on the web: Supporting transactional queries. In SIGIR.
[25]
Andrew McCallum et al. 2000. Maximum Entropy Markov Models for Information Extraction and Segmentation. In ICML.
[26]
Einat Minkov et al. 2005. Extracting personal names from emails: Applying named entity recognition to informal text. In HLT/EMNLP.
[27]
Stephen Soderland. 1999. Learning information extraction rules for semi-structured and free text. Machine Learning, 34:233--272.
[28]
Lorraine Tanabe and W. John Wilbur 2002. Tagging gene and protein names in biomedical text. Bioinformatics, 18:1124--1132.
[29]
Tianhao Wu and William M. Pottenger. 2005. A semi-supervised active learning algorithm for information extraction from textual data. JASIST, 56(3):258--271.
[30]
Huaiyu Zhu, Alexander Loeser, Sriram Raghavan, Shivakumar Vaithyanathan 2007. Navigating the intranet with high precision. In WWW.

Cited By

View all
  • (2024)Towards Automated Infographic Authoring From Natural Language Statement With Multiple Proportional FactsIEEE Transactions on Multimedia10.1109/TMM.2024.336072226(7101-7113)Online publication date: 31-Jan-2024
  • (2024)One Automaton to Rule Them All: Beyond Multiple Regular Expressions ExecutionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444810(193-206)Online publication date: 2-Mar-2024
  • (2023)Learning from Uncurated Regular Expressions for Semantic Type ClassificationProceedings of the 1st Workshop on Simplicity in Management of Data10.1145/3596225.3596226(1-5)Online publication date: 23-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing
October 2008
1129 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 25 October 2008

Qualifiers

  • Research-article

Acceptance Rates

Overall Acceptance Rate 73 of 234 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)121
  • Downloads (Last 6 weeks)10
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Towards Automated Infographic Authoring From Natural Language Statement With Multiple Proportional FactsIEEE Transactions on Multimedia10.1109/TMM.2024.336072226(7101-7113)Online publication date: 31-Jan-2024
  • (2024)One Automaton to Rule Them All: Beyond Multiple Regular Expressions ExecutionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444810(193-206)Online publication date: 2-Mar-2024
  • (2023)Learning from Uncurated Regular Expressions for Semantic Type ClassificationProceedings of the 1st Workshop on Simplicity in Management of Data10.1145/3596225.3596226(1-5)Online publication date: 23-Jun-2023
  • (2021)TransRegexProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00111(1210-1222)Online publication date: 22-May-2021
  • (2019)How to Invest my TimeProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3292500.3330773(2305-2313)Online publication date: 25-Jul-2019
  • (2019)A novel approach for Web page modeling in personal information extractionWorld Wide Web10.1007/s11280-018-0631-922:2(603-620)Online publication date: 1-Mar-2019
  • (2018)FlashProfile: a framework for synthesizing data profilesProceedings of the ACM on Programming Languages10.1145/32765202:OOPSLA(1-28)Online publication date: 24-Oct-2018
  • (2018)How well are regular expressions tested in the wild?Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3236024.3236072(668-678)Online publication date: 26-Oct-2018
  • (2018)FAHESProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3220109(2100-2109)Online publication date: 19-Jul-2018
  • (2018)API LearningCompanion Proceedings of the The Web Conference 201810.1145/3184558.3186966(151-154)Online publication date: 23-Apr-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media