Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1718487.1718501acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Coupled semi-supervised learning for information extraction

Published: 04 February 2010 Publication History

Abstract

We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web documents. Semi-supervised training using only a few labeled examples is typically unreliable because the learning task is underconstrained. This paper pursues the thesis that much greater accuracy can be achieved by further constraining the learning task, by coupling the semi-supervised training of many extractors for different categories and relations. We characterize several ways in which the training of category and relation extractors can be coupled, and present experimental results demonstrating significantly improved accuracy as a result.

References

[1]
Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large plain-text collections. In Proc. of JCDL, 2000.
[2]
Maria-Florina Balcan and Avrim Blum. A PAC-style model for learning from labeled and unlabeled data. In Proc. of COLT, 2004.
[3]
Daniel M. Bikel, Richard Schwartz, and Ralph M. Weischedel. An algorithm that learns what's in a name. Machine Learning, 34(1):211--231, 1999.
[4]
Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proc. of COLT, 1998.
[5]
Sergey Brin. Extracting patterns and relations from the world wide web. In Proc. of WebDB Workshop at 6th International Conference on Extending Database Technology, 1998.
[6]
Michael J. Cafarella, Jayant Madhavan, and Alon Halevy. Web-scale extraction of structured data. SIGMOD Rec., 37(4):55--61, 2008.
[7]
Rich Caruana. Multitask learning. Machine Learning, 28:41--75, 1997.
[8]
Ming-Wei Chang, Lev-Arie Ratinov, and Dan Roth. Guiding semi-supervision with constraint-driven learning. In Proc. of ACL, 2007.
[9]
Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In Proc. of EMNLP, 1999.
[10]
James R. Curran, Tara Murphy, and Bernhard Scholz. Minimising semantic drift with mutual exclusion bootstrapping. In Proc. of PACLING, 2007.
[11]
Hal Daume. Cross-task knowledge-constrained self training. In Proc. of EMNLP, 2008.
[12]
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.
[13]
Doug Downey, Matthew Broadhead, and Oren Etzioni. Locating complex named entities in web text. In Proc. of IJCAI, 2007.
[14]
Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proc. of COLING, 1992.
[15]
Qiuhua Liu, Xuejun Liao, Hui Li, Jason Stack, and Lawrence Carin. Semi-supervised multitask learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(6):1074--1086, 2009.
[16]
David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In Proc. of NAACL, 2006.
[17]
Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, and Alpa Jain. Names and similarities on the web: fact extraction in the fast lane. In Proc. of ACL, 2006.
[18]
Marco Pennacchiotti and Patrick Pantel. Entity extraction via ensemble semantics. In Proc. of EMNLP, 2009.
[19]
Ellen Riloff and Rosie Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proc. of AAAI, 1999.
[20]
Benjamin Rosenfeld and Ronen Feldman. Using corpus statistics on entities to improve semi-supervised relation extraction from the web. In Proc. of ACL, 2007.
[21]
Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. Cheap and fast, but is it good? evaluating non-expert annotations for natural language tasks. In Proc. of EMNLP, 2008.
[22]
Partha Pratim Talukdar, Joseph Reisinger, Marius Pasca, Deepak Ravichandran, Rahul Bhagat, and Fernando Pereira. Weakly-supervised acquisition of labeled class instances using graph random walks. In Proc. of EMNLP, 2008.
[23]
Sebastian Thrun. Is learning the n-th thing any easier than learning the First? In Proc. of NIPS, 1996.
[24]
Nicola Uefing. Self-training for machine translation. In Proc. of NIPS workshop on Machine Learning for Multilingual Information Access, 2006.
[25]
Richard C. Wang and William W. Cohen. Iterative set expansion of named entities using the web. In Proc. of ICDM, 2008.
[26]
Richard C. Wang and William W. Cohen. Character-level analysis of semi-structured documents for set expansion. In Proc. of EMNLP, 2009.
[27]
Roman Yangarber. Counter-training in discovery of semantic patterns. In Proc. of ACL, 2003.
[28]
Dmitry Zelenko, Chinatsu Aone, Anthony Richardella, Jaz K, Thomas Hofmann, Tomaso Poggio, and John Shawe-Taylor. Kernel methods for relation extraction. Journal of Machine Learning Research, 3, 2003.

Cited By

View all
  • (2024)A system for automatic construction of knowledge graphs of mathematical documentsUchenye Zapiski Kazanskogo Universiteta. Seriya Fiziko-Matematicheskie Nauki10.26907/2541-7746.2023.3.264-281165:3(264-281)Online publication date: 12-Jan-2024
  • (2024)Doc‐KG: Unstructured documents to knowledge graph construction, identification and validation with WikidataExpert Systems10.1111/exsy.13617Online publication date: 8-May-2024
  • (2024)Recent Developments in Recommender Systems: A Survey [Review Article]IEEE Computational Intelligence Magazine10.1109/MCI.2024.336398419:2(78-95)Online publication date: May-2024
  • Show More Cited By

Index Terms

  1. Coupled semi-supervised learning for information extraction

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '10: Proceedings of the third ACM international conference on Web search and data mining
      February 2010
      468 pages
      ISBN:9781605588896
      DOI:10.1145/1718487
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 February 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. bootstrap learning
      2. information extraction
      3. semi-supervised learning
      4. web mining

      Qualifiers

      • Research-article

      Conference

      Acceptance Rates

      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)33
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 04 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A system for automatic construction of knowledge graphs of mathematical documentsUchenye Zapiski Kazanskogo Universiteta. Seriya Fiziko-Matematicheskie Nauki10.26907/2541-7746.2023.3.264-281165:3(264-281)Online publication date: 12-Jan-2024
      • (2024)Doc‐KG: Unstructured documents to knowledge graph construction, identification and validation with WikidataExpert Systems10.1111/exsy.13617Online publication date: 8-May-2024
      • (2024)Recent Developments in Recommender Systems: A Survey [Review Article]IEEE Computational Intelligence Magazine10.1109/MCI.2024.336398419:2(78-95)Online publication date: May-2024
      • (2024)Lifelong Hierarchical Topic Modeling via Non-negative Matrix FactorizationWeb and Big Data10.1007/978-981-97-2421-5_11(155-170)Online publication date: 12-May-2024
      • (2023)Tolerance-based granular methods: Foundations and applications in natural language processingIntelligent Decision Technologies10.3233/IDT-22021417:1(139-158)Online publication date: 20-Apr-2023
      • (2023)Review of Knowledge Graph and Its Vertical Applications in Industry2023 42nd Chinese Control Conference (CCC)10.23919/CCC58697.2023.10240572(5151-5157)Online publication date: 24-Jul-2023
      • (2023)KGFlex: Efficient Recommendation with Sparse Feature Factorization and Knowledge GraphsACM Transactions on Recommender Systems10.1145/35889011:4(1-30)Online publication date: 3-Apr-2023
      • (2023)Deep Learning-Based Joint Extraction Model of Entity Relationships for Cloud Operations Knowledge Graph2023 5th International Academic Exchange Conference on Science and Technology Innovation (IAECST)10.1109/IAECST60924.2023.10502732(775-786)Online publication date: 8-Dec-2023
      • (2023)Recommending on graphs: a comprehensive review from a data perspectiveUser Modeling and User-Adapted Interaction10.1007/s11257-023-09359-w33:4(803-888)Online publication date: 13-Mar-2023
      • (2023)Knowledge Representation Learning and Knowledge-Guided NLPRepresentation Learning for Natural Language Processing10.1007/978-981-99-1600-9_9(273-349)Online publication date: 24-Aug-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media