Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2124295.2124327acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

WebSets: extracting sets of entities from the web using unsupervised information extraction

Published: 08 February 2012 Publication History

Abstract

We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus. Most earlier approaches to this problem rely on combining clusters of distributionally similar terms and concept-instance pairs obtained with Hearst patterns. In contrast, our method relies on a novel approach for clustering terms found in HTML tables, and then assigning concept names to these clusters using Hearst patterns. The method can be efficiently applied to a large corpus, and experimental results on several datasets show that our method can accurately extract large numbers of concept-instance pairs.

Supplementary Material

JPG File (wsdm_day2_session1_1.jpg)
MP4 File (wsdm_day2_session1_1.mp4)

References

[1]
Html TIDY project. http://tidy.sourceforge.net/.
[2]
M. J. Cafarella, E. Wu, A. Halevy, Y. Zhang, and D. Z. Wang. Webtables: Exploring the power of tables on the web. PVLDB, 2008.
[3]
J. Callan. The clueweb09 dataset. http://boston.lti.cs.cmu.edu/Data/clueweb09/.
[4]
A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka, Jr., and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.
[5]
W. H. E. Day and H. Edelsbrunner. Efficient algorithms for agglomerative hierarchical clustering methods. In Journal of Classification, 1984.
[6]
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In WWW, 2004.
[7]
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yate. Unsupervised named-entity extraction from the web: An experimental study. In AI, 2005.
[8]
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, 2007.
[9]
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, 2009.
[10]
R. Gupta and S. Sarawagi. Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited. In WSDM, 2011.
[11]
M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In ACL, 1992.
[12]
J. Kamps, R. Kaptein, and M. Koolen. Using anchor text, spam filtering and wikipedia for web search and entity ranking. TREC, 2010.
[13]
Z. Kozareva and E. Hovy. A semi-supervised method to learn and construct taxonomies using the web. In EMNLP, 2010.
[14]
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 2010.
[15]
D. Lin and P. Pantel. Concept discovery from text. In COLING, 2002.
[16]
C. D. Manning, P. Raghavan, and H. Schtze. Introduction to information retrieval. In Cambridge University Press, 2008.
[17]
P. Pantel and D. Ravichandran. Automatically labeling semantic classes. In HLT-NAACL, 2004.
[18]
A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the web of concepts: Extracting concepts from large datasets. In VLDB, 2010.
[19]
E. Ramirez, R. Brena, D. Magatti, and F. Stella. Probabilistic metrics for soft-clustering and topic model validation. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010.
[20]
A. Ritter, S. Soderland, and O. Etzioni. What is this, anyway: Automatic hypernym discovery. In AAAI, 2009.
[21]
K. Shinzato and K. Torisawa. Acquiring hyponymy relations from web documents. In HLT-NAACL, 2004.
[22]
R. Snow, D. Jurafsky, and A. Y. Ng. Learning syntactic patterns for automatic hypernym discovery. In NIPS, 2004.
[23]
R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In EMNLP, 2008.
[24]
P. P. Talukdar, J. Reisinger, M. Paşca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly-supervised acquisition of labeled class instances using graph random walks. In EMNLP, 2008.
[25]
M. Tom. Nell: Never-ending language learning. http://rtw.ml.cmu.edu/rtw/.
[26]
B. Van Durme and M. Pasca. Finding cars, goddesses and enzymes: parametrizable acquisition of labeled instances for open-domain information extraction. In AAAI, 2008.
[27]
R. C. Wang and W. W. Cohen. Automatic set instance extraction using the web. In ACL, 2009.
[28]
R. C. Wang and W. W. Cohen. Character-level analysis of semi-structured documents for set expansion. In EMNLP, 2009.
[29]
R. Wetzker, C. Zimmermann, and C. Bauckhage. Analyzing social bookmarking systems: A del.icio.us cookbook. Mining Social Data (MSoDa) Workshop Proceedings, ECAI, 2008. http://www.dai-labor.de/en/competence_centers/irml/datasets/.
[30]
A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead, and S. Soderland. Textrunner: Open information extraction on the web. In NAACL, 2007.

Cited By

View all
  • (2023)Exploring the Challenges and Limitations of Unsupervised Machine Learning Approaches in Legal Concepts DiscoveryAdvances in Soft Computing10.1007/978-3-031-47640-2_5(52-67)Online publication date: 9-Nov-2023
  • (2022)Knowledge Graphs for Social Good: An Entity-Centric Search Engine for the Human Trafficking DomainIEEE Transactions on Big Data10.1109/TBDATA.2017.27631648:3(592-606)Online publication date: 1-Jun-2022
  • (2022)Automatically Categorizing Software Technologies2022 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2)10.1109/ICoDT255437.2022.9787457(1-6)Online publication date: 24-May-2022
  • Show More Cited By

Index Terms

  1. WebSets: extracting sets of entities from the web using unsupervised information extraction

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining
    February 2012
    792 pages
    ISBN:9781450307475
    DOI:10.1145/2124295
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 February 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. clustering
    2. hyponymy relation acquisition
    3. web mining

    Qualifiers

    • Research-article

    Conference

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Exploring the Challenges and Limitations of Unsupervised Machine Learning Approaches in Legal Concepts DiscoveryAdvances in Soft Computing10.1007/978-3-031-47640-2_5(52-67)Online publication date: 9-Nov-2023
    • (2022)Knowledge Graphs for Social Good: An Entity-Centric Search Engine for the Human Trafficking DomainIEEE Transactions on Big Data10.1109/TBDATA.2017.27631648:3(592-606)Online publication date: 1-Jun-2022
    • (2022)Automatically Categorizing Software Technologies2022 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2)10.1109/ICoDT255437.2022.9787457(1-6)Online publication date: 24-May-2022
    • (2021)Etna: Harvesting Action Graphs from WebsitesThe 34th Annual ACM Symposium on User Interface Software and Technology10.1145/3472749.3474752(312-331)Online publication date: 10-Oct-2021
    • (2021)AutoName: A Corpus-Based Set Naming FrameworkProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463100(2101-2105)Online publication date: 11-Jul-2021
    • (2021)Learning from similarity and information extraction from structured documentsInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-021-00375-3Online publication date: 11-Jun-2021
    • (2020)Automatically Categorizing Software TechnologiesIEEE Transactions on Software Engineering10.1109/TSE.2018.283645046:1(20-32)Online publication date: 1-Jan-2020
    • (2019)Unsupervised Approaches for Textual Semantic Annotation, A SurveyACM Computing Surveys10.1145/332447352:4(1-45)Online publication date: 30-Aug-2019
    • (2019)The 10th workshop on active internet measurements (AIMS-10) reportACM SIGCOMM Computer Communication Review10.1145/3310165.331017148:5(41-47)Online publication date: 28-Jan-2019
    • (2019)Set-based Noise Elimination for Is-a Relations in a Large-Scale Lexical Taxonomy2019 IEEE International Conference on Power Data Science (ICPDS)10.1109/ICPDS47662.2019.9017169(101-104)Online publication date: Nov-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media