research-article

WebSets: extracting sets of entities from the web using unsupervised information extraction

Authors:

Bhavana Bharat Dalvi,

William W. Cohen,

Jamie CallanAuthors Info & Claims

WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

Pages 243 - 252

https://doi.org/10.1145/2124295.2124327

Published: 08 February 2012 Publication History

Abstract

We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus. Most earlier approaches to this problem rely on combining clusters of distributionally similar terms and concept-instance pairs obtained with Hearst patterns. In contrast, our method relies on a novel approach for clustering terms found in HTML tables, and then assigning concept names to these clusters using Hearst patterns. The method can be efficiently applied to a large corpus, and experimental results on several datasets show that our method can accurately extract large numbers of concept-instance pairs.

Supplementary Material

JPG File (wsdm_day2_session1_1.jpg)

Download
14.45 KB

MP4 File (wsdm_day2_session1_1.mp4)

Download
94.71 MB

References

[1]

Html TIDY project. http://tidy.sourceforge.net/.

[2]

M. J. Cafarella, E. Wu, A. Halevy, Y. Zhang, and D. Z. Wang. Webtables: Exploring the power of tables on the web. PVLDB, 2008.

Digital Library

[3]

J. Callan. The clueweb09 dataset. http://boston.lti.cs.cmu.edu/Data/clueweb09/.

[4]

A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka, Jr., and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.

Digital Library

[5]

W. H. E. Day and H. Edelsbrunner. Efficient algorithms for agglomerative hierarchical clustering methods. In Journal of Classification, 1984.

[6]

O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In WWW, 2004.

Digital Library

[7]

O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yate. Unsupervised named-entity extraction from the web: An experimental study. In AI, 2005.

Digital Library

[8]

W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, 2007.

Digital Library

[9]

R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, 2009.

Digital Library

[10]

R. Gupta and S. Sarawagi. Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited. In WSDM, 2011.

Digital Library

[11]

M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In ACL, 1992.

Digital Library

[12]

J. Kamps, R. Kaptein, and M. Koolen. Using anchor text, spam filtering and wikipedia for web search and entity ranking. TREC, 2010.

[13]

Z. Kozareva and E. Hovy. A semi-supervised method to learn and construct taxonomies using the web. In EMNLP, 2010.

Digital Library

[14]

G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 2010.

Digital Library

[15]

D. Lin and P. Pantel. Concept discovery from text. In COLING, 2002.

Digital Library

[16]

C. D. Manning, P. Raghavan, and H. Schtze. Introduction to information retrieval. In Cambridge University Press, 2008.

Digital Library

[17]

P. Pantel and D. Ravichandran. Automatically labeling semantic classes. In HLT-NAACL, 2004.

[18]

A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the web of concepts: Extracting concepts from large datasets. In VLDB, 2010.

Digital Library

[19]

E. Ramirez, R. Brena, D. Magatti, and F. Stella. Probabilistic metrics for soft-clustering and topic model validation. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010.

Digital Library

[20]

A. Ritter, S. Soderland, and O. Etzioni. What is this, anyway: Automatic hypernym discovery. In AAAI, 2009.

[21]

K. Shinzato and K. Torisawa. Acquiring hyponymy relations from web documents. In HLT-NAACL, 2004.

[22]

R. Snow, D. Jurafsky, and A. Y. Ng. Learning syntactic patterns for automatic hypernym discovery. In NIPS, 2004.

Digital Library

[23]

R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In EMNLP, 2008.

Digital Library

[24]

P. P. Talukdar, J. Reisinger, M. Paşca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly-supervised acquisition of labeled class instances using graph random walks. In EMNLP, 2008.

Digital Library

[25]

M. Tom. Nell: Never-ending language learning. http://rtw.ml.cmu.edu/rtw/.

[26]

B. Van Durme and M. Pasca. Finding cars, goddesses and enzymes: parametrizable acquisition of labeled instances for open-domain information extraction. In AAAI, 2008.

Digital Library

[27]

R. C. Wang and W. W. Cohen. Automatic set instance extraction using the web. In ACL, 2009.

Digital Library

[28]

R. C. Wang and W. W. Cohen. Character-level analysis of semi-structured documents for set expansion. In EMNLP, 2009.

Digital Library

[29]

R. Wetzker, C. Zimmermann, and C. Bauckhage. Analyzing social bookmarking systems: A del.icio.us cookbook. Mining Social Data (MSoDa) Workshop Proceedings, ECAI, 2008. http://www.dai-labor.de/en/competence_centers/irml/datasets/.

[30]

A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead, and S. Soderland. Textrunner: Open information extraction on the web. In NAACL, 2007.

Digital Library

Cited By

Prince-Tritto PPonce H(2023)Exploring the Challenges and Limitations of Unsupervised Machine Learning Approaches in Legal Concepts DiscoveryAdvances in Soft Computing10.1007/978-3-031-47640-2_5(52-67)Online publication date: 9-Nov-2023
https://doi.org/10.1007/978-3-031-47640-2_5
Kejriwal MSzekely P(2022)Knowledge Graphs for Social Good: An Entity-Centric Search Engine for the Human Trafficking DomainIEEE Transactions on Big Data10.1109/TBDATA.2017.27631648:3(592-606)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TBDATA.2017.2763164
Khan SHaider Butt W(2022)Automatically Categorizing Software Technologies2022 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2)10.1109/ICoDT255437.2022.9787457(1-6)Online publication date: 24-May-2022
https://doi.org/10.1109/ICoDT255437.2022.9787457
Show More Cited By

Index Terms

WebSets: extracting sets of entities from the web using unsupervised information extraction
1. Computing methodologies
  1. Machine learning
    1. Learning settings

Recommendations

AUTOMATIC ANNOTATION OF AMBIGUOUS PERSONAL NAMES ON THE WEB

Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross-document coreference resolution and word sense disambiguation. We propose an unsupervised method to ...
Web People Search via Connection Analysis

Nowadays, searches for webpages of a person with a given name constitute a notable fraction of queries to web search engines. Such a query would normally return webpages related to several namesakes, who happened to have the queried name, leaving the ...
Geotagging Named Entities in News and Online Documents
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

News sources generate constant streams of text with many references to real world entities; understanding the content from such sources often requires effectively detecting the geographic foci of the entities. We study the problem of associating ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

February 2012

792 pages

ISBN:9781450307475

DOI:10.1145/2124295

General Chairs:
Eytan Adar
University of Michigan, USA
,
Jaime Teevan
Microsoft Research, USA
,
Program Chairs:
Eugene Agichtein
Emory University, USA
,
Yoelle Maarek
Yahoo! Research, Israel

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 February 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM'12

Sponsor:

WSDM'12: Fifth ACM International Conference on Web Search and Data Mining

February 8 - 12, 2012

Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

59
Total Citations
View Citations
888
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Prince-Tritto PPonce H(2023)Exploring the Challenges and Limitations of Unsupervised Machine Learning Approaches in Legal Concepts DiscoveryAdvances in Soft Computing10.1007/978-3-031-47640-2_5(52-67)Online publication date: 9-Nov-2023
https://doi.org/10.1007/978-3-031-47640-2_5
Kejriwal MSzekely P(2022)Knowledge Graphs for Social Good: An Entity-Centric Search Engine for the Human Trafficking DomainIEEE Transactions on Big Data10.1109/TBDATA.2017.27631648:3(592-606)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TBDATA.2017.2763164
Khan SHaider Butt W(2022)Automatically Categorizing Software Technologies2022 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2)10.1109/ICoDT255437.2022.9787457(1-6)Online publication date: 24-May-2022
https://doi.org/10.1109/ICoDT255437.2022.9787457
Riva OKace J(2021)Etna: Harvesting Action Graphs from WebsitesThe 34th Annual ACM Symposium on User Interface Software and Technology10.1145/3472749.3474752(312-331)Online publication date: 10-Oct-2021
https://dl.acm.org/doi/10.1145/3472749.3474752
Huang ZRahimi RYu PShang JAllan JDiaz FShah CSuel TCastells PJones RSakai T(2021)AutoName: A Corpus-Based Set Naming FrameworkProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463100(2101-2105)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3463100
Holeček M(2021)Learning from similarity and information extraction from structured documentsInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-021-00375-3Online publication date: 11-Jun-2021
https://doi.org/10.1007/s10032-021-00375-3
Nassif MTreude CRobillard M(2020)Automatically Categorizing Software TechnologiesIEEE Transactions on Software Engineering10.1109/TSE.2018.283645046:1(20-32)Online publication date: 1-Jan-2020
https://doi.org/10.1109/TSE.2018.2836450
Liao XZhao Z(2019)Unsupervised Approaches for Textual Semantic Annotation, A SurveyACM Computing Surveys10.1145/332447352:4(1-45)Online publication date: 30-Aug-2019
https://dl.acm.org/doi/10.1145/3324473
claffy kClark D(2019)The 10th workshop on active internet measurements (AIMS-10) reportACM SIGCOMM Computer Communication Review10.1145/3310165.331017148:5(41-47)Online publication date: 28-Jan-2019
https://dl.acm.org/doi/10.1145/3310165.3310171
Wang QAn YLi YWang H(2019)Set-based Noise Elimination for Is-a Relations in a Large-Scale Lexical Taxonomy2019 IEEE International Conference on Power Data Science (ICPDS)10.1109/ICPDS47662.2019.9017169(101-104)Online publication date: Nov-2019
https://doi.org/10.1109/ICPDS47662.2019.9017169
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents