research-article

Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

Authors:

Tak-Lam WongAuthors Info & Claims

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

Pages 567 - 576

https://doi.org/10.1145/2433396.2433468

Published: 04 February 2013 Publication History

Abstract

We develop a new framework to achieve the goal of Wikipedia entity expansion and attribute extraction from the Web. Our framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of our framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised learning process. The graph captures alignment similarity among data records. Then the semi-supervised learning process can leverage the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. Extensive experiments on different domains have been conducted to demonstrate its superiority for discovering new entities and extracting attribute content.

References

[1]

S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: a nucleus for a web of open data. In ISWC/ASWC, pages 722--735, 2007.

Digital Library

[2]

M. Banko, M. J. Cafarella, S. Soderl, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, pages 2670--2676, 2007.

Digital Library

[3]

L. Bing, W. Lam, and Y. Gu. Towards a unified solution: data record region detection and segmentation. In CIKM, pages 1265--1274, 2011.

Digital Library

[4]

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250, 2008.

Digital Library

[5]

M. J. Cafarella, D. Downey, S. Soderland, and O. Etzioni. Knowitnow: fast, scalable information extraction from the web. In HLT, pages 563--570, 2005.

Digital Library

[6]

M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web. In WebDB, 2008.

[7]

A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, pages 101--110, 2010.

Digital Library

[8]

E. Crestan and P. Pantel. Web-scale table census and classification. In WSDM, pages 545--554, 2011.

Digital Library

[9]

H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. Proc. VLDB Endow., 2:1078--1089, 2009.

Digital Library

[10]

O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall (preliminary results). In WWW, pages 100--110, 2004.

Digital Library

[11]

O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam. Open information extraction: The second generation. In IJCAI, pages 3--10, 2011.

Digital Library

[12]

Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS, pages 529--536. 2004.

Digital Library

[13]

R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow., 2:289--300, 2009.

Digital Library

[14]

R. Hoffmann, C. Zhang, and D. S. Weld. Learning 5000 relational extractors. In ACL, pages 286--295, 2010.

Digital Library

[15]

F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In ACL, pages 209--216, 2006.

Digital Library

[16]

X.-L. Li, L. Zhang, B. Liu, and S.-K. Ng. Distributional similarity vs. pu learning for entity set expansion. In ACLShort, pages 359--364, 2010.

Digital Library

[17]

G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow., 3:1338--1347, 2010.

Digital Library

[18]

B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD, pages 601--606, 2003.

Digital Library

[19]

D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45:503--528, 1989.

[20]

D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30:3--26, 2007.

[21]

S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443--453, 1970.

[22]

M. Paşca. Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds. In WWW, pages 101--110, 2007.

Digital Library

[23]

M. Paşca. Weakly-supervised discovery of named entities using web search queries. In CIKM, pages 683--690, 2007.

Digital Library

[24]

M. Paşca and B. V. Durme. Weakly-supervised acquisition of open-domain classes and class attributes from web documents and query logs. In ACL, pages 19--27, 2008.

[25]

P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, pages 938--947, 2009.

Digital Library

[26]

M. Pennacchiotti and P. Pantel. Entity extraction via ensemble semantics. In EMNLP, pages 238--247, 2009.

Digital Library

[27]

S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPS, pages 1185--1192, 2004.

Digital Library

[28]

A. Subramanya, S. Petrov, and F. Pereira. Efficient graph-based semi-supervised learning of structured tagging models. In EMNLP, pages 167--176, 2010.

Digital Library

[29]

F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A large ontology from wikipedia and wordnet. Web Semant., 6:203--217, 2008.

Digital Library

[30]

F. M. Suchanek, M. Sozio, and G. Weikum. Sofie: a self-organizing framework for information extraction. In WWW, pages 631--640, 2009.

Digital Library

[31]

P. Venetis, A. Halevy, J. Madhavan, M. Paşca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. Proc. VLDB Endow., 4:528--538, 2011.

Digital Library

[32]

J. Wang, B. Shao, H. Wang, and K. Q. Zhu. Understanding tables on the web. Technical report, 2010.

[33]

R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM, pages 342--350, 2007.

Digital Library

[34]

R. C. Wang and W. W. Cohen. Character-level analysis of semi-structured documents for set expansion. In EMNLP, pages 1503--1512, 2009.

Digital Library

[35]

Y. Wang, G. Haffari, S. Wang, and G. Mori. A rate distortion approach for semi-supervised conditional random fields. In NIPS, pages 2008--2016. 2009.

[36]

T.-L. Wong and W. Lam. Learning to adapt web information extraction knowledge and discovering new attributes via a bayesian approach. IEEE Trans. on Knowl. and Data Eng., 22:523--536, 2010.

Digital Library

[37]

F. Wu and D. S. Weld. Autonomously semantifying wikipedia. In CIKM, pages 41--50, 2007.

Digital Library

[38]

F. Wu and D. S. Weld. Open information extraction using wikipedia. In ACL, pages 118--127, 2010.

Digital Library

[39]

Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, pages 76--85, 2005.

Digital Library

[40]

J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, pages 494--503, 2006.

Digital Library

Cited By

Zheng YShi CCao XLi XWu B(2022)A Meta Path Based Method for Entity Set Expansion in Knowledge GraphIEEE Transactions on Big Data10.1109/TBDATA.2018.28053668:3(616-629)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TBDATA.2018.2805366
Wang YCao CChen ZWang S(2022)CKGAC: A Commonsense Knowledge Graph About Attributes of ConceptsKnowledge Science, Engineering and Management10.1007/978-3-031-10983-6_45(585-601)Online publication date: 6-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-10983-6_45
Shi CDing JCao XHu LWu BLi X(2021)Entity set expansion in knowledge graph: a heterogeneous information network perspectiveFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-020-9240-815:1Online publication date: 1-Feb-2021
https://dl.acm.org/doi/10.1007/s11704-020-9240-8
Show More Cited By

Index Terms

Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning
1. Information systems
  1. Information systems applications

Recommendations

Coupled semi-supervised learning for information extraction
WSDM '10: Proceedings of the third ACM international conference on Web search and data mining

We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or ...
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Semi-supervised partial label learning algorithm via reliable label propagation
Abstract
Partial label learning (PLL) is a weakly supervised learning method that is able to predict one label as the correct answer from a given candidate label set. In PLL, when all possible candidate labels are as signed to real-world training examples, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

February 2013

816 pages

ISBN:9781450318693

DOI:10.1145/2433396

General Chairs:
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy
,
Program Chairs:
Paolo Ferragina
University of Pisa, Italy
,
Aristides Gionis
Yahoo! Research, Barcelona, Spain

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 February 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM 2013

Sponsor:

WSDM 2013: Sixth ACM International Conference on Web Search and Data Mining

February 4 - 8, 2013

Rome, Italy

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
521
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zheng YShi CCao XLi XWu B(2022)A Meta Path Based Method for Entity Set Expansion in Knowledge GraphIEEE Transactions on Big Data10.1109/TBDATA.2018.28053668:3(616-629)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TBDATA.2018.2805366
Wang YCao CChen ZWang S(2022)CKGAC: A Commonsense Knowledge Graph About Attributes of ConceptsKnowledge Science, Engineering and Management10.1007/978-3-031-10983-6_45(585-601)Online publication date: 6-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-10983-6_45
Shi CDing JCao XHu LWu BLi X(2021)Entity set expansion in knowledge graph: a heterogeneous information network perspectiveFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-020-9240-815:1Online publication date: 1-Feb-2021
https://dl.acm.org/doi/10.1007/s11704-020-9240-8
Wang TGuo JWu ZXu T(2021)IFTA: Iterative filtering by using TF-AICL algorithm for Chinese encyclopedia knowledge refinementApplied Intelligence10.1007/s10489-021-02220-w51:8(6265-6293)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s10489-021-02220-w
Yuliana OChang C(2020)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-050:2(271-295)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10489-019-01499-0
Sun XJiang LZhang MWang CChen Y(2019)Unsupervised Learning for Product Ontology from Textual Reviews on E-Commerce SitesProceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence10.1145/3377713.3377755(260-264)Online publication date: 20-Dec-2019
https://dl.acm.org/doi/10.1145/3377713.3377755
Dargahi Nobari AAskari AHasibi FNeshati MCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)Query Understanding via Entity Attribute IdentificationProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3269245(1759-1762)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3269245
Yuliana OChang C(2018)A novel alignment algorithm for effective web data extraction from singleton-item pagesApplied Intelligence10.1007/s10489-018-1208-048:11(4355-4370)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1007/s10489-018-1208-0
Liu SLi YFan B(2018)Hierarchical RNN for Few-Shot Information Extraction LearningData Science10.1007/978-981-13-2206-8_20(227-239)Online publication date: 9-Sep-2018
https://doi.org/10.1007/978-981-13-2206-8_20
Er NBa MAbdessalem TBressan S(2018)Tuple ReconstructionDatabase Systems for Advanced Applications10.1007/978-3-319-91455-8_21(239-254)Online publication date: 12-May-2018
https://doi.org/10.1007/978-3-319-91455-8_21
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten