Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2433396.2433468acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

Published: 04 February 2013 Publication History

Abstract

We develop a new framework to achieve the goal of Wikipedia entity expansion and attribute extraction from the Web. Our framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of our framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised learning process. The graph captures alignment similarity among data records. Then the semi-supervised learning process can leverage the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. Extensive experiments on different domains have been conducted to demonstrate its superiority for discovering new entities and extracting attribute content.

References

[1]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: a nucleus for a web of open data. In ISWC/ASWC, pages 722--735, 2007.
[2]
M. Banko, M. J. Cafarella, S. Soderl, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, pages 2670--2676, 2007.
[3]
L. Bing, W. Lam, and Y. Gu. Towards a unified solution: data record region detection and segmentation. In CIKM, pages 1265--1274, 2011.
[4]
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250, 2008.
[5]
M. J. Cafarella, D. Downey, S. Soderland, and O. Etzioni. Knowitnow: fast, scalable information extraction from the web. In HLT, pages 563--570, 2005.
[6]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web. In WebDB, 2008.
[7]
A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, pages 101--110, 2010.
[8]
E. Crestan and P. Pantel. Web-scale table census and classification. In WSDM, pages 545--554, 2011.
[9]
H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. Proc. VLDB Endow., 2:1078--1089, 2009.
[10]
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall (preliminary results). In WWW, pages 100--110, 2004.
[11]
O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam. Open information extraction: The second generation. In IJCAI, pages 3--10, 2011.
[12]
Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS, pages 529--536. 2004.
[13]
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow., 2:289--300, 2009.
[14]
R. Hoffmann, C. Zhang, and D. S. Weld. Learning 5000 relational extractors. In ACL, pages 286--295, 2010.
[15]
F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In ACL, pages 209--216, 2006.
[16]
X.-L. Li, L. Zhang, B. Liu, and S.-K. Ng. Distributional similarity vs. pu learning for entity set expansion. In ACLShort, pages 359--364, 2010.
[17]
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow., 3:1338--1347, 2010.
[18]
B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD, pages 601--606, 2003.
[19]
D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45:503--528, 1989.
[20]
D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30:3--26, 2007.
[21]
S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443--453, 1970.
[22]
M. Paşca. Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds. In WWW, pages 101--110, 2007.
[23]
M. Paşca. Weakly-supervised discovery of named entities using web search queries. In CIKM, pages 683--690, 2007.
[24]
M. Paşca and B. V. Durme. Weakly-supervised acquisition of open-domain classes and class attributes from web documents and query logs. In ACL, pages 19--27, 2008.
[25]
P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, pages 938--947, 2009.
[26]
M. Pennacchiotti and P. Pantel. Entity extraction via ensemble semantics. In EMNLP, pages 238--247, 2009.
[27]
S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPS, pages 1185--1192, 2004.
[28]
A. Subramanya, S. Petrov, and F. Pereira. Efficient graph-based semi-supervised learning of structured tagging models. In EMNLP, pages 167--176, 2010.
[29]
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A large ontology from wikipedia and wordnet. Web Semant., 6:203--217, 2008.
[30]
F. M. Suchanek, M. Sozio, and G. Weikum. Sofie: a self-organizing framework for information extraction. In WWW, pages 631--640, 2009.
[31]
P. Venetis, A. Halevy, J. Madhavan, M. Paşca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. Proc. VLDB Endow., 4:528--538, 2011.
[32]
J. Wang, B. Shao, H. Wang, and K. Q. Zhu. Understanding tables on the web. Technical report, 2010.
[33]
R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM, pages 342--350, 2007.
[34]
R. C. Wang and W. W. Cohen. Character-level analysis of semi-structured documents for set expansion. In EMNLP, pages 1503--1512, 2009.
[35]
Y. Wang, G. Haffari, S. Wang, and G. Mori. A rate distortion approach for semi-supervised conditional random fields. In NIPS, pages 2008--2016. 2009.
[36]
T.-L. Wong and W. Lam. Learning to adapt web information extraction knowledge and discovering new attributes via a bayesian approach. IEEE Trans. on Knowl. and Data Eng., 22:523--536, 2010.
[37]
F. Wu and D. S. Weld. Autonomously semantifying wikipedia. In CIKM, pages 41--50, 2007.
[38]
F. Wu and D. S. Weld. Open information extraction using wikipedia. In ACL, pages 118--127, 2010.
[39]
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, pages 76--85, 2005.
[40]
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, pages 494--503, 2006.

Cited By

View all
  • (2022)A Meta Path Based Method for Entity Set Expansion in Knowledge GraphIEEE Transactions on Big Data10.1109/TBDATA.2018.28053668:3(616-629)Online publication date: 1-Jun-2022
  • (2022)CKGAC: A Commonsense Knowledge Graph About Attributes of ConceptsKnowledge Science, Engineering and Management10.1007/978-3-031-10983-6_45(585-601)Online publication date: 6-Aug-2022
  • (2021)Entity set expansion in knowledge graph: a heterogeneous information network perspectiveFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-020-9240-815:1Online publication date: 1-Feb-2021
  • Show More Cited By

Index Terms

  1. Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining
    February 2013
    816 pages
    ISBN:9781450318693
    DOI:10.1145/2433396
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 February 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. entity expansion
    2. information extraction
    3. proximate record graph
    4. semi-supervised learning

    Qualifiers

    • Research-article

    Conference

    WSDM 2013

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)A Meta Path Based Method for Entity Set Expansion in Knowledge GraphIEEE Transactions on Big Data10.1109/TBDATA.2018.28053668:3(616-629)Online publication date: 1-Jun-2022
    • (2022)CKGAC: A Commonsense Knowledge Graph About Attributes of ConceptsKnowledge Science, Engineering and Management10.1007/978-3-031-10983-6_45(585-601)Online publication date: 6-Aug-2022
    • (2021)Entity set expansion in knowledge graph: a heterogeneous information network perspectiveFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-020-9240-815:1Online publication date: 1-Feb-2021
    • (2021)IFTA: Iterative filtering by using TF-AICL algorithm for Chinese encyclopedia knowledge refinementApplied Intelligence10.1007/s10489-021-02220-w51:8(6265-6293)Online publication date: 1-Aug-2021
    • (2020)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-050:2(271-295)Online publication date: 1-Feb-2020
    • (2019)Unsupervised Learning for Product Ontology from Textual Reviews on E-Commerce SitesProceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence10.1145/3377713.3377755(260-264)Online publication date: 20-Dec-2019
    • (2018)Query Understanding via Entity Attribute IdentificationProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3269245(1759-1762)Online publication date: 17-Oct-2018
    • (2018)A novel alignment algorithm for effective web data extraction from singleton-item pagesApplied Intelligence10.1007/s10489-018-1208-048:11(4355-4370)Online publication date: 1-Nov-2018
    • (2018)Hierarchical RNN for Few-Shot Information Extraction LearningData Science10.1007/978-981-13-2206-8_20(227-239)Online publication date: 9-Sep-2018
    • (2018)Tuple ReconstructionDatabase Systems for Advanced Applications10.1007/978-3-319-91455-8_21(239-254)Online publication date: 12-May-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media