research-article

Employing Semantic Context for Sparse Information Extraction Assessment

Authors:

Xindong WuAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 12, Issue 5

Article No.: 54, Pages 1 - 36

https://doi.org/10.1145/3201407

Published: 27 June 2018 Publication History

Abstract

A huge amount of texts available on the World Wide Web presents an unprecedented opportunity for information extraction (IE). One important assumption in IE is that frequent extractions are more likely to be correct. Sparse IE is hence a challenging task because no matter how big a corpus is, there are extractions supported by only a small amount of evidence in the corpus. However, there is limited research on sparse IE, especially in the assessment of the validity of sparse IEs. Motivated by this, we introduce a lightweight, explicit semantic approach for assessing sparse IE.¹ We first use a large semantic network consisting of millions of concepts, entities, and attributes to explicitly model the context of any semantic relationship. Second, we learn from three semantic contexts using different base classifiers to select an optimal classification model for assessing sparse extractions. Finally, experiments show that as compared with several state-of-the-art approaches, our approach can significantly improve the F-score in the assessment of sparse extractions while maintaining the efficiency.

References

[1]

E. Agichtein and L. Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the 5th ACM Conference on Digital Libraries (ACL’00). 85--94.

Digital Library

[2]

A. Ahuja and D. Downey. 2010. Improved extraction assessment through better language models. In Proceedings of the Human Language Technologies (HLT’10). 225--228.

Digital Library

[3]

F. Alam, A. Corazza, A. Lavelli, and R. Zanoli. 2016. A knowledge-poor approach to chemical-disease relation extraction. Database 2016 (2016), 1--12.

[4]

Oznur Kirmemis Alkan and Pinar Karagoz. 2016. WaPUPS: Web access pattern extraction under user-defined pattern scoring. Information Science 42, 2 (2016), 261--273.

Digital Library

[5]

Saeed Amal, Tsvi KuFlik, and Einat Minkov. 2017. Harvesting entity-relation social networks from the web: Potential and challenges. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization (UMAP’17). 351--352.

Digital Library

[6]

Leonard E. Baum and Ted Petrie. 1966. Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics 37, 6 (1966), 1554--1563.

[7]

Kevin Lange Di Cesare, Amal Zouaq, and Ludovic Jean-Louis. 2016. A machine learning filter for relation extraction. In Proceedings of the 25th International Conference Companion on World Wide Web (WWW’16). 69--70.

Digital Library

[8]

Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics 16, 1 (1990), 22--29.

Digital Library

[9]

Luciano Del Corro and Rainer Gemulla. 2013. ClausIE: Clause-based open information extraction. In Proceedings of the 22nd International Conference Companion on World Wide Web (WWW’13). 355--365.

Digital Library

[10]

Bhavana Dalvi, William W. Cohen, and Jamie Callan. 2012. WebSets: Extracting sets of entities from the web using unsupervised information extraction. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining (WSDM’12). 243--252.

Digital Library

[11]

J. Demšar. 1961. Multiple comparisons among means. Journal of the American Statistical Association 56 (1961), 52--64.

[12]

J. Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7 (2006), 1--30.

Digital Library

[13]

Doug Downey, Stefan Schoenmackers, and Oren Etzioni. 2007. Sparse information extraction: Unsupervised language models to the rescue. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07). 696--703.

[14]

Doug Downeya, Oren Etzionib, and Stephen Soderland. 2010. Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artificial Intelligence 174 (2010), 726--748.

Digital Library

[15]

Oren Etzioni, Michael J. Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence 165 (2005), 91--134.

Digital Library

[16]

Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam. 2011. Open information extraction: The second generation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). 3--10.

Digital Library

[17]

Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2016. Declarative cleaning of inconsistencies in information extraction. ACM Transactions on Database Systems 41, 1 (April 2016), Article 6, 1--44.

Digital Library

[18]

R. Feldman and B. Rosenfeld. 2006. Boosting unsupervised relation extraction by using NER. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’06). 473--481.

Digital Library

[19]

Evgeniy Gabrilovich and Shaul Markovitch. 2009. Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research 34 (2009), 443--498.

[20]

Kiril Gashteovski, Rainer Gemulla, and Luciano del Corro. 2017. MinIE: Minimizing facts in open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2630--2640.

[21]

Btihal El Ghali and Abderrahim El Qadi. 2017. Context-aware query expansion method using language models and latent semantic analyses. Knowledge and Information Systems 50 (2017), 751--762.

Digital Library

[22]

Dihong Gong, Daisy Zhe Wang, and Yang Peng. 2017. Multimodal learning for web information extraction. In Proceedings of the Multimedia Conference (MM’17). 288--296.

Digital Library

[23]

Pankaj Gulhane, Rajeev Rastogi, Srinivasan H Sengamedu, and Ashwin Tengli. 2010. Exploiting content redundancy for web information extraction. In Proceedings of the World Wide Web (WWW’10). 1105--1106.

Digital Library

[24]

Maeda F. Hanafi, Azza Abouzied, Laura Chiticariu, and Yunyao Li. 2017. Synthesizing extraction rules from user examples with SEER. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). 1687--1690.

Digital Library

[25]

Z. Harris. 1985. Distributional Structure. The Philosophy of Linguistics. New York: Oxford University Press.

[26]

M. A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the International Conference on Computational Linguistics (COLING’92). 539--545.

Digital Library

[27]

Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. 2012. KORE: Keyphrase overlap relatedness for entity disambiguation. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12). 545--554.

Digital Library

[28]

Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, Edwin L. Kelham, Gerard de Melo, and Gerhard Weikum. 2011. YAGO2: Exploring and querying world knowledge in time, space, context, and many languages. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW’11). 229--232.

Digital Library

[29]

Yuzhe Jin, Emre Kiciman, Kuansan Wang, and Ricky Loynd. 2014. Entity linking at the tail: Sparse signals, unknown entities, and phrase models. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM’14). 453--462.

Digital Library

[30]

Mayank Kejriwal and Pedro Szekely. 2017. Information extraction in illicit domains. In Proceedings of the 26th International Conference on World Wide Web (WWW’17). 997--1006.

Digital Library

[31]

Dongwoo Kim, Haixun Wang, and Alice H. Oh. 2013. Context-dependent conceptualization. In Proceedings of the Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI’13). 2654--2661.

Digital Library

[32]

Sarath Kumar Kondreddi, Peter Triantafillou, and Gerhard Weikum. 2014. Combining information extraction and human computing for crowdsourced knowledge acquisition. In Proceedings of the 30th International Conference on Data Engineering (ICDE’14). 988--999.

[33]

Taesung Lee, Zhongyuan Wang, Haixun Wang, and Seung won Hwang. 2011. Web scale taxonomy cleansing. In Proceedings of the 37th International Conference on Very Large Data Bases (VLDB’11). 1295--1306.

Digital Library

[34]

Taesung Lee, Zhongyuan Wang, Haixun Wang, and Seung won Hwang. 2013. Attribute extraction and scoring: A probabilistic approach. In Proceedings of the 29th International Conference on Data Engineering (ICDE’13). 194--205.

Digital Library

[35]

Jiewu Leng and Pingyu Jiang. 2016. A deep learning approach for relationship extraction from interaction context in social manufacturing paradigm. Knowledge-Based Systems 100 (2016), 188--199.

Digital Library

[36]

Peipei Li, Haixun Wang, Kenny Zhu, Zhongyuan Wang, and Xindong Wu. 2015. A large probabilistic semantic network based approach to compute term similarity. IEEE Transactions on Knowledge and Data Engineering 27 (2015), 2604--2617.

Digital Library

[37]

Yang Li, Shulong Tan, Huan Sun, Jiawei Han, Dan Roth, and Xifeng Yan. 2016. Entity disambiguation with linkless knowledge bases. In Proceedings of the 25th International Conference on World Wide Web (WWW’16). 1261--1270.

Digital Library

[38]

Rinaldo Lima, Bernard Espinasse, and Fred Freitas. 2017. OntoILPER: An ontology- and inductive logic programming-based system to extract entities and relations from text. Knowledge and Information Systems (2017), 1--33.

[39]

Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2016. Knowledge representation learning with entities, attributes and relations. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 2866--2872.

Digital Library

[40]

Kevin Meijer, Flavius Frasincar, and Frederik Hogenboom. 2014. A semantic approach for extracting domain taxonomies from text. Decision Support Systems 62 (2014), 78--93.

[41]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.

[42]

Masayuki Okamoto, Zifei Shan, and Ryohei Orihara. 2017. Applying information extraction for patent structure analysis. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17). 987--992.

Digital Library

[43]

Sergio Oramas, Mohamed Sordo, and Luis Espinosa-Anke. 2015. A rule-based approach to extracting relations from music tidbits. In Proceedings of the 24th International Conference on World Wide Web (WWW’15). 661--666.

Digital Library

[44]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.

[45]

Simone P. Ponzetto and Michael Strube. 2007. Deriving a large-scale taxonomy from wikipedia. In Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI’07). 1440--1445.

Digital Library

[46]

Priya Radhakrishnan and Vasudeva Varma. 2013. Extracting semantic knowledge from wikipedia category names. In Proceedings of the Workshop on Automated Knowledge Base Construction (CIKM’13). 109--114.

Digital Library

[47]

Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Chris Rĺę. 2017. Snorkel: Fast training set generation for information extraction. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). 1683--1686.

Digital Library

[48]

Ridho Reinanda, Edgar Meij, and Maarten de Ri-jke. 2016. Document filtering for long-tail entities. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM’16). 771--780.

Digital Library

[49]

Alan Ritter, Stephen Soderland, and Oren Etzioni. 2009. What is this, anyway: Automatic hypernym discovery. In Proceedings of the AAAI Spring Symposium on Learning by Reading and Learning to Read. 88--93.

[50]

Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni. 2012. Open language learning for information extraction. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). 523--534.

Digital Library

[51]

Mondher Sendi and Mohamed Nazih Omri. 2015. Biomedical concept extraction based information retrieval model: Application on the MeSH. In Proceedings of the 15th International Conference on Intelligent Systems Design and Applications (ISDA’15). 40--45.

[52]

Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen. 2011. Short text conceptualization using a probabilistic knowledgebase. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). 2330--2336.

Digital Library

[53]

John A. Swets. 1996. Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers. Lawrence Erlbaum Associates, Mahwah, NJ.

[54]

Bilyana Taneva and Gerhard Weikum. 2013. Gem-based entity-knowledge maintenance. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). 149--158.

Digital Library

[55]

Fei Wu, Raphael Hoffmann, and Daniel S. Weld. 2008. Information extraction from wikipedia: Moving down the long tail. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). 731--739.

Digital Library

[56]

F. Wu and D. S. Weld. 2010. Open information extraction using wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). 118--127.

Digital Library

[57]

Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q. Zhu. 2012. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’12). 481--492.

Digital Library

[58]

Peng Yan and Wei Jin. 2017. Building semantic kernels for cross-document knowledge discovery using Wikipedia. Knowledge and Information Systems 51 (2017), 287--310.

Digital Library

[59]

Alexander Yates, Michael Cafarella, Michele Banko, Oren Etzioni, Matthew Broadhead, and Stephen Soderland. 2007. TextRunner: Open information extraction on the web. In Proceedings of the Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (NAACL HLT Demonstration Program’07). 25--26.

Digital Library

[60]

Jingyuan Zhang, Roger Jie Luo, Altaf Rahman, Yi Chang, and Philip S. Yu. 2015. Learning entity types from query logs via graph-based modeling. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15). 603--702.

Digital Library

[61]

Minling Zhang and Lei Wu. 2015. LIFT: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2015), 107--120.

[62]

Guodong Zhou, Longhua Qian, and Jianxi Fan. 2010. Tree kernel-based semantic relation extraction with rich syntactic and semantic information. Information Sciences 180 (2010), 1313--1325.

Digital Library

[63]

Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji rong Wen. 2009. StatSnowball: A statistical approach to extracting entity relationships. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 101--110.

Digital Library

Cited By

Liu YWu HHuang ZWang HNing YMa JLiu QChen E(2023)TechPat: Technical Phrase Extraction for Patent MiningACM Transactions on Knowledge Discovery from Data10.1145/359660317:9(1-31)Online publication date: 15-Jun-2023
https://dl.acm.org/doi/10.1145/3596603
Wang YTong HZhu ZLi Y(2022)Nested Named Entity Recognition: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/352259316:6(1-29)Online publication date: 30-Jul-2022
https://dl.acm.org/doi/10.1145/3522593
Algamdi SAlbanyan AShah STariq Z(2022)Twitter Accounts Suggestion: Pipeline Technique SpaCy Entity Recognition2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020570(5121-5125)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020570
Show More Cited By

Index Terms

Employing Semantic Context for Sparse Information Extraction Assessment
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Assessing sparse information extraction using semantic contexts
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

One important assumption of information extraction is that extractions occurring more frequently are more likely to be correct. Sparse information extraction is challenging because no matter how big a corpus is, there are extractions supported by only a ...
From names to entities using thematic context distance
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Name ambiguity arises from the polysemy of names and causes uncertainty about the true identity of entities referenced in unstructured text. This is a major problem in areas like information retrieval or knowledge management, for example when searching ...
Information Extraction and Semantic Annotation of Wikipedia
Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge

An architecture is proposed that, focusing on the Wikipedia as a textual repository, aims at enriching it with semantic information in an automatic way. This approach combines linguistic processing, Word Sense Disambiguation and Relation Extraction ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 12, Issue 5

October 2018

354 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3234931

Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
University of Louisiana at Lafayette, USA

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2018

Accepted: 01 March 2018

Revised: 01 February 2018

Received: 01 September 2017

Published in TKDD Volume 12, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

the Program for Changjiang Scholas and Innovative Research Team in University (PCSIRT) of the Ministry of Education
the Natural Science Foundation of China
the US National Science Foundation
National Key Research and Development Program of China
the Natural Science Foundation of Anhui province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
323
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YWu HHuang ZWang HNing YMa JLiu QChen E(2023)TechPat: Technical Phrase Extraction for Patent MiningACM Transactions on Knowledge Discovery from Data10.1145/359660317:9(1-31)Online publication date: 15-Jun-2023
https://dl.acm.org/doi/10.1145/3596603
Wang YTong HZhu ZLi Y(2022)Nested Named Entity Recognition: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/352259316:6(1-29)Online publication date: 30-Jul-2022
https://dl.acm.org/doi/10.1145/3522593
Algamdi SAlbanyan AShah STariq Z(2022)Twitter Accounts Suggestion: Pipeline Technique SpaCy Entity Recognition2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020570(5121-5125)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020570
Chen LTang XChen WQian YLi YZhang Y(2021)DACHA: A Dual Graph Convolution Based Temporal Knowledge Graph Representation Learning Method Using Historical RelationACM Transactions on Knowledge Discovery from Data10.1145/347705116:3(1-18)Online publication date: 22-Oct-2021
https://dl.acm.org/doi/10.1145/3477051
Chen XJin LZhu YLuo CWang T(2021)Text Recognition in the WildACM Computing Surveys10.1145/344075654:2(1-35)Online publication date: 5-Mar-2021
https://dl.acm.org/doi/10.1145/3440756
Hair JSarstedt M(2021)Data, measurement, and causal inferences in machine learning: opportunities and challenges for marketingJournal of Marketing Theory and Practice10.1080/10696679.2020.1860683(1-13)Online publication date: 11-Jan-2021
https://doi.org/10.1080/10696679.2020.1860683
Zhong LZhu Y(2020)Relation Extraction with Proactive Domain Adaptation Strategy2020 IEEE International Conference on Knowledge Graph (ICKG)10.1109/ICBK50248.2020.00069(441-448)Online publication date: Aug-2020
https://doi.org/10.1109/ICBK50248.2020.00069
Asgari-Bidhendi MHadian AMinaei-Bidgoli BKejriwal MLopez VSequeda J(2019)FarsBaseSemantic Web10.3233/SW-19036910:6(1169-1196)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.3233/SW-190369
Adnan KAkbar R(2019)Limitations of information extraction methods and techniques for heterogeneous unstructured big dataInternational Journal of Engineering Business Management10.1177/184797901989077111Online publication date: 9-Dec-2019
https://doi.org/10.1177/1847979019890771
Wanjawa BMuchemi L(2018)Automatic Semantic Network Generation from Unstructured Documents – The Options2018 5th International Conference on Soft Computing & Machine Intelligence (ISCMI)10.1109/ISCMI.2018.8703225(72-78)Online publication date: Nov-2018
https://doi.org/10.1109/ISCMI.2018.8703225

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents