Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Employing Semantic Context for Sparse Information Extraction Assessment

Published: 27 June 2018 Publication History

Abstract

A huge amount of texts available on the World Wide Web presents an unprecedented opportunity for information extraction (IE). One important assumption in IE is that frequent extractions are more likely to be correct. Sparse IE is hence a challenging task because no matter how big a corpus is, there are extractions supported by only a small amount of evidence in the corpus. However, there is limited research on sparse IE, especially in the assessment of the validity of sparse IEs. Motivated by this, we introduce a lightweight, explicit semantic approach for assessing sparse IE.1 We first use a large semantic network consisting of millions of concepts, entities, and attributes to explicitly model the context of any semantic relationship. Second, we learn from three semantic contexts using different base classifiers to select an optimal classification model for assessing sparse extractions. Finally, experiments show that as compared with several state-of-the-art approaches, our approach can significantly improve the F-score in the assessment of sparse extractions while maintaining the efficiency.

References

[1]
E. Agichtein and L. Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the 5th ACM Conference on Digital Libraries (ACL’00). 85--94.
[2]
A. Ahuja and D. Downey. 2010. Improved extraction assessment through better language models. In Proceedings of the Human Language Technologies (HLT’10). 225--228.
[3]
F. Alam, A. Corazza, A. Lavelli, and R. Zanoli. 2016. A knowledge-poor approach to chemical-disease relation extraction. Database 2016 (2016), 1--12.
[4]
Oznur Kirmemis Alkan and Pinar Karagoz. 2016. WaPUPS: Web access pattern extraction under user-defined pattern scoring. Information Science 42, 2 (2016), 261--273.
[5]
Saeed Amal, Tsvi KuFlik, and Einat Minkov. 2017. Harvesting entity-relation social networks from the web: Potential and challenges. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization (UMAP’17). 351--352.
[6]
Leonard E. Baum and Ted Petrie. 1966. Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics 37, 6 (1966), 1554--1563.
[7]
Kevin Lange Di Cesare, Amal Zouaq, and Ludovic Jean-Louis. 2016. A machine learning filter for relation extraction. In Proceedings of the 25th International Conference Companion on World Wide Web (WWW’16). 69--70.
[8]
Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics 16, 1 (1990), 22--29.
[9]
Luciano Del Corro and Rainer Gemulla. 2013. ClausIE: Clause-based open information extraction. In Proceedings of the 22nd International Conference Companion on World Wide Web (WWW’13). 355--365.
[10]
Bhavana Dalvi, William W. Cohen, and Jamie Callan. 2012. WebSets: Extracting sets of entities from the web using unsupervised information extraction. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining (WSDM’12). 243--252.
[11]
J. Demšar. 1961. Multiple comparisons among means. Journal of the American Statistical Association 56 (1961), 52--64.
[12]
J. Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7 (2006), 1--30.
[13]
Doug Downey, Stefan Schoenmackers, and Oren Etzioni. 2007. Sparse information extraction: Unsupervised language models to the rescue. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07). 696--703.
[14]
Doug Downeya, Oren Etzionib, and Stephen Soderland. 2010. Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artificial Intelligence 174 (2010), 726--748.
[15]
Oren Etzioni, Michael J. Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence 165 (2005), 91--134.
[16]
Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam. 2011. Open information extraction: The second generation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). 3--10.
[17]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2016. Declarative cleaning of inconsistencies in information extraction. ACM Transactions on Database Systems 41, 1 (April 2016), Article 6, 1--44.
[18]
R. Feldman and B. Rosenfeld. 2006. Boosting unsupervised relation extraction by using NER. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’06). 473--481.
[19]
Evgeniy Gabrilovich and Shaul Markovitch. 2009. Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research 34 (2009), 443--498.
[20]
Kiril Gashteovski, Rainer Gemulla, and Luciano del Corro. 2017. MinIE: Minimizing facts in open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2630--2640.
[21]
Btihal El Ghali and Abderrahim El Qadi. 2017. Context-aware query expansion method using language models and latent semantic analyses. Knowledge and Information Systems 50 (2017), 751--762.
[22]
Dihong Gong, Daisy Zhe Wang, and Yang Peng. 2017. Multimodal learning for web information extraction. In Proceedings of the Multimedia Conference (MM’17). 288--296.
[23]
Pankaj Gulhane, Rajeev Rastogi, Srinivasan H Sengamedu, and Ashwin Tengli. 2010. Exploiting content redundancy for web information extraction. In Proceedings of the World Wide Web (WWW’10). 1105--1106.
[24]
Maeda F. Hanafi, Azza Abouzied, Laura Chiticariu, and Yunyao Li. 2017. Synthesizing extraction rules from user examples with SEER. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). 1687--1690.
[25]
Z. Harris. 1985. Distributional Structure. The Philosophy of Linguistics. New York: Oxford University Press.
[26]
M. A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the International Conference on Computational Linguistics (COLING’92). 539--545.
[27]
Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. 2012. KORE: Keyphrase overlap relatedness for entity disambiguation. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12). 545--554.
[28]
Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, Edwin L. Kelham, Gerard de Melo, and Gerhard Weikum. 2011. YAGO2: Exploring and querying world knowledge in time, space, context, and many languages. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW’11). 229--232.
[29]
Yuzhe Jin, Emre Kiciman, Kuansan Wang, and Ricky Loynd. 2014. Entity linking at the tail: Sparse signals, unknown entities, and phrase models. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM’14). 453--462.
[30]
Mayank Kejriwal and Pedro Szekely. 2017. Information extraction in illicit domains. In Proceedings of the 26th International Conference on World Wide Web (WWW’17). 997--1006.
[31]
Dongwoo Kim, Haixun Wang, and Alice H. Oh. 2013. Context-dependent conceptualization. In Proceedings of the Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI’13). 2654--2661.
[32]
Sarath Kumar Kondreddi, Peter Triantafillou, and Gerhard Weikum. 2014. Combining information extraction and human computing for crowdsourced knowledge acquisition. In Proceedings of the 30th International Conference on Data Engineering (ICDE’14). 988--999.
[33]
Taesung Lee, Zhongyuan Wang, Haixun Wang, and Seung won Hwang. 2011. Web scale taxonomy cleansing. In Proceedings of the 37th International Conference on Very Large Data Bases (VLDB’11). 1295--1306.
[34]
Taesung Lee, Zhongyuan Wang, Haixun Wang, and Seung won Hwang. 2013. Attribute extraction and scoring: A probabilistic approach. In Proceedings of the 29th International Conference on Data Engineering (ICDE’13). 194--205.
[35]
Jiewu Leng and Pingyu Jiang. 2016. A deep learning approach for relationship extraction from interaction context in social manufacturing paradigm. Knowledge-Based Systems 100 (2016), 188--199.
[36]
Peipei Li, Haixun Wang, Kenny Zhu, Zhongyuan Wang, and Xindong Wu. 2015. A large probabilistic semantic network based approach to compute term similarity. IEEE Transactions on Knowledge and Data Engineering 27 (2015), 2604--2617.
[37]
Yang Li, Shulong Tan, Huan Sun, Jiawei Han, Dan Roth, and Xifeng Yan. 2016. Entity disambiguation with linkless knowledge bases. In Proceedings of the 25th International Conference on World Wide Web (WWW’16). 1261--1270.
[38]
Rinaldo Lima, Bernard Espinasse, and Fred Freitas. 2017. OntoILPER: An ontology- and inductive logic programming-based system to extract entities and relations from text. Knowledge and Information Systems (2017), 1--33.
[39]
Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2016. Knowledge representation learning with entities, attributes and relations. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 2866--2872.
[40]
Kevin Meijer, Flavius Frasincar, and Frederik Hogenboom. 2014. A semantic approach for extracting domain taxonomies from text. Decision Support Systems 62 (2014), 78--93.
[41]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.
[42]
Masayuki Okamoto, Zifei Shan, and Ryohei Orihara. 2017. Applying information extraction for patent structure analysis. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17). 987--992.
[43]
Sergio Oramas, Mohamed Sordo, and Luis Espinosa-Anke. 2015. A rule-based approach to extracting relations from music tidbits. In Proceedings of the 24th International Conference on World Wide Web (WWW’15). 661--666.
[44]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.
[45]
Simone P. Ponzetto and Michael Strube. 2007. Deriving a large-scale taxonomy from wikipedia. In Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI’07). 1440--1445.
[46]
Priya Radhakrishnan and Vasudeva Varma. 2013. Extracting semantic knowledge from wikipedia category names. In Proceedings of the Workshop on Automated Knowledge Base Construction (CIKM’13). 109--114.
[47]
Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Chris Rĺę. 2017. Snorkel: Fast training set generation for information extraction. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). 1683--1686.
[48]
Ridho Reinanda, Edgar Meij, and Maarten de Ri-jke. 2016. Document filtering for long-tail entities. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM’16). 771--780.
[49]
Alan Ritter, Stephen Soderland, and Oren Etzioni. 2009. What is this, anyway: Automatic hypernym discovery. In Proceedings of the AAAI Spring Symposium on Learning by Reading and Learning to Read. 88--93.
[50]
Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni. 2012. Open language learning for information extraction. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). 523--534.
[51]
Mondher Sendi and Mohamed Nazih Omri. 2015. Biomedical concept extraction based information retrieval model: Application on the MeSH. In Proceedings of the 15th International Conference on Intelligent Systems Design and Applications (ISDA’15). 40--45.
[52]
Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen. 2011. Short text conceptualization using a probabilistic knowledgebase. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). 2330--2336.
[53]
John A. Swets. 1996. Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers. Lawrence Erlbaum Associates, Mahwah, NJ.
[54]
Bilyana Taneva and Gerhard Weikum. 2013. Gem-based entity-knowledge maintenance. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). 149--158.
[55]
Fei Wu, Raphael Hoffmann, and Daniel S. Weld. 2008. Information extraction from wikipedia: Moving down the long tail. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). 731--739.
[56]
F. Wu and D. S. Weld. 2010. Open information extraction using wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). 118--127.
[57]
Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q. Zhu. 2012. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’12). 481--492.
[58]
Peng Yan and Wei Jin. 2017. Building semantic kernels for cross-document knowledge discovery using Wikipedia. Knowledge and Information Systems 51 (2017), 287--310.
[59]
Alexander Yates, Michael Cafarella, Michele Banko, Oren Etzioni, Matthew Broadhead, and Stephen Soderland. 2007. TextRunner: Open information extraction on the web. In Proceedings of the Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (NAACL HLT Demonstration Program’07). 25--26.
[60]
Jingyuan Zhang, Roger Jie Luo, Altaf Rahman, Yi Chang, and Philip S. Yu. 2015. Learning entity types from query logs via graph-based modeling. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15). 603--702.
[61]
Minling Zhang and Lei Wu. 2015. LIFT: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2015), 107--120.
[62]
Guodong Zhou, Longhua Qian, and Jianxi Fan. 2010. Tree kernel-based semantic relation extraction with rich syntactic and semantic information. Information Sciences 180 (2010), 1313--1325.
[63]
Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji rong Wen. 2009. StatSnowball: A statistical approach to extracting entity relationships. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 101--110.

Cited By

View all
  • (2023)TechPat: Technical Phrase Extraction for Patent MiningACM Transactions on Knowledge Discovery from Data10.1145/359660317:9(1-31)Online publication date: 15-Jun-2023
  • (2022)Nested Named Entity Recognition: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/352259316:6(1-29)Online publication date: 30-Jul-2022
  • (2022)Twitter Accounts Suggestion: Pipeline Technique SpaCy Entity Recognition2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020570(5121-5125)Online publication date: 17-Dec-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 12, Issue 5
October 2018
354 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3234931
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2018
Accepted: 01 March 2018
Revised: 01 February 2018
Received: 01 September 2017
Published in TKDD Volume 12, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Sparse information extraction
  2. classification
  3. isA relationship
  4. semantic network

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • the Program for Changjiang Scholas and Innovative Research Team in University (PCSIRT) of the Ministry of Education
  • the Natural Science Foundation of China
  • the US National Science Foundation
  • National Key Research and Development Program of China
  • the Natural Science Foundation of Anhui province

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)2
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)TechPat: Technical Phrase Extraction for Patent MiningACM Transactions on Knowledge Discovery from Data10.1145/359660317:9(1-31)Online publication date: 15-Jun-2023
  • (2022)Nested Named Entity Recognition: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/352259316:6(1-29)Online publication date: 30-Jul-2022
  • (2022)Twitter Accounts Suggestion: Pipeline Technique SpaCy Entity Recognition2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020570(5121-5125)Online publication date: 17-Dec-2022
  • (2021)DACHA: A Dual Graph Convolution Based Temporal Knowledge Graph Representation Learning Method Using Historical RelationACM Transactions on Knowledge Discovery from Data10.1145/347705116:3(1-18)Online publication date: 22-Oct-2021
  • (2021)Text Recognition in the WildACM Computing Surveys10.1145/344075654:2(1-35)Online publication date: 5-Mar-2021
  • (2021)Data, measurement, and causal inferences in machine learning: opportunities and challenges for marketingJournal of Marketing Theory and Practice10.1080/10696679.2020.1860683(1-13)Online publication date: 11-Jan-2021
  • (2020)Relation Extraction with Proactive Domain Adaptation Strategy2020 IEEE International Conference on Knowledge Graph (ICKG)10.1109/ICBK50248.2020.00069(441-448)Online publication date: Aug-2020
  • (2019)FarsBaseSemantic Web10.3233/SW-19036910:6(1169-1196)Online publication date: 1-Jan-2019
  • (2019)Limitations of information extraction methods and techniques for heterogeneous unstructured big dataInternational Journal of Engineering Business Management10.1177/184797901989077111Online publication date: 9-Dec-2019
  • (2018)Automatic Semantic Network Generation from Unstructured Documents – The Options2018 5th International Conference on Soft Computing & Machine Intelligence (ISCMI)10.1109/ISCMI.2018.8703225(72-78)Online publication date: Nov-2018

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media