research-article

Mitigating linked data quality issues in knowledge-intense information extraction methods

Authors:

Albert Weichselbraun,

Philipp KuntschikAuthors Info & Claims

WIMS '17: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics

Article No.: 17, Pages 1 - 12

https://doi.org/10.1145/3102254.3102272

Published: 19 June 2017 Publication History

Abstract

Advances in research areas such as named entity linking and sentiment analysis have triggered the emergence of knowledge-intensive information extraction methods that combine classical information extraction with background knowledge from the Web. Despite data quality concerns, linked data sources such as DBpedia, GeoNames and Wikidata which encode facts in a standardized structured format are particularly attractive for such applications.

This paper addresses the problem of data quality by introducing a framework that elaborates on linked data quality issues relevant to different stages of the background knowledge acquisition process, their impact on information extraction performance and applicable mitigation strategies. Applying this framework to named entity linking and data enrichment demonstrates the potential of the introduced mitigation strategies to lessen the impact of different kinds of data quality problems. An industrial use case that aims at the automatic generation of image metadata from image descriptions illustrates the successful deployment of knowledge-intensive information extraction in real-world applications and constraints introduced by data quality concerns.

References

[1]

Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In International Conference on Language Resources and Evaluation (LREC 2010). Malta, 2200--2204.

[2]

Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, 10 (Oct. 2008).

[3]

Erik Cambria, Soujanya Poria, Rajiv Bajpai, and Björn W. Schuller. 2016. SenticNet 4: A Semantic Resource for Sentiment Analysis Based on Conceptual Primitives. In 26th International Conference on Computational Linguistics (COLING 2016). 2666--2677.

[4]

Erik Cambria and Bebo White. 2014. Jumping NLP Curves: A Review of Natural Language Processing Research. IEEE Computational Intelligence Magazine 9, 2 (May 2014), 48--57.

Digital Library

[5]

Rodolfo C. Cavalcante, Rodrigo C. Brasileiro, Victor L. F. Souza, Jarley P. Nobrega, and Adriano L. I. Oliveira. 2016. Computational Intelligence and Financial Markets: A Survey and Future Directions. Expert Systems with Applications (2016).

[6]

Wingyan Chung and Daniel Zeng. 2016. Social-media-based public policy informatics: Sentiment and network analyses of U.S. Immigration and border security. Journal of the Association for Information Science and Technology 67, 7 (2016), 1588--1606.

Digital Library

[7]

Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. 2013. Improving Efficiency and Accuracy in Multilingual Entity Extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS'13). 121--124.

Digital Library

[8]

Jeremy Debattista, Soeren Auer, and Christoph Lange. 2016. Luzzu - A Framework for Linked Data Quality Assessment. In 2016 IEEE Tenth International Conference on Semantic Computing (ICSC). IEEE; IEEE Comp Soc, IEEE, 124--131.

[9]

Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. Advances in Knowledge Discovery and Data Mining. American Association for Artificial Intelligence, 1--34.

[10]

Xianpei Han and Jun Zhao. 2009. Named entity disambiguation by leveraging wikipedia semantic knowledge. In Proceedings of the 18th ACM conference on Information and knowledge management (CIKM '09). ACM, 215--224.

Digital Library

[11]

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, 782--792.

Digital Library

[12]

Saurabh S. Kataria, Krishnan S. Kumar, Rajeev R. Rastogi, Prithviraj Sen, and Srinivasan H. Sengamedu. 2011. Entity disambiguation with hierarchical topic models. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '11). ACM, 1037--1045.

Digital Library

[13]

Erin Hea-Jin Kim, Yoo Kyung Jeong, Yuyoung Kim, Keun Young Kang, and Min Song. 2015. Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news. Journal of Information Science (Oct. 2015).

[14]

Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2014. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal (2014).

[15]

Qing Li, TieJun Wang, Ping Li, Ling Liu, Qixu Gong, and Yuanzhu Chen. 2014. The effect of news and public mood on stock movements. Information Sciences 278 (Sept. 2014), 826--840.

[16]

Nandana Mihindukulasooriya, Mariano Rico, Raul Garcia-Castro, and Asuncion Gomez-Perez. 2015. An Analysis of the Quality Issues of the Properties Available in the Spanish DBpedia. In Advances in Artificial Intelligence (CAEPIA 2015). 198--209. 16th Conference of the Spanish-Association-for-Artificial-Intelligence (CAEPIA), Albacete, SPAIN, NOV 09--12, 2015.

Digital Library

[17]

Dat Ba Nguyen, Johannes Hoffart, Martin Theobald, and Gerhard Weikum. 2014. AIDA-light: High-Throughput Named-Entity Disambiguation. In Linked Data on the Web, WWW 2014, Vol. 1184. Seoul, South Korea.

[18]

Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, and James R. Curran. 2013. Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence 194 (2013), 151--175.

Digital Library

[19]

Heiko Paulheim and Christian Bizer. 2014. Improving the Quality of Linked Data Using Statistical Distributions. International Journal on Semantic Web and Information Systems 10, 2 (2014), 63--86.

Digital Library

[20]

Anja Pilz and Gerhard Paaß. 2011. From names to entities using thematic context distance. In Proceedings of the 20th ACM international conference on Information and knowledge management (CIKM '11). ACM, 857--866.

Digital Library

[21]

Suhas Ranganath, Xia Hu, Jiliang Tang, and Huan Liu. 2016. Understanding and Identifying Advocates for Political Campaigns on Social Media. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM'16).

Digital Library

[22]

Petar Ristoski and Heiko Paulheim. 2016. Semantic Web in data mining and knowledge discovery: A comprehensive survey. Web Semantics: Science, Services and Agents on the World Wide Web 36 (Jan. 2016), 1--22.

Digital Library

[23]

Edna Ruckhaus, Maria-Esther Vidal, Simon Castillo, Oscar Burguillos, and Oriana Baldizan. 2014. Analyzing Linked Data Quality with LiQuate. In Semantic Web: ESWC 2014 Satellite Events. 488--493.

[24]

Arno Scharl and David D. Herring. 2013. Extracting Knowledge from the Web and Social Media for Progress Monitoring in Public Outreach and Science Communication. In Proceedings of the 19th Brazilian Symposium on Multimedia and the Web (WebMedia '13). ACM, 121--124.

Digital Library

[25]

Arno Scharl, Albert Weichselbraun, Max Göbel, Walter Rafelsberger, and Ruslan Kamolov. 2016. Scalable Knowledge Extraction and Visualization for Web Intelligence. In Proceedings of the 49th Hawaii International Conference on System Sciences (HICSS-49). IEEE Computer Society Press.

Digital Library

[26]

Robert Speer and Catherine Havasi. 2012. Representing General Relational Knowledge in ConceptNet 5. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). European Language Resources Association (ELRA).

[27]

Harsh Thakkar, Kemele M. Endris, Jose M. Gimenez-Garcia, Jeremy Debattista, Christoph Lange, and Sören Auer. 2016. Are Linked Datasets Fit for Open-domain Question Answering? A Quality Assessment. In Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics (WIMS '16). ACM, New York, NY, USA, Article 19, 12 pages.

Digital Library

[28]

Albert Weichselbraun, Stefan Gindl, Fabian Fischer, Svitlana Vakulenko, and Arno Scharl. 2016. Aspect-Based Extraction and Analysis of Affective Knowledge from Social Media Streams. IEEE Intelligent Systems (2016). Accepted 30 June 2016.

[29]

Albert Weichselbraun, Stefan Gindl, and Arno Scharl. 2013. Extracting and Grounding Context-Aware Sentiment Lexicons. IEEE Intelligent Systems 28, 2 (2013), 39--46.

Digital Library

[30]

Albert Weichselbraun, Arno Scharl, and Stefan Gindl. 2016. Extracting Opinion Targets from Environmental Web Coverage and Social Media Streams. In Proceedings of the 49th Hawaii International Conference on System Sciences (HICSS-49). IEEE Computer Society Press.

Digital Library

[31]

Albert Weichselbraun, Daniel Streiff, and Arno Scharl. 2015. Consolidating Heterogeneous Enterprise Data for Named Entity Linking and Web Intelligence. International Journal on Artificial Intelligence Tools 24, 2 (2015).

[32]

Shengsheng Xiao, Chih-Ping Wei, and Ming Dong. 2016. Crowd intelligence: Analyzing online product reviews for preference measurement. Information & Management 53, 2 (March 2016), 169--182.

Digital Library

[33]

Donghui Yang, Chao Huang, and Mingyang Wang. 2016. A social recommender system by combining social network and sentiment similarity: A case study of healthcare. Journal of Information Science (2016).

[34]

Amrapali Zaveri, Dimitris Kontokostas, Mohamed A. Sherif, Lorenz Bühmann, Mohamed Morsey, Sören Auer, and Jens Lehmann. 2013. User-driven Quality Evaluation of DBpedia. In Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS '13). ACM, 97--104.

Digital Library

[35]

Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2016. Quality assessment for Linked Data: A Survey. Semantic Web 7, 1 (Jan. 2016), 63--93.

Cited By

Weichselbraun AHörler SHauser CHavelka AChbeir RManolopoulos YAkerkar RMizera-Pietraszko J(2020)Classifying News Media Coverage for Corruption Risks Management with Deep Learning and Web IntelligenceProceedings of the 10th International Conference on Web Intelligence, Mining and Semantics10.1145/3405962.3405988(54-62)Online publication date: 30-Jun-2020
https://dl.acm.org/doi/10.1145/3405962.3405988
Weichselbraun AKuntschik PBraşoveanu A(2018)Mining and Leveraging Background Knowledge for Improving Named Entity LinkingProceedings of the 8th International Conference on Web Intelligence, Mining and Semantics10.1145/3227609.3227670(1-11)Online publication date: 25-Jun-2018
https://dl.acm.org/doi/10.1145/3227609.3227670

Index Terms

Mitigating linked data quality issues in knowledge-intense information extraction methods
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Incomplete data
        Inconsistent data
    2. Information integration

Recommendations

Mining and Leveraging Background Knowledge for Improving Named Entity Linking
WIMS '18: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics

Knowledge-rich Information Extraction (IE) methods aspire towards combining classical IE with background knowledge obtained from third-party resources. Linked Open Data repositories that encode billions of machine readable facts from sources such as ...
Evaluating Entity Linking with Wikipedia

Named Entity Linking (nel) grounds entity mentions to their corresponding node in a Knowledge Base (kb). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate ...
AIDA-Social: Entity Linking on the Social Stream
ESAIR '14: Proceedings of the 7th International Workshop on Exploiting Semantic Annotations in Information Retrieval

Named Entity Linking (NEL) in microblogs is a challenging task due to the use of cryptic abbreviations, insufficient contextual information, and the time-varying importance of entities. We propose three techniques to target these challenges: Mention ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

WIMS '17: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics

June 2017

268 pages

ISBN:9781450352253

DOI:10.1145/3102254

Conference Chair:
Rajendra Akerkar
Western Norway Research Institute, Norway
,
General Chair:
Alfredo Cuzzocrea
University of Trieste and ICAR-CNR, Italy
,
Program Chairs:
Jannong Cao
Hong Kong Polytechnic University, Hong Kong
,
Mohand-Said Hacid
University of Claude Bernard Lyon 1, France

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

COSPECS: Dipartimento di Scienze Cognitive Psicologiche Pedagogiche e degli Studi Culturali, Universita degli Studi di Messina
Arescon: Advanced Research & Solution Consulting
DISA-MIS: Dipartimento di Scienze Aziendali - Management e Innovation Systems, Universita Degli Studi di Salerno
Xenia Progetti: Xenia Progetti

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WIMS '17

Sponsor:

COSPECS
Arescon
DISA-MIS
Xenia Progetti

WIMS '17: 7th International Conference on Web Intelligence, Mining and Semantics

June 19 - 22, 2017

Amantea, Italy

Acceptance Rates

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
137
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Weichselbraun AHörler SHauser CHavelka AChbeir RManolopoulos YAkerkar RMizera-Pietraszko J(2020)Classifying News Media Coverage for Corruption Risks Management with Deep Learning and Web IntelligenceProceedings of the 10th International Conference on Web Intelligence, Mining and Semantics10.1145/3405962.3405988(54-62)Online publication date: 30-Jun-2020
https://dl.acm.org/doi/10.1145/3405962.3405988
Weichselbraun AKuntschik PBraşoveanu A(2018)Mining and Leveraging Background Knowledge for Improving Named Entity LinkingProceedings of the 8th International Conference on Web Intelligence, Mining and Semantics10.1145/3227609.3227670(1-11)Online publication date: 25-Jun-2018
https://dl.acm.org/doi/10.1145/3227609.3227670

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten