Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3102254.3102272acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

Mitigating linked data quality issues in knowledge-intense information extraction methods

Published: 19 June 2017 Publication History

Abstract

Advances in research areas such as named entity linking and sentiment analysis have triggered the emergence of knowledge-intensive information extraction methods that combine classical information extraction with background knowledge from the Web. Despite data quality concerns, linked data sources such as DBpedia, GeoNames and Wikidata which encode facts in a standardized structured format are particularly attractive for such applications.
This paper addresses the problem of data quality by introducing a framework that elaborates on linked data quality issues relevant to different stages of the background knowledge acquisition process, their impact on information extraction performance and applicable mitigation strategies. Applying this framework to named entity linking and data enrichment demonstrates the potential of the introduced mitigation strategies to lessen the impact of different kinds of data quality problems. An industrial use case that aims at the automatic generation of image metadata from image descriptions illustrates the successful deployment of knowledge-intensive information extraction in real-world applications and constraints introduced by data quality concerns.

References

[1]
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In International Conference on Language Resources and Evaluation (LREC 2010). Malta, 2200--2204.
[2]
Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, 10 (Oct. 2008).
[3]
Erik Cambria, Soujanya Poria, Rajiv Bajpai, and Björn W. Schuller. 2016. SenticNet 4: A Semantic Resource for Sentiment Analysis Based on Conceptual Primitives. In 26th International Conference on Computational Linguistics (COLING 2016). 2666--2677.
[4]
Erik Cambria and Bebo White. 2014. Jumping NLP Curves: A Review of Natural Language Processing Research. IEEE Computational Intelligence Magazine 9, 2 (May 2014), 48--57.
[5]
Rodolfo C. Cavalcante, Rodrigo C. Brasileiro, Victor L. F. Souza, Jarley P. Nobrega, and Adriano L. I. Oliveira. 2016. Computational Intelligence and Financial Markets: A Survey and Future Directions. Expert Systems with Applications (2016).
[6]
Wingyan Chung and Daniel Zeng. 2016. Social-media-based public policy informatics: Sentiment and network analyses of U.S. Immigration and border security. Journal of the Association for Information Science and Technology 67, 7 (2016), 1588--1606.
[7]
Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. 2013. Improving Efficiency and Accuracy in Multilingual Entity Extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS'13). 121--124.
[8]
Jeremy Debattista, Soeren Auer, and Christoph Lange. 2016. Luzzu - A Framework for Linked Data Quality Assessment. In 2016 IEEE Tenth International Conference on Semantic Computing (ICSC). IEEE; IEEE Comp Soc, IEEE, 124--131.
[9]
Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. Advances in Knowledge Discovery and Data Mining. American Association for Artificial Intelligence, 1--34.
[10]
Xianpei Han and Jun Zhao. 2009. Named entity disambiguation by leveraging wikipedia semantic knowledge. In Proceedings of the 18th ACM conference on Information and knowledge management (CIKM '09). ACM, 215--224.
[11]
Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, 782--792.
[12]
Saurabh S. Kataria, Krishnan S. Kumar, Rajeev R. Rastogi, Prithviraj Sen, and Srinivasan H. Sengamedu. 2011. Entity disambiguation with hierarchical topic models. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '11). ACM, 1037--1045.
[13]
Erin Hea-Jin Kim, Yoo Kyung Jeong, Yuyoung Kim, Keun Young Kang, and Min Song. 2015. Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news. Journal of Information Science (Oct. 2015).
[14]
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2014. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal (2014).
[15]
Qing Li, TieJun Wang, Ping Li, Ling Liu, Qixu Gong, and Yuanzhu Chen. 2014. The effect of news and public mood on stock movements. Information Sciences 278 (Sept. 2014), 826--840.
[16]
Nandana Mihindukulasooriya, Mariano Rico, Raul Garcia-Castro, and Asuncion Gomez-Perez. 2015. An Analysis of the Quality Issues of the Properties Available in the Spanish DBpedia. In Advances in Artificial Intelligence (CAEPIA 2015). 198--209. 16th Conference of the Spanish-Association-for-Artificial-Intelligence (CAEPIA), Albacete, SPAIN, NOV 09--12, 2015.
[17]
Dat Ba Nguyen, Johannes Hoffart, Martin Theobald, and Gerhard Weikum. 2014. AIDA-light: High-Throughput Named-Entity Disambiguation. In Linked Data on the Web, WWW 2014, Vol. 1184. Seoul, South Korea.
[18]
Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, and James R. Curran. 2013. Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence 194 (2013), 151--175.
[19]
Heiko Paulheim and Christian Bizer. 2014. Improving the Quality of Linked Data Using Statistical Distributions. International Journal on Semantic Web and Information Systems 10, 2 (2014), 63--86.
[20]
Anja Pilz and Gerhard Paaß. 2011. From names to entities using thematic context distance. In Proceedings of the 20th ACM international conference on Information and knowledge management (CIKM '11). ACM, 857--866.
[21]
Suhas Ranganath, Xia Hu, Jiliang Tang, and Huan Liu. 2016. Understanding and Identifying Advocates for Political Campaigns on Social Media. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM'16).
[22]
Petar Ristoski and Heiko Paulheim. 2016. Semantic Web in data mining and knowledge discovery: A comprehensive survey. Web Semantics: Science, Services and Agents on the World Wide Web 36 (Jan. 2016), 1--22.
[23]
Edna Ruckhaus, Maria-Esther Vidal, Simon Castillo, Oscar Burguillos, and Oriana Baldizan. 2014. Analyzing Linked Data Quality with LiQuate. In Semantic Web: ESWC 2014 Satellite Events. 488--493.
[24]
Arno Scharl and David D. Herring. 2013. Extracting Knowledge from the Web and Social Media for Progress Monitoring in Public Outreach and Science Communication. In Proceedings of the 19th Brazilian Symposium on Multimedia and the Web (WebMedia '13). ACM, 121--124.
[25]
Arno Scharl, Albert Weichselbraun, Max Göbel, Walter Rafelsberger, and Ruslan Kamolov. 2016. Scalable Knowledge Extraction and Visualization for Web Intelligence. In Proceedings of the 49th Hawaii International Conference on System Sciences (HICSS-49). IEEE Computer Society Press.
[26]
Robert Speer and Catherine Havasi. 2012. Representing General Relational Knowledge in ConceptNet 5. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). European Language Resources Association (ELRA).
[27]
Harsh Thakkar, Kemele M. Endris, Jose M. Gimenez-Garcia, Jeremy Debattista, Christoph Lange, and Sören Auer. 2016. Are Linked Datasets Fit for Open-domain Question Answering? A Quality Assessment. In Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics (WIMS '16). ACM, New York, NY, USA, Article 19, 12 pages.
[28]
Albert Weichselbraun, Stefan Gindl, Fabian Fischer, Svitlana Vakulenko, and Arno Scharl. 2016. Aspect-Based Extraction and Analysis of Affective Knowledge from Social Media Streams. IEEE Intelligent Systems (2016). Accepted 30 June 2016.
[29]
Albert Weichselbraun, Stefan Gindl, and Arno Scharl. 2013. Extracting and Grounding Context-Aware Sentiment Lexicons. IEEE Intelligent Systems 28, 2 (2013), 39--46.
[30]
Albert Weichselbraun, Arno Scharl, and Stefan Gindl. 2016. Extracting Opinion Targets from Environmental Web Coverage and Social Media Streams. In Proceedings of the 49th Hawaii International Conference on System Sciences (HICSS-49). IEEE Computer Society Press.
[31]
Albert Weichselbraun, Daniel Streiff, and Arno Scharl. 2015. Consolidating Heterogeneous Enterprise Data for Named Entity Linking and Web Intelligence. International Journal on Artificial Intelligence Tools 24, 2 (2015).
[32]
Shengsheng Xiao, Chih-Ping Wei, and Ming Dong. 2016. Crowd intelligence: Analyzing online product reviews for preference measurement. Information & Management 53, 2 (March 2016), 169--182.
[33]
Donghui Yang, Chao Huang, and Mingyang Wang. 2016. A social recommender system by combining social network and sentiment similarity: A case study of healthcare. Journal of Information Science (2016).
[34]
Amrapali Zaveri, Dimitris Kontokostas, Mohamed A. Sherif, Lorenz Bühmann, Mohamed Morsey, Sören Auer, and Jens Lehmann. 2013. User-driven Quality Evaluation of DBpedia. In Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS '13). ACM, 97--104.
[35]
Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2016. Quality assessment for Linked Data: A Survey. Semantic Web 7, 1 (Jan. 2016), 63--93.

Cited By

View all
  • (2020)Classifying News Media Coverage for Corruption Risks Management with Deep Learning and Web IntelligenceProceedings of the 10th International Conference on Web Intelligence, Mining and Semantics10.1145/3405962.3405988(54-62)Online publication date: 30-Jun-2020
  • (2018)Mining and Leveraging Background Knowledge for Improving Named Entity LinkingProceedings of the 8th International Conference on Web Intelligence, Mining and Semantics10.1145/3227609.3227670(1-11)Online publication date: 25-Jun-2018

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WIMS '17: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics
June 2017
268 pages
ISBN:9781450352253
DOI:10.1145/3102254
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • COSPECS: Dipartimento di Scienze Cognitive Psicologiche Pedagogiche e degli Studi Culturali, Universita degli Studi di Messina
  • Arescon: Advanced Research & Solution Consulting
  • DISA-MIS: Dipartimento di Scienze Aziendali - Management e Innovation Systems, Universita Degli Studi di Salerno
  • Xenia Progetti: Xenia Progetti

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. applications
  2. information extraction
  3. linked data quality
  4. mitigation strategies
  5. named entity linking
  6. semantic technologies

Qualifiers

  • Research-article

Conference

WIMS '17
Sponsor:
  • COSPECS
  • Arescon
  • DISA-MIS
  • Xenia Progetti

Acceptance Rates

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Classifying News Media Coverage for Corruption Risks Management with Deep Learning and Web IntelligenceProceedings of the 10th International Conference on Web Intelligence, Mining and Semantics10.1145/3405962.3405988(54-62)Online publication date: 30-Jun-2020
  • (2018)Mining and Leveraging Background Knowledge for Improving Named Entity LinkingProceedings of the 8th International Conference on Web Intelligence, Mining and Semantics10.1145/3227609.3227670(1-11)Online publication date: 25-Jun-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media