Multi-document semantic relation extraction for news analytics

Yongpan Sheng¹,
Zenglin Xu¹,
Yafang Wang² &
…
Gerard de Melo³

730 Accesses
Explore all metrics

Abstract

Given the overwhelming amounts of information in our current 24/7 stream of new incoming articles, new techniques are needed to enable users to focus on just the key entities and concepts along with their relationships. Examples include news articles but also business reports and social media. The fact that relevant information may be distributed across diverse sources makes it particularly challenging to identify relevant connections. In this paper, we propose a system called MuReX to aid users in quickly discerning salient connections and facts from a set of related documents and viewing the resulting information as a graph-based visualization. Our approach involves open information extraction, followed by a careful transformation and filtering approach. We rely on integer linear programming to ensure that we retain only the most confident and compatible facts with regard to a user query, and finally apply a graph ranking approach to obtain a coherent graph that represents meaningful and salient relationships, which users may explore visually. Experimental results corroborate the effectiveness of our proposed approaches, and the local system we developed has been running for more than one year.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles

Coherence and Salience-Based Multi-Document Relationship Mining

Extracting Keyphrases Using Heterogeneous Word Relations

Notes

https://safetyapp.shinyapps.io/GoWvis/
http://maggie.lt.informatik.tu-darmstadt.de/thesis/master/NetworksOfNames
http://tagesnetzwerk.de
http://www.newsleak.io/
Given a relational triple extracted by ClausIE, OLLIE, or Open IE 4, only when its confidence is greater than 0.85 is it judged as being a suitable extraction.
https://github.com/pilehvar/ADW
https://duc.nist.gov/
http://research.signalmedia.co/newsir16/signal-dataset.html
https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/ambiverse-nlu/clausie/
http://knowitall.github.io/ollie/
https://nlp.stanford.edu/software/openie.shtml
https://github.com/knowitall/openie
https://github.com/uma-pi1/minie
An entity or concept is regarded as a topic concept if it occurs in the topic words list as described in Section 4.1.
For popular OpenIE systems such as ClausIE, OLLIE, and Open IE 4, we rely on the confidence value computed by each system itself as the confidence score of each of facts.
http://tomcat.apache.org/
http://www.mysql.com/
http://avalonjs.coding.me/
http://jquery.com/

References

Angeli, G., Premkumar, M. J. J., Manning, C. D.: Leveraging linguistic structure for open domain information extraction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 344–354 (2015)
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, vol. 7, pp. 2670–2676 (2007)
Benikova, D., Fahrer, U., Gabriel, A., Kaufmann, M., Yimam, S.M., von Landesberger, T., Biemann, C.: Network of the day: Aggregating and visualizing entity networks from online sources
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247–1250. ACM (2008)
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E. R., Mitchell, T. M.: Toward an architecture for never-ending language learning. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence (2010)
Council, I.: EventsML-G2: A data model and format for collecting and distributing event information (2014). http://www.iptc.org/site/News_Exchang_Formats/EventsML-G2
Council, I.P.T.: rnews (2014). http://dev.iptc.org/rNews
Council, I.P.T.: NewsML-G2 2.28 specification (2019). https://iptc.org/std/NewsML-G2/2.28/specification/NewsML-G2-2.28-specification.html
Del Corro, L., Gemulla, R.: ClausIE: clause-based open information extraction. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 355–366. ACM (2013)
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics (2011)
Falke, T., Gurevych, I.: GraphDocExplore: A framework for the experimental comparison of graph-based document exploration techniques. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 19–24 (2017)
Fuchs, C. A., Peres, A.: Quantum-state disturbance versus information gain: Uncertainty relations for quantum information. Phys. Rev. A 53(4), 2038 (1996)
Article Google Scholar
Galárraga, L., Heitz, G., Murphy, K., Suchanek, F. M.: Canonicalizing open knowledge bases. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14, pp 1679–1688. ACM, New York, NY, USA (2014), 10.1145/2661829.2662073
Gashteovski, K., Gemulla, R., Del Corro, L.: MinIE: minimizing facts in open information extraction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2630–2640 (2017)
Ge, T., Wang, Y., de Melo, G., Li, H., Chen, B.: Visualizing and curating knowledge graphs over time and space. pp. 25–30 (2016). https://www.aclweb.org/anthology/P16-4005.pdf
Google Microsoft, Y.: Schemas – schema.org. (2012). http://www.schema.org/docs/schemas.html
Hearst, M. A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Annual Meeting of the Association for Computational Linguistics, pp. 539–545. Association for Computational Linguistics (1992)
Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., de Melo, G., Gutierrez, C., Labra Gayo, J.E., Kirrane, S., Neumaier, S., Polleres, A., Navigli, R., Ngonga Ngomo, A.C., Rashid, S.M., Rula, A., Schmelzeisen, L., Sequeda, J., Staab, S., Zimmermann, A.: Knowledge graphs. arXiv:https://arxiv.org/abs/2003.02320 (2020)
Hou, L., Li, J., Wang, Z., Tang, J., Zhang, P., Yang, R., Zheng, Q.: Newsminer: Multifaceted news analysis for event search. Knowl.-Based Syst. 76, 17–29 (2015)
Article Google Scholar
Hu, G., Qin, Y., Shao, J.: Personalized travel route recommendation from multi-source social media data Multimedia Tools and Applications (2018)
Ji, H., Favre, B., Lin, W. P., Gillick, D., Hakkani-Tur, D., Grishman, R.: Open-Domain Multi-Document Summarization via Information Extraction: Challenges and Prospects Multi-Source, Multilingual Information Extraction and Summarization, Pp. 177–201. Springer (2013)
Kochtchi, A., Landesberger, T.v., Biemann, C.: Networks of Names: Visual Exploration and Semi-Automatic Tagging of Social Networks from Newspaper Articles. In: Computer Graphics Forum, Vol. 33, pp. 211–220. Wiley Online Library (2014)
Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 497–506. ACM (2009)
Li, J., Li, J., Tang, J.: A flexible topic-driven framework for news exploration. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 2007 (2007)
Lin, C. X., Zhao, B., Mei, Q., Han, J.: PET: A statistical model for popular events tracking in social communities. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 929–938. ACM (2010)
Mann, G.: Multi-document relationship fusion via constraints on probabilistic databases. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 332–339 (2007)
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., McClosky, D.: The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 55–60 (2014)
Mausam, M.: Open information extraction systems and downstream applications. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, pp. 4074–4077. AAAI Press (2016)
Mei, Q., Zhai, C.: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 198–207. ACM (2005)
Mihalcea, R., Tarau, P.: TextRank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004)
Miller, G. A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Yang, B., Betteridge, J., Carlson, A., Dalvi, B., Gardner, M., Kisiel, B., et al.: Never-ending learning. Communications of the ACM 61(5), 103–115 (2018)
Article Google Scholar
Pilehvar, M. T., Jurgens, D., Navigli, R.: Align, disambiguate and walk: a unified approach for measuring semantic similarity. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1341–1351 (2013)
Pouliquen, B., Steinberger, R., Deguernel, O.: Story tracking: linking similar news over time and across languages. In: Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 49–56. Association for Computational Linguistics (2008)
Rouces, J., de Melo, G., Hose, K.: Heuristics for connecting heterogeneous knowledge via FrameBase. In: Proceedings of ESWC 2016, Lecture Notes in Computer Science, pp. 20–35. Springer (2016). https://link.springer.com/chapter/10.1007/978-3-319-34129-3_2
Schmitz, M., Bart, R., Soderland, S., Etzioni, O., et al.: Open language learning for information extraction. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 523–534. ACL (2012)
Shahaf, D., Guestrin, C.: Connecting the dots between news articles. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 623–632. ACM (2010)
Shan, D., Zhao, W. X., Chen, R., Shu, B., Wang, Z., Yao, J., Yan, H., Li, X.: EventSearch: a system for event discovery and retrieval on multi-type historical data. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1564–1567. ACM (2012)
Sheng, Y., Xu, Z., Wang, Y., Zhang, X., Jia, J., You, Z., de Melo, G.: Visualizing multi-document semantics via open domain information extraction. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 695–699. Springer (2018)
Spitkovsky, V. I., Chang, A. X.: A cross-lingual dictionary for English Wikipedia concepts. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 3168–3175 (2012)
Sridhar, V. K. R.: Unsupervised topic modeling for short texts using distributed representations of words. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 192–200 (2015)
Suchanek, F. M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706. ACM (2007)
Tandon, N., de Melo, G.: Information extraction from web-scale n-gram data. In: Zhai, C., Yarowsky, D. , Viegas, E. , Wang, K. , Vogel, S. (eds.) Web N-gram Workshop. Workshop of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, vol. 5803, pp. 8–15. ACM (2010). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.365.2318
Tandon, N., de Melo, G., De, A., Weikum, G.: Knowlywood: Mining activity knowledge from Hollywood narratives. In: Proceedings of CIKM 2015, pp. 223–232. ACM. (2015). https://dl.acm.org/doi/10.1145/2806416.2806583
Tandon, N., de Melo, G., Suchanek, F. M., Weikum, G.: WebChild: Harvesting and organizing commonsense knowledge from the web. In: Carterettem, B., Diaz, F., Castillo, C., Metzler, D. (eds.) Proceedings of ACM WSDM 2014, pp. 523–532. ACM (2014)
Tandon, N., de Melo, G., Weikum, G.: Acquiring comparative commonsense knowledge from the web. In: Proceedings of AAAI 2014, pp. 166–172. AAAI. (2014). https://dl.acm.org/doi/10.5555/2893873.2893902
Tixier, A., Skianis, K., Vazirgiannis, M.: GoWvis: a web application for graph-of-words-based text visualization and summarization (2016)
Wang, L., Guo, Z., Wang, Y., Cui, Z., Liu, S., de Melo, G.: Social media vs. news media: Analyzing real-world events from different perspectives. In: Proceedings of DEXA 2018, LNCS, vol. 11030, pp. 471–479. Springer Verlag (2018), https://doi.org/10.1007/978-3-319-98812-243. https://link.springer.com/chapter/10.1007/978-3-319-98812-243
Xu, T., Liu, D., Chen, E., Cao, H., Tian, J.: Towards Annotating Media Contents through Social Diffusion Analysis. In: 2012 IEEE 12Th International Conference on Data Mining, pp. 1158–1163. IEEE (2012)
Xu, T., Zhu, H., Chen, E., Huai, B., Xiong, H., Tian, J.: Learning to annotate via social interaction analytics. Knowledge and information systems 41(2), 251–276 (2014)
Article Google Scholar
Yang, Q., Cheng, Y., Wang, S., de Melo, G.: HiText: Text reading with dynamic salience marking. In: Proceedings of WWW 2017, pp. 311–319. ACM (2017). https://dl.acm.org/citation.cfm?id=3041021.3054168
Yimam, S. M., Ulrich, H., von Landesberger, T., Rosenbach, M., Regneri, M., Panchenko, A., Lehmann, F., Fahrer, U., Biemann, C., Ballweg, K.: new/s/leak–information extraction and visualization for investigative data journalists. In: Proceedings of ACL 2016 (System Demonstrations). https://doi.org/10.18653/v1/P16-4028, https://www.aclweb.org/anthology/P16-4028/, pp 163–168. Association for Computational Linguistics (2016)
Yu, D., Huang, L., Ji, H.: Open relation extraction and grounding. In: Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 854–864 (2017)
Zhu, C., Zhu, H., Ge, Y., Chen, E., Liu, Q., Xu, T., Xiong, H.: Tracking the evolution of social emotions with topic models. Knowl. Inf. Syst. 47(3), 517–544 (2016)
Article Google Scholar

Download references

Acknowledgments

This paper was partially supported by National Natural Science Foundation of China (Nos. 61572111 and 61876034). Yafang Wang’s research was supported by the National Natural Science Foundation of China (No. 61503217).

Author information

Authors and Affiliations

SMILE Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
Yongpan Sheng & Zenglin Xu
Ant Financial Services Co., Hangzhou, Zhejiang, China
Yafang Wang
Department of Computer Science, Rutgers University–New Brunswick, Piscataway, NJ, USA
Gerard de Melo

Authors

Yongpan Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Zenglin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yafang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Gerard de Melo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gerard de Melo.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Web and Big Data 2019

Guest Editors: Jie Shao, Man Lung Yiu, and Toyoda Masashi

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sheng, Y., Xu, Z., Wang, Y. et al. Multi-document semantic relation extraction for news analytics. World Wide Web 23, 2043–2077 (2020). https://doi.org/10.1007/s11280-020-00790-2

Download citation

Received: 08 November 2019
Revised: 07 January 2020
Accepted: 15 January 2020
Published: 18 May 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11280-020-00790-2

Multi-document semantic relation extraction for news analytics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles

Coherence and Salience-Based Multi-Document Relationship Mining

Extracting Keyphrases Using Heterogeneous Word Relations

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-document semantic relation extraction for news analytics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles

Coherence and Salience-Based Multi-Document Relationship Mining

Extracting Keyphrases Using Heterogeneous Word Relations

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation