Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Story Forest: Extracting Events and Telling Stories from Breaking News

Published: 13 May 2020 Publication History

Abstract

Extracting events accurately from vast news corpora and organize events logically is critical for news apps and search engines, which aim to organize news information collected from the Internet and present it to users in the most sensible forms. Intuitively speaking, an event is a group of news documents that report the same news incident possibly in different ways. In this article, we describe our experience of implementing a news content organization system at Tencent to discover events from vast streams of breaking news and to evolve news story structures in an online fashion. Our real-world system faces unique challenges in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we (1) need to accurately and quickly extract distinguishable events from massive streams of long text documents, and (2) must develop the structures of event stories in an online manner, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. A core novelty of our Story Forest system is EventX, a semi-supervised scheme to extract events from massive Internet news corpora. EventX relies on a two-layered, graph-based clustering procedure to group documents into fine-grained events. We conducted extensive evaluations based on (1) 60 GB of real-world Chinese news data, (2) a large Chinese Internet news dataset that contains 11,748 news articles with truth event labels, and (3) the 20 News Groups English dataset, through detailed pilot user experience studies. The results demonstrate the superior capabilities of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers.

References

[1]
Charu C. Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. In Mining Text Data. Springer, 77--128.
[2]
James Allan. 2012. Topic Detection and Tracking: Event-based Information Organization. Vol. 12. Springer Science 8 Business Media.
[3]
James Allan, Ron Papka, and Victor Lavrenko. 1998. On-line new event detection and tracking. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 37--45.
[4]
Florian Beil, Martin Ester, and Xiaowei Xu. 2002. Frequent term-based text clustering. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 436--442.
[5]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan (2003), 993--1022.
[6]
Christian Buchta, Martin Kober, Ingo Feinerer, and Kurt Hornik. 2012. Spherical k-means clustering. Journal of Statistical Software 50, 10 (2012), 1--22.
[7]
Stefan Büttcher, Charles L. A. Clarke, and Brad Lushman. 2006. Term proximity scoring for ad-hoc retrieval on very large text collections. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 621--622.
[8]
Pi-Chuan Chang, Michel Galley, and Christopher D. Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the 3rd Workshop on Statistical Machine Translation. 224--232.
[9]
Inderjit S. Dhillon, Subramanyam Mallela, and Dharmendra S. Modha. 2003. Information-theoretic co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 89--98.
[10]
Chris Ding, Xiaofeng He, and Horst D. Simon. 2005. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM International Conference on Data Mining. 606--610.
[11]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Vol. 96. 226--231.
[12]
Pablo A. Estévez, Michel Tesmer, Claudio A. Perez, and Jacek M. Zurada. 2009. Normalized mutual information feature selection. IEEE Transactions on Neural Networks 20, 2 (2009), 189--201.
[13]
Benjamin C.M. Fung, Ke Wang, and Martin Ester. 2003. Hierarchical document clustering using frequent itemsets. In Proceedings of the SIAM International Conference on Data Mining. 59--70.
[14]
Xiaofei He, Deng Cai, Yuanlong Shao, Hujun Bao, and Jiawei Han. 2011. Laplacian regularized gaussian mixture model for data clustering. IEEE Transactions on Knowledge and Data Engineering 23, 9 (2011), 1406--1418.
[15]
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 50--57.
[16]
Ting Hua, Xuchao Zhang, Wei Wang, Chang-Tien Lu, and Naren Ramakrishnan. 2016. Automatical storyline generation with help from Twitter. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2383--2388.
[17]
Lifu Huang and Lian’en Huang. 2013. Optimized event storyline generation based on mixture-event-aspect model. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 726--735.
[18]
Anil K. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31, 8 (2010), 651--666.
[19]
Heeyoung Lee, Marta Recasens, Angel Chang, Mihai Surdeanu, and Dan Jurafsky. 2012. Joint entity and event coreference resolution across documents. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 489--500.
[20]
Bang Liu, Di Niu, Haojie Wei, Jinghong Lin, Yancheng He, Kunfeng Lai, and Yu Xu. 2019. Matching article pairs with graphical decomposition and convolutions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6284--6294.
[21]
Luying Liu, Jianchu Kang, Jing Yu, and Zhongliang Wang. 2005. A comparative study on unsupervised feature selection methods for text clustering. In Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering. IEEE, 597--601.
[22]
Qiaozhu Mei and ChengXiang Zhai. 2005. Discovering evolutionary theme patterns from text: An exploration of temporal text mining. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 198--207.
[23]
Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into texts. ACL.
[24]
Ramesh Nallapati, Ao Feng, Fuchun Peng, and James Allan. 2004. Event threading within news topics. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management. ACM, 446--453.
[25]
Yukio Ohsawa, Nels E Benson, and Masahiko Yachida. 1998. KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In Proceedings of the IEEE International Forum on Research and Technology Advances in Digital Libraries. IEEE, 12--18.
[26]
Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 36, 2 (2009), 3336--3341.
[27]
Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. 2004. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences 101, 9 (2004), 2658--2663.
[28]
Alan Ritter, Oren Etzioni, Sam Clark, et al. 2012. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1104--1112.
[29]
Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic keyword extraction from individual documents. Text Mining (2010), 1--20. https://www.osti.gov/biblio/978967-automatic-keyword-extraction-from-individual-documents.
[30]
Andrew Rosenberg and Julia Hirschberg. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Vol. 7. 410--420.
[31]
Hassan Sayyadi, Matthew Hurst, and Alexey Maykov. 2009. Event detection and tracking in social streams. In Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media.
[32]
Hassan Sayyadi and Louiqa Raschid. 2013. A graph analytical approach for topic detection. ACM Transactions on Internet Technology 13, 2 (2013), 4.
[33]
Dafna Shahaf, Carlos Guestrin, and Eric Horvitz. 2012. Trains of thought: Generating information maps. In Proceedings of the 21st International Conference on World Wide Web. ACM, 899--908.
[34]
Dafna Shahaf, Jaewon Yang, Caroline Suen, Jeff Jacobs, Heidi Wang, and Jure Leskovec. 2013. Information cartography: Creating zoomable, large-scale maps of information. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1097--1105.
[35]
Noam Slonim and Naftali Tishby. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 208--215.
[36]
Hristo Tanev, Jakub Piskorski, and Martin Atkinson. 2008. Real-time news event extraction for global crisis monitoring. In Proceedings of the International Conference on Application of Natural Language to Information Systems. Springer, 207--218.
[37]
Dingding Wang, Tao Li, and Mitsunori Ogihara. 2012. Generating pictorial storylines via minimum-weight connected dominating set approximation in multi-view graphs. In Proceedings of the 26th AAAI Conference on Artificial Intelligence.
[38]
Lu Wang, Claire Cardie, and Galen Marchetti. 2016. Socially-informed timeline generation for complex events. In Proceedings of the 2015 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1055--1065. https://www.aclweb.org/anthology/N15-1112.
[39]
Shize Xu, Shanshan Wang, and Yan Zhang. 2013. Summarizing complex events: A cross-modal solution of storylines extraction and reconstruction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1281--1291.
[40]
Wei Xu, Xin Liu, and Yihong Gong. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 267--273.
[41]
Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Xiaoming Li, and Yan Zhang. 2011. Evolutionary timeline summarization: A balanced optimization framework via iterative substitution. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 745--754.
[42]
Christopher C. Yang, Xiaodong Shi, and Chih-Ping Wei. 2009. Discovering event evolution graphs from news corpora. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 39, 4 (2009), 850--863.
[43]
Yiming Yang, Jaime Carbonell, Ralf Brown, John Lafferty, Thomas Pierce, and Thomas Ault. 2002. Multi-strategy learning for topic detection and tracking. In Topic Detection and Tracking. Springer, 85--114.
[44]
Deyu Zhou, Haiyang Xu, and Yulan He. 2015. An unsupervised Bayesian modelling approach for storyline detection on news articles. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1943--1948.

Cited By

View all
  • (2024)Automated Classification of User Needs for Beginner User Experience Designers: A Kano Model and Text Analysis Approach Using Deep LearningAI10.3390/ai50100185:1(364-382)Online publication date: 2-Feb-2024
  • (2024)Relational Prompt-Based Pre-Trained Language Models for Social Event DetectionACM Transactions on Information Systems10.1145/369586943:1(1-43)Online publication date: 13-Sep-2024
  • (2024)Toward Cross-Lingual Social Event Detection with Hybrid Knowledge DistillationACM Transactions on Knowledge Discovery from Data10.1145/368994818:9(1-36)Online publication date: 12-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 14, Issue 3
June 2020
381 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3388473
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2020
Online AM: 07 May 2020
Accepted: 01 January 2020
Revised: 01 November 2019
Received: 01 December 2018
Published in TKDD Volume 14, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. EventX
  2. Story forest
  3. community detection
  4. document clustering
  5. news articles organization

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)181
  • Downloads (Last 6 weeks)19
Reflects downloads up to 01 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Automated Classification of User Needs for Beginner User Experience Designers: A Kano Model and Text Analysis Approach Using Deep LearningAI10.3390/ai50100185:1(364-382)Online publication date: 2-Feb-2024
  • (2024)Relational Prompt-Based Pre-Trained Language Models for Social Event DetectionACM Transactions on Information Systems10.1145/369586943:1(1-43)Online publication date: 13-Sep-2024
  • (2024)Toward Cross-Lingual Social Event Detection with Hybrid Knowledge DistillationACM Transactions on Knowledge Discovery from Data10.1145/368994818:9(1-36)Online publication date: 12-Nov-2024
  • (2024)On the Application of Natural Language Processing for Advanced OSINT Analysis in Cyber DefenceProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670899(1-10)Online publication date: 30-Jul-2024
  • (2024)Adaptive Differentially Private Structural Entropy Minimization for Unsupervised Social Event DetectionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679537(2950-2960)Online publication date: 21-Oct-2024
  • (2024)conteNXt: A Graph-Based Approach to Assimilate Content and Context for Event Detection in OSNIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.337239911:4(5483-5495)Online publication date: Aug-2024
  • (2024)Unsupervised social event detection via hybrid graph contrastive learning and reinforced incremental clusteringKnowledge-Based Systems10.1016/j.knosys.2023.111225284:COnline publication date: 25-Jan-2024
  • (2024)MRME-Net: Towards multi-semantics learning and long-tail problem of efficient event detection from social messagesJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.10207036:5(102070)Online publication date: Jun-2024
  • (2024)TimeFlows: Visualizing Process Chronologies from Vast Collections of Heterogeneous Information ObjectsResearch Challenges in Information Science10.1007/978-3-031-59465-6_13(203-219)Online publication date: 2-May-2024
  • (2023)Story Mosaic: Visualizing Elder Life Stories for Caregiver Empowerment (Preprint)JMIR Aging10.2196/50037Online publication date: 18-Jun-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media