research-article

Story Forest: Extracting Events and Telling Stories from Breaking News

Authors:

Yu XuAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 14, Issue 3

Article No.: 31, Pages 1 - 28

https://doi.org/10.1145/3377939

Published: 13 May 2020 Publication History

Abstract

Extracting events accurately from vast news corpora and organize events logically is critical for news apps and search engines, which aim to organize news information collected from the Internet and present it to users in the most sensible forms. Intuitively speaking, an event is a group of news documents that report the same news incident possibly in different ways. In this article, we describe our experience of implementing a news content organization system at Tencent to discover events from vast streams of breaking news and to evolve news story structures in an online fashion. Our real-world system faces unique challenges in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we (1) need to accurately and quickly extract distinguishable events from massive streams of long text documents, and (2) must develop the structures of event stories in an online manner, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. A core novelty of our Story Forest system is EventX, a semi-supervised scheme to extract events from massive Internet news corpora. EventX relies on a two-layered, graph-based clustering procedure to group documents into fine-grained events. We conducted extensive evaluations based on (1) 60 GB of real-world Chinese news data, (2) a large Chinese Internet news dataset that contains 11,748 news articles with truth event labels, and (3) the 20 News Groups English dataset, through detailed pilot user experience studies. The results demonstrate the superior capabilities of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers.

References

[1]

Charu C. Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. In Mining Text Data. Springer, 77--128.

[2]

James Allan. 2012. Topic Detection and Tracking: Event-based Information Organization. Vol. 12. Springer Science 8 Business Media.

[3]

James Allan, Ron Papka, and Victor Lavrenko. 1998. On-line new event detection and tracking. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 37--45.

Digital Library

[4]

Florian Beil, Martin Ester, and Xiaowei Xu. 2002. Frequent term-based text clustering. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 436--442.

Digital Library

[5]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan (2003), 993--1022.

Digital Library

[6]

Christian Buchta, Martin Kober, Ingo Feinerer, and Kurt Hornik. 2012. Spherical k-means clustering. Journal of Statistical Software 50, 10 (2012), 1--22.

[7]

Stefan Büttcher, Charles L. A. Clarke, and Brad Lushman. 2006. Term proximity scoring for ad-hoc retrieval on very large text collections. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 621--622.

Digital Library

[8]

Pi-Chuan Chang, Michel Galley, and Christopher D. Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the 3rd Workshop on Statistical Machine Translation. 224--232.

[9]

Inderjit S. Dhillon, Subramanyam Mallela, and Dharmendra S. Modha. 2003. Information-theoretic co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 89--98.

[10]

Chris Ding, Xiaofeng He, and Horst D. Simon. 2005. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM International Conference on Data Mining. 606--610.

[11]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Vol. 96. 226--231.

[12]

Pablo A. Estévez, Michel Tesmer, Claudio A. Perez, and Jacek M. Zurada. 2009. Normalized mutual information feature selection. IEEE Transactions on Neural Networks 20, 2 (2009), 189--201.

Digital Library

[13]

Benjamin C.M. Fung, Ke Wang, and Martin Ester. 2003. Hierarchical document clustering using frequent itemsets. In Proceedings of the SIAM International Conference on Data Mining. 59--70.

[14]

Xiaofei He, Deng Cai, Yuanlong Shao, Hujun Bao, and Jiawei Han. 2011. Laplacian regularized gaussian mixture model for data clustering. IEEE Transactions on Knowledge and Data Engineering 23, 9 (2011), 1406--1418.

Digital Library

[15]

Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 50--57.

Digital Library

[16]

Ting Hua, Xuchao Zhang, Wei Wang, Chang-Tien Lu, and Naren Ramakrishnan. 2016. Automatical storyline generation with help from Twitter. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2383--2388.

Digital Library

[17]

Lifu Huang and Lian’en Huang. 2013. Optimized event storyline generation based on mixture-event-aspect model. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 726--735.

[18]

Anil K. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31, 8 (2010), 651--666.

Digital Library

[19]

Heeyoung Lee, Marta Recasens, Angel Chang, Mihai Surdeanu, and Dan Jurafsky. 2012. Joint entity and event coreference resolution across documents. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 489--500.

Digital Library

[20]

Bang Liu, Di Niu, Haojie Wei, Jinghong Lin, Yancheng He, Kunfeng Lai, and Yu Xu. 2019. Matching article pairs with graphical decomposition and convolutions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6284--6294.

[21]

Luying Liu, Jianchu Kang, Jing Yu, and Zhongliang Wang. 2005. A comparative study on unsupervised feature selection methods for text clustering. In Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering. IEEE, 597--601.

[22]

Qiaozhu Mei and ChengXiang Zhai. 2005. Discovering evolutionary theme patterns from text: An exploration of temporal text mining. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 198--207.

Digital Library

[23]

Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into texts. ACL.

[24]

Ramesh Nallapati, Ao Feng, Fuchun Peng, and James Allan. 2004. Event threading within news topics. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management. ACM, 446--453.

Digital Library

[25]

Yukio Ohsawa, Nels E Benson, and Masahiko Yachida. 1998. KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In Proceedings of the IEEE International Forum on Research and Technology Advances in Digital Libraries. IEEE, 12--18.

[26]

Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 36, 2 (2009), 3336--3341.

Digital Library

[27]

Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. 2004. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences 101, 9 (2004), 2658--2663.

[28]

Alan Ritter, Oren Etzioni, Sam Clark, et al. 2012. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1104--1112.

Digital Library

[29]

Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic keyword extraction from individual documents. Text Mining (2010), 1--20. https://www.osti.gov/biblio/978967-automatic-keyword-extraction-from-individual-documents.

[30]

Andrew Rosenberg and Julia Hirschberg. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Vol. 7. 410--420.

[31]

Hassan Sayyadi, Matthew Hurst, and Alexey Maykov. 2009. Event detection and tracking in social streams. In Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media.

[32]

Hassan Sayyadi and Louiqa Raschid. 2013. A graph analytical approach for topic detection. ACM Transactions on Internet Technology 13, 2 (2013), 4.

Digital Library

[33]

Dafna Shahaf, Carlos Guestrin, and Eric Horvitz. 2012. Trains of thought: Generating information maps. In Proceedings of the 21st International Conference on World Wide Web. ACM, 899--908.

Digital Library

[34]

Dafna Shahaf, Jaewon Yang, Caroline Suen, Jeff Jacobs, Heidi Wang, and Jure Leskovec. 2013. Information cartography: Creating zoomable, large-scale maps of information. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1097--1105.

Digital Library

[35]

Noam Slonim and Naftali Tishby. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 208--215.

Digital Library

[36]

Hristo Tanev, Jakub Piskorski, and Martin Atkinson. 2008. Real-time news event extraction for global crisis monitoring. In Proceedings of the International Conference on Application of Natural Language to Information Systems. Springer, 207--218.

Digital Library

[37]

Dingding Wang, Tao Li, and Mitsunori Ogihara. 2012. Generating pictorial storylines via minimum-weight connected dominating set approximation in multi-view graphs. In Proceedings of the 26th AAAI Conference on Artificial Intelligence.

[38]

Lu Wang, Claire Cardie, and Galen Marchetti. 2016. Socially-informed timeline generation for complex events. In Proceedings of the 2015 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1055--1065. https://www.aclweb.org/anthology/N15-1112.

[39]

Shize Xu, Shanshan Wang, and Yan Zhang. 2013. Summarizing complex events: A cross-modal solution of storylines extraction and reconstruction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1281--1291.

[40]

Wei Xu, Xin Liu, and Yihong Gong. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 267--273.

Digital Library

[41]

Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Xiaoming Li, and Yan Zhang. 2011. Evolutionary timeline summarization: A balanced optimization framework via iterative substitution. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 745--754.

Digital Library

[42]

Christopher C. Yang, Xiaodong Shi, and Chih-Ping Wei. 2009. Discovering event evolution graphs from news corpora. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 39, 4 (2009), 850--863.

Digital Library

[43]

Yiming Yang, Jaime Carbonell, Ralf Brown, John Lafferty, Thomas Pierce, and Thomas Ault. 2002. Multi-strategy learning for topic detection and tracking. In Topic Detection and Tracking. Springer, 85--114.

[44]

Deyu Zhou, Haiyang Xu, and Yulan He. 2015. An unsupervised Bayesian modelling approach for storyline detection on news articles. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1943--1948.

Cited By

Zhang ZChen HHuang RZhu LMa SLeifer LLiu W(2024)Automated Classification of User Needs for Beginner User Experience Designers: A Kano Model and Text Analysis Approach Using Deep LearningAI10.3390/ai50100185:1(364-382)Online publication date: 2-Feb-2024
https://doi.org/10.3390/ai5010018
Li PYu XPeng HXian YWang LSun LZhang JYu P(2024)Relational Prompt-Based Pre-Trained Language Models for Social Event DetectionACM Transactions on Information Systems10.1145/369586943:1(1-43)Online publication date: 13-Sep-2024
https://dl.acm.org/doi/10.1145/3695869
Ren JPeng HJiang LHao ZWu JGao SYu ZYang Q(2024)Toward Cross-Lingual Social Event Detection with Hybrid Knowledge DistillationACM Transactions on Knowledge Discovery from Data10.1145/368994818:9(1-36)Online publication date: 12-Nov-2024
https://dl.acm.org/doi/10.1145/3689948
Show More Cited By

Index Terms

Story Forest: Extracting Events and Telling Stories from Breaking News
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning settings
      1. Semi-supervised learning settings
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document topic models
  2. Information systems applications
    1. Data mining
      1. Clustering
      2. Data stream mining

Recommendations

Growing Story Forest Online from Massive Breaking News
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in ...
From Linear Story Generation to Branching Story Graphs

Interactive narrative systems are storytelling systems in which the user can influence the content or ordering of story world events. Conceptually, an interactive narrative can be represented as a branching graph of narrative elements, implying points ...
Story-map: iPad companion for long form TV narratives
EuroITV '12: Proceedings of the 10th European Conference on Interactive TV and Video

Long form TV narratives present multiple continuing characters and story arcs that last over multiple episodes and even over multiple seasons. Writers increasingly take pride in creating coherent and persistent story worlds with recurring characters and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 14, Issue 3

June 2020

381 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3388473

Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
Minginglamp Academy of Sciences, China

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2020

Online AM: 07 May 2020

Accepted: 01 January 2020

Revised: 01 November 2019

Received: 01 December 2018

Published in TKDD Volume 14, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
947
Total Downloads

Downloads (Last 12 months)181
Downloads (Last 6 weeks)19

Reflects downloads up to 01 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZChen HHuang RZhu LMa SLeifer LLiu W(2024)Automated Classification of User Needs for Beginner User Experience Designers: A Kano Model and Text Analysis Approach Using Deep LearningAI10.3390/ai50100185:1(364-382)Online publication date: 2-Feb-2024
https://doi.org/10.3390/ai5010018
Li PYu XPeng HXian YWang LSun LZhang JYu P(2024)Relational Prompt-Based Pre-Trained Language Models for Social Event DetectionACM Transactions on Information Systems10.1145/369586943:1(1-43)Online publication date: 13-Sep-2024
https://dl.acm.org/doi/10.1145/3695869
Ren JPeng HJiang LHao ZWu JGao SYu ZYang Q(2024)Toward Cross-Lingual Social Event Detection with Hybrid Knowledge DistillationACM Transactions on Knowledge Discovery from Data10.1145/368994818:9(1-36)Online publication date: 12-Nov-2024
https://dl.acm.org/doi/10.1145/3689948
Skopik FAkhras BWoisetschläger EAndresel MWurzenberger MLandauer M(2024)On the Application of Natural Language Processing for Advanced OSINT Analysis in Cyber DefenceProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670899(1-10)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3664476.3670899
Yang ZWei YLi HLi QJiang LSun LYu XHu CPeng HSerra ESpezzano F(2024)Adaptive Differentially Private Structural Entropy Minimization for Unsupervised Social Event DetectionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679537(2950-2960)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679537
Sharma SAbulaish MAhmad T(2024)conteNXt: A Graph-Based Approach to Assimilate Content and Context for Event Detection in OSNIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.337239911:4(5483-5495)Online publication date: Aug-2024
https://doi.org/10.1109/TCSS.2024.3372399
Guo YZang ZGao HXu XWang RLiu LLi J(2024)Unsupervised social event detection via hybrid graph contrastive learning and reinforced incremental clusteringKnowledge-Based Systems10.1016/j.knosys.2023.111225284:COnline publication date: 25-Jan-2024
https://dl.acm.org/doi/10.1016/j.knosys.2023.111225
Wu RHong TWan F(2024)MRME-Net: Towards multi-semantics learning and long-tail problem of efficient event detection from social messagesJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.10207036:5(102070)Online publication date: Jun-2024
https://doi.org/10.1016/j.jksuci.2024.102070
Muller MSaaman Evan der Werf JJeurgens CReijers H(2024)TimeFlows: Visualizing Process Chronologies from Vast Collections of Heterogeneous Information ObjectsResearch Challenges in Information Science10.1007/978-3-031-59465-6_13(203-219)Online publication date: 2-May-2024
https://doi.org/10.1007/978-3-031-59465-6_13
Gui FYang JWu QLiu YZhou JAn N(2023)Story Mosaic: Visualizing Elder Life Stories for Caregiver Empowerment (Preprint)JMIR Aging10.2196/50037Online publication date: 18-Jun-2023
https://doi.org/10.2196/50037
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents