research-article

Improving Topic Modeling Performance through N-gram Removal

Authors:

Mohamad Almgerbi,

Andrea De Mauro,

Valentina PoggioniAuthors Info & Claims

WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Pages 162 - 169

https://doi.org/10.1145/3486622.3493952

Published: 13 April 2022 Publication History

Abstract

In recent years, topic modeling has been increasingly adopted for finding conceptual patterns in large corpora of digital documents to organize them accordingly. In order to enhance the performance of topic modeling algorithms, such as Latent Dirichlet Allocation (LDA), multiple preprocessing steps have been proposed. In this paper, we introduce N-gram Removal, a novel preprocessing procedure based on the systematic elimination of a dynamic number of repeated words in text documents. We have evaluated the effects of the utilization of N-gram Removal through four different performance metrics: we concluded that its application is effective at improving the performance of LDA and enhances the human interpretation of topics models.

References

[1]

Suad A Alasadi and Wesam S Bhaya. 2017. Review of data preprocessing techniques in data mining. Journal of Engineering and Applied Sciences 12, 16 (2017), 4102–4107.

[2]

Mohamad Almgerbi, Andrea De Mauro, Adham Kahlawi, and Valentina Poggioni. 2021. A Systematic Review of Data Analytics Job Requirements and Online-Courses. Journal of Computer Information Systems(2021), 1–13.

[3]

Hesam Amoualian, Wei Lu, Eric Gaussier, Georgios Balikas, Massih R Amini, and Marianne Clausel. 2017. Topical coherence in lda-based models through induced segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1799–1809.

[4]

Sudeep Bhatia, Russell Richie, and Wanling Zou. 2019. Distributed semantic representations for modeling human judgment. Current Opinion in Behavioral Sciences 29 (2019), 31–36.

[5]

Biraj Dahal, Sathish AP Kumar, and Zhenlong Li. 2019. Topic modeling and sentiment analysis of global climate change tweets. Social Network Analysis and Mining 9, 1 (2019), 1–20.

[6]

Jitesh Kumar Dewangan, Aakanksha Sharaff, and Sudhakar Pandey. 2020. Improving topic coherence using parsimonious language model and latent semantic indexing. In ICDSMLA 2019. Springer, 823–830.

[7]

Anjie Fang, Craig Macdonald, Iadh Ounis, and Philip Habel. 2016. Examining the coherence of the top ranked tweet topics. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 825–828.

Digital Library

[8]

João Ferreira, Hugo Gonçalo Oliveira, and Ricardo Rodrigues. 2019. Improving NLTK for processing Portuguese. In 8th Symposium on Languages, Applications and Technologies (SLATE 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

[9]

Salvador García, Sergio Ramírez-Gallego, Julián Luengo, José Manuel Benítez, and Francisco Herrera. 2016. Big data preprocessing: methods and prospects. Big Data Analytics 1, 1 (2016), 1–22.

[10]

Matthew Gentzkow, Bryan Kelly, and Matt Taddy. 2019. Text as data. Journal of Economic Literature 57, 3 (2019), 535–74.

[11]

Kranti Vithal Ghag and Ketan Shah. 2015. Comparative analysis of effect of stopwords removal on sentiment classification. In 2015 international conference on computer, communication and control (IC4). IEEE, 1–6.

[12]

Mahedi Hasan, Anichur Rahman, Md Razaul Karim, Md Saikat Islam Khan, and Md Jahidul Islam. 2021. Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation (LDA). In Proceedings of International Conference on Trends in Computational and Cognitive Engineering. Springer, 341–354.

[13]

Vera Ignatenko, Sergej Koltcov, Steffen Staab, and Zeyd Boukhers. 2019. Fractal approach for determining the optimal number of topics in the field of topic modeling. In Journal of Physics: Conference Series, Vol. 1163. IOP Publishing, 012025.

[14]

Karoliina Isoaho, Daria Gritsenko, and Eetu Mäkelä. 2021. Topic modeling and text analysis for qualitative policy research. Policy Studies Journal 49, 1 (2021), 300–324.

[15]

Hamed Jelodar, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao. 2019. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications 78, 11 (2019), 15169–15211.

Digital Library

[16]

Jashanjot Kaur and Preetpal Kaur Buttar. 2018. STOPWORDS REMOVAL AND ITS ALGORITHMS BASED ON DIFFERENT METHODS.International Journal of Advanced Research in Computer Science 10, 5(2018).

[17]

Jashanjot Kaur and Preetpal Kaur Buttar. 2018. A systematic review on stopword removal algorithms. International Journal on Future Revolution in Computer Science & Communication Engineering 4, 4 (2018), 207–210.

[18]

Hayato Kobayashi. 2014. Perplexity on reduced corpora. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 797–806.

[19]

Denis Kochedykov, Murat Apishev, Lev Golitsyn, and Konstantin Vorontsov. 2017. Fast and modular regularized topic modelling. In 2017 21st Conference of Open Innovations Association (FRUCT). IEEE, 182–193.

Digital Library

[20]

Fedor Krasnov and Anastasiia Sen. 2019. The number of topics optimization: Clustering approach. Machine Learning and Knowledge Extraction 1, 1 (2019), 416–426.

[21]

Vlad Krotov and Leiser Silva. 2018. Legality and ethics of web scraping. (2018).

[22]

Akshay Kulkarni and Adarsha Shivananda. 2019. Deep learning for NLP. In Natural language processing recipes. Springer, 185–227.

[23]

Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 530–539.

[24]

Baojun Ma, Hua Yuan, Yan Wan, Yu Qian, Nan Zhang, and Qiongwei Ye. 2016. Public opinion analysis based on probabilistic topic modeling and deep learning. (2016).

[25]

Nicolas Pröllochs and Stefan Feuerriegel. 2020. Business analytics for strategic management: Identifying and assessing corporate challenges via topic modeling. Information & Management 57, 1 (2020), 103070.

Digital Library

[26]

Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic modeling over short texts by incorporating word embeddings. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 363–374.

[27]

Santosh Kumar Ray, Amir Ahmad, and Ch Aswani Kumar. 2019. Review and implementation of topic modeling in Hindi. Applied Artificial Intelligence 33, 11 (2019), 979–1007.

[28]

Debabrata Sarddar, Raktim Kumar Dey, Rajesh Bose, and Sandip Roy. 2020. Topic Modeling as a Tool to Gauge Political Sentiments from Twitter Feeds. International Journal of Natural Computing Research (IJNCR) 9, 2(2020), 14–35.

[29]

Serhad Sarica and Jianxi Luo. 2020. Stopwords in technical language processing. arXiv preprint arXiv:2006.02633(2020).

[30]

Stefano Sbalchiero and Maciej Eder. 2020. Topic modeling, long texts and the best number of topics. Some Problems and solutions.Quality & Quantity 54, 4 (2020).

[31]

Alexandra Schofield, Måns Magnusson, and David Mimno. 2017. Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 432–436.

[32]

Alexandra Schofield, Laure Thompson, and David Mimno. 2017. Quantifying the effects of text duplication on semantic models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2737–2747.

[33]

Suvarn Sharma and Amit Bhagat. 2016. Data preprocessing algorithm for web structure mining. In 2016 Fifth International Conference on Eco-friendly Computing and Communication Systems (ICECCS). IEEE, 94–98.

[34]

Chang-Woo Song, Hoill Jung, and Kyungyong Chung. 2019. Development of a medical big-data mining process using topic modeling. Cluster Computing 22, 1 (2019), 1949–1958.

Digital Library

[35]

Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler. 2012. Exploring topic coherence over many models and many topics. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. 952–961.

Digital Library

[36]

Symeon Symeonidis, Dimitrios Effrosynidis, and Avi Arampatzis. 2018. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Systems with Applications 110 (2018), 298–310.

[37]

David Mathew Thomas and Sandeep Mathur. 2019. Data analysis by web scraping using python. In 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA). IEEE, 450–454.

[38]

Nguyen Anh Tu, Dong-Luong Dinh, Mostofa Kamal Rasel, and Young-Koo Lee. 2016. Topic modeling and improvement of image representation for large-scale image retrieval. Information Sciences 366(2016), 99–120.

Digital Library

[39]

Hongbin Wang, Jianxiong Wang, Yafei Zhang, Meng Wang, and Cunli Mao. 2019. Optimization of Topic Recognition Model for News Texts Based on LDA.J. Digit. Inf. Manag. 17, 5 (2019), 257.

[40]

Xing Wei and W Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 178–185.

Digital Library

[41]

Thanakorn Yarnguy and Wanida Kanarkard. 2018. Tuning Latent Dirichlet Allocation parameters using ant colony optimization. Journal of Telecommunication, Electronic and Computer Engineering (JTEC) 10, 1-9(2018), 21–24.

[42]

Bo Zhao. 2020. Encyclopedia of Big Data. Encyclopedia of Big Data(2020), 3–5.

[43]

Weizhong Zhao, James J Chen, Roger Perkins, Zhichao Liu, Weigong Ge, Yijun Ding, and Wen Zou. 2015. A heuristic approach to determine an appropriate number of topics in topic modeling. In BMC bioinformatics, Vol. 16. Springer, 1–10.

Recommendations

Improving Summarization Quality with Topic Modeling
TM '15: Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications

The problem of extractive text summarization for a collection of documents is defined as the problem of selecting a small subset of sentences so that the contents and meaning of the original document set are preserved in the best possible way. In this ...
Heterogeneous-Length Text Topic Modeling for Reader-Aware Multi-Document Summarization

More and more user comments like Tweets are available, which often contain user concerns. In order to meet the demands of users, a good summary generating from multiple documents should consider reader interests as reflected in reader comments. In this ...
Using Word Sense as a Latent Variable in LDA Can Improve Topic Modeling
ICAART 2014: Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1

Since proposed, LDA have been successfully used in modeling text documents. So far, words are the common features to induce latent topic, which are later used in document representation. Observation on documents indicates that the polysemous words can ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

December 2021

698 pages

ISBN:9781450391153

DOI:10.1145/3486622

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WI-IAT '21

Sponsor:

SIGAI

WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence

December 14 - 17, 2021

VIC, Melbourne, Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
149
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)3

Reflects downloads up to 27 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents