Nothing Special   »   [go: up one dir, main page]

Using Topic Modeling Methods For Short-Text Data: A Comparative Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

METHODS

published: 14 July 2020


doi: 10.3389/frai.2020.00042

Using Topic Modeling Methods for


Short-Text Data: A Comparative
Analysis
Rania Albalawi 1*, Tet Hin Yeap 1* and Morad Benyoucef 2*
1
School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, Canada, 2 Telfer School of
Management, University of Ottawa, Ottawa, ON, Canada

With the growth of online social network platforms and applications, large amounts
of textual user-generated content are created daily in the form of comments, reviews,
Edited by:
Anis Yazidi,
and short-text messages. As a result, users often find it challenging to discover useful
OsloMet—Oslo Metropolitan information or more on the topic being discussed from such content. Machine learning
University, Norway and natural language processing algorithms are used to analyze the massive amount of
Reviewed by: textual social media data available online, including topic modeling techniques that have
Lei Jiao,
University of Agder, Norway gained popularity in recent years. This paper investigates the topic modeling subject
Ashish Rauniyar, and its common application areas, methods, and tools. Also, we examine and compare
University of Oslo, Norway,
in Collaboration With Reviewer LJ
five frequently used topic modeling methods, as applied to short textual social data, to
Imen Ben Sassi, show their benefits practically in detecting important topics. These methods are latent
Tallinn University of semantic analysis, latent Dirichlet allocation, non-negative matrix factorization, random
Technology, Estonia
Desta Haileselassie Hagos, projection, and principal component analysis. Two textual datasets were selected to
Oslo Metropolitan University, Norway, evaluate the performance of included topic modeling methods based on the topic quality
in Collaboration With Reviewer IS
and some standard statistical evaluation metrics, like recall, precision, F-score, and topic
*Correspondence:
Rania Albalawi
coherence. As a result, latent Dirichlet allocation and non-negative matrix factorization
ralba028@uottawa.ca methods delivered more meaningful extracted topics and obtained good results. The
Tet Hin Yeap paper sheds light on some common topic modeling methods in a short-text context and
tet@eecs.uottawa.ca
Morad Benyoucef provides direction for researchers who seek to apply these methods.
benyoucef@telfer.uottawa.ca
Keywords: natural language processing, topic modeling, short text, user-generated content, online
social networks
Specialty section:
This article was submitted to
Machine Learning and Artificial
Intelligence,
INTRODUCTION
a section of the journal
People nowadays tend to rely heavily on the internet in their daily social and commercial activities.
Frontiers in Artificial Intelligence
Indeed, the internet has increased demand for the development of commercial applications and
Received: 28 February 2020
services to provide better shopping experiences and commercial activities for customers around
Accepted: 14 May 2020
the world. The internet is full of information and sources of knowledge that may confuse readers
Published: 14 July 2020
and cause them to spend additional time and effort in finding relevant information about specific
Citation:
topics of interest. Consequently, there is a need for more efficient methods and tools that can aid
Albalawi R, Yeap TH and Benyoucef M
(2020) Using Topic Modeling Methods
in detecting and analyzing content in online social networks (OSNs), particularly for those using
for Short-Text Data: A Comparative user-generated content (UGC) as a source of data. Furthermore, there is a need to extract more
Analysis. Front. Artif. Intell. 3:42. useful and hidden information from numerous online sources that are stored as text and written in
doi: 10.3389/frai.2020.00042 natural language within the social network landscape (e.g., Twitter, LinkedIn, and Facebook). It is

Frontiers in Artificial Intelligence | www.frontiersin.org 1 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

convenient to employ a natural approach, similar to a human– advance TM. The fundamental steps involved in text mining
human interaction, where users can specify their preferences over are shown in Figure 1, which we will explain later on our data
an extended dialogue. preprocessing step.
Natural language processing (NLP) is a field that combines In general, TM has proven to be successful in summarizing
the power of computational linguistics, computer science, and long documents like news, articles, and books. Conversely, the
artificial intelligence to enable machines to understand, analyze, need to analyze short texts became significantly relevant as the
and generate the meaning of natural human speech. The first popularity of microblogs, such as Twitter, grew. The challenge
actual example of the use of NLP techniques was in the with inferring topics from short text is that it often suffers from
1950s in a translation from Russian to English that contained noisy data, so it can be difficult to detect topics in a smaller corpus
numerous literal transaction misunderstandings (Hutchins, (Phan et al., 2011).
2004). Essentially, keyword extraction is the most fundamental This paper makes the following contributions:
task in several fields, such as information retrieval, text mining,
• We review scholarly articles related to TM from 2015 to 2020,
and NLP applications, namely, topic detection and tracking
including its common application areas, methods, and tools.
(Kamalrudin et al., 2010). In this paper, we focused on the
• We investigate select TM methods that are commonly used
topic modeling (TM) task, which was described by Miriam
in text mining, namely, LDA, LSA, non-negative matrix
(2012) as a method to find groups of words (topics) in a
factorization (NMF), principal component analysis (PCA),
corpus of text. In general, the procedure of exploring data to
and random projection (RP). As there are many TM methods
collect valuable information is stated as text mining. Text mining
in the field of short-text data, and all definitely cannot be
includes data mining algorithms, NLP, machine learning, and
mentioned, we selected the most significant methods for
statistical operations to derive useful content from unstructured
our work.
formats such as social media textual data. Hence, text mining
• We evaluate all included TM methods based on two
can improve commercial trends and activities by extracting
dimensions, the understandability of extracted topics (topic
information from UGC.
quality) besides the topic performance and accuracy by
TM methods have been established for text mining as it is
applying common standard metrics that apply to the TM
hard to identify topics manually, which is not efficient or scalable
domain such as recall, precision, F-score, and topic coherence.
due to the immense size of data. Various TM methods can
In addition, we consider two textual datasets: the 20-
automatically extract topics from short texts (Cheng et al., 2014)
newsgroup data, common for evaluations in social media
and standard long-text data (Xie and Xing, 2013). Such methods
text application tasks, and 20 short conversation data from
provide reliable results in numerous text analysis domains,
Facebook, a popular social network site.
such as probabilistic latent semantic analysis (PLSA) (Hofmann,
• We aim to compare and evaluate many TM methods
1999), latent semantic analysis (LSA) (Deerwester et al., 1990),
to define their effectiveness in analyzing short textual
and latent Dirichlet allocation (LDA) (Blei et al., 2003). However,
social UGC.
many existing TM methods are incapable of learning from short
texts. Also, many issues exist in TM approaches with short The paper is organized as follows. Section Literature Review
textual data within OSN platforms, like slang, data sparsity, contains a comprehensive summary of some recent TM surveys
spelling and grammatical errors, unstructured data, insufficient as well as a brief description of the related subjects on
word co-occurrence information, and non-meaningful and noisy NLP, specifically the TM applications and toolkits used in
words. For example, Gao et al. (2019) discussed the problem of social network sites. In Section Proposed Topic Modeling
word sense disambiguation by using local and global semantic Methodology, we focus on five TM methods proposed in
correlations, achieved by a word embedding model. Yan et al. our study besides our evaluation process and its results. The
(2013) developed a short-text TM method called biterm topic conclusion is presented in section Evaluation along with an
model (BTM) that uses word correlations or embedding to outlook on future work.

FIGURE 1 | The steps involved in a text mining process (Kaur and Singh, 2019).

Frontiers in Artificial Intelligence | www.frontiersin.org 2 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

LITERATURE REVIEW established on the generative probabilistic model such as the LDA
TM. In this paper, we aim to understand the real meaning of
To obtain a comprehensive summary of recent surveys, we a given text, not just to extract a list of related keywords. To
started by exploring existing studies related to the area of TM achieve this, we first need to understand and have a general idea
for long and short texts. Additionally, we reviewed the most about many TM methods as they can be applied in short UGC
common TM applications, tools, and algorithms as applied to (e.g., abstract, dialogue, and Twitter text). Several TM methods
OSNs. For example, Jelisavčić et al. (2012) provided an overview are used to obtain topics from text, such as emails, documents,
of the most popular probabilistic models used in the TM area. and blogs. The choice of technique to extract topics is based on
Hong and Brian Davison (2010) compared the performance the length of the text. For example, counting word frequencies
of the LDA method and author–topic models on the Twitter is an appropriate method to use with a single document or
platform. Alghamdi and Alfalqi (2015) proposed an empirical a small number of documents. Liu et al. (2016) reviewed TM
study of TM by categorizing the reviewed works into two popular techniques for sentimental analysis. Meanwhile, Zihuan et al.
approaches: topic evolution models and standard topic models (2018) proposed a news-topic RS based on extracting topic
with a time factor. Song et al. (2014) presented a survey about keywords from internet news for a specific time. They applied
short-text characteristics, challenges, and classification that were different keyword extraction algorithms, such as term frequency–
divided into four basic types, namely, the usage of semantic inverse document frequency (TF-IDF) and rapid algorithm for
analysis, classification using semi-supervised methods, fusion- keyword extraction (RAKE), to extract the most descriptive
based ensemble technique, and real-time classification. Jaffali terms in a document. This system was efficient in obtaining a
et al. (2020) presented a summary of social network data analysis, particular topic at any specific time. However, they only focused
including its essential methods and applications in the context of on one dataset that was about the political domain and the words
structural social media data analysis. They structured the social that appear repeatedly; this is considered to be an issue in this
network analysis methods into two types, namely, structural recommendation system. Similarly, Shi et al. (2017) developed a
analysis methods (which study the structure of the social network semantics-assisted non-negative matrix factorization (SeaNMF)
like friendships), and added-content methods (which study the model by using a baseline of LDA and author–topic model to
content added by users). Likhitha et al. (2019) presented a integrate semantic relations between word and context.
detailed survey covering the various TM techniques in social To date, the LDA model is the most popular and highly
media text and summarized many applications, quantitative studied model in many domains and numerous toolkits such as
evaluations of various methods, and many datasets that are used Machine Learning for Language Toolkit (MALLET), Gensim,1
with various challenges in short content and documents. Table 1 and Stanford TM toolbox (TMT),2 because it is able to
presents several related works that reviewed the TM methods address other models’ limitations, such as latent semantic
in long/short textual social media data. Different from existing indexing (LSI) (Deerwester et al., 1990) and probabilistic latent
reviewed works, our paper not only focuses on the review of semantic indexing (PLSI) (Hofmann, 2001). The LDA method
TM tools, applications, and methods but also includes several can produce a set of topics that describe the entire corpus,
evaluations applying many techniques over short textual social which are individually understandable and also handle large-
media data to determine which method is the best for our future scale document–word corpus without the need to label any
proposed system that aims to detect real-time topics from online text. Keerthana (2017) developed a document recommendation
user-generated content. system from converted text from the ASR system that used both
In recent years, most of the data in every sphere of our cosine similarity (word co-occurrence) and semantic methods,
lives have become digitized, and as a result, there is a need as well as the LDA TM method that was implemented in the
for providing powerful tools and methods to deal with this MALLET toolkit environment, to extract the most significant
existing digital data increase in order to understand it. Indeed, terms for short conversation fragments. Initially, the topic
there have been many developments in the NLP domain, model was used to define weights for the abstract topics.
including rule-based systems and statistical NLP approaches, After extracting the keywords, TM similarity methods were
that are based on machine learning algorithms for text mining, applied. In this work, researchers compared extracted keywords
information extraction, sentiment analysis, etc. Some typical NLP from different techniques, namely, cosine similarity, word co-
real-world applications currently in use include automatically occurrence, and semantic distance techniques. They found that
summarizing documents, named entity recognition, topic extracted keywords with word co-occurrence and semantic
extraction, relationship extraction, spam filters, TM, and more distance can provide more relevant keywords than the cosine
(Farzindar and Inkpen, 2015). In the areas of information similarity technique.
retrieval and text mining, such as the TM method, several
methods perform keyword and topic extraction (Hussey et al., TM Application
2012). TM is a machine learning method that is used to discover TM can be applied to numerous areas like NLP, information
hidden thematic structures in extensive collections of documents retrieval, text classification and clustering, machine learning,
(Gerrish and Blei, 2011). and recommendation systems. TM methods may be supervised,
TM is a challenging research task for short texts, and several
methods and techniques have been proposed to solve the lack 1 https://pypi.org/project/gensim/
of contextual information. Numerous proposed methods are 2 https://nlp.stanford.edu/software/tmt/tmt-0.4/

Frontiers in Artificial Intelligence | www.frontiersin.org 3 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

TABLE 1 | Some of the existing related works that revised the topic modeling method.

Related Topic modeling Evaluation method Outcome


work method

Chakkarwar Latent Dirichlet allocation Visual overview of - Aimed to discover the current trends, topics, or patterns from research documents to overview
and Tamane (LDA) with extracted topics different research trends.
(2020) bag of words
- The result shows that the LDA is an effective topic modeling method for creating the context of a
(BoW)
document collection.
Ray et al. Latent semantic indexing Perplexity - Aimed to introduce methods and tools of topic modeling to the Hindi language.
(2019) (LSI)
LDA Topic coherence - Discussed many techniques and tools used for topic modeling.
Non-negative matrix - The coherence result of the NMF model was a little better than the LDA model.
factorization (NMF) - The perplexity of the LDA model on the Hindi dataset is better compared to other evaluated topic
modeling methods.
Xu et al. LDA Perplexity - Aimed to help Chinese movie creators to get the psychological needs of movie viewers and
(2019) provide suggestions to improve the quality of Chinese movies.
- Used the word cloud as a visual display of high-frequency keywords in a text which gives a basic
understanding of the core ideas of text data.
- The LDA model provides topics that deliver a good analysis of the Douban online review.
- Used the perplexity method to determine the best number of extracted topics, as a result, 20
extracted topics were set.
Alghamdi Latent semantic analysis - Reviewed many topic modeling methods in terms of characteristics, limitations, and theoretical
and Alfalqi (LSA) background.
(2015) Probabilistic latent
semantic analysis (PLSA)
LDA - Reviewed many topic modeling application areas and evaluation methods.
Correlated topic model
(CTM)
Chen et al. NMF t-Distributed - Aimed to compare and evaluate many topic modeling approaches in analyzing a large set of the
(2017) stochastic neighbor US Securities and Exchange Commission (SEC) filings made by US public banks.
Principal component embedding (TSNE) - Both NMF and LDA methods provide very good document representation, while the K-Competitive
analysis (PCA) dimensionality- Autoencoder for Text (KATE)1 delivered more meaningful document and high-accuracy topics.
reduction
LDA
method
KATE - The LDA provided the best result regarding the classification of topic representation.
Mazarura LDA Topic stability - Tested many numbers of topics (10, 20, 30, 40, 50, and 100 topics).
and de Waal
(2016) - Topic coherence decreases for both the LDA and Dirichlet multinomial mixture model (GSDMM) as
the number of topics increases in a long text, which indicates an overall decline in the quality of
topics uncovered by both models as the number of topics increases.
GSDMM Topic coherence - The LDA’s performance of the coherence values is slightly better than the GSDMM.
- The GSDMM is more stable than LDA.
- The GSDMM is indeed a viable option on the short text as it displays the potential to produce
better results than LDA.
Sisodia et al. BoW - The Nu-support vector classification (Nu-SVC) classifier outperforms all other included classifiers in
(2020) the set of individual classifiers.
Term frequency–inverse Accuracy - Random forest classifier outperforms all other included classifiers in the set of the case on
document frequency ensemble classifiers.
(TF-IDF)
Naive Bayes Precision - The support vector machine (SVM) classifier outperforms all other classifiers in the set of individual
classifiers.
SVM Recall - Random forest classifier outperforms the remaining ones.
Decision trees F-measures - Considered only two datasets; other datasets of different sizes need to be studied for better results.
Nu-SVC
Shi et al. Vector space model - Reviewed all of the following methods: VSM, LSI, PLSA, and LDA.
(2017) (VSM)
LSI - Reviewed the essential concept of topic modeling using a bag-of-words approach.
PLSA - Discussed the basic idea of topic modeling including the bag-of-words approach, training of
model, and output.

(Continued)

Frontiers in Artificial Intelligence | www.frontiersin.org 4 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

TABLE 1 | Continued

Related Topic modeling Evaluation method Outcome


work method

LDA - Discussed topic modeling application, features, limitations, and tools such as Gensim, standard
topic modeling toolbox, Machine Learning for Language Toolkit (MALLET), and BigARTM.
Nugroho LDA Purity - It focuses on the review of the approaches and discusses the features that are exploited to deal
et al. (2020) with the extreme sparsity and dynamics of the online social network (OSN) environment.
NMF Normalized mutual - Run the algorithms over both datasets 30 times and note the average value of each evaluation
information (NMI) metric for comparison.
Task-driven NMF - Most methods can achieve high purity value.
- The NMF and non-negative matrix inter-joint factorization (NMijF) having the best performance over
the other methods.
Plink-LDA Pairwise F-measure - F-measure evaluation results in all methods were well and similar.
- NMijF provides the best results according to all the evaluation metrics.
NMijF - Both LDA and NMF focus on the simple content exploitation of social media posts, main features
(content, social interactions, and temporal).
Ahmed PCA model Precision - The aim was to compare the performance of these methods before and after using PCA.
Taloba et al. Standard SVM Accuracy
(2018) J-48 decision tree Sensitivity - The RF gives acceptable and higher accuracy when compared to the rest of the classifiers.
KNN methods F-measure - The RF algorithm gives higher performance, and its performance is improved after using PCA.
Chen et al. LDA PMI score - Tested many numbers of topics (20, 40, 60, 80, and 100).
(2019)
- The NMF has overwhelming advantages over LDA.
NMF Human judgments - The knowledge-guided NMF (KGNMF) model performs better than NMF and LDA
KGNMF - The NMF provides better topics than LDA with topic numbers ranging from 20 to 100.
Anantharaman LDA Precision - Evaluated all topic modeling algorithms with both BoW and TF-IDF representations.
et al. (2019) Recall - Used the Naïve Bayes classifier for the 20-newsgroup dataset and the random forest classifier for
the BBC news and PubMed datasets.
F-measure
LSA Accuracy - The results of the 20-newsgroup dataset LDA with BoW outperform those of the other topic
Cohen’s algorithms.
Kappa score - The LDA model does not perform well with TF-IDF when compared to BoW.
NMF Matthews
Correlation - The LDA takes a lot of time when compared to the LSA and NMF models.
coefficient
Time taken

1 https://github.com/hugochan/KATE.

unsupervised, or semi-supervised; may use structured or information to induce subjects over diverse trades on a market
unstructured data; and may be applied in several application organization, and other activities.
fields such as health, agriculture, education, e-commerce, social • Bioinformatics: to identify the knowledge structure of the
network opinion analysis, and transport/data network. TM can field, e.g., study patient-related texts constructed from their
be used to discover latent abstract topics in a collection of clinical records.
text such as documents, short text, chats, Twitter and Facebook • Manufacturing applications: used in numerous
posts, user comments on news pages, blogs, and emails. Weng search engines, online advertising systems, and social
et al. (2010) and Hong and Brian Davison (2010) addressed media blogs.
the application of topic models to short texts. Some major • Computer science: extracting valuable information
application areas where researchers have used TM methods from data, image processing, and annotating images
include the following: with words.
• Social network analysis (SNA): mining information about the
• Recommendation systems: in many real-time systems, for real world in social web platforms such as inferring significant
example, job recommendation by mapping the right job for aspects about the users and services.
interested candidates based on their information, history, • Software engineering: mining unstructured repositories in the
sociology, location, media theory, and other contexts. software industry such as source code, test, and bugs to support
• Financial analysis: in many commercial activities like many engineering tasks like program comprehension and
structuring of the stock market exchange, using stock value location (Panichella et al., 2013).

Frontiers in Artificial Intelligence | www.frontiersin.org 5 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

Toolkits for Topic Models is available in GitHub such as online inference for HDP in the
Many TM methods and analyses are available nowadays. Below Python language and TopicNets (Gretarsson et al., 2012).
are selected toolkits that are considered standard toolkits for TM
testing and evaluation.
PROPOSED TOPIC MODELING
• Stanford TMT, presented by Daniel et al. (2009), was METHODOLOGY
implemented by the Stanford NLP group. It is designed to
help social scientists or other researchers who wish to analyze TM is a methodology for processing the massive volume of data
voluminous textual material and tracking word usage. It generated in OSNs and extracting the veiled concepts, protruding
includes many topic algorithms such as LDA, labeled LDA, and features, and latent variables from data that depend on the
latent Dirichlet allocation (PLDA); besides, the input can be context of the application (Kherwa and Bansal, 2018). Several
text in Excel or other spreadsheets. methods can operate in the areas of information retrieval and
• VISTopic is a hierarchical topic tool for visual analytics of text text mining to perform keyword and topic extraction, such as
collections that can adopt numerous TM algorithms such as MAUI, Gensim, and KEA. In the following, we give a brief
hierarchical latent tree models (Yang et al., 2017). description of the included TM methods in this comparison
• KEA is an open-source software distributed in the Public review. In this paper, we focused on five frequently used TM
License GNU and was used for keyphrase extraction from the methods that are built using a diverse representation form
entire text of a document; it can be applied for free indexing and statistical models. A standard process for topic generation
or controlled vocabulary indexing in the supervised approach. is shown in Figure 2. We define the main advantages and
KEA was developed based on the work of Turney (2002) and disadvantages of all involved topic methods as shown in Table 2,
was programmed in the Java language; it is a simple and and we evaluate the topic quality and performance of many
efficient two-step algorithm that can be used across numerous TM methods; the fundamental difference among all involved
platforms (Frank et al., 1999). methods is in how they capture the structures and in which
• MALLET, first released in 2002 (Mccallum, 2002), is a topic parts of the structures they exploit. However, there are numerous
model tool written in Java language for applications of TM methods used in the field of social media textual data,
machine learning like NLP, document classification, TM, and and as we definitely cannot mention all of them, we selected
information extraction to analyze large unlabeled text. The the most popular methods to compare; we then define which
MALLET topic model includes different algorithms to extract method is suitable to integrate in our future proposed real-
topics from a corpus such as pachinko allocation model (PAM) time social recommendation system called ChatWithRec system
and hierarchical LDA. (Albalawi and Yeap, 2019; Albalawi et al., 2019).
• FiveFilters is a free software tool to obtain terms from text
through a web service. This tool will create a list of the most TM Methods
relevant terms from any given text in JSON format. • LSA: It is a method in NLP proposed by Deerwester et al.
• Gensim, presented by Rehurek (2010), is an open- (1990), particularly distributional semantics, that can be used
source vector space modeling and topic modeling toolkit in several areas, such as topic detection; it has become a
implemented in Python to leverage large unstructured digital baseline for the performance of many advanced methods.
texts and to automatically extract the semantic topics from Distributional hypotheses make up the theoretical foundation
documents by using data streaming and efficient incremental of the LSA method, which states terms with similar meaning
algorithms unlike other software packages that only focus are closer in terms of their contextual usage, assuming that
on batch and in-memory processing. Also, Gensim includes words that are near in their meaning show in the related parts
several kinds of algorithms such as LDA, RP, LSA, TF-IDF, of texts (Dudoit et al., 2002). Also, it analyzes large amounts
hierarchical Dirichlet processes (HDPs), LSI, and singular of raw text into words and separate them into meaningful
value decomposition (SVD). Hence, all the mentioned sentences or paragraphs. LSA considers both the similarity
algorithms are unsupervised, so there is no need for human terms of text and related terms to generate more insights into
input or training corpus. In addition, Gensim is considered the topic. Besides, the LSA model can generate a vector-based
to be faster than other topic modeling tools such as MALLET representation for texts which aids the grouping of related
and scalable. words. A mathematical approach called SVD is used in the
• Fathom provides TM of graphical visualization and calls of LSA model to outline a base for a shared semantic vector space
topic distributions (Dinakar et al., 2015). that captures the maximum variance across the corpus. (Neogi
• R TM packages include three packages that are capable of et al., 2020) stated that the LSA method as shown in Figure 3
doing topic modeling analysis which are MALLET, topic learns latent topics by performing matrix decomposition on
models, and LDA. Also, the R language has many packages and the term–document matrix; let’s say X is a term-by-document
libraries for effective topic modeling like LSA, LSAfun (Wild, matrix that decomposed into three other matrices, S, W, and P;
2015), topicmodels (Chang, 2015), and textmineR (Thomas multiplying together those matrices, we give back the matrix X
Jones, 2019). with {X} = {S}{W}{P}; each paragraph is characterized by the
• For other open-source toolkits besides those mentioned above, columns, and the rows characterize the unique words. Figure 3
David Blei’s Lab provides many TM open-source software that presents the SVD of the LSA TM method.

Frontiers in Artificial Intelligence | www.frontiersin.org 6 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

FIGURE 2 | Topic modeling for text data.

TABLE 2 | Main advantages and disadvantages of the TM methods.

TM method Advantage Disadvantage

LSA Solves the data sparsity problem and captures synonyms of words. Difficult to label a topic in some cases and to establish a number of
Reduces the dimensionality of TF-IDF by using singular value topics.
decomposition. The determination of topic numbers depends upon the human
It does not require a robust statistical background and probability theory. judgment and cannot be determined statistically.
Exploits unique structure as factors. It does not capture the correlation between multiple topics.
LDA It does not require any previous training data. Needs aggregation of short messages to avoid data sparsity in short
Provides more semantically interpretable data and performs well if there is documents.
no time constraint. Unable to model relations among topics that help to understand deep
Handles long documents and is able to show adjectives and nouns in structures of documents.
topics. A slow process algorithm.
Handles mixed-length documents. Requires a predefined number of topics (T). If T is too small—topics are
Able to enhance transitive relations between topics and obtain high-order more general if T is too large—topics will be overlapping with
co-occurrence in small documents like in paragraphs and sentences text. each other.
NMF Fast process for a large amount of real-time data. Sometimes provides semantically incorrect results.
Able to extract meaningful topics without prior information or knowledge
of the underlying meaning in the original data.
Appropriate for word and vocabulary recognition tasks.
PCA Low noise sensitivity and decreased need for capacity. The covariance matrix is difficult to evaluate in an accurate manner
Maintains the best possible estimate and works well on moderately (Phillips et al., 2005).
low-dimensional data. Cannot detect the simplest invariance data sometimes, unless the
It decreases the noise data because the maximum variation source is training data explicitly offer this information (Li et al., 2008).
chosen and the small variations are ignored automatically. Expensive to compute particularly for high-dimensional datasets.
Recommended in work that aims to introduce new features by losing
original features in the procedure of transformation of the high dimensions
data into low dimensions.
Delivers an output that can be visualized as a solid version of the
main dataset.
RP Robust. Data sparsity.
Provides good results in data streaming task and if data are so high Slow predictions.
dimensional. Sensitive to noise data.
Valid to use in imbalanced datasets. Bad at fitting complex features.
Advance linear separability. Applicable to only a few datasets.
Good at discovering discriminative features.

• LDA, introduced by Blei et al. (2003), is a probabilistic model algorithm for extracting thematic information (topics) of
that is considered to be the most popular TM algorithm a collection of documents within the Bayesian statistical
in real-life applications to extract topics from document paradigm. The LDA model assumes that each document is
collections since it provides accurate results and can be trained made up of various topics, where each topic is a probability
online. Corpus is organized as a random mixture of latent distribution over words. A significant advantage of using
topics in the LDA model, and the topic refers to a word the LDA model is that topics can be inferred from a
distribution. Also, LDA is a generative unsupervised statistical given collection without input from any prior knowledge.

Frontiers in Artificial Intelligence | www.frontiersin.org 7 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

FIGURE 3 | SVD of the LSA topic modeling method (Neogi et al., 2020).

FIGURE 4 | The original structure of the LDA topic model.

A schematic diagram of the LDA topic model is shown in In Figure 5, D ≈ UV, where U and V are elementwise non-
Figure 4. negative and, for a given text, corpus is decomposed into two
matrices which are term-topic matrix U and topic–document
In Figure 4, α is a parameter that represents the Dirichlet prior
matrix V, corresponding to K coordinate axes and N points
for the document topic distribution, β is a parameter that
in a new semantic space, respectively (each point represents
represents the Dirichlet for the word distribution, θ is a vector
one document).
for topic distribution over a document d, z is a topic for a chosen
word in a document, w refers to specific words in N, plate D is • PCA is an essential tool for text processing tasks, and it has
the length of documents, and plate N is the number of words in been used since the early 1990s (Jolliffe, 1986; Slonim and
the document. Tishby, 2000; Gomez et al., 2012). The PCA method has been
used to decrease feature vector to a lower dimension while
• NMF is an unsupervised matrix factorization (linear algebraic)
retaining the most informative features in several experimental
method that is able to perform both dimension reduction
and theoretical studies. However, it is expensive to compute for
and clustering simultaneously (Berry and Browne, 2005; Kim
high-dimensional text datasets. The PCA TM method found
et al., 2014). It can be applied to numerous TM tasks;
a d-dimensional subspace of Rn that could capture as much
however, only a few works were reported to determine topics
of the dataset’s variation as possible; specifically, given data
for short texts. Yan et al. (2013) presented an NMF model
S = {x1 , . . . , xm }, we would find the linear projection to Rd
that aims to obtain topics for short-text data by using the
as in Equation 1, proposed by Dasgupta (2000):
factorizing asymmetric term correlation matrix, the term–
document matrix, and the bag-of-words matrix representation Xm ∗ 2
χ
i − µ∗ (1)
of a text corpus. Chen et al. (2019) defined the NMF method i=1
as decomposing a non-negative matrix D into non-negative
factors U and V, V ≥ 0 and U ≥ 0, as shown in Figure 5. where χ∗i is the projection of a point χi and µ∗ is the mean of the
The NMF model can extract relevant information about topics projected data.
without any previous insight into the original data. NMF • RP has attracted attention and has been employed in many
provides good results in several tasks such as image processing, machine learning scenarios recently such as classification,
text analysis, and transcription processes. In addition, it clustering, and regression (Wang and McCallum, 2006;
can handle the decomposition of non-understandable data Ramage et al., 2011). The RP TM method uses a random
like videos. matrix to map the original high-dimensional data onto

Frontiers in Artificial Intelligence | www.frontiersin.org 8 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

FIGURE 5 | The original structure of the NMF topic model (Chen et al., 2019).

a lower-dimensional subspace with the reduced time cost contain applicable information for the study. In fact, in NLP,
(Dasgupta, 2000). The main idea behind the RP method there is no particular general list of stop words used by all
stems from Johnson and Lindenstrauss (1984), who states developers who choose their list based on their goal to improve
that as “a set of n points in a high-dimensional vector the recommendation system performance.
space can be embedded into k = ϑ(ε −2 log n) dimensions, • Stemming: the conversion of words into their root, using
with the distances between these points preserved up to a stemming algorithms such as Snowball Stemmer.
factor of 1+ε1+ǫ. This limit can be realized with a linear • Lemmatizing: used to enhance the system’s accuracy by
projection à = AR, for a carefully designed random matrix Rǫ returning the base or dictionary form of a word.
Rǫ×k (k≪d), where Aǫ Rn×d denote a data matrix consisting • Tokenizing: dividing a text input into tokens like phrases,
of n data points in Rd ” (Wójcik and Kurdziel, 2019). In words, or other meaningful elements (tokens). The outcome
addition, RP has attracted lots of attention, and its accuracy of tokenization is a sequence of tokens.
for dimensionality reduction of high-dimensional datasets and • Identifying n-gram procedure such as bigram (phrases
directions of projection is independent of the data (does not containing two words) and trigram (phrases containing three
depend on training data). Still, RP delivers sparse results words) words and consider them as one word.
because it does not consider the fundamental structure of the
After the preprocessing step, we applied a commonly used term-
original data and frequently leads to high distortion.
weighting method called TF-IDF, which is a pre-filtering stage
with all the included TM methods. TF-IDF is a numerical statistic
Data Preprocessing measure used to score the importance of a word (term) in any
In our experiment, all input data were text data that possess content from a collection of documents based on the occurrences
the English language properties. As shown in Figure 1, the first of each word, and it checks how relevant the keyword is in the
steps in the text mining process were to collect unstructured and corpus. Also, it not only considers the frequency but also induces
semi-structured data from multiple data sources like microblogs discriminative information for each term. Term frequency
and news web pages. Next, the preprocessing step was applied represents how many times a word appears in a document,
to clean up the data and then convert the extracted information divided by the total number of words in that document, while
into a structured format to analyze the patterns (visible and inverse document frequency calculates how many documents the
hidden) within the data. Extracted valuable information can term appears in and divides it by the number of documents in the
be stored in a database, for example, to assist the decision- corpus. Furthermore, calculating the TF-IDF weight of a term in
making process of an organization. Corpus preparation and a particular document requires calculating term frequency [TF(t,
cleaning were done using a series of packages running on d)], which is the number of times that the word t occurred in
top of Python such as the Natural Language Toolkit (NLTK) document d; document frequency [DF(t)], which is the number
(Bird et al., 2009) that provides stop-word removal (Bird and of documents in which term t occurs at least once; and inverse
Loper, 2004), stemming, lemmatizing, tokenization, identifying document frequency (IDF), which can be calculated from DF
n-gram procedures, and other data cleanings like lowercase using the following formula. The IDF of a word is considered
transformation and punctuation removal. The preprocessing high if it occurred in a few documents and low if it occurred in
steps are supported in Stanford’s NLTK Library (Kolini and many documents (Ahmed Taloba et al., 2018). The TF-IDF model
Janczewski, 2017; Phand and Chakkarwar, 2018) and contain the is defined in Equations (2) and (3):
following patterns:
• Stop-word elimination: removal of the most common words
in a language that are not helpful and in general unusable in num of occurrences of word in documents
TF = (2)
text mining like prepositions, numbers, and words that do not num of words in all documents

Frontiers in Artificial Intelligence | www.frontiersin.org 9 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

TABLE 3 | Statistics of our involved datasets. TABLE 4 | Performance of involved topic modeling methods with different
extracted topics t, t = 5 and 10, (average value of recall, precision, and F-score).
Dataset Description
TM Number of topics
20-newsgroup1 20,000 documents method
data Average document length: 28 5 10
Topics: computer, recreation, science, miscellaneous,
politics, and religion as distinct classes R P F R P F

Facebook 20 text conversations


LSA 0.1546419 0.1501913 0.1523841 0.1825729 0.1838501 0.1881104
conversations2 Approximately 87 sentences and 7,250 words.
Topics: travel, food, restaurant, hotel booking, flight LDA 0.150000 0.1533333 0.1511765 0.1238715 0.1067887 0.1146975
booking, study and university NMF 0.2577005 0.2522465 0.2549443 0.4734466 0.4791113 0.4762621
PCA 0.3860860 0.3878723 0.3869771 0.5546999 0.5616488 0.5581528
1 http://people.csail.mit.edu/jrennie/20Newsgroups/.
2 https://github.com/Rania2016/20-FACEBOOK-CONVERSATIONS. RP 0.1137931 0.1105053 0.1121251 0.1156499 0.1123152 0.1139581

Bold values represent the highest performance results.

num of documents
IDF = log (3) • Precision (P) is a common information retrieval metric that
num of documents with word occurs
measures the fraction of retrieved recommended items to the
actual relevant items.
EVALUATION • The F-score (F) measures the effectiveness of the retrieval and
Evaluation Procedure is calculated by combining the two standard measures in text
OSNs include a huge amount of UGC with many irrelevant mining, namely, recall and precision.
and noisy data, such as non-meaningful, inappropriate data
and symbols that need to be filtered before applying any text tp
analysis techniques. In our work, we deal with text mining Recall = (4)
tp + fn
subjects. This is quite difficult to achieve since the objective is tp
to analyze unstructured and semi-structured text data. Without Precision = (5)
tp + fp
a doubt, employing methods that are similar to human–human
interaction is more convenient, where users can specify their Precision . Recall
F − score = (6)
preferences over an extended dialogue. Also, there is a need for Precision + Recall
further effective methods and tools that can aid in detecting
Note that the true positive (TP) is the number of keywords
and analyzing online social media content, particularly for those
detected as a topic, the false positive (FP) is the number of
using online UGC as a source of data in their systems. We
non-keywords detected as a topic, the true negative (TN) is the
implemented the Gensim toolkit due to its ease of use and
number of non-keywords detected as non-topics, and the false
because it gives more accurate results. Gensim was the most
negative (FN) is the number of topics detected as non-topics.
popular tool used in many recent studies, and it offers more
functionality; it also contains an NLP package that has effective
implementations of several well-known functionalities for the
Data Extraction and Experiment Results
In our data extraction stage, we aim to extract topics from
TM methods such as TF-IDF, LDA, and LSA.
clusters of input data. As we mentioned before, we did our
In our experiment, we tested numerous TM methods on
second evaluation several times by applying a different number
commonly used public text dataset for experiments in the
of features f and topics t, f = 10, 100, 1,000, and 10,000 and t
text application task called the 20-newsgroup data and short
= 5, 10, 20, and 50. Tables 4–6 present our initial results of the
conversation data from the Facebook social network site, as
topic performance and accuracy after applying some common
shown in Table 3.
standard metrics that are applicable to the TM methods, related
We evaluate the topic quality and performance of five
to the 20-newsgroup data.
frequently used TM methods. In addition, we calculate the
We observe that each TM method we used has its own
statistical measures precision, recall, and F-score to assess the
strengths and weaknesses, and during our evaluation, the results
accuracy verification within a different number of features f, f =
of all the methods performed similarly. Briefly, by comparing
10, 100, 1,000, 10,000. Besides, it is important to consider how
the outcomes of the extracted topics, PCA produced the highest
many topics we want to extract and find in the corpus, and this
term–topic probability; NMF, LDA, and LSA models provided
step must be decided by a human user. We ran an experiment
similar performance; and RP statistical scores were the worst
and create four extracted topics t, t = 5, 10, 20, and 50. Recall,
compared to other methods. The probabilities range from 0 to
precision, and F-score calculations are presented in Equations
1 in all evaluated TM methods. However, it provided a selection
(4–6), respectively.
of non-meaningful words, like domain-specific stop words that
• Recall (R) is a common information retrieval metric are not suitable for further processing. Also, we notice that LDA
that measures the fraction of relevant items among the methods provide the best learned descriptive topics compared to
recommended items. the other methods, aside from some methods that failed to create

Frontiers in Artificial Intelligence | www.frontiersin.org 10 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

TABLE 5 | Performance of involved topic modeling methods with different LDA method outperforms other TM methods with most features,
extracted topics t, t = 20 and 50 (average value of recall, precision, and F-score). while the RP model receives the lowest F-score in most runs
TM Number of topics
in our experiments. The graphs in Figure 6 present the average
method
results of F-scores with a different number of feature f on the
20 50 20-newsgroup dataset. Aside from the TM method comparison,
the graphs show that a higher F-score was obtained with the
R P F R P F
LDA model. In addition, over the Facebook conversation data,
LSA 0.2198939 0.2128799 0.2163301 0.2210345 0.2279532 0.2294835 the LDA method defines the best and clearest meaning compared
LDA 0.3446734 0.3435585 0.3489088 0.2312177 0.2174433 0.2336483 to other examined TM methods.
NMF 0.5918747 0.5977849 0.5948151 0.6915826 0.6952324 0.6934027 Moreover, we measured the topic coherence score, and we
PCA 0.6339618 0.6392421 0.6365910 0.7044610 0.7086668 0.7065576 observed that extracting fewer numbers of keywords led to a
RP 0.1132626 0.1106185 0.1119249 0.1084881 0.1052548 0.1068470
high coherence score in LDA and NMF TM methods. As a
result, obtaining fewer keywords can help define the topic in less
Bold values represent the highest performance results. time, which is useful for our future developing real-time social
recommendation system which aims to analyze the user’s online
conversation and deliver a suitable task such as advertainment.
TABLE 6 | Performance of involved topic modeling methods with a different
number of features f, f = 10, 100, 1,000, and 10,000 (average value of recall, Based on our experiments, we decided to focus on LDA and NMF
precision, and F-score). topic methods as an approach to analyze short social textual data.
Indeed, LDA TM is a widely used method in real-time social
TM method Number of features
recommendation systems and one of the most classical state-
10 100 1,000 10,000 of-the-art unsupervised probabilistic topic models that can be
found in various applications in diverse fields such as text mining,
F-score F-score F-score F-score
computer vision, social network analysis, and bioinformatics
LSA 0.108238987 0.177539633 0.196284973 0.187135878 (Vulić et al., 2015; Liu et al., 2016).
LDA 0.118222579 0.313004427 0.596767795 0.616768865
NMF 0.124100619 0.246607097 0.384831984 0.478534632 CONCLUSION
PCA 0.118742505 0.273855576 0.459150019 0.553060479
RP 0.123841731 0.101052719 0.126599635 0.114772128
The internet assists in increasing the demand for the
development of business applications and services that can
Bold values represent the highest performance results. provide better shopping experiences and commercial activities
for customers around the world. However, the internet is also
full of information and knowledge sources that might confuse
topics that aggregate related words, like the LSA TM method users and cause them to spend additional time and effort
which usually performs best at creating a compact semantic trying to find applicable information about specific topics or
illustration of words in a corpus. In addition, in Tables 4–6, PCA objects. Conversely, the need to analyze short texts has become
and RP methods had the best and worst statistical measure’s significantly relevant as the popularity of microblogs such as
results, respectively, when compared to other TM with similar Twitter grows. The challenge with inferring topics from short
performance results. However, PCA and RP methods distributed text is due to the fact that it contains relatively small amounts and
random topics that made it hard to obtain the main-text main noisy data that might result in inferring an inaccurate topic. TM
topics from them. can overcome such a problem since it is considered a powerful
Moreover, the LDA and NMF methods produce higher- method that can aid in detecting and analyzing content in OSNs,
quality topics and more coherent topics than the other methods particularly for those using UGC as a source of data. TM has
in our evaluated Facebook conversation dataset, but the LDA been applied to numerous areas of study such as Information
method was more flexible and provided more meaningful and Retrieval, computational linguistics and NLP. Also, it has been
logical extracted topics, especially with fewer numbers of topics effectively applied to clustering, querying, and retrieval tasks
that match our final aim of defining a TM method that can for data sources such as text, images, video, and genetics. TM
understand the online UGC. Also, when comparing LDA and approaches still have challenges related to methods used to solve
NMF methods based on their runtime, LDA was slower, and real-world tasks like scalability problems.
it would be a better choice to apply NMF specifically in a This paper delved into a detailed description of some
real-time system. However, if runtime is not a constraint, LDA significant applications, methods, and tools of topic models,
outperforms the NMF method. NMF and LDA have similar focusing on understanding the status of TM in the digital
performances, but LDA is more consistent. The dataset provided era. In our evaluation, we used two textual datasets: the 20-
in our experiment tested over a certain number of topics and newsgroup data and short conversation data from the Facebook
features, though additional investigation would be essential to social network site. The performances achieved by TM methods
make conclusive statements. Also, we ran all the topic methods were compared using the most important and common standard
by including several feature numbers, as well as calculating the metrics in similar studies, namely, recall, precision, F-score, and
average of the recall, precision, and F-scores. As a result, the coherence. We also defined which methods can deliver maximum

Frontiers in Artificial Intelligence | www.frontiersin.org 11 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

FIGURE 6 | The F-score average results with different numbers of features f = 10, 100, 1,000, 10,000 (20-newsgroup dataset).

well-organized and meaningful topics. As a result, we found other processes. Despite these similarities, the two TM methods
that all of the included TM methods we used to have much in that generated the most valuable outputs with diverse ranges
common, like transforming text corpora into term–document and meanings were the LDA and NMF TM methods. The work
frequency matrices and using the TF-IDF model as a prefiltering presented in this paper can be a vital reference for researchers on
model, producing topic content weights for each document and short-text TM.

Frontiers in Artificial Intelligence | www.frontiersin.org 12 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

DATA AVAILABILITY STATEMENT AUTHOR CONTRIBUTIONS


Publicly available datasets were analyzed in this study. This TY and MB contributed to the design of the research, and to the
data can be found here: http://people.csail.mit.edu/jrennie/ writing of the journal. All authors contributed to the article and
20Newsgroups/. approved the submitted version.

REFERENCES Am. Soc. Inform. Sci. 41, 391–407. doi: 10.1002/(SICI)1097-4571


(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Ahmed Taloba, I., Eisa, D. A., and Safaa Ismail, S. I. (2018). A comparative study on Dinakar, K., Chen, J., Lieberman, H., Picard, R., and Filbin, R. (2015). “Mixed-
using principle component analysis with different text classifiers. Int. J. Comp. initiative real-time topic modeling & visualization for crisis counseling,”
Appl. 180, 1–6. doi: 10.5120/ijca2018916800 The 20th International Conference on Intelligent User Interfaces, 417–426.
Albalawi, R., and Yeap, T. H. (2019). “ChatWithRec: Toward a real- doi: 10.1145/2678025.2701395
time conversational recommender system,” in ISERD 174th International Dudoit, S., Fridlyand, J., and T. P (2002). Speed: comparison of discrimination
Conference. The International Conference on Computer Science, Machine methods for the classification of tumor using gene expression data. J. Amer.
Learning and Big Data (ICCSMLBD) (New York, NY), 67–71. Available online Stat. Assoc. 97, 77–87. doi: 10.1198/016214502753479248
at: http://www.worldresearchlibrary.org/up_proc/pdf/3216-157319215067-71. Farzindar, A., and Inkpen, D. (2015). Natural language processing
pdf for social media. Synth. Lect. Hum. Lang. Technol. 8, 1–166.
Albalawi, R., Yeap, T. H., and Benyoucef, M. (2019). “Toward a real- doi: 10.2200/S00659ED1V01Y201508HLT030
time social recommendation system,” in MEDES’19 (Limassol, Cyprus), Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., and Nevill-Manning, C.
336–340. Available online at: https://doi.org/10.1145/3297662.3365789 G. (1999). “Domain-specific keyphrase extraction,” The 16th International
doi: 10.1145/3297662.3365789 Joint Conference on Artificial Intelligence (Stockholm: San Francisco: Morgan
Alghamdi, R., and Alfalqi, K. (2015). A survey of topic modeling in text mining. Kaufmann Publishers), 668–673.
Int. J. Adv. Comp. Sci. Appl. 6, 147–153. doi: 10.14569/IJACSA.2015.060121 Gao, W., Peng, M., Wang, H., Zhang, Y., Xie, Q., and Tian, G. (2019).
Anantharaman, A., Jadiya, A., Siri, C. T. S., Bharath Nvs, A., and Mohan, Incorporating word embeddings into topic modeling of short text. Knowl. Inf.
B. (2019). “Performance evaluation of topic modeling algorithms for text Syst. 61, 1123–1145. doi: 10.1007/s10115-018-1314-7
classification,” in 2019 3rd International Conference on Trends in Electronics and Gerrish, S. M., and Blei, D. M. (2011). “Predicting legislative roll calls from
Informatics (ICOEI) (Tirunelveli). text,” The 28th International Conference on Machine Learning (Bellevue,
Berry, M. W., and Browne, M. (2005). Email surveillance using non- WA), 489–496
negative matrix factorization. Compute Math Organize Theory 11, 249–264. Gomez, J. C., Boiy, E.-M., and Moens, F. (2012). Highly discriminative statistical
doi: 10.1007/s10588-005-5380-5 features for email classification, Knowledge and information systems. Knowl.
Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing With Python. Inform. Syst. 31, 23–53 doi: 10.1007/s10115-011-0403-7
Available online at: http://www.amazon.com/dp/0596516495 Gretarsson, B., O’Donovan, J., Bostandjiev, S., Hollerer, T., Asuncion, A.,
Bird, S., and Loper, E. (2004). “NLTK: the natural language toolkit,” in The Newman, D., et al. (2012). TopicNets: visual analysis of large text
Companion Volume to the Proceedings of 42st Annual Meeting of the Association corpora with topic modeling. ACM Trans. Intell. Syst. Technol. 3, 1–26.
for Computational Linguistics (Barcelona, Spain: Association for Computational doi: 10.1145/2089094.2089099
Linguistics), 214–217. doi: 10.3115/1219044.1219075 Hofmann, T. (1999). Probabilistic latent semantic analysis,” The 22nd Annual
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. International ACM SIGIR Conference on Research and Development in
JMLR, 993–1022. Information Retrieval, 50–57. doi: 10.1145/312624.312649
Chakkarwar, V., and Tamane, S. C. (2020). “Quick insight of research literature Hofmann, T. (2001). Unsupervised learning by probabilistic latent
using topic modeling,” in Smart Trends in Computing and Communications. semantic analysis. Manuf. Netherlands Mach Learn. 42, 177–196.
Smart Innovation, Systems and Technologies, Vol. 165, eds Y. D. Zhang, doi: 10.1023/A:1007617005950
J. Mandal, C. So-In, and N. Thakur (Singapore: Springer), 189–197. Hong, L., and Brian Davison, D. (2010). “Empirical study of topic modeling in
doi: 10.1007/978-981-15-0077-0_20 twitter,” in Proceedings of the First Workshop on Social Media Analytics, 80–88.
Chang, J. (2015). Latent Dirichlet Allocation: Collapsed Gibbs Sampling Methods for doi: 10.1145/1964858.1964870
Topicmodels. Available online at: https://cran.rproject.org/web/packages/lda/ Hussey, R., Williams, S., and Mitchell, R. (2012). “Automatic keyphrase extraction:
index.html a comparison of methods,” in The 4th International Conference on Information
Chen, Y., Rhaad Rabbani, M., Gupta, A., and Mohammed Zak, J. (2017). Process, and Knowledge Management (eKNOW) (Valencia), 18–23.
“Comparative text analytics via topic modeling in banking,” in IEEE Hutchins, W. J. (2004). “The Georgetown-IBM experiment demonstrated in
Symposium Series on Computational Intelligence (SSCI) (Honolulu, HI). January 1954, in Machine Translation: From Real Users to Research. AMTA.
doi: 10.1109/SSCI.2017.8280945 Lecture Notes in Computer Science, eds R. E. Frederking and K. B. Taylor (Berlin,
Chen, Y., Zhang, H., Liu, R., Ye, Z., and Lin, J. (2019). Experimental explorations Heidelberg: Springer), 3265. doi: 10.1007/978-3-540-30194-3_12
on short text topic mining between LDA and NMF based Schemes. Knowledge- Jaffali, S., Jamoussi, S., Khelifi, N., and Hamadou, A. B. (2020). “Survey
Based Systems 163, 1–13. doi: 10.1016/j.knosys.2018.08.011 on social networks data analysis,” in Innovations for Community Services.
Cheng, X., Yan, X., Lan, Y., and Guo, J. (2014). Btm: topic modeling over short I4CS. Communications in Computer and Information Science, Vol. 1139,
texts. IEEE Trans. Knowl. Data Eng. 26, 1–1. doi: 10.1109/TKDE.2014.2313872 eds S. Rautaray, G. Eichler, C. Erfurth and G. Fahrnberger (Cham:
Daniel, R., Rosen, E., Chuang, J., Christopher Manning, D., and Daniel McFarland, Springer), 100–119. doi: 10.1007/978-3-030-37484-6_6
A. (2009). “Topic modeling for the social sciences,” in Stanford University Jelisavčić, V., Furlan, B., Protić, J., and Milutinović, C. (2012). “Topic models and
Stanford, Ca 94305, Nips 2009 Workshop on Applications for Topic Models: Text advanced algorithms for profiling of knowledge in scientific papers,” in MIPRO,
and Beyond. Proceedings of the 35th International Convention, 1030–1035.
Dasgupta, S. (2000). “Experiments with Random Projection,” in Proceedings of the Johnson, W. B., and Lindenstrauss, J. (1984). Extensions of Lipschitz
Sixteenth Conference on Uncertainty in Artificial Intelligence (San Francisco, mappings into a Hilbert space. Contemp Math. 26, 189–206.
CA: Morgan Kaufmann Publishers Inc.), 143–151 doi: 10.1090/conm/026/737400
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T., and Jolliffe, I. T. (1986). Principal Component Analysis. New York, NY: Springer.
Harshman, R. (1990). Indexing by latent semantic analysis. J. doi: 10.1007/978-1-4757-1904-8

Frontiers in Artificial Intelligence | www.frontiersin.org 13 July 2020 | Volume 3 | Article 42


Albalawi et al. Topic Modeling Methods for Short-Text Data

Kamalrudin, M., Grundy, J., and Hosking, J. (2010). “Tool support for essential use Shi, L. L., Liu, L., Wu, Y., Jiang, L., and Hardy, J. (2017). Event detection and user
cases to better capture software requirements,” in Proceedings of the IEEE/ACM interest discovering in social media data streams. IEEE Access 5, 20953–20964.
International Conference on Automated Software Engineering (Antwerp:ACM), doi: 10.1109/ACCESS.2017.2675839
255–264. doi: 10.1145/1858996.1859047 Sisodia, D. S., Bhandari, S., Reddy, N. K., and Pujahari, A. (2020). “A comparative
Kaur, A., and Singh, R. (2019). Implementing sentiment analysis in relevance with performance study of machine learning algorithms for sentiment analysis
Indian Elections. J. Emerg. Technol. Innov. Res. 6, 454–460. Available online of movie viewers using open reviews,” in Performance Management of
at: http://www.jetir.org/papers/JETIR1905073.pdf Integrated Systems and its Applications in Software Engineering, Asset
Keerthana, S. (2017). Recommended search of documents from conversation with Analytics (Singapore: Springer Nature Singapore Pte Ltd.), 107–117.
relevant keywords using text similarity. J. Netw. Commun. Emerg. Technol. doi: 10.1007/978-981-13-8253-6_10
7, 1–6. Slonim, N., and Tishby, N. (2000). “Document clustering using word
Kherwa, P., and Bansal, P. (2018). Topic Modeling: A Comprehensive Review. EAI clusters via the information bottleneck method,” in The 23rd Annual
Endorsed Transactions on Scalable Information Systems Online First. Available International ACM SIGIR Conference on Research and Development in
online at: https://www.researchgate.net/publication/334667298 Information Retrieval (New York, NY: ACM), 208–215. doi: 10.1145/345508.
Kim, J., He, Y., and Park, H. (2014). algorithms for non-negative matrix and tensor 345578
factorizations: a unified view based on block coordinate descent framework. J. Song, G., Ye, Y., Du, X., Huang, X., and Bie, S. (2014). Short text classification: a
Glob. Optim. 58, 285–319. doi: 10.1007/s10898-013-0035-4 survey. J. Multimedia 9, 635–643. doi: 10.4304/jmm.9.5.635-643
Kolini, F., and Janczewski, L. (2017). “Clustering and topic modeling: a new Thomas Jones, W. (2019). TextmineR: Function for Text Mining & Topic
approach for analysis of national cyber security strategies,” in Twenty First Modeling. R package. Available online at: https://cran.rproject.org/web/
Pacific Asia Conference on Information Systems (Langkawi). packages/textmineR/inde~x.html
Li, C., Diao, Y., Ma, H., and Li, Y. (2008). “A statistical PCA method for face Turney, P. D. (2002). “Thumbs up or thumbs down? semantic orientation
recognition,” in Intelligent Information Technology Application, 2008, 376–380. applied to unsupervised classification of reviews,” in The 40th Annual
doi: 10.1109/IITA.2008.71 Meeting of the Association for Computational Linguistics (Stroudsburg,
Likhitha, S., Harish, B. S., and Keerthi Kumar, H. M. (2019). A detailed survey on PA; Philadelphia, PA: Association for Computational Linguistics), 417–424.
topic modeling for document and short text data. Int. J. Comp. Appl. 178:39. doi: 10.3115/1073083.1073153
doi: 10.5120/ijca2019919265 Vulić, I., De, S., Tang, W. J., and Moens, M. F. (2015). Probabilistic topic
Liu, L., Tang, L., Dong, W., S., Yao, and Zhou, W. (2016). An Overview of modeling in multilingual settings: an overview of its methodology and
Topic Modeling and Its Current Applications in Bioinformatics. SpringerPlus. applications. Inform. Proc. Manag. 51, 111–147. doi: 10.1016/j.ipm.2014.
doi: 10.1186/s40064-016-3252-8 08.003
Mazarura, J., and de Waal, A. (2016). “A comparison of the performance of Wang, X., and McCallum, A. (2006). Topics Over Time: A Non-Markov
latent. Dirichlet allocation and the Dirichlet multinomial mixture model on Continuous-Time Model of Topical Trends. Philadelphia, PA: ACM SIGKDD.
short text,” in Pattern Recognition Association of South Africa and Robotics and doi: 10.1145/1150402.1150450
Mechatronics International Conference (PRASA-RobMech) (Stellenbosch), 1–6. Weng, J., Lim, E.-P., Jiang, J., and He, Q. (2010). “Twitterrank: finding
doi: 10.1109/RoboMech.2016.7813155 topic-sensitive influential twitterers,” The Third ACM International
Mccallum, A. K. (2002). MALLET: A Machine Learning for Language Toolkit. Conference on Web Search and Data Mining (New York, NY), 261–270.
Available online at: http://mallet.cs.umass.edu/ doi: 10.1145/1718487.1718520
Miriam, P. (2012). “Very basic strategies for interpreting results from the topic Wild, F. (2015). Latent Semantic Analysis (LSA): The R Project for Statistical
modeling tool,” in Miriam Posner’s Blog. Package. Available online at: https://cran.r-project.org/web/packages/lsa/index.
Neogi, P. P. G., Das, A. K., Goswami, S., and Mustafi, J. (2020). “Topic modeling html
for text classification,” in Emerging Technology in Modelling and Graphics. Wójcik, P. I., and Kurdziel, M. (2019). Training neural networks on high-
Advances in Intelligent Systems and Computing, Vol. 937, eds J. Mandal and dimensional data using random projection. Pattern Anal. Applic. 22,
D. Bhattacharya (Singapore: Springer), 395–407. 1221–1231. doi: 10.1007/s10044-018-0697-0
Nugroho, R., Paris, C., Nepal, S., Yang, J., and Zhao, W. (2020). A survey of recent Xie, P., and Xing, E. P. (2013). “Integrating document clustering and topic
methods on deriving topics from twitter: algorithm to evaluation. Knowl. Inf. modeling,” in Proceedings of the Twenty-Ninth Conference on Uncertainty in
Syst. 62, 2485–2519. doi: 10.1007/s10115-019-01429-z Artificial Intelligence (Bellevue, WA), 694–703.
Panichella, A., Dit, B., Oliveto, R., Di Penta, M., Poshynanyk, D., and De Lucia, A. Xu, A., Qi, T., and Dong, X. (2019). “Analysis of the douban online review of the
(2013). “How to effectively use topic models for software engineering tasks? an mcu: based on LDA topic model,” in 2nd International Symposium on Big Data
approach based on genetic algorithms,” in Proceedings of Software Engineering and Applied Statistics. Journal of Physics: Conference Series, Vol. 1437 (Dalian).
(ICSE-2013) 35th International Conference on Software Engineering, 522–531. doi: 10.1088/1742-6596/1437/1/012102
doi: 10.1109/ICSE.2013.6606598 Yan, X., Guo, J., Lan, Y., and Cheng, X. (2013). “A Biterm topic model for short
Phan, X. H., Nguyen, C. T., Le, D. T., Nguyen, L. M., Horiguchi, S., and Ha, texts,” in International World Wide Web Conference Committee (IW3C2) (Rio
Q. T. (2011). A hidden topic-based framework toward building applications de Janeiro: ACM). doi: 10.1145/2488388.2488514
with short web documents. IEEE Trans. Knowl. Data Eng. 23, 961–976. Yang, Y., Yao, Q., and Qu, H. (2017). VISTopic: a visual analytics system for making
doi: 10.1109/TKDE.2010.27 sense of large document collections using hierarchical topic modeling. Visual
Phand, S. A., and Chakkarwar, V. A. (2018). “Enhanced sentiment classification Inform. 1, 40–47. doi: 10.1016/j.visinf.2017.01.005
using geo location tweets,” in ICICCT 2018, IISC Banglore, India, 881–886. Zihuan, W., Hahn, K., Kim, Y. et al. (2018). A news-topic recommendation
Phillips, P. J., Flynn, P. J., Scruggs, T., Bowyer, K. W., Chang, J., and Hoffman, system based on keywords extraction. Multimedia Tools Appl. 77, 4339–4353.
K. (2005). Overview of the face recognition grand challenge,” in Computer doi: 10.1007/s11042-017-5513-0
Vision and Pattern Recognition. CVPR. IEEE Computer Society Conference on
n Computer Vision and Pattern Recognition (CVPR’05), 947–954. Conflict of Interest: The authors declare that the research was conducted in the
Ramage, D., Christopher Manning, D., and Dumais, S. (2011). “Partially labeled absence of any commercial or financial relationships that could be construed as a
topic models for interpretable text mining,” in Proceedings of the 17th ACM potential conflict of interest.
SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD’11 (New York, NY: ACM), 457–465. doi: 10.1145/2020408.2020481 Copyright © 2020 Albalawi, Yeap and Benyoucef. This is an open-access article
Ray, S. K., Ahmad, A., and Aswani, K. C. (2019). Review and implementation distributed under the terms of the Creative Commons Attribution License (CC BY).
of topic modeling in Hindi. Appl. Artif. Intelligence 33, 979–1007. The use, distribution or reproduction in other forums is permitted, provided the
doi: 10.1080/08839514.2019.1661576 original author(s) and the copyright owner(s) are credited and that the original
Rehurek, R. (2010). “Software framework for topic modelling with large corpora,” publication in this journal is cited, in accordance with accepted academic practice.
in Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks No use, distribution or reproduction is permitted which does not comply with these
(Valletta), 46–50. terms.

Frontiers in Artificial Intelligence | www.frontiersin.org 14 July 2020 | Volume 3 | Article 42

You might also like