Information Processing and Management xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Information Processing and Management
journal homepage: www.elsevier.com/locate/infoproman
An evaluation of document clustering and topic modelling in two
online social networks: Twitter and Reddit
Stephan A. Curiskis , Barry Drake, Thomas R. Osborn, Paul J. Kennedy
⁎
Centre for Artificial Intelligence Faculty of Engineering and Information Technology University of Technology Sydney, 15 Broadway, Ultimo, NSW
2007 Australia
A R TICL E INFO
A BSTR A CT
Keywords:
Document clustering
Topic modelling
Topic discovery
Embedding models
Online social networks
Methods for document clustering and topic modelling in online social networks (OSNs) offer a
means of categorising, annotating and making sense of large volumes of user generated content.
Many techniques have been developed over the years, ranging from text mining and clustering
methods to latent topic models and neural embedding approaches. However, many of these
methods deliver poor results when applied to OSN data as such text is notoriously short and
noisy, and often results are not comparable across studies. In this study we evaluate several
techniques for document clustering and topic modelling on three datasets from Twitter and
Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four
clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison.
Several different evaluation measures are used in the literature, so we provide a discussion and
recommendation for the most appropriate extrinsic measures for this task. We also demonstrate
the performance of the methods over data sets with different document lengths. Our results show
that clustering techniques applied to neural embedding feature representations delivered the best
performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf
weights combined with embedding distance measures.
1. Introduction
In January 2018 there were estimated to be around 4.021 billion people around the world who use the internet. Of these, 3.196
billion people use social media in some form, generating a staggering amount of content.1 Online platforms and social networks have
become a key source of information for nearly half of the world’s population. These platforms are increasingly being used to disseminate information regarding news, brands, political discussion, global events and more (Bakshy, Rosenn, Marlow, & Adamic,
2012). However, much of the data generated is unstructured and not annotated. This means that it is difficult to understand how
topics of information are diffused through online social networks (OSNs), and how users engage with different topics (Guille, Hacid,
Favre, & Zighed, 2013). Automatically annotating topics within OSNs may facilitate analysis of information diffusion and user
preferences by enriching the data available from these platforms, in a way that is readily analysed. With the rise of phenomena like
echo chambers and filter bubbles, which lead to individuals receiving biased and narrowly focused content, the challenge of
Corresponding author.
E-mail address: stephan.a.curiskis@student.uts.edu.au (S.A. Curiskis).
1
https://wearesocial.com/uk/blog/2018/01/global-digital-report-2018, accessed Sep. 2018.
⁎
https://doi.org/10.1016/j.ipm.2019.04.002
Received 22 September 2018; Received in revised form 11 April 2019; Accepted 11 April 2019
0306-4573/ © 2019 Elsevier Ltd. All rights reserved.
Please cite this article as: Stephan A. Curiskis, et al., Information Processing and Management,
https://doi.org/10.1016/j.ipm.2019.04.002
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
automatically annotating OSN data has become important.
Document clustering is a set of machine learning techniques that aim to automatically organise documents into clusters such that
documents within clusters are similar when compared to documents in other clusters. Many methods for clustering documents have
been proposed (Bisht & Paul, 2013; Naik, Prajapati, & Dabhi, 2015). These techniques typically involve the use of a feature matrix,
such as a term-frequency inverse-document-frequency matrix (tf-idf matrix) to represent a corpus, with a clustering method applied to
this matrix. More recently, representations derived from neural word embeddings have seen applications on social media data as they
can produce dense representations with semantic properties and require less manual preprocessing than traditional methods
(Li, Shah, Liu, & Nourbakhsh, 2017). Common clustering methods applied in this context build hierarchies or partitions (Irfan et al.,
2015). Example hierarchical methods are agglomerative clustering and divisive clustering. Example partitioning methods are kmeans and k-medoids clustering.
Topic modelling involves methods to discover patterns of word use within documents, and is an active research area with several
techniques recently applied to OSN data (Chinnov, Kerschke, Meske, Stieglitz, & Trautmann, 2015). Topics are typically defined as a
distribution of words, with documents modelled as mixtures of topics. Like document clustering, topic modelling can be used to
cluster documents by giving a probability distribution over a range of topics for each document. This can be viewed as a form of soft
partition clustering, where the data points have a probabilistic degree of ownership to each cluster. The topic representation also
provides the word distribution for each topic which aids in interpretation. Commonly used topic models with applications on OSN
text data include Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003), the Author-Topic model (Hong & Davison, 2010), and more
recently Dynamic Topic Models which discover topics over time (Alghamdi & Alfalqi, 2015).
Document clustering and topic modelling are increasingly important research areas as these methods can be applied to large
amounts of readily available OSN text data, yielding homogeneous groups of documents. These document groups may then align to
relevant topics and trends. Clustering is particularly suited to OSN data as platforms like Twitter and Facebook use hashtags as a form
of topic annotation (Steinskog, Therkelsen, & Gambäck, 2017), which may be used for evaluation of document clustering and topic
modelling methods. Large scale clustering can help make sense of the huge amount of content being created online every day, and can
subsequently be used in further machine learning tasks. Additional features derived from OSN data (such as user demographic,
geographic and network data) have also been clustered to find groups of online posts or comments that are semantically similar
(Alnajran, Crockett, McLean, & Latham, 2017). However, OSN data presents many challenges when applying topic modelling and
document clustering methods. For example, such text is typically short and contains noise such as misspellings and grammatical
errors (Chinnov et al., 2015).
There are two key challenges with topic modelling and document clustering research on OSN data sets. Firstly, results are often
not reproducible since the data used in the studies frequently cannot be published. For instance, Twitter’s terms of service do not
allow for tweets to be published. Instead, researchers can publish a list of tweet identifiers that were used and retrieved via the API.
Unfortunately, over time the associated tweets are removed from the platform, which degrades the underlying data. The data sets
used are also often small or biased towards particular contexts. These issues result from the complex data collection and preparation
that is often required to extract large data sets from an OSN platform, as well as restrictions on the platforms themselves
(Stieglitz, Mirbabaie, Ross, & Neuberger, 2018).
Secondly, different studies often use different methods for evaluating the performance of clustered documents. Evaluation
methods on Twitter data vary from extrinsic measures which compare clusters against labelled data, to manual assessments of cluster
performance and interpretability (Alnajran et al., 2017). It is therefore difficult to compare empirical results. With the fast pace of
research in this area, there is little guidance on what method or family of methods will perform best in specific circumstances, such as
on short Twitter data or relatively longer Reddit comments.
In this paper we provide an analysis of the performance of several methods for document clustering and topic modelling of OSN
content on three data sets: two Twitter data sets and a publicly available Reddit data set. We evaluate four feature representation
methods derived from tf-idf and embedding matrices combined with four clustering techniques, and include a Latent Dirichlet
Allocation (LDA) topic model for comparison. We also provide a discussion of the properties and appropriateness of document
clustering evaluation measures commonly used in the literature. We evaluate performance with three such measures, namely the
Normalised Mutual Information (NMI), the Adjusted Mutual Information (AMI), and the Adjusted Rand Index (ARI). Furthermore, we
have made our data sets available so that our results can be reproduced. To comply with Twitter’s terms of use, we have made
available the tweet identifiers used along with the topic label. We have also made available the full Reddit data set used
(Curiskis, Drake, Osborn, & Kennedy, submitted).
Further to this, by tuning key hyper-parameters we demonstrate how embedding models can be used to generate feature sets for
document clustering that delivered good performance and captured latent structure in the data. We also show how word embedding
distances can aid in the interpretation of the clusters by ranking the top words, forming a topic vector of words. This contribution is
significant since data sets from OSNs are often short and contain noise such as misspellings, abbreviations, acronyms, special
characters, emojis, URLs and hashtags. These issues can result in poor performance for many commonly used techniques.
Furthermore, a clear consensus is lacking in the literature regarding methods that work effectively on OSN data. The results of this
paper provide guidance on methods giving good performance over different types of OSN data. These results show that traditional
topic modelling and document clustering approaches do not work well on short and noisy social media posts. Instead, clustering
approaches applied to more recent neural network embedding representations can deliver improved performance.
The structure of this paper is as follows. In Section 2 we review the current literature in this research area. In Section 3 we present
the detail of our methods, including a description of the data extraction, the preparation process, the feature representations, the
clustering methods, and the evaluation measures. In Section 4 we present our results with a discussion. In Section 5 we provide a
2
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
discussion followed by our conclusion in Section 6.
2. Literature review
We organize the literature on document clustering and topic modelling of OSNs into three areas. Firstly, many studies have
centred on identifying and interpreting memes in this domain, incorporating textual, network and user data. Secondly, identifying
topics through topic models and clustering approaches has received much attention as a means of understanding and categorising
online content. Thirdly, recent advances in neural word embedding models have been used to provide dense feature representations
of documents from OSNs.
2.1. Meme identification
The term “meme” is commonly used to represent an element of culture or system of behaviour that spreads from one individual to
another by imitation. In the context of OSNs, for this paper we define a “meme” as a semantic unit expressed as electronic text where
the semantics are transferred across multiple individuals even though the text may be different. This specific definition of “meme” is
sometimes called “ememe” (Shabunina & Pasi, 2018). A topic in OSN applications can be defined as a coherent set of semantically
related terms which express a single argument (Guille et al., 2013). In comparison to this definition of a topic, a meme does not
necessarily need to be derived from a set or distribution of words, but instead aims to detect significant semantic content. Often in
practice, however, there is an overlap between the two concepts. The concept of a meme is useful for OSN applications as it can be
thought of as a latent representation of textual content, but can also be discovered through analysis of OSN user and network data.
A study by Ferrara et al. (2013) aimed to identify memes within large social media data. In that study, several similarity measures
were defined for Twitter data which leverage content, metadata and network features. The authors defined the concept of a ‘protomeme’ which was used to refer to hashtags, user mentions, URLs and phrases. Data was aggregated by creating protomeme projections onto spaces based on tweet, user and content features. For each protomeme pair, common user, tweet, content and diffusion
similarity measures were calculated. These similarity matrices were then aggregated in several different ways, such as the elementwise mean and maximum. Finally, the aggregated similarity matrix was clustered with hierarchical clustering. The resulting clusters
were taken to represent memes within the data. The data set used was a collection of 5523 tweets related to the US presidential
primaries in April 2012. Twenty-six topics were manually identified and assigned as labels to each tweet. Since the memes and topics
can overlap per tweet, performance was evaluated using a variation of Normalised Mutual Information designated as LFK-NMI. Given
the optimal parameters for this approach, the protomeme clustering method delivered average 5-fold cross-validation LFK-NMI scores
of around 0.13. JafariAsbagh, Ferrara, Varol, Menczer, and Flammini (2014) later extended the algorithm to work on streaming data.
More recently, Shabunina and Pasi (2018) developed a method to identify and characterise memes, considered as a set of frequently occurring related words propagating through a network over time. The relationships between terms in a social media stream
were modelled using a graph of words. To identify memes, a k-core degeneracy process was applied to the graph to generate
subgraphs, which constituted meme bases. A meme was defined as the fuzzy subset of terms in a meme basis. The method was applied
to over 800,000 tweets from the search queries #economy, #politics and #finance. Although useful to characterise and interpret
topics in social media streams, memes were not attributed to individual social media documents or users. Evaluation of the method
was limited to subjective interpretation and intrinsic measures.
2.2. Document clustering and topic modelling
In contrast to methods for meme identification, many studies have focused on detecting topics in OSNs. Topic models typically
refer to methods that group both documents, which share similar words, as well as words that occur in a similar set of documents.
Document clustering refers to methods that group documents according to some feature matrix, such that documents within a cluster
are more similar to documents in other clusters. Due to the short document size and high degree of noise inherent OSN data, such as
Twitter data, clustering based methods are often applied in favour of more traditional topic models (Chinnov et al., 2015). Nevertheless, topic models applied to OSN data are still an active area of research (Alghamdi & Alfalqi, 2015). Indeed, the term ‘topic
discovery’ may refer to either topic modelling or document clustering.
Document clustering methods have typically used vector space representations of word occurrence by document. Commonly, bagof-words methods model each document as a point in the space of words. Each word is a feature or dimension of this space, with
element values assigned in one of several ways. These can be one-hot-encodings, where the value is set to 1 if the word exists in the
document and 0 otherwise, term frequency, or term-frequency inverse-document-frequency calculations. Given that the total dimension size is the number of unique words, often there is a threshold cut-off to use only those words with high values (Patki &
Khot, 2017). A range of clustering algorithms may then be applied to the feature matrix, such as k-means, hierarchical clustering, selforganising maps, and so on (Naik et al., 2015).
For instance, Godfrey, Johns, Meyer, Race, and Sadek (2014) developed an algorithm to identify topics within a specific Twitter
data set, a collection of about 30,000 tweets extracted using the query term ‘world cup’. Non-negative Matrix Factorisation (NMF)
and k-means clustering were applied to the tf-idf representation of tweets to create topic clusters. Due to the noisiness of Twitter data,
Godfrey et al. (2014) developed a preliminary filtering step using multiple runs of the DBSCAN clustering algorithm combined with
consensus clustering. The rationale was that tweets which are not close to any particular cluster may be treated as noise and removed
from an analysis. The results when using this approach showed that both k-means clustering and NMF produced similar results.
3
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
However, when analysing the clusters using a subjective evaluation of tweet network diagrams and word clouds, NMF seemed to
produce more interpretable clusters.
Fang, Zhang, Ye, and Li (2014) approached detecting topics in Twitter using additional information about the tweet. Recognising
that the textual content of tweets can be quite limited, a ‘multi-view’ topic detection framework was developed based on more
granular ‘multi-relations’. These multi-relations were defined as useful relations from the Twitter social network and included
hashtags, user mentions, retweets, meaningful words and similar posting times. To measure these multi-relations, a document similarity measure was developed. Multi-relation similarity scores were then combined into a multi-view and clustered using three
different methods. These clusters were taken to represent topics and a keyword extraction method, based on suffix trees and tf-idf
weights, was applied to derive representative keywords for each cluster. This method was evaluated using a dataset of 12,000 tweets
with 60 ‘hot’ topics extracted from the Twitter API. Three evaluation measures were used, namely the F-measure, NMI, and entropy.
The results showed that including more multi-views improved performance, with results above 0.928 on the F-measure and 0.935
NMI. However, the authors did not remove any of the hot topic key words from the text. These key words are generally short phrases
or hashtags, and can be discovered easily by tf-idf approaches.
Another study compared the efficacy of different clustering methods to detect topics in Twitter data centered around recent
earthquakes in Nepal (Klinczak & Kaestner, 2016). In this study, tweets were represented by their tf-idf vectors. Four clustering
methods applied to this representation were compared, namely k-means, k-medoids, DBSCAN and NMF. By evaluating each clustering method with measures for cohesion and separation of clusters (i.e. intrinsic evaluation measures), it was clear that NMF
produced superior clusters which were simpler and easier to interpret. More recently, Suri and Roy (2017) applied LDA and NMF to
detect topics on a Twitter data set, as well as a RSS news feed. Both methods were found to have similar performance. LDA was
deemed to be more interpretable, but NMF was faster to calculate. However, performance was evaluated by manual inspection of the
key terms for topics.
Many studies have applied topic modelling techniques to OSN data. For instance, Paul and Dredze (2014) developed a topic
modelling framework for discovering self-reported health topics using Twitter data. 5128 tweets were annotated with a positive
status if they related to the user’s health, and negative if not. A logistic regression model was trained to predict the positive labels in
the annotated data, and applied to a Twitter stream filtered with a large number of health related keywords. This provided a set of
144 million health tweets which was used to run the Ailment Topic Aspect Model. While this study is useful in filtering and interpreting large amounts of relevant tweets, validation of the discovered topics focused on correlation measures against external health
trend data.
Further to topic models applied to a static data set, dynamic topic models, which incorporate the temporal nature of OSN data, are
gaining attention (Alghamdi & Alfalqi, 2015). Ha, Beijnon, Kim, Lee, and Kim (2017) applied dynamic topic models to Reddit data to
understand user perceptions of smart watches. While these results are interesting to gauge public opinion in this area, no ground truth
label was used and likewise no extrinsic evaluation measures were applied. Recently, Klein, Clutton, and Polito (2018) applied topic
modelling to reveal distinct interests in the Reddit conspiracy page (a subreddit page). NMF was used to create topic loadings for each
user contributing to the page. These topic loadings were then clustered using k-means to reveal user subgroups. Again, this study is
useful in understanding the user population within OSN discussion threads, but no extrinsic evaluation was made to validate the
quality of the topic modelling or the clustering.
2.3. Neural network embedding models
Much of the literature on clustering OSN text data used tf-idf matrix representations of tweets at some level. These matrices treat
terms as one-hot encoded vectors, where each term is represented by a binary vector with exactly one non-zero element. This means
that relationships between words, such as synonyms, are not incorporated and the resulting document matrix representation is sparse
and high dimensional. The concept of dense, distributional representations of words, or word embeddings, provide an alternative
approach (Bengio, Ducharme, Vincent, & Janvin, 2003). In these methods, each word is represented by a real valued vector of fixed
dimension. Word embeddings are commonly trained using neural network language models, such as word2vec (Mikolov, Chen,
Corrado, & Dean, 2013). However, when using word embedding models to create document level representations, the word vectors
need to be aggregated in some way. Common approaches in the literature are to simply take the mean of the word vectors for all
terms in the document, or to concatenate the vectors to a document vector of fixed size (Yang, Macdonald, & Ounis, 2017). Document
representations derived from tf-idf weighted word vector averages have also been proposed (Corrêa Júnior, Marinho, & dos Santos,
2017; Zhao, Lan, & Tian, 2015). Another method trains document level dense vector representations at the same time as the word
vectors (Le & Mikolov, 2014). We refer to this latter method as doc2vec.
Much research has applied neural word embeddings to classification and semantic evaluation tasks. For instance,
Billah Nagoudi, Ferrero, and Schwab (2017) applied word embeddings to model semantic similarity between Arabic sentences. Three
different sentence level aggregations were proposed, namely the sum of the word vectors for all words in a sentence, an inversedocument-frequency weighted sum of the word vectors, and a part-of-speech weighted sum. The authors found that the weighted sum
representations delivered more accurate sentence similarities. In another study, Corrêa Júnior et al. (2017) developed a classification
method for sentiment analysis using an ensemble of classifiers with different feature representations, namely a tf-idf matrix, a mean
word vector representation, and a tf-idf weighted mean of the word vectors. Recently, Li et al. (2017) published a number of pretrained word2vec models on a Twitter data set of 390 million English tweets with a range of pre-processing steps. Embedding
representations are becoming more widely used in NLP tasks involving OSN data.
Further to word and document embeddings, character level embedding models have been proposed and applied to Twitter data,
4
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
creating tweet2vec (Dhingra, Zhou, Fitzpatrick, Muehl, & Cohen, 2016). The motivation for tweet2vec is that social media data are
noisy, suffering from spelling errors, abbreviations, acronyms and special characters, which can lead to prohibitively large vocabulary
sizes. Tweet2vec takes as input sequences of characters for each tweet and passes them through a bidirectional GRU neural network
encoder to create a fixed dimensional tweet embedding vector. This tweet embedding is then passed through a linear softmax layer to
predict the hashtags of a tweet. The algorithm was evaluated on hashtag classification performance. While this method may promise
to create useful tweet embeddings, it assumes that hashtags are valid labels for tweets. This assumption may not hold as other text,
user mentions and URLs can also be important in defining the topic of the tweet, and tweets can have multiple hashtags.
Recently, contextualised extensions to word embeddings have been proposed. One challenge for traditional word embeddings is
polysemy, where a word has multiple meanings dependent on the context. Peters et al. (2018) introduced a deep contextualised word
embedding model, which models both the syntactic and semantic characteristics of word use, and how these uses vary across
linguistic contexts. This method involves coupling embedding vectors trained from a bidirectional LSTM with a language model
objective. Named ELMo (Embeddings from Language Models), the method assigns an embedding vector to each token that is a
function of the entire input sentence. This technique may be useful for clustering social media documents.
In addition to the document clustering and topic modelling approaches discussed so far, a new series of deep learning based
clustering methods have been developed (Min et al., 2018). Many of these techniques use deep neural networks to learn feature
representations trained at the same time as clustering. Examples include several deep autoencoder networks with a clustering layer,
where the loss function is a combination of reconstruction loss and clustering loss. Clustering methods based on generative models
such as Variational Autoencoders and Generative Adversarial Networks look promising from a document clustering perspective since
they can also generate representative samples from the clusters. However, the focus for these techniques to date has been on image
data sets.
Many approaches to document clustering and topic modelling are proposed for OSN text data. These methods typically involve
creating document level feature representations with tf-idf matrices or other techniques, followed by clustering methods to group
documents into semantically related clusters. However, there are many variations on these methods and word embedding representations have not yet been effectively applied and benchmarked on document clustering tasks in OSN data, to the best of our
knowledge.
3. Methods
In this section we describe the three data sets used and the processing steps, the feature representations and clustering algorithms,
and the evaluation measures used with a discussion of their properties.
Document clustering and topic modelling methods applied to OSN data typically involve several processing steps as outlined in
Fig. 1. Data is first extracted from a source. From the raw data set or OSN platform API, documents are extracted which consist of text
data from an individual user. A tweet and a Reddit parent comment are examples of a document. The textual elements are then
processed to remove common punctuation and stop words, and tokenised. Feature representations of each document are created,
Fig. 1. Process pipeline for document clustering. The contribution of this paper is an evaluation of four methods for feature representation and four
clustering methods using three evaluation measures over three data sets.
5
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Table 1
Outline of the data sets, methods for feature representations and clustering, and extrinsic evaluation measures used in this study. For the three data
sets, we evaluate the feature representation and clustering method combinations and the LDA topic model (17 combinations) with the three
evaluation measures.
Data sets
Twitter stream filtered by #Auspol, 29,283 tweets
RepLab 2013 competition Twitter data, 2657 tweets
Reddit data from May 2015, 40,000 parent comments
Methods
Feature representations:
FR1 tf-idf matrix with the top 1000 terms per document
FR2 Mean word2vec matrix
FR3 Mean word2vec matrix weighted by the top 1000 tf-idf scores
FR4 doc2vec matrix for each document
Clustering methods:
CM1 k-means clustering
CM2 k-medoids clustering
CM3 Hierarchical agglomerative clustering
CM4 Non-negative matrix factorisation (NMF)
Topic model:
LDA Latent Dirichlet Allocation topic model
Evaluation Measures
NMI Normalised Mutual Information
AMI Adjusted Mutual Information
ARI Adjusted Rand Index
followed by a clustering method. Extrinsic clustering evaluation measures are then calculated using ground truth labels. The variations at each step of the process are outlined in Table 1. In the rest of this section we detail our approach to each step of Fig. 1.
3.1. Data extraction
We used three OSN data sets for evaluation; two Twitter data sets and a Reddit data set. We have used Twitter data since it has
been widely used in the literature regarding topic modelling and document clustering. While there appear to be fewer studies which
have used Reddit data, Reddit still represents a valuable source of OSN data to use for topic modelling and document clustering.
Reddit is also used more as a discussion forum, and comments have a wider range of document lengths than Twitter data. All three
data sets have been made available (Curiskis et al., submitted).
Twitter data provides a readily accessible data source for short and topical user driven content. It is also widely used for research
purposes, but has many challenges due to the short tweet length and use of hashtags, acronyms, user mentions and URLs
(Stieglitz et al., 2018). The first Twitter data set was collected through Twitter’s public API. It was constructed by filtering the Twitter
stream for the hashtag #Auspol, which is frequently used in Australia for political discussion. A common application for document
clustering on OSN data is to take a set of documents related to a particular theme and discover topics, such as the study of health
topics in Twitter data (Paul & Dredze, 2014). The #Auspol Twitter data set is suitable for comparing document clustering methods
since the hashtag is widely used to link a large number of disparate discussions, often with additional hashtags, related to public
opinion in Australia. Data was collected between 13 June and 2 September 2017 and consisted of 1,364,326 tweets. We filtered this
data set by selecting English language tweets only and removed retweets based on the retweeted_status field and a text filter. This
resulted in 205,895 tweets.
No ground truth topic labels exist for this data set so we used a set of high count hashtags as ground truth labels. We further
removed the search hashtag (#Auspol) from the data set, since all tweets had this token. It is common for there to be multiple
hashtags on a tweet, so to avoid having overlapping topics we removed tweets which contained more than one of the top hashtags.
We also manually removed some related hashtags, such as #ssm (same sex marriage) which is closely related to #marriageequality;
we kept the latter as it was used in more tweets. Lastly, we filtered by hashtags with at least 1000 tweets to keep the topics relatively
balanced. This resulted in 29,283 tweets with 13 hashtags denoting topic labels, as given in Table 2.
The second Twitter data set was taken from the RepLab 2013 competition (Amigó et al., 2013). This competition focused on
monitoring the reputation of entities (companies and individuals), and involved tasks such as named entity recognition, polarity
classification and topic detection. The tweets used in this competition were annotated with topic labels by several trained annotators
supervised and monitored by reputation experts. For the purposes of this paper, the topics annotated in these tweets were taken as a
gold standard. We have used this data set because it has gold standard labels already annotated and has been used for topic detection
tasks.
We downloaded the list of Twitter identifiers from the training and testing data sets for the topic detection task made available
through the RepLab 2013 competition and retrieved the details through the Twitter API on 19 January, 2019. Out of 110,344
published tweet identifiers with labelled topics, we could only retrieve the tweet text and other information for 23,684 tweets. This is
6
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Table 2
Count of tweets per hashtag in the #Auspol Twitter data set.
Topic number
Hashtag
Tweets
1
2
3
4
5
6
7
8
9
10
11
12
13
#qldpol
#qanda
#insiders
#lnp
#politas
#marriageequality
#springst
#nbn
#trump
#uspoli
#stopadani
#climatechange
#turnbull
3845
3592
3495
3434
2618
2562
1708
1626
1547
1498
1186
1148
1024
likely due to tweets and users being deleted since the tweets were published. Furthermore, there is a long tail of topics labelled in this
data. In fact, for the 23,684 tweets there were a total of 3432 distinct topics, with 1263 topics containing a single tweet. To ensure
that there were sufficient data points for our methods to detect, we limited the frequency count per topic to be 100. We also removed
the label denoted ‘other topics’ as this does not represent an internally consistent topic. After this filtering we had a data set of 2657
tweets with 13 topic labels from the competition. The list of topic labels used is given in Table 3.
We originally included the RepLab 2013 data set primarily because comparative results for topic discovery are available from the
competition. However, due to the large volume of tweets which could not be retrieved from Twitter’s API, accurate comparisons are
no longer possible. Nevertheless, the ground truth topic labels still allow for the performance of the methods to be benchmarked.
The third data set was from the Reddit platform and consisted of parent comments and their related comments by Reddit subreddit page from May 2015. The Reddit platform is widely used for discussion related to specific topics or themes, grouped by
subreddit page, so is ideal for this study. Furthermore, Reddit comments can be longer than tweets. Reddit parent comments refer to
the top comment which may or may not have responses from other users. This data was made public on the Reddit website
(Reddit, 2015). The full data set contained around 54.5 million comments on 50,138 subreddit pages. We chose this data set since it is
freely available in full and contains discussion on multiple themes. It is therefore an ideal data set to use for benchmarking methods.
We chose five subreddit pages which represent disjoint themes for analysis. These five subreddit pages were also used in a previous
study benchmarking classification models (Gutman & Nam, 2015). Since parent comments and responses are inherently related, we
pooled all the user posts into documents grouped by the parent comment identifier. Table 4 shows the count of parent comments per
subreddit page. We randomly sampled 40,000 parent comment identifiers from across the five subreddit pages, then used these pages
to denote the ground truth labels.
Reddit data is especially useful in this study since it contains a wider range of character lengths per document than Twitter data,
since Twitter has a limit on the number of characters. An evaluation of the performance of the document clustering methods by
document length can provide guidance for future studies on the optimal method for a particular data set. To examine this performance, we partitioned the Reddit data into four distinct subsets based on the number of characters per document. Details for the four
data partitions are given in Table 5. For comparison with the Twitter data sets, a tweet has a maximum of 240 characters. For the
#Auspol Twitter data, the mean character length was 117 with 25th percentile of 103 and 75th percentile of 138. Most tweets
therefore fall into the 101 to 200 character length document group.
Table 3
Count of tweets per topic label in the RepLab 2013 Twitter data set.
Topic number
Topic
Tweets
1
2
3
4
5
6
7
8
9
10
11
12
13
For sale
Suzuki cup
User comments
Money laundering / terrorism finance
Record of views on YouTube
Fan Craze - Beliebers
Princeton offense
For Sale - Nissan Cars, Parts analysed Accessories
Jokes
Sports sponsors
Spam
Ironic criticism
MotoGP - User comments
329
296
262
199
195
154
131
127
127
127
114
111
103
7
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Table 4
Count of parent comments per subreddit page.
Topic number
Subreddit page
Parent comments
1
2
3
4
5
NFL
News
pcmasterrace
Movies
Relationships
10,563
9488
9186
6263
4500
Table 5
Reddit data was partitioned into four sets based on document character length. Documents are grouped by the parent comment. The mean character
length and mean number of tokens per document are given.
Character length range
Number of documents
Mean character length
Mean number of tokens
1–100
101–200
201–500
501 or greater
15,273
8360
9310
7057
46.1
144.9
317.4
1,584.5
4.5
13.3
28.6
141.1
3.2. Data preparation
Data preparation and analysis in this study was conducted using python 3.6.1. For text preprocessing, we removed the list of
stopwords from the nltk 3.2.4 package and punctuation from string. A customised tokeniser function was created for tweets which
retained hashtags and user mentions, and removed URLs. To tokenise the Reddit data, we simply removed punctuation and standard
stopwords. We did not apply any stemming or lemmatisation. We also used the TfidfVectorizer function from sklearn 0.19.1 for the tfidf method and the weighted word2vec method.
For the #Auspol Twitter data, we removed the list of 14 hashtags taken as ground truth labels from the text, in addition to the
#Auspol Twitter API search query. The RepLab 2013 Twitter data set had annotated topic labels that were not based directly on any
individual tokens, so no modification was required. For the Reddit data, as the subreddit page was used as the ground truth label we
did not need to modify the text.
3.3. Feature representations
In this study we evaluated the performance of four methods to construct feature representations for documents combined with
four commonly used clustering algorithms. We also included an LDA topic model in a separate topic models category since the
technique only takes as input a bag-of-words matrix. These methods are outlined in Table 1, where each method component is given a
code for ease of reference. The four feature representations are coded as FR1-FR4 and the four clustering methods are coded as CM1CM4 and the LDA topic model is coded simply as LDA. While many other techniques have been proposed in the literature, such as the
meme identification studies (JafariAsbagh et al., 2014; Shabunina & Pasi, 2018), we did not implement them for evaluation as they
are specific to data from Twitter. However, we provide comparison results in our discussion where they were available from other
studies.
For FR1, the tf-idf matrix was limited to the top 1000 terms per document by frequency since no performance improvement was
gained by including more terms. This is likely due to the short nature of social media text which produces sparse tf-idf feature vectors;
terms with lower frequency would not generally be useful in clustering.
A word2vec model is a neural network trained to create a dense vector with fixed dimension for each token in a corpus. While a
pre-trained word2vec model is available for Twitter data (Godin, Vandersmissen, De Neve, & Van de Walle, 2015), we found that it
did not perform well on the Twitter data sets used in this study. One issue was that many tokens in the data were out of the trained
model’s vocabulary, and also the semantic relationships between words may be very different on different data sets. Additionally, a
pre-trained model on a large amount of Reddit data was not available. Furthermore, there are many hyper-parameters in these
models so finding an ideal set of values for different data sets is a useful contribution. For these reasons, we trained our own word
embedding and document embedding models.
The word2vec models used in FR2 and FR3 were trained with the continuous bag of words (CBOW) method (Mikolov et al.,
2013), 100 dimensions, a context window of size 5 and minimum word count of 1. We tested variations of these hyper-parameters,
including context window sizes ranging from 3 to 15, higher dimensions and minimum word counts. We found that the variation in
performance using the three clustering evaluation measures was minimal and the chosen hyper-parameters were optimal. Some of
these results make sense given the short document length of social media text. We concluded that 100 dimensions for word2vec was
sufficient to represent words for short documents. The mean number of tokens per tweet was 9, and the 75th percentile was 11, so a
context window of size 5 captured all the tokens of most tweets. However, we did find significant variation in the number of training
epochs used for the three data sets. We report on this analysis in Section 4.1. For all other hyper-parameters, we have used default
8
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
values provided by the gensim 3.4.0 python package (Řehůřek & Sojka, 2010).
FR2 was constructed by taking the element-wise mean of the word vectors for each token in each document, returning a dense
feature vector of 100 dimensions. FR3 was constructed by taking the tf-idf weighted mean of the word vectors for each word of a
document. The tf-idf matrix used was the top 1000 term matrix by frequency constructed in FR1. This process excluded any word
vectors that were not in the top 1000 tf-idf terms, although again this was tried with larger numbers of top terms for which the
evaluation measures used were found to decrease. We discuss the evaluation measures used in Section 3.5.
A doc2vec model is a neural network trained to create a dense vector with fixed dimension for each document in a corpus. The
doc2vec models in FR4 were trained with 100 dimensions using the distributed bag of words method (dbow), a context window of
size 5 and a minimum word count of 1. The distributed bag of words method was used since it can train both word vectors and
document vectors in the same embedding space (Le & Mikolov, 2014), which was useful for interpreting the document embedding. As
with the word2vec model, we tested variations of the hyper-parameters and found that the evaluation measures varied significantly
for the number of training epochs, and different data sets had different optimal epochs. This is similar to the results of Lau and
Baldwin (2016) where a dbow doc2vec model trained on 4.3 million words had an optimal number of epochs of 20, while the optimal
number was 400 for a data set of size 0.5 million words. Lau and Baldwin (2016) also found that the optimal number of dimensions
was 300 and window size was 15. The lower optimal values for our method are likely due to the short document lengths of OSN data,
as well as the lower word count of our data sets, especially the Twitter data.
3.4. Clustering methods
For the clustering methods, we have selected four techniques commonly used in the literature (Klinczak & Kaestner, 2016; Naik
et al., 2015) which also gave comparable results on our data sets. Firstly, we applied a k-means clustering algorithm (CM1) using the
Euclidean metric and a maximum of 100 iterations. The algorithm was run multiple times over the data with varying random seeds.
CM2 refers to the k-medoids algorithm. For this we used the pyclustering 0.8.2 python package with starting centroids sampled
according to a uniform distribution. Both k-means and k-medoids clustering were used in Klinczak and Kaestner (2016). For CM3 we
applied an hierarchical agglomerative clustering algorithm with the Euclidean metric and Ward linkage. Hierarchical agglomerative
clustering was used in Ferrara et al. (2013) to cluster a similarity matrix. For CM4 we used a Non-negative Matrix Factorisation
(NMF) algorithm, for which we used the default parameters in the sklearn 0.19.1 package. NMF has seen multiple applications for
topic modelling in OSN data (Godfrey et al., 2014; Klein et al., 2018). For the clustering methods and the LDA model, we set the
number of clusters or components to be equal to the number of unique labels in the evaluation data. In line with Klinczak and
Kaestner (2016), we tested the DBSCAN clustering algorithm with a range of hyper-parameters but found that it delivered poor
performance for all feature representations. The documents would either be grouped into an outlier cluster, or a large number of very
small clusters. A possible reason for this is that the feature representations are high dimensional and sparse, so may not cluster well
using density based approaches.
The LDA topic model was trained with 10 passes, chunk size of 10,000 and updated every record. We again used the default
values for other hyper-parameters in the gensim 3.4.0 package. We included this method since it is commonly used in document
clustering and topic modelling. To assign a topic label to each document, we chose the topic with the highest probability.
3.5. Evaluation measures
Measures used for evaluating document clustering methods typically fall into two categories, intrinsic and extrinsic measures.
Intrinsic measures, such as measures of cluster separation and cohesion, do not require a ground truth label. Such measures describe
the variation within clusters and between clusters. However, they are dependent on the feature representations used, so do not give
comparable results for methods which use different feature sets. Extrinsic measures require a ground truth label, but can be compared
across methods. Common extrinsic measures include precision, recall and F1 (Naik et al., 2015), but these are dependent on the
ordering of cluster labels to ground truth labels which is a problem with a large number of labels. Measures such as the mutual
information and Rand index are more appropriate in this case as they are independent of the absolute values of the labels.
Mutual information is a measure of the mutual dependence between two discrete random variables. It quantifies the reduction in
uncertainty about one discrete random variable given knowledge of another. High mutual information indicates a large reduction in
uncertainty. For two discrete random variables X and Y with joint probability distribution p(x, y), the mutual information, MI(X, Y), is
given by
MI (X , Y ) =
log
y Y x X
p (x , y )
.
p ( x ) p (y )
A commonly used measure is the normalised mutual information (NMI), which normalises the MI to take values between 0 and 1
with 0 representing no mutual information and 1 being agreement. This is useful to compare results across methods and studies. NMI
is given as follows.
NMI (X , Y ) =
MI (X , Y )
H (X ) H (Y )
,
where H(X) and H(Y) denote the marginal entropies, given by
9
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
n
H (X ) =
p (x i )log(p (x i )).
i=1
The Rand index is a pair counting measure for similarity between the labels and clusters. It also takes values between 0 and 1,
with 0 representing a random labelling and 1 representing identical labels. Given a set of elements S = {o1, …, on} and two partitions
of S to compare, X = {X1 , …, Xr } and Y = {Y1, …, Ys}, the Rand index represents the frequency of times the partitions X and Y are in
agreement over the total number of observation pairs. Mathematically the Rand index, RI, is given by
RI (X , Y ) =
a+b
a+b
,
=
n
a+b+c+d
()
2
where a represents the number of pairs of elements in S that are in the same subset in X and the same subset in Y, and b represents the
number of pairs of elements in S that are in different subsets of X and different subsets of Y. Values a and b together give the number
of times the partitions are in agreement. The value c represents the number of pairs of elements in S that are in the same subset of X
and different subsets of Y, and d gives the number of pairs of elements in S that are in different subsets of X and the same subset of Y.
For extrinsic clustering evaluation measures to be useful for comparison across methods and studies, such measures need a fixed
bound and a constant baseline value. Both the NMI and the RI are scaled to have values between 0 and 1, so satisfy the first condition.
However, it has been shown that both measures increase monotonically with the number of labels, even with an arbitrary cluster
assignment (Vinh, Epps, & Bailey, 2010). This is because both the mutual information and Rand index do not have a constant
baseline, implying that these measures are not comparable across clustering methods with different numbers of clusters. To account
for this, adjusted versions of the MI and RI have been proposed. The adjusted rand index, ARI, adjusts the RI by its expected value:
ARI (X , Y ) =
RI (X , Y ) E {RI (X , Y )}
max{RI (X , Y )} E {RI (X , Y )}
where E{RI(X, Y)} denotes the expected value of RI(X, Y). The ARI takes values between 0 and 1, with 1 representing identical
partitions, and is adjusted for the number of partitions in X and Y. In a similar way, the adjusted mutual information, AMI, is given by
AMI (X , Y ) =
MI (X , Y ) E {MI (X , Y )}
,
max{H (X ), H (Y )} E {MI (X , Y )}
where E(MI(X, Y)) represents the expected value of the MI (Vinh et al., 2010). The AMI takes values between 0 and 1, with 1
representing identical partitions, and is adjusted for the number of partitions used. The best measures to ensure a comparable
evaluation are then the AMI and the ARI. The next question is around how these two measures compare to each other. By developing
theory regarding generalised information theoretic measures, Romano, Vinh, Bailey, and Verspoor (2016) concluded that the AMI is
the preferable measure when the labels are unbalanced and there are small clusters, while the ARI should be used when the labels
have large and similarly sized volumes.
In this paper, we report the AMI, ARI and the NMI measures. Many previous studies have reported the NMI measure, so for
comparison purposes we include it in our evaluation. Given the data and methods of this study, it is likely that the ARI is more
appropriate then the AMI as Tables 2 and 4 show that the distribution of documents across labels is relatively balanced. We still
include the AMI since it is interesting to see how much the results may differ from the NMI.
Due to the short and noisy nature of the data sets used in this study, we examined the effect of different random seeds on
performance. We ran each method 20 times with different random seeds, calculated the mean of the NMI, AMI and ARI, and plotted
the distributions of these measures.
4. Results
In this section we present the results of our analysis. We first describe the results on the optimal number of epochs for the
word2vec and doc2vec embedding representations, applied to all three data sets. We then evaluate the performance of all the
methods. Lastly, we discuss methods for the interpretation of the topics using the doc2vec feature representation.
4.1. Optimal training epochs for embedding models
A key hyper-parameter for training neural network models is the number of epochs. Too many epochs and the model may overfit
to the data, too few and performance may be poor. We first explored the performance change of the mean word2vec models (FR2 and
FR3) and the doc2vec model (FR4) with the number of epochs. These results provide guidance for studies where a ground truth topic
label is not present. We used k-means clustering (CM1) for the clustering method as it gave the best results for the embedding
representations. For each epoch value between 25 and 300, with increments of 25, we trained the models 20 times using different
random seeds and evaluated against the ground truth labels. This was done for all three data sets. Table 6 summarises the optimal
epoch results by method and data set. The plots for this analysis on the #Auspol Twitter data are shown in Fig. 2(a) and on the
RepLab 2013 data in Fig. 2(b). The results for the Reddit data are shown in Fig. 3. To save space we only evaluated the AMI and the
ARI measures on the Reddit data. This is because the AMI typically gives similar results as NMI, but is chance adjusted.
For the #Auspol data in Fig. 2(a), it is clear that doc2vec gave the best results and had a peak in performance at around 75 epochs.
10
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Table 6
Optimal number of training epochs for word2vec and doc2vec methods on the three data sets.
Data set
doc2vec
wtd. word2vec
unwtd. word2vec
Twitter #Auspol
Twitter RepLab 2013
Reddit: 1–101
Reddit: 101–200
Reddit: 201–500
Reddit: 501 +
75
300
175
150
100
50
250
200
75
100
50
25
250
200
50
200
50
25
Fig. 2. Plot of the three evaluation measures (vertical axes) by training epoch (horizontal axes) for 20 runs of the word2vec and doc2vec representations on Twitter data using k-means clustering. (a) shows the results on the #Auspol Twitter data and (b) shows the results on the RepLab
2013 Twitter data. 95% confidence bands based on varying random seeds are shown.
11
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Fig. 3. Plots of the AMI and ARI evaluation measures (vertical axes) by training epoch (horizontal axes) for 20 runs of the word2vec and doc2vec
representations on the Reddit data sets using k-means clustering. Different Reddit data sets by size range are given along the rows. Column (a) shows
the AMI results and (b) shows the ARI results. 95% confidence bands based on varying random seeds are shown.
12
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
The word2vec methods generally delivered better performance with more epochs, with a maximum value around 250. The tf-idf
weighted mean word2vec method performed better than the unweighted mean word2vec method, and its performance increased
more smoothly than the unweighted method. There was also not much variation over seeds as the 95% confidence bands are narrow.
On the RepLab 2013 data in Fig. 2(b) the results were quite different. The unweighted mean word2vec method gave the best
performance on the NMI and AMI measures. However, on the ARI measure both word2vec methods suffered drops in performance
after 100 epochs while the doc2vec method improved. This could be caused by some over-fitting of the word2vec models on the data,
which is likely since the RepLab 2013 data was much smaller than the #Auspol data. The ARI measure is also the preferred measure
where the labels have large volumes and are balanced (Romano et al., 2016). This data set was relatively balanced (given in Table 3),
so the ARI is the more appropriate performance measurement than the NMI and AMI. Overall on the RepLab 2013 data, the optimal
number of epochs for the word2vec methods was 200, while the doc2vec method had an optimal value of 300. The higher number of
optimal epochs for the doc2vec method is not surprising given that it is also training document vectors, so has more parameters than
word2vec.
Turning to the results on the four Reddit data sets in Fig. 3, the doc2vec method again gave the best performance. In addition,
there is an evident pattern with doc2vec where shorter documents required more training epochs to reach optimal performance. For
documents with less than 100 characters, the performance of doc2vec with k-means clustering improved up to around 250 epochs.
This dropped to 150 epochs for documents with 101–200 characters, then 150, 100 and 50 for the larger document length data sets in
increasing order. This observed pattern aligns to the results of Lau and Baldwin (2016), confirming that doc2vec models require less
training epochs on larger documents.
For the word2vec methods, the tf-idf weighted mean word vector method gave better performance than the unweighted mean
method. This aligns with results in previous studies (Billah Nagoudi et al., 2017). On the shortest document range, both methods
showed little performance improvement with more training, but then a drop in both measures at 75 epochs for the weighted
word2vec method and 50 for the unweighted method. One possible explanation for this drop is that averaging word vectors may only
make sense above a threshold of words. For this size range, the average number of words per document is 4.5, which might be too
low. On the 101 to 200 character length documents, the weighted word2vec method gave better performance but also required fewer
training epochs. These results also look similar to the results on the Twitter data sets, which typically have a similar character length
range. On the largest documents, both methods required 25 or less epochs to reach optimal performance.
Through this analysis, it is clear that the doc2vec method consistently gave improved performance over averaged word2vec
methods, except in the case where the data set had a low number of documents. Furthermore, the number of training epochs for
doc2vec in general was inversely proportional to the document size, with more epochs required to reach optimal performance on
smaller document sizes. Doc2vec also required more training epochs than word2vec, in general. However, these relations were not
observed for the #Auspol Twitter data where the doc2vec optimal epoch number was 75, below the word2vec optimal epochs of 200.
The optimal number of doc2vec epochs on the RepLab 2013 data was much higher at 300. An explanation might be that while the
doc2vec model improved on its internal loss function with more training epochs on the #Auspol data, these improvements did not
lead to better performance on the clustering task. This is likely because of the hashtag labels used, which may have some overlapping
contributing terms. For the word2vec methods, in general weighting by tf-idf scores gave a performance lift and required fewer
training epochs. However, care should be taken with the number of epochs given the low peak on the shortest Reddit documents.
4.2. Performance evaluation with clustering measures
In this section we provide the mean evaluation measures for the four feature representations with the four clustering methods, and
Table 7
Performance evaluation of the feature representation and clustering methods on #Auspol Twitter data with the Normalised Mutual Information
(NMI), Adjusted Mutual Information (AMI), and Adjusted Rand Index (ARI) measures.
Feature representation
Clustering
NMI
AMI
ARI
doc2vec
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
LDA
0.165
0.193
0.107
0.102
0.088
0.105
0.043
0.062
0.085
0.094
0.043
0.058
0.163
0.114
0.079
0.132
0.043
0.154
0.191
0.105
0.100
0.079
0.102
0.016
0.058
0.076
0.090
0.019
0.054
0.085
0.070
0.028
0.110
0.041
0.059
0.120
0.064
0.056
0.021
0.047
0.001
0.030
0.020
0.041
0.001
0.025
0.013
0.014
0.004
0.032
0.021
wtd word2vec
unwtd word2vec
TF-IDF
LDA
13
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Table 8
Performance evaluation of the feature representation and clustering methods on RepLab 2013 Twitter data with the NMI, AMI and ARI measures.
Feature representation
Clustering
NMI
AMI
ARI
doc2vec
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
LDA
0.449
0.488
0.290
0.261
0.506
0.488
0.421
0.401
0.519
0.508
0.435
0.425
0.466
0.450
0.192
0.437
0.180
0.437
0.478
0.278
0.249
0.491
0.478
0.404
0.384
0.507
0.499
0.414
0.407
0.417
0.379
0.075
0.427
0.169
0.313
0.379
0.215
0.152
0.330
0.352
0.274
0.266
0.347
0.360
0.278
0.286
0.203
0.179
0.011
0.348
0.140
wtd word2vec
unwtd word2vec
TF-IDF
LDA
the LDA model, for each method with 20 different seeds on each data set. We also include distribution plots to illustrate the variability
in performance.
Table 7 provides the mean for each of the three evaluation measures for each method on the #Auspol Twitter data set. We set the
optimal number of epochs to be 75 for the doc2vec methods and 250 for the word2vec methods. It is clear from this table that the
doc2vec feature representation with k-means clustering outperformed the other methods on all three evaluation measures, particularly on the ARI. Hierarchical clustering gave close scores for NMI and AMI, but much lower ARI. For both doc2vec and word2vec
feature representations, NMF performed poorly. The performance of k-medoids clustering was similar to NMF. For the word2vec
representations, k-means clustering also gave the best performance.
An interesting observation is that some methods had a relatively large drop in score between the NMI and AMI measures,
indicating that the chance adjustment of the AMI is important. The tf-idf representation is the most effected by this. For instance, the
tf-idf matrix with hierarchical clustering gave a high NMI of 0.163, well ahead of the word2vec methods, but an AMI of 0.085.
Comparatively, doc2vec and the word2vec methods had smaller drops. As discussed earlier, the AMI and ARI are more appropriate
evaluation measures than NMI due to their adjustment for chance. On this data set, the ARI is more appropriate as the volume of
tweets per hashtag label are relatively similar. The doc2vec representation with k-means clustering therefore far outperformed the
other methods.
Table 8 shows the mean results for the RepLab 2013 Twitter data set with the doc2vec model trained with 300 epochs and the
word2vec methods trained with 200 epochs. Overall the performance is much higher than in the #Auspol data, which is explained by
the RepLab 2013 data having expertly annotated topics which are more distinct. On the ARI score, the doc2vec method with k-means
clustering performed best but the unweighted word2vec method with hierarchical clustering gave higher performance for the NMI
and AMI measures. One explanation for this is that the small size of this data set is insufficient for the embedding representations to
accurately be trained, so further training does not necessarily lead to higher clustering performance. This is reflected in the sharp
drops evident in Fig. 2(b.i) and (b.ii).
To examine the variability from the mean measurements, we plot the distributions for the feature representation methods with the
best performing clustering algorithm and the LDA topic model. Fig. 4 shows the distributions for the three evaluation measures over
the #Auspol (a) and RepLab 2013 (b) Twitter data sets. In Fig. 4(a), the doc2vec method with k-means clustering was distinctly ahead
of the other methods on all three measures. There was also significant overlap between the results for the two word2vec methods,
indicating that multiple runs are required when the scores are close. Note that the tf-idf method with hierarchical clustering did not
show on the plot since both algorithms are deterministic, so every run had the same result.
For the RepLab 2013 data set in Fig. 4(b), the word2vec methods again showed significant overlap, with the doc2vec method
performing in a lower range. It is interesting to note that the doc2vec method showed two close peaks. These peaks are most
significant for the NMI and AMI measures, but also present for ARI. This likely indicates that the doc2vec method optimised to local
minima during training, resulting in poor performance for some of the runs over random seeds. Given that there was a large gap
between the higher performance of doc2vec compared to the word2vec methods on the #Auspol data and close performance between
word2vec and doc2vec on RepLab 2013, the word2vec methods handled the smaller RepLab 2013 data set better than doc2vec. This
may be because there weren’t enough data points in the RepLab 2013 data set to optimally train the doc2vec representation.
Nevertheless, doc2vec still gave the best performance on the ARI measure for both Twitter data sets.
Lastly, we provide results from running the methods over the Reddit data. Fig. 5 shows the NMI (a), AMI (b) and ARI (c) values for
the methods on the Reddit data sets. The horizontal axis compares the results for the document length data partitions. Only the best
performing clustering method is displayed for each feature representation. The mean scores of the evaluation measures for each of the
methods are given in Table 9 for document length ranges 1–100 and 101–200, and in Table 10 for document length ranges 201–500
and 501 or greater. It is clear from these plots and mean results that the doc2vec method delivered the best performance on all four
14
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Fig. 4. Density plots of the three evaluation measures (horizontal axes) over random seeds for the four feature representations with the best
performing clustering algorithm, with LDA for comparison. (a) shows the results on the #Auspol Twitter data and (b) shows the results on the
RepLab 2013 Twitter data.
data sets by size range. This finding corroborates the results from the #Auspol Twitter data set. The tf-idf weighted mean word2vec
method consistently delivered a performance lift compared to the unweighted mean word2vec method. Interestingly, the tf-idf
methods and the LDA model only gave comparable performance to the word2vec methods on the last size range, with number of
characters greater than 500.
4.3. Topic interpretation
It is clear that the doc2vec model with k-means clustering delivered the best performance on the #Auspol Twitter data set and the
Reddit data sets, as well as the RepLab 2013 Twitter data set based on the ARI measure only. However, the usefulness of a topic
discovery model depends on how interpretable the resulting topics are. In this section we aim to address this question through a
deeper analysis of the resulting clusters from the doc2vec representation with k-means clustering.
We consider firstly the results on the #Auspol data, where we analysed the extent to which the document clusters aligned to the
15
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Fig. 5. Plot of the three evaluation measures over random seeds for the methods with the best performing clustering method on Reddit data with
varying document lengths in characters. (a) plots the NMI, (b) plots the AMI and (c) plots the ARI.
16
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Table 9
Performance evaluation on the Reddit data for each method by for document length ranges 1–100 and 101–200 characters.
Document character length
Feature representation
Clustering
NMI
AMI
ARI
1–100
doc2vec
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
LDA
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
LDA
0.029
0.034
0.012
0.030
0.013
0.014
0.007
0.010
0.011
0.012
0.007
0.012
0.009
0.005
0.005
0.014
0.009
0.115
0.262
0.018
0.127
0.112
0.176
0.036
0.116
0.086
0.144
0.020
0.089
0.009
0.005
0.008
0.006
0.008
0.027
0.034
0.011
0.023
0.012
0.013
0.003
0.009
0.011
0.012
0.006
0.011
0.003
0.002
0.001
0.011
0.009
0.111
0.257
0.006
0.096
0.101
0.174
0.016
0.100
0.079
0.142
0.013
0.071
0.003
0.004
0.000
0.005
0.007
0.017
0.026
0.004
0.015
0.010
0.011
0.000
0.000
0.011
0.011
0.000
0.010
0.000
0.000
0.000
0.012
0.003
0.067
0.262
0.001
0.027
0.032
0.144
0.001
0.033
0.027
0.114
0.008
0.015
0.000
0.002
0.000
0.000
0.007
wtd word2vec
unwtd word2vec
TF-IDF
101–200
LDA
doc2vec
wtd word2vec
unwtd word2vec
TF-IDF
LDA
label hashtags. On the #Auspol data, our ground truth topic labels were the top 13 distinct hashtags, which were removed from the
text prior to feature generation and clustering. These hashtags can therefore be considered as latent tokens. We first identified the top
three topic labels (hashtags) by frequency for each cluster. For comparison, we created a tf-idf matrix from the original data using all
the hashtags, including the topic hashtags, and excluded all other tokens. We then extracted the three hashtags with the highest tf-idf
scores for each cluster and compared to the top three topic label hashtags. Table 11 outlines the results. The top topic matches to the
top hashtag for every cluster. Out of 39 top topics for the 13 clusters, only 7 contained different hashtags (highlighted in bold text).
There were also two clusters where the order was adjusted. We conclude that the doc2vec clustering has accurately captured the
structure of the latent label hashtags.
Another way of looking at the quality of the clustering is to analyse the overlap between ground truth labels and clusters. In the
interest of space, we considered the Reddit data sets which contained only 5 topics and chose the data set with document size
between 101 and 200 characters for consistency with the Twitter data sets. We then analysed the confusion matrix for the doc2vec
features with k-means clustering against the ground truth labels, the subreddit pages. The results are shown in Table 12. It is apparent
that the first cluster grouped most of the parent comments from the subreddit page ‘NFL’ and the second cluster grouped strongly
around ‘pcmasterrace’. These pages clearly represent distinct topics. Clusters 3 and 4 grouped well around ‘news’ and ‘movies’
respectively, but cluster 5 is divided primarily between ‘relationships’ and ‘news’.
To further interpret the topics on this Reddit data set, we analysed the top words by cluster. For each cluster we calculated the
centroid as the mean of the doc2vec representations of each document in the cluster. Since the trained doc2vec model produced
document embeddings in the same space as word embeddings, we calculated the cosine similarity between the cluster centroids and
the words. The idea behind this was that words closer to the cluster centroid may be representative of the cluster. However, this
approach doesn’t account for the frequency of words appearing in each cluster, or the relative frequency of the words across the
clusters. To incorporate this information, we pooled all the documents in each cluster and calculated a tf-idf matrix. We then created a
combined score for each word and cluster from the sum of the cosine similarity and the tf-idf score. Table 13 shows the top 10 words
per cluster ordered by this method. It is clear that this method extracts very specific terms related to the main subreddit pages,
particularly for clusters 1 and 2.
17
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Table 10
Performance evaluation on the Reddit data for each method by document length ranges 201–500 and 501 or greater.
Document Character Length
Feature Representation
Clustering
NMI
AMI
ARI
201–500
doc2vec
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
LDA
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
k-means
k-medoids
NMF
Hierarchical
kmeans
kmedoids
NMF
LDA
0.261
0.487
0.037
0.194
0.265
0.333
0.174
0.247
0.227
0.303
0.106
0.208
0.103
0.095
0.014
0.062
0.080
0.532
0.686
0.094
0.331
0.465
0.461
0.353
0.366
0.416
0.433
0.336
0.290
0.304
0.431
0.042
0.396
0.341
0.254
0.483
0.010
0.128
0.246
0.331
0.172
0.226
0.200
0.301
0.103
0.183
0.061
0.085
0.013
0.057
0.079
0.518
0.684
0.037
0.255
0.400
0.433
0.330
0.325
0.367
0.405
0.322
0.242
0.244
0.382
0.007
0.344
0.326
0.212
0.496
0.002
0.044
0.142
0.276
0.150
0.133
0.084
0.247
0.081
0.092
0.015
0.044
0.007
0.046
0.071
0.499
0.708
0.007
0.154
0.327
0.403
0.283
0.229
0.306
0.385
0.290
0.159
0.199
0.323
0.001
0.299
0.291
wtd word2vec
unwtd word2vec
TF-IDF
LDA
doc2vec
501 +
wtd word2vec
unwtd word2vec
TF-IDF
LDA
Table 11
Top three topic labels and top three hashtags for each cluster. Note that the topic labels did not appear in the clustering data, but were mostly
recovered in order when we created a tf-idf matrix for tweets pooled by cluster and selected the three hashtags with the highest scores.
Differences between the top three topic labels and top three hashtags are highlighted in bold.
Cluster
Top three topic labels
Top three tf-idfscore hashtags
1
2
3
4
5
6
7
8
9
10
11
12
13
#nbn, #lnp, #insiders
#uspoli, #insiders, #turnbull
#insiders, #lnp, #qldpol
#qldpol, #insiders, #lnp
#politas, #qldpol, #lnp
#qldpol, #qanda, #trump
#insiders, #lnp, #qldpol
#qldpol, #stopadani, #springst
#lnp, #trump, #uspoli
#qanda, #insiders, #qldpol
#marriageequality, #politas, #lnp
#qldpol, #stopadani, #qanda
#climatechange, #qldpol, #stopadani
#nbn, #lnp, #insiders
#uspoli, #insiders, #trump
#insiders, #lnp, #qldpol
#qldpol, #insiders, #lnp
#politas, #utas, #discover
#qldpol, #politas, #qanda
#insiders, #lnp, #qldpol
#qldpol, #stopadani, #springst
#lnp, #trump, #insiders
#qanda, #insiders, #sayitwithstickers
#marriageequality, #equalitycampaign, #politas
#qldpol, #qanda, #stopadani
#climatechange, #qldpol, #stopadani
Table 12
Confusion matrix for the doc2vec representation with k-means clustering method on Reddit data with size range between 101 and 200 characters.
Subreddit page
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
NFL
pcmasterrace
News
Movies
Relationships
1351
78
93
89
32
60
1295
89
50
37
298
215
952
152
116
273
185
204
767
48
395
260
538
226
557
18
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Table 13
Top 10 words per cluster based on combined embedding similarity score in embedding space and tf-idf score.
Cluster
Top topic
Top 10 words
1
2
3
4
5
NFL
pcmasterrace
News
Movies
Relationships
talent, flacco, quarterback, tds, sb, wrs, roster, dolphins, tackle, foles
install, ps4, r9, mobo, gpus, i5, os, msi, processor, asus
federal, manslaughter, district, homicide, economic, isis, china, labor, upper, toke
avengers, joss, horror, arnold, cinematography, rewatch, australian, doof, boobs, mcx
abusive, mentality, react, rdj, xanax, marriage, heaven, meeting, section, subjective
5. Discussion
Throughout this study it has become clear that for clustering OSN text data into topics, in general doc2vec feature representations
combined with k-means clustering gave the best performance compared to any of the other methods. However, the cases where this
method did not perform as well require discussion. On the RepLab 2013 Twitter data set, the doc2vec method gave performance
below that of the mean word2vec methods for the NMI and AMI measures, but gave the best performance for ARI after 100 epochs of
training. Further to this, the unweighted mean word2vec method performed better than the tf-idf weighted mean word2vec method
on this data. Both of these results are different from the results on the other two data sets. The results on the #Auspol and the Reddit
data with document length between 101 and 200 characters indicate that it is not the size of each document that is the issue on the
RepLab data, but it is most likely that the volume of data used was not sufficient to accurately train the doc2vec model. The
implication is that doc2vec models should be trained on data volumes greater than three thousand or so. Interestingly, the Reddit
data with length between 101 and 200 characters only consisted of 8360 documents and doc2vec performed very well, although
Reddit comments may be quite different to tweets in the terms used.
Another interesting observation is that on the #Auspol Twitter data, the tf-idf matrix with NMF gave better performance on the
NMI and AMI measures than the best clustering for both word2vec methods, although a lower score for the ARI. On the RepLab 2013
data, the word2vec methods performed better on NMI and AMI, but the tf-idf method was very close on ARI. However, on the Reddit
data the tf-idf method gave very low performance until the document size was greater than 200 characters. This indicates that topics
from Twitter text may rely heavily on keywords since the tf-idf clustering performs comparatively well, which is not surprising given
the use of user mentions and hashtags. The doc2vec method represented this information more effectively on the #Auspol data than
the other feature methods. Assigning a heavier weighting for hashtags and user mentions for the doc2vec model might give improved
performance on Twitter data.
Two useful results stand out from this study based on the Reddit data. The first was that the optimal number of training epochs for
doc2vec is inversely proportional to the average length of the documents. This result provides some guidance for future studies using
OSN data. Unfortunately this result was not consistent with the results on the #Auspol data, which may be due to the topic labels
themselves not being clearly distinct. There is an ongoing challenge with using Twitter data as manually labelling topics is time
consuming and prone to error, and the number of retrievable tweets diminishes over time. The result is consistent with the RepLab
2013 Twitter data, but as discussed already the data volume was small. The second result is that the performance of the doc2vec
method increased with the length of the documents. The method gave high performance for the longest Reddit comments, so should
give good results applied to text data from OSN platforms in general.
Improving embedding representations of OSN documents can be useful for several natural language processing tasks. Such representations at the document level can provide high quality feature matrices to be used by other machine learning systems. An
example application is for sentiment analysis (Lee, Jin, & Kim, 2016). In addition, it has been shown previously that pre-training the
word vectors used by doc2vec can provide a performance lift in several natural language processing tasks (Lau & Baldwin, 2016). Pretraining both word vectors and document vectors for large volumes of OSN data could then provide a performance lift on applications
focused on specific samples of data. For instance, pre-trained document vectors could be used in streaming document classification or
clustering applications. In addition, such methods could be applied in other domains where data can be modelled as documents with
a small number of tokens. For example, embedding models are seeing applications on electronic health record data (Choi et al.,
2016). In this instance, medical codes are treated as tokens and embedding models can then be used to capture information about
relationships between diseases and treatments, and be used in subsequent prediction or clustering tasks.
6. Conclusion and future work
In this study we showed the different performance of several document clustering and topic modelling methods on social media
text data. Our results have demonstrated that document and word embedding representations of online social network data may be
used effectively as a basis for document clustering. These methods outperformed traditional tf-idf based approaches and topic
modelling techniques. Furthermore, doc2vec and tf-idf weighted mean word embedding representations delivered better results than
simple averages of word embedding vectors in document clustering tasks. We also demonstrated that k-means clustering provided the
best performance with doc2vec embeddings.
Through applying these methods over the Reddit data set split by document length ranges, we outlined two key results for
clustering doc2vec embeddings. Firstly, the optimal number of training epochs is in general inversely proportional to the character
length range of the documents. Secondly, doc2vec embeddings with k-means clustering provide good performance over all the
19
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
document length ranges in the Reddit data used. These results indicate that this method should perform well on most OSN text data.
To interpret the resulting clusters from these methods, we developed a top term analysis based on combining tf-idf scores and
word vector similarities. We demonstrated that this method can provide a representative set of keywords for a topic cluster. We also
showed that the doc2vec embedding with k-means clustering may successfully recover latent hashtag structure in Twitter data.
We plan several extensions to this work. Firstly, the doc2vec embeddings combined with k-means clustering can be applied
readily to any social media text data. In further applications we intend to demonstrate the usefulness of this method in defining and
interpreting dynamic topics in a streaming fashion. Secondly, this method may be extended to incorporate additional data available
in social networks, and specifically from Twitter user and network data. Thirdly, recent developments in the applications of neural
embedding and deep learning techniques, such as contextualised embedding models (Peters et al., 2018), Latent LSTM Allocation
(Zaheer, Ahmed, & Smola, 2017) and deep learning based clustering models (Min et al., 2018) may be applied to deliver improved
feature representations or document clusterings. Word and document embeddings may also be used as pre-trained initial layers in
deep clustering and topic modelling techniques.
Acknowledgements and declarations
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Supplementary material
Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.ipm.2019.04.002.
References
Alghamdi, R., & Alfalqi, K. (2015). A survey of topic modeling in text mining. International Journal of Advanced Computer Science and Applications, 3(7), 774–777.
Alnajran, N., Crockett, K., McLean, D., & Latham, A. (2017). Cluster analysis of twitter data: A review of algorithms. Proceedsings of the 9th international conference on agents
and artificial intelligence – volume 2: ICAART, INSTICC. SciTePress239–249.
Amigó, E., Carrillo de Albornoz, J., Chugur, I., Corujo, A., Gonzalo, J., Martín, T., et al. (2013). Overview of RepLab 2013: Evaluating online reputation monitoring systems.
Proceedings of thefourth international conference of the CLEF initiative333–352.
Bakshy, E., Rosenn, I., Marlow, C., & Adamic, L. (2012). The role of social networks in information diffusion. Proceedings of the 21st international conference on world wide
web series WWW ’12. New York, NY, USA: ACM519–528.
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
Billah Nagoudi, E. M., Ferrero, J., & Schwab, D. (2017). LIM-LIG at SemEval-2017 Task1: Enhancing the semantic similarity for arabic sentences with vectors weighting.
International workshop on semantic evaluations (SemEval-2017) Proceedings of the 11thinternational workshop on semantic evaluations (SemEval-2017)125–129
Vancouver, Canada
Bisht, S., & Paul, A. (2013). Document clustering: A review. International Journal of Computer Applications, 73, 26–33.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Chinnov, A., Kerschke, P., Meske, C., Stieglitz, S., & Trautmann, H. (2015). An overview of topic discovery in twitter communication through social media analytics.
Proceedings of the Americas conference on information systems1–10.
Choi, E., Bahadori, M. T., Searles, E., Coffey, C., Thompson, M., Bost, J., et al. (2016). Multi-layer representation learning for medical concepts. Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and data mining. New York, NY, USA: ACM1495–1504.
Corrêa Júnior, E. A., Marinho, V. Q., & dos Santos, L. B. (2017). NILC-USP at SemEval-2017 task 4: A multi-view ensemble for twitter sentiment analysis. Proceedings of the
11th international workshop on semantic evaluation (SemEval-2017). Association for Computational Linguistics611–615.
Curiskis, S., Drake, B., Osborn, T., & Kennedy, P. Submitted. Topic labelled online social network data sets from twitter and reddit. Data In Brief.
Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., & Cohen, W. (2016). Tweet2vec: Character-based distributed representations for social media. Proceedings of the 54th
annual meeting of the association for computational linguistics (volume 2: Short papers). Association for Computational Linguistics269–274.
Fang, Y., Zhang, H., Ye, Y., & Li, X. (2014). Detecting hot topics from twitter: A multiview approach. Journal of Information Science, 40(5), 578–593.
Ferrara, E., JafariAsbagh, M., Varol, O., Qazvinian, V., Menczer, F., & Flammini, A. (2013). Clustering memes in social media. Proceedings of the 2013 IEEE/ACM
international conference on advances in social networks analysis and mining. New York, NY, USA: ACM548–555.
Godfrey, D., Johns, C., Meyer, C. D., Race, S., & Sadek, C. (2014). A case study in text mining: Interpreting twitter data from world cup tweets. CoRR, 1–11 abs/
1408.5427
Godin, F., Vandersmissen, B., De Neve, W., & Van de Walle, R. (2015). Multimedia lab @ ACL WNUT NER shared task: Named entity recognition for twitter microposts using
distributed word representations. Proceedings of the workshop on noisy user-generated text. Association for Computational Linguistics146–153.
Guille, A., Hacid, H., Favre, C., & Zighed, D. (2013). Information diffusion in online social networks: A survey. ACM SIGMOD Record, 42, 17–28.
Gutman, J., & Nam, R. (2015). Text classification of reddit postsTechnical Report. New York University.
Ha, T., Beijnon, B., Kim, S., Lee, S., & Kim, J. H. (2017). Examining user perceptions of smartwatch through dynamic topic modeling. Telematics and Informatics, 34(7),
1262–1273.
Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. Proceedings of the first workshop on social media analytics. New York, NY, USA: ACM80–88.
Irfan, R., King, C. K., Grages, D., Ewen, S., Khan, S. U., Madani, S. A., et al. (2015). A survey on text mining in social networks. The Knowledge Engineering Review, 30(2),
157170.
JafariAsbagh, M., Ferrara, E., Varol, O., Menczer, F., & Flammini, A. (2014). Clustering memes in social media streams. Social Network Analysis and Mining, 4(1), 237.
Klein, C., Clutton, P., & Polito, V. (2018). Topic modeling reveals distinct interests within an online conspiracy forum. Frontiers in Psychology, 9, 1–12.
Klinczak, M., & Kaestner, C. (2016). Comparison of clustering algorithms for the identification of topics on twitter. Latin American Journal of Computing - LAJC, 3,
19–26.
Lau, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. Proceedings of the 1st workshop on
representation learning for NLP. Association for Computational Linguistics78–86.
Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. Proceedings of the 31th international conference on machine learning, ICML 2014,
Beijing, China, 21–26 June 20141188–1196.
Lee, S., Jin, X., & Kim, W. (2016). Sentiment classification for unlabeled dataset using doc2vec with jst. Proceedings of the 18th annual international conference on electronic
commerce: E-commerce in smart connected world. New York, NY, USA: ACM28:1–28:5.
Li, Q., Shah, S., Liu, X., & Nourbakhsh, A. (2017). Data sets: Word embeddings learned from tweets and general data. Proceedings of the eleventh international conference on
web and social media, ICWSM 2017, montréal, québec, canada, may 15–18, 2017.428–436.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR abs/1301.3781. 1301.3781
20
Information Processing and Management xxx (xxxx) xxx–xxx
S.A. Curiskis, et al.
Min, E., Guo, X., Liu, Q., Zhang, G., Cui, J., & Long, J. (2018). A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access, 6,
39501–39514.
Naik, M. P., Prajapati, H. B., & Dabhi, V. K. (2015). A survey on semantic document clustering. 2015 IEEE international conference on electrical, computer and communication
technologies (ICECCT)1–10.
Patki, U., & Khot, D. P. (2017). A literature review on text document clustering algorithms used in text mining. Journal of Engineering Computers and Applied Sciences,
6(10), 16–20.
Paul, M. J., & Dredze, M. (2014). Discovering health topics in social media using topic models. PLoS One, 9(8), 1–11.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 conference
of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers). Association for Computational
Linguistics2227–2237.
Reddit (2015). r/datasets - i have every publicly available reddit comment for research. 1.7 billion comments at 250 gb compressed. any interest in this? (Accessed 19
January 2019). https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment.
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks.
Valletta, Malta: ELRA45–50.
Romano, S., Vinh, N. X., Bailey, J., & Verspoor, K. (2016). Adjusting for chance clustering comparison measures. Journal of Machine Learning Research, 17(1),
4635–4666.
Shabunina, E., & Pasi, G. (2018). A graph-based approach to ememes identification and tracking in social media streams. Knowledge-Based Systems, 139, 108–118.
Steinskog, A., Therkelsen, J., & Gambäck, B. (2017). Twitter topic modeling by tweet aggregation. Proceedings of the 21st nordic conference on computational linguistics.
Association for Computational Linguistics77–86.
Stieglitz, S., Mirbabaie, M., Ross, B., & Neuberger, C. (2018). Social media analytics challenges in topic discovery, data collection, and data preparation. International
Journal of Information Management, 39, 156–168.
Suri, P., & Roy, N. R. (2017). Comparison between LDA & NMF for event-detection from large text stream data. 2017 3rd international conference on computational intelligence
communication technology (CICT)1–5.
Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance.
Journal of Machine Learning Research, 11, 2837–2854.
Yang, X., Macdonald, C., & Ounis, I. (2017). Using word embeddings in twitter election classification. Information Retrieval, 21(2–3), 183–207.
Zaheer, M., Ahmed, A., & Smola, A. J. (2017). Latent LSTM allocation: Joint clustering and non-linear dynamic modeling of sequence data. Proceedings of the 34th
international conference on machine learning70. Proceedings of the 34th international conference on machine learning International Convention Centre, Sydney,
Australia: PMLR3967–3976 Proceedings of Machine Learning Research.
Zhao, J., Lan, M., & Tian, J. F. (2015). Using traditional similarity measurements and word embedding for semantic textual similarity estimation. 9th international workshop on
semantic evaluation (SemEval 2015)117.
21