Artificial and Natural Topic Detection in Online Social
Networks
Sylvio Barbon Jr1 , Guilherme Sakaji Kido1 , Gabriel Marques Tavares1
1
Computer Science Department – State Univerisity of Londrina (UEL)
86057-970 – Paraná – PR – Brazil
barbon@uel.br, guilhermekido@gmail.com, gabrielmrqstvrs@gmail.com
Abstract. Online Social Networks (OSNs), such as Twitter, offer attractive
means of social interactions and communications, but also raise privacy and
security issues. The OSNs provide valuable information to marketing and competitiveness based on users posts and opinions stored inside a huge volume of
data from several themes, topics and subjects. In order to mining the topics discussed on an OSN we present a novel application of Louvain method for Topic
Modeling based on communities detection in graphs by modularity. The proposed approach succeeded in finding topics in five different datasets composed
of textual content from Twitter and Youtube. Another important contribution
achieved was about the presence of texts posted by spammers. In this case,
a particular behavior observed by graph community architecture (density and
degree) allows the indication of a topic strength and the classification of it as
natural or artificial. The later created by the spammers on OSNs.
1. Introduction
Currently, people have instant access to massive amounts of data. Learning and discovering becomes easier when the data format is more logical and structured. Unfortunately,
90% of the available information are in unstructured forms [Verma et al. 2015]. Extracting knowledge in large amounts of unstructured data is difficult and problematic. For
companies, being able to do so might improve marketing strategy and business planning
to attract more consumers.
Recent studies indicate that 80% of companies’ information are text documents
[Akilan 2015]. The production of digital data is increasing in volume because the accessibility of this media has become something easy, fast and useful. People write articles on websites, forums, social networks, blogs and e-mails. These sources of information are a rich base of knowledge for organizations such as banks, universities, government, and marketing. The interests, concerns and criticism from users are stored in
databases and can improve the products and services of organizations [Chen et al. 2013]
[Chitra and Subashini 2013] [Choi et al. 2014]; in politics, this data can adjust political
placements in respect of sentiment analysis of your target audience [Tan et al. 2014].
Online Social Networks (OSNs) are services that have been emerging as a new
communication between individuals and organizations [Li et al. 2014]. These services
provide an essential platform for users to share thoughts, ideas, status, and experiences [Zappavigna 2011]. In this sense, the OSNs offer attractive means of online
social interactions and communications, but also raise privacy and security concerns
[Zhang et al. 2010]. Due to the large number of texts, the OSNs have been extremely
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
valuable to marketing companies and public organizations to find opinions about particular topics [Igawa et al. 2015]. The standard methods used for Text Mining are usually
applied to traditional texts in the Web, such as articles, news, and reports [Tsai 2011].
Micro-blog texts are composed of more casual and informal language than traditional texts but are a source of similar relevant information in comparison to other textual
sources. Due to the number of characters limit, users publish in a simplified way, using the colloquial language, abbreviations, slangs and generally links, emoticons, photos,
videos, and others [Huang et al. 2014a]. Colloquial and informal language usually creates
specifics words and terms that are considered noise. Noise is an undesirable textual feature
and specific measures are necessary in order to eliminate or reduce it. One example is the
Adaptive Distribution of Vocabulary Frequencies (ADVF) [Igawa et al. 2014], capable of
highlighting terms detected as textual noise. Roth et al. [Roth et al. 2013] presented a
survey of noise reduction methods for textual databases. In their work, three different denoising principles are highlighted: At-least-one, Topic-based models and, Pattern Correlations. The advantages of this kind of approach to Sentiment Analysis considering statistical models are exposed, such as Pattern Correlations in [Roth et al. 2013]. More details
concerning statistical models employed in noise analysis are given in [Aggarwal 2015],
where the text sampling from a Zipf distribution is discussed. This strategy is the kernel
of [Igawa et al. 2014]; the technique applied in our propose.
The recent concern about textual noise to obtain knowledge from OSNs is getting more attention due to spamming activities. One of the problems with knowledge
discovery from OSNs when compared to traditional texts is the presence of spamming
activities. These activities are practiced by fake accounts or compromised accounts.
The first one, also called bot account, is an account used for spreading malicious contents only [Igawa et al. 2016][Barbon et al. 2016]. A compromised account is a legitimate account which has been taken over by an attacker to publish fake or harmful content
[Igawa et al. 2015]. Directly or indirectly, both aims to spread content that need to be
avoided in knowledge discovery because they do not contain real information. In OSNs,
a bot repeatedly publishes texts with a subject that can compromise the topic trends. The
presence of artificial content made by a spammer could lead to bias the result precision
and application’s goals.
The malicious behavior was discussed by several authors. Jin et al. in
[Jin et al. 2013] exposed several security and privacy threats. The authors highlighted the
Sybil attack which means the register of multiple accounts maliciously. In this sense, a
Sybil attack could create a biased topic due to undue influence. Sybil attack was compared
with other malicious activities (information leak, de-anonymizing, phishing, malware and
spamming) in OSNs by Gao et al. [Gao et al. 2011]. Their results reveal an association
within attack difficulty, defense effectiveness and threat level to the users.
Diverse works propose solutions to mitigate the malicious activities, textual-based
ones being a significant part of them. For example, the spam classification engine extracts
the text from the post and runs it through a support vector machine (SVM) classifier,
which assigns a score to the text [Abu-Nimeh et al. 2011]. More recently, Vanetti et al.
[Vanetti et al. 2013] proposed a content-based filtering (textual) of unwanted content by
the use of Machine Learning. More precisely, the use of Artificial Neural Networks built
with a radial function (RBFN) was applied.
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
Independently of textual-based application, an important task is to identify the
textual content, which can be determined by one or more topics/keywords that describe
the main subject. There are several methods to find those keywords an Topic Modeling
is the main area of this activity. For Zeng et al. [Zeng et al. 2012], topic is considered an
aggregate of words and their frequency, which can be extracted from a document and is
an important unit of the Topic Modeling Process.
Topic Modeling, as Latent Semantic Analysis (LSA) [Landauer et al. 1998] and
Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 2001], is popular in traditional
text documents that need a vast amount of data, i.e., thousands of documents with thousands of words to generate coherent topics [Chen and Liu 2014]. Many micro-blogs have
limited number of characters (e.g., Twitter 140 character limit) per post, but present a high
frequency of submissions that implies in a huge amount of data. This scenario would be
the expected for Topic Modeling on OSNs. However, the presence of noise and malicious
activities increases the difficulty of extracting topics on blogs [Li et al. 2014] and OSNs.
About traditional techniques of Topic Modeling, Huang et al.
[Huang et al. 2014b] evaluated methodologies based on Vector Space Model and
LSA. First, the tool developed by the authors performed the pre-processes data, eliminating stopwords and extracting scores. The TF-IDF, a weighting algorithm, was
applied to each term resulted from pre-processing obtaining a value which measures the
importance of a term concerning all dataset. This importance was based on frequency
distribution of the terms and clustered by K-means. The dataset used in Huang et
al. [Huang et al. 2014b] was composed of Sina Weibo’s texts, a Chinese micro-blog.
Compared to the LSA, the work’s conclusion determined that the methodology presented
a better performance on indexing topics. However, the proposed approach does not
handle noisy terms and malicious content.
In Tsai’s work [Tsai 2011], the author analyze the Author-Topic method (AT) (a
LDA extension) and compare which topics were similar to others, using the Isometric
feature mapping (Isomap). The AT was applied on the Nielson Buzz-Metrics’s dataset, a
blog about security threats and incident reports of cyber crime and computer virus. The
author has succeeded in Topic Modeling, but the methodology of noise detection was
based on manual labeling by users. This strategy can be non-trivial on a huge dataset and
was effective just in a small dataset with few topics.
In this work we focus on the possible strategy of topics as community
centroids in a graph of terms. In this way, finding communities in a textual
data is concerned as a data clustering problem. Several techniques to investigate the community structure of networks have been proposed in literature during
last years [De Meo et al. 2011, Bhowmick and Srinivasan 2013, Campigotto et al. 2014,
Traag 2015]. In [De Meo et al. 2011], the authors proposed a feasible solution considering a large network with a low computational cost by a Generalized Louvain method. In
the same way, Traag in [Traag 2015] shows some improvements to the original Louvain
algorithm able to reach a logarithmic runtime complexity in a clear community structure. On the other hand, in [De Meo et al. 2011] is presented a solution able to carry
out over synthetic and real-world (noisy), showing efficiency and robustness. The urge
for techniques capable of handling large datasets based on accurate and fast solutions
was highlighted by [Bhowmick and Srinivasan 2013]. These authors proposed a sharedBARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
memory parallel algorithm and exposed the advantages of a scalable solution in Louvain
algorithm.
This present work aims to a new approach for Topic Modeling based on a feasible solution to OSNs. Our solution is capable of handling problems of noise and spam,
discovering the topics in an OSNs scenario highlighting the natural or artificial ones. The
kernel of proposed approach was based on Louvain method and the concept of modularity
that provides a quality measure of the communities in a graph [Tang et al. 2012]. In other
words, we interpreted a community in a graph such as a group of terms around a topic
and based on the modularity it is possible to detect the existence of different topics in the
same dataset. To detect the noisy terms we applied ADVF [Igawa et al. 2014], and to treat
the causes of artificial topics, like spam, we suggest an architecture analysis of the graph,
mainly the density value.
The dataset used in the experiments was Twitter1 , considered one of the largest existing micro-blogging services today; and the Youtube2 , one of the biggest video-sharing
website. Their contents were extracted, filtered, processed and visualized in graphs in
order to form word’s networks. These graphs were analysed based on its structure and
the topics found were classified in natural or artificial . The first one means topics that
have some semantic context with the base’s theme, and the other means topics created by
spammer activity.
This paper is an extended version of a previous work [Kido et al. 2016], including
the study of additional related work, an improved explanation of the proposed approach,
and new results discussion. It is organized as follows: the next section provides our
proposed approach for this paper and deals with the explanation of the ADVF and Louvain
method. Section 3 shows how the experiments were performed in this work. Section 4
presents our results and discussions about our method in OSNs datasets. Finally, Section
5 provides the implications and limitations found.
2. Proposed Approach
Our proposed approach, as in the Figure 1, can be summarized as: 1) pre-processing;
2) ADVF analysis; 3) co-occurrence extraction; 4) Louvain calculus and, finally, topics
identification. The datasets are formed by texts only, so no additional information besides
the content was necessary to perform the proposed approach.
Since the datasets are acquired from OSNs, the first step is dedicated to preprocessing. In this step, the traditional filtering and cleaning process is performed in
order to maintain the characteristics of text structure. Similar to conventional techniques
of Text Mining for general textual content, this step consists in stopwords filtering and
cleaning process (removal of links, special characters, and unnecessary spacing). Finally,
a process of tokenization is applied. Others pre-processing strategies, i.e., stemming were
not necessary since the ADVF will treat irrelevant terms as noise in the next step.
The ADVF method highlights the terms that might be considered as noise. These
noisy words are terms that appear a lot or rarely due to grammatical errors. After the
tokenization process for each term from the token’s list, it is checked the respective fre1
2
www.twitter.com
www.youtube.com
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
Figure 1. Proposed approach for Topic Modeling.
quency is checked among all tweets collected. The next step is to apply the ADVF method
on the terms’ frequencies. This method, explained in Section 2.1, creates a probabilistic
frequency of terms, more sensitive to noise.
Then, the N terms with the smaller difference between real and probabilistic frequencies are selected. The N can be interpreted as the sensitiveness of noise detection
level.
Later, the co-occurrence of the selected terms is calculated, creating an adjacency
list. The N selected terms compose the graph nodes, and its co-occurrence frequency is
the edge weight. The adjacency list produces a graph.
The last step for the proposed approach consists of Louvain method application.
Each graph will be divided into communities by Louvain method, always favoring the
modularity optimization. The communities found by the graph structure analysis and
metrics, are classified as natural or artificial topics.
2.1. Adaptive Distribution of Vocabulary Frequencies
Frequently, the usage of techniques of pre-processing data on Text Mining has become
critical to improving better results accuracy. The mathematical model ADVF proposed
by Igawa et al. [Igawa et al. 2014] aims to evaluate the noise level of a dataset from
social media corpus. In this way, the traditional pre-processing techniques combined with
ADVF can improve the dataset quality by avoiding noise.
The ADVF model is based on the principle of Zipf’s Law. The Zipf’s Law
[Powers 1998] is a classical measure of literature that studies the frequency distribution
of terms in a dataset. It was developed by George Kingsley Zipf, it is a potency law
(Equation 1) that analyses the frequency distribution of terms concerning its ranking in
descending order, i.e., the first term is most frequent of the entire database and the last,
the least frequent. Let f ′ (t) a desirable frequency of a term t and r(t), the ranking term.
f ′ (t) ∼
1
r(t)
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
(1)
It means that the second term will be repeated with a frequency of approximately
half the first and third term, with a frequency of 1/3 and so on.
The Zipf’s Law is a standard probability distribution of the terms which the
straight line adapts to all the terms of distribution, but doesn’t treat the noise evidence
on the set.
Most of the highest frequencies correspond to terms that are prepositions, articles,
and pronouns. For Text Mining, depending on the application purpose, these words are
considered stopwords. To eliminate these stopwords, the ADVF considers the evidence
of these noises in the frequency histogram. From two points in the Cartesian plane, t1
(the most frequent term) and tn (the least frequent term), you can find a straight line and
its angular coefficient α (Equation 2 where f (t) is the real frequency of the term t). The
new line is not adapted so well to all terms, but the presence of noise will be evidenced.
The ADVF line is given by Equation 3.
α=
log(r(t1 )) − log(r(tn ))
log(f (t1 )) − log(f (tn ))
ADV F (t) = α(log(r(t))) + log(f (1))
(2)
(3)
As ADVF is a linear distribution based only on the frequency of the terms, it
avoids extra processing and keep a low complexity O(n).
2.2. Louvain Method
In the literature, the term “community” shows different meanings and connotations. In
social science, community refers to a group of people who share the same kind of interests
or activities. Once the networks are considered models for several real systems, the concept of community expands [Papadopoulos et al. 2012]. A new concept of community
appears after the growing of social media, showing a diversity of on-lines entities with
several relations and interactions among entities. The wide range of these networks on
social media attracts more attention to areas such as computer science, psychology, economics, marketing and science of behavior [Tang et al. 2012]. One of the main tasks is to
find communities whose members have more interaction with each other in the same community. The extracted communities can be used for analysis and visualization, marketing,
training and development groups, clustering.
In graph application, a community is a group of nodes where the connectivity
between them is dense. However, the connectivities between nodes of other communities
are sparse. The ability to find and analyse this groups can provide more knowledge about
the network’s structure [Newman and Girvan 2004].
The concept of modularity [Tang et al. 2012] provides a measure of the quality
of a community within a network, quantifying a value given by the comparison of the
fraction of edges within the community with edges between communities. The modularity Q receive a value between 0 and 1. When Q is closer to 1, the community
connectivity is strong. In networks with weights, Q is defined according to Equation
4 [Blondel et al. 2008]:
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
1 X
ki kj
Q=
Aij −
δ(ci , cj ),
2m i,j
2m
(4)
P
where Aij represents the weight between the nodes i and j, ki = j Aij is the weights’
sum between the edges that link the node i, ci is a community that node i belongs
to, the
P
function δ assign 1 if the communities are the same, otherwise 0 and m = ij Aij .
The Louvain method, developed by Blondel et al. [Blondel et al. 2008], consists
of two phases that are repeated iteratively. First, given a graph with N nodes, is assumed
that each node is a community. For each node i and its neighbours j, the gain of modularity is calculated among i withdrew and putting in j communities. The node i assumes
the new community where the modularity gain is maximum and positive, otherwise i still
in the same community. Phase 1 is complete when no improvement can be achieved for
all the nodes, i.e., the local maximum is reached when no moving can improve the modularity. The gain modularity ∆Q obtained from the movement of i to the community C is
demonstrated in Equation 5:
P
2 #
+k
+k
i
i,in
tot
in
−
∆Q =
2m
2m
"P
P
2
2 #
k
i
in
tot
−
−
−
,
2m
2m
2m
"P
(5)
P
P
where in is the sum of the weights of the edges in C, tot is the sum of the weights of
the edges incident to nodes of C, ki is the sum of weights of the edges incident to node
i, ki,in is the sum of the weights of the edges of i to the nodes of C and m is the sum
of weights of all edges in the graph. In practice, ∆Q evaluates the change of modularity
removing i from community and then moving it to the neighbour community.
Phase 2 is the new graph construction, where communities (grouped nodes) of
phase one become the new nodes. The weight edges between two new nodes are the sum
of weight edges between the node of two communities. After the conclusion of Phase
2, Phase 1 can be performed again. The phases are iterated until no gain modularity is
reachable. Figure 2 shows the operation of Louvain method.
Although the exact computational complexity of Louvain method could reach
O(nlog(k)), where k is the average community size for clear datasets [Traag 2015], our
implementation sometimes behaves as O(nlog(n)), where most effort is in the first phase
of the algorithm [Igawa et al. 2014].
3. Experimental Settings
The datasets used in our experiments were composed of texts from Twitter and Youtube
social medias, composing 5 datasets3 (Table 1) with different sizes and themes. The
“TwitterGot”, “TwitterNatal” and “TwitterGame” are datasets formed by a keyword. By
API services, it is possible to collect texts from social media using keywords, where
3
http://www.uel.br/grupo-pesquisa/remid/wp-content/uploads/DatasetiSys2016.zip
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
Figure 2. Louvain method application with 2 phases [Blondel et al. 2008].
all theses texts contain the main term in its contents. The two others datasets, “TwitterTweets” and “Youtube”, are acquired without keywords, i.e., those sets are composed of texts about different themes and subjects. The “Youtube” dataset was published in [Thelwall et al. 2012], it was formed by English comments posted to videos
on the YouTube web site. This represents comments on resources and any associated
discussions. Another consideration is all datasets have different sizes, between 2.100 and
20.100 posts. This practice was performed to investigate the solution behavior in different
datasets scenarios.
Table 1. Online Social Network datasets characteristics used in the experiments
Dataset
TwitterGot
TwitterNatal
TwitterGame
TwitterTweets
Youtube
Size (posts)
2.100
8.600
20.100
4.200
3.400
Keyword
Yes
Yes
Yes
No
No
In pre-processing, all alphanumerics characters were transformed to lowercase.
By the use of regular expressions, URLs and links were removed. These kinds of data
do not represent an analysable term. Since all tweets have a maximum of 140 characters length, the usage of links’ shortcuts are popular, resulting in random links, without
textual information. The messages from “Youtube” dataset presents a small number of
words, similar to Twitter datasets. Most of the posts have colloquial language, with the
presence of slangs, no-alphanumerics characters, and unnecessary spacing. The tokenization process was performed to transform each word from each post in a term.
It is possible to check that there are presence of multiple languages, predominantly
the English language, after Spanish and Portuguese. So, we used stopwords addressing
these three languages. Articles, pronouns, and prepositions were removed because they
are considered noises for topic’s formation. In our experiments we did not treat each
language separately, indeed the topics were found mixing the idioms.
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
To calculate the difference between the real frequency and the ADVF’s frequency
for each term, it was used the Euclidean Distance (ED). Terms presenting smaller ED
are selected. The smaller is ED’s terms, the greater is the probability of this term be a
topic. The Equation 6 shows the calculus to obtain the Euclidean Distance:
ED(Vi , Vj ) =
N
X
(Vim − Vjm )1/2 ,
(6)
m=1
where Vi is the real frequency, Vj is ADVF frequency and m is the term of N .
So, the co-occurrence was applied. For the co-occurrence verification, the adjacency’ list was formed by edges with weights greater than 1, due to a large number of
edges with weight equal to 1. This cutting justifies the elimination of a dense graph,
which is hard to analyze.
For the calculation of modularity and community determination, this study used
the igraph4 library from the R platform. This library has functions able to import, view,
explore, filter, manipulate and export any network. The Louvain method was implemented
in this library. The result of this work is based on the formation of community and the
complexity of each graph.
From the communities divided by Louvain method, each community was analysed
by metrics about the graph structure:
• Degree. In a weighted graph, the node degree is the sum of the weight adjacency
edges of a node. The weight edge in two nodes means the number of tweets that
both terms appear simultaneously.
• Density. A dense graph is a graph in which the number of edges is close to the
maximal number of edges. The opposite, a graph with only a few edges, is a sparse
graph. In undirected graph, the graph density D is defined in Equation 7 as:
2|E|
|V |(|V | − 1)
where E is the number of edges and V , the number of nodes in the graph.
D=
(7)
Due to this kind of datasets were formed by texts from OSNs, the presence of
spans was expressive. The spammer can post repeatedly texts with the same content,
increasing the terms frequency. Although these terms do not represent the natural base
theme, the great presence indicates that there are possible terms.
For this work, the found topics will be classified into two classes: natural and
artificial topics. The first one represents topics that have a link with the base theme and
the other, represents topics created with spans by bots.
4. Results and Discussion
The results show important achievements, for each community created by Louvain
method, it was selected one or more terms, with significant degrees, to become topics.
This aspects are exposed in Subsection 4.1. Another important contribution is related to
malicious behaviour of bots on OSN is avaliable in Subsetion 4.2.
4
http://igraph.org/r/
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
4.1. Topic Modeling
An important tool used to observe the communities is the graphs. Graphs form the foundation representation of several types of data, ranging from Internet connectivity to social
networks. In or case, we want to understand “what the graph looks like;” we want to know
which vertex and edges are important and what are the significant features of the graph,
as well as identifying the representative nodes, communities and links. In our results,
we expose the communities with different colours in order to highlight the several communities found. About the graphs from the “TwitterNatal” (Figure 3) and “TwitterGot”
(Figure 4) datasets, it is possible to verify that the communities are quite distributed with
the presence of centroid terms.
Figure 3. TwitterNatal communities
Figure 4. TwitterGot communities
About the graphs from the “TwitterNatal” (Figure 3) and “TwitterGot” (Figure 4)
datasets, it is possible to verify that the communities are quite distributed with the presence of centroid terms. Centroids are nodes for which the sum of distances to other nodes
is minimal. In the results, the main centroid usually is the own base keyword, but there are
presence of others unknown centroids discovered. For example, in “TwitterNatal” (Figure
5), the keyword “natal”, which means “Christmas”, is the main centroid of the graph, and
it is considered as a topic. Using the proposed approach, the application discovered other
centroids like “festa” (“party”), considered as topics more specifics too, as in Figure 5 is
shown.
It is important to observe that community created by centroid “compra” (blue
graph in Figure 5) has a different graph topology, where we cannot identify a clear centroid. In this case, we can observe the terms that composed the graph had a similar
co-occurrence. Thus it is not possible to specify a centroid, but treat this community with
the presence of expressions to exposes the community sense, in other words, there are
several terms for closely related concepts.
By observing the different communities identified from TwitterNatal dataset, we
obtained 10 different senses and themes in a collection created by keyword “natal”. These
communities are described in Table 2. For example, it was possible to obtain a disambiguation of terms with the same writing (but with different sense) as “natal” as Christmas’s meaning and “natal” the Brazilian city. Another important discovery was that some
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
Figure 5. Communities of TwitterNatal dataset: centroid “natal” (yellow), centroid
“festa” (cyan), without centroid “dezembro” (cyan with three terms) and without
“compra” (blue).
topics found were hyponyms of the keyword used to build the dataset. Hyponym is a
word of more specific meaning that a term applicable to it. Example, Hyponym of Natal
city is a tourist attraction at the city of Natal.
Table 2.
dataset
List of communities and terms sense interpretation of TwitterNatal
Term (centroid)
natal
festa
compra, quero, mais and centro
estar
nao
foto, tirol, rn, publicar
amigo, secreto, oculto
ferias, mes and dezembro
futuro
melhor and seria
Sense
Christmas’ day
graduation parties
go shop / go to the mall
be in a country region
“I don’t like”
a tourist attraction at Natal city
Kriss Kringle play
vacation on December
past and future wishes
could be better
Size
84
56
20
7
7
5
5
3
3
2
Colour
Green
Gray
Blue
Red
Pink
Cyan
Light Orange
Pink
Gray
Light Blue
A similar result was obtained in TwitterGot dataset. We identified 18 different
communities, represented by centroids (or mor frequent terms): {listen,jeffery,stone},
{teamyankee3},{thought, dude}, {dragons}, {gotexhibit}, {gift, s05s07}, {kings, landing}, {happy, take}, {winterfell, theon}, {else, series}, {people}, {dorne}, {trndnl},
{totti9, chapel10, ancelotti8, galliani, grevia}, {mamahablaespanol5, madres3, kanquimania4, exatemporadabarbarella}, {santos4, junio5, llidertenecesitamosvivo2} and
{tolerancia03}. The keyword used to build this dataset is related to a TV Series. It
was observed the most of the communities are assessed to series character. Thus, it was
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
possible to identify some different stories and scenarios from the tv show.
Another important observation is the presence of malicious activities. In the TwitterGot dataset spamming activities by a bot user named as trndnl were identified. A more
precise description of this aspect is done in Sub-section 4.2.
Despite the “TwitterGame”, the graph structure of this dataset is more complex
due to the wide co-occurrence between its terms selected, as Figure 6 illustrate. The term
“game” used as a keyword can have several semantics, depending on the context. Due to
this diversity, the proposed model still can find and determine topics, but they are more
generics.
Figure 6. TwitterGame communities obtained
The communities obtained in TwitterGame were 10. Considering the wide sense
of the keyword used to compose this dataset, “game”, we distinguish important topics. The communities extracted were: {amazing}, {lose}, {trying, remember}, {white},
{flyers, thing, show}, {bring, sunday}, {bout, league}, {espn, finals}, {luck, girls, want}.
For example, the community with centroid “lose” presents terms as “cavs” (Cleveland
Cavaliers from National Basketball Association - NBA), “george” (athlete of Indiana
Pacers NBA team), “score” and “shot”, there are related to a NBA game (Indiana Pacers versus Cleveland Cavaliers) that happened in the time of dataset acquisition. Another
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
NBA game happened during the dataset acquisition. The community {trying, remeber}
contains the terms “wade” (Dwyane Wade an athlete of Miami Heat) and other terms
related to sports. Thus we could interpret as a community related to a NBA game.
In this way, the dataset TwitterGame was split into communities that represent
different sports events during the dataset acquisition. Not just NBA games, the community
{flyers, thing, show} presents the term “flyers” and “rangers”, both are hockey teams of
National Hockey League (NHL): Philadelphia Flyers and New York Rangers.
Figure 7. TwitterTweets communities
Figure 8. Youtube communities
A scenario of a term with several meanings or several terms used to build a dataset
has a more complex graph structure, e.g. Figure 6, Figure 7 and Figure 8. On the other
hand, a more specific term can be represented as a less complex graph, as Figure 3 and
Figure 4 show. This
This phenomenon is not related to the number of communities (i.e. topics). The
complex graphs as obtained by Youtube, TwitterTweet and TwitterGame were composed
by 8, 12 and 10 communities, respectively. The less complex graphs of TwitterNatal and
TwitterGot were formed by 10 and 18 communities. In this way, the graph complexity obtained by performing Louvain is related to equality of communities density. The
community density is the symmetry between the number of terms and co-occurrence that
compose it. A density equal 1.00 means the total co-occurrence between the terms. On
the other hand, low density (near to 0.00) is related to less co-occurrence between all
terms. The low density is a sign of diversity of sentences establishing a community and
the evidence of a centroid presence.
Another important metric is the average degree. We can assess this metric as the
strength of the topic found. Thus a high degree indicates a more popular topic. In the
Table 3 it is possible to observe the average degree 42.30 of “gameofthrones” community. This dataset was built with a keyword related to Game of Thrones TV series. The
other communities obtained with this dataset have lower degree in comparison to “gameofthrones”, designed through a centroid with the same name of TV series. Another example is the communities “natal” and “festa” in Figure 3 (yellow and cyan chart), they have
the highest average degree of TwitterNatal dataset. Thus, are the most strength topics.
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
Considering the presence of a centroid as the implication of a distinct topic, more
precisely, an occurrence of a single term that represents a collection of terms (community), we can affirm that a complex graph has more unique words that describe a topic,
leading to semantical difference between them. A less complex graph presents high density and the indication of repetitive sentences.
The repetitive sentences in a community is a sign of artificial sentences presence.
This artificial sentences could be the consequence of malicious action, as discussed in
Subsection 4.2.
4.2. Malicious Detection
Not only humans that submit the posts to the OSNs. In OSNs, there are problems about
spamming activities by bots. These type of activities can produce malicious contents to
deceive users or to promote something. In Twitter, if a bot produces many tweets with
the same content, for many applications on topic models, these terms can be considered
as main topics. In “TwitterGot” dataset, there are tweets made by bots. In this case, the
proposed model adopted natural or artificial topics specification. Natural are the topics
that have links with the theme base and artificial are topics created by spamming activities.
The Table 3 presents the communities of “TwitterGot”.
In the Community 1 from Table 3, the keyword “gameofthrones” is one of the
centroids discovered, as described above. This graph has a lower density which means
that the number of nodes (terms) is higher than the number of edges (co-occurrence). The
selected terms have one or few links in the graph, which means that they are more specific.
They can be subtopics from the natural topic “gameofthrones”. The Communities 3, 4 and
5 don’t have a centroid term because all nodes linked all nodes. In this case, the density
is 1.00, because the number of possible co-occurrence is maximal. The selected terms are
more generic and probably appear in same tweets. Theses topics are considered artificial.
In the Community 2, the term “trndnl” is another centroid term discovered. Since
this graph presented the same characteristics of the Community 1, it was classified as a
natural topic. The “trndnl” means Trendinalia5 and it is an account service that analyses
the classification of Trend Topics on Twitter. Trend Topics are terms delimited by “#”
which is used to highlight a topic. The Table 4 presents examples of tweets made by
Trendinalia. These tweets have a similar structure to each other. Due to a large amount of
this kind of tweets in the dataset, these bots’ terms are considered artificial topics.
5
https://twitter.com/trndnl?lang=pt
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
Table 3. The natural and artificial communities of TwitterGot dataset
Id
Topic
Class
1
gameoftthrones
Degree: 42.30
Density: 0.02
Natural
2
trndnl
Degree: 19.68
Density: 0.08
Natural
/Artificial
3
Without topic
Degree: Density: 1.00
Artificial
4
Without topic
Degree: Density: 1.00
Artificial
5
Without topic
Degree: Density: 1.00
Artificial
Graph
Table 4. Examples of noised tweets from Trendinalia.
Tweets
6. #GameOfThrones7. Leon Larregui8. Santos
y Queretaro9. Chivas10. John Nash 2015/5/25
04:14 CDT #trndnl http://t.co/IN7801UqsL
6. Leopoldo Lopez7. Ceballos8. Pastor Maldonado9.
Cersei10. Exxon 2015/5/25 04:44 VET #trndnl
http://t.co/TZZWvFfY1p
1. 1. #charliecharliechallenge2. #MeCaesMalSi3.
Therefore, the term “trndnl” can be classified in two different classes. Considering
its density, the graph behaves as a natural topic like the term “gameofthrones”, but due
to the standardized tweets and its huge amount, it is classified as an artificial topic. It is
necessary another metric for graphs to correctly classify theses cases.
To compare with traditional techniques from the other papers, we applied the
TD-IDF, method used in Huang’s paper [Huang et al. 2014b], in our datasets. Selecting the top 5 terms of TF-IFD values in “TwitterGot”, the result was: {9000x}, {astoria},
{babaye}, {bentley} and {brutaaal}. Theses terms have the same TF-IDF value. It can be
explained by algorithm idea. The TD-IDF method is based on the words frequencies in
each document and total collection, to calculate how important a word is to a document.
Considering that a document is a tweet and the tweets length are short, a word that appears
more than once in the same tweet is very unusual. So the TF-IDF in texts from OSNs is
based on only words frequencies. Comparing these words with the selected terms and
communities by our approach, it is possible to verify that the words with biggest TF-IDF
values do not have links to the theme and topics found. Thus, this type of traditional technique for Topic Modeling cannot be applied to Topic Modeling in Online Social Networks
due to the lexicon size of a tweet. In other words, only high frequency is not enough to
set a topic of OSNs datasets.
5. Conclusion
Louvain method was the base of our OSN topic extraction proposal. Different from the
traditional non-structured text (news, whitepapers, and reports), the OSNs’ content has
particular characteristics. Abbreviation, slangs and grammatical errors are very common,
besides the use of few terms (small lexicon).
Five datasets from Twitter and Youtube were used to evaluate the proposal. Using
the ADVF terms selection, terms co-occurrence, and Louvain method, it was possible to
perform the Topic Modeling in OSNs.
The datasets used in the experiments were composed of texts from different authors, and the presence of noise was inevitable. However, there were discovered several
topics. By analyzing the found communities, we can infer if a topic was artificial or
natural, due to the graph’s density.
In other words, our approach can highlight the topics avoiding noise and obtaining
an accurate Topic Modeling. When the dataset was obtained by the use of a keyword, the
topics found were in the most cases hyponyms of the keyword. In addition, our proposed
method can identify an artificial or natural topic in the dataset. Artificial topics need to be
avoided once they were created by the malicious action of a bot.
However, there are cases where the community has characteristics of both artificial
and natural topics. The use of alternative metrics to distinguish this aspect is the object of
our future work.
6. Acknowledgements
We are grateful to CNPQ that made this paper possible by sponsoring our work, process
479821/2013-5.
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
References
Abu-Nimeh, S., Chen, T. M., and Alzubi, O. (2011). Malicious and spam posts in online
social networks. Computer, 44(9):23–28.
Aggarwal, C. C. (2015). Outlier analysis. In Data Mining, pages 237–263. Springer.
Akilan, A. (2015). Text mining: Challenges and future directions. In Electronics and
Communication Systems (ICECS), 2015 2nd International Conference on, pages 1679–
1684. IEEE.
Barbon, S., Igawa, R. A., and Bogaz Zarpelão, B. (2016). Authorship verification applied
to detection of compromised accounts on online social networks. Multimedia Tools
and Applications, pages 1–21.
Bhowmick, S. and Srinivasan, S. (2013). A template for parallelizing the louvain method
for modularity maximization. In Dynamics On and Of Complex Networks, Volume 2,
pages 111–124. Springer.
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and
Experiment, 2008(10):P10008+.
Campigotto, R., Céspedes, P. C., and Guillaume, J.-L. (2014). A generalized and adaptive
method for community detection. arXiv preprint arXiv:1406.2518.
Chen, Y., Amiri, H., Li, Z., and Chua, T.-S. (2013). Emerging topic detection for organizations from microblogs. In Proceedings of the 36th International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR ’13, pages
43–52, New York, NY, USA. ACM.
Chen, Z. and Liu, B. (2014). Mining topics in documents: Standing on the shoulders
of big data. In Proceedings of the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’14, pages 1116–1125, New York, NY,
USA. ACM.
Chitra, K. and Subashini, B. (2013). Data mining techniques and its applications in banking sector. International Journal of Emerging Technology and Advanced Engineering,
3(8):219–226.
Choi, D., Ko, B., Kim, H., and Kim, P. (2014). Text analysis for detecting terrorismrelated articles on the web. Journal of Network and Computer Applications, 38:16–21.
De Meo, P., Ferrara, E., Fiumara, G., and Provetti, A. (2011). Generalized louvain method
for community detection in large networks. In Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on, pages 88–93. IEEE.
Gao, H., Hu, J., Huang, T., Wang, J., and Chen, Y. (2011). Security issues in online social
networks. IEEE Internet Computing, 15(4):56–63.
Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis.
Machine learning, 42(1-2):177–196.
Huang, S., Yang, Y., Li, H., and Sun, G. (2014a). Topic detection from microblog based on
text clustering and topic model analysis. In Services Computing Conference (APSCC),
2014 Asia-Pacific, pages 88–92. IEEE.
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
Huang, S., Yang, Y., Li, H., and Sun, G. (2014b). Topic detection from microblog based
on text clustering and topic model analysis. In Services Computing Conference (APSCC), 2014 Asia-Pacific, pages 88–92.
Igawa, R., Sakaji Kido, G., Seixas, J., and Barbon, S. (2014). Adaptive distribution
of vocabulary frequencies: A novel estimation suitable for social media corpus. In
Intelligent Systems (BRACIS), 2014 Brazilian Conference on, pages 282–287.
Igawa, R. A., Barbon Jr, S., Paulo, K. C. S., Kido, G. S., Guido, R. C., Júnior, M. L. P.,
and da Silva, I. N. (2016). Account classification in online social networks with lbca
and wavelets. Information Sciences, 332:72–83.
Igawa, R. A., de Almeida, A. M. G., Zarpelão, B. B., and Barbon Jr, S. (2015). Recognition of compromised accounts on twitter. In Proceedings of the Annual Conference
on Brazilian Symposium on Information Systems: Information Systems: A Computer
Socio-Technical Perspective, volume 1, pages 9–14.
Jin, L., Chen, Y., Wang, T., Hui, P., and Vasilakos, A. V. (2013). Understanding user
behavior in online social networks: A survey. IEEE Communications Magazine,
51(9):144–150.
Kido, G. S., Igawa, R. A., and Barbon Jr, S. (2016). Topic modeling based on louvain
method in online social networks. In XII Brazilian Symposium on Information Systems
- Information Systems in the Cloud Computing Era, pages 353–360.
Landauer, T. K., Foltz, P. W., and Laham, D. (1998). An introduction to latent semantic
analysis. Discourse processes, 25(2-3):259–284.
Li, H., Yan, J., Weihong, H., and Zhaoyun, D. (2014). Mining user interest in microblogs
with a user-topic model. Communications, China, 11(8):131–144.
Newman, M. E. and Girvan, M. (2004). Finding and evaluating community structure in
networks. Physical review E, 69(2):026113.
Papadopoulos, S., Kompatsiaris, Y., Vakali, A., and Spyridonos, P. (2012). Community
detection in social media. Data Mining and Knowledge Discovery, 24(3):515–554.
Powers, D. M. (1998). Applications and explanations of zipf’s law. In Proceedings of the
joint conferences on new methods in language processing and computational natural
language learning, pages 151–160. Association for Computational Linguistics.
Roth, B., Barth, T., Wiegand, M., and Klakow, D. (2013). A survey of noise reduction
methods for distant supervision. In Proceedings of the 2013 workshop on Automated
knowledge base construction, pages 73–78. ACM.
Tan, S., Li, Y., Sun, H., Guan, Z., Yan, X., Bu, J., Chen, C., and He, X. (2014). Interpreting
the public sentiment variations on twitter. Knowledge and Data Engineering, IEEE
Transactions on, 26(5):1158–1170.
Tang, L., Wang, X., and Liu, H. (2012). Community detection via heterogeneous interaction analysis. Data Mining and Knowledge Discovery, 25(1):1–33.
Thelwall, M., Buckley, K., and Paltoglou, G. (2012). Sentiment strength detection for the
social web. Journal of the American Society for Information Science and Technology,
63(1):163–173.
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017
Traag, V. A. (2015). Faster unfolding of communities: Speeding up the louvain algorithm.
Physical Review E, 92(3):032801.
Tsai, F. S. (2011). A tag-topic model for blog mining. Expert Systems with Applications,
38(5):5330 – 5335.
Vanetti, M., Binaghi, E., Ferrari, E., Carminati, B., and Carullo, M. (2013). A system to
filter unwanted messages from osn user walls. IEEE Transactions on Knowledge and
data Engineering, 25(2):285–297.
Verma, V., Ranjan, M., and Mishra, P. (2015). Text mining and information professionals:
Role, issues and challenges. In Emerging Trends and Technologies in Libraries and
Information Services (ETTLIS), 2015 4th International Symposium on, pages 133–137.
Zappavigna, M. (2011). Ambient affiliation: A linguistic perspective on twitter. New
media & society, 13(5):788–806.
Zeng, J., Duan, J., Cao, W., and Wu, C. (2012). Topics modeling based on selective zipf
distribution. Expert Systems with Applications, 39(7):6541 – 6546.
Zhang, C., Sun, J., Zhu, X., and Fang, Y. (2010). Privacy and security for online social
networks: challenges and opportunities. IEEE Network, 24(4):13–18.
BARBON JR, S.; KIDO, G. S.; TAVARES, G. M.
Artificial and Natural Topic Detection in Online Social Networks
iSys | Revista Brasileira de Sistemas de Informação, Rio de Janeiro, vol. 10, No. 1, pp. 80-98, 2017