Nothing Special   »   [go: up one dir, main page]

Twitternews: Real Time Event Detection From The Twitter Data Stream

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/309426331

TwitterNews: Real time event detection from the Twitter data stream

Article · July 2016


DOI: 10.7287/PEERJ.PREPRINTS.2297

CITATIONS READS
24 380

3 authors:

Mahmud Hasan Mehmet Ali Orgun


Macquarie University Macquarie University
6 PUBLICATIONS 395 CITATIONS 386 PUBLICATIONS 6,403 CITATIONS

SEE PROFILE SEE PROFILE

Rolf Schwitter
Macquarie University
103 PUBLICATIONS 2,092 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mahmud Hasan on 25 April 2017.

The user has requested enhancement of the downloaded file.


TwitterNews: Real Time Event Detection from the Twitter Data Stream

Mahmud Hasan Mehmet Orgun Rolf Schwitter


Department of Computing Department of Computing Department of Computing
Macquarie University Macquarie University Macquarie University
Sydney, Australia Sydney, Australia Sydney, Australia
Email: mahmud.hasan@students.mq.edu.au Email: mehmet.orgun@mq.edu.au Email: rolf.schwitter@mq.edu.au

Abstract—Research in event detection from the Twitter stream- the associated topic at a specific time. This occurrence is
ing data has been gaining momentum in the last couple of characterized by topic and time, and often associated with
years. Although such data is noisy and often contains mislead- entities such as people and location”. In accordance with
ing information, Twitter can be a rich source of information if this definition, the series of coordinated terrorist attacks in
harnessed properly. In this paper, we propose a scalable event Paris on 13th November 2015 is an event in the context
detection system, TwitterNews, to detect and track newsworthy of social media, that prompted a high volume or burst of
events in real time from Twitter. TwitterNews provides a novel tweets as soon as the attacks took place. The various aspects
approach, by combining random indexing based term vector of this event were tweeted by the general users and the
model with locality sensitive hashing, that aids in performing news agencies all over the world, e.g., “BREAKING.This is
incremental clustering of tweets related to various events within what we know: 35 dead, 100 hostages taken at a concert
a fixed time. TwitterNews also incorporates an effective strategy venue. Various drive by shootings. Explosions at a #Paris
to deal with the cluster fragmentation issue prevalent in in- stadium.”.
cremental clustering. The set of candidate events generated by Although the previous example is a high profile event of
TwitterNews are then filtered, to report the newsworthy events a global consequence, an event detection system should also
along with an automatically selected representative tweet from be able to detect newsworthy events at a smaller scale with a
each event cluster. Finally, we evaluate the effectiveness of lower burst of tweets at any given time. There are, however,
TwitterNews, in terms of the recall and the precision, using a
a number of challenges in real time event detection from the
Twitter data streams. The message length restriction of 140
publicly available corpus.
characters greatly reduces the amount of textual information
obtained from a tweet and the messages themselves are
1. Introduction noisy: containing typos, grammatical errors, abbreviations,
etc. In addition, most of the content found on Twitter is
Social media sites are continuously gaining popularity as not related to any event and the high volume of data on a
effective means of communicating information. Millions of diverse number of topics poses a big challenge in terms of
users share information on different aspects of everyday life scalability. Despite these challenges, it would be beneficial
through these social networking services. Information shared to develop a system that can lead to real time detection and
in these platforms range from personal status (i.e., opinions, tracking of events as Twitter does not provide a tool to see
emotions, pointless babbles) to newsworthy events, as well event summaries except for searching on trending topics or
as updates and discussions on these events. Twitter is a using hashtags.
popular social media site providing microblogging service, Different approaches have been taken by the researchers
where users can post and read short text messages, known to deal with the event detection task. The approaches based
as tweets. Due to the informal nature of the tweets, and on term interestingness [2]–[6] and topic modeling [7]–[11]
the ease with which they can be posted, Twitter users can suffer from high computational cost among other things.
be faster in covering an event than the traditional news However, incremental clustering based approaches [12]–[14]
media. As Twitter users do not have to be concerned with usually provide a low computational cost solution. Taking
how to structure a write up on news events or how news this into consideration, we propose an incremental clustering
will be perceived by the readers, they have the advantage based end-to-end solution to detect newsworthy events from
of spreading event related updates before they appear on a stream of time ordered tweets.
the traditional news. Also, local events that have low news The problem of event detection from the Twitter data
coverage are spread through Twitter. stream in an incremental clustering context can be divided
In this paper we use the definition of an event, in a social into two major stages. The first stage involves detecting
media context, provided by Dou et al. [1]: “An occurrence a burst in the number of tweets discussing a topic/event
causing change in the volume of text data that discusses and the second stage involves grouping/clustering the tweets

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2297v1 | CC BY 4.0 Open Access | rec: 18 Jul 2016, publ: 18 Jul 2016
that discuss the same event. The operation of our proposed word segments. The top-k bursty event segments within a
system, TwitterNews, is therefore divided into two major fixed time window are then calculated from the frequency
stages. After the preprocessing of a tweet, the first stage of bursty segments, in conjunction with the user frequency
is responsible for determining the fact whether the current of the bursty segments. Finally, related event segments
input tweet to the system is discussing a previously encoun- are clustered by exploiting the content of their associated
tered topic. If the input tweet is discussing a previously seen tweets and the frequency pattern of the segments within the
topic, then it means a soft tweet burst related to a topic/event specified time window. The events are then filtered based
has occurred and the output of the first stage declares the on the newsworthiness score, which is calculated using the
input tweet to be bursty (“not unique”). The operation in Wikipedia as a knowledge base.
the first stage of TwitterNews is implemented by combining TwitInfo [3] allows a user to input event related key-
Random Indexing (RI) based term vector model [15] with words to track an event. The system starts logging the
the Locality Sensitive Hashing (LSH) scheme proposed by tweets that matches the user specified keywords, and uses
Petrovic et al. [12], to determine whether a tweet is “not a streaming algorithm to detect spikes in tweet data and
unique”. To the best of our knowledge, this is a novel ap- automatically labels them with frequently occurring mean-
proach that combines RI with LSH to reduce the time needed ingful terms from the tweets. The peak generated by the
to determine the novelty of a tweet and performs a fast text high volume of Twitter posts are considered as sub-events.
similarity comparison between the current input tweet and TwitterMonitor [4] detects emergent topics by first iden-
the most recent tweets while maintaining a constant time tifying the bursty terms from the tweets within a small
and space. time window. If the system detected high frequency terms
Subsequently the second stage, implemented using a co-occur in a large number of tweets in the given time
novel approach by adapting the generic incremental clus- window, they are placed in the same group. A greedy
tering algorithm, deals with generating the candidate event strategy is used to generate groupings, in order to avoid the
clusters by incrementally clustering the tweets that were de- high computational cost to enumerate all possible groups.
termined as bursty (“not unique”) during the first stage. The Similarly, enBlogue [16] computes statistical values for tag
second stage also incorporates a defragmentation strategy to pairs within a given time window and monitors unusual
deal with the fragmented events that were generated when shifts in the tag correlations to detect emergent topics.
a particular event is detected multiple times as a new event. Topic Modeling Based Approaches. TwiCal [9] is an
To ensure scalability in a true streaming setting, each open-domain event detection system that provides a struc-
cluster generated in the second stage has a dynamic expiry tured representation of the significant events extracted from
time, dependent on the subsequent tweet arrival time in a the Twitter data stream. The proposed system extracts named
cluster. Finally a set of filters are applied after the second entities, along with associated event phrases and dates from
stage to report the newsworthy events from the candidate each of the streaming tweets. A latent variable based model
event clusters. Newsworthy events are reported along with is used to discover important event types from a large
a representative tweet for each event cluster by employing tweet corpus. The event types that are found to be coherent
a Longest Common Subsequence (LCS) based scheme that during the inspection of the discovered event types were
works on the word-level. retained, and manually annotated with informative labels.
The rest of the paper is organized as follows: Section 2 Once appropriate event types were identified, they were used
contains a brief discussion on the related work, then we in- to categorize event phrases extracted from the subsequent
troduce our proposed system, TwitterNews, in Section 3 and new data without requiring any further manual annotation.
expand on the various aspects of the system in Sections 4 Events are ranked by measuring the association strength
and 5. Finally, we discuss the results of our experiments and between an entity and a specific date, based on G2 log
evaluation of TwitterNews in Section 6. likelihood ratio statistic.
Latent Event and Category Model (LECM) [10] is a
2. Related Work latent variable model similar to TwiCal [9]. However, LECM
incorporates semantic concept to categorize events of differ-
We discuss the related work by organizing them into ent type.
approaches that share common traits (i.e., identifying inter- Incremental Clustering Based Approaches. Becker et
esting properties in tweet terms, using probabilistic topic al. [17] have used an incremental clustering algorithm to
modeling and incremental clustering). For more details we detect events from the Twitter stream. For each tweet, its
refer to the survey on event detection techniques conducted similarity is computed against each of the existing cluster.
by Atefeh and Khreich [13]. If the similarity of a tweet is not higher than a specific
Term Interestingness Based Approaches. The event threshold in any of the existing clusters, then a new cluster
detection system in Twevent [2] initially extracts continu- is created. Otherwise the tweet is assigned to a cluster with
ous and non overlapping word segments from each tweet. the highest similarity. Once the clusters are formulated, a
Statistical information obtained from the Microsoft Web N- Support Vector Machine based classifier is used to distin-
gram Service1 and the Wikipedia is used to detect nontrivial guish between the newsworthy events and the non-events.
1. http://research.microsoft.com/en-us/collaboration/focus/cs/web- McMinn et al. [14] have utilized an inverted index for
ngram.aspx each named entity with its associated near neighbors to

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2297v1 | CC BY 4.0 Open Access | rec: 18 Jul 2016, publ: 18 Jul 2016
cluster the bursty named entities for event detection and for an input tweet, and thus fast retrieval of the neighboring
tracking. The effectiveness of this approach, however, is tweets from the hash tables.
dependent on the accuracy of the underlying Named Entity If the Search Module finds a neighboring tweet of the
Recognizer [18] used by the system. input tweet with a cosine similarity higher than a specific
Petrovic et al. [12] have adapted a variant of the locality threshold value, then the input tweet is decided to be bursty
sensitive hashing (LSH) technique to determine the novelty (“not unique”), and sent to the EventCluster Module. For
of a tweet by comparing it with a fixed number of previously each tweet sent to this module, a candidate event cluster to
encountered tweets. A novel tweet represents a new story, which the tweet can be assigned is searched. If the cosine
which is assigned to a newly created cluster. On the other similarity between a tweet and the centroid of an event
hand, a tweet determined as ‘not novel’ is assigned to cluster is above a certain threshold, then the tweet is as-
an existing cluster containing the nearest neighbor. Event signed to that cluster. When no such cluster is found, a new
clusters are ranked based on a combination of the entropy event cluster is created and the tweet is assigned to the new
information in a cluster and the number of unique user posts. cluster. The EventCluster Module contains a defragmen-
All of the aforementioned general approaches to event tation sub-module that merges together fragmented event
detection from Twitter have their own set of drawbacks. clusters. The defragmentation sub-module is also helpful
The approaches based on term interestingness [4]–[6] can to merge clusters that are sub-events of an event. Finally,
often capture misleading term correlations, and measuring TwitterNews uses a novel LCS based scheme, along with
the term correlations can be computationally prohibitive in a set of different filters to filter out some of the trivial
an online setting. The topic modeling based approaches [7], events from the candidate event clusters, and identifies a
[8] incur too high of a computational cost to be effective representative tweet for each event.
in a streaming setting, and are not quite effective in han-
dling events that are reported in parallel [19]. Incremental 4. The Search Module
clustering based approaches are prone to fragmentation, and
usually are unable to distinguish between two similar events Using the Search Module we aim to detect a soft burst,
taking place around the same time [12], [14], [20]. However, that is, we intend to find at least one previous tweet that
despite the challenges, we believe incremental clustering discusses the same topic as the input tweet. To do that
approach is the way to go to avoid the high computational we need to store the previously encountered tweets and
cost associated with most of the state-of-the art approaches. access them in a fast way to perform a text similarity
comparison with the input tweet. TwitterNews maintains
3. TwitterNews Architecture a fixed number of most recent tweets that are stored and
continuously updated in a set of hash tables. Each hash
From the continuation of our discussion in Section 1 table can be thought of as a collection of buckets, where a
regarding the two major stages of operation in TwitterNews, hash key maps to a bucket in a hash table and each bucket
here we will discuss the two major components in our contains a fixed number of similar tweets. The contents of
system, each of which deals with the operation of a specific a bucket are updated regularly by replacing the new tweet
stage (Figure. 1). The decision making process on the nov- with the oldest tweet in the bucket. A fixed number of hash
elty of an input tweet during the first stage is handled by the tables are maintained to increase the chance of finding the
Search Module and generating the candidate event clusters nearest neighbor of the input tweet. The number of hash
during the second stage is handled by the EventCluster keys needed to be calculated for an input tweet is equivalent
Module. to the number of hash tables maintained. In our approach,
To facilitate the decision on novelty, the Search Module the calculation of the hash keys for the input tweet involves
allows fast retrieval of the neighboring tweets of the input combining Random Indexing (RI) based term vector model
tweet for text similarity comparison. This is achieved by and the Locality Sensitive Hashing based (LSH) scheme of
using the adapted variant of the Locality Sensitive Hashing Petrovic et al. [12] with the parameter settings used by the
(LSH) approach employed by Petrovic et al. [12], where a authors.
set of most recent tweets are stored in a fixed number of hash Before discussing how and why RI is combined with
tables. However, Petrovic et al. [12] have used a variable LSH, we will briefly go through each of these concepts in
length tf − idf based term vector model to calculate the subsections 4.1 and 4.2. Finally, in subsection 4.3 we discuss
hash keys for each input tweet, which is computationally combining RI with LSH to find previously encountered
expensive. The hash keys are used to retrieve the neighbor- tweets similar to the input tweet.
ing tweets of the input tweet from the hash tables.
To reduce the computational cost of calculating the hash 4.1. Random Indexing (RI)
keys, TwitterNews uses a random indexing based term vector
model. The advantage of using random indexing is the Random indexing [15] is a term vector based model for
capability to have a fixed length vector generated for each semantic similarity. It deals with the high dimensionality
input tweet to the system regardless of the number of tweets issue faced by other word co-occurrence based models, by
encountered. This allows fast calculation of the hash keys performing random projection for dimensionality reduction.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2297v1 | CC BY 4.0 Open Access | rec: 18 Jul 2016, publ: 18 Jul 2016
Figure 1. TwitterNews Architecture

Thus, making it a scalable approach. RI works by associat- hyperplanes. The hash function based on which the tweets
ing each term with two vectors: an index vector and a con- are placed in a bucket can be defined as:
text vector. The index vector is a sparse, high dimensional,
and unique random vector. In addition, the index vectors for hu (q) = sgn(u.q)
each term have the same dimension. The context vector is where u is a random vector drawn from a Gaussian distri-
initialized with zeroes and its size is equivalent to that of bution N (0, 1) and q is a query point, i.e., tweet vector. The
the index vector. The context vector of a term is updated by output of the previously defined hash function is,
adding the index vector of another term when it co-occurs (
with the other term in a sliding context window. For each 1 if u.q ≥ 0
tweet in the system, a RI based incremental term vector hu (q) = (1)
0 if u.q < 0
model can be created, by processing each term of the tweet.
Finally, a RI based vector representation for a tweet can Each hyperplane m ∈ [1...k] represents a single bit in m-th
be produced from an average of the vectors of each term. position of the hash key for a bucket and each bit position of
Having a RI based term vector model ensures that, every the hash key is calculated by equation (1). So, k bits of the
tweet vector will have the same dimension, regardless of hash key for a bucket are formed by gluing all the values of
the amount of new terms encountered in any subsequent the corresponding bit position together. Although increasing
tweets. the number of hyperplanes k , decreases the probability
of collision with non-similar points, it also decreases the
4.2. Locality Sensitive Hashing (LSH) probability of collision with the nearest neighbors. Thus, L
number of hash tables are created, each having k indepen-
The basic idea behind the LSH approach used by [12] is dently chosen random hyperplanes, in order to obtain a high
that if two high dimensional points are close in some metric probability of collision with the nearest neighbors.
space X , a random projection operation on those two points
on a randomly drawn hyperplane will allow them to remain 4.3. Combining RI with LSH
close on a low dimensional subspace. The probability of
two points i and j colliding in a single hyperplane in this The calculation of L number of hash keys are compu-
hashing scheme with random projection is: tation intensive even with parallel processing. Generating a
single hash key requires k number of independent random
θ(i, j)
P r[h(i) = h(j)] = 1 − vectors whose dimensions are needed to be updated as
π more unique terms are encountered. This is done to match
where θ(i, j) is the angle between i and j . To simplify, the dimensions of the input tweet vector which is based
the probability of two points colliding is proportional to the on an incremental tf − idf weighting. To alleviate this
cosine of the angle between them. Increasing the number problem we use the term vector model created with RI. This
of hyperplanes k decreases the probability of collision with allows us to have a fixed length for every tweet vector and
points that are not close together. Using this concept, similar also the same length for the random vectors. Hence, the
tweets are hashed into the same bucket of a hash table. random vectors are created once at the initialization phase
Each bucket in a hash table represents the subspace formed of the system, without requiring any further updates, as
by intersecting X with k independently drawn random the dimensions for term vectors are unchanged regardless
of the number of tweets encountered. Algorithm 1 shows
the pseudocode for the Search Module using LSH and RI.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2297v1 | CC BY 4.0 Open Access | rec: 18 Jul 2016, publ: 18 Jul 2016
Algorithm 1 . TwitterNews Search Module
Require: threshold value
1: for each tweet d in twitter-stream D do
2: preprocess(d)
3: generate vector for d with RI
4: generate vector for d with tf − idf
5: S ←set of documents that collide with d in L hash tables . hash calculated using RI based tweet vector
6: simmax ← 0 0
7: for each tweet d in S do 0
. parallel processing
8: tempSim = CosineSimilarity(d, d ) . calculated using tf − idf based tweet vector
9: if tempSim > simmax then
10: simmax = tempSim
11: end if
12: end for
13: if simmax > threshold then
14: d is “not unique”
15: else
16: d is “unique”
17: end if
18: add d in each colliding buckets of L hash tables
19: end for

We have used the S-Space Package [21] to create RI based where N is the number of tweets processed so far and |{d ∈
vector representation for the tweets, and a threshold value in D : t ∈ D}| is the total number of tweets in D where the
the range of [0.5−0.6] for cosine similarity is empirically set term t appears.
for the Search Module to determine the novelty of the input
tweet. If the cosine similarity of the approximate nearest 5. The EventCluster Module
neighbor of the input tweet is below the threshold value,
the input tweet is considered as “unique”, otherwise “not The EventCluster Module incrementally clusters the
unique”. tweets discussing the same topic and produces a set of
Text similarity between two tweet vectors d1 and d2 candidate events. Algorithm 2 shows the pseudocode for
in Algorithm 1 is computed using the cosine similarity the EventCluster Module, where the event threshold (tev )
measure: value for a tweet to be assigned to a cluster is empirically
X n set to 0.6 and the defragmentation granularity (gev ) value to
d1i × d2i merge fragmented events is empirically set in the range of
i=1 [0.05 − 0.07].
cos(θ) = v v
The reason for sending only those tweets that are decided
n
u n
uX uX
t (d1i )2 × t (d2i )2
u as “not unique” as input to the EventCluster Module, is
i=1 i=1
to reduce noise. If a tweet is decided as “unique” and
sent to the EventCluster Module, there is a chance that
In our experiments, we have found that the cosine similarity it might not have any more similar tweets in the future.
calculation is better suited with the tf − idf based term Therefore, unnecessarily creating a new event cluster where
vector model than the RI based vectors. Therefore we use the no new members might be assigned and increasing the
tf − idf based term vectors for cosine similarity calculation overall number of active clusters to be searched.
and the RI based vectors are only used for the hash key Although this strategy reduces the total number of clus-
calculation. To generate the tf − idf based term vector for ters created, there are still a lot of clusters to search,
each tweet in Algorithm 1, the following formula is used: to decide which cluster a tweet belongs. To mitigate this
problem, each cluster has an expiry time associated with it.
tf − idf (t, d, D) = tf (t, d) × idf (t, D) When a cluster is created an initial expiry time tsi for the
where t is a term in the input tweet d, D is the corpus cluster is set. Each time a new tweet is added to the cluster
representing the tweets processed so far, tf (t, d) is simply c, the expiry time is updated based on the average time
the number of times t is found in d, and stamp difference between the arrival of successive tweets in
c. Once an event cluster is expired, it is marked as inactive
N
idf (t, D) = log to avoid further similarity comparison with any tweet that
|{d ∈ D : t ∈ D}| arrives in the EventCluster Module.
The time complexity of Algorithm 2 is O(m), where
m is the number of clusters. However, from the point in

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2297v1 | CC BY 4.0 Open Access | rec: 18 Jul 2016, publ: 18 Jul 2016
Algorithm 2 . TwitterNews EventCluster Module
Require: tweet d decided as “not unique” by Algorithm 1, event threshold value tev , and defragmentaion granularity value
gev
1: simmax ← 0
2: for each active event cluster c in C do . parallel processing
3: tempSim = findClosestCentroid(c, dtermV ector ) . measures cosine similarity between the centroid of the cluster
and the tweet term vector
4: if tempSim > simmax then
5: simmax = tempSim
6: end if
7: if tempSim ≥ (tev + gev ) then
8: Sc ← assign c to the set of fragmented clusters to be merged later
9: end if
10: end for
11: if simmax > tev then
12: assign d to the closest matching cluster csimM ax
13: updateCentroid(csimM ax , dtermV ector ) . cluster centroid updated by averaging with the tweet vector
14: merge the clusters from the set Sc with csimM ax
15: update the centroid of csimM ax . cluster centroid updated by averaging with the event centroids in Sc
16: update csimM ax expiry time
17: else
18: create a new cluster cnew and assign d to it
19: assign dtermV ector as the centroid of cnew
20: assign a initial expiry time to cnew
21: end if

time of system execution where the clusters start getting extract the events found from the Wikipedia current event
inactive, the total number of active clusters remain fairly portal2 , which are within the dates covered by the corpus.
constant. As we search only the active clusters, the cost of Afterwards for each event found on the Wikipedia, relevant
the EventCluster Module effectively becomes O(1). Based tweets of that event are retrieved from a Lucene3 indexed
on our experiments, we found that setting tsi value in version of the corpus. The candidate events produced by
the range of [8 − 15] minutes provides good results while each of these three methods are combined using clustering
reasonably limiting the total number of active clusters. and then the authors [22] have used crowdsourcing to gen-
Any incremental algorithm such as ours for the Event- erate the final set of 506 ground truth events for the time
Cluster Module, suffers from fragmentation. We have em- duration covered by the corpus.
ployed a defragmentation strategy to avoid cluster fragmen- Preprocessing. We have performed preprocessing on
tation as much as possible. The defragmentation strategy is each tweet before it is sent to the Search Module. Each
also helpful to merge clusters that are sub-events of an event. tweet is tokenized using Twokenize [24]. Tokens containing
While searching for a cluster that is closest in similarity to Username/mentions, stop words, and URLs are removed in
the input tweet, we also keep track of the clusters in a set the preprocessing phase as they do not contribute in cluster-
Sc whose cosine similarity with input tweet is > tev + gev , ing the event related tweets. Tokens containing hashtags are
as shown in Algorithm 2. After we assign the tweet to the retained, as hashtags often contain important information.
closest matching cluster (given that, similarity is > tev ), all Reporting newsworthy events. We have conducted ex-
the clusters in Sc are merged to achieve defragmentation. periment on the first 3 days (9th to 11th of October, 2012)
of approximately 17 million tweets from the Events2012
6. Experiment Results and Evaluation corpus. The candidate events generated by TwitterNews are
then filtered by applying a combination of different filters
Corpus. In our experiments, we have used the to retain only the newsworthy events that will be reported
Events2012 corpus provided by McMinn et al. [22]. The by our system. The first level of filters retain the candidate
corpus contains 120 million tweets from the 9th of October events with an entropy > 2.5, and a user diversity > 0.0.
to the 7th of November, 2012. Along with this corpus 506 The entropy threshold ensures that a minimum amount of
events, with relevant tweets for these events, are provided information is contained in an event cluster, and a positive
as a ground truth. Initially, the authors [22] have generated user diversity value ensures that the candidate event contains
a set of candidate events using three different methods. tweets from more than one user. Entropy [12] and User
The first two methods are the LSH approach of Petrovic
et al. [12], and the Cluster Summarization approach of 2. http://en.wikipedia.org/wiki/Portal:Current events
3. https://lucene.apache.org/
Aggarwal and Subbian [23]. The third approach is to simply

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2297v1 | CC BY 4.0 Open Access | rec: 18 Jul 2016, publ: 18 Jul 2016
Diversity [20] value of an event cluster c, are computed by our system. To remedy this problem we have decided
as: X ni ni to manually reconfirm the ground truth events provided by
Entropy = − log the authors [22]. However, there are a total of 506 ground
i
N N truth events spanning from the 9th of October to the 7th of
where, ni is the number of times word i appears in a cluster November, 2012. As this can take a substantially long time,
and N is the total number of words in that cluster. we have only reconfirmed the first three days (9th to 11th
X ui ui of October, 2012) of the ground truth events and manually
U serDiversity = − log selected a total of 41 events that belong to our selected time
i
T T window. Further inspection of these 41 events were required
where ui is the number of tweets published by user i in a to identify and remove the events which contained a large
cluster, and T is the total number of tweets in that cluster. number of unavailable tweets. Doing so led us to a final set
For the second level of filter, we employ a word-level of 31 events to be used as the ground truth.
Longest Common Subsequence (LCS) based filtering. The As we have the reconfirmed ground truth events only for
idea here is based on the empirical evidence found from a specific time period, we ran our experiments on the tweets
inspecting the candidate event set. We have noticed that the spanning the same time period from the corpus. We have
news propagated by the general users or the news agencies used the LSH scheme proposed by Petrovic et al. [12] as
usually follow a similar sentence structure. If we apply the our baseline, which achieved a recall of 0.52 by identifying
traditional word-level LCS algorithm on the relevant tweets 16 events out of the 31 ground truth events. McMinn et al.,
of an event cluster c, the length of the LCS can be used in their later work [14], have used the same baseline over
as a determining factor to decide whether c is about a the Events2012 corpus and reported the baseline system to
newsworthy event. If an event cluster c has an LCS whose achieve a very low precision. As calculating the precision is
length is below a certain threshold, meaning there are no a time consuming task we have skipped this for our baseline
tweets in c with an appropriate level of similarity in their which has been reported to have a low precision on the same
sentence structure, then c is not likely to be a newsworthy corpus.
event. TwitterNews achieved a recall of 0.87 by identifying
The LCS based scheme also selects a representative 27 events out of the 31 ground truth events. A total of
tweet from the event cluster, by emitting the tweet having the 1619 events were reported by TwitterNews within the time
maximum LCS in an event. Before applying the LCS based window of three days. This high number of events obviously
scheme on the set of candidate events, all the tweets of each amount to a very low precision with respect to the ground
event cluster are discarded that do not contain at least one truth. However, McMinn et al. [22] have noted in their later
proper noun or possessive noun. Doing so reduces the total work [14] that a lot of additional events can be detected
number of tweets in a cluster by discarding the tweets that from the Events2012 corpus that were not originally on the
do not contain any useful information. Table 1 shows the ground truth data set they have provided along with the cor-
representative tweets of the newsworthy events reported by pus. Hence, instead of calculating the precision with respect
TwitterNews. Note that, only one event representative tweet to the ground truth, we have used two human annotators to
from each day is shown to keep Table 1 concise. determine the precision of 100 randomly chosen events out
of the reported 1619 events. The precision is calculated as a
TABLE 1. R EPRESENTATIVE TWEETS OF THE SELECTIVE fraction of the 100 randomly chosen events that are related
NEWSWORTHY EVENTS REPORTED BY T WITTER N EWS to realistic events. The agreement between the two anno-
tators, measured using Cohen’s kappa coefficient, was 0.84
Date Event Representative Tweet and the precision of TwitterNews reported by the annotators
09 Oct 2012 Retweet If You’re Watching The “BET Hip Hop was 0.72 (72 out of 100 events were agreed as newsworthy
Awards 2012”
10 Oct 2012 BAE-EADS merger plans are ‘off’: Aerospace and
events by both annotators). The results of our evaluation is
defence firms BAE and EADS have cancelled their shown in Table 2.
planned merger, t... http://bbc.in/Tvu31e
11 Oct 2012 Nobel Prize for literature awarded - Mo Yan of China TABLE 2. S UMMARY OF THE EVALUATION RESULTS
won the prize for his novel “Frog”, which explores
the traditio... http://ow.ly/2sCWey Methods Recall Precision
Baseline [12] 0.52 -
Evaluation. Due to the restriction imposed by Twitter, TwitterNews 0.87 0.72
the Events2012 corpus only contains unique tweet IDs
using which the tweets belonging to the corpus need to
be downloaded. After downloading the tweets, we have 7. Conclusion
inspected the corpus and discovered that a large number
of tweets (around 30%) belonging to the corpus were not TwitterNews incorporates random indexing based term
downloaded as they are not available any more. The effect of vectors with locality sensitive hashing to make a fast de-
a partially incomplete corpus, due to the unavailability of the cision on the novelty of an input tweet. The incremental
tweets, is going to negatively impact the results produced clustering based approach adopted in our system, along with

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2297v1 | CC BY 4.0 Open Access | rec: 18 Jul 2016, publ: 18 Jul 2016
the defragmentation sub-module, provides an efficient way [11] R. Xie, F. Zhu, H. Ma, W. Xie, and C. Lin, “CLEar: A real-time online
to cluster event related tweets. Both of the main components observatory for bursty and viral events,” Proceedings of the VLDB
Endowment, vol. 7, no. 13, pp. 1637–1640, 2014, VLDB Endowment.
of our system maintain a constant space and processing time,
which is an important requirement in a true streaming set- [12] S. Petrović, M. Osborne, and V. Lavrenko, “Streaming first story
detection with application to Twitter,” in Proceedings of the Annual
ting. We have shown that, the result achieved by TwitterNews Conference of the North American Chapter of the Association for
outperforms the state-of-the-art baseline by a big margin. Computational Linguistics: Human Language Technologies, ser. HLT
Note that, our exploration in the Events2012 corpus gave us ’10. Stroudsburg, PA, USA: ACL, 2010, pp. 181–189.
a clear idea on the reason our system has missed some of the [13] F. Atefeh and W. Khreich, “A survey of techniques for event detection
events from the ground truth event set. TwitterNews missed in Twitter,” Computational Intelligence, vol. 31, no. 1, pp. 132–164,
only those events, which are not bursty in nature, meaning, 2015, Wiley Online Library.
the event related tweets were spread out in the corpus with [14] A. J. McMinn and J. M. Jose, “Real-time entity-based event detection
big time gaps between those tweets. In our future work, for Twitter,” in Proceedings of the Experimental IR Meets Multilin-
we will focus on detecting events that are less bursty in guality, Multimodality, and Interaction: 6th International Conference
of the CLEF Association, ser. CLEF ’15. Springer, 2015, pp. 65–77.
nature (e.g., events with usually non-measurable burst of
few tweets, events having small amount of tweets with big [15] M. Sahlgren, “An introduction to random indexing,” in Proceedings
of the Methods and Applications of Semantic Indexing Workshop
gaps in successive tweet arrival). In addition, we intend to at the 7th International Conference on Terminology and Knowledge
devise an approach that will allow us to discard spam and Engineering, TKE, vol. 5, 2005.
neutral event clusters in order to achieve a higher precision. [16] F. Alvanaki, M. Sebastian, K. Ramamritham, and G. Weikum, “En-
Finally, we plan to perform a detailed sensitivity analysis blogue: Emergent topic detection in web 2.0 streams,” in Proceedings
on the different parameter settings used in our system. of the ACM SIGMOD International Conference on Management of
Data, ser. SIGMOD ’11. New York, NY, USA: ACM, 2011, pp.
1271–1274.
References
[17] H. Becker, M. Naaman, and L. Gravano, “Beyond trending topics:
Real-world event identification on Twitter.” in Proceedings of the
[1] W. Dou, K. Wang, W. Ribarsky, and M. Zhou, “Event detection in Fifth International AAAI Conference on Weblogs and Social Media,
social media data,” in Proceedings of the IEEE VisWeek Workshop ICWSM, 2011.
on Interactive Visual Text Analytics - Task Driven Analytics of Social
Media Content, 2012, pp. 971–980. [18] L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva, “Twitter part-
of-speech tagging for all: Overcoming sparse and noisy data.” in
[2] C. Li, A. Sun, and A. Datta, “Twevent: Segment-based event detection
Proceedings of the Recent Advances in Natural Language Processing,
from tweets,” in Proceedings of the ACM International Conference
RANLP, 2013, pp. 198–206.
on Information and Knowledge Management, ser. CIKM ’12. ACM,
2012, pp. 155–164. [19] L. M. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos,
[3] A. Marcus, M. S. Bernstein, O. Badar, D. R. Karger, S. Madden, R. Skraba, A. Goker, I. Kompatsiaris, and A. Jaimes, “Sensing
and R. C. Miller, “TwitInfo: Aggregating and visualizing microblogs trending topics in Twitter,” IEEE Transactions on Multimedia, vol. 15,
for event exploration,” in Proceedings of the SIGCHI Conference on no. 6, pp. 1268–1282, 2013, IEEE.
Human Factors in Computing Systems, ser. CHI ’11. New York, [20] S. Kumar, H. Liu, S. Mehta, and L. V. Subramaniam, “From tweets
NY, USA: ACM, 2011, pp. 227–236. to events: Exploring a scalable solution for Twitter streams,” arXiv
[4] M. Mathioudakis and N. Koudas, “TwitterMonitor: Trend detection preprint arXiv:1405.1392, 2014.
over the Twitter stream,” in Proceedings of the ACM SIGMOD [21] D. Jurgens and K. Stevens, “The S-Space package: an open source
International Conference on Management of Data, ser. SIGMOD ’10. package for word space models,” in Proceedings of the ACL System
New York, NY, USA: ACM, 2010, pp. 1155–1158. Demonstrations. ACL, 2010, pp. 30–35.
[5] G. Stilo and P. Velardi, “Efficient temporal mining of micro-blog texts [22] A. J. McMinn, Y. Moshfeghi, and J. M. Jose, “Building a large-scale
and its application to event discovery,” Data Mining and Knowledge corpus for evaluating event detection on Twitter,” in Proceedings of
Discovery, pp. 1–31, 2015, Springer. the 22nd ACM International Conference on Information and Knowl-
[6] R. Parikh and K. Karlapalem, “ET: Events from tweets,” in Proceed- edge Management, ser. CIKM ’13. New York, NY, USA: ACM,
ings of the 22nd International Conference on World Wide Web, ser. 2013, pp. 409–418.
WWW ’13 Companion. New York, NY, USA: ACM, 2013, pp.
[23] C. C. Aggarwal and K. Subbian, “Event detection in social streams.”
613–620.
in Proceedings of the SIAM International Conference on Data Mining,
[7] J. H. Lau, N. Collier, and T. Baldwin, “On-line trend analysis SDM, vol. 12. SIAM, 2012, pp. 624–635.
with topic models:\# Twitter trends detection topic model online.”
in Proceedings of the International Conference on Computational [24] O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and
Linguistics, COLING, 2012, pp. 1519–1534. N. A. Smith, “Improved part-of-speech tagging for online conver-
sational text with word clusters,” in Proceedings of the Annual
[8] Y. Hu, A. John, D. D. Seligmann, and F. Wang, “What were the Conference of the North American Chapter of the Association for
tweets about? topical associations between public events and Twitter Computational Linguistics: Human Language Technologies, ser. HLT
feeds.” in Proceedings of the International AAAI Conference On Web ’13. ACL, 2013, pp. 380–391.
and Social Media, ICWSM, 2012, pp. 154–161.
[9] A. Ritter, Mausam, O. Etzioni, and S. Clark, “Open domain event
extraction from Twitter,” in Proceedings of the 18th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
ser. KDD ’12. New York, NY, USA: ACM, 2012, pp. 1104–1112.
[10] D. Zhou, L. Chen, and Y. He, “An unsupervised framework of
exploring events on Twitter: Filtering, extraction and categorization,”
in Proceedings of the AAAI Conference on Artificial Intelligence,
2015, pp. 2468–2475.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2297v1 | CC BY 4.0 Open Access | rec: 18 Jul 2016, publ: 18 Jul 2016
View publication stats

You might also like