Nothing Special   »   [go: up one dir, main page]

Project Report Sentiment Analysis On Twitter Using Apache Spark

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/320625064

PROJECT REPORT SENTIMENT ANALYSIS ON TWITTER USING APACHE SPARK

Technical Report · October 2017


DOI: 10.13140/RG.2.2.10737.79200

CITATIONS READS
0 29,831

4 authors, including:

Deepesh Khaneja Khushboo Vyas


Carleton University University of Ottawa
8 PUBLICATIONS   1 CITATION    2 PUBLICATIONS   0 CITATIONS   

SEE PROFILE SEE PROFILE

Ranjit Singh Saini


Carleton University
2 PUBLICATIONS   0 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Survey Of Cross Layer Design in WLANs View project

Object segmentation in an image using Convolutional Neural Networks View project

All content following this page was uploaded by Deepesh Khaneja on 26 October 2017.

The user has requested enhancement of the downloaded file.


SYSC 5807 - ADVANCED TOPICS IN COMPUTER
SYSTEMS

PROJECT REPORT

SENTIMENT ANALYSIS ON TWITTER


USING APACHE SPARK
Amandeep Kaur, Deepesh Khaneja, Khushboo Vyas, Ranjit Singh Saini

Submitted to Dr. Imran Ahmad


8th April, 2016
Carleton University
Abstract—Social media websites have emerged as one news and current events. These sites (Twitter, Facebook,
of the platforms to raise users’ opinions and influence Instagram, google+) offer a platform to people to voice
the way any business is commercialized. Opinion of their opinions. For example, people quickly post their
people matters a lot to analyze how the propagation of reviews online as soon as they watch a movie and then start
information impacts the lives in a large-scale network a series of comments to discuss about the acting skills
like Twitter. Sentiment analysis of the tweets determine depicted in the movie. This kind of information forms a
the polarity and inclination of vast population towards basis for people to evaluate, rate about the performance of
specific topic, item or entity. These days, the not only any movie but about other products and to know
applications of such analysis can be easily observed about whether it will be a success or not. This type of vast
during public elections, movie promotions, brand information on these sites can used for marketing and
endorsements and many other fields. In this project, we social studies [1]. Therefore, sentiment analysis has wide
exploited the fast and in memory computation applications and include emotion mining, polarity,
framework 'Apache Spark' to extract live tweets and classification and influence analysis.
perform sentiment analysis. The primary aim is to Twitter is an online networking site driven by tweets
provide a method for analyzing sentiment score in noisy which are 140 character limited messages. Thus, the
twitter streams. This paper reports on the design of a character limit enforces the use of hashtags for text
sentiment analysis, extracting vast number of tweets. classification. Currently around 6500 tweets are published
Results classify user's perception via tweets into per second, which results in approximately 561.6 million
positive and negative. Secondly, we discuss various tweets per day [1]. These streams of tweets are generally
techniques to carryout sentiment analysis on twitter noisy reflecting multi topic, changing attitudes information
data in detail. in unfiltered and unstructured format. Twitter sentiment
analysis involves the use of natural language processing to
Index Terms— Apache Spark, Sentiment Analysis, extract, identify to characterize the sentiment content.
Twitter, Opinion mining Sentiment Analysis is often carried out at two levels 1)
coarse level and 2) fine level. In coarse level, the analysis
I. INTRODUCTION of entire documents is done while in fine level, the analysis
As internet is growing bigger, its horizons are becoming of attributes is done [3]. The sentiments present in the text
wider. Social Media and Micro blogging platforms like are of two types: Direct and Comparative. In comparative
Facebook, Twitter, Tumblr dominate in spreading sentiments, the comparison of objects in the same sentence
encapsulated news and trending topics across the globe at is involved while in direct sentiments, objects are
a rapid pace. A topic becomes trending if more and more independent of one another in the same sentence.
users are contributing their opinion and judgements, However, doing the analysis of tweets expressed in not
thereby making it a valuable source of online perception an easy job. A lot of challenges are involved in terms of
[3]. These topics generally intended to spread awareness or tonality, polarity, lexicon and grammar of the tweets. They
to promote public figures, political campaigns during tend to be highly unstructured and non-grammatical. It gets
elections, product endorsements and entertainment like difficult to interpret their meaning. Moreover, extensive
movies, award shows. Large organizations and firms take usage of slang words, acronyms and out of vocabulary
advantage of people's feedback to improve their products words are quite common while tweeting online. The
and services which further help in enhancing marketing categorization of such words per polarity gets tough for
strategies. One such example can be leaking the pictures of natural processors involved. This project uses Apache
upcoming iPhone to create a hype to extract people's Spark's fast processing capabilities to analyze sentiment
emotions and market the product before its release. Thus, from such high velocity real-time tweets.
there is a huge potential of discovering and analysing The rest of this project report is structured as follows. In
interesting patterns from the infinite social media data for Section II, we detailed some related work of our project by
business-driven applications. highlighting important features. Next, Section III gives
Sentiment analysis is the prediction of emotions in a brief details about the technologies used. Section IV cover
word, sentence or corpus of documents. It is intended to details of methodology & implementation of the project.
serve as an application to understand the attitudes, opinions The problems we came across and the challenges we
and emotions expressed within an online mention. The resolved during implementation are specified in section V.
intention is to gain an overview of the wider public opinion Further, in Section VI, future work is discussed. Finally,
behind certain topics. Precisely, it is a paradigm of Section VII concludes the report.
categorizing conversations into positive, negative or
neutral labels. Many people use social media sites for
networking with other people and to stay up-to-date with
as explained in [9] by Wu et al. The trending topic or
II. LITERATURE REVIEW hashtag is fed and tweets relevant to it are filtered to form
This section summarizes some of the scholarly and TDM and computing the weights of TF-IDF to find most
research works in the field of Machine Learning and data important words is the key idea of this sentiment analysis.
mining to analyze sentiments on the Twitter and preparing Parallel computation of TDM, TF-IDF score and
prediction model for various applications. As the available determining top 5 keywords generated from TDM in each
social platforms are shooting up, the information is minute as the sliding window moves are one of the
becoming vast and can be extracted to turn into business highlighting features of this research work. Thus, it
objectives, social campaigns, marketing and other leverages the fast computation power of Apache Spark.
promotional strategies as explained in [4]. The benefit of In another work [5] of Sentiment Analysis and Influence
social media to know public opinions and extract their Tracking on Twitter, authors also predicted the polarity –
emotions are considered by authors in [2] and explained positive, negative or neutral of tweets by creating a
how twitter gives advantage politically during elections. classifier. In addition, they used multiple algorithms and
Further, the concept of the hashtag is used for text methods to determine the influence of active entity on the
classification as it conveys emotion in few words. They tweet patterns of users exhibiting certain emotions. They
suggested how previous research work suffered from lack mined tweets only at the entity level i.e. brand, product,
of training set and misses some features of target data. celebrity elements present in tweets rather than the whole
They opted two stage approach for their framework- first sentence in the tweets posted by users. The approach they
preparing training data from twitter using mining followed using algorithms to extract features and track the
conveying relevant features and then propounding the impact and influence made their work different from rest
Supervised Learning Model to predict the results of of the literature. The feature extraction process after
Elections held in USA in 2016. After collecting and preprocessing included constructing n grams along with
preprocessing the tweets, training data set was created first POS taggers taking care of negation part and improving
by manual labelling of hashtags and forming clusters, next accuracy of classification. For further analysis and
by using online Sentimental Analyzer VADER which measuring influence, they opted two algorithms – People
outputs the polarity in percentage. This approach reduced Rank Algorithm inspired by Page Rank Algorithm [6] used
the number of tweets or training set and further they by Google. The main idea behind this algorithm is more the
applied Support Vector Machine and Naive Bayes value of People Rank, the more central is the node in the
classification algorithm to determine the polarity of tweets. graph means its importance on twitter in terms of
Multistage Classification approach was used where an followers, retweets and mentions. The other algorithm is
entity classifier receives general class of tweets and Twitter Rank algorithm, an extension to page Rank to
categorise them with respect to individual candidates for determine the influence of users by considering the
comparison. The metric they used to determine the winner similarity between users and the structure of nodes i.e.
was the “PvT ratio” which is Positive number of tweets to other users they are linked to. They addressed the
total count of tweets for respective candidate. shortcomings of Page Rank and developed this approach.
Sentiment Analysis by researchers Imran et al. [1] The influence measure is considered by following the idea
exploited the technology 'Apache Spark' for fast streaming that popular/influential people follow you and they act as
of tweets and presented the approach StreamSensing to medium to broadcast specific topic. Following some
handle real time data in unstructured and noisy form. They mathematical computation of ratio of followers/following,
conducted the approach on twitter data to find some useful retweets, mentions like parameters, they determine the
and interesting trends which further can be generalized to weights and finally derived a mathematical formula to
any real-time text stream. Unsupervised learning approach track influence of specific entity. The approach they
is used to locate interesting patterns and trends from tweets proposed have potential to determine influence
processed on Apache Spark. Inspired by the approach personalities/entities on twitter and can be used for
described by Zhu et al. [7] and Li et al. [8] for mining data promotional and branding purposes.
by selecting time window, authors [1] opted for sliding
window method for capturing the live streams of tweets.
The common approach found in almost all relevant III. TWITTER SENTIMENT ANALYSIS
research works constitutes data collection using Twitter
API, preprocessing of data, filtering of data then A) Introduction to Problem
approaches in feature extraction, classification and pattern Every day massive amount of data is generated by social
media users which can be used to analyze their opinion
analysis makes the distinction. Authors used sliding
about any event, movie, product or politics. Conventional
window of 5 minutes during data collection and further
tools like Apache Storm analyze stream in micro-batch
created Term Document Matrix(TDM) for feature
whereas novel tools like Apache Spark process data in real
extraction. The pattern analysis was carried out by using
the score of TF-IDF for finding most important keywords
time making analyzing and processing of real time data products: in case they bought a new phone or unsatisfied
possible. with customer service behaviour as opposed to other
B) Platform and Technologies social media where users post most status and pictures of
There are different technologies and tools implemented themselves. These factors make twitter a logical choice for
in the project. These are introduced below. real time data analysis.
Apache Spark: It is an open source lightning fast cluster IntelliJ Idea: It is an Integrated Development Environment
computing platform to retrieve streaming data and to build, run and test code. It is closed source but
forwarding to storage system like HDFS, Database Server. community edition of the software is provided free of cost.
It is built on top of Map Reduce and can integrate well It offers support for SBT plugin which is used to import
with other Apache software. Apache spark is an in- Apache Spark dependencies and build the project. Intellij
memory fast processing system used for large scale data Idea professional edition is used along with SBT plugin
processing. It has come up as an advanced version of which is a build tool, an alternative for maven build tool.
Hadoop. Though it implements the MapReduce SBT makes it easy to define dependencies and import
technology but it processes data even 100 times faster by libraries and dependencies.
partitioning on memory and 10 times faster on disk across IV. CASE STUDY
different nodes. Its structure is based on Resilient Our project involves the usage of Apache Spark to
Distributed datasets(RDD) which is read only, multi sets analyze real time tweets. The objective of our case study is
of data partitioned and distributed across different node, to find the polarity of the words (in tweets) retrieved.
to ensure fault intolerance and scalability factors. It
overcomes the limitation of MapReduce in which data
after reducing was stored into a disk by implementing
iterative algorithms who fetch data from multiple datasets
in a loop thereby implementing repeated database-style
querying of data. In this way, the latency involved is
reduced thereby making it faster. RDD is basically an
abstraction feature which before data processing lays
down the execution plan and then later depicts
computation using Direct Acyclic Graph(DAG).The
generated DAG acts as a framework to carry out the
pattern analysis and processing and task segregation.
Further, it has a better edge over other technologies as it is Fig. 2 Framework for Twitter Analysis
quite easy to implement due to multiple available APIs.
Also, the other benefits include high level libraries. This Each step in the framework involves several sub-tasks.
inbuilt feature can deliver support to SQL, machine 1. Data collection:
learning, graph processing and for streaming data. It can Data in the form of raw tweets is retrieved by using the
access data from different storage sources like HDFS, Scala library “Twitter4j” which provides a package for real
CASSANDRA, HBase, S3. time twitter streaming API. The API requires us to register
a developer account with Twitter and fill in parameters
Scala: It is not only a High Level Functional but also
such as consumerKey, consumerSecret,
supports Object Oriented Programming language model.
accessTokenaccess, and TokenSecret. This API allows to
This provides it an edge over Java which require more
get all random tweets or filter data by using keywords.
code to be written for the same task as compared to Scala. Filters supports to retrieve tweets which match a specific
The major success of Scala is that Apache Spark is itself
criterion defined by the developer. We used this to retrieve
implemented in Scala. There are vast number of packages
tweets related to specific keywords which are taken as
available in Scala language for Apache Spark. Thus, we
input from users. Initially, we set at least set an application
proceeded with implementation in Scala as compared to
name and mode. We execute the program in local mode
Python or Java.
instead of cluster. Then, input array of keywords is
Twitter: It is an online social media platform which is provided as an argument to Streaming Context “ssc” using
suitable for our use case due to number of factors. Firstly, “sc” where “sc” is spark context.
the amount of relevant data is much larger for twitter as For example, on inputting multiple keywords like,
compared to blogs or review websites. Secondly, response 'Canada', 'Trump', 'Toronto', the output we obtained from
on twitter is general and prompt. Other social media giants 15 seconds’ window time was the live stream of tweets
like Facebook does not provide much data so using their associated with these keywords. Only caveat of using
public API was not considered. Finally, most twitter users filters is that famous keywords like “India” have more
voice their opinion about other people like actors,
tweets compared to niche words like “Focusrite” which expressions are used to match alphabetical characters only
makes it difficult to get data for niche specific keywords. and rest are ignored.
2. Data Processing:
Data processing involves Tokenization which is the
process of splitting the tweets into individual words called Fig. 5 Code snippet for removing non-alphabets
tokens. Tokens can be split using whitespace or
punctuation characters. It can be unigram or bigram This helps to reduce the clutter from the twitter stream.
depending on the classification model used. The bag-of-
words model is one of the most extensively used model for Stemming: It is the process of reducing derived words to
classification. It is based on the fact of assuming text to be their roots. Example includes words like “fish” which has
classified as a bag or collection of individual words with same roots as “fishing” and “fishes”. The library to use
no link or interdependence. The simplest way to stemming is Stanford NLP which also provides various
incorporate this model in our project is by using unigrams algorithms such as porter stemming. In our case, we have
as features. It is just a collection of individual words in the not employed any stemming algorithm due to time
text to be classified, so, we split each tweet using constraints.
whitespace.
4. Feature Extraction:
For example, the tweet “Met aziz today !!” is split from
TF-IDF is a feature vectorization method used in text
each whitespace as follows. mining to find the importance of a term to a document in
{ the corpus. Feature extraction involves “mlib” library of
Met
Apache Spark. The recommended API is the Data Frame
Aziz
based API. This feature is useful for a case where we need
!!”
to find trending topics or to create word clouds. However,
}
this project is more focused towards finding sentiment in
The next step in data processing is normalization by twitter streams so TF-IDF is not implemented.
conversion of tweet into lowercase. Tweets are normalized
by converting it to lowercase which makes its comparison 5. Sentiment Analysis
with an dictionary easier. The following function is used as Sentiment analysis is done by using custom algorithm
shown in fig. 3. which finds polarity as below.
Finding polarity: For discovering the polarity, we used a
simple algorithm of counting positive and negative words
in a tweet. For both, positive and negative words, different
lists were made. Next step is to compare every word in a
Fig. 3 code snippet for lowercase tweet against both these list. If the current word matches a
word in positive list, then a score of 1 is incremented and
3. Data Filtering: if a negative word is found then it is decremented. More
A tweet acquired after data processing still has a portion of positive words lead to higher sentiment score as shown in
raw information in it which we may or may not find useful fig. 6. However, Standford NLP can be used to predict
for our application. Thus, these tweets are further filtered accurate sentiment analysis which provide complex
by removing stop words, numbers and punctuations. algorithms to predict it.

Stop words: For example, tweets contain stop words which Sentiment Analysis output: The output contains a list of
are extremely common words like “is”, “am”, “are” and tweets in real time along with their sentiment score on the
holds no additional information. These words serve no left-hand side. The first tweet has score of -2 which is due
purpose and this feature is implemented using a list stored to two negative keywords. Next two tweets are positive as
in stopfile.dat. We then compare each word in a tweet with they contain keywords like “good” and “great. Both these
this list and delete the words matching the stop list as words are in the positive words list. It is to be noted that if
shown in fig. a tweet has a score of 0, then it is ignored from final output.
The problem with neutral tweets is that they serve no
purpose as they don’t convey any sentiment towards the
product.
Fig. 4 Code snippet for stop words removal
Removing non-alphabetical characters: Symbols such as
“#@” and numbers hold no relevance in case of sentiment
analysis and are removed using pattern matching. Regular
and get analyzed results. In this project, we have worked
only with unigram models, but we would like to extend it
to bigram and further which will increase linkage between
the data and provide accurate sentiment analysis results.
Computation of overall tweet score can be done for a single
keyword which can provide an overall sentiment of public
regarding a topic.
VII. CONCLUSION
Twitter is a source of vast unstructured and noisy data sets
that can be processed to locate interesting patterns and
Fig. 6 Sentiment analysis result trends. Apache Spark proved prolific in extracting live
streams of data and has further capability to store batches
The last tweet is most negative tweet with sentiment score of data in HDFS and other major conventional storages.
of -2 which contains some abuse word not shown. Negative The processing capabilities of Spark makes the project
tweets indicate hate and dislike towards a product or public flexible to further extend to multiple nodes, thereby
figure. The result here indicate that People don’t hate supporting distributed computing. Real time data analysis
Donald Trump as portrayed in media and news as general makes it possible for business organizations to keep track
sentiment regarding trump is positive as indicated by the of their services and generates opportunities to promote,
results. advertise and improve from time to time.

V. DISCUSSION Our heartfelt appreciation goes to Professor Imran Ahmad


Developing the project proved to be a lot more with regards to his feedback across the course of project
challenging than expected due to the relative inexperience from the initial proposal up to the conclusion and for the
we had with Apache Spark and Scala. valuable lessons learned along the way including
collaboration within a group and the challenges involved
A) Project Limitation & challenges in a large-scale software development efforts.
Following challenges were faced during implementation.
REFERENCES
Apache Spark Memory error: Apache spark has a setting
related to allotted memory for processing the program and [1] Dr. Khalid N. Alhayyan & Dr. Imran Ahmad “Discovering
the default value was less than what our application and Analyzing Important Real-Time Trends in Noisy
needed. The solution was to change settings in VM options Twitter Stream” n.p
in IntelliJ Idea settings by adding following parameters. [2] J. Ramteke, S. Shah, D. Godhia, and A. Shaikh, “Election
result prediction using Twitter sentiment analysis,” in
-Xms128m -Xmx512m -XX:MaxPermSize=300m -ea Inventive Computation Technologies (ICICT),
International Conference on, 2016, vol. 1, pp. 1–5.
Accessing Country Specific Tweets: There was no [3] M. Desai and M. Mehta, "Techniques for sentiment
parameter in twitter API to restrict tweets to a specific analysis of Twitter data: A comprehensive survey", 2016
country. This prevented us to retrieve tweets from a International Conference on Computing, Communication
specific region to analyze which could be a future work. and Automation (ICCCA), 2016.
[4] Alexander Pak and Patrick Paroubek. "Twitter as a corpus
Library dependencies: There were some initial challenges for sentiment analysis and opinion mining". In Proceedings
in building the application using SBT tools due to of the Seventh International Conference on Language
incompatible versions of Scala and Scala SDK as we had Resources and Evaluation (LREC’10), may 2010.
[5] R. Mehta, D. Mehta, D. Chheda, C. Shah, and P. M.
limited knowledge about the technologies we were using.
Chawan, “Sentiment analysis and influence tracking using
Moreover, the given examples used outdated libraries twitter,” International Journal of Advanced Research in
which we update to latest by comparing the given version Computer Science and Electronics Engineering
against maven repository. (IJARCSEE), vol. 1, no. 2, p. pp–72, 2012.
[6] Mtibaa, M. May, C. Diot and M. Ammar, "PeopleRank:
VI. FUTURE WORK Social Opportunistic Forwarding", 2010 Proceedings IEEE
From future perspective, we would like to extend this INFOCOM, 2010.
[7] Zhu, Y. and Shasha, D. (2002). Statstream: Statistical
project by implementing some machine learning
monitoring of thousands of data streams in real time. In
algorithms for applications like election results, product Proceedings of the 28th Very Large Data Base Conference.
ratings, movies' outcomes and running the project on 358–369
clusters to expand its functionalities. Moreover, we would
like to make a web application for users to input keywords
[8] Li, H.-F. and Lee, S.-Y. (2009). Mining frequent itemsets
over data streams using efficient window sliding
techniques. Expert Syst. Appl. 36, 2, 1466–1477.
[9] H. Wu and R. Luk and K. Wong and K. Kwok. "Interpreting
TF-IDF term weights as making relevance decisions".
ACM Transactions on Information Systems, 26 (3). 2010
View publication stats

You might also like