Project Report Sentiment Analysis On Twitter Using Apache Spark
Project Report Sentiment Analysis On Twitter Using Apache Spark
Project Report Sentiment Analysis On Twitter Using Apache Spark
net/publication/320625064
CITATIONS READS
0 29,831
4 authors, including:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Deepesh Khaneja on 26 October 2017.
PROJECT REPORT
Stop words: For example, tweets contain stop words which Sentiment Analysis output: The output contains a list of
are extremely common words like “is”, “am”, “are” and tweets in real time along with their sentiment score on the
holds no additional information. These words serve no left-hand side. The first tweet has score of -2 which is due
purpose and this feature is implemented using a list stored to two negative keywords. Next two tweets are positive as
in stopfile.dat. We then compare each word in a tweet with they contain keywords like “good” and “great. Both these
this list and delete the words matching the stop list as words are in the positive words list. It is to be noted that if
shown in fig. a tweet has a score of 0, then it is ignored from final output.
The problem with neutral tweets is that they serve no
purpose as they don’t convey any sentiment towards the
product.
Fig. 4 Code snippet for stop words removal
Removing non-alphabetical characters: Symbols such as
“#@” and numbers hold no relevance in case of sentiment
analysis and are removed using pattern matching. Regular
and get analyzed results. In this project, we have worked
only with unigram models, but we would like to extend it
to bigram and further which will increase linkage between
the data and provide accurate sentiment analysis results.
Computation of overall tweet score can be done for a single
keyword which can provide an overall sentiment of public
regarding a topic.
VII. CONCLUSION
Twitter is a source of vast unstructured and noisy data sets
that can be processed to locate interesting patterns and
Fig. 6 Sentiment analysis result trends. Apache Spark proved prolific in extracting live
streams of data and has further capability to store batches
The last tweet is most negative tweet with sentiment score of data in HDFS and other major conventional storages.
of -2 which contains some abuse word not shown. Negative The processing capabilities of Spark makes the project
tweets indicate hate and dislike towards a product or public flexible to further extend to multiple nodes, thereby
figure. The result here indicate that People don’t hate supporting distributed computing. Real time data analysis
Donald Trump as portrayed in media and news as general makes it possible for business organizations to keep track
sentiment regarding trump is positive as indicated by the of their services and generates opportunities to promote,
results. advertise and improve from time to time.