SSRN Id3972525
SSRN Id3972525
SSRN Id3972525
Abstract— These days, the expanded measure of data readers are becoming publishers. Anyone can create
sources on web makes the issue of data flood. Separating the misinformation and share this on social platforms. The
pertinent and real data is another issue in web-based media main objective of these fake news creators is to mislead
confronting now. Cell phones and other electronic devices
the newsreaders [1]. This news has a target, like creating
turned out to be very normal through which individuals get
an issue in society, damaging the dignity of an individual
exceptional data. This paper concentrates on discussing
methodologies to design a personalized data scraping tool to or an organization. As a result, detecting fake news and
fetch texts written in Malayalam Language from assessing the quality of the news becomes an even more
Facebook.com. In addition, the paper also outlines a new predominant skill. When people share news articles, they
approach for finding the genuineness of text and news must first check the authenticity. Therefore, this is an
contents in English. The news prediction is done by important factor that can reduce the spread of anonymous
implementing techniques like TF-IDF, Bag of words and information.
Natural language processing.
One of the main segments of Natural Language
Keywords — Web scraping, Text Summarization, NER, Processing is dealing with Text content. Preprocessing
Malayalam, social media, Natural language processing. must be perfectly completed before developing the
language model. Preprocessing of Malayalam text can be
I. INTRODUCTION implemented with sufficient language packages and tools
like NLTK, root-pack and Tokenization classes.
With the fast headways in innovation, individuals and Removal of punctuations and html tags, code mixing,
gadgets are all around associated with characterize social sentence tokenization, word tokenization, Stemming,
media information is the vast, most profound, and most Root words identification are few of essential tasks to be
extravagant source of data. Individuals utilize these performed with text preprocessing. Text summarization
locales to transfer and post their photographs, includes Abstractive mode and extractive mode,
assessments, exercises, surveys about anything. From sentiment analysis and Named entity Recognition are
children to the elderly, social media has had a profound further steps comes under NLP.
impact on their daily lives. Therefore, the rapid adoption
of social media for everything has increased the sharing II. LITERATURE SURVEY
of information between users without knowing whether
it is fake or genuine. People are sharing news without This section discusses the conventionally used
checking the authenticity of the news. algorithms, methodologies, and techniques that have
been used to execute a fake news detection system.
Fake news is of different types such as target Discussions regarding fake news detection is dominated
misleading information, which is shared through social research in recent years. Many investigators have been
media utilizing someone’s interest, for generating conducting experiments on implementing fake news
additional attention fake headlines were created which identification systems. Prior research substantiates use of
depict bogus facts, moreover, viral posts which are machine learning algorithms such as Support vector
shared through various social media without checking the machine, Naïve based classifier, NLP methods, sentence
authenticity of posts. Misinformation has become a daily similarity, classification algorithm, which are most
phenomenon in the changing media landscape, where widely used for detecting fake news.
Hoaxy: A Fraud Online Tracking Platform [11] They To deal with summarization in the malayalam text
Kanitha et al., proposed a graph theoretic style to produce
collect news from social platforms and news websites
an exact outline for Malayalam texts and claimed the
using web scraping and web syndication. They see user
generated summaries as 51% like human-generated
activity by calculating the number of user tweets posted summaries [23]. Rajina and Sumam in 2014 proposed a
and the popularity of the URL by calculating the total statistical sentence scoring technique and a semantic
number of people who have posted the tweet. graph-based technique for text summarization. The
preprocessing and POS tagging on Malayalam sentences
Fake news detection using deep learning techniques [12]
were along with the scoring and graph-based algorithm
draw the attention to a fake news detection system based resulted in generating effective summarization on
on classification such as s Logistic regression (LR), Malayalam language documents [24]. Krishnaprasad et
Naïve bayes (NB), Support vector machine (SVM), al., proposed an algorithm for generic summarization of
Random Forest (RF), and deep neural network (DNN). Malayalam text in a single document with the help of
generating ranks for each word in the document and then
A detailed study by Said. A et al., about techniques in text extracting the top N ranked sentences to arrange them in
mining from social media sites like Facebook and chronological order for an extractive summarization [25].
Twitter, which discussed the problem of handling
irregular texts in social media into a regular form, the role
The methodologies used in the proposed model are given Data should be pre-processed for getting better
below. results. It comprises the removal of URLs, stemming,
punctuations, and stop words. At this stage, the Natural
language processing method is used. Natural Language
Input news Web scraping Processing allows us to find out the key information from
to the google,yahoo, Preprocessing the data.
system ask,lycos,bing
Punctuation, HTML tags,
Numbers removal
Comparison Prediction
Sentence Tokenization
Figure 1: System Architecture
Word Tokenization
Figure 1 depicts the architecture of the system. The user
can insert the news which wants to find as trustworthy or Stop words removal
not. At that point, the system carries out web scraping
methods. Stemming & Lemma
91.1
getting live news updates. One more python library
90.9
called NumPy is used while working with arrays. Flask
90.6
is another python web framework that is used in system
for building the web application.
V. EXPERIMENTAL RESULTS
MIXED FAKE REAL
Here the method used a specimen model data set for
finding the accuracy of the prediction. In the sample
model, 0 represents fake news, and 1 represents real Figure 3: Accuracy of system using chart
news. By using the test set a sample test is implemented
By analyzing the figure 3, it is understood that while
in the system. The demonstration is done using python
checking the accuracy of the system considering the
programming and a machine learning algorithm.
prediction of true fake, the system shows an accuracy of
The accuracy of predicting fake news, real news, and the 90.6% and the remaining is false fake. When checked the
mixed data of both fake and real are given in the table accuracy of genuine news, that is true real the system
below. shows an accuracy of 91.1% here the remaining is false
real. However, when checked the accuracy of both real
TABLE 2: SYSTEM ACCURACY and fake news(mixed) the model has an accuracy of
90.9%. Hence, it is the overall accuracy of the system.
Fake Original Mixed
Total number of news 32 34 66 These results offer vital evidence for the detection of
Prediction 29 31 60 news as fake or real.
Accuracy 90.6 91.1 90.9 VI. CONCLUSION