SSRN Id3972525

Natural Language Processing of Malayalam

Text for predicting its Authenticity

. Rizwana Kallooravi Thandil Mohamed Basheer K.P. Muneer V.K
Research Department of Research Department of Research Department of
Computer Science. Computer Science. Computer Science.
SS College, Areekode SS College, Areekode, SS College, Areekode.

Abstract— These days, the expanded measure of data readers are becoming publishers. Anyone can create
sources on web makes the issue of data flood. Separating the misinformation and share this on social platforms. The
pertinent and real data is another issue in web-based media main objective of these fake news creators is to mislead
confronting now. Cell phones and other electronic devices
the newsreaders [1]. This news has a target, like creating
turned out to be very normal through which individuals get
an issue in society, damaging the dignity of an individual
exceptional data. This paper concentrates on discussing
methodologies to design a personalized data scraping tool to or an organization. As a result, detecting fake news and
fetch texts written in Malayalam Language from assessing the quality of the news becomes an even more In addition, the paper also outlines a new predominant skill. When people share news articles, they
approach for finding the genuineness of text and news must first check the authenticity. Therefore, this is an
contents in English. The news prediction is done by important factor that can reduce the spread of anonymous
implementing techniques like TF-IDF, Bag of words and information.
Natural language processing.
One of the main segments of Natural Language
Keywords — Web scraping, Text Summarization, NER, Processing is dealing with Text content. Preprocessing
Malayalam, social media, Natural language processing. must be perfectly completed before developing the
language model. Preprocessing of Malayalam text can be
I. INTRODUCTION implemented with sufficient language packages and tools
like NLTK, root-pack and Tokenization classes.
With the fast headways in innovation, individuals and Removal of punctuations and html tags, code mixing,
gadgets are all around associated with characterize social sentence tokenization, word tokenization, Stemming,
media information is the vast, most profound, and most Root words identification are few of essential tasks to be
extravagant source of data. Individuals utilize these performed with text preprocessing. Text summarization
locales to transfer and post their photographs, includes Abstractive mode and extractive mode,
assessments, exercises, surveys about anything. From sentiment analysis and Named entity Recognition are
children to the elderly, social media has had a profound further steps comes under NLP.
impact on their daily lives. Therefore, the rapid adoption
of social media for everything has increased the sharing II. LITERATURE SURVEY
of information between users without knowing whether
it is fake or genuine. People are sharing news without This section discusses the conventionally used
checking the authenticity of the news. algorithms, methodologies, and techniques that have
been used to execute a fake news detection system.
Fake news is of different types such as target Discussions regarding fake news detection is dominated
misleading information, which is shared through social research in recent years. Many investigators have been
media utilizing someone’s interest, for generating conducting experiments on implementing fake news
additional attention fake headlines were created which identification systems. Prior research substantiates use of
depict bogus facts, moreover, viral posts which are machine learning algorithms such as Support vector
shared through various social media without checking the machine, Naïve based classifier, NLP methods, sentence
authenticity of posts. Misinformation has become a daily similarity, classification algorithm, which are most
phenomenon in the changing media landscape, where widely used for detecting fake news.

An approach for detecting fake news based on social of texts written in natural languages, and their
media [2], reported a data mining technique to news sentimental analysis [13] plays a significant role in
verification depends on factors like publisher, content, proper data management for various purposes [16]. A
time of posting on social media websites, number of study on harvesting useful information Remmiya Devi
engagements between different users, and the article. A in 2016 from the text written in Malayalam social
smart system for fake news detection [3], In which they media websites and extract entities and processed the task
using the Structured Skip-Gram Model [19]. They used
have proven that by using supervised machine learning
FIRE2015 as a dataset to process natural language entity
algorithms (Naïve Bayes Classification and SVM) and
extraction and claimed 89.83% of overall accuracy.
Natural language processing the suggested framework
achieved accuracy of up to 93.5% for detecting Named entity recognition [31], Morphology analyser and
suspicious data. Parts of speech tagging [30] are few growing research
areas in Malayalam Language processing which can turn
Identification of fake news using machine learning [4], out significant outcomes in grammar management and
where they implement a system to classify fake news. linguistics. Text summarization, which can be termed as
Famous: fake news detection model [5], In this model, the process of extracting the essence of a longer source
they put forward a new sentence matching model to text without losing its overall meaning and content. Hovy
identify suspicious news that can productively manage et al., classified it into extractive summarization and
sentence matching by retrieving the principal sentence abstractive summarization [22]. Extractive
based on the bidirectional LSTM model. summarization is simpler where the algorithm selects the
important sentences from the source text and joins them
Fake news detection [6] developed a simple fake news to a shorter form. The other method will understand the
detection method based on one of the machine learning overall content and reproduce the same with shorter
algorithms, like naïve Bayes for applying it on Facebook words. The process of Text summarization in NLP has
and labeled it as fake or real. Fake news detection using developed extremely faster with the introduction of
BERT algorithms. Yang Liu et al., produced a different
Naïve Bayes classifier [7] and artificial intelligence
document-level encoder based on BERT which could
methodologies are used for detection of false news. The
articulate the overall meaning of the document and figure
developed system was tested on a comparatively new out interpretations for its sentences and they could
dataset (BuzzFeed news), which allowed evaluating its produce an excellent performance in both extractive
performance on recent data. Weakly supervised learning summarization and abstractive summarization [28]. An
for fake news on Twitter [8] is based on classifying the extensive study on extractive summarization has been
fake and non-fake tweets. The Classification is purely conducted Derek Miller and team for lecture
based on the source of the post/tweet. Improving spam summarization service with the help of text embedding
detection in social networks [9] presents a method for BERT model and K-Means clustering to identify
detecting the misinformation spreaders in Twitter as sentences closest to the centroid for summary selection
Twitter is one of the most well-liked and well-popular [29] and store the result in the cloud and they could prove
social media platforms. the promising result on the study.

Hoaxy: A Fraud Online Tracking Platform [11] They To deal with summarization in the malayalam text
Kanitha et al., proposed a graph theoretic style to produce
collect news from social platforms and news websites
an exact outline for Malayalam texts and claimed the
using web scraping and web syndication. They see user
generated summaries as 51% like human-generated
activity by calculating the number of user tweets posted summaries [23]. Rajina and Sumam in 2014 proposed a
and the popularity of the URL by calculating the total statistical sentence scoring technique and a semantic
number of people who have posted the tweet. graph-based technique for text summarization. The
preprocessing and POS tagging on Malayalam sentences
Fake news detection using deep learning techniques [12]
were along with the scoring and graph-based algorithm
draw the attention to a fake news detection system based resulted in generating effective summarization on
on classification such as s Logistic regression (LR), Malayalam language documents [24]. Krishnaprasad et
Naïve bayes (NB), Support vector machine (SVM), al., proposed an algorithm for generic summarization of
Random Forest (RF), and deep neural network (DNN). Malayalam text in a single document with the help of
generating ranks for each word in the document and then
A detailed study by Said. A et al., about techniques in text extracting the top N ranked sentences to arrange them in
mining from social media sites like Facebook and chronological order for an extractive summarization [25].
Twitter, which discussed the problem of handling
irregular texts in social media into a regular form, the role

III. METHODOLOGY 4.2 Preprocessing

The methodologies used in the proposed model are given Data should be pre-processed for getting better
below. results. It comprises the removal of URLs, stemming,
punctuations, and stop words. At this stage, the Natural
language processing method is used. Natural Language
Input news Web scraping Processing allows us to find out the key information from
to the google,yahoo, Preprocessing the data.
system ask,lycos,bing
Punctuation, HTML tags,
Numbers removal
Comparison Prediction

Sentence Tokenization
Figure 1: System Architecture
Word Tokenization
Figure 1 depicts the architecture of the system. The user
can insert the news which wants to find as trustworthy or Stop words removal
not. At that point, the system carries out web scraping
methods. Stemming & Lemma

4.1. Web scraping POS

Web scraping is a method used in this system
for retrieving news content. It is required the news or
Figure 2: Steps in Preprocessing
articles related to the content posted by the user from
trusted websites. Web scrapers can bring out all the data In the initial level of the process, splitting the sentence
on websites or the specific data that a user wants. To means separating each sentence from the other part to
scrape a site, initially, it has given the URL of the deal with them separately. And in the second level
required sites. At that time, it loads all the HTML code remove the unimportant words (stop words) like (the, a,
for those required sites and a more innovative scraper an, from, to, for instance, of, etc.) from each part of the
might even take out all the CSS and JavaScript elements sentence. In Malayalam few of the stop words are:
too. After that, the scraper captures the required data from
this HTML code and outputs this data in the format പ ോലെ ഞോന്‍ അവലെ ആ അവന് ആയിരുന്നു പവണ്ടി
specified by the user. പേല്‍ ഉണ്ട് കൂലെ അവര്‍ എന്നു ഒന്ന് ഞങ്ങൾ ഈ നിന്ന്
ലകോണ്ട് പേ എന്ത് കുലെ ആകുന്നു അത് നിങ്ങലെ
This web scraping method finds out the URL of the page അഥവോ ഉണ്ടോയിരുന്നു ഒരു പെക്ക് ഒപ്പം etc.
that want to scrape and extracts the data in required
format. Moreover, five search engines are used for The next level is stemming where each word is returned
scraping data from trusted websites like Google, Bing, to its origin and converting into a vector format (using
Ask, Yahoo and lycos. bag of words). A Stemming technique is effectively used
to detach suffixes or prefixes from a word.
To carry out the research work, collecting all travel posts
and related information in an altered way, more • Bag of Words
profoundly than the Graph API can help. In that situation,
Machine learning algorithms are unable to work with the
a customary Facebook API may not be much helpful.
raw text directly. Rather, the text must be converted into
Subsequently fostered a custom model to scratch the
vectors of numbers. In natural language processing, a
fundamental subtleties from Facebook. Utilized JSON,
common technique for extracting features of the text is to
Node JS, and other prearranging devices to achieve the
place entire words that occur in the text in a bucket. This
approach is called a bag of words [18].
After web scraping, the next step is to preprocess the
data. • Named Entity Recognition

Named Entity Recognition, an undertaking of recognizing After preprocessing, the system will compare the input
and characterizing true items like people, places, news with the scraped data. The results of these steps will
associations from a given book is a notable NLP problem be input to the TF-IDF vectorizer.
[15]. The morphological features of Malayalam have been
consistently a test to tackle this issue. At the point when TF-IDF Vectorizer known as the Term frequency-inverse
the named entities show up in a bent or agglutinated document frequency, where the value rises
complex word, the initial step is to break down such words proportionately to the number of times a word becomes
and show up at the root words. visible in the document but is neutralized by the
frequency of the word in the collection. Term frequency
A malayalam passage will be undergoing NER and the can be defined as the number of times a word appears in
output obtained with NER tool is given below. the document divided by the total number of the
document. A text similarity check is performed by the
േൺസൂൺ യോത്രകൾക്ക് എന്നും ത് ിയലപ്പട്ട
system using TF-IDF. Natural language processing is
ഇെേോണ് ൂയംകുട്ടിപയോട് പേര്‍ന്നുള്ള also used for performing sentence matching.
വനത്രോേങ്ങൾ. േഴക്കോെും ഇരുെ ം
ഒരുേിലെത്തുന്ന കെുകെില്‍ ഇെിെ കുത്തി 4.4. Prediction

With the completion of the above steps, the next step is

േഴല യ്യ പപോൾ ുഴയില്‍ രിരക്കിട്ട്
prediction. Based on the comparison If there is any
രീരത്തെുക്കുന്ന പരോണികൾ. േഴനനഞ്ഞു
similar news that can be found from any of these trusted
നില്‍ക്കുന്ന േൃരങ്ങൾ.േിെലകെില്‍ കുരിര്‍ന്ന് sites, the given news is real, otherwise the given news is
fake. Trustworthy sites do not publish fake news as real,
നനലഞ്ഞോട്ടിയ േികൾ. assuming search engines like Google, Yahoo, Ask, Bing,
ഇെകൾക്കിെയിെൂലെ നൂണ്ട് ുെപത്തക്ക് and Lycos as trustworthy. Each of the search engines
രെനീട്ട ന്ന ഇെപ്പോപുകൾ... predicts news as fake or real according to the news. The
final prediction by the system is by counting the total
The output of NER, the names of persons, places, items, number of news predicted as real or fake by each of the
objects are given below, search engines. If the news predicted as real is greater
than the news predicted as fake, then the final output of
േൺസൂൺ യോത്ര ഇെം വനം ത്രോേം േഴ ഇരുൾ
the system is real otherwise the output will be fake. This
പരോണി േൃരം േി ഇെ ോപ്
can be shown in the given table.
Stemming of the words can be implemented by the Google Bing Ask Yahoo Lycos
module named Root-pack developed at ICFOSS
Trivandrum. Extracting root of a word is vital in the News Real Real Real Real Fake
preprocessing stage of most Language processing systems Sample
for Malayalam. The root extraction module can derive the
root of any given words regardless of the number of
suffixes or words attached with the stem [10]. In the above table1, the news samples given are predicted
Given a sample Malayalam inflated word as real by four search engines and only one predicted it
as fake. Since the total count of ‘real’ prediction is greater
"അവരുടെടെല്ലാമാടെന്നാണ്” than the count of ‘fake’ prediction the system predicts the
sample news as real.
The extractor will find out the root word as
The system is implemented using python
The code implementation is done by the method root().
because of its rich libraries and packages. The system is
import root_pack implemented with the help of libraries like requests,
root_pack.root(“അവരുടെടെല്ലാമാടെന്നാണ്”). regular expression, Beautiful Soup, NumPy, Flask.

4.3. Comparison Request module is in python for sending HTTP requests

and this HTTP request will return all response data.

Request module is used for scraping. In the scraping True Real: When the real news is predicted as real.
method when the URL is given, the required sites load all
the HTML code for those required sites, and it has to False Fake: When the fake news is predicted as real.
remove these codes and fetch out the required data. False Real: When the real news is predicted as fake.
The regular expression module (re module) [14] was used
in the system for extracting. The method using re module
for separating content from Html code. Beautiful soup is
another python package that is also used for web news prediction
scraping. Beautiful Soup is used in this system mainly for

getting live news updates. One more python library

called NumPy is used while working with arrays. Flask

is another python web framework that is used in system
for building the web application.

Here the method used a specimen model data set for
finding the accuracy of the prediction. In the sample
model, 0 represents fake news, and 1 represents real Figure 3: Accuracy of system using chart
news. By using the test set a sample test is implemented
By analyzing the figure 3, it is understood that while
in the system. The demonstration is done using python
checking the accuracy of the system considering the
programming and a machine learning algorithm.
prediction of true fake, the system shows an accuracy of
The accuracy of predicting fake news, real news, and the 90.6% and the remaining is false fake. When checked the
mixed data of both fake and real are given in the table accuracy of genuine news, that is true real the system
below. shows an accuracy of 91.1% here the remaining is false
real. However, when checked the accuracy of both real
TABLE 2: SYSTEM ACCURACY and fake news(mixed) the model has an accuracy of
90.9%. Hence, it is the overall accuracy of the system.
Fake Original Mixed
Total number of news 32 34 66 These results offer vital evidence for the detection of
Prediction 29 31 60 news as fake or real.
Accuracy 90.6 91.1 90.9 VI. CONCLUSION

Prior to believe in bogus news and sharing it through

Table 2 highlights that 32 fake news for prediction and web-based media, it is mandatory to discover its
out of that 29 are predicted as fake. Further, used 34 genuineness. In the work, a text similarity finding
original news, and out of that 31 predicted as real. While, algorithm used for predicting whether the news is fake or
mixing both real and fake news, out of 66 news 60 news accurate. Here used five search engines to ensure the
predicted correctly. maximum accuracy and best prediction. The work has led
The accuracy of the system can be calculated by using to the conclusion that including more search engines
the formula: improves the system to get better results. The findings of
this study indicate that the system achieved accuracy of
Total no: of accurate prediction up to 90% in detecting news as fake or real. The study
Accuracy = ∗ 100
Total no: of news prediction provides a new framework for a new way of detecting the
authenticity of the news. Hence, future studies on the
As stated in the introduction, the main objective is to find current topic are therefore needed to improve the
out the credibility of news spreading on social media. The accuracy of the system by adding some more search
proposed system needs to predict the output as accurately engines for scraping, fetching more data from social
as possible. The prediction can be classified into four media, and using another text similarity checking
categories as given below: algorithm.
True Fake: When the fake news is predicted as fake. REFERENCES

[1] “”, Accessed on May 22 2021. [online], [14] expressions-and-data-visualization-doing-it-all-in-python-
media/ 37a1aade7924, “towards data science” Accessed on: May 22,2021.
[2] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection
on social media: A data mining perspective,2017,ACM SIGKDD [15]
Explorations Newsletter, recognition-using-morphology-analyser/
[16] Said A. Salloum, Mostafa Al-Emran, Azza Abdel Monem, Khaled
[3] A. Jain, A. Shakya, H. Khatter and A. K. Gupta, "A smart System Shaalan, A Survey of Text Mining in Social Media: Facebook and
for Fake News Detection Using Machine Learning," 2019 Twitter Perspectives, Advances in Science, Technology and
International Conference on Issues and Challenges in Intelligent Engineering Systems Journal Vol. 2, No. 1, 127-133 (2017)
Computing Techniques (ICICT), 2019, pp. 1-4, doi:
[17] Bo Zhao, Web Scraping, Springer International Publishing AG
(outside the USA) 2017, L.A. Schintler, C.L.McNeely(eds.),
[4] R. R. Mandical, N. Mamatha, N. Shivakumar, R. Monica and A. N. Encyclopedia of Big Data,DOI 10.1007/978-3-319-32001-4_483-1.
Krishna, "Identification of Fake News Using Machine Learning,"
2020 IEEE International Conference on Electronics, Computing
and Communication Technologies (CONECCT), 2020, pp. 1-6, [18] Deepu S, Pethuru Raj and S.Rajaraajeswari, A Framework for Text
doi: 10.1109/CONECCT50063.2020.9198610. Analytics using the Bag of Words (BoW) Model for Prediction,
International Journal of Advanced Networking & Applications
[5] N. Kim, D. Seo and C. Jeong, "FAMOUS: Fake News Detection (IJANA), ISSN: 0975-0282 1st International Conference on
Model Based on Unified Key Sentence Information," 2018 IEEE Innovations in Computing & Networking (ICICN16), CSE, RRCE
9th International Conference on Software Engineering and Service 320.
Science (ICSESS), 2018, pp. 617-620, doi: [19] Remmiya Devi G, Veena P V, Anand Kumar M, Soman K P, Entity
Extraction for Malayalam Social Media Text using Structured Skip-
gram based Embedding Features from Unlabeled Data, 2016, doi:
[6] A. Jain and A. Kasbe, "Fake News Detection," 2018 IEEE 10.1016/j.procs.2016.07.276, Procedia Computer Science 93 ( 2016
International Students' Conference on Electrical, Electronics and
Computer Science (SCEECS), 2018, pp. 1-5, doi: ) 547 – 553.
[20] Sunita Tiwari, et. al, Implicit preference Discovery for biography
[7] M. Granik and V. Mesyura, "Fake news detection using naive Bayes Recommender system Using Twitter, International Conference on
classifier," 2017 IEEE First Ukraine Conference on Electrical and Computational Intelligence and Data Science, 2019, DOI:
Computer Engineering, 2017,pp.900- 10.1016/j.procs.2020.03.352.
[21] Chunmei Zheng1,, A Study of Web Information Extraction
[8] S. Helmstetter and H. Paulheim, "Weakly Supervised Learning for Technology Based on Beautiful Soup, Journal of Computers, 2015.
Fake News Detection on Twitter," 2018 IEEE/ACM International Volume 10, Doi: 10.17706/jcp.10.6.381-387.
Conference on Advances in Social Networks Analysis and Mining
[22] Eduard Hovy, Chin-Yew Lin, Automated Text Summarization in
(ASONAM), 2018, pp. 274-277, doi:
10.1109/ASONAM.2018.8508520. SUMMARIST. In Advances in Automatic Text Summarization,
[9] A. Gupta and R. Kaushal, "Improving spam detection in Online [23] Kanitha D K, et al, Malayalam Text Summarization Using Graph
Social Networks," 2015 International Conference on Cognitive Based Method, International Journal of Computer Science and
Computing and Information Processing (CCIP), 2015, pp. 1-6, doi: Information Technologies, Vol. 9 (2) , 2018, 40-44, ISSN:0975-
10.1109/CCIP.2015.7100738. 9646.
[24] Rajina Kabeer and Sumam Mary Idicula,Text Summarization for
[10] Malayalam Documents – an Experience, 2014, International
Conference on Data Science & Engineering (ICDSE), 978-1-4799-
[11] Shao, Chengcheng & Ciampaglia, Giovanni & Flammini, 5461-2114/$31.00 @2014 IEEE.
Alessandro & Menczer, Filippo. (2016). Hoaxy: A Platform for [25] Krishnaprasad P, Sooryanarayanan A, Ajeesh Ramanujan
Tracking Online Misinformation. WWW '16 Companion: Malayalam Text Summarization: An Extractive Approach,
Proceedings of the 25th International Conference Companion on International Conference on Next Generation Intelligent Systems
World Wide Web. 10.1145/2872518.2890098. (ICNGIS), 2016, 978-1-5090-0870-4/16/$31.00 ©2016 IEEE.
[12] C. K. Hiramath and G. C. Deshpande, "Fake News Detection Using [26] Jan-Willem van Dam, Michel van de Velden, Online profiling and
Deep Learning Techniques," 2019 1st International Conference on clustering of Facebook users, Decision Support Systems 70 (2015)
Advances in Information Technology (ICAIT), 2019, pp. 411-415, 60–72, 0167-9236/©
doi: 10.1109/ICAIT47043.2019.8987258. 2014.
[27] Plamen Milev, Conceptual Approach for Development of Web
[13] M. Rahul , R.R. Rajeev , S. Shine, Social Media Sentiment Scraping Application for Tracking Information, Economic
Analysis For Malayalam, International Journal of Computer
Alternatives, 2017, Issue 3, pp. 475-485.
Sciences and Engineering, E-ISSN: 2347-2693, 2018.

[28] Yang Liu and Mirella Lapata, Text Summarization with Pretrained
Encoders, arXiv:1908.08345v2 [cs.CL] 5 Sep 2019.
[29] Derek Miller, Leveraging BERT for Extractive Text
Summarization on Lectures, 2019,
[30] Anisha Aziz T and Sunitha C, "A hybrid Parts Of Speech tagger
for Malayalam language," 2015 International Conference on
Advances in Computing, Communications and Informatics
(ICACCI), Kochi, 2015, pp. 1502-1507, doi:
[31] Ajees A Pa, Sumam Mary Idicula, A Named Entity Recognition
System for Malayalam using Neural Networks, 2018, Procedia
Computer Science 143 (2018) 962–969.

