Comparison Performance of Naive Bayes
Comparison Performance of Naive Bayes
Comparison Performance of Naive Bayes
E-mail: rosita_kusumawati@uny.ac.id
Abstract. Tokopedia is one of the online shopping centers in Indonesian that carries the
business model marketplace. Positive and negative opinions in Twitter from Tokopedia users
about company services are source of information for the management. Naive Bayes
Classification (NBC) and Support Vectore Machine (SVM) are techniques in data mining used
to classify data or users opinion. The algorithm of NBC is very simple since it only use text
frequency to compute the posterior probability for each classes. While SVM algorithm is more
complex than NBC. SVM develop hyperplane equation which separate data into classes
perfectly. The researcher wants to compare the performance of the NBC and SVM algorithms
and use them to classify user opinions on Tokopedia’s services, because these two algorithms
have different approaches and difficulty levels. Classification included positive and negative
class only. Accuracy, precision and recall value are used to compare the performance of both
algorithms. Research evaluation shows that SVM linear kernel technique outperform NBC
technique with the accuracy 83.34%.
1. Introduction
Tokopedia is one of the e-commerce startups with large assets that accommodate sellers and buyers to
make transactions quickly and easily. Whereas Twitter is a microblog based social network with 24.34
million users in May of 2016 that allows users to send and read text based messages known as
“tweets” [11]. Twitter can be a source of text data of opinion and community sentiment on
Tokopedia’s services which can be analysed for the purposes of an organization or a company. A
sentimental analysis, also called an opinion mining, is a field of study that that not only analyse
people's opinions, but also people's sentiments, evaluations, judgments, attitudes, and emotions on
entities such as products, services, organizations, individuals, problems, events, topics, and other
attributes [5]. These unstructured opinions text data can be classified using various text processing
methods called text mining [12].
Algorithms used to classify opinions include Naive Bayes Classifier (NBC) and Support Vector
Machine (SVM). The NBC is a simple probabilistic-based method which classify opinion based on the
maximum value of the posterior probability obtained through the probability of each class (prior
probabilities) and the probability of each word occurring (conditional probability) in the training data.
Although very simple, NBC has a high degree of accuracy and performance in classifying data text
[10].
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016
The Support Vector Machine (SVM) algorithm is one of machine learning (supervised learning)
method which can classify opinion by searching for the best dividing field or hyperplane that separates
high-dimensional text data perfectly into classes [14]. Hyperplane can be found by maximizing the
margin or the distance between the closest class points (support vector) and the hyperplane. However,
samples data are often not linearly separated, the SVM introduces the idea of increasing data
dimensions. Typically, the use of higher space dimensions will causes engine problems and
overfitting. The problem can be solved by the use of dot-product in space [3].
Although many research on text mining have been done, but there are still many issues about the
performance of the algorithm that need to be addressed further. This article will discuss the application
of Naive Bayes Classifier and Support Vector Machine algorithm to classify tweets of Tokopedia’s
Services and compare the performance of the two algorithms. The remainder of this paper is organized
as follows, section 2 gives works related to text mining using NBC and SVM. The concepts of NBC
and SVM will be explained deeper in section 3. We will use the accuracy, precision and recall level
which will explained further in section 4 to compare the performance of each algorithm. The
classification steps of Tokopedia’s services starts from collection of data, labelling preprocessing, and
sharing then NBC and SVM classification process will be discussed in section 5.
2. Related Work
Numbers of research on sentiment analysis using NBC and SVM algorithm have been done.
Narayanan et.al conducted a sentiment analysis study to find out the film review opinions by Indian
audiences using NBC classification [7]. The film reviews in this study come from the Internet Movie
Database (IMDb). The results of the research quickly and accurately with the level of accuracy
88.80% of 25,000 opinions. While Wahyuningtyas use NBC algorithm also to classify spam and not
spam tweets [13]. The classification accuracy of spam and non-spam tweets is 95.57%. Based on the
results of the research can be known words that often appear on the class spam is bahasa, follow, and
inggris. Research to analyse SVM performance has been done to classify English opinion about self-
driving cars and apple products using tweet data divided into six classes using WEKA program [1].
The accuracy, precision and recall values of each topic are 59.91%, 70.8%, 84.1% for self-driving cars
problem and 71.2%, 70.2%, 71.2% for apple products analysis. Pratama et.al also used SVM
algorithm for text mining of Speedy Telkomsel subscribers’s complaint in twitter using feature
selection combinations i.e. term frequency, document requency, information gain and chi-square) and
Gaussian RBF kernel function. And the resulted accuracy value for term frequency feature is of
82.50% [8].
2
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016
appearance of other words [2]. In the other words, the NBC algorithm assumes that variables 𝑥1 , … , 𝑥𝑛
are mutually independent.
The Naïve Bayes Classifier algorithm can be divided into two types i.e. multivariate Bernoulli and
multinomial Naïve Bayes [6]. This study used the model of multinomial Naïve Bayes since it assumed
the mutual independence of each word for all classes and 𝑃(𝑥1 , … , 𝑥𝑛 ) = 1 or constant, so equation
(2) can be written as follows:
𝑃(𝑐𝑖 |𝑥1 , … , 𝑥𝑛 ) = 𝑃(𝑥1 , … , 𝑥𝑛 |𝑐𝑖 )𝑃(𝑐𝑖 ). (3)
The posterior probabilities values can be obtained based on the probability of each class (prior
probabilities) and the probability of each word (conditional probabilities) in the training data by
simplifying equation (3) as follows [9].
𝑗=𝑛
The equation (4) is a probability model of the Naïve Bayes theorem which is used in the classification
process. In the Naïve Bayes Classifier, the testing data enter the class 𝑐𝑖 that has a maximum posteriori
(MAP) or 𝑐𝑀𝐴𝑃 . The calculation of 𝑐𝑀𝐴𝑃 value is defined as follows:
𝑗=𝑛
3
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016
A text or tweet can be expressed as set 𝑥𝑖 , where 𝑥𝑖 = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑝 ). Suppose a given set 𝑥 =
{𝑥𝑖1 , 𝑥2 , … , 𝑥𝑛 } with 𝑥𝑖 ∈ ℜ𝑝 has a certain pattern, that can be grouped into positive class and
negative class. Thus, each datum and class label can be denoted as 𝑦𝑖 ∈ {−1, +1} so the data are pair
{(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 )}, then it is assumed that the data are perfectly separated (linear
separable) by the 𝑝 dimensional separator function called hyperplane 𝐻0 where 𝑛 is the number of
data that can be defined in the equation (9) as follows,
𝑤. 𝑥𝑖 + 𝑏 = 0 (9)
where 𝑤 is weights vector and 𝑏 is scalar. The illustration of hyperplane and support vector can be
seen in Figure 1.
with constrains,
𝑦𝑖 (𝑤 𝑇 . 𝑥𝑖 + 𝑏 ) ≥ 1 − 𝑠𝑖 , 𝑖 = 1, … , 𝑛 (12)
and 𝑠𝑖 ≥ 0 for ∀𝑖 . (13)
𝐶 plays a role in minimizing training errors and reduce complexity in the model. Using the
Lagrange function for optimization problem, the optimization problem in equations (11) – (13) can be
stated as optimization problem without constraints in equation (14) as follows,
1
min 𝐿𝑝 (𝑤, 𝑏, 𝑠, 𝛼) = 2 ‖𝑤‖2 + 𝐶 (∑𝑛𝑖=1 𝑠𝑖 )
(14)
+ ∑𝑛𝑖=1 𝛼𝑖 (1 − (𝑦𝑖 (𝑤 𝑇 . 𝑥𝑖 + 𝑏 ) − 𝑠𝑖 )).
Non-negative 𝛼𝑖 variables is called Lagrange Multiplier where 𝛼𝑖 ≥ 0. The objective function of
equation (14) is to minimize 𝐿𝑝 to 𝑤 and b, and the same time maximize 𝐿𝑝 to 𝛼. By utilizing partial
derivatives 𝐿𝑝 to 𝑤, 𝑏 and 𝑠, the dual problem of equation (14) is as follows,
4
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016
𝑛 𝑛 𝑛
1
maks 𝐿𝑑 = ∑ 𝛼𝑖 − ∑ ∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 (𝑥𝑖 . 𝑥𝑗 ) (15)
2
𝑖=1 𝑖=1 𝑗=1
with constrains,
0 ≤ 𝛼𝑖 ≤ 𝐶 ,with 𝑖 = 1, … . , 𝑛 (16)
and ∑𝑛𝑖=1 𝛼𝑖 𝑦𝑖 = 0 (17)
5
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016
secret), oauth token, and oauth token secret required to send a secure authorized requests to Twitter
API for R program can be used to extract Twitter data.
5.3. Preprocess
Tweet data that have been labeled then done pre-process stage. The preprocessing stage consists of
five stages:
5.3.2. Tokenizing
Tokenizing is the breaking step of a text document into multiple tokens or words. In the tokenizing
stage is also done the removal of punctuation, number, mention, hashtag, and url. The purpose of
tokenizing stage is to be able to calculate the frequency of each word that appears.
5.3.3. Normalization
Normalization is the phase of word improvement process by converting the word abbreviation into
standard word form. The list of word abbreviations is stored in the form of a database using MySql. In
this study the program code obtained from previous research conducted by Khotimah (2014).
5.3.5. Stemming
The process of stemming is the process of converting words into basic and non-standard words into
standard words. In this study, stemming was done using the algorithm Nazief and Adriani (1996). The
program code used in the study is the program code followed by the Khotimah study (2014).
Stemming stage is the last stage in the pre-process stage.
After the pre-processing phase, the token result will be the term. Term is a unique token (word).
The next stage is the creation of term document matrix that aims to determine the frequency of
occurrence of words in the document. The weighting of words performed in this study is to use the
term frequency (tf) by looking at the number of terms that appear in each tweet.
6
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016
After pre-processing, the tweet data that have gone through pre-processing stage are then divided into
two parts, namely training data and testing data. In this study the training data and testing data used are
formulated using 80:20 rule. The amount of training data and testing data are 96 and 24, respectively.
After data training and data testing have shared, then that is making wordcloud representation by
using term document matrix which have been done in document. Words that appear on positive tweets
and negative tweets are represented by wordcloud. The results of the positive class can be seen in
Figure 3 and the word in the negative class can be seen in Figure 4.
5.5. Classification
5.5.1. Classification using naive bayes algorithm.
The method of classification in this research is by using Naïve Bayes algorithm. Based on the results
of the pre-processing, it will then calculate the probability of occurrence of each class on the sample
(prior probabilities) and the per-word weighting chances to be classified into positive classes and
negative classes (conditional probabilities).The number of terms generated from the preprocess will be
used on the NBC algorithms. There are as many as 388 terms generated. The classification model is
the process of classifying with test data using training knowledge derived from the training data to
classify the test data. The probability of occurrence for each class (prior probabilities) in the training
data is as follows:
The results of word probability calculations in each class (conditional probabilities) using Naïve
Bayes Classifier algorithm can be seen in Table 1.
7
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016
For example there is a tweet "Iklan bagus Tokopedia" with a posterior probability value on each is
𝑃(𝑝𝑜𝑠|𝑡𝑤𝑒𝑒𝑡) = 𝑃(𝑖𝑘𝑙𝑎𝑛|𝑝𝑜𝑠)𝑃(𝑏𝑎𝑔𝑢𝑠|𝑝𝑜𝑠)𝑃(𝑡𝑜𝑘𝑜𝑝𝑒𝑑𝑖𝑎|𝑝𝑜𝑠)𝑃(𝑝𝑜𝑠)
= (0.250)(0.425)(0.450)(0.4167)
= 0.0199
and
𝑃(𝑛𝑒𝑔|𝑡𝑤𝑒𝑒𝑡) = 𝑃(𝑖𝑘𝑙𝑎𝑛|𝑛𝑒𝑔)𝑃(𝑏𝑎𝑔𝑢𝑠|𝑛𝑒𝑔)𝑃(𝑡𝑜𝑘𝑜𝑝𝑒𝑑𝑖𝑎|𝑛𝑒𝑔)𝑃(𝑛𝑒𝑔)
= (0.000)(0.000)(0.071)(0.583)
= 0.000
Because 𝑷(𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆|𝒕𝒘𝒆𝒆𝒕) > 𝑷(𝒏𝒆𝒈𝒂𝒕𝒊𝒗𝒆|𝒕𝒘𝒆𝒆𝒕) then tweet it into the positive class. In the
next stage of testing the results of classification using test data that has been determined. The results of
the prediction of the classification with Naïve Bayes algorithm can be used to determine the level of
accuracy of a model.
5.5.2. Classification using support vector machine algorithm. Classification modeling is done using
Support Vector Machine algorithm. The kernel function used linear kernel using one parameter
𝐶(𝑐𝑜𝑠𝑡). 𝐶(𝑐𝑜𝑠𝑡) is the parameter of the penalty of the error in the classification and the value is
determined by the researchers. In this study the user determines the value 𝐶, the values are 0,1; 1; and
10 for modeling the data classification of training and later searching for the best C that can increase
the value of accuracy. Data will be classified by building SVM model using linear kernel function
through packages RtextTools and e1071 with create_container and train_model as modeling function
with the training data. By using the function train_model can know the number of support vector
contained in the data train, and in this research obtained as many as 81 support vector. The
determination of the previous vector support has been done by the system to determine the function,
the form of the value of 𝑤 ⃗⃗ and 𝑏. The values of 𝑤 ⃗⃗ and 𝑏 will be used to find identical hyperplane.
After obtaining the value with the calculation, the determination of the classification result using the
SVM model already constructed using this training data will be used to process the data. The number
of terms generated from the preprocess will be used on the SVM algorithm, which is similarly used on
previous NBC algorithms. There are as many as 388 terms generated. The classification model is the
process of classifying with test data using training knowledge derived from the training data to classify
the test data. Establishment of the classification model in the test data using the function
classify_model. In the evaluation and validation stage of the results, the Confusion Matrix table is used
to evaluate the classification results (predictions) in the test data. From the calculation of accuracy
8
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016
value got the best accuracy on SVM is with value of C, that is 0,1. The example of SVM for this
research can be seen in Table 2.
Table 2. Examples of false tweets predicted in SVM
Tweets Actual Predicted
wakakak wajar temen ku depet flash sale untung na positive negative
maklum ada yg tidak dapet produk batas akses puluh positive negative
5.5.3. Evaluation. The calculation of accuracy, precision and recall value for each data type for SVM
and NBC algorithm can be seen in Table 3 below.
Table 3. Accuracy, precision and recall value with NBC and SVM
Precision Recall
Algorithm Accuracy
positive negative positive negative
NBC 75% 100% 33.33% 62.50% 100%
SVM 83.34% 100% 75% 66.67% 100%
Table 3 provides the evidences that the SVM does better job that the NBC for classifying opinion
onTokopedia’s services. All criteria show higher values for the SVM and the NBC.
6. Conclusion
According to the accuracy, the precision and the recall value, the performance of SVM algorithm in
classifying Tokopedia’s services is much better than NBC. The worldcloud suggests positive result in
Tokopedia’s quality product and discount or sale program. The negative class corresponds to the
shipment and the payment procedures.
References
[1] Ahmad, M., Aftab, S., & Ali, I. 2017. Sentiment analysis of tweets using SVM. International
Journal of Computer Applications, 25-29.
[2] Apriliyanti, A. 2015. Sentiment analysis with naive bayes to see people's perception of batik
on twitter social network. Proceedings of the National Seminar Mathematics and
Mathematics Education of Muhammadiyah Surakarta University, 836.
[3] Bowell, D. 2002, August 6. Introduction to support vector machines. taken from
dustwell.com: dustwell.com/PastWork/IntroToSVM.pdf
[4] Khotimah, H. 2014. Pemodelan hybrid tourism recommendation menggunakan hidden
markov model dan text mining berbasis data sosial media. Thesis of Institute Pertanian Bogor
[5] Liu, B. 2012. Sentiment analysis and opinion mining. Morgan & Claypool Publishers.
[6] Manning, C.D., Raghavan, P., & Schutze, H. 2009. An introduction to information retrieval.
Cambridge University Press: New York.
[7] Narayanan, Vivek, Arora, I., & Bhatia, A. 2013. Fast and accurate sentiment classification
using an enhanced naive bayes model. Varanasi, India: Indian Institute of Technology.
[8] Pratama, E., & Trilaksono, B. 2015. The classification of customer complaint topics based on
tweets using the combination of extracted features in the method of support vector machine
(SVM). Jurnal Edukasi dan Penelitian Informatika (JEPIN) 1(2), 53-59.
[9] Raschka, S. 2014. Naive bayes and text classification i - introduction and theory.
Birmingham: Packt Publishing.
[10] Routray, P., Swain, C. K., & Mishra, S. P. 2013. A survey on sentiment analysis.
International Journal Of Computer Applications , 2-4.
[11] Statista. July 2016. Number of active twitter users in leading markets as of may 2016 (in
millions). Taken from https://www.statista.com/statistics/242606/number-of-active-twitter-
users-in-selected-countries/ on 25 March 2018.
9
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016
[12] Vijayarani, S., Ilamathi, J., & Nithya. 2015. Preprocessing techniques for text mining - an
overview. International Journal of Computer Science & Communication Networks, 5(1), 7-
16.
[13] Wahyuningtyas, A. 2016. Spam detection on twitter using naïve bayes algorithm. Bogor
Agricultural University.
[14] Zainuddin, N., & Selamat, A. 2014. Sentiment analysis using support vector machine. IEEE
2014 International Conference on Computer, Communication, and Control Technology, 333-
337.
10