Nothing Special   »   [go: up one dir, main page]

Comparison Performance of Naive Bayes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Journal of Physics: Conference Series

PAPER • OPEN ACCESS You may also like


- Novel three-dimensional interphase
Comparison Performance of Naive Bayes characterisation of polymer
nanocomposites using nanoscaled
Classifier and Support Vector Machine Algorithm topography
Mohanad Mousa and Yu Dong

for Twitter’s Classification of Tokopedia Services - Structural and Electrochemical


Characterization of NdBa1-xCo2-yFeyO5+
as Cathode for Intermediate Temperature
To cite this article: R Kusumawati et al 2019 J. Phys.: Conf. Ser. 1320 012016 Solid Oxide Fuel Cells
Giulio Cordaro, Alessandro Donazzi,
Renato Pelosato et al.

- Virus shaped gold nanoparticles with


tunable near infrared plasmon as SERS
View the article online for updates and enhancements. substrates
S G Jiji and K G Gopchandran

This content was downloaded from IP address 118.99.125.10 on 18/09/2022 at 16:42


ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016

Comparison Performance of Naive Bayes Classifier and


Support Vector Machine Algorithm for Twitter's
Classification of Tokopedia Services

R Kusumawati1, A D 'arofah2 and P A Pramana3


1, 2, 3
Department of Mathematics, Faculty of Mathematics and Natural Science,
Universitas Negeri Yogyakarta, Yogyakarta, Indonesia

E-mail: rosita_kusumawati@uny.ac.id

Abstract. Tokopedia is one of the online shopping centers in Indonesian that carries the
business model marketplace. Positive and negative opinions in Twitter from Tokopedia users
about company services are source of information for the management. Naive Bayes
Classification (NBC) and Support Vectore Machine (SVM) are techniques in data mining used
to classify data or users opinion. The algorithm of NBC is very simple since it only use text
frequency to compute the posterior probability for each classes. While SVM algorithm is more
complex than NBC. SVM develop hyperplane equation which separate data into classes
perfectly. The researcher wants to compare the performance of the NBC and SVM algorithms
and use them to classify user opinions on Tokopedia’s services, because these two algorithms
have different approaches and difficulty levels. Classification included positive and negative
class only. Accuracy, precision and recall value are used to compare the performance of both
algorithms. Research evaluation shows that SVM linear kernel technique outperform NBC
technique with the accuracy 83.34%.

1. Introduction
Tokopedia is one of the e-commerce startups with large assets that accommodate sellers and buyers to
make transactions quickly and easily. Whereas Twitter is a microblog based social network with 24.34
million users in May of 2016 that allows users to send and read text based messages known as
“tweets” [11]. Twitter can be a source of text data of opinion and community sentiment on
Tokopedia’s services which can be analysed for the purposes of an organization or a company. A
sentimental analysis, also called an opinion mining, is a field of study that that not only analyse
people's opinions, but also people's sentiments, evaluations, judgments, attitudes, and emotions on
entities such as products, services, organizations, individuals, problems, events, topics, and other
attributes [5]. These unstructured opinions text data can be classified using various text processing
methods called text mining [12].
Algorithms used to classify opinions include Naive Bayes Classifier (NBC) and Support Vector
Machine (SVM). The NBC is a simple probabilistic-based method which classify opinion based on the
maximum value of the posterior probability obtained through the probability of each class (prior
probabilities) and the probability of each word occurring (conditional probability) in the training data.
Although very simple, NBC has a high degree of accuracy and performance in classifying data text
[10].
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016

The Support Vector Machine (SVM) algorithm is one of machine learning (supervised learning)
method which can classify opinion by searching for the best dividing field or hyperplane that separates
high-dimensional text data perfectly into classes [14]. Hyperplane can be found by maximizing the
margin or the distance between the closest class points (support vector) and the hyperplane. However,
samples data are often not linearly separated, the SVM introduces the idea of increasing data
dimensions. Typically, the use of higher space dimensions will causes engine problems and
overfitting. The problem can be solved by the use of dot-product in space [3].
Although many research on text mining have been done, but there are still many issues about the
performance of the algorithm that need to be addressed further. This article will discuss the application
of Naive Bayes Classifier and Support Vector Machine algorithm to classify tweets of Tokopedia’s
Services and compare the performance of the two algorithms. The remainder of this paper is organized
as follows, section 2 gives works related to text mining using NBC and SVM. The concepts of NBC
and SVM will be explained deeper in section 3. We will use the accuracy, precision and recall level
which will explained further in section 4 to compare the performance of each algorithm. The
classification steps of Tokopedia’s services starts from collection of data, labelling preprocessing, and
sharing then NBC and SVM classification process will be discussed in section 5.

2. Related Work
Numbers of research on sentiment analysis using NBC and SVM algorithm have been done.
Narayanan et.al conducted a sentiment analysis study to find out the film review opinions by Indian
audiences using NBC classification [7]. The film reviews in this study come from the Internet Movie
Database (IMDb). The results of the research quickly and accurately with the level of accuracy
88.80% of 25,000 opinions. While Wahyuningtyas use NBC algorithm also to classify spam and not
spam tweets [13]. The classification accuracy of spam and non-spam tweets is 95.57%. Based on the
results of the research can be known words that often appear on the class spam is bahasa, follow, and
inggris. Research to analyse SVM performance has been done to classify English opinion about self-
driving cars and apple products using tweet data divided into six classes using WEKA program [1].
The accuracy, precision and recall values of each topic are 59.91%, 70.8%, 84.1% for self-driving cars
problem and 71.2%, 70.2%, 71.2% for apple products analysis. Pratama et.al also used SVM
algorithm for text mining of Speedy Telkomsel subscribers’s complaint in twitter using feature
selection combinations i.e. term frequency, document requency, information gain and chi-square) and
Gaussian RBF kernel function. And the resulted accuracy value for term frequency feature is of
82.50% [8].

3. Problem Formulations and Methodology

3.1. Naïve Bayes Classifier (NBC)


The Naïve Bayes Classifier is a supervised learning technique in a form of probabilistic classification
based on Bayes theorem with intermediate naive assumption. Where the Bayes rules can be stated as
follows:
𝑃(𝐵𝑗 )𝑃(𝐴|𝐵𝑗 ) (1)
𝑃(𝐵𝑗 |𝐴) =
𝑃(𝐴)
Using the Bayes rule in equation (1), 𝑃(𝑐𝑖 |𝑥1 , … , 𝑥𝑛 ) the conditional probability of class 𝑐𝑖 given the
word feature (𝑥1 , … , 𝑥𝑛 ) in a particular can be expressed as follows:
𝑃(𝑥1 , … , 𝑥𝑛 |𝑐𝑖 )𝑃(𝑐𝑖 )
𝑃(𝑐𝑖 |𝑥1 , … , 𝑥𝑛 ) = (2)
𝑃(𝑥1 , … , 𝑥𝑛 )
where 𝑃(𝑥1 , … , 𝑥𝑛 |𝑐𝑖 ) is the conditional probability of the word feature (𝑥1 , … , 𝑥𝑛 ) in a particular
class 𝑐𝑖 and 𝑥𝑖 state the number of words i appeared in a text. In the NBC algorithm for text mining
problem, the assumption guarantee that the appearance of a word in a text does not affect the

2
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016

appearance of other words [2]. In the other words, the NBC algorithm assumes that variables 𝑥1 , … , 𝑥𝑛
are mutually independent.
The Naïve Bayes Classifier algorithm can be divided into two types i.e. multivariate Bernoulli and
multinomial Naïve Bayes [6]. This study used the model of multinomial Naïve Bayes since it assumed
the mutual independence of each word for all classes and 𝑃(𝑥1 , … , 𝑥𝑛 ) = 1 or constant, so equation
(2) can be written as follows:
𝑃(𝑐𝑖 |𝑥1 , … , 𝑥𝑛 ) = 𝑃(𝑥1 , … , 𝑥𝑛 |𝑐𝑖 )𝑃(𝑐𝑖 ). (3)
The posterior probabilities values can be obtained based on the probability of each class (prior
probabilities) and the probability of each word (conditional probabilities) in the training data by
simplifying equation (3) as follows [9].
𝑗=𝑛

𝑃(𝑐𝑖 |𝑥1 , … , 𝑥𝑛 ) = ∏ 𝑃(𝑥𝑗 |𝑐𝑖 ) 𝑃(𝑐𝑖 ) (4)


𝑗=1
where
𝑖 : positive class, negative class
𝑛 : number of word features in the training data
𝑃(𝑐𝑖 |𝑥1 , … , 𝑥𝑛 ) : probability of words 𝑥1 , … , 𝑥𝑛 in class 𝑐𝑖 (posterior
probabilities)
𝑃(𝑥𝑗 |𝑐𝑖 ) : probability of occurrence of word 𝑥𝑗 in class 𝑐𝑖 (conditional
probabilities)
𝑃(𝑐𝑖 ) : probability of class 𝑐𝑖 in the training data (prior probabilities)
(𝑥1 , … , 𝑥𝑛 ) : word features

The equation (4) is a probability model of the Naïve Bayes theorem which is used in the classification
process. In the Naïve Bayes Classifier, the testing data enter the class 𝑐𝑖 that has a maximum posteriori
(MAP) or 𝑐𝑀𝐴𝑃 . The calculation of 𝑐𝑀𝐴𝑃 value is defined as follows:
𝑗=𝑛

𝑐𝑀𝐴𝑃 = argmax 𝑃(𝑐𝑖 ) ∏ 𝑃(𝑥𝑗 |𝑐𝑖 ) (5)


𝑐𝑖 ∈𝐶
𝑗=1
with the prior probability values as follows:
𝑁𝑐𝑖
𝑃(𝑐𝑖 ) = (6)
𝑁
where 𝑁𝑐𝑖 is the amount of training data that has class 𝑐𝑖 and 𝑁 is the number of data used in the
training data.
The conditional probabilities values as follows:
𝑛𝑗
𝑃(𝑥𝑗 |𝑐𝑖 ) = (7)
𝑛
where 𝑛𝑗 is the number of occurrences of the word 𝑥𝑗 in class 𝑐𝑖 while 𝑛 is number of words contained
in class 𝑐𝑖 .
Sometimes there are words that never appear in any of the classes during the classification process
so that the resulting 𝑃(𝑥𝑗 |𝑐𝑖 ) value is zero. To prevent the occurrence of division by zero, then Laplace
smoothing is used by adding the word frequency as much as 1(add-one) so that the calculation of
𝑃(𝑥𝑗 |𝑐𝑖 ) becomes
𝑛𝑗 + 1
𝑃(𝑥𝑗 |𝑐𝑖 ) = (8)
𝑛 + 𝑛𝑘
where 𝑛𝑘 are the number of different (unique) words that appear in the training data.

3.2. Support Vector Machine (SVM)

3
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016

A text or tweet can be expressed as set 𝑥𝑖 , where 𝑥𝑖 = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑝 ). Suppose a given set 𝑥 =
{𝑥𝑖1 , 𝑥2 , … , 𝑥𝑛 } with 𝑥𝑖 ∈ ℜ𝑝 has a certain pattern, that can be grouped into positive class and
negative class. Thus, each datum and class label can be denoted as 𝑦𝑖 ∈ {−1, +1} so the data are pair
{(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 )}, then it is assumed that the data are perfectly separated (linear
separable) by the 𝑝 dimensional separator function called hyperplane 𝐻0 where 𝑛 is the number of
data that can be defined in the equation (9) as follows,
𝑤. 𝑥𝑖 + 𝑏 = 0 (9)
where 𝑤 is weights vector and 𝑏 is scalar. The illustration of hyperplane and support vector can be
seen in Figure 1.

Figure 1. Hyperplane and support vector


The problem of hyperplane 𝐻0 is the same as looking for the best split field with the largest margin
value which can be formulated into equation (10) as follows,
1
min 2 ‖𝑤‖2 (10)
with the constraints given by,
𝑦𝑖 (𝑤 𝑇 . 𝑥𝑖 + 𝑏 ) ≥ 1 , 𝑖 = 1, … , 𝑛
The classification of the above linear case data explains that data can be separated into two classes
perfectly. For non-linearly separable data, the SVM formula should be modified because the equation
constraint cannot be satisfied and the optimization cannot be performed. Therefore, slack variable 𝑠𝑖
(𝑠𝑖 ≥ 0, ∀𝑖 : 𝑠𝑖 = 0 if 𝑥𝑖 correctly classified) needs to be added so that 𝑦𝑖 (𝑤 𝑇 . 𝑥𝑖 + 𝑏 ) ≥ 1 − 𝑠𝑖 is
obtained. The problem in equation (10) can be formulated into functions (11) as follows,
1
min ‖𝑤‖2 + 𝐶 (∑𝑛𝑖=1 𝑠𝑖 ) (11)
2

with constrains,
𝑦𝑖 (𝑤 𝑇 . 𝑥𝑖 + 𝑏 ) ≥ 1 − 𝑠𝑖 , 𝑖 = 1, … , 𝑛 (12)
and 𝑠𝑖 ≥ 0 for ∀𝑖 . (13)

𝐶 plays a role in minimizing training errors and reduce complexity in the model. Using the
Lagrange function for optimization problem, the optimization problem in equations (11) – (13) can be
stated as optimization problem without constraints in equation (14) as follows,
1
min 𝐿𝑝 (𝑤, 𝑏, 𝑠, 𝛼) = 2 ‖𝑤‖2 + 𝐶 (∑𝑛𝑖=1 𝑠𝑖 )
(14)
+ ∑𝑛𝑖=1 𝛼𝑖 (1 − (𝑦𝑖 (𝑤 𝑇 . 𝑥𝑖 + 𝑏 ) − 𝑠𝑖 )).
Non-negative 𝛼𝑖 variables is called Lagrange Multiplier where 𝛼𝑖 ≥ 0. The objective function of
equation (14) is to minimize 𝐿𝑝 to 𝑤 and b, and the same time maximize 𝐿𝑝 to 𝛼. By utilizing partial
derivatives 𝐿𝑝 to 𝑤, 𝑏 and 𝑠, the dual problem of equation (14) is as follows,

4
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016

𝑛 𝑛 𝑛
1
maks 𝐿𝑑 = ∑ 𝛼𝑖 − ∑ ∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 (𝑥𝑖 . 𝑥𝑗 ) (15)
2
𝑖=1 𝑖=1 𝑗=1
with constrains,
0 ≤ 𝛼𝑖 ≤ 𝐶 ,with 𝑖 = 1, … . , 𝑛 (16)
and ∑𝑛𝑖=1 𝛼𝑖 𝑦𝑖 = 0 (17)

4. The Research Method


Research on sentiment classification evaluation is performed to test the results of the classification by
measuring the performance value of the system that has been made. Test parameters used for
evaluation are accuracy, precision and recall whose calculations are obtained from the confusion
matrix table. The calculation of accuracy, precision, and recall is obtained through the formula [10] :
𝑇𝑃+𝑇𝑁 (18)
𝐴𝑘𝑢𝑟𝑎𝑠𝑖 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
× 100%
𝑇𝑃 (19)
𝑃𝑟𝑒𝑠𝑖𝑠𝑖𝑝𝑜𝑠𝑖𝑡𝑖𝑓 = 𝑇𝑃+𝐹𝑃 × 100%
𝑇𝑃 (20)
𝑅𝑒𝑐𝑎𝑙𝑙𝑝𝑜𝑠𝑖𝑡𝑖𝑓 = 𝑇𝑃+𝐹𝑁 × 100%
𝑇𝑁 (21)
𝑃𝑟𝑒𝑠𝑖𝑠𝑖𝑛𝑒𝑔𝑎𝑡𝑖𝑓 = 𝑇𝑁+𝐹𝑁 × 100%
𝑇𝑁 (22)
𝑅𝑒𝑐𝑎𝑙𝑙𝑛𝑒𝑔𝑎𝑡𝑖𝑓 = 𝑇𝑁+𝐹𝑃
× 100%
Data used in this study are Twitter data which are obtained using the Twitter API (Application
Programming Interface). There are 120 tweets used in the analysis. The first steps in the data
collection is to take tweet data from the Twitter API by using the "twitteR" package on R and the
search Twitter function. All stages of the research can be seen in Figure 2.

Figure 2. The stages of research

5. Research, Analysis, and Discussions

5.1. Data Collection


Stage of data collection is retrieving or crawling tweets data from the Twitter API (Application
Programming Interface) by using the "twitteR" package on the R program and using the searchTwitter
function. There are many attributes in the retrieval Twitter data, but this study use only the text
attributes. Twitter data collection steps using the Twitter API are the following:

5.1.1. Connect R Program with Twitter API


Making application management on Twitter is required to connected R programs with the Twitter API.
In the application management application obtained consumer key (API key), consumer secret (API

5
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016

secret), oauth token, and oauth token secret required to send a secure authorized requests to Twitter
API for R program can be used to extract Twitter data.

5.1.2. Data Crawling in Twitter


Data crawling in Twitter is as much as 300 data tweets with the word Tokopedia as a keyword. Tweet
data retrieval using the searchTwitter () function with parameters (keyword, n). The keyword
parameter is filled with the word "@tokopedia", and the sample size n is filled with the desired
number of tweets that is n = 300. After the data retrieval process, then the tweet data obtained is
converted into the data frame and stored in file.csv format.

5.2. Data Labeling


After crawling the data and then data filtering. Data filtering aims to collect data containing only the
user's opinion of Tokopedia's service. Data tweets that containing ads and considered irrelevant with
Tokopedia removed and leaved 120 data tweets from 300 tweets. Furthermore, data labeling are done
by giving positive and negative labels on each tweet. The negative tweets and positive tweets are 53%
and 47% tweets respectively from all tweets.

5.3. Preprocess
Tweet data that have been labeled then done pre-process stage. The preprocessing stage consists of
five stages:

5.3.1. Case Folding


Case folding is the step of converting all the characters into lowercase in the document.

5.3.2. Tokenizing
Tokenizing is the breaking step of a text document into multiple tokens or words. In the tokenizing
stage is also done the removal of punctuation, number, mention, hashtag, and url. The purpose of
tokenizing stage is to be able to calculate the frequency of each word that appears.

5.3.3. Normalization
Normalization is the phase of word improvement process by converting the word abbreviation into
standard word form. The list of word abbreviations is stored in the form of a database using MySql. In
this study the program code obtained from previous research conducted by Khotimah (2014).

5.3.4. Stopwords Removing


Stopwords is a list of words that are considered unimportant or irrelevant to content on tweet.
Removal of stopwords for indonesia words using stopwords database (Tala, 2003) amounting to 759
words and other additional words that are not meaningful on tweet.

5.3.5. Stemming
The process of stemming is the process of converting words into basic and non-standard words into
standard words. In this study, stemming was done using the algorithm Nazief and Adriani (1996). The
program code used in the study is the program code followed by the Khotimah study (2014).
Stemming stage is the last stage in the pre-process stage.
After the pre-processing phase, the token result will be the term. Term is a unique token (word).
The next stage is the creation of term document matrix that aims to determine the frequency of
occurrence of words in the document. The weighting of words performed in this study is to use the
term frequency (tf) by looking at the number of terms that appear in each tweet.

5.4. Data Sharing

6
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016

After pre-processing, the tweet data that have gone through pre-processing stage are then divided into
two parts, namely training data and testing data. In this study the training data and testing data used are
formulated using 80:20 rule. The amount of training data and testing data are 96 and 24, respectively.
After data training and data testing have shared, then that is making wordcloud representation by
using term document matrix which have been done in document. Words that appear on positive tweets
and negative tweets are represented by wordcloud. The results of the positive class can be seen in
Figure 3 and the word in the negative class can be seen in Figure 4.

Figure 3. Wordcloud representation in positive class

Figure 4. Wordcloud representation in negative class


In Figure 3 it can be seen that the word that often appears in the positive class is bagus, sale,
ramadhan, diskon and in Figure 4 it can be seen that the words that often appear in the negative class
is tidak, bayar, transaksi, kirim. from the processing using the pre-process generated as many as 388
terms that will be used as many features to perform classification on both algorithms.

5.5. Classification
5.5.1. Classification using naive bayes algorithm.
The method of classification in this research is by using Naïve Bayes algorithm. Based on the results
of the pre-processing, it will then calculate the probability of occurrence of each class on the sample
(prior probabilities) and the per-word weighting chances to be classified into positive classes and
negative classes (conditional probabilities).The number of terms generated from the preprocess will be
used on the NBC algorithms. There are as many as 388 terms generated. The classification model is
the process of classifying with test data using training knowledge derived from the training data to
classify the test data. The probability of occurrence for each class (prior probabilities) in the training
data is as follows:

𝑷(𝒄𝒑𝒐𝒔𝒊𝒕𝒊𝒇 ) =𝟎. 𝟒𝟏𝟔𝟔𝟔𝟔𝟕 and 𝑷(𝒄𝒏𝒆𝒈𝒂𝒕𝒊𝒇 ) =𝟎. 𝟓𝟖𝟑𝟑𝟑𝟑𝟑

The results of word probability calculations in each class (conditional probabilities) using Naïve
Bayes Classifier algorithm can be seen in Table 1.

7
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016

Table 1. Word probability of each class (conditional probabilities)


Words Negative Positive
Terimakasih 0.01785714 0.1250000
Tidak 0.39285710 0.1250000
Tokopedia 0.07142857 0.4500000
Proses 0.05357143 0.0250000
Bagus 0.00000000 0.4250000
Transaksi 0.08928571 0.0000000
Tolong 0.10714290 0.0000000
Iklan 0.0000000 0.2500000
Ramadhan 0.0000000 0.1250000
Flash 0.1785714 0.1250000
Sale 0.1964286 0.1250000

For example there is a tweet "Iklan bagus Tokopedia" with a posterior probability value on each is
𝑃(𝑝𝑜𝑠|𝑡𝑤𝑒𝑒𝑡) = 𝑃(𝑖𝑘𝑙𝑎𝑛|𝑝𝑜𝑠)𝑃(𝑏𝑎𝑔𝑢𝑠|𝑝𝑜𝑠)𝑃(𝑡𝑜𝑘𝑜𝑝𝑒𝑑𝑖𝑎|𝑝𝑜𝑠)𝑃(𝑝𝑜𝑠)
= (0.250)(0.425)(0.450)(0.4167)
= 0.0199
and
𝑃(𝑛𝑒𝑔|𝑡𝑤𝑒𝑒𝑡) = 𝑃(𝑖𝑘𝑙𝑎𝑛|𝑛𝑒𝑔)𝑃(𝑏𝑎𝑔𝑢𝑠|𝑛𝑒𝑔)𝑃(𝑡𝑜𝑘𝑜𝑝𝑒𝑑𝑖𝑎|𝑛𝑒𝑔)𝑃(𝑛𝑒𝑔)
= (0.000)(0.000)(0.071)(0.583)
= 0.000

Because 𝑷(𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆|𝒕𝒘𝒆𝒆𝒕) > 𝑷(𝒏𝒆𝒈𝒂𝒕𝒊𝒗𝒆|𝒕𝒘𝒆𝒆𝒕) then tweet it into the positive class. In the
next stage of testing the results of classification using test data that has been determined. The results of
the prediction of the classification with Naïve Bayes algorithm can be used to determine the level of
accuracy of a model.

5.5.2. Classification using support vector machine algorithm. Classification modeling is done using
Support Vector Machine algorithm. The kernel function used linear kernel using one parameter
𝐶(𝑐𝑜𝑠𝑡). 𝐶(𝑐𝑜𝑠𝑡) is the parameter of the penalty of the error in the classification and the value is
determined by the researchers. In this study the user determines the value 𝐶, the values are 0,1; 1; and
10 for modeling the data classification of training and later searching for the best C that can increase
the value of accuracy. Data will be classified by building SVM model using linear kernel function
through packages RtextTools and e1071 with create_container and train_model as modeling function
with the training data. By using the function train_model can know the number of support vector
contained in the data train, and in this research obtained as many as 81 support vector. The
determination of the previous vector support has been done by the system to determine the function,
the form of the value of 𝑤 ⃗⃗ and 𝑏. The values of 𝑤 ⃗⃗ and 𝑏 will be used to find identical hyperplane.
After obtaining the value with the calculation, the determination of the classification result using the
SVM model already constructed using this training data will be used to process the data. The number
of terms generated from the preprocess will be used on the SVM algorithm, which is similarly used on
previous NBC algorithms. There are as many as 388 terms generated. The classification model is the
process of classifying with test data using training knowledge derived from the training data to classify
the test data. Establishment of the classification model in the test data using the function
classify_model. In the evaluation and validation stage of the results, the Confusion Matrix table is used
to evaluate the classification results (predictions) in the test data. From the calculation of accuracy

8
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016

value got the best accuracy on SVM is with value of C, that is 0,1. The example of SVM for this
research can be seen in Table 2.
Table 2. Examples of false tweets predicted in SVM
Tweets Actual Predicted
wakakak wajar temen ku depet flash sale untung na positive negative
maklum ada yg tidak dapet produk batas akses puluh positive negative

5.5.3. Evaluation. The calculation of accuracy, precision and recall value for each data type for SVM
and NBC algorithm can be seen in Table 3 below.

Table 3. Accuracy, precision and recall value with NBC and SVM
Precision Recall
Algorithm Accuracy
positive negative positive negative
NBC 75% 100% 33.33% 62.50% 100%
SVM 83.34% 100% 75% 66.67% 100%
Table 3 provides the evidences that the SVM does better job that the NBC for classifying opinion
onTokopedia’s services. All criteria show higher values for the SVM and the NBC.

6. Conclusion
According to the accuracy, the precision and the recall value, the performance of SVM algorithm in
classifying Tokopedia’s services is much better than NBC. The worldcloud suggests positive result in
Tokopedia’s quality product and discount or sale program. The negative class corresponds to the
shipment and the payment procedures.

References

[1] Ahmad, M., Aftab, S., & Ali, I. 2017. Sentiment analysis of tweets using SVM. International
Journal of Computer Applications, 25-29.
[2] Apriliyanti, A. 2015. Sentiment analysis with naive bayes to see people's perception of batik
on twitter social network. Proceedings of the National Seminar Mathematics and
Mathematics Education of Muhammadiyah Surakarta University, 836.
[3] Bowell, D. 2002, August 6. Introduction to support vector machines. taken from
dustwell.com: dustwell.com/PastWork/IntroToSVM.pdf
[4] Khotimah, H. 2014. Pemodelan hybrid tourism recommendation menggunakan hidden
markov model dan text mining berbasis data sosial media. Thesis of Institute Pertanian Bogor
[5] Liu, B. 2012. Sentiment analysis and opinion mining. Morgan & Claypool Publishers.
[6] Manning, C.D., Raghavan, P., & Schutze, H. 2009. An introduction to information retrieval.
Cambridge University Press: New York.
[7] Narayanan, Vivek, Arora, I., & Bhatia, A. 2013. Fast and accurate sentiment classification
using an enhanced naive bayes model. Varanasi, India: Indian Institute of Technology.
[8] Pratama, E., & Trilaksono, B. 2015. The classification of customer complaint topics based on
tweets using the combination of extracted features in the method of support vector machine
(SVM). Jurnal Edukasi dan Penelitian Informatika (JEPIN) 1(2), 53-59.
[9] Raschka, S. 2014. Naive bayes and text classification i - introduction and theory.
Birmingham: Packt Publishing.
[10] Routray, P., Swain, C. K., & Mishra, S. P. 2013. A survey on sentiment analysis.
International Journal Of Computer Applications , 2-4.
[11] Statista. July 2016. Number of active twitter users in leading markets as of may 2016 (in
millions). Taken from https://www.statista.com/statistics/242606/number-of-active-twitter-
users-in-selected-countries/ on 25 March 2018.

9
ISIMMED2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1320 (2019) 012016 doi:10.1088/1742-6596/1320/1/012016

[12] Vijayarani, S., Ilamathi, J., & Nithya. 2015. Preprocessing techniques for text mining - an
overview. International Journal of Computer Science & Communication Networks, 5(1), 7-
16.
[13] Wahyuningtyas, A. 2016. Spam detection on twitter using naïve bayes algorithm. Bogor
Agricultural University.
[14] Zainuddin, N., & Selamat, A. 2014. Sentiment analysis using support vector machine. IEEE
2014 International Conference on Computer, Communication, and Control Technology, 333-
337.

10

You might also like