RP On Mining
RP On Mining
RP On Mining
net/publication/259220902
Opinion mining using decision tree based feature selection through Manhattan
hierarchical cluster measure
CITATIONS READS
16 773
1 author:
Jeevanandam Jotheeswaran
Amity University
18 PUBLICATIONS 63 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jeevanandam Jotheeswaran on 31 May 2014.
ABSTRACT
Opinion mining plays a major role in text mining applications in consumer attitude detection, brand and
product positioning, customer relationship management, and market research. These applications led to a
new generation of companies and products meant for online market perception, reputation management and
online content monitoring. Subjectivity and sentiment analysis focus on private states automatic
identification like beliefs, opinions, sentiments, evaluations, emotions and natural language speculations.
Subjectivity classification labels data as either subjective or objective, whereas sentiment classification
adds additional granularity through further classification of subjective data as positive/negative or neutral.
Features are extracted from the data for classifying the sentiment. Feature selection has gained importance
due to its contribution to save classification cost with regard to time and computation load. In this paper,
the main focus is on feature selection for Opinion mining using decision tree based feature selection. The
proposed method is evaluated using IMDb data set, and is compared with Principal Component Analysis
(PCA). The experimental results show that the proposed feature selection method is promising.
Keywords: Opinion Mining, Imdb, Inverse Document Frequency (IDF), Principal Component Analysis
(PCA), Leaningr Vector Quantization(LVQ).
72
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.
processing is based on the fact that limited based on specific criteria. It reduces features
differences can be identified between two text number, removes irrelevant/redundant/noisy data,
pieces which does not change meaning much. providing applications effects which include
speeding up data mining algorithms, improving
Some research fields are predominant in
mining performance like predictive accuracy and
Sentiment analysis: sentiment classification,
result comprehensibility. Feature selection is an
feature based Sentiment classification and opinion
active research field and developed machine
summarization. Sentiment classification classifies
learning, and data mining for years and is now
whole documents according to opinions to
applied to fields like text mining, genomic
specific objects. But feature-based Sentiment
analysis, intrusion detection and image retrieval.
classification considers certain subjects features
When new applications emerged, many
opinions. Opinion summarization is different
challenges also arose needing new
from traditional text summarization as the only
theories/methods to address high-
product features are mined on which customers
dimensional/complex data. Optimal redundancy
expressed opinions. Opinion summarization fails
removal, stable feature selection, and auxiliary
to summarize reviews by choosing a subset or
data and prior knowledge exploitation in feature
rewrites some original sentences from reviews to
selection are among the fundamental and
capture main points as in traditional text
challenging problems in feature selection. Up-to-
summarization.
date, large volumes of literature were published
It is hard for a human reader to locate on the research direction of feature selection.
relevant sources, extract related sentences and
Inverse document frequency (IDF) is an
opinions, read, summarize, and organize them
important and widely used concept in information
into usable forms. Thus, automated opinion
retrieval. When IDF combines with term
discovery or summarization systems are needed.
frequency (TF), it results in a robust/highly
Sentiment analysis, also called opinion mining,
effective term weighting scheme applied across
came from this need and is a challenging natural
various application areas like databases natural
language processing/text mining problem. It’s
language processing, knowledge management,
huge value for applications led to its explosive
text classification and information retrieval. There
growth in research, academia and industry. It
were few attempts to improve limited number of
focuses on the topics below [3]:
“classical” IDF formulations mainly due to the
The problem of sentiment analysis: A fact that it is nontrivial to change standard IDF
scientific problem has to be defined before it is formulation in a theoretically meaningful way
solved to formalize it. Formulation introduces while improving effectiveness. There may be
basic definitions, core concepts/issues, sub- heuristic ways to alter IDF formulation, but doing
problems and target objectives. It is also a so leads to little understanding as to why things
framework to unite different research directions. improved.
From an application point of view, it tells
In this paper, it is proposed to compute
practitioners what are the main tasks, inputs and
the inverse document frequency and select
outputs and how resulting outputs are used in
features using proposed feature selection and
practice.
compare it with Principal Component analysis.
Feature-based sentiment analysis: This The effectiveness of the features thus selected is
discovers targets on which opinions were evaluated using LVQ classifier. It is proposed to
expressed in a sentence, and determines whether extract the feature set from IMDb movie data set.
opinions are positive/negative or neutral. The
2. RELATED WORK
targets are objects, and their
components/tributes/features. An object could be Online customer reviews are a
a service, product, organization, individual, topic, significant informative resource useful for both
event etc. For example, a product review sentence potential customers and product manufacturers.
identifies product features commented on by Reviews are written in natural language and are
reviewer determining whether comments are unstructured-free-texts scheme in web pages. The
positive/negative. task of manually scanning huge amounts of
reviews is computationally burdensome and not
Frequently used data mining
practically implemented regarding
dimensionality reduction technique is a feature
businesses/customer perspectives. Hence, it is
selection that selects an original features subset
73
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.
efficient to automatically process various reviews extract features like teachers, exams and
providing necessary information in a correct resources from user-generated content for a
method. Opinion summarization addresses how to selected course and classifier to provide sentiment
determine sentiment, attitude/opinion an author for each feature. Then they are grouped and
expressed in natural language text regarding a features visualized graphically. This ensures
specific feature. An approach to mine the product comparison of one or more courses.
feature and opinion based on both syntactic and
Faster and accessible internet ensures
semantic information considerations was
that people search/learn from fragmented
proposed by Somprasertsri and Lalitrojwong [4].
knowledge. Generally, huge volumes of
Application of dependency relations and
documents and homepages or learning objects are
ontological knowledge with probabilistic based
returned by search engines without any specific
model, proved that this method was more flexible
order. Even if related, a user moves
than others.
forward/backward in the material to figure out the
Opinion mining extracts tasks from page to be read first as users usually have little or
documents opinions as expressed by sources on a no experience in that domain. Though a user may
target. A comparative study on methods used for have domain intuition they are still to be linked.
mining opinions from the newspaper article A learning path construction approach based on
quotations. Its difficulty in being motivated by modified TF-IDF, ATF-IDF and Formal Concept
various possible targets and variety that quotes Analysis algorithms was proposed by Hsieh, et
have, was presented by Balahur, et al., [5]. This al., [8]. The new approach first constructed as
approach evaluated annotated quotations from Concept Lattice with keywords extracted by
news provided from the EMM news engine. ATF-IDF from documents to ensure a
Generic opinion mining requires the use of large relationship hierarchy between keywords
lexicons, and specialized training/testing data. represented concepts. Then FCA was used to
compute intra-document relationships to decide
In the past, researchers developed large
on a correct learning path.
feature selection algorithms designed for other
purposes and each model had its own Data classification for cross domains
advantages/disadvantages. Though there were were researched and is a basic method to
efforts to survey existing feature selection distinguish one from another, as it needs to know
algorithms, a repository collecting representative what belongs to which group. It can infer unseen
feature selection algorithms to facilitate dataset with unknown class through structural
comparison/joint study is yet to materialize. To similarity analysis of a dataset with known
offset this, Zhao, et al., [6] presented a feature classes. Classification results reliability is crucial.
selection repository designed to collect popular The higher the generated classification results
algorithms developed in feature selection research accuracy, the better the classifier. They regularly
to be a platform to facilitate application/ seek to improve classification accuracy through
comparison/joint study. The repository assists either existing techniques or through developing
researchers achieve reliable evaluation when new ones. Various procedures are used to
developing the new feature selection algorithms. improve classification accuracy performance.
While most methods try to improve classifier
In schools/colleges, student comments
techniques accuracy, Omar, et al., [9] reduced
about the courses are an informative resource to
dataset features number by choosing only relevant
improve teaching effectiveness. El-Halees and
features prior to handing over dataset to classifier.
Gaza [7] proposed a model to extract students'
Thereby motivating need for methods capable of
opinions knowledge to improve and measure
selecting relevant features with lowered
course performance. The task is to use student
information loss. The aim is to reduce classifier
generated contents to study specific course’s
workload using feature selection. The review
performance and to compare it with that of other
reveals that classification with feature selection
courses. A model was suggested for this
produced impressive results with accuracy.
consisting of 2 components: Feature extraction to
74
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.
Feature selection has gained importance data’s internal structure correctly. Large margin
due to its contribution to save classification cost generalization bounds are transferred to this,
with regard to time/computation load. Searching leading to input dimensionality independent
for essential features, a feature search method is bounds. This includes local metrics attached to all
through decision trees. The latter is an prototypes corresponding to piecewise quadratic
intermediate feature space inducer to select decision boundaries. The algorithm was tested
essential features. Some studies used decision tree and compared to alternative LVQ schemes using
as feature ranker with direct threshold measure in artificial data set, benchmark UCI repository
decision tree based features selection, while multi-class, and an issue from bioinformatics,
others remain decision trees but use pruning recognition of splice sites for C.
which acts as a threshold mechanism in feature
selection. Yacob, et al., [10] suggested a threshold
measure using Manhattan Hierarchical Cluster
distance for use in feature ranking to select Movie Data
relevant features as part of feature selection from IMDB
procedure. Results were promising and can be
further improved by adding higher number of
attributes test cases.
Feature selection reduces features Feature
number in applications where data has
Extraction IDF
100’s/1000’s of features. Present feature selection
focuses on locating relevant features. Yu and Liu
[11] demonstrated that feature relevance is
insufficient to ensure efficient high dimensional
data feature selection. Feature redundancy was Proposed Feature
defined/proposed to perform feature selection Selection
redundancy analysis. A new framework
decoupling relevance analysis and redundancy
analysis was proposed. A correlation-based
method for relevance/redundancy analysis was
developed and studied its efficiency/effectiveness Classification
compared to representative methods. LVQ
Principal component analysis (PCA) is
the mainstay of data analysis - a black box used
and usually poorly understood. Shlens [12]
dispelled this myth as the manuscript aimed to Benchmark with
build a solid intuition for how/why PCA works. It PCA
crystallized this knowledge by deriving the
mathematics behind PCA from simple intuitions
Figure 1: Flowchart of Proposed method
It was felt that by addressing all aspects, all
readers would have an improved PCA 3. METHODOLOGY
understanding and also the when, how and why of
The flowchart of the proposed
this technique’s application.
methodology is shown in Figure 1 and the
A new matrix learning scheme extending the following sections details the steps in the
Relevance Learning Vector Quantization proposed methodology.
(RLVQ), to a general adaptive metric was
proposed by Schneider, et al., [13]. By
introducing a full relevance factors matrix in 3.1 IMDb Database
distance measure, correlations between features
and classification scheme importance are The IMDb is a large database with
considered and automated, a general metric relevant and comprehensive information on
adaptation happens during training. Compared to movies- past, present and future [14]. It began as
weighted Euclidean metric used in RLVQ and its a shell scripts set and data files. The latter was a
variations, a total matrix powerfully represents collection of email messages between users of
75
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.
76
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.
specified at that node. And every test results in polynomial equation. The features are assumed to
branches, representing varied test outcomes. The be irrelevant for classifying if the slope is zero or
decision tree induction basic algorithm is a greedy negative and relevant when the slope is positive.
algorithm constructing decision trees in a top-
3.4 Principal Component Analysis
down recursive divide-and-conquer manner [16].
When input dimensions are large and
The algorithm begins with tuples in the
components highly correlated, dimensions are
training set, selecting best attribute yielding
reduced using PCA [18]. For a variable set, PCA
maximum information for classification. It
calculates artificial variables smaller set
generates a test node for this and then a top down
representing observed variable’s variance.
decision trees induction divides current tuples set
Artificial variables calculated are principal
according to current test attribute values [17].
components used as predictor, criterion variable
Classifier generation stops when all subset tuples
in the analysis. PCA orthogonalises variables and
belong to the same class or if it is not worthy to
resulting principal components with large
proceed with additional separation to further
variation and eliminates components with least
subsets, i.e. if more attribute tests yield
variation from datasets. When applied on a
information for classification alone below a pre-
dataset PCA observes the following steps.
specified threshold. In this paper, it is proposed to
base the threshold measure based on information 1. Mean subtracted from each data dimensions
gain and Manhattan hierarchical cluster. producing a data set with zero mean.
In the proposed feature selection, a 2. Covariance matrix is calculated.
Decision tree induction selects relevant features.
Decision tree induction is the learning of decision 3. Eigenvectors and eigenvalues of the
tree classifiers constructing tree structure where covariance matrix are calculated.
each internal node (no leaf node) denotes attribute 4. Highest eigenvalues are principal
test. Each branch represents test outcome and components of dataset. Remove eigenvalues
each external node (leaf node) denotes class of less significance to form feature vector.
prediction. At every node, the algorithm selects
best partition data attribute to individual classes. 5. A new dataset is derived.
The best attribute to partitioning is selected by 3.5 Learning Vector Quantization (LVQ)
attribute selection with Information gain.
Attribute with highest information gain splits the Learning Vector Quantization (LVQ) is
attribute. Information gain of the attribute is a local classification algorithm, where
found by classification boundaries are locally
approximated, the difference being that instead of
m using all training dataset points, LVQ uses only a
inf o ( D ) = −∑ pi log 2 ( p ) prototype vectors set. This ensures efficient
i =1 classification as vectors number needing storing
or comparing is reduced greatly. Additionally, a
Where p i is the probability that arbitrary carefully chosen prototype set also increase noise
vector in D belongs to class c i . A log function to problems in the classification accuracy [18].
base 2 is used, as information is encoded in bits.
Info (D) is just average information amount LVQ is an algorithm that learns
required to identify vector D class label. The appropriate prototype positions used classification
information gain is used to rank the features and and is defined by P prototypes set {(m j , c j ), j =
the ranked features are treated as features in 1… P}, where m j is a K-dimensional vector in
hierarchical clusters. The proposed Manhattan feature space, and c j its class label. The
distance for n number of clusters is given as prototypes number is larger than classes number.
follows: Thus, each class is represented by more than one
prototype. Given an unlabeled data point x u , its
∑ (a − bi )
n
=
MDist i =1 i
class label y u is determined as class c q of nearest
prototype m q
A cubic polynomial equation is derived
using the Manhattan values and the threshold
=
yu c=
q,q arg min j d xu , m j ( )
criterion is determined from the slope of the
77
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.
Precision
Figure 2 Classification accuracy 0.8
It can be seen from figure 2, the 0.6
classification accuracy obtained through Naïve 0.4
Bayes with LVQ is better than Naïve Bayes with 0.2
PCA by around 5%. Figure 3 shows the Root 0
Mean Squared Error (RMSE).
Table 2: Root Mean Squared Error
Technique used RMSE
CART with PCA 0.61
Figure 4: Precision
Naïve Bayes with PCA 0.54
78
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.
79
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.
80