Nothing Special   »   [go: up one dir, main page]

RP On Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/259220902

Opinion mining using decision tree based feature selection through Manhattan
hierarchical cluster measure

Article  in  Journal of Theoretical and Applied Information Technology · December 2013

CITATIONS READS

16 773

1 author:

Jeevanandam Jotheeswaran
Amity University
18 PUBLICATIONS   63 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Threat Warning Systems View project

Subjective Text Evaluation View project

All content following this page was uploaded by Jeevanandam Jotheeswaran on 31 May 2014.

The user has requested enhancement of the downloaded file.


Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

OPINION MINING USING DECISION TREE BASED


FEATURE SELECTION THROUGH MANHATTAN
HIERARCHICAL CLUSTER MEASURE
JEEVANANDAM JOTHEESWARAN1, DR. Y. S. KUMARASWAMY2
Research Scholar, Vel Tech Dr. RR & Dr. SR technical University, Chennai, INDIA.
HOD & Sr.Prof, Dept. of MCA, Dayananda Sagar College of Engineering, Bangalore, INDIA.

Email 1: jsearch@zmail.com, yskldswamy2@yahoo.com

ABSTRACT
Opinion mining plays a major role in text mining applications in consumer attitude detection, brand and
product positioning, customer relationship management, and market research. These applications led to a
new generation of companies and products meant for online market perception, reputation management and
online content monitoring. Subjectivity and sentiment analysis focus on private states automatic
identification like beliefs, opinions, sentiments, evaluations, emotions and natural language speculations.
Subjectivity classification labels data as either subjective or objective, whereas sentiment classification
adds additional granularity through further classification of subjective data as positive/negative or neutral.
Features are extracted from the data for classifying the sentiment. Feature selection has gained importance
due to its contribution to save classification cost with regard to time and computation load. In this paper,
the main focus is on feature selection for Opinion mining using decision tree based feature selection. The
proposed method is evaluated using IMDb data set, and is compared with Principal Component Analysis
(PCA). The experimental results show that the proposed feature selection method is promising.
Keywords: Opinion Mining, Imdb, Inverse Document Frequency (IDF), Principal Component Analysis
(PCA), Leaningr Vector Quantization(LVQ).

1. INTRODUCTION opinion candidates. A drawback with this method


is that they yield higher precision at the cost of
A text understanding technology,
large recall loss as generalization ability is not
Opinion mining assists people locate relevant
implied. The problem is mainly caused by out-of-
opinions in a large review collection volume. An
vocabulary (OOV) attributes and OOV sentiment
opinion mining technology based search engine
keywords being encountered in natural language
shows potential to address this issue. An opinion-
review text.
mining tool pores over product reviews for
extraction of opinion units saving them in opinion Sentiment analysis is a natural language
databases. When users input opinion-searching processing type to track public mood about a
query, search engine extracts product names and specific product or a topic. Sentiment analysis,
attribute from query, forwards a complicated SQL also called opinion mining, involves building a
query to opinion database and displays output on system for collecting and examining opinions
Web interface. Opinion-search engines arrange about a product in comments, blog posts, tweets
information based on opinion and not on the or reviews. Sentiment analysis is used in many
document. Hence, product review information is ways. For example, it judges the success of an ad
accessed quickly/easily [1]. campaign/new product launch in marketing to
determine which product versions or service are
State-of-the-art opinion mining
popular and identify which demographics
techniques are divided into 2 camps, i.e. attribute-
like/dislike a specific features [2].
driven methods and sentiment-driven methods.
Their basic idea is to use either attribute or There are many challenges to Sentiment
sentiment keyword to locate opinion candidates analysis. The first is an opinion word considered
through application of certain opinion patterns positive in one situation and negative in another.
(involving attributes/sentiment keywords) for The second challenge is that people express
extraction of sentiment expressions filtering false opinions in various ways. Conventional text

72
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

processing is based on the fact that limited based on specific criteria. It reduces features
differences can be identified between two text number, removes irrelevant/redundant/noisy data,
pieces which does not change meaning much. providing applications effects which include
speeding up data mining algorithms, improving
Some research fields are predominant in
mining performance like predictive accuracy and
Sentiment analysis: sentiment classification,
result comprehensibility. Feature selection is an
feature based Sentiment classification and opinion
active research field and developed machine
summarization. Sentiment classification classifies
learning, and data mining for years and is now
whole documents according to opinions to
applied to fields like text mining, genomic
specific objects. But feature-based Sentiment
analysis, intrusion detection and image retrieval.
classification considers certain subjects features
When new applications emerged, many
opinions. Opinion summarization is different
challenges also arose needing new
from traditional text summarization as the only
theories/methods to address high-
product features are mined on which customers
dimensional/complex data. Optimal redundancy
expressed opinions. Opinion summarization fails
removal, stable feature selection, and auxiliary
to summarize reviews by choosing a subset or
data and prior knowledge exploitation in feature
rewrites some original sentences from reviews to
selection are among the fundamental and
capture main points as in traditional text
challenging problems in feature selection. Up-to-
summarization.
date, large volumes of literature were published
It is hard for a human reader to locate on the research direction of feature selection.
relevant sources, extract related sentences and
Inverse document frequency (IDF) is an
opinions, read, summarize, and organize them
important and widely used concept in information
into usable forms. Thus, automated opinion
retrieval. When IDF combines with term
discovery or summarization systems are needed.
frequency (TF), it results in a robust/highly
Sentiment analysis, also called opinion mining,
effective term weighting scheme applied across
came from this need and is a challenging natural
various application areas like databases natural
language processing/text mining problem. It’s
language processing, knowledge management,
huge value for applications led to its explosive
text classification and information retrieval. There
growth in research, academia and industry. It
were few attempts to improve limited number of
focuses on the topics below [3]:
“classical” IDF formulations mainly due to the
The problem of sentiment analysis: A fact that it is nontrivial to change standard IDF
scientific problem has to be defined before it is formulation in a theoretically meaningful way
solved to formalize it. Formulation introduces while improving effectiveness. There may be
basic definitions, core concepts/issues, sub- heuristic ways to alter IDF formulation, but doing
problems and target objectives. It is also a so leads to little understanding as to why things
framework to unite different research directions. improved.
From an application point of view, it tells
In this paper, it is proposed to compute
practitioners what are the main tasks, inputs and
the inverse document frequency and select
outputs and how resulting outputs are used in
features using proposed feature selection and
practice.
compare it with Principal Component analysis.
Feature-based sentiment analysis: This The effectiveness of the features thus selected is
discovers targets on which opinions were evaluated using LVQ classifier. It is proposed to
expressed in a sentence, and determines whether extract the feature set from IMDb movie data set.
opinions are positive/negative or neutral. The
2. RELATED WORK
targets are objects, and their
components/tributes/features. An object could be Online customer reviews are a
a service, product, organization, individual, topic, significant informative resource useful for both
event etc. For example, a product review sentence potential customers and product manufacturers.
identifies product features commented on by Reviews are written in natural language and are
reviewer determining whether comments are unstructured-free-texts scheme in web pages. The
positive/negative. task of manually scanning huge amounts of
reviews is computationally burdensome and not
Frequently used data mining
practically implemented regarding
dimensionality reduction technique is a feature
businesses/customer perspectives. Hence, it is
selection that selects an original features subset

73
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

efficient to automatically process various reviews extract features like teachers, exams and
providing necessary information in a correct resources from user-generated content for a
method. Opinion summarization addresses how to selected course and classifier to provide sentiment
determine sentiment, attitude/opinion an author for each feature. Then they are grouped and
expressed in natural language text regarding a features visualized graphically. This ensures
specific feature. An approach to mine the product comparison of one or more courses.
feature and opinion based on both syntactic and
Faster and accessible internet ensures
semantic information considerations was
that people search/learn from fragmented
proposed by Somprasertsri and Lalitrojwong [4].
knowledge. Generally, huge volumes of
Application of dependency relations and
documents and homepages or learning objects are
ontological knowledge with probabilistic based
returned by search engines without any specific
model, proved that this method was more flexible
order. Even if related, a user moves
than others.
forward/backward in the material to figure out the
Opinion mining extracts tasks from page to be read first as users usually have little or
documents opinions as expressed by sources on a no experience in that domain. Though a user may
target. A comparative study on methods used for have domain intuition they are still to be linked.
mining opinions from the newspaper article A learning path construction approach based on
quotations. Its difficulty in being motivated by modified TF-IDF, ATF-IDF and Formal Concept
various possible targets and variety that quotes Analysis algorithms was proposed by Hsieh, et
have, was presented by Balahur, et al., [5]. This al., [8]. The new approach first constructed as
approach evaluated annotated quotations from Concept Lattice with keywords extracted by
news provided from the EMM news engine. ATF-IDF from documents to ensure a
Generic opinion mining requires the use of large relationship hierarchy between keywords
lexicons, and specialized training/testing data. represented concepts. Then FCA was used to
compute intra-document relationships to decide
In the past, researchers developed large
on a correct learning path.
feature selection algorithms designed for other
purposes and each model had its own Data classification for cross domains
advantages/disadvantages. Though there were were researched and is a basic method to
efforts to survey existing feature selection distinguish one from another, as it needs to know
algorithms, a repository collecting representative what belongs to which group. It can infer unseen
feature selection algorithms to facilitate dataset with unknown class through structural
comparison/joint study is yet to materialize. To similarity analysis of a dataset with known
offset this, Zhao, et al., [6] presented a feature classes. Classification results reliability is crucial.
selection repository designed to collect popular The higher the generated classification results
algorithms developed in feature selection research accuracy, the better the classifier. They regularly
to be a platform to facilitate application/ seek to improve classification accuracy through
comparison/joint study. The repository assists either existing techniques or through developing
researchers achieve reliable evaluation when new ones. Various procedures are used to
developing the new feature selection algorithms. improve classification accuracy performance.
While most methods try to improve classifier
In schools/colleges, student comments
techniques accuracy, Omar, et al., [9] reduced
about the courses are an informative resource to
dataset features number by choosing only relevant
improve teaching effectiveness. El-Halees and
features prior to handing over dataset to classifier.
Gaza [7] proposed a model to extract students'
Thereby motivating need for methods capable of
opinions knowledge to improve and measure
selecting relevant features with lowered
course performance. The task is to use student
information loss. The aim is to reduce classifier
generated contents to study specific course’s
workload using feature selection. The review
performance and to compare it with that of other
reveals that classification with feature selection
courses. A model was suggested for this
produced impressive results with accuracy.
consisting of 2 components: Feature extraction to

74
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Feature selection has gained importance data’s internal structure correctly. Large margin
due to its contribution to save classification cost generalization bounds are transferred to this,
with regard to time/computation load. Searching leading to input dimensionality independent
for essential features, a feature search method is bounds. This includes local metrics attached to all
through decision trees. The latter is an prototypes corresponding to piecewise quadratic
intermediate feature space inducer to select decision boundaries. The algorithm was tested
essential features. Some studies used decision tree and compared to alternative LVQ schemes using
as feature ranker with direct threshold measure in artificial data set, benchmark UCI repository
decision tree based features selection, while multi-class, and an issue from bioinformatics,
others remain decision trees but use pruning recognition of splice sites for C.
which acts as a threshold mechanism in feature
selection. Yacob, et al., [10] suggested a threshold
measure using Manhattan Hierarchical Cluster
distance for use in feature ranking to select Movie Data
relevant features as part of feature selection from IMDB
procedure. Results were promising and can be
further improved by adding higher number of
attributes test cases.
Feature selection reduces features Feature
number in applications where data has
Extraction IDF
100’s/1000’s of features. Present feature selection
focuses on locating relevant features. Yu and Liu
[11] demonstrated that feature relevance is
insufficient to ensure efficient high dimensional
data feature selection. Feature redundancy was Proposed Feature
defined/proposed to perform feature selection Selection
redundancy analysis. A new framework
decoupling relevance analysis and redundancy
analysis was proposed. A correlation-based
method for relevance/redundancy analysis was
developed and studied its efficiency/effectiveness Classification
compared to representative methods. LVQ
Principal component analysis (PCA) is
the mainstay of data analysis - a black box used
and usually poorly understood. Shlens [12]
dispelled this myth as the manuscript aimed to Benchmark with
build a solid intuition for how/why PCA works. It PCA
crystallized this knowledge by deriving the
mathematics behind PCA from simple intuitions
Figure 1: Flowchart of Proposed method
It was felt that by addressing all aspects, all
readers would have an improved PCA 3. METHODOLOGY
understanding and also the when, how and why of
The flowchart of the proposed
this technique’s application.
methodology is shown in Figure 1 and the
A new matrix learning scheme extending the following sections details the steps in the
Relevance Learning Vector Quantization proposed methodology.
(RLVQ), to a general adaptive metric was
proposed by Schneider, et al., [13]. By
introducing a full relevance factors matrix in 3.1 IMDb Database
distance measure, correlations between features
and classification scheme importance are The IMDb is a large database with
considered and automated, a general metric relevant and comprehensive information on
adaptation happens during training. Compared to movies- past, present and future [14]. It began as
weighted Euclidean metric used in RLVQ and its a shell scripts set and data files. The latter was a
variations, a total matrix powerfully represents collection of email messages between users of

75
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

rec.arts.movies Usenet bulletin board. Such freq ( x, a )


movie fans exchanged information on actors, term frequency denoted by ,
actresses and directors and also biographical expresses number of occurrences of term a in
information on moviemakers. At some point, such TF ( x, a )
data files became searchable with commands built document x . Term-frequency matrix
by shell scripts. measures term association a regarding a given
IMDb uses two methods to add
document x . TF ( x, a ) is assigned zero when
information to a database: Web forms and e-mail
forms. Information from submission procedures TF ( x, a )
document has no term and = 1when
indicates that, it is simpler to use web forms term a occurs in document x or uses relative
rather than e-mail format, if only addition to term frequency; term frequency as against total
information is an update. If new information is to occurrences of document terms. Frequency is
be submitted, users request or obtain format generally normalized by (Liu, et al., 2007):
templates from IMDb through e-mail. The
proposed information has to be formatted
according to templates and validated.
0 freq ( x, a ) = 0
TF ( x, a ) = 
3.2 Inverse Document Frequency (IDF)
 1 + (
log 1 + log freq x , a)
( ( ) ) otherwise
Inverse document frequency (IDF) is a
Inverse Document Frequency (IDF)
numerical statistic showing the importance of a
word to a document, in a collection/corpus [15]. It represents scaling. When a term a occurs
is used as a weighting factor in information frequently in documents, its importance is scaled
retrieval/text mining. IDF value increases with the down due to lowered discriminative power. The
repeated appearance of a word in a document. But IDF ( a )
offset by word’s frequency in the corpus, which is defined as follows:
controls the fact that some words are more
1 +x
common than others. IDF weighting scheme IDF ( a ) = log
variations are used by search engines as central xa
tool to score and rank a document's relevance
given a user query. IDF is used for stop-words xa is documents set having term a .
filtering in subject fields including text
summarization/classification. Text Classification Though TF-IDF is a common metric in
is a semi-supervised machine learning task text categorisation, its use in sentiment analysis is
automatically assigning a document to a pre- not known much. It has been used as a unigram
defined categories set based on textual content, feature weight. TF-IDF has 2 scores, term
extracted features. frequency and inverse document frequency. Term
IDF appears in many heuristic measures frequency counts the many times a term occurs in
of information retrieval. But till date IDF has a document, whereas inverse document-frequency
been a heuristic itself. It is defined as a logarithm is the result of dividing total documents by
of the ratio of documents number containing a documents where a specific word appears
given word. Rare words have high IDF while repeatedly. Multiplication of these values leads to
common function words like “the” have low IDF. high score for words appearing repeatedly in
IDF measures a word’s ability to discriminate limited documents. Terms appearing frequently in
documents. Text Classification assigns a text all documents have a low score [21].
document to a pre-defined class set automatically, 3.3 Proposed Feature Selection Based on
using machine learning. Classification is based on Decision Trees
significant words/key-features of text document.
As classes are pre-defined, it is a supervised Decision trees are popular methods for
machine learning process. inductive inference. They are robust to noisy data
and learn disjunctive expressions. A decision tree
The term document frequency is is a k-array tree in which each internal node
computed as follows: for a document set x and a specifies a test on some attributes from input
set of terms a . A document is modelled as a feature set representing data. Each branch from a
vector v in a dimensional space R a . When node corresponds to possible feature values

76
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

specified at that node. And every test results in polynomial equation. The features are assumed to
branches, representing varied test outcomes. The be irrelevant for classifying if the slope is zero or
decision tree induction basic algorithm is a greedy negative and relevant when the slope is positive.
algorithm constructing decision trees in a top-
3.4 Principal Component Analysis
down recursive divide-and-conquer manner [16].
When input dimensions are large and
The algorithm begins with tuples in the
components highly correlated, dimensions are
training set, selecting best attribute yielding
reduced using PCA [18]. For a variable set, PCA
maximum information for classification. It
calculates artificial variables smaller set
generates a test node for this and then a top down
representing observed variable’s variance.
decision trees induction divides current tuples set
Artificial variables calculated are principal
according to current test attribute values [17].
components used as predictor, criterion variable
Classifier generation stops when all subset tuples
in the analysis. PCA orthogonalises variables and
belong to the same class or if it is not worthy to
resulting principal components with large
proceed with additional separation to further
variation and eliminates components with least
subsets, i.e. if more attribute tests yield
variation from datasets. When applied on a
information for classification alone below a pre-
dataset PCA observes the following steps.
specified threshold. In this paper, it is proposed to
base the threshold measure based on information 1. Mean subtracted from each data dimensions
gain and Manhattan hierarchical cluster. producing a data set with zero mean.
In the proposed feature selection, a 2. Covariance matrix is calculated.
Decision tree induction selects relevant features.
Decision tree induction is the learning of decision 3. Eigenvectors and eigenvalues of the
tree classifiers constructing tree structure where covariance matrix are calculated.
each internal node (no leaf node) denotes attribute 4. Highest eigenvalues are principal
test. Each branch represents test outcome and components of dataset. Remove eigenvalues
each external node (leaf node) denotes class of less significance to form feature vector.
prediction. At every node, the algorithm selects
best partition data attribute to individual classes. 5. A new dataset is derived.
The best attribute to partitioning is selected by 3.5 Learning Vector Quantization (LVQ)
attribute selection with Information gain.
Attribute with highest information gain splits the Learning Vector Quantization (LVQ) is
attribute. Information gain of the attribute is a local classification algorithm, where
found by classification boundaries are locally
approximated, the difference being that instead of
m using all training dataset points, LVQ uses only a
inf o ( D ) = −∑ pi log 2 ( p ) prototype vectors set. This ensures efficient
i =1 classification as vectors number needing storing
or comparing is reduced greatly. Additionally, a
Where p i is the probability that arbitrary carefully chosen prototype set also increase noise
vector in D belongs to class c i . A log function to problems in the classification accuracy [18].
base 2 is used, as information is encoded in bits.
Info (D) is just average information amount LVQ is an algorithm that learns
required to identify vector D class label. The appropriate prototype positions used classification
information gain is used to rank the features and and is defined by P prototypes set {(m j , c j ), j =
the ranked features are treated as features in 1… P}, where m j is a K-dimensional vector in
hierarchical clusters. The proposed Manhattan feature space, and c j its class label. The
distance for n number of clusters is given as prototypes number is larger than classes number.
follows: Thus, each class is represented by more than one
prototype. Given an unlabeled data point x u , its

∑ (a − bi )
n
=
MDist i =1 i
class label y u is determined as class c q of nearest
prototype m q
A cubic polynomial equation is derived
using the Manhattan values and the threshold
=
yu c=
q,q arg min j d xu , m j ( )
criterion is determined from the slope of the

77
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Where d is Euclidean distance. Other Naïve Bayes with LVQ 0.54


distance measures are used depending on the CART with proposed feature
problem. 0.438
extraction
4. RESULTS AND DISCUSSION Naïve Bayes with proposed feature
0.412
extraction
Features are extracted using IDF from Naïve Bayes with proposed feature
the movie data. The PCA and the proposed 0.362
extraction
feature selection method was used to reduce the
features. Table 1 and Figure 2 shows the
classification accuracy obtained from LVQ and
compared with Naïve Bayes classifier and Classification accuracy
Classification and Regression Tree (CART).
90
Table 2 and Figure 3 gives the Root Mean 80
Squared Error (RMSE). 70
60
Table 1: Classification Accuracy 50
40
Classification 30
Technique used
accuracy
CART with PCA 57.5
Naïve Bayes with PCA 70
Naïve Bayes with LVQ 75
CART with proposed Figure 2: Root Mean Squared Error
65.75
feature extraction Table 3: Precision and Recall
Naïve Bayes with
proposed feature 75.5 Technique used Precision Recall
extraction
Naïve Bayes with CART with PCA 0.5763 0.5382
proposed feature 79.75 Naïve Bayes with PCA 0.7003 0.7142
extraction
Naïve Bayes with LVQ 0.5 0.773

RMSE CART with proposed


0.5 0.6691
feature extraction
0.7
0.6 Naïve Bayes with
0.5
0.4 proposed feature 0.5 0.7149
0.3 extraction
0.2
0.1
0 Naïve Bayes with
proposed feature 0.5 0.7995
extraction

Precision
Figure 2 Classification accuracy 0.8
It can be seen from figure 2, the 0.6
classification accuracy obtained through Naïve 0.4
Bayes with LVQ is better than Naïve Bayes with 0.2
PCA by around 5%. Figure 3 shows the Root 0
Mean Squared Error (RMSE).
Table 2: Root Mean Squared Error
Technique used RMSE
CART with PCA 0.61
Figure 4: Precision
Naïve Bayes with PCA 0.54

78
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

[2]. Vinodhini, G., & Chandrasekaran, R. M.


Recall (2012). Sentiment Analysis and Opinion
1 Mining: A Survey. International
0.8 Journal, 2(6).
0.6 [3]. Liu, B. (2010). Sentiment analysis and
0.4
0.2 subjectivity. Handbook of natural language
0 processing, 2, 568.
[4]. Somprasertsri, G., & Lalitrojwong, P. (2010).
Mining Feature-Opinion in Online Customer
Reviews for Opinion Summarization. J.
UCS, 16(6), 938-955.
Figure 5: Recall [5]. Balahur, A., Steinberger, R., Goot, E. V. D.,
It can be seen that the precision and recall low for Pouliquen, B., & Kabadjov, M. (2009,
the three classifiers. September). Opinion mining on newspaper
quotations. In Web Intelligence and
5. CONCLUSION Intelligent Agent Technologies, 2009. WI-
Rapid advances in computer based high- IAT'09. IEEE/WIC/ACM International Joint
throughput technique provided unparalleled Conferences on (Vol. 3, pp. 523-526). IET.
chances for humans to expand production, [6]. Zhao, Z., Morstatter, F., Sharma, S.,
services, communications, and research Alelyani, S., Anand, A., & Liu, H. (2010).
productions. Meanwhile, immense high- Advancing feature selection research. ASU
dimensional data quantities accumulate Feature Selection Repository.
challenging state-of-the-art data mining [7]. El-Halees, A., & Gaza, P. (2011). Mining
techniques. Feature selection is needed for Feature-opinion in Educational Data for
successful data mining applications, as they lower Course Improvement. International Journal
data dimensionality removing irrelevant features. of New Computer Architectures and their
In this paper, a feature selection for Opinion Applications (IJNCAA), 1(4), 1076-1085.
mining using decision tree is proposed. LVQ type [8]. Hsieh, T. C., Chiu, T. K., & Wang, T. I.
learning models constitute popular learning (2008, July). An approach for constructing
algorithms due to their simple learning rule, their suitable learning path for documents
intuitive formulation of a classifier by means of occasionally collected from internet.
prototypical locations in the data space, and their In Machine Learning and Cybernetics, 2008
efficient applicability to any given number of International Conference on (Vol. 6, pp.
classes. Movie review features obtained from 3138-3143). IEEE.
IMDb was extracted using inverse document [9]. Omar, N., Jusoh, F., Ibrahim, R., & Othman,
frequency and the importance of the word found. M. S. (2013). Review of Feature Selection
Principal component analysis was used for feature for Solving Classification Problems. JISRI, 3.
selection based on the importance of the work [10]. Yacob, Y. M., Sakim, H. M., & Isa, N.
with respect to the entire document. The M. (2012). Decision tree-based feature
classification accuracy obtained by LVQ was ranking using manhattan hierarchical cluster
75%. However it was observed that the precision criterion. International Journal of
for positive opinions was quite low. This Engineering and Physical Sciences, 6.
phenomenon was observed not only on LVQ but [11]. Yu, L., & Liu, H. (2004). Efficient
other classifiers including CART and Naïve feature selection via analysis of relevance
Bayes. and redundancy. The Journal of Machine
Learning Research, 5, 1205-1224.
REFERENCES [12]. Shlens, J. (2005). A tutorial on principal
component analysis. Systems Neurobiology
[1]. Xia, Y. Q., Xu, R. F., Wong, K. F., & Zheng, Laboratory, University of California at San
F. (2007, August). The unified collocation Diego.
framework for opinion mining. In Machine [13]. Schneider, P., Biehl, M., & Hammer, B.
Learning and Cybernetics, 2007 (2009). Adaptive relevance matrices in
International Conference on (Vol. 2, pp. 844- learning vector quantization. Neural
850). IEEE. Computation, 21(12), 3532-3561.

79
Journal of Theoretical and Applied Information Technology
10th December 2013. Vol. 58 No.1
© 2005 - 2013 JATIT & LLS. All rights reserved.

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

[14]. The Internet Movie Database Ltd.


Internet movie database.
http://www.imdb.com.
[15]. Metzler, D. (2008, October). Generalized
inverse document frequency. InProceedings
of the 17th ACM conference on Information
and knowledge management (pp. 399-408)
[16]. Ratanamahatana, C. A., & Gunopulos,
D. (2002). Scaling up the naive bayesian
classifier: Using decision trees for feature
selection.
[17]. Gayatri, N., Nickolas, S., & Reddy, A.
V. (2010). Feature selection using decision
tree induction in class level metrics dataset
for software defect predictions.
In Proceedings of the World Congress on
Engineering and Computer Science (Vol. 1,
pp. 124-129).
[18]. Friston, K. J., Frith, C. D., Liddle, P. F.,
& Frackowiak, R. S. J. (1993). Functional
connectivity: the principal-component
analysis of large (PET) data sets. Journal of
cerebral blood flow and metabolism, 13, 5-5.
[19]. Grbovic, M., & Vucetic, S. (2009, June).
Learning vector quantization with adaptive
prototype addition and removal. In Neural
Networks, 2009. IJCNN 2009. International
Joint Conference on (pp. 994-1001). IEEE.
[20]. Jeevanandam Jotheeswaran et al.,
(2012), Feature Reduction using Principal
Component Analysis for Opinion Mining.
IJCST, Volume 3, Issue 5, May 2012 (P
118 – 121).

80

View publication stats

You might also like