Abstract
The aim of text document classification is to automatically group a document to a predefined class. The main problem of document classification is high dimensionality and sparsity of the data matrix. A new feature selection technique using the google distance have been proposed in this article to effectively obtain a feature subset which improves the classification accuracy. Normalized google distance can automatically extract the meaning of terms from the world wide web. It utilizes the advantage of number of hits returned by the google search engine to compute the semantic relation between two terms. In the proposed approach, only the distance function of google distance is used to develop a relation between a feature and a class for document classification and it is independent of google search results. Every feature will generate a score based on their relation with all the classes and then all the features will be ranked accordingly. The experimental results are presented using knn classifier on several TREC and Reuter data sets. Precision, recall, f-measure and classification accuracy are used to analyze the results. The proposed method is compared with four other feature selection methods for document classification, document frequency thresholding, information gain, mutual information and χ 2 statistic. The empirical studies have shown that the proposed method have effectively done feature selection in most of the cases with either an improvement or no change of classification accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the Fourteenth International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proc. of the Twenty-Second International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 42–49 (1999)
Cilibrasi, R.L., Vitanyi, P.M.: The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)
Li, S., Xia, R., Zong, C., Huang, C.: A Framework of Feature Selection Methods for Text. In: Proceedings of ACL-IJCNLP 2009 (2009)
Novovicova, J., Malik, A.: Information-Theoretic Feature Selection Algorithms for Text Classification. In: Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31-August 4 (2005)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Karypis, G., Han, E.H.: Centroid-Based Document Classification: Analysis and Experimental Results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
TREC, Text REtrieval Conference, http://trec.nist.gov
Lehmann, E.L.: Testing of Statistical Hypotheses. John Wiley, New York (1976)
Rao, C.R., Mitra, S.K., Matthai, A., Ramamurthy, K.G. (eds.): Formulae and Tables for Statistical Work. Statistical Publishing Soc., Calcutta (1966)
Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. The Journal of Machine Learning Research 3(1), 1289–1305 (2003)
Liu, T., Liu, S., Chen, Z., Ma, W.: An Evaluation on Feature Selection for Text Clustering. In: Proc. International Conference on Machine Learning (ICML 2003) (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Basu, T., Murthy, C.A. (2012). A Feature Selection Method for Improved Document Classification. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-35527-1_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35526-4
Online ISBN: 978-3-642-35527-1
eBook Packages: Computer ScienceComputer Science (R0)