Abstract
Many feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments, so the results rely on the corpora used and may not be accurate. This paper proposes a novel feature selection framework called Distribution-Based Feature Selection (DBFS) based on distribution difference of features. This framework generalizes most of the state-of-the-art feature selection methods including OCFS, MI, ECE, IG, CHI and OR. The performances of many feature selection methods can be estimated by theoretical analysis using components of this framework. Besides, DBFS sheds light on the merits and drawbacks of many existing feature selection methods. In addition, this framework helps to select suitable feature selection methods for specific domains. Moreover, a weighted model based on DBFS is given so that suitable feature selection methods for unbalanced datasets can be derived. The experimental results show that they are more effective than CHI, IG and OCFS on both balanced and unbalanced datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., Mahoney, M.W.: Feature Selection Methods for Text Classification. In: ACM SIGKDD, pp. 230–239 (2007)
Doumpos, M., Salappa, A.: Feature selection algorithms in classification problems: an experimental evaluation. In: AIKED, pp. 1–6 (2005)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 1289–1305 (2003)
Hong, J., Cho, S.: Efficient huge-scale feature selection with speciated genetic. Pattern Recognition Letters, 143–150 (2006)
Hong, S.J.: Use of Contextual Information for Feature Ranking and Discretization. IEEE Transactions on Knowledge and Data Engineering 9(5), 718–730 (1997)
How, B.C., Kulathuramaiyer, N., Kiong, W.T.: Categorical term descriptor: A proposed term weighting scheme for feature selection. In: IEEE/WIC/ACM WI, pp. 313–316 (2005)
Joactfims, T.: Text categorization with support vector machines learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: ICML, pp. 121–129 (1994)
Lang, K., NewsWeeder: Learning to filter netnews. In: ICML, pp. 331–339 (1995)
Langley, P.: Selectuion of relevant features in machine learning. In: AAAI Fall Symposium on Relevance, pp. 140–144 (1994)
Legrand, G., Nicoloyannis, N.: Feature Selection Method Using Preferences Aggregation. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS, vol. 3587, pp. 203–217. Springer, Heidelberg (2005)
Li, S., Zong, C.: A new approach to feature selection for text categorization. In: IEEE NLP-KE, pp. 626–630 (2005)
Li, F., Guan, T., Zhang, X., Zhu, X.: An Aggressive Feature Selection Method based on Rough Set Theory. Innovative Computing, Information and Control, 176–179 (2007)
Liu, Y., Zheng, Y.F.: FS_SFS: A novel feature selection method for support vector machines. Pattern Recognition 39, 1333–1345 (2006)
Luo, S., Corne, D.: Feature selection strategies for poorly correlated data: correlation coefficient considered harmful. In: AIKED, pp. 226–231 (2008)
Mak, M., Kung, S.: Fusion of feature selection methods for pairwise scoring SVM. Neurocomputing 71, 3104–3113 (2008)
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: ICML, pp. 258–267 (1999)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis and Machine Intelligence 53, 1226–1238 (2005)
Perner, P.: Improving the Accuracy of Decision Tree Induction by Feature Pre-Selection. Applied Artificial Intelligence 15(8), 747–760 (2001)
Polkowski, L., Tsumoto, S., Lin, T.Y.: Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems. Springer, Heidelberg (2000)
Robnik-Siikonja, M., Kononenko, I.: Theoretical and Empirical Analysis of Relief and Relief. Machine Learning Journal 53, 23–69 (2003)
Yan, J., Liu, N., Zhang, B.: OCFS: Optimal orthogonal centroid feature selection for text categorization. In: ACM SIGIR, pp. 122–129 (2005)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: ACM SIGIR, pp. 42–49 (1999)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, pp. 412–420 (1997)
Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: ACM SIGKDD, pp. 803–811 (2008)
Zhao, P., Liu, P.: Spectral feature selection for supervised and unsupervised learning. In: ICML, pp. 1151–1157 (2007)
Zhao, T., Lu, J., Zhang, Y., Xiao, Q.: Feature Selection Based on Genetic Algorithm for CBIR. In: CISP, pp. 495–499 (2008)
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 80–89 (2004)
Zhou, Q., Zhao, M., Hu, M.: Study on feature selection in chinese text categorization. Journal of Chinese Information Processing 18, 17–23 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jing, H., Wang, B., Yang, Y., Xu, Y. (2009). A General Framework of Feature Selection for Text Categorization. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2009. Lecture Notes in Computer Science(), vol 5632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03070-3_49
Download citation
DOI: https://doi.org/10.1007/978-3-642-03070-3_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03069-7
Online ISBN: 978-3-642-03070-3
eBook Packages: Computer ScienceComputer Science (R0)