Abstract
Feature selection is a very important step in text preprocessing, a good selected feature subset can get the same performance than using full features, at the same time, it reduced the learning time. To make our system fit for the application and to embed this model gateway for real-time text filtering, we need to further select more accurate features. In this paper, we proposed a new feature selection method based on Rough set theory. It generate several reducts, but the special point is that between these reducts there are no common attributes, so these attributes have more powerfully capability to classify new objects, especially for real data set in application. We choose two data sets to evaluate our feature selection method, one is a benchmark data set from UCI machine learning archive, and another is captured from Web. We use statistical classification methods to classify these objects, in the benchmark testing set, we get good precision with a single reduct, but in real date set, we get good precision with several reducts, and the data set is used in our system for topic-specific text filtering. Thus we conclude our method is very effective in application. In addition, we also conclude that SVM and VSM methods get better performance, while Naïve Bayes method get poor performance with the same selected features on non-balance data set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Salton, G., Lesk, M.E.: Computer evaluation of indexing and text processing. Journal of the ACM 15(1), 8–36 (1968)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Comm. ACM 18(11), 613–620 (1975)
Lewis, D.D.: Naïve Bayes at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Žižka, J., Bourek, A., Frey, L.: TEA: A text analysis tool for the intelligent text document filtering. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 151–156. Springer, Heidelberg (2000)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Burges, C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Joachims, T.: Text categorization with support vector machines. In: Proceedings of the European Conference on Machine Learning. Springer, Heidelberg (1998)
Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural Networks for Web Content Filtering. IEEE Intelligent Systems 17, 48–57 (2002)
Zhou, Z.-H., Jiang, Y.: Medical diagnosis with C4. 5 rule preceded by artificial neural network ensemble. IEEE Transactions on Information Technology in Biomedicine 7(1), 37–42 (2003)
John, G., Kohavi, R., Pfleger, K.: Irrelevant Features and the Subset Selection Problem. In: Proc. ICML, pp. 121–129 (1994)
Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1(1), 67–88 (1999)
Pawlak, Z.: Rough sets. International Journal of Information and computer Science 11(5), 341–356 (1982)
Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishing, Dordrecht (1991)
Cercone, N., Ziarko, W., Hu, X.: Rule Discovery from Databases: A Decision Matrix Approach. In: Proc. of ISMIS, Zakopane, Poland, pp. 653–662 (1996)
Skowron, A., Rauszer, C.: The Discernibility Matrices and Functions in Information Systems. In: Slowinski, K. (ed.) Intelligent Decision Support - Handbook of Applications and Advances of the Rough Sets Theory, pp. 331–362. Kluwer, Dordrecht (1992)
Bao, Y., Asai, D., Du, X., Yamada, K., Ishii, N.: An Effective Rough Set-Based Method for Text Classification. In: Liu, J., Cheung, Y.-m., Yin, H. (eds.) IDEAL 2003. LNCS, vol. 2690, pp. 545–552. Springer, Heidelberg (2003)
Chouchoulas, A., Shen, Q.: Rough Set-Aided Keyword Reduction for Text Categorisation. Journal of Applied Artificial Intelligence 15(9), 843–873 (2001)
Chouchoulas, A., Shen, Q.: A Rough Set-Based Approach to Text Classification. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC 1999. LNCS (LNAI), vol. 1711, pp. 118–127. Springer, Heidelberg (1999)
Chouchoulas, A., Halliwell, J., Shen, Q.: On the implementation of rough set attribute reduction. In: Proc. 2002 UK Workshop on Computational Intelligence, pp. 18–23 (2002)
Han, J., Hu, X., Lin, T.Y.: Feature Subset Selection Based on Relative Dependency between Attributes. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 176–185. Springer, Heidelberg (2004)
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, Q., Li, J. (2005). Topic-Specific Text Filtering Based on Multiple Reducts. In: Gorodetsky, V., Liu, J., Skormin, V.A. (eds) Autonomous Intelligent Systems: Agents and Data Mining. AIS-ADM 2005. Lecture Notes in Computer Science(), vol 3505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11492870_14
Download citation
DOI: https://doi.org/10.1007/11492870_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26164-3
Online ISBN: 978-3-540-31932-0
eBook Packages: Computer ScienceComputer Science (R0)