Abstract
Text classification is an important task of Artificial Intelligence. Normally, this task uses large textual datasets whose representation is feasible because of normalization and selection techniques. In the literature, we can find three normalization techniques: stemming, lemmatization, and nominalization. Nevertheless, it is difficult to choose the most suitable technique for the text classification task. In this paper, we investigate this question experimentally by applying five different classifiers to four textual datasets in the Portuguese language. Additionally, the classification results are evaluated using unigrams, bigrams, and the combination of unigrams and bigrams. The results indicate that, in general, the number of terms obtained by each of the cases and the comprehensibility required in the results of the classification can be used as criteria to define the most suitable technique for the text classification task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alvares, R.V., Garcia, A.C.B., Ferraz, I.: STEMBR: A Stemming Algorithm for the Brazilian Portuguese Language. In: Bento, C., Cardoso, A., Dias, G. (eds.) EPIA 2005. LNCS (LNAI), vol. 3808, pp. 693–701. Springer, Heidelberg (2005)
Arampatzis, A., van der Weide, T., Koster, C., van Bommel, P.: Linguistically-motivated Information Retrieval, pp. 201–222. Marcel Dekker, NY (2000)
Aranha, C.N.: Uma Abordagem de Pré-Processamento Automático para Mineração de Textos em Português: sob o Enfoque da Inteligência Computacional. PhD thesis, Departamento de Engenharia Elétrica - PUC - Rio de Janeiro (2007)
Bekkerman, R., Allan, J.: Using bigrams in text categorization. Technical Report IR-408, Center of Intelligent Information Retrieval, UMass Amherst (2004)
Brill, E.: Transformation-based error-driven learning of natural language: A case study in part of speech tagging. Computational Linguistics, 543–565 (1995)
Conrado, M.S.: O efeito do uso de diferentes formas de geração de termos na compreensibilidade e representatividade dos termos em coleções textuais na Língua Portuguesa. Master’s thesis, Instituto de Ciências Matemáticas e de Computação - USP, São Carlos, SP (2009)
Conrado, M.S., Marcacini, R.M., Moura, M.F., Rezende, S.O.: O efeito do uso de diferentes formas de geração de termos na compreensibilidade e representatividade dos termos em coleções textuais na Língua Portuguesa. In: Proceedings of II Web and Text Intelligence - 7th Brazilian Symposium in Information and Human Language Technology, São Carlos, SP (2009)
das Nunes, M.G.V.: The design of a lexicon for brazilian portuguese: Lessons learned and perspectives. In: Proceedings of the II Workshop on Computational Processing of Written and Spoken Portuguese, Curitiba, pp. 61–70 (1996)
Demšar, J.: Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning Research 7(1), 1–30 (2006)
Ebecken, N.F.F., Lopes, M.C.S., de Aragão, M.C.: Mineração de Textos. In: Rezende, S.O. (ed.) Sistemas Inteligentes: Fundamentos e Aplicações, 1st edn., Manole, ch. 13, pp. 337–364 (2003)
Gonzalez, M. A. I.: Termos e Relacionamentos em Evidência na Recuperação de Informação. PhD thesis, Instituto de Informática - UFRGS, Porto Alegre (2005)
Gonzalez, M.A.I., de Lima, V.L.S., de Lima, J.V.: Tools for Nominalization: An Alternative for Lexical Normalization. In: Vieira, R., Quaresma, P., Nunes, M.d.G.V., Mamede, N.J., Oliveira, C., Dias, M.C. (eds.) PROPOR 2006. LNCS (LNAI), vol. 3960, pp. 100–109. Springer, Heidelberg (2006)
Braga, Í.A., Monard, M.C., Matsubara, E.T.: Combining unigrams and bigrams in semi-supervised text classification. In: 14th Portuguese Conference on Artificial Intelligence - New Trends in Artificial Intelligence, Aveiro, Portugal, pp. 489–500 (2009)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. In: Explorations of Special Interest Group on Knowledge Discovery and Data Mining, vol. 11, pp. 10–18 (2009)
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 13(3), 637–649 (2001)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Manning, C.D., Raghavan, P., Schütze, H.: Language models for information retrieval. In: An Introduction to Information Retrieval, ch. 12. Cambridge University Press (2008)
Maziero, E.G., del Rosario Castro Jorge, M.L., Pardo, T.A.S.: Identifying multidocument relations. In: Proceedings of 7th International Workshop on Natural Language Processing and Cognitive Science, Funchal/Madeira, Portugal, vol. 1, pp. 60–69 (2010)
Mccallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI Magazine - Workshop on ’Learning for Text Categorization, pp. 1–8 (1998)
Miner, G., Elder, J., Hill, T., Nisbet, R., Delen, D.: Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier Science (2012)
Mitra, V., Wang, C.-J., Banerjee, S.: Text classification: A least square support vector machine approach. Appl. Soft Comput. 7(3), 908–914 (2007)
Nuipian, V., Meesad, P., Boonrawd, P.: Improve abstract data with feature selection for classification techniques. Advanced Materials Research 403-408, 3699–3703 (2011)
Orengo, V.M., Huyck, C.: A stemming algorithm for portuguese language. In: Proceedings of Eigth Symposium on String Processing and Information Retrieval, Chile, pp. 186–193 (2001)
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, pp. 491–497. University of Pennsylvania (1996)
Read, J., Webster, J., Fang, A.C.: In: Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, Sendai, Japan
Řehůřek, R., Sojka, P.: Automated Classification and Categorization of Mathematical Knowledge. In: Autexier, S., Campbell, J., Rubio, J., Sorge, V., Suzuki, M., Wiedijk, F. (eds.) AISC 2008, Calculemus 2008, and MKM 2008. LNCS (LNAI), vol. 5144, pp. 543–557. Springer, Heidelberg (2008)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, pp. 44–49 (1994)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Silic, A., Chauchat, J.-H., Basic, B.-D., Morin, A.: N-grams and morphological normalization in text classification: A comparison on a croatian-english parallel corpus. In: Neves, J., Santos, M.-F., Machado, J. (eds.) 13th Portuguese Conference on Artificial Intelligence, Guimaraes, Portugal
Snedecor, G.W., Cochran, W.G.: Statistical Methods, 6th edn. Iowa State University Press, Ames (1967)
Soares, M.V., Prati, R.C., Monard, M.C.: PreTexT II: Descrição da reestruturação da ferramenta de pré-processamento de textos. Technical Report 333, Instituto de Ciências Matemáticas e de Computação - USP, São Carlos, SP (2008)
Su, J., Zhang, H., Ling, C.X., Matwin, S.: Discriminative parameter learning for bayesian networks. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1016–1023. ACM, New York (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
da Silva Conrado, M., Laguna Gutiérrez, V.A., Rezende, S.O. (2012). Evaluation of Normalization Techniques in Text Classification for Portuguese. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2012. ICCSA 2012. Lecture Notes in Computer Science, vol 7335. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31137-6_47
Download citation
DOI: https://doi.org/10.1007/978-3-642-31137-6_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31136-9
Online ISBN: 978-3-642-31137-6
eBook Packages: Computer ScienceComputer Science (R0)