Evaluation of Normalization Techniques in Text Classification for Portuguese

Merley da Silva Conrado²³,
Víctor Antonio Laguna Gutiérrez²⁴ &
Solange Oliveira Rezende²³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7335))

Included in the following conference series:

International Conference on Computational Science and Its Applications

2424 Accesses

Abstract

Text classification is an important task of Artificial Intelligence. Normally, this task uses large textual datasets whose representation is feasible because of normalization and selection techniques. In the literature, we can find three normalization techniques: stemming, lemmatization, and nominalization. Nevertheless, it is difficult to choose the most suitable technique for the text classification task. In this paper, we investigate this question experimentally by applying five different classifiers to four textual datasets in the Portuguese language. Additionally, the classification results are evaluated using unigrams, bigrams, and the combination of unigrams and bigrams. The results indicate that, in general, the number of terms obtained by each of the cases and the comprehensibility required in the results of the classification can be used as criteria to define the most suitable technique for the text classification task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Automated Text Classification System Based on Statistical Unified Model

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

GenDesc: A Partial Generalization of Linguistic Features for Text Classification

References

Alvares, R.V., Garcia, A.C.B., Ferraz, I.: STEMBR: A Stemming Algorithm for the Brazilian Portuguese Language. In: Bento, C., Cardoso, A., Dias, G. (eds.) EPIA 2005. LNCS (LNAI), vol. 3808, pp. 693–701. Springer, Heidelberg (2005)
Chapter Google Scholar
Arampatzis, A., van der Weide, T., Koster, C., van Bommel, P.: Linguistically-motivated Information Retrieval, pp. 201–222. Marcel Dekker, NY (2000)
Google Scholar
Aranha, C.N.: Uma Abordagem de Pré-Processamento Automático para Mineração de Textos em Português: sob o Enfoque da Inteligência Computacional. PhD thesis, Departamento de Engenharia Elétrica - PUC - Rio de Janeiro (2007)
Google Scholar
Bekkerman, R., Allan, J.: Using bigrams in text categorization. Technical Report IR-408, Center of Intelligent Information Retrieval, UMass Amherst (2004)
Google Scholar
Brill, E.: Transformation-based error-driven learning of natural language: A case study in part of speech tagging. Computational Linguistics, 543–565 (1995)
Google Scholar
Conrado, M.S.: O efeito do uso de diferentes formas de geração de termos na compreensibilidade e representatividade dos termos em coleções textuais na Língua Portuguesa. Master’s thesis, Instituto de Ciências Matemáticas e de Computação - USP, São Carlos, SP (2009)
Google Scholar
Conrado, M.S., Marcacini, R.M., Moura, M.F., Rezende, S.O.: O efeito do uso de diferentes formas de geração de termos na compreensibilidade e representatividade dos termos em coleções textuais na Língua Portuguesa. In: Proceedings of II Web and Text Intelligence - 7th Brazilian Symposium in Information and Human Language Technology, São Carlos, SP (2009)
Google Scholar
das Nunes, M.G.V.: The design of a lexicon for brazilian portuguese: Lessons learned and perspectives. In: Proceedings of the II Workshop on Computational Processing of Written and Spoken Portuguese, Curitiba, pp. 61–70 (1996)
Google Scholar
Demšar, J.: Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning Research 7(1), 1–30 (2006)
MATH Google Scholar
Ebecken, N.F.F., Lopes, M.C.S., de Aragão, M.C.: Mineração de Textos. In: Rezende, S.O. (ed.) Sistemas Inteligentes: Fundamentos e Aplicações, 1st edn., Manole, ch. 13, pp. 337–364 (2003)
Google Scholar
Gonzalez, M. A. I.: Termos e Relacionamentos em Evidência na Recuperação de Informação. PhD thesis, Instituto de Informática - UFRGS, Porto Alegre (2005)
Google Scholar
Gonzalez, M.A.I., de Lima, V.L.S., de Lima, J.V.: Tools for Nominalization: An Alternative for Lexical Normalization. In: Vieira, R., Quaresma, P., Nunes, M.d.G.V., Mamede, N.J., Oliveira, C., Dias, M.C. (eds.) PROPOR 2006. LNCS (LNAI), vol. 3960, pp. 100–109. Springer, Heidelberg (2006)
Chapter Google Scholar
Braga, Í.A., Monard, M.C., Matsubara, E.T.: Combining unigrams and bigrams in semi-supervised text classification. In: 14th Portuguese Conference on Artificial Intelligence - New Trends in Artificial Intelligence, Aveiro, Portugal, pp. 489–500 (2009)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. In: Explorations of Special Interest Group on Knowledge Discovery and Data Mining, vol. 11, pp. 10–18 (2009)
Google Scholar
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 13(3), 637–649 (2001)
Article MATH Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Language models for information retrieval. In: An Introduction to Information Retrieval, ch. 12. Cambridge University Press (2008)
Google Scholar
Maziero, E.G., del Rosario Castro Jorge, M.L., Pardo, T.A.S.: Identifying multidocument relations. In: Proceedings of 7th International Workshop on Natural Language Processing and Cognitive Science, Funchal/Madeira, Portugal, vol. 1, pp. 60–69 (2010)
Google Scholar
Mccallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI Magazine - Workshop on ’Learning for Text Categorization, pp. 1–8 (1998)
Google Scholar
Miner, G., Elder, J., Hill, T., Nisbet, R., Delen, D.: Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier Science (2012)
Google Scholar
Mitra, V., Wang, C.-J., Banerjee, S.: Text classification: A least square support vector machine approach. Appl. Soft Comput. 7(3), 908–914 (2007)
Article Google Scholar
Nuipian, V., Meesad, P., Boonrawd, P.: Improve abstract data with feature selection for classification techniques. Advanced Materials Research 403-408, 3699–3703 (2011)
Article Google Scholar
Orengo, V.M., Huyck, C.: A stemming algorithm for portuguese language. In: Proceedings of Eigth Symposium on String Processing and Information Retrieval, Chile, pp. 186–193 (2001)
Google Scholar
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Google Scholar
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, pp. 491–497. University of Pennsylvania (1996)
Google Scholar
Read, J., Webster, J., Fang, A.C.: In: Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, Sendai, Japan
Google Scholar
Řehůřek, R., Sojka, P.: Automated Classification and Categorization of Mathematical Knowledge. In: Autexier, S., Campbell, J., Rubio, J., Sorge, V., Suzuki, M., Wiedijk, F. (eds.) AISC 2008, Calculemus 2008, and MKM 2008. LNCS (LNAI), vol. 5144, pp. 543–557. Springer, Heidelberg (2008)
Chapter Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, pp. 44–49 (1994)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Silic, A., Chauchat, J.-H., Basic, B.-D., Morin, A.: N-grams and morphological normalization in text classification: A comparison on a croatian-english parallel corpus. In: Neves, J., Santos, M.-F., Machado, J. (eds.) 13th Portuguese Conference on Artificial Intelligence, Guimaraes, Portugal
Google Scholar
Snedecor, G.W., Cochran, W.G.: Statistical Methods, 6th edn. Iowa State University Press, Ames (1967)
Google Scholar
Soares, M.V., Prati, R.C., Monard, M.C.: PreTexT II: Descrição da reestruturação da ferramenta de pré-processamento de textos. Technical Report 333, Instituto de Ciências Matemáticas e de Computação - USP, São Carlos, SP (2008)
Google Scholar
Su, J., Zhang, H., Ling, C.X., Matwin, S.: Discriminative parameter learning for bayesian networks. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1016–1023. ACM, New York (2008)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Sao Paulo University (USP), P.O. Box 668, 13561-970, Sao Carlos, SP, Brazil
Merley da Silva Conrado & Solange Oliveira Rezende
Pontifical Catholic University of Peru (PUCP), P.O. Box 1761, Lima, 32, Peru
Víctor Antonio Laguna Gutiérrez

Authors

Merley da Silva Conrado
View author publications
You can also search for this author in PubMed Google Scholar
Víctor Antonio Laguna Gutiérrez
View author publications
You can also search for this author in PubMed Google Scholar
Solange Oliveira Rezende
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Laboratory of Urban and Territorial Systems, University of Basilicata, 10, Viale dell’Ateneo Lucano, 85100, Potenza, Italy
Beniamino Murgante
Department of Mathematics and Computer Science, University of Perugia, Via Vanvitelli 1, 06123, Perugia, Italy
Osvaldo Gervasi
Department of Cyber Security Science, Federal University of Technology, Gidan Kwano Campus, Minna, Nigeria
Sanjay Misra
Faculty of Engineering, Department of Electronics Engineering and Telecommunications, State University of Rio de Janeiro, Rua Sao Francisco Xavier, 524, 50. andar, sala 5145-F, Maracana, 20.550-013, Rio de Janeiro, RJ, Brazil
Nadia Nedjah
Department of Production and Systems, University of Minho, Campus de Gualtar, 4710-057, Braga, Portugal
Ana Maria A. C. Rocha
School of Business Systems, Monash University, 3800, Clayton, VIC, Australia
David Taniar
Department of Intelligent Informatics, Kyushu Sangyo University, 2-3-1 Matsukadai, Higashi-ku, 813-8503, Fukuoka, Japan
Bernady O. Apduhan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

da Silva Conrado, M., Laguna Gutiérrez, V.A., Rezende, S.O. (2012). Evaluation of Normalization Techniques in Text Classification for Portuguese. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2012. ICCSA 2012. Lecture Notes in Computer Science, vol 7335. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31137-6_47

Download citation

DOI: https://doi.org/10.1007/978-3-642-31137-6_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31136-9
Online ISBN: 978-3-642-31137-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Evaluation of Normalization Techniques in Text Classification for Portuguese

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Automated Text Classification System Based on Statistical Unified Model

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

GenDesc: A Partial Generalization of Linguistic Features for Text Classification

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Evaluation of Normalization Techniques in Text Classification for Portuguese

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Automated Text Classification System Based on Statistical Unified Model

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

GenDesc: A Partial Generalization of Linguistic Features for Text Classification

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation