Feature Extraction in Subject Classification of Text Documents in Polish

Tomasz Walkowiak¹⁸,
Szymon Datko¹⁸ &
Henryk Maciejewski¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10842))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

2077 Accesses

Abstract

In this work we evaluate two different methods for deriving features for a subject classification of text documents. The first method uses the standard Bag-of-Words (BoW) approach, which represents the documents with vectors of frequencies of selected terms appearing in the documents. This method heavily relies on the natural language processing (NLP) tools to properly preprocess text in the grammar- and inflection-conscious way. The second approach is based on the word-embedding technique recently proposed by Mikolov and does not require any NLP preprocessing. In this method the words are represented as vectors in continuous space and this representation of words is used to construct the feature vectors of the documents. We evaluate these fundamentally different approaches in the task of classification of Polish language Wikipedia articles with 34 subject areas. Our study suggests that the word-embedding based features seem to outperform the standard NLP-based features providing sufficiently large training dataset is available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study

Reduction of Dimensionality of Feature Vectors in Subject Classification of Text Documents

Subject Classification of Texts in Polish - from TF-IDF to Transformers

References

Eder, M., Piasecki, M., Walkowiak, T.: An open stylometric system based on multilevel text analysis. Cogn. Stud.—Etudes Cogn. (17) (2017). https://doi.org/10.11649/cs.1430
Goodman, J.: Classes for fast maximum entropy training. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, (Cat. No.01CH37221), vol. 1, pp. 561–564 (2001). https://doi.org/10.1109/ICASSP.2001.940893
Harris, Z.: Distributional structure. Word (1954)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Short Papers, vol. 2, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Association for Computational Linguistics, Atlanta, June 2013. http://www.aclweb.org/anthology/N13-1090
Młynarczyk, K., Piasecki, M.: Wiki test - 34 categories (2015). http://hdl.handle.net/11321/217. CLARIN-PL digital repository
Młynarczyk, K., Piasecki, M.: Wiki train - 34 categories (2015). http://hdl.handle.net/11321/222. CLARIN-PL digital repository
Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16
Chapter Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1986)
Google Scholar
Torkkola, K.: Discriminative features for text document classification. Formal Pattern Anal. Appl. 6(4), 301–308 (2004). https://doi.org/10.1007/s10044-003-0196-8
Walkowiak, T.: Language processing modelling notation - orchestration of NLP microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2017. AISC, pp. 464–473. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-59415-6_44
Google Scholar
Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence, ICAART 2018, vol. 2, pp. 515–522. INSTICC, SciTePress (2018)
Google Scholar

Download references

Acknowledgement

This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).

Author information

Authors and Affiliations

Faculty of Electronics, Wrocław University of Science and Technology, Wrocław, Poland
Tomasz Walkowiak, Szymon Datko & Henryk Maciejewski

Authors

Tomasz Walkowiak
View author publications
You can also search for this author in PubMed Google Scholar
Szymon Datko
View author publications
You can also search for this author in PubMed Google Scholar
Henryk Maciejewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Częstochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
University of Alberta, Edmonton, AB, Canada
Witold Pedrycz
AGH University of Science and Technology, Kraków, Poland
Ryszard Tadeusiewicz
University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Walkowiak, T., Datko, S., Maciejewski, H. (2018). Feature Extraction in Subject Classification of Text Documents in Polish. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2018. Lecture Notes in Computer Science(), vol 10842. Springer, Cham. https://doi.org/10.1007/978-3-319-91262-2_40

Download citation

DOI: https://doi.org/10.1007/978-3-319-91262-2_40
Published: 11 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91261-5
Online ISBN: 978-3-319-91262-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Feature Extraction in Subject Classification of Text Documents in Polish

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study

Reduction of Dimensionality of Feature Vectors in Subject Classification of Text Documents

Subject Classification of Texts in Polish - from TF-IDF to Transformers

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Feature Extraction in Subject Classification of Text Documents in Polish

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study

Reduction of Dimensionality of Feature Vectors in Subject Classification of Text Documents

Subject Classification of Texts in Polish - from TF-IDF to Transformers

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation