Abstract
Document representation is one of the foundations of natural language processing. The bag-of-words (BoW) model, as the representative of document representation models, is a method with the properties of simplicity and validity. However, the traditional BoW model has the drawbacks of sparsity and lacking of latent semantic relations. In this paper, to solve these mentioned problems, we propose two tolerance rough set-based BOW models, called as TRBoW1 and TRBoW2 according to different weight calculation methods. Different from the popular representation methods of supervision, they are unsupervised and no prior knowledge required. Extending each document to its upper approximation with TRBoW1 or TRBoW2, the semantic relations among documents are mined and document vectors become denser. Comparative experiments on various document representation methods for text classification on different datasets have verified optimal performance of our methods.
Article PDF
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
G. Salton, Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley Longman Publishing, Boston, MA, USA, 1989.
M.M. Mironczuk, J. Protasiewicz, A recent overview of the state-of-the-art elements of text classification, Expert Syst. Appl. 106 (2018), 36–54.
S. Vashishtha, S. Susan, Fuzzy rule based unsupervised sentiment analysis from social media posts, Expert Syst. Appl. 138 (2019), 1–15.
A. Skowron, J. Stepaniuk, Tolerance approximation spaces, Fund. Inform. 27 (1996), 245–253.
C. Zhong, Y. Chen, J. Peng, Feature selection based on a novel improved tree growth algorithm, Int. J. Comput. Intell. Syst. 13 (2020), 247–258.
Y. Matsuo, M. Ishizuka, Keyword extraction from a single document using word co-occurrence statistical information, Int. J. Artif. Intell. Tools. 13 (2004), 157–169.
R.G. Rossi, R.M. Marcacini, S.O. Rezende, Analysis of domain independent statistical keyword extraction methods for incremental clustering, Int. J. Comput. Int. Syst. 12 (2014), 17–37.
Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, in Proceeding ICML ’97 Proceedings of the Fourteenth International Conference on Machine Learning, San Francisco, CA, USA, (1997), pp. 412–420.
R. Basili, A. Moschitti, M.T. Pazienza, A text classifier based on linguistic processing, In: Proceedings of IJCAI99, Machine Learning for Information Filtering, Citeseer, Stockholm, Sweden, (1999), pp. 1–6.
C. Qiyue, Structure entropy weight method to confirm the weight of evaluating index, Syst. Eng. Theory Pract. 30 (2010), 1225–1228.
G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process. Manag. 24 (1988), 513–523.
K.S. Jones, A statistical interpretation of term specificity and its application in retrieval, J. Doc. 28 (1972), 11–21.
D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (2003), 993–1022.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990), 391–407.
T. Hofmann, Unsupervised learning by probabilistic latent semantic analysis, Mach. Learn. 42 (2001), 177–196.
T. Hofmann, Probabilistic latent semantic indexing, ACM SIGIR Forum. 51 (2017), 211–218.
M. You, J. Liu, G.Z. Li, Y. Chen, Embedded feature selection for multi-label classification of music emotions, Int. J. Comput. Intell. Syst. 5 (2012), 668–678.
F.J. Pulgar, F. Charte, A.J. Rivera, M.J. Del Jesus, AEkNN: an autoencoder kNNCbased classifier with built-in dimensionality reduction, Int. J. Comput. Intell. Syst. 12 (2018), 436–452.
E.R. Henry, J. Hofrichter, Singular value decomposition: application to analysis of experimental data, Methods Enzymol. 210 (1992), 129–192.
R. Das, M. Zaheer, C. Dyer, Gaussian lda for topic models with word embeddings, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 2015, pp. 795–804.
D.Q. Nguyen, R. Billingsley, L. Du, M. Johnson, Improving topic models with latent feature word representations, Trans. Assoc. Comput. Linguistics. 3 (2015), 299–313.
Y. Liu, Z. Liu, T.S. Chua, M. Sun, Topical word embeddings, in Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25–30, 2015, Austin, Texas, USA, AAAI Press, pp. 2418–2424.
R. Kiros, Y. Zhu, R.R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, S. Fidler, Skip-thought vectors, in Advances in Neural Information Processing Systems, Montreal, Canada 2015, pp. 3294–3302.
Y. Wu, S. Zhao, W. Li, Phrase2Vec: phrase embedding based on parsing, Inf. Sci. 517 (2020), 100–127.
D. Yao, J. Bi, J. Huang, J. Zhu, A word distributed representation based framework for large-scale short text classification, in 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 2015.
Y. Gao, Y. Xu, H. Huang, Q. Liu, L. Wei, L. Liu, Jointly learning topics in sentence embedding for document summarization, IEEE Trans. Knowl. Data Eng. 32 (2020), 688–699.
M.A. Mouriño-García, R. Pérez-Rodríguez, L. Anido-Rifón, M. Vilares-Ferro, Wikipedia-based hybrid document representation for textual news classification, Soft Comput. 22 (2018), 6047–6065.
R. Zhao, K. Mao, Fuzzy bag-of-words model for document representation, IEEE Trans. Fuzzy Syst. 26 (2018), 794–804.
Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 11 (1982), 341–356.
Y. Yao, Probabilistic rough set approximations, Int. J. Approx. Reason. 49 (2008), 255–271.
Z. Pawlak, R. Sowinski, Rough set approach to multi-attribute decision analysis, Eur. J. Oper. Res. 72 (1994), 443–459.
T.B. Ho, N.B. Nguyen, Nonhierarchical document clustering based on a tolerance rough set model, Int. J. Intell. Syst. 17 (2002), 199–212.
D. Greene, P. Cunningham, Practical solutions to the problem of diagonal dominance in kernel document clustering, in Proceedings of the 23rd International Conference on Machine Learning, New York, NY, USA, ACM,2006, pp. 377–384.
A. Onan, S. Korukoğlu, H. Bulut, Ensemble of keyword extraction methods and classifiers in text classification, Expert Syst. Appl. 57 (2016), 232–247.
C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (2011), 27.
M.L. Zhang, Z.H. Zhou, ML-KNN: a lazy learning approach to multi-label learning, Pattern Recognit. 40 (2007), 2038–2048.
A.S. More, D.P. Rana, I. Agarwal, S. Vallabhbhai, Random forest classifier approach for imbalanced big data classification for smart city application domains, Int. J. Comput. Intell. Syst. 1 (2020), 260–266.
Y. Yang, J. Zhang, B. Kisiel, A scalability analysis of classifiers in text categorization, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, ACM, Toronto, Canada, 2003, pp. 96–103.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
This is an open access article distributed under the CC BY-NC 4.0 license (https://doi.org/creativecommons.org/licenses/by-nc/4.0/).
About this article
Cite this article
Qiu, D., Jiang, H. & Yan, R. Tolerance Rough Set-Based Bag-of-Words Model for Document Representation. Int J Comput Intell Syst 13, 1218–1226 (2020). https://doi.org/10.2991/ijcis.d.200808.001
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.2991/ijcis.d.200808.001