Tolerance Rough Set-Based Bag-of-Words Model for Document Representation

Dong Qiu¹,
Haihuan Jiang¹ &
Ruiteng Yan²

70 Accesses
2 Citations
Explore all metrics

Abstract

Document representation is one of the foundations of natural language processing. The bag-of-words (BoW) model, as the representative of document representation models, is a method with the properties of simplicity and validity. However, the traditional BoW model has the drawbacks of sparsity and lacking of latent semantic relations. In this paper, to solve these mentioned problems, we propose two tolerance rough set-based BOW models, called as TRBoW1 and TRBoW2 according to different weight calculation methods. Different from the popular representation methods of supervision, they are unsupervised and no prior knowledge required. Extending each document to its upper approximation with TRBoW1 or TRBoW2, the semantic relations among documents are mined and document vectors become denser. Comparative experiments on various document representation methods for text classification on different datasets have verified optimal performance of our methods.

Article PDF

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

G. Salton, Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley Longman Publishing, Boston, MA, USA, 1989.
M.M. Mironczuk, J. Protasiewicz, A recent overview of the state-of-the-art elements of text classification, Expert Syst. Appl. 106 (2018), 36–54.
Google Scholar
S. Vashishtha, S. Susan, Fuzzy rule based unsupervised sentiment analysis from social media posts, Expert Syst. Appl. 138 (2019), 1–15.
Google Scholar
A. Skowron, J. Stepaniuk, Tolerance approximation spaces, Fund. Inform. 27 (1996), 245–253.
Google Scholar
C. Zhong, Y. Chen, J. Peng, Feature selection based on a novel improved tree growth algorithm, Int. J. Comput. Intell. Syst. 13 (2020), 247–258.
Google Scholar
Y. Matsuo, M. Ishizuka, Keyword extraction from a single document using word co-occurrence statistical information, Int. J. Artif. Intell. Tools. 13 (2004), 157–169.
Google Scholar
R.G. Rossi, R.M. Marcacini, S.O. Rezende, Analysis of domain independent statistical keyword extraction methods for incremental clustering, Int. J. Comput. Int. Syst. 12 (2014), 17–37.
Google Scholar
Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, in Proceeding ICML ’97 Proceedings of the Fourteenth International Conference on Machine Learning, San Francisco, CA, USA, (1997), pp. 412–420.
R. Basili, A. Moschitti, M.T. Pazienza, A text classifier based on linguistic processing, In: Proceedings of IJCAI99, Machine Learning for Information Filtering, Citeseer, Stockholm, Sweden, (1999), pp. 1–6.
C. Qiyue, Structure entropy weight method to confirm the weight of evaluating index, Syst. Eng. Theory Pract. 30 (2010), 1225–1228.
Google Scholar
G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process. Manag. 24 (1988), 513–523.
Google Scholar
K.S. Jones, A statistical interpretation of term specificity and its application in retrieval, J. Doc. 28 (1972), 11–21.
Google Scholar
D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (2003), 993–1022.
Google Scholar
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990), 391–407.
Google Scholar
T. Hofmann, Unsupervised learning by probabilistic latent semantic analysis, Mach. Learn. 42 (2001), 177–196.
Google Scholar
T. Hofmann, Probabilistic latent semantic indexing, ACM SIGIR Forum. 51 (2017), 211–218.
Google Scholar
M. You, J. Liu, G.Z. Li, Y. Chen, Embedded feature selection for multi-label classification of music emotions, Int. J. Comput. Intell. Syst. 5 (2012), 668–678.
Google Scholar
F.J. Pulgar, F. Charte, A.J. Rivera, M.J. Del Jesus, AEkNN: an autoencoder kNNCbased classifier with built-in dimensionality reduction, Int. J. Comput. Intell. Syst. 12 (2018), 436–452.
E.R. Henry, J. Hofrichter, Singular value decomposition: application to analysis of experimental data, Methods Enzymol. 210 (1992), 129–192.
Google Scholar
R. Das, M. Zaheer, C. Dyer, Gaussian lda for topic models with word embeddings, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 2015, pp. 795–804.
D.Q. Nguyen, R. Billingsley, L. Du, M. Johnson, Improving topic models with latent feature word representations, Trans. Assoc. Comput. Linguistics. 3 (2015), 299–313.
Google Scholar
Y. Liu, Z. Liu, T.S. Chua, M. Sun, Topical word embeddings, in Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25–30, 2015, Austin, Texas, USA, AAAI Press, pp. 2418–2424.
R. Kiros, Y. Zhu, R.R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, S. Fidler, Skip-thought vectors, in Advances in Neural Information Processing Systems, Montreal, Canada 2015, pp. 3294–3302.
Y. Wu, S. Zhao, W. Li, Phrase2Vec: phrase embedding based on parsing, Inf. Sci. 517 (2020), 100–127.
Google Scholar
D. Yao, J. Bi, J. Huang, J. Zhu, A word distributed representation based framework for large-scale short text classification, in 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 2015.
Y. Gao, Y. Xu, H. Huang, Q. Liu, L. Wei, L. Liu, Jointly learning topics in sentence embedding for document summarization, IEEE Trans. Knowl. Data Eng. 32 (2020), 688–699.
Google Scholar
M.A. Mouriño-García, R. Pérez-Rodríguez, L. Anido-Rifón, M. Vilares-Ferro, Wikipedia-based hybrid document representation for textual news classification, Soft Comput. 22 (2018), 6047–6065.
R. Zhao, K. Mao, Fuzzy bag-of-words model for document representation, IEEE Trans. Fuzzy Syst. 26 (2018), 794–804.
Google Scholar
Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 11 (1982), 341–356.
Google Scholar
Y. Yao, Probabilistic rough set approximations, Int. J. Approx. Reason. 49 (2008), 255–271.
Google Scholar
Z. Pawlak, R. Sowinski, Rough set approach to multi-attribute decision analysis, Eur. J. Oper. Res. 72 (1994), 443–459.
Google Scholar
T.B. Ho, N.B. Nguyen, Nonhierarchical document clustering based on a tolerance rough set model, Int. J. Intell. Syst. 17 (2002), 199–212.
Google Scholar
http://code.google.com/archive/p/word2vec/
http://scikit-learn.org/stable
D. Greene, P. Cunningham, Practical solutions to the problem of diagonal dominance in kernel document clustering, in Proceedings of the 23rd International Conference on Machine Learning, New York, NY, USA, ACM,2006, pp. 377–384.
A. Onan, S. Korukoğlu, H. Bulut, Ensemble of keyword extraction methods and classifiers in text classification, Expert Syst. Appl. 57 (2016), 232–247.
Google Scholar
C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (2011), 27.
M.L. Zhang, Z.H. Zhou, ML-KNN: a lazy learning approach to multi-label learning, Pattern Recognit. 40 (2007), 2038–2048.
A.S. More, D.P. Rana, I. Agarwal, S. Vallabhbhai, Random forest classifier approach for imbalanced big data classification for smart city application domains, Int. J. Comput. Intell. Syst. 1 (2020), 260–266.
Google Scholar
Y. Yang, J. Zhang, B. Kisiel, A scalability analysis of classifiers in text categorization, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, ACM, Toronto, Canada, 2003, pp. 96–103.

Download references

Author information

Authors and Affiliations

College of Science, Chongqing University of Posts and Telecommunications, Nanan, Chongqing, 400065, P.R. China
Dong Qiu & Haihuan Jiang
School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Nanan, Chongqing, 400065, P.R. China
Ruiteng Yan

Authors

Dong Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Haihuan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ruiteng Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong Qiu.

Rights and permissions

This is an open access article distributed under the CC BY-NC 4.0 license (https://doi.org/creativecommons.org/licenses/by-nc/4.0/).

Reprints and permissions

About this article

Cite this article

Qiu, D., Jiang, H. & Yan, R. Tolerance Rough Set-Based Bag-of-Words Model for Document Representation. Int J Comput Intell Syst 13, 1218–1226 (2020). https://doi.org/10.2991/ijcis.d.200808.001

Download citation

Received: 12 May 2020
Accepted: 04 August 2020
Published: 19 August 2020
Issue Date: January 2020
DOI: https://doi.org/10.2991/ijcis.d.200808.001

Tolerance Rough Set-Based Bag-of-Words Model for Document Representation

Abstract

Article PDF

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Key words

Navigation

Tolerance Rough Set-Based Bag-of-Words Model for Document Representation

Abstract

Article PDF

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation