research-article

A new text representation method for clustering based on higher order Markov model

Authors:

Xiaoqiang XieAuthors Info & Claims

ICISDM '18: Proceedings of the 2nd International Conference on Information System and Data Mining

Pages 1 - 6

https://doi.org/10.1145/3206098.3206099

Published: 09 April 2018 Publication History

Abstract

The ordinal relations in word sequence and character sequence can reflect the latent information about writing style, genre features and topic. Thus, the ordinal relations are important information and should be considered for text clustering. However, the ordinal relations were often neglected in the traditional methods of text clustering. In view of that the ordinal relations can be statistically characterized by the transition probabilities of the higher order Markov model, in this paper, a new method based on higher order Markov model was proposed for text representation. In the new method, all transition probabilities of a higher order Markov model are used as features of text, and the order is identified by maximizing the average Markov-Shannon entropy (MME). The experimental results imply the new text representation method performs better than traditional method.

References

[1]

Aggarwal, C. C. and Zhai, C. X. 2012. A survey of text clustering algorithms, Mining text data (2012), 77--128.

[2]

Salton G, and Mcgill M. J. 1983. Introduction to modern information retrieval. McGraw-Hill, New York.

Digital Library

[3]

Beil, F. Ester, M. and Xu, X. 2002. Frequent term-based text clustering. In: Proceeding of the 2002 ACM SIGKDD international conference on knowledge discovery in databases (KDD02), (Edmonton, Canada, 2002, 436--442.

Digital Library

[4]

Han, J. Cheng, H. and Xin, D. 2007. Frequent pattern mining: current status and future directions. Data Mining & Knowledge Discovery 15.1 (2007) 55--86.

Digital Library

[5]

Jing, L. Ng, M. K. and Huang, J. Z. 2010. Knowledge-based vector space model for text clustering. Knowledge & Information Systems. 25, 1 (2010) 35--55.

Digital Library

[6]

Cavnar, W. B. and Trenkle, J. M. 2001. N-gram-based text categorization, In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US. (2001) 161--175.

[7]

Cavnar, W. 1994. Using an n-gram-based document representation with a vector processing retrieval model. In:Proceedings of TREC 3, 3rd Text Retrieval Conference, Gaithersburg, Maryland, US. (1994) 269--277.

[8]

Mcnamee, P. and Mayfield, J. 2004.Character n-gram tokenization for european language text retrieval, Information Retrieval. 7, 1 (2004) 73--97.

Digital Library

[9]

Kešelj, V. Peng, F. and Cercone, N. 2003. N-gram-based author profiles for authorship attribution, Pacific Association for Computational Linguistics, 2003, 255--264.

[10]

Miao, Y. Kešelj, V. and Milios, E. 2005. Document clustering using character n-grams: A comparative evaluation with term-based and word-based clustering, In: Proceedings of the 14th Acm International Conference on Information and Knowledge Management, Bremen, Germany. 2005, 357--358.

Digital Library

[11]

Suen, C. Y. 1979. N-gram statistics for natural language understanding and text processing, IEEE Transactions on Pattern Analysis and Machine Intelligence. 1, 2 (1979) 164--172.

Digital Library

[12]

Vieira, A. S. Iglesias, E. L. and Borrajo, L. 2014. T-hmm: A novel biomedical text classifier based on hidden markov models, In: 8th International Conference on Practical Applications of Computational Biology & Bioinformatics, (PACBB 2014), 225--234.

[13]

Sims, G. E. Jun, S. R. and Wu, G. A. 2009. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions, Proceedings of the National Academy of Sciences. 106, 8 (2009) 2677--82.

[14]

Guan, R. Yang, C. Marchese, M. Liang, Y. Shi, X. 2014. Full text clustering and relationship network analysis of biomedical publications. Plos One. 9, 9 (2014) 195--205.

[15]

Tong, H. 1975. Determination of the order of a markov chain by akaike's information criterion, Journal of Applied Probability. 12, 3 (1975) 382--91.

[16]

C. E. Shannon, 1948. A mathematical theory of communication, Bell System Technical Journal. 27, 3 (1948) 3--55.

[17]

Nallapati, R. and Allan, J. 2002. Capturing term dependencies using a language model based on sentence trees. In: Eleventh International Conference on Information and Knowledge Management. (ACM, McLean, Virginia, USA, 2002), 383--390.

Digital Library

[18]

Ginestet, C E. 2008. Model Selection and Model Averaging, Cambridge University Press.

[19]

Murtagh, F. and Legendre, P. 2014. Ward's hierarchical agglomerative clustering method: Which algorithms implement ward's criterion? Journal of Classification. 31, 3 (2014) 274--295.

Digital Library

[20]

Xu, R. and Wunsch, D. 2005. Survey of clustering algorithms, IEEE Transactions on Neural Networks. 16, 3 (2005) 645--678.

Digital Library

[21]

Halkidi, M. Batistakis, Y. and Vazirgiannis, M. 2001. On clustering validation techniques, Journal of Intelligent Information Systems. 17, 3 (2001) 107--145.

Digital Library

[22]

Steinbach, M. Karypis, G. and Kumar, V. 2000. A comparison of document clustering techniques, In: KDD workshop on text mining, 400, 1 (2000) 525--526.

Cited By

Mohseni MMaher M(2022)A Framework for Exploring Computational Models of Novelty in Unstructured TextProceedings of the 6th International Conference on Information System and Data Mining10.1145/3546157.3546164(36-45)Online publication date: 27-May-2022
https://dl.acm.org/doi/10.1145/3546157.3546164

Index Terms

A new text representation method for clustering based on higher order Markov model
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Intrusion Detection Method Based on Fuzzy Hidden Markov Model
FSKD '09: Proceedings of the 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 03

Because of the excellent performance of the HMM (Hidden Markov Model), it has been widely used in pattern recognition. Due to the high false alarm rate in the classical intrusion detection system(IDS) based on HMM, a fuzzy approach for the HMM, called ...
Text clustering based on LSA-HGSOM
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II

Text clustering has been recognized as an important component in data mining. Self-Organizing Map (SOM) based models have been found to have certain advantages for clustering sizeable text data. However, current existing approaches lack in providing an ...
A Higher-order interactive hidden Markov model and its applications

In this paper, we propose a higher-order interactive hidden Markov model, which incorporates both the feedback effects of observable states on hidden states and their mutual long-term dependence. The key idea of this model is to assume the probability ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICISDM '18: Proceedings of the 2nd International Conference on Information System and Data Mining

April 2018

169 pages

ISBN:9781450363549

DOI:10.1145/3206098

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 April 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Scientific Research Plan Item of Hunan Provincial Education Department of China
Hunan Provincial Natural Science Foundation of China
National Natural Science Foundation for Youth of China

Conference

ICISDM '18

ICISDM '18: 2018 2nd International Conference on Information System and Data Mining ICISDM '18

April 9 - 11, 2018

FL, Lakeland, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
85
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mohseni MMaher M(2022)A Framework for Exploring Computational Models of Novelty in Unstructured TextProceedings of the 6th International Conference on Information System and Data Mining10.1145/3546157.3546164(36-45)Online publication date: 27-May-2022
https://dl.acm.org/doi/10.1145/3546157.3546164

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents