Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3206098.3206099acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicisdmConference Proceedingsconference-collections
research-article

A new text representation method for clustering based on higher order Markov model

Published: 09 April 2018 Publication History

Abstract

The ordinal relations in word sequence and character sequence can reflect the latent information about writing style, genre features and topic. Thus, the ordinal relations are important information and should be considered for text clustering. However, the ordinal relations were often neglected in the traditional methods of text clustering. In view of that the ordinal relations can be statistically characterized by the transition probabilities of the higher order Markov model, in this paper, a new method based on higher order Markov model was proposed for text representation. In the new method, all transition probabilities of a higher order Markov model are used as features of text, and the order is identified by maximizing the average Markov-Shannon entropy (MME). The experimental results imply the new text representation method performs better than traditional method.

References

[1]
Aggarwal, C. C. and Zhai, C. X. 2012. A survey of text clustering algorithms, Mining text data (2012), 77--128.
[2]
Salton G, and Mcgill M. J. 1983. Introduction to modern information retrieval. McGraw-Hill, New York.
[3]
Beil, F. Ester, M. and Xu, X. 2002. Frequent term-based text clustering. In: Proceeding of the 2002 ACM SIGKDD international conference on knowledge discovery in databases (KDD02), (Edmonton, Canada, 2002, 436--442.
[4]
Han, J. Cheng, H. and Xin, D. 2007. Frequent pattern mining: current status and future directions. Data Mining & Knowledge Discovery 15.1 (2007) 55--86.
[5]
Jing, L. Ng, M. K. and Huang, J. Z. 2010. Knowledge-based vector space model for text clustering. Knowledge & Information Systems. 25, 1 (2010) 35--55.
[6]
Cavnar, W. B. and Trenkle, J. M. 2001. N-gram-based text categorization, In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US. (2001) 161--175.
[7]
Cavnar, W. 1994. Using an n-gram-based document representation with a vector processing retrieval model. In:Proceedings of TREC 3, 3rd Text Retrieval Conference, Gaithersburg, Maryland, US. (1994) 269--277.
[8]
Mcnamee, P. and Mayfield, J. 2004.Character n-gram tokenization for european language text retrieval, Information Retrieval. 7, 1 (2004) 73--97.
[9]
Kešelj, V. Peng, F. and Cercone, N. 2003. N-gram-based author profiles for authorship attribution, Pacific Association for Computational Linguistics, 2003, 255--264.
[10]
Miao, Y. Kešelj, V. and Milios, E. 2005. Document clustering using character n-grams: A comparative evaluation with term-based and word-based clustering, In: Proceedings of the 14th Acm International Conference on Information and Knowledge Management, Bremen, Germany. 2005, 357--358.
[11]
Suen, C. Y. 1979. N-gram statistics for natural language understanding and text processing, IEEE Transactions on Pattern Analysis and Machine Intelligence. 1, 2 (1979) 164--172.
[12]
Vieira, A. S. Iglesias, E. L. and Borrajo, L. 2014. T-hmm: A novel biomedical text classifier based on hidden markov models, In: 8th International Conference on Practical Applications of Computational Biology & Bioinformatics, (PACBB 2014), 225--234.
[13]
Sims, G. E. Jun, S. R. and Wu, G. A. 2009. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions, Proceedings of the National Academy of Sciences. 106, 8 (2009) 2677--82.
[14]
Guan, R. Yang, C. Marchese, M. Liang, Y. Shi, X. 2014. Full text clustering and relationship network analysis of biomedical publications. Plos One. 9, 9 (2014) 195--205.
[15]
Tong, H. 1975. Determination of the order of a markov chain by akaike's information criterion, Journal of Applied Probability. 12, 3 (1975) 382--91.
[16]
C. E. Shannon, 1948. A mathematical theory of communication, Bell System Technical Journal. 27, 3 (1948) 3--55.
[17]
Nallapati, R. and Allan, J. 2002. Capturing term dependencies using a language model based on sentence trees. In: Eleventh International Conference on Information and Knowledge Management. (ACM, McLean, Virginia, USA, 2002), 383--390.
[18]
Ginestet, C E. 2008. Model Selection and Model Averaging, Cambridge University Press.
[19]
Murtagh, F. and Legendre, P. 2014. Ward's hierarchical agglomerative clustering method: Which algorithms implement ward's criterion? Journal of Classification. 31, 3 (2014) 274--295.
[20]
Xu, R. and Wunsch, D. 2005. Survey of clustering algorithms, IEEE Transactions on Neural Networks. 16, 3 (2005) 645--678.
[21]
Halkidi, M. Batistakis, Y. and Vazirgiannis, M. 2001. On clustering validation techniques, Journal of Intelligent Information Systems. 17, 3 (2001) 107--145.
[22]
Steinbach, M. Karypis, G. and Kumar, V. 2000. A comparison of document clustering techniques, In: KDD workshop on text mining, 400, 1 (2000) 525--526.

Cited By

View all
  • (2022)A Framework for Exploring Computational Models of Novelty in Unstructured TextProceedings of the 6th International Conference on Information System and Data Mining10.1145/3546157.3546164(36-45)Online publication date: 27-May-2022

Index Terms

  1. A new text representation method for clustering based on higher order Markov model

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICISDM '18: Proceedings of the 2nd International Conference on Information System and Data Mining
    April 2018
    169 pages
    ISBN:9781450363549
    DOI:10.1145/3206098
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 April 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Higher order Markov model
    2. Text clustering
    3. Text representation
    4. Vector space model

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Scientific Research Plan Item of Hunan Provincial Education Department of China
    • Hunan Provincial Natural Science Foundation of China
    • National Natural Science Foundation for Youth of China

    Conference

    ICISDM '18

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)A Framework for Exploring Computational Models of Novelty in Unstructured TextProceedings of the 6th International Conference on Information System and Data Mining10.1145/3546157.3546164(36-45)Online publication date: 27-May-2022

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media