Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2484028.2484140acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Document classification by topic labeling

Published: 28 July 2013 Publication History

Abstract

In this paper, we propose Latent Dirichlet Allocation (LDA) [1] based document classification algorithm which does not require any labeled dataset. In our algorithm, we construct a topic model using LDA, assign one topic to one of the class labels, aggregate all the same class label topics into a single topic using the aggregation property of the Dirichlet distribution and then automatically assign a class label to each unlabeled document depending on its "closeness" to one of the aggregated topics.
We present an extension to our algorithm based on the combination of Expectation-Maximization (EM) algorithm and a naive Bayes classifier. We show effectiveness of our algorithm on three real world datasets.

References

[1]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, March 2003.
[2]
B. A. Frigyik, A. Kapila, and M. R. Gupta. Introduction to the dirichlet distribution and related processes. Technical report. University of Washington, Seattle, 2012. https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0006.pdf
[3]
T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(suppl. 1):5228--5235, April 2004.
[4]
A. Mccallum and K. Nigam. Text classification by bootstrapping with keywords, EM and shrinkage. In ACL-99 Workshop for Unsupervised Learning in Natural Language Processing, pages 52--58, 1999.
[5]
K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning - Special issue on information retrieval, 39(2--3), May-June 2000.
[6]
A. Chanen and J. Patrick. Measuring Correlation Between Linguists' Judgments and Latent DirichletAllocation Topics. Proceedings of the Australasian Language Technology Workshop, pages 13--20, 2007.

Cited By

View all
  • (2024)TRGNN: Text-Rich Graph Neural Network for Few-Shot Document Filtering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650066(1-9)Online publication date: 30-Jun-2024
  • (2024)Are we listening to every word? Using multiple analytic methods to examine qualitative dataCogent Mental Health10.1080/28324765.2024.24337913:1(1-24)Online publication date: 3-Dec-2024
  • (2023)Artificial intelligence applications in fake review detection: Bibliometric analysis and future avenues for researchJournal of Business Research10.1016/j.jbusres.2022.113631158(113631)Online publication date: Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
July 2013
1188 pages
ISBN:9781450320344
DOI:10.1145/2484028
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. expectation-maximization
  2. text classification
  3. topic modelling

Qualifiers

  • Short-paper

Conference

SIGIR '13
Sponsor:

Acceptance Rates

SIGIR '13 Paper Acceptance Rate 73 of 366 submissions, 20%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)6
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)TRGNN: Text-Rich Graph Neural Network for Few-Shot Document Filtering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650066(1-9)Online publication date: 30-Jun-2024
  • (2024)Are we listening to every word? Using multiple analytic methods to examine qualitative dataCogent Mental Health10.1080/28324765.2024.24337913:1(1-24)Online publication date: 3-Dec-2024
  • (2023)Artificial intelligence applications in fake review detection: Bibliometric analysis and future avenues for researchJournal of Business Research10.1016/j.jbusres.2022.113631158(113631)Online publication date: Mar-2023
  • (2023)Automated scholarly paper reviewInformation Fusion10.1016/j.inffus.2023.10183098:COnline publication date: 26-Jul-2023
  • (2023)Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signalsSoft Computing10.1007/s00500-022-07771-927:9(5397-5410)Online publication date: 6-Jan-2023
  • (2023)Comparison of Text Classification Methods Using Deep Learning Neural NetworksComputational Linguistics and Intelligent Text Processing10.1007/978-3-031-24340-0_33(438-450)Online publication date: 26-Feb-2023
  • (2023)More Sustainable Text Classification via Uncertainty Sampling and a Human-in-the-LoopAgents and Artificial Intelligence10.1007/978-3-031-22953-4_9(201-225)Online publication date: 20-Jan-2023
  • (2022)SATLabel: A Framework for Sentiment and Aspect Terms Based Automatic Topic LabellingMachine Intelligence and Data Science Applications10.1007/978-981-19-2347-0_6(63-75)Online publication date: 2-Aug-2022
  • (2021)An intent recognition model supporting the spoken expression mixed with Chinese and EnglishJournal of Intelligent & Fuzzy Systems10.3233/JIFS-202365(1-12)Online publication date: 27-Jan-2021
  • (2021)Improved Word Sense Determination in Malayalam using Latent Dirichlet Allocation and Semantic FeaturesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347697821:2(1-11)Online publication date: 3-Nov-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media