short-paper

Document classification by topic labeling

Authors:

Swapnil Hingmire,

Sandeep Chougule,

Girish K. Palshikar,

Sutanu ChakrabortiAuthors Info & Claims

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pages 877 - 880

https://doi.org/10.1145/2484028.2484140

Published: 28 July 2013 Publication History

Get Access

Abstract

In this paper, we propose Latent Dirichlet Allocation (LDA) [1] based document classification algorithm which does not require any labeled dataset. In our algorithm, we construct a topic model using LDA, assign one topic to one of the class labels, aggregate all the same class label topics into a single topic using the aggregation property of the Dirichlet distribution and then automatically assign a class label to each unlabeled document depending on its "closeness" to one of the aggregated topics.

We present an extension to our algorithm based on the combination of Expectation-Maximization (EM) algorithm and a naive Bayes classifier. We show effectiveness of our algorithm on three real world datasets.

References

[1]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, March 2003.

Digital Library

Google Scholar

[2]

B. A. Frigyik, A. Kapila, and M. R. Gupta. Introduction to the dirichlet distribution and related processes. Technical report. University of Washington, Seattle, 2012. https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0006.pdf

Google Scholar

[3]

T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(suppl. 1):5228--5235, April 2004.

Crossref

Google Scholar

[4]

A. Mccallum and K. Nigam. Text classification by bootstrapping with keywords, EM and shrinkage. In ACL-99 Workshop for Unsupervised Learning in Natural Language Processing, pages 52--58, 1999.

Google Scholar

[5]

K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning - Special issue on information retrieval, 39(2--3), May-June 2000.

Digital Library

Google Scholar

[6]

A. Chanen and J. Patrick. Measuring Correlation Between Linguists' Judgments and Latent DirichletAllocation Topics. Proceedings of the Australasian Language Technology Workshop, pages 13--20, 2007.

Google Scholar

Cited By

View all

Mu HZhang SWang YSun YXu H(2024)TRGNN: Text-Rich Graph Neural Network for Few-Shot Document Filtering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650066(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650066
Boyd RMorrison NHorwitz SMaciag RTravers-Hill EKim Y(2024)Are we listening to every word? Using multiple analytic methods to examine qualitative dataCogent Mental Health10.1080/28324765.2024.24337913:1(1-24)Online publication date: 3-Dec-2024
https://doi.org/10.1080/28324765.2024.2433791
Ben Jabeur SBallouk HBen Arfi WSahut J(2023)Artificial intelligence applications in fake review detection: Bibliometric analysis and future avenues for researchJournal of Business Research10.1016/j.jbusres.2022.113631158(113631)Online publication date: Mar-2023
https://doi.org/10.1016/j.jbusres.2022.113631
Show More Cited By

Index Terms

Document classification by topic labeling
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Topic labeled text classification: a weakly supervised approach
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Supervised text classifiers require extensive human expertise and labeling efforts. In this paper, we propose a weakly supervised text classification algorithm based on the labeling of Latent Dirichlet Allocation (LDA) topics. Our algorithm is based on ...
Text Classification from Labeled and Unlabeled Documents using EM
Special issue on information retrieval

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining ...
Semi-supervised Multi-Label Topic Models for Document Classification and Sentence Labeling
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Extracting parts of a text document relevant to a class label is a critical information retrieval task. We propose a semi-supervised multi-label topic model for jointly achieving document and sentence-level class inferences. Under our model, each ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

July 2013

1188 pages

ISBN:9781450320344

DOI:10.1145/2484028

General Chairs:
Gareth J.F. Jones
Dublin City University, Ireland
,
Páraic Sheridan
Dublin City University, Ireland
,
Program Chairs:
Diane Kelly
University of North Carolina, Chapel Hill, USA
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Tetsuya Sakai
Microsoft Research Asia, China

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

SIGIR '13

Sponsor:

SIGIR

SIGIR '13: The 36th International ACM SIGIR conference on research and development in Information Retrieval

July 28 - August 1, 2013

Dublin, Ireland

Acceptance Rates

SIGIR '13 Paper Acceptance Rate 73 of 366 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
1,128
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)6

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Mu HZhang SWang YSun YXu H(2024)TRGNN: Text-Rich Graph Neural Network for Few-Shot Document Filtering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650066(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650066
Boyd RMorrison NHorwitz SMaciag RTravers-Hill EKim Y(2024)Are we listening to every word? Using multiple analytic methods to examine qualitative dataCogent Mental Health10.1080/28324765.2024.24337913:1(1-24)Online publication date: 3-Dec-2024
https://doi.org/10.1080/28324765.2024.2433791
Ben Jabeur SBallouk HBen Arfi WSahut J(2023)Artificial intelligence applications in fake review detection: Bibliometric analysis and future avenues for researchJournal of Business Research10.1016/j.jbusres.2022.113631158(113631)Online publication date: Mar-2023
https://doi.org/10.1016/j.jbusres.2022.113631
Lin JSong JZhou ZChen YShi X(2023)Automated scholarly paper reviewInformation Fusion10.1016/j.inffus.2023.10183098:COnline publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1016/j.inffus.2023.101830
Li XWang BWang YOuyang JGarg HThanh D(2023)Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signalsSoft Computing10.1007/s00500-022-07771-927:9(5397-5410)Online publication date: 6-Jan-2023
https://doi.org/10.1007/s00500-022-07771-9
Amjad MGelbukh AVoronkov ISaenko A(2023)Comparison of Text Classification Methods Using Deep Learning Neural NetworksComputational Linguistics and Intelligent Text Processing10.1007/978-3-031-24340-0_33(438-450)Online publication date: 26-Feb-2023
https://doi.org/10.1007/978-3-031-24340-0_33
Andersen JZukunft O(2023)More Sustainable Text Classification via Uncertainty Sampling and a Human-in-the-LoopAgents and Artificial Intelligence10.1007/978-3-031-22953-4_9(201-225)Online publication date: 20-Jan-2023
https://doi.org/10.1007/978-3-031-22953-4_9
Shahriar KMoni MHoque MIslam MSarker I(2022)SATLabel: A Framework for Sentiment and Aspect Terms Based Automatic Topic LabellingMachine Intelligence and Data Science Applications10.1007/978-981-19-2347-0_6(63-75)Online publication date: 2-Aug-2022
https://doi.org/10.1007/978-981-19-2347-0_6
Hu MPeng JZhang WHu JQi LZhang H(2021)An intent recognition model supporting the spoken expression mixed with Chinese and EnglishJournal of Intelligent & Fuzzy Systems10.3233/JIFS-202365(1-12)Online publication date: 27-Jan-2021
https://doi.org/10.3233/JIFS-202365
S. SKannan BPaul B(2021)Improved Word Sense Determination in Malayalam using Latent Dirichlet Allocation and Semantic FeaturesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347697821:2(1-11)Online publication date: 3-Nov-2021
https://dl.acm.org/doi/10.1145/3476978
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Topic labeled text classification: a weakly supervised approach

Text Classification from Labeled and Unlabeled Documents using EM

Semi-supervised Multi-Label Topic Models for Document Classification and Sentence Labeling

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations