article

Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

Authors: Mohammad Reza Keyvanpour, Maryam Bahojb ImaniAuthors Info & Claims

Intelligent Data Analysis, Volume 17, Issue 3

Pages 367 - 385

Published: 01 May 2013 Publication History

Abstract

Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use data sets without labels. So many different semi-supervised learning methods have been studied recently. Among these semi-supervised methods, self-training is one of the important learning algorithms that classifies unlabeled samples with small amount of labeled ones and adds the most confident samples to the training set. In this paper, dynamic weighting beside majority vote approach is applied to classify the unlabeled data to reliable and unreliable classes. Then, the reliable data are added to the training set and the remaining data including unreliable data are classified in iterative process. We tested this method on the extracted features of ten common Reuter-21578 classes. Experimental result indicates that proposed method improves the classification performance and it's effective.

References

[1]

W. Frawley, G. Piatetsky-Shapiro and C. Matheus, Knowledge discovery in databases: An overview, AI Magazine (1992), 213-228.

[2]

Y. Li, High Performance Text Document Clustering, PhD dissertation, Wright State University, 2007.

[3]

W. McKnight, Building business intelligence: Text data mining in business intelligence, in: Data Mining Review, 2005, 21-22.

[4]

F. Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys 34 (2002), 1-47.

Digital Library

[5]

A. McCallum and K. Nigam, A comparison of event models for naive bayes text classification, in: AAAI/ICML-98 Workshop on Learning for Text Categorization, A. Press ed., 1998, pp. 41-48.

[6]

X. Zhu and A.B. Goldberg, Introduction to Semi-Supervised Learning ed., Morgan & Claypool Publishers, 2009.

[7]

X. Zhu, Semi-Supervised Learning Literature Survey, PhD dissertation, University of Wisconsin, Madison, 2008.

[8]

C. Leistner, Semi-Supervised Ensemble Methods for Computer Vision, PhD dissertation, Graz University of Technology, 2010.

[9]

N.N. Pise and P. Kulkarni, A survey of semi-supervised learning methods, in: International Conference on Computational Intelligence and Security (2008), 30-34.

Digital Library

[10]

C. Silva and B. Ribeiro, Inductive Inference for Large Scale Text Classification, ed., Springer-Verlag Berlin Heidelberg, 2010.

[11]

R.K.A. Carrasco, Unsupervised Classification of Text Documents, University of Puerto Rico, 2007.

[12]

M.F. Porter, An algorithm for suffix stripping, Program 14 (1980), 130-137.

[13]

M.E. Basiri and S. Nemati, A novel hybrid ACO-GA algorithm for text feature selection, in: Evolutionary Computation (2009), 2561-2568.

[14]

G. Salton and C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management 24 (1988), 513-523.

Digital Library

[15]

L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, ed., John Wiley & Sons, 2004.

[16]

T. Joachims, Text categorization with support vector machines: Learning with many relevant features, Machine Learning, Springer (1998), 137-142.

[17]

Y. Yang and G.I. Webb, Discretization for Naive-Bayes Learning: Managing Discretization Bias and Variance, Technical Report, School of Computer Science and Software Engineering, Monash University, 2003.

[18]

I. Rish, An empirical study of the naive Bayes classifier, IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence 3 (2001), 41-46.

[19]

S. Tan, An effective refinement strategy for KNN text classifier, Expert Systems with Applications 30 (2006), 290-298.

Digital Library

[20]

D.D. Lewis, Data extraction as text categorization: An experiment with the MUC-3 corpus, in: Proceedings of the Third Message Understanding Evaluation and Conference, Morgan Kaufmann, Los Altos, CA, 1991.

Digital Library

[21]

D. Susan et al., Inductive learning algorithms and representations for text categorization, in: Proceeding CIKM, ACM Press, 1998, pp. 148-155.

[22]

R. Feldman and J. Sanger, The Text Mining Handbook Advanced Approaches in Analyzing Unstructured Data, ed. Cambridge University Press, 2007.

[23]

K. Nigam et al., Text classification from labeled and unlabeled documents using EM, Machine Learning 39 (2000), 103-134.

Digital Library

[24]

P.K. Mallapragada, Some Contributions to Semi-Supervised Learning, PhD dissertation, Michigan State University, 2010.

[25]

K. Nigam, A. McCallum and T. Mitchell, Semi-supervised text classification using EM, in: Semi-Supervised Learning, MIT Press, 2006, pp. 33-55.

[26]

L. Shi et al., Rough set and ensemble learning based semi-supervised algorithm for text classification, Expert Systems with Applications 38 (2010), 6300-6306.

Digital Library

[27]

P. Gu, Q. Zhu and C. Zhang, A multi-view approach to semi-supervised document classification with incremental Naïe Bayes, Computers & Mathematics with Applications 57 (2009).

Cited By

Keyvanpour MBarani Shirzad MHeydarian F(2022)Android malware detection applying feature selection techniques and machine learningMultimedia Tools and Applications10.1007/s11042-022-13767-282:6(9517-9531)Online publication date: 14-Sep-2022
https://dl.acm.org/doi/10.1007/s11042-022-13767-2
Yatsko V(2021)A New Method of Automatic Text Document ClassificationAutomatic Documentation and Mathematical Linguistics10.3103/S000510552103008055:3(122-133)Online publication date: 1-May-2021
https://dl.acm.org/doi/10.3103/S0005105521030080
Zhao JLiu N(2021)A Safe Semi-supervised Classification Algorithm Using Multiple Classifiers EnsembleNeural Processing Letters10.1007/s11063-020-10191-153:4(2603-2616)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s11063-020-10191-1
Show More Cited By

Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Recommendations

Stacked co-training for semi-supervised multi-label learning
Abstract
Due to the difficulty of annotation, multi-label learning sometimes obtains a small amount of labeled data and a large amount of unlabeled data as supplements. To make up this issue, many algorithms extended the existing semi-supervised ...
A semi-supervised feature ranking method with ensemble learning

We consider the problem of using a large amount of unlabeled data to improve the efficiency of feature selection in high-dimension when only a small amount of labeled examples is available. We propose a new method called semi-supervised ensemble ...
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Intelligent Data Analysis

Intelligent Data Analysis Volume 17, Issue 3

May 2013

189 pages

ISSN:1088-467X

Issue’s Table of Contents

Publisher

IOS Press

Netherlands

Publication History

Published: 01 May 2013

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Keyvanpour MBarani Shirzad MHeydarian F(2022)Android malware detection applying feature selection techniques and machine learningMultimedia Tools and Applications10.1007/s11042-022-13767-282:6(9517-9531)Online publication date: 14-Sep-2022
https://dl.acm.org/doi/10.1007/s11042-022-13767-2
Yatsko V(2021)A New Method of Automatic Text Document ClassificationAutomatic Documentation and Mathematical Linguistics10.3103/S000510552103008055:3(122-133)Online publication date: 1-May-2021
https://dl.acm.org/doi/10.3103/S0005105521030080
Zhao JLiu N(2021)A Safe Semi-supervised Classification Algorithm Using Multiple Classifiers EnsembleNeural Processing Letters10.1007/s11063-020-10191-153:4(2603-2616)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s11063-020-10191-1
Athira BIdicula SJones J(2020)Multi-label Topic Classification of Patient Generated Content in a Breast-cancer Community ForumProceedings of the 4th International Conference on Medical and Health Informatics10.1145/3418094.3418132(266-274)Online publication date: 14-Aug-2020
https://dl.acm.org/doi/10.1145/3418094.3418132
Chen MTan XZhang L(2016)An iterative self-training support vector machine algorithm in brain-computer interfacesIntelligent Data Analysis10.3233/IDA-15079420:1(67-82)Online publication date: 18-Jan-2016
https://dl.acm.org/doi/10.3233/IDA-150794
Tan XChen MGan J(2015)A co-training algorithm based on modified Fisher's linear discriminant analysisIntelligent Data Analysis10.5555/2768391.276839619:2(279-292)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.5555/2768391.2768396
Wang JBensmail HYao NGao X(2013)Discriminative sparse coding on multi-manifoldsKnowledge-Based Systems10.5555/2770961.277111054:C(199-206)Online publication date: 1-Dec-2013
https://dl.acm.org/doi/10.5555/2770961.2771110

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents