Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

Published: 01 May 2013 Publication History

Abstract

Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use data sets without labels. So many different semi-supervised learning methods have been studied recently. Among these semi-supervised methods, self-training is one of the important learning algorithms that classifies unlabeled samples with small amount of labeled ones and adds the most confident samples to the training set. In this paper, dynamic weighting beside majority vote approach is applied to classify the unlabeled data to reliable and unreliable classes. Then, the reliable data are added to the training set and the remaining data including unreliable data are classified in iterative process. We tested this method on the extracted features of ten common Reuter-21578 classes. Experimental result indicates that proposed method improves the classification performance and it's effective.

References

[1]
W. Frawley, G. Piatetsky-Shapiro and C. Matheus, Knowledge discovery in databases: An overview, AI Magazine (1992), 213-228.
[2]
Y. Li, High Performance Text Document Clustering, PhD dissertation, Wright State University, 2007.
[3]
W. McKnight, Building business intelligence: Text data mining in business intelligence, in: Data Mining Review, 2005, 21-22.
[4]
F. Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys 34 (2002), 1-47.
[5]
A. McCallum and K. Nigam, A comparison of event models for naive bayes text classification, in: AAAI/ICML-98 Workshop on Learning for Text Categorization, A. Press ed., 1998, pp. 41-48.
[6]
X. Zhu and A.B. Goldberg, Introduction to Semi-Supervised Learning ed., Morgan & Claypool Publishers, 2009.
[7]
X. Zhu, Semi-Supervised Learning Literature Survey, PhD dissertation, University of Wisconsin, Madison, 2008.
[8]
C. Leistner, Semi-Supervised Ensemble Methods for Computer Vision, PhD dissertation, Graz University of Technology, 2010.
[9]
N.N. Pise and P. Kulkarni, A survey of semi-supervised learning methods, in: International Conference on Computational Intelligence and Security (2008), 30-34.
[10]
C. Silva and B. Ribeiro, Inductive Inference for Large Scale Text Classification, ed., Springer-Verlag Berlin Heidelberg, 2010.
[11]
R.K.A. Carrasco, Unsupervised Classification of Text Documents, University of Puerto Rico, 2007.
[12]
M.F. Porter, An algorithm for suffix stripping, Program 14 (1980), 130-137.
[13]
M.E. Basiri and S. Nemati, A novel hybrid ACO-GA algorithm for text feature selection, in: Evolutionary Computation (2009), 2561-2568.
[14]
G. Salton and C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management 24 (1988), 513-523.
[15]
L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, ed., John Wiley & Sons, 2004.
[16]
T. Joachims, Text categorization with support vector machines: Learning with many relevant features, Machine Learning, Springer (1998), 137-142.
[17]
Y. Yang and G.I. Webb, Discretization for Naive-Bayes Learning: Managing Discretization Bias and Variance, Technical Report, School of Computer Science and Software Engineering, Monash University, 2003.
[18]
I. Rish, An empirical study of the naive Bayes classifier, IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence 3 (2001), 41-46.
[19]
S. Tan, An effective refinement strategy for KNN text classifier, Expert Systems with Applications 30 (2006), 290-298.
[20]
D.D. Lewis, Data extraction as text categorization: An experiment with the MUC-3 corpus, in: Proceedings of the Third Message Understanding Evaluation and Conference, Morgan Kaufmann, Los Altos, CA, 1991.
[21]
D. Susan et al., Inductive learning algorithms and representations for text categorization, in: Proceeding CIKM, ACM Press, 1998, pp. 148-155.
[22]
R. Feldman and J. Sanger, The Text Mining Handbook Advanced Approaches in Analyzing Unstructured Data, ed. Cambridge University Press, 2007.
[23]
K. Nigam et al., Text classification from labeled and unlabeled documents using EM, Machine Learning 39 (2000), 103-134.
[24]
P.K. Mallapragada, Some Contributions to Semi-Supervised Learning, PhD dissertation, Michigan State University, 2010.
[25]
K. Nigam, A. McCallum and T. Mitchell, Semi-supervised text classification using EM, in: Semi-Supervised Learning, MIT Press, 2006, pp. 33-55.
[26]
L. Shi et al., Rough set and ensemble learning based semi-supervised algorithm for text classification, Expert Systems with Applications 38 (2010), 6300-6306.
[27]
P. Gu, Q. Zhu and C. Zhang, A multi-view approach to semi-supervised document classification with incremental Naïe Bayes, Computers & Mathematics with Applications 57 (2009).

Cited By

View all
  1. Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Intelligent Data Analysis
      Intelligent Data Analysis  Volume 17, Issue 3
      May 2013
      189 pages

      Publisher

      IOS Press

      Netherlands

      Publication History

      Published: 01 May 2013

      Author Tags

      1. Dynamic Weighting
      2. Ensemble Learning
      3. Self Training
      4. Semi-Supervised Learning
      5. Text Categorization

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 22 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Android malware detection applying feature selection techniques and machine learningMultimedia Tools and Applications10.1007/s11042-022-13767-282:6(9517-9531)Online publication date: 14-Sep-2022
      • (2021)A New Method of Automatic Text Document ClassificationAutomatic Documentation and Mathematical Linguistics10.3103/S000510552103008055:3(122-133)Online publication date: 1-May-2021
      • (2021)A Safe Semi-supervised Classification Algorithm Using Multiple Classifiers EnsembleNeural Processing Letters10.1007/s11063-020-10191-153:4(2603-2616)Online publication date: 1-Aug-2021
      • (2020)Multi-label Topic Classification of Patient Generated Content in a Breast-cancer Community ForumProceedings of the 4th International Conference on Medical and Health Informatics10.1145/3418094.3418132(266-274)Online publication date: 14-Aug-2020
      • (2016)An iterative self-training support vector machine algorithm in brain-computer interfacesIntelligent Data Analysis10.3233/IDA-15079420:1(67-82)Online publication date: 18-Jan-2016
      • (2015)A co-training algorithm based on modified Fisher's linear discriminant analysisIntelligent Data Analysis10.5555/2768391.276839619:2(279-292)Online publication date: 1-Mar-2015
      • (2013)Discriminative sparse coding on multi-manifoldsKnowledge-Based Systems10.5555/2770961.277111054:C(199-206)Online publication date: 1-Dec-2013

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media