Nothing Special   »   [go: up one dir, main page]

skip to main content
article

PU text classification enhanced by term frequency-inverse document frequency-improved weighting

Published: 10 March 2014 Publication History

Abstract

Term frequency-inverse document frequency TF-IDF, one of the most popular feature also called term or word weighting methods used to describe documents in the vector space model and the applications related to text mining and information retrieval, can effectively reflect the importance of the term in the collection of documents, in which all documents play the same roles. But, TF-IDF does not take into account the difference of term IDF weighting if the documents play different roles in the collection of documents, such as positive and negative training set in text classification. In view of the aforementioned text, this paper presents a novel TF-IDF-improved feature weighting approach, which reflects the importance of the term in the positive and the negative training examples, respectively. We also build a weighted voting classifier by iteratively applying the support vector machine algorithm and implement one-class support vector machine and Positive Example Based Learning methods used for comparison. During classifying, an improved 1-DNF algorithm, called 1-DNFC, is also adopted, aiming at identifying more reliable negative documents from the unlabeled examples set. The experimental results show that the performance of term frequency inverse positive-negative document frequency-based classifier outperforms that of TF-IDF-based one, and the performance of weighted voting classifier also exceeds that of one-class support vector machine-based classifier and Positive Example Based Learning-based classifier. Copyright © 2013 John Wiley & Sons, Ltd.

References

[1]
Lewis DD, Gale WA. A sequential algorithm for training text classifiers. In: SIGIR '94: Proceedings of the seventeenth annual international ACM SIGIR conference on research and development in information retrieval, Dublin, Ireland, 1994; 3-12.
[2]
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S. Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the fifteenth national conference on artificial intelligence AAAI-98, Madison, USA, 1998; 509-516.
[3]
Mukherjea S Discovering and analyzing World Wide Web collections. Knowledge and Information Systems 2004; Volume 6 Issue 2: pp.230-241.
[4]
Liu W, Wang T. Online active multi-field learning for efficient email spam filtering. Knowledge and Information Systems 2012; Volume 33 Issue 1: pp.117-136.
[5]
Liu W, Wang T. Utilizing multi-field text features for efficient email spam filtering. International Journal of Computational Intelligence Systems 2012; Volume 5 Issue 3: pp.505-518.
[6]
Chowdhury G. Introduction to Modern Information Retrieval. 3rd Edn., Facet Publishing, London, UK, 2010.
[7]
Joachims T. Text categorization with support vector machines: learning with many relevant features. In Proc. of the European Conference on Machine Learning, Springer, 1998.
[8]
Yang Y, Liu X. A Re-examination of text categorization methods. In SIGIR-99, 1999.
[9]
Brank J, Grobelnik M, Frayling N, Mladenic D. Interaction of feature selection methods and linear classification models. In Proc. of 19th Conf. on Machine Learning ICML-02, Workshop on Text Learning, 2002.
[10]
Dumais S, Platt J, Heckerman D, Sahami M. Inductive learning algorithms and representations for text categorization. In Proc. of the 1998 ACM 7th International Conference on Information and Knowledge Management, 1998; 148-155.
[11]
Soucy P, Mineau GW. Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the 19th International Joint Conference on Artificial Intelligence IJCAI 2005, 2005; 1130-1135.
[12]
Xu G, Gao X, Zhang X, Zhao X. Improved TFIDF weighting for imbalanced biomedical text classification. 2011 International Conference on Energy and Environmental Science-Icees, Energy Procedia, 2011; Volume 11: pp.2360-2367.
[13]
Han EH. Text categorization using weight adjusted k-nearest neighbor classification. PhD thesis, University of Minnesota, Oct.1999.
[14]
Hao HW, Mu CX, Yin XC, Li S, Wang ZB. An improved topic relevance algorithm for focused crawling. 2011 IEEE International Conference on Systems, Man, and Cybernetics SMC, 2011; 850-855.
[15]
Letouzey F, Denis F, Gilleron R. Learning from positive and unlabeled examples. In: Proceedings of the 11th international conference on algorithmic learning theory, Sydney, Australia, 2000; 71-85.
[16]
Liu B, Dai Y, Li X, Lee WS, Yu PS. Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE International Conference on Data Mining, Melbourne, Florida, USA, 2003; 179-188.
[17]
Yu H, Han J, Chang KCC. PEBL: positive example based learning for Web page classification using SVM. In: Proceedings 8th International Conference on Knowledge Discovery and Data Mining KDD'02, Edmonton, Canada, 2002; 239-248.
[18]
Denis F, Gilleron R, Tommasi M. Text classification from positive and unlabeled examples. Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems IPMU, Annecy, France, 2002; 1927-1934.
[19]
Schapire RE, Singer Y. BoosTexter: a boosting-based system for text categorization. Machine Learning 2000; Volume 39 Issue 2-3: pp.135-168.
[20]
Esuli A, Fagni T, Sebastiani F. Boosting multi-label hierarchical text categorization. Information Retrieval 2008; Volume 11: pp.287-313.
[21]
Manevitz LM, Yousef M. One-class SVMs for document classification. Journal of Machine Learning Research 2002; Volume 2: pp.139-154.
[22]
Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing and Management 1988; Volume 24 Issue 5: pp.513-523.
[23]
Peng T, Zuo W, He F. SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowledge and Information Systems 2008; Volume 16 Issue 3: pp.281-301.

Cited By

View all
  • (2023)A personalised operation and maintenance approach for complex products based on equipment portrait of product-service systemRobotics and Computer-Integrated Manufacturing10.1016/j.rcim.2022.10248580:COnline publication date: 20-Jan-2023
  • (2019)An Improved TF-IDF Algorithm Based on Class Discriminative Strength for Text Categorization on Desensitized DataProceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence10.1145/3319921.3319924(39-44)Online publication date: 15-Mar-2019
  • (2018)An effective dimensionality reduction method for text classification based on TFP-treeJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-17123834:3(1893-1905)Online publication date: 1-Jan-2018
  • Show More Cited By
  1. PU text classification enhanced by term frequency-inverse document frequency-improved weighting

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Concurrency and Computation: Practice & Experience
      Concurrency and Computation: Practice & Experience  Volume 26, Issue 3
      March 2014
      236 pages
      ISSN:1532-0626
      EISSN:1532-0634
      Issue’s Table of Contents

      Publisher

      John Wiley and Sons Ltd.

      United Kingdom

      Publication History

      Published: 10 March 2014

      Author Tags

      1. 1-DNFC
      2. Classification
      3. TF-IDF
      4. TFIPNDF
      5. WVC

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)A personalised operation and maintenance approach for complex products based on equipment portrait of product-service systemRobotics and Computer-Integrated Manufacturing10.1016/j.rcim.2022.10248580:COnline publication date: 20-Jan-2023
      • (2019)An Improved TF-IDF Algorithm Based on Class Discriminative Strength for Text Categorization on Desensitized DataProceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence10.1145/3319921.3319924(39-44)Online publication date: 15-Mar-2019
      • (2018)An effective dimensionality reduction method for text classification based on TFP-treeJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-17123834:3(1893-1905)Online publication date: 1-Jan-2018
      • (2018)A reliability and link analysis based method for mining domain experts in dynamic social networksJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-16120534:4(2061-2073)Online publication date: 1-Jan-2018
      • (2018)A Probabilistic Privacy Preserving Strategy for Word-of-Mouth Social NetworksWireless Communications & Mobile Computing10.1155/2018/60317152018Online publication date: 8-Jul-2018
      • (2016)Turning from TF-IDF to TF-IGM for term weighting in text classificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.09.00966:C(245-260)Online publication date: 30-Dec-2016
      • (2016)Building text classifiers using positive, unlabeled and 'outdated' examplesConcurrency and Computation: Practice & Experience10.1002/cpe.387928:13(3691-3706)Online publication date: 10-Sep-2016

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media