Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/383952.383974acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

A statistical learning learning model of text classification for support vector machines

Published: 01 September 2001 Publication History

Abstract

This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of text-classification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifiers, which rely primarily on empirical evidence, this model explains why and when SVMs perform well for text classification. In particular, it addresses the following questions: Why can support vector machines handle the large feature spaces in text classification effectively? How is this related to the statistical properties of text? What are sufficient conditions for applying SVMs to text-classification problems successfully?

References

[1]
A. Bookstein and D. R. Swanson. Probabilistic models for automated indexing. Journal of the American Society for Information Science, 25(5):312-318, 1974.
[2]
C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998.
[3]
W. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. In International ACM SIGIR Conference onResearch and Development in Information Retrieval, pages 57-61, 1991.
[4]
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of ACM-CIKM98, November 1998.
[5]
N. Fuhr, S. Hartmann, G. Lustig, M. Schwantner, K. Tzeras, and G. Knorz. Air/x - a rule-based multistage indexing system for large subject fields. In RIAO, pages 606-623, 1991.
[6]
N. Fuhr and G. Knorz. Retrieval test evaluation of a rule based automatic indexing (air/phys). In C. van Rijsbergen, editor, Research and Development in Information Retrieval: Proceedings of the Third Joint BCS and ACM Symposium, pages 391-408. Cambridge University Press, July 1984.
[7]
N. G.overt, M. Lalmas, and N. Fuhr. A probabilistic description-oriented approach for categorising Web documents. In Proceedings of CIKM-99, 8th ACM International Conference on Information and Knowledge Management, pages 475-482, Kansas City, US, 1999. ACM Press, New York, US.
[8]
S. P. Harter. A probabilistic approach to automated keyword indexing. Part I: on the distribution of specialty words in a technical literature. Journal of the American Society for Information Science, 26(4):197-206, 1975.
[9]
S. P. Harter. A probabilistic approach to automated keyword indexing. Part II: An algorithm for probabilistic indexing. Journal of the American Society for Information Science, 26(5):280-289, 1975.
[10]
T. Jaakkola and D. Haussler. Probabilistic kernel regression models. In Conference on AI and Statistics, 1999.
[11]
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, pages 137 - 142, Berlin, 1998. Springer.
[12]
T. Joachims. Making large-scale SVM learning practical. In B. Sch.olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11. MIT Press, Cambridge, MA, 1999.
[13]
T. Joachims. Estimating the generalization performance of a SVM efficiently. InProceedings of the International Conference on Machine Learning, San Francisco, 2000. Morgan Kaufman.
[14]
T. Joachims. The Maximum-Margin Approach to Learning Text Classifiers: Methods, Theory, and Algorithms. PhD thesis, Universit. at Dortmund, 2001. Kluwer, to appear.
[15]
B. Mandelbrot. A note on a class of skew distribution functions: Analysis and critique of a paper by H. A. Simon. Information and Control, 2(1):90-99, Apr. 1959.
[16]
C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis. In ACM, editor, PODS '98. Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June, 1998, Seattle, Washington, pages 159-168, New York, NY 10036, USA, 1998. ACM Press.
[17]
M. Sahami. Using Machine Learning to Improve Information Access. PhD thesis, Stanford University, 1998.
[18]
V. Vapnik. Statistical Learning Theory. Wiley, Chichester, GB, 1998.
[19]
G. K. Zipf. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Cambridge, MA, USA, 1949.

Cited By

View all

Index Terms

  1. A statistical learning learning model of text classification for support vector machines

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
      September 2001
      454 pages
      ISBN:1581133316
      DOI:10.1145/383952
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 September 2001

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Article

      Conference

      SIGIR01
      Sponsor:

      Acceptance Rates

      SIGIR '01 Paper Acceptance Rate 47 of 201 submissions, 23%;
      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)63
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 25 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Automatic Noise Generation and Reduction for Text ClassificationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.332513532(139-150)Online publication date: 1-Jan-2024
      • (2024)Applied advanced analytics in marketing of mechanical productsMachine Intelligence in Mechanical Engineering10.1016/B978-0-443-18644-8.00002-2(249-285)Online publication date: 2024
      • (2024)Range constrained group query on attribute social graphDistributed and Parallel Databases10.1007/s10619-024-07439-342:3(337-375)Online publication date: 30-Mar-2024
      • (2024)Radical-attended and Pinyin-attended malicious long-tail keywords detectionNeural Computing and Applications10.1007/s00521-024-09871-z36:24(14757-14773)Online publication date: 10-May-2024
      • (2023)An Improved SVM with Earth Mover’s Distance Regularization and Its Application in Pattern RecognitionElectronics10.3390/electronics1203064512:3(645)Online publication date: 28-Jan-2023
      • (2023)DTFL: A Digital Twin-Assisted Graph Neural Network Approach for Service Function Chains Failure LocalizationIEEE Transactions on Cloud Computing10.1109/TCC.2023.329450611:4(3573-3590)Online publication date: Oct-2023
      • (2023)Contrastive knowledge integrated graph neural networks for Chinese medical text classificationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106057122:COnline publication date: 1-Jun-2023
      • (2022)MACHINE LEARNING FOR TEXT CLASSIFICATION IN BUILDING MANAGEMENT SYSTEMSJOURNAL OF CIVIL ENGINEERING AND MANAGEMENT10.3846/jcem.2022.1601228:5(408-421)Online publication date: 12-May-2022
      • (2022)The semi-automatic classification of an open-ended question on panel survey motivation and its application in attrition analysisFrontiers in Big Data10.3389/fdata.2022.8805545Online publication date: 11-Aug-2022
      • (2022)A study on the phenomenon of anaphoric correction in college students’ English conversationApplied Mathematics and Nonlinear Sciences10.2478/amns.2021.2.002898:2(383-394)Online publication date: 23-Dec-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media