Nothing Special   »   [go: up one dir, main page]

skip to main content
10.3115/1072133.1072204dlproceedingsArticle/Chapter ViewAbstractPublication PageshltConference Proceedingsconference-collections
Article
Free access

Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

Published: 18 March 2001 Publication History

Abstract

In this paper, we discuss experiments applying machine learning techniques to the task of confusion set disambiguation, using three orders of magnitude more training data than has previously been used for any disambiguation-in-string-context problem. In an attempt to determine when current learning methods will cease to benefit from additional training data, we analyze residual errors made by learners when issues of sparse data have been significantly mitigated. Finally, in the context of our results, we discuss possible directions for the empirical natural language research community.

References

[1]
Brill, E. Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging. In Natural Language Processing Using Very Large Corpora, 1999.
[2]
Gale, W. A., Church, K. W., and Yarowsky, D. (1993). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415--439.
[3]
Golding, A. R. (1995). A Bayesian hybrid method for context-sensitive spelling correction. In Proc. 3rd Workshop on Very Large Corpora, Boston, MA.
[4]
Golding, A. R. and Roth, D. (1999), A Winnow-Based Approach to Context-Sensitive Spelling Correction. Machine Learning, 34:107--130.
[5]
Golding, A. R. and Schabes, Y. (1996). Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proc. 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA.
[6]
Henderson, J. Exploiting Diversity for Natural Language Parsing. PhD thesis, Johns Hopkins University, August 1999.
[7]
Jones, M. P. and Martin, J. H. (1997). Contextual spelling correction using latent semantic analysis. In Proc. 5th Conference on Applied Natural Language Processing, Washington, DC.
[8]
Mangu, L. and Brill, E. (1997). Automatic rule acquisition for spelling correction. In Proc. 14th International Conference on Machine Learning. Morgan Kaufmann.
[9]
Nigam, K., McCallum, A, Thrun, S and Mitchell, T. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning. 39(2/3). pp. 103--134. 2000.
[10]
Powers, D. (1997). Learning and application of differential grammars. In Proc. Meeting of the ACL Special Interest Group in Natural Language Learning, Madrid.
[11]
Ratnaparkhi, Adwait. (1999) Learning to Parse Natural Language with Maximum Entropy Models. Machine Learning, 34, 151--175.
[12]
Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proc. 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM.

Cited By

View all
  • (2020)Human-in-the-loop AI in governmentProceedings of the 25th International Conference on Intelligent User Interfaces10.1145/3377325.3377489(488-497)Online publication date: 17-Mar-2020
  • (2017)MiCOMPACM Transactions on Architecture and Code Optimization10.1145/312445214:3(1-28)Online publication date: 6-Sep-2017
  • (2017)Data-Driven Shape Analysis and ProcessingComputer Graphics Forum10.1111/cgf.1279036:1(101-132)Online publication date: 1-Jan-2017
  • Show More Cited By
  1. Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image DL Hosted proceedings
      HLT '01: Proceedings of the first international conference on Human language technology research
      March 2001
      375 pages

      Publisher

      Association for Computational Linguistics

      United States

      Publication History

      Published: 18 March 2001

      Author Tags

      1. data scaling
      2. learning curves
      3. natural language disambiguation
      4. very large corpora

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate 240 of 768 submissions, 31%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)76
      • Downloads (Last 6 weeks)12
      Reflects downloads up to 20 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Human-in-the-loop AI in governmentProceedings of the 25th International Conference on Intelligent User Interfaces10.1145/3377325.3377489(488-497)Online publication date: 17-Mar-2020
      • (2017)MiCOMPACM Transactions on Architecture and Code Optimization10.1145/312445214:3(1-28)Online publication date: 6-Sep-2017
      • (2017)Data-Driven Shape Analysis and ProcessingComputer Graphics Forum10.1111/cgf.1279036:1(101-132)Online publication date: 1-Jan-2017
      • (2016)Data-driven shape analysis and processingSIGGRAPH ASIA 2016 Courses10.1145/2988458.2988473(1-38)Online publication date: 28-Nov-2016
      • (2012)Predicting learner levels for online exercises of HebrewProceedings of the Seventh Workshop on Building Educational Applications Using NLP10.5555/2390384.2390396(95-104)Online publication date: 7-Jun-2012
      • (2011)The impact of language models and loss functions on repair disfluency detectionProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 110.5555/2002472.2002562(703-711)Online publication date: 19-Jun-2011
      • (2011)Web scale NLPProceedings of the 20th international conference on World wide web10.1145/1963405.1963457(357-366)Online publication date: 28-Mar-2011
      • (2011)Technical SectionComputers and Graphics10.1016/j.cag.2011.07.00135:5(955-966)Online publication date: 1-Oct-2011
      • (2010)Search right and thou shalt find...Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications10.5555/1866795.1866800(37-44)Online publication date: 5-Jun-2010
      • (2010)An overview of Microsoft web N-gram corpus and applicationsProceedings of the NAACL HLT 2010 Demonstration Session10.5555/1855450.1855462(45-48)Online publication date: 2-Jun-2010
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media