Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Class-based n-gram models of natural language

Published: 01 December 1992 Publication History

Abstract

We address the problem of predicting a word from previous words in a sample of text. In particular, we discuss n-gram models based on classes of words. We also discuss several statistical algorithms for assigning words to classes based on the frequency of their co-occurrence with other words. We find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics.

References

[1]
Averbuch, A.; Bahl, L.; Bakis, R.; Brown, P.; Cole, A.; Daggett, G.; Das, S.; Davies, K.; Gennaro, S. De.; de Souza, P.; Epstein, E.; Fraleigh, D.; Jelinek, F.; Moorhead, J.; Lewis, B.; Mercer, R.; Nadas, A.; Nahamoo, D.; Picheny, M.; Shichman, G.; Spinelli, P.; Van Compernolle, D.; and Wilkens, H. (1987). "Experiments with the Tangora 20,000 word speech recognizer." In Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas, Texas, 701--704.
[2]
Bahl, L. R.; Jelinek, F.; and Mercer, R. L. (1983). "A maximum likelihood approach to continuous speech recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2), 179--190.
[3]
Baum, L. (1972). "An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process." Inequalities, 3, 1--8.
[4]
Brown, P. F.; Cocke, J.; DellaPietra, S. A.; DellaPietra, V. J.; Jelinek, F.; Lafferty, J. D.; Mercer, R. L.; and Roossin, P. S. (1990). "A statistical approach to machine translation." Computational Linguistics, 16(2), 79--85.
[5]
Dempster, A.; Laird, N.; and Rubin, D. (1977). "Maximum likelihood from incomplete data via the EM algorithm." Journal of the Royal Statistical Society, 39(B), 1--38.
[6]
Feller, W. (1950). An Introduction to Probability Theory and its Applications, Volume I. John Wiley & Sons, Inc.
[7]
Gallagher, R. G. (1968). Information Theory and Reliable Communication. John Wiley & Sons, Inc.
[8]
Good, I. (1953). "The population frequencies of species and the estimation of population parameters." Biometrika, 40(3--4), 237--264.
[9]
Jelinek, F., and Mercer, R. L. (1980). "Interpolated estimation of Markov source parameters from sparse data." In Proceedings, Workshop on Pattern Recognition in Practice, Amsterdam, The Netherlands, 381--397.
[10]
Kuçera, H., and Francis, W. (1967). Computational Analysis of Present Day American English. Brown University Press.
[11]
Mays, E.; Damerau, F. J.; and Mercer, R. L. (1990). "Context-based spelling correction." In Proceedings, IBM Natural Language ITL. Paris, France, 517--522.

Cited By

View all
  • (2025)Segmented Sequence Prediction Using Variable-Order Markov Model EnsembleIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.352297537:3(1425-1438)Online publication date: 1-Mar-2025
  • (2025)Review of malicious code detection in data mining applications: challenges, algorithms, and future directionCluster Computing10.1007/s10586-024-05017-x28:3Online publication date: 1-Jun-2025
  • (2024)Large language models for code analysisProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3698947(829-846)Online publication date: 14-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computational Linguistics
Computational Linguistics  Volume 18, Issue 4
December 1992
175 pages
ISSN:0891-2017
EISSN:1530-9312
Issue’s Table of Contents

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 01 December 1992
Published in COLI Volume 18, Issue 4

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)413
  • Downloads (Last 6 weeks)118
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Segmented Sequence Prediction Using Variable-Order Markov Model EnsembleIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.352297537:3(1425-1438)Online publication date: 1-Mar-2025
  • (2025)Review of malicious code detection in data mining applications: challenges, algorithms, and future directionCluster Computing10.1007/s10586-024-05017-x28:3Online publication date: 1-Jun-2025
  • (2024)Large language models for code analysisProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3698947(829-846)Online publication date: 14-Aug-2024
  • (2024)Election Interference and Online Propaganda Campaigns: Dynamic Interdependencies on Facebook, Google Trends, and the New York TimesACM Transactions on Management Information Systems10.1145/369082815:4(1-23)Online publication date: 9-Sep-2024
  • (2024)NGram-Bayes, A Joint Model for Long Distance Context Dependency in Speech RecognitionProceedings of the 2024 7th International Conference on Signal Processing and Machine Learning10.1145/3686490.3686528(258-262)Online publication date: 12-Jul-2024
  • (2024)A Tale of Two Comprehensions? Analyzing Student Programmer Attention during Code SummarizationACM Transactions on Software Engineering and Methodology10.1145/366480833:7(1-37)Online publication date: 26-Aug-2024
  • (2024)A Survey on Evaluation of Large Language ModelsACM Transactions on Intelligent Systems and Technology10.1145/364128915:3(1-45)Online publication date: 29-Mar-2024
  • (2024)Automatic real-word error correction in persian textNeural Computing and Applications10.1007/s00521-024-10045-036:29(18125-18149)Online publication date: 1-Oct-2024
  • (2023)Automated Filipino Language Treebank GeneratorProceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval10.1145/3639233.3639238(119-123)Online publication date: 15-Dec-2023
  • (2023)Semantic Template-based Convolutional Neural Network for Text ClassificationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362782022:11(1-21)Online publication date: 16-Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media