Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/974305.974325dlproceedingsArticle/Chapter ViewAbstractPublication PagesnaaclConference Proceedingsconference-collections
Article
Free access

Detecting errors within a corpus using anomaly detection

Published: 29 April 2000 Publication History

Abstract

We present a method for automatically detecting errors in a manually marked corpus using anomaly detection. Anomaly detection is a method for determining which elements of a large data set do not conform to the whole. This method fits a probability distribution over the data and applies a statistical test to detect anomalous elements. In the corpus error detection problem, anomalous elements are typically marking errors. We present the results of applying this method to the tagged portion of the Penn Treebank corpus.

References

[1]
Steve Abney, Robert E. Schapire, and Yoram Singer. 1999. Boosting applied to tagging and PP attachment. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing Conference and Very Large Corpora.
[2]
V. Barnett and T. Lewis. 1994. Outliers in Statistical Data. John Wiley and Sons.
[3]
V. Barnett. 1979. Some outlier tests for multivariate samples. South African Statist, 13:29--52.
[4]
Eric Brill and Jun Wu. 1998. Classifier combination for improved lexical disambiguation. In Proceedings of COLING-ACL.
[5]
Eric Brill. 1994. Some advances in transformation-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 722--727.
[6]
D.E. Denning. 1987. An intrusion detection model. IEEE Transactions on Software Engineering, SE-13:222--232.
[7]
Eleazar Eskin. 2000. Anomaly detection over noisy data using learned probability distributions. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000) (to appear).
[8]
T. Lane and C. E. Brodley. 1997. Sequence matching and learning in anomaly detection for computer security. In AAAI Workshop: AI Approaches to Fraud Detection and Risk Management, pages 43--49. AAAI Press.
[9]
Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The Penn Treebank. Computational Linguistics, 19(2):313--330.
[10]
Tom Mitchell. 1997. Machine Learning. McGraw Hill.
[11]
Fernando Pereira and Yoram Singer. 1999. An efficient extension to mixture techniques for prediction and decision trees. Machine Learning, 36(3):183--199.
[12]
Adwait Ratnaparkhi. 1996. A maximum entropy model part-of-speech tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference.
[13]
Yoram Singer. 1997. Adaptive mixtures of probalistic transducers. Neural Computation, 9(8):1711--1733.
[14]
Christina Warrender, Stephanie Forrest, and Barak Pearlmutter. 1999. Detecting intrusions using system calls: alternative data models. In 1999 IEEE Symposium on Security and Privacy, pages 133--145. IEEE Computer Society.
[15]
Ralph Weischedel, Marie Meteer, Richard Schwaitz, Lance Ramshaw, and Jeff Palmucci. 1993. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 19(2):359--382.

Cited By

View all
  • (2024)Active Learning for Data Quality Control: A SurveyJournal of Data and Information Quality10.1145/366336916:2(1-45)Online publication date: 11-May-2024
  • (2013)Improving Text Classification Accuracy by Training Label CleaningACM Transactions on Information Systems10.1145/251688931:4(1-28)Online publication date: 1-Nov-2013
  • (2011)Reducing the need for double annotationProceedings of the 5th Linguistic Annotation Workshop10.5555/2018966.2018974(65-73)Online publication date: 23-Jun-2011
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
NAACL 2000: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
April 2000
344 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 29 April 2000

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 21 of 29 submissions, 72%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)7
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Active Learning for Data Quality Control: A SurveyJournal of Data and Information Quality10.1145/366336916:2(1-45)Online publication date: 11-May-2024
  • (2013)Improving Text Classification Accuracy by Training Label CleaningACM Transactions on Information Systems10.1145/251688931:4(1-28)Online publication date: 1-Nov-2013
  • (2011)Reducing the need for double annotationProceedings of the 5th Linguistic Annotation Workshop10.5555/2018966.2018974(65-73)Online publication date: 23-Jun-2011
  • (2011)Collaborative data cleaning for sentiment classification with noisy training corpusProceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I10.5555/2017863.2017894(326-337)Online publication date: 24-May-2011
  • (2010)Correcting errors in a treebank based on synchronous tree substitution grammarProceedings of the ACL 2010 Conference Short Papers10.5555/1858842.1858856(74-79)Online publication date: 11-Jul-2010
  • (2007)Morphological annotation of a large spontaneous speech corpus in JapaneseProceedings of the 20th international joint conference on Artifical intelligence10.5555/1625275.1625556(1731-1737)Online publication date: 6-Jan-2007
  • (2004)Correcting category errors in text classificationProceedings of the 20th international conference on Computational Linguistics10.3115/1220355.1220480(868-es)Online publication date: 23-Aug-2004
  • (2003)Evaluating classifiers by means of test data with noisy labelsProceedings of the 18th international joint conference on Artificial intelligence10.5555/1630659.1630735(513-518)Online publication date: 9-Aug-2003
  • (2002)Detecting errors in corpora using support vector machinesProceedings of the 19th international conference on Computational linguistics - Volume 110.3115/1072228.1072329(1-7)Online publication date: 24-Aug-2002

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media