Article

Free access

Detecting errors within a corpus using anomaly detection

Author:

Eleazar EskinAuthors Info & Claims

NAACL 2000: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference

Pages 148 - 153

Published: 29 April 2000 Publication History

Abstract

We present a method for automatically detecting errors in a manually marked corpus using anomaly detection. Anomaly detection is a method for determining which elements of a large data set do not conform to the whole. This method fits a probability distribution over the data and applies a statistical test to detect anomalous elements. In the corpus error detection problem, anomalous elements are typically marking errors. We present the results of applying this method to the tagged portion of the Penn Treebank corpus.

References

[1]

Steve Abney, Robert E. Schapire, and Yoram Singer. 1999. Boosting applied to tagging and PP attachment. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing Conference and Very Large Corpora.

[2]

V. Barnett and T. Lewis. 1994. Outliers in Statistical Data. John Wiley and Sons.

[3]

V. Barnett. 1979. Some outlier tests for multivariate samples. South African Statist, 13:29--52.

[4]

Eric Brill and Jun Wu. 1998. Classifier combination for improved lexical disambiguation. In Proceedings of COLING-ACL.

Digital Library

[5]

Eric Brill. 1994. Some advances in transformation-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 722--727.

Digital Library

[6]

D.E. Denning. 1987. An intrusion detection model. IEEE Transactions on Software Engineering, SE-13:222--232.

Digital Library

[7]

Eleazar Eskin. 2000. Anomaly detection over noisy data using learned probability distributions. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000) (to appear).

Digital Library

[8]

T. Lane and C. E. Brodley. 1997. Sequence matching and learning in anomaly detection for computer security. In AAAI Workshop: AI Approaches to Fraud Detection and Risk Management, pages 43--49. AAAI Press.

[9]

Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The Penn Treebank. Computational Linguistics, 19(2):313--330.

Digital Library

[10]

Tom Mitchell. 1997. Machine Learning. McGraw Hill.

Digital Library

[11]

Fernando Pereira and Yoram Singer. 1999. An efficient extension to mixture techniques for prediction and decision trees. Machine Learning, 36(3):183--199.

Digital Library

[12]

Adwait Ratnaparkhi. 1996. A maximum entropy model part-of-speech tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference.

[13]

Yoram Singer. 1997. Adaptive mixtures of probalistic transducers. Neural Computation, 9(8):1711--1733.

Digital Library

[14]

Christina Warrender, Stephanie Forrest, and Barak Pearlmutter. 1999. Detecting intrusions using system calls: alternative data models. In 1999 IEEE Symposium on Security and Privacy, pages 133--145. IEEE Computer Society.

[15]

Ralph Weischedel, Marie Meteer, Richard Schwaitz, Lance Ramshaw, and Jeff Palmucci. 1993. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 19(2):359--382.

Digital Library

Cited By

Li NQi YLi CZhao Z(2024)Active Learning for Data Quality Control: A SurveyJournal of Data and Information Quality10.1145/366336916:2(1-45)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3663369
Esuli ASebastiani F(2013)Improving Text Classification Accuracy by Training Label CleaningACM Transactions on Information Systems10.1145/251688931:4(1-28)Online publication date: 1-Nov-2013
https://dl.acm.org/doi/10.1145/2516889
Dligach DPalmer MPradhan STomanek K(2011)Reducing the need for double annotationProceedings of the 5th Linguistic Annotation Workshop10.5555/2018966.2018974(65-73)Online publication date: 23-Jun-2011
https://dl.acm.org/doi/10.5555/2018966.2018974
Show More Cited By

Detecting errors within a corpus using anomaly detection
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Detecting errors in corpora using support vector machines
COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1

While the corpus-based research relies on human annotated corpora, it is often said that a non-negligible amount of errors remain even in frequently used corpora such as Penn Treebank. Detection of errors in annotated corpora is important for corpus-...
Detecting Anomalies in Alert Firing within Clinical Decision Support Systems using Anomaly/Outlier Detection Techniques
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Clinical Decision Support (CDS) systems play an integral role in the improvement of health care quality and safety. Alert malfunctions within CDS are a common problem and these greatly limit its usability. Anomaly detection is a novel approach to ...
Explainable contextual anomaly detection using quantile regression forests
Abstract
Traditional anomaly detection methods aim to identify objects that deviate from most other objects by treating all features equally. In contrast, contextual anomaly detection methods aim to detect objects that deviate from other objects within a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

NAACL 2000: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference

April 2000

344 pages

Program Chair:
Janyce Wiebe
New Mexico State University

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 29 April 2000

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 21 of 29 submissions, 72%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
572
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)7

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li NQi YLi CZhao Z(2024)Active Learning for Data Quality Control: A SurveyJournal of Data and Information Quality10.1145/366336916:2(1-45)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3663369
Esuli ASebastiani F(2013)Improving Text Classification Accuracy by Training Label CleaningACM Transactions on Information Systems10.1145/251688931:4(1-28)Online publication date: 1-Nov-2013
https://dl.acm.org/doi/10.1145/2516889
Dligach DPalmer MPradhan STomanek K(2011)Reducing the need for double annotationProceedings of the 5th Linguistic Annotation Workshop10.5555/2018966.2018974(65-73)Online publication date: 23-Jun-2011
https://dl.acm.org/doi/10.5555/2018966.2018974
Wan X(2011)Collaborative data cleaning for sentiment classification with noisy training corpusProceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I10.5555/2017863.2017894(326-337)Online publication date: 24-May-2011
https://dl.acm.org/doi/10.5555/2017863.2017894
Kato YMatsubara S(2010)Correcting errors in a treebank based on synchronous tree substitution grammarProceedings of the ACL 2010 Conference Short Papers10.5555/1858842.1858856(74-79)Online publication date: 11-Jul-2010
https://dl.acm.org/doi/10.5555/1858842.1858856
Uchimoto KIsahara H(2007)Morphological annotation of a large spontaneous speech corpus in JapaneseProceedings of the 20th international joint conference on Artifical intelligence10.5555/1625275.1625556(1731-1737)Online publication date: 6-Jan-2007
https://dl.acm.org/doi/10.5555/1625275.1625556
Fukumoto FSuzuki Y(2004)Correcting category errors in text classificationProceedings of the 20th international conference on Computational Linguistics10.3115/1220355.1220480(868-es)Online publication date: 23-Aug-2004
https://dl.acm.org/doi/10.3115/1220355.1220480
Lam CStork D(2003)Evaluating classifiers by means of test data with noisy labelsProceedings of the 18th international joint conference on Artificial intelligence10.5555/1630659.1630735(513-518)Online publication date: 9-Aug-2003
https://dl.acm.org/doi/10.5555/1630659.1630735
Nakagawa TMatsumoto Y(2002)Detecting errors in corpora using support vector machinesProceedings of the 19th international conference on Computational linguistics - Volume 110.3115/1072228.1072329(1-7)Online publication date: 24-Aug-2002
https://dl.acm.org/doi/10.3115/1072228.1072329

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten