research-article

Distance-Based Detection and Prediction of Outliers

Authors:

Fabrizio Angiulli,

Clara PizzutiAuthors Info & Claims

IEEE Transactions on Knowledge and Data Engineering, Volume 18, Issue 2

Pages 145 - 160

https://doi.org/10.1109/TKDE.2006.29

Published: 01 February 2006 Publication History

Abstract

A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it, called outlier detection solving set, that can be used to predict the outlierness of new unseen objects, is proposed. The solving set includes a sufficient number of points that permits the detection of the top outliers by considering only a subset of all the pairwise distances from the data set. The properties of the solving set are investigated, and algorithms for computing it, with subquadratic time requirements, are proposed. Experiments on synthetic and real data sets to evaluate the effectiveness of the approach are presented. A scaling analysis of the solving set size is performed, and the false positive rate, that is, the fraction of new objects misclassified as outliers using the solving set instead of the overall data set, is shown to be negligible. Finally, to investigate the accuracy in separating outliers from inliers, ROC analysis of the method is accomplished. Results obtained show that using the solving set instead of the data set guarantees a comparable quality of the prediction, but at a lower computational cost.

References

[1]

F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,” IEEE Trans. Knowledge and Data Eng., vol. 2, no. 17, pp. 203-215, Feb. 2005.

Digital Library

[2]

A. Arning, C. Aggarwal, and P. Raghavan, “A Linear Method for Deviation Detection in Large Databases,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 164-169, 1996.

[3]

V. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley&Sons, 1994.

[4]

S.D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '03), 2003.

Digital Library

[5]

M.M. Breunig, H. Kriegel, R.T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” Proc. Int'l Conf. Managment of Data (SIGMOD '00), 2000.

Digital Library

[6]

Defense Advanced Research Projects Agency DARPA, “Intrusion Detection Evaluation,”

[7]

E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, “A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data,” Applications of Data Mining in Computer Security, Kluwer, 2002.

[8]

T. Fawcett and F. Provost, “Adaptive Fraud Detection,” Data Mining and Knowledge Discovery, vol. 1, pp. 291-316, 1997.

Digital Library

[9]

C. Feng, A. Sutherland, S. King, S. Muggleton, and R. Henery, “Comparison of Machine Learning Classifiers to Statistics and Neural Networks,” Proc. AI&Stats Conf. 93, 1993.

[10]

S. Floyd and M. Warmuth, “Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension,” Machine Learning, vol. 211, no. 3, pp. 269-304, 1995.

Digital Library

[11]

V. Gaede and O. Günther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 170-231, 1998.

Digital Library

[12]

M.R. Garey and D.S. Johnson, Computer and Intractability. New York: W.H. Freeman and Company, 1979.

Digital Library

[13]

J. Han and M. Kamber, Data Mining, Concepts and Technique. San Francisco: Morgan Kaufmann, 2001.

Digital Library

[14]

W. Jin, A.K.H. Tung, and J. Han, “Mining Top-n Local Outliers in Large Databases,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), 2001.

Digital Library

[15]

E. Knorr and R. Ng, “Algorithms for Mining Distance-Based Outliers in Large Datasets,” Proc. Int'l Conf. Very Large Databases (VLDB '98), pp. 392-403, 1998.

Digital Library

[16]

E. Knorr and R. Ng, “Finding Intensional Knowledge of Distance-Based Outliers,” Proc. Int'l Conf. Very Large Databases (VLDB '99), pp. 211-222, 1999.

Digital Library

[17]

E. Knorr, R. Ng, and V. Tucakov, “Distance-Based Outlier: Algorithms and Applications,” VLDB J., vol. 8, nos. 3-4, pp. 237-253, 2000.

Digital Library

[18]

A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection,” Proc. SIAM Int'l Conf. Data Mining (SIAM-03), 2003.

[19]

W. Lee, S.J. Stolfo, and K.W. Mok, “Mining Audit Data to Build Intrusion Detection Models,” Proc. Int'l Conf Knowledge Discovery and Data Mining (KDD-98), pp. 66-72, 1998.

[20]

L. Mangasarian and W.H. Wolberg, “Cancer Diagnosis via Linear Programming,” SIAM News, vol. 25, no. 5, pp. 1-18, 1990.

[21]

D. Peleg, G. Schechtman, and A. Wool, “Randomized Approximation of Bounded Multicovering Problems,” Algorithmica, vol. 18, no. 1, pp. 44-66, 1997.

Digital Library

[22]

F. Provost, T. Fawcett, and R. Kohavi, “The Case Against Accuracy Estimation for Comparing Induction Algorithms,” Proc. Int'l Conf. Machine Learning (ICML '98), 1998.

Digital Library

[23]

S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. Int'l Conf. Managment of Data (SIGMOD '00), pp. 427-438, 2000.

Digital Library

[24]

B. Schölkopf, C. Burges, and V. Vapnik, “Extracting Support Data for a Given Task,” Proc First Int'l Conf. Knowledge Discovery and Data Mining, pp. 252-257, 1995.

[25]

L. Torgo and R. Ribeiro, “Predicting Outliers,” Proc. Int'l Conf. Principles of Data Mining and Knowledge Discovery (PKDD '03), pp. 447-458, 2003.

[26]

K. Yamanishi and J. Takeuchi, “Discovering Outlier Filtering Rules from Unlabeled Data,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), pp. 389-394, 2001.

Digital Library

Cited By

Sánchez Vinces BSchubert EZimek ACordeiro R(2025)A comparative evaluation of clustering-based outlier detectionData Mining and Knowledge Discovery10.1007/s10618-024-01086-z39:2Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1007/s10618-024-01086-z
Li XGuan YFu BLuo Z(2025)Anomaly-aware symmetric non-negative matrix factorization for short text clusteringKnowledge and Information Systems10.1007/s10115-024-02226-z67:2(1481-1506)Online publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1007/s10115-024-02226-z
Adhikari DJiang WZhan JRawat DBhattarai A(2024)Recent advances in anomaly detection in Internet of ThingsComputer Science Review10.1016/j.cosrev.2024.10066554:COnline publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1016/j.cosrev.2024.100665
Show More Cited By

Index Terms

Distance-Based Detection and Prediction of Outliers
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Detection and prediction of distance-based outliers
SAC '05: Proceedings of the 2005 ACM symposium on Applied computing

In this paper we present an unsupervised distance-based outlier detection method designed to learn a model over the objects contained in a data set. The learned model, called solving set, is a small subset of the data set that is used to classify new ...
Outlier detection based on rough sets theory

An outlier in a dataset is a point or a class of points that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of outliers is important for many applications and has always attracted attention among data mining ...
Correlation-based detection of attribute outliers
DASFAA'07: Proceedings of the 12th international conference on Database systems for advanced applications

An outlier is an object that does not conform to the normal behavior of the data set. In data cleaning, outliers are identified for data noise reduction. In applications such as fraud detection, and stock market analysis, outliers suggest abnormal ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering Volume 18, Issue 2

February 2006

140 pages

ISSN:1041-4347

Issue’s Table of Contents

Copyright © 2006.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 February 2006

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sánchez Vinces BSchubert EZimek ACordeiro R(2025)A comparative evaluation of clustering-based outlier detectionData Mining and Knowledge Discovery10.1007/s10618-024-01086-z39:2Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1007/s10618-024-01086-z
Li XGuan YFu BLuo Z(2025)Anomaly-aware symmetric non-negative matrix factorization for short text clusteringKnowledge and Information Systems10.1007/s10115-024-02226-z67:2(1481-1506)Online publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1007/s10115-024-02226-z
Adhikari DJiang WZhan JRawat DBhattarai A(2024)Recent advances in anomaly detection in Internet of ThingsComputer Science Review10.1016/j.cosrev.2024.10066554:COnline publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1016/j.cosrev.2024.100665
He S(2024)An endogenous intelligent architecture for wireless communication networksWireless Networks10.1007/s11276-023-03545-930:2(1069-1084)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s11276-023-03545-9
Angiulli FFassetti FFerragina L(2024)Enhancing anomaly detectors with LatentOutJournal of Intelligent Information Systems10.1007/s10844-023-00829-662:4(905-923)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s10844-023-00829-6
Xu HLiang ZHuang KHuang GHe Y(2023)Pre-Cutoff Value Calculation Method for Accelerating Metric Space Outlier DetectionInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.33412516:1(1-17)Online publication date: 28-Nov-2023
https://dl.acm.org/doi/10.4018/IJGHPC.334125
Wang YChen CLai JFu LZhou YZheng Z(2023)A Self-Representation Method with Local Similarity Preserving for Fast Multi-View Outlier DetectionACM Transactions on Knowledge Discovery from Data10.1145/353219117:1(1-20)Online publication date: 15-Mar-2023
https://dl.acm.org/doi/10.1145/3532191
Zhu PZhang CLi XZhang JQin X(2023)A High-Dimensional Outlier Detection Approach Based on Local Coulomb ForceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.317216735:6(5506-5520)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TKDE.2022.3172167
Ukey NZhang GYang ZLi BLi WZhang W(2023)Efficient continuous kNN join over dynamic high-dimensional dataWorld Wide Web10.1007/s11280-023-01204-926:6(3759-3794)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1007/s11280-023-01204-9
Zhang SXie L(2023)Penalized Least Squares Classifier: Classification by Regression Via Iterative Cost-Sensitive LearningNeural Processing Letters10.1007/s11063-023-11178-455:7(8809-8828)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1007/s11063-023-11178-4
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents