Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Distance-Based Detection and Prediction of Outliers

Published: 01 February 2006 Publication History

Abstract

A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it, called outlier detection solving set, that can be used to predict the outlierness of new unseen objects, is proposed. The solving set includes a sufficient number of points that permits the detection of the top outliers by considering only a subset of all the pairwise distances from the data set. The properties of the solving set are investigated, and algorithms for computing it, with subquadratic time requirements, are proposed. Experiments on synthetic and real data sets to evaluate the effectiveness of the approach are presented. A scaling analysis of the solving set size is performed, and the false positive rate, that is, the fraction of new objects misclassified as outliers using the solving set instead of the overall data set, is shown to be negligible. Finally, to investigate the accuracy in separating outliers from inliers, ROC analysis of the method is accomplished. Results obtained show that using the solving set instead of the data set guarantees a comparable quality of the prediction, but at a lower computational cost.

References

[1]
F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,” IEEE Trans. Knowledge and Data Eng., vol. 2, no. 17, pp. 203-215, Feb. 2005.
[2]
A. Arning, C. Aggarwal, and P. Raghavan, “A Linear Method for Deviation Detection in Large Databases,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 164-169, 1996.
[3]
V. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley&Sons, 1994.
[4]
S.D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '03), 2003.
[5]
M.M. Breunig, H. Kriegel, R.T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” Proc. Int'l Conf. Managment of Data (SIGMOD '00), 2000.
[6]
Defense Advanced Research Projects Agency DARPA, “Intrusion Detection Evaluation,”
[7]
E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, “A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data,” Applications of Data Mining in Computer Security, Kluwer, 2002.
[8]
T. Fawcett and F. Provost, “Adaptive Fraud Detection,” Data Mining and Knowledge Discovery, vol. 1, pp. 291-316, 1997.
[9]
C. Feng, A. Sutherland, S. King, S. Muggleton, and R. Henery, “Comparison of Machine Learning Classifiers to Statistics and Neural Networks,” Proc. AI&Stats Conf. 93, 1993.
[10]
S. Floyd and M. Warmuth, “Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension,” Machine Learning, vol. 211, no. 3, pp. 269-304, 1995.
[11]
V. Gaede and O. Günther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 170-231, 1998.
[12]
M.R. Garey and D.S. Johnson, Computer and Intractability. New York: W.H. Freeman and Company, 1979.
[13]
J. Han and M. Kamber, Data Mining, Concepts and Technique. San Francisco: Morgan Kaufmann, 2001.
[14]
W. Jin, A.K.H. Tung, and J. Han, “Mining Top-n Local Outliers in Large Databases,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), 2001.
[15]
E. Knorr and R. Ng, “Algorithms for Mining Distance-Based Outliers in Large Datasets,” Proc. Int'l Conf. Very Large Databases (VLDB '98), pp. 392-403, 1998.
[16]
E. Knorr and R. Ng, “Finding Intensional Knowledge of Distance-Based Outliers,” Proc. Int'l Conf. Very Large Databases (VLDB '99), pp. 211-222, 1999.
[17]
E. Knorr, R. Ng, and V. Tucakov, “Distance-Based Outlier: Algorithms and Applications,” VLDB J., vol. 8, nos. 3-4, pp. 237-253, 2000.
[18]
A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection,” Proc. SIAM Int'l Conf. Data Mining (SIAM-03), 2003.
[19]
W. Lee, S.J. Stolfo, and K.W. Mok, “Mining Audit Data to Build Intrusion Detection Models,” Proc. Int'l Conf Knowledge Discovery and Data Mining (KDD-98), pp. 66-72, 1998.
[20]
L. Mangasarian and W.H. Wolberg, “Cancer Diagnosis via Linear Programming,” SIAM News, vol. 25, no. 5, pp. 1-18, 1990.
[21]
D. Peleg, G. Schechtman, and A. Wool, “Randomized Approximation of Bounded Multicovering Problems,” Algorithmica, vol. 18, no. 1, pp. 44-66, 1997.
[22]
F. Provost, T. Fawcett, and R. Kohavi, “The Case Against Accuracy Estimation for Comparing Induction Algorithms,” Proc. Int'l Conf. Machine Learning (ICML '98), 1998.
[23]
S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. Int'l Conf. Managment of Data (SIGMOD '00), pp. 427-438, 2000.
[24]
B. Schölkopf, C. Burges, and V. Vapnik, “Extracting Support Data for a Given Task,” Proc First Int'l Conf. Knowledge Discovery and Data Mining, pp. 252-257, 1995.
[25]
L. Torgo and R. Ribeiro, “Predicting Outliers,” Proc. Int'l Conf. Principles of Data Mining and Knowledge Discovery (PKDD '03), pp. 447-458, 2003.
[26]
K. Yamanishi and J. Takeuchi, “Discovering Outlier Filtering Rules from Unlabeled Data,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), pp. 389-394, 2001.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering  Volume 18, Issue 2
February 2006
140 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 February 2006

Author Tags

  1. Index Terms- Distance-based outliers
  2. data mining.
  3. outlier detection
  4. outlier prediction

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A comparative evaluation of clustering-based outlier detectionData Mining and Knowledge Discovery10.1007/s10618-024-01086-z39:2Online publication date: 1-Mar-2025
  • (2025)Anomaly-aware symmetric non-negative matrix factorization for short text clusteringKnowledge and Information Systems10.1007/s10115-024-02226-z67:2(1481-1506)Online publication date: 1-Feb-2025
  • (2024)Recent advances in anomaly detection in Internet of ThingsComputer Science Review10.1016/j.cosrev.2024.10066554:COnline publication date: 1-Nov-2024
  • (2024)An endogenous intelligent architecture for wireless communication networksWireless Networks10.1007/s11276-023-03545-930:2(1069-1084)Online publication date: 1-Feb-2024
  • (2024)Enhancing anomaly detectors with LatentOutJournal of Intelligent Information Systems10.1007/s10844-023-00829-662:4(905-923)Online publication date: 1-Aug-2024
  • (2023)Pre-Cutoff Value Calculation Method for Accelerating Metric Space Outlier DetectionInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.33412516:1(1-17)Online publication date: 28-Nov-2023
  • (2023)A Self-Representation Method with Local Similarity Preserving for Fast Multi-View Outlier DetectionACM Transactions on Knowledge Discovery from Data10.1145/353219117:1(1-20)Online publication date: 15-Mar-2023
  • (2023)A High-Dimensional Outlier Detection Approach Based on Local Coulomb ForceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.317216735:6(5506-5520)Online publication date: 1-Jun-2023
  • (2023)Efficient continuous kNN join over dynamic high-dimensional dataWorld Wide Web10.1007/s11280-023-01204-926:6(3759-3794)Online publication date: 11-Sep-2023
  • (2023)Penalized Least Squares Classifier: Classification by Regression Via Iterative Cost-Sensitive LearningNeural Processing Letters10.1007/s11063-023-11178-455:7(8809-8828)Online publication date: 1-Dec-2023
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media