Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Distance-Based Detection and Prediction of Outliers

Published: 01 February 2006 Publication History

Abstract

A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it, called outlier detection solving set, that can be used to predict the outlierness of new unseen objects, is proposed. The solving set includes a sufficient number of points that permits the detection of the top outliers by considering only a subset of all the pairwise distances from the data set. The properties of the solving set are investigated, and algorithms for computing it, with subquadratic time requirements, are proposed. Experiments on synthetic and real data sets to evaluate the effectiveness of the approach are presented. A scaling analysis of the solving set size is performed, and the false positive rate, that is, the fraction of new objects misclassified as outliers using the solving set instead of the overall data set, is shown to be negligible. Finally, to investigate the accuracy in separating outliers from inliers, ROC analysis of the method is accomplished. Results obtained show that using the solving set instead of the data set guarantees a comparable quality of the prediction, but at a lower computational cost.

References

[1]
F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,” IEEE Trans. Knowledge and Data Eng., vol. 2, no. 17, pp. 203-215, Feb. 2005.
[2]
A. Arning, C. Aggarwal, and P. Raghavan, “A Linear Method for Deviation Detection in Large Databases,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 164-169, 1996.
[3]
V. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley&Sons, 1994.
[4]
S.D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '03), 2003.
[5]
M.M. Breunig, H. Kriegel, R.T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” Proc. Int'l Conf. Managment of Data (SIGMOD '00), 2000.
[6]
Defense Advanced Research Projects Agency DARPA, “Intrusion Detection Evaluation,”
[7]
E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, “A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data,” Applications of Data Mining in Computer Security, Kluwer, 2002.
[8]
T. Fawcett and F. Provost, “Adaptive Fraud Detection,” Data Mining and Knowledge Discovery, vol. 1, pp. 291-316, 1997.
[9]
C. Feng, A. Sutherland, S. King, S. Muggleton, and R. Henery, “Comparison of Machine Learning Classifiers to Statistics and Neural Networks,” Proc. AI&Stats Conf. 93, 1993.
[10]
S. Floyd and M. Warmuth, “Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension,” Machine Learning, vol. 211, no. 3, pp. 269-304, 1995.
[11]
V. Gaede and O. Günther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 170-231, 1998.
[12]
M.R. Garey and D.S. Johnson, Computer and Intractability. New York: W.H. Freeman and Company, 1979.
[13]
J. Han and M. Kamber, Data Mining, Concepts and Technique. San Francisco: Morgan Kaufmann, 2001.
[14]
W. Jin, A.K.H. Tung, and J. Han, “Mining Top-n Local Outliers in Large Databases,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), 2001.
[15]
E. Knorr and R. Ng, “Algorithms for Mining Distance-Based Outliers in Large Datasets,” Proc. Int'l Conf. Very Large Databases (VLDB '98), pp. 392-403, 1998.
[16]
E. Knorr and R. Ng, “Finding Intensional Knowledge of Distance-Based Outliers,” Proc. Int'l Conf. Very Large Databases (VLDB '99), pp. 211-222, 1999.
[17]
E. Knorr, R. Ng, and V. Tucakov, “Distance-Based Outlier: Algorithms and Applications,” VLDB J., vol. 8, nos. 3-4, pp. 237-253, 2000.
[18]
A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection,” Proc. SIAM Int'l Conf. Data Mining (SIAM-03), 2003.
[19]
W. Lee, S.J. Stolfo, and K.W. Mok, “Mining Audit Data to Build Intrusion Detection Models,” Proc. Int'l Conf Knowledge Discovery and Data Mining (KDD-98), pp. 66-72, 1998.
[20]
L. Mangasarian and W.H. Wolberg, “Cancer Diagnosis via Linear Programming,” SIAM News, vol. 25, no. 5, pp. 1-18, 1990.
[21]
D. Peleg, G. Schechtman, and A. Wool, “Randomized Approximation of Bounded Multicovering Problems,” Algorithmica, vol. 18, no. 1, pp. 44-66, 1997.
[22]
F. Provost, T. Fawcett, and R. Kohavi, “The Case Against Accuracy Estimation for Comparing Induction Algorithms,” Proc. Int'l Conf. Machine Learning (ICML '98), 1998.
[23]
S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. Int'l Conf. Managment of Data (SIGMOD '00), pp. 427-438, 2000.
[24]
B. Schölkopf, C. Burges, and V. Vapnik, “Extracting Support Data for a Given Task,” Proc First Int'l Conf. Knowledge Discovery and Data Mining, pp. 252-257, 1995.
[25]
L. Torgo and R. Ribeiro, “Predicting Outliers,” Proc. Int'l Conf. Principles of Data Mining and Knowledge Discovery (PKDD '03), pp. 447-458, 2003.
[26]
K. Yamanishi and J. Takeuchi, “Discovering Outlier Filtering Rules from Unlabeled Data,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), pp. 389-394, 2001.

Cited By

View all

Recommendations

Reviews

Susan Bridges

Outlier detection is an important area of data analysis with many applications, including intrusion and fraud detection, medical diagnosis, drug discovery, and image analysis. Naïve implementations of distance-based outlier detection methods have quadratic time complexity, and are therefore not practical for very large data sets. Angiulli, Basta, and Pizzuti address this problem by defining and developing algorithms for finding a solving set, which is a subset of the data points that is sufficient for efficient and accurate outlier detection in large data sets. In addition, they demonstrate how their algorithm can be adapted for the outlier prediction problem: given a set of points D and a stream of query points, efficiently identify the query points that are outliers with respect to D . The authors prove that finding the minimal solving set is nondeterministic polynomial time (NP) complete, and also demonstrate that their approximation algorithms are efficient, scalable, and accurate with both synthetic and real data sets. This paper is very well written: the definitions, formal statements of the properties of solving sets, proofs, and algorithms are clear and complete. The new algorithms provide an elegant solution to an important problem. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering  Volume 18, Issue 2
February 2006
140 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 February 2006

Author Tags

  1. Index Terms- Distance-based outliers
  2. data mining.
  3. outlier detection
  4. outlier prediction

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An endogenous intelligent architecture for wireless communication networksWireless Networks10.1007/s11276-023-03545-930:2(1069-1084)Online publication date: 1-Feb-2024
  • (2024)Enhancing anomaly detectors with LatentOutJournal of Intelligent Information Systems10.1007/s10844-023-00829-662:4(905-923)Online publication date: 1-Aug-2024
  • (2023)Pre-Cutoff Value Calculation Method for Accelerating Metric Space Outlier DetectionInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.33412516:1(1-17)Online publication date: 28-Nov-2023
  • (2023)A Self-Representation Method with Local Similarity Preserving for Fast Multi-View Outlier DetectionACM Transactions on Knowledge Discovery from Data10.1145/353219117:1(1-20)Online publication date: 15-Mar-2023
  • (2023)A High-Dimensional Outlier Detection Approach Based on Local Coulomb ForceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.317216735:6(5506-5520)Online publication date: 1-Jun-2023
  • (2023)Efficient continuous kNN join over dynamic high-dimensional dataWorld Wide Web10.1007/s11280-023-01204-926:6(3759-3794)Online publication date: 11-Sep-2023
  • (2023)Penalized Least Squares Classifier: Classification by Regression Via Iterative Cost-Sensitive LearningNeural Processing Letters10.1007/s11063-023-11178-455:7(8809-8828)Online publication date: 1-Dec-2023
  • (2022): an unsupervised deep anomaly detection approach exploiting latent space distributionMachine Language10.1007/s10994-022-06153-4112:11(4323-4349)Online publication date: 24-May-2022
  • (2022)A density estimation approach for detecting and explaining exceptional values in categorical dataApplied Intelligence10.1007/s10489-022-03271-352:15(17534-17556)Online publication date: 1-Dec-2022
  • (2022)Cooperative Deep Unsupervised Anomaly DetectionDiscovery Science10.1007/978-3-031-18840-4_23(318-328)Online publication date: 10-Oct-2022
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media