research-article

Generating Artificial Outliers in the Absence of Genuine Ones — A Survey

Authors:

Georg Steinbuss,

Klemens BöhmAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 15, Issue 2

Article No.: 30, Pages 1 - 37

https://doi.org/10.1145/3447822

Published: 27 March 2021 Publication History

Abstract

By definition, outliers are rarely observed in reality, making them difficult to detect or analyze. Artificial outliers approximate such genuine outliers and can, for instance, help with the detection of genuine outliers or with benchmarking outlier-detection algorithms. The literature features different approaches to generate artificial outliers. However, systematic comparison of these approaches remains absent. This surveys and compares these approaches. We start by clarifying the terminology in the field, which varies from publication to publication, and we propose a general problem formulation. Our description of the connection of generating outliers to other research fields like experimental design or generative models frames the field of artificial outliers. Along with offering a concise description, we group the approaches by their general concepts and how they make use of genuine instances. An extensive experimental study reveals the differences between the generation approaches when ultimately being used for outlier detection. This survey shows that the existing approaches already cover a wide range of concepts underlying the generation, but also that the field still has potential for further development. Our experimental study does confirm the expectation that the quality of the generation approaches varies widely, for example, in terms of the dataset they are used on. Ultimately, to guide the choice of the generation approach in a specific context, we propose an appropriate general-decision process. In summary, this survey comprises, describes, and connects all relevant work regarding the generation of artificial outliers and may serve as a basis to guide further research in the field.

References

[1]

Naoki Abe, Bianca Zadrozny, and John Langford. 2006. Outlier detection by active learning. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, NY, 504--509.

Digital Library

[2]

ANOVA 2012. Encyclopedia of Mathematics. Retrieved June 20, 2018 from http://www.encyclopediaofmath.org/index.php?title=ANOVA&oldid=24039.

[3]

András Bánhalmi, András Kocsor, and Róbert Busa-Fekete. 2007. Counter-example generation-based one-class classification. In Proceedings of the 2007 European Conference on Machine Learning (ECML’07). Springer, Berlin, 543--550.

Digital Library

[4]

J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, Afm Smith, and M. West. 2007. Generative or discriminative? Getting the best of both worlds. Bayesian Statistics 8, 3 (July 2007), 3--24.

[5]

Battista Biggio and Fabio Roli. 2018. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition 84 (2018), 317--331.

Digital Library

[6]

Sabri Boughorbel, Fethi Jarray, and Mohammed El-Anbari. 2017. Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PloS One 12, 6 (June 2017), e0177678.

[7]

Wieland Brendel, Jonas Rauber, and Matthias Bethge. 2017. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arxiv:1712.04248 Retrieved December 18, 2018 from https://arxiv.org/abs/1712.04248.

[8]

Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G., Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016. On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30, 4 (Jan. 2016), 891--927.

Digital Library

[9]

Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM Computing Surveys 41, 3 (July 2009), 15:1--15:58.

Digital Library

[10]

C. C. Chang and C. J. Lin. 2001. Training nu-support vector classifiers: Theory and algorithms. Neural Computation 13, 9 (Sept. 2001), 2119--2147.

Digital Library

[11]

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 1 (June 2002), 321--357.

Digital Library

[12]

Robert Curry and Malcolm I. Heywood. 2009. One-class genetic programming. In Proceedings of the European Conference on Genetic Programming (EuroGP’09). Springer, Berlin, 1--12.

[13]

Robert Curry, Peter Lichodzijewski, and Malcolm I. Heywood. 2007. Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Transactions on Systems, Man, and Cybernetics. Part B, Cybernetics. 37, 4 (Aug. 2007), 1065--1073.

Digital Library

[14]

Zihang Dai, Zhilin Yang, Fan Yang, William W. Cohen, and Ruslan R. Salakhutdinov. 2017. Good semi-supervised learning that requires a bad GAN. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS’17). Curran Associates, Inc., New York, NY, 6510--6520.

[15]

M. A. Davenport, R. G. Baraniuk, and C. D. Scott. 2006. Learning minimum volume sets with support vector machines. In Proceedings of the 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing (MLSP’06). IEEE, Los Alamitos, CA, 301--306.

[16]

H. Deng and R. Xu. 2007. Model selection for anomaly detection in wireless Ad hoc networks. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM’07). IEEE, Los Alamitos, CA, 540--546.

[17]

Chesner Désir, Simon Bernard, Caroline Petitjean, and Laurent Heutte. 2013. One class random forests. Pattern Recognition 46, 12 (Dec. 2013), 3490--3506.

Digital Library

[18]

Dua Dheeru and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. Retrieved August 8, 2019 from http://archive.ics.uci.edu/ml.

[19]

Ran El-Yaniv and Mordechai Nisenson. 2007. Optimal single-class classification strategies. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’06). Vol. 19. MIT Press, Cambridge, MA, 377--384.

[20]

W. Fan, M. Miller, S. Stolfo, W. Lee, and P. Chan. 2001. Using artificial anomalies to detect unknown and known network intrusions. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM’01). IEEE, Los Alamitos, CA, 123--130.

[21]

W. Fan, M. Miller, S. Stolfo, W. Lee, and P. Chan. 2004. Using artificial anomalies to detect unknown and known network intrusions. Knowledge and Information Systems 6, 5 (Sept. 2004), 507--527.

Digital Library

[22]

S. Forrest, A. S. Perelson, L. Allen, and R. Cherukuri. 1994. Self-nonself discrimination in a computer. In Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy (RSIP’94). IEEE, Los Alamitos, CA, 202--212.

[23]

F. Gonzalez, D. Dasgupta, and R. Kozma. 2002. Combining negative selection and classification techniques for anomaly detection. In Proceedings of the 2002 Congress on Evolutionary Computation (CEC’02). IEEE, Los Alamitos, CA, 705--710.

[24]

Fabio A. González and Dipankar Dasgupta. 2003. Anomaly detection using real-valued negative selection. Genetic Programming and Evolvable Machines 4, 4 (Dec. 2003), 383--403.

Digital Library

[25]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’14). Vol. 27. Curran Associates, Inc., New York, NY, 2672--2680.

[26]

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning (2nd. ed.). Springer Series in Statistics, Vol. 1. Springer, New York.

[27]

Kathryn Hempstalk, Eibe Frank, and Ian H. Witten. 2008. One-class classification by combining density and class probability estimation. In Proceedings of the Machine Learning and Knowledge Discovery in Databases (ECML PKDD’08). Springer, Berlin, 505--519.

[28]

Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22, 2 (Oct. 2004), 85--126.

Digital Library

[29]

Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, Theory and Applications 6, 2 (1979), 65--70.

[30]

G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. 2003. Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering 15, 5 (Sept. 2003), 1170--1187.

Digital Library

[31]

Atul Kumar, Sameep Mehta, and Deepak Vijaykeerthy. 2017. An introduction to adversarial machine learning. In Proceedings of the Big Data Analytics (BDA’17). Springer, Cham, 293--299.

Digital Library

[32]

Christoph H. Lampert. 2009. Kernel methods in computer vision. Foundations and Trends®in Computer Graphics and Vision 4, 3 (Sept. 2009), 193--285.

[33]

Thomas Larsson. 2008. Fast and tight fitting bounding spheres. In Proceedings of the Annual SIGRAD Conference (SIGRAD’08). Linköping University Electronic Press, Linköping, Sweden, 27--30.

[34]

Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. 2018. Training confidence-calibrated classifiers for detecting out-of-distribution samples. arxiv:1711.09325v3 Retrieved April 1, 2019 from https://arxiv.org/abs/1711.09325v3.

[35]

Yuhua Li and Liam Maguire. 2011. Selecting critical patterns based on local geometrical and statistical information. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 6 (June 2011), 1189--1201.

[36]

Miodrag Lovric. 2011. International Encyclopedia of Statistical Science. Springer, Berlin.

[37]

H. B. Mann and D. R. Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 18, 1 (March 1947), 50--60.

[38]

Mary L. McHugh. 2011. Multiple comparison analysis testing in ANOVA. Biochemia Medica 21, 3 (Oct. 2011), 203--209.

[39]

E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm. 2012. Outlier ranking via subspace analysis in multiple views of the data. In Proceedings of the IEEE 12th International Conference on Data Mining (ICDM’12). IEEE, Los Alamitos, CA, 529--538.

[40]

Judith Neugebauer, Oliver Kramer, and Michael Sonnenschein. 2016. Instance selection and outlier generation to improve the cascade classifier precision. In Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART’16). Springer, Cham, 151--170.

Digital Library

[41]

Stephen Olejnik and James Algina. 2003. Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods 8, 4 (Dec. 2003), 434--447.

[42]

Truong Son Pham, Quang Uy Nguyen, and Xuan Hoai Nguyen. 2014. Generating artificial attack data for intrusion detection using machine learning. In Proceedings of the 5th Symposium on Information and Communication Technology (SoICT’14). ACM, New York, NY, 286--291.

Digital Library

[43]

Hans-Peter Piepho. 2004. An algorithm for a letter-based representation of all-pairwise comparisons. Journal of Computational and Graphical Statistics. 13, 2 (June 2004), 456--466.

[44]

A. V. Prokhorov. 2011. Kendall Coefficient of Rank Correlation. Encyclopedia of Mathematics. Retrieved June 20, 2018 from http://www.encyclopediaofmath.org/index.php?title=Kendall_coefficient_of_rank_correlation&oldid=13189.

[45]

Thomas J. Santner, Brian J. Williams, and William I. Notz. 2013. The Design and Analysis of Computer Experiments. Springer, New York.

[46]

B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural Computation 13, 7 (July 2001), 1443--1471.

Digital Library

[47]

Tao Shi and Steve Horvath. 2006. Unsupervised learning with random forest predictors. Journal of Computational and Graphical Statistics. 15, 1 (March 2006), 118--138.

[48]

Georg Steinbuss and Klemens Böhm. 2017. Hiding outliers in high-dimensional data spaces. International Journal of Data Science and Analytics 4, 3 (Nov. 2017), 173--189.

[49]

Ingo Steinwart, Don Hush, and Clint Scovel. 2005. A classification framework for anomaly detection. Journal of Machine Learning Research. 6 (May 2005), 211--232.

[50]

David M. J. Tax and Robert P. W. Duin. 1999. Support vector domain description. Pattern Recognition Letters 20, 11 (Nov. 1999), 1191--1199.

Digital Library

[51]

David M. J. Tax and Robert P. W. Duin. 2001. Uniform object generation for optimizing one-class classifiers. Journal of Machine Learning Research 2 (Dec. 2001), 155--173.

[52]

James P. Theiler and D. Michael Cai. 2003. Resampling approach for anomaly detection in multispectral images. In Proceedings of the Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery IX. SPIE, Bellingham, Washington, 230--240.

[53]

Chi-Kai Wang, Yung Ting, Yi-Hung Liu, and Gunawan Hariyanto. 2009. A novel approach to generate artificial outliers for support vector data description. In Proceedings of the International Symposium on Industrial Electronics (ISIE’09). IEEE, Los Alamitos, CA, 2202--2207.

[54]

Siqi Wang, Qiang Liu, En Zhu, Fatih Porikli, and Jianping Yin. 2018. Hyperparameter selection of one-class support vector machine by self-adaptive data shifting. Pattern Recognition 74, C (Feb. 2018), 198--211.

Cited By

Cribeiro-Ramallo JArzamasov VBöhm K(2024)Efficient Generation of Hidden Outliers for Improved Outlier DetectionACM Transactions on Knowledge Discovery from Data10.1145/369082718:9(1-21)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.1145/3690827
Fathony RNg JChen J(2024)Simultaneously Detecting Node and Edge Level Anomalies on Heterogeneous Attributed Graphs2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650265(1-10)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650265
Fathony RNg JChen J(2023)Interaction-Focused Anomaly Detection on Bipartite Node-and-Edge-Attributed Graphs2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191331(1-10)Online publication date: 18-Jun-2023
https://doi.org/10.1109/IJCNN54540.2023.10191331
Show More Cited By

Index Terms

Generating Artificial Outliers in the Absence of Genuine Ones — A Survey
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
      2. Unsupervised learning
        Anomaly detection

Recommendations

Distance-Based Detection and Prediction of Outliers

A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it, called outlier detection solving set, that can be used to predict the outlierness of new unseen objects, is proposed. The solving ...
Towards Identifying Multicriteria Outliers: An Outranking Relation-Based Approach

This article tackles the problem of outlier detection in the multicriteria decision aid MCDA field. The authors propose an outlier detection method based on binary outranking relations and Local Outlier Factor LOF algorithm. The outlier is detected by ...
Robust Statistical Scaling of Outlier Scores: Improving the Quality of Outlier Probabilities for Outliers
Similarity Search and Applications
Abstract
Outlier detection algorithms typically assign an outlier score to each observation in a dataset, indicating the degree to which an observation is an outlier. However, these scores are often not comparable across algorithms and can be difficult for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 15, Issue 2

Survey Paper and Regular Papers

April 2021

524 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3446665

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 March 2021

Accepted: 01 October 2020

Revised: 01 August 2020

Received: 01 July 2018

Published in TKDD Volume 15, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Deutsche Forschungsgemeinschaft

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
235
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)12

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cribeiro-Ramallo JArzamasov VBöhm K(2024)Efficient Generation of Hidden Outliers for Improved Outlier DetectionACM Transactions on Knowledge Discovery from Data10.1145/369082718:9(1-21)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.1145/3690827
Fathony RNg JChen J(2024)Simultaneously Detecting Node and Edge Level Anomalies on Heterogeneous Attributed Graphs2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650265(1-10)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650265
Fathony RNg JChen J(2023)Interaction-Focused Anomaly Detection on Bipartite Node-and-Edge-Attributed Graphs2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191331(1-10)Online publication date: 18-Jun-2023
https://doi.org/10.1109/IJCNN54540.2023.10191331
Anthony HKamnitsas K(2023)On the Use of Mahalanobis Distance for Out-of-distribution Detection with Neural Networks for Medical ImagingUncertainty for Safe Utilization of Machine Learning in Medical Imaging10.1007/978-3-031-44336-7_14(136-146)Online publication date: 12-Oct-2023
https://dl.acm.org/doi/10.1007/978-3-031-44336-7_14
Theissler AWengert MGerschner F(2023)ROCKAD: Transferring ROCKET to Whole Time Series Anomaly DetectionAdvances in Intelligent Data Analysis XXI10.1007/978-3-031-30047-9_33(419-432)Online publication date: 12-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-30047-9_33
Turowski MWeber MNeumann OHeidrich BPhipps KÇakmak HMikut RHagenmeyer VLehnhoff SIrwin DWang D(2022)Modeling and generating synthetic anomalies for energy and power time seriesProceedings of the Thirteenth ACM International Conference on Future Energy Systems10.1145/3538637.3539760(471-484)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3538637.3539760

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents