Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Generating Artificial Outliers in the Absence of Genuine Ones — A Survey

Published: 27 March 2021 Publication History

Abstract

By definition, outliers are rarely observed in reality, making them difficult to detect or analyze. Artificial outliers approximate such genuine outliers and can, for instance, help with the detection of genuine outliers or with benchmarking outlier-detection algorithms. The literature features different approaches to generate artificial outliers. However, systematic comparison of these approaches remains absent. This surveys and compares these approaches. We start by clarifying the terminology in the field, which varies from publication to publication, and we propose a general problem formulation. Our description of the connection of generating outliers to other research fields like experimental design or generative models frames the field of artificial outliers. Along with offering a concise description, we group the approaches by their general concepts and how they make use of genuine instances. An extensive experimental study reveals the differences between the generation approaches when ultimately being used for outlier detection. This survey shows that the existing approaches already cover a wide range of concepts underlying the generation, but also that the field still has potential for further development. Our experimental study does confirm the expectation that the quality of the generation approaches varies widely, for example, in terms of the dataset they are used on. Ultimately, to guide the choice of the generation approach in a specific context, we propose an appropriate general-decision process. In summary, this survey comprises, describes, and connects all relevant work regarding the generation of artificial outliers and may serve as a basis to guide further research in the field.

References

[1]
Naoki Abe, Bianca Zadrozny, and John Langford. 2006. Outlier detection by active learning. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, NY, 504--509.
[2]
ANOVA 2012. Encyclopedia of Mathematics. Retrieved June 20, 2018 from http://www.encyclopediaofmath.org/index.php?title=ANOVA&oldid=24039.
[3]
András Bánhalmi, András Kocsor, and Róbert Busa-Fekete. 2007. Counter-example generation-based one-class classification. In Proceedings of the 2007 European Conference on Machine Learning (ECML’07). Springer, Berlin, 543--550.
[4]
J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, Afm Smith, and M. West. 2007. Generative or discriminative? Getting the best of both worlds. Bayesian Statistics 8, 3 (July 2007), 3--24.
[5]
Battista Biggio and Fabio Roli. 2018. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition 84 (2018), 317--331.
[6]
Sabri Boughorbel, Fethi Jarray, and Mohammed El-Anbari. 2017. Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PloS One 12, 6 (June 2017), e0177678.
[7]
Wieland Brendel, Jonas Rauber, and Matthias Bethge. 2017. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arxiv:1712.04248 Retrieved December 18, 2018 from https://arxiv.org/abs/1712.04248.
[8]
Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G., Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016. On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30, 4 (Jan. 2016), 891--927.
[9]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM Computing Surveys 41, 3 (July 2009), 15:1--15:58.
[10]
C. C. Chang and C. J. Lin. 2001. Training nu-support vector classifiers: Theory and algorithms. Neural Computation 13, 9 (Sept. 2001), 2119--2147.
[11]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 1 (June 2002), 321--357.
[12]
Robert Curry and Malcolm I. Heywood. 2009. One-class genetic programming. In Proceedings of the European Conference on Genetic Programming (EuroGP’09). Springer, Berlin, 1--12.
[13]
Robert Curry, Peter Lichodzijewski, and Malcolm I. Heywood. 2007. Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Transactions on Systems, Man, and Cybernetics. Part B, Cybernetics. 37, 4 (Aug. 2007), 1065--1073.
[14]
Zihang Dai, Zhilin Yang, Fan Yang, William W. Cohen, and Ruslan R. Salakhutdinov. 2017. Good semi-supervised learning that requires a bad GAN. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS’17). Curran Associates, Inc., New York, NY, 6510--6520.
[15]
M. A. Davenport, R. G. Baraniuk, and C. D. Scott. 2006. Learning minimum volume sets with support vector machines. In Proceedings of the 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing (MLSP’06). IEEE, Los Alamitos, CA, 301--306.
[16]
H. Deng and R. Xu. 2007. Model selection for anomaly detection in wireless Ad hoc networks. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM’07). IEEE, Los Alamitos, CA, 540--546.
[17]
Chesner Désir, Simon Bernard, Caroline Petitjean, and Laurent Heutte. 2013. One class random forests. Pattern Recognition 46, 12 (Dec. 2013), 3490--3506.
[18]
Dua Dheeru and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. Retrieved August 8, 2019 from http://archive.ics.uci.edu/ml.
[19]
Ran El-Yaniv and Mordechai Nisenson. 2007. Optimal single-class classification strategies. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’06). Vol. 19. MIT Press, Cambridge, MA, 377--384.
[20]
W. Fan, M. Miller, S. Stolfo, W. Lee, and P. Chan. 2001. Using artificial anomalies to detect unknown and known network intrusions. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM’01). IEEE, Los Alamitos, CA, 123--130.
[21]
W. Fan, M. Miller, S. Stolfo, W. Lee, and P. Chan. 2004. Using artificial anomalies to detect unknown and known network intrusions. Knowledge and Information Systems 6, 5 (Sept. 2004), 507--527.
[22]
S. Forrest, A. S. Perelson, L. Allen, and R. Cherukuri. 1994. Self-nonself discrimination in a computer. In Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy (RSIP’94). IEEE, Los Alamitos, CA, 202--212.
[23]
F. Gonzalez, D. Dasgupta, and R. Kozma. 2002. Combining negative selection and classification techniques for anomaly detection. In Proceedings of the 2002 Congress on Evolutionary Computation (CEC’02). IEEE, Los Alamitos, CA, 705--710.
[24]
Fabio A. González and Dipankar Dasgupta. 2003. Anomaly detection using real-valued negative selection. Genetic Programming and Evolvable Machines 4, 4 (Dec. 2003), 383--403.
[25]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’14). Vol. 27. Curran Associates, Inc., New York, NY, 2672--2680.
[26]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning (2nd. ed.). Springer Series in Statistics, Vol. 1. Springer, New York.
[27]
Kathryn Hempstalk, Eibe Frank, and Ian H. Witten. 2008. One-class classification by combining density and class probability estimation. In Proceedings of the Machine Learning and Knowledge Discovery in Databases (ECML PKDD’08). Springer, Berlin, 505--519.
[28]
Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22, 2 (Oct. 2004), 85--126.
[29]
Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, Theory and Applications 6, 2 (1979), 65--70.
[30]
G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. 2003. Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering 15, 5 (Sept. 2003), 1170--1187.
[31]
Atul Kumar, Sameep Mehta, and Deepak Vijaykeerthy. 2017. An introduction to adversarial machine learning. In Proceedings of the Big Data Analytics (BDA’17). Springer, Cham, 293--299.
[32]
Christoph H. Lampert. 2009. Kernel methods in computer vision. Foundations and Trends®in Computer Graphics and Vision 4, 3 (Sept. 2009), 193--285.
[33]
Thomas Larsson. 2008. Fast and tight fitting bounding spheres. In Proceedings of the Annual SIGRAD Conference (SIGRAD’08). Linköping University Electronic Press, Linköping, Sweden, 27--30.
[34]
Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. 2018. Training confidence-calibrated classifiers for detecting out-of-distribution samples. arxiv:1711.09325v3 Retrieved April 1, 2019 from https://arxiv.org/abs/1711.09325v3.
[35]
Yuhua Li and Liam Maguire. 2011. Selecting critical patterns based on local geometrical and statistical information. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 6 (June 2011), 1189--1201.
[36]
Miodrag Lovric. 2011. International Encyclopedia of Statistical Science. Springer, Berlin.
[37]
H. B. Mann and D. R. Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 18, 1 (March 1947), 50--60.
[38]
Mary L. McHugh. 2011. Multiple comparison analysis testing in ANOVA. Biochemia Medica 21, 3 (Oct. 2011), 203--209.
[39]
E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm. 2012. Outlier ranking via subspace analysis in multiple views of the data. In Proceedings of the IEEE 12th International Conference on Data Mining (ICDM’12). IEEE, Los Alamitos, CA, 529--538.
[40]
Judith Neugebauer, Oliver Kramer, and Michael Sonnenschein. 2016. Instance selection and outlier generation to improve the cascade classifier precision. In Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART’16). Springer, Cham, 151--170.
[41]
Stephen Olejnik and James Algina. 2003. Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods 8, 4 (Dec. 2003), 434--447.
[42]
Truong Son Pham, Quang Uy Nguyen, and Xuan Hoai Nguyen. 2014. Generating artificial attack data for intrusion detection using machine learning. In Proceedings of the 5th Symposium on Information and Communication Technology (SoICT’14). ACM, New York, NY, 286--291.
[43]
Hans-Peter Piepho. 2004. An algorithm for a letter-based representation of all-pairwise comparisons. Journal of Computational and Graphical Statistics. 13, 2 (June 2004), 456--466.
[44]
A. V. Prokhorov. 2011. Kendall Coefficient of Rank Correlation. Encyclopedia of Mathematics. Retrieved June 20, 2018 from http://www.encyclopediaofmath.org/index.php?title=Kendall_coefficient_of_rank_correlation&oldid=13189.
[45]
Thomas J. Santner, Brian J. Williams, and William I. Notz. 2013. The Design and Analysis of Computer Experiments. Springer, New York.
[46]
B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural Computation 13, 7 (July 2001), 1443--1471.
[47]
Tao Shi and Steve Horvath. 2006. Unsupervised learning with random forest predictors. Journal of Computational and Graphical Statistics. 15, 1 (March 2006), 118--138.
[48]
Georg Steinbuss and Klemens Böhm. 2017. Hiding outliers in high-dimensional data spaces. International Journal of Data Science and Analytics 4, 3 (Nov. 2017), 173--189.
[49]
Ingo Steinwart, Don Hush, and Clint Scovel. 2005. A classification framework for anomaly detection. Journal of Machine Learning Research. 6 (May 2005), 211--232.
[50]
David M. J. Tax and Robert P. W. Duin. 1999. Support vector domain description. Pattern Recognition Letters 20, 11 (Nov. 1999), 1191--1199.
[51]
David M. J. Tax and Robert P. W. Duin. 2001. Uniform object generation for optimizing one-class classifiers. Journal of Machine Learning Research 2 (Dec. 2001), 155--173.
[52]
James P. Theiler and D. Michael Cai. 2003. Resampling approach for anomaly detection in multispectral images. In Proceedings of the Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery IX. SPIE, Bellingham, Washington, 230--240.
[53]
Chi-Kai Wang, Yung Ting, Yi-Hung Liu, and Gunawan Hariyanto. 2009. A novel approach to generate artificial outliers for support vector data description. In Proceedings of the International Symposium on Industrial Electronics (ISIE’09). IEEE, Los Alamitos, CA, 2202--2207.
[54]
Siqi Wang, Qiang Liu, En Zhu, Fatih Porikli, and Jianping Yin. 2018. Hyperparameter selection of one-class support vector machine by self-adaptive data shifting. Pattern Recognition 74, C (Feb. 2018), 198--211.

Cited By

View all
  • (2024)Efficient Generation of Hidden Outliers for Improved Outlier DetectionACM Transactions on Knowledge Discovery from Data10.1145/369082718:9(1-21)Online publication date: 8-Nov-2024
  • (2024)Simultaneously Detecting Node and Edge Level Anomalies on Heterogeneous Attributed Graphs2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650265(1-10)Online publication date: 30-Jun-2024
  • (2023)Interaction-Focused Anomaly Detection on Bipartite Node-and-Edge-Attributed Graphs2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191331(1-10)Online publication date: 18-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 15, Issue 2
Survey Paper and Regular Papers
April 2021
524 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3446665
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 March 2021
Accepted: 01 October 2020
Revised: 01 August 2020
Received: 01 July 2018
Published in TKDD Volume 15, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Artificial outlier
  2. anomalies
  3. artificial data
  4. outlier detection

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)12
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Generation of Hidden Outliers for Improved Outlier DetectionACM Transactions on Knowledge Discovery from Data10.1145/369082718:9(1-21)Online publication date: 8-Nov-2024
  • (2024)Simultaneously Detecting Node and Edge Level Anomalies on Heterogeneous Attributed Graphs2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650265(1-10)Online publication date: 30-Jun-2024
  • (2023)Interaction-Focused Anomaly Detection on Bipartite Node-and-Edge-Attributed Graphs2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191331(1-10)Online publication date: 18-Jun-2023
  • (2023)On the Use of Mahalanobis Distance for Out-of-distribution Detection with Neural Networks for Medical ImagingUncertainty for Safe Utilization of Machine Learning in Medical Imaging10.1007/978-3-031-44336-7_14(136-146)Online publication date: 12-Oct-2023
  • (2023)ROCKAD: Transferring ROCKET to Whole Time Series Anomaly DetectionAdvances in Intelligent Data Analysis XXI10.1007/978-3-031-30047-9_33(419-432)Online publication date: 12-Apr-2023
  • (2022)Modeling and generating synthetic anomalies for energy and power time seriesProceedings of the Thirteenth ACM International Conference on Future Energy Systems10.1145/3538637.3539760(471-484)Online publication date: 28-Jun-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media