Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3269206.3271721acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Exploring a High-quality Outlying Feature Value Set for Noise-Resilient Outlier Detection in Categorical Data

Published: 17 October 2018 Publication History

Abstract

Unavoidable noise in real-world categorical data presents significant challenges to existing outlier detection methods because they normally fail to separate noisy values from outlying values. Feature subspace-based methods inevitably mix noisy values when retaining an entire feature because a feature may contain both outlying values and noisy values. Pattern-based methods are normally based on frequency and are easily misled by noisy values, resulting in many faulty patterns. This paper introduces a novel unsupervised framework termed OUVAS, and its parameter-free instantiation RHAC to explore a high-quality outlying value set for detecting outliers in noisy categorical data. Based on the observation that the relations between values reflect their essence, OUVAS investigates value similarities to cluster values into different groups and combines cluster-level analysis and value-level refinement to identify an outlying value set. RHAC instantiates OUVAS by three successive modules (i.e., the combination of Ochiai coefficient and LOUVAIN algorithm to cluster values, hierarchical value coupling learning to perform cluster-level analysis, and a threshold to divide fake and real outlying values in value-level refinement). We show that (i) RHAC-based outlier detector significantly outperforms five state-of-the-art outlier detection methods; (ii) Extended RHAC-based feature selection method successfully improves the performance of existing outlier detectors and performs better than two latest outlying feature selection methods.

References

[1]
Charu Aggarwal and S. Yu. 2005. An effective and efficient algorithm for highdimensional outlier detection. The VLDB Journal 14, 2 (2005), 211--221.
[2]
Charu C. Aggarwal. 2017. Outlier Analysis. Springer.
[3]
Charu C. Aggarwal and Saket Sathe. 2015. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter 17, 1 (2015), 24--47.
[4]
Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. 2012. Fast and reliable anomaly detection in categorical data. In CIKM. ACM, 415--424.
[5]
Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, 10 (2008), P10008.
[6]
Markus M. Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: Identifying density-based local outliers. ACM SIGMOD Record 29, 2 (2000), 93--104.
[7]
Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G. B. Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30, 4 (2016), 891--927.
[8]
Longbing Cao, Yuming Ou, and Philip S. Yu. 2012. Coupled behavior analysis with applications. IEEE Transactions on Knowledge and Data Engineering 24, 8 (2012), 1378--1392.
[9]
Kaustav Das and Jeff Schneider. 2007. Detecting anomalous records in categorical datasets. In SIGKDD. ACM, 220--229.
[10]
Zengyou He, Xiaofei Xu, Zhexue Joshua Huang, and Shengchun Deng. 2005. FP-outlier: Frequent pattern based outlier detection. Computer Science and Information Systems 2, 1 (2005), 103--118.
[11]
Songlei Jian, Guansong Pang, Longbing Cao, Kai Lu, and Hang Gao. 2018. CURE: Flexible Categorical Data Representation by Hierarchical Coupling Learning. IEEE Transactions on Knowledge and Data Engineering (2018).
[12]
Fabian Keller, Emmanuel Müller, and Klemens Bohm. 2012. HiCS: High contrast subspaces for density-based outlier ranking. In ICDE. IEEE, 1037--1048.
[13]
Aleksandar Lazarevic and Vipin Kumar. 2005. Feature bagging for outlier detection. In SIGKDD. ACM, 157--166.
[14]
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6, 1, Article 3 (2012), 39 pages.
[15]
Guansong Pang, Longbing Cao, and Ling Chen. 2016. Outlier detection in complex categorical data by modelling the feature value couplings. In IJCAI. AAAI Press, 1902--1908.
[16]
Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2016. Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In ICDM. IEEE, 410--419.
[17]
Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2017. Learning homophily couplings from non-iid data for joint feature selection and noise-resilient outlier detection. In IJCAI. AAAI Press, 2585--2591.
[18]
Guansong Pang, Kai Ming Ting, David Albrecht, and Huidong Jin. 2016. ZERO++: Harnessing the power of zero appearances to detect anomalies in large-scale data sets. Journal of Artificial Intelligence Research 57 (2016), 593--620.
[19]
Guansong Pang, Hongzuo Xu, Longbing Cao, and Wentao Zhao. 2017. Selective Value Coupling Learning for Detecting Outliers in High-Dimensional Categorical Data. In CIKM. ACM, 807--816.
[20]
Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Faloutsos. 2003. LOCI: Fast outlier detection using the local correlation integral. In International Conference on Data Engineering. IEEE, 315--326.
[21]
Pascal Pons and Matthieu Latapy. 2005. Computing communities in large networks using random walks. In International symposium on computer and information sciences. Springer, 284--293.
[22]
Saket Sathe and Charu C. Aggarwal. 2016. Subspace outlier detection in linear time with randomized hashing. In ICDM. IEEE, 459--468.
[23]
Huaimin Wang, Peichang Shi, and Yiming Zhang. 2017. Jointcloud: A cross-cloud cooperation architecture for integrated internet service customization. In ICDCS. IEEE, 1846--1855.
[24]
Wentao Zhao, Qian Li, Chengzhang Zhu, Jianglong Song, Xinwang Liu, and Jianping Yin. 2018. Model-aware categorical data embedding: a data-driven approach. Soft Computing 22, 11 (2018), 3603--3619.
[25]
Chengzhang Zhu, Longbing Cao, Qiang Liu, Jianping Yin, and Vipin Kumar. 2018. Heterogeneous metric learning of categorical data with hierarchical couplings. IEEE Transactions on Knowledge and Data Engineering 30, 7 (2018), 1254--1267.

Cited By

View all
  • (2025)MEFDPN: Mixture exponential family distribution posterior networks for evaluating data uncertaintyExpert Systems with Applications10.1016/j.eswa.2024.125593262(125593)Online publication date: Mar-2025
  • (2024)Automated anomaly detection for categorical data by repurposing a form filling recommender systemJournal of Data and Information Quality10.1145/369611016:3(1-28)Online publication date: 16-Sep-2024
  • (2023)Supervised Anomaly Detection via Conditional Generative Adversarial Network and Ensemble Active LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.322547645:6(7781-7798)Online publication date: 1-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
October 2018
2362 pages
ISBN:9781450360142
DOI:10.1145/3269206
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. categorical data
  2. feature selection
  3. outlier detection

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '18
Sponsor:

Acceptance Rates

CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)4
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)MEFDPN: Mixture exponential family distribution posterior networks for evaluating data uncertaintyExpert Systems with Applications10.1016/j.eswa.2024.125593262(125593)Online publication date: Mar-2025
  • (2024)Automated anomaly detection for categorical data by repurposing a form filling recommender systemJournal of Data and Information Quality10.1145/369611016:3(1-28)Online publication date: 16-Sep-2024
  • (2023)Supervised Anomaly Detection via Conditional Generative Adversarial Network and Ensemble Active LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.322547645:6(7781-7798)Online publication date: 1-Jun-2023
  • (2023)Deep Optimal Isolation Forest with Genetic Algorithm for Anomaly Detection2023 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM58522.2023.00077(678-687)Online publication date: 1-Dec-2023
  • (2023)Feature selection considering interaction, redundancy and complementarity for outlier detection in categorical dataKnowledge-Based Systems10.1016/j.knosys.2023.110678275(110678)Online publication date: Sep-2023
  • (2022)Weighted Kernel based Prediction and Detection of Outliers2022 3rd International Conference on Smart Electronics and Communication (ICOSEC)10.1109/ICOSEC54921.2022.9952112(219-224)Online publication date: 20-Oct-2022
  • (2022)An Anomaly Detection Method for Multiple Time Series Based on Similarity Measurement and Louvain AlgorithmProcedia Computer Science10.1016/j.procs.2022.01.386200(1857-1866)Online publication date: 2022
  • (2022)Edge computing empowered anomaly detection framework with dynamic insertion and deletion schemes on data streamsWorld Wide Web10.1007/s11280-022-01052-z25:5(2163-2183)Online publication date: 1-Sep-2022
  • (2022)A density estimation approach for detecting and explaining exceptional values in categorical dataApplied Intelligence10.1007/s10489-022-03271-352:15(17534-17556)Online publication date: 1-Dec-2022
  • (2020)OPHiForest: Order Preserving Hashing Based Isolation Forest for Robust and Scalable Anomaly DetectionProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3411988(1655-1664)Online publication date: 19-Oct-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media