Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3394486.3403187acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Catalysis Clustering with GAN by Incorporating Domain Knowledge

Published: 20 August 2020 Publication History

Abstract

Clustering is an important unsupervised learning method with serious challenges when data is sparse and high-dimensional. Generated clusters are often evaluated with general measures, which may not be meaningful or useful for practical applications and domains. Using a distance metric, a clustering algorithm searches through the data space, groups close items into one cluster, and assigns far away samples to different clusters. In many real-world applications, the number of dimensions is high and data space becomes very sparse. Selection of a suitable distance metric is very difficult and becomes even harder when categorical data is involved. Moreover, existing distance metrics are mostly generic, and clusters created based on them will not necessarily make sense to domain-specific applications. One option to address these challenges is to integrate domain-defined rules and guidelines into the clustering process. In this work we propose a GAN-based approach called Catalysis Clustering to incorporate domain knowledge into the clustering process. With GANs we generate catalysts, which are special synthetic points drawn from the original data distribution and verified to improve clustering quality when measured by a domain-specific metric. We then perform clustering analysis using both catalysts and real data. Final clusters are produced after catalyst points are removed. Experiments on two challenging real-world datasets clearly show that our approach is effective and can generate clusters that are meaningful and useful for real-world applications.

References

[1]
L. N. Allen and L. C. Rose. 2006. Financial Survival Analysis of Defaulted Debtors. The Journal of the Operational Research Society, Vol. 57, 6 (2006), 630 -- 636.
[2]
Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).
[3]
Yale Chang, Junxiang Chen, Michael H Cho, Peter J Castaidi, Edwin K Silverman, and Jennifer G Dy. 2017. Clustering with domain-specific usefulness scores. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 207--215.
[4]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, Vol. 16 (2002), 321--357.
[5]
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems. 2172--2180.
[6]
The International Cancer Genome Consortium. 2010. International network of cancer genome projects. Nature, Vol. 464 (15 04 2010), 993 -- 998. http://dx.doi.org/10.1038/nature08987
[7]
Pietro Coretto, Angela Serra, Roberto Tagliaferri, and Jonathan Wren. 2018. Robust clustering of noisy high-dimensional gene expression data for patients subtyping. Bioinformatics (2018).
[8]
Kamran Ghasedi, Xiaoqian Wang, Cheng Deng, and Heng Huang. 2019. Balanced self-paced learning for generative adversarial clustering network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4391--4400.
[9]
Manish Kumar Goel, Pardeep Khanna, and Jugal Kishore. 2010. Understanding survival analysis: Kaplan-Meier estimate. International journal of Ayurveda research, Vol. 1, 4 (2010), 274.
[10]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.
[11]
Pranab Haldar, Ian D Pavord, Dominic E Shaw, Michael A Berry, Michael Thomas, Christopher E Brightling, Andrew J Wardlaw, and Ruth H Green. 2008. Cluster analysis and clinical asthma phenotypes. American journal of respiratory and critical care medicine, Vol. 178, 3 (2008), 218--224.
[12]
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing. Springer, 878--887.
[13]
J. A. Hartigan and M. A. Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28, 1 (1979), 100--108. https://doi.org/10.2307/2346830
[14]
Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 1322--1328.
[15]
Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, Vol. 21, 9 (2009), 1263--1284.
[16]
Matan Hofree, John P Shen, Hannah Carter, Andrew Gross, and Trey Ideker. 2013. Network-based stratification of tumor mutations. Nature Methods, Vol. 10 (15 09 2013), 1108 -- 1115. http://dx.doi.org/10.1038/nmeth.2651
[17]
Christian O Jacke, Iris Reinhard, and Ute S Albert. 2013. Using relative survival measures for cross-sectional and longitudinal benchmarks of countries, states, and districts: the BenchRelSurv-and BenchRelSurvPlot-macros. BMC public health, Vol. 13, 1 (2013), 34.
[18]
Michael S Lawrence, Petar Stojanov, Craig H Mermel, James T Robinson, Levi A Garraway, Todd R Golub, Matthew Meyerson, Stacey B Gabriel, Eric S Lander, and Gad Getz. 2014. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature, Vol. 505, 7484 (2014), 495.
[19]
Michael S Lawrence, Petar Stojanov, Paz Polak, Gregory V Kryukov, Kristian Cibulskis, Andrey Sivachenko, Scott L Carter, Chip Stewart, Craig H Mermel, Steven A Roberts, et al. 2013. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature, Vol. 499, 7457 (2013), 214.
[20]
Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature, Vol. 401, 6755 (1999), 788--791.
[21]
Fang Liu, Licheng Jiao, and Xu Tang. 2019. Task-oriented GAN for PolSAR image classification and clustering. IEEE transactions on neural networks and learning systems, Vol. 30, 9 (2019), 2707--2719.
[22]
Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. 2003. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning, Vol. 52, 1--2 (2003), 91--118.
[23]
Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan. 2019. Clustergan: Latent space clustering in generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4610--4617.
[24]
The Cancer Genome Atlas Research Network. 2011. Integrated genomic analyses of ovarian carcinoma. Nature, Vol. 474 (29 06 2011), 609 -- 615. http://dx.doi.org/10.1038/nature10166
[25]
The Cancer Genome Atlas Research Network. 2013. Integrated genomic characterization of endometrial carcinoma. Nature, Vol. 497 (01 05 2013), 67 -- 73. http://dx.doi.org/10.1038/nature12113
[26]
José Pereira. 2014. Survival Analysis Employed in Predicting Corporate Failure: A Forecasting Model Proposal., Vol. 7 (04 2014).
[27]
Catherine R Planey and Olivier Gevaert. 2016. CoINcIDE: A framework for discovery of patient subtypes across multiple datasets. Genome medicine, Vol. 8, 1 (2016), 27.
[28]
Jost Tobias Springenberg. 2015. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390 (2015).
[29]
Michael Steinbach, Levent Ertöz, and Vipin Kumar. 2004. The challenges of clustering high dimensional data. In New directions in statistical physics. Springer, 273--309.
[30]
Mark Stevenson and IVABS EpiCentre. 2009. An introduction to survival analysis. EpiCentre, IVABS, Massey University (2009).
[31]
Mike Stoolmiller and James Snyder. 2013. Embedding multilevel survival analysis of dyadic social interaction in structural equation models: hazard rates as both outcomes and predictors. Journal of pediatric psychology, Vol. 39, 2 (2013), 222--232.
[32]
Kelly C Vranas, Jeffrey K Jopling, Timothy E Sweeney, Meghan C Ramsey, Arnold S Milstein, Christopher G Slatore, Gabriel J Escobar, and Vincent X Liu. 2017. Identifying Distinct Subgroups of Intensive Care Unit Patients: a Machine Learning Approach. Critical care medicine, Vol. 45, 10 (2017), 1607.

Cited By

View all
  • (2024)Generative model-assisted sample selection for interest-driven visual analyticsVisual Informatics10.1016/j.visinf.2024.10.004Online publication date: Oct-2024
  • (2023)Using Catalyst Mass-Based Clustering Analysis to Identify Adverse Events during ApproachAerospace10.3390/aerospace1005048310:5(483)Online publication date: 19-May-2023
  • (2023)Self-Supervised Augmentation of Quality Data Based on Classification-Reinforced GAN2023 17th International Conference on Ubiquitous Information Management and Communication (IMCOM)10.1109/IMCOM56909.2023.10035575(1-7)Online publication date: 3-Jan-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GAN
  2. cancer subtyping
  3. clustering evaluation
  4. domain-informed clustering

Qualifiers

  • Research-article

Funding Sources

  • National Science Foundation

Conference

KDD '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)3
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Generative model-assisted sample selection for interest-driven visual analyticsVisual Informatics10.1016/j.visinf.2024.10.004Online publication date: Oct-2024
  • (2023)Using Catalyst Mass-Based Clustering Analysis to Identify Adverse Events during ApproachAerospace10.3390/aerospace1005048310:5(483)Online publication date: 19-May-2023
  • (2023)Self-Supervised Augmentation of Quality Data Based on Classification-Reinforced GAN2023 17th International Conference on Ubiquitous Information Management and Communication (IMCOM)10.1109/IMCOM56909.2023.10035575(1-7)Online publication date: 3-Jan-2023
  • (2023)DeGTeCFuture Generation Computer Systems10.1016/j.future.2022.11.014141:C(81-95)Online publication date: 15-Feb-2023
  • (2022)Prognostic mutational subtyping in de novo diffuse large B-cell lymphomaBMC Cancer10.1186/s12885-022-09237-522:1Online publication date: 3-Mar-2022
  • (2022) Hausdorff GAN: Improving GAN Generation Quality With Hausdorff Metric IEEE Transactions on Cybernetics10.1109/TCYB.2021.306239652:10(10407-10419)Online publication date: Oct-2022
  • (2021)NIA-Network: Towards improving lung CT infection detection for COVID-19 diagnosisArtificial Intelligence in Medicine10.1016/j.artmed.2021.102082117(102082)Online publication date: Jul-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media