Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1353343.1353398acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article
Free access

HISSCLU: a hierarchical density-based method for semi-supervised clustering

Published: 25 March 2008 Publication History

Abstract

In situations where class labels are known for a part of the objects, a cluster analysis respecting this information, i.e. semi-supervised clustering, can give insight into the class and cluster structure of a data set. Several semi-supervised clustering algorithms such as HMRF-K-Means [4], COP-K-Means [26] and the CCL-algorithm [18] have recently been proposed. Most of them extend well-known clustering methods (K-Means [22], Complete Link [17] by enforcing two types of constraints: must-links between objects of the same class and cannot-links between objects of different classes. In this paper, we propose HISSCLU, a hierarchical, density-based method for semi-supervised clustering. Instead of deriving explicit constraints from the labeled objects, HISSCLU expands the clusters starting at all labeled objects simultaneously. During the expansion, class labels are assigned to the unlabeled objects most consistently with the cluster structure. Using this information the hierarchical cluster structure is determined. The result is visualized in a semi-supervised cluster diagram showing both cluster structure as well as class assignment. Compared to methods based on must-links and cannot-links, our method allows a better preservation of the actual cluster structure, particularly if the data set contains several distinct clusters of the same class (i.e. the intra-class data distribution is multimodal). HISSCLU has a determinate result, is efficient and robust against noise. The performance of our algorithm is shown in an extensive experimental evaluation on synthetic and real-world data sets.

References

[1]
"WEKA machine learning package, http://www.cs.waikato.ac.nz/ml/weka". Universitity of Waikato.
[2]
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. "OPTICS: Ordering Points to Identify the Clustering Structure". In SIGMOD Conference, 1999.
[3]
C. Baumgartner, C. Böhm, D. Baumgartner, G. Marini, K. Weinberger, B. Olgemöller, B. Liebl, and A. Roscher. "Supervised machine learning techniques for the classification of metabolic disorders in newborns". In Bioinformatics, pages 20(17):2985--2996, 2004.
[4]
M. Bilenko, S. Basu, and R. J. Mooney. "A Probabilistic Framework for Semi-Supervised Clustering". In KDD Conference, 2004.
[5]
M. Bilenko, S. Basu, and R. J. Mooney. "Integrating Constraints and Metric Learning in Semi-Supervised Clustering". In ICML Conference, 2004.
[6]
C. L. Blake and C. J. Merz. "UCI Repository of machine learning databases, http://www.ics.uci.edu/~mlearn/MLRepository.html".
[7]
N. Cesa-Bianchi, C. Gentile, A. Tironi, and L. Zaniboni. Incremental algorithms for hierarchical classification. In NIPS, 2004.
[8]
N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Hierarchical classification: combining bayes with svm. In ICML, pages 177--184, 2006.
[9]
B.-R. Dai, C.-R. Lin, and M.-S. Chen. On the techniques for data clustering with numerical constraints. In SDM Conference, 2003.
[10]
O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In ICML, 2004.
[11]
B. E. Dom. "An Information-Theoretic External Cluster-Validity Measure". In Research Report RJ 10219, IBM, 2001.
[12]
L. Dong, E. Frank, and S. Kramer. Ensembles of balanced nested dichotomies for multi-class problems. In Proc. of PKDD Conference, pages 84--95, 2005.
[13]
C. Eick, N. Zeidat, and Z. Zhao. "Supervised Clustering - Algorithms and Benefits". In Proc. of the International Conference on Tools with AI, 2004.
[14]
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise". In KDD Conference, 1996.
[15]
J. Fürnkranz. "Round Robin Classification". Journal of Machine Learning Research, (2):721--747, 2002.
[16]
J. Gracia-Bustos, J. Heitman, and M. Hall. "Nuclear protein localization". In Biochimica et Biophysica Acta, pages 1071:83--101, 1991.
[17]
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988.
[18]
D. Klein, D. Kamvar, and C. Manning. "From Instance-Level Constraints to Space-Level Constraints: Making Most of Prior Knowledge in Data Clustering.". In ICML Conference, 2002.
[19]
H. Li and M. Niranjan. "Outlier Detection in Benchmark Classification Tasks". In Proc. of International Conference on Accoustics, Speech ans Signal Processing, pages 557--560, 2006.
[20]
H. Liu and S. teng Huang. Evolutionary semi-supervised fuzzy clustering. Pattern Recognition Letters, 24(16):3105--3113, 2003.
[21]
Z. Lu and T. Leen. "Semi-supervised Learning with Penalized Probabilistic Clustering". In NIPS 17, pages 849--856, 2005.
[22]
J. MacQueen. "Some Methods for Classification and Analysis of Multivariate Observations". In 5th Berkeley Symp. Math. Statist. Prob., 1967.
[23]
K. Nakai and M. Kanehisa. "A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells". Genomics, 14(897):897--911, 1991.
[24]
C. Plant, C. Böhm, C. Baumgartner, and B. Tilg. "Enhancing instance-based classification with local density.". In Bioinformatics, pages 22:981--988, 2006.
[25]
J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky. "Automatic Extraction of Clusters from Hierarchical Clustering Representations.". In PAKDD, 2003.
[26]
K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedel. "Constrained K-Means Clustering with Background Knowledge". In ICML Conference, 2001.
[27]
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. technical report., 2002.
[28]
A. Zimek. Hierarchical classification using ensembles of nested dichotomies. Master's thesis, TU/LMU Munich, 2005.

Cited By

View all
  • (2024)Metricizing the Euclidean Space Toward Desired Distance Relations in Point CloudsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.342024619(7304-7319)Online publication date: 1-Jan-2024
  • (2023)Constrained Density Peak ClusteringInternational Journal of Data Warehousing and Mining10.4018/IJDWM.32877619:1(1-19)Online publication date: 25-Aug-2023
  • (2023)A review on semi-supervised clusteringInformation Sciences10.1016/j.ins.2023.02.088632(164-200)Online publication date: Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
EDBT '08: Proceedings of the 11th international conference on Extending database technology: Advances in database technology
March 2008
762 pages
ISBN:9781595939265
DOI:10.1145/1353343
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2008

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

EDBT '08

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)103
  • Downloads (Last 6 weeks)15
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Metricizing the Euclidean Space Toward Desired Distance Relations in Point CloudsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.342024619(7304-7319)Online publication date: 1-Jan-2024
  • (2023)Constrained Density Peak ClusteringInternational Journal of Data Warehousing and Mining10.4018/IJDWM.32877619:1(1-19)Online publication date: 25-Aug-2023
  • (2023)A review on semi-supervised clusteringInformation Sciences10.1016/j.ins.2023.02.088632(164-200)Online publication date: Jun-2023
  • (2021)Integrating Prior Knowledge in Mixed-Initiative Social Network ClusteringIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.303034727:2(1775-1785)Online publication date: Feb-2021
  • (2021)A hyperbolic routing scheme for information-centric internet of things with edge computingWireless Networks10.1007/s11276-021-02751-7Online publication date: 15-Aug-2021
  • (2019)An efficient density-based clustering with side information and active learning: A case study for facial expression recognition taskIntelligent Data Analysis10.3233/IDA-17378123:1(227-240)Online publication date: 20-Feb-2019
  • (2019)A unified view of density-based methods for semi-supervised clustering and classificationData Mining and Knowledge Discovery10.1007/s10618-019-00651-1Online publication date: 23-Aug-2019
  • (2019)Density‐based clusteringWIREs Data Mining and Knowledge Discovery10.1002/widm.134310:2Online publication date: 29-Oct-2019
  • (2018)An efficient semi-supervised graph based clusteringIntelligent Data Analysis10.3233/IDA-16329622:2(297-307)Online publication date: 14-Mar-2018
  • (2018)A unified framework of density-based clustering for semi-supervised classificationProceedings of the 30th International Conference on Scientific and Statistical Database Management10.1145/3221269.3223037(1-12)Online publication date: 9-Jul-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media