Outlier-eliminated k-means clustering algorithm based on differential privacy preservation

Qingying Yu^1,2,
Yonglong Luo^1,2,
Chuanming Chen² &
…
Xintao Ding²

1646 Accesses
Explore all metrics

Abstract

Individual privacy may be compromised during the process of mining for valuable information, and the potential for data mining is hindered by the need to preserve privacy. It is well known that k-means clustering algorithms based on differential privacy require preserving privacy while maintaining the availability of clustering. However, it is difficult to balance both aspects in traditional algorithms. In this paper, an outlier-eliminated differential privacy (OEDP) k-means algorithm is proposed that both preserves privacy and improves clustering efficiency. The proposed approach selects the initial centre points in accordance with the distribution density of data points, and adds Laplacian noise to the original data for privacy preservation. Both a theoretical analysis and comparative experiments were conducted. The theoretical analysis shows that the proposed algorithm satisfies ε-differential privacy. Furthermore, the experimental results show that, compared to other methods, the proposed algorithm effectively preserves data privacy and improves the clustering results in terms of accuracy, stability, and availability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Differential Privacy Protection Algorithm for Data Clustering Center

A High-Availability K-modes Clustering Method Based on Differential Privacy

Privacy Aware K-Means Clustering with High Utility

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Acs G., Castelluccia C., Chen R. (2012) Differentially private histogram publishing through lossy compression. In: proceedings of IEEE 12th International Conference on Data Mining, ICDM, pp 1–10
Agrawal R, Srikant R (2000) Privacy-preserving data mining. ACM Sigmod Record 29(2):439–450
Article Google Scholar
Angiulli F, Fassetti F (2009) DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data(TKDD) 3(1):4:1–57
Google Scholar
Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: proceedings of the 6th European Conference on the Principles of Data Mining and Knowledge Discovery, pp 15–27
Bhaskar R, Laxman S, Smith A, Thakurta A (2010) Discovering frequent patterns in sensitive data. In: proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD10). Washington, USA , pp 503–512
Blum A, Dwork C, Mcsherry F, Nissim K (2005) Practical privacy: the SuLQ framework. In: proceedings of the 24th ACM SIGMOD International Conference on Management of Data / Principles of Database Systems. New Yorks:ACM Press, pp 128–138
Chaudhuri K, Monteleoni C (2008) Privacy-preserving logistic regression. In: proceedings of the 22nd Annual Conference on Neural Information Processing Systems. Vancouver, Canada, pp 289–296
Chen R, Acs G, Castelluccia C (2012) Differentially private sequential data publication via variable-length n-grams. In: proceedings of the 2012 ACM Conference on Computer and Communications Security, pp 638–649
Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):86–95
Article Google Scholar
Dwork C, McSherry F, Nissim K, Smith A (2006) Calibrating noise to sensitivity in private data analysis. In: proceedings of the 3rd Conference on Theory of Cryptography. New York, USA, pp 265–284
Dwork C (2006) Differential privacy. In: proceedings of the 33rd International Colloquium on Automata,languages and Programming. Springer, Berlin, pp 1–12
Dwork C (2010) Differential privacy in new settings. In: proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp 174–183
Dwork C (2008) Differential privacy: a survey of results. In: proceedings of the 5th International Conference on Theory and Application of Models of Computation. Berlin Heidelberg, pp 1–19
Dwork C (2009) The differential privacy frontier(extended abstract). In: proceedings of the 6th Theory of Cryptography Conference(TCC09). Springer, Berlin, pp 496–502
Friedman A, Schuster A (2010) Data mining with differential privacy. In: proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, USA, pp 493–502
Ganta SR, Kasiviswanathan S, Smith A (2008) Composition attacks and auxiliary information in data privacy. In: proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA, pp 265–273
Hautamäki V, Cherednichenko S, Kärkkäinen I, Kinnunen T, Fränti P (2005) Improving k-means by outlier removal. Lect Notes Comput Sci 3540:978–987
Article Google Scholar
Hawkins S, He H, Williams G, Baxter R (2002) Outlier detection using replicator neural networks. In: proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery. Springer, Berlin Heidelberg, pp 170–180
Huang M, Ni W, Wang J, Sun F, Chong Z (2012) A logarithmic spiral based data perturbation method for clustering. Chin J Comput 35(11):2275–2282
Article Google Scholar
Jiang F, Chen Y-M (2015) Outlier detection based on granular computing and rough set theory. Appl Intell 42(2):303–322
Article Google Scholar
Jiang H, Yi S, Li J, Yang F, Hu X (2010) Ant clustering algorithm with k-harmonic means clustering. Expert Systems With Applications 37(12):8679–8684
Article Google Scholar
Kasiviswanathan SP, Lee HK, Nissim K, Raskhodnikova S, Smith A (2009) What can we learn privately?. Foundations of Computer Science Annual Symposium on 40(3):531–540
MathSciNet MATH Google Scholar
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. The VLDB Journal-The International Journal on Very Large Data Bases 8(3-4):237–253
Article Google Scholar
Li N, Qardaji W, Su D, Cao J (2012) PrivBasis: frequent itemset mining with differential privacy. In: proceedings of the 38th International Conference on Very Large Data Bases(VLDB12). New Yorks:ACM, pp 1340–1351
Li X-B, Sarkar S (2010) Data clustering and micro-perturbation for privacy-preserving data sharing and analysis. In: proceedings of the International Conference on Information Systems (ICIS). Yamagata, Japan, pp 58–73
Li Y, Hao Z, Wen W, Xie G (2013) Research on differential privacy preserving k-means clustering. Comput Sci 40(3):287–290
Google Scholar
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov Data 1(1):24
Article Google Scholar
Nguyen HH, Kim J, Kim Y (2013) Differential privacy in practice. Int J Comput Sci Eng 7(3):177–186
Google Scholar
Nissim K, Raskhodnikova S, Smith A (2007) Smooth sensitivity and sampling in private data analysis. In: proceedings of the 39th Annual ACM symposium on Theory of Computing – STOC 07, pp 75–84
Oliveira SRM, Zaïane OR (2004) Achieving privacy preservation when sharing data for clustering. In: proceedings of the International Workshop on Secure Data Management in a Connnected World. Toronto, Canada, pp 67–82
Parameswaran R, Blough DM (2008) Privacy preserving data obfuscation for inherently clustered data. Int J Inf Comput Secur 2(1):4–26
Google Scholar
Sweeney L (2002) K-anonymity: a model for protecting privacy. Int J Uncertainty Fuzziness Knowledge Based Syst 10(5):557–570
Article MathSciNet MATH Google Scholar
Tung AKH, Xu X, Ooi BC (2005) CURLER: finding and visualizing nonlinear correlation clusters. In: proceedings of the International Conference on Management of Data, pp 467–478
Visalakshi NK, Thangavel K (2009) Impact of normalization in distributed k-means clustering. Int J Soft Comput 4(4):168–172
Google Scholar
Xiong P, Zhu T, Wang X (2014) A survey on differential privacy and applications. Chin J Comput 37 (1):101–122
Google Scholar
Zhang X, Wang M, Meng X (2014) An accurate method for mining top-k frequent pattern under differential privacy. Journal of Computer Research and Development 51(1):104–114
Google Scholar

Download references

Acknowledgments

The authors would like to thank the reviewers for their useful comments and suggestions for this paper. This work was supported by the National Natural Science Foundation of China (61370050) and the Natural Science Foundation of Anhui Province (1508085QF134).

Author information

Authors and Affiliations

School of Territorial Resources and Tourism, Anhui Normal University, Wuhu, 241003, China
Qingying Yu & Yonglong Luo
School of Mathematics and Computer Science, Anhui Normal University, Wuhu, 241003, China
Qingying Yu, Yonglong Luo, Chuanming Chen & Xintao Ding

Authors

Qingying Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yonglong Luo
View author publications
You can also search for this author in PubMed Google Scholar
Chuanming Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xintao Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yonglong Luo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Q., Luo, Y., Chen, C. et al. Outlier-eliminated k-means clustering algorithm based on differential privacy preservation. Appl Intell 45, 1179–1191 (2016). https://doi.org/10.1007/s10489-016-0813-z

Download citation

Published: 11 July 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10489-016-0813-z

Outlier-eliminated k-means clustering algorithm based on differential privacy preservation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Differential Privacy Protection Algorithm for Data Clustering Center

A High-Availability K-modes Clustering Method Based on Differential Privacy

Privacy Aware K-Means Clustering with High Utility

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Outlier-eliminated k-means clustering algorithm based on differential privacy preservation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Differential Privacy Protection Algorithm for Data Clustering Center

A High-Availability K-modes Clustering Method Based on Differential Privacy

Privacy Aware K-Means Clustering with High Utility

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now