Abstract
Individual privacy may be compromised during the process of mining for valuable information, and the potential for data mining is hindered by the need to preserve privacy. It is well known that k-means clustering algorithms based on differential privacy require preserving privacy while maintaining the availability of clustering. However, it is difficult to balance both aspects in traditional algorithms. In this paper, an outlier-eliminated differential privacy (OEDP) k-means algorithm is proposed that both preserves privacy and improves clustering efficiency. The proposed approach selects the initial centre points in accordance with the distribution density of data points, and adds Laplacian noise to the original data for privacy preservation. Both a theoretical analysis and comparative experiments were conducted. The theoretical analysis shows that the proposed algorithm satisfies ε-differential privacy. Furthermore, the experimental results show that, compared to other methods, the proposed algorithm effectively preserves data privacy and improves the clustering results in terms of accuracy, stability, and availability.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Acs G., Castelluccia C., Chen R. (2012) Differentially private histogram publishing through lossy compression. In: proceedings of IEEE 12th International Conference on Data Mining, ICDM, pp 1–10
Agrawal R, Srikant R (2000) Privacy-preserving data mining. ACM Sigmod Record 29(2):439–450
Angiulli F, Fassetti F (2009) DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data(TKDD) 3(1):4:1–57
Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: proceedings of the 6th European Conference on the Principles of Data Mining and Knowledge Discovery, pp 15–27
Bhaskar R, Laxman S, Smith A, Thakurta A (2010) Discovering frequent patterns in sensitive data. In: proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD10). Washington, USA , pp 503–512
Blum A, Dwork C, Mcsherry F, Nissim K (2005) Practical privacy: the SuLQ framework. In: proceedings of the 24th ACM SIGMOD International Conference on Management of Data / Principles of Database Systems. New Yorks:ACM Press, pp 128–138
Chaudhuri K, Monteleoni C (2008) Privacy-preserving logistic regression. In: proceedings of the 22nd Annual Conference on Neural Information Processing Systems. Vancouver, Canada, pp 289–296
Chen R, Acs G, Castelluccia C (2012) Differentially private sequential data publication via variable-length n-grams. In: proceedings of the 2012 ACM Conference on Computer and Communications Security, pp 638–649
Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):86–95
Dwork C, McSherry F, Nissim K, Smith A (2006) Calibrating noise to sensitivity in private data analysis. In: proceedings of the 3rd Conference on Theory of Cryptography. New York, USA, pp 265–284
Dwork C (2006) Differential privacy. In: proceedings of the 33rd International Colloquium on Automata,languages and Programming. Springer, Berlin, pp 1–12
Dwork C (2010) Differential privacy in new settings. In: proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp 174–183
Dwork C (2008) Differential privacy: a survey of results. In: proceedings of the 5th International Conference on Theory and Application of Models of Computation. Berlin Heidelberg, pp 1–19
Dwork C (2009) The differential privacy frontier(extended abstract). In: proceedings of the 6th Theory of Cryptography Conference(TCC09). Springer, Berlin, pp 496–502
Friedman A, Schuster A (2010) Data mining with differential privacy. In: proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, USA, pp 493–502
Ganta SR, Kasiviswanathan S, Smith A (2008) Composition attacks and auxiliary information in data privacy. In: proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA, pp 265–273
Hautamäki V, Cherednichenko S, Kärkkäinen I, Kinnunen T, Fränti P (2005) Improving k-means by outlier removal. Lect Notes Comput Sci 3540:978–987
Hawkins S, He H, Williams G, Baxter R (2002) Outlier detection using replicator neural networks. In: proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery. Springer, Berlin Heidelberg, pp 170–180
Huang M, Ni W, Wang J, Sun F, Chong Z (2012) A logarithmic spiral based data perturbation method for clustering. Chin J Comput 35(11):2275–2282
Jiang F, Chen Y-M (2015) Outlier detection based on granular computing and rough set theory. Appl Intell 42(2):303–322
Jiang H, Yi S, Li J, Yang F, Hu X (2010) Ant clustering algorithm with k-harmonic means clustering. Expert Systems With Applications 37(12):8679–8684
Kasiviswanathan SP, Lee HK, Nissim K, Raskhodnikova S, Smith A (2009) What can we learn privately?. Foundations of Computer Science Annual Symposium on 40(3):531–540
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. The VLDB Journal-The International Journal on Very Large Data Bases 8(3-4):237–253
Li N, Qardaji W, Su D, Cao J (2012) PrivBasis: frequent itemset mining with differential privacy. In: proceedings of the 38th International Conference on Very Large Data Bases(VLDB12). New Yorks:ACM, pp 1340–1351
Li X-B, Sarkar S (2010) Data clustering and micro-perturbation for privacy-preserving data sharing and analysis. In: proceedings of the International Conference on Information Systems (ICIS). Yamagata, Japan, pp 58–73
Li Y, Hao Z, Wen W, Xie G (2013) Research on differential privacy preserving k-means clustering. Comput Sci 40(3):287–290
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov Data 1(1):24
Nguyen HH, Kim J, Kim Y (2013) Differential privacy in practice. Int J Comput Sci Eng 7(3):177–186
Nissim K, Raskhodnikova S, Smith A (2007) Smooth sensitivity and sampling in private data analysis. In: proceedings of the 39th Annual ACM symposium on Theory of Computing – STOC 07, pp 75–84
Oliveira SRM, Zaïane OR (2004) Achieving privacy preservation when sharing data for clustering. In: proceedings of the International Workshop on Secure Data Management in a Connnected World. Toronto, Canada, pp 67–82
Parameswaran R, Blough DM (2008) Privacy preserving data obfuscation for inherently clustered data. Int J Inf Comput Secur 2(1):4–26
Sweeney L (2002) K-anonymity: a model for protecting privacy. Int J Uncertainty Fuzziness Knowledge Based Syst 10(5):557–570
Tung AKH, Xu X, Ooi BC (2005) CURLER: finding and visualizing nonlinear correlation clusters. In: proceedings of the International Conference on Management of Data, pp 467–478
Visalakshi NK, Thangavel K (2009) Impact of normalization in distributed k-means clustering. Int J Soft Comput 4(4):168–172
Xiong P, Zhu T, Wang X (2014) A survey on differential privacy and applications. Chin J Comput 37 (1):101–122
Zhang X, Wang M, Meng X (2014) An accurate method for mining top-k frequent pattern under differential privacy. Journal of Computer Research and Development 51(1):104–114
Acknowledgments
The authors would like to thank the reviewers for their useful comments and suggestions for this paper. This work was supported by the National Natural Science Foundation of China (61370050) and the Natural Science Foundation of Anhui Province (1508085QF134).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yu, Q., Luo, Y., Chen, C. et al. Outlier-eliminated k-means clustering algorithm based on differential privacy preservation. Appl Intell 45, 1179–1191 (2016). https://doi.org/10.1007/s10489-016-0813-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-016-0813-z