Abstract
Individual privacy preservation has become an important issue with the development of big data technology. The definition of ρ-differential identifiability (DI) precisely matches the legal definitions of privacy, which can provide an easy parameterization approach for practitioners so that they can set privacy parameters based on the privacy concept of individual identifiability. However, differential identifiability is currently only applied to some simple queries and achieved by Laplace mechanism, which cannot satisfy complex privacy preservation issues in big data analysis. In this paper, we propose a new exponential mechanism and composition properties of differential identifiability, and then apply differential identifiability to k-means and k-prototypes algorithms on MapReduce framework. DI k-means algorithm uses the usual Laplace mechanism and composition properties for numerical databases, while DI k-prototypes algorithm uses the new exponential mechanism and composition properties for mixed databases. The experimental results show that both DI k-means and DI k-prototypes algorithms satisfy differential identifiability.
Similar content being viewed by others
References
Mendes R, Vilela J P. Privacy-preserving data mining: methods, metrics, and applications. IEEE Access, 2017, 5: 10562–10582
Yu S. Big privacy: challenges and opportunities of privacy study in the age of big data. IEEE Access, 2016, 4: 2751–2763
Dinur I, Nissim K. Revealing information while preserving privacy. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, San Diego, 2003. 202–210
Dwork C, Nissim K. Privacy-preserving datamining on vertically partitioned databases. In: Advances in Cryptology - CRYPTO 2004. Berlin: Springer, 2004. 528–544
Blum A, Dwork C, McSherry F, et al. Practical privacy: the SuLQ framework. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Baltimore, 2005. 128–138
Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. In: Theory of Cryptography. Berlin: Springer, 2006. 265–284
Dwork C. Differential privacy. In: Automata, Languages, and Programming. Berlin: Springer, 2006. 1–12
Kifer D, Machanavajjhala A. No free lunch in data privacy. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Athens, 2011. 193–204
Cormode J. Personal privacy vs population privacy: learning to attack anonymization. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, 2011. 1253–1261
Lee J, Clifton C. Differential identifiability. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, 2012. 1041–1049
Dwork C. A firm foundation for private data analysis. Commun ACM, 2011, 54: 86–95
Li L, Dong Y, Wang J. Differential privacy data protection method based on clustering. In: Proceedings of Cyber-Enabled Distributed Computing and Knowledge Discovery, Nanjing, 2017. 11–16
Zhao Z, Shang T, Liu J W, et al. Clustering algorithm for privacy preservation on MapReduce. In: Proceedings of the 4th International Conference on Cloud Computing and Security, Haikou, 2018. 622–632
Bugata P, Drotar P. On some aspects of minimum redundancy maximum relevance feature selection. Sci China Inf Sci, 2020, 63: 112103
Samarati P, Sweeney L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: Proceedings of IEEE Symposium on Security and Privacy, Oakland, 1998. 1–19
Samarati P. Protecting respondents identities in microdata release. IEEE Trans Knowl Data Eng, 2001, 13: 1010–1027
Sweeney L. k-anonymity: a model for protecting privacy. Int J Unc Fuzz Knowl Based Syst, 2002, 10: 557–570
Machanavajjhala A, Gehrke J, Kifer D, et al. l-diversity: privacy beyond k-anonymity. In: Proceedings of International Conference on Data Engineering, Atlanta, 2006. 24–35
Li N, Li T, Venkatasubramanian S. t-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of International Conference on Data Engineering, Istanbul, 2007. 106–115
Li N H, Li T C, Venkatasubramanian S. Closeness: a new privacy measure for data publishing. IEEE Trans Knowl Data Eng, 2010, 22: 943–956
Li N H, Qardaji W, Su D, et al. Membership privacy: a unifying framework for privacy definitions. In: Proceedings of ACM SIGSAC Conference on Computer and Communications Security, New York, 2013. 889–900
Backes M, Berrang P, Humbert M, et al. Membership privacy in microRNA-based studies. In: Proceedings of ACM SIGSAC Conference on Computer and Communication, Vienna, 2016. 319–330
Aggarwal G, Feder T, Kenthapadi K, et al. Achieving anonymity via clustering. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Chicago, 2006. 153–162
Byun J W, Kamra A, Bertino E, et al. Efficient k-anonymization using clustering techniques. In: Proceedings of International Conference on Database Systems for Advanced Applications, Bangkok, 2007. 188–200
Nissim K, Raskhodnikova S, Smith A. Smooth sensitivity and sampling in private data analysis. In: Proceedings of ACM Symposium on Theory of Computing, San Diego, 2007. 75–84
Jafer Y, Matwin S, Sokolova M. Using feature selection to improve the utility of differentially private data publishing. Procedia Comput Sci, 2014, 37: 511–516
Huang Z. Clustering large data sets with mixed numeric and categorical values. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 1997. 21–34
Huang Y. Understanding of Big Data: Big Data Processing and Programming Practices. Beijing: Machinery Industry Press, 2014
McSherry F, Talwar K. Mechanism design via differential privacy. In: Proceedings of IEEE Annual Symposium on Foundations of Computer Science, Providence, 2007. 94–103
Dwork C, Rothblum G N, Vadhan S. Boosting and differential privacy. In: Proceedings of IEEE Annual Symposium on Foundations of Computer Science, Las Vegas, 2010. 51–60
Bkakria A, Cuppens-Boulahia N, Cuppens F. Linking differential identifiability with differential privacy. In: Proceedings of International Conference on Information and Communications Security, Cham, 2018. 232–247
McSherry F. Privacy integrated queries:an extensible platform for privacy-preserving data analysis. Commun ACM, 2010, 53: 89–97
Xiong J, Ren J, Chen L, et al. Enhancing privacy and availability for data clustering in intelligent electrical service of IoT. IEEE Internet Things J, 2019, 6: 1530–1540
Zhang Y, Liu N, Wang S. A differential privacy protecting k-means clustering algorithm based on contour coefficients. Plos One, 2018, 13: e0206832
Gao Z, Sun Y, Cui X, et al. Privacy-preserving hybrid k-means. Int J Data Warehous Min, 2018, 14: 1–17
Ren J, Xiong J, Cui X, et al. DPLk-means: a novel differential privacy k-means mechanism. In: Proceedings of IEEE International Conference on Data Science in Cyberspace, Shenzhen, 2017. 133–139
Su D, Cao J, Li N, et al. Differentially private k-means clustering and a hybrid approach to private optimization. ACM Trans Priv Secur, 2017, 20: 1–33
Acknowledgements
This work was supported by National Key Research and Development Program of China (Grant No. 2016YFC1000307) and National Natural Science Foundation of China (Grant Nos. 61971021, 61571024).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shang, T., Zhao, Z., Ren, X. et al. Differential identifiability clustering algorithms for big data analysis. Sci. China Inf. Sci. 64, 152101 (2021). https://doi.org/10.1007/s11432-020-2910-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-020-2910-1