Co-clustering for Differentially Private Synthetic Data Generation

Tarek Benkhelif^17,18,
Françoise Fessant¹⁷,
Fabrice Clérot¹⁷ &
…
Guillaume Raschia¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10708))

Included in the following conference series:

International Workshop on Personal Analytics and Privacy

736 Accesses

Abstract

We propose a methodology to anonymize microdata (i.e. a table of n individuals described by d attributes). The goal is to be able to release an anonymized data table built from the original data while meeting the differential privacy requirements. The proposed solution combines co-clustering with synthetic data generation to produce anonymized data. First, a data independent partitioning on the domains is used to generate a perturbed multidimensional histogram; a multidimensional co-clustering is then performed on the noisy histogram resulting in a partitioning scheme. This differentially private co-clustering phase aims to form attribute values clusters and thus, limits the impact of the noise addition in the second phase. Finally, the obtained scheme is used to partition the original data in a differentially private fashion. Synthetic individuals can then be drawn from the partitions. We show through experiments that our solution outperforms existing approaches and we demonstrate that the produced synthetic data preserve sufficient information and can be used for several datamining tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Co-clustering for Microdata Anonymization

COCOA: A Synthetic Data Generator for Testing Anonymization Techniques

A Survey on Privacy Preserving Synthetic Data Generation and a Discussion on a Privacy-Utility Trade-off Problem

Notes

1.
https://archive.ics.uci.edu/ml/.

References

https://sourceforge.net/projects/privbayes
Acs, G., Castelluccia, C., Chen, R.: Differentially private histogram publishing through lossy compression. In: 2012 IEEE 12th International Conference on Data Mining, pp. 1–10. IEEE (2012)
Google Scholar
Blondel, V.D., Esch, M., Chan, C., Clérot, F., Deville, P., Huens, E., Morlot, F., Smoreda, Z., Ziemlicki, C.: Data for development: the D4D challenge on mobile phone data (2012). arXiv preprint arXiv:1210.0137
Boullé, M.: Data grid models for preparation and modeling in supervised learning. In: Hands-On Pattern Recognition: Challenges in Machine Learning, vol. 1, pp. 99–130 (2010)
Google Scholar
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Chapter Google Scholar
Dwork, C.: Differential privacy: A survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Chapter Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Hartigan, J.A.: Direct clustering of a data matrix. J. Am. Stat. Assoc. 67(337), 123–129 (1972)
Article Google Scholar
McSherry, F.D.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 19–30. ACM (2009)
Google Scholar
Mohammed, N., Chen, R., Fung, B., Yu, P.S.: Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 493–501. ACM (2011)
Google Scholar
Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing, STOC 2007, pp. 75–84. ACM, New York (2007). https://doi.org/10.1145/1250790.1250803
Robert, C.: The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer, New York (2007)
MATH Google Scholar
Sweeney, L.: k-anonymity: A model for protecting privacy. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(05), 557–570 (2002)
Article MathSciNet MATH Google Scholar
Xiao, Y., Xiong, L., Fan, L., Goryczka, S.: Dpcube: differentially private histogram release through multidimensional partitioning (2012). arXiv preprint arXiv:1202.5358
Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J. 22(6), 797–822 (2013)
Article Google Scholar
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: Private data release via bayesian networks. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1423–1434. ACM (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Orange Labs, 2, Avenue Pierre Marzin, 22307, Lannion Cédex, France
Tarek Benkhelif, Françoise Fessant & Fabrice Clérot
LS2N - Polytech Nantes, Rue Christian Pauc, BP 50609, 44306, Nantes Cédex 3, France
Tarek Benkhelif & Guillaume Raschia

Authors

Tarek Benkhelif
View author publications
You can also search for this author in PubMed Google Scholar
Françoise Fessant
View author publications
You can also search for this author in PubMed Google Scholar
Fabrice Clérot
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Raschia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tarek Benkhelif .

Editor information

Editors and Affiliations

KDDLab, ISTI-CNR, Pisa, Italy
Riccardo Guidotti
KDDLab, University of Pisa, Pisa, Italy
Anna Monreale
KDDLab, University of Pisa, Pisa, Italy
Dino Pedreschi
Inria, École Normale Supérieure, Paris, France
Serge Abiteboul

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Benkhelif, T., Fessant, F., Clérot, F., Raschia, G. (2017). Co-clustering for Differentially Private Synthetic Data Generation. In: Guidotti, R., Monreale, A., Pedreschi, D., Abiteboul, S. (eds) Personal Analytics and Privacy. An Individual and Collective Perspective. PAP 2017. Lecture Notes in Computer Science(), vol 10708. Springer, Cham. https://doi.org/10.1007/978-3-319-71970-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-71970-2_5
Published: 25 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71969-6
Online ISBN: 978-3-319-71970-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Co-clustering for Differentially Private Synthetic Data Generation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Co-clustering for Microdata Anonymization

COCOA: A Synthetic Data Generator for Testing Anonymization Techniques

A Survey on Privacy Preserving Synthetic Data Generation and a Discussion on a Privacy-Utility Trade-off Problem

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Co-clustering for Differentially Private Synthetic Data Generation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Co-clustering for Microdata Anonymization

COCOA: A Synthetic Data Generator for Testing Anonymization Techniques

A Survey on Privacy Preserving Synthetic Data Generation and a Discussion on a Privacy-Utility Trade-off Problem

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation