Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Continual Release of Differentially Private Synthetic Data from Longitudinal Data Collections

Published: 14 May 2024 Publication History

Abstract

Motivated by privacy concerns in long-term longitudinal studies in medical and social science research, we study the problem of continually releasing differentially private synthetic data from longitudinal data collections. We introduce a model where, in every time step, each individual reports a new data element, and the goal of the synthesizer is to incrementally update a synthetic dataset in a consistent way to capture a rich class of statistical properties. We give continual synthetic data generation algorithms that preserve two basic types of queries: fixed time window queries and cumulative time queries. We show nearly tight upper bounds on the error rates of these algorithms and demonstrate their empirical performance on realistically sized datasets from the U.S. Census Bureau's Survey of Income and Program Participation.

References

[1]
John Abowd et al. 2021. An uncertainty principle is a price of privacy-preserving microdata. In Advances in Neural Information Processing Systems. M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, (Eds.) Vol. 34. Curran Associates, Inc., 11883--11895. https://proceedings.neurips.cc/paper_files/paper/2021/file/639d79cc857a6c76c2723b7e014fccb0-Paper.pdf.
[2]
Daniel Alabi, Omri Ben-Eliezer, and Anamay Chaturvedi. 2022. Bounded space differentially private quantiles. CoRR, abs/2201.03380.
[3]
Héber H. Arcolezi, Carlos Pinzón, Catuscia Palamidessi, and Sébastien Gambs. 2022. Frequency estimation of evolving data under local differential privacy. arXiv preprint arXiv:2210.00262.
[4]
Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. 2007. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS '07). Association for Computing Machinery, New York, NY, USA, 273--282. isbn: 9781595936851.
[5]
Gary Benedetto, Stanley Jordan C., and Totty Evan. 2018. The creation and use of the {sipp} synthetic beta v7.0. https://www.census.gov/library/working-papers/2018/adrm/SIPP-Synthetic-Beta.html.
[6]
Gary Benedetto, Martha H. Stinson, and John M. Abowd. 2013. The creation and use of the {sipp} synthetic beta. https://www.census.gov/content/dam/Census/programs-surveys/sipp/methodology/SSBdescribe%7B%5C_%7Dnontechnical.pdf.
[7]
Avrim Blum, Katrina Ligett, and Aaron Roth. 2008. A learning theory approach to non-interactive database privacy. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, May 17--20, 2008. Cynthia Dwork, (Ed.) ACM, 609--618.
[8]
Jean Bolot, Nadia Fawaz, S. Muthukrishnan, Aleksandar Nikolov, and Nina Taft. 2013. Private decayed predicate sums on streams. In Proceedings of the 16th International Conference on Database Theory (ICDT '13). Association for Computing Machinery, Genoa, Italy, 284--295. isbn: 9781450315982.
[9]
Mark Bun and Thomas Steinke. 2016. Concentrated differential privacy: simplifications, extensions, and lower bounds. In Theory of Cryptography Conference. Springer, 635--658.
[10]
U.S. Census Bureau. 2021. Longitudinal Business Database. Tech. rep. https://www.census.gov/programs-surveys/ces/data/restricted-use-data/longitudinal-business-database.html.
[11]
U.S. Census Bureau. 2021. Survey of Income and Program Participation Data. Tech. rep. https://www2.census.gov/programs-surveys/sipp/data/datasets/2021/pu2021_csv.zip.
[12]
U.S. Census Bureau. 2023. Synthetic Longitudinal Business Database. Tech. rep. https://www.census.gov/programs-surveys/ces/data/public-use-data/synthetic-longitudinal-business-database.html.
[13]
Clément L. Canonne, Gautam Kamath, and Thomas Steinke. 2020. The discrete gaussian for differential privacy. Advances in Neural Information Processing Systems, 33, 15676--15688.
[14]
Adrian Rivera Cardoso and Ryan Rogers. 2022. Differentially private histograms under continual observation: streaming selection into the unknown. In International Conference on Artificial Intelligence and Statistics, {AISTATS} 2022, 28--30 March 2022, Virtual Event (Proceedings of Machine Learning Research). Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, (Eds.) Vol. 151. {PMLR}, 2397--2419.
[15]
T-H Hubert Chan, Elaine Shi, and Dawn Song. 2011. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14, 3, 1--24.
[16]
Yan Chen, Ashwin Machanavajjhala, Michael Hay, and Gerome Miklau. 2017. Pegasus: data-adaptive differentially private stream processing. In Proceedings of the 2017 {ACM} {SIGSAC} Conference on Computer and Communications Security, {CCS} 2017, Dallas, TX, USA, October 30 - November 03, 2017. Bhavani Thuraisingham, David Evans, Tal Malkin, and Dongyan Xu, (Eds.) {ACM}, 1375--1388.
[17]
Serguei Denissov, Hugh Brendan McMahan, J. Keith Rush, Adam Smith, and Abhradeep Guha Thakurta. 2022. Improved differential privacy for {sgd} via optimal private linear operators on adaptive streams. In Advances in Neural Information Processing Systems. Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, (Eds.) https://openreview.net/forum?id=i9XrHJoyLqJ.
[18]
Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. 2017. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, {USA}. Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, (Eds.), 3571--3580.
[19]
Richard Doll and A. Bradford Hill. 1956. Lung cancer and other causes of death in relation to smoking. British Medical Journal, 2(5001), 1071--1081.
[20]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography (TCC '06), 265--284.
[21]
Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. 2010. Differential privacy under continual observation. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC '10), 715--724.
[22]
Cynthia Dwork and Guy N. Rothblum. 2016. Concentrated differential privacy. arXiv preprint arXiv:1603.01887.
[23]
Alessandro Epasto, Jieming Mao, Andres Muñoz Medina, Vahab Mirrokni, Sergei Vassilvitskii, and Peilin Zhong. 2023. Differentially private continual releases of streaming frequency moment estimations. In 14th Innovations in Theoretical Computer Science Conference, {ITCS} 2023, January 10--13, 2023, MIT, Cambridge, Massachusetts, {USA} (LIPIcs). Yael Tauman Kalai, (Ed.) Vol. 251. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 48:1--48:24.
[24]
Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. 2019. Amplification by shuffling: from local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2468--2479.
[25]
Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. {rappor:} randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 {ACM} {SIGSAC} Conference on Computer and Communications Security, Scottsdale, AZ, USA, November 3--7, 2014. Gail-Joon Ahn, Moti Yung, and Ninghui Li, (Eds.) {ACM}, 1054--1067.
[26]
Hendrik Fichtenberger, Monika Henzinger, and Jalaj Upadhyay. 2022. Constant matters: fine-grained complexity of differentially private continual observation. (2022).
[27]
Marco Gaboardi, Emilio Jesús Gallego Arias, Justin Hsu, Aaron Roth, and Zhiwei Steven Wu. 2014. Dual query: practical private query release for high dimensional data. In Proceedings of the 31th International Conference on Machine Learning, {ICML} 2014, Beijing, China, 21--26 June 2014 ({JMLR} Workshop and Conference Proceedings). Vol. 32. JMLR.org, 1170--1178. http://proceedings.mlr.press/v32/gaboardi14.html.
[28]
Badih Ghazi, Ravi Kumar, Pasin Manurangsi, and Jelani Nelson. 2022. Private counting of distinct and k-occurring items in time windows. arXiv preprint arXiv:2211.11718.
[29]
Moritz Hardt and Guy N. Rothblum. 2010. A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st annual symposium on foundations of computer science. IEEE, 61--70.
[30]
Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. 2010. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow., 3, 1, 1021--1032.
[31]
Monika Henzinger, Jalaj Upadhyay, and Sarvagya Upadhyay. 2023. Almost tight error bounds on differentially private continual counting. In Proceedings of the 2023 {ACM-SIAM} Symposium on Discrete Algorithms, {SODA} 2023, Florence, Italy, January 22--25, 2023. Nikhil Bansal and Viswanath Nagarajan, (Eds.) {SIAM}, 5003--5039.
[32]
James Honaker. 2015. Efficient use of differentially private binary trees. In.
[33]
Palak Jain, Sofya Raskhodnikova, Satchit Sivakumar, and Adam D. Smith. 2021. The price of differential privacy under continual observation. CoRR, abs/2112.00828.
[34]
Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. 2012. Differentially private online learning. In {COLT} 2012 - The 25th Annual Conference on Learning Theory, June 25--27, 2012, Edinburgh, Scotland ({JMLR} Proceedings). Shie Mannor, Nathan Srebro, and Robert C. Williamson, (Eds.) Vol. 23. JMLR.org, 24.1--24.34.
[35]
Matthew Joseph, Aaron Roth, Jonathan Ullman, and Bo Waggoner. 2018. Local differential privacy for evolving data. Advances in Neural Information Processing Systems, 31.
[36]
W. B. Kannel, Dawber T. R., A. Kagan, N. Revotskie, and J. Stokes 3rd. 1961. Factors of risk in the development of coronary heart disease--six year follow-up experience. {t}he {f}ramingham study. Annals of Internal Medicine, 55, 33--50.
[37]
Georgios Kellaris, Stavros Papadopoulos, Xiaokui Xiao, and Dimitris Papadias. 2014. Differentially private event sequences over infinite streams. Proc. {VLDB} Endow., 7, 12, 1155--1166.
[38]
Terrance Liu, Jingwu Tang, Giuseppe Vietri, and Steven Wu. 2023. Generating private synthetic data with genetic algorithms. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research). Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, (Eds.) Vol. 202. PMLR, (23--29 Jul 2023), 22009--22027. https://proceedings.mlr.press/v202/liu23ag.html.
[39]
Ryan McKenna, Gerome Miklau, and Daniel Sheldon. 2021. Winning the {nist} contest: {a} scalable and general approach to differentially private synthetic data. J. Priv. Confidentiality, 11, 3.
[40]
Ryan McKenna, Brett Mullins, Daniel Sheldon, and Gerome Miklau. 2022. {aim:} an adaptive and iterative mechanism for differentially private synthetic data. Proc. {VLDB} Endow., 15, 11, 2599--2612.
[41]
Darakhshan J. Mir, S. Muthukrishnan, Aleksandar Nikolov, and Rebecca N. Wright. 2011. Pan-private algorithms via statistics on sketches. In Proceedings of the 30th {ACM} {SIGMOD-SIGACT-SIGART} Symposium on Principles of Database Systems, {PODS} 2011, June 12--16, 2011, Athens, Greece. Maurizio Lenzerini and Thomas Schwentick, (Eds.) {ACM}, 37--48.
[42]
Marcel Neunhoeffer, Steven Wu, and Cynthia Dwork. 2021. Private post-gan boosting. In International Conference on Learning Representations. https://openreview.net/forum?id=6isfR3JCbi.
[43]
Aleksandar Nikolov, Kunal Talwar, and Li Zhang. 2013. The geometry of differential privacy: the sparse and approximate cases. In Symposium on Theory of Computing Conference, STOC'13, Palo Alto, CA, USA, June 1--4, 2013. Dan Boneh, Tim Roughgarden, and Joan Feigenbaum, (Eds.) ACM, 351--360.
[44]
Olga Ohrimenko, Anthony Wirth, and Hao Wu. 2022. Randomize the future: asymptotically optimal locally private frequency estimation protocol for longitudinal data. In {PODS} '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022. Leonid Libkin and Pablo Barceló, (Eds.) {ACM}, 237--249.
[45]
Omar Rivasplata. 2012. Subgaussian random variables: an expository note. Available at: http://www.stat.cmu.edu/arinaldo/36788/subgaussians.pdf.
[46]
Shuang Song, Susan Little, Sanjay Mehta, Staal A. Vinterbo, and Kamalika Chaudhuri. 2018. Differentially private continual release of graph statistics. CoRR, abs/1809.02575.
[47]
Jonathan Ullman and Salil Vadhan. 2011. Pcps and the hardness of generating private synthetic data. In Theory of Cryptography. Yuval Ishai, (Ed.) Springer Berlin Heidelberg, Berlin, Heidelberg, 400--416. isbn: 978--3--642--19571--6.
[48]
Giuseppe Vietri, Grace Tian, Mark Bun, Thomas Steinke, and Zhiwei Steven Wu. 2020. New oracle-efficient algorithms for private synthetic data release. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18 July 2020, Virtual Event (Proceedings of Machine Learning Research). Vol. 119. PMLR, 9765--9774. http://proceedings.mlr.press/v119/vietri20b.html.
[49]
Lars Vilhuber, John M. Abowd, and Jerome P. Reiter. 2016. Synthetic establishment microdata around the world. Statistical Journal of the International Association for Official Statistics, 32, 65--68.
[50]
Hao Wang, Shivchander Sudalairaj, John Henning, Kristjan Greenewald, and Akash Srivastava. 2023. Post-processing private synthetic data for improving utility on selected measures. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=neu9JlNweE.
[51]
Qiao Xue, Qingqing Ye, Haibo Hu, Youwen Zhu, and Jian Wang. 2022. Ddrm: a continual frequency estimation mechanism with local differential privacy. IEEE Transactions on Knowledge and Data Engineering, 1--1.
[52]
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2017. Privbayes: private data release via bayesian networks. ACM Trans. Database Syst., 42, 4, Article 25, (Oct. 2017), 41 pages.

Cited By

View all
  • (2024)Online Differentially Private Synthetic Data GenerationIEEE Transactions on Privacy10.1109/TP.2024.34866871(19-30)Online publication date: 2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 2
PODS
May 2024
852 pages
EISSN:2836-6573
DOI:10.1145/3665155
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2024
Published in PACMMOD Volume 2, Issue 2

Permissions

Request permissions for this article.

Author Tags

  1. continual release
  2. differential privacy
  3. longitudinal data
  4. synthetic data

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)92
  • Downloads (Last 6 weeks)14
Reflects downloads up to 28 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Online Differentially Private Synthetic Data GenerationIEEE Transactions on Privacy10.1109/TP.2024.34866871(19-30)Online publication date: 2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media