Abstract
We present a mathematical framework for understanding when successfully distinguishing a person from all other persons in a data set—a phenomenon which we call isolation—may enable identification, a notion which is central to deciding whether a release based on the data set is subject to data protection regulation. We show that a baseline degree of isolation is unavoidable in the sense that isolation can typically happen with high probability even before a release was made about the data set and hence identification is not enabled. We then describe settings where isolation resulting from a data release may enable identification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Regulation (EU) 2016/679 (General Data Protection Regulation), Article 4.
- 2.
For variations and sources, see https://csrc.nist.gov/glossary/term/identification.
- 3.
For example: \((25\le \text{ Age } \le 28) \wedge (10,000 \le \text{ Salary } \le 50,000)\).
- 4.
General Data Protection Regulation, Recital 26. See also: Article 29 Working Party, Opinion 05/2014 on Anonymisation Techniques.
- 5.
See citations in [12] for more.
- 6.
“If an individual is not present in a dataset, and is independent of all other individuals in the dataset, then the release of that dataset does not violate that individual’s privacy.”.
- 7.
- 8.
Query access to the release mechanism can also be used in other ways, see Remark 4 below.
- 9.
Access to alternative sources of information may also be used for boosting the information receiver’s confidence that a guess B isolates.
- 10.
The number of records n is included as part of the the receiver’s prior knowledge since n can often be inferred (exactly or approximately) from public information. Examples include (i) where a survey design specifying n was made public prior to the collection of information, and (ii) where n was made public in previous surveys (e.g., a census). For a reader who considers n as part of the release, this section should be understood as demonstrating that the release of n alone suffices for isolation.
- 11.
We exclude \(P(B) \ge \frac{1}{n}\), as the chance that B uniquely distinguishes an individual in a population of size \(N = n/r\) in extremely small. Namely, \((1-P(B))^{N-n} \le e^{-(1-r)/r}\). Taking \(r = 0.01\), say, the probability is about \(10^{-43}\).
- 12.
Accounting for the variance of P(B) yields only an insignificant improvement.
- 13.
E.g., for Governor Weld’s attributes used by Sweeney, they estimated \((1-P(B))^{N-1} \approx 0.58\).
- 14.
The proof of this fact is somewhat nuanced, as \(A_i\) can depend arbitrarily on the dataset \(\textbf{X}\).
- 15.
- 16.
Chebyshev’s inequality: \(\Pr \left[ |X-\textbf{E}[X]|\ge k\right] \le \frac{\textbf{Var}[X]}{k^2}\).
References
Altman, M., Cohen, A., Nissim, K., Wood, A.: What a hybrid legal-technical analysis teaches us about privacy regulation: the case of singling out. BUJ Sci. Tech. L. 27, 1 (2021)
Barth-Jones, D.: The ‘re-identification’ of governor William weld’s medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then Now (2012) (2012)
Bethlehem, J.G., Keller, W.J., Pannekoek, J.: Disclosure control of microdata. J. Am. Stat. Assoc. 85(409), 38–45 (1990)
Cohen, A.: Attacks on deidentification’s defenses. In: Butler, K.R.B., Thomas, K. (eds.) 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, 10–12 August 2022, pp. 1469–1486. USENIX Association (2022). https://www.usenix.org/conference/usenixsecurity22/presentation/cohen
Cohen, A., Nissim, K.: Towards formalizing the GDPR’s notion of singling out. Proc. Natl. Acad. Sci. USA 117(15), 8344–8352 (2020). https://doi.org/10.1073/pnas.1914598117
Dalenius, T.: Finding a needle in a haystack or identifying anonymous census records. J. Official Stat. 2(3), 329 (1986)
De Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3(1), 1–5 (2013)
Fienberg, S.E., Makov, U.E.: Confidentiality, uniqueness, and disclosure limitation for categorical data. J. Official stat. 14(4), 385 (1998)
Francis, P., Wagner, D.: Towards more accurate and useful data anonymity vulnerability measures. arXiv preprint arXiv:2403.06595 (2024)
Jarmin, R.S., et al.: An in-depth examination of requirements for disclosure risk assessment. Proc. Natl. Acad. Sci. 120(43), e2220558120 (2023)
Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), 18–21 May 2008, Oakland, California, USA, pp. 111–125. IEEE Computer Society (2008). https://doi.org/10.1109/SP.2008.33
Rocher, L., Hendrickx, J.M., De Montjoye, Y.A.: Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10(1), 1–9 (2019)
Ruggles, S., Van Riper, D.: The role of chance in the census bureau database reconstruction experiment. Popul. Res. Policy Rev. 41, 781–788 (2022). https://doi.org/10.1007/s11113-021-09674-3
Sánchez, D., Martínez, S., Domingo-Ferrer, J.: Comment on “unique in the shopping mall: on the reidentifiability of credit card metadata”. Science 351(6279), 1274–1274 (2016)
Sweeney, L.: Simple demographics often identify people uniquely. Health (San Francisco) 671(2000), 1–34 (2000)
Acknowledgments
Work of K.N. was supported by NSF Grant No. CCF2217678 “DASS: Co-design of law and computer science for privacy in sociotechnical software systems” and a gift to Georgetown University. Work completed while K.N. visited Bocconi University, Milan.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Isolating with a Partial Knowledge of P
A Isolating with a Partial Knowledge of P
The analysis in Sects. 3.1 and 3.2 assumed that the information receiver has perfect knowledge of the underlying probability measure P (but not \(\textbf{X}\) sampled from P). We now discuss what the receiver may do when they do not know P in full.
Observation 1 and Theorem 2 teach that all the information receiver needs is a partition of the data space into sets of probability weight \(p^* =\frac{1}{\max (n,k)}\) and Remark 3 suggests that it suffices that the partition is close to the optimal weight for the information receiver to succeed in isolating. We now develop these ideas.
Let \(\mathcal{C}=\{C_i\}_{i=1}^\ell \) be a partition of D where \(\ell =\max (n,k)\). The information receiver may choose the partition \(\mathcal{C}\) heuristically in combination with their partial knowledge about P and the data domain D. For example, \(\mathcal{C}\) may partition D into high-dimensional rectangles, each described as the conjunction of one or more attribute ranges (e.g., all combinations of 5-year Age by Sex by City). Denote by \(p_i = P(C_i)\) the probability of an individual falling into \(C_i\). We show that if \(\underline{p} =(p_1,\ldots ,p_\ell )\) is close close enough to \(p^*=(\frac{1}{\ell },\ldots ,\frac{1}{\ell })\) then (even without knowing \(p_1,\ldots ,p_\ell \)) the information receiver succeeds in isolating.
As \(\mathcal{C}\) is a partition of the data domain D we have that \(\sum _{i=1}^\ell p_i=1\). Hence, if we pick a partition element at random, then, in expectation, its probability weight would be exactly \(p^*=\frac{1}{\ell }\):
where we use \(i\sim U_\ell \) to denote that the expectancy is over choosing an element of the partition \(i\in \{1,\dots ,\ell \}\) uniformly at random.
An important parameter of the partition is its standard deviation \(\sigma \):
If the standard deviation \(\sigma \) is small compared to \(\frac{1}{\ell }\), say \(\sigma \le \frac{c}{\ell }\) for \(c\ll 1\), then many of the elements of the partition have weight \(p_i \approx \frac{1}{\ell }\). More precisely, by Chebyshev’s inequalityFootnote 16 we have:
Hence, if the information receiver samples guesses \(B_1,\ldots ,B_k\) from the partition \(\mathcal{C}\) without replacement, then in expectation at least \((1-4c^2)k\) of them would satisfy \(p(B_i) \in \left[ \frac{1}{2\ell },\frac{3}{2\ell }\right] \). Using Eq. 1 each of these guesses would result in isolation probability \(p^\textsf{iso}(B_i) \ge n\cdot \min (\frac{1}{2\ell } \cdot (1-\frac{1}{2\ell })^{n-1}, \frac{3}{2\ell } \cdot (1-\frac{3}{2\ell })^{n-1})\), and in expectation the number of isolating guesses would be at least
As an example, if \(c=\frac{1}{4}\) and \(k=n\) (hence \(\ell =k=n\)) we get that in expectation the number successful isolations is at least
I.e., in expectation almost a quarter of the guesses would consist successful isolations in spite of \(B_i\) not being chosen optimally.
Remark 6
Cohen and Nissim [5] used hashing to create a structure that is equivalent to a partition \(\mathcal{C}\) where \(p_i\) is very close to \(\frac{1}{\ell }\) (assuming P has sufficient min-entropy). The main qualitative difference between the hashing approach and the one described in this work is that hashing destroys the structure of the data domain and makes it harder for the information receiver to make effective use of isolation (e.g., as a step towards a linkage attack) whereas in the approach described herein the information receiver may choose partitions \(\mathcal{C}\) that are more suitable for their purposes.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
D’Acquisto, G., Cohen, A., Naldi, M., Nissim, K. (2024). From Isolation to Identification. In: Domingo-Ferrer, J., Önen, M. (eds) Privacy in Statistical Databases. PSD 2024. Lecture Notes in Computer Science, vol 14915. Springer, Cham. https://doi.org/10.1007/978-3-031-69651-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-69651-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69650-3
Online ISBN: 978-3-031-69651-0
eBook Packages: Computer ScienceComputer Science (R0)