From Isolation to Identification

Giuseppe D’Acquisto⁹,
Aloni Cohen¹⁰,
Maurizio Naldi¹¹ &
…
Kobbi Nissim¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14915))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

448 Accesses

Abstract

We present a mathematical framework for understanding when successfully distinguishing a person from all other persons in a data set—a phenomenon which we call isolation—may enable identification, a notion which is central to deciding whether a release based on the data set is subject to data protection regulation. We show that a baseline degree of isolation is unavoidable in the sense that isolation can typically happen with high probability even before a release was made about the data set and hence identification is not enabled. We then describe settings where isolation resulting from a data release may enable identification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Data Utility-Driven Benchmark for De-identification Methods

A Precautionary Approach to Big Data Privacy

Estimating the success of re-identifications in incomplete datasets using generative models

Article Open access 23 July 2019

Notes

1.
Regulation (EU) 2016/679 (General Data Protection Regulation), Article 4.
2.
For variations and sources, see https://csrc.nist.gov/glossary/term/identification.
3.
For example: $(25\le \text{ Age } \le 28) \wedge (10,000 \le \text{ Salary } \le 50,000)$.
4.
General Data Protection Regulation, Recital 26. See also: Article 29 Working Party, Opinion 05/2014 on Anonymisation Techniques.
5.
See citations in [12] for more.
6.
“If an individual is not present in a dataset, and is independent of all other individuals in the dataset, then the release of that dataset does not violate that individual’s privacy.”.
7.
Our results are robust to a substantial relaxation of these assumptions, in particular, knowledge of the distribution P and the number of records n may be approximate. See Remark 3 and Appendix A.
8.
Query access to the release mechanism can also be used in other ways, see Remark 4 below.
9.
Access to alternative sources of information may also be used for boosting the information receiver’s confidence that a guess B isolates.
10.
The number of records n is included as part of the the receiver’s prior knowledge since n can often be inferred (exactly or approximately) from public information. Examples include (i) where a survey design specifying n was made public prior to the collection of information, and (ii) where n was made public in previous surveys (e.g., a census). For a reader who considers n as part of the release, this section should be understood as demonstrating that the release of n alone suffices for isolation.
11.
We exclude $P(B) \ge \frac{1}{n}$, as the chance that B uniquely distinguishes an individual in a population of size $N = n/r$ in extremely small. Namely, $(1-P(B))^{N-n} \le e^{-(1-r)/r}$. Taking $r = 0.01$, say, the probability is about $10^{-43}$.
12.
Accounting for the variance of P(B) yields only an insignificant improvement.
13.
E.g., for Governor Weld’s attributes used by Sweeney, they estimated $(1-P(B))^{N-1} \approx 0.58$.
14.
The proof of this fact is somewhat nuanced, as $A_i$ can depend arbitrarily on the dataset $\textbf{X}$.
15.
See https://www.slideserve.com/ordell/razi-mukatren-golan-salman, and https://archive.is/W20kx.
16.
Chebyshev’s inequality: $\Pr \left[ |X-\textbf{E}[X]|\ge k\right] \le \frac{\textbf{Var}[X]}{k^2}$.

References

Altman, M., Cohen, A., Nissim, K., Wood, A.: What a hybrid legal-technical analysis teaches us about privacy regulation: the case of singling out. BUJ Sci. Tech. L. 27, 1 (2021)
Google Scholar
Barth-Jones, D.: The ‘re-identification’ of governor William weld’s medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then Now (2012) (2012)
Google Scholar
Bethlehem, J.G., Keller, W.J., Pannekoek, J.: Disclosure control of microdata. J. Am. Stat. Assoc. 85(409), 38–45 (1990)
Article Google Scholar
Cohen, A.: Attacks on deidentification’s defenses. In: Butler, K.R.B., Thomas, K. (eds.) 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, 10–12 August 2022, pp. 1469–1486. USENIX Association (2022). https://www.usenix.org/conference/usenixsecurity22/presentation/cohen
Cohen, A., Nissim, K.: Towards formalizing the GDPR’s notion of singling out. Proc. Natl. Acad. Sci. USA 117(15), 8344–8352 (2020). https://doi.org/10.1073/pnas.1914598117
Article Google Scholar
Dalenius, T.: Finding a needle in a haystack or identifying anonymous census records. J. Official Stat. 2(3), 329 (1986)
Google Scholar
De Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3(1), 1–5 (2013)
Article Google Scholar
Fienberg, S.E., Makov, U.E.: Confidentiality, uniqueness, and disclosure limitation for categorical data. J. Official stat. 14(4), 385 (1998)
Google Scholar
Francis, P., Wagner, D.: Towards more accurate and useful data anonymity vulnerability measures. arXiv preprint arXiv:2403.06595 (2024)
Jarmin, R.S., et al.: An in-depth examination of requirements for disclosure risk assessment. Proc. Natl. Acad. Sci. 120(43), e2220558120 (2023)
Article Google Scholar
Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), 18–21 May 2008, Oakland, California, USA, pp. 111–125. IEEE Computer Society (2008). https://doi.org/10.1109/SP.2008.33
Rocher, L., Hendrickx, J.M., De Montjoye, Y.A.: Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10(1), 1–9 (2019)
Article Google Scholar
Ruggles, S., Van Riper, D.: The role of chance in the census bureau database reconstruction experiment. Popul. Res. Policy Rev. 41, 781–788 (2022). https://doi.org/10.1007/s11113-021-09674-3
Article Google Scholar
Sánchez, D., Martínez, S., Domingo-Ferrer, J.: Comment on “unique in the shopping mall: on the reidentifiability of credit card metadata”. Science 351(6279), 1274–1274 (2016)
Google Scholar
Sweeney, L.: Simple demographics often identify people uniquely. Health (San Francisco) 671(2000), 1–34 (2000)
Google Scholar

Download references

Acknowledgments

Work of K.N. was supported by NSF Grant No. CCF2217678 “DASS: Co-design of law and computer science for privacy in sociotechnical software systems” and a gift to Georgetown University. Work completed while K.N. visited Bocconi University, Milan.

Author information

Authors and Affiliations

LUISS University, Rome, Italy
Giuseppe D’Acquisto
University of Chicago, Chicago, USA
Aloni Cohen
LUMSA University, Rome, Italy
Maurizio Naldi
Georgetown University, Washington, DC, USA
Kobbi Nissim

Authors

Giuseppe D’Acquisto
View author publications
You can also search for this author in PubMed Google Scholar
Aloni Cohen
View author publications
You can also search for this author in PubMed Google Scholar
Maurizio Naldi
View author publications
You can also search for this author in PubMed Google Scholar
Kobbi Nissim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aloni Cohen .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Tarragona, Spain
Josep Domingo-Ferrer
EURECOM, Biot, France
Melek Önen

A Isolating with a Partial Knowledge of P

The analysis in Sects. 3.1 and 3.2 assumed that the information receiver has perfect knowledge of the underlying probability measure P (but not $\textbf{X}$ sampled from P). We now discuss what the receiver may do when they do not know P in full.

Observation 1 and Theorem 2 teach that all the information receiver needs is a partition of the data space into sets of probability weight $p^* =\frac{1}{\max (n,k)}$ and Remark 3 suggests that it suffices that the partition is close to the optimal weight for the information receiver to succeed in isolating. We now develop these ideas.

Let $\mathcal{C}=\{C_i\}_{i=1}^\ell $ be a partition of D where $\ell =\max (n,k)$. The information receiver may choose the partition $\mathcal{C}$ heuristically in combination with their partial knowledge about P and the data domain D. For example, $\mathcal{C}$ may partition D into high-dimensional rectangles, each described as the conjunction of one or more attribute ranges (e.g., all combinations of 5-year Age by Sex by City). Denote by $p_i = P(C_i)$ the probability of an individual falling into $C_i$. We show that if $\underline{p} =(p_1,\ldots ,p_\ell )$ is close close enough to $p^*=(\frac{1}{\ell },\ldots ,\frac{1}{\ell })$ then (even without knowing $p_1,\ldots ,p_\ell $) the information receiver succeeds in isolating.

As $\mathcal{C}$ is a partition of the data domain D we have that $\sum _{i=1}^\ell p_i=1$. Hence, if we pick a partition element at random, then, in expectation, its probability weight would be exactly $p^*=\frac{1}{\ell }$:

$$\mathop {\textbf{E}}_{i\sim U_\ell } \left[ p_i\right] = \sum _{j=1}^\ell \mathop {\Pr }\limits _{i\sim U_\ell }[i=j]\cdot p_j = \frac{1}{\ell }\sum _{j=1}^\ell p_j = \frac{1}{\ell },$$

where we use $i\sim U_\ell $ to denote that the expectancy is over choosing an element of the partition $i\in \{1,\dots ,\ell \}$ uniformly at random.

An important parameter of the partition is its standard deviation $\sigma $:

$$\sigma ^2 := \mathop {\textbf{Var}}_{i\sim U_\ell }\left[ p_i\right] =\mathop {\textbf{E}}_{i\sim U_\ell }\left[ p_i^2\right] - \left( \mathop {\textbf{E}}_{i\sim U_\ell } \left[ p_i\right] \right) ^2 = \frac{1}{\ell }\sum _{i=1}^ \ell p_i^2 - \frac{1}{\ell ^2} = \frac{1}{\ell }\cdot \Vert \underline{p}\Vert _2^2 - \frac{1}{\ell ^2}. $$

If the standard deviation $\sigma $ is small compared to $\frac{1}{\ell }$, say $\sigma \le \frac{c}{\ell }$ for $c\ll 1$, then many of the elements of the partition have weight $p_i \approx \frac{1}{\ell }$. More precisely, by Chebyshev’s inequality^{Footnote 16} we have:

$$\Pr \left[ p_i \not \in \left[ \frac{1}{2\ell },\frac{3}{2\ell }\right] \right] = \Pr \left[ \left| p(C_i)-\textbf{E}[p(C_i)]\right| >\frac{1}{2\ell }\right] < 4 c^2.$$

Hence, if the information receiver samples guesses $B_1,\ldots ,B_k$ from the partition $\mathcal{C}$ without replacement, then in expectation at least $(1-4c^2)k$ of them would satisfy $p(B_i) \in \left[ \frac{1}{2\ell },\frac{3}{2\ell }\right] $. Using Eq. 1 each of these guesses would result in isolation probability $p^\textsf{iso}(B_i) \ge n\cdot \min (\frac{1}{2\ell } \cdot (1-\frac{1}{2\ell })^{n-1}, \frac{3}{2\ell } \cdot (1-\frac{3}{2\ell })^{n-1})$, and in expectation the number of isolating guesses would be at least

$$(1-4c^2)\cdot \frac{kn}{2\ell } \cdot \min \left( (1-\frac{1}{2\ell })^{n-1}, 3\cdot (1-\frac{3}{2\ell })^{n-1}\right) .$$

As an example, if $c=\frac{1}{4}$ and $k=n$ (hence $\ell =k=n$) we get that in expectation the number successful isolations is at least

$$\begin{aligned} & \left( 1-4\cdot \left( \frac{1}{4}\right) ^2\right) \cdot \frac{n^2}{2n} \cdot \min \left( \left( 1-\frac{1}{2n}\right) ^{n-1},3\cdot \left( 1-\frac{3}{2n}\right) ^{n-1}\right) \\ & \quad \approx \frac{3n}{8}\cdot \min \left( e^{-1/2},3e^{-3/2}\right) \approx 0.23n. \end{aligned}$$

I.e., in expectation almost a quarter of the guesses would consist successful isolations in spite of $B_i$ not being chosen optimally.

Remark 6

Cohen and Nissim [5] used hashing to create a structure that is equivalent to a partition $\mathcal{C}$ where $p_i$ is very close to $\frac{1}{\ell }$ (assuming P has sufficient min-entropy). The main qualitative difference between the hashing approach and the one described in this work is that hashing destroys the structure of the data domain and makes it harder for the information receiver to make effective use of isolation (e.g., as a step towards a linkage attack) whereas in the approach described herein the information receiver may choose partitions $\mathcal{C}$ that are more suitable for their purposes.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

D’Acquisto, G., Cohen, A., Naldi, M., Nissim, K. (2024). From Isolation to Identification. In: Domingo-Ferrer, J., Önen, M. (eds) Privacy in Statistical Databases. PSD 2024. Lecture Notes in Computer Science, vol 14915. Springer, Cham. https://doi.org/10.1007/978-3-031-69651-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-69651-0_1
Published: 13 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69650-3
Online ISBN: 978-3-031-69651-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

From Isolation to Identification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Data Utility-Driven Benchmark for De-identification Methods

A Precautionary Approach to Big Data Privacy

Estimating the success of re-identifications in incomplete datasets using generative models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Isolating with a Partial Knowledge of P

A Isolating with a Partial Knowledge of P

Remark 6

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us