Nothing Special   »   [go: up one dir, main page]

Skip to main content

From Isolation to Identification

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14915))

Included in the following conference series:

  • 284 Accesses

Abstract

We present a mathematical framework for understanding when successfully distinguishing a person from all other persons in a data set—a phenomenon which we call isolation—may enable identification, a notion which is central to deciding whether a release based on the data set is subject to data protection regulation. We show that a baseline degree of isolation is unavoidable in the sense that isolation can typically happen with high probability even before a release was made about the data set and hence identification is not enabled. We then describe settings where isolation resulting from a data release may enable identification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Regulation (EU) 2016/679 (General Data Protection Regulation), Article 4.

  2. 2.

    For variations and sources, see https://csrc.nist.gov/glossary/term/identification.

  3. 3.

    For example: \((25\le \text{ Age } \le 28) \wedge (10,000 \le \text{ Salary } \le 50,000)\).

  4. 4.

    General Data Protection Regulation, Recital 26. See also: Article 29 Working Party, Opinion 05/2014 on Anonymisation Techniques.

  5. 5.

    See citations in [12] for more.

  6. 6.

    “If an individual is not present in a dataset, and is independent of all other individuals in the dataset, then the release of that dataset does not violate that individual’s privacy.”.

  7. 7.

    Our results are robust to a substantial relaxation of these assumptions, in particular, knowledge of the distribution P and the number of records n may be approximate. See Remark 3 and Appendix A.

  8. 8.

    Query access to the release mechanism can also be used in other ways, see Remark 4 below.

  9. 9.

    Access to alternative sources of information may also be used for boosting the information receiver’s confidence that a guess B isolates.

  10. 10.

    The number of records n is included as part of the the receiver’s prior knowledge since n can often be inferred (exactly or approximately) from public information. Examples include (i) where a survey design specifying n was made public prior to the collection of information, and (ii) where n was made public in previous surveys (e.g., a census). For a reader who considers n as part of the release, this section should be understood as demonstrating that the release of n alone suffices for isolation.

  11. 11.

    We exclude \(P(B) \ge \frac{1}{n}\), as the chance that B uniquely distinguishes an individual in a population of size \(N = n/r\) in extremely small. Namely, \((1-P(B))^{N-n} \le e^{-(1-r)/r}\). Taking \(r = 0.01\), say, the probability is about \(10^{-43}\).

  12. 12.

    Accounting for the variance of P(B) yields only an insignificant improvement.

  13. 13.

    E.g., for Governor Weld’s attributes used by Sweeney, they estimated \((1-P(B))^{N-1} \approx 0.58\).

  14. 14.

    The proof of this fact is somewhat nuanced, as \(A_i\) can depend arbitrarily on the dataset \(\textbf{X}\).

  15. 15.

    See https://www.slideserve.com/ordell/razi-mukatren-golan-salman, and https://archive.is/W20kx.

  16. 16.

    Chebyshev’s inequality: \(\Pr \left[ |X-\textbf{E}[X]|\ge k\right] \le \frac{\textbf{Var}[X]}{k^2}\).

References

  1. Altman, M., Cohen, A., Nissim, K., Wood, A.: What a hybrid legal-technical analysis teaches us about privacy regulation: the case of singling out. BUJ Sci. Tech. L. 27, 1 (2021)

    Google Scholar 

  2. Barth-Jones, D.: The ‘re-identification’ of governor William weld’s medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then Now (2012) (2012)

    Google Scholar 

  3. Bethlehem, J.G., Keller, W.J., Pannekoek, J.: Disclosure control of microdata. J. Am. Stat. Assoc. 85(409), 38–45 (1990)

    Article  Google Scholar 

  4. Cohen, A.: Attacks on deidentification’s defenses. In: Butler, K.R.B., Thomas, K. (eds.) 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, 10–12 August 2022, pp. 1469–1486. USENIX Association (2022). https://www.usenix.org/conference/usenixsecurity22/presentation/cohen

  5. Cohen, A., Nissim, K.: Towards formalizing the GDPR’s notion of singling out. Proc. Natl. Acad. Sci. USA 117(15), 8344–8352 (2020). https://doi.org/10.1073/pnas.1914598117

    Article  Google Scholar 

  6. Dalenius, T.: Finding a needle in a haystack or identifying anonymous census records. J. Official Stat. 2(3), 329 (1986)

    Google Scholar 

  7. De Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3(1), 1–5 (2013)

    Article  Google Scholar 

  8. Fienberg, S.E., Makov, U.E.: Confidentiality, uniqueness, and disclosure limitation for categorical data. J. Official stat. 14(4), 385 (1998)

    Google Scholar 

  9. Francis, P., Wagner, D.: Towards more accurate and useful data anonymity vulnerability measures. arXiv preprint arXiv:2403.06595 (2024)

  10. Jarmin, R.S., et al.: An in-depth examination of requirements for disclosure risk assessment. Proc. Natl. Acad. Sci. 120(43), e2220558120 (2023)

    Article  Google Scholar 

  11. Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), 18–21 May 2008, Oakland, California, USA, pp. 111–125. IEEE Computer Society (2008). https://doi.org/10.1109/SP.2008.33

  12. Rocher, L., Hendrickx, J.M., De Montjoye, Y.A.: Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10(1), 1–9 (2019)

    Article  Google Scholar 

  13. Ruggles, S., Van Riper, D.: The role of chance in the census bureau database reconstruction experiment. Popul. Res. Policy Rev. 41, 781–788 (2022). https://doi.org/10.1007/s11113-021-09674-3

    Article  Google Scholar 

  14. Sánchez, D., Martínez, S., Domingo-Ferrer, J.: Comment on “unique in the shopping mall: on the reidentifiability of credit card metadata”. Science 351(6279), 1274–1274 (2016)

    Google Scholar 

  15. Sweeney, L.: Simple demographics often identify people uniquely. Health (San Francisco) 671(2000), 1–34 (2000)

    Google Scholar 

Download references

Acknowledgments

Work of K.N. was supported by NSF Grant No. CCF2217678 “DASS: Co-design of law and computer science for privacy in sociotechnical software systems” and a gift to Georgetown University. Work completed while K.N. visited Bocconi University, Milan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aloni Cohen .

Editor information

Editors and Affiliations

A Isolating with a Partial Knowledge of P

A Isolating with a Partial Knowledge of P

The analysis in Sects. 3.1 and 3.2 assumed that the information receiver has perfect knowledge of the underlying probability measure P (but not \(\textbf{X}\) sampled from P). We now discuss what the receiver may do when they do not know P in full.

Observation 1 and Theorem 2 teach that all the information receiver needs is a partition of the data space into sets of probability weight \(p^* =\frac{1}{\max (n,k)}\) and Remark 3 suggests that it suffices that the partition is close to the optimal weight for the information receiver to succeed in isolating. We now develop these ideas.

Let \(\mathcal{C}=\{C_i\}_{i=1}^\ell \) be a partition of D where \(\ell =\max (n,k)\). The information receiver may choose the partition \(\mathcal{C}\) heuristically in combination with their partial knowledge about P and the data domain D. For example, \(\mathcal{C}\) may partition D into high-dimensional rectangles, each described as the conjunction of one or more attribute ranges (e.g., all combinations of 5-year Age by Sex by City). Denote by \(p_i = P(C_i)\) the probability of an individual falling into \(C_i\). We show that if \(\underline{p} =(p_1,\ldots ,p_\ell )\) is close close enough to \(p^*=(\frac{1}{\ell },\ldots ,\frac{1}{\ell })\) then (even without knowing \(p_1,\ldots ,p_\ell \)) the information receiver succeeds in isolating.

As \(\mathcal{C}\) is a partition of the data domain D we have that \(\sum _{i=1}^\ell p_i=1\). Hence, if we pick a partition element at random, then, in expectation, its probability weight would be exactly \(p^*=\frac{1}{\ell }\):

$$\mathop {\textbf{E}}_{i\sim U_\ell } \left[ p_i\right] = \sum _{j=1}^\ell \mathop {\Pr }\limits _{i\sim U_\ell }[i=j]\cdot p_j = \frac{1}{\ell }\sum _{j=1}^\ell p_j = \frac{1}{\ell },$$

where we use \(i\sim U_\ell \) to denote that the expectancy is over choosing an element of the partition \(i\in \{1,\dots ,\ell \}\) uniformly at random.

An important parameter of the partition is its standard deviation \(\sigma \):

$$\sigma ^2 := \mathop {\textbf{Var}}_{i\sim U_\ell }\left[ p_i\right] =\mathop {\textbf{E}}_{i\sim U_\ell }\left[ p_i^2\right] - \left( \mathop {\textbf{E}}_{i\sim U_\ell } \left[ p_i\right] \right) ^2 = \frac{1}{\ell }\sum _{i=1}^ \ell p_i^2 - \frac{1}{\ell ^2} = \frac{1}{\ell }\cdot \Vert \underline{p}\Vert _2^2 - \frac{1}{\ell ^2}. $$

If the standard deviation \(\sigma \) is small compared to \(\frac{1}{\ell }\), say \(\sigma \le \frac{c}{\ell }\) for \(c\ll 1\), then many of the elements of the partition have weight \(p_i \approx \frac{1}{\ell }\). More precisely, by Chebyshev’s inequalityFootnote 16 we have:

$$\Pr \left[ p_i \not \in \left[ \frac{1}{2\ell },\frac{3}{2\ell }\right] \right] = \Pr \left[ \left| p(C_i)-\textbf{E}[p(C_i)]\right| >\frac{1}{2\ell }\right] < 4 c^2.$$

Hence, if the information receiver samples guesses \(B_1,\ldots ,B_k\) from the partition \(\mathcal{C}\) without replacement, then in expectation at least \((1-4c^2)k\) of them would satisfy \(p(B_i) \in \left[ \frac{1}{2\ell },\frac{3}{2\ell }\right] \). Using Eq. 1 each of these guesses would result in isolation probability \(p^\textsf{iso}(B_i) \ge n\cdot \min (\frac{1}{2\ell } \cdot (1-\frac{1}{2\ell })^{n-1}, \frac{3}{2\ell } \cdot (1-\frac{3}{2\ell })^{n-1})\), and in expectation the number of isolating guesses would be at least

$$(1-4c^2)\cdot \frac{kn}{2\ell } \cdot \min \left( (1-\frac{1}{2\ell })^{n-1}, 3\cdot (1-\frac{3}{2\ell })^{n-1}\right) .$$

As an example, if \(c=\frac{1}{4}\) and \(k=n\) (hence \(\ell =k=n\)) we get that in expectation the number successful isolations is at least

$$\begin{aligned} & \left( 1-4\cdot \left( \frac{1}{4}\right) ^2\right) \cdot \frac{n^2}{2n} \cdot \min \left( \left( 1-\frac{1}{2n}\right) ^{n-1},3\cdot \left( 1-\frac{3}{2n}\right) ^{n-1}\right) \\ & \quad \approx \frac{3n}{8}\cdot \min \left( e^{-1/2},3e^{-3/2}\right) \approx 0.23n. \end{aligned}$$

I.e., in expectation almost a quarter of the guesses would consist successful isolations in spite of \(B_i\) not being chosen optimally.

Remark 6

Cohen and Nissim [5] used hashing to create a structure that is equivalent to a partition \(\mathcal{C}\) where \(p_i\) is very close to \(\frac{1}{\ell }\) (assuming P has sufficient min-entropy). The main qualitative difference between the hashing approach and the one described in this work is that hashing destroys the structure of the data domain and makes it harder for the information receiver to make effective use of isolation (e.g., as a step towards a linkage attack) whereas in the approach described herein the information receiver may choose partitions \(\mathcal{C}\) that are more suitable for their purposes.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

D’Acquisto, G., Cohen, A., Naldi, M., Nissim, K. (2024). From Isolation to Identification. In: Domingo-Ferrer, J., Önen, M. (eds) Privacy in Statistical Databases. PSD 2024. Lecture Notes in Computer Science, vol 14915. Springer, Cham. https://doi.org/10.1007/978-3-031-69651-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-69651-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-69650-3

  • Online ISBN: 978-3-031-69651-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics