Bases de Datos
Bases de Datos
Bases de Datos
Alberto Speranzon
Abstract
In this paper, we cast the classic problem of achieving k-anonymity for a given database as
a problem in algebraic topology. Using techniques from this field of mathematics, we propose a
framework for k-anonymity that brings new insights and algorithms to anonymize a database.
We begin by addressing the simpler case when the data lies in a metric space. This case is
instrumental to introduce the main ideas and notation. Specifically, by mapping a database
to the Euclidean space and by considering the distance between datapoints, we introduce a
simplicial representation of the data and show how concepts from algebraic topology, such
as the nerve complex and persistent homology, can be applied to efficiently obtain the entire
spectrum of k-anonymity of the database for various values of k and levels of generalization.
For this representation, we provide an analytic characterization of conditions under which a
given representation of the dataset is k-anonymous. We introduce a weighted barcode diagram
which, in this context, becomes a computational tool to tradeoff data anonymity with data loss
expressed as level of generalization. Some simulations results are used to illustrate the main
idea of the paper. We conclude the paper with a discussion on how to extend this method to
address the general case of a mix of categorical and metric data.
Introduction
Recent times have seen a revolution in computing technologies. Third-party computing services,
such as the Cloud, have been creating new paradigms for both, data storage and computation. Such
technologies require one to repeatedly revisit the basic question of How do we protect the data
from a privacy perspective?. Although this problem originated in database theory and computer
science applications, it has recently expanded to domains such as systems and control. In control
systems, there is always an information flow between sensors/actuators and controllers/plants and
even between controllers in the case of distributed or cloud-based control. In critical cyber-physical
systems such as power or transportation networks, where information about users and utility companies are exchanged or information about multiple users is fused, privacy has become a critical
element to be considered [1]. Clearly, as the world is becoming more connected and loops are increasingly being closed over millions of sensors [2], privacy is becoming a top priority in the control
community.
In this paper, we consider static data collected within a database that, in the context of cyberphysical systems, could represent log/monitoring data that needs to be analyzed offline to determine
overall system performance, enterprise level fault detection and propagation, forensic analysis, etc.
Although several approaches have been proposed to address different aspects of the privatization
of data within a database, this paper provides a novel framework and perspective to address the
most classical version of the problem using concepts and tools from algebraic topology.
1.1
Literature
Among several methods that have been developed for database privacy, a classic and popular
approach is k-anonymity, which is a mechanism for protecting privacy of individuals represented
as entries in a database [3]. The idea consists of removing all attributes from a database that
can be used as unique identifiers but retain a set of attributes, called quasi-identifiers, for which
identification may be possible but these are necessary for analysis. To such quasi-identifiers ZIP
code, date of birth, GPS location, energy usage, data/time, etc. one applies a transformation
that generalizes their value so that data records become indistinguishable. More precisely, for a
given value of k, the original database is modified so that at least kindividuals in the database
have identical quasi-identifiers. This is achieved by generalizing numeric or text attributes: for
example, the ZIP code can be generalized so that a certain number of the least significant digits are
suppressed as 46532 465, age could be generalized to intervals, 35 [30, 40], and the gender
could be generalized as {M, F } P erson.
The problem of computing an optimal k-anonymous version of a database has been shown to
be NP-hard [4]. However, efficient algorithms such as Incognito [5] and its variants or greedy
clustering-based algorithms [6] have been proposed to achieve k-anonymity. A multi-dimensional
extension of the greedy approaches has been addressed in [7], which results into a representation of
the database that is reminiscent of classic grid-based paintings by Mondrian. In multi-dimensional
settings, a data aggregation scheme based on Hilbert curves has been proposed in [8].
Algebraic topology is a branch of mathematics that leverages tools and concepts from abstract
algebra to study topological spaces. For example, a simple model for sensor network is a set of
points in a (multidimensional) space in which two points (or sensors) are neighbours if they are
within a specified distance of each other. Then, concepts from algebraic topology, such as homology,
have been used to detect holes in sensor networks [9, 10]. Distributed algorithms to localize holes
in sensor networks using related concepts have been addressed in [11] and in [12]. Recently such
methods have been also used for filtering and position estimation in [13, 14].
One concept of privacy which has been applied to several control problems is that of differential
privacy [15]. Informally, this concept means that for a given database, if any single individual is
removed from the database, then no output of a computation run on the database would become
significantly more or less likely. This concept has been applied to achieve differential privacy of
Kalman filtering and estimation problems [16], to ensure a level of truthfulness in electric vehicle
charging applications [17], to achieve average consensus in a private manner [18], to name a few.
While the concept of differential privacy is very general and is applicable to dynamic databases, the
resulting mechanism relies very strongly on the type of function/query that needs to be computed
on the database. In contrast, k-anonymity is a static concept but is independent of any computation
to be carried out on the database and therefore suitable in the context of offline analysis. That said,
there are situations when k-anonymity is not sufficient and individuals can be re-identified despite
anonymization. Approaches to deal with such cases have been considered such as `-diversity [19]
and t-closeness [20]. Even though such advanced concepts have been proposed and privacy metrics
continue to be an active research area, k-anonymity is still widely used.
Extensions of the proposed approach to other metrics will be a subject of future study.
1.2
Contributions
This paper introduces a novel perspective to data privacy based on algebraic topology. In particular,
we address the case when the data lies in a metric space. By defining two datapoints that lie within
a specified radius as neighbours, we show that the representation falls within the natural setting
of a Cech
(or in general, a Nerve) complex [21]. By increasing the radius (generalization), we
1.3
This paper is organized as follows. The problem formulation is presented in Section 2. Background
results and concepts from algebraic topology are reviewed in Section 3. The proposed approach
is presented in Section 4 for numerical data along with some simulation results. Extension of the
method described in Section 4 to mixed categorical and numerical data is discussed in Section 5.
Problem formulation
Let us consider a database table T (A1 , A2 , . . . , Am ) consisting of N rows, where each Ai D are
various attributes that in general can take the form of numeric and/or categorical values, i.e., the
domain D can either be a set of discrete or continuous values. Without loss of generality, we can
identify with QT = {A1 , A2 , . . . , Ad } a set of d attributes that we define as quasi-identifiers, namely
attributes that can be joined with external information/databases so that private information can
be obtained. Typical examples of private data that could be obtained are names of individuals,
salaries, etc.
Another database table T(A1 , A2 , . . . , Am ) consisting of N rows is said to be a generalization
trivial method would completely destroy the information content in the original database. The
problem is how to minimize such as over-generalization of the quasi-identifiers. This notion will be
made precise in Section 4.
In this section, we provide a summary of some useful concepts from algebraic topology that will be
used in our approach. Interested readers may refer to [21] and [24] for additional details on these
concepts.
3.1
Simplicial Homology
3.2
Persistent Homology
Let us consider a sequence of complexes C with = {1 , 2 , . . . , M }, more specifically the sequence
of Cech
complexes {C i }N
i=1 , for increasing i R0 , i j for i 6= j. There are clearly inclusion
maps between such complexes
C 1 , C 2 , , C M 1 , C M .
Rather than studying the homology of each complex for each value of the parameter , one can then
study the homology of the inclusions : H (C i ) H (C j ) for i < j. Such maps are important as
they capture topological features that persist over the parameter space.
The dimension of the homology groups as a function of the single parameter can be plotted in
a diagram, called the barcodes diagram, see [24]. We show a simple example in Figure 1 where is
4
In this section, we describe the algebraic topological approach toward achieving k-anonymity. In
this section, we will restrict our attention to the case of the attributes in the quasi-identifier set QT
being all numeric/continuous variables, such as Age, Salaries, Taxes paid in a year, etc. In this
case, we can clearly represent the set QT as a |QT |-dimensional vector. We will assume that all the
vectors are elements of a real vector space.
Figure 2 shows an example of a table T with two identifiers (Name and Last Name), three quasiidentifiers QT = {Age, Total Scholarship, Money Borrowed} and the sensitive column Current
Salary. As we show at the bottom of Figure 2, we can map the quasi-identifiers to a threedimensional vector space (three dimensional as |QT | = 3), where each entry in the table corresponds
to a point in R3 .
In the following, we will often refer to entries of the table T as points due to the previously
described representation. Also, note that without any loss of generality, we can consider the data
Figure 2: Example of a table where the first two columns are identifiers, the last column is the
sensitive data and the middle three are the quasi-identifiers. Each entry in the table can be mapped,
through the quasi-identifiers, to a point in the three-dimensional vector space, R3 .
to take values within the hypercube M = [0, 1]|QT | , since one can always normalize the data
accordingly.
Figure 3: In (a), we show a anonymity 4-simplex representing the fact that there exists an such
that the 4 points can be anonymized. In (b), we show an example where there is no anonymity
4-simplex (indeed we have only two 2-simplices) as the selected is not large enough, or equivalently
the generalization is high enough, to ensure 4-anonymity.
Lemma 4.1 Given a set of points p = {pi }N
1 corresponding to N rows of the table T , given a global
generalization and the corresponding anonymity complex C(p), we have that the points {pi }N
1 have
the k-anonymity property if and only if
[
C(p) =
S`i
i
where S`i is the i-th anonymity `i -simplex with `i k for i N0 . We say in this case that the
anonymity complex achieves k-anonymity.
Proof: The points {pi }N
1 have the k-anonymity property if and only if we can sub-divide the set of
points into subsets such that
i1
i2
N
{pi }N
1 = {pi }1 {pi }i1 +1 {pi }iN 1 +1 ,
| {z } | {z }
| {z }
`1
`2
`N
and to each subset `i (p), we can associate an anonymity `i -simplex S`i with `i k for any i (for
given fixed ) and S`i S`j = for i 6= j.
Given the previous set relations, we have that complex C associated with the set of points {pi }
is then given by the union of the S`i .
This result then establishes a natural connection between the properties of the anonymity
complex, in terms of some of its subcomplexes, and the k-anonymity property.
We further explore how topological properties of the anonymity complex are related to the kanonymity property and how we can leverage that to find an optimal generalization. Let us first
establish some topological properties of C and then define explicitly what we mean with optimal
generalization.
Proposition 4.1 An anonymity complex C, for a given , has the k-anonymity property if and
only if its homology groups Hn (C) are trivial for any n > 0, and every connected components is an
`-simplex with ` k. Furthermore, when this is the case the number of equivalence classes generated
by using the generalization is given by the dimension of the zero-th homology, dim H0 (C).
Proof: (If) From Lemma 4.1, we know that we can decompose C into
L a finite number of disjoint
anonymity k-simplices. It is known [21] that in this case Hn (C) = i Hn (S`i ), namely the n-th
homology of C is given by the direct sum of the n-th homology of the anonymity k-simplices. As
the k-simplices are simply connected spaces and contractible, they have trivial high order (n > 0)
homology groups and H0 (S`i ) Z for every i.
(Only if) Let us assume that C has Hn (C) = {0} for n > 0, and H0 (C) is non-trivial, and
in particular let us assume that H0 (C) Rs . This means that the complex C has s connected
components. From the hypothesis that the connected components are `-simplices with ` k,
we know that each component is a anonymity simplex. Thus the anonymity complex C has the
k-anonymity property.
We are now able to connect topological properties of the anonymity complex with the kanonymity property. Of course, as it can be seen from Proposition 4.1, the result still depends
on , namely the generalization.
When anonymizing a dataset, one is typically interested in corrupting the data by the least
amount. Indeed, if one carefully thinks about the k-anonymity problem, it is always possible to find
a large enough k that makes the data anonymous, i.e., if one makes the extreme choice of k = n,
then the entire data will be in one equivalence class and thus, k-anonymity will be achieved. The
issue with this is that the information contained in the data will be completely lost.
In the context of this paper, as the generalization is parametrized by , we are interested for
find the smallest value of that gives k-anonymity. We then have the following definition.
Definition 4.3 (Minimal Anonymity Complex) Given an anonymity complex C(p) associated to a set of points, lets us denote with C the anonymity complex for a given generalization .
We define as minimal anonymity complex the following object
C = min C ,
Age
25
22
24
43
52
38
47
36
32
ZIP Code
47677
47602
47678
47905
47909
47906
47605
47673
47607
Salary
$47,000
$32,000
$52,000
$151,000
$145,000
$98,000
$110,000
$92,000
$115,000
Age
[22-25]
[22-25]
[22-25]
[38-52]
[38-52]
[38-52]
[32-47]
[32-47]
[32-47]
ZIP Code
[47602-47678]
[47602-47678]
[47602-47678]
[47905-47909]
[47905-47909]
[47905-47909]
[47605-47603]
[47605-47603]
[47605-47603]
Figure 4: Weighted barcode showing the full spectrum of anonymity regimes. Although the barcode is
only a single diagram, we split it here into three diagrams where we only show the simplices that meet the
requirements of the k-anonymity indicated underneath each figure. With red bars we show regimes that
cannot achieve 2,3,4-anonymity. These correspond to either situations where not all the k-simplices admit
trivial higher order homology groups, as is the case for 3-anonymity between [0.11, 0.13] and [0.17, 0.19], not
all the simplices are `-simplices with ` k. This happens for example in 2-anonymity in the interval [0, 0.08]
and between [0.19, 0.4] for 4-anonymity. We only show here dim H0 and dim H1 as higher order homology
groups are trivial for this simple example.
Remark 4.1 (Advantages of proposed approach) The main advantage of the method proposed is that it enables us not only to find a k-anonymization for a fixed k (if it exists), but also
to provide alternative regimes that can be especially useful when k-anonymity cannot be achieved
for the given k. This generally is not something that other algorithms, such as [5, 7, 8] directly
provide. Indeed, one would need to run the same algorithms for various values of k to obtain the
same tradeoff picture as we obtain. The persistent diagram we consider in this paper is instead
computed in one shot from the filtration {C }. Furthermore, we leverage very scalable algorithms
for such computation based on discrete Morse theory, see [25, 22]. Also, note that not only the
persistent diagram allows us to determine the right regime that gives the desired k-anonymity, but
we can also look at the other important tradeoff parameter such as the number of classes. For a
given k, it is indeed possible to find various generalizations that meet the k-anonymity requirement, however some might lead to equivalence classes with many more than k elements that is in
general not desirable.
The methodology we have described in the previous section has clearly nice properties but has
some limitations. First, it is restricted to numerical attributes where the notion of a radius is well
10
(a) Example of generalization for two set of labels. On the top for countries
and on the bottom for gender.
Figure 5: In (a) we show two generalization trees and in (b) we show the corresponding generalization lattice.
defined and thus it would not work in the case where attributes are categorical in nature, such
as, for example, strings, labels, social security numbers, etc. Second, growing balls (or polytopic
approximations) in high dimensional spaces and determining intersections can be computationally
challenging. In this section, we discuss an extension that leads to weighted persistent diagrams as
the one described in the previous section and that can be used to explore anonymity tradeoffs.
For anonymization of categorical data, one needs to specify generalization trees that enforce
a certain partial order between the various generalizations. Let us consider a simple example
where there are two attributes: Countries and Gender. Examples of generalization tree are shown
in Figure 5a where, as one moves from leaves to the root, the generalization becomes increasingly coarser, see for example [5]. We thus have that {P ortugal, Spain} {W est Europe} and
{W est Europe, East Europe} {Europe}, etc. or where {M ale, F emale} {P erson}. We may
capture the generalization of numerical values within trees as well.
Formally, we have that a generalization tree is a map T : N [0, r] A where N is the
set all nodes in the trees (including root and leaves) and [0, r] is the level of the generalization.
For example, T ({U SA, Canada}, 2) = America. For this setting, we can extend the notion of
anonymity complex as follows.
Definition 5.1 (Generalized Anonymity Complex) Given a table T with N rows and a set
of quasi-identifiers QT as well as a set of generalization trees Tk for k = 1, . . . , |QT |, we define the
generalization anonymity complex as the simplicial complex whose k-simplices are determined by
(k + 1) |QT |-dimensional tuples {Ai }ki=1 such that there exists a r such that Ti (Ai , r) = Tj (Aj , r).
The definition of a generalized anonymity k-simplex can be obtained as well. For example,
{P ortugal, Spain, Hungary} forms a generalized anonymity 3-simplex for the generalization level
2, considering the tree in Figure 5a.
Definition 5.2 (Generalization lattice) Given a |QT |-dimensional tuple representing attributes
and the associated trees Tk for each attribute, we can construct a directed graph (lattice) where the
11
C 10
C 11
C 00
C 01
C 12
C 13
(1)
C 02
C 03
13
12
11
10
00
01
02
03
C
| , C ,
{z C , C } C
{z C , C } .
| , C ,
C 003
C 103
Note that the middle map is not an inclusion. It is possible to still compute persistent homology
of the full chain by joining the two subsequences of spaces C 003 and C 103 through their
union as follows: In this case, by using Mayer-Vietoris pyramid principle [27], we can compute the
003
C 03 C 10
C 103
persistent homology of the chain and by passing through the union we obtain the barcodes for the
initial chain [27]. Given the barcodes we are now in the same situation as in the previous section
where we can search for different k-anonymity regimes.
Note however that the previous barcode would be different than the one built for the following
sequence, say C 00 , C 10 , C 01 , C 11 , C 02 , C 12 , C 03 , C 13 . In this case we would need to apply MayerVeitoris pyramid principle multiple times:
1
We denote with C 00 the generalized anonymity complex corresponding to the generalization < G0 , C0 >, as in
Figure 5b
12
C 10 C 01
00
10
C 01
C 11
C 13
C 03
C 12
C 11 C 02
C 02
C 03 C 12
An alternative approach to compute the barcodes for (1), is via an approximation. Specifically,
one can consider the persistence equivalence theorem [28, Section 2.4.2] where, given the commutative diagram in (1), one can compute the persistent homology for the lower and upper chains.
Because of the inclusion maps we have that the persistent homology modules of the two chains
respect the inclusions. We can then proceed as follows: we first compute the persistent homology
of the lower chain in the commutative diagram (1) and if it achieves the desired k-anonymity, then,
because of the inclusion maps, such k-anonymity can be achieved also by the upper chain in (1),
thus we do not need to compute the barcodes for the upper chain and stop after obtaining the
barcodes for the lower one.
However, if k-anonymity is not achieved considering the lower chain, then persistent homology and corresponding barcodes need to be computed for the upper chain. Note that as we are
approximating a multi-dimensional persistent homology by a sequence of (independent) persistent homology computation, if k-anonymity cannot be achieved following such process we cannot
conclude that there is not a way to k-anonymize the data. This is a consequence of the natural
complexity of the anonymity problem.
Conclusion
This paper introduced a new perspective to k-anonymity in data privacy based on algebraic topology. In particular, we addressed the case when the data lies in a metric space. We demonstrated
how tools such as persistence homology can be applied to efficiently obtain the entire spectrum of
k-anonymity of the database for various values of the radius of proximity. For this representation,
we provided an analytic characterization of conditions under which a given representation of the
dataset is k-anonymous. Finally, we discussed how this method can be extended to address the
general case of a mix of categorical and metric data.
In future, it would be interesting to investigate other notions of privacy using these tools. In
particular, applicability of such techniques to dynamic databases would be an interesting case which
naturally arise in control applications.
Acknowledgments
The authors would like to thank Vidit Nanda for discussions on zig-zag persistent homology and
the Perseus software used in this paper to compute persistent homology.
13
References
[1] Terraswarm Theme 3: Services, applications and cloud interactions, http://www.
terraswarm.org/services/, accessed 2015-09-27.
[2] Design technology for the trillion-device future, http://youtu.be/ViJ3SH5t4Ys, presented
at the DARPA Wait, What? conference, St. Louis, MO, 2015.
[3] L. Sweeney, k-anonymity: A model for protecting privacy, Int. J. of Uncertainty, Fuzziness
and Knowledge-Based Systems, vol. 10, no. 05, pp. 557570, 2002.
[4] A. Meyerson and R. Williams, On the complexity of optimal k-anonymity, in ACM
SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2004, pp. 223228.
[5] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, Incognito: Efficient full-domain kanonymity, in ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 4960.
[6] J. Byun, A. Kamra, E. Bertino, and N. Li, Efficient k-anonymization using clustering techniques, in Advances in Databases: Concepts, Systems and Applications, e. a. Kotagiri, R.,
Ed. Springer Berlin Heidelberg, 2007.
[7] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, Mondrian multidimensional k-anonymity,
in Int. Conf. on Data Engineering, 2006.
[8] Y.-K. Kim, H. Lee, M. Yoon, and J.-W. Chang, Hilbert-curve based data aggregation scheme
to enforce data privacy and data integrity for wireless sensor networks, Int. J. of Distr. Sensor
Networks, 2013.
[9] R. Ghrist and A. Muhammad, Coverage and hole-detection in sensor networks via homology,
in Int. Symp. on Information Processing in Sensor Networks, 2005.
[10] V. De Silva and R. Ghrist, Homological sensor networks, Notices of the American Mathematical Society, vol. 54, no. 01, pp. 1017, 2007.
[11] A. Muhammad and M. Egerstedt, Control using higher order Laplacians in network topologies, in Int. Symp. of Mathematical Theory of Networks and Systems, 2006, pp. 10241038.
[12] A. Tahbaz-Salehi and A. Jadbabaie, Distributed coverage verification in sensor networks
without location information, IEEE Trans. on Aut. Control, vol. 55, no. 8, pp. 18371849,
2010.
[13] J. Derenick, A. Speranzon, and R. Ghrist, Homological sensing for mobile robot localization,
in IEEE Int. Conf. on Robotics and Automation, 2013, pp. 572579.
[14] R. Ghrist, D. Lipsky, J. Derenick, and A. Speranzon, Topological landmark-based navigation
and mapping, 2012. [Online]. Available: http://www.math.upenn.edu/ghrist/preprints/
landmarkvisibility.pdf
[15] C. Dwork and A. Roth, The algorithmic foundations of differential privacy, Found. and
Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211407, 2013.
[16] J. Le Ny and G. J. Pappas, Differentially private filtering, IEEE Trans. on Aut. Control,
vol. 59, no. 2, pp. 341354, 2014.
14
[17] S. Han, U. Topcu, and G. J. Pappas, Approximately truthful mechanism for electric vehicle
charging via joint differential privacy, in American Control Conference, 2015.
[18] Y. Mo and R. Murray, Privacy preserving average consensus, in IEEE International Conference on Decision and Control, 2014.
[19] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, `-diversity: Privacy
beyond k-anonymity, ACM Trans. Knowl. Discov. Data, vol. 1, no. 1, 2007.
[20] N. Li, L. T., and S. Venkatasubramanian, t-closeness: Privacy beyond k-anonymity and
`-diversity, in IEEE Int. Conf. on Data Engineering, 2007, pp. 106115.
[21] R. Ghrist, Elementary Algebraic Topology, 1st ed.
Createspace.
15