Nothing Special   »   [go: up one dir, main page]

US20070239982A1 - Method and apparatus for variable privacy preservation in data mining - Google Patents

Method and apparatus for variable privacy preservation in data mining Download PDF

Info

Publication number
US20070239982A1
US20070239982A1 US11/249,647 US24964705A US2007239982A1 US 20070239982 A1 US20070239982 A1 US 20070239982A1 US 24964705 A US24964705 A US 24964705A US 2007239982 A1 US2007239982 A1 US 2007239982A1
Authority
US
United States
Prior art keywords
data
privacy
group
data records
records
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/249,647
Inventor
Charu Aggarwal
Philip Yu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US11/249,647 priority Critical patent/US20070239982A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: AGGARWAL, CHARU C., YU, PHILIP SHI-LUNG
Priority to PCT/EP2006/066858 priority patent/WO2007042403A1/en
Publication of US20070239982A1 publication Critical patent/US20070239982A1/en
Priority to US12/119,766 priority patent/US8627070B2/en
Priority to US14/051,530 priority patent/US8966648B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/604Tools and structures for managing or administering access control systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification

Definitions

  • the present invention relates to data mining techniques and, more particularly, to variable privacy preserving, data mining techniques.
  • a simple transaction such as using a credit card results in automated storage of information about user buying behavior.
  • users are not willing to supply such personal data unless its privacy is guaranteed. Therefore, in order to ensure effective data collection, it is important to design methods which can mine the data with a guarantee of privacy.
  • Principles of the invention provide improved privacy preservation techniques for use in accordance with data mining.
  • one aspect of the invention comprises a technique for preserving privacy of data records for use in a data mining application comprising the following steps/operations.
  • Different privacy levels are assigned to the data records.
  • Condensed groups are constructed from the data records based on the privacy levels, wherein summary statistics are maintained for each condensed group.
  • Pseudo-data is generated from the summary statistics, wherein the pseudo-data is available for use in the data mining application.
  • principles of the invention provide a new framework for privacy preserving data mining, in which the privacy of all records is not the same, but can vary considerably. This is often the case in many real applications, in which different groups of individuals may have different privacy requirements. Further, principles of the invention are capable of handling both static and dynamic data sets.
  • FIG. 1 illustrates a server architecture and network environment in accordance with which variable privacy preserving, data mining techniques may be employed, according to an embodiment of the present invention
  • FIG. 2 illustrates a process for performing variable privacy preservation, according to an embodiment of the present invention
  • FIG. 3 illustrates a process for creating condensed groups for privacy preservation, according to an embodiment of the invention
  • FIG. 4 illustrates a process for performing cannibalization for condensation, according to an embodiment of the invention
  • FIG. 5 illustrates a process for performing attrition for condensation, according to an embodiment of the invention.
  • FIG. 6 illustrates a process for creating pseudo-data from condensed groups, according to an embodiment of the invention.
  • data stream may generally refer to a continuous sequence of data over a given time period.
  • a sequence of data may be generated by a real-time process which uses continuous data storage.
  • principles of the invention are not limited to any particular type of data set or type of data stream.
  • data point (or point) is used herein interchangeably with the phrase “data record” (or record).
  • a data point or record could refer to one or more attributes of an individual. For example, it could refer to a record containing age, sex, and/or salary, etc.
  • group refers to a set of records which are similar. The similarity may be defined by a distance function. Thus, a group could be a set of individuals with similar demographic characteristics.
  • the invention is not limited to these particular types of data points, data records, or groups.
  • the condensation approach of is also referred to as the k-indistinguishability model.
  • a record is said to be k-indistinguishable, when there are at least k other records in the data (including itself) from which it cannot be distinguished.
  • a record is 1-indistinguishable, it has no privacy.
  • the k-indistinguishability of a record is achieved by placing it in a group with at least (k-1) other records.
  • the condensation-based approach does not rely on domain specific hierarchies, and the k-indistinguishability model can also work effectively in a dynamic environment such as that created by data streams.
  • an illustrative embodiment of the invention provides for variable privacy levels in a condensation-based privacy preserving, data mining methodology.
  • the data may be available at one time or it may be available in a more dynamic and incremental fashion.
  • principles of the invention provide a methodology for performing the condensation when the entire data is available at one time, and a methodology for the case when the data is available incrementally. The latter is a more difficult case because it is often not possible to design the most effective condensation at the moment the data becomes available. It will be evident that, in most cases, the algorithm for performing the dynamic group construction is able to achieve results which are comparable to the algorithm for static group construction.
  • the privacy level for a given record is defined as the minimum number of other records in the data from which it can not be distinguished.
  • the data is partitioned into groups of records. Records within a given group can not be distinguished from one another. For each group, certain summary statistics about the records are maintained. These summary statistics provide the ability to apply data mining algorithms directly to the condensed groups of records. This information also suffices to preserve information about the mean and correlations across the different dimensions of the data.
  • the size of the groups may vary, but its size is at least equal to the desired privacy level of each record in that group. Thus, a record with privacy level equal to p(i) may be condensed with records of privacy levels different from p(i). However, the size of that group is at least equal to the maximum privacy level of any record in that group.
  • Each group of records is referred to as a condensed unit.
  • G be a condensed group containing the records X 1 . . . X k .
  • each record X i contains the d dimensions which are denoted by x i 1 . . . x i d .
  • the following information is maintained about each group of records G:
  • the covariance matrix is simply a d*d matrix where the ijth entry refers to covariance between dimensions i and j.
  • the covariance matrix is used in turn to create the pseudo-records for the group.
  • the pseudo-records are generated independently along each eigenvector. That is, records are generated with variance proportional to the corresponding eigenvalue along each eigenvector.
  • the method of group construction is different depending upon whether an entire database of records is available or whether the data records arrive in an incremental fashion.
  • the essence of the static approach is to construct the groups using an iterative method in which the groups are processed with increasing privacy level.
  • D p segment of the database with a privacy level requirement of p
  • H p set of groups with a privacy level of p
  • the database D 1 consists of the set of points which have no privacy constraint at all. Therefore, the group H 1 is comprised of the singleton items from the database D 1 .
  • the statistics of the groups in H p are constructed using an iterative algorithm.
  • the privacy level p is increased by one, and the condensed groups H p which have privacy level p are constructed.
  • the first step is to construct the group H p by using a purely segmentation based process. This segmentation process is a straightforward iterative approach.
  • a record X is sampled from the database H p .
  • the closest (p-1) records to this individual record X are added to this group. Let us denote this group by G.
  • the statistics of the p records in G are computed.
  • the p records in G are removed from D p . The process is repeated iteratively, until the database D p is empty.
  • This procedure can also be extended to the dynamic case.
  • the process of dynamic maintenance of groups is useful in a variety of settings such as that of data streams.
  • the points in the data stream are processed incrementally.
  • the incremental algorithm works by using a nearest neighbor approach.
  • the closest cluster to the data point is found using the distance of the data point X i to the different centroids. While it is desirable to add X i to its closest centroid, X i can not be added to a given cluster which has fewer than p(i)-1 data points in it. Therefore, the data point X i is added to the closest cluster which also happens to have at least p(i)-1 data points inside it.
  • the average privacy level of the group G can be computed from the condensed statistics. This number is equal to Ps(G)/n(G). This is because Ps(G) is equal to the sum of the privacy levels of the data points in the group.
  • the split criterion used by an illustrative algorithm of the invention is that a group is divided when the number of items in the group is more than twice the average privacy level of the items in the group. Therefore, the group is split when the following holds true: n ( G ) ⁇ 2 Ps ( G )/ n ( G )
  • the group is split along the eigenvector with the largest eigenvalue. This also corresponds to the direction with the greatest level of variance. This is done in order to reduce the overall variance of the resulting clusters and ensure the greatest compactness of representation.
  • the eigenvector e 1 with the lowest index is the chosen direction the split.
  • the pseudo-data from the condensed groups are generated using a data generation approach described herein below.
  • FIG. 1 is a block diagram illustrating a server architecture and network environment in accordance with which variable privacy preserving, data mining techniques may be employed, according to an embodiment of the present invention.
  • an exemplary network environment 100 comprises a trusted server 102 - 1 and a non-trusted server 102 - 2 .
  • Each server ( 102 - 1 , 102 - 2 ) may comprise a central processing unit or CPU ( 104 - 1 , 104 - 2 ) coupled to a volatile main memory ( 106 - 1 , 106 - 2 ) and a non-volatile disk ( 108 - 1 , 108 - 2 ).
  • the servers are connected over a communication network 110 .
  • the network may be a public information network such as, for example, the Internet or World Wide Web, however, the servers may alternatively be connected via a private network, a local area network, or some other suitable network.
  • a server may receive data to be processed from any source or sources.
  • one or more client devices may supply data to be processed to a server.
  • all or portions of the data to be processed may already be available at the server (e.g., on disk), or may be accessible by the server.
  • the main memory may be used in order to store some or all of the intermediate results performed during the operations/computations.
  • software components including instructions or code for performing the methodologies of the invention, as described herein, may be stored in one or more memory devices described above with respect to the server and, when ready to be utilized, loaded in part or in whole and executed by the CPU.
  • variable privacy preservation operations of the invention are performed at trusted server 102 - 1 . That is, CPU 104 - 1 of the trusted server is used in order to perform the privacy preservation operations on the original data.
  • the original data may be received from any source or sources (e.g., one or more client devices connected to the server over the network) and stored in disk 108 - 1 .
  • the data (which is now considered “trusted data” after being processed) may be sent to non-trusted server 102 - 2 where it is stored on disk 108 - 2 .
  • data mining may be performed on the trusted data at the non-trusted server.
  • FIG. 2 a flow diagram illustrates a process 200 for performing variable privacy preservation, according to an embodiment of the present invention. That is, FIG. 2 illustrates an overall approach for performing condensation-based privacy preservation.
  • the process starts at block 202 .
  • the condensation based approach is a two step process.
  • the process generates the condensed groups from the data (step 204 ).
  • the summary statistics of these condensed groups are stored. These summary statistics may include the covariance matrix, as well as the sum of the attributes, and the number of records. Such statistics are explained in detail above. We note that this information is sufficient to determine the characteristics which are useful for privacy preservation. This step is further explained below in the context of FIG. 3 .
  • the statistics of the condensed groups are used in the second step of the process to generate the pseudo-data for mining purposes (step 206 ).
  • the pseudo-data are often available in the form of multi-dimensional records which are similar to the original data format.
  • Such pseudo-data is the so-called “trusted data” that is sent to the non-trusted server ( 102 - 2 of FIG. 1 ) for use in data mining operations.
  • the data is considered “trusted” since it obtains a degree of indistinguishability, thus preserving its privacy.
  • the pseudo-data generation step is further explained below in the context of FIG. 6 .
  • the process ends at block 208 .
  • FIG. 3 a flow diagram illustrates a process 300 for creating condensed groups for privacy preservation, according to an embodiment of the invention.
  • FIG. 3 illustrates details of step 204 of FIG. 2 . That is, FIG. 3 illustrates an overall process of performing condensation for the privacy preservation process.
  • the process starts at block 302 .
  • the condensation of the groups works with an iterative approach in which groups with successively higher privacy levels are generated.
  • this privacy level is denoted by p.
  • step 306 the process determines groups of privacy level p. This can be done by using any conventional clustering algorithm, see, e.g., Jain and Dubes, “Algorithms for Clustering Data,” Prentice Hall. The determination of such groups can be very useful for the privacy preservation process.
  • step 308 the process examines all groups with privacy level (p-1) and redistributes the points (records) to groups with higher privacy levels, if such redistribution reduces the mean square errors of the data points. This step is further explained below in the context of FIG. 4 .
  • step 310 the process performs the attrition which reassigns the points from groups with larger than p points to other groups. This reassignment is performed if such reassignment improves the errors of the corresponding data points. This step is further explained below in the context of FIG. 5 .
  • step 311 the privacy level p is incremented by one.
  • Step 312 then checks whether p is equal to p max . It is to be appreciated that p max is the maximum privacy requirement of any record in the data set. If p does not yet equal p max , then the process returns to step 306 and continues. Once p max is reached, the process ends at block 314 .
  • FIG. 4 a flow diagram illustrates a process 400 for performing cannibalization for condensation, according to an embodiment of the invention.
  • cannibalization e.g., step 308 of FIG. 3
  • the process assigns the data points of a given group to those of a higher level group. This is done in order to improve the errors of the group formation process.
  • the process starts at block 402 .
  • the cannibalization process is performed as follows. For each group in which the privacy level is lower than the current value of p, the process determines if reassignment of all points in the group to their corresponding closest centroid improves the error values. This step is performed in step 404 . If such a reassignment does indeed improve the group radius, then the reassignment is executed in step 406 . Otherwise, that group is kept intact. The process ends at block 408 .
  • a flow diagram illustrates a process 500 for performing attrition for condensation, according to an embodiment of the invention.
  • the process starts at block 502 .
  • the process determines if the moving of an excess point from a given group to its next closest centroid reduces the average error of the condensation. If this is the case, then the process performs the move from one centroid to the other (step 506 ). This process maintains privacy while increasing the compactness of the groups.
  • the process ends at block 508 .
  • FIG. 6 a flow diagram illustrates a process 600 for creating pseudo-data from condensed groups, according to an embodiment of the invention.
  • FIG. 6 illustrates details of step 206 of FIG. 2 .
  • the process starts at block 602 .
  • the pseudo-data are generated by calculating the condensed statistics and generating the eigenvectors from each set of condensed statistics (step 604 ).
  • the eigenvalues along these eigenvectors represent the corresponding variances.
  • the process generates the data independently along each eigenvector (step 606 ). More particularly, along each eigenvector, the process uses a uniform distribution with variance equal to the corresponding eigenvalue.
  • the process ends at block 608 .
  • the above-described method of privacy preservation can also be extended to data streams.
  • the condensed statistics are updated incrementally as the data points are received.
  • the incremental update of the condensed statistics is used in conjunction with a splitting step which is used when the group size exceeds twice the average privacy level.
  • the process for splitting may include splitting the group along the longest eigenvalue.
  • the process reconstructs aggregate statistics assuming that the distribution along each eigenvalue is uniform.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Improved privacy preservation techniques are disclosed for use in accordance with data mining. By way of example, a technique for preserving privacy of data records for use in a data mining application comprises the following steps/operations. Different privacy levels are assigned to the data records. Condensed groups are constructed from the data records based on the privacy levels, wherein summary statistics are maintained for each condensed group. Pseudo-data is generated from the summary statistics, wherein the pseudo-data is available for use in the data mining application. Principles of the invention are capable of handling both static and dynamic data sets

Description

  • This invention was made with Government support under Contract No.: H98230-04-3-0001 awarded by the U.S. Department of Defense. The Government has certain rights in this invention.
  • FIELD OF THE INVENTION
  • The present invention relates to data mining techniques and, more particularly, to variable privacy preserving, data mining techniques.
  • BACKGROUND OF THE INVENTION
  • Privacy preserving data mining has become an important issue in recent years due to the large amount of consumer data tracked by automated systems on the Internet. The proliferation of electronic commerce on the World Wide Web has resulted in the storage of large amounts of transactional and personal information about users. In addition, advances in hardware technology have also made it more feasible to track information about individuals from transactions in everyday life.
  • For example, a simple transaction such as using a credit card results in automated storage of information about user buying behavior. In many cases, users are not willing to supply such personal data unless its privacy is guaranteed. Therefore, in order to ensure effective data collection, it is important to design methods which can mine the data with a guarantee of privacy.
  • However, while there has been a considerable amount of focus on privacy preserving data collection and mining methods in recent years, such methods assume homogeneity in the privacy level of different entities.
  • Accordingly, it would be highly desirable to provide improved techniques for use in accordance with a privacy preserving data mining.
  • SUMMARY OF THE INVENTION
  • Principles of the invention provide improved privacy preservation techniques for use in accordance with data mining.
  • By way of example, one aspect of the invention comprises a technique for preserving privacy of data records for use in a data mining application comprising the following steps/operations. Different privacy levels are assigned to the data records. Condensed groups are constructed from the data records based on the privacy levels, wherein summary statistics are maintained for each condensed group. Pseudo-data is generated from the summary statistics, wherein the pseudo-data is available for use in the data mining application.
  • Advantageously, principles of the invention provide a new framework for privacy preserving data mining, in which the privacy of all records is not the same, but can vary considerably. This is often the case in many real applications, in which different groups of individuals may have different privacy requirements. Further, principles of the invention are capable of handling both static and dynamic data sets.
  • These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 illustrates a server architecture and network environment in accordance with which variable privacy preserving, data mining techniques may be employed, according to an embodiment of the present invention;
  • FIG. 2 illustrates a process for performing variable privacy preservation, according to an embodiment of the present invention;
  • FIG. 3 illustrates a process for creating condensed groups for privacy preservation, according to an embodiment of the invention;
  • FIG. 4 illustrates a process for performing cannibalization for condensation, according to an embodiment of the invention;
  • FIG. 5 illustrates a process for performing attrition for condensation, according to an embodiment of the invention; and
  • FIG. 6 illustrates a process for creating pseudo-data from condensed groups, according to an embodiment of the invention.
  • DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
  • The following description will illustrate the invention using an exemplary data processing system architecture. It should be understood, however, that the invention is not limited to use with any particular system architecture. The invention is instead more generally applicable to any data processing system architecture in which it would be desirable to provide variable privacy preservation in accordance with data mining techniques.
  • As used herein, the phrase “data stream” may generally refer to a continuous sequence of data over a given time period. By way of example, such a sequence of data may be generated by a real-time process which uses continuous data storage. However, it is to be understood that principles of the invention are not limited to any particular type of data set or type of data stream.
  • Further, the phrase “data point” (or point) is used herein interchangeably with the phrase “data record” (or record). By way of example only, in a demographic data set, a data point or record could refer to one or more attributes of an individual. For example, it could refer to a record containing age, sex, and/or salary, etc. On the other hand, the term “group” refers to a set of records which are similar. The similarity may be defined by a distance function. Thus, a group could be a set of individuals with similar demographic characteristics. However, the invention is not limited to these particular types of data points, data records, or groups.
  • A recent approach to privacy preserving data mining has been a condensation-based technique, as disclosed in C. C. Aggarwal and P. S. Yu, “A Condensation Based Approach to Privacy Preserving Data Mining,” Proceedings of the EDBT Conference, pp. 183-199, 2004. This technique essentially creates condensed groups of records which are then utilized in one of two ways. First, the statistical information in the pseudo-groups can be utilized to generate a new set of pseudo-data which can be utilized with a data mining algorithms. Second, the condensed pseudo-groups can be utilized directly with minor modifications of existing data mining algorithms.
  • The condensation approach of is also referred to as the k-indistinguishability model. A record is said to be k-indistinguishable, when there are at least k other records in the data (including itself) from which it cannot be distinguished. Clearly, when a record is 1-indistinguishable, it has no privacy. The k-indistinguishability of a record is achieved by placing it in a group with at least (k-1) other records. The condensation-based approach does not rely on domain specific hierarchies, and the k-indistinguishability model can also work effectively in a dynamic environment such as that created by data streams.
  • However, in the k-indistinguishability model approach, it is assumed that all records have the same privacy requirement. In most practical applications, this is not always a reasonable assumption. For example, when a data repository contains records from heterogeneous data sources, it is rarely the case that each repository has the same privacy requirement. Similarly, in an application tracking the data for brokerage customers, the privacy requirements of retail investors are likely to be different from those of institutional investors. Even among a particular class of customers, some customers (such as high net-worth individuals) may desire a higher level of privacy than others.
  • In general, principles of the invention realize that it may be desirable to associate a different privacy level with each record in a data set. Thus, an illustrative embodiment of the invention, to be described herein, provides for variable privacy levels in a condensation-based privacy preserving, data mining methodology.
  • Let us assume that we have a database D containing N records. The records are denoted by X1 . . . XN. We denote this desired privacy level for record Xi by p(i). The process of finding condensed groups with a varying level of point-specific privacy makes the problem significantly more difficult from a practical standpoint. This is because it may not be advisable to pre-segment the data into different privacy levels before performing the condensation separately for each segment. When some of the segments contain very few records, such a condensation may result in an inefficient representation of the data. In some cases, the number of records for a given level of privacy k′ may be lower than k′. Clearly, it is not even possible to create a group containing only records with privacy level k′, since the privacy level of the entire group would then be less than k′. Therefore, it is not possible to create an efficient (and feasible) system of group condensation without mixing records of different privacy levels. This leads to a number of interesting trade-offs between information loss and privacy preservation. Principles of the invention provide algorithms that optimize such trade-offs.
  • In many cases, the data may be available at one time or it may be available in a more dynamic and incremental fashion. Thus, principles of the invention provide a methodology for performing the condensation when the entire data is available at one time, and a methodology for the case when the data is available incrementally. The latter is a more difficult case because it is often not possible to design the most effective condensation at the moment the data becomes available. It will be evident that, in most cases, the algorithm for performing the dynamic group construction is able to achieve results which are comparable to the algorithm for static group construction.
  • Before describing details of a condensation-based data mining algorithm for providing variable privacy preservation, we will discuss some notations and definitions. We assume that we have a set of N records, each of which contains d dimensions. We also assume that associated with each data point i, we have a corresponding privacy level p(i). The overall database is denoted by D whereas the database corresponding to the privacy level p is denoted by Dp.
  • The privacy level for a given record is defined as the minimum number of other records in the data from which it can not be distinguished.
  • In the condensation-based approach, the data is partitioned into groups of records. Records within a given group can not be distinguished from one another. For each group, certain summary statistics about the records are maintained. These summary statistics provide the ability to apply data mining algorithms directly to the condensed groups of records. This information also suffices to preserve information about the mean and correlations across the different dimensions of the data. The size of the groups may vary, but its size is at least equal to the desired privacy level of each record in that group. Thus, a record with privacy level equal to p(i) may be condensed with records of privacy levels different from p(i). However, the size of that group is at least equal to the maximum privacy level of any record in that group.
  • Each group of records is referred to as a condensed unit. Let G be a condensed group containing the records X1 . . . Xk. Let us also assume that each record Xi contains the d dimensions which are denoted by xi 1 . . . xi d. The following information is maintained about each group of records G:
  • (i) For each attribute j, the sum of corresponding values is maintained. The corresponding value is given by Σi=1 kxi j. We denote the corresponding first-order sums by Fsj(G). The vector of first order sums is denoted by Fs(G).
  • (ii) For each pair of attributes i and j, the sum of the product of corresponding attribute values is maintained. The corresponding sum is given by Σi=1 kx1 i·x1 j. We denote the corresponding second order sums by Scij(G). The vector of second order sums is denoted by Sc(G).
  • (iii) The sum of the privacy levels of the records in the group is maintained. This number of denoted by Ps(G).
  • (iv) The total number of records k in that group is maintained. This number is denoted by n(G).
  • We note that these summary statistics can be used to construct a covariance matrix for that group, which is also maintained as part of the summary statistics. The covariance matrix is simply a d*d matrix where the ijth entry refers to covariance between dimensions i and j. The covariance matrix is used in turn to create the pseudo-records for the group. As will be further explained below, in one embodiment, the pseudo-records (pseudo-data) are generated independently along each eigenvector. That is, records are generated with variance proportional to the corresponding eigenvalue along each eigenvector.
  • We note that the algorithm for group construction tries to put each record in a group which is at least equal to the maximum privacy level of any record in the group. A natural solution is to first classify the records based on their privacy levels and then independently create the groups for varying privacy levels. Unfortunately, this does not lead to the most efficient method for packing the sets of records into different groups. This is because the most effective method for constructing the groups may require us to combine records from different privacy levels. For example, a record with a very low privacy requirement may sometimes naturally be combined with a group of high privacy records in its locality. An attempt to construct a separate group of records with a low privacy requirement may lead to an even higher loss of information.
  • First, we need a measure to quantify the effectiveness of a given condensation-based approach. In general, this effectiveness is related to the level of compactness with which we can partition the data into different groups. However, there are several constraints on the cardinality of the data points in each group as well as the identity of the data points which can be added to a group with given cardinality. Thus, for the process of quantification of the condensation quality, in one embodiment, we use the square sum error of the data points in each group. While the privacy level of a group is determined by the number of records in it, the information loss is defined by the average variance of the records about their centroid. We will refer to this quantity as the Sum Squared Error (SSQ).
  • The method of group construction is different depending upon whether an entire database of records is available or whether the data records arrive in an incremental fashion. We will discuss two approaches for construction of class statistics. The first approach is utilized for the case when the entire database of records is available. The second approach is utilized in an incremental scheme in which the data points arrive one at a time. First, we will discuss the static case in which the entire database of records is available.
  • The essence of the static approach is to construct the groups using an iterative method in which the groups are processed with increasing privacy level. We assume that the segment of the database with a privacy level requirement of p is denoted by Dp. We also assume that the set of groups with a privacy level of p is denoted by Hp. We note that the database D1 consists of the set of points which have no privacy constraint at all. Therefore, the group H1 is comprised of the singleton items from the database D1.
  • Next, the statistics of the groups in Hp are constructed using an iterative algorithm. In each iteration, the privacy level p is increased by one, and the condensed groups Hp which have privacy level p are constructed. The first step is to construct the group Hp by using a purely segmentation based process. This segmentation process is a straightforward iterative approach. In each iteration, a record X is sampled from the database Hp. The closest (p-1) records to this individual record X are added to this group. Let us denote this group by G. The statistics of the p records in G are computed. Next, the p records in G are removed from Dp. The process is repeated iteratively, until the database Dp is empty. We note that at the end of the process, it is possible that between 1 and (p-1) records may remain. These records can be added to their nearest sub-group in the data. Thus, a small number of groups in the data may contain larger than p data points. During the iterative process, it is possible that points from a group with lower privacy level may fit better with groups of a higher privacy level. Such groups can be cannibalized to higher level groups. The reverse is true in some cases where some of the points can be fit to lower level groups, when the group has larger than the desired number of points for that particular privacy level.
  • This procedure can also be extended to the dynamic case. The process of dynamic maintenance of groups is useful in a variety of settings such as that of data streams. In the process of dynamic maintenance, the points in the data stream are processed incrementally.
  • The incremental algorithm works by using a nearest neighbor approach. When an incoming data point Xi is received, the closest cluster to the data point is found using the distance of the data point Xi to the different centroids. While it is desirable to add Xi to its closest centroid, Xi can not be added to a given cluster which has fewer than p(i)-1 data points in it. Therefore, the data point Xi is added to the closest cluster which also happens to have at least p(i)-1 data points inside it. In general, it is not desirable to have groups with high sizes compared to their constituent privacy levels. When such a situation arises, it effectively means that a higher level of representational inaccuracy is created than is really necessary with the privacy requirements of the points within the group. The average privacy level of the group G can be computed from the condensed statistics. This number is equal to Ps(G)/n(G). This is because Ps(G) is equal to the sum of the privacy levels of the data points in the group.
  • The split criterion used by an illustrative algorithm of the invention is that a group is divided when the number of items in the group is more than twice the average privacy level of the items in the group. Therefore, the group is split when the following holds true:
    n(G)≧2 Ps(G)/n(G)
  • We utilize a uniformity assumption in order to split the group statistics. In each case, the group is split along the eigenvector with the largest eigenvalue. This also corresponds to the direction with the greatest level of variance. This is done in order to reduce the overall variance of the resulting clusters and ensure the greatest compactness of representation. We assume without loss of generality that the eigenvector e1 with the lowest index is the chosen direction the split.
  • Once the groups have been generated, we can also generate the pseudo-data from the different condensed groups. The pseudo-data from the condensed groups are generated using a data generation approach described herein below.
  • Referring initially, FIG. 1 is a block diagram illustrating a server architecture and network environment in accordance with which variable privacy preserving, data mining techniques may be employed, according to an embodiment of the present invention.
  • As illustrated, an exemplary network environment 100 comprises a trusted server 102-1 and a non-trusted server 102-2. Each server (102-1, 102-2) may comprise a central processing unit or CPU (104-1, 104-2) coupled to a volatile main memory (106-1, 106-2) and a non-volatile disk (108-1, 108-2). The servers are connected over a communication network 110. It is to be appreciated that the network may be a public information network such as, for example, the Internet or World Wide Web, however, the servers may alternatively be connected via a private network, a local area network, or some other suitable network.
  • It is to be understood that a server may receive data to be processed from any source or sources. For example, one or more client devices (not shown) may supply data to be processed to a server. However, all or portions of the data to be processed may already be available at the server (e.g., on disk), or may be accessible by the server. The main memory may be used in order to store some or all of the intermediate results performed during the operations/computations.
  • Further, software components including instructions or code for performing the methodologies of the invention, as described herein, may be stored in one or more memory devices described above with respect to the server and, when ready to be utilized, loaded in part or in whole and executed by the CPU.
  • In one preferred embodiment, the variable privacy preservation operations of the invention (i.e., condensation operations) are performed at trusted server 102-1. That is, CPU 104-1 of the trusted server is used in order to perform the privacy preservation operations on the original data. As mentioned above, the original data may be received from any source or sources (e.g., one or more client devices connected to the server over the network) and stored in disk 108-1. Once processed in accordance with the privacy preservation operations at the trusted server 102-1, the data (which is now considered “trusted data” after being processed) may be sent to non-trusted server 102-2 where it is stored on disk 108-2. In accordance with CPU 104-2 and 106-2, data mining may be performed on the trusted data at the non-trusted server.
  • Referring now to FIG. 2, a flow diagram illustrates a process 200 for performing variable privacy preservation, according to an embodiment of the present invention. That is, FIG. 2 illustrates an overall approach for performing condensation-based privacy preservation.
  • The process starts at block 202. As mentioned above, the condensation based approach is a two step process. In the first step, the process generates the condensed groups from the data (step 204). The summary statistics of these condensed groups are stored. These summary statistics may include the covariance matrix, as well as the sum of the attributes, and the number of records. Such statistics are explained in detail above. We note that this information is sufficient to determine the characteristics which are useful for privacy preservation. This step is further explained below in the context of FIG. 3.
  • Once the statistics of the condensed groups have been stored, they are used in the second step of the process to generate the pseudo-data for mining purposes (step 206). The pseudo-data are often available in the form of multi-dimensional records which are similar to the original data format. Such pseudo-data is the so-called “trusted data” that is sent to the non-trusted server (102-2 of FIG. 1) for use in data mining operations. The data is considered “trusted” since it obtains a degree of indistinguishability, thus preserving its privacy. The pseudo-data generation step is further explained below in the context of FIG. 6. The process ends at block 208.
  • Referring now to FIG. 3, a flow diagram illustrates a process 300 for creating condensed groups for privacy preservation, according to an embodiment of the invention. FIG. 3 illustrates details of step 204 of FIG. 2. That is, FIG. 3 illustrates an overall process of performing condensation for the privacy preservation process.
  • The process starts at block 302. The condensation of the groups works with an iterative approach in which groups with successively higher privacy levels are generated. In FIG. 3, this privacy level is denoted by p. The process starts with the privacy level p=1 (step 304).
  • In step 306, the process determines groups of privacy level p. This can be done by using any conventional clustering algorithm, see, e.g., Jain and Dubes, “Algorithms for Clustering Data,” Prentice Hall. The determination of such groups can be very useful for the privacy preservation process.
  • We note that often groups with lower privacy level can be distributed into groups with a higher privacy level using a cannibalization process. In order to perform cannibalization (step 308), the process examines all groups with privacy level (p-1) and redistributes the points (records) to groups with higher privacy levels, if such redistribution reduces the mean square errors of the data points. This step is further explained below in the context of FIG. 4.
  • We note that the process of cannibalization may often result in some groups having more points than their required privacy level. In such cases, the process can reassign the data points for the corresponding groups to lower privacy level groups. Thus, in step 310, the process performs the attrition which reassigns the points from groups with larger than p points to other groups. This reassignment is performed if such reassignment improves the errors of the corresponding data points. This step is further explained below in the context of FIG. 5.
  • In step 311, the privacy level p is incremented by one. Step 312 then checks whether p is equal to pmax. It is to be appreciated that pmax is the maximum privacy requirement of any record in the data set. If p does not yet equal pmax, then the process returns to step 306 and continues. Once pmax is reached, the process ends at block 314.
  • Referring now to FIG. 4, a flow diagram illustrates a process 400 for performing cannibalization for condensation, according to an embodiment of the invention. In cannibalization (e.g., step 308 of FIG. 3), the process assigns the data points of a given group to those of a higher level group. This is done in order to improve the errors of the group formation process. The process starts at block 402.
  • The cannibalization process is performed as follows. For each group in which the privacy level is lower than the current value of p, the process determines if reassignment of all points in the group to their corresponding closest centroid improves the error values. This step is performed in step 404. If such a reassignment does indeed improve the group radius, then the reassignment is executed in step 406. Otherwise, that group is kept intact. The process ends at block 408.
  • We note that the process of cannibalization only reassigns a data point to groups with a higher privacy level. Consequently, the privacy level of each group is maintained. This is because all group sizes of the (remaining) groups are increased in the process. Thus, the privacy is increased by the cannibalization process while reducing the error.
  • Referring now to FIG. 5, a flow diagram illustrates a process 500 for performing attrition for condensation, according to an embodiment of the invention. We note that the use of attrition (e.g., step 310 of FIG. 3) can be helpful in reassigning the groups with excess data points in a more effective way. The process of attrition can be useful in reducing the overall errors of the privacy preservation process. The process starts at block 502. In step 504, the process determines if the moving of an excess point from a given group to its next closest centroid reduces the average error of the condensation. If this is the case, then the process performs the move from one centroid to the other (step 506). This process maintains privacy while increasing the compactness of the groups. The process ends at block 508.
  • Referring lastly to FIG. 6, a flow diagram illustrates a process 600 for creating pseudo-data from condensed groups, according to an embodiment of the invention. FIG. 6 illustrates details of step 206 of FIG. 2. The process starts at block 602. The pseudo-data are generated by calculating the condensed statistics and generating the eigenvectors from each set of condensed statistics (step 604). The eigenvalues along these eigenvectors represent the corresponding variances. Then, the process generates the data independently along each eigenvector (step 606). More particularly, along each eigenvector, the process uses a uniform distribution with variance equal to the corresponding eigenvalue. The process ends at block 608.
  • As mentioned above, the above-described method of privacy preservation can also be extended to data streams. Specifically, in such case, the condensed statistics are updated incrementally as the data points are received. The incremental update of the condensed statistics is used in conjunction with a splitting step which is used when the group size exceeds twice the average privacy level. The process for splitting may include splitting the group along the longest eigenvalue. The process reconstructs aggregate statistics assuming that the distribution along each eigenvalue is uniform.
  • Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope or spirit of the invention.

Claims (20)

1. A method for preserving privacy of data records for use in a data mining application, comprising the steps of:
assigning different privacy levels to the data records;
constructing condensed groups from the data records based on the privacy levels, wherein summary statistics are maintained for each condensed group; and
generating pseudo-data from the summary statistics, wherein the pseudo-data is available for use in the data mining application.
2. The method of claim 1, wherein maintaining summary statistics further comprises:
an iterative step of segmentation wherein data records with the same privacy level are included in one group;
an iterative step of cannibalization wherein data records from one group are redistributed to other groups; and
a step of attrition wherein data records from one group are reassigned to a closer group.
3. The method of claim 1, wherein data records of a given privacy level are processed in increasing order of privacy.
4. The method of claim 1, wherein data records of a given privacy level are processed in decreasing order of privacy.
5. The method of claim 2, wherein the cannibalization step redistributes records of a given privacy level to groups with higher privacy levels.
6. The method of claim 5, wherein the cannibalization step is performed when the reassignment of all data records within the group results in a lower squared error.
7. The method of claim 2, wherein the attrition step reassigns excess records from a given group to other groups.
8. The method of claim 1, wherein the data records are static.
9. The method of claim 1, wherein the data records are dynamic.
10. Apparatus for preserving privacy of data records for use in a data mining application, comprising:
a memory; and
a processor coupled to the memory and operative to: (i) assign different privacy levels to the data records; (ii) construct condensed groups from the data records based on the privacy levels, wherein summary statistics are maintained for each condensed group; and (iii) generate pseudo-data from the summary statistics, wherein the pseudo-data is available for use in the data mining application.
11. The apparatus of claim 10, wherein maintaining summary statistics further comprises:
an iterative operation of segmentation wherein data records with the same privacy level are included in one group;
an iterative operation of cannibalization wherein data records from one group are redistributed to other groups; and
an operation of attrition wherein data records from one group are reassigned to a closer group.
12. The apparatus of claim 10, wherein data records of a given privacy level are processed in increasing order of privacy.
13. The apparatus of claim 10, wherein data records of a given privacy level are processed in decreasing order of privacy.
14. The apparatus of claim 11, wherein the cannibalization operation redistributes records of a given privacy level to groups with higher privacy levels.
15. The apparatus of claim 14, wherein the cannibalization operation is performed when the reassignment of all data records within the group results in a lower squared error.
16. The apparatus of claim 11, wherein the attrition operation reassigns excess records from a given group to other groups.
17. The apparatus of claim 10, wherein the data records are static.
18. The apparatus of claim 10, wherein the data records are dynamic.
19. An article of manufacture for use in preserving privacy of data records for use in a data mining application, the article comprising a machine readable medium containing one or more programs which when executed implement the steps of:
assigning different privacy levels to the data records;
constructing condensed groups from the data records based on the privacy levels, wherein summary statistics are maintained for each condensed group; and
generating pseudo-data from the summary statistics, wherein the pseudo-data is available for use in the data mining application.
20. The article of claim 19, wherein maintaining summary statistics further comprises:
an iterative step of segmentation wherein data records with the same privacy level are included in one group;
an iterative step of cannibalization wherein data records from one group are redistributed to other groups; and
a step of attrition wherein data records from one group are reassigned to a closer group.
US11/249,647 2005-10-13 2005-10-13 Method and apparatus for variable privacy preservation in data mining Abandoned US20070239982A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US11/249,647 US20070239982A1 (en) 2005-10-13 2005-10-13 Method and apparatus for variable privacy preservation in data mining
PCT/EP2006/066858 WO2007042403A1 (en) 2005-10-13 2006-09-28 Method and apparatus for variable privacy preservation in data mining
US12/119,766 US8627070B2 (en) 2005-10-13 2008-05-13 Method and apparatus for variable privacy preservation in data mining
US14/051,530 US8966648B2 (en) 2005-10-13 2013-10-11 Method and apparatus for variable privacy preservation in data mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/249,647 US20070239982A1 (en) 2005-10-13 2005-10-13 Method and apparatus for variable privacy preservation in data mining

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/119,766 Continuation US8627070B2 (en) 2005-10-13 2008-05-13 Method and apparatus for variable privacy preservation in data mining

Publications (1)

Publication Number Publication Date
US20070239982A1 true US20070239982A1 (en) 2007-10-11

Family

ID=37635630

Family Applications (3)

Application Number Title Priority Date Filing Date
US11/249,647 Abandoned US20070239982A1 (en) 2005-10-13 2005-10-13 Method and apparatus for variable privacy preservation in data mining
US12/119,766 Expired - Fee Related US8627070B2 (en) 2005-10-13 2008-05-13 Method and apparatus for variable privacy preservation in data mining
US14/051,530 Expired - Fee Related US8966648B2 (en) 2005-10-13 2013-10-11 Method and apparatus for variable privacy preservation in data mining

Family Applications After (2)

Application Number Title Priority Date Filing Date
US12/119,766 Expired - Fee Related US8627070B2 (en) 2005-10-13 2008-05-13 Method and apparatus for variable privacy preservation in data mining
US14/051,530 Expired - Fee Related US8966648B2 (en) 2005-10-13 2013-10-11 Method and apparatus for variable privacy preservation in data mining

Country Status (2)

Country Link
US (3) US20070239982A1 (en)
WO (1) WO2007042403A1 (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090327296A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Preserving individual information privacy by providing anonymized customer data
US20130014270A1 (en) * 2011-07-08 2013-01-10 Sy Bon K Method of comparing private data without revealing the data
US9202078B2 (en) 2011-05-27 2015-12-01 International Business Machines Corporation Data perturbation and anonymization using one way hash
US9336302B1 (en) 2012-07-20 2016-05-10 Zuci Realty Llc Insight and algorithmic clustering for automated synthesis
JP6148371B1 (en) * 2016-03-29 2017-06-14 西日本電信電話株式会社 Grouping device, grouping method, and computer program
JP6154933B1 (en) * 2016-03-29 2017-06-28 西日本電信電話株式会社 Grouping device, grouping method, and computer program
US20180239925A1 (en) * 2015-11-02 2018-08-23 LeapYear Technologies, Inc. Differentially Private Density Plots
US10467234B2 (en) 2015-11-02 2019-11-05 LeapYear Technologies, Inc. Differentially private database queries involving rank statistics
US10579828B2 (en) 2017-08-01 2020-03-03 International Business Machines Corporation Method and system to prevent inference of personal information using pattern neutralization techniques
US10586068B2 (en) 2015-11-02 2020-03-10 LeapYear Technologies, Inc. Differentially private processing and database storage
US10642847B1 (en) 2019-05-09 2020-05-05 LeapYear Technologies, Inc. Differentially private budget tracking using Renyi divergence
US10715529B2 (en) 2008-06-30 2020-07-14 Conversant Wireless Licensing S.A R.L. Method, apparatus, and computer program product for privacy management
US10726153B2 (en) 2015-11-02 2020-07-28 LeapYear Technologies, Inc. Differentially private machine learning using a random forest classifier
US10733320B2 (en) 2015-11-02 2020-08-04 LeapYear Technologies, Inc. Differentially private processing and database storage
US10789384B2 (en) 2018-11-29 2020-09-29 LeapYear Technologies, Inc. Differentially private database permissions system
US11055432B2 (en) 2018-04-14 2021-07-06 LeapYear Technologies, Inc. Budget tracking in a differentially private database system
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US11328084B2 (en) 2020-02-11 2022-05-10 LeapYear Technologies, Inc. Adaptive differentially private count
US11755769B2 (en) 2019-02-01 2023-09-12 Snowflake Inc. Differentially private query budget refunding
CN116975897A (en) * 2023-09-22 2023-10-31 青岛国信城市信息科技有限公司 Smart community personnel privacy information safety management system

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070239982A1 (en) 2005-10-13 2007-10-11 International Business Machines Corporation Method and apparatus for variable privacy preservation in data mining
US9959285B2 (en) 2014-08-08 2018-05-01 International Business Machines Corporation Restricting sensitive query results in information management platforms
CN108141460B (en) * 2015-10-14 2020-12-04 三星电子株式会社 System and method for privacy management of unlimited data streams
US10755172B2 (en) 2016-06-22 2020-08-25 Massachusetts Institute Of Technology Secure training of multi-party deep neural network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633872B2 (en) * 2000-12-18 2003-10-14 International Business Machines Corporation Extendible access control for lightweight directory access protocol

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040199781A1 (en) * 2001-08-30 2004-10-07 Erickson Lars Carl Data source privacy screening systems and methods
US7024409B2 (en) * 2002-04-16 2006-04-04 International Business Machines Corporation System and method for transforming data to preserve privacy where the data transform module suppresses the subset of the collection of data according to the privacy constraint
US20070239982A1 (en) 2005-10-13 2007-10-11 International Business Machines Corporation Method and apparatus for variable privacy preservation in data mining

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633872B2 (en) * 2000-12-18 2003-10-14 International Business Machines Corporation Extendible access control for lightweight directory access protocol

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8140502B2 (en) 2008-06-27 2012-03-20 Microsoft Corporation Preserving individual information privacy by providing anonymized customer data
US20090327296A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Preserving individual information privacy by providing anonymized customer data
US10715529B2 (en) 2008-06-30 2020-07-14 Conversant Wireless Licensing S.A R.L. Method, apparatus, and computer program product for privacy management
US9202078B2 (en) 2011-05-27 2015-12-01 International Business Machines Corporation Data perturbation and anonymization using one way hash
US20130014270A1 (en) * 2011-07-08 2013-01-10 Sy Bon K Method of comparing private data without revealing the data
US8776250B2 (en) * 2011-07-08 2014-07-08 Research Foundation Of The City University Of New York Method of comparing private data without revealing the data
WO2014011633A2 (en) * 2012-07-09 2014-01-16 Research Foundation Of The City University Of New York Method of safeguarding private medical data without revealing the data
WO2014011633A3 (en) * 2012-07-09 2014-03-27 Research Foundation Of The City University Of New York Safeguarding private medical data
US10318503B1 (en) 2012-07-20 2019-06-11 Ool Llc Insight and algorithmic clustering for automated synthesis
US9336302B1 (en) 2012-07-20 2016-05-10 Zuci Realty Llc Insight and algorithmic clustering for automated synthesis
US9607023B1 (en) 2012-07-20 2017-03-28 Ool Llc Insight and algorithmic clustering for automated synthesis
US11216428B1 (en) 2012-07-20 2022-01-04 Ool Llc Insight and algorithmic clustering for automated synthesis
US12072998B2 (en) 2015-11-02 2024-08-27 Snowflake Inc. Differentially private processing and database storage
US20180239925A1 (en) * 2015-11-02 2018-08-23 LeapYear Technologies, Inc. Differentially Private Density Plots
US11100247B2 (en) 2015-11-02 2021-08-24 LeapYear Technologies, Inc. Differentially private processing and database storage
US10467234B2 (en) 2015-11-02 2019-11-05 LeapYear Technologies, Inc. Differentially private database queries involving rank statistics
US10489605B2 (en) * 2015-11-02 2019-11-26 LeapYear Technologies, Inc. Differentially private density plots
US10733320B2 (en) 2015-11-02 2020-08-04 LeapYear Technologies, Inc. Differentially private processing and database storage
US10586068B2 (en) 2015-11-02 2020-03-10 LeapYear Technologies, Inc. Differentially private processing and database storage
US10726153B2 (en) 2015-11-02 2020-07-28 LeapYear Technologies, Inc. Differentially private machine learning using a random forest classifier
JP6148371B1 (en) * 2016-03-29 2017-06-14 西日本電信電話株式会社 Grouping device, grouping method, and computer program
JP6154933B1 (en) * 2016-03-29 2017-06-28 西日本電信電話株式会社 Grouping device, grouping method, and computer program
JP2017182341A (en) * 2016-03-29 2017-10-05 西日本電信電話株式会社 Grouping device, grouping method, and computer program
JP2017182342A (en) * 2016-03-29 2017-10-05 西日本電信電話株式会社 Grouping device, grouping method, and computer program
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US10579828B2 (en) 2017-08-01 2020-03-03 International Business Machines Corporation Method and system to prevent inference of personal information using pattern neutralization techniques
US11893133B2 (en) 2018-04-14 2024-02-06 Snowflake Inc. Budget tracking in a differentially private database system
US11055432B2 (en) 2018-04-14 2021-07-06 LeapYear Technologies, Inc. Budget tracking in a differentially private database system
US12130942B2 (en) 2018-04-14 2024-10-29 Snowflake Inc. Budget tracking in a differentially private database system
US10789384B2 (en) 2018-11-29 2020-09-29 LeapYear Technologies, Inc. Differentially private database permissions system
US11755769B2 (en) 2019-02-01 2023-09-12 Snowflake Inc. Differentially private query budget refunding
US11188547B2 (en) 2019-05-09 2021-11-30 LeapYear Technologies, Inc. Differentially private budget tracking using Renyi divergence
US10642847B1 (en) 2019-05-09 2020-05-05 LeapYear Technologies, Inc. Differentially private budget tracking using Renyi divergence
US11328084B2 (en) 2020-02-11 2022-05-10 LeapYear Technologies, Inc. Adaptive differentially private count
US11861032B2 (en) 2020-02-11 2024-01-02 Snowflake Inc. Adaptive differentially private count
US12105832B2 (en) 2020-02-11 2024-10-01 Snowflake Inc. Adaptive differentially private count
CN116975897A (en) * 2023-09-22 2023-10-31 青岛国信城市信息科技有限公司 Smart community personnel privacy information safety management system

Also Published As

Publication number Publication date
US20090319526A1 (en) 2009-12-24
WO2007042403A1 (en) 2007-04-19
US8627070B2 (en) 2014-01-07
US20140041049A1 (en) 2014-02-06
US8966648B2 (en) 2015-02-24
WO2007042403A9 (en) 2007-09-13

Similar Documents

Publication Publication Date Title
US8627070B2 (en) Method and apparatus for variable privacy preservation in data mining
US7302420B2 (en) Methods and apparatus for privacy preserving data mining using statistical condensing approach
US7739284B2 (en) Method and apparatus for processing data streams
US7475085B2 (en) Method and apparatus for privacy preserving data mining by restricting attribute choice
US6959303B2 (en) Efficient searching techniques
US11636486B2 (en) Determining subsets of accounts using a model of transactions
EP2742446B1 (en) A system and method to store video fingerprints on distributed nodes in cloud systems
US7743058B2 (en) Co-clustering objects of heterogeneous types
US7890510B2 (en) Method and apparatus for analyzing community evolution in graph data streams
Zakerzadeh et al. Faanst: fast anonymizing algorithm for numerical streaming data
US20070061544A1 (en) System and method for compression in a distributed column chunk data store
JP5089854B2 (en) Method and apparatus for clustering of evolving data streams via online and offline components
Vijayalakshmi et al. Analysis on data deduplication techniques of storage of big data in cloud
US9830377B1 (en) Methods and systems for hierarchical blocking
US9734229B1 (en) Systems and methods for mining data in a data warehouse
US11227231B2 (en) Computational efficiency in symbolic sequence analytics using random sequence embeddings
US11892297B2 (en) Scalable graph SLAM for HD maps
de Oliveira et al. Scalable fast evolutionary k-means clustering
US20210056586A1 (en) Optimizing large scale data analysis
US9760654B2 (en) Method and system for focused multi-blocking to increase link identification rates in record comparison
Romero-Gainza et al. Memory mapping and parallelizing random forests for speed and cache efficiency
Cai et al. Dynamic programming based optimized product quantization for approximate nearest neighbor search
US10803102B1 (en) Methods and systems for comparing customer records
US11886385B2 (en) Scalable identification of duplicate datasets in heterogeneous datasets
US20210357453A1 (en) Query usage based organization for very large databases

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AGGARWAL, CHARU C.;YU, PHILIP SHI-LUNG;REEL/FRAME:016955/0784

Effective date: 20051013

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO PAY ISSUE FEE