CN117235665B

CN117235665B - Self-adaptive privacy data synthesis method, device, computer equipment and storage medium

Info

Publication number: CN117235665B
Application number: CN202311199403.9A
Authority: CN
Inventors: 黄雨; 张荣超; 王捍贫; 陈冬雪
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2024-06-25
Anticipated expiration: 2043-09-18
Also published as: CN117235665A

Abstract

The application relates to a self-adaptive privacy data synthesis method, a self-adaptive privacy data synthesis device, computer equipment and a storage medium. Comprising the following steps: generating a fusion vector corresponding to the original data according to the continuous features and the discrete features of the original data and a feature fusion network; inputting the target sequence and the discrete features in the fusion vector into a feature distribution extraction network to extract corresponding feature distribution parameters; calculating to obtain a first loss based on the corresponding characteristic distribution parameters; extracting weight values of the network full-connection layer from the characteristic distribution to obtain a corresponding positive sample and a negative sample, and obtaining a second loss according to the similarity of the positive sample and the negative sample; generating initial synthesized data through a data synthesis network based on the characteristic distribution parameters, and generating third loss based on the difference degree between the initial synthesized data and the original data; fusing the first loss, the second loss and the third loss to obtain fusion loss; and performing network training based on the fusion loss to obtain target synthetic data, so that the accuracy of the synthetic data can be improved.

Description

Self-adaptive privacy data synthesis method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for synthesizing self-adaptive privacy data, a computer device, and a storage medium.

Background

With the development of big data technology, data islands and data privacy problems often make integration and analysis across data sources difficult, and generating structured synthetic data capable of accurately reflecting dynamic characteristics, potential distribution and interaction relations of the phenotype data with privacy is still an active research field to protect data privacy.

In the prior art, a Bayesian generation method and the like are generally utilized to perform statistical characteristic angle fitting on feature distribution of original data, and the generated synthesized data cannot accurately capture feature interaction relations and has low accuracy.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an adaptive privacy data synthesis method, apparatus, computer device, and storage medium, which can effectively improve the accuracy of generating synthesized data.

In a first aspect, the present application provides a method for synthesizing adaptive privacy data, including:

acquiring continuous features and discrete features of the original data, and generating fusion vectors corresponding to the original data based on a feature fusion network, wherein the fusion vectors are used for representing association relations among various data of the original data;

Acquiring a target sequence from the fusion vector, respectively inputting the target sequence and semantic vectors corresponding to the discrete features into a feature distribution extraction network, and extracting corresponding first feature distribution parameters and second feature distribution parameters; calculating to obtain a first loss based on the first characteristic distribution parameter and the second characteristic distribution parameter;

Dividing weight values of the feature distribution extraction network full-connection layer to obtain a corresponding positive sample and a negative sample, and constructing to obtain a second loss based on similarity of the positive sample and the negative sample;

generating initial synthesized data through a data synthesis network based on the first characteristic distribution parameters, and generating third loss based on the difference degree between the initial synthesized data and the original data, wherein the data synthesis network is used for encrypting input data;

Fusing the first loss, the second loss and the third loss to obtain fusion loss; training the feature fusion network, the feature distribution extraction network and the data synthesis network based on the fusion loss until the fusion loss is smaller than a threshold value, stopping training to obtain target synthesis data, wherein the target synthesis data is encrypted data corresponding to the original data.

In one embodiment, obtaining continuous features and discrete features of original data, generating a fusion vector corresponding to the original data based on a feature fusion network includes:

Inputting the discrete features and the continuous features of the original data into a semantic extraction network, and generating a first feature vector corresponding to the continuous features and a second feature vector corresponding to the discrete features;

Generating a token sequence corresponding to the original data based on the fusion of the first feature vector and the second feature vector, wherein the token sequence is used for representing the feature association relation among all data of the original data;

And inputting the token sequence into a feature fusion network to generate a fusion vector corresponding to the original data.

In one embodiment, obtaining a target sequence from the fusion vector, inputting the target sequence and semantic vectors corresponding to the discrete features into a feature distribution extraction network, and extracting corresponding first feature distribution parameters and second feature distribution parameters, including:

Inputting the target sequence to a Gaussian mixture encoder, and outputting a first Gaussian distribution parameter corresponding to the target sequence, wherein the first Gaussian distribution parameter comprises at least one first data pair consisting of a mean value and a standard deviation;

and inputting the semantic vector corresponding to the discrete feature into a Gaussian mixture encoder, and outputting a second Gaussian distribution parameter corresponding to the discrete feature, wherein the second distribution parameter comprises at least one second data pair consisting of a mean value and a standard deviation.

In one embodiment, calculating the first loss based on the first feature distribution parameter and the second feature distribution parameter includes:

Constructing and obtaining a corresponding first data distribution characteristic item based on the first data pair, wherein the first data distribution characteristic item is used for representing the data distribution condition of the target sequence;

constructing a corresponding second data distribution characteristic item based on the second data pair, wherein the second data distribution characteristic item is used for representing the data distribution condition of the discrete characteristic;

The degree of difference between the first data distribution characteristic item and the second data distribution characteristic item is taken as a first loss.

In one embodiment, generating initial composite data over a data composite network based on the first feature distribution parameter and generating a third penalty based on a degree of difference between the initial composite data and the original data, includes:

determining corresponding data characteristic distribution based on the first characteristic distribution parameters, and randomly sampling the data characteristic distribution to obtain a characteristic acquisition sequence;

Inputting the characteristic acquisition sequence into a data synthesis network to generate initial synthesis data;

a third penalty is generated based on the degree of difference between the initial synthetic data and the original data.

In one embodiment, generating the third loss based on a degree of difference between the initial synthetic data and the original data includes:

Calculating the cross entropy between the initial synthesized data and the original data to obtain a target difference degree;

Based on the target degree of difference, a third penalty is constructed.

In one embodiment, fusing the first, second, and third losses to obtain a fused loss includes:

obtaining a scale factor corresponding to the first loss, the second loss and the third loss;

and carrying out weighted fusion on the first loss, the second loss and the third loss based on the scale factors to obtain fusion loss.

In a second aspect, the present application also provides an adaptive privacy data synthesis apparatus, including:

The feature extraction module is used for acquiring continuous features and discrete features of the original data, generating fusion vectors corresponding to the original data based on a feature fusion network, and the fusion vectors are used for representing association relations among various data of the original data;

The first calculation module is used for acquiring a target sequence from the fusion vector, inputting the target sequence and semantic vectors corresponding to the discrete features into the feature distribution extraction network respectively, and extracting corresponding first feature distribution parameters and second feature distribution parameters; calculating to obtain a first loss based on the first characteristic distribution parameter and the second characteristic distribution parameter;

The second calculation module is used for dividing the weight value of the feature distribution extraction network full-connection layer to obtain a corresponding positive sample and a negative sample, and constructing and obtaining a second loss based on the similarity of the positive sample and the negative sample;

The third calculation module is used for generating initial synthesized data through a data synthesis network based on the first characteristic distribution parameters, generating third loss based on the difference degree between the initial synthesized data and the original data, and encrypting the input data through the data synthesis network;

The data generation module is used for fusing the first loss, the second loss and the third loss to obtain fusion loss; training the feature fusion network, the feature distribution extraction network and the data synthesis network based on the fusion loss until the fusion loss is smaller than a threshold value, stopping training to obtain target synthesis data, wherein the target synthesis data is encrypted data corresponding to the original data.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

According to the self-adaptive privacy data synthesis method, the self-adaptive privacy data synthesis device, the computer equipment and the storage medium, continuous features and discrete features of original data are obtained, fusion vectors corresponding to the original data are generated according to the continuous features and the discrete features on the basis of the feature fusion network, a target sequence in the fusion vectors and semantic vectors corresponding to the discrete features are input into the feature distribution extraction network, corresponding first feature distribution parameters and second feature distribution parameters are obtained through extraction, and first losses are obtained through construction according to the first feature distribution parameters and the second feature distribution parameters; extracting similarity between positive samples and negative samples obtained by weight value division of a network full-connection layer according to the feature distribution, and constructing to obtain second loss; and generating initial synthesized data through a data synthesis network based on the first characteristic distribution parameters, constructing and obtaining third loss according to the difference degree between the initial synthesized data and the original data, finally carrying out weighted fusion on the first loss, the second loss and the third loss to obtain fusion loss, dynamically adjusting each network parameter according to the fusion loss until target synthesized data is obtained, and carrying out efficient parallel fusion calculation and reasoning on the large-scale data characteristics through the construction of the multi-level multi-mode data loss function so as to capture complex interaction relation and context information in high-dimensional characteristics, thereby effectively improving the accuracy of the generated multi-mode structured synthesized data.

Drawings

FIG. 1 is a flow diagram of a method of adaptive privacy data synthesis in one embodiment;

FIG. 2 is a flow chart of a fusion vector generation method in one embodiment;

FIG. 3 is a flow diagram of generating a first feature distribution parameter and a second feature distribution parameter in one embodiment;

FIG. 4 is a schematic flow diagram of constructing a first penalty in one embodiment;

FIG. 5 is a flow chart of a method of generating a third penalty in one embodiment;

FIG. 6 is a schematic flow diagram of a third penalty constructed in one embodiment;

FIG. 7 is a flow diagram of a fusion loss generation method in one embodiment;

FIG. 8 is a block diagram of an adaptive private data synthesizer according to one embodiment;

FIG. 9 is an internal block diagram of a computer device in one embodiment;

FIG. 10 is an internal block diagram of another computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, an adaptive privacy data synthesis method is provided, where this embodiment is applied to a terminal to illustrate the method, and it is understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

Step S102, continuous features and discrete features of the original data are obtained, and fusion vectors corresponding to the original data are generated based on a feature fusion network.

The fusion vector is used for representing the association relation among various data of the original data.

Specifically, a computer device obtains raw data for which data encryption is requiredDividing the original data D into continuous features/>, according to the continuous data features and the discrete data featuresAnd discrete featuresCapturing semantic relations and features among data by using a word embedding technology, converting all features in original data into semantic vectors in a continuous vector space, inputting all discrete features and continuous features in the original data into a BERT embedding layer, extracting feature dependency relations in the discrete features, generating a first feature vector corresponding to the continuous features of the original data and a second feature vector corresponding to the discrete features, and obtaining a fusion vector corresponding to the original data based on the first feature vector and the second feature vector.

Step S104, obtaining a target sequence from the fusion vector, respectively inputting the target sequence and semantic vectors corresponding to the discrete features into a feature distribution extraction network, and extracting corresponding first feature distribution parameters and second feature distribution parameters; and calculating to obtain the first loss based on the first characteristic distribution parameter and the second characteristic distribution parameter.

The fusion vector comprises a target sequence corresponding to the inserted special mark, a continuous characteristic sequence corresponding to the continuous characteristic of the original data and a discrete characteristic sequence corresponding to the discrete characteristic, wherein the target sequence characterizes the data characteristic of each modal data in the original data and the association relation between the data characteristic and the interior; the first characteristic distribution parameter and the second characteristic distribution parameter are respectively used for representing the distribution characteristics of each mode data in the target sequence and the discrete characteristics.

Specifically, the computer device fuses vectors from theThe method comprises the steps of obtaining a target sequence [ CLS ] ⁽ⁱ⁾, inputting the target sequence into a feature distribution extraction network, outputting corresponding first feature distribution parameters, inputting semantic vectors corresponding to discrete features of original data into the feature distribution extraction network, outputting second feature distribution parameters corresponding to the discrete features, and finally participating in calculation according to the generated first feature distribution parameters and second feature distribution parameters to obtain first loss.

And S106, dividing the weight values of the full-connection layer of the feature distribution extraction network to obtain a corresponding positive sample and a negative sample, and constructing and obtaining a second loss based on the similarity of the positive sample and the negative sample.

Specifically, the computer equipment adopts the field self-adaptive triplet to measure the mutual information between implicit constraint features of the countermeasure strategy so as to enhance the semantic representation of feature embedding, specifically, the weights of the full-connection layers of the feature distribution extraction network are subjected to sample division to obtain divided positive samples and negative samples, then the data similarity between the positive samples and the negative samples is calculated, and the second loss is calculated according to the similarity.

For example, the computer device may generate the corresponding second penalty as follows:

extracting potential embedding of the features as an anchor sample according to the particularity of the multimode projection head output by the transducer; embedding positive discrete features And negative discrete feature embeddingRegarding a set of samples B, capturing characteristic mutual information against loss using a domain-adaptive triplet metric:

where dis (u, v) is a similarity function used to calculate the similarity of two feature vectors.

Step S108, generating initial synthesized data through a data synthesis network based on the first characteristic distribution parameters, and generating a third loss based on the difference degree between the initial synthesized data and the original data.

Wherein the data synthesis network is used for encrypting the input data.

Specifically, the computer device may determine a corresponding feature distribution curve according to the first feature distribution parameter, randomly sample the feature distribution curve to obtain a target sampling sequence, input the target sampling sequence to the data synthesis network, output corresponding initial synthesized data, perform difference comparison on feature distribution of the initial synthesized data and the original data, and construct a third loss based on the difference result.

Step S110, fusing the first loss, the second loss and the third loss to obtain fusion loss; training the feature fusion network, the feature distribution extraction network and the data synthesis network based on the fusion loss until the fusion loss is smaller than a threshold value, and stopping training to obtain target synthesis data.

The target synthesized data is encrypted data corresponding to the original data.

The computer equipment obtains the scale factors to carry out weighted fusion on the first loss, the second loss and the third loss to obtain fusion loss, trains the feature fusion network, the feature distribution extraction network and the data synthesis network according to the fusion loss until the fusion loss is smaller than a threshold value, stops training, and takes the synthesis data correspondingly generated under the latest iteration times as target synthesis data.

In the embodiment, continuous features and discrete features of original data are obtained, a fusion vector corresponding to the original data is generated based on the feature fusion network according to the continuous features and the discrete features, a target sequence in the fusion vector and a semantic vector corresponding to the discrete features are input into a feature distribution extraction network, corresponding first feature distribution parameters and second feature distribution parameters are extracted, and first loss is obtained through construction according to the first feature distribution parameters and the second feature distribution parameters; extracting similarity between positive samples and negative samples obtained by weight value division of a network full-connection layer according to the feature distribution, and constructing to obtain second loss; and generating initial synthesized data through a data synthesis network based on the first characteristic distribution parameters, constructing and obtaining third loss according to the difference degree between the initial synthesized data and the original data, finally carrying out weighted fusion on the first loss, the second loss and the third loss to obtain fusion loss, dynamically adjusting each network parameter according to the fusion loss until target synthesized data is obtained, and carrying out efficient parallel fusion calculation and reasoning on the large-scale data characteristics through the construction of the multi-level multi-mode data loss function so as to capture complex interaction relation and context information in high-dimensional characteristics, thereby effectively improving the accuracy of the generated multi-mode structured synthesized data.

In one embodiment, as shown in fig. 2, acquiring continuous features and discrete features of original data, generating a fusion vector corresponding to the original data based on a feature fusion network includes:

step S202, inputting the discrete features and the continuous features of the original data into a semantic extraction network, and generating a first feature vector corresponding to the continuous features and a second feature vector corresponding to the discrete features.

The semantic extraction network is used to extract semantic feature information in the original data, and BERT (Bidirectional Encoder Representations from Transformer) networks or other neural networks or models capable of extracting the semantic feature information can be used, which are not particularly limited herein.

Step S204, based on the fusion of the first feature vector and the second feature vector, a token sequence corresponding to the original data is generated.

The token sequence is used for representing the characteristic association relation among all data of the original data.

Specifically, in order to ensure that the feature fusion encoder can learn information of different modes, the computer device inserts a special mark/sequence into all feature embedding heads embedded by the marking layer to obtain a token sequence T _in ⁽ⁱ⁾, as shown in the following formula 2:

Wherein [ CLS ] ⁽ⁱ⁾ is the inserted special tag/sequence, For the first eigenvector,Is the second feature vector.

Step S206, inputting the token sequence into the feature fusion network to generate fusion vectors corresponding to the original data.

The feature fusion network may be an Attention-based encoder (Attention network) or other commonly used encoder, among others.

Specifically, the computer device inputs the token sequence T _in ⁽ⁱ⁾ to the feature fusion network, and outputs a fusion vector T _out ⁽ⁱ⁾ corresponding to the original data, as shown in the following formula 3:

Wherein [ CLS ] ⁽ⁱ⁾ is a target sequence after feature fusion network learning, which can characterize the distribution features of each modal data in the original data, AndThe first feature vector and the second feature vector are corresponding target vectors which are output after being processed by a feature fusion network.

In this embodiment, discrete features and continuous features of original data are input to a semantic extraction network to generate a first feature vector corresponding to the continuous features and a second feature vector corresponding to the discrete features, a token sequence corresponding to the original data is generated based on fusion of the first feature vector and the second feature vector, the token sequence is input to a feature fusion network to generate a fusion vector corresponding to the original data, the generated fusion vector is obtained by fully extracting semantic features and fusing features, and can accurately reflect inter-dependency relationships, distribution features and the like between various data in the original data, so that accuracy of subsequently generated synthesized data is improved.

In one embodiment, as shown in fig. 3, the method for obtaining the target sequence from the fusion vector, respectively inputting the target sequence and the semantic vector corresponding to the discrete feature into the feature distribution extraction network, and extracting the corresponding first feature distribution parameter and second feature distribution parameter includes:

Step S302, the target sequence is input to a Gaussian mixture encoder, and a first Gaussian distribution parameter corresponding to the target sequence is output.

The first Gaussian distribution parameter comprises at least one first data pair consisting of a mean value and a standard deviation.

Step S304, inputting semantic vectors corresponding to the discrete features into a Gaussian mixture encoder, and outputting second Gaussian distribution parameters corresponding to the discrete features.

The second distribution parameter comprises at least one second data pair consisting of a mean value and a standard deviation.

In this embodiment, semantic vectors corresponding to the discrete features of the target sequence and the original data are respectively input to the gaussian hybrid encoder to obtain a plurality of independent potential gaussian distribution feature parameters (i.e., a plurality of sets of mean values and standard deviations are output, and each pair of mean values and standard deviations represents an independent gaussian distribution feature), so that the distribution features of each modal data in the original data can be rapidly extracted, and the accuracy of analysis of the distribution features is improved.

In one embodiment, as shown in fig. 4, calculating the first loss based on the first feature distribution parameter and the second feature distribution parameter includes:

Step S402, based on the first data pair, constructing and obtaining a corresponding first data distribution characteristic item.

The first data distribution characteristic item is used for representing the data distribution condition of the target sequence.

And step S404, constructing and obtaining a corresponding second data distribution characteristic item based on the second data pair.

Wherein the second data distribution characteristic term is used to characterize the data distribution of the discrete feature.

For example, the computer device may construct a probability density function corresponding to the first data pair or the second data pair according to the following formula 4

Wherein,Is a discrete feature or target sequence of the original data,Is an indicator function, μ _jk and σ _jk are the outputs of the Gaussian Mixture encoder, μ _jk is the mean of the data pair, σ _jk is the standard deviation of the data pair. Since the embedding is dynamically updated during training, the learned prior knowledge is also dynamically updated.

In step S406, the degree of difference between the first data distribution feature item and the second data distribution feature item is used as the first loss.

Specifically, the computer device aligns the first data distribution feature item with the second data distribution feature item using the KL divergence, i.e. calculates the difference value between the two, for example, the computer device may construct a first loss according to the method of the following formula 5

Wherein,For the first data distribution feature item,For the target sequence,Potential representation generated for the ith sample,ForA corresponding probability density function; /(I)For the second data distribution feature item,Is a discrete feature of the original data,ForA corresponding probability density function.

In this embodiment, a corresponding first data distribution feature item is constructed based on the first data pair, a corresponding second data distribution feature item is constructed based on the second data pair, and the degree of difference between the first data distribution feature item and the second data distribution feature item is used as a first loss, so that the distribution characteristic in the original data can be associated with the first loss, and each network parameter can be dynamically adjusted towards the direction of reducing the first loss when the network parameter is optimized later, so that the accuracy of the generated target synthesized data is improved.

In one embodiment, generating initial composite data over a data composite network based on a first feature distribution parameter and generating a third penalty based on a degree of difference between the initial composite data and the original data, comprises:

Step S502, corresponding data characteristic distribution is determined based on the first characteristic distribution parameter, and the data characteristic distribution is randomly sampled to obtain a characteristic acquisition sequence.

The first characteristic distribution parameter is used for representing the data distribution characteristics of the original data reflected in the target sequence.

Specifically, the computer device constructs gaussian distribution data according to the mean value and the standard deviation in the first characteristic distribution parameter, and then randomly samples the gaussian distribution data to obtain a characteristic sampling sequence, wherein the sampling frequency can be flexibly set by a technician according to actual needs, and the method is not particularly limited.

Step S504, inputting the characteristic acquisition sequence into a data synthesis network to generate initial synthesis data.

Step S506, generating a third loss based on the degree of difference between the initial synthesized data and the original data.

Specifically, the computer device may calculate the difference value between the initial synthesized data and the original data using cross entropy, and construct a third loss function according to the difference value, for example, the computer device may calculate the third loss according to the following formula 6

Wherein coross _entopy (T ^α,X^α) is the cross entropy between the original data T ^α and the original composite data X ^α,AndThe markable code column representing the jth sequential feature original data and the reconstructed data is a thermal code column. T ^α and X ^α represent other columns than the markable encoding column, θ being a random parameter output with the decoder. t (X) is the tan function.

In this embodiment, the corresponding data feature distribution is determined based on the first feature distribution parameter, and the data feature distribution is randomly sampled to obtain the feature acquisition sequence, the feature acquisition sequence is input to the data synthesis network to generate initial synthesis data, and the third loss is generated based on the degree of difference between the initial synthesis data and the original data, so that the distribution feature of the generated synthesis data can be associated with the first loss, and each network parameter can be dynamically adjusted towards the direction of reducing the first loss when the network parameter is optimized subsequently, so that the accuracy of the generated target synthesis data is improved.

In one embodiment, as shown in fig. 6, generating a third loss based on the degree of difference between the initial synthetic data and the original data, includes:

Step S602, calculating cross entropy between the initial synthesized data and the original data to obtain a target difference degree.

In step S604, a third loss is constructed based on the target difference degree.

In this embodiment, the degree of difference between the synthesized data and the original data is evaluated by using cross entropy, so that the synthesized data similar to the original data as much as possible is learned and generated, and the accuracy of the generated synthesized data is improved.

In one embodiment, as shown in fig. 7, fusing the first loss, the second loss, and the third loss to obtain a fused loss includes:

in step S702, the scaling factors corresponding to the first loss, the second loss, and the third loss are obtained.

In step S704, the first loss, the second loss and the third loss are weighted and fused based on the scale factors, so as to obtain fusion loss.

Specifically, the computer device obtains the scale factors corresponding to the first loss, the second loss and the third loss in the above steps, fuses each loss and the scale factor corresponding to each loss to obtain each fusion sub-item, and fuses each fusion sub-item to obtain the fusion loss, for example, the computer device may generate the fusion loss according to the following formula 7 method

Wherein lambda ₁、λ₂、λ₃ is the first lossSecond lossThird lossThe corresponding scale factor.

In this embodiment, the first loss, the second loss, the third loss and the corresponding scale factors are weighted and fused to obtain the fusion loss, which can comprehensively regulate and control the optimization direction of the network parameters, is favorable for improving the reliability and accuracy of the finally generated target synthetic data, so that the target synthetic data has higher accuracy under the condition of ensuring the safety, and can accurately reflect the distribution characteristics, the potential cross correlation and the like of the original data.

The application also provides an application scene, which applies the self-adaptive privacy data synthesis method, and the method is applied to a scene for encrypting the user privacy data. Specifically, the application of the adaptive privacy data synthesis method in the application scene is as follows:

after the computer equipment acquires the user privacy data to be encrypted, generating synthetic data corresponding to the user privacy data according to the following steps:

step S1, for a real phenotype dataset Divide it into continuous mode feature setsAnd discrete modal feature setCapturing semantic relations and features among the data by using a word embedding technology, and converting all the features in the data set into vector representations T _out ⁽ⁱ⁾ in a continuous vector space;

Step S1.1, feeding all the features x ⁽ⁱ⁾ to the BERT embedding layer, extracting the feature dependency relationship therein, and generating corresponding feature vectors And

Step S1.2, in order to ensure that the feature fusion encoder can learn information of different modes, a special mark [ CLS ] is inserted into all feature embedding heads embedded by a marking layer to obtain an input token sequence

Step S2, inputting the token sequence T _in ⁽ⁱ⁾ to an Attention-based encoder to obtain a group of fusion vectors

Step S3, using GMVAE as a basic model framework, through multi-mode prior space generation and more complex representation learning, promoting potential representation meaningful to reconstructed samples;

Step S3.1, embedding discrete features represented as single thermal codes into an input Gaussian Mixture encoder, and generating a plurality of independent potential Gaussian distributions, namely a plurality of mean values mu _jk and standard deviations sigma _jk;

step S3.2, using weights from the first fully connected layer in the Gaussian Mixture encoder Embedding as each discrete feature;

step S3.3, activating positive Gaussian to form a Gaussian mixture subspace, wherein the probability density function of the subspace is expressed as:

Wherein, Is an indicator function, μ _jk and σ _jk are the outputs of the Gaussian Mixture encoder. Since the embedding is dynamically updated during training, the learned prior knowledge is also dynamically updated;

Step S3.4, aligning a priori with a posterior using KL divergence:

Wherein, Representing the potential representation of the ith sample generation, the transducer-based encoder is phi parameterized;

s4, introducing mutual information among implicit constraint features of domain self-adaptive triplet measurement countermeasure strategies, and enhancing semantic representation of feature embedding;

step S4.1, extracting potential embedding of the features as anchor samples according to the particularity of the multimode projection head output by the transducer;

Step S4.2, embedding the positive discrete features And negative discrete feature embeddingRespectively treating as a positive sample and a negative sample;

Step S4.3, for a set of samples B, capturing characteristic mutual information using a domain-adaptive triplet metric against loss:

wherein dis (u, v) is a similarity function for calculating the similarity of two feature vectors;

Step S5, converting T _out ⁽ⁱ⁾ into synthetic data X. Let t be one data in a set of samples B, which contains C consecutive attributes. Assuming a token of T is available, labeled T, the composite data X is reconstructed.

In step S5.1, the degree of difference between the synthesized data and the real data is estimated by using cross entropy (T ^α,X^α), and the synthesized data similar to the real data is learned and generated, and the corresponding reconstruction loss can be expressed as:

Wherein, AndThe markable code column representing the jth sequential feature original data and the reconstructed data is a thermal code column. T ^α and X ^α represent other columns than the markable encoding column, θ being a random parameter output with the decoder. t (X) is the tan function;

step S5.2, the potential representation is further converted into a composite sample X using a decoder, which process can be expressed as:

Step S5.3, using the above AndAnd balanced with a scaling factor lambda _i, calculating gradients and updating model parameters of the Attention-based encoder, gaussian Mixture encoder and decoder, determining the direction and magnitude of parameter updates such that the model proceeds in a direction to reduce overall losses:

In this embodiment, the computer device may tune the model according to a specific task, gradually generate a reconstructed token under random noise or a specific condition, and finally reversely decode the reconstructed token to obtain the synthesized data, thereby reducing the risk of disclosure of the data privacy information.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

In one embodiment, as shown in fig. 8, there is provided an adaptive privacy data synthesizing apparatus, which may employ software modules or hardware modules, or a combination of both, as part of a computer device, the apparatus specifically including: a feature extraction module 802, a first computation module 804, a second computation module 806, a third computation module 808, a data generation module 810, wherein:

the feature extraction module 802 is configured to obtain continuous features and discrete features of the original data, generate a fusion vector corresponding to the original data based on a feature fusion network, where the fusion vector is used to characterize an association relationship between each data of the original data;

The first calculation module 804 is configured to obtain a target sequence from the fusion vector, input semantic vectors corresponding to the target sequence and the discrete features into the feature distribution extraction network, and extract corresponding first feature distribution parameters and second feature distribution parameters; calculating to obtain a first loss based on the first characteristic distribution parameter and the second characteristic distribution parameter;

the second calculation module 806 is configured to divide the weight values of the feature distribution extraction network full-connection layer to obtain a corresponding positive sample and a negative sample, and construct to obtain a second loss based on the similarity of the positive sample and the negative sample;

A third calculation module 808, configured to generate initial synthesized data through a data synthesis network based on the first feature distribution parameter, and generate a third loss based on a degree of difference between the initial synthesized data and the original data, where the data synthesis network is configured to encrypt the input data;

The data generating module 810 is configured to fuse the first loss, the second loss, and the third loss to obtain a fused loss; training the feature fusion network, the feature distribution extraction network and the data synthesis network based on the fusion loss until the fusion loss is smaller than a threshold value, stopping training to obtain target synthesis data, wherein the target synthesis data is encrypted data corresponding to the original data.

In one embodiment, the feature extraction module 802 is further configured to input the discrete feature and the continuous feature of the original data into the semantic extraction network, and generate a first feature vector corresponding to the continuous feature and a second feature vector corresponding to the discrete feature; generating a token sequence corresponding to the original data based on the fusion of the first feature vector and the second feature vector, wherein the token sequence is used for representing the feature association relation among all data of the original data; and inputting the token sequence into a feature fusion network to generate a fusion vector corresponding to the original data.

In one embodiment, the first calculation module 804 is further configured to input the target sequence to a gaussian mixture encoder, and output a first gaussian distribution parameter corresponding to the target sequence, where the first gaussian distribution parameter includes at least one first data pair consisting of a mean value and a standard deviation; and inputting the semantic vector corresponding to the discrete feature into a Gaussian mixture encoder, and outputting a second Gaussian distribution parameter corresponding to the discrete feature, wherein the second distribution parameter comprises at least one second data pair consisting of a mean value and a standard deviation.

In one embodiment, the first computing module 804 is further configured to construct a corresponding first data distribution feature item based on the first data pair, where the first data distribution feature item is used to characterize a data distribution situation of the target sequence; constructing a corresponding second data distribution characteristic item based on the second data pair, wherein the second data distribution characteristic item is used for representing the data distribution condition of the discrete characteristic; the degree of difference between the first data distribution characteristic item and the second data distribution characteristic item is taken as a first loss.

In one embodiment, the third computing module 808 is further configured to determine a corresponding data feature distribution based on the first feature distribution parameter, and randomly sample the data feature distribution to obtain a feature acquisition sequence; inputting the characteristic acquisition sequence into a data synthesis network to generate initial synthesis data; a third penalty is generated based on the degree of difference between the initial synthetic data and the original data.

In one embodiment, the third calculation module 808 is further configured to calculate a cross entropy between the initial synthesized data and the original data, to obtain a target difference degree; based on the target degree of difference, a third penalty is constructed.

In one embodiment, the data generating module 810 is further configured to obtain scaling factors corresponding to the first loss, the second loss, and the third loss; and carrying out weighted fusion on the first loss, the second loss and the third loss based on the scale factors to obtain fusion loss.

According to the self-adaptive privacy data synthesis device, continuous features and discrete features of original data are obtained, fusion vectors corresponding to the original data are generated based on the feature fusion network according to the continuous features and the discrete features, a target sequence in the fusion vectors and semantic vectors corresponding to the discrete features are input into a feature distribution extraction network, corresponding first feature distribution parameters and second feature distribution parameters are obtained through extraction, and first losses are obtained through construction according to the first feature distribution parameters and the second feature distribution parameters; extracting similarity between positive samples and negative samples obtained by weight value division of a network full-connection layer according to the feature distribution, and constructing to obtain second loss; and generating initial synthesized data through a data synthesis network based on the first characteristic distribution parameters, constructing and obtaining third loss according to the difference degree between the initial synthesized data and the original data, finally carrying out weighted fusion on the first loss, the second loss and the third loss to obtain fusion loss, dynamically adjusting each network parameter according to the fusion loss until target synthesized data is obtained, and carrying out efficient parallel fusion calculation and reasoning on the large-scale data characteristics through the construction of the multi-level multi-mode data loss function so as to capture complex interaction relation and context information in high-dimensional characteristics, thereby effectively improving the accuracy of the generated multi-mode structured synthesized data.

For specific limitations on the adaptive privacy data synthesizing apparatus, reference may be made to the above limitation on the adaptive privacy data synthesizing method, and no further description is given here. The respective modules in the above-described adaptive privacy data synthesizing apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store the original data to be encrypted. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an adaptive privacy data synthesis method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 9. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an adaptive privacy data synthesis method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by persons skilled in the art that the structures shown in fig. 8 and 9 are merely block diagrams of portions of structures associated with aspects of the application and are not intended to limit the computer apparatus to which aspects of the application may be applied, and that a particular computer apparatus may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of encrypting private data, the method comprising:

Acquiring continuous features and discrete features of original data, and generating fusion vectors corresponding to the original data based on a feature fusion network, wherein the fusion vectors comprise: inputting the discrete features and the continuous features of the original data into a semantic extraction network, and generating a first feature vector corresponding to the continuous features and a second feature vector corresponding to the discrete features; generating a token sequence corresponding to the original data based on fusion of the first feature vector and the second feature vector, wherein the token sequence is used for representing feature association relations among all data of the original data; inputting the token sequence into the feature fusion network to generate a fusion vector corresponding to the original data; the fusion vector is used for representing the association relation among all the data of the original data;

Dividing the weight value of the feature distribution extraction network full-connection layer to obtain a corresponding positive sample and a negative sample, and constructing to obtain a second loss based on the similarity of the positive sample and the negative sample;

Fusing the first loss, the second loss and the third loss to obtain a fusion loss; training the feature fusion network, the feature distribution extraction network and the data synthesis network based on the fusion loss until the fusion loss is smaller than a threshold value, stopping training, and obtaining target synthesis data, wherein the target synthesis data is encrypted data corresponding to the original data.

2. The method of claim 1, wherein the feature fusion network is an attention network.

3. The method according to claim 1, wherein the obtaining the target sequence from the fusion vector, inputting the target sequence and the semantic vector corresponding to the discrete feature into a feature distribution extraction network, and extracting the corresponding first feature distribution parameter and second feature distribution parameter, respectively, includes:

And inputting the semantic vector corresponding to the discrete feature into the Gaussian mixture encoder, and outputting a second Gaussian distribution parameter corresponding to the discrete feature, wherein the second Gaussian distribution parameter comprises at least one second data pair consisting of a mean value and a standard deviation.

4. A method according to claim 3, wherein said calculating a first loss based on said first characteristic distribution parameter and said second characteristic distribution parameter comprises:

and taking the degree of difference between the first data distribution characteristic item and the second data distribution characteristic item as a first loss.

5. The method of claim 1, wherein generating initial composite data over a data composite network based on the first feature distribution parameter and generating a third loss based on a degree of difference between the initial composite data and the original data comprises:

inputting the characteristic acquisition sequence into the data synthesis network to generate initial synthesis data;

a third loss is generated based on a degree of difference between the initial synthetic data and the original data.

6. The method of claim 5, wherein generating a third loss based on a degree of difference between the initial synthetic data and the raw data comprises:

And constructing and obtaining a third loss based on the target difference degree.

7. The method of claim 1, wherein fusing the first, second, and third losses to obtain a fused loss comprises:

Obtaining the scale factors corresponding to the first loss, the second loss and the third loss;

8. A private data encryption apparatus, the apparatus comprising:

The feature extraction module is used for acquiring continuous features and discrete features of the original data, generating fusion vectors corresponding to the original data based on a feature fusion network, and comprises the following steps: inputting the discrete features and the continuous features of the original data into a semantic extraction network, and generating a first feature vector corresponding to the continuous features and a second feature vector corresponding to the discrete features; generating a token sequence corresponding to the original data based on fusion of the first feature vector and the second feature vector, wherein the token sequence is used for representing feature association relations among all data of the original data; inputting the token sequence into the feature fusion network to generate a fusion vector corresponding to the original data; the fusion vector is used for representing the association relation among all the data of the original data;

The first calculation module is used for acquiring a target sequence from the fusion vector, inputting the target sequence and semantic vectors corresponding to the discrete features into a feature distribution extraction network respectively, and extracting corresponding first feature distribution parameters and second feature distribution parameters; calculating to obtain a first loss based on the first characteristic distribution parameter and the second characteristic distribution parameter;

the third calculation module is used for generating initial synthesized data through a data synthesis network based on the first characteristic distribution parameters and generating third loss based on the difference degree between the initial synthesized data and the original data, and the data synthesis network is used for encrypting input data;

The data generation module is used for fusing the first loss, the second loss and the third loss to obtain fusion loss; training the feature fusion network, the feature distribution extraction network and the data synthesis network based on the fusion loss until the fusion loss is smaller than a threshold value, stopping training, and obtaining target synthesis data, wherein the target synthesis data is encrypted data corresponding to the original data.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.