WO2023065632A1

WO2023065632A1 - Data desensitization method, data desensitization apparatus, device, and storage medium

Info

Publication number: WO2023065632A1
Application number: PCT/CN2022/089872
Authority: WO
Inventors: 郑旭如; 赵盟盟; 王磊
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-10-21
Filing date: 2022-04-28
Publication date: 2023-04-27
Also published as: CN113886885A

Abstract

The present application relates to the field of artificial intelligence, in particular to a data desensitization method, a data desensitization apparatus, a device, and a storage medium. The method comprises: obtaining user data, and performing information identification on the user data on the basis of a pre-trained key information identification model to obtain key information; preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing comprising data discretization processing or data normalization processing; performing conditional random sampling processing on the discrete variables on the basis of a conditional loss function to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a spliced vector; and inputting the spliced vector into a pre-trained generator for desensitization processing to obtain desensitized data. Therefore, desensitized data cannot be easily reversely cracked, thereby ensuring that privacy data will not be leaked, and improving the security of the desensitized data.

Description

Data desensitization method, data desensitization device, equipment and storage medium

This application claims the priority of the Chinese patent application with the application number 202111229481.X and the title of the invention "data desensitization method, data desensitization device, equipment and storage medium" submitted to the China Patent Office on October 21, 2021. The entire contents are incorporated by reference in this application.

technical field

The present application relates to the field of artificial intelligence, and in particular to a data desensitization method, a data desensitization device, computer equipment, and a storage medium.

Background technique

In the era of big data, the frequency of attacks on data is increasing, and the attack methods are also becoming more abundant. Data desensitization technology is an effective method to solve data security problems and risks. Data desensitization refers to the transformation of key information or personal information according to preset rules or transformations, so that personal identity cannot be identified or key information is hidden. Currently common structured data desensitization methods are desensitization methods based on anonymization technology or scrambling technology.

However, the inventor realized that there is a one-to-one mapping relationship between the desensitized data and the original data in the structured data desensitization method based on anonymization technology or scrambling technology, which makes the desensitized data easy to be reversed, thus The original data is easily restored, which leads to the leakage of private information in the original data, and the data security is poor.

Contents of the invention

The present application provides a data desensitization method, data desensitization device, computer equipment, and storage medium, aiming to solve the problem that existing desensitization methods are easily reversed and private information is easily leaked.

In order to achieve the above purpose, the present application provides a data desensitization method, the method comprising:

Obtaining user data, and based on a pre-trained key information identification model, performing information identification on the user data to obtain key information;

Preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization or data normalization;

Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;

The splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.

In order to achieve the above purpose, the present application also provides a data desensitization device, the data desensitization device includes:

The key information extraction module is used to obtain user data, and based on the pre-trained key information identification model, perform information identification on the user data to obtain key information;

An information processing module, configured to preprocess the key information to obtain discrete variables corresponding to the key information, where the preprocessing includes data discretization or data normalization;

A vector splicing module, configured to perform conditional random sampling processing on the discrete variables based on a conditional loss function to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a splicing vector;

The data desensitization module is used to input the splicing vector into the pre-trained generator for desensitization processing to obtain desensitized data.

In addition, in order to achieve the above object, the present application also provides a computer device, the computer device includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and Realize following steps when executing described computer program:

Obtain user data, and based on a pre-trained key information identification model, perform information identification on the user data to obtain key information;

In addition, in order to achieve the above purpose, the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps:

The data desensitization method, data desensitization device, device, and storage medium disclosed in the embodiments of the present application generate splicing vectors by extracting key information of user data and discrete variables of key information, and use a pre-trained generator to Vectors are desensitized to obtain desensitized data, so that the desensitized data cannot be easily reversed, thereby ensuring that private data is not leaked and improving the security of desensitized data.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present application more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.

FIG. 1 is a schematic diagram of a scenario of a data desensitization method provided in an embodiment of the present application;

Fig. 2 is a schematic flow chart of a data desensitization method provided in the embodiment of the present application;

Fig. 3 is a schematic block diagram of a data desensitization device provided by an embodiment of the present application;

Fig. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

The flow charts shown in the drawings are just illustrations, and do not necessarily include all contents and operations/steps, nor must they be performed in the order described. For example, some operations/steps can be decomposed, combined or partly combined, so the actual order of execution may be changed according to the actual situation. In addition, although the functional modules are divided in the schematic diagram of the device, in some cases, they may be divided into modules different from those in the schematic diagram of the device.

The term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.

Data desensitization technology is a data processing technology that can reduce and remove data sensitivity by processing data. The use of data desensitization technology can reduce the risk and harm of data leakage and effectively protect the privacy of user data. In the Internet or medical field, users can store, view and share personal medical and health data through their personal digital space, but personal medical data will face the risk of leaking sensitive medical information of users in the process of online medical treatment, online purchase of medicines, outpatient appointments, etc. In the medical industry, user data has extremely high authenticity and sensitivity. Once the user's personal sensitive information is leaked, it may pose a potential threat to the user's life. With the help of data desensitization, the information in the personal digital space can be used for business-related analysis and processing, while avoiding the leakage of user data.

Currently common structured data desensitization methods are desensitization methods based on anonymization technology or scrambling technology. Common anonymization techniques include k-anonymity, l-diversity, and t-closeness. They generalize the quasi-identifier of a single record to make the data indistinguishable in the entire data set, thereby achieving desensitization. Effect. Scrambling-based technology is to add noise to the record, such as adding additive or multiplicative noise to continuous values, so as to achieve the effect of desensitization.

However, in the structured data desensitization method based on anonymization technology or scrambling technology, there is a one-to-one mapping relationship between the desensitized data and the original data, which leads to the risk of the desensitized data being reversed, and after desensitization The data are often quite different from the original data and lose the value of research.

In order to solve the above problems, this application provides a data desensitization method, which can be applied in the server, specifically in multiple fields such as finance and medical treatment. By continuously iteratively updating the parameters of the generator, a pre-trained generator is obtained. Extract the sensitive information of the user data, and use the pre-trained generator to desensitize the sensitive information to obtain the desensitized data, so that the desensitized data cannot be easily reversed, thus ensuring that the private data is not leaked , improving the security of desensitized data.

Wherein, the server may be, for example, an individual server or a server cluster. However, for ease of understanding, the following embodiments will introduce in detail the data desensitization method applied to the server.

Some implementations of the present application will be described in detail below in conjunction with the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

As shown in FIG. 1 , the data desensitization method provided in the embodiment of the present application can be applied to the application environment shown in FIG. 1 . The application environment includes a terminal device 110 and a server 120, wherein the terminal device 110 can communicate with the server 120 through a network. Specifically, the server 120 obtains the user data sent by the terminal device 110, and the server 120 performs key information extraction, information processing, and desensitization processing on the user data to generate desensitized data, and sends the desensitized data to the terminal device 110, so as to realize Data desensitization processing. Wherein, the server 120 can be an independent server, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms. The terminal device 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto. The terminal and the server may be connected directly or indirectly through wired or wireless communication, which is not limited in this application.

Please refer to FIG. 2 . FIG. 2 is a schematic flowchart of a data desensitization method provided by an embodiment of the present application. Wherein, the data desensitization method can be applied in the server, thereby making the desensitization data unable to be easily reversely deciphered, thereby ensuring that the private data is not leaked, and improving the security of the desensitization data.

As shown in FIG. 2, the data desensitization method includes steps S101 to S104.

S101. Acquire user data, and perform information identification on the user data based on a pre-trained key information identification model to obtain key information.

Wherein, the user data is data containing key information, and may specifically include medical data such as medical record data, financial data such as bank account data, and the like. The key information identification model may be a pre-trained BERT-CRF model based on an attention mechanism, which is used to extract key information in user data. The key information is the information that the user needs to desensitize, which is generally the user's private information. For example, the key information can be the height and weight information in the medical record data, or the account balance information and investment information in the bank account data. . It should be noted that all sensitive or private information can be used as key information.

The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

In some embodiments, word segmentation processing is performed on the user data to obtain multiple word segmentations; feature extraction is performed on each of the word segmentations to obtain the embedded features of each of the word segmentations; The word meaning prediction is to obtain the meaning corresponding to each of the word segmentations; the plurality of word segmentations are screened according to the meanings corresponding to each of the word segmentations to obtain key information. As a result, key information can be accurately extracted, and the accuracy and security of desensitized data generation can be improved.

Wherein, the embedding features are word embedding features, position embedding features and segmentation embedding features. The word embedding feature is a vector representation of each word segment, the position embedding is a vector representation of each word segment position, and the segmentation embedding feature is used to distinguish two different sentences.

Specifically, the user data can be segmented based on a word segmentation algorithm to obtain multiple word segments. The word segmentation algorithm can be forward maximum matching method, reverse maximum matching method, word segmentation algorithm based on hidden Markov model, condition-based Random field word segmentation algorithm and other algorithms.

Exemplarily, the word segmentation algorithm based on the hidden Markov model can be used to segment the user data such as the medical record text of the medical record data as "the patient has symptoms such as frequent urination, hunger, anxiety, tremor, etc., and is suspected of diabetes," to obtain multiple corresponding Word segmentation such as frequent urination, hunger, anxiety, tremor, etc.

Specifically, feature extraction can be performed on each of the word segmentations to obtain the embedded features of each of the word segmentations, and based on the word meaning prediction model, the meaning of each of the word segmentations can be predicted according to the embedded features of each of the word segmentations to obtain The word meaning prediction result corresponding to each word segment, and based on the word meaning prediction result corresponding to each word segment, filter the plurality of word segments to obtain key information. In this way, text features can be mined to the greatest extent, richer word representations can be extracted, and the shortcomings of traditional word vectors such as Word2vec and Glove that cannot dynamically represent context information and cannot solve polysemy of a word can be eliminated. Therefore, the similarity between each word segment and the preset standard sensitive word can be quickly obtained, and then the corresponding key information can be quickly obtained.

Wherein, the word meaning prediction model is used to predict the similarity between each word segmentation and the preset standard sensitive word segmentation, the word meaning prediction model is obtained by training the semantic matching model and the standard sensitive word segmentation database, and the semantic prediction model may include LSTM Matching model, MV-DSSM model, ESIM model and other models, the word meaning prediction result is the similarity between each participle and the standard sensitive participle in the standard sensitive participle database.

Exemplarily, for example, the word segmentation includes account balance equivalent words of account information and stock trend information equal segmentation words, feature extraction can be performed on each of the word segmentations, and the word embedding features, position embedding features and segmentation embedding of each of the word segmentations can be obtained Features, based on the LSTM matching model, perform word meaning prediction for each of the word segmentations according to the word embedding features, position embedding features and segmentation embedding features, and obtain the word meaning prediction results corresponding to each word segmentation, and based on the word meaning prediction results corresponding to each word segmentation Filter out the word segmentation corresponding to the stock trend information to obtain key information.

S102. Perform preprocessing on the key information to obtain discrete variables corresponding to the key information, where the preprocessing includes data discretization processing or data normalization processing.

Since the key information is generally continuous data, it is necessary to perform representation conversion between continuous data and discrete data, that is, data preprocessing operations, which is a key step in the input and output of neural networks.

Exemplarily, when the key information is information such as height and weight, the key information is continuous data, and when the key information is information such as the number of investment companies, the key information is discrete data.

Wherein, the discrete variable refers to a variable whose value can be listed in a certain order, and usually takes an integer value, such as the number of employees, the number of factories, the number of machines, and the like. Specifically, the data normalization processing may include maximum-minimum normalization processing and normalization processing according to a Gaussian mixture model; the data discretization processing may include K-bins discretization processing and regression tree discretization processing.

In some embodiments, the key information is subjected to maximum and minimum normalization processing to obtain the discrete variable corresponding to the key information; or, the key information is normalized through a Gaussian mixture model to obtain the key information A discrete variable corresponding to the information; or, K-bins discretization processing is performed on the key information to obtain a discrete variable corresponding to the key information; or, a regression tree discretization process is performed on the key information to obtain the key information Corresponding discrete variables.

Specifically, if the key information is continuous data, it can be mapped to the range of [0,1] through the maximum and minimum linear transformation, so that the continuous value can be represented by the tanh activation function, and the discrete variable corresponding to the key information can be obtained.

Specifically, if the key information is continuous data, the Gaussian mixture model can be used to fit the key information, and the Gaussian component is sampled according to the probability of the Gaussian component of the key information in the mixture model, and the sampled Gaussian component is used to compare the data in the record. The key information is normalized. Then, the key information can be composed of the normalized representation and the one-hot encoding of the Gaussian component, so as to obtain the discrete variables corresponding to the key information.

Specifically, if the key information is continuous data, K-bins discretization processing may be performed on the key information to obtain discrete variables corresponding to the key information. Among them, the discretization can also be called binning, that is, the key information is divided into various intervals according to certain rules, and each interval is represented by one-hot encoding, so that the key information is represented by a piecewise function containing four intervals Fitting is performed to obtain discrete variables corresponding to the key information.

Specifically, if the key information is continuous data, the key information may be discretized using a CART regression tree to obtain discrete variables corresponding to the key information. Among them, the CART regression tree can predict continuous data, and its leaf node represents a predicted value. The key information can be converted into discrete values by expressing a series of leaf nodes of the regression tree or regression tree set of key information through one-hot encoding.

It should be noted that if the key information is discrete data, there is no need to perform data discretization or data normalization.

S103. Based on the conditional loss function, perform conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splice the conditional embedding vector and the hidden vector to obtain a concatenated vector.

Wherein, the conditional loss function is a conditional loss function based on the adversarial generation network, and the data items of the loss function based on the adversarial generation network are generated based on conditional probabilities. The original intention is to enable data to be generated according to conditions, so that the same type The distribution of the data to be desensitized and the generated desensitized data is as consistent as possible. However, because the conditions sampled each time may be different variables, it is difficult to obtain sufficient training based on the data under the condition variables, and it can be observed that the values of the corresponding variables in the generated data are inconsistent with the values of the condition variables. The training process can be constrained by predicting the condition variables, so that the values of the condition variables are consistent with the values of the corresponding variables in the generated data, and the effect of data generation can be further optimized.

Specifically, the conditional embedding vector can randomly select a discrete variable that meets the preset condition from a plurality of discrete variables corresponding to the key information with equal probability, and the hidden vector can be selected from the white noise corresponding to the key information The concatenated vector is obtained by concatenating the conditional embedding vector and the latent vector, and is used as an input of the generator. By adding hidden vectors, the one-to-one mapping relationship between the desensitized data and the original data is changed, so that the desensitized data is not easy to be reversely cracked, and private information can be obtained.

Specifically, the distributed representation of the discrete variable can be obtained by constructing the probability mass distribution function of each value of the discrete variable, and the distributed representation of the discrete variable is subjected to conditional random sampling processing to obtain the conditional embedding vector and hidden vector.

Exemplarily, the white noise corresponding to the discrete variable may be converted by a deep neural network to generate a latent vector from the distributed representation of the discrete variable.

In some embodiments, the conditional embedding vector is converted to obtain a one-hot encoding; the one-hot encoding is concatenated with the hidden vector to obtain a concatenated vector. Wherein, the one-hot encoding is One-Hot encoding, also known as effective encoding, and its method is to use N-bit state registers to encode N states, each state has its own independent register bit, and at any time, Only one of them is valid. Converting the conditional embedding vector into a one-hot encoding can solve the problem that the discriminator cannot handle attribute data well, and at the same time, it also plays a role in expanding the vector feature to a certain extent.

Specifically, the conditional embedding vector can be transformed through a deep neural network to obtain a one-hot encoding, and the splicing vector can be obtained by splicing the one-hot encoding and the latent vector. From this, a concatenated vector that meets the input requirements of the generator can be obtained.

S104. Input the concatenated vector into a pre-trained generator for desensitization processing to obtain desensitized data.

Wherein, the pre-trained generator is generated based on confrontation generation network training, and the desensitized data is data after desensitizing key information in the data to be desensitized.

In some embodiments, the splicing vector corresponding to the training data is obtained, and the splicing vector is input to the first generator for desensitization processing to obtain desensitized data; based on the desensitized data and training data pair The preset discriminator is trained to obtain a pre-trained discriminator; according to the preset learning rate and the parameters of the pre-trained discriminator, the parameters of the first generator are iteratively updated multiple times to obtain A second generator, and use the second generator as a pre-trained generator. In this way, the parameters of the first generator can be updated iteratively for multiple times through the pre-trained discriminator and the desensitized data, and very real desensitized data can be generated. The reason why it is necessary to train the pre-trained discriminator first, and then train the generator is because it is necessary to have a good discriminator first, so that it can be taught to distinguish between the desensitized data and the generated desensitized data before being able to More accurate updates to generator parameters.

Wherein, the training data is a data set to be desensitized for training generator parameters, the first generator is a preset untrained generator, and the second generator is a Generated by multiple iterative updates. Wherein, the parameters of the first generator and the second generator are different. The prior probability of the discrete variable can be obtained through the distributed representation of the discrete variable, and parameters are sampled from the prior probability as parameters of the first generator. Specifically, the generator and the discriminator can be trained by the stochastic gradient Hamiltonian Monte Carlo method to obtain a pre-trained generator and a pre-trained discriminator.

Specifically, the preset discriminator is trained based on the desensitized data and training data, and the pre-trained discriminator is obtained by combining the conditional embedding vector with the desensitized data and the Splicing the training data to obtain the first spliced data and the second spliced data, and calculating the similarity between the first spliced data and the second spliced data, and optimizing the loss function according to the similarity between the first spliced data and the second spliced data, And the discriminator is gradient clipped through the loss function to obtain a pre-trained discriminator.

Exemplarily, the discriminator parameters can be trained by the first generator and the preset discriminator parameters, and the desensitized data can be judged as false as much as possible, thereby adjusting the discriminator parameters, thereby improving the discriminator's ability to desensitize the data Discrimination ability.

Exemplarily, the posterior probability of the second generator can be calculated through the prior probability of the parameters of the first generator and the pre-trained discriminator, so that the desensitized data should make the discriminator misjudge it as being to be desensitized Sensitive data, so as to adjust the parameters of the generator to generate real desensitized data.

In some embodiments, after the second generator is obtained, the second generator is subjected to noise-increasing processing based on a loss function of statistical information to obtain a pre-trained generator, wherein the first generator, the The parameters of the generator whose parameters have been updated are different from those of the pre-trained generator. In this way, the generation quality and degree of desensitization of the desensitized data can be controlled.

Wherein, the statistical information-based loss function may include a mean-based loss function, a variance-based loss function, and the like.

Specifically, Gaussian noise can be added to the parameters of the second generator, thereby realizing polynomial fitting of a sinusoidal curve, and the Gaussian noise is an error conforming to a Gaussian normal distribution. The specific value of Gaussian noise can be obtained through experiments.

Exemplarily, an error term may be introduced into the parameters of the second generator, so that the parameters of the second generator may be corrected to obtain a pre-trained generator. Due to the existence of the error term, there is a certain difference between the generated desensitized data and the original data, but the difference is not large, thus avoiding that the desensitized data often differs greatly from the original data and loses the research value, and also ensures Data cannot be easily reversed.

In some embodiments, after the desensitized data is obtained, the discrete variables of the desensitized data are randomly sampled to obtain the target discrete variables; based on the logistic regression model, the target The discrete variable is predicted to obtain the predicted result of the target discrete variable; and the parameters of the pre-trained generator are adjusted based on the predicted result of the target discrete variable. In this way, the parameters of the generator can be adjusted by predicting the discrete variables to achieve a better desensitization effect. The better desensitization effect here means that it can prevent the desensitized data from being reversely cracked, while maintaining the association with the original data.

Wherein, the target discrete variable is randomly sampled from multiple discrete variables of the desensitized data, and at the same time, in order to associate the discrete variables between the desensitized data and the original data, it can generally be considered that the target discrete variable does not change, and the desensitized data and If the difference in the original data is small, the value of the research will not be lost, so it is necessary to ensure the consistency of the target discrete variables. The logistic regression model was used to predict discrete variables.

Specifically, the cross-entropy loss function can be used to judge whether the generated prediction result of the target discrete variable is consistent with the target discrete variable, so as to determine the generation quality of the desensitized data. If the prediction result of the target discrete variable is consistent with the target discrete variable, there is no need to adjust the parameters of the pre-trained generator; if the prediction result of the target discrete variable is inconsistent with the target discrete variable, then determine the prediction result of the target discrete variable The difference with the target discrete variable, and adjust the parameters of the pre-trained generator according to the difference. In this way, the accuracy of the target discrete variable can be determined, and the generated desensitized data can avoid making the original data too different. Since most of the discrete variables of the desensitized data and the original data are the same, removing one of the discrete variables can accurately predict the discrete variable based on the remaining discrete variables.

Exemplarily, if the target discrete variable of the desensitized data is a shoe size of 43, based on the logistic regression model, the target discrete variable can be predicted through the remaining discrete variables of the desensitized data, such as height and weight, to obtain the shoe size to determine whether the generated shoe size prediction result is consistent with the shoe size of the desensitized data. For example, if the generated shoe size prediction result is a shoe size of 40, the difference is determined to be 1 size. According to the difference The parameters of the pre-trained generator are updated iteratively; if the predicted result of the generated shoe size is size 43, there is no need to adjust the parameters of the pre-trained generator.

In some embodiments, the server may also send prompt information for prompting the user that the desensitized data has been generated to the terminal device.

Wherein, the manner of prompting information may specifically include an application program (APP) or Email, a short message, a chat tool, such as WeChat, qq, and the like.

Exemplarily, when the desensitized data has been generated, the server will send a prompt message that the desensitized data has been generated to the terminal device to remind the user.

Please refer to FIG. 3 . FIG. 3 is a schematic block diagram of a data desensitization device provided by an embodiment of the present application. The data desensitization device can be configured in a server to execute the aforementioned data desensitization method.

As shown in FIG. 3 , the data desensitization device 200 includes: a key information extraction module 201 , an information processing module 202 , a vector splicing module 203 and a data desensitization module 204 .

The key information extraction module 201 is configured to acquire user data, and based on a pre-trained key information identification model, perform information identification on the user data to obtain key information;

An information processing module 202, configured to preprocess the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization processing or data normalization processing;

A vector splicing module 203, configured to perform conditional random sampling processing on the discrete variables based on a conditional loss function to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;

A data desensitization module 204, configured to input the splicing vector into a pre-trained generator for desensitization processing to obtain desensitized data;

The feature extraction module 201 is also used to perform word segmentation processing on the user data to obtain a plurality of word segments; perform feature extraction on each of the word segments to obtain the embedded features of each of the word segments; according to the embedding features of each of the word segments The feature predicts the meaning of the word to obtain the meaning corresponding to each of the word segmentations; according to the meaning of each of the word segmentations, the plurality of word segmentations are screened to obtain key information.

The information processing module 202 is further configured to perform maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or, to perform normalization processing on the key information through a Gaussian mixture model to obtain the The discrete variable corresponding to the key information; or, carry out K-bins discretization process on the key information, obtain the discrete variable corresponding to the key information; or, perform regression tree discretization process on the key information, obtain the described key information Discrete variables corresponding to key information.

The vector splicing module 203 is further configured to convert the conditional embedding vector to obtain a one-hot encoding; splice the one-hot encoding and the latent vector to obtain a concatenated vector.

The generator training module 205 is used to obtain the splicing vector corresponding to the training data, and input the splicing vector to the first generator for desensitization processing to obtain desensitized data; based on the desensitized data and training The data is used to train the preset discriminator to obtain a pre-trained discriminator; according to the preset learning rate and the parameters of the pre-trained discriminator, the parameters of the first generator are iteratively updated multiple times , get the second generator, and use the second generator as a pre-trained generator.

The generator training module 205 is further configured to perform noise-increasing processing on the second generator based on a loss function of statistical information to obtain a pre-trained generator, wherein the first generator and the second generator The parameters of the generator and the pre-trained generator are different.

The generator training module 205 is also used to randomly sample the discrete variables of the desensitized data to obtain the target discrete variables; based on the logistic regression model, the target discrete variables are processed according to the remaining discrete variables of the desensitized data. Forecasting, obtaining a prediction result of the target discrete variable; adjusting parameters of the pre-trained generator based on the prediction result of the target discrete variable.

It should be noted that those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described devices, modules, and units can refer to the corresponding processes in the foregoing method embodiments. No longer.

The methods and devices of the present application can be used in many general-purpose or special-purpose computing system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer terminal equipment, network PCs, minicomputers, mainframe computers, including the above A distributed computing environment for any system or device, and more.

Exemplarily, the above-mentioned method and apparatus can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 4 .

Please refer to FIG. 4 . FIG. 4 is a schematic diagram of a computer device provided by an embodiment of the present application. The computer device may be a server.

As shown in FIG. 4, the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a volatile storage medium, a non-volatile storage medium, and an internal memory.

Non-volatile storage media can store operating systems and computer programs. The computer program includes program instructions. When the program instructions are executed, the processor can be executed to perform any data masking method.

The processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.

The internal memory provides an environment for running the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can execute any data desensitization method.

This network interface is used for network communication, such as sending assigned tasks, etc. Those skilled in the art can understand that the structure of the computer equipment is only a block diagram of the partial structure related to the solution of the present application, and does not constitute a limitation to the computer equipment on which the solution of the application is applied. The specific computer equipment may include More or fewer components are shown in the figures, or certain components are combined, or have different component arrangements.

It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in some embodiments, the processor is configured to run a computer program stored in the memory to implement the following steps: acquire user data, and perform information identification on the user data based on a pre-trained key information identification model , to obtain the key information; preprocessing the key information to obtain the discrete variable corresponding to the key information, the preprocessing includes data discretization processing or data normalization processing; based on the conditional loss function, the discrete variable Perform conditional random sampling processing to obtain conditional embedding vectors and hidden vectors, and splice the conditional embedding vectors and the hidden vectors to obtain spliced vectors; input the spliced vectors to the pre-trained generator for desensitization , to get desensitized data.

In some embodiments, the processor is further configured to: perform word segmentation processing on the user data to obtain multiple word segments; perform feature extraction on each of the word segments to obtain embedded features of each of the word segments; The embedding features of each of the word segmentations are used to predict the meaning of the word to obtain the corresponding meaning of each of the word segmentations; according to the meaning of each of the corresponding word segmentations, the plurality of word segmentations are screened to obtain key information.

In some embodiments, the processor is further configured to: perform maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or, normalize the key information through a Gaussian mixture model or, performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or, performing regression tree discretization on the key information process to obtain the discrete variables corresponding to the key information.

In some embodiments, the processor is further configured to: convert the conditional embedding vector to obtain a one-hot encoding; concatenate the one-hot encoding and the latent vector to obtain a concatenated vector.

In some embodiments, the processor is further configured to: acquire a splicing vector corresponding to the training data, and input the splicing vector to the first generator for desensitization processing to obtain desensitized data; based on the desensitized The pre-sensitized data and training data are used to train the preset discriminator to obtain a pre-trained discriminator; according to the preset learning rate and the parameters of the pre-trained discriminator, the first generator is The parameters are iteratively updated multiple times to obtain a second generator, and the second generator is used as a pre-trained generator.

In some embodiments, the processor is further configured to: perform noise-increasing processing on the second generator based on a loss function of statistical information to obtain a pre-trained generator, wherein the first generator, The parameters of the second generator and the pre-trained generator are different.

In some embodiments, the processor is further configured to: perform random sampling on the discrete variables of the desensitized data to obtain target discrete variables; Predicting the target discrete variable to obtain a prediction result of the target discrete variable; adjusting parameters of the pre-trained generator based on the prediction result of the target discrete variable.

The embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium may be non-volatile or volatile. A computer program is stored on the computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed, any data desensitization method provided in the embodiments of the present application is implemented.

Wherein, the computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device. The computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD ) card, flash memory card (Flash Card), etc.

Further, the computer-readable storage medium may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, etc.; The data created using the node, etc.

This application refers to the new application mode of computer technology such as the storage of the blockchain language model, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain (Blockchain), essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

A data desensitization method, wherein the method includes:

Obtaining user data, and based on a pre-trained key information identification model, performing information identification on the user data to obtain key information;

Preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization or data normalization;

Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;

The splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
The method according to claim 1, wherein the pre-trained key information identification model is used to perform information identification on the user data to obtain key information, including:

performing word segmentation processing on the user data to obtain multiple word segmentations;

Carry out feature extraction to each described participle, obtain the embedding feature of each described participle;

Carrying out word meaning prediction according to the embedding feature of each described participle, obtains the corresponding word meaning of each described participle;

The multiple word segments are screened according to the meanings corresponding to each of the word segments to obtain key information.
The method according to claim 1, wherein said preprocessing the key information to obtain the discrete variable corresponding to the key information comprises:

performing maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or,

Normalizing the key information through a Gaussian mixture model to obtain discrete variables corresponding to the key information; or,

performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or,

Perform regression tree discretization processing on the key information to obtain discrete variables corresponding to the key information.
The method according to claim 1, wherein said concatenating said conditional embedding vector and said latent vector to obtain a concatenated vector comprises:

converting the conditional embedding vector to obtain one-hot encoding;

Concatenate the one-hot encoding and the latent vector to obtain a concatenated vector.
The method according to claim 1, wherein the method further comprises:

Obtaining the splicing vector corresponding to the training data, and inputting the splicing vector to the first generator for desensitization processing to obtain desensitized data;

Training a preset discriminator based on the desensitized data and training data to obtain a pre-trained discriminator;

According to the preset learning rate and the parameters of the pre-trained discriminator, iteratively update the parameters of the first generator multiple times to obtain a second generator, and use the second generator as a pre-training good builder.
The method according to claim 5, wherein, after said obtaining the second generator, said method further comprises:

Based on the loss function of statistical information, perform noise-increasing processing on the second generator to obtain a pre-trained generator, wherein the first generator, the second generator and the pre-trained generator The parameters of the device are different.
The method according to claim 1, wherein, after the desensitization data is obtained, the method further comprises:

Randomly sampling the discrete variables of the desensitized data to obtain target discrete variables;

Based on the logistic regression model, predict the target discrete variable according to the remaining discrete variables of the desensitized data, and obtain the prediction result of the target discrete variable;

Adjusting the parameters of the pre-trained generator based on the prediction result of the target discrete variable.
A data desensitization device, including:

The key information extraction module is used to obtain user data, and based on the pre-trained key information identification model, perform information identification on the user data to obtain key information;

An information processing module, configured to preprocess the key information to obtain discrete variables corresponding to the key information, where the preprocessing includes data discretization or data normalization;

Vector splicing module, for carrying out conditional random sampling process to described discrete variable based on conditional loss function, obtain conditional embedding vector and latent vector, and described conditional embedding vector and described latent vector are spliced, obtain splicing vector;

The data desensitization module is used to input the splicing vector into the pre-trained generator for desensitization processing to obtain desensitized data.
A computer device, wherein the computer device includes a memory and a processor;

The memory is used to store computer programs;

The processor is configured to execute the computer program and implement the following steps when executing the computer program:

Obtaining user data, and based on a pre-trained key information identification model, performing information identification on the user data to obtain key information;

Preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization or data normalization;

Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;

The splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
The computer device according to claim 9, wherein the processor implements the pre-trained key information identification model, performs information identification on the user data, obtains key information, and is used to realize:

performing word segmentation processing on the user data to obtain multiple word segmentations;

Carry out feature extraction to each described participle, obtain the embedding feature of each described participle;

Carrying out word meaning prediction according to the embedding feature of each described participle, obtains the corresponding word meaning of each described participle;

The multiple word segments are screened according to the meanings corresponding to each of the word segments to obtain key information.
The computer device according to claim 9, wherein the processor implements the preprocessing of the key information to obtain a discrete variable corresponding to the key information, for realizing:

performing maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or,

Normalizing the key information through a Gaussian mixture model to obtain discrete variables corresponding to the key information; or,

performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or,

Perform regression tree discretization processing on the key information to obtain discrete variables corresponding to the key information.
The computer device according to claim 9, wherein the processor implements the splicing of the conditional embedding vector and the hidden vector to obtain a splicing vector for realizing:

converting the conditional embedding vector to obtain one-hot encoding;

Concatenate the one-hot encoding and the latent vector to obtain a concatenated vector.
The computer device according to claim 9, wherein the processor is further configured to implement:

Obtaining the splicing vector corresponding to the training data, and inputting the splicing vector to the first generator for desensitization processing to obtain desensitized data;

Training a preset discriminator based on the desensitized data and training data to obtain a pre-trained discriminator;

According to the preset learning rate and the parameters of the pre-trained discriminator, iteratively update the parameters of the first generator multiple times to obtain a second generator, and use the second generator as a pre-training good builder.
The computer device according to claim 9, wherein, after the processor obtains the desensitized data, it is further configured to:

Randomly sampling the discrete variables of the desensitized data to obtain target discrete variables;

Based on the logistic regression model, predict the target discrete variable according to the remaining discrete variables of the desensitized data, and obtain the prediction result of the target discrete variable;

Adjusting the parameters of the pre-trained generator based on the prediction result of the target discrete variable.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps:

Obtaining user data, and based on a pre-trained key information identification model, performing information identification on the user data to obtain key information;

Preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization or data normalization;

Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;

The splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
The computer-readable storage medium according to claim 15, wherein the processor implements the pre-trained key information identification model, performs information identification on the user data, obtains key information, and is used to realize:

performing word segmentation processing on the user data to obtain multiple word segmentations;

Carry out feature extraction to each described participle, obtain the embedding feature of each described participle;

Carrying out word meaning prediction according to the embedding feature of each described participle, obtains the corresponding word meaning of each described participle;

The multiple word segments are screened according to the meanings corresponding to each of the word segments to obtain key information.
The computer-readable storage medium according to claim 15, wherein the processor implements the preprocessing of the key information to obtain a discrete variable corresponding to the key information, for realizing:

performing maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or,

Normalizing the key information through a Gaussian mixture model to obtain discrete variables corresponding to the key information; or,

performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or,

Perform regression tree discretization processing on the key information to obtain discrete variables corresponding to the key information.
The computer-readable storage medium according to claim 15, wherein the processor implements the splicing of the conditional embedding vector and the hidden vector to obtain a splicing vector for realizing:

converting the conditional embedding vector to obtain one-hot encoding;

Concatenate the one-hot encoding and the latent vector to obtain a concatenated vector.
The computer-readable storage medium according to claim 15, wherein the processor is further configured to implement:

Obtaining the splicing vector corresponding to the training data, and inputting the splicing vector to the first generator for desensitization processing to obtain desensitized data;

Training a preset discriminator based on the desensitized data and training data to obtain a pre-trained discriminator;

According to the preset learning rate and the parameters of the pre-trained discriminator, iteratively update the parameters of the first generator multiple times to obtain a second generator, and use the second generator as a pre-training good builder.
The computer-readable storage medium according to claim 15, wherein, after the processor obtains the desensitized data, it is further configured to:

Randomly sampling the discrete variables of the desensitized data to obtain target discrete variables;

Based on the logistic regression model, predict the target discrete variable according to the remaining discrete variables of the desensitized data, and obtain the prediction result of the target discrete variable;

Adjusting the parameters of the pre-trained generator based on the prediction result of the target discrete variable.