WO2023065632A1 - Data desensitization method, data desensitization apparatus, device, and storage medium - Google Patents
Data desensitization method, data desensitization apparatus, device, and storage medium Download PDFInfo
- Publication number
- WO2023065632A1 WO2023065632A1 PCT/CN2022/089872 CN2022089872W WO2023065632A1 WO 2023065632 A1 WO2023065632 A1 WO 2023065632A1 CN 2022089872 W CN2022089872 W CN 2022089872W WO 2023065632 A1 WO2023065632 A1 WO 2023065632A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- key information
- data
- vector
- generator
- trained
- Prior art date
Links
- 238000000586 desensitisation Methods 0.000 title claims abstract description 77
- 238000000034 method Methods 0.000 title claims abstract description 61
- 239000013598 vector Substances 0.000 claims abstract description 137
- 238000012545 processing Methods 0.000 claims abstract description 78
- 238000007781 pre-processing Methods 0.000 claims abstract description 25
- 238000010606 normalization Methods 0.000 claims abstract description 23
- 238000005070 sampling Methods 0.000 claims abstract description 17
- 230000011218 segmentation Effects 0.000 claims description 47
- 238000012549 training Methods 0.000 claims description 29
- 230000006870 function Effects 0.000 claims description 28
- 238000004590 computer program Methods 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 10
- 239000000203 mixture Substances 0.000 claims description 9
- 238000007477 logistic regression Methods 0.000 claims description 7
- 230000010365 information processing Effects 0.000 claims description 6
- 238000013473 artificial intelligence Methods 0.000 abstract description 7
- 238000005516 engineering process Methods 0.000 description 25
- 238000010586 diagram Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 208000019901 Anxiety disease Diseases 0.000 description 2
- 206010044565 Tremor Diseases 0.000 description 2
- 230000036506 anxiety Effects 0.000 description 2
- 235000003642 hunger Nutrition 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000027939 micturition Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 206010012601 diabetes mellitus Diseases 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the present application relates to the field of artificial intelligence, and in particular to a data desensitization method, a data desensitization device, computer equipment, and a storage medium.
- Data desensitization technology is an effective method to solve data security problems and risks.
- Data desensitization refers to the transformation of key information or personal information according to preset rules or transformations, so that personal identity cannot be identified or key information is hidden.
- common structured data desensitization methods are desensitization methods based on anonymization technology or scrambling technology.
- the inventor realized that there is a one-to-one mapping relationship between the desensitized data and the original data in the structured data desensitization method based on anonymization technology or scrambling technology, which makes the desensitized data easy to be reversed, thus The original data is easily restored, which leads to the leakage of private information in the original data, and the data security is poor.
- the present application provides a data desensitization method, data desensitization device, computer equipment, and storage medium, aiming to solve the problem that existing desensitization methods are easily reversed and private information is easily leaked.
- the present application provides a data desensitization method, the method comprising:
- Preprocessing the key information to obtain discrete variables corresponding to the key information includes data discretization or data normalization;
- conditional loss function Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;
- the splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
- the present application also provides a data desensitization device, the data desensitization device includes:
- the key information extraction module is used to obtain user data, and based on the pre-trained key information identification model, perform information identification on the user data to obtain key information;
- An information processing module configured to preprocess the key information to obtain discrete variables corresponding to the key information, where the preprocessing includes data discretization or data normalization;
- a vector splicing module configured to perform conditional random sampling processing on the discrete variables based on a conditional loss function to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a splicing vector;
- the data desensitization module is used to input the splicing vector into the pre-trained generator for desensitization processing to obtain desensitized data.
- the present application also provides a computer device, the computer device includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and Realize following steps when executing described computer program:
- Preprocessing the key information to obtain discrete variables corresponding to the key information includes data discretization or data normalization;
- conditional loss function Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;
- the splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
- the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps:
- Preprocessing the key information to obtain discrete variables corresponding to the key information includes data discretization or data normalization;
- conditional loss function Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;
- the splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
- the data desensitization method, data desensitization device, device, and storage medium disclosed in the embodiments of the present application generate splicing vectors by extracting key information of user data and discrete variables of key information, and use a pre-trained generator to Vectors are desensitized to obtain desensitized data, so that the desensitized data cannot be easily reversed, thereby ensuring that private data is not leaked and improving the security of desensitized data.
- FIG. 1 is a schematic diagram of a scenario of a data desensitization method provided in an embodiment of the present application
- Fig. 2 is a schematic flow chart of a data desensitization method provided in the embodiment of the present application
- Fig. 3 is a schematic block diagram of a data desensitization device provided by an embodiment of the present application.
- Fig. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.
- Data desensitization technology is a data processing technology that can reduce and remove data sensitivity by processing data.
- the use of data desensitization technology can reduce the risk and harm of data leakage and effectively protect the privacy of user data.
- users can store, view and share personal medical and health data through their personal digital space, but personal medical data will face the risk of leaking sensitive medical information of users in the process of online medical treatment, online purchase of medicines, outpatient appointments, etc.
- user data has extremely high authenticity and sensitivity. Once the user's personal sensitive information is leaked, it may pose a potential threat to the user's life.
- the information in the personal digital space can be used for business-related analysis and processing, while avoiding the leakage of user data.
- Common structured data desensitization methods are desensitization methods based on anonymization technology or scrambling technology.
- Common anonymization techniques include k-anonymity, l-diversity, and t-closeness. They generalize the quasi-identifier of a single record to make the data indistinguishable in the entire data set, thereby achieving desensitization. Effect. Scrambling-based technology is to add noise to the record, such as adding additive or multiplicative noise to continuous values, so as to achieve the effect of desensitization.
- this application provides a data desensitization method, which can be applied in the server, specifically in multiple fields such as finance and medical treatment.
- a pre-trained generator is obtained. Extract the sensitive information of the user data, and use the pre-trained generator to desensitize the sensitive information to obtain the desensitized data, so that the desensitized data cannot be easily reversed, thus ensuring that the private data is not leaked , improving the security of desensitized data.
- the server may be, for example, an individual server or a server cluster.
- the following embodiments will introduce in detail the data desensitization method applied to the server.
- the data desensitization method provided in the embodiment of the present application can be applied to the application environment shown in FIG. 1 .
- the application environment includes a terminal device 110 and a server 120, wherein the terminal device 110 can communicate with the server 120 through a network.
- the server 120 obtains the user data sent by the terminal device 110, and the server 120 performs key information extraction, information processing, and desensitization processing on the user data to generate desensitized data, and sends the desensitized data to the terminal device 110, so as to realize Data desensitization processing.
- the server 120 can be an independent server, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
- the terminal device 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
- the terminal and the server may be connected directly or indirectly through wired or wireless communication, which is not limited in this application.
- FIG. 2 is a schematic flowchart of a data desensitization method provided by an embodiment of the present application.
- the data desensitization method can be applied in the server, thereby making the desensitization data unable to be easily reversely deciphered, thereby ensuring that the private data is not leaked, and improving the security of the desensitization data.
- the data desensitization method includes steps S101 to S104.
- S101 Acquire user data, and perform information identification on the user data based on a pre-trained key information identification model to obtain key information.
- the user data is data containing key information, and may specifically include medical data such as medical record data, financial data such as bank account data, and the like.
- the key information identification model may be a pre-trained BERT-CRF model based on an attention mechanism, which is used to extract key information in user data.
- the key information is the information that the user needs to desensitize, which is generally the user's private information.
- the key information can be the height and weight information in the medical record data, or the account balance information and investment information in the bank account data. . It should be noted that all sensitive or private information can be used as key information.
- AI artificial intelligence
- digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
- Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
- word segmentation processing is performed on the user data to obtain multiple word segmentations; feature extraction is performed on each of the word segmentations to obtain the embedded features of each of the word segmentations;
- the word meaning prediction is to obtain the meaning corresponding to each of the word segmentations; the plurality of word segmentations are screened according to the meanings corresponding to each of the word segmentations to obtain key information.
- the embedding features are word embedding features, position embedding features and segmentation embedding features.
- the word embedding feature is a vector representation of each word segment
- the position embedding is a vector representation of each word segment position
- the segmentation embedding feature is used to distinguish two different sentences.
- the user data can be segmented based on a word segmentation algorithm to obtain multiple word segments.
- the word segmentation algorithm can be forward maximum matching method, reverse maximum matching method, word segmentation algorithm based on hidden Markov model, condition-based Random field word segmentation algorithm and other algorithms.
- the word segmentation algorithm based on the hidden Markov model can be used to segment the user data such as the medical record text of the medical record data as "the patient has symptoms such as frequent urination, hunger, anxiety, tremor, etc., and is suspected of diabetes," to obtain multiple corresponding Word segmentation such as frequent urination, hunger, anxiety, tremor, etc.
- feature extraction can be performed on each of the word segmentations to obtain the embedded features of each of the word segmentations, and based on the word meaning prediction model, the meaning of each of the word segmentations can be predicted according to the embedded features of each of the word segmentations to obtain The word meaning prediction result corresponding to each word segment, and based on the word meaning prediction result corresponding to each word segment, filter the plurality of word segments to obtain key information.
- text features can be mined to the greatest extent, richer word representations can be extracted, and the shortcomings of traditional word vectors such as Word2vec and Glove that cannot dynamically represent context information and cannot solve polysemy of a word can be eliminated. Therefore, the similarity between each word segment and the preset standard sensitive word can be quickly obtained, and then the corresponding key information can be quickly obtained.
- the word meaning prediction model is used to predict the similarity between each word segmentation and the preset standard sensitive word segmentation
- the word meaning prediction model is obtained by training the semantic matching model and the standard sensitive word segmentation database
- the semantic prediction model may include LSTM Matching model, MV-DSSM model, ESIM model and other models
- the word meaning prediction result is the similarity between each participle and the standard sensitive participle in the standard sensitive participle database.
- the word segmentation includes account balance equivalent words of account information and stock trend information equal segmentation words
- feature extraction can be performed on each of the word segmentations, and the word embedding features, position embedding features and segmentation embedding of each of the word segmentations can be obtained
- Features based on the LSTM matching model, perform word meaning prediction for each of the word segmentations according to the word embedding features, position embedding features and segmentation embedding features, and obtain the word meaning prediction results corresponding to each word segmentation, and based on the word meaning prediction results corresponding to each word segmentation Filter out the word segmentation corresponding to the stock trend information to obtain key information.
- the key information is generally continuous data, it is necessary to perform representation conversion between continuous data and discrete data, that is, data preprocessing operations, which is a key step in the input and output of neural networks.
- the key information is information such as height and weight
- the key information is continuous data
- the key information is information such as the number of investment companies
- the key information is discrete data
- the discrete variable refers to a variable whose value can be listed in a certain order, and usually takes an integer value, such as the number of employees, the number of factories, the number of machines, and the like.
- the data normalization processing may include maximum-minimum normalization processing and normalization processing according to a Gaussian mixture model; the data discretization processing may include K-bins discretization processing and regression tree discretization processing.
- the key information is subjected to maximum and minimum normalization processing to obtain the discrete variable corresponding to the key information; or, the key information is normalized through a Gaussian mixture model to obtain the key information A discrete variable corresponding to the information; or, K-bins discretization processing is performed on the key information to obtain a discrete variable corresponding to the key information; or, a regression tree discretization process is performed on the key information to obtain the key information Corresponding discrete variables.
- the key information is continuous data, it can be mapped to the range of [0,1] through the maximum and minimum linear transformation, so that the continuous value can be represented by the tanh activation function, and the discrete variable corresponding to the key information can be obtained.
- the Gaussian mixture model can be used to fit the key information, and the Gaussian component is sampled according to the probability of the Gaussian component of the key information in the mixture model, and the sampled Gaussian component is used to compare the data in the record.
- the key information is normalized. Then, the key information can be composed of the normalized representation and the one-hot encoding of the Gaussian component, so as to obtain the discrete variables corresponding to the key information.
- K-bins discretization processing may be performed on the key information to obtain discrete variables corresponding to the key information.
- the discretization can also be called binning, that is, the key information is divided into various intervals according to certain rules, and each interval is represented by one-hot encoding, so that the key information is represented by a piecewise function containing four intervals Fitting is performed to obtain discrete variables corresponding to the key information.
- the key information may be discretized using a CART regression tree to obtain discrete variables corresponding to the key information.
- the CART regression tree can predict continuous data, and its leaf node represents a predicted value.
- the key information can be converted into discrete values by expressing a series of leaf nodes of the regression tree or regression tree set of key information through one-hot encoding.
- conditional loss function is a conditional loss function based on the adversarial generation network, and the data items of the loss function based on the adversarial generation network are generated based on conditional probabilities.
- the original intention is to enable data to be generated according to conditions, so that the same type
- the distribution of the data to be desensitized and the generated desensitized data is as consistent as possible.
- the training process can be constrained by predicting the condition variables, so that the values of the condition variables are consistent with the values of the corresponding variables in the generated data, and the effect of data generation can be further optimized.
- the conditional embedding vector can randomly select a discrete variable that meets the preset condition from a plurality of discrete variables corresponding to the key information with equal probability, and the hidden vector can be selected from the white noise corresponding to the key information
- the concatenated vector is obtained by concatenating the conditional embedding vector and the latent vector, and is used as an input of the generator.
- the one-to-one mapping relationship between the desensitized data and the original data is changed, so that the desensitized data is not easy to be reversely cracked, and private information can be obtained.
- the distributed representation of the discrete variable can be obtained by constructing the probability mass distribution function of each value of the discrete variable, and the distributed representation of the discrete variable is subjected to conditional random sampling processing to obtain the conditional embedding vector and hidden vector.
- the white noise corresponding to the discrete variable may be converted by a deep neural network to generate a latent vector from the distributed representation of the discrete variable.
- the conditional embedding vector is converted to obtain a one-hot encoding; the one-hot encoding is concatenated with the hidden vector to obtain a concatenated vector.
- the one-hot encoding is One-Hot encoding, also known as effective encoding, and its method is to use N-bit state registers to encode N states, each state has its own independent register bit, and at any time, Only one of them is valid. Converting the conditional embedding vector into a one-hot encoding can solve the problem that the discriminator cannot handle attribute data well, and at the same time, it also plays a role in expanding the vector feature to a certain extent.
- conditional embedding vector can be transformed through a deep neural network to obtain a one-hot encoding, and the splicing vector can be obtained by splicing the one-hot encoding and the latent vector. From this, a concatenated vector that meets the input requirements of the generator can be obtained.
- the pre-trained generator is generated based on confrontation generation network training, and the desensitized data is data after desensitizing key information in the data to be desensitized.
- the splicing vector corresponding to the training data is obtained, and the splicing vector is input to the first generator for desensitization processing to obtain desensitized data; based on the desensitized data and training data pair
- the preset discriminator is trained to obtain a pre-trained discriminator; according to the preset learning rate and the parameters of the pre-trained discriminator, the parameters of the first generator are iteratively updated multiple times to obtain A second generator, and use the second generator as a pre-trained generator. In this way, the parameters of the first generator can be updated iteratively for multiple times through the pre-trained discriminator and the desensitized data, and very real desensitized data can be generated.
- the training data is a data set to be desensitized for training generator parameters
- the first generator is a preset untrained generator
- the second generator is a Generated by multiple iterative updates.
- the parameters of the first generator and the second generator are different.
- the prior probability of the discrete variable can be obtained through the distributed representation of the discrete variable, and parameters are sampled from the prior probability as parameters of the first generator.
- the generator and the discriminator can be trained by the stochastic gradient Hamiltonian Monte Carlo method to obtain a pre-trained generator and a pre-trained discriminator.
- the preset discriminator is trained based on the desensitized data and training data, and the pre-trained discriminator is obtained by combining the conditional embedding vector with the desensitized data and the Splicing the training data to obtain the first spliced data and the second spliced data, and calculating the similarity between the first spliced data and the second spliced data, and optimizing the loss function according to the similarity between the first spliced data and the second spliced data, And the discriminator is gradient clipped through the loss function to obtain a pre-trained discriminator.
- the discriminator parameters can be trained by the first generator and the preset discriminator parameters, and the desensitized data can be judged as false as much as possible, thereby adjusting the discriminator parameters, thereby improving the discriminator's ability to desensitize the data Discrimination ability.
- the posterior probability of the second generator can be calculated through the prior probability of the parameters of the first generator and the pre-trained discriminator, so that the desensitized data should make the discriminator misjudge it as being to be desensitized Sensitive data, so as to adjust the parameters of the generator to generate real desensitized data.
- the second generator is subjected to noise-increasing processing based on a loss function of statistical information to obtain a pre-trained generator, wherein the first generator, the The parameters of the generator whose parameters have been updated are different from those of the pre-trained generator. In this way, the generation quality and degree of desensitization of the desensitized data can be controlled.
- the statistical information-based loss function may include a mean-based loss function, a variance-based loss function, and the like.
- Gaussian noise can be added to the parameters of the second generator, thereby realizing polynomial fitting of a sinusoidal curve, and the Gaussian noise is an error conforming to a Gaussian normal distribution.
- the specific value of Gaussian noise can be obtained through experiments.
- an error term may be introduced into the parameters of the second generator, so that the parameters of the second generator may be corrected to obtain a pre-trained generator. Due to the existence of the error term, there is a certain difference between the generated desensitized data and the original data, but the difference is not large, thus avoiding that the desensitized data often differs greatly from the original data and loses the research value, and also ensures Data cannot be easily reversed.
- the discrete variables of the desensitized data are randomly sampled to obtain the target discrete variables; based on the logistic regression model, the target The discrete variable is predicted to obtain the predicted result of the target discrete variable; and the parameters of the pre-trained generator are adjusted based on the predicted result of the target discrete variable.
- the parameters of the generator can be adjusted by predicting the discrete variables to achieve a better desensitization effect.
- the better desensitization effect here means that it can prevent the desensitized data from being reversely cracked, while maintaining the association with the original data.
- the target discrete variable is randomly sampled from multiple discrete variables of the desensitized data, and at the same time, in order to associate the discrete variables between the desensitized data and the original data, it can generally be considered that the target discrete variable does not change, and the desensitized data and If the difference in the original data is small, the value of the research will not be lost, so it is necessary to ensure the consistency of the target discrete variables.
- the logistic regression model was used to predict discrete variables.
- the cross-entropy loss function can be used to judge whether the generated prediction result of the target discrete variable is consistent with the target discrete variable, so as to determine the generation quality of the desensitized data. If the prediction result of the target discrete variable is consistent with the target discrete variable, there is no need to adjust the parameters of the pre-trained generator; if the prediction result of the target discrete variable is inconsistent with the target discrete variable, then determine the prediction result of the target discrete variable The difference with the target discrete variable, and adjust the parameters of the pre-trained generator according to the difference. In this way, the accuracy of the target discrete variable can be determined, and the generated desensitized data can avoid making the original data too different. Since most of the discrete variables of the desensitized data and the original data are the same, removing one of the discrete variables can accurately predict the discrete variable based on the remaining discrete variables.
- the target discrete variable of the desensitized data is a shoe size of 43
- the target discrete variable can be predicted through the remaining discrete variables of the desensitized data, such as height and weight, to obtain the shoe size to determine whether the generated shoe size prediction result is consistent with the shoe size of the desensitized data. For example, if the generated shoe size prediction result is a shoe size of 40, the difference is determined to be 1 size. According to the difference The parameters of the pre-trained generator are updated iteratively; if the predicted result of the generated shoe size is size 43, there is no need to adjust the parameters of the pre-trained generator.
- the server may also send prompt information for prompting the user that the desensitized data has been generated to the terminal device.
- the manner of prompting information may specifically include an application program (APP) or Email, a short message, a chat tool, such as WeChat, qq, and the like.
- APP application program
- Email a short message
- chat tool such as WeChat, qq, and the like.
- the server when the desensitized data has been generated, the server will send a prompt message that the desensitized data has been generated to the terminal device to remind the user.
- FIG. 3 is a schematic block diagram of a data desensitization device provided by an embodiment of the present application.
- the data desensitization device can be configured in a server to execute the aforementioned data desensitization method.
- the data desensitization device 200 includes: a key information extraction module 201 , an information processing module 202 , a vector splicing module 203 and a data desensitization module 204 .
- the key information extraction module 201 is configured to acquire user data, and based on a pre-trained key information identification model, perform information identification on the user data to obtain key information;
- An information processing module 202 configured to preprocess the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization processing or data normalization processing;
- a vector splicing module 203 configured to perform conditional random sampling processing on the discrete variables based on a conditional loss function to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;
- a data desensitization module 204 configured to input the splicing vector into a pre-trained generator for desensitization processing to obtain desensitized data
- the feature extraction module 201 is also used to perform word segmentation processing on the user data to obtain a plurality of word segments; perform feature extraction on each of the word segments to obtain the embedded features of each of the word segments; according to the embedding features of each of the word segments.
- the feature predicts the meaning of the word to obtain the meaning corresponding to each of the word segmentations; according to the meaning of each of the word segmentations, the plurality of word segmentations are screened to obtain key information.
- the information processing module 202 is further configured to perform maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or, to perform normalization processing on the key information through a Gaussian mixture model to obtain the The discrete variable corresponding to the key information; or, carry out K-bins discretization process on the key information, obtain the discrete variable corresponding to the key information; or, perform regression tree discretization process on the key information, obtain the described key information Discrete variables corresponding to key information.
- the vector splicing module 203 is further configured to convert the conditional embedding vector to obtain a one-hot encoding; splice the one-hot encoding and the latent vector to obtain a concatenated vector.
- the generator training module 205 is used to obtain the splicing vector corresponding to the training data, and input the splicing vector to the first generator for desensitization processing to obtain desensitized data; based on the desensitized data and training
- the data is used to train the preset discriminator to obtain a pre-trained discriminator; according to the preset learning rate and the parameters of the pre-trained discriminator, the parameters of the first generator are iteratively updated multiple times , get the second generator, and use the second generator as a pre-trained generator.
- the generator training module 205 is further configured to perform noise-increasing processing on the second generator based on a loss function of statistical information to obtain a pre-trained generator, wherein the first generator and the second generator The parameters of the generator and the pre-trained generator are different.
- the generator training module 205 is also used to randomly sample the discrete variables of the desensitized data to obtain the target discrete variables; based on the logistic regression model, the target discrete variables are processed according to the remaining discrete variables of the desensitized data. Forecasting, obtaining a prediction result of the target discrete variable; adjusting parameters of the pre-trained generator based on the prediction result of the target discrete variable.
- the methods and devices of the present application can be used in many general-purpose or special-purpose computing system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer terminal equipment, network PCs, minicomputers, mainframe computers, including the above A distributed computing environment for any system or device, and more.
- the above-mentioned method and apparatus can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 4 .
- FIG. 4 is a schematic diagram of a computer device provided by an embodiment of the present application.
- the computer device may be a server.
- the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a volatile storage medium, a non-volatile storage medium, and an internal memory.
- Non-volatile storage media can store operating systems and computer programs.
- the computer program includes program instructions.
- the processor can be executed to perform any data masking method.
- the processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
- the internal memory provides an environment for running the computer program in the non-volatile storage medium.
- the processor can execute any data desensitization method.
- This network interface is used for network communication, such as sending assigned tasks, etc.
- Those skilled in the art can understand that the structure of the computer equipment is only a block diagram of the partial structure related to the solution of the present application, and does not constitute a limitation to the computer equipment on which the solution of the application is applied.
- the specific computer equipment may include More or fewer components are shown in the figures, or certain components are combined, or have different component arrangements.
- the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the processor is configured to run a computer program stored in the memory to implement the following steps: acquire user data, and perform information identification on the user data based on a pre-trained key information identification model , to obtain the key information; preprocessing the key information to obtain the discrete variable corresponding to the key information, the preprocessing includes data discretization processing or data normalization processing; based on the conditional loss function, the discrete variable Perform conditional random sampling processing to obtain conditional embedding vectors and hidden vectors, and splice the conditional embedding vectors and the hidden vectors to obtain spliced vectors; input the spliced vectors to the pre-trained generator for desensitization , to get desensitized data.
- the processor is further configured to: perform word segmentation processing on the user data to obtain multiple word segments; perform feature extraction on each of the word segments to obtain embedded features of each of the word segments; The embedding features of each of the word segmentations are used to predict the meaning of the word to obtain the corresponding meaning of each of the word segmentations; according to the meaning of each of the corresponding word segmentations, the plurality of word segmentations are screened to obtain key information.
- the processor is further configured to: perform maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or, normalize the key information through a Gaussian mixture model or, performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or, performing regression tree discretization on the key information process to obtain the discrete variables corresponding to the key information.
- the processor is further configured to: convert the conditional embedding vector to obtain a one-hot encoding; concatenate the one-hot encoding and the latent vector to obtain a concatenated vector.
- the processor is further configured to: acquire a splicing vector corresponding to the training data, and input the splicing vector to the first generator for desensitization processing to obtain desensitized data; based on the desensitized
- the pre-sensitized data and training data are used to train the preset discriminator to obtain a pre-trained discriminator; according to the preset learning rate and the parameters of the pre-trained discriminator, the first generator is The parameters are iteratively updated multiple times to obtain a second generator, and the second generator is used as a pre-trained generator.
- the processor is further configured to: perform noise-increasing processing on the second generator based on a loss function of statistical information to obtain a pre-trained generator, wherein the first generator, The parameters of the second generator and the pre-trained generator are different.
- the processor is further configured to: perform random sampling on the discrete variables of the desensitized data to obtain target discrete variables; Predicting the target discrete variable to obtain a prediction result of the target discrete variable; adjusting parameters of the pre-trained generator based on the prediction result of the target discrete variable.
- the embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium may be non-volatile or volatile.
- a computer program is stored on the computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed, any data desensitization method provided in the embodiments of the present application is implemented.
- the computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device.
- the computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD ) card, flash memory card (Flash Card), etc.
- the computer-readable storage medium may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, etc.; The data created using the node, etc.
- Blockchain essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Bioethics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Storage Device Security (AREA)
Abstract
The present application relates to the field of artificial intelligence, in particular to a data desensitization method, a data desensitization apparatus, a device, and a storage medium. The method comprises: obtaining user data, and performing information identification on the user data on the basis of a pre-trained key information identification model to obtain key information; preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing comprising data discretization processing or data normalization processing; performing conditional random sampling processing on the discrete variables on the basis of a conditional loss function to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a spliced vector; and inputting the spliced vector into a pre-trained generator for desensitization processing to obtain desensitized data. Therefore, desensitized data cannot be easily reversely cracked, thereby ensuring that privacy data will not be leaked, and improving the security of the desensitized data.
Description
本申请要求于2021年10月21日提交中国专利局、申请号为202111229481.X、发明名称为“数据脱敏方法、数据脱敏装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111229481.X and the title of the invention "data desensitization method, data desensitization device, equipment and storage medium" submitted to the China Patent Office on October 21, 2021. The entire contents are incorporated by reference in this application.
本申请涉及人工智能领域,尤其涉及数据脱敏方法、数据脱敏装置、计算机设备及存储介质。The present application relates to the field of artificial intelligence, and in particular to a data desensitization method, a data desensitization device, computer equipment, and a storage medium.
大数据时代下,针对数据进行的攻击频率日益增多,攻击方式也日趋丰富。数据脱敏技术是解决数据安全问题和风险的一种行之有效的方法。数据脱敏是指对关键信息或个人信息按照预设的规则或者变换进行数据变形,从而使得个人身份无法识别或者隐去关键信息。目前常见的结构化数据脱敏方式有基于匿名化技术或者扰乱技术的脱敏方式。In the era of big data, the frequency of attacks on data is increasing, and the attack methods are also becoming more abundant. Data desensitization technology is an effective method to solve data security problems and risks. Data desensitization refers to the transformation of key information or personal information according to preset rules or transformations, so that personal identity cannot be identified or key information is hidden. Currently common structured data desensitization methods are desensitization methods based on anonymization technology or scrambling technology.
但是发明人意识到基于匿名化技术或扰乱技术的结构化数据脱敏方法中存在脱敏后数据与原数据是一对一的映射关系的问题,导致脱敏后的数据容易被逆向破解,从而被轻易还原出原数据,进而导致原数据中隐私信息的泄露,数据安全性较差。However, the inventor realized that there is a one-to-one mapping relationship between the desensitized data and the original data in the structured data desensitization method based on anonymization technology or scrambling technology, which makes the desensitized data easy to be reversed, thus The original data is easily restored, which leads to the leakage of private information in the original data, and the data security is poor.
发明内容Contents of the invention
本申请提供了一种数据脱敏方法、数据脱敏装置、计算机设备及存储介质,旨在解决现有的脱敏方式存在容易被逆向破解导致隐私信息容易被泄露的问题。The present application provides a data desensitization method, data desensitization device, computer equipment, and storage medium, aiming to solve the problem that existing desensitization methods are easily reversed and private information is easily leaked.
为实现上述目的,本申请提供一种数据脱敏方法,所述方法包括:In order to achieve the above purpose, the present application provides a data desensitization method, the method comprising:
获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息;Obtaining user data, and based on a pre-trained key information identification model, performing information identification on the user data to obtain key information;
对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理;Preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization or data normalization;
基于条件损失函数,对所述离散变量进行条件随机采样处理,得到条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量;Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;
将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据。The splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
为实现上述目的,本申请还提供一种数据脱敏装置,所述数据脱敏装置包括:In order to achieve the above purpose, the present application also provides a data desensitization device, the data desensitization device includes:
关键信息提取模块,用于获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息;The key information extraction module is used to obtain user data, and based on the pre-trained key information identification model, perform information identification on the user data to obtain key information;
信息处理模块,用于对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理;An information processing module, configured to preprocess the key information to obtain discrete variables corresponding to the key information, where the preprocessing includes data discretization or data normalization;
向量拼接模块,用于基于条件损失函数,对所述离散变量进行条件随机采样处理,得到条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量;A vector splicing module, configured to perform conditional random sampling processing on the discrete variables based on a conditional loss function to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a splicing vector;
数据脱敏模块,用于将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据。The data desensitization module is used to input the splicing vector into the pre-trained generator for desensitization processing to obtain desensitized data.
此外,为实现上述目的,本申请还提供一种计算机设备,所述计算机设备包括存储器和处理器;所述存储器,用于存储计算机程序;所述处理器,用于执行所述的计算机程序并在执行所述的计算机程序时实现如下步骤:In addition, in order to achieve the above object, the present application also provides a computer device, the computer device includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and Realize following steps when executing described computer program:
获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别, 得到关键信息;Obtain user data, and based on a pre-trained key information identification model, perform information identification on the user data to obtain key information;
对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理;Preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization or data normalization;
基于条件损失函数,对所述离散变量进行条件随机采样处理,得到条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量;Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;
将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据。The splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:In addition, in order to achieve the above purpose, the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps:
获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息;Obtaining user data, and based on a pre-trained key information identification model, performing information identification on the user data to obtain key information;
对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理;Preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization or data normalization;
基于条件损失函数,对所述离散变量进行条件随机采样处理,得到条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量;Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;
将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据。The splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
本申请实施例公开的数据脱敏方法、数据脱敏装置、设备及存储介质,通过提取用户数据的关键信息以及关键信息的离散变量,从而生成拼接向量,并利用预训练好的生成器对拼接向量进行脱敏处理,得到脱敏数据,由此能够使脱敏数据的无法轻易被逆向破解,从而保证了隐私数据不被泄露,提高了脱敏数据的安全性。The data desensitization method, data desensitization device, device, and storage medium disclosed in the embodiments of the present application generate splicing vectors by extracting key information of user data and discrete variables of key information, and use a pre-trained generator to Vectors are desensitized to obtain desensitized data, so that the desensitized data cannot be easily reversed, thereby ensuring that private data is not leaked and improving the security of desensitized data.
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.
图1是本申请实施例提供的一种数据脱敏方法的场景示意图;FIG. 1 is a schematic diagram of a scenario of a data desensitization method provided in an embodiment of the present application;
图2是本申请实施例提供的一种数据脱敏方法的流程示意图;Fig. 2 is a schematic flow chart of a data desensitization method provided in the embodiment of the present application;
图3是本申请一实施例提供的一种数据脱敏装置的示意性框图;Fig. 3 is a schematic block diagram of a data desensitization device provided by an embodiment of the present application;
图4是本申请一实施例提供的一种计算机设备的示意性框图。Fig. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。另外,虽然在装置示意图中进行了功能模块的划分,但是在某些情况下,可以以不同于装置示意图中的模块划分。The flow charts shown in the drawings are just illustrations, and do not necessarily include all contents and operations/steps, nor must they be performed in the order described. For example, some operations/steps can be decomposed, combined or partly combined, so the actual order of execution may be changed according to the actual situation. In addition, although the functional modules are divided in the schematic diagram of the device, in some cases, they may be divided into modules different from those in the schematic diagram of the device.
在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。The term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.
数据脱敏技术是一种可以通过对数据进行处理,从而达到降低和去除数据敏感程度的数据处理技术。采用数据脱敏技术,可以降低数据泄露的风险和危害,有效地保护用户数据的隐私。在互联网或医疗领域,用户可以通过个人数字空间存储、查看和分享个人医疗健康数据,但是个人的医疗数据在网上看病、网上购买药品、门诊预约等流程中会面临用户医疗敏感信息泄露的风险,而在医疗行业用户的数据具有极高的真实性和敏感性特点,一旦用户的个人敏感信息泄露可能会对用户本人造成潜在的生命威胁。借助数据脱敏,个人数字空间中的信息可以被用于业务相关的分析和处理,同时避免用户数据的泄露。Data desensitization technology is a data processing technology that can reduce and remove data sensitivity by processing data. The use of data desensitization technology can reduce the risk and harm of data leakage and effectively protect the privacy of user data. In the Internet or medical field, users can store, view and share personal medical and health data through their personal digital space, but personal medical data will face the risk of leaking sensitive medical information of users in the process of online medical treatment, online purchase of medicines, outpatient appointments, etc. In the medical industry, user data has extremely high authenticity and sensitivity. Once the user's personal sensitive information is leaked, it may pose a potential threat to the user's life. With the help of data desensitization, the information in the personal digital space can be used for business-related analysis and processing, while avoiding the leakage of user data.
目前常见的结构化数据脱敏方式有基于匿名化技术或者扰乱技术的脱敏方式。常见的匿名化技术有k-匿名、l-多样性和t-closeness等,它们是通过对单条记录的准标识符进行泛化后使得该数据在整个数据集中无法进行区分,从而达到脱敏的效果。基于扰乱技术则是往记录中加入噪声,比如在连续值中加入加性或者乘性噪声,从而达到脱敏的效果。Currently common structured data desensitization methods are desensitization methods based on anonymization technology or scrambling technology. Common anonymization techniques include k-anonymity, l-diversity, and t-closeness. They generalize the quasi-identifier of a single record to make the data indistinguishable in the entire data set, thereby achieving desensitization. Effect. Scrambling-based technology is to add noise to the record, such as adding additive or multiplicative noise to continuous values, so as to achieve the effect of desensitization.
而基于匿名化技术或扰乱技术的结构化数据脱敏方法中存在脱敏后数据与原数据是一对一的映射关系的问题,导致脱敏后的数据存在被逆向的风险,而且经过脱敏的数据往往与原数据差别较大而失去了研究的价值。However, in the structured data desensitization method based on anonymization technology or scrambling technology, there is a one-to-one mapping relationship between the desensitized data and the original data, which leads to the risk of the desensitized data being reversed, and after desensitization The data are often quite different from the original data and lose the value of research.
为解决上述问题,本申请提供了一种数据脱敏方法,可以应用在服务器中,具体应用金融、医疗等多个领域,通过不断对生成器参数进行迭代更新,得到预训练好的生成器,提取用户数据的敏感信息,并利用预训练好的生成器对敏感信息进行脱敏处理,得到脱敏数据,由此能够使脱敏数据的无法被轻易逆向破解,从而保证了隐私数据不被泄露,提高了脱敏数据的安全性。In order to solve the above problems, this application provides a data desensitization method, which can be applied in the server, specifically in multiple fields such as finance and medical treatment. By continuously iteratively updating the parameters of the generator, a pre-trained generator is obtained. Extract the sensitive information of the user data, and use the pre-trained generator to desensitize the sensitive information to obtain the desensitized data, so that the desensitized data cannot be easily reversed, thus ensuring that the private data is not leaked , improving the security of desensitized data.
其中,服务器例如可以为单独的服务器或服务器集群。但为了便于理解,以下实施例将以应用于服务器的数据脱敏方法进行详细介绍。Wherein, the server may be, for example, an individual server or a server cluster. However, for ease of understanding, the following embodiments will introduce in detail the data desensitization method applied to the server.
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Some implementations of the present application will be described in detail below in conjunction with the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.
如图1所示,本申请实施例提供的数据脱敏方法,可以应用于如图1所示的应用环境中。该应用环境中包含有终端设备110和服务器120,其中,终端设备110可以通过网络与服务器120进行通信。具体地,服务器120获取终端设备110发送的用户数据,服务器120对用户数据进行关键信息提取、信息处理以及脱敏处理后生成脱敏数据,并将该脱敏数据发送给终端设备110,以实现数据脱敏处理。其中,服务器120可以是独立的服务器,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备110可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。As shown in FIG. 1 , the data desensitization method provided in the embodiment of the present application can be applied to the application environment shown in FIG. 1 . The application environment includes a terminal device 110 and a server 120, wherein the terminal device 110 can communicate with the server 120 through a network. Specifically, the server 120 obtains the user data sent by the terminal device 110, and the server 120 performs key information extraction, information processing, and desensitization processing on the user data to generate desensitized data, and sends the desensitized data to the terminal device 110, so as to realize Data desensitization processing. Wherein, the server 120 can be an independent server, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms. The terminal device 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto. The terminal and the server may be connected directly or indirectly through wired or wireless communication, which is not limited in this application.
请参阅图2,图2是本申请实施例提供的一种数据脱敏方法的示意流程图。其中,该数据脱敏方法可以应用在服务器中,由此能够使脱敏数据的无法被轻易逆向破解,从而保证了隐私数据不被泄露,提高了脱敏数据的安全性。Please refer to FIG. 2 . FIG. 2 is a schematic flowchart of a data desensitization method provided by an embodiment of the present application. Wherein, the data desensitization method can be applied in the server, thereby making the desensitization data unable to be easily reversely deciphered, thereby ensuring that the private data is not leaked, and improving the security of the desensitization data.
如图2所示,该数据脱敏方法包括步骤S101至步骤S104。As shown in FIG. 2, the data desensitization method includes steps S101 to S104.
S101、获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息。S101. Acquire user data, and perform information identification on the user data based on a pre-trained key information identification model to obtain key information.
其中,所述用户数据为包含关键信息的数据,具体可以包括医疗数据比如病历数据、金融数据比如银行账户数据等。所述关键信息识别模型可以是基于注意力机制的预训练BERT-CRF模型,用于提取用户数据中的关键信息。所述关键信息是用户需要进行脱敏的信息,一般为用户的隐私信息,比如关键信息可以为病历数据中的身高、体重信息等,还可以为银行账户数据中的账户余额信息和投资信息等。需要说明的是,一切敏感信息或隐私信息均可以作为关键信息。Wherein, the user data is data containing key information, and may specifically include medical data such as medical record data, financial data such as bank account data, and the like. The key information identification model may be a pre-trained BERT-CRF model based on an attention mechanism, which is used to extract key information in user data. The key information is the information that the user needs to desensitize, which is generally the user's private information. For example, the key information can be the height and weight information in the medical record data, or the account balance information and investment information in the bank account data. . It should be noted that all sensitive or private information can be used as key information.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
在一些实施例中,对所述用户数据进行分词处理,得到多个分词;对每个所述分词进行 特征提取,得到每个所述分词的嵌入特征;根据每个所述分词的嵌入特征进行词义预测,得到每个所述分词对应的词义;根据每个所述分词对应的词义对所述多个分词进行筛选,得到关键信息。由此可以能够准确地提取得到关键信息,提高脱敏数据生成的准确性以及安全性。In some embodiments, word segmentation processing is performed on the user data to obtain multiple word segmentations; feature extraction is performed on each of the word segmentations to obtain the embedded features of each of the word segmentations; The word meaning prediction is to obtain the meaning corresponding to each of the word segmentations; the plurality of word segmentations are screened according to the meanings corresponding to each of the word segmentations to obtain key information. As a result, key information can be accurately extracted, and the accuracy and security of desensitized data generation can be improved.
其中,所述嵌入特征为词嵌入特征,位置嵌入特征和切分嵌入特征。所述词嵌入特征是每个分词的向量表示,所述位置嵌入为每个分词位置的向量表示,所述切分嵌入特征用于区分不同的两句话。Wherein, the embedding features are word embedding features, position embedding features and segmentation embedding features. The word embedding feature is a vector representation of each word segment, the position embedding is a vector representation of each word segment position, and the segmentation embedding feature is used to distinguish two different sentences.
具体地,可以基于分词算法对所述用户数据进行分词处理,得到多个分词,所述分词算法可以为正向最大匹配法、逆向最大匹配法、基于隐马尔可夫模型的分词算法、基于条件随机场的分词算法等算法。Specifically, the user data can be segmented based on a word segmentation algorithm to obtain multiple word segments. The word segmentation algorithm can be forward maximum matching method, reverse maximum matching method, word segmentation algorithm based on hidden Markov model, condition-based Random field word segmentation algorithm and other algorithms.
示例性的,可以基于隐马尔可夫模型的分词算法对用户数据比如病历数据的病历文本为“患者存在尿频、易饿、焦虑、震颤等症状,疑似糖尿病,”进行分词,得到对应的多个分词比如尿频、易饿、焦虑、震颤等。Exemplarily, the word segmentation algorithm based on the hidden Markov model can be used to segment the user data such as the medical record text of the medical record data as "the patient has symptoms such as frequent urination, hunger, anxiety, tremor, etc., and is suspected of diabetes," to obtain multiple corresponding Word segmentation such as frequent urination, hunger, anxiety, tremor, etc.
具体地,可以对每个所述分词进行特征提取,得到每个所述分词的嵌入特征,并基于词义预测模型,根据每个所述分词的嵌入特征对每个所述分词进行词义预测,得到每个分词对应的词义预测结果,并基于每个分词对应的词义预测结果对所述多个分词进行筛选,得到关键信息。由此能够最大程度地挖掘文本特征,抽取更丰富的词表示,消除传统词向量比如Word2vec、Glove等无法动态表示上下文信息和无法解决一词多义的缺点。从而可以快速得到每个分词与预设的标准敏感分词的相似度,进而快速得到对应的关键信息。Specifically, feature extraction can be performed on each of the word segmentations to obtain the embedded features of each of the word segmentations, and based on the word meaning prediction model, the meaning of each of the word segmentations can be predicted according to the embedded features of each of the word segmentations to obtain The word meaning prediction result corresponding to each word segment, and based on the word meaning prediction result corresponding to each word segment, filter the plurality of word segments to obtain key information. In this way, text features can be mined to the greatest extent, richer word representations can be extracted, and the shortcomings of traditional word vectors such as Word2vec and Glove that cannot dynamically represent context information and cannot solve polysemy of a word can be eliminated. Therefore, the similarity between each word segment and the preset standard sensitive word can be quickly obtained, and then the corresponding key information can be quickly obtained.
其中,所述词义预测模型用于预测每个分词与预设的标准敏感分词的相似程度,所述词义预测模型通过语义匹配模型与标准敏感分词数据库进行训练得到,所述语义预测模型可以包括LSTM匹配模型、MV-DSSM模型、ESIM模型等模型,所述词义预测结果为每个分词与标准敏感分词数据库中的标准敏感分词的相似度。Wherein, the word meaning prediction model is used to predict the similarity between each word segmentation and the preset standard sensitive word segmentation, the word meaning prediction model is obtained by training the semantic matching model and the standard sensitive word segmentation database, and the semantic prediction model may include LSTM Matching model, MV-DSSM model, ESIM model and other models, the word meaning prediction result is the similarity between each participle and the standard sensitive participle in the standard sensitive participle database.
示例性的,比如分词包括账户信息的账户余额等分词以及股票走势信息等分词,可以对每个所述分词进行特征提取,得到每个所述分词的词嵌入特征,位置嵌入特征和切分嵌入特征,基于LSTM匹配模型,根据词嵌入特征,位置嵌入特征和切分嵌入特征对每个所述分词进行词义预测,得到每个分词对应的词义预测结果,并基于每个分词对应的词义预测结果将股票走势信息对应的分词筛选掉,得到关键信息。Exemplarily, for example, the word segmentation includes account balance equivalent words of account information and stock trend information equal segmentation words, feature extraction can be performed on each of the word segmentations, and the word embedding features, position embedding features and segmentation embedding of each of the word segmentations can be obtained Features, based on the LSTM matching model, perform word meaning prediction for each of the word segmentations according to the word embedding features, position embedding features and segmentation embedding features, and obtain the word meaning prediction results corresponding to each word segmentation, and based on the word meaning prediction results corresponding to each word segmentation Filter out the word segmentation corresponding to the stock trend information to obtain key information.
S102、对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理。S102. Perform preprocessing on the key information to obtain discrete variables corresponding to the key information, where the preprocessing includes data discretization processing or data normalization processing.
由于关键信息一般都是连续型数据,因此需要对连续型数据和离散型数据进行表示转换,即数据预处理操作,是神经网络的输入和输出的关键一步。Since the key information is generally continuous data, it is necessary to perform representation conversion between continuous data and discrete data, that is, data preprocessing operations, which is a key step in the input and output of neural networks.
示例性的,当关键信息为身高、体重等信息时,则该关键信息为连续型数据,当关键信息为投资企业个数等信息时,则该关键信息为离散型数据。Exemplarily, when the key information is information such as height and weight, the key information is continuous data, and when the key information is information such as the number of investment companies, the key information is discrete data.
其中,所述离散变量指变量值可以按一定顺序一一列举,通常以整数位取值的变量,比如职工人数、工厂数、机器台数等。具体地,所述数据归一化处理可以包括最大最小归一化处理和根据高斯混合模型进行归一化处理;所述数据离散化处理可以包括K-bins离散化处理和回归树离散化处理。Wherein, the discrete variable refers to a variable whose value can be listed in a certain order, and usually takes an integer value, such as the number of employees, the number of factories, the number of machines, and the like. Specifically, the data normalization processing may include maximum-minimum normalization processing and normalization processing according to a Gaussian mixture model; the data discretization processing may include K-bins discretization processing and regression tree discretization processing.
在一些实施例中,对所述关键信息进行最大最小归一化处理,得到所述关键信息对应的离散变量;或,通过高斯混合模型对所述关键信息进行归一化处理,得到所述关键信息对应的离散变量;或,对所述关键信息进行K-bins离散化处理,得到所述关键信息对应的离散变量;或,对所述关键信息进行回归树离散化处理,得到所述关键信息对应的离散变量。In some embodiments, the key information is subjected to maximum and minimum normalization processing to obtain the discrete variable corresponding to the key information; or, the key information is normalized through a Gaussian mixture model to obtain the key information A discrete variable corresponding to the information; or, K-bins discretization processing is performed on the key information to obtain a discrete variable corresponding to the key information; or, a regression tree discretization process is performed on the key information to obtain the key information Corresponding discrete variables.
具体地,若关键信息为连续型数据,可以通过最大最小线性变换被映射到[0,1]的范围内,使得利用tanh激活函数可以表示该连续值,得到所述关键信息对应的离散变量。Specifically, if the key information is continuous data, it can be mapped to the range of [0,1] through the maximum and minimum linear transformation, so that the continuous value can be represented by the tanh activation function, and the discrete variable corresponding to the key information can be obtained.
具体地,若关键信息为连续型数据,可以通过高斯混合模型对关键信息进行拟合,根据该关键信息在混合模型的高斯分量的概率来采样高斯分量,用采样出来的高斯分量对记录中的关键信息进行归一化表示。那么,关键信息即可由归一化后的表示以及高斯分量的独热编 码共同构成,从而得到所述关键信息对应的离散变量。Specifically, if the key information is continuous data, the Gaussian mixture model can be used to fit the key information, and the Gaussian component is sampled according to the probability of the Gaussian component of the key information in the mixture model, and the sampled Gaussian component is used to compare the data in the record. The key information is normalized. Then, the key information can be composed of the normalized representation and the one-hot encoding of the Gaussian component, so as to obtain the discrete variables corresponding to the key information.
具体地,若关键信息为连续型数据,可以对所述关键信息进行K-bins离散化处理,得到所述关键信息对应的离散变量。其中,所述离散化,也可称为分箱,即按一定规则将关键信息分入各个区间中,并用独热编码表示每个区间,从而将关键信息用一个含有四个区间的分段函数进行拟合,得到所述关键信息对应的离散变量。Specifically, if the key information is continuous data, K-bins discretization processing may be performed on the key information to obtain discrete variables corresponding to the key information. Among them, the discretization can also be called binning, that is, the key information is divided into various intervals according to certain rules, and each interval is represented by one-hot encoding, so that the key information is represented by a piecewise function containing four intervals Fitting is performed to obtain discrete variables corresponding to the key information.
具体地,若关键信息为连续型数据,可以利用CART回归树对所述关键信息进行离散化处理,得到所述关键信息对应的离散变量。其中,CART回归树可以预测连续型数据,其叶子节点即表示一个预测值。将关键信息的回归树或回归树集合的一系列叶子节点通过独热编码表示,即可将该关键信息转化为离散值。Specifically, if the key information is continuous data, the key information may be discretized using a CART regression tree to obtain discrete variables corresponding to the key information. Among them, the CART regression tree can predict continuous data, and its leaf node represents a predicted value. The key information can be converted into discrete values by expressing a series of leaf nodes of the regression tree or regression tree set of key information through one-hot encoding.
需要说明的是,若关键信息为离散型数据,则无需进行数据离散化处理或数据归一化处理。It should be noted that if the key information is discrete data, there is no need to perform data discretization or data normalization.
S103、基于条件损失函数,对所述离散变量进行条件随机采样处理,得到条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量。S103. Based on the conditional loss function, perform conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splice the conditional embedding vector and the hidden vector to obtain a concatenated vector.
其中,所述条件损失函数为基于对抗生成网络的条件损失函数,所述基于对抗生成网络的损失函数的数据项是基于条件概率生成的,其初衷是使得数据能够按条件生成,使得同样类型的待脱敏数据与生成的脱敏数据的分布尽可能一致。然而,因为每次采样出来的条件都可能是不同的变量,因此,基于条件变量下的数据很难得到充分的训练,可以观察到生成的数据中对应的变量的值与其条件变量的值不一致。可通过对条件变量的预测,对训练过程进行约束,使得条件变量的值与生成的数据中对应的变量的值一致,即可进一步优化数据生成的效果。Wherein, the conditional loss function is a conditional loss function based on the adversarial generation network, and the data items of the loss function based on the adversarial generation network are generated based on conditional probabilities. The original intention is to enable data to be generated according to conditions, so that the same type The distribution of the data to be desensitized and the generated desensitized data is as consistent as possible. However, because the conditions sampled each time may be different variables, it is difficult to obtain sufficient training based on the data under the condition variables, and it can be observed that the values of the corresponding variables in the generated data are inconsistent with the values of the condition variables. The training process can be constrained by predicting the condition variables, so that the values of the condition variables are consistent with the values of the corresponding variables in the generated data, and the effect of data generation can be further optimized.
具体地,所述条件嵌入向量可以随机等概率地从所述关键信息对应的多个离散变量中挑选出一个符合预设条件的离散变量,所述隐向量可以从所述关键信息对应的白噪声采样得到,所述拼接向量是通过所述条件嵌入向量与所述隐向量拼接得到的,用于作为生成器的输入。通过加入隐向量,从而改变了脱敏后数据与原数据一对一的映射关系,从而使脱敏后的数据不容易被逆向破解,得到隐私信息。Specifically, the conditional embedding vector can randomly select a discrete variable that meets the preset condition from a plurality of discrete variables corresponding to the key information with equal probability, and the hidden vector can be selected from the white noise corresponding to the key information The concatenated vector is obtained by concatenating the conditional embedding vector and the latent vector, and is used as an input of the generator. By adding hidden vectors, the one-to-one mapping relationship between the desensitized data and the original data is changed, so that the desensitized data is not easy to be reversely cracked, and private information can be obtained.
具体地,可以通过构建离散变量下各个值的概率质量分布函数,得到该离散变量的分布式表示,并对该离散变量的分布式表示进行条件随机采样处理,得到条件嵌入向量和隐向量。Specifically, the distributed representation of the discrete variable can be obtained by constructing the probability mass distribution function of each value of the discrete variable, and the distributed representation of the discrete variable is subjected to conditional random sampling processing to obtain the conditional embedding vector and hidden vector.
示例性的,可以通过深度神经网络转换离散变量对应的白噪声,以从该离散变量的分布式表示生成隐向量。Exemplarily, the white noise corresponding to the discrete variable may be converted by a deep neural network to generate a latent vector from the distributed representation of the discrete variable.
在一些实施例中,对所述条件嵌入向量进行转化处理,得到独热编码;将所述独热编码与所述隐向量进行拼接,得到拼接向量。其中,所述独热编码即One-Hot编码,又称有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候,其中只有一位有效。将所述条件嵌入向量转化为独热编码能够解决判别器不好处理属性数据的问题,同时在一定程度上也起到了扩充向量特征的作用。In some embodiments, the conditional embedding vector is converted to obtain a one-hot encoding; the one-hot encoding is concatenated with the hidden vector to obtain a concatenated vector. Wherein, the one-hot encoding is One-Hot encoding, also known as effective encoding, and its method is to use N-bit state registers to encode N states, each state has its own independent register bit, and at any time, Only one of them is valid. Converting the conditional embedding vector into a one-hot encoding can solve the problem that the discriminator cannot handle attribute data well, and at the same time, it also plays a role in expanding the vector feature to a certain extent.
具体地,可以通过深度神经网络对条件嵌入向量进行转化处理,得到独热编码,并通过独热编码与隐向量进行拼接,得到拼接向量。由此可以得到符合生成器输入需求的拼接向量。Specifically, the conditional embedding vector can be transformed through a deep neural network to obtain a one-hot encoding, and the splicing vector can be obtained by splicing the one-hot encoding and the latent vector. From this, a concatenated vector that meets the input requirements of the generator can be obtained.
S104、将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据。S104. Input the concatenated vector into a pre-trained generator for desensitization processing to obtain desensitized data.
其中,所述预训练好的生成器是基于对抗生成网络训练生成的,所述脱敏数据为对待脱敏数据中关键信息进行脱敏后的数据。Wherein, the pre-trained generator is generated based on confrontation generation network training, and the desensitized data is data after desensitizing key information in the data to be desensitized.
在一些实施例中,获取训练数据对应的拼接向量,并将所述拼接向量输入到第一生成器进行脱敏处理,得到脱敏后的数据;基于所述脱敏后的数据和训练数据对预设的判别器进行训练,得到预训练好的判别器;根据预设的学习率和所述预训练好的判别器的参数,对所述第一生成器的参数进行多次迭代更新,得到第二生成器,并将所述第二生成器作为预训练好的生成器。由此可以通过预训练好的判别器和脱敏后的数据对第一生成器的参数进行多次迭代更新,能够生成非常真实的脱敏数据。之所以要先训练得到预训练好的判别器,再训练生成器,是因为要先拥有一个好的判别器,使得能够教好地区分出待脱敏数据和生成的脱敏数 据之后,才能够更为准确地对生成器的参数进行更新。In some embodiments, the splicing vector corresponding to the training data is obtained, and the splicing vector is input to the first generator for desensitization processing to obtain desensitized data; based on the desensitized data and training data pair The preset discriminator is trained to obtain a pre-trained discriminator; according to the preset learning rate and the parameters of the pre-trained discriminator, the parameters of the first generator are iteratively updated multiple times to obtain A second generator, and use the second generator as a pre-trained generator. In this way, the parameters of the first generator can be updated iteratively for multiple times through the pre-trained discriminator and the desensitized data, and very real desensitized data can be generated. The reason why it is necessary to train the pre-trained discriminator first, and then train the generator is because it is necessary to have a good discriminator first, so that it can be taught to distinguish between the desensitized data and the generated desensitized data before being able to More accurate updates to generator parameters.
其中,所述训练数据为用于训练生成器参数的的待脱敏数据集,所述第一生成器为预设的未经过训练的生成器,所述第二生成器为第一生成器经过多次迭代更新生成的。其中,所述第一生成器和所述第二生成器的参数不同。可以通过离散变量的分布式表示得到该离散变量的先验概率,并在先验概率中采样出参数,以此作为第一生成器的参数。具体地,可以通过随机梯度哈密顿蒙特卡洛方法对生成器和判别器进行训练,得到预训练好的生成器和预训练好的判别器。Wherein, the training data is a data set to be desensitized for training generator parameters, the first generator is a preset untrained generator, and the second generator is a Generated by multiple iterative updates. Wherein, the parameters of the first generator and the second generator are different. The prior probability of the discrete variable can be obtained through the distributed representation of the discrete variable, and parameters are sampled from the prior probability as parameters of the first generator. Specifically, the generator and the discriminator can be trained by the stochastic gradient Hamiltonian Monte Carlo method to obtain a pre-trained generator and a pre-trained discriminator.
具体地,基于所述脱敏后的数据和训练数据对预设的判别器进行训练,得到预训练好的判别器是通过将所述条件嵌入向量分别与所述脱敏后的数据和所述训练数据进行拼接,得到第一拼接数据和第二拼接数据,并根据计算第一拼接数据和第二拼接数据的相似度,并根据第一拼接数据和第二拼接数据的相似度优化损失函数,并对通过该损失函数对判别器进行梯度裁剪,得到预训练好的判别器。Specifically, the preset discriminator is trained based on the desensitized data and training data, and the pre-trained discriminator is obtained by combining the conditional embedding vector with the desensitized data and the Splicing the training data to obtain the first spliced data and the second spliced data, and calculating the similarity between the first spliced data and the second spliced data, and optimizing the loss function according to the similarity between the first spliced data and the second spliced data, And the discriminator is gradient clipped through the loss function to obtain a pre-trained discriminator.
示例性的,可以通过第一生成器和预设的判别器参数对判别器参数进行训练,将脱敏数据尽可能判别为假,从而调整判别器参数,进而提高判别器对于待脱敏数据的判别能力。Exemplarily, the discriminator parameters can be trained by the first generator and the preset discriminator parameters, and the desensitized data can be judged as false as much as possible, thereby adjusting the discriminator parameters, thereby improving the discriminator's ability to desensitize the data Discrimination ability.
示例性的,可以通过第一生成器和预训练好的判别器参数的先验概率计算第二生成器的后验概率,从而使该脱敏数据要尽可能使得判别器误判其为待脱敏数据,从而调整生成器参数,能够生成真实的脱敏数据。Exemplarily, the posterior probability of the second generator can be calculated through the prior probability of the parameters of the first generator and the pre-trained discriminator, so that the desensitized data should make the discriminator misjudge it as being to be desensitized Sensitive data, so as to adjust the parameters of the generator to generate real desensitized data.
在一些实施例中,得到第二生成器之后,基于统计信息的损失函数,对所述第二生成器进行增噪处理,得到预训练好的生成器,其中,所述第一生成器、所述参数更新后的生成器和所述预训练好的生成器的参数不同。由此可以控制脱敏数据的生成质量和脱敏程度。In some embodiments, after the second generator is obtained, the second generator is subjected to noise-increasing processing based on a loss function of statistical information to obtain a pre-trained generator, wherein the first generator, the The parameters of the generator whose parameters have been updated are different from those of the pre-trained generator. In this way, the generation quality and degree of desensitization of the desensitized data can be controlled.
其中,所述基于统计信息的损失函数可以包括基于均值的损失函数、基于方差的损失函数等。Wherein, the statistical information-based loss function may include a mean-based loss function, a variance-based loss function, and the like.
具体地,可以对第二生成器的参数加入高斯噪声,由此可以实现多项式拟合正弦曲线,所述高斯噪声是符合高斯正态分布的误差。而高斯噪声的具体数值可以通过实验得到。Specifically, Gaussian noise can be added to the parameters of the second generator, thereby realizing polynomial fitting of a sinusoidal curve, and the Gaussian noise is an error conforming to a Gaussian normal distribution. The specific value of Gaussian noise can be obtained through experiments.
示例性的,可以对第二生成器的参数引入误差项,从而对第二生成器的参数进行修正,得到预训练好的生成器。由于误差项的存在,使得生成的脱敏数据与原数据存在一定差别,但差别不大,由此可以避免经过脱敏的数据往往与原数据差别较大而失去了研究的价值,同时也保证数据不会被轻易被逆向。Exemplarily, an error term may be introduced into the parameters of the second generator, so that the parameters of the second generator may be corrected to obtain a pre-trained generator. Due to the existence of the error term, there is a certain difference between the generated desensitized data and the original data, but the difference is not large, thus avoiding that the desensitized data often differs greatly from the original data and loses the research value, and also ensures Data cannot be easily reversed.
在一些实施例中,得到脱敏数据之后,对所述脱敏数据的离散变量进行随机采样处理,得到目标离散变量;基于逻辑回归模型,根据所述脱敏数据的其余离散变量对所述目标离散变量进行预测,得到所述目标离散变量的预测结果;基于所述目标离散变量的预测结果调整所述预训练好的生成器的参数。由此可以通过对离散变量进行预测,从而调整生成器的参数,进而达到更好的脱敏效果。这里更好的脱敏效果是指可以防止脱敏后的数据被逆向破解,同时还能保持与原数据的关联。In some embodiments, after the desensitized data is obtained, the discrete variables of the desensitized data are randomly sampled to obtain the target discrete variables; based on the logistic regression model, the target The discrete variable is predicted to obtain the predicted result of the target discrete variable; and the parameters of the pre-trained generator are adjusted based on the predicted result of the target discrete variable. In this way, the parameters of the generator can be adjusted by predicting the discrete variables to achieve a better desensitization effect. The better desensitization effect here means that it can prevent the desensitized data from being reversely cracked, while maintaining the association with the original data.
其中,所述目标离散变量为从脱敏数据的多个离散变量中随机采样得到的,同时为了脱敏数据与原数据的关联离散变量,一般可以认为目标离散变量不发生改变,脱敏数据与原数据差别较小,便不会失去了研究的价值,因此需要保证目标离散变量的一致性。所述逻辑回归模型用于对离散变量进行预测。Wherein, the target discrete variable is randomly sampled from multiple discrete variables of the desensitized data, and at the same time, in order to associate the discrete variables between the desensitized data and the original data, it can generally be considered that the target discrete variable does not change, and the desensitized data and If the difference in the original data is small, the value of the research will not be lost, so it is necessary to ensure the consistency of the target discrete variables. The logistic regression model was used to predict discrete variables.
具体地,可以利用交叉熵损失函数来判断生成的目标离散变量的预测结果与目标离散变量是否一致,从而确定所述脱敏数据的生成质量。若目标离散变量的预测结果与目标离散变量一致,则无需对所述预训练好的生成器的参数进行调整;若目标离散变量的预测结果与目标离散变量不一致,则确定目标离散变量的预测结果与目标离散变量的差值,根据所述差值调整所述预训练好的生成器的参数。由此可以确定目标离散变量的准确性,避免生成的脱敏数据使原数据差别过大。由于脱敏数据与原数据的大部分离散变量是相同的,因此去除其中一个离散变量,可以根据其余离散变量准确地预测得到该离散变量。Specifically, the cross-entropy loss function can be used to judge whether the generated prediction result of the target discrete variable is consistent with the target discrete variable, so as to determine the generation quality of the desensitized data. If the prediction result of the target discrete variable is consistent with the target discrete variable, there is no need to adjust the parameters of the pre-trained generator; if the prediction result of the target discrete variable is inconsistent with the target discrete variable, then determine the prediction result of the target discrete variable The difference with the target discrete variable, and adjust the parameters of the pre-trained generator according to the difference. In this way, the accuracy of the target discrete variable can be determined, and the generated desensitized data can avoid making the original data too different. Since most of the discrete variables of the desensitized data and the original data are the same, removing one of the discrete variables can accurately predict the discrete variable based on the remaining discrete variables.
示例性的,若脱敏数据的目标离散变量是鞋码为43码,基于逻辑回归模型,则可以通过 脱敏数据的其余离散变量比如身高、体重等对目标离散变量进行预测,得到对于鞋码的预测结果,判断生成的鞋码的预测结果与脱敏数据的鞋码是否一致,比如若生成的鞋码的预测结果是鞋码为40码,则确定差值为1码,根据所述差值对所述预训练好的生成器的参数进行不断地迭代更新;若生成的鞋码的预测结果是鞋码为43码,则无需对所述预训练好的生成器的参数进行调整。Exemplarily, if the target discrete variable of the desensitized data is a shoe size of 43, based on the logistic regression model, the target discrete variable can be predicted through the remaining discrete variables of the desensitized data, such as height and weight, to obtain the shoe size to determine whether the generated shoe size prediction result is consistent with the shoe size of the desensitized data. For example, if the generated shoe size prediction result is a shoe size of 40, the difference is determined to be 1 size. According to the difference The parameters of the pre-trained generator are updated iteratively; if the predicted result of the generated shoe size is size 43, there is no need to adjust the parameters of the pre-trained generator.
在一些实施例中,服务器还可以将用于提示用户脱敏数据已经生成的提示信息发送给终端设备。In some embodiments, the server may also send prompt information for prompting the user that the desensitized data has been generated to the terminal device.
其中,所述提示信息的方式具体可以包括应用程序(APP)或Email、短信、聊天工具,聊天工具例如微信、qq等。Wherein, the manner of prompting information may specifically include an application program (APP) or Email, a short message, a chat tool, such as WeChat, qq, and the like.
示例性的,当脱敏数据已经生成时,服务器会发送脱敏数据已经生成的提示信息给终端设备以提醒用户。Exemplarily, when the desensitized data has been generated, the server will send a prompt message that the desensitized data has been generated to the terminal device to remind the user.
请参阅图3,图3是本申请一实施例提供的一种数据脱敏装置的示意性框图,该数据脱敏装置可以配置于服务器中,用于执行前述的数据脱敏方法。Please refer to FIG. 3 . FIG. 3 is a schematic block diagram of a data desensitization device provided by an embodiment of the present application. The data desensitization device can be configured in a server to execute the aforementioned data desensitization method.
如图3所示,该数据脱敏装置200包括:关键信息提取模块201、信息处理模块202、向量拼接模块203和数据脱敏模块204。As shown in FIG. 3 , the data desensitization device 200 includes: a key information extraction module 201 , an information processing module 202 , a vector splicing module 203 and a data desensitization module 204 .
关键信息提取模块201,用于获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息;The key information extraction module 201 is configured to acquire user data, and based on a pre-trained key information identification model, perform information identification on the user data to obtain key information;
信息处理模块202,用于对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理;An information processing module 202, configured to preprocess the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization processing or data normalization processing;
向量拼接模块203,用于基于条件损失函数,对所述离散变量进行条件随机采样处理,得到条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量;A vector splicing module 203, configured to perform conditional random sampling processing on the discrete variables based on a conditional loss function to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;
数据脱敏模块204,用于将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据;A data desensitization module 204, configured to input the splicing vector into a pre-trained generator for desensitization processing to obtain desensitized data;
特征提取模块201,还用于对所述用户数据进行分词处理,得到多个分词;对每个所述分词进行特征提取,得到每个所述分词的嵌入特征;根据每个所述分词的嵌入特征进行词义预测,得到每个所述分词对应的词义;根据每个所述分词对应的词义对所述多个分词进行筛选,得到关键信息。The feature extraction module 201 is also used to perform word segmentation processing on the user data to obtain a plurality of word segments; perform feature extraction on each of the word segments to obtain the embedded features of each of the word segments; according to the embedding features of each of the word segments The feature predicts the meaning of the word to obtain the meaning corresponding to each of the word segmentations; according to the meaning of each of the word segmentations, the plurality of word segmentations are screened to obtain key information.
信息处理模块202,还用于对所述关键信息进行最大最小归一化处理,得到所述关键信息对应的离散变量;或,通过高斯混合模型对所述关键信息进行归一化处理,得到所述关键信息对应的离散变量;或,对所述关键信息进行K-bins离散化处理,得到所述关键信息对应的离散变量;或,对所述关键信息进行回归树离散化处理,得到所述关键信息对应的离散变量。The information processing module 202 is further configured to perform maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or, to perform normalization processing on the key information through a Gaussian mixture model to obtain the The discrete variable corresponding to the key information; or, carry out K-bins discretization process on the key information, obtain the discrete variable corresponding to the key information; or, perform regression tree discretization process on the key information, obtain the described key information Discrete variables corresponding to key information.
向量拼接模块203,还用于对所述条件嵌入向量进行转化处理,得到独热编码;将所述独热编码与所述隐向量进行拼接,得到拼接向量。The vector splicing module 203 is further configured to convert the conditional embedding vector to obtain a one-hot encoding; splice the one-hot encoding and the latent vector to obtain a concatenated vector.
生成器训练模块205,用于获取训练数据对应的拼接向量,并将所述拼接向量输入到第一生成器进行脱敏处理,得到脱敏后的数据;基于所述脱敏后的数据和训练数据对预设的判别器进行训练,得到预训练好的判别器;根据预设的学习率和所述预训练好的判别器的参数,对所述第一生成器的参数进行多次迭代更新,得到第二生成器,并将所述第二生成器作为预训练好的生成器。The generator training module 205 is used to obtain the splicing vector corresponding to the training data, and input the splicing vector to the first generator for desensitization processing to obtain desensitized data; based on the desensitized data and training The data is used to train the preset discriminator to obtain a pre-trained discriminator; according to the preset learning rate and the parameters of the pre-trained discriminator, the parameters of the first generator are iteratively updated multiple times , get the second generator, and use the second generator as a pre-trained generator.
生成器训练模块205,还用于基于统计信息的损失函数,对所述第二生成器进行增噪处理,得到预训练好的生成器,其中,所述第一生成器、所述第二生成器和所述预训练好的生成器的参数不同。The generator training module 205 is further configured to perform noise-increasing processing on the second generator based on a loss function of statistical information to obtain a pre-trained generator, wherein the first generator and the second generator The parameters of the generator and the pre-trained generator are different.
生成器训练模块205,还用于对所述脱敏数据的离散变量进行随机采样处理,得到目标离散变量;基于逻辑回归模型,根据所述脱敏数据的其余离散变量对所述目标离散变量进行预测,得到所述目标离散变量的预测结果;基于所述目标离散变量的预测结果调整所述预训练好的生成器的参数。The generator training module 205 is also used to randomly sample the discrete variables of the desensitized data to obtain the target discrete variables; based on the logistic regression model, the target discrete variables are processed according to the remaining discrete variables of the desensitized data. Forecasting, obtaining a prediction result of the target discrete variable; adjusting parameters of the pre-trained generator based on the prediction result of the target discrete variable.
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块、单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described devices, modules, and units can refer to the corresponding processes in the foregoing method embodiments. No longer.
本申请的方法、装置可用于众多通用或专用的计算系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、机顶盒、可编程的消费终端设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。The methods and devices of the present application can be used in many general-purpose or special-purpose computing system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer terminal equipment, network PCs, minicomputers, mainframe computers, including the above A distributed computing environment for any system or device, and more.
示例性的,上述的方法、装置可以实现为一种计算机程序的形式,该计算机程序可以在如图4所示的计算机设备上运行。Exemplarily, the above-mentioned method and apparatus can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 4 .
请参阅图4,图4是本申请实施例提供的一种计算机设备的示意图。该计算机设备可以是服务器。Please refer to FIG. 4 . FIG. 4 is a schematic diagram of a computer device provided by an embodiment of the present application. The computer device may be a server.
如图4所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括易失性存储介质、非易失性存储介质和内存储器。As shown in FIG. 4, the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a volatile storage medium, a non-volatile storage medium, and an internal memory.
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种数据脱敏方法。Non-volatile storage media can store operating systems and computer programs. The computer program includes program instructions. When the program instructions are executed, the processor can be executed to perform any data masking method.
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。The processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种数据脱敏方法。The internal memory provides an environment for running the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can execute any data desensitization method.
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,该计算机设备的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。This network interface is used for network communication, such as sending assigned tasks, etc. Those skilled in the art can understand that the structure of the computer equipment is only a block diagram of the partial structure related to the solution of the present application, and does not constitute a limitation to the computer equipment on which the solution of the application is applied. The specific computer equipment may include More or fewer components are shown in the figures, or certain components are combined, or have different component arrangements.
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
其中,在一些实施方式中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息;对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理;基于条件损失函数,对所述离散变量进行条件随机采样处理,得到条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量;将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据。Wherein, in some embodiments, the processor is configured to run a computer program stored in the memory to implement the following steps: acquire user data, and perform information identification on the user data based on a pre-trained key information identification model , to obtain the key information; preprocessing the key information to obtain the discrete variable corresponding to the key information, the preprocessing includes data discretization processing or data normalization processing; based on the conditional loss function, the discrete variable Perform conditional random sampling processing to obtain conditional embedding vectors and hidden vectors, and splice the conditional embedding vectors and the hidden vectors to obtain spliced vectors; input the spliced vectors to the pre-trained generator for desensitization , to get desensitized data.
在一些实施例中,所述处理器还用于:对所述用户数据进行分词处理,得到多个分词;对每个所述分词进行特征提取,得到每个所述分词的嵌入特征;根据每个所述分词的嵌入特征进行词义预测,得到每个所述分词对应的词义;根据每个所述分词对应的词义对所述多个分词进行筛选,得到关键信息。In some embodiments, the processor is further configured to: perform word segmentation processing on the user data to obtain multiple word segments; perform feature extraction on each of the word segments to obtain embedded features of each of the word segments; The embedding features of each of the word segmentations are used to predict the meaning of the word to obtain the corresponding meaning of each of the word segmentations; according to the meaning of each of the corresponding word segmentations, the plurality of word segmentations are screened to obtain key information.
在一些实施例中,所述处理器还用于:对所述关键信息进行最大最小归一化处理,得到所述关键信息对应的离散变量;或,通过高斯混合模型对所述关键信息进行归一化处理,得到所述关键信息对应的离散变量;或,对所述关键信息进行K-bins离散化处理,得到所述关键信息对应的离散变量;或,对所述关键信息进行回归树离散化处理,得到所述关键信息对应的离散变量。In some embodiments, the processor is further configured to: perform maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or, normalize the key information through a Gaussian mixture model or, performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or, performing regression tree discretization on the key information process to obtain the discrete variables corresponding to the key information.
在一些实施例中,所述处理器还用于:对所述条件嵌入向量进行转化处理,得到独热编码;将所述独热编码与所述隐向量进行拼接,得到拼接向量。In some embodiments, the processor is further configured to: convert the conditional embedding vector to obtain a one-hot encoding; concatenate the one-hot encoding and the latent vector to obtain a concatenated vector.
在一些实施例中,所述处理器还用于:获取训练数据对应的拼接向量,并将所述拼接向 量输入到第一生成器进行脱敏处理,得到脱敏后的数据;基于所述脱敏后的数据和训练数据对预设的判别器进行训练,得到预训练好的判别器;根据预设的学习率和所述预训练好的判别器的参数,对所述第一生成器的参数进行多次迭代更新,得到第二生成器,并将所述第二生成器作为预训练好的生成器。In some embodiments, the processor is further configured to: acquire a splicing vector corresponding to the training data, and input the splicing vector to the first generator for desensitization processing to obtain desensitized data; based on the desensitized The pre-sensitized data and training data are used to train the preset discriminator to obtain a pre-trained discriminator; according to the preset learning rate and the parameters of the pre-trained discriminator, the first generator is The parameters are iteratively updated multiple times to obtain a second generator, and the second generator is used as a pre-trained generator.
在一些实施例中,所述处理器还用于:基于统计信息的损失函数,对所述第二生成器进行增噪处理,得到预训练好的生成器,其中,所述第一生成器、所述第二生成器和所述预训练好的生成器的参数不同。In some embodiments, the processor is further configured to: perform noise-increasing processing on the second generator based on a loss function of statistical information to obtain a pre-trained generator, wherein the first generator, The parameters of the second generator and the pre-trained generator are different.
在一些实施例中,所述处理器还用于:对所述脱敏数据的离散变量进行随机采样处理,得到目标离散变量;基于逻辑回归模型,根据所述脱敏数据的其余离散变量对所述目标离散变量进行预测,得到所述目标离散变量的预测结果;基于所述目标离散变量的预测结果调整所述预训练好的生成器的参数。In some embodiments, the processor is further configured to: perform random sampling on the discrete variables of the desensitized data to obtain target discrete variables; Predicting the target discrete variable to obtain a prediction result of the target discrete variable; adjusting parameters of the pre-trained generator based on the prediction result of the target discrete variable.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性。所述计算机可读存储介质上存储有计算机程序,所述计算机程序中包括程序指令,所述程序指令被执行时实现本申请实施例提供的任一种数据脱敏方法。The embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium may be non-volatile or volatile. A computer program is stored on the computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed, any data desensitization method provided in the embodiments of the present application is implemented.
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。Wherein, the computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device. The computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD ) card, flash memory card (Flash Card), etc.
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer-readable storage medium may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, etc.; The data created using the node, etc.
本申请所指区块链语言模型的存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。This application refers to the new application mode of computer technology such as the storage of the blockchain language model, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain (Blockchain), essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.
Claims (20)
- 一种数据脱敏方法,其中,所述方法包括:A data desensitization method, wherein the method includes:获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息;Obtaining user data, and based on a pre-trained key information identification model, performing information identification on the user data to obtain key information;对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理;Preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization or data normalization;基于条件损失函数,对所述离散变量进行条件随机采样处理,得到条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量;Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据。The splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
- 根据权利要求1所述的方法,其中,所述基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息,包括:The method according to claim 1, wherein the pre-trained key information identification model is used to perform information identification on the user data to obtain key information, including:对所述用户数据进行分词处理,得到多个分词;performing word segmentation processing on the user data to obtain multiple word segmentations;对每个所述分词进行特征提取,得到每个所述分词的嵌入特征;Carry out feature extraction to each described participle, obtain the embedding feature of each described participle;根据每个所述分词的嵌入特征进行词义预测,得到每个所述分词对应的词义;Carrying out word meaning prediction according to the embedding feature of each described participle, obtains the corresponding word meaning of each described participle;根据每个所述分词对应的词义对所述多个分词进行筛选,得到关键信息。The multiple word segments are screened according to the meanings corresponding to each of the word segments to obtain key information.
- 根据权利要求1所述的方法,其中,所述对所述关键信息进行预处理,得到所述关键信息对应的离散变量,包括:The method according to claim 1, wherein said preprocessing the key information to obtain the discrete variable corresponding to the key information comprises:对所述关键信息进行最大最小归一化处理,得到所述关键信息对应的离散变量;或,performing maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or,通过高斯混合模型对所述关键信息进行归一化处理,得到所述关键信息对应的离散变量;或,Normalizing the key information through a Gaussian mixture model to obtain discrete variables corresponding to the key information; or,对所述关键信息进行K-bins离散化处理,得到所述关键信息对应的离散变量;或,performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or,对所述关键信息进行回归树离散化处理,得到所述关键信息对应的离散变量。Perform regression tree discretization processing on the key information to obtain discrete variables corresponding to the key information.
- 根据权利要求1所述的方法,其中,所述将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量,包括:The method according to claim 1, wherein said concatenating said conditional embedding vector and said latent vector to obtain a concatenated vector comprises:对所述条件嵌入向量进行转化处理,得到独热编码;converting the conditional embedding vector to obtain one-hot encoding;将所述独热编码与所述隐向量进行拼接,得到拼接向量。Concatenate the one-hot encoding and the latent vector to obtain a concatenated vector.
- 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:获取训练数据对应的拼接向量,并将所述拼接向量输入到第一生成器进行脱敏处理,得到脱敏后的数据;Obtaining the splicing vector corresponding to the training data, and inputting the splicing vector to the first generator for desensitization processing to obtain desensitized data;基于所述脱敏后的数据和训练数据对预设的判别器进行训练,得到预训练好的判别器;Training a preset discriminator based on the desensitized data and training data to obtain a pre-trained discriminator;根据预设的学习率和所述预训练好的判别器的参数,对所述第一生成器的参数进行多次迭代更新,得到第二生成器,并将所述第二生成器作为预训练好的生成器。According to the preset learning rate and the parameters of the pre-trained discriminator, iteratively update the parameters of the first generator multiple times to obtain a second generator, and use the second generator as a pre-training good builder.
- 根据权利要求5所述的方法,其中,所述得到第二生成器之后,所述方法还包括:The method according to claim 5, wherein, after said obtaining the second generator, said method further comprises:基于统计信息的损失函数,对所述第二生成器进行增噪处理,得到预训练好的生成器,其中,所述第一生成器、所述第二生成器和所述预训练好的生成器的参数不同。Based on the loss function of statistical information, perform noise-increasing processing on the second generator to obtain a pre-trained generator, wherein the first generator, the second generator and the pre-trained generator The parameters of the device are different.
- 根据权利要求1所述的方法,其中,所述得到脱敏数据之后,所述方法还包括:The method according to claim 1, wherein, after the desensitization data is obtained, the method further comprises:对所述脱敏数据的离散变量进行随机采样处理,得到目标离散变量;Randomly sampling the discrete variables of the desensitized data to obtain target discrete variables;基于逻辑回归模型,根据所述脱敏数据的其余离散变量对所述目标离散变量进行预测,得到所述目标离散变量的预测结果;Based on the logistic regression model, predict the target discrete variable according to the remaining discrete variables of the desensitized data, and obtain the prediction result of the target discrete variable;基于所述目标离散变量的预测结果调整所述预训练好的生成器的参数。Adjusting the parameters of the pre-trained generator based on the prediction result of the target discrete variable.
- 一种数据脱敏装置,其中,包括:A data desensitization device, including:关键信息提取模块,用于获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息;The key information extraction module is used to obtain user data, and based on the pre-trained key information identification model, perform information identification on the user data to obtain key information;信息处理模块,用于对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理;An information processing module, configured to preprocess the key information to obtain discrete variables corresponding to the key information, where the preprocessing includes data discretization or data normalization;向量拼接模块,用于基于条件损失函数,对所述离散变量进行条件随机采样处理,得到 条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量;Vector splicing module, for carrying out conditional random sampling process to described discrete variable based on conditional loss function, obtain conditional embedding vector and latent vector, and described conditional embedding vector and described latent vector are spliced, obtain splicing vector;数据脱敏模块,用于将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据。The data desensitization module is used to input the splicing vector into the pre-trained generator for desensitization processing to obtain desensitized data.
- 一种计算机设备,其中,所述计算机设备包括存储器和处理器;A computer device, wherein the computer device includes a memory and a processor;所述存储器,用于存储计算机程序;The memory is used to store computer programs;所述处理器,用于执行所述的计算机程序并在执行所述的计算机程序时实现如下步骤:The processor is configured to execute the computer program and implement the following steps when executing the computer program:获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息;Obtaining user data, and based on a pre-trained key information identification model, performing information identification on the user data to obtain key information;对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理;Preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization or data normalization;基于条件损失函数,对所述离散变量进行条件随机采样处理,得到条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量;Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据。The splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
- 根据权利要求9所述的计算机设备,其中,所述处理器在实现所述基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息,用于实现:The computer device according to claim 9, wherein the processor implements the pre-trained key information identification model, performs information identification on the user data, obtains key information, and is used to realize:对所述用户数据进行分词处理,得到多个分词;performing word segmentation processing on the user data to obtain multiple word segmentations;对每个所述分词进行特征提取,得到每个所述分词的嵌入特征;Carry out feature extraction to each described participle, obtain the embedding feature of each described participle;根据每个所述分词的嵌入特征进行词义预测,得到每个所述分词对应的词义;Carrying out word meaning prediction according to the embedding feature of each described participle, obtains the corresponding word meaning of each described participle;根据每个所述分词对应的词义对所述多个分词进行筛选,得到关键信息。The multiple word segments are screened according to the meanings corresponding to each of the word segments to obtain key information.
- 根据权利要求9所述的计算机设备,其中,所述处理器在实现所述对所述关键信息进行预处理,得到所述关键信息对应的离散变量,用于实现:The computer device according to claim 9, wherein the processor implements the preprocessing of the key information to obtain a discrete variable corresponding to the key information, for realizing:对所述关键信息进行最大最小归一化处理,得到所述关键信息对应的离散变量;或,performing maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or,通过高斯混合模型对所述关键信息进行归一化处理,得到所述关键信息对应的离散变量;或,Normalizing the key information through a Gaussian mixture model to obtain discrete variables corresponding to the key information; or,对所述关键信息进行K-bins离散化处理,得到所述关键信息对应的离散变量;或,performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or,对所述关键信息进行回归树离散化处理,得到所述关键信息对应的离散变量。Perform regression tree discretization processing on the key information to obtain discrete variables corresponding to the key information.
- 根据权利要求9所述的计算机设备,其中,所述处理器在实现所述将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量,用于实现:The computer device according to claim 9, wherein the processor implements the splicing of the conditional embedding vector and the hidden vector to obtain a splicing vector for realizing:对所述条件嵌入向量进行转化处理,得到独热编码;converting the conditional embedding vector to obtain one-hot encoding;将所述独热编码与所述隐向量进行拼接,得到拼接向量。Concatenate the one-hot encoding and the latent vector to obtain a concatenated vector.
- 根据权利要求9所述的计算机设备,其中,所述处理器还用于实现:The computer device according to claim 9, wherein the processor is further configured to implement:获取训练数据对应的拼接向量,并将所述拼接向量输入到第一生成器进行脱敏处理,得到脱敏后的数据;Obtaining the splicing vector corresponding to the training data, and inputting the splicing vector to the first generator for desensitization processing to obtain desensitized data;基于所述脱敏后的数据和训练数据对预设的判别器进行训练,得到预训练好的判别器;Training a preset discriminator based on the desensitized data and training data to obtain a pre-trained discriminator;根据预设的学习率和所述预训练好的判别器的参数,对所述第一生成器的参数进行多次迭代更新,得到第二生成器,并将所述第二生成器作为预训练好的生成器。According to the preset learning rate and the parameters of the pre-trained discriminator, iteratively update the parameters of the first generator multiple times to obtain a second generator, and use the second generator as a pre-training good builder.
- 根据权利要求9所述的计算机设备,其中,所述处理器在实现所述得到脱敏数据之后,还用于实现:The computer device according to claim 9, wherein, after the processor obtains the desensitized data, it is further configured to:对所述脱敏数据的离散变量进行随机采样处理,得到目标离散变量;Randomly sampling the discrete variables of the desensitized data to obtain target discrete variables;基于逻辑回归模型,根据所述脱敏数据的其余离散变量对所述目标离散变量进行预测,得到所述目标离散变量的预测结果;Based on the logistic regression model, predict the target discrete variable according to the remaining discrete variables of the desensitized data, and obtain the prediction result of the target discrete variable;基于所述目标离散变量的预测结果调整所述预训练好的生成器的参数。Adjusting the parameters of the pre-trained generator based on the prediction result of the target discrete variable.
- 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps:获取用户数据,并基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息;Obtaining user data, and based on a pre-trained key information identification model, performing information identification on the user data to obtain key information;对所述关键信息进行预处理,得到所述关键信息对应的离散变量,所述预处理包括数据离散化处理或数据归一化处理;Preprocessing the key information to obtain discrete variables corresponding to the key information, the preprocessing includes data discretization or data normalization;基于条件损失函数,对所述离散变量进行条件随机采样处理,得到条件嵌入向量和隐向量,并将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量;Based on a conditional loss function, performing conditional random sampling processing on the discrete variable to obtain a conditional embedding vector and a hidden vector, and splicing the conditional embedding vector and the hidden vector to obtain a concatenated vector;将所述拼接向量输入到预训练好的生成器进行脱敏处理,得到脱敏数据。The splicing vector is input to the pre-trained generator for desensitization processing to obtain desensitized data.
- 根据权利要求15所述的计算机可读存储介质,其中,所述处理器在实现所述基于预训练好的关键信息识别模型,对所述用户数据进行信息识别,得到关键信息,用于实现:The computer-readable storage medium according to claim 15, wherein the processor implements the pre-trained key information identification model, performs information identification on the user data, obtains key information, and is used to realize:对所述用户数据进行分词处理,得到多个分词;performing word segmentation processing on the user data to obtain multiple word segmentations;对每个所述分词进行特征提取,得到每个所述分词的嵌入特征;Carry out feature extraction to each described participle, obtain the embedding feature of each described participle;根据每个所述分词的嵌入特征进行词义预测,得到每个所述分词对应的词义;Carrying out word meaning prediction according to the embedding feature of each described participle, obtains the corresponding word meaning of each described participle;根据每个所述分词对应的词义对所述多个分词进行筛选,得到关键信息。The multiple word segments are screened according to the meanings corresponding to each of the word segments to obtain key information.
- 根据权利要求15所述的计算机可读存储介质,其中,所述处理器在实现所述对所述关键信息进行预处理,得到所述关键信息对应的离散变量,用于实现:The computer-readable storage medium according to claim 15, wherein the processor implements the preprocessing of the key information to obtain a discrete variable corresponding to the key information, for realizing:对所述关键信息进行最大最小归一化处理,得到所述关键信息对应的离散变量;或,performing maximum and minimum normalization processing on the key information to obtain discrete variables corresponding to the key information; or,通过高斯混合模型对所述关键信息进行归一化处理,得到所述关键信息对应的离散变量;或,Normalizing the key information through a Gaussian mixture model to obtain discrete variables corresponding to the key information; or,对所述关键信息进行K-bins离散化处理,得到所述关键信息对应的离散变量;或,performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or,对所述关键信息进行回归树离散化处理,得到所述关键信息对应的离散变量。Perform regression tree discretization processing on the key information to obtain discrete variables corresponding to the key information.
- 根据权利要求15所述的计算机可读存储介质,其中,所述处理器在实现所述将所述条件嵌入向量与所述隐向量进行拼接,得到拼接向量,用于实现:The computer-readable storage medium according to claim 15, wherein the processor implements the splicing of the conditional embedding vector and the hidden vector to obtain a splicing vector for realizing:对所述条件嵌入向量进行转化处理,得到独热编码;converting the conditional embedding vector to obtain one-hot encoding;将所述独热编码与所述隐向量进行拼接,得到拼接向量。Concatenate the one-hot encoding and the latent vector to obtain a concatenated vector.
- 根据权利要求15所述的计算机可读存储介质,其中,所述处理器还用于实现:The computer-readable storage medium according to claim 15, wherein the processor is further configured to implement:获取训练数据对应的拼接向量,并将所述拼接向量输入到第一生成器进行脱敏处理,得到脱敏后的数据;Obtaining the splicing vector corresponding to the training data, and inputting the splicing vector to the first generator for desensitization processing to obtain desensitized data;基于所述脱敏后的数据和训练数据对预设的判别器进行训练,得到预训练好的判别器;Training a preset discriminator based on the desensitized data and training data to obtain a pre-trained discriminator;根据预设的学习率和所述预训练好的判别器的参数,对所述第一生成器的参数进行多次迭代更新,得到第二生成器,并将所述第二生成器作为预训练好的生成器。According to the preset learning rate and the parameters of the pre-trained discriminator, iteratively update the parameters of the first generator multiple times to obtain a second generator, and use the second generator as a pre-training good builder.
- 根据权利要求15所述的计算机可读存储介质,其中,所述处理器在实现所述得到脱敏数据之后,还用于实现:The computer-readable storage medium according to claim 15, wherein, after the processor obtains the desensitized data, it is further configured to:对所述脱敏数据的离散变量进行随机采样处理,得到目标离散变量;Randomly sampling the discrete variables of the desensitized data to obtain target discrete variables;基于逻辑回归模型,根据所述脱敏数据的其余离散变量对所述目标离散变量进行预测,得到所述目标离散变量的预测结果;Based on the logistic regression model, predict the target discrete variable according to the remaining discrete variables of the desensitized data, and obtain the prediction result of the target discrete variable;基于所述目标离散变量的预测结果调整所述预训练好的生成器的参数。Adjusting the parameters of the pre-trained generator based on the prediction result of the target discrete variable.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111229481.X | 2021-10-21 | ||
CN202111229481.XA CN113886885A (en) | 2021-10-21 | 2021-10-21 | Data desensitization method, data desensitization device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023065632A1 true WO2023065632A1 (en) | 2023-04-27 |
Family
ID=79004109
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/089872 WO2023065632A1 (en) | 2021-10-21 | 2022-04-28 | Data desensitization method, data desensitization apparatus, device, and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113886885A (en) |
WO (1) | WO2023065632A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116757748A (en) * | 2023-08-14 | 2023-09-15 | 广州钛动科技股份有限公司 | Advertisement click prediction method based on random gradient attack |
CN117290888A (en) * | 2023-11-23 | 2023-12-26 | 江苏风云科技服务有限公司 | Information desensitization method for big data, storage medium and server |
CN117744127A (en) * | 2024-02-20 | 2024-03-22 | 北京佳芯信息科技有限公司 | Data encryption authentication method and system based on data information protection |
CN117912624A (en) * | 2024-03-15 | 2024-04-19 | 江西曼荼罗软件有限公司 | Electronic medical record sharing method and system |
CN117932676A (en) * | 2024-01-26 | 2024-04-26 | 湖北消费金融股份有限公司 | Data desensitization method and system based on network interface access control |
CN118278051A (en) * | 2024-06-03 | 2024-07-02 | 广州青莲网络科技有限公司 | Data desensitization method and system based on artificial intelligence |
CN118748614A (en) * | 2024-07-12 | 2024-10-08 | 青岛海高设计制造有限公司 | Data processing method, device, equipment and storage medium |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113886885A (en) * | 2021-10-21 | 2022-01-04 | 平安科技(深圳)有限公司 | Data desensitization method, data desensitization device, equipment and storage medium |
CN115514564B (en) * | 2022-09-22 | 2023-06-16 | 成都坐联智城科技有限公司 | Data security processing method and system based on data sharing |
CN116361858B (en) * | 2023-04-10 | 2024-01-26 | 北京无限自在文化传媒股份有限公司 | User session resource data protection method and software product applying AI decision |
CN116629984B (en) * | 2023-07-24 | 2024-02-06 | 中信证券股份有限公司 | Product information recommendation method, device, equipment and medium based on embedded model |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188202A (en) * | 2019-06-06 | 2019-08-30 | 北京百度网讯科技有限公司 | Training method, device and the terminal of semantic relation identification model |
CN111143884A (en) * | 2019-12-31 | 2020-05-12 | 北京懿医云科技有限公司 | Data desensitization method and device, electronic equipment and storage medium |
WO2021027533A1 (en) * | 2019-08-13 | 2021-02-18 | 平安国际智慧城市科技股份有限公司 | Text semantic recognition method and apparatus, computer device, and storage medium |
CN113254649A (en) * | 2021-06-22 | 2021-08-13 | 中国平安人寿保险股份有限公司 | Sensitive content recognition model training method, text recognition method and related device |
CN113886885A (en) * | 2021-10-21 | 2022-01-04 | 平安科技(深圳)有限公司 | Data desensitization method, data desensitization device, equipment and storage medium |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107368752B (en) * | 2017-07-25 | 2019-06-28 | 北京工商大学 | A kind of depth difference method for secret protection based on production confrontation network |
CN110263152B (en) * | 2019-05-07 | 2024-04-09 | 平安科技(深圳)有限公司 | Text classification method, system and computer equipment based on neural network |
CN110135193A (en) * | 2019-05-15 | 2019-08-16 | 广东工业大学 | A kind of data desensitization method, device, equipment and computer readable storage medium |
CN110807207B (en) * | 2019-10-30 | 2021-10-08 | 腾讯科技(深圳)有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111768325B (en) * | 2020-04-03 | 2023-07-25 | 南京信息工程大学 | Security improvement method based on generation of countermeasure sample in big data privacy protection |
CN111563275B (en) * | 2020-07-14 | 2020-10-20 | 中国人民解放军国防科技大学 | Data desensitization method based on generation countermeasure network |
CN113297573B (en) * | 2021-06-11 | 2022-06-10 | 浙江工业大学 | Vertical federal learning defense method and device based on GAN simulation data generation |
-
2021
- 2021-10-21 CN CN202111229481.XA patent/CN113886885A/en active Pending
-
2022
- 2022-04-28 WO PCT/CN2022/089872 patent/WO2023065632A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188202A (en) * | 2019-06-06 | 2019-08-30 | 北京百度网讯科技有限公司 | Training method, device and the terminal of semantic relation identification model |
WO2021027533A1 (en) * | 2019-08-13 | 2021-02-18 | 平安国际智慧城市科技股份有限公司 | Text semantic recognition method and apparatus, computer device, and storage medium |
CN111143884A (en) * | 2019-12-31 | 2020-05-12 | 北京懿医云科技有限公司 | Data desensitization method and device, electronic equipment and storage medium |
CN113254649A (en) * | 2021-06-22 | 2021-08-13 | 中国平安人寿保险股份有限公司 | Sensitive content recognition model training method, text recognition method and related device |
CN113886885A (en) * | 2021-10-21 | 2022-01-04 | 平安科技(深圳)有限公司 | Data desensitization method, data desensitization device, equipment and storage medium |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116757748A (en) * | 2023-08-14 | 2023-09-15 | 广州钛动科技股份有限公司 | Advertisement click prediction method based on random gradient attack |
CN116757748B (en) * | 2023-08-14 | 2023-12-19 | 广州钛动科技股份有限公司 | Advertisement click prediction method based on random gradient attack |
CN117290888A (en) * | 2023-11-23 | 2023-12-26 | 江苏风云科技服务有限公司 | Information desensitization method for big data, storage medium and server |
CN117290888B (en) * | 2023-11-23 | 2024-02-09 | 江苏风云科技服务有限公司 | Information desensitization method for big data, storage medium and server |
CN117932676A (en) * | 2024-01-26 | 2024-04-26 | 湖北消费金融股份有限公司 | Data desensitization method and system based on network interface access control |
CN117744127A (en) * | 2024-02-20 | 2024-03-22 | 北京佳芯信息科技有限公司 | Data encryption authentication method and system based on data information protection |
CN117744127B (en) * | 2024-02-20 | 2024-05-07 | 北京佳芯信息科技有限公司 | Data encryption authentication method and system based on data information protection |
CN117912624A (en) * | 2024-03-15 | 2024-04-19 | 江西曼荼罗软件有限公司 | Electronic medical record sharing method and system |
CN118278051A (en) * | 2024-06-03 | 2024-07-02 | 广州青莲网络科技有限公司 | Data desensitization method and system based on artificial intelligence |
CN118748614A (en) * | 2024-07-12 | 2024-10-08 | 青岛海高设计制造有限公司 | Data processing method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113886885A (en) | 2022-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023065632A1 (en) | Data desensitization method, data desensitization apparatus, device, and storage medium | |
US11475143B2 (en) | Sensitive data classification | |
US10430610B2 (en) | Adaptive data obfuscation | |
CN113726784B (en) | Network data security monitoring method, device, equipment and storage medium | |
CN106874253A (en) | Recognize the method and device of sensitive information | |
CN111033506A (en) | Edit script verification with match and difference operations | |
NL2029110B1 (en) | Method and system for static analysis of executable files | |
WO2021189975A1 (en) | Machine behavior recognition method and apparatus, and device and computer-readable storage medium | |
WO2021151358A1 (en) | Triage information recommendation method and apparatus based on interpretation model, and device and medium | |
US12111933B2 (en) | System and method for dynamically updating existing threat models based on newly identified active threats | |
WO2022252638A1 (en) | Text matching method and apparatus, computer device and readable storage medium | |
US20240061952A1 (en) | Identifying sensitive data using redacted data | |
US11972023B2 (en) | Compatible anonymization of data sets of different sources | |
CN116821299A (en) | Intelligent question-answering method, intelligent question-answering device, equipment and storage medium | |
Tayyab et al. | Cryptographic based secure model on dataset for deep learning algorithms | |
Abaimov et al. | A survey on the application of deep learning for code injection detection | |
CN117609379A (en) | Model training method, system, equipment and medium based on vertical application of blockchain database | |
CN117313159A (en) | Data processing method, device, equipment and storage medium | |
CN116579798A (en) | User portrait construction method, device, equipment and medium based on data enhancement | |
US20200302017A1 (en) | Chat analysis using machine learning | |
CN117009832A (en) | Abnormal command detection method and device, electronic equipment and storage medium | |
US12105776B2 (en) | Dynamic feature names | |
CN113901821A (en) | Entity naming identification method, device, equipment and storage medium | |
CN113326699A (en) | Data detection method, electronic device and storage medium | |
CN116956356B (en) | Information transmission method and equipment based on data desensitization processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22882257 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22882257 Country of ref document: EP Kind code of ref document: A1 |