CN112086144A

CN112086144A - Molecule generation method, molecule generation device, electronic device, and storage medium

Info

Publication number: CN112086144A
Application number: CN202010884581.5A
Authority: CN
Inventors: 郑奕嘉; 吴红艳; 蔡云鹏; 纪超杰
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-12-15
Anticipated expiration: 2040-08-28
Also published as: CN112086144B

Abstract

The application is applicable to the technical field of computers, and provides a molecule generation method, a device, an electronic device and a storage medium, wherein the molecule generation method comprises the following steps: the method comprises the steps of obtaining source molecular data, inputting the source molecular data and a sampling vector into a preset molecular generation model at each moment in d moments, and outputting first molecular data corresponding to the source molecular data, wherein the sampling vector input at the r-th moment is determined according to the first molecular data output at the r-1-th moment, d is larger than or equal to r and larger than 1, and r and d are integers, so that a more optimal sampling vector can be generated, the first molecular data are generated according to the more optimal sampling vector, and molecules with better performance can be obtained.

Description

Molecule generation method, molecule generation device, electronic device, and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating molecules, an electronic device, and a storage medium.

Background

Computer-aided drug molecule design is an emerging cross-domain that combines computer, artificial intelligence, pharmacy, and biology, where computer-based molecular structure generation is an important research direction. The existing molecule generation method generally trains a molecule generation model first, and inputs source molecule data into the trained molecule generation model, thereby generating a new molecule. In order to improve the diversity of generated molecules, source molecule data and sampling vectors are generally input into a molecule generation model to generate new molecules, and the sampling vectors are generally obtained by sampling for multiple times based on standard gaussian distribution, so that the sampling vectors have high randomness, which results in that the optimal sampling vectors cannot be obtained, and when molecules are generated according to the sampling vectors with high randomness, the performance of the generated molecules cannot meet the requirements.

Disclosure of Invention

In view of this, embodiments of the present application provide a molecule generating method, apparatus, electronic device and storage medium, which can improve the performance of generated molecules.

A first aspect of an embodiment of the present application provides a method for generating molecules, including:

acquiring source molecular data;

and respectively inputting the source molecular data and the sampling vector into a preset molecular generation model at each moment in d moments, and outputting first molecular data corresponding to the source molecular data, wherein the sampling vector input at the r-th moment is determined according to the first molecular data output at the r-1-th moment, d is more than or equal to r and is more than 1, and r and d are integers.

In a possible implementation manner of the first aspect, a method for determining the sampling vector input at an r-th time according to the first sub-data output at the r-1-th time includes:

inputting the first sub-data output at the (r-1) th moment into a preset RNN model to obtain an observation state corresponding to the (r) th moment;

and inputting the observation state corresponding to the r-th moment into a preset intelligent agent to obtain the sampling vector which is input into the molecule generation model at the r-th moment.

In a possible implementation manner of the first aspect, the sampling vector input at the 1 st time is determined according to the source molecular data, where the determination method of the sampling vector input at the 1 st time is:

inputting the source molecular data into the RNN model to obtain an observation state corresponding to the 1 st moment;

and inputting the observation state corresponding to the 1 st moment into the intelligent agent to obtain the sampling vector which is input into the molecular generation model at the 1 st moment.

In a possible implementation manner of the first aspect, before the inputting the observation state corresponding to the r-1 th time into a preset agent, the method further includes:

and training the intelligent agent according to the source molecular data and the molecular generation model.

In a possible implementation manner of the first aspect, the training the agent according to the source molecule data and the molecule generation model includes:

determining a hidden vector at each moment in d moments according to the input molecular data and the initial agent;

inputting the source molecular data and the hidden vector into the molecular generation model respectively, and outputting second molecular data corresponding to the source molecular data, wherein the input molecular data at the r-th moment is determined according to the second molecular data output at the r-1-th moment, and the input molecular data at the 1 st moment is the source molecular data;

and optimizing the initial agent according to the second sub-data to obtain the agent for inputting the observation state.

In a possible implementation manner of the first aspect, the optimizing the initial agent according to the second sub-data includes:

determining a reward value according to the attribute value, the similarity and the difference of the second sub-data;

optimizing the initial agent according to the reward value.

In a possible implementation manner of the first aspect, before the inputting the source molecular data and the sampling vector into a preset molecular generation model, the method further includes:

acquiring target molecule data corresponding to the source molecule data, wherein the source molecule data and the target molecule data form a molecule pair data set;

and training the molecule generation model according to the molecule pair data set and a preset random vector.

A second aspect of an embodiment of the present application provides a molecule generating apparatus, including:

the acquisition module is used for acquiring source molecular data;

and the computing module is used for respectively inputting the source molecular data and the sampling vector into a preset molecular generation model at each moment in d moments and outputting first molecular data corresponding to the source molecular data, wherein the sampling vector input at the r-th moment is determined according to the first molecular data output at the r-1-th moment, d is more than or equal to r and is more than 1, and r and d are integers.

In a possible implementation manner of the second aspect, the calculation module is further configured to:

In a possible implementation manner of the second aspect, the sampling vector input at the 1 st time is determined according to the source molecular data, and the calculation module is further configured to:

In a possible implementation manner of the second aspect, the molecule generating apparatus further includes a training module, and the training module is configured to:

In a possible implementation manner of the second aspect, the training module is specifically configured to:

In a possible implementation manner of the second aspect, the training module is further specifically configured to:

optimizing the initial agent according to the reward value.

In a possible implementation manner of the second aspect, the training module is further configured to:

A third aspect of embodiments of the present application provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the method according to the first aspect.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, which, when executed by a processor, implements the method according to the first aspect as described above.

A fifth aspect of embodiments of the present application provides a computer program product, which, when run on an electronic device, causes the electronic device to perform the method of the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that: by obtaining source molecular data, respectively inputting the source molecular data and a sampling vector into a preset molecular generation model at each moment in d moments, and outputting first molecular data corresponding to the source molecular data, wherein the sampling vector input at the r-th moment is determined according to the first molecular data output at the r-1-th moment, and d is more than or equal to r and is more than 1, compared with the method that a random vector is input into the molecular generation model, the sampling vector at the r-th moment is determined according to the first molecular data output at the r-1-th moment, a more optimal sampling vector can be generated, and then the first molecular data is generated according to the more optimal sampling vector, so that a molecule with better performance can be obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below.

FIG. 1 is a schematic flow chart of an implementation of a molecule generation method provided in an embodiment of the present application;

FIG. 2 is a timing diagram of a molecule generation method provided by an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram illustrating a method for training an agent according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a training process of an agent provided by an embodiment of the present application;

FIG. 5 is a timing diagram illustrating training of agents provided by embodiments of the present application;

FIG. 6 is a schematic view of a molecule generating apparatus provided in an embodiment of the present application;

fig. 7 is a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.

In the existing molecule generation method, a source molecule and a sampling vector are generally input into a molecule generation model to generate a new molecule, and the sampling vector is generally obtained by sampling for multiple times based on standard Gaussian distribution, so that the sampling vector has high randomness, which can cause that the optimal sampling vector cannot be obtained, and when a molecule is generated according to the sampling vector with high randomness, the performance of the generated molecule cannot meet the requirement. Therefore, the application provides a molecule generation method, wherein in the sampling vectors input into the molecule generation model at the r-th moment, the sampling vector input into the molecule generation model at the r-th moment is determined according to the first molecule data output at the r-1 th moment, namely the sampling vector at the current moment is determined according to the new molecule output at the last moment, so that the sampling vector at the current moment is associated with the characteristics of the output new molecule, a more optimal sampling vector can be generated, the first molecule data is generated according to the more optimal sampling vector, and a molecule with better performance can be obtained.

The molecular generation methods provided herein are described below in conjunction with specific embodiments for illustrative purposes.

The molecule generation method provided by the embodiment of the application is applied to electronic equipment, and the electronic equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer and a cloud server.

Referring to fig. 1, a method for generating a molecule according to an embodiment of the present application includes:

s101: source molecular data is acquired.

Wherein the source molecular data is existing molecular data downloaded from a database of molecular compounds.

S102: and respectively inputting the source molecular data and the sampling vector into a preset molecular generation model at each moment in d moments, and outputting first molecular data corresponding to the source molecular data, wherein the sampling vector input at the r-th moment is determined according to the first molecular data output at the r-1-th moment, d is more than or equal to r and is more than 1, and r and d are integers.

Specifically, a sampling vector at the 1 st time is obtained first, and the sampling vector at the 1 st time may be determined by the source molecule data or may be a random vector. And respectively inputting the source molecular data and the sampling vector at the 1 st moment into a molecular generation model to obtain first molecular data corresponding to the 1 st moment. And determining a sampling vector at the 2 nd moment according to the first sub-data at the 1 st moment, respectively inputting the source molecular data and the sampling vector at the 2 nd moment into the molecular generation model, obtaining first sub-data corresponding to the 2 nd moment, and repeating the steps until the first sub-data corresponding to the preset moment is output.

In the above embodiment, compared with the method that a random vector is input into a molecule generation model, a sampling vector at the r-th time is determined according to first sub-data output at the r-1-th time, a better sampling vector can be generated, and then first sub-data is generated according to the better sampling vector, so that a molecule with better performance can be obtained.

In one possible implementation, a sampling vector at each time is determined according to a Recurrent Neural Network (RNN) and an agent, and then source molecule data and the sampling vector are respectively input to a molecule generation model. Specifically, as shown in fig. 2, the source molecule data X is input into the RNN model to obtain the observation state s corresponding to the 1 st time₁The observation state s corresponding to the 1 st time is calculated₁Inputting into an agent to obtain the sampling vector z of the input molecule generation model at the 1 st moment₁Source molecular data and a sample vector z₁Respectively inputting the molecule generation models to obtain the first molecular data Y at the 1 st moment₁. Then the first sub-data Y of the 1 st moment is divided into₁Inputting the data into RNN model to obtain observation state s corresponding to the 2 nd time₂The observation state s corresponding to the 2 nd time is calculated₂Inputting into an agent to obtain the sampling vector z of the 2 nd moment input molecule generation model₂Source molecular data and a sample vector z₂Inputting the molecule generation model to obtain the 2 nd timeA sub-data Y₂And the analogy is repeated, and the first sub data Y output at the r-1 th moment_r-1Inputting the state into RNN model to obtain the observed state s corresponding to the r-th time_rThe observed state s corresponding to the r-th time_rInputting into a preset agent to obtain a sampling vector z of an input molecule generation model at the r-th moment_rInputting the source molecular data and the r-th time into a sampling vector z of the molecular generation model_rInputting a molecule generation model to obtain first molecular data Y at the r-th moment_r. The RNN and the agent are obtained by training through a machine learning algorithm, the sampling vector of the input molecule generation model at the r-th moment is determined according to the RNN, the agent and the first sub-data output at the r-1 th moment, and a better sampling vector can be generated.

In one possible implementation, the agent is obtained by training an initial agent according to source molecule data and a molecule generation model, where the molecule generation model may be trained in advance, or may be obtained by training a molecule pair data set composed of the source molecule data and target molecule data.

The following describes a training process of an agent provided in the embodiment of the present application, by taking an example of training a molecule generation model first and then training the agent.

As shown in fig. 3, the training process of the agent provided in the embodiment of the present application includes:

1. data is acquired. Specifically, molecular data is downloaded from a molecular compound database (e.g., a ZINC database), and the downloaded molecular data is represented by a Simplified molecular input line specification (SMILES) string.

2. A molecular pair dataset is constructed. Specifically, as shown in fig. 4, the downloaded molecular data is paired and screened according to a preset similarity threshold and a preset attribute threshold, so as to form a plurality of molecular pair data sets. The molecular pair data set comprises two molecular data, namely source molecular data X and target molecular data Y, the similarity of the source molecular data and the target molecular data is larger than a preset similarity threshold, the attribute value of the source molecular data is smaller than a preset attribute threshold, and the attribute value of the target molecular data is larger than a preset attribute threshold. The Tanimoto distance of the chemical fingerprint (e.g., fingerprint) vectors corresponding to the source molecular data and the target molecular data can be calculated by using an RDKit toolkit, and then the similarity between the source molecular data and the target molecular data is determined according to the Tanimoto distance. The associated attribute calculation tools in the RDKit toolkit may be used to calculate the attribute values of the source molecular data and the target molecular data, for example, the logarithmic value of the molecular water distribution coefficient may be calculated to determine the attribute values of the source molecular data and the target molecular data.

3. And constructing a molecular graph structural representation. Specifically, the SMILES character strings corresponding to the source molecular data and the target molecular data are analyzed by using an RDKit tool package to obtain atomic data and compound bond data of each molecular data, the atomic data are used as nodes and the compound bond data are used as edges connecting the two nodes, and a molecular diagram structure representation corresponding to each molecular data is obtained according to each node and each edge. In the molecular diagram structural representation, the characteristics of the nodes can be represented by the one-hot coded vectors of the types of the atomic data, and the characteristics of the edges can be represented by the one-hot coded vectors of the types of the compound bond data. Obtaining a set corresponding to the molecular diagram structure according to the characteristics of the nodes and the characteristics of the edges

Wherein

The set of representative nodes is represented by a set of nodes,_Grepresenting a set of edges, each node in the set of nodes using x_iRepresenting, each edge in the set of edges by x_(i,j)Representing that both i and j represent the identity of the node, and (i, j) represents the identity of the edge connecting node i and node j.

4. Pre-training the molecule to generate a model. Specifically, the molecular generation model comprises an encoder and a decoder, wherein the encoder is used for obtaining a corresponding hidden layer representation vector according to the molecular data, and the decoder is used for determining the molecular diagram structural representation of the new molecule according to the hidden layer representation vector representation.

Specifically, a constructed molecular graph structure of source molecular data is input into an encoder, the encoder firstly represents each edge as two edges (i → j) and (j → i) related to directions, namely directional edges according to two nodes of each edge in the molecular graph structure representation, and then updates a hidden layer representation vector of each directional edge by adopting a calculation method of a graph neural network, specifically, a formula is adopted

Updating the hidden-layer representation vector for each directed edge, wherein,

the hidden layer for the t-th updated directional edge (i → j) represents a vector,

hidden layer representation vector of directed edge (k → i) for t-1 th update, f₁(. to) represents a first multi-layered perceptron network, T ∈ [1, T ]]。

After the hidden layer expression vector of the directed edge updates the T wheel, the encoder updates the T wheel according to the formula

Updating the hidden layer representation vector of each node, wherein h_iThe hidden layer representing the corresponding node i represents a vector,

hidden layer representation vector representing the updated directed edge (k → i) of the T-th round, f₂(. H) represents a second multi-layered perceptron network, resulting in a geometry H ═ H of hidden layer representation vectors for all nodes₁,h₂,...,h_n) I.e. a representation vector of the source molecule, where n represents the number of nodes.

And after the expression vector of the source molecule is obtained, introducing a preset random vector z, and splicing the random vector z and the expression vector of the source molecule to obtain a spliced molecule expression vector H'. The molecular generation process has more randomness due to the addition of random vectors, so that a one-to-many molecular generation process can be modeled.

Wherein the random vector z may be sampled according to a posterior distribution of the estimated variation of the molecular pairs, in particular by maximizing an objective function

And (4) obtaining the product.

Wherein Q (z | X, Y) represents the variation posterior distribution, P (X) represents the probability calculation, and λ_KLAnd D_KLRepresenting a constant.

In the embodiment of the present application, the ring and the edge of the new molecule to be decoded are regarded as a node, so that the new molecule can be represented as a tree structure connected by the node, i.e., a junction tree. Therefore, the process of decoding the new numerator by the decoder is the process of decoding the junction tree.

For each node in the junction tree, the feature vectors of the node and the edge are first computed. In one possible implementation, the edge features are first aggregated with the current node features by a gated recurrent neural network (GRU) to update the current edge feature vector. In particular, using the formula

The current feature vector is updated, wherein,

a feature vector representing the current node is shown,

a feature vector representing an edge connected to the current node,

representing the current feature vector.

After the feature vector of the current edge is obtained, the information of the edge connected with the current node is converged to obtain the feature vector of the current node, and specifically, a formula is adopted

Updating the feature vector of the current node, wherein h_tThe feature vector representing the current node, τ (-) represents a linear rectification function.

After the characteristic vector of the current node is obtained, determining whether the current node continues to expand a new sub-node downwards by adopting a topology prediction method, wherein the formula of the topology prediction is

Wherein,

σ (-) denotes Sigmoid function. attention (-) denotes the neural network attention layer, f₃(. represents a third multilayer perceptron network, p_tRepresenting the prediction probability.

And after the prediction probability of the topology prediction is obtained through calculation, whether the downward expansion is continued or not is determined according to a preset probability, for example, 0.5, if the prediction probability is smaller than the preset probability, the downward expansion is not performed, and if the downward expansion is not performed, the father node of the current node is returned until the root node is returned and the downward expansion is not performed. And if the prediction probability is greater than the preset probability, determining to continue to expand downwards. And if the expansion continues to be carried out downwards, determining the substructure to be expanded according to a label prediction method. Wherein the formula of the label prediction is

Wherein,

f₄(. represents a fourth multilayer perceptron network, q)_tRepresenting the prediction probability. And after the prediction probability of the label prediction is calculated, selecting a substructure with the maximum prediction probability as a structure corresponding to the node of the current tree, wherein the substructure is the node of the junction tree and consists of a ring and an edge.

After each substructure is predicted, the connection mode between the substructures is predicted, and the new molecule can be obtainedAnd (3) molecular diagram structure. Specifically, firstly, a message transmission network of an encoder is utilized to obtain a representation vector of a subgraph under different connection modes

And calculating grading function values under different connection modes according to the expression vectors of the subgraphs under different connection modes

And determining the molecular diagram structure of the new molecule according to the score function value.

In particular by maximizing the objective function

Determining the molecular diagram structure of the new molecule, wherein f₅() represents a fifth multi-layered perceptron network,

represents the molecular diagram structure corresponding to the correct connection mode,

the molecular diagram structure corresponding to each possible connection mode is represented, and the molecular diagram structure corresponding to the correct connection mode is the molecular diagram structure of the finally output new molecule.

And after the molecular diagram structure of the new molecule is obtained, optimizing the parameters of the encoder and the decoder according to the diversity and the attribute value of the corresponding new molecule data and the difference between the new molecule data and the source molecule data, and determining a molecule generation model according to the optimized parameters of the encoder and the decoder.

5. And training the intelligent agent. Specifically, as shown in fig. 4, after obtaining the molecule generation model, the corresponding hidden vector z is determined according to the initial agent_rThe implicit vector and source molecule data are input into the molecule generation model. An encoder of the molecule generation model determines a representation vector H of a source molecule according to source molecule data, and splices the implicit vector and the representation vector H of the source molecule to obtain a spliced molecule representation vector H', and the encoder of the molecule generation modelAnd the decoder determines second molecular data according to the spliced molecular expression vector H', wherein the second molecular data are the generated new molecules. And after the second sub-data is obtained, training the initial intelligent agent according to the second sub-data to obtain the intelligent agent. Wherein, the process of training the agent includes: at the moment d, determining a hidden vector according to the input molecular data and the initial agent; inputting the source molecular data and the hidden vector into a molecular generation model to obtain second molecular data which is output by the molecular generation model and corresponds to the source molecular data, wherein the input molecular data at the r-th moment is determined according to the second molecular data output at the r-1-th moment, and the input molecular data at the 1-st moment is the source molecular data; and finally, optimizing the initial agent according to the second sub-data to obtain the agent.

Specifically, as shown in fig. 5, at the 1 st time, the source molecule data is encoded by the encoder to obtain the representation vector of the source molecule, and the representation vector of the source molecule is input to the RNN model to obtain the observation state s corresponding to the 1 st time₁The observation state s corresponding to the 1 st time is calculated₁Inputting into initial agent to obtain hidden vector z of 1 st time input molecule generation model₁The source molecular data and the implicit vector z₁Inputting a molecule generation model to obtain second molecule data Y at the 1 st moment₁. Then, the second sub-data Y of the 1 st moment is divided into₁The expression vector is input into the RNN model to obtain the observation state s corresponding to the 2 nd time₂The observation state s corresponding to the 2 nd time is calculated₂Inputting the initial agent to obtain the implicit vector z of the 2 nd time input molecule generation model₂The source molecular data and the implicit vector z₂Inputting a molecule generation model to obtain second molecule data Y at the 2 nd moment₂And the analogy is repeated, and the second sub data Y output at the r-1 th moment is output_r-1The expression vector is input into the RNN model to obtain the observation state s corresponding to the r-th time_rThe source molecule data and the observation state s corresponding to the r-th time are calculated_rInputting into an intelligent agent to obtain an implicit vector z of an input molecule generation model at the r-th moment_rThe source molecular data and the time rSampling vector z of input molecule generation model_rInputting a molecule generation model to obtain second molecule data Y at the r moment_r。

In one possible implementation, at each time instant, in addition to optimizing the initial agent, the RNN model is optimized according to the molecular data input to the RNN model and the generated second molecular data. Specifically, at time 1, equation s is used₁,h₁＝RNN(H,h_init) Calculating the observation state s corresponding to the 1 st moment₁And parameter h of RNN model₁Determining the RNN model optimized at the 1 st moment according to the parameters of the RNN model, wherein h_initThe initial state of the recurrent neural network is the default of a full 0 vector, RNN (-) represents the recurrent neural network, and H represents the source molecular data input into the RNN model at the 1 st moment. At each subsequent moment, after the second sub-data is obtained, the formula s is further adopted_r,h_r＝RNN(H_r-1,h_r-1) Determining the observation state s corresponding to the r-th moment_rAnd parameter h of RNN model_rWherein H is_r-1A vector of representations, h, of the second sub-data output for the (r-1) th moment_r-1And (3) representing the parameters of the RNN model output at the (r-1) th moment, and determining the optimized RNN model according to the finally output parameters of the RNN model.

In one possible implementation, the formula of the agent is represented as z_r＝f₆(s_r) I.e. calculating the hidden vector z of the input molecule generation model at the r-th moment according to the formula_rWherein f is₆(. h) represents a sixth multi-tier perceptron network.

After the second molecular data are obtained, calculating the attribute value of the second molecular data, the similarity between the second molecular data and the source molecular data, and the difference between the second molecular data and other generated molecules, wherein the sum of the three parts is used as a reward value. The attribute value may be calculated according to the property of the second molecule, such as a logarithmic value of a molecular lipid-water distribution coefficient, and the like, the similarity may be determined by a Tanimoto distance of a finger print vector corresponding to the second molecule data and the source molecule data, and the difference may be obtained by subtracting the similarity between the molecules from 1. After the reward value is calculated, the initial agent may be trained by a policy gradient optimization method, such as a trust domain policy optimization method (TRPO), to optimize the initial agent to obtain a final agent.

In the embodiment of the application, the process of training the intelligent agent is a reinforcement learning method, and the reinforcement learning needs several basic elements: state (State), Action (Action), Reward (Reward), State transition. In the above embodiment, the agent uses the trained molecule generation model as an interactive environment, determines a hidden vector (action) of the next step each time according to the information (state) of the source molecule and the new molecule that has been generated currently, and then the molecule generation model generates the new molecule according to the hidden vector and updates the current state (state transition) of the agent, and simultaneously uses the attribute value of the new molecule and the diversity of the generated molecules as a reward value (reward).

6. And (5) evaluating the effect of the model. Specifically, after obtaining the molecular generative model and the agent, the effect of the agent is evaluated. The evaluation indexes comprise the success rate of second molecular data output by the molecular generation model under a preset attribute threshold and a preset similarity threshold, and the diversity of all generated new molecules. And if the evaluation index meets the preset requirement, finishing the training of the intelligent agent.

The success rate calculation method comprises the following steps: setting a similarity threshold₁And attribute threshold₂. For each second molecular data, calculating attribute value P of the molecular_iAnd the similarity sim (X, Y) of the source molecule X and the molecule_i) If P is_i≥₁And sim (X, Y)_i)≥₂Then the molecule is a successful molecule. The success rate is obtained by dividing the number of successful molecules by the number of all molecules produced.

The diversity calculation method comprises the following steps: and determining the minimum value of the difference of each second molecular datum and all generated new molecules, taking the minimum value as the measure of the molecular diversity, and taking the average value of all generated molecular diversity as the measure of the overall diversity.

The training process of the agent provided in the embodiments of the present application is described below with reference to specific applications.

Firstly, downloading SMILES sequence, namely molecular data from a ZINC database, and setting a similarity threshold value₁0.4, attribute threshold₂And (3) traversing and screening all the molecular data, and selecting all the molecular pair data sets which meet the conditions, so that the similarity between the source molecular data and the target molecular data is greater than 0.4, the attribute value of the source molecular data is less than 0.9, and the attribute value of the target molecular data is greater than 0.9. Wherein the attribute values and the similarity are calculated by an RDKit toolkit.

And after the molecular pair data set is obtained, taking a part of the molecular pair data set as a training set and taking a part of the molecular pair data set as a testing set. Firstly, training an intelligent agent by using a test set, analyzing the molecular data by using an RDkit tool package for the test set to obtain atomic data and chemical bond data of molecules, taking each atom as a node and taking a chemical bond as an edge to construct a molecular graph structure representation, simultaneously determining the characteristics of all nodes according to the one-hot coding vector of the type of each atom, and obtaining the characteristics of all edges according to the one-hot coding vector of the chemical bond valence of each chemical bond.

And after the molecular diagram structure is expressed, the diagram structure of the source molecule is used as the input of the molecular generation model, the diagram structure of the target molecule is used as the target of the molecular generation model, and supervised training is carried out to obtain the trained molecular generation model. The model can input a source molecule and a random vector to generate a new target molecule.

After a molecule generation model is trained, determining a representation vector of a source molecule, inputting the representation vector of the source molecule into an RNN model to obtain an observation state, taking the observation state as the input of a multilayer perceptron to obtain an implicit vector, and inputting the implicit vector and the source molecule into the molecule generation model together to generate new molecule data. Then, new molecule data is used as the input of the recurrent neural network, and the previous processes are repeated continuously, so that a plurality of new molecules can be generated. After the new molecules are generated, calculating attribute values of the new molecules, the similarity of the new molecules and source molecule data and the difference of the new molecules and other generated molecules, and summing the attribute values, the similarity of the new molecules and the source molecule data to obtain a reward value. After the reward value is calculated, training the initial agent through a trust domain strategy optimization method to optimize the initial agent and obtain a final agent.

After the agent is trained, the molecule data in the test set is used as source molecule data, the observation state corresponding to each moment is calculated according to the RNN model, then the hidden vector is obtained according to the observation state and the agent, and the hidden vector is input into the molecule generation model to obtain 20 new molecules. And then calculating the success rate and diversity of 20 new molecules, wherein the method for calculating the success rate comprises the following steps: and taking the new molecules with the similarity larger than 0.4 and the attribute value larger than 0.9 as successful molecules, wherein the ratio of the number of the successful molecules to the number of all the generated new molecules is the success rate. Diversity is determined by the average difference of all new molecules generated. According to the success rate and the diversity, the effect of the intelligent agent obtained by training can be judged.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 6 is a block diagram showing a structure of a molecule generating apparatus according to an embodiment of the present application, and only a part related to the embodiment of the present application is shown for convenience of explanation.

As shown in fig. 6, the molecular generating apparatus provided in the embodiment of the present application includes,

an obtaining module 10, configured to obtain source molecule data;

the calculation module 20 is configured to input the source molecular data and the sampling vector into a preset molecular generation model at each of d times, and output first molecular data corresponding to the source molecular data, where the sampling vector input at the r-th time is determined according to the first molecular data output at the r-1-th time, d is greater than or equal to r and greater than 1, and r and d are integers.

In one possible implementation, the calculation module 20 is further configured to:

In a possible implementation manner, the sampling vector input at the 1 st time is determined according to the source molecular data, and the calculation module 20 is further configured to:

In one possible implementation, the molecule generating apparatus further includes a training module, and the training module is configured to:

In a possible implementation manner, the training module is specifically configured to:

In a possible implementation manner, the training module is further specifically configured to:

optimizing the initial agent according to the reward value.

In one possible implementation, the training module is further configured to:

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

Fig. 7 is a schematic diagram of an electronic device provided in an embodiment of the present application. As shown in fig. 7, the electronic apparatus of this embodiment includes: a processor 11, a memory 12 and a computer program 13 stored in said memory 12 and executable on said processor 11. The processor 11, when executing the computer program 13, implements the steps in the above-described embodiment of the molecule generating method, such as the steps S101 to S102 shown in fig. 1. Alternatively, the processor 11, when executing the computer program 13, implements the functions of each module/unit in each device embodiment described above, for example, the functions of the modules 10 to 20 shown in fig. 6.

Illustratively, the computer program 13 may be partitioned into one or more modules/units, which are stored in the memory 12 and executed by the processor 11 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 13 in the electronic device.

Those skilled in the art will appreciate that fig. 7 is merely an example of an electronic device and is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or different components, e.g., the electronic device may also include input-output devices, network access devices, buses, etc.

The Processor 11 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 12 may be an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device. The memory 12 may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device. Further, the memory 12 may also include both an internal storage unit and an external storage device of the electronic device. The memory 12 is used for storing the computer program and other programs and data required by the electronic device. The memory 12 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other ways. For example, the above-described apparatus/electronic device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of generating molecules, comprising:

acquiring source molecular data;

2. The method of claim 1, wherein the sample vector input at the r-th time is determined according to the first sub-data output at the r-1 th time by:

3. The method of generating molecules according to claim 2, wherein the sampling vector input at the 1 st time is determined according to the source molecule data, wherein the sampling vector input at the 1 st time is determined by:

4. The method of claim 2, wherein before the inputting the observation state corresponding to the r-1 th time into a preset agent, the method further comprises:

5. The molecular generation method of claim 4, wherein the training of the agent according to the source molecular data and the molecular generation model comprises:

6. The molecular generation method of claim 5, wherein the optimizing the initial agent according to the second molecular data comprises:

optimizing the initial agent according to the reward value.

7. The molecular generation method of claim 1, wherein prior to the inputting the source molecular data and sample vector into a pre-defined molecular generation model, the method further comprises:

8. A molecule generating apparatus, comprising:

the acquisition module is used for acquiring source molecular data;

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the molecule generation method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the method of generating molecules according to any one of claims 1 to 7.