Research
Open access
Published: 29 October 2024

GTransCYPs: an improved graph transformer neural network with attention pooling for reliably predicting CYP450 inhibitors

Candra Zonyfar¹,
Soualihou Ngnamsie Njimbouom¹,
Sophia Mosalla² &
…
Jeong-Dong Kim^1,2,3

Journal of Cheminformatics volume 16, Article number: 119 (2024) Cite this article

492 Accesses
Metrics details

Abstract

State‑of‑the‑art medical studies proved that predicting CYP450 enzyme inhibitors is beneficial in the early stage of drug discovery. However, accurate machine learning-based (ML) in silico methods for predicting CYP450 inhibitors remains challenging. Here, we introduce GTransCYPs, an improved graph neural network (GNN) with a transformer mechanism for predicting CYP450 inhibitors. This model significantly enhances the discrimination between inhibitors and non-inhibitors for five major CYP450 isozymes: 1A2, 2C9, 2C19, 2D6, and 3A4. GTransCYPs learns information patterns from molecular graphs by aggregating node and edge representations using a transformer. The GTransCYPs model utilizes transformer convolution layers to process features, followed by a global attention-pooling technique to synthesize the graph-level information. This information is then fed through successive linear layers for final output generation. Experimental results demonstrate that the GTransCYPs model achieved high performance, outperforming other state-of-the-art methods in CYP450 prediction.

Scientific contribution

The prediction of CYP450 inhibition via computational techniques utilizing biological information has emerged as a cost-effective and highly efficient approach. Here, we presented a deep learning (DL) architecture based on GNN with transformer mechanism and attention pooling (GTransCYPs) to predict CYP450 inhibitors. Four GTransCYPs of different pooling technique were tested on an experimental tasks on the CYP450 prediction problem for the first time. Graph transformer with attention pooling algorithm achieved the best performances. Comparative and ablation experiments provide evidence of the efficacy of our proposed method in predicting CYP450 inhibitors. The source code is publicly available at https://github.com/zonwoo/GTransCYPs.

Introduction

In drug-drug interactions, the inhibition of cytochrome P450 (CYP450) enzymes plays a crucial role in drug efficacy, toxicity, and potential interactions [1,2,3,4]. These enzymes are responsible for metabolizing numerous drugs in the body [5]. If the activity of these enzymes is hindered by one drug, it can impact the metabolism of other drugs, potentially altering the drug's response and raising the risk of toxicity. Among the 57 commonly found CYP450 isozymes in the human liver [2, 3, 6], five of them—namely 1A2, 2C9, 2C19, 2D6, and 3A4—play critical roles in most drug metabolism processes in the human body [7].

In vitro, high-throughput screening for CYP450 inhibition technology has generated data on CYP450 isozymes, including through research initiatives such as PubChem Bioassays [8]. The collected data has enabled the computational prediction of potential inhibitory compounds against five CYP450 isozymes using ML approaches. The in silico approach is appealing as it can be utilized at the early stages of drug discovery pathways, reducing the number of wet-lab experiment studies needed for selecting new drug candidates and thus minimizing costs. In addition to enhancing success rates, it also aids in predicting the activity of designed compounds before synthesis [9, 10].

In recent years, ML has been used as a computational method to predict CYP450 inhibition [11,12,13,14,15,16]. Various ML algorithms have been applied in this research, including random forest (RF) [15], and support vector machines (SVM) [17]. RF, which is a collection of decision trees (DT), is used to improve accuracy and reduce overfitting by combining the results from several DT. Additionally, SVM is used to build a model that can separate data into different classes by finding the optimal hyperplane in feature space.

In the biomedical domain and molecular property prediction, GNN are attracting growing interest and have now set state-of-the-art methods [18,19,20,21]. Unlike conventional ML models commonly employed in research to predict CYP450 enzyme activity relying on chemical features and molecular structure, GNN strives to utilize data representation in graph form. GNN facilitates the integration of topological information from molecular graph structures into the model, thereby considering the spatial relationships among atoms within molecules. Its ability to model complex molecular structures and account for potential atom interactions can enhance prediction accuracy. A model based on GNN has been proposed by Qiu et al. [22] to predict CYP450 inhibition. They utilized two types of input data for the developed model. On one side, they extracted chemical representation features from SMILES into the GNN, while on the other side, they extracted features from sequence alignments with convolutional neural network (CNN). Subsequently, these two sets of features were concatenated at the end of the model. As a result, their model was reported to outperform the iCYP-MFE [16] model. On the other hand, recent research in predicting CYP450 activity has been conducted by Ai et al. [23]. The method they proposed involves two pathways. The first pathway employs an artificial neural network (ANN) to learn information from substructure-based molecular fingerprints, as well as one pharmacophore-based fingerprint. The second pathway utilizes a GNN with attention mechanisms to extract structural information from molecular graphs, which is then combined with the features obtained from the first pathway in a fully connected layer. Their proposed model, named FPGNN, integrates these pathways to predict the inhibition of five CYP450 isozymes. However, despite promising results, incorporating fingerprints and GNN presents a limitation. Firstly, there is a tendency towards information duplication, as fingerprint features may already encapsulate data that GNN could learn from the molecular structure. This redundancy can hinder the model's capacity to discern genuinely informative features. Secondly, merging various feature representations increases computational complexity, potentially hindering the model's effective prediction.

In this paper, we proposed a DL model, an improved GNN with a transformer mechanism for predicting CYP450 inhibitors (GTransCYPs). Initially, the drug chemical structure is represented as graph in which the vertices are atoms, and the edges are chemical relationships within the molecule. Next, a graph transformer network is used to compute the drug embedding vectors. In addition, attention pooling is used to enable more efficient downsampling, improving model generalizability. Results from extensive experiments demonstrate that GTransCYPs improves the performance of predicting CYP450 inhibitors in comparison with the state-of-the-art models. An ablation study was carried out to show the robustness of the proposed method in modeling parameters and their ability to predict potential inhibitory compounds against five CYP450 isozyme. In summary, we believed that GTransCYPs is an effective tool to identify potential inhibitory compounds against CYP450 for further wet-lab experiment validation. A web server was developed to host the models from the study for public access.

Method

Overview of the GTransCYPs model

Graph learning models garner growing attention in the computational analysis of molecule data. The proposed model is instantiated through GNN class, designed to extract molecular data and encapsulate insights from atomic node characteristics by orchestrating a stratified transformation process. This network efficiently acquires hierarchically enriched node embeddings, considering local neighbor interactions and comprehensive graph comprehension. Key components of this network architecture include the integration of transformer layers (TransformerConv), linear projection, batch normalization, and attention pooling, as illustrated in Fig. 1. These elements underscore contextual awareness's significance in local and global contexts. It aims to empower networks to learn complex graph representations. Thus, finding a harmonic balance between capturing delicate graph patterns and maintaining a thorough understanding will improve the predictive performance of CYP450 inhibitors.

Figure 1A shows an overview of the molecular graph construction, the pertinent features are extracted from the data of the five CYP450 isozymes through a featurization process. In this method, each molecule is transformed into a graph wherein the atoms serve as the nodes and the interconnecting bonds between the atoms are depicted as edges in the form of numerical vectors. To accomplish this transformation, we make use of the RDKit library, which facilitates the conversion of the SMILES representation into the requisite structural format for training the model. RDKit is used for its capability to manipulate molecular structures, enabling comprehensive representation and processing of molecular data. Figure 1B shows the model initiates the process by transforming atomic input into an initial embedding representation utilizing multi-head attention, succeeded by normalization. Subsequently, aggregation is performed through graph pooling operations, specifically attention pooling layers. Following this, the data traverse multiple linear layers and undergoes activation through the Rectified Linear Unit (ReLU) function to produce the final output.

Graph neural network (GNN)

GNN has become increasingly popular and reliable in molecular analysis due to their ability to capture high-level graph information and relationships and propagate them through networks. GNN is a special type of DL technique created explicitly for processing and analyzing structured data in the form of graphs [24]. The development of GNN has adopted the concept of attention mechanisms and allows models to assign dynamic weights to interactions between nodes in a graph, enabling more adaptive decision making and capturing long-range dependencies. GNN can be applied to the graph representation of compound molecule ($G$) by using an iterative mechanism to update node and edge features based on their neighbors. This process can be described in the following equations:

$${h}_{v}^{(l+1)}={\phi }_{v}\left({h}_{v}^{(l)}, {h}_{e}^{(l)} | ({v}_{{a}_{i}},{v}_{{a}_{j}})\in E\right)$$

(1)

$${h}_{e}^{(l+1)}={\phi }_{e}\left({h}_{e}^{(l)}, \left\{{h}_{v}^{(l+1)}|{v}_{a} is neighbor ({v}_{{a}_{i}},{v}_{{a}_{j}})\right\}\right)$$

(2)

where ${h}_{v}^{(l)}$ is the representation of the ${v}_{a}$ node feature on the $t$-th iteration, and ${h}_{e}^{(l)}$ is the representation of the edge feature ${(v}_{{a}_{i}},{v}_{{a}_{j}})$ on the $l$-th iteration.

Molecular graph construction

The graph concept enables encoding the high-dimensional space of molecular structures into a lower-dimensional representation. The simplified molecular input line entry system (SMILES) of CYP450 enzyme was introduced as a two-dimension molecular graph, representing chemical atoms and bond token information. The molecular graph of the CYP450 is represented as $G=(V,E)$ where $V$ denotes the set of nodes and $E$ indicates the edges. Each $a \in {\rm A}$ atom is mapped to a node in the graph, so$V = \left\{{v}_{a} | a \in {\rm A}\right\}$, where ${v}_{a}$ is the node representing the $a$ atom. Furthermore, each $b \in B$ bond is mapped to an edge between nodes representing the atoms bound by that bond, therefore$E = \left\{{(v}_{{a}_{i}} , {v}_{{a}_{j}})|b=( {a}_{i},{a}_{j} )\in B\right\}$. In addition, features are added at each node and edge to represent atomic and bonding properties. Suppose ${F}_{v}$ is the feature space for nodes and ${F}_{e}$ is the feature space for edges. The features on the ${v}_{a}$ node can represent the atomic type, charge, and other properties of the $E$ atom,${f(v}_{a})= {(F}_{atom}(a), {F}_{charge}(a),...)$. On the other hand, features on the edge ${(v}_{{a}_{i}} , {v}_{{a}_{j}})$ can represent the bond type, bond length, or other properties of bonds$b=( {a}_{i},{a}_{j} )$, that is$f{(v}_{{a}_{i}} , {v}_{{a}_{j}})= {(F}_{bond}(b), {F}_{length}(b),...)$.

We use molecular featurization to handle graph construction by adopting featurization based on path-augmented graph transformer network [25], a special feature that builds a molecular graph connecting all pairs of atoms that takes into account the interaction of atoms with every other atom in the molecule. Path-augmented graph transformer networks were employed in the featurization process to capture both local and global structural information of molecules, providing a context-aware representation of molecular features. The featurization process is carried out iteratively on each molecular entity in the dataset. For each molecule, its SMILES representation is converted to a molecular graph object using the help of DeepChem [26] and the RDKit library [27]. The graph is then converted into a feature representation, including node attributes and edge attributes.

Graph transformer for learning CYP450 molecules

The proposed model is designed to understand the representation of molecular convolutional features in graph structures, considering the information of each atom in the molecule. With the node features provided${H}^{l}=\left\{{h}_{1}^{(l)}, {h}_{2}^{(l)}, {...h}_{n}^{(l)}\right\}$, the multi-head attention for each edge is computed as follows:

$${q}_{c,i}^{(l)}={W}_{c,q}^{(l)}{h}_{i}^{(l)}+ {b}_{c,q}^{(l)}$$

(3)

$${k}_{c,j}^{(l)}={W}_{c,k}^{(l)}{h}_{j}^{(l)}+ {b}_{c,k}^{(l)}$$

(4)

$${e}_{c,ij}={W}_{c,e}{e}_{ij}+ {b}_{c,e}$$

(5)

$${a}_{c,ij}^{(l)}=\frac{\langle {q}_{c,i}^{(l)}, {k}_{c,j}^{(l)}, {e}_{c,ij}\rangle }{{\sum }_{u\in N(i)}\langle {q}_{c,i}^{(l)}, {k}_{c,u}^{(l)}, {e}_{c,iu}\rangle }$$

(6)

Here,$q$, $k$ represent the query, and key projections, respectively, at layer$l$. These projections are computed using weight ${W}_{c,q}^{(l)}$, ${W}_{c,k}^{(l)}$ and biases${b}_{c,q}^{(l)}$,${b}_{c,k}^{(l)}$. They are linear transformations of the original input tokens’ representations ${h}_{i}^{(l)}$ and${h}_{j}^{(l)}$. Then, ${e}_{c,ij}$ quantifies the importance or score of the interaction between $i$-th query and j-th key. It is computed by applying another linear transformation with weight ${W}_{c,e}$ and bias${b}_{c,e}$. ${a}_{c,ij}^{(l)}$ is the attention weight associated with $i$-th query and j-th key at layer$l$. It is computed by normalizing the dot product of the${q}_{c,i}^{(l)}, {k}_{c,j}^{(l)}, {e}_{c,ij}$. The denominator involves a sum over all attention scores between the $i$-th query and other keys, represented by ${k}_{c,u}^{(l)}$ for u in the neighborhood$N(i)$.

In this context, each token in an order that represents the molecular structure maps to a node in the graph. Key projections ${k}_{c,j}^{(l)}$ and query projections ${q}_{c,i}^{(l)}$ are calculated by multiplying the corresponding weight ${W}_{c,q}^{(l)}$, ${W}_{c,k}^{(l)}$ by the representation of each atoms or tokens in a molecule, which serve as key-value and query-value respectively. Then, the edge features ${e}_{c,ij}$ are calculated by performing the inner product between the key projection and the query projection for each pair of atoms. Weights ${W}_{c,e}$ and ${b}_{c,e}$ and exponential transformations play an important role in giving emphasis or increasing the attention score according to their relevance to molecular interactions, this allows the model to explore complex interactions between atoms in molecules. Furthermore, ${a}_{c,ij}^{(l)}$ is used to calculate the attention weight which indicates how important the interaction between i and j atoms is in informing enzyme activity. In a nutshell, this process allows the model to adaptively understand the complexity of molecular structure and the interrelationships of its atoms, contributing to the model's ability to make predictions.

Message aggregation with transformer

In this study, we employ message passing mechanism through neural network to get local information in the CYP450 molecular graph, then the environmental information of each node is aggregated to update the representation of the central node. Here, following the acquisition of multi-head attention from the graph, a process of message aggregation with transformer is executed:

$${v}_{c,i}^{(l)}={W}_{c,v}^{(l)}{h}_{j}^{(l)}+ {b}_{c,v}^{(l)}$$

(7)

$${\widehat{h}}_{i}^{(l+1)}={\parallel }_{c=1}^{C}\left[\sum_{j\in \mathcal{N}}{a}_{c,ij}^{(l)}({v}_{c,i}^{(l)}+{e}_{c,ij})\right]$$

(8)

where ${\widehat{h}}_{i}^{(l+1)}$ represents the updated representation of element $i$ in the next layer $l+1$, and where the $\parallel$ is the concatenation operation for $C$ head attention. This process involves aggregating messages originating from neighboring nodes. The contribution of each neighboring nodes, which is determined by ${a}_{c,ij}^{(l)}$, is calculated using the information contained in ${v}_{c,i}^{(l)}$ and the terms of edge features.

Attention pooling

In order to reduce the size and complexity of graph data structures while preserving significant information. In this study, we employ the attention pooling technique to streamline the graph representation while retaining essential information, thereby improving the model’s performance. The attention mechanism empowers the model to prioritize crucial input elements necessary for task completion, thus enhancing its ability for optimal performance. Utilizing the attention technique, an examination is conducted on a sequence of normalized weights that reflect the relative significant levels attributed to each node. The attention is calculated as follows:

$${Attention}_{i}={SoftMax}_{i}\left(H\times W\right)= \frac{exp({H}_{i}\times W)}{{\sum }_{j\in N}\text{exp}({H}_{j}\times W)}$$

(9)

Here, the SoftMax function is used to normalize the attention vector, ensuring that the attention coefficients are proportional across various nodes.

Experimental results

Datasets

To validate the performance of our proposed model, we utilize the same dataset that was employed by Veith et al. [28] and collect from therapeutics data commons database [29] (https://tdcommons.ai, accessed January 2024). The dataset includes inhibitors targeting the five major CYP450 isozymes, namely 1A2, 2C9, C19, 2D6, and 3A4. We apply the scaffolding method to split the training, validation, and testing sets in an 80:10:10 ratio. This scaffolding method, which separates samples based on their two-dimensional structural frameworks, was chosen because it presents a greater challenge for learning algorithms compared to random splitting, while ensuring a better representation of molecular diversity in each subset. The training set comprises 4564 inhibitors and 5499 non-inhibitors for 1A2; 3275 inhibitors and 6398 non-inhibitors for 2C9; 4591 inhibitors and 5541 non-inhibitors for 2C19; and 4028 inhibitors and 5834 non-inhibitors for 3A4. The validation set contains 639 inhibitors and 618 non-inhibitors for 1A2; 375 inhibitors and 834 non-inhibitors for 2C9; 648 inhibitors and 618 non-inhibitors for 2C19; along with 567 inhibitors and 665 non-inhibitors for 3A4. Finally, the testing set comprises 626 inhibitors and 633 non-inhibitors for 1A2; 395 inhibitors and 815 non-inhibitors for 2C9; 580 inhibitors and 687 non-inhibitors for 2C19; and 515 inhibitors and 719 non-inhibitors for 3A4. Notably, the 2D6 dataset shows a significant imbalance with only 5148 instances labeled as inhibitors compared to 10,616 non-inhibitors, indicating significant imbalance compared to datasets 2C9, 2C19, and 3A4. We implemented a downsampling strategy to balance the class distribution and address the disparity between the two classes of molecules in the 2D6 dataset. Oversampling, in contrast to downsampling, was not used because it addresses class imbalance by generating additional data for the minority class. However, this synthesized data often does not accurately represent real-world conditions. Initially, the dataset was partitioned by labels to isolate the majority (non-inhibitors) and minority (inhibitors) classes. We then applied downsampling to the non-inhibition data, randomly selecting samples without replacement to match the number of inhibition-labeled instances. This downsampled non-inhibition data was subsequently reintegrated with the inhibition data, achieving a more balanced class distribution to mitigate potential biases arising from the initial class imbalance. Furthermore, a dataset was divided into training, validation, and testing sets using scaffolding split technique. As a result, the training set involved 4092 inhibitors and 4084 non-inhibitors, validation set 530 inhibitors and 492 non-inhibitors; testing set 526 inhibitor and 496 non-inhibitors. Figure 2 presents a comprehensive analysis of inhibitor and non-inhibitor quantities within the training, validation, and testing sets. Additionally, it includes graphical representations depicting the distribution of SMILES lengths across each dataset. SMILES length refers to the number of characters in the SMILES string that represents a molecule.

Figure 3 shows the chemical space represented by each dataset ascertained and compared, depending on related molecular descriptors such as molecular weight (MW) and $LogP$, adopted from [11, 23]. CYP450 inhibitors are visually represented by orange dots, while cyan dots correspond to CYP450 non-inhibitors. The compounds within the CYP450 modeling dataset exhibit a wide distribution across MW, reflecting a diverse array of chemical structures. This spectrum of MW and $LogP$ values underscore the dataset's inclusion of compounds spanning a broad chemical range. The chemical spaces of various CYP450 enzymes, including CYP1A2 (MW: 33.03 to 1736.18, $LogP$: − 17.08 to 20.75), CYP2C9 (MW: 33.03 to 1664.92, $LogP$: − 17.08 to 20.75), CYP2C19 (MW: 33.03 to 1664.92, $LogP$: − 17.08 to 20.75), CYP2D6 (MW: 42.39 to 1488.80, $LogP$: − 14.01 to 15.34), and CYP3A4 (MW: 33.03 to 1736.18, $LogP$: − 24.39 to 20.75), highlight the encompassing variety of chemical characteristics in the dataset. In addition, we analyzed the molecular weight distribution of compounds within the dataset, as shown in Fig. 4. For 1A2, the molecular weight of compounds ranged from 33.03 Dalton to 1736.18 Dalton, with a distribution peak at 291.35 Dalton. Datasets 2C9 and 2C19 showed similar molecular weight ranges, from 33.03 Dalton to 1664.92 Dalton, with modes at 280.33 Dalton and 291.35 Dalton, respectively. Dataset 2D6 had a molecular weight range from 42.39 Dalton to 1488.81 Dalton, with a mode at 291.35 Dalton. Lastly, 3A4 displayed a molecular weight range from 33.03 Dalton to 1736.18 Dalton, with a mode at 291.35 Dalton. This analysis indicates a broad molecular weight distribution across the datasets, with a consistent mode around 291.35 Dalton for most datasets. Principal Component Analysis (PCA) was plotted to visualize and evaluate the chemical space coverage of the training and test datasets of each of the five CYP450 inhibitors. Figure 5 shows a scatterplot of the first two principal components, illustrating a similar distribution of chemical space between the training and test sets. This similarity helps define the applicability domain of the developed model, ensuring reliable assessment of new compounds chemically comparable to those in the training set. Ensuring that the test set is representative of the training set allows for an accurate evaluation of the model's performance and avoids potential biases from dissimilar chemical spaces.

Evaluation metrics

GTransCYPs model is evaluated using metrics such as balanced accuracy (BA), matthews correlation coefficient (MCC), precision (PRE), recall (REC), F1-Score. The calculations for these metrics are as follows:

$$BA= \frac{\frac{TP}{TP+FN}+\frac{TN}{FP+TN}}{2}$$

(10)

$$MCC= \frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FP\right)+\left(TP+FN\right)+\left(TN+FP\right)+\left(TN+FN\right)}}$$

(11)

$$PRE= \frac{TP}{TP+FP}$$

(12)

$$REC= \frac{TP}{TP+FN}$$

(13)

$$F1-Score=2\times \frac{PRE \times REC}{PRE+REC}$$

(14)

where $TP$ is the positive samples correctly identified, $TN$ is the negative samples correctly identified, $FP$ is the negative samples incorrectly labeled and $FN$ is the positive samples incorrectly labeled.

Hyperparameters and environment setup

We optimize the hyperparameters for GTransCYPs model, as outlined in Table 1, by conducting a search for the best parameters on the training process. The experiment was executed utilizing NVIDIA GTX 1080 Ti × 4 and PyTorch. The detailed of the experimental environment is provided in Table 2, and the training was conducted of 20 epochs.

Table 1 The hyperparameters

Full size table

Table 2 Environmental setup

Full size table

Performance of the proposed model

The performances of GTransCYPs are shown in Table 3. The model demonstrates accurate predictions with BA scores ranging from 0.770 to 0.886; the MCC scores ranging 0.534 to 0.770; REC scores ranging from 0.823 to 0.916; PRE scores ranging from 0.677 to 924; and finally, F1-score ranging from 0.743 to 0.920.

Table 3 Performance results of GTransCYPs on five isozyme CYP450 datasets

Full size table

Ablation study

We conducted an ablation study to assess the effectiveness of the attention pooling in the GTransCYPs model. The variant models, labeled as model-A, model-B, model-C, and model-D, consist of different pooling techniques: model-A is the GTransCYPs with global mean pooling; Model-B combines global mean pooling and global maximum pooling within the GTransCYPs framework; Model-C utilizes top-k pooling in the GTransCYPs; and in Model-D, the GTransCYPs integrates an attention mechanism into the pooling layers.

From Table 4 it is evident that utilizing attention pooling in the GTransCYPs model outperforms all other pooling schemes in 1A2 dataset in terms of BA, MCC, and F1-score evaluation. We can conclude that the implementation of GTransCYPs with attention pooling is effective due to its superior performance. This can be attributed to the capability of the proposed model to extract the most significant information from each node and edge within the molecular data of CYP450 five isozymes.

Table 4 Performance comparison of utilizing different pooling combinations in the ablation experiment

Full size table

Hyperparameter sensitivity

In this study, we investigated the influence of various hyperparameters on the performance of our proposed model. The insights derived from our analysis of key parameters' sensitivity, including number of attention heads, batch size, learning rate, and representation embedding dimensions throughout the training process, are illustrated in Fig. 6.

Effect of the number of attention heads

We explored the effect of the number of attention heads on GTransCYPs performance. The experimental results indicate that the area under the ROC curve (AUC) score increased from 0.880 with 1 attention head to its peak at 0.897 with 4 attention heads. However, subsequent to reaching this peak, there was a decline in the AUC score to 0.883 with 8 attention heads, and further dropped to 0.880 with 16 attention heads. The decrease in performance with larger numbers of attention heads may be attributed to excessive model complexity and information redundancy. Therefore, the optimal number of attention heads is 4, as it provides a balanced trade-off between relevant information and complexity.

Effect of the batch size number

We investigated the impact of batch size on model performance, where the highest AUC score was observed with a batch size of 100, reaching 0.897, whereas batch sizes of 64 and 128 yielded slightly lower AUC scores of 0.878 and 0.882, respectively. However, a more substantial decrease in AUC score was noted with a batch size of 256, dropping to 0.879. This decline in performance with a batch size of 256 can be attributed to the reduction in sample size per iteration, potentially disrupting model convergence and resulting in less accurate information extraction from the data. Thus, in this study, a batch size of 100 emerges as the optimal choice for enhancing model performance.

Effect of the learning rate

The analysis indicates that the learning rate affects the AUC score, reaching its highest peak at 0.1 with a value of 0.897. However, there is a decrease in model performance, with the AUC score dropping to 0.823 at a learning rate of 0.0001. Reducing the learning rate can diminish the model's ability to identify important patterns in the data, underscoring the importance of selecting the appropriate learning rate to optimize model performance.

Effect of the size of embedding dimension

We can observe that the AUC score tends to increase with the increase in the embedding dimension size, reaching its peak at a dimension size of 128 with a value of 0.897. This indicates an enhancement in the model’s ability to capture crucial information from the data as the embedding dimension size grows. However, it is important to note that increasing the embedding dimension size will result in an increase in the number of parameters in the model, consequently prolonging the training time and requiring more computational resources.

Comparison of GTransCYPs with existing methods

To ensure a comprehensive assessment of our proposed model, we compared GTransCYPs with existing methods such as SuperCYP [30], iCYP-MFE [16], DeepCYPs [23], GAT_GCN [22], GraphSAGE [31], GAT [32], GCN [33], dan GIN [34]. Figure 7 shows a comparison between GTransCYPs and other methods in predicting CYP450 inhibition across five isozyme datasets. We can see that GTransCYPs outperforms advanced methods on most datasets. Table 5 reports that the GraphSAGE model outperforms the GAT_GCN, DeepCYPs, GAT, GCN, and GIN models with a score of 0.870 on dataset 1A2 in terms of BA values. However, a notable performance improvement is observed in the GTransCYPs model, which surpasses GraphSAGE with an increase of 1.8%. For datasets 2C9, 2C19, and 2D6, GTransCYPs outperformed all models with performance improvements of 9%, 0.6%, and 0.6% respectively. Compared to other methods, GTransCYPs achieved the highest MCC scores with improvements of 4.3%, 26.4%, and 0.2% for datasets 1A2, 2C9, and 2C19 respectively, as detailed in Table 6. The proposed model also demonstrates good performance compared to other models for F1-score evaluation. Table 7 presents that GTransCYPs model outperforms all other methods across most datasets, with an improvement of 1,2%, 17,9%, 0,3% and 2,9% for datasets 1A2, 2C9, 2C19, and 2D6, respectively. Table 8 presents representative molecular structures of both inhibitors and non-inhibitors for the CYP450 isoenzymes. Meanwhile, Table 9 displays the predicted inhibition activity of various compounds against CYP450 isoenzymes using four different methods: GIN, GAT, GCN, GraphSAGE, and GTransCYPs. Although all models predicted correctly, the proposed model demonstrates a high prediction confidence for inhibition. We can see that the GTransCYPs achieved scores of 90.35 and 89.27 when predicting inhibitory activities of SDI 4239706 and SDI 17385329, respectively. These results indicate a significant difference compared to other models.

Table 5 The comparison of BA between GTransCYPs and other methods for predicting five CYP450 isozymes inhibition

Full size table

Table 6 The comparison of MCC between GTransCYPs and other methods for predicting five CYP450 isozymes inhibition

Full size table

Table 7 The F1-Score comparison between GTransCYPs and other methods for predicting five CYP450 isozymes inhibition

Full size table

Table 8 Representative molecular structures of inhibitors and non-inhibitors for each CYP isozyme

Full size table

Table 9 Comparison of GTransCYPs prediction performance with the other models

Full size table

Webserver

GTransCYPs source code is available at https://github.com/zonwoo/GTransCYPs and can be locally hosted using the streamlit platform (https://www.streamlit.io). This platform has been designed to provide support for researchers and practitioners in the fields of chemistry, biology, and pharmacology in their drug development research, especially concerning online prediction of CYP450 inhibition. The interface is designed to ensure simplicity and user-friendliness, thereby enhancing the experience for both novice and experienced users. Figure 8 presents an overview of the user interface along with examples of processing snippets.

Conclusion

Predicting CYP450 inhibition is one of the key challenges in drug research and holds significant implications across various clinical applications. This study introduced a novel graph representation learning model GTransCYPs for CYP450 inhibition prediction. GTransCYPs first learns low-dimensional molecular representations and constructs topological graphs by integrating attention mechanisms and transformer feature architectures. Additionally, it integrates graph pooling to simplify the complexity of graph structures by preserving a designated number of informative nodes within each subgraph. This approach amplifies efficiency and accentuates the focus on pertinent information essential for predicting CYP450 inhibitors. According to the experimental results, GTransCYPs achieves competitive performance compared to existing methods. Ablation experiments provide additional clarity on the key roles of pooling layers in boosting the predictive capabilities of the proposed method. However, there are still a few improvements that should be considered. Due to the limited size of the publicly available dataset, particularly in the case of the 3A4 and 2D6 datasets, there is a significant disparity between the positive and negative classes. This imbalance heavily impacts the dataset, resulting in suboptimal model performance. Additionally, model interpretability remains a limitation, as it is crucial to understand how the proposed model learns patterns in molecular substructures and identifies molecules with potential inhibitory activity on various enzymes during the molecule design process. In future work, we will explore various strategies, such as transfer learning and semi-supervised learning approaches based on graph representations. Furthermore, we will focus on model interpretability to provide deeper insights into how the model learns patterns in molecular substructures.

Availability of data and materials

The CYP450 datasets of this work can be found at https://tdcommons.ai (therapeutics data commons). The source codes are available at https://github.com/zonwoo/GTransCYPs.

References

Rendic SP, Guengerich FP (2021) Human Family 1–4 cytochrome P450 enzymes involved in the metabolic activation of xenobiotic and physiological chemicals: an update. Arch Toxicol 95:395–472
Article CAS PubMed PubMed Central Google Scholar
Peter GF (1994) Catalytic selectivity of human cytochrome P450 enzymes: relevance to drug metabolism and toxicity. Toxicol Lett 70:133–138
Article Google Scholar
Zhao M, Ma J, Li M et al (2021) Cytochrome P450 enzymes and drug metabolism in humans. IJMS 22:12808
Article CAS PubMed PubMed Central Google Scholar
Song Y, Li C, Liu G et al (2021) Drug-metabolizing cytochrome P450 enzymes have multifarious influences on treatment outcomes. Clin Pharmacokinet 60:585–601
Article CAS PubMed Google Scholar
Yu M-S, Lee H-M, Park A et al (2018) In silico prediction of potential chemical reactions mediated by human enzymes. BMC Bioinform 19:207
Article Google Scholar
Evans WE, Relling MV (1999) Pharmacogenomics: translating functional genomics into rational therapeutics. Science 286:487–491
Article CAS PubMed Google Scholar
Di L (2014) The role of drug metabolizing enzymes in clearance. Expert Opin Drug Metab Toxicol 10:379–393
Article CAS PubMed Google Scholar
Wang Y, Bryant SH, Cheng T et al (2017) PubChem BioAssay: 2017 update. Nucleic Acids Res 45:D955–D963
Article CAS PubMed Google Scholar
Lee JH, Basith S, Cui M et al (2017) In silico prediction of multiple-category classification model for cytochrome P450 inhibitors and non-inhibitors using machine-learning method. SAR QSAR Environ Res 28:863–874
Article CAS PubMed Google Scholar
Kato H (2020) Computational prediction of cytochrome P450 inhibition and induction. Drug Metab Pharmacokinet 35:30–44
Article CAS PubMed Google Scholar
Plonka W, Stork C, Šícho M et al (2021) CYPlebrity: machine learning models for the prediction of inhibitors of cytochrome P450 enzymes. Bioorg Med Chem 46:116388
Article CAS PubMed Google Scholar
Xu M, Lu Z, Wu Z et al (2023) Development of In silico models for predicting potential time-dependent inhibitors of cytochrome P450 3A4. Mol Pharmaceutics 20:194–205
Article Google Scholar
Wu Z, Lei T, Shen C et al (2019) ADMET evaluation in drug discovery. 19. Reliable prediction of human cytochrome P450 inhibition using artificial intelligence approaches. J Chem Inf Model 59:4587–4601
Article CAS PubMed Google Scholar
Goldwaser E, Laurent C, Lagarde N et al (2022) Machine learning-driven identification of drugs inhibiting cytochrome P450 2C9. PLoS Comput Biol 18:e1009820
Article CAS PubMed PubMed Central Google Scholar
Wang N-N, Wang X-G, Xiong G-L et al (2022) Machine learning to predict metabolic drug interactions related to cytochrome P450 isozymes. J Cheminform 14:23
Article PubMed PubMed Central Google Scholar
Nguyen-Vo T-H, Trinh QH, Nguyen L et al (2022) iCYP-MFE: identifying human cytochrome P450 inhibitors using multitask learning and molecular fingerprint-embedded encoding. J Chem Inf Model 62:5059–5068
Article CAS PubMed Google Scholar
Su B-H, Tu Y, Lin C et al (2015) Rule-based prediction models of cytochrome P450 inhibition. J Chem Inf Model 55:1426–1434
Article CAS PubMed Google Scholar
Tang M, Li B, Chen H (2023) Application of message passing neural networks for molecular property prediction. Curr Opin Struct Biol 81:102616
Article CAS PubMed Google Scholar
Buterez D, Janet JP, Kiddle SJ et al (2024) Transfer learning with graph neural networks for improved molecular property prediction in the multi-fidelity setting. Nat Commun 15:1517
Article CAS PubMed PubMed Central Google Scholar
Wei Z, Zhao C, Zhang M et al (2024) Meta-DHGNN: method for CRS-related cytokines analysis in CAR-T therapy based on meta-learning directed heterogeneous graph neural network. Brief Bioinform 25:bbae104
Article PubMed PubMed Central Google Scholar
Meller A, Ward M, Borowsky J et al (2023) Predicting locations of cryptic pockets from single protein structures using the PocketMiner graph neural network. Nat Commun 14:1177
Article CAS PubMed PubMed Central Google Scholar
Qiu M, Liang X, Deng S et al (2022) A unified GCNN model for predicting CYP450 inhibitors by using graph convolutional neural networks with attention mechanism. Comput Biol Med 150:106177
Article CAS PubMed Google Scholar
Ai D, Cai H, Wei J et al (2023) DEEPCYPs: a deep learning platform for enhanced cytochrome P450 activity prediction. Front Pharmacol 14:1099093
Article CAS PubMed PubMed Central Google Scholar
Gillioz A, Riesen K (2023) Graph-based pattern recognition on spectral reduced graphs. Pattern Recognition 144:109859
Article Google Scholar
Chen B, Barzilay R, Jaakkola T. Path-augmented graph transformer network. 2019.
Ramsundar B, Eastman P, Walters P et al (2019) Deep learning for the life sciences. O’Reilly Media, Sebastopol
Google Scholar
RDKit: Open-source cheminformatics. https://zenodo.org/record/3732262.
Veith H, Southall N, Huang R et al (2009) Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries. Nat Biotechnol 27:1050–1055
Article CAS PubMed PubMed Central Google Scholar
Huang K, Fu T, Gao W et al (2022) Artificial intelligence foundation for therapeutic science. Nat Chem Biol 18:1033–1036
Article CAS PubMed PubMed Central Google Scholar
Banerjee P, Dunkel M, Kemmler E et al (2020) SuperCYPsPred—a web server for the prediction of cytochrome activity. Nucleic Acids Res 48:W580–W585
Article CAS PubMed PubMed Central Google Scholar
Guo Z, Yu W, Zhang C et al. GraSeq: graph and sequence fusion learning for molecular property prediction. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management. Virtual Event Ireland: ACM, 2020, 435–43.
Feng Y-Y, Yu H, Feng Y-H et al (2022) Directed graph attention networks for predicting asymmetric drug–drug interactions. Brief Bioinform 23:bbac151
Article PubMed Google Scholar
Zhao T, Hu Y, Valsdottir LR et al (2021) Identifying drug–target interactions based on graph convolutional network and deep neural network. Brief Bioinform 22:2141–2150
Article CAS PubMed Google Scholar
Zheng K, Zhao H, Zhao Q et al (2022) NASMDR: a framework for miRNA-drug resistance prediction using efficient neural architecture search and graph isomorphism networks. Brief Bioinform 23:bbac338
Article PubMed Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This research was supported by the MSIT (Ministry of Science, ICT), Korea, under the National Program for Excellence in SW supervised by the IITP (Institute of Information & communications Technology Planning & Evaluation) in 2024 (2024-0-00023).

Author information

Authors and Affiliations

Department of Computer Science and Electronic Engineering, Sun Moon University, Asan, 31460, Republic of Korea
Candra Zonyfar, Soualihou Ngnamsie Njimbouom & Jeong-Dong Kim
Division of Computer Science and Engineering, Sun Moon University, Asan, 31460, Republic of Korea
Sophia Mosalla & Jeong-Dong Kim
Genome-based BioIT Convergence Institute, Sun Moon University, Asan, 31460, Republic of Korea
Jeong-Dong Kim

Authors

Candra Zonyfar
View author publications
You can also search for this author in PubMed Google Scholar
Soualihou Ngnamsie Njimbouom
View author publications
You can also search for this author in PubMed Google Scholar
Sophia Mosalla
View author publications
You can also search for this author in PubMed Google Scholar
Jeong-Dong Kim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: C.Z., J.D.K.; Methodology: C.Z.; Investigation: C.Z., S.N.N., S.M.; Formal Analysis: C.Z., J.D.K., S.N.N., S.M.; Visualization: C.Z., S.N.N., J.D.K., S.M.; Supervision: J.D.K.; Writing—original draft: C.Z., S.N.N.; Writing—review C.Z., J.D.K., S.N.N., S.M.

Corresponding author

Correspondence to Jeong-Dong Kim.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zonyfar, C., Ngnamsie Njimbouom, S., Mosalla, S. et al. GTransCYPs: an improved graph transformer neural network with attention pooling for reliably predicting CYP450 inhibitors. J Cheminform 16, 119 (2024). https://doi.org/10.1186/s13321-024-00915-z

Download citation

Received: 07 May 2024
Accepted: 10 October 2024
Published: 29 October 2024
DOI: https://doi.org/10.1186/s13321-024-00915-z

GTransCYPs: an improved graph transformer neural network with attention pooling for reliably predicting CYP450 inhibitors

Abstract

Introduction

Method

Overview of the GTransCYPs model

Graph neural network (GNN)

Molecular graph construction

Graph transformer for learning CYP450 molecules

Message aggregation with transformer

Attention pooling

Experimental results

Datasets

Evaluation metrics

Hyperparameters and environment setup

Performance of the proposed model

Ablation study

Hyperparameter sensitivity

Effect of the number of attention heads

Effect of the batch size number

Effect of the learning rate

Effect of the size of embedding dimension

Comparison of GTransCYPs with existing methods

Webserver

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us